Assignment – 16
You are tasked with developing a Python code for sentiment extraction utilizing a
provided sample dataset. The dataset consists of textual data annotated with
labels categorizing sentiments into four categories: "rude," "normal," "insult," and
"sarcasm."
Dataset:
● Real News:
https://drive.google.com/file/d/1FL2HqgLDAP5550nd1h_8iBhAV-
ISTnzr/view?usp=sharing
● Fake News:
https://drive.google.com/file/d/1EdI_HyUeI_Fi2nld7rQnnGEpQqn_BwM-
/view?usp=sharing
1. Outline the key steps involved in developing a sentiment extraction
algorithm using Python.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																										 answer:Sentiment analysis is a technique used to determine the emotional tone or sentiment expressed in a text. It involves analyzing the words and phrases used in the text to identify the underlying sentiment, whether it is positive, negative, or neutral. 

Sentiment analysis has a wide range of applications, including social media monitoring, customer feedback analysis, and market research.

One of the main challenges in sentiment analysis is the inherent complexity of human language. Text data often contains sarcasm, irony, and other forms of figurative language that can be difficult to interpret using traditional methods. 

However, recent advances in natural language processing (NLP) and machine learning have made it possible to perform sentiment analysis on large volumes of text data with a high degree of accuracy. 

Three Methodologies for Sentiment Analysis
There are several ways to perform sentiment analysis on text data, with varying degrees of complexity and accuracy. The most common methods include a lexicon-based approach, a machine learning (ML) based approach, and a pre-trained transformer-based deep learning approach. Let’s look at each in more detail:

Lexicon-based analysis
This type of analysis, such as the NLTK Vader sentiment analyzer, involves using a set of predefined rules and heuristics to determine the sentiment of a piece of text. These rules are typically based on lexical and syntactic features of the text, such as the presence of positive or negative words and phrases. 

While lexicon-based analysis can be relatively simple to implement and interpret, it may not be as accurate as ML-based or transformed-based approaches, especially when dealing with complex or ambiguous text data.

Machine learning (ML) 
This approach involves training a model to identify the sentiment of a piece of text based on a set of labeled training data. These models can be trained using a wide range of ML algorithms, including decision trees, support vector machines (SVMs), and neural networks. 

ML-based approaches can be more accurate than rule-based analysis, especially when dealing with complex text data, but they require a larger amount of labeled training data and may be more computationally expensive.

Pre-trained transformer-based deep learning
A deep learning-based approach, as seen with BERT and GPT-4, involve using pre-trained models trained on massive amounts of text data. These models use complex neural networks to encode the context and meaning of the text, allowing them to achieve state-of-the-art accuracy on a wide range of NLP tasks, including sentiment analysis. However, these models require significant computational resources and may not be practical for all use cases.

Lexicon-based analysis is a straightforward approach to sentiment analysis, but it may not be as accurate as more complex methods. 
Machine learning-based approaches can be more accurate, but they require labeled training data and may be more computationally expensive. 
Pre-trained transformer-based deep learning approaches can achieve state-of-the-art accuracy but require significant computational resources and may not be practical for all use cases. 
The choice of approach will depend on the specific needs and constraints of the project at hand.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																		 2. Describe the structure and format of the sample dataset required for
sentiment extraction.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																			          answer:Datasets for Supervised Learning
In our introductory article on emotion detection, we listed some public datasets for emotion detection that we can use to develop a base model. In here, we list a different set, containing a more thorough description of the features and scientific usages pertaining to each.

We’ll start by listing the most common datasets for supervised learning in sentiment analysis. They’re all particularly suitable for developing machine learning models that classify texts according to a predetermined typology.

3.1. The MPQA Opinion Corpus
The MPQA Opinion Corpus is comprised of 70 annotated documents, that correspond news items published in the English-speaking press. It uses a specific annotation scheme that comprises the following tags or labels:

An agent label, that refers to the entity which is the recipient of the author’s sentiment
An expressive-subjectivity tag, which marks the elements of the texts that contain an indirect judgment over one of the entities labeled as agents
The direct-subjective tag, which refers to the direct expression of sentiment in relation to particular entities
An objective-speech-event tag, that indicates a neutral statement with regards to its sentiment
The attitude value, which contains the polarization of the sentiment in regard to an expressed statement
The two tags, expressive and direct-subjectivity, also contain the measurement of polarity assigned to the particular sentence to which they refer. This dataset is particularly suitable for training models that learn both the explicit and implicit expressions of sentiments in regard to particular entities. It has also been used for the training of deep learning models for sentiment analysis and, more in general, for the conduct of opinion mining.

3.2. Sentiment 140
The dataset Sentiment 140 contains an impressive 1,600,000 tweets from various English-speaker users, and it’s suitable for developing models for the classification of sentiments. The name comes, of course, from the defining character limitation of the original Twitter messages.

This dataset comprises automatically-tagged messages, marked as “positive” or “negative” according to whether, respectively, they contain the emoticons \text{:)} or \text{:(}. This automatic approach to tagging, albeit used commonly, is however characterized by known limitations especially in terms of blindness to irony.

The features of the dataset are:

Polarity, from negative to positive
ID and date of the tweet, useful if we want to do time-series analysis
Author’s twitter handle
And of course, the text itself of the tweet
Sentiment 140 proves useful to train Maximum Entropy models in particular. Additionally, the scientific literature also shows its usage with Naive Bayesian models. And furthermore, it’s useful to analyze the population’s attitude towards pandemics, modeled by means of support vector machines.

3.3. Paper Reviews
The Paper Reviews dataset contains 405 reviews, in Spanish and English, on papers submitted to an international conference on computer science. The number of papers to which they refer is a bit more than half since it’s common in scientific publishing to use at least two reviewers per paper. The dataset itself is in JSON format, and contains the following features:

ID and date of the paper to which the review refers
The decision to accept or refuse the paper by the reviewer
The text of the review itself, sent by the reviewers to the editors of the conference proceedings, and also to the paper’s authors
A second text called remark, which the editors receive but the authors of the paper don’t
Orientation, which is the sentiment score assigned by the authors of the dataset to each individual review
Evaluation, which is the score or judgment on a given paper
And finally, confidence, which indicates a measure of the certainty that the reviewer has in assigning the evaluation score to an article
The Paper Reviews dataset finds usage for the training of hybrid models that include swarm optimization. It’s also suitable for general classification and regression tasks, given that the evaluation score has a numerical ordinal value. It should also be useful, and not well-exploited as of yet, to study the relationship between emotions, objectivity, and scores in the peer-review process.

One common belief in science is that the peer review process is generally fair and equitable. This belief is however questionable, especially in relation to some known human cognitive biases such as gender, institutional prestige, and, most importantly for us in natural language processing, language. This dataset is, therefore, particularly adapt to analyze human bias and its role in the publication of scientific discoveries.

3.4. Large Movie Review Dataset
Another popular dataset containing reviews, in this case on movies, is the Large Movie Review Dataset. The dataset contains 50’000 reviews divided into training and testing, all containing highly polarized texts. It’s particularly suitable for binary classification, and it comprises just two features:

The text of the review
And a polarization value, either “positive” or “negative”
This dataset found usage in the training of hybrid supervised-unsupervised learning models. But also, of support vector classifiers, naive Bayesian classifiers, and jointly, of neural networks and k-nearest neighbors. A large collection of notebooks containing models for classification of this dataset is available on Kaggle.

3.5. Deeply Moving, Stanford Sentiment Treebank
The Stanford Sentiment Treebank is a corpus of texts used in the paper Deeply Moving: Deep Learning for Sentiment Analysis. The dataset is comprised of 10,605 texts extracted from the website Rotten Tomatoes that specializes in movie reviews. It comprises the following features:

The texts themselves, in the original and unprocessed form
The phrases contained in the texts, and a unique ID for each of them
And lastly, the structure of the tree that parses the texts in the dataset
The Stanford Sentiment Treebank finds usage in the training of support vector classifiers and deep learning models. It also inspired the development of similar datasets for other languages, with the creation of the Arabic Sentiment Treebank.

3.6. Multi-Domain Sentiment Dataset
This dataset for multi-domain analysis was initially developed by the University of Pennsylvania on the basis of Amazon product scraped from the website. The products belong to four categories: electronics, books, kitchen utensils, and DVDs. Each review possesses a polarization score of “positive” or “negative”, corresponding to, respectively, more than three stars or less than three stars out of a maximum of five.

Both an unprocessed and a preprocessed version of the reviews are available. The latter comes already tokenized into uni or bi-grams. The features of the preprocessed version are:

The tokens themselves
For each token, the count of occurrences
A label, containing the polarization value
The two classes of positive and negative reviews possess 1000 elements each. Unlabeled data is also present, in the form of 3685 reviews for DVDs and 5945 for kitchen utensils. The usage of unlabeled data might help to compare the predictions of different models against previously unseen data.

The dataset has found ample usage in the literature on sentiment analysis. Among these, a joint sentiment-topic model proved useful in learning the factors that predict the emotional connotation of a review. Naive Bayesian models and sequential minimal optimization also successfully performed the classification of the texts from this dataset.

3.7. Pros and Cons
The Pros and Cons Dataset relates to the task of opinion mining at the sentence-level. It contains around 23,000 sentences indicating positive and negative judgments and is meant to be used in relation to the Comparative Sentences dataset. The dataset is suitable for two usages:

As a lexicon or lookup dictionary, to determine the polarity of identical sentences in new texts
To assign polarity to new sentences on the basis of their similarity with those contained in this dataset
The papers in the scientific literature that leverage this dataset fall into two categories: model development, and extension of automatic polarity classification to languages other than English.

Regarding the first category, the usage of this dataset was effective for automated speech processing. In relation to this task, the dataset provides the classification labels for polarity, that a model for audio processing can use to determine the sentiment of user speech. Its related dataset Comparative Sentences also found a similar usage, in the attribution of sentiment to Youtube videos.

Regarding the second category, the dataset inspired the creation of a corpus of polarized sentences in Norwegian, but also a multi-lingual corpus for deep sentiment analysis. Multi-lingual sentiment analysis is notoriously difficult because it’s language-dependent, and the usage of this dataset together with others in different languages can help address this problem.

3.8. Opinosis Opinion Dataset
The Opinosis Opinion Dataset is a resource that comprises user reviews for products and services, grouped by topic. It contains a notable amount of 51 different topics related to products sold on the websites Amazon, Tripadvisor, and Edmunds. For each topic, there are about 100 distinct sentences that mostly relate to electronics, hotels, or cars.

All sentences are divided into tokens, which are subsequently augmented with parts-of-speech tags. The dataset is especially useful for text summarization because it lacks polarization labels. Its usage in conjunction with a lexicon for sentiments, though, allows also the conduct of supervised sentiment analysis, as was the case for all previous datasets.

The advantage of the Opinosis Opinion Dataset lies in its parts-of-speech tags. Studies suggest that a model that uses adjectives and adverbs outperforms one that uses adjectives alone, and we need parts-of-speech tags to discriminate between the two groups. This dataset, therefore, allows the construction of models for sentiment analysis that implement parts-of-speech tags as well as lexica.

3.9. Twitter US Airlines
Another dataset originating from Twitter is the Twitter US Airlines Dataset, that comprises thematic messages on the quality of service by American aviation companies. The dataset contains these features:

A unique ID for each message
A polarity score, assigned by volunteer contributors
If the polarity is negative, a sentence in natural language in which the human tagger identifies the reason
The self-assessed confidence by the human tagger in assigning a polarization score
The number of retweets, useful for studying the distribution or influence of messages
And of course the name of the specific aviation company concerned by the message
In scientific literature, the dataset is used for classification tasks in general. But also, more specifically, for support vector machines and AdaBoost, and for ensemble approaches that combine predictions from multiple algorithms.

Interestingly, we can note that some US airline companies that are represented in this dataset react to negative customer feedback on Twitter surprisingly quickly. This may lead us to believe that they themselves might have adopted a system for the detection of negative polarity in user tweets.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																											 3. Implement the Python code to read and preprocess the sample dataset for
sentiment analysis. Ensure that the code correctly handles text data and labels.
answer:Three Methodologies for Sentiment Analysis
There are several ways to perform sentiment analysis on text data, with varying degrees of complexity and accuracy. The most common methods include a lexicon-based approach, a machine learning (ML) based approach, and a pre-trained transformer-based deep learning approach. Let’s look at each in more detail:

Lexicon-based analysis
This type of analysis, such as the NLTK Vader sentiment analyzer, involves using a set of predefined rules and heuristics to determine the sentiment of a piece of text. These rules are typically based on lexical and syntactic features of the text, such as the presence of positive or negative words and phrases. 

While lexicon-based analysis can be relatively simple to implement and interpret, it may not be as accurate as ML-based or transformed-based approaches, especially when dealing with complex or ambiguous text data.

Machine learning (ML) 
This approach involves training a model to identify the sentiment of a piece of text based on a set of labeled training data. These models can be trained using a wide range of ML algorithms, including decision trees, support vector machines (SVMs), and neural networks. 

ML-based approaches can be more accurate than rule-based analysis, especially when dealing with complex text data, but they require a larger amount of labeled training data and may be more computationally expensive.

Pre-trained transformer-based deep learning
A deep learning-based approach, as seen with BERT and GPT-4, involve using pre-trained models trained on massive amounts of text data. These models use complex neural networks to encode the context and meaning of the text, allowing them to achieve state-of-the-art accuracy on a wide range of NLP tasks, including sentiment analysis. However, these models require significant computational resources and may not be practical for all use cases.

Lexicon-based analysis is a straightforward approach to sentiment analysis, but it may not be as accurate as more complex methods. 
Machine learning-based approaches can be more accurate, but they require labeled training data and may be more computationally expensive. 
Pre-trained transformer-based deep learning approaches can achieve state-of-the-art accuracy but require significant computational resources and may not be practical for all use cases. 
The choice of approach will depend on the specific needs and constraints of the project at hand.

Installing NLTK and Setting up Python Environment
To use the NLTK library, you must have a Python environment on your computer. The easiest way to install Python is to download and install the Anaconda Distribution. This distribution comes with the Python 3 base environment and other bells and whistles, including Jupyter Notebook. You also do not need to install the NLTK library, as it comes pre-installed with NLTK and many other useful libraries. 

If you choose to install Python without any distribution, you can directly download and install Python from python.org. In this case, you will have to install NLTK once your Python environment is ready.

To install NLTK library, open the command terminal and type:


pip install nltk

 OpenAI
It's worth noting that NLTK also requires some additional data to be downloaded before it can be used effectively. This data includes pre-trained models, corpora, and other resources that NLTK uses to perform various NLP tasks. To download this data, run the following command in terminal or your Python script:


import nltk

nltk.download('all')

 OpenAI
Preprocessing Text
Text preprocessing is a crucial step in performing sentiment analysis, as it helps to clean and normalize the text data, making it easier to analyze. The preprocessing step involves a series of techniques that help transform raw text data into a form you can use for analysis. Some common text preprocessing techniques include tokenization, stop word removal, stemming, and lemmatization.

text preprocessing steps in sequence

Image Source

Tokenization
Tokenization is a text preprocessing step in sentiment analysis that involves breaking down the text into individual words or tokens. This is an essential step in analyzing text data as it helps to separate individual words from the raw text, making it easier to analyze and understand. Tokenization is typically performed using NLTK's built-in `word_tokenize` function, which can split the text into individual words and punctuation marks.

Stop words
Stop word removal is a crucial text preprocessing step in sentiment analysis that involves removing common and irrelevant words that are unlikely to convey much sentiment. Stop words are words that are very common in a language and do not carry much meaning, such as "and," "the," "of," and "it." These words can cause noise and skew the analysis if they are not removed.

By removing stop words, the remaining words in the text are more likely to indicate the sentiment being expressed. This can help to improve the accuracy of the sentiment analysis. NLTK provides a built-in list of stop words for several languages, which can be used to filter out these words from the text data.

Stemming and Lemmatization
Stemming and lemmatization are techniques used to reduce words to their root forms. Stemming involves removing the suffixes from words, such as "ing" or "ed," to reduce them to their base form. For example, the word "jumping" would be stemmed to "jump." 

Lemmatization, however, involves reducing words to their base form based on their part of speech. For example, the word "jumped" would be lemmatized to "jump," but the word "jumping" would be lemmatized to "jumping" since it is a present participle.

To learn more about stemming and lemmatization, check out our Stemming and Lemmatization in Python tutorial.

Bag of Words (BoW) Model
The bag of words model is a technique used in natural language processing (NLP) to represent text data as a set of numerical features. In this model, each document or piece of text is represented as a "bag" of words, with each word in the text represented by a separate feature or dimension in the resulting vector. The value of each feature is determined by the number of times the corresponding word appears in the text.

The bag of words model is useful in NLP because it allows us to analyze text data using machine learning algorithms, which typically require numerical input. By representing text data as numerical features, we can train machine learning models to classify text or analyze sentiments. 

The example in the next section will use the NLTK Vader model for sentiment analysis on the Amazon customer dataset. In this particular example, we do not need to perform this step because the NLTK Vader API accepts text as an input instead of numeric vectors, but if you were building a supervised machine learning model to predict sentiment (assuming you have labeled data), you would have to transform the processed text into a bag of words model before training the machine learning model. 

Bag of Words Example

Image Source

End-to-end Sentiment Analysis Example in Python
To perform sentiment analysis using NLTK in Python, the text data must first be preprocessed using techniques such as tokenization, stop word removal, and stemming or lemmatization. Once the text has been preprocessed, we will then pass it to the Vader sentiment analyzer for analyzing the sentiment of the text (positive or negative).

Step 1 - Import libraries and load dataset
First, we’ll import the necessary libraries for text analysis and sentiment analysis, such as pandas for data handling, nltk for natural language processing, and SentimentIntensityAnalyzer for sentiment analysis.

We’ll then download all of the NLTK corpus (a collection of linguistic data) using nltk.download().

Once the environment is set up, we will load a dataset of Amazon reviews using pd.read_csv(). This will create a DataFrame object in Python that we can use to analyze the data. We'll display the contents of the DataFrame using df.


# import libraries
import pandas as pd

import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer


# download nltk corpus (first time only)
import nltk

nltk.download('all')




# Load the amazon review dataset

df = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv')

df

 OpenAI
Load Dataset
Step 2 - Preprocess text
Let’s create a function preprocess_text in which we first tokenize the documents using word_tokenize function from NLTK, then we remove step words using stepwords module from NLTK and finally, we lemmatize the filtered_tokens using WordNetLemmatizer from NLTK.


# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text

    tokens = word_tokenize(text.lower())




    # Remove stop words

    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]




    # Lemmatize the tokens

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]




    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

# apply the function df

df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

 OpenAI
Preprocess Data

Notice the changes in the "review text" column as a result of the preprocess_text function that we applied in the above step.

Step 3 - NLTK Sentiment Analyzer
First, we’ll initialize a Sentiment Intensity Analyzer object from the nltk.sentiment.vader library.

Next, we’ll define a function called get_sentiment that takes a text string as its input. The function calls the polarity_scores method of the analyzer object to obtain a dictionary of sentiment scores for the text, which includes a score for positive, negative, and neutral sentiment. 

The function will then check whether the positive score is greater than 0 and returns a sentiment score of 1 if it is, and a 0 otherwise. This means that any text with a positive score will be classified as having a positive sentiment, and any text with a non-positive score will be classified as having a negative sentiment.

Finally, we’ll apply the get_sentiment function to the reviewText column of the df DataFrame using the apply method. This creates a new column called sentiment in the DataFrame, which stores the sentiment score for each review. We’ll then display the updated DataFrame using df.


# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()


# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment




# apply get_sentiment function

df['sentiment'] = df['reviewText'].apply(get_sentiment)

df

 OpenAI
sentiment analysis

The NLTK sentiment analyzer returns a score between -1 and +1. We have used a cut-off threshold of 0 in the get_sentiment function above. Anything above 0 is classified as 1 (meaning positive). Since we have actual labels, we can evaluate the performance of this method by building a confusion matrix. 


from sklearn.metrics import confusion_matrix

print(confusion_matrix(df['Positive'], df['sentiment']))

 OpenAI
Output:


[[ 1131  3636]

 [  576 14657]]

 OpenAI
We can also check the classification report:


from sklearn.metrics import classification_report

print(classification_report(df['Positive'], df['sentiment']))

 OpenAI
Classification report

As you can see, the overall accuracy of this rule-based sentiment analysis model is 79%. Since this is labeled data, you can also try to build a ML model to evaluate if an ML-based approach will result in better accuracy.

Check out the full notebook on the Datacamp workspace.

Conclusion
NLTK is a powerful and flexible library for performing sentiment analysis and other natural language processing tasks in Python. By using NLTK, we can preprocess text data, convert it into a bag of words model, and perform sentiment analysis using Vader's sentiment analyzer. 

Through this tutorial, we have explored the basics of NLTK sentiment analysis, including preprocessing text data, creating a bag of words model, and performing sentiment analysis using NLTK Vader. We have also discussed the advantages and limitations of NLTK sentiment analysis, and provided suggestions for further reading and exploration.

Overall, NLTK is a powerful and widely used tool for performing sentiment analysis and other natural language processing tasks in Python. By mastering the techniques and tools presented in this tutorial, you can gain valuable insights into the sentiment of text data and use these insights to make data-driven decisions in a wide range of applications.

If you want to learn how to apply NLP to real-world data, including TED talks, articles, and movie reviews, using Python libraries and frameworks, including NLTK, scikit-learn, spaCy, and SpeechRecognition, check out the resources below:

Introduction to Natural Language Processing in Python
Natural Language Processing in Python
It provides a strong foundation for processing and analyzing text data using Python. Whether you're new to NLP or looking to expand your skills, this course will equip you with the tools and knowledge to convert unstructured data into valuable insights.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																													         4. Discuss the process of classifying sentiments into the specified categories:
"rude," "normal," "insult," and "sarcasm." Explain any techniques or
algorithms employed for this classification task.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												     answer:The advancement in internet technology and tremendous growth of users in online activities, and social media networks leads to the generation of an unprecedented volume of data. The data that users generate through their online activities, whether it be in the form of text, images, music, videos, log files, reviews, etc., is typically generated from a variety of sources, voluminous and includes structured as well as unstructured data. Performing and analysing these types of unstructured and structured data has a greater impact on the big data field [1]. Such type of data can be analysed for decision making using machine learning, data mining, web mining and text mining techniques. Also, since these types of data can be voluminous and extracting the patterns from this data is quite a difficult process. And, further, microblogging services like Twitter, YouTube, Instagram, Facebook, Snapchat, WhatsApp, LinkedIn, blogs, Wikis etc., support a variety of data formats with/without the proper grammatical rules and also short texts which are written without concerning the grammars [2]. Fig. 1 shows the percentage of users on social network platforms. From these platforms the amount of information (opinions) [3], [4] which is shared by the users can be used for analysing the opinions about the products, political movements, financial and political forecasting, monitoring the company strategies, marketing analysis, disseminating news, crime forecasting, product preferences, tracing the terrorist activities, e-health and e-tourism, monitoring reputations, detecting the hate speech in the public forms etc. To find meaningful information from the text (corpus) or data coming from public forums, Natural Language Processing (NLP) techniques is used [5].


Download : Download high-res image (275KB)
Download : Download full-size image
Fig. 1. Active users and their percentage in social networks.

The advent of social media and online forums has revolutionized the way people communicate and express their opinions. However, this newfound freedom of expression has also given rise to the proliferation of hate speech, cyberbullying, and offensive content, which can have severe implications on individuals and society as a whole. Identifying and curbing such harmful contents has become a critical task for maintaining a respectful and safe online space. For instance, Modha et. al. [6] dealt with the identification of the aggression types of texts in the online platforms and divided the texts into aggressive and non– aggressive. Fig. 2 depicts the percentage of hate speech texts posted in Instagram during the four quarters of the years 2020 and 2021. Kaur et. al. [7] mentions the concepts of abusive content detection based on four categories of features namely, activity based, user based, context-based, and network-based features. This survey has also mentioned many parameters to identify the abusive contents such as posts per day, age, gender, etc and helps to build the researchers with fundamental concepts and key insight areas including the recent trends and techniques. The relationship between hate speech, aggressiveness and offensive speech is discussed in [8].


Download : Download high-res image (252KB)
Download : Download full-size image
Fig. 2. Actioned Hate Speech on Instagram from 2020 to 2021.

Traditional rule-based methods for hate speech detection and sentiment analysis often lack the scalability and adaptability to handle the vast amount of user-generated content on social media platforms. In contrast, machine learning and deep learning techniques have shown promising results in automating the process of identifying hate language and analyzing sentiments expressed in text data. The primary objective of this survey is to present an in-depth analysis of hate speech detection and sentiment analysis techniques, focusing on the application of machine learning and deep learning models. By exploring the challenges faced by the present approaches, this paper aims to provide researchers with insights into the evolving landscape of hate speech detection and sentiment analysis. By investigating a wide range of methodologies, and datasets, this survey seeks to shed light on the advancements made in this crucial field and highlight the challenges that lie ahead. The main contributions of this survey include:
1.
Review of Datasets: We present a detailed list of datasets used for hate speech detection and sentiment analysis.

2.
Review of machine learning approaches: We begin by exploring the early efforts in hate speech detection and sentiment analysis, which relied on machine learning algorithms such as Support Vector Machines (SVM), Naive Bayes (NB), Decision Trees (DT), and Logistic Regression (LR).

3.
Emergence of deep learning models: Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based architectures like BERT and GPT, which have shown remarkable capabilities in text classification and sentiment analysis have also been reviewed.

4.
Challenges in hate speech detection: We then delve into the unique challenges faced by hate speech detection models, including the dynamic nature of language, context-dependent interpretations, and the subtleties involved in identifying sarcastic or disguised hate speech.


1.1. 1.1 Review methodology
The main aim of this work is to review the articles that show and describe the importance of text mining and NLP to online social media networks.

1. Therefore, this survey process is started and gathered numerous numbers of papers from the standard academic and research search engines such as ScienceDirect, Springer, IEEE Explore, Francis, Taylor and Google Scholar DBLP. The key terms which are used to perform the search process are:1. “Text classification” and “NLP”, 2. “sentiment analysis” and “NLP”, 3. “Hate speech” and “NLP” 4. “Hate speech” and “Machine Learning”, 5: “Offensive message” and “Deep Learning”, 6. “Online social media” and “hate speech”. 7. “Online Sources” and “Hate speech”. 8. “Sentiment analysis” and “Hate Speech”, 9. “Abusive content” and “Social Networks”. This search strategy is used to collect the initial set of articles which are published in the research platforms recently and further, this search strategy was expanded by identifying the new set of articles that were cited from this initial set of articles.

2. Further, we extracted and refined our survey by exploring more than 100 articles that were published from a decade to till date. Also, we have explored and investigated the importance of hate speech detection through the Statista web platform for knowing the activities, posts, and classification of messages on online platform networks such as Twitter, Facebook, Instagram, etc., In the same way, this survey article covers artificial intelligence techniques, in particular with, deep learning and machine learning approaches for hate speech detection and sentiment analysis.

3. An in-depth investigation of deep learning and machine learning approaches to hate speech detection is carried out by presenting datasets. This study focused on showing the issues present in hate speech detection from the users’ posts, blogs, etc.

4. Additionally, this survey covers the importance and need of hate speech detection and sentiment analysis in day-to-day life and subsequently, addressed the need for highly sophisticated machine learning and deep learning approaches to analyse and classify the data from online social media.

The rest of the article is organized as follows: Section 2 presents an overview of data preprocessing and datasets used for hate speech detection and sentiment analysis. This section enumerates the datasets with their class labels and languages. In Section 3, we present the role of machine learning and deep learning algorithms in the present study. In addition, we tabulate the models developed for hope speech detection and sentiment analysis along with the details of the datasets used in those models. The challenges in the objectives of the proposed study are enumerated in Section 4. Finally, we conclude our work and present future research directions of the proposed study in Section 5.

2. Preliminary steps for detecting hate speech in text
2.1. Data acquisition
Comments on the social media websites such as Twitter, Facebook, YouTube, etc., are not always good for the users, in some cases the posts may be rude or hateful words. On social media, offensive remarks might include indiscriminate slang, abusive language and vulgarity. Because of the drastic increase in online resources, data collection is extremely dependent on the type of media used to share the contents and also the data format is important to analyse the data. Twitter, Sina-Wiebo and other microblogging services have made their Application Programming Interface (API) available to extract public data from the sites. Twitter provides a REST API for static data such as user profiles and a Streaming API2 for streaming data such as tweets [1]. Twitter4J API3 [2] is used to extract the streaming tweets. Facebook Graph API4 and Tancent API5 are also made available by Facebook and Sina-Wiebo, respectively. These APIs are also used to collect articles as well as other data from their site for further analysing the data.

2.2. Data pre-processing
Data pre-processing is the first and foremost step and it includes data cleaning, tokenization, stop word removal, normalization etc, Data cleaning processes the links, punctuation marks, hashtags, and numeric characters are all regarded as non-essential in NLP. However, eliminating punctuation and hashtags, for example, may not be the most effective technique to clean up text information. Punctuation marks can be used as alternative emojis to represent the users' feelings, and hashtags typically contain extensive semantic meaning that could be useful for detecting abusive comments. As a result, the pre-processing step has been tested with or without the data cleaning step to see how it is affected by the outcomes.

The process which divides the text data into words and sentences, which are referred to as tokens is called tokenization. These tokens aid in the comprehension of the context or the development of the NLP model. By evaluating the sequence of words, tokenization aids in understanding the context of the text. The comments can be tokenized based on punctuation marks, whitespaces, etc. Stop words like formatting tags, numerals, pronouns, prepositions, conjunctions, and auxiliary verbs can be eliminated from the comments. Text is normalized to lessen its unpredictability and move it closer to a defined standard. As a result, the amount of variation in the data is reduced, and efficiency could be enhanced.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																											                   5. Evaluate the effectiveness of the sentiment extraction algorithm on the
provided sample dataset. Consider metrics such as accuracy, precision,recall andf1-score.
answer:Terminology of a specific domain is often difficult to start with. With a software engineering background, machine learning has many such terms that I find I need to remember to use the tools and read the articles.

Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well a classifier is doing, as opposed to just looking at overall accuracy. Writing an explanation forces me to think it through, and helps me remember the topic myself. That’s why I like to write these articles.

I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit more consideration on multi-class problems. But that is something to consider another time.

Before going into the details, an overview figure is always nice:


Hierarchy of Metrics from raw measurements / labeled data to F1-Score. Image by Author.
On the first look, it is a bit of a messy web. No need to worry about the details for now, but we can look back at this during the following sections when explaining the details from the bottom up. The metrics form a hierarchy starting with the the true/false negatives/positives (at the bottom), and building up all the way to the F1-score to bind them all together. Lets build up from there.

True/False Positives and Negatives
A binary classifier can be viewed as classifying instances as positive or negative:

Positive: The instance is classified as a member of the class the classifier is trying to identify. For example, a classifier looking for cat photos would classify photos with cats as positive (when correct).
Negative: The instance is classified as not being a member of the class we are trying to identify. For example, a classifier looking for cat photos should classify photos with dogs (and no cats) as negative.
The basis of precision, recall, and F1-Score comes from the concepts of True Positive, True Negative, False Positive, and False Negative. The following table illustrates these (consider value 1 to be a positive prediction):


Examples of True/False Positive and Negative
True Positive (TP)
The following table shows 3 examples of a True Positive (TP). The first row is a generic example, where 1 represents the Positive prediction. The following two rows are examples with labels. Internally, the algorithms would use the 1/0 representation, but I used labels here for a more intuitive understanding.


Examples of True Positive (TP) relations.
False Positive (FP)
These False Positives (FP) examples illustrate making wrong predictions, predicting Positive samples for a actual Negative samples. Such failed prediction is called False Positive.


True Negative (TN)
For the True Negative (TN) example, the cat classifier correctly identifies a photo as not having a cat in it, and the medical image as the patient having no cancer. So the prediction is Negative and correct (True).


False Negative (FN)
In the False Negative (FN) case, the classifier has predicted a Negative result, while the actual result was positive. Like no cat when there is a cat. So the prediction was Negative and wrong (False). Thus it is a False Negative.


Confusion Matrix
A confusion matrix is sometimes used to illustrate classifier performance based on the above four values (TP, FP, TN, FN). These are plotted against each other to show a confusion matrix:


Confusion Matrix. Image by Author.
Using the cancer prediction example, a confusion matrix for 100 patients might look something like this:


Confusion matrix for the cancer example. Image by Author.
This example has:

TP: 45 positive cases correctly predicted
TN: 25 negative cases correctly predicted
FP: 18 negative cases are misclassified (wrong positive predictions)
FN: 12 positive cases are misclassified (wrong negative predictions)
Thinking about this for a while, there are different severities to the different errors here. Classifying someone who has cancer as not having it (false negative, denying treatment), is likely more severe than classifying someone who does not have it as having it (false positive, consider treatment, do further tests).

As the severity of different kinds of mistakes varies across use cases, the metrics such as Accuracy, Precision, Recall, and F1-score can be used to balance the classifier estimates as preferred.

Accuracy
The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions:


Accuracy Formulas. Image by Author.
These three show the same formula for calculating accuracy, but in different wording. From more formalized to more intuitive (my opinion). In the above cancer example, the accuracy would be:

(TP+TN)/DatasetSize=(45+25)/100=0.7=70%.
This is perhaps the most intuitive of the model evaluation metrics, and thus commonly used. But often it is useful to also look a bit deeper.

Precision
Precision is a measure of how many of the positive predictions made are correct (true positives). The formula for it is:


Precision formulas. Image by Author.
All three above are again just different wordings of the same, with the last one using the cancer case as a concrete example. In this cancer example, using the values from the above example confusion matrix, the precision would be:

45/(45+18)=45/63=0.714=71.4%.
Recall / Sensitivity
Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as Sensitivity. The formula for it is:


Recall formulas. Image by Author.
Once again, this is just the same formula worded three different ways. For the cancer example, using the confusion matrix data, the recall would be:

45/(45+12)=45/57=0.789=78.9%.
Specificity
Specificity is a measure of how many negative predictions made are correct (true negatives). The formula for it is:


Specificity formulas. Image by Author.
In the above medical example, the specificity would be:

25/(25+18)=0.581=58,1%.
F1-Score
F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:


F1-Score formula. Image by Author.
The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have a higher value for the F1-score value to rise. For example, a Precision of 0.01 and Recall of 1.0 would give :

an arithmetic mean of (0.01+1.0)/2=0.505,
F1-score score (formula above) of 2*(0.01*1.0)/(0.01+1.0)=~0.02.
This is because the F1-score is much more sensitive to one of the two inputs having a low value (0.01 here). Which makes it great if you want to balance the two.

Some advantages of F1-score:

Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across positive/negative samples.
As illustrated by the first figure in this article, it combines many of the other metrics into a single one, capturing many aspects at once.
In the cancer example further above, the F1-score would be

2 * (0.714*0.789)/(0.714+0.789)=0.75 = 75%
Exploring F1-score
I find it easiest to understand concepts by looking at some examples. First a function in Python to calculate F1-score:


Python implementation of the F1-score formula. Image by Author.
To compare different combinations of precision and recall, I generate example values for precision and recall in range of 0 to 1 with steps of 0.01 (100 values of 0.01, 0.02, 0.03, … , 1.0):


Generating example values for precision and recall. Image by Author.
This produces a list for both precision and recall to experiment with:


Generated precision and recall values. Image by Author.
F1-score when precision=recall
To see what is the F1-score if precision equals recall, we can calculate F1-scores for each point 0.01 to 1.0, with precision = recall at each point:


Calculating F1-Score for the example values, where precision = recall at each 100 points. Image by Author.

F1-score when precision = recall. F1-score equals precision and recall at each point when p=r. Image by Author.
F1-score equals precision and recall if the two input metrics (P&R) are equal. The Difference column in the table shows the difference between the smaller value (Precision/Recall) and F1-score. Here they are equal, so no difference, in following examples they start to vary.

F1-score when Recall = 1.0, Precision = 0.01 to 1.0
So, the F1-score should handle reasonably well cases where one of the inputs (P/R) is low, even if the other is very high.

Lets try setting Recall to the maximum of 1.0 and varying Precision from 0.01 to 1.0:


Calculating F1-Score when recall is always 1.0 and precision varies from 0.01 to 1.0. Image by Author.

F1-score when recall = 1.0 and precision varies from 0.1 to 1.0. Image by Author.
As expected, the F1-score stays low when one of the two inputs (Precision / Recall) is low. The difference column shows how the F1-score in this case rises a bit faster than the smaller input (Precision here), gaining more towards the middle of the chart, weighted up a bit by the bigger value (Recall here). However, it never goes very far from the smaller input, balancing the overall score based on both inputs. These differences can also be visualized on the figure (difference is biggest at the vertical red line):


F1-Score with precision = 1.0, recall = 0–1.0 with highlighted posts. Image by Author.
F1-score when Precision = 1.0 and Recall = 0.01 to 1.0
If we swap the roles of Precision and Recall in the above example, we get the same result (due to F1-score formula):


Calculating F1-Score when precision is always 1.0 and recall varies from 0.0 to 1.0. Image by Author.

F1-score when precision = 1.0 and recall varies from 0.01 to 1.0. Image by Author.
This is to say, regardless of which one is higher or lower, the overall F1-score is impacted in the exact same way (which seems quite obvious in the formula but easy to forget).

F1-score when Precision=0.8 and Recall = 0.01 to 1.0
Besides fixing one input at maximum, lets try a bit lower. Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before:


Calculating F1-Score when precision is always 0.8 and recall varies from 0.0 to 1.0. Image by Author.

F1-score when precision = 0.8 and recall varies from 0.01 to 1.0. Image by Author.
The top score with inputs (0.8, 1.0) is 0.89. The rising curve shape is similar as Recall value rises. At maximum of Precision = 1.0, it achieves a value of about 0.1 (or 0.09) higher than the smaller value (0.89 vs 0.8).

F1-score when Precision=0.1 and Recall=0.01 to 1.0
And if we fix one value near minimum at 0.1?


Calculating F1-Score when precision is always 0.1 and recall varies from 0.0 to 1.0. Image by Author.

F1-score when precision = 0.1 and recall varies from 0.01 to 1.0. Image by Author.
Because one of the two inputs is always low (0.1), the F1-score never rises very high. However, interestingly it again rises at maximum to about 0.08 value larger than the smaller input (Precision = 0.1, F1-score=0.18). This is quite similar to the fixed value of Precision = 0.8 above, where the maximum value reached was 0.09 higher than the smaller input.

Focusing F1-score on precision or recall
Besides the plain F1-score, there is a more generic version, called Fbeta-score. F1-score is a special instance of Fbeta-score, where beta=1. It allows one to weight the precision or recall more, by adding a weighting factor. I will not go deeper into that in this post, however, it is something to keep in mind.

F1-score vs Accuracy
Accuracy is commonly described as a more intuitive metric, with F1-score better addressing a more imbalanced dataset. So how does the F1-score (F1) vs Accuracy (ACC) compare across different types of data distributions (ratios of positive/negative)?

Imbalance: Few Positive Cases
In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:


F1-score vs accuracy with varying prediction rates and imbalanced data. Image by Author.
The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.

In each case where TP =0, the Precision and Recall both become 0, and F1-score cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. Because the classifier cannot predict any correct positive result. This is rows 0, 4, and 8 in the above table. These also illustrate some cases of high Accuracy for a broken classifier (e.g., row 0 with 90% Accuracy while always predicting only negative).

The remaining rows illustrate how the F1-score is reacting much better to the classifier making more balanced predictions. For example, F1-score=0.18 vs Accuracy = 0.91 on row 5, to F1-score=0.46 vs Accuracy = 0.93 on row 7. This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the F1-score emphasizes this (and Accuracy sees no difference to any other values).

Balance 50/50 Positive and Negative cases:
How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:


F1-score vs accuracy with varying prediction rates and balanced data. Image by Author.
F1-score is still a slightly better metric here, when there are only very few (or none) of the positive predictions. But the difference is not as huge as with imbalanced classes. In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.

Imbalance: Few Negative Cases
Finally, what happens if the minority class is measured as the negative and not positive? F1-score no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:


F1-score vs Accuracy when the positive class is the majority class. Image by Author.
For example, row 5 has only 1 correct prediction out of 10 negative cases. But the F1-score is still at around 95%, so very good and even higher than accuracy. In the case where the same ratio applied to the positive cases being the minority, the F1-score for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.

This result with minority negative cases is because of how the formula to calculate F1-score is defined over precision and recall (emphasizing positive cases). If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how True Positives feed into both Precision and Recall, and from there to F1-score. The same figure also shows how True Negatives do not contribute to F1-score at all. This seems to be viisble here if you reverse the ratios and have fewer true negatives.

So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.

Conclusions
So what are these metrics good for?

The traditional Accuracy is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.

Precision is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.

Recall is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.

Neither precision nor recall is necessarily useful alone, since we rather generally are interested in the overall picture. Accuracy is always good to check as one option. F1-score is another.

F1-score combines precision and recall, and works also for cases where the datasets are imbalanced as it requires both precision and recall to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the precision or recall of the positive class is low.

Besides these, there are various other metrics and ways to explore your results. A popular and very useful approach is also use of ROC- and precision-recall curves. These allow fine-tuning the evaluation thresholds according to what type of error we want to minimize. But that is a different topic to explore.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																													                 6. Propose potential enhancements or modifications to improve the
performance of the sentiment extraction algorithm. Justify your recommendations.
answer:Sentiment analysis is the process of classifying whether a block of text is positive, negative, or neutral. The goal that Sentiment mining tries to gain is to be analysed people’s opinions in a way that can help businesses expand. It focuses not only on polarity (positive, negative & neutral) but also on emotions (happy, sad, angry, etc.). It uses various Natural Language Processing algorithms such as Rule-based, Automatic, and Hybrid.

let’s consider a scenario, if we want to analyze whether a product is satisfying customer requirements, or is there a need for this product in the market. We can use sentiment analysis to monitor that product’s reviews. Sentiment analysis is also efficient to use when there is a large set of unstructured data, and we want to classify that data by automatically tagging it. Net Promoter Score (NPS) surveys are used extensively to gain knowledge of how a customer perceives a product or service. Sentiment analysis also gained popularity due to its feature to process large volumes of NPS responses and obtain consistent results quickly.

What is Sentiment Analysis?
Sentiment

Why is Sentiment Analysis Important?
Sentiment analysis is the contextual meaning of words that indicates the social sentiment of a brand and also helps the business to determine whether the product they are manufacturing is going to make a demand in the market or not.


According to the survey,80% of the world’s data is unstructured. The data needs to be analyzed and be in a structured manner whether it is in the form of emails, texts, documents, articles, and many more.

Sentiment Analysis is required as it stores data in an efficient, cost friendly.
Sentiment analysis solves real-time issues and can help you solve all real-time scenarios.
Here are some key reasons why sentiment analysis is important for business:

Customer Feedback Analysis: Businesses can analyze customer reviews, comments, and feedback to understand the sentiment behind them helping in identifying areas for improvement and addressing customer concerns, ultimately enhancing customer satisfaction.
Brand Reputation Management: Sentiment analysis allows businesses to monitor their brand reputation in real-time.
By tracking mentions and sentiments on social media, review platforms, and other online channels, companies can respond promptly to both positive and negative sentiments, mitigating potential damage to their brand.
Product Development and Innovation: Understanding customer sentiment helps identify features and aspects of their products or services that are well-received or need improvement. This information is invaluable for product development and innovation, enabling companies to align their offerings with customer preferences.
Competitor Analysis: Sentiment Analysis can be used to compare the sentiment around a company’s products or services with those of competitors.
Businesses identify their strengths and weaknesses relative to competitors, allowing for strategic decision-making.
Marketing Campaign Effectiveness
Businesses can evaluate the success of their marketing campaigns by analyzing the sentiment of online discussions and social media mentions.
Positive sentiment indicates that the campaign is resonating with the target audience, while negative sentiment may signal the need for adjustments.
What are the Types of Sentiment Analysis?
Fine-Grained Sentiment Analysis
This depends on the polarity base. This category can be designed as very positive, positive, neutral, negative, or very negative. The rating is done on a scale of 1 to 5. If the rating is 5 then it is very positive, 2 then negative, and 3 then neutral.

Emotion detection
The sentiments happy, sad, angry, upset, jolly, pleasant, and so on come under emotion detection. It is also known as a lexicon method of sentiment analysis.

Aspect-Based Sentiment Analysis
It focuses on a particular aspect for instance if a person wants to check the feature of the cell phone then it checks the aspect such as the battery, screen, and camera quality then aspect based is used.

Multilingual Sentiment Analysis
Multilingual consists of different languages where the classification needs to be done as positive, negative, and neutral. This is highly challenging and comparatively difficult.

How does Sentiment Analysis work?
Sentiment Analysis in NLP, is used to determine the sentiment expressed in a piece of text, such as a review, comment, or social media post.

The goal is to identify whether the expressed sentiment is positive, negative, or neutral. let’s understand the overview in general two steps:

Preprocessing
Starting with collecting the text data that needs to be analysed for sentiment like customer reviews, social media posts, news articles, or any other form of textual content. The collected text is pre-processed to clean and standardize the data with various tasks:


Removing irrelevant information (e.g., HTML tags, special characters).
Tokenization: Breaking the text into individual words or tokens.
Removing stop words (common words like “and,” “the,” etc. that don’t contribute much to sentiment).
Stemming or Lemmatization: Reducing words to their root form.
Analysis
Text is converted for analysis using techniques like bag-of-words or word embeddings (e.g., Word2Vec, GloVe).Models are then trained with labeled datasets, associating text with sentiments (positive, negative, or neutral).

After training and validation, the model predicts sentiment on new data, assigning labels based on learned patterns.

What are the Approaches to Sentiment Analysis?
There are three main approaches used:

Rule-based
Over here, the lexicon method, tokenization, and parsing come in the rule-based. The approach is that counts the number of positive and negative words in the given dataset. If the number of positive words is greater than the number of negative words then the sentiment is positive else vice-versa.

Machine Learning
This approach works on the machine learning technique. Firstly, the datasets are trained and predictive analysis is done. The next process is the extraction of words from the text is done. This text extraction can be done using different techniques such as Naive Bayes, Support Vector machines, hidden Markov model, and conditional random fields like this machine learning techniques are used.

Neural Network
In the last few years neural networks have evolved at a very rate. It involves using artificial neural networks, which are inspired by the structure of the human brain, to classify text into positive, negative, or neutral sentiments. it has Recurrent neural networks, Long short-term memory, Gated recurrent unit, etc to process sequential data like text.

Hybrid Approach
It is the combination of two or more approaches i.e. rule-based and Machine Learning approaches. The surplus is that the accuracy is high compared to the other two approaches.

Sentiment analysis Use Cases
Sentiment Analysis has a wide range of applications as:

Social Media
If for instance the comments on social media side as Instagram, over here all the reviews are analyzed and categorized as positive, negative, and neutral.

Nike Analyzing Instagram Sentiment for New Shoe Launch
Nike, a leading sportswear brand, launched a new line of running shoes with the goal of reaching a younger audience. To understand user perception and assess the campaign’s effectiveness, Nike analyzed the sentiment of comments on its Instagram posts related to the new shoes.

Nike collected all comments from the past month on Instagram posts featuring the new shoes.
A sentiment analysis tool was used to categorize each comment as positive, negative, or neutral.
Sentiment analysis Use Cases

The analysis revealed that 60% of comments were positive, 30% were neutral, and 10% were negative. Positive comments praised the shoes’ design, comfort, and performance. Negative comments expressed dissatisfaction with the price, fit, or availability.

The positive sentiment majority indicates that the campaign resonated well with the target audience. Nike can focus on amplifying positive aspects and addressing concerns raised in negative comments.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																	           7. Reflect on the ethical considerations associated with sentiment analysis,
particularly regarding privacy, bias, and potential misuse of extracted sentiments.
answer:Data Analysis
The process which involves collection, cleaning, transformation, and modelling of data to capture important information for various processes for decision making is called as Data analysis. The main use is to collect information from the raw data. It consists of steps like Data requirement gathering, Data collection, Data cleaning, Data analysis, Data interpretation, and Data visualization. The need for data analysis should be found out, initially. Then, the data for the research should be collected from different data sources. The very next important step is data cleaning. The data should be error free, for that all the unwanted details like duplicate records, white spaces, and mistakes will be removed from the collected data. In the Analysis step, the critical analysis will be done on the cleaned and processed data. After analysing the data, the data/results will be interpreted either in the form of simple words or charts or tables, etc. The final step is data visualization, where the results will be visualized in the form of charts, graphs, etc., as the final output. Figure 1 shows the necessary data analysis steps.

Fig. 1
figure 1
Data analysis steps

Full size image
Natural Language Processing
Natural Language Processing is a method that communicates with an intelligent system using a natural example, say English. It can be used to perform many tasks on these intelligent systems. Lexical Analysis involves identifying and analysing the original structure of the words in the sentence. Identifying the grammar and relationships among the different words available is called as Syntactic Analysis. The exact meaning or the dictionary meaning of the text is extracted using the Semantic Analysis. Discourse Integration means identification of the meaning of the given sentence corresponding to the meaning of the previous sentence. Re-interpretation of the words will be done in the Pragmatic Analysis.

Sentiment Analysis
The classification of the block of text is whether positive, negative, or neutral, is called Sentiment Analysis. The main aim will be to analyse people’s interest in a way that it can help the businesses expand. It depicts not only on polarity (positive, negative neutral) but also on emotions (happy, sad, angry, etc.). It uses various Natural Language Processing algorithms. It is the contextual mining of words that indicates the social sentiment of a brand. It also helps to determine the business whether the product in which they are manufacturing is going to make a demand in the market or not. Figure 2 shows the necessary sentiment analysis steps.

Fig. 2
figure 2
Sentiment analysis steps

Full size image
Two techniques that involves in Sentiment analysis are:

Rule-based sentiment analysis It uses rules and a collection of words labelled by polarity to identify the text opinion. Sentiment value typically need to be combined with extra powers to understand sentences containing sarcasm, negations, or dependent clauses.

Machine learning-based sentiment analysis Involves training a Machine Learning model to understand the polarity based on the word order using a sentiment-labelled training set.

Social Network Analysis
Social network analysis means the process of identifying and realizing the relationships and data flow between people, groups, organizations, computers and other connected information entities. The network nodes are the groups and people, whereas the links show relationships between the respective nodes. It performs analysis like visual and mathematical analysis of human relationships. Researchers measures the activity of network for a node that involves the concept of degrees—the total number of connections a node has directly.

A centralized network is identified by one or a few nodes known as central nodes. Central nodes are damaged or removed if the network quickly fragments into unconnected sub-networks. It can become a single point of failure. A centralized system around a well-connected hub can fail if that hub is removed or disabled. Hubs are nodes with a high degree centrality.

Depression
A medical condition which affects how you feel, think and act. Various feelings like sadness and a loss of interest in activities you once enjoyed are the symptoms of depression which may decrease your ability to function at work and home. American Psychiatric Association states “It affects one in 15 adults (6.7%) in any given year. Depression affects One-sixth of the people (16.6%) will experience in their life”. Women are more likely than men to experience Depression, studies said. One of three women will share a significant depressive behaviour in their lifetime. There is a high chance of inheriting when first-degree relatives (parents/children/siblings) have Depression.

Section 1. Deals with the Introduction to the paper.

Section 2. Describes the Preliminary Review that done on the study.

Section 3. Explains the Outcome of Survey.

Section 4. Depicts the Summary, Conclusion and Future Work.

Section 5. Deals with the Compliance with the Ethical Standards.

Preliminary Review
Survey was conducted on various papers from different research areas which include :

Data analysis.

Social media analysis.

Natural language processing.

Sentiment analysis.

Depression detection.

Various research papers were collected from the above mentioned areas whereas the paper other than these areas were excluded. Total of 101 papers collected from different journals which belongs to the above mentioned areas or domains. The journals which includes the selected papers are IEEE, Springer, ACM and other Medical Journals. The journals of different issuing years were collected for the survey. A comparative study was conducted among different papers to understand or identify the methods and techniques followed by different authors. Figure 3 shows the total number of papers based on the research area.

Fig. 3
figure 3
Total Number of papers selected based on the research areas

Full size image
Study was conducted on various research papers and a comparative study conducted between them. Different techniques and methods in the journals were compared and a review paper was made based on that. Different comparisons made on the literature study or survey was:

1.
Texts, Emoticon Emoji Analysis.

Multi-class Sentiment Analysis.

Feature Extraction Techniques.

Emoticon and Emoji Analysis.

2.
Artificial Intelligence Techniques.

Machine Learning Techniques.

Deep Learning Techniques.

3.
Depression detection

3.
Data Source

Texts, Emoticon Emoji Analysis
This section includes various texts, emoticons and emoji analysis in the sentiment analysis for the sentiment classification. This mainly includes the comparison of Binary and Ternary Classifications and the reason for the introduction of Multi-class Classification. Also, the various Feature Extraction Techniques has been explained. This particular section also deals with Emoticons and Emoji Analysis.

Multi-class Sentiment Analysis
The data will be classified corresponding to the sentiments using various Machine Learning and Deep learning Methods or Techniques. The sentiment was classified into two polarities or classes: Positive and Negative, also called Binary Classification [7] initially. In Tanna et al. [7], the sentiments were classified into Positive and Negative Classes [84]. It would provide different sections like universities or business to analyse the users’ ideas depending on their circle. All the positive values were classified into Positive Class, whereas all the negative values or words were classified into Negative Class. After the classification, the accuracy will be found out based on the model or the algorithm. Later, the Ternary Classification [1,2,3, 8] came into existence where the sentiments will be categorized into 3 Classes like Positive, Negative and Neutral. All the data which are neither Positive nor Negative will be classified using the Neutral Class. In Ternary Classification, we can expect a lower accuracy value when compared to Binary Classification.

Table 1 Multi-class classifications
Full size table
Instead of classifying the data into Ternary and Binary Classification, the collected data can be categorized into Multi- class Classification [7, 8, 12, 13, 18, 32, 33, 40, 61, 65, 70] as a precise or accurate classification can be expected here. Mohammed Jabreel et al. [32] state that “A multi-label problem is represented as one or more single-label (i.e., binary or multi class) problems. The single-label classifiers are basically learned and implemented, then the predictions of classifiers are transformed into multi-label predictions”. TextBlob will collect the polarity and subjectivity of the sentence [7, 8]. Here, the data will be classified into various subclasses [76] basedon the sentiment polarity. We can conclude that a Multi-class Classification performs well as it gives a precise classification as the data is organized into different subclasses or polarity based on the dataset. SentiWordNet [3, 8] dictionary is used to identify the positivity and negativity or the sentiments of the sentence. Ali Shariq Imran et al. [40], Sentiment analysis on tweets refers to the classification of an input tweet text into sentiment polarities, including positive, negative and neutral, whereas emotions’ classification refers to classifying tweet text in emotions’ label including joy, surprise, sadness, fear, anger and disgust.

Table 1 shows the comparison of different classes in the Multi-class Classification. In Nofiz Al Asad et al., depression was classified as Considered Normal, Mild Depression, Borderline Depression, Moderate Depression and Severe Depression. Subhan Tariq [18] states the classes, multi-class as Anxiety, ADHA, Bipolar and Depression. Reshma Radheshamjee Baheti [29] states the classes as Depression, Stress, Normal, Relax, Happy and Others. In Shakeel Ahmed et al., the class described as Non Extremist, Extremist, Anger, Joy, Fear, Sadness and Analytical. In Mohammed Jabreel et al., multi-classes was Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Surprise and Trust. Mondher Bouzazi [33] states the classes as Love, Happiness, Anger, Neutral, Fun, Hate and Sadness. Ali Shariq Imran [40] described the classes as Joy, Surprise, Sad, Fear, Anger and Disgust. Jonathan G. D Harb et al., states the classes as Sadness, Anger, Neutral, Surprise, Fear and Disgust. In Alex M. G. Almeida et al. [77], the multi-classes were Joy, Disgust, Fear, Anger, Surprise, Sadness and Neutral. The multi classes stated by Govin Gaikwad et al. [79] were Funny, Happy, None, Sad and Angry. In Jaewoo Kim et al. [80], the classes are Anger, Disgust, Fear, Joy and Sadness and in Jayakrishnan et al. [82], Happy, Sad, Anger, Fear and Surprise were the multi- classes defined. Nur Maulidich Elfjar et al. [87] defined the classes as Happy, Sad, Cry, Exciting and Laugh. In Yongcai Tao et al. [88], the classes defined were Happiness, Like, Anger, Sadness, Fear, Disgust and Surprise.

TextBlob TextBlob is a python library for Natural Language Processing (NLP). TextBlob actively used Natural Language ToolKit (NLTK) to achieve its tasks. TextBlob is a simple library which supports complex analysis and operations on textual data. TextBlob returns polarity and subjectivity of a sentence. Polarity lies between [– 1, 1], – 1 defines a negative sentiment and 1 defines a positive sentiment. Negation words reverse the polarity. TextBlob has semantic labels that help with fine-grained analysis. For example—emoticons, exclamation mark, emojis, etc. Subjectivity lies between [0,1]. TextBlob has one more parameter—intensity. TextBlob calculates subjectivity by looking at the ‘intensity’. Intensity determines if a word modifies the next word. For English, adverbs are used as modifiers (‘very good’).

SentiWordNet Dictionary Opinion lexicon derived from the WordNet database where each term is associated with numerical scores indicating positive and negative sentiment information. SentiWordNet is built in a two-stage approach: initially, WordNet term relationships such as synonym, antonym and hyponymy are explored to extend a core of seed words used in, and known a priori to carry positive or negative opinion bias. After a fixed number of iterations, a subset of WordNet terms is obtained with either a positive or negative label. To minimize bias, the classifiers are trained using different algorithms and different training set sizes. The predictions from the classifier committee are then used to determine the sentiment orientation of the remainder of terms in WordNet.

Feature Extraction Techniques
Feature extraction reduces to the processing groups from the initial raw data. Feature extraction is the method of selecting and combining data into features reducing the data amount that must be accurately processed and the original data set described thoroughly. The amount of redundant data for a given analysis will also get reduced. The machine’s efforts in variable combinations (features)building and data reduction facilitate machine learning process by learning and generalization steps.

Various techniques used for Feature Extraction [58] are:

Term Frequency Inverse Document Frequency (TF-IDF): TF-IDF [2, 12, 13, 18, 43] calculates how important a word is to a document in a group of documents which is done by multiplying two metrics. It checks the appearance of words in a document and the document frequency in inverse of the word across a set of documents.

Term document matrix (TDM): Term-document matrix [4, 11,12,13, 18] describes the total number of words that appear in the documents collections where columns correspond to terms and rows correspond to documents in the collection.

Bag-of-words (BOW): A bag-of-words [4, 11, 19], text representation that depicts the presence of words within a document. It involves two things: (1) Known words vocabulary. (2) Known words total presence.

Negation Handling: The way of determining the negation scope and polarities inversion of opinionated words that are affected by a negation.

LIWC: Linguistic Inquiry and Word Count [5, 7, 16, 21, 25] measures the total number of various categories of words used in a text and how can it process texts.

Word2Vec: Word2vec [11, 14, 38, 44, 53, 56, 65] a neural network model will be used to learn associations of words from a large corpus of text. A model can detect synonymous words or suggest additional words for a partial sentence, once trained.

N-Gram : Type of probabilistic language model [16, 43, 46] for predicting the next item in such a sequence in the form of a (n 1) order Markov model. An N-gram means a sequence of N words. N-gram model is built by counting how often word sequences occur in corpus text and then estimating the probabilities.

Word2Seq: A word sequencing [44] approach from the text without taking into consideration the weights of each word. This technique mapped the word sequence into a matrix with the length (input size) and height (number of observations).

Tokenization: Tokenization [8, 11, 12] is a process of a representing the tokens to a sequence of characters that are represented as group. Tokens will be made from texts which counts tokens, which can be used as features.

Stemming: Stemming [3, 8] is the process of identifying the words that are similar in meaning.It will help in reducing the redundancy to get the base form of the word by removing the suffix.

GloVe: GloVe [11, 14, 41,42,43, 53], a unsupervised learning algorithm by mapping words into a meaningful space for obtaining words vector representations where the semantic similarity is related to the distance between words.

POS Vector: It calculates the Part-Of-Speech [8] tags using an array and when the noun, verb, adjective and adverb are found in the pos tag list, it increases by 1.

Gram features: After converting the sentence in tokens, bigrams with the use of porter stemmer for stemming of words in gram feature extraction, the mapping of the sentence is calculated [8].

Fasttext: Fasttext [14, 15, 38] model creates an unsupervised learning or supervised learning algorithm for which words are represented as vectors.Total of 294 languages pertained models in Facebook.

Static Word Embedding: A (static) word embedding is a function that maps each word type to a single vector. These vectors are typically dense and have much lower dimensionality than the size of the vocabulary. This mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table. Table 2 shows the different Feature Extraction Techniques.

Table 2 Feature extraction techniques
Full size table
Emoticons and Emoji Analysis
From Social Media, mainly Text data will be used for sentiment analysis. The data collected from the different social media will be pre-processed from the corresponding dataset where all the unwanted data or details will be removed. Pre-processed data will be used for the extraction of features. Features from the pre-processed data will be extracted using feature extraction techniques. In Fazeel Abid et al. [14], Emoji’s and emoticons is one the significant source of sentiments. A user on social networks expresses their feelings often using different kinds of emoji and emoticons. Table 3 shows examples of emoticons.

Table 3 Emoticons examples
Full size table
Like textual data, emoticons and emojis [3, 9, 11, 14] also can be used for the sentiment analysis. They also have some sentiment score values, which can be considered or used for sentiment analysis. Li-Chen Cheng et al. [11] states Informal short messages contains unique words and symbols which includes emoticons and slang. Natural language processing (NLP) tools cannot be used to pre-process these reviews. Emoticons and Emojis are expressed or represented using various punctuation symbols/marks. Usually, these symbols or marks will be removed during the data preprocessing stage. Since they also have some sentimental value, and it should not be removed from the preprocessing location. In Yong Chen et al. [9], WeChat friends circle contains documents of emoticons for the perinatal depression analysis as WeChat emoticons helps in the Sentiment classification of the whole record. Emoticons and Emojis help in the study of various sentiments. For example, We can use NLTK Tokenizer tokenize the social media data into individual words where they spare all the emoticons and emojis without removing them. In Nur Maulidiah Elfajr et al. [87], perform weighting score in emoticons. Assume that emoticons have more affect on all the tweets and can describe an emotion than ordinary words. So give more weight that value is double on each emoticon identified. Analyse sentiment score to each sentence m. Sentiment score is obtained from SentiWordNet containing positive and negative, after identifying the text into words and emoticons scores. Yongcai Tao et al. [88], each emoticon is converted into the distributed representation vector of the corresponding emotional type.

Artificial Intelligence Techniques
AI means the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. This section deals with various Machine Learning and Deep Learning Techniques.

Machine Learning Techniques
The process of teaching a system on making accurate predictions [24, 27, 29] while feeding data is called as Machine Learning. It shows the working of an algorithm [2, 16]which learns more accurate in its predictions [2]. The basic approaches that involves in the machine learning are supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Table 1 shows a comparison of various machine learning algorithms.

Table 4 Comparison of machine learning techniques(1)
Full size table
Table 5 Comparison of machine learning techniques(2)
Full size table
Tables 4 and 5 shows the different Machine Learning Techniques were compared based on the accuracy values obtained during different classification procedures which are used in different research papers. In Bohang Chen et al. [1], the Support Vector Machine gives the accuracy of 74.18%. In Akshi Kumara et al. [26], Multinomial Naive Bayes scored around 77.89%, Random Forest was 81.04%, Ensemble Vote Classifier was 85.09% and Gradient Boosting was 79.12%. Shakeel Ahmad et al. [30] states K-nearest neighbour scored 72.0%, Random forest was 82.0%, Naive Bayesian was 71.0% and SVM was 79.0%. In Gonzalo A Ruz et al. [4], the accuracy value using Naive Bayes was 74.2%, SVM was 81.2%, Random Forest was 72.5%, TAN was 72.1% and BF TAN was 76.4%.

According to Govin Gaikwad et al. [79], the accuracy for SVM is 82%, Naive Bayes is 64% and KNN is 73%. Jayakrishna et al. [82] states that the accuracy with SVM as the classifier is 90%. In Georgios S Solakidis et al. [89], the accuracy values with Multinomial Naive Bayes is 92.2%, SVM is 93.1%, LOG is 93.2%. Mandar Despande et al. [92] states that the accuracy value for Multinomial Naive Bayes is 83% and SVM is 79%. In Rinki Chatterjee et al. [95], Naive Bayes gives the accuracy value of 76.6%. Rincy Jose et al. [96] states that the accuracy value using SentiWordNet was 21.05%, Naive Bayes was 69.92%, HMM was 64.06% and Ensemble Approach gives 71.46%.

1.
Supervised learning [26]. Algorithms with labelled [18, 21, 22] data and defines the variables they want the system to assess for correlations [8]. The input and output of the entire learning algorithm or system is specified.

2.
Unsupervised learning. It trains on unlabelled data looking for any meaningful connection by scanning through data sets. They are trained on and the predictions they output are predetermined.

3.
Semi-supervised learning. The combination of the unsupervised and supervised. Scientists may use an algorithm to identify the data independently and it develops the understanding of the dataset.

4.
Reinforcement learning. Learning a system to perform a process of several steps for defining rules. It completes a task and results whether positive or negative determining how to work out the process of task. The algorithm decides on its own which steps to take along the way for most of the part.

Various Machine Learning Techniques are :

Multinomial Naïve Bayes: Used for classification with discrete features [7, 8, 68]as multinomial distribution [3] typically requires integer feature counts. To identify maximum estimates based on the training data to estimate the conditional probability [66], after normalization, Term Frequency can be used. Manoj Sethi et al. states Multinomial Naıve Bayes classifier [2, 7]is a modified a version of the Naive Bayes algorithm which considers distinct features like frequency of words for text classification.

Logistic Regression: The Logistic Regression [26] conducts when the dependent variable is binary. It describes the data that explains the relationships among one binary variable which is dependent and one or more nominal, ordinal, interval, or ratio-level independent variables. Manoj Sethi et al. states the sigmoid module [2] to modify its output to provide a probability score, which is mapped to different classes.

Random Forest: A random forest [26] fits several decision tree classifiers on various sub-samples which improves the accuracy and control over-fitting by averaging the dataset. Manoj Sethi et al. states that the most suitable solution selected in Random Forest [2] is using voting.

Support Vector Machine: Support Vector Machine [12, 22] is a algorithm that can be used for both Classification or Regression [3, 7, 68]. It shows each data item as a point in n-dimensional [2, 8, 12] space with the value of each element being the value of a particular coordinate. In Manoj Sethi et al. [2]. The hyper-plane will be found out which can differentiate between the classes well, to perform classification.

Decision Tree: A supervised learning [2] method used for classification and Regression as it can creates a model which predicts a target variable’s value by learning simple decision rules collected from the data features.

Maximum Entropy: The probability distribution [3] of a particular random variable can be used to estimate to leave the largest remaining uncertainty. A conditional probability model that allows us to predict [7, 68] class labels given a set of features for a given data point.

XGBoost: XGBoost is an implementation of gradient boosted decision trees [2] designed for speed and performance.

Bayesian networks classifiers: Bayesian networks [4, 50], a powerful graphical model for encoding the probabilistic relationships between a set of variables as it can be used for classification. For an event that occurs and predicts the one of several possible known causes was the contributing factor, they are ideal.

MS3VM: It is multi-class semi-supervised SVM [71] which is our implementation of augmenting unlabelled tweets without adaptive features. CoMS3VM is MS3VM algorithm in a co-training scheme, by naturally splitting the common features into text and non-text views.

Multilayer perceptron: Most popular neural network technique, which consists of the feed forward network of processing the neurons [60], which are grouped into layers and is connected with the weighted links.

K Nearest Neighbour: The similarity [5] between the new data and available cases and put the new data into the category that is most similar to the available types will be calculated. Regression as well as Classification can be performed, but mostly it is used for the Classification [24] problems.

Classification and Regression Trees: It is a predictive model [5], which helps to find a variable based on other labeled variables. It describes prediction of the target variable values. It is a decision tree where each fork, a split describes a predictor variable and each node at the end has a prediction for the target variable.

Clustering: Clustering is the process of splitting the data into groups of similar objects and the data is represented by less number of clusters [60]. The data is modelled by its clusters and it has the perspective roots in mathematics, numerical analysis and statistics. In machine learning perceptive, the clusters correspond to the hidden patterns and the searching is unsupervised learning. There are several types of clustering algorithms, which includes the following:

Hierarchical methods.

Partitioning methods.

Grid based methods.

BERT BERT stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

Deep Learning Techniques
Deep learning is a technique that learns various systems to perform various activities naturally by humans: learn by example. It performs classification tasks by images, text, or sound in deep learning. It achieves the state of accuracy sometimes by exceeding the human-level performance. A set of labelled data and neural network architectures containing many layers are used in the training process. Most deep learning methods use neural network architectures, so deep learning models are often referred to as deep neural networks. Tables 5 and 6 shows the comparison of different Deep Learning Techniques.

Table 6 Comparison of deep learning techniques(1)
Full size table
Table 7 Comparison of deep learning techniques(2)
Full size table
Table 8 Comparison of deep learning techniques(3)
Full size table
Tables 6, 7 and 8 shows the different Deep Learning Techniques were compared based on the accuracy values obtained during different classification procedures which are used in different research papers. In Bohnag Chen et al. [1], accuracy using LSTM was 74.18%, CNN was 75.97% and CNN-LSTM was 74.70%. Li-Chen Cheng et al. [11] states the accuracy of 80.83% using LSTM, 87.17% using BiLSTM and 64.92% for GRU. Kantinee In Katchapakirin et al. [22], accuracy for LSTM was 85.0%. Shakeel Ahmad et al. [30] states the accuracy of 85.07% for 1-layer LSTM, 83.09% for 1-layer CNN and 92.06% for LSTM + CNN.

In Renata L. Rosa et al. [34], the accuracy using CNN BLSTM-RNN using SoftMax was 89% and CNN BLSTM-RNN using SVM was 87%. Li Yang et al. [35] states that accuracy using CNN was 90.9%, CNN + Attention 91.4%, BiGRU was 92.6% and 93.1% for BIGRU + Attention. In Mehmet Umut Salur et al. [38], acuracy for CNN + FastText Embedding was 68.48%, CNN + Characte Embedding was 69.25%, CNN + Word Embedding was 67.14%, LSTM + FastText Embedding was 65.35% and CNN + BILSTM FastText Embedding + Word Embedding was 82.14%. In Ali Shariq Imran et al. [40], the accuracy using DNN was 64.5%, LSTM + FastText was 66.0%, LSTM + GloVe 67.7%, LSTM + GloVe Twitter was 69.9% and LSTM + w/o Pretrained Embed was 66%. Yue Han et al. [52] states that accuracy during LSTM was 71.15%, BiGRU was 71.35% and TD-LSTM was 72.83%.

According to Anisha Mukherjee et al. [75], the accuracy values using Simple RNN was 55.92%, GRU was 65.51%, LSTM was 63.47% and BiLSTM was 66.44%. Tianyi Wang et al. [83] states that CNN scored accuracy of 71.19% and LSTM was 57.73%. In Kan Liu et al. [85], the accuracy value using CNN was 86.28%, LSTM was 85.74% and BiLSTM was 86.56%. Shan Huang et al. [90] states that the accuracy using SG + GRU was 87.15%, SG + LSTM was 87.34%, SG + BiGRU was 87.18%, SG + BiLSTM was 86.64% and SG + Emoji was 88.35%. In Li Yang et al. [98], the accuracy value of 83.5% scored for SLCABG. Shivam Behl et al. [101] states that the accuracy of 82% for CNN – W and 87% for CNN – WP. Various Deep Learning Techniques are:

LSTM: Long Short Term Memory networks [1, 5, 9, 10] is a special kind of RNN for learning long-term dependencies. To avoid the long-term dependency problem, all recurrent neural networks have the form of a chain of repeating modules of a neural network . The repeating module will have a straightforward structure like a single tanh layer. In Bohnag Cheng et al. (2018), the proposed method is compared with CNN, LSTM. CNN-LSTM [30], SVM. The repeating module has a different structure in LSTM. There are fourneural network layer instead of having a single neural network layer. Badr Ait Hammou et al.(2020) states the proposed solution increases the accuracy of well-known deep learning models ie LSTM, BiLSTM and GRU.

CNN: Convolutional neural network [14, 15, 19, 31, 45] is the deep learning method that have become using in various dee learning or image processing tasks and is attracting interest across a variety of areas and it consists of multiple building blocks, such as convolution layers, pooling layers, and fully connected layers. Backpropagation algorithm helps in automatically designed and adaptively learn spatial hierarchies of features. Fazeel Abid et al. [14] states that a single RNN layer will generate a “DECR” for getting the important information examined as input as CNN [78] gives results compared with the random initialization for sentiment classification.

GRU: Gated Recurrent Unit Network(GRU) [11, 14, 15] was developed to solve the Vanishing-Exploding gradients problem often encountered during the operation of a basic Recurrent Neural Network [32]. It consists of 3 gates and does not maintain an Internal Cell State whereas the information stored in the Internal Cell State in an LSTM recurrent unit is saved into the Gated Recurrent Unit’s hidden state.

BiLSTM: A bidirectional LSTM (BiLSTM) [14, 15] layer learns long-term bidirectional time sequence data as they are useful when you want the network to know from the complete time series at each time step. In Badr Ait Hammou et al. [15], BiLSTM layer and fastText are both represented in the proposed architecture. .

BiGRU: A Bidirectional GRU [14] with only the input and forgets gates is a sequence processing model that consists of two GRUs. One was taking the input in a forward direction, and the other in backward order.

CNN-LSTM: CNN can extract local information but may fail to capture long-distance dependency. LSTM can address this limitation by sequentially modelling texts across sentences. The CNN-LSTM [43] architecture involves using CNN layers for feature extraction on input data combined with LSTM to support sequence prediction.

Deep Belief Network: A generative graphical model [40, 45] of deep neural network, composed of multiple layers of latent variables [51] with connections between the layers but not between units within each layer. They are used to recognize, cluster and generate images, video sequences and motion-capture data. A continuous deep-belief network is simply an extension of a deep-belief network that accepts a continuum of decimals, rather than binary data.

Data Source
A data source is where data used to run a report or gain information is originating from. Here, for Sentiment Analysis, the data source or data collected from various social media, say Twitter [2, 4, 6, 8, 31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71], Facebook [5, 22], Wechat [9], Weibo [13, 37, 69], Amazon, Reddit [18, 54], Feedbacks [56] etc. The posts by various users on the various social media will be collected for the sentiment analysis. According to IGI Global “An social network is an online social connection which consists of people and individuals that can be called “nodes,” and the links are the types of relationships established between above mentioned nodes”. Today,the type of communication can modify the behaviour of nodes, the communication habits of OSN [28] users, as social networks use web-based services [86]. In Yong Chen et al. (2018), the perinatal depression screening discussed is based on many social data from the WeChat circle of friends. Table 9 shows different Data Sources.

Table 9 Data sources
Full size table
Depressed People have the proclivity of posting their views and opinions about their personal life and gregarious issues in social media like Twitter, Facebook, YouTube , Reddit, Yelp etc. Their feelings and emotions can be facilely identified utilizing these posts, comments, tweets etc. Social media [99] will act as the data source for identifying the emotions of the people. Not only the textual data, but withal emoticons and emojis can withal be utilized for the sentiments identification as they too have sentiment score and they will avail in sentiment analysis for dejection detection. Table 10 shows the depression dataset predicated on the respective paper.

Table 10 Depression data sources
Full size table
For example, in Twitter, the tweets [26] are posted by various users on Twitter. The data will be extracted for collected using “hashtags” say, #twitter hashtag will be used to extract data based on the term Twitter. In Manoj Sethi et al. [2], Twitter is the data source used mainly to gather tweets specific to the corona virus. 3 datasets were used totally, Ttweets with hashtag #coronavirus -Dataset 1, #COVID19- Dataset 2, Dataset 3 manually created which is the combination of other datasets. The data collected from various data sources will be pre-processed, features are extracted, then various machine learning [24, 29] or deep learning [17, 30,31,32] algorithms will be used for the analysis and classification process. Kantinee Katchapakirin et al. [22] states a statistical study concluded that the most popular social network in Thailand is Facebook and it is used as a tool to share feelings, opinions, as well as life events.

Jonathas G. D. Harb et al. [70] compared pre and post-event tweets to investigate the emotional impact of an event. To identify the tweets referring to the mass shootings, Twitter trending topics were manually inspected and raw data gathered from the web, as well as samples extracted using the official Twitter API on the respective event dates. Recurrent hashtags for each one of the events and used them as query search terms were found out. In Samah Mansour [64], tweet is a collection of words that reflects the user’s opinion about a certain topic. R as a language and environment for statistical computing and graphics was used to collect and analyze the tweets. Through R, Twitter’s API is used to get the tweets. The Method searchTwitter was used to collect the tweets using the keyword ISIS.

Hamed Jelodar et al. [54] used Reddit, an American social media for discussing about various topics that includes web content ratings. Users are able to post questions and comments, and to respond to each other regarding different subjects, such as COVID-19. It is an ideal source for collecting health-related information about COVID-19-related issues. Irum Sindhu et al. (2019) tested on a manually tagged data set as positive or negative constructed from the last five years students’ comments from Sukkur IBA University as well as on a standard SemEval-2014 data set. The presence of negative comments within feedback is indicated through manual highlighting process done by domain experts.

Md. Mokhlesur Rahmana et al. [100] states that the study used Twitter data collected between April 30, 2020, and May 08, 2020 to understand the sentiment of the people towards the reopening of the US economy. This method is generally applicable for collecting data from any social media platforms (e.g., Facebook, LinkedIn, Instagram, and news agency) regarding any real-world social events (e.g., man-made and natural dis-asters, political affairs, religious and racial conflicts). In Li Yang et al. (2019), the dataset used was the data of book reviews collected from Dangdang using web crawler technology. The book reviews in this original data are divided into five levels, one to five stars, divide the five levels into two categories, 1–2 stars are defined as negative reviews, 3–5 stars are defined as positive reviews. In Masum Billah et al. [94], a self-developed dataset was created from the collected data in Facebook. The developed dataset contains only Bangla status. Some small English words were kept which may contain more information. Fidel Cacheda et al. [91] states that the data will be collected from Reddit and the resulting dataset consists of a collection of tuples of the form (id, writing), such that id is a unique identifier for each social network user and writing represents a writing instance in the social network. Kan Liu et al. [85] states that Patient description text on medical social media is taken as an example and the patient’s required medical treatment is treated as a classification problem.

Tianyi Wang et al. [83] states that 2.4 million Weibo posts from 1 January 2020 to 19 February 2020 are crawled by CCIR 2020 organizer. The crawler mainly uses SciPy and Beautiful Soup techniques, and deletion of duplicates and reposts are processed to construct the Weibo posts dataset. The dataset includes posts by around 640 thousand users with user location information excluded. In Hay Mar Su Aung et al. [74], the dataset means the public comments of Facebook page related to Celebrity” Page in Myanmar to create own dataset. Social media data (Facebook comments) is collected through data crawling using Facepager tool. Lixia Yu et al. [69], downloaded social media posts via Sina Weibo API. These posts were generated between January of 2014 and July of 2017. Then, we altered the data and retained the posts that contained the keyword ”. Mohammad Ehsan Basiri et al. [59] states that long review and short reviews on various Twitter datasets like Kindle, Movies, U.S Airline, Sentiment 140, Airline Twitter were used. In Irum Sindhu et al. [56], used last 5 years students feedback of Sukkur IBA University. Usually, the university is processing feedback manually by tagging each students comments in positive or negative category. The presence of negative comments within feedback is indicated through manual highlighting process done by domain experts. Mehmet Umut Salur et al. [59] states that used a dataset collected from shared user tweets about a GSM operator in Turkey. The dataset contains 17,289 Turkish tweets between 2011 and 2017. The tweets have three sentiment classes: positive, negative, and neutral. Yong Chen et al. (2018), data of WeChat circle of friends has its particularity, such as a large number of e-commerce data, emoticons, etc. There are more than 30 emojis. YuWen Lyua et al. [13] states that data consisted of Weibo comments, which were collected from Weibo API and a web crawler. The API interface is provided in Weibo’s open platform. The platform’s application was used to access the Weibo API and collect the required data.

Fazeel Abid et al. [14] states that proposed techniques on multi-source corpora with various domains and size respectively, we test the system on multi-source datasets; small, medium, and large for its applicability, efficiency, and reliability. The dataset; sentiment strength twitter (SST) dataset includes the summative number of tweets is 4242 classifying into positive and negative which are 1252 and 1037. The second dataset, which is used and prolonged extensively as a benchmark dataset after the initialization by is IMDB . The dataset contains 25,000 tweets with the polarity of 12,500 positives, and 12,500 negatives. The third dataset is among the famous and has been realistic in abundant fields contains 1.6 Million tweets taken from known as Stanford twitter Sentiment Corpus (STS). In Subhan Tariq et al. (2019), used python (API) for Reddit, PRAW to download the top 1000 posts from each of the following sub-reddits such as Depression, Anxiety, ADHD and Bipolar. Guozheng Rao et al. [19] states that large-scale novel Reddit Self-reported Depression Diagnosis (RSDD) dataset contains over 9000 diagnosed users with depression, which is matched with approximately 107,000 control users who have a healthy mental state.

Depression Detection
Depression is one of the mental health problems faced by the people globally. Studies shows that less than half of those who have this emotional problem gained access to mental health services. This could be due to lack of awareness about the disease. Depression often begins in adulthood. Depression is recognized as occurs in children and adolescents, although it sometimes presents with more prominent irritability than low mood. Many chronic mood and anxiety disorders in adults begin as high levels of anxiety in children. Depression in mid life or older adults can co-occur with other serious medical illnesses such as diabetes, cancer, heart disease, and Parkinson’s disease. These conditions are often worse when depression is present. Medications taken for these physical illnesses may cause side effects that contribute to depression. A doctor experienced in treating these complicated illnesses can help work out the best treatment strategy. Risk factors includes:

Personal or family history of depression.

Major life changes, trauma, or stress.

Certain physical illnesses and medications.

Depression is diagnosed usually made by healthcare workers with the help of various questionnaires and self-reporting. These methods not only depends on the current mood of the patient but also the experience of the people who are reluctant to seek help. People express their feelings and thoughts with friends and family through various social media. According to American Psychiatric Association “ Depression affects an estimated one in 15 adults (6.7%) in any given year. And one in six people (16.6%) will experience depression at some time in their life. Depression can occur at any time, but on average, first appears during the late teens to mid-20s. Women are more likely than men to experience depression. Some studies show that one-third of women will experience a major depressive episode in their lifetime”. Figure 4 shows the bar chart which illustrates the percentage of depression among people in different age groups. Figure 5 shows the spike of anxiety and depression during the pandemic, according to CDC, NCHS, U.S Census Bureau.

In India, 30% of the 103 million people above the age of 60 exhibit symptoms of dejection, according to a recent regime survey. It estimated that 8.3% of the country’s elderly population have probable major dejection. This designates, one in every 12 elderly person in the country have had melancholy. The prevalence figure is 10 times higher than the self-reported diagnosed dejection of 0.8% in the elderly population, pointing at the encumbrance of undiagnosed cases, the report verbally expressed. Among the people who are of 45–59 years of age, 26% show depressive symptoms.

More elderly women (9%) have prevalence of probable major dejection than men (7%). Additionally, the figure is higher among rural denizens (9%) than their urban counterparts (6%). The report withal verbalizes that 10% of the elderly population who live alone suffer from despondence. The study shows 3% of all the elderly have some form of noetic impairment. Fewer people above the age of 60 who have 10 or more years of schooling (5%) have despondence than those with less than primary edification (9%).

Over a tenth of the elderly population have probable major dejection in Madhya Pradesh (17%), Uttar Pradesh (14%), Delhi (11%), Bihar (10%), and Goa (10%). Among the older adults above the age of 45 years, over 60% were hospitalised at a private facility in the 12 months prior to the survey. The mean out-of-pocket expenditure in private health facility among the elderly is Rs. 31,933 compared to Rs. 71,232 among those aged 45–59.
																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																	 8. Write a complete code for this assignment.																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																                    answer:"""
For this assignment, you have to write a complete Python program. Paste your code in the window below.

You may define additional auxiliary functions as needed.
There are some public test cases and some (hidden) private test cases.
"Compile and run" will evaluate your submission against the public test cases
"Submit" will evaluate your submission against the hidden private test cases. There are 6 private test cases, with equal weightage. You will get feedback about which private test cases pass or fail, though you cannot see the actual test cases.
Ignore warnings about "Presentation errors".
Here are some basic facts about tennis scoring: A tennis match is made up of sets. A set is made up of games.

To win a set, a player has to win 6 games with a difference of 2 games. At 6-6, there is often a special tie-breaker. In some cases, players go on playing till one of them wins the set with a difference of two games.

Tennis matches can be either 3 sets or 5 sets. The player who wins a majority of sets wins the match (i.e., 2 out 3 sets or 3 out of 5 sets) The score of a match lists out the games in each set, with the overall winner's score reported first for each set. Thus, if the score is 6-3, 5-7, 7-6 it means that the first player won the first set by 6 games to 3, lost the second one 5 games to 7 and won the third one 7 games to 6 (and hence won the overall match as well by 2 sets to 1).

You will read input from the keyboard (standard input) containing the results of several tennis matches. Each match's score is recorded on a separate line with the following format:

Winner:Loser:Set-1-score,...,Set-k-score, where 2 ≤ k ≤ 5

For example, an input line of the form

Halep:Wozniacki:3-6,6-3,6-3
indicates that Halep beat Wozniacki 3-6, 6-3, 6-3 in a best of 3 set match.

The input is terminated by a blank line.

You have to write a Python program that reads information about all the matches and compile the following statistics for each player:

Number of best-of-5 set matches won
Number of best-of-3 set matches won
Number of sets won
Number of games won
Number of sets lost
Number of games lost
You should print out to the screen (standard output) a summary in decreasing order of ranking, where the ranking is according to the criteria 1-6 in that order (compare item 1, if equal compare item 2, if equal compare item 3 etc, noting that for items 5 and 6 the comparison is reversed).

For instance, given the following data

Federer:Nadal:2-6,6-7,7-6,6-3,6-1
Nadal:Federer:6-3,4-6,6-4,6-3
Federer:Nadal:6-0,7-6,6-7,6-3
Nadal:Federer:6-4,6-4
Federer:Nadal:2-6,6-2,6-0
Nadal:Federer:6-3,4-6,6-3,6-4
Federer:Nadal:7-6,4-6,7-6,2-6,6-2
Nadal:Federer:7-5,7-5
Halep:Wozniacki:3-6,6-3,6-3
your program should print out the following

Federer 3 1 13 142 16 143
Nadal 2 2 16 143 13 142
Halep 0 1 2 15 1 12
Wozniacki 0 0 1 12 2 15
You can assume that there are no spaces around the punctuation marks ":", "-" and ",". Each player's name will be spelled consistently and no two players have the same name.
"""
# Statistics will be stored as a dictionary
# Each key is a player name, each value is a list of 6 integers 
# representing 
#   Best of 5 set matches won,
#   Best of 3 set matches won,
#   Sets won
#   Games won
#   Sets lost (store as negative number for comparison)
#   Games lost (store as negative number for comparison)
stats = {}   

# Read a line of input
line = input()
while line:
  # Keep track of sets/games won and lost in this match
  # with respect to winner of the match
  (wsets,lsets,wgames,lgames) = (0,0,0,0)

  # Extract winner, loser and string of setscores
  (winner,loser,setscores) = line.strip().split(':',2)

  # Extract sequence of sets from setscores
  sets = setscores.split(',')

  for set in sets:
    # Process each set
    (winstr,losestr) = set.split('-')
    win = int(winstr)
    lose = int(losestr)
    wgames = wgames + win
    lgames = lgames + lose
    if win > lose:
      wsets = wsets + 1
    else:
      lsets = lsets + 1

  # Update statistics for each of the players

  for player in [winner,loser]:
    try:
      stats[player]
    except KeyError:
      stats[player] = [0,0,0,0,0,0]

  if wsets >= 3:
    stats[winner][0] = stats[winner][0] + 1
  else:
    stats[winner][1] = stats[winner][1] + 1

  stats[winner][2] = stats[winner][2] + wsets
  stats[winner][3] = stats[winner][3] + wgames
  stats[winner][4] = stats[winner][4] - lsets
  stats[winner][5] = stats[winner][5] - lgames

  stats[loser][2] = stats[loser][2] + lsets
  stats[loser][3] = stats[loser][3] + lgames
  stats[loser][4] = stats[loser][4] - wsets
  stats[loser][5] = stats[loser][5] - wgames

  line = input()

# Collect each player's stats as a tuple, name last    
statlist = [(stat[0],stat[1],stat[2],stat[3],stat[4],stat[5],name) for name in stats.keys() for stat in [stats[name]]]

# Sort the statistics in descending order
# Losing games are stored negatively for sorting correctly
statlist.sort(reverse = True)

# Print
for entry in statlist:
    print(entry[6],entry[0],entry[1],entry[2],entry[3], -entry[4], -entry[5])