Keywords

1 Introduction

The coronavirus pandemic gave horrific scenes to the crowd around world in the years 2020 and 2021. As there were no vaccines or medicines, which can cure the symptoms of the virus, the different Governments took various measures which were not formal in their own ways but kept its place [1, 2]. Various precautions included quarantine, lockdown, self-isolation, safe distance, and many more which by some extent reduced the havoc which shook the whole world.

This was one part of the whole situation going around in the world, well the other half was occupied by activists, political leaders, the people itself, media and many more. But the major part was occupied by social media. Nowadays, Social Media has become a key to express positivity around, but it also leads to negative impact when the information is quite accurate or satisfy the human mind and so does it happen in the COVID -19 pandemic, where number of news and rumors were communicated across the Social Media platform, making people excited and happy when they see a tweet or a post which relieves them from the pain of virus itself and also at the same time making people sad and more panicked when they see something which is unimaginable. So, there was a need where posts, tweets etc. get a proper segregation which can define whether that particular post will make a healthy impact on the readers’ mind or will give a shocking impact.

The main purpose of this research work is to perform emotion detection and sentiment analysis of tweets over COVID-19. To carry out this work, various machine learning, ensemble and deep learning methods are used for extracting emotion based sentiments from tweets. The paper is organized as follows. Section 2 presents related work carried out in proposed direction. Complete methodology followed for implementation of this work is presented in Sect. 3. Detailed results and analysis are presented in Sect. 4 followed by conclusion in Sect. 5.

2 Previous Work

This section presents previous reported work carried out in this direction. Table 1 presents approach and feature based analysis of various works carried related to sentiment analysis of COVID-19.

Table 1 Approach and feature based analysis for COVID-19 sentiment analysis

From Table 1, it can be observed that number of different techniques (including supervised and unsupervised) were experimented to identify sentiments related to COVID-19 from different social media platforms. Figure 1 depicts the pictorial distribution for different approaches used. From machine learning (ML) area, prominent algorithm usage is for Support Vector Machine (SVM), Naïve Bayes (NB) and Logistic regression (LR) [3,4,5]. Deep Learning (DL) techniques such as Long short term memory (LSTM), Bi-directional Encoder Representations from Transformers (BERT) and Bi-directional Long Short Term Memory (Bi-LSTM) were used by many researchers [6,7,8,9]. Other techniques used by different researchers include unsupervised learning (US) techniques, Latent Dirichlet Allocation (LDA), Multi-layer Perceptron (MLP), Growing Self-organizing Map (GSOM) and lexicon-based (LB) Natural Language Processing (NLP). These techniques were implemented for extraction of emotions and sentiments from COVID-19 text on different social media platforms. Figure 2 presents the average based performance analysis of existing COVID-19 sentiment and emotion detection analysis works.

Fig. 1
figure 1

Different approaches used for COVID-19 sentiment analysis

Fig. 2
figure 2

Performance analysis of existing COVID-19 sentiment analysis works

Dataset based analysis was provided in Table 2. Sentiment and Emotion detection analysis for COVID-19 related text was carried in different perspective such as false news during COVID-19, COVID-19 related awareness, COVID-19 vaccination opinions, COVID-19 and political perspective, public response to COVID-19 and many more. Time span considered for COVID-19 analysis is March 2020 to May 2021. Figure 3 depicts that much of sentiment analysis work was carried during first wave of COVID-19. Language of majority of tweets is English. From reviewed literature, it can be concluded that for sentiment analysis task, main class labels used are positive, negative and neutral whereas for emotion detection task, main class labels are fear, sadness, anger, disgust and optimistic.

Table 2 COVID-19 dataset analysis for sentiment analysis task
Fig. 3
figure 3

COVID-19 sentiment analysis time span

3 Methodology

Architecture of proposed methodology is depicted in Fig. 4. Proposed system consists of two main phases: phase 1 and phase 2. The detail description of phase 1 and phase 2 is presented in Fig. 4.

Fig. 4
figure 4

Architecture of COVID-19 sentiment and emotion detection system

3.1 Phase 1

It consists of the following sub phases.

3.1.1 Data Collection and Understanding the Dataset

For this research work, we have utilized admission dataset from Kaggle [5]. Dataset comprises of 5000 tweets which were further divided into categories and sub categories. For sentiment analysis of COVID-19 related tweets, tweets were bifurcated into positive, negative and neutral tweets. For emotion detection related to COVID-19 pandemic, 11 different labels were selected. This sub categorization includes: Optimistic (0), Thankful (1), Empathetic (2), Pessimistic (3), Anxious (4), Sad (5), Annoyed (6), Denial (7), Surprise (8), Official report (9), Joking (10). Statistical analysis of the dataset is provided in Table 3.

Table 3 Dataset description

Figures 5 and 6 show the distribution of dataset (in categories and sub categories). Basic experimental analysis was performed to understand human emotions in 3 polarities, i.e., positive, negative, and neutral; our findings showed that 28% of people were positive, 52% were negative, and 19% were neutral, in response to COVID-19 worldwide. Emotion based classes distribution was presented in Fig. 6. Out of 10 emotion labels, prominent distribution of tweets was present in optimistic (23%), annoyed (17%), sad (13%), anxious (11%).

Fig. 5
figure 5

Sentiment class distribution in dataset

Fig. 6
figure 6

Emotion class distribution in dataset

For better understanding of dataset, word analysis was carried out. Top words in each category of sentiment (positive, negative, neutral) were presented in Fig. 7. Table 4 presents the top-5 words present in each.

Fig. 7
figure 7

Word Cloud for 3 sentiment analysis class labels

Table 4 Top 5 words in each emotion class label

3.1.2 Data Pre-processing

All tweets were passed through various pre-processing phases:

  1. 1.

    Removing Numbers, Special Characters and Punctuations

Punctuation marks, numbers, and special characters are not helpful in analyzing emotions. It is best to remove them from the text. Here we will replace everything except letters with spaces.

  1. 2.

    Stopwords Removal

In NLP work stopwords (very common words e.g., that, are, have) do not make sense in reading because they are not connected with emotions. Removing them therefore saves integration and increases the accuracy of the model.

  1. 3.

    Stemming using Porter Stemmer

Stemming is used to remove the suffixes such as (‘-ing’, ‘-ly’, ‘-es’, ‘-s’, etc.) to get a root word of some particular word specified. We implemented Porter Stemmer in our work. We have used five step process, all with its own rules. Porter Stemmer is renowned because of its easy-to-use behavior, speed and efficiency. The outcome will get us a word in its root form.

  1. 4.

    Label Encoding of target variables

This is an encoding which converts the categorical values in integer values in between the range of 0 and the number of classes minus 1. If suppose, we have 5 distinct categorical classes, then the conversion would be (0, 1, 2, 3, 4).

3.1.3 Feature Extraction and Feature Weighing

After pre-processing of data, ‘Bag of Word’ model is used for feature extraction and vector space representation was created for entire data. Term frequency (TF) and Term-frequency inverse document frequency (TF-IDF) is used for feature weighing.

3.1.4 Model Training

In total, 12 models were trained and tested on this dataset. Based on their type, these models were divided into three categories: Baseline Learners (BL), Ensemble Learners (EL) and Deep Learners (DL). Baseline learners consists of Logistic Regression (LR), K- Nearest Neighbour (KNN), Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and Decision Tree (DT). LR, NB (& it’s variation), SVM are statistical models in nature. LR is a way of modeling probability of a discrete outcome given an input variable [3, 5]. NB is based on conditional probability and Bayes theorem [3, 4]. SVM perform classification by finding a hyperplane that distinctly classifies the data points [3, 4]. Ensemble Learners consists of Random Forest (RF), XG Boost (XGB), Bagging (BG) and Gradient Boosting (GB) [20]. RF operated by constructing multitude of decision tree. XGB uses gradient boosting technique to generate boosted tree with enhanced performance. BG aggregates the performance of several weak models. GB tries to minimize the loss function by adding weak learners using gradient descent. Deep learners consist of Artificial Neural Network (ANN), Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) [6, 7. ANN is nonlinear statistical model which exhibits the complex relationship between input and output. CNN is a class of deep neural network which consists of an input layer, an output layer and numerous hidden layers. LSTM is one type of recurrent neural network that records different cell state to perform the classification.

3.1.5 Phase 2

Performance evaluation is carried out using Accuracy, Precision, Recall and F1-measure [21].

4 Results and Analysis

This section provides the result and analysis of the application of 12 algorithms on two feature weighing criteria (TF, and TF-IDF) and on sentiment analysis as well as emotion detection tasks.

4.1 Result and Analysis on Sentiment Analysis

4.1.1 Sentiment Analysis Results Using TF

From Table 5, it can be observed that, with accuracy of 59.1%, precision of 64.9%, recall of 46.3% and F1-score of 47.3%, SVM performed better as compared to other baseline algorithms followed by Multinomial Naïve Bayes. Among ensemble learning methods, gradient boosting turns out to be the best with accuracy, precision, recall and F1-score of 60.1%, 63.1%, 48.2% and 49.5%, respectively. It can be observed that with highest accuracy, precision, recall and F1-score (34.9%, 54.0%, 81.5% and 64.3%, respectively), CNN turns out to be the best among deep learning methods.

Table 5 Results of algorithms using term frequency as feature weighing

4.1.2 Sentiment Analysis Results Using TF-IDF

From the Table 6, it can be observed that, with an accuracy of 59.3%, precision of 66.5%, recall of 46.2% and F1-score of 47.1%, SVM performed better compared to other baseline learners followed by Logistic Regression. From ensemble learning category, gradient boosting has become the best among the ensemble learners. Accuracy, precision, recall and F1-score in gradient boosting were reported to be 58.8%, 58.9%, 47.1% and 48.2%, respectively. CNN turns out to be the best among the deep learners.

Table 6 Results of algorithms using TF-IDF as feature weighing

Figure 8 indicates that ensemble learners performed better as compared to other ones. Performance of TF and TF-IDF is approximately equal for sentiment analysis task. From review of existing state-of-art research carried out in this direction (as represented in Table 1), ensemble learning techniques have never been applied for sentiment as well as emotion detection work. Deep learners were not suitable for sentiment analysis task. Analysis based on other performance metrics (Precision, Recall, F1-Score) are presented in Fig. 9.

Fig. 8
figure 8

Average Accuracy based analysis of COVID-19 related sentiments

Fig. 9
figure 9

Average Precision, Recall, F1-Score based analysis of COVID-19 related sentiments

4.2 Result and Analysis on Emotion Detection Task

4.2.1 Result and Analysis Using TF

From Table 7, it can be seen that the MNB, XGBoost and CNN are best performers in BL, EL and DL categories respectively. The best accuracy, precision, recall and F1-scores are respectively 37.5%, 29.4%, 24.5% and 22.6% (for MNB), 36.3%, 47.4%, 26.0% and 28.8% (for XGBoost) while 20.2%, 85.4%, 82.5% and 89.7% (for CNN).

Table 7 Results of algorithms using term frequency as feature weighing

4.2.2 Result and Analysis Using TF-IDF

From Table 8, it could be seen that SVM, with accuracy, precision, recall and F1-score of 35.1%, 37.3%, 21.1% and 19.2% respectively, accomplished better compared to BL algorithms. XGBoost, with accuracy, precision, recall and F1-score of 35.2%, 37.0%, 23.0% and 22.9% respectively, was best in EL category. Also, CNN was best in DL category. Accuracy (Fig. 10), precision, recall and F1-score (Fig. 11) for CNN was reported to be 18.7%, 92.2%, 89.4% and 87.8% respectively.

Table 8 Results of algorithms using TF-IDF as feature weighing
Fig. 10
figure 10

Average Accuracy based analysis of COVID-19 related emotions

Fig. 11
figure 11

Average Precision, Recall, F1-Score based analysis of COVID-19 related emotions

5 Conclusion

Average Precision, Recall, F1-Score based analysis of COVID-19 related emotions Social Media is platform for expressing your opinions, viewpoints, thought freely without any hesitation. During COVID-19 pandemic, world was physically disconnected due to COVID-19 restrictions but it is more connected in virtual environment. This research work was carried on corona virus outbreak using twitter data. The main focus of this study is to understand emotions and sentiments of people during COVID-19. This work helps to understand the people’s perception about coronavirus and its impact on the public. The sentiments and emotions during the period were downloaded and the public’s reaction towards the outbreak was analyzed. This dataset was passed through various pre-processing phases. Term frequency and term frequency-invers document frequency was used for feature extraction and feature weighing. To analyze sentiment and emotions, total 12 models were trained and tested using twitter dataset. These models were categorized into baseline, ensemble and deep learners. Results revealed that for sentiment analysis task, gradient boosting algorithm with term frequency as feature weighing (from ensemble learning models) outperformed all other models. Accuracy and Precision reported by gradient boosting model is 60.1% and 63.1%, respectively. For emotion detection task, Multinomial Naïve Bayes model with term frequency performed better in comparison with other models.