Keywords

1 Introduction

Nowadays social media plays a very important role among Internet users. Twitter is a social micro-blogging platform where people express their opinions, feelings, arguments about different topics across the world. Tweets often contain sentiments and emotions expressed by the users. Several lines of research that focus on tweets try to understand emotion or sentiment attached to it. Sentiment analysis describes whether the tweet is positive, negative or neutral. Emotion detection assigns the tweet to one of the given emotion categories (anger, fear, joy, and sadness). Existing research in this context have mainly focused on either sentiment analysis or emotion detection in Twitter [1, 27]. The focus on emotion intensity prediction is limited in the literature. It is often useful to find the intensity of emotion in text in various applications, e.g., crisis management, product quality, event reporting, etc.

In this paper, we focus on the following problem: Given a tweet and emotion category, predict the intensity of that emotion in the tweet. We use three different families of machine learning algorithms, Convolution Neural Networks (CNN), XGBoost, and Support Vector Regression (SVR) to find the emotion intensity in the tweets. Each algorithm is very popular in handling various machine learning tasks. The predictions of each algorithm are averaged to get the final prediction.

Recently, a dataset is published in WASSA-2017 shared task in emotion intensity [17] where the tweets are labeled with four emotion categories, anger, fear, joy, and sadness. For each tweet, the intensity of that emotion is also provided. Few example instances from that dataset are presented in Table 1. We use this dataset to evaluate our proposed method.

Table 1. Example tweets showing emotion intensity

Rest of the paper is organized as follows. Related literature for current work is described in Sect. 2. Next in Sect. 3, problem statement of our work is defined. Details of the proposed method are presented in Sect. 4. Experimental evaluation of the method is described in Sect. 5. We conclude the work by providing directions for future research in Sect. 6.

2 Related Work

A large amount of work has been done to detect sentiments from twitter data. Although, sentiment analysis is different from emotion intensity prediction, features which are used in sentiment analysis can also be used in emotion intensity prediction. Hence, in this section, we present related work from literature for both sentiment analysis and emotion intensity prediction tasks.

Sentiment Analysis: Part-of-speech tag, lexicons, bag-of-words, emoticons, linguistic features, semantic features, etc. are some of the common features used in sentiment analysis. A hybrid approach which uses both corpus-based and dictionary-based methods to find the semantic orientation of the opinion words in tweets is described in [13]. Agarwal et al. [1] used POS-specific prior polarity features and tree kernel for sentiment analysis. Bag-of-words features along with Sentiwordnet, lexicons, emoticons, etc. are used in [25]. Semantic feature is added along with traditional features for sentiment analysis in [28]. Kouloumpis et al. [12] used linguistic features and lexical resources. However, in all the above methods emotion category is not considered.

Emotion Detection: A method with distant supervision for emotion classification is described in [26]. The public mood is modeled using Twitter messages in [4]. A dataset for emotion detection in Twitter is developed in [27]. The authors have considered seven emotion categories, namely, anger, disgust, fear, joy, love, sadness, and surprise. Another large dataset containing instances of \({\langle }\) tweet, emotion category \({\rangle }\) annotation is created in [31]. The authors have used emotion-related hashtags which are present in the tweets for the creation of dataset. They have used unigrams, bigrams, sentiment words, and part-of-speech features for emotion detection. They have also considered seven emotion categories similar to Roberts et al. [27] but used thankfulness category instead of anger. However, in all the above methods intensity of the emotion is not considered. Word-emotion association lexicon is built using crowdsourcing in [20]. An annotation scheme is introduced for finding the emotion intensities in the blog posts in [2]. A supervised framework is developed for identifying the emotional expressions and intensities in [7]. However, the emotion intensities are categorical (high, medium, and low). An ensemble method for predicting emotion intensities is described in [14]. The authors have used two SVR methods with different features and a neural network method in the ensemble. However, word embedding features are not used.

3 Problem Definition

We now briefly define the problem addressed in this paper: Given a tweet T and an emotion E, determine the intensity \(Y_{T, E}\) of emotion E felt by the author of the tweet T. \(Y_{T, E}\) is a real-valued score between 0 and 1. Here 1 is the maximum possible score, and it means the maximum amount of emotion E felt by the speaker of the tweet T. Similarly, 0 is the minimum possible score, and it means the least amount of emotion E.

4 Methodology

We model the problem of predicting emotion intensity as a regression problem. We identified three methods, namely, Convolution Neural Networks (CNN), XGBoost, and Support Vector Regression (SVR) from three different family of algorithms for this prediction. These methods are selected due to their wide acceptability in the machine learning literature for performing various predictive analytics. These three methods are combined in an ensemble to retain the predictive power of the individual algorithms as well as to exploit the synergy between them.

Tweets often contain noise in the form of slang words, elongated words, spelling mistakes, abbreviations, @ mentions, etc. The maximum length of tweet is 140 characters long. We apply the following text preprocessing steps to get better performance of the model. URLs are removed, all words are converted to lower case, @ mentions and numbers are also removed as part of the preprocessing step. These preprocessed tweets are given to each of the individual methods in the ensemble for training and testing. We now describe these methods in detail.

4.1 Convolution Neural Networks (CNN)

Convolution Neural Networks (CNN) are popular in computer vision for various tasks, e.g., face recognition, image classification, action recognition, human pose estimation, scene labeling, etc. CNNs are also used in many Natural Language Processing (NLP) tasks, named entity recognition, part-of-speech tagging, chunking, etc. We used CNN for our problem on the similar lines of approach given in [10]. CNN architecture has five layers, namely, input layer, convolution layer, pooling layer, hidden layer, and output layer.

The input to the model is tweets. Let each tweet be comprised of sequence of words: \(\{term_1, term_2, term_3,...,term_n\}\). Then tweet vector is represented as

$$\begin{aligned} T_v = w_1 \circ w_2 \circ w_3 \circ ...\circ w_n \end{aligned}$$
(1)

Where \(w_i\) is the word embedding vector of \(term_i\), and \(\circ \) is the concatenation operator. Each \(w_i \in \mathbb {R}^{1 \times d}\) is associated with their corresponding pre-trained word vectors. These word embeddings can be looked up in a vocabulary of the embedding matrix \(W \in \mathbb {R}^{V \times d}\), where V is the number of words in the vocabulary. Words are mapped to indices from 1 to V, and the embedding matrix is created in such a way that at index i, the word embedding corresponding to the word associated with index i is present. Tweet matrix \(T_m \in \mathbb {R}^{n \times d}\) is given as input to the model where each word is represented by word embedding \(w_i \in \mathbb {R}^{1 \times d}\). Glove Twitter word embeddings are used in our method. These word embeddings are publicly availableFootnote 1 [23]. Tweet lengths may vary, so necessary padding is applied to have equal lengths for all the tweet vectors. Next layer is convolution layer. Convolution feature maps are created to extract emotion features. Convolution feature is calculated as follows.

$$\begin{aligned} o_i = g(\alpha \cdot w_{i:i+h-1}+\beta ) \end{aligned}$$
(2)

where \(\alpha \) is a convolution filter, \(\beta \) \( \in \mathbb {R}\) is bias term, h is window size, \(w_{i:i+h-1}\) is the concatenation of embeddings for the terms occuring in a window of length h, from positions i to \(i+h-1\), and g is a non-linear function such as the hyperbolic tangent. This convolution filter is applied to each possible window of words in the tweet to produce a convolution feature map \(c \in \mathbb {R}^{n-h+1}\). Next layer is max pooling layer. The main idea in this layer is to capture most important activation. Let \(o_1, o_2, o_3,... \in \mathbb {R}\) denote the output values for our filter. Max-over-time pooling is computed as follows.

$$\begin{aligned} c = max_i(o_i) \in \mathbb {R}\end{aligned}$$
(3)

The output of max-pooling layer is given as input to the dense hidden layer. The output of hidden layer is passed through the final output layer using sigmoid as the activation function. Values output by this sigmoid activation function is emitted as the prediction of the emotion intensity for the input tweet. To avoid overfitting, dropout parameter is used.

The dataset used in our experiments contains four emotion categories. Four CNNs are used for these four emotion categories. Each CNN is trained separately for each emotion category, and emotion intensities for that category are predicted. Same configuration (filter length, number of filters, word embedding dimension size, dropout rate, number of neurons in hidden layer, number of layers, etc.) is used for all the categories to train the model. This CNN model is static where the word embeddings are not changed throughout the model.

4.2 Extreme Gradient Boosting (XGBoost)

This is the second method in the ensemble. XGBoost is based on original Gradient Boosting Machine (GBM) framework [6]. It is a supervised learning algorithm. It is a tree ensemble model and is a set of Classification and Regression Trees (CARTs). Normally, a single tree is not strong enough for classification in practice. In tree ensemble, predictions of multiple trees are added to get the final prediction. Mathematically, model is written as

$$\begin{aligned} \hat{y_i} = \sum _{k=1}^{K}f_k(x_i), f_k \in F \end{aligned}$$
(4)

where K is the number of trees, f is a function in the functional-space F, F is the set of all possible CARTs, \(x_i\) is training data, and \(\hat{y_i}\) is the prediction. If \(y_i\) is target variable then the objective function can be written as

$$\begin{aligned} Obj = \sum _{i=1}^{n}l(y_i, \hat{y_i}) + \sum _{k=1}^{K}\varOmega (f_k) \end{aligned}$$
(5)

The first part in the above equation is training loss and second part is regularization. Additive training is used for training the model. XGBoost is often used in many of the data science competitions. It does computations parallely and is very fast. Word n-gram and character n-gram features are used in this model.

4.3 Support Vector Regression (SVR)

This is the third method used in our ensemble which is taken from [16]. Features used are word n-grams, char n-grams, word embeddings, and lexicons. Word embeddings are trained from Edinburgh Twitter corpus [24] using \(\textit{Word2Vec}\) with 400 dimensions. Lexicons used in this method are AFINN [22], BingLiu [9], MPQA [32], NRC Affect Intensity Lexicon [15], NRCWord-Emotion Association Lexicon [20], NRC10 Expanded [5], NRC Hashtag Emotion Association Lexicon [18], NRC Hashtag Sentiment Lexicon [19], Sentiment140 [19], SentiWordNet [3], and SentiStrength [29]. If the lexicon consists of categorical labels for the words then number of words matching each category in the tweet are counted. If the lexicon provides scores for the words then the scores of each word in the tweet are added. SVM Regression model is trained by using these features for each category separately and emotion intensities are predicted.

4.4 Ensemble

Ensemble methods have been proved to be very successful for classification problems. A system named Webis has achieved the best performance in SemEval-2015 subtask B, “Sentiment Analysis in Twitter” [8]. In the Netflix competition [30] and KDD Cup 2009 [21], the winners have used ensemble-based methods. There are several ways to combine the classifiers, e.g., bagging, boosting, simple averaging, majority voting, stacking, etc. We tested our methods with some of them, and simple averaging performed better than the other ensemble methods. Our ensemble method works as follows. CNN with word embedding features is trained on each category separately in the training data, and it is applied to the testing data and predictions are noted. Similarly, XGBoost with word n-gram and char n-gram features is trained, and predictions of testing data are saved. In a similar fashion, SVR with lexicon and word embedding features is trained and is applied on testing data and predictions are noted. Finally, for each tweet, the average of prediction values of individual methods is considered as final prediction.

5 Experiments

5.1 Data

The dataset used in our experiments is obtained from [16]. Statistics of the data is described in Table 2. Each row of the dataset contains id, text, emotion category, and emotion intensity as described in Table 1. The emotion intensity is a real value between 0 and 1. There are four categories of emotions, namely, anger, fear, joy, and sadness. The dataset is created by using a technique called best-worst-scaling (BWS) which improves the annotation consistency and reliable emotion intensity values.

Table 2. Number of tweets in each category

5.2 Evaluation Metrics

In this section we describe the evaluation metrics used in our approach.

  • Pearson correlation (PC):

    It measures the correlation between two variables. Pearson correlation is calculated between predicted values and gold values. Pearson correlation coefficient is calculated as

    $$\begin{aligned} PC = \frac{\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum _{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum _{i=1}^{n}(y_i-\bar{y})^2}} \end{aligned}$$
    (6)

    In our problem, n is the number of test tweets, \(x_i\) is predicted emotion intensity value for \(i^{th}\) test tweet, \(y_i\) is ground truth value, \(\bar{x}\) is mean of x, and \(\bar{y}\) is mean of y.

  • Spearman rank correlation (SC):

    It measures the relationship between two rankings. Let X denote the set of actual intensity values and Y denote the set of predicted intensity values. Let X and Y are converted to ranks rgX and rgY respectively. Spearman rank correlation coefficient is calculated as

    $$\begin{aligned} SC = \frac{cov(rg_X,rg_Y)}{\sigma _{rg_X}\sigma _{rg_Y}} \end{aligned}$$
    (7)

    where \(cov(rg_X,rg_Y)\) is the co-variance of rank variables, \(\sigma _{rg_X}\), \(\sigma _{rg_Y}\) are the standard deviations of the rank variables.

Sometimes, the tweets which are having high emotion content are relevant. So, it is useful to identify the high emotion related content. To test this kind of tweets, we use two additional metrics, Pearson 0.5 to 1.0 (PCH) and Spearman 0.5 to 1.0 (SCH). Pearson 0.5 to 1.0 is calculated by considering the instances only with ground truth emotion intensities which are greater than or equal to 0.5, and the rest are ignored. Similarly, Spearman 0.5 to 1.0 is calculated.

5.3 Results and Discussions

The first method used in the ensemble is CNN. Glove Twitter word embeddings are used with dimensions, 25, 50, 100, and 200 [23]. We have used 100 as maximum sentence length, window size 3, 250 filters, hidden layer with 200 neurons, dropout 0.2 as regularization parameter in our setting. The results of CNN with 25D, 50D, 100D, and 200D word embeddings are reported in Tables 3, 4, 5, and 6 respectively. We observe that the increase in dimensions results in increase in performance. For example, CNN with 50D performance is better than CNN with 25D. Similarly, CNN with 100D is performing better than CNN with 50D, and the performance of CNN with 200D is greater than CNN with 100D. Therefore, CNN with 200D is used in our method.

Table 3. CNN with Glove 25D.
Table 4. CNN with Glove 50D.
Table 5. CNN with Glove 100D.
Table 6. CNN with Glove 200D.
Table 7. XGBoost.
Table 8. SVR.
Table 9. Ensemble (proposed method).
Table 10. Comparison of our proposed method with other approaches.
Fig. 1.
figure 1

Category-wise comparison of four emotion categories

The second method used in the ensemble is XGBoost. The parameters in this method are learning rate = 0.1, number of estimators = 100, booster is gradient boosting tree, and maximum depth is 3. The results of four emotion categories are reported in Table 7. The Pearson coefficient and Spearman coefficient values are higher than CNN with 25D but lesser than the CNN with other dimensional word embeddings (50D, 100D, 200D). The final method used in the ensemble is SVR (Table 8). The parameters used in this model are linear kernel, C = 0.001 (penalty term). Radial Basis Function (RBF) and polynomial kernels are also tested. However, linear kernel is performing better. The evaluation metric values are better than XGBoost and CNN with 25D and 50D. An ensemble is created by averaging the predictions of three methods described in Sect. 4 and results are reported in Table 9. It can be observed that the Pearson coefficient for both 0 to 1 and 0.5 to 1.0, the ensemble values are higher than all other methods. Similarly, the Spearman coefficient for both 0 to 1 and 0.5 to 1.0, the ensemble method values are higher.

Comparison with SVR [16], IMS [11] and combination of methods in the ensemble is presented in Table 10. We observe that our ensemble method significantly outperforms the baselines and other combinations. This shows efficacy our proposed method. Category-wise comparison of our approach with other methods for four emotion categories, anger, fear, joy, and sadness is presented in Fig. 1a, b, c, and d respectively. For anger, fear, and joy categories, our method performs better than other methods. This is due to the presence of diverse features in individual methods of the ensemble. For sadness category, our proposed method values are higher for PC and SC whereas CNN+SVR combination method values are slightly higher for PCH and SCH.

6 Conclusion

In this paper, we presented an ensemble approach to predict the emotion intensity in tweets. The three methods are Convolution Neural Networks (CNN), XGBoost, and Support Vector Regression (SVR). Glove Twitter word embeddings are used with different dimensions for training the CNN model. The presence of diverse features in each of these three methods make the ensemble more stronger in predicting the better emotion intensities. Experimental results show that our method significantly outperforms other methods. For future work, we would like to identify new features and new methods to include in the ensemble.