Keywords

1 Introduction

With the fast development of social media, more and more people are willing to express their feelings and emotions via microblogging platform. The microblog has aggregated huge number of tweets that containing people’s opinions about personal lives, celebrities, social events and so on. Therefore, analyzing people’s sentiments in microblogs has attracted more and more attentions from both academic and industrial communities.

The traditional sentiment analysis problems usually classify the microblog into positive and negative orientations or identify people’s fine-grained emotions in microblogs, such as happiness, sadness, like, anger, disgust, fear, surprise and so on. In these methods, each microblog is associated with a single sentiment orientation or emotion label. This is commonly referred as a single-label classification problem. However, in fact multiple emotions may be coexisting in just one tweet or even one sentence of the microblog, as shown in the following examples of Table 1.

Table 1. The examples of coexisting multiple emotions in microblogs

Different from the traditional single-label sentiment classification problem, there are critical new challenges for detecting the emotions in microblogs. Firstly, the microblog usually has a length limitation of 140 characters, which leads to extremely sparse vectors for the learning algorithms. The free and informal writing styles of users also set obstacles for the emotion detection in microblogs. Secondly, although the text is short, multiple emotions may be coexisting in one microblog. As shown in Table 1, the second tweet expresses a disgust emotion as well as a fear emotion simultaneously. Therefore, for the emotion detection task, each microblog can be associated with multiple fine-grained emotion labels, which is rarely studied in the previous literature.

To tackle these challenges, in this paper, we regard the emotion detection in microblogs as a multi-label classification problem. We utilize word distributed representation to alleviate the sparse vector problem and leverage a Convolutional Neural Network (CNN) based method to detect the coexisting mixed emotions in the short text of microblogs.

Multi-label classification has been extensively studied in many applications [1]. However, only limited literature are available for analyzing the coexisting emotions in text. Ye et al. employed multi-label classification algorithm to classify the input online news articles into categories according to different readers’ emotions [2]. Bhowmick et al. presented a novel method for classifying news into multiple emotion categories using an ensemble based multi-label classification technique called RAKEL [3]. Read et al. used restricted Boltzmann machines to develop better feature-space representations which could make labels less interdependent and easier to predict for multi-label classification [4]. Although the above two studies have achieved promising results, both of them focused on the online long news articles with dense vector space. Liu et al. gave detailed empirical study of different classic multi-label learning methods for the multiple emotion sentiment classification task of Chinese microblogs [5]. The goal of Liu’s work is similar to ours. However, they conducted the experiments on two very small datasets that only had a few hundreds of microblogs and the classic learning methods they used were heavily dependent on the quality of the lexicons and manually designed features.

In this paper, we utilize a Convolutional Neural Network (CNN) based method to solve multi-label emotion classification problem without any manually designed features in the Chinese microblogs. The details of our proposed framework are shown in Fig. 1. (1) We preprocess the microblogs and transform the multi-label training dataset into single-label dataset. (2) We utilize skip-gram language model and extremely large corpus to train the distributed representation of each word in the single-label dataset. The dense vector representations of the microblog sentences are generated by the word vector with appropriate semantic composition method. (3) We use a CNN based method to train one multi-class classifier and k binary classifiers to get the probability values of the k emotion labels. (4) We leverage a multi-label learning method based on Calibrated Label Ranking (CLR) to get the final emotion labels of each microblog. As a powerful deep learning algorithm, CNN has achieved remarkable performance in computer vision and speech recognition. However, as far as we know, CNN has never before been reported for solving the multi-label text classification task.

Fig. 1.
figure 1

The details of our proposed framework

The main contributions of this paper are summarized as follows: (1) To the best of our knowledge, this is the first attempt to explore the feasibility and effectiveness of CNN model with the help of word embedding to detect the coexisting emotions in the microblog short text. (2) We utilize skip-gram language model to get word distributed representation instead of the TFIDF based extremely sparse vectors. Meanwhile, the input features are learned by CNN automatically rather than being designed manually. (3) We conduct extensive experiments on two public short text datasets. The experimental results demonstrate that the proposed method achieves excellent performance in terms of multi-label classification metrics.

2 Related Work

Analyzing the sentiments and emotions in microblogs has attracted increasing research interest in recent years. The methods used for microblog sentiment analysis generally can be classified into lexicon-based [6] and learning-based [7] categories. The performance of many existing learning based methods are heavily dependent on the choice of feature representation of tweets. For example, Mohammad et al. designed a number of hand-crafted features to implement a twitter sentiment classification system that achieved the best performance of SemEval 2013 task [8]. Unlike the previous studies, on one hand the CNN model used in this paper could automatically learn discriminative features for the classification task. On the other hand, we utilize the multi-label learning based framework to detect the coexisting multiple emotions in tweets, which is rarely studied in previous literature.

The goal of multi-label classification is to associate a sentence with a set of labels. There are two main types of methods for multi-label classification: algorithm adaptation methods and problem transformation methods [9]. Algorithm adaptation methods generalize existing classification algorithms to handle multi-label datasets. Zhang et al. proposed a multi-label lazy learning approach named ML-KNN to solve the multi-label problem [10]. Problem transformation methods transform a multi-label classification problem into single-label problems. Wang et al. proposed a calibrated label ranking based framework for detecting the multi-label fine-grained emotions in the Chinese microblogs [11]. According to the detailed empirical study of different multi-label classification methods by [5] and our previous work, we utilize the Calibrated Label Ranking (CLR) as the learning method for the multi-label microblog emotion classification task.

As a powerful deep learning algorithm, Convolutional Neural Network could decrease the number of features by capturing local semantic features and is to use end-to-end automatically trainable systems which do not rely on human-designed heuristics. CNN has been successfully used in image recognition, speech recognition [12], and machine translation [13]. Wei et al. propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), to solve the multi-label problem of images [14]. In recent years, CNN has achieved remarkable performances for various challenging NLP tasks, such as part-of-speech tagging, chunking, named entity recognition and semantic role labeling [15]. Santos et al. proposed a new deep CNN that exploited from character- to sentence-level information to perform sentiment analysis of short text [16]. Kim et al. proposed a simple CNN architecture to allow for the use of both task-specific and static vectors for sentence-level classification tasks [17]. Although a lot of papers have been published for CNN based text classification, no literature has been reported for exploiting the feasibility and effectiveness of CNN model for the multi-label microblog emotion classification task. In this paper, we introduce a CNN framework to solve multi-label emotion classification problem in the Chinese microblogs without any manually designed features.

3 Microblog Modeling Using Convolutional Neural Network

In this section, we propose a CNN framework to solve multi-label sentiment classification problem in the Chinese microblogs. Suppose \( Y = \left\{ {y_{1} ,y_{2} , \ldots ,y_{k} } \right\} \) represents the emotion label set. Giving the training microblog dataset \( D = \left\{ {\left( {x_{i} ,Y_{i} } \right)|1 \le i \le p} \right\} \) where \( p \) is the number of sentences in the D, \( x_{i} \) is a sentence, \( Y_{i} \) represents the emotion label subset, \( Y_{i} \subseteq Y \). Our aim is to use a CNN classifier to predict the emotion label set \( h\left( x \right) \subseteq Y \) of a sentence \( x \) in test dataset \( T \).

3.1 Preprocessing of Datasets and Training Dataset Transformation

The raw microblog data is inconsistency and redundancy. We remove all the URLs, @ symbols, stop words and English tweets. Then, we use JiebaFootnote 1 python package for Chinese text segmentation which break the microblogs into words.

Because we use Calibrated Label Ranking (CLR) as method for multi-label classification, we need to transform the multi-label training microblog dataset into single-label dataset. The goal of problem transformation method is to transform the multi-label problem into other well-established learning scenario. For the given training dataset, each sentence in the microblog may contain one or more relevant emotion labels, therefore we need to transform the multi-label training dataset into the single-label train dataset so that we can use the CNN learning algorithm to address the multi-label learning problem. The dataset transformation method is shown in Algorithm 1.

In Algorithm 1, we associate each microblog sentence with only one emotion label, which means that a sentence \( x_{i} \) in \( D \) may appear multiple times in \( D^{'} \).

3.2 Convolutional Neural Network Model

The CNN model is a variant of neural network, which includes the input layer, convolutional layer, pooling layer and fully connected layer. The CNN model used in this paper for modeling the microblog sentences is shown in Fig. 2 below.

Fig. 2.
figure 2

Convolutional approach to sentence-level classification with different filter widths

The neurons in CNN are usually arranged in three dimensions: depth, width, and height. The size of input layer, convolutional layer, pooling layer and output layer are depth \( \times \) width \( \times \) height. For example, when a sentence has 70 words and each word is represented by 300 dimensional vector, the size of input layer is 1 \( \times \) 70 \( \times \) 300.

Input Layer.

This layer contains the input data. Let \( s_{i} \in R^{k} \) be the k-dimensional word vector corresponding to the i-th word in the sentence. The max sentence length in the datasets is n. If the length is less than n, the sentence is padded necessarily. Therefore, every sentence is represented as

$$ s_{1:n} = s_{1} \oplus s_{2} \oplus \ldots \oplus s_{n} $$
(1)

where \( \oplus \) is the concatenation operator.

Convolutional Layer.

In this layer, local semantic features can be captured and the connection between two connected neuron represents the layer of convolutional and input. Here a convolution operation involves a filter \( w \in {\mathbb{R}}^{hk} \), which is applied to a window of \( h \) words to produce a new feature. The filter w can also be considered as the weight matrix in the convolution layer. We can calculate the value of \( h \) words to generate new feature \( z_{i} \):

$$ z_{i} = f\left( {w \cdot s_{i:i + h - 1} + b} \right) $$
(2)

where \( f \) is an activation function such as rectified linear units and \( b \in {\mathbb{R}} \) is a bias term. This filter is applied to the sentence \( \left\{ {s_{1:h} ,s_{2:h + 1} , \ldots ,s_{n - h + 1:n} } \right\} \) to produce a new feature map \( z = \left[ {z_{1} ,z_{2} , \ldots ,z_{n - h + 1} } \right] \), with \( z \in {\mathbb{R}}^{n - h + 1} \).

Pooling Layer.

In this layer, the max-over-time pooling layer selects global semantic features and attempts to capture the most important feature with the highest value for each feature map. The pooling layer is aimed at sampling with common statistics methods, such as mean value, max value and L2-norm. By doing this, the problem of overfitting is addressed to a certain degree. We then apply max-over-time pooling operation over the new feature map by convolution operation and take maximum value \( z_{max} = { \hbox{max} }\left( z \right) \) as the feature corresponding to the filter. Because the model employs several parallel convolutional layers with different window sizes of words, we obtain multiple features so that we can get the input vectors for classifier.

Fully Connected Layer.

In this layer, the dropout regularization method is used to summarize the features. The neurons in fully connected layers have full connections with all neurons in the previous layer. The structure of a fully connected layer is the same as that of the layer in the ordinary neural network. In order to avoid overfitting and approximate a way to solve the exponentially neural network computational architectures, we apply dropout on the fully connected layers. When training the model, the neurons which is dropped out has a probability p to be temporarily removed from the network, as shown in Fig. 2. When calculating input and output both in the forward propagation and the back propagation, the dropout neurons are ignored. That means the dropout will prevent neurons from co-adapting too much.

Softmax Layer.

Finally, these features are passed to the softmax layer and the output of softmax layer is the probability distribution over labels.

Input Word Vectors.

The input of the CNN model is the distributed representation of the words, namely word vectors, in the microblog sentences. These vectors could be initialized randomly during CNN model training. But a popular method to improve the performance of CNN is to initialize word vectors by unsupervised neural language model. In this paper, we utilize the skip-gram model [18] to pre-train word vectors from extremely large corpus that has more than 30 million Chinese microblogs. Throughout the training process of CNN, one strategy is to keep the pre-trained word vectors static, another strategy is to fine-tune the vectors via back propagation.

4 Calibrated Label Ranking Based Multi-label Classification

Based on the transformed single label microblog dataset and the CNN classification algorithm, in this section we leverage the calibrated label ranking method to detect the multiple emotions in the microblog short text.

4.1 Ranking Function

A single label multi-class classifier \( f\left( \cdot \right) \) is built, so that we can get the confidence of the emotion labels for a given microblog sentence. We use \( f\left( {x_{i} ,y_{j} } \right) \) represents the confidence that sentence \( x_{i} \) has the probability of emotion label \( y_{j} \). The \( f\left( {x_{i} ,y_{j} } \right) \) can be transformed to a rank function \( rank_{f} \left( {x_{i} ,y_{j} } \right) \) subsequently and we use the Formula 3 to get the rank value.

$$ rank_{f} \left( {x_{i} ,y_{j} } \right) = \mathop \sum \nolimits_{k = 1,j \ne k}^{q} \left[\! \left[ f\left( {x_{i} ,y_{j}} \right) > f\left( {x_{i} ,y_{k}}\right) \right]\! \right]$$
(3)

where \( y_{j} \in Y \), \( Y \) is the label set, \( q \) is the number of labels, \( \llbracket\eth=1\rrbracket \) if the predicate \( \eth \) holds, and 0 otherwise.

4.2 Threshold Selection and Label Set Result

After we get the rank values of all labels by the rank function, the key issue is to gain a threshold to divide the rank value list into relevant label set and irrelevant label set. In order to determine the threshold, the processing procedure is as follows:

  1. (1)

    Transform the multi-class single label training dataset into \( q \) separate datasets and each dataset is associated with a binary label (with/without an emotion label).

  2. (2)

    For \( q \) binary labels training datasets, we construct \( q \) binary CNN classifiers \( g_{j} \left( {x_{i} } \right) \), \( j \in \left\{ {1,2, \ldots ,q} \right\} \) and \( x_{i} \) represents a sentence. We use the Formula 4 to construct the training dataset of the \( q \) binary classifiers.

    $$ D_{j} = \left\{ {\left( {x_{i} ,\phi \left( {Y_{i} ,y_{j} } \right)} \right)|1 \le i \le p} \right\} $$
    $$ \phi \left( {Y_{i} ,y_{j} } \right) = \left\{ {\begin{array}{*{20}c} { + 1, y_{j} \in Y_{i} } \\ { - 1, otherwise} \\ \end{array} } \right. $$
    (4)

    where \( D_{j} \) is the train dataset, \( j \in \left\{ {1,2, \ldots ,q} \right\} \), \( y_{j} \) represent the emotion label, \( p \) is the number of sentences in the training dataset and \( Y_{i} \) is the relevant label set of the sentence \( x_{i} \). If the label \( y_{j} \) in the labels set \( Y_{i} \), the \( \phi \left( {Y_{i} ,y_{j} } \right) = + 1 \), otherwise \( \phi \left( {Y_{i} ,y_{j} } \right) = - 1 \).

    The \( q \) binary classifiers \( g_{j} \left( {x_{i} } \right) \) trained by the separate training datasets classify every sentence in the testing dataset. If \( x_{i} \) is corresponding with \( y_{j} \), then \( g_{j} \left( {x_{i} } \right) = + 1 \), else \( g_{j} \left( {x_{i} } \right) = - 1 \).

  3. (3)

    Apply the output results of the \( q \) binary classifiers to product the threshold and the rank value for every microblog sentence in the testing dataset. The rank value \( rank_{f} \left( {x_{i} ,y_{j} } \right) \) of label \( y_{j} \), and the threshold value \( rank_{f}^{*} \left( {x_{i} ,y_{v} } \right) \) is updated as follow.

    $$ rank_{f}^{*} \left( {x_{i} ,y_{j} } \right) = rank_{f} \left( {x_{i} ,y_{j} } \right) + \left[\! \left[{g_{j} \left( {x_{i} } \right) > 0}\right]\! \right] $$
    (5)
    $$ rank_{f}^{*} \left( {x_{i} ,y_{v} } \right) = \mathop \sum \nolimits_{j = 1}^{q} \left[\! \left[{g_{j} \left( {x_{i} } \right) < 0}\right]\! \right] $$
    (6)

Finally, we compare the rank value with the threshold value so that we can get the relevant label set and irrelevant label set. If the rank value of a label is bigger than the threshold, namely \( rank_{f}^{*} \left( {x_{i} ,y_{v} } \right) \), we put the label into the relevant label set. Otherwise we put it into the irrelevant label set. We regard the relevant label set as the final result of multi-label classification. The formula of multi-label emotion classifier \( h\left( {x_{i} } \right) \) is:

$$ h\left( {x_{i} } \right) = \left\{ {y_{j} |rank_{f}^{*} \left( {x_{i} ,y_{j} } \right) > rank_{f}^{*} \left( {x_{i} ,y_{v} } \right),y_{j} \in Y} \right\} $$
(7)

In summary, for a sentence in test dataset, the process of predicting its multi-label emotion set is shown as Algorithm 2.

5 Experiment

5.1 Experiment Setup and Datasets

We conduct our experiment on the NLPCC 2014 Emotion Analysis in Chinese Weibo Text (EACWT) taskFootnote 2 and Ren_CECps [19] dataset respectively. In the EACWT, the training dataset contains 15,677 microblog sentences, the testing dataset contains 3,816 microblog sentences, the fine-grained emotion label set contains seven basic emotions, i.e. Y = {like, happiness, sadness, fear, surprise, disgust, anger} and every sentence has less than two emotion labels. In the Ren_CECps, the training dataset contains 26,000 Chinese blog sentences, the testing dataset contains 6,153 Chinese blog sentences, the fine-grained emotion label set contains eight basic emotions, i.e. Y = {anger, anxiety, expect, hate, joy, love, sorrow, surprise} and every sentence has less than three emotion label. Note that Ren_CECps is a Chinese blog dataset which contains documents with different topics. However, the average length of the sentences is similar to that of the microblog dataset EACWT. Therefore, we can utilize the sentences in Ren_CECps as well as EACWT to evaluate the performance of our proposed short text multi-label emotion classification algorithm.

Table 2 show the statistics information of the number of sentences in the datasets with different number of emotion labels. In Table 2, the Ren_CECps dataset has much more sentences with two emotion labels. We argue that using these two datasets with different emotion label distributions, we can better evaluate the feasibility and effectiveness of our proposed model and method.

Table 2. The number of sentences in the datasets with different number of emotion labels

5.2 Hyper-parameters Settings

For the CNN model we use: rectified linear units, filter windows (h) of 2, 3, 4 with 100 feature maps each, dropout rate (p) of 0.5, \( l_{2} \) constraint (s) of 3, mini-batch size of 50 based on stochastic gradient descent. The Adadelta update rule is used. The max sentence length n is 70 and 90 in the EACWT and Ren_CECps respectively.

We utilize the well-known word embedding tool word2vec to train the word vectors on a crawled 30 million Chinese microblog corpus. The word2vec hyper-parameters of our pre-trained word vectors includes: the selected algorithm is skip-gram, the size of dimensionality is 50, 100, 200 and 300 respectively, the size of window is 8, the value of sample is 1e-4, and the times of iteration is 15.

5.3 Model Variations and Baselines

In this paper, we conduct the experiments with several variants of the CNN model. The SVM classifier that has achieved excellent performance in many previous sentiment classification tasks is viewed as the baseline model.

SVM-tfidf: We use the tfidf to generate the unigram feature vectors, and apply SVM classifier with CLR algorithm to solve the microblog multi-label classification problem. In the previous work, the SVM-tfidf method has achieved the best performance in the EACWT evaluation task [11]. We regard it as a strong baseline.

SVM-word2vec: We use the vector semantic composition based degeneration algorithm for modeling the microblog sentence. The vectors of words in the sentence are added together to form the distributed representation of the microblog sentence.

CNN-static: The inputs of CNN are the pre-trained vectors of words in microblog sentences from word2vec. If the word is not in the set of pre-trained word vectors, the unknown one is randomly initialized. Here “-static” means that the word vectors are kept static during training and only other parameters of the model are learned.

CNN-non-static: It is the same as the CNN-static but the pre-trained vectors are fine-tuned during training for different datasets.

CNN-rand: The input of all words vector are randomly initialized and modified during training.

Evaluation metrics. The classic multi-label learning evaluation metrics Ranking Loss (RL), Hamming Loss (HL), One-Error (OE), Subset Accuracy (SA), macro F1 (maF), micro F1 (miF) and Average Precision (AP) are selected for evaluating the performance of the proposed model and algorithm. The seven metrics can be grouped into label-based metrics, example-based metrics and ranking-based metrics. For more details about these metrics, please refer to Madjarov et al. [20] and Zhang et al. [9].

5.4 Experimental Results

We show the best result of every model in the Table 3 and Table 4. “↓” means the smaller values are better. “↑” means the larger values are better. Meanwhile, the SVM-word2vec-300, CNN-static-300, CNN-non-static-300 and CNN-rand-300, “-300” represent that the input word vectors have the dimensionality of 300 in the corresponding models. We can clearly see that in EACWT dataset the model CNN-non-static-300 has achieved the best performances under the every evaluation metric comparing with the other models in the Table 3. In Ren_CECps dataset, CNN-static-300 is better than other baselines and CNN models under all the metrics except Hamming Loss in the Table 4.

Table 3. Microblog multi-label emotion classification results on the EACWT dataset
Table 4. Microblog multi-label emotion classification results on the Ren_CECps dataset

Based on the observations from above two tables, we have following discussions.

  1. (1)

    With appropriate parameter settings, the CNN based models outperform the SVM algorithms. We argue that the CNN models can automatically learn the higher level of feature representation, so it is the better than the SVM based methods.

  2. (2)

    The performance of CNN models can be significantly improved by using pre-trained word vectors. This is because that compared with randomly initialized vectors, pre-trained vectors could learn the semantic information of the word from extremely large corpus. Moreover, pre-trained vectors can be fine-tuned on the dataset in “non-static” model, so that vectors indeed reflect the word semantic in the dataset. We can see that CNN-non-static model has achieved best performances in EACWT dataset.

  3. (3)

    The “non-static” model does not perform best as we expected in the Ren_CECps dataset. This may because that Ren_CECps contains various topics and has more scattered emotion labels. Different documents have the different topics. So it is more difficult for learning the effective feature representations and semantic vectors. The CNN-static model has achieved the best performances in Ren_CECps dataset. A similar observation of “static” model is reported for a single-label text classification task in [17]. We think the performance of “static” and “non-static” CNN models are related to the topic and label distribution for the text classification.

Finally, we demonstrate the effects of word vector dimension for EACWT dataset in Fig. 3. Generally, we find that when the word dimension is bigger, a better performance for the CNN-non-static model could be achieved. Similar performances can be observed in Ren_CECps dataset. We omit the results of micro F1 metric in EACWT and all the metrics in Ren_CECps due to length limitation.

Fig. 3.
figure 3

The experiment results for different word dimensions for the EACWT dataset

6 Conclusions and Future Work

Different from the traditional sentiment orientation classification problem, multiple fine-grained emotions can co-exist in the microblog sentences, which can be regarded as a multi-label learning problem. In this paper, we propose a Convolutional Neural Network framework using Calibrated Label Ranking to detect the multiple emotions in the Chinese microblog sentences. Extensive experiments on two dataset show that static and non-static CNN methods with the help of word embedding could achieve better performances than other strong baseline algorithms by a large margin.

In the future, on one hand we want to design new problem transformation methods for the multiple emotion detection task. On the other hand, we intend to revise the architecture of CNN models for multi-label learning problem and incorporate other deep learning models, such as RNN, to further improve the performance of the task.