Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Social media has become a very popular communication tool among Internet users. In China, the number of users of social networking websites had reached 288 million by the end of June 2013. The proportion of social networking service (SNS) users amongst Internet users was 48.8 % [5]. Sina Weibo (hereafter Weibo), is a Chinese microblog website. Most people take it as the Chinese version of Twitter; it is one of the most popular sites in China, with 60.2 million daily active users [6], and has therefore become a valuable source of people’s opinions and sentiments.

Microblog texts (called statuses in Weibo) are very different from general newspaper or web text. Weibo statuses are shorter and more casual; many topics are discussed, with less coherence between texts. Combining this with the huge amount of lexical and syntactic variety (misspelt words, new words, emoticons, unconventional sentence structures) in Weibo data, many existing methods for emotion and sentiment detection which depend on grammar- or lexicon-based information are no longer suitable.

Machine learning via supervised classification, on the other hand, is robust to such variety but usually requires hand-labelled training data. The labelling process is difficult and time-consuming with large datasets, and can be unreliable when attempting to infer an author’s emotional state from short texts [31]. Our solution is to use distant supervision: we adapt the approach of [17, 31] to Weibo data, using emoticons and Weibo’s built-in smilies as author-generated emotion labels for training, allowing us to learn a model of the associated language which can classify Weibo statuses into different basic emotion classes. Adapting this approach to Chinese data poses several research problems: finding accurate and reliable labels to use, segmenting Chinese text and extracting sensible lexical features.

Our experiments show that choice of labels has a significant effect, with emoticons generally providing higher accuracy than Weibo’s smilies, and that choice of text segmentation method is crucial, with current word segmentation tools providing poor accuracy on microblog text and character-based features proving superior.

2 Background

2.1 Sentiment Analysis and Emotion Detection

Most research in this area focuses on sentiment analysis—classifying text as positive or negative [27]. However, finer-grained emotion detection is required to provide cues for further human-computer interaction, and is critical for the development of intelligent interfaces. It is hard to reach a consensus on how the basic emotions should be categorised, but here we follow [8] and others in using the definition in the work of [11], providing six basic emotions: anger, disgust, fear, happiness, sadness, and surprise.

Algorithms previously used for this task range from matching words in a sentiment lexicon to training classifiers with labelled data. In early work, Turney [41] used mutual information between document phrases and the word “excellent” and “poor” to get the average sentiment orientation of reviews. They used unsupervised classification and achieved an average accuracy of 74 %. Phrases containing adjectives or adverbs were extracted and used since they are good indicators of subjective [19]. Pang et al. [28] first applied different machine learning methods to detect the polarity of movie reviews. They reported the effectiveness of using machine learning techniques for sentiment classification: machine learning approach beats human-produced baselines easily. However, the performance was not as good as traditional topic-based text classification. They evaluated three machine learning methods (Naïve Bayes (NB), Maximum Entropy (ME) and Support Vector Machines (SVMs)) and results showed that unigram presence information seemed to be the most effective. Yessenov and Misailovic [45] used movie review comments from social network Digg,Footnote 1 and evaluated both supervised learning (NB, ME, Decision trees) and unsupervised learning (K-Means). In addition to a bag-of-words model, they also tried to incorporate WordNet synonyms information. They came to a similar conclusion with [28] that the simple bag-of-words model performs relatively well. Tsutsumi et al. [40] proposed a way of using a multiple classifier based on three different classifiers. Results showed that the integrated methods outperformed all three single classifiers.

2.2 Distant Supervision

Distant supervision is an approach which combines standard supervised classification methods with a weakly labelled training dataset; it can be seen as an example of semi-supervised learning in that it exploits large amounts of data without access to expert gold-standard labels. Go et al. [17] and Pak and Paroubek [26], following [32], use emoticons in Twitter messages to provide these weak (or noisy) labels, then learn a classifier on the basis of the remaining text (after removal of the emoticons) to classify positive/negative sentiment with above 80 % accuracy.

Yuasa et al. [46] showed that emoticons have an important role in emphasizing the emotions conveyed in a sentence; they can therefore give us direct access to authors’ own emotions. Derks et al. [10] and Provine et al. [29] similarly found that emoticons tend to increase the intensity of the associated verbal content, rather than replacing it (perhaps playing a similar role to laughter, facial expressions and other non-verbal behaviour). We would therefore expect them to be suitable for use as labels in a distant supervision approach, indexing the emotional content while leaving its verbal expression largely unaffected when the emoticons are removed. Purver and Battersby [31] investigated the applicability of this approach to English Twitter messages, using a broader set of emoticons to extend the distant supervision approach to six-way emotion classification, and we apply a similar approach here to Chinese Weibo statuses. However, in addition to the widely used, domain-independent emoticons, other markers have emerged for particular interfaces or domains. Weibo provides a built-in set of smilies that can work as special emoticons that help us better understand authors’ emotions.

2.3 Chinese Text Processing

In Chinese text, sentences are represented as strings of Chinese characters without explicit word delimiters as used in English (e.g., white space). Therefore, it is important to determine word boundaries before running any word-based linguistic processing on Chinese.

There is a large body of research into Chinese word segmentation [12, 15, 18, 21, 35, 43]. These methods can be roughly classified into two categories: lexicon-based method and character-tagging method.

The idea for lexicon-based method is “segmentation”. The basic technique for identifying distinct words is based on the lexicon-based identification scheme [4]. This approach performs the word segmentation process by using matching algorithms: matching input character strings with a known lexicon. However, since the real-world lexicon is open-ended, new words are coming out every day—and this is especially true with social media. A lexicon is therefore difficult to construct or maintain accurately for such a domain.

The character-tagging method was first introduced by [44]. It is more like a “word-building” process: it treats the word segmentation as a sequence labeling problem by assigning labels to all characters. Labels indicate whether a character locates at the beginning of, inside or at the end of a word. Several discriminative sequential learning algorithms have been exploited (e.g., conditional random fields (CRFs) [39], latent variable CRFs [37], structured perceptron [20], and the Passive-Aggressive algorithm [36]). However, the performance on social media data is not satisfying as the data is so different from the existing training libraries used.

3 Weibo Corpus

3.1 Corpus Collection

Our training data consisted of Weibo statuses with emoticons or smilies (see Sect. 3.2). Since Weibo has a public API,Footnote 2 training data can be collected through automated means. To use the API, we also need to create a Weibo account and register an application. We wrote a Python script which requested the statuses public_timeline API Footnote 3 every 30 s and inserted the collected data into a MongoDB Footnote 4 database. We constructed a corpus of Weibo data, filtering out messages not containing emotion labels (see Sects. 3.2 and 3.4 for details).

3.2 Emotion Labels

Two kinds of emotion labels (emoticons and smilies) were used as noisy labels. By “noisy”, we mean that the emoticons and smilies are noisy themselves compared to gold-standard manual labels: to some degree ambiguous or vague in their meaning. Not all emoticons and smilies are closely related to these six emotion classes considered in our work; and some emoticons or smilies may be used differently in different situations, as people have different understandings. Smilies are Weibo built-in smilies (see Fig. 1) which form a finite, fixed set defined by the Weibo interface. Emoticons here are Eastern-style emoticons, which are made up of several characters and can thus be defined by the user; note that they are very different from Western-style emoticons [23] (see Table 1).

Eastern-style and Western-style emoticons are different, mostly because of different habits from using very different languages. For Western-style emoticons, people are used to reading them from left to right: Western emoticons are generally taken as being rotated by 90 degrees [30]. They are usually made of two to four characters and are of a relatively small number, generally focussing on some feature of mouth shape. Eastern emoticons, in contrast, are usually un-rotated and present faces, gestures, or postures from a point of view easily comprehensible to the reader.

Fig. 1
figure 1

Screenshot of the first page of Weibo built-in smilies

Table 1 Emoticons: Eastern style versus Western style

At the beginning, we looked at all Eastern-style emoticons and Weibo built-in smilies available. Initial investigation found that not all emoticons and smilies can be classified into Ekman’s six emotion classes [11]; and for some less frequently used labels, authors have widely different understandings. We therefore identified the most widely used and well-known emoticons/smilies; to then determine whether these would be reliable as labels, we set up a web survey to examine whether people could classify these emoticons/smilies consistently.Footnote 5

Our survey contained two parts. In the first part, we asked people to choose one from the six emotion classes that best matched each of our identified emoticons/smilies. We also provided a None of the above option allowing participants to give their own definitions. In the second part, we asked people to tick all the emoticons and smilies they would use to convey each of the six emotions; we also allowed them to fill in other emoticons/smilies of their own that they would use for each emotion class. The survey was distributed via Weibo and only Chinese Weibo users were allowed to take part. 56 individuals completed our survey in two days time and full results are given in Appendix Table 9.

From the results of this, we identified 12 emoticons and 10 smilies to use as emotion labels (see Table 2). It is worth noting that we found no reliable emoticons for disgust, nor any reliable labels of either kind for fear. One reason may be that both disgust and fear, as emotion classes, are themselves difficult to represent (as facial expressions) using only punctuation and letters. For fear, we even found no relevant smilies in the Weibo interface. We believe this is because there is no obvious distinguishing feature on a fear face. In addition, people seem to use other emotions with fear, like “nervous”, “cry”. In order to ensure a reliable labelling, we decided to use only one smiley for disgust, and the keyword

figure b

for fear (a Chinese word meaning fear). However, we should be careful with keywords as they might not work well. Removing a word from a text may affect the meaning of the message itself and leave the rest of the text less informative and reliable. In addition, words are verbal, so they are subject to things like negation. Using keywords as emotion labels may be less reliable and it may result in lots of false positive examples.

Table 2 Conventional markers used for emotion classes

3.3 Text Processing

Initial investigation also found that some Weibo statuses are mixtures of different language units: as well as Chinese, English words were also sometimes present and provided useful infomation. Therefore, in our work, not only Chinese characters/words, but also any lexical items from other languages were included as features. Weibo usernames (starting with @) and URLs were removed. Punctuation was included as a feature (treated like a lexical unigram), with any repeated punctuations being normalised to 3 characters. We then removed the labelling emoticons and smilies from the texts, using them instead only as positive/negative labels for the relevant emotion classes for training and testing purposes. We then extracted different kinds of lexical features: segmented Chinese words, Chinese characters, and higher-order n-grams.

To use word-based features, we need to segment the statuses into words. There are lots of Chinese word segmentation tools; however, many are unsuitable for online social media text; we compared Pymmseg,Footnote 6 Smallseg Footnote 7 and Stanford Chinese Word Segmenter,Footnote 8 which all appeared to give reasonable results. Pymmseg uses the MMSEG algorithm [38]. Smallseg is an open sourced Chinese segmentation tool based on DFA. Stanford Segmenter is CRF-based [39].

3.4 Corpus Analysis

Our corpus contains 1,027,853 Weibo statuses with emotion labels; Table 3 shows statistics. The number of Weibo statuses varied with the popularity of labels themselves: labels for happiness and sadness are much more frequent than others; very similar results were observed on English Twitter (see e.g., [31]), suggesting that these frequencies are relatively stable across very different languages.

Table 3 Number of Weibo statuses per emotion class

Overall frequencies show that users of Weibo are more likely to use built-in smilies rather than emoticons. One possible reason is that smilies can be inserted with a single mouse click, whereas emoticons must be typed using several keystrokes—Eastern-style emoticons are usually made of five or more characters.

4 Experiments and Discussions

Machine learning techniques have been shown to be effective for traditional text classification and sentiment analysis. Here, we use Support Vector Machines (SVMs) [42], a state-of-the-art supervised kernel method. The basic idea is to find a maximum-margin hyperplane—a hyperplane that can separate two different classes correctly, and simultaneously maximize the margin (or the distance) between that hyperplane and other “difficult points” close to the hyperplane. These “difficult points” are called support vectors, and the decision function is fully specified by these support vectors. New testing examples are then assigned to one side of the hyperplane. Classifiers trained using SVMs have been shown to have better performance than other classifiers: Joachims [22] proved that SVMs consistently achieved good performance on text categorization tasks and outperformed other methods substantially and significantly; Pang et al. [28] applied different machine learning methods to detect the polarity of movie reviews. By evaluating three machine learning methods: Naïve Bayes (NB), Maximum Entropy (ME) and SVMs, they showed that SVMs had the best performance and NB turned out to be the worst. SVMs are good for high-dimensional feature spaces [22], while, other classifiers are training expensive when dealing with a large number of features.

In our work, classification was using SVMs throughout, with the help of LIBLINEAR [13]. LIBLINEAR inherits many features of LIBSVM [3], but is more efficient for training large-scale problems without using kernels. The performance was evaluated using 10-fold cross validation.

Cross validation is used to estimate how well a model generalises [24]. For one round of cross validation, the dataset is partitioned into two subsets, one for training (training set) and one for testing (validation set or testing set). Several rounds of cross validation are performed, with different partitionings, in order to assess variance. Then we average the results and calculate the standard deviation (\(\sigma \)). F-fold cross validation was introduced by [16]. A single dataset is divided into F chunks; in each fold, 1 chunk is retained as the validation data (test set) while the remaining (\(F-1\)) chunks are used as training data (training set). This process is repeated F times so that each of the F chunks is used exactly once as a test set.

Our training datasets were balanced: a dataset of size N contained N / 2 positive instances (Weibo statuses containing labels for this emotion class) and N / 2 negative ones (Weibo statuses containing labels from other classes). For N / 2 negative instances, we randomly selected instances from other emotion classes for larger datasets (\(N>50{,}000\)), but ensured an even weighting across negative classes for smaller sets to prevent bias towards one negative class.

4.1 Feature Selection

An important part of data-driven approach is converting a piece of text (the “observation”) into a feature vector for text processing. A suitable feature vector should be designed and it should contain as few features as necessary. There is lots of work addressing the feature extraction problem for machine learning (e.g., see [14, 33]). In this section, we focused on two types of lexical features: word-based features and character-based features.

4.1.1 Word-Based Features

Chinese is written without spaces between words. In order to identify lexical features, we need to segment them first. Classification performance depends largely on the quality of the lexical features we obtain from different Chinese word segmentation tools.

Fig. 2
figure 2

Classification results of word-based features based on different segmentation tools. a Anger, b disgust, c fear, d happiness, e sadness, f surprise

However, people might find it difficult to apply existing segmentation tools to social media data. On one hand, unconventional words are used in microblogs: misspelt words, cyber words, as well as new words (see e.g., [1]). On the other hand, there are some pre-defined structures which are not used in other domains: Weibo usernames (@username), hashtags (#topic#), URLs, emoticons, smilies, etc.

For these latter unconventional (but known) structures, we can treat them separately, removing them before passing through the segmenter. However, for unconventional and misspelled words, this is not possible in general, and it is difficult for existing tools to identify them correctly. It may require better segmentation algorithms and new models should be trained using social media data. We investigated the effect of three different segmentation tools and results are presented in Fig. 2.

Results showed that Pymmseg outperformed Smallseg and Stanford Segmenter for all emotion classes except surprise (where Stanford Segmenter yielded the best performance) as training dataset size increased. We can also learn from the results that accuracy increased as we using more training examples (see Sect. 4.2). We also want to point out that in terms of segmentation speed, Pymmseg is the fastest and Stanford Segmenter is the slowest. Therefore, we used Pymmseg for later experiments.

4.1.2 Character-Based Features

For character-based features, rather than requiring word segmentation, we simply treat each Chinese character as a unigram feature, as well as each punctuation character, emoticon and smiley (see Table 4).

Whether higher-order n-grams are useful features appears to be a matter of some debate. Pang et al. [28] reported that unigrams outperformed bigrams when classifying movie reviews by sentiment polarity, but [9] found that bigrams and trigrams can give better product-review polarity classification.

Table 4 An example of one Weibo status and its n-gram features: repeated punctuations were normalised to 3 chars and reserved as a unigram; smiley was reserved as a unigram; Weibo username was removed
Fig. 3
figure 3

Classification results of character-based n-gram features. a Anger, b disgust, c fear, d happiness, e sadness, f surprise

In our experiments with higher-order n-grams, we also included lower-order n-grams (e.g., for 5-grams, we used all unigrams, bigrams, trigrams, 4-grams and 5-grams as features, see Table 4), as there are lots of Chinese words with only one character.

Results showed that higher-order n-grams are useful features for our wide-topic social media Weibo data. Higher-order n-grams (bigrams, trigrams, 4-grams and 5-grams) outperformed unigrams for all emotion classes by a large margin (see Fig. 3).

We stopped at 5-gram since the accuracy didn’t improve any more. And as we adding higher-order n-gram features, it took more time to train classifiers.

4.1.3 Word-Based Features Versus Character-Based Features

Looking at all six emotion classes, we found that word-based features did not beat character-based ones. Character-based higher-order n-gram features had better performance than word-based features (even using the most effective segmenter, Pymmseg) for all emotion classes except sadness—see Table 5.

Table 5 Classification accuracy for all six emotion classes (\(N=15{,}000\)). The best one for each emotion class is marked in bold

Our results suggested that we could just use Chinese characters, rather than doing any word segmentation. Three out of six emotion classes achieved their best performance by using character-based 4-gram features: disgust, fear, and happiness.

Examination of the segmented data showed that these three segmentation tools didn’t work well with our social media data and made lots of segmentation mistakes. In addition, they produced many segmented words which contained only one character. The use of character-based features was therefore preferred and 4-gram features were used in later experiments.

4.2 Increasing Dataset Size

So far, experiments results also showed that increasing dataset sizes increased accuracy up to \(N=15{,}000\) (see Figs. 2 and 3). In this experiment, we kept increasing training dataset sizes for all six emotion classes and compared their classification results. Character-based 4-gram features were used, and as mentioned before, for larger datasets (\(N>50{,}000\)), we randomly selected negative training examples from other emotion classes (see Sect. 4).

Because of the unbalanced number of Weibo statuses for each emotion class (see Sect. 3.4), the largest training dataset size for each emotion class varied: from \(N=15{,}000\) for disgust to \(N=800{,}000\) for happiness. Classification accuracy (using cross-validation) increased as we added more training examples, and does not appear to approach an asymptote until the largest sizes—see Fig. 4 and Table 6. As our dataset sizes increase over time, we therefore expect improvements in accuracy for all six emotion classes.

However, performances are quite different (see Table 6): fear is the most accurately predicted emotion (92.01 %) with the keyword as emotion label, followed by happiness (87.17 %), anger (80.56 %), sadness (78.85 %), surprise (77.36 %) and disgust (77.31 %).

Fig. 4
figure 4

Classification results for all six emotion classes

Table 6 Classification results (accuracy (%)) for all six emotion classes. The best one for each emotion class is marked in bold

4.3 Emotion Labels

In all experiments above, we used a random sample of instances “labelled” with either emoticons or smilies. In this experiment, we compared these two different types of emotion labels (emoticons and smilies) in terms of their classification accuracy. Four kinds of training dataset were constructed and tested for happiness, sadness and surprise:

  • A dataset only contained instances collected with emoticons;

  • A dataset only contained instances collected with smilies;

  • Half of the training examples were collected with emoticons and the other half were collected with smilies;

  • The training examples were randomly selected from all the instances collected with both emoticons and smilies.Footnote 9

Comparing the accuracies between these sets tells us which of the label types is used in a more consistent way: association with a more consistent distribution of words/characters will result in higher classification accuracy (accuracy of prediction of emotion label). Results (see Fig. 5) showed that emoticon labels were easier to classify than smilies. By examining a sample of the data directly, we found that people use emoticons in a more systematic or consistent way. They tend to use emoticons to tell others what their real emotions are (happiness, sadness etc.); on the other hand, they use smilies for a much bigger range of things, such as jokes, sarcasm, etc. Some people use smilies just to make their Weibo statuses more interesting and lively, apparently without any subjective feelings.

Fig. 5
figure 5

Comparison of two different types of labels. Character-base 4-gram features were used. Performance was evaluated using 10-fold cross validation. a Happiness, b sadness, c surprise

4.4 Manual Labelling

So far, we used only the distant (“noisy”) labels for both training and testing. In other words, classification accuracy is strictly only a measure of ability to predict the noisy label’s presence (i.e., use of an emoticon or smiley), rather than necessarily measuring the ability to predict the author’s emotion. To examine how well the two correspond, we must test against human judgements.

Amazon’s Mechanical Turk (MTurk)Footnote 10 service has shown to be useful for gathering human judgements for many simple NLP tasks (e.g., see [2, 7, 25, 34]). In our final experiment, we used MTurk to collect some manually labelled test data.

Another set of 2,190 instances was used for human annotation. These instances were collected using either emoticons or smilies, and were evenly distributed across our 6 emotion classes. Human annotators were asked to choose the strongest emotion class behind the message, with only one class allowed, although a None of the above option was also provided. Each instance was labelled by three different annotators.

Agreement between annotators was poor: only 26 % instances (571 out of 2,190) were assigned the same labels by all three annotators. These unanimous instances were quite unbalanced: from 5 examples for fear to 289 examples for happiness. When looking at instances agreed by a majority (i.e., at least two annotators), we got 1,335 (out of 2,190) examples varying from 27 for fear to 553 for happiness—see Table 7.

Table 7 Number of agreed instances for each emotion class
Table 8 Classification results on manually labelled data

Two rounds of evaluation were performed where instances agreed by all and majority were used respectively. The best classifier for each emotion class from Sect. 4.2 was used. Since the test dataset was unbalanced, precision, recall and F1 for the class in question were used instead of accuracy. Recall is much higher than precision for some emotions (sadness, surprise, disgust and fear) when using default settings. In order to have a consistent F-score to compare between emotion classes, we also tuned these experiments so that recall approximate equals precision. Overall performance is shown in Table 8.

As before, results for happiness and anger are quite good, which showed that:

  1. 1.

    These two emotion classes are easier to detect;

  2. 2.

    The distant labels used for these two emotion classes are reliable;

  3. 3.

    Our classifiers are able to detect these two emotions.

Results for surprise, sadness and disgust can perhaps be considered reasonable, considering there are far fewer positive examples than negative ones in their test sets.

However, the result for fear is poor. Considering the low number of annotated positive test examples (see Table 7), we may conclude that this emotion class is difficult to identify even for human annotators. It is interesting to note that our classifier failed to detect fear in these annotated examples even though it achieved high cross-validation accuracy (see Sects. 4.1 and 4.2). This was the only emotion category where we used the presence of a keyword, rather than a non-verbal sign (emoticon or smiley)—this suggests that the use of keywords is a poor method for distant supervision, as suspected.

5 Conclusion

In our work, we used SVMs for automatic emotion detection for Chinese microblog texts. We collected our own Weibo corpus and defined new emoticons and smilies as distant labels. Our results showed that using emoticons and smilies as noisy labels can be an effective way to perform distant supervision for Chinese, while the use of keywords extracted from the text is not effective. Emoticons seem to be more reliable for emotion detection than smilies.

It was also found that, when dealing with social media data, many existing Chinese word segmentation tools do not work well. Instead, we can use characters as lexical features and performance improves with higher-order n-grams. Character-based 4-gram features seem to be the most effective. Increasing the dataset size also improves performance, and our future work will examine larger sets.

Performance for different emotion classes are quite different: happiness is the most accurately predicted emotion (87.17 %), followed by anger (80.56 %). The effectiveness of our classifiers for these two emotion classes was also verified by using human annotated test data. Test results on manually labelled data also showed that the other four emotion classes (sadness, surprise, disgust and fear) are difficult to classify, either because reliable labels are hard to find (especially in the case of fear), and/or because they are difficult to detect even for human annotators.