On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis

Carvalho, Jonnathan; Plastino, Alexandre

doi:10.1007/s10462-020-09895-6

On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis

Published: 27 August 2020

Volume 54, pages 1887–1936, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Artificial Intelligence Review Aims and scope Submit manuscript

On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis

Download PDF

3281 Accesses
48 Citations
Explore all metrics

Abstract

Sentiment analysis of short informal texts, such as tweets, remains a challenging task due to their particular characteristics. Much effort has been made in the literature of Twitter sentiment analysis to achieve an effective and efficient representation of tweets. In this context, distinct types of features have been proposed and employed, from the simple n-gram representation to meta-features to word embeddings. Hence, in this work, using a relevant set of twenty-two datasets of tweets, we present a thorough evaluation of features by means of different supervised learning algorithms. We evaluate not only a rich set of meta-features examined in state-of-the-art studies, but also a significant collection of pre-trained word embedding models. Also, we evaluate and analyze the effect of combining those distinct types of features in order to detect which combination may provide core information in the polarity detection task in Twitter sentiment analysis. For this purpose, we exploit different strategies for combination, such as feature concatenation and ensemble learning techniques, and show that the sentiment detection of tweets benefits from combining different types of features proposed in the literature.

Combining Classical and Deep Learning Methods for Twitter Sentiment Analysis

Hybrid Features for Twitter Sentiment Analysis

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Article 28 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, much attention has been given to the content generated by Internet users. Since people can express their opinions and emotions about any target, such as products, services, and events around the globe, many consumers and companies can make decisions based on this ever-growing opinionated content. However, as a huge amount of opinions is published every day, manually seeking for and identifying them as conveying a positive or negative sentiment may be impractical. In this context, Sentiment Analysis, or Opinion Mining, is the field of study that analyzes people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in written text (Liu 2015).

One of the key challenges in this field is regarding the automatic identification of opinions and emotions expressed in short informal texts, such as tweets. Tweets, which are short texts published on Twitter,^{Footnote 1} make the task of sentiment analysis very complex due to their inherent characteristics, such as their informal linguistic style, the presence of misspelled words, and the careless use of grammar (Martínez-Cámara et al. 2014). Although sentiment analysis has recently been recognized as a suitcase research problem (Cambria et al. 2017; Chaturvedi et al. 2018), which involves various Natural Language Processing (NLP) tasks, including sarcasm detection, aspect extraction, and subjectivity detection, our focus is on the polarity detection task. Regarding supervised machine learning strategies, which are also the focus of this work, much effort has been made in the literature of Twitter sentiment analysis to achieve an effective representation of tweets. In this context, distinct types of features have already been proposed, from the simple n-gram-based representation to meta-level features to word embeddings.

N-grams are the most basic feature representation when dealing with text classification problems, having motivated early works on Twitter sentiment analysis (Go et al. 2009; Pak and Paroubek 2010; Pang et al. 2002). In this scenario, raw sequences of n words extracted from tweets constitute a sparse and high-dimensional feature space for the classification task. Later, in an attempt to deviate from the sparsity issue, several state-of-the-art studies have proposed different sets of features by developing an abstract representation of tweets, comprising meta-information extracted from their textual content (Barbosa and Feng 2010). Those features, also called meta-level features, can capture new, insightful information from tweets, taking into account their peculiarities. More recently, distributed representations of words generated from deep learning approaches, namely word embeddings, have emerged as an efficient feature representation for text documents. They are currently the main focus of most works on sentiment detection in tweets. Word embeddings encode linguistic patterns of words from a vast corpus of textual data and can represent the textual content of tweets in low-dimensional feature vectors.

As far as we know, despite the efforts on designing effective and efficient feature representation in the literature of Twitter sentiment analysis, there is a gap regarding the effect of combining such distinct types of features proposed in state-of-the-art works. In this study, we recognize three main groups of features considering their structural properties and how they are engineered, such as the n-gram language model, meta-level features, and word embedding-based features. Each of these groups encloses a rich disjoint set of features which may boost classification effectiveness if appropriately combined.

Moreover, as for meta-level features, we have observed that only a small and different fraction of features are employed on each work in the literature. For that, we propose to fill another gap by aggregating meta-level features designed in different works. We believe that combining them into a unique set might benefit sentiment detection in tweets, as we shall see later. Also, we categorize this aggregated set of meta-level features, putting together features that share similar aspects, so as to examine whether the sentiment classification of tweets can benefit from different categories of meta-level features.

In this work, our main goal is to improve classification performance in Twitter sentiment analysis. In this context, this study is conducted in order to provide a response to the following three main research questions:

RQ1. Which group of features is the most effective in Twitter sentiment analysis? Given the large number of features of distinct types designed and employed in the literature, such as n-grams, meta-level features, and word embedding-based features, we propose to perform a comparative evaluation of their predictive performances, by means of a large collection of datasets of tweets. Our goal is to detect the most powerful feature set in the sentiment classification of tweets from various domains.

However, we believe that an improper choice of a learning algorithm to be used with a specific feature set may hinder classification performance. As a result, it might prevent the classifier from learning how to assign a sentiment label to tweets accurately. With that said, in order to take maximum advantage of the features from each feature set, we leverage the best classifiers constructed for each feature set, instead of comparing them by merely relying on the same learning algorithm. More clearly, we answer the intermediate question “Which classification strategies are the most suitable for each group of features?” by evaluating distinct supervised learning algorithms for each feature set. After identifying the best classifiers under the individual evaluation of each feature set, we then carry out a fair comparative assessment of their predictive potential.

As a result of the comparative study among the best classifiers for each feature set, as we shall see, a classifier made up of a concise —yet rich—set of meta-level features from well-referenced works (Agarwal et al. 2011; Barbosa and Feng 2010; Bravo-Marquez et al. 2014; Buscaldi and Hernandez-Farias 2015; da Silva et al. 2014; Davidov et al. 2010; Go et al. 2009; Hagen et al. 2015; Jiang et al. 2011; Khuc et al. 2012; Kouloumpis et al. 2011; Mohammad et al. 2013; Park et al. 2018; Vo and Zhang 2016; Zhang et al. 2011) achieves improved results, which may be a piece of evidence that such feature set plays an essential role in this task. Going further, we propose to categorize this rich set of meta-level features, this being an extension of our previous study (Carvalho and Plastino 2016). In this work, the categories proposed in Carvalho and Plastino (2016) are revisited, and we include some new meta-level features. In addition to this categorization, we investigate whether the classification of tweets from different domains can benefit from these distinct categories of meta-level features. For this purpose, we evaluate the predictive power of those categories in order to give a more general understanding of the relevance of the most common meta-level features proposed in the literature.

Lastly, regarding the word embedding-based features, we also present an underlying evaluation of a significant collection of generic and affective pre-trained embedding models that we have identified in the literature, in order to acknowledge the most effective one for the polarity classification of tweets. Pre-trained models are publicly available embedded representations of words, trained with different deep learning methods. While generic pre-trained models comprise word vectors trained for general purpose, the affective ones are specifically trained for the sentiment and emotion detection tasks.

RQ2. Can the concatenation of different types of features proposed in the literature boost classification performance in Twitter sentiment analysis? We propose to evaluate distinct combinations of the feature sets investigated in this work, (i.e., n-grams, meta-level features, and word embedding-based features), considering that features from different groups might complement one another, leading to an improvement in detecting the polarity of tweets. Our goal is to determine which combinations of distinct feature sets may provide the core information in Twitter sentiment analysis. To this end, we adopt a simple feature concatenation approach that aims at combining features from distinct groups into a unique feature vector. In this work, we investigate whether the concatenation of all feature sets, as well as pairs of distinct feature sets, can improve sentiment classification effectiveness.

Furthermore, despite the acknowledged use of SVM due to its robustness on large feature spaces (Carvalho and Plastino 2016; Hagen et al. 2015; Jabreel and Moreno 2017; Mohammad et al. 2013), to the best of our knowledge, no study in the literature evaluates the effectiveness of different learning methods in the presence of the different types of features studied in this work. We believe that some learning algorithms may be more effective than others when features from distinct natures are put together, depending on their intrinsic properties and how the learning algorithms can harness them. In this scenario, we also conduct experiments to identify which classification strategies are the most suitable when combining features of different types.

RQ3. Can the sentiment classification of tweets benefit from the use of ensemble classification strategies having the best classifiers for each type of feature as base learners? Another approach to combine the discriminative power of different sets of features is through ensemble classification methods. Ensemble methods are learning algorithms that create a set of classifiers, also called base classifiers or base learners, which are used to classify new instances by taking a vote of their predictions (Dietterich 2000).

According to Zhang and Duin (2011), in practice, there exist two main kinds of ensemble strategies. In the first, the predictions of homogeneous classifiers are combined according to some rule. The second is marked by the use of heterogeneous classifiers. While homogeneous classifiers use the same learning algorithm with different representations of the feature space, the heterogeneous ones apply different classification algorithms to the same input features. In this work, we exploit a hybrid approach to ensemble learning.

Specifically, given the varied nature of features studied in this work, we use different learning algorithms as base classifiers, each one provided with a specific feature representation for the same dataset of tweets (i.e., n-grams, meta-level features, or embedding-based features). For most situations, we show that those classifiers can complement one another in the sentiment detection of tweets, properly dealing with the peculiarities of the data that might be uncovered by some of them. In addition, we provide an in-depth analysis of the correlation among the base classifiers, showing that there is sufficient diversity among them, which is an imperative condition for ensemble strategies to succeed (Dietterich 2000)

In summation, the main contributions of this study are: (i) a literature review and analysis of the most common feature representations of tweets for supervised sentiment classification, including n-grams, meta-level features, and word embedding-based features; (ii) the categorization of a rich set of meta-level features developed in state-of-the-art works and the evaluation of each proposed category; (iii) a comparative study of a significant collection of publicly available pre-trained word embedding models in the sentiment classification of tweets; (iv) an assessment of the combination effectiveness of the different sets of features studied in this work, by feature concatenation and ensemble learning; and (v) the use of twenty-two datasets of tweets in all experiments performed in this work. To the best of our knowledge, this is the first study that evaluates different types of features and classifiers for a significant number of tweet datasets.

This article is organized as follows. In Sect. 2, we present the related work, offering a description of the distinct types of feature representation, as well as how they have been combined in the literature to increase the predictive performance of Twitter sentiment analysis. Sections 3, 4, and 5 present a literature review of the features exploited in this work, such as n-grams, meta-level features, and word embedding-based features, respectively. The computational experiments conducted in this work to answer the research questions introduced in this section are described in Sect. 6. Finally, in Sect. 7, we present the conclusions of this work and directions for future research.

2 Related work

Sentiment analysis. Over the years, sentiment analysis has been broadly used to summarize people’s opinions and sentiments about products, services, organizations, individuals, and events (Liu 2012). In the pioneer works in sentiment analysis, Pang et al. (2002) and Turney (2002) applied distinct machine learning methods in the domain of product reviews.

In Pang et al. (2002), Pang et al. applied three supervised machine learning algorithms to determine the polarity of movie reviews. Conversely, using an unsupervised approach, Turney (2002) presented a simple strategy for classifying reviews of automobiles, banks, movies, and travel destinations as recommended or not recommended, i.e., whether the reviews convey a positive or a negative opinion. Since then, sentiment analysis has been applied in various domains to solve distinct types of problems (Cambria et al. 2010; Tumasjan et al. 2010; Valdivia et al. 2017; Wang et al. 2012a; Yoo et al. 2018).

In past years, sentiment analysis has been used to generate real time insights during political debates (Tumasjan et al. 2010; Wang et al. 2012a), detect real-time events (Yoo et al. 2018), and in health (Cambria et al. 2010) and tourism applications (Valdivia et al. 2017). Also, as social media interactions grow, companies can collect customers feedback and influence their decisions by designing intelligent marketing systems, as well as using public mood to predict the stock market (Bollen et al. 2011). In this scenario, applications of sentiment analysis on social media marketing and financial forecasting have received attention from the research community in recent years (Li et al. 2020; Xing et al. 2018, 2019).

Xing et al. (2018) addressed the problem of incorporating public mood to the asset allocation problem, which is an investment strategy that aims at balancing the trade-off between asset returns and the risk taken by investors. In Xing et al. (2018), they developed an ensemble of an evolving clustering method and long short-term memory (LSTM) neural network to formalize sentiment information in market views. To this end, they proposed to compute sentiment time series from social media by using the sentic computing framework Cambria and Hussain (2015), arguing that it enables sentiment analysis not only at document or sentence level but also at concept level.

Recently, Li et al. (2020) studied how to combine technical indicators from stock prices and news sentiments from textual news articles, which is considered as an open research topic in financial market. To this end, they used different sentiment dictionaries to model news sentiment and constructed a two-layer LSTM network to make stock predictions. They showed that the LSTM incorporating both technical indicators and news sentiments outperformed the baseline models that use only one of these information sources at a time.

Although much effort in the literature of sentiment analysis has been on exploiting only English content, Lo et al. (2017) claim that it is no longer sufficient, considering that Asia now has the most Internet users (52.2%), followed by Europe (15.1%).^{Footnote 2} Thus, dealing with multilingual language content represents one of the major challenges in sentiment analysis (Araújo et al. 2020; Dashtipour et al. 2016; Lo et al. 2017). For example, Araújo et al. (2020) investigated how a simple translation strategy can address the problem of sentiment analysis in multiple languages. In Araújo et al. (2020), they showed that machine translation systems such as Google Translate, Microsoft Translator Text API, and Yandex Translate, are mature enough to produce reliable translations to English that can be used for sentence-level sentiment analysis.

At present, with the explosion of social media networks, semi-supervised strategies have also been emerging in the literature of sentiment analysis taking advantage of the massive amount of unlabeled data available (Fu et al. 2019; Hussain and Cambria 2018). Hussain and Cambria (2018) describe semi-supervised learning as a supervised learning problem biased by an unsupervised reference solution. In Hussain and Cambria (2018), they proposed a novel semi-supervised learning model for the task of emotion recognition based on the combined use of random projections and support vector machines. Fu et al. (2019) built a novel model to perform aspect-level sentiment classification, called AL-SSVAE (Semi-supervised Aspect Level Sentiment Classification Model based on Variational Autoencoder), based on the variational autoencoder framework (Kingma and Welling 2013). The proposed model introduces a given aspect of text into the encoder and decoder, and adds an aspect level sentiment classifier for semi-supervised learning in the aspect level sentiment classification.

Feature representation. One of the most significant challenges when dealing with text classification problems is related to feature engineering, especially in short texts such as tweets. Among the broad set of features that have emerged in the literature of Twitter sentiment analysis, the n-gram features have been widely employed because of their simplicity in representing tweets (Agarwal et al. 2011; Araque et al. 2017; Arif et al. 2018; Barbosa and Feng 2010; Bermingham and Smeaton 2010; Bifet and Frank 2010; Chikersal et al. 2015; Cozza and Petrocchi 2016; da Silva et al. 2016, 2014; Davidov et al. 2010; Emadi and Rahgozar 2019; Go et al. 2009; Hagen et al. 2015; Hamdan 2016; Hamdan et al. 2015; Jabreel and Moreno 2017; Jiang et al. 2011; Kouloumpis et al. 2011; Lin and Kolcz 2012; Lochter et al. 2016; Miranda-Jiménez et al. 2017; Mohammad et al. 2013; Narr et al. 2012; Pak and Paroubek 2010; Saif et al. 2012; Siddiqua et al. 2016; Speriosu et al. 2011; Wang et al. 2012b; Zhang et al. 2011).

N-gram features are contiguous sequences of n words from a text. Despite their simplicity, it has already been acknowledged that this type of feature may negatively impact the predictive performance of the classification because of the large number of uncommon words in Twitter (Saif 2015), and because people tend to use much less characters of the 140-character limit for tweets (da Silva et al. 2016). Indeed, analyzing a corpus of 1.6M tweets, Go et al. (2009) have reported that the average length of a tweet is 14 words, or 78 characters. Further, in Saif et al. (2012), it was brought to attention that 93% of the words in a corpus of 60,000 tweets are highly infrequent, occurring less than ten times. These drawbacks make the data very sparse due to the curse of dimensionality, which can sometimes prevent the classifier from correctly learning how to assign a sentiment label to unseen tweets.

Beyond the sparsity issue, another factor that makes the sentiment classification even harder is related to the challenging nature of tweets, such as their informal linguistic style and the careless use of grammar (Martínez-Cámara et al. 2014), resulting in a new form of written text, termed microtext (Cambria et al. 2017). In this context, while some studies propose methods for normalizing tweets to plain English hence improving classification accuracy (Satapathy et al. 2017), other state-of-the-art works have explored feature engineering by designing hand-crafted features or meta-level features. Meta-level features are usually extracted from other features and are able to capture insightful new information about the data (Canuto et al. 2016). These features include summations and counts of: part-of-speech of words (Agarwal et al. 2011; Barbosa and Feng 2010; Bravo-Marquez et al. 2014; Go et al. 2009; Kouloumpis et al. 2011; Mohammad et al. 2013), punctuation marks (Agarwal et al. 2011; Barbosa and Feng 2010; Davidov et al. 2010; Hagen et al. 2015; Jiang et al. 2011; Mohammad et al. 2013), specific characteristics of Twitter and short messages, such as hashtags, user mentions, retweets (RT), abbreviations, etc. (Agarwal et al. 2011; Barbosa and Feng 2010; Hagen et al. 2015; Jiang et al. 2011; Kouloumpis et al. 2011; Mohammad et al. 2013; Zhang et al. 2011), emoticons (Agarwal et al. 2011; da Silva et al. 2014; Hagen et al. 2015; Mohammad et al. 2013), and lexicon features (Agarwal et al. 2011; Bravo-Marquez et al. 2014; da Silva et al. 2014; Hagen et al. 2015; Jiang et al. 2011; Khuc et al. 2012; Kouloumpis et al. 2011; Mohammad et al. 2013; Vo and Zhang 2016), which use the prior sentiment information of words annotated in existing lexicon resources. For example, Mohammad et al. (2013) have implemented a large set of meta-features (referred to as NRC-features), while also emphasizing the importance of a set of lexicon-based features. In Mohammad et al. (2013), authors have designed lexicon-based features such as the total number of positive and negative tokens from a tweet, the overall and the maximal score of a tweet, and the score of the last token of a tweet. All those features were extracted for each of the five different sentiment lexicons. The results of the experiments have shown that the most influential features for the two assessed datasets of tweets were the lexicon-based ones, which led to an improvement of 8.5% in terms of the macro-averaged F-score of the positive, negative, and neutral classes.

With the revival and success of deep learning techniques in traditional machine learning applications, distributed representations of words have emerged as a solution to the curse of dimensionality issue (Bengio et al. 2003; Collobert et al. 2011; Mikolov et al. 2013a, b; Pennington et al. 2014). In this context, neural networks based on dense vector representations have been producing superior results in many NLP tasks (Young et al. 2018). Bengio et al. (2003) have discussed two main characteristics of the n-gram model that can lead to misclassification problems: the context and the similarity between words are not taken into consideration. Although some context can be caught by using higher-order n-grams, such as 5-grams, it does not consider contexts farther than n words. Besides that, it makes the dimensionality even higher. Collobert et al. (2011) introduce a method to overcome these limitations, which relies on largely unlabeled data and uses a multilayer neural network architecture to learn word representations, namely word embeddings. Word embeddings are dense, low-dimensional, and real-valued vectors, each one representing a word in the vocabulary, and encode linguistic patterns that can capture context from a massive corpus of textual data. This method has been successfully applied in many NLP tasks such as part-of-speech tagging, named entity recognition and semantic role labeling (Collobert et al. 2011).

In the context of sentiment analysis, some works have effectively designed sentiment and emotion-specific embedding learning methods (Agarwal et al. 2018; Felbo et al. 2017; Tang et al. 2014; Xu et al. 2018). For example, Tang et al. (2014) have observed that traditional methods for learning word embeddings ignore the sentiment information of text, which may become a problem since words that appear in similar contexts but carrying opposite polarities are mapped into close vectors (for example, good and bad). In Tang et al. (2014), this issue is addressed by extending the method proposed in Collobert et al. (2011). Specifically, Tang et al. have developed a sentiment-specific word embedding (SSWE) neural network that incorporates the sentiment information of texts into the embedding learning process, using a corpus of 10M tweets with emoticons as a noisy, distant-supervised training data. In the experiments conducted to evaluate their approach, Tang et al. have shown that the results achieved by the SSWE learning method are competitive with those achieved by the state-of-the-art meta-level features proposed in Mohammad et al. (2013) (84.98% and 84.73% in macro-F1, respectively).

Recently, deep learning methods have also been successfully applied to aspect-based sentiment analysis (Chen et al. 2017; Ma et al. 2018; Wang et al. 2017), which aims at identifying the polarity of specific aspects rather than the document itself in its entirety (Poria et al. 2016). For example, Ma et al. (2018) have proposed a long short-term memory (LSTM) neural architecture that incorporates the attention mechanism. LSTM is a recurrent neural network (RNN) that can handle sequences of data. The attention mechanism takes an external memory and representations of a sequence as input and produces a probability distribution related to each position of the sequence. In Ma et al. (2018), authors have modeled attention as a two-step model: target-level attention and sentence-level attention, and they have shown that the proposed attention architecture can outperform state-of-the-art methods in aspect-based sentiment analysis.

Combination strategies. Arguing that the combination of classifiers has not been properly explored in the literature of Twitter sentiment analysis, da Silva et al. (2014) show that a classifier ensemble formed by Multinomial Naive Bayes (MNB), Support Vector Machines (SVM), Random Forest (RF), and Logistic Regression (LR) can improve the classification accuracy on four sentiment datasets used in the investigation, when combined in a majority voting strategy. The diversity in the classifier ensemble is addressed by varying only the base learners, all of which using the same bag-of-words feature representation. Prusa et al. (2015) have evaluated seven base classifiers combined with either bagging or boosting ensemble strategies on the sentiment classification of tweets, using only unigrams as features. In bagging, different training partitions are sampled from the original training dataset (with replacement), and a single base learner is trained on each partition. Boosting, on the other hand, iteratively creates the base classifiers, where in each iteration a classifier is trained based on the misclassified instances from previous iterations. At the end of the process, both ensemble techniques aggregate the resulting classifiers by averaging the posterior probabilities of each model in the ensemble. In Prusa et al. (2015), they show that using ensemble strategies such as bagging and boosting can benefit the sentiment classification of tweets, particularly on high dimensional datasets.

In Fersini et al. (2014), Fersini et al. propose a Bayesian Ensemble Learning approach based on Bayesian Model Averaging (BMA), which uses a greedy backward elimination strategy to select the optimal set of base classifiers. The base candidate classifiers that integrate the search space are a dictionary-based approach (DIC), NB, SVM, Maximum Entropy (ME), and Conditional Random Fields (CRF). The feature space used for learning is the bag-of-words model, except for the DIC approach, which relies on the polarities of words in sentiment lexicons. Interestingly, although the dictionary approach presents the lower individual performance on the datasets used in the experimental evaluation, the optimal ensemble provided by BMA always includes DIC as one of the base classifiers for all datasets.

Recently, Fersini et al. (2016) pointed out that not only words are key features in detecting the sentiment polarity of tweets, but also some strong signals can help to discriminate the positive messages from the negative ones. In this context, in Fersini et al. (2016), the combination of the bag-of-words representation of tweets with adjectives, pragmatic particles (emoticons, initialisms for emphatic expressions, and onomatopoeic expressions), and expressive lengthening are investigated independently and as part of an ensemble learning strategy. More precisely, the bag-of-words vectors representing each tweet are expanded with five new features: the number of positive and negative adjectives, the number of positive and negative pragmatic particles and the expressive lengthening of a tweet. In the experimental investigation, they show that using the bag-of-words model expanded with all those expressive signals on an ensemble learning framework (BMA Fersini et al. 2014) can lead to a significant improvement in terms of accuracy.

The combination of distinct preprocessing techniques with well-established classification algorithms has been investigated by Lochter et al. (2016). In Lochter et al. (2016), they propose an ensemble system that performs a grid search to select the best combination between text processing techniques and different classification methods, such as Naive Bayes (NB), SVM, LR, k-Nearest Neighbors (k-NN), and Decision Trees (DT). In Lochter et al. (2016), they evaluate the predictive power of the ensemble system on nine datasets of tweets. As their goal is to detect the best combination of text preprocessing techniques and classifiers, they have used a small fixed set of features for each learning method assessed, such as unigrams and the count of positive and negative terms in each tweet. Emadi and Rahgozar (2019) have recently proposed a classifier ensemble approach which combines supervised and unsupervised methods in Twitter sentiment classification. To this end, three supervised machine learning algorithms, such as SVM, NB, and ME are used as base classifiers, each of them supplied with unigrams, bigrams, and a combination of both. In addition to those classifiers, an unsupervised NLP-based method is used. The classifiers are chosen based on diversity measures in order to select methods that complement one another. Once the diverse set of classifiers is identified, i.e., classifiers with sufficient diversity, a learning fusion method is applied to assign a polarity orientation for each tweet. In Emadi and Rahgozar (2019), the Choquet Fuzzy Integral (CFI) method is used as a meta-learning strategy, which combines the decision of each classifier. Araque et al. (2017) have investigated different combinations of features via ensemble learning and through feature concatenation. They evaluate and compare the predictive performance of these combinations against a supervised baseline model fed with word embeddings trained on a corpus of 1.28M tweets. For the ensemble model, they use as base classifiers six different sentiment methods, each one trained with various, though rather simple, features (e.g., n-grams, POS features, and polarity values for each word), in addition to classifiers trained with generic and affective word embeddings, i.e., word vectors trained for general purpose and for the sentiment analysis task, respectively.

Different from ensemble learning methods, which combine the strength of classifiers and features at prediction time, feature concatenation consists in combining different sets of features into a unified set as a preprocessing step prior to the classification process. Aiming at evaluating the combination of several types of features, Araque et al. (2017) have proposed three feature concatenation models. The first one, denoted M_SG, combines a small set of meta-level features and generic word embedding vectors. The second type, M_GA, combines generic and affective word vectors. Finally, the third, M_SGA, consists in the combination of the features included in the first and second models, i.e., meta-level features, generic and affective word vectors. In the experimental evaluation, both the ensemble model and the feature concatenation model M_SG achieved the best results, with no significant statistical difference between them. Agarwal et al. (2011) have proposed a rich set of meta-level features, termed Senti-features, which were divided into three categories: \({\mathbb {N}}\), \({\mathbb {R}}\), and \({\mathbb {B}}\). Features from category \({\mathbb {N}}\) are those whose value is a positive integer (e.g., #hashtags, #positive words, etc.). Features from category \({\mathbb {R}}\) are those whose value is a real number (e.g., polarity score of words in some lexicon). Lastly, features whose value is a boolean (e.g., presence of capitalized text) make up category \({\mathbb {B}}\). Besides, they have adopted unigrams as a baseline. In the experimental evaluation of the proposed set of features, features were added incrementally to the baseline unigram model and they show that the best result is achieved by using all meta-level features in combination with the unigrams, through feature concatenation. Mansour et al. (2015) examine a large set of features introduced in the literature of Twitter sentiment analysis. In Mansour et al. (2015), authors have performed an exhaustive combination of features aiming at identifying a compact feature subset that can reduce the computational complexity without harming classification accuracy. To this end, in addition to unigrams and bigrams, they have also investigated Senti-features (Agarwal et al. 2011), NRC-features (Mohammad et al. 2013), and SSWE embeddings (Tang et al. 2014). In their experimental evaluation, they employed two datasets of tweets and the Maximum Entropy classification algorithm. For polarity detection, they identified that the best results were achieved by concatenating unigrams, bigrams, NRC-features, and SSWE embeddings with macro-F1 of 89%.

In Tang et al. (2014), Tang et al. explore the combination of the SSWE embeddings and the state-of-the-art NRC-features (Mohammad et al. 2013) through feature concatenation, which improved prediction performance from 84.98% to 86.58%. In order to obtain rich sources of information, Vo and Zhang (2015) employed, as features, a combination of word vectors trained with two different embedding learning approaches, namely Google’s word2vec (Mikolov et al. 2013b) and SSWE (Tang et al. 2014). To this end, they have trained the embeddings with a large-scale corpus of 5M unlabeled tweets and show that the combination of generic and affective word vectors are beneficial to the sentiment classification of tweets. Xu et al. (2018) investigate the performance of the proposed affective embedding learning system, Emo2vec, by combining the word vectors obtained with both their approach and Stanford’s GloVe vectors (Pennington et al. 2014), in an attempt to render feature representation more accurate, since Emo2vec is weak on capturing syntactic and semantic meaning. Table 1 presents a summary of the combination methods discussed in this section.

Table 1 Summary of combination strategies on Twitter sentiment classification, separated by classifier ensemble and feature concatenation approaches

On the evaluation and combination of state-of-the-art features in Twitter sentiment analysis

Abstract

Similar content being viewed by others

Combining Classical and Deep Learning Methods for Twitter Sentiment Analysis

Hybrid Features for Twitter Sentiment Analysis

Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models

Explore related subjects

1 Introduction

2 Related work

3 N-gram features

4 Meta-level features

5 Word embedding-based features

6 Experimental evaluation

6.1 Experimental setting

6.2 Answering research question RQ1

6.2.1 Effectiveness of n-gram features

6.2.2 Effectiveness of meta-level features

6.2.3 Effectiveness of word embedding-based features

6.2.4 Overall analysis of features

6.3 Answering research questions RQ2 and RQ3

6.3.1 Combining features through feature concatenation (RQ2)

6.3.2 Combining features via ensemble learning (RQ3)

6.3.3 Comparing combination methods

7 Conclusions and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (xlsx 15 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation