Keywords

1 Introduction

The recent developments of smart technologies using mobile-based communication have entailed massive amount of data. The enormous amount of data generated requires having automatic systems that allow us to sort, classify, and select information. Nowadays, the powerful communication of social media has lead in to exploit them for achieving many objectives. The lure of social media is that it enables businesses to conduct real-time conversations directly with their customers very inexpensively.

User-generated data in Twitter represents a gold-mine for analyzing various and varied aspects such as, for instance, traces of individual behavior, or how a brand is perceived. Sentiment Analysis (SA) is a very popular natural language processing task which main aim is to determine the subjective component of a given piece of text. The important role of SA has been recognized beyond computer sciences. It has emerged as a trending topic in Industry due to the wide range of application that can be exploited from its results, the outcomes of applying SA can be used for evaluating customer service, gathering consumer feedback, developing marketing campaigns, among others [1].

Identifying the opinions expressed by users of airline companies has been recognized as a powerful tool that can be used for these corporations in order to identify opportunities for improvement. Such a task has been investigated from different perspectives. In [2], the authors exploited a soft voting classifier approach that uses logistic regression and stochastic gradient descent with both traditional weighted schemes and pre-trained word-embeddings from text classification in order to categorize tweets in the airline companies domain. In [3], feature selection and class imbalanced techniques were used in order to classify comments of travelers’ feedback regarding airlines. Particular aspects related to airlines such as punctuality, food and beverages quality, ticket prices, among others, were investigated in [4]. In [5], the authors analyzed the location of a set of tweets for determining how this aspect can help to airline companies.

In this paper, we are proposing to exploit not only the terms contained in the tweets but also a wide range of lexical resources for capturing different kinds of information that can be exploited in order to perform sentiment analysis in tweets reflecting opinions about airline companies. Aiming to propose a set of features for capturing the sentiment in such texts, we performed various experiments in order to identify potential aspects coming form different information sources. Our intuition is that, with a small number of features the sentiment analysis of these tweets can be performed with a comparative performance that when all the vocabulary is used. We experimented with a benchmark corpus in this domain, obtaining competitive results against the state-of-the-art. Furthermore, an analysis over this dataset was carried out.

The rest of the document is organized as follows. Section 2 introduces the methodology we propose for classifying tweets in the airlines domain by exploiting both the vocabulary in the data and also lexical-based information. In Section 3 we describe the experiments carried out as well as the obtained results. Finally, the conclusions and findings for future work are presented in Sect. 4.

2 Proposed Methodology

We are interested in to perform sentiment analysis in the context of Airlines comments. For doing so, we are proposing to exploit a wide range of lexical resources reflecting different aspects as well as traditional and word-embeddings representations. The SA task was performed as a text classification approach taking advantage of machine learning algorithmsFootnote 1 such as: Naive Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression (LR). Besides, we also experimented with a majority voting ensemble of classifiers. For text representation, we exploited the methodologies described below.

2.1 Vocabulary-Based Experiments

We carried out a set of experiments considering only the content of the tweets for determining the polarity of each instance. Five different configurations with term frequency as weighted schema were used: BOW_0: text without any kind of pre-processing; BOW_1: text tokenized, lowercased, and stop-words removed, and discarding those terms with frequency lower than 3; BOW_2: text tokenized, lowercased, stop-words removed, and hashtags, mentions, and url replaced by corresponding labels, and leaving out those terms with frequency lower than 3; BOW_3: text tokenized, lowercased, stop-words removed, and considering only those terms with an Information Gain Rate (IGR) greater than 0.001; BOW_4: text tokenized, lowercased, stop-words removed, and hashtags, mentions, and url replaced by corresponding labels, and considering only those terms with an IGR greater than 0.001. We also used pre-trained word embeddings in order to generate a representation of the tweets by using an average vector with the words in each instance. We took advantage of three well-known pre-trained word embedding models: word2vecFootnote 2, GloVeFootnote 3, and FastTextFootnote 4.

2.2 Lexicon-Based Experiments

The important role of lexical resources for performing SA has been widely recognized since they allow us to capture different nuances of affect ranging from sentiment polarity to finer-grained emotions [6]. The most basic approach in such resources involves the creation of lists of terms associated to two polarity strengths: positive and negative; there are other methods where a word is labeled with a score reflecting its value with regards to a particular aspect. We are interested in to evaluate the performance of such resources for determining the polarity of tweets in the airlines domain. For doing so, we selected a set of 14 lexical resources comprising different facets of affect. Two main groups can be distinguishing: those including information strongly related to sentiment and emotions, and those were psycho-linguistic information is also considered.

Sentiment and Emotions Resources.

The first set of lexical resources can be further divided into two subgroups: (i) The SA group, we found: AFINN [7], Hu&Liu [8], SentiWordNet (SWN) [1], EffectWordNet [9], Semantic Orientation [10], Subjectivity lexicon [11]. We calculated three scores for each of the resources: positive (denoted as pos_resource), negative (denoted as neg_resource), and the sum of both (denoted as tot_resource); and, (ii) The EMOT group, that is composed by two sub-groups divided according to the main theories of emotions: Categorical model (emotCat): EmoLex [12], EmoSenticNet [13], and SentiSense [14]; and the Dimensional model (emotDim): SenticNet [15], ANEW (Affective Norms for English Words) [16], and the Dictionary of Affect in Language (DAL) [17] (it contains three dimensions namely, Pleasantness, Attention, and Imagery). In the case of the emotCat, we calculated the frequency of words belonging to a given emotion in each tweet, while for emotDim, the sum of each dimension regarding the words in each instance was considered. It is important to mention that, EmoLex and SenticNet have also some aspects that were included into the SA group: a set of positive and negative words and an equation (denoted as SN-eq) for calculating the polarity of a given text in terms of affective dimensions defined in [18], respectively. Besides, we also considered the positive and negative categories included in the two psycho-linguistic resources that will be introduced in the following section. In the end, we have a vector composed of 60 features.

Psycho-Linguistics Resources.

The second subset of lexical resources includes two dictionaries were a set of words are associated to different aspects reflecting the use of language from a psycho-linguistic perspective. Both of them have been successfully applied in different natural language processing tasks such as Author Profiling [19] and Emotion Identification [20]. The Linguistic Inquirer and Word Count (henceforth LIWC) [21] is a dictionary containing 64 categories such as social and affective processes, personal concerns, as well as grammatical (verbs, nouns, etc.). General Inquirer (henceforth GI) [22] is composed by 182 categoriesFootnote 5. It was developed with the aim of analyze different aspects of language such as cognitive, emotions, interpersonal relations, among others. We used the Category-based representation as it was defined in [23]. Each resource was exploited individually, also we combined both of them into a single one (GI+LIWC). Then, we experimented with vectors of 64, 182, and 246 features for LIWC, GI, and GI+LIWC, respectively.

3 Results

3.1 Corpus Description

We experimented with the Twitter US Airline Sentiment (henceforth denoted as TwAS) corpus that is freely availableFootnote 6. It is a set of tweets posted in February 2015 regarding some well-known airline companies in the US. TwAS is composed by 14,485 tweets manually labeled according to three categories: positive (2332 instances), negative (9088 instances), and neutral (3065 instances). In TwAS there is a remarkable imbalanced class distribution towards the negative category. Besides the overall sentiment annotations, the tweets included in TwAS have there are other types of labels such as the target airline and the username of the author of each tweet.

3.2 Vocabulary-Based Experiments

Figure 1 shows the obtained results in terms of accuracy. As it can be observed, the best performance was achieved using the BOW_0 with the ensemble of classifiers (denoted as (ENS) is composed by NB, LR, and SVM). Regarding the word-embeddings, FastText shows the highest rate.

Fig. 1.
figure 1

Precision of the first set of experiments carried out on the airline corpus.

Since we are interested in to determine an optimal set of features of the tweets by using different kinds of lexical resources, we plot a dimensionality reduction version of the BOW representation (see Fig. 2) of the instances by exploiting the TSNE techniqueFootnote 7. Samples of the negative class are plotted with red color, of the positive class with green color, and of the neutral class with yellow color. The skewed amount of instances belonging to the negative class is clearly observed. Besides, there are not salient clusters of each class, instead, it is possible to observe a high rate of overlap among the classes. In Fig. 2, we can observe a high degree of overlap between the main components of each categories of the corpus under study.

Fig. 2.
figure 2

TSNE dimension reduction using BOW representation (Color figure online)

3.3 Lexicon-Based Experiments

Sentiment and Emotions Resources.

We experimented with each group of lexical resources on its own, and also by combining them into a single one (SA+EMOT). Figure 3 shows the obtained results when the aforementioned resources are exploited. Concerning the Sentiment and Emotions, the best performance is achieved when all these resources are used. Interestingly, using only the sentiment related ones, there is a low decrease in the accuracy, confirming the usefulness of them for characterizing the sentiment of a piece of text. With respect to the Emotions, the subset regarding the categorical model shows a slightly higher performance than the dimensional one, however, it is important to emphasize that both models are composed of only a small number of features. In terms of the Psycho-linguistics resources, overall, LIWC shows a better performance than GI and than GI+LIWC. The best result was obtained with LIWC with the ensemble of classifiers.

Fig. 3.
figure 3

Obtained results when the aforementioned resources are exploited

In Figure 4, we plot the TSNE representation based on lexical resources. We can see more defined clusters, this is directly related to the best separation between the classes, which, moreover, is reflected in the results shown in the graph above.

Fig. 4.
figure 4

Representation of the main components by class using lexical resources

Table 1. Best ranked features for each group

3.4 Selecting the Most Relevant Features

With the aim of generating a representation with a lower dimensionality, we performed an Information Gain analysisFootnote 8 over the features obtained from both lexical and psycho-linguistics resources. Table 1 shows the best-ranked features for each group. All the scores obtained from Afinn and Hu&Liu lexicons emerged as very informative. Regarding the emotCat, it is observed that very opposite emotions serve to capture useful information. With respect to both psycho-linguistic resources, among the best-ranked dimensions we found those having a sort of negative connotation, it can be provoked due to the data skewed towards the negative class. Besides, we also identified some dimensions reflecting activities from the past, that is in line with the fact that users tend to post their experiences after traveling.

Proposed Representations.

We defined three different subsets of features coming from the lexical resources described before. The first one, denoted as subset-1, is composed of the fifteen best-ranked features according to the IGV obtained from the whole set coming from the Sentiment and Emotions resources. The features included on it are: All the three scores from Hu&Liu and Afinn, all the dimensions in DAL, the negative and objectivity dimensions from SWN, pos_LIWC and tot_LIWC, and pleasantness and sensitivity from SenticNet. The second one denoted as subset-2, comprised all the dimensions from GI and LIWC in Table 1. Finally, the last one denoted as subset-3 includes all the features from SA, emotCat, and emotDim described in Table 1. In summary, the vector representation in each set is composed by 15, 20, and 24 features for the subset-1, subset-2, and subset-3, respectively. In addition to the experiments carried out with each set of features, we decided to combine each of them with the best performing representations based on vocabulary: BOW_4 and FastText.

The obtained results are shown in Fig. 5. Regarding the proposed representations, the subset-1 and subset-3 show a higher performance than subset-2. When using the classifiers ensemble, it is possible to reach a 0.73 of accuracy. On the other hand, combining the proposed representations with BOW_4, the best performance is also achieved with the ensemble of classifiers with any of the subsets of features. With respect to the word-embeddings representation, the highest accuracy rate obtained is 0.79 with both the ensemble and SVM when it is merged with the subset-1 and subset-2. It is important to highlight that, this result is the most similar to the baseline, i.e., the BOW_0, with the important difference that instead of using more than 10,000 features, only 315 and 320 were used. Fig. 6 shows the dimensionality reduction of each of the proposed representations. In this case, we can observe the effect of lexical resources on the definition of clusters by graphically representing the main components of each class.

Fig. 5.
figure 5

Results obtained when combining each subset with the best performing representation based on vocabulary

The TwAS dataset has been used before for evaluating sentiment analysis methods, in [2] the highest accuracy rate reported was of 0.792 when a the proposed methodology was exploited with a TF-IDF schema, while a 0.783 using Word2vec pre-trained embeddings, and a 0.686 exploiting a LSTM classifier. As it can be observed, the obtained results with the different combinations we propose are very competitive even against more sophisticated techniques.

Fig. 6.
figure 6

Clusters defined by each subset of features

All the results presented until now are presented in terms of Accuracy. However, we decided to also include the outcomes obtained in terms of F-score for each class. We selected the seven best performing representations in the experiments carried out: a: BOW_0, b: BOW_4, c: FastText, d: SA+EMOT, e: LIWC, f: subset-1, and g: FastText+subset-1. Figure 7 shows the obtained results. As it can be observed, across the different configurations, the behavior is similar considering the performance among the classes. There is a significative drop in the performance for the neutral class while the F-score for the negative one remains almost the same. It is a slight improvement in terms of F-score for the positive and negative classes when the FastText+subset-1 is used.

Fig. 7.
figure 7

Obtained results in terms of F-score on the three classes of tweets: positive, negative and neutral

3.5 Data Analysis

Taking as starting point the overlapping of the instances we discovered along the TSNE-based representations proposed, a manual analysis of the instances in the TwAS was carried out. We identified some cases were instances composed by almost the same termsFootnote 9 were labeled with different, even contradictory classes.

  • @airline thank you Labels: neutral and positive

  • @airline What a really GREAT& FLATTERING story about you! You should be very proud :) URL (via @mention) Labels: negative and positive

  • @airline has getaway deals through May, from $59 one-way. Lots of cool cities URL #CheapFlights #FareCompare Labels: negative, positive, and neutral

Then, attempting to remove those instances we applied two different sentiment analysis libraries namely NLTKFootnote 10 and TextBlobFootnote 11 in order to determine the sentiment of each tweet. Each tweet was “re-labeled” by considering the following criterion: When both the class assigned by each of the libraries and the original label of the tweet are equal, the instance is kept. We also considered both resources at the time, in this case, for selecting a given instance, the three labels must be the same. Table 2 shows the distribution of each subsample of data. Besides, in parenthesis we include the F-score obtained for each class when the classification task was performed over each subsample. The FastText+subset-1 group of features was used.

Table 2. Distribution of each subsample of data

We also analyse the obtained results of re-annotating the tweets. Table 3 shows some samples. The first instance was labeled as positive during the manual annotation while both SA tools identified it as negative; correctly classifying such a complex expression is a challenge due to the fact that the sentiment expressed by the user is very subtle. The second and third samples have the neutral label, however both are clearly positive; these instances can be considered as the ones presented above, since in the TwAS we found tweets with almost the same content annotated with contradictory classes. The last two sentences reflect a negative connotation despite being annotated as neutral. Finally, we also identified some instances with irony and sarcasm, another important challenge for sentiment analysis [24] such as: @airline never fails to disappoint. and @airline Another delay. Wow.

Table 3. Obtained results of re-annotating tweets

4 Conclusions

The use of lexical and linguistic resources can help in identifying subjective expressions in tweets. In this work, the evaluation of the proposed methodology was carried out using a corpus of tweets in the domain of commercial airlines. The advantages of managing to reduce the dimensionality in a significant (from more than 12000 to less than 200) way by incorporating lexical and psycho-linguistic resources are mainly in the computational cost as well as in the capability of obtaining similar results in terms of the performance rate obtained when using bag-of-words representation when carrying out sentiment analysis. As future work, we are interested in to further analyze this corpus considering the role of irony and sarcasm as well as to evaluate the performance of the proposed methodology over other domains.