Keywords

1 Introduction

Now-a-days, a huge amount of social media and user-generated content are available on the Web. The major portion of social media texts such as blog posts, tweets and comments includes opinion related information. The internet users give opinions in the various domains such as politics, crickets, sports, movies, music etc. This vast amount of online social media textual data can be collected and mined for deriving intelligence useful in the many applications in the various domains such as marketing, politics/political science, policy making, sociology and psychology. A social media user expresses the sentiment in form of an opinion, a subjective impression, a thought or judgment prompted by feelings [1]. It also includes emotions. Sentiment analysis for detecting sentiment polarity (positive, negative or neutral) in social media texts has been recognized by the researchers as one of the major research topics though it is hard to find the concrete boundary between two research areas-sentiment analysis and opinion mining.

So, to derive knowledge from the vast amount of social media data, there is a need for an efficient and accurate system which can perform analysis and detect sentiment polarity of the opinions coming from various heterogeneous sources.

The most common approaches to sentiment analysis use various machine learning based techniques [2,3,4,5,6,7,8,9] though the earlier approaches to sentiment analysis used the natural language processing (NLP) and computational linguistics techniques [10,11,12,13] which use in-depth linguistic knowledge. Research works on sentiment polarity detection have already been carried out in the different genres: blogs [14], discussion boards or forums [15], user reviews [16] and expert reviews [17].

In contrast to machine learning based approach to sentiment analysis, the lexicon based approach [18] relies solely on background knowledge base or sentiment lexicon which is either manually constructed lexicon of polarity (positive and negative) terms [19] or a lexicon constructed using some automatic process [20,21,22,23,24,25]. A kind of sentiment lexicon [26] is manually constructed lexicon created based on the quantitative analysis of the glosses associated with synsets retrieved from WordNet [27]. Though the sentiment lexicon plays a crucial role in the most sentiment analysis tasks, the main problem with the lexicon based approach is that it is difficult to extract and maintain a universal sentiment lexicon.

The main advantage of the machine learning based approach is that it can easily be ported to any domain quickly by changing the training dataset. Hence the most researchers prefer supervised machine based sentiment analysis approach. Supervised method for sentiment polarity detection uses machine learning algorithms trained on a sentiment polarity labeled corpus of social media texts where each text is turned into a feature vector. The most commonly used features are word n-grams, surrounding words, or even punctuations. The most common machine learning algorithms which have been used for sentiment analysis task are Naïve Bayes, SVM, K-nearest neighbor, decision tree, Artificial Neural Networks [18, 28,29,30,31,32,33,34,35,36].

The most previous research works on sentiment polarity detection involves analysis of sentiments of English texts. But due to the multilingual nature of Indian social media texts, there is also need for developing a system that can do sentiment analysis of Indian language texts. In this line, a shared task on Sentiment Analysis in Indian Languages (SAIL) Tweets was held in conjunction with MIKE 2015 conference at IIIT Hyderabad, India [37]. Bengali language was also included in this shared task as one of the major Indian languages. The Bengali (Bangla) Language is also one of the most spoken languages in the world. In recent years, some researchers have attempted to develop sentiment analysis systems for Bengali language [35, 38,39,40,41].

In this paper, we present a stacked ensemble approach for sentiment polarity detection of Bengali tweets. This approach first constructs three base models, where each model makes use of a subset of input features representing the tweets. The tweets are represented by word n-gram, character n-gram and SentiWordNet features. SentiWordNet or sentiment-WordNet is an external knowledge base containing a collection of polarity words (discussed in the next section).

Features are grouped into three subsets- (1) the subset consisting of word n-gram features and Sentiment-WordNet features, (2) the subset consisting of character n-gram features and Sentiment-WordNet features and (3) the subset consisting of unigram and Sentiment-WordNet features. The first base model is developed using multinomial Naïve Bayes with the word n-gram and Sentiment-WordNet features, the second base model is developed using multinomial Naïve Bayes with character n-gram and Sentiment-WordNet features and the third base model is developed using support vector machine with unigram features and Sentiment-WordNet features. The base level classifiers’ predictions are combined using a Meta classifier. This process is popularly known as stacking.

2 Proposed Methodology

The proposed system uses stacked ensemble model for sentiment polarity detection in Bengali tweets. The proposed model has three important steps: (1) data cleaning, (2) features and base classifiers (3) model development and sentiment polarity classification.

2.1 Data Cleaning

At the preprocessing step, the entire data collection is processed to remove irrelevant characters from the data. This is important for tweet data because tweet data is noisy.

2.2 Features and Base Classifiers

Our idea of stacked ensemble is to combine outputs of the base classifiers, each of which makes use of a subset of features representing the tweets. As mentioned earlier, a tweet is represented by word n-gram, character n-gram features and SentiWordNet features. The word n-grams or the character n-grams that do not occur at least 3 times in the training data are removed from the tweets as noise.

Fig. 1.
figure 1

System architecture of the stacked ensemble based machine learning for Bengali tweet sentiment analysis

We have developed three base classifiers at the first level-(1) multinomial naïve Bayes with word n-gram features and Sentiment-WordNet features, (2) multinomial naïve Bayes with character n-gram features and Sentiment-WordNet features, and (3) linear kernel support vector machines with unigram (1-gram) features and Sentiment-WordNet features. For the second level, we have used MLP classifier as the Meta classifier. The overall architecture of our proposed model is shown in Fig. 1.

Base Classifiers

Multinomial Naïve Bayes with word n-gram features and Sentiment-WordNet features. As mentioned earlier, the first base model in our proposed stacked ensemble model uses multinomial naïve Bayes [38, 40] and the associated feature set includes word n-gram features and Sentiment-WordNet features. For this model, we have taken unigrams, bigrams and trigrams as the features (we have taken up to trigrams, i.e., n = 1, 2, 3). Word n-grams which do not occur at least 3 times in the training data are removed as noise. Considering word n-gram as features, the sentiment class of a tweet T is determined by the posterior probability for a sentiment class given the sequence of word n-grams in the tweet:

$$ \varvec{P}\left( {\varvec{C|T} } \right) = \varvec{P}\left( \varvec{C} \right)\mathop{\prod}\nolimits_{{\varvec{i} = 1}}^{\varvec{m}} \varvec{P}(\varvec{t}_{\varvec{i}} |{\text{C}}) $$
(1)

Where: m is the number of word n-grams (n = 1 to 3) in the tweet,

ti is the i-th word n-gram type,

C is a sentiment class,

P(C) is the prior probability,

T is a tweet represented as a sequence of word n-grams in the tweet, T = (t1, t2, …..tm) and

m is the number of word n-grams in the tweet including repetition (a word n-gram may repeat in the tweet).

The details of how the Multinomial Naïve Bayes is applied to sentiment analysis task can be found in [38]. In addition to word n-gram features, we have also incorporated external knowledge base called Sentiment-Wordnet wherefrom some polarity information is retrieved for the tweet words. Though polarity of a word does not always depend on its literal meaning, there are many words which are usually used as positive words (for example, the word “good”). This is also true for negative polarity words. Such information may be useful for sentiment polarity detection in tweets. For our work, Sentiment-WordNet for Indian Languages [42] (retrieved from http://amitavadas.com/sentiwordnet.php) has been used. This is a collection of positive, negative and neutral words along with their broad part-of-speech categories. To incorporate Sentiment-WordNet, each word in a tweet of the corpus is augmented with a special word “#P” if the tweet word is found in the list of positive polarity words, “#N” if the tweet word is found in the negative set and “#NU” if the word is found in the neutral set. For example, the tweet “ ” (This is a very good food) is augmented as follows: “ #P #P ” (This is a very #P good #P food). With this new augmentation, the formula for posterior probability is modified as follows:

$$ \varvec{P}\left( {\varvec{C} |\varvec{T} } \right) = \varvec{P}\left( \varvec{C} \right) \left[ {\mathop{\prod}\nolimits_{{\varvec{i} = 1}}^{\varvec{m}} \varvec{P}(\varvec{t}_{\varvec{i}} |{\mathbf{C}} )} \right]\varvec{P}\left( {\# \varvec{P} |\varvec{C}} \right)^{{\varvec{m}1}} \varvec{P}\left( {\# \varvec{N} |\varvec{C}} \right)^{{\varvec{m}2}} \varvec{P}\left( {\# \varvec{NU} |\varvec{C}} \right)^{{\varvec{m}3}} $$
(2)

Where: m = number of word n-grams (n = 1 to 3) in the tweet (including repetition)

m1 = number of tweet words found in the positive word-list of Sentiment-WordNet.

m2 = number of tweet words found in the negative word-list of Sentiment-WordNet.

m3 = number of tweet words found in the neutral-word list of Sentiment-WordNet.

From Eq. 2, it is evident that the posterior probability for a tweet is boosted by how many polarity words it contains. For example, if a tweet contains more number of positive polarity words than other two types, the overall polarity of the tweet is boosted in the direction of positivity.

Multinomial Naïve Bayes with Character n-gram Features and Sentiment-WordNet Features.

This base classifier is also based on the same principle described in the above sub-section. The only difference is that this model makes use of the different subset of input features, that is, it uses character n-grams and Sentiment-WordNet features representing a tweet. The examples of character n-grams and word n-grams are given below:

Example Input text: “khub bhalo cinema” (very good movie).

Word n-grams (for n = 1, 2) are: “khub”, “bhalo”, “cinema”, “khub bhalo”, “bhalo cinema”.

Character n-grams for n = 4 are: “khub”, “hub”, “ub b”, “b bh”, “bha”, “bhal”, “halo”, “alo”, “lo c”, “o cin”, “cin”, “cine”, “inem”, “nema”.

It is very common that the word occurring in the test tweet is absent in the training data. This is known as out-of-vocabulary problem. Character n-gram features are useful to deal with the out-of-vocabulary problem. The character n-grams with n varying from 2 to 5 are used for developing this base model. The character n-grams that do not occur at least 3 times in the training data are removed as noise.

For this base model, the set of character n-gram features and the Sentiment-WordNet features are used and the posterior probability for the tweet is calculated using Eq. 2 with the only difference is that the variables t1, t2, ….tm in Eq. 2 refer to the distinct character n-grams in the tweet, that means, the probability value is taken only once in the equation even if the character n-gram repeats several times in the tweet.

Support Vector Machines with Unigram and Sentiment-WordNet Features.

It is proven that Support Vector Machines (SVM) [43] with linear kernel is useful in text classification task due to its inherent capability in dealing with high dimensionality of the data. So, for the third base classifier, SVM with linear kernel has been used. This base model also uses a different subset of tweet features, that is, it uses unigram (word 1-gram) features and Sentiment-WordNet features. Since the tweet words, which are found in Sentiment-WordNet, are augmented with one of the possible pseudo words - “#P”, “#N” and “#NU”, Sentiment-WordNet features are automatically taken into account while computing the unigram feature set.

For developing this base model, we did not take all unigrams as features. A subset of unigrams is taken as features because we observe that increasing the number of unigram features hampers the individual performance of this base model. So, for this purpose, the most frequent 1000 unigrams per class are considered as the features for developing this base classifier. Thus, according to bag-of-unigrams model, each tweet is represented by a feature vector of length 3000 (1000 per class × 3) where each component of the vector corresponds to the frequency of the corresponding unigram in the tweet under consideration and finally each vector is labelled by the label of the corresponding training tweet.

2.3 Model Development and Sentiment Classification

As we have shown in Fig. 1, for model development, three base classifiers are used and the base classifiers’ predictions are combined using a Meta classifier. We have used multilayer perceptron (MLP) neural network classifier as the meta-classifier at the second level. From the training data provided to the model, it learns how to classify a tweet into one of three sentiment polarity classes - Positive, Negative and Neutral. Here the MLP classifier has one hidden layer with softplus activation function. The number of nodes considered in the hidden layer is 2.

During testing phase, the unlabeled tweet is presented for classification to the trained model. The label of the test tweet, assigned by the model, is considered as the sentiment label of the corresponding tweet.

3 Evaluation and Experimental Results

We have used Bengali datasets released for a shared task on Sentiment Analysis in Indian Languages (SAIL) Tweets, held at IIIT Hyderabad, India [37]. The training set consists of 1000 tweets and the test set consists of 500 tweets.

3.1 Experiments and Results

We have combined the SAIL training and test data to form a dataset consisting of 1500 tweets and 10-fold cross validation is done and the average accuracy over 10 folds is computed for each model presented in this paper. The obtained average accuracy has been reported in this paper.

We have compared our proposed stacked ensemble model with some existing Bengali tweet sentiment analysis systems published in the literature. For meaningful comparisons among the systems, we implemented the existing systems that previously used SAIL 2015 datasets for system development. The brief description of the systems to which our proposed system is compared is given below.

  • A deep learning model for Bengali tweet sentiment analysis has been presented in [41]. It has used recurrent neural networks called LSTM for model development. This LSTM based model takes into account the entire sequence of tokens contained in a tweet while detecting sentiment polarity of the tweet. The similar tweet augmentation strategy used in our proposed model is also used in this model.

  • The sentiment polarity detection model in [38] uses Multinomial Naïve Bayes with word unigram, bigram and Sentiment-WordNet features. The details can be found in [38].

  • The sentiment polarity detection model reported in [38] uses SVM with word unigram and Sentiment-WordNet features. The details of this model can be found in [38].

  • The sentiment polarity detection model presented in [40] uses character n-gram features and Sentiment-WordNet features. The details of this model can be found in [40].

Table 1. Performance comparisons of our proposed model and other four existing machine learning based models applied to sentiment polarity detection in Bengali tweets

We have compared the results obtained by our proposed stacked ensemble model with four existing sentiment polarity detection models described above. The comparisons of the results have been shown in Table 1. It is evident from Table 1 that our proposed stacked ensemble models perform better than other existing models it is compared to. Since each of the existing models mentioned in Table 1 uses a single machine learning algorithm that uses either word n-gram and Sentiment-WordNet features or character n-gram and Sentiment-WordNet features for Bengali tweet sentiment classification, the results obtained by our proposed stacked ensemble model show that combining classifiers with stacking improves performance over the individual classifier applied to sentiment polarity detection in Bengali tweets. As we can see from Table 1, our proposed model also performs better than a LSTM based deep learning model presented in [41].

4 Conclusion and Future Work

In this paper, we have described stacked ensemble model for Bengali tweet sentiment classification. Two Multinomial Naïve Bayes models using the different subsets of features and a SVM based model with linear kernel have been combined in a stacked ensemble using MLP classifier. We have experimented to choose the appropriate Meta classifier and our experiments reveal that MLP classifier with softplus activation in the hidden units performs best among other possibilities we considered.

The insufficiency of the training data is one of the major problems for developing systems for Bengali tweet sentiment analysis. We also observe that SAIL 2015 data is not error free. Some tweets have been wrongly labeled by the human annotators. However, for meaningful comparisons of the systems, we have left those errors uncorrected. We hope that the system performance can be improved with the increased amount of training data and proper annotation. We also expect that our proposed system can be easily ported to other Indian languages like Hindi, Tamil etc.

Choosing appropriate base classifiers and the Meta classifier can be the other ways for improving the system performance.