Keywords

1 Introduction

The proliferation of Web 2.0 has led to a digital landscape where people are able to socialize, by expressing and sharing their thoughts and opinions on various issues in public, through a variety of means and applications. Indicative of this innovation of social interaction is microblogging, an online broadcast medium that allows users to post and share short messages with an audience online. Twitter is the most popular of such services with million users exchanging daily a huge volume of text messages, called tweets. This has resulted in an enormous source of unstructured data, Big Data, which can be integrated into the decision-making process, Big Data Analytics. Big data analytics is a complex process of examining large and varied data sets (Big Data) to uncover information including hidden patterns, unknown correlations, market trends and customer preferences that can help organizations make better-informed business decisions [1]. Twitter is a valuable media for this process. This paper focuses on the analysis of opinions of tweets, called Twitter Sentiment Analysis.

Sentiment Analysis is the process aiming to detect sentiment content of a text unit in order to identify people’s attitudes and opinions towards various topics [2]. Twitter, with nearly 600 million users and over 250 million messages per day, has become one of the largest and most dynamic datasets for sentiment analysis. Twitter Sentiment Analysis refers to the classification of tweets based on the emotion or polarity that the user intends to transmit. This information is extremely valuable to numerous circumstances where the decision making is crucial. For instance, the opinion and sentiment detection is useful in politics to forecast election outcomes or estimate the acceptance of politicians [3], in marketing for sales predictions, product recommendations and investors’ choices [1], and in the educational context for incorporating learner’s emotional state to the student model providing an affective learning environment [4].

Twitter Sentiment analysis is becoming increasingly important for social media mining, as it gives the access to valuable opinions of numerous participants on various business and social issues. Consequently, many researchers have focused on the study of applications and enhancements on sentiment analysis algorithms that provide more efficient and accurate results [5]. There are three main sentiment analysis approaches: a. machine learning-based, which uses classification technique to classify the text entity, b. lexicon-based, which uses sentiment dictionary with opinion words and weights determined the polarity, and c. hybrid, where different approaches are combined [4, 6, 7]. Experimental results have shown that machine learning methods have higher precision, while lexicon-based methods are competitive in case the training dataset lacks quantity and quality, as they require few efforts in human-labeled document [5, 7]. On the other hand, combing the proper techniques foster better results, as the hybrid approaches could collectively exhibit the accuracy of a machine learning algorithms and the speed of lexicons [6]. However, recently a new approach has arisen from machine learning field, namely deep learning, overshadowing the aforementioned methods.

Deep learning is an aspect of machine learning which refers to an artificial neural network with multiple layers: an input, an output and at least one hidden [8]. The “deep” in deep learning refers to having more than one hidden layer. Replicating the human brain, neural networks consist of a large number of information processing units, called neurons, organized in layers, which work in unison. Deep learning is a hierarchical feature learning, where each layer learns to transform its input into a slightly more abstract and composite representation through a nonlinear processing, using a weight-based model with an activation function to each neuron in order to dictate the importance of the input value. Iterations continue until an acceptable rate of accuracy reached at output data.

Deep learning can be used to extract incredible information that buried in a Big Data [1]. It has been extensively applied in artificial intelligence field, like computer vision, transfer learning, semantic parsing, and natural language processing. Thus, for more accurate sentiment analysis, many researchers have implemented deferent models of this approach, namely CNN (convolutional neural networks), DNN (deep neural networks), RNN (recurrent neural networks) and DBN (deep belief networks) [8]. The results show that deep learning is very beneficial in the sentiment classification. Except the higher performance, there is no need for carefully optimized hand-crafted features and feature engineering, one of the most time-consuming parts of machine learning practice. Big Data challenges, like semantic indexing, data tagging and immediate information retrieval, can be addressed better using Deep learning. However, it requires large data sets and is extremely computationally expensive to train.

In this study, a convolutional neural network was applied on three well-known Twitter datasets, namely Obama-McCain Debate (OMD), Health Care Reform (HCR) and Stanford Twitter Sentiment Gold Standard (STS-Gold, for sentiment classification, using different pre-trained word embeddings. Word Embeddings is one of the most useful deep learning methods used for constructing vector representations of words and documents. Their novelty is the ability to capture the syntactic and semantic relations among words. The most successful deep learning methods of word embeddings are Word2Vec [9, 10], GloVe [11] and FastText [12]. Many researchers have used these methods in their sentiment analysis experiments [13, 14]. Despite their effectiveness, one of their limitations is the need of large corpuses for training and presenting an acceptable word vector. Thus, researches have to use pre-trained word vectors, due to the small size of some datasets. In this research, we experiment on four notable pre-trained word embeddings, namely Google’s Word2Vec, Stanford’s Crawl GloVe, Stanford’s Twitter GloVe, and Facebook’s FastText, and examine their effect on the performance of sentiment classification.

The structure of this paper is as follows. First, we present the related work in Twitter sentiment analysis using deep learning techniques. Following, we present the evaluation procedure of this research, describing the datasets, the pre-trained word embeddings and the deep learning algorithm used. Section 5.4 deals with the comparative analysis and discussion on the experiment results. Finally, we present our conclusions and future work.

2 Related Work

The need for analyzing the big data originated from the proliferation of social media and classifying sentiment efficiently lead many researchers in the adoption of deep learning models. The models used vary in the dimensions of the datasets employed, the deep learning techniques applied regarding the word embeddings and algorithm, and the purpose served. This section briefly describes representative studies related to Twitter sentiment analysis using deep learning approaches, and tables them based on the aforementioned dimensions.

In [13], the authors experiment with deep learning models along with modern training strategies in order to achieve better sentiment classifier for tweets. The proposed model was an ensemble of 10 CNNs and 10 LSTMs together through soft voting. The models ensembled were initialized with different random weights and used different number of epochs, filter sizes and embedding pre-training algorithms, i.e. Word2Vec or FastText. The GloVe variation is excluded from ensembled model, as it gave a lower score than both the other two embeddings.

In [15], the authors propose a three-step process to train their deep learning model for predicting polarities at both message and phrase levels: i. word embeddings are initialized using Word2Vec neural language model, which is trained on a large unsupervised tweet dataset; ii. a CNN is used to further refine the embeddings on a large distant supervised dataset; iii. the word embeddings and other parameters of the network obtained at the previous stage are used to initialize the network with the same architecture, which is then trained on a supervised dataset from Semeval-2015.

In [16], the authors develop a Bidirectional Long Short-Term Memory (BLSTM) RNN, which combines 2 RNNs: a forward RNN where the sequence is processed from left to right, and a reverse RNN where the sequence is processed from right to left. The average of both RNNs outputs for each token is used to compute the model’s final label for adverse drug reaction (ADR) identification. For pre-trained word embeddings, they utilized the Word2Vec 400 M Twitter model (by Frederic Godin),Footnote 1 which is a set of 400-dimensional word embeddings trained using the skip-gram algorithm on more than 400 million tweets.

In [17], the authors propose a new classification method based on deep learning algorithms to address issues on Twitter spam detection. They firstly collected a 10-day real tweets corpus and applied Word2Vec to pre-process it, instead of feature extraction. Afterwards, they implemented a binary detection using MLP and compared their approach to other techniques, such as traditional text-based methods and feature-based methods supported by machine learning.

In [18], the authors apply convolution algorithm on Twitter sentiment analysis to train deep neural network, in order to improve the accuracy and analysis speed. Firstly, they used GloVe model to implement unsupervised learning of word-level embeddings on a 20 billion twitter corpus. Afterwards, they combined this word representation with the prior polarity score feature and sentiment feature set, and used it as input in a deep convolution neural network to train and predict the sentiment classification labels of five Twitter datasets. Their method was compared to baseline approaches regarding preprocessing techniques and classification algorithms used.

In [19], the authors present an effective deep neural architecture for performing language-agnostic sentiment analysis over tweets in four languages. The proposed model is a convolutional network with character-level embedding, which is designed for solving the inherent problem of word-embedding and n-gram based approaches. This model was compared to other CNN and LSTM models with different embedding, and a traditional SVM-based approach.

In [14], the authors, firstly, construct a tweet processor using semantic rules, and afterwards, train character embedding DeepCNN to produce feature maps capturing the morphological and shape information of a word. Moreover, they integrated global fixed-size character feature vectors and word-level embedding for Bi-LSTM. As pre-trained word vectors, they used the public available Word2Vec with 300 dimensionality and Twitter GloVe of Stanford with 200 dimensionality. The results showed that the use of character embedding through a DeepCNN to enhance information for word embedding built on top of Word2Vec or GloVe improves classification accuracy in Twitter sentiment analysis.

In [20], the authors apply deep learning techniques to classify sentiment of Thai Twitter data. They used Word2Vec to train initial word vectors for LSTM and Dynamic CNN methods. Afterwards, they studied the effect of deep neural network parameters in order to find the best settings, and compared these two methods to other classic techniques. Finally, they demonstrated that the sequence of words influences sentiment analysis.

Feeding the network with the proper word embeddings is a key factor for achieving an effective and punctual sentiment analysis. Therefore, this article is concentrated on the evaluation of the most popular pre-trained word embeddings, namely Word2Vec, Crawl GloVe, Twitter GloVe, FastText. The experiments were conducted using CNN method in the context of sentiment analysis on three Twitter datasets.

Table 5.1 illustrates the analysis of deep learning researches described in this section, analyzing them regarding the datasets used, the deep learning techniques applied, and the scope served. It is worth noting that the authors mainly used the well-known word vectors Word2Vec and GloVe for training their network and CNN as base algorithm. Moreover, in the vast majority of cases, the purpose is to present a comparative analysis of deep learning approaches reporting their performance metrics.

Table 5.1 Analysis of deep learning researches

3 Evaluation Procedure

The scope of the current research is to investigate the effect of the top pre-trained word embeddings obtained by unsupervised learning on large text corpora on training a deep convolutional neural network for Twitter sentiment analysis. Figure 5.1 illustrates the steps followed.

Fig. 5.1
figure 1

Twitter classification using word embeddings to train CNN

Firstly, word embeddings is used to transform tweets into its corresponding vectors to build up the sentences vectors. Then, the generated vectors are divided into training and test set. The training set is used as input to CNN in order to be trained and make predictions on the test set. Finally, the experiments’ outcomes have been tabulated and a descriptive analysis has been conducted, compared CNN results with baseline machine learning approaches. All the preprocessing settings and the classification were employed in Weka data mining package.

3.1 Datasets

The experiments described in this paper were performed in three well-known and freely-available on the Web Twitter datasets. The reason they have been chosen is that they consist of a significant volume of tweets, created by reputable universities for academic scope, and they have been used in various researches. STS-Gold dataset [21] consists of random tweets with no particular topic focus, whereas OMD [22] and HCR [23] include tweets on specific domains. In particular, OMD dataset contains tweets related with Obama-McCain Debate and HCR dataset focuses on Health Care Reform. The statistics of the datasets are shown in Table 5.2 and examples of tweets are given in Table 5.3.

Table 5.2 Statistics of the three Twitter datasets used
Table 5.3 Examples of tweets

3.2 Data Preprocessing

Due to the short length of tweets and their informal aspect, tweets are usually consisted of a variety of noise and language irregularities. This noise and unstructured sentences will affect the performance of sentiment classification [24]. Thus, before feature selection, a stage of preprocessing is essential to be involved in the classification task. The preprocessing includes:

  • Stemming (reducing inflected words to their word stem, base or root form).

  • Remove all numbers, punctuation symbols and some special symbols.

  • Force lower case for all tokens.

  • Remove stopwords (referred to the most common words in a language).

  • Replace the emoticons and emoji with their textual form using an emoticon dictionary.

  • Tokenize the sentence.

In our experiments, we use Wikipedia emoticons list,Footnote 2 Weka Stemmer, Rainbow stopwords list, and Tweet-NLP.

3.3 Pre-trained Word Embeddings

Word embeddings are a set of natural language processing techniques where individual words are mapped to a real-value vector in a high-dimensional space [9]. The vectors are learned in such a way that words that have similar meanings will have similar representation in the vector space. This is a more semantic representation for text than more classical methods like bag-of-words, where relationships between words or tokens are ignored, or forced in bigram and trigram approaches [10]. This numeric representation is essential in sentiment analysis, as many machine learning algorithms and deep learning approaches are incapable of processing strings in their raw form and require numeric values as input to perform their tasks. Thus, there are a variety of word embeddings methods which can be applied on any corpus in order to build a vocabulary with hundred of dimensions word vectors, capturing a great deal of semantic patters.

In 2013, Word2Vec [9, 10] has been proposed by Mikolov et al., which has now become the mainstream of word embedding. Word2Vec employs a two-level neural network, where Huffman techniques is used as hierarchical softmax to allocate codes to frequent words. The model is trained through stochastic gradient descent and the gradient is achieved by backpropagation. Moreover, optimal vectors are obtained for each word by CBOW or Skip-gram.

In 2014, Pennington et al. introduced GloVe [11], which become also very popular. GloVe is essentially a log-bilinear model with a weighted least-squares objective. The model is based on the fact that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

In 2016, based on Word2Vec, Bojanowski et al. proposed FastText [12], which can handle subword units and has fast computing. FastText extends Word2Vec by introducing subword modeling. Instead of feeding individual words into the neural network, it represents each word as a bag of character n-gram. The embedding vector for a word is the sum of all its n-grams.

Word embedding is a demanding process, requiring large corpuses for training and presenting an acceptable word vector. Thus, rather than training the word vectors from scratch, it can be used pre-trained word embeddings, public available in Internet, for sentiment analysis through deep learning architectures. In this paper, we use four well-known pre-trained word vectors, namely Google’s Word2Vec,Footnote 3 Stanford’s Crawl GloVe,Footnote 4 Stanford’s Twitter GloVe, and Facebook’s FastText,Footnote 5 with the intention of evaluating their effect in sentiment analysis. A briefly description of these vectors is following.

Google’s Word2Vec trained on part of Google News dataset, about 100 billion words. The model contains 300-dimensional vectors for 3 million words and phrases. Crawl GloVe was trained on a Common Crawl dataset of 42 billion tokens (words), providing a vocabulary of 2 million words with an embedding vector size of 300 dimensions. On the other hand, Twitter GloVe was trained on a dataset of 2 billion tweets with 27 billion tokens. This representation has a vocabulary of 2 million words and embedding vector sizes, including 50, 100 and 200 dimensions. Finally, FastText consists of 1 million-word vectors of 300 dimensions, trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16 billions of tokens) (Table 5.4).

Table 5.4 Statistics of pre-trained word embeddings

3.4 Deep Learning

In this paper, we use a deep Convolutional Neural Network (CNN) for tweets classification. CNN is a class of deep learning approaches which use a variation of multilayer perceptrons designed to require minimal preprocessing [25]. First, the tweet is tokenized and transformed into a list of word embeddings. Afterwards, the created sentence matrix is combined using multiple filters with variable window size. Thus, local sentiment feature vectors are generated for each possible word window size. Then, the feature maps activate a 1-max-pooling layer via a non-linear activation function. Finally, this pooling layer is densely connected to the output layer using softmax activation to generate probability value of sentiment classification, and optional dropout regularization to prevent over-fitting. The architecture of the convolutional neural network used for sentiment classification is shown on Fig. 5.2.

Fig. 5.2
figure 2

Architecture of Convolutional Neural Network (CNN) used

4 Comparative Analysis and Discussion

In this paper, we experimented on CNN for the sentiment analysis of three well-known dataset, using four different word embeddings. The scope of the paper is to evaluate the effect of pre-trained word embeddings, namely Word2Vec, Crawl GloVe, Twitter GloVe, and FastText, on the performance of Twitter classification. For all datasets, the same preprocessing steps were applied. Many experiments were made in order to define the best settings for CNN. Thus, we set a mini-batch size of 50. Learning rate technique was applied for stochastic gradient descent with a maximum of 100 training epochs. We tested with various filter windows and concluded to set the filter windows to 5. Moreover, we use the activation function ReLu for the convolution layer and SoftMax for the output one. The experimental evaluation metric is the accuracy in the classification of positive and negative tweets, and the best CNN results are compared to the performance of baseline classifiers emerged from other previous studies [5].

Table 5.5 illustrates the percentage of accuracy achieved by CNN regarding the different word embeddings used. We observe that using FastText gives better performance than other approaches in the majority of experiments. Moreover, Twitter Glove performs satisfactory, despite having lower dimensionality than other word vectors and training on a considerably smaller dataset. Its good accuracy rates may be due to the fact that the corpora used by this model originated from Twitter. On the other hand, Word2Vec seems not to be as effective as others models.

Table 5.5 CNN accuracy using different word vectors on Twitter datasets

We further compare our method to five representative machine learning algorithms, namely Naïve Bayes, Support Vector Machine, k- Nearest Neighbor, Logistic Regression and C4.5. In [5], these classifiers have been evaluated using the same datasets as the current research, and their performance has been tabled. We use the best accuracy rates of CNN version for each dataset, in order to compare with the baseline algorithms. As represented in Fig. 5.3, the deep learning technique outperforms the other machine learning methods over all the datasets.

Fig. 5.3
figure 3

Comparison of accuracy values between our best CNN version and baseline approaches

5 Conclusion and Future Work

Deep Learning and Big Data analytics are, nowadays, two burning issues of data science. Big Data analytics is important for organizations that need to extract information from huge amounts of data collected through social networking services. Such a social networking service is a microblogging service, with Twitter being its most representative platform. Through Twitter, millions of users exchange public messages daily, namely “tweets”, expressing their opinion and feelings towards various issues. This makes Twitter one of the largest and most dynamic Big Data sets for data mining and sentiment analysis. Deep learning is a valuable tool for Big Data analytics. Deep Learning models have achieved remarkable results in a diversity of fields, including natural language processing and classification. Thus, in this paper, we employ a deep learning approach for Twitter sentiment classification, with the intention of evaluating different settings of word embeddings.

The development and use of word embeddings is an essential task in the classification process, as deep learning techniques require numeric values as input, and feeding the network with the proper input can boost the performance of algorithms. In this work, we examined four well-known pre-trained word embeddings, namely Word2Vec, Crawl GloVe, Twitter Glove and FastText, that have been developed by reliable research teams and have been used extensively in deep learning projects. The results show that FastText provides more consistent results across datasets. Moreover, Twitter GloVe obtains very good accuracy rates despite its lower dimensionality.

Part of our future work is to study word embeddings trained from scratch on Twitter data using different parameters, as the training model, dimensional size, relevant vocabulary to test dataset etc., and evaluate their effect on classification performance. Moreover, another interesting research venue is concerned with the comparative analysis of different deep learning approaches regarding the algorithm structure, layers, filters, functions etc., in order to identify the proper network settings.