1 Introduction

Natural Language Processing (NLP) is an area of artificial intelligence which enables computers to read, understand and process human language. It is a discipline specifically focused on building a relationship between natural language data and data science. The natural language data generated from conversations, videos, speeches, image captions, etc is unstructured in nature as it is not in the form to be put into conventional row and column arrangement of a database. Natural Language Processing is a set of algorithms and techniques for extracting meaningful information from data by adding necessary structure to it (Collobert et al. 2011). It can be divided into following three different sets of approaches:

  1. 1.

    Rule Based Bird et al. (2009) stated that rule-based systems perform sentiment analysis using hand-crafted set of rules. As these systems are based on linguistic structures resembling human way of building grammar, they tend to focus on decoding the linguistic relationships between words to interpret the polarity of piece of text. The rule based system approach comprises of a definite set of rules employing traditional NLP approaches such as part-of-speech tagging, stemming, tokenization, lexicons, etc.

  2. 2.

    Statistics Based Manning et al. (1999) explained that statistical approach does not rely on the hand-crafted rules but traditional machine learning perspective including probabilistic modeling, likelihood maximization, linear classifiers etc. This approach is based on statistical models like Hidden Markov Models, Perceptrons, Logistic Regression, etc. It takes the sentiment task as a classification problem and feeds it to a classifier. Statistical hypothesis in general is based on the data generated in accordance with some unknown probability distribution and makes inference out of it. The classifier predicts the most probable outcome from the distribution as the corresponding label.

  3. 3.

    Neural Networks Based Neural Networks are designed to find generalized predictive patterns as they are not based on hypothesis about the correlation among the variables. According to Young et al. (2018), this is the biggest advantage of neural networks over the traditional NLP techniques. Feature engineering is skipped in these networks as they automatically learn to capture the important features of data (Jean-François 2017). Some specific neural networks employed in NLP comprise of Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Long-Short Term Memory (LSTM), etc. These state-of-art networks (Bouazizi and Ohtsuki 2017; Chen et al. 2017) find applications in many areas such as text analytics, sentiment analysis, speech recognition, financial trading, etc.

For computers to precisely comprehend the unstructured data, it is necessary for them to understand the sentiments or contextual meanings hidden in the natural language (Collobert et al. 2011). Present paper focuses on this abstract idea known as sentiment analysis.

Sentiment analysis (Pathak et al. 2020) refers to automatically analyzing the sentiments and categorizing them into certain classes. This concept can be categorized into three major levels: aspect level, sentence level and document level.

  • Aspect level sentiment analysis is a text mining technique that divides the text into different attributes and allots each attribute a sentiment class (Gupta et al. 2019).

  • Sentence level sentiment analysis involves dividing the sentence into two classes: subjective and objective (Gupta et al. 2019). The subjective part comprises of emotions, beliefs, views, etc; whereas the objective part consists of the fact-based data.

  • Document level sentiment analysis (Wu et al. 2020) considers the whole document as a fundamental data block. It focuses on extracting sentiment(s) from the document to determine the overall opinion of document or any of its particular entity (Gupta et al. 2019).

With the proliferation of social media, multimodal sentiment analysis is set to bring new opportunities with the arrival of complementary data streams for improving and going beyond text-based sentiment analysis.

Soleymani et al. (2017) illustrated that sentiment analysis techniques can be applied to any mode of data such as audio, video or text. The proposed work utilizes automated analysis techniques for text data at sentence level to detect the polarity of labelled text solely based on its content. The experimental results are obtained from three datasets which help in testing the overall effectiveness of our model.

The comprehensive contributions of this paper are as follows:

  1. 1.

    We propose a bidirectional model inspired from ULMFit which is based on transfer learning.

  2. 2.

    We have demonstrated the experimental results of two important aggregation architectures used for extracting features from embedding sequences namely Concat Pool and Attention, with and without zeta. When zeta is incorporated into the model, it reinforces the performance of both aggregation architectures relative to the case when not utilized. It has also been observed that irrespective of the scaling parameter, Concat pool has performed efficiently relative to Attention mechanism.

  3. 3.

    We have reckoned that our proposed solution is an excellent and novel substitute of Attention mechanism. Also, we have not confined the comparison to this mechanism only, we have compared our proposed model with other state-of-the-art deep and non-deep learning models also.

The organization of the present paper is as follows: Sect. 2 provides an insight of the previous work of some researchers related to the paper; Sect. 3 lists out the datasets used in the research work along with their detailed study; Sect. 4 looks into some of the index terms used in the research work to get better understanding; Sect. 5 introduces the proposed model and provides the experimental results; Sect. 6 shows the experimental results of our proposed model and its comparison with other state-of-the-art models; Sect. 7 abridges the paper and provides suggestions for future work.

2 Literature survey

There has been a lot of research in the field of sentiment classification, making it one of the most popular tasks in NLP (Cambria 2016). The advancements include traditional methods, deep learning methods, transfer learning methods, etc based on different sentiment classification algorithms. As stated by Wang et al. (2020) and Mirończuk and Protasiewicz (2018), these methods have evolved over the time to classify sentiments accurately in accordance with their contextual meanings. Present section discusses some of the significant NLP approaches used for sentiment classification.

2.1 Traditional methods for sentiment analysis

In recent years, several machine learning techniques based on shallow models such as Support Vector Machine, Logistic regression, etc (Jiang et al. 2018) are employed for sentiment analysis. These models are trained on a limited number of hand crafted features. Also, the features have to be identified by a domain expert in order to reduce the data complexity and provide an ease for these traditional techniques to work. Collobert et al. (2011) explained that most state-of-art techniques in various NLP tasks, such as Named-Entity Recognition (NER), Part-of-Speech (POS) tagging, Semantic role labelling are outperformed by a simple deep learning model. For modeling tedious NLP tasks, statistical NLP turned out to be one of the primary options (Haddoud et al. 2016; de Araujo et al. 2020). However, it often used to suffer from curse of dimensionality in its initial phases while learning joint probability functions of language models, as stated by Bengio et al. (2003). This concept gave motivation for learning distributed representations of words present in low-dimensional space. Mikolov et al. (2013b) proposed novel architectures for producing high quality word representations, thereby producing more accurate results compared to the ones obtained from the traditional techniques.

2.2 Deep learning methods for sentiment analysis

Distributional vectors follow distributional hypothesis, which state that words with similar semantics are likely to occur in similar context. The semantic similarity is directly proportional to the distance between vectors (Mikolov et al. 2013b). The distance between these vectors is measured using cosine similarity.

Word embeddings can be pretrained in such a way that they capture the semantic information and give predictions based on the contextual meanings of words. These embeddings, due to their low dimensionality, have proved to be fast and efficient in capturing the conceptual semantics (Cambria et al. 2017; Turney and Pantel 2010). In the recent years, models creating these word embeddings have been facile. Thus, there was no need for deep learning models (Rane and Kumar 2018; Saif et al. 2016) to create quality embeddings. However, it is known that word embeddings are responsible for state-of-art results in NLP tasks in which words, phrases, etc are represented using these embeddings (Jianqiang and Xiaolin 2017). For instance, Glorot et al. (2011) employed embeddings along with stacked autoencoders for domain adaptation in sentiment analysis whereas Hermann and Blunsom (2013) learnt the composition of sentence by introducing CCAE (Combinatory Categorial Autoencoders). Developments in NLP over these years marked the foundation of research in distributed representations.

Mikolov et al. (2013a) developed a method to create word embeddings known as word2vec. It includes two models namely Common bag of words (CBOW) and Skip Gram which are used to obtain word embeddings. In CBOW architecture, distributed representations of words surrounding the input word are combined to predict the current target word. On the other hand, skip gram architecture usually attains the opposite of what CBOW does. It predicts the surrounding words based on the given target word. Pennington et al. (2014) showed the above mentioned word2vec methods do not utilize the corpus statistics efficiently. It happens because they train on different restricted context windows and not global co-occurrence information. They proposed a new model called GloVe, which directly captures the global representation of data. GloVe focuses on the global word-to-word co-occurrence matrix, which relates word embeddings to the co-occurrences of words over the whole corpus.

In 2016, Facebook developed fastText; an extension of vanilla word2vec model. It takes word parts into account and learns representations for sub-words, also known as character n-grams. Bojanowski et al. (2017) explained how fastText works by taking morphology into consideration; this technique strengthens learning in extremely inflected languages. The static embeddings came up with some limitations as discussed in Sect. 4.1. As a result, they fail to capture higher-level information.

2.3 Transfer learning methods for sentiment analysis

For instance, McCann et al. (2017) utilized an encoder of a supervised neural machine conversion to bring context out of the word embeddings. Eventually, the context based word vectors are concatenated with the pretrained word embeddings. On the other hand, Neelakantan et al. (2015) conducted research by training each vector for individual word senses, which means multiple embeddings per word were learned. With these techniques, the issue of missing context was eliminated. These techniques, however, could not eradicate the need of training the factual task model from scratch.

Many machine learning algorithms have an assumption that the training and future data must have same dimensions and be in same vector space (Pan and Yang 2009). However, this does not hold in real life applications. This aforesaid assumption was solved by a concept in which knowledge gained in one domain of interest is transferred to another. Therefore transfer learning emerged as a new learning strategy for developing deep learning models (Liu et al. 2019).

Peters et al. (2018) introduced deep contextualised word representations known as ELMo (Embeddings from Language models), capable of modeling complex characterstics of semantics. They proved that ELMo bi-LSTM can be trained on a large dataset and then used as an essential part in other NLP models. The word vectors allocated to a token or word are a learned function of inner states of language model. In other words, it is context dependent and same word can have different vectors under different contexts. This method trains only the main task model from scratch and considers pre-trained embeddings as a fixed parameter, thereby restricting its success.

Howard and Ruder (2018) proposed ULMFit (Universal Language Model Fine-tuning) model. It recommends language modeling on a massive corpus and uses this language model as a backbone for fine-tuning the classifier. The architecture used by ULMFit for the language modeling is AWD-LSTM which stands for Average Stochastic Gradient Descent Weight Dropped LSTM. To apply this concept to other datasets, the parameters of language model have to be fine-tuned and a classifier layer has to be appended and trained. To delve into contextualization, an intuitive way to find the alignment between the words was introduced by Bahdanau et al. (2015). This is performed by visualizing the weights analogous to annotations.

Peters et al. (2017) extracted contextual features from right-to-left and left-to-right from a deep learning model based on LSTM. But Devlin et al. (2019) developed an attention based model called BERT (Bidirectional Encoder Representation from Transformers) for training deep bidirectional representations. This architecture can extract contextual features from right-to-left, left-to-right and combined left-to-right, thereby allowing high level of parallelism.

3 Datasets

Sentiment detection reckons on text data. For experiment, we have chosen three databases commonly used in sentiment analysis nowadays: IMDb (Maas et al. 2011), US Airline Twitter (Crowdflower 2016), Real Life Deception Detection (Pérez-Rosas et al. 2015). A brief description of datasets is presented below.

  1. 1.

    IMDb It is a movie review dataset containing 50,000 movie reviews for binary sentiment classification. Each sample in this dataset is a text document. They are combined to form training and test files. It comprises of binary reviews namely positive and negative as shown in the Table  1 below.

  2. 2.

    US Airline Twitter This dataset was first released by Crowdflower in 2015 comprising of tweets on major US Airlines such as United, US Airways, Southwest, Delta and Virgin America. The tweets have been classified into 3 categories: Positive, Negative and Neutral. The present paper focuses on tweets and their labels for sentiment classification as shown in the Table  1 below.

  3. 3.

    Real Life Deception Detection (RLDD) It is multi-modal dataset comprising of real life videos of courtroom trials. It is an aggregation of both text and visual data bisected into: Deceptive and Truthful. The videos have been sourced from various Youtube channels. We have used the transcripts of the courtroom trials to build the required dataset. A text document is created using both truthful and deceptive cases, refer Table  1 for the details.

Table 1 A brief outline of statistics of experimental datasets

4 Index terms

4.1 Word embeddings

According to Le and Mikolov (2014), the concept of word embeddings has been one of the outstanding developments in the field of Natural Language Processing in the last few years. These are also known as distributed representations of text in n-dimensional space. Word embeddings primarily bridge the human understanding of language to that of a machine. These are a set of strategies in which tokens are mapped to real valued vectors in a meaningful vector space. The distance between vectors is directly proportional to semantic similarity between the tokens. These vectors values are learned in such a way that they resemble a neural network.

There are many word embeddings used by researchers nowadays, but the present paper focuses on two kinds of pre-trained embeddings namely GloVe (Pennington et al. 2014) and fastText (Joulin et al. 2017). We selected these embeddings as they are pre-trained on a large corpus and can be employed in diverse downstream tasks such as named entity recognition, part-of-speech tagging, language modeling, etc.


GloVe


To the full extent, GloVe stands for Global Vectors for word representation. It is an unsupervised learning algorithm developed by Pennington et al. (2014). GloVe finds its prime application in generating word embeddings based on the co-occurence information of words present in the corpus. The process of building the word embeddings in GloVe is carried out by training the collection of global word to word co-occurrence statistics from the corpus. The resulting embeddings exhibit intriguing one dimensional substructure of vector space. The present paper uses GloVe embeddings of 300 dimensions containing 840 billion tokens for the sentiment classification purpose.


fastText


GloVe embeddings made remarkable developments in the field of NLP, but they were not able to generalize for the words not present in the GloVe embeddings. In 2016, Facebook’s AI research team released a library for learning word embeddings and performing sentiment classification known as fastText. According to Joulin et al. (2017), this library helps to build an unsupervised or supervised learning algorithm to obtain word vectors using a deep learning model. Their results presented major improvement of fastText when compared to other embeddings such as word2vec, GloVe, etc. This improvement refers to dividing the word into sub-words known as n-grams (where n is total number of sub-words). The building blocks for sentiment classification in this case are sub-words or n-grams and not words or sentences. This makes way for generalization as long as characters of words present in GloVe embeddings are present in our database too.

Although the pre-trained embeddings immensely influence research in NLP, they still show following major limitations:

  • Static embeddings presume that a word has same interpretation across the corpus or they do not support polysemy. Due to the limitation, contextual meaning of the word is lost (Wang et al. 2020).

  • When the embeddings are loaded into the model, only embedding layer is trained and not the hidden layers (Yosinski et al. 2014). Changes made in the embeddings do not reflect in hidden and output layers of the model. As a result, these layers have to be trained from scratch. Thus, pre-trained embeddings fail to capture higher-level information.

As a result of above mentioned limitations, researchers felt the need of a trained model capable of capturing contextual meanings of words present in the corpus. They introduced a technique known as Transfer Learning, wherein models are pre-trained on large corpus and fine-tuned for specific NLP tasks. The knowledge gained in pre-training step is utilized for fine-tuning on the target task. The strength of this technique includes accelerating the training time of a model by reusing the modules of hitherto developed models. These trained models are open-sourced and prove to be successful in various NLP classification challenges (Liu et al. 2019; Zheng et al. 2020).

4.2 Pre-trained models


ULMFit


Howard and Ruder (2018) proposed that transfer learning is the backbone of a pre-trained model, ULMFit, which stands for Universal Language Model Fine Tuning. ULMFit has been completely executed in fastai library, which simplifies training of fast and accurate neural nets using modern best practices. A brief outline of ULMFit is presented below.

  1. 1.

    Language pre-training on general domain corpus Text data used in the model has to be tokenized and encoded. These unique tokens are then prioritized based on the usage in an order from most often to least often used. The text data is fed to embedding layer of the model. The embedding matrix in the model has vectors for each token in the general domain corpus named WikiText-103. The encoded tokens present in the corpus are matched to their corresponding vectors by using one-hot encoding. After the embedding layer, text data enters three stacked LSTM layers known as Average Stochastic Gradient Weight-Dropped LSTM (AWD-LSTM), where the model training takes place. The output of embedding layer is in form of tensors with word embeddings. These tensors are fed as inputs to first LSTM layer. They are subsequently passed to the second and third LSTM layers. The hidden state of last LSTM layer has same shape as of embedded input. Softmax function is applied to the output of last LSTM layer to get corresponding probabilities. The first step is computationally expensive; which is why fastai has made this model publicly available (Howard and Ruder 2018).

  2. 2.

    Language Model fine-tuning on target task The pre-trained model is fine tuned on the target task at this step. To achieve this, the target dataset has to be pre-processed and AWD-LSTM rebuild in order to load the weights of the language model. Also, the embeddings have to be adjusted according to the target dataset before recalliberating AWD-LSTM and loading the weights. However, the recurrent connections of an LSTM are usually susceptible to overfitting. If dropouts similar to LSTM’s hidden state are applied, its capability to hold on to long-term dependencies is challenged.

    To resolve this problem, an alternative to dropout was derived by researchers known as DropConnect (Merity et al. 2018). Unlike the classical concept of dropout, where random subset of activations are set to zero, DropConnect selects random subset of weights and sets them to zero. With this, input coming from random subset of units in the previous layer is received by each unit. Figures 1,  2 and  3 explain the difference between dropout and DropConnect when compared to a standard neural network using three LSTM layers. The LSTM blocks inside these layers are activations for the neural network.

    The mathematical formulation of an LSTM can be expressed as set of subequations (a)–(f) of Eq. (1) as:

    $$\begin{aligned} {\begin{array}{ll} i_t = \sigma (W^ix_t + U^ih_{t-1}) &{}\qquad (\mathrm{a}) \\ f_t = \sigma (W^fx_t + U^fh_{t-1}) &{}\qquad (\mathrm{b}) \\ o_t = \sigma (W^ox_t + U^oh_{t-1}) &{} \qquad (\mathrm{c}) \\ {\tilde{c}}_t = \tanh (W^cx_t + U^ch_{t-1}) &{}\qquad (\mathrm{d}) \\ c_t = i_t \odot {\tilde{c}}_t + f_t \odot + {\tilde{c}}_{t-1}) &{} \qquad (\mathrm{e}) \\ h_t = o_t \odot \tanh (c_t) &{}\qquad (\mathrm{f}) \end{array}} \end{aligned}$$
    (1)

    where [\(W^i\), \(W^f\), \(W^o\), \(U^i\), \(U^f\), \(U^o\)] are weight matrices, \(x_t\) is the vector input to the timestep t, \(h_t\) is the current exposed hidden state, \(c_t\) is the memory cell state, and \(\odot\) is the element-wise multiplication. Merity et al. (2018) proved that DropConnect is applied to hidden-to-hidden weight matrices [\(U^i\), \(U^f\), \(U^o\)] instead of hidden or memory states, thereby preventing overfitting on the recurrent connections of the LSTM. Howard and Ruder (2018) explained that training the entire model at once leads to catastrophic forgetting in the three LSTMs as they still comprise of old weights from the pre-trained language model. In this situation, it is necessary to ‘freeze’ the weights of stacked LSTM layers while the embedding and output layers are trained so that they can be adjusted to the LSTM weights. After this, all the AWD-LSTM layers are unfreezed for fine-tuning. This is how the language model learns task specific features of the language.

  3. 3.

    Target Task Classification Sentiment classification is performed by adjusting the language model architecture. To achieve this, two linear functional blocks namely ReLU and softmax activations respectively are appended to the stacked LSTMs. The resulting sentiment classification of text is obtained as probabilities from the softmax layer.

Fig. 1
figure 1

An arrangement of three stacked interconnected LSTM layers containing multiple LSTM blocks constitute a neural network. Nodes of this neural network are shown by LSTM blocks and the weights by their interconnections

Fig. 2
figure 2

Dropout method applied to a neural network where random subset of activations are set to zero

Fig. 3
figure 3

DropConnect method applied to a neural network where random subset of weights are deactivated


Attention in ULMFit


Soft attention is convenient because it addresses the entire input state space and keeps the model fully differentiable and deterministic (Vaswani et al. 2017; Xu et al. 2015). It lets the user decide how much attention should be paid to each and every token using cosine similarity. This is performed by representing the tokens and query in the same vector space. The cosine distance is entirely differentiable with respect to its inputs; hence the final model comes out to be differentiable. In soft attention mechanism, the gradients can be calculated instantly as compared to hard attention where they are estimated through a stochastic process. This mechanism fits well into any existing model where gradients propagate through it and remaining layers of the neural network.

Soft attention mechanism compares the source and target states to generate attention scores, denoted by \(\alpha\) (Bahdanau et al. 2015). These scores signify how well two words are aligned with each other at a particular position t with respect to a source sequence x and target sequence y, as shown in Eq. (2). Softmax activation is used to normalize these alignment scores, which are used to generate a context vector explained in Eqs. (3)–(4); where s and h are attention mechanism parameters. Equation (5) explains how the scores are calculated from a context vector (\(c_t\)), where \(W_c\) and \(b_c\) are weight matrices and bias to be learned.

$$\begin{aligned}&\alpha _{t,i}= align(y_t,x_i) \end{aligned}$$
(2)
$$\begin{aligned}&\alpha _{t,i} = \frac{\exp (score(s_{t-1}, h_i))}{\sum _{i'=1}^{n} \exp (score(s_{t-1}, h_i'))} \end{aligned}$$
(3)
$$\begin{aligned}&c_t= \sum _{i=1}^{n} (\alpha _{t,i}, h_i) \end{aligned}$$
(4)
$$\begin{aligned}&score(s_t,h_i)= tanh(W_c*c + b_c) \end{aligned}$$
(5)

BERT


BERT (Bidirectional Encoder Representations from Transformers) is a technique developed by Google in 2018. BERT finds its roots from pre-training contextual representations such as ELMo (Peters et al. 2018), ULMFit, Generative Pre-Training, Semi-Supervised Sequence learning, etc. Regardless of the above mentioned representations, where the text data is observed either from left to right or right-to-left or combined left-to-right training, BERT is considered to be the first deeply bidirectional language representation. Devlin et al. (2019) innovated the concept of bidirectional training of a well known attention model named Transformer, to language modelling. Transformer comprises of two distinct functional blocks: an encoder which reads the text input; and a decoder which provides a prediction for the task. Rather than processing the tokens present in a sequence one-by-one, the attention models process the tokens in relation to all other ones present in that sequence. According to them, this kind of training is achieved through a unique approach known as Masked Language Modelling through which the language models tend to acquire an insightful sense of language context and motion.

5 Proposed model

The present manuscript proposes a new model inspired from the pre-trained model ULMFit. Our model follows the ensemble strategy of Forward and Backward language models. Both of these models are two versions of the same proposed model, as shown in Fig. 4 and discussed later.

figure a

Algorithm 1 above elaborates our proposed model in three steps. Clearly, as shown in the algorithm, we have performed both forward and backward probability modeling of a sequence of tokens. All the steps in Algorithm have been explained in Sects. 5.1 and  5.2 in detail.

5.1 Forward language model

The probability for modeling a token \(t_s\) in a token sequence at position s in the forward language model is given by Eq. (6) as:

$$\begin{aligned} p(t_1, t_2, ..., t_N) = \prod _{s=1}^{N} p(t_s|t_1, t_2, ..., t_{s-1}) \end{aligned}$$
(6)

where the factor \(p(*|*)\) is referred to as the conditional probability generated in the forward and backward token sequences.

For sentiment analysis, our language model is pre-trained on a generic massive corpus: WikiText-103 (Howard and Ruder 2018). Pre-training commences with pre-processing of data. Although there are a lot of word embeddings available for pre-processing, our language model has been trained to obtain representations for full sentences apart from the distributed representations of tokens present inside it.

For the same sequence of tokens as mentioned in Eq. (6), each word is converted into an embedding vector \(w_s^F\), which are encoded by column vectors in an embedding matrix \(W_s^{F_{emb}}\). The embedding of tth word in the vocabulary is corresponded by each column. The databunch object in fastai does the pre-processing in the background. The sequence of words affect the semantics of each and every word due to its dependence on foregoing words. Thus, we have employed three stacked LSTM layers (AWD-LSTM) succeeding the embedding layer, where the word embeddings are fed. The hidden state \(H_j^F\) of dimension D is obtained as the output of the last AWD-LSTM layer having same shape as the embedded input as shown in Eq. (7):

$$\begin{aligned} H_j^F= {LSTM}_j(H_{j-1}^F, w_s^F) \end{aligned}$$
(7)

where the subscript ‘j’ distinguishes the three stacked LSTM layers. Finally, the hidden state is multiplied with a decoder matrix. Eqs. (8), (9) illustrate the softmax function transforming all values in the decoder matrix into probabilities.

$$\begin{aligned} u_s^F= & {} \tanh (W_s^{F_{emb}} H_j^F + b^F) \end{aligned}$$
(8)
$$\begin{aligned} a_s^F= & {} softmax (v^T u_s^F ) \end{aligned}$$
(9)

where \(W_s^{F_{emb}} \in R^{D*D}\), v(word level latent vector) and b \(\in\) D, therefore \(u_s^F \in R^D\) and \(u_s^F \in R\).

After the first stage of pre-training, we fine-tune the pre-trained forward language model on task specific datasets, as mentioned in Sect. 3. Two strategies are used for fine tuning the language model: Discriminative fine tuning and Slanted triangular learning rates. Discriminative fine tuning is based on the fact that different layers of a neural network should possess different values of learning rates; as these layers capture different types of information (Yosinski et al. 2014). Thus, we can tune each layer with different learning rates. The stochastic gradient descent (SGD) for the model parameter \(\theta\) at time step t with a learning rate of \(\eta\) is given by Eq. (10):

$$\begin{aligned} \theta _t = \theta _{t-1} - \eta \cdot \nabla _\theta J(\theta ) \end{aligned}$$
(10)

The model parameters are split into \(({\theta ^1, \theta ^2, ..., \theta ^L})\), where L is the number of layers in the neural network and \(\theta ^l\) contains the parameters at lth layer. Analogous to model parameters, we obtain \(({\eta ^1, \eta ^2, ..., \eta ^L})\) for learning rate, where \(\eta ^l\) is the learning rate of lth layer of network. Equation (11) shows the updated SGD (SGD with discriminative fine tuning) as:

$$\begin{aligned} \theta _t^l = \theta _{t-1}^l - \eta ^l \cdot \nabla _{\theta ^l} J(\theta ) \end{aligned}$$
(11)

The Slanted triangular learning rate is based on the fact that the model should swiftly converge to an appropriate region of parameter space when the training commences. This is performed in order to adapt the model’s parameters to task specific features. An annealed learning rate can not attain this behaviour. Thus, we use Slanted triangular learning rate as shown in Eq. (12), which increases the learning rate linearly in the initial phase and then decays according to the follow-up schedule.

$$\begin{aligned} cut= & {} \big \lfloor T \cdot cut\_frac \big \rfloor \\ p= & {} \left\{ {\begin{array}{ll} t/cut,&{}\quad \text{if t}\le cut\\ 1-\frac{t-cut}{cut\cdot (\frac{1}{cut\_frac}-1)}, &{}\quad \text{otherwise} \end{array}}\right. \\ \eta _t= & {} \eta _{max} \cdot \frac{1 + p \cdot (ratio -1 )}{ratio} \end{aligned}$$
(12)

where cut is the iteration when we advance from initial phase towards follow-up schedule, T is the number of training iterations, \(\eta _t\) specifies the learning rate at an iteration t, ratio is comparison of lowest and highest value of learning rate, p is the fraction of number of iterations which increased or decreased, \(cut\_frac\) represents the fraction of rising learning rate.

Now, the final task is to build a classifier using the \(forward\_encoder\) saved in the language model fine tuning step as mentioned in Eqs. (11) and (12). The classifier is also fine tuned using the two strategies mentioned above. The weights of AWD-LSTM have been scaled by a factor of zeta, denoted as ‘\(\zeta\)’, which lies in the range (0,1]. This scaling factor generalizes the concept of DropConnect in stacked LSTM layers.

Subequations (a)–(f) of Eq. (13) show the mathematical formulation of LSTM (as seen from Eq. (1)) after scaling as:

$$\begin{aligned} {\begin{array}{ll} i_t = \sigma (W^ix_t + \zeta *U^ih_{t-1}) &{} \qquad (\mathrm{a}) \\ f_t = \sigma (W^fx_t + \zeta *U^fh_{t-1}) &{} \qquad (\mathrm{b}) \\ o_t = \sigma (W^ox_t + \zeta *U^oh_{t-1}) &{} \qquad (\mathrm{c}) \\ {\tilde{c}}_t = \tanh (W^cx_t + \zeta *U^ch_{t-1}) &{}\qquad (\mathrm{d}) \\ c_t = i_t \odot {\tilde{c}}_t + f_t \odot + {\tilde{c}}_{t-1}) &{}\qquad (\mathrm{e}) \\ h_t = o_t \odot \tanh (c_t) &{}\qquad (\mathrm{f}) \\ \end{array}} \end{aligned}$$
(13)

When \(\zeta\) is active, we do not lose any kind of information. However, when \(\zeta\) is not active, traditional meaning of DropConnect is justified in our model. The scaled interconnections of AWD-LSTM layers contain information which can occur anywhere in the document. Due to this, we concatenate the last hidden state of last time step \(h_T\) from \({H_j^F} =(h_1, h_2, ...., h_T)\) with both average pooled and max pooled representations as illustrated in Eq. (14):

$$\begin{aligned} h_{c}^F= [h_T, maxpool(H_j^F), meanpool(H_j^F)] \end{aligned}$$
(14)

We did not fine tune all the layers of neural network concurrently, but gradually. This is carried out by gradual unfreezing of layers. The output of the classifier is obtained as predictions in form of probabilities. These steps have been repeated for each dataset.

5.2 Backward language model

Apart from the traditional forward language models, it is necessary to consider a backward language model in order to capture the future context in embeddings (Peters et al. 2017). As discussed earlier in this section, both forward and backward language models are two versions of the same proposed architecture. Hence, a backward language model can be implemented in an analogous way to forward language model.

A backward language model predicts the preceding token if future context is present. For the backward model, Eq. (15) shows the probability for modeling the token sequence mentioned in Eq. (6):

$$\begin{aligned} p(t_1, t_2, ..., t_N) = \prod _{s=N}^{1} p(t_s|t_N, t_{N-1}, ..., t_{s+1}) \end{aligned}$$
(15)

For the sequence of tokens which are run in reverse by a backward language model, each word is converted into an embedding vector \(w_s^B\). In an analogous way to forward language model, the word embeddings are produced for the sequence \((t_N, t_{N-1}, ..., t_{s+1})\). These vectors are encoded by column vectors in an embedding matrix \(W_s^{B_{emb}}\).

Subsequently, the hidden state \(H_j^B\) of dimension D is obtained as the output of the last AWD-LSTM layer having same shape as the embedded input as shown in Eq. (16):

$$\begin{aligned} H_j^B = {LSTM}_j(H_{j-1}^B, w^B) \end{aligned}$$
(16)

where the subscript ‘j’ distinguishes the three stacked LSTM layers. Finally, the hidden state is multiplied with a decoder matrix. The softmax function transforms all values in the decoder matrix into probabilities as shown in Eqs. (17) and (18):

$$\begin{aligned} u_s^B= & {} \tanh (W^{B_{emb}} H_j^B + b^B) \end{aligned}$$
(17)
$$\begin{aligned} a_s^B= & {} softmax (v^T u_s^B ) \end{aligned}$$
(18)

where \(W_s^{B_{emb}} \in R^{D*D}\), v(word level latent vector) and b \(\in\) D, therefore \(u_s^B \in R^D\) and \(u_s^B \in R\).

After the pre-trained language model is built, it is fine tuned for target datasets, as discussed in Sect. 3. Furthermore, the same fine tuning strategies are employed as explained in Eqs. (7) and (8). A classifier is built using the \(backward\_encoder\) saved in the language model fine-tuning step. The classifier fine tuning is also performed using the same two strategies as in forward language model. Subsequently, the scaling factor \(\zeta\) is applied to oversee the weights scaling in Back propagation through time (BPTT). The reason for introducing \(\zeta\) is again to extrapolate the idea of DropConnect in AWD-LSTM. Correspondingly, we concatenate the last hidden state of last time step \(h_T\) from \({H_j^B}\) = \((h_1, h_2, ...., h_T)\) with both average pooled and max pooled representations as illustrated in Equ.(19):

$$\begin{aligned} h_{c}^B= [h_T, maxpool(H_j^B), meanpool(H_j^B)] \end{aligned}$$
(19)

Thus, after the gradual unfreezing of backward language model layers, we obtain the predictions in form of probabilities. The predictions of both forward and backward language models are ensemble to get the final predictions in form of probabilities: \(a_s^{FB}\).

To evaluate the effectiveness of our model, we have used the same set of hyperparameters mentioned in Sect. 6 across all the datasets. We have utilized ‘NVIDIA Tesla P100’ GPU for faster and efficient computations.

Fig. 4
figure 4

Proposed model architecture: an ensemble of forward and backward language models with scaled recurrent weights of AWD-LSTM

The proposed model has been described in Fig.  4. The insight of AWD-LSTM section for both language models has been explained separately. AWD-LSTM consists of three stacked LSTM layers. The arrows represent the strength of connections or weights of the layers. The size and thickness of these arrows is directly related to weights, which in turn, relates to the features extracted in AWD-LSTM layers. For instance, a thick arrow indicates unscaled recurrent weights when \(\zeta\) is unity. We have use the term unscaled because \(\zeta\) being unity does not alter the recurrent weights. Thus, it is considered inactive in this case. However, when \(\zeta\) becomes active, it scales the recurrent weights, as represented by plain arrows. The broken arrows also indicate scaled weights in the given range. These are the extremely small weights which have higher chances of dying out in BPTT. Thus, they are eliminated using \(\zeta\) which makes the model faster.

The fine grained description of AWD-LSTM layers inside our model has been displayed in Fig. 5. As we can clearly see, this figure contains 3 stacked LSTM layers. Each unit inside a layer describes an LSTM block. These interconnections between these layers, also known as their weights, are scaled by a factor of zeta (\(\zeta\)) as explained earlier. Unscaled recurrent weights are presented using black arrows whereas scaled weights by grey arrows of variable lengths. Figures 1,  2 and  3 can be referred to comprehend the strategy behind generalization of DropConnect method.

Fig. 5
figure 5

Fine grained description of AWD-LSTM layers inside our proposed model showing the generalized DropConnect strategy

We tried using the uni-directional language models. However, introducing the concept of bidirectionality proved to be effective. It was observed that ensemble predictions improve the model’s performance. Ensembling not only reduces the variance of our model but also results in predictions that are better than any single model. Moreover, the scaling factor eliminates extremely low weights during back propagation, thereby making the model faster.

6 Experimental results

The present section compares the experimental results of proposed model with existing state-of-the-art deep and non-deep learning models. Table  2 consists of results for text classification using two word embeddings: GloVe and fastText. In case of GloVe, data was trained using biLSTM and the classifiers employed were Logistic Regression, Linear SVM and Polynomial kernel. Hierarchical classifier was employed in case of fastText embeddings.

Subsequently, we have extensively extended our research and performed experiments by taking soft attention mechanism into consideration. We tried using the consolidation of both attention layer and concat pool layer, but it performed poorly during testing than the training time. In other words we can say that it contributed to overfitting. Thus, both attention layer and concat pool layer can not be utilized together in the same model. To solve this issue, we replaced the concat pool layer with attention layer. This replacement is necessary for testing the overall performance of attention layer. We have also incorporated \(drop\_mult\) (DM) and (\(\zeta\)) in their own limits when using attention layer. The empirical results are obtained using three variations of DM and zeta which have been illustrated in Tables  3,  4,  5. Table  3 demonstrates the model’s performance when \(\zeta\) is not active and only weight dropout parameters are varied in their entire range. Table  4 shows results when only scaling factor \(\zeta\) is active. Results when both DM and zeta are active are compiled in Table  5.

Next, we present the results of the proposed model. These experiments have been performed using aforementioned variations of DM and zeta. In Table  6, the scaling factor \(\zeta\) is not active. We have varied the weight dropout parameters by the factor of (DM) lying in the range [0,1], for both Forward and Backward Language models independently. In Table  7, scaling factor \(\zeta\) becomes active but DM goes naught. Both \(\zeta\) and DM become active in Table  8. Both are varied according to each other in their own limits. We have chosen the values of \(\zeta\) and DM in such a way that their effect on the model can be tested in their entire range efficiently.

Following this, we have performed a comparison of both aggregate architectures, namely concat pool and attention. This comparison is shown in Table  9. The final ensemble classification accuracy of both Forward and Backward Language models is the measure of comparison for all the databases. Table  10 shows how effectively our model has performed when juxtaposed to other models. Along with ensemble classification accuracy results, we have illustrated graphical results in Figs. 6,  7 and  8 relating Loss and Learning Rate for both forward and backward language models, respectively.

We have used four Key Performance Indicators to evaluate the effectiveness of our approach: Accuracy, Precision, Recall and F1 score. Although widely used, accuracy alone is not enough information to decide whether the proposed model makes robust predictions or not. The reason behind this is the overwhelming nature of unbalanced databases, where high accuracy can be achieved by an over- or under-fit model which do not generalize well for new data. Therefore, in order to fully evaluate the model, more performance metrics are required. Those metrics are Precision, Recall and F1 score. We have also used the three aforementioned performance metrics to test the behavior of our model.

  1. 1.

    Accuracy refers to the overall correctness of classification. It measures the ratio of correctly classified instances over the total number of instances.

    $$\begin{aligned} Accuracy= \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$

    where

    • True Positive (TP) Prediction is positive and it is true.

    • True Negative (TN) Prediction is negative and it is true.

    • False Positive (FP) Prediction is positive and it is false.

    • False Negative (FN) Prediction is negative and it is false.

  2. 2.

    Precision is the ratio of the correctly positive labeled instances to all the positive labeled instances. In other words, it defines the reliability of results given by the model.

    $$\begin{aligned} Precision= \frac{TP}{TP+FP} \end{aligned}$$
  3. 3.

    Recall tells how many of the actual positive cases are predicted by the model correctly. It expresses how well the model is able to detect a particular instance.

    $$\begin{aligned} Recall= \frac{TP}{TP+FN} \end{aligned}$$
  4. 4.

    F1 Score is the harmonic mean of precision and recall. The F1 Score mentioned in our results is “macro” averaged because it lets all the labels contribute equally regardless of their appearance in the corpus.

    $$\begin{aligned} F_1 = 2* \frac{Precision*Recall}{Precision+Recall} \end{aligned}$$

Hyperparameters


The model hyperparameters are beneficial as their values are used to control the learning process. Our model consists of embedding size of 400 units, 3 layers with 1152 units per layer. It has been trained with Back Propagation Through Time (BPTT) technique with a batch size of 70. For classification, we have used dropout of 0.4, 0.05 for input and output embedding layers, respectively. Adam optimization method is used with parameter values \(\beta _1 = 0.9\), \(\beta _2 = 0.99\). We have used a batch size of 32, a learning rate of 0.004 and 0.01 for fine-tuning the language model and classifier, respectively. Weight dropout parameter and scaling factor (\(\zeta\)) have been varied in their respective ranges. We have used the concept of bidirectional modeling by fine-tuning forward and backward language models separately and ensembling their predictions at the end.

Table 2 Classification results of GloVe and fastText embeddings for each dataset
Table 3 Classification results of both forward and backward language models using attention mechanism when zeta (\(\zeta\)) is inactive and weight dropout parameters varied by a factor of DM
Table 4 Classification results of both forward and backward language models by employing attention mechanism when zeta (\(\zeta\)) is active and weight dropout parameters remain independent of DM
Table 5 Classification results of both forward and backward language models by using attention mechanism when zeta (\(\zeta\)) is varied according to certain DM values
Table 6 Classification results of both forward and backward language models of our proposed model when zeta (\(\zeta\)) is inactive and weight dropout parameters varied by a factor of DM
Table 7 Classification results of both forward and backward language models of our proposed model when zeta (\(\zeta\)) is active and weight dropout parameters remain independent of DM
Table 8 Classification results of both forward and backward language models of our proposed model when zeta (\(\zeta\)) is varied according to certain DM values
Table 9 Comparison of experimental results of ensemble classification accuracy of Concat Pool as used in our proposed model to existing state-of-the-art Attention mechanism with respect to same variations of DM and zeta as used in aforementioned experimental results for all the three databases
Fig. 6
figure 6

Loss versus learning rate of Twitter data

Fig. 7
figure 7

Loss versus learning rate of RLDD data

Fig. 8
figure 8

Loss versus learning rate of IMDb data

6.1 Comprehensive comparison

In this section, we present comparison of our evaluation results with the existing state-of-the-art models on sentiment analysis task. First 2 approaches namely Logistic Regression and Linear SVM are traditional machine learning techniques as mentioned in Sect. 2. Next 2 approaches namely GloVe+biLSTM and fastText are non-deep learning methods for sentiment analysis. These are also known as distributed representations of words. Rest of the strategies including ours are deep learning models. Out of these strategies, all are based on transfer learning except for ANN. Apart from the top 4 approaches, every approach is capable of automatically extracting features from input data without the help of feature engineering.

Table  10 shows that classification accuracy of our model is preferable when tested on three conventional datasets, as mentioned in Sect. 3. Considering the algorithm effectiveness, fastText has proved salient among other non-deep learning techniques in its category. Our proposed model, on the other hand, has achieved superior results not only among the deep learning approaches but all the eight approaches, as mentioned above. The classification accuracy of our model is 82.27%, 84% and 94.94% for US airline twitter, Real life deception detection and IMDb datasets respectively. In contrast to ULMFit, we planned to bring bidirectional contextualization using ensemble methods. We also focused on regularization by generalizing the DropConnect method. Our aim is to create an ambitious yet reasonable framework for sentiment analysis task. Overall, our model has performed remarkably on experimental datasets and with such performance, we can consider it as a robust framework for sentiment classification problems.

Table 10 Comparison of experimental results of classification accuracy of our proposed model to existing state-of-the-art models

All the variations of our model have been explained independently in Tables  6,  7 and  8 as shown above. The combined results of all model variations are illustrated in Figs. 9 and  10. The two bar graphs presented here show ensemble classification accuracy of both forward and backward language models for all experimental datasets. Sensitivity analysis of our model variations are explained using two parameters: DM and zeta(\(\zeta\)). In first bar graph, every value of DM is varied for all zeta values in the entire range. Correspondingly, second graph illustrates variation of every value of zeta with DM.

For IMDb, the accuracy is at its best when around 50% of dropout parameters and 70% of recurrent weights of AWD-LSTM are scaled. DM is relatively less sensitive for this case. In case of RLDD, our model achieves maximum accuracy when only 1% of dropout parameters and 70% of AWD-LSTM weights are scaled. Thus, our model is most sensitive to zeta for this corpus. For another counterpart Twitter, our model works best when we scale 50% of the AWD-LSTM weights. Clearly it does not depend on the dropout parameter scaling and is sensitive to weights scaling only. Altogether, it can be observed that our model is relatively more sensitive to weight scaling factor zeta (\(\zeta\)). Keeping DM into consideration, we can say that performance of our model varies directly with AWD-LSTM scaled weights in their respective limits.

Overall, the combination of \(\text{DM}=0.5\) and \(\text{zeta}=0.7\) proves best for IMDb where our model achieves highest classification accuracy of 94.94%. Similarly, in case of RLDD dataset, it achieves highest accuracy of 84% when \(\text{DM}=0.01\) and \(\text{zeta}=0.7\). On Twitter, our model attains maximum accuracy of 82.27% for \(\text{DM}=0.0\) and \(\text{zeta}=0.5\).

Fig. 9
figure 9

Sensitivity analysis of classification accuracy w.r.t zeta (\(\zeta\)) for various experimental datasets

Fig. 10
figure 10

Sensitivity analysis of classification accuracy w.r.t DM for various experimental datasets

7 Conclusion and future work

In this manuscript, we have focused on sentiment classification task and proposed a new transfer learning based model. It intends to work on challenging sentiment analysis tasks such as contextualization and regularization. The main steps followed by our model include: (1) pre-train a language model (2) fine-tune the language model on target data by scaling recurrent weights of AWD-LSTM using a scaling factor zeta(\(\zeta\)). To introduce bidirectionality, we have performed both forward and backward language modeling based on state-of-the-art transfer learning based methods. Our model is an ensemble representation of forward and backward language models.

After an extensive and thorough research, we have demonstrated that the model variations, \(\zeta\)+concat pool and \(\zeta\)+attention perform extremely well as compared to their conventional versions where no zeta is incorporated. The performance of the model is intensified when \(\zeta\) is included suitably in the model. In addition to this, our model has surpassed the experimental results of soft attention mechanism too. Hitherto, the comparison with attention layer has been propounded especially on the grounds that both attention and concat pool are state-of-the-art feature aggregate architectures. That is why separate research has been carried out using both architectures individually and our model has emerged as the superior one.

Finally, it has been validated that the incorporation of \(\zeta\) and its variation with DM has yielded superior results. This is so because with the companionship of these parameters, our model is no longer reckoning on just maximum and average values but has the hold of other dominant features of the embedding layer. The intention is to learn significant features of embeddings rather than just presuming that concatenation of maximum and average values or just attention mechanism can provide finest results. Therefore, we have arrived at a conclusion that inclusion of zeta in our model makes an apt and excellent substitute for attention mechanism. Our model has performed remarkably well in terms of classification accuracy when compared to other state-of-the-art approaches.

As mentioned, we have evaluated the efficiency of our model on three widely-used databases. All these databases, as mentioned in Table  1, have a diversified range. We have obtained outstanding experimental results for our model compared to the ones obtained from other state-of-the-art frameworks. Thus, it shows that our proposed model is versatile and robust in nature.

In future, we plan to extend our work for multi-class text classification by incorporating Transformer based models with some other attention mechanisms.