1 Introduction

One of the common applications of domain dependent problems in internet research is sentimental analysis of tweets, movie reviews, news headlines or other textual data such as general web content. Sentiment analysis is widely used in varied domains for different purposes and literature mentions notable use of several techniques for this task. Traditional sentiment analysis has been accomplished-based on manual selection of features [1] and also using some evolutionary approaches like greedy heuristic [2], genetic algorithms for feature reduction [3] etc. Genetic algorithms has also been widely used in the studies of [4,5,6,7] for various other tasks related to feature reduction and optimizations. Sentiment analysis is widely considered as domain specific task and traditional sentiment analysis requires labelled data in all domains for these techniques to work efficiently that makes this problem as computationally intensive. Therefore, traditional techniques are not capable of achieving same levels of efficaciousness in the data which has more than one domain of interest. Segregation of data into different domains is itself a task of classification, resulting in higher degree of computation time and because of which researchers are trying ways for systems development that can work on multiple domains.

Multidomain sentiment analysis has been tried by many researchers recently using variety of ways like random walk-based solution by Xue et al. [8]; proposing ways for extraction of domain features automatically [9,10,11] and multi domain aspect-based methods [12,13,14,15]. Advances in deep learning and recurrent networks have also led to proposal of numerous studies in this field. The use of attention mechanism has also been encouraging in the field of natural language processing for extraction of the important feature for specific words, as available in recently proposed studies [16,17,18]. In spite of availability of these works, there has been challenges for the effective sentiment classification for multiple domains since most of these methods are based on use of machine learning tools or neural networks for feature selection which has performance issues, vanishing gradient problem and availability of labelled data. Although attention mechanism has paved the way for focussing on relevant features only and seems to be a promising option but its applicability in multiple domains is still relatively unexplored.

This paper proposes a sentiment classification model capable of working on multiple domains employing bi-directional deep recurrent neural network with attention mechanism for effective classification. Bi-directional deep recurrent network has helped to remember the contextual information and selection of independent and specific features from both directions. The attention mechanism in neural network has helped to learn important features of specific words and share information across domains. The hidden layers select domain shared and domain specific features simultaneously and next hidden layers further extracting the sentiments of the given sentence with the combination of its domain obtained from previous layers. The proposed architecture has been envisioned by experimenting various architectures and observing results by adding additional layers, changing the network architecture for domain representation and classification. The problem of vanishing gradients has been overcome by use of Gated Recurrent Networks (GRU) in network bi-directional recurrent network. More details regarding recurrent networks, LSTM and related concepts are given in “Appendix”. Readers may refer to the “Appendix” for wider reference to these concepts.

The paper is organized as follows: Sect. 2 details about the related studies in the proposed field with focus on the use of recurrent networks, attention mechanism, deep neural networks, multi-stage networks etc. Section 3 describes the proposed architecture with details of sentiment and domain module described individually. Section 4 explains the experiments and required discussions while Sect. 5 concluded the experiments findings.

2 Related Works

The designing of sentiment analysis system requires extraction of useful features of interest specific to various domains. The attention mechanism is an effective approach for the detection of useful features but its usefulness for multiple domains is still unexplored. Kim et al. [19] at Microsoft AI and Research Bloomberg LP proposed a domain attention model with an ensemble of experts. They assumed ‘N’ domain specific intent and slot models that are trained on the separate domains. If given a new domain, the model uses a weighted combination of feedback from the ‘N’ domain experts along with its own opinion to anticipate the new domain. A dynamic memory networks, proposed by Kumar et al. [20] for processing of natural language, comprises of similar procedure and uses fixed query representation as attention for selecting the relevant features respective to an intended task. However, drawback of the approach was that it was restricted to only single domain scenarios.

The use of recurrent neural networks and attention-based architectures for multi-domain data also founds mention in few works in literature. Chauhan et al. [18] proposed a technique by combining linguistic patterns with bidirectional recurrent network supported by new word embeddings. They proposed fine-tuned word embeddings helped in extraction of domain specific most relevant terms. Attention mechanism was used to capture long-term dependency between specific words. Valid aspect terms were extracted using un-supervised learning and were used to train the proposed bi-LSTM model. Lee et al. [21] proposed word attention-based mechanism to classify negative and positive sentences using CNN based weakly supervised learning algorithm. Word attention mechanism generated class activation map for impactful words and helped in producing sentence level and word level polarity scores using weak labels only.

Liu et al. [22] utilized attention mechanism over a BiLSTM based architecture in order to extract useful low-level information from hidden layers that was combined higher level phrase information extracted using a CNN layer. Fu et al. [23] made extension to an LSTM based architecture by integrating lexicon into word embeddings. Global attention mechanism was used to extract global information from the text which was combined with representational powers of words introduced by lexicon-based word embeddings. Spyridon Kardakis et al. [24] has provided a detailed analysis of various attention based studies used for sentiment analysis.

It is also prevalent from human behaviour that people frequently use some shared expressions or domain specific terms in order to express their feelings that may lead to the training of sentiment classifiers for different domains. Therefore, multi-domain sentiment analysis can be regarded as a particular application of the multi-task learning [25]. Liu et al. [17] proposed a framework for multi-task feature learning where different tasks use similar patterns of sparsity. The efficiency of these methods strongly depends on the assumptions of domain relatedness that may not be relevant in different scenarios. More important work in direction of multistage models employing recurrent networks has been done by Poria et al. [12], Rana and Cheah et al. [13], Wu et al. [14] in which they have proposed extraction of noun or noun phrases using rule-based methods in first stage. These were used in the next stage for training or feature sharing in attention based or bi-LSTM network. Do et al. [15], Yuan et al. [16], Chauhan et al. [18] proposed to use linguistic rules with attention-based approach to increase the extraction of noun phrases. Yu et al. [26] proposed a Wasserstein-based Transfer Network (WTN) to share the domain-invariant information of source and target domains and employed BERT embedding for deep level semantic information of text. Another recent work in the direction of multidomain sentiment analysis has been proposed by Yu et al. [27] by extending analysis over multimedia data and processing internal features of all modalities and correlating these modalities with each other to guage sentiment trends. The proposed recurrent architecture consists of use of BERT for extracting text features, ResNet for Image Features along with multimodal feature fusion.

After deliberating on the available studies available in literature, we infer the presence of a wider scope for technique suitable for multi-domain sentiment analysis using recurrent neural networks. Therefore, this paper proposes a sentiment classification model capable of working on multi-domain classification, employing bi-directional deep recurrent neural network with attention mechanism for effective classification. The methodology has been widely tested and experimented with usage of both Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) cells in different combinations for sentiment and domain analysis to obtain a most optimal combination.

3 Proposed Methodology

Unlike the aforementioned methods, this paper proposes a sentiment classification model for multiple domains employing attention mechanism and Bidirectional Recurrent Neural Network (BRNN) with GRU. More details about these basic building blocks of the model are available in “Appendix 1”. “Appendix 1” projects the working and basic structures of RNN, LSTM, GRU and attention mechanism.

Use of attention mechanism has helped to select independent and specific features simultaneously. The proposed architecture is shown in Fig. 1 and consists of two modules Domain module and Sentiment module both employing a BRNN.

Fig. 1
figure 1

Proposed architecture for multi-domain classification

The task of the Domain module is to predict an appropriate domain for learning representations of various domain. An attention selection is triggered by the domain representation for assemblage of most crucial features related to domain in the corresponding sentiment module. The Domain module makes use of a bidirectional GRU network for gaining the most useful representation for domain. A Bidirectional GRU runs the inputs in two modes, one from forward to backward and one from backward to forward direction, making it different from the uni-directional approach in which recurrent runs only backwards. The principle of BRNN is to split the neuron of a regular RNN into two directions, one for positive direction and one for a negative direction and by using two-time directions, input information from the past and the future can be used for the current time frame that has helped to make effective domain prediction.

We had implemented Gated Recurrent Unit(GRU) cells as our recurrent network. GRU has been preferred over LSTM network because it trains quickly and helps in retaining long term information also. We use £GRU(.) to denote the processing on embeddings.

$$\begin{aligned} & \theta_{d} = \pounds_{{GRU^{d} }} \left( {X_{j}^{i} } \right) \\ & \theta_{d} = \pounds_{{GRU^{d} }} \left( { w_{x1} , w_{x2} ,w_{x3} \ldots w_{xn} } \right) \\ \end{aligned}$$
(1)

where \(\theta_{d}\) is the output of GRU network, wxj represents word representation of jth word in the text.

Since proposed architecture is a bi-directional network therefore \(\theta_{d}\) is a combination of GRU networks, therefore \(\theta_{d}\) will be represented as,

$$\begin{aligned} & \theta _{d} = ~\theta _{d}^{{forw}} ~\S.~\theta _{d}^{{back}} \\ & \theta _{d} = \pounds_{{GRU^{d} }}^{{forw}} ~~\left( {X_{j}^{i} } \right)~~\S.~~~\pounds_{{GRU^{d} }}^{{back}} ~\left( {X_{j}^{i} } \right)~ \\ \end{aligned}$$
(2)

where § means concatenation and \({\uptheta }_{{\text{d}}}\) is obtained by combining text processing in both forward and backward direction as in Eq. 1.

A soft-max layer at the end has been employed for domain prediction. In the experiments that follows, this label has been used to gauge the domain classification accuracy.

3.1 Sentiment Module

Another attention mechanism based bidirectional recurrent network has been used for sentiment module. The prime idea behind using attention mechanism is that the sentiment module should attend all outputs of the recurrent process, which differentiates it from the domain module. The sentiment module also implements a GRU network and \({\uptheta }_{{\text{s}}}\) denotes the outputs of sentiment module network which can be seen as the representation of the details of the text.

The representation of the domain from the domain module is also used and various attention weights are being extracted using a single feed-forward layer in sentiment module. A softmax layer converts all the attention weights to probabilistic attention weights which are then fed to sequence of two fully connected layers and a softmax layer to get final representation of sentiment for prediction of the sentiment.

The proposed network is trained on combination of cross entropy losses for both sentiment prediction and domain prediction. The sentiment module also uses Gated Recurrent Unit with attention mechanism whose output can be explained as follows:-

$$\theta_{s} = f_{{GRU^{s } }} = f_{{GRU^{s } }} \left( { \left( {w_{v1} , w_{v2} ,w_{v3} \ldots \ldots .w_{vn} } \right)^{t} } \right)$$
(3)

where \(\theta_{s}\) is the output of GRU network which will be a list of vectors.

Attention weights are learned using Bahdanau’s model [56] with feed-forward network as follows,

$$y_{i}^{att} = f( w^{att} \left( {\theta_{d} {{\S}}.\;\theta_{s}^{k} } \right) + b^{att}$$
(4)

where watt and batt are parameters for attention mechanism, \(\theta_{d} , \theta_{s}^{k}\) refers to domain representation(annotation) and kth sentiment vector representation.

The attention weights are calculated by normalizing output score of the feed-forward neural network as in Eq. 4

$$\alpha_{i} = \frac{{{\text{exp}}\left( {y_{i}^{\left( a \right)} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{n} {\text{exp}}\left( {y_{i}^{\left( a \right)} } \right)}}$$
(5)

The final value for \(\theta_{s}\) will be the weighted sum for all the values which will be

$$\theta_{s} = \mathop \sum \limits_{k = 1}^{n} \alpha_{i}^{{\theta_{s}^{k} }}$$
(6)

\(\alpha_{i}\) represent attention weights and \(\theta_{s}\) is the sentiment text representation.

This is further fed to dense layer and softmax layer for the final sentiment prediction.

3.2 Loss

The loss used is the sum of sentiment and domain prediction loss given by

$$L_{0} = \frac{1}{{N_{k} }}\sum\limits_{k = 1}^{M} {\sum\nolimits_{i = 1}^{{N_{k} }} {L(s_{i}^{k} ,p(s_{i}^{k} |x_{i}^{k} )) + \frac{\lambda }{{N_{k} }}\sum\limits_{k = 1}^{M} {\sum\limits_{i = 1}^{{N_{k} }} {L(d_{i}^{k} ,p(d_{i}^{k} |x_{i}^{k} ))} } } }$$
(7)

Equation 7 is the global loss implemented for the proposed architecture, assuming the presence of M domains and Nk samples for kth domain. \(s_{i}^{k} , d_{i}^{k}\) are the true sentiment value and true domain value respectively. The probabilistic output of sentiment and domain module is given by \(p( s_{i}^{k} {|}x_{i}^{k} {)}\) and \({ }p( d_{i}^{k} {|}x_{i}^{k} {)}.\)

L(.) is the cross-entropy loss function used to calculate the difference between predicted and true label with Adam as optimizer. The first L(.) term refers to the cross entropy loss observed for sentiment and the second L(.) term refers to the domain loss. L0 is the global loss which is combination of sentiment loss and domain loss. The domain loss was weighted by \(\lambda\) term whose values was set to 0.04 and was extracted after experimentation with multiple values. \(\lambda\) term is a regularization term that control the importance of domain error.

Use of bidirectional recurrent network provides good accuracy across all the domains because bidirectional networks can understand the context better and attention mechanism in the sentiment module helps to focus on those features which differentiates one domain from other. The proposed architecture models the related domains at intrinsic feature-level rather than having separate layers for a single task. The hidden layers select domain specific and domain shared features simultaneously. The following hidden layers extract the sentiments of the given sentence with the combination of its domain obtained from previous layers.

More details are available in the next section where several models for these modules with variations in their design were built and experiments were conducted to have meaningful insights into the architecture. The explanation of the experiments helps to understand the evolvement of the proposed architecture. These experiments lay out the architectures used and the results obtained by each method in details. We conclude by proposing the architectures for multi-domain classification for which the most promising results were obtained.

4 Results and Discussions

This section details about the conducted experiments and observed results in order to derive single model that could work on different domains. All experimentation has been done on the Amazon review dataset [31] that comprises four types of reviews namely Books, Electronics, Kitchen appliances and DVDs. The dataset consists of 2000 reviews for each product with 1000 for positive and negative reviews for each of the mentioned products and a total of 8000 reviews. The reviews were suitably split into train, test and validation sets. These experiments were performed on a workstation with Processor: Intel i7-6850K CPU @ 3.60 GHz, Architecture: x86_64 and 2 Titan Xp-Pascal GPUs.

4.1 Experiment 1

We first experimented with a model inspired by the work of Yuan et al. [14]. The proposed architecture consists of a sentiment module and a domain module for sentiment and domain prediction and employs LSTM as recurrent network. We extracted large number of other parameters for the experiment during the training process, which are tabled in Table 1. These include various kinds of losses as observed on training and validation data. (For more information about parameters used in the table, readers are encouraged to refer to “Appendix 1”). Table 1 indicates significant higher values for validation metrics indicating a scope for further improvement. We further evaluated the model on additional number of parameters like Precision, Recall and F1-score. We further experimented and extracted domain-wise and sentiment-wise separate evaluations which were not part of the original study. These evaluations are listed in Tables 2 and 3 respectively. The obtained results of this experiment indicate that the model has shown almost same levels of accuracy for almost all domains with a marginal advantage to books, as the model was able to understand the context using a strong bidirectional network. The overall sentiment accuracy for both positive and negative sentiments depicts the same results. A receiver operating characteristic is also depicted in Fig. 2a using the test values for deriving true positives and true negatives for more insights into the model.

Table 1 Parameters comparison for models
Table 2 Domain wise comparison of various model
Table 3 Sentiment wise comparison of various models
Fig. 2
figure 2

ROC plots of various experiments

4.2 Experiment 2

This experiment proposes an extension to the model of previous experiment by incorporating a dense layer before the final layer with number of neurons being equivalent to number of domains. A dense layer is a fully connected layer, as in, all neurons in the previous layer are connected with all neurons in the corresponding layer. These are different from convolutional layers, since weights are reused across different sections of the vectors, whereas a dense layer has a unique weight for every neuron to neuron pair. An appropriate value of dropout is also set for regularization to prevent over-fitting. The usage of such architecture was targeted to improve the accuracy of certain domains because the dense layer might allow smaller number of reviews to get better results in their sentimental analysis. The computation is observed for the effect of incorporation of Dense Layer in the Domain Attention model and results are tabled in Tables 2 and 3. The observed values indicate that addition of dense layer resulted in an increase in values of almost all metrics with considerable decrease in validation loss. The values of different other parameters are also depicted as part of Table 1 and a considerable dip in validation loss metrics can also be observed.

4.3 Experiment 3-a

This experiment is designed to obtain a compatible structural ensemble of LSTM and GRU for hidden layers of Domain Attention model. The experiment was targeted to find a suitable recurrent network for the proposed architecture. We experimented by modifying the architecture of previous experiment by using LSTM cell as bidirectional RNN in the domain module and GRU cells in the sentiment module. Domain module makes use of a LSTM network to gain the domain representation. The sentiment module is bidirectional Gated Recurrent Unit (GRU) network with attention mechanism. Unlike LSTM, GRU consists of only three gates and does not maintain an internal cell state. GRU uses reset gate and update gate which act as vectors, deciding what information is to be passed to the output to avoid vanishing gradient problem of a standard RNN. Usage of such architecture rendered notable results with a lower training period time since GRU tends to train quickly as compared to LSTM. Here as well, sentiment module attends to all outputs unlike the domain module. The extracted results and other parameters for the experiment are shown in Tables 1,2 and 3. The observed values indicate an improvement over the previous observed values.

4.4 Experiment 3-b

In this experiment, recurrent network units of both the modules are reversed i.e. GRU is used as the bidirectional RNN in the domain module to gain domain representation and LSTM in the sentiment module. The use of GRU units helps to determine the amount of information gained from previous steps that needed to be passed further using the update gates. This step makes it possible for the model to decide whether to keep all the information gained from the past steps and helps to vanquish the risk of vanishing gradients. The observed values in Tables 1,2 and 3 indicate comparable results from previous experiments in various metrics but marginal increase in accuracy values was observed across all the domains because of the fine compatibility of GRUs.

4.5 Experiment 4

We performed experiments to compute and analyze the usage of GRU cells for the bidirectional RNN in the domain module as well as in the sentiment module. Domain module, employs a GRU network to gain the domain representation and update gates to decide which information from previous steps is to be passed. The architecture's sentiment module is also GRU network with attention mechanism. This model was successful in providing good accuracy across all domains because the modules gelled together quickly. The model gave highest accuracy in most of the domains and had less training time as well. Corresponding results are available in Tables 1,2 and 3.

4.6 Experiment 5: Proposed Architecture

The last experiment was performed to improve the results further by adding an additional dense layer for higher degree of information processing. The additional dense layer helps in better retainment of specific information; prevents over-fitting and improve accuracy. The impact of addition of dense layer resulted in increase in values of each utilized parameter and these evaluations are presented in Tables 1, 2 and 3. The overall values of the proposed architecture were better than the rest of experiments due to addition of new layer to the model which enabled to learn more comparative dissimilarities amongst both domain and sentiment modules thereby leading to final results improvements. The resulted values of precision, recall and F-1 score were also better than all experiments for both domain and sentiment modules respectively. The training of the network was also quick. This was true for every model and the time taken for training of various networks didn’t have much variations since pre-trained models were employed and the dataset was relatively not that big. The proposed model was also tested on low-end devices like Nvidia Jetson Nano [28] but the observant lag was also negligible.

In order to dig deeper into comparisons, we plotted the Receiver Operating Characteristics (ROC) curves for the various experiments since the classes to be tested are not imbalanced therefore ROC curve was preferred over Precision-Recall curves. Comparison of ROC curve plots of all experiments is shown in Fig. 2 which further illustrates the success of the proposed model. This is also indicative by the values of Area under Curve (AUC) for the proposed model in Fig. 2f illustrating that the final model was successfully able to make correct predictions.

To summarize, the average parameters for all metrics tend to be highest for proposed model experiment, i.e. experiment 5, indicating a better overall choice for multiple domains. The proposed model outperformed other architectures in all domains demonstrating its effectiveness not only for a specific domain class but in each class as well. Experiment 2 achieved high F1-scores for Kitchen and Electronics domain and higher accuracy values for Books domain while experiment 1 also demonstrated comparable results but the proposed architecture performed well in terms of validation loss and validation accuracies (Table 1). For instance, the proposed model has observed lowest validation loss for domain prediction and other types of loss values are small. Overall, it can be concluded that the proposed architecture using the attention mechanism and employing GRU as recurrent network with addition of dense layer has demonstrated improvements over the existing models available in literature. This is also true as observed from the ROC plot comparisons of all experiments as presented in Fig. 2.

We further compared the results of proposed model with similar state-of-the -art studies in literature which is represented in Table 4. The results in Table 4 indicate that the proposed system is capable of outperforming all methods outlined in the table. It has convincingly outperformed methods working on individual domains like LS and SVM; methods working on multiple domains (RMTL, MTL-Graph and CMSC); methods on combination of sentiment classifiers like MDSC also. This is possible because the proposed method is capable of extracting common information across domains and attention mechanism helps it to learn and retain the important information effectively.

Table 4 Comparison with other state of art methods

5 Conclusions

The paper proposes a deep learning-based sentiment analysis system employing domain attention mechanism for successfully predicting sentiment values across multiple domains. The proposed architecture consisted of employing GRU cells for Sentiment and Domain Module which are implemented as bi-directional recurrent network with an additional dense layer to improve accuracies. A number of experiments for evaluation of the proposed architecture were conducted and it was found that using GRU layer as the bidirectional RNN gave an efficient performance over other tested models. It not only took less training and testing time but also provides higher validation accuracy in most of the domains. The predictions of the model across different domains were efficient and better than the rest of the experiments. The proposed architecture also demonstrated higher validation accuracy in the sentiment module. The evaluation of domain and sentiment modules was conducted separately using large number of parameters like Precision, Recall, F1-score etc. Overall, it can be concluded that employing of GRU as sentiment and domain module and a dense layer leads to increase in prediction accuracy and be a better classifier. The future extensions to work are possible by extending its training over multilingual dataset and building a web-utility for sentiment prediction for public use. This will also help to test the proposed model for scalability and deployment on low-end devices.