Introduction

The Web 2.0 era witnessed a drastic change in online shopping culture as customers are likely to share their opinions by writing reviews on online platforms. These reviews are subjective and vital for new customers willing to purchase the product. Moreover, it is advantageous for both product manufacturers and sellers as they gain insight into the preferences and dislikes of their product, allowing them to enhance it in the future.

Unlike document-level and sentence-level sentiment analysis, Aspect based Sentiment Analysis (ABSA) is a more specific type of sentiment analysis that concentrates on identifying the sentiment towards a particular aspect or characteristic in a given review. ABSA provides a finer-grained understanding of sentiment than the more broad-level approaches of document-level and sentence-level sentiment analysis. For example, given a review, “Excellent screen but the battery life could be better”, the aspect “screen” is positive, and the aspect “battery life” is of negative sentiment in the given review. By considering aspect information, ABSA can avoid inaccurate sentiment analysis at the sentence level and better facilitate the comprehension of users’ emotional expressions in various aspects. Jiang et al. [1] stressed the significance of aspects in sentiment classification tasks and stated that most errors could be attributed to neglecting the aspect information.

Methods to solve ABSA tasks can be broadly classified into three categories: Rule-based, Machine Learning, and Deep Learning models [2]. Rule-based models are based on criteria designed by considering the domain knowledge and linguistic patterns. These models have limited learning abilities because the rules they are built with are static, meaning they are applied to the data without being updated or changed. Machine learning models necessitate the labeling or annotating of training data for guidance. Machine learning models are constructed using feature extraction techniques (feature engineering) and continually refining the model parameters based on the training data. Deep learning models came into the picture to bypass the manual feature extraction and let the model learn the significant features by itself. The deep learning model is a Neural Network with a fixed architecture. One of the benefits of utilizing a Neural Network is its inherent adaptability, as it can adjust its parameters through errors generated during training. In recent years, deep learning has made breakthroughs in the fields of computer vision [3], speech recognition [4], NLP [5], and medical domain [6, 7].

In NLP tasks, the length of an input sentence can vary, necessitating the development of neural networks capable of accommodating such variability. This led to the creation of sequential models, which capture the meaning of a word dependent on the context within the word sequence. Rather than treating words independently, these models capture sequential information during training. Recurrent Neural Networks (RNNs) [8] were developed to capture sequence and sentence length. Long Short-Term Memory (LSTM) [9] is a sophisticated type of RNN designed to overcome the vanishing and exploding gradient issues [10] common in traditional RNNs. LSTMs utilize gates that selectively remember and forget learned information. Many variants of LSTMs, such as Gated Recurrent Unit (GRU) and Bidirectional LSTM (BiLSTM), were developed and used for sentiment analysis subtasks. For example, Target dependent LSTM (TD-LSTM) and Target Connection LSTM (TC-LSTM) [11] utilize LSTM models and aspect targets for sentiment analysis, and ATE-SPD [12] uses BiLSTM and CRF for ABSA tasks. A significant drawback of LSTM-based sequence models is that they cannot extract semantic information in parallel, resulting in substantial training time overhead.

Attention [13] is a crucial concept in deep learning. It enables the model to concentrate on a specific part of the sentence provided. It has been widely used in Sentiment Analysis tasks. Previous studies [14,15,16] have demonstrated that the efficiency of neural network models can be enhanced by focusing on specific aspect terms of input sentences and incorporating attention mechanisms. ATAE-LSTM [14] used Attention and LSTM for sentiment analysis. IAN [15] applies an attention mechanism between the aspect vector and its corresponding context vector, obtained from an LSTM using Glove embeddings[16]. Experimental results demonstrate that utilizing the interactive attention mechanism enhances sentiment classification performance. However, GloVe, a static encoding method used as word embedding in the IAN model, cannot perform dynamic differential encoding according to the context. In addition, the model employs different LSTM to separately learn the word semantic information of the aspect and the context.

Bidirectional Encoder Representation from Transformers(BERT) [17], a self-supervised masked language model, significantly enhances performance when fine-tuned for specific NLP tasks. BERT can model bidirectional context by utilizing a multi-layer self-attention mechanism. Variations of the BERT model have been employed as the state-of-the-art approach in ABSA [18,19,20].

The training of BERT is based on datasets from Wikipedia and BookCorpus. It aims to learn general-purpose knowledge from the extensive corpus data. Recent studies [21,22,23] show that learning domain-specific knowledge would be more beneficial since it can capture long-tailed information, which could be important for domain-specific end tasks. Post-trained BERT, trained on a specific end task dataset, performs better than general-purpose BERT trained on the Wikipedia and BookCorpus datasets.

To this end, an IAN-BERT aspect sentiment analysis model is proposed. It leverages post-trained BERT to dynamically encode word vectors that utilize transformer encoder [24] to extract semantic features in a parallel manner. It alleviates the problems of static word encoding and significant time overhead in the LSTM sequence sentiment analysis model. The contextualized representation is refined using an attention mechanism between the aspect and context vectors. Finally, the sentiment classification layer determines the sentiment orientation for the aspect. Tests conducted on the SemEval-14 Restaurant and Laptop and MAMS datasets demonstrate that our model is more precise than the baseline model.

The following are the key contributions of this work:

  1. 1.

    We present a new approach, IAN-BERT, which leverages the BERT representations and recognizes the mutual influence of aspects and context.

  2. 2.

    We tried to create a model incorporating an attention mechanism that interacts between the aspect and context representation obtained through the self-attention mechanism.

  3. 3.

    We utilized a post-trained BERT model trained on Yelp and Amazon review datasets to obtain representation with domain-specific knowledge.

The rest of the paper is organized as follows: “Related Work” highlights BERT, post-trained BERT, and attention-based approaches for sentiment analysis. “Proposed Work” provides a comprehensive explanation of our proposed method, IAN-BERT. “Experiments and Results” presents the experiment setup and result analysis. Finally, the paper is concluded in “Discussion and Conclusion”.

Related Work

Significant advancements have been made in NLP, particularly in tasks such as Question Answering, Sentiment Analysis, and Named Entity Recognition, due to the utilization of large pre-trained language models. In this work, we have used post-trained BERT and Attention for the ABSA task. BERT shows significant improvement over models using static encoding-based representation. Further, the post-trained BERT model on domain-specific datasets performs better than the standard BERT. Prior work based on BERT, post-trained BERT, and Attention mechanism for ABSA are discussed in the following subsections:

BERT-Based ABSA

Li et al. [18] investigated the utilization of BERT-derived contextualized embeddings for the ABSA tasks. They have designed neural baselines to deal with ABSA. Semeval laptop and restaurant dataset is used in this work. BERT output is combined with GRU, self-attention (SAN), Transformer, and CRF for comparison purposes. The outcome indicates that BERT-GRU excels in the laptop dataset while BERT-SAN is superior in the restaurant dataset. The author has claimed that the BERT-based model can also give comparable results with simple linear layers.

Li et al. [19] introduced a new unified approach for two subtasks of E2E-ABSA. They utilized two stacked recurrent networks for the task. The first RNN finds the target boundary for the aspect, which the second RNN further uses for predicting unified tags as the final output. They have explicitly modeled the constrained transition from the first task to the second for inter-task dependency. They have utilized a gate mechanism for sentiment consistency that captures the relation between contiguous words. Experiments were performed on Semeval laptop, Restaurant, and Twitter datasets.

Hu et al. [20] have proposed a span-based extract-then-classify framework for Open-domain targeted sentiment analysis. Instead of using a sequence tagging scheme, they experimented with finding all the spans containing aspect words, and it is further used for the polarity detection task. They have investigated three approaches: pipeline, joint, and collapsed models. The proposed framework is divided into two parts, a multi-target extractor and a polarity classifier which are utilized for two subtasks. Experiments were performed on Semeval Laptop, restaurant dataset, and Twitter dataset. The pipelined method outperforms the other two techniques.

Post-trained BERT for Downstream Tasks

Xu et al. [21] have introduced a new task called Review Reading Comprehension (RRC), where they utilize review sentences as a source of knowledge to answer questions from users. They build the dataset ReviewRC by taking data from the popular benchmarks for ABSA. BERT is used as the base model for this work. They proposed a post-training approach since the dataset is limited and standard BERT lacks domain-specific knowledge. The joint post-training process enhances domain and task knowledge that improves the performance of BERT for RRC. Experiments have shown that utilizing the post-training technique for aspect extraction and aspect sentiment classification tasks leads to enhanced performance than BERT.

Xu et al. [23] examined the hidden representation acquired from BERT for the ABSA task. The author claimed Masked LM(MLM) learns fine-grained features and treats each word/token equally. While learning aspect representation, it focuses on aspect features rather than opinions and vice versa. This method has proven advantageous for extracting the feature but not for sentiment classification. Many end task examples are required to map BERT feature space. They conclude with the requirement of alternative learning tasks besides MLM. The major drawback of MLM is equal treatment for all words. Being a sentiment word or aspect word does not affect MLM. Aspect representation may be obtained by grouping reviews for the same item so that model gets the clue for the aspect of the item. Sentiment representation may be obtained by considering rating.

To bridge the gap between standard language models (such as ELMo and BERT) and domain understanding, Xu et al. [22] designed a language model capable of capturing the domain knowledge guided by the end tasks. It tries to combine the standard language model trained on large and mixed domain datasets with low-resource domain-specific knowledge. The authors have introduced DomBERT, a language model that expands upon BERT by incorporating domain knowledge. It facilitates learning domain knowledge-enhanced language models while using low resources. The results from the SemEval Laptop and Restaurant datasets showed that incorporating domain knowledge into BERT results in improved performance across various tasks.

Attention-Based ABSA

Song et al. [25] have introduced an attentional encoder network (AEN) that employs attention mechanisms rather than the traditional RNN architecture to capture the aspect and context interrelationship. The model consists of two attention layers: the Attentional Encoder and Target-Specific Attention layers. The Attentional Encoder layer encodes the interaction between words of a sentence. In the first layer, multi-head attention is employed to analyze context and context-aware aspects. A convolution operation (PCT) transforms the information obtained from the Multi-Head Attention (MHA), and the same transformation is applied to each token. The target-specific attention layer, the second attention layer, employs another multi-head attention mechanism to acquire a context representation specifically tailored to the target. They used label Smoothing regularization (LSR) for label unreliability issues. The experiments were conducted on three datasets: SemEval-14 restaurant and laptop dataset, and the ACL-14 Twitter dataset, having three sentiment polarities: positive, negative, and neutral. Pretrained BERT has been employed in this work.

Wu et al. [26] have used an attention mechanism to generate representations of aspect and context for aspect-based sentiment analysis. The proposed model employs attention for sequence modeling, ensuring that the context and aspect are aware of each other during the modeling process. It eliminates the need to model aspects and context separately and can have an interactive aspect context representation. The model can deal with multiple aspects and sentiments in the given sentence. Experiments were conducted on Twitter and Semeval14 datasets.

Ma et al. [15] divided the input sentence into aspect and context, utilizing Long Short-Term Memory (LSTM) to learn the sequential representation of the aspect target separately from its context and vice versa. A new Interactive Attention Network (IAN) has been suggested to determine the mutual significance of context and aspect terms and create two different representations. Together, these depictions reflect the target aspect and its surroundings, enabling the identification of the sentiment orientation of the specified aspect. The author has designed a variety of models for performance comparison. The idea behind the model is to extract important information from both aspects and context independently and then utilize the resulting combined representation for sentiment identification. Experimental results on the SemEval-14 dataset indicate that the IAN model effectively captures relevant features and provides crucial information for the sentiment classification task.

Ambartsoumian et al. [27] investigated the efficacy of the Self Attention Network (SAN) for various tasks in sentiment analysis. Experiments conducted on six datasets demonstrated the superior performance of SAN as it outperforms RNN and CNN variants. The SAN model competes on benchmark metrics, including training time, memory consumption, and accuracy. Several modifications, such as the number of heads and sequence position information, have been explored in this work. Experiments were conducted on word embeddings obtained from the Word2Vec algorithm. This works emphasize the importance of relative position representation and claim that it performs better than other variants of position encodings.

BERT is trained on a general-purpose dataset; it possesses a context-aware representation but lacks domain-specific knowledge. Post-trained BERT on domain-specific datasets can effectively capture the domain knowledge, resulting in word representations specific to that domain. It motivates us to use post-trained BERT instead of standard BERT for contextualized representation. Also, the core of BERT is centered around self-attention. Self-attention aims to comprehend the importance of words with one another. Each word is treated equally regardless of whether it is an aspect word. So it cannot use explicit knowledge in the form of aspect information fed to the model. It motivates us to use attention mechanism on top of self-attention-based contextualized representation.

Proposed Work

This section will provide a brief overview of the components of our model, followed by an in-depth explanation of our proposed model.

BERT Model

BERT [17] is the encoder component of the Transformer [24] architecture that has been specifically designed for natural language processing. The given input is transformed into a contextualized representation through a self-attention mechanism. Contrary to earlier language models such as Word2Vec [28], Glove [16], and ELMo [29], BERT learns the representation by considering the context from both directions, hence the name bidirectional. BERT uses a fine-tuning approach that makes it appropriate for end tasks.

Unlike Word2Vec, which generates static embeddings, BERT considers the context and relates each word (token) to all other words (tokens) in the given sentence. Hence it creates dynamic embeddings, aware of its context. BERT utilizes a multi-headed attention mechanism to make a context-based representation for every word in the input sentence. It further uses residual connection and layer normalization to obtain the final representation.

Along with token embedding, BERT takes segment embedding and position embedding as input. Segment embedding emphasizes a sentence out of two sentences given as input. Also, since the BERT model eliminates the need for recurrence required in previous sequential models and processes all words simultaneously, it uses position embedding to sense the relative position of words in the given sentence.

BERT has two standard configurations, viz. BERT-base and BERT-large as shown in Table 1

Table 1 Parameters of BERT

In this work, we have used bert-base-uncased and bert-large-uncased (here uncased signifies words have been lowercased before tokenization)

Post-trained BERT Model

The BERT model was trained using the Wikipedia and BookCorpus datasets. These datasets contain information from various domains. It is beneficial for learning the word vectors since words become aware of different possible contexts while learning the vector representation. However, the training dataset belongs to a specific domain for a downstream task like ABSA. Post-training on the BERT model is done to utilize the domain knowledge so that words become aware of domain-specific context. Post-training BERT model, trained on diverse and extensive datasets, equips the language model with restricted domain-specific expertise. In this study, we utilized a post-trained BERT model that was trained on a combination of Amazon and Yelp datasets [21].

IAN Model

Ma et al. [15] created an Interactive Attention Model (IAN) to capture context’s impact on aspect terms and vice versa. Glove word vectors are fed to the LSTM model for context-aware representation. Further, context and aspect are separated, and attention mechanisms are applied. The two representations obtained from different attention mechanisms are concatenated. Finally, a dense layer containing three nodes corresponding to positive, negative, and neutral sentiment and a nonlinear activation function is applied for sentiment classification.

In our research, we have applied an interactive attention mechanism that facilitates the exchange of information between the aspect and context vectors, both of which were derived from post-trained BERT.

Proposed Model

In this study, we have integrated post-trained BERT with interactive attention to determine the sentiment polarity of a specified aspect within a sentence. The design of our model is illustrated in Fig. 1.

Fig. 1
figure 1

Proposed architecture of IAN-BERT model

The BERT model initially processes the sentence for a given sentence and aspect to obtain a context-aware representation. Further, the context and aspect representations are separated. Attention is applied to these two representations to find the contribution of these representations to each other. Finally, the resultant vectors obtained after applying the attention mechanism are concatenated, and the sentiment classification layer is used to find the sentiment polarity of the aspect in the given sentence. The following subsections describe these steps in detail:

Contextualized Representation Using BERT

In this work, we have used BERT and post-trained BERT to obtain the contextualized representation of input words/tokens. Unlike earlier language models, BERT focuses on the entire input sentence to compute the vector representation of tokens. Given an input sentence and aspect as

$$\begin{aligned} S&= [w_1,w_2 ,\dots ,w_n] \end{aligned}$$
(1)
$$\begin{aligned} A&= [a_1,a_2 ,\dots ,a_m] \end{aligned}$$
(2)

where \(A \in S\), n and m are sentence and aspect length, respectively. The input vector fed to BERT is represented as

$$\begin{aligned} E = [e_1,e_2 ,\dots ,e_n] \end{aligned}$$
(3)

where

$$\begin{aligned} e_t = T_t + S_t + P_t \end{aligned}$$
(4)

Here, \(T_t\) represents token embedding, \(S_t\) represents segment embedding, and \(P_t\) represents position embedding corresponding to the input token \(w_t\). The L layer transformer processes the input vector E to produce refined context-aware token features. Specifically, the final representation of input after passing by all L-layers of the transformer is represented in Eq. 5:

$$\begin{aligned} H_L = {\text {Transformer}}_L(E) \end{aligned}$$
(5)

For a more detailed understanding of how BERT works, readers should refer to [24]. We consider \(H_L = [h_1,h_2,\dots ,h_n]\) as the context-aware representations of the input tokens and utilize them for prediction in subsequent tasks.

Attention Mechanism

In this step, context-aware word representation obtained from the BERT model is utilized to ascertain the contribution of context words on aspect terms and vice versa. The contextualized representation is separated into context and aspect. Suppose we have an input sentence of n tokens and an aspect of m tokens, and the aspect appears at \(k^{th}\) position in the input sentence. Then, we can extract two representations as follows:

$$\begin{aligned} C&= [h_1, \dots , h_{k-1},h_{k+m}, \dots , h_{n}] \end{aligned}$$
(6)
$$\begin{aligned} A&= [h_k,h_{k+1}, \dots , h_{k+m-1} ] \end{aligned}$$
(7)

Two additional representations are acquired by taking the average of the aspect and context representations:

$$\begin{aligned} C_\textrm{avg}&= \frac{\sum _{i=1}^{k-1} h_i + \sum _{i=k+m}^{n} h_i}{(n-m)} \end{aligned}$$
(8)
$$\begin{aligned} A_\textrm{avg}&= \frac{\sum ^{k+m-1}_{j=k} h_j}{m} \end{aligned}$$
(9)

We leverage the information from Eq. 6 to Eq. 9 and utilize an attention mechanism to identify the key information in determining the sentiment polarity of the aspect in the sentence. Our approach considers both the impact of the aspect on the context and the impact of the context on the aspect, providing a deeper understanding of relevant sentiment features. We use a pair of contexts and aspects and apply the attention mechanism, as shown in Fig. 1. The attention mechanism generates an attention vector \(\alpha _i\) through the use of the aspect representation \(A_\textrm{avg}\) and the context word representations \(C_i\), as expressed as follows:

$$\begin{aligned} \alpha _i = \frac{\textrm{exp}(\beta (C_i,A_\textrm{avg}))}{\sum _{j=1}^{n-m} \textrm{exp}(\beta (C_j,A_\textrm{avg}))} \end{aligned}$$
(10)

where \(\beta\) is score function that signifies how much context word \(C_i\) attends to the aspect \(A_\textrm{avg}\). The score function \(\beta\) is calculated as

$$\begin{aligned} \beta (C_i,A_\textrm{avg}) = \textrm{tan}h(C_i.W_a.A_\textrm{avg}^T + b_a) \end{aligned}$$
(11)

where \(W_a\) and \(b_a\) are trainable parameters.

Similarly, the attention vector \(\gamma _i\) is calculated by taking into account both the average context representation \(C_\textrm{avg}\) and the aspect representation \(A_i\) as follows:

$$\begin{aligned} \gamma _i = \frac{\textrm{exp}(\beta (A_i,C_\textrm{avg}))}{\sum _{j=1}^{m} \textrm{exp}(\beta (A_j,C_\textrm{avg}))} \end{aligned}$$
(12)

The context and aspect representations are derived from Eqs. 13 and 14, respectively, as

$$\begin{aligned} C_r&= \sum _{i=1}^{n-m} \alpha _i C_i \end{aligned}$$
(13)
$$\begin{aligned} A_r&= \sum _{i=1}^{m} \gamma _i A_i \end{aligned}$$
(14)

Finally, the aspect representation \(A_r\) and context representation \(C_r\) are combined into a single vector S, and the final representation is obtained as

$$\begin{aligned} S = \textrm{Concat}(A_r,C_r) \end{aligned}$$
(15)

Sentiment Classification

The vector S is passed through a fully connected layer with three output neurons, each corresponding to a sentiment class: positive, negative, and neutral. Softmax is applied to find the output probability distribution over the sentiment classes for the given aspect:

$$\begin{aligned} O = \textrm{Softmax}(W*S + b) \end{aligned}$$
(16)

where parameters W and b are trainable.

The loss function utilized in this work is cross-entropy, represented by Eq. 17:

$$\begin{aligned} L = - \sum ^c_{i=1} Y_i \textrm{log}(O_i) \end{aligned}$$
(17)

where \(Y_i\) is a true polarity vector, and c signifies polarity classes.

Experiments and Results

This section explores the dataset used for experiments for Targeted sentiment analysis, parameter settings, and model training, followed by performance evaluation.

Dataset

We conducted experiments using three publicly available datasets: the SemEval-14 Restaurant and Laptop dataset [30] and the Multi-Aspect Multi-Sentiment (MAMS) dataset [31]. The SemEval-14 dataset includes labeled customer reviews for the Laptops and Restaurants categories. The MAMS dataset contains review sentences that have at least two aspects. The sentiment labels are categorized into three predefined classes in all three datasets: positive, negative, and neutral. The statistics of the dataset are given in Table 2.

Table 2 Statistics of dataset

Model Hyperparameters

Experiments were performed on three review datasets. The “bert-base-uncased” and “bert-large-uncased” pre-trained models and the “bert-base-uncased” post-trained models were evaluated. The number of encoder layers in the pre-trained “bert-base-uncased” and post-trained “bert-base-uncased” models is 12, while in the “bert-large-uncased” model, it is 24. The learning rate is 2e-5. The batch size is set to 32 for both Laptop and Restaurant datasets. We train the model for 25 epochs. As per the settings outlined in Table 3, we train five models with different random seeds and present the average of the results.

Table 3 Hyperparameters

Evaluation Metrics

For performance evaluation, accuracy and f1-score metrics were used for targeted sentiment analysis tasks. Accuracy is defined to measure the correct predictions among total predictions as given in Eq. 18:

$$\begin{aligned} \textrm{Accuracy}=\frac{\textrm{number}\ \textrm{of}\ \textrm{correct}\ \textrm{predictions}}{\textrm{total}\ \textrm{predictions}} \end{aligned}$$
(18)

F1-score is defined as the harmonic mean of precision and recall as given in Eq. 19:

$$\begin{aligned} \textrm{F1}-\textrm{score}=\frac{2\times \textrm{Precision}\times \textrm{Recall}}{\textrm{Precision}+\textrm{Recall}} \end{aligned}$$
(19)

Performance Evaluation

Variation of Our Model

We conducted experiments using various BERT variations, namely bert-base-uncased, bert-large-uncased, and post-trained BERT. In Table 4, we present six different models, where the first three (bert_base, bert_large, and bert_base_PT) use bert-base-uncased, bert-large-uncased, and post-trained BERT, respectively, to generate contextualized representations. To obtain the aspect vector, we applied an aspect mask over the contextualized representation and then used the mean of the aspect vector to determine the sentiment orientation.

On the other hand, the last three models (IAN-BERT_BB, IAN-BERT_BL, and IAN-BERT_BB_PT) apply interactive attention over the contextualized representation obtained from the different BERT variants (bert-base-uncased, bert-large-uncased, and post-trained BERT, respectively). Interactive attention is used to get the aspect and context vectors concatenated and passed through a softmax layer for polarity detection.

Our experiment findings demonstrate that bert-base achieved the lowest accuracy and f1-score for the Restaurant and MAMS dataset, whereas bert-large showed the lowest accuracy for the Laptop dataset. However, a combination of post-trained BERT and interactive attention performed best on all three datasets, as depicted in Table 4.

Table 4 Different variations of the proposed model for various datasets

Base Models

In this section, we have discussed the base models used to compare our method.

  1. 1.

    TD-LSTM [11] employs two LSTMs that simultaneously analyze the context on either side of the targeted aspect. The output of these LSTMs is combined to form the final representation, which is then used for sentiment classification.

  2. 2.

    ATAE-LSTM [14] uses an attention-based LSTM for sentiment analysis of aspect terms and aspect categories. The target embedding is combined with the hidden states of the LSTM to calculate the attention weights.

  3. 3.

    MemNet [32] treats the ABSA as a Question Answering task and combines the context word vectors by summing up linearly transformed aspect vectors with the attention output.

  4. 4.

    IAN [15] resorts to an attention mechanism that interactively learns the score between the target word and its context. This attention score is further used to create a context-aware target and target-aware context representation.

  5. 5.

    RAM [33] utilizes a 2-layer BiLSTM to generate memory, incorporating position information and producing a custom-made set of memory. It employs a recurrent attention network to extract sentiment features related to the designated target.

  6. 6.

    MGAN [34] employs fine-grained and coarse-grained attention for aspect term and aspect category sentiment classification tasks. Fine-grained attention is utilized for word-level interaction between aspects and their context.

  7. 7.

    CDT [35] employs BiLSTM over the input and further learns the aspect representation using Graph Convolution Network(GCN) over the dependency tree.

  8. 8.

    AOA-MultiACIA [26] employs interactive attention between aspect and context for classifying aspects present in the input. Multiple groups of key and value pairs of aspect and context are utilized to generate aspect and context representation, respectively. The result is obtained through a combination of these two representations.

  9. 9.

    AEN-BERT [25] uses an attention mechanism to model the interaction between aspects and context for targeted sentiment analysis and incorporates label smoothing regularization to address the issue of fuzzy labels.

  10. 10.

    BERT-SPC [25] uses input sentence and aspect as sentence pairs and feeds them to pre-trained BERT. Aspect classification into predefined classes is done using pooled embedding.

Table 5 Accuracy of various techniques on the SemEval-14 dataset

Result Analysis

Table 5 demonstrates that our IAN-BERT_BB_PT model outperforms other models on Laptop, Restaurant and MAMS datasets. These datasets contain domain-specific terms that require specialized word embeddings, which can be more effectively captured by post-trained BERT models trained on relevant sources such as Amazon and Yelp datasets. Furthermore, our model leverages interactive attention between aspect and context vectors, enhancing the quality of representations and augmenting aspect classification accuracy. Hence, our IAN-BERT_BB_PT model provides superior performance on these challenging datasets thanks to its ability to leverage domain-specific knowledge and attention mechanisms.

BiLSTM models outperform LSTM models by effectively capturing complete contextual information for every word, resulting in superior performance. Consequently, models utilizing LSTM, such as TD-LSTM and ATAE-LSTM, exhibit inferior performance compared to those using BiLSTM, such as RAM and IAN. However, BERT-based models outperform both BiLSTM and attention models due to their bidirectional capability. Therefore, models that incorporate BERT, such as AEN-BERT, IAN-BERT_BB, IAN-BERT_BL, and IAN-BERT_BB_PT, demonstrate superior performance compared to other models.

Standard BERT vs Post-trained BERT

The idea behind applying post-trained BERT is to capture the domain-specific knowledge during word(token) representation, which needs to be improved in standard BERT trained on the general dataset. Table 6 illustrates instances where standard BERT (IAN-BERT_BB) falls short in comparison to post-trained BERT(IAN-BERT_BB_PT). The table exhibits sentences and their predicted sentiment polarities by both BERT and post-trained BERT models. It can be observed from the table that in comparison to BERT, a model with post-trained BERT efficiently predicts the correct sentiment of sentences having multiple aspect terms also.

Table 6 Result comparison of BERT(IAN-BERT_BB) vs post-trained BERT(IAN-BERT_BB_PT)

Attention vs Interactive Attention

As a transformer model, BERT uses self-attention to learn the contextualized representation. With self-attention, every word or token is given equal consideration, and an effort is made to understand the influence of a word or token on the other words or tokens in the sentence. In the ABSA task, we have both a sentence and an aspect, and the goal is to determine the sentiment toward the aspect within the sentence. To exploit this additional aspect information, we have added interactive attention on top of the post-trained contextualized word representation.

Discussion and Conclusion

This work integrates interactive attention with cutting-edge context-sensitive representation to perform aspect-based sentiment analysis. Post-trained BERT trained on Amazon, and Yelp datasets are used instead of generalized BERT trained on Wikipedia and BookCorpus datasets. Further, an interactive Attention mechanism between aspect and context is applied on top of Domain Knowledge enhanced context-aware BERT representation. Experiments were performed on SemEval-14 Laptop, Restaurant, and MAMS datasets. The results demonstrate superior performance compared to other existing works. Applying Graph attention over BERT would be interesting future work.