1 Introduction

Aspect-Based Sentiment Analysis (ABSA) focuses on identifying the aspect terms explicitly mentioned in sentences and detecting the sentiment polarities of the aspect-terms [1]. For example, in the review sentence “Tasty food but the service was slow!”, the user mentions two aspect-terms, namely, “food” and “service”, and conveys positive sentiment about the first, and negative sentiment for the second. Generally, the ABSA task can be broken into two sub-tasks: aspect-terms extraction and aspect-term sentiment classification. Aspect-term extraction aims to identify the aspect(s) mentioned in the text, and it has been broadly studied in [2,3,4]. The second sub-task, i.e., aspect-term sentiment classification, enhances the extracted aspect-term(s) usefulness by detecting its sentiment polarity. This sub-task has been also extensively studied in [5,6,7].

A unified approach that integrates both the subtasks has been adopted by previous researchers [8,9,10] to enhance the performance of ABSA. Despite the effectiveness of unified methods, we argue that most of the previous works have not given enough attention to the generalization and robustness of the model. A learned model is expected to perform well on unseen test examples and should be able to combat adversarial samples, which are created by adding small perturbations to the original inputs [11]. These adversarial samples are un-noticeable to human judges and can mislead the neural networks to incorrect predictions. For example in the original review text “Finally, I got sick of the badservice,obnoxious smirks, and snotty back talk.” the word “bad” can be replaced with “terrible” to generate a semantically coherent adversarial example “Finally, I got sick of the terrible service,obnoxious smirks, and snotty back talk.”. A highly generalized and robust unified ABSA model is expected to detect aspect-term as “service” with associated negative sentiment in both original and adversarially generated review text. Adversarial training makes the neural network robust to such examples and helps the model generalize better.

Given the above point, we formulate the end-to-end ABSA as a single sequence labelling task with a unified tagging schemeFootnote 1 and propose a novel BERT-Based Interactive Learning with Ensemble Adversarial Training (BILEAT) for the same. BILEAT is a multi-layer unified framework that handles ABSA end-to-end, along with two auxiliary tasks AE and OE. Performance of main ABSA task improves by exchanging clues between AE and OE auxiliary tasks. We further add adversarial examples in an ensemble way to improve the generalization and robustness of our model.

The adversarial attack in BILEAT is two-folded. First, we build an adversarially post-trained White-Box Domain Knowledge BERT (WBDK-BERT) to efficiently capture the context-dependent meaning of the word in a sentence. WBDK-BERT is built by doing post-training of BERT [12] on the masked language model (MLM) task using a domain-specific dataset. During post-training, white-box adversarial training [13] is applied that augments the standard training objective with an additional term to maximize the adversarial loss via applying perturbation in the embedding space. BILEAT utilizes WBDK-BERT to generate a representation of words in a given sentence.

Second, to further enhance the robustness and reliability of our proposed model, we generate adversarial examples using a black-box [14] technique. We utilize the potential of BERT to create adversarial examples that are grammatically correct and semantically in line with the original input. Such adversarial examples can fool the neural network. Our method is inspired by Li et al. [15], but it differs from them in two ways: (1). we consider only the aspect and opinion terms in the sentence for replacement (2). And apply our scoring function along with BERT-MLM in a semantic-preserving way to produce substitutes for these words. As a perturbation generator, we use the masked language model and choose perturbations that maximize the likelihood of making the wrong prediction [11] by the model for the given sequence of words in a sentence. Our proposed model is trained using the original dataset along with generated adversarial examples.

The main contributions of our work can be summarized as follows:

  • We propose a novel, BERT-Based Interactive Learning with Ensemble Adversarial Training (BILEAT) model for a unified ABSA task. BILEAT does interactive learning to understand the mutual relation between AE and OE auxiliary tasks and uses a domain-specific white-box adversarially trained WBDK-BERT built by us to generate context-aware word embeddings. Further, we apply a black-box attack to create fluent and semantically coherent adversarial examples. BILEAT is trained using both original inputs and such adversarial examples in a combined way.

  • For a unified ABSA task, we create grammatically correct and semantically coherent adversarial datasets, which will be helpful for future research work.

  • We do an ablation study of BILEAT for evaluating the impact of interactive learning between AE and OE, the usefulness of WBDK-BERT, and the effect of combined training of original inputs and generated adversarial examples.

  • We utilize above mentioned adversarial test datasets to evaluate the robustness of various methods. Experimental results show BILEAT outperforms state-of-the-art methods. To the best of our knowledge, we are the first to perform such a detailed study about the robustness of unified ABSA methods. Our experimental results can serve as a benchmark for future research.

The rest of this paper is organized as follows, after discussing related work in Section 2, we present a detailed description of our proposed model, in Section 3. In Sections 4 and 5, we discuss the details of our extensive experiments and do the analysis of results. Finally, we summarize our work in Section 6.

2 Related work

Existing Aspect-Based Sentiment Analysis approaches are broken down into two subtasks: Aspect Extraction (AE) [16,17,18,19,20] and Aspect-level Sentiment Classification (ASC) [5, 21,22,23,24,25]. The former refers to detecting aspect terms in a sentence, while the latter refers to detecting a review sentence’s sentiment polarity towards a given aspect. These approaches have been studied extensively in previous works. Most existing methods solving the ASC assume that the aspects are already mentioned with the review sentence, which limits the practical use of such methods. One way to employ these methods in practical settings is to use them in a pipelined manner. However, treating these tasks in a pipelined approach leads to error propagation across subtasks giving us poor results.

Some studies [8, 10] have a unified modelling approach to handle the above tasks in an end-to-end manner. These methods are modelled as sequence labelling tasks which fall into two types: collapsed tagging and joint training. The former uses shared features for each subtask, whereas the latter uses a multi-task learning framework that uses shared and private features. Li et al. [10] have identified auxiliary tasks such as boundary guidance and sentiment consistency for joint modelling. These tasks guide their model to learn the unified tagging scheme for a review sentence. He et al. [26] have tried to model the interaction using a message passing mechanism. They learn semantically related tasks (such as aspect-level sentiment classification and document-level sentiment classification) through joint training to get better results. However, these methods do not model the interactions between the subtasks to their full potential. Li et al. [27] have used contextual word embeddings instead of GloVe or Word2Vec embeddings to get better context-aware representations. Chen et al. [28] interact the semantic information between the encoded features generated from AE, ASC, and OE (opinion extraction) to enhance the sub-modules through mutual knowledge transfer. Liang et al. [29] have further introduced document-level sub-tasks (mainly domain classification and document-level sentiment classification) to infuse document-level information for enhancing the performance of the aspect-extraction sub-tasks. Luo et al. [30] have proposed a method called GRACE that uses post-trained BERT and applies a gradient harmonized method with virtual adversarial training to solve the ABSA adopting a cascaded labelling approach. Mao et al. [31] proposed a joint training framework that constructs machine reading comprehension tasks to solve AE, OE, and ASC problems using BERT-MRC models with parameter sharing. Lee et al. [32] have proposed a unified model for completing ABSA tasks by interacting signals between ATE and OE tasks. They also use self-supervised strategies such as pairwise relation masking, which help the model to better exploit the relations between aspects and opinions at a sentence level.

2.1 Adversarial training

Developing robust deep neural models for natural language processing continues to be a long-standing real-world problem. Attackers develop examples for inputs that can flip the prediction, thereby decreasing the model’s accuracy. Adversarial training can enhance robustness, but past works have shown that it also affects generalization. There have been several studies for adversarial attacks on continuous data. In general, adversarial attacks are of two types (1). white-box and (2). black-box. In white-box attacks [13] model parameters can be accessed, while black-box attacks [14] work without accessing model parameters and only uses the input and output. However, generating adversarial examples for text continues to be a challenging task.

  • White Box Attack: Xu et al. [33] proposed TextTricker for targeted and non-target attacks on classification model. These attacks have been implemented using two ways: loss-based and gradient-based. Liu et al. [34] introduce Adversarial training for Large Neural Language Models, an algorithm that regularizes and improves both generalization and robustness of a deep neural network. Karimi et al. [35] add perturbations using gradients of the loss function to the encoded inputs and generate adversarial examples.

  • Black Box Attack: Previous studies for generating adversarial examples rely on introducing error at the character level [36] or adding/deleting word [37] in a sentence. However, the added perturbations may result in a grammatically incorrect sentence, hence easily identifiable by a human. Rule-based approaches have been shown to come up with more natural-looking sentences. However, these approaches rely on external tools such as POS Tagger, NER Tagger, WordNet, etc., and do not generate semantically coherent sentences. Pruthi et al. [38] predict each word’s correct substitution for all possibly misspelled words in a sentence using some back-off strategies. The predictions are passed to the downstream tasks for further training. Recent studies have used language models for adding perturbations to sentences. Li et al. [15] use BERT-MLM for getting word substitutions in a sentence. The examples generated have word substitutions that are context-aware, and the overall sentences are semantically coherent. Following their work to extract important words from input sentences, Hofer et al. [39] use character-level adversarial attacks, which are inconspicuous to human observers. These attacks include replacing characters with visually similar-looking symbols, adding misspellings and irrelevant punctuation marks in a sentence.

Previous works have adopted various effective approaches to solve the ABSA in a unified way. We have presented the summary of the same in the Table 1. However, these previous works are effective but have not given enough focus on the generalization and robustness of the model. A learned model is expected to perform well on unseen test examples and should be able to combat adversarial samples, which are created by adding small perturbations to the original inputs. Our proposed model, BILEAT, utilize interactive learning between auxiliary tasks to produce a collaborative signal and uses a domain-specific and white-box adversarially trained WBDK-BERT built by us to generate context-aware word embeddings. Further, we utilize BERT-MLM and apply a black-box attack to create fluent and semantically coherent adversarial examples. BILEAT is trained using both original inputs and such adversarial examples in a combined way, which makes our proposed model highly generalized and robust for the unified ABSA task.

Table 1 Summary of previous works related to ABSA

3 Our method

We formulate unified ABSA as sequence labelling problem and use a unified tagging scheme \(\mathcal {Y}=\{B\)-POS, I-POS, B-NEG, I-NEG, B-NEU, I-NEU, O}, which consists of 7 tags. Each tag except O contains information about aspect-term and its associated sentiment. For example B-POS denotes beginning of an aspect-term with positive sentiment. For a given a sentence S = {w1,w2,....,wN}, our ultimate goal is to is to predict a tag sequence Yu = {y1,y2,....,yN}, where \({Y_{i}^{u}} \in \mathcal {Y}\).

3.1 Proposed model

In this section, we describe the architecture of BERT-Based Interactive Learning with Ensemble Adversarial Training (BILEAT). BILEAT is a highly generalized and robust model that uses both white-box [13] and black-box [14] adversarial training in a combined way. In white-box attacks, adversarial examples are generated by accessing model parameters, while black-box attacks create such examples using only the input and output without accessing model parameters. As illustrated in Fig. 1, BILEAT contains white-box adversarially trained domain knowledge BERT (WBDK-BERT), word encoding layer, interactive learning layer, adversarial perturbations generated through a black-box attack, and the objective function to be optimized. The ultimate goal is to solve ABSA in a unified way.

Fig. 1
figure 1

The proposed BILEAT framework

3.1.1 WBDK-BERT: white-box adversarially trained domain knowledge BERT

BERT is a pre-trained language representation model, which consists of a 12-layer bidirectional Transformer encoder [42]. Xu et al. in [43] have shown that inducing domain-specific information to BERTBASE boosts the performance of ABSA. Following their work, we take BERTBASE and perform post-training using domain-specific datasets, where the aim of the standard objective function is to minimize the standard error on training data with the training objectives derived from the self-supervision MLM task.

In general, the training algorithm aims to learn a function f(w;β): \(w \rightarrow V\), parametrized by β. For MLM task, V is the vocabulary, and f(w;β) attempts to predict the masked token u. During post-training, V becomes the task-specific label set, and f(w;β) acts as the classifier. Given a training dataset \(\mathcal {D}\) consisting of input-output pairs (w;u) and the loss function l(:;:) (e.g., cross entropy), f(w;β) is trained to minimize the standard loss:

$$ \min_{\beta} \mathbb{E}_{(w,u) \sim \mathcal{D}} [l(f(w; \beta), u) ] $$
(1)

Liu et al. [34] have shown that white box adversarial training for large language models improves the generalization and robustness in downstream tasks. Inspired by their work, we build our white-box adversarially trained domain knowledge BERT (WBDK-BERT). To build WBDK-BERT, we randomly choose perturbation δ from a normal distribution and add the same to the embedding level. Subsequently, using the perturbated embedding, we compute adversarial gradient gadv.

$$ g_{adv} \leftarrow \nabla_{\delta}l(f(w;\beta), f(w+\delta; \beta)) $$
(2)

The motive behind adding the perturbation δ to the embedding level is to generate such an adversarial example that can maximize the adversarial loss of the model. Hence, we find the optimal value of δ by moving in the direction of increasing loss through the gradient ascent.

$$ \delta \leftarrow {\varPi}_{||\delta||_{\infty} \leq \epsilon} (\delta + \eta g_{adv}) $$
(3)

Here, 𝜖 is the upper bound of perturbation δ.

Using the obtained value of perturbation δ adversarial examples are created. Finally, for MLM task we use both original input and adverserial examples and during training, we compute the overall loss of the model by combining the standard objective loss and adverserial loss:

$$ \min_{\beta} \mathbb{E}_{(w,u) \sim \mathcal{D}} [l(f(w; \beta), \!u) + \alpha \max_{\delta} l(f(w + \delta; \beta),\! f\!(w;\! \beta))] $$
(4)

Finally, to minimize the oveall loss, the model parameters β are updated using the global learning rate τ through gradient descent.

$$ g_{\beta} \leftarrow \nabla_{\beta}(f(w; \beta), y) + \alpha \nabla_{\beta}l(f(w; \beta), f(w+\delta; \beta)) $$
(5)
$$ \beta \leftarrow \beta - \tau g_{\beta} $$
(6)

We have summarized the adversarial training of WBDK-BERT in Algorithm 1.

figure a

3.1.2 Word encoding layer

Our proposed model employs WBDK-BERT for generating contextual word representations. For a sentence \(S = [w_{1}, w_{2}, \dots , w_{N}]\), which consists of N words, (\(w_{1}, w_{2}, \dots , w_{N}\)) is passed to WBDK-BERT, to obtain hidden representations \(H \in \mathbb {R}^{N \times d}\) for each word.

$$ H = WBDK\texttt{-}BERT(S) $$
(7)

Here, d is the size of the hidden dimension of WBDK-BERT, and HiH is the hidden representation of ith word of the sentence S.

3.1.3 Interactive learning

We pass H to two different linear layers to learn the two separate word representations corresponding to auxiliary tasks AE and OE.

$$ H^{a} = H{W_{a}^{1}} $$
(8)
$$ H^{o} = H{W_{o}^{1}} $$
(9)

Here, \({W_{a}^{1}}\) and \({W_{o}^{1}}\) are trainable parameters. \(H^{a} \in \mathbb {R}^{N \times d}\) and \(H^{o} \in \mathbb {R}^{N \times d}\) are the two learnt representations of the words in the given sentence S. Subsequently, Ha is passed to Softmax layer to predict the probabilities \(\hat {Y}^{a}\in \mathbb {R}^{N \times 3}\) of {BA, IA, OA} tags for AE task. Likewise, Ho is passed to to Softmax layer to predict the probabilities \(\hat {Y}^{o}\in \mathbb {R}^{N \times 3}\) of {BO, IO, OO} tags for OE task.

$$ \hat{Y}^{a} = Softmax\left( H^{a}{W_{a}^{2}}\right) $$
(10)
$$ \hat{Y}^{o} = Softmax\left( H^{o}{W_{o}^{2}}\right) $$
(11)

Here, \({W_{a}^{2}}\) and \({W_{o}^{2}}\) are trainable weights.

Usually, aspect-term and opinion-term are strongly correlated, and interaction between these two tasks can exchange important clues about the unified ABSA task. Hence, to learn the non-linear interactions between aspect and opinion words we define a score function Qi,j:

$$ Q_{i,j} = {H^{a}_{i}} W_{c} ({H^{o}_{j}})^{T} * \frac{1}{|i - j|} $$
(12)

Here, Wc is a trainable weight matrix, which learns non-linear interactions between aspect and opinion words. We argue that aspect term and its corresponding opinion term occur in closer proximity. Thus, the second term in the (12) shows that scores are inversely proportional to the number of words between each other. Moreover, we define Qi,i = 0 since a word cannot be both aspect and opinion word at the same time.

We make use of the above score function Qi,j and generate an interaction matrix \(\mathcal {A}^{a|o}\) to capture the contribution of jth word from OE-oriented features to the ith word in the AE-oriented features.

$$ \mathcal{A}_{i,j}^{a|o} = \frac{exp\left( Q_{i,j} * \hat{Y}^{o}_{j, \{BO, IO\}}\right)}{{\sum}_{j=1}^{n} exp\left( Q_{i,j} * \hat{Y}^{o}_{j, \{BO, IO\}}\right)} $$
(13)

Where, the term \(\hat {Y}^{o}_{j, \{BO, IO\}}\) is the sum of probabilities of BO and IO output tag which denotes the predicted probability that the j-th token is part of any opinion term.

Similarly, we define interaction matrix \(\mathcal {A}^{o|a}\) to determine the contribution of jth word from AE-oriented features to the ith word in the OE-oriented features.

$$ \mathcal{A}_{i,j}^{o|a} = \frac{exp\left( Q^{T}_{i,j} * \hat{Y}^{a}_{j, \{BA, IA\}}\right)}{{\sum}_{j=1}^{n} exp\left( Q^{T}_{i,j} * \hat{Y}^{a}_{j, \{BA, IA\}}\right)} $$
(14)

Here, \(\hat {Y}^{a}_{j, \{BA, IA\}}\) in (13) is the sum of probabilities of BA and IA output tag which is associated with the predicted probability that the j-th token is part of any aspect term.

Now, using the interaction matrices \(\mathcal {A}^{o|a}\) we compute overall opinion representation of a word wi with respect to each aspect word in the (14) and likewise \(\mathcal {A}^{a|o}\) is used to compute overall aspect representation of the same word with respect to each opinion word in the (13).

$$ X_{i}^{o|a} = \sum\limits_{j=1}^{n} \left( \mathcal{A}^{o|a}_{i,j} * {H^{a}_{j}}\right) $$
(15)
$$ X_{i}^{a|o} = \sum\limits_{j=1}^{n} \left( \mathcal{A}^{a|o}_{i,j} * {H^{o}_{j}}\right) $$
(16)

At last, we compute final representation \(G_{i} \in \mathbb {R}^{n \times 4d}\) of word wiS in the following way:

$$ G_{i} = {H^{a}_{i}} \oplus X_{i}^{o|a} \oplus {H^{o}_{i}} \oplus X_{i}^{a|o} $$
(17)

Here, ⊕ is the concatenation operation. We use G for predicting unified labels for ABSA task.

$$ \hat{Y}^{u} = Softmax(G W_{g}) $$
(18)

Here, Wg is a trainable weight.

3.1.4 Objective function

For every word in a sentence, we compute the loss for each of the three tasks (Unified, AE, and OE) using multi-margin loss or hinge loss l as given in (19)

$$ l = \sum\limits_{i; i \neq c}^{|\mathcal{Y}|} max(0, margin -{p}_{c} + {p}_{i}) $$
(19)

Here, margin is a hyperparameter, pc is the prediction logit of the correct label \(c \in \mathcal {Y}\) and pi is predicted logit of a wrong label. Standard loss \({\mathscr{L}}\) of our model is calculated by summing loss of unified ASBA lu, loss of AE la, and loss of OE lo.

$$ \mathcal{L} = l_{u} + \lambda (l_{a} + l_{o}) $$
(20)

Here, λ is a hyperparameter that controls the contribution of loss of auxiliary task in the overall loss \({\mathscr{L}}\).

In addition, we also built our model using cross-entropy loss and compared the model performance with hinge loss in Fig. 2b and a. This study shows that hinge loss helps better convergence of loss, which is getting translated in the performance of respective variants of BILEAT. Hence, we choose hinge loss over cross-entropy for building our model.

Fig. 2
figure 2

Performances of our proposed model

3.1.5 Adversarial examples generation using black-box attack

In this step, we apply black-box adversarial training to make our model highly generalized and robust. Unlike the white-box attack, the black-box attack doesn’t have access to the model parameters, i.e., it treats the model as a black box. The attack is only allowed to query the model on input and retrieve the prediction probabilities. Using this prediction output, the attack tries to craft an adversarial example. We generate quality adversarial examples by transferring the perturbations from another model. Ensuring changes are unnoticeable to human judges yet capable of fooling the neural network, maintaining grammatical fluency and semantic consistency with original inputs.

Recently Li. et al. [15] have utilized BERT-MLM to generate the fluent and semantically consistent adversarial examples. Our adversarial sentence generation method is inspired by their work, but it differs from them in two ways: (1) We consider only the aspect and opinion terms to be vulnerable or important words in the given sentence (2) We replace these important words in the sentence using BERT-MLM by applying our scoring function.

We take the original input sentence S and pass it to BERT-MLM for finding the replacement candidates for important words IWS and then generating semantically coherent replacements.

$$ P = BERT{\text -}MLM(S) $$
(21)
$$ C_{i} = TopK(Filter(P_{i})) $$
(22)

BERT-MLM provides replacement probability score \(P \in \mathbb {R}^{N \times |V|}\) where V is the vocabulary set used by BERT. We take all replacement candidates PiP of an important word IWiIW and pass it to Filter and TopK function in a sequence, where first all stops words, punctuation, and antonyms are removed and then top K words are selected as replacement candidates Ci based on its probability score.

We replace important words IWiIW of the sentence S with replacement candidates Ci, after each replacement, a modified sentence is created. We compute the semantic similarity score between each pair of the original sentence and a modified sentence using a Universal Sentence Encoder USE [44]. Modified sentences that carry a similarity score of more than a pre-defined threshold sim-threshold are considered as candidate adversarial examples for the sentence S. We pass such candidate adversarial examples one by one to a pre-trained unified model \({\mathscr{M}}\) to calculate Score by taking the sum of predicted probabilities corresponding to true labels of each important word. The lower Score associated with a candidate adversarial example indicates its better ability to fool the model’s prediction. Hence, among candidate adversarial examples, we consider the one which has the lowest Score as an adversarial example for the sentence S.

Previous work [45] has shown that perturbations transferred from other models help the original model to become more robust to black-box attacks on image data. Motivated by this, we choose a strong baseline model BERT+Linear [46] and use it as a pre-trained unified model \({\mathscr{M}}\) in the above-mentioned black-box attack to generate adversarial examples. Algorithm 2 describes the steps involved in the generation of adversarial examples.

figure b

3.1.6 Adversarial training

We take original input and its corresponding adversarial example to train BILEAT, where the standard loss \({\mathscr{L}}\) for original input is calculated as mentioned in (20) and on the similar lines adversarial loss \({\mathscr{L}}_{adv}\) is calculated for the adversarial example. Total loss \({\mathscr{L}}_{t}\) of BILEAT is defined as follows:

$$ \mathcal{L}_{t} = \mathcal{L} + \gamma \mathcal{L}_{adv} $$
(23)

Where, γ is a hyperparameter that controls the contribution of adversarial loss \({\mathscr{L}}_{adv}\). During training, model loss \({\mathscr{L}}_{t}\) is minimized.

4 Experiments

4.1 Datasets

We conduct experiments using Laptop and Restaurant review datasets taken from SemEval ABSA challenges. These datasets are re-prepared by Li et al. in [10]. The laptop dataset is prepared using SemEval ABSA challenge 2014 [47], which contains a train-test split same as the original dataset. Restaurant review dataset is union of SemEval ABSA challenge 2014, 2015 and 2016. The training dataset of Restaurant is created by merging training dataset of three years, and a similarly testing dataset is also built. For both the datasets, we take 10% randomly held-out of training data as the development set. Table 2 presents details of the datasets.

Table 2 Details of the laptop & restaurant datasets

4.2 Baseline methods

We compare the performance of BILEAT with two groups of baseline. The first group has those models, wherein results are either copied or reproduced using original code from other papers.

  • CRF-Unified (Mitchell et al. [48]): built this model by leveraging hand-crafted linguistic features with CRF to perform the sequence labeling task using a unified approach.

  • NN+CRF-Unified (Mitchell et al. [48]): is an improved version of CRF-Unified, where target word embedding and context word embeddings are concatenated, and also hand-crafted linguistic features are used with CRF.

    We have taken the results of the above two models from [9].

  • LSTM+ CRF: is the standard LSTM model that uses CRF layer for predicting the unified tags.

  • HAST-TNet (Li et al. [3, 23]): is a pipeline approach based on two state-of-the-art models HAST [3] and TNet [23] on the tasks of aspect-term extraction and aspect sentiment classification respectively.

  • CMLA (Wang et al. [40]): is a multi-layer coupled-attention architecture. Each layer of CMLA has two coupled GRUs that performs aspect and opinion terms co-extraction.

  • E2E-TBSA (Li et al. [10]): is an end-to-end model that adopts unified tagging scheme to address complete ABSA task.

    We have taken results of HAST-TNet, and E2E-TBSA from [10].

  • IMN (He et al. [26]): jointly learns multiple related tasks simultaneously using an iteratively message passing architecture.

  • DOER (Luo et al. [9]): provides a framework that extracts aspect and its polarity simultaneously. It employs a dual RNN to extract the respective representation of each task, and a cross-shared unit help in understanding the relationship between each other.

  • E2E-Triplet-Unified (Peng et al. [41]): provides an end-to-end framework that extracts triplet in the form of aspect-term, its sentiment, and associated opinion word. In one of the versions of their model, they also adopt a unified tagging scheme to only extract aspect-term and its sentiment. We call that version E2E-Triplet-Unified.

    Results of CMLA, and E2E-Triplet-Unified are taken from [41].

  • BERT+Linear (Li et al. [46]): uses BERT to generate representations for tokens in a sentence, these representations are passes to a linear layer to address ABSA in a unified manner.

  • BERT+GRU (Li et al. [46]): applies a stacked architecture of BERT with GRU to solve ABSA in a unified manner.

  • BERT+SAN (Li et al. [46]): applies a stacked architecture of BERT with a self-attention network (SAN) to solve ABSA in a unified manner.

    We reproduce the results of BERT-Linear, BERT-GRU, and BERT-SAN using its original code.

  • GRACE (Luo et al. [30]): provides a multi-head attention architecture with virtual adversarial training that uses a gradient harmonized method. We have taken the result of this model from the respective paper.

    The second group has variations of our BILEAT model. These models are also used in ablation studies.

  • BILEAT w/o BBT: it doesn’t use black-box adversarial training.

  • BILEAT w/o WBDK-BERT: uses DK-BERT instead of white-box domain knowledge BERT (WBDK-BERT).

  • BILEAT w/o BBT & WBDK-BERT: Both black-box adversarial training and WBDK-BERT are not used. For token representation generation DK-BERT is used in this model.

  • BILEAT w/o BBT, WBDK-BERT& DK-BERT: Black-box adversarial training, WBDK-BERT, and DK-BERT are removed. BERTBASE is used to generate the token representation in this model.

  • BILEAT w/o BBT, WBDK-BERT, DK-BERT & A/O: aspect-to-opinion interaction is switched off in this model, and also Black-box adversarial training, WBDK-BERT and DK-BERT are removed. BERTBASE is used to generate the token representation in this model.

  • BILEAT w/o BBT, WBDK-BERT, DK-BERT & O/A: opinion-to-aspect interaction is switched off in this model, and also Black-box adversarial training, WBDK-BERT and DK-BERT are removed. BERTBASE is used to generate the token representation in this model.

  • BILEAT w/o BBT, WBDK-BERT, DK-BERT, A/O & O/A: both opinion-to-aspect and aspect-to-opinion interactions are switched off in this model, and also Black-box adversarial training, WBDK-BERT, and DK-BERT are removed. BERTBASE is used to generate the token representation in this model.

4.3 Evaluation metrics

We evaluate the performance of a model using Precision (P), Recall (R), and F-score (F), which means that extracted aspect is considered to be correct when it exactly matches with the gold standard span of the mentioned aspect and its corresponding sentiment.

4.4 Settings

In order to build post-trained WBDK-BERT, we use Amazon laptop reviews and Yelp Dataset Challenge reviews provided by [49] and perform post-training of BERTBASE. We do white-box adversarial training of WBDK-BERT by following [34], and set the gradient steps K to 1, the variance for initializing perturbation σ to 0.00001 and the step size η as 0.001. For generating a black-box adversarial example, we set sim-threshold to 0.8. BILEAT uses hidden size d as 768, learning rate of 4e-5 with Adam optimizer and batch size of 16 for both datasets. We train the model up to 2000 steps. After training for 1000 steps, we conduct model selection on the development set for every 100 steps using the F-score for comparison. For the auxiliary task of opinion term classification, we use the existing opinion lexiconFootnote 2 to provide opinion words. For the hinge loss function, we set the margin as 1, the loss contribution of auxiliary tasks λ is set to 0.1, and the loss contribution of adversarial example γ is set to 0.2.

5 Results & discussion

  • Generalization: We discuss the generalization aspect of our proposed model by comparing its performance with various baseline models. Table 3 shows that our proposed model BILEAT performs the best on both Restaurant and Laptop datasets. CRF-based models perform quite poorly among all the baseline models. The performance of CRF depends on the quality of handcrafted features; also, CRF, like traditional statistical models, focuses more on learning explicit features and is unable to learn implicit features efficiently. This could be the primary reason for the poor performance of CRF-Unified. However, the performance of another CRF-based model NN+CRF-Unified enhances slightly by utilizing the pre-trained word embeddings. LSTM+CRF uses LSTM to encode the meaning of a word in the input text by learning latent features efficiently and utilizing CRF for classification. This architecture makes LSTM+CRF a strong classifier and helps in performing better than CRF-Unified. HAST and TNet are attention mechanisms based on two different models on the tasks of aspect extraction and sentiment classification, respectively. HAST-TNet is built by integrating both strong models in a pipeline. Thus, HAST-TNet perform better than LSTM+CRF on both the datasets with a good margin. CMLA is a multi-layer coupled-attention architecture that helps it performing slightly better than HAST-TNet. Instead of relying on the pipeline approach, E2E-TBSA is built by using two stacked recurrent neural networks to explore the inter-task dependency and predict the aspect-term and its related sentiment in a unified way. Learning inter-task dependency using gate mechanism helps E2E-TBSA to perform better than HAST-TNet and CMLA. It shows that a nicely designed integrated model can be more effective than pipeline-based methods. Unlike conventional multi-task learning methods IMN jointly learns common features for the different tasks using an iteratively message passing architecture. This unconventional architecture enables IMN to perform better than E2E-TBSA. DOER and E2E-Triplet-Unified are a joint model that learns from the interaction between the two relevant tasks. Learning through mutual influence through a cross-shared unit may be one potential reason for both these models to perform better than IMN.

    Table 3 Precision, Recall and F score of experimental results on original test datasets

    Language models based on Transformer architectures are very powerful in understanding the contextual meaning of the word. Hence, BERT+Linear that uses BERT embeddings to encode the meaning of the word with a linear output layer surprisingly outperforms many existing strong and complex architecture-based baselines (i.e., CMLA, IMN, and E2E-TABSA), which does not use BERT embeddings. The architecture of BERT+GRU and BERT+SAN are also based on BERT but use more powerful output layers like GRU and Self-Attention-Network, respectively. Such output layers lead both these models to achieve better performance than BERT+Linear. GRACE uses domain-specific post-trained BERT and applies a gradient harmonized method along with virtual adversarial training. The domain-specific BERT embedding and virtual adversarial training enhances the performance of GRACE and helps in performing better than all the baselines including BERT+Linear, BERT+GRU, and BERT+SAN. Our proposed model BILEAT outperforms better than the strongest baseline model GRACE on both the datasets. F-score comparison shows that BILEAT performs better than GRACE by a margin of 2.36% and 1.98% on Laptop and Restaurant datasets, respectively.

  • Robustness: We compare the robustness of BILEAT and its variants with some strong baseline models against adversarial attacks. We evaluate the performance of models using Laptop and Restaurant adversarial test datasets (details are mentioned in Section 3.1.5) generated by us. Table 4 shows that GRACE performs the best among all the baselines. GRACE is built using a post-pretraining BERT and virtual adversarial training (VAT), which makes this model robust and helps in performing better than other baseline models e.g. (E2E-TABSA, BERT-Linear, BERT+GRU, and BERT+SAN). Comparison of F1-score shows BILEAT outperform the strongest baseline model GRACE) by a margin of 3.63% and 3.91% on the Laptop and Restaurant adversarial test dataset, respectively. Ensemble adversarial training is the primary reason that makes BILEAT such a robust model and helps it to perform better than all baselines.

In terms of both generalization and robustness, the better performance of BILEAT over various baselines can be attributed to the following reasons:

  • BILEAT utilizes interactive learning between AE and OE auxiliary tasks to produce a collaborative signal;

  • BILEAT uses a domain-specific and white-box adversarially trained WBDK-BERT built by us to generate more effective context-aware word embeddings;

  • By applying a black-box attack, quality adversarial examples are generated, BILEAT is trained using both original inputs and such adversarial examples in a combined way by doing task-specific fine-tuning of WBDK-BERT. This combined training makes BILEAT more robust and generalized.

Table 4 Precision, Recall and F-score of experimental results on adversarial test datasets

5.1 Ablation study

To understand the effectiveness of different key components in improving the generalization and robustness of BILEAT, we conduct the ablation study.

In order to do an ablation study for generalization, we sequentially remove each component one after another and obtain six simplified variants. The second block of Table 3 has the results of all different variants of BILEAT. The result shows that each O/A and A/O component are individually contributing to improving the performance. However, when both these components are combined, the performance gets enhanced to a large extent. Result analysis also reveals that both white-box and black-box adversarial training individually contribute to improving the performance, and when both these training are combined, the performance gain is even better.

For a robustness ablation study, we evaluate the performance of our proposed model on the adversarial test datasets by sequentially removing both white-box and black-box adversarial components one after another and obtain three variants. The second block of Table 4 has the results of all different variants of BILEAT. The result shows the inclusion of white-box adversarial pre-trained WBDK-BERT, and black-box examples individually contribute to improving the robustness of BILEAT. Performance of BILEAT w/o BBT is marginally better than BILEAT w/o WBDK-BERT, which means the contribution of black-box examples is slightly larger than WBDK-BERT in enhancing the robustness of BILEAT. BILEAT performs better than both BILEAT w/o WBDK-BERT and BILEAT w/o BBT, which reveals combining WBDK-BERT with black-box adversarial examples further enhances the robustness of BILEAT.

5.2 Case study

In this subsection, we show the effectiveness of key components of BILEAT by presenting a case study. We pick some review sentences from our original datasets to study the contribution of important components in enhancing the performance of our proposed model. Table 5 presents the details of predicted aspect(s) and its related sentiment in a given sentence by various models. Actual aspects and their associated sentiment are shown in the bold italic font style under the Sentence column. In the first two sentences only BILEAT and BILEAT w/o BBT are able to make correct predictions. It shows that white-box adversarially post-trained WBDK-BERT contributes to improving the generalization of BILEAT. Similarly, the prediction results of the third and fourth sentences quantify the importance of black-box adversarial examples, as only BILEAT w/o WBDK-BERT and BILEAT make the correct prediction. In the result of the sixth sentence, expect BILEAT w/o WBDK-BERT & BBT all models make the correct prediction. It reveals that at the individual level, both WBDK-BERT and black-box training are sufficient to make the correct prediction. Result analysis of the last two sentences is very interesting as except BILEAT no other model is able to make the correct prediction. It exhibits that the combination of WBDK-BERT and black-box adversarial examples complement each other, and a combination of these two helps BILEAT to achieve better performance.

Table 5 Case study : Original Input sentences with predicted aspect-terms & its sentiment by various models

We also pick a few review sentences from generated adversarial datasets (refer Section 3.1.5 for details) to study the impact of key components in enhancing the robustness of our proposed model. Table 6 provides the details of adversarial examples, which are generated by replacing underlined blue text in the corresponding original sentence. This table also gives information about the predicted aspect(s) and related sentiment in the given adversarial example by various models. It is evident that generated adversarial examples are grammatically fluent, semantically coherent with the original sentence, and have misled the BILEAT w/o WBDK-BERT & BBT to make an incorrect prediction. It shows the quality of these adversarial examples. In the first two adversarial examples, except for BILEAT w/o WBDK-BERT & BBT, all other models have made a correct prediction. It shows the individual potential of WBDK-BERT and black-box adversarial examples in improving the robustness of our proposed model. The prediction results of the third and fourth adversarial examples exhibit the importance of black-box adversarial training, as only BILEAT w/o WBDK-BERT and BILEAT can make the correct prediction. Result analysis of fifth and sixth adversarial examples reveals the resultant effect of combining WBDK-BERT with black-box adversarial examples in our proposed model, as except BILEAT all other models make the wrong prediction. It shows the combined effect of WBDK-BERT and black-box adversarial examples in enhancing the robustness of BILEAT.

Table 6 Case study : Adversarial Input sentences with predicted aspect-terms & its sentiment by various models

5.3 Error analysis

In some of the review sentences of our original test datasets, BILEAT is unable to identify either the aspect terms or its associated sentiment correctly. We analyze those errors and classify the same into the following categories:

  • Multi-token aspect terms containing three or more words: Some review sentences contain aspect terms that have three or more words. For example, in the review sentence “I opted for the SquareTrade 3-Year Computer Accidental Protection Warranty $1500-2000 which also supports accidents like drops and spills that are NOT covered by AppleCare” aspect term is “SquareTrade 3-Year Computer Accidental Protection Warranty”.

  • Use of idioms in review sentence: Few review comments use idioms to express sentiment about aspect term. For example, in the comment “The two waitresses looked like they had been sucking on lemons” sentiment about aspect “waitress” is expressed using “sucking on lemons”.

  • Sentence without aspect & sentiment: There are some review sentences, which do not contain any aspect and its associated sentiment. For example, in the comment “Besides, the Apple stocks have been falling due to lack of sales”, no aspect term and associated sentiment exist, but “sales” is detected as aspect term with negative sentiment.

  • Implicit opinion expressed about aspect: Some review comments do not express direct sentiment towards the aspects. The example includes “Overall, I would go back and eat at the restaurant again”, where sentiment about the aspect “restaurant” is expressed implicitly.

6 Conclusion

In this work, we investigate the importance of interactive learning and the effectiveness of adversarial training for unified ABSA tasks. We build a white-box adversarially post-trained domain knowledge BERT (WBDK-BERT) and use the same to generate robust and contextualized embeddings in our proposed model. To enhance the sentence representation for unified ABSA, we introduced two auxiliary tasks. Interactive learning between these two tasks produces a collaborative signal that helps in improving the performance of our model. In order to make our model more generalized and robust, we generated adversarial examples using a black-box technique and trained our model using original inputs and such adversarial examples in a combined way. The experimental results show the superiority of our proposed model in terms of generalization and robustness compared to existing methods. Future work will focus on extending our proposed method to non-English languages. The multilingual language models mBERT [12], and XLM-R [50] will be used to investigate how cross-lingual transfer helps to solve unified ABSA tasks in multilingual settings.