1 Introduction

Aspect-based sentiment analysis [6, 27, 28, 48] is a fine-grained task in sentiment analysis [2, 38, 41, 43] whose goal is to predict the sentiment polarity (e.g., positive, neutral or negative) toward each specific aspect term in a given sentence.

There are two subtasks in aspect-based sentiment analysis, including aspect-category sentiment analysis and aspect-term sentiment analysis [48]. An example in Fig. 1 presents a sample sentence. The aspect-category sentiment analysis implicitly describes the general entity category. For instance, in the sentence “The sushi is delicious but the waiter is very rude”, “sushi” describes the aspect category “food”, and “waiter” describes the aspect category “service”. The user expresses both positive and negative sentiments toward two aspect categories “food” and “service”, respectively. The aspect-term sentiment analysis characterizes specific entities that occur explicitly in a sentence. For the example sentence, the aspect terms are “sushi” and “waiter”, and the user expresses positive and negative sentiments toward them, respectively. In terms of the aspect granularity, the aspect category is coarse-grained, while the aspect term is fine-grained.

Fig. 1
figure 1

An example sentence with two instances of aspect-based sentiment analysis. The user expresses both positive and negative sentiments toward two aspect categories “food” and “service”, and expresses positive and negative sentiments toward two aspect terms “sushi” and “waiter”

Earlier studies introduced recurrent neural network (RNN) [4] models into aspect-based sentiment analysis due to its ability to flexibly capture the semantic relations between an aspect and its context words. However, not all the information in the sequence is important; therefore, an attention mechanism [47] was introduced into the RNN model to cause the model to pay more attention to the more important parts of the sequence. Gu et al. [11] proposed a position-aware bidirectional attention network (PBAN) based on the Bi-GRU model. PBAN not only concentrates on the positional information of aspect terms but also mutually models the relation between an aspect term and the sentence by employing a bidirectional attention mechanism. With the development of word embedding technology, the convolutional neural network (CNN) [14] has been widely applied to aspect-based sentiment analysis. On the one hand, a CNN can utilize word embeddings to map a sentence into a lower-dimensional semantic representation while also maintaining the sequence information of the words. On the other hand, a CNN can extract a local text representation. Xue et al. [39] proposed a model based on convolutional neural networks with gating mechanisms. First, the novel gated tanh-ReLU units selectively outputs the sentiment features according to a specified aspect or entity. Second, the model computation was easy to parallelize during training.

However, the above models ignored both syntactical constraints [29] and long-range sentiment dependencies [7] while mistakenly identifying irrelevant contextual words as clues for judging aspect sentiment. For instance, in the sentence “Certainly not the best sushi in New York, however, it is always fresh, and the place is very clean and sterile”, the CNN model convolves only the information of consecutive words; thus it cannot judge the sentiment of nonadjacent words. Accordingly, it may mistakenly identify the phrase “not the best” as a clue for judging the aspect “sushi” but ignore the influence of the word “fresh” on the aspect “sushi”. In addition, these models usually use aspect-independent encoders to encode sentences, which could result in a lack of aspect information. In that same sentence, the words “sushi”, “best” and “fresh” are irrelevant for sentiment prediction when the considered aspect is “place”. The use of an aspect-independent encoder when encoding sentences can cause these words to be mistaken as clues for judging “place”, leading to an erroneous prediction. Therefore, we aim to use a graph convolutional network (GCN) [13] containing an aspect gate to address these shortages. A GCN has the ability to process data with generalized topological graph structure and extract spatial features; therefore, it can update the feature information by capturing the long-range sentiment dependencies between adjacent nodes. Zhang et al. [45] was the first to apply a GCN to aspect-based sentiment analysis. Their model exploits the syntactical dependency structures within a sentence and resolves the long-range multiword dependency issue for aspect-based sentiment classification, but it still exploits aspect-independent encoders to encode sentences, which can lead to a lack of aspect information.

In this paper, we propose the aspect-gated graph convolutional network (AGGCN) model for aspect-bas-ed sentiment analysis. First, we design an aspect gate based on a long short-term memory (LSTM) network that can guide the encoding of aspect-specific information from the outset while discarding aspect-indepen-dent information. Then, we generate a dependency tree based on aspect-specific information and construct a GCN on the dependency tree to fully capitalize on the syntactical information and long-range sentiment dependencies. Finally, we use a novel retrieval-based attention mechanism to obtain the hidden representation of the attention from GCN to predict aspect sentiment. Experimental results on multiple SemEval datasets de-monstrate the effectiveness of our proposed approach, and our model outperform the strong baseline models.

Our main contributions can be summarized as follows:

  1. 1.

    We propose an aspect-gate mechanism based on LSTM. The specific aspect gate can select an aspect-specific representation by controlling the token embedding transformation at each time step, which enables the LSTM to guide the encoding of aspect-specific from the outset and discard aspect-independent information. This mechanism solves the noise and bias problems caused by the weaker encoders used in previous models.

  2. 2.

    We propose a novel aspect-based sentiment analysis framework that employs a GCN to capture syntactical information and long-range sentiment dependencies. The proposed framework enables the model to perceive context through the syntactical information and long-range sentiment dependencies, and uses a novel attention mechanism to obtain the hidden representation of the attention from GCN. This framework can help identify irrelevant context words more accurately and avoid identifying them as clues for judging aspect sentiment.

  3. 3.

    We evaluate our method on multiple SemEval datasets. The experiments show that our model achieves higher accuracy than most of the baseline models and outperforms the strong baseline models.

The remainder of this paper is organized as follows. After introducing the related works in Section 2, we elaborate our proposed model in Section 3, and then conduct experiments in Section 4. Finally, we summarize our work and provide an outlook of future work in Section 5.

2 Related works

Aspect-based sentiment analysis has become one of the most active research fields in natural language processing (NLP) and has spread from computer science to the social sciences [18, 20] and network sciences [16, 19], including marketing, finance, politics, communication, medical science and even history. Due to its clear commercial advantages, aspect-based sentiment analysis has aroused concerns throughout society. The previous aspect-based sentiment analysis models [10, 33] mostly used traditional methods based on dictionaries and machine learning methods. However, training those models was dependent on the quality of annotated data sets and obtaining high-quality datasets requires substantial investments of labor and high costs.

With the in-depth study of deep learning technology, neural networks are now widely used in aspect-based sentiment analysis. RNN models [23, 25, 31] have been applied to aspect-based sentiment analysis because they can flexibly capture the semantic relationship between an aspect and its context words. Tang et al. [31] were the first to introduce an LSTM into aspect-based sentiment analysis. They developed two target-dependent LSTM models that automatically considered target information. However, not all the information in a sequence is important to an RNN; thus, the attention mechanism [22, 35, 37] was introduced into the RNN model to cause it to pay more attention to the more important parts of the sequence. Wang et al. [36] proposed an attention-based LSTM for aspect-level sentiment classification. The attention mechanism concentrated on different parts of a sentence as different aspects were taken as input. Recursive neural network (RecNN) models [1, 26, 30] were applied to aspect-based sentiment analysis to replace RNNs because RNNs were unsuited for the tree and graph structures that information contains. Dong et al. [8] were the first to apply a RecNN model to aspect-based sentiment analysis. They proposed the adaptive recursive neural network (AdaRNN), which adaptively propagated the word sentiments to targets based on the context and syntactic relationships between words. However, RecNN may suffer from syntax parsing errors [34, 42]. With the development of word embedding technology, CNNs [5, 39, 40] have been widely employed in aspect-based sentiment analysis due to their ability to extract both local and global representations. Fan et al. [9] proposed a novel convolutional memory network that incorporated an attention mechanism. Their model sequentially computed the weights of multiple memory units corresponding to multiple words and can capture both word and multi-word expressions in sentences for use in aspect-based sentiment analysis.

Due to the unsatisfactory performance of neural networks such as RNN and CNN for processing graph data, related Graph algorithms [15] are introduced into NLP, especially GCNs have been widely used in aspect-based sentiment analysis. A GCN performs excellently for processing graph data containing rich relational information. A GCN possesses a multilayer architecture in which each layer encodes and updates the node representations in the graph using the features of the node’s immediate neighbors. Zhao et al. [46] proposed a novel aspect-level sentiment classification model based on GCNs that effectively captures the sentiment dependencies between multiple aspects in one sentence. The model first introduced a bidirectional attention mechanism with position encoding to model the aspect-specific representations between each aspect and its context words and then employed a GCN over the attention mechanism to capture the sentiment dependencies between the different aspects in one sentence.

3 Our proposed model

In this section, we describe the proposed model AGGCN for aspect-based sentiment analysis in detail. The AGGCN is shown in Fig. 2. Specifically, we first define the model notations in Section 3.1. Then, we introduce an aspect-gated LSTM in Section 3.2 and construct the GCN based on the output of the aspect-gated LSTM in Section 3.3. Next, we describe the use a retrieval-based attention mechanism to generate an attention representation for sentiment analysis in Section 3.4. Finally, we present the model training process in Section 3.5.

Fig. 2
figure 2

The framework of our proposed model. A special aspect gate is designed to guide the encoding of aspect-specific information from the outset. The model constructs a graph convolution network on the dependency tree. In particular, we use position-aware transformation and GCN masking in the GCN to fully utilize the syntactical information and long-range sentiment dependencies. The position-aware transformation reduces noise and bias during graph convolution process, and GCN masking perceives contexts around the aspect in a way that considers both syntactical information and long-range sentiment dependencies

3.1 Definition

First, we introduce some notations to facilitate the subsequent descriptions : \( S= \left \lbrace {x^{c}_{1}}, {x^{c}_{2}},\cdots ,{x^{c}_{n}} \right \rbrace \) denotes an input sentence, which contains a corresponding aspect \( X = \{ x^{a}_{k + 1 }, x^{a}_{k + 2 }, \cdots , x^{a}_{k + m} \} \) starting from the (k + 1)-th token. We embed each word token into a low-dimensional real-valued vector space [3] with an embedding matrix \( E \in {}\mathbb {R}^{d_{emp} \times |V|} \), where demp denotes the dimension of word embedding, and V indicates the number of words involved in the corpus.

3.2 Aspect-gated LSTM

A conventional LSTM first discards irrelevant information via its forget gates, then it adds useful information through the input gate, and finally, it determines which information will be output through the output gate. Instead, we design an aspect gate in LSTM that selects aspect-specific representations by controlling the transformation of token embedding at each time step. At time step t, the hidden state \( {h_{t}^{c}} \) is formulated as follows:

$$ {h_{t}^{c}} = o_{t} \times \tanh \left( C_{t}\right) $$
(1)

where \( {h_{t}^{c}} \in \mathbb {R}^{2d_{h}}\) represents the hidden state vector at time step t from the bidirectional aspect-gated LSTM; dh represents the dimension of the hidden state vector from the unidirectional aspect-gated LSTM; ot is the output gate, which outputs the cell state through a sigmoid neural layer and a dot multiplication operation. Each element of the sigmoid layer output is a real number between 0 and 1 that represents the weight of the corresponding information passing through. For example, 0 means “no information” and 1 means “let all information pass”. We process the cell state Ct through tanh (obtaining a value between − 1 and 1) and multiply it with the output of the output gate to obtain the hidden state \( {h_{t}^{c}} \). The formula for the cell state Ct is as follows:

$$ C_{t} = f_{t} \times C_{t-1} + i_{t} \times \tilde{C}_{t}^{a} $$
(2)

where ft is the forget gate, which determines what information is discarded from the cell state. The forget gate outputs a value between 0 and 1 through \(h_{t-1}^{c}\) and \( {x_{t}^{a}} \), where a 1 means “fully reserved”, and a 0 means “completely abandoned”. The it represents an input gate, which determines how much new information is added to the cell state. We multiply the previous cell state Ct− 1 by ft to discard information determined a discardable. Then, we add the product of it and the candidate cell state Ct− 1 to obtain a new cell state Ct. In particular, we design the aspect gate to be located between the input gate it and the output gate ot. The aspect gate controls the transformation of aspect information together with the tanh function, and it plays a part in determining the candidate cell state \( \tilde {C}_{t}^{a} \).

The \( \tilde {C}_{t}^{a} \) is formulated as follows:

$$ \begin{array}{@{}rcl@{}} \tilde{C}_{t}^{a} &=&\tanh \left( W_{c} h_{t-1}^{c} + g_{t} \cdot \left( W_{g}{x_{t}^{c}} \right) \right) + l_{t} \cdot {H_{t}^{c}}\left( {x_{t}^{c}} \right) \\&+& g_{t} \cdot {H_{t}^{a}}\left( {x_{t}^{a}}\right) \end{array} $$
(3)

where \( {g_{t}^{a}} \) is the aspect gate, which is designed to guide the encoding of aspect-specific information from the outset. The aspect gate can control the transformation of aspect information together with the tanh function, and it plays a part in the candidate cell state \( \tilde {C}_{t}^{a} \). \( {x_{t}^{c}}\) and \( {x_{t}^{a}}\) represent the input word embedding and the aspect embedding at time step t. lt [21] is the linear transformation gate for \( {x_{t}^{c}} \), \( {h_{t}^{c}} \) and \( {H_{t}^{a}} \) represent the linear transformations of the input \( {x_{t}^{c}} \) and \( {x_{t}^{a}} \), respectively, controlled by the linear transformation gate lt and the aspect gate \( {g_{t}^{a}} \). Equations (1) and (2) show that the hidden state \( {h_{t}^{c}} \) is controlled by the previous cell state Ct− 1, \(\tilde {C}_{t}^{a} \), lt and gt. The aspect gate structure alleviates the vanishing gradient problem because this approach provides a linear transformation path as a supplement between consecutive hidden states. Wc and Wg denote the weight matrix and bias, respectively. The remaining terms, ot, it, ft, lt, gt, \( {H_{t}^{c}}\left ({x_{t}^{c}} \right ) \) and \( {H_{t}^{a}}\left ({x_{t}^{a}} \right ) \) are formulated as follows:

$$ \begin{array}{@{}rcl@{}} o_{t}&= &\sigma \left( W_{o} \cdot \left[ h_{t-1}^{c}, {x_{t}^{c}}\right] + b_{o} \right)\\ i_{t}&= &\sigma \left( W_{i} \cdot \left[ h_{t-1}^{c}, {x_{t}^{c}}\right] + b_{i} \right)\\ f_{t}&= &\sigma \left( W_{f} \cdot \left[ h_{t-1}^{c}, {x_{t}^{c}}\right] + b_{f} \right)\\ l_{t}&= &\sigma \left( W_{l} \cdot \left[ h_{t-1}^{c}, {x_{t}^{c}}\right] + b_{l} \right)\\ g_{t}&= &Relu \left( W_{g} \cdot \left[ h_{t-1}^{c}, {x_{t}^{a}}\right] + b_{g} \right) \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} {H_{t}^{c}}\left( {x_{t}^{c}} \right ) &= &W_{hc}{x_{t}^{c}}\\ {H_{t}^{a}}\left( {x_{t}^{a}} \right)& = &W_{ha}{x_{t}^{a}} \end{array} $$
(5)

where σ represents the sigmoid function, ReLU is the activation function. Wo,Wi, Wf, Wl, Wg, Whc, Wha are weight matrices and bo,bi, bf, bl, bg are bias vectors to be learned during training.

In (3)–(5), the aspect gate gt controls the nonlinear transformation of the input \( {x_{t}^{c}} \) under the guidance of the given aspect at time step t. According to the current input \( {x_{t}^{c}} \), \( {x_{t}^{a}} \) and the previous hidden state \( h_{t-1}^{c} \), we adopt the linear transformation gate lt in the cooperative aspect gate gt to control the linear transformation of input. Therefore, a specific aspect gate can select an aspect-specific representation by controlling the token embedding transformation at each time step, which enables the LSTM to guide the encoding of aspect-specific from the outset and discard aspect-independent information.

3.3 Graph convolutional network

In Section 3.2, we obtain the output \( H^{c} = \{ {h_{1}^{c}},\ {h_{2}^{c}}, {\cdots } ,\) \(h_{k+1}^{c}, \cdots , h_{k+m}^{c}, \cdots , {h_{n}^{c}} \} \) of the aspect-gated LSTM. B-ased on this Hc output, we first construct a syntactical dependency treeFootnote 1 and convert each tree into its corresponding adjacency matrix A, to make a GCN suitable for the modeling dependency tree. Then, the GCN is executed in an L-layer convolutional fashion on top of the aspect-gated LSTM output Hc, i.e., Hl = Hc to create context-aware nodes. Finally, the hidden representation of each node is updated through a graph convolution operation with a normalization factor [13]. The graph convolution was inspired by contextualized Graph Convolutional Networks [44] as shown below:

$$ {h_{i}^{l}} =Relu \left( \sum\limits_{j=1}^{n} A_{ij} W_{l} \ g_{j}^{l-1} / d_{i} + b_{l} \right) $$
(6)

where \( {h_{i}^{l}} \in \mathbb {R}^{2d_{h}} \) is the i-th token’s hidden representation of the l-th GCN layer, and \( g_{j}^{l-1} \in \mathbb {R}^{2d_{h} }\) is the j-th token’s representation evolved from the \( \left (l-1\right ) \)-th GCN layer. \( A_{ij} \in \mathbb {R}^{n \times n } \) denotes the adjacency matrix. Specifically,based on the idea of self-looping, each word is manually set adjacent to itself, i.e., the diagonal values of A are all ones. Wl is the weight matrix and bl is the bias vector to be learned during training. Then, \( d_{i} = \sum \nolimits _{j=1}^{n}A_{ij}\) represents the degree of the i-th token in the tree.

Specifically, to reduce the noise and bias during the graph convolution process, we conduct a position-aware transformation [11, 17, 36] before \( {h_{i}^{l}} \) is input into GCN.

$$ {g_{i}^{l}} = p_{i} {h_{i}^{l}} $$
(7)
$$ p_{i}= \left\{\begin{array}{ll} |\tfrac{i-k-1}{n}| \quad &0<i< k+1\\ 0 \quad &k+1 \leq i \leq k+m\\ |\tfrac{k+m-i}{n}| \quad &k+m<i \leq n \end{array}\right. $$
(8)

where \(p_{i} \in \mathbb {R} \) is the position weight of the i-th token. The final hidden representation of the L-layer GCN is \(H_{L} = \left \lbrace {h_{1}^{L}},{h_{2}^{L}}, \cdots ,h_{k+1}^{L},\cdots ,h_{k+m}^{L},\cdots , {h_{n}^{L}}\right \rbrace \), where \({h_{t}^{L}}\in \mathbb {R}^{2d_{h}}\). Table 1 describes the above process.

Table 1 The formal pseudo-code for Graph Convolution is presented in Algorithm 1

3.4 Retrieval-based attention

In this section, we use a retrieval-based attention mechanism to generate an attention representation. This idea was derived from the aspect-specific graph convolutional network [45]. The retrieval-based attention mechanism retrieves significant features that are semantically relevant to the aspect words from the hidden state vectors and sets a retrieval-based attention weight accordingly for each context word.

We first add a masking mechanism on top of the GCN to mask out nonaspect words. This operation enables the model to perceive context through syntactical information and long-range sentiment dependencies.

$$ H_{Mask}^{L}= \left\{\begin{array}{ll} 0 \quad &0<t< k+1\\ {h_{t}^{L}} \quad &k+1 \leq t \leq k+m\\ 0 \quad &k+m<t \leq n \end{array}\right. $$
(9)

When the current word is not an aspect word, we set the value of \(H_{Mask}^{L}\) to 0. Conversely, when the current word is an aspect word, we use the value from (6). Table 2 describes the process of GCN Masking.

Table 2 The formal pseudo-code for GCN Masking is presented in Algorithm 2

Then, we produce the retrieval-based attention representation based on Hc and \( H_{mask}^{L} \) and formulate it as follows:

$$ h_{R} = \sum\limits_{t=1}^{n} \alpha_{t}{h_{t}^{c}} $$
(10)
$$ \alpha_{t} = \frac{exp\left( \beta_{t}\right) }{ \sum \nolimits_{i=1}^{n}exp\left( \beta_{i} \right) } $$
(11)
$$ \beta_{t} = \sum\limits_{i=1}^{n} \left( {h_{t}^{c}}\right)^{\top} {h_{i}^{L}} = \sum\limits_{i=k+1}^{k+m} \left( {h_{t}^{c}}\right)^{\top} {h_{i}^{L}} $$
(12)

where hR is the retrieval-based attention representation, αt represents the attention weight, and βt is the attention-aware function to obtain the semantic correlation between the aspect and context.

Finally, we input the attention representation hR into the softmax layer for aspect-based sentiment analysis:

$$ y = softmax\left( W_{p}h_{R} + b_{p}\right) $$
(13)

where \( y \in \mathbb {R}^{|C|}\) is the sentiment distribution prediction, and \(W_{p} \in \mathbb {R}^{2d_{h} \times | C|}\) and \(b_{p} \in \mathbb {R}^{|C|}\) are the trainable parameters. C is the dimension of the sentiment labels.

3.5 Training of model

The purpose of model training is to optimize all the parameters to minimize the loss function insofar as possible. Our model is trained using cross-entropy with the L2-regularization term and formulated as follows:

$$ loss = -{\sum\limits_{i}^{N}} y_{i}\log \left( \hat{y_{i}}\right) + \lambda \lVert \theta \rVert^{2} $$
(14)

where N is the number of samples in the dataset, yi is the ground truth probability, and \(\hat {y_{i}}\) is the estimated probability of an aspect. λ is the L2-regularization factor and 𝜃 represents all the trainable parameters.

4 Experiments

In this section, we first describe the datasets and experimental settings in Section 4.1. Then, we describe the baseline models in Section 4.2 and the experimental results and analyses in Section 4.3. Next, we provide a discussion of AGGCN in Section 4.4, and we present an ablation study in Section 4.5. Finally, we describe a case study in Section 4.6

4.1 Datasets and experimental settings

To demonstrate the effectiveness of our proposed model, we conduct experiments on five datasets, namely Lap14, Rest14, Rest15, Rest16 and Twitter, which are originally from SemEval 2014 task 4,Footnote 2 SemEval 2015 task 12,Footnote 3 SemEval 2016 task 5Footnote 4 and TwitterFootnote 5 respectively. The SemEval datesets consist of data in two categories: Restaurant and Laptop. The word embeddings that are fixed in the Twitter dataset consist of data in one category: Twitter, and the reviews include three sentiment polarity labels: positive, negative, and neutral. The dataset statistics are shown in Table 3.

Table 3 Dataset description

In our experiments, we apply the pretrained GloVe vectors with 300 dimensions to initialize the word embeddings. The dimension of the hidden state vectors is set to 300. All the weight matrices obtain their initial values from a uniform distributed U (− 0.1, 0.1). All the models are optimized using the Adam optimizer with the learning rate set to 0.001. The L2 regularization is set to 0.00001, and the batch size is set to 32. In addition, the number of GCN layers is set to 2, which was the best depth found during the experiment. To evaluate the performance, we obtain the experimental results by averaging the results of 20 runs with random initialization, and we adopt accuracy and the macro-averaged F1-score (Macro-F1) as the evaluation metrics. The Macro-F1 metric is more appropriate when the data set is unbalanced.

4.2 Baseline models

To evaluate the effectiveness of our model, we compare it with the following baseline models on all five datasets:

TD-LSTM:

constructs aspect-specific representation by the left context with aspect and the right context with aspect and then employs two LSTMs to model them [31].

ATAE-LSTM:

generates an attention vector by combining aspect embedding with hidden state, and it appends the aspect embedding into each word vector to better capitalize on the aspect information [36].

MenNet:

uses a deep memory network on the context word embeddings for sentence representation to capture the relevance between each context word and the aspect. Finally, the output of the last attention layer is used to infer aspect polarity [32].

IAN:

first learns attention from the contexts and aspect terms. Then, the representations for aspect terms and contexts are generated separately. Finally, it concatenates the aspect term representation and the context representation to predict the sentiment polarity of the aspect terms within its contexts [24].

AOA:

jointly learns the representations for aspects and sentences and automatically focuses on the important parts in sentences [12].

TNET:

proposed context-preserving transformation (C-PT) to preserve and strengthen the informative part of contexts [17].

AS-GCN:

exploits syntactical dependency structures within a sentence and resolves the long-range multiword dependency issue for aspect-based sentiment classification [45].

DMTL:

uses a shared layer to learn the common features of sentiment prediction (SP) and position prediction (PP). Then, it uses two task-specific layers to learn the features specific to the tasks and perform PP and SP in parallel [49].

4.3 Experimental results and analyses

As shown in Table 4, we report the performance of all the baseline models and our proposed AGGCN model. From Table 4, we can make the following observations:

Table 4 Average accuracy and macro-F1 score over 20 runs with random initialization. The two best results on each dataset are shown in bold font

Compared with the baseline models, AGGCN achie-ves the best performances on the Rest14, Rest16 and Twitter datasets. On the Rest14 datasets, compared with the best baseline model DMTL, AGGCN achieves absolute increases of 2.14% and 1.33% in accuracy and Macro-F1, respectively. These results demonstrate the effectiveness of using the syntactic information and long-range sentiment dependencies. On the Rest16 and Twitter datasets, compared with the best baseline model AS-GCN, AGGCN achieves absolute increases of 1.54% and 1.07% in accuracy, respectively, and it also achieves absolute increases of 6.44% and 1.80% in Macro-F1. These results demonstrate that encoding the aspect-specific information from scratch can increase the model accuracy for sentiment analysis. The accuracy of AGGCN is slightly below that of AS-GCN on Lap14, and AS-GCN also achieves the best performance on the Lap14 datasets. AS-GCN makes full use of the syntactical dependency structures within a sentence and resolves the long-range multiword dependency issue. One possible reason for this discrepancy is that the Lap14 datasets are not as sensitive to aspect-specific information but they are more sensitive to syntactic information. In addition, the performance of AGGCN is lower than that of DMTL on Rest15, where DMTL also shows the best performance on Rest15. DMTL uses a shared layer to learn the common features of SP and PP. Then, it utilizes two task-specific layers to learn the features specific to the tasks and performs PP and SP in parallel. DMTL pays more attention to the influence of position information on the model. Its best results on the Rest15 datasets may be because the Rest15 dataset is not as sensitive to syntactic information and long-range dependencies as it is to position information.

4.4 Discussion of AGGCN

One highly important parameter in AGGCN is the number of GCN layers because that value affects the performance of our model. To demonstrate the effectiveness of our proposed model, we investigate the effect of the layer number L on the final performance of AGGCN, and we conduct experiments with different numbers of GCN layers from 1 to 9. The performance results are shown in Figs. 3 and 4.

Fig. 3
figure 3

The effect of the different layers number in Accuracy

Fig. 4
figure 4

The effect of the different layers number in Macro-F1

As the results in Figs. 3 and 4 show, the model achieves the best performances when the number of GCN layers is 2. When the number of GCN layers is larger than 2, the performance degrades as the number of GCN layers increases on both datasets. One possible reason for this performance drop phenomenon may be that as the number of model parameters increases, the model becomes more difficult to train and tends to overfit.

4.5 Ablation study

To investigate the impacts of each component on AGGCN, we conducted an ablation study. The results are shown in Table 5.

Table 5 Ablation study of the AGGCN on five datasets, AG means aspect gate

AGGCN w/o AG denotes a model with the aspect gate component removed. As the results show, when we remove the aspect gate, the performance of AGGCN degrades on the Rest14, Rest15, Rest16 and Twitter datasets but it improves on the Lap14 data sets. These results demonstrate that the aspect gate helps the model better identify and extract information on specific aspects. Specifically, recalling the result on the Lap14 datasets in Table 4, the reasons for the performance degradation of our proposed model may be that Lap14 datasets are less sensitive to aspect-specific information but are more sensitive to syntactic information. The notation w/o AGGCN w/o GCN denotes a model from which we removed the GCN mechanism. When we remove the GCN component, the performance of AGGCN drops on the Lap 14, Rest14, Rest15, and Rest16 datasets but improves on the Twitter dataset. This result demonstrates that the GCN simultaneously captures both the syntactic information and the long-range sentiment dependencies. One possible reason for the performance degradation is that the Twitter dataset is not sensitive to syntactic information and long-range sentiment dependencies. The notation AGGCN w/o pos. denotes a model from which we removed the position-aware transformation component. Compared with the complete AGGCN, the performance of AGGCN w/o pos. falls on all five datasets but especially on Rest15. Recall the performance of the baseline model DMTL in Table 4, which pays more attention to position information. We can conclude that the Rest15 datasets are more sensitive to position information. The notation AGGCN w/o mask denotes a model from which we removed the mask component. The performance of AGGCN w/o mask falls on all five datasets, demonstrating that the mask mechanism helps the AGGCN perceive contexts around the aspect in a way that considers both syntactical dependencies and long-range sentiment dependencies.

4.6 Case study

To provide an intuitive understanding of how the AGGCN works with different components, we adopted a case study as a test example for illustrative purposes. We constructed heat maps to visualize the attention weights on the words computed by the three models in which the color depth denotes the semantic relatedness level between the given aspect and each word. More depth indicates a stronger relation to the given aspect. The results are shown in Fig. 5.

Fig. 5
figure 5

Visualization of attention from AGGCN, AGGCN w/o AG and AGGCN w/o GCN on a testing example

In this sentence, the aspect word “restaurant” is the target of negative sentiment; the aspect words “drinks” and “food” are connected by the conjunction “and” to express positive sentiment; and the conjunction “but” reverses the previous negative sentiment. By comparing the heat maps of AGGCN and AGGCN w/o AG, we find that AGGCN accurately focuses on three aspects in the sentence: “restaurant”, “drinks” and “food” and it also pays attention to the conjunctions “but” and “and”. This phenomenon indicates that AGGCN not only identities aspect information but also perceives the context in a way that considers both syntactic information and long emotional dependency. When we remove the aspect gate, as can be seen in the heat maps in the second row, the color of the aspect words “restaurant”, “drinks” and “food” becomes lighter. This phenomenon indicates that the ability of AGGCN to focus on aspects is weakened. When we remove the GCN, as seen from the third heat maps, AGGCN no longer focuses on the conjunctions “but” and “and”. The ASGCN w/o GCN model predicts the polarity of aspect “drinks” by the word “fantastic” and the polarity of aspect “food” by the word “superb” in isolation, ignoring the relation between the two aspects.

From the above results, we can conclude that our proposed model can not only identify the aspect and address the lack of aspect information in prior models through the special aspect gate but also perceive the contexts around the aspect by considering both syntactical dependencies and long-range sentiment dependencies. These mechanisms make our model better for aspect-based sentiment analysis.

5 Conclusion and future work

In this paper, we proposed an Aspect-gated Graph Convolutional Network (AGGCN) for aspect-based sentiment analysis. The AGGCN not only guides the encoding of aspect-specific information from the outset and discards aspect-independent information but also perceives contexts around the aspect by considering both syntactical dependencies and long-range sentiment dependencies. The experimental results on multiple SemEval datasets demonstrate the effectiveness of our proposed approach, and our model outperforms the str-ong baseline models.

In future work, we plan to further improve the performance of the model from the following aspects. First, noise and biases may occur during the encoding of asp-ect-specific information; therefore, it is necessary to introduce a deep conversion transformation mechanism that can decode the aspect information to ensure that it is completely and accurately embedded into the model. Second, domain knowledge could be incorporated to improve model generalizability.