Keywords

1 Introduction

For the past few years, aspect-based sentiment analysis (ABSA) has become a popular research field in natural language processing [5, 25]. Different from sentence-level sentiment analysis, ABSA is a fine-grained task that aims at inferring the sentiment polarity (e.g., positive, negative, neural) of one specific aspect despite possible multiple aspects in a sentence. For example, in Fig. 1, the corresponding sentiment polarities of the two aspects “works” and “apple OS” are both positive.

Fig. 1.
figure 1

An example sentence. The words in blue and red are aspects and opinions, respectively. The arrows above manifest the correspondence between aspects and opinions. The arrows below suggest dependencies between words. (Color figure online)

It is a key challenge for ABSA to learn the critical sentiment information concerning the given aspect from the sentence [11, 21, 25]. In early works, attention mechanism-based deep learning models are a promising paradigm due to their ability to pay attention to important parts of a sentence regarding a specific aspect [5, 21]. However, these models lack a mechanism to account for sentiment dependencies between long-range words and may focus on irrelevant sentiment information to predict aspect sentiment. Recent research focuses on developing graph convolutional networks (GCN) over syntactic dependency trees [5, 19, 25]. Dependency trees can clarify the connections between contextual words and aspect words in a sentence. For example, in Fig. 1, there is a syntactic dependency relationship between the aspect word “Works” and its corresponding opinion word “well” as well as the other two words “am” and “and” in the sentence.

How to leverage the global structure of the graph to improve the model performance is a widely studied problem in the field of graph neural networks [1, 3, 12,13,14, 23]. Most GCNs for the ABSA task, however, lack a mechanism to effectively capture the global information of the graph. To our knowledge, the GCN models in ABSA can only afford 2–3 layers [4, 11, 19, 22, 25], meaning that each node in these models can only collect local information from neighborhood nodes 2–3 hops away from it based on the message-passing scheme [14]. If a GCN model goes deeper so that each node can have a larger receptive field of the graph and learn more global information, the vanishing gradient problem will make the model unstable [11]. The nodes in the GCNs often fail to capture the critical sentiment clues due to the limitation of the receptive field. For example, in Fig. 1, the nodes representing the aspect “apple OS” are 4–5 hops away from the nodes representing the corresponding opinion word “happy”. Although the opinion nodes contain significant sentiment information for determining the sentiment polarity of the aspect, this information cannot be transmitted to the aspect nodes.

To tackle the challenges mentioned above, a novel model, the virtual node augmented graph convolutional network (ViGCN), is proposed in this paper, whose architecture is shown in Fig. 2. In ViGCN, an artificial virtual node is added to the graph over the dependency tree and connected to all the real nodes to give them a global receptive field. Real nodes refer to all the nodes in the graph before adding the virtual node. The virtual node was originally proposed to represent the entire graph [9]. Recent research finds that it can be used for graph augmentation [23], because it can aggregate global information from the whole graph and propagate aggregated information to each real node. Considering that the graphs in ABSA are generated from sentences [25], the global information gathered from these graphs contains the sentiment expression of the sentences, which is crucial for the model to predict the sentiment polarities of aspects. Moreover, under the inspiration of SenticNet [2] and previously successful LCF-BERT model [24], weighted edges between the virtual node and real nodes are established based on affective commonsense knowledge and the semantic-relative distances between contextual words and the given aspect. With this approach, the virtual node can focus more on the context containing critical sentiment information in a sentence, making the preserved global information better reflect the emotional expression of the sentence towards the given aspect.

Fig. 2.
figure 2

Overview of the proposed virtual node augmented graph convolutional network.

This paper mainly makes the following contributions: (1) The GCN models applied to the ABSA task are reconsidered so as to exploit the global information regarding the given aspect. (2) A novel ViGCN model for the ABSA task is proposed in this paper, which can effectively preserve global information via a virtual node. Specifically, the proposed ViGCN model leverages affective commensense knowledge and semantic-relative distance to augment the preserved global information. (3) Extensive experiments on the SemEval 2014 and Twitter datasets demonstrate the superiority of the proposed ViGCN model in ABSA. The code of ViGCN is available at: https://github.com/code4scie/ViGCN.

2 Related Work

Aspect-based sentiment analysis is a fine-grained subtask for sentiment analysis. Early works in this field are feature engineering-based models like SVM [10], which are time- and labor-intensive. Later, deep neural networks have been widely used because of their ability to capture features automatically from the sentences. In general, representative works are based on recursive neural network (RNN) [7, 16], long short-term memory (LSTM) [21], convolutional neural network (CNN) [8] and deep memory network [5]. However, these neural networks are generally lacking in a mechanism to leverage the syntax information that is of crucial importance for ABSA [25].

Most of the current state-of-the-art methods are graph network-based models combined with syntactic dependency trees [4, 11, 19, 20, 22, 25]. For instance, Zhang et al. [25] applied GCNs over dependency trees to exploit word dependencies. Wang et al. [20] proposed an R-GAT model based on an aspect-oriented dependency tree. Chen et al. [4] developed a dotGCN model based on a discrete latent opinion tree. Tang et al. [19] adopted a bidirectional GCN to consider BERT outputs and graph-based representations jointly. Xiao et al. [22] utilized grammatical sequential features from BERT and the syntactic knowledge from dependency graphs to augment GCNs. Li et al. [11] put forward a DualGCN model that contains two GCNs to preserve syntax structure information and semantic correlation message, respectively. Despite exhibiting appealing power in ABSA, the graph network-based models learn node representations mainly based on aggregating features from the neighborhood of each node [14], overlooking the preservation of the global information of the graph that is capable of representing aspect-specific sentiment expression of the sentence. Therefore, how to enhance the node representations of the graph network-based models via effectively leveraging global information should be considered in the task of ABSA.

3 Proposed Model

The overall architecture of the proposed virtual node augmented graph convolutional network (ViGCN) is plotted in Fig. 2. In this section, the details of the ViGCN are presented. In particular, the aspect-based sentiment analysis task is first defined, and then how to initialize the node embeddings of a sentence is illustrated. After that, an introduction is made on how to build a virtual node augmented graph and feed the initial node embeddings and graph to ViGCN. Finally, how to obtain the sentiment polarity and train the model is detailed.

3.1 Task Definition

Given an n-word sentence \(S=\left\{ s_{1},...,s_{a_{1}},..., s_{a_{k}},...,s_{n} \right\} \) with a k-word aspect \(A=\left\{ s_{a_{1}},...,s_{a_{k}} \right\} \) included, ABSA is aimed at predicting the sentiment polarity p of the aspect A, where \(p\in \left\{ Negative, Neutral, Positive \right\} \). It is worth noting that the aspect may contain a or several words and \(1\le k< n\).

3.2 Node Embeddings Initialization

Bidirectional encoder representations from transformers (BERT) [6] is utilized as the aspect-based encoder to learn the hidden contextual representations of sentences. To be specific, under the inspiration of the model LCF-BERT [24], the sentence-aspect pair \(G = \left[ CLS \right] + S+ \left[ SEP \right] + A+ \left[ SEP \right] \) is first constructed as input, so that BERT can capture the semantic relationship between the sentence S and the aspect A through its next-sentence-prediction mechanism [6]. \( \left[ CLS \right] \) and \( \left[ SEP \right] \) are the special tokens of BERT. BERT first tokenizes each word in G into subwords, and the sentence sequence S is tokenized into an m-subword sequence \(S_t=\left\{ t_{1},...,t_{a_{1}},...,t_{a_{j}},...,t_{m} \right\} \). Then BERT transforms each subword in G into a hidden state vector, and \(S_t\) becomes \(S_h=\left\{ h_{1},...,h_{a_{1}},...,h_{a_{j}},...,h_{m} \right\} \), where \(h_i\in \mathbb {R}^{d_h}\) is the hidden state vector of the \({i}\text{- }{th}\) subword \(t_i\). We use \(S_h\) as the initial real node set. Besides, the virtual node is initialized with a zero vector \({0} \in \mathbb {R}^{d_h}\). The initial node embeddings are \(V^{0}=\left\{ h_{1},...,h_{a_{1}},...,h_{a_{j}},...,h_{m},0 \right\} \).

3.3 Construction of the Virtual Node Augmented Graph

For each sentence, a virtual node augmented graph \(G=(V,E)\) is constructed to represent the syntactic relationship among subwords, and \(A\in \mathbb {R} ^{(m+1)\times (m+1) }\) is the adjacency matrix of the graph G. \(V=\left\{ x_{1},...,x_{a_1},...,x_{a_j},...,x_{m},x_{m+1} \right\} \) is the set of nodes. The first part \(\left\{ x_{1},...,x_{a_1},...,x_{a_j},...,x_{m} \right\} \) represents the real nodes, and the latter part \(x_{m+1}\) indicates the added virtual node. E is the set of edges.

To build the edges between real nodes, a syntactic dependency tree is first constructed for each input sentence using the dependency parsing model LAL-Parser [15]. To match the subword sequence generated by BERT, the syntactic dependency of a word is expanded into all its subwords. Then, an edge is established between two subwords if a dependency is contained in them. Specifically, for \(i,j\in \left[ 1,m \right] \):

$$\begin{aligned} A_{i,j}={\left\{ \begin{array}{ll}p_e&{}\text {if } t_i, t_j \text { may contain dependency},\\ 0&{}\text {otherwise}.\end{array}\right. }\end{aligned}$$
(1)

As suggested by [11], \(p_e\in \left( 0,1 \right] \) here is the probability of a dependency between two subwords from a dependency parser. Its purpose is to reduce the adverse impact of parsing errors on model performance.

Inspired by [23], a virtual node is then added to the graph to preserve global information. Firstly, “naive connections” are constructed between the virtual node and real nodes. To be specific, following previous work [23], an undirectional edge is established between the virtual node and each real node in the graph. Therefore, for \(i\in \left[ 1,m \right] \):

$$\begin{aligned} A_{i,m+1}=A_{m+1,i}= 1. \end{aligned}$$
(2)

Like previous work [23], a self-loop is not added to the virtual node, thereby leading to \(A_{m+1,m+1}=0\).

To learn more global information from the words with stronger sentiment, the representation of the adjacency matrix is enhanced by utilizing affective commonsense knowledge from SenticNet. In particular, we use SenticNet 6, which is a public commonsense knowledge base that provides a set of polarity scores associated with 200,000 concepts [2]. Therefore, for \(i\in \left[ 1,m \right] \):

$$\begin{aligned} A_{i,m+1}=A_{m+1,i}= \left| Sentics(t_i) \right| + 1, \end{aligned}$$
(3)

where \(Sentics(t_i)\in \left[ -1,1 \right] \) represents the polarity score of the subword \(t_i\). The value of the polarity score floats between \(-1\) and \(+1\). The polarity score of a word is closer to \(+1\) when it is more positive. Conversely, the polarity score of a word is closer to \(-1\) when it is more negative. The polarity score of each word in SenticNet 6 is obtained first and then used as the polarity score of its subwords. For those words excluded from SenticNet 6, their polarity scores are set to be 0. Consideration is only given to the strengthening of the connections between the virtual node and real nodes generated from words with an intense sentiment, regardless of whether the sentiment is positive or negative. Therefore, the absolute polarity value is taken.

In LCF-BERT [24], the semantic-relative distance (SRD) is proposed to focus on the subword tokens generated by the local context, given that the local context close to the aspect usually contains significant sentiment information. Under the inspiration of the work of [24], the connections between the virtual node and nodes representing the local context are further enhanced based on SRD:

$$\begin{aligned} A_{i,m+1}=A_{m+1,i}={\left\{ \begin{array}{ll} \left| Sentics(t_i) \right| + 2 &{} \text { if } SRD_{i}< \varphi , \\ \left| Sentics(t_i) \right| + 1 &{} \text { otherwise }. \end{array}\right. } \end{aligned}$$
(4)

Here, \(i\in \left[ 1,m \right] \) and \(SRD_i = \left| P_i-P_a \right| -\left\lfloor \frac{k}{2} \right\rfloor \) represents the SRD between the \({i}\text{- }{th}\) token and targeted aspect; \(P_a\) is the central position of the aspect and \(P_i\) is the position of the context word generating the \({i}\text{- }{th}\) token, respectively; k refers to the length of the aspect sequence; \(\varphi \) stands for the SRD threshold.

3.4 Virtual Node Augmented Graph Convolutional Network

GCN is a special convolutional neural network working directly on graphs and taking advantage of the graph-structured information. After the initialization of node embeddings and the construction of the graph, they are fed into an \({L}\text{- }{layer}\) GCN to learn local and global information for the given aspect. In an \({L}\text{- }{layer}\) GCN, \(V^{l}=\left\{ x_{1}^{l},...,x_{a_1}^{l},...,x_{a_j}^{l},...,x_{m}^{l},x_{m+1}^{l} \right\} \) (\(l \in \left[ 1,2,...,L \right] \)) is denoted as the node representations of the \({l}\text{- }{th}\) layer, where \(\left\{ x_{1}^{l},...,x_{a_1}^{l},...,x_{a_j}^{l},...,x_{m}^{l} \right\} \) represents real nodes, and \(x_{m+1}^{l}\) is the virtual node. The output of the \({i}\text{- }{th}\) node in the \({l}\text{- }{th}\) layer can be calculated as follows:

$$\begin{aligned} x_{i}^{l}=\sigma \left( \left( \sum _{j=1}^{m+1}{A_{ij}} W^{l}x_{j}^{l-1} \right) /\left( d_i+1 \right) + b^{l} \right) , \end{aligned}$$
(5)

where \(W^{l}\) is a trainable weight matrix and \(b^{l}\) is a bias vector; \(d_i\) is the outdegree of the \({i}\text{- }{th}\) node; \(\sigma \) refers to an activation function and ReLU is used here. The initial node set is \(V^{0}\) obtained in Subsect. 3.2.

3.5 Model Training

The final output of the L-layer GCN is \(V^{L}=\left\{ x_{1}^{L} ,...,x_{a_{1}}^{L},...,x_{a_{j}}^{L},...,x_{m}^{L},x_{m+1}^{L} \right\} \).

Then, the final feature r is obtained through applying average pooling \(f_a\left( \cdot \right) \) over aspect nodes:

$$\begin{aligned} r=f_a\left( x_{a_{1}}^{L},...,x_{a_{j}}^{L} \right) . \end{aligned}$$
(6)

Next, the final feature r is fed into a fully connected layer, followed by a softmax layer to learn a sentiment polarity probability distribution p:

$$\begin{aligned} p=softmax(W_c r+b_c), \end{aligned}$$
(7)

where \(W_c\) and \(b_c\) are the trainable weight and bias, respectively.

The model is trained to minimize the objective function composed of a cross-entropy loss function \(\ell \) and an \(L_{2}\) regularization:

$$\begin{aligned} \mathfrak {L} = \ell + \lambda \left| \left| \varTheta \right| \right| , \end{aligned}$$
(8)

where \(\varTheta \) represents all the trainable parameters, and \(\lambda \) is the coefficient of \(L_{2}\) regularization. In this paper, \(\ell \) is defined as follows:

$$\begin{aligned} \ell = -\sum _{i=1}^{D}\sum _{j=1}^{C}\hat{p}_{i}^{j}\log _{}{{p}_{i}^{j}}, \end{aligned}$$
(9)

where D is the number of training samples, and C is the number of different sentiment polarities. \(\hat{p}\) is the ground-truth sentiment polarity distribution.

4 Experiments

4.1 Datasets and Experiment Settings

We evaluate ViGCN on three public datasets. Restaurant and Laptop datasets are from the SemEval-2014 Task4 [17]. Following [5], all the data samples with the “conflict” label are removed. The Twitter dataset is provided by [7]. All three datasets contain three sentiment polarities: positive, neural and negative. Table 1 shows the statistics of the datasets.

In this experiment, we use pre-trained BERT-base-uncased model [6] to initialize word embeddings. The dimension of word embeddings and hidden states \(d_h\) is 768. The depth of ViGCN layers L is set to be 2, and the dropout rate of ViGCN is set to be 0.1 to avoid overfitting. The SRD threshold \(\varphi \) is set to be 3 on Restaurant and Twitter datasets, and 6 on the Laptop dataset. The Adam optimizer with a learning rate of 0.001 is utilized to optimize the model attributes. The coefficient \(\lambda \) of the \(L_{2}\) regularization is \(10^{-4}\). The model is trained in 15 epochs and the batch size is set to be 16. The experimental results are obtained by averaging 10 runs with random initialization, where accuracy and macro F1 score are the evaluation metrics adopted to evaluate the model performance.

Table 1. Statistics of the experimental datasets.

4.2 Comparison Baselines

We compare ViGCN with the following baselines: (1)ATAE-LSTM [21] is an attention-based LSTM model for ABSA; (2)AEN [18] is an attentional encoder network based on BERT; (3) RAM [5] uses a recurrent attention network on memory to learn the sentence representation; (4) ASGCN [25] is an aspect-specific GCN model over the dependency tree; (5) BERT4GCN [22] is a GCN augmented with intermediate layers of BERT and positional information between words for ABSA task; (6) R-GAT+BERT [20] proposes a relational graph attention network based on an aspect-oriented dependency tree; (7) DualGCN + BERT [11] integrates syntactic knowledge and semantic information simultaneously with dual GCNs, namely, SynGCN and SemGCN; (8) DGEDT+BERT [19] is a dual-transformer model which jointly considers graph-based representations and flat-representations; (9) dotGCN+BERT [4] is a graph convolutional network based on a discrete opinion tree.

4.3 Comparison Results

The main experimental results can be seen in Table 2. On all the three experimental datasets, the ViGCN model outperforms almost all compared attention-based and graph neural network-based models with respect to both accuracy and macro F1 score. Compared to the remarkable DualGCN+BERT model, it performs quite competitively as well. To be specific, ViGCN outperforms DualGCN+BERT on Restaurant and Twitter datasets, although its accuracy score on the Laptop dataset is slightly lower than DualGCN+BERT by 0.31. This proves that the ability to effectively preserve global information enables ViGCN to achieve significant gains in ABSA.

Compared with attention-based models such as ATAE-LSTM, AEN and RAM, ViGCN exploits a syntactic dependency tree to explicitly model the connections between the aspect and the context, which thus can avoid the noise introduced by the attention mechanism. In comparison with previous state-of-the-art graph neural network-based models like ASGCN, BERT4GCN and R-GAT+BERT, the enhancement is mainly contributed by two factors. One is that a virtual node is added to the graph because of being able to enhance node representations through leveraging the global information of the whole graph. The other is that effective weights are set for edges between the virtual node and real nodes based on affective commonsense knowledge and semantic-relative distance, which refines the process of the virtual node aggregating and propagating global information.

Table 2. Comparisons of ViGCN with baselines. Accuracy (ACC.) and macro F1 score (F1) are used for metrics. Best results are in bold and second best results are underlined.

4.4 Ablation Analysis

To further clarify the impact of each component of constructing virtual node augmented graphs for the proposed ViGCN, ablation studies are conducted. The results are demonstrated in Table 3. First, it can be observed that the model without a virtual node performs poorly on all datasets (ViGCN w/o N+S+D) compared with the other models in Table 3. This indicates virtual node can improve the performance of GCNs in ABSA. When we only construct “naive connections” between the virtual node and real nodes (ViGCN w/o S+D), the performance of the model, although improved, is still far below the best performance. Comparatively, ViGCN w/o D and ViGCN w/o S evidently perform better. This proves the effectiveness of the SenticNet-based and SRD-based optimization methods for “naive connection”. Also, the “naive connections” are indispensable to the model, given that the removal of “naive connections” leads to poorer performance (ViGCN w/o N). It is worth noting that ViGCN w/o D does not outperform ViGCN w/o S on the Twitter dataset as on the Restaurant and Laptop datasets, which may be because the data on the Twitter dataset is biased towards colloquial expressions, less sensitive to sentiment information [7]. Finally, ViGCN outperforms all its ablation models, revealing that each component of ViGCN is indispensable.

Table 3. Results of ablation analysis. “S” represents SenticNet, “D” represents SRD, “N” represents “naive connections”.
Fig. 3.
figure 3

Effect of the SRD threshold \(\varphi \).

4.5 Parameter Sensitivity

Figure 3 illustrates the performance of ViGCN at different SRD thresholds \(\varphi \) from 1 to 10. It can be seen that ViGCN performs best with an SRD threshold of 3 on the Restaurant and Twitter datasets, and with an SRD threshold of 6 on the laptop dataset. As the SRD threshold \(\varphi \) increases, the performance of the model gradually improves until it reaches the best performance, and then the performance of the model shows a drop trend. A possible reason is that when \(\varphi \) is too small, the model cannot capture enough information from the local context. On the opposite, while it is too large, noise may be introduced into the global information preserved by the model.

4.6 Case Study

ViGCN, RAM and ASGCN are compared on several sample cases. The results are demonstrated in Table 4. For the first sample, the words “apple” and “OS” in the aspect"apple OS" are five and four hops away, respectively, from the opinion word "happy" on the graph over the dependency tree. For a 2-layer ASGCN, the information contains in the opinion word cannot be passed to the aspect. In contrast, in ViGCN, information can be passed from the opinion word to the aspect in 2 hops via the virtual node. Given this, ViGCN succeeds while ASGCN fails. For the second sample, the attention-based model RAM wrongly predicts the sentiment polarity of the aspect “Saketini”, which may attend to the noise word “Disappointingly”. For the third sample, impacted by the noise word “wonderful”, ASGCN and RAM mispredict the sentiment polarity of the aspect “burger”. Nonetheless, ViGCN predicts it correctly because the model combines global information preserved from the entire sentence to make predictions, thereby significantly mitigating the interference of those noise words.

Table 4. Case study. The words in red denote the aspects. The symbols P, N and O represent positive, negative and neural sentiment, respectively.

5 Conclusion

In this paper, the task of aspect-based sentiment analysis is investigated and a virtual node augmented graph convolutional network called ViGCN is proposed. Taking advantage of the virtual node, the proposed ViGCN can effectively preserve global information to precisely predict the sentiment polarity towards a given aspect. Empirical results on three public datasets demonstrate the effectiveness of our model. Future work includes applying the virtual node to other graph-based models in ABSA, e.g., graph attention network [20].