1 Introduction

Aspect-based sentiment analysis (ABSA) [1,2,3,4,5] aims to determine the sentiment polarity for a specific aspect in a sentence. For example, in the sentence shown in Fig. 1, given the two aspects “fish” and “variety of fish”, the goal of ABSA is to infer the sentiment polarities for the aspect words: positive for “fish” and negative for “variety of fish”.

Fig. 1
figure 1

An example sentence with two aspects and different sentiment polarities

Many previous studies have used recurrent neural networks (RNNs) and their variants [1, 6, 7] to learn the relationships among words in a sentence. However, it cannot effectively distinguish important information in long texts. When the sentence length is long, important information may likely fade away during propagation, resulting in poor modeling. To solve this problem, the attention mechanism and its variants are widely used in this task [8,9,10,11], which can capture crucial information between aspects and context words, thereby, effectively improving the accuracy of the ABSA task. Yadav et al. [12] masked the aspect words and used Opinion Lexicon to replace the opinion words, which were trained by two Bidirectional GRUs and an attention layer. Liu et al. [13] proposed a co-attention mechanism to capture the relationships between aspect and context, where the 1-pair hop mechanism analyses the relationship between aspect and context at the lexical level and an interactive mechanism analyses the relationship between aspect and context at the feature level. Nevertheless, the risk of matching the wrong opinion words with aspect words is an unavoidable disadvantage, which leads to prediction errors. Such challenges can be alleviated to some extend by extracting local features using a convolutional neural network (CNN) [14,15,16]. However, this method can easily bring noise that decreases prediction accuracy as the positional relationship between the aspects and the real opinion words may be quite distant.

Based on the above problems, recent methods have introduced dependency trees to encode the structure information of sentences [17,18,19,20,21] and encoded the dependencies between different words using graph convolutional networks (GCN) [22,23,24,25,26] or graph attention networks (GAT) [27,28,29,30]. Liang et al. [31] enhanced the dependency graphs of sentences by introducing affective knowledge, and then fed the new dependency graphs into the GCN network for training. In addition, the use of GCN [32] after introducing the attention mechanism of position encoding can effectively capture sentiment dependencies of different aspects in sentences. The performance of these models goes far beyond the previous conventional models. Despite the significant improvements achieved, the ABSA task still faces great challenges. First, even the state-of-the-art parser is difficult to ensure completely correct parsing, thus the dependency tree obtained will inevitably introduce noise. Second, in the process of identifying the sentiment polarity of aspects, some sentences rely primarily on syntactic information and others on semantic information. Dealing with sentences that differ in sensitivity to syntactic and semantic information effectively is also an issue that needs to be addressed.

To overcome the aforementioned challenges, we propose a multiple GCN (MultiGCN) model that obtains the syntactic structure and the semantic content information of each sentence through RGCN and a contextual encoder, respectively. Then, the information from the RGCN and contextual encoder is combined using the common information extraction module, and the final result is obtained through a fusion mechanism. Furthermore, we modify the loss function with difference and similarity losses. The difference loss helps the model distinguish among structural information alone, content information alone, and common information after combining them. The similarity loss encourages the model to improve the degree of association between structure and content information. The experimental results confirm that they are essential for better model training.

Our contributions are summarized as follows:

  1. 1)

    We propose a MultiGCN model based on GCN, which makes full use of both structured and unstructured information of sentences and combines different types of information through a fusion mechanism.

  2. 2)

    We propose difference and similarity losses, which modify the traditional loss function and facilitate the model to learn better the degree of difference or similarity between different types of information.

  3. 3)

    We validate our model on four datasets and the results show that the MultiGCN outperforms the baseline models on all datasets.

2 Related work

ABSA is a direction of sentiment analysis in natural language processing [7, 33,34,35,36], which focuses sentiment on one or more specific aspects in the same sentence to judge the sentiment polarity. With regard to the structure of neural networks, existing work on ABSA can be divided into methods based on traditional deep learning and those based on graph neural networks.

2.1 Conventional methods

CNN [37] can extract advanced features from raw data through convolution and pooling operations. Huang et al. [14] incorporated aspect information into CNN for sentence encoding using a parameterized filter and a parameterized gate. Fan et al. [15] proposed a convolutional memory network that combines attention mechanisms and captures both words and multi-words expressions in sentences. To help the CNN feature extractor locate sentiment indicators more accurately, Li et al. [16] reviewed the drawbacks of the attention mechanism and the barriers that prevent CNN from playing a role in classification tasks, and then proposed a new classification model that uses a proximity strategy to scale the input of convolutional layers by using the positional relevance between words and aspect words.

In addition, Memory Networks [38] have been applied to this task. Tang et al. [39] applied a deep memory network to aspect level sentiment classification tasks, which uses attention mechanisms with explicit memory to capture the importance of each context word for the given aspect. Chen et al. [40] proposed a recursive attention mechanism based on memory networks to extract sentiment information separated by a long distance. The memory slices are weighted according to their position proximity to the aspect, and gated recurrent units are adopted to update the representation of aspect mentions. To better simulate sentiment interaction, Li et al. [41] integrated aspect detection into sentiment classification.

Many ABSA tasks are also modeled using RNN due to its superior capacity of sequence learning. Tang et al. [6] regarded the given aspect words as features and connected them with context features to predict sentiment polarity. Zhang et al. [1] used the gated neural network structure to model the interaction between aspect words and surrounding context words by comprehensively considering the syntax and semantics of a sentence. To further improve the accuracy of sentiment polarity discrimination, Wang et al. [8] introduced the attention mechanism and set an attention vector for each aspect based on long short-term memory (LSTM) network, which is an effective way to strengthen the neural model to focus on the relevant parts of a sentence. Subsequently, the interactive attention mechanism is explored to model the aspect-context relationship [9, 10], thus it can effectively learn important parts of sentences and aspects to provide sufficient information for judging the sentiment polarity. Fan et al. [11] combined fine-grained and coarse-grained attentions to capture aspect and context interaction at the word level, and then proposed aspect alignment loss to describe the interaction between aspects with a common context.

2.2 Graph neural network methods

Zhang et al. [22] constructed a GCN model on the dependency tree of sentences using syntax information and word dependences. This model starts with the BiLSTM layer and captures context information about word orders. In order to obtain aspect characteristics, GCN and masking mechanisms are used to preserve specific aspect features for predicting aspect-based sentiment polarity. In the same year, Sun et al. [23] also proposed to construct GCN on the syntactic dependency tree and combine it with BiLSTM to build the model. Later, Chen et al. [24] proposed a new gating mechanism for merging multiple tree structures in GCN coding to improve the classification accuracy of noisy texts and more effectively capture the relationship between aspect and opinion words. Furthermore, Liang et al. [25] designed an aspect-focused graph and an inter-aspect graph for each instance by considering context words related to aspect words and the dependence of aspect words on other aspects, and then conducted an ABSA task through a new interactive graph-aware model.

Recently, Zhang et al. [42] integrated word co-occurrence information and dependency type information using a hierarchical graph structure to solve the problem that most previous studies used dependency relations only but ignored different types of relations. Bai et al. [27] also used dependency label information to distinguish the dependency types of different relations and integrated label features into the attention mechanism. They proposed a new relational GAT that improved the accuracy of parsing.

In this paper, our proposed model incorporates several different types of node features so that the sentiment polarity of a sentence can be predicted accurately. Moreover, we improve the loss function to help the model train better.

3 Methodology

In this section, we describe the details of our proposed model. An overview of MultiGCN is depicted in Fig. 2. For our model, a sentence S with the aspect is given, where \(S=\left \{w_{1}, w_{2}, \ldots , a_{1}, \ldots , a_{\mathrm {m}}, \ldots , w_{\mathrm {n}}\right \}\) represents the sentence to be entered and contains the specific aspect \(\left \{a_{1}, a_{2}, \ldots , a_{\mathrm {m}}\right \}\). Each word in the sentence can be found using a word embedding lookup table \(E \in R^{|V| \times d_{e}}\), where |V | is the size of the word list, de is the dimension of word embeddings. Afterward, the resulting word embeddings are fed into a BiLSTM to encode the sentence to obtain a hidden state vector \(H=\left \{h_{1}, h_{2}, \ldots , h_{n}\right \}\), where HR2d and d is the dimension of the hidden state vector of a unidirectional LSTM.

Fig. 2
figure 2

The overall framework of MultiGCN

The overall architecture of our model consists of four components: the RGCN, the contextual encoder, the common information extraction module, and the fusion mechanism. First of all, the hidden representation obtained from BiLSTM is input into RGCN and contextual encoder. Then both output representations are fed into the common information extraction module to produce the new sentence representation comprehensively. Next, the fusion mechanism fuses the output of the three components, and we obtain the final feature representation by pooling and concatenation operations. Finally, the sentence S with specific aspects is predicted to have a sentiment polarity y ∈{1,− 1,0}, where 1 is positive, -1 is negative and 0 is neutral.

3.1 Rational graph convolutional network

GCN can perform convolution operations on directly connected nodes in the graph structure data, which allows more global information to be obtained for each node. Therefore, by entering a dependency probability matrix, each word in a sentence can get information about the word it depends on, thus obtaining the syntactic structure information of the whole sentence. Inspired by Bai et al. [27], in order to make more comprehensive use of the results of the dependency parser, we take dependency relations and the types of relations into account and propose RGCN, as shown in Fig. 3.

Fig. 3
figure 3

Details of the RGCN structure for relation embeddings

To begin with, an adjacency matrix \(A_{s y}=\left \{a_{i, j}\right \}_{n \times n}\) and a relation matrix \(R=\left \{r_{i, j}\right \}_{n \times n}\) can be generated from a dependency tree, representing whether there is a dependency arc between words and what the dependency type is, respectively. Asy is a 0-1 matrix, if there is an arc between two words, then ai,j = 1, and ai,j = 0 otherwise. As for R, if ai,j = 1, then \(r_{i, j} \in \left \{r_{1}, r_{2}, \ldots , r_{k}\right \}\) is the corresponding dependency type, where k is the number of dependency types, and ri,j = 0 otherwise. In the next place, we convert all ri,j in the relation matrix R to embedding \(e_{i, j}^{r}\) and combine it with the output of the BiLSTM to obtain \(H_{s y}^{(0)}\). With the adjacency matrix Asy, the structure representation Hsy of the sentence is expressed from the following formula:

$$ H_{s y}^{(l)}=ReLU\left( A_{s y} H_{s y}^{(l-1)} W_{s y}^{(l)}\right) $$
(1)

where \(W_{s y}^{(l)}\) is the learnable matrix of the l-th RGCN layer.

3.2 Contextual encoder

As the structural information of some sentences is not obvious, the accuracy of analysis results may be reduced by only relying on the structural information extracted by RGCN. Therefore, we consider modeling context words using another GCN, i.e., content modeling. However, unlike above, instead of using a dependency tree, we use the attention matrix Aatt of the adjacency matrix, derived from the self-attention mechanism. The final contextual representation Hcon of the sentence is obtained from the following formula:

$$ A_{a t t}=\frac{H_{q} W_{q} \times H_{k} W_{k}}{\sqrt{d}} $$
(2)
$$ H_{con}^{(l)}=ReLU\left( A_{a t t} H_{c o n}^{(l-1)} W_{c o n}^{(l)}\right) $$
(3)

where Hq, Hk, and \(W_{c o n}^{(0)}\) are equal, and all of them are the output of BiLSTM; d is the dimension of the output; Wq and Wk are trainable weight matrix; \(W_{c o n}^{(l)}\) is the learnable matrix of the l-th GCN layer.

3.3 Common information extraction module

Intuitively, most sentences contain both syntactic structural information and semantic content information, which are closely related. For example, in the sentence “The staff was horrible”, there is a clear semantic relationship between “staff” and “horrible”. Meanwhile, there is also a dependent arc between the two words in the structure, so we propose a common information extraction module that combines the two features. We use the gating mechanism to integrate the syntactic representation Hsy and the content representation Hcon, where Hsy is derived from (1) and Hcon is derived from (3). Finally, the common representation Hc is obtained from the following equation:

$$ H_{c}=g \times H_{s y}+(1-g) \times H_{con} $$
(4)
$$ g=\sigma\left( \left[H_{s y}, H_{c o n}\right] \times W_{g}+b_{g}\right) $$
(5)

where g is the gate; \(\left [H_{s y}, H_{c o n}\right ]\) is the connection between Hsy and Hcon; Wg and bg are the model weight and bias, respectively; σ represents the activation function sigmoid.

3.4 Fusion mechanism

Before obtaining the final sentence representation, we propose a fusion mechanism to extract the relevant features of different modules to obtain a more accurate representation. Hsy, Hcon, and Hc are input and interact through the following formulas:

$$ H_{s y}^{\prime}=softmax\left( H_{s y} W_{1}\left( H_{c o n}\right)^{T}\right) H_{s y} $$
(6)
$$ H_{con }^{\prime}=softmax\left( H_{con} W_{2}\left( H_{s y}\right)^{T}\right) H_{c o n} $$
(7)
$$ H_{c_{-} s y}^{\prime}=softmax\left( H_{c} W_{3}\left( H_{con }\right)^{T}\right) H_{s y} $$
(8)
$$ H_{c_{-}{con }}^{\prime}=softmax\left( H_{c} W_{4}\left( H_{s y}\right)^{T}\right) H_{con} $$
(9)

where W1, W2, W3 and W4 are trainable parameters. After that, the combination of \(H_{c_{-} s y}^{\prime }\) and \(H_{c_{-} c o n}^{\prime }\) gives Hc as follows:

$$ H_{c}^{\prime}=\frac{\alpha H_{c_{-} s y}^{\prime}+\beta H_{c_{-} c o n}^{\prime}}{2} $$
(10)

where α and β are model parameters.

What’s more, we use the mask mechanism and average pooling to obtain the aspect representation \(h_{s y}^{\prime }\), \(h_{c o n}^{\prime }\), and \(h_{c}^{\prime }\), concatenating them together to obtain the final aspect representation ha:

$$ h_{s y}^{\prime}=f\left( mask\left( h_{s y_{1}}^{\prime}, h_{s y_{2}}^{\prime}, \ldots, h_{s y_{n}}^{\prime}\right)\right) $$
(11)
$$ h_{c o n}^{\prime}=f\left( mask\left( h_{c o n_{1}}^{\prime}, h_{c o n_{2}}^{\prime}, \ldots, h_{c o n_{n}}^{\prime}\right)\right) $$
(12)
$$ h_{c}^{\prime}=f\left( mask\left( h_{c_{1}}^{\prime}, h_{c_{2}}^{\prime}, \ldots, h_{c_{n}}^{\prime}\right)\right) $$
(13)
$$ h_{a}=\left[h_{s y}^{\prime}, h_{c o n}^{\prime}, h_{c}^{\prime}\right] $$
(14)

where mask(⋅) is a function of the mask mechanism, filtering the representations obtained from the different modules to get the representations about aspects; f(⋅) is an average pooling operation.

Finally, we input the aspect representation ha into the softmax layer to obtain the final probability distribution of sentiment polarity, completing the ABSA task:

$$ y=softmax\left( W h_{a}+b\right) $$
(15)

where W and b are the model weight and bias, respectively.

3.5 Model training

For optimizing model parameters to further improve model performance, we use the regularization method proposed by Li et al. [43], which is given by

$$ R_{O}=\left\|A_{a t t}\left( A_{a t t}\right)^{T}-I\right\|_{F} $$
(16)
$$ R_{D}=\frac{1}{\left\|A_{a t t}-A_{s y}\right\|_{F}} $$
(17)

where I is an identity matrix; Aatt and Asy are the attention score matrix and adjacency matrix of the sentence, respectively.

Furthermore, we promote the traditional cross-entropy loss function by proposing the similarity and difference losses. Foremost, the similarity loss Lsim is proposed in that the sentence representations Hsy and Hcon derived from the RGCN and contextual GCN are significantly related. Context words that are related in a sentence are also structurally connected, that is, there often have dependent arcs between the two words, so Lsim should be as small as possible.

$$ L_{sim}=\left( H_{con}^{\prime}-H_{s y}^{\prime}\right)^{2} $$
(18)

Another improvement, the difference loss Ldif, takes into account that the common information extraction module combines the content and structure information of a sentence to get a sentence representation Hc with two kinds of information, which should be different from the sentence representations Hcon and Hsy that make use of only one kind of information. Therefore, the difference loss calculates the dot product of Hc with Hcon and Hsy, respectively, and then sums them up. The smaller the value, the better the result can be obtained by making full use of the different information, as shown in the formula below:

$$ L_{dif}=\left( \left( H_{con}^{\prime}\right)^{T} H_{c}^{\prime}\right)^{2}+\left( \left( H_{s y}^{\prime}\right)^{T} H_{c}^{\prime}\right)^{2} $$
(19)

Combining the above regularization and loss functions, the overall objective function is obtained, and the parameters of the model are optimized by backpropagation as follows:

$$ L=-\sum\limits_{i \in S} \sum\limits_{j \in P} {y_{i}^{j}} \log {p_{i}^{j}}+\lambda_{1} L_{s i m}+\lambda_{2} L_{d i f}+\lambda_{3}\left( R_{O}+R_{D}\right)+\lambda_{4}\|\theta\|_{2} $$
(20)

where S is the sample of all input sentences; P is all possible sentiment polarities; y is the true label; p is the value predicted by the model; λ1 and λ2 are the coefficients of the two loss function terms; λ3 and λ4 are the regularization coefficients; 𝜃 is the trainable parameter.

4 Experiments

In this section, we first introduce the public standard datasets for verification and the experimental details for implementation. Secondly, we introduce the baseline models for comparison, then we describe the experimental results and analysis, as well as the ablation experiments and case studies. Finally, we discuss the effects of the number of GCN layers and different types of labels on the experimental results.

We use the PyTorch development framework to implement MultiGCN, which is deployed on two NVIDIA GeForce RTX3090 GPU, the version of CUDA is 11.0.

4.1 Datasets

We evaluate our proposed model on four benchmark datasets, including the Restaurant and Laptop datasets from the SemEval 2014 ABSA challenge [44], the Twitter dataset from Dong et al. [45], and the MAMS dataset released recently by Jiang et al. [46]. Each sentence in all four datasets is marked by three sentiment polarities (positive, negative, and neutral) according to the given aspect. The statistics of the four datasets are indicated in Table 1.

Table 1 Statistics of datasets

4.2 Implementation detail

In this work, we use the LAL parser [47] to obtain the dependency tree. To initialize the word embeddings in our experiments, we exploit pretrained 300-dimensional Glove vectors [48]. Moreover, before the BiLSTM encoder, the relative position embeddings of the aspect words as well as the part-of-speech embeddings of each word in the sentences are concatenated with the word embeddings, and both dimensions are set to 30. The hidden layer dimension is set to 50 during propagation. Our model is optimized using the Adam optimizer with a learning rate of 0.001. To prevent overfitting, the L2 regularization is set to 0.0001, and the dropout rate on input word embeddings and GCN modules are 0.7 and 0.1, respectively. After that, we set up 50 epochs for training, with a batch size of 16.

We use two evaluation metrics: accuracy and Macro Average F1. The former is the most common performance indicator for classification, and the latter is more suitable for datasets with imbalanced classes.

4.3 Baseline models

Our proposed model will be compared with several state-of-the-art baseline models to demonstrate the effectiveness of MultiGCN. A brief description of the baseline models is shown below.

  1. 1)

    ATAE-LSTM [8] combines attention with LSTM to solve the aspect level sentiment analysis problem by using attention to obtain contextual information that is more important to different aspects.

  2. 2)

    IAN [9] first models target and context separately, and then links them using an attention mechanism to obtain the representation that incorporates information about their interaction.

  3. 3)

    AOA [10] learns aspect representations and sentence representations that explicitly capture the interaction between the aspect and context, and automatically focuses on the important parts of the sentence.

  4. 4)

    MGAN [11] can capture word-level interactions between the aspect and context, alleviating information loss in a coarse-grained attention mechanism.

  5. 5)

    ASGCN [22] uses the syntax dependency structure in sentences and solves the problem of long-distance word dependency in ABSA.

  6. 6)

    CDT [23] enhances embeddings by using GCNs that act directly on the sentence dependency tree to better learn sentence representations.

  7. 7)

    kumaGCN [24] uses automatically induced aspect-specific graphs to obtain more comprehensive syntactic features.

  8. 8)

    BiGCN [42] fully integrates hierarchical syntactic graphs and lexical graphs and obtains aspect-oriented representations through the mask and gating mechanism.

  9. 9)

    InterGCN [25] can simultaneously obtain important aspect-focused and inter-aspect information by using heterogeneous graphs.

  10. 10)

    CL-GCN [49] makes use of both global and local structural information and uses attention mechanisms to fuse the two kinds of information.

  11. 11)

    RGAT [27] uses dependent label information and a new attention function to help targets better capture useful contextual information.

4.4 Experimental results and analysis

Table 2 presents the experimental results for all baseline models and our proposed MultiGCN model using accuracy and Macro-F1 as evaluation metrics.

Table 2 Experimental results comparison on four publicly available datasets

Our model achieves optimal results on the Restaurant, Laptop, Twitter and MAMS datasets. Among them, MAMS has the largest amount of data, in the meanwhile, MultiGCN has the largest improvement over other state-of-the-art models on the MAMS dataset, with both accuracy and Macro-F1 increasing by 1.86 percent, which demonstrates the potential of our model on large datasets. Similarly, on the Restaurant, Laptop and Twitter datasets, accuracy improved by 0.27 percent, 0.78 percent and 0.27 percent, respectively, and Macro-F1 improved by 1.11 percent, 0.97 percent and 0.15 percent, respectively. This demonstrates the universality of our approach and the ability to take advantage of structural, contextual, and common information in sentences.

Furthermore, comparing all models, it can be observed that models using a dependency tree or its variants achieve higher accuracy rates, indicating that the use of structural information is beneficial for the ABSA task. However, considering only structural information has become increasingly limited. On the one hand, model optimization can be achieved with the expansion of structural information, e.g., by increasing the types of dependencies between words. On the other hand, other information about the sentence, such as content information between contexts, can be considered interactively.

4.5 Ablation study

As shown in Table 3, ablation experiments were conducted on the Restaurant, Laptop, Twitter, and MAMS datasets to demonstrate the validity of the components of our model. Four main components are ablated: dependency label, common information extraction module, fusion mechanism, and improved loss function.

Table 3 Ablation study on the four datasets

MultiGCN w/o DL indicates that our model removes the dependency label. The results show that this is the most significant factor causing a decrease in performance on the Restaurant, Laptop and MAMS datasets, and a degree of decrease in both accuracy and Macro-F1 on the Twitter dataset. It is suggested that the introduction of dependency labels can clarify the relationship between aspect words and context, thus locating the opinion words with sentiment polarity more precisely and improving the accuracy of prediction results.

MultiGCN w/o ComInfo means that the model removes the common information extraction module. Compared with the complete MultiGCN, performance on all datasets has declined, while has declined most dramatically in the Twitter dataset. We believe it is due to the fact that the Twitter dataset has both well-structured sentences and many colloquial expressions. Therefore, the dataset is more sensitive to the information that combines structure and content.

MultiGCN w/o FM means that we remove the fusion mechanism from the model. In other words, instead of exchanging structural information, contextual information, and common information containing both, the final generated sentence expression is fed directly into the pooling layer. Owing to the lack of interaction between features, it can be seen that the performance decreases to varying degrees on all datasets.

Finally, experiments are conducted for our modified loss function. MultiGCN w/o Ldif refers to the model that removes the difference loss during training, i.e., the difference between features containing both types of information and the structure or content information alone is not considered. MultiGCN w/o Lsim indicates that the model is trained with similarity loss removed, i.e., without considering the correlation between structure information and content information. The experimental results show that similarity loss has a greater impact on most datasets. We reason that this is because the vast majority of the data have both syntax structure and context information. Thus, if the two features can be well combined, the model training can be better completed. MultiGCN w/o Ldif&Lsim denotes that both the difference loss and the similarity loss are removed. As can be seen, combining the two loss functions still facilitates the overall training of the model, even with better results.

4.6 Case study

To further analyze the MultiGCN model, we investigate the effect of different components of the model visually through a case study. We adopt the mask method proposed by Sun et al. [23] to calculate the contribution of different words in a sentence to the final emotional prediction. The larger the value, the more influential the final judgment. The calculation is given as follows:

$$ \gamma(w, s)=\frac{1}{m}\lvert h_{s}-h_{s / w} \rvert $$
(21)
$$ m=\max_{w \in s} \gamma(w, s) $$
(22)

where s denotes the example sentence we selected, i.e., “The pizza is the best if you like thin crusted pizza.”; w denotes the word in the sentence; hs denotes the final representation of the example sentence output by the model; hs/w denotes a sentence representation that masks out one word from front to back.

In the example sentence, the aspect word is “thin crusted pizza”, and sentiment polarity is neutral. Figure 4 visualizes the relevance scores of the words in the sentence as a heat map, which shows that our model MultiGCN can accurately determine the contextual information that is most relevant to the aspect words and minimize the focus on irrelevant information. In addition, with the removal of the dependency label or common information extraction module, the scores for each word were changed, and the focus on irrelevant context or even distracting information was increased. In contrast, the common information extraction module had a greater impact on the whole sentence. After removing this module, the score for interfering information in the whole sentence representation increases significantly and is almost identical to the relevant information, which can easily cause errors in judgment. Thus, it confirms that our proposed model can accurately locate the opinion words and correctly predict the sentiment polarity of sentences.

Fig. 4
figure 4

Visualization of word relevance scores

4.7 Performance of the multi-GCN layer number

To analyze the effect of the number of GCN layers on the experimental results, we select the Laptop and Restaurant datasets and conduct experiments on them for GCN layers from 1 to 9. The results are shown in the form of a line graph, as shown in Fig. 5. It is proved that the model works best when the number of GCN layers is 2, i.e., both accuracy and Macro-F1 are the highest. However, if the number of GCN layers increases to 8, the accuracy and Macro-F1 greatly decreases, which is attributed to overfitting caused by the high complexity of the model. In summary, our model adopts a 2-layer GCN on all datasets and achieves good results.

Fig. 5
figure 5

Effect of GCN layers on accuracy and Macro-F1

4.8 Effects of different dependency labels

To further explore the effect of dependency labels on accuracy, we choose the 10 most common labels and remove them one by one to study the effect of the decrease in accuracy for different labels on the Laptop and Restaurant datasets. The results are illustrated in Fig. 6.

Fig. 6
figure 6

Number of dependency labels and the impact on the accuracy

Here are some observations. First, on the Laptop dataset, nsubj, dobj, amod and advmod labels have a great impact on the accuracy. We believe this is because these tags belong to nouns and noun modifications, which are crucial for measuring the sentiment of sentences. Therefore, deleting these tags also has the greatest impact. In addition, ccomp, nmod, and det labels play a certain role, resulting in an average decrease in accuracy of 0.7. The rest of the labels have little effect on the model, mainly those indicating prepositions or conjunctions, such as conj and cc labels. Second, on the Restaurant dataset, most of the results are consistent with the Laptop dataset. However, as for the dobj tag, the impact on the results is much smaller than on the Laptop dataset. We speculate that it is because the dobj label only accounts for 35 percent of the Restaurant dataset, while it accounts for 67 percent in the Laptop dataset, thus it has a relatively low impact on model accuracy.

4.9 Analysis of model efficiency

To further evaluate the training efficiency of the model, we compare MultiGCN with RGAT in terms of the average training time per epoch and the total number of parameters, as shown in Table 4. We unify the batch size and hidden size of the two models to ensure a fair comparison. The results show that the average training time for epoch is slightly longer than that of the baseline model, due to the additional operations such as the fusion mechanism and the optimization of the loss function, but the overall time difference with the baseline model is not significant. For the total number of parameters, MultiGCN has fewer parameters since our method used GCN to encode the features.

Table 4 Efficiency comparison between MultiGCN and RGAT

5 Conclusion and future work

In this paper, we propose a MultiGCN model for the ABSA task. The different types of GCN, the common information extraction module, and the fusion mechanism are used to properly integrate the structure of the sentence with the content information. With the improved loss function, the parameters of the model are continuously optimized. The experimental results on the four benchmark datasets show that our model achieves state-of-the-art performance. In future work, we will consider applying external knowledge to help the proposed model understand the relationship between words in a sentence, thereby improving its performance.