1 Introduction

With the explosion of information in the era of big data, various social networks and short video platforms have generated enormous comment data. Users can give the most authentic and valuable reviews for specific products or services. The reviews effectively reflect the dynamic changes in user needs. Therefore, commercial companies need to identify the potential value of user reviews in a timely manner and improve their products accordingly to respond to market changes. Sentiment analysis on comment data is a good solution, but there is often more than one subject in a comment, which is not considered by traditional sentiment classification. The quality of a product can be judged by analyzing the sentiment polarity of a specific aspect in the review data. Hence, fine-grained sentiment analysis, specifically Aspect Category Sentiment Analysis (ACSA), has attracted widespread attention from researchers and businesses (Singh & Singh, 2021; Ozyurt & Akcayol, 2021), and is a hot topic in academic research.

The main research of sentiment analysis is to explore the sentiment polarity of texts with emotional overtones. Common sentiment polarity includes “positive”, “negative”, “neutral” and so on. When there are more than two sentiment polarities, it becomes a multi-classification problem and can be solved as a regression problem (Berka, 2020). According to the granularity of classification, sentiment classification can be divided into document-level, sentence-level, and aspect-level. While the first two emphasize the macro level, aspect-level sentiment analysis is more detailed and more demanding, requiring a correct judgment of the sentiment polarity of a specific aspect object in a text (Cambria, 2016). Moreover, ACSA is a major research direction in aspect-level sentiment classification tasks. For example, in the sentence “The pizza is very good and huge”, the word “pizza” can be assigned to the aspect category “food”. It can be judged as positive according to the semantics of the context. The aspect-category words may or may not appear directly in the text, or there may be more than one aspect-category. When inferring the sentiment of a given aspect, it is important to correctly analyze the contextual semantic features of the given aspect.

In recent years, methods based on deep neural networks for extracting text semantics have emerged and achieved good results for the ACSA tasks. To mine text features from the temporal information of text, schemes based on the long short-term memory model (LSTM Hochreiter et al., 1997) have also been continuously adopted. In addition, there are related studies that have noted many other types of rich structures between words and constructed graph networks for aspect-level sentiment analysis (Zhu et al., 2022). Although the results of these studies are effective, the question of how to add the feature information contained in aspect words to the context remains a worthwhile research problem. Most of the studies transform aspect words into a vector embedded in each word of the text. Although this method can retain the complete features of aspect words, it is not sufficiently targeted and may weaken the extraction effect of some key semantic features. In addition, since Bahdanau et al. (2015) proposed to apply attention mechanisms to the field of natural language processing, many researchers have also started to use attention mechanisms to capture important features of aspect words. Researchers have also devised many methods to compute the attention score, such as Self-Attention (Xiao et al., 2020), Hierarchical Attention (Geed et al., 2022), etc. Although most of the attention mechanisms are able to find sentiment words easily, they are more likely to focus on sentiment features that are not relevant to the aspect-category due to their inability to locate the given aspect-category. With this motivation, we pay more attention to the importance of aspect-category information at different locations in the context and propose “ALAN”, an Aspect-Location Attention Networks for ACSA. Our ALAN mainly consists of four modules. The first is the semantic representation module, which produces vector representations of words in sentences and aspect category words. The second is the proposed novel aspect-location embedding module. It differs from some approaches that combine aspect-category information indiscriminately because this module dynamically incorporates aspect-category information into the context, which enables the features that represent the aspect-category in the context to be more prominent. An improved attention mechanism is designed in the third module to focus on aspect-specific sentiment features. The attention network is different from traditional attention methods. In traditional methods, aspect category embeddings are often directly used as the benchmark for measuring attention scores, while the proposed method utilizes the results of the whole sentence combined with aspect-location embedding as the metric of attention. In this way, the attention weight can be calculated in a more targeted manner. The last module is the classifier, which is used to output features and perform sentiment classification. To demonstrate the universality of the aspect-location embedding module, we derive a variant model of ALAN (ALANvar) without attention support based on it. ALANvar utilizes convolutional neural networks (CNNs Kim 2014) with a gating mechanism to integrate aspect-related sentiment information in context.

ALAN has the following advantages: (I) Our proposed aspect-location embedding module can be applied to most ACSA tasks, and some approaches on ACSA tasks can also use this module as a special embedding module to enhance the initial representations. (II) The attention modeling approach combined with aspect-location embedding can better integrate aspect-specific sentiment features in sentences. (III) The modules in ALAN accomplish their required feature extraction, resulting in an overall low coupling and excellent robustness.

We summarize our main contributions as follows:

  • A novel aspect-location embedding module is proposed that dynamically combines aspect categories and contextual information. This is the most prominent contribution of our work.

  • An improved attention network based on aspect-location embedding representation is further proposed to aggregate aspect-specific sentiment features in sentences.

  • A variant model of ALAN without the support of an attention mechanism is devised, and shows the superiority of the aspect-location embedding method.

  • Results on three experimental datasets show that ALAN consistently outperforms other compared baseline models and demonstrate the effectiveness of ALAN and its variant.

The remainder of this article is organized as follows. Section 2 reviews the related work before our method was proposed. Section 3 details the structure and realization process of our model. In Section 4, we present the results and descriptions of all experiments. Section 5 concludes our work and the content of the paper and provides an outlook on future research directions.

2 Related work

2.1 Early research

Most of the early papers use machine learning methods such as Naive Bayes, Maximum Entropy, Logistic Regression, Support Vector Machine (SVM), etc. (Tripathy et al., 2016; Al-Smadi et al., 2017). These methods focus on traditional sentiment analysis and aspect-term sentiment classification, which cannot be fully applied to ACSA tasks, but have the significance of borrowing. The basic idea is to apply these algorithms to predict the most likely class based on a complex combination of features, but such features usually need to be designed manually. Varghese and Jayasree (2013) combined dependency parsing, co-reference parsing, and SentiWordNet, culminating in SVM as the primary classifier. Singh et al. (2013) proposed aspect-term sentiment classification of movie reviews using different linguistic features and n-gram feature extraction based on SentiWordNet scheme. Karagoz et al. (2019) proposed a framework that focuses on aspect extraction and aspect sentiment word retrieval by using an unsupervised approach and provided a tool to visualize the analysis results. The methods mentioned above have achieved some results. However, due to the explosive growth of data on platforms such as social networks, there are great obstacles to the application of traditional sentiment analysis methods in the era of big data.

2.2 Deep learning methods for ACSA

The rapid development of neural networks and deep learning has driven the development of natural language processing, and a large number of deep learning methods applied to ACSA have been proposed. LSTM and GRU (Chung et al., 2014) based on Recurrent Neural Networks have been widely adopted for various sentiment classification tasks. Tang et al. (2016) proposed TD-LSTM and TC-LSTM based on the connection between the target words and their context. TD-LSTM uses two LSTMs to model the sentences before and after the target words respectively, and finally fuses the extracted features to determine the sentiment polarity of the text. In order to better utilize the relationship between the target words and the entire text, TC-LSTM explicitly links the target words to each word in the text based on TD-LSTM as the modeling embedding layer. Although the correlation between the target words and the context is taken into account, such a simple linkage is not sufficient to maximize the internal association.

After that, it was inspired by the success of the attention mechanism in machine translation and the Memory Network’s ability to optimize machine reading comprehension (Hermann et al., 2015). Wang et al. (2016) proposed an Attention-based LSTM. This strategy incorporates aspect word embedding to compute attention weights, thereby forcing the model to pay attention to the important part of the text, which has a certain meaning. However, the model only considers aspect content when calculating contextual weights, and the aspect embedding is the same as before. Ma et al. (2017) noticed that the evaluation object and the context representations can be modeled separately and proposed IAN. They utilized an interactive method to calculate the attention weights of the two, and the learned features can be spliced together as sentiment representations.

CNNs have also been proven to work in the field of NLP (Kim, 2014; Ramaswamy and Chinnappan, 2022). Convolution operation can also capture semantically rich text features. TextCNN, proposed by Yoon Kim in (Kim, 2014), applied CNNs to text classification tasks. The core idea is to capture local features. For text, local features are sliding windows consisting of several words, similar to N-grams (Mikolov et al., 2013). Xue and Li (2018) proposed a model based on CNNs and Gating Mechanism. A new gate unit can control the sentiment features of the output with a given aspect-category. Since convolutional layers are not time-dependent, it is possible that the computations during training can be parallelized, thus reducing the time cost. It is the lack of temporal dependency, however, that ignores inter-textual word order features.

With the rise of graph networks, there are also some studies advocating the construction of relational graphs for ACSA tasks. Liang et al. (2021) proposed an aspect-aware graph convolutional network (AAGCN). They design a beta distribution guided aspect-aware algorithm to compute the relational weights between the aspect and the external affective knowledge, and the obtained results are transferred to the syntactic dependency tree of the original sentence. In this way, GCN networks are constructed for aspect category sentiment classification. Subsequently, they investigated a new few-shot aspect category sentiment analysis task in the paper (Liang et al., 2022), and proposed a meta-learning framework in combination with previous aspect-aware information.

2.3 Pre-trained models (PTMs) for ACSA

In recent years, a great deal of research has shown that PTMs based on a large corpus can learn general language representations. This is a result of transfer training, which allows the model to have some experience from the beginning, rather than starting from scratch. There have been several international NLP tasks that can confirm the advantages of PTMs. PTMs have gone through about two stages. The first stage is to train the representation of a single word, so that words with similar semantics or the same category have a certain connection in the vector space, but they are context-independent, such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). There is a big breakthrough in the second stage. In Kenton and Toutanova (2019), BERT was proposed, which intercepts the encoder of Transformer (Vaswani et al., 2017). The full name of BERT is Bidirectional Encoder Representation from Transformer. It is a language model trained by Google in an unsupervised manner on a large unlabeled corpus. It is able to learn contextually relevant word vectors and obtain a better representation of the text before processing downstream tasks. Xu et al. (2019) viewed ACSA as a new task and they called it Review Reading Comprehension. In fact, a post-training approach to BERT networks was explored, considering aspects and texts as two sentences connected by a special character. Gao et al. (2019) designed a target-dependent model based on BERT (TD-BERT) to locate the output at the target term and an optional sentence with a built-in target. Although it produced good results in context-aware representation, the embedding representation of aspects was not incorporated in the model. Dai et al. (2021) investigated whether PTMs contain sufficient syntactic information for aspect-level sentiment analysis. After experimentally comparing the induced trees from PTMs and the dependency parsing trees, their proposed RoBERTa-MLP demonstrates that PTMs implicitly encompass task-oriented grammatical information. Liu et al. (2021) considered a more direct approach of transforming the ACSA task into a natural language generation task by utilizing the pre-trained language model BART. They designed templated natural language sentences to represent the output, and the last word of the template sentences was used as the basis for determining sentiment polarity.

Recently, some researchers have considered ACSA and rating prediction (RP) as two highly related tasks and combined information from both tasks to construct models. Bu et al. (2021) collected a new dataset of Chinese reviews (called ASAP) containing the required labeled information for both ACSA and RP tasks. They also designed a joint model based on BERT to enhance the accuracy on both tasks. Fei et al. (2022) followed up their research by taking inspiration from human intuition and proposed a from-fine-to-coarse reasoning framework to obtain better performance on the joint task.

3 ALAN model

3.1 Problem definition

ACSA task is to predict the sentiment polarity of a given aspect-category of a piece of text. The input to this problem can be regarded as a tuple (A,X), consisting of an aspect-category and a contiguous segment of text. The text X = {x1,x2,x3,...,xn} consists of n words, and the aspect-category A = {a1,a2,a3,...,am} contains m words. The output of ACSA is the sentiment label y ∈{1,2,...,K} of the given aspect-category, where K stands for the set of sentiment polarity. For ACSA, the number of aspect categories A is always finite and each A has only a few words. In contrast, the text X is arbitrary and generally m is less than n. The scope of y can also be further refined according to the specific task requirements. The same text may have multiple aspect-categories, and aspect-category words may appear directly in the text or may not appear at all.

3.2 Overview of ALAN

The overall architecture of ALAN is shown in Fig. 1. Our model ALAN consists of four modules, which are the semantic representation module, the aspect-location embedding module, the aspect-location attention learning module and the classifier module. In ALAN, the input tuples (A,X) of the ACSA task in the above problem definition first enter the semantic representation module, then are respectively mapped to semantic embedding representations. After that, the aspect category representation and the sentence representation are dynamically fused into aspect-specific embedding representations by the proposed aspect-location embedding module. The original sentence representations pass through the LSTM layer in the aspect-location attention module, and enter the attention layer together with the aspect-specific embedding representations to generate aspect-specific sentiment features through the aspect-location attention mechanism. The final obtained features are fed into the classifier module for aspect category sentiment classification. These four modules and their combination methods are described in detail in the following.

Fig. 1
figure 1

The overall architecture of ALAN

3.3 Semantic representation

The main work in this part is to extract the semantic representations of the initial text and aspect-category words. We use the results of the pre-trained model BERT as the basis for our entire model architecture to obtain the embedding representations of the text. BERT has its tokenizer WordPiece, which can convert the text X into a sequence of tokens. After that, the token sequence is expanded into a high-dimensional embedding representation

$$ \begin{array}{@{}rcl@{}} E_{x}=[e_{c},e_{1},e_{2},...,e_{n},e_{s}] \end{array} $$
(1)

through the embedding layer. \(E_{x}\in \mathbb {R}^{d\times (n+2)}\) where d is the dimension of the embedding layer of BERT, n is the length of the original text sequence. ec represents the embedding vector of the first token [CLS], and es represents the embedding vector of the last token [SEP]. After the multi-layer encoding of BERT, we take all the hidden states of the last layer as the initial embedding matrix

$$ \begin{array}{@{}rcl@{}} T=[t_{c},t_{1},t_{2},...,t_{n},t_{s}] \end{array} $$
(2)

where \(t_{i}\in \mathbb {R}^{d}\) is the final features of each token and it can be seen that the input and output dimensions of BERT are consistent. We also need to convert the aspect category words into an initial embedding matrix Ta and average pool them to obtain the embedding vector \(v_{a}\in \mathbb {R}^{d}\) of the aspect category.

$$ \begin{array}{@{}rcl@{}} &T_{a}=[{t_{c}^{a}},{t_{1}^{a}},{t_{2}^{a}},...,{t_{m}^{a}},{t_{s}^{a}}] \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} &v_{a}=\frac{1}{m+2}\left( {t_{c}^{a}} + {t_{s}^{a}} + \displaystyle\sum\limits^{m}_{i=1}{t_{i}^{a}}\right) \end{array} $$
(4)

3.4 Aspect-location embedding

This module is used to mine the aspect-category for location information in the text. Figure 2 demonstrates the structure of the aspect-location embedding module. The final text embedding matrix T and the aspect-category embedding va obtained from the text semantic representation module are used as the input unit of this part. Multiple aspect-categories may be contained in the same text, but the standard BERT cannot express the different semantic features. Directly predicting the specific sentiment classification of different aspect-categories with this approach does not show any difference and it degrades to the sentiment classification of the whole text. In order to alter the original semantic representation according to the given aspect-category, we propose a method to match location information and design a specialized aspect-location embedding function to construct location features. Specifically, the aspect-category vector va and each vector in the vector matrix T are calculated as a similarity score and the index \(s_{\max \limits }\) of the vector with the maximum similarity is taken. Then the embedding weight ri of each vector in the matrix T is calculated by

$$ \begin{array}{@{}rcl@{}} s&_{\max}=argmax({v_{a}^{T}} T) \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} r&_{i}=\exp\left( -\frac{(i-s_{\max})^{2}}{2\sigma^{2}}\right) \end{array} $$
(6)

where \({v_{a}^{T}}\) is the transpose of matrix va, i ∈{0,1,2,...,n} is the index of each word in the input text. \(\sigma \in \mathbb {R}\) is the location embedding rate and an adjustable hyperparameter. The reason for this design is that the similarity score can be employed to locate the approximate distribution of aspect-category in the text, and secondly, the weights calculated by our designed aspect-location function can retain more significant features near the aspect-category and the text features farther apart will be weakened. Thereby, an effect of dynamic embedding of aspect-category information into text is achieved.

Fig. 2
figure 2

The detailed structure of aspect-location embedding module

We further combine text embedding and aspect-category embedding to learn aspect-location representation P, which can make location features more suitable for aspect-category. Mathematically, we compute P as

$$ \begin{array}{@{}rcl@{}} t_{i}^{\ast}&=&r_{i}\times t_{i} \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} {v_{a}^{i}}&=&(1-r_{i})\times v_{a} \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} p_{i}&=&t_{i}^{\ast} +{v_{a}^{i}} \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} P&=&[p_{0},p_{1},p_{2},...,p_{n},p_{n+1}] \end{array} $$
(10)

where \(p_{i}\in \mathbb {R}^{d}\) denotes the i th vector in the aspect-location representation P.

3.5 Aspect-location attention learning

The function of the module is mainly to learn critical information in the text exploiting the attention mechanism. Figure 3 is a schematic diagram of the aspect-location attention learning module. The input units of the module are the text embedding matrix T and the aspect-location embedding P. The standard LSTM is unable to focus on capturing the important semantic features related to aspect-category in the text. To break through this limitation, we design an attention mechanism based on aspect-location embedding P to improve the LSTM. We put the text embedding matrix T into a LSTM to obtain a sequence of hidden layer vector matrix H consisting of each hidden vector hi by

$$ \begin{array}{@{}rcl@{}} f_{i}&=&sigmoid(W_{f}\cdot [h_{i-1},n_{i}]+b_{f}) \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} g_{i}&=&sigmoid(W_{g}\cdot [h_{i-1},n_{i}]+b_{g}) \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} \tilde{c}_{i}&=&tanh(W_{c}\cdot [h_{i-1},n_{i}]+b_{c}) \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} c_{i}&=&g_{i}\odot\tilde{c}_{i}+f_{i}\odot c_{i-1} \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} o_{i}&=&sigmoid(W_{o}\cdot [h_{i-1},n_{i}]+b_{o}) \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} h_{i}&=&o_{i}\odot tanh(c_{i}) \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} H&=&[h_{0},h_{1},h_{2},...,h_{n+1}] \end{array} $$
(17)

where Wf, Wg, Wc and Wo represent weight matrices while bf, bg, bc, and bo denote biases.

Fig. 3
figure 3

The detailed structure of aspect-location attention learning module

After obtaining the hidden vector sequence H, we calculate the correlation between each hidden state and the aspect-location embedding matrix P. The aspect-location attention mechanism will generate an attention-weight vector \(\alpha \in \mathbb {R}^{n+2}\) and a weighted hidden representation \(v_{\alpha }\in \mathbb {R}^{d}\).

$$ \begin{array}{@{}rcl@{}} M&=&tanh([W_{h} H; W_{p} P]) \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} \alpha&=&softmax(w_{\alpha}^{T} M) \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} v_{\alpha}&=&H \alpha^{T} \end{array} $$
(20)

where \(M\in \mathbb {R}^{2d\times (n+2)}\), \(W_{h}\in \mathbb {R}^{d\times d}\), \(W_{p}\in \mathbb {R}^{d\times d}\) and \(w_{\alpha }\in \mathbb {R}^{2d}\) are projection parameters.

3.6 Classifier

The basic approach of ALAN is to directly apply aspect-location representation vα for sentiment classification. A linear layer is added to compress vα to a length equal to the number of sentiment polarities. We convert to a conditional probability distribution yα by

$$ \begin{array}{@{}rcl@{}} y_{\alpha}=softmax(W_{\alpha}v_{\alpha}+b_{\alpha}) \end{array} $$
(21)

where Wα and bα are projection parameters of the linear layer. The sentiment polarity of the values in the conditional probability distribution yα is treated as the final sentiment classification prediction.

3.7 The variant of ALAN

ALANvar is constructed on the basis of the aspect-location embedding module, as shown in Fig. 4. It transforms aspect-location embedding representations into n-gram features in sentences, and then maximally pools these features to obtain sentiment representations for the corresponding aspect category.

Fig. 4
figure 4

The structure of ALANvar

The aspect-location embedding representations are integrated with multiple (two in our experiments) convolutional networks with different convolutional kernel sizes, where different activation functions are utilized to control the range of the output. The results of the convolution are successively subjected to pooling and concatenating operations to learn different feature representations. We compute two convolutional representations \({c_{i}^{t}}\in \mathbb {R}^{d}\) and \({c_{i}^{r}}\in \mathbb {R}^{d}\) by

$$ \begin{array}{@{}rcl@{}} c&_{i}^{t}=\tanh(P_{(i:i+k_{1})}\ast W_{t}+b_{t}) \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} c&_{i}^{r}=\text{relu}(P_{(i:i+k_{2})}\ast W_{r}+b_{r}) \end{array} $$
(23)

where ∗ represents the convolution operation, bt, br are the bias, and k1, k2 are the size of the convolution kernel. The activation functions are tanh and relu. Different convolution kernel sizes are set in the above two equations, allowing the obtained convolutional features to be more representative and present more specific aspect-location information. According to the characteristics of the two activation functions, the tanh function can generate features that conform to the semantics of the text more consistently, while relu can additionally accept the changing features generated by the combination of text and aspect-category. The two convolutional sequences are max pooled separately, and the resulting vectors are concatenated to obtain the final aspect-location representation vector \(v_{\beta }\in \mathbb {R}^{2{\times }d}\).

$$ \begin{array}{@{}rcl@{}} v_{\beta}={c_{m}^{t}}{\Vert}{c_{m}^{r}} \end{array} $$
(24)

where ∥ represents a concatenation operation, \({c_{m}^{t}}\) and \({c_{m}^{r}}\) are the vectors of the two convolutional sequences after max pooling. Finally, vβ is fed into the classifier module to obtain the conditional probability distribution yβ of the sentiment polarities.

$$ \begin{array}{@{}rcl@{}} y_{\beta}=softmax(W_{\beta}v_{\beta}+b_{\beta}) \end{array} $$
(25)

where Wβ denotes weight matrices while bβ denotes biases.

3.8 Model efficiency analysis

We assume that the length of a sentence is n and the dimension of the embedding vector is d. The main time overhead of the semantic representation module lies in the multi-layer self-attention mechanism computed in BERT, so the time complexity of the process is O(n2d). The aspect-location embedding module mainly includes the computation of maximum similarity and aspect-location embedding representations, and their time complexity is O(n2d) and O(nd), respectively. The aspect-location attention learning module goes through LSTM and aspect-location attention mechanism successively, where the time complexity of LSTM is O(nd2) and the time complexity of aspect-location attention mechanism is O(n2d). Generally speaking, d is too large to be ignored. Therefore, the time complexity of ALAN is O(n2d) without restricting the sentence length.

ALANvar also contains the semantic representation module and the aspect-location embedding module, so the time overhead of this part is consistent. The difference is that ALANvar is followed by the use of convolutional computation. Assuming that the size of the convolution kernel is k (the other size in the text is d by default), the required time overhead is O(knd2). Since the overhead of the convolution operation is much smaller than the first two modules, the time complexity of ALANvar is just as O(n2d).

From the above theoretical analysis, the semantic representation module and the aspect-location embedding module account for the major time overhead. Although ALAN and ALANvar end up with the same time complexity, the computation of LSTM relies on the results of the previous time step for each time step, while CNNs can be computed in parallel. Therefore, in practice, the model efficiency of ALANvar is higher than that of ALAN.

4 Experiments

4.1 Datasets and experiment preparation

We have verified the effect of ALAN and ALANvar on three publicly available social review datasets, including two in English and one in Chinese.

The English datasets are the review data on the restaurant field in SemEval-2014 (Manandhar, 2014), SemEval-2015 (Pontiki et al., 2015) and SemEval-2016 (Pontiki et al., 2016). The SemEval-14 dataset has 5 aspect-categories (“food”, “anecdotes/miscellaneous”, “service”, “ambience”, “price”) as target objects for sentiment classification, and the SemEval-15, SemEval-16 datasets include 12 aspect-categories. Since the SemEval-16 dataset is extended on the SemEval-15 dataset, a large portion of their training data is consistent, so we merge their training and testing datasets separately. The data labeled “conflict” in the original datasets are excluded. For example, the sentence “not a large place, but it’s cute and cozy” has a sentiment label of “conflict” for the “ambiance” aspect-category. We only keep data with sentiment labels as “positive”, “negative” and “neutral”. Since the amount of “neutral” data in the training set of SemEval-15&16 is too small, the “neutral” category cannot be distinguished during the training process, which has a great negative impact on all models. Therefore, we add four copies of the original “neutral” data to the training set, which is a method of data augmentation.

The Chinese dataset is the dataset of the “Fine-grained user comment sentiment analysis” track in “Global AI Challenger 2018”. It contains comment data on a social platform. The dataset is divided into four parts: training, validation, test A and test B. The evaluation objects in the dataset are divided into two levels according to different granularities. The first level is the coarse-grained evaluation object, such as “service” and “location” involved in the review text; the second level is the fine-grained emotion object, such as “waiter’s attitude” and “waiting time” in the “service” category. There are four sentiment polarities for every fine-grained element: positive, neutral, negative, and unmentioned, which are labeled as 1, 0,− 1 and − 2. We simplify the second level by setting the sentiment polarity of each aspect-category in the first level according to the strategy of “majority voting” on fine-grained sentiment labels. The samples for the “unmentioned” labels are discarded. We divide the data of the original validation set into training data and test data after processing, and the dataset is called “AC-SemVal-2018”. The statistics of these datasets are shown in Table 1, and the distribution of aspect-categories in each dataset is shown in Fig. 5.

Table 1 Dataset statistics
Fig. 5
figure 5

The distribution of aspect-categories: (a) SemEval-2014; (b) AC-SemVal-2018; (c) SemEval-15&16

4.2 Compared methods and experimental settings

We select some of the proposed and powerful baseline methods, conduct experimental comparisons, and evaluate our models. The following is a description of the comparison models:

  • LSTM (Hochreiter et al., 1997): Since the standard LSTM cannot combine any aspect-level information, the same text is given different aspect-categories and the final predicted sentiment polarity is the same.

  • TextCNN (Kim, 2014): TextCNN uses three CNNs with different convolutional kernel sizes to convolve text features and stitch the pooled results together as the final representation.

  • ATAE-LSTM (Wang et al., 2016): The model combines aspect-category word vectors and LSTM-encoded hidden state sequences to learn attention weights, and weights all hidden vectors as aspect-level sentiment classification representations.

  • IAN (Ma et al., 2017): IAN models the aspect-category and the input text separately and designs an interactive attention mechanism based on the two LSTMs.

  • GCAE (Xue and Li, 2018): This method proposes an efficient model based on CNNs and Gating Mechanisms. The Tanh-ReLU gating unit retrieves the important information in the text based on the given aspect-category.

  • BERT-PT (Xu et al., 2019): BERT-PT combines the aspect words and the text as two sentences and inputs them into the BERT model for training, and the obtained [CLS] vector is used as the final representation of the sentiment polarity.

  • TD-BERT-QA-CON (Gao et al., 2019): This method is a variant of TD-BERT and focuses on fine-tuning the output of the BERT model: pooling the target words and splicing [CLS] vectors as the final representation.

  • BERT-MLP (Dai et al., 2021): A simple and effective baseline model (RoBERTa-MLP/BERT-MLP) was proposed in Dai et al. (2021). It takes the output of the target words in RoBERTa/BERT and adds a maximum pooling layer and a multi-layer perceptron (MLP) to perform sentiment classification.

  • AAGCN-BERT (Liang et al., 2021): The model utilizes the beta distribution to calculate the relational weights of aspect category words with external sentiment knowledge and constructs graph networks on the syntactic dependency trees of the sentences.

  • BART generation (Liu et al., 2021): The model uses natural language generation to design template sentences to represent the output, and takes the features of the last word of the generated sentences as the basis for determining sentiment polarity.

The word vector initialization tool Word2Vec,Footnote 1 used for the English datasets (SemEval-14/SemEval-15&16), contains 300 million common words and word vectors, trained by Google using a large amount of computational power based on the huge corpus of Google News, with 300 dimensions per word vector. The words of the vocabularies in Word2Vec are randomly initialized to a uniform distribution U (–0.5, 0.5). We only used the 200,000 most frequently used words, treating the others as unfamiliar and initializing them randomly. For the Chinese dataset (AC-SemVal-2018), the Word2Vec modelFootnote 2 produced by Li et al. (2018) and Qiu et al. (2018) is used, which is trained based on a large number of Weibo corpora. The BERT models used in the experiments, for the English and Chinese datasets, are bert-base-uncasedFootnote 3 and bert-base-chinese,Footnote 4 respectively.

Since TD-BERT-QA-CON and BERT-MLP analyze the sentiment of target words in the text, while the object of study in this paper is the aspect-category, which does not necessarily appear directly in the text. Therefore, we take the approach of matching target words. We calculate the word with the highest similarity to aspect-category in the text and take one word from its above and one word from its below as the target words of the given aspect-category.

All the compared models are constructed in the Tensorflow (Abadi et al., 2016) framework and trained on a single NVIDIA TITAN RTX GPU device (24GB RAM). In all experiments, dataset preprocessing is performed. In the English datasets, all uppercase letters are converted to lowercase. To remove deactivated words, we process the Chinese dataset with a list of deactivated words collected on the web. And we remove words related to time, numbers and symbols as they are not relevant to the sentiment classification task. Regarding the text length, if the maximum text length exceeds the set value, it is truncated at the end of the text and preceded by zero if it is insufficient. For the parameters involved in the baseline methods, the values from the original paper are applied or several experiments are performed to select the best values. With BERT as the model for the embedding layer, the parameters are set according to the recommended parameters in the BERT source code.Footnote 5 In our method, the embedding size of all word vectors is 768 and other hyperparameters including Epoch, Batch size, Max sequence length, Learning rate, and Optimizer are approved in Table 2. Since the datasets used in the experiments do not provide a specified validation set, we randomly selected 10% of the data samples from the training set as the validation set to adjust the above hyperparameters. We also adopted this approach in the compared experiments to ensure the fairness of the experiments.

Table 2 The main hyperparameter settings

4.3 Results analysis of aspect-category sentiment classification

In order to validate the performance effectiveness of ALAN and ALANvar, we use the overall classification accuracy (acc) and macro F1-score (F1) as performance evaluation criteria when the training datasets are completely consistent. Table 3 show the performance of ALAN and ALANvar compared with other baseline models on the three datasets.

Table 3 The performance of ALAN compared with other baseline models. The bold emphasis indicates the maximum value of the performance comparison in its column

In general, the standard LSTM and TextCNN perform poorly, especially in terms of macro F1-score, where they lag behind other methods by a large margin. Although their overall classification accuracy is higher than ATAE-LSTM and IAN in the English datasets, this is only because the training results of ordinary neural networks will be more skewed towards the classes with more identical labels in the training samples. The underlying reason for its poor performance is that the aspect-category information is not considered, such that each word is equal in the neural network, which will affect the sentiment classification effect of different aspect-categories. From the experimental results on Chinese dataset, the performance of standard LSTM and TextCNN is similarly far inferior to that of the model considering aspect-category. Therefore, these two methods may be more suitable for common text classification tasks. GCAE outperforms ATAE-LSTM and IAN in the English datasets and slightly underperforms than IAN in the Chinese dataset because the attention mechanism adopted by ATAE-LSTM and IAN can combine the supervision role of aspect-category information to effectively obtain important contextual information. Due to its special gating mechanism, GCAE can obtain richer features in combination with CNNs. The methods (BERT-PT, TD-BERT-QA-CON, BERT-MLP) fine-tuned based on the BERT model consistently surpass previous methods on all datasets. Although the overall classification accuracy of TD-BERT-QA-CON and BERT-MLP is slightly inferior to IAN on the Chinese dataset, their macro F1-scores are also stable and better than ATAE-LSTM, IAN and GCAE. Benefiting from its exploitation of external knowledge and rich graph structure, AAGCN-BERT has slightly higher accuracy than ALAN and ALANvar on SemEval-15&16. From the experimental results of BART generation, although it is slightly inferior to ALAN and ALANvar, it still outperforms all the compared BERT-based methods. This is mainly due to the differences in the pre-trained corpus and the number of parameters, where BART is much superior to BERT.

Our ALAN and ALANvar further deepen the importance of aspect-category objects and their related contextual information by proposing a combination of aspect-category location distribution features and special location attention mechanisms. From the experimental results, our proposed ALAN method consistently outperforms all contrasting methods (accuracy improved by 0.1% on SemEval-14., 0.71% on AC-SemVal-2018. macro-F1 improved by 0.38% on SemEval-14., 0.17% on SemEval-15&16., 0.89% on AC-SemVal-2018). State-of-the-art results are achieved on the three datasets involved in the experiments in terms of two performance metrics (Acc and F1).

To test whether the performance between methods is statistically significantly different, we applied the non-parametric McNemar’s test (Dietterich, 1998).This test is well suited for our purposes because it does not require a normal distribution of the data and has also been used in related studies (Chen et al., 2019). In order to make a comparison between method A and method B by the McNemar’s test, we need to count the number of samples that are correctly classified by A instead of B (denoted as n10) and the number of samples that are correctly classified by B instead of A (denoted as n01). Then we can compute the statistic

$$ \begin{array}{@{}rcl@{}} \chi^{2}=\frac{(\lvert n_{01}-n_{10} \lvert -1)^{2}}{n_{01}+n_{10}} \end{array} $$
(26)

which is distributed as χ2 with 1 degree of freedom. Performance is considered to be statistically significantly different only if the p-value of the computed statistic is below a pre-specified significance level.

The results of the statistics are shown in Tables 4 and 5, and we specified a significance level of 5%. From the data in the tables, the performance of our proposed ALAN and ALANvar on the three datasets is mostly superior to that of the baseline methods, and only individual methods cannot present significant differences. It was previously noticed that on SemEval-15&16, the accuracy of AAGCN-BERT was higher than slightly ALANvar, while macro-F1 was much lower than ALANvar, so it was difficult to distinguish their performance. However, from the results of McNemar’s test, the p-value of ALANvar vs. AAGCN-BERT is significant at 5% level. Thus, it can be verified that ALAN and ALANvar have superior performance.

Table 4 McNemar’s statistics between the results of ALANvar and other methods on each dataset
Table 5 McNemar’s statistics between the results of ALAN and other methods on each dataset

4.4 Ablation study

To further investigate how different components of ALAN affect the performance of ACSA, we evaluate the performance of the ablated ALAN model. The ablation model excludes BERT semantic representations (word2vec-ALANvar, word2vec-ALAN), aspect-location embedding (ALANvar/AE, ALAN/AE), and aspect-location attention mechanisms (ALAN/Atten), respectively. In word2vec-ALANvar and word2vec-ALAN, we use Word2Vec instead of BERT as the word embedding layer. The results of the comparison of the complete ALAN model with its ablation are shown in Table 3.

As a whole, the performance of ALANvar and ALAN suffers from different degrees of impairment after removing important components, and there are some differences across datasets. We observe that ALAN and ALAN produce some degree of degradation in their effectiveness on all datasets after losing the support of BERT. However, compared to all Word2Vec-based baseline models (LSTM, TextCNN, ATAE-LSTM, IAN and GCAE), the performance of word2vec-ALANvar and word2vec-ALAN is still superior (accuracy improved by 1.84% on SemEval-14., 1.95% on SemEval-15&16., 0.93% on AC-SemVal-2018. macro-F1 improved by 4.37% on SemEval-14., 3.42% on SemEval-15&16., 0.69% on AC-SemVal-2018). Compared with the full model, ALANvar/AE, ALAN/AE and ALAN/Atten have small decreases in accuracy and F1 scores, around 1-2%, on the two commonly small datasets of sentiments (SemEval-14, SemEval-15&16). Their performance slips about 10% on the larger Chinese dataset (AC-SemVal-2018), even lower than Word2Vec-based ALANvar and ALAN. It demonstrates the effectiveness of aspect-location embedding and aspect-location attention mechanism, which are indispensable for the proposed model. It can also be seen that BERT can show excellent performance on small-scale datasets with only fine-tuning, but for some larger and more complex datasets, special neural networks need to be designed to adapt.

4.5 Impact of the proposed parameter

For the location embedding rate σ proposed in our method,we recommend to take a wide range in the experiments, which varies according to the length of the text, and then use a large step size to select candidate values within the range and verify the best value. As an example, for the SemEval-14 dataset, we change σ from 6 to 51 in increments of 5. For the Chinese dataset (AC-SemVal-2018), since the length of the Chinese text is much larger than that of the English text, we choose a larger range for the candidate value interval. We changed its value from 16 to 64 in increments of 8.

Figures 67 and 8 show the changes in the performance of our models after adjusting σ on each dataset. We evaluate the model effect with classification accuracy and macro F1-score, train the model after determining the value of σ and take the optimal result for analysis. From the results shown in the line graphs, the aspect embedding rate σ has a more obvious effect on the performance of ALANvar and ALAN. The performance of ALANvar and ALAN peaks around σ = 6 on SemEval-14 and SemEval-15&16. On AC-SemVal-2018, it peaks around σ = 24. The trend of the lines shows that ALANvar has a relatively steady decreasing tendency in performance as σ increases. Although the change in the performance of ALAN is not significant, it can still be seen that the effect is better when σ is smaller. This is because as σ increases, the aspect-location embedding features will become smooth until become undifferentiated aspect category embedding. Thus, aspect-specific representations are missing, leading to a decrease in model performance.

Fig. 6
figure 6

The performance of different σ in SemEval-14: (a) Accuracy (%); (b) Macro-F1(%)

Fig. 7
figure 7

The performance of different σ in SemEval-15&16: (a) Accuracy (%); (b) Macro-F1(%)

Fig. 8
figure 8

The performance of different σ in AC-SemVal-2018: (a) Accuracy (%); (b) Macro-F1(%)

4.6 Stability evaluation

The aim of this section is to validate the model trained on a small dataset and analyze whether it can maintain the same performance (defined as stability of predictive performance) when predicting a large amount of data. The experiment also avoids the performance errors of small test sets. For this purpose, we provide a large Chinese test dataset. This dataset is obtained by processing the training dataset of the “Fine-grained user comment sentiment analysis” track in “Global AI Challenger 2018”. It is the same source dataset as AC-SemVal-2018, but has no intersection and is noted as AC-SemTra-2018. Its processing method is the same as AC-SemVal-2018, see Section 4.2 for details, and the data distribution is recorded in Table 1. Table 6 shows the results of predicting AC-SemTra-2018 by using the model trained on AC-SemVal-2018. It clearly shows that our models are robust, consistently superior to the baseline methods by a significant margin, and achieve higher overall classification accuracy. We find a slight decrease in macro F1-scores, so we further investigate the classification effect of the three sentiment polarities. Figure 9 shows a comparison of the specific classification effects of our model on AC-SemVal-2018 and AC-SemTra-2018. From Fig. 9, we can see that the ALANvar and ALAN do not change much in predicting the sentiment labels “Positive” and “Negative”, but the prediction performance for “Neutral” decreases a bit. It can be inferred that the models have some difficulty in capturing the sentiment features in the text when predicting the text with sentiment polarity “Neutral”.

Table 6 The performance of our methods and baseline models in AC-SemTra-2018. The bold emphasis indicates the maximum value of the performance comparison in its column
Fig. 9
figure 9

F1-scores of the three sentiment polarities: (a) ALANvar; (b) ALAN

4.7 Case study

4.7.1 Qualitative evaluation

We randomly select some qualitative examples from the test data of SemEval-2014 and present the prediction results of the proposed models ALANvar and ALAN and a baseline model BERT-PT in Table 7.

Table 7 Example predictions on SemEval-2014 (test)

Firstly, there are cases where only one aspect category exists in the sentence. There is a high probability of obvious emotions in such sentences (“expensive” in example 1, “uncomfortable” in example 2 and “disappointed” in example 3), and the three models are able to predict their sentiment polarity relatively accurately. However, similar to Example 4, when multiple words “a bit more friendly” are required to jointly judge the sentiment of the aspect category “service”, the performance of BERT-PT and ALAN is not very good. This is mainly because these two models unilaterally judge the sentiment polarity by the sentiment word “friendly”. Furthermore, in examples 5 to 8, the number of aspect categories in the sentence is more than one. Obviously, it is more difficult to correctly predict their respective sentiment polarities. BERT-PT, which does not consider the embedding of aspect categories, performs well only in Example 6, and its correct prediction is due to the explicit sentiment words “top-notch” and “unforgettable” in the context. But for some other examples, such as examples 5, 7 and 8, where the polarities of some aspect categories cannot be directly predicted by their complex contexts, BERT-PT performs much less well than ALANvar and ALAN. In examples 6 and 7, the lack of an attention mechanism weakened the focus on some important sentiment features, which led to the poor performance of ALANvar.

Overall, ALANvar and ALAN are better than BERT-PT in these cases, although they also have some misjudgment phenomena. For the above mentioned misclassification cases, two mitigation methods are considered. One is to extend the attention mechanism from a single head to multiple heads, thus stabilizing the attention learning process. The other is to increase the samples similar to the error cases, and some data augmentation can be employed to collect more samples to train the model. These schemes will be continued and validated in our future practice.

4.7.2 Validation of attention

To investigate whether the attention mechanism of ALAN is effective, we propose an intuitive method to detect the matching of aspect-category with relevant words in the text. This is achieved by visualizing the attention distribution of all words in the text by drawing a heat map, the attention distribution ρ is calculated by

$$ \begin{array}{@{}rcl@{}} \rho=softmax(n\times \alpha[1:n+1]) \end{array} $$
(27)

where n is the length of the original text sequence. The formula α[1 : n + 1] represents the weight from the first word to the last word in the attention weight vector α.

In this section, we study two exemplary reviews from SemEval-14 as experimental cases. ALAN is applied to both cases and the correct sentiment classification is obtained. In Case 1, Fig. 10 shows two distinct attention distributions for the sentence “the food is reliable and the price is moderate” with two given aspect-categories. The shade of the color represents the intensity of attention. In other words, the darker the color is, the more important the word in that part is. For the explicit aspect-category “food”, the word “reliable” has a red to black color, and similarly for the aspect category “price”, the word “moderate” has a striking color. Obviously, this is in line with our human judgment on semantic emotion in natural language and it also shows that ALAN can locate emotional information of explicit aspect-categories very well. In Case 2, the given aspect-categories “service” and “ambience” do not appear in direct words in the sentence “the restaurant is rather small but we were lucky to get a table quickly”. This situation is actually a big difficulty, after all, the machine cannot understand the overall semantics of the sentence as well as a human can. However, in Fig. 11, we find that ALAN focuses most of its attention on the first half of the sentence “the restaurant is rather small” according to “ambience”, and pays more attention to the second half of the sentence “but we were lucky to get a table quickly” under the aspect-category “service”. Therefore, to a certain extent, we can see that our model is effective in grasping the overall semantics of the text. In addition, the prepositions and punctuation marks in the text are rarely noticed by ALAN, as can be seen from Figs. 10 and 11. That is consistent with the normal logic of judging sentiment polarity: we do not care about these common words.

Fig. 10
figure 10

Case 1, the explicit aspect-categories are “service” and “ambience”

Fig. 11
figure 11

Case 2, the implicit aspect-categories are “food” and “price”

As we expected, aspect-category related words and sentiment characteristic words are attended to by ALAN and play a dominant role in correctly judging sentiment polarity. Thus, we conclude that our aspect-location attention mechanism captures important information and is able to model the overall semantics of the text.

5 Conclusion and future work

In this paper, we propose a memory neural network incorporating aspect-location embedding and aspect-location attention mechanism for the ACSA task. Aspect-location embedding can be combined with the idea of attention to form a new attention model (ALAN), which pays more attention to the semantic relationship between words in the sequence itself. Quantitatively, we compare the performance of ALAN and ALANvar with other network models through extensive experiments on English and Chinese datasets. ALAN achieves state-of-the-art results. To avoid performance errors from small test sets, we use the model trained on a small dataset to validate on a large dataset, which also maintains good performance. In addition, the visual images of attention weights show that ALAN can reasonably pay attention to the special information in the input text, which is of great significance in judging the sentiment polarity of sentences. Inspired by the analysis of the error cases, we hope to consider other content and different embedding methods in the Aspect-Location Embedding module to further mine the location memory information. Since neural networks have some instability, our model can be trained to improve its robustness according to the classification theory proposed by Colbrook et al. (2022) Moreover, it is also necessary to optimize the connection between the Aspect-Location Embedding and the Semantic Representation module in future work.