Keywords

1 Introduction

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task, which aims to detect the sentiment polarity towards one given aspect in a sentence  [14, 17, 20]. The given aspect usually refers to the aspect term or the aspect category. An aspect term is a word or phrase explicitly mentioned in the sentence representing the feature or entity of products or services. Aspect categories are pre-defined coarse-grained aspect descriptions, such as food, service, and staff in restaurant review domain. Therefore, ABSA contains two subtasks, namely Aspect Term Sentiment Analysis (ATSA) and Aspect Category Sentiment Analysis (ACSA). Figure 1 shows an example for ATSA and ACSA. Given the sentence “The salmon is tasty while the waiter is very rude”, the sentiments toward the two aspect terms “salmon” and “waiter” are respectively positive and negative. ACSA is to detect the sentiment polarity towards the given pre-defined aspect category, which is explicitly or implicitly expressed in the sentence. There are two aspect categories in the sentence of Fig. 1, i.e., food and waiter, and their sentiments are respectively positive and negative. Note that the annotations for ATSA and ACSA can be separated.

Fig. 1.
figure 1

An example of the ATSA and ACSA subtasks. The terms in red are two given aspect terms. Note that the annotations for ATSA and ACSA can be separated. (Color figure online)

To study ABSA, several public datasets are constructed, including multiple SemEval Challenges datasets  [18,19,20] and Twitter dataset  [5]. However, in these datasets, most sentences consist of only one aspect or multiple aspects with the same sentiment polarity, which makes ABSA degenerate to sentence-level sentiment analysis  [9]. For example, there are only 0.09% instances in Twitter dataset belonging to the case of multi-aspects with different sentiment polarities. To promote the research of ABSA, NLPCC 2020 Shared Task 2 releases a Multi-Aspect Multi-Sentiment (MAMS) dataset. In the MAMS dataset, each sentence consists of at least two aspects with different sentiment polarities. Obviously, the property of multi-aspect multi-sentiment makes the proposed dataset more challenging compared with existing ABSA datasets.

To deal with ABSA, recent works employ neural networks and achieve promising results in previous datasets, such as attention networks  [6, 16, 25], memory networks  [2, 22], and BERT  [9]. These works separate multiple aspects of a sentence into several instances and process one aspect each time. As a result, they only consider local sentiment information for the given aspect while neglecting the sentiments of other aspects in the same sentence as well as the relations between multiple aspects. This setting is unsuitable, especially for the new MAMS dataset, as multiple aspects of a sentence usually have different sentiment polarities in the MAMS dataset, and knowing sentiment of a certain aspect can help infer sentiments of other aspects. To address the issue, we re-formalize ABSA as a task of multi-aspect sentiment analysis, and propose a Transformer-based Multi-aspect Modeling method (TMM) to simultaneously detect the sentiment polarities of all aspects in a sentence. Specifically, we adopt the pre-trained RoBERTa  [15] as backbone network and build a multi-aspect scheme for MAMS based on transformer  [23] architecture, then employ multi-head attention to learn the sentiment and relations of multi-aspects. Compared with existing works, our method has three advantages:

  1. 1.

    It can capture sentiments of all aspects synchronously in a sentence and relations between them, thereby avoid focusing on sentiment information belonging to other aspects mistakenly.

  2. 2.

    Modeling multi-aspect simultaneously can improve computation efficiency largely without additional running resources.

  3. 3.

    Our method applies the strategy of transfer learning, which exploits large-scale pre-trained semantic and syntactic knowledge to benefit the downstream MAMS task.

Finally, our proposed method obtains obvious improvements for both ATSA and ACSA in the MAMS dataset, and rank the second place in the NLPCC 2020 Shared Task 2 Evaluation.

2 Proposed Method

In this section, we first re-formalize the ABSA task, then present our proposed Transformer-based Multi-aspect Modeling scheme for ATSA and ACSA. The final part introduces the fine-tuning and training objective.

2.1 Task Formalization

Prior studies separate multiple aspects and formalize ABSA as a problem of sentiment classification toward one given aspect a in the sentence \(s=\{w_1, w_2, \cdots , w_n\}\). In ATSA, the aspect term a is a span of the sentence s representing the feature or entity of products or services. For ACSA, the aspect category \(a\in A\) and A is the pre-defined aspect set, i.e., {food, service, staff, price, ambience, menu, place, miscellaneous} for the new MAMS dataset. The goal of ABSA is to assign a sentiment label \(y\in C\) to the aspect a of the sentence s, where C is the set of sentiment polarities (i.e., positive, neural and negative).

In this work, we re-formalize ABSA as a task of multi-aspect sentiment classification. Given a sentence \(s=\{w_1, w_2, \cdots , w_n\}\) and m aspects \(\{a_1, a_2, \cdots , a_m\}\) mentioned in s, the objective of MAMS is to simultaneously detect the sentiment polarities \(\{y_1, y_2, \cdots , y_m\}\) of all aspects \(\{a_1, a_2, \cdots , a_m\}\), where \(y_i\) corresponds to the sentiment label of the aspect \(a_i\).

Fig. 2.
figure 2

Transformer-based Multi-Aspect Modeling for ATSA. In the above example, the aspect \(a_i\) may contain multiple words, and each word of the sentence might be split into several subwords. For simplicity, here we do not represent them with subword tokens.

2.2 Transformer-Based Multi-aspect Modeling for ATSA

Recently, Bidirectional Encoder Representations from Transformers (BERT)  [4] achieves great success by pre-training a language representation model on large-scale corpora then fine-tuning on downstream tasks. When fine-tuning on classification tasks, BERT uses the specific token [CLS] to obtain task-specific representation, then applies one additional output layer for classification. For ABSA, previous work concatenates the given single aspect and the original sentence as the input of BERT encoder, then leverages the representation of [CLS] for sentiment classification  [9].

Inspired by BERT, we design a novel Transformer-based Multi-Aspect Modeling scheme (TMM) to address MAMS task with simultaneously detecting the sentiments of all aspects in a sentence. Here we take ATSA subtask as example to elaborate on it. Specifically, given a sentence \(\{w_1, \cdots , a_1, \cdots , a_m, \cdots , w_n\}\), where the aspect terms are denoted in the original sentence for the ease of following description, we propose two specific tokens [AS] and [AE] to respectively represent the start position and end position of aspect in the sentence. With the two tokens, the original sentence \(\{w_1, \cdots , a_1, \cdots , a_m, \cdots , w_n\}\) can be transformed into the sequence \(\{w_1, \cdots , {{\mathtt {[AS]}}}, a_1, {{\mathtt {[AE]}}}, \cdots , {{\mathtt {[AS]}}}, a_m, {{\mathtt {[AE]}}}, \cdots , w_n\}\). Based on this new input sequence, we then employ multi-layer transformer to automatically learn the sentiments and relations between multiple aspects.

As shown in Fig. 2, we finally fetch the representation \(\mathbf {H}_\mathrm{{[AS]}}\) of the start token [AS] of each aspect as feature vector to classify the sentiment of aspect.

Fig. 3.
figure 3

Transformer-based Multi-Aspect Modeling for ACSA.

2.3 Transformer-Based Multi-aspect Modeling for ACSA

Since aspect categories are pre-defined and may be not mentioned explicitly in the sentence, the above TMM scheme needs some modifications for ACSA. Given the sentence \(s=\{w_1, w_2, \cdots , w_n\}\) and aspect categories \(\{a_1, a_2, \cdots , a_m\}\) in s, we concatenate the sentence and aspect categories, and only use the token [AS] to separate multiple aspects because each aspect category is a single word, finally forming the input sequence \(\{w_1, w_2, \cdots , w_n, {\mathtt {[AS]}}, a_1, {\mathtt {[AS]}}, a_2, \cdots , {\mathtt {[AS]}}, a_m\}\). As Fig. 3 shows, after multi-layer transformer, we use the representation \(\mathbf {H}_\mathrm{{[AS]}}\) the indication token [AS] of each aspect category for sentiment classifcation.

2.4 Fine-Tuning and Training Objective

As aforementioned, we adopt the pre-trained RoBERTa as backbone network, then fine-tune it on the MAMS dataset with the proposed TMM scheme. RoBERTa is a robustly optimized BERT approach and pre-trained with the larger corpora and batch size.

When in the fine-tuning stage, we employ a softmax classifier to map the representation \(\mathbf {H}^i_\mathrm{{[AS]}}\) of aspect \(a_i\) into the sentiment distribution \(\hat{\mathbf {y}}_i\) as follow:

$$\begin{aligned} \hat{\mathbf {y}}_i=\mathrm {softmax}(\mathbf {W}_o\mathbf {H}^i_\mathrm{{[AS]}}+\mathbf {b}_o), \end{aligned}$$
(1)

where \(\mathbf {W}_o\) and \(\mathbf {b}_o\) respectively denote weight matrix and bias.

Finally, we use cross-entropy loss between predicted sentiment label and the golden sentiment label as training loss, which is defined as follows:

$$\begin{aligned} Loss=- \sum _{s\in D}\sum _{i=1}^{m}\sum _{j\in C}\mathbb {I}(y_i=j) \log \hat{y}_{i,j}, \end{aligned}$$
(2)

where s and D respectively denote a sentence and training dataset, m represents the number of aspects in the sentence s, C is the sentiment label set, \(y_i\) denotes the ground truth sentiment of aspect \(a_i\) in s, and \(\hat{y}_{i,j}\) is the predicted probability of the j-th sentiment towards the aspect \(a_i\) in the input sentence.

3 Experiment

3.1 Dataset and Metrics

Similar to SemEval 2014 Restaurant Review dataset  [20], the original sentences in NLPCC 2020 Shared Task 2 are from the Citysearch New York dataset  [7]. Each sentence is annotated with three experienced researchers working on natural language processing. In the released MAMS dataset, the annotations for ATSA and ACSA are separated. For ACSA, they pre-defined eight coarse-grained aspect categories, i.e., food, service, staff, price, ambience, menu, place, and miscellaneous. The sentences consisting of only one aspect or multiple aspects with the same sentiment polarities are deleted, thus each sentence at least contains two aspects with different sentiments. This property makes the MAMS dataset more challenging. The statistics of the MAMS dataset are shown in Tabel 1.

Table 1. Statistics of the MAMS dataset. Sen. and Asp. respectively denotes the numbers of sentences and given aspects in the dataset. Ave. represents the average number of aspects in each sentence. Pos., Neu. and Neg. respectively indicate the numbers of positive, neutral and negative sentiment.

NLPCC 2020 Shared Task 2 uses Macro-F1 to evaluate the performance of different systems, which is calculated as follows:

$$\begin{aligned} Precision (P)&= TP/(TP+FP),\end{aligned}$$
(3)
$$\begin{aligned} Recall (R)&= TP/(TP+FN),\end{aligned}$$
(4)
$$\begin{aligned} F1&= 2*P*R/(P+R), \end{aligned}$$
(5)

where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. Macro-F1 value is the average of F1 value of each category. The final evaluation result is the average result of Macro-F1 values on the two subtasks (i.e., ATSA and ACSA). In this work, we also use standard Accuracy as the metric to evaluate different methods.

3.2 Experiment Settings

We use pre-trained RoBERTa as backbone network, then fine-tune it on downstream ATSA or ACSA subtask with our proposed Transformer-based Multi-aspect Modeling scheme. The RoBERTa has 24 layers of transformer blocks, and each block has 16 self-attention heads. The dimension of hidden size is 1024. When fine-tuning on ATSA or ACSA, we apply Adam optimizer  [10] to update model parameters. The initial learning rate is set to 1e-5, and the mini-batch size is 32. We use the official validation set for hyperparameters tuning. Finally, we run each model 3 times and report the average results on the test set.

3.3 Compared Methods

To evaluate the performance of different methods, we compare our RoBERTa-TMM method with the following baselines on ATSA and ACSA.

  • LSTM: We use the vanilla LSTM to encode sentence and apply the average of all hidden states for sentiment classification.

  • TD-LSTM: TD-LSTM  [21] employs two LSTM networks respectively to encode the left context and right context of the aspect term, then concatenates them for sentiment classification.

  • AT-LSTM: AT-LSTM  [25] uses the aspect representation as query, and employs the attention mechanism to capture aspect-specific sentiment information. For ATSA, the aspect term representation is the average of word vectors in the aspect term. For ACSA, the aspect category representation is randomly initialized and learned in the training stage.

  • ATAE-LSTM: ATAE-LSTM  [25] is an extension of AT-LSTM. It concatenates the aspect representation and word embedding as the input of LSTM.

  • BiLSTM-Att: BiLSTM-Att is our implemented model similar to AT-LSTM, which uses bidirectional LSTM to encode the sentence and applies aspect attention to capture the aspect-dependent sentiment.

  • IAN: IAN  [16] applies two LSTM to respectively encode the sentence and aspect term, then proposes the interactive attention to learn representations of the sentence and aspect term interactively. Finally, the two representations are concatenated for sentiment prediction.

  • RAM: RAM  [2] employs BiLSTM to build memory and then applies GRU-based multi-hops attention to generate the aspect-dependent sentence representation for predicting the sentiment of the given aspect.

  • MGAN: MGAN  [6] proposes fine-grained attention mechanism to capture the word-level interaction between aspect term and context, then combines it with coarse-grained attention for ATSA.

In addition, we also compare strong transformer-based models including \(\text {BERT}_\text {BASE}\) and RoBERTa. They adopt the conventional ABSA scheme and deal with one aspect each time.

  • \(\mathbf{BERT} _\mathbf{BASE} \): \(\text {BERT}_\text {BASE}\)  [4] has 12 layers transformer blocks, and each block has 12 self-attention heads. When fine-tuning for ABSA, it concatenates the aspect and the sentence to form segment pair, then use the representation of the [CLS] token after multi-layer transformers for sentiment classification.

  • RoBERTa: RoBERTa  [15] is a robustly optimized BERT approach. It replaces the static masking in BERT with dynamic masking, removes the next sentence prediction, and pre-trains with larger batches and corpora.

Table 2. Main experiment results on ATSA and ASCA (%). The results with the marker \(^*\) are from official evaluation and they do not provide accuracy performance.

3.4 Main Results and Analysis

Table 2 gives the results of different methods on two subtasks of ABSA.

The first part shows the performance of non-transformer-based baselines. We can observe that the vanilla LSTM performs very pool in this new MAMS dataset, because it does not consider any aspect information and is a sentence-level sentiment classification model. In fact, LSTM can obtain pretty good results on previous ABSA datasets, which reveals the challenge of the MAMS dataset. Compared with other attention-based models, RAM and MGAN achieve better performance on ATSA, which validates the effectiveness of multi-hops attention and multi-grained attention for detecting the sentiment of aspect. It is surprising that the TD-LSTM obtains competitive results among non-transformer-based baselines. This result indicates that modeling position information of aspect term may be crucial for the MAMS dataset.

The second part gives two strong baselines, i.e., \(\text {BERT}_\text {BASE}\) and RoBERTa. They follow the conventional ABSA scheme and deal with one aspect each time. It is observed that they outperform the non-transformer-based models significantly, which shows the power of pre-trained language models. Benefiting from the larger datasets, batch size and the more parameters, RoBERTa obtains better performance than \(\text {BERT}_\text {BASE}\) on ATSA and ACSA.

Compared with the strongest baseline RoBERTa, our proposed Transformer-based Multi-aspect Modeling method RoBERTa-TMM still achieves obvious improvements in the challenging MAMS dataset. Specifically, it outperforms RoBERTa by 1.93% and 1.91% respectively in accuracy and F1-score for ATSA. In terms of ACSA, the improvement of RoBERTa-TMM against RoBERTa is relatively limited. This may be attributed to that the predefined aspect categories are abstract and it is challenging to find their corresponding sentiment spans from the sentence even in the multi-aspect scheme. Nevertheless, the improvement in ACSA is still substantial because the data size of the MAMS dataset is sufficient and even large-scale for ABSA research. Finally, our RoBERTa-TMM-based ensemble system achieves 85.24% and 79.41% respectively for ATSA and ACSA in F1-score, and ranks the 2nd in NLPCC 2020 Shared Task 2 Evaluation.

3.5 Case Study

Fig. 4.
figure 4

Attention visualization of RoBERTa-TMM and RoBERTa in ATSA. The words in red are two given aspect terms. The darker blue denotes the bigger attention weight. (Color figure online)

To further validate the effectiveness of the proposed TMM scheme, we take a sentence from ATSA as example, and average the attention weight of different heads in RoBERTa-TMM and RoBERTa models, finally visualize them in Fig. 4.

From the results of attention visualization, we can see that the two aspect terms in the RoBERTa-TMM model capture the corresponding sentiment spans correctly through multi-aspect modeling. In contrast, given the aspect term “Food”, RoBERTa mistakenly focuses on the sentiment spans of the other aspect term “fish” due to lacking other aspects information, thus making wrong sentiment prediction. The attention visualization indicates that the RoBERTa-TMM can detect the corresponding sentiment spans of different aspects and avoid wrong attention as much as possible by simultaneously modeling multi-aspect and considering the potential relations between multiple aspects.

4 Related Work

4.1 Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) has been studied in the last decade. Early works devote to designing effective hand-crafted features, such as n-gram features  [8, 11] and sentiment lexicons  [24]. Motivated by the success of deep learning in many tasks  [1, 3, 12], recent works adopt neural network-based methods to automatically learn low-dimension and continuous features for ABSA.  [21] separates the sentence into the left context and right context according to the aspect term, then employs two LSTM networks respectively to encode them from the two sides of sentence to the aspect term. To capture aspect-specific context,  [25] proposes the aspect attention mechanism to aggregate important sentiment information from the sentence toward the given aspect. Following the idea,  [16] introduces the interactive attention networks (IAN) to learn attentions in context and aspect term interactively, and generates the representations for aspect and context words separately. Besides, some works employ memory network to detect more powerful sentiment information with multi-hops attention and achieve promising results  [2, 22]. Instead of the recurrent network,  [26] proposes the aspect information as the gating mechanism based on convolutional neural network, and dynamically selects aspect-specific information for aspect sentiment detection. Subsequently, BERT based method achieves state-of-the-art performance for the ABSA task  [9].

However, the above methods perform ABSA with the conventional scheme that separates multiple aspects in the same sentence and analyzes one aspect each time. They only consider local sentiment information for the given aspect and possibly focus on sentiment information belonging to other aspects mistakenly. In contrast, our proposed Transformer-based Multi-aspect Modeling scheme (TMM) aims to learn sentiment information and relations between multiple aspects for better prediction.

4.2 Pre-trained Language Model

Recently, substantial works have shown that pre-trained language models can learn universal language representations, which are beneficial for downstream NLP tasks and can avoid training a new model from scratch  [4, 13, 15, 27]. These pre-trained models, e.g., GPT, BERT, XLNet, RoBERTa, use the strategy of first pre-training then fine-tuning and achieve the great success in many NLP tasks. To be specific, they first pre-train some self-supervised objectives, such as the masked language model (MLM), next sentence prediction (NSP), or sentence order prediction (SOP)  [13] on the large corpora, to learn complex semantic and syntactic pattern from raw text. When fine-tuning on downstream tasks, they generally employ one additional output layer to learn task-specific knowledge.

Following the successful learning paradigm, in this work, we employ RoBERTa as the backbone network, then fine-tune it with the TMM scheme on the MAMS dataset to perform ATSA and ACSA.

5 Conclusion

Facing the challenging MAMS dataset, we re-formalize ABSA as a task of multi-aspect sentiment analysis in this work and propose a novel Transformer-based Multi-aspect Modeling scheme (TMM) for MAMS, which can determine the sentiments of all aspects in a sentence simultaneously. Specifically, TMM transforms the original sentence and constructs a new multi-aspect sequence scheme, then apply multi-layer transformers to automatically learn to sentiments clues and potential relations of multiple aspects in a sentence. Compared with previous works that analyze one aspect each time, our TMM scheme not only helps improve computation efficiency but also achieves substantial improvements in the MAMS dataset. Finally, our method achieves the second place in NLPCC 2020 Shared Task 2 Evaluation. Experiment results and analysis also validate the effectiveness of the proposed method.