Keywords

1 Introduction

Aspect-based sentiment analysis has attracted increasing attention recently due to its broad applications. It aims at identifying the sentiment polarity towards a specific aspect in a sentence. A target aspect refers to a word or a phrase describing an aspect of an entity. For example, in the sentence “The salmon is tasty while the waiter is very rude”, there are two aspect terms “salmon” and “waiter”, and they are associated with “positive” and “negative” sentiment, respectively.

Recently, neural network methods have dominated the study of ABSA since these methods can learn important features automatically from the input sequences and be trained in an end-to-end manner. [1] proposed to model the preceding and following contexts for the target via two separate long-short term memory (LSTM) networks. [2] proposed to learn an embedding vector for each aspect, and these aspect embeddings were used to calculate the attention weights to capture important information for aspect-level sentiment analysis. [3] developed the deep memory network to compute the importance degree and text representation of each context word with multiple attention layers. [4] introduced the interactive attention networks (IAN) to interactively learn attention vectors for the context and target, and generated the representations for the target and context words separately. [5] extracted sentiment features with convolutional neural networks and selectively output aspect-related features for sentiment classification with gating mechanisms. Subsequently, Transformer [6] and BERT-based methods [7] have achieved noticeable success on ABSA task. [8] combined the capsule network with BERT to improve the performance of ABSA. There are also several studies attempting to simulate the process of human reading cognition to further improve the performance of ABSA [9, 10].

So far, several ABSA datasets have been constructed, including SemEval-2014 Restaurant and Laptop review datasets  [11], and Twitter dataset  [12]. Although these three datasets have since become the benchmark datasets for the ABSA task, most sentences in these datasets consist of only one aspect or multiple aspects with the same sentiment polarity, which makes the ABSA task degenerate to the sentence-level sentiment analysis. Based on our empirical observation, the sentence-level sentiment classifiers (TextCNN and LSTM) without considering aspects can still achieve competitive results with more advanced ABSA methods (e.g., GCAE  [5]). On the other hand, even advanced ABSA methods (e.g., AEN [13]) trained on these datasets can hardly distinguish the sentiment polarities towards different aspects in the sentences that contain multiple aspects and multiple sentiments.

In NLPCC 2020, we manually annotated a large-scale restaurant reviews corpus for MAMS, in which each sentence contains at least two different aspects with different sentiment polarities, making the provided MAMS dataset more challenging compared with existing ABSA datasets [8]. Considering merely the sentence-level sentiment of the samples would fail to achieve good performance on MAMS dataset.

This NLPCC 2020 shared task on MAMS has attracted a total of 50 teams to register, and 17 teams submitted the final results. We provide training and development sets to participating teams to build their models in the first stage and evaluate the final results on the test set in the second stage. The final ranking list is based on the average Macro-F1 scores of the two sub-tasks (i.e., ATSA and ACSA).

2 Task Description

Conventional sentiment classification aims to identify the sentiment polarity of a whole document or sentence. However, in practice, a sentence may contain multiple target aspects in a single sentence or document. For example, the sentence “the salmon is tasty while the waiter is very rude” expresses negative sentiment towards the “service” aspect, but contains positive sentiment concerning the “food” aspect. Considering merely the document- or sentence-level sentiment cannot learn the fine-grained aspect-specific sentiment.

Aspect-based sentiment analysis  [11], which aims to automatically predict the sentiment polarity of the specific aspect in its context, has gained increasing popularity in recent years due to many useful applications, such as online customer review analysis and conversations monitoring. Similar to SemEval-2014 Task 4, NLPCC-2020 MAMS task also includes two subtasks: (1) aspect term sentiment analysis (ATSA) and (2) aspect category sentiment analysis (ACSA). Next, we will describe the two subtasks in detail.

2.1 Aspect Term Sentiment Analysis (ATSA)

The ATSA task aims to identify the sentiment polarity (i.e., positive, negative or neutral) towards the given aspect terms which are entities presented in the sentence. For example, as shown in the Fig. 1, the sentence “the salmon is tasty while the waiter is very rude” contains two aspect terms “salmon” and “waiter”, the sentiment polarities towards the two aspect terms are positive and negative, respectively. Different from the ATSA task in SemEval-2014 Task 4, each sentence in MAMS contains at least two different aspect terms with different sentiment polarities, making the our ATSA task more challenging.

2.2 Aspect Category Sentiment Analysis (ACSA)

The ACSA task aims to identify the sentiment polarity (i.e., positive, negative or neutral) towards the given aspect categories that are pre-defined and may not presented in the sentence. We pre-defined eight aspect categories: food, service, staff, price, ambience, menu, and miscellaneous. For example, the sentence “the salmon is tasty while the waiter is very rude” contains two aspect categories “food” and “service”, the sentiment polarities towards the two aspect categories are positive and negative, respectively. For our NLPCC-2020 ACSA task, each sentence contains at least two different aspect categories with different sentiment polarities.

Fig. 1.
figure 1

An example for the ATSA and ACSA tasks.

3 Dataset Construction

Similar to SemEval-2014 Restaurant Review dataset  [11], we annotate sentences from the Citysearch New York dataset collected by  [14]. We split each document in the corpus into a few sentences, and remove the sentences consisting more than 70 words. The original MAMS dataset was presented in [8]. In NLPCC-2020 shared task, we relabel the MAMS dataset by providing more high-quality validation and test data.

For the ATSA subtask, we invited three experienced researchers who work on natural language processing (NLP) to extract aspect terms in the sentences and assign the sentiment polarities with respect to the aspect terms. The sentences that consist of only one aspect term or multiple aspects with the same sentiment polarities are deleted. We also provide the start and end positions for each aspect term in the sentence.

For the ACSA subtask, we pre-defined eight coarse aspect categories: food, service, staff, price, ambience, menu, place and miscellaneous. Five aspect categories (i.e., food, service, price, ambience, anecdotes/miscellaneous) are adopted in SemEval-2014 Restaurant Review Dataset. We add three more aspect categories (i.e., staff, menu, place) to deal with some confusing situations. Three experienced NLP researchers were asked to identify the aspect categories described in the given sentences and determine the sentiment polarities towards these aspect categories. We only keep the sentences that consist of at least two unique aspect categories with different sentiment polarities.

The detailed statistics of the datasets for ATSA and ACSA subtasks are reported in Table 1. The released datasets are stored in XML format, as shown in the Fig. 2. Each sample contains the given sentence, aspect terms with their sentiment polarities, and aspect categories with their sentiment polarities. In total, the ATSA dataset consists of 11,186 training samples, 2,668 development samples, and 2,676 test samples. The ACSA dataset consists of 7,090 training samples, 1,789 development samples, and 1,522 test samples.

Table 1. Statistics of MAMS dataset.
Fig. 2.
figure 2

Dataset format of MAMS task.

4 Evaluation Metrics

Both ATSA and ACSA tasks are evaluated using Macro-F1 value that is calculated as follows:

$$\begin{aligned} Precision (P) = \frac{TP}{TP+FP} \end{aligned}$$
(1)
$$\begin{aligned} Recall (R) = \frac{TP}{TP+FN} \end{aligned}$$
(2)
$$\begin{aligned} F1 = 2*\frac{P*R}{P+R} \end{aligned}$$
(3)

where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. We average the F1 value of each category to get Macro-F1 score. The final result for the MAMS task is the averaged Macro-F1 scores on the two sub-tasks (i.e., ATSA and ACSA).

5 Evaluation Results

In total, there are 50 teams registered for the NLPCC-2020 MAMS task, and 17 teams submitted their final results for evaluation. Table 2 shows the Macro-F1 scores and ranks of these 17 teams. The Macro-F1 results confirmed our expectations. It is noteworthy that we have checked the technique reports of the top three teams and reproduced their codes. Next, we briefly introduce the implementation strategies of the top-3 teams.

The best average Macro-F1 score (82.4230\(\%\)) was achieved by the Baiding team. They tackle the MAMS task as a sentence pair classification problem and employed pre-trained language models as the feature extractor. In addition, the bidirectional gated recurrent unit (Bi-GRU) is connected to the last hidden layer of pre-trained language models, which can further enhance the representation of aspects and contexts. More importantly, a weighted voting strategy is applied to produce an ensemble model that combines the results of several models with different network architectures, pre-trained language models, and training steps.

The Just a test team won the 2nd place in the MAMS shared task. They achieved a Macro-F1 score of 85.2435\(\%\) on the ATSA task and 79.4187\(\%\) on the ACSA task. The averaged Macro-F1 score was 82.33\(\%\), which was slightly worse than that of the Baiding team. The RoBERTa-large is used as the pre-trained language model. The Just a test team added a word sentiment polarity prediction task as an auxiliary task and simultaneously predicted the sentiment polarity of all aspects in a sentence to enhance the model performance. In addition, a data augmentation via EDA (Easy data augmentation)  [15] is adopted to further improve the performance, which doubled the training data.

The CUSAPA team won the third place, which achieved a Macro-F1 score of 84.1585\(\%\) on the ATSA task and 79.7468\(\%\) on the ACSA task. The averaged Macro-F1 score was 81.9526\(\%\). The CUSAPA team employs a joint learning framework to train these two sub-tasks in a unified framework, which improves the performance of both tasks simultaneously. Furthermore, three BERT-based models are adopted to capture different aspects of semantic information of the context. The best performance is achieved by combing these models with a stacking strategy.

Table 2. Macro-F1 scores (\(\%\)) on the MAMS dataset.

6 Conclusion

In this paper, we briefly introduced the overview of the NLPCC-2020 shared task on Multi-Aspect-based Multi-Sentiment Analysis (MAMS). We manually annotated a large-scale restaurant reviews corpus for MAMS, in which each sentence contained at least two different aspects with different sentiment polarities, making the provided MAMS dataset more challenging compared with existing ABSA datasets. The MAMS task has attracted 50 teams to participate in the competition and 17 teams to submit the final results for evaluation. Different approaches were proposed by the 17 teams, which achieved promising results. In the future, we would like to create a new MAMS dataset with samples from different domains, and add a new cross-domain aspect-based sentiment analysis task.