1 Introduction

As shown in Fig.1, Aspect Based Sentiment Analysis (ABSA) aims to identify the sentiment polarity of one or more specific aspects [1]. The input is a review and the output is the aspects and their corresponding sentiment labels (Pos, Neu, Neg) in this sentence. According to Bu et al. [1], as training data volume increases, the performance of document-level sentiment analysis shows a downward trend, while ABSA shows an upward trend.

ABSA is a highly data-dependent task in Natural Language Processing (NLP), it has derived many sub-tasks, making ABSA one of the most challenging tasks that have attracted the attention of a large number of researchers. Due to the complexity of ABSA, the models require a large amount of labeled training data. Currently, many state-of-the-art (SOTA) ABSA methods rely on supervised learning due to their high rates of effectiveness [2]. In this paper, we design reasonable and low-compute resource augmentation strategies for ABSA. Figure 2 shows an example. The unlabeled review contains one or more aspects and corresponding sentiment that can serve as the training data for the ABSA model. Our prompt-based model can locate the aspects, recognize the corresponding polarities, and extract them as the data label.

Fig. 1
figure 1

The main purpose of ABSA

Data Augmentation (DA) was initially applied in computer vision [3, 4] and had been widely applied in NLP [5, 6]. Common methods include random addition, deletion, and exchange; The Synonym Replacement (SR) is a simple and intuitive data augmentation method, which replaces some words in the source sentence with synonyms in WordNet or similar words in the word vector [7]. Another method is back-translation. It needs to translate the original sentence between the source language and multiple target languages many times. Then translate it back to the source language [8]. Text generation-based DA model is also a popular method. Generative Adversarial Network (GAN) [9] and Variational AutoEncoder (VAE) [10] are two generation models based on neural networks, the text is generated based on the input text and can be used for data augmentation of sentiment analysis [11]. Recently, the widespread use of pre-trained language models has also brought new methods to DA [12]. Kobayashi [13] proposed to randomly replace words with words predicted by the pre-trained language models. Existing methods have effectively improved the performance of the ABSA model. Wang et al. [14] proposed a contrastive cross-channel data augmentation framework to generate more domain and multi-dimensional samples, and trained a more robust ABSA model based on these generated data. However, none of the methods is perfect, because semantic distortion, syntax errors, and other data noises may occur in the augmentation process.

Fig. 2
figure 2

Our DA system takes a unlabeled data as the input and extracts the aspects and corresponding label from the unlabeled as the output

Small changes in training data can mislead the model into making incorrect predictions [15]. Compared with working with existing data, we believe that automated labeling of true unlabeled data is a more worthwhile direction to explore. Li et al. [16] summarized the common DA methods in NLP. They also believed that external datasets have very high data values. Compared with consuming a lot of resources to work on existing datasets, automatic annotation of external datasets is more meaningful, but data annotation is an arduous task.

Data is the foundation of Artificial Intelligence (AI). However, most existing ABSA datasets are manually annotated, which has low efficiency and cannot guarantee the accuracy of annotated data. The uncertainty of manual data annotation can affect the distribution of data in the datasets. The proportion of various types of data in the datasets will directly affect the performance of various ABSA models [17,18,19]. With the development of neural networks, the structure of neural networks becomes more and more complex [20, 21]. Correspondingly, the demand for data rises dramatically. The problem of insufficient data is alleviated by pre-training models that are pre-trained on unlabeled data and then fine-tuned in downstream tasks. However, in the fine-tuning phase, insufficient training data can lead to over-fitting. The model performance is negatively affected by the lack of sufficiently labeled data. The prompt-based DA method has been widely applied in other NLP domains [22,23,24,25], but it is rare in ABSA. We need an automatic data labeling method to provide more high-quality labeled data.

According to Huang et al. [26], in Amazon and Yelp (document-level sentiment dataset), there is a large amount of data that can be used as ABSA training data. However, in the document-level sensitive dataset, the data is only divided into 1 to 5 ratings, which requires further processing.

In this paper, we propose a prompt-based DA method. The contributions of our paper are summarized as follows:

  1. (1)

    To our knowledge, the data augmentation method of ABSA via prompt learning is rare.

  2. (2)

    We propose a prompt-based data augmentation method for ABSA and sub-tasks, which provides more real and diverse data with the help of external knowledge. Unlike previous generative data augmentation methods, our model only requires low computational resources.

  3. (3)

    We provide qualitative analysis and discussions as to why our augmentation method works and test its implementation.

The structure of this paper is as follows: The background and related work, which includes a discussion about existing ABSA and DA models, and the development of prompt learning is introduced in Sect. 2. Our method and the details of the implementation of automated data annotation are introduced in Sect. 3. Then we present the experiment settings, results, discussion, case study, and ablation study in Sect. 4. Finally, we give the conclusion and future work in Sect. 5

2 Related work

2.1 Aspect-based sentiment analysis

ABSA is a classic task. It has great significance for practical application. Zhao et al. [27] regarded this problem as the joint extraction of terms and relationships, and designed a span-based multi-task learning (SpanMlt) framework to jointly extract aspects/views and pairing relationships. Chen et al. [28] proposed a model containing two channels to extract aspect/viewpoint terms and relationships respectively. Two synchronization mechanisms are further designed to realize the information interaction between the two channels. End-to-End ABSA (E2E-ABSA) is used to extract the aspect term and its corresponding sentiment polarity at the same time. It can be divided into two subtasks. Some common ideas frequently appear in different models, the Relationship Aware Collaborative Learning (RACL) framework proposed by Chen and Qian [29] explicitly models the interaction of multiple tasks, and uses the relationship propagation mechanism to coordinate these tasks. Liang et al. [30] further designed a routing algorithm to improve knowledge transfer between these tasks. One of the subtasks is Aspect Category Sentiment Analysis (ACSA). The aspects extracted by E2E-ABSA must be clear in the sentence, while ACSA can be extracted whether implicit or explicit. Because of this feature, ACSA is more widely used in industry. Cai et al. [31] proposed a hierarchical classification method to solve the ACSA problem: Hier-GCN first recognizes aspect categories, then jointly predicts the sentiment of each recognition category. Similarly, Li et al. [32] used the shared sentiment prediction layer to share sentimental knowledge between different categories to alleviate the problem of insufficient data. Liu et al. [33] used the Seq2Seq modeling paradigm to solve the ACSA problem. Based on the pre-trained generation model, they use natural language sentences to represent the required output, and its performance is better than previous models.

In the real world, comment texts are mostly informal expressions with complex grammatical structures. Li et al. [34] proposed DualGCN, which combines syntactic structure complementarity and semantic correlation. The SynGCN module (with rich syntactic knowledge) aims to reduce dependency analysis errors, while the SemGCN module (self-attention mechanism) aims to capture semantic correlations. Zhong et al. [35] introduced external knowledge in the process of solving the ABSA problem, and through complementary information between external knowledge and context (combining external knowledge with context and syntax), captured emotional features from three perspectives: context-based, syntax-based, and knowledge-based. Context and syntax extract features through pre-trained word embedding representations. Specifically, the context is encoded through BiLSTM; The process of syntax encoding is to first establish syntactic dependencies, obtain the adjacency matrix in the sentence, and then encode it through two layers of GCN. Introduce wordnet’s knowledge graph as external knowledge and learn knowledge embedding through semantic matching methods. The knowledge representation of specific aspects is learned through the soft attention mechanism.

Until now, one of the main challenges for ABSA tasks lies in the lack of labeled data, especially with the widespread application of large-scale pre-trained models, many DA methods are no longer able to improve model performance [36].

2.2 Data augmentation and prompt learning

The increase in training data does not have a simple linear relationship with model performance, but it cannot be denied that the amount of data is still important for AI models. Here are some examples, for computer vision, the RGB channels’ rotations and changes; For speech recognition, the change of sound and speed etc. [12]. DA models should contain source data D, an algorithm A, and new data N. The early ideas can be found in LeNET [37]. Adding noise to existing data is a common method. Coulombe et al. [38] added spelling errors that are common in daily life to the data, resulting in an additional 1.5% increase in XGBoost. Similar studies also include [39, 40] etc.

Embedding replacement is also a popular method. Wang and Yang [41] used KNN to embed the training data words’ substitution with the best effect. Compared to the baseline, their logistic regression-based method achieved 2.4% F1 improvement. Similar studies also include [42,43,44] etc. Their difference lies in the choice of embedded-based words.

Pre-trained language models are widely used in all domains of NLP, and DA is no exception. Wu et al. [45] improved the structure of BERT (c-BERT) through labeled conditional methods, effectively enhancing baseline performance on multiple tasks. However, the performance of c-BERT is not good enough under low data resources. To address this drawback, Hu et al. [46] introduced reinforcement learning on the basis of c-BERT, further improving its model performance. The related work is also reflected in the work from Qu et al. [47] and Anaby Tavor et al. [48].

Prompt learning can effectively improve the performance of pre-trained models [49]. Brown et al. suggested that large-scale models can greatly exploit their inference and understanding capabilities with the help of suitable templates. However, it is initially designed for large pre-trained models. Researchers tried to use it on more general models such as ALBERT etc. Schutze et al. [50] proposed PET (Pattern-Exploiting Training) by text classification tasks and tried to convert all classification tasks to completion blanks consistent with the Masked Language Model (MLM), they designed the important components of Prompt-Tuning, including Pattern (Template) and Verbalizer.

We denote Template as \(\mathcal {T}\) and denote Verbalizer as \(\mathcal {V}\). The above two components are uniformly defined as Pattern-Verbalizer-Pair (PVP) in this work. Their formal description is shown in Eq. 1.

$$\begin{aligned} p(y \mid x)=\prod _{j=1}^n p\left( [{\text {mask}}]_j=\mathcal {V}(y) \mid \mathcal {T}(x)\right) \end{aligned}$$
(1)

where x is the sequence, y is the corresponding label.

Figure 3 shows an example. The main steps are template design, answer search, and answer mapping.

Fig. 3
figure 3

Schematic diagram of prompt learning

The AutoPrompt achieves better performance [51]. It can be summarized as follows: given the original input, a number of additional discrete characters are defined to form a template and the probability of the corresponding [answers] is predicted by a pre-trained language model. Continuous prompt converts templates into continuous vectors that can be optimized [52, 53]. Converting templates into vectors that exist in the semantic space facilitates optimization search. The expression of a continuous template is shown in Eq. 2.

$$\begin{aligned} \mathcal {T}=[x]\left[ v_1\right] \left[ v_2\right] \ldots \left[ v_m\right] [mask] \end{aligned}$$
(2)

where \([v_m]\) is vector.

Prompt-based DA has been widely used. We introduce some representative work here. Chen et al. [22] proposed a framework named GOTTA. They integrated the cloze task. They imitated the main QA task format and efficiently utilized generative-based prompt learning, enabling the model to learn all tasks simultaneously. Liu et al. [23] used prompt learning to extract knowledge from pre-trained models and designed a label-conditioned approach to generate more data with the same labels. In addition, a prompt-based QA method was designed to generate new training data from unlabeled text. Wang et al. [24] used a generative-based DA method, utilizing soft prompt and NLU models to filter the generated data from multiple perspectives, ensuring the quality of the newly added data. Abaskohi et al. [25] found that when fine-tuning on small datasets, the model did not perform as well as expected. They combined contrastive learning with prompt tuning and proposed LM-CPPF (Comparative Paraphrasing guided Prompt based Fine-tuning of Language Models) for data augmentation using GPT-3 and OPT-175B.

Through summarizing existing methods, we found that most methods strive to make the model output as close to natural language as possible, but the effect is not satisfactory.

3 Our approach

As mentioned before, the basic idea of the strategies in our study is to label data automatically. Based on the DA techniques described in related work, we believe that the real review can provide more information than the artificially augmented data (e.g. Synonym replacement, Back-translation, etc.). In this section, we first introduce the problem formulation and then describe our augmentation method in detail.

3.1 Problem formulation

Given an initial dataset D, the review text X can be considered as \(X=\left[ x_1, x_2, \ldots , x_n\right]\) and a label sequence \(L=\left[ l_1, l_2, \ldots , l_n\right]\) where \(l_i\) are the aspects. What we need to do is design an algorithm to identify \(l_i\) and give the label (Pos, Neu, Neg) to the X. On this basis, a new dataset with a reasonable proportion is composed of real data and prompt templates filled by pre-trained models.

The first objective of our augmentation task is to assign corresponding aspects and sentiment labels (Pos, Neg) to unlabeled data so that they can be used as training data for the ABSA model.

The second objective of our augmentation task is to use neutral data from Semeval datasets (labeled data) as input, the prompt templates will be filled with the pre-trained model. The filled prompt templates constitute neutral data for the enhanced dataset (label: NEU). As we all know, there are many hard prompt template forms, which ensure the diversity of data. Finally, we get the new dataset which contains the labeled real review (label: Pos, Neg) and filled templates (label: Neu).

3.2 The detail of our model

Fig. 4
figure 4

Framework of our augmentation method

The framework of our method is shown in Fig. 4. The general procedure for Review2Extreme and Review2Neutral are given in Algorithm 1 and Algorithm 2. The corresponding sub-model framework is shown in the Figs. 5 and  6.

Fig. 5
figure 5

Automated data labeling with Review2Extreme

Fig. 6
figure 6

Automated data labeling with Review2Neural

We design two strategies for the problem: Review2Extreme and Review2Neural. They have different responsibilities. The input of Review2Extreme is unlabeled data from YELP, AMAZON, and the preset positive Prompt template. The output is the aspects and corresponding labels (Pos, Neu, Neg). The input of Review2Neural is the Semeval data (the widely-used labeled datasets in ABSA) and normal prompt templates. The output is the templates filled by per-trained models. As shown in Fig. 5. We transformed the task into a probe of unlabeled data by using the prompt template. For example, there is an unlabeled comment: “The food in this restaurant is very bad” This comment is obviously contrary to our preset positive template: “The food in this restaurant made me feel happy”. Finally, we successfully got the data label.

We can decompose the task into the following steps: aspect extraction, automated labeling, and distribution of data in proportion to obtain an enhanced dataset with a balanced distribution. In this paper, we keep the data of corresponding domains in YELP and Amazon, there are a total of five levels from one to five stars in the dataset. We have deleted some data and only selected the reviews x with one and five stars (\(1 \bigstar ,5 \bigstar\) in pseudocode) among them, this step is to eliminate the noise of chapter-level labeling. This is a common practice in previous work. Unlike the enhancement strategy for sentence-level sentiment analysis, the DA method for ABSA should ensure that no substitution, splitting, or replacement of aspect words occurs during the process. If the aspects are replaced indiscriminately, the presented noise will interfere locally with the aspect words and affect the sentiment classification effect of the aspect words in the model [54]. So we record the aspects that appear in the public ABSA dataset and select the comments containing the same aspect from the corresponding dataset (YELP, AMAZON). Finally, we obtained the raw data and aspects.

But there is a problem we must solve. For instance, after the first step, we get two sentences including "look": ’It has a bad look’ ’It is so awful, I look for its spare parts but unsuccessfully’. In the first one, "look" is a noun, it means appearance. But in the second one, "look" is a verb that does not have the corresponding sentiment. We must annotate the lexical category to eliminate the noise. This is a sequence labeling problem. As mentioned before, the review text X can be considered as a set of n sentences \(X=\left[ x_1, x_2, \ldots , x_n\right]\), we define the lexical as \(Y=\left[ y_1, y_2, \ldots , y_n\right]\). The X and Y can be considered as a Hidden Markov Model (HMM) chain. The HMM model describes the process by which a sentence is produced. X are explicit and Y are implicit.

Algorithm 1
figure a

Algorithm 1: Review2Extreme (5⋆, 1⋆ are rating level in Document-Level dataset D)

Algorithm 2
figure b

Algorithm 2: Review2Neutral (sentenceneutral is the Neu data in aspect-level dataset)

The implicit states can be transformed following the transition probability. Between X and Y exists the emission probability. For instance, in the sentence "John saw the saw." Odds of producing the sentence "John saw the saw" from the lexical sequence corresponding to the emission probability. We get the equation as follows:

$$\begin{aligned} \textrm{P}(\textrm{x}, \textrm{y})=\textrm{P}(\textrm{y}) \textrm{P}(\textrm{x} \mid \textrm{y}) \end{aligned}$$
(3)

where

$$\begin{aligned} \begin{aligned}&\begin{aligned} \textrm{P}(\textrm{y})&=\textrm{P}(\textrm{P} \textrm{N} \mid \text{ start } ) \\&\quad \times \textrm{P}(\textrm{V} \mid \textrm{P} \textrm{N}) \\&\quad \times \textrm{P}(\textrm{D} \mid \textrm{V}) \\&\quad \times \textrm{P}(\textrm{N} \mid \textrm{D}) \\&\quad \times \textrm{P}( \text{ end } \mid \textrm{N}) \end{aligned}\\&\begin{aligned} \textrm{P}(\textrm{x} \mid \textrm{y})&=\textrm{P}( \text{ John } \mid \textrm{PN}) \\&\quad \times \textrm{P}( \text{ saw } \mid \textrm{V}) \\&\quad \times \textrm{P}( \text{ the } \mid \textrm{D}) \\&\quad \times \textrm{P}( \text{ saw } \mid \textrm{N}) \end{aligned} \end{aligned} \end{aligned}$$
(4)

PN means a person’s name. V means a verb. D means an adverb. N means a noun. Then we get the equations as follows:

$$\begin{aligned} \begin{aligned}&\textrm{P}(\textrm{y})=\textrm{P}\left( \textrm{y}_1 \mid \text{ start } \right) \times \prod _{\textrm{l}=1}^{\textrm{L}-1} \textrm{P}\left( \textrm{y}_{\textrm{l}+1} \mid \textrm{y}_{\textrm{l}}\right) \times \textrm{P}\left( \text{ end } \mid \textrm{y}_{\textrm{L}}\right) \\&\textrm{P}(\textrm{x} \mid \textrm{y})=\prod _{\textrm{l}=1}^{\textrm{L}} \textrm{P}\left( \textrm{x}_{\textrm{l}} \mid \textrm{y}_{\textrm{l}}\right) \end{aligned} \end{aligned}$$
(5)

In Eq. 5, \(\textrm{P}(\textrm{y})\) is defined as transition probability. And \(\textrm{P}(\textrm{x})\) is defined as emission probability.

\(\textrm{P}(y_1\mid \text {start})\) indicates the odds of the first lexeme chosen at the beginning. \(\prod _{\textrm{l}=1}^{\textrm{L}-1} \textrm{P}\left( \textrm{y}_{\textrm{l}+1} \mid \textrm{y}_{\textrm{l}}\right)\) denotes the probability of a transition probability in the X; \(\textrm{P}\left( \right.\)end\(\left. \mid \textrm{y}_{\textrm{L}}\right)\) indicates the odds of the last lexeme being at the end. We substitute Eq. 5 into Eq. 3, we get the equations as Eq. 6:

$$\begin{aligned} \textrm{P}(\textrm{x}, \textrm{y})=\textrm{P}\left( \textrm{y}_1 \mid \text{ start } \right) \prod _{\textrm{l}=1}^{\textrm{L}-1} \textrm{P}\left( \textrm{y}_{\textrm{l}+1} \mid \textrm{y}_{\textrm{l}}\right) \textrm{P}\left( \text{ end } \mid \textrm{y}_{\textrm{L}}\right) \prod _{\textrm{l}=1}^{\textrm{L}} \textrm{P}\left( \textrm{x}_1 \mid \textrm{y}_{\textrm{l}}\right) \end{aligned}$$
(6)

The emission probability and transition probability are from training data. Substitute it into Eq. 6. We get Eq. 7.

$$\begin{aligned} \textrm{P}\left( \textrm{y}_{l+1}=\textrm{s}^{\prime } \mid \textrm{y}_1=\textrm{s}\right) =\frac{{\text {count}}\left( \textrm{s} \rightarrow \textrm{s}^{\prime }\right) }{{\text {count}}(\textrm{s})} \end{aligned}$$
(7)

where the value of \(\textrm{P}(\textrm{y}_{l+1})\) is the times that \(\textrm{y}_{l}\) appears before \(\textrm{y}_{l+1}\). The same reasoning can be used to prove that \(\textrm{P}\left( \textrm{x}_1=\textrm{t} \mid \textrm{y}_1=\textrm{s}\right) =\frac{{\text {count}}(\textrm{s} \rightarrow \textrm{t})}{{\text {count}}(\textrm{s})}\). The sequence labeling problem can be summarized as Eq. 8.

$$\begin{aligned} \begin{aligned} \textrm{y}&=\arg \max _{\textrm{y} \in \textrm{Y}} \textrm{P}(\textrm{y} \mid \textrm{x}) \\&=\arg \max _{\textrm{y} \in \textrm{Y}} \frac{\textrm{P}(\textrm{x}, \textrm{y})}{\textrm{P}(\textrm{x})} \\&=\arg \max _{\textrm{y} \in \mathbb {Y}} \textrm{P}(\textrm{x}, \textrm{y}) \end{aligned} \end{aligned}$$
(8)

Predicted values are \(\tilde{\textrm{y}}=\arg \max _{\textrm{y} \in \mathbb {Y}} \textrm{P}(\textrm{x}, \textrm{y})\). We can solve Eq. 8 by the Viterbi algorithm. In the Viterbi algorithm, we introduced two variables \(\delta\) and \(\psi\), and the maximum probability value for all individual paths \(\left( I_1, I_2, \ldots , I_T\right)\) in state i at moment t is defined as Eq. 9.

$$\begin{aligned} \begin{aligned} \delta _{t+1}(i)=\max _{1 \le j \le N}\left[ \delta _t(j) a_{j i}\right] b_i\left( o_{t+1}\right) , \\ i=1,2, \ldots , N ; t=1,2, \ldots , T \end{aligned} \end{aligned}$$
(9)

We define the \(i-1\) node of the path \(\left( I_1, I_2, \ldots , I_T\right)\) with the highest probability among all individual paths with state i at moment t as Eq. 10.

$$\begin{aligned} \psi _t(i)=\arg \max _{1 \le j \le N}\left[ \delta _{t-1}(j) a_{j i}\right] , i=1,2, \ldots , N ; t=1,2, \ldots , T \end{aligned}$$
(10)

The algorithm initialization can be shown as Eq. 11:

$$\begin{aligned} \begin{aligned}&\delta _1(i)=\pi _i b_i\left( o_1\right) , \quad i=1,2, \mathrm {~L}, N \\&\psi _1(i)=0,\quad i=1,2, \mathrm {~L}, N \end{aligned} \end{aligned}$$
(11)

As shown in Eq. 12, for \(t=2, \ldots , T\), we have the equation:

$$\begin{aligned} \begin{aligned} \delta _t(i)&=\max _{1 \le j<N}\left[ \delta _{t-1}(j) a_{j i}\right] b_i\left( o_t\right) , i=1,2, \mathrm {~L}, N \\ \psi (i)&=\arg \max _{1 \le j<N}\left[ \delta _{t-1}(j) a_{j t}\right] , \quad i=1,2, \mathrm {~L}, N \end{aligned} \end{aligned}$$
(12)

The algorithm will be ended when \(\begin{aligned} P^*&=\max _{1 \le i \le N} \delta _T(i)\end{aligned}\) and \(i_T^*=\arg \max _{1 \le i \le N}\left[ \delta _T(i)\right]\). We can find the optimal path by performing optimal path backtracking, for \(t=T-1,T-2,\ldots ,1\), it is shown as Eq. 13:

$$\begin{aligned} \begin{aligned}&i_t^*=\psi _{t+1}\left( i_{t+1}^*\right) \\&I^*=\left( i_1^*, i_2^*, \mathrm {~L}, i_T^*\right) \end{aligned} \end{aligned}$$
(13)

We further leverage the self-consistency mechanism [2, 55] to consolidate the labeling correctness. Specifically, for each of the three prompt templates, we set the PLM decoder to generate answers independently, and we select the one with high voting consistency.

4 Experiments

4.1 Datasets

SemEval datasets [56, 57] have been widely used in ABSA for many years. It can be seen from the dataset that the amount of data has been difficult to fit the requirements of large models. The details are shown in Table 1.

Table 1 Statistics of the datasets

The Amazon review dataset records user reviews of products on Amazon.com and is a classic dataset for recommendation systems, and Amazon is always updating this dataset.

The YELP dataset includes 4.7 million user reviews and 12 metropolitan areas. It also covers 1 million tips from 1.1 million users, and over 1.2 million merchant attributes (such as hours of operation, availability of parking, reservation availability, and environment information). The data in YELP and Amazon is chapter-level data, they are divided into five sentiment tendencies from one to five stars.

4.2 Experimental models

ASGCN [58]: ASGCN is based on GCN and contextual information about the word order is captured starting from the LSTM layer. A multi-layer graph convolution structure is then implemented on top of the LSTM output to obtain aspect features.

CABASC [59]: CABASC is based on a sentence-based content attention mechanism that embeds sentences and aspects separately. This allows extracting the important information of specific aspect words in a sentence from a global perspective, taking into account the location and relevance of the information. A contextual attention mechanism is designed in CABASC to generate custom memory blocks of aspect words in each sentence considering the order and relevance between words and aspect words.

LSTM [60]: LSTM is a special type of recurrent neural network (RNN) for modeling time-series data. Two fully concatenated layers and a softmax layer form the main network structure. The ASGCN, CABASC, and LSTM are also used in Li et al. (DRAWS and PWSS) [54].

R-GAT+BERT [61]: R-GAT solves this problem by effectively encoding grammatical information. Firstly, by refactoring and pruning the ordinary dependency parsing tree, a unified aspect-oriented approach was defined based on the aspect The dependency tree structure of. Based on the graph attention mechanism, encode a new tree structure for data label prediction.

4.3 Baseline models

DRAWS and PWSS [54]: Pos-Wise Synonym Substitution (PWSS) and Dependency Relation-based Word Swap (DRAWS) are proposed by Guangmin Li et al. PWSS selects synonyms from general dictionaries for substitution. PWSS enables a reasonable increase in the capacity of the training set. DRAWS have a better ability of sentence semantics on polarity orientation (positive, negative, and neutral).

4.4 Setting of experiment

To make maximum use of the knowledge in the pre-trained model, we selected three templates for the templates by adding them to the text. The number of templates can be changed arbitrarily.

  • I felt the [aspect] was [MASK].

  • The [aspect] made me feel [MASK].

  • The [aspect] is [MASK].

where [aspect] is the placeholder for the aspect term, and [MASK] represents the masked word for BERT which is pre-trained on the MNLI dataset.

We use the weights released by Morris et al. [62], which were trained on the MNLI dataset. The models are trained for 20 epochs. According to [63], we finetune the Prompt-based DA method until the training losses are around 1e-07 to get a stable model.

As Li et al. [54] described, according to the original proportions of polarity in the training set, 50, 500, and all instances are selected in turn while the data capacity of the test set and validation set remains unchanged.

4.5 Results and discussion

The additional experimental results based are shown in Table 2. The performance for each model is shown in Table 3 and illustrated in Figure 7. Our experimental setup was consistent with the baseline models. Specifically, the experimental models ASGCN, CABASC, and LSTM were trained for different training sizes in the baseline model [54].

Our method achieves better results than the baseline model on all experimental models. The following observations are made. In terms of the training set size, the classification performance of the three models shows an upward trend with the increase in data capacity. The experimental results confirm that the augmentation strategies are effective. Especially in the training results of the training size from 50 to 500, the Macro-F1 is increased by a large margin. It is also seen that the classification results are slowly improved from training size 500 to full.

Table 2 R-GAT performance comparison among none and ours
Table 3 Comparison of the performance Macro-F1 (%) among None, PWSS [54], DRAWS [54] and Ours for different training sizes
Fig. 7
figure 7

Comparison of macro-F1(%) on different training sizes among none, PWSS [54], DRAWS [54] and ours

This result ties well with the previous study [64, 65]. This is attributed to the injection of noise information with the increase of the data scale. The results of this experiment show that the model gain is not simply linearly related to the amount of data. At the same time, it also illustrates the necessity of data correction and textual data noise removal for existing datasets. This needs to be verified by further experiments. Unlike existing methods, our approach provides real data. Although the Semeval dataset is also composed of real data, our data has less noise compared to manual annotation, resulting in better performance than the original data. Furthermore, it can be seen from Table 1 that the class distributions are dramatically imbalanced on datasets 15_rest and 16_rest. The data we annotate can be added to relevant categories to balance the distribution of the data. Further improved the performance of the ABSA models. The baseline DRAWS, base1 and PWSS [54] did not verify their methods on pre-trained models. In order to better demonstrate our method, we supplemented the relevant experiments of the pre-rained models. We conducted experiments based on the code published by the authors. The results are shown in Table 2.

In Table 4, we present the case study to show the model effectively. R1 is difficult for our models. As we can see, there are confusing words in R1 (food was exceeding expectations), but there is a prerequisite (when they actually gave people the meals they ordered), so honestly speaking, it is a negative review. This case demonstrates that our model has some shortcomings. In a few situations, the model’s overall grasp of the information in the corpus is inadequate. Hand-constructed templates do not efficiently apply the knowledge present in pre-trained models. R2 indicates that our method can effectively label data in most cases. R3 and R4 are implicit sentiment reviews. These cases demonstrate that our model has good recognition of implicitly expressed sentiment. R5 and R6 indicate that different prepositions can affect the performance of the model. Prepositions that represent semantic progression (e.g. ’also’) may mislead the model into ignoring subsequent information, thereby affecting the performance of the model. R7 is a template that is filled by per-trained models. We select many templates for NEU data, the aspect is the same as Semeval datasets. Our method can perform data completion on datasets with imbalanced label distribution, as well as bias correction on existing datasets. As a low computational resource method, it has great application prospects. It is undeniable that there is still a lot of room for improvement in our method, such as the complexity of human language (such as prepositions with diverse expressions) and the erroneous judgments caused by multiple aspects coexisting in a sentence, which urgently need to be addressed.

Table 4 Case study

Based on the experimental results, we have listed the advantages and disadvantages of our method and other classic data augmentation techniques(including baseline) for a more detailed comparison:

The overview of Wei et al. [8]: Four data augmentation techniques have been proposed, including synonym replacement, random insertion, random exchange, and random deletion. A comparative study was conducted on text classification experiments using five datasets on deep learning models RNN and CNN.

The advantages of Wei et al. [8]: For the first time, random inserts, swaps, and deletions were used for data augmentation, and their methods were validated on convolutional neural networks and recurrent neural networks; The method is simple, intuitive and easy to understand; Under appropriate parameters, each raw data can generate nine enhanced sentences, and in most cases, the original data labels are retained.

The disadvantages of Wei et al. [8]: Firstly, the output of the EDA model has the possibility of changing semantics, which may conflict with the original labels and provide incorrect learning samples for the ABSA model. Secondly, pre-trained models have become the mainstream of current NLP research, and simple DA is no longer able to substantially improve model performance [36]. Finally, according to the research of Ebrahimi et al. [15]., small changes in training data have a significant impact on the performance of the ABSA model. The uncontrollability of EDA (such as no improvement in temporary analysis when replacing heads with synonyms [66]) may bring unknown data noise.

The overview of Li et al. [54]: It is a method based on EDA that utilizes part of speech, external domain knowledge, and syntactic dependencies to achieve DA through synonym replacement and dependency-based word exchange. These strategies were evaluated through extensive experiments using three representative deep learning models—ASGCN, CABASC, and LSTM—on four common datasets (Semeval dataset).

The advantages of Li et al. [54]: Unlike previous sentiment-based analysis methods, this method combines DA with dependency parsing trees, incorporating external knowledge to improve data quality. An analysis was conducted on common deep-learning network architectures. The method proposed by the author enhances the generalization ability of the model.

The disadvantages of Li et al. [54]: Adding noise information that interferes with real labels during word replacement cannot balance semantic integrity and syntactic correctness. The data constructed by the method is still an extension based on old data, which can only bring limited new information to the model.

The overview of ours: Automated annotation of real unlabeled data based on prompt learning. By constructing a rich hard prompt template, based on the labeled data of NEU, the filled template is introduced back into the model. Due to the fact that the existing NEU data is manually annotated, ablation experiments have shown that it contains a significant amount of data noise. This operation further corrects the label bias of the existing NEU data. This method can also quantitatively introduce data into imbalanced datasets, improving the effectiveness of the ABSA model.

The advantages of ours: The method occupies less computational resources, is intuitive and easy to understand, and is simple to use; Provides authentic data while balancing semantic integrity and syntactic correctness; Provides controllable data augmentation methods.

The disadvantages of ours: The discriminative ability for complex statements needs to be strengthened; Compared to the soft prompt template, the hard prompt template cannot fully utilize the knowledge in the pre-trained model; We should enhance research on semantic and grammatical structural information to further improve model performance. Eliminate data noise.

4.6 Ablation study

4.6.1 Review2Extreme and Review2Neutral’s respective influence

As mentioned before, our approach consists of two parts. We use the following symbols to represent the corresponding data:

\(Sem\_Neu\): Raw neutral data in the Semeval dataset

\(Sem\_Ext\): Raw positive and negative data in the Semeval dataset

\(New\_Neu\): The new neutral data

\(New\_Ext\): The new positive and negative data we labeled

In this subsection, we attempt to block Review2Extreme and Review2Neutral separately to observe their influence on the performance of different ABSA models. The experimental results are shown in Tables 5, 6, 7, 8. Through the analysis of the experimental results, we believe that there are some mistakes in Neu data which is from the Semeval dataset, it has a certain negative influence on the performance of the ABSA model.

Table 5 Comparison of the macro-F1(%) (no Review2Extreme : \(Sem\_Ext\) + \(New\_Neu\))
Table 6 Comparison of the Macro-F1(%) (No Review2Neutral : \(New\_Ext\) + \(Sem\_Neu\))
Table 7 Comparison of the performance(%) (no Review2Extreme : \(Sem\_Ext\) + \(New\_Neu\))
Table 8 Comparison of the performance(%) (No Review2Neutral : \(New\_Ext\) + \(Sem\_Neu\))

4.6.2 Aspect independent prompts

As mentioned before, our approach computes the relationship between the prompt templates and unlabeled data and then assigns labels to expand the training data. In this subsection, we will replace different aspects with a unified aspect. (e.g. The [aspect] made me feel good. transforms to This made me feel good. The performance is shown in Tables 9 and 10, we found that this operation will cause a significant decrease in the quality of newly labeled data. Indicating that the determined aspects have a crucial impact on whether the model can correctly label data.)

Table 9 Comparison of the Macro-F1(%) (determined aspects/unified aspect)
Table 10 Comparison of the R-GAT (%) (determined aspects/unified aspect)

4.6.3 Single prompt template

As mentioned before, the labels (positive and negative) are voted upon by multiple templates. In this subsection, we attempt to assign the corresponding labels by one template. The performance is shown in Tables 11 and 12, the experimental results indicate that multiple templates can effectively correct errors in the model and improve the quality of the dataset.

Table 11 Comparison of the macro-F1(%) (multiple templates/random single prompt)
Table 12 Comparison of the R-GAT (%) (multiple templates/random single prompt)

5 Conclusions and future work

In this paper, we systematically explore the effects of data augmentation for ABSA and propose a novel prompt learning-based data augmentation method, which allows us to compute the relationship between prompt templates and unlabeled data, leverage them to enrich data resources and improve generalization capability via prompt learning. The experimental results on four well-studied datasets demonstrate that our model not only achieves results on par with existing state-of-the-art data augmentation methods on a few occasions but also significantly outperforms existing ABSA base models and data augmentation methods on most occasions, indicating its strong robustness in various base ABSA models, including ASGCN [58], CABASC [59] and etc. Further explorations on incorporating more types of soft template (soft prompt is optimized in vector space) and data augmentation methods for enhancing the performance of ABSA will be addressed in our future work.