1 Introduction

Deep learning methods usually need large-scaled labeled data to train neural networks. Considering that collecting and annotating data induce high time and labor costs, recent studies have paid attention to data augmentation techniques for automatically generating synthetic instances and increasing the data diversity [38]. Data augmentation is firstly used in the field of computer vision (CV), where photos are augmented by rotation, cropping, masking, color jittering, gray scaling, etc [15, 41]. Then data augmentation is quickly extended to the field of natural language processing (NLP). Different from photos in CV, languages in NLP are more sophisticated since even a slight modification may change the original semantic meanings. To avoid changing the labels after perturbing languages, existing data augmentation studies in NLP mainly concentrate on coarse-grained sentence-level tasks such as machine translation [8, 37], text classification [14, 44], and question answering [1, 21]. The frequently used methods for perturbing languages include back translation, random insertion, random swap, random deletion, etc.

Data augmentation research on fine-grained sequence tagging tasks like name entity recognition (NER) and aspect-based sentiment analysis (ABSA) is still limited. The main reason is that sequence tagging tasks are defined at the token level and neural models are trained to capture the one-to-one correspondence between tokens and their labels, and perturbing tokens sequences may produce the wrong label sequences.

For example, if “Disney” is deleted from the text segment “love Disney Land” tagged as “O B-facility I-facility”, the perturbed label sequence will be “O I-facility” which is not allowed by the tagging rule in NER (no beginning of an entity).

Among the few data augmentation methods for sequence tagging tasks, most of them modify sentence-level perturbation methods by adding additional constraints to keep the correspondence of labels [4]. Related techniques include label-wise token replacement, shuffle within segments, mention replacement, etc. The main drawback of this type of methods is that they are not stable enough to obtain high-quality synthetic data due to the randomness in perturbation. Another type of methods generates synthetic instances by pre-training a customized language model and then sampling from it [7]. However, the language model needs to be trained on enough labeled data, which is not suitable for low-resource scenarios. Moreover, when sampling synthetic instances from the language model, several manual rules must be defined in advance and then be used to filter out low-quality outputs. Besides the above specific drawbacks, these two types of methods are both limited to the instance-level augmentation. In other words, they only focus on how to generate more synthetic instances, but neglect to promote sequence taggersFootnote 1 under the limited training data.

In this paper, we propose a description and demonstration guided data augmentation (D3A) method for sequence tagging, which not only enhances the quality of the produced synthetic instances, but also generalizes the learning capability of the neural models. In particular, the description is a collection of dependency paths that act as the reference for producing new data for instance-level augmentation, and the demonstration is a set of syntactic or semantic related tokens that serve as the evidence for feature-level augmentation.

At the instance level, our goal is to generate more reliable synthetic instances with the help of descriptions. In the sequence tagging task, we can divide the token sequence into two types of tokens, namely mentions (e.g., named entities in NER and aspect terms in ABSA) and contexts (non-mention tokens). To increase the data diversity, mention replacement is a feasible way that keeps a mention’s contexts unchanged and replaces itself with another mention to create a synthetic instance. However, existing methods [4] often select mentions for replacement at random and thus cannot ensure the high quality of synthetic instances. To solve this problem, we propose to construct descriptions for mentions to refine the replacement procedure. Specifically, we first summarize each mention with a description set that includes the involved dependency paths (e.g., MENTION\(\overset {nsubj}{\longrightarrow }\)JJ). Then, in order to find the compatible mentions that match the original contexts, we calculate the correlation between the original mention and each candidate substitutive mention. Lastly, we rank the candidate mentions w.r.t their correlations and randomly choose several mentions that are qualified for synthesizing new instances.

At the feature level, we turn to make better use of the real but limited training sample. For this purpose, we introduce demonstrations for the token sequence to enhance the learning capability of sequence taggers. Specifically, for a token in the sequence, its demonstrations consist of a set of tokens that have appeared in training instances. These demonstration tokens play similar syntactic and semantic roles as the original token and could provide extra evidence for tagging. After augmenting token features with demonstrations in the learning procedure, a sequence tagger can model an instance not only under the guidance of its own label, but also under the guidance of other training instances. Consequently, the coupling relationship among instances can help the sequence tagger converge to a better state than before.

We conduct extensive experiments on two sequence tagging tasks including NER and ABSA, with two datasets per task. The experimental results demonstrate that after introducing descriptions at the instance level and demonstrations at the feature level, our proposed method can significantly improve the performance of sequence taggers, especially in low-resource scenarios.

2 Related work

In this section, we first introduce two related sequence tagging tasks: NER and ABSA. We then present a brief summarization of existing data augmentation methods towards sequence tagging tasks.

2.1 Named entity recognition (NER)

NER is a task for identifying and categorizing target keyphrases (entities) such as person, organization, time, and location. NER usually acts as the first step in the NLP processing pipeline, and serves as the information source for downstream tasks like question answering, information retrieval, and relation extraction. Early studies mainly use handcrafted features, manual rules, and linguistic lexicons [22, 30, 35]. Recent studies focus on designing neural models which require little feature engineering and expert knowledge [16, 19, 46, 48, 53]. Under the deep learning framework, NER is typically defined as a sequence tagging task. For example, in the text sentence “Made it back home to GA. Time to start planning the next Disney World trip.”, “GA, Disney World” are tagged as “B-Location, B-facility I-facility”, respectively, while other tokens are tagged as “O”.Footnote 2

2.2 Aspect-based sentiment analysis (ABSA)

ABSA is a fine-grained task that aims to summarize the opinions of users towards specific aspects in reviews. With the rapid growth of the world wide web and social media, ABSA has been widely applied to various fields to analyze texts like product reviews, forum discussions, and blog posts. For example, given a sentence “The pizza here is also absolutely delicious.”, ABSA needs to extract the aspect term and classify the corresponding sentiment polarity, then tag “pizza” with “B-positive” and other tokens with “O”. Most existing studies treat ABSA as a two-step task and develop separate methods for aspect term extraction [23, 24, 26, 34, 42, 47, 51] and aspect-level sentiment classification [2, 11, 12, 18, 20, 25, 54]. To obtain the complete ABSA predictions, results from two steps must be merged together in a pipeline manner, which however may lead to error propagation. To address this problem, recent studies in ABSA design end-to-end sequence taggers that directly map tokens to their collapsed labels [3, 17, 28, 50].

2.3 Data augmentation for sequence tagging

Data augmentation are originated from computer vision [38] and then quickly applied to natural language processing [9]. Most existing data augmentation methods are designed for document-level and sentence-level tasks such as machine translation [8, 37], text classification [14, 44], and question answering [1, 21]. Frequently used methods for perturbing languages include back propagation, random insertion, random swap, random deletion, etc. In this scenario, synthetic data is relatively easy to obtain since the labels usually remain unchanged after language perturbation.

For fine-grained sequence tagging tasks, where tokens and labels have a fragile one-to-one correspondence, there are only a few studies available. Sahin and Steedman [36] use dependency tree morphing (sentence cropping and sentence rotating) to generate synthetic data for POS tagging, but the produced synthetic data is not semantically smooth and hard to interpret. Ding et al. [7] first pre-train a customized language model by concatenating tokens with their labels, then sample outputs from this language model and transform them to the synthetic data. Their language model is trained on at least 1k annotated sentences, which is not suitable for low-resource scenarios. Moreover, several manual rules need to be pre-defined and then used to filter out low-quality outputs. Dai et al. [4] modify sentence-level augmentation methods by adding additional constraints, and propose several methods for NER including the label-wise token replacement, synonym replacement, and mention replacement. These methods show improved performance in both recurrent and transformer models, but the synthetic data is not stable enough and contains harmful noises. Zhang et al. [52] adapt the MIXUP [49] technique for active sequence labeling by augmenting queried samples which also requires manual labor. Guo et al. [10] also refer to MIXUP and create new synthetic instances by softly combining token/label sequences following the Beta distribution, which blends mentions with non-mentions and does not perform well for sequence tagging. Apart from the specific drawbacks, existing methods for data augmentation are all limited to instance-level augmentation. They only pay attention to synthesize instances, but neglect to promote taggers to make better use of the limited training data.

Different from prior methods, our proposed approach generates synthetic samples with the guidance of descriptions, and hence it can produce stable results. In addition, our approach performs feature-level augmentation for training samples, which can further improve the learning capability of neural networks with limited training data.

3 Methodology

In this section, we first introduce the definition of sequence tagging and the overview of the proposed method. We then present two backbone sequence taggers as the carrier and tester of data augmentation methods. Lastly, we illustrate our method in detail.

3.1 Problem definition

In sequence tagging tasks, a sequence tagger is trained to learn a mapping function f between a token sequence x = {x1, ... , xn} and a label sequence y = {y1, ... , yn}, i.e., \(f:x\rightarrow y\). Each xi is a token in natural language and each yi belongs to the label set {B-X, I-X, O}. Specifically, the first token of a mention with the type X is tagged as B-X, and the tokens inside that mention are tagged as I-X and the contexts (non-mention tokens) are tagged as O.

In this work, we focus on the data augmentation issue for sequence tagging. Given a training set containing (usually limited) gold-labeled instances \(\mathcal {D}^{train}\), our goal is to generate some synthetic labeled instances \(\mathcal {D}^{syn}\) to improve the diversity of the training set, and promote the sequence tagger’s performance in tagging the unseen test set \(\mathcal {D}^{test}\).

3.2 Backbone sequence tagger

To examine the effectiveness of different data augmentation methods, we consider two types of backbone sequence taggers: one is based on static GloVe embeddings, and the other is based on contextual BERT representations.

In the GloVe backbone, we first map each token in xi with the static GloVe embeddings and obtain its vector ei, then use an additional encoder containing several convolutional layers without pooling operation to extract the hidden state hi:

$$ \begin{array}{ll} \{{e}_{1}, ..., {e}_{n}\}&=\textit{GloVe-Lookup}~(\{x_{1}, ..., x_{n}\}),\\ \{{h}_{1}, ..., {h}_{n}\}&=\textit{CNN-Encoder}~(\{{e}_{1}, ..., {e}_{n}\}), \end{array} $$
(1)

where the parameters of the CNN encoder are learned from scatch. On the contrary, in the BERT backbone, we directly use the pre-trained BERT encoders to obtain the hidden state hi:

$$ \begin{array}{ll} \{{h}_{1}, ..., {h}_{n}\}=\textit{BERT-Encoder}~(\{x_{1}, ..., x_{n}\}) \end{array} $$
(2)

In both backbones, after extracting the hidden state of each token, a classifier which consists of a linear transformation layer and a softmax function is used to predict the tags of tokens:

$$ \{\hat{y}_{1}, ..., \hat{y}_{n}\}=\textit{Classifier}~(\{{h}_{1}, ..., {h}_{n}\}) $$
(3)

Lastly, we compute the cross-entropy loss and train learnable parameters with back propagation:

$$ \mathcal{L}= - \sum\nolimits_{i=1}^{n} \sum\nolimits_{j=1}^{J} \hat{y_{ij}} \cdot log(y_{ij}), $$
(4)

where n is the length of x, J is the number of label categories, \(\hat {y_{i}}\) and yi are the predictions and ground truth labels, respectively. Given the gold instances from \(\mathcal {D}^{train}\) and the synthetic instances from \(\mathcal {D}^{syn}\), we can train the backbone sequence taggers accordingly and then make inference on \(\mathcal {D}^{test}\).

3.3 Description guided instance-level augmentation

At the instance level, D3A aims to synthesize reliable instances and then adds them into the training set for boosting the performance and generalization of sequence taggers. For example, for a token sequence “Can’t upload payload to my apache 2 server, pentesting exercise.” in the NER task, we can divide it into two types of tokens, namely the mention (i.e., “apache 2 server”) and the contexts (non-mention tokens). For improving the diversity of the training data, Dai et al. [4] propose to keep a mention’s contexts unchanged and replace the mention with another one to generate a synthetic instance. However, they select the substitutive mentions only based on the mention type (e.g., “PRODUCT” in this example). Consequently, an incompatible mention like “Touchscreen” may be selected to join the contexts and cause incongruity in semantics. Different from this naive mention replacement method, we propose to construct descriptions for mentions by resorting to the involved dependency paths and then search compatible substitutive mentions for replacement. In this way, our proposed D3A can ensure the quality of the synthetic data and can conduct effective instance-level data augmentation.

3.3.1 Construct descriptions via dependency paths

To capture the correlation among different mentions, we need to characterize mentions completely. Therefore, we propose to construct descriptions for mentions according to their involved dependency paths. As shown in Figure 1, for the NER instance with the mention “apache 2 server” and the mention type “PRODUCT”, we can construct the mention’s description set by collecting the involved dependency paths. According to the parsing results, the paths can be divided into two categories as follows.

  • Head-related paths. In a dependency parse tree, each token has exact one head token (a.k.a, governor). Inspired by a recent study for aspect-level sentiment classification [43], we first reshape the original parsing tree to a mention-oriented tree so as to keep the integrity of the target mention. Specifically, we treat the mention containing multiple tokens as a whole and only consider the paths outside the mention. For example, two paths inside the target mention, i.e., apache\(\overset {nummod}{\longrightarrow }\)2 and server\(\overset {compound}{\longrightarrow }\)apache, are discarded. After reshaping, we can easily collect two head-related paths: payload\(\overset {nmod}{\longrightarrow }\)M and NN\(\overset {nmod}{\longrightarrow }\)M that corresponds the head token and its part-of-speech (POS) tag, respectively. (Here we use M to denote an arbitrary mention for simplicity.)

  • Tail-related paths. Similar to the head-related paths, after reshaping the parsing tree, we can collect four tail-related paths: M\(\overset {case}{\longrightarrow }\)to, M\(\overset {case}{\longrightarrow }\)IN, M\(\overset {nmod:pass}{\longrightarrow }\)my, and M\(\overset {nmod:pass}{\longrightarrow }\)PRP$. One small difference here is that the number of tail tokens (a.k.a, dependent) is not limited to only one.

After traversing the entire training data, for each mention M, we can construct its description (the set S) that contains all involved dependency paths.

Fig. 1
figure 1

Illustration of the description guided instance-level data augmentation

3.3.2 Search compatible substitutive mentions

For the target mention M, we consider all other mentions that belong to the same type as its candidate substitutive mentions. Once the descriptions are constructed, we start to search compatible candidates to replace the target mention and generate synthetic instances. Given the target mention M and a random candidate mention \(\hat {\texttt {M}}\), we calculate their correlation via the Jaccard similarity of their corresponding description sets S and \(\hat {\texttt {S}}\):

$$ Correlation(\texttt{M}, \hat{\texttt{M}})= Jaccard(\texttt{S}, \hat{\texttt{S}})=\frac{\lvert \texttt{S}\cap\hat{\texttt{S}}\rvert}{\lvert\texttt{S}\cup\hat{\texttt{S}}\rvert}. $$
(5)

By ranking candidate mentions based on the Jaccard similarities, we can preserve the top τ% related candidates and filter out the others, where τ is a hyperparameter. Generally, τ is inversely proportional to the size of the training data \({\mathcal {D}^{train}}\) as we will show in the analysis section. After that, we randomly select a substitutive mention \(\hat {\texttt {M}}\) from the preserved candidates, then fix \(\hat {\texttt {M}}\) and the contexts of M to generate a synthetic instance. By this means, we refine the selection process in the naive mention replacement method and obtain more reliable synthetic instances \(\mathcal {D}^{syn}\). Finally, the instance-level data augmentation can be achieved by training backbone sequence taggers with the combination of \(\mathcal {D}^{train}\) and \(\mathcal {D}^{syn}\).

3.4 Demonstration guided feature-level augmentation

Existing data augmentation methods for sequence tagging tasks focus on synthesizing new instances when the labeled training set is small. However, we can also consider this issue from another point of view, i.e., how to make full use of limited labeled data. To this end, we propose to further conduct feature-level data augmentation by enhancing the training process with demonstrations. Specifically, when extracting features from the token sequence, the sequence tagger will receive a set of demonstration tokens that can provide extra information for tagging. As shown in Figure 2, for a rare token “apache” without enough sample exposure, the tagger can hardly predict its correct label “B-PRODUCT”. However, if we associate “apache” with some demonstrations like “proxy, git, anaconda, android”, the tagging of “apache” will become less challenging.

Fig. 2
figure 2

Illustration of the demonstration guided feature-level data augmentation

We now illustrate the feature-level data augmentation in two steps, i.e., retrieving demonstrations from the training data and augmenting token features with demonstrations. We take a training instance x = {x1, ... , xn} in \(\mathcal {D}^{train}\) as the example.

3.4.1 Retrieve demonstrations from training data

For a target token xix, we define its demonstrations as a set of tokens {d1,d2, ... , dK} where each dj has similar syntactic and semantic characteristics with xi. For convenience, we correspond both xi and dj to the word vocabulary \(\mathcal {V}\) and change the notations accordingly. Then the problem can be reformulated as follows: for a target token v, how to select another token \(\widetilde {v} \in \mathcal {V}^{train}\) (the vocabulary of \(\mathcal {D}^{train}\)) that is qualified to serve as v’s demonstration. To answer the problem, we resort to three different attributes: the semantic meaning, the part-of-speech tag, and the dependency relation.

  • Semantic meaning. We use the pre-trained GloVe embedding to obtain the vectors vsem and \(\widetilde { {v}}_{sem}\) for v and \(\widetilde {v}\), respectively. We then calculate the semantic similarity between v and \(\widetilde {v}\):

    $$ sem.sim(v, \widetilde{v}) = cosine({{v}_{sem}},{\widetilde{{v}}_{sem}}), $$
    (6)

    where cosine(⋅,⋅) is the cosine similarity.

  • Part-of-speech tag. In each sentence where v has appeared, we can use a one-hot vector vpos \(\in \mathcal {R}^{N_{pos}}\) to represent its POS tag, where Npos is the number of POS types. Notice that many tokens (e.g., “like”) are polysemous and can serve as different POS tags in different contexts. Therefore, we choose to summarize the global usages < vpos > of v by merging its POS vectors in all sentences:

    $$ <{v}_{pos}> = \{{v}_{pos,l=1} ~\lvert~ {v}_{pos,l=2}~\lvert ... \lvert~ {v}_{pos,l=\lvert \mathcal{D}^{train}\lvert} \} $$
    (7)

    where \(\lvert \) is the dimension-wise OR operation. Similarly, we can obtain \(<\widetilde { {v}}_{pos}>\) for \(\widetilde {v}\):

    $$ <\widetilde{{v}}_{pos}> = \{\widetilde{{v}}_{pos,l=1} ~\lvert~ \{\widetilde{{v}}_{pos,l=2}~\lvert ... \lvert~ \{\widetilde{{v}}_{pos,l=\lvert\mathcal{D}^{train}\lvert} \} $$
    (8)

    We then calculate the POS similarity between v and \(\widetilde {v}\) as follows:

    $$ pos.sim(v, \widetilde{v}) = cosine(<{v}_{pos}>,<\widetilde{{v}}_{pos}>). $$
    (9)
  • Dependency relation. As we illustrated in Section 3.3, dependency relations can be divided into head- and tail-related ones. In each sentence where v has appeared, we can use a one-hot vector vhead and a multi-hot vector vtail to represent the involved head and tail relation, where each vector \(\in \mathcal {R}^{N_{dep}}\) and Ndep is the number of relation types. Then we concatenate them to form the whole dependency vector vdep \(\in \mathcal {R}^{2\times N_{dep}}\). Following the steps in calculating the POS similarity, we can obtain the global usages < vdep > for v and \(<{\widetilde { {v}}_{dep}}>\) for \(\widetilde {v}\), then calculate the dependency similarity between v and \(\widetilde {v}\):

    $$ dep.sim(v, \widetilde{v}) = cosine(<{v}_{dep}>,<\widetilde{{v}}_{dep}>). $$
    (10)

After calculating three different types of attributes’ similarities, we can obtain the overall similarity score between v and \(\widetilde {v}\):

$$ attr.sim(v, \widetilde{v}) = sem.sim + pos.sim + dep.sim, $$
(11)

Consequently, we can obtain a attr.sim score matrix M\(^{train}\in \mathcal {R}^{\lvert V^{train}\rvert \times \lvert {V}^{train}\rvert }\). After ranking, for v, we select the top-K tokens as the demonstrations {d1,d2, ... , dK}, and record their attr.sim scores {a1,a2, ... , aK}.

We retrieve demonstrations for both training and test data, so they can benefit both the training and inference processes. During testing, we can calculate the attr.sim score matrix M\(^{test}\in \mathcal {R}^{\lvert V^{test}\rvert \times \lvert {V}^{train}\rvert }\) in a similar way. However, a problem here is that we cannot obtain < vpos > and < vdep > since the whole test data is unseen. Therefore, we use the local vpos and vdep of the current test sample to calculate the part-of-speech and dependency similarities. The retrieval process is a one-time job and often finishes in ten seconds.

3.4.2 Augment token features with demonstrations

After obtaining demonstrations, we can conduct the feature-level augmentation augment with these demonstrations. Generally, we follow a simple rule, i.e., injecting the demonstrations after the pre-trained module. Moreover, considering the difference in the amount of information carried by GloVe and BERT, we propose two augmentation methods accordingly.

In the GloVe backbone, our target is the vector ei of each token xi. Specifically, we first map xi’s demonstration tokens di,k to the vectors di,k, then aggregate them to a single vector \(\widetilde { {d}}_{i}\) according to the similarity scores ai,k:

$$ \begin{array}{ll} \{{d}_{i,1}, ... , {d}_{i,K}\} &= \textit{GloVe-Lookup}~(\{{d}_{i,1}, ... , {d}_{i,K}\}),\\ \widetilde{{d}}_{i} &= \sum\limits_{k=1}^{K} {d}_{i,k} \cdot {{a}}_{i,k}. \end{array} $$
(12)

We then calculate a dimension-wise gate gi to augment ei with \(\widetilde { {d}}_{i}\):

$$ \begin{array}{ll} {g}_{i}&= \sigma~(\textbf{W}_{1}({e_{i}} \oplus\widetilde{{d}}_{i})),\\ {r}_{i}&= {g}_{i} \odot({e_{i}}\oplus\widetilde{{d}}_{i}), \end{array} $$
(13)

where ei is the GloVe word vector, \(\widetilde {d}_{i}\) is the demonstration vector, W1 is a transformation matrix, σ is the Sigmoid function, ⊕ is concatenation, and ⊙ is element-wise multiplication. Lastly, we send the augmented vector ri instead of ei to the blank CNN encoder for extracting hidden states while other modules remain unchanged.

In the BERT backbone, we adopt a different strategy since the contextualized BERT representations are very informative. Therefore, we do not interfere the encoding process but inject demonstrations into the hidden states hi. Specifically, we first use the embedding layer inside BERT to transform di,k into \(\widetilde { {d}}_{i}\), then pass them to the BERT encoder and obtain the hidden states \(\widetilde { {h}}_{i}\). Afterwards, we calculate a single-value gate gi to combine hi and \(\widetilde { {h}}_{i}\):

$$ \begin{array}{ll} {g}_{i}&= \sigma~(\textbf{W}_{2}({h_{i}} \oplus\widetilde{{h}}_{i})),\\ {r}_{i}&= {g}_{i} \cdot{h_{i}} + (1-{g_{i}}) \cdot \widetilde{{h}}_{i} , \end{array} $$
(14)

where W2 is a transformation matrix, hi and \(\widetilde { {h}}_{i}\) are the hidden states of input tokens and demonstrations. Lastly, we send ri instead of hi to the token classifier and make predictions. After augmenting token features with demonstrations in the learning procedure, a sequence tagger can model an instance not only under the guidance of its own label, but also under the guidance of other training instances. Consequently, the coupling relationship among instances can help the sequence tagger converge to a better state given limited training data.

4 Experiment

In this section, we first present the experimental setup, then compare the proposed D3A method with the state-of-the-art data augmentation baselines.

4.1 Experimental setup

4.1.1 Datasets

We examine D3A on two sequence tagging tasks: named entity recognition (NER) and aspect-based sentiment analysis (ABSA), each containing two datasets. For NER, we use the WNUT16 [40] and WNUT17 [5] constructed from Twitter and adopt the original train/test/development splits for the experiment. For ABSA, we merge the restaurant datasets from the ABSA tasks in SemEval 2014 [33], 2015 [32], and 2016 [31], and the laptop dataset from SemEval 2014 Task 4 [33]. Since there are no official development data, we randomly sample 20% training instances from each dataset as the development set, and use the rest instances for training. The detailed statistics of datasets are presented in Table 1. We use four different sizes of datasets to examine the effectiveness of data augmentation methods in different scenarios. SMALL (S) contains 50 training instances, MEDIUM (M) contains 150 training instances, LARGE (L) contains 300 training instances, and FULL (F) uses the complete training set. Generally, S, M, and L settings can be considered as the low-resource scenarios.

Table 1 The statistics of datasets

4.1.2 Settings

We pre-process each dataset by lowercasing all words and use Stanford CoreNLP [27] for dependency parsing. There are Npos = 45 classes of POS tags and Ndep = 40 classes of dependency relations in four datasets.

In the GloVe-based backbone, we use the glove.840B.300d.txt vectors. The kernel size and the number of convolution layers in the CNN encoder are set to 3 and 4, respectively. Dropout [39] is applied to convolution layers’ outputs with the probability of 0.5. In the BERT-based backbone, we use the officially released bert-base-uncased pre-trained model [6]. We train the GloVe/BERT backbone for 100/15 epochs using Adam optimizer [13] with the learning rate 1e-4/3e-5 and batch size 8 in a 3090 GPU, respectively.

In the description guided instance-level augmentation, the threshold τ for preserving candidate mentions are tuned from 0.1 to 1.0, stepped by 0.1. In the demonstration guided feature-level augmentation, we set the number of demonstrations K = 10. If there are synthetic instances in the training data, feature-level augmentation will also be conducted on these instances.

4.1.3 Evaluation protocol

We report F1-scores for both NER and ABSA tasks in different scenarios, and also present the mean improvement δ for clear comparison. To compute F1 scores, the prediction would be considered correct if it exactly matches the mention span and mention type. We run the experiments five times with random initialization and report the averaged results. The checkpoint achieving the maximum F1-score on the development set is used for evaluation on the test set.

4.2 Compared methods

Details of compared methods are listed below. For all data augmentation methods, the ratio of gold data to synthetic data is 1:3, which means the augmented training set is four times larger than before.

  • NoAug : No augmentation. It only uses the gold training data.

  • DUP : Duplication. A naive augmentation method that simply duplicates the gold data three times. This is an important baseline to observe the effectiveness of other data augmentation methods.

  • LwTR : Label-wise token representation [4]. For each token in the sequence, a binomial distribution is used to randomly decide whether it should be replaced. If yes, the token is replaced by a randomly selected token with the same label.

  • SR : Synonym replacement [4]. It is similar to LwTR, except that the token is replaced with one of its synonyms retrieved from WordNet.

  • MR : Mention replacement [4]. For each mention in the instance, a binomial distribution is used to randomly decide whether it should be replaced. If yes, the mention is replaced by another mention from the original training set which has the same entity type.

  • SiS : Shuffle within segments [4]. It first splits the token sequence into segments of the same label and makes each segment corresponds to either a mention or a sequence of out-of-mention tokens. Then for each segment, a binomial distribution is used to randomly decide whether it should be shuffled. If yes, the order of the tokens within the segment is shuffled, while the label order is kept unchanged.

  • SeqMix : Sequence mixup [10]. It creates new synthetic instances by softly combining token/label sequences from the training data. The proportion of the mixture is sampled from a Beta distribution.

  • DAGA : Data augmentation with a generation approach [7]. It is a two-step augmentation method. First, a language model over sequences of labels and words linearized as per a certain scheme is learned. Second, sequences are sampled from the fixed language model and de-linearized to generate new tokens and labels.

4.3 Main results

The comparison results of different methods are shown in Table 2. Clearly, our proposed D3A achieves a new state-of-the-art performance on all four datasets. For GloVe-ABSA, BERT-ABSA, GloVe-NER, and BERT-NER,Footnote 3 D3A outperforms NoAug by 11.94%, 5.22%, 7.40%, and 6.77%. It also outperforms the second-best augmentation baselines by 2.90%, 1.75%, 0.66%, and 1.60%, respectively. When inspecting the results in detail, we can further draw three conclusions.

Table 2 Comparison of different methods for two sequence tagging tasks

Firstly, compared with the backbone models without synthetic data (i.e., NoAug in each table), almost all data augmentation methods can improve the performance in sequence tagging tasks. However, we found that simply duplicating gold data several times (i.e., DUP) can also achieve promising performance, and even surpass some baseline methods like LwTR occasionally. The reason is that, in NoAug, the hyper-parameters of sequence taggers are identical to other methods but the training instances in settings S, M, and L are very limited. Therefore, the insufficient exposure of instances causes the underfitting of sequence taggers and deteriorates the performance. After duplicating training instances, the taggers can converge to a better state than before and achieve a performance gain. We believe that the comparison with DUP is an important standard for judging the effectiveness of data augmentation methods, but it is often ignored by previous work. Obviously, the proposed D3A consistently outperforms DUP on the mean δ of F1-scores in all scenarios.

Secondly, compared with NoAug, the performance gain brought by data augmentation methods is approximately inversely proportional to the size of training data. In small (S) training sets, all data augmentation methods achieve significant improvements over NoAug. While in full (F) training sets, the performance becomes stable and even decreases in some scenarios. The reason is intuitive. When the training instances are inadequate, the mentions only co-occur with limited contexts. Therefore, synthetic instances can increase the data diversity and brings about performance gains. As the training set grows, more and more collocations between mentions and contexts are already covered by the gold data. On the contrary, some low-quality synthetic instances act as noise and even poison the sequence taggers. Therefore, the data augmentation methods sometimes become optional when there is enough training data.

Thirdly, augmenting BERT-based sequence taggers is more difficult than augmenting GloVe-based ones. For example, in the ABSA task, D3A improves GloVe backbone by 11.94%, but this value for BERT backbone is only 5.22%. As shown in previous studies [4, 6], the stacked transformer encoders pre-trained on large-scale external data make BERT more powerful than static word embeddings like GloVe in natural language understanding. Therefore, compared with GloVe backbones, the knowledge carried by the synthetic instances is less useful for BERT backbones.

5 Deep analysis

In this section, we present an in-depth analysis including the augmentation of SOTA methods with D3A, ablation study, parameter study, and case study.

5.1 Augmentation of SOTA methods with D3A

To demonstrate the effectiveness of D3A, we further examine it with state-of-the-art methods for aspect-based sentiment analysis and named entity recognition, respectively. According to the public leaderboards,Footnote 4 we select BERT-PTFootnote 5 [45] and SANERFootnote 6 [29] as the competitors and present the results in Table 3.

Table 3 Augmentation of SOTA methods in ABSA and NER with D3A

For ABSA, BERT-PT post-trains the BERT model with in-domain corpus from the large-scale Yelp and Amazon reviews. With full training data, BERT-PT achieves 74.84% and 66.03% F1-scores on the Restaurant and Laptop datasets (our best backbone achieves 72.19% and 60.92%). For NER, SANER trains the transformer encoders with semantic augmentation. With full training data, SANER achieves 48.22% and 46.08% F1-scores on the WNUT16 and WNUT17 datasets (our best backbone achieves 43.10% and 41.86%).

After augmenting SOTA methods with D3A, we further obtain 2.02% and 8.12% mean improvements on two tasks. The improvements mainly come from situations with limited training data like S and M. Since BERT-PT is fully pre-trained, it is relatively stable under all settings. On the contrary, the encoders inside SANER are trained from scratch and can only achieve promising performance when training data is adequate.

5.2 Ablation study

To validate the effectiveness of designs in D3A, we conduct a series of ablation study on both instance- and feature-level augmentation. The results are presented in Table 4.

Table 4 Ablation study

In variants 1∼2, we remove the feature-level or instance-level augmentation respectively, and the drop of F1-scores demonstrates the effectiveness of both levels of augmentation. Moreover, the description guided instance-level augmentation is more important than the demonstration guided feature-level augmentation, especially in low-resource scenarios (S, M, L). The reason is that the feature-level augmentation also needs to learn from the training data, and thus is inferior when lacking labeled instances. Therefore, in practice, we suggest a hierarchical augmentation framework by placing the feature-level augmentation on top of the instance-level augmentation.

In variants 3∼4, we examine the effectiveness of different dependency paths for constructing the descriptions in the instance-level augmentation. In ABSA, the tail-related paths are more important than the head-related paths, but the opposite is true in NER. Since the tails of a given token could contain multiple tokens while its head is exactly a single token, tail-related paths show more diversity while head-related paths show higher accuracy. In ABSA, the polarity of an aspect term is determined by its contexts like verbs (“love”) and adjectives (“good”), thus considering more related words is beneficial to the classification of sentiment. In contrast, the categories of named entities in NER only depend on themselves, and hence the accurate paths are more useful for generating synthetic instances.

In variants 5∼7, we examine the impacts of different similarities for retrieving the demonstrations in the feature-level augmentation. By only preserving one of three similarities, we can find that all the similarities are important and none of them can completely cover the others.

5.3 Parameter study

There are two key hyperparameters in D3A: the threshold τ in constructing descriptions at the instance level and the number of demonstrations K at the feature level. Here we investigate their impacts by varying them in certain ranges and observing the performance trends.

Figure 3 shows the impacts of τ by varying its value in the range [0.1,1.0] stepped by 0.1. Although the trends of curves are not very obvious, we can analyze by marking the best-performing τ in different sizes of training data. For example, with small (S) training data, diversity is more important since candidate mentions are rare, and the best results are achieved when τ ∈ [0.6,1.0]. On the contrary, with full (F) training data, τ ∈ [0.1, 0.4] brings promising performance since accuracy becomes dominant when candidate mentions are adequate.

Fig. 3
figure 3

Impacts of the threshold τ in the instance-level augmentation

Figure 4 shows the impacts of K by varying its value in the range [1, 10] stepped by 1. When more demonstrations are injected, the curves of GloVe-ABSA and GloVe-NER are generally upward. This trend is reasonable since GloVe embeddings contain limited knowledge and more demonstrations equal to more supporting information. While for BERT-ABSA and BERT-NER, only 1∼3 demonstrations can achieve satisfactory performance since the BERT backbone already embeds sufficient knowledge. In this case, the latter demonstrations with low similarity scores seem to be noisy and not informative.

Fig. 4
figure 4

Impacts of the number of demonstrations K in the feature-level augmentation

5.4 A Closer look at D3A

In this section, we take a closer look at D3A. As shown in Table 5, we first present several synthetic instances generated by MRFootnote 7 and D3A, and compare them with the gold instances to observe the synthetic quality in description-guided instance-level augmentation. Take the example from Restaurant as the example. For “staff” tagged as “B-Positive”, MR replaces it with “tables” while D3A replaces it with “maitre-d”. Obviously, “maitre-d” is a more suitable mention here than “tables” to serve as a subject having the attitude “horrible”. Thanks to the qualified synthetic instances, D3A is powerful for the instance-level data augmentation as shown in the ablation study. We then examine the retrieved demonstrations in Table 6 to observe the influence of demonstration-guided feature-level augmentation. For a target token like “food”, its encoded feature in the sequence tagger is augmented by related tokens like “meal, pizza, dessert”. Therefore, it will be much easier than before for sequence taggers to recognize “food” as an aspect term.

Table 5 Comparison of gold instances and synthectic instances generated by MR and D3A
Table 6 Case study of tokens and their demonstrations in different datasets

6 Conclusion

In this paper, we propose a description and demonstration guided data augmentation method D3A for sequence tagging. By combining both instance-level and feature-level augmentation, D3A can effectively improve the performance and generalization of sequence taggers. Specifically, at the instance level, we construct descriptions for mentions via head-related and tail-related dependency paths and generate reliable synthetic data. At the feature level, we retrieve demonstrations for tokens to enhance the learning capability of sequence taggers given limited training data. We conduct extensive experiments on NER and ABSA using different sizes of training sets. The results on both GloVe-based and BERT-based backbone sequence taggers demonstrate that D3A can significantly improve the performance for sequence tagging tasks, especially in low-resource scenarios. In the future, we plan to investigate other methods for instance-level and feature-level augmentation, and generalize the data augmentation methods to more NLP tasks like relation extraction and event extraction.