Description and demonstration guided data augmentation for sequence tagging

Chen, Zhuang; Qian, Tieyun

doi:10.1007/s11280-021-00978-0

Description and demonstration guided data augmentation for sequence tagging

Published: 11 December 2021

Volume 25, pages 175–194, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

World Wide Web Aims and scope Submit manuscript

Description and demonstration guided data augmentation for sequence tagging

Download PDF

464 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Fine-grained annotations are indispensable for sequence tagging tasks like named entity recognition and aspect-based sentiment analysis, which may incur extremely high time and labor costs. Recent efforts are towards data augmentation which aims to generate synthetic labeled instances. However, most existing methods adopt the random replacement or perturbation strategy under pre-defined constraints, and thus often lead to unstable performance. More importantly, these methods focus on producing more artificial samples yet neglect to make good use of real training samples. In this paper, we propose a novel description and demonstration guided data augmentation (D³A) approach for sequence tagging. On one hand, we collect dependency paths as descriptions to supervise the instance-level augmentation process, such that we can consistently generate high-quality synthetic data. On the other hand, we retrieve semantic or syntactic related features as demonstrations to enhance the learning capability of neural networks under limited training data. We conduct extensive experiments on four sequence tagging datasets with various sizes of training data. The results demonstrate that our proposed D³A approach can significantly improve the performance of sequence tagging, especially in low-resource scenarios.

Transfer Learning for Cross-Domain Sequence Tagging Tasks

Reinforcement Learning for Named Entity Recognition from Noisy Data

Using error decay prediction to overcome practical issues of deep active learning for named entity recognition

Article 05 August 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep learning methods usually need large-scaled labeled data to train neural networks. Considering that collecting and annotating data induce high time and labor costs, recent studies have paid attention to data augmentation techniques for automatically generating synthetic instances and increasing the data diversity [38]. Data augmentation is firstly used in the field of computer vision (CV), where photos are augmented by rotation, cropping, masking, color jittering, gray scaling, etc [15, 41]. Then data augmentation is quickly extended to the field of natural language processing (NLP). Different from photos in CV, languages in NLP are more sophisticated since even a slight modification may change the original semantic meanings. To avoid changing the labels after perturbing languages, existing data augmentation studies in NLP mainly concentrate on coarse-grained sentence-level tasks such as machine translation [8, 37], text classification [14, 44], and question answering [1, 21]. The frequently used methods for perturbing languages include back translation, random insertion, random swap, random deletion, etc.

Data augmentation research on fine-grained sequence tagging tasks like name entity recognition (NER) and aspect-based sentiment analysis (ABSA) is still limited. The main reason is that sequence tagging tasks are defined at the token level and neural models are trained to capture the one-to-one correspondence between tokens and their labels, and perturbing tokens sequences may produce the wrong label sequences.

For example, if “Disney” is deleted from the text segment “love Disney Land” tagged as “O B-facility I-facility”, the perturbed label sequence will be “O I-facility” which is not allowed by the tagging rule in NER (no beginning of an entity).

Among the few data augmentation methods for sequence tagging tasks, most of them modify sentence-level perturbation methods by adding additional constraints to keep the correspondence of labels [4]. Related techniques include label-wise token replacement, shuffle within segments, mention replacement, etc. The main drawback of this type of methods is that they are not stable enough to obtain high-quality synthetic data due to the randomness in perturbation. Another type of methods generates synthetic instances by pre-training a customized language model and then sampling from it [7]. However, the language model needs to be trained on enough labeled data, which is not suitable for low-resource scenarios. Moreover, when sampling synthetic instances from the language model, several manual rules must be defined in advance and then be used to filter out low-quality outputs. Besides the above specific drawbacks, these two types of methods are both limited to the instance-level augmentation. In other words, they only focus on how to generate more synthetic instances, but neglect to promote sequence taggers^{Footnote 1} under the limited training data.

In this paper, we propose a description and demonstration guided data augmentation (D³A) method for sequence tagging, which not only enhances the quality of the produced synthetic instances, but also generalizes the learning capability of the neural models. In particular, the description is a collection of dependency paths that act as the reference for producing new data for instance-level augmentation, and the demonstration is a set of syntactic or semantic related tokens that serve as the evidence for feature-level augmentation.

At the instance level, our goal is to generate more reliable synthetic instances with the help of descriptions. In the sequence tagging task, we can divide the token sequence into two types of tokens, namely mentions (e.g., named entities in NER and aspect terms in ABSA) and contexts (non-mention tokens). To increase the data diversity, mention replacement is a feasible way that keeps a mention’s contexts unchanged and replaces itself with another mention to create a synthetic instance. However, existing methods [4] often select mentions for replacement at random and thus cannot ensure the high quality of synthetic instances. To solve this problem, we propose to construct descriptions for mentions to refine the replacement procedure. Specifically, we first summarize each mention with a description set that includes the involved dependency paths (e.g., MENTION$\overset {nsubj}{\longrightarrow }$JJ). Then, in order to find the compatible mentions that match the original contexts, we calculate the correlation between the original mention and each candidate substitutive mention. Lastly, we rank the candidate mentions w.r.t their correlations and randomly choose several mentions that are qualified for synthesizing new instances.

At the feature level, we turn to make better use of the real but limited training sample. For this purpose, we introduce demonstrations for the token sequence to enhance the learning capability of sequence taggers. Specifically, for a token in the sequence, its demonstrations consist of a set of tokens that have appeared in training instances. These demonstration tokens play similar syntactic and semantic roles as the original token and could provide extra evidence for tagging. After augmenting token features with demonstrations in the learning procedure, a sequence tagger can model an instance not only under the guidance of its own label, but also under the guidance of other training instances. Consequently, the coupling relationship among instances can help the sequence tagger converge to a better state than before.

We conduct extensive experiments on two sequence tagging tasks including NER and ABSA, with two datasets per task. The experimental results demonstrate that after introducing descriptions at the instance level and demonstrations at the feature level, our proposed method can significantly improve the performance of sequence taggers, especially in low-resource scenarios.

2 Related work

In this section, we first introduce two related sequence tagging tasks: NER and ABSA. We then present a brief summarization of existing data augmentation methods towards sequence tagging tasks.

2.1 Named entity recognition (NER)

NER is a task for identifying and categorizing target keyphrases (entities) such as person, organization, time, and location. NER usually acts as the first step in the NLP processing pipeline, and serves as the information source for downstream tasks like question answering, information retrieval, and relation extraction. Early studies mainly use handcrafted features, manual rules, and linguistic lexicons [22, 30, 35]. Recent studies focus on designing neural models which require little feature engineering and expert knowledge [16, 19, 46, 48, 53]. Under the deep learning framework, NER is typically defined as a sequence tagging task. For example, in the text sentence “Made it back home to GA. Time to start planning the next Disney World trip.”, “GA, Disney World” are tagged as “B-Location, B-facility I-facility”, respectively, while other tokens are tagged as “O”.^{Footnote 2}

2.2 Aspect-based sentiment analysis (ABSA)

ABSA is a fine-grained task that aims to summarize the opinions of users towards specific aspects in reviews. With the rapid growth of the world wide web and social media, ABSA has been widely applied to various fields to analyze texts like product reviews, forum discussions, and blog posts. For example, given a sentence “The pizza here is also absolutely delicious.”, ABSA needs to extract the aspect term and classify the corresponding sentiment polarity, then tag “pizza” with “B-positive” and other tokens with “O”. Most existing studies treat ABSA as a two-step task and develop separate methods for aspect term extraction [23, 24, 26, 34, 42, 47, 51] and aspect-level sentiment classification [2, 11, 12, 18, 20, 25, 54]. To obtain the complete ABSA predictions, results from two steps must be merged together in a pipeline manner, which however may lead to error propagation. To address this problem, recent studies in ABSA design end-to-end sequence taggers that directly map tokens to their collapsed labels [3, 17, 28, 50].

2.3 Data augmentation for sequence tagging

Data augmentation are originated from computer vision [38] and then quickly applied to natural language processing [9]. Most existing data augmentation methods are designed for document-level and sentence-level tasks such as machine translation [8, 37], text classification [14, 44], and question answering [1, 21]. Frequently used methods for perturbing languages include back propagation, random insertion, random swap, random deletion, etc. In this scenario, synthetic data is relatively easy to obtain since the labels usually remain unchanged after language perturbation.

For fine-grained sequence tagging tasks, where tokens and labels have a fragile one-to-one correspondence, there are only a few studies available. Sahin and Steedman [36] use dependency tree morphing (sentence cropping and sentence rotating) to generate synthetic data for POS tagging, but the produced synthetic data is not semantically smooth and hard to interpret. Ding et al. [7] first pre-train a customized language model by concatenating tokens with their labels, then sample outputs from this language model and transform them to the synthetic data. Their language model is trained on at least 1k annotated sentences, which is not suitable for low-resource scenarios. Moreover, several manual rules need to be pre-defined and then used to filter out low-quality outputs. Dai et al. [4] modify sentence-level augmentation methods by adding additional constraints, and propose several methods for NER including the label-wise token replacement, synonym replacement, and mention replacement. These methods show improved performance in both recurrent and transformer models, but the synthetic data is not stable enough and contains harmful noises. Zhang et al. [52] adapt the MIXUP [49] technique for active sequence labeling by augmenting queried samples which also requires manual labor. Guo et al. [10] also refer to MIXUP and create new synthetic instances by softly combining token/label sequences following the Beta distribution, which blends mentions with non-mentions and does not perform well for sequence tagging. Apart from the specific drawbacks, existing methods for data augmentation are all limited to instance-level augmentation. They only pay attention to synthesize instances, but neglect to promote taggers to make better use of the limited training data.

Different from prior methods, our proposed approach generates synthetic samples with the guidance of descriptions, and hence it can produce stable results. In addition, our approach performs feature-level augmentation for training samples, which can further improve the learning capability of neural networks with limited training data.

3 Methodology

In this section, we first introduce the definition of sequence tagging and the overview of the proposed method. We then present two backbone sequence taggers as the carrier and tester of data augmentation methods. Lastly, we illustrate our method in detail.

3.1 Problem definition

In sequence tagging tasks, a sequence tagger is trained to learn a mapping function f between a token sequence x = {x₁, ... , x_n} and a label sequence y = {y₁, ... , y_n}, i.e., $f:x\rightarrow y$. Each x_i is a token in natural language and each y_i belongs to the label set {B-X, I-X, O}. Specifically, the first token of a mention with the type X is tagged as B-X, and the tokens inside that mention are tagged as I-X and the contexts (non-mention tokens) are tagged as O.

In this work, we focus on the data augmentation issue for sequence tagging. Given a training set containing (usually limited) gold-labeled instances $\mathcal {D}^{train}$, our goal is to generate some synthetic labeled instances $\mathcal {D}^{syn}$ to improve the diversity of the training set, and promote the sequence tagger’s performance in tagging the unseen test set $\mathcal {D}^{test}$.

3.2 Backbone sequence tagger

To examine the effectiveness of different data augmentation methods, we consider two types of backbone sequence taggers: one is based on static GloVe embeddings, and the other is based on contextual BERT representations.

In the GloVe backbone, we first map each token in x_i with the static GloVe embeddings and obtain its vector e_i, then use an additional encoder containing several convolutional layers without pooling operation to extract the hidden state h_i:

$$ \begin{array}{ll} \{{e}_{1}, ..., {e}_{n}\}&=\textit{GloVe-Lookup}~(\{x_{1}, ..., x_{n}\}),\\ \{{h}_{1}, ..., {h}_{n}\}&=\textit{CNN-Encoder}~(\{{e}_{1}, ..., {e}_{n}\}), \end{array} $$

(1)

where the parameters of the CNN encoder are learned from scatch. On the contrary, in the BERT backbone, we directly use the pre-trained BERT encoders to obtain the hidden state h_i:

$$ \begin{array}{ll} \{{h}_{1}, ..., {h}_{n}\}=\textit{BERT-Encoder}~(\{x_{1}, ..., x_{n}\}) \end{array} $$

(2)

In both backbones, after extracting the hidden state of each token, a classifier which consists of a linear transformation layer and a softmax function is used to predict the tags of tokens:

$$ \{\hat{y}_{1}, ..., \hat{y}_{n}\}=\textit{Classifier}~(\{{h}_{1}, ..., {h}_{n}\}) $$

(3)

Lastly, we compute the cross-entropy loss and train learnable parameters with back propagation:

$$ \mathcal{L}= - \sum\nolimits_{i=1}^{n} \sum\nolimits_{j=1}^{J} \hat{y_{ij}} \cdot log(y_{ij}), $$

(4)

where n is the length of x, J is the number of label categories, $\hat {y_{i}}$ and y_i are the predictions and ground truth labels, respectively. Given the gold instances from $\mathcal {D}^{train}$ and the synthetic instances from $\mathcal {D}^{syn}$, we can train the backbone sequence taggers accordingly and then make inference on $\mathcal {D}^{test}$.

3.3 Description guided instance-level augmentation

At the instance level, D³A aims to synthesize reliable instances and then adds them into the training set for boosting the performance and generalization of sequence taggers. For example, for a token sequence “Can’t upload payload to my apache 2 server, pentesting exercise.” in the NER task, we can divide it into two types of tokens, namely the mention (i.e., “apache 2 server”) and the contexts (non-mention tokens). For improving the diversity of the training data, Dai et al. [4] propose to keep a mention’s contexts unchanged and replace the mention with another one to generate a synthetic instance. However, they select the substitutive mentions only based on the mention type (e.g., “PRODUCT” in this example). Consequently, an incompatible mention like “Touchscreen” may be selected to join the contexts and cause incongruity in semantics. Different from this naive mention replacement method, we propose to construct descriptions for mentions by resorting to the involved dependency paths and then search compatible substitutive mentions for replacement. In this way, our proposed D³A can ensure the quality of the synthetic data and can conduct effective instance-level data augmentation.

3.3.1 Construct descriptions via dependency paths

To capture the correlation among different mentions, we need to characterize mentions completely. Therefore, we propose to construct descriptions for mentions according to their involved dependency paths. As shown in Figure 1, for the NER instance with the mention “apache 2 server” and the mention type “PRODUCT”, we can construct the mention’s description set by collecting the involved dependency paths. According to the parsing results, the paths can be divided into two categories as follows.

Head-related paths. In a dependency parse tree, each token has exact one head token (a.k.a, governor). Inspired by a recent study for aspect-level sentiment classification [43], we first reshape the original parsing tree to a mention-oriented tree so as to keep the integrity of the target mention. Specifically, we treat the mention containing multiple tokens as a whole and only consider the paths outside the mention. For example, two paths inside the target mention, i.e., apache$\overset {nummod}{\longrightarrow }$2 and server$\overset {compound}{\longrightarrow }$apache, are discarded. After reshaping, we can easily collect two head-related paths: payload$\overset {nmod}{\longrightarrow }$M and NN$\overset {nmod}{\longrightarrow }$M that corresponds the head token and its part-of-speech (POS) tag, respectively. (Here we use M to denote an arbitrary mention for simplicity.)
Tail-related paths. Similar to the head-related paths, after reshaping the parsing tree, we can collect four tail-related paths: M$\overset {case}{\longrightarrow }$to, M$\overset {case}{\longrightarrow }$IN, M$\overset {nmod:pass}{\longrightarrow }$my, and M$\overset {nmod:pass}{\longrightarrow }$PRP$. One small difference here is that the number of tail tokens (a.k.a, dependent) is not limited to only one.

After traversing the entire training data, for each mention M, we can construct its description (the set S) that contains all involved dependency paths.

3.3.2 Search compatible substitutive mentions

For the target mention M, we consider all other mentions that belong to the same type as its candidate substitutive mentions. Once the descriptions are constructed, we start to search compatible candidates to replace the target mention and generate synthetic instances. Given the target mention M and a random candidate mention $\hat {\texttt {M}}$, we calculate their correlation via the Jaccard similarity of their corresponding description sets S and $\hat {\texttt {S}}$:

$$ Correlation(\texttt{M}, \hat{\texttt{M}})= Jaccard(\texttt{S}, \hat{\texttt{S}})=\frac{\lvert \texttt{S}\cap\hat{\texttt{S}}\rvert}{\lvert\texttt{S}\cup\hat{\texttt{S}}\rvert}. $$

(5)

By ranking candidate mentions based on the Jaccard similarities, we can preserve the top τ% related candidates and filter out the others, where τ is a hyperparameter. Generally, τ is inversely proportional to the size of the training data ${\mathcal {D}^{train}}$ as we will show in the analysis section. After that, we randomly select a substitutive mention $\hat {\texttt {M}}$ from the preserved candidates, then fix $\hat {\texttt {M}}$ and the contexts of M to generate a synthetic instance. By this means, we refine the selection process in the naive mention replacement method and obtain more reliable synthetic instances $\mathcal {D}^{syn}$. Finally, the instance-level data augmentation can be achieved by training backbone sequence taggers with the combination of $\mathcal {D}^{train}$ and $\mathcal {D}^{syn}$.

3.4 Demonstration guided feature-level augmentation

Existing data augmentation methods for sequence tagging tasks focus on synthesizing new instances when the labeled training set is small. However, we can also consider this issue from another point of view, i.e., how to make full use of limited labeled data. To this end, we propose to further conduct feature-level data augmentation by enhancing the training process with demonstrations. Specifically, when extracting features from the token sequence, the sequence tagger will receive a set of demonstration tokens that can provide extra information for tagging. As shown in Figure 2, for a rare token “apache” without enough sample exposure, the tagger can hardly predict its correct label “B-PRODUCT”. However, if we associate “apache” with some demonstrations like “proxy, git, anaconda, android”, the tagging of “apache” will become less challenging.

We now illustrate the feature-level data augmentation in two steps, i.e., retrieving demonstrations from the training data and augmenting token features with demonstrations. We take a training instance x = {x₁, ... , x_n} in $\mathcal {D}^{train}$ as the example.

3.4.1 Retrieve demonstrations from training data

For a target token x_i ∈ x, we define its demonstrations as a set of tokens {d₁,d₂, ... , d_K} where each d_j has similar syntactic and semantic characteristics with x_i. For convenience, we correspond both x_i and d_j to the word vocabulary $\mathcal {V}$ and change the notations accordingly. Then the problem can be reformulated as follows: for a target token v, how to select another token $\widetilde {v} \in \mathcal {V}^{train}$ (the vocabulary of $\mathcal {D}^{train}$) that is qualified to serve as v’s demonstration. To answer the problem, we resort to three different attributes: the semantic meaning, the part-of-speech tag, and the dependency relation.

Semantic meaning. We use the pre-trained GloVe embedding to obtain the vectors v_sem and $\widetilde { {v}}_{sem}$ for v and $\widetilde {v}$, respectively. We then calculate the semantic similarity between v and $\widetilde {v}$:
$$ sem.sim(v, \widetilde{v}) = cosine({{v}_{sem}},{\widetilde{{v}}_{sem}}), $$
(6)
where cosine(⋅,⋅) is the cosine similarity.
Part-of-speech tag. In each sentence where v has appeared, we can use a one-hot vector v_pos $\in \mathcal {R}^{N_{pos}}$ to represent its POS tag, where N_pos is the number of POS types. Notice that many tokens (e.g., “like”) are polysemous and can serve as different POS tags in different contexts. Therefore, we choose to summarize the global usages < v_pos > of v by merging its POS vectors in all sentences:
$$ <{v}_{pos}> = \{{v}_{pos,l=1} ~\lvert~ {v}_{pos,l=2}~\lvert ... \lvert~ {v}_{pos,l=\lvert \mathcal{D}^{train}\lvert} \} $$
(7)
where $\lvert $ is the dimension-wise OR operation. Similarly, we can obtain $<\widetilde { {v}}_{pos}>$ for $\widetilde {v}$:
$$ <\widetilde{{v}}_{pos}> = \{\widetilde{{v}}_{pos,l=1} ~\lvert~ \{\widetilde{{v}}_{pos,l=2}~\lvert ... \lvert~ \{\widetilde{{v}}_{pos,l=\lvert\mathcal{D}^{train}\lvert} \} $$
(8)
We then calculate the POS similarity between v and $\widetilde {v}$ as follows:
$$ pos.sim(v, \widetilde{v}) = cosine(<{v}_{pos}>,<\widetilde{{v}}_{pos}>). $$
(9)
Dependency relation. As we illustrated in Section 3.3, dependency relations can be divided into head- and tail-related ones. In each sentence where v has appeared, we can use a one-hot vector v_head and a multi-hot vector v_tail to represent the involved head and tail relation, where each vector $\in \mathcal {R}^{N_{dep}}$ and N_dep is the number of relation types. Then we concatenate them to form the whole dependency vector v_dep $\in \mathcal {R}^{2\times N_{dep}}$. Following the steps in calculating the POS similarity, we can obtain the global usages < v_dep > for v and $<{\widetilde { {v}}_{dep}}>$ for $\widetilde {v}$, then calculate the dependency similarity between v and $\widetilde {v}$:
$$ dep.sim(v, \widetilde{v}) = cosine(<{v}_{dep}>,<\widetilde{{v}}_{dep}>). $$
(10)

After calculating three different types of attributes’ similarities, we can obtain the overall similarity score between v and $\widetilde {v}$:

$$ attr.sim(v, \widetilde{v}) = sem.sim + pos.sim + dep.sim, $$

(11)

Consequently, we can obtain a attr.sim score matrix M$^{train}\in \mathcal {R}^{\lvert V^{train}\rvert \times \lvert {V}^{train}\rvert }$. After ranking, for v, we select the top-K tokens as the demonstrations {d₁,d₂, ... , d_K}, and record their attr.sim scores {a₁,a₂, ... , a_K}.

We retrieve demonstrations for both training and test data, so they can benefit both the training and inference processes. During testing, we can calculate the attr.sim score matrix M$^{test}\in \mathcal {R}^{\lvert V^{test}\rvert \times \lvert {V}^{train}\rvert }$ in a similar way. However, a problem here is that we cannot obtain < v_pos > and < v_dep > since the whole test data is unseen. Therefore, we use the local v_pos and v_dep of the current test sample to calculate the part-of-speech and dependency similarities. The retrieval process is a one-time job and often finishes in ten seconds.

3.4.2 Augment token features with demonstrations

After obtaining demonstrations, we can conduct the feature-level augmentation augment with these demonstrations. Generally, we follow a simple rule, i.e., injecting the demonstrations after the pre-trained module. Moreover, considering the difference in the amount of information carried by GloVe and BERT, we propose two augmentation methods accordingly.

In the GloVe backbone, our target is the vector e_i of each token x_i. Specifically, we first map x_i’s demonstration tokens d_i,k to the vectors d_i,k, then aggregate them to a single vector $\widetilde { {d}}_{i}$ according to the similarity scores a_i,k:

$$ \begin{array}{ll} \{{d}_{i,1}, ... , {d}_{i,K}\} &= \textit{GloVe-Lookup}~(\{{d}_{i,1}, ... , {d}_{i,K}\}),\\ \widetilde{{d}}_{i} &= \sum\limits_{k=1}^{K} {d}_{i,k} \cdot {{a}}_{i,k}. \end{array} $$

(12)

We then calculate a dimension-wise gate g_i to augment e_i with $\widetilde { {d}}_{i}$:

$$ \begin{array}{ll} {g}_{i}&= \sigma~(\textbf{W}_{1}({e_{i}} \oplus\widetilde{{d}}_{i})),\\ {r}_{i}&= {g}_{i} \odot({e_{i}}\oplus\widetilde{{d}}_{i}), \end{array} $$

(13)

where e_i is the GloVe word vector, $\widetilde {d}_{i}$ is the demonstration vector, W₁ is a transformation matrix, σ is the Sigmoid function, ⊕ is concatenation, and ⊙ is element-wise multiplication. Lastly, we send the augmented vector r_i instead of e_i to the blank CNN encoder for extracting hidden states while other modules remain unchanged.

In the BERT backbone, we adopt a different strategy since the contextualized BERT representations are very informative. Therefore, we do not interfere the encoding process but inject demonstrations into the hidden states h_i. Specifically, we first use the embedding layer inside BERT to transform d_i,k into $\widetilde { {d}}_{i}$, then pass them to the BERT encoder and obtain the hidden states $\widetilde { {h}}_{i}$. Afterwards, we calculate a single-value gate g_i to combine h_i and $\widetilde { {h}}_{i}$:

$$ \begin{array}{ll} {g}_{i}&= \sigma~(\textbf{W}_{2}({h_{i}} \oplus\widetilde{{h}}_{i})),\\ {r}_{i}&= {g}_{i} \cdot{h_{i}} + (1-{g_{i}}) \cdot \widetilde{{h}}_{i} , \end{array} $$

(14)

where W₂ is a transformation matrix, h_i and $\widetilde { {h}}_{i}$ are the hidden states of input tokens and demonstrations. Lastly, we send r_i instead of h_i to the token classifier and make predictions. After augmenting token features with demonstrations in the learning procedure, a sequence tagger can model an instance not only under the guidance of its own label, but also under the guidance of other training instances. Consequently, the coupling relationship among instances can help the sequence tagger converge to a better state given limited training data.

4 Experiment

In this section, we first present the experimental setup, then compare the proposed D³A method with the state-of-the-art data augmentation baselines.

4.1 Experimental setup

4.1.1 Datasets

We examine D³A on two sequence tagging tasks: named entity recognition (NER) and aspect-based sentiment analysis (ABSA), each containing two datasets. For NER, we use the WNUT16 [40] and WNUT17 [5] constructed from Twitter and adopt the original train/test/development splits for the experiment. For ABSA, we merge the restaurant datasets from the ABSA tasks in SemEval 2014 [33], 2015 [32], and 2016 [31], and the laptop dataset from SemEval 2014 Task 4 [33]. Since there are no official development data, we randomly sample 20% training instances from each dataset as the development set, and use the rest instances for training. The detailed statistics of datasets are presented in Table 1. We use four different sizes of datasets to examine the effectiveness of data augmentation methods in different scenarios. SMALL (S) contains 50 training instances, MEDIUM (M) contains 150 training instances, LARGE (L) contains 300 training instances, and FULL (F) uses the complete training set. Generally, S, M, and L settings can be considered as the low-resource scenarios.

Table 1 The statistics of datasets

Full size table

4.1.2 Settings

We pre-process each dataset by lowercasing all words and use Stanford CoreNLP [27] for dependency parsing. There are N_pos = 45 classes of POS tags and N_dep = 40 classes of dependency relations in four datasets.

In the GloVe-based backbone, we use the glove.840B.300d.txt vectors. The kernel size and the number of convolution layers in the CNN encoder are set to 3 and 4, respectively. Dropout [39] is applied to convolution layers’ outputs with the probability of 0.5. In the BERT-based backbone, we use the officially released bert-base-uncased pre-trained model [6]. We train the GloVe/BERT backbone for 100/15 epochs using Adam optimizer [13] with the learning rate 1e-4/3e-5 and batch size 8 in a 3090 GPU, respectively.

In the description guided instance-level augmentation, the threshold τ for preserving candidate mentions are tuned from 0.1 to 1.0, stepped by 0.1. In the demonstration guided feature-level augmentation, we set the number of demonstrations K = 10. If there are synthetic instances in the training data, feature-level augmentation will also be conducted on these instances.

4.1.3 Evaluation protocol

We report F1-scores for both NER and ABSA tasks in different scenarios, and also present the mean improvement δ for clear comparison. To compute F1 scores, the prediction would be considered correct if it exactly matches the mention span and mention type. We run the experiments five times with random initialization and report the averaged results. The checkpoint achieving the maximum F1-score on the development set is used for evaluation on the test set.

4.2 Compared methods

Details of compared methods are listed below. For all data augmentation methods, the ratio of gold data to synthetic data is 1:3, which means the augmented training set is four times larger than before.

NoAug : No augmentation. It only uses the gold training data.
DUP : Duplication. A naive augmentation method that simply duplicates the gold data three times. This is an important baseline to observe the effectiveness of other data augmentation methods.
LwTR : Label-wise token representation [4]. For each token in the sequence, a binomial distribution is used to randomly decide whether it should be replaced. If yes, the token is replaced by a randomly selected token with the same label.
SR : Synonym replacement [4]. It is similar to LwTR, except that the token is replaced with one of its synonyms retrieved from WordNet.
MR : Mention replacement [4]. For each mention in the instance, a binomial distribution is used to randomly decide whether it should be replaced. If yes, the mention is replaced by another mention from the original training set which has the same entity type.
SiS : Shuffle within segments [4]. It first splits the token sequence into segments of the same label and makes each segment corresponds to either a mention or a sequence of out-of-mention tokens. Then for each segment, a binomial distribution is used to randomly decide whether it should be shuffled. If yes, the order of the tokens within the segment is shuffled, while the label order is kept unchanged.
SeqMix : Sequence mixup [10]. It creates new synthetic instances by softly combining token/label sequences from the training data. The proportion of the mixture is sampled from a Beta distribution.
DAGA : Data augmentation with a generation approach [7]. It is a two-step augmentation method. First, a language model over sequences of labels and words linearized as per a certain scheme is learned. Second, sequences are sampled from the fixed language model and de-linearized to generate new tokens and labels.

4.3 Main results

The comparison results of different methods are shown in Table 2. Clearly, our proposed D³A achieves a new state-of-the-art performance on all four datasets. For GloVe-ABSA, BERT-ABSA, GloVe-NER, and BERT-NER,^{Footnote 3} D³A outperforms NoAug by 11.94%, 5.22%, 7.40%, and 6.77%. It also outperforms the second-best augmentation baselines by 2.90%, 1.75%, 0.66%, and 1.60%, respectively. When inspecting the results in detail, we can further draw three conclusions.

Table 2 Comparison of different methods for two sequence tagging tasks

Full size table

Firstly, compared with the backbone models without synthetic data (i.e., NoAug in each table), almost all data augmentation methods can improve the performance in sequence tagging tasks. However, we found that simply duplicating gold data several times (i.e., DUP) can also achieve promising performance, and even surpass some baseline methods like LwTR occasionally. The reason is that, in NoAug, the hyper-parameters of sequence taggers are identical to other methods but the training instances in settings S, M, and L are very limited. Therefore, the insufficient exposure of instances causes the underfitting of sequence taggers and deteriorates the performance. After duplicating training instances, the taggers can converge to a better state than before and achieve a performance gain. We believe that the comparison with DUP is an important standard for judging the effectiveness of data augmentation methods, but it is often ignored by previous work. Obviously, the proposed D³A consistently outperforms DUP on the mean δ of F1-scores in all scenarios.

Secondly, compared with NoAug, the performance gain brought by data augmentation methods is approximately inversely proportional to the size of training data. In small (S) training sets, all data augmentation methods achieve significant improvements over NoAug. While in full (F) training sets, the performance becomes stable and even decreases in some scenarios. The reason is intuitive. When the training instances are inadequate, the mentions only co-occur with limited contexts. Therefore, synthetic instances can increase the data diversity and brings about performance gains. As the training set grows, more and more collocations between mentions and contexts are already covered by the gold data. On the contrary, some low-quality synthetic instances act as noise and even poison the sequence taggers. Therefore, the data augmentation methods sometimes become optional when there is enough training data.

Thirdly, augmenting BERT-based sequence taggers is more difficult than augmenting GloVe-based ones. For example, in the ABSA task, D³A improves GloVe backbone by 11.94%, but this value for BERT backbone is only 5.22%. As shown in previous studies [4, 6], the stacked transformer encoders pre-trained on large-scale external data make BERT more powerful than static word embeddings like GloVe in natural language understanding. Therefore, compared with GloVe backbones, the knowledge carried by the synthetic instances is less useful for BERT backbones.

5 Deep analysis

In this section, we present an in-depth analysis including the augmentation of SOTA methods with D³A, ablation study, parameter study, and case study.

5.1 Augmentation of SOTA methods with D³A

To demonstrate the effectiveness of D³A, we further examine it with state-of-the-art methods for aspect-based sentiment analysis and named entity recognition, respectively. According to the public leaderboards,^{Footnote 4} we select BERT-PT^{Footnote 5} [45] and SANER^{Footnote 6} [29] as the competitors and present the results in Table 3.

Table 3 Augmentation of SOTA methods in ABSA and NER with D³A

Full size table

For ABSA, BERT-PT post-trains the BERT model with in-domain corpus from the large-scale Yelp and Amazon reviews. With full training data, BERT-PT achieves 74.84% and 66.03% F1-scores on the Restaurant and Laptop datasets (our best backbone achieves 72.19% and 60.92%). For NER, SANER trains the transformer encoders with semantic augmentation. With full training data, SANER achieves 48.22% and 46.08% F1-scores on the WNUT16 and WNUT17 datasets (our best backbone achieves 43.10% and 41.86%).

After augmenting SOTA methods with D³A, we further obtain 2.02% and 8.12% mean improvements on two tasks. The improvements mainly come from situations with limited training data like S and M. Since BERT-PT is fully pre-trained, it is relatively stable under all settings. On the contrary, the encoders inside SANER are trained from scratch and can only achieve promising performance when training data is adequate.

5.2 Ablation study

To validate the effectiveness of designs in D³A, we conduct a series of ablation study on both instance- and feature-level augmentation. The results are presented in Table 4.

Table 4 Ablation study

Full size table

In variants 1∼2, we remove the feature-level or instance-level augmentation respectively, and the drop of F1-scores demonstrates the effectiveness of both levels of augmentation. Moreover, the description guided instance-level augmentation is more important than the demonstration guided feature-level augmentation, especially in low-resource scenarios (S, M, L). The reason is that the feature-level augmentation also needs to learn from the training data, and thus is inferior when lacking labeled instances. Therefore, in practice, we suggest a hierarchical augmentation framework by placing the feature-level augmentation on top of the instance-level augmentation.

In variants 3∼4, we examine the effectiveness of different dependency paths for constructing the descriptions in the instance-level augmentation. In ABSA, the tail-related paths are more important than the head-related paths, but the opposite is true in NER. Since the tails of a given token could contain multiple tokens while its head is exactly a single token, tail-related paths show more diversity while head-related paths show higher accuracy. In ABSA, the polarity of an aspect term is determined by its contexts like verbs (“love”) and adjectives (“good”), thus considering more related words is beneficial to the classification of sentiment. In contrast, the categories of named entities in NER only depend on themselves, and hence the accurate paths are more useful for generating synthetic instances.

In variants 5∼7, we examine the impacts of different similarities for retrieving the demonstrations in the feature-level augmentation. By only preserving one of three similarities, we can find that all the similarities are important and none of them can completely cover the others.

5.3 Parameter study

There are two key hyperparameters in D³A: the threshold τ in constructing descriptions at the instance level and the number of demonstrations K at the feature level. Here we investigate their impacts by varying them in certain ranges and observing the performance trends.

Figure 3 shows the impacts of τ by varying its value in the range [0.1,1.0] stepped by 0.1. Although the trends of curves are not very obvious, we can analyze by marking the best-performing τ in different sizes of training data. For example, with small (S) training data, diversity is more important since candidate mentions are rare, and the best results are achieved when τ ∈ [0.6,1.0]. On the contrary, with full (F) training data, τ ∈ [0.1, 0.4] brings promising performance since accuracy becomes dominant when candidate mentions are adequate.

Figure 4 shows the impacts of K by varying its value in the range [1, 10] stepped by 1. When more demonstrations are injected, the curves of GloVe-ABSA and GloVe-NER are generally upward. This trend is reasonable since GloVe embeddings contain limited knowledge and more demonstrations equal to more supporting information. While for BERT-ABSA and BERT-NER, only 1∼3 demonstrations can achieve satisfactory performance since the BERT backbone already embeds sufficient knowledge. In this case, the latter demonstrations with low similarity scores seem to be noisy and not informative.

5.4 A Closer look at D³A

In this section, we take a closer look at D³A. As shown in Table 5, we first present several synthetic instances generated by MR^{Footnote 7} and D³A, and compare them with the gold instances to observe the synthetic quality in description-guided instance-level augmentation. Take the example from Restaurant as the example. For “staff” tagged as “B-Positive”, MR replaces it with “tables” while D³A replaces it with “maitre-d”. Obviously, “maitre-d” is a more suitable mention here than “tables” to serve as a subject having the attitude “horrible”. Thanks to the qualified synthetic instances, D³A is powerful for the instance-level data augmentation as shown in the ablation study. We then examine the retrieved demonstrations in Table 6 to observe the influence of demonstration-guided feature-level augmentation. For a target token like “food”, its encoded feature in the sequence tagger is augmented by related tokens like “meal, pizza, dessert”. Therefore, it will be much easier than before for sequence taggers to recognize “food” as an aspect term.

Table 5 Comparison of gold instances and synthectic instances generated by MR and D³A

Full size table

Table 6 Case study of tokens and their demonstrations in different datasets

Full size table

6 Conclusion

In this paper, we propose a description and demonstration guided data augmentation method D³A for sequence tagging. By combining both instance-level and feature-level augmentation, D³A can effectively improve the performance and generalization of sequence taggers. Specifically, at the instance level, we construct descriptions for mentions via head-related and tail-related dependency paths and generate reliable synthetic data. At the feature level, we retrieve demonstrations for tokens to enhance the learning capability of sequence taggers given limited training data. We conduct extensive experiments on NER and ABSA using different sizes of training sets. The results on both GloVe-based and BERT-based backbone sequence taggers demonstrate that D³A can significantly improve the performance for sequence tagging tasks, especially in low-resource scenarios. In the future, we plan to investigate other methods for instance-level and feature-level augmentation, and generalize the data augmentation methods to more NLP tasks like relation extraction and event extraction.

Notes

In this paper, the sequence tagger denotes a specific neural network for sequence tagging.
In this paper, we use the B(beginning)-I(inside)-O(outside) tagging scheme throughout. Other schemes such as B-I-O-E(end)-S(single) can also be used as labels. The choice of tagging scheme does not affect the implementation of our method.
For simplicity, we here use Backbone-Task (e.g., GloVe-ABSA) pairs for illustration.
https://paperswithcode.com/sota/aspect-based-sentiment-analysis-on-semeval-7 for ABSA and https://paperswithcode.com/sota/named-entity-recognition-on-wnut-2016 for NER.
https://github.com/howardhsu/BERT-for-RRC-ABSA.
https://github.com/cuhksz-nlp/SANER. Since the best method CL-KL uses external resources, we select the second-best one SANER. We do not include the development set for training.
We choose MR as the representative method because it performs well in most cases, and also because MR adopts the mention replacement strategy which is of the same type as ours.

References

Asai, A., Hajishirzi, H.: Logic-guided data augmentation and regularization for consistent question answering. In: ACL, pp 5642–5650 (2020)
Che, W., Zhao, Y., Guo, H., Su, Z., Liu, T.: Sentence compression for aspect-based sentiment analysis. IEEE ACM Trans. Audio Speech Lang. Process. 23(12), 2111–2124 (2015)
Article Google Scholar
Chen, Z., Qian, T.: Relation-aware collaborative learning for unified aspect-based sentiment analysis. In: ACL, pp 3685–3694 (2020)
Dai, X., Adel, H.: An analysis of simple data augmentation for named entity recognition. In: COLING, pp 3861–3867 (2020)
Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: NUT@EMNLP, pp 140–147 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186 (2019)
Ding, B., Liu, L., Bing, L., Kruengkrai, C., Nguyen, T.H., Joty, S.R., Si, L., Miao, C.: DAGA: data augmentation with a generation approach forlow-resource tagging tasks. In: EMNLP, pp 6045–6057 (2020)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. In: ACL, pp 567–573 (2017)
Feng, S.Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., Hovy, E.H.: A survey of data augmentation approaches for NLP. In: ACL Findings, vol. ACL/IJCNLP 2021, pp 968–988 (2021)
Guo, D., Kim, Y., Rush, A.M.: Sequence-level mixed sample data augmentation. In: EMNLP (2020)
Huang, L., Sun, X., Li, S., Zhang, L., Wang, H.: Syntax-aware graph attention network for aspect-level sentiment classification. In: COLING, pp 799–810 (2020)
Jiang, L., Yu, M., Zhou, M., Liu, X., Zhao, T.: Target-dependent twitter sentiment classification. In: ACL, pp 151–160 (2011)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Kobayashi, S.: Contextual augmentation: Data augmentation by words with paradigmatic relations. In: NAACL-HLT, pp 452–457 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Kruengkrai, C., Nguyen, T.H., Mahani, S.A., Bing, L.: Improving low-resource named entity recognition using joint sentence and token labeling. In: ACL, pp 5898–5905 (2020)
Li, X., Bing, L., Li, P., Lam, W.: A unified model for opinion target extraction and target sentiment prediction. In: AAAI, pp 6714–6721 (2019)
Lin, P., Yang, M., Lai, J.: Deep selective memory network with selective attention and inter-aspect modeling for aspect level sentiment classification. IEEE ACM Trans. Audio Speech Lang. Process. 29, 1093–1106 (2021)
Article Google Scholar
Lin, S., Gao, J., Zhang, S., He, X., Sheng, Y., Chen, J.: A continuous learning method for recognizing named entities by integrating domain contextual relevance measurement and Web farming mode of Web intelligence. World Wide Web 23(3), 1769–1790 (2020)
Article Google Scholar
Lin, Y., Fu, Y., Li, Y., Cai, G., Zhou, A.: Aspect-based sentiment analysis for online reviews with hybrid attention networks. World Wide Web 24 (4), 1215–1233 (2021)
Article Google Scholar
Longpre, S., Lu, Y., Tu, Z., DuBois, C.: An exploration of data augmentation and sampling techniques for domain-agnostic question answering. In: MRQA@EMNLP, pp 220–227 (2019)
Luo, G., Huang, X., Lin, C.-Y., Nie, Z.: Joint entity recognition and disambiguation. In: EMNLP, pp 879–888 (2015)
Luo, H., Li, T., Liu, B., Wang, B., Unger, H.: Improving aspect term extraction with bidirectional dependency tree representation. IEEE ACM Trans. Audio Speech Lang. Process. 27(7), 1201–1212 (2019)
Article Google Scholar
Ma, D., Li, S., Wu, F., Xie, X., Wang, H.: Exploring sequence-to-sequence learning in aspect term extraction. In: ACL, pp 3538–3547 (2019)
Ma, D., Li, S., Zhang, X., Wang, H.: Interactive attention networks for aspect-level sentiment classification. In: IJCAI, pp 4068–4074 (2017)
Manek, A.S., Shenoy, P.D., Mohan, M.C., Venugopal, K.R.: Aspect term extraction for sentiment analysis in large movie reviews using gini index feature selection method and SVM classifier. World Wide Web 20(2), 135–154 (2017)
Article Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: ACL, pp 55–60 (2014)
Mitchell, M., Aguilar, J., Wilson, T., Durme, B.V.: Open domain targeted sentiment. In: EMNLP, pp 1643–1654 (2013)
Nie, Y., Tian, Y., Wan, X., Song, Y., Dai, B.: Named entity recognition for social media texts with semantic augmentation. In: EMNLP, pp 1383–1391 (2020)
Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014, pp 78–86 (2014)
Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., Al-Smadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., Clercq, O.D., Hoste, V., Apidianaki, M., Tannier, X., Loukachevitch, N.V., Kotelnikov, E.V., Bel, N., Zafra, S.M.J., Eryigit, G.: Semeval-2016 task 5: Aspect based sentiment analysis. In: NAACL-HLT, pp 19–30 (2016)
Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., Androutsopoulos, I.: Semeval-2015 task 12: Aspect based sentiment analysis. In: SemEval, pp 486–495 (2015)
Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., Manandhar, S.: Semeval-2014 task 4: Aspect based sentiment analysis. In: SemEval, pp 27–35 (2014)
Popescu, A.-M., Etzioni, O.: Extracting product features and opinions from reviews. In: EMNLP, pp 339–346 (2005)
Ratinov, L.-A., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CoNLL, pp 147–155 (2009)
Sahin, G.G., Steedman, M.: Data augmentation via dependency tree morphing for low-resource languages. In: EMNLP, pp 5004–5009 (2018)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: ACL (2016)
Simard, P.Y., LeCun, Y., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition-tangent distance and tangent propagation. In: Neural Networks: Tricks of the Trade, pp 239–27 (1996)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Strauss, B., Toma, B., Ritter, A., de Marneffe, M.-C., Xu, W.: Results of the WNUT16 named entity recognition shared task. In: NUT@COLING, pp 138–144 (2016)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp 1–9 (2015)
Vicente, I.S., Saralegi, X., Agerri, R.: Elixa: A modular and flexible ABSA platform. In: SemEval@NAACL-HLT, pp 748–752 (2015)
Wang, K., Shen, W., Yang, Y., Quan, X., Wang, R.: Relational graph attention network for aspect-based sentiment analysis. In: ACL, pp 3229–3238 (2020)
Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP-IJCNLP, pp 6381–6387 (2019)
Xu, H., Liu, B., Shu, L., Yu, P.S.: BERT post-training for review reading comprehension and aspect-based sentiment analysis. In: NAACL-HLT, pp 2324–2335 (2019)
Xu, J., He, H., Sun, X., Ren, X., Li, S.: Cross-domain and semisupervised named entity recognition in chinese social media: A unified model. TASLP 26(11), 2142–2152 (2018)
Google Scholar
Xue, W., Li, T., Rishe, N.: Aspect identification and ratings inference for hotel reviews. World Wide Web 20(1), 23–37 (2017)
Article Google Scholar
Yan, H., Deng, B., Li, X., Qiu, X.: TENER: adapting transformer encoder for named entity recognition. CoRR arXiv:1911.04474 (2019)
Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: ICLR (2018)
Zhang, M., Zhang, Y., Vo, D-T: Neural networks for open domain targeted sentiment. In: EMNLP, pp 612–621 (2015)
Zhang, M., Qian, T.: Convolution over hierarchical syntactic and lexical graphs for aspect level sentiment analysis. In: EMNLP, pp 3540–3549 (2020)
Zhang, R., Yu, Y., Zhang, C.: Seqmix: Augmenting active sequence labeling via sequence mixup. In: EMNLP, pp 8566–8579 (2020)
Zhou, J.T., Zhang, H., Jin, D., Zhu, H., Fang, M., Goh, R.S.M., Kwok, K.: Dual adversarial neural transfer for low-resource named entity recognition. In: ACL, pp 3461–3471 (2019)
Zhu, P., Chen, Z., Zheng, H., Qian, T.: Aspect aware learning for aspect category sentiment analysis. TKDD 13(6) (2019)

Download references

Acknowledgements

This work has been supported in part by the National Natural Science Foundation of China (NSFC) Projects (61572376, 62032016, 61972291).

Funding

National Natural Science Foundation of China Projects 61572376, 62032016, 61972291.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, 16 Luojiashan Road, Wuhan, 430072, Hubei, China
Zhuang Chen & Tieyun Qian

Authors

Zhuang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tieyun Qian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Zhuang Chen and Tieyun Qian. The first draft of the manuscript was written by Zhuang Chen and revised by Tieyun Qian. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tieyun Qian.

Ethics declarations

Financial interests

The authors declare they have no financial interests.

Non-financial interests

The authors declare they have no non-financial interests.

Additional information

Availability of data and material

The data and material used in this paper have been uploaded at https://github.com/NLPWM-WHU/D3A.

Code availability

The demo code of the proposed method in this paper has been uploaded at https://github.com/NLPWM-WHU/D3A.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Qian, T. Description and demonstration guided data augmentation for sequence tagging. World Wide Web 25, 175–194 (2022). https://doi.org/10.1007/s11280-021-00978-0

Download citation

Received: 19 September 2021
Revised: 08 November 2021
Accepted: 12 November 2021
Published: 11 December 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11280-021-00978-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Description and demonstration guided data augmentation for sequence tagging

Abstract

Similar content being viewed by others

Transfer Learning for Cross-Domain Sequence Tagging Tasks

Reinforcement Learning for Named Entity Recognition from Noisy Data

Using error decay prediction to overcome practical issues of deep active learning for named entity recognition

Explore related subjects

1 Introduction

2 Related work

2.1 Named entity recognition (NER)

2.2 Aspect-based sentiment analysis (ABSA)

2.3 Data augmentation for sequence tagging

3 Methodology

3.1 Problem definition

3.2 Backbone sequence tagger

3.3 Description guided instance-level augmentation

3.3.1 Construct descriptions via dependency paths

3.3.2 Search compatible substitutive mentions

3.4 Demonstration guided feature-level augmentation

3.4.1 Retrieve demonstrations from training data

3.4.2 Augment token features with demonstrations

4 Experiment

4.1 Experimental setup

4.1.1 Datasets

4.1.2 Settings

4.1.3 Evaluation protocol

4.2 Compared methods

4.3 Main results

5 Deep analysis

5.1 Augmentation of SOTA methods with D3A

5.2 Ablation study

5.3 Parameter study

5.4 A Closer look at D3A

6 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Financial interests

Non-financial interests

Additional information

Availability of data and material

Code availability

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

5.1 Augmentation of SOTA methods with D³A

5.4 A Closer look at D³A