Keywords

1 Introduction

As a classic research topic, Named Entity Recognition (NER) is commonly adopted in the field of Natural Language Processing (NLP) [1]. It is known to all that the high performance of the NER task depends on the size and quality of the effective and robust pre-trained model [2]. At the same time, NER models have seen significant improvement in their performance with the recent advances of pre-trained language models [3], yet obtaining massive and diverse labeled data is often expensive and time-consuming [4]. Even if a large annotated corpus has already been obtained beforehand, it will inevitably have rare entities that do not appear enough to train the model to recognize them accurately in the text [5]. Therefore, the data augmentation method for NER is crucially significant [6].

Previous work has studied lot of data augmentations for sentence-level tasks such as text classification.[4, 7,8,9, 21] However, because of the different semantic granularity and more fine-grained labels, they are difficult to adapt to tasks for token-level classification such as named entity recognition. Besides, the big models are often brittle to adversarial examples because of the overfitting, resulting in bad robustness for the NER tasks. [10] Dai & Adel [11] implemented research that mainly paid their attention to the simple data augmentation methods and adapted them to NER, but lack of research on data augmetation at the sentence level. Some studies have explored the impact of mixup augmentation on robustness evaluation, but lack of more variant methods and its impact on effectiveness for pre-trained model [10, 12].

To facilitate research in this direction, inspired by the Cutmix [13], Augmix [14], and Mosaic [15] augmentation methods in computer vision, we proposed three sentence-level data augmentation methods for named entity recognition including CMix, CombiMix, TextMosaic. We conducted empirical experiments on three authoritative datasets comparing our proposed method with a strong baseline (no data augmentations) and mentioned replacement (MR) which is one of the representative token-level data augmentations [11]. We find that the data augmentation methods may not necessarily improve the effectiveness of the models. However, our proposed methods are always better than the token-level method both in effectiveness and robustness. If controlling the number of augmented samples, these methods will enhance the performance of the pre-trained models. The results also show that our approaches can greatly improve the robustness of the pre-trained model even over strong baselines, and we achieved SOTA in the robustness evaluation of the CCIR CUP 2021. We release our code at https://github.com/jrmjrm01/SenDA4NER-NLPCC2022.

2 Methodology

2.1 CMix

The core idea of CMix method is that we need to randomly select a group of data from the replacement sentence source and randomly replace any group of data from the target sentence source. Before using the CMix method, there are two sentence sources, one is the target sentence source, and the other is the replacement sentence source. When randomly cutting the data from the replacement sentence source, the data from the replacement sentence source is replaced with the data from the target sentence source in a random mixing ratio of 0% to 50%, and so on for each target sentence source. However, at most 50% of the data in the target sentence source will be randomly replaced with the data in the replacement sentence source.

figure a

2.2 CombiMix

The core idea of CombiMix method is to perform different data argument methods on samples and fuse them so as to achieve the effect of data argument. This approach also requires two sentence sources, the target sentence source and the replacement sentence source. CombiMix mainly applies to two data argument methods [11], mention replacement (MR) and label-wise Token replacement (LwTR). MR uses the binomial distribution to decide whether to replace the mentions of the target sentence source. If replacement is required, the mentions in the replacement sentence source are used to replace the mentions of the target sentence source and required to be of the same label. LwTR uses binomial distribution to determine whether each word of the target sentence source is replaced or not. If replacement is needed, a word with the same label in the replacement sentence source is randomly selected for replacement and the original label sequence remains unchanged. Finally, we fuse the data set processed by MR, and the data set processed by LwTR and the original data set form the final data set of CombiMix.

figure b

2.3 TextMosaic

In this section, the method of TextMosaic will be introduced in detail. This method consists of three approaches including span sampling, random sampling and over-sampling respectively, which can use sentence-level contexts and help train a more effective and robust NER model.

Span Sampling.

It is kind of method to allow training data to be sampled across one or more sentences to obtain richer training accuracy, as illustrated in Fig. 1. By randomly selecting head and sampling length, the truncated parts of one or more sentences are obtained to form a new sentence. Generally speaking, there will be a logical association between the upper and lower sentences, especially the end of the upper sentence and the beginning of the next sentence.

Fig. 1.
figure 1

The schematic diagram of span sampling. For example, sentence B is the next sentence of A, we might sample the word sequence C from the two sentences.

Random Sampling.

The method refers to randomly extracting two or more data fragments from the original data and recombining them into new sentences for training, as shown in Fig. 2. By randomly sampling some fragments in different sentences, new sentences can be recombined for training. Due to certain entities can be accurately identified under common sense without much attention to contexts, thus, by superimposing multiple fragments, the diversity of textual information can be enhanced significantly. In addition, it should be noted that when randomly intercepting fragments, the interception position is generally three to five tokens before the start tag of an entity.

Fig. 2.
figure 2

The schematic diagram of random sampling. In above, the trained word sequence was sampled with two or several word sequence pieces. For example, the word sequence of C was sampled from the two sentences of A and B.

Over-Sampling.

To solve the problem of uneven distribution of data labels, as shown in Fig. 3. We use the sliding window to amplify the data, which can be regarded as the process of sieving. The sliding window sampling is performed on the original context according to the specific steps. On the one hand, the position encoding of BERT is obtained by learning, so that the texts sampled by the sliding window do not overlap because of different positions. On the other hand, the specific step is obtained by sliding window sampling, which reduces the operational steps of filling and optimizes the training process of the model.

Fig. 3.
figure 3

The schematic diagram of over-sampling. For example, the word sequence A was sampled from the total sentence C, and with one step shifted, the word sequence B was sampled from the total sequence above.

3 Experiment

This section first introduces three authoritative NER datasets and their post-attack dataset by TextFlint [16] in Sect. 3.1, and then shows the experimental setup in Sect. 3.2. We present main results of the effectiveness evaluation in Sect. 3.3 and further explore how the augmented sample size influences the effectiveness of these methods in Sect. 3.4. The results show our method greatly improves the robustness of the pre-trained model even over strong baselines in Sect. 3.5. What’s more, we participated in a NER Robustness Competition hosted by TexfFlint, where our approach achieved state-of-the-art (SOTA) in CCIR Cup2021 in Sect. 3.6.

3.1 Datasets

In order to evaluate the effectiveness of our proposed approaches, we conduct experiments on three authoritative and popular NER datasets across two languages, including the OntoNotes4.0 Chinese dataset [17], OntoNotes5.0 English dataset [18], and CoNLL-03 English dataset [19]. We show descriptive statistics of these datasets in Table 1.

In order to evaluate the effectiveness of our proposed approaches, we conduct experiments on the above datasets that were attacked by TextFlint [16]. This includes many diverse methods of attack such as universal text transformation, adversarial attack, subpopulation and their combinations. We combined the datasets of OntoNotes5.0 and CoNLL-03 mentioned above after being attacked by 20 different attack methodsFootnote 1 Footnote 2, and evaluate the robustness of our proposed methodology. The OntoNotes4.0 dataset after being attacked by TextFlint as a benchmark competition for the CCIR CUP 2021Footnote 3.

Table 1. The statistics of the adopted datasets.

3.2 Experimental Setup

Baseline.

Named entity recognition can be modeled as a sequence-label task. The state-of-the-art sequence models consist of distributed representations for input, context encoder, and tag decoder [20]. We adopt the BERT model [3] as the backbone model for pre-training and decoded by linear layers, then fine-tuned on the NER dataset, as shown in Fig. 4. For the BERT embedding, we used the following Huggingface-pretrained BERT models: “bert-base-chinese”Footnote 4 for the Chinese dataset and “bert-base-uncased”Footnote 5 for the English dataset. In the baseline, we do not use any data augmentation methods and set the same hyperparameters and pipeline for the following experiment.

Token-level Augmentation.

Current token-level data augmentations dedicated to named entity recognition are label-wise token replacement (LwTR), mention replacement(MR), and synonym replacement(SR). [11] We choose one of the representative token-level methods that is MR and compare three sentence-level data augmentations with it.

Training.

To improve the convergence and robustness of the model, we use a bag of tricks [21] and select the optimal hyper-parameters as shown in Table 2. The gradient accumulation method can achieve a similar effect to a large batch size when the algorithm is limited. Therefore, the learning rate warm-up method is utilized to speed up the convergence speed. In addition, using the encapsulated optimizer AdamW [22], each parameter can be given an independent learning rate, and the past gradient history can be taken into consideration as well. To alleviate the over-fitting problem, label smoothing and limiting the non-linear parameters are conducted to solve the dilemma of over-fitting. The method of multi-model ensemble stacking uses 3-fold cross-validation.

Table 2. The hyper-parameters of the experiment
Fig. 4.
figure 4

The training model. BERT(backbone) + NER(head)

Metric.

We adopted span-based Micro F1 to measure the effectiveness and robustness except for CCIR Cup. The CCIR Cup used span-based Macro F1 to evaluate the robustness.

3.3 Results of Effectiveness Evaluation

Table 3 shows the overall results of effectiveness evaluation on CoNLL-03, OntoNotes4, and OntoNotes5 datasets. However, we would also like to report a negative result, which does not apply to all datasets, such as OntoNotes4 and CoNLL-03, where the performance is reduced (except CMix) using data augmentation methods. However, at the same time, compared to MR, our proposed methods mostly outperform results, demonstrating that the sentence-level data augmentation methods are also relatively effective.

Table 3. Results of effectiveness evaluation

3.4 Study of the Sample Size After Data Augmentation

We counted the number of samples after data augmentation for the three training sets as shown in Table 4. MR and Cmix were twice as large as baseline, and CombiMix increased the number of samples three times as large as Cmix. The sample size for the TextMosaic(set sample length = 64) was the largest on OntoNotes4 and OntoNotes5.

Table 4. In OntoNotes4, CoNLL-03 and OntoNotes5 datasets, the data sample size after processing by five data argument methods

To further explore the effect of the number of samples and their distribution characteristics on the model performance after data enhancement, we take OntoNotes4 as an example and balance all samples to the same value with the following strategy.

  • Baseline: Duplicate original samples three times to 3860 samples

  • MR: Duplicate original samples one times to 3860 samples

  • CMix: Duplicate original samples one times to 3860 samples

  • CombiMix: Shuffle original samples and randomly select 3860 samples

  • TextMosaic: Shuffle original samples and randomly select 3860 samples

Table 5 shows the change in F1 score before and after balancing the number of samples. We found that duplicate samples could also be used as a means of data enhancement. Although the number of duplicate samples did not change the characteristics of the data distribution, Baseline, CMix also improved by 1% compared to the previous one. In our analysis, it is possible that the data augmentation carries a large amount of irregular semantic information and noise, reducing the performance of the model. And thus the performance of the model is reduced, although it may be able to improve the robustness of the model.

Table 5. Compare the F1 score of the five data argument methods with the current F1 score before and after balancing the number of samples

3.5 Results of Robustness Evaluation

The two datasets contain twenty transformations, such as word changes, back translations, contraction, extended sentences by irrelevant sentences, keyboard error and so on, as illustrated in Fig. 5 and Fig. 6. Table 6 shows overall results of robustness evaluation on CoNLL-03, and OntoNotes5 datasets that were attacked by TextFlint. On the one hand, the F1 of the model for both datasets dipped 7%–17% approximately. On the other hand, these methods can improve the robustness of the model, with both CombiMix and TextMosaic being higher than baseline and MR on CoNLL-03. On OntoNotes 5, all three sentence-level data augmentations show significant improvements over MR and Baseline, with the best results method CMix was 17% higher than strong baseline.

Fig. 5.
figure 5

CoNLL-03 data size distribution after attack by TextFlint

Fig. 6.
figure 6

OntoNotes5 data size distribution after attack by TextFlint

Table 6. Comparison of F1 score of five methods on robustness evaluation

3.6 Results of CCIR Cup

We participated in a robustness evaluation competition hosted by TextFlint in CCIR Cup 2021Footnote 6. The validation sets and test sets used for the evaluation were generated by TextFlint after performing eleven forms of changes on OntoNotes4. The evaluation was divided into two phases, with LeaderBoard A (LB-A) focusing on contextual changes and LeaderBoard B(LB-B) combining more forms of contextual changes and entity changes. We test three proposed sentence-level augmentations and reported main results in Table 7. We achieved first place in both phases. In LB-A we got the highest F1 score with 85.99 which is 7.96 higher than the second place(F1 = 78.03), and in LB-B we got an F1 score to 76.54 which is 2.53 higher than second place(F1 = 74.01).

Different from the experiment setup, we use data augmentation methods before pre-training and semi-supervised learning in combination with out-of-domain dataset [23]. In our analysis, using generic data augmentation as a noise agent for the consistent training method may be a good choice.

Furthermore, we tested the length of the predicted sequence in model inference and found that the effect is best when the sequence length is 126. In our analysis, when the condition of the sequence length is too long, the long-distance dependence learning effect of the transformer is not good, resulting in poor model performance; Conversely, when the sequence length is too short, it is difficult to learn the semantic information of the entity context, resulting in poor NER performance.

Table 7. Results of CCIR CUP(S510/S254/S126: set sequence length = 510/254/126)

4 Conclusion

This research proposes three different strategies for sentence-level data augmentation for named entity recognition, which is a token-level task. Through experiments on three authoritative datasets, we find that the data augmentation methods may not necessarily improve the effectiveness of the models but controlling the number of augmented samples will enhances the performance of the pre-trained models to fit the feature distribution of the input contextual embeddings. The results also show that our approach can greatly improve the robustness of the pre-trained model even over strong baselines, and we achieved state-of-the-art (SOTA) in the CCIR CUP 2021.