Keywords

1 Introduction

Named entity recognition (NER) is an important task of natural language processing, it recognizes the predefined entity types from the input text. Early NER systems, e.g., NetOwl [1], relied on manually-defined rules. Some feature-based supervised learning methods regard the NER task as a multi-classification problem or sequence labeling problem, e.g., CRF [2]. However, traditional NER methods cannot capture the semantic information in the text, so it is difficult to improve the performance of these methods further. As deep learning methods, e.g., BiLSTM + CRF [3], have been widely applied in NER tasks, these methods can capture hidden features and exhibit better generalization ability than traditional methods.

Fig. 1.
figure 1

Framework of our model

Fig. 2.
figure 2

Trigger representation learning

Although deep-learning based methods have achieved great progress in NER tasks, many challenges remain to address, such as the lack of sufficient annotation data in some low-resource fields. Many NER systems have good results in general domain data sets, but they need a large amount of annotation data to train the model, and the acquisition of annotation data usually requires rich domain knowledge, as well as huge labor costs. However, high-quality annotation data is scarce in many practical scenarios. Therefore, it is of great significance to develop NER systems for few-shot settings.

In this paper, we propose a few-shot NER model based on self-training, taking machine reading comprehension (MRC) as a built-in block. The overall structure of our model is shown in Fig. 1. Specifically, it is mainly composed of three steps: 1) The base model is trained first by using labeled data; 2) Compute the confidence of weak annotation data inferred by the trained model in the former step, and select high-confidence data to expand labeled data; 3) Iterate from step 1 to step 2 until the stop condition is achieved. The introduction of the MRC-based model can encode external knowledge about entities by setting appropriate queries, which benefits the application in few-shot settings. While the framework of self-training is adopted, we use entity triggers to compute the confidence of weak annotation data, which can mine information from different perspectives of labeled data and provide effective filtering rules to filter out noisy data. As self training has proved its effectiveness in few-shot settings, we apply a new confidence measure to the process of self-training and conduct experiments to show the effectiveness of our method.

In summary, the contributions of the paper can be summarized as follows:

  • We propose a self-training based framework to recognize named entities in few-shot settings.

  • We select machine reading comprehension model as the base model of our self-training framework, and the NER task is regarded as answering the corresponding queries. Besides, we compute confidence of weak labeled data based on entity triggers.

  • Extensive experiments are conducted on two benchmarks to confirm the effectiveness of the proposed method.

2 Our Model

2.1 MRC-NER

We first transform the tagging-style annotation NER dataset into MRC-style. Specifically, we generate the query set \(Q=\left\{ q_{y_{1}},\ldots ,q_{y_{k}}\right\} \), where \(q_{y_{i}}\) denotes the query of entity type \(y_{i}\). Then we can get corresponding answer set \(A=\left\{ a_{start_{1},end_{1}},\ldots ,a_{start_{p},end_{p}}\right\} \) of input S, where \(a_{start_{p},end_{p}}=\left\{ w_{start},\ldots ,w_{end}\right\} \) denotes the corresponding entity mention. Therefore, we can get MRC-style annotation sample \(\left( Question,Answer,Context\right) \). After transforming tagging-style dataset into MRC-style, we can extract the entity by answering the question of a certain type. Solving NER tasks by the MRC-based model has a key advantage against traditional methods: we can encode prior knowledge about entity categories through the query, and the specific description of similar entity categories can effectively eliminate ambiguity.

For few-shot learning, due to the limited annotation data, it is necessary to import external knowledge. Thus, we choose the MRC-based NER method [5] as the base model and improve its performance through self-training.

2.2 Entity Triggers

Entity Triggers [6] are defined as a set of words that help explain the entity recognition process in a sentence. When we recognize some entity in a sentence, we usually take certain words or phrases in the sentence as the basis for our judgment, even if it is a word we are not familiar with. In short, entity triggers can effectively help us understand the training process of the model and enable the model to summarize the information of entity categories better. This method was proposed by Lin et al. [6], and it achieved good results in few-shot settings by using labeled data with entity triggers. Fig. 3 presents such an example, where \(t_{i}\) denotes an entity and its corresponding trigger.

Fig. 3.
figure 3

Example of entity trigger

When it lacks enough annotation data, entity triggers may provide us supplementary information different from the original label information. It can be regarded as supplementary annotations in the case of insufficient annotation data, so as to help the model learn and summarize better from the limited annotation data. Therefore, we select relevant information of entity triggers as an auxiliary to compute the confidence of weak labeled data during self-training process, and it can effectively filter out noisy data and improve the performance of our model.

Trigger Extractor. Although annotating entity triggers manually may have high quality, it needs domain knowledge and high labor costs, which is not practical for NER tasks in few-shot settings. Therefore, we design a model for automatic extraction of triggers based on the AutoTrigger model proposed by Lee et al. [7]. We use SOC (Sampling and Occlusion) [8] algorithm to compute the context-independent importance of phrases, which can be used to extract triggers. SOC is a technique for model interpretation. The expression of the importance score of the phrase p in input sequence x is:

$$\begin{aligned} \phi \left( p,x\right) = \frac{1}{\left| S\right| }\sum _{\widehat{x}_{\delta }\in S} \left[ s\left( {x}_{-\delta };\widehat{x}_{\delta }\right) -s\left( {x}_{-\left\{ \delta ,p \right\} } ;\widehat{x}_{\delta };0_{p}\right) \right] \end{aligned}$$
(1)

where \(s\left( x\right) \) denotes the predict score of the model, \(x_{-\delta }\) denotes the sequence after masking a context of length N surrounding the phrase p from input sequence x, \(\widehat{x}_{\delta }\) denotes the sequence of length N obtained according to sampling probability distribution \(p\left( \widehat{x}_{\delta }|x_{-\delta }\right) \) based on the pre-trained language model, \(0_{p}\) denotes paddings for phrase p, and S denotes a collection of samples \(\widehat{x}_{\delta }\) from a pre-trained language model. Therefore, the importance score of phrase p can be interpreted as the expectation of difference between predict scores after masking phrase p in all possible context \(\widehat{x}_{\delta }\) of p, which can also eliminate the relationship between the importance score and the context of the phrase.

The process of automatic trigger extraction can be simply described as follows:

  1. 1)

    It first trains a classifier \(M_{t}\) based on annotation data \(D_{L}\). For the input \(x=\left( x^{\left( 1\right) },x^{\left( 2\right) },\ldots ,x^{\left( n\right) }\right) \), the classifier \(M_{t}\) uses conditional probability \(P\left( y|x\right) \) to denote its output, y is the corresponding label sequence. The predict score of target entity e can be expressed as the following formula:

    $$\begin{aligned} s\left( x,e\right) =\frac{1}{\left| e\right| }\sum _{x^{\left( j\right) }\in e}P\left( y^{\left( j\right) } | x^{\left( j\right) }\right) \end{aligned}$$
    (2)
  2. 2)

    Then generate the candidate trigger set P according to the set of phrase nodes from the constituency parse tree, and calculate the importance score of its target entity for each phrase \(p_{i} \in P\):

    $$\begin{aligned} \phi \left( p_{i},x,e\right) = \frac{1}{\left| S\right| }\sum _{\widehat{x}_{\delta }\in S} \left[ s\left( {x}_{-\delta },e;\widehat{x}_{\delta }\right) -s\left( {x}_{-\left\{ \delta ,p_{i} \right\} },e;\widehat{x}_{\delta };0_{p_{i}}\right) \right] \end{aligned}$$
    (3)
  3. 3)

    For all candidate triggers \(p_{i} \in P\), select top-K triggers with the highest score after computing the importance score.

2.3 Self-training Framework

Trigger Representation Learning. After extracting entity triggers, we train the model to learn the representation of triggers.

First, for the annotation data with triggers, we obtain the embedding of input sentence S and trigger t according to the method proposed by Lin et al. [9], denoted as \(g_{s}\) and \(g_{t}\) respectively. \(g_{s}\) is the weighted sum of token embeddings in the sentence, and \(g_{t}\) is the weighted sum of embeddings of triggers in the sentence. Then we learn the weight matrix by training in two tasks and obtain the trigger embedding. Fig. 2 shows the framework. For the first task, we learn trigger vectors by using entity types as supervision. The second task aims at making the trigger vector and sentence matched. The final loss is the weighted sum of the loss of these two tasks.

Confidence. In the iterative process of self-training, how to find and remove noisy data is critical. By selecting reliable weak annotation data, we can improve the quality of expanded labeled data, and then improve and model performance.

Based on trigger vectors learned in last subsection, we compute the distance \(d=\left\| g_{x}-g_{t}\right\| _{2}\) between trigger t and weak annotation sentence x, and set the threshold \(\lambda \). For the set of triggers \(T_{x}=\left\{ t^{\left( 1\right) }_{x},t^{\left( 2\right) }_{x},\ldots \right\} \) satisfying \(d<\lambda \), the corresponding entity type and quantity set is \(E_{x}=\left\{ \left( e_{1},n_{1}\right) ,\left( e_{2},n_{2}\right) ,\ldots ,\left( e_{k},n_{k}\right) \right\} \), where \(e_{i}\) denotes the corresponding entity type and \(n_{i}\) denotes the number of triggers belong to this entity type.

For weak annotation data \(\left( x,y\right) \), the annotation entity type is \(e_{i}\) and its entity type and quantity set is \(E_{x}\), if the following conditions are satisfied, we will regard this weak annotation data as reliable one:

$$\begin{aligned} \frac{n_{i}}{\sum ^{k}_{j=1}n_{j}}\ge \theta _{1} \quad or \quad n_{i} \ge \theta _{2} \end{aligned}$$
(4)

where \(\theta _{1}\) and \(\theta _{2}\) are thresholds. For the reliable weak annotation data obtained after each iteration and the previous labeled data, we define the loss function in the next iteration as follows:

$$\begin{aligned} L_{ST}=\frac{1}{\left| D^{L}\right| }\sum _{\left( x,y\right) \in D^{L}}L\left( f\left( x\right) ,y\right) + \frac{\lambda }{\left| D^{U}\right| }\sum _{\left( x,y\right) \in D^{U}}L\left( f\left( x\right) ,y\right) \end{aligned}$$
(5)

where \(f\left( \cdot \right) \) denotes new trained model based on \(D^{L}\) and \(D^{U}\), and \(\lambda _{U}\) denotes weight. Self-training is carried out iteratively according to the corresponding steps until reaches the maximum number of iterations or meets stop conditions.

3 Experiments

3.1 Datasets

We use two datasets CoNLL2003 [10] and BC5CDR [11] for experiments. CoNLL2003 is an English general domain dataset, including four named entities: Location, Organization, Person, and Miscellaneous. BC5CDR is an English dataset in the biomedical field, including two named entities: Disease and Chemical. Tagging-style annotation data in two datasets are transformed into corresponding MRC-style annotation data. The queries corresponding to the entity category are obtained from the annotation guide notes.

3.2 Baselines

We select the following models as baselines:

  • BiLSTM-CRF [3]: A classical sequence labeling model.

  • Trigger Matching Network (TMN) [6]: NER model based on manually labeled triggers.

  • TMN with Self-training [6]: Self-training is adopted to TMN, and the confidence is computed based on MNLP proposed by Shen et al. [12].

  • Bert-Tagger [4]: Sequence labeling model based on BERT.

3.3 Results and Analysis

Table 1 and Table 2 show the results in CoNLL2003 and BC5CDR respectively. It can be observed that, when training data is 1\(\%\) of the dataset, the F1 value of BilSTM-CRF model is only 24.81\(\%\). Few training data leads to poor generalization ability of the model. Although with training samples become more, the model performance has been significantly improved a lot, there is still a big gap between BilSTM-CRF and our model (STM). The performance of Bert-Tagger model is similar to that of BilSTM-CRF. STM performs much better than BilSTM-CRF and Bert-Tagger in the case of few training samples. When compared with two TMN (+self-training) models based on trigger matching, the performance of STM is slightly poor when the training samples are less than 5\(\%\). The reason may be that when the sample size is small, the quality of extracted trigger is not high enough, and the query information imported to MRC model cannot be learned well. But when training samples reach 7\(\%\) or above, the performance will be improved, and it has certain advantages when compared with TMN (+self-training). On the whole, when training samples are less than 20\(\%\), STM has a relatively good performance by importing external knowledge and mining information from limited training data. The disadvantage is that when the size of training data is too small (less than 5\(\%\)), the model can not fully filter out noisy data because of the poor quality of extracted triggers, resulting in poor performance. Therefore, the model can be improved by improving the quality of extracted triggers for few-shot settings, such as transferring existing entity triggers to low-resource field.

Table 1. Results on CoNLL2003, where P and R denote Precision and Recall, respectively
Table 2. Results on BC5CDR

The entity definition in the biomedical field is complex, and it’s difficult to identify. Therefore, the overall model performance is much lower than that in CoNLL2003. Compared with the results in CoNLL2003, STM has a more significant advantage in BC5CDR (F1 value is about 4\(\%\)–5\(\%\) higher on average). The possible reason is that STM can make full use of the corresponding external knowledge for entities in the biomedical field by setting appropriate queries. In this way, the significant features of the entity category can be extracted, and the noisy data that is easily confused can be filtered out based on triggers, so the advantages are more obvious than in CoNLL2003.

Fig. 4.
figure 4

Effect of varying percentage of training samples on CoNLL2003

Fig. 5.
figure 5

Effect of varying percentage of training samples on BC5CDR

Corresponding line charts are drawn for the performance of STM and BILSTM-CRF in different percentages of training data in CoNLL2003 and BC5CDR respectively, as shown in Fig. 4 and Fig. 5. It can be seen that, less training data, the greater advantage of STM compared with BiLSTM-CRF, the reason is that when the size of training data is small, it is hard for BiLSTM-CRF to learn the important features of corresponding entity category, which leads to poor generalization ability. However, the external knowledge introduced by STM and the information mined from different perspectives of limited training data lead to good generalization ability even if the size of training data is small.

The results of ablation experiments are shown in Table 3. It shows the results of STM, BERT-MRC model without self-training, and self-training based MRC model without filtering out noisy data in only 1\(\%\) training samples of CoNLL2003. After the introduction of entity triggers to filter out noisy data and expand training data with high-quality weak annotation data, the performance of STM (F1 value is 57.18\(\%\)) improves a lot when compared with that of BERT-MRC model (F1 value is 54.35\(\%\)). Without the process of filtering out noisy data, STM without Triggers only use weak annotation data to expand training data, although the size of training data has been increased, the quality falls and prediction error of the model will be accumulated, so the model performance falls when compared with F1 value of BERT-MRC model, it is reduced by about 3\(\%\). Therefore, it can be concluded that the process of filtering out noisy data by mining trigger information in training data is very important.

Table 3. Ablation results of CoNLL2003

4 Conclusion

In this paper, we propose a self-training based NER method to improve the generalization ability of the model in the settings of few-shot. Our model uses MRC-based model as the base model and trains the model under the framework of self-training. The experimental results show that the proposed method outperforms the existing methods.