Keywords

1 Introduction

Named entity recognition(NER) is a subtask of information extraction, which refers to a task of detecting spans from text and classifying their types. Among mainstream research methods, the NER task is commonly considered as a sequence labeling problem [1, 3, 6, 12, 24]: for each token of the input sequence, predict a class label assigned to it. The sequence labeling framework solves NER with an end-to-end way, and has achieved effective results on various datasets.

Fig. 1.
figure 1

Human annotation process of named entity extraction and recognition. The annotation rules and example set are chosen from CoNLL 2003 dataset.

However, this formalization of NER is quite different from the recognition process of humans. Figure 1 shows human conventions when annotating entity labels. The annotation rules should first be summarized according to human experience and background knowledge. Then the annotator would try to annotate a few examples according to the rules and adjust the rules based on example set. Finally, the annotation rule and the example set are combined together as prior knowledge to carry out the complete data annotation process.

Inspired by human convention, we propose a new framework that is capable of integrating knowledge from annotation rules and example set. Instead of treating NER as a sequence labeling problem, we formulate it as a deep semantic matching task [5, 14, 22]. Following the principle of two-phase framework [10], we design three sub-modules: 1) Prior Knowledge Encoding: encode the representation of entity types from annotation rules and example set, 2) Boundary Detection: predict the start and end index of candidate entities and extract the representation of them, 3) Semantic Matching: calculate the similarity between candidate span and different types. The input sentence is first sent to the boundary detection module to extract a set of candidates.

At the same time, we combine the annotation rules and example set corresponding to each entity type, and encode them to obtain the representation vector of the entity type. In the second phase, we input the representation vector of each candidate span and entity types into the semantic matching module. The label of candidate span is determined by the similarity of semantic representation between them. In order to measure the similarities between spans and entity types with different lengths, we introduce Word Mover’s Distance(WMD) [7], which is a novel distance function based on Earth Mover’s Distance(EMD) [20].

We conduct experiments on public NER datasets to show the effectiveness of our approach. Experimental results show that our deep semantic matching based framework outperforms both sequence labeling and machine reading comprehension based frameworks. In addition, we also conducted ablation experiments to verify the influence of different prior knowledge on our method. Our main contributions are summarized as follows:

  • We propose a novel deep semantic matching based NER framework which exploits prior knowledge and is closer to human annotation behavior.

  • Our boundary detection module overcomes the problem of excessive sample size and imbalance between positive and negative samples in previous entity classification methods.

  • We first introduce the Word Mover’s Distance into semantic modeling to directly measure the similarity of unequal length sequences.

2 Related Work

Named Entity Recognition(NER). Traditional entity recognition methods treat NER task as a sequence labeling problem and use CRFs as the backbone [8, 25]. More recently, neural models was introduced for NER under the sequence labeling framework. Collobert et al. [2] presented a CNN-CRF structure, Huang et al. [6] first applied BiLSTM-CRF model to NER, Lample et al. [9] proposed a BiLSTM-CRF model with character-based word representations, Ma and Hovy [12] and Chiu and Nchols [1] extend the BiLSTM-CRF structure with a character CNN to extract features, Sturbell et al. [24] proposed a iterated dilated convolutions NER model to accelerate the parallel computing on GPU. With the rise of large-scale pre-trained language models [3, 16, 18, 19], sequence labeling style NER models achieved state of the art performance.

In addition to the recognition of flat entities, there are also some studies on nested entities. Previous work was mainly based on the two-phase framework, which first enumerated all possible spans, and then predicted entity type. According to this idea, Sohrab et al. [23] proposed a deep exhaustive model which limited all the regions within a specified maximum length. Zheng et al. [28] leveraged the entity boundaries to improve the performance of identifying entities.

Moreover, Li et al. [11] migrate the NER task to machine reading comprehension framework and make the model compatible with recognizing both flat and nested entities.

Semantic Textual Matching. Huang et al. [5] first proposed the deep structured semantic model(DSSM) in web search area to map a query to its relevant documents at semantic level. The principle is that the query and documents are embedded to semantic vectors, and the distance between them is calculated by cosine distance, and finally the semantic matching model is trained. Aiming at the shortcoming of the bag-of-words model used by DSSM, Shen et al. [22] replaced the DNN with CNN, so that the model can make up for the loss of context. Since the CNN based model can not capture the feature from long term context, Palang et al. [14] introduced the LSTM to overcome the problem.

Word Mover’s Distance. Kusner et al. [7] proposed the document distance matrix called Word Mover’s Distance(WMD), which can be cast as an instance of the Earth Mover’s Distance(EMD). In statistics, the EMD is a measure of the distance between two probability distributions over a region D. If the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other, where the cost is assumed to be the amount of dirt moved times the distance by which it is moved. The concept of EMD was first introduced by Gaspard Monge [13] in the context of transportation theory. The use of the EMD as a distance measure for monochromatic images was described by Peleg et al. [15]. Stolfi et al. [20] first proposed the name “Earth Mover’s Distance". Rubner et al. [20] first used it on image retrieval task to measure the distance between images.

Fig. 2.
figure 2

Overview of deep semantic matching entity recognition framework(DSMER).

3 NER as Semantic Matching

Figure 2 shows the architecture of DSMER. Given an input sequence \(X = \{x_{1},x_{2},...,x_{l}\}\), where l denotes the length of the sequence, we need to extract every candidate entity span from X, and then assign a label \(t \in T\) to it through semantic matching model, where T is the set of all entity types. The framework is a two-phase model composed of three modules. In the first phase, the representations of candidate spans are extracted, and entity types are encoded through prior knowledge like annotation rules, example set, etc. In the second phase, we separately measure the similarity of each candidate span and all entity types through the semantic matching module. BERT [3] is used as the encoder in each module of the first phase. The following subsections will describe the detail of different modules in DSMER.

3.1 Prior Knowledge Encoding

The prior knowledge encoding procedure is important for DSMER since the external text like annotation rules contains informative semantics and has a significant impact on the final result. Seyler et al. [21] discussed the importance of different categories of external knowledge for performing NER task, including Name-based, Knowledge-Base-based and Entity-based. Besides, Li et al. [11] encoded annotation guideline notes as reference queries and achieved a vast amount of performance boost over current SOTA models. In this paper, we take both annotation rules and example set of entity mentions as prior knowledge. Annotation rules are not only the guidelines provided to the annotators of the dataset but the Wikipedia definition and synonyms of entity type.

Assuming \(E_{t}\) is the representation of entity type t. Given a list of annotation rules \(R = [r_1,r_2,...,r_n]\) and a set of example mentions \(S = {s_1,s_2,...,s_m}\), where n and m denote the number of rules and mentions. We first encode the annotation rules and the example set separately, and then concatenate the hidden representations of them as \(E_{t}\):

$$\begin{aligned} E_{t} = tanh(W_{t}[E_{R},E_{S}] + b_{t}) \end{aligned}$$
(1)

where \(E_{R}\) and \(E_{S}\) are both encoded by BERT, \(W_t\) and \(b_t\) is the trainable weight and bias:

$$\begin{aligned} \begin{aligned}&E_{R} = \frac{1}{n}\sum _{i=1}^{n}BERT(r_{i}) \\&E_{S} = \frac{1}{m} \sum _{j=1}^{m}BERT(s_{j}) \end{aligned} \end{aligned}$$
(2)

In particular, we only take the output context representation of [CLS] position to calculate the average representation of rules and mentions with different lengths.

3.2 Boundary Detection

The boundary detection module is designed to recognize all possible candidate span in the input sentence X. Previous work [23, 28] simply set a maximum length of entity, and enumerated all possible spans as a candidate set, which caused the imbalance of positive and negative samples and the problem that the number of samples increased exponentially with the length of the input sequence. To tackle this problem, we use two binary classifiers: one to predict whether each token is the start index or not, the other to predict the end index. Figure 3 shows the architecture of boundary detection module.

Fig. 3.
figure 3

The workflow of boundary detection module.

Given the representation matrix \(E_X\) output from BERT,

$$\begin{aligned} E_{X} = BERT(X), \quad E\in {R^{n\times {d}}} \end{aligned}$$
(3)

where d is the dimension size of the output layer of BERT. The module adopts two fully-connected layers to detect the start and end position indexed respectively by assigning each token a binary tag (0/1).

$$\begin{aligned} P^{i}_{start} = \sigma (W_{start} E_{x_i} + b_{start}) \end{aligned}$$
(4)
$$\begin{aligned} P^{i}_{end} = \sigma (W_{end} E_{x_i} + b_{end}) \end{aligned}$$
(5)

where \(P_{i}^{start}\) and \(P_{i}^{end}\) represent the probability of identifying the i-th token in the input sequence X as the start and end position of a candidate span.

After predicting the start and end positions, we combine start index and each end index greater than it as a candidate span c, and extract the representation \(E_c = \{E_{x_{start}},E_{x_{end}}\}\) for semantic matching in next phase.

3.3 Semantic Matching

The semantic matching module is a deep neural network following DSSM [5] and CLSM [22]. Figure 4 shows the structure of this module. Considering the ground truth type \(t^{+} \in T\), which is closer to candidate span than other types in semantic space. We can simply use the deep semantic model to calculate the relevance of each pair of (ct).

Fig. 4.
figure 4

The structure of deep semantic matching module. Let \(t_1\) be the matched entity type of candidate span \(c_i\), and all others are negative examples. Send their representations into the model, calculate the similarity of each pair, and finally output the posterior probability through softmax layer.

To directly measure the difference between two sequences of different lengths, we introduce the Word Mover’s Distance. Considering the embedding of entity span \(E_c\) and the embedding of entity type \(E_t\), the cost of WMD can be calculated by:

$$\begin{aligned} \begin{aligned}&\min \limits _{d_{i,j}\ge 0} \sum _{i,j}d_{i,j}\left\| e_{i} - e'_{j} \right\| \\&\mathrm { s.t.} \sum _{i}d_{i,j}=\frac{1}{l_c},\sum _{j}d_{i,j}=\frac{1}{l_t} \end{aligned} \end{aligned}$$
(6)

where \(l_c\) and \(l_t\) are the length of candidate span and entity type vector, \(e_i\) and \(e_{j}^{'}\) are i-th and j-th embedding vector in \(E_c\) and \(E_t\). The semantic relevance score between a candidate c and a entity type t is then measured as:

$$\begin{aligned} M(c,t) = WMD(E_{c},E_{t}) \end{aligned}$$
(7)

After obtaining the semantic relevance score, we compute the posterior probability through a softmax function:

$$\begin{aligned} P(t|c) = \frac{exp(M(c,t))}{\sum _{t'\in T}exp(M(c,t'))} \end{aligned}$$
(8)

In particularly, we adopt shortcut connections every other layer parallel to linear transformation before the activation function, as in ResNet [4]. This helps the training of a deep neural network.

3.4 Loss Function

At the training time, X is paired with two label sequences \(Y_{start}\) and \(Y_{end}\) that represent the ground-truth label of each token \(x_i\). We use the binary cross-entropy loss for the prediction of start and end index:

$$\begin{aligned} L_{start} = BCE(P_{start}, Y_{start}) \end{aligned}$$
(9)
$$\begin{aligned} L_{end} = BCE(P_{end}, Y_{end}) \end{aligned}$$
(10)

The parameters of semantic matching module are estimated to maximize the likelihood of \(t^{+}\). Equivalently, we need to minimize the following loss function:

$$\begin{aligned} L_{match} = -log\prod _{(c,t^{+})}P(t^{+}|c) \end{aligned}$$
(11)

The overall training objective to be minimized is as follows:

$$\begin{aligned} L = \alpha L_{start} + \beta L_{end} + \gamma L_{match} \end{aligned}$$
(12)

where \(\alpha ,\beta ,\gamma \in [0,1]\) are the hyper-parameters to control the contributions of different modules. The three losses from two phrase of DSMER are jointly trained with parameters shared at BERT.

At the test time, candidate spans are first extracted based on boundary detection module. Then the semantic matching model is used to measure the similarity of candidate span and entity types, leading to the final answers.

4 Experiments and Discussions

In this section, we conduct experiments on several public datasets and compare DSMER with models of different NER framework. The following subsections will describe the implementation details and ablation analysis in detail.

4.1 Datasets and Preprocessing

Datasets. We use corpora provided by CoNLL 2003 Shared Task [26] and OntoNotes 5.0 [17] to evaluate the model presented in this paper. CoNLL2003 is an English dataset with four types of named entities: Location, Organization, Person and Miscellaneous. And Ontonotes 5.0 includes 18 types of named entity, consisting of 11 types (Person, Organization, etc.) and 7 values (Date, Percent, etc.).

Data Reconstruction. Most NER corpora provide the labeled data for sequence labeling framework. Different from other NER frameworks, the DSMER needs to extract the rules from annotation document and random sampling part of entities for each type from raw dataset.

For each train set, we random choose 10% annotated entities as example set, and remain 90% as train set as usual. The statistical details are listed in Table 1. To further experiment, we also test the ratio of 5%, 15%, 20% and 40% in following experiments.

Table 1. The entity statistics of preprocessed datasets.

As for the boundary detection module, training data requires binary label for start and end indexes. The ground truth label of entities is converted into two lists for start and end, which are set to 1 only when the token belongs to the boundary of the entity.

4.2 Implementation Details

We use fastNLPFootnote 1 to implement the model and evaluate all experiments on datasets. The DSMER model uses BERT as the skeleton. In order to ensure the effectiveness of the semantic matching method, we only use BERT-base as a semantic encoder in all the comparison experiments below. All experiments are run on Nvidia Tesla V100 GPU, which has 32 GB memory to accommodate larger batch size.

Table 2. Hyper-parameter settings.

We train the model using AdamW optimizer with an initial learning rate of 2e–5, and use warm-up mechanism with linear schedule to adjust the learning rate. To avoid gradient explosion problem, the gradient clip method is used as a callback in training. The semantic matching module of DSM follows the deep structured nerual network in [5], We use 5 fully connected layers, and the input dimension of candidate span and entity types is 300. All other details of hyperparameters are listed in Table 2.

4.3 Experimental Results

In order to verify the effectiveness of DSMER, we choose the classic and SOTA models under different NER frameworks for comparison. For sequence labeling framework, we change the encoder module connected to CRF in range of Bi-LSTM, IDCNN and Transformer. And BERT is also introduced for the pretrain+finetune framwork. Finally we use the MRC-BERT model to stand the machine reading comprehension framework. All comparison results on CoNLL2003 and Ontonotes 5.0 are listed in Table 3 and 4.

Table 3. Comparison with other NER models on Conll2003.

Because we use BERT-base as the model skeleton, we respectively give the experimental results without using the annotation rule and example set to verify the effectiveness of the semantic matching framework.

Experimental results on CoNLL 2003 show a slight improvement by DSMER without example sets. However, significant improvement has been achieved under the conditions of only using the example set. At the same time, we observe that using example set and annotation rule can not improve all factors. This is because the example set can better represent the scope of the entity type in the semantic space, but the description text of the annotation rule may cause a certain offset, which makes the calculation of the semantic similarity also be affected.

Table 4. Comparison with other NER models on OntoNotes 5.0.

Similar results are also observed in the experiment on the OnteNotes 5.0 dataset. However, the use of annotation rule can still improve F1 score, so we think it is effective prior knowledge. Comparative experiments show that DSMER can handle NER problems. We continue to conduct more ablation experiments in Subsect. 4.4 to analyze the impact of different model designs on performance.

4.4 Ablation Studies

The Impact of Example Set. As shown in Table 3 and 4, whether to use example set has a great influence on model performance. In order to observe the impact of the size of the example set on the model, we split the data set according to the split ratio of Subsect. 4.1, and test it on the CoNLL 2003 dataset. The results are shown in Table 5:

Table 5. The impact of the percentage of example set, experiments on CoNLL 2003.

It can be seen that the 10% and 15% split ratios have the best effect. And as the proportion of the example set increases, the overall effect decreases since the lack of training data. Since all entities in the example set are phrases that can express their entity type, a large number of entity examples can better express the position of the entity type in the high-dimensional semantic space. In this way, the calculation of the distance between candidate span and entity type is more accurate. But with the increase of the example set, the decrease of training data makes the model easy overfitting on the training data. This is a trade-off process for dataset segmentation. Comparing with other models, we choose 10% as the segmentation ratio.

The Impact of Annotation Rules. How to construct the annotation rule sentence also has a significant influence on the final results. In this subsection, we explore difference sources to construct annotation rules and their influence, including:

  • Annotation guideline: the annotation rule from documents, like “find organizations including companies, agencies and institutions".

  • Wikipedia: the wikipedia definition of entity type, like “an organization is an entity comprising multiple people, such as an institution or an association."

  • Synonyms: word or phrases that mean nearly the same as the entity type word from Dictionary, like “association"

  • All above: encode above three concepts and use the average representation.

Table 6. Results of different types of annotation rules on CoNLL 2003.

Table 6 shows the experimental results on CoNLL 2003. DSMER outperforms BERT-tagger by using different types of annotation rules. Among them, the effect of using annotation guideline is the best among the three categories, because it is the closest text description to the entity annotation. At the same time, it can be seen that the combined usage of three different kind of rules can achieve better performance improvement.

5 Conclusion

In this paper, we introduce a novel framework for named entity recognition task which reflect the natural entity annotation process of human being. The proposed model obtain state of the art results on public datasets, which indicates the effectiveness of DSMER. The deep semantic matching based framework shows a possible new paradigm to tackle such problem. We would like to explore more variant of the framework in the future.