Keywords

1 Introduction

Named entity recognition (NER) has been regarded as a fundamental task in natural language processing. Previously, flat NER was treated as a sequence labeling that requires assigning a label to each word in a sentence accordingly [12, 25, 34]. This requires an assumption that the entities should be short and that there should be no overlap between them. However, in real applications, as illustrated in Fig. 1(a), an organizational noun may be nested in a personal noun. The emergence of nested entities makes the assumption no longer applicable. Therefore, it is necessary to design a model that can identify flat and nested entities. Recent methods of nested NER can be divided into four categories: 1) sequence labeling methods has been improved for identifying nested entities. Some works overlay flat NER layers [9, 23] to identify nested entities. However, such practice is prone to error propagation. 2) hypergraph-based methods represent all entity segments as graph nodes and combine them to represent hypergraph [17]. However, such methods suffer from structural errors and structural ambiguities during inference. 3) sequence-to-sequence methods generate entities directly [29], which leads to inefficiencies in the decoding process and common drawbacks of sequence-to-sequence(Seq2Seq) models, such as exposure bias. 4) span-based methods enumerate all spans in a sentence and classify them accordingly. The approach takes the boundaries as the key to constitute the span representation [19, 35]. However, only the boundaries cannot effectively detect complex nested entities [32], so focusing only on the boundaries is not comprehensive.

Fig. 1.
figure 1

(a) An example sentence with nested entities from ACE2005 (b) Information that can help determine the entity.

As shown in Fig. 1, the information available for a span to be identified includes not only the boundaries but also the auxiliary information such as inside tokens, labels, related spans, and relative positions. The utilization of the above information is helpful to solve the entity recognition problem. Although there have been works to utilize them [6, 27], some issues still need to be addressed. Firstly, enumerating all possible spans in a sentence using related spans is computationally expensive. Secondly, they can only leverage part of the aforementioned auxiliary information, and most overlook relative positions’ importance. Lastly, the use of related spans involves the challenge of subjective selection, which can lead to error.

In order to solve the problems mentioned above, we propose a simple but effective method to simultaneously utilize all the above-mentioned auxiliary information. The key of our model is to propose an Auxiliary Information Enhanced Span-based NER (AIESNER) neural method. Specifically, our research follows two steps: entity extraction and entity classification. In the entity extraction stage, we design an adaptive convolution layer that contains a position-aware module, a dilated gated convolutions (DGConv) module, and a gate module. These three modules can not only dynamically acquire position-aware head and tail representations of spans by applying two single-layer fully connection layer, but also capture relationship between close and distant words. Through the acquisition of connections at different distances between words, the information-aware layer obtains auxiliary information, while the head and tail representations are used to acquire boundaries and then incorporate relatively necessary parts into span representations. Because span representations have different association strengths under different labels in the entity classification stage, we design the information-agnostic layer to apply the multi-head self-attention mechanism to establish the corresponding span-level correlation for each label. To avoid excessive attention to auxiliary information, we emphasize the boundaries at this layer with the use of only head and tail representations.

To prove the effectiveness of proposed model, we conducted experiments on six NER datasets, three of them are nested datasets, and the other three are flat datasets. For the nested datasets, proposed model achieves \(F _{1}\) scores of 87.73, 87.23, and 81.40 on ACE2004, ACE2005 and GENIA, respectively. For the flat datasets, our model achieves \(F _{1}\) scores of 97.07, 73.81, and 93.07 on Resume, Weibo and CoNLL2003, respectively. Using BERT as an encoder, proposed model outperforms the state-of-the-art methods on ACE2005, ACE2004, Resume and Weibo. And we get comparable results on the GENIA and CoNLL03. Our contributions are summarized as:

  • This is the first work of using boundary and complete auxiliary information (i.e., inside tokens, labels, related spans, relative position) that is more efficient and reduces subjective interference.

  • This work has no artificially set rules. The research does not require any external knowledge resources to achieve promising results. Thus it can be easily adapted to most usage scenarios for domain-specific data.

  • The experiments explore that our proposed method performs better than the existing span-based methods and achieves state-of-the-art performance on ACE2005, ACE2004, Resume, and Weibo.

2 Related Work

2.1 Nested NER

Here we mainly focus on four nested NER methods: sequence-tagging methods, hypergraph-based methods, sequence-to-sequence methods, and span-based methods since they are similar to our work.

By stacking flat NER layers, sequence labeling methods can obtain nested entities. However, this leads to the error propagation problem. By Using dynamic stacking of flat decoding layers, [9] construct revised representations of the entities identified in the lower layers. Then provide identified entities to the next layer. Some people have improved this method by designing a reverse pyramid structure to achieve the reverse flow of information [23]. Others divide NER into two steps: merging entities and sequence labeling [4].

Hypergraph-based method was first proposed by [14] as a solution to the problem of nested entities. It has been further consolidated and enhanced by subsequent work [16, 22]. The methods requires complex structures to deal with nested entities. The method also leads to structural errors and structural ambiguities during inference.

Span-based methods enumerate all span representations in a sentence and predict their types. The span representation of an entity can be obtained in various ways [11, 19, 31]. Several works have proposed the use of external knowledge resources. Such as the introduction of machine reading comprehension (MRC) [13] and dependency relations [10] for span prediction. The span-based methods can identify entities and their corresponding types directly [11], or they can split the process of identifying entities into two stages, including entity extraction and entity classification [19, 26, 27]. Compared with these previous methods, our method uses the auxiliary information that the span representation possesses.

Seq2Seq methods generate various entities directly. [5] first proposed a Seq2Seq model, where the input is the original sentence and the output is the entity start position, entity length, and entity type. [29] combines the Seq2Seq model with a BART-based pointer network. There are other methods using contrast learning [33], generative adversarial networks [8] and reinforcement learning [24] for entity recognition.

3 Approach

Figure 2 shows an overview of our approach, which consists of four main layers: the encoder layer, the adaptive convolution layer, the information-aware layer, and the information-agnostic layer.

Fig. 2.
figure 2

The architecture of our method. MLP represents multi-layer perceptron. \(\oplus \) and \(\otimes \) represent concatenation and dot-product operations.

3.1 Encoder Layer

We follow [11] to encode the text. Given the input sentence \(X =\left\{ x_{1}, x_{2}, \ldots , x_{N}\right\} \) of N tokens, we first generate contextual embeddings of word pieces using BERT [3] and then combine them employing max-pooling to produce word representations. Then we adopt BiLSTM [7] to enhance the word representation. Finally, our sentence X can be represented as word representations H:

$$\begin{aligned} H&= \left\{ h_{1}, h_{2}, \ldots , h_{N}\right\} \in \mathbb {R}^{N \times d_{w}} \end{aligned}$$
(1)

where \(d_{w}\) denotes the dimension of the word representation, and N is the length of the input sentence.

3.2 Adaptive Convolution Layer

Position-Aware Module. To represent the head and tail of a span, we use two single full connection layers to transform each \(h_{i}\) to the head and tail vector space. At this point, we obtain the head and tail representation. In addition, position plays an essential role in identifying entities [28], so we attach position embedding from [20] to the word representation:

$$\begin{aligned} h_{i}^{s}&= \left( W_{s} h_{i}+b_{1}\right) \otimes R_{i}\end{aligned}$$
(2)
$$\begin{aligned} h_{i}^{t}&= \left( W_{t} h_{i}+b_{2}\right) \otimes R_{i}\end{aligned}$$
(3)
$$\begin{aligned} H_{\delta }&=\left\{ h_{1}^{\delta }, h_{2}^{\delta }, \ldots , h_{N}^{\delta }\right\} \end{aligned}$$
(4)

where \(W_{s}, W_{t} \in \mathbb {R}^{d_{w} \times d_{h}}\) and \(b_{1}, b_{2} \in \mathbb {R}^{d_{h}}\) are trainable parameters and bias terms, respectively. \(R_{i}\) is the position embedding of the i-th word, \(\otimes \) is the element-wise multiplication. \(\delta \in \{s, t\}\). s and t are the head and tail, respectively.

DGConv Module. We feed the head and tail representation into the same convolution module, which allows the head and tail to learn each other’s word representation without introducing additional parameters. For capturing the interactions between words at different distances, we use multiple DGConv with different dilation rates r\(\text{(e.g., } r \in [1,2,5])\). Gated convolution avoids gradient vanishing and controls information flow while these interactions form auxiliary information. The calculation of each dilated gated convolution can be expressed as:

$$\begin{aligned} {\text {DGConv}}(H_{\delta })&=D_{1} \otimes H_{\delta }+\left( {\textbf {1}}-D_{1}\right) \otimes \phi \left( D_{2}\right) \end{aligned}$$
(5)
$$\begin{aligned} C_{\delta }^{r}&=\sigma \left( {\text {DGConv}} \left( H_{\delta }\right) \right) \end{aligned}$$
(6)

where \(D_{1}\) and \(D_{2}\) are parameter-independent 1-dimensional convolution with \(H_{\delta }\) as input. \(\sigma \) and \(\phi \) are relu and sigmoid activation functions, respectively. \(\otimes \) is element-wise multiplication, and 1 is a 1-vector with its dimension matching \(D_{1}\). After that, we combine the different dilatation rates of \(C_{\delta }^{r}\) to get the final result \(C_{\delta }=\left[ C_{\delta }^{1}; C_{\delta }^{2}; C_{\delta }^{5}\right] \in \mathbb {R}^{N \times 3 d_{h}}\) and feed it into the multi-layer perceptron (MLP) to reduce the dimension:

$$\begin{aligned} Q_{\delta }=MLP\left( C_{\delta }\right) \in \mathbb {R}^{N \times d_{h}} \end{aligned}$$
(7)

Gate Module. Since the previous work [6, 32] demonstrated that the boundaries are practical, we balance the word representation itself with the extracted word representation at different distances. Then we can filter the unnecessary information. The gate module is shown below:

$$\begin{aligned} r_{\delta }&=W_{1} H_{\delta }+W_{2} Q_{\delta }+b \end{aligned}$$
(8)
$$\begin{aligned} O_{\delta }&=r_{\delta } \otimes H_{\delta }+\left( {\textbf {1}}-r_{\delta }\right) \otimes Q_{\delta } \end{aligned}$$
(9)

where \(H_{\delta }\) and \(Q_{\delta }\) are from Eqs. 4 and 7. \(W_{1}, W_{2} \in \mathbb {R}^{d_{h} \times d_{h}}\) and \(b \in \mathbb {R}^{d_{h}}\) are trainable parameters and bias term, respectively. 1 is a 1-vector with its dimension matching \(H_{\delta }\). \(\otimes \) is element-wise multiplication. Finally, we get head and tail representation:

$$\begin{aligned} S&=Q_{s}=\left\{ s_{1}, \ldots , s_{N}\right\} \in \mathbb {R}^{N \times d_{h}}\end{aligned}$$
(10)
$$\begin{aligned} T&=Q_{t}=\left\{ t_{1}, \ldots , t_{N}\right\} \in \mathbb {R}^{N \times d_{h}} \end{aligned}$$
(11)

3.3 Information-Aware Layer

To integrate boundaries and auxiliary information into the span representation. We obtain \({\text {Span}}(i, j)\) by dot product \(s_{i}^{T}\) and \(t_{j}\), T is for transposition:

$$\begin{aligned} {\text {Span}}(i, j)=s_{i}^{T} t_{j} \end{aligned}$$
(12)

\({\text {Span}}(i, j)\in \mathbb {R}^{1 \times 1}\) indicates the region of a candidate span from the i-th word to the j-th word in a sentence. Due to the filtering of the gate module [1] and the local attention of the convolution [2], the model can learn to discriminate the importance of words acquired at different distances. Thus, \(s_{i}\) and \(t_{j}\) itself will yield the boundary, close and distant words that are strongly associated with the current word pair (\(s_{i}\), \(t_{j}\)) will be the inside tokens and related spans, respectively:

$$\begin{aligned} (A+B)^{T}(C+D)&= A^{T}C+A^{T}D+B^{T}C+B^{T}D \end{aligned}$$
(13)

Here we simplify the process and ignore the weights. As in Fig. 1, suppose the current entity \({\text {Span}}(i,j)\) is [U.S.Army], A represents U, B represents Persian, C represents Army, and D represents Gulf. \(A+B\) represents the word representation of U that obtains Persian information from upper layer. \(A^{T}D\) represents the boundary, and \(B^{T}D\) represents the required related spans. Thus, instead of enumerating all spans, \({\text {Span}}(i,j)\) can obtain boundaries, inside tokens, and related spans, while the model can learn weights to adjust their importance. Additionally, the relative positions can be determined by using position embedding attached to the word representation. We take the boundary as an example:

$$\begin{aligned} \begin{aligned} \left( R_{i} h_{i}^{s}\right) ^{T}\left( R_{j} h_{j}^{t}\right)&=h_{i}^{s T} R_{j-i} h_{j}^{t} \end{aligned} \end{aligned}$$
(14)

where \(R_{i} \text{ and } R_{j}\) are the position embeddings of the i-th and j-th words mentioned earlier (Eq. 2), related spans and inside tokens can also acquire their relative position.

3.4 Information-Agnostic Layer

Excess auxiliary information cluttering the span representation during entity classification may cause incorrect boundary predictions. So the boundaries become more significant in this layer. And in order to learn the correlation intensities of span representation to different labels, motivated by the multi-head self-attention mechanism, we set the number of heads as the size of entity types, then apply attention operations. We denote \(\text {c}_{\alpha }(i, j)\) as the correlation intensities of \({\text {Span}}(i, j)\) under the \(\alpha \) tag and only use the boundaries to form span representation.

$$\begin{aligned} \text {c}_{\alpha }(i, j)=W_{\alpha }^{T}\left[ h_{i}^{s} ; h_{j}^{t}\right] \end{aligned}$$
(15)

where \(\alpha \in \{1,2, \ldots ,|T|\}\), |T| is the number of labels. \(W_{\alpha } \in R^{\left( 2 \times d_{h}\right) }\) is the trainable parameters. [;] means concatenation operation. \(h_{i}^{s}\) and \(h_{j}^{t}\) are from Eq. 2 and Eq. 3. We combine the results of entity extraction and entity classification to get the final span score:

$$\begin{aligned} p_{i, j}^{\alpha }={\text {Span}}(i, j)+\text{ c}_{\alpha }(i, j) \end{aligned}$$
(16)

3.5 Training and Inference Details

During training, we follow [21] which generalizes the softmax cross-entropy loss to multi-label classification. The method effectively solved the problem of positive and negative label imbalance. In addition, as in [31], we set a threshold \(\gamma \) to determine whether span belongs to label \(\alpha \). The loss function can be formulated as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\alpha }=\log \!\! \left( \!\!e^{\gamma }\!+\!\!\!\!\sum _{\!(i, j) \in \varOmega _{\alpha }} \!\!e^{-p_{i, j}^{\alpha }\!\!}\right) \!\!+\log \!\! \left( \!\!e^{\gamma }\!+\!\!\!\!\sum _{\!(i, j) \notin \varOmega _{\alpha }} \!\!e^{p_{i, j}^{\alpha }\!\!}\right) \end{aligned} \end{aligned}$$
(17)

where \(\varOmega _{\alpha }\) represents the set of entities span belonging to label \(\alpha \), \(\gamma \) is set to 0. Finally, we add up the loss on all labels to get the total loss:

$$\begin{aligned} \mathcal {L}=\sum _{\alpha \in \varepsilon }\mathcal {L}_{\alpha } \end{aligned}$$
(18)

where \(\varepsilon =\left\{ 1,2,...,\left| T \right| \right\} \), [T] is the number of labels.

During inference, The span satisfying \(p_{i,j}^\alpha >0\) is the output of the entity belonging to the label \(\alpha \).

4 Experiments

4.1 Datasets

To evaluate the performance of our model on the two NER subtasks, we conduct experiments on six datasets.

Flat NER Datasets. We conduct experiments on the English dataset CoNLL2003 and the Chinese dataset Resume and Weibo. We employ the same experimental setting in previous work [29].

Nested NER Datasets We conducted experiments on the GENIA, ACE2005, and ACE2004. For ACE2005 and ACE2004, we used the same dataset split as [14]. For GENIA, we followed [11] using five types of entities, dividing the train/dev/test as 8.1:0.9:1.0.

Table 1. Results for flat NER datasets. \(\dagger \) represents our re-implementation with their code.
Table 2. Results for nested NER datasets.

4.2 Results for Flat NER

We evaluate our model on CoNLL03, Weibo, and Resume. As shown in Table 1, \(F_{1}\) scores of our model were 93.07, 73.81 and 97.07, respectively, outperforming the representatives of other methods (+0.02 on CoNLL2003, +3.31 on Weibo, +0.45 on Resume). Compared to other span-based methods, our model achieves the best performance on the \(F_{1}\) scores of Resume (+0.41 vs. baseline+BS) and Weibo (+1.14 vs. baseline+BS), reaching the state-of-the-art results and on CoNLL03 (+0.00 vs. W2NER) we also achieved competitive results. Furthermore, our model achieves the best precision performance, demonstrating the effectiveness of the auxiliary information we incorporated.

4.3 Results for Nested NER

Table 2 shows the performance of our model on ACE2004, ACE2005, and GENIA. \(F_{1}\) scores of our model were 87.73, 87.23, and 81.40, respectively, which substantially outperforms the representatives in other methods (+0.89 on ACE2004, +1.83 on ACE2005, +0.80 on GENIA), proving the advantage of span-based methods in solving nested NER. Compared with other span-based methods, our model outperforms previous state-of-the-art methods in terms of \(F_{1}\) scores for ACE2004 (+0.12 vs. CNN-NER) and ACE2005 (+0.41 vs. Triaffine). Our model also achieved competitive performance for GENIA (+0.00 vs. CNN-NER).

Table 3. Model ablation studies \(F_{1}\). DGConv(r=1) denotes the convolution with the dilation rate 1. “-” means remove the module.

4.4 Ablation Study

As shown in Table 3, we ablate or replace each part of the model on ACE2005, Weibo, and GENIA. First, we remove the gate module, and the performance drop proves the importance of the boundaries. In contrast, changing the gates to direct addition would make the model unable to use the information obtained selectively. The overall performance drop is more pronounced than the weakening of the boundaries information. The model’s performance drops after replacing the DGconv with the Dconv. After removing the adaptive convolution layer or position embedding, the performance of the model decreases significantly.

Table 4. Case study on ACE2005 and GENIA dataset. The colored brackets indicate the boundary and label of the entity. “AUX infor” is the abbreviation for auxiliary information.

4.5 Case Study

To analyze the effectiveness of auxiliary information, we show two examples from the ACE2005 and GENIA datasets in Table 4. We remove the position embedding, DGConv module, and Gate module from the dynamic convolution layer to eliminate the effect of auxiliary information. In the first example, the model misclassifies “Sukhoi Su-27” as “None” and “Su-27” as “VEH” in the absence of auxiliary information. However, with the help of auxiliary information, the model corrects them to “VEH” and “None”. In the second example, the model successfully corrects “promoter” from the “None” to the “DNA”. In addition, with the help of the auxiliary information, the confidence level \(p_{i, j}^{\alpha }\) of the model for the correct label can be significantly improved.

5 Conclusion

In this paper, we propose a span-based method for nested and flat NER. We argue that boundaries and auxiliary information including relative position, inside tokens, labels, and related spans, should be used reasonably to enhance span representation and classification. To this end, we design a model that automatically learns the correlation between boundaries and auxiliary information, avoiding the error and tedium of human-defined rules. Experiments show that our method outperforms all span-based methods and achieves state-of-the-art performance on four datasets.