Keywords

1 Introduction

Nested Named Entity Recognition (NNER) refers to the simultaneous recognition of multiple nested levels of named entities in text. For example, in Fig. 1 “哈尔滨 医科大学附属第一医院(First Affiliated Hospital of Harbin Medical University)”, it contains three entities: “哈尔滨(Harbin)” belongs to LOC entity, “哈尔滨医科 大学(Harbin Medical University)” belongs to ORG entity, “哈尔滨医科大学附属第 一 医 院(First Affiliated Hospital of Harbin Medical University)” belongs to ORG entity. They overlap each other and are nested entities. NNER has a wide range of applications in information extraction, question answering systems, natural language understanding and other fields. However, most current NNER research focuses on English corpora, while relatively few studies on Chinese corpora. There are some notable differences between Chinese and English Nested Named Entity Recognition. First of all, the vocabulary structure of Chinese and English is different from that of English. The fundamental unit of composition in Chinese is characters, while in English, it is letters. Therefore, identifying nested named entities in Chinese needs to take into account more complex language structures and features, such as polyphonic characters, ambiguous words, word order, etc. This also makes Chinese Nested Named Entity Recognition tasks more challenging than English. Secondly, Chinese Named Entity Recognition often needs to solve ambiguity problems, because words in Chinese often have many different meanings, and contextual information and context need to be considered to determine the correct entity type. Compared to English, the task of named entity recognition in Chinese is more challenging due to the presence of more ambiguities and context dependencies. Therefore, it is more crucial to incorporate external information for Chinese Nested Named Entity Recognition than for English Nested Named Entity Recognition.

Fig. 1.
figure 1

Nested Entity Structure Example

Integrating external knowledge has been shown to be effective in various natural language processing tasks, such as text classification [1], semantic matching [2], text generation [3], and Named Entity Recognition [4].Among these, the dictionary- enhanced approach has demonstrated a notable improvement in Chinese Named Entity Recognition task. BERT is based on character-level granularity for Chinese and cannot capture the overall information of multi-word words, which hinders the identification of entity boundaries. Therefore, incorporating additional lexical knowledge is crucial for improving the accuracy of named entity recognition. Existing methods, such as Lattice LSTM [5] and FLAT [6], input extra vocabulary information alongside the sentence sequence into BERT and apply a specific attention mechanism to calculate them separately. However, this approach results in longer sequences, increasing computation time and memory consumption and introducing noise to the semantic representation. Recently, Liu et al. [7] proposed LEBERT, a model that integrates external dictionary information into the middle layer of BERT as an additional module, achieving promising results. In this study, we aim to apply the LEBERT dictionary-enhanced approach to Chinese Nested Named Entity Recognition to improve its performance.

In the field of Nested Named Entity Recognition, span enumeration is one of the prevailing approaches. Sohrab and Miwa et al. [8] have proposed a method of exhaustively enumerating all possible spans up to a specified length by connecting the output of start and end position LSTMs, which is then used to calculate the score for each span. To overcome the limitation of length in predicting entities, a bi-affine based structural model is employed. By constructing the token-token table in parallel, the bi-affine decoder generates a global view of the sentence, including vector representations of all possible spans, thus improving efficiency. This approach has been demonstrated as effective in various works, such as Dozat and Manning (2017) [9] and Yu et al. (2020) [10]. In recent advances in Nested Named Entity Recognition research, Hang Yan et al. [11] treated the feature matrix as a view and utilized CNNs to model the spatial relationships between adjacent spans in the scoring matrix, which resulted in significant improvements in the task performance.

To improve the performance of Chinese Nested Named Entity Recognition, this paper proposes a dictionary-assisted method to capture richer semantics. The model constructs character-word pairs by using matching phrases obtained from a wiki dictionary and integrates them into the middle layer of BERT, fully utilizing its representational capacity. Chinese phrases contain richer semantic information than single characters, and introducing dictionary information enhances the feature richness. The model uses a bi-affine structure to obtain a global view of the span, avoiding the limitation of specific length enumeration. Additionally, the local interaction between spans is modeled using a Convolutional Neural Network (CNN) to capture spatial correlation between adjacent spans. Finally, the model's robustness is enhanced using the R-drop based contrastive learning approach. The model in this chapter aims to optimize the characteristics of the Chinese language, improve the accuracy and efficiency of Chinese Nested Named Entity Recognition.

The main contributions of this work are as follows:

  1. 1)

    Proposing a simple and effective model for Chinese nested named entity recognition, aimed at improving the accuracy and efficiency of the task.

  2. 2)

    Considering that phrases can provide richer semantics and better handle Chinese nested entity structures, integrating dictionary information into BERT to achieve deep lexical knowledge fusion. Applying the idea of contrastive learning based on R-drop to enhance the robustness and generalization ability of the model, while reducing overfitting.

  3. 3)

    Evaluating and validating the proposed method on both Chinese flat and nested datasets, and comparing it with baseline models, achieving the best results.

2 Related Work

Currently, methods for Nested Named Entity Recognition can be classified into four main categories: 1) Improved sequence labeling framework: through the design of a trade-off scheme, the sequence labeling task is capable of handling nested named entities; 2) Hypergraph-based methods: by utilizing a hypergraph structure, nested structures can be effectively addressed; 3) Parsing tree-based methods: similar to Constituency Parsing tree structures, they are used in nested named entity recognition; 4) Span-based methods: candidate spans are first exhaustively enumerated, and then assigned a corresponding category.

2.1 Improved Sequence Labeling Framework

Traditional sequence labeling methods, such as Hidden Markov Models and Conditional Random Fields, are usually inadequate for dealing with nested named entities. However, the improved sequence labeling method can handle nested named entities by introducing additional features and constraints. In 2018, Ju et al. [13] proposed a dynamic stacking plane NER method, which treats each plane NER as a single-layer sequence labeling, to extract entities from the inside to the outside. However, this approach is prone to error propagation. To model multiple named entity labels, Strakova et al. [14] proposed a linearized encoding scheme that combines all categories that may co-occur in pairs to generate new labels (e.g., combining B-Location with B-Organization to construct a new label B-Loc | Org). Shibuya et al. [15] provided a sub-optimal path solution that treats the label sequence of nested entities as the second-best path within the span of their parent entities, extracting entities from outside to inside. To identify nested named entities from bottom to top, Li et al. [16] proposed a Chinese NER model based on a self-attention aggregation mechanism, which connects a series of sub-models of multi-layer sequence labeling. Wang et al. [17] designed a pyramid framework to recognize nested entities. The improved sequence labeling method is straightforward and convenient to use, but it is not sufficiently accurate for modeling the nesting relationship.

2.2 Hypergraph-Based Approaches

A hypergraph is a graphical structure in which a node can be associated with multiple edges, and it can be used to model the nested structure of a sentence where each entity is a node and the nested relationship between entities is an edge. Hypergraph-based methods aim to better capture dependencies between entities by using hypergraphs. These methods typically transform the reasoning problem on hypergraphs into an integer linear programming problem. Lu et al. [18] proposed a joint entity extraction and classification model for nested NER that can effectively capture nested entities with infinite lengths. Katiyar et al. [19] extracted a hypergraph representation from an RNN and trained the model using greedy search. Wang et al. [20] proposed a piecewise hypergraph representation that avoids structural ambiguity. Luo et al. [21] proposed a bipartite planar graph structure that uses a planar NER module for the outermost entity and a graph module for all entities located in the inner layer to perform two-way information interaction between layers. Although hypergraph-based methods can explicitly capture nested entities, they require skillful hypergraph design to handle complex reasoning problems and may result in long running times.

2.3 Parsing Tree-Based Methods

Parsing tree-based methods use a tree-based algorithm to analyze the relationship between nested entities, similar to the constituency parsing tree structure used in syntactic analysis. A parsing tree is constructed in a bottom-up or top-down manner, and different features can be used for classification. In 2009, Finkel et al. [22] proposed converting a sentence into a constituency tree, with each entity corresponding to a phrase in the tree and a root node connecting the entire sentence. Fu et al. [23] proposed regarding nested NER as the constituency resolution of the tree using local observations, with the entity spans of all markers as the nodes observed in the constituency tree, and other spans as potential nodes. Lou et al. [24] improved the method proposed by Fu by using a two-stage strategy and head-aware loss, which effectively utilized the effective information of entity heads. Yang et al. [25] proposed a new pointer network for the bottom-up analysis of nested NER and constituency resolution. Parsing tree-based methods can accurately capture nested relationships, but require more computing resources.

2.4 Span-Based Methods

Span-based methods are among the most widely used approaches for nested NER. These methods involve enumerating all potential spans in a sentence and classifying each one. While some approaches exhaustively list all possible spans, such as Sohrab et al.’s method [25], this is computationally intensive. Other approaches, like Lin et al.’s [26], first locate an anchor word and then match the entire span for classification, but this approach only works for specific structures. Xia et al. [27] proposed a multi- granularity NER method that includes a detector for entity locations and a classifier for entity types. The boundary-aware model proposed by Zheng et al. [28] uses sequence labeling to determine span boundaries before classification. Yu et al. [29] applied the bi-affine model to nested NER, pinpointing spans and scoring each one using start and end markers. Xu et al. [30] proposed a supervised multi-head self- attention mechanism, where each head identifies a category and uses a boundary detection module as an auxiliary task. Finally, Shen et al. [31] developed a two-stage method that generates candidate spans by filtering and boundary regression of seed spans before marking the corresponding category.

3 Model

This paper proposes a KBCNNER model. As shown in Fig. 2, the model is divided into three parts: the first part is the dictionary information introduction module, which integrates the matched character-word pairs information into the BERT intermediate layer; the second part is the bi-affine decoder layer, which obtains the global view of the sentence; and the third part is the Convolutional Neural Network (CNN) layer, which models the relationship between adjacent spans using CNN.

Fig. 2.
figure 2

KBCNNER model diagram

3.1 Import Dictionary Information

Define the input as a Chinese sentence, S = {c1, c2, …, cn}, where n represents the number of characters in the sentence. Next, two parts of the operation are performed on the input sentence S at the same time, one part is to use the BERT embedding layer to extract the vector representation of each character, and get E = {e1, e2, …, en}, and then input E into the Transformer encoder for the following calculation:

$$ \mathrm{G} = \mathrm{LN}(\mathrm{H}^{{\mathrm{l} - 1}} + \mathrm{MHAttn}(\mathrm{H}^{{\mathrm{l} - 1}} )) $$
(1)
$$ \mathrm{H}^{\mathrm{l}} = \mathrm{LN}(\mathrm{G} + \mathrm{FFN}(\mathrm{G})) $$
(2)

where \({\mathrm{H}}^{{\mathrm{l}}} = \left\{ {h_{1}^{{\mathrm{l}}} ,h_{\mathrm{n}}^{{\mathrm{l}}} , \ldots ,h_{n}^{{\mathrm{l}}} } \right\}\), represents the output of the first layer of Transformer.

LN is a normalization operation, MHAttn is a multi-head attention mechanism, and FFN is a two-layer feedforward neural network using Relu as the activation function.

In the other part, the sentence S is matched with the dictionary D to construct character-word pairs, where the dictionary D is prepared in advance, as shown in Fig. 3. The specific method is as follows: First, build a dictionary tree Trie based on the dictionary D, then we iterate through all possible character subsequences in the input sentence and match them against the Trie tree, resulting in a list of potential words. For example, the sentence “中国人民” can be matched to “中国”, “ 中国人”, “国人”, “人民”. For each matched word, assign the characters it contains. For example, the matched word “中国” is assigned to the characters “中” and “国”. We pair each character with the matching word to form a character-word pair \({\mathrm{s}}_{{{\mathrm{CW}}}} = \left\{ {\left( {{\mathrm{c}}_{1} ,{\mathrm{ws}}_{1} } \right), \ldots \left( {{\mathrm{c}}_{{\mathrm{i}}} ,{\mathrm{ws}}_{{\mathrm{i}}} } \right),\left( {{\mathrm{c}}_{{\mathrm{n}}} ,{\mathrm{ws}}_{{\mathrm{n}}} } \right)} \right\}\) where ci denotes the i−th character in the sentence and wsi denotes matched words assigned to ci.

Fig. 3.
figure 3

Character-words pair

Next, use the Fusion module in Fig. 4 to inject vocabulary information into BERT, and the input of Fusion is a character−word pair (\(h_{{\mathrm{i}}}^{{\mathrm{c}}} ,{\mathrm{x}}_{{\mathrm{i}}}^{{{\mathrm{ws}}}}\)),where \(h_{{\mathrm{i}}}^{{\mathrm{c}}}\) is a character vector, the output of a certain transformer layer in BERT, and \({\mathrm{x}}_{{\mathrm{i}}}^{{{\mathrm{ws}}}} = \left\{ {{\mathrm{x}}_{{{\mathrm{i}}1}}^{{\mathrm{w}}} ,{\mathrm{x}}_{{{\mathrm{i}}2}}^{{\mathrm{w}}} , \ldots {\mathrm{x}}_{{{\mathrm{im}}}}^{{\mathrm{w}}} } \right\}\) is a set of word embeddings to the i-th character, where m is the number of words. The j-th word in \({\mathrm{x}}_{{\mathrm{i}}}^{{{\mathrm{ws}}}}\) is represented as following: \({\mathrm{x}}_{{{\mathrm{ij}}}}^{{\mathrm{w}}} = {\mathrm{e}}^{{\mathrm{w}}} ({\mathrm{w}}_{{{\mathrm{ij}}}} )\), where \({\mathrm{e}}^{{\mathrm{w}}}\) is a pre-trained word embedding lookup table and wij is the j-th word in wsi.Align word representations and word representation dimensions using nonlinear changes:

$$ \mathrm{v}_{{\mathrm{ij}}}^{\mathrm{w}} = \mathrm{w}_{2} (\tanh (\mathrm{w}_{1} \mathrm{x}_{{\mathrm{ij}}}^{\mathrm{w}} + \, \mathrm{b}_{1} )) + \mathrm{b}_{2} $$
(3)

where \({\mathrm{w}}_{1} \in {\mathrm{R}}^{{{\mathrm{d}}_{{{\mathrm{c}}*}} {\mathrm{d}}_{{\mathrm{w}}} }}\), \({\mathrm{w}}_{1} \in {\mathrm{R}}^{{{\mathrm{d}}_{{{\mathrm{c}}*}} {\mathrm{d}}_{{\mathrm{c}}} }}\), and b1 and b2 are scaler bias. dc and dw denote the dimension of word embedding and the hidden size of BERT respectively.

To pick out the most relevant words from all matched words, we introduce a character-to-word attention mechanism. We denote all \({\mathrm{v}}_{{{\mathrm{wij}}}}\) assingned to i-th character as \({\mathrm{v}}_{{\mathrm{i}}} - ({\mathrm{v}}_{{{\mathrm{i}}1}}^{{\mathrm{w}}} ,{\mathrm{v}}_{{{\mathrm{i}}2}}^{{\mathrm{w}}} , \ldots {\mathrm{v}}_{{{\mathrm{im}}}}^{{\mathrm{w}}} )\). The relevance of each word can be calculated as:

$$ {\mathrm{a}}_{{\mathrm{i}}} = \mathrm{softmax}({{h}}_{{\mathrm{i}}}^{{\mathrm{c}}} {\mathrm{w}}_{{{\mathrm{attn}}}} {\mathrm{v}}_{{\mathrm{i}}}^{1} ) $$
(4)

where wattn is the weight matrix of billinear attention. Consequently, we can get the weighted sum of all words by:

$$ {\mathrm{z}}_{{\mathrm{i}}}^{{\mathrm{w}}} = \sigma_{{{\mathrm{j}} = 1}}^{{\mathrm{m}}} {\mathrm{a}}_{{{\mathrm{ij}}}} {\mathrm{v}}_{{{\mathrm{ij}}}}^{{\mathrm{w}}} $$
(5)

Finally, the weighted lexcion information is injected into the character vector by:

$$ \tilde{h}_{{\mathrm{i}}} = h_{{{\mathrm{ic}}}} + {\mathrm{z}}_{{\mathrm{i}}}^{{\mathrm{w}}} $$
(6)

Input the fused vector into the remaining Transformer layer for calculation, and finally get \({\mathrm{H}}^{{\mathrm{l}}} = \left\{ {h_{1}^{{\mathrm{l}}} ,h_{2}^{{\mathrm{l}}} , \ldots h_{n}^{{\mathrm{l}}} } \right\}\).

Fig. 4.
figure 4

Structure of Fusion. Enter a character vector and paired word features. By bilinear attention on characters and words, the lexical features are weighted into a vector. This vector is added to the character-level vectors, followed by layer normalization.

3.2 Bi-affine Decoder

The obtained vector representation of each character is input into the bi-affine decoder and mapped to a scoring matrix R of L × L × k, as shown in Fig. 5. L is the sentence length, k ∈ {1,…, |k|}, is the type of entity, |k| is the number of entity types. Specifically, each span (i, j) can be expressed as a tuple (i, j, k). i, j are the start, end index of entity. After BERT encoding, The embedding of the token at the position i, j are hi, hj, where hi, hj ∈ Rd, d is the hidden size of embedding. We compute the score for a span(i, j):

$$ {\mathrm{f}}({\mathrm{i}},{\mathrm{j}}) \, = {{h}}_{{\mathrm{i}}}^{{\mathrm{T}}} {\mathrm{U}}h_{{\mathrm{j}}} + \, w([h_{{\mathrm{i}}} ;h_{{\mathrm{j}}} ]) + {\mathrm{b}} $$
(7)

where U is a d × k × d tensor, W is a 2d × k matrix and b is the bias.

Fig. 5.
figure 5

Scoring matrix R, the element of each position of the square matrix is a k- dimensional vector, which is used to represent the distribution of named entity categories of the text segment corresponding to the position.

3.3 CNN on Score Matrix

Understand the scoring matrix as a picture with k channels and L × L length and width, and further use the Convolutional Neural Network (CNN) commonly used in the field of computer vision to model this spatial connection:

$$ {\mathrm{R}^{\prime}} = {\mathrm{Conv2d}}({\mathrm{R}}) $$
(8)
$$ {\mathrm{R}^{\prime\prime}} = {\mathrm{Gelu(LayerNorm}}({\mathrm{R}^{\prime}} + {\mathrm{R}})) $$
(9)

where Conv2d is 2DCNN, the convolution kernel performs a sliding window operation in two-dimensional space. LayerNorm is layer normalization, which performs normalization operations in the feature layer, and Gelu is the activation function. Since the number of tokens in the sentence is different, their R have different shapes. In order to ensure the same result when processing R in batches, 2DCNN has no bias and fills R with 0, as shown in Fig. 6.

Fig. 6.
figure 6

CNN

We use a perceptron to get the prediction logits as follows:

$$ {\mathrm{P}} = {\mathrm{Sigmoid}}({\mathrm{w}}_{0} \left( {{\mathrm{R}^{{\prime}}} + {\mathrm{R}^{{\prime\prime}}} } \right) + {\mathrm{b}}) $$
(10)

where w0 ∈ R|k| × d, b ∈ R|k|, P ∈ RL × L × |k|.

3.4 Loss Function

  1. (1)

    (1) The loss function of the model itself we use the binary cross entropy to calculate the loss as:

    $$ {\mathcal{L}}_{{{\mathrm{BCE}}}} = - \sigma_{{0 \le {\mathrm{i}},{\mathrm{j}} < {\mathrm{L}}}} {\mathrm{Y}}_{{{\mathrm{ij}}}} {\mathrm{log}}({\mathrm{P}}_{{{\mathrm{ij}}}} ) $$
    (11)

    where Yij is ground truth entity, Pij is the predicted probability.

The tag for the score matrix is symmetric, namely, the tag in the (i,j)-th entry is the same as in the (j-i)-th. When inference, we calculate scores in the upper triangle part as:

$$ \widehat{P}_{{{\mathrm{ij}}}} = ({\mathrm{p}}_{{{\mathrm{ij}}}} + {\mathrm{p}}_{{{\mathrm{ji}}}} )/2 $$
(12)

where i ≤ j. Then we only use this upper triangle score to get the final prediction.

(2) Contrastive learning based on R-drop

In order to enhance the robustness and generalization ability of the model and reduce the occurrence of overfitting, this paper adopts the idea of contrastive learning based on R-drop. During the training process, because dropout randomly discards some hidden units, the same sentence is input into the model twice to get two different vector representations, but they have the same label. This data augmentation method does not require any modifications to the neural network structure, but only needs to add a KL bifurcation loss function, so no noise is introduced.

For the construction of positive examples, use the dropout data enhancement method to input a sample sentence into the model twice, and obtain two probability distributions p(i, j), p+ (i, j) through the Bert, Bi-affine and CNN modules. In order to construct negative examples, this paper uses Gaussian distribution to initialize M distribution of K × L × L and calculates the loss with the label and then selects the N with the largest loss ass, the negative example \({\mathrm{p}}_{{\overline{n}}} ({\mathrm{i}},j)\). The purpose of this is to introduce noise, increase the robustness of the model, and avoid too much negative impact on the training of the model. The loss of contrastive learning is expressed as

$$ {\mathcal{L}}_{{{\mathrm{KL}}}} = \begin{array}{*{20}c} {{\mathrm{K}}_{{\mathrm{L}}} {\mathrm{p}}({\mathrm{i}},{\mathrm{j}})p^{ + } (i,j)} \\ {\sigma_{{{\mathrm{n}} = 0}} {\mathrm{KL}},{\mathrm{p}}({\mathrm{i}},{\mathrm{j}}),p_{{\overline{n} }} ({\mathrm{i}},{\mathrm{j}})} \\ \end{array} $$
(13)

The purpose is to minimize the kl divergence of positive examples and maximize the kl divergence of negative examples to optimize the training effect of the model.

(3) Final Loss Function

The final loss function is expressed as

$$ {\mathcal{L}} = {\mathcal{L}}_{{{\mathrm{BCE}}}} + \, {\mathcal{L}}_{{{\mathrm{KL}}}} $$
(14)

3.5 Entity Decoding

First discard all fragments with a predicted probability lower than 0.5, then sort the spans from high to low according to the predicted probability, and then select the fragment with the highest current predicted probability in turn, if it does not conflict with the previously decoded named entity, then the The fragment is decoded into a new named entity, otherwise it is discarded. By doing this iteratively, all non- conflicting named entities of the input sequence predicted by the model are obtained.

4 Experimental Analysis

4.1 Dataset

We conduct experiments on both the Chinese nested NER dataset and the flat NER dataset. Among them, the Chinese nested NER dataset selects “人民日报” and the Chinese medical dataset CMeEE, and the Chinese flat NER dataset selects Weibo and Resume.

The “人民日报” dataset belongs to the news field and contains three entity types, namely, person names, place names, and organization names. The number of nested entities accounts for about 12.81% of the total number of entities. The CMeEE dataset is called Chinese Medical Entity Extraction dataset. It contains nine types of medical entities such as common pediatric diseases, body parts, clinical manifestations, and medical procedures. Nesting is allowed in the “clinical manifestations” entity category, and other eight types of entities are allowed within this entity. The Weibo dataset is generated by filtering and filtering the historical data of Sina Weibo from November 2013 to December 2014, including 1890 Weibo messages. The entity category of this dataset is divided into four categories: people, organizations, addresses and geopolitical entities. The Resume data set is generated by screening and manual labeling based on the summary data of resumes of senior managers of listed companies on Sina Finance and Economics. The data set contains 1027 resume summaries, and the entity annotations are divided into 8 categories including name, nationality, place of origin, race, major, degree, institution, and job title. The statistics of the above datasets are shown in Table 1.

Table 1. The statistics of the datasets

4.2 Experimental Settings

The BERT-base-chinese pre-training model with 12 hidden layers, outputting 768-dimensional tensors, 12 self-attention heads, and a total of 110M parameters is used in this study. The model is pre-trained on Simplified and Traditional Chinese texts. The 200-dimensional pretrained word embeddings of Song et al. [32] are used, which are trained on news and webpage texts using a directional tab model. Dictionary D is trained on texts such as Wikipedia and Baidu Baike. The Adam optimizer with a learning rate of 2e−5 is used for model optimization during training. The maximum epoch on all datasets is 30, and the maximum input length is 150. The character-word pair information between the 1st and 2nd Transformers in BERT is fused, and BERT and pretrained word embeddings are fine-tuned during training. The CNN convolution kernel is set to 3. An entity is considered correct when both the predicted class and the predicted span are exactly correct. Evaluation metrics used in this study include Precision (P), Recall (R), and F-score (F1). The hyperparameter settings are summarized in Table 2.

Table 2. The hyper-parameter in this paper

4.3 Analysis of Results

Baselines:

  • LSTM-Crf [33]: The LSTM-CRF model is a traditional sequence labeling model composed of two parts. First, the LSTM (Long Short-Term Memory) neural network maps each input element to a high-dimensional vector space by learning the contextual information in the input sequence. The LSTM network can handle variable-length sequences and update the parameters.

  • BERT-Crf [34]: The BERT-CRF model is a sequence labeling model based on a pre-trained Transformer model. It first uses the pre-trained BERT model to encode the input sequence to obtain context-aware word embeddings. These embeddings capture rich semantic information of the input sequence and are fed to the CRF layer for label prediction. Compared with traditional models such as LSTM-CRF, BERT-CRF can better capture rich semantic information and dependency relationships between labels.

  • LEBERT-Crf [8]: LEBERT-CRF model is a sequence labeling model that integrates external lexical knowledge directly into the BERT layer. Specifically, it integrates external lexical knowledge into the BERT layer through the vocabulary adapter module, and uses a linear transformation layer to fuse external knowledge with internal embeddings. Then, the CRF layer is used for label decoding. The advantage of the LEBERT-CRF model is that it can directly integrate external knowledge into the model, thereby improving its performance.

To assess the efficacy of the model proposed in this study, we compared its experimental outcomes against those of the baseline model across four datasets. The comparative findings are presented in Tables 3 and 4.

Table 3. Nested dataset comparison experiment results

According to the results in Table 3, on the Chinese nested dataset “人民日报”, both the BERT-CRF and LEBERT-CRF models outperformed the LSTM-CRF in terms of precision, recall, and F1-score. The proposed method in this paper achieved higher precision, recall, and F1-score (96.08%, 96. 13%, and 96. 11%, respectively) than the corresponding values of the other three models. Specifically, compared with the LEBERT-CRF model, our proposed method showed improvements of 3.57%, 2.64%, and 3. 11%, indicating better performance in the task on this dataset. On the nested dataset CMeEE, our proposed method achieved better performance in precision, recall, and F1-score than the other three models. Compared with the F1- score values of the LSTM-CRF, BERT-CRF, and LEBERT-CRF models (47.00%, 56.45%, and 57.35%, respectively), the proposed method achieved an F1-score of 65.22%, representing relative improvements of 18.22%, 8.77%, and 7.87%, respectively. These results demonstrate that our proposed method achieves better performance and usability than the other three models on this dataset.

Table 4. Flat dataset comparison experiment results

According to the results in Table 4, the proposed method in this paper achieves significant improvement in precision, recall, and F1 score on the flat data set Weibo compared to the LSTM-CRF model and BERT-CRF model. Specifically, compared to the LSTM-CRF model, the proposed method improves precision, recall, and F1 score by 17.80, 11.62, and 13.53% points, respectively. Compared to the BERT- CRF model, the proposed method improves precision, recall, and F1 score by 13.70, 7.20, and 12.40% points, respectively. Compared to the LEBERT-CRF model, the proposed method has similar precision but significantly higher recall and F1 score, improving by 6.85 and 1.82% points, respectively. These results demonstrate that the proposed method outperforms the other three models in terms of performance, indicating its superior classification ability and generalization performance on this data set.

On the flat data set Resume, the proposed method also exhibits excellent performance with higher precision (96.96%), recall (96.35%), and F1 score (96.65%) than the other three models. Specifically, compared to the LSTM-CRF model, the proposed method improves precision, recall, and F1 score by 1.15, 2.24, and 1.19% points, respectively. Compared to the BERT-CRF model, the proposed method improves precision, recall, and F1 score by 1.59, 1.51, and 1.54% points, respectively. Compared to the LEBERT-CRF model, the proposed method improves precision, recall, and F1 score by 1.21, 1.25, and 1.23% points, respectively. Therefore, the proposed method exhibits significantly superior performance on the flat data set Resume compared to the other three models.

5 Ablation Study

To verify the effectiveness of our proposed method in Chinese nested named entity recognition, we chose to conduct ablation experiments on the Chinese nested datasets “人民日报” and CMeEE. We further chose to delete some components and conducted three experiments: 1) Our complete model, using dictionary assistance, incorporates character-word pair information into BERT, and uses bi-affine structure encoding to obtain a 3D feature matrix. At the same time, the feature matrix is regarded as an image, and the local interaction between spans is modeled by using the convolutional neural network (CNN), and the spatial correlation between adjacent spans is fully utilized, and finally all non-conflicting features of the predicted input sequence are obtained. Named entity. 2) Remove the CNN module, skip formulas (6)–(7), and directly obtain the predicted entity after bi-affine structure decoding. 3) Remove the dictionary auxiliary module, skip formulas (3)–(5), and refer to the use of character information. 4) Remove the contrastive learning module, skip formulas (13) The comparison between our full model(14), and only use binary cross-entropy to calculate the loss function.The experimental results are shown in Table 5.

Table 5. The comparison between our full model and ablated models

On the nested data set “人民日报”, the performance of the model decreased slightly after removing the CNN module, with precision, recall, and F1 score decreasing by 1.37%, 2.05%, and 1.02%, respectively. Removing the dictionary- assisted module had a greater impact on the model's performance, with precision, recall, and F1 score decreasing by 3.73%, 4.26%, and 3.55%, respectively. The removal of the contrastive learning module had a relatively small impact on the model's performance, with precision, recall, and F1 score decreasing by 0.33%, 0.03%, and 0.69%, respectively. On the nested data set CMeEE, the F1 score of this approach (65.22%) was higher than that of the other three experiments, with the experiment that removed the dictionary-assisted module achieving the lowest F1 score. When the CNN module was removed, the F1 score decreased by 2.72, indicating that the CNN module has certain advantages in modeling local interactions between spans on this dataset. Additionally, the dictionary-assisted module in this approach had a significant effect on Chinese nested named entity recognition, with a significant decrease in F1 score after its removal. The effect of the contrastive learning module was relatively stable, and its removal also led to a slight decrease in F1 score, which indicates that the module can enhance the model's robustness and generalization ability while reducing overfitting. Overall, the various components in this approach contributed to the model's performance to varying degrees.

6 Conclusion

In this paper, we propose KBCNER, a dictionary-assisted Chinese Nested Named entity Recognition model. The matching words are obtained through the dictionary, and the character-phrase pairs are formed and integrated into BERT. The semantic information contained in Chinese phrases is richer than that of a single character, and the dictionary information enhancement feature is introduced to obtain richer semantics. Using the bi-affine structure, get a global view of the span. At the same time, the feature matrix is regarded as an image, and the local interaction between spans is modeled by using a Convolutional Neural Network (CNN), which improves the recognition accuracy of nested entities. Finally, the idea of contrastive learning based on R-drop is adopted to enhance the robustness of the model. In the experimental part, the model is compared with the Chinese nested NER dataset (“人民日报”, CMeEE) and the flat NER dataset (Weibo, Resume). At the same time, we conduct ablation experiments to analyze in detail the influence of the main components of the model on its performance. The model has achieved better performance than the baseline model on all data sets, indicating that the model has strong adaptability and versatility in different fields and different data sets.