1 Introduction

Information extraction [1] aims to meet the demand of individuals for rapidly and accurately accessing information and knowledge on the Internet. Specifically, information extraction refers to the process of identifying information from extensive unstructured data that cater to user needs. Among these processes, Named Entity Recognition (NER) [2, 3], operating as a text mining algorithm, assumes a crucial role in converting unstructured text into structured content. This algorithm offers robust support for the execution of information extraction tasks and enables the implementation of subsequent downstream applications. Compared to English NER, Chinese NER [4] faces greater challenges. This is primarily due to the complexities of Chinese language involving word segmentation, difficulty in determining boundaries of related entities, and intricate grammatical structures. In Chinese NER, entity boundaries align with character boundaries, and the segmentation process inevitably propagates errors, presenting a significant challenge in accurate recognition of entities in Chinese NER.

Introducing lexical information can effectively alleviate this issue, to fully leverage lexical information, a novel variant of the LSTM model called Lattice-LSTM [5] was introduced. This model utilizes words within a sentence that is segmented into individual characters and encodes these words into a directed acyclic graph. To address the non-parallelizability issue of Lattice-LSTM, WC-LSTM [6] introduced four word embedding strategies: shortest, longest, average, and self-attention. The FLAT [7] model incorporates a lattice structure and employs fully connected self-attention to capture long-distance dependency relationships within sequences. LEBERT [8] injects lexical information into the underlying BERT model by introducing a Lexicon Adapter. This innovative approach facilitates the integration of lexical knowledge into the BERT architecture. However, due to the presence of polysemy in the Chinese language, it is necessary to strike a balance in the value of injected lexical information.

In recent years, Span-based models have gained significant attention in NER research. This approach typically involves enumerating all candidate spans and categorizing them into entity types (including a “non-entity” type). Bi-LSTM is used to capture the contextual information of sentences, and then input into Biaffine attention [9] to score each segment. This method contributes to predicting entities within the text. In certain studies, the NER task has been transformed into a machine reading comprehension (MRC) [10, 11] task. They employ entity types as queries, asking the model whether a given segment belongs to a specific entity type. The W2NER model [12] converts the NER task into predicting the relationship categories between pairs of words. The Span-based methods are friendly to parallelism and the decoding is easy. Therefore, this formulation has been widely adopted [13, 14]. However, previous work has overlooked the spatial relationships between adjacent spans.

By dynamically extracting words corresponding to characters and feeding them into the Transformer layers of BERT, we ensured consistent dimensions for characters and words using a bilinear attention mechanism. We fine-tuned certain parameters of the BERT component to fully leverage word information. Furthermore, the spans surrounding a central span exhibit unique relationships, as shown in the Fig. 1, which can enhance the model’s understanding of contextual information within the text. For further information, please refer to CNN-NER [14]. To leverage these correlations, we employed a biaffine decoder to generate a 3D feature matrix. Treating this feature matrix as an image, we utilized CNN to model the local interactions adjacent spans.

Fig. 1
figure 1

Spatial relationships between adjacent spans. For instance, o(2-3) is represented as “京 (Capital) 市(City)”, and it is surrounded by “南京市 (Nanjing city)”, which is categorized as a location entity (LOC). It conflicts with “市长 (Mayor)” on the term “市 (City)”, as the mayor is categorized as a personal entity (PER). the center span can have the special relationship with its surrounding spans (different relations are depicted in different colors)

In summary, our main contributions are as follows:

  • Proposing Char–Words pairs, injecting lexical information into the Transformer layers of BERT, and adaptively adjusting word representations.

  • We observed interconnections between adjacent spans. After the BiLSTM layer, we employed a multi-head Biaffine to obtain a span feature matrix. Treating this matrix as an image, we utilized CNN to model the interactions between adjacent spans.

  • We proposed the LB-BMBC (Lexicon BERT + BiLSTM + MHBiaffine + CNN) model and conducted experiments on four Chinese datasets: Resume [5], Weibo [15], Ontonotes [16], and MSRA [17]. Additionally, we performed ablation experiments to validate the effectiveness of the method.

2 Related Work

2.1 Methods Based on Statistical Machine Learning

In previous work, supervised machine learning classification models were employed for NER, including models, such as HMM, MEM, SVM, and CRF. Zhang et al. [18] proposed an automatic Chinese person name recognition method based on role tagging using HMM. This method identifies and categorizes named entities by maximizing the matching of the best role sequence, addressing challenges such as the loss of names without distinctive features, internal word formation, and the difficulty in recalling person names within context-dependent word formations. Biekl et al. [19] used HMM to calculate the probability that a word is of an entity type based on features, such as case, number symbol, and the first word of a sentence. Zhou et al. [20] were among the first to apply MEM to the recognition of Chinese noun phrases, transforming the phrase recognition problem into a labeling problem. They extracted candidate features from pre-defined feature templates in the corpus and identified noun phrases based on these candidate features. Zhang et al. [21] proposed an MEM model that combines multiple features, integrating both local and global features. This model integrated rule-based and machine learning methods while incorporating heuristic knowledge to address efficiency and space issues. Takeuchi et al. [22] used SVM for NER in the MUC-6 evaluation corpus and in the field of molecular biology. They found that SVM performed well in the domain of biological NER. Li Lishuang et al. [23] proposed an automatic recognition method for Chinese place names based on SVM. They incorporated characteristic information of place names as vector features. Additionally, they employed active learning strategies to gradually increase the scale of classifier training samples, further improving the classifier’s recognition. CRF models, which compute global probabilities and normalize not only locally but also globally, were widely applied in NER. McCallum et al. [24] proposed a CRF-based feature induction method, which automatically induced features to enhance accuracy while significantly reducing feature numbers. Feng et al. [25] proposed small-scale common suffix features into the CRF framework, improving model training speed while maintaining recognition accuracy. Yan Yang et al. [26] proposed a stacked CRF approach and addressed NER in Chinese electronic medical records. In the second layer, they used a feature set containing entities and lexicon information to recognize two types of named entities: disease names and clinical symptoms.

Fig. 2
figure 2

BERT structure

2.2 Methods Based on Deep Learning

Traditional supervised learning methods based on feature engineering have consumed a significant amount of human effort. Furthermore, during the feature extraction process, errors often propagate due to the inapplicability of prior experience. In contrast, deep learning methods, thanks to their end-to-end feature extraction capabilities, have effectively addressed this issue. Huang et al. [27] used BiLSTM to extract character-level features of words. Gregoirc et al. [28] adopted multiple independent BiLSTM units in the input and promoted diversity among LSTM units using inter-model regularization, reducing model parameters. Yang et al. [29] introduced a self-attention mechanism before BiLSTM, allowing the model to adjust its focus on different parts of the input sequence, capturing long-distance dependencies. Xu et al. [30] incorporated multi-head self-attention and dictionary information to adjust the weight relationships between Chinese characters and multi-level semantic features. While most BiLSTM architectures capture global features of sentences, they often lack local features. CNN were initially popular in computer vision for their ability to capture local features and have gradually been applied in NLP. Wu et al. [31] used CNN to represent the entire sentence as a global feature while extracting local features. After extracting both global and local features, they connected them to a fully connected neural network for sequence labeling and entity recognition. Kong et al. [32] proposed a Chinese clinical NER method that combines multi-level CNN and attention mechanisms, addressing the limitation of LSTM in capturing global information for long sentences. Strubell et al. [33] proposed an Iterated Dilated Convolutional Neural Network (ID-CNN) that offers better context and structured prediction capabilities compared to traditional CNNs while significantly reducing training time by fully utilizing CPU parallelism. Jiang et al. [34] proposed the Word Embedding-based BiLSTM-IDCNN-CRF model, utilizing different network architectures for obtaining global and local features. The advent of pre-trained models has ushered NLP into a new era. Research has shown that pre-trained models trained on very large corpora can learn language text representations suitable for various domains, benefiting various downstream NLP tasks, including NER. BERT is a transformer-based bidirectional encoder that leverages self-supervised learning tasks to mine contextual representations. Researchers have proposed improved pre-trained models based on BERT, such as RoBERTa [35], ALBERT [36], and BioBERT [37]. Chang et al. [38] concatenated CRF with pre-trained BERT models for NER tasks. Liu et al. [39] proposed the BERT-BiLSTM-CRF recognition method and applied it to research on citrus pests and diseases. Gan et al. [40] proposed the BERT-Transformer-BiLSTM-CRF model to handle the challenges posed by pronouns and polysemous words in Chinese NER. Li et al. [41] addressed the issue of excessive parameters and long training times in BERT by introducing the BERT-IDCNN-CRF model.

Sequence labeling-based [42] methods can encounter challenges when dealing with nested entities, as the Cartesian product of entity labels might lead to addressing the long-tail issue. Using hypergraphs [43] can effectively identify spans, but the decoding process becomes challenging. The Seq2Seq [44] framework can be employed to generate entity sequences, which can be either entity pointer sequences [45] or text sequences [46]. However, Seq2Seq suffers from the issue of high decoding time during the process.

2.3 Lexical Information for NER

Compared to English NER, Chinese NER faces more challenges primarily due to the difficulty in determining entity boundaries in Chinese text and the complex syntactical structure of the Chinese language. Previous research [4] compared character-based and word-based approaches, and character-based NER methods often fail to fully harness explicit word and word sequence information, despite its potential value. To leverage lexical features, Zhang et al. [5] proposed the Lattice-LSTM model, which encodes all words matched by individual characters in a sentence into a DAG. However, the DAG structure may sometimes struggle to select the correct path, potentially causing the model to degrade into a Character-based model. Liu et al. [6] proposed the WC-LSTM model to integrate word information into character-based models, employing four different word encoding strategies: shortest, longest, average, and self-attention. These strategies encode word information into fixed-size vectors, enabling batch training and adaptability to various application scenarios. To maximize the benefits of pre-trained models, Lai et al. [7] adopted a Lattice structure and employed fully connected self-attention to capture long-range dependencies in sequences. Liu et al. [8] introduced a Lexicon Adapter into the Transformer Encoder layer of BERT, allowing individual characters within sentences to interact with lexical information. This effectively enhances the model’s ability to acquire the meaning of individual characters based on contextual semantics.

2.4 Biaffine for NER

Yu et al. [47] proposed a Biaffiner decoder from dependency parsing, which is used to convert the span classification into classifying the start and end token pairs. Text spans are treated as candidate entities and span tuples as candidate relation tuples allows for the sharing of span semantics [48]. Hanoi et al. [49] proposed in BiLSTM-Biaffine, using the context-rich word vector representation provided by BiLSTM to further accurately predict the entity category to which each word belongs and the start and end positions of the entity. Li et al. [50] introduced attention mechanism into Biaffine for the first time, achieving faster training speed under the same performance as BiLSTM. Gu et al. [51] combine Biaffine with Regularity-aware Module to effectively explore the internal composition information of entities, and use the special naming patterns or naming rules of entities to further enhance entity boundaries. Yan et al. [14] consider connecting CNN after Biaffine and using CNN to model adjacent spatial span relationships.

3 Method

3.1 BERT Pre-training Module

The internal architecture of BERT is primarily composed of multiple Transformer Encoder layers. BERT takes embedded vectors as input, denoted as \(E=\{E_1,E_2,\ldots ,E_n\}\). These vectors are then processed through multiple Transformer Encoders to yield the output layer, represented as \(T=\{T_1,T_2,...,T_n\}\), as shown in Fig. 2. In our approach, we introduce the Lexicon Adapter, which modifies one of these Transformer Encoders. Each encoder includes components, such as multi-head attention layers, feed-forward neural networks, and layer normalization, as depicted in Fig. 3. BERT is described by the following parameters: L, H, and T. Here, L corresponds to the number of layers in the Transformer, denotes the output dimensionality, and T represents the total count of model parameters. In this study, we utilize bert-base-chinese, which comprises 12 Transformer Encoder layers.

Fig. 3
figure 3

Transformer encoder unit

To obtain input vectors E for a Chinese sequence \(S=\{s_1,s_2,\ldots ,s_n\}\)

$$\begin{aligned} E&= {\text {Token}}_{\text {{Embeddings}}}(S) + {\text {Segment}}_{\text {{Embeddings}}}(S) \nonumber \\&\quad + {\text {Position}}_{\text {{Embeddings}}}(S). \end{aligned}$$
(1)

Presently, we are directing the vector E into the Transformer Encoder layers, with \(H^0\) initialized as E

$$\begin{aligned} G= & {} {\text {LN}}\left( H^{l-1}+{\text {MHAttn}}\left( H^{l-1}\right) \right) \end{aligned}$$
(2)
$$\begin{aligned} H^l= & {} {\text {LN}}\left( G+{\text {FFN}}\left( G\right) \right) . \end{aligned}$$
(3)

MHAtten represents multi-head attention mechanism, LN stands for layer normalization and FFN refers to the feed-forward network. Specifically, FFN is a two-layer feed-forward network with ReLU as the hidden activation function. The ultimate layer of the Transformer Encoder functions as the output and is denoted as T.

In the attention mechanism, each word corresponds to three different vectors, namely Query vector (Q), Key vector (K), and Value vector (V). These three vectors are obtained by multiplying the embedding vector by three different weight matrices \(w_q,w_k,w_v\). Then, each word is scored by multiplying the Query vector and Key vector. Attention value is to use softmax to smooth the score item just obtained and then multiply the result with the Value vector

$$\begin{aligned} {\text {Attention}}(Q,K,V) = {\text {softmax}}\left( \frac{QK^T}{\sqrt{d_k}}\right) V. \end{aligned}$$
(4)

Furthermore, the Transformer encoder unit incorporates a residual network and layer normalization to address the issue of degradation and enhance model performance

$$\begin{aligned} {\text {LN}}(x_i)= & {} \alpha \frac{x_i - u_L}{\sqrt{\sigma ^2_L + \epsilon }} + \beta . \end{aligned}$$
(5)
$$\begin{aligned} {\text {FFN}}= & {} \max (0,xW_1 + b_1)W_2 + b_2, \end{aligned}$$
(6)

where \(\alpha \) and \(\beta \) represent the parameters that need to be learned, and u and \(\sigma \) denote the mean and variance of the input, respectively.

3.2 Lexicon Adapter

The main architecture of the Lexicon Adapter is illustrated in Fig. 4, where Chinese sentences are converted into a sequence of Char–Words pairs. sequence. The Lexicon Adapter is placed between the Transformer layers of BERT, effectively integrating lexical knowledge into BERT. In this section, we describe the following aspects: (1) how the Char–Words pair sequence is generated, and (2) how the Adjust Lexicon Adapter functions within BERT.

Fig. 4
figure 4

The architecture of Lexicon Enhanced BERT, in which lexicon features are integrated between kth and \((k+1)\)th Transformer Layer using Lexicon Adapter. Where \(s_i\) denote the ith Chinese character in the sentence, and \(w_i\) denotes matched words assigned to character \(s_i\)

3.2.1 Char–Words Pair Sequence

Chinese sentences are typically represented as sequences of characters, containing only character-level features. To fully leverage lexical information, we extend the character sequence into a sequence of Char–Words pairs.

Given a Chinese dictionary DFootnote 1 with associated embedding vectors, we traverse D to construct a Trie tree. For a Chinese sequence \(S=\{s_1,s_2,\ldots ,s_m\}\), by traversing all character subsequences in the sentence and matching them with the Trie tree, we can obtain all potential words. Taking “南(South) 京(Capital) 市(City) 长(Long) 江(River) 大(Major) 桥(Bridge)” as an example, we can obtain all different words: “南京(Nanjing), 南京市(Nanjing city), 市长(Mayor), 长江(Yangtze), 大桥(Major bridge)”. Subsequently, we allocate these words to individual characters in the Chinese sequence, as illustrated in Fig. 5. For instance, “南京(Nanjing)” is assigned to the characters “南(South)” and “京(Capital)” Characters without words are padded with PAD. This results in a Char–Words pair sequence. Exactlt,\(S_w=\{\left( s_1,w_1\right) ,\left( s_2,w_2\right) ,\ldots ,\left( s_m,w_m\right) \}\), where \(w_i\) corresponds to the words assigned to \(s_i\).

Fig. 5
figure 5

Character–words pair sequence of a truncated Chinese “南(South) 南(Capital) 市(City) 长(Long) 江(River) 大(Major) 桥(Bridge)”, The words that match with “市 (City)” are “南京市 (Nanjing city)” and “市长 (mayor)”. PAD denotes padding value and each word is assigned to the characters it contains.

3.2.2 Adjust Lexicon Influence BERT Adapter

To inject lexical information into BERT, we utilize Char-Words pairs as our input features. Now, let us review the workings of the Transformer Encoder

$$\begin{aligned} H^0= & {} E \end{aligned}$$
(7)
$$\begin{aligned} G= & {} \textrm{LN}\left( H^{l-1}+\textrm{MHAttn}\left( H^{l-1}\right) \right) \end{aligned}$$
(8)
$$\begin{aligned} H^l= & {} \textrm{LN}\left( G+\textrm{FFN}\left( G\right) \right) . \end{aligned}$$
(9)

Here, E corresponds to the outputs obtained after transforming \(s_1,s_2,\ldots ,s_m\) from \(S_w=\{\left( s_1,w_1\right) ,\left( s_2,w_2\right) ,\ldots ,\left( s_m,w_m\right) \}\) into tokens and incorporating segment and position embeddings. Subsequently, E is fed into Transformer encoders and each Transformer layers acts as follows. MHAtten is multi-head attention mechanism, LN is layer normalization, FFN is the multi-head attention mechanism, and FFN is a two-layer feed-forward network with ReLU as hidden activation function.

In the typical BERT layer, we have \(H^l=\{h_1^l,h_2^l,\ldots ,h_m^l\}\),without the inclusion of lexical information. However, in the Adjust Lexicon Layer, we possess \(\widetilde{H^l}=\widetilde{h_1^l},\widetilde{h_2^l},\ldots ,\widetilde{h_m^l}\), where the computation formula for \(\widetilde{h_i^l}\) is defined as follows:

$$\begin{aligned} \widetilde{h_i^l}=h_1^l+\gamma z_i. \end{aligned}$$
(10)

Here, \(\gamma \) is the weight coefficient that governs the influence of lexical information, while \(z_i\) signifies the lexical information associated with The ith character.

For the Adjust Lexicon Layer, considering the char–word pairs \(S_w=\{\left( s_1,w_1\right) ,\left( s_2,w_2\right) ,\ldots ,\left( s_m,w_m\right) \}\), we obtain the embedding for \(w_m\) and each character can match a maximum of p words

$$\begin{aligned} x_{ij}&=e^w\left( w_{ij}\right) \nonumber \\ i&= 1\ldots ,m\quad j=1\ldots ,p. \end{aligned}$$
(11)

Here, \(x_{ij}\in R^{d_w}\) represents the embedding value of the jth matching word for the ith character. \(e^w\) is the pre-trained vocabulary embedding matrix, with an embedding dimension \(d_w = 200\). After obtaining the vocabulary embedding values, we apply a nonlinear transformation

$$\begin{aligned} v_{ij}=W_2\left( \textrm{tanh}\left( W_1x_{ij}+b_1\right) \right) +b_2. \end{aligned}$$
(12)

Here, \( W_1\in R^{d_c \times d_w}, W_2\in R^{d_c\times d_c}\), \(b_1\) and \(b_2\) are scalar biases. \(d_c = 768\) represents the hidden size of BERT.

Specifically, we denote all \(v_{ij}\) assigned to ii-th character as \(V_i=\{v_{i1},v_{i2},\ldots ,v_{ip}\}\), The relevance of each word can be calculated as

$$\begin{aligned} \alpha _i = \textrm{softmax}(h_i^lW_{attn}V_i^T), \end{aligned}$$
(13)

where \( W_\textrm{attn} \in R^{d_c \times d_c}\) is the weight matrix of bilinear attention. Consequently, we can get the weighted sum of all words by

$$\begin{aligned} z_i=\sum _{j=1}^{p}a_{ij}v_{ij}. \end{aligned}$$
(14)

3.3 CNN-Span

The main architecture of CNN-Span is illustrated in Fig. 6, where we introduce lexical information into BERT, pass it through Bi-LSTM to capture contextual information, and then feed it to a multi-head Biaffine layer to obtain probability scores for entities. Finally, we utilize a residual-connected CNN to model the spatial correlation between adjacent entities.

Fig. 6
figure 6

The main architecture of LB-BMBC model

3.3.1 BiLSTM and MHBiaffine

We approach this NER task as a span classification task, where our model designates an entity label for each valid span. Initially, we input the Chinese sentence \(S_w=\{\left( s_1,w_1\right) ,\left( s_2,w_2\right) ,\ldots ,\left( s_m,w_m\right) \}\) into the Encoder enriched with lexical information

$$\begin{aligned} H=\textrm{Encoder}_\textrm{Lexicon}\left( S_w\right) . \end{aligned}$$
(15)

Here, \(H\in R^{m \times d_c}\). In the next step, we convey the output H to a BiLSTM to obtain the Head and Tail of each span

$$\begin{aligned} H_H= & {} \textrm{BiLSTM}\left( H\right) \end{aligned}$$
(16)
$$\begin{aligned} H_T= & {} \textrm{BiLSTM}\left( H\right) . \end{aligned}$$
(17)

\(H_h\in R^{m \times d_h},H_T\in R^{m \times d_h}, d_h\) represents the size of the output layer in the BiLSTM, which can obtain the context information of each element by traversing in both forward and reverse directions, which is crucial for understanding the context environment of the entity.

Then, we convey both \(H_H\) and \(H_T\) through a multi-head Biaffine decoder to get the score matrix Q

$$\begin{aligned} Q_{ij} = H_h(i)^T U H_T(j) + W(H_H(i) \oplus H_T(j)) + b; \end{aligned}$$
(18)

\(Q\in R^{m \times m \times |T|}\), where each cell(i,j) within Q can be seen as the feature vector in Q corresponding to the span. Specifically, and for the lower triangle of Q (where \(i > j\)), the span contains character from jth to the ith positions.

3.3.2 CNN

Now, we apply a CNN to model the interaction adjacent spans. We repeat the following steps in the model:

$$\begin{aligned} Q^\prime= & {} \textrm{Conv2d}\left( Q\right) \end{aligned}$$
(19)
$$\begin{aligned} Q^{\prime \prime }= & {} \textrm{GeLU}\left( \textrm{LayerNorm}\left( Q^\prime +Q\right) \right) , \end{aligned}$$
(20)

where Conv2d, LayerNorm, and GeLU represent the 2D CNN, layer normalization, and GeLU activation function. Layer normalization is performed across the feature dimension, allowing the network to use a higher Learning Rate without causing delivery problems during the training process, accelerating model convergence. It is important to note that due to the varying number of tokens m in sentences, their Q matrices have different shapes. To ensure consistent results in batch processing, the 2D CNN has no bias term, and all the paddings within Q are filled with 0. Following traversal through multiple CNN blocks, the \(Q^{\prime \prime }\) will be further processed by another 2D CNN module.

We use a Sigmoid Function to get the prediction score as follows:

$$\begin{aligned} P=\textrm{Sigmoid}\left( W_o\left( Q+Q^{\prime \prime }\right) +b\right) . \end{aligned}$$
(21)

Here, \(P\in R^{m \times m\times \ |T|}\). And then, we use the binary cross entropy to calculate the loss as

$$\begin{aligned} \textrm{LBCE}=-\sum _{0\le i,j<n} y_{ij}\textrm{log}\left( P_{ij}\right) . \end{aligned}$$
(22)

To facilitate batch processing, we cannot just compute the upper triangular part. We consider both the upper and lower triangular parts and output them. Since the labels of the score matrix are symmetric, the label for (ij) is the same as the label for (ji). During inference, we compute scores within the upper triangle segments as follows:

$$\begin{aligned} \hat{P_{ij}}=\left( P_{ij}+P_{ji}\right) /2. \end{aligned}$$
(23)

We filter out non-entity spans (score 0.5) and subsequently arrange the remaining spans in descending order based on their maximum entity scores. We then prioritize the selection of spans based on this order.

Algorithm 1
figure a

LB-BMBC

4 Experiments

4.1 Metrics

We adopt strict metric, and only when the entity boundary and entity type match are correct.

  • TP: Correctly recognize entity boundaries and types.

  • FP: The entity can be recognized but the category or boundary judgment is wrong.

  • FN: No entity recognized.

$$\begin{aligned} \textrm{Precall}(P)= & {} \frac{\textrm{TP}}{\textrm{TP} + \textrm{FP}} \end{aligned}$$
(24)
$$\begin{aligned} \textrm{Recall}(R)= & {} \frac{\textrm{TP}}{\textrm{TP} + \textrm{FN}}. \end{aligned}$$
(25)

F1 is used to balance P and R

$$\begin{aligned} F1 = \frac{2 \times P \times R}{P + R}. \end{aligned}$$
(26)

4.2 Dataset

We have employed four prominent Chinese NER benchmark datasets: Resume [5], Weibo [15], OntoNotes 4.0 [16], and MSRA [17]. The statistics of the dataset is shown in Table 1.

  • Resume: The Resume dataset is derived from filtering and manually annotating executive summary data from Sina Finance. This dataset encompasses 1027 executive summaries, with entity annotations distributed across eight distinct categories: CONT, EDU, LOC, PER, ORG, PRO, RACE, and TITLE.

  • Weibo: The Weibo dataset is curated from historical data of Sina Weibo, covering the period from November 2013 to December 2014. This dataset comprises 1890 microblog messages, with entity annotations spanning four categories: PER, ORG, LOC, and GPE.

  • OntoNotes 4.0: OntoNotes 4.0 is a Chinese dataset primarily sourced from the news domain. It encompasses entity annotations for four categories: GPE, LOC, ORG, and PER.

  • MSRA: MSRA is a news-domain entity recognition dataset meticulously annotated by Microsoft Research Asia. It also serves as one of the datasets utilized in the SIGNAN backoff 2006 entity recognition task. This dataset encompasses over 50,000 Chinese entity recognition annotations, classified into three fundamental categories: ORG, PER, and LOC.

4.3 Hardware Environment and Experimental Parameters

Our model is implemented with Python interpreter version 3.8, and Pytorch 1.10, the GPU is an n NVIDIA GeForce RTX 3090.

Table 2 shows the hyper-parameter values of our model

Table 1 Statistics of the Chinese datasets
Table 2 Model hyper-parameter settings

4.4 Baselines

We conduct previous SoTA methods as baselines.

  • Lattice-LSTM [5]: For the Chinese NER task, an LSTM model using a lattice structure, Lattice-LSTM, is proposed to encode both the character features of the input sequence and all potential words matched with the lexicon for NER after fusing the information of words and word sequences.

  • CAN-NER [52]: Extract local character information through CNN, and then capture adjacent character or context information using a global self-attention layer composed of GRU.

  • LR-CNN [53]: Used CNN to encode sentences, and proposes the rethinking mechanism to resolve lexical conflicts.

  • LGN [54]: Introduce a dictionary-based graph neural network, modeling the Chinese NER problem as a node classification task in Solve the issue of ambiguous word boundaries in Chinese by employing an iterative aggregation mechanism.

  • PLT [55]: Enhance self-attention through positional relationship representation, while also introducing a porous mechanism to enhance local modeling and preserve the ability to capture extensive long-term dependencies effectively.

  • FLAT [7]: A position encoding scheme has been designed to incorporate lattice structures for introducing lexicon information, and cross-domain relative position encoding has been proposed to make the Transformer suitable for NER tasks.

  • softLexicon(LSTM) [56]: Proposing a simple yet effective method to incorporate the word lexicon into character representations by fine-tuning the character representation layer to introduce lexical information.

  • MECT [57]: A two-stream Transformer coding model incorporating Chinese character structure features is proposed.

To analyze the contribution of each component in our model, we ablate the full model and demonstrate the effectiveness of each component:

  • B-MB: The composition of the model is BERT + MHBiaffine, and it is the most basic model that we will compare.

  • B-BMB [47]: The composition of the model is BERT + BiLSTM + MHBiaffine, Using BiLSTM to obtain the Head and Tail of the sentence and then feeding it into a Biaffine classifier.

  • B-MBC: The model is composed of BERT + MHBiaffine + CNN, which uses CNN to capture the spatial relations adjacent span.

  • B-BMBC: The model is composed of BERT + BiLSTM + MHBiaffine + CNN. We use BiLSTM to obtain the Head and Tail relationship of the sentence, and then send it to the Biaffine decoder to score the sentence. Finally, CNN is used to capture the spatial relationship adjacent span.

  • LB-BMBC: Introduce Lexical information in BERT’s Transformer layer and then concatenate BiLSTM + MHBiaffine + CNN.

Table 3 Experiment results (%) on Resume and Weibo
Table 4 Experiment results(%) on Ontonotes and MSRA

4.5 Results and Discussion

The results on public datasets are shown in Tables 3 and 4, and we can observe that compared with other models, LB-BMBC model has the best performance in Resume (Precall 96.15%, Recall 94.44%, F1 96.29%), Weibo (Precall 65.01%, Recall 72.71%, F1 68.64%), Ontonotes (Precall 80.49%, Recall 82.22%, F1 81.35%), and MSRA (Precall 95.54%, Recall 95.45%, F1 95.50%).

Ablation Study All the components of our model play an important role in improving performance. If any component is missing, then the performance will decrease. We also conducted additional experiments on LB-BMBC with ablation consideration

  • B-MB: Compared with the LB-BMBC model, the F1 score of the LB-BMBC model has decreased in different degrees (3.42% on Resume, 13.14%, 11.35% on Ontonotes, 3.77% on MSRA). Experiments on four datasets found that lexicon information and CNN Block play a key role in improving the performance of NER system.

  • B-BMB: In this study, we add BiBLSTM to the B-MB model. Compared with the B-MB model, the F1 score have improved in different degrees(2.95% on Resume, 6.94% on Ontonotes, and 2.21% on MSRA). BiLSTM can capture the Head–Tail relationship of the sentence quite well.

  • B-MBC: In this study, we add CNN Block to the B-MB model. Compared B-MB model, the F1 score has improved in different degrees(2.71% on Resume, 8.44% on Weibo, 8.49% on Ontonotes, and 3.33% on MSRA). After using the CNN Block to capture spatial information of adjacent spans, there was a significant improvement in the model’s performance.

  • B-BMBC: In this study, we add BiLSTM and CNN Block to the B-MB model. Compared with the B-MB model, the F1 score have improved in different degrees(3.02% on Resume, 9.97% on Weibo, 10.62% on Ontonotes, and 3.68% on MSRA). After applying BiLSTM to capture the Head–Tail relationships of the sentence, the model’s performance further improved.

  • LB-BMBC: In this study, the F1 score of the LB-BMBC model we proposed is the best (96.29%, 68.64%, 81.35%, 95.50%). Intecting lexicon information into the BERT alleviates the problem of word have multiple meanings and improves the recognition of entity boundaries.

4.6 The Impact of Different CNN Block Numbers

To investigate the effectiveness of CNN in modeling adjacent spans, furthermore, we conducted experiments with different numbers of CNN Blocks, as shown in Tables 5, 6, as illustrated in Fig. 7. By introducing CNN Blocks, more True Positives (TP) were correctly predicted, thereby improving Recall (R). When the number of CNN Blocks was set to 1, due to the insufficient capacity for modeling the span space, it resulted in some False-Positive (FP) predictions, thus reducing Precision (P). However, this can be rectified by increasing the number of CNN Blocks. In summary, introducing more than one CNN Block effectively enhances the model’s recognition capability. When using fewer CNN Block numbers, there may be a problem of layer disappearance, which will make the network difficult to train. As the number of CNN Block numbers increases, the model’s capacity increases, which means that the model has more parameters to learn the features of the data. Therefore, it can provide the model’s expressive ability to some extent, thereby improving performance.

Table 5 Different CNN Block numbers experiments results (%) on Resume and Weibo
Fig. 7
figure 7

F1 (%) for different numbers of CNN Blocks in Resume, Weibo, Ontonotes, and MSRA

5 Conclusion

Named Entity recognition is one of the important tasks in information extraction within the field of natural language processing, playing a crucial role in downstream tasks such as knowledge graphs and question-answering systems. Chinese NER faces even greater challenges compared to English NER due to complexities like word segmentation and intricate grammatical structures in the Chinese language. In this paper, we propose the LB-BMBC model, which incorporates lexical information into the Transformer layers of BERT, allowing for substantial interaction between lexicon information and individual characters. Additionally, we model the spatial relationships between adjacent spans by introducing a CNN after Biaffine. Using our proposed method, we validate its effectiveness on four Chinese NER datasets, outperforming other lexical information models. The efficacy of our proposed approach is further confirmed through extensive ablation experiments.

Table 6 Different CNN Block numbers experiments results (%) on Ontonote and MSRA

In our future work, we will consider the following three aspects:

  • Behind the visual form of Chinese characters lies rich linguistic information. For instance, characters like “液” (liquid), “河” (river), and “湖” (lake) all share the semantic element “氵” (water), indicating their semantic association with water. Intuitively, leveraging the visual form of Chinese characters could enhance Chinese NLP capabilities. We will explore the integration of character forms into the underlying layers of BERT.

  • Chinese characters of ten exhibit polysemy, where a single character can have multiple meanings. For example, the character “乐” has two distinct pronunciations, “乐” (yue) representing music and “乐” (le) representing happiness. We will consider incorporating Chinese character phonetic information (pinyin) into the underlying layers of BERT to address this phenomenon.

  • While applying CNN to model the spatial relationships between adjacent spans, both non-entity and entity spans are currently treated uniformly. We plan to enhance the modeling of entity spans’ spatial relations by applying specific techniques to these spans, acknowledging their distinct nature.