1 Introduction

Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP), aiming at recognizing specific entities in unstructured texts, such as persons, countries and institutions. It is the upstream task of many NLP tasks which prevail currently, including question generation [7], event extraction [41] and knowledge graph [10]. NER has been regarded as a sequence labeling task in English and the BiLSTM-CRF architecture [19] is always taken as the model backbone. Popular models are based on words, since there are natural delimiters in English sentences [5, 24, 33, 47].

Fig. 1
figure 1

The lattice structure. Green directed arrows denote information flow paths which connect word cells with their first character cells and last character cells

However, word-based NER models do not work well in Chinese NER. There are no natural delimiters in Chinese sentences, which means that sentences must be segmented into words before a word-based model is applied. This is a tough process for machines and bad judgement may mislead NER inference. To eliminate the influence of word segmentation error, models based on characters have become mainstream in Chinese NER [9, 14, 37]. Later, Character-based models gradually hit a bottleneck because they ignore the fact that many Chinese characters express multiple meanings and only referring to words can most character meanings be determined. Guided by the idea that lexicon can play a supplementary role, researchers turn to integrating lexicon information into character-based models. Under this circumstance, the famous Lattice-LSTM [48] is proposed. The schematic diagram of Lattice-LSTM is shown in Fig. 1. More information flow paths are added to the standard LSTM layer [19], connecting words with their first characters and last characters. By a well-designed calculating method, it effectively integrates lexicon information into the character-based model. Though Lattice-SLTM is proven effective, its drawbacks are obvious. The complex model architecture greatly slows down its speed, which makes this model not very practical. Subsequent models like WC-LSTM [26], SoftLexicon [28] successfully solve the inefficiency problem of Lattice-LSTM, but they are limited in simply concatenating lexicon information and character embeddings. As a result, it still remains challenging to explore a more proper way to integrate lexicon information.

In addition, word information is often taken as the only information that can be obtained from lexicon in recent models. This information is presented in word embeddings while being integrated into models. According to our observation, however, another information can also be extracted from lexicon, i.e., character position in a word. This information is always not taken seriously by recent models, but sometimes it is valuable in distinguishing different meanings of a character. Take the character

figure a

as an example. On the one hand, if

figure b

is at the beginning position of a word, it most likely conveys the meaning of “substitute” or “acting” and in this condition character position information can help to recognize TITLE entities like

figure c

(acting general manager). On the other hand, if

figure d

is at the end of a word, it usually conveys the meaning of “era” or “epoch”, which is conducive to identify TIME entities like

figure e

(Ming Dynasty).

Fig. 2
figure 2

Comparison between KV-MemNN and TFM. The KV-MemNN cell is on the left and the TFM cell is on the right

To utilize the character position information and solve the problems former models remain, we propose TFM, a triple fusion module which fuses character, word and character position information from lexicon. This module is inspired by the key-value memory network (KV-MemNN) [30], which is designed for the question answering task. The original KV-MemNN first incorporates key-value slots. Then it uses keys and a fixed question embedding to address weights, by which values are selectively retained. As shown in Fig. 2, the differences between KV-MemNN and TFM mainly reflect in the input form and the inner calculation process. TFM takes lexicon information triples as input. Then weights are computed by the character and word information. Finally, the output is derived from the embeddings concatenated by the word and character position information. Though there are some differences, TFM keeps the main idea of KV-MemNN, i.e., memorizing information according to weights. Like a list of recent works [39, 40], we insert TFM into the general BiLSTM-CRF architecture to form our model. The contributions of our work can be summarized as follows:

  1. 1.

    We propose TFM to integrate lexicon information into the character-based Chinese NER model. TFM fuses information by neither complicated calculation nor simple concatenation, which is between these two.

  2. 2.

    Apart from word information, we exploit another information from lexicon which is overlooked by other methods, i.e., character position information. This information is fused together with the other two information in TFM and finally integrated into the character-based model.

  3. 3.

    To investigate the performance of our model, we evaluate it on three public Chinese NER datasets, i.e., Resume [48], Weibo [34] and MSRA [21]. By these experiments, we find that our model outperforms all the models for comparison.

The remaining sections of this article are organized as follows. Section 2 analyzes related work about this paper. Section 3 introduces the details of our model. Section 4 describes the setup of experiments and reports final results. Section 5 draws a conclusion.

2 Related Work

2.1 NER with Lexicon Information

The phenomenon of word information loss through lack of delimiters does not exist in languages like English. Therefore, there is no need to deliberately consider word information in NER in these languages. But utilizing words and phrases from lexicon can help models know specific entity instances in advance, thus enhancing their inference ability. For example, Liu et al. [25] concatenated query results to the output of BiLSTM and got tags by Semi-CRF. Peshterliev et al. [35] put gazetteer embeddings together with word embeddings, which means they introduced lexicon information in the embedding layer. This concatenating method of fusing knowledge is also seen in Japanese and Hindi NER [13, 31]. Different from NER in word-based languages, the desire of word information in Chinese NER is not only confined to partial entity nouns, but also words in the whole sentences. So, the integration of word information needs further investigation.

Different methods have been proposed. On the one hand, jointly training and transfer learning worked. Peng and Dredze [34] jointly trained Chinese NER and Chinese word segmentation (CWS) models, improving the Chinese NER model with nearly 5% absolute improvement. Wu et al. [44] used CNN to capture local context and also jointly trained Chinese NER and CWS models. Cao et al. [3] applied adversarial transfer learning in Chinese NER, incorporating word boundary information from CWS task. On the other hand, lexicon information was valued. Ding et al. [8] constructed a directed acyclic graph to connect characters and words in lexicon, integrating both with a graph neural network. Gui et al. [15] and Sui et al. [37] utilized graph neural networks to integrate word information. Zhang and Yang [48] changed the architecture of standard LSTM, using shortcut paths to provide a link between character cells and word cells, which formed a lattice structure. In view of the complexity and inefficiency of lattice structure, Li et al. [23] proposed FLAT, Liu et al. [26] proposed WC-LSTM and Ma et al. [28] proposed SoftLexicon, all focusing on simplifying the complicated lattice structure, improving running speed and advancing the applicability of the models. Recently, Hu and Wei [18] rethought the second-order lexicon knowledge of the character which relieved word boundary conflicts. Gong et al. [12] constructed a hierarchical tree structure to utilize characters, subwords and lexicon words. Unlike all those models, we exploit one more information from lexicon, i.e., character position information and design TFM to integrate lexicon knowledge into the character-based model.

2.2 Key-Value Memory Network

KV-MemNN [30] was proposed for the question answering task, aiming at narrowing the gap between referring to knowledge bases and reading directly from documents. It encodes and integrates prior knowledge in key-value pairs. This method of incorporating information has a strong transferability and is often superior to simple embedding concatenation. As a result, it has been applied to other tasks, such as image recognition [2], clinical diagnostic inferencing [36] and machine translation [42]. In sequence labeling task, KV-MemNN has also proven its merits. Tian et al. [40] utilized wordhood information with KV-MemNN for CWS. In Slot Filling for dialogue systems, Wu et al. [45] adopted this method to trace long-term slot context. For NER task, to get document-level representation, Gui et al. [16] recorded context representations and label embeddings while Luo et al. [27] incorporated word representations and hidden states, all following the idea of KV-MemNN. Besides, some researchers leveraged syntactic information. Nie et al. [32] learned three syntactic information and Tian et al. [39] injected syntactic knowledge into the biomedical NER model. It is worth noting that all the mentioned methods for sequence labeling task generate various key-value pairs according to the task objectives and keep the basic structure of KV-MemNN unchanged. Different from these models, we incorporate lexicon knowledge into the NER model and modify KV-MemNN to make it suitable for the input of triples, instead of key-value pairs.

Fig. 3
figure 3

The architecture of our model. The left part is the standard BiLSTM-CRF model and the right part is the proposed TFM. Word information is fused with character and character position information, where \(\oplus \) denotes concatenation operation, \(\otimes \) denotes element-wise product operation, \(\textcircled {s}\) denotes the softmax function and \(\diamond \) denotes the formation of triples

3 Method

In this work, we propose TFM to incorporate lexicon information into the character-based model. The architecture of our model is illustrated in Fig. 3, where the general BiLSTM-CRF model is on the left part with TFM on the right part working between the BiLSTM layer and the CRF layer. The first layer is an embedding layer which maps characters into dense vectors. The second layer is a BiLSTM layer by which character representations with context information are obtained. Then, triples containing character, word and character position information are sent to TFM to get fusion information. The last layer is the CRF layer whose input is the fusion information and output is the predicted named entity tags. Details are elaborated in the rest of this section.

3.1 Embedding Layer

In the embedding layer, characters in the input sentence in text form are mapped into values and expressed as dense vectors. Formally, given an input sentence \(X=\left\{ x_1,x_2,\ldots ,x_n\right\} \in V_C\), where n denotes the length of the sentence and \(V_C\) denotes the character vocabulary, each character \(x_i\) is represented as:

$$\begin{aligned} \varvec{e}_i^c=BERT\left( x_i\right) , \end{aligned}$$
(1)

where \(BERT\left( \cdot \right) \) denotes the pre-trained BERT [6] model. Recently, this model which integrates abundant semantic information has been widely used in NLP tasks. Rather than static embeddings, BERT can generate dynamic embeddings for the same character depending on its context characters.

3.2 BiLSTM Layer

BiLSTM [19] is good at capturing context information. Since the standard structure of BiLSTM has not been modified in our model, we briefly introduce its forward calculation process:

$$\begin{aligned}&\begin{bmatrix} \varvec{i}_i \\ \varvec{f}_i \\ \varvec{o}_i \\ \widetilde{\varvec{c}}_i \end{bmatrix} = \begin{bmatrix} \sigma \\ \sigma \\ \sigma \\ tanh \end{bmatrix}\left( \varvec{W} \begin{bmatrix} \varvec{e}_i^c \\ \varvec{h}_{i-1} \end{bmatrix} + \varvec{b} \right) , \end{aligned}$$
(2)
$$\begin{aligned}&\varvec{c}_i = \widetilde{\varvec{c}}_i *\varvec{i}_i + \varvec{c}_{i-1} *\varvec{f}_i, \end{aligned}$$
(3)
$$\begin{aligned}&\varvec{h}_i = \varvec{o}_i *tanh(\varvec{c}_i), \end{aligned}$$
(4)

where \(\sigma \) is the sigmoid function, \(*\) is element-wise product, \(\varvec{W}\) and \(\varvec{b}\) are trainable parameters.

Given a sequence of character embeddings, BiLSTM is applied to exploit hidden expressions of characters from global context:

$$\begin{aligned} \begin{aligned} \begin{bmatrix} \overrightarrow{\varvec{h}}_1 , \overrightarrow{\varvec{h}}_2 ,\ldots , \overrightarrow{\varvec{h}}_n \end{bmatrix} = \overrightarrow{LSTM}\left( \begin{bmatrix} \varvec{e}_1^c,\varvec{e}_2^c,\ldots ,\varvec{e}_n^c \end{bmatrix} \right) , \\ \begin{bmatrix} \overleftarrow{\varvec{h}}_1 , \overleftarrow{\varvec{h}}_2 ,\ldots , \overleftarrow{\varvec{h}}_n \end{bmatrix} = \overleftarrow{LSTM}\left( \begin{bmatrix} \varvec{e}_1^c,\varvec{e}_2^c,\ldots ,\varvec{e}_n^c \end{bmatrix} \right) , \end{aligned} \end{aligned}$$
(5)

where \(\overrightarrow{LSTM}\) and \(\overleftarrow{LSTM}\) denote the forward and backward LSTMs and \(\varvec{e}_i^c(i=1,2,\ldots ,n)\) are character embeddings. Finally, we get the representation with context information of \(x_i\) by concatenating \(\overrightarrow{\varvec{h}}_i\) and \(\overleftarrow{\varvec{h}}_i\):

$$\begin{aligned} \varvec{h}_i = \begin{bmatrix} \overrightarrow{\varvec{h}}_i, \overleftarrow{\varvec{h}}_i\end{bmatrix}. \end{aligned}$$
(6)

3.3 Triple Fusion Module

To make the idea of KV-MemNN more suitable for our need of employing triples as the input, we innovatively change its architecture, fusing information in three steps.

Generating triples. For each character \(x_i\) in the input sentence X, we first get its matched words in a lexicon, represented as \(W_i = \left\{ w_{i1}, w_{i2},\ldots ,w_{im}\right\} \). Here, \(w_{i\cdot }\) is a sub-sequence of X that contains \(x_i\), i.e., \(w_{i\cdot }=\{x_{i-a},\ldots ,x_i,\ldots ,x_{i+b}\}\), where \(0 \le a \le i\) and \(0 \le b \le n-i\). For each word \(w_{ij}\) in \(W_i\), a triple is generated by:

$$\begin{aligned} t_{ij} = \left( x_i, p_{ij}, w_{ij} \right) , \end{aligned}$$
(7)

which means \(x_i\) is at the \(p_{ij}\) position of \(w_{ij}\). Here, \(p_{ij}\) is a position tag and it is an item of \(\left\{ B,E,S,M_1,M_2,\ldots ,M_{k-2}\right\} \), where k is the maximum length of the words in the lexicon. Specifically, like common sequence labeling tags, B denotes that the character is at the Beginning position of the word, other labels can be deduced in the same manner. It should be noted that S is viewed as a position tag which denotes that the word consists of a single character. Furthermore, we distinguish the middle positions because we believe that different middle characters are closer to different parts in a word and the detailed character position information is good for NER inference, e.g., character (labor) and

figure g

(intelligence) are all at the middle positions in

figure h

(artificial intelligence), but

figure i

(labor) is closer to

figure j

(human) while

figure k

(intelligence) is closer to

figure l

(ability) from lexicology. Consequently, the positions of

figure m

(labor) and

figure n

(intelligence) should not be confused. We use \(M_1\) to denote the first Middle position of the word, \(M_2\) denotes the second Middle position of the word and so on.

To show the process of generating triples concretely, we take

figure o

(artificial intelligence is interesting) as an example. The character

figure p

(labor) occurs in three words,

figure q

(labor),

figure r

(artificial) and

figure s

(artificial intelligence). Then three triples are generated as

figure t

,

figure u

and

figure v

and they mean that

figure w

(labor) is a separate word,

figure x

(labor) is at the end position of

figure y

(artificial) and

figure z

(labor) is at the first middle position of

figure aa

(artificial intelligence). Triples belonging to

figure ab

(intelligence) can be produced as

figure ac

,

figure ad

,

figure ae

in the same way. The illustration is shown in Fig. 4.

Fig. 4
figure 4

An example of generating triples. The triples strung together with lines belong to the same character

Digitizing triple sets.The triples are digitalized before TFM fusing the information they contain. For each triple \(t_{ij}\) of \(x_i\), the word embedding and character position embedding are mapped as follows:

$$\begin{aligned}&\varvec{e}_{ij}^w = Word\left( w_{ij}\right) , \end{aligned}$$
(8)
$$\begin{aligned}&\varvec{e}_{ij}^p = Position\left( p_{ij}\right) , \end{aligned}$$
(9)

where \(Word(\cdot )\) and \(Position(\cdot )\) are embedding lookup tables for words and character positions. Also, if the embedding of \(x_i\) in \(t_{ij}\) is represented as \(\varvec{e}_{ij}^x\), then the triple is updated as:

$$\begin{aligned} \varvec{e}_{ij}^t=\left( \varvec{e}_{ij}^x,\varvec{e}_{ij}^p,\varvec{e}_{ij}^w\right) . \end{aligned}$$
(10)

In this work, since \(\varvec{h}_i\) can be viewed as the character representation with context information, we set \(\varvec{e}_{ij}^x \left( j=1,\ldots ,m \right) \) as \(\varvec{h}_i\).

Fusing information. After getting digitized triples, we fuse the character, word and character position information. Considering the instability of word frequency calculation caused by different corpora and following the main idea of KV-MemNN, for each word \(w_{ij}\) in the triple set of \(x_i\), its weight is computed by:

$$\begin{aligned} q_{ij} = \frac{exp\left( \varvec{e}_{ij}^x \cdot \varvec{e}_{ij}^w \right) }{\sum \nolimits _{j=1}^m exp\left( \varvec{e}_{ij}^x \cdot \varvec{e}_{ij}^w \right) }. \end{aligned}$$
(11)

Then word information and character position information are concatenated:

$$\begin{aligned} \varvec{e}_{ij}^f =\varvec{W}_f \cdot \varvec{e}_{ij}^w \oplus \varvec{e}_{ij}^p, \end{aligned}$$
(12)

where \(\varvec{W}_f\) is a trainable parameter. The lexicon information of \(x_i\) will be computed by:

$$\begin{aligned} \varvec{e}_i^l = \sum _{j=1}^{m}{q_{ij} \varvec{e}_{ij}^f}. \end{aligned}$$
(13)

Afterwards, \(\varvec{e}_i^l\) and \(\varvec{h}_i\) are concatenated and returned as the fusion information:

$$\begin{aligned} \varvec{v}_i = \varvec{h}_i \oplus \varvec{e}_i^l. \end{aligned}$$
(14)

Here, we still take the former example to describe the details. As is shown in the right part of Fig. 2, the triples assigned to the three words of character \(x_2\) are \(\left( \varvec{e}_{21}^x,\varvec{e}_{21}^p,\varvec{e}_{21}^w\right) ,\left( \varvec{e}_{22}^x,\varvec{e}_{22}^p,\varvec{e}_{22}^w\right) ,\left( \varvec{e}_{23}^x,\varvec{e}_{23}^p,\varvec{e}_{23}^w\right) \). In this step, weights \(q_{21}, q_{22}, q_{23}\) are first calculated by Eq. 11. Then the word and character position embeddings are concatenated as \(\varvec{e}_{21}^f, \varvec{e}_{22}^f, \varvec{e}_{23}^f\). Based on the weights and the concatenating embeddings, the lexicon information \(\varvec{e}_2^l\) is calculated according to Eq. 13. Finally, fusion information \(\varvec{v}_2\) is output by putting \(\varvec{e}_2^l\) together with \(\varvec{h}_2\).

3.4 CRF Layer

A standard CRF [20] layer is used at the top of BiLSTM layer and TFM. Given the predicted tag sequence \(Y=\left\{ y_1,y_2,\ldots ,y_n\right\} \in V_l\), where \(V_l\) denotes the label set, the probability of the predicted sequence is

$$\begin{aligned} P\left( Y \mid X \right) = \frac{exp \left( \sum _{i} \left( \varvec{W}_{CRF}^{y_i}\varvec{v}_i+\varvec{b}_{CRF}^{\left( y_{i-1}, y_i\right) }\right) \right) }{\sum _{Y^\prime }{exp \left( \sum _i \left( \varvec{W}_{CRF}^{y_i^\prime }\varvec{v}_i + \varvec{b}_{CRF}^{\left( y_{i-1}^\prime , y_i^\prime \right) } \right) \right) }}, \end{aligned}$$
(15)

where \(Y^\prime \) denotes an arbitrary tag sequence, and \(\varvec{W}_{CRF}^{y_i}\) and \(\varvec{b}_{CRF}^{\left( y_{i-1},\ y_i\right) }\) are trainable parameters. We use Viterbi algorithm [11] to get the predicted tag sequence. Given a set of training data \(\left\{ \left( X_i,Y_i\right) \right\} |_{i=1}^N\), a log-likelihood loss function is used to train the model:

$$\begin{aligned} L=\sum _{i=1}^{N}{\log P\left( Y_i | X_i\right) }. \end{aligned}$$
(16)

4 Experiments

We conduct a series of experiments on public Chinese NER datasets to study the effectiveness of our model. Standard precision (P), recall (R) and F1-score (F1) are used to evaluate the performance.

4.1 Experiment Setup

Preparation. We perform experiments on three public Chinese NER datasets: Resume [48], Weibo [34] and MSRA [21]. Three datasets are collected from different domains. Specifically, Resume is collected from Sina Finance,Footnote 1 Weibo is collected from Sina WeiboFootnote 2 and MSRA is collected from newswire. Gold segmentation is not available for the mentioned datasets. The lexicon we used is released by [48] and contains 704.4k words. The position lookup table is randomly initialized and trained. The BERT pre-trained model we use is bert-base-chineseFootnote 3 and its parameters are fixed. The tagging styles are all transformed to BMES tagging style for the unity of experiments.

Hyper-parameter settings. The main hyper-parameters we set during experiments are shown in Table 1. We keep some parameters the same with Lattice-LSTM, including word embedding size, dropout, learning rate decay and so on. Other parameters like LSTM hidden size and learning rate are adjusted to fit our model.

Table 1 Hyper-parameter settings

Models for comparison. Since our model focuses on integrating lexicon information, recent models which share the same goal with our model are selected as the baselines, i.e.,

  1. 1.

    BERT-tagger [6] fine-tunes the BERT for encoding and uses a classification layer for decoding.

  2. 2.

    BERT+BiLSTM-CRF is based on BiLSTM-CRF and uses BERT as the encoder.

  3. 3.

    CAN-NER [49] captures the context information with a character-based CNN and a gated GRU.

  4. 4.

    self-attention+BiLSTM-CRF [4] adds the self-attention mechanism to the BiLSTM-CRF model, which integrates character and word information.

  5. 5.

    BERT+MFE [22] incorporates semantic, glyph and phonetic features into the character-based model.

  6. 6.

    Multi-digraph model [8] uses graph neural networks to integrate lexicon information.

  7. 7.

    LGN [15] utilizes a graph neural network to solve word ambiguities by integrating characters, potential words and sentence semantics.

  8. 8.

    CGN [37] is proposed to integrate self-matched lexical words and nearest contextual lexical words.

  9. 9.

    Lattice-LSTM [48] is a variant of LSTM which incorporates all potential words into a character-based model.

  10. 10.

    LR-CNN [14] is a CNN-based method that incorporates lexicons using a rethinking mechanism.

  11. 11.

    WC-LSTM [26] adds word information to the start or the end character of the word, aiming at solving some problems that have been found in Lattice-LSTM. In WC-LSTM, there are four words encoding strategy. For each dataset, we pick the strategy which works best for comparison.

  12. 12.

    HiLSTM [12] is a hierarchical LSTM framework which considers not only words in lexicon but also words and subwords in sentences.

  13. 13.

    BERT+SoftLexicon [28] follows the idea of Lattice-LSTM, which avoids the complex structure and improves the performance of model.

  14. 14.

    SLK-NER [18] fuses different second-order lexicon knowledge (SLK) with the global attention information to alleviate the impact of word boundary conflicts.

  15. 15.

    AM-BiLSTM [46] enhances character embeddings with the multi-word information feature, which keeps the word information by matching a lexicon.

4.2 Results on Benchmark Datasets

Table 2 Results on resume
Table 3 Results on Weibo
Table 4 Results on MSRA

Experimental results on Resume, Weibo and MSRA are respectively shown in Tables 2, 3, 4. We divide the baselines into four groups, including general models in Chinese NER which do not integrate any lexicon information, models using different external knowledge, models utilizing lexicon information by graph neural network and the state-of-the-art Chinese NER models which focus on integrating lexicon information.

As we can see from the tables, compared with baselines in the first group, our model outperforms all the models. This proves that integrating lexicon information has a positive effect on the character-based model.

Baselines in the second group integrate different knowledge, including word segmentation information and character feature information. Our model performs better than them, since there are always some mistakes during word segmentation and character features cannot provide enough information for NER inference.

When we compare our model with baselines in the third group, we find that the F1-scores of graph-based models are not better than our model. One of the reasons is that these baselines do not exploit full lexicon information. Our TFM overcomes this shortcoming very well.

Our model outperforms all the state-of-the-art baselines in the fourth group. We attribute the improvements to the TFM we proposed because TFM integrates more lexicon information and the multiple information are fused in a more proper manner.

In all, our model achieves the best F1 and P, which proves its merits.

4.3 Efficiency Study

In this section, we evaluate the inference speed of our model. The settings follow [28]. We choose the general BERT+BiLSTM-CRF and three models which integrate lexicon information for comparison.

Fig. 5
figure 5

Relative inference speed on three datasets

Results are shown in Fig. 5. The inference speed of our model is faster than Lattice-LSTM and LR-CNN and is more than twice Lattice-LSTM. This proves the improvement of efficiency. Though TFM is inserted, the inference speed of our model is very close to the BERT+BiLSTM-CRF model. Moreover, SoftLexicon focuses on integrating lexicon information while reducing running time. Our model integrates more information, but the inference speed does not significantly drop.

4.4 Influence of Different Sequence Modeling Layers

Though general sequence labeling model BiLSTM-CRF is used as the backbone of our model, we still want to explore the influence of different sequence modeling layers. To this end, we replace BiLSTM with CNN and transformer[43]. The F1-socres are shown in Table 5.

Table 5 F1-scores with different sequence modeling layers

From the table, we find that our model empirically works best with BiLSTM and F1-socres drop when applying other sequence modeling layers. Since most models choose BiLSTM, we also follow them to make comparisons fairer.

4.5 Influence of Different Embedding Methods

We try three other embedding methods, i.e., fastText [1], word2vec [29] and ERNIE [38] to evaluate the influence of different embedding methods. For fastText, we use the pre-trained model provided by Facebook.Footnote 4 For word2vec, we use the pre-trained model released by [48]. For ERNIE, we use a PyTorch versionFootnote 5. The results are shown in Table 6.

Table 6 F1-scores with different embedding methods

The F1-socres on fastText and word2vec are lower than BERT and ERNIE, especially on Weibo dataset. One of the reasons is that Weibo is in informal language environments and static embeddings cannot express exact meanings of characters. F1-scores with ERNIE are close to BERT, but BERT is used more widely and our model works better with BERT, so we choose BERT as our embedding layer.

4.6 Performance on Different Entities

We further analyze F1-scores of different entities to investigate whether the improvement of our model is general or entity-specific. We compare our model with Lattice-LSTM and BERT+BiLSTM-CRF. Experimental results are shown in Fig. 6.

Fig. 6
figure 6

F1-scores of different entity types on three datasets

The three models perform well on Resume and our model only improves on certain entity types, such as ORG, TITLE and PRO. On Weibo, the F1-scores of our model on four entity types exceed Lattice-LSTM. When compared with BERT+BiLSTM-CRF, we find that the F1-score of our model on ORG entity drops. The drop, however, does not affect the overall improvement, since the F1-scores of other entity types rise. Finally, our model has slight F1-score increases on three entity types in MSRA and this contributes to the overall advantages.

Table 7 An example from Weibo

4.7 Case Study

In this section, we analyze a sentence from Weibo dataset, i.e., (Amazon’s official WeChat package tracking service launched). The design of this experiment is following [26]. Experimental results are shown in Table 7. Even with the help of lexicon knowledge, Lattice-LSTM model fails to recognize the organization entity (Amazon). Different from Lattice-LSTM, our model correctly makes the predictions. Later, we check the lexicon for experiment and find that almost every word that ends with the character is an entity, e.g., (Robinson), (Tao Xun) and (Nelson). In this situation, the position information of in the word plays an important role in recognizing named entity. The TFM fuses the character position information of in (Amazon) which contributes to the right inference of our model.

During the lexicon matching process, many words can be assigned to a character. In fact, most of the words are useless. Though all matched words are fused, we still hope that TFM pays more attention to latent right words. We record the weights of the words for each character in TFM and get the heatmap shown in Fig. 7. As we can see from the heatmap, right words like (Amazon), (official) are assigned greater weights. This proves that TFM can automatically diminish the disturbance of useless words.

Fig. 7
figure 7

Heatmap of the word weights for the example (Amazon’s official WeChat package tracking service launched). We truncate the sentence for brevity. The darker the color, the greater the weight

Table 8 An ablation study on our model

4.8 Ablation Study

To investigate different factors that affect the performance of our model, we conduct ablation study on three datasets and the results are shown in Table 8.

  1. 1.

    In the “- TFM” experiment, we remove the proposed TFM. In this case, the model turns to the BERT+BiLSTM-CRF model. We find that the performance of the model drops, which demonstrates the effectiveness of our proposed module.

  2. 2.

    By “-position” and “-word” experiment, we study the influence of different information combinations. When only one kind of information from lexicon is fused, the F1-scores drop. This proves that our model works best when combining the word and character position information.

  3. 3.

    Affected by the labeling paradigm in sequence labeling, models like SoftLexicon overlook the distinction of different middle positions and put them in the same group. In the “- ‘M’ distinction” experiment, we do not distinguish any middle positions and the performance of model declines. This proves that emphasizing different middle positions during fusing process has a positive effect on the performance of our model.

5 Conclusion

In this paper, we notice the character position information from lexicon which is ignored by other models. Then TFM is proposed to integrate lexicon knowledge into the character-based model. TFM effectively handles the situation that word information is lost in the character-based model and enhances the ability to understand the relationship between a word and its characters by the character position information. Experiments on public datasets show that our model achieves ideal performance in terms of efficiency and F1-score.

In the end, we describe our future work. The model we propose works well when the training data is sufficient. But it will overfit under the circumstances of extremely scarce training examples, i.e., few-shot settings [17]. So, we plan to explore how to integrate external knowledge into few-shot NER models.