TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition

Liu, Haitao; Song, Jihua; Peng, Weiming; Sun, Jingbo; Xin, Xianwei

doi:10.1007/s11063-022-10768-y

TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition

Published: 22 April 2022

Volume 54, pages 3425–3442, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Processing Letters Aims and scope Submit manuscript

TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition

Download PDF

Haitao Liu¹,
Jihua Song¹,
Weiming Peng¹,
Jingbo Sun¹ &
…
Xianwei Xin¹

519 Accesses
9 Citations
Explore all metrics

Abstract

Due to the characteristics of the Chinese writing system, character-based Chinese named entity recognition models ignore the word information in sentences, which harms their performance. Recently, many works try to alleviate the problem by integrating lexicon information into character-based models. These models, however, either simply concatenate word embeddings, or have complex structures which lead to low efficiency. Furthermore, word information is viewed as the only resource from lexicon, thus the value of lexicon is not fully explored. In this work, we observe another neglected information, i.e., character position in a word, which is beneficial for identifying character meanings. To fuse character, word and character position information, we modify the key-value memory network and propose a triple fusion module, termed as TFM. TFM is not limited to simple concatenation or suffers from complicated computation, compatibly working with the general sequence labeling model. Experimental evaluations show that our model has performance superiority. The F1-scores on Resume, Weibo and MSRA are 96.19%, 71.12% and 95.63% respectively.

CPMFA: A Character Pair-Based Method for Chinese Nested Named Entity Recognition

The interactive fusion of characters and lexical information for Chinese named entity recognition

Article Open access 16 August 2024

Chinese Named Entity Recognition Based on Dynamically Adjusting Feature Weights

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP), aiming at recognizing specific entities in unstructured texts, such as persons, countries and institutions. It is the upstream task of many NLP tasks which prevail currently, including question generation [7], event extraction [41] and knowledge graph [10]. NER has been regarded as a sequence labeling task in English and the BiLSTM-CRF architecture [19] is always taken as the model backbone. Popular models are based on words, since there are natural delimiters in English sentences [5, 24, 33, 47].

However, word-based NER models do not work well in Chinese NER. There are no natural delimiters in Chinese sentences, which means that sentences must be segmented into words before a word-based model is applied. This is a tough process for machines and bad judgement may mislead NER inference. To eliminate the influence of word segmentation error, models based on characters have become mainstream in Chinese NER [9, 14, 37]. Later, Character-based models gradually hit a bottleneck because they ignore the fact that many Chinese characters express multiple meanings and only referring to words can most character meanings be determined. Guided by the idea that lexicon can play a supplementary role, researchers turn to integrating lexicon information into character-based models. Under this circumstance, the famous Lattice-LSTM [48] is proposed. The schematic diagram of Lattice-LSTM is shown in Fig. 1. More information flow paths are added to the standard LSTM layer [19], connecting words with their first characters and last characters. By a well-designed calculating method, it effectively integrates lexicon information into the character-based model. Though Lattice-SLTM is proven effective, its drawbacks are obvious. The complex model architecture greatly slows down its speed, which makes this model not very practical. Subsequent models like WC-LSTM [26], SoftLexicon [28] successfully solve the inefficiency problem of Lattice-LSTM, but they are limited in simply concatenating lexicon information and character embeddings. As a result, it still remains challenging to explore a more proper way to integrate lexicon information.

In addition, word information is often taken as the only information that can be obtained from lexicon in recent models. This information is presented in word embeddings while being integrated into models. According to our observation, however, another information can also be extracted from lexicon, i.e., character position in a word. This information is always not taken seriously by recent models, but sometimes it is valuable in distinguishing different meanings of a character. Take the character

as an example. On the one hand, if

is at the beginning position of a word, it most likely conveys the meaning of “substitute” or “acting” and in this condition character position information can help to recognize TITLE entities like

(acting general manager). On the other hand, if

is at the end of a word, it usually conveys the meaning of “era” or “epoch”, which is conducive to identify TIME entities like

(Ming Dynasty).

To utilize the character position information and solve the problems former models remain, we propose TFM, a triple fusion module which fuses character, word and character position information from lexicon. This module is inspired by the key-value memory network (KV-MemNN) [30], which is designed for the question answering task. The original KV-MemNN first incorporates key-value slots. Then it uses keys and a fixed question embedding to address weights, by which values are selectively retained. As shown in Fig. 2, the differences between KV-MemNN and TFM mainly reflect in the input form and the inner calculation process. TFM takes lexicon information triples as input. Then weights are computed by the character and word information. Finally, the output is derived from the embeddings concatenated by the word and character position information. Though there are some differences, TFM keeps the main idea of KV-MemNN, i.e., memorizing information according to weights. Like a list of recent works [39, 40], we insert TFM into the general BiLSTM-CRF architecture to form our model. The contributions of our work can be summarized as follows:

1.
We propose TFM to integrate lexicon information into the character-based Chinese NER model. TFM fuses information by neither complicated calculation nor simple concatenation, which is between these two.
2.
Apart from word information, we exploit another information from lexicon which is overlooked by other methods, i.e., character position information. This information is fused together with the other two information in TFM and finally integrated into the character-based model.
3.
To investigate the performance of our model, we evaluate it on three public Chinese NER datasets, i.e., Resume [48], Weibo [34] and MSRA [21]. By these experiments, we find that our model outperforms all the models for comparison.

The remaining sections of this article are organized as follows. Section 2 analyzes related work about this paper. Section 3 introduces the details of our model. Section 4 describes the setup of experiments and reports final results. Section 5 draws a conclusion.

2 Related Work

2.1 NER with Lexicon Information

The phenomenon of word information loss through lack of delimiters does not exist in languages like English. Therefore, there is no need to deliberately consider word information in NER in these languages. But utilizing words and phrases from lexicon can help models know specific entity instances in advance, thus enhancing their inference ability. For example, Liu et al. [25] concatenated query results to the output of BiLSTM and got tags by Semi-CRF. Peshterliev et al. [35] put gazetteer embeddings together with word embeddings, which means they introduced lexicon information in the embedding layer. This concatenating method of fusing knowledge is also seen in Japanese and Hindi NER [13, 31]. Different from NER in word-based languages, the desire of word information in Chinese NER is not only confined to partial entity nouns, but also words in the whole sentences. So, the integration of word information needs further investigation.

Different methods have been proposed. On the one hand, jointly training and transfer learning worked. Peng and Dredze [34] jointly trained Chinese NER and Chinese word segmentation (CWS) models, improving the Chinese NER model with nearly 5% absolute improvement. Wu et al. [44] used CNN to capture local context and also jointly trained Chinese NER and CWS models. Cao et al. [3] applied adversarial transfer learning in Chinese NER, incorporating word boundary information from CWS task. On the other hand, lexicon information was valued. Ding et al. [8] constructed a directed acyclic graph to connect characters and words in lexicon, integrating both with a graph neural network. Gui et al. [15] and Sui et al. [37] utilized graph neural networks to integrate word information. Zhang and Yang [48] changed the architecture of standard LSTM, using shortcut paths to provide a link between character cells and word cells, which formed a lattice structure. In view of the complexity and inefficiency of lattice structure, Li et al. [23] proposed FLAT, Liu et al. [26] proposed WC-LSTM and Ma et al. [28] proposed SoftLexicon, all focusing on simplifying the complicated lattice structure, improving running speed and advancing the applicability of the models. Recently, Hu and Wei [18] rethought the second-order lexicon knowledge of the character which relieved word boundary conflicts. Gong et al. [12] constructed a hierarchical tree structure to utilize characters, subwords and lexicon words. Unlike all those models, we exploit one more information from lexicon, i.e., character position information and design TFM to integrate lexicon knowledge into the character-based model.

2.2 Key-Value Memory Network

KV-MemNN [30] was proposed for the question answering task, aiming at narrowing the gap between referring to knowledge bases and reading directly from documents. It encodes and integrates prior knowledge in key-value pairs. This method of incorporating information has a strong transferability and is often superior to simple embedding concatenation. As a result, it has been applied to other tasks, such as image recognition [2], clinical diagnostic inferencing [36] and machine translation [42]. In sequence labeling task, KV-MemNN has also proven its merits. Tian et al. [40] utilized wordhood information with KV-MemNN for CWS. In Slot Filling for dialogue systems, Wu et al. [45] adopted this method to trace long-term slot context. For NER task, to get document-level representation, Gui et al. [16] recorded context representations and label embeddings while Luo et al. [27] incorporated word representations and hidden states, all following the idea of KV-MemNN. Besides, some researchers leveraged syntactic information. Nie et al. [32] learned three syntactic information and Tian et al. [39] injected syntactic knowledge into the biomedical NER model. It is worth noting that all the mentioned methods for sequence labeling task generate various key-value pairs according to the task objectives and keep the basic structure of KV-MemNN unchanged. Different from these models, we incorporate lexicon knowledge into the NER model and modify KV-MemNN to make it suitable for the input of triples, instead of key-value pairs.

3 Method

In this work, we propose TFM to incorporate lexicon information into the character-based model. The architecture of our model is illustrated in Fig. 3, where the general BiLSTM-CRF model is on the left part with TFM on the right part working between the BiLSTM layer and the CRF layer. The first layer is an embedding layer which maps characters into dense vectors. The second layer is a BiLSTM layer by which character representations with context information are obtained. Then, triples containing character, word and character position information are sent to TFM to get fusion information. The last layer is the CRF layer whose input is the fusion information and output is the predicted named entity tags. Details are elaborated in the rest of this section.

3.1 Embedding Layer

In the embedding layer, characters in the input sentence in text form are mapped into values and expressed as dense vectors. Formally, given an input sentence $X=\left\{ x_1,x_2,\ldots ,x_n\right\} \in V_C$, where n denotes the length of the sentence and $V_C$ denotes the character vocabulary, each character $x_i$ is represented as:

$$\begin{aligned} \varvec{e}_i^c=BERT\left( x_i\right) , \end{aligned}$$

(1)

where $BERT\left( \cdot \right) $ denotes the pre-trained BERT [6] model. Recently, this model which integrates abundant semantic information has been widely used in NLP tasks. Rather than static embeddings, BERT can generate dynamic embeddings for the same character depending on its context characters.

3.2 BiLSTM Layer

BiLSTM [19] is good at capturing context information. Since the standard structure of BiLSTM has not been modified in our model, we briefly introduce its forward calculation process:

$$\begin{aligned}&\begin{bmatrix} \varvec{i}_i \\ \varvec{f}_i \\ \varvec{o}_i \\ \widetilde{\varvec{c}}_i \end{bmatrix} = \begin{bmatrix} \sigma \\ \sigma \\ \sigma \\ tanh \end{bmatrix}\left( \varvec{W} \begin{bmatrix} \varvec{e}_i^c \\ \varvec{h}_{i-1} \end{bmatrix} + \varvec{b} \right) , \end{aligned}$$

(2)

$$\begin{aligned}&\varvec{c}_i = \widetilde{\varvec{c}}_i *\varvec{i}_i + \varvec{c}_{i-1} *\varvec{f}_i, \end{aligned}$$

(3)

$$\begin{aligned}&\varvec{h}_i = \varvec{o}_i *tanh(\varvec{c}_i), \end{aligned}$$

(4)

where $\sigma $ is the sigmoid function, $*$ is element-wise product, $\varvec{W}$ and $\varvec{b}$ are trainable parameters.

Given a sequence of character embeddings, BiLSTM is applied to exploit hidden expressions of characters from global context:

$$\begin{aligned} \begin{aligned} \begin{bmatrix} \overrightarrow{\varvec{h}}_1 , \overrightarrow{\varvec{h}}_2 ,\ldots , \overrightarrow{\varvec{h}}_n \end{bmatrix} = \overrightarrow{LSTM}\left( \begin{bmatrix} \varvec{e}_1^c,\varvec{e}_2^c,\ldots ,\varvec{e}_n^c \end{bmatrix} \right) , \\ \begin{bmatrix} \overleftarrow{\varvec{h}}_1 , \overleftarrow{\varvec{h}}_2 ,\ldots , \overleftarrow{\varvec{h}}_n \end{bmatrix} = \overleftarrow{LSTM}\left( \begin{bmatrix} \varvec{e}_1^c,\varvec{e}_2^c,\ldots ,\varvec{e}_n^c \end{bmatrix} \right) , \end{aligned} \end{aligned}$$

(5)

where $\overrightarrow{LSTM}$ and $\overleftarrow{LSTM}$ denote the forward and backward LSTMs and $\varvec{e}_i^c(i=1,2,\ldots ,n)$ are character embeddings. Finally, we get the representation with context information of $x_i$ by concatenating $\overrightarrow{\varvec{h}}_i$ and $\overleftarrow{\varvec{h}}_i$:

$$\begin{aligned} \varvec{h}_i = \begin{bmatrix} \overrightarrow{\varvec{h}}_i, \overleftarrow{\varvec{h}}_i\end{bmatrix}. \end{aligned}$$

(6)

3.3 Triple Fusion Module

To make the idea of KV-MemNN more suitable for our need of employing triples as the input, we innovatively change its architecture, fusing information in three steps.

Generating triples. For each character $x_i$ in the input sentence X, we first get its matched words in a lexicon, represented as $W_i = \left\{ w_{i1}, w_{i2},\ldots ,w_{im}\right\} $. Here, $w_{i\cdot }$ is a sub-sequence of X that contains $x_i$, i.e., $w_{i\cdot }=\{x_{i-a},\ldots ,x_i,\ldots ,x_{i+b}\}$, where $0 \le a \le i$ and $0 \le b \le n-i$. For each word $w_{ij}$ in $W_i$, a triple is generated by:

$$\begin{aligned} t_{ij} = \left( x_i, p_{ij}, w_{ij} \right) , \end{aligned}$$

(7)

which means $x_i$ is at the $p_{ij}$ position of $w_{ij}$. Here, $p_{ij}$ is a position tag and it is an item of $\left\{ B,E,S,M_1,M_2,\ldots ,M_{k-2}\right\} $, where k is the maximum length of the words in the lexicon. Specifically, like common sequence labeling tags, B denotes that the character is at the Beginning position of the word, other labels can be deduced in the same manner. It should be noted that S is viewed as a position tag which denotes that the word consists of a single character. Furthermore, we distinguish the middle positions because we believe that different middle characters are closer to different parts in a word and the detailed character position information is good for NER inference, e.g., character (labor) and

(intelligence) are all at the middle positions in

(artificial intelligence), but

(labor) is closer to

(human) while

(intelligence) is closer to

(ability) from lexicology. Consequently, the positions of

(labor) and

(intelligence) should not be confused. We use $M_1$ to denote the first Middle position of the word, $M_2$ denotes the second Middle position of the word and so on.

To show the process of generating triples concretely, we take

(artificial intelligence is interesting) as an example. The character

(labor) occurs in three words,

(labor),

(artificial) and

(artificial intelligence). Then three triples are generated as

,

and

and they mean that

(labor) is a separate word,

(labor) is at the end position of

(artificial) and

(labor) is at the first middle position of

(artificial intelligence). Triples belonging to

(intelligence) can be produced as

,

,

in the same way. The illustration is shown in Fig. 4.

Digitizing triple sets.The triples are digitalized before TFM fusing the information they contain. For each triple $t_{ij}$ of $x_i$, the word embedding and character position embedding are mapped as follows:

$$\begin{aligned}&\varvec{e}_{ij}^w = Word\left( w_{ij}\right) , \end{aligned}$$

(8)

$$\begin{aligned}&\varvec{e}_{ij}^p = Position\left( p_{ij}\right) , \end{aligned}$$

(9)

where $Word(\cdot )$ and $Position(\cdot )$ are embedding lookup tables for words and character positions. Also, if the embedding of $x_i$ in $t_{ij}$ is represented as $\varvec{e}_{ij}^x$, then the triple is updated as:

$$\begin{aligned} \varvec{e}_{ij}^t=\left( \varvec{e}_{ij}^x,\varvec{e}_{ij}^p,\varvec{e}_{ij}^w\right) . \end{aligned}$$

(10)

In this work, since $\varvec{h}_i$ can be viewed as the character representation with context information, we set $\varvec{e}_{ij}^x \left( j=1,\ldots ,m \right) $ as $\varvec{h}_i$.

Fusing information. After getting digitized triples, we fuse the character, word and character position information. Considering the instability of word frequency calculation caused by different corpora and following the main idea of KV-MemNN, for each word $w_{ij}$ in the triple set of $x_i$, its weight is computed by:

$$\begin{aligned} q_{ij} = \frac{exp\left( \varvec{e}_{ij}^x \cdot \varvec{e}_{ij}^w \right) }{\sum \nolimits _{j=1}^m exp\left( \varvec{e}_{ij}^x \cdot \varvec{e}_{ij}^w \right) }. \end{aligned}$$

(11)

Then word information and character position information are concatenated:

$$\begin{aligned} \varvec{e}_{ij}^f =\varvec{W}_f \cdot \varvec{e}_{ij}^w \oplus \varvec{e}_{ij}^p, \end{aligned}$$

(12)

where $\varvec{W}_f$ is a trainable parameter. The lexicon information of $x_i$ will be computed by:

$$\begin{aligned} \varvec{e}_i^l = \sum _{j=1}^{m}{q_{ij} \varvec{e}_{ij}^f}. \end{aligned}$$

(13)

Afterwards, $\varvec{e}_i^l$ and $\varvec{h}_i$ are concatenated and returned as the fusion information:

$$\begin{aligned} \varvec{v}_i = \varvec{h}_i \oplus \varvec{e}_i^l. \end{aligned}$$

(14)

Here, we still take the former example to describe the details. As is shown in the right part of Fig. 2, the triples assigned to the three words of character $x_2$ are $\left( \varvec{e}_{21}^x,\varvec{e}_{21}^p,\varvec{e}_{21}^w\right) ,\left( \varvec{e}_{22}^x,\varvec{e}_{22}^p,\varvec{e}_{22}^w\right) ,\left( \varvec{e}_{23}^x,\varvec{e}_{23}^p,\varvec{e}_{23}^w\right) $. In this step, weights $q_{21}, q_{22}, q_{23}$ are first calculated by Eq. 11. Then the word and character position embeddings are concatenated as $\varvec{e}_{21}^f, \varvec{e}_{22}^f, \varvec{e}_{23}^f$. Based on the weights and the concatenating embeddings, the lexicon information $\varvec{e}_2^l$ is calculated according to Eq. 13. Finally, fusion information $\varvec{v}_2$ is output by putting $\varvec{e}_2^l$ together with $\varvec{h}_2$.

3.4 CRF Layer

A standard CRF [20] layer is used at the top of BiLSTM layer and TFM. Given the predicted tag sequence $Y=\left\{ y_1,y_2,\ldots ,y_n\right\} \in V_l$, where $V_l$ denotes the label set, the probability of the predicted sequence is

$$\begin{aligned} P\left( Y \mid X \right) = \frac{exp \left( \sum _{i} \left( \varvec{W}_{CRF}^{y_i}\varvec{v}_i+\varvec{b}_{CRF}^{\left( y_{i-1}, y_i\right) }\right) \right) }{\sum _{Y^\prime }{exp \left( \sum _i \left( \varvec{W}_{CRF}^{y_i^\prime }\varvec{v}_i + \varvec{b}_{CRF}^{\left( y_{i-1}^\prime , y_i^\prime \right) } \right) \right) }}, \end{aligned}$$

(15)

where $Y^\prime $ denotes an arbitrary tag sequence, and $\varvec{W}_{CRF}^{y_i}$ and $\varvec{b}_{CRF}^{\left( y_{i-1},\ y_i\right) }$ are trainable parameters. We use Viterbi algorithm [11] to get the predicted tag sequence. Given a set of training data $\left\{ \left( X_i,Y_i\right) \right\} |_{i=1}^N$, a log-likelihood loss function is used to train the model:

$$\begin{aligned} L=\sum _{i=1}^{N}{\log P\left( Y_i | X_i\right) }. \end{aligned}$$

(16)

4 Experiments

We conduct a series of experiments on public Chinese NER datasets to study the effectiveness of our model. Standard precision (P), recall (R) and F1-score (F1) are used to evaluate the performance.

4.1 Experiment Setup

Preparation. We perform experiments on three public Chinese NER datasets: Resume [48], Weibo [34] and MSRA [21]. Three datasets are collected from different domains. Specifically, Resume is collected from Sina Finance,^{Footnote 1} Weibo is collected from Sina Weibo^{Footnote 2} and MSRA is collected from newswire. Gold segmentation is not available for the mentioned datasets. The lexicon we used is released by [48] and contains 704.4k words. The position lookup table is randomly initialized and trained. The BERT pre-trained model we use is bert-base-chinese^{Footnote 3} and its parameters are fixed. The tagging styles are all transformed to BMES tagging style for the unity of experiments.

Hyper-parameter settings. The main hyper-parameters we set during experiments are shown in Table 1. We keep some parameters the same with Lattice-LSTM, including word embedding size, dropout, learning rate decay and so on. Other parameters like LSTM hidden size and learning rate are adjusted to fit our model.

Table 1 Hyper-parameter settings

Full size table

Models for comparison. Since our model focuses on integrating lexicon information, recent models which share the same goal with our model are selected as the baselines, i.e.,

1.
BERT-tagger [6] fine-tunes the BERT for encoding and uses a classification layer for decoding.
2.
BERT+BiLSTM-CRF is based on BiLSTM-CRF and uses BERT as the encoder.
3.
CAN-NER [49] captures the context information with a character-based CNN and a gated GRU.
4.
self-attention+BiLSTM-CRF [4] adds the self-attention mechanism to the BiLSTM-CRF model, which integrates character and word information.
5.
BERT+MFE [22] incorporates semantic, glyph and phonetic features into the character-based model.
6.
Multi-digraph model [8] uses graph neural networks to integrate lexicon information.
7.
LGN [15] utilizes a graph neural network to solve word ambiguities by integrating characters, potential words and sentence semantics.
8.
CGN [37] is proposed to integrate self-matched lexical words and nearest contextual lexical words.
9.
Lattice-LSTM [48] is a variant of LSTM which incorporates all potential words into a character-based model.
10.
LR-CNN [14] is a CNN-based method that incorporates lexicons using a rethinking mechanism.
11.
WC-LSTM [26] adds word information to the start or the end character of the word, aiming at solving some problems that have been found in Lattice-LSTM. In WC-LSTM, there are four words encoding strategy. For each dataset, we pick the strategy which works best for comparison.
12.
HiLSTM [12] is a hierarchical LSTM framework which considers not only words in lexicon but also words and subwords in sentences.
13.
BERT+SoftLexicon [28] follows the idea of Lattice-LSTM, which avoids the complex structure and improves the performance of model.
14.
SLK-NER [18] fuses different second-order lexicon knowledge (SLK) with the global attention information to alleviate the impact of word boundary conflicts.
15.
AM-BiLSTM [46] enhances character embeddings with the multi-word information feature, which keeps the word information by matching a lexicon.

4.2 Results on Benchmark Datasets

Table 2 Results on resume

Full size table

Table 3 Results on Weibo

Full size table

Table 4 Results on MSRA

Full size table

Experimental results on Resume, Weibo and MSRA are respectively shown in Tables 2, 3, 4. We divide the baselines into four groups, including general models in Chinese NER which do not integrate any lexicon information, models using different external knowledge, models utilizing lexicon information by graph neural network and the state-of-the-art Chinese NER models which focus on integrating lexicon information.

As we can see from the tables, compared with baselines in the first group, our model outperforms all the models. This proves that integrating lexicon information has a positive effect on the character-based model.

Baselines in the second group integrate different knowledge, including word segmentation information and character feature information. Our model performs better than them, since there are always some mistakes during word segmentation and character features cannot provide enough information for NER inference.

When we compare our model with baselines in the third group, we find that the F1-scores of graph-based models are not better than our model. One of the reasons is that these baselines do not exploit full lexicon information. Our TFM overcomes this shortcoming very well.

Our model outperforms all the state-of-the-art baselines in the fourth group. We attribute the improvements to the TFM we proposed because TFM integrates more lexicon information and the multiple information are fused in a more proper manner.

In all, our model achieves the best F1 and P, which proves its merits.

4.3 Efficiency Study

In this section, we evaluate the inference speed of our model. The settings follow [28]. We choose the general BERT+BiLSTM-CRF and three models which integrate lexicon information for comparison.

Results are shown in Fig. 5. The inference speed of our model is faster than Lattice-LSTM and LR-CNN and is more than twice Lattice-LSTM. This proves the improvement of efficiency. Though TFM is inserted, the inference speed of our model is very close to the BERT+BiLSTM-CRF model. Moreover, SoftLexicon focuses on integrating lexicon information while reducing running time. Our model integrates more information, but the inference speed does not significantly drop.

4.4 Influence of Different Sequence Modeling Layers

Though general sequence labeling model BiLSTM-CRF is used as the backbone of our model, we still want to explore the influence of different sequence modeling layers. To this end, we replace BiLSTM with CNN and transformer[43]. The F1-socres are shown in Table 5.

Table 5 F1-scores with different sequence modeling layers

Full size table

From the table, we find that our model empirically works best with BiLSTM and F1-socres drop when applying other sequence modeling layers. Since most models choose BiLSTM, we also follow them to make comparisons fairer.

4.5 Influence of Different Embedding Methods

We try three other embedding methods, i.e., fastText [1], word2vec [29] and ERNIE [38] to evaluate the influence of different embedding methods. For fastText, we use the pre-trained model provided by Facebook.^{Footnote 4} For word2vec, we use the pre-trained model released by [48]. For ERNIE, we use a PyTorch version^{Footnote 5}. The results are shown in Table 6.

Table 6 F1-scores with different embedding methods

Full size table

The F1-socres on fastText and word2vec are lower than BERT and ERNIE, especially on Weibo dataset. One of the reasons is that Weibo is in informal language environments and static embeddings cannot express exact meanings of characters. F1-scores with ERNIE are close to BERT, but BERT is used more widely and our model works better with BERT, so we choose BERT as our embedding layer.

4.6 Performance on Different Entities

We further analyze F1-scores of different entities to investigate whether the improvement of our model is general or entity-specific. We compare our model with Lattice-LSTM and BERT+BiLSTM-CRF. Experimental results are shown in Fig. 6.

The three models perform well on Resume and our model only improves on certain entity types, such as ORG, TITLE and PRO. On Weibo, the F1-scores of our model on four entity types exceed Lattice-LSTM. When compared with BERT+BiLSTM-CRF, we find that the F1-score of our model on ORG entity drops. The drop, however, does not affect the overall improvement, since the F1-scores of other entity types rise. Finally, our model has slight F1-score increases on three entity types in MSRA and this contributes to the overall advantages.

Table 7 An example from Weibo

Full size table

4.7 Case Study

In this section, we analyze a sentence from Weibo dataset, i.e., (Amazon’s official WeChat package tracking service launched). The design of this experiment is following [26]. Experimental results are shown in Table 7. Even with the help of lexicon knowledge, Lattice-LSTM model fails to recognize the organization entity (Amazon). Different from Lattice-LSTM, our model correctly makes the predictions. Later, we check the lexicon for experiment and find that almost every word that ends with the character is an entity, e.g., (Robinson), (Tao Xun) and (Nelson). In this situation, the position information of in the word plays an important role in recognizing named entity. The TFM fuses the character position information of in (Amazon) which contributes to the right inference of our model.

During the lexicon matching process, many words can be assigned to a character. In fact, most of the words are useless. Though all matched words are fused, we still hope that TFM pays more attention to latent right words. We record the weights of the words for each character in TFM and get the heatmap shown in Fig. 7. As we can see from the heatmap, right words like (Amazon), (official) are assigned greater weights. This proves that TFM can automatically diminish the disturbance of useless words.

Table 8 An ablation study on our model

Full size table

4.8 Ablation Study

To investigate different factors that affect the performance of our model, we conduct ablation study on three datasets and the results are shown in Table 8.

1.
In the “- TFM” experiment, we remove the proposed TFM. In this case, the model turns to the BERT+BiLSTM-CRF model. We find that the performance of the model drops, which demonstrates the effectiveness of our proposed module.
2.
By “-position” and “-word” experiment, we study the influence of different information combinations. When only one kind of information from lexicon is fused, the F1-scores drop. This proves that our model works best when combining the word and character position information.
3.
Affected by the labeling paradigm in sequence labeling, models like SoftLexicon overlook the distinction of different middle positions and put them in the same group. In the “- ‘M’ distinction” experiment, we do not distinguish any middle positions and the performance of model declines. This proves that emphasizing different middle positions during fusing process has a positive effect on the performance of our model.

5 Conclusion

In this paper, we notice the character position information from lexicon which is ignored by other models. Then TFM is proposed to integrate lexicon knowledge into the character-based model. TFM effectively handles the situation that word information is lost in the character-based model and enhances the ability to understand the relationship between a word and its characters by the character position information. Experiments on public datasets show that our model achieves ideal performance in terms of efficiency and F1-score.

In the end, we describe our future work. The model we propose works well when the training data is sufficient. But it will overfit under the circumstances of extremely scarce training examples, i.e., few-shot settings [17]. So, we plan to explore how to integrate external knowledge into few-shot NER models.

Notes

References

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Cai Q, Pan Y, Yao T, Yan C, Mei T (2018) Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4080–4088
Cao P, Chen Y, Liu K, Zhao J, Liu S (2018) Adversarial transfer learning for chinese named entity recognition with self-attention mechanism. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 182–192
Chang N, Zhong J, Li Q, Zhu J (2020) A mixed semantic features model for Chinese NER with characters and words. Adv Inf Retr 12035:356
Google Scholar
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNS. Trans Assoc Comput Linguist 4:357–370
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Dhole KD, Manning CD (2020) Syn-qg: syntactic and shallow semantic rules for question generation. arXiv:2004.08694
Ding R, Xie P, Zhang X, Lu W, Li L, Si L (2019) A neural multi-digraph model for chinese ner with gazetteers. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1462–1467
Dong C, Zhang J, Zong C, Hattori M, Di H (2016) Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Natural language understanding and intelligent applications. Springer, pp 239–250
Elhammadi S, Lakshmanan LV, Ng R, Simpson M, Huai B, Wang Z, Wang L (2020) A high precision pipeline for financial knowledge graph construction. In: Proceedings of the 28th international conference on computational linguistics, pp 967–977
Forney GD (1973) The viterbi algorithm. Proc IEEE 61(3):268–278
Article MathSciNet Google Scholar
Gong C, Li Z, Xia Q, Chen W, Zhang M (2020) Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition. Sci China Inf Sci 63(10):1–15
Article Google Scholar
Goyal A, Gupta V, Kumar M (2021) A deep learning-based bilingual hindi and punjabi named entity recognition system using enhanced word embeddings. Knowl Based Syst, 107601
Gui T, Ma R, Zhang Q, Zhao L, Jiang YG, Huang X (2019) CNN-based Chinese ner with lexicon rethinking. In: IJCAI, pp 4982–4988
Gui T, Zou Y, Zhang Q, Peng M, Fu J, Wei Z, Huang XJ (2019) A lexicon-based graph neural network for Chinese NER. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 1039–1049
Gui T, Ye J, Zhang Q, Zhou Y, Gong Y, Huang X (2020) Leveraging document-level label consistency for named entity recognition. In: IJCAI, pp 3976–3982
Hofer M, Kormilitzin A, Goldberg P, Nevado-Holgado A (2018) Few-shot learning for named entity recognition in medical text. arXiv:1811.05468
Hu D, Wei L (2020) SLK-NER: exploiting second-order lexicon knowledge for Chinese NER. arXiv:2007.08416
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Levow GA (2006) The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN workshop on Chinese language processing, pp 108–117
Li J, Meng K (2021) MFE-NER: multi-feature fusion embedding for chinese named entity recognition. arXiv:2109.07877
Li X, Yan H, Qiu X, Huang X (2020) Flat: Chinese NER using flat-lattice transformer. arXiv:2004.11795
Lin BY, Lee DH, Shen M, Moreno R, Huang X, Shiralkar P, Ren X (2020) Triggerner: learning with entity triggers as explanations for named entity recognition. arXiv:2004.07493
Liu T, Yao JG, Lin CY (2019) Towards improving neural named entity recognition with gazetteers. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5301–5307
Liu W, Xu T, Xu Q, Song J, Zu Y (2019) An encoding strategy based word-character LSTM for Chinese NER. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol. 1 (Long and Short Papers), pp 2379–2389
Luo Y, Xiao F, Zhao H (2020) Hierarchical contextualized representation for named entity recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 8441–8448
Ma R, Peng M, Zhang Q, Huang X (2019) Simplify the usage of lexicon in Chinese NER. arXiv:1908.05969
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv:1310.4546
Miller A, Fisch A, Dodge J, Karimi AH, Bordes A, Weston J (2016) Key-value memory networks for directly reading documents. arXiv:1606.03126
Misawa S, Taniguchi M, Miura Y, Ohkuma T (2017) Character-based bidirectional lstm-crf with words and characters for japanese named entity recognition. In: Proceedings of the first workshop on subword and character level models in NLP, pp 97–102
Nie Y, Tian Y, Song Y, Ao X, Wan X (2020) Improving named entity recognition with attentive ensemble of syntactic information. arXiv:2010.15466
Nie Y, Tian Y, Wan X, Song Y, Dai B (2020) Named entity recognition for social media texts with semantic augmentation. arXiv:2010.15458
Peng N, Dredze M (2016) Improving named entity recognition for Chinese social media with word segmentation representation learning. arXiv:1603.00786
Peshterliev S, Dupuy C, Kiss I (2020) Self-attention gazetteer embeddings for named-entity recognition. arXiv:2004.04060
Prakash A, Zhao S, Hasan SA, Datla V, Lee K, Qadir A, Liu J, Farri O (2017) Condensed memory networks for clinical diagnostic inferencing. In: Thirty-first AAAI conference on artificial intelligence
Sui D, Chen Y, Liu K, Zhao J, Liu S (2019) Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 3821–3831
Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019) Ernie: enhanced representation through knowledge integration. arXiv:1904.09223
Tian Y, Shen W, Song Y, Xia F, He M, Li K (2020) Improving biomedical named entity recognition with syntactic information. BMC Bioinform 21(1):1–17
Article Google Scholar
Tian Y, Song Y, Xia F, Zhang T, Wang Y (2020) Improving chinese word segmentation with wordhood memory networks. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8274–8285
Tong M, Xu B, Wang S, Cao Y, Hou L, Li J, Xie J (2020) Improving event detection via open-domain trigger knowledge. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 5887–5897
Tu Z, Liu Y, Shi S, Zhang T (2018) Learning to remember translation history with a continuous cache. Trans Assoc Comput Linguist 6:407–420
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Wu F, Liu J, Wu C, Huang Y, Xie X (2019) Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In: The World Wide Web conference, pp 3342–3348
Wu J, Harris I, Zhao H (2021) Spoken language understanding for task-oriented dialogue systems with augmented memory networks. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 797–806
Xu H, Chen Z, Wang S, Jiang X (2021) Chinese NER using Albert and multi-word information. In: ACM turing award celebration conference-China (ACM TURC 2021), pp 141–145
Yan R, Jiang X, Dang D (2021) Named entity recognition by using XLNet-BILSTM-CRF. Neural Process Lett 53:1–18
Article Google Scholar
Zhang Y, Yang J (2018) Chinese Ner using lattice LSTM. arXiv:1805.02023
Zhu Y, Wang G, Karlsson BF (2019) Can-ner: convolutional attention network for Chinese named entity recognition. arXiv:1904.02141

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61877004 and 62007004), the Major Program of National Social Science Foundation of China (Grant No. 18ZDA295) and the Doctoral Interdisciplinary Foundation Project of Beijing Normal University (Grant No. BNUXKJC2020).

Author information

Authors and Affiliations

School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China
Haitao Liu, Jihua Song, Weiming Peng, Jingbo Sun & Xianwei Xin

Authors

Haitao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jihua Song
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jingbo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xianwei Xin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jihua Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., Song, J., Peng, W. et al. TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition. Neural Process Lett 54, 3425–3442 (2022). https://doi.org/10.1007/s11063-022-10768-y

Download citation

Accepted: 04 February 2022
Published: 22 April 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11063-022-10768-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition

Abstract

Similar content being viewed by others

CPMFA: A Character Pair-Based Method for Chinese Nested Named Entity Recognition

The interactive fusion of characters and lexical information for Chinese named entity recognition

Chinese Named Entity Recognition Based on Dynamically Adjusting Feature Weights

1 Introduction

2 Related Work

2.1 NER with Lexicon Information

2.2 Key-Value Memory Network

3 Method

3.1 Embedding Layer

3.2 BiLSTM Layer

3.3 Triple Fusion Module

3.4 CRF Layer

4 Experiments

4.1 Experiment Setup

4.2 Results on Benchmark Datasets

4.3 Efficiency Study

4.4 Influence of Different Sequence Modeling Layers

4.5 Influence of Different Embedding Methods

4.6 Performance on Different Entities

4.7 Case Study

4.8 Ablation Study

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TFM: A Triple Fusion Module for Integrating Lexicon Information in Chinese Named Entity Recognition

Abstract

Similar content being viewed by others

CPMFA: A Character Pair-Based Method for Chinese Nested Named Entity Recognition

The interactive fusion of characters and lexical information for Chinese named entity recognition

Chinese Named Entity Recognition Based on Dynamically Adjusting Feature Weights

Explore related subjects

1 Introduction

2 Related Work

2.1 NER with Lexicon Information

2.2 Key-Value Memory Network

3 Method

3.1 Embedding Layer

3.2 BiLSTM Layer

3.3 Triple Fusion Module

3.4 CRF Layer

4 Experiments

4.1 Experiment Setup

4.2 Results on Benchmark Datasets

4.3 Efficiency Study

4.4 Influence of Different Sequence Modeling Layers

4.5 Influence of Different Embedding Methods

4.6 Performance on Different Entities

4.7 Case Study

4.8 Ablation Study

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation