Keywords

1 Introduction

Name entity recognition (NER) is also known as proper name recognition. It is a fundamental task in natural language processing (NLP) and has a wide range of applications. Named entity generally refers to the entity with specific meaning or strong reference in the text, usually including person name, place name, organization name, and so on. NER is a foundation key task in NLP. At the same time, NER is also the basis of many NLP tasks such as relationship extraction, event extraction, knowledge mapping, machine translation, and question answering system.

NER has been a research hotspot in the field of NLP. At beginning, NER is based on dictionary and rule-based methods. Later, the traditional machine learning methods, especially probabilistic graph models such as hidden Markov model (HMM) [1, 2], maximum entropy Markov model (MEMM), and conditional random field (CRF) [3, 4], became the focus of NER’s research. In recent years, NER based on deep learning such as long short-term memory (LSTM) has been become popular [5]. However, both probability graph model and deep learning model are incomplete. Probability graph model does not learn the deep hidden information in sentences and deep learning model does not consider the probabilistic relation between parts of speech and the parts of speech. Some scholars found that the combination of two methods deals with the above problems [8]. Huang Z, et al. used bidirectional long short-term memory (BiLSTM) and CRF to do NER, and the results is very good [6, 7]. Then, Lample G, et al. proposed that use two LSTM and CRF can obtain deeper hidden information and a better accuracy [9]. But their method did not consider the reverse meaning of texts. Also, Zheng S, et al. used a BiLSTM network to deal with this problem and a LSTM network to replace CRF to gain the deeper semantic [10], but they did not consider the probabilistic relationship.

To solve these problems, we propose an advanced BiLSTM network to achieve better deep semantics and probability relationships. We use two LSTM to train a sequence and return a sequence. What is more, we use two reverse LSTM to train the same sequence and return a sequence too. Then, we concatenate the two sequences to one sequence. Hence, we can obtain deep semantics and the relationships including the forward meaning and backward meaning together. Finally, we use a CRF to obtain the probabilistic relation.

In this paper, in Sect. 1, we introduce the background and the overall idea. In Sect. 2 , we introduce the important models in our algorithm. In Sect. 3 , we give out our experiment results. In Sect. 4, a conclusion is given.

2 Mathematic Models

2.1 Improved BiLSTM

The Disadvantage of RNN: Recurrent neural networks (RNN) are implemented by reusing a cell structure. Hence, the output of the current moment is related to the past.

The expression function of RNN is:

$$\begin{aligned} h_{t} = f(W \cdot x_{t} + U \cdot h_{t-1} + b), \end{aligned}$$
(1)
$$\begin{aligned} y_{t} = V \cdot h_{t}, \end{aligned}$$
(2)

where \(x_{t}\) and \(y_{t}\) are the input and output at time t, respectively. \(h_{t}\) is the memory information at time t, and f(z) is an activation function, which is usually a tanh function. WUbv are the parameters of the network.

Because RNN is implemented by reusing a cell structure, we just train WUbv by iterations and do not need other parameters.

However, there is a disadvantage of RNN. It can not remember long time information because of gradient diffusion. The appearance of LSTM solved this problem.

The Advantage of LSTM: LSTM is a variant of RNN, which can effectively solve the gradient diffusion problem of simple RNN. It mainly improves the following two parts.

First, it adds a new internal state \(c_{t}\) and retains the original external state \(h_{t}\). Hence, gradient diffusion is suppressed by combining linearity and nonlinearity. Second, it controls the amount of information transmitted through three gates, ensuring that the linear transmission does not lead to too much information. Therefore, its formulas are written as:

$$\begin{aligned} c_{t} = f_{t} \odot c_{t-1} + i_{t} \odot \hat{c}_{t}, \end{aligned}$$
(3)
$$\begin{aligned} h_{t} = o_{t} \odot \tanh (c_{t}), \end{aligned}$$
(4)

where \(f_{t},i_{t},o_{t}\) are forgotten gate, input gate, output gate, respectively. They are from 0 to 1. \(\odot \) means the product of the value of the gate and each element in the vector. \(\hat{c}_{t}\) is the candidate states obtained by nonlinear functions. The details are shown as follows:

$$\begin{aligned} \hat{c}_{t} = \tanh (W_{c} \cdot [ x_{t},h_{t-1} ]+b_{c}), \end{aligned}$$
(5)
$$\begin{aligned} i_{t} = \sigma (W_{i} \cdot [x_{t},h_{t-1}] + b_{i}), \end{aligned}$$
(6)
$$\begin{aligned} f_{t} = \sigma (W_{f} \cdot [x_{t},h_{t-1}] + b_{f}), \end{aligned}$$
(7)
$$\begin{aligned} o_{t} = \sigma (W_{o} \cdot [x_{t},h_{t-1}] + b_{o}), \end{aligned}$$
(8)

where \([x_{t},h_{t-1}]\) means to concatenate \(x_{t}\) and \(h_{t-1}\), together. \(\sigma \) is sigmoid function:

$$\begin{aligned} \sigma (x) = \frac{1}{1+{\text {e}}^{-x}}. \end{aligned}$$
(9)

Because of these three gates and the internal state, LSTM can successfully remember the past information and delete the useless information.

Although LSTM is the most popular network in dealing with time sequence problems, it also has a small shortcoming. It just learn the forward information but does not learn the backward information, which has limitations in the language model.

The Meaning of BiLSTM: As we just mentioned above, sometimes, we need a network to study the forward information and backward information together, especially in text problem such as text classification, text translation, and NER.

BiLSTM is simple to understand. It has two LSTMs. The first one’s input is from \(x_{1}\) to \(x_{T}\) and its external state is from \(h_{1}^{1}\) to \(h_{T}^{1}\). The second one’s input is from \(x_{T}\) to \(x_{1}\) and its external state is from \(h_{1}^{2}\) to \(h_{T}^{2}\). Later, the concatenation of these two LSTMs’ external state is \(h_{t} = [h_{t}^{1},h_{t}^{2}]\).

Our Changed BiLSTM: In this paper, in order to get the deeper information of sentences and the forward and backward text connection, we change the BiLSTM structure.

We divided the model into two parts. The first part is a forward LSTM connected to another forward LSTM. The first forward LSTM’s input is the input data, and the second forward LSTM’s input is the returned sequence from the first forward LSTM. The second part is a backward LSTM connected to another backward LSTM. Also, the first backward LSTM’s input is the input data, and the second backward LSTM’s input is the returned sequence from the first backward LSTM.

By using this structure, we can firstly get deeper meanings of forward and backward text sequences. Later, we combine the two states together and use the combinative result as the CRF’s input.

2.2 Conditional Random Filed

The Model Definition: CRF is a conditional probability distribution model of one set of output variables with another set of input variables that are given. It is characterized by the assumption that the output random variables constitute Markov conditional field.

The Parametric Formalization: If P(Y|X) is linear CRF and the value of random variable X, x is given, the value of random variable Y, y is shown as follows:

$$\begin{aligned} P(y|x) = \frac{1}{Z(x)} \exp \left( \sum _{i,k} \lambda _{k} t_{k}(y_{i-1},y_{i},x,i) + \sum _{i,l} \mu _{l} s_{l}(y_{l},x,i) \right) , \end{aligned}$$
(10)

where \(t_{k}\) and \(s_{l}\) are characteristic functions which are always 0 or 1, \(\lambda _{k}\) and \(\mu _{l}\) are corresponding weights, \(Z(x) = \sum _{y} \exp \bigg (\sum _{i,k} \lambda _{k} t_{k}(y_{i-1},y_{i},x,i) + \sum _{i,l} \mu _{l} s_{l}(y_{l},x,i) \bigg )\) is a normalization factor.

The parameters are learned by BFGS method, which is not described here.

The Prediction Algorithm: The CRF prediction problem is to find the output sequence \(y^{*}\) with the maximum conditional probability given the CRF P(Y|X) and the input sequence x. This problem is usually solved by viterbi algorithm. Hence, the output sequence is written as:

$$\begin{aligned} y^{*} = \arg \max _{w} P_{w}(y|x) = \arg \max _{w} \frac{\exp (w\cdot F(y,x))}{Z_{w}(x)}, \end{aligned}$$
(11)

and because \(Z_{w}(x)\) is a constant and \(\exp (.)\) is a monotone increasing function. Therefore,

$$\begin{aligned} y^{*} = \arg \max _{y} (w \cdot F(y,x)) \end{aligned}$$
(12)

So the CRF prediction problem is called the optimal path problem with the largest probability of denormalization.

$$\begin{aligned} \max _{y} (w \cdot F(y,x)), \end{aligned}$$
(13)

where the path means the tag sequence and

$$\begin{aligned} w = (w_{1},w_{2},...,w_{K})^\mathrm{{T}}, \end{aligned}$$
(14)
$$\begin{aligned} F(y,x) = (f_{1}(y,x),f_{2}(y,x),...,f_{K}(y,x))^\mathrm{{T}}, \end{aligned}$$
(15)
$$\begin{aligned} f_{k}(y,x) = \sum _{i=1}^{n} f_{k}(y_{i-1},y_{i},x,i), \quad k=1,2,...,K. \end{aligned}$$
(16)

To solve the optimal path, (13) can be written as:

$$\begin{aligned} \max _{y} \sum _{i=1}^{n} w \cdot F_{i}(y_{i-1},y_{i},x), \end{aligned}$$
(17)

where \(F_{i}(y_{i-1},y_{i},x) = (f_{1}(y_{i-1},y_{i},x,i),f_{2}(y_{i-1},y_{i},x,i),...,f_{K}(y_{i-1},y_{i},x,i))^\mathrm{{T}}\) is local characteristic vector.

Later, use viterbi algorithm. First of all, the denormalized probability of each mark \(j=1,2...,m\) to position 1 is:

$$\begin{aligned} \delta _{1}(j) = w \cdot F_{1}(y_{0}={\text {start}},y_{1}=j,x), \quad j=1,2,...,m. \end{aligned}$$
(18)

In general, the recursive formula is used to find the maximum value of the denormalization probability of each mark \(l=1,2,...,m\) to position i:

$$\begin{aligned} \delta _{i}(j) = \max _{i \le j \le m} {\delta _{i-1}(j)+w \cdot F_{i}(y_{i-1}=j,y_{i}=l,x)}, \quad l=1,2,...,m, \end{aligned}$$
(19)
$$\begin{aligned} \Phi _{i}(l) = \arg \max _{i \le j \le m} {\delta _{i-1}(j)+w \cdot F_{i}(y_{i-1}=j,y_{i}=l,x)}, \quad l=1,2,...,m. \end{aligned}$$
(20)

When \(l=n\), the maximum value of the denormalized probability is:

$$\begin{aligned} \max _{y} (w \cdot F(y,x)) = \max _{1 \le j \le m} \delta _{n}(j), \end{aligned}$$
(21)

and the most optimal terminal:

$$\begin{aligned} y_{n}^{*} = \arg \max _{1 \le j \le m} \delta _{n}(j) \end{aligned}$$
(22)

Return from the end of this optimal path:

$$\begin{aligned} y_{i}^{*} = \Phi _{i+1}(y_{i+1}^{*}),\quad i=n-1,n-2,...,1, \end{aligned}$$
(23)

Finally, the most optimal path is \(y^{*}=(y_{1}^{*},y_{2}^{*},...,y_{n}^{*})^\mathrm{{T}}\).

2.3 Our Model Structure

This part totally explains what we do and how to combine improved BiLSTM and CRF together.

Figure 1 is the whole structure of our network.

Fig. 1.
figure 1

Whole structure of our model

Firstly, because all the words in corpus are in the form of one-hot encoder and the dimension is very high, we use an embedding layer to find the words’ low-dimension representation to reduce the learning time. Also, the low-dimension dense representation has better representation ability than the high-dimension sparse representation.

Secondly, we put the embedded words sequence into our improved BiLSTM to learn the forward and backward deep semantic meaning. Then, use a merge layer to concatenate the external state of the two-layer-forward LSTM and the external state of the two-layer-backward LSTM.

Thirdly, the combined external state is regarded as the input of CRF layer. CRF layer learns its parameters.

Finally, the output sequence of CRF layer is the name entities.

3 Experiment and Results

3.1 The Form of Experiment Data

In order to show whether our algorithm is good or not, we do an experiment with real data.

We download an open-source data of Chinese NER from the Internet. The data form is as following. Each sentence is separated by “\(\backslash \)n\(\backslash \)n”. Each pair of Chinese character and NER label is separated by “\(\backslash \)n”. Each element (Chinese character and NER label) in the pair is separated by “\(\backslash \)t”. For example, “ ”. And the labels are “B-PER”, “I-PER”, “B-LOC”, “I-LOC”, “B-ORG”, “I-ORG”, “O”.

3.2 Data Preprocessing and Model Compile

Obviously, this data can not be used directly. There must be some preprocessing.

First of all, we combine the characters in each sentence as an input sequence, and we combine the NER labels in each sentence as an output sequence. However, the characters can not be used in a mathematic model. Hence, we count the number of characters that appear and change each character to its one-hot encoder form.

Later, LSTM should have a fixed input dimension, which means that the input sequences should have the same length. So, we find the longest sequence and use zero-padding to make the shorted sequence has the same length as the longest one.

Finally, we use Keras to set up our model, and Fig. 2 is a simple example of our model.

Fig. 2.
figure 2

Simple example of our experiment

In our experiment, we use 50658 training data and 4631 test data. The output embedding dimension is 200, and LSTM has 128 units. Also, there are batch normalization layer and dropout layer. The “categorical cross entropy" is the training loss and metrics, and we use “adam" as our optimizer.

3.3 Experiment Result

After training our model with, we firstly use a new text to predict whether our model has the ability to successfully analyze name entities. Here, we use a sentence “ ”, and the result is that “B-PRE” + “I-PER”: “B-LOC” + “I-LOC”: ; “B-ORG” + “I-ORG”: . We can see that the result is totally correct.

Table 1. Comparison result of different algorithms.

Then, we compared our algorithm with the traditional algorithms by using all test data, and the result is shown in Table 1.

From Table 1, we can find that our model has the best accuracy than other models.

However, because our model has deeper structure than traditional models, our training time is longer than other methods. Therefore, we add batch normalization layers and dropout layers to reduce training time. Also, we reduce the amount of units in LSTM layers appropriately. Finally, the comparison of training time is shown in Table 2.

Table 2. Comparison of training time

Table 2 shows that our model with BN and dropout layers can use the same training time as the traditional models to get a better CRF viterbi accuracy.

4 Conclusion

In order to solve the problem that traditional BiLSTM can not obtain deeper latent semantics, we change the BiLSTM model, and successfully combine it with a probabilistic graphic model (CRF layer) to gain a better prediction accuracy in NER problem. After experiments, we can find that it is useful in Chinese NER question, and it has the highest CRF viterbi accuracy.