Keywords

1 Introduction

Name entity recognition is a fundamental Natural Language Processing task. The NER system labels each word in sentences with predefined types, such as Person (PER), Location (LOC), Organization (ORG) and so on. The results of NER can be used in many downstream NLP tasks, such as question answering [27] and relation extraction [1]. The neural network methods [4, 10] are used to realize the NER system recently. Large annotated data is required in neural network methods. However, the annotated data is usually scarce.

In order to improve the performance of NER system in low resource, multi-domain and multi-task methods are often used [2, 7, 18, 26]. Multi-domain learning tries to transfer information from the source domain to the target domain [7]. Multi-task learning tries to transfer information from the source task to the target task [2].

There existing some challenges in previous works. First, most of the previous models only test in English. Can the models work well in Chinese NER? For example, the tasks are part-of-speech (POS) tagging and named entity recognition in English multi-task learning [4], however, the tasks are Chinese Word Segmentation (CWS) and named entity recognition in Chinese multi-task learning [17]. Second, previous works often consider the multi-domain methods and multi-task methods separately. For example, Cao et al. only consider the multi-task learning [2]. Can the multi-task models be directly used in multi-domain, vice versa?

In this manuscript, we do an empirical study in Chinese NER. First, we summarize the previous multi-domain and multi-task models according to the model architecture. The neural network methods are considered in this manuscript. Second, we suppose that the multi-domain and multi-task learning methods are independent of languages. We use Chinese social media domain as the target domain and Chinese news domain as the source domain. The Chinese NER task is the target task and CWS task is the source task. These domains and tasks are similar and the information can be transferred. Third, we suppose that the methods used in multi-domain and multi-task are similar. The methods come from transfer learning. The methods used in multi-domain can be directly used in multi-task, vice versa. In other words, the model architecture is not required to be changed when the model is used in multi-domain or in multi-task, and only the data is required to be changed. Three types of universal model architectures are demonstrated: SHA (share model), FEAT (feature used model) and ADV (adversarial network model). Experiments show that the universal models are useful in Chinese NER and outperform the baseline model.

Specifically, we make contributions as follows:

  • We summarize the previous multi-domain and multi-task models in NER.

  • We explore the performance of the multi-domain and multi-task methods in Chinese NER task.

  • We demonstrate three types of universal model architectures in multi-domain and multi-task learning.

2 Overview

2.1 Previous Summaries of NER

The existing surveys mainly focus on summarizing the methods used in named entity recognition, including supervised, semi-supervised, and unsupervised methods [14, 20]. Yadav et al. provided recent advances in NER from deep learning models [25]. The transfer learning surveys mainly focus on general methods in multi-domain and multi-task learning [15]. Tan et al. presented a survey of deep transfer learning [21]. Compared with the previous summaries of NER, we focus on multi-domain and multi-task learning in Chinese NER.

2.2 Domains and Tasks in Multi-domain and Multi-task NER

Previous works show that multi-domain and multi-task learning improve the performance of English NER [4, 12]. In multi-domain and multi-task learning, the target domains and tasks are often similar to the source domains and tasks. In English, the target domain is often twitter domain and the source domain is news domain. The target task is NER and the source task is chunk or POS. In this manuscript, the source domain is Chinese news domain, and the target domain is Chinese weibo domain. The source task is CWS, and the target task is NER. An example is shown in Fig. 1. We suppose that Chinese weibo NER is similar to Chinese news NER. The weibo NER and news NER are the same task and use different domain data. Some tokens and labels are the same in two domains. For example, the “ ” is labeled as “LOC” in both news and weibo domains. We suppose that CWS task is similar to the NER. The CWS and NER all belong to sequence labeling task. CWS tries to find the word boundary. For example, “ ” is an independent word. NER tries to find the word boundary and types. For example, “ ” is seen as an independent word and the entity type of “ ” is “LOC”.

Fig. 1.
figure 1

The first block is the Chinese-English translation pair for understanding. The second block is from weibo NER. The third block is from weibo CWS. The fourth block is from news NER.

2.3 Methods in Multi-domain and Multi-task NER

A list of neural multi-domain and multi-task learning models are shown in Table 1. The multi-domain and multi-task models are divided into four types: SHA, FEAT, ADV and BV (variant of base model). The SHA model is prevalent in previous works [4, 8, 18, 26]. The multi-task learning of English named entity recognition was first proposed by Collobert et al. using neural network model [4]. Lee et al. trained the model using the source data and retrained the model using the target data [8]. Yang et al. explored the transferring module in multi-domain and multi-task separately [26]. Peng and Dredze used domain projection and specific task Conditional Random Fields (CRF) combining the multi-domain and multi-task [18]. The FEAT model was first proposed by [17] used for multi-task in NER. Cao et al. used the adversarial network to integrate the task-shared word boundary information into Chinese NER task [2]. The BV models are variant of base model. The BV models can not be directly used between multi-domain and multi-task.

Table 1. A summary of the multi-domain and multi-task learning.

3 Model

3.1 Module

All the models are composed by some basic modules. We discuss the basic modules first. Four types of modules are considered: Character embedding, Bi-LSTM, CRF and Classifier.

Character Embedding. Character embedding is the first step of neural network models in Chinese NER. Character embedding is similar to the word embedding in English. For example, “ ” is a character and is mapped to a low dimension vector in Character embedding layer. Pre-trained character embedding is often used to utilize the information from the large unannotated data. For a sequence of character \( c = \{ x_1, x_2,...,x_n \} \), we obtain \( x =\{e_{x_1} ,e_{x_2},...,e_{x_n} \} \) though looking up pre-trained character embedding.

Bi-LSTM. Bi-LSTM is used to extract the features from the sentence. The Bi-LSTM concatenates the forward LSTM output and backward LSTM output as the final output and can capture the information of a character from right context and left context [6]. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The implementations of LSTM are as follows:

$$\begin{aligned} i_t&= \sigma (W_ih_{t-1} + U_ie_{x_t} + b_i) \end{aligned}$$
(1)
$$\begin{aligned} f_t&= \sigma (W_fh_{t-1} + U_fe_{x_t} + b_f) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{c_t}&= tanh(W_ch_{t-1} + U_ce_{x_t} + b_c) \end{aligned}$$
(3)
$$\begin{aligned} c_t&= f_t\odot c_{t-1} + i_t\odot \tilde{c_t} \end{aligned}$$
(4)
$$\begin{aligned} o_t&= \sigma (W_oh_{t-1}+U_oe_{x_t} + b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t&= o_t \odot tanh(c_t) \end{aligned}$$
(6)

where \( e_{x_t} \) is the input vector at time t, \( h_t \) is the output of LSTM model, \( \sigma \) is the element-wise sigmoid function, and \( \odot \) is the element-wise product.

CRF. The CRF is used to predict the label sequence \( y = \{y_1,y_2,...,y_n \} \). The CRF uses the feature extracted by the Bi-LSTM and considers the neighborhood information in a sequence to make prediction. We define the source s of the sentence when X is used as the input sequence list and y is used as the output NER tag list:

$$\begin{aligned} s(X,y) = \sum _{i=0}^{n}A_{y_i,y_{i+1}}+\sum _{i=1}^{n}P_{h_i,y_i} \end{aligned}$$
(7)

where \( A_{y_i, y_{i+1}} \) describes the cost from tag \( y_i \) transferring to \( y_{i+1} \), and P represents the probability from \( h_i \) predicting the tag \( y_i \). The probability of tag sequences can be represented as:

$$\begin{aligned} P(y|X)=\frac{e^{s(X,Y)}}{\sum _{\tilde{y}\in {Y_{all}}}e^{s(X,\tilde{y})}} \end{aligned}$$
(8)

where \( \tilde{y} \) is the possible NER tags and \( Y_{all} \) is all the possible NER tags. When the model is trained, we maximize the log-probability of the correct sequence:

$$\begin{aligned} logP(y|X) = s(X,y)-log(\sum _{\tilde{y}}e^{s(X,\tilde{y})}) \end{aligned}$$
(9)

When the model is tested, we can obtain the best NER tag sequence \( y^{*} \) by:

$$\begin{aligned} y^{*} = \mathop {{\text {argmax}}}_{ \tilde{y}\in {Y_{all}} } s(X,\tilde{y}) \end{aligned}$$
(10)
Fig. 2.
figure 2

Four types of architectures in multi-domain and multi-task learning. The blue block is source part, the black block is target part, and the red block is the share part. (Color figure online)

Classifier. For the multi-domain models, the classifier discriminates the sentence from news domain or weibo domain. For the multi-task models, the classifier discriminates the sentence from NER or CWS. The classifier contains maxpooling and softmax.

$$\begin{aligned} h&= Maxpooling(H) \end{aligned}$$
(11)
$$\begin{aligned} D(h,\theta _d)&= softmax(W_dh + b_d) \end{aligned}$$
(12)

where H is the feature representation of the sentences and \( \theta _d \) is the parameters in softmax, including \( W_d \) and \( b_d \).

3.2 Base Model

We use LSTM-CRF model as the base model which is widely used for a single domain and task NER [5, 10]. The architecture of the model is shown in Fig. 2(a). The model contains three parts. The character embedding part is used to utilize the word level features from the large unannotated data. The Bi-LSTM is used to extract the sentence level features. The features are fed into CRF to predict the sequence labels. Compared with the multi-domain and multi-task model, the LSTM-CRF model is a single task and domain model, which only uses weibo NER dataset as input.

3.3 Multi-domain and Multi-task Model

Multi-domain and multi-task methods are shown to improve the single task and domain NER performance. However, most of the models only focus on English NER. In the other part, most of the models are only used in one specific situation: in multi-domain learning or in multi-task learning. In this manuscript, we do an empirical study in 3 types of Chinese named entity recognition models: SHA, FEAT and ADV. For the specificity of the BV model, we will do the experiments in the future. In the multi-domain and multi-task model, the models use two types of data as inputs. For the multi-domain model, the weibo NER dataset and news NER dataset are used. For the multi-task model, the weibo NER dataset and weibo CWS dataset are used.

SHA. The SHA model shares the feature extractor between different tasks or domains [4, 8, 18, 26]. The feature extractor trained by source data can contain useful information for target task or domain. The architecture of the SHA model is shown in Fig. 2(b). The character embedding layer and the Bi-LSTM layer are shared. Different domains or tasks have specific CRFs. With different training methods, the SHA model can be divided into three different sub-models.

SHA-INIT. The training method is divided into two steps. First, the model uses the source data as the input to train the model until convergence. In the second step, target data is fed as the input of the model to continually train until convergence. Two steps use different CRFs. The parameters in character embedding layer and the Bi-LSTM layer are all updated in two steps.

SHA-CRF. The training method is the same as SHA-INIT except for the second step. In the second step, the parameters of character embedding and Bi-LSTM layer are frozen and we only update the parameters in CRF layer.

SHA-MUL. The model trains the source data and target data simultaneously. In one epoch, source data and target data are fed to the model alternatively. A hyper-parameter \( \alpha \) can be used to control the size between source data and target data.

FEAT. The FEAT model supposes that the features extracted by the source part can be used as auxiliary information for the target part [17]. The architecture of the model is shown in Fig. 2(c). The target part uses the intermedia results of the source part. The Bi-LSTM output of the target part is concatenated with the output of the source part before being fed into CRF layer. Three different training methods lead to three sub-models.

FEAT-INIT. The model first uses the source data to train a base model. Then, the source part of the model is initialized by the pre-trained model. Finally, the target data is used to train the model. All parameters are updated in the model.

FEAT-CRF. The training method is the same as FEAT-INIT except for the final parameters updating step. In the FEAT-CRF model, the source part is initialized by the pre-trained model and then the parameters are frozen. The model only updates the target part parameters.

FEAT-MUL. The source part and the target part are trained alternatively. When the source data is used as input, the parameters in the source part are updated. When the target data is used as input, the parameters in the source part and target part are updated.

ADV. The ADV model uses the private feature extractor extracts the private information, and uses the shared feature extractor extract the shared information [2]. The architecture of the model is shown in Fig. 2(d). The model uses private character embedding, Bi-LSTM and CRF to capture the different information between source domain and target domain, and uses shared Bi-LSTM to capture the common information between source domain and target domain. Character embedding tries to capture the word level representation. Bi-LSTM tries to extract the sentence level feature representation of the words. The classifier tries to guarantee that specific features of tasks do not exist in shared space. The source data and target data are fed to the model alternatively.

BV. The BV models are the models that can not find the universal models in multi-domain and multi-task learning [3, 7, 12, 22]. For example, Wang et al. required the source domain and the target domain has the same label sets [22]. However, different tasks have different label sets. Lin et al. used a domain adaptation layer to reduce the disparity between different pre-trained character embeddings [12]. However, pre-trained embeddings are the same in different tasks.

4 Experiments and Results

4.1 Datasets

The Chinese weibo NER corpus is from [16]. The Chinese news NER corpus is from Sighan NER [11]. The Chinese weibo word segmentation corpus is from [19]. The sentence numbers of the different corpora are shown in Table 2.

Table 2. The details of the corpora.

4.2 Parameters Setting

The character embedding is initialized by pre-trained character embedding. The news embedding is pre-trained on Chinese Wikipedia data using word2vec [13]. The weibo embedding is pretrained on Chinese social media data using word2vec. The embedding dimension is 100. The LSTM dimension in both source and target part are 100. The optimization method we used is adam [9].

4.3 Results

The overview results of Chinese weibo NER are shown in Table 3. The results show that the multi-domain and multi-task learning can be used in Chinese named entity recognition. The FEAT and ADV models can always outperform the baseline model. The SHA-CRF model obtains the worst F1 score which is far worse than the baseline model. The reason may be that the CRF is hard to process the features extracted by a source feature extractor. Meanwhile, the experiments show that multi-domain and multi-task learning can use the universal models. The same model architecture can directly be used in both multi-domain and multi-task learning.

Table 3. The overview results of multi-domain and multi-task learning in Chinese weibo NER. P represents precision, R represents recall, and F represents F1 score.

In Table 4, we make a survey in the performance of previous works in Chinese weibo NER. Compared with previous performances, the three types of models achieve competitive results.

Table 4. The performance of previous Chinese weibo NER models.

To show the generalization of the models in Chinese NER, the Ontonote NER dataset is also tested [23]. The broadcast news domain is used as the source domain, and the web text domain is used as the target domain. The Chinese weibo word segmentation is used as the source task, and the Chinese Ontonote web text named entity recognition is used as the target task. The broadcast news data contains 10083 sentences, and the web text contains 8405 sentences. In Table 5, the results show that the universal models can be used in different datasets.

5 Discussion

Experiments show that the multi-domain and multi-task learning can improve the performance of Chinese NER. More works can be done in the future.

Table 5. The overview results of multi-domain and multi-task learning in Ontonote dataset.

First, Chinese specific features can be considered in multi-domain and multi-task learning. In this manuscript, we only use the Chinese models which is similar to the English models. Some Chinese specific features are shown very helpful in Chinese NER, such as radical feature [5], glyph representation of Chinese character [24]. We will explore these features in multi-domain and multi-task learning.

Second, the BV model can be universal model through small changes. For example, some models require the same labels in source and target domain. The requirement can extend to that the source and target domain labels are related. The model architecture required small changes in the future.

Third, the multi-domain and multi-task have high similarity. Two works can be combined together. For example, Peng and Dredze used domain projection and specific task CRF combining the multi-domain and multi-task [18]. However, the Peng and Dredze only processed the situation that the domains have the same label sets. In the future, more general models could be considered.

6 Conclusion

In this manuscript, we focus on utilizing Chinese news domain information and Chinese word segmentation information to improve the performance of Chinese weibo named entity recognition by multi-domain and multi-task learning. Three types of universal model architectures are explored. Experiments show that the universal models outperform the baseline model.