Keywords

1 Introduction

Chinese word segmentation, part-of-speech (POS) tagging and dependency parsing are three fundamental tasks for Chinese natural language processing, whose accuracy obviously affects downstream tasks such as semantic comprehension, machine translation and question-answering. The traditional method is usually following pipeline way: word segmentation, POS tagging and dependency parsing. However, there are two problems of the pipline way, one is error propagation: incorrect word segmentation directly affects POS tagging and dependency parsing, another is information sharing: the tree tasks are strongly related, the label information of one task can help others, but the pipline way cannot exploit the correlations among the three tasks.

Using joint model for Chinese word segmentation, POS tagging and dependency parsing is a solution to these two problems. The previous joint models [7, 13, 21] mainly adopted a transition-based framework to integrate the three tasks. Based on the standard sequential shift-reduce transitions, they design some extra actions for word segmentation and POS tagging. Although these transition-based models maintained the best performance of word segmentation, POS tagging and dependency parsing, its local decision problem led to the low precision of long-distance dependency parsing, which limited the precision of dependency parsing.

Different from the transition-based framework, the graph-based framework has the ability to make global decisions. Before the advent of neural network, the graph-based framework was rarely applied to the joint model due to its large decoding space to calculate. With the development of neural network technology, the graph-based method for dependency parsing improves rapidly and comes back into researchers’ vision. [19] firstly proposed a graph-based unified model for joint Chinese word segmentation and dependency parsing with neural network and attention mechanism, which is superior to the best transition-based joint model in terms of word segmentation and dependency parsing. This work without POS tagging task shows that dependency parsing task is beneficial to Chinese word segmentation.

Chinese word segmentation, POS tagging and dependency parsing are three highly correlated tasks and can improve each other’s performance. Dependency parsing is beneficial to word segmentation and POS tagging, while word segmentation and POS tagging are also helpful to dependency parsing, which has been demonstrated by considerable work on the existing transition-based joint model of three tasks. We consider that joint POS tagging task can further improve the performance of dependency parsing. In addition, it makes sense of the model to provide POS information for downstream tasks. For these reasons, this paper proposes a graph-based joint model for word segmentation, POS tagging and dependency parsing. First, we design a character-level POS tagging task, and then combine it with a graph-based joint model for word segmentation and dependency parsing ( [19]. As for the joint approach, this paper proposes two ways, one is to combine the two tasks by hard sharing parameters ( [3]) and the other is combine the two tasks by introducing tag attention mechanism in the shared parameter layer. Finally, we analyze our proposed models on the Chinese treebank (CTB5) dataset.

2 The Proposed Model

In this section, we introduce our proposed graph-based joint model for Chinese word segmentation, POS tagging and dependency parsing. Through the joint POS tagging task, we explore the joint learning method among multiple tasks and seek for a better joint model to improve the performance of Chinese dependency parsing further.

Fig. 1.
figure 1

An example of a character-level dependency tree

2.1 Character-Level Chinese Word Segmentation and Dependency Parsing

This paper refers to [19]’s approach of combining word segmentation and dependency parsing into a character-level dependency parsing task. Firstly, we transform the word segmentation task to a special arc prediction problem between characters. Specifically, we treat each word as a dependency subtree, and the last character of the word is the root node, and for other characters, the next character is its head node. For example, the root node of the dependency subtree of the word “ ” is “ ”, and the head node of the character “ ” is “ ”, which constitutes an intra-word dependency arc of “ ”. To distinguish it from the dependencies between words, a special dependency label “Append(A)” was added to represent the dependencies between characters within a word. We use the last character in each word (the root node of the dependency subtree) as a representation of this word, and the dependency between words can be replaced by the dependency between last characters of each word. For example, the dependency relationship “ ” is transformed into “ ”. Figure 1 shows an example of CTB5 dataset being converted to a character-level dependency tree.

2.2 Character-Level POS Tagging

In order to transform the POS tagging into a character-level task, this paper adopts the following rules to convert the POS tag of words into POS tag of each character: the POS tag of each character is the POS tag of the word it is in. In predicting word’s POS tag, it is represented by the POS tag of last character of the word. For example, if the predicted POS tag sequence of the word “ ” is “NN, VV, NN”, then the POS tag “NN” of the last character “ ” is taken as the POS tag prediction result of the whole word. It is important to note that a word’s POS tag is predicted correctly only if the word segmentation is predicted correctly and the last charater’s POS tag is also predicted correctly.

2.3 Graph-Based Joint Model for Word Segmentation, POS Tagging and Dependency Parsing

According to Sects. 2 and 2.1, after converting three tasks into two character-level tasks, we designed a shared deep Bi-LSTM network to encode the input characters and obtain contextual character vectors. As shown in Fig. 2, given the input sentence (character sequence) \(X = \{x_1, ..., x_n\}\). Firstly, vectorize each character \(x_i\) to get vector \(e_i\), which consists of two parts, one is pre-trained vector \(p_i\) which is fixed during training, and the other is randomly initializing embeddings \(s_i\) which can be adjusted in training. Element-wise adds the pre-trained and random embeddings as the final input characters’ embedding \(e_i\), that is \(e_i = p_i + s_i\). Then we feed the characters’ embedding into multi-layer Bi-LSTM network, and get each character’s contextual representation \(C = \{c_1, ..., c_n\}\).

$$\begin{aligned} {\overrightarrow{c_i} = \overrightarrow{\mathrm{LSTM}}(e_i, \overrightarrow{c}_{i-1}, \overrightarrow{\theta });\ \overleftarrow{c_i} = \overleftarrow{\mathrm{LSTM}}(e_i, \overleftarrow{c}_{i+1}, \overleftarrow{\theta });\ c_{i}=\overrightarrow{c_{i}}\oplus \overleftarrow{c_{i}}} \end{aligned}$$
(1)

After the contextual character vectors are obtained, the character-level POS tagging and dependency parsing are carried out respectively. We adopted the graph-based framework to analyze the character-level dependency parsing task. By taking each character as a node on the graph, and taking the possibility of forming a dependency relationship between characters as a probability directed edge between nodes (from the head node points to the dependency node), we can define dependency parsing as finding a dependency tree with the highest probability that conforms to the dependency grammar on a directed complete graph. The process of dependency parsing contains two subtasks: prediction of dependency relationship and prediction of dependency relationship type.

Prediction of Dependency Relationship: We use \(x_i\leftarrow x_j\) to represent the dependency relation between \(x_i\) as the dependency node and \(x_j\) as the head node. After context encoding, each character obtains a vector representation \(c_i\). Considering that each character has the possibility of being a dependency node and a head node, we use two vectors \(d^{arc}_i\) and \(h^{arc}_i\) to represent them respectively, and get them from \(c_i\) through two different MLP, as shown in formula (2).

$$\begin{aligned} d^{arc}_i = \mathrm{{MLP}}^{arc}_d(c_i);\ h^{arc}_i = \mathrm{{MLP}}^{arc}_h(c_i) \end{aligned}$$
(2)
Fig. 2.
figure 2

A joint model of segmentation, POS tagging and dependency parsing with parameter sharing

To calculate the probability \(s^{arc}_{ij}\) of \(x_i\leftarrow x_j\), we use biaffine attention mechanism proposed by [5].

$$\begin{aligned} s^{arc}_{ij} = \mathrm{{Biaffine}}^{arc}(h^{arc}_j, d^{arc}_i) = h^{arc}_j U^{arc}d^{arc}_i + h^{arc}_j u^{arc} \end{aligned}$$
(3)

where \(U^{arc}\) is a matrix whose dimension is (\(d_c, d_c\)), and the \(d_c\) is the dimension of vector \(c_i\), \(u^{arc}\) is a bias vector. After we get the scores of all head nodes of the i-th character, we select the max score node as its head.

$$\begin{aligned} s^{arc}_i = [s^{arc}_{i1}, ..., s^{arc}_{in}];\ y^{arc}_i = \arg \max (s^{arc}_i) \end{aligned}$$
(4)

Prediction of Dependency Relatoinship Type: After obtaining the best predicted unlabeled dependency tree, we calculate the label scores \(s^{label}_{ij}\) for each dependency relationship \(x_i\leftarrow x_j\). In our joint model, the arc labels set consists of the standard word-level dependency labels and a special label “A” indicating the intra-dependency within a word. We also use two vectors \(d^{label}_i\) and \(h^{label}_i\) to represent them respectively, and get them from \(c_i\) through two different MLP, and we use another biaffine attention network to calculate the label scores \(s^{label}_{ij}\).

$$\begin{aligned}&d^{label}_i = \mathrm{{MLP}}^{label}_d(c_i);\ h^{label}_i = \mathrm{{MLP}}^{label}_h(c_i) \end{aligned}$$
(5)
$$\begin{aligned}&s^{label}_{ij} = \mathrm{{Biaffine}}^{label}(h^{label}_j, d^{label}_i) = h^{label}_j U^{label}d^{label}_i + (h^{label}_j\oplus d^{label}_i) V^{label} + b \end{aligned}$$
(6)

where \(U^{arc}\) is a tensor whose dimension is (\(k, d_c, d_c\)), k is the number of dependency relationship labels, and \(V^{label}\)’s dimension is (\(k, 2d_c\)), and b is a bias vector. The best label of the dependency relationship \(x_i\leftarrow x_j\) is:

$$\begin{aligned} y^{label}_{ij} = \arg \max (s^{label}_{ij}) \end{aligned}$$
(7)

Prediction of POS Tagging: We use multi-layer perceptron (MLP) to calculate the probability distribution of the POS tag for each character.

$$\begin{aligned} s^{POS}_i = \mathrm{{MLP}}^{POS}(c_i) \end{aligned}$$
(8)

The best POS tag of the character \(x_i\) is

$$\begin{aligned} y^{POS}_i = \arg \max (s^{POS}_i) \end{aligned}$$
(9)

Loss Function for Joint Model: For the three tasks described above, we adopt cross-entropy loss for all of them, and the results are denoted as \(Loss_{arc}, Loss_{dep}, Loss_{pos}\) respectively. The common way to deal with the loss of multiple tasks is to add them together, but this way does not balance the loss of each task. Therefore, we adopt the method proposed by [10], that is using uncertainty to weigh losses for three tasks.

$$\begin{aligned} \mathcal {L}(\theta ) = \dfrac{1}{\delta _{arc} ^{2}}Loss_{arc} + \dfrac{1}{\delta _{dep} ^{2}}Loss_{dep} + \dfrac{1}{\delta _{pos} ^{2}}Loss_{pos} + log \delta _{arc} ^{2} + log \delta _{dep} ^{2} + log \delta _{pos} ^{2} \end{aligned}$$
(10)

2.4 Introduction of Tag Attention Mechanism

The above model joint the three tasks through sharing Bi-LSTM layers to encode the contextual character’s information. However, there is no explicit representation of the POS information in the shared encoding layers, the POS tagging task cannot provide the predicted information for word segmentation and dependency parsing. Therefore, we introduce the vector representation of the POS tag and propose the tag attention mechanism (TAM) to integrate the POS information of contextual characters into the vector representation of each character, so that the POS information of the contextual character can also be used in the word segmentation and dependency parsing. This structure is similar to the hierarchically-refined label attention network (LAN) proposed by [4], but we use it to obtain POS information of each layer for subsequent character-level dependency parsing tasks. LAN differs from TAM in that LAN only predicts at the last layer while TAM predicts at each layer. We have tried to predict only at the last layer, but the result of segmentation and dependency parsing is slightly lower than predicting at each layer. The model is shown in Fig. 3.

Fig. 3.
figure 3

A joint model of segmentation, POS tagging and dependency parsing with tag attention mechanism

Firstly, we vectorize the POS tags. Each POS tag is represented by a vector \(e^t_i\), and the represents of the set of POS tags denoted as \(E^t = \{e^t_1, ...,e^t_m\}\), which is randomly initialized before model training, and then is adjusted during the model training. Then, we calculate the attention weight between the contextual character vectors and POS tag vectors:

$$\begin{aligned}&\alpha = \mathrm{{softmax}}(\frac{{{QK}}^T}{\sqrt{d_c}}) \end{aligned}$$
(11)
$$\begin{aligned}&E^+ = \mathrm{{Attention}}(Q, K, V) = \alpha V \end{aligned}$$
(12)
$$\begin{aligned}&C^+ = \mathrm{{LayerNorm}}(C + E^+) \end{aligned}$$
(13)

where Q, K, V are matrices composed of a set of queries, keys and values. We set \(Q = C, K = V = E^t\). The i-th line of \(\alpha \) represents the POS tag probability distribution of the i-th character of the sentence. According to this probability distribution \(\alpha \), we calculate the representation of predicted POS tag of each character of the sentence, and it is denoted as \(E^+\). The \(E^+\) is added to the contextual vectors C as the POS tag information. After layer normalization( [1], we can obtain the character vectors (\(C^+\)) containing the POS information, and then take it as the input of the next Bi-LSTM layer. After the second layer of Bi-LSTM encoding, each character vector we get will contain every characters’ POS information, which can be used by word segmentation and dependency parsing.

When the tag attention mechanism is applied, the i-th line of the calculated attention weight for each layer is the POS tag distribution of the i-th character. Different from the prediction method of POS tagging in previous model, we added the attention weights of all layers as the final POS tag distribution:

$$\begin{aligned} s^{POS}_i = \sum ^{m}_{j}\alpha ^{j}_i \end{aligned}$$
(14)

where, m is the number of layers. The prediction of POS tag is:

$$\begin{aligned} y^{POS}_i = \arg \max (s^{POS}_i) \end{aligned}$$
(15)

For word segmentation and dependency parsing, we use the same approach as the previous model. For the losses of three tasks, we also use the same way to calculate it as the previous model.

3 Experiment

3.1 Dataset and Evaluation Metrics

We conducted experiments on the Penn Chinese Treebank5 (CTB-5). We adopt the data splitting method as same as previous works [7, 13, 19]. The training set is from section 1\(\sim \)270, 400\(\sim \)931 and 1001\(\sim \)1151, the development set is from section 301\(\sim \)325, and the test set is from section 271\(\sim \)300. The statistical information of the data is shown in Table 1.

Table 1. The statistics of the dataset.

Following previous works [9, 13, 19], we use standard measures of word-level F1 score to evaluate word segmentation, POS tagging and dependency parsing. F1 score is calculated according to the precision P and the recall R as \(F = 2PR/(P+R)\) [9]. Dependency parsing task is evaluated with the unlabeled attachment scores excluding punctuations. The output of POS tags and dependency arcs cannot be correct unless the corresponding words are correctly segmented.

3.2 Model Configuration

We use the same Tencent’s pre-trained embeddings [17] and configuration as [19], and the dimension of character vectors is 200. The dimension of POS tag vectors is also 200. We use with 400 units for each Bi-LSTM layer and the layer numbers is 3. Dependency arc MLP output size is 500 and the label MLP output size is 100. The dropout rates are all 0.33.

The models are trained with Adam algorithm [11] to minimize the total loss of the cross-entropy of arc predictions, label predictions and POS tag predictions, which using uncertainty weights to combine losses. The initial learning rate is 0.002 annealed by multiplying a fix decay rate 0.75 when parsing performance stops increasing on development sets. To reduce the effects of “gradient exploding”, we use gradient clip of 5.0 [16]. All models are trained for 100 epochs.

3.3 Results

We conduct comparison of our models with other joint parsing models. The model shown in Fig. 2 is denoted as Ours and the model shown in Fig. 3 as Ours-TAM (with tag attention mechanism). The comparison models include three types: one is the transition-based joint models with feature templates [7, 13, 21], the other is the transition-based joint models with neural network [13](4-g, 8-g), and the third is the graph-based model with neural network without POS tagging task [19]. The results are shown in Table 2Footnote 1.

Table 2. Performance comparison of Chinese dependency parsing joint models.

From the table, we see that transition-based joint models using feature templates maintain the best performance in word segmentation, POS tagging and dependency parsing for a long time. Although [13](4-g, 8-g) adopted the neural network approach, it still didn’t surpass the joint model with feature templates. While, the graph-based joint model [19] obtained the better performance in word segmentation and dependency parsing than all transition-based model.

Our models Ours and Ours-TAM exceeded [19] 0.67 and 0.38% points respectively in dependency parsing, indicating that the POS tag information contributes to dependency parsing. Although they are 0.13 and 0.05% points lower than [19] on word segmentation task respectively, they still exceed the best transition-based joint model with feature templates [13]. [19] does not have POS tagging task, but our models have, and its performance exceeded that of the previous best joint model [13] by 0.11 and 0.35% points respectively, indicating that after the introduction of POS tagging, other tasks such as dependency parsing are also helpful for POS tagging task itself.

3.4 Detailed Analysis

We will further investigate the reasons for the improvement of dependency parsing after the combination of POS tagging task. For a dependency relationship \(x_i\leftarrow x_j\), we use \(X\leftarrow Y\) to represent its POS dependency pattern, the X is the POS tag of \(x_i\), and the Y is the POS tag of \(x_j\). We calculated the distribution of Y for each X in training set and found that the probability between some X and Y was very high. For example, when X was P(preposition), the distribution of Y was {VV(78.5%), DEG(5.1%), ..., NN(3.1%), ... }. In order to verify whether our models can use these POS informations in training dataset, we calculated the accuracy of each POS dependency patterns in test dataset on our models and the re-implemented model of [19]. The patterns on which the accuracy of our models are better than [19] are shown in left part of Fig. 4.

Fig. 4.
figure 4

Comparison of precision on different POS tag patterns before and after joint POS tagging task

Table 3. Head POS distribution

The X of these 5 patterns are {DT, P, ETC, CD, CC}, and the Y’s distributions of each X are shown in the Table 3. It is found that all 5 patterns select Y with the highest probability, indicating that our model can fully utilize the POS informations to improve the accuracy of dependencies with these POS dependency patterns. As the example shown in the Fig. 5, when predicting the head node of “ ”, [19] predicted wrong node “ ”, while our models both predicted right node “ ”. The POS tag of “ ” is P and the POS tag of correct head node “ ” is VV whose probability is 78.5%, while the wrong head node s POS tag is NN whose probability is only 3.1%. Because our models can use these POS informations to exclude the candidate head nodes of low probability POS, thus improving the performance of dependency parsing.

Fig. 5.
figure 5

An example of POS information contributes to dependency parsing

Although Ours-TAM achieved better results in segmentation and POS tagging, the dependency parsing was reduced compared with Ours. The right part of the Fig. 4 shows the patterns on which the accuracy of our models are worse than [19]. It can be found that the dependency probability of these patterns is small, and the addition of POS information actually reduces the accuracy. Therefore, Ours-TAM has better POS information, so the accuracy of these patterns is lower than Ours, thus the overall precision of dependency parsing of Ours-TAM decreases compared with that of Ours.

Fig. 6.
figure 6

The influence of dependency length and sentence length on dependency parsing

Next, we will investigate the difference between the graph-based joint model and the transition-based joint model in dependency parsing. We compare our graph-based joint models to the transition-based joint model [13] according to dependency length and sentence length respectively. The results are shown in Fig. 6. From the figure, we can see that our proposed joint models on long-distance dependencies have obvious advantages, and the accuracy of the dependency parsing is relatively stable with the increase of sentence length, while the transition-base joint model has an obvious downward trend, which indicates that our graph-based joint model can predict the long-distance dependencies more effectively than transition-based joint model.

4 Related Work

[7] proposed a character-level dependency parsing for the first time, which combines word segmentation, POS tagging and dependency parsing, They combined the key feature templates on the basis of the previous feature engineering research on the three tasks, and realized the synchronous processing of the three tasks. [21] annotated the internal structure of words, and regarded the word segmentation task as dependency parsing within characters to jointly process with three tasks. [13] firstly applied neural network to the charater-level dependency parsing. Although these transition-based joint models achieved best accuracy in dependency parsing, they still suffer from the limitation of local decision.

With the development of neural network, the graph-based dependency parsing models [5, 12] using neural networks have developed rapidly. these model fully exploit the ability of the bidirectional long short-term memory network (Bi-LSTM) [8] and attention mechanism [2, 18] to capture the interactions of words in a sentence. Different from transition-based models, the graph-based model can make global decision when predicting dependency arcs, but few joint model adopted this framework. [19] firstly proposed a joint model adopting graph-based framework with neural network for Chinese word segmentation and dependency parsing, but they does not use POS tag.

According to the research of existing transition-based joint model, the word segmentation, POS tagging and dependency parsing are three highly correlated tasks that influence each other. Therefore, we consider that integrating POS tagging task into graph-based joint model [19] to further improve the performance and to provide POS information for downstream tasks. We transform the POS tagging task into a character-level sequence labeling task and then combine it and [19] by using multi-task learning. There are many multi-task learning approaches such as [3, 14, 15] and [6], we use parameter sharing [3] to realize the joint model, and then improve it with tag attention mechanism. Finally, we analyze the models on the CTB5 dataset.

5 Conclusion

This paper proposed the graph-based joint model for Chinese word segmentation, POS tagging and dependency parsing. The word segmentation and dependency parsing are transformed into a character-level dependency parsing task, and the POS tagging task is transformed into a character-level sequence labeling task, and we use two ways to joint them into a multi-task model. Experiments on CTB5 dataset show that the combination of POS tagging task is beneficial to dependency parsing, and using the POS tag attention mechanism can exploit more POS information of contextual characters, which is beneficial to POS tagging and dependency parsing, and our graph-based joint model outperforms the existing best transition-based joint model in all of these three tasks. In the future, we will explore other joint approaches to make three tasks more mutually reinforcing and further improve the performance of three tasks.