Keywords

6.1 Introduction

A good amount of research has been devoted to parsing technology, due to the importance of dependency parsing, and many natural language applications, such as information retrieval and Q&A system [10], employing it as a base component, and their performance may highly rely on the parsing result.

Recent methods of dependency parsing can be divided into two classes: data-driven methods and rule-based methods. Data-driven methods usually are statistical parser and use some machine learning algorithms to catch the statistical features of data in order to produce syntactic relations of words in sentences.

Most of the state-of-the-art parsers are statistical parser, parsing accuracy of which highly relies on the quality and quantity of treebank [3, 6, 11]. There are several treebanks in China for syntactic parsing, such as Penn Chinese Treebank (CTB) [8] and Chinese Dependency Treebank (CDT). The CTB is constituency annotation and was retrieved from Xinhua Newswire, Hong Kong news, Sinorama and ACE broadcast news, while the CDT is a dependency treebank which was retrieved from People’s Daily newswire stories. Occurrence frequency of syntactic substructure (subtree) in one treebank may vary from another. Some would frequently occur in the treebank but others would not. The rare syntactic substructures of some sentences in the treebank are well formed for human, but would be abandoned when there are some common syntactic substructures which are conflict with the rare ones and prevent the parser from handling rare syntactic structures correctly. This would lead to label attachment recall rate degradation partly.

In order to examine the assumption, we segment treebank into k-parts in one round, and train a transition-based parser (subparser) for each part, and then a result set generated by the subparsers is compared with k-best generated by parser trained with full treebank. We learn that label attachment recall rate of k-segments would be higher than corresponding k-best’s, and this confirms the assumption. Depending on this phenomenon, we further employ a parser to post-reparse the subparsers output. Since transition-based parser and graph-based parser have different training and inference algorithms [5, 7] and have different behaviors, we construct the post-reparser with constrained Eisner’s algorithm [2, 4] to find maximum spanning trees (MST). The experiment shows that the pipeline could improve the parsing accuracy while computing the parsing reliability score.

6.2 Parsing Pipeline

The pipeline in this paper addresses the general structural prediction problem, which map an input sentence \( x \in {\mathbf{X}} \) to an output dependency structure \( y \in {\mathbf{Y}} \), which is composed of edge e,

$$ e = \left\{ {\begin{array}{*{20}c} {\left\langle {i,j, \leftarrow ,l} \right\rangle } \\ {\left\langle {i,j, \to ,l} \right\rangle } \\ \end{array} } \right. $$
(6.1)

where i and j are relation endpoints, l is dependent label in label set L. We employ transition-based parser with beam search [9] as subparser and use MST parser with conditional random field as post-reparser. In CRF model, the output y probability would be,

$$ p\left( {y|x} \right) = {{\exp \left( {f\left( {y,x} \right) \cdot {\varvec{\uplambda}}} \right)} \mathord{\left/ {\vphantom {{\exp \left( {f\left( {y,x} \right) \cdot {\varvec{\uplambda}}} \right)} {Z\left( x \right)}}} \right. \kern-0pt} {Z\left( x \right)}} $$
(6.2)

where \( f\left( {y,x} \right) \) maps y and x to a feature vector, λ is a corresponding weight vector, and \( Z\left( x \right) \) is the normalization factor. For a sentence x, the parsing result y is calculated by finding the highest probability one among the all possible results,

$$ O\left( x \right) = { \arg }\,{ \hbox{max} }_{{y\in \mathbf{PSET}}\left( x \right)}\,p\left( {y|x} \right) $$
(6.3)

where \( {\mathbf{PSET}}\left( x \right) \) denotes the set of the possible result for the sentence x.

The pipeline is composed of a training procedure and a parsing procedure. The training procedure mainly creates a set of subparsers which are transition-based parser. And in parsing procedure, we first use this set of subparsers to achieve a result set, and then reparse the set to compute the best output.

6.2.1 Training

The training procedure is as follows:

  • Segmenting treebank into equally sized sub-treebanks \( {\varvec{\Omega}} = \left\{ {b_{i} } \right\}_{i = 1, \ldots ,N} \)

  • Training subparser \( t_{i} \) with sub-treebank \( b_{i} \in\Omega \) to build a set of subparsers \( {\varvec{\Gamma}}_{N} = \left\{ {t_{i} } \right\}_{i = 1, \cdots ,N} \).

  • Training the MST parser T with the whole treebank to calculate the weight vector \( {\varvec{\uplambda}}_{T} \) for features.

Through this procedure, we get a trained model \( \left\{ {N,{\varvec{\Gamma}}_{N} ,{\varvec{\uplambda}}_{T} ,f_{T} } \right\} \), where N is the number of subparsers trained by segmented treebanks; \( f_{T} \) is a feature extract function of MST parser, built by feature temples.

6.2.2 Post-reparsing

In parsing step, a set of result \( {\mathbf{R}} = \left\{ {r_{i} } \right\}_{i = 1, \ldots ,N} \), which is different from N-best result, have been generated by subparser in \( {\varvec{\Gamma}}_{N} \) for an input sentence x. We use them to constrain the searching space for the sentence (far small than the full searching space), and then employ constrained Eisner’s algorithm to extract the best result. The constraint scores are obtained as follows:

$$ s_{c} \left( {e,{\mathbf{R}}} \right) = \left\{ {\begin{array}{*{20}c} {f_{T} \left( e \right) \cdot {\varvec{\uplambda}}_{T} } & {\text{if}} & {e \in r_{i} ,i = 1, \ldots ,N} \\ 0 & \quad {\text{else}} & {} \\ \end{array} } \right. $$
(6.4)

where \( f_{T} \left( e \right) \) maps edge e to a feature vector. And mixture score is as follows:

$$ s_{\text{mixc}} \left( {e,{\mathbf{R}},\alpha } \right) = \alpha \cdot f_{T} \left( e \right) \cdot {\varvec{\uplambda}}_{T} + \left( {1 - \alpha } \right) \cdot s_{c} \left( {e,{\mathbf{R}}} \right) $$
(6.5)

where α is a mixture factor which controls the strength of constraint from the result set, given a sentence \( S = w_{0} w_{1} \cdots w_{N} \) and the corresponding R. The post-reparsing procedure is as follows (Table 6.1).

Table 6.1 Pseudo-code for constrained Eisner’s algorithm

6.2.3 Reliability Score of Dependency Relation

With the pipeline, we can get a set of subparsers \( {\varvec{\Gamma}}_{N} \), in which each subparser is trained by different parts of training corpora. Because each part of corpora can be seen as unseen data from other part, we can assume that the parsing result of each subparser for a sentence is supported by the corresponding part of training corpora. Thus, reliability score can be calculated like a weighted voting scheme [1] as follows:

$$ c\left( {e_{i} } \right) = {{\sum\limits_{{\varepsilon { \in }{\mathbf{E}}_{i} ,\varepsilon = e_{i} }} {\exp \left( {\xi \cdot f_{T} \left( \varepsilon \right) \cdot {\varvec{\uplambda}}_{T} } \right)} } \mathord{\left/ {\vphantom {{\sum\limits_{{\varepsilon { \in }{\mathbf{E}}_{i} ,\varepsilon = e_{i} }} {\exp \left( {\xi \cdot f_{T} \left( \varepsilon \right) \cdot {\varvec{\uplambda}}_{T} } \right)} } {\sum\limits_{{\varepsilon { \in }R_{i} }} {\exp \left( {\xi \cdot f_{T} \left( \varepsilon \right) \cdot {\varvec{\uplambda}}_{T} } \right)} }}} \right. \kern-0pt} {\sum\limits_{{\varepsilon { \in }R_{i} }} {\exp \left( {\xi \cdot f_{T} \left( \varepsilon \right) \cdot {\varvec{\uplambda}}_{T} } \right)} }} $$
(6.6)

where \( {\mathbf{E}}_{i} \) is a set of dependency relationships \( \left\langle {i, \cdot , \cdot , \cdot } \right\rangle \), which is ith relationship of dependency structure in set R, ξ is an adjusting factor, and when \( \xi = 0,\;c\left( {e_{i} } \right) \) is normal voting score for edge \( e_{i} \).

6.3 Baseline and Experiments

This section presents the pipeline experiments of segmentation and post-reparsing. Before this, we only evaluate the pipeline with Chinese Penn Treebank corpora as heavy computation cost for the CRF training without loss of generality. We split sentences in the Penn Treebank 6.0 into training, development, and test set as Table 6.2, and then employ the head-finding rule to translate them into dependency structures.

Table 6.2 The training, development, and test data for CTB6

Baseline parsers are ZParFootnote 1 dependency parser, MSTParserFootnote 2, and crfParser,Footnote 3 which are open source projects and have achieved state-of-the-art accuracy. They are trained with the training data in Table 6.2, and use their default feature temple, respectively. The test results are as follows (Table 6.3).

Table 6.3 The test results of the baseline parsers

where MSTParser1 and crfParser are first order graph-based parsers, MSTParser is second order parser, and ZPar is transition-based dependency parser.

In order to explore the phenomenon brought by corpora segmentation, we segment the training data into parts with different number, namely, \( {\varvec{\Theta}} = [2,3,4,6,12] \), and then build the set of subparser \( {\varvec{\Gamma}}_{K} \) for each \( k \in\Theta \).

6.3.1 Attachment Recall Rate

Labeled/unlabeled attachment recall rate (LAR/UAR) is the ratio of correct labeled/unlabeled attachment among the dependency structure of result set,

$$ \left\{ {\begin{array}{*{20}c} {{\text{LAR}}\left(\Pi \right) = {{\sum\limits_{{\left( {R_{K} ,r_{c} } \right){ \in \varPi }}} {\delta_{L} \left( {{\mathbf{R}}_{K} ,r_{c} } \right)} } \mathord{\left/ {\vphantom {{\sum\limits_{{\left( {R_{K} ,r_{c} } \right){ \in \varPi }}} {\delta_{L} \left( {{\mathbf{R}}_{K} ,r_{c} } \right)} } {\sum\nolimits_{{\left( {{\mathbf{R}}_{K} ,r_{c} } \right){ \in \varPi }}} {\left| {r_{c} } \right|} }}} \right. \kern-0pt} {\sum\nolimits_{{\left( {{\mathbf{R}}_{K} ,r_{c} } \right){ \in \varPi }}} {\left| {r_{c} } \right|} }}} \\ {{\text{UAR}}\left(\Pi \right) = {{\sum\limits_{{\left( {R_{K} ,r_{c} } \right){ \in \varPi }}} {\delta_{U} \left( {{\mathbf{R}}_{K} ,r_{c} } \right)} } \mathord{\left/ {\vphantom {{\sum\limits_{{\left( {R_{K} ,r_{c} } \right){ \in \varPi }}} {\delta_{U} \left( {{\mathbf{R}}_{K} ,r_{c} } \right)} } {\sum\nolimits_{{\left( {{\mathbf{R}}_{K} ,r_{c} } \right){ \in \varPi }}} {\left| {r_{c} } \right|} }}} \right. \kern-0pt} {\sum\nolimits_{{\left( {{\mathbf{R}}_{K} ,r_{c} } \right){ \in \varPi }}} {\left| {r_{c} } \right|} }}} \\ \end{array} } \right. $$

where \( {\mathbf{R}}_{K} \) and \( r_{c} \) are a set of parsing result and gold dependency structure for a sentence, \( {\varvec{\Pi}} \) generated from test data is a set of \( \left( {{\mathbf{R}}_{K} ,r_{c} } \right),\;\left| {r_{c} } \right| \) is the number of relationship in dependency \( r_{c} \), function \( \updelta_{L} ({\mathbf{R}}_{k} ,r_{c} ) \) counts the correct dependency relationships with label in \( r_{c} \) which coexist in result set \( {\mathbf{R}}_{k} \), and function \( \delta_{U} \) counts similarly without label.

We calculate LAR and UAR for the k-segment’s result set \( R_{K} \) and baseline parser’s k-best result. The relationship between number of parts and attachment recall rate is as follows.

With data segmentation, we can achieve higher LAR and UAR then k-best parsing result, this means that the dataset \( {\mathbf{R}}_{K} \) would cover more correct dependency relationship than the k-best dataset. It would be beneficial to postprocessing in the pipeline, such as reparsing and reranking, with small searching space.

6.3.2 Post-reparsing

In post-reparsing state, we analyze the result set generated by the set of subparser \( {\varvec{\Gamma}}_{N} \) or the N-best result made by baseline parser to get final parsing result, and their accuracy is as follows.

For each \( k \in {\varvec{\Theta}} \), we search the best \( \alpha \in \left[ {0,1} \right] \) for calculating the LAS/UAS of each k-segment’s or k-best result. The highest LAS of 2-segmentation is 83.2112, and is 0.7751 % higher than the ZPar in baseline parsers, and 2-best’s is 83.1275 %. From Fig. 6.2, we could see that the LAS of k-segment and k-best is lower than the ZPar’s when k > 2, meanwhile, k-segment’s LAS is lower than k-best’s. The reason may be the postparser, which is first-order minimum spanning tree parser with local features, is not powerful enough to utilize the higher label attachment recall rate. That is why 2-segment’s LAS is higher than 2-best’s. From Fig. 6.1, we could find that k-segment’s LAR ascends faster than k-best’s, and the k-best’s LAS descend slower than k-segment’s since k > 3 in Fig. 6.2, this is also shown that the postparser needs a finer design. Besides, we employ reliability score \( c\left( {e_{i} } \right) \) to rerank the result set \( {\mathbf{R}}_{K} \), the result is as follows:

Fig. 6.1
figure 1

The labeled/unlabeled attachment recall rate (LAR/UAR)—number of parts curve

Fig. 6.2
figure 2

The LAS/UAS of reparsing for k-segment’s and k-best result set

From Fig. 6.3, we can find that using reliability score \( c\left( {\varvec{e}_{i} } \right) \) to directly select dependency relationship is feasible. Their LAS/UAS are higher than each element in k-segment’s result set, but their output may be not a tree, and need further process. The LAS/UAS of reranking result is lower than the baseline ZPar, this may also due to weakness of the postparser as well.

Fig. 6.3
figure 3

The LAS/UAS of reranking for k-segment’s result set. The dot line is the maximum LAS/UAS of element in k-segment’s set

6.4 Conclusions and Future Work

We build a hybrid parsing pipeline, which employs transition-based dependency parser as subparser in front end, and then use graph-based dependency parser in next stage. Finally, we investigate the influence on the pipeline with different k-segment dataset. From the experiment, we found that using segmentation of training data would largely improve the labeled/unlabeled attachment recall rate with some final LAS/UAS drop, and the result set generated by the subparsers with high attachment recall rate could be used to calculate reliability score, such as simple voting scheme used in this paper. Besides, the hybrid pipeline could improve the final LAS/UAS when using 2-segment. But it cannot further improve the accuracy due to weak postparser. In future, we would try to use more sophisticated parser as postparser to explore the searching space constructed by the subparsers effectively.