Keywords

1 Introduction

Machine translation quality estimation (QE) is a task that aims at automatically estimating the quality of machine translations. Unlike the standard evaluation metrics such as BLEU [15], NIST [4] and METEOR [1], QE models estimate translations without relying on golden references. In the past decade, researches on QE have attracted more and more attentions [7], since QE can be utilized to ensure the diversity and robustness of the NMT systems [25].

Fig. 1.
figure 1

An example of a translation that is correct in sentence-level but incorrectly in document-level. We use THUMT [20] and 2M Chinese-English parallel data to training the NMT model.

Currently, mainstream QE-related researches [2, 13, 26] mainly focus on sentence-level QE models, which normally ignore the document-level information. While, previous studies [21, 23] have shown that document-level information is important for estimating the translation qualities. As shown in Fig. 1, the word “predicts” in current translation should be “predicted” according to the context, but is wrongly translated into present tense. Obviously, a QE model that does not consider the document-level information could not tell the above-mentioned error.

To alleviate this problem, we propose a document-level QE model called CpQE by introducing Centering Theory (CT) [24] to formulate the sentence relations. Concretely, our CpQE model uses the Preferred Center (Cp), whose meaning could be found in Subsect. 3.1, to represent the context features. Moreover, we adapt a BERT-based [3] sequence labeling model to extract the Cps. In addition, a semi-supervised pseudo-label learning method is adopted to alleviate the low resource problem of Cp extraction.

2 Related Work

Traditional QE works [6, 17] used feature engineering to extract features, e.g. QuEst++ [19] design word-, sentence- and document-level features for multi-level QE. Recently, neural QE methods outperformed these hand-craft methods. [16] treated QE as a slot filling problem and proposed a language independent word-level QE system using Recurrent Neural Network (RNN). [14] proposed a stacked model by introducing multi-task learning, which achieved the best result for word-level and sentence-level QE at that time.

More recently, Predictor-Estimator framework [10] was reported superior performance and become a mainstream approach for neural QE. To combine Predictor and Estimator into the architecture, [13] proposed a unified neural network, which were trained jointly to minimize the mean absolute error over the QE training samples. Furthermore, [5] proposed a neural bilingual expert model, which replaced the RNN layers with a novel bidirectional transformer [22] for feature extraction. And [11] apply the pre-trained model, BERT [3], as feature extractor. However, these methods evaluate each translation independently, leading to an inconsistent problem for the evaluation of document-level machine translation.

3 Centering Theory and Extraction of the Preferred Centers

3.1 Centering Theory and Preferred Centers

Centering Theory (CT) [8, 9, 24] is a theoretical model about the local coherence of discourses. CT, which can be parameterized and calculated easily compared with other related theories, provides a quantitative standard for evaluating the context consistency of translations. Therefore, in this work, we apply CT to capture the discourse coherence information for document-level QE.

In CT, any entity in a sentence may relate to entities in the following sentences. So an entity is called Forward-looking Center (Cf). And an entity related to entities in the previous sentences is called Backward-looking Center (Cb). Preferred Center (Cp) is the entity that is the most likely one to be associated with a Cb. For example, given a current sentence “Xiao Hong likes to wear a red skirt” and the following sentence “She went shopping today and met Xiao Fang”. The entities in the current sentence include “Xiao Hong” and “skirt”, so we have Cf = [“Xiao Hong”, “skirt”]; and the Cb in following sentence is “she”, i.e. Cb = [“she”]. In Cf, the word “Xiao Hong” is the most closely related to the Cb, so “Xiao Hong” is defined as the preferred center. It should be noted that a sentence may contains more than one Cps.

3.2 The Preferred Centers Extraction Model

The conventional methods for extracting Cp are mainly rule-based. While, in this paper, we take this problem as a sequence labeling problem and construct a BERT-BiLSTM-CRF based model to settle it.

Fig. 2.
figure 2

The overview of Preferred Centering extraction model

Figure 2 presents the overview of our extraction model. The input sentences are encoded by BERT first. Then, the output of BERT are fed to a BiLSTM layer, in which the operations of the LSTM are shown as follows:

$$\begin{aligned} i_{t}&= \sigma (W_{i}[h_{t-1},x_{t}]+b_{i}), \end{aligned}$$
(1)
$$\begin{aligned} f_{t}&=\sigma (W_{f}[h_{t-1},x_{t} ]+b_{f} ), \end{aligned}$$
(2)
$$\begin{aligned} c_{t}&=f_{t}c_{t-1}+i_{t}\tanh (W_{c}[h_{t-1},x_{t}]+b_{c}), \end{aligned}$$
(3)
$$\begin{aligned} o_{t}&=\sigma (W_{o}[h_{t-1},x_{t}]+b_{o}), \end{aligned}$$
(4)
$$\begin{aligned} h_{t}&= o_{t}\tanh (c_{t}), \end{aligned}$$
(5)

where \(x_{t}\) represents the output of BERT. \(i_t\), \(f_t\) and \(c_t\) are the input gate, forget gate and cell vectors, respectively. \(o_t\) is the output gate and \(h_t\) is the hidden vector. t represents the t-th cell state of LSTM.

After that, the output of the forward and the backward LSTM are concatenated using (6), as follows:

$$\begin{aligned} h_{t}=[{\mathop {h_{t}}\limits ^{\longrightarrow }},{\mathop {h_{t}}\limits ^{\longleftarrow }}] \end{aligned}$$
(6)

Finally, the outputs of BiLSTM are provided to Conditional Random Field (CRF) [12] to decode the Cp labels.

Table 1. The format of preferred center annotation.
Fig. 3.
figure 3

The pipeline of our semi-supervised training method

3.3 The Semi-supervised Preferred Center Extraction Method

Since there are no public datasets for Cp extraction, we manually annotated a small-scale Cp extraction dataset. Concretely, the English corpus is annotated in word-level while the Chinese corpus is annotated in character-level. Table 1 shows the format of annotation. Considering that such a small annotated dataset is not enough for training a automatic annotation model, we proposed a semi-supervised method to do so. The training pipeline is shown in Fig. 3.

First, we divided the annotated dataset into training set and development set. Then we trained the BERT-BiLSTM-CRF model with these two sets to get Model 1. After that, we predict the unlabeled parallel corpus with Model 1 to get a labeled dataset. Next, we filtered the labeled data by rules to alleviate the effect of noise. Here are the rules we define:

  • Remove the sentences whose ratio of the total length of preferred centers to the total length of sentence is more than 1/4.

  • Calculate the maximum similarity between each preferred center and the words in the following sentence. If the similarity is less than 0.5 and such preferred center do not belong to any component of subject, direct object or indirect object, record this preferred center. If the number of such kind of preferred center is greater than or equal to 50% of the number of preferred centers extracted from the sentence, the sentence will be removed.

Roughly, Rule 1 limits the number of preferred centers to avoid selecting excessive entities as the preferred centers for higher recall, and Rule 2 remove the samples which contain ambiguous Cp. For measuring the similarity between words, we use a word2vec modelFootnote 1 to encode the words into vectors and calculate their cosine similarity:

$$\begin{aligned} similarity(w_i,w_j)=\frac{emb_i*emb_j}{||emb_i||*||emb_j||} \end{aligned}$$
(7)

where \(emb_i\) is the vectorized representation of \(w_i\). If the out-of-vocabulary word can not be found in the following sentence, the similarity is set to be 0, otherwise 1.

After filtering the labeled dataset, the dataset will be randomly sampled to get three sampling datasets. These three datasets will be combined with the initial training set respectively for training three new models. Then we choose the highest recall model on development set as Model 2. Our goal is to obtain comprehensive preferred centers as far as possible so we choose the recall to select the optimal model. So far, we have completed one iteration. The next step is to repeat the previous steps.

Fig. 4.
figure 4

The overview framework of our CpQE model

4 The Quality Estimation Model

In this section, we present our CT-based document-level QE model. As shown in Fig. 4, we extract the features of preferred centers from two aspects by outer-extractor. First, we get the embeddings of preferred centers in both source and target side. Second, compute the consistency between current sentence and context in both source and target side. Finally, the two types of features and the inner sentence features extracted by inner-extractor are passed to the quality evaluator for scoring.

4.1 The Inner-Extractor

As shown in Fig. 5, the encoder of inner-extractor is a standard encoder of transformer [22] and the decoder is bidirectional. The forward self-attention network decodes the target words from left to right, while the backward self-attention network decodes the target words from right to left. The combination of the two self-attention can make the model focus on the whole sentence.

Fig. 5.
figure 5

The architecture of inner-extractor

4.2 The Outer-Extractor

Outer-Extractor extract Cp features from two aspects: sentences relation features and embeddings of preferred centers. Sentences relation features can evaluate the coherence between source text and translations. Here we define four rules for designing features:

  • The number of preferred centers of current sentence in source and target side and the difference between the numbers.

  • The number of preferred centers of previous sentence in source and target side and the difference between the numbers.

  • The similarity between preferred centers of previous sentence and current sentence in source and target side and the difference between the similarities.

  • The similarity between preferred centers of previous sentence and preferred centers of current sentence in source and target side and the difference between the similarities.

Rule 1 and rule 2 focus on the number of preferred centers which can reflect the consistency between source text and translation at some extent. Rule 3 use a quantitative measurement to evaluate the consistency between previous sentence and current sentence. Rule 4 measure the change of entities which reflects the change of topic. If a sentence at the beginning of document, the preferred center of the previous is empty set. The preferred center of the last sentence in document is empty set too. The similarity between the sequence is computed as follow:

$$\begin{aligned} \begin{aligned} similarity(l_1,l_2)=&\frac{l_{v1}+l_{v2}}{L_1+L_2}cosine(emb_{w_{v1}},emb_{w_{v2}}) \\&+\frac{2}{L_1+L_2}\sum _{w in w_{o1}}f(w,l_2) \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned} f(w,l_2)=\left\{ \begin{aligned} 1&, w\ in\ l_2, \\ -1&, w\ not\ in l_2. \end{aligned} \right. \end{aligned}$$
(9)

where \(w_{v1}\) is the word in the sequence 1 which can be found in vocabulary while \(w_{o1}\) is the word in the sequence 1 which out of the vocabulary. \(l_{v1}\) is the length of \(w_{v1}\) and \(L_1\) is the length of the sequence 1. \(w_{v1}\) and \(w_{v2}\) are calculated by Word2Vec model. According to the four rules, we design 12 features to represent sentence relation information. We provide the running process of outer-extractor on Appendix A.

4.3 The Evaluator

Finally, we provide the features to evaluator. Since the preferred center embedding is a word-level feature, and the local sentence relation feature is for both sentence and context, we integrate the preferred center embedding before BiLSTM. And the sentence relation feature is concatenated with the whole sentence feature output by BiLSTM:

$$\begin{aligned} {\mathop {h_{1:T+n}}\limits ^{\longrightarrow }},{\mathop {h_{1:T+n}}\limits ^{\longleftarrow }}=BiLSTM(f) \end{aligned}$$
(10)
$$\begin{aligned} f=[f_{inner};CpEmb] \end{aligned}$$
(11)

where T is the length of translation, n is the number of preferred centers. \(f_{inner}\) represents the features extracted by inner-extractor. The sentence relation feature can make the evaluator focus on consistency between source text and translation. Finally, sigmoid function is used \(\sigma \) to score the translations:

$$\begin{aligned} Score=\sigma (w^T[{\mathop {h_{1:T+n}}\limits ^{\longrightarrow }};{\mathop {h_{1:T+n}}\limits ^{\longleftarrow }};f_{outer}]) \end{aligned}$$
(12)

where w is a trainable parameters, \(f_{outer}\) is the features extracted by outer-extractor. The optimization object is calculate as follows:

$$\begin{aligned} arg min||HTER-Score||^2_2 \end{aligned}$$
(13)
$$\begin{aligned} HTER=\frac{N_{edit}}{N_{reference}} \end{aligned}$$
(14)

where \(N_{edit}\) is the number of edits from translation to reference, \(N_{reference}\) is the number of words in reference. Human-targeted Translation Edit Rate (HTER) [18] is the widest used metric of QE. Calculation of HTER need to find out the closest reference of the translation, then calculate the edit rate from translation to reference.

5 Experiments

5.1 Metrics

For preferred centers extraction, our goal is to maximize the total number of preferred centers that are correctly tagged by our method, so we use standard Accuracy and Recall scoreFootnote 2 to measure the performance of our BERT-based extraction model.

For quality estimation model, following with previous works such as [5, 14], we use Pearson correlation coefficient, which is calculated as follows.

$$\begin{aligned} \rho _{X,Y} = \frac{\sum _{i=1}^{n} (x_i - \mu _X)(y_i - \mu _Y)}{\sqrt{\sum _{i=1}^{n} (x_i - \mu _X)^{2}\sum _{i=1}^{n}(y_i - \mu _Y)^{2}}} \end{aligned}$$
(15)

Where n is the number of samples, \(\mu _X\) and \(\mu _Y\) denote means of the samples. A larger coefficient represents that X and Y are more correlated.

Table 2. Preferred center extraction performance
Table 3. Pearson correlation coefficient of models. CpQE+CpRuled represents the preferred centers are extracted by rule. CpQE+CpSeq represents the preferred centers are extracted by our sequence labeling model.

5.2 Dataset Description

Since the lack of document-level QE corpus, we manually annotated an open source Chinese-English document-level datasetFootnote 3. Concretely, our document-level QE corpus is built from the test set of WMT2019 MT automatic evaluation task. We select 996 Chinese source sentences from the corpus, including 112 articles with a text length less than 14 sentences, and the corresponding 1992 sentences of English translations. The 1992 translations are calculated the HTER value to construct our corpus.

For the preferred center extraction experiment, we use our annotated preferred center extraction dataset including 1,432 Chinese sentences and 1,432 English sentences. The Chinese-English parallel corpus comes from FBIS corpus including 10,355 documents and 228,611 sentence pairs are used to generate pseudo labeled data.

For the quality estimation experiment, we use CCMT19 Chinese-English sentence-level translation quality estimation dataset with 11,213 sentences and our document-level QE corpus with 1992 sentences. We randomly select 50% sentences to delete or replace 20%–70% of the words and enhance the corpus up to 2,565 sentences. Word2Vec model are trained on 23GB Chinese-English monolingual corpus from Wikipedia and Sohu News. CCMT19 Chinese-English parallel corpus and FBIS Chinese-English corpus are used to train the inner-extractor.

Table 4. Case study results.

5.3 Preferred Centers Extraction

In this experiment, we use a rule-based method as the baseline. In the rule-based method, Stanfordnlp is used for syntactic analysis. Noun subject, clausal subject, direct object, indirect object are chosen to be preferred centers. The setup of our model is presented in Appendix B.

The experiments results are shown in Table 2. Our semi-supervised training method train model for two iterations on both Chinese and English data. The recall and accuracy of Chinese Model 3 achieve 60.70% and 59.44% respectively. And English Model 3 achieve 63.61% recall and 61.08% accuracy. Both semi-supervised model significantly outperform the rule based model. The performance of each iteration is better than that of last iteration indicating that our proposed semi-supervised method can improve the performance of model. We choose the recall as metrics for the reason that we want to obtain comprehensive preferred centers as far as possible.

5.4 QE Results

In this experiment, we use Transformer-based feature extractor-evaluator as baseline model. Compared with the baseline, our model introduce an inner-extractor. The setup of CpQE model is shown in Appendix C. The result of quality estimation model is shown in Table 3. The Pearson correlation coefficients measure the correlation between model score and HTER. In the sentence-level QE, the difference among the three models is about 0.01. In document-level QE, our CpQE+CpSeq model achieve the best performance with 0.6035, outperform the baseline by 0.0499. The rule-based Cp extractor with only 40.43% recall but still improve the QE model, indicating that not only preferred centers can improve the documen-level QE, other information also plays a role in the QE model. When the recall of Cp extraction increase, the performance of QE model further improve, which show the effectiveness of preferred centers. In the sentence-level QE, according to the setting of text boundary feature acquisition, the proposed model can not get any hint of the preferred center, which is equivalent to no additional information, so the performance of the model is comparable to that of the baseline model.

5.5 Case Study

As shown in Table 4, we provide the example of CpQE model and baseline model on scoring translation in document-level QE.

In the given example, the word “ (china-europe train)” has two meanings. The first one is “the train from China to Europe” and the other one is “the train in central Europe”. Since the previous sentences of the same document have mentioned “the train tack from Chengdu, China to Europe”, the word in this sentence should be translated into “the train from China to Europe”. Unfortunately, the translation output to be evaluated, i.e. mt1, provides an incorrect translation where the word “China” is missed. To test whether our proposed document-level QE system is sensitive to such errors, we simply recover the missing word “China” while ignore other mistakes in mt1 and produce another output, namely mt2. Then we evaluated these two outputs using the baseline model and our model, respectively. Clearly, the evaluation results show that both models indicate the decline of the edition rate. The proportion of the reduction of our model is higher than that of the baseline model, which is consistent with the HTER value, as listed in the fourth column. This results imply that our proposed model is more sensitive to such problems.

6 Conclusion

This research focus on the document-level machine translation quality estimation. Concretely, based on the concept of Preferred Center in the Centering Theory and the evaluation method of local text fluency, we manually annotated a small-scale dataset for Preferred Center extraction. Then, we trained a model to extract Preferred Centers for given texts and combine the extracted Preferred Centers as context information into the Predictor-Estimator model to improve the performance of QE. Furthermore, we construct a document-level Chinese-English QE dataset to measure the performance of our document-level QE models.