Keywords

1 Introduction

Quality estimation (QE) refers to the task of evaluating the quality of MT results without any human annotated references [2]. We participate the CCMT 2019 QE task in both EN\(\rightarrow \)ZH and ZH\(\rightarrow \)EN directions. Each of them consists of two subtasks: word-level and sentence-level. Word level task is to predict OK/BAD labels for each word and gap in translation results, corresponding to mistranslation, over-translation and under-translation. Sentence-level task is to predict the Human-targeted Translation Edit Rate (HTER) scores [14] which represent the overall quality of the translation results.

In early works, human-crafted features were wildly used. A typical framework was QUEST++ [15] which provided a variety of features and machine learning methods to build QE models. In recent years, neural models significantly improved the performance in this task. Kim et al. [7] proposed a neural network architecture called predictor-estimator, which adopted a bilingual recurrent neural network (RNN) language model [9] as predictor to extract feature vectors, and used a bidirectional RNN as estimator to predict QE scores. Fan et al. [5] introduced a bidirectional Transformer based pre-trained model for feature extraction, and used 4-dimensional mis-matching features from this model to improve performance.

In our work, all the tasks we submit share the same model architecture based on the predictor-estimator. We pre-train left-to-right and right-to-left deep Transformer models with a large amount of bilingual data as predictor. Byte-pair-encoding (BPE) [12] tokenization is applied to reduce the number of unknown tokens. After that, a multi-layer Bi-GRU is used as estimator, and is jointly trained with predictors using the quality estimation task data. We transform word-level tasks into binary classification problems and sentence-level tasks into regression problems for estimator model to predict labels or scores with the feature information extracted by predictor.

To further improve the performance of the predictor, we use target-side monolingual data to construct pseudo-data by various back-translation [3] methods, including beam search, sampling and sampling-topk [4]. Due to the scarcity of QE data, we also construct QE pseudo data. We regard real target-side sentences in bilingual data as personal edited results, and use beam search, sampling or sampling-topk to construct machine translation results. Finally, we used the TER tool [14] to generate word-level OK/BAD labels or sentence-level HTER scores.

Our system also employs the ensemble strategy to further improve model performance. By training multiple sub-models, the final results are fused by voting or averaging in different tasks.

Fig. 1.
figure 1

The architecture of our model based on predictor-estimator.

2 Deep Transformer

A strong and effective feature extraction model is essential for the estimator to make more accurate predictions. We choose the pre-trained machine translation model to extract features. Neural Machine Translation (NMT) based on multi-layer self-attention has shown strong results in many machine translation tasks. In order to improve the performance of machine translation and extract the information contained in the sentences more fully, we apply the structure of Pre-norm Transformer-DLCL. In this section, we describe the details about our deep architecture as below:

Pre-norm Transformer: For Transformer [16], learning deeper networks [1] is not easy because of the difficulty to optimize due to the gradient vanishing/exploring problem. But in recent implementations, Wang et al. [17] emphasized that the location of layer normalization [8] plays a vital role when training deep Transformer. In early versions of Transformer, layer normalization is placed after the element-wise residual addition. While in recent implementations, layer normalization is applied to the input of every sublayer, which can provide a direct way to pass error gradient from top to bottom. In this way pre-norm Transformer is more efficient for training than post-norm (vanilla Transformer) when the model goes deeper.

Transformer-DLCL: In addition, a dynamic linear combination of previous layers method [17] was used in Transformer model. Transformer-DLCL employed direct links with all previous layers and offered efficient access to lower-level representations in a deep stack. An additional weight matrix \(W_{l+1} \in R^{L \times L}\) was used to weight each incoming layer in a linear manner. This method can be formulated as:

$$\begin{aligned} \varPsi (y_0, y_1...y_l) = \sum _{k=0}^l{W_k^{l+1}LN(y_k)} \end{aligned}$$
(1)

Equation 1 provided a way to learn preference of layers in different levels of the stack, \(\varPsi (y_0, y_1...y_l)\) was the combination of previous layer representation. Furthermore, this method is model architecture free which can be integrated with either pre-norm Transformer or relative position Transformer [13] for further enhancement. The details can be seen in Wang et al. [17].

We used Transformer-DLCL model with 25 layers in encoder, and show the performance improvement of Transformer-DLCL vs. Transformer-base and Transformer-Big in Table 1.

Table 1. BLEU score and \(\bigtriangleup \) BLEU [%] on WMT ZH\(\rightarrow \)EN and EN\(\rightarrow \)ZH newstest2017.

3 System

3.1 Architecture

The model architecture of the whole system is presented in Fig. 1. It consists of two parts: a predictor which joint left-to-right and right-to-left Pre-norm Transformer-DLCL, and an estimator with a multi-layer Bi-GRU. Predictor is used to extract semantic information from given machine translation results, according to source-side sentences. In order to fully consider the forward and backward information in the sentences, we use the left-to-right and right-to-left translation models to extract the bidirectional semantic information independently, and then fuse them to obtain the quality vectors. After that, the quality vector is fed into the bidirectional GRU to predict the HTER score or OK/BAD labels. We first pre-train forward and backward translation models, then jointly train the estimator with the predictor to maximize the evaluation capability of the system.

3.1.1 Deep Bi-Predictor

The sequence-to-sequence based Transformer models [16] are powerful in extracting information and have been proven to be strong in many translation tasks. The Pre-Norm Transformer-DLCL further improves the feature extraction ability. The encoder receives the input sequence \(x=\{x_0, x_1...x_n\}\),and maps it to a vector \(z=\{z_0, z_1...z_n\}\) of the same length,which contains the source sentence feature. The decoder inputs the translation sequence \(y=\{y_0, y_1...y_m\}\) and generates a top-level representation containing sufficient semantic and grammatical information.

Due to the existence of the decoder mask, the unidirectional model can not observe the future information. In order to make the vector extracted by the model contain sufficient context knowledge, we use left-to-right and right-to-left translation models respectively, and extract the feature vectors l2r and r2l independently. We get the final quality vector by concatenating way (\(q=[l2r:r2l]\)).

3.1.2 Bi-GRU Estimator

RNN is widely used to solve sequence generation problem. And we use a Bi-GRU as our estimator. The Bi-GRU consists of two parts, forward and backward. It reads quality vector q, calculate the forward hidden states \(({\mathop {\mathbf {h}}\limits ^{\rightarrow }}_{1}, \cdots , {\mathop {\mathbf {h}}\limits ^{\rightarrow }}_{T})\) and backward hidden states \(({\mathop {\mathbf {h}}\limits ^{\leftarrow }}_{1}, \cdots , {\mathop {\mathbf {h}}\limits ^{\leftarrow }}_{T})\) respectively, where T is the sequence length. We get the representation of each word by concatenating the forward hidden state \({\mathop {\mathbf {h}_{j}}\limits ^{\rightarrow }}\) and the backward one \({\mathop {\mathbf {h}_{j}}\limits ^{\leftarrow }}\), \(\varvec{h}_{j}=[{\mathop {\mathbf {h}}\limits ^{\rightarrow }}_{j}, {\mathop {\mathbf {h}_{j}}\limits ^{\leftarrow }}]\). We convert the word-level tasks into classification problems, and Eqs. 2 and 3 show our goals on the word and gap tasks, respectively. Sentence-level tasks are converted to a regression problem, refer to Eq. 4.

$$\begin{aligned}&\arg \min \sum _{j=1}^{T} \mathbf {cross\_entropy}\left( y_{j}, \mathbf {W}_{1}\varvec{h}_{j}\right) \end{aligned}$$
(2)
$$\begin{aligned}&\arg \min \sum _{j=0}^{T} \mathbf {cross\_entropy}\left( y_{j}, \mathbf {W}_{2}\mathbf {Conv}(\varvec{h}_{j},\varvec{h}_{j+1})\right) \end{aligned}$$
(3)
$$\begin{aligned}&\arg \min \left\| h-\mathbf {sigmoid}\left( \mathbf {W}_{3}{\mathbf {h}_{T}}\right) \right\| _{2}^{2} \end{aligned}$$
(4)

where h is the real HTER score, \(y_j\) is real labels, \(\mathbf {W}_{1}\), \(\mathbf {W}_{2}\) and \(\mathbf {W}_{3}\) is trainable parameter matrices, and T is the length of the target-side. \(\mathbf {cross\_entropy}\) is the cross entropy loss (with logits). \(\mathbf {Conv}\) is a convolution operation that fuses information from adjacent locations for predicting gap tags.

We dynamically control the number of layers of the Bi-GRU according to different data volumes. At the same time, we also try the self-attention layer and self-attention layer + Bi-GRU architectures as estimator, finding there is no better performance. But we use them as candidate models for ensemble to enhance diversity.

3.1.3 BPE Matrix

BPE is introduced to reduce the number of unknown tokens in many NLP tasks. And we also apply it to our model. But there is a problem in word-level task. The length \(L_{b}\) of quality vector extracted by predictor is different from the number \(L_{w}\) of tokens in sentence. We follow Fan et al. [5] to solve this problem by a \(L_{w} \times L_{b}\) sparse matrix, which average the features of subwords corresponding to one word token, and reduce the length of quality vector from \(L_{b}\) to \(L_{w}\).

3.2 Data Construction

3.2.1 Bilingual Data for Pre-training

We use WMT 2019 ZH-EN parallel data to pre-train our predictors, which consists of CWMT, wikititles-v1, NewsCommentary-v14 and UN corpus. After filtering, about 11M sentences pair is selected. Furthermore, we use 6M monolingual data from WMT 2019 to construct pseudo data by back-translation [3] in both directions. All parallel data is segmented by NiuTrans [18] word segmentation toolkit. After the preprocessing, we train BPE [12] models with 32, 000 merge operations for both sides respectively.

3.2.2 Quality Estimation Data

The dataset for QE task consists of three parts: source sentences, machine translations and QEscore (HTER score for sentence level or OK/BAD labels for word-level). The amount of data provided by CCMT 2019 QE task is no more than 15K. We think it isn’t enough to train a strong model, so we construct 50K pseudo data using parallel data from WMT 2019. To obtain high quality bilingual data, we use machine translation model and language model to score parallel data. First, we use the translation model to score the real bilingual data by forced decoding. Secondly, we use the language model to score the source and target sentences, and combine the three scores to sort the real data, select the data with the higher score. After obtaining high-quality bilingual data, we decode them in a variety of ways to obtain machine-translated data, including beam search [11], sampling-topk. We regard the target sentences of bilingual data as personal edited data, and generate the sentence-level HTER score or the word-level labels using TER tool [14].

In addition, we find the ratio of OK/BAD labels in word gap subtask is about 20:1, which means the BAD labels between words corresponding to missing translations is too few and it’s hard to predict BAD label for trained model. So we randomly drop some word in our machine translation results to improve the number of BAD label between words.

3.3 Model Ensemble

In MT systems, ensemble decoding method is wildly used to boost translation quality via integrating the predictions of several single models at each decode step. We try a similar approach in QE task. However, we find that ensemble method is expensive when it comes to more model fusion. It can’t try to combine more models in a limited time, so we adopt an external fusion method:

  • We select twelve high-scoring single models using different model architectures or datasets, and decode 12 results as candidates.

  • Calculate all combinations of twelve models externally.

  • For different combinations, word-level tasks, we use the voting method to ensemble, and the sentence-level we average HTER score.

  • Pick the best performing model combination.

In this way, we quickly try out all the combinations of candidates in a short time, and it is easier to pick the optimal combination.

Table 2. Word-level word result on CCMT QE valid2019. We use a jointly l2r and r2l Transformer-DLCL as a predictor and Bi-GRU as an estimator to jointly train with different datasets.

4 Experiments and Results

We implement our QE models based on Fairseq [10]. Transformer-DLCL models are pre-trained on eight 1080Ti GPUs. We use the Adam optimizer with \(\beta _{1}=0.97\), \(\beta _{2}=0.997\) and \(\epsilon =10^{-6}\). The training data is reshuffled after finishing each training epoch, and we batch sentence pairs by target-side sentences lengths, with 8192 tokens per GPU. Large learning rate and warmup-steps are chosen for faster convergence. We set max learning rate as 0.002 and warmup-steps as 8000. For the jointly training predictor-estimator architecture, we train it on one 1080Ti GPU, 1024 tokens per step. And we set max learning rate as 0.0005 and warmup-steps as 200.

Moreover, due to the lack of BAD labels in the word-level tasks are relatively small, the model tends to predict all labels as OK in the inference stage. So we introduce the bad-enhanced parameter, strengthen the weight of the BAD label when calculating the loss, thereby improving the ability of the model to predict BAD. Next, we will show details in the following subsections.

4.1 QE Pseudo Data

We compare different method on the task of ZH2EN word-level. The following will introduce the method we use.

  • Use high-quality bilingual data such as newtest2016, newtest2017, and use the target as the result of personal editing, and decode the source to construct dataset by sampling-topk.

  • The data selected from the bilingual data, and the pseudo datasets decoded by the beam search [11] or the Sampling-topk.

  • We translate the monolingual data in target side to the source sentences, and then translate generated sentences back to target side, this method names round-trip [6]. The detail results are shown in Table 2.

Table 3. Word-level result on CCMT QE valid2019. We use GRU as an estimator to jointly train using officially available data.

The round-trip and sampling-topk methods are mainly aimed at the unbalanced distribution of OK and BAD labels in word-level tasks. We increase the number of BAD tags by introducing noise during the decoding process. The Table 2 shows that pseudo-data using high-quality bilingual constructs delivers the greatest performance improvement in the same architecture. However, there are no significant differences in the average label distribution in the results by introducing noise in a variety of ways. We speculate that the target language in high-quality bilingual data is closer to the personal editing results, and the generated tags are more consistent with the real data, which makes the model more accurate. Different datasets are also used to increase data diversity in model fusion.

4.2 Different Predictor

Our model base on the predictor-estimator architecture. Recent research shows that the Transformer [16] has powerful information extraction capability. Therefore, we use the translation model as a predictor to extract the semantic information contained in the sentence. At the same time, we empirically believe that a stronger translation model can bring greater performance improvement to the QE task. In order to verify the impact of the pre-trained translation model on the QE model, we conduct multiple experiments with different left-to-right predictors and the same estimator. The result of word-level is shown on Table 3, Sentence-level on Table 4.

Table 4. Sentence-level result on CCMT QE valid2019. We use GRU as an estimator to jointly train using officially available data.

From the Tables 3 and 4, we find the estimator has better performance with more powerful translation model.

4.3 Different Estimator

After determining the architecture of the predictor, we try a variety of architectures as the estimator, including GRU, Bi-GRU and self-attention. We take the task of the ZH-EN word-level as an example. In Table 5, we show different prediction results in different architectures.

Table 5. ZH2EN word-level word result on CCMT QE valid2019. We use a jointly l2r and r2l Transformer-DLCL as a predictor.

We use real data and high-quality bilingual constructed pseudo-data total 30k as jointly training data. We can observe that Bi-GRU performs significantly better than other architectures with the same dataset. However, due to the possibility of data scarcity that makes complex architecture trained inadequately, we also try to increase the amount of pseudo-data for the self-attention layer and self-attention + Bi-GRU architecture. We found that increasing the amount of data lead to the performance improvement of more complex estimator architectures. But it’s still a little worse than the Bi-GRU. We use them as seed models for system integration to increase diversity.

4.4 Ensemble

We construct multiple sub-models through different model architectures and data sets, and integrate the results of multiple systems externally on all tasks to further improve the stability and performance of the system. We use the left-to-right Transformer-DLCL as the predictor and the GRU as the estimator to build our baseline system. Table 6 shows the final results of all of our participating tasks.

Table 6. All word-level and sentence-level result on CCMT QE valid2019.

5 Conclusion

This paper describes our systems for CCMT19 Quality Estimate tasks including both word-level and sentence-level.

We adopt predictor-estimator architecture, use Transformer-DLCL as Predictor based on deep network [1], and combine left-to-right and right-to-left models to further enhance predictor’s feature extraction capabilities. Estimator adopts the Bi-GRU and uses the quality vector extracted by predictor to predict for different tasks.

At the same time, we further improve the performance of the translation model as predictor and the prediction performance of estimator by artificially constructing pseudo-data. In addition, a external ensemble algorithm is helpful to search a robust combination of models.