Keywords

1 Introduction

Machine translation has been widely used nowadays, which requires a quality estimation system to ensure its appropriate usage. However, traditional estimation methods depend on pure human judgment, which is inefficient and resource-intensive. It is important to accomplish automatic translation quality estimation with high efficiency and low cost. Machine translation Quality Estimation (QE) is an automatic evaluation method for choosing the best translation from several machine-translated sentences (mt) candidates of one source sentence (src) without reference (ref).

QE can be realized at sentence level or word level. Sentence level QE aims at predicting a quality score of mt sentence and word level QE aims at predicting an OK/BAD label for each word in mt to indicate whether this word is translated properly or improperly. The quality estimation task of CCMT 2023 focuses on English to/from Chinese language directions and predicts Human Translation Edit Rate (HTER) [7, 16] score for each mt sentence, where HTER measures the number of editing that needs to perform to change mt into ref. Our method utilizes different pretrained language models (PLMs) to encode sentence pairs and predict HTER score. We also explore new pretraining approaches for quality estimation and the deep interaction between words in src and mt, which shows significant improvements on this task. In addition, ensemble methods boost model performance not surprisingly.

2 Related Work

Quality Estimation algorithms before deep learning usually extract various features from word, POS tags, syntax, length, and other binary features presenting different aspects in src or mt. Then a machine learning model uses these features to make predictions. Many machine learning tools are designed for this purpose like QuEst [1] and QuEst++ [2]. Such procedures can be abstracted as the predictor-estimator framework [11].

After the broad usage of neural networks and Transformers [10], deep learning models play the role of feature extractor or even score estimator. DeepQuest [3] and OpenKiwi [4] provide tools for multilevel quality estimation by adopting neural networks. In recent years, TransQuest [5] and COMET [6] have shown significant improvements in quality estimation tasks, by using big PLM as feature extractor and then predicting sentence-level scores or word-level labels. These models encode src and its mt into high dimensional embedding vectors through multilingual PLMs, then input another neural network to get quality predictions. Diverse multilingual PLMs can perform as sentence encoder such as mBERT [8], XLM-RoBERTa [21], InfoXLM [20], mDeBERTa [19], RemBERT [22] and so on.

3 Feature-Enhanced Estimator for Sentence-Level QE

3.1 Model Architecture

Quality estimation task involves measuring the editing number in mt word by word, which expects the meaning of each word to be translated precisely and unambiguously. PLMs like BERT [8] and RoBERTa [9] have revealed marvelous capability of representation and feature extraction in natural language processing. Consequently, our model designs several feature interactive structures between src and mt after being encoded by multilingual PLM encoders. Both src and mt are concatenated and input into PLM to get their last hidden state representations \(({s}_{1},\, {s}_{2},\, ...,\, {s}_{m})\) and \(({t}_{1},\, {t}_{2},\, ...,\, {t}_{n})\), where m and n are word numbers of src and mt as shown in Eq. 1 and Eq. 2. Afterward, three kinds of feature interactive modules are completed on top of PLMs as illustrated in Fig. 1.

$$\begin{aligned} outputs = PLM\_encoder([src;mt]) \end{aligned}$$
(1)
$$\begin{aligned}{}[{s}_{1}, {s}_{2}, ..., {s}_{m}], [{t}_{1}, {t}_{2}, ..., {t}_{n}] = split\_ src\_ mt(outputs) \end{aligned}$$
(2)
Fig. 1.
figure 1

Model architecture with interactive module

Simple Interactive Module (SIM). Following the settings of CometKiwi [12], we use the mean pooling to generate the vector representations of src and mt. Then the element-wise product and subtraction results between src and mt representations are concatenated together with the vector representations of src and mt. A MLP module predicts HTER score based on these representations as follows:

$$\begin{aligned} {s}_{mean} = mean\_pooling([{s}_{1}, {s}_{2}, ..., {s}_{m}]) \end{aligned}$$
(3)
$$\begin{aligned} {t}_{mean} = mean\_pooling([{t}_{1}, {t}_{2}, ..., {t}_{m}]) \end{aligned}$$
(4)
$$\begin{aligned} hter = MLP([{s}_{mean}; {t}_{mean}; {s}_{mean}\odot {t}_{mean}; |{s}_{mean}-{t}_{mean}|]) \end{aligned}$$
(5)

RNN-Based Interactive Module (RIM). Recurrent neural network (RNN) is a popular model in the natural language processing area, which can capture intra-sentence dependency in a text sequence. Therefore, we use the Bidirectional LSTM [15] (BiLSTM) layer to encode the word-level output hidden states of the encoders for further reinforcing the feature interactions and predicting HTER scores between src and mt as follows:

$$\begin{aligned} lstm\_output = BiLSTM([{s}_{1}, {s}_{2}, ..., {s}_{m};{t}_{1}, {t}_{2}, ..., {t}_{m}]) \end{aligned}$$
(6)
$$\begin{aligned} hter = MLP(mean\_pooling(lstm\_output)) \end{aligned}$$
(7)

Multilevel Interactive Module (MIM). Inspired by ESIM [14] and RE2 [13], the cross attention between src and mt reflects the similarity between words in different languages. Considering that different layers in an encoder catch different levels of information of src and mt, we determine to combine these two kinds of features to strengthen the representations of src and mt. Specifically, a weighted sum of layer-wise hidden states of src or mt

$$\begin{aligned} {s}^{l} = mean\_pooling([{s}^{l}_{1}, {s}^{l}_{2}, ..., {s}^{l}_{m}]), for\, each\, layer\, l \end{aligned}$$
(8)
$$\begin{aligned} {t}^{l} = mean\_pooling([{t}^{l}_{1}, {t}^{l}_{2}, ..., {t}^{l}_{n}]), for\, each\, layer\, l \end{aligned}$$
(9)
$$\begin{aligned} {s}_{mix}\, =\, \sum ^{L}_{l=1} {{w}^{l}_{s}}\cdot {s}^{l},\, where \sum ^{L}_{l=1} {{w}^{l}_{s}}=1 \end{aligned}$$
(10)
$$\begin{aligned} {t}_{mix}\, =\, \sum ^{L}_{l=1} {{w}^{l}_{t}}\cdot {t}^{l}, where \sum ^{L}_{l=1} {{w}^{l}_{t}}=1 \end{aligned}$$
(11)

with a total layer number L (Eq. 8–Eq. 11) is concatenated with its cross attention layer output (Eq. 12–Eq. 16) to transform into a rich representation through MLP layer.

$$\begin{aligned} {e}_{ij}\, =\, {s}^{T}_{i}{t}_{j} \end{aligned}$$
(12)
$$\begin{aligned} {s}^{ca}_{i}\, =\, \sum ^{n}_{j=1} {\frac{exp({e}_{ij})}{\sum ^{n}_{k=1} {exp({e}_{ik})}}}{t}_{j},\, \forall i\in [1,2,...,m] \end{aligned}$$
(13)
$$\begin{aligned} {t}^{ca}_{j} = \sum ^{m}_{i=1} {\frac{exp({e}_{ij})}{\sum ^{m}_{k=1} {exp({e}_{kj})}}}{s}_{i}, \forall j\in [1,2,...,n] \end{aligned}$$
(14)
$$\begin{aligned} {s}_{ca} = mean\_pooling([{s}^{ca}_{1}, {s}^{ca}_{2}, ..., {s}^{ca}_{m}]) \end{aligned}$$
(15)
$$\begin{aligned} {t}_{ca} = mean\_pooling([{t}^{ca}_{1}, {t}^{ca}_{2}, ..., {t}^{ca}_{n}]) \end{aligned}$$
(16)

Features of src and mt are fused separately (Eq. 17, Eq. 18) for further combination.

$$\begin{aligned} {s}_{comb} = MLP([{s}_{ca};{s}_{mix}; |{s}_{ca} \odot {s}_{mix}|;{s}_{ca} - {s}_{mix}]) \end{aligned}$$
(17)
$$\begin{aligned} {t}_{comb} = MLP([{t}_{ca};{t}_{mix};|{t}_{ca}\odot {t}_{mix}|;{t}_{ca} - {t}_{mix}]) \end{aligned}$$
(18)

Then HTER score is computed by another MLP layer.

$$\begin{aligned} hter\, =\, MLP([{s}_{comb}; {t}_{comb}]) \end{aligned}$$
(19)

This module is illustrated in Fig. 3.

Loss Selection. We choose the Mean Squared Error (MSE) loss for finetuning models. Besides, the square root of the original HTER score is set as label when using MSE loss since the distribution of HTER score in training data is dense in lower score range (see Fig. 2).

$$\begin{aligned} loss\, =\, MSE({hter}_{pred},\, \sqrt{{hter}_{true}}) \end{aligned}$$
(20)

Taking the square root of HTER score can increase the divergence among different scores and make model predict easier.

Fig. 2.
figure 2

The ranking of HTER scores on En-Zh training data in ascending order

Fig. 3.
figure 3

The structure of Multilevel Interactive Module

3.2 Pretraining Corpus Generation

Transformer models often need plenty of training data for supervised learning so we need to generate more data for pretraining quality estimation models from bilingual parallel corpus. Motivated by [17], to generate mt data, we use several open-source neural machine translation models, including mbart50-m2m [28], Helsinki NLP Opus-en-zh [29], Helsinki NLP Opus-zh-en [30], M2M100 [31] and NLLB [32], which are provided by huggingface.com.

Some simple rules like length restriction and removal of special characters together with LaBSE [18] model are used to filter semantically unrelated sentence pairs from the original parallel corpus and we sample part of the filtered parallel corpus due to the computing power limit. For each sentence pair in sampled parallel corpus, we translate both sentences from one language to another, from which we can get two pairs of src and mt. Then we compute the HTER score between mt and ref by using sacrebleu [27] and filter out those sentence pairs with HTER score greater than 1.

After these steps, we generate nearly three million of (src, mt, ref, hter score) tuples as pretraining dataset before finetuning on the quality estimation dataset. These data are utilized to do two separate pretraining tasks on encoders. The first task mlm is mask language modeling on pretraining dataset, identically in BERT-like models [8, 9, 21]. Both src and ref are concatenated and masked partial words randomly as input into encoder, and encoder predicts what the masked tokens should be. Another pretraining task hter is predicting HTER scores on pretraining dataset based on concatenated src and mt with MSE loss.

3.3 Model Ensemble

Since several models are finetuned to predict HTER scores, we need to do model filtering and ensemble for better results developed on Dev set. For model filtering, we select the models with a higher Pearson correlation coefficient. Then we integrate the results from different models by two means. The first way is simply averaging models’ scores to get the final prediction score for each mt

$$\begin{aligned} score\, =\, \frac{1}{model\_num}\sum ^{model\_num}_{i=1} {{score}_{i}} \end{aligned}$$
(21)

and the second method is to calculate the weighted sum of predictions of one sample, where the weights are the performance rank of each model on Dev set for each language pair.

$$\begin{aligned} total\, =\, \sum ^{model\_rank}_{i=1} {i} \end{aligned}$$
(22)
$$\begin{aligned} score\, =\, \sum ^{model\_rank}_{i=1} {\frac{i}{total}{score}_{i}} \end{aligned}$$
(23)

4 Experiments

4.1 Datasets

We use the data from CCMT 2023 Machine Translation Quality Estimation tasks for model finetuning while restoring the tokenized sentences to the original forms by removing spaces. The bilingual parallel dataset for CCMT 2023 Chinese-English translation task is filtered and used for pretraining as described in Sect. 3.2. The QE dataset statistics are shown in Table 1.

Table 1. QE data statistics of CCMT Quality Estimation Task

4.2 Training and Evaluation

Training and model Python codes are completed with PyTorch [24] 1.13 and transformers [25] 4.26.1. All models are trained on NVIDIA GeForce RTX 3090 24G for both pretraining on pretraining corpus and finetuning on quality estimation data. Models are finetuned by AdamW [23] optimizer with learning rate of 1e−5, max sequence length of 128, batch size of 16, and 3 epochs. The model checkpoint with the best Pearson correlation coefficient calculated by SciPy [26] on Dev set is selected for Test set. Chinese-to-English and English-to-Chinese directions are trained and evaluated separately. We also evaluate finetuned results with and without pretraining.

4.3 Results and Analysis

The following Tables 2 and Table 3 show the Pearson correlation coefficient results of different encoders with different interactive modules on Dev set. These models are pretrained on hter task at first. When the interactive module fuses more diverse features from encoder, the Pearson correlation coefficient grows even with different encoders with respect to models with encoder only.

However, the pretraining approaches described in Sect. 3.2 have positive or negative effects on different encoders shown in Table 4 when using encoders and multilevel interactive module. The hter improves the performance of all models which reveals that the amount of training data is the key to superior quality estimation. The mlm boosts the performance of most models except InfoXLM and RemBert. A probable explanation for this is that InfoXLM is pretrained in a contrastive learning way [20] while mask language modeling is harmful to the original capability. And RemBert has a similar reason for this phenomenon [22].

Table 2. Results on En-Zh Dev set of interactive modules in Sect. 3.1
Table 3. Results on Zh-En Dev set of interactive modules in Sect. 3.1
Table 4. Results on Dev set of pretraining methods

In addition, not all models benefit from the selection of loss defined in Eq. 20. Table 5 gives the comparison between two loss functions when using different encoders and multilevel interactive module. The same model on different language pairs shows opposite effects which suggests that the loss function must be carefully designed.

Table 5. Results on Dev set of loss functions

4.4 Model Ensemble

According to the experiments of single model, we do a grid search on combinations of different models with high Pearson correlation coefficient on Dev set. Table 6 compares the results by using two different ensemble methods as described in Sect. 3.3.

Table 6. Results on Dev and online Test of ensembles

We can see that allocating distinct weights to different models can perform better which implies that more adjustments on weights may surpass the current results. And our results are competitive even on offline Test set.

5 Conclusion

This paper describes our method for CCMT 2023 Quality Estimation Task on both English-to-Chinese and Chinese-to-English directions. With the help of PLMs and specially designed pretraining tasks, we can get better representations of src text and its mt text. The application of pretraining on generated HTER data helps model predict more accurate scores while mlm pretraining harms some PLMs’ ability to make better predictions. Both pretraining on HTER data and mlm way can further improve model performance in most cases. Since the prediction of HTER score requires taking account of the editing on word level, interactions between source words and translation words need to be modeled deeply to reflect the change of consistency semantically and grammatically. Experiment results show that our models can generate scores of higher Pearson correlation coefficient with true HTER scores when making deeper and multiple levels of representations interactive between src and its mt. When combining different levels of interactive modules on different language pairs, different PLMs show better or worse results which suggests that it is hard to design a universal module for various language pairs but using interactive modules always boosts the performance. We will leave it as future work. Moreover, the change of loss function increases Pearson correlation coefficient significantly on Dev set. Also, the ensemble method based on the rankings of models makes the prediction scores more relevant which indicates that it has much potential to explore the best combination of weights. Our models achieve competitive results in both English-to-Chinese and Chinese-to-English directions.