An Improved Multi-task Approach to Pre-trained Model Based MT Quality Estimation

Yuan, Binhuan; Li, Yueyang; Chen, Kehai; Lu, Hao; Yang, Muyun; Cao, Hailong

doi:10.1007/978-981-19-7960-6_11

Binhuan Yuan⁷,
Yueyang Li⁷,
Kehai Chen⁷,
Hao Lu⁷,
Muyun Yang⁷ &
…
Hailong Cao⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1671))

Included in the following conference series:

China Conference on Machine Translation

263 Accesses
1 Citations

Abstract

Machine translation (MT) quality estimation (QE) aims to automatically predict the quality of MT outputs without any references. State-of-the-art solutions are mostly fine-tuned with a pre-trained model in a multi-task framework (i.e., joint training sentence-level QE and word-level QE). In this paper, we propose an alternative multi-task framework in which post-editing results are utilized for sentence-level QE over an mBART-based encoder-decoder model. We show that the post-editing sub-task is much more in-formative and the mBART is superior to other pre-trained models. Experiments on WMT2021 English-German and English-Chinese QE datasets showed that the proposed method achieves 1.2%–2.1% improvements in the strong sentence-level QE baseline.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Improving Quality Estimation of Machine Translation by Using Pre-trained Language Representation

NiuTrans Submission for CCMT19 Quality Estimation Task

Improved Quality Estimation of Machine Translation with Pre-trained Language Representation

Keywords

1 Introduction

Machine translation (MT) quality estimation (QE) is used as an automatic evalua-tion for selecting the most suitable machine translation without golden reference. QE is usually implemented either in sentence-level or word-level. Sentence-level QE subtask takes HTER [3] Metric to represent the quality of MT, and the word-level QE task measures the translation quality by generating a quality tag for each word in the output of MT.

The sentence-level and word-level QE subtasks both rely on the triplets of src (source sentence), mt (machine translated sentence) and pe (post-edited sentence). Therefore, sentence-level task is usually training jointly with word-level task so as to improves model performance. It should be noted that, for sentence-level task, pe is only used for calculating the label HTER, it is not integrated into the training phase.

In contrast to existing practice, we propose to integrate pe into the sentence-level QE model, which is named as pe based multi-task learning QE. Following recent em-ployment of pre-trained model, we adopt a multi-task translation QE model based on mBART [4, 5]. Evaluated on the WMT2021 English-German/English-Chinese QE dataset and CCMT2021 English-Chinese/Chinese-English QE datasets, the proposed method is revealed a substantial improvement in sentence-level QE compared with jointly training by word-level task. We also reveal that compared to other pre-trained models like BERT [1] and [2], mBART achieved better perfor-mance.

This paper is organized as follows. In Sect. 2, we introduce the related work of QE. The proposed multi-task QE method based on mBART is described in Sect. 3., we report the experiment and results in Sect. 4, and conclude our paper in Sect. 5.

2 Related Works

With the purpose of estimating machine translations without reference translation, the early research on QE tasks adopted traditional feature extraction and feature selection methods to train the models. Commonly used features included the length of the translation, the matching degree of special symbols, punctuation, and capital letters, etc. Gaussian process [9], heuristic [12] and principal component analysis [16] were commonly used feature selection methods.

With the development of deep learning, QE tasks had gradually shifted into neu-ral network-based framework. The simple network of QE is based on context win-dow [6], and it could be improved by CNN and RNN [15]. In order to integrate large-scale parallel corpus into RNN model, the model could be implemented by Predic-tor-Estimator structure [7]. With the rise of transformer, transformer-based QE models was implemented for its abilities of using large-scale parallel corpus and learning lexical and syntactic information [8].

With the emergence of pre-trained model, researchers attempted to use pre-trained models (e.g., XLM [13] and XLM-R [14]) to implement machine translation quality estimation, which obtained fairly good results compared with previous re-search based on barely transformer. Those researches are both based on encoder framework, which consider QE as a regression task for matching HTER. However, As QE tasks and MT are highly related, QE models can also be implemented based on encoder-decoder framework. The QE model with encoder-decoder framework achieved the state-of-the-art performance in WMT 2017/2018 QE task [8] and mBART [4] based model achieved good results on DA (Direct Assessment) QE task [11]. It should be noted that previous methods usually neglected pe data in sentence-level QE task. In other words, information in pe data is unexploited. The only excep-tion is in word level QE, which relies on pe to derive the quality label for each word.

3 PE Based Multi-task Learning for Sentence Level QE

3.1 Multi-task Learning Framework for QE

Given that QE tasks is highly correlated with machine translation which is imple-mented by encoder-decoder architecture, we choose mBART [4] as our base model. mBART is based on multi-layers transformer architecture and utilizes the bidirec-tional modeling capability of the encoder while retaining the autoregressive feature. We feed the source text (src) into the encoder and the machine translation (MT) into the decoder, and the output of the decoder is used to implement the sentence level task and word level task, respectively.

The multi-task learning QE based on mBART is shown in Fig. 1. For sentence-level task, we take the last token which is a special token <eos> to calculate the sentence-level loss, which we believe that the logit contains adequate information. We use sigmoid as the activation function. The loss function for sentence-level is as follows:

$$\begin{aligned} L_{ {sentence_level }}={\text {MSE}}(\textrm{HTER}, {\text {sigmoid}}(F C(u))) \end{aligned}$$

(1)

where u denotes the hidden representation for the special token ${<}{eos}{>}$. MSE represent Mean Square Error function, $L_{sentence_level} $ denotes the sentence-level loss, FC denotes a fully connected layer.

For word-level task (used as the baseline in this paper), we utilize each token’s correlated logits to generate word-quality label. The loss function for word-level is as follows:

$$\begin{aligned} L_{ {word_level }} =\sum _{i=1}^{k}\left( -I({\text {label}}=O K) \log \left( {\text {logit}}_{i}[0]\right) -I({\text {label}}=B A D) \log \left( {\text {logit}}_{i}[1]\right) \right) \end{aligned}$$

(2)

The final overall loss is the sum of sentence-level loss and word-level loss, $\aleph $ is a constant weight.

$$\begin{aligned} L=L_{ sentence_level }+\alpha \times L_ {word_level } \end{aligned}$$

(3)

3.2 PE Based Multi-task Learning QE

Under the encoder-decoder structure of mBART, we design a translation task from src to pe as an auxiliary task for sentence-level QE. The model is shown in Fig. 2. For the translation part, we feed the right-shifted pe $x=[x_1,...,x_{k+1} ]$ into the decoder which share parameter with the sentence-level part.

The translation loss $L_{translation} $ is calculated by the cross-entropy loss function:

$$\begin{aligned} L_{\text{ translation } }=\sum _{i=1}^{k}-\log \left( {\text {logit}}_{i}\left[ x_{i+1}\right] \right) \end{aligned}$$

(4)

where $x_{i+1}$ denotes each token in the input sentence.

The final overall loss is the sum of sentence-level loss and translation loss, $\beta $ is a constant weight.

$$\begin{aligned} L=L_{ {sentence_level }}+\beta \times L_{ {translation }} \end{aligned}$$

(5)

Compared with word-level task, translation task can evaluate not only the trans-lation quality of each single word, but also the translation quality at the sentence-level by using the context information in the pe data. Meanwhile, compared with encoder-based QE structures, mBART can utilize pe data more directly and avoid additional label cost in word level quality annotation.

3.3 Multi-model Ensemble

Given that various models with different initialized parameters, we can utilize multi-ple models to construct our system. Following existing practices in this aspect, we further implemented three other different QE models, mBERT, XLM-RoBERTa-base and XLM-RoBERTa-large to obtain different information from the same data. We average the HTER obtained by these three models and our system to generate stronger performance.

mBERT and XLM-RoBERTa are both encoder-based multilingual pre-trained models. The framework of QE is shown in the Fig. 3. src and mt are concatenated as encoder input. The output of the encoder passes through the linear layer, which utiliz-es sigmoid as the activation function. For CCMT does not provide word level QE data, we didn’t apply multi-task learning for encoder-based framework.

4 Experiments

4.1 Dataset

To compare with recent public results, we use the QE data from WMT2021 Machine Translation Quality Estimation tasks for English-German, and CCMT2021 Machine Translation Quality Estimation tasks for English-Chinese. Each dataset contains both sentence-level and word-level tasks. The dataset of WMT2021 provided 7k samples for training in both directions, and CCMT2021 provided more than ten thousand samples, slightly more data than WMT2021. The dataset statistics are shown in Table 1.

Table 1. The statistics of quality estimation datasets.

Full size table

4.2 Model Training and Evaluation Metric

In the training process, AdamW is selected as the optimizer. We set the batch-size as 8 and the learning rate is set to 1e−5, and the warmup steps are 1000 steps. The train-ing adopts the early stop strategy, that is, if the model does not improve on the vali-dation set in 2000 steps, stop training. The proposed approach is trained over a single Nvidia 3090. In the sentence-level translation quality estimation task, three evalua-tion metrics are used: Spearman’s Rank Correlation Coefficient (Spearman), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). The Spearman corre-lation coefficient is used as the main metric, in which the higher value indicates better performance of the QE model. The mean absolute error and the root mean square error are also provided for reference, in which the lower value indicates better perfor-mance of the QE model.

4.3 Experimental Results and Analysis

We first compare mBART with other pre-trained models on the WMT2021 Dataset. We choose monolingual BERT, XLM-Roberta, and mBERT as baselines. As shown in Table 2, the mBART model surpasses all the other pre-trained models and achieves the highest Pearson correlation in both DE-De and EN-ZH tasks.

Table 2. Experiment results with different pretrain models

Full size table

The experiment results of our system on WMT2021 are shown in Table 3. It shows that the multi-task learning method can achieve better results compared with using mBART only. For sentence-level QE, jointly trained with translation task ob-tained better performance than the single word-level task. However, combining word-level task and translation task will lead to a performance decline. We also compare the proposed QE model with the best results of WMT2021. HW-TSC [9] utilizes the auxiliary data for training which is obtained by a mature translation system. IST-Unbabel [10] uses the ADAPT strategy and a more complicated feature extraction classifier to enhance its performance. As a result, there is still a gap between our method and the best results.

The experiment results of our system on CCMT2021 are shown in Table 4. The proposed approach outperforms all the other pre-trained models in the CCMT2021 dataset. Jointly training with translation task boost the performance of our mBART-based system, and the ensemble of multiple models can also make improvement in both directions.

Table 3. Experiment results with multitask on WMT2021

Full size table

Table 4. Experiment results on CCMT2021

Full size table

Table 5. Effect of PE translation tasks

Full size table

4.4 Ablation Study

In this section, we will investigate the effect of translation task. We use pe (post editing) to correct the error of mt (machine translation) in different proportions, then the corrected mt is used as the input of decoder for the translation task. The result is shown in Table 4. We observe that with the increase of the correction ratio, the per-formance of the model improves significantly. This means that when introducing pe into sentence-level evaluation system, the proposed approach can obtain more useful information from pe data (Table 5).

We also test the influence of weight on multi-task learning as shown in Fig. 4 and 5. Generally speaking, the performance of the translation multi-task method is better than the word-level multi-task method.

Moreover, we test different ways of input to train mBART like feed mt into the encoder and put src into the decoder or put src and mt into the encoder together, as shown in Table 6. Compared to other ways of input, our framework achieves signifi-cant improvements in EN-DE and EN-ZH tasks.

Table 6. Experiment results with different ways of input

Full size table

5 Conclusion

In this paper, we describe our submission in the QE task, which consists of English- Chinese and Chinese-English tasks. Our system is implemented based on the mBART and multi-task QE learning strategies. We propose a sentence-level translation quality estimation model based on the mBART, which achieves better results than other cross-language pre-training models. We also present a training method to introduce translation task into multi-task QE learning which successfully integrates post-edited sentences into sentence-level QE task and greatly improve the system performance with a simple model architecture design.

References

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv pre-print arXiv:1911.02116 (2019)
Specia, L., Farzindar, A.: Estimating machine translation post-editing effort with HTER: In: Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry, pp. 33–43 (2010)
Google Scholar
Liu, Y., Gu, J., Goyal, N., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Article Google Scholar
Tang, Y., Tran, C., Li, X., et al.: Multilingual translation with extensible multilingual pre-training and finetuning. arXiv preprint arXiv:2008.00401 (2020)
Kreutzer, J., Schamoni, S., Riezler, S.: QUality estimation from scratch (QUETCH): deep learning for word-level translation quality estimation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 316–322 (2015)
Google Scholar
Kim, H., Lee, J.H.: A recurrent neural network approach for estimating the quality of ma-chine translation output. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techologies, StroudsBurg, PA, pp. 494–498. ACL (2016)
Google Scholar
Fan, K., Wang, J., Li, B., et al.: “Bilingual Expert” can find translation errors. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 6367–6374 (2019)
Google Scholar
Shah, K., Cohn, T., Specia, L.: A Bayesian non-linear method for feature selection in ma-chine translation quality estimation. Mach. Transl. 29(2), 101–125 (2015)
Article Google Scholar
Moura, J., Vera, M., van Stigt, D., et al.: IST-Unbabel participation in the WMT20 quality estimation shared task.: In: Proceedings of the Fifth Conference on Machine Translation, pp. 1029–1036 (2020)
Google Scholar
Zerva, C., van Stigt, D., Rei, R., et al.: IST-Unbabel 2021 submission for the quality estimation shared task. In: Proceedings of the Sixth Conference on Machine Translation, pp. 961–972 (2021)
Google Scholar
González-Rubio, J., Navarro-Cerdán, J.R., Casacuberta, F.: Dimensionality reduction methods for machine translation quality estimation. Mach. Transl. 27(3–4), 281–301 (2013)
Article Google Scholar
Kepler, F., Trénous, J., Treviso, M., et al.: Unbabel’s participation in the WMT19 translation quality estimation shared task. arXiv preprint arXiv:1907.10352 (2019)
Ranasinghe, T., Orasan, C., Mitkov, R.: TransQuest: translation quality estimation with cross-lingual transformers. arXiv preprint arXiv:2011.01536 (2020)
Martins, A.F.T., Astudillo, R., Hokamp, C., et al.: Unbabel’s participation in the WMT16 word-level translation quality estimation shared task. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 806–811 (2016)
Google Scholar
Mikolov, T., Chen, K., Corrado, G.S., et al.: Efficient estimation of word representations in vector space. Comput. Sci. (2013)
Google Scholar

Download references

Acknowledgement

This work is partially funded by the National Key Research and Development Pro-gram of China (No. 2020AAA0108000), and by the Key Project of National Natural Science Foundation China (No. U1908216).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Binhuan Yuan, Yueyang Li, Kehai Chen, Hao Lu, Muyun Yang & Hailong Cao

Authors

Binhuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yueyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Kehai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Muyun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Binhuan Yuan , Yueyang Li or Kehai Chen .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Tong Xiao
Meta AI, San Francisco, CA, USA
Juan Pino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, B., Li, Y., Chen, K., Lu, H., Yang, M., Cao, H. (2022). An Improved Multi-task Approach to Pre-trained Model Based MT Quality Estimation. In: Xiao, T., Pino, J. (eds) Machine Translation. CCMT 2022. Communications in Computer and Information Science, vol 1671. Springer, Singapore. https://doi.org/10.1007/978-981-19-7960-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-19-7960-6_11
Published: 09 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7959-0
Online ISBN: 978-981-19-7960-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Improved Multi-task Approach to Pre-trained Model Based MT Quality Estimation

Abstract

Similar content being viewed by others

Improving Quality Estimation of Machine Translation by Using Pre-trained Language Representation

NiuTrans Submission for CCMT19 Quality Estimation Task

Improved Quality Estimation of Machine Translation with Pre-trained Language Representation

Keywords

1 Introduction

2 Related Works