Target-Side Language Model for Reference-Free Machine Translation Evaluation

Zhang, Min; Qiao, Xiaosong; Yang, Hao; Tao, Shimin; Zhao, Yanqing; Li, Yinlu; Su, Chang; Wang, Minghan; Guo, Jiaxin; Liu, Yilun; Qin, Ying

doi:10.1007/978-981-19-7960-6_5

Min Zhang⁷,
Xiaosong Qiao⁷,
Hao Yang⁷,
Shimin Tao⁷,
Yanqing Zhao⁷,
Yinlu Li⁷,
Chang Su⁷,
Minghan Wang⁷,
Jiaxin Guo⁷,
Yilun Liu⁷ &
…
Ying Qin⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1671))

Included in the following conference series:

China Conference on Machine Translation

254 Accesses
2 Citations

Abstract

With the rapid progress of deep learning in multilingual language processing, there has been a growing interest in reference-free machine translation evaluation, where source texts are directly compared with system translations. In this paper, we design a reference-free metric that is based only on a target-side language model for segment-level and system-level machine translation evaluations respectively, and it is found out that promising results could be achieved when only the target-side language model is used in such evaluations. From the experimental results on all the 18 language pairs of the WMT19 news translation shared task, it is interesting to see that the designed metrics with the multilingual model XLM-R get very promising results (best segment-level mean score on the from-English language pairs, and best system-level mean scores on the from-English and none-English language pairs) when the current SOTA metrics that we know are chosen for comparison.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Incorporating Multilingual Knowledge Distillation into Machine Translation Evaluation

Innovative Deep Neural Network Fusion for Pairwise Translation Evaluation

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Keywords

1 Introduction

Traditional automatic metrics for machine translation (MT) score MT output by comparing it with one or more reference translations. Common such metrics include the word-based metrics BLEU [1] and METEOR [2], and the word embedding-based metrics BERTScore [3] and BLEURT [4]. However, reference sentences could only cover a tiny fraction of input source sentences, and non-professional translators can not yield high-quality reference translations [5].

These problems can be avoided through reference-free MT evaluation, meaning that only source texts are used in MT output evaluation and they are directly compared with system translations. Recently, with the rapid progress of deep learning in multilingual language processing [6, 7], a lot of reference-free metrics have been proposed for such evaluation. Popović et al. [8] exploited a bag-of-word translation model for quality estimation, which sums over the likelihoods of aligned word pairs between source and translation texts. Specia et al. [9] used language-agnostic linguistic features extracted from source texts and system translations to estimate quality. YiSi-2 [10] evaluates system translations by summing similarity scores over words pairs which are best-aligned mutual translations. Moreover, by introducing cross-lingual linear projection, Lo and Larkin [11] greatly improved the effect of YiSi-2. Prism-src [12] frames the task of MT evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on source text. COMET-QE [13, 14] encodes segment-level representations of source text and translation text as the input to a feed forward regressor. Gekhman et al. [15] proposed a simple and effective Knowledge-Based Evaluation (KoBE) method by measuring the recall of entities found in source texts and system translations. To mitigate the misalignment of cross-lingual word embedding spaces, Zhao et al. [16] proposed post-hoc re-alignment strategies which integrate a target-side GPT [17] language model. Song et al. [18] proposed an unsupervised metric SentSim by incorporating a notion of sentence semantic similarity.

In this paper, we find out that assessing system translation only with a target-side language model could achieve very promising results. With a modified sentence perplexity calculation for system translations, we design a reference-free metric for segment-level and system-level MT evaluations respectively. And then we test the performances of the two metrics on all the 18 language pairs of WMT19 news translation shared task [19]. The experimental results demonstrate that our metrics with the pretrained model XLM-R [7] are very competitive for reference-free MT evaluations when compared with the current SOTA reference-free metrics that we know.

2 Target-Side Language Model Metrics

A statistical language model is a probability distribution over sequences of words [20]. Given such a sequence with m words, i.e., $\boldsymbol{s}=(w_1,\dots ,w_m)$, it assigns a probability $P(\boldsymbol{s})$ to the whole sequence, which is defined as:

$$\begin{aligned} P(\boldsymbol{s})=P(w_1,\dots ,w_m)=\prod _{i=1}^{m}{P(w_i|w_1,\dots ,w_{i-1})}. \end{aligned}$$

(1)

In order to overcome the data sparsity problem in building a statistical language model, a common solution is to assume that the probability of a word only depends on the previous n words. This is known as the n-gram model or unigram model when $n=1$. So the probability $P(\boldsymbol{s})$ could be approximated as:

$$\begin{aligned} P(\boldsymbol{s})=\prod _{i=1}^{m}{P(w_i|w_1,\dots ,w_{i-1})}\approx \prod _{i=1}^{m}{P(w_i|w_{i-(n-1)},\dots ,w_{i-1})}. \end{aligned}$$

(2)

With the advancements in deep learning [21], various neural language models are proposed to use continuous representations or embeddings of words to make their predictions [6, 22]. Typically, a neural language model is constructed and trained as probabilistic classifiers for

$$\begin{aligned} P(w\,|\,context),\ for \ w \in V. \end{aligned}$$

(3)

That is to say, the model is trained to predict a probability distribution over the vocabulary V, when some linguistic context is given.

In this paper, we adopt the masked language model [6] to design a reference-free metric for segment-level and system-level MT evaluations respectively.

For segment-level evaluation where a single system translation sentence $\boldsymbol{s}$ is provided, the metric $SEG\_LM$ is defined as:

$$\begin{aligned} SEG\_LM(\boldsymbol{s}) = \frac{1}{m}{\sum _{i=1}^{m}{\log \frac{1}{P(w_i|\boldsymbol{s}-w_i)}}}, \end{aligned}$$

(4)

where m is the number of words in sentence $\boldsymbol{s}$, $w_i$ is the i-th word in $\boldsymbol{s}$, and $P(w_i|\boldsymbol{s}-w_i)$ the probability of $w_i$ predicted by the masked language model when $w_i$ is replaced by [MASK] in $\boldsymbol{s}$.

It should be pointed out that the metric $SEG\_LM$ is slightly different from the log form of the sentence perplexity [20] (PPL), which is defined as:

$$\begin{aligned} \log {PPL(\boldsymbol{s})} = \log {\root m \of {\frac{1}{P(w_1,\dots ,w_m)}}}=\frac{1}{m}{\sum _{i=1}^{m}\log {\frac{1}{P(w_i|w_1,\dots ,w_{i-1})}}}. \end{aligned}$$

(5)

From the above definitions, it could be seen that the context for predicting the probability of $w_i$ in PPL is different from $SEG\_LM$.

For system-level evaluation where a set of system translation sentences S is given, the metric $SYS\_LM$ is defined as:

$$\begin{aligned} SYS\_LM(S) = \frac{1}{\left| S\right| }{\sum _{\boldsymbol{s} \in S}{SEG\_LM(\boldsymbol{s})}} , \end{aligned}$$

(6)

which is the mean value of $SEG\_LM$ scores on each sentence in S.

Although source texts are not considered in our designed metrics, the experimental results on WMT19 in Sect. 3 will show that the metrics $SEG\_LM$ and $SYS\_LM$ are very promising for both segment-level and system-level reference-free MT evaluations.

3 Experiments

In this section, we evaluate the performance of our metrics $SEG\_LM$ and $SYS\_LM$ by correlating their scores with human judgments of translation quality for reference-free MT evaluations. The pretrained multilingual model XLM-R^{Footnote 1} is used as the masked language model for our metrics.

Table 1. Segment-level metric results for the into-English language pairs of WMT19

Full size table

Table 2. Segment-level metric results for the from-English language pairs of WMT19

Full size table

Table 3. Segment-level metric results for the none-English language pairs of WMT19

Full size table

3.1 Datasets and Baselines

The source language sentences, and their system and reference translations are collected from the WMT19 news translation shared task [19], which contains predictions of 233 translation systems across 18 language pairs. Each language pair has about 3,000 source sentences, and each is associated with one reference translation and with the automatic translations generated by the participating systems. In this paper, all the 18 language pairs in WMT19 are chosen for reference-free MT evaluation.

A range of reference-free metrics are chosen to compare with our metrics: LASIM and LP [23], UNI and UNI+ [19], YiSi-2 [10] and YiSi-2+CLP [11], KoBE [15] and CLP-UMD [16]. To the best of our knowledge, the above metrics could cover most of the current SOTA metrics for reference-free MT evaluation. Reference-based baseline metrics BLEU and sentBLEU [24] are selected as references. It should be pointed out that only the results of our metrics $SEG\_LM$ and $SYS\_LM$ are calculated in this paper, and the results of the other metrics are from their respective papers.

3.2 Results

Evaluation Measures. Kendall’s Tau and Pearson correlations [19] are used as measures for segment-level and system-level metric evaluations respectively.

Table 4. System-level metric results for the into-English language pairs of WMT19

Full size table

Table 5. System-level metric results for the from-English language pairs of WMT19

Full size table

Table 6. System-level metric results for the none-English language pairs of WMT19

Full size table

Segment-level Results. Tables 1, 2 and 3 show the comparison results of the metrics for reference-free segment-level evaluations on the into-English, from-English and none-English language pairs of WMT19 respectively (Best results excluding sentBLEU are in bold).

From Table 1, it could be seen that the scores of our metric $SEG\_LM$ on the de-en, lt-en and ru-en language pairs are very close to the best values (only 0.001 gap). And as shown in Table 2, our metric not only gets the best mean score on the from-English language pairs, but also ranks first on 6 of all the 8 language pairs. The results in Table 3 show that our metric even gets better scores on all the none-English language pairs than the reference-based metric sentBLEU. Therefore, our metric $SEG\_LM$ is very promising for segment-level MT evaluation especially when the target-side language is not English.

System-level Results. Tables 4, 5 and 6 illustrate the comparison results of the metrics for reference-free system-level evaluations on the into-English, from-English and none-English language pairs of WMT19 respectively (Best results excluding BLEU are in bold).

As shown in the into-English results of Table 4, our metric $SYS\_LM$ again gets scores very close to the best values on the fi-en and lt-en language pairs. The results in Table 5 demonstrate that our metric gets the best mean score and 5 best scores on all the 8 from-English language pairs. Meanwhile, the results in Table 6 show that $SYS\_LM$ gets better scores than the SOTA metric YiSi-2+CLP on the system-level evaluations, although it does not outperform YiSi-2+CLP on the segment-level evaluations, as shown in Table 3. In addition, $SYS\_LM$ gets the best mean score on the none-English language pairs. Overall, the experimental results demonstrate that our metric $SYS\_LM$ is very competitive for system-level MT evaluations when the current SOTA metrics that we know are involved for comparison.

3.3 Discussion

In this section, an explanation for why target-side language model works is provided. For segment-level evaluation where the input is a source sentence $\boldsymbol{s}$ and a system translation sentence $\boldsymbol{t}$, we design metrics to estimate the true probability $P(\boldsymbol{t}|\boldsymbol{s})$. According to the conditional probability formula, we could have:

$$\begin{aligned} \log P(\boldsymbol{t}|\boldsymbol{s}) = \log \frac{P(\boldsymbol{s}|\boldsymbol{t})P(\boldsymbol{t})}{P(\boldsymbol{s})} = \log P(\boldsymbol{s}|\boldsymbol{t}) + \log P(\boldsymbol{t}) - \log P(\boldsymbol{s}). \end{aligned}$$

(7)

The target-side language model is mainly to approximate the second term $\log P(\boldsymbol{t})$, and when there are no much differences in the first term $\log P(\boldsymbol{s}|\boldsymbol{t})$, our target-side language model metric works for MT evaluation.

4 Conclusion

In this paper, a reference-free metric designed only with a target-side language model is proposed for segment-level and system-level MT evaluations respectively. With the pretrained multilingual model XLM-R as the target-side language model, the performances of our metrics $SEG\_LM$ and $SYS\_LM$ are evaluated on all the 18 language pairs of WMT19. The experimental results show that our metrics are very competitive (best mean score of segment-level evaluations on the from-English language pairs, and best mean scores of system-level evaluations on the from-English and none-English language pairs) when most of the current SOTA reference-free metrics are chosen for comparison. Furthermore, the reason why the target-side language model works is discussed. The fusion of our metrics and other metrics that are for the first term $\log P(\boldsymbol{s}|\boldsymbol{t})$ in Eq. 7 will be our future work.

Notes

1.
https://huggingface.co/xlm-roberta-base.

References

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (2002)
Google Scholar
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 228–231. Association for Computational Linguistics, Prague, Czech Republic (2007)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Google Scholar
Sellam, T., Das, D., Parikh, A.: BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892. Association for Computational Linguistics, Online (2020)
Google Scholar
Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: Professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229. Association for Computational Linguistics, Portland, Oregon, USA (2011)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, pp. 8440–8451, 5–10 July 2020. Association for Computational Linguistics (2020)
Google Scholar
Popović, M., Vilar, D., Avramidis, E., Burchardt, A.: Evaluation without references: IBM1 scores as evaluation metrics. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 99–103. Association for Computational Linguistics, Edinburgh, Scotland (2011)
Google Scholar
Specia, L., Shah, K., de Souza, J.G., Cohn, T.: QuEst - a translation quality estimation framework. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 79–84. Association for Computational Linguistics, Sofia, Bulgaria (2013)
Google Scholar
Lo, C.K.: YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 507–513. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Lo, C.K., Larkin, S.: Machine translation reference-less evaluation using YiSi-2 with bilingual mappings of massive multilingual language model. In: Proceedings of the Fifth Conference on Machine Translation, pp. 903–910. Association for Computational Linguistics, Online (2020)
Google Scholar
Thompson, B., Post, M.: Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 90–121. Association for Computational Linguistics, Online (2020)
Google Scholar
Rei, R., et al.: Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In: Proceedings of the Sixth Conference on Machine Translation, pp. 1030–1040 (2021)
Google Scholar
Rei, R., Stewart, C., Farinha, A.C., Lavie, A.: COMET: a neural framework for MT evaluation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702 (2020)
Google Scholar
Gekhman, Z., Aharoni, R., Beryozkin, G., Freitag, M., Macherey, W.: KoBE: knowledge-based machine translation evaluation. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3200–3207. Association for Computational Linguistics (2020)
Google Scholar
Zhao, W., Glavaš, G., Peyrard, M., Gao, Y., West, R., Eger, S.: On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1656–1671. Association for Computational Linguistics (2020)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Song, Y., Zhao, J., Specia, L.: SentSim: crosslingual semantic evaluation of machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3143–3156. Association for Computational Linguistics (2021)
Google Scholar
Ma, Q., Wei, J., Bojar, O., Graham, Y.: Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges. In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 62–90. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here. In: Proceedings of the IEEE, vol. 88, pp. 1270–1278 (2000)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Google Scholar
Yankovskaya, E., Tättar, A., Fishel, M.: Quality estimation and translation metrics via pre-trained word and sentence embeddings. In: Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pp. 101–105. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180. Association for Computational Linguistics, Prague, Czech Republic (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Huawei Translation Services Center, Beijing, China
Min Zhang, Xiaosong Qiao, Hao Yang, Shimin Tao, Yanqing Zhao, Yinlu Li, Chang Su, Minghan Wang, Jiaxin Guo, Yilun Liu & Ying Qin

Authors

Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosong Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Tao
View author publications
You can also search for this author in PubMed Google Scholar
Yanqing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yinlu Li
View author publications
You can also search for this author in PubMed Google Scholar
Chang Su
View author publications
You can also search for this author in PubMed Google Scholar
Minghan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yilun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Zhang .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Tong Xiao
Meta AI, San Francisco, CA, USA
Juan Pino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M. et al. (2022). Target-Side Language Model for Reference-Free Machine Translation Evaluation. In: Xiao, T., Pino, J. (eds) Machine Translation. CCMT 2022. Communications in Computer and Information Science, vol 1671. Springer, Singapore. https://doi.org/10.1007/978-981-19-7960-6_5

Download citation

DOI: https://doi.org/10.1007/978-981-19-7960-6_5
Published: 09 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7959-0
Online ISBN: 978-981-19-7960-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Target-Side Language Model for Reference-Free Machine Translation Evaluation

Abstract

Similar content being viewed by others

Incorporating Multilingual Knowledge Distillation into Machine Translation Evaluation

Innovative Deep Neural Network Fusion for Pairwise Translation Evaluation

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Keywords

1 Introduction

2 Target-Side Language Model Metrics

3 Experiments

3.1 Datasets and Baselines

3.2 Results

3.3 Discussion

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Target-Side Language Model for Reference-Free Machine Translation Evaluation

Abstract

Similar content being viewed by others

Incorporating Multilingual Knowledge Distillation into Machine Translation Evaluation

Innovative Deep Neural Network Fusion for Pairwise Translation Evaluation

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

Keywords

1 Introduction

2 Target-Side Language Model Metrics

3 Experiments

3.1 Datasets and Baselines

3.2 Results

3.3 Discussion

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation