Keywords

1 Introduction

Traditional automatic metrics for machine translation (MT) score MT output by comparing it with one or more reference translations. Common such metrics include the word-based metrics BLEU [1] and METEOR [2], and the word embedding-based metrics BERTScore [3] and BLEURT [4]. However, reference sentences could only cover a tiny fraction of input source sentences, and non-professional translators can not yield high-quality reference translations [5].

These problems can be avoided through reference-free MT evaluation, meaning that only source texts are used in MT output evaluation and they are directly compared with system translations. Recently, with the rapid progress of deep learning in multilingual language processing [6, 7], a lot of reference-free metrics have been proposed for such evaluation. Popović et al. [8] exploited a bag-of-word translation model for quality estimation, which sums over the likelihoods of aligned word pairs between source and translation texts. Specia et al. [9] used language-agnostic linguistic features extracted from source texts and system translations to estimate quality. YiSi-2 [10] evaluates system translations by summing similarity scores over words pairs which are best-aligned mutual translations. Moreover, by introducing cross-lingual linear projection, Lo and Larkin [11] greatly improved the effect of YiSi-2. Prism-src [12] frames the task of MT evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on source text. COMET-QE [13, 14] encodes segment-level representations of source text and translation text as the input to a feed forward regressor. Gekhman et al. [15] proposed a simple and effective Knowledge-Based Evaluation (KoBE) method by measuring the recall of entities found in source texts and system translations. To mitigate the misalignment of cross-lingual word embedding spaces, Zhao et al. [16] proposed post-hoc re-alignment strategies which integrate a target-side GPT [17] language model. Song et al. [18] proposed an unsupervised metric SentSim by incorporating a notion of sentence semantic similarity.

In this paper, we find out that assessing system translation only with a target-side language model could achieve very promising results. With a modified sentence perplexity calculation for system translations, we design a reference-free metric for segment-level and system-level MT evaluations respectively. And then we test the performances of the two metrics on all the 18 language pairs of WMT19 news translation shared task [19]. The experimental results demonstrate that our metrics with the pretrained model XLM-R [7] are very competitive for reference-free MT evaluations when compared with the current SOTA reference-free metrics that we know.

2 Target-Side Language Model Metrics

A statistical language model is a probability distribution over sequences of words [20]. Given such a sequence with m words, i.e., \(\boldsymbol{s}=(w_1,\dots ,w_m)\), it assigns a probability \(P(\boldsymbol{s})\) to the whole sequence, which is defined as:

$$\begin{aligned} P(\boldsymbol{s})=P(w_1,\dots ,w_m)=\prod _{i=1}^{m}{P(w_i|w_1,\dots ,w_{i-1})}. \end{aligned}$$
(1)

In order to overcome the data sparsity problem in building a statistical language model, a common solution is to assume that the probability of a word only depends on the previous n words. This is known as the n-gram model or unigram model when \(n=1\). So the probability \(P(\boldsymbol{s})\) could be approximated as:

$$\begin{aligned} P(\boldsymbol{s})=\prod _{i=1}^{m}{P(w_i|w_1,\dots ,w_{i-1})}\approx \prod _{i=1}^{m}{P(w_i|w_{i-(n-1)},\dots ,w_{i-1})}. \end{aligned}$$
(2)

With the advancements in deep learning [21], various neural language models are proposed to use continuous representations or embeddings of words to make their predictions [6, 22]. Typically, a neural language model is constructed and trained as probabilistic classifiers for

$$\begin{aligned} P(w\,|\,context),\ for \ w \in V. \end{aligned}$$
(3)

That is to say, the model is trained to predict a probability distribution over the vocabulary V, when some linguistic context is given.

In this paper, we adopt the masked language model [6] to design a reference-free metric for segment-level and system-level MT evaluations respectively.

For segment-level evaluation where a single system translation sentence \(\boldsymbol{s}\) is provided, the metric \(SEG\_LM\) is defined as:

$$\begin{aligned} SEG\_LM(\boldsymbol{s}) = \frac{1}{m}{\sum _{i=1}^{m}{\log \frac{1}{P(w_i|\boldsymbol{s}-w_i)}}}, \end{aligned}$$
(4)

where m is the number of words in sentence \(\boldsymbol{s}\), \(w_i\) is the i-th word in \(\boldsymbol{s}\), and \(P(w_i|\boldsymbol{s}-w_i)\) the probability of \(w_i\) predicted by the masked language model when \(w_i\) is replaced by [MASK] in \(\boldsymbol{s}\).

It should be pointed out that the metric \(SEG\_LM\) is slightly different from the log form of the sentence perplexity [20] (PPL), which is defined as:

$$\begin{aligned} \log {PPL(\boldsymbol{s})} = \log {\root m \of {\frac{1}{P(w_1,\dots ,w_m)}}}=\frac{1}{m}{\sum _{i=1}^{m}\log {\frac{1}{P(w_i|w_1,\dots ,w_{i-1})}}}. \end{aligned}$$
(5)

From the above definitions, it could be seen that the context for predicting the probability of \(w_i\) in PPL is different from \(SEG\_LM\).

For system-level evaluation where a set of system translation sentences S is given, the metric \(SYS\_LM\) is defined as:

$$\begin{aligned} SYS\_LM(S) = \frac{1}{\left| S\right| }{\sum _{\boldsymbol{s} \in S}{SEG\_LM(\boldsymbol{s})}} , \end{aligned}$$
(6)

which is the mean value of \(SEG\_LM\) scores on each sentence in S.

Although source texts are not considered in our designed metrics, the experimental results on WMT19 in Sect. 3 will show that the metrics \(SEG\_LM\) and \(SYS\_LM\) are very promising for both segment-level and system-level reference-free MT evaluations.

3 Experiments

In this section, we evaluate the performance of our metrics \(SEG\_LM\) and \(SYS\_LM\) by correlating their scores with human judgments of translation quality for reference-free MT evaluations. The pretrained multilingual model XLM-RFootnote 1 is used as the masked language model for our metrics.

Table 1. Segment-level metric results for the into-English language pairs of WMT19
Table 2. Segment-level metric results for the from-English language pairs of WMT19
Table 3. Segment-level metric results for the none-English language pairs of WMT19

3.1 Datasets and Baselines

The source language sentences, and their system and reference translations are collected from the WMT19 news translation shared task [19], which contains predictions of 233 translation systems across 18 language pairs. Each language pair has about 3,000 source sentences, and each is associated with one reference translation and with the automatic translations generated by the participating systems. In this paper, all the 18 language pairs in WMT19 are chosen for reference-free MT evaluation.

A range of reference-free metrics are chosen to compare with our metrics: LASIM and LP [23], UNI and UNI+ [19], YiSi-2 [10] and YiSi-2+CLP [11], KoBE [15] and CLP-UMD [16]. To the best of our knowledge, the above metrics could cover most of the current SOTA metrics for reference-free MT evaluation. Reference-based baseline metrics BLEU and sentBLEU [24] are selected as references. It should be pointed out that only the results of our metrics \(SEG\_LM\) and \(SYS\_LM\) are calculated in this paper, and the results of the other metrics are from their respective papers.

3.2 Results

Evaluation Measures. Kendall’s Tau and Pearson correlations [19] are used as measures for segment-level and system-level metric evaluations respectively.

Table 4. System-level metric results for the into-English language pairs of WMT19
Table 5. System-level metric results for the from-English language pairs of WMT19
Table 6. System-level metric results for the none-English language pairs of WMT19

Segment-level Results. Tables 1, 2 and 3 show the comparison results of the metrics for reference-free segment-level evaluations on the into-English, from-English and none-English language pairs of WMT19 respectively (Best results excluding sentBLEU are in bold).

From Table 1, it could be seen that the scores of our metric \(SEG\_LM\) on the de-en, lt-en and ru-en language pairs are very close to the best values (only 0.001 gap). And as shown in Table 2, our metric not only gets the best mean score on the from-English language pairs, but also ranks first on 6 of all the 8 language pairs. The results in Table 3 show that our metric even gets better scores on all the none-English language pairs than the reference-based metric sentBLEU. Therefore, our metric \(SEG\_LM\) is very promising for segment-level MT evaluation especially when the target-side language is not English.

System-level Results. Tables 4, 5 and 6 illustrate the comparison results of the metrics for reference-free system-level evaluations on the into-English, from-English and none-English language pairs of WMT19 respectively (Best results excluding BLEU are in bold).

As shown in the into-English results of Table 4, our metric \(SYS\_LM\) again gets scores very close to the best values on the fi-en and lt-en language pairs. The results in Table 5 demonstrate that our metric gets the best mean score and 5 best scores on all the 8 from-English language pairs. Meanwhile, the results in Table 6 show that \(SYS\_LM\) gets better scores than the SOTA metric YiSi-2+CLP on the system-level evaluations, although it does not outperform YiSi-2+CLP on the segment-level evaluations, as shown in Table 3. In addition, \(SYS\_LM\) gets the best mean score on the none-English language pairs. Overall, the experimental results demonstrate that our metric \(SYS\_LM\) is very competitive for system-level MT evaluations when the current SOTA metrics that we know are involved for comparison.

3.3 Discussion

In this section, an explanation for why target-side language model works is provided. For segment-level evaluation where the input is a source sentence \(\boldsymbol{s}\) and a system translation sentence \(\boldsymbol{t}\), we design metrics to estimate the true probability \(P(\boldsymbol{t}|\boldsymbol{s})\). According to the conditional probability formula, we could have:

$$\begin{aligned} \log P(\boldsymbol{t}|\boldsymbol{s}) = \log \frac{P(\boldsymbol{s}|\boldsymbol{t})P(\boldsymbol{t})}{P(\boldsymbol{s})} = \log P(\boldsymbol{s}|\boldsymbol{t}) + \log P(\boldsymbol{t}) - \log P(\boldsymbol{s}). \end{aligned}$$
(7)

The target-side language model is mainly to approximate the second term \(\log P(\boldsymbol{t})\), and when there are no much differences in the first term \(\log P(\boldsymbol{s}|\boldsymbol{t})\), our target-side language model metric works for MT evaluation.

4 Conclusion

In this paper, a reference-free metric designed only with a target-side language model is proposed for segment-level and system-level MT evaluations respectively. With the pretrained multilingual model XLM-R as the target-side language model, the performances of our metrics \(SEG\_LM\) and \(SYS\_LM\) are evaluated on all the 18 language pairs of WMT19. The experimental results show that our metrics are very competitive (best mean score of segment-level evaluations on the from-English language pairs, and best mean scores of system-level evaluations on the from-English and none-English language pairs) when most of the current SOTA reference-free metrics are chosen for comparison. Furthermore, the reason why the target-side language model works is discussed. The fusion of our metrics and other metrics that are for the first term \(\log P(\boldsymbol{s}|\boldsymbol{t})\) in Eq. 7 will be our future work.