Keywords

1 Introduction

In traditional machine translation (MT) evaluation (also referred to as reference-based MT evaluation), reference texts are provided and compared with system translations. The common metrics for such evaluation include the word-based metrics BLEU [1] and METEOR [2], and the word embedding-based metrics BERTScore [3] and BLEURT [4].

However, reference sentences could only cover a tiny fraction of input source sentences, and non-professional translators can not yield high-quality human reference translations [5]. Recently, with the rapid progress of deep learning in multilingual language processing [6, 7], there has been a growing interest in reference-free MT evaluation [8], which is also referred to as “quality estimation” (QE) in the MT community. In QE, evaluation metrics compare system translations with source sentences directly. And lots of methods have been proposed to approach this task. Popović et al. [9] exploited a bag-of-word translation model for quality estimation, which sums over the likelihoods of aligned word pairs between source and translation texts. Specia et al. [10] used language-agnostic linguistic features extracted from source texts and system translations to estimate quality. YiSi-2 [11] evaluates system translations by summing similarity scores over words pairs which are best-aligned mutual translations. Prism-src [12] frames the task of MT evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on source text. COMET-QE [13, 14] encodes segment-level representations of source text and translation text as the input to a feed forward regressor. To mitigate the misalignment of cross-lingual word embedding spaces, Zhao et al. [15] proposed post-hoc re-alignment strategies which integrate a target-side GPT [16] language model. Song et al. [17] proposed an unsupervised metric SentSim by incorporating a notion of sentence semantic similarity. Wan et al. [18] proposed a unified framework (UniTE) with monotonic regional attention and unified pretraining for reference-only, source-only and source-reference-combined MT evaluations.

Fig. 1.
figure 1

First two principle components of contextual token embeddings of mBERT, XLM-R and pmmb-v2 for 100 zh-en parallel sentences in WMT19 by t-SNE (The more areas that do not cover each other, the worse the word embedding alignment effectiveness)

In a word, most of the above mentioned methods try to directly achieve cross-lingual alignment on lexical, word embedding or sentence embedding levels, which is critically important for reference-free MT evaluation. In this paper, we find out that cross-lingual word embedding alignment could be achieved implicitly by multilingual knowledge distillation (MKD) [19] for sentence embedding alignment, of which the training procedure is to map the sentence embeddings of source and target sentences in parallel data that are obtained through a multilingual pretrained model to the same location in the vector space as the source sentence embedding that is obtained through a monolingual Sentence-BERT (SBERT) [20] model by means of the MSE loss. To illustrate the alignment effect intuitively, a simple example shown in Fig. 1 is designed to compare the distilled multilingual model (paraphrase-multilingual-mpnet-base-v2Footnote 1, hereinafter referred to as pmmb-v2) with the classic multilingual pretrained models mBERT [6] and XLM-R [7]. In Fig. 1, each point represents a word in 100 zh-en parallel sentences from the WMT19 news translation shared task [8] and is composed of the first two principle components of the contextual word embeddings of the respective models by t-SNE [21]. Because each word could be well aligned in the high-quality parallel sentences, the points representing the two language words will be covered by each other if no misalignment exists in the cross-lingual embedding spaces. From Fig. 1, it could be clearly discovered that the misalignment areas in the parts (c) and (d) for pmmb-v2 are much smaller than the parts (a) and (b) for mBERT and XLM-R. This show that multilingual knowledge distillation benefits cross-lingual word embedding alignment.

In this paper, with the framework of BERTScore, we incorporate multilingual knowledge distillation into MT evaluation and propose a reference-free metric BERTScore-MKD. And then we test the performance of BERTScore-MKD on the into-English language pairs of WMT17-19 for both system-level and segment-level evaluations. The experimental results show that BERTScore-MKD is very competitive when compared with the current SOTA reference-free metrics that we know. Furthermore, from the comparison results on WMT19, it is interesting to find that BERTScore-MKD is also suitable for reference-based MT evaluation.

2 Method

In this section, the metric BERTScore-MKD will be given after the descriptions of multilingual knowledge distillation and BERTScore.

Fig. 2.
figure 2

Multilingual knowledge distillation [19]

2.1 Multilingual Knowledge Distillation

The procedure of multilingual knowledge distillation (MKD) proposed by Reimers and Gurevych [19] for sentence embedding alignment is described in Fig. 2, where the teacher model is monolingual SBERT [20] which achieves state-of-the-art performance for various sentence embedding tasks, and the student model is a multilingual pretrained model like mBERT or XLM-R before distillation. From Fig. 2, it could be seen that MKD achieves the alignment of paired sentence embedding directly. And the effectiveness of the student model’s sentence embedding after distillation is demonstrated for over 50 languages from various language families [19].

2.2 BERTScore

BERTScoreFootnote 2 [3] is an effective and robust automatic evaluation metric for text generation, which computes a similarity score for each token in the candidate sentence \(\hat{\boldsymbol{x}}\) with each token in the reference sentence \(\boldsymbol{x}\) by using contextual embedding instead of exact matches. In the absence of token importance weighting, the recall R, precision P and F1 score are defined as:

$$\begin{aligned} R=\frac{1}{|\boldsymbol{x}|}\sum _{x_i\in \boldsymbol{x}}\max \limits _{\hat{x}_j\in \hat{\boldsymbol{x}}}{E(x_i\mid \boldsymbol{x})^\top }{E(\hat{x}_j\mid \hat{\boldsymbol{x}})}, \end{aligned}$$
(1)
$$\begin{aligned} P=\frac{1}{|\hat{\boldsymbol{x}}|}\sum _{\hat{x}_j\in \hat{\boldsymbol{x}}}\max \limits _{x_i\in \boldsymbol{x}}{E(\hat{x}_j\mid \hat{\boldsymbol{x}})^\top }{E(x_i\mid \boldsymbol{x})}, \end{aligned}$$
(2)
$$\begin{aligned} F1=2\cdot \frac{P\cdot R}{P+R}, \end{aligned}$$
(3)

where E is a contextual word embedding function, the outputs of E are normalized to reduce similarity computation, and \(x_i\) and \(\hat{x}_j\) denote the i-th and j-th tokens in \(\boldsymbol{x}\) and \(\hat{\boldsymbol{x}}\) respectively. For MT evaluation, BERTScore with a pretrained model is usually used as a reference-based metric which demonstrates stronger correlations with human judgments than BLEU, and we will show that BERTScore using the distilled student model in Sect. 2.1 is suitable for both reference-free and reference-based MT evaluations.

2.3 BERTScore-MKD

Suppose \(\boldsymbol{s}\) and \(\boldsymbol{r}\) are two parallel sentences, which could be denoted as:

$$\begin{aligned} \boldsymbol{s}=(s_1,\ldots ,s_i,\ldots ,s_m), \end{aligned}$$
(4)
$$\begin{aligned} \boldsymbol{r}=(r_1,\ldots ,r_j,\ldots ,r_n), \end{aligned}$$
(5)

where \(s_i\) and \(r_j\) denote the i-th and j-th tokens in \(\boldsymbol{s}\) and \(\boldsymbol{r}\) respectively.

According to the mean pooling strategy used in SBERT and MKD [19, 20], the sentence embedding is the average of all token embeddings in the last layer of the given model. So the two sentence embeddings of \(\boldsymbol{s}\) and \(\boldsymbol{r}\) for the student model could be represented as:

$$\begin{aligned} SE(\boldsymbol{s}) = \frac{1}{m}\sum \limits _{i=1}^{m}{E_{LL}(s_i\mid \boldsymbol{s})}, \end{aligned}$$
(6)
$$\begin{aligned} SE(\boldsymbol{r}) =\frac{1}{n}\sum \limits _{j=1}^{n}{E_{LL}(r_j\mid \boldsymbol{s})}, \end{aligned}$$
(7)

where SE denotes the sentence embedding of the given sentence, and \(E_{LL}\) stands for the contextual word embedding function in the last layer (LL).

As illustrated in Fig. 2, after distillation with MSE loss for the student model, we could have \(SE(\boldsymbol{s})\approx SE(\boldsymbol{r})\), i.e.,

$$\begin{aligned} \frac{1}{m}\sum \limits _{i=1}^{m}{E_{LL}(s_i\mid \boldsymbol{s})} \approx \frac{1}{n}\sum \limits _{j=1}^{n}{E_{LL}(r_j\mid \boldsymbol{r})}. \end{aligned}$$
(8)

Therefore, from the above equation, it could be intuitively seen that the token embeddings in the last layer of the student model could have some degree of alignment effect (if m and n are close to 1). And for the paired sentences of normal length, the word embedding alignment could also be maintained, as shown in the parts (c) and (d) of Fig. 1. However, it is not obvious that part (d) (last layer) has a better alignment effect than part (c) (9th layer). We will show that the last layer is the best choice for cross-lingual word embedding alignment in Sect. 3.4, and denote BERTScore using the last layer embeddings of the student model as metric BERTScore-MKD. Nevertheless, the reason why cross-lingual word embedding alignment could be achieved by MKD is still very worthy of in-depth analysis.

3 Experiments

In this section, we evaluate the performance of our metric BERTScore-MKD by correlating its scores with human judgments of translation quality for reference-free MT evaluations, where both segment-level and system-level evaluations are included for full comparisons and are defined as follows.

Segment-level evaluation (the input is a source sentence and a system translation sentence): The metric BERTScore-MKD chooses the outputs of the last layer in the model pmmb-v2 as the cross-lingual word embedding function, and takes the F1 score (without token importance weighting) in Eq. 3 as its value.

System-level evaluation (the input is a set of source sentences and the corresponding system translation sentences): The mean value of BERTScore-MKD on each pair of the sentences is used as its score for system-level evaluation.

It should be pointed out that the above definitions are for reference-free MT evaluations, and reference-based MT evaluation is implemented by just replacing source sentences with reference sentences.

3.1 Datasets

The source language sentences, and their system and reference translations are collected from the WMT17-19 news translation shared tasks [8, 22, 23], which contain predictions of 166 translation systems across 16 language pairs in WMT17, 149 translation systems across 14 language pairs in WMT18, and 233 translation systems across 18 language pairs in WMT19. Each language pair in WMT17-19 has about 3,000 source sentences, and each is associated with one reference translation and with the automatic translations generated by participating systems. In this paper, all the into-English language pairs in WMT17-19 are chosen for reference-free MT evaluation.

3.2 Baselines

In this paper, a range of reference-free metrics are chosen to compare with our metric BERTScore-MKD: LASIM and LP [24], UNI and UNI+ [8], YiSi-2 [11], CLP-UMD [15] and SentSim [17]. To the best of our knowledge, the above metrics could cover most of the current SOTA metrics for reference-free MT evaluation. In addition, BERTScore that uses the multilingual pretrained model XML-RFootnote 3 is denoted as BERTScore+XLM-RFootnote 4 and is selected to directly compare the cross-lingual word embedding alignment effect with our metric BERTScore-MKD; and reference-based baseline metrics BLEU and sentBLEU [8] are selected as references. It should be pointed out that only the results of the metrics BERTScore-MKD and BERTScore+XLM-R are calculated in this paper, and the results of the other metrics are from their respective papers.

3.3 Results

Evaluation Measures. Pearson correlation (r) and Kendall’s Tau correlation (\(\tau \)) [8] are used as measures for metric evaluations, and are defined as follows:

$$\begin{aligned} r = \frac{\sum _{i=1}^{n}(H_i-\overline{H})(M_i-\overline{M})}{\sqrt{\sum _{i=1}^{n}(H_i-\overline{H})^2}\cdot \sqrt{\sum _{i=1}^{n}(M_i-\overline{M})^2}}, \end{aligned}$$
(9)
$$\begin{aligned} \tau = \frac{|Concordant |- |Discordant |}{|Concordant |+ |Discordant |}. \end{aligned}$$
(10)
Table 1. Segment-level metric results (Pearson correlation) for the into-English language pairs of WMT17. Best results excluding sentBLEU are in bold.
Table 2. Segment-level metric results (Kendall’s Tau correlation) for the into-English language pairs of WMT19. Best results excluding sentBLEU are in bold.

In Eq. 9, \(H_i\) are human assessment scores of all systems (or sentence pairs) in a given translation direction, \(M_i\) are the corresponding scores predicted by a given metric, and \(\overline{H}\) and \(\overline{M}\) are their mean values respectively. In Eq. 10, Concordant is the set of all human comparisons for which a given metric suggests the same order, and Discordant is the set of all human comparisons with which a given metric disagrees. It should be pointed out that the measure r could be used for both system-level and segment-level evaluations, while the measure \(\tau \) is mainly for segment-level evaluation.

Segment-Level Results. Table 1 and Table 2 show the comparison results of the metrics for the reference-free segment-level evaluations on the into-English language pairs of WMT17 and WMT19 respectively.

From the comparison results of BERTScore+XLM-R and BERTScore-MKD in Table 1 and Table 2, it could be seen that BERTScore-MKD has significantly better results on all the into-English language pairs of WMT17 (avg. \(0.396\rightarrow 0.563\)) and WMT19 (avg. \(0.136\rightarrow 0.188\)), which indicates the cross-lingual word embeddings by MKD have much better alignment effects because only the word embeddings are different for the two metrics.

Table 3. System-level metric results (Pearson correlation) for the into-English language pairs of WMT17. Best results excluding BLEU are in bold.
Table 4. System-level metric results (Pearson correlation) for the into-English language pairs of WMT18. Best results excluding BLEU are in bold.
Table 5. System-level metric results (Pearson correlation) for the into-English language pairs of WMT19. Best results excluding BLEU are in bold.

And when being compared with the current SOTA metrics involved in this paper, our metric BERTScore-MKD gets the best average scores and ranks first on the all language pairs except zh-en of WMT19 and 3 language pairs (cs-en, ru-en and tr-en) of WMT17. Moreover, as the sentence embeddings of SBERT are adopted in SentSim [17], and BERTScore-MKD uses the word embeddings distilled from SBERT, it could be seen from Table 1 that using word embeddings has better performance than using sentence embeddings (avg. \(0.563\;vs.\;0.556\)), which means using the cross-lingual word embeddings by MKD is a better choice for reference-free MT evaluation.

System-Level Results. Tables 3, 4 and 5 illustrate the comparison results of the metrics for the reference-free system-level evaluations on the into-English language pairs of WMT17, WMT18 and WMT19 respectively.

From the experimental results in Tables 3, 4 and 5, it could be seen again that BERTScore-MKD has significantly better results than BERTScore+XLM-R on all the into-English language pairs of WMT17-19 (avg. \(0.629\rightarrow 0.942\), \(0.637\rightarrow 0.949\), \(0.396\rightarrow 0.806\)) except fi-en of WMT18 (\(0.952\;vs.\;0.957\)), and gets the best average scores on the into-English language pairs of WMT17-19 when the current SOTA metrics are chosen for comparison. Moreover, the reference-free metric BERTScore-MKD even gets better results than the reference-based metric BLEU on WMT17 and WMT18 (avg. \(0.942\;vs.\;0.933, 0.949\;vs. \;0.931\)).

Therefore, from the segment-level and system-level experimental results in Tables 1, 2, 3, 4 and 5, it could be seen that BERTScore-MKD is very competitive for reference-free MT evaluation when the current SOTA metrics that we know are chosen for comparison. And in Sect. 3.5 we will show that BERTScore-MKD is also suitable for reference-based MT evaluation.

3.4 Effects of Embedding Layers

Since BERTScore is sensitive to the layer of the model selected to generate the contextual token embeddings [3], we investigate which layer of the model pmmb-v2 is the best choice for BERTScore-MKD as a reference-free metric through experimental comparisons on the into-English language pairs of WMT19.

BERTScore+XLM-R is chosen for comparison, and the mean values on the into-English language pairs of WMT19 for segment-level and system-level evaluations are illustrated in Fig. 3.

Fig. 3.
figure 3

Mean measure values of BERTScore-MKD and BERTScore+XLM-R with different layers of word embeddings for segment-level and system-level reference-free MT evaluations on the into-English language pairs of WMT19

From Fig. 3, it could be clearly seen that the last layer is the best choice for MKD-BERTScore on both segment-level and system-level evaluations, which is consistent with our analysis. And it is interesting to find that the best layers of BERTScore+XLM-R for reference-free and reference-based evaluations are almost the same (9th). Meanwhile, it could be also found that our metric BERTScore-MKD outperforms BERTScore+XLM-R on every layer for both segment-level and system-level reference-free MT evaluations.

Table 6. System-level reference-based metric results (Pearson correlation) for the into-English language pairs of WMT19. Best results are in bold.
Table 7. System-level reference-based metric results (Pearson correlation) for the from-English language pairs of WMT19. Best results are in bold.

3.5 As Reference-Based Metric

In this section we investigate the performance of BERTScore-MKD as a reference-based metric, where source sentences in the input are replaced with reference sentences. As the system translations and the reference sentences are in the same language, there is no need for cross-lingual alignment. Therefore, besides the last layer, BERTScore-MKD also uses the outputs of the 9th layers (recommended in [3]) in the model pmmb-v2 as the contextual word embedding function.

Table 6 and Table 7 report the results of BERTScore-MKD as a reference-base metric for system-level evaluations on the into-English and from-English language pairs of WMT19, and the metrics BLEU and BERTScore+XLM-R are chosen for comparison.

From the comparison results in Table 6 and Table 7, it could be seen that both BERTScore+XLM-R and BERTScore-MKD are clearly better than the classical metric BLEU, and our metric BERTScore-MKD is almost the same with the current SOTA metric BERTScore+XLM-R. Meanwhile, the 9th layer is slightly better than the last layer for BERTScore-MKD. In summary, BERTScore-MKD shows its effectiveness and robustness as a reference-base metric.

4 Conclusion

In this paper, it is found out that the cross-lingual word embedding alignment could be achieved implicitly through multilingual knowledge distillation (MKD) for sentence embedding alignment. With the framework of BERTScore, a reference-free metric BERTScore-MKD is proposed by incorporating MKD into MT evaluation. As shown in the performance test of BERTScore-MKD on the into-English language pairs of WMT17-19 for both segment-level and system level evaluations, the reference-free metric BERTScore-MKD is very competitive (best mean scores on WMT17-19 and better than BLEU on WMT17-18) with the current SOTA metrics that we know. Furthermore, the comparison results on WMT19 show the effectiveness and robustness of BERTScore-MKD as a reference-base metric. Although we have found that MKD could achieve the alignment of cross-lingual word embeddings and the last layer of the distilled student model is the best choice for reference-free MT evaluation, the reason why MKD could achieve the alignment is still worthy of further study.