Keywords

1 Introduction

In recent years, with the development of deep learning, Machine Translation (MT) systems made a few major breakthroughs and were wildly applied. Machine translation quality estimation (Quality Estimation, QE) aims to evaluate the quality of machine translation automatically without golden reference [1]. The quality can be measured with different metrics, such as HTER (Human-targeted Edit Error) [2] or DA (Direct Assessment) Score [3].

Previous methods treat QE as a supervised problem, and they require large amounts of in-domain translations annotated with quality labels for training [4, 5]. However, such large collections of data are only available for a small set of languages in limited domains.

Recently, Fomicheva [6] firstly performs QE in an unsupervised manner. They explore different information that can be extracted from the MT system as a by-product of translation, and use them to fit quality estimation output. Since their methods are based on glass-box features, they can only be implemented in limited situations and demands probation inside the machine translation system.

In this work, we firstly propose to perform unsupervised QE in a black-box setting, without relying on human-annotated data or model-related features. We create pseudo-data based on Machine Translation Evaluation (MTE) metrics, such as BLEU, HTER and BERTscore, from publicly-accessible translation parallel dataset. The MTE-metrics based data are then used to fine-tune several multilingual pre-trained language models, to evaluate translation output.

To the best of our knowledge, this is the first work to utilize MTE methods to deal with QE. Our method does not involve complex architecture engineering and easy to implement. We performed experiment on two language-pairs on MLQEFootnote 1 Dataset, outperforming Fomicheva by a large margin. We even outperformed two supervised models of Fomicheva, revealing the potential of MTE-based methods for QE.

2 Background

2.1 Machine Translation Evaluation

Similar to QE, Machine Translation Evaluation (MTE) also aims to evaluate the machine translation output. The difference between MTE and QE is that MTE normally requires annotated references, while QE is performed without reference and highly relies on source sentences.

Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments such as Direct Assessment (DA) [3] is an expensive and high-latency process, which does not easily fit in a daily model development pipeline.

Meanwhile automatic metrics, for example BLEU [7] or TER [2], can automatically provide an acceptable proxy for quality based on string matching or hand-crafted rules, and have been used in various scenarios and led the development of machine translation. But these metrics cannot appropriately reward semantic or syntactic variations of a given reference [8].

Recently, after the emergence of pre-trained language models, a few contextual embedding based metrics have been proposed, such as BERTscore [8] and BLUERT [9]. These metrics compute a similarity score for the candidate sentence with the reference based on token embeddings provided by pre-trained models. Refraining from relying on shallow string matching and incorporate lexical synonymy, BERTscore can achiev higher relevance with human evaluation.

Given the intrinsic correlation nature of MTE and QE, few works have been done to leverage MTE methods to deal with the task of QE.

2.2 Machine Translation Quality Estimation

Despite the performance of machine translation systems is usually evaluated by automatic metrics based on references, there are many scenarios where golden reference is unavailable or hard to get. Besides, reference-based metrics also completely ignore the source segment [10]. This leads to pervasive interest on the research of QE.

Early methods referred to QE as a machine learning problem [11]. Their model could be divided into the feature extraction module and the classification module. Highly relied on heuristic artificial feature designing, these methods did not manage to provide reliable estimation results.

During the trending of deep learning in the field of natural language processing, there were also a few works aiming to integrate deep neural network into QE systems. Kim [12] proposed for the first time to leverage massive parallel machine translation data to improve QE results. They applied RNN-based machine translation model to extract high-quality feature. Fan [13] replaced the RNN-based MT model with Transformer and achieved strong performance.

After the emergence of BERT, there were a few works to leverage pretrained models on the task of QE [14, 15]. Language models pre-trained on large amounts of text documents are suitable for data-scarce QE task by nature, and have led to significant improvements without complex architecture engineering.

Despite most models relied on artificial annotated data, there were also a few trials aiming to apply QE in an unsupervised manner. The most important work is Fomicheva [6], which proposed to fit human DA scores with three categories of model-related features: A set of unsupervised quality indicators that can be produced as a by-product of MT decoding; the attention distribution inside the Transformer architecture; model uncertainty quantification captured by Monte Carlo dropout. Since these methods are all based on glass-box features, they can only be applied in limited scenarios where inner exploration into the MT model is possible.

3 Model Description

3.1 Pretrained Models for Quality Estimation

Our QE predictor is based on three different pre-trained models, namely BERT [16], XLM [17], and XLM-R [18], as shown in Fig. 1.

Fig. 1.
figure 1

Pre-trained model for quality estimation.

Given one source sentence and its translated result, our model concatenates them and feeds them into the pre-trained encoder. To leverage the global contextual information when doing sentence-level prediction, an extra layer of bidirectional recurrent neural network is applied on the top of the pre-trained model.

Despite the shared multilingual vocabulary, BERT is originally a monolingual model [19], pretrained with sentence-pairs from one language or another. To help BERT adapts to our bilingual scenario, where the inputs are two sentences from different languages, we implement a further pre-training step.

During the further pre-training step, we combine bilingual sentence pairs from large-scale parallel dataset, and randomly mask sub-word units with a special token, and then train BERT model to predict masked tokens. Since our input are two parallel sentences, during the predicting of masked words given its context and translation reference, BERT can capture the lexical alignment and semantic relevance between two languages.

In contrast, XLM and XLM-R are multilingual models by nature, which receive two sentences from different languages as input during training, that means a further pre-training step is redundant. The training strategies and data of XLM and XLM-R are designed distinctly, which are explained in detail in their papers.

3.2 MTE-Based QE Data

Despite sentence-pairs with source and machine-translated text readily accessible (for which we only need to translate source text into target language using a MT system), the absence of DA scores becomes our biggest challenge. Even in supervised scenario, human-annotated DA scores are still scarce and limited [5]. Therefore, we propose to use MTE metrics to fit human assessment, thus creating massive pseudo data for the training of the QE system. Our approach can be described as follows:

Firstly, we decode source sentences in parallel corpus into target language. Secondly, we use automatic MTE-metrics to evaluate the quality of output sentences based on references. In this step we do not need any human annotation or time-consuming training. The MTE based evaluation can give a roughly accurate quality assessment, and can be used as substitution to human-annotated DA scores. Thirdly, the pseudo DA scores, combined with source and translated sentence pairs, are used to train our QE system (Fig. 2).

Fig. 2.
figure 2

MTE-metrics based QE training procedure.

We tried three different MTE metrics to fit DA evaluation, namely TER [2], BLEU [7], and BERTscore [8].

TER uses word edit distance to quantify similarity, based on the number of edit operations required to get from the candidate to the reference, and then normalizes edit distance by the number of reference words, as shown in Eq. 1.

$$ TER = \frac{\# \;of\; edits}{average\; \# \; of\; reference \;words} $$
(1)

BLEU is the most widely used metric in machine translation. It counts the number of n-grams that occur in the reference sentence and candidate sentence. Each n-gram in the reference can be matched at most once, and very short candidates are discouraged using a brevity penalty, as shown in Eq. 2.

$$ BLEU\, = \, BP \cdot exp\left( {\sum\nolimits_{n = 1}^{N} {w_{n} \log p_{n} } } \right), BP\, = \,\left\{ {\begin{array}{*{20}c} 1 \\ {e^{{\left( {1 - r/c} \right)}} } \\ \end{array} } \right.\begin{array}{*{20}c} { if c > r} \\ { if c \le r} \\ \end{array} $$
(2)

where pn denotes the geometric average of the modified n-gram precisions, wn denotes positive weight for each token, c denotes the length of the candidate translation and r denotes the effective reference corpus length.

BERTscore calculates the cosine similarity of a reference token and a candidate token based on their contextual embedding provided by the pre-trained model. The complete score matches each token in reference to a token in candidate to compute recall, and each token in candidate to a token in reference to compute precision, and then combine precision and recall to compute an F1 measure, as displayed in the following equations.

$$ P_{BERT} \, = \,\frac{1}{{\left| {\hat{x}} \right|}}\sum\nolimits_{{\hat{x}_{j} \in \hat{x}}} {\max_{{x_{i} \in x}} } x_{i}^{\text{T}} \hat{x}_{j} $$
(3)
$$ R_{BERT} \, = \,\frac{1}{\left| x \right|}\sum\nolimits_{{x_{i} \in x}} {\max_{{\hat{x}_{j} \in \hat{x}}} } x_{i}^{\text{T}} \hat{x}_{j} $$
(4)
$$ F_{BERT} \, = \,2\frac{{P_{BERT} \, \cdot \, R_{BERT} }}{{P_{BERT} \, + \,R_{BERT} }} $$
(5)

where x and \( \hat{x} \) denote the contextual embedding for each token in reference and candidate sentences, respectively.

With the ability of matching paraphrases and capturing distant dependencies and ordering, BERTscore is proved to be highly correlated with human assessment [8].

4 Experiment

4.1 Setup

Dataset.

The dataset we use is MLQE Dataset [6], which contains training and development data for six different language-pairs. We performed our experiments mainly on two high-resource languages (English–Chinese and English–German). Since we want to solve the problem in unsupervised setting, we only used the 1000 sentence-pairs from the development data for each direction respectively.

To train our own MT model, we use the WMT2020 English-Chinese and English-German dataFootnote 2, which contains roughly 10 million sentence-pairs for each direction after cleaning (a large proportion is reserved to generate QE data).

Fomicheva also provide the MT model which was used to generate their QE sentence pairs, thus we have two different MT models to use. We will explain the influence of different MT models in the next section.

For fine-tuning pre-trained models, we used the reserved data from WMT2020 English-Chinese and English-German translation, and randomly sampled 500 k sentence pairs for each direction to create MTE-based QE data.

Baseline.

Sine there are few works done in the area of unsupervised QE, we mainly make comparison with Fomicheva. They proposed 10 methods which can be categorized as three sets, among them we display their top-two results in each direction, namely D-Lex-Sim and D-TP for English-Chinese, and D-TP and Sent-Std for English-German.

We also make comparison with supervised methods, including PredEst models using the same parameters in the default configurations provided by Kepler [14], and the recent SOTA QE system BERT, augmented with two bidirectional RNN [15]. These two models are trained with the provided 7000 training pairs.

4.2 Experiment Results

As shown in Table 1 and Table 2, our approach surpasses Fomicheva with their best-performance methods by a large margin on both directions, verifying the effectiveness of MTE-based QE data. We even outperform BERT-BiRNN trained in supervised manner on both directions.

Table 1. Experiment results on English-Chinese MLQE Dataset.
Table 2. Experiment results on english-German MLQE dataset.

Although the supervised training data provided is limited and our best results are achieved by XLM rather than BERT (we will explain this in next section), the result is still very fascinating.

The glass-box features, although thoroughly explored by Fomicheva, seem unhelpful compared with MTE-metrics based methods. These features are no more than statistic cues regulated by the machine translation model. If we rely on the same MT model to evaluate the translation, then we will be constrained by itself and unable to cope with various phenomena.

Moreover, we can also conclude that when fine-tuning pre-trained models for QE task, the quantity of data is more important than the quality of data, as shown in Fig. 3. Although our data is generated purely based on automatic metrics rather human annotators, we can still surpass supervised systems trained only with clean data.

Fig. 3.
figure 3

The variation of Pearson’s correlation coefficient with the increase of the training step. Although the supervised model could generate better results in the first few steps, as the unsupervised model receives more data after more steps, it would outperform the supervised model.

Among our three methods, BERTscore-based methods achieve better results than statistical metrics-based methods, which is reasonable since BERTscore is proved to better correlate with human assessment. More accurate MTE metrics could lead to more natural pseudo data, therefore enable the QE model to perform better.

5 Analysis

5.1 Is BERT Always the Best?

Despite the overwhelming results BERT has accomplished on multiple datasets, our scenario demands the ability to process bilingual input, while BERT is originally a monolingual model, treating the input as either being from one language or another.

In contrast, XLM and XLM-R are multilingual models by nature, pre-trained with bilingual inputs from different languages. Since QE task aims to evaluate the translation based on the source sentence from another language, XLM and XLM-R should be more suitable. Experiment results in Table 3 verify our hypothesis.

Table 3. Experiment results on MLQE direct assessment data.

Even augmented by further pre-training steps with bilingual input in our experiment, BERT is still not competitive in multilingual scenarios. Multilingual pre-trained models are more suitable than BERT on QE task.

5.2 Is Black-Box Model Necessary?

While we cannot explore the internal structure of MT model in black-box setting, the input and output of the model are still available. Therefore, when creating source-translation sentence pairs, we can choose to use our own model or the provided black-box model.

Nowadays, the neural-based (especially Transformed-based) MT architecture has dominated the machine translation area [20]. Different NMT systems trained with similar data may behave similarly to the same input [21].

Therefore, even with another model trained with slightly different data, the generated translation may still have similar error distribution. Experiment results displayed in Table 4 verify our hypothesis.

Table 4. Results of different data generated by different MT models.

While the data generated by the provided model does obtain higher correlation, the result obtained by our own model is yet competitive. When creating MTE-based QE data, the provided model can benefit a lot, but if it is not available, we can simulate its error distribution with similar architecture and similar training data.

5.3 Where Is the Limitation of QE?

In this section, we would like to perform a case-study based on our results on development set. Since the distributions of our system’s output and the real-world QE scores differ a lot, as shown in Fig. 1, we mainly compare the ranking for the same sentence in different methods. Namely, we would rank the whole development set according to scores provided by our system and the golden label, and compare the discrepancy of ranking for the same sentence in different systems.

In summary, there are two problems impede the performance of our model.

Firstly, our model relies too much on the syntactic consistency while ignoring semantic understandability to evaluate a translation. Given a translated sentence with syntactically consistent structure, our model would assign a very high score even when the translation is semantically erroneous (Fig. 4).

Fig. 4.
figure 4

Distribution of DA scores on development set. Solid line denotes the output of our system, and dashed line denotes the golden labels.

As shown in Table 5, although Translation 2 is much better than Translation 1, our method would still assign a higher evaluation score for Translation1 since the syntactic structure is more consistent.

Table 5. Wrong prediction caused by syntactical inconsistency.

This problem originates from pre-trained models themselves, as it is very likely for pre-trained models to rely on spurious statistical cues when doing prediction [22], while not really understand the sentence meaning. Most sentence pairs with a consistent syntactic structure are assigned with a higher score in our training data, which is captured by our model and used as an inappropriate criterion for evaluation.

The second problem is that our system fails to detect erroneously translated words, especially when prior knowledge is in need.

As shown in Table 6, for the first sentence, the provided model mistranslated the word Judah, which is a country, as a name. And in the second sentence, the word consulship, which refers to a period, is mistranslated as a building. To understand why these words are mistranslated, you may need related history knowledge.

Table 6. Wrong prediction caused by mistranslated words.

The mistranslation of these key information makes the whole sentence beyond understanding, but since there is no grammatic error and the syntactic structure is appropriate, our model refers to them as good translations.

For the first problem, we believe it can be alleviated by strategically picked training samples, with more sentence-pairs syntactically inconsistent but semantically correct. We will leave this as our future work.

Since both QE model and MT model are based on deep-learning, QE can barely solve these problems which MT model cannot solve. More training data may help to alleviate this problem, but can hardly solve it, as more training data does not really introduce structured prior knowledge. We believe this is the limitation of QE.

6 Conclusion

Machine translation quality estimation (Quality Estimation, QE) aims to evaluate the quality of machine translation automatically without reference provided. Despite it has attracted a lot of research interest recent years, few works have been done to deal with QE in an unsupervised manner.

In this paper, we have devised an unsupervised approach to QE where we do not rely on any glass-box features. We create massive pseudo data based on automatic machine translation evaluation (MTE) metrics such as BLEU, TER and BERTscore, from publicly accessible machine translation parallel dataset. Then we use the MTE-based QE data to fine-tune multilingual pre-trained models, to predict direct assessment (DA) scores. Our approach surpassed previous unsupervised methods by a large margin, and even surpassed supervised methods, proving the effectiveness of incorporating MTE metrics into QE.

Despite the lack of human-annotated DA scores, the MTE metrics can provide a highly reliable evaluation for machine translated sentences, and enable us to perform QE in an unsupervised way. We will continue to explore the application of MTE in QE models, and try to reach the limitation of deep-learning based QE.