Keywords

1 Introduction

With the rapid development of deep learning, neural machine translation (NMT) has attracted much attention in recent years. This is because it has superior performance and does not require much manual intervention [1, 3]. Korean is the official language of the Korean ethnic group in China, and is also used in the Korean Peninsula, the United States, the Russian Far East and other areas where Koreans congregate. Transnational and transregional is a key feature of Korean. As the Korean ethnic group is one of the 24 ethnic minorities in China that have their own language [20], the research on Chinese-Korean neural machine translation plays an important role in promoting the language work of ethnic minorities and strengthening the communication and unity of ethnic groups.

At present, most NMT models adopt the teacher forcing strategy in training [17]. Teacher forcing is to minimize the difference between the source sentence and the reference and force the predicted translation to be infinitely close to the reference. First of all, as there is usually no reference for sentence prediction, exposure bias will be brought, which may affect the performance and robustness of the model [9]. Secondly, as there are a large number of synonyms and similar expressions in the language, the translation model cannot always generate the ground truth word even with teacher forcing in training. Using this strategy will greatly curb the diversity of translation and make many reasonable translations in an unreachable state [21]. In addition, domestic research on Chinese-Korean machine translation started late, the foundation is poor, the lack of large-scale parallel corpus. Therefore, there are many challenges to improving the performance of Chinese-Korean machine translation under the condition of low resources.

We attempt to introduce a sentence-level evaluation mechanism to guide the training of the model, so that the prediction can not converge completely to the ground truth word, to alleviate exposure bias and poor translation diversity. The evaluation mechanism is reference-free evaluation, which is called quality estimation (QE) in conference on machine translation. The main idea is that the evaluation model can estimate the quality of the unseen translation without reference after supervised training. The instruct mechanism is guided by the reinforcement learning (RL) of policy optimization, which enables the model to optimize the target sequence at the sentence level. In order to alleviate the acknowledged instability and variance difficulty fitting problems of reinforcement learning, we trained MLE and RL together and referred to the baseline reward method proposed by Weaver L. [16]. Moreover, previous studies have directly used BLEU value as a reward [18, 19], which would lead to serious bias in the model and exacerbate the problem of poor translation diversity. Therefore, we propose a reward function based on QE score. Meanwhile, monolingual corpus and Korean text preprocessing with different granularity are used in the training process overcome data sparsity and improve the quality of machine translation for low-resource languages.

2 Related Work

Machine Translation Quality Estimation: Machine translation quality estimation is different from the evaluation indexes of machine translation such as BLEU [8], TER [12], METEOR [7], etc. It can automatically give the quality prediction of machine-generated translation without relying on any reference translation. The most commonly used quality score is the human translation edit rate (HTER). QuEst is a model proposed by Specia L. [13], which is used for quality estimation tasks. As the baseline model of machine translation quality estimation task, the model consists of feature extraction module and machine learning module. In order to solve the problem of machine translation quality estimation, Kim H. [5] firstly applied the machine translation model to the quality estimation task and proposed a translation quality estimation model based on RNN. Fan K. [2] replace the machine translation model based on RNN with Transformer model on the basis of Predictor-Estimator, and proposed a bilingual expert model, which improved the performance and interpretability of the model evaluation.

Reinforcement Learning-based NMT: There is a lot of research showing that the advantages of reinforcement learning in sequence generation tasks. Ranzato M. [9] proposed a novel sequence level training algorithm that directly optimizes the metric used at the test time. Wu L. [18]conducted a systematic study on how to train better NMT models using reinforcement learning with several large-scale translations tasks. Keneshloo Y. [4] considered seq2seq problems from the RL point of view and provide a formulation combining the power of RL methods in decision-making with seq2seq models that enable remembering long-term memories.

3 Methodology

In this section, we describe the construction and training of a Chinese-Korean neural machine translation model that incorporates translation quality assessment. We introduce the model architecture, sentence-level translation quality estimation methods, reward function design and training of the whole model.

3.1 Model Overview

To alleviate exposure bias and poor translation diversity, we propose a Chinese-Korean machine translation model that incorporates translation quality estimation. The model introduces an evaluation mechanism at the sentence level to guide the model prediction not to converge completely to the ground truth word. The specific framework structure of the model is shown in Fig. 1, which mainly includes two modules: machine translation and machine translation quality estimation. The translation module adopts the encoder-decoder architecture, and the framework is consistent with Transformer. The evaluation module adopts sentence-level machine translation quality estimation model Bilingual Expert. The training adopts reinforcement learning.

Fig. 1.
figure 1

The architecture of the Chinese-Korean machine translation module

In the process of machine translation, NMT system, as an agent of reinforcement learning, obtains the environmental state information at the current moment through continuous interaction with the environment. The environmental state information is the source sentence x under the time step t and the above \(P({y_t}|x,{\hat{y}_{ < t}})\) of the generated target sentence. Where \({\hat{y}_{ < t}}\) represents the target sentence predicted by the model before the time step t. According to the state of the current environment, the agent decides to choose the next selected word, obtains the reward value of the word selection operation in the current state and enters the next state, and finally finds the optimal strategy of translation through reinforcement learning.

As shown in Fig. 2, the machine translation task is described as: Training a machine translation model \({M_\theta }\) with parameter \(\theta \) under the condition that the Chinese-Korean parallel corpus is given. The machine translation model \({M_\theta }\) translates the given source sentence sequence \(\mathbf{{x}}\mathrm{{ = }}\left( {{x_1},{x_2},...,{x_n}} \right) \) into a target sentence sequence \(\mathbf{{y}}\mathrm{{ = }}\left( {{y_1},{y_2},...,{y_m}} \right) \), where nm are sequence lengths of source and target sentences respectively. In the time step t, the state \({S_t}\) defines the target sentence \(\mathbf{{y}}\mathrm{{ = }}\left( {{y_1},{y_2},...,{y_m}} \right) \) generated by the machine translation model \({M_\theta }\) under the current time step, and the action \({a_t}\) is defined as the selection of the next word \({y_ {t \,+\,1}}\) in the current environment. Training a machine translation quality estimation model \({Q_\varphi }\) with parameter \(\varphi \) under the condition that the condition of the translation data and HTER scores is given. After supervised training, the quality estimation model \({Q_\varphi }\) acts as a generator of reward functions to give quality scores to the unseen translations, and the machine translation model \({M_\theta }\) is guided to interact with the environment to produce the next word \({y_ {t\, +\, 1}}\).

Fig. 2.
figure 2

The schematic diagram of the decision process in translation

3.2 Generate Rewards Through Sentence-Level Quality Estimation

An excellent translation usually includes multidimensional evaluation, such as fidelity and faithfulness, so it is difficult to abstract machine translation tasks into simple optimization problems. Therefore, instead of manually setting a single rule as the source of the reward function, we use the output of the machine translation quality estimation model as part of the reward. The model can generate a more comprehensive score through a relatively complex network structure, which is more relevant to human evaluation and more tolerant to the diversity of translation.

The model \({Q_\varphi }\) in this paper uses the same network structure as Bilingual Expert. The model includes a word prediction module based on bidirectional Transformer and a regression prediction model based on Bi-LSTM. Bidirectional Transformer includes three parts encoder self attention, e forward and backward self-attentions and token reconstruction. It acquires hidden state features h by pre-training on large-scale parallel corpus. The encoder part corresponds to q(h|xy), and the decoder part corresponds to p(y|h). The calculation formula is as follows:

$$\begin{aligned} q(h|x,y) = \prod \limits _t {q(\overrightarrow{{h_t}} |x,{y_{< t}})} q(\overleftarrow{{h_t}} |x,{y_{ < t}}) \end{aligned}$$
(1)
$$\begin{aligned} p(y|h) = \prod \limits _t {p({y_t}|\overrightarrow{{h_t}} ,\overleftarrow{{h_t}} )} \end{aligned}$$
(2)

The hidden state \(h\mathrm{{ = }}\left( {{h_1},...,{h_m}} \right) \) is a combination of the forward and backward hidden state, which captures the deep translation features of the sentence. The final features are as follows:

$$\begin{aligned} f = \mathrm{{Concat}}\left( {{{\overrightarrow{h}}_t},{{\overleftarrow{h}}_t},{e_{t - 1}},{e_{t + 1}},{f^{mm}}} \right) \end{aligned}$$
(3)

where, \({e_{t - 1}},{e_{t + 1}}\) is the embedding concatenation of two neighbor tokens, and \({f^{mm}}\) is the mis-matching features. Finally, the features are input to Bi-LSTM for training to get the predicted HTER score:

$$\begin{aligned} \mathrm{{HTER' = sigmoid}}\left( {{w^T}\left[ {\mathrm{{Bi - LSTM}}\left( f \right) } \right] } \right) \end{aligned}$$
(4)

The loss function in the training process is:

$$\begin{aligned} \mathrm{{arg min||HTER - HTER'||}}_2^2 \end{aligned}$$
(5)

The scalar value obtained in Eq. (4) is the evaluation of the generated translation by the machine translation quality estimation module. Compared with BLEU, it has deeper translation characteristics. Therefore, our method uses this score to guide the machine translation module and can achieve the effect of the prediction can not converge completely to the ground truth word.

3.3 Reward Computation

It is critical to set up appropriate rewards for RL training. In previous researches on NMT, it is assumed that the effective predictive value of each word item in the generated target sentence is unique, that is to say, there is a fixed reference for each sentence. Therefore, both the minimum-risk training method [10] and the reinforcement learning method [4, 19] use the BLEU score of similarity between the generated sentence and the reference as the training target. However, in natural language, the same source sentence fragment can correspond to multiple reasonable translations, so the reward based on BLEU cannot give a reasonable reward or punishment for words other than the target language. As a result, most reasonable translations are denied, which greatly limits the improvement of translation effect by reinforcement learning and exacerbates the problem of poor diversity of machine translations. Thus we set the reward as:

$$\begin{aligned} R\left( {{{\hat{y}}_t}} \right) = \alpha Scor{e_{BLEU}}\left( {{{\hat{y}}_t}} \right) + \frac{{1 - \alpha }}{{Scor{e_{QE}}\left( {{{\hat{y}}_t}} \right) + 1}} \end{aligned}$$
(6)

where, \(Scor{e_{BLEU}}\left( {{{\hat{y}}_t}} \right) \) is the normalized BLEU between the generated translation and the ground truth, and \(Scor{e_{QE}}\left( {{{\hat{y}}_t}} \right) \) is the normalized QE evaluation score of the generated translation. The super parameter \(\alpha \) is used to balance the weight between BLEU and QE scores, so as to avoid the problem that the QE score may aggravate the instability of training after introducing it. In this way, the training can be converged quickly and the diversity of translation can be fully considered.

In the machine translation task, the agent needs to take dozens of actions to generate a complete target sentence, but after generating a complete sequence, only one terminal reward can be obtained, and sequence-level reward cannot distinguish the contribution of each word item to the total reward. Therefore, there is a problem of reward sparsity during the training, which will lead to slow convergence speed of the model or even failure to learn. Reward shaping can alleviate this problem. In reward shaping, the instant reward at each decoding step t is imposed, and the rewards correspond to word levels. The rewards are set as follows:

$$\begin{aligned} {r_t}\left( {{{\hat{y}}_t}} \right) = R\left( {{{\hat{y}}_t}} \right) - R\left( {{{\hat{y}}_{t - 1}}} \right) \end{aligned}$$
(7)

During the training, an accumulative reward is calculated as the current sequence reward after each sampling action is completed, and the reward difference between two continuous time steps is the word level reward. In this way, the model can get an instant reward for the current time step after each action, thus alleviating the problem of reward sparse.

$$\begin{aligned} R({\hat{y}_t}) = \sum \limits _{t = 1}^T {{r_t}\left( {{{\hat{y}}_t}} \right) } \end{aligned}$$
(8)

Experiments have shown that using reward shaping does not change the optimal strategy. Since the reward of the whole sequence is the sum of the reward value of each word item level, which is consistent with the reward of the sequence level, the total reward of the sequence will not be affected.

3.4 The Training of Reinforcement Learning

The idea of reinforcement learning is that the agent selects an action to execute according to the current environment, and then the environment shifts with a certain probability and gives a reward to the agent, and the agent repeats the above process for the purpose of maximizing the reward. Specifically, in the translation task, the NMT model is regarded as the agent making decisions, and the random strategy \(\pi \left( {{a_t}|{\mathbf{{s}}_\mathbf{{t}}};\varTheta } \right) \) is adopted to select candidate words from the word list as an action. During the training, the agent learns better translation through the reward given by the environment after the target sentence is generated by the decoder.

$$\begin{aligned} \pi \left( {{a_t}|{\mathbf{{s}}_\mathbf{{t}}};\varTheta } \right) = \sigma \left( {\mathbf{{W}}*{\mathbf{{s}}_t} + \mathbf{{b}}} \right) \end{aligned}$$
(9)

where \(\pi \left( {{a_t}|{\mathbf{{s}}_\mathbf{{t}}};\varTheta } \right) \) is the probability of choosing an action, and \(\sigma \) is the sigmoid function, \(\varTheta = \{ \mathbf{{W}},\mathbf{{b}}\} \) is a parameter of the policy network. During the training, the action from all the conditional aim-listed probability \(p({y_t}|x,{y_{ < t}})\) of the given source sentence and the word selected below, and the goal is to pursue the maximum expected reward, i.e.:

(10)

When the complete target sentence is generated, the quality estimation score of the sentence is used as the label information to calculate the reward. Then the Policy Gradient method [11] of reinforcement learning algorithm is used to maximize the expected revenue, as shown follows:

$$\begin{aligned} \begin{array}{l} J(\varTheta ) = \mathop \sum \limits _{i = 1}^N {E_{\hat{y} \sim p\left( {\hat{y}\mid {x^i}} \right) }}R\left( {\hat{y}} \right) \\ = \sum \limits _{i = 1}^N {\sum \limits _{\hat{y} \in Y} p } \left( {\hat{y}\mid {x^i}} \right) R\left( {\hat{y}} \right) \end{array} \end{aligned}$$
(11)

where Y is the space composed of candidate translated sentences and \(R(\hat{y})\) is the sentence-level reward of the translation. Because the state at time step \(t+1\) is completely determined by the state at time step t, the probabilities \(p\left( {{\mathbf{{s}}_\mathbf{{1}}}} \right) \) and \(p\left( {{\mathbf{{S}}_{\mathbf{{t}} + \mathbf{{1}}}}|{\mathbf{{S}}_\mathbf{{t}}},{a_t}} \right) \) are equal to 1. The gradient update strategy as shown follows:

$$\begin{aligned} {\nabla _\varTheta }J(\varTheta ) = - \frac{1}{N}\sum \limits _{n = 1}^N {\sum \limits _{t = 1}^L {({R_L}} - b)} {\nabla _\varTheta }\log {\pi _\varTheta }\left( {{a_t}|{\mathbf{{s}}_\mathbf{{t}}}} \right) \end{aligned}$$
(12)

where N is the number of turns, \(b \approx E[{R_L}]\).

The action space of reinforcement learning-based machine translation is considerable and discrete. Its size is the capacity of the entire word list. So we use beam search to sample the actions. It reduces the computational cost and increases the probability of high quality translation results in the decoding stage. The principle of beam search is shown in Fig. 3.

Fig. 3.
figure 3

The schematic diagram of beam search

In order to stabilize the process of reinforcement training and alleviate the large variance that may be brought by reinforcement learning, we combined MLE training goals with RL goal. The specific step is to retain the cross-entropy loss function of traditional machine translation in the loss function, and then combine it with the reinforcement learning training objective linearly. The loss function after mixing is shown below:

$$\begin{aligned} {L_{\mathrm{{combine}}}}\mathrm{{ = }}\gamma \times {L_{\mathrm{{mle}}}}\mathrm{{ + }}\left( {1 - \gamma } \right) {L_{\mathrm{{rl}}}} \end{aligned}$$
(13)

where, \({L_{\mathrm{{combine}}}}\) is the binding loss function, \({L_{\mathrm{{mle}}}}\) is the cross entropy loss function, \({L_{\mathrm{{rl}}}}\) is the reinforcement learning reward, and \(\gamma \) is the super parameter controlling the weight between \({L_{\mathrm{{mle}}}}\) and \({L_{\mathrm{{rl}}}}\). Different \(\gamma \) will affect the performance of the final translation results.

4 Experiments

4.1 Datasets

The data set used in the experiments comes from the corpus constructed by the laboratory to undertake the “China-Korea Science and Technology Information Processing Comprehensive Platform” project [14]. The original corpus includes more than 30,000 documents and more than 160,000 parallel sentence pairs, covering 13 fields such as biotechnology, Marine environment and aerospace. To alleviate the problem about data sparsity, we also used additional monolingual corpus in the experiment. The detailed data information obtained after preprocessing according to the task in this paper is shown in Table 1. The HTER score for quality estimation tasks is automatically calculated by the TERCOM tool.

Table 1. Description of data in Chinese-Korean machine translation

4.2 Preprocessing

Large-scale corpus word embedding can provide sufficient priori information for the model, accelerate the convergence rate of the model, and effectively improve the effect of downstream tasks. However, as a low-resource language, Korean lacks a large corpus, so there will be a large number of low-frequency words in the corpus, which will lead to low quality of word vectors. To solve this problem, we use more flexible Korean language granularity for word embedding to alleviate the data sparsity problem.

Korean is a phonetic alphabet. From a phonetic point of view, the Korean language consists of phonemes that form syllables according to rules, and syllables that form sentences. Since the number of phonemes and syllables is relatively fixed (67 phonemes and 11172 syllables), the scale of dictionary construction using such granularity is very small, and the existence of low-frequency words can be significantly reduced compared with other granularities. And from a semantic point of view, word may have clearer morphological and linguistic features. Therefore, we use phoneme, syllable and word to preprocess the corpus of the Korean text. Phonemes are obtained by the open source phoneme decomposition tool hgtk, syllables are obtained by reading characters directly, and word is obtained by the word segmentation tool Kkam.

4.3 Setting

Our translation module is implemented on an encoder-decoder framework based on self-attention, and the Transformer system adopts the same model configuration as described by [15], which is implemented using Tensor2Tensor, an open source tool built on Google Brain. We set dropout to 0.1, and the word vector dimension to 512. The MLE training gradient optimization algorithm uses the Adam [6] algorithm and learning rate decay. In the feature extraction part of our machine translation quality estimation module, the number of encoder and decoder layer is 2, the number of hidden units of feedforward sublayer is 1024, and the number of head of attention is 4. In our quality estimation part, the network structure is single-layer Bi-LSTM, the hidden layer unit is set as 512, the gradient optimization algorithm uses Adam, and the learning rate is set as 0.001. During reinforcement learning training, the MLE model was used for parameter initialization, the learning rate was set as 0.0001, and the beam search width was set as 6.

4.4 Main Results and Analysis

In order to test the translation performance of the model, we conducted Chinese-Korean machine translation experiments under the same hardware and corpus environment and calculated the BLEU and QE of the test set respectively. The results are shown in Table 2.

Table 2. The score of translation performance

As can be seen from Table 2, our method can exceed the baseline model for both the Chinese-Korean and Korean-Chinese translation tasks. Compared with LSTM+attention, BLEU increases by 10.15 and QE score decreases by 58.37 in Chinese-Korean direction, and BLEU increases by 10.85 and QE score by 58.04 in Korean-Chinese direction. Compared with Transformer, BLEU increases by 5.85 and QE score decreases by 5.33 in Chinese-Korean direction, and BLEU increases by 3.26 and QE score by 2.25 in Korean-Chinese direction. Therefore, the introduction of evaluation module effectively improves the performance of Chinese-Korean machine translation.

4.5 Performance Verification About QE

To ensure the rationality and effectiveness of our method, we verify the performance of the machine translation quality estimation module. Pearson’s Correlation Coefficient, Mean Average Error (MAE) and Root Mean Squared Error (RMSE) used in WMT competition were used as verification indexes. Pearson’s Correlation Coefficient is used to measure the correlation between the predicted value and the ground truth. The higher the positive correlation, the better the performance of QE module. MAE and RMSE represent the mean and square root of the absolute error between the predicted value and the true value respectively, the smaller the value, the better. The baseline system adopts the open source system QuEst ++ [13], which is the official baseline system of WMT2013-2019. The specific experimental results are shown in Table 3.

Table 3. Verification results for QE module performance

As can be seen from the experimental results in Table 3, compared with the baseline system of QE task, the Bilingual Expert used in the experiment has a better performance improvement, with Pearson correlation coefficient increased by 0.079, MAE decreased by 0.018, and RMSE decreased by 0.007. Our results have a high correlation with manual evaluation, thus proving the effectiveness of using the machine translation quality assessment module in this experiment. In conclusion, it is reasonable to use the machine translation quality estimation module to optimize the translation module.

4.6 Example of Translation Results

Examples of translation results for different models are shown in Table 4.

Table 4. Examples of translation results

As can be seen from the translation examples in Table 4, the translation obtained by our method is more accurate in both directions, the fluency and fidelity of the translation are in line with the target language specification, and the quality of the translation is significantly better than that of the other baseline models, which proves that our method can effectively improve the performance of the Chinese-Korean neural machine translation model.

5 Conclusion

In order to alleviate the exposure bias and poor translation diversity problems caused by teacher forcing in machine translation, we propose a Chinese-Korean neural machine translation model that incorporates machine translation quality estimation. The model introduces an evaluation mechanism at the sentence level to guide the training of the model, so that the prediction cannot converge completely to the ground truth word. The evaluation mechanism is reference-free evaluation. The instruct mechanism is guided by the reinforcement learning. The experimental results clearly show that our approach can effectively improve the performance of Chinese-Korean neural machine translation.