Keywords

1 Introduction

Most neural machine translation architectures (NMT)  [2, 12] are autoregressive models that sequentially generate target sequences. Among them, the model with the most excellent performance is the Transformer  [13], which is completely based on the Attention Mechanism  [2] and realizes the parallelization of the training process which highly reduce the delay of training phase. However, since there is no golden reference when decoding, it utilizes previous generated sequence to predict the target word which seriously affects the decoding efficiency. In recent years, Gu et al.  [4] proposed non-autoregressive neural machine translation model (NAT) which utilizes Knowledge Distillation  [5] method to assist training and realizes the parallelization of the decoding process while further significantly reduces the decoding delay. Subsequently, many works are improved on the basis of this model and to some extent alleviate the over-translation and under-translation problems of the non-autoregressive model without losing its fast decoding characteristic, such as  [6, 10, 11] and  [14].

In this paper, inspired by Wu  [15], we apply the idea of Reinforcement Learning (RL) to our model by using the Actor-Critic algorithm  [1] and accelerate the decoding speed by treating the NAT architecture as a actor model to iteratively optimize the generated translation. At the same time, considering that for the language with rich morphology, only the token-level rewards of traditional reinforcement learning cannot be used to obtain the sentence structure and deeper semantics of the tokens  [3], so we propose to add additional affix-level rewards and in conjunction with original rewards to guide the parameters update. Moreover, so as to further improve the translation quality of the NAT model, we propose to add capsule network  [8] layers in the encoder and decoder to replace the position encoding layer, which can extract deeper position information of the source sequence. We also utilize the Interpolation Knowledge Distillation  [5] method to use the output of the autoregressive teacher model as the distilled data.

We verify the effectiveness of the proposed architecture on two machine translation tasks, including NIST English-Chinese and CCMT2019 Mongolian-Chinese, and further validate from different aspects such as model time consuming and the convergence of the loss function. For NIST English-Chinese, experiments show that our model has an average increase of 7.45 BLEU scores compared to the baseline NAT model. What’s more remarkable is that compared to the baseline actor-critic model, the speed has increased by nearly 4 times, and at the same time the results are close to similar. Furthermore, so as to verify that the affix-level reward can further consider the grammar and semantic information between sentences, we also carry out ablation study on CCMT2019 Mongolian-Chinese, and finally obtained 27.98 BLEU with the decoding delay is only 179 ms.

2 Background

2.1 Non-autoregressive Neural Machine Translation

Under the condition of given source sequence \(S=(s_1,...,s_n)\) and target sequence \(T=(t_1,...,t_L)\), the autoregressive model (AT) simulates the conditional distribution of S and T by searching the maximum likelihood of the current predictive word based on the source sentence and the generated target sentence. The process has the following equation:

$$\begin{aligned} P_{AR}(T|S;\theta )=\prod _{l=1}^{L}P(t_l|t_1,...,t_{l-1},S;\theta ) \end{aligned}$$
(1)

where L represents the length of the target sentence and \(\theta \) is a series of parameters of the model.

Such models are comprehensive used and have brilliant performance in neural sequence modeling. However, there is a large delay since the decoding phase needs to depend on the generated words. In order to improve the decoding speed, Gu et al.  [4] propose a non-autoregressive neural machine translation model which breaks the order dependence characteristic of the traditional models and achieves parallelization of decoding by generating independent distributions of each target word simultaneously:

$$\begin{aligned} P_{NAR}(T|S;\theta )=P(L_y|S;\theta ))\prod _{l}P(t_l|S,L_y;\theta ) \end{aligned}$$
(2)

where \(L_y\) represents the predicted target sentence length.

2.2 Actor-Critic Based Neural Machine Translation

RL can be leveraged to bridge the gap between NMT training and inference by directly optimizing the evaluation indicators such as BLEU scores during training. Specifically, NMT model can be regarded as an agent that interacts with the environment and the agent will select an action from the vocabulary, that is a candidate word according to the policy (ie, the parameters of agent).

The reward is defined as \(R(\hat{t},t_l)\) in NMT by comparing generated sentence \(\hat{t}\) with reference sentence t. It is a token-level reward and once the NMT model generates a complete target sequence, the overall final feed will be observed. The objective function is to maximize the expected reward:

$$\begin{aligned} L_{RL}=\sum _{l=1}^{L}E_{\hat{t}\sim p(\hat{t}|s_l)}R(\hat{t},t_l)=\sum _{l=1}^{L}\sum _{\hat{t}\epsilon T}p(\hat{t}|s_l)R(\hat{t},t_l) \end{aligned}$$
(3)

where T is the space for all candidate translation sentences which is grows exponentially because of the large vocabulary size, and it is impossible to accurately maximize \(L_{RL}\).

3 Approach

3.1 Actor Model

Encoder. As shown in Fig. 1, we utilize an improved non-autoregressive model as actor model to accelerate the decoding process. The features detected by all capsules in the Capsule Network  [8] are encapsulated in the form of vector. We bring it into the NAT model based on the intuition: extracting more positional characteristics from the source sentence will benefit to the hidden layer states of the encoder and decoder. Inspired by this, we use the source word embedding \(e_i\) calculated by the self-attention layer as the input of each capsule:

$$\begin{aligned} c_j=\sum _{i}\alpha _{ij}F(e_{i},w_{ij}) \end{aligned}$$
(4)

where \(\alpha \) is the coupling coefficient, \(w_{ij}\) is the weight matrix, \(F(\cdot )\) denotes transformation function and we use feed-forward neural network to achieve. The output \(c_j\) contains in-depth position information in a vector manner.

Fig. 1.
figure 1

The architecture of the improved actor model.

Decoder Based on Capsule Network. Similar to the encoder side, we use the child capsule network layer to extract the source information more deeply. To comprehensively capture the overall information of the source, we introduce the parent capsule network layer to integrate the information of the M children capsules and map it to the representation suitable for the parent capsules, the process can be described as Eq. 5:

$$\begin{aligned} s_j=\sum _{i}^{M}F(c_{ij},w_{ij}) \end{aligned}$$
(5)

Our Squashing function compresses the modulus of the input vector to [0, 1). The parent capsule will update the status based on the input information flow:

$$\begin{aligned} p_j=Squash(s_j)=\frac{\left\| s_j \right\| ^2}{1+\left\| s_j \right\| ^2}\cdot \frac{s_j}{\left\| s_j \right\| ^2} \end{aligned}$$
(6)

Then integrate all of the children capsules to generate parent capsules representation \(P=[p_1,p_2,...,p_N]\) which containing the position information. The source sequence is encoded into N capsules, later iteratively determine what information will be fed to the Inter-Attention sub-layer:

$$\begin{aligned} Attention(Q_p,K_p,V_p)=softmax(\frac{Q_pK_p^T}{\sqrt{d_k}})\cdot V_p \end{aligned}$$
(7)

where \(Q_p\) is the output of parent capsule network containing location information and \(K_p\), \(V_p\) are the vectors from the encoder.

Providing the decoder with a source sequence that combines location information will assist the implementation of parallelization and make the generated translation have a favorable word order. We combine the extracted location relationship with the source information, then feed these directly to the attention layer to provide a stronger signal. All of the \(Q_p\), \(K_p\), \(V_p\) can be calculated by Eq. 8:

$$\begin{aligned} Emb(Q_p,K_p,V_p)=(e_1+p_1,...,e_n+p_n) \end{aligned}$$
(8)

where \(e_i\) means the initial input word embedding and \(p_i\) represents the position information perceived via capsule network.

Length Predictor. During training phase, there is no requirement to predict the target sentence length due to the exist of ground truth. However, for the sake of accomplishing parallel decoding, predicting the sentence length in advance will advantage the decoder to infer. We adopt a creative and persuasive equation to calculate:

$$\begin{aligned} L_y=\eta L_x+C \end{aligned}$$
(9)

where C is an bias term and \(\eta \) represents the ratio between target length and source length in the training set. Then we predict the target sentence length from \([\alpha L_x - B, \alpha L_x + B]\), where B represents half of the searching window. Through this method, we can get multiple translation sentences of different lengths.

Fig. 2.
figure 2

The framework of improved actor-critic method.

3.2 Critic Model

For the NMT method of RL, it is usually selects an action which is a candidate word from the lexical table according to the previously generated target words and source sequence. After generating the complete sequence \(\hat{T}=\left\{ \hat{t}_1,...,\hat{t}_M \right\} \), the reward is calculated by comparing with the ground truth \(T^*=\left\{ t_1^*,...,t_L^* \right\} \):

$$\begin{aligned} R(\hat{T},T^*)=\sum _{m=1}^{M}r_m(\hat{t}_m;\hat{T}_{1,...,m-1},T^*)) \end{aligned}$$
(10)

where M is the length of the generated sequence \(\hat{T}\) and \(\hat{T}_{1,...,m-1}\) is the candidate sequence that has been generated. \(R(\hat{T},T^*)\) represents a token-level reward between the generated translation and the ground truth which can be optimized by iteratively updating.

Nevertheless, there is only one final reward for the complete candidate sequence generated. This sparseness of reward will affect the efficiency of the model. Bahdanau et al.  [1] propose to utilize the intermediate rewards generated by the actor network to facilitate the learning of the critic network which is shown in Eq. 11:

$$\begin{aligned} r_l(\hat{t},t_l^*)=R(\hat{t}_{1...l},t^*)-R(\hat{t}_{1...l-1},t^*) \end{aligned}$$
(11)

However, the reward is only based on the morphological similarity of the actor model output and the reference translation, which does not show significant performance for low-resource languages.

To this end, we propose an improved method of actor-critic which provides additional affix-level rewards for such rich-forming languages. As shown in Fig. 2, we jointly train token-level rewards predicted by critic and additional affix similarity rewards introduced in this paper to iteratively optimize the generated translation. Specifically, due to the state of the generated translation may be confused, that is the problem of inaccurate additional components of words, so we add part-of-speech (POS) annotation to the output of the actor model. Then perform the segmentation of stems and affixs and calculate affix-level rewards based on the ground truth. In this way, according to the POS of the generated word, the semantic relationship of the sentence in which the word is located can be further obtained. In the experiment, we set a threshold when solving the affix similarity which is introduced in Sect. 4.2 specifically.

Critic model takes the golden reference as the input of the encoder and calculates the token-level value \(R_c\) and affix-level similarity reward \(R_{affix}\) between the candidate sequence generated by the actor model and the corresponding reference at each time step. The critic accumulates the learned single-step rewards and outputs the total reward \(Q_c\) eventually to guide the optimization of the actor model:

$$\begin{aligned} \begin{aligned} \hat{Q_c}&=\lambda R_c(\hat{T},T^*)+(1-\lambda )R_{affix}(\hat{T},T^*)\\&=\lambda \sum _{l=1}^{L}r_c(\hat{t}_l;t^*_l)+(1-\lambda )\sum _{l=1}^{L}r_{affix}(\hat{t}_l;t^*_l) \end{aligned} \end{aligned}$$
(12)

where \(\lambda \) is hyper-parameter and \(R_{affix}(\cdot )\) represents the additional affix similarity reward which we employ cosine similarity to measure it, as shown in Eq. 13:

$$\begin{aligned} r_{affix}=\frac{\hat{t_l}\cdot t_l^*}{\left\| \hat{t_l} \right\| \left\| t_l^* \right\| } \end{aligned}$$
(13)

3.3 Training

Objective Function. During the training phase, we utilize cross-entropy to calculate the loss of the non-autoregressive model with position awareness, as shown in Eq. 14:

$$\begin{aligned} \begin{aligned} L_{NAT}(S;\theta )=-\sum _{l=1}^{L_y}\sum _{t_l}(logP_{NAT}(t_l|L_y,S) \cdot logP_{AT}(t_l|t_1,..,t_{l-1},S;\theta )) \end{aligned} \end{aligned}$$
(14)

Interpolation Knowledge Distillation. We employ Sequence-Level Interpolation Knowledge Distillation (SIKD)  [5] to fine-tune which make the model simultaneously consider both the teacher’s output and its golden reference for the same source. The loss function is shown in Eq. 15:

$$\begin{aligned} \begin{aligned} L_{SIKD}=(1-\xi )L_{NLL}+\xi L_{KD}=-(1-\xi )logp(r|s)-\xi logp(\hat{r}|s) \end{aligned} \end{aligned}$$
(15)

where \(\xi \) is hyper-parameter, r is the golden reference and \(\hat{r}\) is the distilled data from the teacher model. Then we expand the second intractable item:

$$\begin{aligned} \hat{t}=\; \underset{\hat{y}\in \varOmega _k}{argmax}sim(\hat{r},\hat{y})q(\hat{y}|s) \end{aligned}$$
(16)

where sim is the similarity calculated by Jaccard distance, \(\varOmega _k\) is the K-best list from beam search which is close to \(\hat{r}\) and has high probability under the guidance of teacher model, and \(q(\cdot )\) represents the probability distribution of the teacher model (ie, the autoregressive model).

Reinforcement Learning. The reward of RL will be changed to the weighted sum of the original reward and the affix-level reward which is shown in Eq. 17:

$$\begin{aligned} L_{RL}=\sum _{l=1}^{L_y}E_{\hat{t}\sim p(\hat{t}|s_l)}\left[ \lambda R_c(\hat{t},t_l^*)+(1-\lambda )R_{affix}(\hat{t},t_l^*)\right] \end{aligned}$$
(17)

Joint Training. We fine-tune the non-autoregressive model with interpolation knowledge distillation and perform secondary training on the NAT-based actor model and critic model. The final training loss is shown in the Eq. 18:

$$\begin{aligned} L _{Total}=\beta (\alpha L_{NAT}+(1-\alpha )L_{SIKD})+(1-\beta ) L_{RL} \end{aligned}$$
(18)

where \(\alpha \) and \(\beta \) are hyper-parameters.

4 Experiments

4.1 Experimental Setting

Datasets. We utilize two machine translation tasks in our experiments: NIST English-Chinese (En-Zh)Footnote 1 and CCMT2019 Mongolian-Chinese (Mo-Zh). For NIST En-Zh, we use MT02 as validation set and employ MT03, MT04, MT05 and MT06 as test sets. Furthermore, we segment all datasets into sub-word units by byte pair encoder (BPE) algorithm  [9] and use our own tools to divide CCMT2019 Mo-Zh task into stems and affixes.

Baseline. We use the synchronous decoding model proposed by Zhou et al.  [16] as the teacher model to guide the training of the actor model. The following tasks are selected as baseline systems: Transformer  [13] with state-of-the-art performance, NAT-FT model proposed by Gu et al.  [4], NAT model by iterative refinement  [6], NAT model for retrieving sequential information  [10], NMT system with reinforcement learning  [15] which denote as RL-NMT and actor-critic for sequence prediction model  [1].

Model Setting. The parameter settings of our model are the same as described in Transformer  [13]. The word vector dimension set to 278 and the number of hidden layer neurons set to 507, layer depth set to 5, the attention head set to 2 and the number of capsules is set to 6. We set the hyper-parameter \(\xi \) in Eq. 15 to 0.5, \(\lambda \) in Eq. 17 to 0.5, \(\alpha \) to 0.6 and \(\beta \) to 0.5 in Eq. 18 based on experimental verification. Latency is computed as average per sentence decoding time on the full test set without mini-batching and we test it on two NIVDIA TITAN X.

Table 1. Experimental results of different models on the NIST English-Chinese dataset. Where “multinomial” indicates that multinomial sampling is utilized to make the data diverse, “shaping” represents that reward shaping is employed and “i” means the number of iterations. RF-C denotes Reinforce-Critic based on actor-critic algorithm, LL empresses log-likelihood training. In our model, NAT-CN indicates the NAT model with the addition of the capsule network layers, NAT-CN+AC means using the improved non-autoregressive model as the actor model, and “Total” represents combining all the methods mentioned in this article.

4.2 Results and Analysis

Main Results. We mainly perform experiments on the NIST English-Chinese dataset and the experiment results are shown in Table 1. Where “average” represents the average BLEU  [7] score of test sets from MT03 to MT06.

Fig. 3.
figure 3

(a) Graph of loss function for different models. (b) Diagram of training time for different models.

For our improved NAT model with capsule network, the BLEUFootnote 2 score is increased 4.49 on the MT06 test set compared to the baseline NAT-FT model and the average BLEU score is increased 3.05 but the decoding delay is reduced 24 ms. When the improved NAT model is used as the actor model, the average BLEU score is 23.49 and the decoding delay is only 145 ms which is 4.42 times faster than the baseline actor-critic model. After combining all the methods, our model has an average increase of 7.45 BLEU scores compared to the NAT-FT model and the MT06 test set significantly improved 8.05 BLEU scores. It still has a certain degree of BLEU score difference when compared with Transformer, but our model has a speed increase of 4.34 times. It is worth noting that compared to the baseline actor-critic model, the speed is increased by about 4 times while achieving similar results.

Table 2. Quality evaluation on CCMT2019 Mongolian-Chinese dataset. Where VR represents the vanilla token-level reward, SIKD is short for sentence-level interpolation knowledge distillation method, POS refers to part-of-speech tagging of tokens, and ASR is the proposed affix similarity reward.

Effect of Training Time. In order to verify the validity of our model, we also conduct experiments on the convergence of the loss function and the time consuming of the training process. We compare our model with the original actor-critic model  [1] and the NMT model with RL  [15]. As shown in Fig. 3(a), our model achieves faster convergence, and the curve gradually stabilizes after convergence. The training time of the different models are shown in Fig. 3(b), our model is significantly lower than the other two models in terms of training delay and decoding delay. This is because we utilize non-autoregressive model as the actor model which can generate target tokens in parallel and greatly reduce translation latency.

Ablation Study. As shown in Table 2, we perform ablation study on different methods utilized in this paper on CCMT2019 Mongolian-Chinese dataset. When using the vanilla token-level reward only, the BLEU score is 25.31 and the decoding delay is only 105 ms. After combining the interpolation knowledge distillation method, the BLEU scores increased 1.17, and the part-of-speech tagging also brings an improvement of 0.35 BLEU scores. Since Mongolian is a kind of agglutinative language with rich word composition, we get 1.15 BLEU scores improvement when we combined the affix-level reward which is sufficient to prove that the calculation of affix-level reward further considers the sentence pattern and semantics between the generated translation and the golden reference. When combined all the methods, we get 27.98 BLEU scores which has a significant improvement of 2.67 compared to the baseline system, but the decoding delay is only 179 ms.

Fig. 4.
figure 4

Training curves for affix similarity with different threshold a.

Effect of Affix Similarity Threshold a. We set a threshold when solving the affix similarity between the generated tokens and the reference tokens. If the value of similarity exceeds the threshold, the reward obtained will also increase to indicate that the status is normal, yet if it is lower than the threshold, the affix similarity reward will decrease which indicates that the state is more chaotic and the parameters still need to be updated. When it is at a high level of affix similarity, the generated translation is generally consistent with golden reference in terms of pattern and the meaning of sentence, otherwise it is regarded as inconsistent. The choice of threshold is shown in Fig. 4, when a = 0.75 the system has the most stable and fastest BLEU improvement over time.

5 Conclusion

We propose a novel NAT model architecture which combines the pose transformation characteristics of the capsule network to obtain higher location features. We regard the improved NAT model as an actor model to greatly increase the decoding speed. Moreover, we combine the proposed affix-level reward with the original reward as the final feedback to optimize the translation generated by the actor in an iterative manner, thereby alleviating the reward sparse problem of reinforcement learning especially in low-resource languages. The experiments show that our model has significantly improved both in translation quality and decoding speed.

Our future work is to apply the proposed model to more machine translation tasks, especially for low-resource languages, and to explore methods that are more suitable for the characteristics of agglutinative languages.