Keywords

1 Introduction

Most neural machine translation (NMT) [1, 2] models are sequentially autoregressive models (AT) such as RNNs, Transformer [3] which have state-of-the-art performance. The training process of Transformer is parallel, but in decoding phase, it exploit the generated sequence to predict the current target word which will cause severe decoding delay. In recent years, non-autoregressive neural machine translation model (NAT) [4] is proposed to effectively speed up the decoding process which exploits Knowledge Distillation [5] and fine-tuning to assist training. Subsequently, there are some novel-innovative improvements based on the NAT model, such as the work of regulating the similarity of hidden layer states by two auxiliary regularization terms [6], the model reconstruct generative translation through iterative refinement [7] and Ghazvininejad put forward to partially mask target translation through the conditional masked language model [8].

In this work, we propose to utilize the Capsule Network [9] in the architecture which has a significant impact on extracting more deeply positional features and making the generated translation more advantageous in word order. Besides, we adopt the word-level error correction method to reconstruct the generated sentence which can alleviate the translation problems. Experiments show that our model is superior to the previous NAT models. On the WMT14 De-En task, the addition of the capsule network layers increases the BLEU score by more than 6. More significantly, our word-level error correction method brings 1.88 BLEU scores improvement. We also perform case study on WMT14 En-De and ablation study on IWSLT16 to verify the effectiveness of the proposed methods.

2 Background

2.1 Non-autoregressive Neural Machine Translation

Under the condition of given source sentence \(S=(s_1,...,s_K)\) and target sentence \(T=(t_1,...,t_L)\), the autoregressive model utilizes a sequential manner to predict the current word which will bring a certain degree of delay. Non-autoregressive neural machine translation model (NAT) [4] is proposed to improve the decoding speed which only predicts based on the source sequence and the target sequence length \(L_y\) predicted in advance:

$$\begin{aligned} P_{NAT}(T|S;\theta )=P(L_y|S;\theta )\cdot \prod _{l}^{L_y}P(t_l|S;\theta ) \end{aligned}$$
(1)

where \(\theta \) is a series of model parameters.

2.2 Neural Machine Translation with Error Detection

For error detection in NMT, the model first characterizes each word in the source sentence as a word embedding vector and then feeds it to the bidirectional LSTM. At each time step, the hidden state in both directions is combined and regarded as the final output. In addition, the error correction model also constructs mis-matching features, that is when there are wrong words in a output sequence, the pre-trained model will give the correct word prediction distribution and there will be a gap between their probability distributions. The model make the next prediction according to this gap feature, as shown in Eq. 2.

$$\begin{aligned} argmin\sum _{k=1}^{T}XENT\left( g_k,W\left[ \overrightarrow{h_k},\overleftarrow{h_k},\overrightarrow{h_{k+1}},\overleftarrow{h_{k+1}} \right] \right) \end{aligned}$$
(2)

where XENT stands for cross-entropy loss, W represents the weight matrix, \(\overrightarrow{h_k},\overleftarrow{h_k}\) means the overall score of the sentence in the forward and backward directions and \(g_k\) is the gap label between k-th token and k+1st token.

Fig. 1.
figure 1

The architecture of the proposed NAT-CN model. The encoder use child layer to capture location information and the decoder integrate information by parent layer, then update the weights by Dynamic Routing Algorithm (DRA).

3 Approach

3.1 Model Architecture

Since the NAT model ignores the target words and context information, we use the Capsule Network [9] to improve, the model architecture is shown in Fig. 1 which also composed of encoder and decoder. The hidden layer state of the encoder is shown in the Eq. 2.

$$\begin{aligned} h_j=\sum _{i}\alpha _{ij}F(e_i,w_{ij}) \end{aligned}$$
(3)

where \(e_i\) is the output of the self-attention layer, \(\alpha \) represents the coupling coefficient of the capsule network, and the final output of this layer is \(h_j\).

Similar to the encoder side, we use a child layer to extract source information, but at decoder side we introduce an additional parent layer to integrate information extracted by the previous layer (ie, child layer), and map it to another form that is consistent with the parent’s representation:

$$\begin{aligned} s_j=\sum _{i}^{M}F(h_{ij},w_{ij}) \end{aligned}$$
(4)

where M represents the number of child capsules in the child capsule layer. Then use the Squashing function to compress the modulus of the vector into the interval [0, 1), each parent capsule will update the state as follows:

$$\begin{aligned} p_j=Squash(s_j)=\frac{\left\| s_j \right\| ^2}{1+\left\| s_j \right\| ^2}\frac{s_j}{\left\| s_j \right\| } \end{aligned}$$
(5)

Integrate all the child capsules in the form described above to generate the final parent capsule layer representation \(P=[p_1,p_2,...,p_N]\). After that, iterative updating is used to determine what information in the N parent capsules will be transmitted to the Multi-Head Inter-Attention sub-layer.

$$\begin{aligned} Attention(Q_p,K_p,V_p)=softmax(\frac{Q_pK_p^T}{\sqrt{d_k}})\cdot V_p \end{aligned}$$
(6)

where \(Q_p\) is the output of the parent capsule layer, \(K_p\), \(V_p\) are vectors from the encoder, and they all contain rich position information.

Position-Aware Strategy. Since there is no direct target sequence information on the decoder side, we combine the extracted deeper information with the source information to get the final word vector representation and feed it to the next layer:

$$\begin{aligned} Emb_p(Q_p,K_p,V_p)=(e_1+p_1,...,e_n+p_n) \end{aligned}$$
(7)

where \(e_i\) represents the original source word embedding and \(p_i\) indicates the position vector extracted by the capsule network layers. Besides, to accomplish parallel decoding and advantage the decoder to infer, we calculate the ratio \(\lambda \) between target and source sentence lengths in the training set and given a bias term C. The target sentence length +*96 \(L_y = \lambda L_x+C\), then predict it from \([\lambda L_x - B, \lambda L_x + B]\), where B represents half of the searching window.

3.2 Training

Objective Function. We utilize teacher model to guide the training of NAT model to improve translation quality. In the capsule network layers, we update the parameters through an iterative dynamic routing algorithm:\(b_{ij}=b_{ij}+p_jF(u_i,w_j)\), where \(u_i\) is the previous capsule network output, \(p_j\) is the parent capsule network layer output and \(F(\cdot )\) denotes the calculation of the feed-forward neural network. We use cross-entropy to calculate the loss of NAT model with position awareness during the training phase, as shown in Eq. 8.

$$\begin{aligned} \begin{aligned} L_{NAT}(S;\theta )=-\sum _{l=1}^{L_y}\sum _{t_l}((logP_{NAT}(t_l|L_y,S)\cdot logP_{AT}(t_l|t_1,..,t_{l-1},S;\theta )) \end{aligned} \end{aligned}$$
(8)

We utilize the Sequence-Level Interpolation Knowledge Distillation method [5] to assist training which makes the proposed NAT-CN model generate translations by selecting the output that is closest to the gold reference r but has the highest probability under the guidance of distilled data. The training process is shown in Eq. 9.

$$\begin{aligned} \begin{aligned} L_{IKD}=(1-\alpha )L_{SEQ-NLL}+\alpha L_{SEQ-KD}=-(1-\alpha )logp(r|s)-\alpha logp(\hat{t}|s) \end{aligned} \end{aligned}$$
(9)

where \(\alpha \) is a hyper-parameter and \(\hat{t}\) is the output under the guidance of teacher model.

3.3 Word-Level Error Correction

Teacher Model. For the translation problem of the NAT model, we perform word-level error correction on the generated translation by use bilingual teacher model. As shown in Fig. 1, teacher model extracts features bidirectionally from source sequences and generates the latent variable \(\overleftarrow{Z}\) and \(\overrightarrow{Z}\), then integrates encoded potential variables to predict the probability distribution of candidate words as the gap feature. We use this gap to guide error correction and obtain the output of teacher model by maximizing the expected probability.

$$\begin{aligned} \begin{aligned} p(t|z)=\prod _{l}p(t_l|\overleftarrow{z_l},\overrightarrow{z_l});&q(z|t,s)=\prod _{l}q(\overleftarrow{z_l}|s,t_{<l},\overrightarrow{z_l}|s,t_{>l}) \end{aligned} \end{aligned}$$
(10)

where z denotes latent variable, we only need to construct two probabilities of \(p(\cdot )\) and \(q(\cdot )\) by bidirectional transformer to get the maximum expectation.

Force Decoding. We can extract three kinds of matching features after training teacher model, which consists of latent variable \(z_l\), token embedding \(E_p\) and categorical distribution \(p(t_k|\cdot ) \sim Categorical(softmax(I_k))\). Therefore, we can construct 4-dimensional mis-matching feature \(f_k^{mis-match}\):

$$\begin{aligned} \begin{aligned} f_k^{mis-match}=(I_{k,m_k},I_{k,i_{max}^k},I_{k,m_k}-I_{k,i_{max}^k}, \Xi _{m_k\ne i_{max}}) \end{aligned} \end{aligned}$$
(11)

where \(m_k\) represents \(k-th\) token in the NAT-CN model output, \(i_{max}^k = argmax_iI_k\) is the gap feature. These four items respectively represent:the probability of forced decoding into the current output token \(m_k\); the model does not use forced decoding but retains the probability information of the most likely word \(i_{max}^k\); the difference between the first two items; the probability distribution used to indicate whether the current word is consistent with the predicted word.

Fig. 2.
figure 2

Use the output of the asynchronous bidirectional decoding model to perform word-level error correction on the translation of NAT-CN model.

Then we can use \(f_k\) to forcibly decode the current token into the token with highest probability. We modified the original NAT objective to get Eq. 12.

$$\begin{aligned} \begin{aligned} P_{NAT}(T|S,f_k;\theta )=P(L_y|S;\theta )\\ \cdot \prod _{l=1}^{L_y}P(t_l|S,Z,f_k;\theta ) \end{aligned} \end{aligned}$$
(12)

As shown in Fig. 2, according to this mismatching feature, it can be decided whether the translation of the NAT-CN model is normally decoded to t or forcedly decoded to the reference translation \(t^*\).

Table 1. Evaluation of translation quality on select translation tasks including BLEU scores, decoding latency and training speed. Where “NAT-CN” represents the proposed model with capsule network and “EC” refers to the NAT-CN model combined with word-level translation error correction method. We use “KD” to empress the method of knowledge distillation and “\(i_{dec}\)” stands for the number of iterations.

4 Experiments and Results

4.1 Datasets and Setting

We use the following three machine translation tasks: WMT14 En-De (4.5M pairs) and WMT16 En-Ro (610k pairs), IWSLT16 En-De (196k pairs). For WMT16, we utilize newsdev2016 as the verification set and newstest2016 as the test set. For IWSLT16, we employe test2013 as development set. For WMT14, we utilize newstest2013 and newstest2014 as the validation set and test set respectively. All datasets are tokenized by MosesFootnote 1 and segmented into sub-word units by BPE algorithm. We compare our model with strong baseline systems, including the NAT with fertility and noisy parallel decoding (NAT-FT+NPD) [4]Footnote 2 and our model is modified on it, the NAT with iterative refinement (NAT-IR) [7], the NAT with discrete latent variables (NAT-LV) [11], the conditional sequence generation model with generative flow (FlowSep) [12], the Mask-Predict model (CMLM) [8] and the NAT with auxiliary regularization (NAR-REG) [6].

On the dataset WMT, our parameter settings are the same as Transformer [3] which are described in its paper. Because IWSLT is smaller, the word vector dimension set to 278, the number of hidden layer neurons set to 507, layer depth set to 5, and the attention head set to 2. We conduct experimental verification on the development set and finally select 0.6 as hyper-parameter \(\alpha \) in Eq. 9 and the number of parent capsules N and child capsules M are both set to 6. Latency is calculated as the average decoding time of each sentence on entire test set without mini-batching and we test it on two NIVDIA TITAN X.

4.2 Analysis

Results. The experimental results are shown in Table 1. Specifically, on the WMT En\(\rightarrow \)De task, our NAT-CN model get 24.92 BLEUFootnote 3 scores, which is an improvement of 6.26 BLEU scores compared to the NAT-FT(+NPD) model. After combining the word-level error correction method, we get 26.12 BLEU scores which is an improvement of 1.02 compared with the best baseline NAT-LV model and has a similar decoding speed, however, the difference is only 1.29 compared with the Transformer but the decoding speed is improved by 6.18 times. On the En-Ro task, the BLEU scores of 30.26 and 31.93 are finally obtained, and the word-level error correction method on Ro\(\rightarrow \)En also brings 1.62 BLEU scores improvement.

Table 2. Translation case studies on WMT14 De\(\rightarrow \)En task. In order to compare under the same conditions, we set B to 4 in the experiment.

Case Study and Ablation Study. A translation case on WMT14 De-En is shown in Table 2. We utilize Transformer [3] as AT model and set B to 4. Compared with the original NAT-FT model [4], our NAT-CN model has a better ability to capture the global position information, and the effect of the word-level error correction method is also significant. There is a gap in the word order between the NAT-FT model translation and the reference and there are also translation problems such as “photos photos” and “was was”. However, our model corrects “photos” to “were” and “was” to “null”, that is the target word at the current position is empty, and also corrects “about” to “around”. We mark the corrected words in red font.

Table 3. Ablation study performance on IWSLT16 development set.

We perform ablation study on the IWSLT16 translation task to verify the impact of different methods. As shown in Table 3, after using the capsule network layers, the BLEU score of our model is increased by about 4 and the decoding speed also improved by 16.86 times. It is enough to see the impact of the increase of the capsule network layers on the overall experimental results. After combining the word-level error correction method, the BLEU score improves 2.63 which also proves that this approach can make the translation close to the output of the autoregressive model.

5 Conclusion

We propose a novel NAT model architecture to extract the position feature and its context of the word embedding by adding capsule network layers to the vanilla NAT model. In addition, the word-level error correction method is used to reconstruct the translation of the NAT model, which reduces the degradation of the model while improving the decoding speed. Experiments show that our model has a significant effect compared to all non-autoregressive baseline systems.