1 Introduction

Neural machine translation (NMT) has achieved great improvements over the past few years  (Bahdanau et al. 2014; Gehring et al. 2017; Vaswani et al. 2017). NMT models are generally based on the encoder-decoder framework. The encoder maps the input sentences to distributed representations, and the decoder generates the output sentences from these representations in a word-by-word manner. That is, when predicting the next word, the decoder feeds the previous output as input. This word-by-word generation manner of NMT models limits the application of parallel computing methods during the inference phase and leads to high translation latency which restricts the application scenarios of NMT.

With the introduction of parallel computing methods in the NMT training phase, the question of how to perform parallel decoding has attracted researchers’ attention. To avoid the autoregressive property and produce the outputs in parallel,  Gu et al. (2017) proposed a non-autoregressive translation (NAT) model. Drawing on the parallel computing power of Transformer (Vaswani et al. 2017), although a NAT model still uses the encoder-decoder architecture, the NAT model does not use the previously generated words as input which avoids the problems inherent in autoregressive models. The NAT model takes other signals (transferred from the source inputs (Gu et al. 2017; Guo et al. 2019), translation results from other systems (Lee et al. 2018; Guo et al. 2019), or latent variables (Kaiser et al. 2018)) as decoder inputs, which enables the independent and simultaneous generation of the target words and reduces translation latency.

In recent years, research on NAT has mainly focused on improving the decoder (Ghazvininejad et al. 2019; Gu et al. 2019; Shu et al. 2019; Sun et al. 2019; Shao et al. 2020). While those methods improve the performances of NAT model by modifying the inputs of the decoder or the training objective, the information that the decoder can rely on still comes from the encoder. The question, then, is whether the performance of the NAT model can continue to improve with a strong encoder. The effectiveness of an enhanced encoder has been demonstrated in NMT (Bastings et al. 2017; Imamura and Sumita 2019; Wei et al. 2019; Xiao et al. 2019), but the encoder used in NAT models is still the vanilla encoder from Transformer. To address this question, in this paper, we explored the effect of enhanced encoders in NAT models.

Given developments in pre-training methods (Devlin et al. 2018; Radford et al. 2019; Yang et al. 2019b), there has been some work on enhancing encoders in NMT (Clinchant et al. 2019; Yang et al. 2019a; Zhu et al. 2020), but those methods have not been investigated yet in NAT models. Drawing on the success of pre-training in NMT, in this paper, we exploited a straightforward method to enhance the encoder in NAT models with pre-training methods.

In this paper, based on pre-training, we proposed a BERT-based method to enhance the encoder in NAT models. Since the generation of target words is independent of the previously generated words, the decoder cannot acquire information from the previous target words, so all information including dependencies or word order comes from the encoder. Accordingly, we proposed a BERT-based encoder for NAT models to enhance its modeling capability. Considering the performance degradation (Zhu et al. 2020) of directly using BERT to initialize the encoder of NMT models, we learned from a BERT-fused model(Zhu et al. 2020) and designed our enhanced encoder for NAT models, where the BERT representations are combined with the representations of the vanilla encoder. When the input sentences are fed into our model, the raw encoder and the BERT encoder simultaneously map the input sentences as distributed representations. As different sequence lengths are derived from different word segmentation rules, the representations from the raw encoder and the BERT encoder cannot be directly concatenated or added together. Therefore, we used an extra attention module in the decoder to fuse the BERT encoder representations with those from the raw encoder. In addition, a gate module is used to dynamically select information from the BERT and Raw encoder representations. Our enhanced encoder has the following advantages over other methods:

  • It does not assume a NAT-specific model architecture and is suitable for any NAT model architecture with recent technology.

  • It strengthens the ability to model the input sentence.

  • It provides rich information to the decoder.

To evaluate the performance of our model, we compare it with previous work  (Gu et al. 2017; Lee et al. 2018; Libovickỳ and Helcl 2018; Ghazvininejad et al. 2019; Gu et al. 2019) and conduct experiments on three benchmark tasks: WMT17 EN\(\rightarrow\)ZH, WMT14 EN\(\leftrightarrow\)DE and WMT16 EN\(\leftrightarrow\)RO. In addition, we conduct further analysis on IWSLT16 EN\(\rightarrow\)DE. Experimental results and analysis show that our enhanced encoder surpasses the baseline NAT system by a significant margin in terms of translation quality without decelerating decoding speed. Furthermore, with distilled knowledge, our model can achieve comparable performance with an autoregressive MT baseline.

The main contributions of this paper include:

  • We are first to exploit the influence of the encoder in NAT models.

  • We use the BERT-based model to enhance the encoder of NAT models.

  • We achieve a new state-of-the-art 27.87 BLEU on WMT’14 En\(\rightarrow\)De for single-step non-autoregressive MT.

2 Related work

Gu et al. (2017) introduced a non-autoregressive Transformer model to reduce the translation latency of NMT, but this comes at the cost of translation quality. Instead of feeding the previous target tokens into the decoder, NAT models use other signals such as latent variables (Gu et al. 2017; Shu et al. 2019) as the input to the decoder. However, there is still a gap between the autoregressive and non-autoregressive models. In recent years, there has been some significant previous work on non-autoregressive models to improve the performance of NAT models. Lee et al. (2018) introduced an iterative decoding method for NAT models, which significantly improved the performance of NAT models. Inspired by the mask language model, Devlin et al. (2018), Ghazvininejad et al. (2019) and Gu et al. (2019) utilized the mask-based method to improve the decoder in NAT models. until now, research has mainly focused on the decoder in NAT models such as Shao et al. (2020). Recently, some research has started to migrate the successful experience on autoregressive NMT into non-autoregressive NMT such as Shao et al. (2019), Shao et al. (2020) and Zhou and Keung (2020). Following this idea and considering the absence of research on enhancing the encoder in NAT models, we utilize a BERT model as an extra encoder to enhance the modeling ability of the encoder.

There is some work on how to incorporate BERT into autoregressive NMT. Imamura and Sumita (2019) directly used a BERT model as an extra encoder to strengthen the representations generated by the encoder. Because of the limited improvement using BERT directly as encoder, Yang et al. (2019a) and Zhu et al. (2020) utilized the BERT model as an extra encoder to strengthen the encoding process. In this paper, we also use a BERT model as an extra encoder. Note too that different tokenizations have different effects on the performance of NMT models (Bahdanau et al. 2014; Sennrich et al. 2015). In this work, with the different tokenization methods, the BERT encoder and the raw encoder may model the different aspects of the input sentence.

3 Background

3.1 Autoregressive neural machine translation

At present, both autoregressive and non-autoregressive MT adapt the encoder-decoder framework. This framework has achieved great success in NMT (Bahdanau et al. 2014; Gehring et al. 2017; Vaswani et al. 2017). Compared with RNN-based models, CNN- and self attention-based models have a highly parallelized architecture and solve parallelization problems during training. However, during inference, as is the way in autoregressive models, the translation is still generated word-by-word.

Given an input sentence \(X= \{x_1,x_2,\ldots,x_n\}\) and the target sequence \(Y=\{y_1,y_2,\ldots,y_n\}\), an autoregressive translation model models the conditional probability as in (1):

$$\begin{aligned} P(Y|X,\theta ) = \prod _{t-1}^T p(y_t|y_{<t},X,\theta ), \end{aligned}$$
(1)

where \(\theta\) are the parameters of the autoregressive translation models and \(y_{<t}\) denote the previously generated words. During inference, with \(\theta\) and the previously generated words \(y_{<t}\), the autoregressive model generates the current word. During training, \(\theta\) are learned by maximizing the log-likelihood of the training data, as in (2) and (3):

$$\begin{aligned} \theta= {arg}\mathop {max}_{\theta } ({L}(\theta )) \end{aligned}$$
(2)
$$\begin{aligned} L(\theta )= \sum _{n=1}^N {\sum _{t=1}^T log(p(y^n_t|y^n_{<t},X^n,\theta ))}, \end{aligned}$$
(3)

where N denotes the pairs of sentences in the training set.

According to these formulae, the unique characteristic of autoregressive models is that it requires the the previously generated words in the decoding procedure. Due to this unique characteristic, parallelization is not possible and the decoding is limited, which restricts the application of the autoregressive model.

However, at present, the performance of non-autoregressive models falls far behind autoregressive models. In this work, we attempt to improve the performance of non-autoregressive models using pre-trained models.

3.2 Pre-training for autoregressive NMT

Pre-training has been used in natural language processing (NLP) (Mikolov et al. 2013; Dai and Le 2015) for years. At the beginning, because improvements were not comparable with pre-training in computer vision, the scope of pre-training in NLP was still relatively small. Recently, with increases in both computing resources and available data, pre-training techniques have received increasing attention from NLP researchers. The pre-training approach has refreshed state-of-the-art results on some tasks (Devlin et al. 2018; Yang et al. 2019b). Pre-training in MT has seen some research (Yang et al. 2019a; Zhu et al. 2020), but .this work has been conducted only for autoregressive MT. However, in this work, we use the pre-training language model as an extra encoder to model the sentence in the source language for non-autoregressive MT.

3.3 Non-autoregressive NMT

The aim of non-autoregressive NMT proposed by  Gu et al. (2017) is to accelerate the decoding speed. Compared to autoregressive models, non-autoregressive NMT can simultaneously and independently generate the words in the translation. Compared to conditional probability in autoregressive MT, the translation the probability from X to Y in non-autoregressive MT is modeled as in (4):

$$\begin{aligned} P(Y|X) = \prod _{t=1}^T p(y_t|X,\theta ) \end{aligned}$$
(4)

Given a training set \(D = \{X^N,Y^N\}\) with N sentence pairs, the training objective of non-autoregressive MT is to maximize the log-likelihood of the training data, as in (5):

$$\begin{aligned} \theta = {argmax}_{\theta }(L(\theta )) \end{aligned}$$
(5)

in which \(L(\theta )\) is computed as in (6):

$$\begin{aligned} L(\theta ) = \sum _{n=1}^N \sum _{t=1}^T \mathrm {log}(p(y_t^n|X^n,\theta )) \end{aligned}$$
(6)

Eq. (1) shows that when non-autoregressive MT generates the target words, it does not need access to the previously generated words. During inference, the target words can be generated by taking the word with the maximum probability in each time step, as in (7):

$$\begin{aligned} \hat{y_t} = \mathrm {argmax}_{y_t}(p(y_t|X,\theta )) \end{aligned}$$
(7)

No longer needing the previous target words at each time step, non-autoregressive MT can be computed in parallel both in the training and in the decoding phases. However, there are some weaknesses in non-autoregressive MT. For example, non-autoregressive MT still has a great gap in translation quality compared to autoregressive MT and tends to generate repetitive words or wrong words. In this work, we introduce a stronger encoder for the NAT model to improve its performance. Inspired by previous work using pre-training methods in NMT, in this work, we demonstrate how pre-training can be incorporated into NAT models.

4 Approaches

In this section, we first define the necessary notation, and then introduce our proposed enhanced encoder model.

Notation Let X and Y denote the input sentences and target sentences, respectively. We denote the raw encoder and BERT encoder as \(Enc_R\), \(Enc_B\), respectively, and we let attn be the attention module.

Since our proposed model mainly focuses on the encoder of NAT models, and does not restrict the decoder, in this section, for ease of description, we use the architecture of  Ghazvininejad et al. (2019) to describe our model. An illustration of our model is shown in Figure 1.

Fig. 1
figure 1

The overall enhanced NAT encoder. The BERT encoder is an extra encoder, and the Raw Encoder is the vanilla Transformer encoder. The decoder can dynamically control the information flowing from the BERT encoder and Raw Encoder. ”M” denotes ”MASK” in MASK-Predict

Firstly, given an input \(x\in X\), the BERT-encoder and Raw-encoder encode it into representations \(H_B= Enc_B(x)\) and \(H_R = Enc_R(x)\), respectively. \(H_B\) and \(H_R\) are the output of the last layer in the BERT-encoder and Raw-encoder, respectively.

Then let \(S^l\) denotes the hidden state of l-th layer in the decoder, where we have (8)–(10):

$$\begin{aligned} \hat{S^l}= \mathrm {attn}_s(S^{l-1},S^{l-1},S^{l-1}) \end{aligned}$$
(8)
$$\begin{aligned} S_B^l= \mathrm {attn}_B(\hat{S^l},H_B,H_B) \end{aligned}$$
(9)
$$\begin{aligned} S_R^l= \mathrm {attn}_R(\hat{S^l},H_R,H_R) \end{aligned}$$
(10)

\(S_B\) and \(S_R\) denote the information learned from \(H_B\) and \(H_R\), respectively. \(attn_s\), \(attn_B\) and \(attn_R\) represent the self-attention module, BERT-encoder-decoder attention module and encoder-decoder module, respectively. We use a gate module to make the decoder dynamically combine the information from BERT and the Raw encoder, and control the information flowing to the next layer, as in (11).

$$\begin{aligned} g= \sigma (W[H_R:H_B]) \end{aligned}$$
(11)
$$\begin{aligned} S^l= g \times H_R + (1 - g) \times H_B \end{aligned}$$
(12)

\(S^l\) in (12) is the output of the l-th layer in the decoder. With a stack decoder, we can derive the final hidden state of the decoder S. Finally, with softmax, we can obtain the conditional probability in (13):

$$\begin{aligned} P(y|x) = \mathrm {softmax}(WS) \end{aligned}$$
(13)

In our proposed model, BERT is used as an auxiliary encoder, and generates different representations of the input. With a different tokenization model, the BERT-encoder and Raw-encoder can learn to express the input from different angles, and the decoder can dynamically utilize the representations from different angles via the gate module.

5 Experimental settings

5.1 Datasets

We use several commonly adopted benchmark datasets to evaluate the performance of our proposed methods: WMT17 EN\(\rightarrow\)ZH, WMT14 EN\(\leftrightarrow\)DE, and WMT16 EN\(\leftrightarrow\)RO. We also add experiments and analysis on IWSLT16 EN\(\rightarrow\)DE. These datasets consist of 20M, 4.5M, 610k, and 196k sentence pairs, respectively. For IWSLT16 EN\(\rightarrow\)DE, we use the test2013 dataset for validation purposes. For WMT14 EN\(\leftrightarrow\)DE, we use newstest2013 for our validation set and newstest2014 for our test set. For WMT16 EN\(\leftrightarrow\)RO, we use newsdev2016 and newstest2014 as our development and test sets. For WMT17 EN\(\rightarrow\)ZH, we use newsdev2017 and newstest2017 as our validation and test sets, respectively. For all tasks, we use the script from Moses (Koehn et al. 2007) as our tokenization tools, and we segment each word into subword units with BPE (Sennrich et al. 2015). The vocabulary size for all tasks is 40k and is shared for source and target languages. We use BLEU (Papineni et al. 2002) as our evaluation metric.

5.2 Baselines

We use the Transformer model (Vaswani et al. 2017) as our autoregressive baseline . We choose several recently proposed NAT methods as our NAT baselines:

  • NAT (Gu et al. 2017) is the first non-autoregressive model.

  • I-NAT (Lee et al. 2018) is the first model that utilizes iterative refinement to refine the translations.

  • Mask-Predict (Ghazvininejad et al. 2019) introduce a masked language model to train the NAT model.

  • LevT (Gu et al. 2019) utilizes three decoders to determine which operation (Deletion, Insertion, or Filling) should be done.

  • CTC (Libovickỳ and Helcl 2018) utilizes CTC model to learn the alignment between source and target sentences.

  • SMART (Ghazvininejad et al. 2020b) adapts semi-autoregressive training to improve the non-autoregressive model.

  • NAT-REG (Wang et al. 2019) introduces explicit regularization to reduce repetitive words in the NAT model.

  • BoN-NAT (Shao et al. 2020) introduce bag-of-ngram loss to improve the performance of the NAT model.

  • Hint-NAT (Li et al. 2019) distil the output of the attention module by an autoregressive model to improve the performance of the NAT model.

  • FlowSeq (Ma et al. 2019) models the generation flow as latent variables.

  • CRF-NAT (Sun et al. 2019) introduce an approximate CRF model to model the structure of the target sentence.

  • AXE-NAT (Ghazvininejad et al. 2020a) introduce a new loss function to align target words with source words.

  • KERMIT (Chan et al. 2019) is an insertion-based generative model.

  • Imputer (Saharia et al. 2020) model the alignments as latent variables to improve the performance of the NAT model.

5.3 Model configurations

We follow the standard hyperparameters for Transformer in the base configurations  (Vaswani et al. 2017): 6 layers stack, 8 attention heads per layer, 512 model dimensions, and 2048 hidden dimensions. To effectively learn the representation of the source language, for IWSLT16 EN\(\rightarrow\)DE, WMT14 EN\(\rightarrow\)DE, WNT17 EN\(\rightarrow\)ZH, and WMT16 EN\(\rightarrow\)RO, we use bert-base-uncasedFootnote 1 to initialize our enhanced encoder. For WMT14 DE\(\rightarrow\)EN, we use bert-base-german-casedFootnote 2 as initialization. For WMT16 RO\(\rightarrow\) EN, we initialize our enhanced encoder with bert-base-multilingual-uncased.Footnote 3 For regularization, we use 0.3 dropout, 0.01 \(L_2\) weight decay, and smoothed cross validation loss with \(\epsilon =0.1\). We follow Ghazvininejad et al. (2019) and train our model with batches of 128k tokens using Adam (Kingma and Ba 2014). We train all the models for 300k steps and average the 5 best checkpoints to create the final model.

5.4 Sequence-level distillation

According to previous work (Gu et al. 2017; Zhou et al. 2019), in non-autoregressive MT, sequence-level knowledge distillation (Kim and Rush 2016) is critical for NAT models. In this work, we follow this previous work and train all our models based on the translations generated by an autoregressive model. We then discuss the influence of distillation on our model.

5.5 Model architecture

Because our method is enhancing the encoder in NAT models, it does not restrict the architecture of the decoder, so it can simply be migrated to any recent method. In this paper, we implement our method on Levenshtein Transformer (LevT)  (Gu et al. 2019). That is, we use the BERT-encoder and Raw-encoder of Transformer as our encoder, and we use the decoderFootnote 4 of Levenshtein Transformer as our decoder. To dynamically control the information flowing to the next layer, we add a gate module to the decoder.

6 Results and analysis

6.1 Single step decoding

Firstly, we evaluate the performance of our method with single-step decoding, as shown in Table 1. We compare our method with other non-autoregressive single decoding models. Our enhanced encoder LevT achieves 27.87 BLEU for WMT14 EN\(\rightarrow\)DE. Our method achieves an almost 2.0 BLEU point improvement over Imputer which is the state-of-the-art model for single-step non-autoregressive MT. In addition, our method has also improved the performance of NAT to varying degrees for WMT14 DE\(\rightarrow\)EN, WMT16 EN\(\rightarrow\)RO and WMT17 EN\(\rightarrow\)ZH.

Table 1 Performance of various single-step decoding models. Our enhanced encoder LevT is able to outperform all prior single decoding models

6.2 Iterative decoding

We now analyze the performance of enhanced encoder LevT with more decoding iterations. We compare the performance of enhanced encoder LevT with other non-autoregressive models ranging from models requiring logarithmic to a constant number of decoding iterations. The results of our method are summarized in Table 2.

Our enhanced encoder LevT achieves 27.95 BLEU on WMT14 EN\(\rightarrow\)DE with only 2 iterative decoding steps, which is comparable with the autoregressive model Transformer. With 4 iterative decoding steps, our method achieves 28.35 BLEU, slightly outperforming the autoregressive Transformer score of 27.8 BLEU. On WMT14 DE\(\rightarrow\)EN, we achieve 31.1 BLEU, which is on a par with the autoregressive Transformer. On WMT16 RO\(\rightarrow\)EN, with 2 iterative decoding steps, our method also outperforms the standard LevT and achieves similar results as autoregressive Transformer in 4 iterative decoding steps. On WMT17 EN\(\rightarrow\)ZH, our model and SMART achieve similar levels of performance. We think that the sequence-level distillation by the autoregressive model limits the improvement in performance.

Table 2 Performance of various autoregressive and non-autoregressive models

6.3 Impact of decoding speed

Because of the introduction of the BERT encoder and the gate module, decoding speed may be adversely affected. In this section, we compare the decoding speed of our method with the standard LevT. For both models, we decode batches of 10 sentences on 1 Nvidia 1080Ti GPU. We measure the wall time from when the model and data have been loaded until the last example has been translated, and calculate the decoding speed to assess the average speed performance trade-off.

The speeds of our method are shown in Table  3. We can observe that there is a decline in speed of our enhanced encoder LevT. However, compared with Transformer, our method can still obtain a 3.12\(\times\) speedup.

Table 3 Translation latency on WMT14 EN\(\rightarrow\)DE

6.4 Impact of distillation

We analyze the impact of distillation on our method by comparing the performance on the original training data (original data) and training data generated by a base Transformer teacher (distilled data) on IWSLT16 EN\(\rightarrow\)DE.

Table 4 IWSLT16 EnDe BLEU comparison on the impact of distillation

From Table 4, we can observe that in all cases, the model with distilled data outperforms the model with the original data. As the number of decoding steps increases, the gap between the model with original and distilled data decreases, which is identical to the observation of  citetzhou2019understanding. Similarly, on WMT14 with distilled data, our method obtains a comparable result to autoregressive Transformer with 28.98 BLEU. Note that our model outperforms Transformer with only 2 decoding steps.

6.5 Case study

Table 5 An example of IWSLT16 EN\(\rightarrow\)DE translation

We show an example from the IWSLT16 EN\(\rightarrow\)DE validation set in Table 5. For the words “shipping containers”, LevT generates a wrong word “Schiffcontainer”. In contrast, our model with different encoders generates the correct word “Schiffscontainer”. While LevT misunderstands the meaning of the word “cafes”, our model understands it correctly.

6.6 Ablation analysis

Table 6 The comparison of different encoders on the IWSLT16 EN\(\rightarrow\)DE validation set

To evaluate the effect of different encoders, we conduct an ablation analysis on the IWSLT16 EN\(\rightarrow\)DE validation set, and give the results in Table 6. We can see that only using BERT as the encoder can decrease the performance of the NAT model, which is consistent with the conclusion of Zhu et al. (2020). In contrast, using BERT as an additional encoder in our model can significantly improve the performance of the NAT model.

7 Conclusion

In this paper, we utilize a BERT model as an extra encoder to strengthen the ability of the encoder in non-autoregressive MT. Unlike most of the previous work which focused mainly on the decoder in NAT models, our method mainly focuses on enhancing the encoder. With the addition of a gate module, the decoder can dynamically select representations of the input sentences from the Raw and BERT encoders. Furthermore, with quite a simple architecture, our method can easily be incorporated seamlessly into recent work. Our enhanced encoder LevT achieves 27.87 BLEU with a single generation step, which is comparable with the Transformer baseline on the WMT14 EN\(\rightarrow\)DE task.