Keywords

1 Introduction

Neural machine translation (NMT) [1,2,3, 12, 15] has made great progress and drawn much attention recently. NMT models mainly fit in the attention-based encoder-decoder framework where the encoder encodes the source sentence into representations in a common semantic space and at each time step the decoder first collects source information over all the source words via an attention function and then generates a target word based on the collected source information.

Although there may exist different attention functions, including additive attention and dot-product attention [15], the main mechanism is almost the same which first gets the weight for each source representation according to its relevance to the current target-side information and then outputs the weighted sum of source representations as the source information for each time step to translate. From this process, we can see that the calculation of the attention at each time step is only related to the current target-side information and the keys (usually the representations of source words). It does not involve the previous attention directly and hence is independent to each other at different time steps. As a result, the attention component cannot get to know the completion degree of each source word which leads to over-translation or under-translation [13]. Table 1 gives examples of over-translation and under-translation. Example (1) shows the case of over-translation where “23” has been translated twice. If the model can get the translation derived from “23”, it may not attend too much on it when calculating attention. Example (2) indicates the case of under-translation where the source words “5 zhōunián” have not been translated. Once the model can get the translated part of “5 zhōunián”, it will adjust to give more attention to it. As a conclusion, if the model can maintain the translated source and target translation up to now related to each source word, it can work out more reasonable attention. On these grounds, in order to address the problem of over-translation and under-translation, we propose a method to involve the bilingual history information into the calculation of attention. The main idea is to gather the translated source and target information for each source word at each time step, and then accumulate the translated bilingual history up to now related to each source word with GRUs. In this way, we can evaluate the completion degree for each source word and give reasonable suggestion for the calculation of attention. Experiments on the Chinese-to-English and English-to-German translation tasks show that our method can achieve significantly improvements over strong baselines and can also produce better alignment.

Table 1. Two examples of Chinese-to-English NMT.

2 Background

Our work is initially based on the representative attention-based NMT model [1]. The basic framework is a mature end-to-end system following the encoder-decoder framework whose encoder consists of a RNN or bi-directional RNN to generate the representations of the source sentence as a sequence of vectors. The framework employed another RNN network as decoder to learn to align and translate by reading the vectors at the same time. In particular, the framework above possesses an extra attention module which is a mechanism for improving alignment. We’ll explain the model and its sub-components in detail in the following section.

Encoder. The encoder employs two GRUs to run through the source words bi-directionally and obtain two sequences of hidden states as follows:

$$\begin{aligned} \overrightarrow{\varvec{\mathrm {h}}}_j = \overrightarrow{\varvec{\mathrm {GRU}}}\Big (x_j, \overrightarrow{\varvec{\mathrm {h}}}_{j-1}\Big ) \end{aligned}$$
(1)
$$\begin{aligned} \overleftarrow{\varvec{\mathrm {h}}}_j = \overleftarrow{\varvec{\mathrm {GRU}}}\Big (x_j, \overleftarrow{\varvec{\mathrm {h}}}_{j+1}\Big ) \end{aligned}$$
(2)

The formal representation of each word in the source sequence is the given by concatenating the corresponding hidden states in both direction, which is shown by Eq. 3:

$$\begin{aligned} \varvec{\mathrm {h}}_j = \left[ {\overrightarrow{\varvec{\mathrm {h}}}_j};{\overleftarrow{\varvec{\mathrm {h}}}_j}\right] \end{aligned}$$
(3)

Attention. The design of attention section is inspired by the intuition that corresponding pair of source-end word and target-end word can be highly connected when generating a new word. Thus, the module aims at building direct connections between those highly related source and target words.

Above all, we need to compute the relevance between target word \(\varvec{\mathrm {y}}_j\) and \(\varvec{\mathrm {h}}_i\), which can be evaluated as

$$\begin{aligned} e_{ji}=\varvec{\mathrm {v}}_a^T \tanh \left( \varvec{\mathrm {W}}_a\varvec{\mathrm {s}}_{i-1} + \varvec{\mathrm {U}}_a\varvec{\mathrm {h}}_j\right) \end{aligned}$$
(4)

For computational convenience, we will use following formula to normalize the relevance of \(\varvec{\mathrm {h}}_i\) in the source hidden state sequence in j-th decoding step:

$$\begin{aligned} \alpha _{ji} = \frac{\exp \left( e_{ji} \right) }{\sum _{j'=1}^{l_s} \exp \left( e_{j'i} \right) } \end{aligned}$$
(5)

Finally, the attention can be compute as weighted summation of all source hidden states by their normalized relevance obtained in the previous step

$$\begin{aligned} \varvec{\mathrm {a}}_i = \sum \nolimits _{l=1}^{l_s}\alpha _{ji}\varvec{\mathrm {h}}_j \end{aligned}$$
(6)

where \(l_s\) is the length of source inputs. Decoder: The decoder works by predicting a probability distribution over all the words within the vocabulary and output the target word with the greatest probability. It also use a variant of GRU network to roll the target information, the details of which are described in [1]. Then the current target hidden state \(s_i\) is given by

$$\begin{aligned} \varvec{\mathrm {s}}_i = f(\varvec{\mathrm {y}}_{i-1}, \varvec{\mathrm {s}}_{i-1}, \varvec{\mathrm {a}}_i) \end{aligned}$$
(7)

The probability distribution \(\mathcal {D}_{i}\) over the target vocabulary at the i-th step depends on the combinational effect of previous ground truth word, the attention \(\varvec{\mathrm {a}}_i\) and the rolled target information \(\varvec{\mathrm {s}}_{i}\), the relationship can be described mathematically as

$$\begin{aligned} \varvec{\mathrm {t}}_i = g(\varvec{\mathrm {y}}_{i-1}, \varvec{\mathrm {a}}_i, \varvec{\mathrm {s}}_i) \end{aligned}$$
(8)
$$\begin{aligned} \varvec{\mathrm {o}}_i = \varvec{\mathrm {W}}_o \varvec{\mathrm {t}}_i \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {D}_{i} = \mathrm {softmax}\left( \varvec{\mathrm {o}}_i\right) \end{aligned}$$
(10)

where g represents a linear transformation, \(\varvec{\mathrm {t}}_i\) can be mapped to \(\varvec{\mathrm {o}}_i\) by \(\varvec{\mathrm {W}}_o\) so that each target word has only one corresponding dimension in \(\varvec{\mathrm {o}}_i\).

Intuitively, the probability \(\alpha _{ji}\) and the variable \(e_{ji}\) jointly reflect the influence of \(\varvec{\mathrm {h}}_{j}\) in deciding next hidden state and even generating next target word.

3 The Proposed Method

The attention component collects source information at each time step by weightedly summing the semantic of all the source words and then the decoder produces a target word according to the generated attention. In this process, there is a semantic projection between the source attention and the target information. It implies that the semantics held by the source attention and the generated target word is equivalent. Thus we can derive the consumed source semantic and the generated target semantic related to each source word at each step. With this, we can get the accumulated consumed source semantic and generated target semantic up to each time step. The bilingual history semantic can well indicate completion degree of each source word and hence help to generate more reasonable attention.

Fig. 1.
figure 1

The architecture of our method with bilingual history involved attention.

Figure 1 gives the architecture of our method. After the target word y is generated \(y_i\), the source information related to the source word \(x_j\) is accumulated via a GRU to be \({\tilde{\varvec{\mathrm {h}}}}_{j}^{i}\), and similarly the target information related to the source word \(x_j\) is accumulated to \({\tilde{\varvec{\mathrm {s}}}}_{j}^{i}\). Then to generate the next target word \(y_{i+1}\), the accumulated bilingual information is involved to calculated the attention weight of \(x_j\) and the weighted sum over the source hidden states is treated as the attention and fed to the decoder.

In this paper, we attempt to add different part of information as

  • * SA-NMT: Only involve the source information up to now in the calculation of attention;

  • * TA-NMT: Only involve the target information up to now in the calculation of attention;

  • * BA-NMT: Involve both the source and target information up to now in the calculation of attention.

3.1 Source History Involved Attention

At the i-th time step, assume the source information related to the source word \(x_j\) is \({\tilde{\varvec{\mathrm {h}}}}_{j}^{i-1}\). To generate the target word \(y_i\), we calculate the attention with source history information involved and get

$$\begin{aligned} e_{ji}=\varvec{\mathrm {v}}_a^T \tanh \left( \varvec{\mathrm {W}}_{a}\varvec{\mathrm {s}}_{i-1} + \varvec{\mathrm {U}}_a\varvec{\mathrm {h}}_j + \varvec{\mathrm {V}}_h\varvec{\tilde{\varvec{\mathrm {h}}}}_{j}^{i-1}\right) \end{aligned}$$
(11)

Then we can get the attention following Eqs. 5 and 6.

According to the attention wight \(\alpha _{ji}\) to the source word \(x_j\), we can think at the i-th time step, the quantity of the translated source information related to \(x_j\) is

$$\begin{aligned} \varvec{\mathrm {I}}_{ji}^{S}=\alpha _{ji}*\varvec{\mathrm {h}}_{j} \end{aligned}$$
(12)

But we cannot accumulate the source information related to the source word directly by adding them, as at each time step the translated information is not normalized against the source word. Here we employ a GRU to accumulate it, hoping the learnable update gate and reset gate can perform normalization dynamically. Based on the source information up to the \(i-1\)-th time step, we can update to get the source information up to the i-th time step related to the word \(x_j\) as

$$\begin{aligned} {\tilde{\mathbf {h}}}_j^i = \mathbf {GRU}(\mathbf {I}_{ji}^{S}, {\tilde{\mathbf {h}}}_{j}^{i-1}) \end{aligned}$$
(13)

We initialize \({\tilde{\varvec{\mathrm {h}}}}_{j}^{0}\) with 0, which means that no source words have been translated yet. Besides, the accumulated source information also attention the calculation of logit shown in Eq. 8. Before fed to logit, a weighted sum with the attention weights is performed over the history source information related to each source word as

$$\begin{aligned} \begin{aligned}&\tilde{\mathbf {h}}^{i-1}=\sum _j{\alpha _{ji}*\tilde{\mathbf {h}}_{j}^{i-1}} \\&\mathbf {t}_i = g(\mathbf {y}_{i-1}, \mathbf {a}_i, \mathbf {s}_i,{\tilde{\mathbf {h}}^{i-1}}) \end{aligned} \end{aligned}$$
(14)

3.2 Target History Involved Attention

When calculating the attention, it can be considered that the source-side information contained in the current attention is equal to the information of the current generated target word. So each source word corresponds to the current target information:

$$\begin{aligned} \mathbf {I}_{ji}^{T}=\alpha _{ji}*\mathbf {s}_{i-1} \end{aligned}$$
(15)

Then again, \(\mathbf {I}_{ji}^{T}\) is not normalized for the source words, and we still need GRU to accumulate it:

$$\begin{aligned} \tilde{\mathbf {s}}_{j}^{i}= \mathbf {GRU}(\mathbf {I}_{ji}^{T}, \tilde{\mathbf {s}}_{j}^{i-1}) \end{aligned}$$
(16)

where \( \tilde{\varvec{\mathrm {s}}}_{j}^{i}\) denotes historical information accumulated by the target end. We also take these historical target information into account when calculating attention, so we rewrite the attention model Eq.(4) as follows:

$$\begin{aligned} e_{ji}=\mathbf {v}_a^T \tanh \left( \mathbf {W}_{a}\mathbf {s}_{i-1} + \mathbf {U}_a\mathbf {h}_j + \mathbf {V}_s\tilde{\mathbf {s}}_{j}^{i-1}\right) \end{aligned}$$
(17)

Note that \(\tilde{\varvec{\mathrm {s}}}_{j}^{i}\) measures the relevance between the translated historical information of target-end and the corresponding j-th source hidden state. Then, we rewrite the \(\varvec{\mathrm {t}}_i\) in Eq.(8) as follows:

$$\begin{aligned} \begin{aligned}&\tilde{\mathbf {s}}^{i-1}=\sum _j{\alpha _{ji}*\tilde{\varvec{\mathrm {s}}}_{j}^{i-1}} \\&\mathbf {t}_i = g(\mathbf {y}_{i-1}, \mathbf {a}_i, \mathbf {s}_i,\tilde{\mathbf {s}}^{i-1}) \end{aligned} \end{aligned}$$
(18)

3.3 Bilingual History Involved Attention

Figure 1 illustrates concatenation pattern of the bilingual history involved attention mechanism. The bilingual historical information is the amount of information that has been translated for each source word and the amount of information that has been translated for the target when calculating attention. Intuitively, we combine the bilingual history together by rewriting the attention model. Thus we have

$$\begin{aligned} e_{ji}&=\mathbf {v}_a^T \tanh (\mathbf {W}_{a} \mathbf {s}_{i-1} + \mathbf {U}_{a}\mathbf {h}_{j} \nonumber \\&\qquad +\mathbf {V}_{h}\tilde{\mathbf {h}}_{j}^{i-1} +\mathbf {V}_{s}\tilde{\mathbf {s}}_{j}^{i-1}) \end{aligned}$$
(19)

4 Related Work

Attention in neural machine translation [1, 7] is an imperative mechanism to improve the effect of an Encoder + Decoder model based on RNN, which is designed to assign weights to different inputs. Now some new models [13]are proposed to improve the performance of attention mechanism. Some of them [13] integrate the previous attention history into the current attention for better alignment.

Self-attention is another popular mechanism in recent studies. Look-ahead attention proposed by [17] are able to model dependency relationship between distant target words. The model extends the mechanism by referring to previous generated target words, while by and large, previous works focus on learning to align with source words. [5] further presented a variational self-attention mechanism extracts different aspects of the sentence and partition them into multiple vector representations.

Exploiting historical information to improve the performance of Attention is also a novel mechanism. [8] proposed to introduce source-end historical information onto attention, which use interactive attention to rewrite the source information during translation. Interactive attention to keep tracking the source history by reading and writing operations. [16] proposed to introduce target-end historical information onto attention, which focuses on integrating the decoding history. However, the utilization of historical information basically limited to either source-end or target-end by then, our work managed to combine bilingual history together.

5 Experiments

5.1 Data Preparation

We mainly evaluated our approach on the widely used NIST Chinese-English translation task. In addition, to show the usefulness of our approach, we also provided the results of the English-German translation task. So we carried out experiments on two datasets:

NIST Zh\(\rightarrow \)En: Our training data for the Chinese-English training task consists of 1.25M sentence pairsFootnote 1. We chose the NIST 2002 test set as our development set, and the NIST 2003, 2004, 2005, 2006 datasets as the test sets.

WMT14 En\(\rightarrow \)De: Our training data for the English-German training task consists of 4.45M sentence pairs. We use newstest2013 as the valid set, and newstest2014 as the test set.

In our experiments, we used the case-insensitive 4-gram BLEU [10] for Zh\(\rightarrow \)En and case-sensitive for En\(\rightarrow \)De to evaluate the translation performance.

Table 2. Performance comparison on Zh\(\rightarrow \)En translation. The “\(\ddag \)” indicates statistically significant improvement over RNNsearch\(^{\star }\). “\(\star \)” means statistically significant improvement over NN-Coverage and IA-Model. Here \(\rho < 0.05\) [14].

5.2 Systems

We involved following systems as below:

RNNsearch: We implemented the conventional attention-based Neural Machine Translation of [1] with PyTorchFootnote 2.

RNNsearch\(^{\star }\): This is an improved system of RNNsearch, the detail we can see in this linkFootnote 3.

NN-Coverage: A variants of attention-based NMT model [13] which maintain a soft coverage on each source representation to keep track of the history to improve the attention mechanism.

IA-Model: An improved NMT model which can capture translation status with an interactive attention to track attention history.

5.3 Configuration

For the NIST Zh\(\rightarrow \)En data set, we adopted 16 k byte pair encoding (BPE) merging operations [11] in the source and target end, respectively. The length of the sentences was limited up to 128 tokens on both ends. For WMT En\(\rightarrow \)De, the number of merge operations in BPE is set to 32 K for both source and target languages, and the maximum length of sentences in the En\(\rightarrow \)De task is also set to 128.

We deployed shared configuration for all the systems. All the embedding sizes were both set to 512, the size of all hidden units in encoder and decoder RNNs was also set to 512, and all parameters were initialized by using uniform distribution over \(\left[ -0.1,0.1\right] \). The mini-batch stochastic gradient descent (SGD) algorithm was employed. We batch sentence pairs according to the approximate length, and limit input and output tokens to 4096. In addition, the learning rate was adjusted by adam optimizer [4] (\(\beta _1=0.9\), \(\beta _2=0.999\), and \(\epsilon =1e^{-6}\)). Dropout was applied on the output layer with dropout rate of 0.2. The beam size was set to 10.

5.4 Ablation Study

We employed several methods to improve the performance of our model. For instance, we keep track of source history and put it into attention model, which settles the problem of missing translation to a certain extent. Furthermore, we model the dependency relationship between the previous generated target words and the source words where each pair of source word and generated target word is one-to-one correspondence.

Table 3. Ablation study with average BLEU scores.
Table 4. Performance comparison on En\(\rightarrow \)De translation.

The translation performance is listed in Table 3 measuring in BLEU score. It is obvious that in all the cases, our proposed history involved attention model outperforms RNNsearch\(^{\star }\) system. Specifically, we obtained a BLEU score of 43.52 when only employing the Source History Involved Attention, which indicated that feeding predicted words as context can sufficiently mitigate exposure bias. In comparison, we improved RNNsearch\(^{\star }\) by 0.68 BLEU points, which also proves its effectiveness. Likewise, we are also gratified by the result of only applying Target History Involved Attention, which achieved a comparable BLEU score as Source History Involved Attention, we improved RNNsearch\(^{\star }\) by 0.99 BLEU points. Eventually, we managed to combine the above two attention mechanism together and expect to get a more remarkable improvement.

On the En-De dataset, as shown in Table 4, BA-NMT shows superiority on test dataset, and achieves the gains of 0.8 BLEU points over RNNsearch\(^{\star }\) system. Given the above results, we can conclude that BA-NMT can indeed better utilize the historical information and bring improvement on the translation performance.

5.5 Alignment Quality

As the results of BLEU scores have proved that our method can achieve more accurate translation, we then try to verify this conclusion from another perspective. Since there is a common belief that the better translation should have better alignment with the source sentence, intuitively, we try to evaluated the quality of the alignments derived from the attention module of NMT using AER [9]. As for dataset, we consider the human aligned dataset from [6], containing 900 Chinese-English sentence pairs, to evaluate alignment quality in our experiment.

In practice, we adopted the method that retain the alignment link with the highest probability in Eq.(5). As a comparison, we report the results of both the baseline system and our system. Measured by BLEU score, the results shown in Table 5 illustrate that our system BA-NMT is able to produce more accurate translation than the RNNsearch\(^{\star }\). Meanwhile, our corresponding AER score is lower, suggesting better alignments.

Table 5. Comparison of alignment quality on Zh\(\rightarrow \)En translation task, the BLEU and AER scores are evaluated on different test sets.

6 Conclusion

In this work, we demonstrate a novel Bilingual History Involved Attention for the attention-based NMT. Our core innovation is that our model allows to maintain track of both the target history and the source history, which is beneficial for our model to better utilize the historical information and generate more accurate translation. We further explore the application of our model on NMT tasks and conduct experiments by using three strategies to integrate the historical information into NMT. Results of empirical studies are consistent with our expectation, which proves that our Bilingual History Involved Attention model is capable of achieving better alignment quality than baseline model, especially in the complicated cases. Besides, the proposed model could effectively alleviated the problem of over-translation and under-translation.