Keywords

1 Introduction

Neural machine translation (NMT) [5, 14] has achieved levels comparable to human translation on multiple large-scale datasets. However, the neural network’s neuron parameters have an upper limit on the “memory” of the corpus, and it has poor interpretability of the machine translation model for the learned knowledge. Moreover, when encountering new “knowledge”, the model requires large-scale parameter updates and the scalability of the model is limited, especially obvious for low-resource tasks. The recently proposed kNN-MT and its variants [7, 15, 19, 20] combine the traditional NMT model with a token-level memory retrieval module. These methods decouple the memory ability of the model from the model parameters by storing the training data in the memory module, realizing it to directly access the domain-specific datastore to improve translation accuracy without fine-tuning the entire model, gentle to cope with the discrepancy across domain distributions and improve the generality of the trained models.

Previous works usually use simple linear interpolation to fuse external knowledge guidance and NMT prediction, and use a hyperparameter to control the fusion ratio to obtain the final probability distribution. However, using the same fusion ratio for all sentences may bring some problems, while it is proved through our experiments that the model translation results are quite sensitive to the selection of hyperparameter, which affects the robustness and stability of the model. Furthermore, although kNN-MT and its related models greatly improve the model performance, there is a huge drawback in the practical application, that is, the slow decoding efficiency of the model. The main reason for this phenomenon is that the memory module capacity is quite large, the similarity calculation of high-dimensional vectors is required in finding similar sentences, and the whole memory module must be searched for each retrieval probability during decoding.

This paper aims to improve the performance of low-resource machine translation model by solving the above problems. For the former, in the process of retrieval and fusion of external memory module, we abandon the traditional linear interpolation and adopt non-parametric dynamic fusion method based on Monte Carlo, which improves the robustness and generalization of the model. For the latter, we optimize the translation speed by reducing the retrieval frequency. Specifically, a sub-network is used to judge the confidence of the model’s prediction results, and retrieval is performed only with the low confidence of the model’s prediction results, and the decoding efficiency is improved by filtering some unnecessary retrieval operations. Extensive experiments on low-resource translation task CCMT2019 and medium-high resource task CCMT2022 Mongolian-Chinese demonstrate the effectiveness of our method.

2 Background and Related Work

2.1 Memory-Augmented NMT

Mark [2] first applies memory-augmented neural network to machine translation. He combines word correspondences from statistical machine translation in the form of “memory” to the decoder to increase the probability of occurrence of rare words, which is particularly effective on small data sets. Akiko [4] enhances the model’s translation capability by constructing a sentence-level memory bank. Zhang [18] constructs a fragment-level memory bank that allows the model to obtain more information from it and collect n-gram translation fragments from the target side with higher similarity and alignment scores.

Khandelwal proposes kNN-MT [7] builds a token-level memory module on the basis of the traditional NMT, which stores the contextual representation of the training data and the key-value pairs of target words, so that the matching degree of memory library retrieval is higher. Figure 1 illustrates of how to employ the kNN algorithm to retrieve from the memory module. The key idea is to query the corresponding word of neighboring sentences similar to the current sentence in the external memory module when translating the current word to obtain reference and guidance from the module, and then use a simple linear interpolation to probabilistically fuse with the translation results of NMT to obtain the final translation results:

$$\begin{aligned} \begin{aligned} {p} \left( y_{t} \mid x, \hat{y}_{1:i-1} \right) =\lambda {p} _{NMT} \left( y_{t} \mid y_{<t}, x \right) +\left( 1-\lambda \right) {p} _{Mem} \left( y_{t} \mid y_{<t} \right) \end{aligned} \end{aligned}$$
(1)

After that, many variant models have been proposed, such as Adaptive kNN-MT [19], which trains a meta-k network by artificially constructing features to generate the nearest neighbor hyper-parameter k. Fast kNN-MT [12] introduces efficient hierarchical retrieval to improve the slow translation speed. Moreover, many researchers apply this idea to other natural language processing fields, such as question answering tasks and dialogue systems.

2.2 Decoding Efficiency Optimization

During the development process of memory-augmented NMT, the decoding efficiency of these models remains slowly even though vector retrieval tools like Faiss [6] are available. We summarize three mainstream decoding optimization algorithms in recent years:

  1. 1.

    Dimensionality reduction algorithms such as PCA and SVD are used to reduce the high-dimensional vectors of the memory module. These algorithms are simple to operate, but the disadvantage is that some of the high-dimensional position information will be lost during the dimensionality reduction process, which has a certain negative impact on the translation performance [15].

  2. 2.

    Reducing the memory module capacity by merging key-value pairs [11] or clustering high-dimensional vectors and discarding redundant entries [15], thereby narrowing the scope of retrieval and improving decoding efficiency. Experiments show that both methods can greatly reduce the capacity of memory module, but the disadvantage is that the performance of the model decreases significantly.

  3. 3.

    Narrowing the retrieval frequency by saving a certain amount of retrieval history [16] or adjusting the retrieval granularity [10]. The former imitates the caching technology in computer architecture, while the latter draws on space-for-time operation in algorithm design to reduce the number of retrievals by retrieving more tokens at once, and uses heuristic rules to decide which retrieval processes need to be discarded.

Fig. 1.
figure 1

The Illustration of the proposed method. The confidence network generates a confidence estimate \(c_t\) at each step of decoding, and outputs the model prediction directly if \(c_t\) is larger than the set threshold c, otherwise, constructs a retrieval probability \({p} _{Mem}\) to represent the “guidance” of the memory module to the model by retrieving similar contexts from it. Then, dynamically fuses \({p} _{Mem}\) and \({p} _{NMT}\) based on Monte Carlo algorithm to obtain the final prediction \(y_{t}\).

3 Methodology

The overall architecture of the model is shown in Fig. 1, and this section describes the methodology of this paper specifically.

3.1 Monte Carlo Non-parametric Fusion

Vanilla kNN-MT [7] uses simple linear interpolation to fuse \({p} _{NMT}\) and \({p} _{Mem}\), however, due to the long tail effect of the dataset, some sentences have many similar sentences and some sentences have few. It may cause insufficient information for some sentences and noise for others by applying the same fusion ratio to all sentences. To solve this problem, this paper proposes a non-parametric dynamic fusion method based on Monte Carlo algorithm, which abandons the fixed fusion of linear interpolation and alleviates the problem that the fixed fusion ratio cannot adapt to all fusion scenarios. Our method mainly applies to the inference stage, the prediction retrieved from the memory module and the prediction of NMT placed in a large sample collection. According to Y < T uses the Monte Carlo algorithm to simulate the entire sentence, and the prediction of the sentence with the highest BLEU is selected as the current word and output.

Specifically, we use the generator parameters \(\theta \) of an already trained Conditional Sequence Generative Adversarial Nets [17] and apply the Monte Carlo search under the policy gradient of \(G^{\theta }\) to sample the unknown tokens.

$$\begin{aligned} \begin{aligned} \left\{ y_{1:T_{1} }^{1},..., y_{1:T_{N} }^{N} \right\} =MC^{G^{\theta } } \left( \left( y_{1:t-1},x \right) ,N \right) \end{aligned} \end{aligned}$$
(2)

where \(T_{i}\) represents the length of the sentence sampled by the i’th Monte Carlo search. \(y_{1:t-1}\) is the previously generated tokens and \(y_{t:T_{N} }^{N}\) is sampled based on the policy \(G^{\theta }\). We calculate the BLEU of N sentences and take the current word \(y_{t}\) as the final prediction, which simulates sentence has the highest BLEU.

3.2 Gating Mechanism Based on Confidence Estimation

To enhance the decoding efficiency without affecting the model’s translation quality, this paper proposes a decoding efficiency optimization algorithm based on the idea of reducing the retrieval frequency with a confidence-based gating mechanism. The model output is used directly without retrieval from the memory module if the confidence level is higher, and vice versa with retrieval to assist the model in generating words with higher confidence.

Inspired by DeVries [3] and Lu’s [9] study, we interpret the confidence as how many prompts the NMT model needs to make a correct prediction. During training, the model can use groud-truth to generate complex translations, but each prompt comes at the cost of a certain penalty. We encourage the model to translate independently in most cases to avoid penalties, but when the model’s own capabilities are insufficient to generate tokens with high confidence, the reference’s help is available to ensure that the loss function is reduced. Therefore, this paper utilizes a confidence network (Fig. 3) to learn the word-level confidence, which takes the hidden variable of the decoder as the input and outputs a single scalar between 0 and 1 as the current generated word’s confidence, and takes the confidence estimate as the threshold indicator for the gating mechanism, \(c_t\) closer to 1 indicates the model is confident that it can translate correctly, otherwise output \(c_t\) closer to 0 for more prompts:

$$\begin{aligned} \begin{aligned} c_t=\sigma (W'h_t+b') \end{aligned} \end{aligned}$$
(3)

where \(W'\) and \(b'\) are trainable parameters. \(\sigma (\cdot )\) is the sigmoid function. To supply the model “prompts” during training, we employ \(c_t\) as an interpolation ratio to weight fusion the one-hot encoding of ground-truth \(y_t\) with the model prediction to adjust the original prediction probability, and the translation loss is calculated using the adjusted prediction probabilities:

$$\begin{aligned} \begin{aligned} p_t'=c_t\cdot p_t+(1-c_t)\cdot y_t \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} \begin{aligned} \mathcal {L} _{NMT} = {\textstyle \sum _{t=1}^{T}} -y_tlog{(p_t')} \end{aligned} \end{aligned}$$
(5)

Furthermore, we add a penalty in the loss function to prevent the model from minimizing the loss by setting \(c_t\rightarrow 0\). The final loss is the weighted sum of the translation loss and the confidence loss. Since the model is “fragile” during early training stage and cannot provide prompts in the initial training stage, the value of \(\lambda \) is dynamically controlled using the training step, and \(\lambda _0\) and \(\beta _0\) control the initial value and the declining speed of \(\lambda \).:

$$\begin{aligned} \begin{aligned} \mathcal {L} _{Conf} = {\textstyle \sum _{t=1}^{T}} -log{(c_t)} \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} \begin{aligned} \mathcal {L} = \mathcal {L}_{NMT} + \lambda \mathcal {L}_{Conf} \quad \quad \lambda (s) = \lambda _0 *e^{\frac{-s}{\beta _0} } \end{aligned} \end{aligned}$$
(7)

Gating mechanism is a psychological concept, which refers to the mechanism of screening and filtering input information in people’s memory and cognitive systems. The main purpose of the gating mechanism proposed in this paper is to filter some unnecessary retrievals of the model, so as to reduce the retrieval times and improve the decoding efficiency. The confidence network conducts synchronous training with the NMT model. During the decoding process, the confidence network generates a confidence estimate \(c_t(c_t\subseteq [0,1])\) at each step to determine whether the current retrieval operation needs to be performed, it’s output directly if \(c_t\) is larger than the set threshold. We set \(\lambda _{0} =30\), \(\beta _{0} =45000\) and the threshold \(c=0.9\) in the settings of the confidence network.

4 Experiment

4.1 Datasets, Baselines and Configurations

This paper mainly improves on the Mongolian-Chinese translation task. The experiment’s corpus comes from CCMT2019 and CCMT2022 to explore the model performance in low-resource and medium-high-resource scenarios respectively. Table 1 shows the specific size of two corpus. According to previous research and experimental verification on this translation task, we use the preprocessing operations of ULM and word segmentation\(+\)ULM for Mongolian and Chinese respectively.

We compare our method against the traditional Transformer-base [14] and some classical or leading memory-augmented NMT baselines including: MANN [2], TM-augmented [1], kNN-MT [7], Adaptive kNN-MT [19], Fast kNN-MT [12]. Due to the characteristics of Chinese, different segmentation methods may cause huge differences in BLEU scores, so we use SacreBLEU [13] to evaluate the results. We adopt Adam optimizer [8] and set 2000 warm-up steps. All the above baselines and our method are based on fairseqFootnote 1 implementation.

Table 1. The information table of experimental corpus.

4.2 Main Results

Table 2 shows the comparative experimental results of our method and different baselines. MANN [2] adds a memory module on the basis of RNN, but the effect is still far behind the Transformer. TM-augmented [1] uses monolingual corpus to build translation memory and augments the NMT model with a learnable cross-lingual memory retriever, which performs better on low-resource datasets because the large-scale monolingual corpus can compensate for the model’s own under training in low-resource scenarios. kNN-MT [7] constructs a token-level memory module to guide model generation by retrieval during decoding, but the optimal choice of k is different when using different data stores, leading to poor robustness and generalizability of the method. Adaptive kNN-MT [19] trains a meta-k network by artificially constructing features to generate the nearest neighbor hyper-parameter k. It performs well in various translation tasks. Fast kNN-MT [12] introduces hierarchical retrieval to improve decoding efficiency, but has certain damage to performance. Our method utilizes a Monte Carlo non-parametric dynamic fusion method to further improves the model robustness. Meanwhile, we introduce a confidence-based gating mechanism to accelerate the decoding, so our method obtains consistent improvement in all scenarios.

Table 2. Comparison experiments of different memory enhancement models.

4.3 Ablation Study

To verify the effect of different components on the model performance, this paper conducts ablation experiments based on Transformer, and the experimental results are shown in Table 3. It is clear that memory module plays a critical role, in the CCMT2019 low-resource Mongolian-Chinese translation, there is a maximum of 5 BLEU improvements, while in the CCMT2022 high-resource scenarios, there is an average of less than 2 BLEU improvements, indicating that the improvement of the memory module to the model is affected by the model’s own capabilities, the stronger the model capability, the smaller the additional achievements of the memory module on the model. Since the test set of CCMT2019 is mostly simple and short sentences, while the validation set has more long difficult sentences. Therefore, the improvement rate on the valid set is not as large as that on the test set, which also reflects the effectiveness of our method in complex translation scenarios to a certain extent. Line 4 represents the utilize of Monte Carlo non-parametric fusion on ordinary kNN-MT, which also has some improvement, indicating the effectiveness of this algorithm. The introduction of the confidence network is also shown to be benefit of improving performance (Line 3), the reason is that it can calibrate the confidence estimates of the model itself during training, mitigating the confidence bias in the testing phase due to exposure bias. Moreover, it also has a cumulative effect on translation results when combined with Monte Carlo fusion (Line 5).

Table 3. The results of ablation study, “\(\circ \)” means utilize this method and “\(\times \)” means not. MM, MC and CE represent Memory Module, Monte Carlo and Confidence Estimation respectively.

4.4 Effect of Memory Module Capacity and Threshold c

Fig. 2 shows the effect of memory module capacity. It can be seen that the translation quality improves with the increase of the external memory module size, but for memory modules containing tens of millions of tokens, the retrieval speed slows down with the increase of memory module size. It also demonstrates that the model is not necessary to be retrained when encountering new training data, and directly storing the data in the memory module can also improve the translation performance. In addition, the external memory module can significantly improve the translation results in low-resource scenarios. For middle-high resource scenarios, there is a very obvious bottleneck in the improvement rate. After reaching this value, the improvement in translation effect brought about by increasing the memory module capacity is far less than the negative impact on slower retrieval speed. Therefore, for middle-high resource scenarios, it is necessary to balance the direct relationship between memory module capacity and translation speed.

Fig. 2.
figure 2

The effect of memory module capacity on translation quality.

Fig. 3.
figure 3

Effect of different thresholds on BLEU and total translation time.

To explore whether the model can improve the translation ability for unfamiliar data by modifying the memory module when it encounters new data, we design a test of an extreme scenario and use the model trained on the CCMT2019 dataset to translate the CCMT2022 test set. It can be seen from Table 4 that after adding the test set to the external memory module, the model translation ability for this data has been significantly improved, which proves that the model can be updated by storing unfamiliar data into the external memory module. The translation quality improved significantly after adding a large amount of training data into the memory module, indicating that the performance of “small” models can also be improved by increasing the capacity of the memory module rather than retraining on a large amount of data. We explore the model performance and decoding time under different threshold c on CCMT2022, and the results are shown in Fig. 3. It can be seen that the model quality does not fluctuate greatly under different threshold settings, but the total decoding time of the model decreases as the threshold keeps increasing. It indicates that our method can’t affect the translation quality of the model too much while optimizing the decoding efficiency.

Table 4. The effect of updating memory module on translation quality.

4.5 Decoding Efficiency Verification in Different Dimensions

This paper measures the decoding efficiency from three dimensions on the test set of CCMT2022, namely the total translation time, the number of sentences translated per second, and the number of tokens translated per second. Experimental results are shown in Fig. 4. The decoding efficiency of the proposed method is about 2 times that of the original method when the retrieval nearest neighbor number k is small. With the increase of k, the improvement range of the proposed method becomes smaller and smaller. However, since the optimal k of the experimental model is less than 24, the decoding efficiency of this method is better than that of the traditional method in general. Moreover, this method does not affect the model quality while improving the decoding efficiency.

Fig. 4.
figure 4

Decoding efficiency comparison chart.

4.6 Domain Adaptation and Robustness Analysis

To verify the effectiveness of our approach in domain adaption, we according to Adaptive kNN-MT conduct experiments in four domains including, IT (I), Medical (M), Koran (K) and Laws (L) in German-English. The main results are shown in Table 5, where the hyperparameter k are 8, 4, 8 and 4, respectively, and this paper’s approach has obtained consistency improvement in all domains. In IT\(\rightarrow \)Medical (I\(\rightarrow \)M) setting, we use the IT domain hyperparameters and the memory module to translate the medical test set. The kNN-MT encounters drastic performance degradation due to the retrieved “neighbors” are highly noisy. In contrast, Adaptive kNN-MT can filter out noises and therefore prevent performance degradation as much as possible. The performance of this paper is further improved by Monte Carlo nonparametric fusion and gating mechanism compared to Adaptive kNN-MT.

Table 5. Our method on domain adaptive experiments and robustness evaluation.

5 Conclusion

In this paper, we propose a non-parametric method based on Monte Carlo to dynamically integrate memory module’s prediction and NMT prediction, which improves model performance and robustness in various scenarios. In view of the slow retrieval speed in kNN-MT, this paper proposes a gating mechanism based on confidence estimation to filter the unnecessary retrieval behavior of the model, so as to improve the decoding efficiency. Our method is effective in low resource scenarios, but marginal utility appears in high resource scenarios. Therefore, future work will further study the optimization and promotion in high resource tasks.