Keywords

1 Introduction

Over the past few years, with the development of deep learning, neural machine translation has come a long way. In order to further improve the translation accuracy, more and more researches have started to express the training data as some kind of external knowledge rather than as model parameters, which is called non-parametric method. Since this method requires search to obtain external knowledge, it is also called search-based model. The representative methods are as follows: Nearest neighbor language models (kNN-LM) [1], which introduces kNN to the language model for the first time and gains tremendous enhancements; k-nearest-neighbor machine translation (kNN-MT) [2], which extends kNN-LM to translation model, has made a qualitative leap in bilingual translation, multilingual translation, and especially domain adaptation translation tasks compared with traditional methods; As well as Adaptive kNN-MT implemented by [3] on this basis, a meta-k network is trained by artificially constructing features for generating the number of nearest neighbors k, instead of artificially specifying them; And Fast kNN-MT [4] introduces hierarchical retrieval to improve the retrieval efficiency thus improving the slow translation speed of kNN-MT.

kNN-MT bulids an external memory module on top of the ordinary NMT, storing the context representation of the corresponding sentence as well as the target word. The idea of kNN-MT is to retrieve sentences similar to the current sentence in the memory module when translating the current word, and get reference and guidance from the translation memory by the words corresponding to the similar sentences. Then it is fused with the translation result of NMT to get the final result.

Although kNN-MT has demonstrated its powerful capability in high-resource languages as well as domain adaptation, there are still two problems. On the one hand, kNN-MT has not been studied in low-resource scenarios due to its particular reliance on the representational power of pre-trained translation models and the retrieval effect of similar sentences. On the other hand, in the fusion of NMT with an external memory module, the fusion ratio is controlled by a hyperparameter \(\lambda \), i.e., how much information the NMT model obtains from the external memory module. However, it poses some problems, due to the long-tail effect of the dataset, some sentences have more similar sentences while some sentences have less similar sentences. Using the same fusion ratio for all data will cause the problem that some sentences do not acquire enough information and some sentences introduce noise. We illustrate this with a concrete example in Fig. 1. Moreover, it is experimentally demonstrated that the model translation results are very sensitive to the selection of hyperparameter \(\lambda \), which affects the robustness of the model.

Fig. 1.
figure 1

Example of failure of probability interpolation between \(p_{NMT}\) and \(p_{Mem}\), while translating DE-EN.

To solve this problem, we propose a dynamic fusion method via Dempster-Shafer theory, which drops the fixed fusion method with linear interpolation, and gives different fusion results for different retrieval probabilities and translation probabilities. The problem of high confidence in retrieval probability, but too low fusion ratio, i.e., better prediction of retrieval probability, but biased final translation result due to too low fusion ratio, and vice versa, is alleviated. Moreover, our method improves the robustness of the model to cope with translation in complex scenarios. More importantly, we explore the application of kNN-MT in low-resource translation scenarios for the first time, demonstrating the effectiveness of non-parametric methods in low-resource scenarios. We validate the effectiveness of our methodology for multi-domain datasets, including IT, Medical, Koran, Law, and the CCMT’19 Mongolian-Chinese low-resource dataset. Our method obtains an increase of 0.41-1.89 BLUE, and the robustness of the model is improved.

2 Background

The main approach of kNN-MT involves the building of memory modules and the fusion of external knowledge with the predicted results of the NMT model. In terms of memory module construction, unlike [5] and [6] which construct sentence-level and fragment-level memory datastore, kNN-MT constructs token-level memory datastore. Its advantage is better retrieval and higher matching, but the memory module size is the total number of tokens in the target language, which leads to low retrieval efficiency. In terms of construction method, kNN-MT selects an offline construction method, therefore a pre-trained model with strong knowledge representation capability is required. The memory module is stored as a key-value pair of a context vector and a target token, and is constructed by feeding the training data into the model in a single forward pass. Given a bilingual corpus \(\left( x,y \right) \in \left( \mathcal {X} ,\mathcal {Y} \right) \) the decoder decodes \(y_{t}\) based on the source language x and the words \(y_{<t}\) that have been generated. Assuming that the hidden layer state of the pre-trained model is \(f\left( x,y_{<t} \right) \), the key of the datastore is \(f\left( x,y_{<t} \right) \) and the value is \(y_{t}\), then the construction process is:

$$\begin{aligned} \left( \mathcal {K} ,\mathcal {V} \right) =\left\{ \left( f\left( x,y_{<t} \right) ,y_{t}\right) ,\forall y_{t}\in y \mid \left( x,y \right) \in \left( \mathcal {X} ,\mathcal {Y} \right) \right\} \end{aligned}$$
(1)

Once the memory module is constructed, the similar sentences can be retrieved when decoding, and the token corresponding to the similar sentences can be used to obtain a retrieval probability, i.e., the retrieval probability \(p_{Mem}\) given by the memory module through historical data.

$$\begin{aligned} {p}_{Mem}\left( y_{i} \mid x,\hat{y}_{1:i-1} \right) \propto \sum _{\left( k_{i}, v_{i} \in \mathcal {N} \right) } \mathbbm {1}_{y_{i} = v_{i} } exp\left( -\frac{-d\left( k_{j} , f\left( x, \hat{y}_{1:i-1} \right) \right) }{T} \right) \end{aligned}$$
(2)

The retrieval probability represent external knowledge guidance, and kNN-MT fuses the external knowledge with the model knowledge by simple linear interpolation to obtain the final probability distribution.

$$\begin{aligned} {p} \left( y_{t} \mid x, \hat{y}_{1:i-1} \right) =\lambda {p} _{NMT} \left( y_{t} \mid y_{<t}, x \right) +\left( 1-\lambda \right) {p} _{Mem} \left( y_{t} \mid y_{<t} \right) \end{aligned}$$
(3)

3 Method

In this section, we mainly introduce our proposed method, and our method is mainly applied in the inference stage of the model. We discard linear interpolation and use DST (Dempster-Shafer theory) in the fusion process of \(p_{NMT}\) and \(p_{Mem}\), and our method is shown in Fig. 2. Since \(p_{Mem}\) only generates probabilities for a few relevant words of the similar neighbors in the actual calculation process, and the probabilities of other irrelevant words are all 0, resulting in a very hard distribution of \(p_{Mem}\), and more 0 probabilities will have a very significant impact on the DST results, so we use label smoothing for \(p_{Mem}\) to make the distribution of \(p_{Mem}\) smoother.

Fig. 2.
figure 2

Schematic diagram of our approach, the retrieval process occurs at the decoder, where similar sentences are retrieved in the memory module based on the context vector. The retrieval probability is obtained by normalizing the target token and then dynamically fused with the translation probability using the DST algorithm.

3.1 Dempster-Shafer Theory

Dempster-Shafer theory [7] is a generalization of probability theory and a very effective method for data fusion. DST extends the basic event space in probability theory to power sets of basic elements by replacing a single probability value of a basic element with a probability range. DST is based on the mathematical theory proposed by Demster and Schaeffer, and is a more general formulation of Bayesian theory. DST proposes a framework that can be used to represent incomplete knowledge and update credibility. If a set is defined as \(\varTheta =\left\{ \theta _{1},\theta _{2},...,\theta _{N} \right\} \) and all elements in the set are independent and mutually exclusive, \(\varTheta \) is called the frame of discernment framework. Under this premise, the DST combination rules are provided.

Let \(m_{1}\) and \(m_{2}\) be the two probability assignment functions on the same discernment framework. The corresponding focal elements are \(A_{i} \left( i=1,2,...,k \right) \) and \(B_{j} \left( j=1,2,...,l \right) \), respectively, and the new probability assignment (BPA) functions after the combination is denoted by m. Then the DST combination rule can be expressed as the following form:

$$\begin{aligned} m(A) = m_{1}(A)\oplus m_{2}(A) \left\{ \begin{matrix} m(\phi ) = 0\\ \frac{1}{1-k} \sum \limits _{A_{i}\cap B_{j} = A}^{} m_{1}(A_{i})m_{2}(B_{j}) \end{matrix}\right. \end{aligned}$$
(4)

Dempster-Shafer theory has been widely used to deal with problems with uncertainty or imprecision. Because it can integrate different algorithms based on its basic probability assignment framework to improve the reliability of the results. In this paper, we use evidence theory to execute data fusion for \(p_{NMT}\) and \(p_{Mem}\), where \(m_{1}\) in Eq. 4 is \(p_{NMT}\) and \(m_{2}\) is \(p_{Mem}\).

3.2 Label Smoothing

Label Smoothing [8] is a widely used regularization technique in machine translation. LS penalizes the high confidence in the hard target to introduce noise to the label and change the hard target into a soft target. The idea of label smoothing is simple: the token corresponding to the ground truth should not have exclusive access to all probabilities; other tokens should have a chance to be used as ground truth. In parameter estimation of complex models, it is often necessary to assign some probabilities to unseen or low-frequency events to ensure the better generalization ability of the model. For the specific implementation, label smoothing uses an additional distribution q which is a uniform distribution over the vocabulary V, i.e., \(q_{k} =\frac{1}{v} \), where \(q_{k}\) denotes the kth dimension of the distribution. The distribution of final result is then redefined as a linear interpolation of \(y_{j}\) and q:

$$\begin{aligned} y_{j}^{ls}=(1-\alpha ) \cdot y_{j} +\alpha \cdot q \end{aligned}$$
(5)

Here, \(\alpha \) denotes a coefficient to control the importance of the distribution q, and \(y_{j}^{ls}\) denotes the learning target after using label smoothing. The schematic diagram is shown in Fig. 3.

Fig. 3.
figure 3

Targets with Label Smoothing when \(\alpha =0.1.\)

Label smoothing can also be seen as an adaptation of the loss function with the introduction of additional prior knowledge (i.e., the part related to q). But this prior knowledge is not fused with the original loss function by means of linear interpolation.

The process of generating the final probability can be summarized by the following procedure, where the LS denotes a label smoothing, DST denotes Dempster-Shafer theory, \(p_{Mem}\) denotes the retrieval probability obtained from the memory module, and \(p_{NMT}\) denotes the translation probability of the NMT model.

$$\begin{aligned} p\left( y_{t} \mid y_{<t} \right) =DST\left( p_{NMT},LS\left( p_{Mem} \right) \right) \end{aligned}$$
(6)

4 Experiment

We validate the effectiveness of our method in two translation scenarios: (1) domain adaptation. (2) Mongolian-Chinese low resource language.

4.1 Experimental Setup

Data. We use the following datasets for training and evaluation:

MULTI-DOMAINS: We use the multi-domains dataset [9], re-split by [10] for the domain adaptation experiments. It includes German-English parallel data for train/valid/test sets in four domains: Medical, Law, IT and Koran. The sentence statistics of MULTI-DOMAINS datasets are illustrated in Table 1.

Table 1. Statistics of dataset in different domains.

Low-resource: We use the CCMT’19 Mongolian-Chinese dataset to evaluate the performance of our method in low-resource scenarios. The bilingual parallel corpus comes from a comprehensive field, including daily conversations, government documents, government work reports, laws and regulations, etc. The sentence statistics of Mongolian-Chinese dataset are illustrated in Table 2.

Table 2. Statistics of dataset in Mongolian-Chinese.

Models. For the domain adaptation experiments, we use the WMT’19 German-English news translation task winner [11], available via the FAIRSEQ library [12]. It is a Transformer encoder-decoder model [13] with 6 layers, 1,024 dimensional representations, 8,192 dimensional feedforward layers and 8 attention heads. Apart from WMT’19 training data, this model is trained on over 10 billion tokens of back translation data and fine-tuned on newstest test sets from years prior to 2018.

For low-resource translation, we train a Mongolian-Chinese translation model based transformer. The corpus is subworded using subword-nmtFootnote 1 [14], using a Adam optimizer [15] with a warmup step of 10,000, epoch of 30 and setting early stop. Other settings are kept the same as transformer-base.

Our experiments are based on the fairseqFootnote 2 sequence modeling toolkit to train NMT models, using the faissFootnote 3 [16] toolkit for external memory module construction and high-speed retrieval. We implement our approach on the open source code of adaptive-knn-mtFootnote 4, which implements the original kNN-MT based on fairseq and has a good code structure.

4.2 Result and Analysis

For the domain adaptive task, the main results are shown in Table 3. Consistency improvement is obtained for all four domains of our method. The BLEU scores are improved by 1.89, 0.51, 0.48, and 0.55 compared to kNN-MT. The minimum improvement is in the Koran domain and the highest is in the IT domain.

Table 3. BLEU scores of Base NMT model, kNN-MT and our method on domain adaptive experiments with hyperparameters k of 8, 4, 8 and 4, respectively. The linear interpolation ratios \(\alpha \) for kNN-MT are 0.7, 0.8, 0.7, and 0.7.

For the low-resource task, the experimental results are shown in Table 4, and it can be found that kNN-MT can also obtain a huge improvement on the translation result in the low-resource domain, and our method is also improved compared with kNN-MT.

Table 4. BLEU scores of Base NMT model, kNN-MT and our method on Mongolian-Chinese low-resource experiments with hyperparameter k = 4.

Analysis. Compared with kNN-MT our method is more flexible in the probabilistic fusion stage, which is reflected in the results to obtain a consistent improvement of BLEU. The biggest improvement in the domain adaptive experiments is in the IT domain, and by analyzing the translation results we speculate that it may be due to the presence of more low-frequency special nouns in the IT domain. kNN-MT introduces noise in the retrieval process, while our method performs better in the translation of low-frequency words.

In the low-resource scenario since the test sets of Mongolian-Chinese are mostly simple short sentences, while the valid sets have more long and difficult sentences. Therefore, the improvement of our method on the test sets is not as large as that on the valid sets, which also reflects the effectiveness of our method in complex translation scenarios to some extent. Since DST can produce different results according to different probabilities and expose more information after using label smoothing for \(p_{Mem}\), it increases the generalization and robustness of the model.

4.3 Robustness

Fig. 4.
figure 4

Robustness experiments of kNN-MT and our method at different hyperparameters k.

To verify the robustness of our method, we test the accuracy of translation under different hyperparameters k. The experimental results are shown in Fig. 4. We find that the BLEU scores of kNN-MT fluctuate more in the case of not optimal k values, indicating that the performance of kNN-MT is more sensitive to the noise brought by k. And the performance of our method is also affected during the increase of k, but with less fluctuation. It indicates that a relatively good performance can be maintained at different noise intensities.

We also evaluate the robustness of our method in the domain-mismatch setting, where we consider a scenario that the user inputs an out-of-domain sentence (e.g. Medical domain) to a domain-specific translation system (e.g. IT domain) to evaluate the robustness of different methods. Specifically, in IT\(\rightarrow \)Medical setting, We use hyperparameters and datastore in the IT domain, and then use Medical test set to test the model with IT datastore. As shown in Table 5, the retrieved results are highly noisy so that the kNN-MT encounters drastic performance degradation. In contrast, our method could filter out some noise and therefore prevent performance degradation as much as possible.

Table 5. Robustness Evaluation, where the test sets are from Medical/IT domains and the datastore are from IT/Medical domains respectively.

4.4 Case Study

Fig. 5.
figure 5

Translation examples of different systems in IT domain and Mongolian-Chinese.

As shown in Fig. 5, we show examples of translations in the IT domain and Mongolian-Chinese. We can observe that kNN-MT can produce mistranslations in some cases, and our method can generate translations with more fidelity and fluency in this case. Moreover, our method can alleviate the \(\left\langle unk \right\rangle \) problem to a certain extent. In the Mongolian-Chinese example, both the Base NMT model and kNN-MT can not translate correctly when the corpus contains \(\left\langle unk \right\rangle \), which also shows that our method is more robust and higher error tolerance.

5 Conclusion

In this paper we propose dynamic fusion of kNN-MT. By using Dempster-Shafer theory instead of fixed linear interpolation to dynamically fuse the two probability distributions from NMT model and memory modules. Through experiments in domain adaptation, we verify that our method has some improvement on kNN-MT and validate that our method is more robust. In addition, we explore the possibility of applying kNN-MT in low-resource scenarios for the first time. In the future, we will deeply explore the application of non-parametric methods in low-resource scenarios.