Keywords

1 Introduction

Translation Memory (TM) is conceptually regarded as a database of sentence pairs (source and target texts), which is utilized to reuse previously translated content when working on new texts. Recent works have focused on memory augmentation to improve the performance of neural machine translation (TM-augmented NMT) [12].

Technically, a typical TM-augmented NMT model performs the translation process in two phases, as shown in Fig. 1: i) Retrieval Stage extracts the candidate sentence memories from training corpus based on calculating similarity; and ii) Generation Stage integrates the candidate sentences into translation model for the translation. Subsequently, the trend research focuses on jointly learning models of two phases (retriever and translation models) with remarkable results [4].

Fig. 1.
figure 1

An example of the neural machine translation using translation memory.

Despite the success of the recent TM-augmented NMT models, there are still two remaining research issues that need take into account: i) The retrieval stage mainly uses a greedy method to extract the top-r nearest sentence pairs, which results in redundant information because the top-r sentence pairs are highly similar to each other [4]. ii) Most previous works use TM with sentence pairs (source-target pairs), which is not able to take advantage of abundant monolingual data [9]. In this regard, this study proposes a new method for TM-based NMT to deal with the aforementioned issues. Specifically, for the retrieval phase, we adopt Maximal Marginal Relevance (MMR) [7] to enable the diversity, guaranteeing the two most challenging properties of candidates: informativeness obtained by the distance between query and candidates; diversity expressed by the distances among candidates themselves. For the monolingual TM, following the work in [4], a simple dual-encoder framework is adopted for selecting the most relevant sentences. Generally, the main contributions of this study are two folds as follows:

  • We present a novel end-to-end model for TM-augmented NMT, which aims to leverage two emergent issues of TM-augmented NMT such as balancing the relevance and diversity of the retrieval phase and using monolingual data. To the best of our knowledge, the proposed method is the first study that integrates two aforementioned issues in a unified framework.

  • We execute our approach on IWSLT15, a benchmark English-Vietnamese dataset [8] to demonstrate the effectiveness of the model in low-resource language pair scenarios. Specifically, the reported results show that our model outperforms strong baseline models in this research field.

The rest of the paper is organized as: Sect. 2 presents a brief review of previous works regarding this study. The proposed method is presented in Sect. 3. We report and analyze the evaluated results in Sect. 4. Section 5 is the conclusion and discussion of this study.

2 Related Work

Recent work tries to jointly train the retrieval model and a translation model with monolingual TM and achieve impressive results [4]. The proposed method in this study is the orthogonality of recent works of this research line. Specifically, we present an end-to-end monolingual TM-augmented NMT model that includes a retrieval stage with a special focus on extracting both relevant and diverse sentences. In this section, we take a brief review of those aforementioned techniques for improving the performance of the TM-augmented NMT approach.

2.1 Neural Machine Translation for Low-Resource Languages

In recent years, Neural Machine Translation (NMT) [19] has emerged as a state-of-the-art approach to machine translation, gaining widespread popularity. Specifically, the Transformer architecture [25] has revolutionized the field of NMT by achieving success in multiple language pairs. However, supervised NMT requires large datasets, which are often limited in low-resource languages. To address this issue, several data augmentation techniques have emerged, including back-translation [23], and self-training [15]. Additionally, transfer learning techniques [20, 29] show promise in leveraging pre-trained models for improved performance. In cases where parallel data is not available, unsupervised technique NMT [2], pivot-based [10] or multi-NMT-based solution [11] can be employed. Recent studies focus on using TM with monolingual data, as an emergent technique, to improve the translation quality of NMT.

2.2 Translation Memory-Augmented Neural Machine Translation

Augmenting TM has become an emerging research topic for improving NMT. There are primarily two approaches for incorporating translation memory (TM) into neural machine translation (NMT): constraining the decoding process with TM and using TM to train a more powerful NMT model.

The main idea of the first research line is to increase the generation probability of some target words based on TM. Zhang et al. [28] increased the generation probability of target words aligned with the TM. In [13] a bilingual dictionary is used as auxiliary information to tackle infrequent word translation. Khandelwal et al. [17] used kNN-MT to retrieve TMs from dense vectors by creating a key-value datastore and interpolating the generation probability of the NMT model with similar target distributions from the datastore at each time step.

The second research line aims to train the translation model to learn how to deal with the retrieved TMs. A data augmentation way was used by Pham et al. [22] to concatenate the retrieved TMs with input sentences during training. Several studies have explored modifying the architecture of the NMT model to improve integration with TM. Cao et al. [6] introduced a gating mechanism to control the signal from the retrieved TM, and following this, in [5] an additional transformer encoder is designed to incorporate the target sentence of the TM through attention. In Xia et al. [26] work, multiple retrieved TMs are compressed into a graph structure to enhance efficiency and space usage and then integrated into the model through attention.

2.3 Retrieval for Translation Memory-Augmented Neural Machine Translation

Previous works focus on the TM with bilingual sentence pairs [26, 27], which used fuzzy matching to retrieve the most similar sentences from the corpus with a query. In TM with monolingual, retrieval task is more challenging due to the cross-lingual setting. To address this challenge, Cai et al. [4] proposed a simple dual-encoder framework pre-trained on two tasks: sentence-level cross-alignment and token-level cross-alignment.

Regarding diversity for the retrieval results, authors in [9] have proven that diverse translation memories are able to improve the performance of the NMT, making it important to ensure diversity in the retrieval stage. There are several methods to enable diversity, including MMR [7], IA-Select [1] or MaxSum Diversification [3]. We employed MMR in this study due to its straightforward implementation and, more importantly, its ease of interpretation.

3 Methodology

3.1 Overview System

Figure 2 depicts the overview structure of the proposed method. In particular, the main contribution of this study focuses on the retrieval stage, which selects the most relevant and diverse sentences from a large monolingual TM in the target language. Specifically, given an input sentence x in the source language and a large monolingual TM \(M = \{m_1,m_2,..,m_n\}\), the output of the retrieval stage is a subset (top k) TM and relevance scores \( \{f(x,m_i)\}_{i=1}^{k} \) Then, the translation model conditions on both the input x, the retrieved set, and their scores x to generate the output y.

Fig. 2.
figure 2

Overall structure.

3.2 Retrieval Model

3.2.1 Relevant Monolingual TM: The input sentence x (source sentence) and monolingual TM M of target language are encoded by using two independent Transformers [25], which are sequentially formulated as follows:

$$\begin{aligned} \begin{aligned} z_x = W_1Trans_x(<bos>,x^1, x^2,..x^{|x|}) \\ z_{m_i} = W_2Trans_m(<bos>,m^1_i, m^2_i,..m^{|m_i|}_i) \end{aligned} \end{aligned}$$
(1)

where \(m_i \in M\) denotes the memory target sentence. \(W_1\) and \(W_2\) are learning parameters. In this regard, the relevance score \(f(x, m_i)\) between the source sentence and the candidate sentence can be calculated using the dot product:

$$\begin{aligned} f(x,m_i) = z_x^T z_{m_i} \end{aligned}$$
(2)

Subsequently, the top r relevant sentences are extracted using Maximum Inner Product Search (MIPS).

3.2.2 Diversity-Enabled TM: After obtaining \(R= \{m_1,..,m_r\}\) as the relevant sentences, a subset of size k is selected from R is selected to increase the diversity by using MMR [7], the MMR function can be formulated as follows:

$$\begin{aligned} MMR(x,R,S) = \underset{m_i \in R \setminus S}{argmax}[ \lambda . cosine(x,m_i) - (1-\lambda ). \underset{m_j \in S}{max} (cosine(m_i,m_j)) ] \end{aligned}$$
(3)

where S is the current set of chosen candidates. \(R \setminus S\) is a set of unselected sentences. The hyperparameter \(\lambda \), which takes values in the range [0, 1], is used to trade off accuracy and diversity. A high value of lambda corresponds to high accuracy, whereas a low value corresponds to high diversity.

figure a

The diverse-enabled TM processed can be described in the Algorithm 1. Specifically, the output of this process is a set of translation memories \(S = \{m_1,..,m_k\}\) and its retrieval score \(f(x,S) = \{f(x, m_1),..,f(x,m_k)\} \).

3.3 Translation Model

For the translation stage, we follow the work in [4] for the end-to-end model, which is built based on the standard encoder-decoder NMT model [25]. Specifically, given source sentence x, a set of retrieval TM \(S = \{{m_i}\}_{i=1}^{k}\) and its scores \(\{f(x,m_i)\}_{i=1}^{k}\) in the previous step, the objective of the translation model is to define the conditional probability as follows:

$$\begin{aligned} p(y|x, m_1, f(x, m_1), . . . , m_k, f(x, m_k)) \end{aligned}$$
(4)

To incorporate the information of TM contextualized token embeddings \(\{z_{m_i,j}\}_{j=1}^{|m_i|}\) \((1 \le i \le k)\), the cross attention is calculated as follows:

$$\begin{aligned} \alpha _{ij} = \frac{exp(h_t^TW_3z_{m_i,j} + \beta f(x,m_i))}{\sum _{i=1}^{i=k}\sum _{l=1}^{L_i}exp(h_t^TW_3z_{m_i,l} + \beta f(x,m_i))} \end{aligned}$$
(5)

Here, \(\alpha _{ij}\) and \(L_i\) denote the attention score of the j-th token in \(z_{m_i}\) and the length of the sentence \(z_{m_i}\), respectively. \(W_3\) represents the learning parameter. \(h_t\) is the decoder’s hidden state at time step t. The weighted sum of memory information can be updated as follows:

$$\begin{aligned} \begin{aligned} c_t = W_4\sum _{i=1}^{k}\sum _{j=1}^{L_i}\alpha _{i,j}z_{i,j} \end{aligned} \end{aligned}$$
(6)

where \(W_4\) denotes the learning parameter. Following this, \(h_t\) is updated with \(c_t\), i.e., \(h_t = h_t + c_t\). In this regard, the next-token probabilities can be computed as follows:

$$\begin{aligned} p(y_t|x, m_1, f(x, m_1), . . . , m_k, f(x, m_k)) = (1 - \lambda _t)P_v(y_t) + \lambda _t\sum _{i=1}^{k}\sum _{j=1}^{L_i}\alpha _{i,j}\mathbbm {1}_{z_{m_i,j} = y_t} \end{aligned}$$
(7)

where \(\lambda _t = g(h_t, c_t)\) denotes the feed-forward network, \(\mathbbm {1}\) is the indicator function, the next-token probabilities \(P_v\) are obtained by converting the hidden state \(h_t\) using a linear projection and then applying the softmax function, which can be formulated as follows:

$$\begin{aligned} P_v =softmax(W_vh_t + b_v) \end{aligned}$$
(8)

4 Experiment

4.1 Experiment Setup

4.1.1 Dataset and Evaluation: We use the English-Vietnamese as the evaluated dataset of this study (publicly available in the MT track of the IWSLT 2015 corpus [8]). Specifically, this dataset comprises a collection of parallel sentences in spoken language domains. The detailed data statistic is illustrated in Table 1. In all experiments, the target language in the training set is utilized as monolingual translation memory data M. Subsequently, different bilingual datasets are generated for later experiments by randomly selecting \(60 \%\), \(80 \%\), \(100\%\) of the training dataset, referred to as D60, D80, and D100 datasets, respectively. For evaluation, we use the BLEU score [21].

Table 1. Statistics of the evaluated dataset.

4.1.2 Baseline Models: We compare the proposed model with the following baselines:

  • NMT wo TM: the original NMT model without TM [25].

  • NMT + TM-BM25: source similarity search method based on BM25, which is used in many recent TM-augmented NMT models [14, 26].

  • NMT + Monolingual TM: The joint training retrieval and translation models by adopting a dual encoder architecture [4].

4.1.3 Implementation Details: Our model utilizes Transformer blocks with the same setup as Transformer Base [25], which includes 8 attention heads, a hidden state with 512 dimensions, and a feed-forward state with 2048 dimensions. We employ 3 Transformer blocks for the retrieval model, 4 blocks for the memory encoder in the translation model, and 6 blocks for the encoder-decoder architecture in the translation model. We set trade-off hyperparameter in MMR \(\lambda = 0.5\). The FAISS [16] has been used for indexing the dense representations. The learning rate schedule, dropout, and label smoothing are set following the default settings in [25]. We use Adam optimizer [18] and train models with up to 30K steps throughout all experiments. The number of tokens in every batch is 4096. BPE [24] tokenizer is employed with a vocabulary size of 4000. In order to execute the BM25-based method, we used a BM25 search engineFootnote 1 to obtain a preliminary set of TM sentences.

Table 2. Report results (BLEU scores) with two different values of top k retrieval TM for English \(\longrightarrow \) Vietnamese. Bold texts are the best results of each column.
Table 3. Report the BLEU scores obtained when comparing monolingual translation memory (TM) and bilingual TM for English \(\longrightarrow \) Vietnamese.

4.2 Main Results

Table 2 shows the results of the evaluation on different sizes of the training dataset. Particularly, the reported results are conducted with 3 and 5 retrieval sentences for the TM, respectively. As reported results, we make the following observations: i) NMT by using greedy retrieval (e.g., BM25) does not outperform the original NMT model, which indicates that joint training is an important method for the MT-augmented NMT approach; ii) Our model, which focuses on improving the retrieval sentence in terms of enabling the diversity for TM, achieves the best results compared with strong baseline models. The reported results indicate that diverse TM is able to improve the performance of NMT, especially with the low-resource scenarios; iii) The results between the number k sentences (k = 3 and k = 5) of the retrieval models are not too different. However, in our opinion, selecting the number of k sentences should be regarded as a hyperparameter and tuned during the training process. We leave this issue as future work regarding this study.

Furthermore, we also try to evaluate the performance between monolingual TM and bilingual TM. Accordingly, we re-implement the most recent work [9] for bilingual TM and comparing with our method in terms of both monolingual and bilingual, respectively. Table 3 shows the results of the variant of our model and Cheng et al., [9] with \(k=3\). An interesting observation is that the performance of monolingual TM is slightly better than bilingual TM. The evaluated results indicate that taking advantage of abundant monolingual data is able to improve the performance of NMT tasks, especially for low-resource scenarios.

5 Conclusion

In this paper, we propose a new framework for TM-augmented NMT by enabling the diversity of monolingual TM. To the best of our knowledge, the proposed method is the first study of end-to-end TM-augmented NMT that takes both monolingual and diversity-enabled TM into account. Specifically, by adding a non-heuristic module using the MMR algorithm, our proposed framework is able to enable diversity for the retrieval stage. Furthermore, instead of utilizing bilingual sentence pairs for the retrieval stage, we adopt two transformer encoders to exploit the capability of abundant information by monolingual data. Experiments show the effectiveness of the proposed method. Specifically, with varying the number of training datasets, our method is able to increase the performance from 0.5 to 1 Bleu score compared with strong baseline models in this research field. Regarding the future work of this study, we plan to exploit the size of translation memory by integrating this hyperparameter into the learning process in order to improve the performance of TM-augmented NMT tasks.