1 Introduction

Statistical machine translation (smt) has been successfully applied to the translation between various language pairs, particularly phrase-based smt, which is the most common since it can learn a translation model from a sentence-aligned parallel corpus without any linguistic annotations. Although we can improve the quality of translation by using a large language model that can be obtained from easily available monolingual corpora [1], language models capture only the fluency in languages so the quality of translation cannot be improved much if the translation model does not provide correct translation candidates for source-language words and phrases. The quality of translation in smt is therefore bounded by the size of parallel corpus to train the translation model. Even if a large parallel corpus is available for the pair of languages in question, we often want to translate sentences in a domain that has a different vocabulary from the domain of available parallel corpora, and this inconsistency deteriorates the quality of translation [2, 3].

Researchers have tackled this problem and proposed methods of domain adaptation for smt that exploits a larger out-of-domain parallel corpus. They have focused on a scenario in which a small or pseudo in-domain parallel corpus is available for training [4]. In actual scenarios when users want to exploit machine translation, the target domains can differ so the domain mismatches between the prepared smt system and the target documents are likely to occur. Domain adaptation is thus expected to improve the quality of translation. However, it is unrealistic for most mt users who cannot command the target language to prepare in-domain parallel corpora by themselves. The use of crowdsourcing for preparing in-domain parallel corpora is allowed for a few users who have a large number of documents for translation and are willing to pay money for improving the quality of translation.

In this study, we assume domain adaptation for smt in a scenario where no sentence-aligned parallel corpus is available for the target domain and propose an instant method of domain adaptation for smt by using a cross-lingual projection of word semantic representations [5]. Assuming that source-and target-language monolingual corpora are available, we first learn vector-based semantic representations of words in the source and target languages from those monolingual corpora. We next obtain a projection from semantic representations in the source language to those in the target language using a seed dictionary (in general domain) to learn a translation matrix. We then use the translation matrix to obtain translations of unseen (out-of-vocabulary, oov) words. The translation probabilities are computed by using cosine-similarity between the projected semantic representation of the oov word and semantic representations of words in the target language.

To evaluate the effectiveness of our method, we apply our method to a translation between English (en) and Japanese (ja) in recipe documents using a translation model learned by phrase-based smt from Kyoto-related Wikipedia articles. Experimental results confirmed that our method improves bleu score by 0.5–1.5 and 0.1–0.2 for ja-en and en-ja translations, respectively.

The remainder of this paper is structured as follows. Section 2 explains existing approaches to domain adaptation for smt without in-domain parallel corpus. Section 3 describes a method of translating word semantic representations. Section 4 proposes a method of adapting smt to a new domain without a sentence-aligned parallel corpora. Section 5 evaluates the effectiveness of the proposed method on domain adaptation for smt. Section 6 finally concludes this study and addresses future work.

2 Related Work

As mentioned in Sect. 1, most previous approaches to domain adaptation for smt assume a scenario where a small or pseudo in-domain parallel corpus is available. In this section, we briefly overview a method of domain adaptation for smt in a setting where no in-domain parallel corpus is available.

Wu et al. [6] have proposed domain adaptation for smt that exploits an in-domain bilingual dictionary. They generate a translation model from the bilingual dictionary and combine it with the translation model learned from out-of-domain parallel corpora. An issue here is how to learn a translation probability between words (or phrases) needed for the translation model, and they resort to probabilities of words in the target language in a monolingual corpus. Although building a bilingual dictionary for the target domain is more effective than developing a parallel corpus to cover rare oov words, it is still difficult to develop a bilingual dictionary for most mt users who cannot command the target language.

To cope with this problem, several researchers have recently exploited a bilingual lexicon automatically induced from in-domain corpora to generate a translation model for smt [7,8,9]. These approaches induce a bilingual lexicon from in-domain comparable corpora prior to the translation and use it to obtain an in-domain translation model.

Marthur et al. [10] exploit parallel corpora in various domains to induce the translation model for the target domain. They used 11 sets of parallel corpora for domains including TED talks, news articles, and software manuals to train the translation model for each domain and then linearly interpolated these translation models to derive a translation model for the target domain. They successfully improved the quality of translation when no parallel corpus was available for the target domain. Yamamoto and Sumita [11] assume various language expressions in translating travel conversations and train several language and translation models from a set of parallel corpora that are split by unsupervised clustering of the entire parallel corpus for travel conversations. The language and translation models for translating a given sentence are chosen in accordance with the similarity between the given sentence and the sentences in each split of the parallel corpus. Although this method is not intended for domain adaptation, it can be used in our setting when we have a parallel corpus for the general domain (and the domain of the target sentence is included in the general domain). These studies, however, implicitly assume in-domain (or related domain) parallel corpora are available, while we assume those resources are unavailable to broaden the applicability of our method.

Among these studies, our method is most closely related to domain adaptation using bilingual lexicon induction [7,8,9] but is different from these approaches in that it does not need to build a sort of bilingual lexicon prior to the translation to support the translation of oov words in a given sentence. We use a projection of semantic representations of source-language words to the target-language semantic space to dynamically find translation candidates of found oov words by computing the similarity of the obtained representations to semantic representations for words in the target language at the time of translation. Also, we empirically show that our approach could even benefit from general-domain non-comparable monolingual corpora instead of in-domain comparable monolingual corpora used in these studies on bilingual lexicon induction.

3 Cross-Lingual Projection of Word Semantic Representations

Our method exploits a projection of semantic representations of oov words in the source-language onto the target-language semantic space to look for translation candidates for the oov words. In this section, we first introduce semantic representations of words in a continuous vector space and then describe a method we proposed previously that learns a translation matrix for projecting vector-based representations of words across languages [5].

A vector-based semantic representation of a word, hereinafter word vector, represents the meaning of a word with a continuous vector. These representations are based on the distributional hypothesis [12, 13], which states that words that occur in the similar contexts tend to have similar meanings. The word vectors can be obtained from monolingual corpora in an unsupervised manner, such as a count-based approach [14] or prediction-based approaches [15, 16].

The words that have similar meanings tend to have similar vectors [17, 18]. By mapping words into continuous vector space, we can use cosine similarity to compute the similarity of meanings between words. However, the similarity between word vectors across languages is difficult to compute, so these word vectors are difficult to utilize in cross-lingual applications such as machine translation or cross-lingual information retrieval.

To solve this problem, Mikolov et al. [19] proposed a method that learns a cross-lingual projection of word vectors from one language into another. By projecting a word vector into the target-language semantic space, we can compute the semantic similarity between words in different languages. Suppose that we have training data of n examples, \(\{(\varvec{x}_1, \varvec{z}_1),(\varvec{x}_2,\varvec{z}_2),\dots (\varvec{x}_{n},\varvec{z}_{n})\}\), where \(\varvec{x}_i\) is the vector representation of a word in the source language (e.g., “gato”), and \(\varvec{z}_i\) is the word vector of its translation in the target language (e.g., “cat”). Then the translation matrix, \(\varvec{W}\), such that \(\varvec{W}\varvec{x}_i\) approximates \(\varvec{z}_i\), can be obtained by solving the following optimization problem:

$$\begin{aligned} {\varvec{W}^{\star }} = \mathop {\mathrm {argmin}}\limits _{\varvec{W}} \sum _{i=1}^n ||\varvec{W} \varvec{x}_i - \varvec{z}_i ||^2 \end{aligned}$$

Here, since word vectors are induced from monolingual corpora, vectors of oov words are easy to obtain by using in-domain or large-scale monolingual corpora.

We have improved the aforementioned approach by adopting the count-based vectors for words and integrating prior knowledge on translatable context pairs between the dimensions of count-based vectors [5]:

$$\begin{aligned} {\varvec{W}^{\star }} = \mathop {\mathrm {argmin}}\limits _{\varvec{W}} \sum _{i=1}^n \Vert \varvec{W} \varvec{x}_i - \varvec{z}_i \Vert ^2 + \frac{\lambda }{2}\Vert \varvec{W}\Vert ^2 -\beta _{train} \!\!\!\!\!\! \sum _{(j,k)\in \mathcal {D}_{train}} \!\!\!\!\!\! w_{jk} - \beta _{sim} \!\!\!\! \sum _{(j,k)\in \mathcal {D}_{sim}} \!\!\!\! w_{jk}. \end{aligned}$$

The second term is the \(L_2\) regularizer, while the third and fourth terms are meant to strengthen \(w_{jk}\) when k-th dimension in the source language corresponds to j-th dimension in the target language. \(\mathcal {D}_{train}\) and \(\mathcal {D}_{sim}\) are sets of translatable dimension pairs. \(\mathcal {D}_{train}\) is obtained from the above training data, while \(\mathcal {D}_{sim}\) is obtained by computing the surface-level similarity between the dimensions. \(\lambda \), \(\beta _{train}\) and \(\beta _{sim}\) are corresponding hyperparameters to control the strength of the added terms.

Because our method improved the accuracy of choosing translation candidates for words using the projected semantic representation against [19, 20], we adopt and implement this method again for finding translation candidates of oov words in our method.

4 Method

Our method assumes that monolingual corpora are available for the source and target language (in the target domain, if any) and first induces semantic representation of words from those corpora. It then learns a cross-lingual projection (translation matrix) using a seed dictionary in a general domain as described in Sect. 3. Note that a seed dictionary for common words is usually available for most pairs of languages or could be constructed assuming English as a pivot language [21].

Having a translation matrix to obtain projections of semantic representations of oov words in a given sentence, our method instantly constructs a back-off translation model used for enumerating translation candidates for the oov words in the following way:

  • Step 1: When the translation system accepts a sentence with an oov word, \(f_{\textsc {oov}}\), it translates a semantic representation of the word, \(\varvec{x}_{\textsc {oov}}\) into a semantic representation in the target language \(\varvec{x}_{\textsc {oov}}'\) using the translation matrix obtained by the method described in Sect. 3.

  • Step 2: It then computes the cosine similarity between the obtained semantic representations with those in the target languages to enumerate k translation candidatesFootnote 1 in accordance with the value of cosine similarity. The cosine similarity is also used to obtain \(P_{vec}(e|f_{\textsc {oov}})\), the direct translation probabilities from the oov word in the source language, \(f_{\textsc {oov}}\), to a candidate word in the target language, e, by normalizing them to sum up to 1. Although the obtained translation candidates could include wrong translations, the language model can choose one that is more appropriate in the contexts in the next step, unless the contexts are full of oov words.

  • Step 3: The decoder of phrase-based smt uses the above translation probabilities as a back-off translation model to perform the translation. More formally, we add new feature function \(h_{vec}\) to the log-linear model used in the decoder as following equation:

    $$\begin{aligned} \log P(\varvec{e}|\varvec{f}) = \sum _{i} \log (h_i (\varvec{e}, \varvec{f}) ) \lambda _i + \log (h_{vec} (\varvec{e}, \varvec{f}) ) \lambda _{vec} \end{aligned}$$
    (1)

    The \(h_{vec} (\varvec{e}, \varvec{f})\) in Eq. (1) is computed with \(P_{vec}(e|f_{\textsc {oov}})\), only for each oov word \(f_{\textsc {oov}}\) in source sentence \(\varvec{f}\). An issue here is how to set feature weight \(\lambda _{vec}\) since no in-domain training data are available for turning. We simply set \(\lambda _{vec}\) to the same value as the weight of direct phrase translation probability of the translation model.

5 Experiments

This section evaluates our method of domain adaptation for smt, using an out-of-domain parallel corpus and source-language and target-language monolingual corpora.

5.1 Settings

First, we prepared two parallel corpora in different domains to carry out an experiment of domain adaptation in the smt system. One is the “Japanese-English Bilingual Corpus of Wikipedia’s Kyoto Articles” (hereinafter kftt corpus), originally prepared by the National Institute of Information and Communications Technology (nict) and used as a benchmark in “The Kyoto Free Translation Task”Footnote 2[22], a translation task that focuses on Wikipedia articles relates to Kyoto. The other parallel corpus (hereafter recipe corpus) is provided by Cookpad Inc.,Footnote 3 which is the largest online recipe sharing service in Japan. The kftt corpus includes many words relates to Japanese history and the temples or shrines in Kyoto. On the other hand, the recipe corpus includes many words related to foods and cookware. We randomly sampled 10k pairs of sentences from the recipe corpus as test corpus for evaluating our domain adaption method. The language models of the target languages are trained with the concatenation of the kftt corpus and the remaining portion of the recipe corpus, while the translation models are trained with only the kftt corpus. The sizes of the training data and test data are as detailed in Table 1.

Table 1. Statistics of the dataset.
Table 2. Monolingual corpora used to induce semantic representations.

We conducted experiments with Moses [23]Footnote 4 with the language models trained with SRILM [24]Footnote 5 and the word alignments predicted by GIZA++ [25].Footnote 6 5-gram language models were trained using SRILM with interpolate option and kndiscount option. Word alignments were obtained using GIZA++ with grow-diag-final-and heuristic. The lexical reordering model was obtained with msd-bidirectional setting.

Next, we extracted four sets of count-based word vectors from Wikipedia dumpsFootnote 7 (general-domain monolingual corpora) and the remaining portion of the recipe corpus (in-domain monolingual corpora), for Japanese and English, respectively. We considered context windows of five words to both sides of the target word. The function words are then excluded from the extracted context words following our previous work [5]. Since the count vectors are very high-dimensional and sparse, we selected top-d (\(d=10,000\) for general-domain corpus, \(d=5000\) for in-domain corpus) frequent words as contexts words (in other words, the number of dimensions of the word vectors). We converted the counts into positive point-wise mutual information [26] and normalized the resulting vectors to remove the bias introduced by the difference in the word frequency. The size of the monolingual dataset for inducing semantic representations of words is as detailed in Table 2.

Finally, we used Open Multilingual WordNetFootnote 8 to train the translation matrices as in [5]. The hyperparameters were tuned on the development set as follows: \(\lambda = 0.1\), \(\beta _{train} = 5\), \(\beta _{sim} = 5\) for (ja-en, general-domain). \(\lambda = 1\), \(\beta _{train} = 0.1\), \(\beta _{sim} = 0.2\) for (ja-en, in-domain). \(\lambda = 0.1\), \(\beta _{train} = 5\), \(\beta _{sim} = 5\) for (en-ja, general-domain). \(\lambda = 0.5\), \(\beta _{train} = 1\), \(\beta _{sim} = 2\) for (en-ja, in-domain).

5.2 Results

We performed domain adaptation as described in Sect. 4 and evaluated the effectiveness of our method through bleu score [28]. Table 3 shows results of the translations of the 10k sentences in the recipe corpus between Japanese and English. All and in Table 3 show the bleu scores measured in the whole test set and the scores measured only in the sentences that include oov words, respectively. Statistics of the oov words are shown in Table 4.

Table 3. bleu on recipe corpus. \(^*\) indicates statistically significant improvements in bleu over the respective baseline systems in accordance with bootstrap resampling [27] at \(p <0.05\).
Table 4. Statistics of the oov words in test data (the 10k sentences in the recipe corpus).

All four methods shown in Table 3 use translation models that were trained with the kftt corpus and are tested with the recipe corpus. Proposed (general) uses the word vectors extracted from Wikipedia corpus, while Proposed (in-domain) uses the vectors extracted from the remaining portion of the recipe corpus. In both these methods, we performed domain adaptation by automatically constructing back-off translation models for oov words. Parallel Corpus in Table 3 uses the remaining portion of the recipe corpus as a parallel corpus to learn the translation models, resources of which are assumed to be unavailable in this study. Thus, Parallel Corpus is the upper-bound for the task. The low bleu score for en-ja translation is explained by the direction of the translation being different from the direction when the corpus was built (ja-en) [29]. In addition, the smaller number of oov tokens in en-ja than in ja-en also causes the smaller improvement in bleu score.

Table 5. Hand-picked examples of the translations for the 10k sentences in the recipe corpus from Japanese to English. Text in bold denotes oov words in the input sentences and their translations. The subscripts of the translation of the oov words refer to a manual word alignment of the oov words.

Table 3 shows that our methods perform well for the translation task. We found that it was better to use the in-domain monolingual corpora rather than general-domain monolingual corpora to obtain the word vectors. This conforms to our expectation because the contextual information included in the word vectors strongly correlates with the target domains. The Parallel Corpus has much higher bleu than all other methods. This result shows that the domain adaptation task we performed was intrinsically difficult because of the significant differences between the two domains.

We show hand-picked examples of the translations in Table 5 to analyze the methods in more detail. The first two examples show that Proposed (in-domain) provides more accurate translations than Proposed (general). Despite our method being able to improve the translations of oov words, the third and the fourth examples indicate that it is not good at improving the translations of Baseline that have wrong syntax. The last example shows that some oov words tend to be translated into their related words, mainly because of their similarity in the semantic space.

The examples show that the oov words such as “” (simmer), “” (toaster), and “” (bake) could successfully be translated with Proposed (in-domain). These words almost never appear in the kftt corpus, since they do not have any relation with Japanese history or the temples in Kyoto. By comparing Proposed (in-domain) and Proposed (general), we see that the latter method translated many oov words into related words (e.g., “” (toaster) to “refrigerator”, or “” (simmer) to “boil”) by mistake. This result also indicates that the word vectors extracted from the in-domain corpus will work better than the vectors extracted from the general-domain corpus.

6 Conclusions

A cross-lingual projection of word semantic representations has been leveraged to obtain a translation model for unseen (out-of-vocabulary, oov) words in domain adaptation for smt. Assuming monolingual corpora for the source and target languages, we induce vector-based semantic representations of words and obtain a projection (translation matrix) from source-language semantic representations into the target-language semantic space. We use this projection to find translation candidates of oov words and use the cosine similarity to induce the translation probability. Experimental results on domain adaptation from a Kyoto-related domain to a recipe domain confirmed that our method improved bleu by 0.5–1.5 and 0.1–0.2 for en-ja and ja-en translations, respectively.

In the future, we plan to (i) assign better translation probabilities for non-oov words that exist in the translation model learned from an out-of-domain parallel corpus, (ii) extend our method to obtain a translation between phrases as in [30], and (iii) combine our method with the existing approaches to domain adaptation for smt that assumes no bilingual corpus in the target domain.