1 Introduction

Machine translation (MT) systems are often applied in settings where the test data might be sampled from a distribution that differs from the training data, usually due to different domains of language use. This domain mismatch between training and test data often leads to performance degradation, usually due to lexical differences between the domains. When a word in the test data is found in the training data, its most suitable translation in the test domain might be different from that in the training domain. For example, when translating from English to Russian, the most natural translation for the word ‘code’ would be ‘шифр’ (‘cipher’), ‘закон’ (‘law’) or ‘’ (‘program’) if we consider cryptography, legal and software development domains, respectively. Given parallel training data originating from one of those domains, training an MT system on the data would produce a rather suboptimal translation for the other domains.

Surprisingly, degradation in translation quality is observed even when we train an MT system on large heterogeneous corpora such as EuroParl (Koehn 2005),Footnote 1 Common Crawl Corpus,Footnote 2 UN Corpus,Footnote 3 and News CommentaryFootnote 4 (Shah et al. 2012; Carpuat et al. 2014; Cuong et al. 2016b). For instance, Axelrod et al. (2011) show that when it comes to a domain-specific task, a small percentage of well-selected data can outperform the full heterogeneous dataset for training MT systems (Biçici and Yuret 2011; Poncelas et al. 2017). Shah et al. (2010) show that it would benefit from training word alignment with weighting sentence pairs according to their relevance to a domain-specific task.

In this paper, we provide a comprehensive survey of domain adaptation for statistical machine translation (SMT), aimed particularly at phrase-based systems (Koehn et al. 2003). A very basic question is what constitutes a domain? There are different definitions in the literature, for example:

  • The ‘provenance’ of the training data (Foster and Kuhn 2007; Moore and Lewis 2010; Sennrich 2012);

  • The difference of words and grammars between corpora (Pecina et al. 2012);

  • The thematic content in the training data, such as topic (Hasler et al. 2014; Hu et al. 2014);

  • A particular combination of many factors: genres, topics, dialects and writing styles (Chen et al. 2013a).

We do not aim to find the best answer, as the concept of domain is still an open question and has not been well-defined in the literature (see Van Der Wees et al. (2015) for a discussion). We rather provide a systematic overview of previous approaches to domain adaptation, showing their advantages/disadvantages, as well as how they relate to and differ from each other.

The survey is organized as follows. We first introduce SMT in general, with a focus on aspects of SMT relevant to domain adaptation (Sects. 2, 3).Footnote 5 The survey identifies components that need to be adapted when an SMT system is applied to new domains (Sect. 4). We explain what may go wrong in translation by analyzing potential sources of translation errors and providing an explanation as to why each specific type of error may happen.

Subsequently, we present a general picture of domain adaptation for SMT where we outline the main general approaches (Sect. 5). A major part focuses on the induction (Sect. 6) and combination (Sect. 7) of domain-focused phrase translation tables, lexical weights and reordering probabilities. The induction of domain-focused sparse features and word-alignment probabilities are discussed in Sects. 8.1 and 8.2.

Finally we also cover several practical adaptation scenarios, including adapting an existing system to multiple specific domains at the same time (Sect. 8.3). Another scenario addressed in (Sect. 8.4) is embedding an SMT system into a Cross-lingual Information Retrieval (CLIR) system (i.e. automatically translating queries into different languages, so that a search engine can return search results in the corresponding languages). We also discuss how web-based translation services such as Bing TranslatorFootnote 6 and Google TranslateFootnote 7 can be improved when the domain of a new request is not known in advance. Specifically, we cover cache-based adaptive models (Sect. 8.5) and rewarding domain invariance for adaptation (Sect. 8.6).

2 Statistical machine translation

In SMT, we aim to translate a source (foreign) sentence \(\mathbf f \) into a sentence in the target language \(\mathbf e \). Among the target translation hypotheses, the translation hypothesis \(\hat{\mathbf{e }}\) with the highest probability given the source sentence is selected, as in (1):

$$\begin{aligned} \hat{\mathbf{e }} = {{{\mathrm{\arg \!\max }}}}_\mathbf{e }\{{P}(\mathbf e |\ \mathbf f )\} = {{{\mathrm{\arg \!\max }}}}_\mathbf{e } \{P(\mathbf e ) P(\mathbf f |\ \mathbf e )\}. \end{aligned}$$
(1)

This approach to modeling translation is referred to as the noisy-channel framework. The architecture of the framework includes two components: the translation model (i.e. \(P(\mathbf f |\ \mathbf e )\)) and the language model (i.e. \(P(\mathbf e )\)).

A more powerful approach exploits a log-linear formulation, more formally, where the posterior probability \(P(\mathbf e |\ \mathbf f )\) is modeled with a set of M feature functions \(\varvec{\phi }(\mathbf e ,\ \mathbf f ) = \{\phi _1(\mathbf e ,\ \mathbf f ),\ \ldots ,\ \phi _M(\mathbf e ,\ \mathbf f )\}\) with model parameters \({\mathbf {w}} = \{w_1,\ \ldots ,\ w_M\}\) as in (2):

$$\begin{aligned} {P}(\mathbf e |\ \mathbf f )&\propto \exp ({\mathbf {w}} \cdot \varvec{\phi } (\mathbf e ,\ \mathbf f )).\end{aligned}$$
(2)

Under this framework, we obtain the decision rule in (3):

$$\begin{aligned} \hat{\mathbf{e }} = {{{\mathrm{\arg \!\max }}}}_\mathbf{e } {\mathbf {w}} \cdot \varvec{\phi } (\mathbf e ,\ \mathbf f ). \end{aligned}$$
(3)

The decision rule is simple as we can safely ignore the daunting normalization factor.

The model was first proposed by Och and Ney (2002), forming the basis of phrase-based SMT systems. It is straightforward to see that this framework contains the noisy-channel framework as a special case. Its advantage lies in its flexibility, relative to the noisy-channel framework, as one can extend a basic SMT system containing translation and language models by including arbitrary feature functions of the source and the target sentences. There are many possibilities for defining feature functions that help the SMT system to improve translation, such as linguistic features, word and phrase penalties, reordering features, and rule counting. Simply adding feature functions from the target to source language also often improves translation.

Learning the model parameters \({\mathbf {w}} = \{w_1,\ \ldots ,\ w_M\}\)) using a held-out development set is crucial to improving translation quality. In principle, training for log-linear models can be done using maximum likelihood or related criteria (e.g. cross-entropy, perplexity). Such an objective function is convex, and global optimization is possible. The main difficulty, however, is that we need to compute the normalization factor during learning. This is intractable, as we cannot explore the full space of all translation hypotheses for each translation input. In practice, the normalization factor is computed using an N-best list of top-N translation hypotheses or a lattice (Macherey et al. 2008).Footnote 8

Optimizing an SMT system using maximum likelihood or related criteria has a loose relation to the translation quality on unseen text (Och 2003). There is a need to directly incorporate translation accuracy on a held-out development set into the optimization, now a fundamental part of modern SMT systems. Numerous optimization methods have been proposed in the literature, such as MERT (Och 2003), MIRA (Watanabe et al. 2007; Chiang et al. 2008; Cherry and Foster 2012), and Pairwise Ranked Optimization (PRO: Hopkins and May (2011)). Readers may refer to Neubig and Watanabe (2016) for a comprehensive survey of system optimization methods in general.

The latter SMT framework has two notable shortcomings that make the problem of domain adaptation for SMT even more challenging:

  • First, having more translation features significantly increases the difficulty of the optimization. Specifically, having more feature dimensions requires a much larger held-out development set for system optimization, as shown in Waite and Byrne (2015). This is an issue in domain adaptation for SMT because creating such an in-domain held-out development dataset is expensive.

  • Second, log-linear models try to separate good and bad translation hypotheses using a linear hyper-plane. This is potentially problematic, as interactions between domain-specific features can be complex. It may be necessary to perform preprocessing steps over the feature space to produce a feature set that is less prone to non-linearities (Liu et al. 2013; Clark et al. 2014). However, methods tailored to such a special treatment are quite sophisticated and not widely deployed in practice.

3 Phrase-based SMT system

There are many types of translation systems that have been built in the past, for example:

  • Syntax-based translation systems (Yamada and Knight 2001),

  • Phrase-based SMT systems (Och and Ney 2002; Koehn et al. 2003),

  • Hierarchical phrase-based SMT systems (Chiang 2005, 2007),

  • Syntactic phrase-based SMT systems (Quirk et al. 2005; Quirk and Menezes 2006).

This paper focuses on phrase-based SMT systems (Och and Ney 2002; Koehn et al. 2003).

3.1 Model

A standard phrase-based SMT system has various dense feature functions (i.e. highly informative feature functions) estimated at phrase level. Three of the most important translation models are a phrase-based model \(\phi _{TM}(\mathbf e ,\ \mathbf f )\), lexical weighting \(\phi _{LW}(\mathbf e ,\ \mathbf f )\), and reordering model \(\phi _{RM}(\mathbf e ,\ \mathbf f )\). A common domain-adaptation strategy for SMT is to directly adapt these models. We thus describe them in detail below.

  • Phrase-based model At the core of a phrase-based SMT system is the phrase-based model, which aims at modeling translation of sentence pairs at phrase level. Given an input sentence \(\mathbf f \), let us assume that a sequence of target-language phrases \(\mathbf e = (\tilde{e}_1, \tilde{e}_2, \cdots , \tilde{e}_n)\) is currently hypothesized by the decoder. Let us also assume we are provided with a phrase alignment \({\mathbf {a}}= (a_1, a_2, \cdots , a_n)\) that defines a source \(\tilde{f}_{a_i}\) for each translated phrase \(\tilde{e}_i\). The model is estimated as in (4):

    $$\begin{aligned} \phi _{TM}(\mathbf e , \mathbf f )&= \log P_{TM}(\mathbf e |\ \mathbf f ) = \log \prod \nolimits _{i=1}^{n}P(\tilde{e}_i|\ \tilde{f}_{a_i})\nonumber \\&= \sum \nolimits _{i=1}^{n}\log P(\tilde{e}_i|\ \tilde{f}_{a_i}) \end{aligned}$$
    (4)
  • lexical weighting The lexical weighting provides smoother estimates for probabilities of phrase pairs. The model is estimated as in (5):

    $$\begin{aligned} \phi _{LW}(\mathbf e , \mathbf f )&= \log P_{LW}(\mathbf e |\ \mathbf f ) = \log \prod \nolimits _{i=1}^{n}P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i}) \nonumber \\&= \sum \nolimits _{i=1}^{n}\log P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i}) \end{aligned}$$
    (5)

    Here, the distribution \(P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i})\) is computed based on lexical probabilities \(P(e\ |\ f)\) between words \(\langle e,\ f\rangle \) in a phrase pair \(\langle \tilde{e}_i,\ \tilde{f}_{a_i}\rangle \). Different models have a slightly different way of computing \(P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i})\). A typical estimate of \(P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i})\) (Koehn et al. 2003) is as in (6):

    $$\begin{aligned} P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {a_i}) = \prod \nolimits _{i=1}^{|\tilde{e}_i|}\frac{1}{|\{j|(j,\ k)\in a_i\}|}\sum \nolimits _{(j,\ k)\in a_i}^{}P(\tilde{e}_{i}^k|\ \tilde{f}_{a_i}^j). \end{aligned}$$
    (6)

    Here,

    • \(\tilde{e}_{i}^k\): word at position k in target phrase \(\tilde{e}_{i}\),

    • \(\tilde{f}_{a_i}^j\): word at position j in source phrase \(\tilde{f}_{a_i}\).

    • \(|\tilde{e}_i|\): length of phrase \(\tilde{e}_i\)

    • \(|\{j|(j,\ k)\in a_i\}|\): the number of source words that each target word at position k in phrase \(\tilde{e}_i\) aligns to.

  • Reordering model Such phrase-based models and lexical weighting are not meant for handling word/phrase order phenomena between languages. For state-of-the-art phrase-based SMT systems, integrating lexicalized reordering models (Tillmann 2004; Koehn et al. 2007; Galley and Manning 2008) must be considered. These models estimate the probability of a sequence of orientations \(\mathbf O = (o_1,\ o_2,\ \ldots ,\ o_n)\) as in (7):

    $$\begin{aligned} \phi _{RM}(\mathbf e , \mathbf f , \mathbf O )&= \log P_{RM}(\mathbf O |\ \mathbf e ,\ \mathbf f ) = \log \prod \nolimits _{i=1}^{n}P(o_i|\tilde{e}_i,\ \tilde{f}_{a_i},\ a_{i-1},\ a_{i-2})\nonumber \\&= \sum \nolimits _{i=1}^{n}\log P(o_i|\ \tilde{e}_i,\ \tilde{f}_{a_i},\ a_{i-1},\ a_{i-2}) \end{aligned}$$
    (7)

    Here, each orientation \(o_i\) takes possible values \(\{M, S, D\}\), representing how likely a phrase is to directly follow a previous phrase (\({{\varvec{M}}}onotone\)), to swap positions with it (\({{\varvec{S}}}wap\)), or to be not adjacent to it (\({{\varvec{D}}}iscontinous\)).

Beside these three types of dense translation features, there are also penalties for word, phrase and distance-based reordering. Those are the basic translation features that form a phrase-based SMT system (beside the language model).

A phrase-based SMT system can be also augmented with millions of sparse feature functions (e.g. phrase features (Chiang et al. 2009; Simianer et al. 2012), lexical features (Watanabe et al. 2007; Chiang et al. 2009), or syntax-based features (Blunsom and Osborne 2008; Marton and Resnik 2008)). It is possible to induce sparse features using a large portion of the parallel training data. However, scaling training to large data requires extensive additional efforts [cf. Yu et al. (2013)]. Models employing sparse features are often trained using a small held-out development set in practice.

3.2 Training

The most common approach to training a phrase-based SMT system is using relative frequency estimation. We take phrase translation scores as an example. To compute \(P(\tilde{e}|\ \tilde{f})\), we first count the number of times phrase \(\tilde{e}\) aligns to phrase \(\tilde{f}\) in the parallel training data, before normalizing into probability by dividing by the total number of possible alignments to \(\tilde{f}\), as in (8):

$$\begin{aligned} P(\tilde{e}|\ \tilde{f}) = \frac{c(\tilde{e},\ \tilde{f})}{\sum _{\tilde{e}'}^{}c(\tilde{e}',\ \tilde{f})} \end{aligned}$$
(8)

This distribution, however, does not necessarily maximize the likelihood of the parallel training data. This is similar to the Data Oriented Parsing (DOP) method (Bod et al. 2003) in parsing, which hypothesizes a distribution over many possible derivations of each training example from subtrees of varying sizes.

The key to the training is extracting bilingual phrases from bilingual data. The standard way is to rely on the word-aligned training data, using a heuristic method such as grow-diag-final-and, grow-diag-final or final (Koehn et al. 2003).

3.2.1 Word alignment

We now discuss how to create word-aligned training data. Given a parallel sentence, we look for the most probable alignment between words, \(\hat{{\mathbf {a}}}\), as in (9):

$$\begin{aligned} \hat{{\mathbf {a}}} = \mathop {{{\mathrm{\arg \!\max }}}}\limits _{{\mathbf {a}}}P(\mathbf {f},\ {\mathbf {a}}|\ \mathbf {e}). \end{aligned}$$
(9)

The idea of word alignment can be traced back to Brown et al. (1990). The degree of difficulty of the search in Eq. (9) depends on the underlying independence assumptions. Even now, over twenty years since the IBM Models (Brown et al. 1993) and the HMM-based alignment model (Vogel et al. 1996), word alignment is still an active research topic (Simion et al. 2013; Chang et al. 2014; Tamura et al. 2014; Liu et al. 2015; Shen et al. 2015; Wang et al. 2015).

Fig. 1
figure 1

HMM alignment model with observed and latent alignment layers

We now briefly review the HMM alignment model (Vogel et al. 1996), which is one of the most popular and widely used alignment models. The generative story of the model is shown in Fig. 1. The latent states rely on the target-language words and generate source-language words.

Formally, let us assume the target sentence \(\mathbf {e}\) contains I words \(\mathbf {e}\ =\ (e_1,\ \ldots ,\ e_I)\) and the source sentence \(\mathbf {f}\) contains J words \(\mathbf {f}\ =\ (f_1,\ \ldots ,\ f_J)\). For an alignment \({\mathbf {a}}\) \(=\) \((a_1,\ \ldots ,\ a_J)\) of the sentence pair \(\langle \mathbf {e},\ \mathbf {f}\rangle \), the model factors \(P(\mathbf {f},\ {\mathbf {a}}|\ \mathbf {e})\) into the word-translation and transition probabilities as in (10):

$$\begin{aligned} P(\mathbf {f},\ {\mathbf {a}}|\ \mathbf {e}) = \prod \nolimits _{j=1}^{J}P(f_j|\ e_{a_j})P(a_j|\ a_{j-1}). \end{aligned}$$
(10)

Here, \(P(f_j|\ e_{a_j})\) represents word-translation probabilities and \(P(a_j|\ a_{j-1})\) represents word-transition probabilities. In practice \(P(a_j|\ a_{j-1})\) depends only on the distance \((a_j - a_{j-1})\). Note also that the first-order dependency model is an extension of the uniform dependency model of IBM Model 1 and zero-order dependency model of IBM model 2. With the HMM alignment model, the most probable alignment \(\hat{{\mathbf {a}}}\) for each sentence pair can be computed efficiently using the Viterbi algorithm.

The HMM alignment model has two kinds of parameters: word-translation probabilities and transition probabilities. Adapting the expectation maximization (EM) algorithm  (Dempster et al. 1977) for training the model is straightforward (Vogel et al. 1996). For the sake of completeness we present the algorithm in detail. We use \(c(f|\ e;\ \mathbf {f},\ \mathbf {e})\) to denote the expected count of word e aligning to word f. We also use \(c(i|\ i';\ \mathbf {f},\ \mathbf {e})\) to denote the expected counts of two certain consecutive source words j and \(j-1\) aligning to two target words i and \(i'\), respectively. Figure 2 presents the algorithm.

Fig. 2
figure 2

Pseudocode for the training algorithm for the HMM alignment model. Note that \(P^{(c)}\) denotes current iteration estimates, \(P^{(+)}\) denotes the re-estimates and \(\delta \) denotes the Kronecker delta function. Note that \(P(\cdot |\ \cdot )\) \(=\) \(\sum _{{\mathbf {a}}}^{}P(\cdot ,\ {\mathbf {a}}|\ \cdot )\) which can be computed efficiently using dynamic programming

Does word alignment suffer from domain mismatch? A domain mismatch could have a negative impact on word-alignment accuracy, for example:

  • Word-alignment models, like any statistical models, suffer from lack of in-domain data for training (Duh et al. 2010; Shah et al. 2010; Gao et al. 2011).

  • The insensitivity of existing word-alignment models to domains often yields suboptimal results on large heterogeneous data (Gao et al. 2011; Cuong and Sima’an 2015).

In Sect. 8.2 we discuss this aspect in detail.

3.3 Decoding

Decoding for phrase-based SMT system is a difficult problem. The search can be done by various approaches, such as beam search (Koehn 2004) or exact decoding (Chang and Collins 2011; Aziz et al. 2014). Among these competing approaches, beam search is probably the most popular decoding framework for phrase-based SMT systems. Starting from an initial hypothesis, given an input string of words, a number of phrase translations are applied to expand the current hypothesis until all words are marked as translated.

Beam search heuristically prunes the search space, and as a result, the search is inexact and search errors can occur as the best-scoring hypothesis is not necessarily optimal in terms of the given model parameters. Extensive prior work on minimum Bayes risk (MBR) objectives [cf. Kumar and Byrne (2004)] can potentially mitigate this issue. MBR methods select translations that are less ‘risky’ by taking the uncertainty in model predictions into account. Sect. 8.6 discusses a link between MBR and domain adaptation for SMT.

4 Translation errors when applied to new domains

Applying a phrase-based SMT system to new domains produces suboptimal translation in practice, e.g. Newswire (Foster et al. 2013), Medical (Irvine et al. 2013b), Patents (Wäschle and Riezler 2012), Transcribed Lectures (Federico et al. 2012), Web Blogs (Su et al. 2012; Foster et al. 2013), TED Talks (Duh et al. 2010; Mansour et al. 2011; Hasler et al. 2014), Subtitles (Irvine et al. 2013b), or Web Queries (Nikoulina et al. 2012). This section reviews different sources of translation errors when applied to new domains.

Table 1 Translation errors on an unseen domain

4.1 Lexical selection

Lexical selection appears to be the most common source of errors (Irvine et al. 2013a; Van Der Wees et al. 2015). We present some examples in Table 1. Here, we train a standard phrase-based SMT system for English–Spanish on a large dataset combined from multiple resources including EuroParl, Common Crawl Corpus, UN Corpus, and News Commentary. We then apply the system to a new domain of “Consumer and Industrial Electronics”. As shown in Table 1, incorrect translations are “can reproduce signs of audio” instead of “can play back audio signals”, “password teacher” instead of “master password”, “commenced with” instead of “opened with” file, and “Repeated all avenues” instead of “Repeat all tracks”.

An important question is what went wrong with lexical selection, i.e. what made the phrase-based SMT system suffer from degradation in lexical translation quality on new domains? Two main different error types that cause the degradation are as follows (Irvine et al. 2013a)Footnote 9:

  • SEEN/SENSE: an incorrect translation for unobserved source-language words and an incorrect translation because of known source-language words but with unobserved target words in the parallel training data.

  • SCORE: an incorrect translation for which the system goes for an incorrect translation path (i.e. incorrect ranking).

The majority of cases where degradation in lexical translation quality is seen are due to SEEN and SENSE errors. However, it is important to understand that improving coverage does not necessarily result in better translation quality. This leads to the error type SCORE, which is perhaps a much harder problem to address.

Fig. 3
figure 3

Statistical translation framework

To provide a better understanding of the SCORE error, let us step back and reconsider how SMT models are estimated (Fig. 3). Statistical translation models are trained without integrating (likely hidden) domain information of the bilingual data. This results in coarse and domain-confused translation statistics that reflect translation preferences aggregated over different translation options with respect to different domains. Some translation options are more popular than others for a specific word or phrase in general. When it comes to a specific domain, however, it is likely that one of the rare translation options would be the most relevant one. A standard phrase-based SMT system is unlikely to be able to provide such a translation in this case, given that resulting domain-confused statistics are not expressive enough as they do not take domain information into account.

4.2 Reordering

Different from the lexical selection, it is not clear that reordering model adaptation improves translation. There is some evidence supporting this hypothesis, notably from Chen et al. (2013a) and Zhang et al. (2015). Chen et al. (2013a) show that there are two potential reasons for an improvement in translation quality caused by reordering model adaptation:

  • some corpora may be better for training reordering models than others, and

  • there exists domain-dependent differences in reordering.

The first statement is intuitively plausible. Some data may contain noisy parallel sentences (e.g. comparable data), or simply too short parallel examples (e.g. Subtitles, Search Queries), which have a negative impact on parameter estimates (i.e. less accurate estimates).

Meanwhile, it is not at all obvious that reordering of phrase pairs is particularly domain-specific. Chen et al. (2013a) suggest that this is the case for Chinese–English and Arabic–English. They train lexicalized reordering models (Tillmann 2004; Koehn et al. 2007; Galley and Manning 2008) on different but high-quality parallel training data with specific genres. Their results show that the estimates of reordering parameters are significantly different between the corpora (e.g. the reordering probabilities estimated from News bilingual training data are different from those estimated from Legal bilingual data). It is, therefore, unsurprising that domain adaptation can help phrase-based SMT systems to improve reordering for English–Chinese as in Chen et al. (2013a).

However, it is unlikely that this would happen for all language pairs. Taking English–Spanish as an example, Cuong and Sima’an (2014a) train different lexicalized reordering models on a somewhat similar scenario with News parallel training data, including four sub-corpora: EuroParl, Common Crawl Corpus, UN Corpus, and News Commentary. They show that adapting reordering models for a new domain of Consumer and Industrial Electronics contributes only a minor translation improvement for this domain. Cuong et al. (2016a) show similar examples with English–Dutch.

As a side note, it is likely the case that dialect contributes to reordering behaviour, cf. Chen et al. (2013a) for Chinese, and Jeblee et al. (2014) for Egyptian Arabic. Domain adaptation with respect to this aspect (e.g. training lexicalized reordering models on different dialect bilingual training data) might, therefore, contribute reordering improvements.

4.3 Optimization

Domain mismatch between held-out development and test data is also an important source of errors. This is widely observed in many studies, e.g. Nikoulina et al. (2012), Pecina et al. (2012). In Table 2, we show a qualitative example. Specifically, we first train a phrase-based SMT system for English–German on a large dataset combined from multiple resources including EuroParl, Common Crawl Corpus and News Commentary. We then apply the system to a new domain of “Legal Service”, but with three different scenarios for system optimization:

  1. 1.

    we optimize the system on an in-domain (Legal) held-out development set with 2K sentence pairs;

  2. 2.

    we optimize the system on a mixed-domain held-out development set with 8K sentence pairs from a combination of different domains: The in-domain Legal held-out development set itself, plus three different held-out development sets of Software, Hardware and Professional & Business Services;

  3. 3.

    we optimize the system on another mixed-domain held-out development set with 6K sentence pairs of Software, Hardware and Professional & Business Services in the third setting. This is the mixed-domain held-out development set in the second setting, but excludes the in-domain development set part.

Note that there is no prior knowledge about the domain’s provenance of the mixed-domain held-out development set in the second and third settings.

Table 2 presents the translation performance of the phrase-based SMT system with respect to the different tuning scenarios. It can be seen quite clearly from the lower BLEU scores (Papineni et al. 2002) that moving to a new domain without having an in-domain held-out development set for system optimization can degrade the translation quality of a phrase-based SMT system. Note that our comparison may favour mixed-domain tuning scenarios: the mixed-domain held-out development sets are at least three times larger than the in-domain set, which presumably improves system optimization. In practice, the degradation in translation quality may be much more substantial, especially in a setting where the desired task is different from the held-out development set (e.g. Subtitles, Search Queries).

Table 2 Degradation in translation quality on a domain-specific translation task with different tuning scenarios
Fig. 4
figure 4

Statistical translation framework with a combination of multiple K sub-models for translation

5 Domain adaptation: a general picture

A typical phrase-based SMT system contains various components, such as word alignment, language, translation and reordering models. This distinguishes SMT from most other Natural Language Processing tasks, and makes application of standard domain-adaptation methods less straightforward.

In general, the most popular approach to domain adaptation for SMT is to induce domain-focused translation statistics from seed in-domain data. Domain-focused translation statistics are typically domain-specific phrase translation probability distributions, lexical weighting and reordering probabilities. In the end, we can combine them together with the baseline ‘domain-confused’ translation features, or even replace the baseline features. This results in a statistical translation framework with a combination of multiple (sub-)models for translation. Figure 4 provides an illustration of the standard approach to domain adaptation for SMT.

Implementing such a framework, however, is non-trivial. Two main technical challenges are as follows:

  • The induction of domain-focused translation statistics: specific prior knowledge (e.g. in-domain bilingual corpora, comparable corpora, monolingual corpora) requires a different model for inducing domain-focused translation statistics. Section 6 provides a systematic overview of previous approaches to the problem.

  • The combination of multiple (sub-)models for translation: the main object is a combination model tailored to high-dimensional feature spaces, which is surprisingly hard to achieve. Sect. 7 reviews different combination models for adaptation.

Beside the two main research lines, previous work also considers other adaptation scenarios. This survey covers several adaptation trends (Sect. 8). We first review the induction of domain-focused sparse features and word-alignment probabilities (Sects. 8.1, 8.2). We also show how an existing system can be adapted to multiple specific domains at the same time (Sect. 8.3). Another scenario is applying an SMT system to web search queries (Sect. 8.4). We also discuss how web-based translation services can be improved when domain of a new request is not known in advance (Sect. 8.5, 8.6).

6 Domain-specific translation induction for SMT

We start with induction using in-domain parallel data, and continue with comparable and monolingual corpora. We also discuss the induction with the domain’s provenance, which is special in that we are provided with a large corpus consisting of different domain-specific subcorpora that are not necessarily strictly related to the desired task.

6.1 Induction with in-domain parallel data

In many studies, a seed in-domain parallel corpus (\(\mathcal {C}_{IN}\)) exemplifying the target translation task is used as a form of prior knowledge for domain adaptation for SMT. The data, however, is very small compared with a mixed of domains corpus \(\mathcal {C}_{{OUT}}\). The main goal of translation induction with in-domain parallel corpora is inducing a phrase-based model from \(\mathcal {C}_{{OUT}}\) for adaptation. We now review the two most popular approaches to domain adaptation in this scenario: instance weighting and data selection.

Instance weighting

Instance weighting is perhaps the most effective approach to learning domain-focused translation statistics. To give some intuition about how instance weighting addresses the problem, in this general exposition we introduce a latent domain variable z to mark whether a phrase is in-domain (\(z_1\)) or out-of-domain (\(z_0\)). With the introduction of the latent variable, we expect to extend the translation tables in phrase-based models from domain-confused \(P(\tilde{e}|\ \tilde{f})\) to domain-focused by conditioning them on z, i.e. \(P(\tilde{e}|\ \tilde{f},\ z)\). Note how \(P(\tilde{e}|\ \tilde{f},\ z)\) contains \({P}(\tilde{e}|\ \tilde{f})\) as a special case, as in (14):

$$\begin{aligned} P(\tilde{e}|\ \tilde{f},\ z)&= \frac{{P}(\tilde{e}|\ \tilde{f}) {\textit{P}}(z|\ \tilde{e},\ \tilde{f})}{\sum _{\tilde{e}'} {P}(\tilde{e}'|\ \tilde{f}) {\textit{P}}(z|\ \tilde{e}',\ \tilde{f})}. \end{aligned}$$
(14)

Here \({\textit{P}}(z|\ \tilde{e},\ \tilde{f})\) is viewed as a latent phrase-relevance model, i.e. the probability that a phrase pair is in- (\(z_1\)) or out-of-domain (\(z_0\)). In the end, the adaptation can be performed by replacing the domain-confused tables \(P(\tilde{e}|\ \tilde{f})\) with the in-domain-focused ones \(P(\tilde{e}|\ \tilde{f},\ z_1)\), or simply by using these domain-focused models as additional features for the baseline phrase-based SMT system.

From Eq (14), the main challenge of inducing \(P(\tilde{e}|\ \tilde{f},\ z)\) is inducing the latent phrase-relevance model \({\textit{P}}(z|\ \tilde{e},\ \tilde{f})\). Following Matsoukas et al. (2009), a fairly large body of work on domain adaptation for SMT embeds \({{P}}(z|\ \tilde{e},\ \tilde{f})\) in an asymmetric sentence-level model \(P(z|\ \mathbf e ,\ \mathbf f )\) for sentence pairs \(\langle \mathbf e , \mathbf f \rangle \). Specifically, the estimation of \(P(z|\ \tilde{e},\ \tilde{f})\) for phrases \(\tilde{e}\) and \(\tilde{f}\) can be simplified by computing \(P(z|\ \mathbf e ,\ \mathbf f )\) for sentence pairs \(\langle \mathbf e , \mathbf f \rangle \) as in (15):

$$\begin{aligned} P(z|\ \tilde{e},\ \tilde{f})&= \frac{\sum \nolimits _\mathbf{e ,\ \mathbf f }^{}P(z|\ \mathbf e ,\ \mathbf f )\ c(\tilde{e};\ \mathbf e )\ c(\tilde{f};\ \mathbf f )}{\sum _{z' \in \{z_1,\ z_0\}}^{}\sum \nolimits _\mathbf{e ,\ \mathbf f }^{} P(z'|\ \mathbf e ,\ \mathbf f )\ c(\tilde{e};\ \mathbf e )\ c(\tilde{f};\ \mathbf f )}. \end{aligned}$$
(15)

Here, \(c(\tilde{e},\ \mathbf e )\) and \(c(\tilde{f},\ \mathbf f )\) are the count of phrases \(\tilde{e}\) and \(\tilde{f}\) in sentence pairs \(\langle \mathbf e ,\ \mathbf f \rangle \) in the training corpus.

Fig. 5
figure 5

The EM-based training algorithm for learning \({\textit{P}}(z|\ \tilde{e},\ \tilde{f})\) and \(P(z|\ \mathbf e ,\ \mathbf f )\) simultaneously

But how can the asymmetric sentence level model be learned? A simple and straightforward way proposed by Cuong and Sima’an (2014a) is to devise an EM algorithm for learning (Fig. 5). At every iteration, in- or out-of-domain estimates provide full sentence pairs \({\langle {{\mathbf {e}}},\ {{\mathbf {f}}}\rangle }\) with probabilities \({P(z|\ {{\mathbf {e}}},\ {{\mathbf {f}}})}\). The latent phrase-relevance model parameters are then re-estimated using these expectations. Metaphorically, during each EM iteration the current in- or out-of-domain phrase pairs compete in inviting \(\mathcal {C}_{{OUT}}\) sentence pairs to be in- or out-of-domain, which bring in new (weights for) in- and out-of-domain phrases.

Another approach is directly building a logistic weighting model for the asymmetric sentence-level model. Specifically, a logistic weighting model maps a set of features \({\phi (\mathbf e ,\ \mathbf f )}\) with the parameter vector \(\mathbf {w}\) to a scalar weight in (0, 1). There are numerous types of sentence-level features that can be used, such as manual sub-corpus and genre membership, number of source and target token, and ratio of number of the tokens on both sides. Interestingly, the parameter vector \(\mathbf {w}\) can be learned directly simultaneously with the log-linear model weight parameters so as to optimize the translation accuracy on a held-out development set. This approach was first proposed by Matsoukas et al. (2009).

An alternative approach to learning domain-focused translation statistics is directly building a discriminative model at phrase level. This approach is intuitively plausible, as a sentence itself may often contain a mixture of domains. In the work of Foster et al. (2010), the estimation of domain-focused phrase translation probabilities can be directly computed as in (16):

$$\begin{aligned} P(\tilde{e}|\ \tilde{f},\ IN) = \frac{c_{\mathbf {w}}(\tilde{e},\ \tilde{f})}{\sum _{\tilde{e}'}^{}c_{\mathbf {w}}(\tilde{e}',\ \tilde{f})}, \end{aligned}$$
(16)

where the modified count \(c_{\mathbf {w}}(\tilde{e},\ \tilde{f})\) is computed as in (17):

$$\begin{aligned} c_{\mathbf {w}}(\tilde{e},\ \tilde{f}) = \frac{1}{1+\exp (-\,\mathbf {w} \cdot \varvec{\phi }(\tilde{e}, \tilde{f}))} c(\tilde{e},\ \tilde{f}). \end{aligned}$$
(17)

Learning the weight parameters \(\mathbf {w} = \{w_1, \ldots , w_K\}\) of K features for the logistic weighting model can be done using maximum likelihood or related criteria. More specifically, let us assume a held-out development set in which each sentence \(\langle \mathbf e ,\ \mathbf f \rangle \) contains a (multi-)set \(\mathcal {A}(\mathbf e ,\ \mathbf f )\) of extracted phrases \(\langle \tilde{e},\ \tilde{f}\rangle \). The objective function is the maximization of the likelihood over \(\mathcal {A}(\mathbf e ,\ \mathbf f )\) for all parallel sentences \(\langle \mathbf e ,\ \mathbf f \rangle \) in the development set with respect to \(\mathbf {w}\), as in (18):

$$\begin{aligned} {\hat{\mathbf {w}}} = \mathop {{{\mathrm{\arg \!\max }}}}\limits _{\mathbf {w}} \sum \limits _{\langle \mathbf e ,\ \mathbf f \rangle }\sum \limits _{\langle \tilde{e},\ \tilde{f}\rangle \in \mathcal {A}(\mathbf e ,\mathbf f )}\tilde{P}(\tilde{e}, \tilde{f})\log P(\tilde{e}|\ \tilde{f},\ IN). \end{aligned}$$
(18)

Here, note that \(\tilde{P}(\tilde{e},\ \tilde{f})\) is computed from all phrase pairs extracted from the held-out development set. The optimization problem can be solved using the popular L-BFGS algorithm, as shown in Foster et al. (2010). The algorithm requires computing the gradient \(\frac{\partial P(\tilde{e}|\ \tilde{f},\ IN)}{\partial w_i}\), which is done as in (19):

$$\begin{aligned} \frac{\partial P(\tilde{e}|\ \tilde{f},\ IN)}{\partial \lambda _i}&= \frac{1}{P(\tilde{e}|\ \tilde{f},\ IN)}\nonumber \\&\quad \bigg [\frac{c_{\mathbf {w}_i}(\tilde{e},\ \tilde{f})}{\sum _{\tilde{f}'}^{}c_{\mathbf {w}}(\tilde{e},\ \tilde{f}')} - \frac{c_{\mathbf {w}}(\tilde{e},\ \tilde{f})\sum _{\tilde{f}'}^{}c_{\mathbf {w}_i}(\tilde{e},\ \tilde{f}')}{\big (\sum _{\tilde{f}'}^{}c_{\mathbf {w}}(\tilde{e},\ \tilde{f}')\big )^2} \bigg ]. \end{aligned}$$
(19)

where:

$$\begin{aligned} c_{\mathbf {w}_i}(\tilde{e},\ \tilde{f}) = c_{\mathbf {w}}(\tilde{e},\ \tilde{f})f_i(\tilde{e},\ \tilde{f})\bigg ( \frac{\exp (-\,\mathbf {w} \cdot \varvec{\phi }(\tilde{e},\ \tilde{f}))}{1+\exp (-\,\mathbf {w} \cdot \varvec{\phi }(\tilde{e},\ \tilde{f}))}\bigg ). \end{aligned}$$
(20)

Both approaches have their own advantages and disadvantages. The EM-based approach strives for simplicity and is accordingly much easier to implement. However, using a discriminative model to learn relevance of sentence pairs and phrases in the parallel training data would perhaps be much more effective, but requires feature engineering, so is more difficult to implement. An empirical comparison of the approaches, however, has yet to be thoroughly conducted, to the best of our knowledge.

Note that using the same algorithm we can also adapt all other core translation components in tandem, including lexical weighting and lexicalized reordering models.

Data selection

Another approach to learning domain-focused translation statistics is selecting training data from a large corpus. Then, we can simply train a phrase-based SMT system on the selected data. The resulting translation statistics are presumably domain-focused. Data selection would naturally be less effective than instance weighting, as we strictly remove a lot of bilingual data that are (presumably) not relevant to a desired task. However, data selection has received considerable attention in the past years for two main reasons:

  1. 1.

    Large bilingual training data comes with a cost: training phrase-based SMT systems on large data is extremely expensive and time-consuming;

  2. 2.

    A small, well-selected subset of the data often outperforms the full dataset for training a phrase-based SMT system (Axelrod et al. 2011; Biçici and Yuret 2011; Mansour et al. 2011; Duh et al. 2013; Cuong and Sima’an 2014b; Kirchhoff and Bilmes 2014; Mansour and Ney 2014; Zhang and Chiang 2014; Poncelas et al. 2017).

Existing work can be roughly classified depending on what kind of information is used for selection. The most popular approach (Axelrod et al. 2011) selects sentence pairs using the cross-entropy difference between in- and out-of-domain language models (both source and target sides), as in (21):

$$\begin{aligned} rank(\mathbf {f},\ \mathbf {e}) = \underbrace{\bigg (H_{LM_{IN}}(\mathbf {f}) - H_{LM_{OUT}}(\mathbf {f})\bigg )}_{source\ side}+\underbrace{\bigg (H_{LM_{IN}}(\mathbf {e}) - H_{LM_{OUT}}(\mathbf {e})\bigg )}_{target\ side}. \end{aligned}$$
(21)

The cross-entropy is defined as in (22)–(23):

$$\begin{aligned} H_{LM}(\mathbf {f})&= -\,\frac{1}{m}\sum \nolimits _{i=1}^{m}\log P(f_i|\ f_{1}^{i-1}) \end{aligned}$$
(22)
$$\begin{aligned} H_{LM}(\mathbf {e})&= -\,\frac{1}{l}\sum \nolimits _{i=1}^{l}\log P(e_i|\ e_{1}^{i-1}) \end{aligned}$$
(23)

The method itself is a modification of the method proposed in Moore and Lewis (2010), which was introduced to address exactly the same problem we are discussing, but for only one side (i.e. monolingual data).

More recent approaches (Mansour et al. 2011; Cuong and Sima’an 2014b; Mansour and Ney 2014) use translation model information. The idea is intuitively plausible: in the translation context, a source phrase often has different translations in different domains, which cannot be distinguished with monolingual language models. But how much should data selection depend on bilingual vs. monolingual factors? Cuong and Sima’an (2014b) present a comprehensive study of the contribution of these factors, showing that they actually complement each other for data selection.

One of the most difficult problems in data selection is to jointly learn translation and language models. An EM-based learning algorithm was first proposed by Cuong and Sima’an (2014b) to address the problem. However, a joint bilingual neural network model proposed by Devlin et al. (2014) might be a more powerful solution to the problem. Chen et al. (2016) were the first to deploy the joint bilingual neural network model to address the problem. In their work, promising data-selection performance is observed.

As a side note, data selection often goes hand in hand with data reduction for SMT (Eck et al. 2005; Lewis and Eetemadi 2013). Data reduction aims at reducing the size of data that is used for training, while at the same time impacting very little on quality.

6.2 Induction with comparable corpora

Creating an in-domain dataset is extremely expensive in practice. A cheaper approach to domain adaptation for SMT is mining comparable corpora (Snover et al. 2008; Daumé and Jagarlamudi 2011; Irvine et al. 2013b).

We now present two notable approaches as examples. The first approach is mining unseen words for an adaptation task (Daumé and Jagarlamudi 2011), which extends the approach described in Haghighi et al. (2008) to mining translations from comparable corpora. Learning bilingual lexicons from comparable corpora is obviously not an easy task [cf. Koehn and Knight (2002), Haghighi et al. (2008), Tamura et al. (2012)], and their mining is “bootstrapped” based on a bilingual dictionary that is created automatically from out-of-domain corpora. The output of the dictionary-mining approach is normally a list of (source and target) word pairs, with corresponding scores representing the word-translation probability. Perhaps surprisingly, a straightforward approach to incorporating the induced word pairs by having an additional feature representing dictionary-mining translation probability may not be helpful. A more effective way, as described in Daumé and Jagarlamudi (2011), is to have not only the dictionary-mining translation-probability feature, but also an additional feature to mark whether a phrase pair is seen in the source and target data or not.

The second approach, proposed by Irvine et al. (2013b), directly recovers the joint probability distribution of source and target word pairs on a new domain. Specifically, assume we have access to a joint distribution \(P_{\textit{OUT}}(f,\ e)\) over source and target word pairs \(\langle f,\ e\rangle \), estimated from an out-of-domain corpus. Let \(\tilde{P}(f)\) and \(\tilde{P}(e)\) be the empirical marginal distributions estimated from comparable corpora (i.e. we extract raw word frequencies from the corpora). Irvine et al. (2013b) cast the learning of the joint probability distribution of source and target word pairs on a new domain as a linear programming problem, as in (24):

$$\begin{aligned}&\hat{P}_{IN} = \mathop {{{\mathrm{\arg \!\min }}}}\limits _{P_{IN}}\Vert \sum _{\langle f,\ e\rangle }^{} P_{IN}(f, e)\ -\ P_{OUT}(f, e)\Vert _1, \end{aligned}$$
(24)

subject to:

$$\begin{aligned} \sum \limits _{\langle f,e\rangle }P_{IN}(f,e)&= 1, \sum \limits _{e}P_{IN}(f,e) = \tilde{P}(f), \\ \sum \limits _{f}P_{IN}(f)&= \tilde{P}(e), \text { and }P_{IN}(f,e)\ge 0. \end{aligned}$$

Here, \(l_1\)-norm (\(\Vert \cdot \Vert _1\)) is used to measure the distance between two distributions. Regularization terms are usually added into Eq. (24) so that the solution would be as sparse as possible. A linear programming solver can be used to learn \(P_{IN}(f,\ e)\) from Eq. (24).

The method is perhaps one of the most elegant approaches to domain adaptation for SMT. It exploits cheap resources and shows significant improvement in translation quality on new domains.

6.3 Induction with monolingual data

Exploiting in-domain monolingual data is also an effective approach to domain adaptation for SMT. In general, synthetic bilingual data is first generated by using a phrase-based SMT system. Then, we can use the created data to induce domain-focused translation statistics (Schwenk 2008; Wu et al. 2008; Bertoldi and Federico 2009; Schwenk and Senellart 2009). Empirical results show that having in-domain monolingual data could substantially improve translation quality for a new domain, especially with in-domain monolingual data on the target side (Lambert et al. 2011).

Surprisingly, we can still derive improvements from incorporating induced domain-focused translation features to the baseline, given that the baseline is already augmented with induced domain-focused language-model features. As a side note, the adaptation of reordering model gives consistent but modest improvements in this scenario (Schwenk 2008; Bertoldi and Federico 2009).

6.4 Induction with monolingual data and meta-information

Beside generating synthetic bilingual data, are there any other ways of adapting translation models with monolingual corpora? There has been an intensive line of research that focuses on translation-model adaptation using topic models (Gong et al. 2011; Eidelman et al. 2012; Su et al. 2012; Hewavitharana et al. 2013; Hasler et al. 2014; Hu et al. 2014). Such studies interchangeably use the term “topic” and “domain”.

Assume we are provided with an out-of-domain parallel corpus \(\mathcal {C}_{OUT} = \{\mathcal {S}_{OUT},\ \mathcal {T}_{OUT}\}\), together with an in-domain monolingual corpus on the source side \(\mathcal {S}_{IN}\) only. Given the data, a general approach is building an adapted translation model in the following steps:

  • Step 1: Estimating topic models (e.g. Probabilistic Latent Semantic Analysis (Hofmann 1999), Latent Dirichlet Allocation (Blei et al. 2003), or Hidden Topic Markov Models (Gruber et al. 2007)) at document level in monolingual corpora;

  • Step 2: Estimating topic-specific translation models (i.e. conditioning the translation of phrase pairs on the topic information of source phrases);

  • Step 3: Estimating topic posterior distributions of phrases;

  • Step 4: Estimating phrase-translation probabilities using predefined topic-specific translation models and topic posterior distributions of phrases.

More formally, let us use \(P(z_{\mathbf{f }_{IN}}|\ \mathbf f )\) and \(P(z_{\mathbf{f }_{OUT}}|\ \mathbf f )\) to indicate how a sentence \(\mathbf{f }\) expresses a specific source-side topic in in- and out-of-domain monolingual corpora. The sentence-topic distributions are provided by topic models (Step 1).

Let us use \(P(\tilde{e}|\ \tilde{f}, z_{{\tilde{f}}_{OUT}})\) to indicate the probability of translating a phrase \(\tilde{f}\) as a phrase \(\tilde{e}\) given the source-side topic \(z_{{\tilde{f}}_{OUT}}\). The topic-specific translation models are estimated as in (25) (Step 2):

$$\begin{aligned} P(\tilde{e}|\ \tilde{f},\ z_{{\tilde{f}}_{OUT}})&= \frac{\sum \nolimits _\mathbf{e ,\ \mathbf f \ \in \ \mathcal {C}_{OUT}}^{}P(z_{{\tilde{f}}_{OUT}}|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )\ c(\tilde{e};\ \mathbf e )}{\sum _{\tilde{e}'}^{}\sum \nolimits _\mathbf{e ,\ \mathbf f \ \in \ \mathcal {C}_{OUT}}^{}P(z_{{\tilde{f}}_{OUT}}|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )\ c(\tilde{e}';\ \mathbf e )}. \end{aligned}$$
(25)

Let us use \(P(z_{{\tilde{f}}_{IN}}|\ \tilde{f})\) and \(P(z_{{\tilde{f}}_{OUT}}|\ \tilde{f})\) to denote the phrase-topic distributions. The distributions can be computed as in (26)–(27) (Step 3):Footnote 10

$$\begin{aligned} P(z_{{\tilde{f}}_{IN}}|\ \tilde{f})&= \frac{\sum \nolimits _\mathbf{f \ \in \ \mathcal {S}_{IN}}^{}P(z_{{\tilde{f}}_{IN}}|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )}{\sum _{z_{{\tilde{f}}_{IN}}'}^{}\sum \nolimits _\mathbf{f \ \in \ \mathcal {S}_{IN}}^{}P(z_{{\tilde{f}}_{OUT}}'|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )}. \end{aligned}$$
(26)
$$\begin{aligned} P(z_{{\tilde{f}}_{OUT}}|\ \tilde{f})&= \frac{\sum \nolimits _\mathbf{f \ \in \ \mathcal {S}_{OUT}}^{}P(z_{{\tilde{f}}_{OUT}}|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )}{\sum _{z_{{\tilde{f}}_{OUT}}'}^{}\sum \nolimits _\mathbf{f \ \in \ \mathcal {C}_{OUT}}^{}P(z_{{\tilde{f}}_{OUT}}'|\ \mathbf f )\ c(\tilde{f};\ \mathbf f )}. \end{aligned}$$
(27)

Finally, phrase-translation probabilities can be computed as in (28) (Step 4):

$$\begin{aligned} P(\tilde{e}|\ \tilde{f}) = \sum \nolimits _{z_{{\tilde{f}}_{IN}}}^{}\sum \nolimits _{z_{{\tilde{f}}_{OUT}}}^{}P(\tilde{e}|\ \tilde{f},\ z_{{\tilde{f}}_{OUT}})P(z_{{\tilde{f}}_{OUT}}|\ z_{{\tilde{f}}_{IN}})P(z_{{\tilde{f}}_{IN}}|\ \tilde{f}), \end{aligned}$$
(28)

where the topic-mapping probability distribution \(P(z_{{\tilde{f}}_{OUT}}|z_{{\tilde{f}}_{IN}})\) can be computed as in (29):Footnote 11

$$\begin{aligned} P(z_{{\tilde{f}}_{OUT}}|z_{{\tilde{f}}_{IN}}) = \sum \nolimits _{\tilde{f} \in \mathcal {S}_{IN}\ \cap \ \mathcal {S}_{OUT}}^{}P_{IN}(z_{{\tilde{f}}_{IN}}|\ \tilde{f})P_{OUT}(z_{{\tilde{f}}_{OUT}}|\ \tilde{f}). \end{aligned}$$
(29)

The estimate of \(P(\tilde{e}|\ \tilde{f})\) as in Eq. (28) can be used to replace the domain-confused translation probability. It can also simply serve as an additional feature to the baseline.

In practice, it is also possible that instead of having only the source side \(\mathcal {S}_{IN}\) monolingual data, we are provided with an in-domain parallel corpus \(\mathcal {C}_{IN} = \{\mathcal {S}_{IN},\ \mathcal {T}_{IN}\}\). In that case, bilingual topic inference should be preferred to monolingual topic inference (Mimno et al. 2009; Hasler et al. 2014; Hu et al. 2014).

Using topic models for domain adaptation for SMT provides an effective way of quantifying the effect of the topical context information on translation selection. Using the same approach, we can adapt all other core translation components in tandem, including lexical weighting and lexicalized reordering models.

Meanwhile, the model has a potential drawback: most parallel corpora lack the annotation of document boundaries. Of course, a single sentence can be considered as a short pseudo-document, but it is questionable whether such a corpus with short pseudo-documents is topic-model ‘friendly’ (Tang et al. 2014).

6.5 Induction using a domain’s provenance

In practice, there are adaptation scenarios where we are provided with a large corpus consisting of different domain-specific subcorpora, where the subcorpora are manually grouped/annotated, but not necessarily strictly related to the desired task. In that scenario, it is still very useful to condition the lexical weighting features on provenance (Chiang et al. 2011). In the end, we can simply optimize the system with different types of domain-focused translation statistics on an in-domain held-out development set.

Another simple and elegant approach is to use a vector space model. Specifically, let us assume we are provided with a corpus consisting of N different domain-specific subcorpora. First, we create a vector profile for every phrase pair extracted from the training data, as in (30):

$$\begin{aligned} V_{training}(\tilde{f},\ \tilde{e}) = \bigg [w_1(\tilde{f},\ \tilde{e}), \ldots , w_N(\tilde{f},\ \tilde{e})\bigg ] \end{aligned}$$
(30)

Another vector profile is created for every phrase pair extracted from the in-domain held-out development set, as in (31):

$$\begin{aligned} V_{dev}(\tilde{f},\ \tilde{e}) = \bigg [w_1(\tilde{f},\ \tilde{e}), \ldots , w_N(\tilde{f},\ \tilde{e})\bigg ] \end{aligned}$$
(31)

In principle, each element of the vector \(w(\tilde{f},\ \tilde{e})\) can simply be the count of a phrase pair. A better approach proposed by Chen et al. (2013b) is adapting standard tf-idf statistics, a standard technique in IR.

Then, we simply use the similarity score between these two types of vectors as additional feature functions (e.g. the Bhattacharyya distance (Bhattacharyya 1946), the Kullback-Leibler distance (Kullback and Leibler 1951), and the cosine distance), which reward phrase pairs that are relevant to the desired task.

The vector space model approach was first proposed by Chen et al. (2013b), and is a very effective adaptation technique for SMT. However, a domain’s provenance is not always available in practice. Despite the fact that topic models can automatically provide meta-information, experiments in this setting show only a modest improvement [cf. Hewavitharana et al. (2013)].

7 Model combination for adaptation

Domain-focused translation statistics, once induced, need to be combined together in an appropriate way. The main desire is to have a combination model tailored to high-dimensional feature spaces.

7.1 Log-linear mixture

Log-linear translation model mixtures (Birch et al. 2007; Koehn and Schroeder 2007) are of the form in (32):

$$\begin{aligned} \phi _{TM}(\mathbf e ,\ \mathbf f )&= \lambda \sum \nolimits _{i=1}^{n}\log P(\tilde{e}_i|\ \tilde{f}_{a_i}, {IN})\nonumber \\&\quad +(1-\lambda ) \sum \nolimits _{i=1}^{n}\log P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {OUT}). \end{aligned}$$
(32)

Here, \(P(\tilde{e}_i|\ \tilde{f}_{a_i}, {IN})\) and \(P(\tilde{e}_i|\ \tilde{f}_{a_i}, {IN})\) represent different types of domain-focused translation statistics with respect to IN and OUT. As in Eq. (32), they can be added to the baseline as additional features. There is also no further effort needed for training: the respective weights are set with any weight optimization method (e.g. MERT, MIRA, PRO).

The implementation of a log-linear translation mixture model for adaptation can be slightly different in practice. It is common to leave the decoder as is (Razmara et al. 2012), but it is also possible to put constraints on hypotheses generated by the decoder (Birch et al. 2007; Koehn and Schroeder 2007). For instance, the decoder may only generate hypotheses that are contained in both in-domain and out-of-domain translation tables. The decoder may also generate hypotheses that are contained in each of the tables. An empirical comparison of the implementations, however, has yet to be thoroughly conducted, to the best of our knowledge.

This model has two potential drawbacks:

  1. 1.

    In practice, it is common to have many sub-models, which leads to significantly longer search and potentially more search errors. This also makes system optimization even more challenging. It is not uncommon for such a log-linear mixture model to perform significantly worse than a system trained on a concatenation of all the data (Sennrich 2012; Wäschle and Riezler 2012);

  2. 2.

    Having high-dimensional feature spaces requires a much larger held-out development set for system optimization (Waite and Byrne 2015). This is unrealistic in practice, as in-domain data is very expensive to annotate.

7.2 Linear mixture

Linear translation model mixtures are of the form in (33):

$$\begin{aligned} \phi _{TM}(\mathbf e , \mathbf f )&= \sum \nolimits _{i=1}^{n}\log \bigg (\lambda P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {IN})+(1-\lambda ) P(\tilde{e}_i|\ \tilde{f}_{a_i},\ {OUT})\bigg ) \end{aligned}$$
(33)

An alternative form of linear combination is a maximum a posteriori (MAP) combination, as in (34):

$$\begin{aligned} \phi _{TM}(\mathbf e , \mathbf f )&= \sum \nolimits _{i=1}^{n}\log \bigg (\frac{c_{IN}(\tilde{e}_i,\ \tilde{f}_{a_i})+\lambda {P}(\tilde{e}_i|\ \tilde{f}_{a_i}, OUT) }{\sum _{\tilde{e}'}c_{IN}(\tilde{e}',\ \tilde{f}_{a_i})+\lambda } \bigg ) \end{aligned}$$
(34)

This model was first proposed by Foster and Kuhn (2007), but training the model is not straightforward. It is desirable to directly optimize the weights of the baseline system \(\mathbf {w} = \{w_1, \ldots , w_M\}\) and interpolation weight \({\lambda }\) directly for BLEU. This is possible (Foster et al. 2013; Haddow 2013), but very challenging to implement.Footnote 12 In practice, the most common approach is performing system optimization with a two-step procedure as follows:

  • First, we learn the interpolation weight by maximum likelihood or related criteria;

  • We hold the interpolation weight as constant, and optimize the log-linear weights as normal with any optimization method.

By isolating the task of learning log-linear weights, the problem of learning the interpolation weight is not hard (Foster et al. 2010; Sennrich 2012). Specifically, let us assume a held-out development set, in which each sentence pair \(\langle \mathbf e ,\ \mathbf f \rangle \) contains a (multi-)set \(\mathcal {A}(\mathbf e ,\ \mathbf f )\) of extracted phrases \(\langle \tilde{e}, \tilde{f}\rangle \). The objective function is the maximization of the likelihood over \(\mathcal {A}(\mathbf e ,\ \mathbf f )\) for all pairs \(\langle \mathbf e ,\ \mathbf f \rangle \) with respect to \(\varvec{\lambda }\), as in (35):

$$\begin{aligned} {\hat{\lambda }}= & {} \mathop {{{\mathrm{\arg \!\max }}}}\limits _{{{\lambda }}} \sum \limits _{\langle \mathbf e ,\ \mathbf f \rangle }\sum \limits _{\langle \tilde{e},\ \tilde{f}\rangle \in \mathcal {A}(\mathbf e ,\mathbf f )}\tilde{P}( \tilde{e}, \tilde{f})\nonumber \\&\quad \log \bigg (\lambda P(\tilde{e}|\ \tilde{f}, IN)+(1-\lambda ) P(\tilde{e}|\ \tilde{f}, OUT)\bigg ) \end{aligned}$$
(35)

If we are using MAP, the objective function of training is as in (35):

$$\begin{aligned} {\hat{\lambda }} = \mathop {{{\mathrm{\arg \!\max }}}}\limits _{{{\lambda }}} \sum \limits _{\langle \mathbf e ,\ \mathbf f \rangle }\sum \limits _{\langle \tilde{e},\ \tilde{f}\rangle \in \mathcal {A}(\mathbf e ,\mathbf f )}\tilde{P}( \tilde{e}, \tilde{f})\log \frac{c_{IN}(\tilde{e},\ \tilde{f})+\lambda P(\tilde{e}|\ \tilde{f}, OUT)}{\sum _{\tilde{e}'}^{}c_{IN}(\tilde{e},\ \tilde{f}) + \lambda } \end{aligned}$$
(36)

Note that \(\tilde{P}(\tilde{e},\ \tilde{f})\) in both cases is computed from all phrase pairs extracted from the held-out development set.

Since the objective function is convex, the optimization can be done efficiently with EM (Carpuat et al. 2014) or Limited-memory BFGS algorithm (Sennrich 2012).Footnote 13 Both algorithms require computing the gradient \(\frac{\partial }{\partial \lambda }\). The gradient is easy to compute in the first case, as in (37):

$$\begin{aligned} \frac{\partial }{\partial \lambda } = \bigg [ \frac{P(\tilde{e}|\ \tilde{f}, IN) - P(\tilde{e}|\ \tilde{f}, OUT)}{\lambda P(\tilde{e}|\ \tilde{f}, IN)+(1-\lambda ) P(\tilde{e}|\ \tilde{f}, OUT)} \bigg ] \end{aligned}$$
(37)

If we are using MAP, the gradient is slightly different, as in (38):

$$\begin{aligned} \frac{\partial }{\partial \lambda } = \frac{-\sum _{\tilde{e}'}^{}c_{IN}(\tilde{e}',\ \tilde{f})}{\big (\sum _{\tilde{e}'}^{}c_{IN}(\tilde{e}',\ \tilde{f}) + \lambda \big )^2} \bigg [ \frac{P(\tilde{e}|\ \tilde{f}, IN) - P(\tilde{e}|\ \tilde{f}, OUT)}{\frac{c_{IN}(\tilde{e},\ \tilde{f})+\lambda \bar{P}(\tilde{e}|\ \tilde{f}, OUT) }{\sum _{\tilde{e}'}c_{IN}(\tilde{e}',\ \tilde{f})+\lambda } } \bigg ] \end{aligned}$$
(38)

A linear translation model is perhaps the most common combination model for adaptation. Compared with the log-linear translation model, it often works better with high-dimensional feature spaces. However, the model has two potential drawbacks:

  1. 1.

    The maximum likelihood or related criteria may not correlate well with translation accuracy. It is not uncommon that assigning optimized weights underperforms compared to uniform weights;

  2. 2.

    The performance would likely suffer from combining too many (e.g. more than 10) sub-models, leaving an open question of how best to design a combination model tailored to very high-dimensional feature spaces.

7.3 Fill-up

A very simple approach that provides a competitive performance to log-linear and linear translation model mixtures is Fill-up. The idea of Fill-up was first proposed by Besling and Meier (1995) for addressing the problem of language model adaptation for speech recognition. It was first introduced in SMT by Nakov (2008), and first used in domain adaptation for SMT in the work of Bisazza et al. (2011).

Let us assume we have two translation tables \(T_{IN}\) and \(T_{OUT}\), with their corresponding phrase translation probabilities \(P(\tilde{e}|\ \tilde{f},\ IN)\) and \(P(\tilde{e}|\ \tilde{f},\ OUT)\), respectively. A Fill-up table \(T_{FILL UP}\) is defined as in (39):

$$\begin{aligned}&\forall (\tilde{f},\tilde{e})\in T_{IN}\ \cup \ T_{OUT}:\nonumber \\&T_{FILL UP}(\tilde{f},\ \tilde{e}) = {\left\{ \begin{array}{ll} \{P(\tilde{e}|\ \tilde{f},\ IN), \exp (0)\} &{} \text {if } (\tilde{f},\ \tilde{e}) \in T_{IN}\\ \{P(\tilde{e}|\ \tilde{f},\ OUT), \exp (1)\} &{} \text {otherwise.}\\ \end{array}\right. } \end{aligned}$$
(39)

Here, the entries of \(T_{FILL UP}\) correspond to the union of the two phrase tables, in which we consider \(T_{IN}\) as the more reliable source and use it whenever possible. The exponential function (i.e. \(\exp (0)\) and \(\exp (1)\)) is to mark whether a phrase pair is in-domain (\(T_{IN}\)) or out-of-domain (\(T_{OUT}\)).Footnote 14

Simplicity is perhaps the main advantage of Fill-up. The model, however, has two potential drawbacks:

  • It remains unclear whether the approach is able to scale to many sub-models. Such an empirical evaluation has yet to be thoroughly conducted, to the best of our knowledge.

  • Translation probabilities in \(T_{FILL UP}\) do not form a full probability distribution. This is potentially problematic: interactions between features can be complex and log-linear models may not be able to handle the interactions.

8 Other trends in domain adaptation

This survey covers several other adaptation trends. We first review the induction of domain-focused sparse features and word-alignment probabilities (Sects. 8.1, 8.2). We also show how an existing system can be adapted to multiple specific domains at the same time (Sect. 8.3). Another scenario is applying an SMT system to web search queries (Sect. 8.4). We also discuss how web-based translation services can be improved when domain of a new request is not known in advance (Sects. 8.5, 8.6).

8.1 Adaptation with sparse features

Having in-domain sparse feature functions is particularly useful when applying a phrase-based SMT system to new domains (Bertoldi and Federico 2009; Hasler et al. 2012; Green et al. 2013, 2014). This is because sparse features allow for more flexibility than dense features, but at the risk of increasing the difficulty of the optimization. Applying cross-validation techniques (e.g. jackknife training (Hasler et al. 2012)) is often very useful to avoid overfitting.

Fig. 6
figure 6

Latent domain HMM alignment model. An additional latent layer representing domains has been conditioned on by both the remaining layers

8.2 Domain adaptation for word alignment

There is some evidence to support the claim that like any statistical models, word-alignment models suffer significantly from a lack of in-domain data for training. Wu et al. (2005) train different alignment models independently on different domain-specific subcorpora. In the end, they show that an interpolation of the alignment models improves word-alignment accuracy.

Similar findings are reported in Duh et al. (2010) and Gao et al. (2011). Duh et al suggest that training a phrase-based SMT system might benefit from using the following simple trick: they first train statistical alignment models on a concatenation of both in-domain and a much larger out-of-domain dataset, and then exclude out-of-domain data during phrase extraction. Gao at al show that an interpolation of domain-specific and general-domain alignment models improves translation accuracy.

As a side note, Shah et al. (2010) show that weighting sentence pairs according to their relevance to a new domain benefits word-alignment training.

Recently, Cuong and Sima’an (2015) provide an in-depth study of domain adaptation for word alignment. They focus on the insensitivity of existing word-alignment models to domain differences, which often yields suboptimal results on heterogeneous corpora (e.g. EuroParl, Common Crawl Corpus, UN Corpus, and News Commentary). A latent domain word-alignment model is proposed, which explicitly incorporates latent domain information in learning domain-focused lexical and alignment statistics. Figure 6 presents such a case with a latent domain HMM alignment model. Cuong and Sima’an (2015) train the model on a heterogeneous corpus, using a small number of seed samples from different domains. Their experiments show that the derived domain-focused statistics, once combined together, produce significant improvements both in word alignment and translation.

8.3 Multi-domain adaptation for SMT

A common scenario in practice is adapting an existing system to multiple domain-specific tasks at the same time, which is clearly a challenging problem.

8.3.1 Adaptation with multi-task learning

The main approach is to optimize an SMT system in the way that exploits commonalities shared among different tasks (Wäschle and Riezler 2012). More formally, let us use \(\{\hat{\mathbf {w}}_1,\ \ldots ,\ \hat{\mathbf {w}}_K\}\) to denote a set of model parameters with respect to K different domains. The commonalities shared among different tasks are modeled as in (40):

$$\begin{aligned} \mathbf w _{AVG} = \frac{1}{K}\sum \nolimits _{d=1}^{K}{} \mathbf w _d. \end{aligned}$$
(40)

In the end, the goal is to learn model parameters that maximize the objective function, as in (41):

$$\begin{aligned} \{\hat{\mathbf {w}}_1,\ \ldots ,\ \hat{\mathbf {w}}_K\}&= \mathop {{{\mathrm{\arg \!\min }}}}\limits _\mathbf{w _1, \ldots , \mathbf w _K}\sum \nolimits _{d=1}^{K} loss_d(\mathbf w _d)\nonumber \\&\quad +\lambda \sum \nolimits _{d=1}^{K}\Vert \mathbf w _d-\mathbf w _{AVG}\Vert _1. \end{aligned}$$
(41)

Here, the parameter \(\lambda \) controls the influence of the regularization, which trades off between task-specific parameter vectors and their distance to the average. Meanwhile, we use \(loss_d(\mathbf w _d)\) to represent a translation loss function on the held-out development set from task d. The optimization problem can be solved using gradient-descent optimization with \(l_1\)-regularization (Tsuruoka et al. 2009; Wäschle and Riezler 2012).

While this method is intuitively plausible, it gives only a modest translation improvement (Wäschle and Riezler 2012). Different variants are proposed in the literature (Simianer et al. 2012; Cui et al. 2013) which show a potentially more promising performance.

8.3.2 Adaptation with genre-aware decoding

Another interesting approach to multi-domain adaptation for SMT is using a genre-aware classifier (Wang et al. 2012). The core to this approach is a source-sentence genre classifier that signals the most relevant domain to source sentences. In this way, the MT system is configured to use the proper domain feature weights and appropriate domain language model. Note that in the work of Wang et al. (2012), their system uses a single translation model to serve different domains. This allows the system to scale more easily to many domains, but makes tuning and decoding more difficult. Wang et al. (2012) introduce simple genre-aware decoding and tuning techniques to address the problem. Their experiments show that the proposed system is capable of producing better domain-specific translations while simultaneously preserving the quality of general-domain translations.

8.4 Cross-lingual information retrieval

A practical real-world problem is translating web search queries into several target languages, so that a search engine can return search results in the corresponding languages. The quality of a translation component thus plays a crucial role. The problem, however, is particularly difficult for three specific reasons:

  1. 1.

    Translation quality degrades substantially when applying a generic phrase-based SMT system to a domain-specific task. This is particularly true for search queries, due to their unique characteristics: search queries are very short (just a couple of words per query) and the word order is typically different to a typical sentence in natural language.

  2. 2.

    Second, a phrase-based SMT system is usually trained to optimize the quality of the translation, which is not necessarily correlated with the retrieval quality (especially for the short queries) (Kettunen 2009; Nikoulina et al. 2012). For example, the word order which is crucial for translation quality is often ignored by IR models. In contrast, retrieval systems often use bag-of-word representations in document-scoring models, and queries are rarely grammatical natural language sentences.

  3. 3.

    Finally, there are only a few tiny corpora of parallel queries (e.g. CLEF tracks) that can be obtained.

A very simple, yet effective approach to improving adaptation for CLIR is reranking the N-best translation candidates generated by a baseline system (Nikoulina et al. 2012). Note that a re-ranker should be optimized to maximize a retrieval metric rather than translation accuracy. Putting constraints on hypotheses generated by the decoder is another approach to improving adaptation for CLIR (Dong et al. 2014; Hieber and Riezler 2015). While the latter approach may be more efficient, such an implementation is obviously far more complicated.

8.5 Cache-based adaptive models for translation adaptation

A common scenario in practice, particularly for web-based translation services such as Bing Translator and Google Translate, is that translation requests are unknown as to their domain. A common approach is to exploit two general phenomena in natural language and translation:

  1. 1.

    Repetition and recency effects of words: many words, especially content words, are repeated in close context;

  2. 2.

    Consistency of translations: the translation of content words is consistent given a specific context.

The two phenomena provide us with a natural way to perform fully unsupervised domain adaptation on a new domain: a phrase-based SMT system performs the translation of a sentence by not only considering the sentence itself, but also taking the translation history of recent input sentences into account.

Accounting for these phenomena in translation is fairly simple, using a cache-based adaptive model (Kuhn and De Mori 1992). More specifically, Tiedemann (2010) develops two cache-based adaptive models, that we now describe.

Cache-based adaptive language modelTiedemann (2010) uses a dynamic cache-based adaptive language model in the form of a linear mixture as in (42):

$$\begin{aligned} P(e_n|\ e_{n-k},\ \ldots ,\ e_{n-1})&= (1\ -\ \lambda )P(e_n|\ e_{n-k},\ \ldots ,\ e_{n-1},\ OUT)\nonumber \\&\quad + \lambda P(e_n|\ e_{n-k},\ \ldots ,\ e_{n-1},\ CACHE) \end{aligned}$$
(42)

Here, the cache stores the best translation hypotheses of previous sentences. Of course the size of the cache is very small (e.g. 100-5000 words). The value of the interpolation weight \(\lambda \) can be set manually. The EM algorithm can also be used to learn the weight automatically.

Implementing the model as a simple unigram model is a good option, but a better solution in practice would be introducing a decay factor in the estimation of cache probabilities, as in (43):

$$\begin{aligned} P(e_n|\ e_{n-k},\ \ldots ,\ e_{n-1},\ CACHE) \propto \sum \nolimits _{i=n-k}^{n-1}\delta (e_n\ =\ e_i)\exp \bigg (-\alpha (n-i)\bigg ) \end{aligned}$$
(43)

This approach was first introduced by Clarkson and Robinson (1997). Here, \(\delta \) is the Kronecker delta function. The decay rate \(\alpha \) is normally set to a very small value [e.g. 0.005 as in Clarkson and Robinson (1997)].

Cache-based adaptive translation modelTiedemann (2010) develops a cache-based adaptive translation model in a similar manner, using a decay factor to compute translation model scores from the cache, as in (44):

$$\begin{aligned} P(\tilde{e}_n|\ \tilde{f}_n,\ CACHE) \propto \sum \nolimits _{i=1}^{K}\delta \bigg (\langle \tilde{e}_n,\ \tilde{f}_n\rangle \ =\ \langle \tilde{e}_i,\ \tilde{f}_i\rangle \bigg )\exp \bigg (-\alpha i\bigg ) \end{aligned}$$
(44)

The cache-based adaptive models can be integrated into a phrase-based SMT system in a straightforward manner: both can be used to replace the language and translation models, or to serve as additional feature functions within a log-linear model. In the end, the decoder is forced to prefer identical translations for repeated terms.

While using cache-based adaptive models is an elegant approach, Tiedemann observes that the adaptation effect is rather modest. Nor is it terribly robust; it is not uncommon that an augmented SMT system produces a rather suboptimal translation. There are two potential reasons for this:

  • First, it would be risky to assume that previous translation hypotheses are good enough to be cached [cf. the risk of error propagation (Tiedemann 2010)].

  • Second, using the translation of initial sentences in the input stream may not be so beneficial.

Potential solutions to these problems are quite straightforward (Gong et al. 2011; Louis and Webber 2014). For instance, in the work of Gong et al. (2011), the cache stores similar target sentence pairs in the bilingual training data to the translation hypotheses, instead of the translation hypotheses by themselves. As a side note, other types of cache can be developed to improve adaptation, e.g. caching not only phrase pairs but also topic caches, as in Gong et al. (2011).

8.6 Rewarding domain invariance for adaptation

When the target domain is unknown at training time, the system could also be trained to make safer choices, preferring translations which are likely to work across different domains. For exampleAs we pointed out early on, when translating from English to Russian, the most natural translation for the word ‘code’ would be highly dependent on the domain (and the corresponding word sense). Russian words ‘шифр’ (‘cipher’), ‘закон’ (‘law’) or ‘программа’ (‘program’) would perhaps be optimal choices if we consider cryptography, legal and software development domains, respectively. However, the translation ‘код’ (‘code’) is also acceptable across all these domains and, as such, would be a safer choice when the target domain is unknown. Note that such a translation may not be the most frequent overall and, consequently, might not be proposed by a standard (i.e. domain-agnostic) phrase-based translation system.

Fig. 7
figure 7

The projection framework of phrases into a K-dimensional vector space of probabilistic latent subdomains

In order to encode a preference for domain-invariant translations, we can first project phrases onto a compact \((K-1)\) dimensional simplex of subdomains with vectors, as in (45)–(46):

$$\begin{aligned} \tilde{\mathbf {e}}&= \bigg [P(z=1|\ \tilde{e}),\ \ldots ,\ P(z=K|\ \tilde{e})\bigg ] \end{aligned}$$
(45)
$$\begin{aligned} \tilde{\mathbf {f}}&= \bigg [P(z=1|\ \tilde{f}),\ \ldots ,\ P(z=K|\ \tilde{f})\bigg ]. \end{aligned}$$
(46)

See Fig. 7 for an illustration of the projection framework.

Of course, the subdomains are usually not specified in the heterogeneous training data. We can treat the subdomains as latent, and induce them automatically (Cuong et al. 2016b). In the end, we can use a relevant measure to quantify how likely a phrase (or a phrase-pair) is to be “domain-invariant”, for instance:

  • Domain-specificity of phrases A rule with source and target phrases having a peaked distribution over latent subdomains is likely domain-specific. Technically speaking, entropy is a natural choice for quantifying domain specificity. Here, we opt for the Renyi entropy and define the domain specificity as in (47)–(48):

    $$\begin{aligned} D_{\alpha }(\tilde{\mathbf {e}})&= \frac{1}{1-\alpha }\log \bigg (\sum \nolimits _{i=1}^{K}P(z = i|\ \tilde{e})^\alpha \bigg ) \end{aligned}$$
    (47)
    $$\begin{aligned} D_{\alpha }(\tilde{\mathbf {f}})&= \frac{1}{1-\alpha }\log \bigg (\sum \nolimits _{i=1}^{K}P(z = i|\ \tilde{f})^\alpha \bigg ) \end{aligned}$$
    (48)

    Normally, the value of \(\alpha \) is set to 2 by default (also known as the Collision entropy).

  • Source-target coherence across subdomains A translation rule with source and target phrases having two similar distributions over the latent subdomains is likely to be safer to use. We can use the Chebyshev distance to measure the similarity between two distributions. The divergence of two vectors \(\tilde{\mathbf {e}}\) and \(\tilde{\mathbf {f}}\) is defined as in (49):

    $$\begin{aligned} D(\tilde{\mathbf {e}},\ \tilde{\mathbf {f}}) = \max _{i=\{1,\ \ldots ,\ K\}}\left| P(z = i|\ \tilde{e}) - P(z = i|\ \tilde{f})\right| \end{aligned}$$
    (49)

Once integrated into a phrase-based SMT system as feature functions, the measures force the decoder to give higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. The translation improvement is quite robust; it is obtained without tuning specifically for the target domain or using other domain-related meta-information in the training corpus (Cuong et al. 2016b).

A similar idea has been deployed in Zhang et al. (2014b), which exploits topic-insensitivity that is learned over documents for translation. There is a link between this line of work and extensive prior work on minimum Bayes risk (MBR) objectives (used either at test time (Kumar and Byrne 2004) or during training (Goodman 1998; Sima’an 2003; Smith and Eisner 2006; Pauls et al. 2009)). The goal of MBR minimization is to select translations that are less ‘risky’, but there is a degree of uncertainty in modelling such predictions, and some of this uncertainty may indeed be associated with domain-variability of translations. Still, a system trained with an MBR objective will tend to output the most frequent translation rather than the most domain-invariant one, and this, as we argued in the introduction, might not be the right decision when applying it across domains. We believe that the two classes of methods are largely complementary.

9 Conclusion

This paper contributes a comprehensive survey of domain adaptation for SMT. We first introduce preliminaries regarding SMT in general, with a focus on aspects of SMT relevant to domain adaptation. We present an in-depth discussion where we explain what may go wrong with translation when applying a phrase-based SMT system to new domains.

The question of “what constitutes a domain?” is an open one which has not been well defined in the literature. Each different view of factors contributing to defining the domain leads to a different approach to domain adaptation. We provide a general picture of domain adaptation, and show how different research lines fall into a specific part of the general picture, as well as how they relate to each other. Providing such a comprehensive survey is, to the best of our knowledge, a novel contribution.

As discussed, SMT is just one among data-driven approaches to modeling translation. Other approaches can be deployed, e.g. example-based machine translation (Nagao 1984; Carl and Way 2003) and neural MT (Bahdanau et al. 2015). While it is pretty clear that example-based machine translation can benefit from what the domain-adaptation literature for SMT offers, it would be less clear whether neural MT can learn from that or not. Recent studies suggest this is the case, where classic techniques in domain adaptation for SMT can be used to perform adaptation for neural translation models [cf. Durrani et al. (2015), Joty et al. (2015)]. More specifically, Durrani et al. (2015) shows that EM-based mixture modeling and data-selection techniques also give a significant improvement in adaptation. Joty et al. (2015) reveal that regularizing the loss function towards the in-domain neural network joint model also improves translation.