1 Introduction

Worldwide demand of translation services has dramatically accelerated in the last decade, as an effect of market globalization and the growth of the Information Society. Computer assisted translation (CAT) tools are currently the dominant technology in the translation and localization market, and those including machine translation (MT) engines are on the increase. Although MT systems are not yet able to provide output that is suitable for publication without human intervention, recent achievements in the field have raised new expectations in the translation industry. Several empirical studies (Federico et al. 2012; Green et al. 2013; Läubli et al. 2013) have recently shown that significant productivity is gained when professional translators post-edit MT output instead of translating from scratch. So far, however, MT has focused on providing ready-to-use translations, rather than outputs that minimize the effort of a human translator. Our approaches focus on the latter. In fact, a very important issue for research and industry is how to effectively integrate machine translation within CAT software.

State-of-the-art statistical MT systems translate each sentence of an input text in isolation. While this reduces the complexity of translating large documents, it introduces the problem that information beyond the sentence level is lost. As mentioned in Tiedemann (2010), for instance, there are two types of important properties in natural language and translation that are often ignored in statistical models: consistency and repetitiveness. In the post-editing scenario additional information beyond the sentence level is available. After producing a system translation, user feedback in the form of a manual translation or of a user correction is received, which can be exploited to refine the next translations. From the viewpoint of professional translators, immediate refinement of the MT system in response to user post-editing is crucial in order to offer the experience of a system that learns from feedback and corrections. Online adaptation achieves this by increasing consistency of system translations with respect to the user translation of previously seen examples.

The post-editing scenarioFootnote 1 fits well into an online learning protocol  (Cesa-Bianchi and Lugosi 2006), where a stream of input data is revealed to the learner one by one. For each input example, the learner must make a prediction, after which the actual output is revealed, which the learner can use to refine the next prediction. Online learning can be applied to the framework of online adaptation by weighting new inputs heavier than older ones. It is important to stress that the potential of any online adaptation technique can be effectively exploited for texts, like for instance in technical documents, where repetition of content words is very common.

This paper compares approaches to online adaptation of phrase-based statistical MT systems (Koehn 2010) by building on and significantly extending recent works by Wäschle et al. (2013) and Bertoldi et al. (2013), which were the first attempt to apply both generative and discriminative online adaptation methods in a post-editing setting.

First, we propose methods that augment the generative components of the MT system, translation and language model, straightforwardly by building cache-based local models of phrase pairs and \(n\)-grams from user feedback. Then, we present a discriminative method based on a structured perceptron to refine a feature-based re-ranking module applied to the \(k\)-best translations of the MT system. The generative and discriminative approaches are independent and can be straightforwardly cascaded.

A deep investigation and comparison of the proposed adaptation techniques have been conducted in three domains, namely information technology, legal documents, and patents, and for several language pairs, namely from English into Italian, Spanish, and German, and from German into English. In sum, the gains of the generative and discriminative approaches come to improvements up to 10 absolute BLEU points over a non-adaptive system.

The paper is organized as follows. Section 2 overviews previous research on caching techniques, online learning, and discriminative re-ranking in machine translation. Section 3 presents the post-editing workflow from an online learning perspective. Section 4 describes the proposed generative and discriminative approaches to online adaptation of a MT system, and provides details on their implementation as well. Section 5 provides details about the translation directions and tasks we considered in our experimental evaluation; Sect. 6 reports on the experiments that we conducted, including a discussion of their outcomes. Some final summarizing comments end the paper in Sect. 7.

2 Related research

The concept of caching arose in computer science in the ’60s, when it was introduced to speed up the fetching of instructions and data and the virtual-to-physical address translation. In caching, results of computations are stored transparently so that future requests for them can be served faster. Caches exploit the locality of reference (principle of locality): the same value or related storage location is frequently accessed. This phenomenon does not only occur in computer science, but also in natural language, where the short-term shifts in word-use frequencies is empirically observed and was the rationale behind the introduction of the cache component in statistical language models by Kuhn and De Mori (1990). In this case, the argument was not efficiency like for computers but improving the prediction capability of the model; caching has also been used for time savings in the concrete implementation of language models (Federico et al. 2008).

The use of caching in MT was introduced by Nepveu et al. (2004), with the goal of improving the quality of both translation and language models in the framework of interactive MT; the approach includes automatic word alignment of source and post-edited target, namely the IBM model 2 Viterbi search. Tiedemann (2010) proposed to incrementally populate the translation model cache with the translation options used by the decoder to generate the final best translation; no additional alignment step is required here. Our cache-based translation model stands in between these two approaches: The cache is filled with phrase pairs from the previous post-edit session. An explicit, possibly partial, phrase-alignment is obtained via an efficient constrained search, fed by all translation options whose source side matches the sentence to translate. We propose a further enhancement of the basic caching mechanism for rewarding cached items related to the current sentence to translate.

Our work is also related to MT adaptation in general and online learning in particular. Online learning methods in statistical MT are found in the context of stochastic methods for discriminative training (Liang et al. 2006; Chiang et al. 2008), or streaming scenarios for incremental adaptation of the core components of MT (Levenberg et al. 2010, 2012). However, the online learning protocol is applied in these approaches to training data only, i.e., parameters are updated on a per-example basis on the training set, while testing is done by re-translating the full test set using the final model. In an online adaptation framework, an important aspect is the evaluation of a dynamic system, which should consider not only the overall average performance, but also its evolution over time. Bertoldi et al. (2012) proposed the Percentage Slope as an effective measure for the system learning capability. Further related work can be found in the application of incremental learning to domain adaptation in MT. Here a local and a global model have to be combined, either in a (log)-linear combination (Koehn and Schroeder 2007; Foster and Kuhn 2007), with a fill-up method (Bisazza et al. 2011), or via ultraconservative updating (Liu et al. 2012).

Various structured learning techniques have been applied to online discriminative re-ranking in a post-editing scenario, for example, by Cesa-Bianchi et al. (2008), Martínez-Gómez et al. (2012), or López-Salcedo et al. (2012). Incremental adaptations of the generative components of MT have been presented for a related scenario, interactive machine translation, where an MT component produces hypotheses based on partial translations of a sentence (Nepveu et al. 2004; Ortiz-Martínez et al. 2010). Our online learning protocol is similar, but operating on the sentence instead of word or phrase level.

Incremental adaptations have also been presented for larger batches of data (Bertoldi et al. 2012). In terms of granularity, our scenario is most similar to the work by Hardt and Elming (2010), where the phrase-based training procedure is employed to update the phrase table immediately after a reference becomes available. Our work, however, focuses on adapting both language and translation models with techniques that combine small adaptive local models with large static global models. This feature in fact nicely fits with the typical use of CAT tools, in which users use both a global shared translation memory and a local private translation memory.

In parallel to our work, Denkowski et al. (2014) developed an approach to learning from post-editing that allows independent adaptation of translation grammar, language model, and discriminative parameters of the statistical MT model. The results are not directly comparable because of the use of different corpora. However, while their approach differs also in the proposed techniques in several respects from our work, they achieve best results by stacking all adaptation techniques, confirming a central finding in our work.

3 Online learning from post-editing

In the CAT workflow, source documents are split into chunks, typically corresponding to sentences, called segments, that are in general translated sequentially. When the translator opens a segment, the CAT tool proposes possible translation suggestions, originating from the translation memory and/or from a machine translation engine. Depending on the quality of the suggestions, the translator decides whether to post-edit one of them or to translate the source segment from scratch. Completed segments represent a valuable source of knowledge which can be readily stored in the translation memory for future use. The advantage of post-edited translations over reference translations, which are created independently by a human translator, is that post-edits are closer to actual MT translation hypotheses in terms of edit distance and domain relevance. This work addresses this issue and presents several methods which successfully improve SMT performance over time.

3.1 Online learning protocol

From a machine learning perspective, the post-editing scenario perfectly fits the online learning paradigm (Cesa-Bianchi and Lugosi 2006), which assumes that every time a prediction is made, the correct target value of the input is discovered right after and used to improve future predictions. We conveniently transpose the above concept to our post-editing scenario as depicted in Fig. 1. The learning process starts from training a global model \(M_g\) on parallel data in the range of millions of sentence pairs. Then for each document \(d\), consisting of a few tens up to a thousand segments, an empty local model \(M_d\) is initialized. For each example, first the static global model \(M_{g}\) and the current local model \(M_{d}\) are combined into a model \(M_{g+d}\) (step 1). Next the received input \(x_t\) (step 2) is translated into \(\hat{y}_t\) using the model \(M_{g+d}\) (step 3). Then the user translation \(y_t\) is received after producing \(\hat{y}_t\) (step 4). Finally the local model \(M_{d}\) is refined on the user feedback \(y_t\) (step 5).

Fig. 1
figure 1

Online learning procedure for the post-editing workflow

This basic online learning protocol will be adapted to generative and discriminative learning components of phrase-based SMT, and extended from several directions, e.g. by adding an aging factor that scores more recent data more heavily than older data, thus accounting for online learning as online adaptation. In the following, we will use the terms online learning and online adaptation interchangeably. The evaluations reported in this paper take the local predictions \(\hat{y}_t\) and compare them to the user translations \(y_t\) for each document, e.g. using \(BLEU \{ (\hat{y}_t, y_t) \}_{t=1}^{|d|}\) (Papineni et al. 2002). Note, that this setup differs from the standard scenario, where the whole test set is re-translated using the learned model. However, the evaluation is still fair since only feedback from previous test set examples is used to update the current model.

3.2 Measuring the repetitiveness of a text

In Sect. 1, repetitiveness was mentioned as one of the phenomena occurring in texts that can highly affect the effectiveness of the online adaptation technique and the quality of automatic translation.

Bertoldi et al. (2013) introduced a way to measure repetitiveness inside a text, by looking at the rate of non-singleton \(n\)-grams types (\(n=1,\ldots ,4\)) it contains. As shown in Bertoldi et al. (2013), this rate decays exponentially with \(n\). For combining values with exponential decay, a reasonable scheme is to average their logarithms, or equivalently to compute their geometric mean. Furthermore, in order to make the measure comparable across differently sized documents, statistics are collected on a sliding window of 1,000 words, and properly averaged.

Formally, the repetition rate (RR) in a document can be expressed as:

$$\begin{aligned} {\textit{RR}} = \left( \ \prod _{n=1}^{4} \frac{\sum _{S}\left( \ V(n)-V(n,1) \ \right) }{\sum _{S}V(n)} \right) ^{1/4} \end{aligned}$$
(1)

where \(S\) is the sliding window, \(V(n,1)\) is the number of singleton \(n\)-grams types in \(S\), and \(V(n)\) is the total number of \(n\)-grams types in \(S\). RR ranges between 0 and 1, where the extreme points are respectively reached when all \(n\)-grams observed in all text windows occur exactly once (\(\mathrm{RR}=0\)) and more than once (\(\mathrm{RR}=1\)). It is worth noting that using a sliding window, in addition to permiting the comparison of RR of texts of different sizes, allows to largely maintain the linguistic features of the original text, as opposed to what would happen if the sentences to be processed together were randomly sampled.

3.3 Measuring the learning capability of a dynamic system

The standard MT metrics, such as BLEU (Papineni et al. 2002) and TER (Snover et al. 2006), provide absolute performance of the system, but they do not fit well into an online adaptation scenario, in which the system evolves dynamically over time. As shown in Bertoldi et al. (2012), adapting systems can be effectively analyzed by means of the Percentage Slope (henceforth \(Slope\)), which measures their learning capability. This metric originates in the industrial environment to evaluate the efficiency gained when an activity is repeated. Slope expresses the rate of learning on a scale of 0–100 %. A 100 % Slope represents no learning at all, zero percentage reflects a theoretically infinite rate of learning. In practice, human operations hardly ever achieve a rate of learning faster than 70 % as measured on this scale.

The Percentage Slope is actually a meta-metric \(Slope(metric)\), because it relies on an external metric measuring the efficiency of the activity in a range between 0 and 100, and fulfilling the constraint that “the lower, the better”. More details on its computation can be found in Bertoldi et al. (2012). From a practical point of view, as suggested by the authors, the sequence of scores are computed while the adapting system is being used; the learning curve which best matches the sequence is then found and eventually Slope is computed.

It is crucial to stress that the main assumption in the definition of the Percentage Slope metric is that the difficulty of the activity remains constant. Actually, this is not true in our scenario, because the difficulty of translating different portions of a text can vary a lot. Nevertheless, this metric is still useful to evaluate the learning capability of a dynamic system. It is sufficient to consider the performance difference between the dynamic system and the static (non-adaptive) system, taken as reference, to remove the effects of the intrinsic variable difficulty of the text. We hence computed a Percentage Slope on the difference of BLEU achieved by the dynamic and static systems, and named it \({\Delta }Slope\) to make such a difference clear. More precisely, as a metric is required which decreases when efficiency increases, we defined \({\Delta }Slope=Slope(100-(BLEU(dynamic) - BLEU(static))\). The dynamic system has learning power if \({\Delta }Slope\) is below 100 %.

4 Online adaptation in SMT

We present several techniques for refinements of local MT models (step 5 in Fig. 1), namely adaptations of the generative components of translation model (TM) (Sects. 4.3.1 and  4.3.2) and language model (LM) (Sect. 4.4) and adaptation via discriminative re-ranking (Sect. 4.5). Different refinements result in different modes of combination of global and local models (step 1). Both generative and discriminative adaptation modes deploy a constrained search technique (Sect. 4.2) to extract information relevant for system refinement from the received user feedback (step 4). Translation (step 3) employs a standard phrase-based MT engine, briefly introduced in Sect. 4.1.

Our adaptation techniques are exemplified throughout this Section by the two segments #6 and #39 of a sample document in the Information Technology domainFootnote 2 reported in Fig. 2. In these two segments, the phrase “Technical Offer” occurs twice, and the user chooses a common translation “Offerta Tecnica” for it. As shown in Fig. 3, the baseline non-adaptive MT system repeatedly produces an incorrect translation; the user, who relies on the MT suggestions to post-edit the segments, is forced to correct the error twice. The final goal of our adaptation approaches is to learn the right translation from the user feedback on the first segment and suggest this in the second, in order to reduce the post-editing effort of the user.

Fig. 2
figure 2

Two segments occurring in different positions of a sample document containing the same fragment “Technical Offer”; the user post-edit, including its correct translation “Offerta Tecnica”, is shown in the third column. These sentences are used throughout the paper to illustrate the proposed approaches

Fig. 3
figure 3

The output of both sample segments generated by the baseline system contain the same error “Technical offerta” for the expression “Technical Offer

4.1 Baseline system

The MT engine is built with the open source toolkit Moses (Koehn et al. 2007). The 4-score global translation and the 6-score lexicalized reordering models are estimated on parallel training data with default settings. The global 5-g LM is smoothed by the improved Kneser–Ney technique, and estimated on the target monolingual side of the parallel training data using the IRSTLM toolkit (Federico et al. 2008). Models are case-sensitive. Moreover, word and phrase penalties, and a distance-based distortion model are employed as features. The log-linear interpolation weights are optimized using the standard Minimum Error Rate Training (MERT) procedure (Och 2003) provided with the Moses toolkit. As suggested by Cettolo et al. (2011), weights are estimated averaging three distinct optimization runs to reduce the instability of MERT (Clark et al. 2011). The baseline system also provides a list of \(k\)-best translations, which are exploited by the online discriminative re-ranking module.

4.2 Constrained search for feedback exploitation

In order to extract information for system refinement from user feedback, source and user translation need to be aligned at the phrase-level. We use a constrained search technique described in Cettolo et al. (2010) to achieve this, which optimizes the coverage of both source and target sentences given a set of translation options.

The search produces exactly one phrase segmentation and alignment, and allows gaps such that some source and target words may be uncovered. Unambiguous gaps (i.e. one on the source and one on the target side) can then be aligned. It differs in this respect from forced decoding, which produces an alignment only when the target is fully reachable with the given models.

From the phrase alignment, three types of phrase pairs can be collected: (i) new phrase pairs by aligning unambiguous gaps; (ii) known phrase pairs already present in the given model; (iii) full phrase pairs consisting of the complete source sentence and its user translation. Since preliminary experiments showed that using cached phrases consisting only of function words had a negative impact on translation quality, we restricted the phrase extraction to phrases that contain at least one content word.Footnote 3 We found that in our experimental data cached phrases tended to be between one and three words in length, so we limited the length of new and known phrases to four words to speed up the annotation. The full phrase pair, which can have any length, is added to mimic the behaviour of a translation memory.

From the alignment shown in Fig. 4, we extract the new phrase pair Technical Offer \(\rightarrow \) Offerta Tecnica, the known phrase pairs Annex \(\rightarrow \) Allegato and to the \(\rightarrow \) all’ and the full phrase Annex to the Technical Offer \(\rightarrow \) Allegato all’ Offerta Tecnica. Then, to the \(\rightarrow \) all’ is actually discarded, because it does not contain any content word.

Fig. 4
figure 4

Phrase segmentation and alignment obtained applying the constrained search algorithm to the sample segment #6

4.3 TM adaptation

The growing collection of source sentences and corresponding user-approved translations enables the construction of a local translation model. The goal of this local model is to reward MT translations that are consistent with previous user translations as well as to integrate new translations learned from user corrections, in order to better translate the following sentences.

We compare two alternative approaches to include the growing set of phrase pairs into the translation models: the former is based on a cache external to the decoder, and exploits the Moses xml-input option; the latter relies on a new Moses feature implementing an internal cache.

4.3.1 TM adaptation with an external cache

From each sentence pair, all phrase pairs extracted with the constrained search technique described in Sect. 4.2 are inserted into a cache; a score for each phrase pair is estimated based on the relative frequency of the target phrase given the source phrase in the cache, as described in Eq. 2.

$$\begin{aligned} score(f, e)&= \frac{ c_{all}(f,e) }{c_{all}(f)}\nonumber \\ c_{all}(f, e)&= \lambda _{new} \cdot c_{new}(f,e) + \lambda _{known} \cdot c_{known}(f,e) + \lambda _{full} \cdot c_{full}(f,e)\nonumber \\ c_{all}(f)&= \sum _{e} c_{all}(f, e) \end{aligned}$$
(2)

where \(c_{new}(f,e)\), \(c_{known}(f,e)\), and \(c_{full}(f,e)\) are the frequencies of \((f,e)\) in the corresponding groups of extracted phrase pairs; \(c_{all}(f,e)\) is their linear combination with individual weights \(\lambda \); \(c_{all}(f)\) is the marginal weighted frequency of all translation options \(e\) for a given \(f\). Cache and model are updated on a per-sentence basis as soon as source sentence and user translation become available. Details about the estimation of \(\lambda \) parameters are given in Sect. 6.1.

A fast way to integrate the constantly changing local model in the decoder at run-time is the Moses xml-input option.Footnote 4 Translation options can be passed to the decoder via xml-like markup. Through this modality, the options are temporarily added to the global translation model. They become entries of the model in effect, and hence at decoding time can be accessed like those already inside. Note that the temporary options are automatically deleted after the translation of the current sentence. In the global translation model each entry is usually associated with multiple feature scores (four in our setting); hence, the \(score(f, e)\) (Eq. 2) assigned to each option passed via xml-like markup is split uniformly. Multiple translation options and their corresponding probabilities can be suggested for a specific source phrase.

Moses offers two ways to interact with this local phrase table. In inclusive mode, the given phrase translations compete with existing phrase table entries, as though they were temporarily added to the baseline phrase table. The decoder is instead forced to choose only from the given translations in exclusive mode.Footnote 5 During development, we found that the exclusive option is too strict in our scenario. Though most phrase pairs are correct and useful additions, for example spelling variants such as S.p.A \(\rightarrow \) SpA or domain vocabulary such as lease payment \(\rightarrow \) canone, some are restricted to a specific context, e.g. translation from singular to plural such as service \(\rightarrow \) servizi, and some are actually incorrect. In inclusive mode, the global translation and language model can reject unlikely translations.

Since the xml-input option does not support overlapping phrases, sentences are annotated in a greedy way from left to right. For each phrase in the input sentence, the cache is checked for possible translations, starting from the complete sentence down to single words. In this way, translations for larger spans are preferred over word translations. We did not explore other setups, such as preferring newly learned phrases over older options from the cache, but instead opted to keep the implementation simple.

Figure 5 illustrates the annotation to pass translation options to the decoder through the xml-like markup. The scores 0.75 and 0.25 associated with the two options are computed according to their frequency in the cache.

Fig. 5
figure 5

Annotation of the sample segment #39 when TM adaptation via external cache is applied and the cache contains two translation options “Offerta Tecnica” and “Proposta Tecnica” for “Technical Offer”, with frequency 3 and 1, respectively

4.3.2 TM adaptation with an internal cache

In this second approach for generative adaptation, the local translation model is implemented as an additional 1-score phrase table,Footnote 6 i.e. an additional feature providing one score. This model dynamically changes over time in two respects: (i) new phrase pairs can be inserted, and (ii) scores of all entries are modified when new pairs are added.

All entries are associated with an age, corresponding to the time they were inserted, and scored accordingly. Each new insertion causes the ageing of the existing phrase pairs and hence their re-scoring; in case of re-insertion of a phrase pair, the old value is overwritten. Each phrase pair is scored according to its actual \(age\) in the cache, hence it varies over time; the score is computed with the following functionFootnote 7:

$$\begin{aligned} score(age) = 1 - exp\left( \frac{1}{age}\right) \end{aligned}$$
(3)

From each sentence pair, phrase pairs are extracted with the procedure used in the previously described approach, and simultaneously added to the local translation model by feeding the decoder with an annotation illustrated in Fig. 6. All options in the example are inserted simultaneously in the cache and hence they are associated with the same age.

Fig. 6
figure 6

Annotation to update the local translation model when the internal cache is applied; in this example we feed the decoder with the phrase pairs extracted from segment #6

During decoding, translation alternatives are searched both in the global static phrase table and in the local cache-based (dynamic) phrase table, get a score from both tables, and compete in the creation of the best translation hypothesis.

4.4 LM adaptation

Similar to the local cache-based translation model described in Sect. 4.3.2, a local cache-based language model is built to reward the \(n\)-grams found in post-edited translations. This model is implemented as an additional feature of the log-linear model, which provides a score for each \(n\)-grams; the feature relies on a cache storing target \(n\)-grams.

The same policy employed in the modification of the local cache-based translation model is applied to the local cache-based language model. The score is also computed according to Eq. 3 and depends on the actual \(age\) of the \(n\)-grams in the cache. Only the annotation slightly changes as shown in Fig. 7. As with the cache-based translation model, we found in preliminary experiments that discarding stopword phrases improved results, so for each user-approved translation \(y\), all its \(n\)-grams (\(n=1,\ldots ,4\)) containing at least one content word are extracted (for an example, see Fig. 7) and inserted in the cache.

Fig. 7
figure 7

Annotation to update the local language model; in this example we feed the decoder with the \(n\)-grams extracted from segment #6

At decoding time, the target side of each translation option fetched by the search algorithm is scored with the cached model according to the same policy applied for the local cache-based translation model. If the target is not found in the cache, it receives no reward. Note that \(n\)-grams crossing over contiguous translation options are not taken into account by this model.

It is worth emphasizing that, despite the name, the proposed additional feature is not a conventional language model, but rather a function rewarding approved high-quality word sequences in target sides of phrase pairs.

4.5 Online discriminative re-ranking

Our discriminative re-ranking approachFootnote 8 is based on the structured perceptron by Collins (2002), which fits nicely into the online scenario considered here: For each source sentence the baseline system is asked to generate a \(k\)-best list of hypotheses. This list is ranked according to the current linear re-ranking model and its prediction is returned. Then, the learner receives the user translation, which is used for parameter updating. Updates occur, if the prediction of the re-ranking differs from the user translation. More formally, given a feature representation \(f(x,y)\) for a source-target pair \((x,y)\), and a corresponding weight vector \(w\), the perceptron update on a training example \((x_t,y_t)\) where the prediction \(\hat{y} = \arg \max _{y} \left< w,f(x_t,y) \right>\) does not match the target \(y_t\) is defined as:

$$\begin{aligned} w = w + f(x_t,y_t) - f(x_t,\hat{y}) \end{aligned}$$
(4)

We use lexicalized sparse features defined by the following feature templates: All phrase pairs used by the decoder (for system translations) or given by the constrained search (for the user translation) are used as features; in addition, we use target-side \(n\)-grams (\(n=1,\ldots ,4\)) as features, extracted from the user translation or the system translation, respectively. All features are simple indicator functions, with feature values given by the number of source words for phrase pairs, or \(n\) for target-side \(n\)-grams. This way, more weight is put on phrases spanning longer parts of the source sentence and higher order \(n\)-grams. During development we found this to have a positive effect on BLEU results, but as shown in the experiments section, this may result in worsening of other metrics such as TER. Considering the example in Fig. 4, the following features are extracted: Annex \(\rightarrow \) Allegato with a feature value of 1, Offerta Tecnica \(\rightarrow \) Technical Offer Footnote 9 with feature value 2, and all \(n\)-grams found in the English translation. As for the TM adaptations we also only consider features that include at least one content word.

The advantage of using only the two above described feature templates in discriminative reranking are as follows. First, a major advantage of this approach is its simplicity as there is no need for interaction with the decoder. The decoder is only required to return a \(k\)-best list of translation hypotheses along with phrase segmentations. This way, a wide variety of decoders can be used with the re-ranking module, which may be beneficial for practical use. Second, the combination of decoder-independent features with the constrained search technique allows us to apply an update condition that can be categorized as bold in terms of Liang et al. (2006). That is, for the purpose of discriminative training, in our setup all references are effectively reachable since we can extract features from them and assign model scores.

4.6 Tuning of the dynamic systems

The described techniques for TM adaptation via internal cache and LM adaptation add one more feature each to the standard feature set of the baseline system. Optimization of the additional feature together with the standard ones is achieved by a modification of the standard MERT procedure provided by Moses. The translation of the input text is now performed sequentially instead of in parallel; practically, the batch translation of the standard MERT procedure is replaced by the online process introduced in Sect. 3 to update the dynamic models sentence after sentence. This enhanced MERT procedure permits to reliably tune the weight of the additional feature as well as those of the standard feature set.

The overall dynamic system has meta-parameters related to the constrained search step and to the selection of the features for updating the dynamic models. Tuning of these parameters is described in Sects. 6.1 and 6.2.

5 Data benchmarks

Experimental analysis of the systems was performed on several tasks on both proprietary and publicly available data, involving the translation of documents from three domains, namely information technology, law, and patents, and for several language pairs, namely from English into Italian, Spanish, and German, and from German into English. Experiments were carried out by simulating post-editing feedback with the available reference translations, as proposed by Hardt and Elming (2010). Note that, for two tasks (English–Italian, information technology and legal domains), the used reference translations correspond to actual post-edits made by professional translators working with a non-adaptive MT system.

5.1 Training data

For the English–Italian Information Technology task training data mostly consist of commercial data extracted from a translation memory built during a number of translation projects for several commercial companies; these data were provided by Translated srl, the industrial partner of the MateCat project, and were collected during the real use of CAT tools by professional translators. In addition, parallel texts from the OPUS corpusFootnote 10 were are also included (Tiedemann 2012).

For the other tasks, public data, which allow replicability and cross-assessment of our outcomes, were chosen in order to cover a wide range of linguistic conditions. For the English–Italian and English–Spanish Legal tasks training data are taken from version 3.0 of the JRC-AcquisFootnote 11 collection (Steinberger et al. 2006).

For the English–German and German–English Patents tasks training texts consist of patent text sampled from title, abstract and description sections from the PatTRFootnote 12 corpus (Wäschle and Riezler 2012).

Statistics for the training corpora are reported in Table 1.

Table 1 Statistics for the training data for all tasks: number of segments, source and target running words

5.2 Development and test data

In this paper we aim at comparing the proposed adaptation techniques across different domains, language pairs, text repetitiveness, reference types, and overall performance of the baseline system. To this purpose we collected several documents for each task employed for either development or testing. Main statistics are shown in Table 2: number of segments, number of source and target running words, and source and target repetition rate (RR) computed as explained in Sect. 3.2. Figures on the source side refer to the texts the users are requested to translate; figures on the target side refer to either the translations or the actual post-edited texts; all figures refer to tokenized texts. When references consist of post-edits, they are created by translators modifying the suggestions of a static SMT system.

Table 2 Statistics of the development and test data for all tasks: number of segments, running words and repetition rate for both source and target sides

5.2.1 English–Italian IT

Six documents labeled set0–5 correspond to projects provided by Translated srl, where the references are user-approved translations. Documents set6 and set7 are taken from a software user manual. For each sentence, the actual user corrections by four different translators (A–D) were collected during a field test and used as references. We report the scores for all four translators, regarding each translator’s post-edits as an independent document.

This choice has strong motivations in the online adaptation scenario. Each translator processed the sentences in his/her preferred order and provided a different reference; hence, the original baseline system evolves differently, and possibly achieves different performance. In addition, since sentence order and references differ among documents set6A–D and set7A–D, both source and target repetition rates vary, as explained in Sect. 3.2.

Weight optimization was performed on documents set0–2 for each system independently. During the tuning procedure of the dynamic systems, their caches were cleared at the beginning of each document in order to avoid possibly uncontrolled interactions among them.

5.2.2 English–Italian and English–Spanish legal

For both the English–Italian and English–Spanish task, three documents were selected by exploiting the labeling of JRC-Acquis documents in terms of EurovocFootnote 13 subject domain classes. We chose two classes including a not too large nor too small number of documents (around 100), and three documents were selected from each class.Footnote 14 For fairness, all other documents of those classes have been removed from the training data.

For English–Italian only, an additional document set3, taken from a recent motion for a European Parliament resolution published on the EUR-Lex platform, was translated by four different translators (A–D) during a field test, hence four independent post-edits were used as user feedback.

Document set0 was used for development, the remaining documents for testing. Note that small variations of the source side statistics of the English–Italian and English–Spanish data sets, are due to minor differences in sentence alignment.

5.2.3 English–German and German–English Patents

For the Patents task, we selected 10 patent documents each containing a title, an abstract and a description section with a total length of more than 200 sentences. All documents were sampled from the same IPCFootnote 15 section E (‘Fixed Constructions’), which can be viewed as technical subdomain. Data from these documents were excluded from the training corpus.

Patents set3–5 were used for development, the remaining sets for testing. The same data sets were used for both translation directions. All data is available for download from the PatTR website.Footnote 16

6 Experiments

In this Section we describe the detailed experimental comparison of the online adaptation approaches proposed in Sects. 4.34.5. The approaches to TM and LM adaptation and the discriminative re-ranking module are autonomous and can be applied independently from the others. Consequently, 12 systems could be built employing the TM adaptation via either external (+xmtm) or internal cache (+cbtm), the LM adaptation (+cblm), and cascading the discriminative re-ranking module (+rnk). The baseline system (bsln) does not make use of any adaptation techniques.

The fine-grained tuning of the systems was conducted on the English–Italian IT task taking into account the BLEU score (Papineni et al. 2002), and only the best performing systems were tested and compared on all other tasks, namely English–Italian and English–Spanish Legal and the English–German and German–English Patents. In particular, the meta-parameters of the TM adaptation via external cache (the weights \(\lambda \) of Eq. 2) and LM adaptation (the order of \(n\)-grams) techniques were estimated in this condition.

Evaluation of systems was performed by means of BLEU and TER (Snover et al. 2006), both ranging from 0 to 100. Performance of the system bsln on all test documents and for all tasks are shown in Fig. 8. From these plots, we observe that there is a fairly high correlationFootnote 17 between BLEU and TER scoresFootnote 18 and that performance among test documents of the same task can vary a lot.

Fig. 8
figure 8

Performance in terms of BLEU (left) and TER (right) of the baseline system for all tasks and all test documents

The comparison among the systems is also performed by means of \(\Delta \)Slope, introduced in Sect. 3.3, which reflects the learning capability of the dynamic systems.

6.1 TM adaptation

The generative approaches to online adaptation rely on the set of phrase pairs extracted from the user feedback by means of the constrained search. As explained in Sect. 4.2, three types of phrase pairs (new, known, and full) are collected. We carried out preliminary experiments comparing the translation performance of all three variants. Table 3 shows performance of +xmtm system in terms of BLEU, under different conditions on set0–5 of the English–Italian IT task; the difference from the baseline are also reported. We report mean improvement over the baseline in small font size and give the corrected sample standard deviation of the improvements over all documents in the group in square brackets. Statistical significance is assessed using approximate randomization (Noreen 1989) with a significance level set to 0.05.

Table 3 Performance of the +xmtm system using new, known or full phrases and combinations of the three on set0–2 and set3–5 of English–Italian IT task

Each type of phrase pair in +xmtm yields individual improvements. new phrase pairs yield smaller improvements than known and full, which is explained by the fact that only few new phrase translations are extracted (recall, that only unambiguous gaps in the alignment are considered). The individual improvements add up to an overall statistically significant improvement of 2.90 (set0–2) and 2.42 (set3–5) BLEU points over the baseline for the combination of all conditions, namely new + known + full. For the combination we attempted to optimize the \(\lambda \) parameters in Eq. 2 using a simple grid search. However, no changes in BLEU score were observed for different weight settings, so the weights were kept uniform for all following experiments. This is in accordance with the additiveness of the different phrase pair types and an indicator for consistency of the documents: only a small number of translation options are observed for every source phrase, so the different conditions do not have to compete with each other. We use the combination of all types of phrase pairs for the second TM adaptation approach (+cbtm) as well, without individual experiments.

6.2 LM adaptation

Table 4 shows a comparison of different LM adaptation conditions applied as standalone (+cblm) and in combination with the best TM adaptation technique determined above in Sect. 6.1 (+xmtm+cblm).

Table 4 Performance of the +cblm system using 1-g, 4-g, and tm-\(n\)-g on top of the bsln and the +xmtm systems on set0–2 and set3–5 of English–Italian IT task

The LM adaptation always outperforms the baseline regardless of the type of added \(n\)-grams; although the improvements seem consistent, they are not always statistically significant. The most important observation is that the gains achieved by the TM adaption techniques via external cache and LM adaptation are definitely independent and additive, and the combination of the two techniques yields significant improvements. Using \(n\)-grams up to order 4 (4-g) on top of the TM adaptation outperforms the baseline on both development sets by 5.05 and 3.76 BLEU points, corresponding to a relative improvement of around 20 %.

Rewarding only 1-g (1-g) gives the smallest improvements, indicating that more context is helpful. Using only those \(n\)-grams that are target sides of phrase pairs (tm-\(n\)-grams) added during the TM adaptation yields good improvements as standalone, but in combination with TM adaptation, the 4-g LM performs best. We therefore consider this the best system configuration and keep it fixed for the remaining investigation, in the case of the TM adaptation via internal cache as well.

6.3 Comparison of generative approaches

In this section we compare the five dynamic systems generated by different configuration of the TM and LM adaptation approaches. Figure 9 shows their difference in terms of BLEU and TER from the baseline system.

Fig. 9
figure 9

Performance of the generative online adaptation approaches on all English–Italian IT documents, expressed as difference in BLEU (left) and TER (right) from the baseline system

Apart from set6A–D, whose behavior is discussed later, similar observations can be drawn from performance of set3–5 and set7A–D. Each single adaptation approach is effective in improving baseline performance, even if gains vary a lot ranging from 1 to 8 BLEU points and from 2 to 7 TER points. Moreover, looking at Fig. 8, no correlation between the effectiveness of the online adaptation and the baseline performance can be observed. This is fortunate, as it shows that the proposed adaptation techniques are effective regardless of the absolute quality of the system to which they are applied.

When used alone, the LM adaptation technique +cblm seems less effective than both TM adaptation techniques +cbtm and +xmtm. Among the TM adaptation approaches, +cbtm outperforms +xmtm by more than 1 BLEU point if +cblm is not applied, otherwise the difference mostly vanishes; indeed, +cbtm alone achieves the best performance. The gain of the +xmtm and +cblm are instead partially additive. In our opinion, the larger effectiveness of the +cbtm approach with respect to the +xmlm approach is due to the larger freedom of the decoder in choosing the correct translation options. In the former case, the decoder is free to apply the suggested options (in the internal cache) to any source fragment; in the latter case the decoder can apply them only to the fragments greedily identified through the xml-like markup. We think that the +cblm does not give an additive contribution to +cbtm because the latter system has already achieved the maximum possible gain. However, +cblm can help +xmtm to reach this maximum.

In summary, we find that the TM adaptation techniques outperform the LM adaptation technique with the best results achieved by their combination. Among the TM adaptation techniques, there is a slight preference for the internal-cache (+cbtm) over the external cache (+xmtm).

Concerning documents set6A–D, the LM-adapted systems clearly outperform their corresponding systems without +cblm, while both TM adaptation techniques fail. The main reason is likely related to the low Repetition Rate of those documents. Figure 10 is the scatter plot of Repetition Rate and BLEU performance gain from the baseline system achieved by the TM adaptation techniques (+cbtm and +xmtm) on all English–Italian IT documents; the trend lines – computed according to simple linear regression and having a coefficient of determination \(R^2\) of 0.79 and 0.86, respectively – are evidence of a high correlation between the two measures. We conclude that it is unlikely to get any improvement by the TM-adapted system on texts like documents set6A–D, which are scarcely repetitive.

Fig. 10
figure 10

English–Italian IT domain: trend lines of Repetition Rate vs. \(\Delta \)BLEU between TM-adapted (+cbtm and +xmtm) and baseline systems. Corresponding trend lines have \(R^2\) of 0.79 and 0.86, respectively

6.4 Impact of the discriminative re-ranking module

The discriminative re-ranking module described in Sect. 4.5 can be independently combined with any of the previous systems, including the baseline. For our experiments we precomputed 200-best lists for all systems and for all data sets. The features of the hypotheses could then be readily extracted from the translations and the phrase alignment. The new phrase pairs from adapted systems are also communicated to the re-ranking module to have complete feature representations for all hypotheses. The development for the re-ranking module was carried out with the baseline system on the set0–2 of the English–Italian IT data and settings were carried over to the other data sets and combination with other systems without any further optimization.

In Fig. 11 we directly compare the performance of the re-ranking module applied alone (+rnk) against the +cbtm+cblm and +xmtm+cblm systems, clearly proving that its effectiveness is lower, but reasonably consistent. This is not surprising, as the re-ranking uses a narrow search space and is not able to affect decoding in any way.

Fig. 11
figure 11

Performance of the discriminative re-ranking module (+rnk) on all English–Italian IT documents, expressed as difference in BLEU (left) and TER (right) from the baseline system. As reference the performance of the +cbtm+cblm and +xmtm+cblm systems are also reported

Results for the stacking of the re-ranking on top of all systems in terms of BLEU and TER differences from the baseline for the English–Italian IT data sets are shown as a scatter plot in Fig. 12. Favorable results lie in the 4th quadrant, with an increase in BLEU and a decrease in TER.

Fig. 12
figure 12

Test set results in terms of BLEU (x-axis) and TER (y-axis) differences from the baseline system for the re-ranking module stacked with all systems on the English–Italian IT data sets

The re-ranking module worsens BLEU scores in several cases (18 out of 66), but 17 of them refer to test documents set6A–D. We already discussed this in Sect. 6.3, stressing that they do not fit very well into an online adaptation scenario because of their low repetitiveness.

By excluding these sets, the module increases the BLEU score in the vast majority of cases (41 out of 42); however, TER still worsens in more cases (10 out of 42). This can be attributed to the fixation on \(n\)-grams matches enforced by the re-ranking step, which promotes higher order \(n\)-grams matches through its feature values. See Fig. 13 for two examples taken from the outputs of the baseline system and the baseline stacked with the re-ranking module demonstrating this issue. In both cases, the baseline system has fewer or lower \(n\)-grams matches than the hypothesis of the re-ranker, but it would take fewer operations to transform the baseline translations into the references: exactly one swap in the first example, and three operations in the second example.

Fig. 13
figure 13

Two examples showing a mismatch between BLEU and TER. In the first example, the baseline has fewer \(n\)-grams matches than the re-ranked hypothesis, but fewer edits are needed to reach the reference. In the second example, the count of the \(n\)-grams matches is identical, but the re-ranked translation has a 4-g match, while the baseline system does not. Still, fewer edits are needed to make the baseline match the reference. Highest order \(n\)-grams matches with the reference are in bold font

Finally, we examined positively and negatively weighted features that explain how re-ranking can help the system recover from errors by re-weighting translations. For example, our re-ranking model captures the contextual difference of translating the English and into the Italian e before a consonant or ed before a vowel by assigning high positive weight to \(n\)-grams such as DLI ed IBM and ed IBM and a high negative weight to \(n\)-grams such as DLI e IBM and e IBM. Due to the frequent use of title case in the IT data, the system also learned to prefer phrase pairs with matching case (Life \(\rightarrow \) Vita, machine \(\rightarrow \) macchina) over pairs with case mismatch (Customer \(\rightarrow \) clienti).

6.5 Performance of the full-fledged system

According to the results reported above, the best dynamic system employs both TM (with internal cache) and LM adaptation techniques and the discriminative re-ranking module. We therefore apply this system combination (+cbtm+cblm+rnk) to the other tasks, namely English–Italian and English–Spanish Legal and English–German and German–English Patents. Its performance is reported in Fig. 14.

Fig. 14
figure 14

Performance of the full-fledged system on all documents of all tasks, expressed as difference in BLEU (left) and TER (middle) from the baseline system, and \(\Delta \)Slope (right)

The full-fledged adapted system outperforms the baseline system across almost all documents of all tasks, except for documents set6A–D of English–Italian IT and set3A–D of the English–Italian Legal. A reasonable explanation for this bad performance was given in Sect. 6.3, and it is related to the relatively low Repetition Rates of those documents, as reported in Table 2. Therefore, we exclude these sets from any further analysis. On the other documents improvements vary a lot and range from \(+1\) to \(+10\) BLEU points and \(-1\) to \(-8\) TER points.

One goal of the proposed online adaptation approaches is the improvement of the system over time. Therefore, the considered systems are evaluated not only in terms of their average BLEU and TER performance, but also in terms of Percentage Slope, a measure of their learning capability as explained in Sect. 3.3. According to the aforementioned discussion, where we assessed the importance of the gain rather than the absolute value of the metric, in Fig. 14 we also plot the \(\Delta \)Slope difference of the full-fledged system from the baseline system.

The plots indicate that the learning capability of the full-fledged system is definitely strong only on test documents set7A–D of the English–Italian IT task, where the \(\Delta \)Slope ranges from 94 to 97 points. For the remaining test sets the gain is much lower or even slightly negative.

A reasonable explanation for this very different behavior can be inferred by considering how the baseline and the full-fledged system perform over time. Figure 15 reports the incremental BLEU achieved by the two systems on two typical sets of the English–Italian IT task, namely set4 and set7A. Apart from the first part of the document (up to segment #50) for which BLEU is not reliable enough due to the small amount of text, performance of the compared system on set7A is definitely regular; the full-fledged system constantly improves over the baseline as the divergent curve proves.

Fig. 15
figure 15

Performance in terms of BLEU of the baseline and full-fledged systems on increasing portions of set7A (left) and set4 (middle) of the English–Italian IT task. Plot on the right zooms in on the first 235 sentences of set4

The behavior on set4 is much less consistent; the big drop from sentence 235 is due to a substantial change in the intrinsic difficulty of the source text. Indeed, by analyzing the text we observed that up to that segment the software manual mainly contains table of contents and indexed items, whereas a more descriptive and verbose language is used afterwards. By focusing on the first 235 segments (see Fig. 15 right), the learning capability of the full-fledged system is observable again, and it is confirmed by the \(\Delta \)Slope decreasing to 96.05 from the 99.81 computed on whole set4.

This analysis suggests a related additional explanation for the low \(\Delta \)Slope figures. Considering a document as a very tiny sub-domain, most of the document-specific information has been learned at some point; after that the learning curve necessarily flattens, and the assumption of an almost constant learning ratio, which the \(Slope\) metric relies on, no longer holds. Nevertheless, the reliability of the metric remains valid if computed over the first part. But to confirm this observation, a deep and specialized investigation into how the \(\Delta \)Slope varies over time should be performed.

In Fig. 16, we give some insights about the translations actually produced by the baseline and dynamic systems, in order to highlight the main pros and cons of the proposed online MT adaptation method. We give three examples from the the English–Italian Information Technology task, but similar cases can be easily found in the other tasks.

Fig. 16
figure 16

Some examples of the translations produced by the baseline and dynamic systems showing typical errors and improvements. Source input and post-edits are also reported

Example 1 shows that the adaptive system was able to capture from the user feedback (Sentence #35) the preferred translation of “consistency” (“di congruenza”) and to properly use it when the phrase appears again in sentence #45. In Example 2 user feedback is immediately exploited to correct not only the lexical error (“Customer” vs “Cliente”) but also the word order error. In Example 3 the term “Pre-configuration”, occurring for the first time in sentence #269, is erroneously translated because it has not been seen before in the training data. Thanks to user feedback the correct translation “Pre-configurazione” is added to the local translation model, which is able to provide the correct translation at the next occurrence in sentence #283. Nevertheless, the system +cbtm+cblm still contains word-reordering (“Pre-configurazione servizi”) and lexical (“Italy”) errors; both errors are fixed by the re-ranking module. This also indicates that system +cbtm+cblm can produce the correct translation, but this does not necessarily have the highest score.

Example 3 also reveals a limitation of our current online adaptation approaches: they strongly rely on the quality and coherence of the user feedback. The English terms “Pre-configuration” is inconsistently translated with different word casing, causing an actual, though minor, error in sentence #283. In fact, we have to consider that the feedback exploited here is not a true post-edit and was produced independently from the system. We expect that in a real usage scenario the translator will be influenced by the MT suggestion and consequently will produce more coherent translations.

7 Conclusion

We presented the application of an online learning protocol that fits well with the typical post-editing workflow and achieves a tighter integration of human and machine translation. The protocol offers immediate feedback provided by the human after each translation output, and allows an MT system to learn from this feedback for future translations. Assuming coherent texts, the obvious advantages of this scenario are the possibilities to provide more consistent MT suggestions, to reduce the post-editing effort, and last but not least to enhance the user experience. Our adaptation techniques generalise in some sense the behaviour of some translation memory systems, which perform real-time updates: the MT system is adapted as soon as a segment is post-edited, so that future outputs will reflect the recent translation preferences of the user. The generalization lies in the fact that the MT system learns user preferences both at the sentence level (like a translation memory) and at the phrasal level: i.e. from single words to groups of words. The crucial steps in this learning process are the extraction of relevant parallel and target phrases from the source and post-edited segments, and the assignment of proper scores to such phrases. While we use a constrained search procedure for the first step, we investigated two approaches for assigning scores to the extracted parallel phrases and monolingual target phrases: (i) methods that augment the generative components of the MT system, translation and language models, by building local models based on internal or external caches; (ii) a discriminative method based on a structured perceptron that refines a feature-based re-ranking module applied to the \(k\)-best translations of the MT system. The proposed generative and discriminative approaches are independent and can be straightforwardly cascaded.

A deep investigation and comparison of the proposed adaptation techniques have been conducted on three domains and four translation directions. Evaluations have been carried out by using both reference and post-edited translations. We also related the Repetition Rate capturing the amount of phrase repetitiveness inside a text to effectiveness of our adaptation methods on the different domains we tested.

The main outcomes of our experiments can be summarized as follows: (i) adaptation is highly affected by the level of repetitiveness of the text; (ii) bilingual features are more effective than monolingual features; (iii) the internal cache model is the most effective adaptation method; (iv) generative and discriminative methods are to some extent additive.

Our work also raised interesting issues related to the development of MT systems that learn from user feedback, which have only be addressed partially. First, our cache-based adaptation method shows performance correlating with the repetitiveness rate of the input. As the input text is available in advance and its repetitiveness may locally vary in a significant way, it would be interesting to refine the Repetition Rate measure in such a way as to predict which portions of texts can mostly benefit from the cache models. Another issue which deserves further investigations is the feature extraction step that is applied on the source and target sides of each post-edited segment. In particular, its effectiveness (precision and recall) could be improved especially when one or more translation pairs including never observed words occur. Finally, work is in progress to actually integrate the discussed adaptation methods in a CAT tool and to field test them with professional translators. Relevant aspects we are focusing on are the latency of the adaptation step, which should not delay the human workflow, and the concurrent use of cache models by multiple users translating similar documents.