Keywords

1 Introduction

It is generally accepted that statistical machine translation (SMT) provides sufficiently good translation results with in-domain test data and “enough” training data. Results are rapidly decreasing for out-of-domain test data. Therefore, lot of research has been directed in the last years towards domain adaptation of SMT systems - e.g. [10]. Especially for European languages, current state-of-the-art SMT-engines are trained on one of the two large corpora available: JRC-AcquisFootnote 1 or EuroparlFootnote 2. Special techniques are applied in a second phase in order to ensure lexical domain adaptation. Less attention is paid to the fact that, even inside one domain, corpora belong to different text genres or, at least, have different discourse structures and, therefore, other types of syntactic structures or semantic frames. These differences may have a bigger influence on the quality of an SMT-system than assumed until now.

1.1 The Context for the Experiments

This aspect is of particular importance in scenarios where a machine translation engine is part of a complex architecture exposed to textual input from heterogeneous domains or text genres. This is the case of a Web- Content-Management System (WCMS) as the ATLAS SystemFootnote 3.

In this system several web-services based on advanced language technology components are built for seven European languages. Among the key technologies which are incorporated, a central role is played by machine translation. Due to lack of enough training data for all possible domains, the data-driven translation engine is trained mostly on the JRC-Acquis corpus and afterwards domain adaptation is performed. For domains for which no training model is available, the user is informed that the translation quality can lack accuracy.

As the acceptance of such system depends extensively from the user acceptance we decided to investigate also to which extent the text genre of the input can influence the translation quality.

This paper shows several SMT experiments with different test data (in- domain vs. out-of-domain vs. ‘similar’ data) using the JRC-Acquis corpus for training. The language-pair considered is English-Romanian. The originality of the work is not in the MT approach involved, but in the way of choosing the test data. SMT experiments using JRC-Acquis and Romanian-English as language pair have been presented in [2, 6] and [9]. The results are presented in a tabular form in Table 1.

Table 1. Previous reported results

SMT experiments have been usually performed and presented with in-domain data, for example see the experiments from [9] or [6].

An overview of how (rule-based) machine translation (MT) reacts to various text genres is shown in [1], where the MT system used is SYSTRANFootnote 4. The study analyzed machine translated extracts from four text genres with respect to different linguistic errors. Best results were obtained for technical sets of instructions.

Our paper is organized in seven sections. After this short introduction we will present the environmentof the MT-Engine in Sect. 2, while in Sect. 3 we describe our experimental settings: the MT system and the training and test data.

In Sect. 4 we show the evaluation results, followed in Sect. 5 by presenting factors which influence the results. The paper presents the conclusions and further work in Sect. 6. The last part of the paper shows our acknowledgments.

2 The ATLAS Content Management System

The core online service of the ATLAS platform is i-Publisher, a powerful Web-based instrument for creating, running and managing content-driven Web sites. It integrates the language-based technology to improve content navigation e.g. by interlinking documents based on extracted phrases, words and names, providing short summaries and suggested categorization concepts. Currently two different thematic content-driven Web sites, i-Librarian and EUDocLib, are being built on top of ATLAS platform, using i-Publisher as content management layer. i-Librarian is intended to be a user-oriented web site which allows visitors to maintain a personal workspace for storing, sharing and publishing various types of documents and have them automatically categorized into appropriate subject categories, summarized and annotated with important words, phrases and names. EUDocLib is planned as a publicly accessible repository of EU legal documents from the EUR-LEX collection with enhanced navigation and multilingual access.

The i-Publisher service:

  • is mainly targeted at small enterprises and non-profit organizations,

  • gives the ability to build via point-and-click user interface content-driven Web sites, which provide a wide set of pre-defined functionalities and the textual content of which is automatically processed, i.e. categorized, summarized, annotated, etc.,

  • enables publishers, information designers and graphic designers to easily collaborate,

  • aims at saving authors, editors and other contributors valuable time by automatically processing textual data and allows them to work together to produce high quality content. The last evaluation round of the service indicates that users do really see the benefit of LT-Technologies embedded into the system

The i-Librarian service:

  • addresses the needs of authors, students, young researchers and readers,

  • gives the ability to easily create, organize and publish various types of documents,

  • allows users to find similar documents in different languages, to share personal works with other people, and to locate the most relevant texts from large collections of unfamiliar documents.

The EUDocLib service is a particular refinement of i-Librarian targeted to the management of documents from the European Commission.

The facilities described above are supported through intelligent language technology components like automatic classification, named entity recognition and information extraction, automatic text summarization, machine translation and cross-lingual retrieval. These components are integrated into the system in brick-like architecture, which means that each component is building on top of the other. The baseline brick is the language processing chains component which ensure a heterogonous linguistic processing of all documents independent of their language. A processing chain for a given language includes a number of existing tools, adjusted and/or fine-tuned to ensure their interoperability. In most respects a language processing chain does not require development of new software modules but rather combining existing tools. The basic ATLAS softwareFootnote 5 is distributed as a software package under GPL license. LT-plug-ins like e.g. the language processing chains or the MT-engine follow a commercial licensing. The iLibrarian is available as web-service and it has unrestricted access.

2.1 Machine Translation in ATLAS System

Machine Translation is a key component of the ATLAS system. The development of the engine is particular challenging as the translation should be used in different domains. Additionally, the considered language-pairs belong to less resourced groupFootnote 6, for which bilingual training and test material is available in limited amount.

The machine translation engine is integrated in 2 distinct ways into the ATLAS platform:

  • for i-Publisher Service (generic platform for generating websites) the MT is serving as a translation aid tool for publishing multilingual content. Text is submitted to the translation engine and the result is subject to the human post processing

  • for i-Librarian and EuDocLib (dedicated web services for collecting documents) the MT-engine provides a translation for assimilation, which means that the user retrieving documents in different languages will use the engine in order to get a clue about the documents, and decide if he will store them. If the translation is considered as acceptable it will be stored into a database.

The integration of a machine translation engine into a web based content management system in general and the ATLAS system in particular, presents from the user point of view several challenges among which we mention two, which ATLAS-System dealt with

  1. 1.

    The user may retrieve documents from different domains. Domain adaptation is a major issue in machine translation, and in particular in corpus–based methods. Poor lexical coverage and false disambiguation are the main issues when translating documents out of the training domain

  2. 2.

    The user may retrieve documents from various time periods. As language changes over time, language technology tools developed for the modern languages do not work equally well on diachronic documents.

With the current available technology it is not possible to provide a translation system which is domain and language variation independent and works for a couple of heterogeneous language pairs. Therefore our approach envisages a system of user guidance, so that the availability and the foreseen system-performance are transparent at any time.

For the MT-Engine of the ATLAS system we decided on a hybrid architecture combining EBMT [4] and SMT [8] at word-based level (no syntactic trees will be used). An original approach of our system is the interaction of the MT-engine with other modules of the system:

  • The document categorization module assigns to each document one or more domains. For each domain the system administrator has the possibility to store information regarding the availability of a correspondent specific training corpus. If no specific trained model for the respective domain exists, the user is provided with a warning, telling that the translation may be inadequate with respect to the lexical coverage.

  • The output of the summarization module is processed in such a way that ellipses and anaphora are omitted, and lexical material is adapted to the training corpus.

The information extraction module is providing information about metadata of the document including publication age. For documents previous to 1900 we will not provide translation, explaining the user that in absence of a training corpus the translation may be misleading.

The domain and dating restrictions can be changed at any time by the system administrator when an adequate training model is provided. The described architecture is presented in Fig. 1.

Fig. 1.
figure 1

System architecture for ATLAS-Engine

3 Experiments

In this section we present the SMT system and the training and test data used in the experiments.

3.1 The SMT System

The SMT system follows the description of the baseline architecture given for the EMNLP 2011 Sixth Workshop on SMTFootnote 7. The system uses MosesFootnote 8, an SMT system that allows the user to automatically train translation models for the language pair needed, considering that the user has the necessary parallel aligned corpus. More details about Moses can be found in [8].

While running Moses, we used SRILM - [16] - for building the language model (LM) and GIZA++ - [11] - for obtaining word alignment information. We made two changes to the specifications given at the Workshop on SMT: we left out the tuning step and we changed the order of the language model (LM) from 5 to 3.

Leaving out the tuning step has been motivated by results we obtained in experiments which are not the topic of this paper, when comparing different settings for SMT: not all tests for the system configuration which included tunning showed improvement in the evaluation scores. Changing the LM order has been motivated by results reported in the SMART project, in which it has been concluded that 3-gram configurations provide best results – see [13].

3.2 Training Data

The training data is part of the JRC-Acquis corpus for English-Romanian. JRC-Acquis is a freely available parallel corpus in 22 languages, which consists of European Union (EU) documents of a legal nature. It is based on the Acquis Communautaire (AC), the total body of European Union (EU) law applicable in the EU Member States. This collection of legislative text changes continuously and currently comprises selected texts written between the 1950s and today.

From the two types of sentence alignments available (Vanilla and HunAlign), we used the Vanilla alignments. The same alignments have been also used in [6]. The sentence alignment is done at paragraph levelFootnote 9, where a paragraph can be a simple or complex sentence, a sub-sentential phrase (e.g. noun phrase) or even more sentences. In order to reduce possible errors, only one-to-one alignments have been considered for the experiments presented in this paper. More details on the JRC-Acquis corpus can be found in [15].

The corpus has not been (manually) corrected. Therefore, translation, alignment or spelling errors can have an influence on the output quality.

For the SMT experiments, from 391324 links in 6557 documents, only 336509 links (the one-to-one alignments) have been considered. Due to the cleaning step of the SMT systemFootnote 10, the number of one-to-one alignment links used for the LM was reduced to 240219 links for the Translation Model (TM). This represents 61.38 % of the initial corpus. The average sentence length is around 14.5 tokens. (In this paper token means a word, a number or a punctuation sign.)

3.3 Test Data

We used test data from three different corpora:

  • JRC-Acquis itself (Case A) in-domain data;

  • Europarl (Case B) ‘similar’ data (in-domain, different genre out-of-genre data);

  • RoGER (Case C) out-of-domain data.

The first two corpora could be considered in the same domain, as both refer to EU matters, but they are of a different genre: JRC-Acquis contains EU regulations; Europarl is extracted from the literal reports of the debates in the European Parliament. RoGER represents a totally different domain, as it contains text from a manual of an electronic device. The separation of these texts has been done by inspection and intuition.

A: JRC-Acquis The tests were run on parts of the JRC-Acquis, which were not used for training. 897 sentences (three sets of 299 sentences A: Test 1, A: Test 2, A: Test 3) were removed before the training step from the initial corpus, in order to be used as test data. Sentences were removed from different parts of the corpus to ensure a relevant lexical, syntactic and semantic coverage. A: Test 1+2+3 data set contains all the sentences.

The test data has not been cleaned, this means that no length restriction is considered and sentences might be repeated. For example, the paragraph “Article NUMBER” repeats itself 53, 44 and 11 times in A: Test 1, A: Test 2 and A: Test 3, respectively. The data is in-domain data. The average sentence length is around 21 tokens.

B: Europarl The Europarl parallel corpus [7] is extracted from the proceedings of the European Parliament (the literal reports of the debates) dating back to 1996 and contains in its last version twenty-one languages.

We extracted from version 6 of the corpusFootnote 11 three different test data sets, each of 299 sentences from the English-Romanian data. As for JRC-Acquis, we extracted the data from different parts of the corpus: from the beginning, middle and the end of the corpus. Small corrections have been done, as sometimes also sentences in other languages have been encountered.

The test data sets from this corpus are: B: Test 1, B: Test 2, B: Test 3 and B: Test 1+2+3. The average sentence length is around 13 tokens. However, for B: Test 1 and B: Test 2 it is around 7.5 and for B: Test 3 it is 24.5. The data is in-domain, but it has a different genre when compared with the training data: the structure and discourse of the text are totally different than the ones of the JRC-Acquis. The text refers to similar matters as the training data: European regulations. We consider these test data sets as ‘similar’ test data.

C: RoGER In order to analyze the performance of SMT systems to a total different type of text input, we used the RoGER corpus.

RoGER is a parallel corpus, aligned at sentence level. It is domain-restricted, as the texts are from a users’ manual of an electronic device. The languages included in the development of this corpus are Romanian, English, German and Russian. The corpus was manually compiled and verified: the translations and the (sentence) alignments were manually corrected. It is not annotated and diacritics are ignored. More about the RoGER corpus can be found in [5].

From the 2333 sentences, we extracted 300 sentences from the middle of the corpus and used them as test data (C: Test). The average sentence length is around 15 tokens. The data is entirely out-of-domain.

4 Evaluation Results

We evaluated our translations using three automatic evaluation metrics: BLEU, NIST and TER. The choice of the metrics is motivated by the (linguistic) resources we had available and the results reported in the literature. Due to lack of data and further translation possibilities, the comparison with only one reference translation is considered in these experiments.

Although criticized, BLEU (bilingual evaluation understudy) is the score mostly used in the last years for MT evaluation. It measures the number of n- grams, of different lengths, of the system output that appear in a set of reference translations. More details about BLEU can be found in [12].

The NIST Score, described in [3], is similar to the BLEU score in that it also uses n-gram co-occurrence precision. If BLEU considers a geometric mean of the n-gram precision, NIST calculates the arithmetic mean. Another difference is that n-gram precisions are weighted by the n-gram frequencies.

TER calculates the minimum number of edits needed to get from obtained translations to the reference translations, normalized by the average length of the references. It considers insertions, deletions, substitutions of single words and an edit-operation which moves sequences of words. More information about TER can be found in [14].

The obtained evaluation results are presented in Tables 2 and 3. The BLEUresults are graphically presented in Fig. 2.

Table 2. Evaluation results (Romanian-English)
Table 3. Evaluation results (English-Romanian)
Fig. 2.
figure 2

BLEU results

The results for in-domain data are similar to other BLEU scores published in the literature (with the exception of the test data set A: Test 1 for Romanian- English)Footnote 12. The out-of-domain data provides quite low results. The results for ‘similar’ data, somehow surprisingly, are closer to the ones of the out-of-domain data.

A direct comparison with the results in [1] is not possible as there are several important differences, such as the MT approach and the evaluation methodology.

5 Analyzing the Results – Factors of Influence

Several aspects connected with the type of test data can influence the translation results. We will analyze in this paper the number of out-of-vocabulary words (OOV-words) and test sentences already encountered in the training data. The tests have been run in a realistic scenario, with no human interference (choosing specific sentence average lengths, testing the inclusion in the training data, etc.) on the test data.

The overview of the OOV-words and test sentences already encountered in the training data is presented in Tables 4 and 5. The OOV-words are extracted analyzing only the surface forms. This means that a word can be in the training data as lemma, but a specific word-formFootnote 13 might be missing.

Table 4. Data description (Romanian-English)
Table 5. Data description (English-Romanian)

Comparing the OOV-words for Test 1+2+3 for Europarl and Test 1+2+3 for JRC-Acquis, we could conclude that these two sets of OOV-words are (al- most) totally different: only three words for English-Romanian and two for Romanian-English are in common in these two sets of OOV-words.

As expected, for the translation direction Romanian-English, the highest number of OOV-words appear in data C (RoGER; out-of-domain data) data (37.67 %). However, for English-Romanian, Test 2 from Europarl (data B; ‘similar’ data) contains the highest numer of OOV-words: 18.68 %. The out-of-domain data (data C) has only 14.65 % of OOV-words.

A better analysis of the OOV-words in different test data-sets should be made to have a more realistic overview. For example, it could be possible that in data B, due to the text genre, more declination or conjugation forms have been used, when compared with data A. Therefore, the use of a lemmatizer in the translation process could improve the translation results. Concerning the number of test sentences already found in the training data, excluding in-domain data, more sentences have been found for English-Romanian and ‘similar’ data. For Romanian-English the results for this aspect is similar for both out-of-domain and ‘similar’ data: under 1 %.

6 Conclusions

In this paper we showed several SMT experiments with different test data (in- domain vs. out-of-domain vs. ‘similar’ data) using the JRC-Acquis (English- Romanian) corpus for training. The results for in-domain and out-of-domain data are as expected. Somehow surprisingly, the results for ‘similar’ data are closer to the results for out-of-domain data. The differences in discourse and vocabulary lowered the translation scores for the Europarl tests, although we find ourselves in the same European framework as in the training data. This shows that having only ‘similar’ data for a specific domain, we cannot always expect good translation results. We can consider the conclusion of this paper limited to the data we used and only as a starting point for further analyses. A manual analysis of the translations should bring a better overview on the automatic scores and the sources of errors. Further experiments with various corpora and language pairs are needed before drawing a final (more general) conclusion