Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

Babych, Bogdan; Chen, Yu; Eisele, Andreas; Hunsicker, Sabine; Pinnis, Mārcis; Skadiņa, Inguna; Skadiņš, Raivis; Thurmair, Gregor; Vasiļjevs, Andrejs; Verlic, Mateja; Zhang, Xiaojun

doi:10.1007/978-3-319-99004-0_6

Bogdan Babych¹⁰,
Yu Chen¹¹,
Andreas Eisele¹¹,
Sabine Hunsicker¹¹,
Mārcis Pinnis¹²,
Inguna Skadiņa¹²,
Raivis Skadiņš¹²,
Gregor Thurmair¹³,
Andrejs Vasiļjevs¹²,
Mateja Verlic¹⁴ &
…
Xiaojun Zhang¹¹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

404 Accesses

Abstract

This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of data extracted from comparable corpora on MT quality is evaluated for 17 language pairs, and detailed studies involving human evaluation are carried out for 11 language pairs. At first, baseline statistical machine translation (SMT) systems were built using traditional SMT techniques. Then they were improved by the integration of additional data extracted from the comparable corpora. Comparative evaluation was performed to measure improvements. Comparable corpora were also used to enrich the linguistic knowledge of rule-based machine translation (RBMT) systems by applying terminology extraction technology. Finally, SMT systems were adjusted for a narrow domain and included domain-specific knowledge such as terminology, named entities (NEs), domain-specific language models (LMs), etc.

Access provided by Autonomous University of Puebla. Download chapter PDF

Domain Adaptation in Statistical Machine Translation Using Comparable Corpora: Case Study for English Latvian IT Localisation

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Language-Independent Hybrid MT: Comparative Evaluation of Translation Quality

6.1 Introduction

Building a statistical machine translation (SMT) system requires a large amount of parallel data for model training. Reasonably good results can be achieved when the domain of the training corpus is close to the test data.

There are only a few parallel corpora publicly available for the lesser spoken languages of Europe. Several large-scale highly multi-lingual parallel language resources, such as the JRC-Acquis corpus (Steinberger et al. 2006), the DGT-TM (Steinberger et al. 2012) and DCEP corpus (Najeh et al. 2014), are made available by the European Commission’s Joint Research Centre (JRC) and other European Union organisations (Steinberger et al. 2014). Different corpora are presented in the OPUS collection (Tiedemann 2009, 2012). SETimes (Tyers and Alperen 2010) is a parallel corpus from a multi-lingual news website into English and eight South-East European Languages (Albanian, Bulgarian, Croatian, Greek, Macedonian, Romanian, Serbian and Turkish).

For many under-resourced languages, multi-lingual comparable resources are widely available. Data extracted from comparable resources can be useful for machine translation. While methods on how to use parallel corpora in MT are well studied, methods and techniques for comparable corpora have not been thoroughly investigated.

The research in the field of the application of comparable corpora to the task of SMT has shown that adding extracted aligned parallel lexical data (additional phrase tables and their combination) from comparable corpora to the training data of an SMT system improves the system’s performance in view of untranslated word coverage (Hewavitharana and Vogel 2008; Xu et al. 2006; Zhang 2011). It has also been demonstrated that language pairs with little parallel data can benefit the most from exploitation of comparable corpora (Lu et al. 2010).

Xu et al. (2006) exploit comparable data to extract parallel corpus. The proposed approach breaks documents into segments using pre-defined anchor words and then align these segments. In order to avoid errors in alignments, they present an advanced approach to extract the parallel sentences recursively by partitioning a bilingual document into two pairs. For Chinese–English data, this method produced translation results comparable to those of a state-of-the-art sentence aligner. A combination of the two approaches lead to a better translation performance.

Munteanu and Marcu (2006) achieved significant performance improvements from large comparable corpora of news feeds for English, Arabic and Chinese over a baseline MT system trained on existing available parallel data. The authors stated that the impact of comparable corpora on SMT performance is ‘comparable to that of human translated data of similar size and domain’.

Irvine and Callison-Burch (2013) used comparable corpora to improve accuracy and coverage of phrase-based MT built on small amounts of parallel data. They showed that adding translations of low-frequency words from comparable corpora improves performance beyond what is achieved by inducing translations for out-of-vocabulary words alone and that data from comparable corpora improves BLEU score (Papineni et al. 2002).

Most of the experiments are performed with widely used language pairs, such as French–English (Abdul-Rauf and Schwenk 2009, 2011), Arabic–English (Abdul-Rauf and Schwenk 2011) or English–German (Ştefănescu et al. 2012), while possible exploitation of comparable corpora for machine translation tasks is less studied for under-resourced languages (e.g. Skadiņa et al. 2012).

In this chapter, we analyse the impact of data extracted from comparable corpora on the machine translation task (both data-driven and rule-based) for under-resourced languages and narrow domains. Section 6.2 describes experiments to improve SMT systems trained on available parallel data by integration of additional data from comparable corpora for application in the general domain translation task. Section 6.3 proposes a methodology for how to assess changes in translation quality for systems enhanced with data extracted from comparable corpora and describes human evaluation results for eleven language pairs. Section 6.4 focusses on MT adaptation for a particular domain with the help of domain data extracted from comparable corpora. The last three sections deal with use cases. Section 6.5 analyses German–English MT adaptation to the automotive domain for both (rule-based and SMT) approaches. Section 6.6 analyses the role of machine translation in Web authoring, while Sect. 6.7 discusses the application of MT systems, enriched with data from comparable corpora, in computer-aided translation.

6.2 Enriching General Domain SMT Systems with Data from Comparable Corpora

In this section, we describe experiments to improve SMT systems trained on available parallel data (we call them baseline systems) by integration of additional data from comparable corpora for application in the general domain translation task.

6.2.1 Data Used for Experiments

The following publicly accessible parallel corpora were used to set up baseline SMT systems for the experiments: JRC: JRC-Acquis, DGT: DGT-TM (Steinberger et al. 2012), SETimes,^{Footnote 1} Europarl, and News Commentary.^{Footnote 2} Table 6.1 shows the size of the training data that was used to train the baseline systems.

Table 6.1 Size of corpora for baseline systems

Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

Abstract

Similar content being viewed by others

Domain Adaptation in Statistical Machine Translation Using Comparable Corpora: Case Study for English Latvian IT Localisation

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Language-Independent Hybrid MT: Comparative Evaluation of Translation Quality

6.1 Introduction

6.2 Enriching General Domain SMT Systems with Data from Comparable Corpora

6.2.1 Data Used for Experiments

6.2.2 Methodology

6.2.2.1 Mixture Translation Model

6.2.2.2 Interpolating Language Models

6.2.3 Experiments with Data Extracted from Comparable Corpora

6.2.4 Staggered Experiments

6.2.4.1 English–Latvian

6.2.4.2 English–Romanian

6.2.4.3 English–Lithuanian

6.3 Human Evaluation of MT Output

6.3.1 Evaluation Methodology and the Interface

6.3.2 Experiment Set-Up

6.3.3 Human Evaluation Results

6.4 MT Adaptation for Under-Resourced Domains

6.4.1 Initial Extraction and Alignment of Terms and Named Entities

6.4.2 Comparable Corpora Collection

6.4.3 Extraction of Term Pairs from Comparable Corpus

6.4.4 Baseline System Training

6.4.5 SMT System Adaptation

6.5 MT Adaptation to a Narrow Domain in Case of Resource-Rich Languages

6.5.1 Evaluation Objects: Narrow-Domain-Tuned MT Systems

6.5.2 Evaluation Data

6.5.3 Evaluation Methodology

6.5.4 Evaluation Tools

6.5.5 Evaluation Results

6.5.5.1 Automatic Evaluation

6.5.5.2 Comparative Evaluation

6.5.5.3 Absolute Evaluation

6.5.6 Conclusion

6.6 Application of Machine Translation (MT) in Web Authoring

6.6.1 The Role of Translation and MT in Web Authoring

6.6.2 Characteristics and Requirements for Translation in Web Authoring

6.6.2.1 MT in Web Authoring

6.6.2.2 Translating User-Generated Content

6.6.2.3 Defining Requirements for Using MT in Web Authoring

6.6.3 MT Systems Enhanced with Comparable Corpora in Web Authoring: A Use Case

6.6.3.1 Evaluation Process and Datasets

6.6.3.2 Results and Discussion

6.6.4 Conclusion

6.7 Systems for Computer-Aided Translation

6.7.1 Collecting and Processing a Comparable Corpus

6.7.2 Building SMT Systems

6.7.3 Automatic and Comparative Evaluation

6.7.4 Evaluation in Localisation Task

6.7.4.1 Evaluation Set-Up

6.7.4.2 Results

6.7.5 Discussion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Additional information

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation