Keywords

1 Introduction

Statistical machine translation (SMT) is a methodology based on statistical data analysis. Better performance of SMT systems largely depends on the quantity and quality of the parallel data it uses. If the quantity and quality of the parallel data is high, this will boost the SMT results. Even so, parallel corpus is still scarce and not easily available. Similarly, the genre and language coverage of the data should also be very limited to increase SMT performance. In particular, languages that have very few native speakers and thus offer a limited audience will lead to very little research in the field. This creates a technical gap between languages that are widely spoken in comparison to languages with few speakers. However, the majority of existing human languages are spoken by only a small population of native speakers.

As a result, high-quality data exists only for a few language pairs in particular domains, whereas the majority of languages lack sufficient linguistic resources such as parallel data. Building a translation system that can cover all possible language translations would require millions of translation directions and a huge amount of parallel corpora. Moreover, if we consider multiple domains in the equation, the requirements for corpus training increase dramatically. The current study explored methods to build high-quality parallel data.

Multiple studies have been performed to automatically acquire additional data to enhance SMT systems in the long term [1, 2]. All such approaches have focused on discovering the actual text for the source languages as well as the target languages. However, our study presents an alternative approach for building the parallel data. In creating virtual parallel data, as we might call it, at least one side of the parallel data is generated. For this purpose, we use monolingual text. For the other side of the parallel data, we use an automatic procedure to obtain the translation of the text. In other words, our approach generates parallel data rather than gathering it. To monitor the performance and quality of the automatically generated parallel data and to maximize its utility for SMT, we focus on compatibility between the diverse layers of an SMT system.

In classification, it is recommended that an estimate be considered reliable when multiple systems show a consensus on it. However, the output of machine translation (MT) is human language, for which it is much too complicated to seek unanimity from multiple systems to generate the same output each time we execute the translation process. In such situations, we can choose partial compatibility as an objective rather than the complete agreement of multiple systems. To evaluate the generated data, we can use the Levenshtein distance and also run through a backward translation procedure. Using this approach, only those pairs that pass an initial compatibility check, being translated back into the native language and compared to the original sentences, will be accepted. This concept is depicted in Fig. 1.

Fig. 1.
figure 1

Corpora generation scheme

We can use this method to easily generate additional parallel data from monolingual news data provided for WMT16.Footnote 1 Retraining the newly assessed data during this procedure enhances translation system performance. Moreover, linguistic resource pairs that are rare can be improved. This methodology is not limited to languages but is also very significant for rare and important language pair resources. Most significantly, the virtual parallel corpus generated by the system is applicable to MT as well as other natural language processing (NLP) tasks.

2 State of the Art

In this study, we present an approach based on generating comprehensive multilingual resources through SMT systems. At the present time, we are working on two approaches for MT applications: self-training and translation via bridge languages. These approaches are different from what we have discussed previously. The first is focused on exploiting available bilingual data, where the linguistic resources from another language are rarely applied. The second approach focuses more on correcting the alignment of the prevailing word segment. In addition, it incorporates the phrase model concept rather than exploring the new text in context, such as translations at the word, phrase, or even the sentence level, through bridge languages. The methodology of this paper lies between the paradigm of self-training and translating via a bridge language. Our study generates data instead of gathering information for parallel data. Moreover, we have applied linguistic information and relationships between the languages to eventually perform translations between source and target languages.

Callison-Burch and Osborne in [3] presented a cooperative training method for SMT that is the consensus of several translation systems to identify the best translation resource for training. Similarly, Ueffing et al. [4] explored model adaptation methods for using monolingual data from a source language. Furthermore, as the learning progressed, application of that learned material was constrained by a multi-linguistic approach without introducing new information from a third language.

In another approach, Mann and Yarowsky [5] presented a technique to develop a translation lexicon based on transduction models of cognate pairs through a bridge language. The edit distance rate is applied to the process rather than the general MT system to limit the range of vocabulary for majority European languages. Kumar et al. [6] described the process to boost word alignment quality by using multiple bridge languages. In [7] and [8], phrase translation tables are improved through the use of phrase tables acquired in multiple ways from pivot languages. In [9], a hybrid method is combined with RBMT and SMT systems. This methodology is introduced to fill gaps in the data for pivot translation. Cohn and Lapata [10] presented another methodology to generate more reliable results of translations by generating information from small sets of data using multi-parallel data.

Contrary to the existing approaches, we returned to the black-box translation system. This means a wide generation of virtual data can be performed on translation systems that include rule-based, statistics-based, and also those based on human translations. The approach introduced in [11] pooled the results of translations of a test set made by any of the pivot MTs per unique language. However, this approach did not enhance the systems, and hence the novel training data is not used. Along with others, Bertoldi et al. [12] also conducted research on pivot languages, but they did not consider the application of universal corpus filtering, which is the measurement of compatibility to control data quality.

3 Generating Virtual Parallel Data

To generate new data, we have trained three SMT systems. The Experiment Management System [13] from the open source Moses SMT toolkit was utilized to carry out the experimentation. A 6-gram language model was trained using the SRI Language Modeling toolkit (SRILM) [14]. Word and phrase alignment was performed using the SyMGIZA ++ symmetric word alignment tool [15] instead of GIZA ++. Out-of-vocabulary (OOV) words were monitored using the Unsupervised Transliteration Model [16]. While working with the Czech (CS) and English (EN) language pair, a first SMT system was trained on TED [17], a second on the Qatar Computing Research Institute’s Educational Domain Corpus (QED) [18], and a third using the News Commentary corpora provided for the WMT16Footnote 2 translation task. Official WMT16 test sets were used for system evaluation. Translation engine performance was measured by the BLEU metric [25]. The performance of the engines is shown in Table 1.

Table 1. Corpora used for generation of SMT systems

All engines worked in accordance with Fig. 1, and the Levenshtein distance was used to measure the compatibility between translation results. The Levenshtein distance measures the diversity between two strings. Moreover, it also indicates the edit distance. It is closely linked to paired arrangement of strings [26].

Mathematically, the Levenshtein distance between two strings  \( {\text{a}},{\text{b}} \)  [of length  \( \left| {\text{a}} \right| \)  and  \( |{\text{b}}| \), respectively] is given by  \( {\text{lev}}_{{{\text{a}},{\text{b}}}} [\left| {\text{a}} \right|,\left| {\text{b}} \right|] \)  where:

$$ lev_{a,b} \left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {\hbox{max} \left( {i,j} \right)} \hfill & {if\,\hbox{min} \left( {i,j} \right) = 0} \hfill \\ {min\left\{ {\begin{array}{*{20}l} {lev_{a,b} \left( {i - 1,j} \right) + 1} \hfill \\ {lev_{a,b} \left( {i,j - 1} \right) + 1} \hfill \\ {\quad \quad \quad \quad lev_{a,b} \left( {i - 1,j - 1} \right) + 1_{{[a_{i} \ne b_{j} ]}} } \hfill \\ \end{array} } \right.} \hfill & {otherwise.} \hfill \\ \end{array} } \right. $$

In this equation, \( 1_{{[{\text{a}}_{\text{i}} \ne {\text{b}}_{\text{j}} ]}} \)  is the display function, equal to 0 when  \( {\text{a}}_{\text{i}} = {\text{b}}_{\text{j}} \)  and equal to 1 otherwise, and  \( {\text{lev}}_{{{\text{a}},{\text{b}}}} [{\text{i}},{\text{j}}] \)  is the distance between the first  \( {\text{i}} \)  characters of  \( {\text{a}} \)  and the first  \( {\text{j}} \)  characters of \( {\text{b}} \).

Using the combined methodology and monolingual data, parallel corpora were built. Statistical information on the data is provided in Table 2.

Table 2. Specification of generated corpora

The purpose of this research was to create synthetic parallel data for training a machine translation system by translating monolingual texts with multiple machine translation systems and various filtering steps. This objective is not new; synthetic data has previously been created. The novel aspect of the present paper is to use three MT systems and apply the Levenshtein distance between their outputs as a filter and—much more importantly—use back-translation as an additional filtering step. In Table 3, we show statistical information on the corpora used without the backward translation step.

Table 3. Specification of generated corpora without backward translation

4 Experimental Setup

The machine translation experiments we conducted involve three WMT16 tasks: news translation, information technology (IT) document translation, and biomedical text translation. Our experiments were conducted on the CS-EN pair in both directions. To obtain more accurate word alignment, we have used the SyMGiza ++ tool. This tool assists in the formation of a similar word alignment model. This particular tool develops alignment models that obtain multiple many-to-one and one-to-many alignments in multiple directions between the given language pairs. SyMGiza ++ is also used to create a pool of several processors, supported by the newest threading management, which makes it a very fast process. The alignment process used in our case utilizes four unique models during the training of the system to achieve refined and enhanced alignment outcomes. The results of these approaches have been fruitful [15]. OOV words are another challenge for an SMT system. To deal with OOV words, we used the Moses toolkit and the Unsupervised Transliteration Model (UTM). The UTM is a language-independent approach that has an unsubstantiated capability for learning OOV words. We also utilized the post-decoding transliteration method from this particular toolkit. UTM is known to make use of a transliteration phrase translation table to access the probable solutions. UTM was used to score several possible transliterations and to find a translation table [16, 19].

The KenLM tool was applied to language model training. This library helps to resolve typical problems of language models, reducing execution time and memory usage. To reorder the phrase probability, the lexical values of the sentences were used. We also used KenLM for lexical reordering. Three directional types are based on both targets: swap (S), monotone (M), and discontinuous (D). All three models were used in a hierarchical model. The bidirectional restructuring model examines the phrase arrangement probabilities [20,21,22].

The quality of domain adaptation largely depends on training data, which helps in incorporating the linguistic and translation models. The acquisition of domain-centric data helps much in this regard [23]. A parallel, generalized domain corpus and a monolingual corpus were used in this process, as identified by Wang et al. [24]. First, sentence pairs of the parallel data were weighted based on their significance to the targeted domain. Second, reorganization was conducted to get the best sentence pairs. After obtaining the required sentence pairs, these models were trained for the target domain [24].

For similarity measurement, we used three approaches: word overlap analysis, the cosine term frequency-inverse document frequency (tf-idf) criterion, and perplexity measurement. However, the third approach, which incorporates the best of the first two, is the strictest one. Wang et al. [24] observed that a combination of these approaches provides the best possible solution for domain adaptation for Chinese-English corpora [24]. Inspired by Wang et al.’s approach [24], we utilized a combination of these models. Similarly, the three measurements were combined for domain adaptation. Wang et al. found that the performance of this process yields around 20 percent of the domain analogous data [24].

5 MT Results and Conclusions

Numerous human languages are used around the world. Millions of translation systems have been introduced for the possible language pairs. These translation systems struggle largely due to the limited availability of language resources such as parallel data.

We have attempted to supplement these limited resources. Additional parallel corpora can be utilized to improve the quality and performance of linguistic resources, as well as individual NLP systems. In the MT application (Table 3), our data generation approach has increased translation performance. Although the results appear very promising, there is still a lot of room for improvement. Performance improvements can be attained by the application of more sophisticated algorithms to quantify the comparison among different MT engines. In Table 4, we present the baseline (BASE) outcomes for the MT systems we obtained for three diverse domains (news, IT, and biomedical—using official WMT16 test sets). Second, we generated a virtual corpus and adapted it to the domain (FINAL). The generated corpora demonstrate improvements in SMT quality and utility as NLP resources. From Table 2, it can be concluded that a generated virtual corpus is morphologically rich, which makes it acceptable as a linguistic resource. In addition, by retraining with a virtual corpus SMT system and repeating all the steps, it is possible to obtain more virtual data of higher quality.

Table 4. Evaluation of generated corpora

Lastly, in Table 5 we replicate the same quality experiment but using generated data without the backward translation step. Even as shown in Table 3, more data can be obtained in such a manner. However, the SMT results are not as good as the ones obtained with the backward translation step. This means that the generated data must be noisy and most likely contain incomplete sentences that are removed after backward translation.

Table 5. Evaluation of corpora generated without backward translation step