Keywords

1 Introduction

Statistical machine translation (SMT) is a methodology based on statistical data analysis. The performance quality of SMT systems largely depends on the quantity and quality of the parallel data used by these systems; that is, if the quantity and quality of the parallel data are high, this will boost the SMT results. Even so, good quality parallel corpora, without noisy data or error free, remain scarce and are not easily available [1]. Moreover, in order to increase SMT performance, the genre and language coverage of the data should be limited to a specific text domain e.g. law or medical texts. In particular, little research has been conducted on languages with few native speakers and thus with a limited audience, even though most existing human languages are spoken by only a small population of native speakers as showed in Table 1.

Table 1. Top languages by population: asterisks mark the 2010 estimates for the top dozen languages

Despite the enormous number of people with technological knowledge and access, many are excluded because they cannot communicate globally due to language divides. Consistent with Anderson et al. [2], over 6,000 languages [2] are used globally; there is no universal spoken language for communication. The English language is only the third most popular (used by only 5.52% of the global population); Spanish (5.85%) and Mandarin (14.1%) are more common [3]. Moreover, fewer than 40% of citizens of the European Union (not including developing or Eastern European countries) know English [4], which makes communication a problem even within the EU [5].

This has created a technical gap between languages that are widely spoken in comparison to languages with few speakers. This also led to a big gap between quality and amount of available parallel corpora for less common language pairs, which makes natural language processing sciences slower in such countries.

As a result, high-quality data exist for just a few language pairs in particular domains (e.g. Czech-English law texts domain), whereas the majority of languages lack sufficient linguistic resources, such as parallel data for good quality research or natural language processing tasks. Building a translation system that can handle all possible language translations would require millions of translation directions and a huge volume of parallel data. Moreover, if we consider multiple domains in the equation, the requirements for corpus training in machine translation increase dramatically. Thus, the current study explored methods to build a corpus of high-quality parallel data, using Czech-English as the language pair.

Multiple studies have been performed to automatically acquire additional data for enhancing SMT systems in the long term [6]. All such approaches have focused on discovering authentic text from real-world sources for both the source and target languages. However, our study presents an alternative approach for building this parallel data. In creating virtual parallel data, as we might call it, at least one side of the parallel data is generated, for which purpose we use monolingual text (news internet crawl in Czech, in this case). For the other side of the parallel data, we use an automated procedure to obtain a translation of the text. In other words, our approach generates rather than gathers parallel data. To monitor the performance and quality of the automatically generated parallel data and to maximize its utility for SMT, we focus on compatibility between the diverse layers of an SMT system.

It is recommended that an estimate be considered reliable when multiple systems show a consensus on it. However, since the output of machine translation (MT) is human language, it is much too complicated to seek unanimity from multiple systems to generate the same output each time we execute the translation process. In such situations, we can choose partial compatibility as an objective rather than complete agreement between multiple systems. To evaluate the generated data, we can use the Levenshtein distance as well as implementing a back-translation procedure. Using this approach, only those pairs that pass an initial compatibility check, when translated back into the native language and compared to the original sentences, will be accepted. This concept is depicted in Fig. 1.

Fig. 1.
figure 1

Generation of artificial data

We can use this method to easily generate additional parallel data from monolingual news data provided for WMT16. Retraining the newly assessed data during this procedure enhances translation system performance. Moreover, linguistic resource pairs that are rare can be improved. This methodology is not limited to languages but is also very significant for rare but important language pairs. Most significantly, the virtual parallel corpus generated by the system is applicable to MT as well as other natural language processing (NLP) tasks.

2 State of the Art

In this study, we present an approach based on generating comprehensive multilingual resources through SMT systems. We are now working on two approaches for MT applications: self-training and translation via bridge languages (also called “pivot languages”). These approaches are different from those discussed previously: While self-training is focused on exploiting the available bilingual data, to which the linguistic resources of a third language are rarely applied, translation via bridge languages focuses more on correcting the alignment of the prevailing word segment. This latter approach also incorporates the phrase model concept rather than exploring the new text in context, by examining translations at the word, phrase, or even sentence level, through bridge languages. The methodology of this paper lies in between the paradigm of self-training and translating via a bridge language. Our study generates data instead of gathering information for parallel data, while we also apply linguistic information and inter-language relationships to eventually produce translations between the source and target languages.

Callison-Burch and Osborne [7] presented a cooperative training method for SMT that comprises the consensus of several translation systems to identify the best translation resource for training. Similarly, Ueffing et al. [8] explored model adaptation methods to use monolingual data from a source language. Furthermore, as the learning progressed, the application of that learned material was constrained by a multi-linguistic approach without introducing new information from a third language.

In another approach, Mann and Yarowsky [9] presented a technique to develop a translation lexicon based on transduction models of cognate pairs through a bridge language. In this case, the edit distance rate was applied to the process rather than the general MT system of limiting the vocabulary range for majority European languages. Kumar et al. [10] described the process of boosting word alignment quality using multiple bridge languages. In Wu and Wang [11], Habash and Hu [12], phrase translation tables were improved using phrase tables acquired in multiple ways from pivot languages. In Eisele et al. [13], a hybrid method was combined with RBMT (Rule-Based Machine Translation) and SMT systems. This methodology was introduced to fill gaps in the data for pivot translation. Cohn and Lapata [14] presented another methodology to generate more reliable results of translations by generating information from small sets of data using multi-parallel data.

Contrary to the existing approaches, in this study, we returned to the black-box translation system. This means that virtual data could be widely generated for translation systems, including rule-based, statistics-based, and human-based translations. The approach introduced in Leusch et al. [15] pooled the results of translations of a test set created by any of the pivot MTs per unique language. However, this approach was not found to enhance the systems, and hence the novel training data were not used. Amongst others, Bertoldi et al. [16] also conducted research on pivot languages, but did not consider applying universal corpus filtering, which is the measurement of compatibility to control data quality.

2.1 Generating Virtual Parallel Data

To generate new data, we trained three SMT systems based on TED, QED and News Commentary corpora. The Experiment Management System [17] from the open source Moses SMT toolkit was utilized to carry out the experimentation. A 6-gram language model was trained using the SRI Language Modeling toolkit (SRILM) [18]. Word and phrase alignment was performed using the SyMGIZA++ symmetric word alignment tool [19] instead of GIZA++. Out-of-vocabulary (OOV) words were monitored using the Unsupervised Transliteration Model [20]. Working with the Czech (CS) and English (EN) language pair, the first SMT system was trained on TED [21], the second on the Qatar Computing Research Institute’s Educational Domain Corpus (QED) [22], and the third using the News Commentary corpora provided for the WMT16 translation task. Official WMT16 test sets were used for system evaluation. Translation engine performance was measured by the BLEU metric [23]. The performance of the engines is shown in Table 2.

Table 2. Corpora used for generation of SMT systems

All engines worked in accordance with Fig. 1, and the Levenshtein distance was used to measure the compatibility between translation results. The Levenshtein distance measures the diversity between two strings. Moreover, it also indicates the edit distance and is closely linked to the paired arrangement of strings [24].

Mathematically, the Levenshtein distance between two strings a, b [of length |a| and |b|, respectively] is given by \( {\text{lev}}_{\text{a,b}} \left[ {\left| {\text{a}} \right|,\left| {\text{b}} \right|} \right] \) where:

$$ lev_{a,b} \left( {i,j} \right) = \left\{ \begin{aligned} & \hbox{max} (i,j)\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad if\;\hbox{min} (i,j) = 0 \\ & \hbox{min} \left\{ \begin{aligned} & lev_{a,b} (i - 1,j) + 1 \\ & lev_{a,b} (i,j - 1) + 1 \\ & \quad \quad \quad \quad lev_{a,b} (i - 1,j - 1) + 1_{{\left[ {a_{i} \ne b_{j} } \right]}} \\ \end{aligned} \right.\quad otherwise. \\ \end{aligned} \right. $$

In this equation, \( 1_{{\left[ {{\text{a}}_{\text{i}} \ne {\text{b}}_{\text{j}} } \right]}} \) is the display function, equal to 0 when \( {\text{a}}_{\text{i}} = {\text{b}}_{\text{j}} \) and equal to 1 otherwise, and \( {\text{lev}}_{\text{a,b}} \left[ {{\text{i,}}\,{\text{j}}} \right] \) is the distance between the first i characters of a and the first j characters of b.

Using the combined methodology and monolingual data, parallel corpora were built. Statistical information on the data is provided in Table 3.

Table 3. Specification of generated corpora

The purpose of this research was to create synthetic parallel data to train a machine translation system by translating monolingual texts with multiple machine translation systems and various filtering steps. This objective is not new; synthetic data have been created in the past. However, the novel aspect of the present paper is its use of three MT systems, application of the Levenshtein distance between their outputs as a filter, and—much more importantly—its use of back-translation as an additional filtering step. In Table 4, we show statistical information on the corpora used without the back-translation step.

Table 4. Specification of generated corpora without back-translation

2.2 Semantically-Enhanced Generated Corpora

The artificially generated corpora presented in Table 3 were obtained using statistical translation models, which are based purely on how frequently “things” happen, and not on what they really mean. This means that they do not really understand what was translated. In this research, these data were additionally extended with semantic information so as to improve the quality and scope of the data domain. The word relationships were integrated into generated data using the WordNet database.

The way in which WordNet was used to obtain a probability estimator was shown in Cao et al. [25]. In particular, we wanted to obtain P(wi|w), where wi and w are assumed to have a relationship in WordNet. The formula is as follows:

$$ {\text{P(w}}_{\text{i}} | {\text{w) = }}\frac{{{\text{c(w}}_{\text{i}} , {\text{w|W,L)}}}}{{\sum\limits_{{{\text{w}}_{\text{j}} }} {\text{c}} ( {\text{w}}_{\text{j}} , {\text{w|W,L)}}}} $$

where W is a window size and c(wi, w|W, L) is the count of wi and w appearing together within W-window. This can be obtained simply by counting each within a certain corpus. In order to smooth the model, we applied interpolated Kneser-Ney [26] smoothing strategies.

The following relationships were considered: synonym, hypernym, hyponym, and hierarchical distance between words.

In Table 5, we show statistical information on the semantically enhanced corpora produced previously and shown in Table 3.

Table 5. Specification of semantically generated corpora without back-translation

Another common approach to semantic analysis that is also used within this research is latent semantic analysis (LSA). LSA has already been shown to be very helpful in automatic speech recognition (ASR) [27] and many other applications, which was the reason for incorporating it within the scope of this research. The high-level idea of LSA is to convert words into concept representations and to assume that if the occurrence of word patterns in documents is similar, then the words are also similar. The mathematical model can be defined as follows:

In order to build the LSA model, a co-occurrence matrix W will first be built, where wij is a weighted count of word wj and document dj.

$$ {\text{w}}_{\text{ij}} = {\text{G}}_{\text{i}} {\text{L}}_{\text{ij}} {\text{C}}_{\text{ij}} $$

where Cij is the count of wi in document dj; Lij is local weight; and Gi is global weight. Usually, Lij and Gi can use TF/IDF.

Then, singular value decomposition (SVD) analysis will be applied to W, as

$$ {\text{W}} = {\text{ U S V}}^{\text{T}} $$

where W is a M * N matrix (M is vocabulary size, N is document size); U is M * R, S is R * R, and V is a R * N matrix. R is usually a predefined dimension number between 100 and 500.

After that, each word wi can be denoted as a new vector Ui = ui * S. Based on this new vector, the distance between two words is defined as:

$$ {\text{K}}\left( {{\text{U}}_{\text{i}} ,{\text{ U}}_{\text{j}} } \right) \, = \, \left\{ {{\text{u}}_{\text{i}} *{\text{S}}^{ 2} *{\text{u}}_{\text{m}}^{\text{T}} } \right\}\left\{ {\left| {{\text{u}}_{\text{i}} *{\text{S}}} \right|*|{\text{u}}_{\text{m}} *{\text{S}}|} \right\} $$

Therefore, clustering can be performed to organize words into K clusters, C1, C2, …., CK.

If \( {\text{H}}_{{{\text{q}} - 1}} \) is the history for word Wq, then it is possible to obtain the probability of Wq given \( {\text{H}}_{{{\text{q}} - 1}} \) using the following formula:

$$ \begin{aligned} P\left( {W_{q} |H_{q - 1} } \right) & = P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} , \, d_{{q_{1} }} } \right) \\ & = P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} } \right)*P\left( {W_{q} |d_{{q_{1} }} |} \right) \\ \end{aligned} $$

where \( P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} , \, d_{{q_{1} }} } \right) \) is the N-gram model; \( {\text{P}}\left( {{\text{d}}_{{{\text{q}}_{ 1} }} |{\text{W}}_{\text{q}} } \right) \) is the LSA model.

Additionally,

$$ P(W_{q} |d_{{q_{1} }} ) = P(U_{q} |V_{q} ) = K(U_{q} ,V_{{q_{1} }} )/Z(U,V)K(U_{q} ,V_{{q_{1} }} ) = \frac{{U_{q} *S*V_{q - 1}^{T} }}{{|U_{q} *S^{1/2} |*|V_{q - 1} *S^{1/2} |}}, $$

where Z(U, V) is the normalized factor.

It is possible to also apply word smoothing to the model-based K-Clustering as follows:

$$ P(W_{q} |d_{{q_{1} }} ) = \sum\limits_{k = 1}^{K} {P(W_{q} |C_{k} )P(C_{k} |d_{{q_{1} }} )} $$

where \( P(W_{q} |C_{k} ) \), \( P(C_{k} |d_{{q_{1} }} ) \) can be computed using the distance measurement given above by a normalized factor.

In this way, the N-gram and LSA model are combined into a single language model and can be used for word comparison and text generation. The Python code for such LSA analysis was implemented in Thomo’s [28] research.

In Table 6, we show statistical information on the semantically enhanced corpora produced previously and shown in Table 3.

Table 6. Specification of semantically generated corpora using LSA

2.3 Experimental Setup

The machine translation experiments we conducted involved three WMT16 tasks: news translation, information technology (IT) document translation, and biomedical text translation. Our experiments were conducted on the CS-EN pair in both directions. To obtain more accurate word alignment, we used the SyMGiza++ tool, which assisted in the formation of a similar word alignment model. This particular tool develops alignment models that obtain multiple many-to-one and one-to-many alignments in multiple directions between the given language pairs. SyMGiza++ is also used to create a pool of several processors, supported by the newest threading management, which makes it a very fast process. The alignment process used in our case utilizes four unique models during the training of the system to achieve refined and enhanced alignment outcomes. The results of these approaches have been shown to be fruitful in previous research [19]. OOV words are another challenge for an SMT system and to deal with such words, we used the Moses toolkit and the Unsupervised Transliteration Model (UTM). The UTM is a language-independent approach that has an unsubstantiated capability for learning OOV words. We also utilized the post-decoding transliteration method from this particular toolkit. UTM is known to make use of a transliteration phrase translation table to access probable solutions. UTM was used to score several possible transliterations and to find a translation table [20, 29].

The KenLM tool was applied to language model training. This library helps to resolve typical problems of language models, reducing execution time and memory usage. To reorder the phrase probability, the lexical values of the sentences were used. We also used KenLM for lexical reordering. Three directional types are based on each target–swap (S), monotone (M), and discontinuous (D)–all three of which were used in a hierarchical model. The bidirectional restructuring model was used to examine the phrase arrangement probabilities [30,31,32].

The quality of domain adaptation largely depends on training data, which helps in incorporating the linguistic and translation models. The acquisition of domain-centric data helps greatly in this regard [33]. A parallel, generalized domain corpus and monolingual corpus were used in this process, as identified by Wang et al. [34]. First, sentence pairs of the parallel data were weighted based on their significance to the targeted domain. Second, reorganization was conducted to obtain the best sentence pairs. After obtaining the required sentence pairs, these models were trained for the target domain [34].

For similarity measurement, we used three approaches: word overlap analysis, the cosine term frequency-inverse document frequency (tf-idf) criterion, and perplexity measurement. However, the third approach, which incorporates the best of the first two, is the strictest. Moreover, Wang et al. observed that a combination of these approaches provides the best possible solution for domain adaptation for Chinese-English corpora [34]. Thus, inspired by Wang et al.’s approach, we utilized a combination of these models. Similarly, the three measurements were combined for domain adaptation. Wang et al. found that the performance of this process yields approximately 20% of the domain analogous data.

2.4 Evaluation

To make progress in machine translation (MT), the quality of its results must be evaluated. It has been recognized for quite some time that using humans to evaluate MT approaches is very expensive and time-consuming [35]. As a result, human evaluation cannot keep up with the growing and continual need for MT evaluation, leading to the recognition that the development of automated MT evaluation techniques is critical. Evaluation is particularly crucial for translation between languages from different families (i.e., Germanic and Slavic), such as Polish and English [35, 36].

Vanni and Reeder [36] compiled an initial list of SMT evaluation metrics. Further research has led to the development of newer metrics. Prominent metrics include Bilingual Evaluation Understudy (BLEU), the National Institute of Standards and Technology (NIST) metric, Translation Error Rate (TER), and the Metric for Evaluation of Translation with Explicit Ordering (METEOR). These metrics were used in this research for evaluation.

In this research, we used the most popular metric BLEU, which was developed based on a premise similar to that used for speech recognition, described by Papineni et al. [23] as “The closer a machine translation is to a professional human translation, the better it is.” Thus, the BLEU metric is designed to measure how close SMT output is to the output of human reference translations. It is important to note that translations, be they SMT or human, may differ significantly in terms of word usage, word order, and phrase length [23].

2.4.1 Statistical Significance Tests

In cases where the differences in metrics described above do not deviate greatly from each other, a statistical significance test can be performed. The Wilcoxon test [37] (also known as the signed-rank or matched-pairs test) is one of the most popular alternatives to the Student’s t-test for dependent samples. It belongs to the group of non-parametric tests and is used to compare two (and only two) dependent groups that involve two measurement variables.

The Wilcoxon test is used when the assumptions for the Student’s t-test for dependent samples are not valid; for this reason, it is considered an alternative to this test. The Wilcoxon test is also used when variables are measured on an ordinal scale (in the Student’s t-test, the variables must be measured on a quantitative scale). The requirement for application of the Wilcoxon test is the potential to rank differences between the first and second variable (the measurement). On an ordinal scale, it is possible to calculate the difference in levels between two variables; therefore, the test can be used for variables calculated on such a scale. In the case of quantitative scales, this test is used if the distributions of these variables are not close to the normal distribution.

3 Results and Discussion

Numerous human languages are used around the world and millions of translation systems have been introduced for the possible language pairs. However, these translation systems struggle with high quality performance, largely due to the limited availability of language resources such as parallel data.

In this study, we have attempted to supplement these limited resources. Additional parallel corpora can be utilized to improve the quality and performance of linguistic resources, as well as individual NLP systems. In the MT application (Table 4), our data generation approach has increased translation performance. Although the results appear very promising, there remains a great deal of room for improvement. Performance improvements can be attained by applying more sophisticated algorithms to quantify the comparison among different MT engines. In Table 6, we present the baseline (BASE) outcomes for the MT systems we obtained for three diverse domains (news, IT, and biomedical—using official WMT16 test sets). Second, we generated a virtual corpus and adapted it to the domain (FINAL). The generated corpora demonstrate improvements in SMT quality and utility as NLP resources. From Table 3, it can be concluded that a generated virtual corpus is morphologically rich, which makes it acceptable as a linguistic resource. In addition, by retraining with a virtual corpus SMT system and repeating all the steps, it is possible to obtain more virtual data of higher quality. Statistically significant results in accordance with the Wilcoxon test are marked with * and those that are very significant with ** (Table 7).

Table 7. Evaluation of generated corpora

Next, in Table 8, we replicate the same quality experiment but using generated data without the back-translation step. As shown in Table 4, more data can be obtained in such a manner. However, the SMT results are not as good as those obtained using back-translation. This means that the generated data must be noisy and most likely contain incomplete sentences that are removed after back-translation.

Table 8. Evaluation of corpora generated without the back-translation step

Next, in Table 9, we replicate the same quality experiment but using generated data from Table 5. As shown in Table 9, augmenting virtual corpora with semantic information makes a positive impact on not only the data volume but also data quality. Semantic relations improve the MT quality even more.

Table 9. Evaluation of semantically generated corpora without the back-translation step

Finally, in Table 10, we replicate the same quality experiment but using generated data from Table 6 (LSA). As shown in Table 10, augmenting virtual corpora with semantic information by facilitating LSA makes an even more positive impact on data quality. LSA-based semantic relations improve the MT quality even more. It is worth mentioning that LSA provided us with less data but we believe that it was more accurate and more domain-specific than the data generated using Wordnet.

Table 10. Evaluation of semantically generated corpora using LSA

4 Conclusions

Summing up, in this study, we successfully built parallel corpora of satisfying quality from monolingual resources. This method is very time and cost effective and can be applied to any bilingual pair. In addition, it might prove very useful for rare and under-resourced languages. However, there is still room for improvement, for example, by using better alignment models, neural machine translation, or adding more machine translation engines to our methodology. Moreover, using Framenet, which provides semantic roles for a word and shows restrictions in word usage, in that only several kinds of word can be followed by a certain word, might be of interest for future research [38].