Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Wołk, Krzysztof; Wołk, Agnieszka

doi:10.1007/978-3-319-77703-0_37

Krzysztof Wołk⁶ &
Agnieszka Wołk⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 745))

Included in the following conference series:

World Conference on Information Systems and Technologies

8713 Accesses

Abstract

Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Keywords

1 Introduction

Statistical machine translation (SMT) is a methodology based on statistical data analysis. The performance quality of SMT systems largely depends on the quantity and quality of the parallel data used by these systems; that is, if the quantity and quality of the parallel data are high, this will boost the SMT results. Even so, good quality parallel corpora, without noisy data or error free, remain scarce and are not easily available [1]. Moreover, in order to increase SMT performance, the genre and language coverage of the data should be limited to a specific text domain e.g. law or medical texts. In particular, little research has been conducted on languages with few native speakers and thus with a limited audience, even though most existing human languages are spoken by only a small population of native speakers as showed in Table 1.

Table 1. Top languages by population: asterisks mark the 2010 estimates for the top dozen languages

Full size table

Despite the enormous number of people with technological knowledge and access, many are excluded because they cannot communicate globally due to language divides. Consistent with Anderson et al. [2], over 6,000 languages [2] are used globally; there is no universal spoken language for communication. The English language is only the third most popular (used by only 5.52% of the global population); Spanish (5.85%) and Mandarin (14.1%) are more common [3]. Moreover, fewer than 40% of citizens of the European Union (not including developing or Eastern European countries) know English [4], which makes communication a problem even within the EU [5].

This has created a technical gap between languages that are widely spoken in comparison to languages with few speakers. This also led to a big gap between quality and amount of available parallel corpora for less common language pairs, which makes natural language processing sciences slower in such countries.

As a result, high-quality data exist for just a few language pairs in particular domains (e.g. Czech-English law texts domain), whereas the majority of languages lack sufficient linguistic resources, such as parallel data for good quality research or natural language processing tasks. Building a translation system that can handle all possible language translations would require millions of translation directions and a huge volume of parallel data. Moreover, if we consider multiple domains in the equation, the requirements for corpus training in machine translation increase dramatically. Thus, the current study explored methods to build a corpus of high-quality parallel data, using Czech-English as the language pair.

Multiple studies have been performed to automatically acquire additional data for enhancing SMT systems in the long term [6]. All such approaches have focused on discovering authentic text from real-world sources for both the source and target languages. However, our study presents an alternative approach for building this parallel data. In creating virtual parallel data, as we might call it, at least one side of the parallel data is generated, for which purpose we use monolingual text (news internet crawl in Czech, in this case). For the other side of the parallel data, we use an automated procedure to obtain a translation of the text. In other words, our approach generates rather than gathers parallel data. To monitor the performance and quality of the automatically generated parallel data and to maximize its utility for SMT, we focus on compatibility between the diverse layers of an SMT system.

It is recommended that an estimate be considered reliable when multiple systems show a consensus on it. However, since the output of machine translation (MT) is human language, it is much too complicated to seek unanimity from multiple systems to generate the same output each time we execute the translation process. In such situations, we can choose partial compatibility as an objective rather than complete agreement between multiple systems. To evaluate the generated data, we can use the Levenshtein distance as well as implementing a back-translation procedure. Using this approach, only those pairs that pass an initial compatibility check, when translated back into the native language and compared to the original sentences, will be accepted. This concept is depicted in Fig. 1.

We can use this method to easily generate additional parallel data from monolingual news data provided for WMT16. Retraining the newly assessed data during this procedure enhances translation system performance. Moreover, linguistic resource pairs that are rare can be improved. This methodology is not limited to languages but is also very significant for rare but important language pairs. Most significantly, the virtual parallel corpus generated by the system is applicable to MT as well as other natural language processing (NLP) tasks.

2 State of the Art

In this study, we present an approach based on generating comprehensive multilingual resources through SMT systems. We are now working on two approaches for MT applications: self-training and translation via bridge languages (also called “pivot languages”). These approaches are different from those discussed previously: While self-training is focused on exploiting the available bilingual data, to which the linguistic resources of a third language are rarely applied, translation via bridge languages focuses more on correcting the alignment of the prevailing word segment. This latter approach also incorporates the phrase model concept rather than exploring the new text in context, by examining translations at the word, phrase, or even sentence level, through bridge languages. The methodology of this paper lies in between the paradigm of self-training and translating via a bridge language. Our study generates data instead of gathering information for parallel data, while we also apply linguistic information and inter-language relationships to eventually produce translations between the source and target languages.

Callison-Burch and Osborne [7] presented a cooperative training method for SMT that comprises the consensus of several translation systems to identify the best translation resource for training. Similarly, Ueffing et al. [8] explored model adaptation methods to use monolingual data from a source language. Furthermore, as the learning progressed, the application of that learned material was constrained by a multi-linguistic approach without introducing new information from a third language.

In another approach, Mann and Yarowsky [9] presented a technique to develop a translation lexicon based on transduction models of cognate pairs through a bridge language. In this case, the edit distance rate was applied to the process rather than the general MT system of limiting the vocabulary range for majority European languages. Kumar et al. [10] described the process of boosting word alignment quality using multiple bridge languages. In Wu and Wang [11], Habash and Hu [12], phrase translation tables were improved using phrase tables acquired in multiple ways from pivot languages. In Eisele et al. [13], a hybrid method was combined with RBMT (Rule-Based Machine Translation) and SMT systems. This methodology was introduced to fill gaps in the data for pivot translation. Cohn and Lapata [14] presented another methodology to generate more reliable results of translations by generating information from small sets of data using multi-parallel data.

Contrary to the existing approaches, in this study, we returned to the black-box translation system. This means that virtual data could be widely generated for translation systems, including rule-based, statistics-based, and human-based translations. The approach introduced in Leusch et al. [15] pooled the results of translations of a test set created by any of the pivot MTs per unique language. However, this approach was not found to enhance the systems, and hence the novel training data were not used. Amongst others, Bertoldi et al. [16] also conducted research on pivot languages, but did not consider applying universal corpus filtering, which is the measurement of compatibility to control data quality.

2.1 Generating Virtual Parallel Data

To generate new data, we trained three SMT systems based on TED, QED and News Commentary corpora. The Experiment Management System [17] from the open source Moses SMT toolkit was utilized to carry out the experimentation. A 6-gram language model was trained using the SRI Language Modeling toolkit (SRILM) [18]. Word and phrase alignment was performed using the SyMGIZA++ symmetric word alignment tool [19] instead of GIZA++. Out-of-vocabulary (OOV) words were monitored using the Unsupervised Transliteration Model [20]. Working with the Czech (CS) and English (EN) language pair, the first SMT system was trained on TED [21], the second on the Qatar Computing Research Institute’s Educational Domain Corpus (QED) [22], and the third using the News Commentary corpora provided for the WMT16 translation task. Official WMT16 test sets were used for system evaluation. Translation engine performance was measured by the BLEU metric [23]. The performance of the engines is shown in Table 2.

Table 2. Corpora used for generation of SMT systems

Full size table

All engines worked in accordance with Fig. 1, and the Levenshtein distance was used to measure the compatibility between translation results. The Levenshtein distance measures the diversity between two strings. Moreover, it also indicates the edit distance and is closely linked to the paired arrangement of strings [24].

Mathematically, the Levenshtein distance between two strings a, b [of length |a| and |b|, respectively] is given by $ {\text{lev}}_{\text{a,b}} \left[ {\left| {\text{a}} \right|,\left| {\text{b}} \right|} \right] $ where:

$$ lev_{a,b} \left( {i,j} \right) = \left\{ \begin{aligned} & \hbox{max} (i,j)\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad if\;\hbox{min} (i,j) = 0 \\ & \hbox{min} \left\{ \begin{aligned} & lev_{a,b} (i - 1,j) + 1 \\ & lev_{a,b} (i,j - 1) + 1 \\ & \quad \quad \quad \quad lev_{a,b} (i - 1,j - 1) + 1_{{\left[ {a_{i} \ne b_{j} } \right]}} \\ \end{aligned} \right.\quad otherwise. \\ \end{aligned} \right. $$

In this equation, $ 1_{{\left[ {{\text{a}}_{\text{i}} \ne {\text{b}}_{\text{j}} } \right]}} $ is the display function, equal to 0 when $ {\text{a}}_{\text{i}} = {\text{b}}_{\text{j}} $ and equal to 1 otherwise, and $ {\text{lev}}_{\text{a,b}} \left[ {{\text{i,}}\,{\text{j}}} \right] $ is the distance between the first i characters of a and the first j characters of b.

Using the combined methodology and monolingual data, parallel corpora were built. Statistical information on the data is provided in Table 3.

Table 3. Specification of generated corpora

Full size table

The purpose of this research was to create synthetic parallel data to train a machine translation system by translating monolingual texts with multiple machine translation systems and various filtering steps. This objective is not new; synthetic data have been created in the past. However, the novel aspect of the present paper is its use of three MT systems, application of the Levenshtein distance between their outputs as a filter, and—much more importantly—its use of back-translation as an additional filtering step. In Table 4, we show statistical information on the corpora used without the back-translation step.

Table 4. Specification of generated corpora without back-translation

Full size table

2.2 Semantically-Enhanced Generated Corpora

The artificially generated corpora presented in Table 3 were obtained using statistical translation models, which are based purely on how frequently “things” happen, and not on what they really mean. This means that they do not really understand what was translated. In this research, these data were additionally extended with semantic information so as to improve the quality and scope of the data domain. The word relationships were integrated into generated data using the WordNet database.

The way in which WordNet was used to obtain a probability estimator was shown in Cao et al. [25]. In particular, we wanted to obtain P(w_i|w), where w_i and w are assumed to have a relationship in WordNet. The formula is as follows:

$$ {\text{P(w}}_{\text{i}} | {\text{w) = }}\frac{{{\text{c(w}}_{\text{i}} , {\text{w|W,L)}}}}{{\sum\limits_{{{\text{w}}_{\text{j}} }} {\text{c}} ( {\text{w}}_{\text{j}} , {\text{w|W,L)}}}} $$

where W is a window size and c(w_i, w|W, L) is the count of wi and w appearing together within W-window. This can be obtained simply by counting each within a certain corpus. In order to smooth the model, we applied interpolated Kneser-Ney [26] smoothing strategies.

The following relationships were considered: synonym, hypernym, hyponym, and hierarchical distance between words.

In Table 5, we show statistical information on the semantically enhanced corpora produced previously and shown in Table 3.

Table 5. Specification of semantically generated corpora without back-translation

Full size table

Another common approach to semantic analysis that is also used within this research is latent semantic analysis (LSA). LSA has already been shown to be very helpful in automatic speech recognition (ASR) [27] and many other applications, which was the reason for incorporating it within the scope of this research. The high-level idea of LSA is to convert words into concept representations and to assume that if the occurrence of word patterns in documents is similar, then the words are also similar. The mathematical model can be defined as follows:

In order to build the LSA model, a co-occurrence matrix W will first be built, where w_ij is a weighted count of word w_j and document d_j.

$$ {\text{w}}_{\text{ij}} = {\text{G}}_{\text{i}} {\text{L}}_{\text{ij}} {\text{C}}_{\text{ij}} $$

where C_ij is the count of w_i in document d_j; L_ij is local weight; and G_i is global weight. Usually, L_ij and G_i can use TF/IDF.

Then, singular value decomposition (SVD) analysis will be applied to W, as

$$ {\text{W}} = {\text{ U S V}}^{\text{T}} $$

where W is a M * N matrix (M is vocabulary size, N is document size); U is M * R, S is R * R, and V is a R * N matrix. R is usually a predefined dimension number between 100 and 500.

After that, each word w_i can be denoted as a new vector U_i = u_i * S. Based on this new vector, the distance between two words is defined as:

$$ {\text{K}}\left( {{\text{U}}_{\text{i}} ,{\text{ U}}_{\text{j}} } \right) \, = \, \left\{ {{\text{u}}_{\text{i}} *{\text{S}}^{ 2} *{\text{u}}_{\text{m}}^{\text{T}} } \right\}\left\{ {\left| {{\text{u}}_{\text{i}} *{\text{S}}} \right|*|{\text{u}}_{\text{m}} *{\text{S}}|} \right\} $$

Therefore, clustering can be performed to organize words into K clusters, C₁, C₂, …., C_K.

If $ {\text{H}}_{{{\text{q}} - 1}} $ is the history for word W_q, then it is possible to obtain the probability of W_q given $ {\text{H}}_{{{\text{q}} - 1}} $ using the following formula:

$$ \begin{aligned} P\left( {W_{q} |H_{q - 1} } \right) & = P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} , \, d_{{q_{1} }} } \right) \\ & = P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} } \right)*P\left( {W_{q} |d_{{q_{1} }} |} \right) \\ \end{aligned} $$

where $ P\left( {W_{q} |W_{q - 1} ,W_{q - 2} , \ldots W_{q - n + 1} , \, d_{{q_{1} }} } \right) $ is the N-gram model; $ {\text{P}}\left( {{\text{d}}_{{{\text{q}}_{ 1} }} |{\text{W}}_{\text{q}} } \right) $ is the LSA model.

Additionally,

$$ P(W_{q} |d_{{q_{1} }} ) = P(U_{q} |V_{q} ) = K(U_{q} ,V_{{q_{1} }} )/Z(U,V)K(U_{q} ,V_{{q_{1} }} ) = \frac{{U_{q} *S*V_{q - 1}^{T} }}{{|U_{q} *S^{1/2} |*|V_{q - 1} *S^{1/2} |}}, $$

where Z(U, V) is the normalized factor.

It is possible to also apply word smoothing to the model-based K-Clustering as follows:

$$ P(W_{q} |d_{{q_{1} }} ) = \sum\limits_{k = 1}^{K} {P(W_{q} |C_{k} )P(C_{k} |d_{{q_{1} }} )} $$

where $ P(W_{q} |C_{k} ) $, $ P(C_{k} |d_{{q_{1} }} ) $ can be computed using the distance measurement given above by a normalized factor.

In this way, the N-gram and LSA model are combined into a single language model and can be used for word comparison and text generation. The Python code for such LSA analysis was implemented in Thomo’s [28] research.

In Table 6, we show statistical information on the semantically enhanced corpora produced previously and shown in Table 3.

Table 6. Specification of semantically generated corpora using LSA

Full size table

2.3 Experimental Setup

The machine translation experiments we conducted involved three WMT16 tasks: news translation, information technology (IT) document translation, and biomedical text translation. Our experiments were conducted on the CS-EN pair in both directions. To obtain more accurate word alignment, we used the SyMGiza++ tool, which assisted in the formation of a similar word alignment model. This particular tool develops alignment models that obtain multiple many-to-one and one-to-many alignments in multiple directions between the given language pairs. SyMGiza++ is also used to create a pool of several processors, supported by the newest threading management, which makes it a very fast process. The alignment process used in our case utilizes four unique models during the training of the system to achieve refined and enhanced alignment outcomes. The results of these approaches have been shown to be fruitful in previous research [19]. OOV words are another challenge for an SMT system and to deal with such words, we used the Moses toolkit and the Unsupervised Transliteration Model (UTM). The UTM is a language-independent approach that has an unsubstantiated capability for learning OOV words. We also utilized the post-decoding transliteration method from this particular toolkit. UTM is known to make use of a transliteration phrase translation table to access probable solutions. UTM was used to score several possible transliterations and to find a translation table [20, 29].

The KenLM tool was applied to language model training. This library helps to resolve typical problems of language models, reducing execution time and memory usage. To reorder the phrase probability, the lexical values of the sentences were used. We also used KenLM for lexical reordering. Three directional types are based on each target–swap (S), monotone (M), and discontinuous (D)–all three of which were used in a hierarchical model. The bidirectional restructuring model was used to examine the phrase arrangement probabilities [30,31,32].

The quality of domain adaptation largely depends on training data, which helps in incorporating the linguistic and translation models. The acquisition of domain-centric data helps greatly in this regard [33]. A parallel, generalized domain corpus and monolingual corpus were used in this process, as identified by Wang et al. [34]. First, sentence pairs of the parallel data were weighted based on their significance to the targeted domain. Second, reorganization was conducted to obtain the best sentence pairs. After obtaining the required sentence pairs, these models were trained for the target domain [34].

For similarity measurement, we used three approaches: word overlap analysis, the cosine term frequency-inverse document frequency (tf-idf) criterion, and perplexity measurement. However, the third approach, which incorporates the best of the first two, is the strictest. Moreover, Wang et al. observed that a combination of these approaches provides the best possible solution for domain adaptation for Chinese-English corpora [34]. Thus, inspired by Wang et al.’s approach, we utilized a combination of these models. Similarly, the three measurements were combined for domain adaptation. Wang et al. found that the performance of this process yields approximately 20% of the domain analogous data.

2.4 Evaluation

To make progress in machine translation (MT), the quality of its results must be evaluated. It has been recognized for quite some time that using humans to evaluate MT approaches is very expensive and time-consuming [35]. As a result, human evaluation cannot keep up with the growing and continual need for MT evaluation, leading to the recognition that the development of automated MT evaluation techniques is critical. Evaluation is particularly crucial for translation between languages from different families (i.e., Germanic and Slavic), such as Polish and English [35, 36].

Vanni and Reeder [36] compiled an initial list of SMT evaluation metrics. Further research has led to the development of newer metrics. Prominent metrics include Bilingual Evaluation Understudy (BLEU), the National Institute of Standards and Technology (NIST) metric, Translation Error Rate (TER), and the Metric for Evaluation of Translation with Explicit Ordering (METEOR). These metrics were used in this research for evaluation.

In this research, we used the most popular metric BLEU, which was developed based on a premise similar to that used for speech recognition, described by Papineni et al. [23] as “The closer a machine translation is to a professional human translation, the better it is.” Thus, the BLEU metric is designed to measure how close SMT output is to the output of human reference translations. It is important to note that translations, be they SMT or human, may differ significantly in terms of word usage, word order, and phrase length [23].

2.4.1 Statistical Significance Tests

In cases where the differences in metrics described above do not deviate greatly from each other, a statistical significance test can be performed. The Wilcoxon test [37] (also known as the signed-rank or matched-pairs test) is one of the most popular alternatives to the Student’s t-test for dependent samples. It belongs to the group of non-parametric tests and is used to compare two (and only two) dependent groups that involve two measurement variables.

The Wilcoxon test is used when the assumptions for the Student’s t-test for dependent samples are not valid; for this reason, it is considered an alternative to this test. The Wilcoxon test is also used when variables are measured on an ordinal scale (in the Student’s t-test, the variables must be measured on a quantitative scale). The requirement for application of the Wilcoxon test is the potential to rank differences between the first and second variable (the measurement). On an ordinal scale, it is possible to calculate the difference in levels between two variables; therefore, the test can be used for variables calculated on such a scale. In the case of quantitative scales, this test is used if the distributions of these variables are not close to the normal distribution.

3 Results and Discussion

Numerous human languages are used around the world and millions of translation systems have been introduced for the possible language pairs. However, these translation systems struggle with high quality performance, largely due to the limited availability of language resources such as parallel data.

In this study, we have attempted to supplement these limited resources. Additional parallel corpora can be utilized to improve the quality and performance of linguistic resources, as well as individual NLP systems. In the MT application (Table 4), our data generation approach has increased translation performance. Although the results appear very promising, there remains a great deal of room for improvement. Performance improvements can be attained by applying more sophisticated algorithms to quantify the comparison among different MT engines. In Table 6, we present the baseline (BASE) outcomes for the MT systems we obtained for three diverse domains (news, IT, and biomedical—using official WMT16 test sets). Second, we generated a virtual corpus and adapted it to the domain (FINAL). The generated corpora demonstrate improvements in SMT quality and utility as NLP resources. From Table 3, it can be concluded that a generated virtual corpus is morphologically rich, which makes it acceptable as a linguistic resource. In addition, by retraining with a virtual corpus SMT system and repeating all the steps, it is possible to obtain more virtual data of higher quality. Statistically significant results in accordance with the Wilcoxon test are marked with * and those that are very significant with ** (Table 7).

Table 7. Evaluation of generated corpora

Full size table

Next, in Table 8, we replicate the same quality experiment but using generated data without the back-translation step. As shown in Table 4, more data can be obtained in such a manner. However, the SMT results are not as good as those obtained using back-translation. This means that the generated data must be noisy and most likely contain incomplete sentences that are removed after back-translation.

Table 8. Evaluation of corpora generated without the back-translation step

Full size table

Next, in Table 9, we replicate the same quality experiment but using generated data from Table 5. As shown in Table 9, augmenting virtual corpora with semantic information makes a positive impact on not only the data volume but also data quality. Semantic relations improve the MT quality even more.

Table 9. Evaluation of semantically generated corpora without the back-translation step

Full size table

Finally, in Table 10, we replicate the same quality experiment but using generated data from Table 6 (LSA). As shown in Table 10, augmenting virtual corpora with semantic information by facilitating LSA makes an even more positive impact on data quality. LSA-based semantic relations improve the MT quality even more. It is worth mentioning that LSA provided us with less data but we believe that it was more accurate and more domain-specific than the data generated using Wordnet.

Table 10. Evaluation of semantically generated corpora using LSA

Full size table

4 Conclusions

Summing up, in this study, we successfully built parallel corpora of satisfying quality from monolingual resources. This method is very time and cost effective and can be applied to any bilingual pair. In addition, it might prove very useful for rare and under-resourced languages. However, there is still room for improvement, for example, by using better alignment models, neural machine translation, or adding more machine translation engines to our methodology. Moreover, using Framenet, which provides semantic roles for a word and shows restrictions in word usage, in that only several kinds of word can be followed by a certain word, might be of interest for future research [38].

References

Wołk, K., Marasek, K., Wołk, A.: Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora. In: 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), Gdansk, pp. 517–525 (2016)
Google Scholar
Anderson, S.R., Harrison, D., Horn, L., Zanuttini, R., Lightfoot, D.: How many languages are there in the world?: linguistic society of America (2010). http://www.linguisticsociety.org/sites/default/files/how-many-languages.pdf. Accessed 16 Feb 2017
List of languages by number of native speakers (2016). Wikipedia, https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers. Accessed 16 Feb 2016
Paolillo, J., Anupam, D.: Evaluating language statistics: the Ethnologue and beyond (2006). http://www.uis.unesco.org/Library/Documents/evaluating-language-statistics-ethnologue-beyond-culture-2006-en.pdf. Accessed 8 Oct 2015
English language in Europe 2016 Wikipedia. https://en.wikipedia.org/wiki/English_language_in_Europe. Accessed 16 Feb 2017
Munteanu, D., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Human Language Technologies-The 2004 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Marina del Rey, pp. 265–272 (2004)
Google Scholar
Callison-Burch, C., Osborne, M.: Co-training for statistical machine translation. Dissertation, School of Informatics, University of Edinburgh (2002)
Google Scholar
Ueffing, N., Haffari, G., Sarkar, A.: Semisupervised learning for machine translation. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation, pp. 237–256. MIT Press, Pittsburgh (2009)
Google Scholar
Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, Pittsburgh, pp. 1–8 (2001)
Google Scholar
Kumar, S., Och, F., Macherey, W.: Improving word alignment with bridge languages. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, pp. 42–50 (2007)
Google Scholar
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21(3), 165–181 (2007)
Article Google Scholar
Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Athens, pp. 173–181 (2009)
Google Scholar
Eisele, A., Federmann, C., Uszkoreit, H., Saint-Amand, H., Kay, M., Jellinghaus, M., Hunsicker, S., Herrmann, T., Chen, Y.: Hybrid machine translation architectures within and beyond the EuroMatrix project. In: Hutchins, J., Hahn, W.V. (eds.) Hybrid MT Methods in Practice: Their Use in Multilingual Extraction, Cross-Language Information Retrieval, Multilingual Summarization, and Applications in Hand-Held Devices. Proceedings of the European Machine Translation Conference, Proceedings of the 12th Annual Conference of the European Association for Machine Translation. HITEC e.V., European Association for Machine Translation, Hamburg, Germany, pp. 27–34 (2008)
Google Scholar
Cohn, T., Lapata, M.: Machine translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, pp. 728–735 (2007)
Google Scholar
Leusch, G., Max, A., Crego, J.M., Ney, H.: Multi-pivot translation by system combination. In: Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), Paris, pp. 299–306 (2010)
Google Scholar
Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical machine translation with pivot languages. In: Proceedings of IWSLT, Hawaii, pp. 143–149 (2008)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association of Computational Linguistics, Prague, pp. 177–180 (2007)
Google Scholar
Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of International Conference Spoken Language Processing, Denver, pp. 901–904 (2002)
Google Scholar
Junczys-Dowmunt, M., Szal, A.: SyMGiza ++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems: International Joint Conferences, 2011, Warsaw, pp. 379–390. Springer, Heidelberg (2012)
Chapter Google Scholar
Durrani, N., Sajjad, H., Hoang, H., Koehn, P.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, pp. 148–153 (2014)
Google Scholar
Cettolo, M., Girardi, C., Fedirico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, pp. 261–268 (2012)
Google Scholar
Abdelali, A., Guzman, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Ninth International Conference on Language Resources and Evaluation (LREC14), Reykjavik, pp. 1044–1054 (2014)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)
Google Scholar
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
Article Google Scholar
Cao, G., Nie, J., Bai, J.: Integrating term relationships into language models. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, pp. 298–305 (2005)
Google Scholar
Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999)
Article Google Scholar
Bellegarda, J.: Data-driven semantic language modeling, Institute for Mathematics and Its Applications Workshop (2000). http://cmusphinx.sourceforge.net/wiki/semanticlanguagemodel. Accessed 16 Feb 2017
Thomo, A.: Latent semantic analysis (LSA) tutorial (2009). http://webhome.cs.uvic.ca/~thomo/svd.pdf. Accessed 16 Feb 2007
Moses statistical machine translation, OOVs (2015). http://www.statmt.org/moses/?n=Advanced.OOVs#ntoc2. Accessed 27 Sept 2015
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Edinburgh, pp. 187–197 (2011)
Google Scholar
Costa-jussa, M.R., Fonollosa, J.R.: Using linear interpolation and weighted reordering hypotheses in the Moses system. In: Seventh Conference on International Language Resources and Evaluation, Valletta, pp. 1712–1718 (2011)
Google Scholar
Moses statistical machine translation, Build reordering model (2013) http://www.statmt.org/moses/?n=FactoredTraining.Build. Reordering Model. Accessed 10 Oct 2015
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association of Computational Linguistics, Edinburgh, pp. 355–362 (2011)
Google Scholar
Wang, L., Wong, D.F., Chao, L.S., Lu, Y., Xing, J.: A systematic comparison of data selection criteria for SMT domain adaptation. Sci. World J. 2014, 745485 (2014)
Google Scholar
Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, pp. 127–133 (1999)
Google Scholar
Vanni, M., Reeder, F.: How are you doing? A look at MT evaluation. In: White, J.S. (eds.), Envisioning Machine Translation in the Information Future, AMTA 2000. LNCS, vol. 1934. Springer, Heidelberg (2000)
Chapter Google Scholar
Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2, 172–176 (2012)
Article MathSciNet Google Scholar
Lin, S., Verspoor, K.: A semantics-enhanced language model for unsupervised word sense disambiguation. In: Ninth International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2008). Lecture Notes in Computer Science (LNCS), Haifa, pp. 287–298 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, Koszykowa 86, Warsaw, Poland
Krzysztof Wołk & Agnieszka Wołk

Authors

Krzysztof Wołk
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Wołk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Universidade de Coimbra, Coimbra, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
DSI/EEUM, Universidade do Minho, Guimarães, Portugal
Luís Paulo Reis
DIMES, Università della Calabria, Arcavacata di Rende, Italy
Sandra Costanzo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wołk, K., Wołk, A. (2018). Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-77703-0_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-77703-0_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77702-3
Online ISBN: 978-3-319-77703-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Abstract

Similar content being viewed by others

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Keywords

1 Introduction

2 State of the Art

2.1 Generating Virtual Parallel Data

2.2 Semantically-Enhanced Generated Corpora

2.3 Experimental Setup

2.4 Evaluation

2.4.1 Statistical Significance Tests

3 Results and Discussion

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Abstract

Similar content being viewed by others

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Keywords

1 Introduction

2 State of the Art

2.1 Generating Virtual Parallel Data

2.2 Semantically-Enhanced Generated Corpora

2.3 Experimental Setup

2.4 Evaluation

2.4.1 Statistical Significance Tests

3 Results and Discussion

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation