Keywords

1 Introduction

The growing demand of machine-translation applications that shows an witness of the creation of complex systems performing similar or even identical functions in real world. These systems, excess in the spare nature of their creation, have a limited functionality cause of mismatches in application purposes. However, integration of these systems is desirable to find information in business information systems. In recent years, a considerable research effort has been directed to evaluate the relationships between word alignment and machine translation performance, which aim at obtaining a certain degree of coordination between various kind of language pairs by automatically detecting correspondences between the elements of these alignments. However, there is no theoretical support from the view of providing a formulation to describe the relationship between word alignments and machine translation performance.

We examine the Kazakh language, which is the majority language in the Republic of Kazakhstan. Kazakh is part of the Kipchak branch of the Turkic language family and part of the majority Ural-Altay family, in comparison with languages like English, is very rich in morphology.

The Kazakh language which words are generated by adding affixes to the root form is called an agglutinative language. We can derive a new word by adding an affix to the root form, then make another word by adding another affix to this new word, and so on. This iteration process may continue several levels. Thus a single word in an agglutinative language may correspond to a phrase made up of several words in a non-agglutinative language [1] (Table 1).

Table 1 An example of Kazakh agglutination

Although the phenomena of word alignment has learned considerably after many challenges by El-Kahlout [2], Bisazza and Federico [3], these contributions are related to the use of morphology, as well as the consideration of probability distribution within the phrase pairs and the resulting alignments. These research issues in word alignment was also handling a precision of the matching process [4].

In order to demonstrate our objective, which can be used to build high quality machine translation systems [5], several applications of word alignment can be found, such as adaptation of the context-semantic disclosure. An word alignment processing is more convenient with respect to obtain m-to-n alignments, where several source words are aligned to several target words than barely segmenting the strings before the matching process. The common approaches of word alignment training are IBM Models [6] and hidden Markov model (HMM)[7], which practically use expectation-maximization (EM) algorithm [8]. The EM algorithm finds the parameters that increases the likelihood of the dependent variables. EM transfers the sentences by overlapping the actual parameters, where some rare words align to many words on the opposite sentence pair.

Since generating segments and modeling the relative features of phrases comprise a similarity measure within the parallel corpora, it makes the system too general to be applied to other kind of language pair, with the different morphotactics. However, our approach can be applied to the potential areas, which include improvements in machine translation, machine learning methods and information retrieval. Anyway, many concepts and definitions are pretty vague, which needs to be dealt within the word alignment process.

Using a Morfessor tool [9], we can find grammatical features of the word and can retrieve syntactic structure of input sentence. It clearly demonstrates the benefit of using similarly to the rule-based morphological analyzers [10], which consist in deep language expertise and a exhaustive process in system development. Unsupervised approaches use actually unlimited supplies of text and widely studied for a number of languages [11]. However, for a comprehensive survey of the rule-based morphological analyze we refer a reader to the research by Altenbek [12] and Kairakbay [13].

The article is structured as follows. Section 2 discusses the proposed model and describes the different segmentation techniques we study. And Sect. 3 presents our evaluation results.

2 Description of Our Method

Hybrid methods comprise two major groups of approaches: those that use morpheme analysis, and those that rely on probability distribution combined with techniques from machine learning in order to compare the similarity of the stems and their synonymy and ambiguity problems. An alignment process was understood as the process of establishing relations between the elements of a parallel language pair, which results in an alignment between equivalent phrases. Different alignment techniques, which enhance the quality of machine translation for Kazakh-English lasnguage pair, have been introduced in the past years in order to resolve different types of morphological segmentation of Kazakh words, relying on methods coming from areas of machine learning and linguistics. For these purposes we used Morfessor, an unsupervised analyzer and Helsinki Finite-State Toolkit (HFST) [14] for the rule-based analyze; finally we use the GIZA++ [15] tool to produce IBM Model 4 word alignment. Our morpheme analysis approach is concerned with word segmentation and as a result comparing groups of morphemes to another and detects the relations that exists between them.

Our studies try to investigate the impact of pruning technique to the overall translation quality by reduction the level of sparse phrases, which leads to higher BLEU scores [16]. We don’t use a manually annotated gold standard word alignment set that the similarity measured on new sets of alignments reflects the personal opinion about a translation similarity between the instances.

2.1 Word Alignment

We suppose a phrase pair is denoted by (FE) and with an alignment A, if any words \(f_{j}\) in F have a correspondence in a, with the words \(e_{i}\) in E. Formal definition can be described as follows: \(\forall e_{i}\in E : \left( e_{i},f_{j} \right) \in a \Rightarrow f_{j} \in F\) and \(\forall f_{j}\in F : \left( e_{i},f_{j} \right) \in a \Rightarrow e_{i} \in E\), clearly, there are \(\exists e_{i}\in E, f_{j}\in F:\left( e_{i},f_{j} \right) \in A\).

Generally, the phrase-based models are generative models that translate sequences of words in \(f_{j}\) into sequences of words in \(e_{j}\), in difference from the word-based models that translate single words in isolation.

$$\begin{aligned} P\left( e_{j}\mid f_{j} \right) = \sum _{j=1}^{J}P\left( e_j, a_j\mid f_j \right) \end{aligned}$$
(1)

Improving translation performance directly would require training the system and decoding each segmentation hypothesis, which is computationally impracticable. That we made various kind of conditional assumptions using a generative model and decomposed the posterior probability. In this notation \(e_{j}\) and \(f_{i}\) point out the two parts of a parallel corpus and \(a_{j}\) marked as the alignment hypothesized for \(f_{i}\). If \(a\mid e \sim \textit{ToUniform}\left( a;I+1 \right) \), then

$$\begin{aligned} P\left( e_{j}^{J}, a_{j}^{J}\mid f_{i}^{I} \right) = \frac{f_{i}}{\left( I+1 \right) ^{J}}\prod _{j=1}^{J}p\left( e_{j}\mid f_{a_{j}} \right) \end{aligned}$$
(2)

We extend the alignment modeling process of Brown et al. at the following way. We assume the alignment of the target sentence e to the source sentence f is a. Let c be the tag of f for segmented morphemes. This tag is an information about the word and represents lexeme after the segmentation process. This assumption is used to link the multiple tag sequences as hidden processes, that a tagger generates a context sequence \(c_{j}\) for a word sequence \(f_{j}\) (3).

$$\begin{aligned} P\left( e_{1}^{I},a_{1}^{I}\mid f_{1}^{J} \right) = P\left( e_{1}^{I},a_{1}^{I}\mid c_{1}^{J},f_{1}^{J} \right) \end{aligned}$$
(3)

Then we can show Model 1 as (4):

$$\begin{aligned} P\left( e_{i}^{I},a_{i}^{I}\mid f_{j}^{J},c_{j}^{J} \right) = \frac{1}{\left( J+1 \right) ^{I}}\prod _{i=1}^{I}p\left( e_{i}\mid f_{a_{i}},c_{a_{i}} \right) \end{aligned}$$
(4)

We applied EM algorithm to estimate the phrase pairs that are consistent with the word alignments, and then assign probabilities to the obtained phrase pairs. The probability \(p_{k}\) of the word w to the corresponding context k is:

$$\begin{aligned} p_{k}\left( w \right) = \frac{p_{k}f_{k}\left( w\mid \phi _{k} \right) }{\sum p_{i}f_{i}\left( w\mid \phi _{i} \right) } \end{aligned}$$
(5)

where, \(\phi \) is the covariance matrix, and f are certain component density functions, which evaluated at each sequence. Consecutive word subsequences in the sentence pair are not longer than w words. After we use association measures to filter infrequently occurring phrase pairs by log likelihood ratio estimation [17].

Our algorithm, like a middle tier component, processes the input alignment files in a single pass. Current implementation reuses the code from https://github.com/akartbayev/clir that conducts the extraction of phrase pairs and filters out low frequency items. After the alignment processing all valid phrases have to be stored in the phrase table and should be passed further.

2.2 Morphological Segmentation

Kazakh is a morphologically complex language with many differences from English. We describe here the main grammar features of Kazakh that are relevant to its English translation and mostly are associated separately in English by a different order. Case suffixes are attached to the noun in Kazakh often represent the preposition in English, also the word order is pretty challenging in the context of translation to English. For Kazakh noun phrases, which correspondence to English phrases may lead to the long phrases problem that exceed the size of phrases in a phrase table.

Our job usually starts from word segmentation, which includes running morphological tools to each entry of the phrase pair. At the first step, an word segmentation process aims to get suffixes and roots from the word. Therefore, we take surface forms of the words and generate their all possible lexical forms. Also we use the vocabulary to label the initial states as the root words by parts of speech such as noun, verb, etc. The final states represent a lexeme created by affixing morphemes in each further states.

The schemes presented below are different combinations of outputs determining the removal of affixes from the analyzed words. The baseline approach is not perfect since a scheme includes several suffixes incorrectly segmented. In this case, we mainly focused on detection a few techniques for the segmentation of such word forms. In order to find an effective rule set we tested several segmentation schemes named S[1–8], some of which have described in the following Table 2.

Table 2 The segmentation schemes

There are large amount of verbs presenting ambiguity during segmentation, which do not take personal endings, but follow conjugated main verbs. During the process, we hardly determined the border between stems and inflectional affixes, especially when the word and the suffix matches entire word in the language. In fact, there are lack of syntactic information we cannot easily distinguish among similar cases.

In order to solve the problems represented above, we have to split up Kazakh words into the morphemes and some tags which represent the morphological information expressed on the suffixation. Splitting Kazakh words in this way, we expect to reduce the sparseness produced by the agglutination being of Kazakh and the drought of training data. Anyway, the segmentation model takes into account the several segmentation options of both sides of the parallel corpus while looking for the optimal segmentation. As we discovered, words with same Part-Of-Speech (POS) tags often correspond to each other in the word alignment and may help to efficiently handle out-of-vocabulary (OOV) words by incorporating linguistic information, but it can also make the training data more sparse [18]. We also suppose that the discovery of word context relations could lead to better word alignment scores and we apply this idea using a heuristic algorithm for every single training scenarios.

To define the most convenient segmentation for our Kazakh-English system, we checked most of the segmentation options and have measured their impact on the translation quality. This application of morphological processing aims to find several best splitting options that the each Kazakh phrase ideally corresponds to one English phrase, so the deep analysis is more desirable.

3 Evaluation

For evaluation the system, three samples of text data were processed with 50k sentences each one, which were used in raw form and with special segmentation. The expert decisions about a segmentation quality were defined by our university undergraduate students. The data samples were stored randomly into a training set and a test set had one sample for each of the phrase-based Moses [19] system run. After the most of the samples were found processed correctly, which means the same interpretation of data was selected as acceptable by the experts, we decided the system was trained well, and that is a good result.

Our corpora consists of the legal documents from http://adilet.zan.kz, a content of http://akorda.kz, and Multilingual Bible texts, and the target-side language models were trained on the MultiUN [20] corpora. We conduct all experiments on a single PC, which runs the 64-bit version of Ubuntu 14.10 server edition on a 4Core Intel i7 processor with 32 GB of RAM in total. All experiment files were processed on a locally mounted hard disk. Also we expect the more significant benefits from a larger training corpora, therefore we are in the process of its construction.

We did not have a gold standard for phrase alignments, so we had to refine the obtained phrase alignments to word alignments in order to compare them with our word alignment techniques. We measure the accuracy of the alignment using precision, recall, and F-measure, as given in the equations below; here, A represents the reference alignment; T, the output alignment; A and T intersection, the correct alignments (Table 3).

$$\begin{aligned} pr = \frac{\left| A\cap T \right| }{\left| T \right| }, re = \frac{\left| A\cap T \right| }{\left| A \right| }, F-measure= \frac{2\times pr \times re}{pr+re} \end{aligned}$$
(6)
Table 3 The performance of word alignment on 50 K

The alignment error rate (AER) values for the trained system show distinct tendencies which were consistent through the iteration of different training parameters. The values show completely the higher rates for raw lexeme than for segmented one, which seems suitable for an alignment task. Another tendency is that the differences of context receive smaller impact than the precision of segmentation. This was not clear since removing or normalization causes a change in word structure. A problem in interpreting these training results depend on the scaling of the morpheme probability, which can be of different variation, and the scale needs to be appropriate to the text domain and segmentation schemes. We assume that phrase alignment connects word classes rather than words. Consequently, the phrase translation table has to be learned directly from phrase alignment models, and an estimation of phrase distribution probability is internally part of the process (Table 4).

Table 4 Best performance scores

The system parameters were optimized with the minimum error rate training (MERT) algorithm [21], and evaluated on the out-of and in-domain test sets. All 5-gram language models were trained with the IRSTLM toolkit [22] and then were converted to binary form using KenLM for a faster execution [23]. The translation performance scores were computed using the MultEval [24]: BLEU, TER [25] and METEOR [26]; and we ran Moses several times per experiment setting, and report the best BLEU/AER combinations obtained. Our survey shows that translation quality measured by BLEU metrics is not strictly related with lower AER.

4 Conclusion and Future Work

In this paper, we learned the effect of morphological processing on SMT by making the source and target languages more similar than they usually are. The methods we use to solve most common problems are implemented as a pre-processing steps script and a middle-tier component for word alignment processing. As far as we know, dealing with nominal agglutination only does not considerable change the BLEU score of the baseline translation. However, we expected the combination of morphological analysis and phrase table refining have a positive effect on translation quality. As a result, our experiments produced not only more perfectly matching phrases, but also obtained new alignments that did not produce from the training data. Taking a closer look, we found that morphological features extracted from the source language are a valuable resource for alignment prediction. Our evaluation shows that morphological processing leads to better translations where the quality can not be measured by BLEU score. The improved model performs at slightly the same speed as the previous one, and gives an increase of about 3 BLEU over baseline translation. I think that it is a demonstration of the potential of word alignments for SMT quality, and we plan to investigate more complicated methods in the future researches, possibly adding the new alignment features to the model.