Statistical machine translation of Indian languages: a survey

Khan Jadoon, Nadeem; Anwar, Waqas; Bajwa, Usama Ijaz; Ahmad, Farooq

doi:10.1007/s00521-017-3206-2

Statistical machine translation of Indian languages: a survey

Original Article
Published: 17 November 2017

Volume 31, pages 2455–2467, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Statistical machine translation of Indian languages: a survey

Download PDF

Nadeem Khan Jadoon¹,
Waqas Anwar²,
Usama Ijaz Bajwa² &
…
Farooq Ahmad²

1008 Accesses
9 Citations
Explore all metrics

Abstract

In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.

Phrase-Based English–Nyishi Machine Translation

Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation

Indowordnet’s help in Indian language machine translation

Article 06 September 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this section, a brief background of machine translation is given. An overview of machine translation (MT) approaches is also discussed with the SMT approach being used in this research work. Indian languages selected for this work are also discussed briefly.

1.1 Machine translation

Machine translation (MT) can be defined as an automated system that analyses text from a source language (SL), by applying some computation on that input, and produces equivalent text in a required target language (TL) ideally without any kind of human intervention.

It is one of the most interesting and the hard problem in the field of NLP [1]. The two challenges in machine translation are adequacy and fluency. The former is to develop a system that adequately represents the ideas expressed in the source language into the target language. The latter is to represent those ideas grammatically. The common approaches to machine translation are the rule-based approach and corpus-based approach.

In the rule-based approach, the text in the source language is analyzed using various tools such as: a morphological parser and analyzer and then transformed into an intermediate representation. A set of rules are used to generate the text in target language of this intermediate representation. A large number of rules are necessary to capture the phenomena of natural language. These rules transfer the grammatical structure of the source language into target language. As the number of rules increases, the system become more complicated [2] and slow to translate. Formulation of a large number of rules is a tedious process and requires years of effort and linguistic analysis.

In another approach, large parallel and monolingual corpora are used as source of knowledge. This approach can be further divided into statistical approach and example-based approach. In the statistical approaches, target text is generated and scored through a statistical model, from parallel corpus. Here, MT is also identified as a decision problem, and a better target language phrase id is decided from the given source language. Further, Bayes rule and statistical decision theory are used to solve this decision problem. Statistical decision theory and Bayesian decision rules are used to minimize errors of decision. SMT [1] gives better results if additional training data are available.

SMT is superior to rule-based and example-based systems in that it does not require human interpenetration and can build a translation system in an unsupervised manner directly from the training data. With the rapid proliferation of internet and increasing availability of data, SMT is currently the most popular and prevalent paradigm. SMT can be represented by different models, and a basic architecture of simple SMT system model is shown in Fig. 1. An arrow from translation model to language model shows that the language model contains the target side corpora as well. The arrow from language model to translation text shows that the fluency of the translation depends upon the quality of language model. In this study, we use phrase-based SMT model, and an overview of this model is given in the next section.

1.1.1 Phrase-based model

In this work, the phrase-based SMT models [3, 4] are used and their performance is evaluated on the morphologically rich Indian languages. Phrase-based models are used to translate phrases of one or more words as atomic units [1]. These models divide the input sentence into phrases and produce the target phrases, and at the end reordering of these phrases is done. Phrase-based models memorize local dependencies such as short reordering, idiomatic collocations, insertions and deletions.

In addition, phrase-based models are based on the noisy channel model introduced by [5] in the information theory. Given a source sentence F, the objective is to find a target sentence E, which maximizes the likelihood of two components, the translation (or adequacy) and the language (or fluency model).

Every sentence F is an arrangement of words symbolized as $f_{1}^{J} = \, f_{1 \ldots } f_{j \ldots } f_{J}$ is decoded into a sentence E of target language, and symbolized as $e_{1}^{I} = e_{1 \ldots } e_{i \ldots } e_{I}$. The objective is to find a target sentence that maximizes the model:

$$\hat{e}_{1}^{I} = \hbox{argmax} \, P\left( {e_{1}^{I} |f_{1}^{J} } \right)$$

(1)

For decoding sentence $f_{1}^{J}$ into sentence $e_{1}^{I}$, we require to calculate $P\left( {e_{1}^{I} | f_{1}^{J} } \right)$, the translation model probability. Using Bayes theorem, we can decompose the above equation as:

$$P\left( {e_{1}^{I} | f_{1}^{J} } \right) = \frac{{P \left( {f_{1}^{J} | e_{1}^{I} } \right) \cdot P \left( {e_{1}^{I} } \right) }}{{P \left( {f_{1}^{J} } \right)}}$$

(2)

Subsequently, the goal is to get the most out of general probable translation hypotheses for the specified source sentence $f_{1}^{J}$. Equation 2 will be computed for every sentence in Language E. But P ($f_{1}^{J}$) is not modified for every translation hypothesis. Therefore, we can neglect the denominator P ($f_{1}^{J}$) from Eq. 2.

$$\hat{e}_{1}^{I} = \hbox{argmax} P\left( {f_{1}^{J} |e_{1}^{I} } \right) \cdot P \left( {e_{1}^{I} } \right)$$

(3)

The model of the likelihood distributed for the first term in Eq. 3$\left( {P\left( {f_{1}^{J} |e_{1}^{I} } \right)} \right)$, probability of translation (f,e) is called translation model, and the distribution of $P \left( {e_{1}^{I} } \right)$ is called the language model.

1.2 Language selection

In this work, eight common spoken languages in the subcontinent are selected. Parallel corpus of all the languages is available for the experiment.

Bengali (Bangla) Bengali is the national language of Bangladesh and one of the officially spoken languages of India. More than 21 million people speak Bengali as their either first or second language [6]. There are roughly 10 million native speakers of Bengali in Bangladesh and around 85 million in India in the states like West Bengal, Assam and Tripura. Bengali is also known as Bangla, and it is associated with Indo-Iranian family. Like most languages it is also written from left to right. Its sentence structure is similar to English, i.e., subject object verb (SOV). All letters are written in same case, and there are no capital letters. The source of its punctuation is English language of nineteenth century.

Gujarati It is a member of Indo-Aryan branch of languages. Forty-six million people in the Indian state of Gujarat speak Gujarati [7]. Evolution of Gujarati language took place in twelfth century. Gujarati declension is considerably complicated. It contains three genders masculine, feminine, and neuter and two numbers singular and plural. For nouns it has three cases nominative, oblique and agentive locative. It is written from left to right with writing style SOV.

Hindi It is the national and official language of India. Four hundred twenty-five million people speak Hindi as their first language and more than 12 million people as their second language [8]. Outside India, some communities in South Africa, Mauritius, Bangladesh, Yemen and Uganda also communicate in Hindi language. Hindi is a member of the Indo-Aryan group within the Indo-Iranian branch of the Indo-European language family. Like in Persian, Hindi adjectives do not change as a result of number change in noun. Its preposition is similar to English. Unlike other Sanskrit-based languages like Gujarati, it has only two genders, i.e., masculine and feminine. Case marking in Hindi is simple due to Persian influence and reduces it to direct form and an oblique form. Case relations are shown postpositions. Like many languages it is also written from left to right, but its writing style is SOV. Modern standard Hindi evolved from the interaction of Muslim from Afghanistan, Iran, Turkey, Central Asia and elsewhere.

Due to Persian influence Hindi borrowed some part of vocabulary from Persian language such as dresses [e.g., پاجامہ, pajama (trouser); ، چادر chadar (sheet)], cuisine [(e.g., قورمہ, korma; کباب, kebab)], cosmetics [e.g., صابن, sabun (soap); حنا, hina, hen-na], furniture [e.g., کرسی, kursi (chair); میز, maiz (table)], construction [e.g., دیوار (wall)].

A large number of adjectives and their nominal derivatives (e.g.,-abad-inhabitedand-abadi-population) and a wide range of other items and concepts are so much a part of the Hindi language that purists of the post-independence period have been unsuccessful in purging them. While borrowing Persian and Arabic words, Hindi also borrowed phonemes, such as /f/ and /z/, though these were sometimes replaced by /ph/ and /j/. For instance, Hindi renders the word for force as either zor or jor and the word for sight as nazar or najar. In most cases the sounds /g/ and /x/ were replaced by /k/ and /kh/, respectively. Contact with the English language has also enriched Hindi. Many English words, such as button, pencil, petrol and college are fully assimilated in the Hindi lexicon.

Malayalam Malayalam is also a widely spoken language in India, mainly in the state of Kerala where it is an official language. In Tamil Nadu and Karnataka, few societies communicate in Malayalam language. It belongs to South Dravidian which is subpart of Dravidian language. Around 35 million people speak this language [9]. There exist different slangs between social caste lines which causes diglossia, i.e., difference between formal, literary and colloquial forms of speech. Like other Dravidian languages it also has a series of retroflex constants (/ḍ/, /ṇ/, and /ṭ/) pronounce by touching the tip of tongue to the roof of the mouth. Its writing style is SOV and has nominative accusative case marking pattern. It has three genders, i.e., masculine, feminine and neuter. Inflection is generally marked via suffixation. Unlike other Dravidian languages, Malayalam inflects its finite verb only for tense—not for person, number or gender.

Punjabi (Panjabi) It is a member of the Indo-Aryan subgroup of the Indo-European language family. More than 10 million people speak this language [10] in the domain that was discordant between Pakistan and India during cleave. This language is officially added in Indian constitution. Some small societies in UAE, UK, USA, Canada, South Africa and Malaysia speak Punjabi. It is of two miscellanies; one is western which is known as Lahnda and second is eastern known as Gurmukhi. There are two ways to write Punjabi, one is by Perso-Arabic script and other is by Gurmukhi alphabets which were conceived by Sikh Guru Angad (1539-52) rules for scriptural use. Its writing style is SOV and written from left to right (Gurmukhi) and right to left (Perso-Arabic).

Tamil Tamil is the member of Dravidian language and is the official language of the Tamil Nadu state. It is also the official language in Sri Lanka and Singapore and is also spoken by many people is Malaysia, Mauritius, Fiji and South Africa. In 2004, it was declared as classical language of India which means it met three criteria; its origins are ancient; it has an independent tradition; and it possesses a considerable body of ancient literature. Around 66 million people speak Tamil language [11].

Three times, changes occurred in grammatical and lexical form of this language, Old Tamil (from about 450 BCE to 700 CE), Middle Tamil (700 CE 1600 CE) and Modern Tamil (from 160 CE onwards). Its writing system developed from Brahmi script. Over the time its letters changed shapes until sixteenth century CE when printing was introduced and its shape stabilized. The major addition to the alphabet was the incorporation of Grantha letters to write unassimilated Sanskrit words, although a few letters with irregular shapes were standardized during the modern period. A script known as Vatteluttu (round script) is also in common use. With time, changes in the way of speaking this language occurred. Tamil language spoken in India is different from that which is spoken in Sri Lanka. Its writing style is SOV, and within Tamil Nadu there are phonological differences between the northern, western and southern speech.

Telugu Telugu is one of the most spoken languages among the Dravidian language family. In southeastern part of India, people communicate in this language. In Andhra Pradesh it is the official language. Worldwide, 75 million people speak Telugu language [12]. The oldest material belonging to this language is of 575 CE. The Telugu script is used for writing Telugu, which is derived from Calukya Dynasty. Its writing style is SOV and written from left to right. Visually, it differs from many of the North Indian scripts in that the letters have a rounded base.

Urdu is also a member of the Indo-Aryan group within the Indo-European family of languages. Urdu is the national language of Pakistan, while it is officially recognized language in Indian constitution as well. More than 100 million people [13] within Pakistan and India speak in Urdu. Apart from these two nations Urdu is also spoken by the immigrants and in small societies in UK, USA and UAE. Urdu and Hindi are bilaterally audible. This language developed and stemmed from Indian subcontinent; therefore, it is similar to Hindi. Due to similarity in phonics and grammar, they seem like one language but there sources are different. Urdu is lent from Arabic and Persian, while Hindi is borrowed from Sanskrit that is why they are treated as maverick languages. There is a huge difference in their writing style. Urdu script is an altered and revised form of Perso-Arabic scripts, while Hindi script is a modified form of Devanagari script. Urdu and Hindi sound similar except few variations in short vowel allophones. Urdu withholds a full set of aspirated stops. It is the property of both Indo-Aryan and retroflex stops. Urdu does not retain the complete range of Perso-Arabic consonants, despite its heavy borrowing from that tradition. The largest number of sounds retained is among the spirants; a group of sounds uttered with a friction of breath against some part of the oral passage, in this case /f/, /z/, /zh/, /x/, and /g/. One sound in the stops category, the glottal /q/, has also been retained from Perso-Arabic. Grammatically, Hindi and Urdu are same. Major difference between these two is Urdu is written from right to left, while Hindi is written from left to right. Style of Urdu writing is SOV and exhibit split ergative behavior. In Urdu, Perso-Arabic prefixes and suffixes are more than Hindi. Examples include the prefixes dar- “in,” ba-/baa- “with,” be-/bila-/la- “without” and bad- “ill, miss” and the suffixes -dar “holder,” -saz “maker” (as in zinsaz “harness maker”), -khor “eater” (as in muftkhor “free eater”) and -posh “cover” (as in mez posh “table cover”).

1.3 Related work

Initial research has been done to translate Indian languages, mostly focusing Hindi and Bengali. However, most of the focus is still rule based because of the unavailability of parallel data to build SMT systems for these languages.

Dasgupta et al. [14] proposed an approach for English to Bangla MT that uses syntactic transfer of English sentences to Bangla with optimal time complexity. In generation stage of the phrases they used a dictionary to identify subject, object and also other entities like person, number and generate target sentences. Naskar and Bandyopadhyay [15] presented an example-based machine translation system for English to Bangla. Their work identifies the phrases in the input through a shallow analysis, retrieves the target phrases using the example-based approach and finally combines the target phrases using some heuristics based on the phrase reordering rules from Bangla. The authors also discussed some syntactic issues between English and Bangla. Anwar et al. [16] proposed a method to analyze syntactically Bangla sentence using context-sensitive grammar rules which accepts almost all types of Bangla sentences including simple, complex and compound sentences and then interpret input Bangla sentence to English using a NLP conversion unit. The grammar rules employed in the system allow parsing five categories of sentences according to Bangla intonation. The system is based on analyzing an input sentence and converting into a structural representation (SR). Once an SR is created for a particular sentence, it is then converted to corresponding English sentence by NLP conversion unit. For conversion, the NLP conversion utilizes the corpus. Islam et al. [2] proposed a phrase-based statistical machine translation (SMT) system that translates English sentences to Bengali. They added a transliteration module to handle OOV words. A preposition handling module is also incorporated to deal with systematic grammatical differences between English and Bangla. To measure the performance of their system, they used BLEU, NIST and TER scores. Durrani et al. [17] also made use of transliteration to aid translation between Hindi and Urdu which are closely related languages. Roy [18] applied three reordering techniques namely lexicalized, manual and automatic reordering to the source and language in a Bangla–English SMT system. Singh et al. [19] presented a phrase-based model approach to English–Hindi translation. In their work they discussed the simple implementation of default phrase-based model for SMT for English to Hindi and also give an overview of different machine translation applications that are in use nowadays.

Sharma et al. [20] presented English to Hindi SMT system using phrase-based model approach. They used human evaluation metrics as their evaluation measures. These evaluations cost higher than the already available automatic evaluation metrics. Yamada and Knight [21] used methods based on tree-to-string mappings where source language sentences are first parsed and later operations on each node. Eisner [22] presented issues of working with isomorphic trees and presented a new approach of non-isomorphic tree-to-tree mapping translation model using synchronous tree substitution grammar (STSG). Liu et al. [23] first gave idea of using maximum entropy model based on source language parse trees to get n-best syntactic reorderings of each sentence which was further extended to use of lattices.

Bisazza and Federico [24] further explored lattice-based reordering techniques for Arabic–English; they used shallow syntax chunking of the source language to move clause-initial verbs up to the maximum of six chunks where each verb’s placement is encoded as separate path in lattice and each path is associated with a feature weight used by the decoder.

Jawaid et al. [25] presented complete study work for English to Urdu MT that uses factored-based MT. In their work they discussed the complete divergence between two languages. Vocabulary difference between Urdu and English has been discussed. The authors showed the importance of factored-based models when we obtained information about the morphology of both source and targeted language.

Khan et al. [26] presented baseline SMT system for English to Urdu translation using hierarchical model given by Chiang [27]. They also made a comparison of simple default phrase-based model with the hierarchical model and showed the performance of simple phrase based is much better for such local language like Urdu than the hierarchical phrase-based approach to SMT.

Singh [28] presented a Punjabi to Hindi machine translation system. The purposed system for Punjabi to Hindi translation has been implemented with various research techniques based on direct MT architecture and language corpus. The output is evaluated in order to get the suitability of the system for the Punjabi–Hindi language pair. Extensive research work can be found in the literature using neural networks technology in the field of MT which is recommended as a good approach by the researchers nowadays. Neural machine translation is a newly proposed approach in MT. The main drawback using the approach is it requires relatively large amount of training corpus as compared to SMT. Khalilov et al. [29] estimated a continuous space language model with a neural network in an Italian to English MT system. Bahdanau et al. [30] presented a neural machine translation by joint learning to align and translate.

In this work, phrase-based SMT models are used and their performance is evaluated on the morphologically rich Indian languages. These languages are low-resource languages in terms of the availability of MT systems (and NLP tools in general) yet together they represent nearly half a billion native speakers. Their speakers are well educated, with many of them speaking English either natively or as a second language. An important phenomenon present in these languages is a high degree of morphological complexity relative to English. Also Indian languages can be highly agglutinative, which means that words are formed by concatenating morphological affixes that convey information such as tense, person, number, gender, mood and voice. Morphological complexity is a considerable hurdle at all stages of the MT pipeline, particularly alignment, where inflectional variations mask patterns from alignment tools that treat words as fragments. Another important factor in these languages is head-finalness, exhibited most obviously in a subject–object–verb (SOV) pattern of sentence structure, in contrast to the general SVO ordering of English sentences.

2 Evaluation

In this section, we adopt two datasets used in the experiments followed by discussion on training, tuning and testing of different model components.

2.1 Dataset

2.1.1 EMILLE corpus

For this work, parallel corpora from diverse domains were collected for all the selected languages. For this purpose the corpus that is selected to use is Enabling Minority Language Engineering (EMILLE). EMILLE is a 63 million word corpus of Indic languages [31] which is distributed by the European Language Resources Association (ELRA). EMILLE contains data from six different categories: consumer, education, health, housing, legal and social documents. These data are based on the information leaflets provided by the UK government and various local authorities. There are 72 parallel files in total for five the source language with each filename consisting of language code, text type (written or spoken), genre and subcategory, connected with hyphen character. The data are encoded in full 2-byte unicode format and marked up in SGML format. The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Its bilingual resources consists of approximately 13,000 sentences for all the available languages from which we were able to sentence-align and extract over 8000 sentence for all languages pairing with English using the sentence alignment algorithm given by Moore [32]. Details about number of parallel sentences that were extracted for each pair are given in Tables 1 and 2.

Table 1 Training and evaluation data for EMILLE

No.	Features	Description
1	Maximum sentence length of 80
2	GDFA symmetrization of GIZA++ alignments [36]	GIZA++ [36] and the heuristics “grow-diag-final-and” are used to generate a word-aligned corpus, where bilingual phrases with maximum length 80 are extracted
3	Interpolated Kneser–Ney smoothed 5-g language model with SRILM [37] used at runtime	SRILM toolkits [37] to train a 5-g language model
4	5-g OSM [38]
5	msd-bidirectional-fe lexicalized reordering model	The msd-fe reordering model has three features, which represent the probabilities of bilingual phrases in three orientations: monotone, swap or discontinuous. If a msd-bidirectional-fe model is used, then the number of features doubles: one for each direction
6	Sparse lexical and domain features [39]
7	Distortion limit of 6
8	100-Best translation options
9	MBR decoding [40]
10	Cube pruning [41] with a stack size of 1000 during tuning and 5000 during test
11	No reordering over punctuation heuristic

Statistical machine translation of Indian languages: a survey

Abstract

Similar content being viewed by others

Phrase-Based English–Nyishi Machine Translation

Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation

Indowordnet’s help in Indian language machine translation

Explore related subjects

1 Introduction

1.1 Machine translation

1.1.1 Phrase-based model

1.2 Language selection

1.3 Related work

2 Evaluation

2.1 Dataset

2.1.1 EMILLE corpus

2.1.2 Crowdsourcing parallel corpus

2.2 Experimental setup

2.2.1 Corpus setup

2.2.2 Statistical machine translation model

2.3 Results

2.3.1 EMILLE corpus

2.3.2 Crowdsourcing parallel corpus

3 Conclusion and future work

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation