1 Introduction

This is an introduction to machine translation (MT) involving the Thai language. The focus is on the translation of Thai to and from English, as historically translation from English has opened greater access to information for people in Thailand. In addition, English is the primary language of research publication and MT tools, techniques and online data. The need for language translation has increased in Thailand with the membership of multinational organizations such as the Association of Southeast Asian Nations (ASEAN), and the increase in use of media and technology providing multilingual content via the Internet. Currently there is a shift in research focus internationally from the dominant approach of the last decade, phrase-based statistical machine translation (PBSMT), to translation using neural networks called neural machine translation (NMT). Asian languages that normally suffer when transferring techniques applied on western languages have found this new approach to be more portable. The aim of this paper is to provide a valuable resource of relevant information to assist current researchers in the field as well as non-experts.

Recent research has not only seen encouraging results when applied to languages that have no explicit markers between words, but also to the segmentation process itself. In addition, the ability to translate multiple languages in the same process improves translation of low-resource languages. Given these are areas that have historically been hurdles for machine translation research in Thailand, there is great potential for emerging MT research for the translation of the Thai language. To achieve reasonable quality, translation systems require a substantial quantity of parallel text for training that is currently only available to industry-based providers of Thai MT, thus resulting in limited published details due to commercial sensitivities and a lack of incentive for academic researchers. Nevertheless, the level of global interest and potential suitability for the Thai language requires that interested researchers have access to relevant information.

MT is a difficult but important area of language processing and, at present, to the author’s knowledge, there is no overview of Thai MT research. This paper aims to inform the reader of the significant MT research in Thailand, both past and present, and provide information of the application of neural networks for translation of the Thai language. This is an introductory paper of relevant research and a deeper analysis of Thai MT systems is left for researchers with greater development involvement.

The background to Thai MT that includes details about the Thai language, significant research and history, is described in the next section. Section 3 begins with the details that are relevant for the translation of the Thai language including approaches, systems and the evaluation of these systems, and also deals with NMT. This is followed by a discussion of the important issues for Thai MT that include word segmentation, sentence segmentation, alignment, linguistic concerns and evaluation. The conclusions are reported in Sect. 5.

2 Background

To understand the issues involved in translating the Thai language, it is necessary to explain the relevant approaches to Thai MT and the characteristics of the Thai language that make it a challenging field for natural language processing (NLP) and MT research.

2.1 The Thai language

Thai is an analytic language that relies on particles and word order rather than inflection (Supnithi and Boonkwan 2008), and words are modified or added to indicate gender, noun number, tense or mood of a verb (Chimsuk and Auwatanamongkol 2009). In addition, the meaning of words can change due to order and sentence position. The Thai language, like English, uses the order of Subject–Verb–Object but differs greatly with no definite or indefinite article, no verb conjugation, no noun declension, and no specific object pronouns. The Thai language is described as an “uninflected, primarily monosyllabic, tonal language” that has changed little since it was introduced by King Ramkhamhaeng in 1283,Footnote 1 so there are many loan words transliterated into the Thai script. Thai uses five tones and spelling is based on how words are spoken. Thai is one of many languages and dialects belonging to the Tai sub-family, in the Tai-Kadai family, which is a member of the Sino-Tibetan family of languages. Thai is the official language used in Thailand although there are four major dialects, corresponding to the Southern, Northern (“Yuan” or Lanna), Northeastern (close to Lao language), and Central (Bangkok Thai) regions of the country. Tai languages share basic vocabulary such as animal names and body parts. In case of doubt, the ‘Thai’ we refer to in this paper, is the official language of Thailand.

Thai words consist of consonants with adjacent vowels and tone marks, so the characters form an abugida rather than an alphabet. There are 44 consonants, 15 basic vowel graphemes also used for vowel forms with multiple characters, and four diacritic tone markers. Thai words are formed using Thai syllables consisting of an initial consonant, a vowel, and an optional ending consonant. The initial and final consonant can be a consonant cluster, the vowel is either long or short in duration, and not all consonants can appear as the final consonant. Vowels are written before, after, above or below the initial consonant although spoken afterwards. In Fig. 1, the vowels are grouped as preceding, following, below, and above, with an additional category of ‘extra’ vowels for other symbols used in Thai text that include the vowel shortening mark . The tone markers are also written above the consonants (and vowels), thus making parsing and optical character recognition of Thai script difficult. There are syllables without a vowel, such as in the word กบ (frog), where the vowel represented by the IPA symbol ‘O’ is implied.Footnote 2 The Thai consonant symbols ‘ย’,’ว’, and ‘อ’ can also appear as part of compound vowel sounds called diphthongs. For example, the Thai word เที่ยงวัน (midday) consists of two syllables ‘เที่ยง ‘ and ‘วัน’ and can be shortened to the first syllable that consists of the initial consonant ‘ท’, the vowel diphthong , the tone mark , and the final consonant ‘ง ‘. The tone is determined by several factors and influenced by the tone marks rather than being dependent on them. Although Arabic numerals are used in Thailand, the Thai language does have its own numbering system based on the Hindu-Arabic numeral system.

Fig. 1
figure 1

The Thai Abugida (alphabet)

MT is difficult for Asian languages such as Chinese, Japanese, Korean and Thai because there is no explicit boundary between words. Spaces are often used to indicate the end of a sentence but are also used between clauses, names and dates. This has made segmentation a predominant issue for Thai MT. To identify the boundaries between Thai words, we must first consider what is classified as a word. An entry in a dictionary is an example of a word. There are several formats that this entry can take such as a simple word (e.g. molecule), a complex word (e.g. intramolecule), a compound word (e.g. photosynthesis), a multiword (e.g. sewing machine), a phrasal unit (e.g. bridges with pinjoined members), or a set phrase (e.g. night and day) (Aroonmanakun 2007). In English, words such as ‘ice cream’ have a space or a hyphen between two orthographic words but are still considered a linguistic unit. This unit is considered a word and has at least one meaning.

The Thai sentence “เธอมีหนังสือสองเล่มและปากกา” translates to the sentence “She has two books and a pen”, becoming “|เธอ|มี|หนังสือ|สอง|เล่ม|และ|ปากกา|” after segmentation (see Fig. 2 below). A direct translation using a Thai-to-English dictionary would result in the sentence ‘She have book two and pen’. Articles do not have a Thai equivalent, so the token ‘a’ is aligned to the following noun ‘pen’ in this example. The use of classifiers is common in Thai although not always with a direct translation, as seen in the phrase ‘two books’ that is translated into หนังสือ (book), สอง (two) and เล่ม (classifier for books). The word ‘has’ is translated as the word มี (to have), whilst the word เป็น (is) can also be translated as ‘has’, such as in the phrase ‘เธอเป็นไข้’ (she has a fever). To illustrate the ambiguity of the segmentation process, the Thai words หนังสือ and ปากกา consist of two parts. These characters can also be used to form the four words of หนัง (movie), สือ (media), ปาก (mouth), and กา (to make a check mark) seen in the incorrectly segmented sentence in Fig. 2(c).

Fig. 2
figure 2

Thai word segmentation

The collection of characters that form a unit comparable to a syllable that usually contains a vowel are referred to as a Thai character cluster (TCC). These clusters form simple or compound words leading to ambiguity when segmenting words. Compound words differ in that some have the same meaning as their parts, some do not, and some are related to the meaning of a subset of their parts. For example, แม่น้ำ (river) is composed of แม่ (mother) + น้ำ (water), ยินดี (glad) is composed of ยิน (hear) + ดี (good), and หายใจ (breath) is composed of หาย (lost) + ใจ (heart). Aroonmanakun (2002) asks should หม้อหุงข้าว (rice cooker) be analyzed as one compound word, or three simple words หม้อ (pot), หุง (cook), and ข้าว (rice)? If we analyze หม้อหุงข้าว as a single word by assuming it denotes a single referent, should we also consider หม้อหุงข้าวไฟฟ้า (electric rice cooker) as a single word? Loan words can be more complicated such as ไมโครซอฟต์ (Microsoft), composed of ‘ไม’, โค (ox), ‘ร’, ซอ (fiddle), and ‘ฟต์’, where only two parts are known words (Charoenpornsawat and Schultz 2008).

Word order is different in the Thai language such as in an adjective following a noun as opposed to preceding a noun in English. For example, the phrase ‘big dog’ in Thai is หมาใหญ่ (dog-big). There is also ambiguous meaning due to word order seen with the two words ปี (year) and ครึ่ง (half) that can be ordered to mean either half a year (ครึ่งปี) or a year and a half (ปีครึ่ง). In the Thai language, an adjective is used to express plurality of a noun, the preposition ‘ของ’ (of) indicates possession, there is no pronoun or verb inflection, and pronouns are often omitted especially in speech. The word กว่า (more) is used to indicate the comparative degree and the word ที่สุด (the best) indicates the superlative degree. Derivation occurs in Thai, not by the use of suffixes but rather additional words such as การ (changes a verb to a noun) and ความ (adjective to noun). Meaning can be dependent on word order seen in the example of the word กำลัง which can be a noun meaning power or capacity, whereas it is a preposition when it precedes terms such as ทำงาน (Nomponkrang and Sanrach 2016).

Issues for Thai MT that arise can be seen in Fig. 3 in the example of the translation of the Thai text meaning “the body of an Australian man” from research comparing five commercial MT systems (Lyons 2016b). The candidate translations of each term are as follows: ร่าง (body, figure, shape, draft, to draft/sketch, grid), ของ (to belong to), ชาย (male, man, men), ชาว (people, folk), ออสเตรเลีย (Australia, Australian), คน (person, people, man, prefix to nationality) and หนึ่ง (one, only one), or grouped as “man’s body” (ร่าง|ของ|ชาย) or “Australian” (ชาว|ออสเตรเลีย|คน|หนึ่ง). The MT system has difficulty segmenting the text correctly in translation #5, and fails to identify which words should be grouped to ascertain the correct meaning in example #4. Ambiguity in word meaning is seen in translation #2 where the Thai word ‘ร่าง’ has several meanings including ‘body’ and ‘draft’, and there are issues with word order seen in both translations #2 and #5. Finally, the Thai text does not indicate singular or plural resulting in translations #1 and #5 pluralizing ‘man’ to ‘people’.

Fig. 3
figure 3

Example of the output of MT systems for a simple clause

In a 2010 report on the difficulties of English to Thai translation, several “linguistic differences” were cited as problematic (Nathalang et al. 2010). Ambiguous meaning of both Thai and English words, affixation (e.g. in-conceive-able) in English, compound words in Thai, and lexical meaning caused problems for English to Thai SMT. Lexical meaning included propositional and expressive meaning (e.g. ‘use your head’) and presupposed meaning, either selection such as ‘handsome boy’ or ‘pretty boy’, or collocational restrictions such as verb use, e.g. the difference between languages expressing the action of brushing teeth: ‘brush teeth’ (English), ‘polish teeth’ (German), ‘wash teeth’ (Polish) and ‘clean teeth’ in Russian. Words such as ‘access’ were often incorrectly translated, as were the derivations of words, and some words and concepts in the English language do not exist in Thai. In conclusion, the Thai language consists of several grammatical differences from other languages and so provides challenges to the application of MT models, techniques and tools built for other languages. A substantial amount of Thai MT research has focused on segmentation, and the common issues found for other language pairs, such as unknown words, word order and long distance dependencies also exist.

2.2 Machine translation

Machine Translation (MT) translates text from one language (the source language) to another language (the target language) without human assistance. The translation should have the same meaning and be syntactically and grammatically correct. From the mid-2000s until recently, PBSMT systems, based on the architecture seen in Fig. 4, have topped performance evaluations and have formed the approach used by large-scale translation providers. In the preface of his 2009 book, Koehn stated that there were about a thousand academic papers about SMT with over half of them published in the preceding three years (Koehn 2009). Prior to this time there were several different linguistically-motivated approaches classified as rule-based due to their reliance on manually created rules to address the common difficulties in translation. A comprehensive guide to the history of MT before the corpus-based approaches can be found in (Hutchins 1995). The success of freely available SMT tools has led to their widespread use especially for low-resource languages that can then focus on language-specific issues. These tools are based around the Moses decoder (Koehn et al. 2007), with accompanying tools such as the alignment tool GIZA++ (Och and Ney 2003) and the language modelling toolkit SRILM (Stolcke 2002). SMT research involving the Thai language has almost exclusively reported the use of the Moses SMT Toolkit.

Fig. 4
figure 4

Phrase-based statistical machine translation (PBSMT)

Luekhong et al. (2016) provided a comprehensive guide to a Thai–English SMT system detailing the phrase extraction stage where the word translation probabilities are calculated using IBM Model 4 (Brown et al. 1993), words are aligned in forward and backward directions (source-to-target and target-to-source), and additional heuristics can be used to assist alignment before the phrase pairs are calculated. This forms the phrase-based model using the phrase-level conditional probabilities with lexical weight scores for phrase pairs based on word alignment probabilities. The decoder provides the best translation based on the source text, phrase translation model and the target language model. Starting from an initial hypothesis the decoder uses a search algorithm, such as beam search, to expand each subsequent phrase segmentation of the source sentence and marks a path with translation alternatives to cover all of the words. Scores are then calculated on the path options and the sentence with the highest score is selected.Footnote 3

In the last five years the focus of MT research has dramatically changed due to the increased use of neural networks. The success using neural networks for language modelling led to the increased use of these networks, ultimately for the entire translation process called NMT. The recent use of neural networks for MT is not unique as originally it had been presented some time ago (Castano and Casacuberta 1997; Forcada and Ñeco 1997; Ñeco and Forcada 1997), although the technology and resources required for reasonable performance did not exist at that time. The inclusion of NMT in publications during 2014 was followed by the entry of NMT systems at the NIST Open Machine Translation (OpenMT) EvaluationFootnote 4 and the Workshop on Statistical Machine Translation (WMT) in 2015, and by the next year the system with the best performance in the WMT’16 workshop involved neural networks.Footnote 5 Leading universities such as Edinburgh and Stanford, and large organizations such as Google, Microsoft, Amazon and Facebook are now either deploying NMT systems or offering neural translation toolkits.

Several research teams proposed the use of neural networks for MT (e.g. Kalchbrenner and Blunsom 2013), whilst Sutskever et al. (2014) and Cho et al. (2014) described the encoder-decoder framework that is still widely used (see Fig. 5). NMT systems based on an encoder and decoder are multi-layered neural networks often with the input transformed using a context-weighting technique called ‘word embedding’ (e.g. Mikolov et al. (2013)), that uses statistics to give a value to words based on their probability of occurring given the context of the surrounding words. To deal with the need to remember previous information in a sequence, the use of Recurrent Neural Networks (RNNs) that allow information to persist from previous nodes, and Long Short Term Memory (LSTM) nodes that aid this persistence, have been common in early models.

Fig. 5
figure 5

A neural machine translation (NMT) system

In ‘attention-based’ systems there is an attention mechanism between the encoder and decoder that allows the decoder to give greater importance to the nodes that have the highest scores in a weighting system. For example, if a word with the same meaning in a direct translation was placed much earlier in a sentence, then this could be given a higher weight thus giving it greater attention. Some issues are not solved using attention such as the inability to process input in parallel. If each word can be processed at the same time, such as with Convolutional Neural Networks (CNN), then parallelization could reduce the time overhead. The problem of dependencies when translating sentences still remain, so a combination of CNNs and attention was created to form the Transformer model (Vaswani et al. 2017). Transformers allow the decoder to focus on the relevant parts of the input sentence but also use a form of attention called self-attention that uses query, key and value vectors for each word to create scores that measure the relationship between the word and the other words in the sentence. Finally, the decoder chooses from a large selection of target-language words using a ‘softmax’ function. Systems can use a combination of words and sub-words to significantly reduce the problem caused by the large amount of output options.

3 Machine translation for the Thai language

The historic details of the state-of-the-art research of Thai language processing that are relevant for Thai MT are listed in a select few publications. In 1988, the first national-level research project in Thailand focused on MT (Kawtrakul and Praneetpolgrang 2014). It was a collaboration of Thai universities and two research institutions, the Center of the International Cooperation for Computerization of Japan (CICC), and the National Electronics and Computer Technology Center (NECTEC), Thailand. The universities included Chulalongkorn University, Kasetsart University, and the King Mongkut’s Institutes of Technology (Ladkrabang and North Bangkok). In 2000, there was one annotated corpus called Orchid, one publicly available Thai MT system called Parsit from NECTEC, and text segmentation was suffering from the problems of manually created dictionaries and unknown words (Sornlertlamvanich et al. 2000b). In 2002, a second corpus called the NaiST corpus was in use, and again the issues of unknown words and boundary ambiguity in segmentation were prominent (Kawtrakul et al. 2002). The areas of language-specific issues, segmentation and resource availability are common problems for both Thai speech processing and MT, and were included in a comprehensive speech processing review by NECTEC (Wutiwiwatchai and Furui 2007). Finally, in 2009 the advancements in segmentation research were seen in the BEST and InterBEST competitions for word segmentation that used a five-million-word corpus provided by NECTEC (Kosawat et al. 2009).

Both the Thai and English languages use the same SVO (subject-verb-object) order and share the same basic structure of simple sentences. These were represented in phrase structure grammar (PSG) as patterns using a lexical transfer approach in (Chancharoen et al. 1999). English words were morphologically analyzed to identify the root word and affixes, the sentences were then represented as a phrase representation in PSG, and the patterns were mapped to their Thai equivalent with Thai words reordered accordingly. Word translation was dictionary based, although the system used a statistical approach for ambiguity by determining the probabilities of word meaning from previous examples. The use of sentence patterns was also used recently in Thai to Khmer MT (Prasomsuk and Mol 2017). Parsit, the first publicly-available MT system (English to Thai), used an Interlingua representation (cf. Goodman and Nirenburg 1991) based on syntactic and semantic analysis using parts-of-speech (POS), the verb-arguments relationship to determine grammar, and semantic case relations. Thai language production used both syntactic and semantic generation modules, before the unknown words that remained were transliterated. The system supported both simple and compound sentences but not complex sentences. The C4.5 classifying technique (Quinlan 1993) was added to improve performance by creating a rule set forming a decision tree to learn the correct meanings for ambiguous words (Sornlertlamvanich et al. 2000a). The Parsit system suffered from two main problems: incorrect meaning of words (81.74%) and word order (18.26%) (Modhiran et al. 2005). For a detailed breakdown of the linguistic problems found in the Parsit translation, please consult Phaholphinyo et al. (2005).

In 2002 three approaches to Thai translation were defined as structural transfer, semantic transfer (Parsit) and lexical transfer (pattern-based approach) (Boonkwan and Kawtrakul 2002). Plaesarn, an Internet-based translation assistant, used structural transfer dependent on syntactic analysis. It was an English to Thai translation tool that used parse trees and transfer rules with probability values attached to each transfer rule. Manually separated sentences were parsed into a syntactic parse tree before a classifier matching algorithm used the head noun to parse the tree into Thai and added the appropriate classifier. In 2005 the method of creating a Thai–English PBSMT model was reported although no bilingual corpus existed large enough to train a statistical MT system (Netjinda et al. 2009). Kritsuthikul et al. 2006, introduced the first English to Thai example-based MT using an n-gram model. They prepared a bilingual corpus and used n-grams to locate ‘patterns’ of partial sentences. The system continued with a two-stage analysis and generation approach using the word segmentation tool called SWATH (Smart Word Analysis for THai)Footnote 6 and a monolingual corpus to assist in choosing alternative examples. In Tongchim et al. (2008), an example-based MT system (cf. Nagao 1984; Carl and Way 2003) for Thai and Japanese translation using dependency structure involved research to develop a syntactically annotated corpus and a parser, using Support Vector Machines (SVM) on word and POS features. They stated the lack of suitable corpora contributed to the lack of research in Thai MT. A speech-to-speech (S2S) translation project included an effort to further develop the Parsit MT system for Thai-to-English translation. It adopted a translation memory (TM) module in which translation results that were corrected by users were stored and reused. In addition to the rule-based approach, SMT was also explored. A collection of 200,000 pairs of Thai–English sample sentences were taken from dictionaries to be used for a general evaluation of Thai–English SMT and to develop a TM engine (Wutiwiwatchai et al. 2008). By 2009 the first English-to-Thai speech translation service was developed by NECTEC (Wutiwiwatchai et al. 2009) using the Moses tool to build the English–Thai SMT engine. NECTEC continued MT research as part of speech-to-speech translation projects both within Thailand as part of a bilingual Thai/English TTS (text-to-speech) system (Wutiwiwatchai et al. 2017), and internationally in a multilingual speech translation mobile application developed under an international collaboration (Wutiwiwatchai 2015) and for ASEAN (Wutiwiwatchai et al. 2013). The use of the Moses SMT Toolkit is cited in several other publications involving Thai MT (e.g. Labutsri et al. 2009; Mai et al. 2014; Wutiwiwatchai 2015).

The lack of suitable corpora is cited as a major hurdle to obtaining a reasonable performance level for Thai MT. The BEST corpus was used in the work to address the problem of long sentences in (Supnithi et al. 2010). They commented that previous research had access to only one resource which was the Orchid corpus (Sornlertlamvanich et al. 1997). Orchid contained approximately 43,000 sentences covering 568,316 words from Thai junior encyclopedias and NECTEC technical papers with several annotations including part-of-speech (POS), word and sentence boundaries, and pronunciation. Kawtrakul et al. (2002) listed resources such as dictionaries and corpora including the NAiST text corpus (Kawtrakul et al. 1995) that consisted of 60,511,974 words, with word and sentence boundary tags, that was created with the primary aim of collecting magazine documents for training and evaluating a writing assistance system. There has been a significant impact on Thai word segmentation from the availability of the BEST corpus, but a similar resource does not exist for Thai MT. Recent research has stated the use of either the parallel text from the TED talks used in the IWSLT 2015 MT evaluation,Footnote 7 or the 20,000 sentence-pairs corpus on the travel domain available from the ASEAN-MT website. The IWSLT corpus was provided for evaluation of Thai–English MT systems but there were no submissions, and the ASEAN-MT corpus is domain-specific. There are proprietary corpora, data used by NECTEC and a few other reported uses of parallel text for Thai MT (see Table 1). For example, researchers at Chiang Mai University and NECTEC stated the use of Thai–English parallel sentences developed from the Basic Travel Expression Corpus (BTEC), a multilingual speech corpus containing tourism-related sentences, and the HIT London Olympic Corpus, cited as a Chinese–English–Japanese trilingual corpus (Luekhong et al. 2017). Although it is not possible to determine the completeness of the list in Table 1, it does cover an extensive amount of publications listed in this paper and shows the extent of corpora use by researchers for Thai–English MT. There remains a web interface to the Sealang library bitext corpus,Footnote 8 although the application of this corpus specifically for MT is not reported.

Table 1 Parallel text corpora used for Thai-to-English or English-to-Thai MT

The BLEU score (Papineni et al. 2002) is the predominant reported metric for MT research. Denkowski and Lavie (2010) state that BLEU scores above 30 generally reflect understandable translations (Seljan et al. 2012), and it is suggested that MT systems suffer from poorer performance on Asian languages (Kit and Wong 2008). There are few reliable BLEU scores for Thai-based translation, and it is difficult to gain a reasonable assessment of the quality of MT systems involving Thai and English with reported BLEU scores ranging from 2.6 (Kritsuthikul et al. 2006) to 57.45 (Labutsri et al. 2009). Table 2 includes the BLEU scores of research focusing on the translation of the language pair of Thai and English. There are several other publications that report evaluation for Thai–English MT although some researchers use the process to illustrate the advantages of associated research topics such as word segmentation and state the scores should not be used for comparison with other research. In 2010, research on Thai-to-English translation reported BLEU scores of 23.3 and 21.3 (Th–En) and 19.4 and 18.9 (En–Th) (Slayden et al. 2010a, b), stating they were not aware of any previously published BLEU results for either direction of Thai–English translation. The research was developed to be used for Microsoft’s Bing multilingual SMT system based on a hybrid generative/discriminative model. Other research reports scores closer to the 13.0 level, including a BLEU score of 12.9 (Th–En) (Mai et al. 2014), 13.0 (En–Th) (Nathalang et al. 2010), and 13.5 (En–Th) (Porkaew et al. 2008). In 2016, five large-scale MT systems providing Thai-to-English translation were evaluated using several methods with BLEU scores ranging from 13.2 to 20.9 (Lyons 2016b). Significantly higher scores are seen in Luekhong et al. (2016) during tenfold cross validation experiments comparing phrase-based and HPBT (Hierarchical phrase-based translation, cf. Chiang 2005) with the best performance achieved using the HPBT approach. These results are comparable to the BLEU score of 40.1 (En–Th) also during tenfold cross validation tests achieved by the state-of-the-art PBSMT system reported in Wutiwiwatchai et al. (2009) that used the Moses toolkit trained on parallel text corpora containing more than 100,000 sample sentences from Thai–English dictionaries and the English–Thai BTEC corpus. BLEU scores of 38.6 (En–Th) and 35.45 (Th–En) were reported in Pa et al. (2016) using the ASEAN parallel corpus, that improve on the baseline metrics for HPBT, 25.8 (En–Th) and 25.3 (Th–En) that are stated on the ASEAN MT website,Footnote 9 although these scores are influenced by the domain-specific nature of the corpus. The results for PBSMT were also above the website’s standard scores with 37.3 (En–Th) and 36.98 (Th–En) improving on the baseline scores of 27.9 (En–Th) and 24.4 (Th–En). Other language pairs involving Thai (e.g. Thai-Lao, Thai-Vietnamese) have received attention in recent research although these publications are not freely available. The BLEU score of 21.83 for Chinese-to-Thai translation and 15.0 for Thai-to-Chinese were reported in Luekhong et al. (2012). Finally, there were BLEU scores reported using NMT in comparison to HPMT at the end of 2019 in a study of the effects of using neural networks in the Thai–English MT process (Luekhong et al. 2019). The paper states tenfold cross validation was applied for word segmentation and the translation used an 80–10-10 ratio for training, tuning and testing.

Table 2 Reported BLEU scores for Thai–English machine translation

It is difficult to make a comparison using the reported scores due to a range of factors including the use of domain-specific text, corpus size, and the differences in evaluation techniques such as using manual amendments. Discounting the experimental approaches there are two levels of performance at the BLEU scores of 13% and 20%. Mai et al. 2014 used a 400,000 (400 K) sentence corpus and produced a score of 12.9, similar to the scores of 13.5 and 13.0 using the NECTEC corpus of between 160 and 200 K sentences. BLEU scores above 20% were achieved by global organizations with substantial resources, such as for Microsoft in 2010 using a 725,000-sentence corpus and the primary commercial MT systems reported in Lyons (2016b). The experimental results are also not comparable, such as the research using the same corpus reporting scores falling from 41.3 and 40.5 to 28.5 and 29.5 using the same HPMT approach during tenfold cross validation tests. Thai NMT research provides just one publication reporting BLEU scores for NMT ranging between 10 and 16 depending on the application of addition techniques, so it would be inadvisable to attempt to compare these scores to BLEU scores achieved using SMT. In conclusion, to compare research, either between SMT and NMT, or the impact of corpus size on Thai MT performance, a greater amount of published research is required, preferably using freely available corpora and supervised evaluation such as in workshop evaluation campaign participation.

The evaluation of MT systems is a research area in itself (cf. Way 2018) with most researchers using the BLEU metric for its simplicity and widespread use to such an extent that often SMT systems are finely-tuned to achieve the highest BLEU score. Few Thai MT research publications report alternative evaluation metrics although several approaches to MT evaluation to determine the quality of Thai-to-English translation were reported in Lyons (2016b). These included the BLEU metric, error classification and the human-based evaluation approaches of reading comprehension and the analysis of a professional translator. The translator’s analysis, whilst insightful, would not be considered a viable alternative evaluation technique for MT research due to logistic difficulties and ambiguity, whereas error classification indicates translation errors rather than performance (cf. Lommel 2018). Whilst the reading comprehension did reflect the quality of the translations, the approach has issues of subjectivity and a lack of widespread use with standardized procedures. There are no significant reviews of alternative evaluation of Thai–English MT outside of this publication.

3.1 Neural machine translation

Despite the suitability of and global interest in NMT, it is not widely described in academic publications in Thailand due to the lack of parallel corpora required for training. There is research on NMT systems for the Thai language from companies with adequate resources although they are not inclined to disclose details for commercial reasons. Published academic research includes a description of the use of a compression technique called knowledge distillation that reported results on Thai-to-English translation that included an NMT approach (Kim and Rush 2016). It was trained on the Thai–English IWSLT 2015 datasetFootnote 10 using the attention-based architecture described in Luong et al (2015a). The knowledge was distilled from a larger, more complex teacher network to a smaller student network that approximated the function learnt by the teacher. The baseline BLEU score of 10.6 for the student increased to 14.4 using various techniques including an increased beam size, knowledge distillation, and sentence-level interpolation where the training data is integrated back into the process. In language processing tasks such as MT, the prediction required is a sequence, so systems use a probability distribution to determine the best possible output. Given the exponential nature of the task a heuristic to limit the search space is required such as a beam search where the most likely outcomes are searched within the defined beam. The teacher produced a BLEU score of 15.7, rising to 16.0 with fine tuning using sentence-level interpolation. The student score of 10.6 is consistent with the performance seen in experiments using NMT published on a Thai language processing website (Tanruangporn 2017). In experiments using the TED talks dataset of about 80,000 parallel sentences (Cettolo et al. 2012), several approaches based on the input format of the Thai text were applied using an NMT system. This included byte-pair encoding (BPE), word-level, character-level and using Thai character clusters (TCCs). The system based on Lee et al. (2017) and the attention-based model (Luong et al 2015a) used gated recurrent units (GRUs) as opposed to using LSTM units and used fewer GRUs. The BLEU score of the word-based approach was 10.7, the TCC approach was 10.3, BPE was 9.88, and the character-based approach had a BLEU score of 7.7. In recent research, a comparison of SMT and NMT used a bi-tech corpus of 149,000 Thai–English sentences, a bidirectional neural network (BNN) for word segmentation post-edited by linguists, and the openNMT toolkit (Klein et al. 2017) with some language-specific adjustments (Luekhong et al. 2019). The reported BLEU scores for NMT were 0.421 (Th–En) and 0.442 (En–Th) in comparison to the HPBT scores of 0.285 (Th–En) and 0.294 (En–Th).

Given the need to obtain large corpora for the best performance of NMT systems, it is not surprising that these are achieved by the large-scale systems from the international providers of Thai translation. Local research found that students had access to many translation applications and services, but these produced just five unique translations that originated from Google, Microsoft (Bing), Baidu, Naver (Line) and one other source (Lyons 2016a). These providers have published research detailing the deployment of NMT and list Thai–English as a supported language pair, such as for Google,Footnote 11 Microsoft,Footnote 12 Baidu,Footnote 13 and NAVER.Footnote 14 Although publications associated with these systems do not detail Thai MT specifically, the research does inform us of the approaches used for NMT systems that are deployed and used to provide translation of the Thai language. The field of NMT is rapidly evolving and a comprehensive study is outside the scope of this paper, so the following details relate to the MT systems known to provide Thai–English translation. The NAVER MT system is described in Lee et al. (2015). It applied a mixture of tree-to-string syntax-based SMT to English–Japanese translation, PBSMT to Korean–Japanese translation, NMT applied to re-ranking these SMT systems, and an NMT system using the bidirectional RNN encoder/decoder architecture with attention mechanism. The system took ten days to train, used an approach taken from Luong et al. (2015a, b) and Jean et al. (2015), and used a character vocabulary in preference to using a rare or unknown word symbol.

Issues caused by morphological forms such as the use of affixes, and unknown and rare words, have led research to broaden from word input and output to the use of sub-words as well as character-level approaches (Chung et al. 2016). This has advantages not only for multilingual translation but also for languages that require word segmentation such as Thai. There are character-based NMT models that use character embedding only for the source language, character-level decoders that use character embedding for the target language, and full character-level NMT systems (Kazimi 2017). It has been stated that NMT using the attention model has produced improved performance using the character-based approach (Lee et al. 2017; Sennrich et al. 2016). At the time of writing, sub-word NMT is the state-of-the-art for large-scale industry systems such as Google’s GNMT system (Wu et al. 2016). GNMT divides words into a limited set of common sub-word units and consists of an LSTM network with 8 encoder and 8 decoder layers connected by an attention mechanism. The training time is also accelerated by special hardware (Google’s Tensor Processing Unit). This model had a BLEU score of 38.95 on the English–French task at WMT 2014, whilst a Baidu system later reported a BLEU score of 37.7 rising to 40.4 after handling unknown words (Zhou et al. 2016a, b).

The length of time to train a large-scale system and the amount of language pairs has added to the interest of using several languages in the translation process. In research using a multilingual approach, the NMT system learns weights from other languages before being applied on the target language. This is especially useful for low-resource languages. The multi-language approach was explained in detail in Firat et al. (2016).Footnote 15 Other research includes a multi-task learning framework from Baidu (Dong et al. 2015) based on a sequence model from one source language to multiple target languages, sharing both the encoder and attention (Niu et al. 2018), and Google’s zero-shot system that supported many-to-many translation directions by simply attaching a token stating the source language (Johnson et al. 2017). Baidu also introduced the use of minimum risk training (MRT) (Shen et al. 2015), shared word alignment matrices on the same training data in Cheng et al. (2015), proposed models to incorporate the word reordering knowledge in NMT (Zhang et al. 2017) and used a Multi-Channel Encoder (MCE) to be able to focus both on a word and the surrounding context when appropriate (Xiong et al. 2018). MRT uses evaluation metrics to minimize the expected loss on the training data, whilst MCE looks at the different levels of composition of sequences (e.g. entities, idioms) for the encoder and the attention mechanism. The research field of NMT is evolving so rapidly that the RNN-based approaches included here have been reportedly outperformed by both the convolutional sequence-to-sequence model, and the Transformer model (Chen et al. 2018). Despite the difficulty to report relevant and up-to-date research, the use of NMT is state-of-the-art for large-scale industry translation systems, and these provide Thai–English translation.

4 Discussion

A discussion on MT in Thailand is not complete without reference to the associated research topics that have had a constant influenced on research. Historically, MT systems have used approaches based on words and sentences, so a significant quantity of Thai MT research has focused on word and sentence segmentation. The primary method to increase the level of performance of Thai MT using current techniques is the availability of training data in the form of parallel text. To create suitable corpora, sentences must be aligned, thus incorporating alignment techniques, sentence segmentation and potentially word segmentation and unknown word issues during the matching sentence process. We started to see the impact of the research in these areas during the SMT era when evaluation began using the BLEU score. The recent rise in research involving the application of neural networks has affected Thailand specifically in segmentation but also generally in language processing and to a lesser extent Thai NMT. We can follow the progression of these relevant areas to Thai–English MT from the early years through the statistical era to the ongoing research that exist today.

Early researchers in Thai MT highlighted several areas of concern that created difficulties including word fertility, word order, unknown words, word segmentation, and alignment due to sentence segmentation issues (Chancharoen et al. 1999). The impact of these research areas on the performance of early Thai MT systems is not clear as there were no reported measures to compare approaches, systems, performance or improvements as no evaluation scores were reported. It was stated that there was no test set suitable for evaluation as there were no existing Thai MT systems to evaluate (Modhiran et al. 2005). The early published research recognized the importance of word segmentation and used dictionary-based techniques that required it, yet it was only in 2005 that it was reported that the Parsit MT system used the SWATH word segmentation tool. SWATH offers the three traditional methods of word segmentation using techniques that match the longest words called longest matching (Pooworawan 1986), matching the minimal words in a sentence called maximal matching (Sornlertlamvanich 1993) and POS n-grams, with Parsit preferring to use the longest matching technique. The SWATH tool was originally used to segment the BEST corpus, although the “laborious” manual corrections led to the development of the segmentation verification tool (Klaithin et al. 2011) and it has remained popular due to its availability and ease of use.

Rare and unknown words and word boundary ambiguity are the major hurdles to Thai word segmentation systems. There are several ways that a text can be split whilst forming valid Thai words, and these ambiguities result in systems either missing word boundaries or inserting incorrect word separators. Thai text includes foreign words either kept as loan words or transliterated into the Thai script, compound words, and rare words that are additionally problematic when part of noun phrases. These are all examples of words that are omitted from dictionaries, known as ‘unknown words’ or ‘out-of-vocabulary’ items. The segmentation task is not straightforward even for human annotators, as seen in inconsistencies when Thai text is manually segmented. Coverage of these problems can be seen in Meknavin et al. (1997) and Aroonmanakun (2002). Early dictionary-based segmentation systems based on either maximum or longest matching would suffer considerable performance loss with a higher percentage of unknown words, so feature-based methods and the use of decision trees based on the TCCs were attempted (Theeramunkong and Usanavasin 2001). Corpus-based methods could detect an unknown word by observing its co-occurrence frequency, but other substrings required manually created rules to determine unknown word boundaries. Kampanya et al. (2002) addressed the unknown word problem using word frequency, word location and a K-vec algorithm to identify candidate word pairs. They used a pre-processing stage with both word and sentence segmentation for Thai and inflection issues resolved for English, then word numerical distribution and location indicated a relationship between words in both texts. The sentences in the texts were aligned by matching the known words within the sentences and the position of the sentence, followed by the alignment of the unknown words. The alignment of phrases was added later (Kawtrakul and Boonkwan 2004). This approach was used in Plaesarn (Boonkwan and Kawtrakul 2002) whilst Parsit transliterated unknown words from Thai-to-English (Modhiran et al. 2005). Later researchers created a framework for constructing a Thai unknown-word open dictionary from the web (Haruechaiyasak et al. 2006).

Conventionally in Thai writing, a space is placed at the end of a sentence, but a space does not always indicate a sentence boundary. In Thai sentence segmentation, a binary classification determines if a character is a sentence break or a non-sentence break using the context of the text around the character’s position. Mittrapiyanuruk and Sornlertlamvanich (2000) used a corpus annotated with POS tags to train a trigram model. An extension of the algorithm was proposed adding collocations of surrounding words and different n-gram lengths of surrounding tokens were used as features. These features were extracted automatically using the Winnow algorithm to improve the effectiveness of segmentation (Charoenpornsawat and Sornlertlamvanich 2001). In later research it was not assumed that the sentence boundary was always a space, and a word-labelling approach was used which treated the space character as a normal word and detected if the break between words was a sentence break (Zhou et al. 2016a, b). Other sentence segmentation research includes using a modified transition network with Prolog’s definite clause grammar rules in Netisopakul and Keawwan (2007), whilst Aroonmanakun (2007) suggested it might be more practical to segment text into discourse segments, composed of clauses rather than sentences.

4.1 The statistical era

The first BLEU scores appeared in publications between 2008 and 2010 when Thai MT had adopted the statistical approach and several events led to the advancement of Thai MT. The availability of the Moses SMT Toolkit and compatible tools facilitated research in languages with less resources, including Thai. There were several national and international projects that let to Thai–English SMT systems with reported evaluation scores, such as in speech-to-speech translation research. This era also included the most significant event for Thai word segmentation research, which hosted approaches that were then subsequently used in Thai MT. In 2009 NECTEC invited researchers to segment a 5-million-word corpus in the BEST Thai word segmentation challenges (Kosawat et al. 2009). These systems, and recent systems based on neural networks, represent the two prominent approaches currently used for Thai word segmentation.

The most successful approaches to Thai word segmentation have used Conditional Random Fields (CRF) which is a machine learning approach that treats the segmentation issue as a sequential supervised learning problem. Thai segmentation research using CRF was first published in Kruengkrai et al. (2006), and this was based on previous work (Sornlertlamvanich 1993) with the addition of POS tagging. The approach stored all the possibilities of words and their POS to generate a ‘lattice’ of words and POS tags. It then applied a technique to identify the optimal path from this lattice. In Haruechaiyasak et al. (2008), the word segmentation task was treated as a binary classification problem. A corpus of Thai text had each character tagged as either a word beginning (B) or an intra-word character (I). This research used character types seen in Fig. 6 based on their linguistic properties and placement in preference to POS tags. This approach was not reliant on a dictionary and so dealt better with unknown words and ambiguity, although it was dependent on the quality, size and domain coverage of the corpus used for the training. In this research, the CRF algorithm also outperformed other machine learning algorithms including Native Bayes (NB), Support Vector Machines (LIBSVM) and decision trees (J48). The following year this approach was improved to create the Thai Lexeme Analyser (TLex) with an additional ‘combination’ feature of both the character and type combined as a feature (Haruechaiyasak and Kongyoung 2009). Suesatpanit et al. (2009) also used character types, but with four labels of first (B), ending (E), inner (I) or single (S) character, and the characters and character function were merged. The top performance at the BEST segmentation challenges was based on a word and character-cluster hybrid model (Kruengkrai et al. 2009). The previous lattice of words and POS became a lattice of word- and character-cluster nodes, with word-level nodes dealing with known word ambiguities and the character cluster-level nodes dealing with unknown words.

Fig. 6
figure 6

Adapted from Haruechaiyasak et al. (2008)

Character types used for word segmentation.

The approaches to word segmentation seen in the BEST challenges were subsequently used in Thai MT research. Although word segmentation was not described in the paper, the use of a Thai word segmentation toolkit called WordSeg,Footnote 16 using the lattice described above, was cited in Porkaew et al. (2008). Thus, the word segmentation approach can be linked to the one of the first reported evaluation scores seen when the research improved on a baseline SMT system for English to Thai with an increased BLEU score of 13.50 from 13.11 using a training corpus of 160,000 sentence pairs. This was part of the speech-to-speech MT research at NECTEC also reported in Wutiwiwatchai et al. (2009) and Nathalang et al. (2010). A BLEU score of 40.1 on tenfold cross validation using a 160 K sentence corpus was quoted in the first, whilst Nathalang et al. reported a BLEU score of 13 using an increased corpus size of 200,000 sentences. A comparable score of 12.93 was later presented in research using Moses (Mai et al. 2014). In a comparison of automatic speech recognition systems, the research stated the improvement to word segmentation increased the BLEU score by 5–6% for a Thai–English SMT system (Charoenpornsawat and Schultz 2008). BLEU scores ranged from 43.64 to 47.76 using an SMT toolkit on a corpus of 300,000 sentence pairs using the SWATH segmentation system taking the maximum matching approach, and Pharaoh (Koehn 2004), a predecessor of the Moses SMT toolkit, for phrase extraction.

During 2014 researchers compared the performance of six Thai word segmentation systems with CRF remaining the highest-performing approach and TLex the top system (Noyunsan et al. 2014). Advancement in the standard segmentation methods since the BEST challenges was seen when research merged machine learning and dictionaries in TLex+, a hybrid system based on TLex (Kongyoung et al. 2015). The three dictionaries included a pre-process to deal with long expressions and long named entities (NE), post-processing to identify unknown words, used to correct segmentation errors, and an NE dictionary to merge the multiple word entities into one word. The performance level was given as an F-measure of 97.5%. In recent research unknown words were identified using a set of 28 rules based on Thai language principles from Thonglor (1972), and a feature-based approach for ambiguity (Mahatthanachai et al. 2016). Finally, the CRF approach was improved by splitting suitable words that created a greater amount of POS-tagged words, and a dictionary-based post-processing stage that located compound words (Nararatwong et al. 2018). The significant contribution to Thai language processing from Wirote Aroonmanakun at Chulalongkorn University includes a Thai Language Toolkit called TLTK, in Python, that uses a maximum collocation approach for word segmentation,Footnote 17 whilst alternative methods have included the use of a lexical semantic approach (Khankasikam and Muansuwan 2005), the application of a generalized LR parsing technique to merge Thai character clusters (Limcharoen et al. 2009), and the employment of a Hidden Markov Model (HMM) to merge syllables and decision trees to identify non-dictionary words (Bheganan et al. 2009).

The difficulty of English to Thai SMT was discussed in Nathalang et al. 2010. They investigated problems in translation described in Porkeaw et al. (2008), reporting the use of an English-to-Thai corpus of 1.3 million words and producing a BLEU score of 13. It found two-thirds (67%) of the translations were both inaccurate and unintelligible, and categorized problems as lexical meaning occurring in 40% of instances, word-meaning mismatch (33%), morpho-syntactic features (10%), and others (40%). This included problems such as words not existing in both cultures, the use of affixes, sentence structure differences, word order and incorrectly added words. The study concluded that using both statistical and rule-based MT may enable MT systems to translate more effectively. Meechoonuk and Rakchonlatee (2001), provided a list of linguistic problems that included omission of words, additional words, mismatched concepts and inappropriate literal translation. The category of ‘mismatched concept’ was the most common linguistic problem, found in 34% of cases (Supnithi et al. 2002). In addition to grammatical errors and misplaced modifiers, there were also issues with different semantic segmentation between the source and target languages. A lack of information such as “insufficient definitions of idioms, two-word verbs, and phrasal verbs” and “insufficient dictionary definitions” were also cited. Other difficulties reported later included different meanings due to segmentation errors or final-word exclusion, and problems with compound words, plurals, order, false negation and pronoun omission (Lyons 2016b).

The ordering of Thai words differs from English (see Sect. 2.1) and word meaning can be dependent on its position in a sentence. Therefore, word order is problematic for Thai MT in two ways: locally between adjacent words, and globally over the whole sentence. The first issue is resolved to some extent using phrases in SMT and using the attention mechanism in NMT. SMT systems can often be adjusted to produce higher BLEU scores as this metric rewards better word order. A preprocessing step using reordering rules for an English–Thai PBSMT system was proposed in Labutsri et al. (2009), and a set of reordering rules specifically for noun phrases in (Wutiwiwatchai et al. 2009), which also proposed the use of Categorial Grammar (CG: Steedman 1987) in a syntactic parser. The syntactic category of a word can change depending on the word order and this is detrimental to POS-tagging, so an alternative form of syntactic analysis of Thai includes other forms of representation such as CG that is useful to resolve word order issues in translation (Supnithi et al. 2010). Because Thai is analytic and uses auxiliary particles and word order as opposed to inflection, the attributes of the function words are necessary to derive a parse tree for Thai such as in rule-based MT (Ruangrajitpakorn et al. 2007). Chimsuk and Auwatanamongkol 2009 proposed the use of Lexical Functional Grammar (LFG) for MT (cf. Kaplan et al. 1989) using an Interlingua approach. Thai is translated into an LFG tree as an Interlingua that is transformed into an English LFG tree by pattern matching and node transformation, which is then used to form the English translation. This uses a C-structure for the word order.

The impact on MT can be seen with reported improvements in BLEU score of 13.11 to 13.50 by the reordering of noun phrases (Porkeaw et al. 2008), and the use of the preprocessing step using reordering rules resulting in a performance increase from 40.05 to 57.45 reported in Labutsri et al. (2009) for English-Thai translation using the Moses. Later in 2014, an English–Thai PBSMT system using Moses reported an increase of 8.1% in BLEU score resulting from word segmentation improvements (Sutantayawalee et al. 2014). This used character grouping instead of focusing on word separation, 650,000 sentence pairs (633 K for training) and a 20,000 travel domain (ASEAN) corpus. BLEU scores rising to 49.56 were based on the character clusters rather than words, so the authors suggest the scores cannot be compared to other systems. A further increase in BLEU score from 37.12 to 40.13 was achieved on a test involving the 650 K corpus using ‘bilingually-guided alignment information’.

The successful completion of a bidirectional system for Thai and English translation using a hybrid generative/discriminative model for SMT,achieved reasonable BLEU scores of 23.3 Th–En and 19.4 En–Th in international research (Slayden et al. 2010a). It used a maximum entropy classifier with features described as a model with a four-token window of Thai lemmas plus ‘categorical’ features. The availability of parallel pair documents located the Thai sentence boundaries using a probabilistic approach to identify the segments of Thai text that aligned to each of the English sentences (Slayden and Luqman 2010). In addition, the identification of person names and dates assisted the identification of the use of non-sentence breaking spaces. Word segmentation included manually-selected heuristics applied on the output of Microsoft’s Uniscribe service, and a word-dependent HMM for alignment. Other approaches included a pre-translation “Thai character sequence normalization” module to correct errors in typed characters, and a post-translation re-spacing module for Thai target text output. It incorporated linguistic information and analysis for the English text but did not use Thai linguistic information. Although the resources included a total of 125 million word tokens in training, 9.6 million Thai sentences and a parallel text of 725 K sentence pairs comprised from publicly available, purchased and web-crawled content, the research concluded that it remained necessary to “find or create true Thai–English corpora”.

In two unrelated works that compared SMT approaches the researchers reported different findings whilst using the Moses toolkit for Thai translation. PBSMT, HPBT, string-to-tree, tree-to-string, and using the operational sequence model were compared for low-resource languages including Lao, Myanmar and Thai in Pa et al. (2016), whilst the comparison of PBSMT and HPBT was seen for Thai–Chinese translation in Luekhong et al. (2012) and later Thai–English MT (Luekhong et al. 2016). In low-resource language research, the HPBT system gained the highest BLEU score for English-to-Thai (En–Th), whilst the phrase-based approach had the highest BLEU score for Thai-to-English (Th–En). When using an alternative evaluation metric called RIBES (Rank-based Intuitive Bilingual Evaluation Metric) (Isozaki et al. 2010), the phrase-based approach remained the best performer, although the syntactic tree approach outperformed the others for En–Th translation. In the experiments in Luekhong et al. (2016), the best performance level was achieved by the HPBT approach for both directions of Thai and English translation. In these experiments different n-grams were used for the language model. In the tenfold cross validation results the hierarchical approach outperformed the phrase-based approach on both translation directions, with the best scores for HPBT using 5-g and the PBSMT approach preferring 4-g. The improvement using HPBT over the phrase-based approach was also seen in the translation of Thai and Chinese (Luekhong et al. 2012).

Despite its importance to MT, performance of Thai–English alignment is considered poor compared to other language pairs when using SMT tools such as GIZA+ + for word alignment. Alignment research consists of techniques that are not successful when transferred to Thai, or which rely on resources that are not available for the Thai language. Relevant research has included the ‘Pooja’ system that improved alignment via dictionary lookup to reduce ambiguity in a lexical pair using a similarity score from a bilingual dictionary (Luekhong et al. 2013), and the identification of grammatical attributes of related Thai function words and English content words (e.g. tense) in a pre-process to GIZA (Phodong and Kongkachandra 2016). This used grammatical attributes such as tense, aspect and modality to identify Thai words used for function rather than context and used this linguistic knowledge to align this to the function used in the English inflection. For example, the tag ‘past’ can be attributed to an English verb in the past tense, and the function word ‘แล้ว’ (already, in the past), used to mark the past tense in Thai. The alignment using the Pooja system increased the BLEU score of a Thai–English PBSMT system from 21.05 to 21.6 on a corpus of 110,000 sentence pairs. In related research, improvements in alignment were reported to cause an increase of 2.09 BLEU points, improving a baseline score of 32.18 to 34.27, using 149,000 sentence pairs for the Thai–English SMT system (Luekhong et al. 2017).

4.2 Ongoing research

The application of neural networks to language processing tasks such as MT has changed the research landscape. Both SMT and NMT are corpus-based approaches so the availability of training corpora is the primary requirement for a better performance level. Research using techniques that are not reliant on corpora exist, such as rule-based Thai–English MT using deep linguistic analysis proposed for a ‘Deep Thai–English MT’,Footnote 18 but not extensively. The potential for NMT has attracted growing interest within Thailand, although there are few publications to represent this work to date. There is evidence of the successful application of neural networks to the word segmentation task, as well as other partially-related language processing tasks such as classifying documents into categories during translation (Oupatcha and Thammakoranonta 2014), whilst we can also highlight published work that does include Thai NMT.

There are several systems that use neural networks for Thai word segmentation that are both freely available online and report a similar performance level to that of state-of-the-art CRF-based systems. Although these scores are not reported in academic publications, the systems that use neural networks have achieved a similar level of performance in comparison tests (Lyons 2020). DeepcutFootnote 19 is a CNN-based tokenizer trained from 90% of the BEST corpus and is reported as performing at a similar level to the best CRF system developed at NECTEC (Tanruangporn 2017). Another Thai word segmentation systemFootnote 20 used a bidirectional RNN and was trained by matching a sequence of characters in a sentence with a sequence of manually labelled word boundaries. It used approximately 150,000 sentences from the BEST corpus and is reported as having a comparable performance level.Footnote 21 An F1 score of 0.992 for this system was stated in Letpiya et al. (2018) although the level of performance for word segmentation was reduced to 0.882 when applied on user-generated web content. There are other neural network-based segmenters with similar performance levels listed on a Thai NLP resource website.Footnote 22 Cutkum, a Thai word segmentation system using an RNN, was trained on the BEST corpus using approximately 600,000 words and achieved better performance at character level than at word level.Footnote 23 SynThaiFootnote 24 performs Thai word segmentation and POS-tagging with an RNN using LSTM nodes. The use of an LSTM bidirectional neural network achieved “substantial improvement” on Thai–English and English–Thai Machine Transliteration (Finch et al. 2016). They also used pairs of LSTM RNN sequence-to-sequence transducers that first encode the input sequence into a fixed length vector, and then decode from this to produce the target output. Another joint word segmentation and POS tagging system is described in research from NECTEC addressing the problem of rare and unknown words (Boonkwan and Supnithi 2017). They used a combination of analyzing the character-level context and using morphological information to determine affixes to address the issues for both Chinese and Thai translation.

The highest F-measure scores for state-of-the-art Thai word segmentation are reported in the region of 97–98% although these systems are trained and tested on the same dataset. During an independent comparison of six Thai word segmentation programs, three had F-measure scores of under 60%, two reached 63% and one system scored 75.26% (Noyunsan et al. 2014). Research comparing five Thai word segmentation systems, including three neural network-based systems mentioned above, reported F-measure scores of 91.7, 89.4, 88.6, 88.2 and 87.9 (Lyons 2020). The benefit of any system that uses an approach that does not segment longer terms such as compound words, which specifically benefits MT, is not reflected in the evaluation. In recent work researchers adopted the concept of word sequence tagging and reduced the relative error by 7.4% and 10.5% compared with the state-of-the-art methods for Thai sentence segmentation that use the CRF-based model with n-gram embedding or the Bi-LSTM-CRF model, which is currently the preferred deep learning approach for sequence tagging (Saetia et al. 2019). Using a deep learning model, they used n-gram embedding to capture the context of words near sentence boundaries, self-attention modules for distant representations of words from dependent clauses, and used unlabeled data by adapting Cross-View Training as a semi-supervised learning technique.

To improve Thai MT performance, it is necessary to use a large amount of parallel data, but it is difficult to create as the texts are required to be aligned sentences and segmentation problems have a negative impact on this process. An interesting new work in this regard involves replicating sentence-pairs with different word segmentation methods on Thai using BPE for use as NMT training data (Poncelas et al. 2020). Their experiments show that combining these dataset improves NMT performance. Sentence alignment also uses techniques that are statistical, linguistic or a combination of both. These are often ineffective for Thai text, involve word segmentation and again suffer from the ambiguity of translation when determining which texts should be aligned. Either deeper linguistic analysis is required to pair terms, or systems suffer from unknown words. Bootstrapping used classification models to select the most probable unknown word from multiple candidates (TeCho et al. 2009), and SMT systems used the bilingual data to assist in unknown word identification (Sutantayawalee et al. 2014). Some SMT systems did not suffer significant performance loss with the incorrect segmentation of unknown words (Charoenpornsawat and Schultz 2008), although in NMT word embedding is adversely affected. The use of bilingual data in the alignment process improved BLEU scores for English to Thai SMT to 36.3% over two baseline alignment methods, Gale-Church (22.7%) and Bleualign (15.4%) (Coughlin et al. 2018). There are online resources for English and Thai subtitles, such as TED talks,Footnote 25 OPUS,Footnote 26 and open subtitlesFootnote 27 websites, and this research used a large amount of parallel lines of subtitles (2.8 million) to initially train the SMT model with Moses, used in a secondary alignment process (with 268 K sentences) with explicit user feedback to correct alignment that was retrained in the SMT model. They too state the use of an RNN-based word segmentation system (see fn. 21), and concluded that translation could be improved by simplifying Thai pronouns, and using larger and more varied parallel corpora.

In 2019, NECTEC opened a service that includes the availability of language resources called “AI for Thai” that consists of services for MT, sentiment analysis, speech-to-text and text-to-speech, character and object recognition, and basic Thai NLP (Tapsai et al 2019). This publication included details of previous work from the author such as the improvements to word segmentation including the segmentation efficiency, misspelling words, multiple spelling patterns of names and foreign language vocabulary, and compound words. There are many other systems for Thai language processing available online and through code repositories, although many of these are not associated with academic publications. In addition to the research detailed in Sect. 3.1, there are few opportunities to disseminate publications of Thai NMT. At the end of 2019 it was stated “few researches [sic] towards NMT of Thai translation have been conducted but are yet to be published”, and “… little-to-none has been tested and publicly reported” (Luekhong et al. 2019). The research topics such as word and sentence segmentation, alignment, word order and unknown words have an impact on Thai MT but the lack of regular evaluation, publicly available corpora and publications detailing the evaluation of Thai MT systems using these resources makes it difficult to make comparisons and determine their impact. As the standard of Thai NMT and the motivation to researchers to publish work is linked to the availability of a large amount of training data, ongoing research will hopefully benefit from the increase use and availability of suitable resources.

5 Conclusion

Historically MT in Thailand has largely suffered because of the linguistic differences from languages that have attracted the most research attention as their techniques are not as successful when applied to the Thai language. Despite this, the extensive work especially on issues such as word segmentation has seen reported improvements with the events such as the BEST word segmentation workshops. Recent success with sub-word and character NMT, in conjunction with the potential of multi-lingual translation, are of interest to research in languages with limited resources or that require segmentation. These recent advancements and the increase in availability of resources, such as parallel text, has the potential to improve the performance of Thai–English MT systems.