Abstract
Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, Short Papers, Columbus, June, pp 153–156
Buckwalter T (2002) Buckwalter Arabic morphological analyzer. Linguistic Data Consortium. (LDC2002L49)
El Kholy A, Habash N (2010a) Orthographic and morphological processing for English-Arabic statistical machine translation. In: Proceedings of TALN 2010, Montréal, 19–23 July 2010
El Kholy A, Habash N (2010b) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of the seventh international conference on language resources and evaluation (LREC) 2010, Valletta, Malta
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing, Columbus, June, pp 49–57
Habash N (2007) Arabic morphological representations for machine translation book chapter. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin
Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), Ann Arbor
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the association for computational linguistics/human language technologies conference, Barcelona
Kirchhof K, Vergyri D, Bilmes J, Duh K (2006) Andreas Stolcke morphology-based language modeling for conversational Arabic speech recognition. Comput Speech Lang 20: 589–608
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the empirical methods in natural language processing conference (EMNLP’04), Barcelona
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Annual meeting of the association for computational linguistics (ACL), demonstration session, Prague, June
Koehn P, Birch A, Steinberger R (2009) 462 machine translation systems for Europe. In: MT summit XII: proceedings of the twelfth machine translation summit, Ontario, 26–30 Aug 2009, pp 65–72
Lavie A, Denkowski M (2009) The METEOR metric for automatic evaluation of machine translation. Mach Transl J 23(2–3): 105–115. doi:10.1007/s10590-009-9059-4
Lee Y-S (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the association for computational linguistics/human language technologies conference (HLT NAACL04), Boston, pp 57–60
Maamouri M, Bies A, Buckwalter T (2004) The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools, Cairo
Och F (2003) Minimum error rate training in statistical machine translation. In: Proceedings of ACL, Sapporo, pp 160–167
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, pp 311–318
Sadat F, Habash N (2006) Morphological preprocessing scheme combination for statistical MT. In: Proceedings of COLING-ACL, Sydney. HLT-NAACL06, New York, pp 49–52
Sarikaya R, Deng Y (2007) Joint morphological-lexical language modeling for machine translation. In: Proceedings of NAACL HLT 2007, Companion Volume, Rochester, NY, April 2007, pp 145–148
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA-2006), Cambridge, Aug, pp 223–231
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Short papers in the proceedings of the human language technology and North American association for computational linguistics conference (HLT/NAACL), New York, 4–9 June 2006
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al-Haj, H., Lavie, A. The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation. Machine Translation 26, 3–24 (2012). https://doi.org/10.1007/s10590-011-9101-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9101-1