Abstract
In this paper a new technique of tokenization and part-of-speech (POS) tagging for Arabic text is presented. The introduced technique uses the Arabic morphological analyzer to extract new features that will improve the stemming and the POS tagging. Applying standard evaluation metrics, the proposed tokenizer achieves an F (β = 1) score of 99.99, and the POS tagger achieves an accuracy of 98.05%.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Diab, M., Hacioglu, K., Jurafsky, D.: Automated methods for processing Arabic text: From tokenization to base phrase chunking. In: van den Bosch, A., Soudi, A. (eds.) Arabic Computational Morphology: Knowledge-based and Empirical Methods. Kluwer/Springer (2007)
Habash, N., Rambow, O.: Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In: Proc. of the American Association of Computational Linguistic Conference (ACL) Short Papers, Michigan, USA (2005)
Habash, N., Rambow, O.: Morphological analysis and generation for Arabic dialects. In: Proc. of the Workshop on Computational Approaches to Semitic Languages in the American Association of Computational Linguistic Conference (ACL), Michigan, USA (2005)
AlGahtani, S., Black, W., McNaught, J.: Arabic Part-of-Speech Tagging Using Transformation-Based Learning. In: Proc. of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (April 2009)
Kulick, S.: Simultaneous Tokenization and Part-of-Speech Tagging for Arabic without a Morphological Analyzer. In: Proc. of the American Association of Computational Linguistic (ACL) Conference Short Papers, Uppsala, Sweden (July 2010)
Mansour, S., Sima’an, K., Winter, Y.: Smoothing a Lexicon-based POS tagger for Arabic and Hebrew. In: Proc. of the American Association of Computational Linguistic Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic (2007)
Diab, M.: Second generation tools (AMIRA 2.0): Fast and robust tokenization, pos tagging, and base phrase chunking. In: Proc. of 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt (April 2009)
Maamouri, M., Bies, A., Buckwalter, T.: The penn arabic treebank: Building a largescale annotated arabic corpus. In: Proc. of NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt (2004)
Tamah, E., Al-Shammari, J.L.: Towards an Error-Free Arabic Stemming. In: Proc. of the American Association of Computational Linguistic (ACL) Conference on Information and Knowledge Management, New York, NY, USA (2008)
Khoja, S., Garside, P., Knowles, G.: A tagset for the morphosynactic tagging of Arabic. In: Proc. of Corpus Linguistics. Lancaster University, Lancaster (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Nawar, M.N. (2014). Improving Arabic Tokenization and POS Tagging Using Morphological Analyzer. In: Hassanien, A.E., Tolba, M.F., Taher Azar, A. (eds) Advanced Machine Learning Technologies and Applications. AMLTA 2014. Communications in Computer and Information Science, vol 488. Springer, Cham. https://doi.org/10.1007/978-3-319-13461-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-13461-1_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13460-4
Online ISBN: 978-3-319-13461-1
eBook Packages: Computer ScienceComputer Science (R0)