Abstract
Over the past decade, the use of social networks has become more common and people have found a convenient place to share information and express opinions. The massive volume of data generated provides a good opportunity to extract valuable knowledge to reveal people’s needs and behaviours. For this purpose, Sentiment Analysis techniques are widely used. The results are very accurate when they are applied to common languages, namely English, Spanish or French. However, these techniques are still at development stage for Modern Standard Arabic (MSA) and derived dialects. In the case of Moroccan Dialect used in Social Media, the main challenge is the phenomena of Code Switching; two or more languages appear in the same sentence (Arabic, Tamazight, French, English or Spanish). The second is the Arabizi of words using Latin script combined with numbers instead of Arabic characters. As a consequence, the preprocessing became one of the important steps of data analysis. This paper proposes a new method based on Natural Language Processing (NLP) to address the challenges of preprocessing text that contains Arabizi and Code Switching forms. We aim to build a multilingual corpus that includes linguistic features and reflects the structure of text written in Social Media Moroccan Dialect.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Arabic transliteration according to Buckwalter System. Retrieved June 07, 2021, from http://www.qamus.org/transliteration.htm.
- 2.
Alexa Ranking: Top social media sites in Morocco, http://www.alexa.com/topsites/countries/MA, visited Mai 07, 2021.
References
Alasadi, S.A., Bhaya, W.S.: Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 12(16), 4102–4107 (2017)
Alshdaifat, E., Alshdaifat, D., Alsarhan, A., Hussein, F., El-Salhi, S.M.F.S., et al.: The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance. Data 6(2), 1–23 (2021)
Harrat, S., Meftouh, K., Smaili, K.: Machine translation for Arabic dialects (survey). Inf. Process. Manag. 56(2), 262–273 (2019)
Hegazi, M.O., Al-Dossari, Y., Al-Yahy, A., Al-Sumari, A., Hilal, A.: Preprocessing Arabic text on social media. Heliyon 7(2), e06191 (2021)
Talafha, B., Abuammar, A., Al-Ayyoub, M.: ATAR: Attention-based LSTM for Arabizi transliteration (2088–8708). Int. J. Electr. Comput. Eng. 11(3), 2327–2334 (2021)
Chakrani, B.: Between profit and identity: analyzing the effect of language of instruction in predicting overt language attitudes in Morocco. Appl. Linguis. 38(2), 215–233 (2017)
Ferguson, C.A.: Diglossia. Word 15(2), 325–340 (1959)
Farha, I.A., Magdy, W.: A comparative study of effective approaches for Arabic sentiment analysis. Inf. Process. Manag. 58(2), 102438 (2021)
Soufan, A.: Deep learning for sentiment analysis of Arabic text. In: Proceedings of the ArabWIC 6th Annual International Conference Research Track. ArabWIC 2019. Association for Computing Machinery (2019)
Mallek, F., Belainine, B., Sadat, F.: Arabic social media analysis and translation. Procedia Comput. Sci. 117, 298–303 (2017). Arabic Computational Linguistics
El Abdouli, A., Hassouni, L., Anoun, H.: Mining tweets of Moroccan users using the framework Hadoop, NLP, k-means and basemap. In: 2017 Intelligent Systems and Computer Vision (ISCV), pp. 1–7. IEEE (2017)
Al-Ghaith, W.: Developing lexicon-based algorithms and sentiment lexicon for sentiment analysis of Saudi dialect tweets. Int. J. Adv. Comput. Sci. Appl. 10(11), 83–88 (2019)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Sebastopol (2009)
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 573–580 (2005)
Hughes, B., Baldwin, T., Bird, S., Nicholson, J., MacKinlay, A.: Reconsidering language identification for written language resources. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (2006)
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., Lindén, K.: Automatic language identification in texts: a survey. J. Artif. Intell. Res. 65, 675–782 (2019)
Shuyo, N.: Language detection library for java (2010). http://code.google.com/p/language-detection/
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199
Jünger, J., Keyling, T.: Facepager. An application for automated data retrieval on the web. Facepager. An application for generic data retrieval through APIs. Source code and releases available (2019). https://github.com/strohne/Facepager
Chiny, M., Chihab, M., Chihab, Y., Bencharef, O.: LSTM, VADER and TF-IDF based hybrid sentiment analysis model (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hajbi, S., Chihab, Y., Ed-Dali, R., Korchiyne, R. (2022). Natural Language Processing Based Approach to Overcome Arabizi and Code Switching in Social Media Moroccan Dialect. In: Maleh, Y., Alazab, M., Gherabi, N., Tawalbeh, L., Abd El-Latif, A.A. (eds) Advances in Information, Communication and Cybersecurity. ICI2C 2021. Lecture Notes in Networks and Systems, vol 357. Springer, Cham. https://doi.org/10.1007/978-3-030-91738-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-91738-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91737-1
Online ISBN: 978-3-030-91738-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)