Abstract
Arabic language is one of the most spoken languages in the world, it’s the official language of many countries and the fourth most used language on the internet. Arabic texts are often written without diacritic marks. However, those marks are important to clarify the sense and meaning of words. Automatic diacritization is the process of assigning diacritics to letters, and it’s an important field in Arabic Natural Language Processing (ANLP). In this work, we try to find the effect of increasing the training dataset on the diacritization error rate (DER) by building a new dataset and concatenating it with the Tashkeela dataset. We trained a deep learning model based on bidirectional long short-term memory BLSTM that transcribes undiacritized sequences of Arabic letters and produces an output sequence of the same length fully diacritized. Our model shows significant results on the new dataset in terms of DER and validation loss.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boudad, N., Faizi, R., Thami, R.O.H., Chiheb, R.: Sentiment analysis in Arabic: a review of the literature. Ain Shams Eng. J. 9(4), 2479–2490 (2018)
Farghaly, A., Shaalan, K.: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Language Inform. Process. 8(4), 1–22 (2009)
Fadel, A., Tuffaha, I., Al-Jawarneh, B., Al-Ayyoub, M.: Arabic text diacritization using deep neural networks. In: 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, pp. 1–7 (2019). https://doi.org/10.1109/CAIS.2019.8769512
Thompson, B., Alshehri, A.: Improving Arabic Diacritization by Learning to Diacritize and Translate. https://arxiv.org/ftp/arxiv/papers/2109/2109.14150.pdf
Almanea, M.M.: Automatic methods and neural networks in Arabic texts Diacritization: a comprehensive survey. IEEE Access 9, 145012–145032 (2021). https://doi.org/10.1109/ACCESS.2021.3122977
Larabi, S., Marie-Sainte, S., Alalyani, N., Alotaibi, S., Ghouzali, S., Abunadi, I.: Arabic natural language processing and machine learning-based systems. IEEE Access 7, 7011–7020 (2019). https://doi.org/10.1109/ACCESS.2018.2890076
El-Sadany T., Hashish M.: Semi-automatic vowelization of Arabic verbs. In: 10th National Computer Conference, pp. 725–732 (1988)
Al-Sughaiyer, I.A., Al-Kharashi, I.A.: Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol, 55(3), 189–213 (2004)
Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 1–7 (2002)
Hifny, Y.: Smoothing techniques for Arabic diacritics restoration. In: 12th Conf. on Language Engineering, pp. 6–12 (2012)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech and Language Process. 20(1), 30–42 (2012)
Vergyri, D., Kirchhoff, K.: Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: Workshop on Computational Approaches to Arabic Scriptbased Languages, pp. 66–73 (2004)
Nelken, R., Shieber, S.M.: Arabic diacritization using weighted _nite-state transducers. In: ACL Workshop on Computational Approaches to Semitic Languages, pp. 79–86 (2005)
Barqawi, A., Zerrouki, T.: Shakkala, arabic text vocalization. https://github.com/Barqawiz/Shakkala (2017)
Al Sallab, M., Rashwan, H., Raafat, M., Rafea, A.,: Automatic Arabic diacritics restoration based on deep nets. In: Proceedings of the MNLP Workshop Arabic Natural Lang. Process. (ANLP). Association Computational Linguistics, Doha, Qatar, pp. 65–72. https://www.aclweb.org/anthology/W14-3608 (2014)
Abandah, G.A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., Al-Taee, M.: Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Document Anal. Recogn. (IJDAR) 18(2), 183–197 (2015). https://doi.org/10.1007/s10032-015-0242-2
Belinkov, Y., Glass, J., : Arabic diacritization with recurrent neural networks. In: Proceedings of the Conference Empirical Methods Natural Language Processing Lisbon, pp. 2281–2285. Association Computational Linguistics, Portugal. https://www.aclweb.org/anthology/D15-1274 (2015)
Fadel, A., Tuffaha, I., Al-Jawarneh, B., Al-Ayyoub, M.: Neural arabic text diacritization: state of the art results and a novel approach for machine translation. In: Proceedings 6thWorkshop Asian Translation, pp. 215–225. Association Computational Linguistics, Hong Kong (2019)
Mubarak, H., Abdelali, A., Sajjad, H., Samih, Y., Darwish, K.: Highly effective Arabic diacritization using sequence-to-sequence modeling. In: Proceedings of the Conference North American Chapter Association Computational Linguistics, Human Language Technologies, vol. 1, pp. 2390–2395. Association Computational Linguistics, Minneapolis, MN, USA (2019)
AlKhamissi, B., ElNokrashy, N., Gabr, M.: Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization. arXiv:2011.00538v1 (2020)
Madhfar, M.A.H., Qamar, A.M.: Effective deep learning models for automatic Diacritization of Arabic Text. IEEE Access 9, 273–288 (2021). https://doi.org/10.1109/ACCESS.2020.3041676
Abandah, G., Abdel-Karim, A.: Accurate and fast recurrent neural network solution for the automatic Diacritization of Arabic text. Jordanian J. Comput. Inform. Technol. 06, (02) 1 (2020)
Abandah, G.A., Khedher, M.Z., Abdel-Majeed, M.R., Mansour, H.M., Hulliel, S.F., Bisharat, L.M.: Classifying and diacritizing Arabic poems using deep recurrent neural networks. J. King Saud Univ. – Comput. Inform. Sci. 34, 3775-3788 (2022)
Náplava, J., Straka, M., Straková, J.: Diacritics restoration using BERT with analysis on Czech language. Prague Bull. Math. Linguist. 116(1), 27–42 (2021)
Ayogu, I.I., Abu, O.: Automatic diacritic recovery with focus on the quality of the training corpus for resource-scarce languages. In: 2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA), pp. 98–103. Abuja, Nigeria (2021)
Alqahtani, S., Mishra, M., Diab M.: A Multitask Learning Approach for Diacritic Restoration arXiv:2006.04016v1 (2020)
Abdel Karim, A., Abandah, G.: On the training of deep neural networks for automatic Arabic-text Diacritization. Int. J. Adv. Comput. Sci. Appl. 12(8), 276–286 (2021)
Abandah, G.A., Suyyagh, A.E., Abdel-Majeed, M.R.: Transfer learning and multi-phase training for accurate diacritization of Arabic poetry. J. King Saud Univ. – Comput. Inf. Sci. 34(6), 3744–3757 (2022). https://doi.org/10.1016/j.jksuci.2022.04.005
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Iman, Z., Adnan, S., Eddine, E.M.B. (2023). Neural Network for Arabic Text Diacritization on a New Dataset. In: Lazaar, M., En-Naimi, E.M., Zouhair, A., Al Achhab, M., Mahboub, O. (eds) Proceedings of the 6th International Conference on Big Data and Internet of Things. BDIoT 2022. Lecture Notes in Networks and Systems, vol 625. Springer, Cham. https://doi.org/10.1007/978-3-031-28387-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-28387-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28386-4
Online ISBN: 978-3-031-28387-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)