Language resources for Maghrebi Arabic dialects’ NLP: a survey

Younes, Jihene; Souissi, Emna; Achour, Hadhemi; Ferchichi, Ahmed

doi:10.1007/s10579-020-09490-9

Language resources for Maghrebi Arabic dialects’ NLP: a survey

Survey
Published: 25 April 2020

Volume 54, pages 1079–1142, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Language Resources and Evaluation Aims and scope Submit manuscript

Language resources for Maghrebi Arabic dialects’ NLP: a survey

Download PDF

Jihene Younes¹,
Emna Souissi²,
Hadhemi Achour¹ &
…
Ahmed Ferchichi¹

896 Accesses
9 Citations
4 Altmetric
Explore all metrics

Abstract

Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic dialects (AD). Deemed to be amongst low-resource languages, these dialects have aroused increased interest among the NLP community in recent years. Indeed, the various Arabic dialects are increasingly used on the social web and may be transcribed in both the Arabic and the Latin script. The latter is known as Arabizi and seems to be more frequently used for some of them. The AD NLP raises many challenges and requires the availability of large and appropriate language resources. In this study, we focus, in particular, on the Maghrebi Arabic dialects (MADs). We propose a thorough review of the language resources (LRs) that have been generated by the various work carried out on the MAD language processing. A survey of the currently online available MAD NLP dedicated-LRs is also compiled and discussed. LRs investigated in this work are essentially data-resources such as primary and annotated corpora, lexica, dictionaries, ontologies, etc.

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

DarijaBERT: a step forward in NLP for the written Moroccan dialect

Article 23 January 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Constructing and evaluating spoken or written Natural Language Processing (NLP) algorithms and systems require the availability of different kinds of resources related to the treated languages and usually referred to as Language Resources (LRs). According to the European Language Resources Association,^{Footnote 1} the term LR refers to a set of speech or language data and descriptions that are accessible in an electronic form and useful for developing or evaluating natural language or speech algorithms and systems. We should, however, note that the definition assigned to the term LR can vary among scholars using it. It ranges from a broad definition encompassing various kinds of data (such as corpora, lexica, thesauri, ontologies, etc.) and tools generating new data and descriptions (such as morphological analysers, taggers, parsers, etc.) (Witt et al. 2009), to a narrower definition where the term LR designates data-only resources (Cunningham et al. 2009). When it comes to data-only LRs, a common classification is to divide these resources into primary and secondary (or derived) resources (Rosner 2009). Primary resources refer to raw data obtained from different textual sources while secondary or derived data refer to data which have been annotated by additional information such as different levels of linguistic descriptions.

Having such different kinds of LRs is crucial for any work aiming at a language study or analysis (El-Haj et al. 2014). Their construction, representation, maintenance and evaluation are important issues in the NLP field and especially for those working on statistical and machine learning methods requiring large scale resources. While a substantial work has been carried out on this area for languages such as English and some other European languages and has led to achieve significant technological advances in language and speech processing, efforts are still required regarding under-resourced languages in order to build various kinds of LRs allowing their automatic processing.

The concept of under-resourced language may have different designations (such as low-resource language, poorly endowed language, etc.) and various definitions. According to Besacier et al. (2013), an under-resourced language refers to “a language with some of (if not all) the following aspects: lack of a unique writing system or stable orthography, limited presence on the web, lack of linguistic expertise, lack of electronic resources for speech and language processing, such as monolingual corpora, bilingual electronic dictionaries, transcribed speech data, pronunciation dictionaries, vocabulary lists, etc.”. Other definitions are more centred on Human Language Technologies (HLT) such as that provided by the LORELEI^{Footnote 2} project, considering a low-resource language as a language “for which no automated human language technology capability exists”. As for Duong (2017), this definition is related to a particular NLP task, by considering a language as low-resource for a given task, if there are no existing solutions using available data and performing this task with an adequate performance.

It can be clearly drawn from the above definitions, that a given language may be considered as under resourced when it lacks LRs required to develop one or more NLP tasks involving that language. In this sense, Arabic dialects in general, are deemed to be amongst low-resource languages, in view of the current availability of NLP data resources and tools associated to these dialects (Hamdi et al. 2015; Novotney et al. 2016; Harrat and Meftouh 2017a). It should however be noted that in recent years, Arabic dialects have aroused increased interest among the NLP community. This interest may be explained by, among other reasons, the growing use of these dialects by Arab Internet users, especially on the social web (Zaidan and Callison-Burch 2014; Younes et al. 2015; Samih and Maier 2016a; Alshutayri and Atwell 2017).

Arabic dialects are spoken by more than 440 million people^{Footnote 3} in a region covering Arabia (Arabian Peninsula), North Africa and the Middle East. The classification to which subscribe several researchers (Embarki 2008) is that of Versteegh (Versteegh 1997), which divides modern Arabic dialects into five major dialectal areas, from East to West: (1) Arabian Peninsula dialects include the dialects of Kuwait, Saudi Arabia, Bahrain, Qatar, United Arab Emirates, Oman and Yemen; (2) Mesopotamian dialects include the dialects of Iraq; (3) Levantine dialects include the dialects of Lebanon, Syria, Jordan and Palestine; (4) Egyptian dialects cover the dialects of the Nile valley: Egypt and Sudan; (5) Maghrebi dialects cover the dialects of Mauritania, Morocco, Algeria, Tunisia and Libya.

This study focuses on the Maghrebi Arabic dialects (MADs) in particular. Indeed, MADs have in common the same sociolinguistic variation and are characterized by many common linguistic specificities: the coexistence of several languages (MSA, dialectal Arabic, Berber and French), the influence of French language in written and oral use, the great increase use on social media, the writing with Latin letters, etc. All these features have generated common difficulties and challenges to be overcome when trying to process these dialects. That’s why, many research works bring together these dialects. Although most of the work carried out on Arabic dialects has focused on Mashriqi^{Footnote 4} and mainly on Egyptian Arabic (Shoufan and Alameri 2015; Assiri et al. 2015; Harrat and Meftouh 2017b), MADs have been picking increasing interest from NLP researchers in the past few years. As regards surveys, and to our knowledge, only one study proposed by Harrat and Meftouh (2017c) was specifically dedicated to reviewing work on the automatic processing of Maghrebi dialects. Although the survey of Harrat and Meftouh (2017c) covers a variety of works, it consists in a short and a non-exhaustive review, dealing with only three dialects: Tunisian, Algerian and Moroccan dialects. The authors presented a linguistic overview of the three dialects, then reviewed 39 works (until 2017) according to 7 categories: corpora and lexicons (12 works), identification (5 works), orthography (4 works), morphological analysis (6 works), sentiment analysis (3 works), machine translation (5 works) and other works including sentence boundary detection and diacritics restauration (4 works). Harrat and Meftouh (2017c) provided some information about the followed NLP approaches and sizes of the used corpora. They, however, did not specify the used scripts nor the annotation level of these corpora and didn’t provide any information about the availability of MAD language resources.

In the present survey, we aim for a more comprehensive study and a thorough review of the language resources that have been generated by the various work^{Footnote 5} carried out on the MAD language processing. A survey of the currently online available MAD NLP dedicated-LRs is also compiled and discussed. LRs investigated in this work are essentially data-resources such as primary and annotated corpora, lexica, dictionaries, ontologies, etc. Our main goals are to provide a clear picture of what progress has been made towards constructing LRs for the Maghrebi dialects’ NLP and their availability to researchers wishing to work in this field.

The reminder of this paper is organized as follows: Sect. 2 is devoted to a general presentation of the MADs. We review their major linguistic specificities on different levels. We also provide some pertinent indications on their current use in the social web and set forth the main difficulties and challenges related the MADs’ NLP. In Sect. 3, we identify the constructed MAD primary LRs, mainly, raw text and speech corpora. Section 4 reviews the construction of various annotated corpora (derived LRs). As for Sect. 5, it is dedicated to identifying different kinds of other MAD data-LRs such as lexica, dictionaries and ontologies. We devote Sect. 6 to works that examined the normalization and the codification of the MADs. In Sect. 7, we make a census of the currently online available LRs dedicated to MAD NLP and in Sect. 8, we propose a discussion on the identified resources, supported by numbers and charts illustrating the evolution of works on Maghrebi dialects, the typology of LRs, their breakdown by MAD, etc. Finally, Sect. 9 is dedicated to the conclusion and future perspectives.

2 Maghrebi Arabic dialects (MADs)

2.1 Linguistic specificities

Maghrebi Arabic (or Maghrebi Darija) is one of the mother tongues of people of the Maghreb area in North Africa and covers five countries: Mauritania, Morocco, Algeria, Tunisia and Libya. The Maghreb has more than 100 million inhabitants:^{Footnote 6} 4.3 million in Mauritania, 36 million in Morocco, 43 million in Algeria, 11.5 million in Tunisia and 6.2 million in Libya. In these countries, at least two mother tongues coexist according to the origin of the inhabitants (Pereira 2005): Berber^{Footnote 7} and Maghrebi Arabic. In fact, Berber is the mother tongue of nearly 30% of the Maghreb population. Berber speakers represent about 45 to 50% of the population in Morocco, 25 to 30% in Algeria, 1% in Tunisia and between 5 and 10% in Libya and Mauritania. In the other hand, Maghrebi Arabic is used by Arabic speakers and Berber speakers, but also between Berber speakers, who do not understand each other, when the Berber variety of each speaker is different.

Speakers of Maghrebi Arabic call their language Darija (in Arabic, “دارجة” which is the feminine form of “دارج” meaning “familiar, common, popular, used”^{Footnote 8}). Darija alludes to colloquial spoken Arabic rather than Modern Standard Arabic (MSA), but it is also common to refer to the Maghrebi Arabic varieties directly as languages. For instance, Moroccan Arabic as Maghrebi (Moroccan), Algerian Arabic would be referred as Dzayri (Algerian), Tunisian Arabic as Tounsi (Tunisian), Libyan Arabic as Libi (Libyan) and Mauritanian Arabic as Hassaniya (derived from the name of the Arab tribes called “Beni Hassan”).

Maghrebi Arabic is a spontaneous oral language. It is the language in which speakers communicate with each other, but it is also sometimes a literature language and a writing language (Pereira 2005). Indeed, it is a literature language by means of which proverbs, nursery rhymes, tales, riddles, poems are said. It can be a writing language when song lyrics and theatre plays are written in Maghrebi Arabic. Today, it is more used on radio, television and in publicity. In its written form, Maghrebi Arabic is written for a long time mainly with the Arabic script (AS), but it may also be written using the Latin alphabet (LS) [Written content using LS is known as “Arabizi”]. It should be noted that the introduction of new modes of communication (SMS, e-mails, Facebook, Twitter, etc.), widely used in Arab countries, has strengthened dialectal writing, especially in Latin script.

According to Mohand (1999) and Embarki (2008), the Maghrebi Arabic dialects undoubtedly possess common phonetic, morphological, syntactic and lexical features, giving them a particular linguistic character and thereby, clearly differentiating them from Eastern Arabic. The main common characteristics of the MADs include:

Mutual intelligibility The varieties of Maghrebi Arabic have a significant degree of mutual intelligibility, especially between geographically adjacent ones (such as local dialects spoken in Eastern Morocco and Western Algeria or Eastern Algeria and North Tunisia or South Tunisia and Western Libya), but hardly between Moroccan and Tunisian Darija. On the other hand, it is well-known that Maghrebians understand almost all other Arabic dialects.
Mixture of many languages Maghrebi Arabic cannot be understood by Eastern Arabic speakers (from Egypt, Sudan, Levant, Iraq, and Arabian Peninsula) in general as they derive from different substratums and a mixture of many languages (Mohand 1999; Elimam 2009, 2012; Sayahi 2014; Mzoughi 2015): Punic, Berber, Arabic, Turkish, French, Spanish, Italian and Niger-Congo languages. Sayahi (2014) considers that Berber in particular, has exercised an important influence in the way the Maghrebi dialects formed, distinguishing them from the eastern Arabic dialects at many levels. For these reasons, some linguists like Elimam (2004, 2012), tend to consider Maghrebi Arabic as an independent language.
Continual evolution Maghrebi Arabic continues to evolve by integrating new words. Speakers frequently borrow words from French (in Morocco, Algeria and Tunisia), Spanish (in Morocco) and Italian (in Libya and Tunisia) and conjugate them according to the rules of Arabic with some exceptions (like passive voice for example) (Sayahi 2014; Elimam 2009). Some examples showing the dynamicity of the Tunisian dialect linguistic system are given in Table 1. These examples are also valid for Algerian and Moroccan dialects.
Table 1 Words borrowed from French
Full size table
Code-switching Defined as the “alternate use of two or more languages in the same utterance of conversation” (Fishman 1999), Code-switching is a result of multilingualism. In the Maghreb, code-switching is frequent and represents a feature of the local way of speech. In text, code-switching generally occurs between MSA and Maghrebi Arabic when it is written in Arabic script and between French/Spanish/English and Maghrebi Arabic when it is written with Latin alphabet. There may also be, but rarer, written texts (especially in the social web and advertising spots) using both Arabic and Latin scripts. Table 2 below shows some examples of these messages. This kind of code-switching adds another difficulty which imposes a reading from left to right of the part written with the Latin letters and in the opposite direction of the Arabic part.
Table 2 Examples of written MAD using both AS and LS
Full size table
Phonological aspects In Maghrebi dialects, some Non-Arabic phonemes may be used such as/g/ڤ,/p/پ and/v/ڥ. Long vowels are usually shortened, and the three short vowels are often reduced to two. In many cases, CvCC is changed to CCvC (e.g., سَقف/saqf (roof) in Oriental Arabic, is said سقَف/sqaf in Maghrebi Arabic). Also, preference is for final-syllable stress, especially with the reduction of non-stressed short vowels (e.g., كِتَاب/kitaab (book) in Oriental Arabic, is said كتَاب/ktaab in Maghrebi Arabic).
Morphological aspects Maghrebi dialects have a specific conjugation distinguishing them from Mashriqi dialects and Modern Standard Arabic. Table 3 below shows some examples.
Table 3 Verb conjugation specificities of MAD
Full size table

Although Maghrebi dialects share many common linguistic characters, each one of them has its own linguistic specificities. Table 4 shows some examples of specific characteristics that differentiate them from each other. In this paper, we use MD for Moroccan Dialect, AD for Algerian Dialect, TD for Tunisian Dialect, LD for Libyan Dialect, HD for Mauritanian Dialect; ArD for Arabic dialects; OAD for other Arabic dialects and OL for other languages.

Table 4 Examples of MAD distinctive linguistic specificities (Pereira 2005, 2011; Sayahi 2014)

Language resources for Maghrebi Arabic dialects’ NLP: a survey

Abstract

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

DarijaBERT: a step forward in NLP for the written Moroccan dialect

Explore related subjects

1 Introduction

2 Maghrebi Arabic dialects (MADs)

2.1 Linguistic specificities

2.2 MAD presence on social networks

2.3 MAD NLP: difficulties and challenges

3 MAD raw corpora

3.1 Speech corpora

3.2 Speech transcription

3.3 Web and social media corpora

4 MAD annotated corpora

4.1 Language identification

4.2 Morphosyntactic annotation

4.3 Translation

4.4 Transliteration and code-switching

4.5 Sentiment analysis

4.6 Several annotations

5 Lexica, dictionaries and ontologies

6 Codification and normalization

7 Online available MAD LRs

8 Discussion

8.1 Work on MAD LRs evolution

8.2 MAD LRs Scripts

8.3 MAD LRs types

8.4 MAD LRs availability

8.5 Main used approaches in works on MADs

9 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation