1 Introduction

Constructing and evaluating spoken or written Natural Language Processing (NLP) algorithms and systems require the availability of different kinds of resources related to the treated languages and usually referred to as Language Resources (LRs). According to the European Language Resources Association,Footnote 1 the term LR refers to a set of speech or language data and descriptions that are accessible in an electronic form and useful for developing or evaluating natural language or speech algorithms and systems. We should, however, note that the definition assigned to the term LR can vary among scholars using it. It ranges from a broad definition encompassing various kinds of data (such as corpora, lexica, thesauri, ontologies, etc.) and tools generating new data and descriptions (such as morphological analysers, taggers, parsers, etc.) (Witt et al. 2009), to a narrower definition where the term LR designates data-only resources (Cunningham et al. 2009). When it comes to data-only LRs, a common classification is to divide these resources into primary and secondary (or derived) resources (Rosner 2009). Primary resources refer to raw data obtained from different textual sources while secondary or derived data refer to data which have been annotated by additional information such as different levels of linguistic descriptions.

Having such different kinds of LRs is crucial for any work aiming at a language study or analysis (El-Haj et al. 2014). Their construction, representation, maintenance and evaluation are important issues in the NLP field and especially for those working on statistical and machine learning methods requiring large scale resources. While a substantial work has been carried out on this area for languages such as English and some other European languages and has led to achieve significant technological advances in language and speech processing, efforts are still required regarding under-resourced languages in order to build various kinds of LRs allowing their automatic processing.

The concept of under-resourced language may have different designations (such as low-resource language, poorly endowed language, etc.) and various definitions. According to Besacier et al. (2013), an under-resourced language refers to “a language with some of (if not all) the following aspects: lack of a unique writing system or stable orthography, limited presence on the web, lack of linguistic expertise, lack of electronic resources for speech and language processing, such as monolingual corpora, bilingual electronic dictionaries, transcribed speech data, pronunciation dictionaries, vocabulary lists, etc.”. Other definitions are more centred on Human Language Technologies (HLT) such as that provided by the LORELEIFootnote 2 project, considering a low-resource language as a language “for which no automated human language technology capability exists”. As for Duong (2017), this definition is related to a particular NLP task, by considering a language as low-resource for a given task, if there are no existing solutions using available data and performing this task with an adequate performance.

It can be clearly drawn from the above definitions, that a given language may be considered as under resourced when it lacks LRs required to develop one or more NLP tasks involving that language. In this sense, Arabic dialects in general, are deemed to be amongst low-resource languages, in view of the current availability of NLP data resources and tools associated to these dialects (Hamdi et al. 2015; Novotney et al. 2016; Harrat and Meftouh 2017a). It should however be noted that in recent years, Arabic dialects have aroused increased interest among the NLP community. This interest may be explained by, among other reasons, the growing use of these dialects by Arab Internet users, especially on the social web (Zaidan and Callison-Burch 2014; Younes et al. 2015; Samih and Maier 2016a; Alshutayri and Atwell 2017).

Arabic dialects are spoken by more than 440 million peopleFootnote 3 in a region covering Arabia (Arabian Peninsula), North Africa and the Middle East. The classification to which subscribe several researchers (Embarki 2008) is that of Versteegh (Versteegh 1997), which divides modern Arabic dialects into five major dialectal areas, from East to West: (1) Arabian Peninsula dialects include the dialects of Kuwait, Saudi Arabia, Bahrain, Qatar, United Arab Emirates, Oman and Yemen; (2) Mesopotamian dialects include the dialects of Iraq; (3) Levantine dialects include the dialects of Lebanon, Syria, Jordan and Palestine; (4) Egyptian dialects cover the dialects of the Nile valley: Egypt and Sudan; (5) Maghrebi dialects cover the dialects of Mauritania, Morocco, Algeria, Tunisia and Libya.

This study focuses on the Maghrebi Arabic dialects (MADs) in particular. Indeed, MADs have in common the same sociolinguistic variation and are characterized by many common linguistic specificities: the coexistence of several languages (MSA, dialectal Arabic, Berber and French), the influence of French language in written and oral use, the great increase use on social media, the writing with Latin letters, etc. All these features have generated common difficulties and challenges to be overcome when trying to process these dialects. That’s why, many research works bring together these dialects. Although most of the work carried out on Arabic dialects has focused on MashriqiFootnote 4 and mainly on Egyptian Arabic (Shoufan and Alameri 2015; Assiri et al. 2015; Harrat and Meftouh 2017b), MADs have been picking increasing interest from NLP researchers in the past few years. As regards surveys, and to our knowledge, only one study proposed by Harrat and Meftouh (2017c) was specifically dedicated to reviewing work on the automatic processing of Maghrebi dialects. Although the survey of Harrat and Meftouh (2017c) covers a variety of works, it consists in a short and a non-exhaustive review, dealing with only three dialects: Tunisian, Algerian and Moroccan dialects. The authors presented a linguistic overview of the three dialects, then reviewed 39 works (until 2017) according to 7 categories: corpora and lexicons (12 works), identification (5 works), orthography (4 works), morphological analysis (6 works), sentiment analysis (3 works), machine translation (5 works) and other works including sentence boundary detection and diacritics restauration (4 works). Harrat and Meftouh (2017c) provided some information about the followed NLP approaches and sizes of the used corpora. They, however, did not specify the used scripts nor the annotation level of these corpora and didn’t provide any information about the availability of MAD language resources.

In the present survey, we aim for a more comprehensive study and a thorough review of the language resources that have been generated by the various workFootnote 5 carried out on the MAD language processing. A survey of the currently online available MAD NLP dedicated-LRs is also compiled and discussed. LRs investigated in this work are essentially data-resources such as primary and annotated corpora, lexica, dictionaries, ontologies, etc. Our main goals are to provide a clear picture of what progress has been made towards constructing LRs for the Maghrebi dialects’ NLP and their availability to researchers wishing to work in this field.

The reminder of this paper is organized as follows: Sect. 2 is devoted to a general presentation of the MADs. We review their major linguistic specificities on different levels. We also provide some pertinent indications on their current use in the social web and set forth the main difficulties and challenges related the MADs’ NLP. In Sect. 3, we identify the constructed MAD primary LRs, mainly, raw text and speech corpora. Section 4 reviews the construction of various annotated corpora (derived LRs). As for Sect. 5, it is dedicated to identifying different kinds of other MAD data-LRs such as lexica, dictionaries and ontologies. We devote Sect. 6 to works that examined the normalization and the codification of the MADs. In Sect. 7, we make a census of the currently online available LRs dedicated to MAD NLP and in Sect. 8, we propose a discussion on the identified resources, supported by numbers and charts illustrating the evolution of works on Maghrebi dialects, the typology of LRs, their breakdown by MAD, etc. Finally, Sect. 9 is dedicated to the conclusion and future perspectives.

2 Maghrebi Arabic dialects (MADs)

2.1 Linguistic specificities

Maghrebi Arabic (or Maghrebi Darija) is one of the mother tongues of people of the Maghreb area in North Africa and covers five countries: Mauritania, Morocco, Algeria, Tunisia and Libya. The Maghreb has more than 100 million inhabitants:Footnote 6 4.3 million in Mauritania, 36 million in Morocco, 43 million in Algeria, 11.5 million in Tunisia and 6.2 million in Libya. In these countries, at least two mother tongues coexist according to the origin of the inhabitants (Pereira 2005): BerberFootnote 7 and Maghrebi Arabic. In fact, Berber is the mother tongue of nearly 30% of the Maghreb population. Berber speakers represent about 45 to 50% of the population in Morocco, 25 to 30% in Algeria, 1% in Tunisia and between 5 and 10% in Libya and Mauritania. In the other hand, Maghrebi Arabic is used by Arabic speakers and Berber speakers, but also between Berber speakers, who do not understand each other, when the Berber variety of each speaker is different.

Speakers of Maghrebi Arabic call their language Darija (in Arabic, “دارجة” which is the feminine form of “دارج” meaning “familiar, common, popular, usedFootnote 8). Darija alludes to colloquial spoken Arabic rather than Modern Standard Arabic (MSA), but it is also common to refer to the Maghrebi Arabic varieties directly as languages. For instance, Moroccan Arabic as Maghrebi (Moroccan), Algerian Arabic would be referred as Dzayri (Algerian), Tunisian Arabic as Tounsi (Tunisian), Libyan Arabic as Libi (Libyan) and Mauritanian Arabic as Hassaniya (derived from the name of the Arab tribes called “Beni Hassan”).

Maghrebi Arabic is a spontaneous oral language. It is the language in which speakers communicate with each other, but it is also sometimes a literature language and a writing language (Pereira 2005). Indeed, it is a literature language by means of which proverbs, nursery rhymes, tales, riddles, poems are said. It can be a writing language when song lyrics and theatre plays are written in Maghrebi Arabic. Today, it is more used on radio, television and in publicity. In its written form, Maghrebi Arabic is written for a long time mainly with the Arabic script (AS), but it may also be written using the Latin alphabet (LS) [Written content using LS is known as “Arabizi”]. It should be noted that the introduction of new modes of communication (SMS, e-mails, Facebook, Twitter, etc.), widely used in Arab countries, has strengthened dialectal writing, especially in Latin script.

According to Mohand (1999) and Embarki (2008), the Maghrebi Arabic dialects undoubtedly possess common phonetic, morphological, syntactic and lexical features, giving them a particular linguistic character and thereby, clearly differentiating them from Eastern Arabic. The main common characteristics of the MADs include:

  • Mutual intelligibility The varieties of Maghrebi Arabic have a significant degree of mutual intelligibility, especially between geographically adjacent ones (such as local dialects spoken in Eastern Morocco and Western Algeria or Eastern Algeria and North Tunisia or South Tunisia and Western Libya), but hardly between Moroccan and Tunisian Darija. On the other hand, it is well-known that Maghrebians understand almost all other Arabic dialects.

  • Mixture of many languages Maghrebi Arabic cannot be understood by Eastern Arabic speakers (from Egypt, Sudan, Levant, Iraq, and Arabian Peninsula) in general as they derive from different substratums and a mixture of many languages (Mohand 1999; Elimam 2009, 2012; Sayahi 2014; Mzoughi 2015): Punic, Berber, Arabic, Turkish, French, Spanish, Italian and Niger-Congo languages. Sayahi (2014) considers that Berber in particular, has exercised an important influence in the way the Maghrebi dialects formed, distinguishing them from the eastern Arabic dialects at many levels. For these reasons, some linguists like Elimam (2004, 2012), tend to consider Maghrebi Arabic as an independent language.

  • Continual evolution Maghrebi Arabic continues to evolve by integrating new words. Speakers frequently borrow words from French (in Morocco, Algeria and Tunisia), Spanish (in Morocco) and Italian (in Libya and Tunisia) and conjugate them according to the rules of Arabic with some exceptions (like passive voice for example) (Sayahi 2014; Elimam 2009). Some examples showing the dynamicity of the Tunisian dialect linguistic system are given in Table 1. These examples are also valid for Algerian and Moroccan dialects.

    Table 1 Words borrowed from French
  • Code-switching Defined as the “alternate use of two or more languages in the same utterance of conversation” (Fishman 1999), Code-switching is a result of multilingualism. In the Maghreb, code-switching is frequent and represents a feature of the local way of speech. In text, code-switching generally occurs between MSA and Maghrebi Arabic when it is written in Arabic script and between French/Spanish/English and Maghrebi Arabic when it is written with Latin alphabet. There may also be, but rarer, written texts (especially in the social web and advertising spots) using both Arabic and Latin scripts. Table 2 below shows some examples of these messages. This kind of code-switching adds another difficulty which imposes a reading from left to right of the part written with the Latin letters and in the opposite direction of the Arabic part.

    Table 2 Examples of written MAD using both AS and LS
  • Phonological aspects In Maghrebi dialects, some Non-Arabic phonemes may be used such as/g/ڤ,/p/پ and/v/ڥ. Long vowels are usually shortened, and the three short vowels are often reduced to two. In many cases, CvCC is changed to CCvC (e.g., سَقف/saqf (roof) in Oriental Arabic, is said سقَف/sqaf in Maghrebi Arabic). Also, preference is for final-syllable stress, especially with the reduction of non-stressed short vowels (e.g., كِتَاب/kitaab (book) in Oriental Arabic, is said كتَاب/ktaab in Maghrebi Arabic).

  • Morphological aspects Maghrebi dialects have a specific conjugation distinguishing them from Mashriqi dialects and Modern Standard Arabic. Table 3 below shows some examples.

    Table 3 Verb conjugation specificities of MAD

Although Maghrebi dialects share many common linguistic characters, each one of them has its own linguistic specificities. Table 4 shows some examples of specific characteristics that differentiate them from each other. In this paper, we use MD for Moroccan Dialect, AD for Algerian Dialect, TD for Tunisian Dialect, LD for Libyan Dialect, HD for Mauritanian Dialect; ArD for Arabic dialects; OAD for other Arabic dialects and OL for other languages.

Table 4 Examples of MAD distinctive linguistic specificities (Pereira 2005, 2011; Sayahi 2014)

2.2 MAD presence on social networks

As has been the case with the whole Arab world, the Maghreb countries have experienced in the past few years, a greatly increased use of social media. Taking as example Facebook, which remains the most popular social media platform in the Maghreb region, and according to Mourtada and Salem (2014) and Salem (2017), the total number of its users in the Maghreb has almost doubled over the past 3 years, growing from 20.5 in 2014 to 38.3 million users in 2017. On average, Facebook is used in the Maghreb region, by about 38% of its total population.

As for the used languages on the social web, it is important to note that Maghrebi user-generated content is highly multilingual, mainly including the three languages Arabic, French and English. Table 5 gives for each country, a breakdown of users according to the languages they use on Facebook. We can clearly notice the important growth of the Arabic language use in all the Maghreb countries. On average, the percentage of Facebook users that use Arabic, has grown from 43.2% in 2014 to 73.7% in 2016 in the Maghreb region, even though the use of French remains also among the preferences of Maghrebi Facebook users, especially in Tunisia, Algeria and Morocco.

Table 5 Percentage of Facebook users by language used (Mourtada and Salem 2014; Salem 2017)

As for the Arabic language used on the social web, it is not limited to MSA, but textual contents expressed in the various Arabic dialects spoken by Maghrebi populations are also proliferating on social platforms (Younes et al. 2015; Samih et al. 2016; Abidi and Smaili 2017). Moreover, the dialectal productions may be written using both the Arabic and the Latin alphabet. In this regard, Table 6 gives some key figures on the use of dialectal Arabic and its transcriptions in the Algerian and Tunisian social web according to, respectively Abidi and Smaili (2017) and Younes et al. (2015). This table shows the significant use of colloquial Arabic on the web, for both Tunisia and Algeria. In the case of Algeria in fact, 74% of the total studied content is in dialectal Arabic. This rate corresponds to 58% regarding the Tunisian study. Table 6 also shows the strong use of the Latin script for transcribing the dialectal Arabic, with rates reaching more than 81% of the Tunisian dialect productions on the social web (and more than 62% of the studied Algerian dialect productions) that are transcribed in the Latin alphabet. Several factors have been mentioned in (Younes et al. 2015) that can explain this trend of using the Latin alphabet for dialectal writing, such as the scarcity of Arabic keyboards in the early years of the web and mobile devices, multilingualism, influence of colonization, migration, and neo-cultures.

Table 6 Use of dialects and their transcriptions in the social web: cases of Algeria and Tunisia

Samih and Maier (2016a) for their part, reported that the construction of an Arabic Moroccan code-switched corpus from textual productions, transcribed in Arabic and extracted from Moroccan discussion forums and blogs, resulted in a corpus comprising 35% of Darija words (Moroccan Arabic) and 49% of MSA words. The rest of the words making up the corpus consists of words from other languages (French, English, Spanish or Berber), some ambiguous words, named entities, numbers and punctuations.

It can therefore, be seen that the various Maghrebi dialects are widely used on the social web, in both the Latin and the Arabic scripts. Their automatic processing is gradually attracting a growing interest among the NLP community. MAD NLP however, faces many difficulties and there are still challenges ahead to be met.

2.3 MAD NLP: difficulties and challenges

In dealing with the Maghrebi dialects automatic processing, several difficulties and challenges are to be overcome which are mainly due to:

  • The lack of spelling conventions and orthographic standards. For example, the word meaning “he does not want” has the following potential Romanised writings: “mayehibbish”, “maye7ibbish”, “ma yhibbish”, “ma ye7ebbech”, etc. and several arabic writings such as”: “مَيْحبِّش”, “مَا يحِبِّش”, etc. More examples are shown in Tables 1 and 22.

  • Language continual evolution with the borrowing of a variety of new words. Examples are shown in Table 1.

  • MAD variety. The Maghrebi dialects vary from one country to another and present dissimilarities on different linguistic levels. They also present significant dissimilarities with MSA, which make it difficult to directly use available MSA NLP tools that would undoubtedly yield to lower performance.

  • Language Ambiguity. Diglossia, multilingualism and code-switching characterizing the Maghrebi dialects arise the issue of MAD identification which becomes an intricate task, facing the problem of language ambiguity. Ambiguous words can be both MSA and MAD words when they are transcribed in Arabic. When they are Romanized, they may share a common writing with French, English and dialectal words. Some examples of ambiguous words are presented in Table 7.

    Table 7 Examples of ambiguous MAD words

As we can see, MAD NLP raises important issues to be dealt with and its development requires the availability of large and appropriate language resources. This is why, we propose in the reminder of this paper a comprehensive review of the different kinds of constructed MAD LRs, from the study of existing work carried out on these dialects and make a census of the LRs currently available online for their NLP.

3 MAD raw corpora

3.1 Speech corpora

Maghrebi dialect speech processing has piqued interest of several researchers in the last few years, who resorted to oral recordings to construct new corpora. Some of them resorted to recording the speech of different speakers (Pellegrino and Barkat 1999; Barkat 1999; Barkat and Vasilescu 2001; Barkat et al. 2003, 2004; Bezoui et al. 2019) and real conversations between people (Bougrine et al. 2016; Hassine et al. 2016, 2018). Djellab et al. (2017) used records, in particular, including digital radio calls, phone interception systems, voicemails, etc. Others collected extracted speech from TV shows and Radios (Bougrine et al. 2015; Lachachi and Adla 2015, 2016a, b; Amazouz et al. 2017), from broadcast news (Ali et al. 2016) web streamed local radio channels and TVs and Youtube channels (Bougrine et al. 2017). In (Bougrine et al. 2016), a speech corpus for Algerian Arabic subdialects (named “ALG-DARIDJAH”) was collected using spontaneous speech, translated MSA speech and image narration. Table 8 shows the constructed MAD speech corpora.

Table 8 MAD speech corpora

3.2 Speech transcription

Transcribing oral conversations was one of the first approaches that have been followed to overcome the lack of written MAD LRs. This was the case of Iskra et al. (2004) who participated to OrienTel project which aimed to develop speech databases and phonetic standards across Northern Africa, the Middle East and the Arabian Gulf. Two speech corpora were thus created for TD and MD which were subsequently transcribed. A similar work was carried out by Belgacem (2009) who participated in the project “Oréodule: a system board real-time recognition, translation and speech synthesis Arabic” in which he established Arabic multi-dialect corpora including Moroccan, Algerian and Tunisian with other Arabic dialects. He used Transcriber tool for the transcription process. Other speech corpora transcription works has been done and mainly concerned the Tunisian dialect. These works include those of (Graja et al. 2010; Masmoudi et al. 2014a, b; Boujelbane et al. 2015). For Algerian, we cite the work of Meftouh et al. (2012) and Amazouz et al. (2017, b, 2018a, b).

Wray and Ali (2015) studied the crowdsourcing for the speech transcription of dialectal Arabic, including Maghrebi. They collected dialectal speech of debate and news programs uploaded from Aljazeera website and resorted to CrowdFlower (CF) to group the speech utterances according to the dialect. The speech was transcribed automatically using QATS (Ali et al. 2014) and via CF by introducing quality control parameters. Table 9 summarizes the identified transcribed corpora.

Table 9 MAD Transcribed Corpora

3.3 Web and social media corpora

Given the increasing use of Internet in the Arab world and the proliferation of various user-generated content in the Arabic language and its dialects on the web, several researchers have been rather resorting to web and social media to construct various Arabic and Maghrebi dialect LRs.

Several corpora have been thus, extracted from the web and the social web in particular. Some of them cover Arabic dialects in general, including some Maghrebi dialects (Callan et al. 2009; Almeman and Lee 2013; Suwaileh et al. 2016). Others relate exclusively to the MADs, covering only one particular dialect such as the Tunisian dialect (McNeil and Faiza 2011; McNeil 2015; Younes and Souissi 2014; Younes et al. 2015; Bouchlaghem et al. 2014; Masmoudi et al. 2017; Torjmen and Haddar 2018a, b), the Algerian dialect (Abidi and Menacer 2017; Abidi and Smaili 2017, Guellil et al. 2018a, b, c; Soumeur et al. 2018) and the Libyan dialect (Alhammi and Alfards 2018), or covering a subset of MADs (Adouane et al. 2016a). In (Guellil et al. 2018a, b, c) two raw corpora were built from Algerian Facebook pages. The corpora include respectively 8,673,285 messages, 3,668,575 of which are written in Arabic letters and 15,407,910 messages, 7,926,504 of which are written in Arabic and 3,976,700 of which are written in Arabizi.

Characteristics of these LRs as well as their creation context and use are summarized in Table 10.

Table 10 Web and social media MAD corpora

4 MAD annotated corpora

4.1 Language identification

The identification of the Maghrebi dialects on a token level was the main interest of several researchers, since it is a crucial step for LR construction, especially from the web and social media. In this context several annotated corpora have been generated, where tags mainly correspond to the used language for each considered unit. Regarding resources dedicated to the identification of a specific MAD, we mention those of (Tratz et al. 2014; Voss et al. 2014; Samih and Maier 2016a, b) and (Tachicart et al. 2017) for Moroccan Arabic, those of (Cotterell et al. 2014; Guellil and Azouaou 2016a; Adouane and Dobnik 2017) and (Lichouri et al. 2018) for Algerian Arabic and that of (Aridhi et al. 2017) for Tunisian Arabic. As for corpora built by (Salama et al. 2014; Cotterell and Callison-Burch 2014; Mubarak and Darwish 2014; Zaidan and Callison-Burch 2014; Sadat et al. 2014a, b; Harrat et al. 2015; Adouane et al. 2016a; Alshutayri and Atwell 2017, 2018a, b; Saadane et al. 2017, 2018; Zaghouani and Charfi 2018; Alsarsour et al. 2018; El-Haj et al. 2018; Alshutayri and Atwell 2018a, b) and (Altamimi et al. 2018), they cover several Arabic dialects including some MADs.

When the identification task is performed at a sentence level, the works usually concern more than one dialect. The annotation type involves the level of dialect that the sentence includes and the topic or the gender. Regarding the word-level identification, the works rather deal with code switching textual contents and the annotations concern the type of dialect and other information like punctuation marks.

Various corpora dedicated to the MAD identification are described in Table 11.

Table 11 MAD identification corpora

4.2 Morphosyntactic annotation

The morphosyntactic analysis of a language is a crucial step towards the study of its word formation and structure. With regard to Maghrebi dialects, the study of the literature shows that several works have been carried out, on their morphosyntactic analysis. However, we can see that, most of this work focused on the Tunisian dialect in particular including word segmentation (McNeil 2012), sentence segmentation (Zribi et al. 2016), morphological analysis (Graja et al. 2013; Zribi et al. 2013a, 2016, 2017), POS-tagging (Boujelbane et al. 2014; Hamdi et al. 2014; Zribi et al. 2017) and syntactic analysis (Mekki et al. 2017). Other works focused on the morpho-grammatical analysis of Algerian dialect (Harrat et al. 2014, 2016; Guellil and Azouaou 2016b), while Almeman and Lee (2012), Al-Shargi et al. (2016), Eldesouki et al. (2017), Samih et al. (2017) and Darwish et al. (2018a) have considered various Arabic dialects including some Maghrebi dialects. Two works were dedicated to diacritics restoration for Algerian dialect (Harrat et al. 2013) and Moroccan and Tunisian dialect (Darwish et al. 2018b).

Most of the researchers who worked on the morphosyntactic annotation of MAD resorted to MSA existing resources and tools, such as Zribi et al. (2013a), Boujelbane et al. (2014), Harrat et al. (2014). The works were based on the existing resources by adding some changes on the suffixes, prefixes, morpheme’s orders and creating new rules for the studied Maghrebi dialect.

Others like Guellil and Azouaou (2016b) resorted to social media to extract the textual content and observed the characteristics of the Algerian dialect in order to syntactically analyse it.

These works resulted in the construction of a number of annotated resources that are described in the following Table 12. The column “Annotation” indicates which level of analysis is used for annotation.

Table 12 MAD morphosyntactic analysis corpora

4.3 Translation

Several works have been carried out on the language translation and have mainly focused on translating MADs into MSA. Among these works, there are those who tackled the translation into MSA, of various Arabic dialects including some MADs (Bouamor et al. 2014, 2018; Meftouh et al. 2015, 2018; Harrat et al. 2015, 2017a; Mubarak 2018). Others focused on translating MADs in particular (Meftouh et al. 2012; Tachicart and Bouzoubaa 2014a, b; Sadat et al. 2014c). As mentioned above, all of them considered dialects as source languages and MSA as the target language. Parallel corpora resulting from these works are described in Table 13.

Table 13 Translated MAD corpora

4.4 Transliteration and code-switching

Transliteration consists in transforming a word from a writing system to another while preserving its pronunciation. As shown in Sect. 2.2, MADs can be written in both Arabic and Latin scripts especially on social networks, blogs and forums. This has led several researchers to focus on the transliteration task, which can be particularly useful for the process of dialectal LR construction and for many other applications such as machine translation (treatment of proper nouns), information retrieval, etc. Works on transliteration of Maghrebi dialects have mainly allowed the construction of various corpora. They tackled Latin to Arabic script transliteration, using pre-established rules (such as in Saadane et al. 2013; Masmoudi et al. 2015; Guellil et al. 2018a), or weighted finite state transducers (such as in Ben Moussa et al. 2019) or machine learning based methods (such as in Younes et al. 2016; Guellil et al. 2017a), or deep learning based Sequence-to-Sequence approach such as in (Younes et al. 2018).

The majority of the works on the transliteration task is based on the Arabic-Latin correspondences between characters. Some of these correspondences were based on rules which were created following an observation of the written content in both Latin and Arabic scripts (Saadane et al. 2013; Guellil et al. 2018a). The work of Masmoudi et al. (2015) was, however, based on CODA which is an orthographic convention for the Tunisian dialect that follows supposedly the same orthographic rules as MSA with some exceptions and extensions.

We summarize LRs resulting from these works in the following Table 14.

Table 14 Latin to Arabic transliterated MAD corpora

Only few corpora of Arabic code-switching have been created by Cotterell (2014), Amazouz (2017) and Samih and Maier (2016a, b).

4.5 Sentiment analysis

Sentiment analysis (also called opinion mining) is used to determine the emotional tone of a user’s language productions. It is mainly used to better grasp the perception and the opinions declared in a user statement. It is extremely effective as it provides an overview of users’ opinions about a given topic, especially in social media. With the increasing use of Maghrebi dialects in social media, several works were initiated on MAD sentiment analysis in the last few years. Those who focused on a particular MAD include (Elkhlifi et al. 2014; Zarra et al. 2017; Oussous et al. 2018) that focused on the Moroccan dialect (Mataoui et al. 2016; Rahab et al. 2017, 2018; Guellil et al. 2017b, 2018b; Soumeur et al. 2018) that dealt with the Algerian dialect and (Sayadi et al. 2016; Ameur et al. 2016; Mdhaffar et al. 2017) that worked on the Tunisian dialect. As for Al-kabi et al. (2016), they considered several Arabic dialects including MADs, knowing that Maghrebi dialectal content represents only 0.3% of their corpus.

Table 15 shows MAD corpora dedicated to sentiment analysis. The column “Annotation” indicates the tags (positive, negative, …) used to annotate each unit (word or sentence).

Table 15 Annotated MAD corpora for sentiment analysis

4.6 Several annotations

A few multi-annotated corpora have been created. The corpus of Diab et al. (2010) which deals with Moroccan and other Arabic dialects comprises morphosyntactic analysis and identification annotations. The corpus of Baly et al. (2017) deals with three MAD dialects and includes identification and sentiment analysis annotations. The recent Algerian corpus of Abainia (2019) includes several annotations: code-switching, transliteration, translation, dialects and sub-dialect identification, gender identification, sentiment analysis, abuse detection, named entity recognition, etc.

Baly et al. (2017) started by creating from Twitter a corpus called MD-ArSenTD (Multi-Dialect Arabic Sentiment Twitter Dataset) which represents a multi-dialectal dataset composed of tweets collected from 12 Arab countries including Algeria, Morocco and Tunisia. These tweets are annotated by dialect and sentiment labels that incorporate both polarity and intensity information on a 5-point scale (very negative, negative, neutral, positive and very positive). Baly et al. (2017) used only Egyptian and Emirati tweets to discover distinctive features that could facilitate the sentiment analysis task. They also carried out a comparative evaluation of different sentiment models (SVM and LSTM deep learning model) on Egyptian and Emirati tweets. Table 16 shows the corresponding constructed LRs.

Table 16 Several annotated MAD corpora

5 Lexica, dictionaries and ontologies

As part of works carried out on the automatic processing of Maghrebi dialects, a set of data LRs have been constructed including various lexica, dictionaries and ontologies. Such resources represent indeed essential material for language studies and NLP tool development, especially when dealing with the translation task. In this LRs’ category, most of the constructed MAD resources are bilingual lexica involving MADs and MSA (Meftouh et al. 2012; Boujelbane et al. 2013a, b, 2015; Hamdi et al. 2014; Tachicart et al. 2014a, b; Elmarakshy and Ismail 2015; El Abdouli et al. 2019) or MADs and other languages such as English (Graff and Maamouri 2012) and French (Guellil and Azouaou 2017; Azouaou and Guellil 2017) or bilingual lexicon of transliterations (Younes et al. 2015; Abidi and Smaili 2018). Multilingual lexicons were also built covering MADs and other languages (Saadane et al. 2018; Bouamor et al. 2018). The main monolingual MAD lexica mainly include a Tunisian dialect phonetic dictionary (TunDPDic), constructed by Masmoudi et al. (2014c), lexicon for Al-Khalil analyser of (Boudlal et al. 2011) constructed by Zribi et al. (2013b), lexicon for morphological analyser of Torjmen and Haddar (2018a, b), Lexical Database Adaptation for Comp-Dial system realized by Neifar et al. (2014) and Tunisian Arabic lexical dictionary by Ben Moussa et al. (2016). Lexicons for sentiment analysis were also built by (Mataoui et al. 2016; Ameur et al. 2016; Guellil et al. 2017b; Guellil et al. 2018b, c).

As for ontologies involving MADs, they are very scarce and are still at a preliminary stage. In reviewing the literature, we mainly identified domain ontologies related to the Tunisian dialect. These include domain ontologies built as part of works on the processing of TD in dialogue systems (in Railway stations) (Graja et al. 2011a, b; Karoui et al. 2013a, b; Graja et al. 2015). They also include the Tunisian dialect Wordnet “TunDiaWN” proposed by Bouchlaghem et al. (2014) who preserved the AWN content (Arabic Wordnet by Elkateb et al. (2006)), and the “aebWordnet” proposed by Ben Moussa et al. (2014, 2015), Ben Moussa and Alimi (2016), which was modeled from a bilingual dictionary (Tunisian-English). Mrini and Bond (2017) built the Moroccan Darija Wordnet (MDW) using a bilingual Moroccan-English dictionary of (Harrell 1963). These various MAD resources are described in Table 17.

Table 17 MAD Lexica, dictionaries and ontologies

6 Codification and normalization

The informal nature of the dialects makes their automatic processing a challenging task. It is, in fact, a language that doesn’t conform to linguistic or orthographic rules and encompasses everyday new terms. Therefore, several researchers resorted to an intermediate step that may ease the processing of such languages, namely the orthographic conventions. This was the idea of Zribi et al.’s work (2013b), they proposed OTTA, an orthographical convention for the transcription of the spoken Tunisian Arabic. They resorted to MSA orthographic rules and defined new ones for the TD specificities on the corpus TuDiCoI (Graja et al. 2010). Another convention was developed by Zribi et al. (2014), inspired from the CODA project (a Conventional Orthography for the Dialectal Arabic) (Habash et al. 2012) to create a CODA, specific to the Tunisian dialect. The CODA goals were summarized by Zribi et al. in five perspectives: (1) every word has a single orthographic interpretation; (2) created for computational purposes; (3) uses the Arabic script; (4) intended as a framework for writing all the Arabic dialects; (5) aims to create a balance between establishing conventions based on the MSA-ArD similarities and maintaining a level of dialectal uniqueness. The TD follows supposedly the same orthographic rules as MSA with some phonological, phono-lexical, morphological and lexical exceptions and extensions. Saadane and Habash (2015) and Turki et al. (2016) adapted the same convention and developed respectively an Algerian Arabic CODA and a Maghrebi Arabic CODA. Boujelbane et al. (2016) proposed an automatic process for spontaneously spelled Tunisian Arabic (TA) normalization into the conventional orthography CODA (Zribi et al. 2014). This process is baptized COTA orthography (COnventionalized Tunisian Arabic orthography). Their proposed approach is close to the approach of (Eskander and Habash 2013). They showed that the rule-based and the statistical methods can reduce the transcription errors. Habash et al. (2018) presented a unified set of guidelines and resources for conventional orthography of dialectal Arabic applied to 28 Arab city dialects including 7 Maghrebi cities (Rabat, Fes, Algiers, Tunis, Sfax, Tripoli and Benghazi). The resources of this new CODA*Footnote 9 are all available online.

We provide in Table 18, a recap of the various codification and normalization conventions used for MADs.

Table 18 Codification and normalization conventions for MADs

7 Online available MAD LRs

By examining language resource management platforms such as those of the Language Data Consortium (LDC) and the European Language Resources Association (ELRA), we could see that very few corpora are currently available for Maghrebi dialects, when compared with other languages and also, when compared with MSA, Levantine and Egyptian dialects. In fact, simple queriesFootnote 10 for macro-ArabicFootnote 11 resources available in the LDC catalog show 164 resources. Only 14 LRs involve the MADs. With the ELRA search engine, we found 108 resources. Only 6 involve the MADs. Table 19 shows the details.

Table 19 Available LRs in the LDC and ELRA platforms

Regarding the various identified LRs that have been generated by existing work on MADs we examined for this survey, it is worth noting, that only 23% of total resources are currently available online. We list these available resources in Table 20 and indicate whether they are freely downloadable.

Table 20 MAD LRs available online

Finally, it should be added that, in order to collect data and construct MAD LRs, some of the works examined in this study, have resorted to freely available resources that we list in Table 21.

Table 21 Other MAD available LRs

8 Discussion

8.1 Work on MAD LRs evolution

It is clear from this study that Maghrebi dialects are giving rise to an increasing interest from several NLP researchers. We should however note that, the total number of works that led to the construction of MAD LRs we could identify till May 2019, is of the order of a hundred (exactly 148 works that have led to the construction of 158 different RLs, 143 of which are written RLs), which actually represents a very limited number in comparison with other languages.

In this discussion, we distinguish between works that focused specifically on a single dialect (AD: Algerian dialect, HD: dialect of Mauritania, LD: Libyan dialect, MD: Moroccan dialect, TD: Tunisian dialect), works which dealt with a subset of specified (named) Maghrebi dialects (MAD) and those which focused on a subset of Maghrebi dialects without specifying their nature (NS MAD: non-specified MAD).

As we can see in Fig. 1, the evolution of the number of research works witnessed a growth from 2010 to 2018. This figure shows that the evolution varies from one dialect to another and that, it is on the Tunisian dialect that there has been the most work published during these last nine years. This can also be seen in Fig. 2 that gives the breakdown of published works over the targeted dialects. We can indeed observe that, out of the total number of the identified works involving the construction of MAD LRs, 63% focused specifically on a single dialect, mainly on the TD (33%), the AD (20%) and the MD (9%). On the other hand, only one recent work has resulted in a constructed LD LR (Alhammi and Alfard 2018) (1%) and to the best of our knowledge, no works have been carried out specifically on the HD to this day. HD was however treated once in the work of Sadat et al. (2014a, b) that dealt with a set of MADs including HD. Libyan and Mauritanian dialects are thus, currently non-resourced languages. As for multi Arabic dialectal resources including various MADs, we have identified 55 related works, representing 37% of the total work, knowing that 30 of them (20% of the total work) do not specify the type of the considered MAD.

Fig. 1
figure 1

Evolution of the works on MAD LRs’ construction

Fig. 2
figure 2

Distribution of total work according to the target dialect

8.2 MAD LRs Scripts

Despite the significant presence of MAD textual productions using the Latin alphabet especially on the social web, we can see that most of the works presented in this study dealt with Maghrebi dialects written in the Arabic script. Very few of them focused on the Latin script and code-switching, such as Cotterell et al. (2014), Samih and Maier (2016b), Guellil et al. (2017a, 2018b, c), and Younes et al. (2015). This is shown in Fig. 3, where 67% of the related work led to the construction of MAD LRs transcribed exclusively in the Arabic script. Only 23% of these works considered both scripts, while 10% were concerned only with Latin transcribed MAD resources.

Fig. 3
figure 3

Distribution of total work according to MADs’ script

One of the main reasons explaining the large focus on the Arabic script, consists of the idea of exploiting its closeness to MSA and resorting accordingly, to existing MSA resources and tools for constructing useful LRs for dialects’ processing. This approach has been adopted by several researchers, namely Almeman and Lee (2012), Boujelbane et al. (2013a, b), Hamdi et al. (2013a, b) and Zribi et al. (2013a), etc. While very useful, it should be however highlighted that Arabic transcribed MAD resources are far from being able to cover the needed LRs for the Maghrebi dialects’ NLP, especially when dealing with user-generated content on the social web. In Maghrebi countries indeed, and as we have shown in Table 6 (in Sect. 2.2), the Latin script is very used, even more than the Arabic script in some countries such as Tunisia, Algeria and Morocco. The Latin Maghrebi dialect, also referred to as “Romanized” and “Arabizi”, is generally vowelized since users include letters such as “a”, “e”, “o”, “i”, etc. to designate the Arabic vowels, it is also marked by several other phenomena such as the use of digits to replace Arabic letters with no equivalents in the Latin alphabet, and encompassing abbreviations and acronyms. Therefore, dealing with it would be different, especially in the morphosyntactic and the identification tasks. This form of the dialect still lacks consistent and large corpora allowing its processing, especially regarding the tasks that need the original user’s input such as opinion mining. The link between the two forms of MAD can be achieved using the transliteration task. The work on this task can benefit from the Latin script as it includes vowels, unlike the Arabic script. Latin MAD to Arabic MAD transliteration was however dealt with in very few works [only in (Saadane et al. 2013; Masmoudi et al. 2015; Younes et al. 2016; Guellil et al. 2017a, 2018a)], while it may constitute a crucial task allowing to automatically generate Arabic MAD LRs from the widely produced Latin content in social media, and helpful for many other kinds of MAD NLP applications (Information retrieval, sentiment analysis, machine translation, etc.). Latin to Arabic transliteration of MADs is to our opinion still an open task and needs to be further explored and studied.

8.3 MAD LRs types

Works studied in this paper have all resulted in the production of various MAD data-LRs including corpora (raw or annotated), lexicons, dictionaries and ontologies. Most of them were built in a given context with the objective of a specific task or application such as translation, segmentation, morphosyntactic analysis, sentiment analysis, etc.

As regards MAD corpora, we identified 36 works that have focused on creating raw corpora which are crucial to study and process Maghrebi dialects. As shown in Fig. 4, 45% of them are speech corpora and 55% of them are written corpora, mainly collected from the web and social media or transcribed from speech.

Fig. 4
figure 4

Distribution of the works on raw corpora by type

Based on the study of existing work, it was since 2010 that researchers have been interested in constructing written linguistic resources for Maghrebi dialects. One of the first approaches they have been following, was based on speech transcription, usually performed manually. We can cite among these works (Graja et al. 2010; Masmoudi et al. 2014a), which were mainly carried out on the TD and some subsets of MADs (Fig. 5). Although the transcription approach is a way to address the lack of written LRs, it represents an expensive and time-consuming task, leading to the construction of relatively small corpora. Transcription is also, usually carried out by a native speaker of the considered language or following a transcription convention (Masmoudi et al. 2014a) and, in both cases, the procedure results in a unique transcription of each word. The use of this kind of LRs can therefore be very limited, since they may not be appropriate when dealing with dialectal content produced on social media where, both Latin and Arabic scripts are used, and writing doesn’t conform to any spelling rules or conventions. Abidi et al. (2017) illustrate this phenomenon by the example of the word “يرحمك” (that means “Bless you”) which can be written in 66 different ways with Latin Script. We give in Table 22 other examples of a MAD words written in both Latin and Arabic scripts along with its potential transcriptions.

Fig. 5
figure 5

Distribution of the works on raw corpora according to the target dialect

Table 22 Example of a MAD word transcriptions

With the use of social media growth in Maghreb countries and the proliferation of written dialects on the web, several researchers resorted to the web and social media for collecting MAD corpora. Their work represents, according to Fig. 4, 39% of total MAD raw corpora construction works and 64% of total written MAD raw corpora construction works.

Nevertheless, the issues related to intellectual property and legal implications that may arise from the use of social media have not, to our knowledge, been discussed by the various works that have used such data sources. Indeed, the legal issues may arise cross all levels of LR acquisition from their creation to their exploitation:

  • During the access to the data, when it comes to collecting the consent of the users;

  • During the recordings for speech, where they are related to the respect of the private life, the will of the users and the consequent choice of what it is to show;

  • When it comes to anonymizing the content by removing all identity information.

  • When it is stored;

  • When it is exploited.

Currently, the copyright and ethics of the data collection from social media and the jurisdictions are not always very clear, and they may differ from one country to another. The best appropriate approach would be to have an agreement with the owners of these sites as part of a partnership. Knowing that there are more permissive sites than others, it’s easier to work with tweets (using public API). With Facebook, it’s a little more complicated even if the extraction API is public as well. We believe that ethical issues could be avoided as long as:

  • The data is extracted via a public API. There is no hacking or intrusion into sites to which we are not entitled;

  • The data is extracted from public pages and not from closed groups;

  • The user anonymity is respected and no information concerning his identity is extracted, revealed or kept;

  • The extracted data is exclusively intended for research in an academic setting with no commercial purpose;

  • The data will not be kept to self by the researcher but will be made available for the research community.

A significant proportion (39%) of the constructed raw corpora are not specific to one particular dialect but include a subset of various MADs. These resources are often constructed as part of works dealing with Arabic dialects in general and including some MADs. Some of them (22% of total work on raw corpora) do not specify the nature (country) of the considered MADs (Mubarak and Darwish 2014; Alshutayri and Atwell 2017, etc.). As for the works which focused on a specific mono-dialectal corpus, they represent 61% of the total work and according to Fig. 5, they concern mainly the TD and AD (thirteen works for each one), followed by the MD and LD (one work). We note that the number of this kind of raw corpora is still very limited and doesn’t cover all Maghrebi dialects. More effort should be therefore, put into collecting significant amounts of data for building large corpora. For this, the web and social media seem today to be the richest source of dialectal data that could be utilized in for this purpose. These corpora are crucial to study MAD dialects, allowing both their qualitative and quantitative analysis, building language models and learning their morphology. Moreover, it is from these primary resources that we will be able to produce various annotated resources useful for specific MAD NLP applications.

By examining the works of literature which have resulted in MAD annotated corpora, we identified six types of performed annotations: identification, morphosyntactic analysis, transliteration, translation, sentiment analysis and multi-annotations. Figures 6 and 7 illustrate their distribution.

Fig. 6
figure 6

Distribution of works on annotated corpora by annotation type

Fig. 7
figure 7

Distribution of works on annotated corpora by annotation type and dialect

It can be observed that the majority of the studied works (61%) focused on the identification of dialects and their morpho-grammatical analysis. The problem of identification is crucial and arises in the automatic extraction of dialectal content from social media that are highly multilingual. It is mainly the task of distinguishing between MAD dialects and MSA (Samih and Maier 2016a, b; Tachicart et al. 2017; Saadane et al. 2017, 2018) when the considered script is Arabic, and distinguishing between MAD dialects and other languages like French and English when the considered script is Latin (Tratz et al. 2014; Adouane et al. 2016a; Cotterell et al. 2014). As for the morpho-grammatical analysis, it constitutes one of the first essential steps of the linguistic analysis of these dialects and it was the subject of an important part of the works on the MAD (26%) which led to the creation of annotated corpora. We note, however, and as shown in Fig. 7, that these works mainly concerned the Tunisian dialect, followed by the Algerian and Moroccan, while no works of morpho-syntactic analysis were identified for other Maghrebi dialects. We can also note from both Figs. 6 and 7 that 37% of the works leading to annotated corpora mainly concern the following three applications: sentiment analysis, translation and transliteration. As regards the sentiment analysis, the works are relatively recent and led to the construction of annotated corpora, whose numbers and sizes are still relatively small.

All translation works involved the translation of dialects to MSA, often motivated by the idea of applying MSA NLP tools for dialectal processing. As for transliteration, interest in this task is relatively recent (since 2013). Only four works were identified, and they all concerned the Latin to Arabic sense. They mainly focused on the Tunisian and Algerian dialects, given the massive use of the Latin alphabet on social networks in these two countries. Finally, in terms of the number of works that led to annotated corpora, it is the TD that has been the most processed while no annotated corpora were identified for the Libyan and Mauritanian dialects.

Note that only two works concern multi annotated corpora realized by Baly et al. (2017) and Abainia (2019).

The multi annotated corpora presented in this paper are still relatively small. This is due to the difficulty of building such corpora knowing that only the team of Guellil et al. (2018b, c) proposed a system for automatic construction of annotated sentiment analysis corpora for AD and the team of Alshutayri and Atwell (2018a, b) developed an online game for Arabic dialect annotation of AD identification.

Apart from the raw corpora (oral and written), and annotated corpora, other MAD data resources were constructed as part of various recent works on Maghrebi dialects. These are resources such as lexica, dictionaries and various ontologies. Figures 8 and 9 below illustrate the distribution of these works by type of resource and concerned dialect.

Fig. 8
figure 8

Distribution of works on lexica, dictionaries and ontologies by type

Fig. 9
figure 9

Distribution of works on lexica, dictionaries and ontologies by type and dialect

For the ontologies and as shown in Fig. 9, they were built only for TD. These are essentially the domain ontology TudiCol (Graja et al. 2011a, b; Karoui et al. 2013a, b) and the wordnets TunDiaWN (Bouchlaghem et al. 2014) and aebWornet (Ben Moussa et al. 2014, 2016; Ben Moussa et al. 2016). On the other hand, with regard to monolingual and multilingual lexica and dictionaries, we can see that they exist for different dialects like TD, AD, MD and sets of MADs combined, in the number of 33 in all. Efforts were provided by Guellil et al. (2017b, 2018b, c) and Mataoui et al. (2016) for AD and by Ameur et al. (2016) for TD to construct various lexicons for sentiment analysis. Apart from the two bilingual dictionaries MD ↔ EN of Graff and Maamouri (2012) and AD ↔ FR of Guellil and Azouaou (2017), all identified bilingual lexica present essentially a MAD ↔ MSA correspondence. Again, efforts to translate MAD to MSA can be explained by the desire to use the MSA as an intermediate language to apply existing tools of MSA NLP for the processing of MADs or their translation into other languages, like French or English.

Finally, some works focused on the establishment of rules and orthographic conventions. They were mainly carried out for TD (three works), AD (one work) and MAD (one work).

8.4 MAD LRs availability

Regarding the current availability of MAD LRs on the web, we can see from Table 20 of Sect. 7, that despite the efforts made in the various works dealing with the processing of MADs, LRs like corpora, lexica, dictionaries, etc., which are available online, are still relatively limited in size and number (23% of the total of resources). Indeed, and if we look for example to the freely available lexica, their size varies between 1 and 18 K words. Annotated corpora have sizes that do not exceed 370 K words for some, 10 K sentences, 17 K Facebook comments and 16 K tweets for others. As for the number of freely available LRs on the web and according to Table 20, we can see that TD is the most available with ten LRs available online, followed by the AD and the MD (both with two LRs available online). Only two multilingual LRs for browsing online and including LD have been identified. For HD, no available RLs have been found.

We can also notice from the studied works, that some constructed resources do not have a wide coverage and are limited to a particular domain. Thus, the corpus built by Graja et al. (2010), Karoui et al. (2013a, b), Neifar et al. (2014) for example, included interactions between staff and clients in Railway stations. The content was limited to the vocabulary of the booking, the tickets, the time and the price. Another work focused on a limited vocabulary (Hassine et al. 2016) and targeted the pronunciation of the digits from 0 to 9. Guellil and Azouaou (2016b) also crawled a unique Facebook page’s content namely an Algerian phone operator.

Building resources of significant size and coverage and making them available, remains therefore a priority to allow the study and the processing of the various Maghrebi dialects, especially since the web (the social web in particular) is now an important source of dialectal data.

8.5 Main used approaches in works on MADs

Several approaches were followed for corpus and lexicon creation. For speech corpora, researchers resorted to recording speeches and real conversations between people and collected extracted speech from web streamed local radio channels, TVs and Youtube channels. Regarding textual corpora, some works were based on speech transcription. Others focused on data collection from Social media (Facebook, twitter, User comments on YouTube videos, Blogs and forums, etc.) and online newspaper websites. Given the wide use of social media in Maghreb countries and given the richness of these platforms in dialectal content, they remain the most used source of data collection. They also present an opportunity to build sizable language resources.

To construct lexicons and dictionaries, the majority of researchers started from the corpora in order to cover real and authentic vocabulary.

Regarding annotation, most of the works were performed manually. Only few followed an automatic approach given the difficulty of the tasks and the informality of the MAD language which require the human intervention in all cases.

Maghrebi dialects’ automatic processing has been relying on MSA corpora and tools in several works. We can cite Boujelbane et al. (2013a) who exploited an existing MSA lexicon to create TD LRs, Almeman and Lee (2012) who adapted an MSA morphological analyser to deal with the Arabic dialects Harrat et al. (2014, 2016), who built an Algerian dialect dictionary by exploiting the MSA characteristics, etc. Indeed, processing the Maghrebi dialects is not a trivial task as they do not conform to specific syntactic rules or orthographic conventions. Therefore, researches have been resorting to the MSA to exploit its closeness to the dialects, since they originally derive from it. However, several problems may arise following this approach. The main issue is related to the dialectal words’ etymologies. In fact, Maghrebi dialects, as mentioned in Sect. 2, are characterized by a linguistic interference between MSA and the neighbouring languages and the influence of colonization (example: French), migration (example: Spanish, Turkish) and neo-cultures (example: Italian). Thus, many words used in Maghrebi dialects are not originally derived from MSA. As we can see through the examples given in Table 23, many user-generated words composing the Maghrebi dialects have no relation with MSA and derive from a completely foreign language.

Table 23 Examples of MAD words and their origins

Moreover, as the phenomenon of multilingualism that the Arab world is witnessing is in a continuous growth, the Maghrebi dialect is undergoing remarkable changes. Indeed, many Maghrebi dialect words possess the roots of foreign languages to which are added new suffixes and prefixes. Some of the foreign words’ roots undergo a change in morphology as well and result on new dialectal words that can be conjugated. For example, the word (“ماتيليشرجيتش”—“matéléchargitech”) meaning “I didn’t download it”, originates from the French verb “télécharger” (to download), to which are added the prefix (“ما”—“ma”) and the suffix (“يتش”—“itech”). Therefore, the idea of completely relying on the analogy with MSA is not utterly appropriate since we cannot limit the changes that the Maghrebi dialects undergo with the emergence of new terms of foreign languages. Almeman and Lee (2012) show that the MSA morphology analyser could analyse only 32% of the dialectical words. Building resources and tools specifically dedicated to dialects processing seems thus, to be essential.

Some NLP tasks that were tackled in the studied works, such as dialectal identification, sentiment analysis, labeling and transliteration, used various methods of machine learning (relying notably on NB, SVM, CRF, etc.). On the other hand, we have found that approaches based on deep learning, currently popular in the NLP field, are still very little used and have not yet been explored in the works on the MADs. This could be explained, among others, by the unavailability of the large resources required for the effectiveness of these methods.

9 Conclusion

In this study, we proposed a detailed review of the different kinds of LRs that have been generated as part of various work carried out on the automatic processing of Maghrebi dialects. Our objective was to provide a clear picture of what progress has been made towards constructing LRs for the MADs’ NLP and their availability to researchers wishing to work in this field.

From this study, we can conclude that the Maghrebi dialects are receiving increasing interest from the NLP community and that the TD remains the most studied dialect with 33% of the presented works. Indeed, the first works on Maghrebi dialects were carried out on TD (Example: Iskra et al. 2004) which implied the early availability of LRs for this dialect compared to the other MADs that encouraged researchers to further study the TD.

Another important conclusion concerns the writing systems of the constructed LRs. Most of the presented studies were carried out on the Arabic script that was targeted in 67% of the works against 10% for the Latin script and 23% for both Arabic and Latin. The significant focus on the Arabic script is due to the exploitation of available MSA resources given the lack of dialectal content that allows further studies on MADs. We believe that the MADs are still under-resources when it comes to the Latin script and that future works should focus more on Arabizi.

Regarding the used methods for LR construction, we noticed that researchers are becoming more oriented to social media (39% of the MAD raw corpora) since they encompass rich and varied dialectal content.

In terms of availability, only 23% of the works are accessible online which is a low percentage compared to the total works.

Indeed, this survey clearly demonstrated that in recent years, MAD language processing has been generating an increasing interest among NLP researchers. However, despite these recent efforts, large and freely available MAD NLP dedicated-LRs are still lacking and Maghrebi dialects in general are still among low-resource languages. Some of them, such as the Libyan and Mauritanian dialects are non-resourced languages.

Currently, and given the wide use of social networks in Maghreb countries, the social web seems to be a rich source of dialectal content, that could be utilized to collect data and build large linguistic resources. The efforts to be made in constructing the required resources must, however, take into account the linguistic specificities of these dialects and their informality especially when they are spontaneously produced on the web.

It would be insightful to better consider the written form of these dialects in the Latin script, which is increasingly present on the web. This form generally including vowels, unlike the form transcribed in Arabic, could, thanks to transliteration, contribute significantly to the automatic generation of dialectal Arabic resources and be beneficial for many tasks (such as morphological and syntactic analysis) as well as applications, such as machine translation and sentiment analysis.

It is also essential that the constructed LRs be made available to researchers, in order to be able to make significant progress in the study and the treatment of these dialects, by allowing in particular, to implement various deep learning techniques which so far, have been very little explored in the MADs’ NLP.