Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Diacritizing Arabic written text is crucial for many NLP tasks, translation can be enumerated among a longer list of applications that vitally benefit from automatic diacritization [13]. Arabic diacritics are superscript and subscript diacritical marks (vocalization or voweling), defined as the full or partial representation of short vowels, shadda (gemination), nunation, and hamza [4]. Diacritization helps the reader in disambiguating the text or simply in articulating it correctly. As Arabic is a language where the intended pronunciation of a written word cannot be completely determined by its standard orthographic representation; it rather depends on a set of special diacritics. The absence of these diacritics in Arabic text increases lexical and morphological ambiguity, because one written form can have several vocalizations, each vocalization may have different meaning(s) [5, 6]. However, these diacritics are generally left out in most genres of written Arabic which results in widespread ambiguities in vocalizations and meaning.

Although native speakers are able to disambiguate the intended meaning and pronunciation from the surrounding context with minimal difficulty, it is not the case with automatic processing of Arabic which is often hampered by the lack of diacritics. Several applications can radically benefit from automatic diacritization, such as Text-to-speech (TTS), Part-Of-Speech (POS) tagging, Word Sense Disambiguation (WSD), and Machine Translation [6].

Much work has been done on Arabic diacritization. The actually implemented systems can be divided into two categories [7]: Systems implemented by individuals as part of their academic activities and systems implemented by commercial organizations for realizing market applications. One of the advantages of the first type is that they present some good ideas as well as some formalization. The weak point about these systems is that they are mostly partial demo systems [7]. The following are examples of these systems: Vergyri and [811]. For the second category, the most representative commercial Arabic morphological processors are Sakhr, Xerox, and RDI [7]. There are also other available systems as Mishkal Arabic diacritizerFootnote 1, and Harakat Arabic diacritizerFootnote 2; they are free Arabic diacritizers which are available online. Finally, on March Google has launched an innovative new Google Labs Arabic tool called Tashkeel, a tool that adds the missing diacritics to Arabic text. Unfortunately, the tool is not available now.

There is another system [12] that has integrated three different proposed techniques, each of which has its own strengths and weaknesses. They are lexicon retrieval, diacritized bigram and SVM statistical-based diacritizer. Most of the previous approaches cited above utilize different sequence modeling techniques that use varying degrees of knowledge from shallow letter and word forms to deeper morphological information. None of the previous systems make use of syntax with the exception of [13] which have integrated syntactic analysis; however, they are not rule based. In this paper, Alserag; an Arabic diacritizer, is proposed. Alserag is based on different steps: retrieval of unambiguous lexicon entries, disambiguating between the different stored possible solutions of the words to realize their internal diacritization through the morphological analysis step (the system tokenizes a text and provides a solution for each token and restore the appropriate internal diacritics from the dictionary), the syntactic processing step that is responsible for the case ending detection is based on shallow parsing and finally the morpho-phonological step that is developed to fulfill the requirements of vowel harmony and assimilation. Section 2 demonstrates the system architecture. Section 3 explains the different applied modules to fully diacritize texts. Section 4 evaluates the output and discusses the results and benchmarking process. Finally, Sect. 5 concludes the paper.

2 System Architecture

In this system, a rule-based approach was adopted. In this section, the different processes that took place in order to convert a plain text into a fully diacritized text will be described. Figure 1 presents the system’s overall architecture, where the diacritization is achieved through 7 main phases: (i) Preprocessing which is responsible for auto-correcting the raw text and segmenting the Arabic text into sentences. (ii) Tokenization which is the process of splitting the natural language input into lexical items. (iii) Disambiguation which is a process of choosing the right internal diacritization for the word from the dictionary. (iv) Name entity recognition (stored in the dictionary and have been obtained from the UNLariumFootnote 3 [14]). (v) Syntactic shallow parsing which is an analysis of a sentence by identifying its constituents (NPs, JPs—etc.). (vi) Case ending module which is responsible for predicting the arguments of the predicate and assigning the diacritical marks that are attached to the ends of words to indicate their grammatical function. (vii) Morph-phonological module which is a series of rules that focus on the sound changes that take place in morphemes (minimal meaningful units) when they are combined to form words.

Fig. 1.
figure 1

Architecture of alserag system.

There are two engines that are used in Alserag, the first is Interactive ANalyzer (IAN)Footnote 4 which is used in the analysis process, it includes a grammar for natural language analysis. The syntactic processing is done automatically through the natural language analysis grammar, the second is dEep-to-sUrface natural language GENErator engine (EUGENE)Footnote 5 which is used in the generation process, It receives the analyzed input and provides a diacritized output without any human intervention, for more details see [15].

3 Development of the System Resources

Alserag depends on two resources; the Arabic diacritized dictionary and a set of linguistics rules. Each one will be described in details in the following subsections.

3.1 Dictionary

The Arabic diacritized dictionary is a dictionary where Arabic natural language words exist with their diacritics, along with the corresponding linguistic features which describe the Arabic word morphologically, syntactically and semantically. For example, the Arabic word ‘ktb’ ‘write’ its diacritics ‘kataba’ and a list of linguistic features such as part of speech, tense, transitivity, person, gender, number, etc. are included in the dictionary. The words in the Arabic diacritized dictionary are extracted from the Arabic dictionary in UNLarium. The process of diacritizing the entries mainly depends on two resources: BAMA and Alkhalil Arabic Morphological Analyzer.

The diacritizing process begins with Buckwalter’s analysis. Some words have only one solution, other words have more than one solution and some words couldn’t be analyzed in Buckwalter. These are analyzed by Alkhalil which also suggests different solutions to some words. Then, these words are verified manually to select their correct diacritization. Not all of the Arabic diacritized dictionary entries are fully diacritized. Nouns, adjectives, subjunctive and indicative verbs are partially diacritized, since their case endings depend on the context. By default, a present tense verb is marked by a short /o/ , in this case it is called indicative . However, if a present verb is preceded by certain particles, the verb will be marked by a short /a/ , and if the verb ends by one of the three suffixes , the final will be deleted, in this case it is called subjunctive . Nevertheless, imperative verb forms and past verb forms are fully diacritized, because their case endings are not affected by the context. Some enhancements have to be made in the Buckwalter solutions. For example, some solutions have a missing vocalization before as in ‘EAlim’ ‘scientist’, ‘makotabAt’ ‘libraries’. So, these missing vocalizations have been added manually.

3.2 The Linguistic Rules

Alserag depends on three modules in order to provide fully diacritized Arabic words namely, morphological analysis module, syntactic analysis module and morph-phonological processing module.

Morphological analysis: is responsible for the morphological analysis of Arabic words and assigning the correct POS and the internal diacritization of words which is achieved through two processes; tokenization process and disambiguation process. However, before the tokenization process began, a preprocessing phase should take place over the string stream to fix the most common spelling mistakes, if needed. First, the tokenization algorithm is based on the entries of the dictionary. It starts from left to right trying to match the longest possible string with dictionary entries. The process starts with preventing joined lexical items. Then, it identifies the different suffixes and prefixes that could be attached to each lexical category. Disambiguation rules apply over the natural language list structure to constrain word selection and to correctly disambiguate the POS. They have the following format:(node 1) (node 2) (…)(node n) = P; Where (node 1), (node 2) and (node n) are nodes, and P is an integer expressing the possibility of occurrence. The engine is able to tokenize automatically some words correctly based on the dictionary and assign the correct POS to words. On the other hand, the larger the number of entries in the dictionary, the more the ambiguity during tokenization increases. For example, the sequence ‘by the floods’ would be automatically segmented as ‘bAl’ (worn) +  ‘fayaDAnAt’ (floods), according to the longest match algorithm, given the fact that the dictionary includes [ART ] ‘Al’ (the), [ADJ ] ‘bAl’ (worn), ‘bi’ (by) and [N ] ‘fayaDAnAt’ (floods).

Rule in (1) states that adjectives can only be followed by a blank space (BLK), suffix (SFX) or to occur at the end of the sentence (STAIL), where (^) means not. So, will be chosen as the appropriate combination.

  1. 1.

    (ADJ)(^SFX,^BLK,^STAIL) = 0;

If words have spelling mistakes or undergo morpho-syntactic changes, rules will investigate the morphological form of those words. For example, the most common mistake in Arabic writing is /Hamza/in the initial position as in “>anotaziE”. Rules will investigate the morphological pattern of the wrongly spelled word by the regular expression techniques. For example, if a five-letters word begins with the sequence , the wrong written /Hamza/ will be modified to the correct , according to the Arabic grammar, by the rule in (2), as in the pattern //. Then the correct diacritized form will be retrieved from the dictionary “>inotazaE”.

  1. 2.

    ,^Hamza_modified)(%y,PUT): = (“1 >”, Hamza_modified)(%y);

Second, disambiguation is concerned with preventing the wrong automatic lexical choices and obtaining the right internally diacritized words. Some linguistic indicators can help in solving the lexical ambiguity which are morphological and adjacency indicators.

Morphological indicators: affixation has an important role as the first level of part of speech disambiguation, as prefixes and suffixes are the smallest processing units rules can begin with. Prefixes can help as indicators in determining correct lexical choices. For example, in the word ‘liDafoE’, the noun ‘DafoE’ (push) is chosen instead of the verb (V) ‘DafaE’ (to push), since it is preceded by the preposition ‘li’ (to) by the rule in (3).

  1. 3.

    (P)(V) = 0;

Adjacency indicators: After disambiguating the POS on the word level, the role of the adjacent word will take its effect as the second level of disambiguation. In this level, disambiguating the part of speech could be controlled by many qualifiers.

Number and Gender qualifiers: as in (and they call). According to the longest match algorithm, the engine will automatically choose the noun ‘wahom’ (illusion). But, because it is followed by a plural verb ‘yusam ~ uwna’ (they call) and subject and verb should agree in number and gender in Arabic, this tokenization will be rejected and will be retokenized as ‘wa’ (and) +  ‘hum’ (they) by the rule in (4).

  1. 4.

    (SHEAD) (,%x) (BLK) (V, ^NUM = %x) = 0;

Functional word qualifier: Particles could be used as indicators for disambiguating the part of speech, as there are particles for verbs and others for nouns. For example, the particle ‘>ay~’ (any) is a noun particle. Therefore, in the combination (any condition), rule in (5) will reject the word if it is chosen as a verb ‘$ar ~ aTa’ (slit), since it is preceded by the particle (PTC) ‘>ay~’ (any). Then, it will backtrack it to the noun ‘$aroT’ (condition).

  1. 5.

    (PTC, ) (BLK) (V) = 0;

The co-occurrence of specific words with words with specific semantic features is used as an indicator. The word has different internal diacritizations that depend on the different meanings, such as ‘tuqoliE’ ‘take off’ with the semantic feature motion (MOT) and ‘taqolaE’ ‘strip’ which has the semantic feature contact (CTC). If the verb ‘taqolaE’ ‘strip’ is followed by a noun such as ‘TA}irap’ ‘airplane’ which has the semantic feature artifact (ARF) (Nouns denoting man-made objects). Rule in (6) will reject ‘strip’ ‘tuqoliE’ ‘take off’.

  1. 6.

    (V, SEM = CTC) (BLK)(ART)(N, SEM = ARF) = 0;

Syntactic analysis: Shallow parsing is considered necessary for case ending assignment. Transformation rules have been developed to group words under the different phrasal categories. The rules follow the very general formalism α: = β; where the left side α is a condition statement, and the right side β is an action to be performed over α. Phrasal grouping is necessary for identifying the sentence components and linking them by predicate. Then, the different functions of the sentence components can be identified and assigned the suitable case ending. This process will be illustrated in the following. Rules were developed to syntactically mark the phrasal units of the partially diacritized sentence in (7).

  1. 7.

‘wali*`lika lam taboEavo Ald ~ irAsap Almadorasiy ~ ap litAriyx AlfarAEinap ayo $awoq bay ~ na AlT ~ alobap > aw Alxir ~ iyjiyna lilAisotizAdap’

‘Therefore, the school study for the Pharaohs history did not provoke any urge between the students or the graduates to increase.’

In sentence (7), different NPs structures are established. The first is established by rule in (8a); it combines the definite article ‘the’ and the following noun to project a noun phrase (NP) ‘Ald ~ irAsap’ ‘the school study’, ‘AlfarAEinap’ ‘the Pharaohs’, ‘AlT ~ alobap’ ‘the students’ and ‘Alxir ~ iyjiyna’’ the graduates’. The second NP structure is formed by rule in (8b); it combines the indefinite noun ‘tAriyx’ ‘history’ and the NP ‘the Pharaohs’, the composed NP is automatically assigned with the features of its head such as gender, number, animacy and semantic class that are necessary to describe the NP. The third NP structure consists of two coordinated elements and a conjunction; ‘AlT ~ alobap > aw Alxir ~ iyjiyna’ ‘the students or the graduates’ by rule in (8c). Moreover, adverbial phrase (AP) consists of the adverb ‘bay ~ na’ ‘between’ and the coordination NP; ‘AlT ~ alobap > aw Alxir ~ iyjiyna’ is established by rule in (8d). However, the AP ‘between the students or the graduates’ is considered as an optional argument in the sentence in (7). Next, prepositional phrases (PPs) will be established; the two previously composed NPs ‘>alisotizAdap’ and ‘tAriyx AlfarAEinap’ ‘’ will be combined with the preceding preposition ‘li’ (to) to form the prepositional phrases (PPs) ‘to increase’ and ‘for the Pharaohs history’ by rule in (8e).

  1. 8.

    (a) (ART, %a)(%y, N): = ((%a)(%y), NP, ANI = %y, GEN = %y, NUM = %y, SEM = %y);

    (b)(^ART, %a)(%y, N)(NP, %x): = (%a)((%y, np)(%x), NP, ANI = %y, GEN = %y, NUM = %y, SEM = %y);

    (c)(NP, %a)(%y, COO)(NP, %x): = ((%a, np)(%y)(%x), NP, ANI = %a, GEN = %a, NUM = %a);

    (d) (ADV, %a)(NP, %x)(%j): = ((%a)(%x), AP)(%j);

    (e) (%x, P, ^pp)(%n, NP): = ((%x, pp)(%n), PP);

Different syntactic functions of the predicate arguments should be identified in order to assign the case ending after the shallow parsing stage. In (7), the arguments of the verb should be identified which will be illustrated in the following.

Verbs and their Arguments Diacritization: The sentence in (7) contains a verb, it is considered as the core of the sentence, since it is the verb that answers the three most important elements of any message - the what, who and when. In terms of the importance of the verb in the diacritization process, verb decides the case ending of the sentence elements. In sentence in (7), the verb ‘taboEavo’ ‘provoke’ is a transitive verb that requires two arguments, one to function as a subject ‘Ald ~ irAsap’ ‘study’ and another as an object ‘ayo’ ‘any’. After identifying the phrasal constructions, grammar rules have been developed to assign the function and the case ending of the composed verb arguments by rule in (9a). The rule states that, if a verb is followed by two noun phrases and there is gender agreement between the verb and the following noun phrase (NP, GEN = %v), this noun phrase will be considered as the subject of the verb (SBJ). The second will be considered as the object (OBJ). Once the functions of the arguments have been determined, the case ending will be assigned to each noun phrase; the nominative case (NOM) will be assigned to the subject and the accusative case (ACC) will be assigned to the object.

  1. 9.

    (a) (V, TSTD, %v)(NP, GEN = %v, ^CAS, %n)(PP, %a)(NP, ^CAS, %n2): = (%v)(SBJ, CAS = NOM, %n)(%a)(OBJ, CAS = ACC, %n2);

    (b) (, %a)(%x, PRS, ^MOO): = (%a)(MOO = JUS, %x);

In the rule in (9a), the nominative and the accusative cases have been assigned to the heads of the two composed NPs; the words (CAS = NOM) and (CAS = ACC). Rule in (9b) assigns the mode of the verb as jussive (JUS) , because it is preceded by a jussive particle, as illustrated in (10).

  1. 10.

The modifiers are diacritized accordingly. The genitives such as “muDAf > ilayhi” and the constituents after prepositions do not depend on the case ending of the preceding elements. In sentence in (7), genitive case (GNT) is assigned to genitives, as , and , and .

Adjectives, coordinated elements and nouns in apposition are assigned the same case ending of the preceding element. The adjective ‘Almadorasiy ~ ap’ is assigned with the same case ending of the preceding noun ‘Ald ~ irAsap’, so it is assigned nominative case . As for the coordinated elements, as in , the case of the NP which is genitive is assigned to the NP . However, in Arabic, masculine plural noun ending with suffix, does not permit the genitive case marker (kasra), its genitive case is marked by fatha ‘a’. The final diacritization for the sentence in (7) is as in (11).

  1. 11.

Nominal sentences Diacritization: Nominative case is directly assigned to the topic of the sentence (noun or noun phrase in the beginning of sentences), because it is considered as ‘mobtadaa’. Since Arabic is a free word-order language, comment may precede topic in nominal sentences such as the sentence in (12).

  1. 12.

    ‘A house in the garden’

The case of the topic ‘A house’ can be detected in the system in the case of the prepositional phrase ‘in the garden’ precedes it by rule (13).

  1. 13.

    (SHEAD, %x)(%c, PP)(NP, ^CAS, %y): = (%x)(%c)(%y, mobtadaa, CAS = NOM);

Rule in (13) states that if a prepositional phrase comes in the beginning of the sentence is followed by a noun phrase , where (SHEAD) means beginning of the sentence. This noun phrase is assigned with nominal case (NOM).

However, these nominal phrases cases change if Anna and her sisters precede them. In the example, ‘<in ~ a fiy AlHadiyqap bayotAF’, the NP ‘bayot’ ‘house’ became accusative.

Morpho-phonological process: Many morpho-phonological alternations occur in Arabic due to the concatenative nature of Arabic morphology, the interaction between morphological and phonological processes is usual. There are two cases where morpho-phonological change is necessary; vowel harmony and assimilation necessity. Vowel harmony takes place in the diacritization process (i.e. phonological). For example, a morpho-phonological rule is necessary for ‘lahu’ ‘for him’ that consists of two morphemes ‘li’ +  ‘hu’, to change the vowel ‘i’ in to ‘a’ to be more harmonious with the vowel ‘u’ on the suffix . Moreover, the phonological Arabic system doesn’t permit the “moon letters” to be assimilated with the /l/ of the definite article ‘Al’ ‘the’, as they are not near in the place of articulation, but they can assimilate with the other Arabic alphabets which are called “sun letters”. When the definite article is followed by a sun letter, the /l/ of the Arabic definite article al- assimilates to the initial consonant of the following noun, resulting in a doubled consonant (phonologically) which is orthographically expressed by putting a shaddah on the consonant after . For example, the word ‘the morning’ before applying the morpho-phonological rule, is diacritized as ‘AlsabAH’. Another rule adds a shaddah before the vowel , if the diacritical mark is on a sun letter, it would be diacritized ‘Als ~ abAH’.

4 Evaluation and Benchmarking

The corpus has been selected from the International Corpus of Arabic (ICA). The selected corpus size is 400,000 Modern Standard Arabic words; they are divided into 300,000 words as tuning data and 100,000 words as testing data. The selected texts are from different sources; Newspapers, Net Articles and Books representing the following genres; politics: 148,211, miscellaneous: 100,253, child stories: 57,174, economy: 34,930, society: 32,955 and sports: 26,477.

The results were evaluated automatically for accuracy against the reference which is a fully diacritized texts by Arabic linguist using the following two metrics; diacritization error rate (DER) which is the proportion of characters with incorrectly restored diacritics. Word error rate (WER) which is the percentage of incorrectly diacritized white-space delimited words: in order to be counted as incorrect, at least one letter in the word must have a diacritization error.

These two metrics were calculated as: (1) all words are counted excluding numbers and punctuators, (2) each letter in a word is a potential host for a set of diacritics, and (3) all diacritics on a single letter are counted as a single binary (True or False) choice. Moreover, the target letter that is not diacritized is taken into consideration, as the output is compared to the reference.

In addition to calculating DER and WER, the evaluation system calculates internal diacritics and case ending separately. Alserag results were compared with the output of other three known diacritization systems; Harakat, Mishkal, and Aldoaly as they are the only available systems. The outputs of these three systems were evaluated using the same data. Table 1 shows benchmarking of the whole data of Alserag among the other three systems.

Table 1. Benchmarking of the whole data of alserag among the other three systems.

According to the results obtained by the benchmarking process, our system scored the least error rate followed by Harakat and Mishkal and finally Aldoaly which scored over 80 % error rate. Future plan is associated with improving parsing phase as it is the main source of problems that raised DER and WER. In addition, it is planned to perform the evaluation and benchmarking of Alserag by using the same dataset of LDC (Arabic Treebank) used by more robust systems such as Sakhr, RDI and Microsoft system in the evaluation process so at least we can compare results published by such systems.

5 Conclusion

The paper presents an automatic diacritization system Alserag that is developed based on the rule- based approach which is considered as our contribution to the subject of automatic diacritization. All of the other available systems that were mentioned are statistical based. The results of the system were evaluated against the reference. The DER was 8.68 % while WER measurement was 18.63 %.