Keywords

1 Introduction

Arabic language belongs to the Semitic group of languages. In this language, verbs and nouns are often derived from roots constituted with three letters or more using various patterns. Arabic root carries the basic conceptual meaning. Pattern involves lexical, syntactic and semantic information. Arabic textual words are often compound structures, which should syntactically be regarded as phrases rather than single words. Arabic words are divided into three types: Noun, Verb and Particle. Arabic is a relatively free word order language. While the primary word order is verb-subject-object (VSO), Arabic also allows subject-verb-object (SVO), object-verb-subject (OVS), etc.

The printed words, we are reading now, are the perceptible cornerstones of an otherwise invisible grammatical edifice that is automatically reconstructed in our mind. According to many psycholinguists [19], comprehending spoken or written sentences involves building grammatical structures. This activity, which is called syntactic analysis or sentence parsing, includes assigning a word class (part-of-speech) to individual words, combining them into word groups or ‘phrases’, and establishing syntactic relationships between word groups. All these parsing processes should be in harmony with grammatical rules. Developing a working model of sentence parsing is impossible without adopting a grammatical formalism and the structure building operations specified in it. Arabic language’s syntactic analysis is still in its early stages; most of the researches in Arabic Natural Language Processing (NLP) systems have mainly concentrated on the fields of the morphological analysis. Most works in the Arabic parsing systems have adopted the statistical or hybrid approach [4,5,6,7], which cannot give desirable results. Besides, other works have used rule based approach [1,2,3], which implements separated systems to parsing Arabic sentences in a well-defined order. Extensibility may be difficult in these works to covering many NLP levels to building NLP applications, because the analysis levels must communicate and share results, which must be unified to facilitate this communication. The aim of this paper is to implement a syntactic parser of simple Arabic verbal sentences. Thus, in the first step of our work, we present a syntactic and semantic classification of Arabic words. In the second step, we study the simple Arabic sentence’s and simple Arabic verbal sentence’s grammatical structure. This parser is based on a dependency grammar thanks to the attribution concept in the Arabic sentence. This grammar is implemented in the NooJ platform. The lack of diacritics in written Arabic texts produces ambiguity at the morphological level. That is why the morphological ambiguity resolution is required in this work before the sentence parsing step.

The rest of this paper is organized as follows: Sect. 2 is dedicated to Arabic lexicon syntactic, and semantic classification, Sect. 3 presents simple Arabic verbal sentence parsing; we have four subsections in this part: simple Arabic sentence study, simple Arabic verbal sentence’s grammatical structure, morphological disambiguation and syntactic parsing. In Sect. 4, we present the results and tests of our parser. Section 5 explains the related work. Finally, the last part includes the conclusion and the future work.

2 Lexicon Syntactic Classification

NooJ platform integrates an Arabic dictionary named El-DicAr. This dictionary classifies Arabic words into many classes. Verbs are classified, regarding transitivity feature, into three sub-classes: direct transitive (one direct object), indirect transitive (one indirect object) and intransitive. But transitive verbs are sub-classified according to the number of the accusative form that requires the verb to convey the sentence meaning. Arabic transitive verbs can select one, two or three accusative forms (direct and indirect object(s) in the sentence). Then we cannot use this dictionary to parsing Arabic verbal sentence. In this section, we present a syntactic classification of Arabic words, and we add new proprieties in our dictionary implemented in the work [8] using NooJ platform. This classification includes required features in the syntactic analysis of the Arabic sentence.

2.1 Arabic Lexicon Syntactic Classification

Arabic language linguists classify Arabic words into three main classes: nouns, verbs, and particles [13, 14, 16,17,18, 20,21,22,23,24,25]. Each class is in turn divided into sub-classes.

A noun () in Arabic is a word that describes a place, person, thing, or an idea, etc. It conveys a lexical meaning, and gives no indication of time. The noun in Arabic language is divided into many sub-classes: complete noun () like (, lion and , stripe), incomplete noun () like (), verbal noun ()like (, reading, , knowledge), adjective () such as (, reader, , written, , great), etc. The sub-class “adjective” has also sub-classes: resembling adjective (), “active participle” (), “passive participle” (), etc.

A verb in Arabic language is a word with two features: action and time. A verb refers to an action effected in the past, the present or the future. In fact, the “verbs” class is divided into two main sub-classes: “complete verbs” (, al-fi’lattaam) and “incomplete verbs” (, al-fi’l annaqis). We can also classify verbs according to many features. One of them could be the syntactic feature which is considered as an important property of verb transitivity (). It is used in the syntactic analysis of sentence constructions to determine the number of object elements (arguments), which select the verb, in addition to the subject, to achieve the sentence meaning. Hence, a verb could be (); it handles only a subject, but its semantic function is a complement so that the verb conveys the sentence meaning. Intransitive () verb can achieve the sentence meaning only with its subject. Transitive verb () requires other syntactic position(s) in addition to the subject to convey the sentence meaning; transitive verbs in Arabic handle from one to three accusative forms. In addition to the syntactic classification, we classify verbs according semantic features such as rationality.

A particle () category refers to function words that cannot be considered either as verbs or nouns, and conveys no lexical meaning; it must be linked with another word (noun or verb) to convey meaning. Particles are used to connect words (nouns and verbs together) to make phrases or sentences. They are divided into three categories according to the type of word they can affect. They can either affect a noun, a verb, or both. Particle class includes: prepositions, conjunctions, interrogative particles, exception particles, interjections, etc.

2.2 Classification Implementation

We have already implemented an Arabic dictionary based on root and pattern properties [8]. It is classified based on inflectional and derivational models, number of root letters, morphological features, etc. This dictionary includes all Arabic particles, and 160.000 verbs, nouns and forms (which are also obtained from inflectional and derivational graphs). The implemented dictionary lucks a fine-grained syntactic and semantic classification in its entries. Therefore, we must enrich our dictionary by the properties holding the lexicon syntactic and semantic classification discussed in Sect. 2.1. Table 1 summarizes these properties.

Table 1. Lexicon syntactic classification

3 Simple Arabic Verbal Sentence Parsing

To parse an Arabic sentence, we must study its grammatical structure. In the first sub-section, we present the general structure of the simple Arabic sentence, and in the second sub-section, we study the simple Arabic verbal sentence’s grammatical structure. The third sub-section is devoted to morphological disambiguation. Additionally, the last sub-section deals with the parser implementation in NooJ platform.

3.1 Simple Arabic Sentence Study

The simple sentence, in every language, is formed from two required elements; these two elements are the predicate (, al-musnad) and the subject (, al-musnad ilayh) [16,17,18, 23,24,25]. In the Arabic language, the subject can be dropped (pro-drop). The relationship holding between the predicate and the subject is called the attribution (al-’isnād, ). The simple Arabic sentence includes one event (ḥadaṯ wãḥid, ) or a singular attribution. The predicate and the subject constitute a predicative kernel that builds the sentence meaning. In the Arabic language, the predicate can be a complete verb, in this case, the sentence is verbal, or a noun, in this case, the sentence is nominal, but the subject is always a noun phrase (). The predicate can be governed by some particles affecting the verb, such as negation particles, or the noun such as annulling particles. The governor particle is called the head of the sentence. Incomplete verbs are regarded as head both in the verbal and the nominal sentence. Direct object(s) and indirect object(s) are regarded as complement, , al- faḏlah). This component is optional such as the head. Arabic grammarians have established the following formula, describing the general structure of the simple sentence:

(1)

The sentence, al-ǧumlah = [the head, al-ṣṣadr] (the predicate, al-musnad and, wa the subject, al-musnad ’ilayh) [the complement, al-faḏlah]

The simple Arabic verbal sentence can not exceed four main components. The first one, the head, can be replaced by some particle sub-classes such as negation or interrogation particles, or by incomplete verbs. The sentence kernel includes the predicate and the subject, select zero, one or many accusative forms (complement) to achieve the sentence meaning. These four components have a free order in the simple Arabic sentence (both verbal and nominal). Figure 1 shows an example of the negative simple Arabic verbal sentence:

Fig. 1.
figure 1

An example of simple verbal sentence structure

3.2 Grammatical Structure of Simple Arabic Verbal Sentence

The simple Arabic verbal sentence contains one predicative kernel, and its predicate is a complete verb whatever its position is in the sentence. In the Arabic language, sentence components order is free; we can match various combinations: SPC (Subject-Predicate-Complement) (, the student understood the lesson), PSC (), PCS (), etc.

Regarding transitivity feature, Arabic verbs are classified into five main subclasses (see Sect. 2.1). All verbs of a given subclass share the same syntactic structure, and then we must match five different syntactic structures of the simple Arabic verbal sentence: the first one is based on () verbs; in this structure the predicate and its subject (object in the meaning) are only required in the structure, such as in (. The apple fell, and , the glass was broken). The second structure is based on intransitive verbs; only the predicate and the subject are mandatory in the structure, such as in (, the child slept). The third one is constituted by transitive verbs to one object; then we must have three required syntactic positions in this structure: predicate, subject and a required complement such in (, the teacher explains the lesson and , it is one o’clock). The forth structure is based on transitive verbs to two objects; in addition to the predicate and the subject, a complement including two objects is required, such as in (, Ali gave a letter to his brother and , the king granted a medal to the minister). The last structure of the simple Arabic verbal sentence is obtained using transitive verbs to three objects; the complement must include three accusative forms, such as in (, the brigadier reported to the soldiers that the enemy is coming).

An optional part of the complement, which includes one or more prepositional/locative phrases, can be added to required components, in the five syntactic structures of the simple Arabic verbal sentence, and it gives a particular meaning to the sentence such as in (, the child slept on the bed at home, , the professor explains the lesson in the class).

3.3 Morphological Disambiguation

Arabic language is ambiguous at the morphological level; the same undicritized word can have many meanings. In the sentence “”, the word without dicritization can be the verb (to benefit), the verb (to make some body benefit), the nouns , , the verbal noun (benefit). The word can be the noun (science), the noun (banner), the verb , (to know), the verb (to learn) or the verb (to notify). In this sentence, we have another ambiguous word; the word can be the noun (solution), or the verb (to resolve). Ambiguity in the first word can be solved using two constraints: the first one is “after a particle affecting on the verb, we must have a verb”, thus after (indeed), that affects the verb, we must have a verb. The second constraint is “we cannot have two consecutive complete verbs”, and then is a noun because before it we have a complete verb. Ambiguity, in the last ambiguous word , can be solved using the constraint “after a particle affecting on noun, we must have a noun”. Thus, (in) is a preposition affecting only on noun; then the word must be a noun. Arabic morphological ambiguity can be solved using Arabic lexical and syntactic rules.

Our disambiguation method is applied after the morphological analysis and before the syntactic analysis. The morphological analyzer produces all possible interpretations of the textual Arabic word. Disambiguation would be resolved by applying certain types of lexical and syntactic constraints that are defined with local grammar rules [9,10,11,12, 16] (see Figs. 2, 3 and 4). These rules lead to a correct parse. It could resolve morphological ambiguity.

Fig. 2.
figure 2

Morphological disambiguation schema

Fig. 3.
figure 3

Disambiguation rule 1

Fig. 4.
figure 4

Disambiguation rule 2

After that, we implement a set of disambiguation rules. These rules model Arabic lexical and syntactic constraints. They are implemented as local grammars using the NooJ platform. Each analyzed sentence is matched with these grammars in a sequential mode in order to overcome useless morphological annotations. The following local grammars (Figs. 4 and 5) summarize disambiguation rules.

Fig. 5.
figure 5

First level of our grammar

3.4 Parser Implementation

Arabic is an agglutinative language; its syntactic parsing requires the morphological analysis at the first step. In a previous work [15], we have already implemented a set of morpho-syntactic graphs processing agglutination in NooJ platform.

Arabic verbs can produce five different grammatical structures according to the transitivity feature, presented above. We have implemented five sub-grammars covering simple Arabic verbal sentence grammar. Arabic particles affecting the verb are classified into 12 sub-classes; then we have implemented (12 + 1) * 5 + 5 = 70 simple Arabic verbal sentence types, all possible structures and types (affirmative, negative, interrogative, etc.) of the simple Arabic verbal sentence in the active voice. Our grammar consists of forty syntactic graphs implemented in NooJ platform. The following figure describes the first level of our grammar.

The first level of every sub-grammar handles and annotates the sentence main components. We reduce the syntactic ambiguity just after the generation of possible structures of the input sentence, using verb-subject agreement constraints and some semantic features such as rationality. In the rest of this section, we present the sub grammar based on the transitive verb to one object of the simple Arabic verbal sentence. Figure 6 presents the sub-grammar based on the transitive verb to one object of the simple Arabic verbal sentence:

Fig. 6.
figure 6

Sub grammar based on transitive verb to one object

The predicate of this kind of sentence is a transitive verb to one object (V + CMP + TR1). The head, in the verbal sentence, is a governing particle in the verb or an incomplete verb (). Figure 7 shows the syntactic sub-classes replacing the head in the Arabic verbal sentence.

Fig. 7.
figure 7

Particle classes replacing the head in the verbal sentence

The subject is always a noun phrase (). The complement, in this type of Arabic sentence, must include a noun phrase (Direct Object Complement), or at least a prepositional/locative phrase (Indirect Object Complement). The complement can be extended by one or many prepositional/locative phrase(s) such as in:

, the student wrote the lesson [with a pen on paper]. In this case, the complement gives a particular meaning to the sentence.

4 Parser Tests

To test the syntactic parser in Arabic corpora, we must segment corpora text in sentences. This module requires a specific study that is not the aim of this study. Then, we have created manually a text of one thousand simple verbal sentences representing all possible structures of this type of the Arabic sentence. Our morphological disambiguation grammar and syntactic parser are then applied to analyze a selected text. The number of totally disambiguated sentences is six hundred and seventy (67%). However, we still have ambiguity in some sentences due to the ambiguity feature of Arabic language. The number of partially disambiguated sentences is one hundred and fifty cases (15%). Then, the rate of disambiguation is around eighty-two percent. This can be explained because some constraints are not yet implemented. This issue could be solved once we implement a semantic analyzer beside the syntactic analyzer. Regarding the parsing task, the number of successfully parsed simple verbal sentences is nine hundred and twenty. The rate of our analyzer is around ninety-two percent. This is obvious since some grammars are not yet implemented. Figure 8 presents the tagging of a sentence just before the disambiguation step. Figure 9 presents the tagging of the same sentence just after the disambiguation step and before the parsing step. Figures 10 and 11 contain the syntactic analysis result for the same sentence. We notice the success of the analysis even though the sentence components’ order is not the same in each of them.

Fig. 8.
figure 8

NooJ TAS before disambiguation

Fig. 9.
figure 9

NooJ TAS after disambiguation

Fig. 10.
figure 10

NooJ TAS after parsing

Fig. 11.
figure 11

NooJ TAS after parsing

4.1 Morphological Disambiguation Test

In the example cited in Sect. 4, the sentence is ambiguous at the morphological level. We show in Fig. 8 the NooJ Text Annotation Structure (TAS) before applying our disambiguation local grammar on this sentence.

Figure 9 shows the NooJ TAS after the disambiguation step; all impossible annotations are filtered out.

4.2 Parsing Test

After the morphological disambiguation step, we obtain a disambiguated sentence which is the input of our parser. The parser has to match the parse tree(s) to the input sentence. Figures 10 and 11 show the NooJ TAS after the parsing step applied to two sentences. These sentences are similar but the order of their components is different. The parser returns the same syntactic annotations of the sentence components in the two proposed cases.

5 Related Work

In the previous literature, many approaches were applied to implement a syntactic analyzer for parsing Arabic sentences. Actually, there are three main approaches: linguistic, statistical, and hybrid. The linguistic methods are based on lexicon and grammar rules such as in [1,2,3]. This approach lacks resources (dictionary, grammar, etc.). For instance, the Arabic grammars do not cover all sentences’ types. The works based on the statistical approach use annotated corpora such as Treebank (TB) and approximate grammatical rules from the corpora parse trees using automated quantitative methods [4, 5]. The shortcoming of the statistical methods is that they rely on reference corpora. So, if the reference corpora do not cover all possible sentence structures, we cannot obtain reliable results. The hybrid approach incorporates linguistic rules and corpora-based statistics. We can cite in this approach [6, 7].

In [1], Ouersighni et al. implemented a parser for unrestricted Arabic sentences using the AGFL (Affix Grammars over Finite Lattice) system. AGFL grammars are a restricted form of Context Free grammars (CFG). Context-Free production rules are extended with features (affixes) for implementing agreement in the sentence. These features are passed as parameters to the grammar rules. Disambiguation is not resolved in this paper. This parser parses both nominal and verbal sentences.

In [2], Abuawad et al. developed an Arabic parser based on analyzing the Arabic language grammar conforming to gender and number. The author formalized grammar rules using Context Free Grammar (CFG) formalism, and implemented a top–down algorithm parsing technique with recursive transition network.

In [3], Al-Taani et al. presented a top-down chart parser for parsing simple Arabic sentences using Context Free Grammar (CFG) formalism. The authors tested the proposed parser on 70 sentences extracted from Arabic real-world documents.

In [4], Shahrour et al. presented a methodology of using models with access to additional information of exact syntactic analysis and rules to offer an enhanced estimation of case and state. The expected case and state values are, then, used to re-tag the Arabic morphological tagger MADAMIRA output by choosing the best match to its graded morphological analysis. Since what they are learning to expect is how to correct MADAMIRA’s baseline choice (as opposite to a generative model of case-state). They also re-applied the model on its output to repair mainly spread agreement errors.

In [5], Khoufi et al. suggested an approach for parsing Arabic sentences based on supervised machine learning using Support Vector Machine (SVMs). This system selects syntactic labels of the sentence. This proposed method has two steps: the first one is the learning step and the second is the prediction step. The first step is based on a training corpus, extraction features, and a set of rules that are obtained from the corpus of learning. The second step implements the results of learning obtained from the first stage to accomplish parsing.

In [6], Ibrahim M. et al. proposed a hybrid system composed of the statistical and linguistic approach for Arabic grammar analysis, parsing and resolving end word cases. This system showed an adequate accuracy and it is easy to be implemented. However, the system requires deep knowledge of Arabic despite the use of learning portions availability. The authors use a set that contains 600 Arabic sentences to make system experiments.

In [7], Khoufi et al. implemented the Arabic sentence parser based on Probabilistic Context Free Grammar (PCFG). The authors proposed a method that consists of two phases: in the first one a PCFG is induced from Arabic Treebank parse trees, and in the second, the authors implemented the Viterbi parsing algorithm using the induced grammar in the first phase. The authors have tested the parser on 1650 sentences extracted from the same Treebank.

Arabic text linguistic analysis can concern different analysis levels (morphological, syntactic, semantic and pragmatic); every level uses the previous one’s result (output). The treatment is applied sequentially in the text (transduction on text automata), in order to resolve ambiguity. Then, the result must be unified between the linguistic analysis levels to have easy communication between them. NooJ guarantees high integration of all levels of natural languages’ description thanks to compatible notations and a unified representation for all linguistic analysis results, which enable different analyzers at different linguistic levels to communicate with one another. [9, 11]. By using this platform, we can implement a set of analysis tools applied in cascade in the input text.

6 Conclusion and Perspectives

In this paper, we have presented our methodology of the simple Arabic verbal sentence parsing. This methodology consists of classifying the entries of an Arabic electronic dictionary, regarding syntactic and semantic classes, implementing morphological disambiguation rules, and creating syntactic grammars. The implementation is performed using the NooJ platform. If the platform NooJ allows to process all the stages of analysis (morphological, syntactic and semantic). Our work will focus on the stage of syntactic analysis, with a preliminary stage of disambiguation.

Thus, we have implemented many transducers (morphological disambiguation rules) modeling a set of lexical and syntactic constraints in Arabic language. These transducers are applied sequentially. After that, with our structural grammars, we have analyzed several simple Arabic verbal sentences, disambiguated them automatically and generate their annotated parse tree(s).

Our method will not be limited to the simple verbal sentence; we will extend it to other types of the Arabic sentence. Therefore, we will be able to syntactically analyze different texts and corpora thereafter.

Once the Arabic analyzer is done, many issues could be solved such as automatic diacritics, Arabic sentences’ correction, and accurate translation. Also, other disambiguation rules could be implemented when the semantic analysis can be used.