Keywords

1 Introduction

Ontologies hold great importance for modern knowledge-based systems. They serve as explicit, conceptual knowledge models to share a common understanding of information in a domain and make that knowledge available to information systems [1]. However, the manual construction of ontologies is an expensive and time-consuming task because of the difficulty in capturing knowledge, an issue also known as the “knowledge acquisition bottleneck.” A solution for this issue is providing automatic or at least semi-automatic support for ontology construction. This operation is usually referred to as Ontology Learning (OL) [2].

Cimiano [3] compares the tasks involved in OL to forming a layered cake. The cake is composed, in ascending order, of term acquisition, synonym acquisition, concept formation, taxonomy definition, relation definition, and finally, axiom definition (see Fig. 1). Several ontology-learning tools are proposed in the literature for accomplishing these tasks [4,5,6]. They differ according to input data types (format and language), output formats, and mainly the methods used in order to extract the ontological structures. Unfortunately, the Arabic language is still not supported by these tools, even though it is one of the most common languages spoken worldwide.

In this paper, we deal with ontology learning from Arabic legal texts. We use the NooJ linguistic platform to semi-automatically process the identified steps: corpus study, term acquisition, and conceptualization. Then we use the AWN project to accomplish the ontology enrichment. Section 2 presents the overall ontology learning process from text: input, output, existing approaches, and prominent ontology learning tools. Section 3 discusses related works in the legal domain. In Sect. 4, we describe the proposed learning process and its implementation in NooJ. Section 5 comments on the learning process and the obtained results. Finally, in Sect. 6, we present our conclusions and plans for future work.

2 Ontology Learning

The term ontology learning refers to the automatic or semi-automatic support for the construction of an ontology [7]. It aims at extracting ontological elements (conceptual knowledge) from a given input text with limited human exertion. Techniques from established fields, such as NLP, data mining, and information retrieval, have been fundamental in developing ontology learning methods [8]. This section presents the inputs used to learn ontologies, the ontology learning tasks, and outputs, existing approaches, and most prominent ontology learning tools.

2.1 Input

There are three different kinds of ontology learning input data [9]: structured (such as databases), semi-structured (e.g., XML), and unstructured (natural language text documents). Unstructured data is the most widely available format for ontology learning input and presents the most common sources for ontology extraction [10]. However, processing unstructured data is a tedious task; indeed, human language is mostly very implicit and allows different people to conceptualize it in different manners [11]. The legal domain is strictly dependent on its linguistic expression and therefore inherits all the challenging problems that this implies. McCarty overtly claimed, “one of the main obstacles to progress in the field of artificial intelligence and law is the natural language barrier” [12].

2.2 Tasks and Outputs

Ontology learning is primarily concerned with defining concepts, relations, and (optionally) axioms from texts. Although there is no standard regarding this development process, Cimiano [3] describes the tasks involved in ontology learning as forming a layer cake (see Fig. 1). These tasks aim at returning six main outputs: terms, sometimes synonyms, concepts, taxonomic relations, non-taxonomic relations, and finally, axioms.

Fig. 1.
figure 1

Ontology Learning “Layer Cake” from [25].

Terms are the most basic building blocks of ontology learning [13]. They can be simple (i.e., single-word) or complex (i.e., multi-word), and are considered linguistic realizations of domain-specific concepts. There are many term extraction methods in the literature. Most of them are based on terminology and NLP research [14,15,16]; others, on information retrieval methods for term indexing [17].

Synonym discovery consists of finding words that denote the same concept [18]. The synonym layer addresses the acquisition of semantic term variants in and between languages. It is either based on sets, such as WordNet synsets [19] (after sense disambiguation), on clustering techniques [20,21,22,23], or on other similar methods, including Web-based knowledge acquisition.

Concepts can be abstract or concrete, real or fictitious. However, the consensus in this field is that concepts should include the following:

  • Intension: a formal definition of the set of objects that this concept describes;

  • Extension: a set of objects that the definition of this concept describes;

  • Lexical realizations: a set of linguistic realizations, (multilingual) terms for this concept.

Most of the research in concept extraction addresses the question from a clustering perspective, regarding concepts as clusters of related terms [3]. This approach overlaps almost entirely with that of term and synonym extraction [24] and can be found in [25].

Concept hierarchies (generalization and specialization) or taxonomies are crucial for any knowledge-based system [24]. There are three main paradigms to induce concept hierarchies from texts:

  • Lexico-syntactic patterns, as proposed in [26],

  • Harris’s distributional analysis using clustering algorithms [27],

  • The document-based notion of term subsumption, as proposed in [28].

Relations refer to any relationship between concepts except taxonomical relations. This includes specific conceptual relationships such as synonymy, possession, attribute-of, and causality, as well as more general relationships referring to any labeled link between a source concept. In the literature, few approaches have addressed the issue of relations extraction from texts, such as the use of an association rules extraction algorithm [29] and the use of syntactic dependencies [30].

Lastly, axioms are propositions that are always taken as true. They act as a starting point for deducing other truths and verifying the correctness of existing ontological elements. The extraction of axioms from the text occurs at an early stage [31]. Initial blueprints of this task can be found in [32]. This work proposes an unsupervised method based on an extended version of Harris’s distributional hypothesis in order to discover inference rules.

2.3 Approaches

Several approaches deal with ontology learning from textual resources in the literature. We briefly discuss the most relevant ones for our concerns. Aussenac-Gilles [33] proposed an ontology learning approach based on knowledge elicitation from technical documents. This approach enables the creation of a domain model by analyzing a given corpus using natural language processing (NLP) tools and linguistics techniques. It includes four main activities: corpus constitution, linguistic study, Normalization, and Formalization. Sabou [34] proposed a natural language processing approach that uses syntactic patterns to discover the dependency relations between words. This approach consists of four main steps: term extraction, conceptualization, and enrichment. Mazari [35] proposed an automatic construction approach that uses statistical techniques to extract elements of ontology from Arabic texts. The ontology learning tasks are carried out in three steps: preparing the corpus, extracting concepts, and discovering relations. In the legal domain, all ontology learning experiments mainly focus on concept extraction as the primary step in the ontology development process [36].

2.4 Tools

Ontology learning tools aim to reduce both the time and cost of the ontology development process. They differ in terms of input data types, output formats, and mainly the methods and algorithms used in order to extract the ontological structures. In this subsection, we present the most relevant ontology learning systems from unstructured textual resources.

TERMINAE [6] is a tool based on a methodology elaborated from practical experiments of ontology building. Its originality is to integrate linguistic and knowledge engineering tools. The linguistic engineering part allows term acquisition from textual resources. The knowledge engineering part provides knowledge-base management with an editor and browser for the ontology. This tool helps to represent a notion as a concept, which is called a terminological concept.

Text2Onto [7], is a framework for learning ontologies from textual resources. Text2Onto represents the learned knowledge into a meta-level model called a probabilistic ontology model (POM), which stores the learned primitives independently of a specific Knowledge Representation (KR) language. It calculates confidence about the correctness of the ontology elements and updates the learned knowledge each time the corpus is changed to avoid processing it from scratch.

Text-to-Knowledge (T2K) [8], is a generic computer platform for data and text mining. T2K extracts domain-specific information from texts by combining linguistic technologies and statistical techniques in three main phases: preprocess text and extract terms, form concepts, and relations or knowledge organization (Table 1).

Table 1. A summary of ontology learning tools.

Unfortunately, most of the existing ontology learning tools do not support Arabic language processing, while a few others lack support.

3 Related Work

Our proposed approach aims to use NLP techniques and tools in order to build a domain-specific ontology from Arabic textual resources. The most closely related works in the legal domain are Francesconi [37] and El Ghosh [10]. Francesconi [37] performed the term extraction task with two different acquisition tools: GATE for English texts and T2K for Italian.

The other tasks, such as evaluating terms, linking them to concepts, and defining relations, were processed under the supervision of ontology engineers and domain experts. For El Ghosh [10], the ontology extraction process has used Text2Onto and is composed of two main phases: linguistic preprocessing and extraction of modeling primitives (concepts, instances, taxonomies, general relations, and disjoint axioms). The resulting ontology is considered an inexpressive ontology and needs to be re-engineered.

Our work differs from previous work in the following aspects. First, we are processing Arabic, one of the most challenging natural languages in the NLP field. Second, we use the NooJ platform to implement the linguistic resources needed for term acquisition and conceptualization. Finally, we are developing a Java module to enrich the ontology vocabulary from the AWN project.

4 Our Work

After a comprehensive literature review, we can see that most of the approaches proposed for learning ontologies from text strongly depend on their specific environment, consisting of language, input, domain, and application. Thus, there is no standard regarding the ontology learning process and no guarantee that the (semi-) automatically generated ontology is sufficiently correct and precise to characterize the domain of interest [10].

For this reason, domain expert intervention throughout the learning process is highly necessary in order to control, complete, and validate the extracted elements. From this perspective, we defined a semi-automatically learning process that involves legal expert intervention and comprises mainly four tasks: corpus study, term acquisition, conceptualization, and enrichment. This section presents the corpus and the platform used to learn the ontology, introduces each learning task, and discusses the obtained results.

4.1 Corpus Definition

We constituted the corpus from the Moroccan family code (Fig. 2), which consists of Arabic natural language texts and includes seven main books composed of 400 articles of law, about 2,700 text units, and 18,000 different tokens.

Fig. 2.
figure 2

Moroccan family code Corpus excerpt.

4.2 Tool Selection

Arabic is a Semitic language that has a very complex morphology [38]; it is a highly inflected and agglutinative language; and, due to this complex morphology, it requires a set of preprocessing routines to be suitable for manipulation.

In the current project, we used NooJ [39] as a natural-language processing tool in order to formalize inflectional and derivational morphology, lexicon, regular grammars, and context-free grammars. NooJ uses an annotation mechanism (stored in each Text Annotation Structure, or TAS) that integrates every single piece of linguistic information, making it possible to combine morphological constraints in syntactic rules. NooJ is also a powerful corpus processor that supports sophisticated operations, such as information extraction, concordances, and statistical analyses.

4.3 Ontology Learning Process

Corpus Study.

This step consists of a lexico-syntactic analysis of Moroccan legal texts. First, we built a legal domain-specific dictionary based on the family code dictionary available on the ADALA Morocco legal and judicial Portal [40]. The built dictionary comprises more than 1,000 entries, consisting of simple terms (nouns and adjectives), compound nouns, pronouns, prepositions, adverbs, and conjunctions. Furthermore, we added to the simple terms the required related inflectional and derivational forms. Below are some examples of the dictionary’s entries (Table 2):

Table 2. Excerpt of dictionary entries.

Second, inspired by Mesfar [41], we modeled a set of morphological grammars that recognize the component morphemes of the agglutinative forms. For instance, the morphological grammar in Fig. 3 allows the identification of the agglutinative word, including various prefixes {[definite article (the, )], [prepositions (for, ل), (by, ب)], (conjunctions (and, و)]}, and the suffix [pronoun (her, )], e.g.: (Her husband, زوجها), (By its expiration, ).

Fig. 3.
figure 3

A morphological grammar of tokenization.

Finally, to solve multi-word unit ambiguities, we modeled local grammars using the feature “ + UNAMB”. The local grammar in Fig. 4 recognizes as nouns both (Son, ابن) and (Son of son, ابن الإبن). The corpus was annotated with a lexical coverage rate of 81.83%, which we consider to be a very satisfactory result.

Fig. 4.
figure 4

A syntactic grammar for kinship relationship.

Term Acquisition.

After preparing the corpus, we moved to extract the ontology elements. We identified manually, with the legal expert’s help, about 13 patterns of nominal compositions that reference the potential candidate terms (see Table 3). We modeled these patterns using the NooJ local grammars and applied them to extract the corresponding sequences in the corpus. Finally, to keep only the relevant terms, we employed TF-IDF measures of the NooJ statistical module. As a result, we acquired 398 single and multi-word candidate terms.

Table 3. Patterns of the potential candidate terms.

Conceptualization.

In this step, concepts and their relations are derived from the extracted terms. We elaborated a cascade of local grammars that identifies the candidate terms sharing a large number of syntactic contexts, for instance, those sharing the same head or the same expansion (see Table 4.).

The legal expert used the obtained clusters to define the concepts, their properties, and semantic relationships between them – for instance, hyponymy, hypernymy, and synonymy. For example, the lexical units (daughter, ), (wife, زوجة), and (father, ) share the same syntactic context [( , expense), noun] and specialize the concept (Close relative, ). The lexical units ( , divorce) and ( , marriage) share several syntactic contexts – [noun, (types, )], [noun, (date, )] and [prepNoun, (registration of, )] – and specialize the concept (Situation, ). Two hundred and thirty single and multiword concepts and 10 semantic relations were identified single and multiword, concepts were identified and 10 semantic relations.

Table 4. Excerpt of the clustered terms.

An excerpt can be seen in Fig. 5, below.

Fig. 5.
figure 5

Excerpt of the taxonomy identified.

Enrichment.

At the end of the previous step, we added to the NooJ dictionary the semantic properties that referred to the concepts and their reference hypernym trees. In the current task, we identify the concept synonym sets from the AWN Project [42]. AWN is a lexical database for the Arabic language that groups words into clusters of synonyms called synsets that are linked by semantic relationships. Based on JAWS API [43], we developed a Java module that located for each simple word concept the corresponding synsets in AWN. If a concept had multiple senses, the module constructed an AWN hypernym tree for each and calculated their semantic similarity to the reference hypernym tree. Finally, the module adds the most similar sense’s synonyms to the concept as semantic property in the NooJ dictionary. In following the structure of our lexicon:

Entry,GrammaticalCategory+Concept+HypenymTree=listOfString+Synonyms=istOfString

Example:

زَوْج,N + Concept + HypernymTree = زَوْج|قَرِيب|شَخْص + Synonyms =  .

5 Discussion

This section briefly highlights the main issues and remarks identified throughout the learning process from Arabic legal texts. First, the Arabic language’s complexity and lack of an ontology learning tool make the learning process from Arabic texts more complicated and challenging than learning from Romance languages. Second, the acquired pieces of information using lexical analysis and term extraction are essential but inexpressive. They need to be revised by a domain expert and re-engineered into the following ontological elements: concepts, concept properties, and relations. Third, analyzing a legal domain-specific corpus can identify relevant concepts and relationships relating to a regulated domain, which provides significant indications for building a legal domain ontology. Last, the NooJ platform offers all the linguistic tools required to implement the ontology learning methods proposed in the literature. Regrettably, it does not support knowledge engineering tools to model the ontological model.

6 Conclusion

In this article, we have presented an overview of ontology learning from text and proposed a bottom-up approach to building a legal domain-specific ontology from unstructured Arabic text. The learning process was identified. We used the linguistic platform NooJ as an NLP tool to extract the ontology elements (concepts and relations) and the AWN project to enrich the ontology vocabulary. The obtained results were validated and completed manually by the legal expert. Future work will focus on the formalization and implementation of the designed ontology. We will also focus on developing our LIRS according to available information in the ontology. We expect that using the ontology will help the results be more semantically related to the query than other related works.