Keywords

1 Introduction

Question-answering systems (QASs) offer different mechanisms to provide adequate and precise answers to questions expressed in natural language. Indeed, this type of system allows user to ask a question in natural language and receive a precise answer to his request instead of a set of documents deemed relevant, as in the case of search engines.

The first process in QASs is to extract the information from users’ questions that are expressed in natural language. One of the crucial steps in the extracting information from texts is the recognition of named entities. The term named entity appeared during the MUC6 conference (Message Understanding Conference) [1]. These are the entities that have a determined designator (e.g. “EDF”, “Jules Verne”). They include proper names or expressions such as the species names (e.g. “Bengal tiger”), diseases, or chemicals. This definition has also been extended to temporal expressions such as dates and times, or to numeric values (e.g. 2.3 g/l).

By legal entities, we mean named entities specific to the legal field such as acts and facts. Detecting such entities requires the availability of resources describing the domain vocabulary and / or training corpus allowing the learning of the common characteristics to these entities.

Our goal in this article is to build a legal dictionary that will be used for the automatic analysis of the users’ questions expressed in natural language in order to extract the information that is needed to formulate SPARQL queries equivalent to users’ questions.

The rest of this document is organized as follows: First, Sect. 2 presents related work on extracting terms from texts. Subsequently, Sect. 3 presents the legal field and its complexity. Then, Sect. 4 describes the methodology used for the construction of the legal dictionary. Finally, we end this article with the results of the experimentation of the legal entities recognition by applying our legal dictionary in Sect. 5, and conclude in Sect. 6.

2 Extracting Terms from Texts

A term is an expression with a unique meaning for a particular domain [2]. In the legal field, the words “tax service” become a term in relation to the field, it has a unique meaning in this field.

Term extraction consists of identifying potential terms in a specific text or a set of texts (corpus) as well as the relevant information related to the use of these terms or to the concepts to which they refer (definition, context, etc.).

Extracting terms is an important step in building a dictionary from a corpus. Terms are words or expressions having a precise meaning in a given context, and represent the linguistic supports of the concepts. The problem of building up resources is at the heart of terminological activity. If the notion of “term”, which appeals to that of concept and is often based on a particular act of reference, does not seem to lend itself to computer processing, a certain number of tools aiming to extract the terms of a corpus have seen the day [3].

The definition of the term given above exerts strong constraints on the form and the functioning of the terminological units. These constraints constitute the operational principles of terminology extraction software that have been developed in recent years. The objective of these software is to automatically provide a more or less structured lexicon of the domain.

We can distinguish three types of approaches for the automatic term extraction: (i) linguistic approaches that use lists of named entities and manually written recognition patterns [4, 5], (ii) statistical approaches based on learning techniques from annotated texts [6, 7] and (iii) hybrid approaches which integrate the first two methods [8, 9]. Table 1 gives a brief description of each approach for the automatic term extraction.

Table 1. Approaches for the automatic term extraction

3 The Legal Field

The legal field is a complex field by its terms which can be:

  • Terms with only a legal meaning;

  • Terms with at least one legal and non-legal meaning;

  • Terms designated by their synonyms in different texts;

  • Terms appearing in different morphological forms;

  • Non-synonymous terms with the same legal meaning.

In addition, there are different lexical forms that legal terms can take. Table 2 gives some examples of legal terms with their lexical form.

Table 2. Examples of legal terms with their lexical form

These examples of legal terms show the diversity and the infinity of the lexical forms of the legal terms. We find terms in the form of “Noun”, “Noun-Adjective”, “Noun-Preposition-Noun”, etc. This lexical diversity makes it impossible to automatically extract the legal terms based on lexical grammars.

No resource on the legal terms has been developed for the legal field. Therefore, we decided to build a NooJ legal dictionary describing the legal terms and their categorization, which will be used for the automatic analysis of the users’ questions that are expressed in natural language, using the natural language automatic processing platform NooJ [15]. The latter makes it possible to build, test and manage formal descriptions in a wide coverage of natural languages, in the form of electronic dictionaries and grammars.

4 The Legal Dictionary

The description of natural languages is formalized in the form of electronic dictionaries and grammars represented by organized sets of graphs. NOOJ dictionaries are used to represent, describe and recognize simple and compound words. Dictionaries are.nod files compiled from editable.dic source files.

Our goal is to build an electronic dictionary of legal terms for NOOJ. A term can be simple if it contains one word, or compound if it contains more than one. A compound word is built from simple words. Silberztein M. [16] defines a compound noun as a consecutive sequence of at least two simple forms and blocks of separators. A simple form is a consecutive nonempty sequence of characters of the alphabet appearing between two separators. A single word is a simple form that constitutes a dictionary entry.

The legal dictionary that we propose to build from laws and decrees, will bring together the terminological material necessary for the automatic processing of legal texts, and in particular during the stage of transforming users’ questions, in natural language, to SPARQL queries in our question-answering system. We have adopted a methodological framework in 6 steps for the construction of the legal dictionary (see Fig. 1).

Fig. 1.
figure 1

Construction stages of the legal dictionary

4.1 The Constitution of the Legal Corpus

In this step we have built up a legal corpus from laws and decrees. We focused our study initially on the general tax code of Morocco. The general tax code has 3 books (see Fig. 2).

Fig. 2.
figure 2

The general tax code

The first book deals with the tax and recovery rules, and has 9 titles and 209 articles. Book 2 deals with the tax procedures and has 3 titles and 39 articles. Book 3 deals with other duties and taxes and has 5 titles and 40 articles.

We started with the first title of the first book of the general tax code, on “corporation tax” (see Fig. 3).

Fig. 3.
figure 3

The first title of the first book of the general tax code

4.2 Extracting the Legal Entities

In this step we have manually analyzed the corpus and extracted the legal entities. We identified 679 legal entities.

4.3 Lemmatization of Legal Entities

Then, we proceeded to the lemmatization of the extracted legal entities by passing words bearing inflection marks (plural, conjugated form of a verb…) to their reference forms (lemma or canonical form).

For example, the legal entity “Personnes imposables” (Taxable persons) becomes “Personne imposable” (Taxable person).

4.4 Inflectional and Derivational Morphology

In this step, we established the inflected and derived forms of the legal entities using NooJ grammars. An extract is given in Fig. 4.

Fig. 4.
figure 4

An extract of the inflectional grammar

For example, the inflectional model “ACHAT” is defined by “ACHAT =  <E>/m+s | <PW> s/m+p;” and means that the legal term that uses this inflectional model has two forms:

  • The term as it is: masculine singular

  • The term with an “s” at the end of the first word: masculine plural.

4.5 Conceptualization

After having established the list of the legal entities, we proceeded to group these entities into semantic classes by establishing a list of concepts. We have established 42 concepts. Table 3 gives some examples of legal concepts with their description and some examples.

Table 3. Examples of legal concepts

4.6 The Construction of the Legal Dictionary

Finally, we proceeded to the structuring of the legal terms by building an electronic dictionary of legal terms. The electronic computer dictionary was developed with NooJ [17,18,19] and has 679 entries. An extract is given in Fig. 5.

Fig. 5.
figure 5

An extract of the legal dictionary

For example, for the dictionary entry “acte d’acquisition définitif”:

acte d’acquisition définitif, NC+TJ+ACTE+FLX = ACHAT.

  • acte d’acquisition définitif: the legal entity

  • +NC+TJ: the categories are compound noun and legal term

  • +ACTE: the semantic class “ACTE”

  • ACHAT: the inflectional model “ACHAT”

The inflectional model “ACHAT” is defined by “ACHAT =  <E>/m+s | <PW>s/m+p;” which means that the legal term has two inflected forms:

  • acte d’acquisition définitif: masculine singular

  • actes d’acquisition définitif: masculine plural.

5 Experimentation

The NooJ legal dictionary, which we have developed, is able to annotate and recognize legal entities in natural language text. However, with the legal dictionary one is able to automatically analyze and recognize legal terms in natural language questions, using the natural language automatic processing platform NooJ.

Figure 6 shows the result obtained from the annotation, with the NooJ legal dictionary that we built, of the question in French “Quelles sont les sociétés qui sont passibles de l'impôt sur les sociétés?” (Which companies are liable to corporation tax?). The result of the annotation shows that the term “société” (company) was identified by: noun and legal term masculin plural, of semantic class “COMPANY”; and that the term “passibles de l’impôt sur les sociétés” (liable to corporation tax) was identified by: noun and legal term masculin plural, of semantic class “STATE”.

Fig. 6.
figure 6

The result of the annotation with the NooJ legal dictionary

6 Conclusion

In this work we have developed an electronic NooJ dictionary that allows annotating and recognizing legal terms in natural language texts. We have adopted a methodological framework in 6 steps for the construction of the legal dictionary: (1) we have constituted a legal corpus of laws and decrees focusing on the first title of the first book of the general tax code, on “corporation tax”; (2) we manually analyzed the corpus and extracted the legal entities by identifying 679 legal entities; (3) we lemmatized the extracted legal entities by passing words bearing inflection marks (plural, conjugated form of a verb…) to their reference forms; (4) we have built grammars describing the inflectional and derivational morphology of the legal entities; (5) we have grouped the legal entities into semantic classes by establishing 42 concepts; (6) we have structured legal entities by building a NooJ electronic legal dictionary capable of annotating and identifying legal terms in natural language texts.

As perspectives, we will integrate the legal dictionary into our question-answering system, by using it in the automatic processing of the users’ questions in natural language, which the objective is to extract the information necessary for the formulation of SPARQL queries equivalent to users’ questions.