1 Introduction

Globalization has amplified the problem of internationalizing and localizing software systems to ensure that they comply with both international and local regulations. Different international policies, laws, and regulations, written in a wide range of languages even within one jurisdictional entity—such as the European Union—together with privacy and security requirements, pose serious challenges for software system developers worldwide. In particular, IT professionals have to face the so-called regulation compliance problem, whereby companies and developers are required to ensure that their software systems comply with relevant regulations, either by design or through re-engineering [3].

Legal documents are written in a jargon colloquially referred to as legalese, a specialized language that makes the elicitation of requirements from regulations a difficult task for developers who lack proper legal training. Misinterpretation—for example, by overlooking an exception in a regulatory rule—can have legal consequences, as it may lead to incorrect assignment of rights or obligations to stakeholders. Elicitation of requirements is generally achieved in a manual and often haphazard way. Breaux et al. [13] propose a systematic manual process for extracting rights and obligations from legal text. Our work seeks to augment this manual process with tool support, thereby improving productivity, as well as quality and consistency of the output.

To accomplish this complex task, we use semantic annotation (SA) techniques, where legal text is annotated on the basis of concepts that have been defined in terms of a conceptual schema (aka ontology). Such techniques are used widely in the field of the semantic Web to generate semistructured data amenable to automated processing from unstructured Web data. Such techniques have also been fruitfully applied in requirements engineering to support partial automation of specific steps in the elicitation process [29].

This paper presents the details of an SA-based tool for requirements extraction from legal documents. We discuss related issues and propose an engineering approach founded on an existing framework for SA called Cerno [32]. This framework has been adapted and extended to deal with many of the complexities of legal documents and the result is a new tool named GaiusT Footnote 1. The main objective of this paper is to present GaiusT along with its evaluation on regulatory documents written in two different languages: the U.S. Health Insurance Portability and Accountability Act [56] (HIPAA, in English) and the Italian accessibility law for information technology instruments (known as the Stanca Act, in Italian) [26]. The novelty of our contributions rests in identifying a combination of techniques adopted from research on semantic annotations, legal document processing, and software reverse engineering, which together demonstrably improve productivity on extracting requirements from legal text.

These contributions expand on an earlier paper [30], where we first outlined our preliminary research plan, along with a first application of GaiusT. In this work, we focus on the architecture and implementation of GaiusT and expand on the experimental results reported in [30] based on two case studies.

The rest of the paper is organized as follows. Section 2 describes the challenges of analyzing prescriptive documents. In Sect. 3, we present the research baseline of this work, while Sect. 4 highlights requirements for analyzing legal documents. In Sect. 5, we propose a design for GaiusT to address these requirements and elaborate on its architecture and application from a user-centered perspective in Sect. 6. In Sect. 7, we evaluate our proposal by applying the tool to two case studies and report the results of experiments comparing the tool’s analysis of these laws with that of human experts. Finally, we provide an overview and comparison with other methods and projects in the area in Sect. 8 and draw conclusions in Sect. 9.

2 Complexities of prescriptive documents

Analysis of legal documents poses a number of challenges different from those found in other documents, largely because of the special traditions and conventions of law. In this section, we report on some of the most distinctive characteristics of legal documents, focusing in particular on laws and regulations. The handling of these characteristics forms a part of the requirements for the GaiusT system.

The most distinctive characteristic of legal documents is that they are prescriptive [37] and are written accordingly. This means that they express the permission of desirable and/or the prohibition of undesirable behaviors and situations. The norms that they convey are requirements that control behaviors or possible states of affairs in the world. Consequently, each norm has a core part called the norm-kernel which carries the essential information conveyed by the norm. This information includes a legal modality, a subject, an object, conditions of applicability, and exceptions. The legal modality determines the function of a norm, which can be either an obligationor a right [60]. A right is an action that a stakeholder is conditionally permitted to perform, and an obligation is an action that a stakeholder is conditionally required to perform. In contrast, an anti-right/anti-obligation states that a right/obligation does not exist. The subject of a norm is the person or institution to whom the norm is directed. The object of a norm is the act regulated by the norm, considering the modality and other properties of the action. The conditions of applicability establish the circumstances under which a norm is applicable. These are clauses that suspend, or in some way modify, the principal right or obligation. In addition to these, each law contains general rules that prescribe a norm for a large group of individuals, actions or objects, followed by a rule that excludes specific members or subsets of these groups. Often, conditions of applicability are modified by exceptions both for subjects and objects [44]. Exceptions can be introduced explicitly through words such as except or contrary to, or implicitly when they have to be inferred. For example, in “A medical savings account is exempt from taxation under this subtitle unless such account has ceased to be a medical savings account,” the use of the term unless indicates an exception. One or several exceptions related to a right can be formulated directly in a section mentioning this right, or indirectly in another part of the law, or even in another law using cross-references. To support the markup of exceptions, we have identified a set of terms that are commonly used to express exceptions (see Table 1).

Table 1 Document structure classification

Cross-references create links between parts of the same document (internal) or with parts of other documents (external) and are essential to the interpretation of laws. To unambiguously identify the parts of a document and to allow for explicit references to related parts, legal documents are organized using a deep hierarchical structure.

To facilitate the understanding of a law, every legal document contains a declarative part—the definition of legal terms—that defines legal concepts used in the document. This constitutes the glossary section of a legal document and plays a fundamental role in its interpretation. The lack of such a declarative part can result in ambiguity and misinterpretations.

All of these features pose serious challenges for semantic analysis. In fact, although the vocabulary and grammar used in such documents are often restricted, inconsistent use of terms is pervasive, mainly due to the length and complexity of the lifetime of legal documents. Moreover, it is often the case that different parts of a legal document are written by different people, either because of the sheer size of the document, or as a result of subsequent modifications.

Another problem dimension arises from the multilinguality of laws applicable in a given jurisdiction (such as the European Union). Analysis of regulations written in different languages entails handling the different structural, syntactic, and semantic issues of each particular language.

Furthermore, the requirements obtained from several regulations must be integrated, aligned, and ranked according to the priority of the regulations they were extracted from.

Finally, inaccuracy of the analysis of legal documents must be minimal because in the eyes of the law, there is no such thing as “near-compliance”. For example, by missing an exception or a condition, one could bestow or deny an important right or obligation to the wrong party.

3 Research baseline: Cerno

In light of the complexity of legal documents outlined above, we propose a tool-assisted process for their annotation, which includes the following main activities: (1) Structural analysis to handle identification of structural elements of a document, basic text constructs, and cross-references; (2) Semantic analysis to identify normative elements of a document such as, rights, obligations, anti-rights, and anti-obligations.

In order to realize this process, we have adopted and extended the semantic annotation framework of Cerno [31] and used it as a foundation for our work. Cerno is based on a lightweight text pattern-matching approach that takes advantage of the structural transformation system TXL [16]. The types of problems that Cerno can address are similar to those faced in the analysis of legal documents, such as the need to understand the document’s structure and the need to identify concept instances present in the text. Moreover, experimentation with Cerno has shown it to be readily adaptable to different types of documents [31, 33, 63].

In Cerno, textual annotations for concepts of interest are inferred based on a conceptual model of the application domain. Annotation rules can be automatically or manually constructed from this model. Such an approach factors out domain-dependent knowledge, thus allowing for the development of a wide range of semantic annotation tools to support different domains, document types, and annotation tasks.

Cerno’s annotation process has three phases: (1) Parse, where the structure of an input document is processed, (2) Markup, where annotations are generated based on annotation rules associated with the concepts of the conceptual model, and (3) Mapping, which results in a relational database containing instances of concepts, as identified by annotations.

The goal of the Parse step is to break down an input document into its structural constituents and identify interesting word-equivalent objects, such as e-mail addresses, phone numbers, and other basic units. Similarly to Cerno’s Parse step, structural analysis of legal texts identifies both structural elements of the document and word-equivalent objects. These are basic legal concepts, such as legal actors, roles, procedures, and other entities normally defined in the Definitions section of the law.

Next, the Markup phase conducts semantic analysis. An annotation schema specifies the rules for identifying domain concepts by listing concept identifiers with their syntactic indicators. These indicators can be single words and phrases, or names of previously parsed basic entities. The annotation schema can be populated with such syntactic indicators either through (semi-) automatic techniques that use keywords of the conceptual model—if available—or hand-crafted from a set of examples.

Lastly, the Mapping phase of Cerno is optional and depends on how one proposes to use generated annotations. In the analysis of legal documents, we assume that a human analyst may want to correct or enrich tool-generated results. In this setting, annotations must remain embedded in the documents so that the user can easily revise them using textual context. After this revision, the annotations can be indexed and stored in a database.

Although the basic steps in Cerno clearly apply well to our problem, the annotation process we envision for legal documents required a significant adaptation effort at the design and implementation level in order to make the framework not only applicable in the legal domain, but also usable by requirements engineers and possibly legal consultants. To this end, we designed and developed the GaiusT system as an extension and specialization of Cerno that includes new components specific to the annotation of legal documents in different languages. To enhance the usability of the system, a GUI was added to better support human interaction.

The main extensions to Cerno are described next. Firstly, although Cerno was well adapted to the sequence-of-paragraphs form of English text, such as newspaper articles, Web pages, and advertisements, it could not handle the specialized semiformal structure of legal texts. We therefore extended the TXL-based Cerno parser so that it can process the nested structural form of legal documents, including numbering and nesting of sections, paragraphs, and clauses.

The labels of legal document sections are also context-dependent. For example, “clause (b) above” implicitly refers to the one in the current paragraph and section. To resolve this problem, the parser was enhanced with an additional TXL source transformation step to contextualize section, paragraph, and clause names to make them globally unique. For example, the label of clause “(b)” in paragraph 2 of Sect. 5 becomes “5.2(b)”. A subsequent reference resolution transformation was added to use these globally unique labels to link local and global references to the clauses, paragraphs, and sections they refer to. For example, the reference “except as in (b) above” in paragraph 2 of Sect. 5 is resolved to “except as in 5.2(b) above”. The set of Cerno word-equivalent patterns, which recognize items such as phone numbers, email addresses, Web URLs, and the like, was extended to recognize legal items such as regulation and law names (e.g., “42 U.S.C.—405(a) (2006)”).

The set of concept patterns was also extended and enhanced to include terms and language for legal obligations, rights, and other concepts, and patterns were added to recognize temporal clauses such as “within a month”.

Finally, Cerno’s word pattern specification notation was enhanced to process Italian regular and irregular verb forms in order to enable processing of Italian legal documents.

In the following section, we describe the design of the GaiusT tool considering both the structural and semantic analyses steps of the extended Cerno process.

4 Semantic annotation of legal documents

4.1 Structural analysis

A fundamental feature of legal documents is their deep hierarchical structure. Identifying document structure is important to semantic annotation for two main reasons: (1) recognition of potential annotation units, thus allowing for different annotation granularity, and (2) flattening the document’s hierarchical representation, consisting of sections, subsections, etc. Annotation granularity can be defined as the fragment of text that refers to a concept instance, or an amount of context that must be annotated together with the concept. Different granularities can be used to address different objectives. In legal documents, for instance, single-word granularity is convenient for annotating basic concepts, such as stakeholders or cross-references.

From a semantic viewpoint, the semiformal structure of legal documents is used to support the identification of complex deontic concepts, inferring a relationship between inferred concepts from their relative positions in the structure. The identification of structural elements in the process of semantic annotation also improves accuracy of the results, e.g., by treating a list as a single object for a given statement.

In general, any document has two levels of structure, a rhetorical structure and an abstract structure, both of which contribute to meaning [46]. According to rhetorical structure theory [35], a document can be represented as a hierarchy where some text units are salient, while others constitute support. Abstract document structure, on the other hand, relies on graphical visualization of the text and includes syntactic elements used for text representation [20]. Both rhetorical and abstract document structures are important for the design of a structure analysis component.

Due to the variety of existing document formats and contexts, it is impossible to define a common structure valid for all documents, even though several efforts have been devoted to the problem [54]. There are several possible strategies to automatically capture the structure of documents [23, 61]. The most widely used approach is based on feature-based boundary detection of text fragments. This approach exploits patterns, such as punctuation marks to identify text units. These patterns are typically composed of regular expressions and include non-printable characters such as carriage return for new line identification, symbol list or indentation for text blocks, graphical position of certain elements such as titles. To capture the deep hierarchical structure of legal documents, we have defined a classification of document text units, using a hierarchy where a text unit at a given level is composed of one or more units at the next level down (Table 1). This classification facilitates the management of a large class of heterogeneously structured documents with the only assumption that they are written in languages of the European family. Moreover, this classification can then be specialized according to the type of application and granularity level needed for a particular semantic analysis.

In GaiusT, we have defined grammatical patterns, using TXL’s ambiguous context-free grammar notation, to capture text units at different levels of this document hierarchy.

To capture the complex hierarchical structure and cross-references of legal documents, we adopted “The Guide to Legal writing” [55] that presents styles for proper writing and structuring of legal documents. Based on the hierarchy reported in the “The Guide to Legal writing”, we specialized our grammatical patterns (see for example Tables 5, 10). For cross-references, since loops in cross-references chains are intractable (see for example HIPAA paragraph 164.528(a)(2)(i)), our approach consists of recognizing and annotating cross-references, so that the annotator can take them into account and decide on a case-by-case basis. Special parsing rules catch both external and internal cross-references, and a TXL transformation renames all local labels and references to be globally unique. For example, the transformation changes list element label “(b)” of paragraph 2 of Sect. 5 to “5.2(b)”.

4.2 Semantic analysis of legal documents

The semantic analysis of a document requires the development of a conceptual model that represents domain information, in the case of legal documents deontic concepts. To design the conceptual model, we used an iterative process. An initial list of important concepts in the domain of legal texts was derived from the “The Guide to Legal writing” [55] and norm–kernel concepts [60]. We then applied part of the semantic parameterization methodology developed by Breaux and Antón [9], which defines a set of guidelines (textual patterns) for extracting legal requirements from regulations. Our model was then discussed with legal experts and finalized based on their recommendations. The experts were colleagues of the Faculty of Law of the University of Trento, and they helped with the analysis and refinements of legal concepts/terms and their interrelations.

The resulting model, represented in Fig. 1, includes as classes only the core deontic concepts. It leaves out others, such as penalties, that according to legal, experts can be considered as properties of the core deontic concepts. The conceptual model was used as the basis for the definition of an annotation schema, as discussed in Sect. 6. Concepts of the conceptual model are used to populate the annotation schema, and for each concept, a set of syntactic indicators and rules or patterns is defined. For the identification of prescribed behaviors—rights, obligations, and their antitheses—the annotation schema employs a set of manually constructed patterns, based on modal verbs (see Table 2). For identifying basic concepts—such as actors, resources, and actions—we reuse the Definitions section of a legal document, where every used term is defined precisely and a reference synonym is assigned [18, 55].

Fig. 1
figure 1

The conceptual model of legal concepts

Table 2 Syntactic indicators for deontic concepts

Annotation of conditions of application is represented in the conceptual model as constraint specialized into two subconcepts, exception and temporal condition. A list of syntactic indicators can be generated using a combined approach that analyzes the content of legal document with the use of lexical databases such as WordNet Footnote 2 or a thesaurus. Footnote 3 In our case studies, we manually built a list of syntactic indicators for temporal conditions and exceptions and integrated it with lists derived from lexical databases (Table 3).

Table 3 Syntactic indicators for Exception

4.2.1 Multilingual semantic annotation of legal documents

Language diversity in the semantic annotation process is treated at three distinct levels: (1) \({\it Structural}\): a text available in multiple language translations uses different character sets, word splitting, orders of reading, layout, text units, and so on; (2) \({\it Syntactic}\): the annotation patterns and rules that are used to identify concepts need to be expressed in the language. For example the syntactic indicators used for the concept of obligation (see Sect. 6) must be revised for each new target language; (3) \({\it Semantic}\): the semantic model of legal concepts should have a language-independent representation; thus, the syntactic indicators for each concept in different languages must be carefully defined.

The structural level has been investigated in the information retrieval (IR) area and basically amounts to segmenting text into words in different languages, using word stemming, morphology, and the identification of phrase boundaries, using punctuation conventions. There are many tools for handling these issues. However, software developers often underestimate the value of syntactic aspects, while a successful text analysis tool must always cater to them. Issues related to this level are managed by the document structure analysis as described in Sect. 5.

The syntactic level relates to the fact that people of different nations express the same concept in different ways. For example, working with Italian legal documents, we discovered that statements expressing an obligation normally use the present active or present passive tense, e.g., “the organization sends a request” or “the request is sent to the user”. This form differs from English construction of the same concept, where obligations are usually expressed with modal verbs “must” or “should”. However, syntactic differences do not affect the conceptual model built for the domain of legal documents since the concepts represented are valid also for other languages than English (Fig. 1). In our case studies, we used the WFL Generator and Lexical Database Manager components of GaiusT to identify types of syntactic indicators for the concepts of the annotation schema (see Sects. 7.1.1 and 7.2.1). Such components perform statistical analysis of the documents and use lexical database sources of the specific language to integrate results.

The most problematic issue for annotation across languages is the semantic level, because it has to account for alternative worldviews and cultural backgrounds. The existence of semantic vagueness, semantic mismatch, and ambiguities between languages poses several challenges [58]. The main problems related to this level are caused by proper names, temporal and spatial conventions, language patterns, such as specific or idiomatic ways to express certain concepts, representation of co-reference, time reference, place reference, causal relationships, and associative relationships. For example, consider the concept “week”; in many European countries, a week starts on Sunday and ends on Saturday, whereas in some Eastern-European countries, it starts on Monday and ends on Sunday. Thus, the expression “by the end of the week” refers to different time frames depending on the country. The best approach to handle semantic differences consists in developing parallel alignments with the same methodology and the same conceptual model [50]. When using GaiusT with English and Italian documents, such semantic differences were not essential and did not have to be captured in the conceptual model. This section has considered some of the issues that arise in annotating multilingual legal documents. GaiusT components were adopted to deal with these issues. However, for other aspect such as contextual information and implicit assumptions (i.e., ambiguities used to embrace multiple situations), a deeper semantic analysis is required (see, for example, [17]).

5 System architecture

This section presents the system architecture of GaiusT tool, including its components and technologies used. The overall architecture is shown in Fig. 2.

Fig. 2
figure 2

Architecture of GaiusT

GaiusT integrates and extends existing modules of Cerno (represented in Fig. 2 using green color) by adding modules to automate text preprocessing, annotation, and annotation schema generation. The extraction process is supported by several components that start from input documents, annotate, and map the results into a database for late analysis. The process of annotation used by Cerno is based on a hand-crafted conceptual model, specifying domain-dependent information enriched with lexical information to support identification of semantic categories. The enriched model is the basis for defining an annotation schema. Taking into account that the annotation schema can change according to user interests and goals, and that the schema has a great impact on the annotation process, we developed a series of modules that support the generation of an annotation schema.

GaiusT has been developed on a Microsoft Windows platform using the .NET framework (language C#). Choices about external tools and libraries such as PDF text extractor library, or part-of-speech tagger have been mainly driven by platform compatibility, modularity, integration, and multilingual requirements.

The architecture of GaiusT is composed of four main layers: (a) the annotation schema layer, (b) the annotation layer, (c) the database layer, and (d) the graphical user interface layer (GUI), as shown in Fig. 2. The entire system is about 50 k lines of code and more than 130 Mb of size. Physically, the system includes eight main components: (1) annotation schema generator, (2) preprocessing component, (3) TXL-rule generator, (4) document structure analyzer, (5) annotation generator, (6) database mapper, (7) evaluation component, and (8) GUI.

The components of the architecture are presented in following subsections, taking into account input and output data, as well the functionality of each component.

5.1 The annotation schema layer

The annotation schema layer sets up the core elements for the semantic annotation (SA) process and consists of the Annotation Schema Generator component. Its main input is a conceptual model that represents the domain of the document, defined taking into account also of the goals of the annotation. Its main output is the preliminary annotation schema (PAS). The component is implemented using a combination of C# programs and TXL programs. The generator uses four modules to generate the preliminary annotation schema:

  • The Conceptual Model Parser extracts from an XML Metadata Interchange (XMI) Footnote 4 or RDF Footnote 5 or OWL Footnote 6 source file concepts, properties, relationships, and constraints produce a list of items organized in a plain text file. This module takes as input a conceptual model generated using a CASE tool that supports the XMI UML interchange language such as UMLSpeed, MagicDraw, Footnote 7 Visual Paradigm Footnote 8 or Protegé Footnote 9 and returns the skeleton of a PAS. The module is a C# library.

  • The Word Frequency List (WFL) Generator provides an inverse frequency list of the words from one or more input documents. First, the module tokenizes the input and removes, according to the language, stop words. The WFL Generator produces two separated lists: one with the frequency of words of the source file and another list with the lemma of the words and their frequency. The process of lemmatization that reduces inflectional forms of a word to the lemma is realized by the lemmatizer module of Lucene open source search engine Footnote 10 that also supports multilingualism.

  • The Lexical Database Manager is an interface for lexical resources like WordNet, Footnote 11 Thesaurus Footnote 12 and for the multilingual definitions, Wikipedia. Footnote 13 The input to the module is the elements extracted from conceptual model, while the output is a list of syntactic candidate indicators for each element of the PAS. The interface provides facilities to query lexical resources both online and standalone and filter the output as for example all the options provided by the WordNet command-line program to filter the output.

  • The Part-of-Speech (POS) Manager is a module that can be used to refine the syntactic indicators of the PAS. It uses a Part-of-Speech (POS) tagger program developed by the Institute for Computational Linguistics of the University of Stuttgart Footnote 14 [51], to generate a table with the token elements of the document and the probability of their syntactic roles. The tokens can be used to strengthen or enrich the list of linguistic indicators. The POS Manager supports different languages.

5.2 The annotation layer

The annotation layer includes four components: the preprocessing component, the TXL-rule generator, the document structure analyzer, and the annotation generator. It is the core layer of the tool and it preprocesses input documents, delimits the text unit boundaries, generates the grammar and rules for the parser, and performs the annotation.

  • The preprocessing component extracts plain text from input documents and cleans up unprintable characters. Text Extractor uses open source libraries Footnote 15 to extract plain text from Microsoft Word document (doc), Rich Text Format (rtf), and Adobe Acrobat documents (pdf), while for Web pages of different formats (HTML, XHTML, PHP, ASP, and XML structured documents), it uses a TXL program that extracts plain text and whenever possible, preserves structural information useful for document structure identification. The input of the component is a source file in one of the above formats and the output is plain text. The Normalizer, based on an existing TXL module of Cerno, normalizes the plain text document generated by the Text Extractor [28], by removing leading characters, unprintable characters and trailing spaces, and producing as output a text document where each line represents a phrase. The module supports the different character sets of European languages.

  • The TXL generator component consists of two distinct domain-independent modules that generate TXL grammar rules and patterns. The Grammar Generator is a TXL program that given a PAS generates a TXL grammar file used at the annotation phase. This module generates expansion of the lemma of syntactic indicators defined in the preliminary annotation schema. It support English and Italian languages, by expanding regular verbs and nouns with proper forms (such as "s" for plural nouns and third person declension of English verb or declension of present and past regular Italian verbs with "are", "ere", "ire" suffixes). An example of such a grammar file is represented in Fig. 3. The Semantic Rule Generator generates a TXL program file with as many rules as the number of patterns and constraints defined in the PAS.

    Fig. 3
    figure 3

    An example of grammar file generated by the annotation generator component

  • the document structure analyzer annotates text units of a given document with XML tags delimiting its text unit boundaries. The document structure analyzer is composed of a TXL program with a set of rules, grammars, and patterns, used to identify text units. The rules and grammars are ambiguous and utilize island parsing techniques to identify relevant parts [15]. Grammar overrides are used to express robust parsing and to avoid errors or exceptions due to input not explained by the grammar. The input to the component consists of the grammar for the document structure and the normalized plain text document. The output is a document annotated with XML tags that identify structural elements.

  • the annotation generator is the core component of GaiusT. It embeds an extended version of the existing Cerno annotation prototype developed in [28]. The component runs the TXL interpreter with the program for annotation and the grammar files created by the Grammar Rule and the Semantic Rule Generator components. The input of the component is a semistructured text, i.e., the result of preprocessing and structural analysis, while the output is the text annotated in XML format with the concepts of the conceptual model used for the annotation. The XML output document can be converted to OWL format by using JXML2OWL. Footnote 16

5.3 The database layer

The database layer includes the components that support evaluation and the mapping of the annotation data to a relational database.

  • the database mapper accepts the annotated XML document and uploads it into a database using two modules: The SQL Generator and The Bulk Database Loader. The SQL Generator developed as part of a Master’s Thesis at Queen’s University [52], uses a TXL program to parse an input annotated document, and produces an output file with SQL data definition language statements along with SQL data manipulation language statements for an SQL relational database. The input to the module is an XML document and the set of tags used for the annotation. Using these generated statements, The Bulk Database Loader executes the SQL statements produced by SQL Generator module according to the target relational database. The input is SQL statements and the output is a populated relational database in a database system such as MySQL, Microsoft SQLServer, or Oracle.

  • the evaluation component is developed to support the assessment of the quality of annotation results. This component facilitates the evaluation process with the following features: (1) Text preparation: the annotated text is organized in a table where columns are the set of tags used for the annotation and rows are the annotates units. For each tag, the number of occurrences in text units is reported. This representation provides input for statistical measures of the annotated text; (2) Comparison with the reference annotation: the results of the GaiusT annotation, according to the evaluation framework provided in [28], can be compared with the manual annotated document. The component automatically compares the documents annotated by the tool with their reference annotation, i.e., manually annotated gold standard if there is one, by calculating recall, precision, fallout, accuracy, error, and F-measure [62]. The component accepts as input the documents with embedded annotations and the list of tags used for the annotation. The output of the process is provided in a database, where each evaluated document is represented as a table. The Evaluation component is realized in C# and its input is one or more documents annotated by GaiusT, producing as outputs tables with the evaluation measures. The output can be exported as a spreadsheet.

5.4 The graphical user interface

The last layer is the graphical user interface (GUI) that supports the management of the entire process of semantic annotation. In particular, the GUI supports the process of annotation orchestrating all components and modules (see Fig. 4). It provides the control over all input and output operations and parameters of every GaiusT component. The GUI is realized using Windows Form of the Microsoft .NET platform.

Fig. 4
figure 4

The GaiusT GUI

The panels of the GUI are as follows:

  • Project: the panel allows the control of the Annotation Project. An Annotation Project is an XML database file that stores all references to resources used for the annotation, the settings of the system, the input and output files of each component, and the parameters used for each component (see Fig. 4).

  • Settings: the panel allows to control the settings of external resources such as the TXL programs, the database repository for storing annotation, Web resources, third-party applications such as WordNet and so on.

  • Annotation Schema: this panel allows to control all the components that contribute to the creation of the annotation schema. It is subdivided into different sections related to different modules of the Annotation Schema component.

  • Grammar: the panel allows managing the collection of grammar files used by the TXL programs. Each item of a grammar file can be managed through a checkbox window that permits to add, edit, or remove grammar items.

  • Preprocess: the modules Text Extractor and Normalizer are controlled through this panel. The panel manages also the Document Structure Analyzer component.

  • Process: provides visualization and control of the steps of the SA process so that each operation can be executed and managed independently from the others allowing flexibility and debugging features.

  • Database: the database panel provides an interface to manage the SQL Generator and the Database Bulk Loader modules.

  • Browser: this component displays the annotated document, visualizing tags in different colors. The browser allows also other features, like, for example, editing the annotation and syntax highlighting. It takes as input plain text, rich text (RTF), or Web pages (HTML, XML, Php, Asp, etc.), and displays it, enabling also hyperlink navigation. The visualization of the annotated document with the identified concepts is given through a XSLT program used to transform the XML markup in a HTML document and presenting the results in a table format. Browser can also support the requirements engineer in the process of manual annotation by facilitating tags insertion.

  • Quantitative evaluation: this panel has functionalities to map the annotated text into a table where each column represents the tag used and each row annotated units. For each tag, the partial number of tags per text unit and the overall number of tags in the document are calculated.

  • Qualitative evaluation: this panel facilitates comparison of annotated files against reference annotation (if available) and estimation of human disagreement.

Table 4 lists the components of GaiusT, showing their input and output data types.

Table 4 Components of GaiusT: input and output data type

6 The GaiusT semantic annotation process

On the basis of extensions of Cerno components described in Sect. 3, we present the generic process of semantic annotation used by GaiusT to annotate legal documents. Each step of the process can be carried out manually or automatically. Figure 6 shows the entire annotation process of GaiusT including input and output of the components.

6.1 Definition of a conceptual model

This phase is necessary if there is no preexisting conceptual model. The process for creating a conceptual model has been studied extensively in the literature and is supported by a variety of tools. The construction of conceptual model of legal concepts has been presented in Sect. 4.2.

6.2 Text extraction

Each input document is converted into plain text and normalized. This is achieved through two steps: (a) text extraction is performed by the Text Extractor module. This phase is critical, especially for PDF file format, given that there are no format tags as for RTF or HTML documents. In PDF documents, the information related to the format of the documents is completely lost or confused and a manual intervention on text extracted by module is required to normalize document structure; (b) the normalization is achieved by the Normalizer module. The output is plain text where each line represents a sentence.

6.3 Construction of the annotation schema

This step creates the annotation schema elements for the Annotation component. The process is divided into three subphases:

  1. (a)

    Construction of the preliminary annotation schema, which includes a set of concepts and other elements extracted from the conceptual model created using XMI, RDF, or OWL languages. This preliminary schema is automatically created by using the Conceptual Model Parser module. The PAS is structured as a plain text where the concepts and the other elements, such as relationships, properties, and constraints, are listed. The relationships are extracted from XMI language and for each relationship, a pattern is generated.

  2. (b)

    Population of the annotation schema: The PAS is now revised by adding/filtering out relevant indicators to generate the final version of the annotation schema. In particular, each concept and property in the PAS needs to be enriched by syntactic indicators such as keywords related to the concept, i.e., synonyms or hyponyms (Fig. 5). The syntactic indicators are generated by using: (1) the WFL module to produce an inverse frequency list of keywords Footnote 17 of the input document; and (2) the Lexical Database module to produce the list of hyponyms and synonyms for each concept and property. The two lists are then merged into one list, by removing the redundant elements. The final output is a table with two columns: the list of concepts and the related keyword retrieved.

    Fig. 5
    figure 5

    Structure of annotation schema

  3. (c)

    Generation of TXL grammar and rules for the annotation schema: during this step, the annotation schema is parsed by the Grammar Generator and the Semantic Rule Generator, and two files are produced: a grammar in which the syntactic indicators for each concepts are defined and a program file with all the semantic rules.

6.4 Annotation

Once the annotation schema is generated, the annotation process can be executed. It is divided into two subphases: (a) The Document Structure Analysis. In this step, the input document is annotated with tags that identify the structure of the document and tags for basic word-equivalent objects. This step is important because it establishes the granularity of the annotation. The granularity can be controlled by the requirements engineer through the visual aid provided by the grammar panel. The annotation is generated through the Document Structure Analyzer component, which runs a TXL program with the grammar’s items chosen in the grammar panel. (b) The Annotation phase. The Annotation Generator annotates the relevant items, according to the annotation schema, generated grammars, and rules. Finally, it removes the structural tags used for the boundaries identification of text units (Fig. 6).

Fig. 6
figure 6

Annotation process of GaiusT

6.5 Database population (optional)

The last stage of annotation is the population of a database with annotated documents. There are two modules that perform this operation: the SQL Generator module is used to generate a relational database from the XML annotated documents, while the Bulk Database Loader module loads the data into the target database. The target database can be any of the relational databases that support the SQL92 standard. Footnote 18

7 Evaluation

To evaluate the efficacy and efficiency of the GaiusT tool, two experiments have been conducted, one on the HIPAA Privacy Act (USA, in English), and the other on the Stanca law (Italy, in both English and Italian). Our experiments with the tool suggest that it is scalable in the sense that it requires milliseconds to process a HIPAA section, and just a few seconds to process and annotate all of HIPAA (approximately 147 K words). In all experiments we did, evaluation was accomplished by comparison against human performance consistently with evaluation framework of [31]. Moreover, the comparison does not consider overhead to set up GaiusT resources, such as the construction of the conceptual model and time used for tuning the syntactic elements of the annotation schema. This is an one-time only overhead for a given language and can be at least partially offset by overhead to train people, so that they are comfortable with legal concepts and the annotation process for legal documents. We have not attempted a comparison of GaiusT with other tools, for pragmatic reasons. None of the tools we reviewed has comparable features or scope with GaiusT.

7.1 The HIPAA privacy rule

7.1.1 Case study setup

The U.S. Health Insurance Portability and Accountability Act (HIPAA) Footnote 19 is a part of the Social Security Act. The HIPAA Privacy Rule defines standards for the protection of personal medical records and other personal health information. To perform the structure analysis of American laws, the Document Drafting Handbook was consulted. It contains the style Guide to Legal writing for Regulations Footnote 20 [55]. Patterns extracted for the law hierarchy are reported in Table 5. The structure of American Law contains 12 levels and so it has been necessary to extend the hierarchical grammar of Table 1 for the analysis of the specific text units. The annotation schema used in this application has been focused on extracting a set of objects of concern: right, anti-right, obligation, anti-obligation, exception [13], and some types of conditions of application. The list of syntactic indicators for HIPAA was constructed following the guidelines of the semantic parametrization methodology, and following the process described in Sect. 6; a fragment is shown in Table 6. We have used Annotation Schema component to parse the input conceptual model and build the preliminary annotation schema populating the list of syntactic indicators with WFL Generator, Lexical Database Manager, and POS Manager.

Table 5 The document structure of American Law
Table 6 Examples of syntactic indicators for legal requirements in HIPAA

Some of the indicators are complex patterns that combine literal phrases and basic concepts. Thus, the identification of indicators requires a preliminary recognition of cross-references, policies, and actors. As regards cross-references, internal ones can be consistently identified by GaiusT using the small set of patterns shown in Fig. 7. External ones were not included in the annotation schema, as we have no hints regarding what kind of format they can be in.

Fig. 7
figure 7

The grammar for cross-reference objects

The introduction section 160.103: “Definitions of HIPAA” has been used where terms are strictly defined and reference synonyms are assigned; a fragment is shown in Fig. 8. The reference word is used consistently over the entire document denoting all related terms. For instance, the HIPAA text operates with such terms as “policy”, “business associate”, “act”, “covered entity”, and others. In the analyzed sections of HIPAA, other terms that could be generalized into a common, abstract type were found, including event, date, and information. Thus, on the basis of the definition section, a list of syntactic indicators for all the basic concepts has been derived. The final list of basic concepts includes actor, policy, event, date, and information. GaiusT has been applied to the parts numbered 160 “General Administrative Requirements” and 164 “Security and Privacy” of the HIPAA Privacy Rule containing a total of 33,788 words [56]. Automatic annotation of this text by GaiusT-based tool required only 3.07 s on an Intel Pentium 4 personal computer with a 2.60 GHz processor and RAM 512 MB of memory running Windows XP operating system. As a result, about 1,900 basic entities and 140 rights and obligations were identified. The manual evaluation of the quality of automatic results was carried out for the following sections contained in the analyzed parts: \(\S164.520:\) Notice of privacy practices for protected health information; \(\S164.522:\) Rights to request privacy protection for protected health information; \(\S164.524:\) Access of individuals to protected health information; and \(\S164.526:\) Amendment of protected health information. These sections were chosen because the results acquired by GaiusT can be compared to the results of the manual analysis reported in [13]. The manual analysis by an expert requirements engineer of the chosen fragments, containing a total of 5,978 words or 17.8 % of HIPAA, took an average of 2.5 h per section. Figure 9 illustrates an excerpt of annotated text from \(\S164.526(\hbox{a})(2)\) in the HIPAA resulting from GaiusT’s application. Each embedded XML annotation is a candidate “object of concern”. For example, in the Fig. 9, the “Index” annotation denotes the sub-paragraph index “(i)” and the Actor annotation denotes the “covered entity”; the latter appears twice in this excerpt.

Fig. 8
figure 8

Some of the indicators for basic entities according to the information in the definitions section

Fig. 9
figure 9

A fragment of the result generated by GaiusT for HIPAA Sect.164.526

7.1.2 Analysis of results

According to the evaluation framework presented in [31] in a first stage, the tool and four human expert annotators marked up sections Sect. 164.524 and Sect. 164.526. The annotated text of this fragment of HIPAA consisted of 77 sentences. Our human annotators were experienced in both computer science and law. The annotations performed by experts were evaluated pairwise: each annotation was matched against all others and the results of comparisons were aligned adopting the following criteria: (1) first, annotated concepts were considered correct when they overlapped on most meaningful words, so that minor divergences in text fragment boundaries were ignored; (2) then, concepts identified by GaiusT and missed by a human annotator were manually revisited and validated. We used these annotations to calibrate system annotations as we consider them as an upper bound on what automatic annotation can do. Experts reported an overall performance of 91.6 % for recall and a precision of 82.3 %. Adopting expert annotations as gold standard [6, 25] and comparing with tool annotation, the tool shows good performances with 84.8 % for the F-measure and good accuracy of 77.9 % Footnote 21 (Table 7).

Table 7 Evaluation rates of experts and GaiusT annotating HIPAA

As for the annotation of basic concepts, calculation of evaluation metrics was not possible, as the human annotator was asked to identify only high-level deontic concepts. By manually revising the annotated results, we can say the tool correctly identified all instances of the concepts actor, policy, event, information, and date with a precision of 91 %. GaiusT also correctly recognized section and subsection boundaries, titles, and annotated paragraph indices with a precision of 96 %. These structural annotations were used to disambiguate and manage cross-references.

On the basis of the experimental study, we can conclude that GaiusT was able to identify legal requirements with high precision (from 93 to 100 %). Good recall rates were also demonstrated for most concepts (70–100 %), apart from anti-obligation (33 %), where the tool’s performance must be improved.

Overall, the GaiusT tool largely reduces human effort in identifying legal requirements by doing some of the work, also by giving hints to human annotators on how to proceed. For example, the tool could identify constraints and subject candidates that the human annotator could then link to identified concept instances.

7.1.3 Productivity evaluation

The goal of this application was to test the usefulness of the tool for non-experts in regulatory text who may have to analyze such documents to generate requirements specifications for a software system. In addition to the expert evaluation, an empirical validation of the proposed tool against inexperienced requirements engineers has been carried out. The problem here is that requirements engineers are not always supported by lawyers when designing new software. For this purpose, Sect. 164.524 and Sect. 164.526 of HIPAA were selected for annotation by subjects, who are not working with rules and regulations directly. These parts were selected so that they have an approximately equal number of statements, comprised of 1,205 words and 1,057 words, respectively.

The experiment involved eighteen master-level Computer Science students taking a software engineering course. Participants in this experiment were not familiar with legal terms and need some training in semantically annotating legal text. A detailed explanation of the annotation process and examples of the concepts to be identified were available. Moreover, the participants were provided with an user-friendly interface to facilitate insertion and modification of tags in input documents.

The setup of the experiment consisted in dividing participants into two groups, and each group was asked to annotate Sect. 164.524 and Sect. 164.526 (containing a total of 2,262 words), in two different time frames. In the first time frame group, one worked on Sect. 164.524 without annotations, while group two on Sect. 164.526 with annotations previously generated by GaiusT. In the second time frame, the sections were switched.

The annotators were asked to incrementally identify rule statements and their components in each of the two texts, inserting markups on the original page for the unannotated text, and modifying GaiusT’s output in the annotated text. Our statistics note the time spent for annotation of both texts by each novice annotator, also the number of different entities identified during the annotation process. Table 8 summarizes average times spent by participants to fully annotate the two sections and average times spent to fix already annotated sections by GaiusT. Average time saved by using the output of the tool was 33.52 %, a notable saving considering that the sections are only a small fragment of the entire law. The entire HIPAA is 147K words and assuming that annotation time depends linearly on the size of the law, it would require almost three days for complete annotation.

Table 8 Average time spent to annotate HIPAA sections by novice annotators

Concerning the performance of novices relative to the gold standard, novices who performed a full annotation without assistance achieved a recall of 52 % and an accuracy of 50 %. By comparison, novices who simply improved the automated GaiusT annotations reported a much higher rate of precision and recall (95 and 90 %, respectively) with an accuracy of 90 % (see Table 9).

Table 9 Comparing novice results versus novice assisted by the tool

This confirms that the tool is a great help for novices both in terms of productivity, as well as quality of the results. Overall, the GaiusT tool largely reduces human effort in identifying legal requirements by doing some of the work, also by giving hints to human annotators on how to proceed. For example, the tool could identify constraints and subject candidates, which the human annotator could then link to identify concept instances. Annotators generally expressed satisfaction with the GaiusT tool and the annotations it provides, finding them easy to read and helpful in interpreting a legal document. However, they also pointed to some areas where the tool can be improved. Footnote 22

7.2 The Italian accessibility law

7.2.1 Case study setup

The Italian Accessibility Law by Stanca contains a set of provisions to promote access by disabled people to information technology instruments [26]. This law defines a set of guidelines containing the different accessibility levels and technical requirements, and the technical methodologies used to verify the accessibility of Web sites, as well as the assisted evaluation programs that can be used for this purpose. The Stanca law was chosen because it is available both in English and in Italian.Footnote 23 The availability of both language versions allows us to perform a more fine grain analysis of the different style and syntax used in the two documents.

For the identification of hierarchical structure for Italian law, the Italian guidelines for Lawmaker Guida alla redazione dei testi normativi have been used [47]. The guidelines provide suggestions about style, structure, and the type of verbs to be used to express legal concepts. The structure of Italian law foresees 10 levels instead of the 12 of American Law, and the naming convention used for the text units is different; thus, the hierarchy grammar that had been developed for HIPAA had to be adapted to correspond. The analysis of the text units that characterize the Italian law is reported in Table 10.

Table 10 The document structure of Italian Law

The grammar rules to support the interpretation of the cross-references of an Italian law were derived from the structural elements. The TXL programs have also been extended to deal with the Italian character set, including the accented vowels and the apostrophes used for articles and prepositions. The annotation schema for the SA of Italian legal documents has been designed keeping in mind the fact that Italian law uses verbs in the present tense to express deontic concepts and the use of modal verbs is left to the Lawmaker [47]. The syntactic indicators used for identifing deontic concepts in American Law are partially valid, but needed to be adapted. Two techniques have been used to address this issue: (1) the translation of modal verbs used for the American Law; (2) the manual analysis of the text for the identification of candidate verbs to express the concepts of right and obligation and their antis. As regards the second technique, the POS Manager module has been used to retrieve all the present tense verbs of the Stanca law and then the candidate verbs for each concept were manually annotated. For the identification of actor instances in the Italian law, we adapted two solutions: (1) some instances were mined manually from the definition section “Definizioni”; (2) in order to catch the actors not mentioned in the definitions, we exploited the results provided by POS Manager, i.e., all proper nouns were marked as actors. For resource instances, we followed only the first solution reusing the terms stated in the definition section.

In order to identify action verbs, we adopted the following heuristic: annotate all verbs in present tense, passive tense, and impersonal tense. The verbs in the listed forms also refer to obligations, in accordance with the instructions for writing Italian legal documents. Thus, the corresponding heuristic rule was adapted for identifying obligations. As for rights, obligations, and their antis, it is more difficult to identify them in the Italian language. Unlike English that mainly uses modal verbs to state prescriptions as for instance “the users must present their request”, Italian regulations use present active (“gli utenti presentano la domanda”), present passive (“la domanda è presentata”), and the passive impersonal tense (“la domanda si presenta”) of verbs to describe an obligation. The choice of style is highly author-dependent. Each of these styles is equally recommended by guidelines [47]. Therefore, in identifying these concepts, our strategy included (1) translation of indicators identified by Breaux and Antón (see the list of equivalent indicators in Table 11), (2) in addition, annotation of those sentences that contain verbs in the tenses that intrinsically express obligations as instances of obligation. Figure 10 shows the list of syntactic indicators identified for the deontic concepts and a subset of the grammar for syntactic indicators integrated as a domain-dependent component of Cerno, which correspond to various concepts. The annotation of the Stanca law, containing a total of 6,185 words on 280 lines, by GaiusT takes only 61 milliseconds on an Intel Pentium 4 personal computer with a 3 GHz processor and 2 Gb of memory running Windows 2003 server. A fragment of the annotated Stanca document is shown in Fig. 11.

Fig. 10
figure 10

A fragment of entities used for identifing categories

Fig. 11
figure 11

A fragment of the annotated Stanca law

Table 11 Syntactic indicators for Italian laws

7.2.2 Analysis of results

To evaluate the annotation results, an empirical study was designed, involving annotation of the Stanca law, and compared the performance of the tool with manual identification of instances of rights, obligations, and associated constraints. Table 12 presents the results of the evaluation compared to human annotators. In the case of the HIPAA Privacy Act, the comparison was made against human expert annotators. The human annotations were used as the gold standard and the performance of GaiusT was calibrated against their opinions. We have not yet measured the productivity of GaiusT for the Italian language, as the focus of the Italian experiment was to evaluate the quality rather than the productivity of the tool when changing language. The tool demonstrates good precision for all of the concepts, ranging from the lowest 67 % for actor to the highest rate 100 % for right, anti-obligation, and constraint. Recall was not as high as precision, showing the lowest rates of 33 % for right and 35 % for constraint. This fact is caused by the lack of syntactic indicators for the reliable identification of these concepts in Italian legal language. Instances of anti-right concept were not identified in the document either by the human or tool.

Table 12 Evaluation rates for legal concepts found in the accessibility law

Our plan for future work includes further investigation of how the identified drawbacks can be resolved. Particular difficulties of Italian text emerged for both human and tool. For instance, in Italian, the subject is frequently omitted, as in passive forms of verbs, or hidden by using impersonal expressions, thus making it difficult to correctly classify the whole text regulatory fragment and find the holder of a right or of an obligation. Surprisingly, the official English translation of the Stanca law in most cases explicitly states this information. Consider the use of verb phrases (in bold) to state the obligation in Italian and English versions of the same fragment, below:

  • Italian statement: “Nelle procedure svolte dai soggetti di cui all’articolo 3, comma 1, per l’acquisto di beni e per la fornitura di servizi informatici, i requisiti di accessibilità stabiliti con il decreto di cui all’articolo 11 costituiscono motivo di preferenza a parità di ogni altra condizione nella valutazione dell’offerta tecnica, tenuto conto della destinazione del bene o del servizio".

  • English translation: “The subjects mentioned in article 3, when carrying out procedures to buy goods and to deliver services, are obliged, in the event that they are adjudicating bidders which all have submitted similar offers, to give preference to the bidder which offers the best compliance with the accessibility requirements provided for by the decree mentioned in article 11”.

Overall, the annotation results suggest that the GaiusT process for regulation analysis is applicable to documents that are written in different languages, English and Italian. The effort required to adapt the framework for the new application was relatively small. This experiment also revealed several language differences that we were able to quantify. In our future work, we plan to conduct a more extensive analysis that may remove other language effects independently from legislator effects. As the fragment shows, in the English version, the translator disambiguated implicit information.

In general, the results of the annotation with GaiusT provide a useful input for software engineers looking for requirements contained in regulations, rather than starting from scratch and suggest that the GaiusT-based process for regulation analysis is applicable also to documents written in different languages.

8 Related work

Automatic analysis of legal documents is an old research field of artificial intelligence [1, 7], but due to the complexity of such documents, the goal has still not been achieved [38]. In particular, there is an intense debate on the limits of the logical representation for law in deontic logic [24, 45]. The idea of using contextual patterns or keywords to identify relevant information in prescriptive documents is not new. A number of methodologies based on similar techniques have been developed. However, tools to realize and synthesize these methods under a single framework are lacking. In [14], the authors suggest an algorithm for the detection and classification of non-functional requirements (NFRs). In a pilot experiment, the indicator terms were mined from catalogs of operationalization methods for security and performance softgoal interdependency graphs. These indicators were then used to identify NFRs in fifteen requirements specifications. The results have shown a satisfactory recall and precision for the security and performance keywords. Antón proposed the goal-based requirements acquisition methodology (GBRAM) to manually extract goals from natural language documents [2]. The GBRAM has been applied to financial and health care privacy policies [4]. Additional analysis of the extracted goals led to new semantics for modeling goals [8, 9], which distinguish rights and obligations, and new heuristics for extracting these artefacts from regulations [10, 13]. Manually marked sections of the HIPAA [56] are used to automatically fill in frameworks defined according to their conceptual model of laws [11].

Ghanavati et al. [21] report a large systematic review of goal-oriented approaches to model legal documents by reviewing 88 papers, comparing results and summarizing the contributions and research questions in this area. Another survey of researches in analyzing legal documents for compliance requirements can be found in [41]. Among the other issues, authors underline the need to develop a comprehensive system to support requirements engineers in the requirement compliance task. A set of 9 high-level functions have been identified as relevant for such a system: annotation of legal statement is the one addressed by GaiusT [40]. Other requirements are partially fullfielled by GaiusT, among them, the “Semiautomated Navigation and Searching”, including cross-references identification (see [27]) and prioritization of regulations and exceptions. To facilitate reasoning with regulations, Antoniou et al. [5] introduce the regulations analysis method based on defeasible logic [39]. For this purpose, the facts manually found in the regulation document are represented as defeasible rules.

Analysis of requirements texts has been automated in several ways. LIDA [42] is an iterative model development method based on linguistic techniques. It uses a part-of-speech tagging to derive instances of the UML abstractions. First, the tool proposes a list of nouns as class candidates to the requirements engineer who should remove irrelevant ones, then the list of adjectives is given for choosing relevant attributes, and finally, the list of verbs is provided for the identification of relevant methods and roles. Along similar lines, the NL-OOPS tool was devised to support requirements analysis by generating object-oriented models from natural language requirements documents [36]. As the core technology, the tool exploits the LOLITA system for domain-independent natural language processing [19]. In [59], the authors addressed the problem of correct identification of requirements providing the detailed analysis of NASA requirements documents and recommendation on writing clear specifications. As a part of this work, the authors discovered that good requirements specifications use imperative verbs, such as shall, must, must not, and others, as in the statements explicitly pointing out to requirements, constraints, or capabilities. They also introduced the notion of continuances, i.e., phrases that introduce the specification of requirements at a lower level. Such phrases usually appear after the words below, as follows, listed, etc. This analysis relates to this application to some extent, and a set of heuristic rules to detect requirements were derived by regularities in language of regulations documents. The notion of continuances to identify complete lists was also considered in the analysis. The proposed tool is able to identify such lists correctly and relate them to a specific upper-level concept where they belong. In aspect-oriented requirements engineering, several methods approach the task of identification and separation of concerns using text analysis methods. For instance, the EA-Miner [49] tool supports separation of aspectual Footnote 24 and non-aspectual concerns and their relationships by applying natural language processing techniques to requirements documents.

One of the early methods intended to facilitate the transition from informal to formal requirements specifications is the requirements apprentice [48]. In order to avoid dealing with the full complexity of natural language, this tool needs a mediator, a skilled requirements engineer who enters the information obtained from the end user in a command-line style using a high-level language. The system then generates a formal internal representation of a requirement from a set of statements and uses a variety of techniques to identify ambiguities and contradictions. To address the complexity of full natural language in general, and for legal texts too, it is often required to rewrite requirements in structured natural language, reducing the advantages of applying automatic tools. Not rarely, this manual preprocessing comes together with vocabulary restrictions limiting the scope of the system to specific knowledge domain. An example of such approaches is [57]. Many tools have developed to support the visualization and the storage of manually annotated features. At different level of complexity, such functionalities are offered by tools like Annozilla, Footnote 25 SHOE, Footnote 26 Yawas, Footnote 27 and SMORE. Footnote 28 However, none of them support automatic annotation, as GaiusT annotation module does; also, most of them are limited in the input documents they can deal with (for example, accepting only HTML documents) or in usability (for example, requiring knowledge of a specific markup language). Besides, GaiusT allows to track the annotation process in many ways, satisfying another requirement for an integrated platform to support all the activities related to compliance with HIPAA and other data privacy and security requirements [12, 34].

9 Conclusions and future work

We have proposed the GaiusT tool to facilitate requirements elicitation in the domain of legal documents. The tool supports successive analysis phases, each producing a new level of annotation. The main contribution of this work rests on the integration of a number of techniques from software engineering, the semantic Web and natural language processing to facilitate the elicitation of requirements from legal documents. Our contribution has been evaluated through a series of experiments involving human subjects, with positive results.

There are many technical aspects of the work reported in this paper that need to be improved in future research. In particular, the population of the annotation schema with the Annotation Schema Generator can be improved by using clustering algorithms [43, 53] or independent component analysis [22].

Concerning the semantic annotation of legal documents, the syntactic indicators and patterns used for the identification of types of constraints should be updated with a revision of the annotation schema and indicators, and the set of patterns used for the identification of the subjects of conjunctions or disjunctions—or, and—needs to be completed. This task is problematic even for full-fledged linguistic analysis tools. Experimentation using other kinds of regulation documents is also planned. In particular, contract documents are going to be considered, in order to verify the generality of the proposed method, and improve its effectiveness.

Finally, the multilingual aspect of SA needs to be further investigated, by considering other languages and other types of multilingual documents, such as Web pages and news stories.