GaiusT: supporting the extraction of rights and obligations for regulatory compliance

Zeni, Nicola; Kiyavitskaya, Nadzeya; Mich, Luisa; Cordy, James R.; Mylopoulos, John

doi:10.1007/s00766-013-0181-8

GaiusT: supporting the extraction of rights and obligations for regulatory compliance

Original Article
Published: 20 September 2013

Volume 20, pages 1–22, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Requirements Engineering Aims and scope Submit manuscript

GaiusT: supporting the extraction of rights and obligations for regulatory compliance

Download PDF

Nicola Zeni¹,
Nadzeya Kiyavitskaya¹,
Luisa Mich²,
James R. Cordy³ &
…
John Mylopoulos¹

2190 Accesses
57 Citations
Explore all metrics

Abstract

Ensuring compliance of software systems with government regulations, policies, and laws is a complex problem. Generally speaking, solutions to the problem first identify rights and obligations defined in the law and then treat these as requirements for the system under design. This work examines the challenge of developing tool support for extracting such requirements from legal documents. To address this challenge, we have developed a tool called GaiusT. The tool is founded on a framework for textual semantic annotation. It semiautomatically generates elements of requirements models, including actors, rights, and obligations. We present the complexities of annotating prescriptive text, the architecture of GaiusT, and the process by which annotation is accomplished. We also present experimental results from two case studies to illustrate the application of the tool and its effectiveness relative to manual efforts. The first case study is based on the US Health Insurance Portability and Accountability Act, while the second analyzes the Italian accessibility law for information technology instruments.

GaiusT 2.0: Evolution of a Framework for Annotating Legal Documents

Towards Automated GDPR Compliance Checking

Contratto – A Method for Transforming Legal Contracts into Formal Specifications

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Globalization has amplified the problem of internationalizing and localizing software systems to ensure that they comply with both international and local regulations. Different international policies, laws, and regulations, written in a wide range of languages even within one jurisdictional entity—such as the European Union—together with privacy and security requirements, pose serious challenges for software system developers worldwide. In particular, IT professionals have to face the so-called regulation compliance problem, whereby companies and developers are required to ensure that their software systems comply with relevant regulations, either by design or through re-engineering [3].

Legal documents are written in a jargon colloquially referred to as legalese, a specialized language that makes the elicitation of requirements from regulations a difficult task for developers who lack proper legal training. Misinterpretation—for example, by overlooking an exception in a regulatory rule—can have legal consequences, as it may lead to incorrect assignment of rights or obligations to stakeholders. Elicitation of requirements is generally achieved in a manual and often haphazard way. Breaux et al. [13] propose a systematic manual process for extracting rights and obligations from legal text. Our work seeks to augment this manual process with tool support, thereby improving productivity, as well as quality and consistency of the output.

To accomplish this complex task, we use semantic annotation (SA) techniques, where legal text is annotated on the basis of concepts that have been defined in terms of a conceptual schema (aka ontology). Such techniques are used widely in the field of the semantic Web to generate semistructured data amenable to automated processing from unstructured Web data. Such techniques have also been fruitfully applied in requirements engineering to support partial automation of specific steps in the elicitation process [29].

This paper presents the details of an SA-based tool for requirements extraction from legal documents. We discuss related issues and propose an engineering approach founded on an existing framework for SA called Cerno [32]. This framework has been adapted and extended to deal with many of the complexities of legal documents and the result is a new tool named GaiusT ^{Footnote 1}. The main objective of this paper is to present GaiusT along with its evaluation on regulatory documents written in two different languages: the U.S. Health Insurance Portability and Accountability Act [56] (HIPAA, in English) and the Italian accessibility law for information technology instruments (known as the Stanca Act, in Italian) [26]. The novelty of our contributions rests in identifying a combination of techniques adopted from research on semantic annotations, legal document processing, and software reverse engineering, which together demonstrably improve productivity on extracting requirements from legal text.

These contributions expand on an earlier paper [30], where we first outlined our preliminary research plan, along with a first application of GaiusT. In this work, we focus on the architecture and implementation of GaiusT and expand on the experimental results reported in [30] based on two case studies.

The rest of the paper is organized as follows. Section 2 describes the challenges of analyzing prescriptive documents. In Sect. 3, we present the research baseline of this work, while Sect. 4 highlights requirements for analyzing legal documents. In Sect. 5, we propose a design for GaiusT to address these requirements and elaborate on its architecture and application from a user-centered perspective in Sect. 6. In Sect. 7, we evaluate our proposal by applying the tool to two case studies and report the results of experiments comparing the tool’s analysis of these laws with that of human experts. Finally, we provide an overview and comparison with other methods and projects in the area in Sect. 8 and draw conclusions in Sect. 9.

2 Complexities of prescriptive documents

Analysis of legal documents poses a number of challenges different from those found in other documents, largely because of the special traditions and conventions of law. In this section, we report on some of the most distinctive characteristics of legal documents, focusing in particular on laws and regulations. The handling of these characteristics forms a part of the requirements for the GaiusT system.

The most distinctive characteristic of legal documents is that they are prescriptive [37] and are written accordingly. This means that they express the permission of desirable and/or the prohibition of undesirable behaviors and situations. The norms that they convey are requirements that control behaviors or possible states of affairs in the world. Consequently, each norm has a core part called the norm-kernel which carries the essential information conveyed by the norm. This information includes a legal modality, a subject, an object, conditions of applicability, and exceptions. The legal modality determines the function of a norm, which can be either an obligationor a right [60]. A right is an action that a stakeholder is conditionally permitted to perform, and an obligation is an action that a stakeholder is conditionally required to perform. In contrast, an anti-right/anti-obligation states that a right/obligation does not exist. The subject of a norm is the person or institution to whom the norm is directed. The object of a norm is the act regulated by the norm, considering the modality and other properties of the action. The conditions of applicability establish the circumstances under which a norm is applicable. These are clauses that suspend, or in some way modify, the principal right or obligation. In addition to these, each law contains general rules that prescribe a norm for a large group of individuals, actions or objects, followed by a rule that excludes specific members or subsets of these groups. Often, conditions of applicability are modified by exceptions both for subjects and objects [44]. Exceptions can be introduced explicitly through words such as except or contrary to, or implicitly when they have to be inferred. For example, in “A medical savings account is exempt from taxation under this subtitle unless such account has ceased to be a medical savings account,” the use of the term unless indicates an exception. One or several exceptions related to a right can be formulated directly in a section mentioning this right, or indirectly in another part of the law, or even in another law using cross-references. To support the markup of exceptions, we have identified a set of terms that are commonly used to express exceptions (see Table 1).

Table 1 Document structure classification

Full size table

Cross-references create links between parts of the same document (internal) or with parts of other documents (external) and are essential to the interpretation of laws. To unambiguously identify the parts of a document and to allow for explicit references to related parts, legal documents are organized using a deep hierarchical structure.

To facilitate the understanding of a law, every legal document contains a declarative part—the definition of legal terms—that defines legal concepts used in the document. This constitutes the glossary section of a legal document and plays a fundamental role in its interpretation. The lack of such a declarative part can result in ambiguity and misinterpretations.

All of these features pose serious challenges for semantic analysis. In fact, although the vocabulary and grammar used in such documents are often restricted, inconsistent use of terms is pervasive, mainly due to the length and complexity of the lifetime of legal documents. Moreover, it is often the case that different parts of a legal document are written by different people, either because of the sheer size of the document, or as a result of subsequent modifications.

Another problem dimension arises from the multilinguality of laws applicable in a given jurisdiction (such as the European Union). Analysis of regulations written in different languages entails handling the different structural, syntactic, and semantic issues of each particular language.

Furthermore, the requirements obtained from several regulations must be integrated, aligned, and ranked according to the priority of the regulations they were extracted from.

Finally, inaccuracy of the analysis of legal documents must be minimal because in the eyes of the law, there is no such thing as “near-compliance”. For example, by missing an exception or a condition, one could bestow or deny an important right or obligation to the wrong party.

3 Research baseline: Cerno

In light of the complexity of legal documents outlined above, we propose a tool-assisted process for their annotation, which includes the following main activities: (1) Structural analysis to handle identification of structural elements of a document, basic text constructs, and cross-references; (2) Semantic analysis to identify normative elements of a document such as, rights, obligations, anti-rights, and anti-obligations.

In order to realize this process, we have adopted and extended the semantic annotation framework of Cerno [31] and used it as a foundation for our work. Cerno is based on a lightweight text pattern-matching approach that takes advantage of the structural transformation system TXL [16]. The types of problems that Cerno can address are similar to those faced in the analysis of legal documents, such as the need to understand the document’s structure and the need to identify concept instances present in the text. Moreover, experimentation with Cerno has shown it to be readily adaptable to different types of documents [31, 33, 63].

In Cerno, textual annotations for concepts of interest are inferred based on a conceptual model of the application domain. Annotation rules can be automatically or manually constructed from this model. Such an approach factors out domain-dependent knowledge, thus allowing for the development of a wide range of semantic annotation tools to support different domains, document types, and annotation tasks.

Cerno’s annotation process has three phases: (1) Parse, where the structure of an input document is processed, (2) Markup, where annotations are generated based on annotation rules associated with the concepts of the conceptual model, and (3) Mapping, which results in a relational database containing instances of concepts, as identified by annotations.

The goal of the Parse step is to break down an input document into its structural constituents and identify interesting word-equivalent objects, such as e-mail addresses, phone numbers, and other basic units. Similarly to Cerno’s Parse step, structural analysis of legal texts identifies both structural elements of the document and word-equivalent objects. These are basic legal concepts, such as legal actors, roles, procedures, and other entities normally defined in the Definitions section of the law.

Next, the Markup phase conducts semantic analysis. An annotation schema specifies the rules for identifying domain concepts by listing concept identifiers with their syntactic indicators. These indicators can be single words and phrases, or names of previously parsed basic entities. The annotation schema can be populated with such syntactic indicators either through (semi-) automatic techniques that use keywords of the conceptual model—if available—or hand-crafted from a set of examples.

Lastly, the Mapping phase of Cerno is optional and depends on how one proposes to use generated annotations. In the analysis of legal documents, we assume that a human analyst may want to correct or enrich tool-generated results. In this setting, annotations must remain embedded in the documents so that the user can easily revise them using textual context. After this revision, the annotations can be indexed and stored in a database.

Although the basic steps in Cerno clearly apply well to our problem, the annotation process we envision for legal documents required a significant adaptation effort at the design and implementation level in order to make the framework not only applicable in the legal domain, but also usable by requirements engineers and possibly legal consultants. To this end, we designed and developed the GaiusT system as an extension and specialization of Cerno that includes new components specific to the annotation of legal documents in different languages. To enhance the usability of the system, a GUI was added to better support human interaction.

The main extensions to Cerno are described next. Firstly, although Cerno was well adapted to the sequence-of-paragraphs form of English text, such as newspaper articles, Web pages, and advertisements, it could not handle the specialized semiformal structure of legal texts. We therefore extended the TXL-based Cerno parser so that it can process the nested structural form of legal documents, including numbering and nesting of sections, paragraphs, and clauses.

The labels of legal document sections are also context-dependent. For example, “clause (b) above” implicitly refers to the one in the current paragraph and section. To resolve this problem, the parser was enhanced with an additional TXL source transformation step to contextualize section, paragraph, and clause names to make them globally unique. For example, the label of clause “(b)” in paragraph 2 of Sect. 5 becomes “5.2(b)”. A subsequent reference resolution transformation was added to use these globally unique labels to link local and global references to the clauses, paragraphs, and sections they refer to. For example, the reference “except as in (b) above” in paragraph 2 of Sect. 5 is resolved to “except as in 5.2(b) above”. The set of Cerno word-equivalent patterns, which recognize items such as phone numbers, email addresses, Web URLs, and the like, was extended to recognize legal items such as regulation and law names (e.g., “42 U.S.C.—405(a) (2006)”).

The set of concept patterns was also extended and enhanced to include terms and language for legal obligations, rights, and other concepts, and patterns were added to recognize temporal clauses such as “within a month”.

Finally, Cerno’s word pattern specification notation was enhanced to process Italian regular and irregular verb forms in order to enable processing of Italian legal documents.

In the following section, we describe the design of the GaiusT tool considering both the structural and semantic analyses steps of the extended Cerno process.

4 Semantic annotation of legal documents

4.1 Structural analysis

A fundamental feature of legal documents is their deep hierarchical structure. Identifying document structure is important to semantic annotation for two main reasons: (1) recognition of potential annotation units, thus allowing for different annotation granularity, and (2) flattening the document’s hierarchical representation, consisting of sections, subsections, etc. Annotation granularity can be defined as the fragment of text that refers to a concept instance, or an amount of context that must be annotated together with the concept. Different granularities can be used to address different objectives. In legal documents, for instance, single-word granularity is convenient for annotating basic concepts, such as stakeholders or cross-references.

From a semantic viewpoint, the semiformal structure of legal documents is used to support the identification of complex deontic concepts, inferring a relationship between inferred concepts from their relative positions in the structure. The identification of structural elements in the process of semantic annotation also improves accuracy of the results, e.g., by treating a list as a single object for a given statement.

In general, any document has two levels of structure, a rhetorical structure and an abstract structure, both of which contribute to meaning [46]. According to rhetorical structure theory [35], a document can be represented as a hierarchy where some text units are salient, while others constitute support. Abstract document structure, on the other hand, relies on graphical visualization of the text and includes syntactic elements used for text representation [20]. Both rhetorical and abstract document structures are important for the design of a structure analysis component.

Due to the variety of existing document formats and contexts, it is impossible to define a common structure valid for all documents, even though several efforts have been devoted to the problem [54]. There are several possible strategies to automatically capture the structure of documents [23, 61]. The most widely used approach is based on feature-based boundary detection of text fragments. This approach exploits patterns, such as punctuation marks to identify text units. These patterns are typically composed of regular expressions and include non-printable characters such as carriage return for new line identification, symbol list or indentation for text blocks, graphical position of certain elements such as titles. To capture the deep hierarchical structure of legal documents, we have defined a classification of document text units, using a hierarchy where a text unit at a given level is composed of one or more units at the next level down (Table 1). This classification facilitates the management of a large class of heterogeneously structured documents with the only assumption that they are written in languages of the European family. Moreover, this classification can then be specialized according to the type of application and granularity level needed for a particular semantic analysis.

In GaiusT, we have defined grammatical patterns, using TXL’s ambiguous context-free grammar notation, to capture text units at different levels of this document hierarchy.

To capture the complex hierarchical structure and cross-references of legal documents, we adopted “The Guide to Legal writing” [55] that presents styles for proper writing and structuring of legal documents. Based on the hierarchy reported in the “The Guide to Legal writing”, we specialized our grammatical patterns (see for example Tables 5, 10). For cross-references, since loops in cross-references chains are intractable (see for example HIPAA paragraph 164.528(a)(2)(i)), our approach consists of recognizing and annotating cross-references, so that the annotator can take them into account and decide on a case-by-case basis. Special parsing rules catch both external and internal cross-references, and a TXL transformation renames all local labels and references to be globally unique. For example, the transformation changes list element label “(b)” of paragraph 2 of Sect. 5 to “5.2(b)”.

4.2 Semantic analysis of legal documents

The semantic analysis of a document requires the development of a conceptual model that represents domain information, in the case of legal documents deontic concepts. To design the conceptual model, we used an iterative process. An initial list of important concepts in the domain of legal texts was derived from the “The Guide to Legal writing” [55] and norm–kernel concepts [60]. We then applied part of the semantic parameterization methodology developed by Breaux and Antón [9], which defines a set of guidelines (textual patterns) for extracting legal requirements from regulations. Our model was then discussed with legal experts and finalized based on their recommendations. The experts were colleagues of the Faculty of Law of the University of Trento, and they helped with the analysis and refinements of legal concepts/terms and their interrelations.

The resulting model, represented in Fig. 1, includes as classes only the core deontic concepts. It leaves out others, such as penalties, that according to legal, experts can be considered as properties of the core deontic concepts. The conceptual model was used as the basis for the definition of an annotation schema, as discussed in Sect. 6. Concepts of the conceptual model are used to populate the annotation schema, and for each concept, a set of syntactic indicators and rules or patterns is defined. For the identification of prescribed behaviors—rights, obligations, and their antitheses—the annotation schema employs a set of manually constructed patterns, based on modal verbs (see Table 2). For identifying basic concepts—such as actors, resources, and actions—we reuse the Definitions section of a legal document, where every used term is defined precisely and a reference synonym is assigned [18, 55].

Table 2 Syntactic indicators for deontic concepts

Full size table

Annotation of conditions of application is represented in the conceptual model as constraint specialized into two subconcepts, exception and temporal condition. A list of syntactic indicators can be generated using a combined approach that analyzes the content of legal document with the use of lexical databases such as WordNet ^{Footnote 2} or a thesaurus. ^{Footnote 3} In our case studies, we manually built a list of syntactic indicators for temporal conditions and exceptions and integrated it with lists derived from lexical databases (Table 3).

Table 3 Syntactic indicators for Exception

Full size table

4.2.1 Multilingual semantic annotation of legal documents

Language diversity in the semantic annotation process is treated at three distinct levels: (1) \({\it Structural}\): a text available in multiple language translations uses different character sets, word splitting, orders of reading, layout, text units, and so on; (2) \({\it Syntactic}\): the annotation patterns and rules that are used to identify concepts need to be expressed in the language. For example the syntactic indicators used for the concept of obligation (see Sect. 6) must be revised for each new target language; (3) \({\it Semantic}\): the semantic model of legal concepts should have a language-independent representation; thus, the syntactic indicators for each concept in different languages must be carefully defined.

The structural level has been investigated in the information retrieval (IR) area and basically amounts to segmenting text into words in different languages, using word stemming, morphology, and the identification of phrase boundaries, using punctuation conventions. There are many tools for handling these issues. However, software developers often underestimate the value of syntactic aspects, while a successful text analysis tool must always cater to them. Issues related to this level are managed by the document structure analysis as described in Sect. 5.

The syntactic level relates to the fact that people of different nations express the same concept in different ways. For example, working with Italian legal documents, we discovered that statements expressing an obligation normally use the present active or present passive tense, e.g., “the organization sends a request” or “the request is sent to the user”. This form differs from English construction of the same concept, where obligations are usually expressed with modal verbs “must” or “should”. However, syntactic differences do not affect the conceptual model built for the domain of legal documents since the concepts represented are valid also for other languages than English (Fig. 1). In our case studies, we used the WFL Generator and Lexical Database Manager components of GaiusT to identify types of syntactic indicators for the concepts of the annotation schema (see Sects. 7.1.1 and 7.2.1). Such components perform statistical analysis of the documents and use lexical database sources of the specific language to integrate results.

The most problematic issue for annotation across languages is the semantic level, because it has to account for alternative worldviews and cultural backgrounds. The existence of semantic vagueness, semantic mismatch, and ambiguities between languages poses several challenges [58]. The main problems related to this level are caused by proper names, temporal and spatial conventions, language patterns, such as specific or idiomatic ways to express certain concepts, representation of co-reference, time reference, place reference, causal relationships, and associative relationships. For example, consider the concept “week”; in many European countries, a week starts on Sunday and ends on Saturday, whereas in some Eastern-European countries, it starts on Monday and ends on Sunday. Thus, the expression “by the end of the week” refers to different time frames depending on the country. The best approach to handle semantic differences consists in developing parallel alignments with the same methodology and the same conceptual model [50]. When using GaiusT with English and Italian documents, such semantic differences were not essential and did not have to be captured in the conceptual model. This section has considered some of the issues that arise in annotating multilingual legal documents. GaiusT components were adopted to deal with these issues. However, for other aspect such as contextual information and implicit assumptions (i.e., ambiguities used to embrace multiple situations), a deeper semantic analysis is required (see, for example, [17]).

5 System architecture

This section presents the system architecture of GaiusT tool, including its components and technologies used. The overall architecture is shown in Fig. 2.

GaiusT integrates and extends existing modules of Cerno (represented in Fig. 2 using green color) by adding modules to automate text preprocessing, annotation, and annotation schema generation. The extraction process is supported by several components that start from input documents, annotate, and map the results into a database for late analysis. The process of annotation used by Cerno is based on a hand-crafted conceptual model, specifying domain-dependent information enriched with lexical information to support identification of semantic categories. The enriched model is the basis for defining an annotation schema. Taking into account that the annotation schema can change according to user interests and goals, and that the schema has a great impact on the annotation process, we developed a series of modules that support the generation of an annotation schema.

GaiusT has been developed on a Microsoft Windows platform using the .NET framework (language C#). Choices about external tools and libraries such as PDF text extractor library, or part-of-speech tagger have been mainly driven by platform compatibility, modularity, integration, and multilingual requirements.

The architecture of GaiusT is composed of four main layers: (a) the annotation schema layer, (b) the annotation layer, (c) the database layer, and (d) the graphical user interface layer (GUI), as shown in Fig. 2. The entire system is about 50 k lines of code and more than 130 Mb of size. Physically, the system includes eight main components: (1) annotation schema generator, (2) preprocessing component, (3) TXL-rule generator, (4) document structure analyzer, (5) annotation generator, (6) database mapper, (7) evaluation component, and (8) GUI.

The components of the architecture are presented in following subsections, taking into account input and output data, as well the functionality of each component.

5.1 The annotation schema layer

The annotation schema layer sets up the core elements for the semantic annotation (SA) process and consists of the Annotation Schema Generator component. Its main input is a conceptual model that represents the domain of the document, defined taking into account also of the goals of the annotation. Its main output is the preliminary annotation schema (PAS). The component is implemented using a combination of C# programs and TXL programs. The generator uses four modules to generate the preliminary annotation schema:

The Conceptual Model Parser extracts from an XML Metadata Interchange (XMI) ^{Footnote 4} or RDF ^{Footnote 5} or OWL ^{Footnote 6} source file concepts, properties, relationships, and constraints produce a list of items organized in a plain text file. This module takes as input a conceptual model generated using a CASE tool that supports the XMI UML interchange language such as UMLSpeed, MagicDraw, ^{Footnote 7} Visual Paradigm ^{Footnote 8} or Protegé ^{Footnote 9} and returns the skeleton of a PAS. The module is a C# library.
The Word Frequency List (WFL) Generator provides an inverse frequency list of the words from one or more input documents. First, the module tokenizes the input and removes, according to the language, stop words. The WFL Generator produces two separated lists: one with the frequency of words of the source file and another list with the lemma of the words and their frequency. The process of lemmatization that reduces inflectional forms of a word to the lemma is realized by the lemmatizer module of Lucene open source search engine ^{Footnote 10} that also supports multilingualism.
The Lexical Database Manager is an interface for lexical resources like WordNet, ^{Footnote 11} Thesaurus ^{Footnote 12} and for the multilingual definitions, Wikipedia. ^{Footnote 13} The input to the module is the elements extracted from conceptual model, while the output is a list of syntactic candidate indicators for each element of the PAS. The interface provides facilities to query lexical resources both online and standalone and filter the output as for example all the options provided by the WordNet command-line program to filter the output.
The Part-of-Speech (POS) Manager is a module that can be used to refine the syntactic indicators of the PAS. It uses a Part-of-Speech (POS) tagger program developed by the Institute for Computational Linguistics of the University of Stuttgart ^{Footnote 14} [51], to generate a table with the token elements of the document and the probability of their syntactic roles. The tokens can be used to strengthen or enrich the list of linguistic indicators. The POS Manager supports different languages.

5.2 The annotation layer

The annotation layer includes four components: the preprocessing component, the TXL-rule generator, the document structure analyzer, and the annotation generator. It is the core layer of the tool and it preprocesses input documents, delimits the text unit boundaries, generates the grammar and rules for the parser, and performs the annotation.

The preprocessing component extracts plain text from input documents and cleans up unprintable characters. Text Extractor uses open source libraries ^{Footnote 15} to extract plain text from Microsoft Word document (doc), Rich Text Format (rtf), and Adobe Acrobat documents (pdf), while for Web pages of different formats (HTML, XHTML, PHP, ASP, and XML structured documents), it uses a TXL program that extracts plain text and whenever possible, preserves structural information useful for document structure identification. The input of the component is a source file in one of the above formats and the output is plain text. The Normalizer, based on an existing TXL module of Cerno, normalizes the plain text document generated by the Text Extractor [28], by removing leading characters, unprintable characters and trailing spaces, and producing as output a text document where each line represents a phrase. The module supports the different character sets of European languages.
The TXL generator component consists of two distinct domain-independent modules that generate TXL grammar rules and patterns. The Grammar Generator is a TXL program that given a PAS generates a TXL grammar file used at the annotation phase. This module generates expansion of the lemma of syntactic indicators defined in the preliminary annotation schema. It support English and Italian languages, by expanding regular verbs and nouns with proper forms (such as "s" for plural nouns and third person declension of English verb or declension of present and past regular Italian verbs with "are", "ere", "ire" suffixes). An example of such a grammar file is represented in Fig. 3. The Semantic Rule Generator generates a TXL program file with as many rules as the number of patterns and constraints defined in the PAS.
Fig. 3
An example of grammar file generated by the annotation generator component
Full size image
the document structure analyzer annotates text units of a given document with XML tags delimiting its text unit boundaries. The document structure analyzer is composed of a TXL program with a set of rules, grammars, and patterns, used to identify text units. The rules and grammars are ambiguous and utilize island parsing techniques to identify relevant parts [15]. Grammar overrides are used to express robust parsing and to avoid errors or exceptions due to input not explained by the grammar. The input to the component consists of the grammar for the document structure and the normalized plain text document. The output is a document annotated with XML tags that identify structural elements.
the annotation generator is the core component of GaiusT. It embeds an extended version of the existing Cerno annotation prototype developed in [28]. The component runs the TXL interpreter with the program for annotation and the grammar files created by the Grammar Rule and the Semantic Rule Generator components. The input of the component is a semistructured text, i.e., the result of preprocessing and structural analysis, while the output is the text annotated in XML format with the concepts of the conceptual model used for the annotation. The XML output document can be converted to OWL format by using JXML2OWL. ^{Footnote 16}

5.3 The database layer

The database layer includes the components that support evaluation and the mapping of the annotation data to a relational database.

the database mapper accepts the annotated XML document and uploads it into a database using two modules: The SQL Generator and The Bulk Database Loader. The SQL Generator developed as part of a Master’s Thesis at Queen’s University [52], uses a TXL program to parse an input annotated document, and produces an output file with SQL data definition language statements along with SQL data manipulation language statements for an SQL relational database. The input to the module is an XML document and the set of tags used for the annotation. Using these generated statements, The Bulk Database Loader executes the SQL statements produced by SQL Generator module according to the target relational database. The input is SQL statements and the output is a populated relational database in a database system such as MySQL, Microsoft SQLServer, or Oracle.
the evaluation component is developed to support the assessment of the quality of annotation results. This component facilitates the evaluation process with the following features: (1) Text preparation: the annotated text is organized in a table where columns are the set of tags used for the annotation and rows are the annotates units. For each tag, the number of occurrences in text units is reported. This representation provides input for statistical measures of the annotated text; (2) Comparison with the reference annotation: the results of the GaiusT annotation, according to the evaluation framework provided in [28], can be compared with the manual annotated document. The component automatically compares the documents annotated by the tool with their reference annotation, i.e., manually annotated gold standard if there is one, by calculating recall, precision, fallout, accuracy, error, and F-measure [62]. The component accepts as input the documents with embedded annotations and the list of tags used for the annotation. The output of the process is provided in a database, where each evaluated document is represented as a table. The Evaluation component is realized in C# and its input is one or more documents annotated by GaiusT, producing as outputs tables with the evaluation measures. The output can be exported as a spreadsheet.

5.4 The graphical user interface

The last layer is the graphical user interface (GUI) that supports the management of the entire process of semantic annotation. In particular, the GUI supports the process of annotation orchestrating all components and modules (see Fig. 4). It provides the control over all input and output operations and parameters of every GaiusT component. The GUI is realized using Windows Form of the Microsoft .NET platform.

The panels of the GUI are as follows:

Project: the panel allows the control of the Annotation Project. An Annotation Project is an XML database file that stores all references to resources used for the annotation, the settings of the system, the input and output files of each component, and the parameters used for each component (see Fig. 4).
Settings: the panel allows to control the settings of external resources such as the TXL programs, the database repository for storing annotation, Web resources, third-party applications such as WordNet and so on.
Annotation Schema: this panel allows to control all the components that contribute to the creation of the annotation schema. It is subdivided into different sections related to different modules of the Annotation Schema component.
Grammar: the panel allows managing the collection of grammar files used by the TXL programs. Each item of a grammar file can be managed through a checkbox window that permits to add, edit, or remove grammar items.
Preprocess: the modules Text Extractor and Normalizer are controlled through this panel. The panel manages also the Document Structure Analyzer component.
Process: provides visualization and control of the steps of the SA process so that each operation can be executed and managed independently from the others allowing flexibility and debugging features.
Database: the database panel provides an interface to manage the SQL Generator and the Database Bulk Loader modules.
Browser: this component displays the annotated document, visualizing tags in different colors. The browser allows also other features, like, for example, editing the annotation and syntax highlighting. It takes as input plain text, rich text (RTF), or Web pages (HTML, XML, Php, Asp, etc.), and displays it, enabling also hyperlink navigation. The visualization of the annotated document with the identified concepts is given through a XSLT program used to transform the XML markup in a HTML document and presenting the results in a table format. Browser can also support the requirements engineer in the process of manual annotation by facilitating tags insertion.
Quantitative evaluation: this panel has functionalities to map the annotated text into a table where each column represents the tag used and each row annotated units. For each tag, the partial number of tags per text unit and the overall number of tags in the document are calculated.
Qualitative evaluation: this panel facilitates comparison of annotated files against reference annotation (if available) and estimation of human disagreement.

Table 4 lists the components of GaiusT, showing their input and output data types.

Table 4 Components of GaiusT: input and output data type

Full size table

6 The GaiusT semantic annotation process

On the basis of extensions of Cerno components described in Sect. 3, we present the generic process of semantic annotation used by GaiusT to annotate legal documents. Each step of the process can be carried out manually or automatically. Figure 6 shows the entire annotation process of GaiusT including input and output of the components.

6.1 Definition of a conceptual model

This phase is necessary if there is no preexisting conceptual model. The process for creating a conceptual model has been studied extensively in the literature and is supported by a variety of tools. The construction of conceptual model of legal concepts has been presented in Sect. 4.2.

6.2 Text extraction

Each input document is converted into plain text and normalized. This is achieved through two steps: (a) text extraction is performed by the Text Extractor module. This phase is critical, especially for PDF file format, given that there are no format tags as for RTF or HTML documents. In PDF documents, the information related to the format of the documents is completely lost or confused and a manual intervention on text extracted by module is required to normalize document structure; (b) the normalization is achieved by the Normalizer module. The output is plain text where each line represents a sentence.

6.3 Construction of the annotation schema

This step creates the annotation schema elements for the Annotation component. The process is divided into three subphases:

(a)
Construction of the preliminary annotation schema, which includes a set of concepts and other elements extracted from the conceptual model created using XMI, RDF, or OWL languages. This preliminary schema is automatically created by using the Conceptual Model Parser module. The PAS is structured as a plain text where the concepts and the other elements, such as relationships, properties, and constraints, are listed. The relationships are extracted from XMI language and for each relationship, a pattern is generated.
(b)
Population of the annotation schema: The PAS is now revised by adding/filtering out relevant indicators to generate the final version of the annotation schema. In particular, each concept and property in the PAS needs to be enriched by syntactic indicators such as keywords related to the concept, i.e., synonyms or hyponyms (Fig. 5). The syntactic indicators are generated by using: (1) the WFL module to produce an inverse frequency list of keywords ^{Footnote 17} of the input document; and (2) the Lexical Database module to produce the list of hyponyms and synonyms for each concept and property. The two lists are then merged into one list, by removing the redundant elements. The final output is a table with two columns: the list of concepts and the related keyword retrieved.
Fig. 5
Structure of annotation schema
Full size image
(c)
Generation of TXL grammar and rules for the annotation schema: during this step, the annotation schema is parsed by the Grammar Generator and the Semantic Rule Generator, and two files are produced: a grammar in which the syntactic indicators for each concepts are defined and a program file with all the semantic rules.

6.4 Annotation

Once the annotation schema is generated, the annotation process can be executed. It is divided into two subphases: (a) The Document Structure Analysis. In this step, the input document is annotated with tags that identify the structure of the document and tags for basic word-equivalent objects. This step is important because it establishes the granularity of the annotation. The granularity can be controlled by the requirements engineer through the visual aid provided by the grammar panel. The annotation is generated through the Document Structure Analyzer component, which runs a TXL program with the grammar’s items chosen in the grammar panel. (b) The Annotation phase. The Annotation Generator annotates the relevant items, according to the annotation schema, generated grammars, and rules. Finally, it removes the structural tags used for the boundaries identification of text units (Fig. 6).

6.5 Database population (optional)

The last stage of annotation is the population of a database with annotated documents. There are two modules that perform this operation: the SQL Generator module is used to generate a relational database from the XML annotated documents, while the Bulk Database Loader module loads the data into the target database. The target database can be any of the relational databases that support the SQL92 standard. ^{Footnote 18}

7 Evaluation

To evaluate the efficacy and efficiency of the GaiusT tool, two experiments have been conducted, one on the HIPAA Privacy Act (USA, in English), and the other on the Stanca law (Italy, in both English and Italian). Our experiments with the tool suggest that it is scalable in the sense that it requires milliseconds to process a HIPAA section, and just a few seconds to process and annotate all of HIPAA (approximately 147 K words). In all experiments we did, evaluation was accomplished by comparison against human performance consistently with evaluation framework of [31]. Moreover, the comparison does not consider overhead to set up GaiusT resources, such as the construction of the conceptual model and time used for tuning the syntactic elements of the annotation schema. This is an one-time only overhead for a given language and can be at least partially offset by overhead to train people, so that they are comfortable with legal concepts and the annotation process for legal documents. We have not attempted a comparison of GaiusT with other tools, for pragmatic reasons. None of the tools we reviewed has comparable features or scope with GaiusT.

7.1 The HIPAA privacy rule

7.1.1 Case study setup

The U.S. Health Insurance Portability and Accountability Act (HIPAA) ^{Footnote 19} is a part of the Social Security Act. The HIPAA Privacy Rule defines standards for the protection of personal medical records and other personal health information. To perform the structure analysis of American laws, the Document Drafting Handbook was consulted. It contains the style Guide to Legal writing for Regulations ^{Footnote 20} [55]. Patterns extracted for the law hierarchy are reported in Table 5. The structure of American Law contains 12 levels and so it has been necessary to extend the hierarchical grammar of Table 1 for the analysis of the specific text units. The annotation schema used in this application has been focused on extracting a set of objects of concern: right, anti-right, obligation, anti-obligation, exception [13], and some types of conditions of application. The list of syntactic indicators for HIPAA was constructed following the guidelines of the semantic parametrization methodology, and following the process described in Sect. 6; a fragment is shown in Table 6. We have used Annotation Schema component to parse the input conceptual model and build the preliminary annotation schema populating the list of syntactic indicators with WFL Generator, Lexical Database Manager, and POS Manager.

Table 5 The document structure of American Law

Full size table

Table 6 Examples of syntactic indicators for legal requirements in HIPAA

Full size table

Some of the indicators are complex patterns that combine literal phrases and basic concepts. Thus, the identification of indicators requires a preliminary recognition of cross-references, policies, and actors. As regards cross-references, internal ones can be consistently identified by GaiusT using the small set of patterns shown in Fig. 7. External ones were not included in the annotation schema, as we have no hints regarding what kind of format they can be in.

The introduction section 160.103: “Definitions of HIPAA” has been used where terms are strictly defined and reference synonyms are assigned; a fragment is shown in Fig. 8. The reference word is used consistently over the entire document denoting all related terms. For instance, the HIPAA text operates with such terms as “policy”, “business associate”, “act”, “covered entity”, and others. In the analyzed sections of HIPAA, other terms that could be generalized into a common, abstract type were found, including event, date, and information. Thus, on the basis of the definition section, a list of syntactic indicators for all the basic concepts has been derived. The final list of basic concepts includes actor, policy, event, date, and information. GaiusT has been applied to the parts numbered 160 “General Administrative Requirements” and 164 “Security and Privacy” of the HIPAA Privacy Rule containing a total of 33,788 words [56]. Automatic annotation of this text by GaiusT-based tool required only 3.07 s on an Intel Pentium 4 personal computer with a 2.60 GHz processor and RAM 512 MB of memory running Windows XP operating system. As a result, about 1,900 basic entities and 140 rights and obligations were identified. The manual evaluation of the quality of automatic results was carried out for the following sections contained in the analyzed parts: \(\S164.520:\) Notice of privacy practices for protected health information; \(\S164.522:\) Rights to request privacy protection for protected health information; \(\S164.524:\) Access of individuals to protected health information; and \(\S164.526:\) Amendment of protected health information. These sections were chosen because the results acquired by GaiusT can be compared to the results of the manual analysis reported in [13]. The manual analysis by an expert requirements engineer of the chosen fragments, containing a total of 5,978 words or 17.8 % of HIPAA, took an average of 2.5 h per section. Figure 9 illustrates an excerpt of annotated text from \(\S164.526(\hbox{a})(2)\) in the HIPAA resulting from GaiusT’s application. Each embedded XML annotation is a candidate “object of concern”. For example, in the Fig. 9, the “Index” annotation denotes the sub-paragraph index “(i)” and the Actor annotation denotes the “covered entity”; the latter appears twice in this excerpt.

7.1.2 Analysis of results

According to the evaluation framework presented in [31] in a first stage, the tool and four human expert annotators marked up sections Sect. 164.524 and Sect. 164.526. The annotated text of this fragment of HIPAA consisted of 77 sentences. Our human annotators were experienced in both computer science and law. The annotations performed by experts were evaluated pairwise: each annotation was matched against all others and the results of comparisons were aligned adopting the following criteria: (1) first, annotated concepts were considered correct when they overlapped on most meaningful words, so that minor divergences in text fragment boundaries were ignored; (2) then, concepts identified by GaiusT and missed by a human annotator were manually revisited and validated. We used these annotations to calibrate system annotations as we consider them as an upper bound on what automatic annotation can do. Experts reported an overall performance of 91.6 % for recall and a precision of 82.3 %. Adopting expert annotations as gold standard [6, 25] and comparing with tool annotation, the tool shows good performances with 84.8 % for the F-measure and good accuracy of 77.9 % ^{Footnote 21} (Table 7).

Table 7 Evaluation rates of experts and GaiusT annotating HIPAA

Full size table

As for the annotation of basic concepts, calculation of evaluation metrics was not possible, as the human annotator was asked to identify only high-level deontic concepts. By manually revising the annotated results, we can say the tool correctly identified all instances of the concepts actor, policy, event, information, and date with a precision of 91 %. GaiusT also correctly recognized section and subsection boundaries, titles, and annotated paragraph indices with a precision of 96 %. These structural annotations were used to disambiguate and manage cross-references.

On the basis of the experimental study, we can conclude that GaiusT was able to identify legal requirements with high precision (from 93 to 100 %). Good recall rates were also demonstrated for most concepts (70–100 %), apart from anti-obligation (33 %), where the tool’s performance must be improved.

Overall, the GaiusT tool largely reduces human effort in identifying legal requirements by doing some of the work, also by giving hints to human annotators on how to proceed. For example, the tool could identify constraints and subject candidates that the human annotator could then link to identified concept instances.

7.1.3 Productivity evaluation

The goal of this application was to test the usefulness of the tool for non-experts in regulatory text who may have to analyze such documents to generate requirements specifications for a software system. In addition to the expert evaluation, an empirical validation of the proposed tool against inexperienced requirements engineers has been carried out. The problem here is that requirements engineers are not always supported by lawyers when designing new software. For this purpose, Sect. 164.524 and Sect. 164.526 of HIPAA were selected for annotation by subjects, who are not working with rules and regulations directly. These parts were selected so that they have an approximately equal number of statements, comprised of 1,205 words and 1,057 words, respectively.

The experiment involved eighteen master-level Computer Science students taking a software engineering course. Participants in this experiment were not familiar with legal terms and need some training in semantically annotating legal text. A detailed explanation of the annotation process and examples of the concepts to be identified were available. Moreover, the participants were provided with an user-friendly interface to facilitate insertion and modification of tags in input documents.

The setup of the experiment consisted in dividing participants into two groups, and each group was asked to annotate Sect. 164.524 and Sect. 164.526 (containing a total of 2,262 words), in two different time frames. In the first time frame group, one worked on Sect. 164.524 without annotations, while group two on Sect. 164.526 with annotations previously generated by GaiusT. In the second time frame, the sections were switched.

The annotators were asked to incrementally identify rule statements and their components in each of the two texts, inserting markups on the original page for the unannotated text, and modifying GaiusT’s output in the annotated text. Our statistics note the time spent for annotation of both texts by each novice annotator, also the number of different entities identified during the annotation process. Table 8 summarizes average times spent by participants to fully annotate the two sections and average times spent to fix already annotated sections by GaiusT. Average time saved by using the output of the tool was 33.52 %, a notable saving considering that the sections are only a small fragment of the entire law. The entire HIPAA is 147K words and assuming that annotation time depends linearly on the size of the law, it would require almost three days for complete annotation.

Table 8 Average time spent to annotate HIPAA sections by novice annotators

Full size table

Concerning the performance of novices relative to the gold standard, novices who performed a full annotation without assistance achieved a recall of 52 % and an accuracy of 50 %. By comparison, novices who simply improved the automated GaiusT annotations reported a much higher rate of precision and recall (95 and 90 %, respectively) with an accuracy of 90 % (see Table 9).

Table 9 Comparing novice results versus novice assisted by the tool

Full size table

This confirms that the tool is a great help for novices both in terms of productivity, as well as quality of the results. Overall, the GaiusT tool largely reduces human effort in identifying legal requirements by doing some of the work, also by giving hints to human annotators on how to proceed. For example, the tool could identify constraints and subject candidates, which the human annotator could then link to identify concept instances. Annotators generally expressed satisfaction with the GaiusT tool and the annotations it provides, finding them easy to read and helpful in interpreting a legal document. However, they also pointed to some areas where the tool can be improved. ^{Footnote 22}

7.2 The Italian accessibility law

7.2.1 Case study setup

The Italian Accessibility Law by Stanca contains a set of provisions to promote access by disabled people to information technology instruments [26]. This law defines a set of guidelines containing the different accessibility levels and technical requirements, and the technical methodologies used to verify the accessibility of Web sites, as well as the assisted evaluation programs that can be used for this purpose. The Stanca law was chosen because it is available both in English and in Italian.^{Footnote 23} The availability of both language versions allows us to perform a more fine grain analysis of the different style and syntax used in the two documents.

For the identification of hierarchical structure for Italian law, the Italian guidelines for Lawmaker Guida alla redazione dei testi normativi have been used [47]. The guidelines provide suggestions about style, structure, and the type of verbs to be used to express legal concepts. The structure of Italian law foresees 10 levels instead of the 12 of American Law, and the naming convention used for the text units is different; thus, the hierarchy grammar that had been developed for HIPAA had to be adapted to correspond. The analysis of the text units that characterize the Italian law is reported in Table 10.

Table 10 The document structure of Italian Law

Full size table

The grammar rules to support the interpretation of the cross-references of an Italian law were derived from the structural elements. The TXL programs have also been extended to deal with the Italian character set, including the accented vowels and the apostrophes used for articles and prepositions. The annotation schema for the SA of Italian legal documents has been designed keeping in mind the fact that Italian law uses verbs in the present tense to express deontic concepts and the use of modal verbs is left to the Lawmaker [47]. The syntactic indicators used for identifing deontic concepts in American Law are partially valid, but needed to be adapted. Two techniques have been used to address this issue: (1) the translation of modal verbs used for the American Law; (2) the manual analysis of the text for the identification of candidate verbs to express the concepts of right and obligation and their antis. As regards the second technique, the POS Manager module has been used to retrieve all the present tense verbs of the Stanca law and then the candidate verbs for each concept were manually annotated. For the identification of actor instances in the Italian law, we adapted two solutions: (1) some instances were mined manually from the definition section “Definizioni”; (2) in order to catch the actors not mentioned in the definitions, we exploited the results provided by POS Manager, i.e., all proper nouns were marked as actors. For resource instances, we followed only the first solution reusing the terms stated in the definition section.

In order to identify action verbs, we adopted the following heuristic: annotate all verbs in present tense, passive tense, and impersonal tense. The verbs in the listed forms also refer to obligations, in accordance with the instructions for writing Italian legal documents. Thus, the corresponding heuristic rule was adapted for identifying obligations. As for rights, obligations, and their antis, it is more difficult to identify them in the Italian language. Unlike English that mainly uses modal verbs to state prescriptions as for instance “the users must present their request”, Italian regulations use present active (“gli utenti presentano la domanda”), present passive (“la domanda è presentata”), and the passive impersonal tense (“la domanda si presenta”) of verbs to describe an obligation. The choice of style is highly author-dependent. Each of these styles is equally recommended by guidelines [47]. Therefore, in identifying these concepts, our strategy included (1) translation of indicators identified by Breaux and Antón (see the list of equivalent indicators in Table 11), (2) in addition, annotation of those sentences that contain verbs in the tenses that intrinsically express obligations as instances of obligation. Figure 10 shows the list of syntactic indicators identified for the deontic concepts and a subset of the grammar for syntactic indicators integrated as a domain-dependent component of Cerno, which correspond to various concepts. The annotation of the Stanca law, containing a total of 6,185 words on 280 lines, by GaiusT takes only 61 milliseconds on an Intel Pentium 4 personal computer with a 3 GHz processor and 2 Gb of memory running Windows 2003 server. A fragment of the annotated Stanca document is shown in Fig. 11.

Table 11 Syntactic indicators for Italian laws

Full size table

7.2.2 Analysis of results

To evaluate the annotation results, an empirical study was designed, involving annotation of the Stanca law, and compared the performance of the tool with manual identification of instances of rights, obligations, and associated constraints. Table 12 presents the results of the evaluation compared to human annotators. In the case of the HIPAA Privacy Act, the comparison was made against human expert annotators. The human annotations were used as the gold standard and the performance of GaiusT was calibrated against their opinions. We have not yet measured the productivity of GaiusT for the Italian language, as the focus of the Italian experiment was to evaluate the quality rather than the productivity of the tool when changing language. The tool demonstrates good precision for all of the concepts, ranging from the lowest 67 % for actor to the highest rate 100 % for right, anti-obligation, and constraint. Recall was not as high as precision, showing the lowest rates of 33 % for right and 35 % for constraint. This fact is caused by the lack of syntactic indicators for the reliable identification of these concepts in Italian legal language. Instances of anti-right concept were not identified in the document either by the human or tool.

Table 12 Evaluation rates for legal concepts found in the accessibility law

Full size table

Our plan for future work includes further investigation of how the identified drawbacks can be resolved. Particular difficulties of Italian text emerged for both human and tool. For instance, in Italian, the subject is frequently omitted, as in passive forms of verbs, or hidden by using impersonal expressions, thus making it difficult to correctly classify the whole text regulatory fragment and find the holder of a right or of an obligation. Surprisingly, the official English translation of the Stanca law in most cases explicitly states this information. Consider the use of verb phrases (in bold) to state the obligation in Italian and English versions of the same fragment, below:

Italian statement: “Nelle procedure svolte dai soggetti di cui all’articolo 3, comma 1, per l’acquisto di beni e per la fornitura di servizi informatici, i requisiti di accessibilità stabiliti con il decreto di cui all’articolo 11 costituiscono motivo di preferenza a parità di ogni altra condizione nella valutazione dell’offerta tecnica, tenuto conto della destinazione del bene o del servizio".
English translation: “The subjects mentioned in article 3, when carrying out procedures to buy goods and to deliver services, are obliged, in the event that they are adjudicating bidders which all have submitted similar offers, to give preference to the bidder which offers the best compliance with the accessibility requirements provided for by the decree mentioned in article 11”.

Overall, the annotation results suggest that the GaiusT process for regulation analysis is applicable to documents that are written in different languages, English and Italian. The effort required to adapt the framework for the new application was relatively small. This experiment also revealed several language differences that we were able to quantify. In our future work, we plan to conduct a more extensive analysis that may remove other language effects independently from legislator effects. As the fragment shows, in the English version, the translator disambiguated implicit information.

In general, the results of the annotation with GaiusT provide a useful input for software engineers looking for requirements contained in regulations, rather than starting from scratch and suggest that the GaiusT-based process for regulation analysis is applicable also to documents written in different languages.

8 Related work

Automatic analysis of legal documents is an old research field of artificial intelligence [1, 7], but due to the complexity of such documents, the goal has still not been achieved [38]. In particular, there is an intense debate on the limits of the logical representation for law in deontic logic [24, 45]. The idea of using contextual patterns or keywords to identify relevant information in prescriptive documents is not new. A number of methodologies based on similar techniques have been developed. However, tools to realize and synthesize these methods under a single framework are lacking. In [14], the authors suggest an algorithm for the detection and classification of non-functional requirements (NFRs). In a pilot experiment, the indicator terms were mined from catalogs of operationalization methods for security and performance softgoal interdependency graphs. These indicators were then used to identify NFRs in fifteen requirements specifications. The results have shown a satisfactory recall and precision for the security and performance keywords. Antón proposed the goal-based requirements acquisition methodology (GBRAM) to manually extract goals from natural language documents [2]. The GBRAM has been applied to financial and health care privacy policies [4]. Additional analysis of the extracted goals led to new semantics for modeling goals [8, 9], which distinguish rights and obligations, and new heuristics for extracting these artefacts from regulations [10, 13]. Manually marked sections of the HIPAA [56] are used to automatically fill in frameworks defined according to their conceptual model of laws [11].

Ghanavati et al. [21] report a large systematic review of goal-oriented approaches to model legal documents by reviewing 88 papers, comparing results and summarizing the contributions and research questions in this area. Another survey of researches in analyzing legal documents for compliance requirements can be found in [41]. Among the other issues, authors underline the need to develop a comprehensive system to support requirements engineers in the requirement compliance task. A set of 9 high-level functions have been identified as relevant for such a system: annotation of legal statement is the one addressed by GaiusT [40]. Other requirements are partially fullfielled by GaiusT, among them, the “Semiautomated Navigation and Searching”, including cross-references identification (see [27]) and prioritization of regulations and exceptions. To facilitate reasoning with regulations, Antoniou et al. [5] introduce the regulations analysis method based on defeasible logic [39]. For this purpose, the facts manually found in the regulation document are represented as defeasible rules.

Analysis of requirements texts has been automated in several ways. LIDA [42] is an iterative model development method based on linguistic techniques. It uses a part-of-speech tagging to derive instances of the UML abstractions. First, the tool proposes a list of nouns as class candidates to the requirements engineer who should remove irrelevant ones, then the list of adjectives is given for choosing relevant attributes, and finally, the list of verbs is provided for the identification of relevant methods and roles. Along similar lines, the NL-OOPS tool was devised to support requirements analysis by generating object-oriented models from natural language requirements documents [36]. As the core technology, the tool exploits the LOLITA system for domain-independent natural language processing [19]. In [59], the authors addressed the problem of correct identification of requirements providing the detailed analysis of NASA requirements documents and recommendation on writing clear specifications. As a part of this work, the authors discovered that good requirements specifications use imperative verbs, such as shall, must, must not, and others, as in the statements explicitly pointing out to requirements, constraints, or capabilities. They also introduced the notion of continuances, i.e., phrases that introduce the specification of requirements at a lower level. Such phrases usually appear after the words below, as follows, listed, etc. This analysis relates to this application to some extent, and a set of heuristic rules to detect requirements were derived by regularities in language of regulations documents. The notion of continuances to identify complete lists was also considered in the analysis. The proposed tool is able to identify such lists correctly and relate them to a specific upper-level concept where they belong. In aspect-oriented requirements engineering, several methods approach the task of identification and separation of concerns using text analysis methods. For instance, the EA-Miner [49] tool supports separation of aspectual ^{Footnote 24} and non-aspectual concerns and their relationships by applying natural language processing techniques to requirements documents.

One of the early methods intended to facilitate the transition from informal to formal requirements specifications is the requirements apprentice [48]. In order to avoid dealing with the full complexity of natural language, this tool needs a mediator, a skilled requirements engineer who enters the information obtained from the end user in a command-line style using a high-level language. The system then generates a formal internal representation of a requirement from a set of statements and uses a variety of techniques to identify ambiguities and contradictions. To address the complexity of full natural language in general, and for legal texts too, it is often required to rewrite requirements in structured natural language, reducing the advantages of applying automatic tools. Not rarely, this manual preprocessing comes together with vocabulary restrictions limiting the scope of the system to specific knowledge domain. An example of such approaches is [57]. Many tools have developed to support the visualization and the storage of manually annotated features. At different level of complexity, such functionalities are offered by tools like Annozilla, ^{Footnote 25} SHOE, ^{Footnote 26} Yawas, ^{Footnote 27} and SMORE. ^{Footnote 28} However, none of them support automatic annotation, as GaiusT annotation module does; also, most of them are limited in the input documents they can deal with (for example, accepting only HTML documents) or in usability (for example, requiring knowledge of a specific markup language). Besides, GaiusT allows to track the annotation process in many ways, satisfying another requirement for an integrated platform to support all the activities related to compliance with HIPAA and other data privacy and security requirements [12, 34].

9 Conclusions and future work

We have proposed the GaiusT tool to facilitate requirements elicitation in the domain of legal documents. The tool supports successive analysis phases, each producing a new level of annotation. The main contribution of this work rests on the integration of a number of techniques from software engineering, the semantic Web and natural language processing to facilitate the elicitation of requirements from legal documents. Our contribution has been evaluated through a series of experiments involving human subjects, with positive results.

There are many technical aspects of the work reported in this paper that need to be improved in future research. In particular, the population of the annotation schema with the Annotation Schema Generator can be improved by using clustering algorithms [43, 53] or independent component analysis [22].

Concerning the semantic annotation of legal documents, the syntactic indicators and patterns used for the identification of types of constraints should be updated with a revision of the annotation schema and indicators, and the set of patterns used for the identification of the subjects of conjunctions or disjunctions—or, and—needs to be completed. This task is problematic even for full-fledged linguistic analysis tools. Experimentation using other kinds of regulation documents is also planned. In particular, contract documents are going to be considered, in order to verify the generality of the proposed method, and improve its effectiveness.

Finally, the multilingual aspect of SA needs to be further investigated, by considering other languages and other types of multilingual documents, such as Web pages and news stories.

Notes

Named after GaiusTerentilius Harsa, a plebeian tribune who played an instrumental role in establishing for the first time in ancient Rome a formal code of laws through the Twelve Tablets (462BC).
http://wordnet.princeton.edu/.
http://thesaurus.reference.com/.
XMI is the standard language used for representing Unified Modeling Language (UML) models (http://www.omg.org/technology/documents/formal/xmi.htm).
http://www.w3.org/RDF.
http://www.w3.org/OWL.
http://www.magicdraw.com.
http://www.visual-paradigm.com.
http://protege.stanford.edu/.
http://lucene.apache.org/java/docs/.
http://wordnet.princeton.edu/.
http://thesaurus.reference.com/.
http://wikipedia.org/.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.
http://www.codeproject.com/KB/cs/PDFToText.aspx and http://www.codeproject.com/KB/cs/DocToText.aspx.
http://jxml2owl.projects.semwebcentral.org/.
The stem of each keyword is considered.
http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt.
U.S. Public Law 104–191, 110 Stat. (1996).
http://www.whitehouse.gov/omb/memoranda/fy2007/m07-07.pdf; http://www.archives.gov/federal-register/write/handbook/ddh.pdf; http://www.archives.gov/federal-register/write/legal-docs/.
- Recall is a measure of how well the tool performs in finding relevant items \(\frac{TP}{TP+FN}\);
- Precision is a measure of how well the tool performs in not returning irrelevant items \(\frac{TP}{TP+FP}\);
- Fallout is a measure of how quickly precision drops as recall is increased \(\frac{FP}{FP+TN}\);
- Accuracy is a measure of how well the tool identifies relevant items and rejects irrelevant ones \(\frac{TP+TN}{N}\);
- Error is a measure of how much the tool is prone to accept irrelevant items and reject relevant ones \(\frac{FP+FN}{N}\);
- F-measure is a harmonic mean of recall and precision \(\frac{2 \times Recall \times Precision}{Recall+Precision}\)
where TP is the number of items correctly assigned to the category; FP is the number of items incorrectly assigned to the category; FN is the number of items incorrectly rejected from the category; TN is the number of items correctly rejected from the category; and N is the total number of items N = TP + FP + FN + TN.
The complete results and resources are available at https://docs.google.com/file/d/0B7VFCr6GDi-sZE10MWJfRTRkYWc/edit?usp=sharing.
English version reports the following note: The published text was translated in English by the Information Systems Accessibility Office at CNIPA—National Organism for ICT in Public Administration—with the sole aim of facilitating a better comprehension of it. The translation does not have official status; therefore, the only official text is the one published in the Official Gazette of the Italian Republic, in Italian.
Aspects are abstractions used to modularize cross-cutting concerns in software development. Examples include such concerns as security, distribution, functionality, and real-time constraints.
http://annozilla.mozdev.org/index.html.
http://www.cs.umd.edu/projects/plus/SHOE.
http://www.keeness.net/yawas/index.htm.
http://www.mindswap.org/2005/SMORE.

References

Alchourrón C, Bulygin E (1971) Normative systems. Springer, Wien
Book MATH Google Scholar
Antón AI (1996) Goal-based requirements analysis. In: Proceedings of the 2nd international conference on requirements engineering (ICRE’96), IEEE. IEEE Computer Society, Washington, DC, USA, pp 136–144
Antón AI, Earp JB, Carter RA (2003) Precluding incongruous behavior by aligning software requirements with security and privacy policies. Inf Softw Technol 45(14):967–977
Article Google Scholar
Antón AI, Earp JB, He Q, Stufflebeam W, Bolchini D, Jensen C (2004) Financial privacy policies and the need for standardization. IEEE Secur Privacy 2(2):36–45
Article Google Scholar
Antoniou G, Billington D, Maher MJ (1999) On the analysis of regulations using defeasible rules. In: Proceedings of the 32nd annual Hawaii international conference on system sciences (HICSS’99), vol 6. IEEE Computer Society, Washington, DC, USA, p 6033
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596
Article Google Scholar
Bing J (1987) Designing text retrieval systems for conceptual searching. In: Proceedings of the 1st international conference on artificial intelligence and law (ICAIL’87), pp 43–51
Breaux TD, Antón AI (2005) Analyzing goal semantics for rights, permissions, and obligations. In: Proceedings of the 13th international requirements engineering conference (IEEE’05), IEEE Computer Society, Washington, DC, USA, pp 177–186
Breaux TD, Antón AI (2005) Deriving semantic models from privacy policies. In: Proceedings of the 6th IEEE international workshop on policies for distributed systems and networks (POLICY’05), IEEE Computer Society, Washington, DC, USA, pp 67–76
Breaux TD, Antón AI (2005) Mining rule semantics to understand legislative compliance. In: Proceedings of the 2005 ACM workshop on privacy in the electronic society (WPES’05). ACM Press, New York, NY, pp 51–54
Breaux TD, Antón, AI (2008) Analyzing regulatory rules for privacy and security requirements. IEEE Trans Softw Eng 34(1):5–20
Article Google Scholar
Breaux TD, Antón AI, Spafford EH (2009) A distributed requirements management framework for legal compliance and accountability. Comput Secur 28(1-2):8–17
Article Google Scholar
Breaux TD, Vail MW, Antón AI (2006) Towards regulatory compliance: Extracting rights and obligations to align requirements with regulations. In: Proceedings of the 14th IEEE international requirements engineering conference (RE’06), IEEE Computer Society, Washington, DC, USA, pp 46–55
Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: Proceedings of the 14th international requirements engineering conference (RE’06), IEEE Computer Society, Washington, DC, USA, pp 36–45
Cordy JR (2003) Generalized selective xml markup of source code using agile parsing. In: Proceedings of the 11th IEEE international workshop on program comprehension (IWPC’03), IEEE Computer Society, Washington, DC, USA, p. 144
Cordy JR (2006) The TXL source transformation language. Sci Comput Program 61(3):190–210
Article MATH MathSciNet Google Scholar
Dini L, Peters W, Liebwald D, Schweighofer E, Mommers L, Voermans W (2005) Cross-lingual legal information retrieval using a wordnet architecture. In: Proceedings of the 10th international conference on artificial intelligence and law, ICAIL’05. ACM, New York, NY, pp 163–167
van Engers TM, van Gog R, Sayah K (2004) A case study on automated norm extraction. In: Proceedings of the 17th annual conference of legal knowledge and information systems (Jurix’04). Elsevier Science Publishers B. V., Amsterdam, pp 49–58
Garigliano R, Morgan R, Smith M (1993) The LOLITA system as a contents scanning tool. In: Proceedings of the 13th international conference on artificial intelligence, expert systems and natural language processing (ICAI’93)
Geoffrey N (1990) The linguistics of punctuation. Lecture notes 18. Center for the Study of Language of Information, Stanford, CA
Ghanavati S, Amyot D, Peyton L (2011) A systematic review of goal-oriented requirements management frameworks for business process compliance. In: Fourth international workshop on requirements engineering and law (RELAW), 2011, pp 25–34. doi:10.1109/RELAW.2011.6050270
Grant S, Skillicorn D, Cordy JR (2008) Topic detection using independent component analysis. In: Proceedings of the workshop on link analysis, counterterrorism and security (LACTS’08), pp 23–28
Groza T, Handschuh S, Möller K, Decker S (2007) Salt- semantically annotated LaTeX for scientific publications. In: Proceedings of the 4th European conference on the semantic web (ESWC’07), Lecture notes in computer science, vol 4519. Springer, Berlin, pp 518–532. doi:10.1007/978-3-540-72667-8_37
Horty JF (2001) Agency and deontic logic. Oxford University Press, New York, NY
Book MATH Google Scholar
Hripcsak G, Rothschild AS (2005) Technical brief: agreement, the f-measure, and reliability in information retrieval. JAMIA 12(3):296–298
Google Scholar
Italian Parliament: Stanca Act, Law. no. 4, January 9 2004: provisions to support the access to information technologies for the disabled. Gazzetta Ufficiale 13, Rome, 17 January 2004. http://www.pubbliaccesso.gov.it/normative/legge_20040109_n4.htm
Jeremy C Maxwell, AIA, Swire P (2011) Discovering conflicting software requirements by analyzing legal cross-references. In: 19th IEEE international requirements engineering conference (RE 2011), Trento. To be published
Kiyavitskaya N (2006) Tool support for semantic annotation. Ph.D. thesis, University of Trento, Department of Information Engineering and Computer Science
Kiyavitskaya N, Zannone N (2008) Requirements model generation to support requirements elicitation: the secure tropos experience. Autom Softw Eng 15(2):149–173. doi:10.1007/s10515-008-0028-6
Article Google Scholar
Kiyavitskaya N, Zeni N, Breaux TD, Antón AI, Cordy JR, Mich L, Mylopoulos J (2008) Automating the extraction of rights and obligations for regulatory compliance. In: Proceedings of the 27th international conference on conceptual modeling (ER’08), Lecture notes in computer science, vol 5231. Springer, Berlin, pp 154–168. doi:10.1007/978-3-540-87877-3_13
Kiyavitskaya N, Zeni N, Cordy JR, Mich L, Mylopoulos J (2009) Cerno: light-weight tool support for semantic annotation of textual documents. Data Knowl Eng 68(12):1470–1492. doi:10.1016/j.datak.2009.07.012
Article Google Scholar
Kiyavitskaya N, Zeni N, Mich L, Cordy JR, Mylopoulos J (2006) Text mining through semi automatic semantic annotation. In: Proceedings of practical aspects of knowledge management (PAKM’06), Lecture notes in computer science, vol 4333. Springer, Berlin, pp 143–154
Kiyavitskaya N, Zeni N, Mich L, Cordy JR, Mylopoulos J (2007) Annotating accommodation advertisements using cerno. In: ENTER, pp 389–400
Lazzarotti J (2011) Automating hipaa compliance tracking and audit preparation http://www.workplaceprivacyreport.com/2011/11/articles/hipaa-1/automating-hipaa-compliance-tracking-and-audit-preparation/
Mann WC, Matthiessen CMIM, Thompson SA (1992) Rhetorical structure theory and text analysis. In: Mann WC, Thompson SA (eds) Discourse description: diverse linguistic analyses of a fund-raising text. Amsterdam and Philadelphia, John Benjamins, pp 39–78
Mich L (1996) NL-OOPS: from natural language to object oriented requirements using the natural language processing system lolita. Nat Lang Eng 2(2):161–187
Article Google Scholar
Moulin B, Rousseau D (1990) Knowledge acquisition from prescriptive texts. In: Proceedings of the 3rd international conference on industrial and engineering applications of artificial intelligence and expert systems (IEA/AIE’90). ACM Press, New York, NY, pp 1112–1121
Nakamura M, Nobuoka S, Shimazu A (2008) Towards translation of legal sentences into logical forms. In: Proceedings of the 2007 conference on new frontiers in artificial intelligence (JSAI’07), Lecture notes in computer science, vol 4914. Springer, Berlin, pp 349–362
Nute D (1987) Defeasible reasoning. In: Proceedings of the 20th Hawaii international conference on systems science (HICSS’87), pp 470–477. IEEE Press, New York
Otto PN, Antón AI (2007) The role of law in requirements engineering. Technical report TR-2007-07, North Carolina State University
Otto PN, Antón AIAI (2009) Managing legal texts in requirements engineering design requirements engineering: a ten-year perspectives. Springer, Berlin, pp 374–393
Overmyer SP, Lavoie B, Rambow O (2001) Conceptual modeling through linguistic analysis using LIDA. In: Proceedings of the 23rd international conference on software engineering (ICSE’01). IEEE Computer Society, Washington, DC, pp 401–410
Periklis A, Panayiotis T, Renée, JM, Kenneth CS (2004) Limbo: Scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology (EDBT’04), Lecture notes in computer science, vol 2992. Springer, Berlin, pp 123–146
Pietrosanti E, Graziadio B (1999) Advanced techniques for legal document processing and retrieval. Artif Intel Law 7(4):341–361
Article Google Scholar
Pizzo A (2007) Pensiero pratico e logica deontica: assenza o presenza di razionalitá (in Italian). http://www.filosofia.it/ Online; accessed 25 February 2008
Power R, Scott D, Bouayad-Agha N (2003) Document structure. Comput Linguist 29(2):211–260. doi:10.1162/089120103322145315
Article Google Scholar
Presidenza Consiglio dei Ministri (2001) Guida alla redazione dei testi normativi. Gazzetta Ufficiale (in Italian) 101(2):1–80. http://www.guritel.it/free-sum/ARTI/2001/05/03/sommario.html
Reubenstein HB, Waters RC (1991) The requirements apprentice: Automated assistance for requirements acquisition. IEEE Trans Softw Eng 17(3):226–240
Article Google Scholar
Sampaio A, Chitchyan R, Rashid A, Rayson P (2005) EA-Miner: a tool for automating aspect-oriented requirements identification. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering (ASE’05). ACM Press, New York, NY, pp 352–355
Sarcevic S (1997) New approach to legal translation. Kluwer Law International, Dordrecht
Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing (ICNMLP’94). Manchester, UK. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Shruti A (2007) Relational views of XML for the semantic web. Master’s thesis, School of Computing, Queen’s University at Kingston, Canada. http://hdl.handle.net/1974/736
Souza V, Zeni N, Kiyavitskaya N, Andritsos P, Mich L, Mylopoulos J (2008) Automating the generation of semantic annotation tools using a clustering technique. In: Proceedings of the 13th international conference on natural language and information systems (NLDB’08), Lecture notes in computer science, vol 5039. Springer, Berlin, pp 91–96. doi:10.1007/978-3-540-69858-6_10
Taylor SL, Dahl DA, Lipshutz M, Weir C, Norton LM, Nilson RW, Linebarger MC (1994) Integrating natural language understanding with document structure analysis. Artif Intel Rev 8(2–3):255–276. http://dblp.uni-trier.de/db/journals/air/air8.html#TaylorDLWNNL94
US Federal Register (1998) Document drafting handbook. Federal Agency. http://www.nara.gov/fedreg
US Goverment (2003) Standards for privacy of individually identifiable health information, 45 CFR part 160, Part 164 subpart E. In Federal Register 68(34):8334–8381
Uusitalo E, Raatikainen M, Mannisto T, Tommila T (2011) Structured natural language requirements in nuclear energy domain towards improving regulatory guidelines. In: Fourth international workshop on requirements engineering and law (RELAW), 2011, pp 67–73. doi:10.1109/RELAW.2011.6050274
Viegas E (1998) Multilingual computational semantic lexicons in action: the WYSINNWYG approach to NLP. In: Proceedings of the 17th international conference on computational linguistics (COLING’98). Association for Computational Linguistics, Morristown, NJ, pp 1321–1327. doi:10.3115/980691.980784
Wilson WM, Rosenberg LH, Hyatt LE (1997) Automated analysis of requirement specifications. In: Proceedings of the 19th international conference on software engineering (ICSE’97). ACM Press, New York, NY, pp 161–171
von Wright GH (1963) Norm and action: a logical enquiry. Routledge & Kegan Paul, London
Google Scholar
Yacoub S, Peiro JA (2005) Identification of document structure and table of content in magazine archives. In: Proceedings of the 8th international conference on document analysis and recognition (ICDAR’05). IEEE Computer Society, Washington, DC, pp 1253–1259. doi:10.1109/ICDAR.2005.133
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90
Article Google Scholar
Zeni N, Kiyavitskaya N, Mich L, Mylopoulos J, Cordy JR (2007) A lightweight approach to semantic annotation of research papers. In: Natural language processing and information systems. Springer, Berlin, pp 61–72

Download references

Acknowledgments

This work has been supported by the ERC advanced grant 267856 “Lucretius: Foundations for Software Evolution” (unfolding during the period of April 2011–March 2016) http://www.lucretius.eu.

Author information

Authors and Affiliations

Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 14, 38123, Povo, TN, Italy
Nicola Zeni, Nadzeya Kiyavitskaya & John Mylopoulos
Department of Industrial engineering, University of Trento, Via Mesiano, 77, 38123, Trento, TN, Italy
Luisa Mich
School of Computing, Queens University, Kingston, ON, K7L 3N6, Canada
James R. Cordy

Authors

Nicola Zeni
View author publications
You can also search for this author in PubMed Google Scholar
Nadzeya Kiyavitskaya
View author publications
You can also search for this author in PubMed Google Scholar
Luisa Mich
View author publications
You can also search for this author in PubMed Google Scholar
James R. Cordy
View author publications
You can also search for this author in PubMed Google Scholar
John Mylopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Zeni.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeni, N., Kiyavitskaya, N., Mich, L. et al. GaiusT: supporting the extraction of rights and obligations for regulatory compliance. Requirements Eng 20, 1–22 (2015). https://doi.org/10.1007/s00766-013-0181-8

Download citation

Received: 22 November 2012
Accepted: 09 August 2013
Published: 20 September 2013
Issue Date: March 2015
DOI: https://doi.org/10.1007/s00766-013-0181-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

GaiusT: supporting the extraction of rights and obligations for regulatory compliance

Abstract

Similar content being viewed by others

GaiusT 2.0: Evolution of a Framework for Annotating Legal Documents

Towards Automated GDPR Compliance Checking

Contratto – A Method for Transforming Legal Contracts into Formal Specifications

Explore related subjects

1 Introduction

2 Complexities of prescriptive documents

3 Research baseline: Cerno

4 Semantic annotation of legal documents

4.1 Structural analysis

4.2 Semantic analysis of legal documents

4.2.1 Multilingual semantic annotation of legal documents

5 System architecture

5.1 The annotation schema layer

5.2 The annotation layer

5.3 The database layer

5.4 The graphical user interface

6 The GaiusT semantic annotation process

6.1 Definition of a conceptual model

6.2 Text extraction

6.3 Construction of the annotation schema

6.4 Annotation

6.5 Database population (optional)

7 Evaluation

7.1 The HIPAA privacy rule

7.1.1 Case study setup

7.1.2 Analysis of results

7.1.3 Productivity evaluation

7.2 The Italian accessibility law

7.2.1 Case study setup

7.2.2 Analysis of results

8 Related work

9 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation