Keywords

1 Introduction

Making machines to understand commercial contracts is a challenging and multidisciplinary task. Natural Language Processing (NLP) techniques are required to analyze different documents to extract relevant information, which can be expressed in significantly different formats and records. Once obtained, this information needs to be properly represented, via some knowledge representation system. In addition, reasoning methods are also required to extract new knowledge evaluating collected information.

Focusing on the legal domain, there exist different proposals for formalizing legal information in the form of logical predicates. We find among them PROLEG [26], a legal reasoning system able to represent and reason about contract status and derive information such as its validity or the right or reason of a rescission. Nevertheless, in spite of having all the contract law logic needed already coded, remains still open and important how to automatically transform the input text into logic facts. Currently, the translation of texts describing contract events must be manually coded, being therefore very inefficient in terms of cost and time, and remaining unsolved the fact that any ulterior change would need manual curation. A bridge between NLP and this logical system is therefore needed for automatic retrieval of all relevant facts from text to populate the PROLEG fact knowledge base.

In this paper, we propose a framework, called ContractFrames, able to translate natural language texts referring to the different status of a purchase contract into PROLEG clauses. These texts are not normative texts nor regular texts (being both types extensively studied in previous literature), but some natural language text at a mid point between regular language and pure legal language; an example of one of these texts can be found in Fig. 1, along with its translation into PROLEG. To the aim of expressing these texts in a full logical legal language, we have developed different framesFootnote 1 and rules for representing and extracting the relevant information that will feed the PROLEG reasoner. These resources are integrated into a natural language processing pipeline able to take a natural language text as an input and return its PROLEG version. Also an ontology, called the Contract Workflow Ontology, is proposed for representing the extracted information in a standard way.

‘person A’ bought this_real_estate from ‘person B’ at the price of 200000 dollars by contract0 on 1/January/2018. But ‘person A’ rescinded contract0 because ‘person A’ is a minor on 1/March/2018. However, this rescission was made because ‘person B’ threatened ‘person A’ on 1/February/2018. It is because ‘person B’ would like to sell this_real_estate to ‘person C’ in the higher price. So, ‘person A’ rescinded rescission of contract0 on 1/April/2018.

Fig. 1.
figure 1

Example of an input text and its expected output.

The rest of this paper is as follows. Section 2 presents related work. Section 3 introduces problem and the reasoning system PROLEG, describing the clauses on which the natural language will be translated. Section 4 analyses the main challenges. Section 5 presents how our framework tackles these problems, outlining the different steps and its main functionalities. Section 6 presents the outputs of our framework. Finally, Sect. 7 outlines the main points of our work along with some conclusions and next steps.

2 Related Work

Although the problem of extracting rules in the legal domain has been extensively tackled in literature [8, 9, 21, 28], most efforts focus on regulations and normative text, but not on semi-formal documents dealing with the binding agreements.

The work by Biagioli et al. [3] includes for instance the idea of representing different types of provisions in normative texts as logical structures or frames; nevertheless, these frames output are XML files, not logical clauses, where each provision has some metadata arguments independent of other provisions. On the other hand, Araujo et al. [1] consider a series of legal events in Brazilian Portuguese. Using domain and linguistic knowledge, as well as ontologies, they develop rules for detecting these events via an OWL reasoner. Similarly, Wyner et al. [28] use different NLP tools and resources such as VerbNet [27] to extract events and some related roles from Regulations in English. Other approaches also use ontologies and resources such as WordNet [18] for obtaining semantic information about concepts of interest, such as obligation, permission and prohibition [8]. Although semantics is a common approach [1, 8, 15, 28], it must be noted that not all proposals rely on NLP or semantics, such as the work by Moulin et al. [21]. Finally, the work by Nakamura et al. [22], later extended to deal with references [14], present a methodology to translate natural language Japanese law texts to logical forms following the Davidsonian Formalization.

The representation of contracts in a digital form has been made in many different forms for different purposes, but none of them matches well enough the representation of PROLEG clauses. The contract itself has been represented in different XML forms, from the well structured OASIS eContracts [16] format to the practical ebXML agreements or the RuleML based business rules [11]. Most efforts to represent the contract in a formal way lean towards defining deontic logic systems, such as Governatori’s Business Contract Language intended to address contract violations [10], Daskalopulu’s approach to tackle subjective visions on the contract [6] or Prisacariu’s effort to consider temporal aspects [24]. However, not many efforts focus on the contract within a workflow, being among the most interesting the Kabilan’s ontologies [13] and Molina’s [20]. Kabilan identifies at least three perspectives under which a contract can be analyzed: the legal one, the business one, and the information systems one. From each of these perspectives, it is not trivial to abstract a common model for the representation of contracts and contract workflows. Commercial law is different from jurisdiction to jurisdiction, each organization has its own in-house business policies with respect to contract management and information systems are simply too diverse.

When it comes to representing contracts with the purpose of publishing and linking contract information, contract formats are even more scarce. LKIF [12] devotes a class to ContractFootnote 2, but provides no support to contract workflows. FrameNet [2] could be used to represent contracts to some extent (using elements such as DocumentsFootnote 3 or Being_obligatedFootnote 4 in FrameNet), but these options do no reflect the information needed for the related PROLEG clauses. Similarly, the Commerce_buyFootnote 5 frame provides a lot of information on the context of the purchase, but does not consider the contract, focus of our research. The representation of contracts as RDF is not supported by any massively adopted ontology, and there are not many standards or public ontology-based specifications to choose from. One of the possible choices is the Media Contract Ontology [25], ISO standard to support the representation of contracts as RDF but nonetheless domain-specific.

Finally, the analysis of event processing and representation in the legal domain by Navas-Loro et al. [23] shows a overview of the different systems and representation options in previous literature.

3 The Problem

In this section, we will present the problem and introduce the frames developed for representing it, as well as the reasoning system PROLEG. Let us start analyzing the example exposed in Fig. 1.

Fig. 2.
figure 2

The three different frames in the framework (Purchase, Rescission and Duress) and how they interact. An action can be a contract or a rescission, therefore a rescission can be of a contract or of another rescission. A duress is also necessarily attached to a rescission.

Fig. 3.
figure 3

Rulebase of PROLEG.

In the input text, we find different events related to the status of the contract. First, the purchase is uttered via a contract; then, a rescission is claimed, adducing the fact that one of the parts was a minor. In the third sentence, a fact of duress (threatening) on one of the parts is issued. Additionally, the cause of the ending of the contract is expressed in the following sentence. Finally, in the last sentence the former rescission is rescinded, what makes the contract valid again. It must be noted that we are actually not interested in the information on the fourth sentence, since we just want information about the contract status and the ‘real reasons’ behind the fact of its rescission are not relevant, but just the fact of it being rescinded at some point of time. For modeling the relevant situations that can involve a contract, we developed three main frames, depicted in Fig. 2. The framework is also able to extract other relevant events to the system, such as if any of the parts involved in the contract is a minor, but most events are related to these three frames. Some examples of events involving these frames are for instance an agreement of a purchase contract, a manifestation of a rescission or the expression of a duress. The expected representation in PROLEG of the facts relevant to the contract status expressed in the example can be found in Fig. 1. With these facts and the contract law information encoded in its rule base (see Fig. 3), the PROLEG system would be able to derive legal consequences of each of the facts, leading to new conclusions such as if the buyer has the right of handling the goods purchased at some concrete point in time or if a contract or a rescission becomes invalid for some reason, such as the existence of duress or some legal incompatibility, such as one of the parts involved being a minor. The reasoning process is represented in Fig. 4.

4 Analysis of Challenges

Before explaining our framework, we will expose in this section the different difficulties found during the development of the framework. Each one has a letter assigned, so it can be referred in later paragraphs.

Fig. 4.
figure 4

Visualization of the reasoning made by PROLEG from the facts extracted by our framework.

[A] Style of the text. Legal texts usually use patterns such as “A sells L to B by C”, “Part A established a contract with Part B”, or “‘personB’ threatened ‘person A”’. In these examples, each of the letters are Named Entities that can be misleading to general NLP tools, that for instance consider A as a determiner, changing the whole grammar structure of the sentence. In our case, a preprocessing was done in order to distinguish among real determiners and ‘A’ parts, and also to eliminate misleading characters (such as ‘) or blank spaces.

[B] Relevance. Differently to other proposals, our aim is not to translate each sentence of the original text, but to extract just relevant facts to the PROLEG system. Therefore, not all the sentences in the text are relevant; in fact, some of them can be actually misleading.

[C] Factuality. Besides relevance, some of the sentences in the text processed do not refer to actual facts, but to possibilities, intentions or preferences (e.g., ‘A would like to sell a land to B’, ‘A preferred to sell it to part D’). In these cases, some screening should be done in order to prevent these events to enter in our facts base.

[D] Paraphrasing. There are a number of different ways to express the information, both from the syntactical and semantics point of view. While in other domains a lot of semantic resources are available to palliate this phenomena, the Legal Domain presents a very specific terminology uneasy to deal with.

[E] Complexity. As already reported in the previous literature [7], texts in the legal domain tend to be more complex than texts those other domains. They have higher parse trees, more words per sentence and different POS distribution. These particularities imply an extra difficulty when extracting information from them beyond the required preprocessing previously mentioned.

[F] Coreferences and nesting. In a natural language text, a single sentence does not necessarily contain all the information of one event. Coreferences are also difficult to handle, especially when there are several manifestations of a type of event (such as a rescission). Also, some information is directly not mentioned and must be inferred using some domain knowledge information. This is the case for instance of a rescission of a contract; if we know that a contract C involves part A and part B, and we know there is a rescission of this contract of part A as manifester but it is not explicitly mentioned who is the manifestee, we can assume this role must be Part B. Similarly, the duress manifestation of a rescission must be coherent with the information we have of this rescission and the contract it applies to.

[G] Different information. For each of the different events processed we require different information. That is an important point that implies some ordering in the rules and preprocessing of the text. An example of this fact is the sentence like “PersonB rescinded contract because personA was a minor on 1 March 2018.” In PROLEG clauses, the predicate minor() has arity one, so there is no date attached. So even when NLP tools tend to assume that the date mentioned refers to the verb be and not to rescind (what in fact is linguistically correct and could also be the inference of a human, is an ambiguous sentence), our framework should be able to note the difference.

[H] Matching. Since some frames are dependent (e.g., a duress must be related to a rescission), they must be correctly tracked and matched, so the task is not just about extraction but also about merging.

5 ContractFrames

Our framework ContractFrames makes use of the NLP tool Stanford CoreNLP [17]. We use it for tokenization of the sentences, lemmatization, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) and also for sentence dependence parsing. We also make use of the TokensRegex [4], an annotator that allows setting rules that produce customized annotations. Differently from other options, such as regex, the rules developed in the TokenRegex format are based in previous annotators output, such as POS or NER, being therefore more powerful from the semantic point of view. The steps in our framework are explained below, along with the problems exposed in the previous section that they target:

  1. 1.

    Preprocessing the input text: the first step in our framework deals with problems related to the style of the text and the different information (problems [A] and [G]). The objective of this step is to output a version of the text easier to be understood by the CoreNLP pipeline. To this aim, the following functionalities have been implemented, among others:

    1. Algorithm to replace A, ‘person B’, or similar misleading expressions for the POS-tagger and the parser.

    2. Replacement of appearances of relevant references like ‘this_real_state’, that often appear in input texts, to standard strings such as Item1. These replacements are eventually reversed before producing the final output.

    3. Standardization of the dates, that might come as ‘dd/Month/yyyy’, a format that the Stanford CoreNLP temporal tagger (SUTime) is not able to detect.

    4. NLP rules to detect the clause with arity one minor(agent), that when found is added to the list of clauses and deleted from the text to avoid misleading parsing as the one explained in the previous section.

  2. 2.

    Annotation with the CoreNLP pipeline: the annotations include tokenization, sentence splitting, POS-tagging, lemmatization, NER (that includes SUTime), parse and the application of our event rules via TokensRegex. The rules developed allows our framework to detect different kind of events (establishment of contracts, purchases, sales, rescissions, duress...) both in verbal forms (buy, sell, rescind) or as noun events (purchase, sale, rescission). Each type of event is annotated consequently as a event annotation, so we find the relevant events (problem [B] in the previous section).

  3. 3.

    Parse sentence by sentence: we analyze each sentence separately, assigning one of each of the possible types of frame. Then we analyze the annotations of each token separately:

    1. (a)

      If the token has an event annotation, we check if it is negated in the sentence (being therefore not a fact, so we should not transform it into logic) or if it is an intention or a possibility (in this case, we should not either consider it), targeting problem [C]. Once we have verified that it is a fact, we check which type, and then apply different rules to find each its arguments (if available) and express the information as the corresponding frame of the sentence. These rules are mainly applied in the dependency parsing of the sentence, and take into consideration not just the type of the event but also its form (if it is a noun or a verb, if it is active or passive). We cover therefore the different paraphrasing (problem [D]) that can express the information relevant for each frame. Let us analyze for instance the following sentences:

      1. (1)

        landL was sold by PartA to PartB via contractC for 20000 dollars on 13/October/2017.

      2. (2)

        On 10/13/2017, PartB established a purchase contractC with PartA to buy landL at the price of 20000 dollars.

      3. (3)

        PartA sold landL to PartB by contractC for 20000 dollars on October 13, 2017.

      For all of them, the same information is provided, despite of having different words and syntax; the analysis of the dependencies of each sentence is depicted in Fig. 5. Therefore, all these three different input texts imply the same frame (the establishment of a purchase contract) and their PROLEG output should be the following:

      figure a
    2. (b)

      If the token has any other relevant annotation (namely, it has been tagged as a DATE or MONEY by Stanford CoreNLP annotators or as a CONTRACT mention by our rules), we store it as relevant information in the sentence.

    Once each token has been analyzed, we check if there is missing information in frames that have been initialized due to some found event. If so, we complete it with the relevant data we stored (problems [E] and [F]).

  4. 4.

    Once the whole sentence has been processed, we check the information stored in the frames. If there is any information missing, we look for it explicitly (for instance in the values for DATE and MONEY we stored in the previous step for the current sentence), as well as we also check if new information can complete previous frames. An example of this is the case of a duress: an expression of duress must be linked to a previous rescission, so we check previously mentioned rescissions and link it to the most suitable one (problem [H]).

  5. 5.

    Finally, once all the sentences have been processed, we complete the information on the final frames using some common sense. Let us imagine for instance that we have a rescission with Manifester A and Manifestee B, with a duress where B is the Manifester but with no information on the text about the threatened. Since we know the two parties in the contract, we can derive that the Manifestee must be A.

  6. 6.

    Last step involves just transforming the information in our frames in PROLEG clauses, including reversing the replacement done during the preprocessing step. Since all the relevant information has been stored as our standard frames, translation into PROLEG clauses is straightforward (each frame has one or two clauses whose arguments are among the data in the frame).

Fig. 5.
figure 5

CoreNLP online demo’s (http://corenlp.run/) output of the enhanced dependencies of the three example sentences where the challenges of paraphrasing become evident. We can see for instance how the buyer (PartB) and the seller (PartA) play different roles in the sentence depending on the voice (active/passive) or the semantics of the verb expressing the purchase frame (buy/sell/establish a contract). Also other information can be expressed differently, such as the date.

Fig. 6.
figure 6

The pipeline of our framework.

It must be noted that our framework’s NLP pipeline does not involve the corefence annotator from CoreNLP. Although it was initially included in a first version of our framework, we detected that it presented some limitations when dealing with the same event refered both by using nouns and verbs. While it could detect that for instance that in \(``\underline{\textit{The rescission of the contract}}\) was done on 1 February, 2018. \(\underline{\textit{This rescission}}\) was cancelled later” there was a coreference, it did not succeed in cases such as “A \(\underline{\textit{rescinded the contract}}\) with B. \(\underline{\textit{This rescission}}\) was cancelled later”. We therefore developed the our own algorithm to detect previous potentially similar events and merge them, executed in step 4. Figure 6 depicts the pipeline of our framework.

6 Results

The code of ContractFramesFootnote 6 is publicly available in a GitHub repositoryFootnote 7. A dataset with several different texts and their expected input is included in the repository. These texts have been generated by legal researchers, some of them taking no part in the development of the framework. The texts include different types of paraphrasing, both semantic and syntactic, such as the examples depicted in Fig. 5. Also different levels of nesting and events are represented in the dataset, that includes texts of different length explaining the workflow of a contract (the establishment of the contract, a the rescission or duress), and even surrounding facts not exploited by the system.

Fig. 7.
figure 7

The Contract Workflow Ontology.

Fig. 8.
figure 8

Visualization of our custom annotations in XML.

The code, written in Java, provides two main classes. The first allows the user to input any text and provides it in the form of PROLEG clauses. The second processes all the files in the dataset, that is also provided and can be extended by the user just by adding new files.

Besides the logic clauses output format, an ontology able to express contract has been developed. This ontology, called Contract Workflow OntologyFootnote 8, is capable of representing the different types of events processed, such as agreements and rescission, as well as others in the workflow of a general event such as negotiation. A method for generating an output in the form of triples is provided. Figure 7 shows the Contract Workflow Ontology.

Additionally, the system also generates a xml output that allows the visualization of the inner custom annotations in the text, namely events and named entities like contracts. An example of this visualization (using the tool GATE [5]) can be found in Fig. 8.

7 Conclusions

In this paper we have presented ContractFrames, a framework to process input text written in natural language including technical legal terminology and to produce as output its relevant information in the form of PROLEG logical clauses. This framework can recognize different kinds of events, can analyze if they are actual facts and can extract important related information in the form of a frame, deriving omitted information to some extent. Also, an ontology has been created as a data model to store the relevant information of the case and compound the whole contract workflow. In this way, any other logical system might also benefit of the processing and the information extracted, and would be able to generate a custom output from this knowledge representation.

Finally, although our framework has been able to successfully process all the example texts produced by legal researchers not involved in its coding, it still has some limitations that will be handled as future work. We therefore have several research lines for improving ContractFrames. For now, the framework is not able to recognize appositions nor naming statements such as “the contract, entitled from now ‘contract0’, (...)”. Despite these expressions are not very common, we consider that exploring new rules for detecting these kind of alternative paraphrasing, eventually to be added to our framework, will be useful. Also more frames able to represent other legal situations and rule sets for populating them will be developed, as well as common sense techniques to derive non explicit information.