Keywords

1 Introduction

With the maturation and growing power of interactive proof systems, the body of formalized mathematics and engineering is dramatically increasing. The Isabelle Archive of Formal Proof (AFP) [6], created in 2004, counted in 2015 a total of 215 articles, whereas the count stood at 413 only three years later. An in-depth empirical analysis shows that both complexity and size increased accordingly [11]. Together with the AFP, there is also a growing body on articles concerned with formal software engineering issues such as standardized language definitions (e. g., [15, 21]), data-structures (e. g., [14, 24]), hardware-models (e. g., [20]), security-related specifications (e. g., [13, 26]), or operating systems (e. g., [22, 27]).

This development raises interest in at least two ways: First, there is a substantial potential of retrieve and reuse of formal developments, and second, formal techniques allow a deeper checking of documents containing formal specifications, proofs and tests. This paves the way for collaborative, continuously machine-checked developments of certification documents involving both formal as well of informal content evolution.

We are focusing in this paper on the latter aspect. Certification documents have to follow a structure which is relatively strictly defined in certification standards such as [16, 17]. In practice, large groups of developers have to produce a substantial set of documents where the consistency is notoriously difficult to maintain. In particular, certifications are centered around the traceability of requirements throughout the entire set of documents. While technical solutions for the traceability problem exists (most notably: DOORS [7]), they are weak in the treatment of formal entities (such as formulas and their logical contexts).

Enforcing a document structure is done by annotations with meta-information; the language in which the latter is defined is widely called a document ontology (an equivalent term is vocabulary) in the semantic web community [3], i. e., a machine-readable form of the structure of a document and the document discourse. Let us consider a set of text elements available in a given corpus. These elements may be sentences or paragraphs, figures, tables, definitions or lemmas, code, and, for example, the results of test-executions. By annotation, we make links explicit that may exist between an ontology concept and a document element of the considered corpus. While ontologies as such can be used for a variety of applications, this paper is concerned with the representation of a mixture formal and semi-formal content (as it is, e. g., very common in documents within a software development process). Therefore, we also discuss the mapping to a concrete target document format (e. g., PDF) that, e. g., might be used within a traditional certification process.

In this paper, we present the concepts of our Document Ontology Framework (DOF) designed for building scalable and user-friendly tools on top of interactive theorem provers, and an implementation of DOF called Isabelle/DOF. Isabelle/DOF supports both defining ontologies and documents that conform to one or more ontologies. An example-driven introduction into Isabelle/DOF also presenting details of the user-interaction in the IDE can be found elsewhere [12]. In this paper, we are focusing on the fundamental concepts of its ontology definition language ODL and the more technical issues of its implementation. In particular, we present novel concepts such as meta-types-as-types, class-invariants, monitors, inner-syntax antiquotations as well as their interaction.

The rest of the paper is structured as follows: after explicating the underlying assumptions in a generic document model, we present the design of DOF as a language in Sect. 3. It follows a presentation of the implementation of Isabelle/DOF (Sect. 4) and a discussion on related and future work (Sect. 5).

2 Background: The Document Model

In this section, we introduce the assumed document model underlying DOF in general; in particular the concepts integrated document, sub-document, text-element and semantic macros occurring inside text-elements. Furthermore, we assume two different levels of parsers (for outer and inner syntax) where the inner-syntax is basically a typed \(\lambda \)-calculus and some Higher-order Logic (HOL).

Fig. 1.
figure 1

A theory-graph in the document model.

We assume a hierarchical document model, i. e., an integrated document consist of a hierarchy sub-documents (files) that can depend acyclically on each other. Sub-documents can have different document types in order to capture documentations consisting of documentation, models, proofs, code of various forms and other technical artifacts. We call the main sub-document type, for historical reasons, theory-files. A theory file consists of a header, a context definition, and a body consisting of a sequence of commands (Fig. 1). Even the header consists of a sequence of commands used for introductory text elements not depending on any context. The context-definition contains an and a section, for example:

figure c

where is the abstract name of the text-file, Main refers to an imported theory (recall that the import relation must be acyclic) and are used to separate commands from each other.

We distinguish fundamentally two different syntactic levels:

  1. 1.

    the outer-syntax (i. e., the syntax for commands) is processed by a lexer-library and parser combinators built on top, and

  2. 2.

    the inner-syntax (i. e., the syntax for \(\lambda \)-terms in HOL) with its own parametric polymorphism type checking.

On the semantic level, we assume a validation process for an integrated document, where the semantics of a command is a transformation for some system state . This document model can be instantiated with outer-syntax commands for common text elements, e. g., or . Thus, users can add informal text to a sub-document using a text command:

figure j

This will type-set the corresponding text in, for example, a PDF document. However, this translation is not necessarily one-to-one: text elements can be enriched by formal, i. e., machine-checked content via semantic macros, called antiquotations:

figure k

which is represented in the final document (e. g., a PDF) by:

figure l

Semantic macros are partial functions of type ; since they can use the system state, they can perform all sorts of specific checks or evaluations (type-checks, executions of code-elements, references to text-elements or proven theorems such as , which is the reference to the axiom of reflexivity).

Semantic macros establish formal content inside informal content; they can be type-checked before being displayed and can be used for calculations before being typeset. They represent the device for linking the formal with the informal.

Implementability of the Assumed Document Model. Batch-mode checkers for DOF can be implemented in all systems of the LCF-style prover family, i. e., systems with a type-checked , and abstract -type for theorems (protected by a kernel). This includes, e. g., ProofPower, HOL4, HOL-light, Isabelle, as well as Coq and its derivatives. DOF is, however, designed for fast interaction in an IDE. If a user wants to benefit from this experience, only Isabelle and Coq have the necessary infrastructure of asynchronous proof-processing and support by an IDE [10, 18, 28, 29]. For our implementation of DOF, called Isabelle/DOF, we are using the Isabelle platform [25]. Figure 2 shows a screen-shot of an introductory paper on Isabelle/DOF [12] presenting a number of application scenarios and user-interface aspects. On the left, we represented the Isabelle/DOF IDE, while on the right, the generated presentation in PDF is shown.

Fig. 2.
figure 2

The Isabelle/DOF IDE (left) and the corresponding PDF (right).

Isabelle provides, beyond the features required for DOF, a lot of additional benefits. For example, it also allows the asynchronous evaluation and checking of the document content [10, 28, 29] and is dynamically extensible. Its IDE provides a continuous build, continuous check functionality, syntax highlighting, and IntelliSense-like auto-completion. It also provides infrastructure for displaying meta-information (e. g., binding and type annotation) as pop-ups, while hovering over sub-expressions. A fine-grained dependency analysis allows the processing of individual parts of theory files asynchronously, allowing Isabelle to interactively process large (hundreds of theory files) documents. Isabelle can group sub-documents into sessions, i. e., sub-graphs of the document-structure that can be “pre-compiled” and loaded instantaneously, i. e., without re-processing.

3 The DOF Design

DOF consists basically of two parts: 1. the declaration of new keywords and new commands allowing for the specification of ontological concepts in our Ontology Definition Language (ODL), and 2. the definition of text-elements that are “ontology-aware,” i. e., perform the necessary checks to ensure compliance to an imported ontology. This represents a partial instantiation of the underlying generic document model. The document language can be extended (recall the -section) dynamically, i. e., new user-defined can be introduced at run-time. This is similar to the definition of new functions in an interpreter.

We illustrate the design of DOF by modeling a small ontology that can be used for writing formal specifications that, e. g., could build the basis for an ontology for certification documents used in processes such as Common Criteria [17] or CENELEC 50128 [16].Footnote 1 Moreover, in examples of certification documents, we refer to a controller of a steam boiler that is inspired by the famous steam boiler formalization challenge [9].

3.1 Ontology Modeling in ODL

Conceptually, ontologies specified in ODL consist of:

  • document classes (syntactically marked by the keyword) that describe concepts;

  • an optional document base class expressing single inheritance extensions;

  • attributes specific to document classes, where

    • attributes are typed;

    • attributes of instances of document elements are mutable;

    • attributes can refer to other document classes, thus, document classes must also be HOL-types (such attributes are called links);

  • a special link, the reference to a super-class, establishes an is-a relation between classes;

  • classes may refer to other classes via a regular expression in a where clause (classes with a where clauses are called monitor classes);

  • attributes may have default values in order to facilitate notation.

A major design decision of ODL is to denote attribute values by HOL-terms and HOL-types. Consequently, ODL can refer to any predefined type defined in the HOL library, e. g., or as well as parameterized types, e. g., , , , or products . As a consequence of the document model, ODL definitions may be arbitrarily intertwined with standard HOL type definitions. Finally, document class definitions result in themselves in a HOL-types in order to allow links to and between ontological concepts.

figure y

Listing 1.1 shows an example ontology for mathematical papers (an extended version of this ontology was used for writing [12], also recall Fig. 2). The commands (modeling fixed enumerations) and (defining type synonyms) are standard mechanisms in HOL systems. Since ODL is an add-on, we have to quote sometimes constant symbols (e. g., ) to avoid confusion with predefined keywords. ODL admits overriding (such as in the document class ), where it is set to another library constant, but no overloading. All elements have an optional attribute, which will be used in the output generation for the decision if the context is a section header and its level (e. g., chapter, section, subsection). While within an inheritance hierarchy overloading is prohibited, attributes may be re-declared freely in independent parts (as is the case for ).

3.2 Meta-Types as Types

To express the dependencies between text elements to the formal entities, e. g., (\(\lambda \)-term), , or , we represent the types of the implementation language inside the HOL type system. We do, however, not reflect the data of these types. They are just declared abstract types, “inhabited” by special constant symbols carrying strings, for example of the format . When HOL expressions were used to denote values of instance attributes, this requires additional checks after conventional type-checking that this string represents actually a defined entity in the context of the system state . For example, the attribute in the previous section is the power of the ODL: here, we model a relation between claims and results which may be a formal, machine-check theorem of type denoted by, for example: in a system context where this theorem is established. Similarly, attribute values like require that the HOL-string is again type-checked and represents indeed a formula in \(\theta \). Another instance of this process, which we call second-level type-checking, are term-constants generated from the ontology such as . For the latter, the argument string must be checked that it represents a reference to a text-element having the type according to the ontology in Listing 1.1.

3.3 Annotating with Ontology Meta-Data: Outer Syntax

DOF introduces its own family of text-commands, which allows having side effects of the global context and thus to store and manage own meta-information. Among others, DOF provides the commands , , or . Here, the argument is a syntax for declaring instance, class and attributes for this text element, following the scheme

figure ba

The can be omitted, which represents the implicit superclass , where must be declared attributes in the class and where the HOL must have the corresponding HOL type. Attributes from a class definition may be left undefined; definitions of attribute values override default values or values of super-classes. Overloading of attributes is not permitted in DOF.

We can now annotate a text as follows. First, we have to place a particular document into the context of our conceptual example ontology (Listing 1.1):

figure bf

This opens a new document (theory), called Steam_Boiler that imports our conceptual example ontology “tiny_cert” (stored in a file tiny_cert.thy).Footnote 2

Now we can continue to annotate our text as follows:

figure bg

Where is a predefined macro for (similarly ). The macro assumes a class-id referring to a class that has a attribute. We continue our example text:

figure bm

The first text element in this example fragment defines the text entity and also references the formerly defined text element (which will be represented in the PDF output, for example, by a text anchor “Section 1” and a hyperlink to its beginning). The antiquotation , which is automatically generated from the ontology, is immediately validated (the link to is defined) and type-checked (it is indeed a link to an text-element). Moreover, the IDE automatically provides editing and development support such as auto-completion or the possibility to “jump” to its definition by clicking on the antiquotation. The consistency checking ensures, among others, that the final document will not contain any “dangling references” or references to entities of another type.

DOF as such does not require a particular evaluation strategy; however, if the underlying implementation is based on a declaration-before-use strategy, a mechanism for forward declarations of references is necessary:

figure bs

This command declares the existence of a text-element and allows for referencing it, although the actual text-element will occur later in the document.

3.4 Editing Documents with Ontology Meta-Data: Inner Syntax

We continue our running example as follows:

figure bt

As mentioned earlier, instances of document classes are mutable. We use this feature to modify meta-data of these text-elements and “assign” them to the property-list afterwards and add results from Isabelle definitions and proofs. The notation stands for . This mechanism can also be used to define the required relation between claims and results required in the -relation required in a .

3.5 ODL Class Invariants

Ontological classes as described so far are too liberal in many situations. For example, one would like to express that any instance of a class finally has a non-empty property list, if its is , or that the relation between and is surjective.

In a high-level syntax, this type of constraints could be expressed, e. g., by:

figure ce

where , , and are the set of all possible instances of these document classes. All specified constraints are already checked in the IDE of DOF while editing; it is however possible to delay a final error message till the closing of a monitor (see next section). The third constraint enforces that the user sets the set, otherwise an error will be reported.

3.6 ODL Monitors

We call a document class with an accept-clause a monitor. Syntactically, an accept-clause contains a regular expression over class identifiers. We can extend our ontology with the following example:

figure ck

Semantically, monitors introduce a behavioral element into ODL:

figure cl

Inside the scope of a monitor, all instances of classes mentioned in its accept-clause (the accept-set) have to appear in the order specified by the regular expression; instances not covered by an accept-set may freely occur. Monitors may additionally contain a reject-clause with a list of class-ids (the reject-list). This allows specifying ranges of admissible instances along the class hierarchy:

  • a superclass in the reject-list and a subclass in the accept-expression forbids instances superior to the subclass, and

  • a subclass S in the reject-list and a superclass T in the accept-list allows instances of superclasses of T to occur freely, instances of T to occur in the specified order and forbids instances of S.

Monitored document sections can be nested and overlap; thus, it is possible to combine the effect of different monitors. For example, it would be possible to refine the section by its own monitor and enforce a particular structure in the presentation of examples.

Monitors manage an implicit attribute containing the list of “observed” text element instances belonging to the accept-set. Together with the concept of ODL class invariants, it is possible to specify properties of a sequence of instances occurring in the document section. For example, it is possible to express that in the sub-list of -elements, the first has an element with a strictly smaller than the others. Thus, an introduction is forced to have a header delimiting the borders of its representation. Class invariants on monitors allow for specifying structural properties on document sections.

3.7 Document Representation

Up to now, we discussed the support of ontological concepts in the context of an IDE, i. e., a rather dynamic environment that, e. g., allows for interactive querying and displaying of information. Certification processes often require “static” documents, e. g., in a format such as PDF/A that are designed for archiving and long-term preservation of electronic documents, are required.

While many concepts of ODL can easily be mapped to such static formats, more dynamic features (e. g., references) requires additional considerations such as ensuring that references point to text elements that have a unique identifier that is visible in the actual document representation. Currently, the definition of a static document representation is not part of DOF itself and, thus, depends on the underlying implementation. We refer the reader to Sect. 4.6 for details.

4 The Isabelle/DOF Implementation

In this section, we describe the basic implementation aspects of Isabelle/DOF, which is based on the following design-decisions:

  • the entire Isabelle/DOF is a “pure add-on,” i. e., we deliberately resign on the possibility to modify Isabelle itself.

  • we made a small exception to this rule: the Isabelle/DOF package modifies in its installation about 10 lines in the generator thy_output.ML which greatly simplifies the architecture.Footnote 3

  • we decided to make the markup-generation by itself to adapt it as well as possible to the needs of tracking the linking in documents.

  • Isabelle/DOF is deeply integrated into the Isabelle’s IDE (PIDE) to give immediate feedback during editing and other forms of document evolution.

Semantic macros, as required by our document model, are called document antiquotations in the Isabelle literature [30]. While Isabelle’s code-antiquotations are an old concept going back to Lisp and having found via SML and OCaml their ways into modern proof systems, special annotation syntax inside documentation comments have their roots in documentation generators such as Javadoc. Their use, however, as a mechanism to embed machine-checked formal content is usually very limited and also lacks IDE support.

4.1 Writing Isabelle/DOF as User-Defined Plugin in Isabelle/Isar

A plugin in Isabelle starts with defining the local data and registering it in the framework. As mentioned before, contexts are structures with independent cells/compartments having three primitives , and . Technically this is done by instantiating a functor , and the following fairly typical code-fragment is drawn from Isabelle/DOF:

figure cy

where the table manages document classes and the environment for class definitions (inducing the inheritance relation). Other tables capture, e. g., the class invariants, inner-syntax antiquotations.

All the text samples shown here have to be in the context of an SML file or in an command inside a theory file.

Operations follow the model-view-controller paradigm, where Isabelle/Isar provides the controller part. A typical model operation has the type:

figure dc

representing a transformation on system contexts. For example, the operation of declaring a local reference in the context is presented as follows:

figure dd

where is the update function resulting from the instantiation of the functor . This code fragment uses operations from a library structure that were used to update the appropriate table for document objects in the plugin-local state. Possible exceptions to the update operation were mapped to a system-global error reporting function.

Finally, the view-aspects were handled by an API for parsing-combinators. The library structure provides the operators:

figure di

for alternative, sequence, and piping, as well as combinators for option and repeat. Parsing combinators have the advantage that they can be smoothlessly integrated into standard programs, and they enable the dynamic extension of the grammar. There is a more high-level structure providing specific combinators for the command-language Isar:

figure dk

The “model” and “new” parts were combined via the piping operator and registered in the Isar toplevel:

figure dn

Altogether, this gives the extension of Isabelle/HOL with Isar syntax and semantics for the new command:

figure do

The construction also generates implicitly some markup information; for example, when hovering over the command in the IDE, a popup window with the text: “declare document reference” will appear.

4.2 Programming Antiquotations

The definition and registration of text antiquotations and ML-antiquotations is similar in principle: based on a number of combinators, new user-defined antiquotation syntax and semantics can be added to the system that works on the internal plugin-data freely. For example, in

figure dq

the text antiquotation is declared and bounded to a parser for the argument syntax and the overall semantics. This code defines a generic antiquotation to be used in text elements such as

figure ds

The subsequent registration binds code to a ML-antiquotation usable in an ML context for user-defined extensions; it permits the access to the current “value” of document element, i. e.; a term with the entire update history.

It is possible to generate antiquotations dynamically, as a consequence of a class definition in ODL. The processing of the ODL class also generates a text antiquotation , which works similar to except for an additional type-check that assures that is a reference to a definition. These type-checks support the subclass hierarchy.

4.3 Implementing Second-Level Type-Checking

On expressions for attribute values, for which we chose to use HOL syntax to avoid that users need to learn another syntax, we implemented an own pass over type-checked terms. Stored in the late-binding table , we register for each inner-syntax-annotation (ISA’s), a function of type

figure dz

Executed in a second pass of term parsing, ISA’s may just return . This is adequate for ISA’s just performing some checking in the logical context ; ISA’s of this kind report errors by exceptions. In contrast, transforming ISA’s will yield a term; this is adequate, for example, by replacing a string-reference to some term denoted by it. This late-binding table is also used to generate standard inner-syntax-antiquotations from a .

4.4 Programming Class Invariants

For the moment, there is no high-level syntax for the definition of class invariants. A formulation, in SML, of the first class-invariant in Sect. 3.5 is straight-forward:

figure ed

The -command (last line) registers the function into the Isabelle/DOF kernel, which activates any creation or modification of an instance of . We cannot replace by the corresponding antiquotation , since is bound to a variable here and can therefore not be statically expanded.

Isabelle’s code generator can in principle generate class invariant code from a high-level syntax. Since class-invariant checking can result in an efficiency problem—they are checked on any edit—and since invariant programming involves a deeper understanding of ontology modeling and the Isabelle/DOF implementation, we backed off from using this technique so far.

4.5 Implementing Monitors

Since monitor-clauses have a regular expression syntax, it is natural to implement them as deterministic automata. These are stored in the for monitor-objects in the Isabelle/DOF component. We implemented the functions:

figure el

where is basically a map between internal automaton states and class-id’s ( ’s). An automaton is said to be enabled for a class-id, iff it either occurs in its accept-set or its reject-set (see Sect. 3.6). During top-down document validation, whenever a text-element is encountered, it is checked if a monitor is enabled for this class; in this case, the -operation is executed. The transformed automaton recognizing the rest-language is stored in if possible; otherwise, if fails, an error is reported. The automata implementation is, in large parts, generated from a formalization of functional automata [23].

4.6 Document Representation

Isabelle/DOF can generate PDF documents, using a -backend (for end users there is no need to edit -code manually). For PDF documents, a specific representation, including a specific layout or formatting of certain text types (e. g., title, abstract, theorems, examples) is required: for each ontological concept (using the -command), a representation for the PDF output needs to be defined. The -setup of Isabelle/DOF provides the -command and an inheritance-based dispatcher, i. e., if for a concept no -representation is defined, the representation of its super-concept is used.

Recall the document class from our example ontology (Listing 1.1). The following -code (defined in a file tiny_cert.sty) defines the representation for abstracts, re-using the the standard -environment:

figure fa

The takes the name of the concept as first argument, followed by a list of parameters that is the same as the parameters used in defining the concept with . Within the definition section of the command, the main argument (written in the actual document within ) is accessed using . The parameters can be accessed using the -command. In our example, we print the abstract within -environment of . Moreover, we test if the parameters and are non-empty and, if yes, print them as part of the abstract.

5 Conclusion and Related Work

5.1 Related Work

Our work shares similarities with existing ontology editors such as Protégé [5], Fluent Editor [1], NeOn [2], or OWLGrEd [4]. These editors allow for defining ontologies and also provide certain editing features such as auto-completion. In contrast, Isabelle/DOF does not only allow for defining ontologies, directly after defining an ontological concept, they can also be instantiated and their correct use is checked immediately. The document model of Jupyter Notebooks [8] comes probably closest to our ideal of a “living document.”

Finally, the that is generated as intermediate step in our PDF generation is conceptually very close to SALT [19], with the difference that instead of writing manually it is automatically generated and its consistency is guaranteed by the document checking of Isabelle/DOF.

5.2 Conclusion

We presented the design of DOF, an ontology framework designed for formal documents developed by interactive proof systems. It foresees a number of specific features—such as monitors, meta-types as-types or semantic macros generated from a typed ontology specified in ODL—that support the specifics of such documents linking formal and informal content. As validation of these concepts, we present Isabelle/DOF, an implementation of DOF based on Isabelle/HOL. Isabelle/DOF is unique in at least one aspect: it is an integrated environment that allows both defining ontologies and writing documents that conform to a set of ontological rules, and both are supported by editing and query features that one expects from a modern IDE.

While the batch-mode part of DOF can, in principle, be re-implemented in any LCF-style prover, Isabelle/DOF is designed for fast interaction in an IDE. It is this feature that allows for a seamless development of ontologies together with validation tests checking the impact of ontology changes on document instances. We expect this to be a valuable tool for communities that still have to develop their domain specific ontologies, be it in mathematical papers, formal theories, formal certifications or other documents where the consistency of formal and informal content has to be maintained under document evolution. Today, in some areas such as medicine and biology, ontologies play a vital role for the retrieval of scientific information; we believe that leveraging these ontology-based techniques to the field of formal software engineering can represent a game changer.

Availability. The implementation of the framework is available at https://git.logicalhacking.com/Isabelle_DOF/Isabelle_DOF/. Isabelle/DOF is licensed under a 2-clause BSD license (SPDX-License-Identifier: BSD-2-Clause).