Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Algebraic data types (structures and variants) and associative arrays are fundamental building blocks when representing, grouping and handling complex data efficiently. However, operations manipulating them are rarely concerned with the entire compound input data structure. Most frequently, they depend only on a limited subset of their input. A complete specification of such an operation will not only stipulate that the output possesses a certain property but will also include its framing requirements, i.e. the part of the input that it operates on. Specifying and proving the preservation of logical properties for the unmodified part is a particular manifestation of the more general frame problem [8] – a notoriously cumbersome task in formal software verification, imposing unnecessary manual effort [9].

The verification of a given property can be simplified if we can determine the input fragments on which the property depends. This is the purpose of the dependency analysis presented in this paper. Our analysis targets a functional language that handles immutable algebraic data types and arrays. Furthermore, it is designed to be used on programs as well as on specifications. In contrast to the vast majority of static analyses that are mostly used only on actual code and in an essentially purely automatic setting, our analysis is thought of as a companion tool to be exploited in the middle of interactive program verification.

1.1 A Motivating Example

The work reported in this paper is motivated by the formal verification of operating systems. To illustrate the problem that we are addressing, consider an abstract process manager and the data structures for its fundamental components: process and thread, shown in Fig. 1a. A process is an executing instance of an application that can consist of multiple threads that share the same address space. A thread is a path of execution within a process and it is modeled as a structure having fields such as the thread’s identifier and the memory region for its stack. The current state of a thread is defined as a variant having three alternatives: READY, BLOCKED, RUNNING. Similarly, a process is a structure including an identifier for the currently running thread and an array of possibly inactive threads associated with it. Whether a thread in the thread array is active or has terminated is indicated by a variant of type \({{\texttt {option\_thread = | Some(thread\,\,\,th) | None}}}\).

Fig. 1.
figure 1

Example - data structures and functions of an abstract process manager

The signature of a Boolean function disjoint_stacks, written in a modeling language that we will present in Sect. 2, is shown in Fig. 1b. It verifies a fundamental property of a valid process state, namely that the stack regions of all active threads associated with the input pr are disjoint. Its result depends only on the array threads of the input pr and for each active thread element only on the field stack. All other input sub-elements are irrelevant to the result.

Another function run_thread has two possible execution scenarios: true and invalid_id. It stops the currently running thread and starts the one having the identifier given as an input. If it is valid, then a new process new_pr is returned for which current_thread is set to new_id. In the array threads, the state of the thread identified by new_id is set to RUNNING. The function’s precondition stipulates that the disjoint_stack property holds for the input pr and that the input thread is READY. Proving the property’s preservation is intuitively easy once the function’s effects and the input subset on which disjoint_stack depends are known. Automatically proving the preservation of invariants concerning only fields or elements that have not been altered by a transition in the system would considerably diminish the number of proof obligations.

This is precisely the issue that we are addressing: the delimitation of the input subset on which the output depends, given an operation with a compound input. We define dependency as the observed part of a structured domain and strive to obtain type-sensitive results, distinguishing between the sub-elements of arrays and algebraic data types and capturing the dependency specific to each. The targeted results mirror – in terms of dependency – the layered structure of compound data types.

Generally, our dependency analysis targets complex transition systems. These are characterized by states defined by compound data structures and transitions, i.e. state changes, that map an input state to an output state. In particular, we are applying it to an abstract model of an operating system, stemming from ProvenCore [6], an ongoing project revolving around a fully secure micro-kernel.

ProvenCore, inspired by MINIX 3.1, is a general-purpose micro-kernel that ensures isolation. Its proof is based on multiple refinements between successive models, from the most abstract, on which the isolation property is defined and proved, to the most concrete, i.e. the actual model used for code generation.

The global states of the abstract layers are complex structures with multiple compound fields. Commands such as fork, exec, exit can be executed. Each of these receives as an input the global state before executing the command and returns the state of the system after execution. Most supported commands affect only a handful of invariants, leading to a much more complex, but fundamentally similar version of the situation depicted by our introductory example.

Outline. The rest of this paper is structured as follows: in Sect. 2 we underline the specificities of our modeling language. The defined abstract domain of dependencies is described in Sect. 3. It is followed in Sect. 4 by an in-depth presentation of our analysis at an intraprocedural level and in Sect. 5 by a summary of it at an interprocedural level. In Sect. 6 we discuss the results obtained on two abstract layers of ProvenCore. Finally, in Sect. 7 we review related work.

2 The Modeling Language

In this section we present the unified programming and specification language that we will be analyzing. It is an idealized version of a language developed at Prove & Run Footnote 1 and designed to facilitate proofs and to allow users to write both the implementation and the specification of programs. It is purely functional, side-effect free and strongly-typed. The basic building blocks of programs written in our language are predicates, the equivalent of functions in common programming languages. In addition to the common built-in primitive types traditionally available, structures and variants are also provided. The language is designed to write code that will subsequently be proven, so it allows the definition of various types of logical specifications, ranging from pre- and postconditions, local assertions and loop invariants to inductive predicates.

2.1 Types and Statements

For defining the language we are working on, we let \(\mathbb {T}\) be the universe of type identifiers and \(T_0 \subset \mathbb {T}\) the set of base types identifiers. Furthermore, let \(\mathscr {F}\) be the set of structure field identifiers and \(\mathscr {C}\) the set of variant constructors.

\(\begin{array}{llll} t := &{} \arrowvert \;\tau \in T_0 &{} &{}\text {base types} \\ &{} \arrowvert \;\varvec{structure}\{f_1 : t,\ldots , f_n : t\}&{} f_i \in \mathscr {F}, 1 \le i \le n &{}\text {structures} \\ &{} \arrowvert \;\varvec{variant} [C_1 (t \; e_1) \; \arrowvert \; \ldots \; \arrowvert \; C_m (t \; e_m)]&{} C_i \in \mathscr {C} , 1 \le i \le m &{} \text {variants} \\ &{} \arrowvert \;\varvec{array}^t\langle t \rangle &{} &{}\text {arrays}\\ \end{array}\)

A structure is a data type grouping elements of different types called fields and represents the Cartesian product of its fields’ types. A variant is the disjoint union of different types. It represents data that may take on multiple forms, where each form is marked with a specific tag called the constructor. Arrays group a collection of data of the same type (given in angle brackets) into a single entity; each element is selected by an index whose type is included (as denoted by the superscript) in the array’s definition.

A program in our language is a collection of predicates. A predicate has input and output parameters and a body of statements of the form shown in Table 1.

Table 1. Supported statements

The first statement represents a generic predicate call and is described later in Sect. 2.2. All other statements could be seen as special cases of it, representing calls to built-in predicates. Statement \(\text {(2)}\) is a call to the “=” predicate, that checks whether its two inputs are equal. Similarly, \(\text {(3)}\) is a call to the assignment predicate “:=”. Both are generic and can be applied on any supported type of the language.

The statements (4)–(8) are structure-related. \(\text {(4)}\) creates a new structure s with \(e_1; \ldots ; e_n\) as field values. \(\text {(5)}\) returns the values of all the fields of s in the output parameters \(o_1; \ldots ; o_n\). Statement \(\text {(6)}\) returns the value of the \(f_i\) field. As previously mentioned, we are focusing on a purely functional language and consider immutable algebraic data structures and arrays. Therefore, setting the value of a structure’s field, shown in \(\text {(7)}\), returns a new structure where all fields have the same value as in s, except \(f_i\) which is set to \(e_i\). Statement \(\text {(8)}\) verifies if the values of the given subset of fields of two structures \(s'\) and \(s''\) are equal.

Statements \(\text {(9)}-\text {(11)}\) are variant-related. \(\text {(9)}\) creates a new variant v using the constructor \(C_p\) with \(e_p\) as an argument. Statement \(\text {(10)}\) is used for matching on the different constructors of an input variant v. Statement \(\text {(11)}\) verifies if the input variant v was created with one of the constructors in \(\{C_1, \ldots , C_k \}\). This could be obtained with a variant switch, but for practical considerations it has been provided as a built-in predicate.

The last two statements are array-related. \(\text {(12)}\) returns the value of the i-th cell of the input array a. Similarly to \(\text {(7)}\), updating the i-th cell of an array – shown in \(\text {(13)}\) – has a functional nature. It returns a new array where all cells have the same value as in a, except the i-th cell which is set to e.

2.2 Exit Labels

Besides input and output parameters, the declaration of a predicate also includes a non-empty set of exit labels. When called, a predicate exits with one of the specified exit labels, thus summarizing and returning to its callers further information regarding its execution.

Table 2. Statements and their exit labels

Exit labels constitute the main specificity of the language. They can denote different exceptional execution scenarios and act as exit codes, similarly to exceptions and exit status return values in other programming languages. For example, the predicate run_thread(process pr, int new_id) introduced in Sect. 1 has two exit labels: true, corresponding to a successful execution and invalid_id, indicating that the given identifier is invalid. Labels also offer a convenient way to model a Boolean result. Frequently, a Boolean output value can be replaced by declaring two possible exit labels: true for a successful execution of the predicate and false for its opposite. This is illustrated by the previously defined property disjoint_stacks(process pr) (in Fig. 1b).

Exit labels play an important role with respect to control flow management. Complex control flow is expressed and directed by catching and transforming labels. Furthermore, they condition the existence of output parameters, as these are associated to the exit labels of a predicate. Whenever a predicate exits with an exit label \(\lambda \), all the outputs associated to it are effectively produced, whereas all other outputs are discarded. If no output is associated to an exit label, it means that no output is generated when the predicate exits with this particular label. We can now explain the generic predicate call statement (1) from Table 1: the predicate p is called with inputs \(e_1, \ldots , e_n\) and yields one of the declared exit labels \(\lambda _1, \ldots , \lambda _n\), each having its own set of associated output variables \(\bar{o}\).

As shown in Table 2, statement (10) will have an exit label corresponding to each constructor of the given input variant. Statements (2), (8) and (11) are bi-labeled, using true and false as logical values. Statements (12) and (13) are bi-labeled as well. However, unlike the previously mentioned statements, they use the label false as an “out of bounds” exception and generate an output only for the label true.

2.3 The Control Flow Graph

In the following we will work with a control flow graph representation of the predicates’ bodies. The nodes represent program states, and the edges are defined by statements with a particular exit label \(\lambda \).

The control flow graph \(G_p = (N,E)\) of a predicate p has a node \(n_i \in N\) for each program point. For each statement s at program point \(n_i\) that can execute and reach program point \(n_j\) with exit label \(\lambda _k\), an edge \((n_i, n_j)\) is added to \(G_p\) and labeled with s and \(\lambda _k\). \(G_p\) has a single entry node \(n_{in} \in N\) corresponding to the program point associated to the first statement of p. The set of exit nodes \(n_{out} \subset N\) consists of the nodes associated to each possible exit label \(\lambda _k\) of the predicate.

In practice, all the outgoing edges of a node \(n_i \in N \) bear the different cases of the same statement s found at program point \(n_i\). Thus, the edges are labeled with the same statement s and there is an edge labeled \(s, \lambda _k\) for each possible exit label \(\lambda _k\) of s. However, the analysis does not depend on this special case.

The subfigures in Fig. 2 show the control flow graph of the following predicate:

figure b

which receives a process p and an index i as inputs and returns the i-th active thread of the input process (the process and thread types are defined in Fig. 1a). If the i-th thread is inactive, it exits with the exit label None. In the case of an “out of bounds” exception, the exit label oob is returned. For better readability, Fig. 2b gives the control flow of the same predicate where we have labeled the nodes with statements of the predicate and the edges with their exit labels.

Fig. 2.
figure 2

Example – control flow graph of predicate thread

3 Abstract Domain of Dependencies

The goal of our analysis is to detect the input subset on which the outputs of a predicate may depend. More precisely, the analysis makes a conservative approximation and must guarantee that what is marked as not needed is definitely not needed.

The first step towards such results is the definition of an abstract domain of dependencies \(\mathbb {D}\), shown below. The domain \(\delta \in \mathbb {D} \) is defined inductively from the three atomic cases \(\top \), \(\oslash \) and \(\bot \), mimicking the structure of the concrete types:

\( \begin{array}{lllr} \delta := &{}\arrowvert \;\top \;\arrowvert \;\oslash \;\arrowvert \;\bot &{}\; \text {atomic cases} \\ &{} \arrowvert \;\{f_{1} \mapsto \delta _{1}; \ldots ; f_{n} \mapsto \delta _{n}\} &{} \; f_1, \ldots , f_n \text { fields} &{} \text {(i)}\\ &{} \arrowvert \; [C_{1} \mapsto \delta _{1}; \ldots ; C_{m} \mapsto \delta _{m}] &{} \; C_1, \ldots , C_m \text { constructors} &{} \text {(ii)}\\ &{} \arrowvert \;\langle \delta \rangle &{} &{} \text {(iii)} \\ &{} \arrowvert \;\langle \delta _{ default } \triangleright i \;: \; \delta _{ exc } \rangle &{} \; i \; \text {array index} &{} \text {(iv)} \end{array} \)

For atomic types the dependency is expressed in terms of the domain’s atomic cases: \(\top \) (least precise), denoting that everything is needed and \(\oslash \), denoting that nothing is needed. The third atomic case \(\bot \), denoting impossible, is explained below. The dependency of a structure (i) describes the dependency on each of its fields. For arrays we distinguish between two cases, namely arrays with a general dependency applying to all of the cells (iii) and arrays with a general dependency applying to all but one exceptional cell, for which a specific dependency is known (iv). For variants (ii), the dependency is expressed in terms of the dependencies of its constructors, expressed in terms of their arguments’ dependencies. Thus, a constructor having a dependency mapped to \(\oslash \) is one for which nothing but the tag has been read, i.e. its arguments, if any, are irrelevant for the execution. For variants, we also include the information that certain constructors cannot occur. The third atomic case – \(\bot \) – is introduced for this purpose. We perform a “possible-constructors” analysis simultaneously, which computes for each execution scenario, the subset of possible constructors for a given value, at a given program point. All constructors that cannot occur are marked as being \(\bot \). This atomic value is the lower bound of our domain and hence, the most precise value. The final abstract domain is a closure of all these combined recursively.

Table 3. \(\sqsubseteq \) – comparison of two domains

The partial order relation \(\begin{array}{lll} \sqsubseteq\subseteq & {} \mathbb {D}\; \times \mathbb {D} \end{array}\) used to compare dependency domains is detailed in Table 3. The greatest element is \(\top \) (Top) and \(\bot \) is the least (Bot). Instances of identical structure and variant types are compared pointwise (Str, Var). For arrays without known exceptional dependencies we compare the default dependencies applying to all array cells (ADef). If exceptional dependencies are known for the same cell, these are additionally compared (AI). For arrays with known exceptional dependencies for different cells, we compare each dependency on the left-hand side with each one on the right-hand side (AIJ). The comparison of \(\oslash \) with structures (\(\oslash \) Str), variants (\(\oslash \) Var) and arrays (\(\oslash \) A, \(\oslash \) AI) is a pointwise comparison between \(\oslash \) and the dependency of each sub-element.

Table 4. Join operation

The defined join operation \(\begin{array}{lll} \vee &{} : &{}\mathbb {D}\; \times \mathbb {D} \; \rightarrow \mathbb {D} \\ \end{array}\) is detailed in Table 4. It is a commutative operation for which the undisplayed cases are defined with respect to their symmetrical counterparts. The operation is total: joining incompatible domains such as a structure and a variant or two structures having different field identifiers, results in \(\top \). Join is applied pointwise on each sub-element; \(\bot \) is its identity element and \(\top \) is its absorbing element. Joining \(\oslash \) and the dependency of a structure, variant or array is applied pointwise. The value obtained by joining \( \delta \) and \( \delta '\) is an upper bound of the two:

$$\begin{aligned} \delta \sqsubseteq \delta \vee \delta ' \;\;\wedge \;\; \delta ' \sqsubseteq \delta \vee \delta ',\;\; \forall \; \delta , \delta ' \in \mathbb {D}. \end{aligned}$$

It is not a least upper bound as a consequence of the non-monotonic approximations made for arrays (rule AIJ).

Besides join, a reduction operator \( \begin{array}{lll} \oplus &{} : &{}\mathbb {D}\; \times \mathbb {D} \; \rightarrow \mathbb {D}\\ \end{array} \) has been defined as well. This is a recursive, commutative, pointwise operation. The need for such an operator is a consequence of the possible-constructors analysis that we perform simultaneously. Following the same execution path, the same constructors must be possible. Thus, the reduction operator is used in order to combine dependencies on the same execution path and consists in performing the intersection of constructors in the case of variants and the union of dependencies for all other types. Its identity element is \(\oslash \) and its absorbing element is \(\bot \). The reduce operator between \(\top \), and the dependency of a structure, variant or array is applied pointwise. Two instances of identical variant types are pointwise reduced.

Finally, the projections summarized in Table 5, have been defined on a dependency domain \( \delta \) and are used to express the data-flow equations of Sect. 4:

\(\begin{array}{lllll} \small {\text {.f}} &{} : &{}\mathbb {D} \; \rightarrow \mathbb {D} &{} \small {\text {projection of a field's dependency}} \\ \small {\text {@C}} &{} : &{}\mathbb {D} \; \rightarrow \mathbb {D} &{} \small {\text {projection of a constructor's dependency}} \\ \small {\langle {i}\rangle }&{} : &{}\mathbb {D} \; \rightarrow \mathbb {D} &{} \small {\text {projection of a cell's dependency}} \\ \small {\langle * \setminus {i}\rangle } &{} : &{}\mathbb {D} \; \rightarrow \mathbb {D} &{} \small {\text {projection of an array's dependency outside cell }}\small {i}\\ \small {\langle * \rangle } &{} : &{}\mathbb {D} \; \rightarrow \mathbb {D} &{} \small {\text {projection of an array's general dependency}} \end{array}\)

Table 5. Dependency projections

They are partial functions, and can only be applied on domains of the corresponding kind. For instance, the field projection .f only makes sense for atomic domains or structured domains with a field named f, which should be the case if the domain represents a variable of a structured type with some field f. For any of the atomic domains \( \delta _a\), applying any of the defined projections yields \( \delta _a\).

4 Intraprocedural Analysis and Data-Flow Equations

Intraprocedural Domains. Dependency information has to be kept at each point of the control flow graph, for each variable of the environment \(\varGamma \), that maps input, output and local variables to their types. An intraprocedural domain \(\varDelta : \mathcal {V} \rightarrow \delta \) is thus a mapping from variables to dependencies, and is associated to every node of the control flow graph, representing the dependencies at the node’s entry point. A special case is the mapping which maps all variables to \(\bot \), which we call Unreachable. In particular it is associated to nodes that cannot be reached during the analysis. Also, if any of the variables of \(\varDelta \) is marked as \(\bot \), the entire node collapses, becoming Unreachable.

For any node of the control flow graph associated to an intraprocedural domain \(\varDelta \), \(\varDelta (x)\) retrieves the dependency associated to the variable x. If a mapping for x does not currently exist, \(\varDelta (x)\) retrieves \(\oslash \). Forgetting a variable x from a reachable intraprocedural domain, \(\varDelta \; \setminus \; x \), removes its mapping. The \(\vee \), \(\sqsubseteq \) and \(\oplus \) operations are extended pointwise to an intraprocedural domain, for each variable and its associated dependency domain \( \delta _v\). In particular, Unreachable is the bottom of this intraprocedural lattice.

Table 6. Statements – representations and data-flow equations

Data-Flow Equations. Our dependency analysis is a backward data-flow analysis. For each exit label, it traverses the control flow graph starting with its corresponding exit node and marking all other exit points as Unreachable. The intraprocedural domain for the currently analyzed label is initialized with its associated output variables mapped to \(\top \). The analysis traverses the control flow graph and gradually refines the dependencies until a fixed point is reached. Table 6 summarizes the representation and general equation of the statements. For each statement, the presented data-flow equation operates on the intraprocedural domains of the statement’s successor nodes. The intraprocedural domain at the entry point of the node is obtained by joining the contributions of each outgoing edge. The contribution of an edge \((n_i, n_j)\) labeled with s and \(\lambda \) is given by \(\llbracket {s}\rrbracket _{\lambda }{(\varDelta _{n_j})}\) where \(\llbracket {s}\rrbracket _{\lambda }{(.)}\) is the transfer function of the edge labeled \(s, \lambda \).

Tables 78910 define the transfer functions for each built-in statement of our language, whereas the general case of a predicate call and its corresponding equation will be detailed in the following section.

Table 7 presents the transfer functions for statements which are not type-specific. For equality tests (1) both of the inputs \(e_1\), \(e_2\) are completely read, whether the test returns true or false. The transfer functions therefore, reduce the domain of the corresponding successor node with a domain consisting of \(e_1\) and \(e_2\) both mapped to \(\top \). In the case of assignment (2), the dependency of the written output variable o is forgotten from the successor’s intraprocedural domain, thus being mapped to \(\oslash \) and forwarded to the input variable e.

Table 7. Generic statements – data-flow equations

The data-flow equations given in Table 8 correspond to structure-related statements. For the Eqs. (3), (4), (5) and (6) we assume that the variable s is of type \(\varvec{structure}\{f_1 : t,\ldots , f_n : t\}\) for some fields \(f_i,\; 1 \le i \le n\). The equation (3) refers to the creation of a structure: each input \(e_i\) is read as much as the corresponding field \(f_i\) of the structure is read. The destructuring of a structure is handled in (4): each field \(f_i\) is needed as much as the corresponding variable \(o_i\) is. When accessing the i-th field of a structure s (5), only the field \(f_i\) is read, and only as much as the access’ result o itself. The equation (6) treats field updates: the variable \(e_i\) is read as much as the field \(f_i\) is. The structure s is read as much as all the fields other than \(f_i\) are read in \(s'\). Finally, the equations given in (7) handle partial structure equality tests, and the transfer functions are the same for the labels true or false: for both compared structures \(s'\) and \(s''\), all the fields in the given set \(f_1, \ldots , f_k\) are completely read, and only those.

Table 8. Structure-related statements – data-flow equations

The data-flow equations given in Table 9 correspond to variant-related statements. They follow the same principles as those used for structure-related statements above. Note that the transfer functions for the switch (9) and possible constructor test (10) introduce \(\bot \) dependencies for constructors which are known to be impossible on the considered edge. In particular, since \(\bot \) is an absorbing element for \(\oplus \), these transfer functions erase, for every constructor which is known to be locally impossible, all the dependency information possibly attached to said constructor in the successor nodes. This is the actual raison d’être for the reduction operator, since using \(\vee \) to combine a successor domain and a local contribution would lose this information.

Table 9. Variant-related statements – data-flow equations

Finally, the equations for array-related statements are given in Table 10. We assume for both that the context is fixed and that \(\mathcal {I}\) is the distinguished set of input variables for the analyzed predicate. This set is used to make sure that exceptions in array dependencies are only registered to variables in \(\mathcal {I}\) and not local or output variables. The reason for such a constraint is a pragmatic one: input variables are not assignable in our language, and therefore they always represent the same value intraprocedurally. Otherwise, each time a variable is written by a statement, we would need to traverse all the dependencies in the domain to erase or reinterpret the occurrences where this variable appears as an exception. Only recording exceptions for input variables makes this kind of costly traversal useless, and since only exceptions about input variables make sense at the interprocedural level (see Sect. 5), we do not lose much precision by doing so. The transfer functions for (11) and (12) thus take care of making adequate approximations when exceptions cannot be introduced. As for the cases when the array access exits with the false label, note that the contribution to the array a is \(\langle \oslash \rangle \), which is strictly less precise than \(\oslash \). The operation makes implicit bounds checking and this can thus be seen as accounting for the fact that no cell in a has been read, but the “length” or “support” of a has been read, hence it would not be true to claim that the result of the statement did not depend on a at all. Similarly, a variant dependency \([ C_1 \mapsto \oslash , \ldots , C_n \mapsto \oslash ]\) mapping all constructors to nothing has not read any value in any of the constructors, but may still depend on the variant’s constructor itself.

Table 10. Array-related statements – data-flow equations

5 Interprocedural Dependencies

Exit labels, presented in Sect. 2.2, constitute an increased source of expressivity, as they indicate the scenario that was observed while executing a predicate. We incorporate this expressivity in our dependency results, by computing specific dependencies for each possible execution scenario. Therefore, our analysis is performed label by label and interprocedural dependency domains associate an intraprocedural domain to each exit label of the analyzed predicate. The variable key-set of each associated intraprocedural domain comprises the inputs of the analyzed predicate. A label that cannot be returned is mapped to an Unreachable intraprocedural domain. This is a form of path-sensitivity [10]. However, we favor the term label-sensitivity for this characteristic, as it seems to be a more natural choice applied to our case and the language we are working on.

An interprocedural domain of a predicate p is thus defined as follows:

$$\begin{aligned} \mathscr {D}_p: \varLambda _p \rightarrow \varDelta , \;\quad \text {where } \varLambda _p \text { the set of output labels of predicate }p \end{aligned}$$

For each analyzed label of a predicate, the analysis starts by initializing the intraprocedural domain mapped to it, with the output variables associated to the exit label. To avoid making any false supposition, these are initially mapped to the most general dependency, namely \(\top \). Subsequently, as described in Sect. 4, the dependency information is gradually refined until a fixed point is reached. The execution scenarios denoted by the exit labels of a predicate are mutually exclusive. Therefore, during the analysis of a particular exit label, all other exit labels of the predicate are mapped to Unreachable. After reaching a fixed point, the intraprocedural domain is filtered so that only input variables appear in the variable set. As explained in Sect. 4, the intraprocedural domains are built such that only input variables may appear as exception indices in dependencies computed for arrays. This invariant is preserved throughout the analysis.

A substitution must be performed on interprocedural domains. This consists in substituting all occurrences of formal input parameters of a predicate by the corresponding effective input parameters. The substitution operation is denoted by \(\blacktriangleleft (\sigma )\) where \(\sigma \) is a substitution from formal to effective parameters.

We proceed by detailing the equation corresponding to a call to a predicate:

$$ p(e_1, \ldots , e_n) [\lambda _1 : \bar{o}_1 \; \arrowvert \; \ldots \;\arrowvert \; \lambda _m : \bar{o}_m] $$

having the following signature:

$$ p(\epsilon _1, \ldots , \epsilon _n) [\lambda _1 : \bar{\omega }_1 \; \arrowvert \; \ldots \;\arrowvert \; \lambda _m : \bar{\omega }_m] $$

The general equation form applies:

$$ \displaystyle \varDelta _n = \bigvee _{n \xrightarrow {s, \lambda _i} n_i } \llbracket {p(e_1, \ldots , e_n) \; [\lambda _1: \bar{o}_1 \; \arrowvert \ldots \arrowvert \; \lambda _m: \bar{o}_m]}\rrbracket _{\lambda _i}{(\varDelta _{n_i})} $$

The transfer functions for the predicate call statement are deduced from the predicate’s interprocedural domain in the following fashion:

$$ \begin{array}{l} \displaystyle \llbracket {p(e_1, \ldots , e_n) \; [\lambda _1: \bar{o}_1 \; \arrowvert \ldots \arrowvert \; \lambda _m: \bar{o}_m]}\rrbracket _{\lambda _i}{(\varDelta )} = (\varDelta \setminus \bar{o_i}) \bigoplus _{j \in \{ 1,\ldots , n\}} {e_j \mapsto dep_j^i} \\ \text {where} \qquad dep_j^i = \mathscr {D}_p(\lambda _i)(\epsilon _j) \blacktriangleleft (\bar{\epsilon } \mapsto \bar{e}) \\ \end{array} $$

Namely, the mappings for the outputs \(\bar{o}\) associated to a label \(\lambda _i\) are removed, and the contribution of a call to each input \(e_j\) stems from the contribution of the interprocedural domain for label \(\lambda _i\) and formal input \(\epsilon _j\). In these, all the formal input parameters \(\bar{\epsilon }\) in array dependency domains are substituted by the corresponding effective input parameters from \(\bar{e}\).

Semantics. We conclude this section by briefly presenting the two possible interpretations of the results of our analysis. Considering an intraprocedural result \(\varDelta ^\lambda _p\) for a predicate p and label \(\lambda \), a first interpretation of our dependency analysis is an equivalence relation on tuples of values \(\approx _{\varDelta ^\lambda _p}\) which relates values that only differ in places on which \(p, \lambda \) does not depend. It can be used for applying congruence modulo reasoning to predicate calls. Namely, if we write \(p(\bar{v}) \xrightarrow {\lambda :\bar{w}}\) to denote that applying p to the values \(\bar{v}\) yields the exit label \(\lambda \) with outputs \(\bar{w}\), then if p is applied in turn to two input data structures \(\bar{u}\) and \(\bar{v}\) that are congruent w.r.t. \(\approx _{\varDelta ^\lambda _p}\), the predicate will exercise the same execution scenario:

figure k

Furthermore, identical outputs will be obtained:

$$ \bar{u} \approx _{\varDelta ^\lambda _p}\bar{v} \implies p(\bar{u}) \xrightarrow {\lambda :\bar{w}} \implies p(\bar{v}) \xrightarrow {\lambda :\bar{z}} \implies \bar{w} = \bar{z} $$

whereas this first interpretation focuses on the dependency part of the analysis, it is also possible to focus on the possible constructors part of the analysis. This additional interpretation is a characteristic function \(\mathbbm {1}_{\varDelta ^\lambda _p}\) on input values which constrains the space of inputs that can make p exit with label \(\lambda \). It denotes the necessary conditions on inputs according to the observed execution scenario and can be used as an inversion lemma when reasoning on calls to a predicate:

$$ \begin{array}{l} p(\bar{u}) \xrightarrow {\lambda :\bar{w}} \implies \mathbbm {1}_{\varDelta ^\lambda _p}(\bar{u}) \end{array} $$

A detailed presentation of these semantics is out of the scope of this paper but in order to give a good intuition of the adequation between the interpretation and the lattice operations described in Sect. 3, we can give some fundamental properties relating the domain operations and these interpretations:

$$\begin{aligned} \bar{v} \approx _\top \bar{w}&\iff \bar{v} = \bar{w}&\bar{v} \approx _\varnothing \bar{w}&\wedge \bar{v} \approx _\bot \bar{w}\quad \forall \bar{v}, \bar{w}&\mathbbm {1}_\top&= \mathbbm {1}_\varnothing = \_ \mapsto 1 \\ \mathbbm {1}_\bot&= \_ \mapsto 0&\varDelta \sqsubseteq \varDelta '&\implies \approx _\varDelta \;\supseteq \;\approx _{\varDelta '}&\qquad \varDelta \sqsubseteq \varDelta '&\implies \mathbbm {1}_\varDelta \subseteq \mathbbm {1}_{\varDelta '} \\ \approx _{\varDelta \oplus \varDelta '}&\;\subseteq \; \approx _\varDelta \cap \approx _\varDelta '&\qquad \mathbbm {1}_\varDelta \wedge \mathbbm {1}_\varDelta '&\implies \mathbbm {1}_{\varDelta \oplus \varDelta '}&\end{aligned}$$

The soundness of the second interpretation as well as the well-formedness of our dependency domains have been proven in CoqFootnote 2.

6 Preliminary Results and Experiments

Our analysis has currently been applied on two abstract layers of ProvenCore, described in Sect. 1. These are the Refined Security Model (RSM), an abstract layer situated just underneath the top-most layer of the refinement chain and the Functional Specifications (FSP) layer, a model closely resembling the most concrete layer (Target of Evaluation Design – TDS) but using data structures and algorithms that facilitate reasoning. Each layer is characterized by a global state with numerous fields and different transitions, i.e. supported commands. Various invariants and properties characterize their states. For example, the FSP’s state contains 14 fields; it is characterized by approximately 50 invariants. In the TDS, these figures are doubled. Each invariant is concerned with a different subset of the global state’s fields. Some of the invariants concern all the processes held in the process store. However, most transitions affect only a few of these fields. We have applied our analysis at a medium-scale on the RSM and FSP layers. The results for over 660 predicates having approximately 11000 lines of code have been computed in 1.5 s.

One of the analyzed predicates from the FSP layer is:

figure l

verifying a fundamental property that has to hold for all processes in the process store of proc. It refers to the relevance of memory permissions and states that every process has permissions covering a valid range of memory addresses inside its virtual space. The process type is a structure with 26 fields, of which 11 are compound data types themselves. Among these, 2 fields are arrays, 3 fields are variants and 6 others are structures with a number of fields ranging from 3 to 17 fields.

The dependency results computed by our analysis for this predicate are shown below. The analysis detects that for each of the possible execution scenarios, the outcome depends only on 2 out of the 26 fields, namely the stackframe and the memory permissions. The dependency on the stackframe is confined to only one of the 3 fields: the data and stack segment. The memory permissions are given by a variant with 3 constructors, denoting reading and writing permissions or the absence of any permission. Furthermore, besides pinning down the outcome’s dependency on 2 out of the 26 fields of the process structure, the analysis also detects that the absence of any memory permission, indicated by the constructor NONE of the mem_auth variant, is \(\bot \) for the false execution scenario. In other words, unused permissions cannot threaten the property proc_mem_auth_ok.

figure m

The relevant memory permissions property is thus only threatened by transitions that add memory permissions or change a process’ virtual space layout. Only 3 transitions out of the 25 belong to this category: exec which resets the process’ segments, do_auth_read and do_auth_write which add permissions. In particular, transitions deleting memory permissions do not impact the property since the absence of permissions, as shown by the dependency of the constructor NONE for the false label, is an impossible case when the property does not hold. This is one of the practical advantages of tracking constructor possibilities simultaneously.

Space constraints prevent us from discussing other examples here. However, various other examples are provided on the webpageFootnote 3 dedicated to our analysis.

7 Related Work

The frame problem and its manifestations in the software verification process – detecting program properties that remain unchanged under a certain operation – are notorious. First described in 1969 [8], the frame problem is still a target for full automation in the software verification realm. A complete specification of a program will necessarily include frame properties. However, though necessary, frame properties are tedious and repetitive. Two prominent solutions to the frame problem come from separation logic [4] and ownership types [1]. However, it is argued that the problem itself should not impose such annotation-heavy solutions. Simpler, automatic solutions for their specification and verification would allow programmers to concentrate on the truly challenging part [9].

The dependency results computed by our analysis are similar to primitive read and write effects used in ownership type systems [1]. Write effects in our case are implicit and include strictly the output variables associated to an exit label. Read effects can only refer to input variables of a predicate. Also, read effects comprise the whole execution of a method even if they are irrelevant for the method’s results. We however, ignore read effects on which the output does not depend, reflecting only those which contribute to the observed result. A technique for declaring and verifying read effects in an ownership type system is presented in [1]. We use static analysis to automatically detect them.

Our dependencies are similar to the influence sets presented by Leino and Müller [5]. Influence sets are represented as sets of heap locations and they are used to specify the parts of the program state that are allowed to impact the return values. Reasoning about heap locations is beyond the scope of our analysis. We treat mappings between variables and values, analyze their evolution in a side-effect free environment and express dependencies as input-output relations.

Static dependence or liveness analyses are typically used for code optimization, dead code elimination [7] and compile time garbage collection, but only seldom for program verification. One case we are aware of comes from Frama-C [2], where it is used in a purely automatic setting and unlike our analysis it does not handle unions and arrays. A plug-in based on the available value analysis [3] computes lists of input and output locations for each function, distinguishing between operational, functional and imperative inputs.

8 Conclusion

We have presented a flow-sensitive, path-sensitive, interprocedural dependency analysis that handles arrays, structures and variants. For the latter it simultaneously computes a subset of possible constructors. We have defined our own abstract dependency domain and obtain dependency information that mirrors the layered structure of compound data types.

The main original traits of our contribution stem from its design as an analysis meant to be used as a companion tool during interactive program verification, in an unified manner on programs as well as on specifications.

An obvious first challenge is to address the issue of context-sensitivity. We plan to introduce lazy components in our interprocedural dependency summaries and to inject in them the current intraprocedural context on an as-needed basis. Early experiments show much promise in terms of improved precision, with only a marginal decrease in performance.

Our long-term goal is to combine the dependency analysis with a correlation analysis, meant to detect relations between inputs and outputs. By uncovering relations (preorders and equivalences) between inputs and outputs, after having detected that a property only depends on unmodified parts and unifying the results, the preservation of invariants for the unmodified parts could be inferred.