Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This paper investigates a new approach to relational analysis. Our aim is to develop a method that scales to very large code bases, yet maintains a reasonable degree of precision, also for programs that use non-linear numeric operations.

Abstract interpretation is a well-established theoretical framework for sound reasoning about program properties. It provides means for comparing program analyses, especially with respect to the granularity of information (precision) that analyses allow us to statically extract from programs. On the whole, reducing such questions to questions about abstract domains. An abstract domain, essentially, specifies the (limited) language of judgements we are able to use when reasoning statically about a program’s runtime behaviour.

A class of abstract domains that has received particular attention are the numeric domains—those supporting reasoning about variables of numeric (often integer or rational) type. Numeric domains are important because of the numerous applications in termination and safety analyses, such as overflow detection and out-of-bounds array analysis. The polyhedral abstract domain [9] allows us to express linear arithmetic constraints (equalities and inequalities) over program state spaces of arbitrary finite dimension k. But high expressiveness comes at a cost; analysis using the polyhedral domain does not scale well to large code bases. For this reason, a number of abstract domains have been proposed, seeking to strike a better balance between cost and expressiveness.

Language Restriction. The primary way of doing this is to limit expressiveness, that is, to restrict the language of allowed judgements. Most commonly this is done by expressing only 1- or 2-dimensional projections of the program’s (abstract) state space, often banning all but a limited set of coefficients in linear constraints. Examples of this kind of restriction to polyhedral analysis abound, including zones [19], TVPI [22, 23], octagons [20], pentagons [18], and logahedra [14]. These avoid the exponential behaviour of polyhedra, instead offering polynomial (typically quadratic or cubic) decision and normalization procedures. Still, they have been observed to be too expensive in practice for industrial code-bases [18, 24]. Hence other “restrictive” techniques have been proposed which are sometimes integral to an analysis, sometimes orthogonal.

Dimensionality Restriction. These methods aim to lower the dimension k of the program (abstract) state space, by replacing the full space with several lower-dimension subspaces. Variables are separated into “buckets” or packs according to some criterion. Usually the packs are disjoint, and relations can be explored only amongst variables in the same pack (relaxations of this have also been proposed [4]). The criterion for pack membership may be syntactic [8] or determined dynamically [24]. A variant is to only permit relations between sets; in the Gauge domain [25], relations are only maintained between program variables and introduced loop counters, not between sets of program variables.

Closure Restriction. Some methods abandon the systematic transitive closure of relations (and therefore lack a normal form for constraints). Constraints that follow by transitive closure may be discovered lazily, or not at all. Closure restriction was used successfully with the pentagon domain; a tolerable loss of precision was compensated for by a significant cost reduction [18].

All of the work discussed up to this point has, in some sense, started from an ideal (polyhedral) analysis and applied restrictions to the degree of “relationality.” A different line of work starts from very basic analyses and adds mechanisms to capture relational information. These approaches do not focus on restrictions, but rather on how to compensate for limited precision using “symbolic” reasoning. Such symbolic methods maintain selected syntactic information about computations and use this to enhance precision. The primary examples are Miné’s linearization method [21], based on “symbolic constant propagation” and Chang and Leino’s congruence closure extension [5].

Fig. 1.
figure 1

Two example programs

Fig. 2.
figure 2

(a) The reachable states at point A in Fig. 1(a); (b) the result of polyhedral analysis at point A, projected onto the y-z plane, assuming analysis performs case split on the sign of y (the convex hull forming a lozenge);(c) the result of polyhedral analysis at point B. Dashed lines show octagon invariants.

Polyhedral analysis and its restrictions tend to fall back on overly coarse approximation when faced with non-linear operations such as multiplication, modulus, or bitwise operations. Higher precision is desirable, assuming the associated cost is limited. Consider the example shown in Fig. 1(a). Figure 2(a) shows the possible program states when execution reaches point A. With octagons, the strongest claim that can be made at that point is

$$ 0 \le x \le 10, -10 \le y \le 10, y - z \le 90, z - y \le 90, x + z \ge -90, z - x \le 90 $$

Fig. 2(b) shows the projection on the y-z plane. Almost all interaction between y and z has been lost and as a result, we fail to detect that z is non-negative at point B. The best possible polyhedral approximation adds

$$ z \ge -10x, z \ge 10x + 10y - 100, z \le 10x, z \le 100 - 10x + 10y $$

While this expresses more of the relationship between x, y and z, we can still only infer \(z \ge -50\) at point B.

In practice, weaker results may well be produced. A commonly used octagon library yields \(y \in [-10,10], z \in [-100, 100]\), rather than the dashed projections shown in Fig. 2(b) and (c). For polyhedral analysis, multiplication is often handled by projection and case-splitting. The two grey triangles in Fig. 2(b) show the result, at point A, of case analysis according to the sign of y, as projected onto the y-z plane; the lozenge is the convex hull. This explains how a commonly used library infers \(\{z \ge 5y - 50, z \le 5y + 50\}\) at point A. The pen-nib shaped area in Fig. 2(c) shows the result, at point B, of polyhedral analysis. Note that the triangle below the y axis is in fact infeasible.

Contribution. The proposal presented in this paper differs from all of the above. It combines closure restriction and a novel symbolic approach. We extract and utilise shared expression information to improve the precision of cheap non-relational analyses (for example, interval analysis), at a small added cost. The idea is to treat the arithmetic operators as uninterpreted function symbols. This allows us to replace expensive convex hull operations by a combination of constraint propagation and term anti-unification. The resulting subterm domain \(\mathcal {ST}\) is an abstract domain of syntactic equivalences. It can be used to augment non-relational domains with relational information, and to improve precision of (possibly relational) domains in the presence of complex operations. The improvement is not restricted to non-linear operations; it can equally well support weakly relational domains that are unable to handle large coefficients.

Table 1 summarises the analysis results for the two programs in Fig. 1, compared with the results of (ideal) octagon and polyhedral analysis.Footnote 1 Note how the subterm domain obtains a tight lower bound on z as well as a tight upper bound on u.

Table 1. States inferred for Fig. 1’s programs, points B (left) and C (right)

The method has been implemented, and the experiments described later in this paper suggest the combination strikes a happy balance between precision and cost. After Sect. 2’s preliminaries, Sect. 3 provides algorithms for operations on systems of terms, and Sect. 4 shows how this can be used to enhance a numeric domain. Section 5 provides comparison with the closest related work. Section 6 reports on experimental results and Sect. 7 concludes.

2 Preliminaries

Abstract Interpretation. In standard abstract interpretation, a concrete domain \(C^\natural \) and its abstraction \(C^\#\) are related by a Galois connection \((\alpha , \gamma )\), consisting of an abstraction function \(\alpha : C^\natural \mapsto C^\#\) and concretization function \(\gamma : C^\# \rightarrow C^\natural \). The best approximation of a function \(f^\natural \) on \(C^\natural \) is \(f^\#(\varphi ) = \alpha (f^\natural (\gamma (\varphi )))\). When analysing imperative programs, \(C^\natural \) is typically the power-set of program states, and the corresponding lattice operations are \((\subseteq , \cup , \cap )\).

In a non-relational (or independent attribute) domain, the abstract state is either the bottom value \(\bot _{D}\) (denoting an infeasible state), or a separate non-\(\bot \) abstraction \(x^\#\) for each variable x in some domain \(D_{V}\) (where each variable admits some feasible value). That is, \(D = \{\bot _{D}\} \cup (D_{V} \setminus \{\bot _{D}\})^{\vert V\vert }\).

Sometimes backwards reasoning is required, to infer the set of states which may/must give rise to some property. The pre-image transformer yields \(\varphi _{pre}\) such that \((F_{\mathcal {D}}([\![\mathtt {S}]\!])(\varphi ') = \varphi ) \Rightarrow (\varphi ' {{\mathrm{\sqsubseteq }}}\varphi _{pre})\). Finding the minimal pre-image of a complex (non-linear) operation can be quite expensive, so pre-image transformers provided by numeric domains are usually coarse approximations.

We shall sometimes need to rename abstract values. Given a binary relation \(\pi \subseteq V \times V'\) and an element \(\varphi \) of an independent attribute domain over V, the renaming \(\pi (\varphi )\) is given by:

The corresponding operation is more involved for relational domains. Assuming \(\mathcal {D}\) is closed under existential quantification, \(\mathcal {D}\) can maintain systems of equalities and V and \(V'\) are disjoint, we have \( rename _{\pi }(\varphi ) = \exists V.~(\varphi {{\mathrm{\sqcap }}}\{ x = x' \mid (x, x') \in \pi \})\).

Term Equations. The set \(\mathcal{T}\) of terms is defined recursively: every term is either a variable \(v \in TVar \) or a construction \(F(t_1,\ldots ,t_n)\), where \(F \in Fun\) has arity \(n \ge 0\) and \(t_1,\ldots ,t_n\) are terms. A substitution is an almost-identity mapping \(\theta \in TVar \rightarrow \mathcal{T}\), naturally extended to \(\mathcal{T} \rightarrow \mathcal{T}\). We use standard notation for substitutions; for example, \(\{x \mapsto t\}\) is the substitution \(\theta \) such that \(\theta (x) = t\) and \(\theta (v) = v\) for all \(v \ne x\). Any term \(\theta (t)\) is an instance of term t.

If we define \(t \sqsubseteq t'\) iff \(t = \theta (t')\) for some substitution \(\theta \) then \(\sqsubseteq \) is a preorder. Define \(t \equiv t'\) iff \(t \sqsubseteq t' \wedge t' \sqsubseteq t\). The set \(\mathcal{T}_{/\equiv } \cup \{\bot \}\), that is \(\mathcal{T}\) partitioned into equivalence classes by \(\equiv \) plus \(\{\bot \}\), is known to form a complete lattice, the so-called term lattice.Footnote 2 A unifier of \(t, t' \in \mathcal{T}\) is an idempotent substitution \(\theta \) such that \(\theta (t) = \theta (t')\). A unifier \(\theta \) of t and \(t'\) is a most general unifier of t and \(t'\) iff \(\theta ' = \theta ' \circ \theta \) for every unifier \(\theta '\) of t and \(t'\).

If we can calculate most general unifiers then we can find meets in the term lattice: if \(\theta \) is a most general unifier of t and \(t'\) then \(\theta (t)\) is the most general term that simultaneously is an instance of t and an instance of \(t'\), so \(\theta (t)\) is the meet of t and \(t'\). Similarly, the join of t and \(t'\) is the most specific generalization; algorithms are available that calculate most specific generalizations [15].

Given a set of terms \(S \subseteq \mathcal{T}\) and equivalences \(E \subseteq (S \times S)\), we can partition S into equivalent terms. Terms t and s are equivalent (\(t \equiv s\)) if they are identical constants, are deemed equal, or \(t = f(t_1, \ldots , t_m)\) and \(s = f(s_1, \ldots , s_m)\) such that for all i, \(t_i \equiv s_i\). Finding this partitioning is the well-studied congruence closure problem, of complexity \(\text{ O }(\vert S\vert \log \vert S\vert )\) [10]. Of relevance is the case \(\vert E\vert = 1\) (introduction of a single equivalence), which can be handled in \(\text{ O }(\vert S\vert )\) time.

In the following, it will be necessary to distinguish a term as an object from the syntactic expression it represents. We shall use \(\mathbf {id}(t)\) to denote the name of a term, and \(\mathbf {def}(t)\) to denote the expression.

3 The Subterm Domain \(\mathcal {ST}\)

An element of the subterm domain consists of a mapping \(\eta : V \mapsto \mathcal {T}\) of program variables to terms. While the domain structure derives from uninterpreted functions, we must reason about the corresponding concrete computations. We accordingly assume each function symbol F has been given a semantic function \(\mathbb {S}(F) : \mathcal {S}^n \rightarrow \mathcal {S}\). Given some assignment \(\theta : TVar \rightarrow \mathcal {S}\) of term variables to scalar values, we can then recursively define the evaluation \(\mathbb {E}(t,\theta )\) of a term under \(\theta \).

$$ \begin{array}{c} \mathbb {E}(x,\theta ) = \theta (x) \\ \mathbb {E}(f(t_1, \ldots , t_n),\theta ) = \mathbb {S}(f)(\mathbb {E}(t_1,\theta ), \ldots , \mathbb {E}(t_n,\theta )) \end{array} $$

We say a concrete state \(\{x_1 \mapsto v_1, \ldots , x_n \mapsto v_n\}\) satisfies mapping \(\eta \) iff there is an assignment \(\theta \) of values to term variables such that for all \(x_i\), \(\mathbb {E}(\eta (x_i),\theta ) = v_i\). The concretization \(\gamma (\eta )\) is the set of concrete states which satisfy \(\eta \).

However, the syntactic nature of our domain gives us difficulties. While we can safely conclude that two (sub-)terms are equivalent, we have no way to conclude that two terms differ. No Galois connection exists for this domain; multiple sets of definitions could correspond to a given concrete state. Even if states \(\eta _1\) and \(\eta _2\) are both valid approximations of the concrete state, the same does not necessarily hold for \(\eta _1 {{\mathrm{\sqcap }}}\eta _2\).

Example 1

Consider two abstract states:

$$ \begin{array}{ccc} \{x \mapsto +(a_1, 7), y \mapsto a_1, z \mapsto a_2 \}&\qquad&\{x \mapsto +(3, b_1), y \mapsto b_2, z \mapsto b_1 \} \end{array} $$

These correspond to the sets of states satisfying \(x = y+7\) and \(x = 3+z\) respectively. Many concrete states satisfy both approximations; one is \((x, y, z) = (7, 0, 4)\). However, a naive application of unification would attempt to unify \(+(y, 7)\) with \(+(3, z)\), which would result in unifying y with 3, and z with 7.

Cousot and Cousot [7] discuss the consequences of a missing best approximation, and propose several approaches for repair: strengthening or weakening the domain, or nominating a best approximation through a widening/narrowing. However, these are of limited value in our application. Strengthening or weakening the domain enough that a best approximation is restored would greatly affect the performance or precision, and explicitly reasoning over the set of equivalent states is impractical. Using a widening/narrowing is sound advice, but offers minimal practical guidance.

3.1 Operations on \(\mathcal {ST}\)

We must now specify several operations: state transformers for program statements, join, meet, and widening. Assignment, join and widening all behave nicely under \(\mathcal {ST}\); meet is discussed in Sect. 3.2.

Figure 3 shows assignment and join operations on \({\mathcal {ST}}\). Calls to generalize are cached, so calls to generalize(s,t) all return the same term variable. In the case of \(\mathcal {ST}\), the lattice join is safe: as \(\eta _1 {{\mathrm{\sqsubseteq }}}\eta _2 \Rightarrow \gamma (\eta _1) {{\mathrm{\sqsubseteq }}}_{\mathcal {C}^\natural } \gamma (\eta _2)\) and \({{\mathrm{\sqcup }}}\) and \({{\mathrm{\sqcup }}}_{\mathcal {C}^\natural }\) are least upper bounds on their respective domains, we have \(\gamma (\eta _1) {{\mathrm{\sqsubseteq }}}\gamma (\eta _1 {{\mathrm{\sqcup }}}\eta _2)\) and \(\gamma (\eta _2) {{\mathrm{\sqsubseteq }}}\gamma (\eta _1 {{\mathrm{\sqcup }}}\eta _2)\), so \(\gamma (\eta _1) {{\mathrm{\sqcup }}}_{\mathcal {C}^\natural } \gamma (\eta _2) {{\mathrm{\sqsubseteq }}}_{\mathcal {C}^\natural } \gamma (\eta _1 {{\mathrm{\sqcup }}}\eta _2)\). The worst-case complexity of the join is \(\text{ O }(\vert \eta _1\vert \vert \eta _2\vert )\). But typical behaviour is expected to be closer to linear, as most shared terms are either shared in both (so only considered once) or are trivially distinct (so replaced by a variable). This is borne out in experiments, see Sect. 6. As \(\mathcal {ST}\) has no infinite ascending chains, \({{\mathrm{\sqcup }}}_{\mathcal {ST}}\) also serves as a widening.

Fig. 3.
figure 3

Definitions of variable assignment and \({{\mathrm{\sqcup }}}\) in \(\mathcal {ST}\).

Every term in \(\eta _1 {{\mathrm{\sqcup }}}\eta _2\) corresponds to some specialization in \(\eta _1\) and \(\eta _2\). We shall use \(\pi ^{\eta _1 \mapsto \eta _1 {{\mathrm{\sqcup }}}\eta _2}\) to denote the relation that maps terms in \(\eta _1\) to corresponding terms in \(\eta _1 {{\mathrm{\sqcup }}}\eta _2\).

Fig. 4.
figure 4

State at the end of the first (a) then and (b) else branches in Fig. 1(b), and (c) the join of the two states.

Example 2

Consider again Fig. 1(b). At the exit of the first if-then-else, we get term-graphs \(\eta _1\) and \(\eta _2\) shown in Fig. 4(a) and (b). For \(\eta _1 {{\mathrm{\sqcup }}}\eta _2\), we first compute the generalization of \(\eta _1(u) = a_0\) with \(\eta _2(u) = b_0\), obtaining a fresh variable \(c_0\). Now, \(\eta _1(t)\) and \(\eta _2(t)\) are both \((+_{2})\) terms, so we recurse on the children; the generalization of \((a_0, b_0)\) has already been computed, so we re-use the existing variable; but we must allocate a fresh variable for \((a_1, b_2)\), resulting in t being mapped to \((+) (c_0, c_1)\). We repeat this process for v and w, yielding the state shown in Fig. 4(c). Note that the result captures the fact that in both branches, t is computed by adding some value to u.

3.2 The Quasi-meet \({{\mathrm{\widetilde{\sqcap }}}}\)

We require our quasi-meet \({{\mathrm{\sqcap }}}_{\mathcal {ST}}\) to be a sound approximation of the concrete meet, that is, \(\gamma (\eta _1) {{\mathrm{\sqcap }}}_{\mathcal {C}^\natural } \gamma (\eta _2) {{\mathrm{\sqsubseteq }}}_{\mathcal {C}^\natural } \gamma (\eta _1 {{\mathrm{\sqcap }}}_{\mathcal {ST}} \eta _2)\). Ideally, we would like to preserve several other properties enjoyed by lattice operations:

  • Minimality: If \(\eta _1 {{\mathrm{\sqsubseteq }}}_{\mathcal {ST}} \eta _2\), then \((\eta _1 {{\mathrm{\sqcap }}}_{\mathcal {ST}} \eta _2) = \eta _1\)

  • Monotonicity: If \(\eta _1 {{\mathrm{\sqsubseteq }}}_{\mathcal {ST}} \eta _1'\), then \((\eta _1 {{\mathrm{\sqcap }}}_{\mathcal {ST}} \eta _2) {{\mathrm{\sqsubseteq }}}_{\mathcal {ST}} (\eta _1' {{\mathrm{\sqcap }}}_{\mathcal {ST}} \eta _2)\)

These are important for precision and termination respectively. However, in the absence of a unique greatest lower bound these properties are mutually exclusive, so the quasi-meet must be handled carefully to avoid non-termination [12].

A simple quasi-meet (denoted by \({{\mathrm{\widetilde{\sqcap }}}}\), as distinct from a ‘true’ meet \({{\mathrm{\sqcap }}}\)) is to adopt the approach of [21], deterministically selecting the term for each variable from either \(\eta _1\) or \(\eta _2\). Minimality can be achieved by selecting the more precise term (according to \({{\mathrm{\sqsubseteq }}}_{\mathcal {ST}}\)) when several choices exist. However, this discards a great deal of information present in the conjunction. Of particular concern is the loss of variable equivalences which are implied by \(\eta _1 \wedge \eta _2\) (the logical conjunction of \(\eta _1\) and \(\eta _2\)), but not by \(\eta _1\) and \(\eta _2\) individually.

We can infer all sub-term (and variable) equivalences of \(\eta _1 \wedge \eta _2\) using the congruence closure algorithm. Unfortunately, not only may this yield multiple incompatible definitions for a variable, the resulting definitions may be cyclic.

Fig. 5.
figure 5

Abstract states \(\eta _1\) and \(\eta _2\), whose conjunction \(\eta _1 \wedge \eta _2\) (c) cannot be represented in \(\mathcal {ST}\); it has an infinite descending chain of approximations (d).

Example 3

Consider the abstract states \(\eta _1\), \(\eta _2\) shown in Fig. 5(a) and (b). Computing \(\eta _1 \wedge \eta _2\), we start with constraints \(\{\eta _1(v) = \eta _2(v) \mid v \in \{w, x, y, z\}\}\):

$$ \{t = (+)(a_0, a_1)\} \cup \{s = (+)(b_0, b_1)\} \cup \{a_0 = b_0, a_0 = s, t = b_0, a_1 = b_1\} $$

After congruence closure, the terms are split into two equivalence classes:

$$\begin{aligned} E_1 = \{a_0, b_0, s, t\}, E_2 = \{a_1, b_1\} \end{aligned}$$

We then wish to extract an element of \(\mathcal {ST}\) which preserves as much of this information as possible. This conjunction, shown in Fig. 5(c), cannot be precisely represented in \(\mathcal {ST}\) – Fig. 5(d) gives an infinite descending chain of approximations. Note that we could obtain incomparable elements of \(\mathcal {ST}\) by pointing each of \(\{w, x, y\}\) at different \((+)\) nodes in Fig. 5(d).

We therefore need a strategy for choosing a finite approximation of \(\eta _1 \wedge \eta _2\) in \(\mathcal {ST}\). There are two elements to this decision: how a representative for each equivalence class is chosen, and how cycles are broken. We wish to preserve as many equivalences as possible, particularly between variables.

Fig. 6.
figure 6

Algorithm to compute \({{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}}\). \( Eq \), \( stack \), \( repr \), \( tvar \) and \( indegree \) are global.

The algorithm for computing \(\eta _1 {{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}} \eta _2\) is given in Fig. 6. We first partition the terms in \(\eta _1 \cup \eta _2\) into equivalence classes using the congruence closure algorithm, then count the external references to each class. These counts, recorded in \( indegree \), give us an indication of how valuable each class is, to discriminate between candidate representatives. \( Eq (t)\) returns the equivalence class containing term t, and \( mem ( eq )\) denotes the set of non-variable terms in class \( eq \).

We then progressively construct the resulting system of terms, starting from the mapping of each variable. Each equivalence class \( eq \) corresponds to at most two terms in the meet; the main representative \( repr ( eq )\), and a term variable \( tvar ( eq )\). Instantiating a term \(f(s_1, \ldots , s_m)\), we look-up the corresponding equivalence class \( eq _i = Eq (s_i)\), and check whether expanding its definition \( repr ( eq _i)\) (which may not yet be fully instantiated) would introduce a cycle. We then replace \(s_i\) with either the recursively constructed representative of \( eq _i\) (if the resulting system is acyclic), or the free variable \( tvar ( eq )\).

Example 4

Consider the abstract states \(\eta _1\), \(\eta _2\) shown in Fig. 5. Congruence closure yields two equivalence classes: \(q_1 = \{ a_0, (+)(a_0, a_1), b_0, (+)(b_0, b_1) \}\), and \(q_2 = \{ a_1, b_1 \}\). The construction of \(\eta _1 {{\mathrm{\widetilde{\sqcap }}}}\eta _2\) starts with \( Eq (w)\). We first mark \(q_1\) as being on the stack to avoid cycles, then choose an appropriate definition to expand. The non-variable members of \(q_1\) are \(\{t_1 = (+)(a_0, a_1), t_2 = (+)(b_0, b_1)\}\). Both \(t_1\) and \(t_2\) have a single non-cycle incoming edge (\( Eq (a_0) = Eq (b_0) = q_1\), which is already on the stack), so we arbitrarily choose \(t_1\).

We must then expand the sub-terms of \(t_1\). \( Eq (a_0)\) is already on the stack, so cannot be expanded; this occurrence of \(a_0\) is replaced with a fresh variable \(c_0\). Now \(a_1\) has no non-variable definitions, so a fresh variable \(c_1\) is introduced. The stack then collapses, yielding \(w \mapsto (+)(c_0, c_1)\).

The algorithm next considers x. A representative for \(q_1\) has already been constructed, so x is mapped to \((+)(c_0, c_1)\), as is y. Finally, \( Eq (z) = q_2\); this also has an existing representative, so \(c_1\) is returned. The resulting abstract state is shown in Fig. 7. \(\Box \)

Fig. 7.
figure 7

\(\eta _1 {{\mathrm{\widetilde{\sqcap }}}}\eta _2\)

The algorithm given in Fig. 6 runs in \(\text{ O }(n\log {}n)\) time, where \(n = \vert \eta _1\vert + \vert \eta _2\vert \). The congruence closure step is run once, in \(\text{ O }(n \log {}n)\) time. The main body of is run at most once per equivalence class. Computing and scoring the set of candidates is linear in \(\vert eq\vert \), and happens once per equivalence class. We detect back-edges in constant time, by marking those equivalence classes which remain on the call stack – any edge to a marked class is a back-edge. So the reconstruction of \(\eta \) takes time O(n) in the worst case. Therefore, the overall algorithm takes \(\text{ O }(n \log {}n)\).

Note that \(\eta _1 {{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}} \eta _2\) is sensitive to variable ordering, as this determines which sub-term occurrence is considered a back-edge, and thus not expanded.

As for \({{\mathrm{\sqcup }}}_{\mathcal {ST}}\), each term in \(\eta _1 {{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}} \eta _2\) corresponds to some set of terms in \(\eta _1\) or \(\eta _2\). As before, \(\pi ^{\eta _1 \mapsto \eta _1 {{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}} \eta _2}\) denotes the mapping between terms in each operand and the result.

3.3 Logical Assertions

Finally consider assertions \([\![\mathtt {x {{\mathrm{\bowtie }}}y}]\!]\), where \({{\mathrm{\bowtie }}}\in \{=, \not =, <, \le \}\). The abstract transformer for \([\![\mathtt {x < y}]\!]\) and \([\![\mathtt {x \le y}]\!]\) is the identity function, as \(\mathcal {ST}\) has no notion of inequalities. \(\mathcal {ST}\) can infer information from a disequality \([\![\mathtt {x \ne y}]\!]\), but only where \(\eta \) has already inferred equality between x and y:

$$ F [\![\mathtt {x \ne y}]\!] \eta = \left\{ \begin{array}{ll} \bot &{} ~\mathbf if ~\eta (x) = \eta (y) \\ \eta &{} ~\mathbf otherwise \end{array} \right. $$

In the case of an equality \([\![\mathtt {x = y}]\!]\), we are left in a similar situation as for \(\eta _1 {{\mathrm{\sqcap }}}\eta _2\); we must reconcile the defining terms for x and y, plus any other inferred equivalences. This is done in the same way, by first computing equivalence classes, then extracting an acyclic system of terms. As we introduce only a single additional equivalence, we can use the specialized linear-time algorithm described in Sect. 3.4 of [10], then extract the resulting term system as for the meet.

4 \(\mathcal {ST}\) as a Functor Domain

Assume we have some abstract domain \(\mathcal {D}\) with the usual operations \({{\mathrm{\sqcap }}}\), \({{\mathrm{\sqcup }}}\), \(F_{\mathcal {D}}\) and as described in Sect. 2. In the following, we assume \(\mathcal {D}\) is not relational, so may only express independent properties of variables.

We would like to use \(\mathcal {ST}\) to enhance the precision of analysis under \(\mathcal {D}\). Essentially, we want a functor domain where \(\mathcal {ST}\) is the functor instantiated with \(\mathcal {D}\). While this is a simple formulation, it provides no path toward an efficient implementation. Where normally we use \(\mathcal {D}\) to approximate the values of (or relationships between) variables in V, we can instead approximate the values of terms occurring in the program. An element of our lifted domain \(\mathcal {ST}(\mathcal {D})\) is a pair \(\langle \eta , \rho \rangle \) where \(\eta \) is a mapping of program variable to terms, and \(\rho \in \mathcal {D}\) approximates the set of satisfying term assignments.

4.1 Operations over \(\mathcal {ST}(\mathcal {D})\)

Evaluating an assignment in the lifted domain may be performed using \(F_{\mathcal {D}}\) and \(F_{\mathcal {ST}}\). We construct the updated definition of x in \(\eta \), then assign the corresponding ‘variable’ in \(\mathcal {D}\) to the result of the computation.

Formulating \({{\mathrm{\sqcup }}}_{\mathcal {ST}(\mathcal {D})}\), \({{\mathrm{\triangledown }}}_{\mathcal {ST}(\mathcal {D})}\) and \({{\mathrm{\widetilde{\sqcap }}}}_{\mathcal {ST}(\mathcal {D})}\) is only slightly more involved, assuming the presence of a renaming operator over \(\mathcal {D}\). We first determine the term structure \(\eta '\) of the result, then map \(\rho _1\) and \(\rho _2\) onto the terms in \(\eta '\) before applying the appropriate operator over \(\mathcal {D}\).

4.2 Inferring Properties from Subterms

While this allows us to maintain approximations of subterms, we cannot use this to directly derive tighter approximations of program variables.

However, upon encountering a branch which restricts x, we can then infer properties on any other terms involving x. For now, we shall restrict ourselves to ancestors of x. If the approximation of x has changed, and p is an immediate parent of x, we can simply recompute p from its definition:

$$ \rho ' = \rho {{\mathrm{\sqcap }}}F_{\mathcal {ST}} [\![\mathtt {\mathbf {id}(\eta (p)) := \mathbf {def}(\eta (p))}]\!]\rho $$

We can then propagate this information upwards.

Fig. 8.
figure 8

If E is reached, y must be positive.

We can also infer information about a term from its parents and siblings. Assume the program fragment in Fig. 8 is being analysed using the (term-lifted) domain of intervals. At point D we know only that x is non-negative; this is not enough to infer bounds on z. However, when point E is reached we know \(z > 0\). As we already know \(x \ge 0\), this can only occur if \(y > 0\), \(x > 0\).

This requires us to reason about the values from which a given computation could have resulted; this is exactly the pre-image discussed in Sect. 2. We can then augment the algorithm to propagate information in both directions, evaluating \(F_{\mathcal {D}}\) and on each term until a fixpoint is reached. Unfortunately, attempts to fully reduce an abstract state run into difficulties.

Fig. 9.
figure 9

A system of terms with no solution; encoding \(x = x+1\). Each evaluation of \(t_1\) or \(t_3\) eliminates only two values from the corresponding bounds.

Example 5

Consider the system of terms shown in Fig. 9, augmenting the domain of intervals. Disregarding interval information, it encodes the constraint \(y = x-z, z = x+1\). In the context of \(y=0\) (the interval bounds for y), this is clearly unsatisfiable.

Propagating the consequences of these terms, we first apply the definition \(t_3 = t_2 + 1\). Doing so, we trim 0 from the domain of \(t_3\) (or z), and \(10^6\) from the domain of \(t_2\) (or x). We then evaluate the definition \(t_1 = t_2 - t_3\), thus removing 0 and \(10^6\) from \(t_1\) and \(t_3\) respectively. We can then evaluate the definitions of \(t_3\) and \(t_1\) again, this time eliminating 2 and . This process eventually determines unsatisfiability, but it takes \(10^6\text {+}1\) steps to do so.Footnote 3

This rather undermines our objective of efficiently combining \(\mathcal {ST}\) with \(\mathcal {D}\). If \(\mathcal {D}\) is not finite, the process may not terminate at all. Consider the case where \(\mathcal {D}(t_2) = \mathcal {D}(t_3) = [0, \infty ]\) – the resulting iterates form an infinite descending chain, where the lower bounds are tightened by one at each iteration step.

The existence of an efficient, general algorithm for normalizing \(\langle \eta , \rho \rangle \) seems doubtful. Even for the specific case of finite intervals, computing the fixpoint of such a system of constraints is NP-complete [3] (in the weak sense – the standard Kleene iteration runs in pseudo-polynomial time). Nevertheless, we can apply the system of terms to \(\rho \) some bounded number of times in an attempt to improve precision; a naive iterative approach is given in Fig. 10.

Fig. 10.
figure 10

Applying a system of terms \(\eta \) to tighten a numeric approximation \(\rho \).

In practice, this iteration is wasteful. In an independent attribute domain, applying \([\![\mathtt { t = f( c_1 , \ldots c_k )}]\!]\) cannot directly affect terms not in \(\{t, c_1, \ldots , c_k\}\), and we can easily detect which of these have changed. So we adopt a worklist approach, updating terms with changed abstractions only. The tightening still progresses level by level, to collect the tightest abstraction of each term before re-applying the definitions. The algorithm is outlined in Fig. 11.

Fig. 11.
figure 11

An incremental approach for applying a single iteration of tighten-step.

tighten-worklist incrementally applies a single pass of tighten-step, where only terms in X have changed. Given the discussion above, the algorithm obviously misses opportunities for propagation; this loss occurs at the point marked \(\dagger \). Given some definition \([\![\mathtt {t = f(c_1, c_2)}]\!]\) and new information about \(c_1\), we could potentially tighten the abstraction of both t and \(c_2\); however, tighten-worklist only applies this information to t.

It is sound to apply the same algorithm when \(\mathcal {D}\) is relational; however, it may miss further potential tightenings, as additional constraints on some term can be reflected in other, apparently unrelated terms.

Care must be taken when combining normalization with widening. As is observed in octagons, closure after widening does not typically preserve termination. A useful exception is the typical widening on intervals which preserves termination when tightening is applied upwards.

5 Other Syntactic Approaches

As mentioned, the closest relatives to the term domain are the symbolic constant domain of Miné [21] and the congruence closure (or alien expression) domain of Chang and Leino [5]. Both domains record a mapping between program variables and terms, with the objective of enriching existing numeric domains.

The term domain can be viewed as a generalization of the symbolic constant domain. Both domains arise from the observation that abstract domains, be they relational or otherwise, exhibit coarse handling of expressions outside their native language – particularly non-linear expressions. And both store a mapping from variables to defining expressions. The primary difference is in the join. Faced with non-equal definitions, the symbolic constant domain discards both entirely. The term domain instead attempts to preserve whatever parts of the computation are shared between the abstract states, which it can then use to improve precision in the underlying domain.

The congruence closure domain [5] arises from a different application – coordinating a heterogeneous set of base abstract domains, each supporting only a subset of expressions appearing in the program. Functions which are alien to a domain are replaced with a fresh variable; equivalences are inferred from the syntactic terms, and added to the base abstract domains. The congruence closure domain assumes the base domains are relational, maintaining a system of equivalences and supported relations. As a result, it assumes the base domain will take care of maintaining relationships between interpreted expressions and the corresponding subterms. Hence it will not help with the examples from Fig. 1.

While the underlying techniques are similar, the objectives (and thus tradeoffs) are quite different. Congruence closure maintains an arbitrary (though finite) system of uninterpreted function equations, allowing multiple – possibly cyclic – definitions for subterms. This potentially preserves more equivalence information than the acyclic system of the subterm domain, but increases the cost and complexity of various operations (notably the join). As far as we know, no experimental evaluation of the congruence-closure domain has been published.

6 Experimental Evaluation

The subterm domain has been implemented in crab, a language-agnostic C++ library of abstract domains and fixpoint algorithms. It is available, with the rest of crab, at https://github.com/seahorn/crab. One purpose of crab is to enhance verification tools by supplying them with inductive invariants that can be expressed in some abstract domain chosen by the client tool. For our experiments we used SeaHorn [13], one of the participants in SV-COMP 2015 [1].

We selected 2304 SV-COMP 2015 programs, in the categories best supported by SeaHorn: ControlFlowInteger, Loops, Sequentialized, DeviceDrivers64, and ProductLines (CFI, Loops, DD64, Seq, PL in Table 3). We first evaluated the performance of the subterm domain by measuring only the time to generate the invariants without running SeaHorn. We compared the subterm domain enhancing intervals \({\mathcal {ST}}(\mathsf {Intv})\) with three other numeric abstract domains: classical intervals \(\mathsf {Intv}\)  [6] (our baseline abstract domain since it was the one used by SeaHorn in SV-COMP 2015), the symbolic constant propagation \({\mathcal {SC}}(\mathsf {Intv})\)  [21], and an optimized implementation of difference-bound matrices using variable packing \({\mathcal {VP}}({\mathsf {DBM}})\)  [24]. Second, we measured the precision gains using \({\mathcal {ST}}(\mathsf {Intv})\) as an invariant supplier for SeaHorn and compared again with \(\mathsf {Intv}\), \({\mathcal {SC}}(\mathsf {Intv})\), and \({\mathcal {VP}}({\mathsf {DBM}})\). All experiments were carried out on a AMD Opteron Processor 6172 with 12 cores running at 2.1 GHz Core with 32 GB of memory.

Table 2. Performance of several abstract domains on SV-COMP’15 programs

Performance. Table 2(a) shows three scatter plots of analysis times comparing \({\mathcal {ST}}(\mathsf {Intv})\) with \(\mathsf {Intv}\) (left), with \({\mathcal {SC}}(\mathsf {Intv})\) (middle), and with \({\mathcal {VP}}({\mathsf {DBM}})\) (right). Table 2(b) shows additional statistics about the analysis of the 2304 programs. For this experiment, we set a limit of 30 s and 4 GB per program.

crab using \({\mathcal {ST}}(\mathsf {Intv})\), \(\mathsf {Intv}\), and \({\mathcal {SC}}(\mathsf {Intv})\) inferred invariants successfully for all programs without any timeout (column TO in Table 2(b)). The total time (denoted by \(\mathsf {T_{total}}\)) indicates that \(\mathsf {Intv}\) was the fastest with 175 s and \({\mathcal {ST}}(\mathsf {Intv})\) the slowest with 456. The columns \(\mathsf {T_\mu }\) and \(\mathsf {T_\sigma }\) denote the time average and standard deviation per program, and the column \(\mathsf {T_{max}}\) is the time of analyzing the program that took the longest. All domains displayed similar memory usage. Again, \(\mathsf {Intv}\) was the most efficient with an average memory usage per program of 31 MB and a maximum of 1.34 GB whereas \({\mathcal {ST}}(\mathsf {Intv})\) was the least efficient with an average of 37 MB and maximum of 1.52 GB.

It is not surprising that \(\mathsf {Intv}\) and \({\mathcal {SC}}(\mathsf {Intv})\) are faster than \({\mathcal {ST}}(\mathsf {Intv})\); interestingly, the evaluation suggests that in practice \({\mathcal {ST}}(\mathsf {Intv})\) incurs only a modest constant-factor overhead of around 2.5. \({\mathcal {VP}}({\mathsf {DBM}})\) was faster than \({\mathcal {ST}}(\mathsf {Intv})\) in many cases but was more volatile, reaching the timeout in 3 cases. This is due to the size of variable packs inferred by \({\mathcal {VP}}({\mathsf {DBM}})\)  [24]. If few interactions are discovered, the packs remain of constant size and the analysis collapses down to \(\mathsf {Intv}\). Conversely, if many variables are found to interact, the analysis degenerates into a single DBM with cubic runtime.

Table 3. SeaHorn results on SV-COMP 2015 enhanced with abstract domains

Precision. Table 3 shows the results obtained running SeaHorn with crab using the four abstract domains. We run SeaHorn on each verification taskFootnote 4 and count the number of tasks solved (i.e., SeaHorn reports “safe” or “unsafe”) shown in columns labelled with #S. In T columns we show the total time in seconds for solving all tasks. The top row gives, in parentheses, the number of programs per category. The row labelled Sea+\(\mathsf {Intv}\) shows the number of tasks solved by SeaHorn using the interval domain (our baseline domain) as invariant supplier, while rows labelled with Sea+\({\mathcal {SC}}(\mathsf {Intv})\), Sea+\({\mathcal {ST}}(\mathsf {Intv})\) and Sea+\({\mathcal {VP}}({\mathsf {DBM}})\) are similar but using \({\mathcal {SC}}(\mathsf {Intv})\), \({\mathcal {ST}}(\mathsf {Intv})\) and \({\mathcal {VP}}({\mathsf {DBM}})\), respectively. We set resource limits of 200 s and 4GB for each task. In all configurations, we ran SeaHorn with Spacer [16] as back-end solverFootnote 5.

The results in Table 3 demonstrate that the subterm domain can produce significant gains in some categories (e.g., Loops and PL) and stay competitive in all. We observe that \({\mathcal {SC}}(\mathsf {Intv})\) rarely improves upon the results of Sea+\(\mathsf {Intv}\). Two factors appear to contribute to this: the join operation on \({\mathcal {SC}}(\mathsf {Intv})\) maintains only the definitions that are constant on all code paths; and SeaHorn’s frontend (based on LLVM [17]) applies linear constant propagation, subsuming many of the opportunities available to \({\mathcal {SC}}(\mathsf {Intv})\). Our evaluation also shows that the subterm domain helps SeaHorn solve more tasks than \({\mathcal {VP}}({\mathsf {DBM}})\) in several categories. One reason could be that \({\mathcal {VP}}({\mathsf {DBM}})\) does not perform propagation across different packs and so it is less precise than classical DBMs [19]Footnote 6 and indeed incomparable with the subterm domain. Another reason might be the more precise modelling of non-linear operations by the subterm domain. Nevertheless, we observed that sometimes \({\mathcal {ST}}(\mathsf {Intv})\) can solve tasks that \({\mathcal {VP}}({\mathsf {DBM}})\) cannot, and vice versa. For PL, for example, Sea+\({\mathcal {ST}}(\mathsf {Intv})\) solved 9 tasks for which Sea+\({\mathcal {VP}}({\mathsf {DBM}})\) reached a timeout but Sea+\({\mathcal {VP}}({\mathsf {DBM}})\) solved 3 tasks that Sea+\({\mathcal {ST}}(\mathsf {Intv})\) missed. This is relevant for tools such as SeaHorn since it motivates the idea of running SeaHorn with a portfolio of abstract domains.

7 Conclusion and Future Work

We have introduced the subterm abstract domain \(\mathcal {ST}\), and outlined its application as a functor domain to improve precision of existing analyses. Experiments on software verification benchmarks have demonstrated that \(\mathcal {ST}\), when used to enrich an interval analysis, can substantially improve generated invariants while only incurring a modest constant factor performance penalty.

The performance of \(\mathcal {ST}\) is obtained by disregarding algebraic properties of operations. Extending \(\mathcal {ST}\) to exploit these properties while preserving performance poses an interesting future challenge.