Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Pointer analysis is a key enabling technology underlying many program analysis, software engineering, and compiler optimization tasks. Given a pointer variable p, pointer analysis statically determines the set of all heap objects that p may point to. The result of such an analysis can be used to resolve important program analysis questions, such as whether two pointers can be aliases or whether a heap location may be referenced in a given piece of code.

While existing pointer analysis algorithms differ along many dimensions, a key feature that determines the precision of an algorithm is context-sensitivity. In particular, a context-sensitive analysis respects the call/return semantics of procedure calls and does not yield spurious points-to facts that arise from interprocedurally unrealizable paths. Furthermore, a context-sensitive analysis distinguishes heap objects that are allocated at the same program location, but due to different invocations of the same method. While more precise than context-insensitive ones, context-sensitive algorithms are much harder to scale to real programs, and many existing techniques use approximations of full context-sensitivity. For instance, object-sensitive analyses [14, 15, 23] only distinguish callsites where the receiver objects are different, and k-CFA analyses [12, 22] differentiate contexts by tracking callstrings up to some fixed length k.

Context-sensitivity can be achieved either by performing a top-down or bottom-up interprocedural analysis. Top-down analyses start ethods of a program and analyze callers before callees. In contrast, bottom-up analyses start at leaf methods of the callgraph and analyze callees before callers. Since top-down algorithms analyze every method in a known calling context, they are simpler to design and implement, but they need to re-analyze the same method multiple times under different contexts. In contrast, bottom-up analyses generate a polymorphic method summary that may be used in any calling context to get context-sensitive results. While generating a polymorphic points-to summary is trickier than determining points-to information at a particular call site, bottom-up analyses do not need to reanalyze the same method several timesFootnote 1 and have the potential to scale better. In addition, the results of a bottom-up pointer analysis are reusable: For instance, using a bottom-up pointer analysis, we can analyze a library just once and reuse its summary for many different clients.

Fig. 1.
figure 1

Code example to illustrate our approach

In this paper, we present a bottom-up context- and field-sensitive pointer analysis algorithm for Java. A key novel feature of our approach is the constraint-based treatment of virtual method calls. Similar to many other approaches, our method starts with an imprecise callgraph and refines the callgraph as points-to facts are discovered. However, we construct the callgraph in a purely bottom-up fashion by predicating points-to facts on the possible dynamic types of the receiver object. As method summaries are propagated up the call chain, these dynamic types are resolved, thereby allowing the refutation of infeasible call targets and spurious points-to facts in a context-sensitive manner.

Another salient feature of our approach is that it can generate polymorphic method summaries without performing expensive case splits on possible aliasing patterns at call sites. In particular, a key challenge in bottom-up pointer analysis is how to generate method summaries that soundly capture the aggregate effect of a call to method m under any possible aliasing relation at m’s call sites. Most previous techniques deal with this difficulty either by performing case splits on all possible aliasing patterns [3, 6] (which can cause exponential blow-up) or by using unification-based methods [10, 13, 27, 28] (which are imprecise compared to subset-based methods). A main advantage of our technique is that it is as precise as subset-based methods despite modeling the unknown state of the heap in a simple and uniform way. In particular, since our technique does not perform strong updates Footnote 2 to heap locations, it can soundly account for the callee’s side effects by performing a fixed-point computation during summary instantiation.

Fig. 2.
figure 2

Method summaries computed by our algorithm

We have implemented our algorithm in a tool called Scuba, and we compare its scalability and precision with top-down pointer analysis algorithms implemented in Chord [16]. Our experimental results on programs from the DaCapo and Ashes benchmark suites indicate that Scuba achieves better or comparable precision to k-CFA and k-object-sensitive algorithm at a fraction of the cost.

To summarize, this paper makes the following contributions:

  • We present a bottom-up, subset-based, and context-sensitive pointer analysis for Java. A key novelty of our approach is the handling of virtual method calls using constraint-based techniques.

  • We describe a new method for summarizing and instantiating points-to facts. Unlike previous techniques, our approach does not case-split on aliasing patterns and guarantees soundness by performing fixed-point computation during summary instantiation.

  • We describe an implementation of our algorithm and compare it with k-CFA and k-obj algorithms on the DaCapo and Ashes benchmarks.

2 Example

This section illustrates our approach on an example that showcases virtual method calls and the need for context-sensitivity. Consider the code shown in Fig. 1, which defines classes X, Y, and A. Here, Y is a subclass of X and overrides X’s bar method. Class A has two instance variables x and y of type X. For concreteness, suppose we want to know whether x.f and y.g can be aliases at the end of a1 and a2.

Our algorithm starts by analyzing X::bar and Y::bar, which are leaf procedures in the callgraph. The summaries for X::bar and Y::bar are shown in Fig. 2(a) and (b). We depict both method summaries as well as local points-to facts in the form of a graph, where nodes correspond to abstract heap objects and directed edges denote may-point-to relations. Our method summaries only include points-to edges that may be added due to an invocation of the summarized method. In particular, method summaries do not include points-to relations that already exist on method entry, and since our analysis does not apply strong updates, no points-to edges can be removed as a result of analyzing a method. Hence, the summary for a method m can be thought of as a bag of (symbolic) points-to edges that are introduced due to an invocation of m.

Fig. 3.
figure 3

Dashed lines indicate points-to edges added after the foo call.

Now, consider the summary for X::bar shown in Fig. 2(a). We use arg \(_0\) to denote the this pointer, and arg \(_1\), arg \(_2\) to denote the first and second formal parameters. For parameter i, we use arg \(_i.\epsilon \) to denote the heap object pointed to by arg \(_i\) on method entry. Hence, according to the summary in Fig. 2(a), X::bar adds a points-to edge via field f from the object pointed to by the this pointer to the object pointed to by X::bar’s first parameter. Note that, at a call site of X::bar, parameter z may have multiple points-to targets, hence, the single edge in X::bar’s summary may introduce multiple points-to edges at a call site. The summary for Y::bar, which is shown in Fig. 2(b), is very similar.

Now consider method A::foo, whose summary is shown in Fig. 2(c). Since x has type X, the call x.bar(a) could either invoke X::bar or Y::bar. Hence, to analyze the method call, we instantiate the summaries of both X::bar and Y::bar, but the points-to edges induced by instantiating method T::bar are qualified by a constraint that stipulates that the dynamic type of x is T.

As an example, consider the potential target X::bar of the call x.bar(a). Here, the only points-to edge in the summary for X::bar is from the f field of arg \(_0.\epsilon \) to arg \(_1.\epsilon \). Since arg \(_0\) of X::bar corresponds to this.x at the call site, arg \(_0.\epsilon \) instantiates to arg \(_0.\epsilon .x\), which denotes the memory location pointed to by this.x. On the other hand, arg \(_1\) in X::bar corresponds to parameter z at the call site, hence arg \(_1.\epsilon \) translates to a location with the same name, i.e., arg \(_1.\epsilon \). Therefore, as shown in Fig. 2(c), instantiating the summary of X::bar for this call site induces an edge from arg \(_0.\epsilon .x\) to arg \(_1.\epsilon \), but this edge is qualified by the constraint \(\mathrm{type}(\mathrm{arg}_0.\epsilon .x) = \mathrm{X}\).

Continuing with the method invocation y.bar(a) in the second line of A::foo, we again instantiate the summaries of X::bar and Y::bar, since the type of y is X. However, this time, the location named arg \(_1.\epsilon \) used in the summaries of X::bar and Y::bar correspond to the location pointed to by this.y, which is denoted by arg \(_0.\epsilon .y\) in method foo. Hence, as shown in Fig. 2(c), the summary for foo includes points-to edges from the f and g fields of arg \(_0.\epsilon .y\) to arg \(_1.\epsilon \), again qualified by the appropriate type constraints. Observe that our approach is context-sensitive because two different invocations of bar induce different points-to relations. Also, even though this.x and this.y may alias at a call site of foo, we represent their points-to targets on method entry using two separate locations called arg \(_0.\epsilon .x\) and arg \(_0.\epsilon .y\). As we explain shortly, this approach is sound as long as we do not perform strong updates.

Now consider method a1 defined in A. The solid black edges in Fig. 3(a) denote points-to facts that hold right before the call to foo. In particular, the location named arg \(_0.\epsilon \) denotes the object pointed to by the this pointer, and both the x and y fields of arg \(_0.\epsilon \) point to a location called \(\mathrm{alloc(Y)@a1:1}\), which corresponds to the memory allocated at the first line of method a1. Throughout the paper, we use the notation \(\mathrm{alloc(T)@Ctx}\) to denote heap objects of type T that are allocated in context \(\mathrm{Ctx}\).

We now turn to the method invocation foo(z) in a1. Here, arg \(_1.\epsilon \) in foo’s summary corresponds to the location pointed to by z in a1, which is \(\mathrm{alloc(Z)@a1:3}\). On the other hand, the locations named arg \(_0.\epsilon .x\) and arg \(_0.\epsilon .y\) in foo represent the locations pointed to by this.x and this.y in a1 respectively. Following the chain of points-to edges in Fig. 3(a), we see that arg \(_0.\epsilon .x\) and arg \(_0.\epsilon .y\) both correspond to the location \(\mathrm{alloc(Y)@a1:1}\). Since this allocation is tagged with type Y, the constraints \(\mathrm{type}(\mathrm{arg}_0.\epsilon .x) = \mathrm{X}\) and \(\mathrm{type}(\mathrm{arg}_0.\epsilon .y) = \mathrm{X}\) evaluate to false, while \(\mathrm{type}(\mathrm{arg}_0.\epsilon .x) = \mathrm{Y}\) and \(\mathrm{type}(\mathrm{arg}_0.\epsilon .y) = \mathrm{Y}\) evaluate to true. Hence, instantiating foo’s summary induces a single points-to edge from the g field of \(\mathrm{alloc(Y)@a1:1}\) to \(\mathrm{alloc(Z)@a1:3}\), which is shown with the dotted edge in Fig. 3(a). A similar chain of reasoning allows us to obtain the points-to facts shown in Fig. 3(b) for method a2.

As we can see from Fig. 3, the analysis determines that x.f and y.g are not aliases in either a1 or a2. Observe that an analysis that is either context-insensitive or based on an imprecise callgraph would conclude otherwise. Also, even though there are two different calls to foo and four different calls to bar, observe that our algorithm analyzes each method only once.

3 Conceptual Foundations

Before describing our analysis in detail, we first describe a conceptual framework that lays the foundations of our algorithm. We describe the main ideas using a may-points-to graph, which we refer to as an abstract heap:

Definition 1

(Abstract heap). An abstract heap H is a graph (NE) where N is a set of nodes corresponding to abstract memory locations, and E is a set of directed edges between nodes labeled with field names or \(\epsilon \). An edge \((o_1, o_2, f)\) indicates that the f field of \(o_1\) may point to \(o_2\).

Here, an abstract memory location represents either the stack location of a variable or a set of heap objects. The edge label \(\epsilon \) is used to model points-to relations from stack locations to heap objects. The root nodes of an abstract heap denote locations of variables, and we write \(root (H)\) to indicate the set of root nodes of H. Given two abstract heaps \(H_1\) and \(H_2\), \(H_1 \cup H_2\) represents the abstract heap containing nodes and edges from both \(H_1\) and \(H_2\). Given a heap H and edges E, we write \(H \backslash E\) to denote the heap that contains all nodes and edges in H except the set of edges E.

3.1 Normalization of Abstract Heaps

Given an abstract heap H, we define a normalization operation \(\mathbb {N}(H)\), which yields a normalized heap \(H^*\) and a mapping \(\zeta \) from nodes in H to nodes in \(H^*\).

Definition 2

(Normal form). Given heap \(H = (N, E)\), \(\mathbb {N}(H)\) yields normalized heap \(H^* = (N^*, E^*)\) and mapping \(\zeta : N \rightarrow 2^{N^*}\) such that:

  1. 1.

    If \(x \in root (H)\), then \(x \in N^*\) and \(\zeta (x) = \{ x \}\).

  2. 2.

    If \((o, o', f) \in E\) and \(o^* \in \zeta (o)\), then \(o^*.f \in N^*\), \(o^*.f \in \zeta (o')\), and \((o^*, o^*.f, f)\in ~E^*\).

We use the notation \(\mathbb {N}(H) = H^*\) to indicate that \(H^*\) is in normal form, and we write \(Map (H, H^*) = \zeta \) to indicate that \(\zeta \) maps nodes of H to nodes in \(H^*\).

Fig. 4.
figure 4

An abstract heap and its normal form

Example 1

Consider the abstract heap H shown in Fig. 4(a) and its normal form \(H^*\) in Fig. 4(b). Here, \(Map (H, H^*)\) yields the following mapping \(\zeta \):

$$\begin{aligned} \small \begin{array}{lll} \zeta (x) = \{ x\} &{} \zeta (y) = \{ y \} &{} \zeta (A) = \{ x.\epsilon \} \\ \zeta (B) = \{ x.\epsilon , y.\epsilon \} &{} \zeta (C) = \{ y.\epsilon \} &{} \zeta (D) = \{ x.\epsilon .f \} \\ \zeta (E) = \{ x.\epsilon .f, y.\epsilon .f \} &{} &{} \zeta (F) = \{ x.\epsilon .f, y.\epsilon .f \} \end{array} \end{aligned}$$

We use heap normal forms to model the heap on entry to a method. In particular, \(H^*\) corresponds to a “generic” heap representing the unknown points-to targets of object o’s f field as o.f. While the mapping \(\zeta \) from abstract heap H to its normal form \(H^*\) differs for each call site, the normalized heap for a method m is the same irrespective of calling context. Observe that no pair of abstract memory locations alias each other in a normalized heap, and every location has exactly one points-to target for a given field.

Given an abstract heap \(H = (N, E)\) and its normal form \(H^* = (N^*, E^*)\) such that \(Map (H, H^*) = \zeta \), we define \(\zeta ^{-1}\) to be a mapping from \(N^*\) to \(2^N\) such that \(n \in \zeta ^{-1}(n^*)\) iff \(n^* \in \zeta (n)\). In general, there is a many-to-many relationship between the nodes of an abstract heap and its corresponding normal form.

Example 2

Consider the heap from Example 1. We have:

$$\begin{aligned} \small \begin{array}{lll} \zeta ^{-1}(x) = \{ x\} &{} \zeta ^{-1}(y) = \{ y \} &{} \zeta ^{-1}(x.\epsilon ) = \{ A, B \} \\ \zeta ^{-1}(y.\epsilon ) = \{ B, C \} &{} \zeta ^{-1}(x.\epsilon .f) = \{ D, E, F\} &{} \zeta ^{-1}(y.\epsilon .f) = \{ E, F \} \\ \end{array} \end{aligned}$$

The mapping \(\zeta ^{-1}\) is important in summary-based analysis because it allows us to instantiate a method summary to a particular abstract heap at a call site. We use the notation \(\overline{\zeta ^{-1}}\) to denote the extension of \(\zeta ^{-1}\) that maps any element that is not in the domain of \(\zeta ^{-1}\) to itself. Given heap \(H^* = (N^*, E^*)\), we also write \(\zeta ^{-1}(H^*)\) to denote a heap \(H = (N, E)\) where \((o_1, o_2, f) \in E\) iff there exists an edge \((o^*_1, o^*_2, f) \in E^*\) such that \(o_1 \in \zeta ^{-1}(o^*_1)\) and \(o_2 \in \zeta ^{-1}(o^*_2)\).

Definition 3

(Default Edge). We say an edge \((n, n', f)\) is a default edge of an abstract heap if \(n' = n.f\).

Given a heap H, we write \(default (H)\) to denote the set of default edges in H.

Fig. 5.
figure 5

Basic structure of summary-based analysis

3.2 Summary-Based Pointer Analysis

We now explain the basic idea underlying our summary-based analysis, assuming a family of pointer analyses that are sound and weakly-updating. Given code snippet S, an abstract heap H, and pointer analysis A, we write \(H' = Analyze (H, S, A)\) to indicate that, if statement S is executed in an environment that satisfies abstract heap H, then analyzing code S using pointer analysis A yields a heap \(H'\) which conservatively models the concrete heap after S.

Fig. 6.
figure 6

Core language used for our formalization

The basic structure of our summary-based pointer analysis is shown in Fig. 5. The algorithm SumAnalyze takes as input an abstract heap H, a code snippet S, and a weakly-updating pointer analysis A, and works as follows. Line (1) constructs the normalized heap \(H^*\) representing the unknown state of the heap before executing S, and line (2) analyzes S without making any assumptions about points-to facts that hold before S. Line (3) generates a polymorphic points-to summary \(\varDelta \) which characterizes side effects of code S. Lines (4)–(11) instantiate the summary \(\varDelta \) by performing a fixed-point computation. Finally, \(H'\) at line (12) models the state of the heap after S when S is executed in an environment satisfying H.

Before discussing details, let us first understand in what way this algorithm is “summary-based”. Since \(H^*\) can be constructed in a context-independent manner, we can analyze S in isolation and compute its side effects without knowing the points-to facts that hold before S. Hence, lines (1)–(3) in Fig. 5 correspond to summary generation. On the other hand, lines (4)–(11) perform summary instantiation by computing the context-specific mapping \(\zeta _0\) and by adding all points-to edges that represent S’s side effects.

The most involved part of the above algorithm is the fixed-point computation at lines (4)–(11). Intuitively, the algorithm maps each edge in the summary to a set of edges at the callsite by using the mapping \(\overline{\zeta ^{-1}}\) and adds these edges to the initial abstract heap H. However, as new edges are added to H, the mapping \(\zeta \) must to be recomputed since locations used in the summary may map to new additional locations after the summary has been applied. Hence, \(H'\) is recomputed until the mapping \(\zeta \) from \(H'\) to \(H^*\) stabilizes. It is easy to see that SumAnalyze is sound because (i) the underlying pointer analysis does not apply strong updates to memory locations, and (ii) the summary is applied to a fixed-point. In particular, observe that \(H'\) overapproximates H and \(\{ H' \} S \{H'\}\) is a valid Hoare triple [8].

The reader may wonder why it is necessary to re-apply the summary until the mapping \(\zeta \) reaches a fixed point in Fig. 5. This is necessary because our summary \(\varDelta \) encodes all possible side effects of code snippet S, but not the order in which they happen. Hence, while the fixed point computation is required for soundness, an immediate corollary is that the procedure SumAnalyze can be less precise than Analyze if the underlying pointer analysis A is flow-sensitive.

4 Formalization of Algorithm

While the previous section describes core ideas of the analysis, it omits many important details. In this section, we describe our full algorithm using the core object-oriented language of Fig. 6. Here, a program consists of one or more class declarations C, which defines a class \(T_1\) with optional superclass \(T_2\). Instructions include assignments, loads, stores, heap allocations (marked with a unique program point \(\rho \)), non-deterministic conditionals, sequences, static method calls \(m^\rho @T(...)\) (also marked with program point \(\rho \)), and virtual method calls \(v_0. m^\rho (\ldots )\). We assume that the first argument of a method is always the this pointer, and if a class T inherits a method m from its superclass \(T'\), then \(T'\) also contains a definition of m with the same implementation.

4.1 Abstract Domains

Figure 7 shows the abstract domains that we need for describing our algorithm. Since our analysis is bottom-up, we differentiate between two kinds of heap objects o: Access paths of the form \(a_i.\eta \) represent (caller-allocated) unknown heap objects reachable through the i’th argument, whereas objects named \(\mathrm{alloc}(T)@\mathbf {\rho }\) represent heap objects of type T that are allocated either in the currently analyzed method or in a transitive callee. Specifically, \(a_i.f_1...f_n\) denotes the unknown locations reachable on method entry through a series of field accesses \(f_1...f_n\) from the i’th argument. We use the notation \(a_i.(f_1...f_n)^\star \) to denote all unknown locations reachable from \(a_i\) through any combination of field selectors \(f_1, ..., f_n\). For instance, \(a_0.(f.g)^\star \) represents the infinite set of access paths \(a_0.f\), \(a_0.g\), \(a_0.f.f\), \(a_0.g.f\) and so on. As we will see in Sect. 4.3, access paths in our analysis correspond to node labels of the normalized heap from Sect. 3.

Fig. 7.
figure 7

Abstract domains used in our analysis

Abstract memory locations \(\pi \) are either heap objects o or stack locations. In particular, \(a_i\) denotes the stack location of the i’th argument, and \(v_i@\mathbf {\rho }\) denotes the location of local variable \(v_i\) under context \(\mathbf {\rho }\). We represent calling contexts using a sequence of program points \(\rho _1,\ldots , \rho _n\), where each \(\rho _i\) corresponds to some call or allocation site. For instance, a memory location named \(\mathrm{alloc}(T)@\rho _1\rho _2\) corresponds to a heap object allocated at program point \(\rho _2\) of some method m which is invoked at call site \(\rho _1\). Similarly, \(v@\rho _1\) denotes the local variable v of some method m when m is invoked at callsite \(\rho _1\). Since our analysis builds contexts in a bottom-up way, local variables declared in the currently analyzed method m do not have any context information; hence, we abbreviate the locations of locals in the current method as \(v_i\).

Definition 4

(Argument-Derived Location). We say location \(\pi \) is derived from an argument, written \(arg (\pi )\), if it is either (i) \(a_i\) representing the location of the i’th argument or (ii) a heap object represented with an access path \(a_i.\eta \).

An abstract heap \(\varGamma \) maps each field f of location \(\pi \) to a points-to set. A (guarded) points-to set \(\theta \) is a set of pairs \((o, \phi )\) where o is a heap object and \(\phi \) is a constraint. As discussed in Sect. 2, we use constraints to predicate points-to facts on dynamic types of receivers. Constraints \(\phi \) belong to the theory of equality with uninterpreted functions, defined according to the following grammar:

$$\begin{aligned} \small \begin{array}{lll} \mathrm{Function} \ f &{} := &{} \mathrm{pts} \ | \ \mathrm{alloc} \ | \ \varsigma _i \\ \mathrm{Term} \ t &{} := &{} c \ | \ v \ | \ f(\mathbf {t}) \\ \mathrm{Formula} \ \phi &{} := &{} \top \ | \ \bot \ | \ \mathrm{type}(t) = T \ | \ \phi _1 \wedge \phi _2 \ | \ \phi _1 \vee \phi _2 \end{array} \end{aligned}$$

Here, terms include constants c, variables v, and function applications \(f(\mathbf {v})\) where f is either the binary function \(\mathrm{pts}, \mathrm{alloc}\), or an n-ary function drawn from \(\varsigma _1, \ldots , \varsigma _k\). Formulas are composed of \(\top \) (true), \(\bot \) (false), and conjunctions and disjunctions of equality constraints of the form \(\mathrm{type}(t) = T\), where T is a type constant. In addition to the usual function and equality axioms, the \(\mathrm{alloc}\) and \(\mathrm{type}\) functions obey the additional axiom \(\forall x. \ \mathrm{type}(\mathrm{alloc}(T, \mathbf {\rho })) = T\), which states that the type of an allocation of type T is T.

Since we will convert heap objects to terms, we define an operation called lift(\(\pi \)), abbreviated \(\overline{\pi }\), as follows:

$$\begin{aligned} \small \overline{a_i} = a_i \ \ \ \overline{\mathrm{alloc}(T)@\mathbf {\rho }} = \mathrm{alloc}(T, \mathbf {\rho }) \ \ \ \overline{\pi .f} = \mathrm{pts}(\overline{\pi }, f) \ \ \ \overline{\pi .(\mathbf {f})^\star } = \varsigma _i(\overline{\pi }, \mathbf {f}) \ (\varsigma _i \ \mathrm{fresh}) \end{aligned}$$

We also assume an operation lift \(^{-1}(t)\) which is the inverse of \(lift (\pi )\). Given a term t, lift \(^{-1}(t)\) yields an abstract memory location representation of that term.

Example 3

The constraint \(\mathrm{type}(\overline{a.\epsilon .f}) = A \wedge \mathrm{type}(\overline{a.\epsilon .f}) = B\) is unsatisfiable since the dynamic type of \(a.\epsilon .f\) cannot simultaneously be A and B. But the constraint \(\mathrm{type}(\overline{a.\epsilon .g^\star }) = A \wedge \mathrm{type}(\overline{a.\epsilon .g^\star }) = B\), which translates to \(\mathrm{type}(\varsigma _1(\mathrm{pts}(a, \epsilon ), g)) = A \wedge \mathrm{type}(\varsigma _2(\mathrm{pts}(a, \epsilon ), g)) = B\), is satisfiable because two distinct occurrences of \(a.\epsilon .g^\star \) may correspond to different objects (e.g., \(a.\epsilon .g\) and \(a.\epsilon .g.g\)).

We now define a function has_type( \(\theta ,\) T) which generates a constraint that evaluates to true if some element in points-to set \(\theta \) can have dynamic type T:

Definition 5

has_type( \(\theta , T\) ). Given a points-to set \(\theta \), the function has_type\((\theta , T)\) yields the following constraint:

$$\begin{aligned} \small \bigvee _{ (\pi _i, \phi _i) \in \theta } (( (\mathrm{type})(\overline{\pi _i}) = T) \wedge \phi _i) \end{aligned}$$

Now, going back to Fig. 7, an environment \(\varUpsilon \) maps each method M in class T to its corresponding summary, which is an abstract heap \(\varGamma \) summarizing M’s side effects. Applying the summary at a call site allows us to determine points-to relations of the caller without having to reanalyze the callee. In addition, our method summaries include points-to information for locals in the summarized method. In particular, this design choice allows us to determine points-to sets for all program variables without employing a separate top-down pass.

4.2 Operations on Abstract Domains

In this section, we describe some operations on abstract domains that simplify the description of our algorithm. Since our algorithm constructs the initial heap on method entry in a demand-driven way, we first define default targets for argument-derived locations:

Definition 6

(Default Target). Given an argument-derived location \(\pi \) and a field f, the default target of the f field of \(\pi \), written def \((\pi , f)\), is given as follows:

$$\begin{aligned} \small def (\pi , f) = \left\{ \begin{array}{lll} \pi &{} \ \mathrm{if}\ \pi = \pi '.(\mathbf {f})^\star \ \mathrm{and} \ f \in \mathbf {f} &{} (1) \\ \pi '.(f.\mathbf {g})^\star &{} \ \mathrm{if}\ \pi = \pi '.f.\mathbf {g} &{} (2) \\ \pi .f &{} \ \mathrm{otherwise} &{} (3) \end{array} \right. \end{aligned}$$

In other words, if field f is not part of a recursive field cycle (line 3), then the default target for field f of an argument derived location \(\pi \) is \(\pi .f\), just like the normal form heaps from Sect. 3. However, if f is part of a recursive field cycle, then our analysis collapses this cycle into a single abstract memory location (lines 1–2). For example, \(\mathrm{def}(a.\mathrm{next}, \mathrm{next})= a.\mathrm{next}^\star \) (line 2), and \(\mathrm{def}(a.\mathrm{next}^\star , \mathrm{next})= a.\mathrm{next}^\star \) (line 1). The summarization of recursive field cycles into access paths of the form \(a.\mathbf {f}^\star \) is needed to ensure termination of the fixed-point computation performed by our algorithm.

Next, we define a field lookup operation on abstract heaps \(\varGamma \):

Definition 7

(Field Look-up). Given heap \(\varGamma \), field f, and location \(\pi \), the field lookup operation \(\varGamma [\pi , f]\) retrieves the points-to target for \(\pi \)’s f field:

$$\begin{aligned} \small \varGamma [\pi , f] = \left\{ \begin{array}{ll} \varGamma (\pi , f) \cup \{ (def (\pi , f), \top )\} &{} \mathrm{if} \ arg (\pi ) \\ \varGamma (\pi , f) &{} \mathrm{otherwise} \end{array} \right. \end{aligned}$$

Since our algorithm does not explicitly add default edges to the abstract heap, \(\varGamma [\pi , f]\) always yields x def \((\pi , f)\) as part of the points-to set of \(\pi .f\) if \(\pi \) is an argument derived location. Now, since our analysis performs weak updates, we need to merge two points-to sets using the following join operator:

Definition 8

(Join \(\sqcup \) of points-to sets \(\theta _1, \theta _2\) )

$$\begin{aligned} \small (\theta _1 \sqcup \theta _2)(o) = \left\{ \begin{array}{l} \theta _1(o) \vee \theta _2(o) \ \ \mathrm{if} \ o \in dom(\theta _1) \cap dom(\theta _2) \\ \theta _1(o) \ \ \mathrm{if} \ o \in dom(\theta _1) \ \mathrm{and} \ o \not \in dom(\theta _2) \\ \theta _2(o) \ \ \mathrm{if} \ o \in dom(\theta _2) \ \mathrm{and} \ o \not \in dom(\theta _1) \end{array} \right. \end{aligned}$$

Observe that if an object o is in both points to sets \(\theta _1\) and \(\theta _2\), we take the disjunction of the constraints associated with o. We also extend this join operator to abstract heaps in the expected way. That is, for a location \(\pi \) and field f, \((\varGamma _1 \sqcup \varGamma _2)(\pi , f)\) yields \(\varGamma _1(\pi , f) \sqcup \varGamma _2(\pi , f) \). In our analysis, we sometimes need to predicate points-to information on constraints. For this purpose, an operation \(\theta \downarrow \phi \) conjoins \(\phi \) with every constraint in \(\theta \):

Definition 9

(Projection of \(\theta \) on \(\phi \) ). \( \theta \downarrow \phi = \{ (\pi _i, \phi _i \wedge \phi ) \ | \ (\pi _i, \phi _i) \in \theta \} \)

Finally, we extend the field lookup operation on points-to sets as follows:

Definition 10

(Field lookup for pts-to set). \( \small \varGamma [\theta , f] = \bigsqcup _{(\pi _i, \phi _i) \in \theta } \varGamma [\pi _i, f] \downarrow \phi _i \)

That is, \(\varGamma [\theta , f]\) includes the points-to target of every element in \(\theta \) under the appropriate constraints.

4.3 Intraprocedural Analysis

Figure 8 describes the intraprocedural analysis using judgements of the form \( \varUpsilon , \varGamma \vdash I: \varGamma ' \). which indicates that, if statement I is executed in an environment that satisfies summary environment \(\varUpsilon \) and abstract heap \(\varGamma \), we obtain a new heap \(\varGamma '\). Since the analysis is (partially) flow-sensitive, we distinguish between heaps \(\varGamma , \varGamma '\) before and after executing I.

Fig. 8.
figure 8

Rules for intraprocedural analysis

Rule (1) in Fig. 8 describes the analysis of assignments. Although our analysis only performs weak updates to heap objects, it does apply strong updates to variables. Hence, rule (1) updates the points-to set for \((v_1, \epsilon )\) to be \(\varGamma [v_2, \epsilon ]\), where the lookup operation is defined in Sect. 4.2. Rule (2) for memory allocations \(v = \mathrm{new}^\rho \ T\) introduces a new abstract location named \(\mathrm{alloc}(T)@\rho \) and assigns \(v_1\) to this singleton.

Rule (3) concerns loads of the form \(v_1 = v_2.f\). Here, we first look up the points-to set \(\theta \) of \(v_2\) and then use \(\varGamma [\theta , f]\) to retrieve the targets of memory locations in \(\theta \). Finally, since our analysis applies strong updates to variables, we override \(v_1\)’s existing targets and change its points-to set to \(\varGamma [\theta , f]\).

Rule (4) analyzes stores \(v_1.f = v_2\). First, we look up the points-to sets \(\theta _1\) and \(\theta _2\) of \(v_1\) and \(v_2\). Now, the store operation will update every location \(o_i\) such that \((o_i, \phi _i) \in \theta _1\). However, since we apply only weak updates to heap objects, we preserve the existing points-to targets \(\varGamma (o_i, f)\) for each \(o_i\). Furthermore, since \(v_1\) points to \(o_i\) under constraint \(\phi _i\), \(o_i\) points to elements in \(\theta _2\) only when \(\phi _i\) holds. Hence, the new points-to set for \(o_i\) is given by \((\theta _2 \downarrow \phi _i) \sqcup \varGamma (o_i, f)\) where the \(\downarrow \) operation is given by Definition 9. Since rules (5) and (6) for if statements and sequencing and are fairly standard, we do not describe them in detail.

Fig. 9.
figure 9

Rules for instantiating memory locations

4.4 Interprocedural Analysis

We now describe the instantiation of summaries at call sites. Since a key part of summary instantiation is constructing the mapping from locations in the summary to those at the call site, we first start with the rules in Fig. 9 which describe the instantiation of memory locations. Informally, the rules of Fig. 9 construct the mapping \(\zeta ^{-1}\) from Sect. 3. More formally, they produce judgements of the form \( \mathcal {M}, \varGamma , \rho \vdash inst\_loc (\pi ): \theta \) where \(\mathcal {M}\) maps formals to actuals, and \(\varGamma \) and \(\rho \) are the abstract heap and program point associated with a call site respectively. The meaning of the judgement is that, under \(\mathcal {M}, \varGamma , \rho \), location \(\pi \) used in the summary maps to (guarded) location set \(\theta \).

The first rule Fig. 9 maps formal parameter \(a_i\) to the actual \(\mathcal {M}(a_i)\). The second rule instantiates argument-derived locations of the form \(\pi .f\). For this purpose, we first instantiate prefix \(\pi \) to location set \(\theta \), then retrieve the points-to targets of the f field of locations in \(\theta \). The third rule instantiates access paths of the form \(\pi .(f_1 \ldots f_n)^\star \). As in the previous rule, we first instantiate prefix \(\pi \), which yields \(\theta _0\). Now, recall that the access path \(\pi .(f_1 \ldots f_n)^\star \) describes the infinite set of access paths given by the regular expression \(\pi .(f_1 + \ldots + f_n)^*\). Hence, to instantiate \(\pi .(f_1 \ldots f_n)^\star \), we need to compute all locations that are reachable from \(\theta _0\) using any combination of field selectors \(f_1, \ldots , f_n\). The resulting set \(\bigsqcup _{i\ge 0} \theta _i\) is the reflexive transitive closure of \(\theta _0\) with respect to fields \(f_1, \ldots , f_n\).

The last two rules in Fig. 9 describe the instantiation of allocations and local variables. Both rules use a helper \(\mathrm{new\_ctx}\) method defined as follows:

$$\begin{aligned} \small \mathrm{new\_ctx}(\rho , \mathbf {\rho }) = \left\{ \begin{array}{ll} \rho , \mathbf {\rho } &{} \mathrm{if} \ |\mathbf {\rho }| \le k \\ \mathbf {\rho } &{} \mathrm{otherwise} \\ \end{array} \right. \end{aligned}$$

In other words, new_ctx appends call site \(\rho \) to context \(\mathbf {\rho }\) if the length of \(\mathbf {\rho }\) is less than some pre-determined threshold k. Hence, our analysis uses a k-CFA style context-sensitive heap abstraction where the value of k is configurable.

Fig. 10.
figure 10

Rules for instantiating constraints

We now turn to the instantiation of constraints, summarized in Fig. 10. To translate a constraint \(\mathrm{type}(t) = T\), we map t to its corresponding location set \(\theta \) by using inst_loc. The function has_type \((\theta , T)\) then yields the condition under which some element in \(\theta \) has dynamic type T (recall Definition 5).

Fig. 11.
figure 11

Rules for instantiating summaries

Using these ingredients, Fig. 11 shows how to instantiate an abstract heap \(\varDelta \). Given location \(\pi _i\) and field \(f_j\) from the callee heap, inst_partial_heap instantiates all points-to edges from \(\pi _i\) labeled with \(f_j\) and yields instantiated partial heap \(\varDelta _{ij}\). The instantiation of \(\varDelta \) is obtained by taking the join over all \(\varDelta _{ij}\)’s.

Finally, Fig. 12 describes the analysis of method calls. First, consider a static call to m with corresponding summary \(\varDelta \) (rule (1)). To analyze it, we construct the formal-to-actual mapping \(\mathcal {M}\) and perform a least fixed-point computation that instantiates the summarized heap \(\varDelta \) until we obtain an overapproximation of the set \(\varDelta '\) of m’s side effects. The abstract heap after the method call is obtained by taking the union of the existing heap \(\varGamma \) and the new “edges” \(\varDelta '\).

The second rule of Fig. 12 describes the analysis of virtual calls. Here, we first overapproximate the call’s targets and then use the previous rule for analyzing static method calls to obtain heap \(\varGamma _i\) assuming the called method is \(T_i::m\). Now, since the target of the virtual call is \(T_i::m\) under the assumption that \(v_0\) has dynamic type \(T_i\), we generate the constraint \(\phi _i = \mathrm{has\_type}(\varGamma (v_0), T_i)\). Then, the final abstract heap after the call is obtained as \( \sqcup _i (\varGamma _i \downarrow \phi _i) \).

Fig. 12.
figure 12

Analysis of method calls

5 Implementation and Extensions

We implemented the proposed algorithm in a tool called Scuba (http://www.cs.utexas.edu/~yufeng/scuba.html) which is built on top of Chord [16]. Scuba performs analysis on the Quad representation of Joeq [25] and obtains an initial callgraph by running the context-insensitive pointer analysis implemented in Chord. It also uses the Z3 SMT solver [5] for checking satisfiability of constraints.

Our implementation performs several optimizations over the core algorithm described here. One optimization is memoizing instantiation results. For example, consider an access path \(a_0.\epsilon .f.g.h\) used in m’s summary. Since this access path may be instantiated many times when analyzing a call site of m, our analysis maintains a cache per callsite that records instantiation results. A second optimization concerns constraint generation for virtual method calls. Consider a call v.m(...) where the static type of v is \(T_0\). Further, suppose \(T_0\) has a large number of subclasses \(T_1, \ldots , T_n\) all of which inherit \(T_0\)’s m method except \(T_n\). Assuming v points to heap object o, we need to introduce constraints of the form \(\bigvee _{0 \le i < n} \mathrm{type}(o) = T_i\). Since such constraints can be very large, our implementation allows subtyping constraints and translates them to linear inequalities by assigning integer identifiers to types in reverse topological order.

6 Evaluation

We evaluated Scuba on ten large Java applications from the DaCapo and Ashes benchmark suites [1, 2]. These applications range between 92615 and 227507 lines of statements in the Quad IR and contain between 4634 and 9653 reachable methods. To evaluate our algorithm, we compared Scuba against the k-CFA and k-object-sensitive algorithms implemented in Chord [16]. All analyses are Anderson-style flow-insensitive pointer analyses that allow customizing the value of k. Chord also allows customizing the context-sensitivity associated with heap objects using a value h. For example, a 2-obj-1-h analysis uses the abstract allocation site of the receiver as a context up to depth 2, and it also differentiates heap allocations with different contexts up to depth 2.

Table 1. Analysis time in seconds. Runs exceeding the time-limit of 3600 s are labeled T/O.

Before describing the results, we first explain how our algorithm relates to k-CFA and k-obj-sensitive analyses. Similar to k-CFA, Scuba uses call sites rather than receiver objects as contexts. However, unlike k-CFA, we do not impose a fixed value of k since our algorithm instantiates method summaries differently for each call site. On the other hand, we can customize the context-sensitivity associated with heap objects by varying the parameter k used in the \(\mathrm{new\_ctx}\) function from Sect. 4.4. Hence, for a given value of k in the \(\mathrm{new\_ctx}\) function, Scuba is roughly comparable to a \(\infty \)-CFA-k-h analysis. In what follows, we write Scuba-k to refer to different configurations of Scuba for different values of parameter k used in the new_ctx function from Sect. 4.4.

Table 2. May-alias results. The bigger the better.

Table 1 compares the running times of Scuba-k (for \(2\le k\le 4\)) against the context-insensitive(CIPA), 2-CFA, and 2-obj-sensitive analyses using h value of 1. While we also tried comparing Scuba against k-CFA and k-obj-sensitive analyses for \(k = 3\) and \(k=4\), these analyses did not complete within an hour for most of the benchmarks; hence, we do not include these results in Table 1. We also note that the running times shown in Fig. 1 include the analysis time for libraries (e.g., JDK, Swing, Sun Security) as well as the application code (i.e., we did not use manually provided stub methods for analyzing libraries). As the results in Table 1 show, Scuba-k is significantly faster compared to k-CFA and k-obj analyses.

To compare the precision of Scuba against k-CFA and k-obj analyses, we used two typical pointer analysis clients, namely may-alias and downcast analyses. Table 2 compares the precision of different analysis configurations in the context of the may-alias client. The column labeled “Alias pairs” shows the number of variable pairs that are queried by the may-alias client. To generate these pairs, we first ran a context-insensitive pointer analysis to identify potential may-alias in the application code. From these variables, we further filtered those pairs that are “obviously” aliases (e.g., due to a direct assignment). Columns 3–11 in Table 2 show the number of variables proven not to be aliases according to each analysis configuration. Hence, a higher number indicates better precision. Observe that every configuration of Scuba-k yields better precision on average compared to 2-CFA and 2-obj analysesFootnote 3.

Table 3. Downcast results. Bigger numbers indicate higher precision.

Table 3 shows the precision of each analysis in the context of the downcast client. Here, the second column labeled “# downcasts” shows the total number of downcasts in the application, and the subsequent columns show the number of downcasts that can be proven safe. The number of downcasts shown in Table 3 only include the downcasts performed in the application code rather than in external librariesFootnote 4. According to the results shown in Table 3, Scuba-k has better precision on average compared to both 2-CFA and 2-obj.

7 Related Work

Top-Down Pointer Analysis. Most existing context-sensitive pointer analysis algorithms are top-down [7, 12, 14, 24, 26]. Generally speaking, top-down context-sensitivity comes in two flavors: call-site sensitivity [22] (k-CFA) and k-object-sensitivity [14]. Specifically, CFA-based algorithms use method call sites as the context, while object-sensitive approaches use the receiver’s abstract allocation site. The recent work described in [9] has proposed selectively combining k-CFA and k-object sensitivity to achieve better precision. Several papers have used BDD-based methods for top-down context-sensitive pointer analysis [11, 26, 31]. The use of BDDs exposes commonalities between different contexts and allows the technique to scale better. The Chord framework [16] used in our experimental evaluation also uses BDDs to exploit equivalences between calling contexts.

Bottom-Up Pointer Analysis. While not as widely-studied as top-down algorithms, several papers propose bottom-up pointer analysis. However, many of these approaches are unification-based [10, 13, 18]. The algorithm proposed in [28] is also bottom-up but uses a combined equality- and subset-based approach. By contrast, our algorithm is subset-based and therefore more precise.

The algorithm described in [4] presents a subset-based, partially bottom-up pointer analysis for C. Unlike our approach, it incorporates both top-down and bottom-up phases where the top-down phase is used for precise handling of function pointers. Also unlike our approach, it tracks alias pairs as opposed to an explicit heap model and is meant for C rather than Java. Another subset-based pointer analysis for C that combines top-down and bottom-up phases is based on the observation that context-insensitivity does not result in a loss of precision if function side effects are accounted for [17]. In contrast to our approach, that technique handles SCCs in a context-insensitive way and employs a top-down phase that removes callee side effects. The algorithms described in [3, 6] perform bottom-up pointer analysis for C++ programs. Unlike the method presented in this paper, they perform case splits on possible aliasing patterns. Since the approach of [6] performs strong updates, it is more precise but less scalable compared to our technique.

Another related work is the algorithm described in [27], which describes a compositional pointer and escape analysis for Java. While this analysis is flow-sensitive and applies strong updates, it assumes that parameters do not alias and generates a summary that is valid under this assumption. However, if this assumption is violated at a call site, the analysis corrects the summary through a complex mechanism that involves merging of memory locations. Another difference is that [27] does not precisely handle virtual method calls.

Summarization. Many papers describe general frameworks for interprocedural analysis [1921]. The work described in [29, 30] compute polymorphic summaries for dataflow problems but both rely on global points-to sets.

8 Conclusion

We described a new bottom-up, summary-based pointer analysis for Java. The experimental evaluation demonstrates that our algorithm runs significantly faster than top-down pointer analyses with comparable precision. We believe that Scuba is able to scale better because the cost of instantiating a method summary is smaller compared to the cost of re-analyzing the function.