Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Points-to analysis discovers information about indirect accesses in a program and its precision influences the precision and scalability of other program analyses significantly. Computationally intensive analyses such as model checking are ineffective on programs containing pointers partly because of imprecision of pointer analyses [1, 8].

We focus on exhaustive (as against demand driven [2, 7, 22]) points-to analysis with full flow- and context-sensitivity for precision. A top-down context sensitive analysis propagates the information from callers to callees [28] thereby analyzing a procedure each time a new data flow value reaches its call(s). Some approaches in this category are: call strings method [21], its value-based variants [10, 17] and the tabulation based functional method [18, 21]. By contrast, bottom-up approaches avoid analyzing callees multiple times by constructing summary flow functions which are used at call sites to incorporate the effect of procedure calls [3, 6, 13, 19, 21, 2328].

It is prudent to distinguish between three kinds of summaries (see [4] for examples) that can be created for a procedure: (a) a bottom-up parameterized summary flow function which is context independent, (b) a top-down enumeration of summary flow function in the form of input-output pairs for the input values reaching a procedure, and (c) a bottom-up parameterless (and hence context-insensitive) summary information. Context independence (in (a) above), achieves context-sensitivity through parameterization and should not be confused with context-insensitivity (in (c) above).

We focus on summaries of the first kind. Their construction requires composing statement-level flow functions to represent a sequence of statements, and merging the composed functions to represent multiple control flow paths reaching a program point. These summaries should be compact and their size should be independent of the number of statements. This seems hard because of the presence of indirect pointees. The composition of the flow functions for a sequence of statements \(x = *y; \; z = *x\) cannot be reduced to a flow function of the basic pointer assignments for 3-address code (\( x = \& y\), \(x = y\), \(x = *y\), and \(*x = y\)).

Our Key Idea and Approach. We generalize the concept of points-to relations by using the counts of indirection levels leaving the unknown locations implicit. This allows us to create summary flow functions in the form of generalized points-to graphs (GPGs) whose size is linearly bounded by the number of variables (Sect. 2). By design, GPGs can represent both memory (in terms of classical points-to facts) and memory transformers (in terms of generalized points-to facts).

Example 1

Consider procedure g of Fig. 1 whose GPG is shown in Fig. 2(c). The edges in GPGs track indirection levels: indirection level 1 in the label \((1,\!0)\) indicates that the source is assigned the address (indicated by indirection level 0) of the target. Edge \({a}\!\xrightarrow {1,0}\!{e}\) is created for line 8. The indirection level 2 in edge \({x}\!\xrightarrow {2,1}\!{z}\) for line 10 indicates that the pointees of x are being defined; since z is read, its indirection level is 1. The combined effect of lines 13 (edge \({y}\!\xrightarrow {1,0}\!{b}\)) and 17 (edge \({y}\!\xrightarrow {2,0}\!{d}\)) results in the edge \({b}\!\xrightarrow {1,0}\!{d}\). However edge \({y}\!\xrightarrow {2,0}\!{d}\) is also retained because there is no information about the pointee of y along the other path reaching line 17.    \(\square \)

The generalized points-to facts are composed to create new generalized points-to facts with smaller indirection levels (Sect. 3) whenever possible thereby converting them progressively to classical points-to facts. This is performed in two phases: construction of GPGs, and use of GPGs to compute points-to information. GPGs are constructed flow-sensitively by processing pointer assignments along the control flow of a procedure and collecting generalized points-to facts (Sect. 4).

Function calls are handled context-sensitively by incorporating the effect of the GPG of a callee into the GPG of the caller (Sect. 5). Loops and recursion are handled using a fixed point computation. GPGs also distinguish between \({\small {{\textsf {\textit{may}}}}}\) and \({\small {{\textsf {\textit{must}}}}}\) pointer updates thereby facilitating strong updates.

Section 6 shows how GPGs are used for computing classical points-to facts. Section 7 presents the empirical measurements. Section 8 describes the related work. Section 9 concludes the paper. A detailed technical report [4] describes how we handle advanced issues (e.g. structures, heap memory, function pointers, arrays, pointer arithmetic) and also provides soundness proofs.

Fig. 1.
figure 1

A program fragment used as a running example through the paper. All variables are global.

The Advantages of GPGs Over Conventional Summaries. Indirect accesses of unknown locations have been commonly modelled using placeholders (called extended parameters in [25] and external variables in [13]).

The partial transfer function (PTF) based method [25] uses placeholders to construct a collection of PTFs for a procedure for different aliasing patterns involving formal parameters and global variables accessed in the procedure.

Example 2

For procedure g of the program in Fig. 1, three placeholders \(\phi _1, \phi _2\), and \(\phi _3\) have been used in the PTFs shown in Figs. 2(a) and (b). The possibility that x and y may or may not be aliased gives rise to two PTFs.    \(\square \)

The number of PTFs could be combinatorial in the number of dereferences of globals and parameters. PTFs that do not correspond to actual aliasing patterns can be excluded by combining a top-down analysis for discovering aliasing patterns with the bottom-up construction of PTFs [25, 28]. Yet, the number of PTFs could remain large.

An alternative approach makes no assumption about aliases in the calling context and constructs a single summary flow function for a procedure. In a degenerate case, it may require a separate placeholder for the same variable in different statements and the size of the summary flow functions may be proportional to the number of statements.

figure a

Example 3

For the code snippet on the right, we need two different placeholders for y in statements \(s_1\) and \(s_3\) because statement \(s_2\) could change the pointee of y depending upon whether \(*z\) is aliased to y.   \(\square \)

Separate placeholders for different occurrences of a variable can be avoided if points-to information is not killed by the summary flow functions [13, 23, 24]. Another alternative is to use flow-insensitive summary flow functions [3]. However, both these cases introduces imprecision.

Fig. 2.
figure 2

PTFs/GPG for proc. g of Fig. 1. Edges deleted due to flow-sensitivity are struck off.

A fundamental problem with placeholders is that they explicate unknown locations by naming them, resulting in either a large number of placeholders (e.g., a GPG edge \({\cdot }\!\xrightarrow {i,j}\!{\cdot }\) would require \(i+j-1\) placeholders) or multiple summary flow functions for different aliasing patterns that exist in the calling contexts.

Since we use edges to track indirection levels leaving unknown locations implicit: (a) placeholders are not needed (unlike [13, 2325, 28]), (b) aliasing patterns from calling contexts are not needed and a single summary per procedure is created (unlike [25, 28]), (c) the size of summary is linearly bounded by the number of variables regardless of the number of statements (unlike [13, 23, 24]), and (d) updates can be performed in the calling contexts (unlike [3, 13, 23, 24]). This facilitates the scalability of fully flow- and context-sensitive exhaustive points-to analysis.

2 Generalized Points-To Graphs (GPGs)

We define the basic concepts assuming scalars and pointers in the stack and static memory; see [4] for extensions to handle structures, heap, function pointers, etc.

2.1 Memory and Memory Transformer

We assume a control flow graph representation containing 3-address code statements. Program points \({\varvec{\mathsf{{ t}}}}\), \({\varvec{\mathsf{{ u}}}}\), \({\varvec{\mathsf{{ v}}}}\) represent the points just before the execution of statements. The successors and predecessors of a program point are denoted by succ and pred; \({\small {{\textsf {\textit{succ}}}}} ^{*}\) \({{{\textsf {\textit{pred}}}}} ^{*}\) denote their reflexive transitive closures. A control flow path is a finite sequence of (possibly repeating) program points \(q_0, q_1, \ldots , q_m\) such that \(q_{i+1} \in {\small {{\textsf {\textit{succ}}}}} (q_i)\).

Let L and \({{{\textsf {\textit{P}}}}} \subseteq {{{\textsf {\textit{L}}}}} \) denote the sets of locations and pointers respectively. Every location has a content and an address. The memory at a program point is a relation \({M} \subseteq {{{\textsf {\textit{P}}}}} \times ({{{\textsf {\textit{L}}}}} \cup \{?\})\) where “?” denotes an undefined location. We view M as a graph with \({{{\textsf {\textit{L}}}}} \cup \{?\}\) as the set of nodes. An edge \(x \rightarrow y\) in M indicates that \(x \in {{{\textsf {\textit{P}}}}} \) contains the address of location \(y \in {{{\textsf {\textit{L}}}}} \). The memory associated with a program point \({\varvec{\mathsf{{ u}}}}\) is denoted by \({M} _{\!{\varvec{\mathsf{{ u}}}}} \); since \({\varvec{\mathsf{{ u}}}}\) could appear in multiple control flow paths and could also repeat in a control flow path, \({M} _{\!{\varvec{\mathsf{{ u}}}}} \) denotes the memory associated with all occurrences of \({\varvec{\mathsf{{ u}}}}\).

The pointees of a set of pointers \(X\subseteq {{{\textsf {\textit{P}}}}} \) in M are computed by the application \({M} \; X = \{ y \mid (x,y) \in {M}, x\in X \}\). A composition of degree i, \({M} ^i \{x\}\) discovers the \(i^{th}\) pointees of x which involves i transitive reads from \(x\,\): first \(i-1\) addresses are read followed by the content of the last address. For composability of M, we extend its domain to \({{{\textsf {\textit{L}}}}} \cup \{?\}\) by inclusion map. By definition, \({M} ^0 \{x\} = \{ x \}\).

For adjacent program points \({\varvec{\mathsf{{ u}}}}\) and \({\varvec{\mathsf{{ v}}}}\), \({M} _{\!{\varvec{\mathsf{{ v}}}}}\) is computed from \({M} _{\!{\varvec{\mathsf{{ u}}}}}\) by incorporating the effect of the statement between \({\varvec{\mathsf{{ u}}}}\) and \({\varvec{\mathsf{{ v}}}}\), where \(\delta ({\varvec{\mathsf{{ u}}}}, {\varvec{\mathsf{{ v}}}})\) is a statement-level flow function representing a memory transformer. For \({\varvec{\mathsf{{ v}}}} \in {\small {{\textsf {\textit{succ}}}}} ^*({\varvec{\mathsf{{ u}}}})\), the effect of the statements appearing in all control flow paths from \({\varvec{\mathsf{{ u}}}}\) to \({\varvec{\mathsf{{ v}}}}\) is computed by where the memory transformer \(\varDelta ({\varvec{\mathsf{{ u}}}},\!{\varvec{\mathsf{{ v}}}})\) is a summary flow function mapping the memory at \({\varvec{\mathsf{{ u}}}}\) to the memory at \({\varvec{\mathsf{{ v}}}}\). Definition 1 provides an equation to compute \(\varDelta \) without specifying a representation for it. Since control flow

figure b

paths may contain cycles, \(\varDelta \) is the maximum fixed point of the equation where (a) the composition of \(\varDelta \) s is denoted by \(\circ \) such that \(\left( g \circ \! f \right) (\cdot ) = g\left( f\left( \cdot \right) \right) \), (b) \(\varDelta \) s are merged using \(\sqcap \), (c) B captures the base case, and (d) \(\varDelta _{id}\) is the identity flow function. Henceforth, we use the term memory transformer for a summary flow function \(\varDelta \). The rest of the paper proposes GPG as a compact representation for \(\varDelta \). Section 2.2 defines GPG and Sect. 2.3 defines its lattice.

2.2 Generalized Points-To Graphs for Representing Memory Transformers

The classical memory transformers explicate the unknown locations using placeholders. Effectively, they use a low level abstraction which is close to the memory defined in terms of classical points-to facts: Given locations \(x,y \in {{{\textsf {\textit{L}}}}} \), a classical points-to fact \({x}\!\xrightarrow {}\!{y}\) in memory \(M\) asserts that x holds the address of y. We propose a higher level abstraction of the memory without explicating the unknown locations.

figure c
Fig. 3.
figure 3

GPG edges for basic pointer assignments in C.

Figure 3 illustrates the generalized points-to facts corresponding to the basic pointer assignments in C. Observe that a classical points-to fact \({x}\!\xrightarrow {}\!{y}\) is a special case of the generalized points-to fact \({x}\!\xrightarrow {i,j}\!{y}\) with \(i = 1\) and \(j = 0\); the case \(i=0\) does not arise.

The generalized points-to facts are more expressive than the classical points-to facts because they can be composed to create new facts as shown by the example below. Section 3 explains the process of composing the generalized points-to facts through edge composition along with the conditions when the facts can and ought to be composed.

figure d

Example 4

Statements \(s_1\) and \(s_2\) to the right are represented by GPG edges \({x}\!\xrightarrow {1,0}\!{y}\) and \({z}\!\xrightarrow {1,1}\!{x}\) respectively. We can compose the two edges by creating a new edge \({z}\!\xrightarrow {1,0}\!{y}\) indicating that z points-to y. Effectively, this converts the generalized points-to fact for \(s_2\) into a classical points-to fact.    \(\square \)

Imposing an ordering on the set of GPG edges allows us to view it as a sequence to represent a flow-sensitive memory transformer. A reverse post order traversal over the control flow graph of a procedure dictates this sequence. It is required only at the interprocedural level when the effect of a callee is incorporated in its caller. Since a sequence is totally ordered but control flow is partially ordered, the GPG operations (Sect. 5) internally relax the total order to ensure that the edges appearing on different control flow paths do not affect each other. While the visual presentation of GPGs as graphs is intuitively appealing, it loses the edge-ordering; we annotate edges with their ordering explicitly when it matters.

A GPG is a uniform representation for a memory transformer as well as (an abstraction of) memory. This is analogous to a matrix which can be seen both as a transformer (for a linear translation) and also as an absolute value. A points-to analysis using GPGs begins with generalized points-to facts \({\cdot }\!\xrightarrow {i,j}\!{\cdot }\) representing memory transformers which are composed to create new generalized points-to facts with smaller indlev  s thereby progressively reducing them to classical points-to facts \({\cdot }\!\xrightarrow {1,0}\!{\cdot }\) representing memory.

2.3 The Lattice of GPGs

Definition 3 describes the meet semi-lattice of GPGs. For reasons described later in Sect. 5, we need to introduce an artificial \(\top \) element denoted \(\varDelta _\top \) in the lattice. It is used as the initial value in the data flow equations for computing GPGs (Definition 5 which instantiates Definition 1 for GPGs).

figure e

The sequencing of edges is maintained externally and is explicated where required. This allows us to treat a GPG (other than \(\varDelta _{\top }\)) as a pair of a set of nodes and a set of edges. The partial order is a point-wise superset relation applied to the pairs. Similarly, the meet operation is a point-wise union of the pairs. It is easy to see that the lattice is finite because the number of locations L is finite (being restricted to static and stack slots). When we extend GPGs to handle heap memory [4], explicit summarization is required to ensure finiteness. The finiteness of the lattice and the monotonicity of GPG operations guarantee the convergence of GPG computations on a fixed point; starting from \(\varDelta _\top \), we compute the maximum fixed point.

For convenience, we treat a GPG as a set of edges leaving the set of nodes implicit; the GPG nodes can always be inferred from the GPG edges.

Fig. 4.
figure 4

A hierarchy of operations for points-to analysis using GPGs. Each operation is defined in terms of the layers below it. \(E\) denotes the set of GPG edges. By abuse of notation, we use M and \(\varDelta \) also as types to indicate the signatures of the operations. The operators “\(\circ \)” and “\({[\![}\; {]\!]}\)” are overloaded and can be disambiguated using the types of the operands.

2.4 A Hierarchy of GPG Operations

Figure 4 lists the GPG operations based on the concept of the generalized points-to facts. They are presented in two separate columns according to the two phases of our analysis and each layer is defined in terms of the layers below it. The operations are defined in the sections listed against them in Fig. 4.

  • Constructing GPGs. An edge composition \(e_1\!\circ e_2\) computes a new edge \(e_3\) equivalent to \(e_1\) using the points-to information in \(e_2\) such that the indlev of \(e_3\) is smaller than that of \(e_1\). An edge reduction \(e_1\circ \varDelta \) computes a set of edges X by composing \(e_1\) with the edges in \(\varDelta \). A GPG update \(\varDelta _1\left[ X\right] \) incorporates the effect of the set of edges X in \(\varDelta _1\) to compute a new GPG \(\varDelta _2\). A GPG composition \(\varDelta _1\circ \varDelta _2\) composes a callee’s GPG \(\varDelta _2\) with GPG \(\varDelta _1\) at a call point to compute a new GPG \(\varDelta _3\).

  • Using GPGs for computing points-to information. An edge application \({[\![}e {]\!]}{M} \) computes a new memory \({M} '\) by incorporating the effect of the GPG edge e in memory M. A GPG application \({[\![}\varDelta {]\!]}{M} \) applies the GPG \(\varDelta \) to M and computes a new memory \({M} '\) using edge application iteratively.

These operations allow us to build the theme of a GPG being a uniform representation for both memory and memory transformers.

3 Edge Composition

This section defines edge composition as a fundamental operation which is used in Sect. 4 for constructing GPGs. Some considerations in edge composition (explained in this section) are governed by the goal of including the resulting edges in a GPG \(\varDelta \).

Let a statement-level flow function \(\delta \) be represented by an edge \({\varvec{\mathsf{{ n}}}}\) (“new” edge) and consider an existing edge \({\varvec{\mathsf{{ p}}}}\) \(\in \varDelta \) (“processed” edge). Edges \({\varvec{\mathsf{{ n}}}}\) and \({\varvec{\mathsf{{ p}}}}\) can be composed (denoted \({\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \)) provided they have a common node called the pivot of composition (since a pivot can be the source or target of either of the edges, there are four possibilities as explained later). The goal is to reduce (i.e., simplify) \({\varvec{\mathsf{{ n}}}}\) by using the points-to information from \({\varvec{\mathsf{{ p}}}}\). This is achieved by using the pivot as a bridge to join the remaining two nodes resulting in a reduced edge \({\varvec{\mathsf{{ r}}}}\). This requires the indlev  s of the pivot in both edges to be made the same. For example, given edges \({\varvec{\mathsf{{ n}}}}\) \(\equiv {z}\!\xrightarrow {i,j}\!{x}\) and \({\varvec{\mathsf{{ p}}}}\) \(\equiv {x}\!\xrightarrow {k,l}\!{y}\) with a pivot x, if \(j > k\), then the difference \(j-k\) can be added to the indlev  s of nodes in \({\varvec{\mathsf{{ p}}}}\), to view \({\varvec{\mathsf{{ p}}}}\) as \({x}\!\xrightarrow {j,(l+j-k)}\!{y}\). This balances the indlev  s of x in the two edges allowing us to create a reduced edge \({\varvec{\mathsf{{ r}}}} \equiv {z}\!\xrightarrow {i,(l+j-k)}\!{y}\). Although this computes the transitive effect of edges, in general, it cannot be modelled using multiplication of matrices representing graphs as explained in our technical report [4].

Example 5

In the first example in Fig. 5, the indlev  s of pivot x in both \({\varvec{\mathsf{{ p}}}}\) and \({\varvec{\mathsf{{ n}}}}\) is the same allowing us to join z and y through an edge \({z}\!\xrightarrow {1,0}\!{y}\). In the second example, the difference (\(2\!-\!1\)) in the indlev  s of x can be added to the indlev  s of nodes in \({\varvec{\mathsf{{ p}}}}\) viewing it as \({x}\!\xrightarrow {2,1}\!{y}\). This allows us to join y and z creating the edge \({y}\!\xrightarrow {1,1}\!{z}\).   \(\square \)

Fig. 5.
figure 5

Examples of edge compositions for points-to analysis.

Let an edge \({\varvec{\mathsf{{ n}}}}\) be represented by the triple \(\left( {{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ n}}}}},\!\left( {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}},\!{{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} \right) \!,\!{{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ n}}}}} \right) \) where \({{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ n}}}}}\) and \({{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ n}}}}}\) are the source and the target of \({\varvec{\mathsf{{ n}}}}\) and \(({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}},\!{{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}})\) is the indlev. Similarly, \({\varvec{\mathsf{{ p}}}}\) is represented by \(\left( {{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ p}}}}},\!\left( {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}},\!{{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \right) \!,\!{{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ p}}}}} \right) \) and the reduced edge \({\varvec{\mathsf{{ r}}}} = {\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \) by \(\left( {{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ r}}}}},\!\left( {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}},\!{{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}} \right) \!,\!{{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ r}}}}} \right) \); \(({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}},\!{{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}})\) is obtained by balancing the indlev of the pivot in \({\varvec{\mathsf{{ p}}}}\) and \({\varvec{\mathsf{{ n}}}}\). The pivot of a composition, denoted \(\mathbb {P}\), may be the source or the target of \({\varvec{\mathsf{{ n}}}}\) and \({\varvec{\mathsf{{ p}}}}\). This leads to four combinations of \({\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \): SS, TS, ST, TT. Our implementation currently uses TS and SS compositions illustrated in Fig. 6; ST and TT compositions are described in the technical report [4].

  • TS composition. In this case, \({{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ n}}}}}\) = \({{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ p}}}}}\) i.e., the pivot is the target of \({\varvec{\mathsf{{ n}}}}\) and the source of \({\varvec{\mathsf{{ p}}}}\). Node \({{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ n}}}}}\) becomes the source and \({{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ p}}}}}\) becomes the target of the reduced edge \({\varvec{\mathsf{{ r}}}}\).

  • SS composition. In this case, \({{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ n}}}}}\) = \({{\textsf {\textit{S}}}} _{{\varvec{\mathsf{{ p}}}}}\) i.e., the pivot is the source of both \({\varvec{\mathsf{{ n}}}}\) and \({\varvec{\mathsf{{ p}}}}\). Node \({{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ p}}}}}\) becomes the source and \({{\textsf {\textit{T}}}} _{\!{\varvec{\mathsf{{ n}}}}}\) becomes the target of the reduced edge \({\varvec{\mathsf{{ r}}}}\).

Consider an edge composition \({\varvec{\mathsf{{ r}}}} = {\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \), \({\varvec{\mathsf{{ p}}}} \in \varDelta \). For constructing a new \(\varDelta \), we wish to include \({\varvec{\mathsf{{ r}}}}\) rather than \({\varvec{\mathsf{{ n}}}}\): Including both of them is sound but may lead to imprecision; including only \({\varvec{\mathsf{{ n}}}}\) is also sound but may lead to inefficiency because it forsakes summarization. An edge composition is desirable if and only if it is relevant, useful, and conclusive. We define these properties below and explain them in the rest of the section.

  1. (a)

    A composition \({\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \) is relevant only if it preserves flow-sensitivity.

  2. (b)

    A composition \({\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \) is useful only if the indlev of the resulting edge does not exceed the indlev of \({\varvec{\mathsf{{ n}}}}\).

  3. (c)

    A composition \({\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \) is conclusive only when the information supplied by \({\varvec{\mathsf{{ p}}}}\) used for reducing \({\varvec{\mathsf{{ n}}}}\) is not likely to be invalidated by the intervening statements.

When the edge composition is desirable, we include \({\varvec{\mathsf{{ r}}}}\) in the \(\varDelta \) being constructed, otherwise we include \({\varvec{\mathsf{{ n}}}}\). In order to explain the desirable compositions, we use the following notation: Let \(\ell _{{\varvec{\mathsf{{ p}}}}}\) denote a \(({\small \mathbb {P}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}})^{th}\) pointee of pivot \(\mathbb {P}\) accessed by \({\varvec{\mathsf{{ p}}}}\) and \(\ell _{{\varvec{\mathsf{{ n}}}}}\) denote a \(({\small \mathbb {P}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}})^{th}\) pointee of \(\mathbb {P}\) accessed by \({\varvec{\mathsf{{ n}}}}\).

Fig. 6.
figure 6

Illustrating all exhaustive possibilities of SS and TS compositions (the pivot is x). Dashed edges are killed. Unmarked compositions are relevant and useful (Sect. 3); since the statements are consecutive, they are also conclusive (Sect. 3) and hence desirable.

Relevant Edge Composition. An edge composition is relevant if it preserves flow-sensitivity. This requires the indirection levels in \({\varvec{\mathsf{{ n}}}}\) to be reduced by using the points-to information in \({\varvec{\mathsf{{ p}}}}\) (where \({\varvec{\mathsf{{ p}}}}\) appears before \({\varvec{\mathsf{{ n}}}}\) along a control flow path) but not vice-versa. The presence of a points-to path in memory (which is the transitive closure of the points-to edges) between \(\ell _{{\varvec{\mathsf{{ p}}}}}\) and \(\ell _{{\varvec{\mathsf{{ n}}}}}\) (denoted by \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) or \(\ell _{{\varvec{\mathsf{{ n}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ p}}}}}\)) indicates that \({\varvec{\mathsf{{ p}}}}\) can be used to resolve the indirection levels in \({\varvec{\mathsf{{ n}}}}\).

Example 6

For \({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} < {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \) in Fig. 6 (Ex. ss1), edge \({\varvec{\mathsf{{ p}}}}\) updates the pointee of x and edge \({\varvec{\mathsf{{ n}}}}\) redefines x. As shown in the memory graph, there is no path between \(\ell _{{\varvec{\mathsf{{ p}}}}}\) and \(\ell _{{\varvec{\mathsf{{ n}}}}}\) and hence y and z are unrelated rendering this composition irrelevant. Similarly, edge composition is irrelevant for \({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} = {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \) (Ex. ss3).

For \({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} > {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \) (Ex. ss2), \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) holds in the memory graph and hence this composition is relevant. For Ex. ts1, \(\ell _{{\varvec{\mathsf{{ n}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ p}}}}}\) holds; for ts2, \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) holds; for ts3 both paths hold. Hence, all three compositions are relevant.   \(\square \)

Useful Edge Composition. The usefulness of edge composition characterizes progress in conversion of the generalized points-to facts to the classical points-to facts. This requires the indlev \(({{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}}, {{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}})\) of the reduced edge \({\varvec{\mathsf{{ r}}}}\) to satisfy:

$$\begin{aligned} {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}} \le {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} \; \wedge \; {{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ r}}}}}^{\,{{\textsf {\textit{c}}}}} \le {{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} \end{aligned}$$
(1)

Intuitively, this ensures that the indlev of the new source and the new target does not exceed the corresponding indlev in the original edge \({\varvec{\mathsf{{ n}}}}\).

Example 7

Consider Ex. ts1 of Fig. 6, for \({{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} < {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \), \(\ell _{{\varvec{\mathsf{{ n}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ p}}}}}\) holds in the memory graph. Although this composition is relevant, it is not useful because the indlev of \({\varvec{\mathsf{{ r}}}}\) exceeds the indlev of \({\varvec{\mathsf{{ n}}}}\). For this example, a TS composition will create an edge \({z}\!\xrightarrow {2,0}\!{y}\) whose indlev is higher than that of \({\varvec{\mathsf{{ n}}}}\) (\({z}\!\xrightarrow {1,1}\!{x}\)).   \(\square \)

Thus, we need \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\), and not \(\ell _{{\varvec{\mathsf{{ n}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ p}}}}}\), to hold in the memory graph for a useful edge composition. We can relate this with the usefulness criteria (Inequality 1). The presence of path \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) ensures that the indlev of edge \({\varvec{\mathsf{{ r}}}}\) does not exceed that of \({\varvec{\mathsf{{ n}}}}\). The usefulness criteria (Inequality 1) reduces to \({{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \le {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} < {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} \) for SS composition and \({{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \le {{{\textsf {\textit{S}}}}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} \le {{{\textsf {\textit{T}}}}} _{{\varvec{\mathsf{{ n}}}}}^{\,{{\textsf {\textit{c}}}}} \) for TS composition.

From Fig. 6, we conclude that an edge composition is relevant and useful only if there exists a path \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) rather than \(\ell _{{\varvec{\mathsf{{ n}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ p}}}}}\). Intuitively, such a path guarantees that the updates made by \({\varvec{\mathsf{{ n}}}}\) do not invalidate the generalized points-to fact represented by \({\varvec{\mathsf{{ p}}}}\). Hence, the two generalized points-to facts can be composed by using the pivot as a bridge to create a new generalized points-to fact represented by \({\varvec{\mathsf{{ r}}}}\).

Conclusive Edge Composition. Recall that \({\varvec{\mathsf{{ r}}}} = {\varvec{\mathsf{{ n}}}} \circ {\varvec{\mathsf{{ p}}}} \) is relevant and useful if we expect a path \(\ell _{{\varvec{\mathsf{{ p}}}}} \twoheadrightarrow \ell _{{\varvec{\mathsf{{ n}}}}}\) in the memory. This composition is conclusive when location \(\ell _{{\varvec{\mathsf{{ p}}}}}\) remains accessible from the pivot \(\mathbb {P}\) in \({\varvec{\mathsf{{ p}}}}\) when \({\varvec{\mathsf{{ n}}}}\) is composed with \({\varvec{\mathsf{{ p}}}}\). Location \(\ell _{{\varvec{\mathsf{{ p}}}}}\) may become inaccessible from \(\mathbb {P}\) because of a combined effect of the statements in a calling context and the statements in the procedure being processed. Hence, the composition is undesirable and may lead to unsoundness if \({\varvec{\mathsf{{ r}}}}\) is included in \(\varDelta \) instead of \({\varvec{\mathsf{{ n}}}}\).

figure f

Example 8

Line 6 in the code on the right indirectly defines a (because of the assignment on line 2) whereas line 7 directly defines a overwriting the value. Thus, x points to b and not c after line 8. When the GPG for procedure q is constructed, the relationship between y and a is not known. Thus, the composition of \({\varvec{\mathsf{{ n}}}} \equiv {x}\!\xrightarrow {1,2}\!{y}\) with \({\varvec{\mathsf{{ p}}}} \equiv {y}\!\xrightarrow {2,0}\!{c}\) results in \({\varvec{\mathsf{{ r}}}} \equiv {x}\!\xrightarrow {1,0}\!{c}\). Here \(\ell _{{\varvec{\mathsf{{ p}}}}}\) is c, however it is not reachable from y anymore as the pointee of y is redefined by line 7.   \(\square \)

Since the calling context is not available during GPG construction, we are forced to retain edge \({\varvec{\mathsf{{ n}}}}\) in the GPG, thereby missing an opportunity of reducing the indlev of \({\varvec{\mathsf{{ n}}}}\). Hence we propose the following condition for conclusiveness: The statements corresponding to \({\varvec{\mathsf{{ p}}}}\) and \({\varvec{\mathsf{{ n}}}}\) should be consecutive on every control flow (a) the intervening statements should not have an indirect assignment (e.g., \(*x = \ldots \)), and (b) the pointee of pivot \(\mathbb {P}\) in edge \({\varvec{\mathsf{{ p}}}}\) should have been found i.e., \({\small \mathbb {P}} _{{\varvec{\mathsf{{ p}}}}}^{\,{{\textsf {\textit{c}}}}} = 1\).

In the example above, condition (b) is violated and hence we add \({\varvec{\mathsf{{ n}}}} \equiv {x}\!\xrightarrow {1,2}\!{y}\) to the GPG of procedure q instead of \({\varvec{\mathsf{{ r}}}} \equiv {x}\!\xrightarrow {1,0}\!{c}\). This avoids a greedy reduction of \({\varvec{\mathsf{{ n}}}}\) when the available information is inconclusive.

4 Constructing GPGs at the Intraprocedural Level

In this section we define edge reduction, and GPG update; GPG composition is described in Sect. 5 which shows how procedure calls are handled.

4.1 Edge Reduction \({\varvec{\mathsf{{ n}}}} \circ \varDelta \)

figure g

Edge reduction \({\varvec{\mathsf{{ n}}}} \circ \varDelta \) uses the edges in \(\varDelta \) to compute a set of edges whose indlev  s do not exceed that of \({\varvec{\mathsf{{ n}}}}\) (Definition 4). The results of SS and TS compositions are denoted by \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) and \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) which compute relevant and useful edge compositions; the inconclusive edge compositions are filtered out independently. The edge ordering is not required at the intraprocedural level; a reverse post order traversal over the control flow graph suffices.

A single-level composition (slc) combines \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) with \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\). When both TS and SS compositions are possible (first case in slc), the join operator \(\bowtie \) combines their effects by creating new edges by joining the sources from \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) with the targets from \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\). If neither of TS and SS compositions is possible (second case in slc), edge \({\varvec{\mathsf{{ n}}}}\) is considered as the reduced edge. If only one of them is possible, its result becomes the result of slc (third case). Since the reduced edges computed by slc may compose with other edges in \(\varDelta \), we extend slc to multi-level composition (\({\small {{\textsf {\textit{mlc}}}}}\)) which recursively composes edges in X with edges in \(\varDelta \) through function \({\small {{\textsf {\textit{slces}}}}}\) which extends slc to a set of edges.

Example 9

When \({\varvec{\mathsf{{ n}}}}\) represents a statement \(x=*y\), we need multi-level compositions: The first-level composition identifies pointees of y while the second-level composition identifies the pointees of pointees of y. This is facilitated by function \({\small {{\textsf {\textit{mlc}}}}}\). Consider the code snippet on the right. \(\varDelta = \{{y}\!\xrightarrow {1,0}\!{a}, {a}\!\xrightarrow {1,0}\!{b}\}\) for \({\varvec{\mathsf{{ n}}}} \equiv {x}\!\xrightarrow {1,2}\!{y}\) (statement \(s_3\)).

figure h

This involves two consecutive TS compositions. The first composition involves \({y}\!\xrightarrow {1,0}\!{a}\) as \({\varvec{\mathsf{{ p}}}}\) resulting in \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta } = \{{x}\!\xrightarrow {1,1}\!{a}\}\) and \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta } = \emptyset \). This satisfies the third case of slc. Then, \({\small {{\textsf {\textit{slces}}}}}\) is called with \(X = \{{x}\!\xrightarrow {1,1}\!{a}\}\). The second TS composition between \({x}\!\xrightarrow {1,1}\!{a}\) (as a new \({\varvec{\mathsf{{ n}}}}\)) and \({a}\!\xrightarrow {1,0}\!{b}\) (as \({\varvec{\mathsf{{ p}}}}\)) results in a reduced edge \({x}\!\xrightarrow {1,0}\!{b}\). \({\small {{\textsf {\textit{slces}}}}}\) is called again with \(X = \{{x}\!\xrightarrow {1,0}\!{b}\}\) which returns X, satisfying the base condition of \({\small {{\textsf {\textit{mlc}}}}}\).   \(\square \)

Example 10

Single-level compositions are combined using \(\bowtie \) when \({\varvec{\mathsf{{ n}}}}\) represents \(*x = y\).

figure i

For the code snippet on the right, \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) returns {\({a}\!\xrightarrow {1,1}\!{y}\)} and \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) returns {\({x}\!\xrightarrow {2,0}\!{b}\)} when \({\varvec{\mathsf{{ n}}}}\) is \({x}\!\xrightarrow {2,1}\!{y}\) (for statement \(s_3\)). The join operator \(\bowtie \) combines the effect of TS and SS compositions by combining the sources from \({{\textsf {\textit{SS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) and the targets from \({{\textsf {\textit{TS}}}}^{{{\textsf {\textit{n}}}}}_{\varDelta }\) resulting in a reduced edge \({\varvec{\mathsf{{ r}}}} \equiv {a}\!\xrightarrow {1,0}\!{b}\).   \(\square \)

4.2 Constructing GPGs \(\varDelta ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}})\)

For simplicity, we consider \(\varDelta \) only as a collection of edges, leaving the nodes implicit. Further, the edge ordering does not matter at the intraprocedural level and hence we treat \(\varDelta \) as a set of edges. The construction of \(\varDelta \) assigns sequence numbers in the order of inclusion of edges; these sequence numbers are maintained externally.

By default, the GPGs record the \({\small {{\textsf {\textit{may}}}}}\) information but a simple extension in the form of boundary definitions (described in the later part of this section) allows them to record the \({\small {{\textsf {\textit{must}}}}}\) information. This supports distinguishing between strong and weak updates and yet allows a simple set union to combine the information.

figure j

Definition 5 is an adaptation of Definition 1 for GPGs. Since \(\varDelta \) is viewed as a set of edges, the identity function \(\varDelta _{id}\) is \(\emptyset \), meet operation is \(\cup \), and \(\varDelta ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}})\) is the least fixed point of the equation in Definition 5. The composition of a statement-level flow function (\({\varvec{\mathsf{{ n}}}}\)) with a summary flow function (\(\varDelta ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ t}}}})\)) is performed by GPG update which includes all edges computed by edge reduction \({\varvec{\mathsf{{ n}}}} \!\circ \!\varDelta ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ t}}}})\); the edges to be removed are under-approximated when a strong update cannot be performed (described in the rest of the section). When a strong update is performed, we exclude those edges of \(\varDelta \) whose source and indlev match that of the shared source of the reduced edges (identified by ). For a weak update, and X contains reduced edges. For an inconclusive edge composition, and \(X = \{ {\varvec{\mathsf{{ n}}}} \}\).

Fig. 7.
figure 7

Aggregate edge for handling strong and weak updates. For this example, \({{\textsf {\textit{s}}}} = \{ a, b, c, \ldots \}\).

Extending \(\varDelta \) to Support Strong Updates. Conventionally, points-to information is killed based on the following criteria: An assignment \(x=\ldots \) removes all points-to facts \({x}\!\xrightarrow {}\!{\cdot }\) whereas an assignment \(*x=\ldots \) removes all points-to facts \({y}\!\xrightarrow {}\!{\cdot }\) where x \({\small {{\textsf {\textit{must}}}}}\)-points-to y; the latter represents a strong update. When x \({\small {{\textsf {\textit{may}}}}}\)-points-to y, no points-to facts can be removed representing a weak update.

Observe that the use of points-to information for strong updates is inherently captured by edge reduction. In particular, the use of edge reduction allows us to model both of the above criteria for edge removal uniformly as follows: the reduced edges should define the same pointer (or the same pointee of a given pointer) along every control flow path reaching the statement represented by \({\varvec{\mathsf{{ n}}}}\). This is captured by the requirement \(|{{\textsf {\textit{def}}}} (X)|=1\).

When \(|{{\textsf {\textit{def}}}} (X)| > 1\), the reduced edges define multiple pointers (or different pointees of the same pointer) leading to a weak update resulting in no removal of edges from \(\varDelta \). When \(|{{\textsf {\textit{def}}}} (X)| = 1\), all reduced edges define the same pointer (or the same pointee of a given pointer). However, this is necessary but not sufficient for a strong update because the pointer may not be defined along all the paths—there may be a path which does not contribute to \({{\textsf {\textit{def}}}} (X)\). We refer to such paths as definition-free paths for that particular pointer (or some pointee of a pointer). The possibility of such a path makes it difficult to distinguish between strong and weak updates.

Since a pointer x or its transitive pointees may be defined along some control flow path from \({\varvec{\mathsf{{ u}}}}\) to \({\varvec{\mathsf{{ v}}}}\), we eliminate the possibility of definition-free paths from \({\varvec{\mathsf{{ u}}}}\) to \({\varvec{\mathsf{{ v}}}}\) by introducing boundary definitions of the following two kinds at \({\varvec{\mathsf{{ u}}}}\): (a) a pointer assignment \(x = x'\) where \(x'\) is a symbolic representation of the initial value of x at \({\varvec{\mathsf{{ u}}}}\) (called the upwards exposed version of x), and (b) a set of assignments representing the relation between \(x'\) and its transitive pointees. They are represented by special GPG edges—the first, by a copy edge \({x}\!\xrightarrow {1,1}\!{x'}\) and the others, by an aggregate edge \({x'}\!\xrightarrow {\mathbb {N},0}\!{{{\textsf {\textit{s}}}}}\) where \(\mathbb {N}\) is the set of all possible indlev  s and s is the summary node representing all possible pointees. As illustrated in Fig. 7, \({x'}\!\xrightarrow {\mathbb {N},0}\!{{{\textsf {\textit{s}}}}}\) is a collection of GPG edges (Fig. 7(b)) representing the relation between x with it transitive pointees at \({\varvec{\mathsf{{ u}}}}\) (Fig. 7(a)).

A reduced edge \({x}\!\xrightarrow {1,j}\!{y}\) along any path from \({\varvec{\mathsf{{ u}}}}\) to \({\varvec{\mathsf{{ v}}}}\) removes the copy edge \({x}\!\xrightarrow {1,1}\!{x'}\) indicating that x is redefined. A reduced edge \({x}\!\xrightarrow {i,j}\!{y}\), \(i\!>\!1\) modifies the aggregate edge \({x'}\!\xrightarrow {\mathbb {N}, 0}\!{{{\textsf {\textit{s}}}}}\) to \({x'}\!\xrightarrow {(\mathbb {N}-\{i\}),0}\!{{{\textsf {\textit{s}}}}}\) indicating that \((i\!-\!1)^{th}\) pointees of x are redefined.

The inclusion of aggregate and copy edges guarantees that \(|{{\textsf {\textit{def}}}} (X)| = 1\) only when the source is defined along every path. This leads to a necessary and sufficient condition for strong updates. Note that the copy and aggregate edges improve the precision of analysis and are not required for its soundness.

Fig. 8.
figure 8

\(\varDelta \) for procedures f and g of Figure 1.

Example 11

Consider the construction of \(\varDelta _g\) as illustrated in Fig. 8(c). Edge \(g_1\) created for line 8 of the program, kills edge \({a}\!\xrightarrow {1,1}\!{a'}\) because \(|{{\textsf {\textit{def}}}} (\{g_1\})| = 1\). For line 10, since the pointees of x and z are not available in g, edge \(g_2\) is created from \(x'\) to \(z'\); this involves composition of \({x}\!\xrightarrow {2,1}\!{z}\) with the edges \({x}\!\xrightarrow {1,1}\!{x'}\) and \({z}\!\xrightarrow {1,1}\!{z'}\). Edges \(g_3\), \(g_4\), \(g_5\) and \(g_6\) correspond to lines 11, 13, 14, and 16 respectively.

The \({z}\!\xrightarrow {1,1}\!{z'}\) edge is killed along both paths (lines 11 and 14) and hence is struck off in \(\varDelta _g\), indicating z is \({\small {{\textsf {\textit{must}}}}}\)-defined. On the other hand, \({y}\!\xrightarrow {1,1}\!{y'}\) is killed only along one of the two paths and hence is retained by the control flow merge just before line 16. Similarly \({x'}\!\xrightarrow {2,0}\!{{{\textsf {\textit{s}}}}}\) in the aggregate edge is retained indicating that pointee of x is not defined along all paths. Edge \(g_6\) kills \({x}\!\xrightarrow {1,1}\!{x'}\). Line 17 creates edges \(g_7\) and \(g_8\); this is a weak update because y has multiple pointees (\(|{{\textsf {\textit{def}}}} (\{g_7, g_8\})| \ne 1\)). Hence \({b}\!\xrightarrow {1,1}\!{b'}\) is not removed. Similarly, \({y'}\!\xrightarrow {2,0}\!{{{\textsf {\textit{s}}}}}\) in the aggregate edge \({y'}\!\xrightarrow {\mathbb {N},0}\!{{{\textsf {\textit{s}}}}}\) is not removed.    \(\square \)

5 Constructing GPGs at the Interprocedural Level

Definition 6 shows the construction of GPGs at the interprocedural level by handling procedure calls. Consider a procedure f containing a call to g between two consecutive program points \({\varvec{\mathsf{{ u}}}}\) and \({\varvec{\mathsf{{ v}}}}\). Let \({\small {{{\textsf {\textit{Start}}}}_{g}}}\) and \({\small {{{\textsf {\textit{End}}}}_{g}}}\) denote the start and the end points of g. \(\varDelta \) representing the control flow paths from \({\small {{{\textsf {\textit{Start}}}}_{f}}}\) to \({\varvec{\mathsf{{ u}}}}\) (i.e., just before the call to g) is \(\varDelta ({\small {{{\textsf {\textit{Start}}}}_{f}}},{\varvec{\mathsf{{ u}}}})\); we denote it by \(\varDelta _f\) for brevity. \(\varDelta \) for the body of procedure g is \(\varDelta ({\small {{{\textsf {\textit{Start}}}}_{g}}},{\small {{{\textsf {\textit{End}}}}_{g}}})\); we denote it by \(\varDelta _g\).

Since GPGs are sequences of edges, \(\varDelta _g \circ \varDelta _f\) involves selecting an edge e in order from \(\varDelta _g\) and performing an update \(\varDelta _f\left[ e \, \circ \,\varDelta _f\right] \). We then update the resulting \(\varDelta \) with the next edge from \(\varDelta _g\). This is repeated until all edges of \(\varDelta _g\) are exhausted. The update of \(\varDelta _f\) with an edge e from \(\varDelta _g\) involves the following: (a) substituting the callee’s upwards exposed variable \(x'\) occurring in \(\varDelta _g\) by the caller’s original variable x in \(\varDelta _f\), (b) including the reduced edges \(e \circ \varDelta _f\), and (c) performing a strong or weak update.

A copy edge \({x}\!\xrightarrow {1,1}\!{x'}\in \varDelta \) implies that x has not been defined along some path. Similarly, an aggregate edge \({x'}\!\xrightarrow {\mathbb {N},0}\!{{{\textsf {\textit{s}}}}}\in \varDelta \) implies that some \((i-1)^{th}\) pointees of x, \(i\!>\! 1\) have not been defined along some path. We use these to define \({\small {{\textsf {\textit{mustdef}}}}\,({{x}\!\xrightarrow {i,j}\!{y}},{\varDelta })}\) which asserts that the \((i\!-\!1)^{th}\) pointees of x, \(i\!>\! 1\) are defined along every control flow path. We combine it with to define callsup for identifying strong updates. Note that we need mustdef only at the interprocedural level and not at the intraprocedural level. This is because, when we use \(\varDelta _g\) to compute \(\varDelta _f\), performing a strong update requires knowing whether the source of an edge in \(\varDelta _g\) has been defined along every control flow path in g. However, we do not have the control flow information of g when we analyze f. When a strong update is performed, we delete all edges in \(\varDelta _f\) that match \(e\circ \varDelta _f\). These edges are discovered by taking a union of , .

figure k

The total order imposed by the sequence of GPG edges is interpreted as a partial order as follows: Since the edges from \(\varDelta _g\) are added one by one, if the edge to be added involves an upwards exposed variable \(x'\), it should be composed with an original edge in \(\varDelta _f\) rather than a reduced edge included in \(\varDelta _f\) created by \(e_1 \circ \varDelta _f\) for some \(e_1 \in \varDelta _g\). Further, it is possible that an edge \(e_2\) may kill an already added edge \(e_1\) that coexisted with it in \(\varDelta _g\). However, this should be prohibited because their coexistence in \(\varDelta _g\) indicates that they are \({\small {{\textsf {\textit{may}}}}}\) edges. This is ensured by checking the presence of multiple edges with the same source in \(\varDelta _g\). For example, edge \(f_7\) of Fig. 8(d) does not kill \(f_5\) as they coexist in \(\varDelta _g\).

Example 12

Consider the construction of \(\varDelta _f\) as illustrated in Fig. 8(d). Edges \(f_1\) and \(f_2\) correspond to lines 2 and 3. The call on line 4 causes the composition of \(\varDelta _f = \{ f_1, f_2 \}\) with \(\varDelta _g\) selecting edges in the order \(g_1, g_2, \ldots , g_8\). The edges from \(\varDelta _g\) with their corresponding names in \(\varDelta _f\) (denoted name-in-g/name-in-f) are: \(g_1/f_3\), \(g_3/f_5\), \(g_4/f_6\), \(g_5/f_7\), \(g_6/f_8\), \(g_7/f_9\), and \(g_8/f_{10}\). Edge \(f_4\) is created by SS and TS compositions of \(g_2\) with \(f_1\) and \(f_2\). Although x has a single pointee (along edge \(f_1\)), the resulting update is a weak update because the source of \(g_2\) is \({\small {{\textsf {\textit{may}}}}}\)-defined indicated by the presence of \({x'}\!\xrightarrow {2,0}\!{{{\textsf {\textit{s}}}}}\) in the aggregate edge \({x'}\!\xrightarrow {\mathbb {N},0}\!{{{\textsf {\textit{s}}}}}\).

Edges \(g_3/f_5\) and \(g_5/f_7\) together kill \(f_2\). Note that the inclusion of \(f_7\) does not kill \(f_5\) because they both are from \(\varDelta _g\). Finally, the edge for line 5 (\({x}\!\xrightarrow {2,1}\!{z}\)) undergoes an SS composition (with \(f_8\)) and TS compositions (with \(f_5\) and \(f_7\)). This creates edges \(f_{11}\) and \(f_{12}\). Since \({x}\!\xrightarrow {2,1}\!{z}\) is accompanied by the aggregate edge \({x'}\!\xrightarrow {\mathbb {N}-\{2\},0}\!{{{\textsf {\textit{s}}}}}\) indicating that the pointee of x is \({\small {{\textsf {\textit{must}}}}}\)-defined, and x has a single pointee (edge \(f_8\)), this is a strong update killing edge \(f_{10}\). Observe that all edges in \(\varDelta _f\) represent classical points-to facts except \(f_9\). We need the pointees of y from the callers of f to reduce \(f_9\).    \(\square \)

For recursive calls, the \(\varDelta \) for a callee may not have been computed because of a cycle in the call graph. This is handled in the usual manner [9, 21] by over-approximating initial \(\varDelta \) that computes \(\top \) for \({\small {{\textsf {\textit{may}}}}}\) points-to analysis (which is \(\emptyset \)). Such an initial GPG, denoted \(\varDelta _\top \) (Definition 3), kills all points-to relations and generates none. \(\varDelta _\top \) is not expressible as a GPG and is not a natural \(\top \) element of the meet semi-lattice [9] of GPGs. The identity GPG \(\varDelta _{id}\) represents an empty set of edges because it does not generate or kill points-to information. For more details, please see [4].

6 Computing Points-To Information Using GPGs

Recall that the points-to information is represented by a memory M. We define two operations to compute a new memory \({M} '\) using a GPG or a GPG edge from a given memory M.

  • An edge application \({[\![}e {]\!]}{M} \) computes memory \({M} '\) by incorporating the effect of GPG edge \(e \equiv {x}\!\xrightarrow {i,j}\!{y}\) in memory M. This involves inclusion of edges described by the set \(\left\{ {w}\!\xrightarrow {1,0}\!{z} \mid w \in M ^{i-1}\{x\},\; z \in M ^{j}\{y\}\right\} \) in \({M} '\) and removal of edges by distinguishing between a strong and a weak update. The edges to be removed are characterized much along the lines of callkill.

  • A GPG application \({[\![}\varDelta {]\!]}{M} \) applies the GPG \(\varDelta \) to M and computes the resulting memory \({M} '\) using edge application iteratively.

Let \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) denote the points-to information at program point \({\varvec{\mathsf{{ v}}}}\) in procedure f. Then, \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) can be computed by (a) computing boundary information of f (denoted \({{\textsf {\textit{BI}}}} _f\)) associated with \({\small {{{\textsf {\textit{Start}}}}_{f}}}\), and (b) computing the points-to information at \({\varvec{\mathsf{{ v}}}}\) from \({{\textsf {\textit{BI}}}} _f\) by incorporating the effect of all paths from \({\small {{{\textsf {\textit{Start}}}}_{f}}}\) to \({\varvec{\mathsf{{ v}}}}\).

\({{\textsf {\textit{BI}}}} _f\) is computed as the union of the points-to information reaching f from all of its call points. For the main function, BI is computed from static initializations. In the presence of recursion, a fixed point computation is required for computing BI.

If \({\varvec{\mathsf{{ v}}}}\) is \({\small {{{\textsf {\textit{Start}}}}_{f}}}\), then \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}} = {{\textsf {\textit{BI}}}} _f\). For other program points, \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) can be computed from \({{\textsf {\textit{BI}}}} _f\) in the following ways; both of them compute identical \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\).

  1. (a)

    Using statement-level flow function (Stmt-ff): Let \({{{\textsf {\textit{stmt}}}}} ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}})\) denote the statement between \({\varvec{\mathsf{{ u}}}}\) and \({\varvec{\mathsf{{ v}}}}\). If it is a non-call statement, let its flow function \(\delta ({\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}})\) be represented by the GPG edge \({\varvec{\mathsf{{ n}}}}\). Then \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) is computed as the least fixed point of the following data flow equations.

    $$\begin{aligned} {{\textsf {\textit{In}}}}_{{\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}}}&= {\left\{ \begin{array}{ll} {[\![}\varDelta ({\small {{{\textsf {\textit{Start}}}}_{q}}},{\small {{{\textsf {\textit{End}}}}_{q}}}){]\!]}{{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ u}}}}} &{} {{{\textsf {\textit{stmt}}}}} ({\varvec{\mathsf{{ u}}}}, {\varvec{\mathsf{{ v}}}}) = \text {call q} \\ {[\![}{\varvec{\mathsf{{ n}}}} {]\!]}{{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ u}}}}} &{} \text {otherwise} \end{array}\right. } \\ {{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ u}}}}}&= \displaystyle \bigcup \limits _{{\varvec{\mathsf{{ u}}}} \, \in \, {{{\textsf {\textit{pred}}}}} ({\varvec{\mathsf{{ v}}}})} {{\textsf {\textit{In}}}}_{{\varvec{\mathsf{{ u}}}},{\varvec{\mathsf{{ v}}}}} \end{aligned}$$
  2. (b)

    Using GPGs: \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) is computed using GPG application \({[\![}\varDelta ({\small {{{\textsf {\textit{Start}}}}_{f}}},{\varvec{\mathsf{{ v}}}}){]\!]}{{\textsf {\textit{BI}}}} _f\). This approach of \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) computation is oblivious to intraprocedural control flow and does not involve fixed point computation for loops.

Our measurements show that the Stmt-ff approach takes much less time than using GPGs for \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) computation. This may appear surprising because the Stmt-ff approach requires an additional fixed point computation for handling loops which is not required in case of GPGs. However, using GPGs requires more time because the GPG at \({\varvec{\mathsf{{ v}}}}\) represents a cumulative effect of the statement-level flow functions from \({\small {{{\textsf {\textit{Start}}}}_{f}}}\) to \({\varvec{\mathsf{{ v}}}}\). Hence the GPGs tend to become larger with the length of a control flow path. Thus computing \({{\textsf {\textit{PT}}}} _{\!{\varvec{\mathsf{{ v}}}}}\) using GPGs for multiple consecutive statements involves redundant computations.

Bypassing of BI . Our measurements show that using the entire BI of a procedure may be expensive because many points-to pairs reaching a call may not be accessed by the callee procedure. Thus the efficiency of analysis can be enhanced significantly by filtering out the points-to information which is irrelevant to a procedure but merely passes through it unchanged. This concept of bypassing has been successfully used for data flow values of scalars [15, 16]. GPGs support this naturally for pointers with the help of upwards exposed versions of variables. An upwards exposed version in a GPG indicates that there is a use of a variable in the procedure which requires pointee information from the callers. Thus, the points-to information of such a variable is relevant and should be a part of BI. For variables that do not have their corresponding upwards exposed versions in a GPG, their points-to information is irrelevant and can be discarded from the BI of the procedure, effectively bypassing its calls.

7 Implementation and Measurements

We have implemented GPG based points-to analysis in GCC 4.7.2 using the LTO framework and have carried out measurements on SPEC CPU2006 benchmarks on a machine with 16 GB RAM with 8 64-bit Intel i7-4770 CPUs running at 3.40 GHz. Figure 9 provides the empirical data.

Our method eliminates local variables using the SSA form and GPGs are computed only for global variables. Eventually, the points-to information for local variables is computed from that of global variables and parameters. Our implementation over-approximates an array by treating it as a single variable and maintains its information flow-insensitively. Heap memory is approximated by maintaining indirection lists of field dereferences of length 2 (see [4]). Unlike the conventional approaches [25, 27, 28], our summary flow functions do not depend on aliasing at the call points. The actually observed number of aliasing patterns (column S in Fig. 9) suggests that it is undesirable to indiscriminately construct multiple PTFs for a procedure.

Columns A, B, P, and Q present the details of the benchmarks. Column C provides the time required for the first phase of our analysis i.e., computing GPGs. The computation of points-to information at each program point has four variants (using GPGs or Stmt-ff with or without bypassing). Their time measurements are provided in columns D, E, F, and G. Our data indicates that the most efficient method for computing points-to information is to use statement-level flow functions and bypassing (column G).

Fig. 9.
figure 9

Time, precision, size, and effectiveness measurements for GPG Based Points-to Analysis. Byp (Bypassing), NoByp (No Bypassing), Stmt-ff (Statement-level flow functions), G (Global pointers), L (Local pointers), Arr (Array pointers).

Our analysis computes points-to information flow-sensitively for globals. The following points-to information is stored flow-insensitively: locals (because they are in the SSA form) and arrays (because their updates are conservative). Hence, we have separate columns for globals (columns H and I) and locals+arrays (column J) for GPGs. GCC-PTA computes points-to information flow-insensitively (column K) whereas LFCPA computes it flow-sensitively (column L).

The second table provides measurements about the effectiveness of summary flow functions in terms of (a) compactness of GPGs, (b) percentage of context independent information, and (c) reusability. Column U shows that GPGs are empty for a large number of procedures. Besides, in six out of nine benchmarks, most procedures with non-empty GPGs have a significantly high percentage of context independent information (column V). Thus a top-down approach may involve redundant computations on multiple visits to a procedure whereas a bottom-up approach may not need much work for incorporating the effect of a callee’s GPG into that of its callers. Further, many procedures are called multiple times indicating a high reuse of GPGs (column R).

The effectiveness of bypassing is evident from the time measurements (columns E and G) as well as a reduction in the average number of points-to pairs (column I).

We have compared our analysis with GCC-PTA and LFCPA [11]. The number of points-to pairs per function for GCC-PTA (column K) is large because it is partially flow-sensitive (because of the SSA form) and context-insensitive. The number of points-to pairs per statements is much smaller for LFCPA (column L) because it is liveness-based. However LFCPA which in our opinion represents the state of the art in fully flow- and context-sensitive exhaustive points-to analysis, does not seem to scale beyond 35 kLoC. We have computed the average number of pointees of dereferenced variables which is maximum for GCC-PTA (column N) and minimum for LFCPA (column O) because it is liveness driven. The points-to information computed by these methods is incomparable because they employ radically dissimilar features of points-to information such as flow- and context-sensitivity, liveness, and bypassing.

8 Related Work

Section 1 introduced two broad categories of constructing summary flow functions for pointer analysis. Some methods using placeholders require aliasing information in the calling contexts and construct multiple summary flow functions per procedure [25, 28]. Other methods do not make any assumptions about the calling contexts [12, 13, 20, 23, 24] but they construct larger summary flow functions causing inefficiency in fixed point computation at the intraprocedural level thereby prohibiting flow-sensitivity for scalability. Also, these methods cannot perform strong updates thereby losing precision.

Among the general frameworks for constructing procedure summaries, the formalism proposed by Sharir and Pnueli [21] is limited to finite lattices of data flow values. It was implemented using graph reachability in [14, 18, 19]. A general technique for constructing procedure summaries [5] has been applied to unary uninterpreted functions and linear arithmetic. However, the program model does not include pointers.

Symbolic procedure summaries [25, 27] involve computing preconditions and corresponding postconditions (in terms of aliases). A calling context is matched against a precondition and the corresponding postcondition gives the result. However, the number of calling contexts in a program could be unbounded hence constructing summaries for all calling contexts could lose scalability. This method requires statement-level transformers to be closed under composition; a requirement which is not satisfied by pointer analysis (as mentioned in Sect. 1). We overcome this problem using generalized points-to facts. Saturn [6] also creates summaries that are sound but may not be precise across applications because they depend on context information.

Some approaches use customized summaries and combine the top-down and bottom-up analyses to construct summaries for only those calling contexts that occur in a given program [28]. This choice is controlled by the number of times a procedure is called. If this number exceeds a fixed threshold, a summary is constructed using the information of the calling contexts that have been recorded for that procedure. A new calling context may lead to generating a new precondition and hence a new summary.

9 Conclusions and Future Work

Constructing bounded summary flow functions for flow and context-sensitive points-to analysis seems hard because it requires modelling unknown locations accessed indirectly through pointers—a callee procedure’s summary flow function is created without looking at the statements in the caller procedures. Conventionally, they have been modelled using placeholders. However, a fundamental problem with the placeholders is that they explicate the unknown locations by naming them. This results in either (a) a large number of placeholders, or (b) multiple summary flow functions for different aliasing patterns in the calling contexts. We propose the concept of generalized points-to graph (GPG) whose edges track indirection levels and represent generalized points-to facts. A simple arithmetic on indirection levels allows composing generalized points-to facts to create new generalized points-to facts with smaller indirection levels; this reduces them progressively to classical points-to facts. Since unknown locations are left implicit, no information about aliasing patterns in the calling contexts is required allowing us to construct a single GPG per procedure. GPGs are linearly bounded by the number of variables, are flow-sensitive, and are able to perform strong updates within calling contexts. Further, GPGs inherently support bypassing of irrelevant points-to information thereby aiding scalability significantly.

Our measurements on SPEC benchmarks show that GPGs are small enough to scale fully flow and context-sensitive exhaustive points-to analysis to programs as large as 158 kLoC (as compared to 35 kLoC of LFCPA [11]). We expect to scale the method to still larger programs by (a) using memoisation, and (b) constructing and applying GPGs incrementally thereby eliminating redundancies within fixed point computations.

Observe that a GPG edge \({x}\!\xrightarrow {i,j}\!{y}\) in M also asserts an alias relation between \({M} ^{i} \{x\}\) and \({M} ^{j} \{y\}\) and hence GPGs generalize both points-to and alias relations.

The concept of GPG provides a useful abstraction of memory involving pointers. The way matrices represent values as well as transformations, GPGs represent memory as well as memory transformers defined in terms of loading, storing, and copying memory addresses. Any analysis that is influenced by these operations may be able to use GPGs by combining them with the original abstractions of the analysis. We plan to explore this direction in the future.