Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Consider a function F which, given \(x, y_1,\dots ,y_m\), computes \(f(x,y_i)\) for each \(y_i\):

figure afigure a

Observe that this implementation would be wasteful if \(f(x,y_i)\) does a significant amount of work that does not depend on \(y_i\). In such a case, it would be advantageous to find the computations in f that depend only on x and then stage execution of f so they are performed only once at the beginning of F.

Fig. 1.
figure 1figure 1

Quickselect: traditional and staged.

Jørring and Scherlis classify automatic program staging transformations, such as the one described above, as forms of frequency reduction or precomputation [15]. To perform frequency reduction, one identifies and hoists computations that are performed multiple times, in order to compute them only once. To perform precomputation, one identifies computations that can be performed in advance and does so—for example, at compile time if the relevant inputs are statically known.

One common precomputation technique is partial evaluation [9, 13], which relies on dynamic compilation to specialize functions to known argument values. Going back to our example, if f specializes to a particular v, written \(f_v\), such that \(f(v,y) = f_v(y)\), then F can be specialized to v as

figure bfigure b

This eliminates the need to compute m times those parts of \(f(v,-)\) which do not depend on the second argument.

Closely related to partial evaluation is metaprogramming, where known values represent program code to be executed in a later stage  [4, 5, 20, 28]. Metaprogramming enables fine-grained control over specialization by requiring explicit staging annotations that mark the stage of each expression.

While simple forms of frequency reduction include standard compiler optimizations such as loop hoisting and common subexpression elimination, Jørring and Scherlis proposed the more general transformation of splitting a program into multiple subfunctions (called pass separation in  [15]). In our example, splitting transforms the function f into two others \(f_1\) and \(f_2\) such that \(f(x,y_i) = f_2(f_1(x),y_i)\). Then we evaluate F by evaluating \(f_1\) on x, and using the result z to evaluate the second function on each \(y_i\).

figure cfigure c

The key difference between splitting and partial evaluation (or metaprogramming) is that the former can be performed without access to the first argument x; \(F_\text {multipass}\) works for any x, while \(F_v\) is defined only for \(x=v\). Therefore, unlike partial evaluation, splitting is a static program transformation (“metastatic” in partial evaluation terminology) and does not require dynamic code generation.

Prior work on partial evaluation and metaprogramming has demonstrated automatic application of these techniques on higher order functional languages. In contrast, automatic splitting transformations have been limited to simpler languages [8, 11, 16, 22].

In this paper, we present a splitting algorithm for , a two-staged typed lambda calculus in the style of Davies [4], with support for recursion and first-class functions. Like Davies, uses a \(\bigcirc \) modality to denote computation in the second stage, but to aid splitting we also add a \(\nabla \) modality to denote purely first-stage computations. The dynamic semantics Sect. 3 of are that of Davies, modified to provide an eager behavior between stages, which we believe is more intuitive in the context of splitting. We then prove the correctness of our splitting algorithm Sect. 4 for with respect to the semantics. Finally, we discuss our implementation of this splitting algorithm Sect. 5 and demonstrate its power and behavior for a number of staged programs ranging from straight-line arithmetic operations to recursive and higher-order functions Sect. 6.

We also demonstrate that splitting a recursive mixed-stage f yields an \(f_1\) which computes a recursive data structure and an \(f_2\) which traverses that structure in light of information available at the second stage. In the case of the quickselect algorithm, which we discuss next, the split code executes asymptotically faster than an unstaged evaluation of f.

2 Overview

Suppose that we wish to perform a series of order statistics queries on a list l. To this end, we can use the quickselect algorithm [12], which given a list l and an integer k, returns the element of l with rank k (i.e. the kth-largest element). As implemented in an ML-like language in Fig. 1(a), qs partitions l using the first element as a pivot and then recurs on one of the two resulting sides, depending on the relationship of k to the size i of the first half, in order to find the desired element. If the index is out of range, a default value of 0 is returned. Assuming that the input list, with size n, is uniformly randomly ordered (which can be achieved by pre-permuting it), qs runs in expected \(\varTheta (n)\) time. Using qs, we can perform m different order statistics queries with ranks \(\mathtt {k1},\dots ,\mathtt {km}\) as follows:

figure dfigure d

Unfortunately, this approach requires \(\varTheta (n \cdot m)\) time.

We can attempt to improve on this algorithm by factoring out computations shared between these calls to qs. In particular, we can construct a binary search tree out of l, at cost \(\varTheta (n\log {n})\), and then simply look for the kth leftmost element of that tree. Doing lookups efficiently, in \(\varTheta (\log {n})\) time, requires one more innovation—augmenting the tree by storing at each node the size of its left subtree. Thus in total this approach has expected runtime \(\varTheta (n\log {n} + m\log {n})\).

Is this method, wherein we precompute a data structure, better than the direct \(\varTheta (mn)\) method? The answer depends on the relationship between the number of lookups m and the size of the list n. If m is constant, then the direct method is superior for sufficiently large n. This is because the precomputed method does unnecessary work sorting parts of the list where there are no query points. However, if the number of queries grows with the size of the list, specficially in \(\omega (\log n)\), then the precomputed method will be asymptotically faster.

Rewriting algorithms in this way—to precompute some intermediate results that depend on constant (or infrequently-varying) inputs—is non-trivial, as it requires implementing more complex data structures and algorithms. In this paper, we present a splitting algorithm which does much of this task automatically.

2.1 Staging

The idea behind staged programming is to use staging annotations—in our case, guided by types—to indicate the stage of each subterm.

In Fig. 1(b) we show a staged version of qs, called qss, where first-stage code is colored red, and second-stage blue. In qss, we regard the input list l as arriving in the first stage (with type \(\nabla \mathrm {list}\), a list “now”), the input rank k as arriving in the second stage (with type \(\bigcirc \mathrm{int}\), an integer in the “future”), and the result as being produced in the second stage (with type \(\bigcirc \mathrm{int}\)).

qss is obtained from qs by wrapping certain computations with prev and next, signaling transitions between first- and second-stage code. Additionally, \(\texttt {gr}\) (ground) annotations mark certain first-stage components as being purely first-stage, rather than mixed-stage. We also use a function , to promote first-stage integers to second-stage integers. Our type system ensures that the staging annotations in qss are consistent, in the sense that computations marked as first-stage cannot depend on ones marked as second-stage.

The process of automatically adding staging annotations to unstaged code, called binding time analysis, has been the subject of extensive research Sect. 7. In this paper, we do not consider this problem, instead assuming that the annotations already exist. In the case of qss, we have specifically chosen annotations which maximize the work performed in the first stage.

Fig. 2.
figure 2figure 2

Two-pass implementation of quickselect.

2.2 Splitting Staged Programs

In the rest of this section, we present a high-level overview of the main ideas behind our splitting algorithm, applied to qss. Splitting qss yields a two-part program that creates a probabilistically balanced, augmented binary search tree as an intermediate data structure. In particular, its first part (qs1) constructs such a binary search tree and its second part (qs2) traverses the tree, using the embedded size information to find the element of the desired rank. The code for qs1 and qs2 is in Fig. 2.

Our splitting algorithm scans qss for first-stage computations, gathering them into qs1. Given l, this function performs these computations and places the information needed by the subsequent function into a boundary data structure. In particular, qs1 performs all recursive calls and evaluates all instances of part (since it depends only on l). It produces a boundary data structure that collects the results from these recursive calls, tagged by the branch (LT, EQ, or GT) in which that call occurred. Since the recursive calls occur in two different branches (LT and GT) the boundary structure is a binary tree. Lastly, it records i (the size of the left subtree) and #1 ht (the pivot/head of the list) in the boundary structure, because those computations are held for use in the second stage. The final result is a binary search tree augmented with size information, and whose keys are the pivots.

Our splitting algorithm simultaneously scans qss for second-stage computations, gathering them into qs2. This function is given the boundary data structure and the rank k, and it finishes the computation. Now that k is known, the conditional on compare k i can be evaluated, choosing which recursive call of qss is actually relevant for this k. Since the boundary data structure contains the first-stage data for all of the recursive calls, performing these comparisons essentially walks the tree, using the rank along with the size data i to look up the kth leftmost node in the tree.

Fig. 3.
figure 3figure 3

abstract syntax. Subscript-w rules apply at all worlds.

3 Statics and Dynamics

We express two-stage programs as terms in , a typed, modal lambda calculus. Although describes computations that occur in two stages, we find it helpful for the specification of splitting to codify terms using one of three worlds. A world is essentially a slightly finer classification than a stage. Whereas there is only one world, , for second-stage computations, there are two worlds corresponding to the first stage: for mixed first-stage computations, which may contain second-stage subterms within next blocks, and for ground first-stage computations, which may not. The distinction between these two first-stage worlds is necessary for the splitting algorithm to produce efficient outputs and will be discussed in Sect. 4.6.

The abstract syntax of is presented as a grammar in Fig. 3. To simplify the upcoming translation in Sect. 4, we have chosen to statically distinguish between values and general computations, using an underline constructor (“\({\underline{v}}\)”) which explicitly note the parts of a computation that have been reduced to a value.Footnote 1 Moreover, the value/computation distinction interacts non-trivially with the three possible world classifications. As a result, we end up with six classes of term in the grammar.

The key feature of all three forms of value is that they have no remaining work in the first stage. That is, all of the first-stage portions of a value are fully reduced (except the bodies of first-stage functions). In the case of values at , which we call ground values, this collapses to just the standard notion of values in a monostage language. In the case of values at , which we call residuals, this collapses to just standard monostage terms. Lastly, values at end up with more of a mixed character, so we call them partial values.

Fig. 4.
figure 4figure 4

statics, split into standard rules and staged rules.

3.1 Statics

The typing judgment \( \Gamma {} \vdash {e}:\!{A}~{\text {@}}~w\), defined in Fig. 4, means that e has type A at world w, in the context \(\Gamma \).

All three worlds contain unit, product, function, sum and recursive types defined in the usual fashion. These “standard” features can only be constructed from subterms of the same world, and variables can only be used at the same world where they were introduced. Thus differing worlds (and hence, differing stages of computation) only interact by means of the \(\bigcirc \) and \(\nabla \) type formers. These modalities are internalizations of worlds and , respectively, as types at world .

At the term level, next blocks can be used to form future computations: given a term e of type A at world , \(\mathtt{next}({e})\) has type \(\bigcirc A\) at . This essentially encapsulates e as a computation that will be evaluated in the future, and it provides a handle (of type \(\bigcirc A\)) now to that eventual value. Computations at can shuffle this handle around as a value, but the future result it refers to cannot be accessed. This is because the only way to eliminate a \(\bigcirc \) wrapper is by using a prev, which yields an A at . This feature was adapted from linear temporal logic, via [4], and ensures that there can be no flow of information from the second stage to the first.

\(\nabla A\) is a type in world which classifies purely-first-stage computations of type A. Given a world term e of type A, \(\mathtt{gr}({e})\) has type \(\nabla A\) at world . (e is guaranteed not to contain second-stage computations because \(\bigcirc \) types are not available in world .) An e of type \(\nabla A\) at can be unwrapped as an A at using the \({\texttt {letg(}}{e}{\texttt {;}}{x}.{e'}{\texttt {)}}\) construct, which binds

figure efigure e

in a term \(e'\). This allows us to compute under \(\nabla \)—for example, given a

figure ffigure f

the term

figure gfigure g

computes its first projection, of type \(\nabla A\). This elimination form, in contrast to that of \(\bigcirc \), does not permit world subterms within any world

figure hfigure h

term.

These features are sufficient to ensure that mixed code does not leak into ground code, however they also prevent information from ever escaping a \(\nabla \) wrapper. So to allow the latter behavior but not the former, we introduce the \({\texttt {caseg(}}{e}{\texttt {;}}{x.e_1}{\texttt {;}}{x.e_2}{\texttt {)}}\) construct, whose predicate is of type \(\nabla (A+B)\) and whose branches are world terms open on \(\nabla {A}\) and \(\nabla {B}\) respectively. This essentially allows code at to inspect an injection tag within a \(\nabla \).

Although products and functions are restricted to types at the same world, \(\bigcirc \) allows construction of “mixed-stage” products and functions. For example, qss is a function at world which takes a \(\nabla \mathrm{list} \times \bigcirc \mathrm{int}\) (a purely-first-stage list and a second-stage integer) to a \(\bigcirc \mathrm{int}\) (a second-stage computation of an integer).

The example code in this paper uses an extension of the formalized . In particular, it makes liberal use of \(\mathrm{int}\)s and various functions on these, as well as a function \(\mathtt{hold()}\) which takes a \(\nabla \mathrm{int}\) to a \(\bigcirc \mathrm{int}\).Footnote 2

Fig. 5.
figure 5figure 5

First-pass evaluation of dynamics. The judgement

figure ifigure i
indicates that e takes a step to \(e'\) in the first pass;
figure jfigure j
and \(\mathcal {S} \ \mathsf {vc}_{w}\) are helper judgements.

3.2 Dynamics

The central tenet of a staged language is that first-stage code should be evaluated entirely before second-stage code. Accordingly, our dynamics operates in two passes. The first pass takes an input top-level program and reduces all of its first-stage (worlds and ) subterms in place, eventually resulting in a residual \({\underline{q}}\). The second pass further reduces this residual. Since q is monostage by definition, this second pass is standard unstaged evaluation and is not described in further detail in this paper. Moreover, for the purposes of these dynamics, we consider a top level program to always be typed at world .

Since may be constructed out of terms at other worlds, our dynamics requires notions of values and steps that are specialized to each world. The rules for all parts of first-pass evaluation are given in Fig. 5. In this and later figures, we extensively use an \(\mathcal {S}[\text {-}]\) construction to indicate a shallow evaluation context which looks a single level deep.

World . Steps at world are given by the judgment

figure kfigure k

. Since first pass evaluation should not reduce stage two terms, this judgment does nothing but traverse e to find prev blocks, under which it performs in-place reductions. A world term is done evaluating when it has the form \({\underline{q}}\), where q is a residual. To be a residual, a term must have no first-stage subterms (equivalently, no prevs), even within the body of a function or branches of a case. This implies that

figure lfigure l

must proceed underneath second-stage binders.

World . Since the ground fragment of the language is not dependent on other worlds, the semantics of ground is just that of a monostage language. Thus, traverses into subterms to find the left-most unevaluated code where it performs a reduction. A ground value u comprises only units, injections, tuples, and functions, where the body of the function may be any ground term.

World . Like its ground counterpart, the step judgment, , finds the left-most unevaluated subterm and performs a reduction. It also descends into gr and next blocks, using one of the other two step judgements ( or

figure mfigure m

) there. The value form for , called a partial value, comprises units, tuples, functions, gr blocks of ground values, and next blocks containing only a stage two variable. This strong requirement ensures that second-stage computations are not duplicated when partial values are substituted for a variable. This is a departure from the staged semantics of [4, 27]. Whereas those semantics interpret values of type \(\bigcirc A\) to mean “code of type A that can be executed in the future,” ours interprets \(\bigcirc A\) to mean “a reference to a value that will be accessible in the future.” This contrast stems from differing goals: metaprogramming explicitly intends to model code manipulating code, whereas our applications feel more natural with an eager interpretation of next. One consequence of the stronger requirement on partial values is that a new kind of step is necessary to put terms into that form. To illustrate, consider:

figure nfigure n

We could reduce this to \({\mathtt{prev(}[\mathtt{next (}e\mathtt{)}/x]e'}\), but this may potentially duplicate an expensive computation e depending on how many times x appears in \(e'\). Instead, we choose to hoist e outside, binding it to a temporary variable y, and substituting that variable instead:

figure ofigure o

This behavior is implemented by the \({e}\nearrow \llbracket q/y \rrbracket {e'}\) judgment, called a hoisting step. We read this as saying that somewhere within e there was a subterm q which needs to be hoisted out, yielding the new term \(e'\) which has a new variable y where q used to be. These steps occur when a next block has contents that are a residual but (to prevent loops) not when those contents are already a variable. In essence, the rules for hoisting steps operate by “bubbling up” a substitution to the innermost containing prev, where it is reified into a let statement.Footnote 3

Consider the following example, where P has type \(\nabla (\mathrm{unit}+ \mathrm{unit})\),

figure pfigure p

Depending on what P evaluates to in the first stage, the whole term will step to either next{0} or next{1}. In this sense, we can see case (at world ) and caseg as the constructs that facilitate all cross-stage communication.

3.3 Type Safety

The statics and dynamics of are related by the type safety theorems below, again annotated by world. In all cases, \(\Gamma \) may be any list of variable bindings at world , representing the second-stage binders under which we are evaluating. Note how the progress theorem for world states that every well-typed term must take either a standard step or a hoisting step.

Theorem 1

(Progress)

  • If , then either e has the form \({\underline{v}}\), or , or \({e}\nearrow \llbracket q/y \rrbracket {e'}\).

  • If , then either e has the form \({\underline{u}}\), or .

  • If , then either e has the form \({\underline{q}}\), or .

Theorem 2

(Preservation)

  • If and \({e}\nearrow \llbracket q/y \rrbracket {e'}\),

    then and .

  • If and , then .

3.4 Evaluating Staged Programs

Multistage functions, such as qss from Sect. 2, can be represented as terms with a type fitting the pattern \(A\rightarrow \bigcirc (B \rightarrow C)\) at .Footnote 4 To apply such a function f to arguments and , simply evaluate the program:

figure qfigure q

Moreover, the reuse of first-stage computations across multiple second-stage computations can even be encoded within . The following program runs many order statistics queries \(\texttt {k1},\dots ,\texttt {km}\) on the same list:

figure rfigure r

Observe that this code evaluates qss only once and substiutes its result into the body of the function lookup, which is then called many times in the second stage.

4 Splitting Algorithm

The goal of a stage splitting translation is to send a program \(\mathcal {P}\) in a multistage language to an equivalent form \(\mathcal {P'}\) where the stages are seperated at the top level. More specifically, \(\mathcal {P}\) and \(\mathcal {P'}\) should produce the same answer under their respective semantics.

Since has three classes of multistage term, our splitting algorithm has three forms: for -terms, for -terms, and for partial values. In each form the output has two parts, corresponding to the first-stage (p, c, and i) and second-stage (l.r, and q) content of the input. Note that there’s no need to provide forms of splitting for ground terms or residuals, since those classes of term are already monostage by construction.

The rules of the three splitting judgements are given in Figs. 6 and 7. Since the rules are simply recursive on the structure of the term, the splitting algorithm runs in linear time on the size of the input program. Splitting is defined for all well-typed inputs Theorem 3, and it produces unique results Theorem 4. That is, each splitting judgement defines a total function.

Theorem 3

(Splitting Totality)

  • For term e, if , then .

  • For term e, if , then .

  • For partial value v, if , then .

Theorem 4

(Splitting Uniqueness)

  • If and , then \(p = p'\), and \(l.r = l'.r'\).

  • If and , then \(c = c'\), and \(l.r = l'.r'\).

  • If and , then \(i = i'\) and \(q = q'\).

We prove these theorems by straightforward induction on the typing derivation and simultaneous induction on the splitting derivations, respectively.

4.1 Outputs of Splitting

Splitting a top level program , via , yields \({\mathtt {\{}}{p} {\mathtt {|}}{l}{\mathtt {.}}{r}{\mathtt {\}}}\). Like e, this output is evaluated in two passes. The first pass reduces p to the value b and plugs this result in for l to produce [b / l]r; the second pass evaluates [b / l]r. The relationship between the two stages in this case is thus like a pipeline, which is why we write them with a ‘|’ in between. Since execution of the first pass serves to generate input for the second pass, we say p is a precomputation that produces a boundary value (b) for the resumer (l.r).

Splitting a partial value v, via , yields \({\texttt {\{}}{i}{\texttt {;}}{q}{\texttt {\}}}\). Since partial values, by definition, have no remaining work in the first pass and since transfer of information between the stages occurs in the first pass of evalution, we know that the second-stage components of v can no longer depend on its first-stage components. Analogously, this must also hold for the output of splitting, which is why q—unlike the resumer of world term splitting—is not open on a variable. Thus, i and q are operationally independent, but they represent the complementary portions of v that are relevant to each stage. We call i the immediate value and q the residual.

For any , splitting e via yields the pair of monostage terms \({\texttt {\{}}{c}{\texttt {|}}{l}{\texttt {.}}{r}{\texttt {\}}}\). This output form is essentially a hybrid of the previous two. Because e is a term, c needs to produces a boundary value b to be passed to the resumer (l.r) And since e types at world , it has an eventual result at the first stage as well as the second, and so it must produce an immediate value i. The term c meets both of these responsibilities by reducing to the tuple \({\texttt {<}} i,b {\texttt {>}}\), and so we call it a combined term.

4.2 World Term Splitting

The dynamic correctness of the splitting translation requires that the simple evaluate-and-plug semantics on the output produces the same answer as the staged semantics of the previous section. That is, [b / l]r (the applied resumer) should be equivalent to the residual q produced by direct evaluation, . This condition is stated more precisely as Theorem 5, where “\({e} \Downarrow {v}\)” indicates standard monostage reduction of the term e to the value v, and “\(\equiv \)” indicates a monostage equivalence, which is defined in Fig. 8.

Theorem 5

(End-to-End Correctness). If , and , then \({p} \Downarrow {b}\) and \([b/l]r \equiv q\).

We prove Theorem 5 by induction on the steps of evaluation. In the base case, where e is already a residual of the form \({\underline{q}}\), we know , so by uniqueness of splitting, \(p = {\texttt {<}} {\texttt {>}}\) and \(r = q\). From here, we can directly derive \({{\underline{{\texttt {<}} {\texttt {>}}}}} \Downarrow {{\texttt {<}} {\texttt {>}}}\) and \(q \equiv q\).

In the recursive case, where the evaluation takes at least one step, we have as well as and . By preservation and totality of splitting, we know and . From here, the inductive hypothesis yields \({p'} \Downarrow {b'}\) and \([b'/l']r' \equiv q\). All that we now require is \({p} \Downarrow {b}\) and \([b/l]r \equiv [b'/l']r'\), To close this gap we introduce Lemma 1, which essentially states that any single step is correct, and whose proof will concern the rest of this section.

Lemma 1

(Single Step Correctness)

  • If , and \({p'} \Downarrow {b'}\), then \({p} \Downarrow {b}\) and \([b/l]r \equiv [b'/l']r'\).

  • If , and \({c'} \Downarrow {{\texttt {<}} i,b' {\texttt {>}}}\), then \({c} \Downarrow {{\texttt {<}} i,b {\texttt {>}}}\) and \([b/l]r \equiv [b'/l']r'\).

After invocation of that new lemma, we can derive \([b/l]r \equiv [b'/l']r' \equiv q\) directly. In order to prove Lemma 1, we will need to state analogous version for steps at , since the various kinds of multistage term in are mutually dependent. Thus, this section proceeds by covering the definition of splitting at , starting with the value form at that world.

4.3 Partial Value Splitting

To provide intuition about the behavior of partial value splitting, consider the following partial value:

figure sfigure s

To construct the value i representing its first-stage components, splitting first redacts all second-stage (blue) parts, along with the surrounding next annotations. The resulting “holes” in the term are replaced with unit values.

figure tfigure t

Finally, partial value splitting drops gr annotations, yielding:

figure ufigure u

To construct the residual q (corresponding to second-stage computations) partial value splitting redacts all gr blocks (replacing them with unit) and next annotations:

figure vfigure v

A precise definition of the partial value splitting relation is given in Fig. 6. In some regards, the formulation of partial value splitting is arbitrary. For instance, we chose to replace “holes” with unit values, but in fact we could have used any value there and it would make no difference in the end. There are however, at least some parts of the definition that are not arbitrary. Importantly, partial value splitting must not lose any meaningful information, such as injection tags.

Fig. 6.
figure 6figure 6

Partial value splitting rules.

4.4 World Term Splitting

The correctness of term splitting with respect to a step is given in Lemma 1. It’s very similar to the world version, in that the reduction of the first stage part of \(e'\) should imply reduction of the first stage part of e, and that the resulting applied resumers should be equivalent. But split forms at have one more piece of output than those at , namely the immediate value i. Lemma 1 accounts for this by saying that that immediate value must be exactly identical on both sides of the step.

The correctness of term splitting with respect to hoisting steps is given in Lemma 2. Because hoisting steps are nothing but rearrangement of second-stage code, this lemma can use the strong requirement of identical combined terms.

Lemma 2

(Hoisting Step Correctness). If , and , then \(c = c'\), and \(l.r \equiv l'.{\texttt {let(}}{q}{\texttt {;}}{y}.{r'}{\texttt {)}}\).

Fig. 7.
figure 7figure 7

Splitting rules for terms at and .

Fig. 8.
figure 8figure 8

Monostage equivalence relation, including reduction, congruence, and let-transposition rules. Since the fragment of is monostage, we simply use to mean any standard reduction from e to \(e'\).

4.5 Example Cases

In this section, we consider a few exemplar cases from the proof of Lemma 1.

Reduction of pi1 . Define \(C = {\texttt {let(}}{{\underline{{\texttt {<}} {\texttt {<}} i_1,i_2 {\texttt {>}},{\texttt {<}} {\texttt {>}} {\texttt {>}}}}}{\texttt {;}}{{\texttt {<}} y,z {\texttt {>}}}.{{\texttt {<}} {\texttt {pi1(}}{\underline{y}}{\texttt {)}},{\underline{z}} {\texttt {>}}}{\texttt {)}}\) and

\(E = {\texttt {pi1(}}{\underline{{\texttt {<}} v_1,v_2 {\texttt {>}}}}{\texttt {)}}\). We are given , , and , and we need to show \({C} \Downarrow {{\texttt {<}} i_1,{\texttt {<}} {\texttt {>}} {\texttt {>}}}\) and \({\texttt {pi1(}}{\underline{{\texttt {<}} q_1,q_2 {\texttt {>}}}}{\texttt {)}} \equiv q_1\). Both of these can be derived directly. This pattern, where the outputs can be directly derived, extends to the pi2 and prev reduction rules and all value promotion rules.

Reduction of Application. Let

figure wfigure w

We are given , ,

, and \({[I,i_1/f,x]c} \Downarrow {{\texttt {<}} i,b {\texttt {>}}}\). From this, we can derive \({C} \Downarrow {{\texttt {<}} i,{\texttt {<}} {\texttt {<}} {\texttt {>}},{\texttt {<}} {\texttt {>}},{\texttt {roll}}{\texttt {(}}b{\texttt {)}} {\texttt {>}} {\texttt {>}}}\) and \({\texttt {app}}{\texttt {(}}{{\underline{Q}}}{\texttt {;}}{{\underline{{\texttt {<}} q_1,{\texttt {roll}}{\texttt {(}}b{\texttt {)}} {\texttt {>}}}}}{\texttt {)}} \equiv [Q,q_1,b/f,x,l] r\) directly. This pattern applies to all of the other reduction rules involving substitution, namely those for caseg, and letg.

Compatibility of pi1 at . By the case, we are given , , , and \({p'} \Downarrow {b'}\). Inversion of the first three yields , , and . Using Lemma 1 inductively gives \({p} \Downarrow {b}\) and \([b/l]r \equiv [b'/l']r\). From this, \([b/l]{\texttt {pi1(}}r{\texttt {)}} \equiv [b'/l']{\texttt {pi1(}}r{\texttt {)}}\) can be derived directly. This pattern generalizes to all world compatibility rules.

Compatibility of pi1 at . By the case, we are given ,

,

, and

. Inversion of the first three yields , , and , and inversion of the reduction yields, for some \(i_2\), \({c'} \Downarrow {{\texttt {<}} {\texttt {<}} i,i_2 {\texttt {>}},b' {\texttt {>}}}\). Using Lemma 1 inductively gives \({c} \Downarrow {{\texttt {<}} {\texttt {<}} i,i_2 {\texttt {>}},b {\texttt {>}}}\) and \([b/l]r \equiv [b'/l']r\). From this, we can derive \({{\texttt {let(}}{c}{\texttt {;}}{{\texttt {<}} y,z {\texttt {>}}}.{{\texttt {<}} {\texttt {pi1(}}{\underline{y}}{\texttt {)}},{\underline{z}} {\texttt {>}}}{\texttt {)}}} \Downarrow {{\texttt {<}} i,b {\texttt {>}}}\) and \([b/l]{\texttt {pi1(}}r{\texttt {)}} \equiv [b'/l']{\texttt {pi1(}}r{\texttt {)}}\) directly. This pattern generalizes to all world compatibility rules.

Compatibility of gr . We are given , , , and \({{\texttt {<}} e',{\underline{{\texttt {<}} {\texttt {>}}}} {\texttt {>}}} \Downarrow {{\texttt {<}} i,{\texttt {<}} {\texttt {>}} {\texttt {>}}}\). Inversion of the step and reduction yield and \({e'} \Downarrow {i}\). As a simple property of monostage reduction (which is), we know \({e} \Downarrow {i}\). From here, we can derive \({{\texttt {<}} e,{\underline{{\texttt {<}} {\texttt {>}}}} {\texttt {>}}} \Downarrow {{\texttt {<}} i,{\texttt {<}} {\texttt {>}} {\texttt {>}}}\) and \({\underline{{\texttt {<}} {\texttt {>}}}} \equiv {\underline{{\texttt {<}} {\texttt {>}}}}\) directly.

4.6 Role of World

The splitting algorithm described in the previous subsections operates purely on the local structure of terms. One artifact of this design is that splitting terms may generate resumers containing unnecessary logic. For example, the rule for splitting caseg terms inserts the tag from the caseg argument into the boundary value, then decodes this tag in the resumer. This logic occurs regardless of whether the terms forming the branches of the caseg contain second-stage computations. Worse, if this caseg appeared in the body of a recursive function with no other second-stage computations, splitting would generate a resumer with (useless) recursive calls.

An illustrative example is the part function in the quickselect example of Sect. 6.3. If part were defined at then (like qs) it would split into two functions part1 and part2, the latter of which recursively computes the (trivial) second-stage component of part’s result. Moreover, qs2 would call part2, just as qs1 calls part1:

figure xfigure x

Rather than attempt global optimization of the outputs of splitting, we instead leverage the type system to indicate when a term contains no second-stage computations by adding a third world whose terms are purely first-stage. Defining part in this world is tantamount to proving it has no second-stage computations, allowing splitting to avoid generating the resumer part2 and calling it from qs2.

Since direct staged term evaluation Sect. 3 reduces all first-stage terms to value forms without any remaining stage two work, the distinction between and is unnecessary. In contrast, when performing program transformations before the first-stage inputs are known, it is valuable to form a clear distinction between ground and mixed terms. This was similarly observed in prior work seeking to implement self-applicable partial evaluators [17, 18]. While this paper assumes that ground annotations already exist in the input, it may be possible to use binding-time analysis techniques to automatically insert them. Ground is also similar to the validity mechanism in ML5 [19].

4.7 Typing the Boundary Data Structure

One of the central features of our splitting algorithm is that it encodes the control flow behavior of the original staged program into the boundary data structure. For instance, the case and caseg splitting rules put injection tags on the boundary based on which branch was taken, and the fn rule adds a roll tag to the boundary. As a result, the boundary value passed between the two output programs has a (potentially recursive) structure like a tree or list.

This structure can be described with a type; for instance, the staged quickselect yields a binary search tree. Indeed, for most of our examples in Sect. 6, inferring this type is straight-forward. However, we do not yet have a formal characterization of these boundary types that is defined for all programs, though we plan to pursue this in future work.Footnote 5

5 Implementation

We have encoded the type system, dynamic semantics, splitting algorithm, and output semantics in Twelf, as well as all of the theorems of Sects. 3 and 4, and their proofs. We also have a Standard ML implementation of the language, a staged interpreter, the splitting algorithm, and an interpreter for split programs. This implementation extends the language described in Sect. 3 with ints, bools, let statements, n-ary sums and products, datatype and function declarations, and deep pattern matching (including \(\mathtt{next()}\) and \(\mathtt{gr}()\) patterns). The code snippets in this paper are written in our concrete syntax, using these additional features when convenient. Our expanded syntax allows staging annotations around declarations. For example,

figure yfigure y

declares the list datatype and part function at world , by elaborating into:

figure zfigure z

In the splitting algorithm, we perform a number of optimizations which drastically improve the readability of split programs. For example, we split patterns directly, instead of first translating them into lower-level constructs. We also take advantage of many local simplifications, most notably, not pairing precomputations when one is known to be \({\texttt {<}} {\texttt {>}}\).

6 Examples of Splitting

Now we investigate the behavior of our splitting algorithm on several examples. The split code that follows is the output of our splitting implementation; for clarity, we have performed some minor optimizations and manually added type annotations (including datatype declarations and constructor names), as our algorithm does not type its output.

6.1 Dot Product

Our first example, dot, appears in [16]. dot is a first-order, non-recursive function—precisely the sort of code studied in prior work on pass separation for imperative languages. dot takes the dot product of two three-dimensional vectors, where the first two coordinates are first-stage, and the last coordinate is second-stage.

figure aafigure aa

The body of dot is an int term at world containing an int computation of (x1*x2) + (y1*y2) which is promoted from world to world . We would expect the first stage of the split program to take the first two coordinates of each vector and perform that first-stage computation; and the second stage to take the final coordinates and the result of the first stage, then multiply and add. Our algorithm splits dot into the two functions:

figure abfigure ab

As expected, dot1 returns (x1*x2)+(y1*y2) as the precomputation, and dot2 adds that precomputation to the products of the final coordinates. This is exactly what is done in [16], though they write the precomputation into a mutable cache.

6.2 Exponentiation by Squaring

Our next example, exp, is a mainstay of the partial evaluation literature (for example, in [13]). exp recursively computes \(b^e\) using exponentiation by squaring, where e is known at the first stage, and b is known at the second stage.

figure acfigure ac

Because exp is a recursive function whose conditionals test the parity of the exponent argument, the sequence of branches taken corresponds exactly to the binary representation of e. Partially evaluating exp with e eliminates all of the conditionals, selecting and expanding the appropriate branch in each case.

Our algorithm, on the other hand, produces:

figure adfigure ad

exp1 recursively performs parity tests on e, but unlike exp, it simply computes a data structure (a binnat) recording which branches were taken. exp2 takes b and a binnat l, and uses l to determine how to compute with b.

Of course, the binnat computed by exp1 is precisely the binary representation of e! While partial evaluation realizes exp’s control-flow dependency on a fixed e by recursively expanding its branches in place, we explicitly record this dependency generically over all e by creating a boundary data structure. This occurs in the rule for case, which emits a tag corresponding to the taken branch in the precomputation, and cases on it (as \(l_b\)) in the residual.

Because splitting exp does not eliminate its conditionals, partial evaluation is more useful in this case. (Notice, however, that partially evaluating exp2 on a binnat is essentially the same as partially evaluating exp on the corresponding int.) Nevertheless, splitting exp still demonstrates how our algorithm finds interesting data structures latent in the structure of recursive functions.

6.3 Quickselect

Let us return to the quickselect algorithm, which we discussed at length in Sect. 2. (The code is in Fig. 1(b).) qss finds the kth largest element of a list l by recursively partitioning the list by its first element, then recurring on the side containing the kth largest element. l is first-stage and k is second-stage.

Stage-splitting qss produces:

figure aefigure ae

This is nearly identical to the cleaned-up code we presented in Fig. 2, except we do not suppress the trivial inputs and outputs of qs1 and qs2.

The function qs1 partitions l, but since the comparison with k (to determine which half of l to recur on) is at the second stage, it simply recurs on both halves, pairing up the results along with h (the head of l) and n (the size of the left half). qs2 takes k and this tree p, and uses k to determine how to traverse p.

How does our splitting algorithm generate binary search trees and a traversal algorithm? The rule for case tuples up the precomputations for its branches, and in the residual, selects the residual corresponding to the appropriate branch. The tree is implicit in the structure of the code; ordinarily, the quickselect algorithm only explores a single branch, but the staging annotations force the entire tree to be built.

This is an instance where stage-splitting is more practical than partial evaluation; if l is large, partially evaluating quickselect requires runtime generation of a huge amount of code simultaneously encoding the tree and traversal algorithm. (Avoiding the code blowup, by not expanding some calls to part, would result in duplicating first-stage computations.)

Note that the recursive part function is defined within a gr annotation. As explained in Sect. 4.6, defining part at would cause it to split in a way that incurs extra cost at the second stage. In this case, that cost would be \(\varTheta (n)\) in the size of the input list, enough to overpower the asymptotic speedup gained elsewhere. With gr annotations, however, this can be prevented.

As discussed in Sect. 2, qs1 performs \(\varTheta (n \log n)\) expected work per call, whereas qs2 performs \(\varTheta (\log n)\) expected work. This results in a net speedup over standard quickselect if we perform many (specifically, \(\omega (\log n)\)) queries on the same list—precisely the topic of our next example.

6.4 Mixed-Stage Map Combinator

As a lambda calculus, one of the strengths of is that it can express combinators as higher order functions. In this example, we consider just such a combinator: tmap, which turns a function of type \(\nabla \mathtt {list} * \bigcirc \mathtt {int} \rightarrow \bigcirc \mathtt {int}\) into one of type \(\nabla \mathtt {list} * \bigcirc \mathtt {list2} \rightarrow \bigcirc \mathtt {list2}\), by mapping over the second argument.Footnote 6

figure affigure af

Importantly, tmap f performs the first-stage part of f once and second-stage part of f many times. This was discussed in the context of partial evaluation in Sect. 3.4, but it is especially clear when we look at the output of splitting:

figure agfigure ag

Indeed, observe that the argument f (for which qs1 is later substituted) of tmap1 is called only once, whereas the corresponding argument f (for which qs2 is later substituted) in tmap2 is evaluated once per element in the query list q.

6.5 Composing Graphics Pipeline Programs

Composable staged programs are particularly important the domain of real-time graphics. This need arises because modern graphics architectures actually require that graphics computations be structured as a pipeline of stages which perform increasingly fine-grained computations (e.g., per-object, per-screen region, per-pixel), where computations in later stages use the results of an earlier stage multiple times [26].

The standard way to program these graphics pipelines is to define one program (usually called a shader) per stage. In other words, the programmer is expected to write their multi-stage programs in an already split form. This requirement results in complex code where invariants must hold across different stages and local changes to the logic of one stage may require changes to that of upstream stages. This harms composition and modularity. Graphics researchers therefore have suggested using mechanisms like our stages  [8, 11, 22] to express graphics program logic, including representing entire pipeline as a single multi-stage function [8].

As a language, is well suited for specification of such functions, and we give a simple example below. In it, we consider a graphics pipeline program to be a staged function that takes an object definition in the first stage (object) and a pixel coordinate in the second stage (\(\bigcirc \) coord), and emits the color of the object at the specified pixel (\(\bigcirc \) color). Given two such multistage functions, we then define a combinator that multiplies their results pointwise in the second stage.Footnote 7

figure ahfigure ah

Prior work [8] lacks the ability to define such combinators, because it does not support higher order functions.

7 Related Work

Frequency reduction and precomputation are common techniques for both designing algorithms and performing compiler optimizations [15]. The idea behind precomputation is to identify computations that can be performed earlier (e.g., at compile time) if their inputs are available statically and perform them at that earlier time. Dynamic algorithms, partial evaluation, and incremental computation are all examples of precomputation techniques. The idea behind frequency reduction is to identify computations that are performed multiple times and pull them ahead so that they can be performed once and used later as needed. Dynamic programming, loop hoisting, and splitting (presented here) are all examples of frequency reduction techniques.

Precomputation techniques. Perhaps one of the most studied examples of precomputation is partial evaluation, which distinguishes between static (compile-time) and dynamic (runtime) stages. Given a program and values for all static inputs, partial evaluation generates a specialized program by performing computations that depend only on the static inputs [13]. We refer the reader to the book by Jones, Gommard, and Sestoft for a comprehensive discussion of partial evaluation work until the early 90’s [14].

Early approaches to partial evaluation can be viewed as operating in two stages: binding time analysis and program specialization. For a multivariate program with clearly marked static and dynamic arguments, binding-time analysis identifies all the parts of the program that can be computed by the knowledge of static arguments. Using this information and the values of static arguments, program specialization specializes the original program to a partially-evaluated one that operates on many different dynamic arguments. This approach has been applied to construct partial evaluators for a number of languages such as Scheme [2, 3].

Researchers have explored other staging transformations that, like splitting, partition an input two-stage program into two components, one corresponding to each stage. In particular, binding time separation [18] (also called program bifurcation [6]) has been used as a preprocessor step in partial evaluators, allowing efficient specialization of programs with mixed-stage data structures without changes to the specializer itself. Notably, the grammar-based binding-time specifications used in binding time separation are capable of describing data structures with purely-static, purely-dynamic, and mixed-stage content, much like the type system of (though this correspondence is less clear without our addition of and \(\nabla \)).

However unlike splitting, where the goal is to emit code where first-stage results are computed once and then reused in multiple invocations of second-stage execution, the second (dynamic) function produced by binding time separation does not use the results of the first; instead, it has access to the first-stage inputs and recomputes all required first-stage computations. As noted by Knoblock and Ruf [16], it may be possible to modify the program bifurcation algorithm to cache and reuse the intermediate results, but this was never attempted. Alternatively, our algorithm could potentially be used as the basis of a general bifurcation algorithm in a partial evaluator.

Experience with binding time analysis showed that it can be difficult to control, leading to programs whose performance was difficult to predict. This led to investigations based on type systems for making the stage of computations explicit in programs [10, 21] and writing “metaprograms” that, when evaluated at a stage, yield a new program to be evaluated at the next stage. Davies [4] presented a logical construction of binding-time type systems by deriving a type system via the Curry-Howard correspondence applied to temporal logic. Davies and Pfenning proposed a new type system for staged computation based on a particular fragment of modal logic [5]. The work on MetaML extended type-based techniques to a full-scale language by developing a statically typed programming language based on ML that enables the programmer to express programs with multiple stages [27, 28]. MetaML’s type system is similar to Davies [4] but extends it in several important ways. Nanevksi and Pfenning further extended the these techniques by allowing free variables to occur within staged computations [20].

The type-system of is closely related to this later line of work on metaprogramming and staged computation. The specific extension to the typed lambda calculus that we use here is based on the \(\bigcirc \) modality of Davies [5]. Our types differ in the restriction to two stages and the addition of \(\nabla \); however, the key difference between our work and this prior work is that we instead focus on the problem of splitting.

Another class of precomputation techniques is incremental computation, where a program can efficiently respond to small changes in its inputs by only recomputing parts of the computation affected by the changes [1, 7, 23, 24]. However, unlike splitting, incremental computation does not require fixing any of its inputs and, in the general case, allows for all program inputs to change. Thus, the benefits of incremental computation depend on what changes to inputs are made. For example, while it is possible to apply incremental computation to the quickselect example in Sect. 2, techniques would unfold the quickselect function based on the demanded ranks, potentially incurring linear time cost at each step of the algorithm (as opposed to the logarithmic result produced by splitting). Moreover, incremental computation techniques must also maintain sophisticated data structures dynamically at run time to track what computations must be performed.

Stage Splitting. Algorithms for stage splitting have appeared in the literature under the names pass separation [15] and data specialization [16]. Perhaps the closest work to ours is the algorithm for data specialization given by Knoblock and Ruf [16], which also seeks to statically split an explicitly staged program into two stages. However, they only consider a simple first-order language and straight-line programs. Their work also treats all computations guarded by second-stage conditionals as second-stage computations, which would prevent optimization (via splitting) of programs such as quickselect.

As noted in Sect. 6.5, splitting algorithms have also been a topic of interest in programming systems for computer graphics, where, to achieve high performance, programs are manually separated into components by frequency of execution corresponding to graphics hardware pipeline stages. The software engineering challenges of modifying multiple per-stage programs have led to suggestions of writing graphics programs in an explicitly-staged programming language [8, 11, 22] and deferring the task of pass separation to the compiler. However, all prior splitting efforts in computer graphics, like that of Knoblock and Ruf [16], have been limited to simple, imperative languages.

Defunctionalization. Defunctionalization [25] is a program transformation that eliminates high-order functions and replaces them with lower order functions that manipulate a data structure encoding the original control flow. This operation has similarity to our splitting transformation, which eliminates staging in a program by encoding control flow in a data structure passed between the stage one and stage two outputs. It would interesting to explore the possibility of a semantics-preserving transformation to convert a staged program into a higher-order form, and then applying defunctionalization to obtain our results.

8 Conclusion

This paper presents a splitting algorithm for , a two-staged typed lambda calculus with support for recursion and first-class functions. The type system of uses three worlds

figure aifigure ai

, and to classify code by its stage, with the modalities \(\bigcirc \) and \(\nabla \) providing internalizations of second-stage code and ground first-stage code to the mixed world. We present a dynamic semantics that evaluates terms in two passes and provides an eager interpretation of next via hoisting steps. We prove the correctness of our splitting algorithm against the dynamic semantics, and implement this proof in Twelf. Finally, we discuss our implementation of and its splitting algorithm and analyze their behavior for a number of staged programs, including those with higher-order and recursive functions.

Looking forward, we are interested in investigating the behavior of splitting in the presence of richer language features such as mutation, polymorphism, parallel constructs, and more than two stages. Also while we have proven the dynamic correctness of our splitting algorithm, we do not yet have a characterization of the types of the output or of the cost behavior of output terms.