Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This paper explores the problem of automatically repairing programs written as a set of mutually recursive functions in a purely functional subset of Scala. We consider a function to be subject to repair if it does not satisfy its specification, expressed in the form of pre- and postcondition. The task of repair consists of automatically generating an alternative implementation that meets the specification. The repair problem has been studied in the past for reactive and pushdown systems [8, 10, 11, 19, 20, 26]. We view repair as generalizing, for example, the choose construct of complete functional synthesis [15], sketching [21, 22], and program templates [23], because the exact location and nature of expressions to be synthesized is left to the algorithm. Repair is thus related to localization of error causes [12, 14, 27]. To speed up our repair approach, we do use coarse-grained error localization based on derived test inputs. However, a more precise nature of the fault is in fact the outcome of our tool, because the repair identifies a particular change that makes the program correct. Using tests alone as a criterion for correctness is appealing for performance reasons [7, 17, 18], but this can lead to erroneous repairs. We therefore leverage prior work [13] on verifying and synthesizing recursive functional programs with unbounded data-types (trees, lists, integers) to provide strong correctness guarantees, while at the same time optimizing our technique to use automatically derived tests. By phrasing the problem of repair as one of synthesis and introducing tailored deduction rules that use the original implementation as guide, we allow the repair-oriented synthesis procedure to automatically find correct fixes, in the worst case resorting to re-synthesizing the desired function from scratch. To make the repair approach practical, we found it beneficial to extend the power and generality of the synthesis engine itself, as well as to introduce explicit support for symbolic tests in the specification language and the repair algorithm.

Contributions. The overall contribution of this paper is a new repair algorithm and its implementation inside a deductive synthesis framework for recursive functional programs. The specific new techniques we contribute are the following.

  • Exploration of similar expressions. We present an algorithm for expression repair based on a grammar for generating expressions similar to a given expression (according to an error model we propose). We use such grammars within our new generic symbolic term exploration routine, which leverages test inputs as well as an SMT solver, and efficiently explores the space of expressions that contain recursive calls whose evaluation depends on the expression being synthesized.

  • Fault localization. To narrow down repair to a program fragment, we localize the error by doing dynamic analysis using test inputs generated automatically from specifications. We combine two automatic sources of inputs: enumeration techniques and SMT-based techniques. We collect traces leading to erroneous executions and compute common prefixes of branching decisions. We show that this localization is in practice sufficiently precise to repair sizeable functions efficiently.

  • Symbolic examples. We propose an intuitive way of specifying possibly symbolic input-output examples using pattern matching of Scala. This allows the user to partially specify a function without necessarily having to provide full inputs and outputs. Additionally, it enables the developer to easily describe properties of generic (polymorphic) functions. We present an algorithm for deriving new examples from existing ones, which improves the usefulness of example sets for fault localization and repair.

    In our experience, the combination of formal specification and symbolic examples gives the user significant flexibility when specifying functions, and increases success rates when discovering and repairing program faults.

  • Integration into a deductive synthesis and verification framework. Our repair system is part of a deductive verification system, so it can automatically produce new inputs from specification, prove correctness of code for all inputs ranging over an unbounded domain, and synthesize program fragments using deductive synthesis rules that include common recursion schemas.

The repair approach offers significant improvements compared with synthesis from scratch. Synthesis alone scales poorly when the expression to synthesize is large. Fault localization focuses synthesis on the smaller, invalid portions of the program and thus results in big performance gains. The source code of our tool and additional details are available from http://leon.epfl.ch as well as http://lara.epfl.ch/w/leon-repair.

Fig. 1.
figure 1

The syntax tree translation in function

figure a
has a strong
figure b
clause, requiring semantic equivalence of transformed and the original tree, as defined by several recursive evaluation functions.
figure c
contains an error. Our system finds it, repairs the function, and proves the resulting program correct.

Example. Consider the following functionality inspired by a part of a compiler. We wish to transform (desugar) an abstract syntax-tree of a typed expression language into a simpler untyped language, simplifying some of the constructs and changing the representation of some of the types, while preserving the semantics of the transformed expression. In Fig. 1, the original syntax trees are represented by the class

figure d

and its subclasses, whereas the resulting untyped language trees are given by

figure e

. A syntax tree of

figure f

either evaluates to an integer, to a boolean, or to no value if it is not well typed. We capture this by defining a type-checking function

figure g

, along with two separate semantic functions,

figure h

and

figure i

.

figure j

, on the other hand, always evaluates to an integer, as defined by the

figure k

function. For brevity, most subclass definitions are omitted.

The

figure l

function translates a syntax tree of

figure m

into one of

figure n

. We expect the function to ensure that the transformation preserves the semantics of the tree: originally integer-valued trees evaluate to the same value, boolean-valued trees now evaluate to 0 and 1, representing

figure o

and

figure p

, respectively, and mistyped trees are left unconstrained. This is expressed in the postcondition of

figure q

.

The implementation in Fig. 1 contains a bug: the

figure r

and

figure s

branches of the

figure t

case have been accidentally switched. Using tests automatically generated using generic enumeration of small values, as well as from a verification attempt of

figure u

, our tool is able to find a coarse-grained location of the bug, as the body of the relevant case of the match statement. During repair, one of the rules performs a semantic exploration of expressions similar to the invalid one. It discovers that using the expression

figure v

instead of the invalid one makes the discovered tests pass. The system can then formally verify that the repaired program meets the specification for all inputs. If we try to introduce similar bugs in the correct

figure w

function, or to replace the entire body of a case with a dummy value, the system successfully recovers the intended case of the transformation. In some cases our system can repair multiple simultaneous errors; the mechanism behind that is explained in Sect. 2.2. Note that the developer communicates with our system only by writing code and specifications, both of which are functions in an existing functional programming language. This illustrates the potential of repair as a scalable and developer-friendly deployment of synthesis in software development.

2 Deductive Guided Repair

We next describe our deductive repair framework. The framework currently works under several assumptions, which we consider reasonable given the state of the art in repair of infinite-state programs. We consider the specifications of functions as correct; the code is assumed wrong if it cannot be proven correct with respect to this specification for all of the infinitely many inputs. If the specification includes input-output tests, it follows that the repaired function must have the same behavior on these tests. We do not guarantee that the output of the function is the same as the original one on tests not covered by the specification, though the repair algorithm tends to preserve some of the existing behaviors due to the local nature of repair. It is the responsibility of the developer to sufficiently specify the function being repaired. Although under-specified benchmarks may produce unexpected expressions as repair solutions, we found that even partial specifications often yield the desired repairs. A particularly effective specification style in our experience is to give a partial specification that depends on all components of the structure (for example, describes property of the set of stored elements), and then additionally provide a finite number of symbolic input-output tests. We assume that only one function of the program is invalid; the implementation of all other functions is considered valid as far as the repair of interest is concerned. Finally, we assume that all functions of the program, even the invalid one, terminate.

Stages of the Repair Algorithm. The function being repaired passes through the following stages, which we describe in the rest of the paper:

  • Test generation and verification. We combine enumeration- and SMT-based techniques to either verify the validity of the function, or, if it is not valid, discover counterexamples (examples of misbehaviors).

  • Fault localization. Our localization algorithm then selects the smallest expression executed in all failing tests, modulo recursion.

  • Synthesis of similar expressions. This erroneous expression is replaced by a “program hole”. The now-incomplete function is sent to synthesis, with the previous expression used as a synthesis hint. (Neither the notion of holes nor the notion of synthesis hints has been introduced in prior work on deductive synthesis [13]).

  • Verification of the solution. Lastly, the system attempts to prove the validity of the discovered solution. Our results in Sect. 5, Fig. 4 indicate in which cases the synthesized function passed the verification.

Repair Framework. Our starting point is the deductive synthesis framework first introduced in [13]. We show how this framework can be applied to program repair by introducing dedicated rules, as well as special predicates. We reuse the notation for synthesis tasks \( \left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \): \(\bar{a}\) denotes the set of input variables, \(\bar{x}\) denotes the set of output variables, \(\phi \) is the synthesis predicate, and \(\varPi \) is the path condition to the synthesis problem. The framework relies on deduction rules that take such an input synthesis problem and either (1) solve it immediately by returning the tuple \(\langle {P} \mid {T} \rangle \) where \(P\) corresponds to the precondition under which the term T is a solution, or (2) decompose it into sub-problems, and define a way to compute the overall solution from sub-solutions. We illustrate these rules as well as their notation with a rule for splitting a problem containing a top-level or:

$$ \frac{{\left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _1 \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P_1} \mid {T_1} \rangle \qquad \left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _2 \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P_2} \mid {T_2} \rangle }}{{\left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _1 \vee \phi _2 \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P_1 \vee P_2} \mid {{\mathsf{if( }}P_1{\mathsf{) \{ }}T_1{\mathsf{\} \mathsf{ else }\{}}T_2{\mathsf{\} }}} \rangle }} $$

This rule should be interpreted as follows: from an input synthesis problem \(\left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _1 \vee \phi _2 \right\rangle \ \bar{x}\right] \!\right] \), the rule decomposes it in two subproblems: \(\left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _1 \right\rangle \ \bar{x}\right] \!\right] \) and \(\left[ \!\left[ \bar{a} \ \left\langle \varPi \rhd \phi _2 \right\rangle \ \bar{x}\right] \!\right] \). Given corresponding solutions \(\langle {P_1} \mid {T_1} \rangle \) and \(\langle {P_2} \mid {T_2} \rangle \), the rule solves the input problem with \(\langle {P_1 \vee P_2} \mid {{\mathsf{if( }}P_1{\mathsf{) \{ }}T_1{\mathsf{\} \mathsf{ else }\{}}T_2{\mathsf{\} }}} \rangle \).

To track the original (incorrect) implementation along instantiations of our deductive synthesis rules, we introduce a guiding predicate into the path condition of the synthesis problem. We refer to this guiding predicate as \(\odot \left[ {\mathsf{expr }}\right] \), where

figure x

represents the original expression. This predicate does not have any logical meaning in the path-condition (it is equivalent to

figure y

), but it provides syntactic information that can be used by repair-dedicated rules. These rules are covered in detail in Sects. 2.1, 2.2 and 3.

2.1 Fault Localization

A contribution of our system is the ability to focus the repair problem to a small sub-part of the function’s body that is responsible for its erroneous behavior. The underlying hypothesis is that most of the original implementation is correct. This technique allows us to reuse as much of the original implementation as possible and minimizes the size of the expression given to subsequent more expensive techniques. Focusing also has the profitable side-effect of making repair more predictable, even in the presence of weak specifications: repaired implementation tends to produce programs that preserve some of the existing branches, and thus have the same behavior on the executions that use only these preserved branches. We rely on the list of examples that fail the function specification to lead us to the source of the problem: if all failing examples only use one branch of some branching expression in the program, then we assume that the error is contained in that branch. We define \(\mathcal {F}\) as the set of all inputs of collected failing tests (see Sect. 4). We describe focusing using the following rules.

If-Focus. Given the input problem \(\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {{\mathsf{if( }}c{\mathsf{) \{ }}t{\mathsf{\} \mathsf{ else }\{}}e{\mathsf{\} }}}\right] \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \) we first check if there is an alternative condition expression such that all failing tests succeed:

Instead of solving this higher-order hypothesis, we execute the function and non-deterministically consider both branches of the

figure z

(and do so within recursive invocations as well). If a valid execution exists for each failing test, the formula is considered satisfiable enabling us to focus on the condition. Otherwise, we check whether

figure aa

evaluates to either

figure ab

or

figure ac

for all failing inputs, allowing us to focus on the corresponding branch:

$$\begin{aligned}&\textsc {If-Focus-Then:}\\&\frac{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {t}\right] \wedge c \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P} \mid {T} \rangle \qquad \forall \bar{i} \in \mathcal {F}. c[\bar{a}\mapsto \bar{i}]}{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {{\mathsf{if( }}c{\mathsf{) \{ }}t{\mathsf{\} \mathsf{ else }\{}}e{\mathsf{\} }}}\right] \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P} \mid {{\mathsf{if( }}c{\mathsf{) \{ }}T{\mathsf{\} \mathsf{ else }\{}}e{\mathsf{\} }}} \rangle } \end{aligned}$$
$$\begin{aligned}&\textsc {If-Focus-Else:}\\&\frac{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {e}\right] \wedge \lnot c \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P} \mid {T} \rangle \qquad \forall \bar{i} \in \mathcal {F}. \lnot c[\bar{a}\mapsto \bar{i}]}{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {{\mathsf{if( }}c{\mathsf{) \{ }}t{\mathsf{\} \mathsf{ else }\{}}e{\mathsf{\} }}}\right] \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P} \mid {{\mathsf{if( }}c{\mathsf{) \{ }}t{\mathsf{\} \mathsf{ else }\{}}T{\mathsf{\} }}} \rangle } \end{aligned}$$

We use analogous rules to repair

figure ad

expressions, which are ubiquitous in our programs. Here, if all failing tests lead to one particular branch of the

figure ae

, we focus on that particular branch.

The above rules use tests to locally approximate the validity of branches. They are sound only if \(\mathcal {F}\) is sufficiently large. Our system therefore performs an end-to-end verification for the complete solution, ensuring the overall soundness.

2.2 Guided Decompositions

In case focusing rules fail to identify a single branch of an

figure af

- or

figure ag

-expression as responsible, we might still benefit from reusing most of the expression. In the case of

figure ah

, reuse is limited to the

figure ai

-condition, but for a

figure aj

-expression, this may extend to multiple valid cases. To this end, we introduce rules analogous to focus, that do decompositions based on the guide.

$$\begin{aligned}&\textsc {If-Split:}\\&\frac{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {t}\right] \wedge c \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P_1} \mid {T_1} \rangle \qquad \left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {e}\right] \wedge \lnot c \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {P_2} \mid {T_2} \rangle }{ \left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {{\mathsf{if( }}c{\mathsf{) \{ }}t{\mathsf{\} \mathsf{ else }\{}}e{\mathsf{\} }}}\right] \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {(c {\wedge } P_1) \vee (\lnot c {\wedge } P_2)} \mid {{\mathsf{if( }}c{\mathsf{) \{ }}T_1{\mathsf{\} \mathsf{ else }\{}}T_2{\mathsf{\} }}} \rangle } \end{aligned}$$

To reuse the valid branches of an

figure ak

or a

figure al

-expression on which focus failed, we introduce a rule that solves the problem if the guiding expression satisfies the specification.

$$\begin{aligned}&\textsc {Guided-Verify:} = \frac{\varPi \models \phi [\bar{x}\mapsto \mathsf{term }]}{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {term}\right] \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {\mathsf {true}} \mid \mathsf{term } \rangle } \end{aligned}$$

2.3 Generating Recursive Calls

Our purely functional language often requires us to synthesize recursive implementations. Consequently, the synthesizer must be able to generate calls to the function currently getting synthesized. However, we must take special care to avoid introducing calls resulting in a non-terminating implementation. (Such an erroneous implementation would be conceived as valid if it trivially satisfies the specification due to inductive hypothesis over a non-well-founded relation.)

Our technique consists of recording the arguments

figure am

at the entry point of the function,

figure an

, and keeping track of these arguments through the decompositions. We represent this information with a syntactic predicate \(\Downarrow \left[ \mathsf{f(a) }\right] \), similar to the guiding predicate from the previous sections. We then heuristically assume that reducing the arguments

figure ao

will not introduce non-terminating calls.

We illustrate this mechanism by considering the

figure ap

function shown in Fig. 1. We start by injecting the entry call information as

$$\begin{aligned} \left[ \!\left[ e \ \left\langle \Downarrow \left[ \mathsf{desugar(e) }\right] \wedge ... \rhd \phi \right\rangle \ x\right] \!\right] \end{aligned}$$

This synthesis problem will then be decomposed by the various deduction rules available in the framework. An interesting case to consider is a decomposition by pattern-matching on

figure aq

which specializes the problem to known variants of

figure ar

. The specialized problem for the

figure as

variant will look as follows:

$$\begin{aligned} \left[ \!\left[ e1 \mathop {,}e2 \ \left\langle \Downarrow \left[ \mathsf{desugar(Plus(e1, e2)) }\right] \wedge ... \rhd \phi \right\rangle \ x\right] \!\right] \end{aligned}$$

As a result, we assume that the calls

figure at

and

figure au

are likely to terminate, so they are considered as candidate expressions when symbolically exploring terms, as explained in Sect. 3.

This relatively simple technique allows us to introduce recursive calls while filtering trivially non-terminating calls. In the case where it still introduces infinite recursion, we can discard the solution using a more expensive termination checker, though we found that this is seldom needed in practice.

2.4 Synthesis Within Repair

The repair-specific rules described earlier aim at solving repair problems according to the error model. Thanks to integration into the Leon synthesis framework, general synthesis rules also apply, which enables the repair of more intricate errors. This achieves an appealing combination between fast repairs for predictable errors and expressive, albeit slower, repairs for more complicated errors.

3 Counterexample-Guided Similar-Term Exploration

After following the overall structure of the original problem, it is often the case that the remaining erroneous branches can be fixed by applying small changes to their implementations. For instance, an expression calling a function might be wrong only in one of its arguments or have two of its arguments swapped. We exploit this assumption by considering different variations to the original expression. Due to the lack of a large code base in pure Scala subset that Leon handles, we cannot use statistically informed techniques such as [9], so we define an error model following our intuition and experience from previous work.

We use the notation \(G(\mathsf{expr })\) to denote the space of variations of

figure av

and define it in the form of a grammar as

$$\begin{aligned} G(\mathsf{expr }) {:}{:}{=}G_{swap}(\mathsf{expr }) ~|~ G_{arg}(\mathsf{expr }) ~|~ G_{*^2}(\mathsf{expr }) \end{aligned}$$

with the following forms of variations.

Swapping Arguments. We consider here all the variants of swapping two arguments that are compatible type-wise. For instance, for an operation with three operands of the same type:

$$\begin{aligned} G_{swap}(\mathsf{op(a,b,c) }) {:}{:}{=}\mathsf{op(b,a,c) } ~|~ \mathsf{op(a,c,b) } ~|~ \mathsf{op(c,b,a) } \end{aligned}$$

Generalizing One Argument. This variation corresponds to making a mistake in only one argument of the operation we generalize:

$$\begin{aligned} G_{arg}(\mathsf{op(a,b,c) }) {:}{:}{=}\mathsf{op( }G(\mathsf{a })\mathsf{,b,c) } ~|~ \mathsf{op(a, }G(\mathsf{b })\mathsf{,c) } ~|~ \mathsf{op(a,b, }G(\mathsf{c })\mathsf{) } \end{aligned}$$

Bounded Arbitrary Expression. We consider a grammar of interesting expressions of the given type and of limited depth. This grammar considers all operations in scope as well as all input variables. It also considers safe recursive calls discovered in Sect. 2.3. Finally, it includes the guiding expression as a terminal, which corresponds to possibly wrapping the source expression in an operation. For example, given a predicate \(\Downarrow \left[ \mathsf{listSum(Cons(h,t)) }\right] \) and a mod function \(\mathsf{Int } \times \mathsf{Int } \rightarrow \mathsf{Int }\) in scope, an integer operation op(a,b,c) is generalized as:

figure aw

Our grammars cover a range of variations corresponding to common errors. During synthesis, the system generates a specific grammar for each invocation of this repair rule, and explores symbolically the space of all expressions in the grammar. We rely on a CEGIS-loop bootstrapped with our test inputs to explore these expressions. This can be abstractly represented by the following rule:

$$\begin{aligned} \textsc {CEGIS-Gen:} = \frac{\exists \mathsf{T } \in \mathcal {L}(G(\mathsf{term }))~ \forall \bar{a}. \varPi \implies \phi [\bar{x}\mapsto \mathsf T ]}{\left[ \!\left[ \bar{a} \ \left\langle \odot \left[ {term}\right] \wedge \varPi \rhd \phi \right\rangle \ \bar{x}\right] \!\right] \vdash \langle {\mathsf {true}} \mid \mathsf{T } \rangle } \end{aligned}$$

Even though this rule is inherently incomplete, it is able to fix common errors efficiently. Our deductive approach allows us to introduce such tailored rules without loss of generality: errors that go beyond this model may be repaired using more general, albeit slower synthesis rules.

Precise Handling of Recursive Calls in CEGIS. Our system uses a symbolic approach to avoid enumerating expressions explicitly [13]. When considering recursive calls among possible expressions within CEGIS, the interpretation of such calls needs to refer back to this same expression. Our previous approach [13] treats recursive invocations of the function under synthesis as satisfying only the postcondition, leading to spurious counter-examples. Our new solution first constructs a parametrized program explicitly representing the search space: given a grammar G at a certain unfolding level, we construct a function

figure ax

in which we describe non-terminals as values with each production guarded by a distinct entry of the

figure ay

array, as in the following repair a case of the

figure az

function.

figure ba

In this new program, the function under repair is defined using the partial solution corresponding to the current deduction tree, in which we call

figure bb

at the point of the CEGIS invocation. Other unsolved branches of the deduction tree become synthesis holes. We augment transitive callers with this additional

figure bc

argument, passing it accordingly. This ensures that a specific valuation of

figure bd

corresponds exactly to a program where the point of CEGIS invocation is replaced by the corresponding expression. We rely on tests collected in Sect. 4 to test individual valuations of

figure be

, removing failing expression from the search space. Finally, we perform CEGIS using symbolic term exploration with the SMT solver to find candidate expressions [13].

4 Generating and Using Tests for Repair

Tests play an essential role in our framework, allowing us to gather information about the valid and invalid parts of the function. In this section we elaborate on how we select, generate, and filter examples of inputs and possibly outputs. Several components of our system then make use of these examples. We distinguish two kinds of tests: input tests and input-output tests. Namely, input tests provide valid inputs for the function according to its precondition, while input-output tests also specify the exact output corresponding to each input.

Extraction and Generation of Tests. Our system relies on three main sources for tests that are used to make the repair process more efficient.

(1) User-provided symbolic input-output tests. It is often interesting for the user to specify how a function behaves by listing a few examples providing inputs and corresponding outputs. However, having to provide full inputs and outputs can be tedious and impractical. To make specifying families of tests convenient, we define a

figure bf

construct to express input-output examples, relying on pattern matching in our language to symbolically describe sets of inputs and their corresponding outputs. This gives us an expressive way of specifying classes of input-output examples. Not only may the pattern match more than one input, but the corresponding outputs are given by an expression which may depend on the pattern’s variables. Wildcard patterns are particularly useful when the function does not depend on all aspects of its inputs. For instance, a function computing the size of a generic list does not inspect the values of individual list elements. Similarly, the sum of a list of integers could be specified concisely for all lists of sizes up to 2. Both examples are illustrated by Fig. 2.

Fig. 2.
figure 2

Partial specifications using the passes construct, allowing to match more than one inputs and providing the expected output as an expression.

Having partially symbolic input-output examples strikes a good balance between literal examples and full-functional specifications. From the symbolic tests, we generate concrete input-output examples by instantiating each pattern several times using enumeration techniques, and executing the output expression to yield an output value. For instance, from case Cons(a, Cons(b, Nil())) \(\Rightarrow \) a + b we will generate the following tests resulting from replacing ab with all combinations of values from a finite set, including, for example, test with input Cons(1, Cons(2, Nil())) and output 3. We generate up to 5 distinct tests per pattern, when possible. These symbolic specifications are the only forms of tests provided by the developer; any other tests that our system uses are derived automatically.

(2) Generated input tests. We rely on the same enumeration technique to generate inputs satisfying the precondition of the function. Using a generate and test approach, we gather up to 400 valid input tests in the first 1000 enumerated.

(3) Solver-generated Tests. Lastly, we rely on the underlying solvers for recursive functions of Leon [25] to generate counter-examples. Given that the function is invalid and that it terminates, the solver (which is complete for counter-examples) is guaranteed to eventually provide us with at least one failing test.

Classifying and Minimizing Traces. We partition the set of collected tests into passing and failing sets. A test is considered as failing if it violates a precondition, a postcondition, or emits one of various other kinds of runtime errors when the function to repair is executed on it. In the presence of recursive functions, a given test may fail within one of its recursive invocations. It is interesting in such scenarios to consider the arguments of this specific sub-invocation: they are typically smaller than the original and are better representatives of the failure. To clarify this, consider the example in Fig. 3 (based on the program in Fig. 1):

Fig. 3.
figure 3

Code and invocation graph for

figure bg
. Solid borderlines stand for passing tests, dashed ones for failing ones. Type constructors for literals have been omitted.

Assume the tests collected are

figure bh

,

figure bi

and

figure bj

. When executed with these tests, the function produces the graph of

figure bk

invocations shown on the right of Fig. 3. A trivial classification tactic would label all three tests as faulty, even though it is obvious that all errors can be explained by the bug in

figure bl

, due to the dependencies between tests. More generally, a failing test should also be blamed for the failure of all other tests that invoke it transitively. Our framework deploys this smarter classification. Thus, in our example, it would only label

figure bm

as a failing example, which would lead to correct localization of the problem on the faulty branch. Note that this process will discover new failing tests not present in the original test set, if they occur as recursive sub-invocations.

Our experience with incorporating tests into the Leon system indicate that they are proving time and again to be extremely important for the tool’s efficiency, even though our system is in its spirit based on verification as opposed to testing alone. In addition to allowing us to detect errors sooner and filter out wrong synthesis candidates, tests also allow us to quickly find the approximate error location.

Fig. 4.
figure 4

Automatically repaired functions using our system. We provide for each operation: a small description of the kind of error introduced, the overall program size, the size of the invalid function, the size of the erroneous expression we locate and the size of the repaired version. We then provide the times our tool took to: gather and classify tests, and repair the erroneous expression. Finally, we mention if the resulting expression verifies. The source of all benchmarks can be found on http://lara.epfl.ch/w/leon-repair (see also http://leon.epfl.ch)

5 Evaluation

We evaluate our implementation on a set of benchmarks in which we manually injected errors (Fig. 4). The programs mainly focus on data structure implementations and syntax tree operations. Each benchmark is comprised of algebraic data-type definitions and recursive functions that manipulate them, specified using strong yet still partial preconditions and postconditions. We manually introduced errors of different types in each copy of the benchmarks. We ran our tool unassisted until completion to obtain a repair, providing it only with the name of the file and the name of the function to repair (typically the choice of the function could also have been localized automatically by running the verification on the entire file). The experiments were run on an Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz with 16GB RAM, with 2GB given to the Java Virtual Machine. While the deductive reasoning supports parallelism in principle, our implementation is currently single-threaded.

For each benchmark of Fig. 4 we provide: (1) the name of the benchmark and the broken operation; (2) a short classification of the kind of error introduced. The error kinds include: a small variation of the original program, a completely faulty

figure bn

-case, a missing

figure bo

-case, a missing necessary

figure bp

-split, a missing function call, and finally, two separate variations in the same function. We describe the relevant sizes (counted in abstract syntax tree nodes) of: (3) the overall benchmark, (4) the erroneous function, (5) the localized error, and (6) the repaired expression. The full size of the program is relevant because our repair algorithm may introduce calls to any function defined in the benchmark, and also because the verification of a function depends on other functions in the file (recall Fig. 1). We also include the time, in seconds, our tool took to: (7) collect and classify tests and (8) repair the broken expression. Finally, we report (9) if the system could formally (and automatically) prove the validity of the repaired implementation. Our examples are challenging to verify, let alone repair. They contain both functional and test-based specifications to capture the intended behavior. Many rely on unfolding procedure of [24, 25] to handle contracts that contain other auxiliary recursive functions. The fast exponentiation algorithm of

figure bq

relies on non-linear reasoning of the Z3 SMT solver [4].

An immediate observation is that fault localization is often able to focus the repair to a small subset of the body. Combined with the symbolic term exploration, this translates to a fast repair if the error fell within the error model. Among the hardest benchmarks are the ones labeled as having “2 variations”. For example,

figure br

is similar to one in Fig. 1 but contains two errors. In those cases, localization returns the entire

figure bs

as the invalid expression. Our guided repair uses the existing

figure bt

as the guide and successfully resynthesizes code that repairs both erroneous branches. Another challenging example is

figure bu

, for which the more elaborate If-Focus-Condition rule of Sect. 2.1 kicks in to resynthesize the condition of the

figure bv

expression.

The repairs listed in evaluation are not only valid according to their specification, but were also manually validated by us to match the intended behavior. A failing proof thus does not indicate a wrong repair, but rather that our system was not able to automatically derive a proof of its correctness, often due to insufficient inductive invariants. We identify three scenarios under which repair itself may not succeed: if the assumptions mentioned in Sect. 2 are violated, when the necessary repair is either too big or outside of the scope of general synthesis, or if test collection does not yield sufficiently many interesting failing tests to locate the error.

6 Further Related Work

Much of the prior work focused on imperative programming, without native support for algebraic data types, making it typically infeasible to even automatically verify data structure properties of the kind that our benchmarks contain. Syntax-guided synthesis format [1, 2] does not support algebraic data types, or specific notion of repair (it could be used to specify some of the sub-problems that our system generates, such those of Sect. 3).

GenProg [7] and SemFix [17] accept as input a C program along with user-provided sets of passing and failing test cases, but no formal specification. Our technique for fault localization is not applicable to a sequential program with side-effects, and these tools employ statistical fault localization techniques, based on program executions. GenProg applies no code synthesis, but tries to repair the program by iteratively deleting, swapping, or duplicating program statements, according to a genetic algorithm. SemFix, on the other hand, uses synthesis, but does not take into account the faulty expression while synthesizing. AutoFix-E/E2 [18] operates on Eiffel programs equipped with formal contracts. Formal contracts are used to automatically generate a set of passing and failing test cases, but not to verify candidate solutions. AutoFix-E uses an elaborate mechanism for fault localization, which combines syntactic, control flow and statistical dynamic analysis. It follows a synthesis approach with repair schemas, which reuse the faulty statement (e.g. as a branch of a conditional). Samanta et al. [20] propose abstracting a C program with a boolean constraint, repairing this constraint so that all assertions in the program are satisfied by repeatedly applying to it update schemas according to a cost model, then concretize the boolean constraint back to a repaired C program. Their approach needs developer intervention to define the cost model for each program, as well as at the concretization step. Logozzo et al. [16] present a repair suggestion framework based on static analysis provided by the CodeContracts static checker [5]; the properties checked are typically simpler than those in our case. In [6], Gopinath et al. repair data structure operations by picking an input which exposes a suspicious statement, then using a SAT-solver to discover a corresponding concrete output that satisfies the specification. This concrete output is then abstracted to various possible expressions to yield candidate repairs, which are filtered with bounded verification. In their approach, Chandra et al. [3] consider an expression as a candidate for repair if substituting it with some concrete value fixes a failing test.

Repair has also been studied in the context of reactive and pushdown systems with otherwise finite control [8, 10, 11, 19, 20, 26]. In [26], the authors generate repairs that preserve explicitly subsets of traces of the original program, in a way strengthening the specification automatically. We deal with the case of functions from inputs to outputs equipped with contracts. In case of a weak contract we provide only heuristic guarantees that the existing behaviors are preserved, arising from the tendency of our algorithm to reuse existing parts of the program.

7 Conclusions

We have presented an approach to program repair of mutually recursive functional programs, building on top of a deductive synthesis framework. The starting point gives it the ability to verify functions, find counterexamples, and synthesize small fragments of code. When doing repair, it has proven fruitful to first localize the error and then perform synthesis on a small fragment. Tests proved very useful in performing such localization, as well as for generally speeding up synthesis and repair. In addition to deriving tests by enumeration and verification, we have introduced a specification construct that uses pattern matching to describe symbolic tests, from which we efficiently derive concrete tests without invoking full-fledged verification. In case of tests for recursive functions, we perform dependency analysis and introduce new ones to better localize the cause of the error. While localization of errors within conditional control flow can be done by analyzing test runs, the challenge remains to localize change inside large expressions with nested function calls. We have introduced the notion of guided synthesis that uses the previous version of the code as a guide when searching for a small change to an existing large expression. The use of a guide is very flexible, and also allows us to repair multiple errors in some cases.

Our experiments with benchmarks of thousands of syntax tree nodes in size, including tree transformations and data structure operations confirm that repair is more tractable than synthesis for functional programs. The existing (incorrect) expression provides a hint on useful code fragments from which to build a correct solution. Compared to unguided synthesis, the common case of repair remains more predictable and scalable. At the same time, the developer need not learn a notation for specifying holes or templates. We thus believe that repair is a practical way to deploy synthesis in software development.