1 Introduction

The discovery and repair of errors are major challenges in software engineering that take developers considerable effort. This is why software engineering research has intensively worked on improving techniques to assess program correctness and to repair program errors.

Formal methods have become a popular approach to finding errors in hardware and software systems. Formal methods offer great promise in that they can provide guarantees regarding the presence or absence of errors in the system. Model Checking (MC) [7], for instance, exhaustively explores the possible executions of the system (or a model of the system) in search of errors – typically expressed as violations of a given temporal property. Formal methods contrasts with other approaches more prominently used in software development, such as testing techniques whose offered guarantees are limited to the behavior that (manually or generated) test cases cover.

Formal methods were initially restricted to the verification of system models (e.g. transition systems) that would directly be provided by the user. More recent tools enable the verification of source code directly. These tools implement a variety of analysis approaches, including: heuristics [29]; model checking [7] and variations [17]; symbolic verification [10], and runtime analysis [12]. Unfortunately, all those approaches either suffer from the state-space explosion problem (i.e. the program state space is too large to be explored exhaustively) or from difficulties in handling the memory model of the system under verification. These two limitations hinders the practical applicability of formal methods to real-world programs.

Statistical Model Checking (SMC) [14, 20, 33] resolves these complexity limitations, albeit at the cost of some statistical uncertainty in the results. Instead of exhaustive exploration, SMC simulates multiple executions of a system and monitors these executions in regards to expected properties. Then, SMC uses statistical algorithms to extrapolate its conclusions to the system globally. While SMC can efficiently find errors and avoid the aforementioned complexity issues, SMC provides only a statistical likelihood of the absence of errors. So far, SMC has been mostly applied to verify safety and liveness properties on (stochastic) models of systems. This includes for example, properties of biological systems [9, 16], robotics [26], security [30], or of railway systems [3]. On the other hand, with the exception of [18, 24], SMC has not been used to verify properties on program code. Program code analysis applications are generally limited to specific reachability properties and restricted fragments of programming languages. In particular, the verification of security properties that depend on space and time memory failures (e.g. out of bounds, overflows, read-only memory access, etc.) has not yet been explored, whereas such failures are prominent in popular languages such as C. The verification of such properties is critical because their violation leads to serious errors that conventional testing techniques (that lean on a test suite) cannot unveil [23].

The main challenge in applying SMC and other formal methods to program code is to accurately and efficiently model the possible values within the program and its complex internal information. The true possible values of a program execution can be difficult to determine because of the large domains of values to consider (e.g. all 64-bit Integers) and the difficulty of tracking complex runtime relationships between successive values. Similarly, building an accurate but abstract model of memory organization is challenging, since the compiler and operating system have specific features that are opaque to the verification engine. All these challenges ultimately lead to inaccuracies in the verification results and, in turn, to an increasing risk of overlooking pernicious security vulnerabilities.

In [5], we have proposed the C Statistical Model Checker (C-SMC) approach and tool in order to address these challenges and enable the discovery a wide range of security errors. C-SMC combines the inherent strengths of SMC with a runtime engine to connect the program under verification with a real executions. This allows us to reason on contextual information such as memory, stack, registers, warning flags, etc. The runtime engine feeds the SMC engine with information over the program executions without having to derive a formal model of the entire hardware and operating system’s behavior. We, more specifically, use a debugging engine – viz. GDB [1] – to inform the analysis on the true instantiations of memory and internal system hardware to determine concrete values. As C-SMC builds on SMC, this approach can handle many independent executions of the program and effectively combine the analysis results of these executions in order to determine the likelihood to satisfy given properties.

Automated Program Repair (APR) [11] is a popular area of research that aims to automate code modifications in order to fix existing errors without programmer intervention. APR is typically used once the presence of an error has been detected in the program, e.g. via formal methods or a tool such as C-SMC.

The process of most APR approaches fall in the subarea of “generate and validate”. During the generation part the APR uses test case results in order to locate suspicious code statements possibly responsible for the error and, then, alter these statements in order to produce candidate patches. During the validation part the APR methods execute the program test suite on the patched code and discard all patches that make some tests fail.

Contribution. The present paper is a proof of concept of how to enrich C-SMC with a program repair module. The tool is used both to generate error trace as well as to evaluate potential patches. Concretely, C-SMC can discover errors and produce test cases that can then be the input to APR, and further C-SMC can validate the patches generated as part of the APR itself. We show how an APR approach can benefit from the statistical nature of SMC and the information that C-SMC produces regarding the errors discovered.

Conceptually, we propose to use the set of traces that C-SMC produce as inputs for APR. The traces are divided into positive traces (that raise no error) and negative traces (that exhibit the error). Hence, we lean on the capability of SMC to provide multiple traces that reveal the error – and to do so without guidance from developer-written or automatically generated tests. Based on the traces, we determine the set of suspicious statements through reverse data-flow analysis from the statement where the error was revealed.

We, then, propose a neural agent-based method that, given a set of suspicious statements, looks for the sequence of actions (statement alterations) that maximizes the likelihood to repair the error. Candidate sequences of actions produce in turn potential patches whose correctness is validated via C-SMC to speed up the process.

We evaluate our repair approach on the benchmark from [5]. Our repair approach fixes 8 out of the 9 errors and – compared to a random baseline – does so while producing 23% to 91% fewer candidates patches in all cases but one. Put together, SMC and APR effectively automate the detection and repair of memory usage-related errors to an extent that previous approaches could not achieve. Our approach complements the state of the art on software testing and repair, which has mostly focused on test-based approaches. Here we find a productive union of formal methods – viz. SMC – and APR that yield a complete package for error detection and repair.

To summarize, our main contributions are as follows. (1) Propose the first approach for automated error detection and program repair that builds on statistical model checking. Our approach can detect and repair pernicious security errors that conventional test cases are not likely to trigger. (2) Propose a novel genetic algorithmic searching approach to find an effective program repair efficiently. (3) Demonstrate an implementation of our approach on top of established tools and technologies. (4) Evaluate our approach using a benchmark of eight memory usage-related errors and one arithmetic error. Our evaluation results suggest that (a) our detection method could find all of the nine errors, and (b) our repair method could fix eight of these errors and do so with fewer generated patches for seven of these eight errors.

Structure. The structure of the paper is as follows. Section 2 provides background and related work that are necessary to understand the paper. Section 3 presents an illustrative example used in this work. Section 4 describes the novel genetic algorithmic repair agent. Section 5 presents our implementation and experimental results. Section 6 concludes.

Note on Content. As said above, this paper is a proof of concept paper that is written to generate discussions with the community. Consequently, many formal details are left out of this presentation.

2 Background and Related Work

We consider systems/programs where executions are represented by traces, i.e., sequences of states. Each state represents the status of the system under verification at a given moment of time. States and traces can be abstracted as mathematical objects such as (sequences of) Boolean variables [7]. When considering languages such as C, a state represents the current location in the program (line of code), its memory, and the set of instructions/statements that are available from this state. A state can also contain information about the execution environment. A trace moves from on state to another state by following a transition. When considering languages such as C, such transition represents the execution of an instruction.

2.1 Essentials of Statistical Model Checking

Statistical Model Checking (SMC) [14, 20, 33] is a variety of probabilistic model checking (see e.g. [2]) that exploits execution traces to avoid an explicit representation of the state space. The evaluation of properties is instead calculated from an empirical distribution of the executions of the system. Given a number of statistically independent traces of a stochastic model, and the capabilities to decide whether a trace satisfies a property, it is possible to estimate the probability that the model will satisfy the property.

Note that unlike model checking [7], SMC provides results that are not complete and may only provide an approximation of the true properties of the system. Thus, an estimate of the behavior may be obtained with a specified confidence provided by, e.g. the Okamoto bound [14, 25]. Further, it is possible to efficiently evaluate the truth of a hypothesis without needing to calculate the actual probability using, e.g. the sequential probability ratio test [31, 33]. SMC can also be used to compute the probability of rare errors [21].

To apply SMC requires building a model of the system and expressing the properties to be checked. Typically the model can be defined as a labeled transition system (LTS) and the properties expressed in a logic such as bounded linear temporal logic (BLTL) that can express complex behavioral properties with nested temporal causality.

2.2 Trace Execution Properties

Expressing properties formally is key in software verification. Based on definitions above, properties about a program can be defined in terms of properties over states, over individual traces, and over sets of traces. We first focus on the first two types of properties. The last one, which corresponds to the verification of the whole program, will be handled in the next (sub-)section.

A propositional logic can define properties about each state individually. To be able to consider all the states of an execution trace, this propositional logic needs to be extended. One popular extension is Linear Temporal Logic LTL [27]. LTL allows us to make hypotheses of unbounded traces via temporal operators. As SMC restricts to finite trace executions, we consider a bounded LTL, where each temporal operator is bounded for the number of states to which it applies.

Bounded Linear Temporal Logic (BLTL) is an enhancement of LTL that adds bounds expressed in step or time units. The syntax of BLTL is as follows:

$$\begin{aligned} \phi ,\psi \quad {:}{:}{=}\quad p \mid \phi \vee \psi \mid \lnot \phi \mid \phi \mathop {U_{\le t}} \psi \mid \mathop {X_{\le t}} \phi \; . \end{aligned}$$
(1)

The p is propositional variables, disjunction \(\phi \vee \psi \) and negation \(\lnot \phi \) are all as in LTL. The formula \(X\phi \) is true if \(\phi \) is true in the next state from the current state. The formula \(\phi \mathop {U_{\le t}} \psi \) is true if both: \(\psi \) becomes true before t in the sequence from the current state; and \(\phi \) remains true in every state before the state where \(\psi \) becomes true. For a formal definition of BLTL semantics, see [35]. A BLTL formula is expressed with respect to a trace. It is also helpful to have conjunction (\(\phi \wedge \psi \)) and implication (\(\phi \Rightarrow \psi \)) that are defined in the usual manner. Similarly the always (G) and eventually (F) operators can be defined using the BLTL syntax above as follows. Eventually is defined as \(F_{\le t} \phi = \texttt{true} \mathop {U_{\le t}} \phi \) and means that the formula \(\phi \) should become true before t. Always is defined as \(G_{\le t} \phi = \lnot F_{\le t } \lnot \phi \) and means that \(\phi \) must always hold for the next t.

Let \(w = s_0, s_1, ..., s_L,...\) be an execution trace, and denote by \(w^j = s_j, ..., s_L, ...\) the portion of the trace starting from j (included). The truthfulness of the formulas can be decided using the rules described in Table 1.

Table 1. BLTL rules.

BLTL allows us to express reachability properties such as “the software should eventually reach a state where variable x is equal to 1”. The logic can also be used to express more elaborated causalities such as “always, if the software reaches a state where x is equal to 1, it will eventually reach a state where y is equal to 1”. This expressive power allows us to express a wide range of safety and security properties. BLTL properties can be verified with monitoring procedures [13]. Such procedures, inspect successive states of an execution trace until it can decide whether the property is satisfied. BLTL properties, which are a fragment of safety properties, can be monitored over finite traces using well-established techniques such as those presented in [13].

2.3 Probabilistic Verification

We now turn to properties defined over sets of trace executions, that is properties defined on the whole system. We are interested in solving the probabilistic BLTL problem, that is to compute the probability for the system to satisfy a BLTL property \(\phi \). Such probability being defined as the probability that a random trace of the system satisfies \(\phi \).

Statistical model checking [19] has been proposed as an efficient approach to solve such problem. SMC statistically measures the truthfulness of properties over a smaller number of traces. This is done by performing a fixed number of simulations of the system and using an algorithm to estimate the probability that the system satisfies the property. As the number of observed executions is finite, this answer comes together with a confidence interval [19].

There is a wide range of statistics algorithms that can be used to estimate the probability to satisfy a given property. This includes, e.g., importance splitting and sampling [19] that can also be used to efficiently exhibit traces that do not satisfy the property. As the study of those algorithms is not the topic of this paper, we propose to work with the most simple one that is based on the Monte Carlo estimator. The estimator relies on the following proportion:

$$\begin{aligned} \bar{\gamma } = \frac{\sum _{i=1}^{N} {\textbf {1}}(w_i \models \phi ) }{N} \text { where } {\textbf {1}}(x) = {\left\{ \begin{array}{ll} 1 &{} \text {if } x \text { is true}\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

where N is the number of simulation being performed. Let \(\gamma \) be the true probability to satisfy \(\phi \) and P be a probability evaluation. Let \(\epsilon \) (precision) and \(\delta \) (confidence )be small values. The Chernoff-Hoeffding bound [15] guarantees that \(P(|\bar{\gamma }-\gamma | \ge \epsilon ) \le \delta \) is given by \(N = \lceil \frac{ln 2 - ln \delta }{2\epsilon ^2}\rceil \). Thus, by controlling the number of trace executions verified, the user entirely controls the preciseness of the \(\bar{\gamma }\) estimator.

2.4 SMC Tools

As seen in the previous section, SMC mainly depends on sub-parts that include the type of execution trace property that has to be monitored and the statistical algorithm used to compute the stochastic guarantee. Implicitly, SMC also depends on the type of system under verification and on the capacity to generate an arbitrary number of execution traces from this system.

In [34], the authors proposed YMER, a tool that can be used to verify BLTL properties of Markov Chains with an hypothesis testing procedure. That is, the tool decides between two hypothesis rather than computing an exact probability. In [8] Monte Carlo is applied to verify properties of Metric Temporal Logic over stochastic timed automata. Those tools have been shown to be very efficient on various problems. However, they are rather static in the sense that they do not exploit the intrinsic modularity of SMC. As an example, YMER does not permit replacing the hypothesis testing algorithm by a Monte Carlo one. None of those tools allows replacing classical BLTL with an algorithm that would consider debugger expression. On the top of this, none of those tools consider C code.

In [4, 22], Boyer et al. introduced Plasma a Statistical Model Checking (SMC) [19] tool that can provide the ability to create custom statistical model checkers. Especially, Plasma works with open source plugins that can (1) specify the property under verification (this includes BLTL as well as many other logics such as those introduced in [6]), (2) monitor traces, (3) generate traces in an efficient manner, i.e., to minimize the number of traces needed to compute the probability to satisfy a given property. Those plugins can be customized to adapt to a wide range of properties and systems.

In a recent work [5], we have introduced C-SMC, a new open source plugin that allows us to monitor security properties of C program. C-SMC monitors BLTL properties via and extension that exploits states of the GDB debugger as input instead of Boolean abstraction. Using this, BLTL properties can refer to variables, registers, memory, locations and next instructions of a given state of the C program. In addition properties can also refer to special flags of the GDB debugger. C-SMC has been able to detect security and safety flaws such as buffer and integer overflows. It can also be used to detect overflow that depends on contextual information coming e.g. from global variables. The later one shall be showed in our motivating example.

3 Illustrative Example

Consider the illustrative binary search example given in Listing 1.1. Variable r should be initialized to size-1 but is initialized to size. This error will lead the search to occasionally go out of the bounds of the array being inspected. Consider an array of size 10 (indexed from 0 to 9) containing numbers 1 to 10. If the search function is called with a value that is greater than the greatest value of the array (10 in this example) then it will go out of bounds looking for the value located at arr[size] (10 in this case).

figure a

This behavior will often lead to no consequences (because even if the access is out of bounds it is still checking a valid address on the stack) and thus be undetected. However, when looking for a value not contained in the array, 11 for instance, it is possible that the memory address located just after the end of the array contains the searched for value. In this case, the output of the search function will be invalid. This error is very hard to reproduce and to express. Indeed, the search function itself does not contain the information about the input values (i.e. arr and size and whether or not size is correct) and so most analysis engines can only warn about a potential out of bounds access for all instances of arr[m] or report no errors.

Pernicious security errors like the above are difficult to reveal through conventional testing. This is due to the fact that testing focuses on comparing outputs for a given set of inputs. As an example, we show in Listing 1.2 a set of 14 test cases that thoroughly check whether the search function works correctly when invoked on (1) all elements that the array contains and (2) close elements that the array does not contain. For conciseness, we have concatenated all these cases into a single piece of code. In this code, the \(\texttt {report\_fail}\) function is used to report when some test yields an incorrect result.

figure b

Observe that all these test cases will usually pass. Thus, although developers can strenuously test the search function and find other potential errors, it is likely that the out of bounds access error will never occur during the test execution. This error would actually occur in the unlikely event where the searched value is not in the array but resides in memory exactly at the memory address located right after the end of the array. In such a case, the search function would return an index value equal to the size of the array, while the special value -1 is expected. This example already demonstrates that test cases are inappropriate to detect errors related to memory usage because the error is independent of the particular inputs on which the function is tested. Even if every possible int value was covered in the test suite, the error would likely remain undetected. To detect this error with C-SMC instead of checking the value of the output for a given input, we monitor the value of the variable m using the formula \( m = 0 | (0<= m \& m < size\). With this formula we can check that at any moment, m does not go out of bounds. This allows us to detect an error that is very hard to reproduce or detect with unit tests or analysis engines.

3.1 SMC-Based Validation

Consider again the binary search example shown in Listing 1.1, where the upper bound of the binary search r is initialized beyond the end of the array arr. A natural check for array data structure is to ensure there is no access (read or write) beyond the limit of the array. This can be formulated as a BLTL property:

$$\begin{aligned} \lnot (\texttt {arr[x]}\wedge \texttt {x}\ge |\texttt {arr}|), \end{aligned}$$
(3)

where arr[x] indicates an access to the array arr at offset x and \(\texttt {x}\ge |\texttt {arr}|\) indicates that x is greater than or equal to the size of arr.

In the example code, there are two accesses to an array, one at lines 7 and one at line 9. However, the value m that records the accessed index is modified at each loop iteration, based on the values of size, elem, and the values contained in arr. Hence, the above property can be reformulated as:

$$\begin{aligned} \text {G}_{\le 1000} (line =7 \vee line = 9) => m \ge 0 \wedge m < size, \end{aligned}$$
(4)

where line indicates the line number in C code, m and size are variables m and size in C code of Listing 1.1.

Assume that C-SMC tested with the following input values for search to generate tests (and in particular this limits the scope of our search).

figure c

After running 4 tests with C-SMC we may discover that the program works properly for v \(\in {\lbrace 6, 2, 3 \rbrace }\) while 9 causes a problem. Observe that applying SMC to search could yield an infinite number of traces that exhibit both good and bad behavior depending on the input arguments given to the search function. As an example, consider the inputs given in Listing 1.2. They will generate traces that satisfy Property 4 when v is assigned to a value located between index 0 and 12 of the input array (input[0:12]) given in line 2 of Listing 1.2. They will generate traces that do not satisfy Property 4 when v is assigned to input[13]. In fact, all searches for a value \(v \le 8\) will provide a GOOD trace, and all searches for values \(v>8\) will provide a BAD trace. Note that the capabilities to identify GOOD and BAD traces depends on the randomized algorithm used to generate inputs. There exists various strategies to reach this objective. Including, e.g., importance splitting and sampling.

BLTL can also be used to describe other classes of properties that are beyond the scope of classical testing. As an example, it is worth checking that the program terminates after a bounded number of steps. The latter can be expressed by:

$$\begin{aligned} \text {F}_{\le 1000} (line= 15 \vee line = 16) \end{aligned}$$
(5)

In addition, we can also use BLTL to specify expected (input, output) pairs just like in typical test cases. Such BLTL formulae would take the form:

$$\begin{aligned} \text {G}_{\le 1000} (line= 16 => eax = gt) \end{aligned}$$
(6)

where eax is a reserved symbol to indicate the return value of the function and gt is a specific, user-defined target output.

3.2 Using Traces for Automated Program Repair

We now turn to illustrate how to repair the program. Observe that the trace monitoring process of SMC has the potential to locate error statements/lines. Those can be exploited to produce a set of possible changes, i.e., actions, for fixing the program according to the execution traces. The key principle is to start from the statement where SMC detected the violation of a property, and apply a reverse data-flow analysis to determine all previously executed statements that share a (direct or indirect) data dependency with the statement where the violation was located. It then examines the abstract syntax tree of the program in order to determine the constituents of these statements and, in turn, determines the set of actions (statement alterations) that the repair agent will consider. For example, a BAD trace given by C-SMC contains lines 2, 3, 5, 6, 7, 9, 10 and 12, and this trace is ended at line 7. Applying the reverse data-flow analysis from line 7, we locates error statements at lines 6, 10, 12, 3 and 2. We can then generate possible changes to each of these lines. This approach is similar to fault-spectrum localisation as presented in [32].

We want to use a repair agent that takes as input the set of possible actions that the trace analysis component identified based on the error trace. It then produces different sequences of actions (based on the taken set) and applies these sequences to generate a set of independent patches that are themselves verified with C-SMC. This process continues until the faulty program is fixed or until the computational budget (expressed in number of produced patches) is reached.

For instance, the agent has fixed our binary search error example by modifying the line where variable r is initialized, as shown in Listing 1.4.

In the rest of this paper, we describe a trace analysis framework that can be used in combination with a genetic algorithm based repair agent. We then report on case studies that implement this proof of concept.

4 Designing a Repair Agent

figure d

Our objective is to develop a repair agent that exploits the information contained in execution traces generated by the SMC engine. We first show how to reason on a trace. Especially, we give intuition on how a trace can be corrected. Recall that each trace contains lines (i.e. location in program’s code) which are reached by the execution and a property field which specifies the property violated by the program. Intuitively, the error statements/lines appear to BAD traces. The last line in a BAD trace indicates where the property has been violated and the error occurs. Let \(\mathcal {L}\) be a set of lines which may cause the error. Let \(\mathcal {T}\) be a set of BAD traces from C-SMC. \(\mathcal {L}\) is computed as follows. Initially, \(\mathcal {L}=\{\ell _i | \ell _i \text { is the last line in the trace} t_i \in \mathcal {T}\}\). Then, \(\ell \in t\) is iteratively added to \(\mathcal {L}\) if there exist a \(\ell ' \in \mathcal {L}\), and \((\ell ',\ell )\) are data dependent (this can be done via classical taint analysis techniques).

Table 2 shows lines appearing in execution traces of the illustrative example (Listing 1.1) computed by C-SMC on the three properties given in Eqs. 45 and 6. The error lines, which are marked with a X, are given by \(\mathcal {L} = \{2,3,6,7,10,12\}\). In order to fix the faulty program, we propose to apply an action to the error lines. An action is implemented by either replacing or inserting a new statement at a given error line. An action \(a = (\ell ,c)\) means that a change c is implemented at line \(\ell \). The change c is generated by three ways as follows. First, one can change operators of an expression. For example, if \(\texttt {<name> <operator> <name>}\) is an expression, then \(\texttt {<operator>}\) can be changed by taking an operator in either the arithmetic operators or the comparison operators. Second, one can replace the current patterns of an expression by existing patterns in the program. The replacement is applied to an expression which matches to an expression pattern in the library. The existing expression matches to a pattern if its abstract syntax tree can be transformed to the pattern’s abstract syntax tree, or if they contain the same numbers of \(\texttt {<name}\) and \(\texttt {<operator>}\). For example, if \((\texttt {<name> <operator> <name>})\texttt {<operator> <name>}\) is an expression and it matches to the pattern \(\texttt {<name> <operator>} (\texttt {<name> <operator> <name>})\), a change is made by the \(\texttt {<name>}\) in the pattern by the \(\texttt {<name>}\) in the expression. Third, a change can be created by inserting a new statement. Consider, e.g., the insertion of the control statements if else, or the addition of special statements such as break, continue, exit, return.

Table 2. Error lines for the example in Listing 1.1.

4.1 A SMC-Based Program Repair Algorithm

Our objective is to fix the faulty program by finding a subset of actions (called patch) from actions given by the trace analysis. By fixing the program, we mean that all properties under verification are satisfied with probability one. We here propose the use of a genetic algorithm and heuristics to generate the patch in an efficient manner.

Let \(\overline{gp}\) denote a faulty program. Let \(\mathcal {A}_{\overline{gp}}\) denotes a finite set of actions given by the trace analysis part to fix the faulty program \(\overline{gp}\). Let \(\varPi _{\overline{gp}}\) be a power set of \(\mathcal {A}_{\overline{gp}}\) (the set of patches). A patch \(\rho = \{a \in \mathcal {A}_{\overline{gp}}\} \in \varPi _{\overline{gp}}\) is applied to \(\overline{gp}\) to transform \(\overline{gp}\) to gp, i.e., \(\overline{gp} \xrightarrow {\rho } gp\). Let \(\mathcal {G}\) be a finite set of programs, i.e., \(\mathcal {G} = \{gp | \exists \rho \in \varPi _{\overline{gp}}: \overline{gp}\xrightarrow {\rho } gp\}\). Our task is to find a patch \(\rho * \in \varPi _{\overline{gp}}\) such that \(\overline{gp}\xrightarrow {\rho *} gp*\) and \(gp*\) is a fixed program. Note that the faulty program \(\overline{gp}\) is fixed by a patch \(\rho *\), i.e., \(\overline{gp}\xrightarrow {\rho *} gp*\), if all the properties given in the environment are satisfied by \(gp*\).

Let f be an objective function \(f:\mathcal {G} \rightarrow [0.0,1.0]^m\) where \([0,1]^m\) denotes the output of m properties of a program \(gp \in \mathcal {G}\), e.g., \(f(gp) = [0.0, 1.0, 0.2]\) indicates that the program gp has \(0\%\) chance for holding the property 1, \(100\%\) chance for holding the property 2 and \(20\%\) chance for holding the property 3. The program \(gp*\), i.e., \(\overline{gp}\xrightarrow {\rho *} gp*\), satisfies all properties such as \(f(gp*) = [1.0]^m\). Let \(\textbf{s} = f(gp)\) be a validation-state of the program gp, i.e., a vector whose components represent the probability to satisfy the properties under verification. Consider \(\textbf{s}_0=f(\overline{gp})\) to be the initial validation-state, i.e. the initial probability value for each property that the program must eventually satisfy.

Our genetic algorithm works by evaluating a population of patches. In case, those patches do not lead to \(s_*\), mutation and recombination operators shall be called. Given a validation-state, our objective is to derive actions that will lead to better patches and eventually to a better validation-state. For doing so, we propose to use a Q-function [28] to specify the relation of states and actions, as follows. For every observed patch \(\rho \in \varPi _{\overline{gp}}\):

$$\begin{aligned} Q(\textbf{s},a) = r + \gamma \max _{a'}Q(\textbf{s}', a'), \forall a \in \rho . , \end{aligned}$$
(7)

where \(\gamma \) is a coefficient in \([0,1 ]\), r and \(\textbf{s}'\) are reward and validation-state respectively. We assume the reward to be 0 when the system is not repaired and 1 if it is repaired. In practice, we will stick to those two values. In theory, a reward of 0, 5 would mean that the program is half fixed. Initially, we have \(Q(\textbf{s}, a) = 0\) for every state \(\textbf{s}\) and every action \(a\in \mathcal {A}_{\overline{gp}}\). The reward r gets 1.0 if the faulty program is fixed. Otherwise, it gets \(-1.0\).

Intuitively, \(Q(\textbf{s}, a)\) indicates the probability to choose action a at validation-state \(\textbf{s}\). The Q-function is updated for every patch and its values later are used to generate a new patch. For instance, if there is a patch \(\rho =\{a_1, a_2\}\) and the state transitions \(\textbf{s}_0 \xrightarrow {a_1} (\textbf{s}_1,r_1) \xrightarrow {a_2} (\textbf{s}_2,r_2)\), then the Q-function is updated as follows: \(Q(\textbf{s}_0, a_1) = r_1 + max_{a'}Q(\mathbf {s_1},a')\) and \(Q(\textbf{s}_1, a_2) = r_2 + max_{a'}Q(\mathbf {s_2},a')\). In this work, we implement a \(\epsilon \) greedy selection algorithm to choose an action for the patch generation, such that, at a state \(\textbf{s}\) an action is either chosen by \(\text {arg}\max _{a\in \mathcal {A}_{\overline{gp}}}(Q(\textbf{s},a))\) with a given probability \(\epsilon \) or chosen any action \(a\in \mathcal {A}_{\overline{gp}}\) with probability \(1-\epsilon \).

figure e

To estimate the values of Q-function, we implement an estimator neural network \(\text {NN}(\textbf{s},\textbf{s}')\). The network takes the source validation-state \(\textbf{s}\) and the destination validation-state \(\textbf{s}'\) as inputs. It outputs the value of the Q-function at validation-state \(\textbf{s}\). Output values are used to compute a patch for the transition \(\textbf{s} \rightarrow \textbf{s}'\). Hence, if there is a transition \(\textbf{s} \rightarrow \textbf{s}'\), the values of Q-function at the state \(\textbf{s}\), i.e., \(Q'(\textbf{s}) \approx Q(\textbf{s},a)_{a\in \mathcal {A}_{\overline{gp}} }\), which are used to compute a patch \(\rho \): \(\textbf{s} \xrightarrow {\rho } \textbf{s}'\), are computed as follows.

$$\begin{aligned} Q'(\textbf{s}) = \text {NN}(\textbf{s},\textbf{s}') \end{aligned}$$
(8)

For every patch, the neural network is updated by the actual values of actions in this patch with the following Mean Squared Error loss function.

$$\begin{aligned} \text {MSE}= \frac{1}{n}\sum ^n_{i=1}(Y_i - \bar{Y}_i)^2, \end{aligned}$$
(9)

where \(Y_i = Q'(\textbf{s},a_i)\) and \(\bar{Y}_i = Q(\textbf{s},a_i)\) are predicted/actual values of the action \(a_i\) at the state \(\textbf{s}\), respectively. n is the number of actions, i.e., \(n= |\rho |\). Note that if \(\textbf{s} \xrightarrow {\rho } \textbf{s}'\), \(Q(\textbf{s},a_i)_{a_i\in \rho }\) is computed by Eq. 7 and \(Q'(\textbf{s},a_i)_{a_i\in \rho }\) is a value of \(Q'(\textbf{s})\), which is predicted by NN, i.e., \(Q'(\textbf{s}) =\text {NN}(\textbf{s}, \textbf{s}')\).

figure f

We are now ready to present the genetic algorithm for patch generation as shown in Algorithm 2. Intuitively, patches are created from one action in the set of actions given by the trace analysis part (line 1). Then, each patch is evaluated by SMC (line 8). The validation-states obtained from this evaluation are used to update the neural network NN (line 9). If there exists a patch which can fix the faulty program, i.e., the state of the current patch is the final state \(\textbf{s}*\), the algorithm terminates (line 11–14). Otherwise, it creates a parent set from the current population, i.e., P, and generates a new population from this set (lines 17–18). If the optimal patch is not found after exceeding the limit number, i.e., MaxPopulation, the threshold \(\tau \) is used to choose actions for creating an “optimal” patch (line 21–22). The threshold \(\tau \) is used to select actions in a patch. We use \(\tau \) to vary the number of actions in a patch. Although the patch chosen by \(\tau \) may not be a good one for fixing the faulty program, it can be a recommendation for the developers to fix their code. Algorithm 3 presents the parent selection algorithm. This algorithm conserves n selected parents by exploiting an Euclidean distance between the validation state obtained for each patch and the expected final validation-state. Algorithm 4 presents the recombination and mutation operations. Initially, the new population \(P'\) is created from the parent population starting with a patch of one action that is selected by Function GreedySelection. Then, \(P'\) is grown up by two operators (lines 1–3). The first operator is the mutation operator. It implements two operations: (1) to add an action with the highest Q-value into \(\rho \), (2) to remove an action \(a \in \rho \) with the lowest Q-value (lines 4–11). The second operator called recombination replace a pair \((\rho _1, \rho _2)\) by a pair \((\rho '_1, \rho '_2)\) that is created by exchanging actions which are associated with the highest Q-values in \(\rho _1\) and \(\rho _2\) (lines 12–21).

figure g
figure h

5 Implementation and Experimental Results

We implemented the APR process described in Sect. 4 with three modules: the environment module, trace analysis module, and agent module.

The environment module has two roles: C program evaluation and program repair evaluation. The C program evaluation is done by interfacing with C-SMC and inputting the C program and BLTL properties. If any property of the C program is not satisfied then repair must be triggered (otherwise the program is error free and no further action taken). The traces from C-SMC are gathered along with their result (GOOD or BAD) and these are sent to the trace analysis module. The program repair role takes a patch from the agent module and produces the state of the C program after applying the patch.

The trace analysis module takes a set of traces from the environment module and the C program as inputs, and produces a set of actions to fix the C program. For the motivating example, it analyses the BAD traces of Eq. 4 and generates actions to fix the C program at lines 2, 3, 5, 6, 7, 8, 9, 10, 12. Then, these actions are sent to the agent module for patch generation.

The agent module takes the lines from the trace analysis module and implements the algorithms detailed in Sect. 4 to generate a patch to apply to the C program. The agent module then sends this to the environment module to evaluate the efficacy of the patch and provide this as data for further patch generation using the genetic techniques described in Sect. 4.

5.1 Experimental Results

In this section, we evaluate the performance of the C-SMC based APR tool with Q-function on the motivating example in Listing 1.1 and in three variants. Our objective is to evaluate the performance of a Q-function with respect to a random strategy to select actions to mutate and recombine in the genetic algorithm.

In this experiment, we choose the maximum population to be 3000 patches, i.e., \(MaxPopulation = 3000\) and we applied the randomized strategy for action selection. The results are reported in Table 3. The motivating example in Listing 1.1 starts in validation-state \(s_0=[0.9, 1.0, 1.0]\). It is fixed by the APR tool after 50 patches out of a population of 112 patches by the random strategy. Variant 1 is obtained by replacing \(l=m+1\) with \(l=m-1\) at line 12. The initial validation-state is given by \(s_0=[1.0, 1.0, 0.6]\). It is fixed by our tool after 72 patches while it takes a population of 210 patches to fix it with a random strategy. Variant 2 is obtained by replacing \(r=size\) with \(r=size-1\) at line 3, \(m=(l+r)/2\) with \(m=(r-l)/2\) at line 6 and \(l=m+1\) with \(l=m-1\) at line 12. Its initial validation state is given by \(s_0=[1.0, 1.0, 0.4]\). It is fixed after 290 patches by our tool while the random strategy takes 2259 patches. Variant 3 is obtained by replacing \(r=m-1\) with \(r=m+1\) at line 10 and \(l=m+1\) with \(l=m-1\) at line 12. Its initial validation-state is given by \(s_0=[1.0, 1.0, 0.6]\). It is fixed after 290 patches by our tool. A random strategy could not fix it after a population of 3000 patches. The results show that the use of an optimized Q-function drastically improve the selection of patches.

Table 3. Comparison of the APR tool with Q-function and the random strategy.
Table 4. Evaluation of the APR tool on other C code.

The C-SMC based APR tool with Q-function was also applied to the nine benchmark error programs from [5] that are available on GitHubFootnote 1. The results are shown in Table 4. Observe that in all cases (except for “read-only memory”) the combination of C-SMC with APR was able to not only find the error, but also successfully repair the error. The failure to repair the “read-only memory” example is due to there being no reasonable way for the patch generation algorithm to create a suitable repair. While this example was relevant to demonstrate the capabilities to detect such errors, it would be difficult to fix given the patch algorithms possible changes, and also to preserve the program’s behavior. Further, in all cases (except for “divide by zero”) the APR tool with Q-function proved to be more effective than the random strategy.

Due to space limitation, a detailed analysis of the benchmarks is given in the aforementioned GitHub. These results demonstrate that the approach is effective both in concept and in practice, and further that the APR Q-function strategy is an improvement over the default (random) genetic algorithm.

6 Conclusions and Future Work

The challenge of finding and repairing errors in software development is complex and ongoing. Formal methods such as SMC provide a strong basis for finding and proving the existence of errors, including a trace to demonstrate the error. The statistical nature of SMC allows the generation of multiple correct and erroneous traces of the program. These traces form a natural set of test cases that can be the input for APR approaches. Combining SMC with APR is an effective solution to find errors and provide the necessary information to repair them.

This paper presents a proof of concept showing that this approach is effective in discovering, learning, repairing, and validating the repair in a single tool. Further, the proposed Q-function for improving the genetic algorithmic search for a patch improves the efficacy of the patch generation. This is demonstrated by the implementation here exploiting C-SMC and demonstrating effective discovery and repair of errors.

Future Work. We would like to extend our approach to guided simulations approaches such as importance sampling and splitting. This shall help us to find error traces in a more efficient manner and hence improve the efficiency of the repair agent. We also would like to apply the work to a wider class of systems. Especially, we would be interested in applying the approach to real-time version of C. This could be done by exploiting recent work on stochastic timed automata implemented in UPPAAL-SMC [18]. The limitations of the “read-only memory” benchmark also motive the use of more complex repair tools in the agent and other methods of patch generation. Finally, observe that this paper is a proof of concept. Comparison with other tools shall be investiguated in near future.