1 Introduction

A far-fetched goal of artificial intelligence research is to build a system that writes computer programs for humans. To achieve this goal, researchers take two distinct approaches: deductive program synthesis and inductive program synthesis. Both approaches attempt to produce programs requested by human users. The difference lies how they produce programs: deductive synthesis tries to deduce programs that satisfy specifications, while inductive program synthesis tries to induce programs from examples.

While such inductive synthesis alleviates the burden of implementation by guessing programs from given input-output examples, in inductive synthesis resulting programs are not trustworthy. Deductive synthesis overcomes this limitation with formal specifications: it allows users to formalise what they want as specifications, whereas inductive synthesis tools guess how programs should behave from examples provided by users. Thus, in deductive synthesis providing formal specifications remains as users’ responsibility. The upside of deductive synthesis is, however, users can obtain correct programs upon success.

SuSLik [19], for example, is one of such deductive synthesis tools. It takes a specification provided by humans and attempts to produce heap-manipulating programs satisfying the specification in a language that resembles the C language. Internally, this derivation process is formulated as proof search: SuSLik composes a heap-manipulating program by conducting a best-first search for a proof goal presented as specification. The drawback is that the search algorithm often fails to find a proof within a realistic timeout. That is, even we pass a specification to SuSLik, SuSLik may not produce a program satisfying the specification. According to Itzhaky et al. [5], different synthesis tasks benefit from different search parameters, and that we might need a mechanism to tune SuSLik ’s search strategy for a given synthesis task.

2 SuSLik’s Search Strategy

SuSLik synthesises a program by searching for a corresponding proof. We can see SuSLik’s proof search as an exploration of an OR-tree, nodes of the tree represent (intermediate) synthesis goals, while edges of the tree represent rule applications. The shape of such search tree is not known in advance, and the task of SuSLik is to identify a solved node, in which a proof is complete.

Fig. 1.
figure 1

Mutation and Elitist Selection

Since such OR-trees can be too large to find proofs within a realistic timeout, SuSLik narrows the search space using a proof strategy. Essentially, proof strategy in SuSLik is a function that takes a synthesis goal and returns an ordered list of rules to apply next. Itzhaky et al. developed the default strategy by manually encoding human expertise. For example, the default strategy precludes the application of a rule called CALL when another rule CLOSE has been applied before reaching the current node. This way, the SuSLik rules are grouped into 10 ordered lists, and the order of these rules define how SuSLik explores the corresponding OR-tree.

Another decision SuSLik has to make for an effective search is to select the next node to expand. The current version of SuSLik make this decision using a cost function, manually developed and tuned by Itzhaky et al. [5].

Both the weights of the cost function and orders of derivation rules are manually tuned for the benchmark used in their evaluation [5]; however, as we show in Sect. 4, our evolutionary framework finds better strategies through evolution.

3 Evolutionary Computation for SuSLik

The aim of our evolutionary computation is to optimise the order of each group of derivation rules and the weights of the cost function, which is used to implement best-first search.

Algorithm 1 summarises the genetic algorithm we used in our framework to improve the search strategy of SuSLik. Firstly, the algorithm takes a set of training problems an inputs, using which we evolve SuSLik instances over 40 generations. Line 1 defines the initial population. Each individual in a population is evaluated according to the fitness function described in Sect. 3.

For each generation, we copy individuals from the previous iteration (Line 6), mutate them (Line 7), evaluate individuals (Line 8). Then, we sort all individuals in the current generation based on their performance (Line 9–10). And we continue to the next generation using the best 20 individuals from the current pool. In the following, we explain the mutation algorithm, the fitness function, and our selection algorithm.

Mutation. As we explained in Sect. 2, by default a search strategy of SuSLik is defined by two factors: the order of rule application and weights of each node in the search tree. To determine an effective way to apply genetic algorithms to program synthesis in SuSLik, we implemented the following three different mutation algorithms:

  • Order-only mutation changes only the order of rule application for each node.

  • General rule-weight mutation changes the weights of each node based on what rules have been applied to reach that node.

  • Goal-specific rule-weight mutation allows SuSLik to choose a weight for each rule based on properties of a node during a search.

figure a

Fitness. The fitness function measures the performance of SuSLik instances. More specifically, it measures how many derivation problems each SuSLik instance solves within the timeout of 2.5 s for each problem. When multiple SuSLik instances solve the same number of derivation problems, the fitness function uses the numbers of rules fired by the instances as a tie-breaker: it considers that the instance that solves a certain number of problems with a smaller number of rule applications is better than another instance that solves the same number of problems with a larger number of rule applications.

Selection. We adopt a version of elitist selection as our selection method: we pass individuals from the current generation to the next generation. By copying them and mutating them if they show better performance in the current generation. Figure 1 provides the schematic view of our elitist selection. Unlike the standard elitist selection algorithm, ours prioritizes the best individual in each generation to speed up the evolution: the best individual in each generation, called champion, is entitled with three children, one original copy without mutation and two mutated children, whereas each of other 19 winners has one original copy and only one mutated child in the next generation.

Note that each individual has two kinds of properties to mutate: the order of derivation rules, and weights used in the cost function. While we represent the weights as floating point numbers, we adopt permutation encoding for the orders of derivation rules.

For each permutation encoding, each individual has the probability of 0.1 to be moved, while we change weights by multiplying a random number between 0.8 and 1.2. In our framework, we do not apply crossover to permutation encoding: since our sequences denoting rule orders tend to be short, we are not sure if crossovers would result in a better performance of evolution.

Our evolutionary computation for program synthesis differs from genetic programming [9] or evolutionary programming [1]: we did not directly apply simulated evolution to programs, but our framework improves the search mechanism for deriving correct programs through evolution. We take this approach to take the best of both worlds: the correctness of resulting programs guaranteed by the deductive synthesis and its certification tool, and the search heuristics enhanced through evolutionary computation.

4 Evaluation

We conducted cross-validations to evaluate what improvements our evolutionary computation framework brought to SuSLik. We measured how many synthesis problems SuSLik failed to solve with in 2.5 s of timeout. For this evaluation, we used a consumer laptop running Ubuntu 20.04.3 LTS on a machine with 16 CPUs of AMD Ryzen 7 4,800H with Radeon Graphics and 15,854 MB of main memory.

As SuSLik is a new tool, we have only 65 problems available in our benchmark: problems from a preceding work on SuSLik [5] and new problems prepared for this project. These problems include tasks on various data-structures such as integers, singly linked lists, sorted lists, doubly linked lists, lists of lists, binary trees, and packed trees.

Firstly, we randomly split our benchmarks into two groups: the validation dataset and training dataset. Then, using the training dataset we apply our evolutionary computation described in Algorithm 1 to evolve SuSLik’s search strategy. As explained in Sect. 3, the output of our evolutionary computation is just one search strategy produced after 40 generations. However, in this experiment we conducted cross-validations using the best individual from the training set for each generation to see how our framework produces transferable improvement over generations.

To reduce the influence from a specific random split, we conducted this experiment four times, and the result of each experiment is illustrated from Fig. 2 to Fig. 5. In these figures, the horizontal axes represent the number of generations, while the vertical axes represent the number of synthesis problems SuSLik did not solve within the timeout.

These figures show that when adopting the general rule-weight mutation, our evolutionary framework managed to improve SuSLik’s capability to find solutions in validation sets, even though evolution is based on training sets. That is, somewhat contrarily to the prediction by Itzhaky introduced in Sect. 1, we found that there are strategies that tend to perform better for unforeseen problems, and we can find such strategies using evolutionary computation.

On the other hand, the order-only mutation and goal-specific rule-weight mutation resulted in less promising results. In particular, the goal-specific rule-weight mutation over-fitted to training data in Fig. 2 and Fig. 5, probably due to its capability to fine tune the strategy for our small dataset.

Fig. 2.
figure 2

Cross-validation 1

Fig. 3.
figure 3

Cross-validation 2

Fig. 4.
figure 4

Cross-validation 3

Fig. 5.
figure 5

Cross-validation 4

5 Discussion

The limited size of available dataset is the main challenge we faced in this project. This problem is partially unavoidable since program synthesis itself is still an emerging field in Computer Science. Other AI projects for interactive theorem provers take advantage of large existing proof corpora for training. For example, Nagashima built a tactic prediction tool, PaMpeR [16], for Isabelle/HOL by extracting 425,334 data points [13] from the Archive of Formal Proofs (AFP) [8]. Li et al. also mined the AFP and produced 820K training examples for conjecturing. For Coq, Yang et al. constructed a dataset containing 71K proofs from 123 projects [21], whereas Huang et al. [4] extracted a dataset consisting of 1,602 lemmas from the Feit-Thompson formalization. For HOL Light [3], The HOLStep [6] used 1,013,046 training examples and 196,030 testing examples extracted from 11,400 proofs, while the HOList project presented a benchmark based on 2,199 definitions and 29,462 theorems and lemmas. These projects managed to gather large data sets since their underlying theorem provers, Isabelle/HOL, Coq, and HOL Light, have a larger user base than SuSLik [19] does.

For the moment, our framework improves static parameters for SuSLik. That is, the resulting weights and rule orders are fixed for all intermediate synthesis problems. Our evaluation has shown that our static parameter optimisation (general rule-weight mutation) using evolutionary computation generalises well: a SuSLik instance that performs well for a training dataset tends to perform well for an evaluation dataset. We expected that we could achieve even better performance by producing dynamic parameters (goal-specific rule-weight) for SuSLik: functions that inspect a node at hand and decide on a promising rule order and weights for that node. Our efforts in this direction are, unfortunately, unsuccessful so far. We hope that a larger training dataset would allow for such optimisation in the future.

6 Related Work

Even though there was an attempt to use reinforcement learning [20] for a connection-style proof search [7]; we mindfully chose evolutionary computation over reinforcement learning: since we do not have a changing environment in our setting, it is unclear if we gain any benefits from having two metrics, reward function for the long term goal and value function for the short term benefit. Instead, we improved SuSLik’s default search strategy for randomly chosen fixed training problem sets and measured how the improvement generalizes to validation sets.

When implementing our framework for evolutionary computation, we took the advantage of a Python framework for evolutionary computation called DEAP [2], even though SuSLik itself is implemented in Scala.

Previously, we attempted to improve proof strategies [17] for Isabelle/HOL using evolutionary computing [11]. However, the focus of that project shifted to the prediction of induction arguments [14, 15] using meta-languages [10, 12].

Nawaz et al. used a genetic algorithm to evolve random proof sequences to target proofs. The drawback of their approach is that the fitness function used in the genetic algorithm relies on the existence of a proof for a given problem. Therefore, this framework is not applicable to open conjectures without existing proofs [18].