Keywords

1 Introduction

Genetic Programming has shown potential to solve a range of general program synthesis problems. In contrast to other problem domains like regression where an approximation of the solution might be acceptable, a partially correct solution is usually of no use in program synthesis. But for GP to be successful in program synthesis, the ability to find a correct solution should be high, as practitioners should not have to be required to run GP multiple times while researchers only do multiple runs for statistical tests. At the same time, it is essential that GP can solve a wide range of program synthesis problems rather than special cases.

To this end, a range of difficult or unsolved problems is identified in the general program synthesis benchmark suite [8], that has been used recently to test GP on program synthesis, especially with G3P [3] and PushGP [17]. While G3P was able to achieve a higher percentage of successful solutions found in cases it found solutions, PushGP was able to solve more problems at least once in general.

The focus of this paper lies in identifying differences between the function set of G3P and PushGP, extending the grammars according to those differences as well as the identified difficult problems from the benchmark suite and extending the grammars accordingly. At the same time, the grammars shall stay as general as possible to be able to use them outside of the context of benchmark problems and should not be trimmed to “cheat” on any particular problem within the benchmark suite. As the benchmark suite that has been used so far, proposes to have an explicit char data type which is currently missing in G3P [3] the possibility of adding it is further investigated. Therefore, the functionality available in the grammars is not allowed to be extended further than the function set available to PushGP.

The rest of the paper is structured in the following way. Section 2 summaries related work on program synthesis. Section 3 describes the benchmark suite used in the GP community for program synthesis and what problems have been difficult for GP and particularly for G3P. Afterwards, Sect. 4 describes in what ways grammars can be extended to overcome the previous shortcomings. The experimental setup used to tackle the benchmark suite is described in Sect. 5 and the results are compared to previous approaches in Sect. 6. Finally, conclusion and future work are discussed in Sect. 7.

2 Related Work

Program synthesis problems have been tackled even before GP was used and many different approaches exist [11]. Nevertheless, GP systems have proven to be very flexible and successful at doing this. Therefore this paper will focus on GP systems.

2.1 Grammar-Guided Genetic Programming

Grammar-Guided Genetic Programming [12] is a GP variant that uses grammars to define the search space. This makes it easy to use and flexible as a grammar can be defined outside of the GP system instead of restricting GP to a certain function set. Additionally, it is quite powerful, because any program that can be generated with the grammar can be found by GP. Grammars also provide the possibility of adding bias, if necessary. The most famous variants are CFG-GP by Whigham [19] and grammatical evolution [14].

Forstenlechner et al. [3] proposed a grammar design for GP to tackle general program synthesis problems, as mainly bespoken grammars have been used before to solve program synthesis [13], which can not be reused to solve other problems. The idea of the grammar design is to have multiple smaller grammars and every grammar contains only the functionality for a single data type. Additionally, one general grammar exists which contains the structure of the program. The benefit of this design is that it is not limited to a single programming language and depending on the problem at hand a subset of the data types required to solve the problem can be chosen. Therefore, the design is capable of solving general purpose program synthesis problems, while the search space can be kept small by not including unnecessary data types. Functions that require multiple data types of which some are not available, will be removed from the grammar automatically when combining the grammars for a chosen problem.

3 General Program Synthesis Benchmark Suite Remarks

A general program synthesis benchmark suite was introduced by Helmuth and Spector [8]. It provides a variety of problems from introductory computer science courses. It consists of a total of 29 problems with a description, training and test set, fitness function and general parameter settings, mainly for PushGP [17], for every problem. Additionally, every problem requires specific data types to be available to be solved. A more detailed description is available in form of a technical report [18], which also contains information about how to generate the training and test data as well as the instructions available for PushGP.

The two GP systems that have been tested on the benchmark suite are a G3P by Forstenlechner [3] and PushGP [17]. PushGP is a GP system that evolves programs in the language Push, which was solely designed for evolutionary algorithms. Push uses stacks to store data instead of using variables. It has a stack for every data type as well as for the code that is executed, which makes it possible to manipulate the code during runtime.

An additional comparison of systems outside of the GP community, namely Flash Fill [5] and MagicHaskeller [9], was done on the benchmark suite in [15]. The comparison showed that GP systems are more flexible and more successful on this benchmark suite, although it should be mentioned that these systems have been created with other use cases in mind like Flash Fill is used in Microsoft Excel for string manipulation tasks.

In the initial introduction of the grammar design for program synthesis problems [3], the functionality was kept to the basics of Python without including more than was available in PushGP. For example, adding the built-in sum function from Python would make solving the problem Vector Average fairly easy.

Table 1 shows the results achieved with G3P on the general program synthesis benchmark suite. The results have been taken from [3]. The datasets of Checksum and Vector Average have been changed since the benchmark suite has been introduced and a simpler version of Super Anagrams has been used in [3]. The table indicates that G3P with the current grammars has difficulty to solve problems that require char as a data type. At the moment it only uses string, most likely because the initial grammars are based on Python which treats char as string. While a programmer has no difficulty to understand how or when to use a single character string, it is definitely more complicated for GP to find out how or when to use it. Adding a char data type could yield better results. Additionally, PushGP was able to solve more problems from the benchmark suite, although in many cases with a low success rate. Nevertheless, adding further functionality could help improve the results of G3P.

Table 1. Results of G3P on the general program synthesis benchmark suite sorted by successfully found solutions. String and Char column indicate if these data types have to be used when solving the problem. A * indicates if the data set has been changed, since the results have been acquired.

4 Extending Program Synthesis Grammars

This section describes how the program synthesis grammars from [3] have been extended to include an additional char data type as well as additional functionality to have a fairer comparison to PushGP. Extending the grammar also means increasing the size of the search space as more programs can be generated from the grammar. Therefore, the extension of the grammars can also have a negative effect on the search performance.

4.1 Data Type Char

As shown in Sect. 3, G3P does poorly on problems that require a data type char. G3P only used string as it mainly relied on Python even though the concepts can be applied to other languages as well and because a char can be interpreted as a string of length one. As many problems in the general program synthesis benchmark suite require to check or manipulate single characters, G3P not using a char grammar could explain why it currently fails at solving such problems. While programmers have the intrinsic knowledge that a string consists of characters and a string of length one can be treated similar to a char, GP either has to discover this knowledge or has to be told a priori. The currently available grammar data types are bool, integer, float and string, as well as a list version grammar of each of these data types, plus the new char grammar. A list of char grammar is currently not included as the benchmark suite does not require it and strings can be viewed as a list of char. As G3P adds variables of the data types of every used grammar to the evolved program, including the char grammar makes it very likely that chars are used as opposed to before where G3P had to find that a string of length one is required.

4.2 Recursion

Recursion is a method of programming where a program calls itself to solve a smaller instance of the same problem first and uses that solution to solve the initial problem. Recursion is not an uncommon strategy to tackle problems in GP [1, 20]. In many cases, a recursive solution can be significantly shorter in terms of code than an iterative program, which might make it easier for GP to find. PushGP is capable of evolving recursive programs and for a fair comparison should be part of the grammars for G3P as well.

To allow recursion, a program needs to be able to call itself and a way to stop the recursion, usually an if condition called guard. As the grammars in G3P are automatically merged together depending on the required data types, and the number of input/output variables, as well as there types, a rule for a recursive call can be generated and added to the grammar. The following is an example where outputX is replaced with the correct type variable non-terminal (e.g. <bool_var>) and inputX with the correct type (e.g. <bool>):

figure a

In a similar way, a return statement can be generated:

figure b

The grammar used to define the control flow (structure.bnf) already contains if statements, but it is very likely that it might not be used and the program gets stuck in an infinite recursion and at some point will throw an error due to a stack overflow. A problem that occurs with infinite loops as well and was handled by adding a guard to avoid any additional iterations if a certain limit is reached. A similar guard is used to avoid infinite recursion. The benefit of using this mechanism is that evolved programs will not throw an error and return a value. Therefore, the program will be given a fitness value based on what it returns instead of a default worst case fitness due to an error.

4.3 List Operations

When the grammars for program synthesis were introduced grammars for lists of all data types were included but kept to the essential functionality. Items could be added at the end, inserted or replaced at a specific index or removed. Lists could be iterated, compared, checked if they are empty and their length could be determined as well as slicing of lists was possible. Any additional functionality the algorithm had to find. PushGP offers more functionality out of the box that can be used, which has been added to the grammars for G3P, like reversing a list, counting the occurrences of an item, replacing or removing items if a condition is met etc. All of this functionality could be discovered as well, but as for example O’Neill et al. [13] showed that GP has difficulties finding a solution to the integer sorting problem, but by adding a swap function the problem was easily solvable. As stated before no further functionality has been added, that was not already available for PushGP as well. At the same time, it should be noted that adding additional functionality also increases the search space, which can make it more difficult to find a correct solution. Even though the additional functionality can make it easier to solve one problem, it can make it more difficult to solve another. Therefore a decrease of successful solutions found on some problems is to be expected.

4.4 Additional Methods

Similar to the list operations in the previous section, additional methods were added to other data types that in general could have been discovered by G3P. One example that is also often not included for boolean problems is XOR, as it can be constructed with AND, OR and NOT and can make certain problems like multiplexer too easy [10]. To be able to have a better comparison between G3P and PushGP, such methods have been added as well. As there are too many to mention every single one of them, the reader is referred to the grammars themselves that are provided online [2] as well as [18]. Again, it should be noted that the extended grammars do not exceed the functionality that is provided by PushGP.

5 Experimental Setup

For the experiments, the extended grammars, which are described in the previous section, are used with the same G3P system as in [3], which is available online [2] including the extended grammars. The experiments are run on the problems from the general program synthesis benchmark suite [8]. The parameter settings are summarized in Table 2. The number of generations is set to 300Footnote 1. As soon as a successful solution is found, the run is stopped as GP cannot improve it anymore. Lexicase selection [6] is used, as it has shown to be the most successful selection operator with GP on program synthesis problems. Instead of using a single fitness value for selection, lexicase operates on the fitness values of every single training cases. It randomly selects a fitness case and selects the best individual based on that case. In case of a tie, lexicase selection continues with a subset of individuals that were in this tie and continue to select other training cases until a single individual is left or until no fitness case is left, in which case an individual is selected randomly.

Table 2. Experimental parameter settings

6 Results

First the overall performance of G3P with the extended grammars and also to PushGP. Afterwards, the effect of the extended grammars on the search is analysed in more detail.

6.1 Successful Solutions

Table 3 shows the solutions found for each problem with G3P with extended grammars for training and test with 100 runs. The results are compared to the previously achieved successful solutions of G3P from [3]. Of the eight problems that require a char data type and have not been solved with G3P before, three have been solved with the extended grammars, namely Pig Latin, Replace Space with Newline and Syllables. Pig Latin is one that has not been solved with PushGP either. Additionally, Mirror Image has been solved as well, probably due to the additional list operations, which was not solved with the G3P with previous grammars. Table 3 also includes the p-value for the Wilcoxon Rank sum test on best test fitness of the two grammar approaches and shows a significant difference for nearly all of the problems. This is not surprising as the grammar has a massive influence on the search, as a function set has on normal GP.

Table 3. Successful solutions found with G3P with extended grammars on training and test with 100 runs as well as increase and decrease to the previous grammars in brackets. The p-value shows if there is a significant difference in the best test performance between the two different grammars with 0.05 as level of significance. A significant difference is highlighted in bold. Finally, the results of PushGP on the benchmark suite from [8] and the difference to G3P with extended grammars in brackets are compared.

The results also show that due to the increased search space, which is caused by the additional functions added to the grammar, the number of successful solutions decreases for some problems. Three problems, Compare String Lengths, Even Squares and Vector Average, could not be solved anymore, but the success rate of the first two was rather small before as well. Especially, Compare String Lengths is highly overfit as 96 successful solutions were found on test, but none generalizes on test. This is a problem that occurs on multiple problem instances and has been noticed before [7].

Even though on the final experiments some problems, even those which require char as data type are still not solved, in preliminary experiments Checksum and Double Letters have been solved with G3P with extended grammars as well. Even then the success rate was rather small, but theoretically, it has been found that they can be solved with the extended grammars as well.

Finally, Table 3 shows the results of PushGP taken from [8] compared to G3P with extended grammars. According to [7], PushGP is able to solve Checksum after the original dataset has been changed. The comparison shows that both approaches have problems where one method is more capable to find solutions than the other, but there does not seem to be a clear advantage over one or the other. Some problems have been solved with PushGP that have currently not been solved with G3P, but again the success rates of these problems are very small, below 10, in most cases, which makes a comparison difficult. The low success rate is an issue that needs to be addressed by both approaches.

6.2 Char Analysis

The grammar for the char data type is used by 10 problems. The grammar contains a rule <char> with productions for char variables, char constants and all functions that return a char value. Therefore, checking the percentage of nodes in individuals shows if GP is making use of the additional data type. Figure 1 depicts this usage.

Fig. 1.
figure 1

Percentage of <char> nodes in individuals averaged over 100 runs over generations.

In the initial generation, the percentage of nodes being <char> is nearly identical for some problems, which is expected as these problems require the same data types, which means the grammars are nearly identical, except maybe input and output variables. Therefore the grammar has the same structure and the same number of possible nodes, which leads to this effect. The percentage of <char> nodes used may seem small being between 0.5% and 1.5%, but considering the number of productions available in the grammar, it is rather high. In case of almost all problems, the usage of <char> nodes is either constant or increases over time, after a few generations. The only problem that seems to slowly decrease the usage of <char> nodes is Digits. This can be explained by how G3P is tackling the problems. While PushGP prints every integer for Digits, G3P has to return a list of integers as it does not use print statements and therefore does not necessarily need a char data type.

For some problems, especially Replace Space with Newline, Syllables, Super Anagrams and Pig Latin, the lines are not as stable as for the other problems. The reason is that solutions that solve the problem at least for training have been found and runs are stopped as soon as this happens. Hence, the average percentage might drop or increase. In most cases, a sudden drop has been found, which shows that runs that use <char> nodes more often seem to be able to find a successful solution earlier. This indicates that the char grammar improves the search for successful solutions.

6.3 Recursion Analysis

The percentage of recursion used can be checked in a similar way as in the previous section for char. Figure 2 depicts the percentage of recursion nodes used over generations. The initial percentage is lower than with <char>, because there is only one recursion production rule in the grammar, whereas <char> is used by multiple functions. Afterwards, it drops even lower for all problems and is barely used overall. As explained in Sect. 4.2, to use recursion, a method needs to be able to call itself and a stopping criterion. At the moment the GP system can evolve a method to call itself, but at the same time has to evolve a stopping criterion, which seems to make it too complicated to be used. Without the stopping criterion, the evolved program runs into an infinite loop, which leads to a stack overflow or a timeout by the G3P system. A way to improve this might be to adapt the grammar that a stopping criterion is added to the same production rule as the recursion to always have both added at the same time. This could increase the chance to make G3P use recursion to solve problems.

Fig. 2.
figure 2

Percentage of recursion nodes in individuals averaged over 100 runs over generations.

7 Conclusion and Future Work

The difficulties of solving multiple problems of the general program synthesis benchmark suite with a grammar design approach [3] have been discussed. As some of this problems have been solved with another approach before, the functionality of the grammars has been extended in various ways to be closer to previous approaches, without “cheating” by adding functionality not used before. An important enhancement of the grammars is that an explicit char grammar has been added as many problems operate on single characters instead of strings. Programmers are able to identify such characteristics of a problem easily, while GP would have to discover such knowledge. As the benchmark suite proposes to use char as its own data type, this additional information does not give G3P an unfair advantage when comparing to other systems.

Afterwards, the extended grammars are used to tackle the program synthesis benchmark suite and the results are compared to the grammar design of [3]. The results show significant differences for nearly all problems and successful solutions have been found for previously unsolved problems with G3P. One problem, Pig Latin, has been successfully solved that was not solved by any other approach before. Additionally, a comparison with PushGP has been made, as the extended grammars are closer in functionality to PushGP as in [3].

Due to the increased search space created by the extended grammars, a decrease of successful solutions found on previously solved problems was expected. A way to dynamically adjust the functionality of grammars during runs could help avoid this problem [16]. Even the success rates of newly solved problems were rather low. This is a problem not only of G3P, but also of other approaches, and should be addressed in the future to make program synthesis with GP more usable outside of the research community as well. Current approaches include smarter operators [4] or post-run simplifications [7], but further research is required to increase success rates.