1 Introduction

The ultimate goal of this paper is to attain an advanced understanding of the issues involved with program initialisation in genetic programming (GP) through the analysis of different aspects of program initialisation. Although we present four different program initialisation algorithms with varying levels of success, our focus is primarily on testing theories rather than presenting algorithms for practical use.

The ideal initialisation of random programs in GP presents a unique challenge when contrasted with population initialisation for other evolutionary algorithms. Unlike other forms of evolutionary computation, GP relies on the execution (or interpreted execution) of programs in order to attain fitness values (although some work has included program structure as a factor [13]). In terms of creating random programs to seed a GP run, the fact that fitness is based on the execution of the program means that we should investigate a semantically diverse starting population, rather than one that is syntactically diverse. It seems reasonable that this would increase the search power of GP.

In order to test the theory that increased semantic diversity affects GP results, we present an entirely semantically driven initialisation (SDI) algorithm (built on ideas developed from our analysis of the traditional ramped half and half (RHH) technique [4]), which produces 100% effective code [5] at initialisation. We present not only the performance results generated when using this algorithm, but also the size and shape metrics of the programs produced by SDI and compare them to the RHH technique over a range of benchmark problems. We hybridise our SDI algorithm with the existing FULL [4] initialisation technique and present further results which, when compared to those of the RHH and SDI techniques, demonstrate that full semantic diversity is not the only major influence on program initialisation for GP.

Finally, in order to examine our theory that the shape of the program tree at initialisation will influence performance, we conduct an experiment designed to test whether different shapes of trees with the same semantics will influence the performance of GP using our semantic models.

Section 2 reviews related literature in the field. Section 3 presents the algorithms we use for program initialisation. In Sect. 4 we present our results, which include unique behaviour analysis, program bias analysis, program metrics analysis, GP performance results and evolvable shape experiments over a range of benchmark problems. In Sect. 5 we present a discussion of the results and in Sect. 6 we present conclusions. In Sect. 7 we discuss several avenues for future related work.

2 Review of related literature

In this section we review a number of issues concerned with program initialisation in the context of this work. We review existing issues and techniques within program initialisation, the concept of program diversity and issues relating to program structure (or tree shape).

2.1 Existing initialisation techniques

The standard method for population initialisation, which we analyse in this paper, is the RHH method. This was introduced in 1992 by Koza [4], who set out three methods for creating a diverse population: GROW, FULL and RHH. Koza chose to use the RHH technique for the majority of his experiments after conducting several experiments comparing FULL and GROW to the RHH because “...the ramped half-and-half method creates trees having a wide variety of sizes and shapes.” [4, p. 93]. Koza also recognized that whilst the RHH method created a variety of programs, there was the possibility that programs could be duplicated. Therefore, he added syntactic duplicate checking of the programs to ensure the syntactic diversity of the starting population.

A language bias (defined by Whigham as “bias is the set of all factors that influence the form of each program” [6]) is present when there is a bias in the choice of items from the function or terminal sets. In 1995, Whigham [68] analysed such a bias and its effects on grammatically based genetic programming. Whigham conducted experiments that explicitly added segment(s) of code to a program in the population. For example, in one experiment he biased if statements such that the condition could only be a specific terminal. This narrowed the exploration of the search space and increased the probability of finding an ideal solution by artificially including segments of a known perfect solution. This demonstrated that specific language bias could be beneficial but could also severely compromise the ability of GP to find a solution to a problem with the highest fitness.

In 1996, Iba [9] devised an approximately uniform tree generation method (RAND_TREE) to combat the idea of language bias. In parallel to Iba’s efforts, Bohm and Geyer-Schulz [10] independently devised a method called Exact Uniform initialisation based on statistical theory with the same objective in mind. Both of these papers report slight improvements in the success of GP runs compared to RHH. In these papers, however, only a small number of examples were studied, which left their results open to issues of problem-dependence (as noted by Bohm and Geyer-Schultz). This issue was addressed in 2001 when Luke and Panait [11] conducted a more comprehensive survey and comparison of population initialisation methods. This survey concluded that there was no significant statistical difference in performance between the RHH and the uniform initialisation methods.

In 2000, Langdon [12] developed ramped uniform initialisation as part of experiments to control code bloat through changing the bias in the distribution of syntax. The algorithm is similar to Iba’s and focuses upon a method of distributing syntax, rather than controlling the semantics of programs. Another potential approach for resolving this type of question would be to ask whether the RHH and uniform creation methods create a similar behavioural bias. In addition, this approach could additionally explain why a theoretically grounded uniform initialisation method cannot improve results when compared to an ad hoc method such as RHH.

Despite the risk of imposing bias in a starting population, it is still a requirement that the GP practitioner has some control over the size of the programs produced (measured in depth or length) in order to prevent excessively large programs being produced. Producing a starting population should take a relatively short amount of time allowing for the fact that GP runs need to be executed many times to provide some degree of consistency in the results. To address this issue, Chellapilla [13] devised an algorithm called RANDOMBRANCH which utilized a specified length rather than depth and produced approximately uniform programs. One problem with this algorithm (highlighted by Luke and Panait [11]) is that because the RANDOMBRANCH algorithm divides up the branch depths evenly, there are many trees that this initialisation method cannot produce. This would result in a language bias in the starting population; however, the effects on behavioural bias remain unstudied.

A later effort by Luke in 2000 [14] addressed the related issue of control over program initialisation. This resulted in two Probabilistic Tree Creation algorithms known as PTC1 and PTC2. These algorithms differed from those previously mentioned in that they allowed more user control. PTC1 allowed the user to provide the probability of appearance of individual functions as well as to define an average size of the initial programs. However, this method does not give the user any control of the variance of these programs. PTC2 addresses this issue by allowing the user to set a probability distribution of tree sizes which gives control over the variance in tree sizes. In comparison to the uniform based algorithms, PTC1 and PTC2 are simpler to implement and provide the user with much more control over the size and variation of programs in the initial population. In a similar way to Whigham’s work, PTC1 and PTC2 give the user the kind of bias that could focus the starting population more towards where a global optimum will occur. The danger with this is that if the wrong kind of bias is used, then the exploration could be steered away from the global optimum.

In 2007, Looks [15] proposed a heuristic to increase behavioural diversity during program initialisation, which was demonstrated to outperform the RHH technique in terms of both success rate and overall computational effort. This shows that behavioural diversity does have an important role to play in population creation. In this paper we present further analysis, using a different technique to that of Looks [15], in which we not only enumerate unique and duplicated programs, but we also examine in detail the types of behaviours frequently produced by the RHH initialisation method. In addition, we conduct a comparison of the size and shape of the programs generated by the SDI, HSDI and RHH techniques.

2.2 Program diversity

When considering diversity in starting populations, it is important to understand two distinct types of diversity. The first type is syntactic diversity, that is, programs in the population being syntactically different. Koza [4] argues that this is important both as a method of generating programs with different behaviours, and as a means of providing a pool of material from which programs can be evolved. The second type is behavioural diversity, that is, diversity of the input–output behaviour. It is easy to find examples of sets of programs that are all syntactically distinct, yet which have identical behaviours.

Whilst Looks [15] introduced semantic diversity to program initialisation, studies of semantic diversity are not new to GP. Gustafson et al. [1618] conducted multiple analyses of behavioural diversity in GP. In 2004, Gustafson et al. [17] conducted an analysis comparing behavioural diversity measures with fitness. These behavioural measures were based on two edit distances and this analysis concluded that edit distance showed a strong correlation with fitness difference. Gustafson et al. [16] present three different methods for measuring the behaviours of the programs they study. However, the authors mention that even these mechanisms do not provide an exhaustive presentation of the behaviour of the programs. One of the limitations of Gustafson’s work [16, 17] was that a behaviourally canonical representation was not used to check for isomorphism; the use of ROBDDs (Reduced Ordered Binary Decision Diagrams) and the ant behaviour reduction algorithm in our work gives us that very ability.

In earlier works, Poli and Langdon [19] and O’Reilly and Oppacher [20] discuss program diversity through analysing how different crossover and search operators perform on different problems, with varying results. In this paper, we analyse these issues using novel semantic analysis methods.

2.3 Program structure

Previous work on the analysis of changes to program structure has demonstrated that tree shape does have an influence on the success or failure of GP. Punch et al. [1] and Gustafson et al. [3] present artificial problem domains in the form of royal trees and the tree-string problem. These artificial problem domains are tunable and designed to evaluate the relationship between program structure and the ability of GP search to navigate through the search space.

Whilst Langdon et al. [21] provide evidence that programs evolve towards particular shapes, Daida et al. [2, 22] suggest not only that program structure has a role to play in evolution, but that predictions can be made as to which shapes are the most evolvable. Our analysis (in Sect. 4) tests two extremes of different program shapes at the point of program initialisation. By introducing this kind of structure analysis at initialisation, we can separate the effects of shape at initialisation from those effects which are due to evolution.

A further well documented aspect of program structure is the intron phenomenon [5, 23, 24]. Introns form either unreachable or redundant code within program trees. There is some disagreement as to the link with program bloat: some authors [25, 26] suggest a positive link between introns and code bloat, whilst others [23] disagree. In a structural sense, authors have suggested that introns are required [27] in order to protect areas of valuable fitness within a program tree. Additionally, authors have provided evidence that fitness neutral evolution is valuable to GP [28]. In relation to this work, we present algorithms capable of producing starting populations which are intron free, and we examine these effects in our discussion.

3 Methods and algorithms

3.1 A general framework

The aim of our work is to demonstrate the effect of moving the initialisation of programs in GP away from random combinations of syntax and towards semantic building blocks of code. This goal is not as easy to implement compared to randomly combining syntax. Furthermore, the semantic initialisation process needs to be comparable in execution speed terms to the traditional ramped half and half initialisation.

The major issues within SDI are increasing semantic variety and producing fully effective starting programs. We present a behavioural analysis in Sect. 4; however, we first set out the abstraction techniques and algorithms we use to enable semantically driven initialisation.

In order to control semantics we need to have a representation of the behaviour of programs. These behavioural representations are dependent on specific problem domains. In our experiments we use six different benchmark problems from the Boolean domain and one from the artificial ant domain.

The first step in the framework is to seed behaviours in the abstract representation. Once this has taken place, we then combine the behavioural representation at the root rather than changing other areas of the representation. This gives us the ability to build more complex behaviours quickly. New behaviours are added to the population if and only if they are unique. Once we have attained the new behaviours, we translate them into syntax as specified for the GP problem.

3.2 The problem domains

In our experiments we use seven test problems, which are described in this section.

The objective of the 6 bit multiplexer problem is to interpret two control bits {A0, A1} as a binary number and choose the correct output bit {D0, D1, D2, D3} based on the number. The fitness is the number of correct choices over all possible 64 combinations of inputs for the 6 Boolean bits. The function set is {IF, AND, OR, NOT} and the terminal set is {A0, A1, D0, D1, D2, D3}. The 11 bit multiplexer is similar to the 6 bit multiplexer. However, there are three control bits which, represented as a binary number, can select one of 8 outputs. The 11 bit multiplexer is substantially more complex compared to the 6 bit multiplexer as the size of the search space increases from \(2^{2^{6}}\) (6 bit multiplexer) to \(2^{2^{11}}\) (11 bit multiplexer). The function set of the 11 bit multiplexer is the same as the 6 bit multiplexer and the terminals are {A0, A1, A2, D0, D1, D2, D3, D4, D5, D6, D7}.

The objective of the even 4 parity problem is to return true if and only if an even number of the inputs are true. The function set is the same as for the multiplexers and the terminal set is {D0, D1, D2, D3}. The 7 parity problem is an extension of the 4 parity problem with the same objective and function set. The terminal set is {D0, D1, D2, D3, D4, D5, D6}.

The objective of the 5 majority problem is to return true if and only if the majority of the inputs are true. The function set is the same as the multiplexers and the terminal set is {D0, D1, D2, D3, D4}. The 9 majority problem is an extension of the 5 majority experiment with the same function set and the terminal set {D0, D1, D2, D3, D4, D5, D6, D7, D8}.

The artificial ant domain models an ant operating over a trail of food pellets on a grid. The ant must collect all the food pellets in order to achieve full score. We use the benchmark santa fe trail [29] which represents 89 food pellets in a broken trail on a 32 × 32 toroidal grid. The function set for the ant problem is {IF-FOOD-AHEAD, PROGN2, PROGN3} and the terminal set is {MOVE, TURN-LEFT, TURN-RIGHT}. The function IF-FOOD-AHEAD is an if-then-else structure with the condition representing whether the ant has a food pellet in the grid square directly in front of it. PROGN2 and PROGN3 execute the instructions they hold in sequence. The only difference between them is that PROGN2 has an arity of two and PROGN3 has an arity of three.

3.3 Boolean domains

To enable us to analyse semantic characteristics of Boolean programs we use a Java implementation of GP [30], linked to the Colorado University Decision Diagram Package (CUDD—[31]) using the JavaBDD [32] interface.

The important functionality that this provides is the ability to reduce program representation by removing redundant and unreachable arguments. We can obtain canonical representations known as ROBDDs [33] of the behaviour of the Boolean programs: this allows us to compare programs for semantic equivalence. Any two programs that reduce to the same ROBDD are semantically equivalent, and vice versa.

An ROBDD is a node tree where each node represents a Boolean decision variable. These nodes are linked by true and false branches to either other nodes or the final output of the diagram (true or false). Bryant [33] describes a method (and algorithm) to reduce the binary trees to a canonical form. An example of an ROBDD can be found in Fig. 1.

Fig. 1
figure 1

This example ROBDD is a canonical representation of behaviour. In the diagram, circles represent variables (terminals in the GP context); solid arrows represent true paths; dotted arrows represent false paths. The squares marked 1 and 0 represent output of true and false respectively. This behaviour could be represented by many different parse trees. Two examples of parse trees that would result in this behaviour are IF A0 D0 D1 and IF (NOT A0) D1 D0

Two important measurements we use in the analysis of ROBDDs are SatCount and NodeCount. SatCount is a value between 0 and 1 that represents the number of input combinations resolving to true in the ROBDD, divided by the total number of input combinations possible. NodeCount will return the number of variables present in the ROBDD (in the GP context the number of terminals). This function can be used in conjunction with SatCount to classify behaviours. For example, a SatCount of 0.25 with a NodeCount of 2 would represent a function such as AND A0 A1, given that there are four input combinations in total of which only one results in true. In a second example, a SatCount of 0.75 with a NodeCount of 2 would indicate a function such as OR A0 A1.

A tautology is a program which produces the output true regardless of input. In the case of a tautology, the value of SatCount is 1. A contradiction is a program which produces the output false regardless of input. For a contradiction, the value of SatCount is 0. Unlike in program parse trees, ROBDD functions are not represented explicitly as nodes in the parse tree, but by using true and false links between the variables. Due to the reduction mechanism [33], it is possible to reduce some ROBDDs to just true or false (i.e. tautology or contradiction). If one considers the example AND A1 (NOT A1), this program will always result in false and the ROBDD of this program will reduce to false. In the GP context this is very undesirable because it indicates that the result is not dependent on any of the variables and always returns the same answer (true or false).

If we consider the tautology and contradiction, the NodeCount will be 0. If the NodeCount is 1 and the SatCount is 0.5 we know the ROBDD of the program parse tree will reduce to just one variable (terminal) and possibly a function such as NOT.

3.3.1 SDI for Boolean domains

The SDI algorithm is designed to create a population consisting of semantically distinct programs, by using the ROBDD method described above as a way of checking whether newly generated programs are semantically equivalent to programs that have already been placed in the starting population. Pseudo code for the algorithm is presented below:

Phase 1:

for each Terminal in List Of Terminals

   create ROBDD representation of Terminal

   add to ROBDD_Store

end for each

Phase 2:

while ROBDD_Store size < population size

   choose a random function from the function set

   choose 1, 2 or 3 (dependent on function) ROBDDs from the ROBDD_Store

   at random (uniform probability)

   apply the function to the ROBDDs at tree root

   if resulting ROBDD is a new behaviour and not a tautology or contradiction

      add resulting ROBDD to ROBDD_Store

   end if

end while

Phase 3:

for each ROBDD in ROBDD_Store

   translate ROBDD to program

   save program

end for each

In phase 1 we represent the terminals as ROBDDs and add them to the ROBDD_Store. This is required as we need at least one example of each input variable to be present as building blocks for phase 2. Without any ROBDDs in the ROBDD_Store from phase 1, phase 2 would fail as it would have nothing to combine using the functions. Phase 2 starts to combine the terminals (or single variable ROBDDs) using functions: when a semantically unique function is produced, it is saved in the ROBDD_Store. As phase 2 continues, the algorithm is able to take advantage of all of the representations held in the ROBDD_Store and this encourages more complex behaviour to be generated as the algorithm continues. Phase 3 translates the ROBDDs back to Boolean parse trees. One other important factor is that this algorithm will not produce tautologies or contradictions.

The behavioural model works in the scenarios we present; however, it is not without limitations. Bryant [33] noted that large graphs (in excess of 100,000 vertices) were possible once 10 variables had been exceeded (in the GP context, terminals). This limitation is manageable with the combination of Bryant’s reduction mechanism and increased power in modern PCs. As such, the generation of a starting population for the 11 bit multiplexer took a matter of seconds. It should be noted that the barriers (with more than 11 terminals) we faced were not due to the creation of ROBDDs, but to the translation mechanism which transforms the behaviours into syntax.

Close observers of the SDI algorithm will notice that there is no syntax that will result in the guarantee of termination. Termination in this case occurs because the behavioural search space is larger than the number of programs being initialised. In GP, it would be very unlikely that a practitioner would be trying to evolve a solution when they could generate every possible behaviour in a reasonable amount of time, otherwise it would defeat the point of using GP to solve a particular problem. Theoretically, for small problems (such as a 3 bit multiplexer), if the SDI was asked to generate a population of more than 254 behaviours (256 in total minus the tautology and contradiction), the algorithm would not terminate.

3.4 Ant domain

In order to represent the behaviour of ants we consider a behavioural model as a sequence of moves and orientations that represent the path which the ant has travelled during only one execution of the ant control program (or GP candidate solution). When the artificial ant is simulated in GP, we execute the candidate solution until the ant has travelled a set number of time steps (600 in this case) [29]; although, in this model we are only interested in a single execution of the ant control code. In addition to this, we execute the ant code on a toroidal grid (32 × 32) that contains no food pellets and we calculate the path of both the true and false branches of the IF-FOOD-AHEAD (if-then-else) function.

An example program in this domain is as follows:

PROGN2 (PROGN3 (MOVE, (IF-FOOD-AHEAD (PROGN2 (MOVE, TURN-RIGHT)) MOVE) MOVE) TURN-LEFT)

An example of the syntax, equivalent to the above program, we use is as follows:

$$ \hbox{Ant representation} = \langle \hbox{M}, \langle \hbox{M}, \hbox{S} \rangle, \langle \hbox{M} \rangle, \hbox{M}, \hbox{N} \rangle $$

The character M represents one move and the characters N, S, E, W represent the orientations north, south, east and west respectively. The subsequences within the set indicate when a branch of an IF-FOOD-AHEAD statement is being accessed and coordinates within those brackets indicate the path travelled during each branch of the condition. Because we are only concerned with modelling the shape of the trail (consider the picture of the trail as we look down on the grid), it is unimportant whether or not the ends of the if-blocks have different orientations for the trail to continue upon. Therefore, at the end of the conditions we reset the current orientation to the orientation before the ant entered the if branches. The reason for this reset is that we are only interested in the relative meaning for modelling change of position (or picture of the trail).

More formally, we can describe in Backus–Naur Format the structure of a representation:

$$ rep ::= \langle < expr > \rangle $$
(1)
$$ expr ::= M | N | S | E | W | < bracketExpr > | < expr > , < expr > $$
(2)
$$ bracketExpr ::= \langle < expr > , < expr >\rangle $$
(3)

In addition to this model structure, we add three checks which condense the abstract representation. The first check is to remove duplicate sub-branches of the same if statement and incorporate the paths as part the fixed path the ant was on before the if statement. The second check searches for sequences of orientations and reduces them to the last orientation in the sequence. This has the effect of removing redundant turns from the ant abstract model. The final check, moves through the representation, remembers the current orientation and removes any duplicate calls to turn to the current orientation. This serves to remove redundant turn instructions.

There are underlying differences between the Boolean domain, which is finite, and the ant domain, which is toroidal, and therefore potentially infinite. When a domain is infinite, it is necessary to constrain the size. We have done this by constraining the behaviour, whereas in previous work [4] this has been done by constraining the syntax. Furthermore, whilst programs in the Boolean domain would be run only once, in the ant domain the program is executed repeatedly up to a limit of 600 time steps. As a result of both the toroidal nature and the repeated executions, a behavioural size limit of 10 moves (chosen as an arbitrary reasonable value) has been applied to enforce a syntactic size limit. This limit was set so that if the function PROGN3 is used, it allows enough moves to traverse the grid in one execution.

3.4.1 SDI for ant domain

Pseudo code for the algorithm is presented below.

Phase 1:

Generate the 4 core behaviours (see discussion below)

Add core behaviours to Behaviour_Store

Phase 2:

while Behaviour_Store size < population_size

   choose a random function from the function set

   choose 2 or 3 (dependent on function) Behaviours from the Behaviour_Store at random (uniform probability)a

   apply the function to the Behaviour_Store

   if resulting Behaviour is a new behaviour and has at least one move

      add resulting Behaviour to Behaviour_Store

   end if

end while

Phase 3:

for each Behaviour in Behaviour_Store

   translate Behaviour to program

   save program

end for each

  1. aThe algorithm will only choose programs with 10 moves or less

The core behaviours of phase one make up the very basic operations the ant can perform. As with the Boolean domain, phase 1 is needed to seed phase 2. These are a representation of one move in every direction. This acts to provide the ant with at least one instance of all the possible moves. Phase two builds up more complex programs from the building blocks and as more behaviours are generated and saved they can be used by the algorithm to build yet more complex behaviour. Phase three translates the behaviours back to syntactic representation for the GP. This algorithm will not produce behaviours that contain no moves.

As with the Boolean domain, there is no syntax to enforce termination. In this case termination is dependent again of the size of the behavioural search space being greater than the starting population.

3.5 Hybridised SDI

In addition to the SDI algorithm, we developed a hybridised version of the algorithm (HSDI). At one end of the scale we have the existing RHH algorithm which we will show produces repeated simplistic behaviours. At the other end of the scale we have the SDI which can produce complex behaviours with no regard for program size. A hybridised version of the algorithm would combine behaviours both in the simplistic and complex areas of the search space, aiding a wider search.

In the hybridised version of the SDI we alter phase 1 of the initialisation algorithm. Instead of using terminals or core behaviours as the initial seed to build on, we use the existing FULL [4] algorithm to create the first round of behaviours. We use FULL because of the increased semantic diversity it provides (shown in Fig. 3) and the fact that because we are converting the programs into behaviours we are not concerned by the shape of the trees produced at this stage. Upon creation of each FULL program we store the behaviour, if and only if, it is unique and is not a tautology or contradiction (or contains no moves). Section 4.2 shows results to demonstrate the current level of unique behaviours within different starting populations using the RHH and FULL techniques. Once this seed is complete, the HSDI algorithm continues in the same way as the SDI algorithm and combines behaviours at the root, therefore encouraging more complex behaviour. In phase 3, the behaviours are translated back to 100% effective syntax.

In the hybridised version of the artificial ant problem, phase 3 can become problematic due to the fact that the abstract representation can reduce branches of the IF-FOOD-AHEAD statement to nothing (for example if it contained a turn left and then a turn right instruction). As such, we made an addition to the ant syntax that can only occur in back translation which is a SKIP operation. This has no effect on the ant apart from costing it one move. The move cost is required because some IF structures result in the ant always falling into the SKIP function when it is executed.

3.6 Evolvable shape analysis algorithm

In order to evaluate how the shape of programs in starting populations affects evolution during GP runs we present a separate set of experiments. In these experiments, we compare how starting populations made up of FULL node trees with all branches reaching the same maximum depth perform against deeper thinner node trees created from our behavioural representations.

In order to analyse the effects of different tree shapes at the point of program initialisation, we developed two experimental algorithms. These algorithms are intended to be used as analysis tools only, and are not designed to be for general use. The first of these is a modified version of the FULL algorithm, in which tautologies are prevented, and the maximum depth is 4. This is to allow us to analyse the performance of fully-branched (“fat”) trees. The algorithm for this modified version of FULL (MODFULL) is as follows:

while first_Gen < pop_Size {

   generate FULL_Program

   generate FULL_Program_Representation

   if FULL_Program_Representation is NOT a tautology or contradiction

     {

         add FULL_Program to first_Gen

     }

}

The second algorithm is designed to create trees in which programs are represented using minimalistic (“thin”) representations. This is achieved by creating programs using the MODFULL algorithm and then applying a “washing” process to them that removes redundant and unreachable code. The washing process works by translating the code into a canonical representation, then translating that back to the tree representation. Pseudo-code for this washing step is as follows; the complete WASHED algorithm consists of taking the outputs from the above code fragment and then applying this process.

for each program in first_Gen {

   translate program to representation

   back-translate representation to reduced_program

   replace program with reduced_program

}

4 Results

In this section we present results for seven sets of experiments, all of which analyse some characteristic of the GP run by comparing different initialisation methods. These experiments are a speed comparison, analysis of unique behaviours, bias analysis, size and shape analysis, measure of overall GP performance, and analysis of evolvable shape.

4.1 Speed comparison of initialisation methods

To address initial fears that these new algorithms might take impractical amounts of computation time, we present a comparison of speed of initialisation showing the time it takes to initialise a starting population using SDI, HSDI and the traditional RHH technique.

Whilst Table 1 shows that the RHH technique takes less time to generate programs, this experiment does not give any indication of the comparable quality of the resulting programs. If we consider that the results quoted are in milliseconds, then it is reasonable to use semantically driven initialisation in order to attain a semantically diverse starting population, as even the slowest initialisation is only 6 s.

Table 1 We initialise 100 populations of size 500 using SDI, HSDI and the RHH techniques

A second aspect of this experiment which is not comparable is that RHH has a built in depth limit and as such cannot build programs greater than that depth. As a result of these depth limitations we know that the average depth of programs will fall in the range of two to six, therefore limiting the size of the programs generated. The SDI algorithm does not use a limitFootnote 1 on the size of the programs. Therefore, it seems reasonable to expect that the process might take longer. It only represents the behaviour in the form of effective code. Arguably, the SDI is having to do more work in terms of code generation, and the size and shape results in Sect. 4.4 support this. Overall, we do not consider the difference in speed of generation to be a significant factor in choosing between these different methods.

4.2 Behaviour in starting populations

4.2.1 Analysis of unique behaviours

We use the behavioural representations to analyse GP starting populations. Given a starting population, we convert each member of the population into behavioural model form. We enumerate the number of unique behaviours in the population by testing for model equivalence. In addition to this, we calculate the number of programs associated with a specific behaviour to analyse any bias towards specific behaviours. In these experiments we initialise 100 populations at each population size and all results reported are averages of these 100 initialisations.

4.2.2 Unique behaviours

Figure 2 shows that in every model, there is a notable percentage of duplicated behaviours. The even 4 parity model represents the worst performing result in terms of the number of unique behaviours: when the population increases past 2500, less than 40% of the programs produced by RHH are unique. Furthermore, the 11 bit multiplexer demonstrated the least duplication of behaviours with a maximum of 15%. One feature applicable to all models is that, at different levels, all percentages of unique programs decrease as the population increases. This could indicate a type of bias such that some behaviours are favoured and repeatedly produced by the RHH. The final feature of note is that as the number of terminals increases for the Boolean domain, there is less duplication of behaviour.

Fig. 2
figure 2

Enumeration of unique behaviours present in starting populations expressed as a percentage of the total population size. This analysis is repeated for population sizes ranging from 500 to 5000 for each experiment. All results quoted are an average of 100 initialisations

Figure 3 shows a sample of two of our test problems comparing the semantic diversity when using the FULL (depth 4) and RHH techniques. In both cases the FULL technique generated more semantic diversity and paired T-tests revealed that this difference is significant at the 95% and 99% confidence intervals (P-value of 0.000 for majority 5 and P-value of 0.001 for the 6 bit multiplexer). This is an interesting result as previous authors [4] have reported that FULL does not perform as well as RHH in terms of GP performance. This result lends weight to the theory that there is an ideal evolvable shape of program tree. In practical terms, if we require a semantically diverse seeding mechanism, then the shape of the programs is not important because they will be modelled as behaviour; therefore, this makes an ideal method to seed the HSDI.

Fig. 3
figure 3

The two graphs show the percentage of semantically unique programs generated by the FULL (depth 4) and RHH techniques against population size for the majority 5 and 6 bit multiplexer problems

4.3 Bias analysis

In order to examine bias more precisely, we run a second experiment which initialises 100 populations of size 1000. We record every behaviour produced by the 100,000 initialised programs and if behaviours are produced multiple times, we keep a record of the frequency.

Table 2 shows that for initialisation of population in the even 4 parity model, the RHH favours simplistic behaviours with the readings at rank one and two being the contradiction and tautology. As explained in Sect. 3.3, tautologies and contradictions are a special case as their result does not depend on the input value of any of the variables (or terminals in the GP context). In every initialisation of population size 1000, an average of 40.94 tautologies and 42.25 contradictions occur. This results in 8.32% (combination of tautology and contradiction) of the programs generated in each initialisation not depending on any of the input variables.

Table 2 Bias results for the parity and multiplexer models

Moreover, none of the behaviours in the 10 most frequent behaviours contain all the terminals and therefore they result in partially blind candidate programs. Behaviours in ranks three to six represent single terminals with or without the possibility of a NOT function and the behaviours in ranks seven to 10 represent simple AND or OR functionality. The 46th ranked most frequent record (with a frequency of 2.92) is the first record which has a node count of four. This indicates that behaviours that use four inputs (and possibly all the terminals) are being infrequently created when compared to the simplistic structures we see in Table 2.

In Table 2, the even 7 parity results are similar to that of the even 4 parity model. We see the tautology and contradiction states featuring at ranks one and two, and single terminals at positions three to nine and simplistic OR functionality in position ten. Whilst the behavioural structures are similar to the even 4 parity model, the frequencies are slightly reduced such that tautologies only represent 5.78% of the behaviours generated in an average initialisation. The introduction of more terminals adds but a little more diversity to creation of behaviours using the RHH technique. It is not until the 1515th ranked most frequent behaviour with a frequency of 0.5 occurrences per initialisation, that we see a node count of seven for the first time. Again, this indicates a bias towards simplistic behaviours being generated by the RHH technique.

In Table 2, the 6 and 11 bit multiplexer experiments show similar results to that of the even parity experiments. Again, the tautology and contradiction states feature as the two most frequently constructed behaviours. These are followed by single terminals, and then simple two terminal structures. As the number of terminals in the problem increases, the chances of constructing a behaviour with all terminals present becomes even worse.

The first occurrence of a behaviour with a node count of six is ranked 559th, with a frequency of 0.13 occurrence per initialisation. This information is worth considering as the 6 bit multiplexer model frequently has its population size parameter set at 500. Therefore, if we consider an average initialisation, the population is unlikely to contain one candidate program which has all terminals present in the behaviour.

Table 2 shows that the bias results for the 11 bit multiplexer have similar characteristics to the other Boolean models: the increase in terminals results in a decrease in the frequency of behaviours which use all inputs. In our analysis, it was not until the 345th ranked most frequent program (with a frequency of 0.16) was reached until only three nodes were used to create a behaviour.

In keeping with the other Boolean domain experiments, Table 3 shows that the 5 and 9 majority experiments suffer a similar bias. In the case of the 5 majority experiment, it was the 150th most common behaviour with a frequency of 0.67 when a node count of 5 was first achieved. In the case of the 9 majority experiment, it was the 2614th most frequent behaviour before a node count of 9 was achieved.

Table 3 Bias results for the artificial ant and majority models

The artificial ant results in Table 3 exhibit similar simplistic behaviours, albeit not in the same way as the problems in the Boolean domain. The most frequent structures assembled by the RHH technique are simple one move or turn structures. It is not until the 6th most frequent reading that a behaviour contains two operations. It is not crucial to have behaviours with large numbers of moves (because of re-execution of the ant control code), but in order to achieve full score, the ant will have to have a behaviour containing several moves and turns. To put this into perspective, it is not until the 197th most frequent reading (0.32 frequency) when five moves are first accomplished.

4.3.1 Discussion

Both the analysis of unique behaviours we present in starting populations, and the more in depth bias analysis for each model have revealed several biased features of the output of the RHH initialisation technique.

Despite preventing syntactically identical code being produced, there are still notable quantities of duplicated behaviours present in the starting population. Further bias analysis of all problem domains considered revealed that the RHH technique favours simplistic behaviours or tautology and contradiction states. Tautologies and contradictions in the Boolean domain represent the RHH’s inability to create building blocks dependent on terminal values and its ability to produce ineffective code, as it is not possible to directly construct true or false with the syntax available. The fact that they are the most frequent behaviours in the Boolean problems indicates that this is the main weakness of using the RHH technique as an initialisation method.

If we consider the ant domain, a similar concept to the tautology or contradiction in the Boolean domain would be an ant that does not perform any moves. An ant could perform as many turns as it wanted without moving positions, but this would not allow it to achieve its ultimate goal of collecting food pellets. Unfortunately, this characteristic is the second to the fifth most produced behaviour by the RHH technique generating syntax in the artificial ant domain.

A final aspect of concern is the apparent inability of the RHH to construct more complex behaviour. In the case of the Boolean domain, we noted low rank of the first occurrences of candidate behaviours with node counts capable of representing all inputs present. With increasing numbers of terminals, it was harder for the RHH to generate behaviours that used all the inputs. This effect is not limited to the Boolean domain. If we consider the ant domain, it was not until the 197th most frequent behaviour (with frequency of 0.1) that the RHH achieved an ant that was capable of moving five positions.

An advantage of the SDI algorithm is that it knows when behaviours reduce to the tautology/contradiction/no move state, and therefore these behaviours can be removed. In addition the SDI will prevent bias in behaviours because it can enforce semantically uniqueness of programs in the population.

4.4 Size and shape analysis

4.4.1 Analysis of size and shape of programs

As previously mentioned there is no size or depth limit on programs produced using SDI. This section aims to present an analysis and comparison of the size and shapes of programs produced by both RHH and SDI. We initialised populations of size one thousand for each problem. We performed 100 initialisations and measured the average of the results. The metrics we used are program depth, program length (total number of nodes in the tree), number of functions, number of terminals and the number of distinct terminals.

4.4.2 Results

Table 4 shows size and shape analysis from the parity and multiplexer experiments. Paired T-tests revealed that all readings are significantly different at the 99% confidence level (P-value of 0.000). The two most notable points that apply to all but the 11 bit multiplexer are that the SDI produces deeper, thinner trees, compared to the RHH technique; and that the SDI increases the numbers of distinct terminals in all cases. The HSDI falls somewhere between the SDI and RHH extremes. In problems such as the 11 bit multiplexer this is useful as it creates smaller programs that are less likely to grow beyond the crossover depth cap [4] (in our case 17) during evolution as a result of the bloat phenomenon [27, 34, 35].

Table 4 Size and shape comparisons

Table 5 shows the results for the majority and ant experiments. Paired T-tests revealed that all readings are significantly different at the 99% confidence level (P-value of 0.000). The majority experiments reflect the multiplexer and parity results in that the SDI produces deeper, thinner trees, except for the larger 9 majority problem and the 11 bit multiplexer. The 11 bit multiplexer and 9 majority results show a large increase in the size (all metrics) of the programs produced. This relates to the earlier discussion in Sect. 3.3 regarding the limitations of generating syntax directly from behaviour. As there are no size checks in the SDI, it will produce syntax to represent the behavioural complexity of the problem faced. In the majority experiments (Table 5) the SDI and HSDI consistently produce more distinct terminals during the initialisation which indicates a better ability for programs to deal with all possible inputs presented. Similar characteristics of size and shape are shown in the ant domain. Deeper thinner programs are produced by the SDI and HSDI compared to the RHH technique and both the SDI and HSDI produce a statistically higher level of distinct terminals.

Table 5 Size and shape comparisons

One final point to mention is the intron wash effect the HSDI will have on the populations it produces. It is seeded from the FULL method, but generated from behavioural representation of that code, so that it produces effective code, which, from studying the size and shape analysis, is supported by the consistently low length/depth values in Tables 4 and 5 (with the exception of the 11 bit multiplexer).

4.4.3 Discussion

One global feature of the size and shape experiments is that no matter that the problem domain, the RHH always produces roughly similar sized programs, whereas the SDI produces different sized programs depending on the problem. It is reasonable to assume that this happens because different problem domains have different behavioural requirements, therefore this would result in different size and shape characteristics of programs.

In addition to this, as the number of terminals increases, the potential combinations of behaviour will increase, causing deeper programs that require more functions and terminals to attain specific behaviours. This is borne out in our results for the Boolean domain, where the increase in program sizes and depths is in line with the increase in the numbers of terminals present in our experiments.

The second point to be made when studying these results is the level of importance that should be given to the measurement of the depth of the tree. All of the SDI and HSDI results produced trees of greater depth than the RHH (but generally shorter in terms of length) and this would be consistent with using composite functions to model specific behaviours. The length metric may be a better measure to use, in terms of flexibility, as it gives programs the opportunity to develop complex behaviour as well as controlling the overall size of the program.

The third observation is that in all experiments the SDI produced significantly more distinct terminals. Statistically, there are fewer possible behaviours that can be generated in the Boolean domain when not all the terminals are used. Once these are generated, the SDI has to generate more complex behaviour to retain the semantic variety it promises. As a result of this the SDI will automatically have a parsimony towards producing programs with all the terminals present, especially in larger populations.

4.5 GP performance results

In the experiments presented in this section, we compare the performance of GP runs that are initialised using the three methods, SDI, HSDI and RHH. We keep all parameters the same except for the population generation method.

We used the following GP parameters: 10% elitist reproduction; 100 runs of 50 generations; maximum depth 17; RHH depth 2–6; SDI or HSDI initialisation; crossover with 90% bias on functions and 10% on terminals; 0.9 crossover; 0 mutation (to remove additional variables from the experiment); and 7 competitor tournament selection. A population of 500 was used for the 6 bit multiplexer, even 4 parity, 5 majority and artificial ant (Santa Fe) experiments. A population of 4000 was used for the 11 bit multiplexer, even 7 parity and 9 majority problems.

Table 6 shows that the performance of the RHH, SDI and HSDI techniques vary depending on the problem being analysed. In this case, in overall terms we see the HSDI performing best in three experiments, the SDI in two and the RHH in two experiments (at the 95% confidence level). This raises two complex questions about how the dynamics of creating the starting population can impact on the performance of GP runs.

Table 6 Performance of the RHH, SDI and HSDI techniques

The first issue concerns the way in which we understand the distribution of initialised programs in relation to the search space for a particular problem. The SDI performs well on the parity and 6 bit multiplexer experiments, but poorly on the 11 bit multiplexer experiment, when compared to the RHH. A simplistic theory is that the SDI produces too many complex behaviours in the wrong region of the search space when we may be looking for simplistic programs in another region of the search space. This might explain why, for example, the SDI performs well at the 6 bit multiplexer, but poorly in the 5 majority experiment. It might also explain how the HSDI is able perform well on both multiplexers, as it is seeded with simplistic behaviour, but can build more complex behaviour on top of this.

The second issue is the effect of initialising 100% effective code compared to code with redundant and unreachable statements. We have already shown that the introduction of the SKIP operation was necessary in the HSDI applied to the artificial ant problem, to alleviate the problem whereby RHH and FULL produce dead branches for the IF-FOOD-AHEAD statement. Given the performance results shows in Table 6, this SKIP statement is clearly required.

The obvious comparison to this work is that of Looks [15] that presented results for a selection of multiplexer and parity experiments. Whilst Looks examines similar problems, he uses a different function set and different parameters, and a different semantic sampling initialisation. Despite these differences, for the 6 bit multiplexer and the 4 parity experiments, we still see a similar relative change in performance based on semantic style sampling. Looks uses a 5 parity and we use a 7 parity; given that our results support the value of semantic initialization for parity problems, our results are in line with his. The 11 bit multiplexer is different to the results of Looks; we can speculate that this is because our experiments use no depth limits on the SDI, and as a result (supported by the size and shape results (Table 4)) vastly increased program sizes are created (changing the distribution of programs in the search space). By contrast, Looks does control the size of the programs he produces using the semantic sampling algorithm. As stated in our introduction, our algorithms were designed to test theories in order to better understand the important issues in producing good quality starting populations.

Finally, there is an element of difference in program sizes when comparing the SDI, HSDI and RHH initialisation algorithms. Our size and shape results (Sect. 4.4) show that the SDI and HSDI algorithms produce programs at varying sizes depending on the problem. Some are smaller than the RHH output and some are larger. There is an argument to try to make the programs the same size for comparison, however, as the only control mechanism on the RHH algorithm is depth and we know that the HSDI and SDI tend to produce deep thin trees, it makes the job of making similar sized starting programs very complex. As such, we have made use of the traditional 2–6 depth range of the RHH algorithm.

4.6 Evolvable shape

In order to examine evolvable shape we use the MODFULL and WASHED algorithms, as set out in Sect. 3.6. The parameters are the same as in Sect. 4.5, however, we use the MODFULL and WASHED algorithms to examine the effect that changing the shape of programs without changing the semantics will have on GP performance. We use MODFULL and WASHED because we have seen in Tables 4 and 5 that the semantically reduced code will provide a dramatic contrast in tree shape, whilst retaining the same behaviour as the FULL code.

Table 7 demonstrates that changing the shape of the initialised program trees can have a dramatic effect on the performance of GP runs. If we consider the multiplexer, parity and majority experiments, all show a statistically significant difference between the MODFULL and WASHED algorithms overall, and all but one of the Boolean problems at the 95% confidence level at generation 50. However, whether MODFULL or WASHED gives a superior performance appears to be problem dependent.

Table 7 Effect of program shape on performance of GP runs

The other factor of note when considering the success rates on the experiments that do find ideal solutions is the level of the difference in success. The biggest difference is 37% in the case of the 6-Mux. This shows the importance of program shape to evolvability in a GP run.

The artificial ant statistically favours MODFULL overall, and is statistically similar to WASHED at generation 50. This is not unsurprising, as despite the size and shape results in Table 5. the back translation mechanism still constructs full trees when it reassembles the ant movement instructions. The only difference is that redundant and unreachable code will have been removed in this reduced form. The fact the performance is statistically similar for the artificial ant indicates that simply removing both unreachable and redundant introns does not alter performance for this particular experiment.

5 Discussion

One of the main points to draw from this analysis, as well as the work of other authors (for example, [11, 15]) in the field, is that the choice of initialisation method may result in statistically significant variation in the performance of GP runs. In the context of this investigation, and the results we present in Tables 6 and 7, it is clear that changing different aspects of program initialisation can have an impact on performance far beyond that required for statistical significance.

5.1 Distribution of behaviours in the search space

Our early results in Sects. 4.2 and 4.3 clearly demonstrate how the existing RHH technique has bias, frequently duplicating simplistic behaviours, tautologies and contradictions (or no move ants). With these results in mind, we set out to counter these effects by producing the SDI algorithm in order to achieve complete behavioural diversity and build more complexity into the starting populations.

Preliminary analysis of the size and shape output (Tables 4 and 5) of the SDI algorithm appeared positive for two reasons. The first is the ability of the SDI to create starting populations that vary their size depending on the problem. This gives the impression that the SDI is actually modelling behaviours specific to the search space of each problem rather than the “one size fits all” solution in the RHH. The second is that depth appears not to be the best way to constrain programs. Our size and shape analysis (Tables 4, 5) showed that the SDI produced deeper but thinner trees in order to specifically model more complex behaviour.

Whilst the SDI might be theoretically superior in terms of distributing behaviours in the search space, Table 6 shows that it only outperforms the RHH technique in the six bit multiplexer and parity experiments. The parity problems are known to be a deceptive problem [29], requiring a more intricate and complex program to solve them. This may be the reason that the SDI performed well on this problem. The multiplexer results are less clear, as it is intriguing that the SDI does not outperform RHH on the 11 bit multiplexer problem. A possible explanation for this is that in terms of the overall search space the program required to secure 100% fitness is relatively simplistic in comparison to all the behaviours in the 11 bit multiplexer search space. If one considers the exponential increase in search space size between the six bit multiplexer search space of \(2^{2^{6}}\) to the 11 bit multiplexer search space of \(2^{2^{11}}\), the SDI is having to model far more complexity to distribute programs through the breadth of the search space. It is possible that, as a result of this, the RHH is more biased to the correct area of the search space, therefore achieving a higher success rate.

The SDI failed on the majority experiments, which are arguably the most simple of the Boolean experiments. This would suggest that the RHH was better able to produce programs in the most successful search areas of the majority search space whereas the SDI’s complexity worked against it.

Based on these results, the hybrid algorithm (HSDI) was created to take advantage of both the traditional FULL semantic generation and the increased complexity aspect of the SDI algorithm. The HSDI produced an algorithm capable of being the best performer overall in three experiments (at the 95% confidence level). In generation 50 it was able to statistically equal the best result in 4 out of 7 experiments, and was the best performing result in the 6 bit multiplexer at the 95% confidence level. The performance of the HSDI was disappointing considering it combined the different features of the SDI and RHH algorithms. This is a testament to the difficulty of creating a problem independent population generation algorithm.

5.2 Evolvable shape

Our results in Table 7 clearly show that the shape of the trees can have a significant effect on GP performance which is in agreement with Daida et al. [2, 22] and Langdon et al. [21]. In the Boolean domain, all of the experiments presented not only statistically significant differences (at the 95% confidence level), but also some dramatic changes in the performance of the GP runs. Given the variations in these results, it would appear that the ideal evolvable shape for a program is problem dependent and therefore it would be difficult to predict ideal program structures for specific problems. This suggests that further work comparing these results with those Daida and Hilss [22] would be valuable.

The similarity of the artificial ant results are unsurprising given that the translation mechanism between abstract meaning and syntax of the ant problem builds full trees (without redundant code) and, as a result, the experiment is comparing larger full trees with smaller, more effective, full trees. A noteworthy point is that the SDI contains no introns for the ant domain, whereas the HSDI has simulated introns (in the form of SKIP), and this increases the performance of the algorithm in the ant domain. A possible hypothesis is that being able to use the SKIP operator as one of the branches of the IF-FOOD-AHEAD operator statistically creates more opportunities for better performing solutions. This calls into question how exactly we go about choosing the function set. In the case of the artificial ant, there may well be better choices of a function set.

Another algorithm that gives the user the ability to influence tree shape is Luke’s Probabilistic Tree Creation [11, 14]. This is because the user-controlled bias in the appearance of functions and terminals would have an overall effect on tree shape. Luke’s success using this algorithm could, at least partially, be a result of evolvable shape. With reference to the results we present in Table 7, it would be interesting to see if performance would increase if we change the probabilities of selection of the terminals and functions from experiment to experiment, for example, to cater for the difference between the parity and majority problems.

Having shown that there is a preferred evolvable shape for specific problems (Table 7), we can argue that this lends support to GP schema theory [29, 36, 37]. In our interpretation, the schemata are in a more abstract form, consisting of a tree of a (problem specific) particular shape containing “don’t care” nodes. Alternatively, one could follow a strategy such as Salustowicz and Schmidhuber [38] enforcing a structure and using node weightings to perform the learning. This would be an interesting direction for future work.

The issue of problem dependence in the context of generating starting populations is a complex one. Table 6 shows that the three algorithms we present seem to favour particular problems and the results in Table 7 merely complicate the matter further. Based on this data, further research is required in the field of program initialisation in order to be in a position to recommend specific code generation algorithms for specific problems.

6 Conclusion

Our analysis of program initialisation in GP has shown that the initialisation method chosen can have a dramatic impact on the performance of GP runs. However, it appears that this impact is problem specific, and we cannot conclude that one of the algorithms that we have presented is best for every problem. We have shown some limitations of the RHH algorithm, and tested theories using our algorithms, but we have failed to find one clear solution to program initialisation that incorporates measures to deal with the behavioural distribution of programs and to control the shape of the syntax produced.

We present clear evidence that both the distribution of programs in the search space and the shape of the tree can have dramatic effects on the performance of GP. Both of these variables are strongly dependent on the problem being tackled, which therefore makes the challenge of constructing an initialisation algorithm a highly complex one.

As a result of this work, there is evidence to support the need to create initialisation algorithms that can explicitly exercise control over both the behavioural distribution of programs and the shape of the programs they produce.

Based on evidence we present, future work in the area of program initialisation needs to be able to address both behavioural diversity and program structures and the interactions between the two. A possible idea would be merging our algorithm with Luke’s PTC algorithms in order to gain a more granular control over program structure, whilst retaining behavioural diversity.

7 Future work

The first avenue to pursue in order to expand this work is to develop behavioural models for other domains, for example, symbolic regression. Further results using different problem domains may reveal different properties or create a clearer understanding of the relationship between semantic diversity and evolvable shape.

The second major area of research is to apply semantic control to other areas of GP. Semantic analyses of the crossover operation are starting to appear [39, 40]. A similar analysis could be used to study different mutation techniques, and also to compare the behavioural changes of mutation and crossover. This would provide data to help resolve the already contentious debate as to whether crossover is another type of mutation operator [41].

There is also the potential to create a type of semantic fitness function in order to measure the difference between two behaviours. This would have the potential of being very quick to execute in comparison to an input output fitness function when there is a large amount of input data used by the traditional fitness function.

Finally, with the use of semantic representation, we could construct a form of semantic pruning in order to attempt to cut bloat or test the performance of intron free GP compared to traditional GP.