I would like to thank Leonardo Vanneschi and Leonardo Trujillo for the opportunity to lead their peer commentary on the thirtieth anniversary of John R. Koza’s book “Genetic Programming: On the programming of computers by means of natural selection” [1] and the colleagues who took the time to read my initial article [2] and kindly comment upon it. They raise important points which I should like to reply to.

In their wittily titled “Veni, Vidi, Evolvi” (I came, I saw, I evolved)Footnote 1Giovanni Squillero and Alberto Tonda [3] point to the success of GP at producing better than human results and give pointers to a number of GP tools. Although they suggest GP could be used to a greater extent in the real world, particularly in future highly automated industry, they list many areas where GP is competitive. For example, the design of ensembles of other artificial intelligence (AI) generated models  [4], such as deep neural networks [5], and recent successes in our own software industry. Indeed search based software engineering [6] is often GP based and has lead to successes such as automatic bug fixing  [7, 8] and genetic improvement of software  [9,10,11], including industrial use  [12,13,14]. Interestingly Squillero and Tonda include areas where they feel that current large language models (using deep neural networks) will never be able to compete with GP [15]. Indeed their reasoning for this, based on the availability of training data, may apply to many other special circumstances. They also suggest sometimes GP will hybridise well with other AI approaches.

Mauro Castelli [16] looks forward 30 years and stresses GP’s ability to support multidisciplinary research, particularly with Biology. Although there is some theoretical underpinning for genetic programming [17], he points out both GP and other forms of AI, such as artificial neural networks, are largely empirical, with considerable progress being made by skilled researchers following what works in practice. Castelli also makes a plea for more thoughtful consideration about what makes a solution human-interpretable, i.e. meaningful to GP’s customers. If computer based systems are understood perhaps this can help companies convince their employees and users that they are fair [18]. There is considerable interest in ways to measure and improve how comprehensible software is  [19, 20]. Possibly GP researchers can find insight in software engineering’s readability metrics. Indeed in future, perhaps genetic improvement could use automatic comprehensibility measures to make software easier to maintain: after all, if you can measure it, you can evolve it. Also could I add to Castelli’s plea, and suggest where possible, especially in presentations, we put the (simplified) evolved model on a slide using the customer’s language, e.g. “glucose” rather than “D1” Fig. 1 .Footnote 2

Fig. 1
figure 1

Interpretation of evolved 3 node GP tree which performs approximately as well as sophisticated machine learning techniques [21, Fig. 1]

Malcolm Heywood [22] follows up two points made by John Koza on his own work at his GECCO 2022 lecture: parallel GP and co-evolution. As Koza predicted, the thirty plus years since the first GP book [1] have been dominated by the exponential increase in available compute power [23]. It looks likely that this will continue into the near future. However CPU clock speeds may not rise much above the 3.6 GHz common today, instead silicon chip designers will spend the extra transistors available on more CPU cores and on more on chip RAM memory (e.g. for cache memory). Thus continuing today’s trend for increasing parallelism and to architectures where compute power is plentiful but distributed and the true costs lie in getting data to each CPU fast enough to keep them all busy [24]. As Heywood points out GP is “embarrassingly parallel” as the algorithm can be readily split into independent work units which can proceed on independent processing units with only limited need for communication or synchronisation between them [25, 26].Footnote 3 In the case of GP we can imagine the traditional workload as being divided into a computational cube (Fig. 2) with 3 dimensions: across individuals in the population \(\times \) across the test cases and \(\times \) across the opcodes that form each GP program. The computational cube metaphor stresses that the work can be split up in many ways on parallel hardware. Even with a modest GP experiment with a population of 1000, 10 fitness cases, and programs of 10 instructions, that gives us 100 000 items to compute per generation, which in five years time (2028) might map well onto a field programmable gate array (FPGA) or graphics card with 100 000 processing units. (Fig. 2 shows a much more modest population of 4, with 5 test cases and programs containing up to 12 instructions.) Fukunaga et al. [27] showed GP could be run without an interpreter. Whereas Juille showed an imaginative way of running a GP interpreter on a highly parallel computer [28], which inspired more recent work on running GP on computer gaming or graphics cards (GPUs) [29] (see also iCUBE’s EASEA platform which supports GPU computing [30]). Heywood also mentions exploiting parallel hardware in the form of FPGA [31, 32]. Another area, which every one hopes will become available soon, is Quantum Computing [33]. Although Quantum Computing is at present limited in terms of number of Q-bits, evolution has already been shown to be able to help improve the reliability of existing quantum algorithms [34].

Fig. 2
figure 2

Evaluating a GP population of four individuals each on the same five fitness cases. There are upto \(4\times 5 \times 12\) GP operations to be performed by, in principle, 240 GPU threads. Each cube needs the opcode to be interpreted, the fitness test case (program inputs) and the previous state of the program (i.e. the stack). Taken from [35, Fig. 19]

Heywood [22] also talks of the many cases, since Koza’s first book [1], where GP has contributed to the exciting area of co-evolution. Including both competitive ”red queen” coevolution [36], and co-operative coevolution (such as the evolution of multiple tree individuals [37] and ADFs [38]). He gives competitive coevolution ”arms race” [39] examples, such as simultaneously evolving a program and its test suite [40], where programs are evolved to pass tests but the tests are being evolved to find bugs in the evolving programs. Heywood also describes co-operative coevolution. Cooperative evolution covers many approaches, such as: evolving separate programs so that they work as an ensemble [41, 42], as a team [43], part of a complete solution [44] or as a member of a multi-agent simulation  [45, 46].Footnote 4 He also points out that compared to current large language models created by deep learning artificial neural networks, GP is not slow.

Alberto Bartoli, Luca Manzoni and Eric Medvet [50] caution us against accepting too rosy a picture of GP. They are right to point to the diffuse evidence of industrial GP take up and lack of a dominant GP packageFootnote 5. Whilst a few GP tools are now firmly in the industrial domain: Eureqa [51,52,53] and HeuristicLab [54], they are right to point to the diversity of available GP tools. It is unfair to pick out examples from the many available, nevertheless a few come to mind: DEAP [55] EASEA [30, 56] ECJ [57] GeneticEngine [58] GenProg [59] Gin [60, 61] GPLAB [62] gplearn, GPTIPS [63] Inspyred [64] Magpie [65] Pony GE2 [66] PushGP [67] and TensorGP [68]. I have already pointed to a few examples of GP take up in the software industry, more can be found in the water industry [69] and civil engineering [70], indeed they celebrate the success of TPOT [71, 72] in Bioinformatics. Generally Bioinformatics and medical research has embraced open science and is more than happy to cite GP tools, such as TPOT, or tools enhanced by evolution [11] when they use them [73]. However, as David Andre at his invited keynote at GPTP-2021 [74] pointed out, generally companies in competitive industries (particularly in finance) are very wary about talking openly about any tool or technique that gives them an edge. Bartoli et al. point to GP’s continued success in symbolic regression, citing Bill La Cava and team’s work, which was published at the top neural networks conference [75] and their tool, SRBench, which is available on GitHub. Bartoli et al. make important points and suggest ways the GP community should do better. They point to the recent success of deep neural networks, but in some ways perhaps we should take heart from this. In the popular press deep networks are “AI” and yet they, like GP, are empirical rather than theory driven, and both are firmly based on learning. The idea that AI requires someone to patiently code all human knowledge into a rule base is nowhere to be seen. Deep neural networks demand huge compute resources, that is, they are not efficient, and they are far from error free. Perhaps we should be happy to let our GP systems consume resources and tolerate some errors. As Stephanie Forrest said [76] what could we do if we allowed our evolutionary system the same resources that the mega corporations have spent.

Jason Moore [77] points to the impact of Koza’s first GP book [1] in artificial intelligence (AI), artificial life, machine learning, art, biology, economics and engineering but questions if Darwin’s fitness driven evolution is helpful. (You may remember the full title is “Genetic Programming: On the programming of computers by means of natural selection”, which was inspired by that of Charles Darwin’s 1859 revolutionary book “On the Origin of Species by Means of Natural Selection” [78].) Instead Moore focuses on the importance of the representation used to define the programs, the variation operators used to modify them and suggests perhaps GP needs a name change. Certainly the last 30+ years have seen an expansion in GP representations from interpretable Lisp trees (including ADFs [38]) to linear GP [79], grammatical evolution [80], and graph based GP, such as cartesian genetic programming [81]. Also genetic improvement [82, 83] has reinforced the idea that existing computer programs are not fragile [84] and can also be evolved, with genetic representations as diverse as: lines of C++ code [85], Java [86], XML based abstract syntax trees [65],Footnote 6 SQL [88, 89], Java byte code  [90,91,92], assembler [93], Clang intermediate code [94] and even binary machine code [95]. Moore is correct in saying that not only are trees not the only representation but also genetic search is not the only game in town. Already people have (in addition to genetic algorithms [1]) been successful with: hill climbing [96], simulated annealing [97], novelty search [98] and Monte-Carlo Tree Search [99]. As he points out we need to be cautious about re-naming, for example, often people do not like “random” but the equivalent “stochastic” sounds more scientific. But in the end he points out whatever we call GP our goal must be to continue to help people and help society.

Colin Johnson [100] points to the “unreasonable effectiveness” of GP fitness functions but nevertheless suggests several ways to improve them including information theory [101], and suggests the possibility of a universal fitness function, perhaps derived from existing examples, e.g. using his Learned Guidance Function LGF [102]. Similarly we can regard GP as providing a universal representation [103]. He also mentions current work on using deep neural network based large language models (LLMs) in natural language processing (NLP) text generation applications. Perhaps LGFs with LLMs could act as surrogate fitness functions [104] and may be give multitudes of fitness test points? He also suggests we abandon static fitness functions, and instead use dynamic fitness functions, whose role and target change during the run as the GP population evolves. (In some ways dynamic fitness functions might emulate the often hoped for role of co-evolution of continuously stretching the populations, by adapting the direction of fitness selection. Thus preventing any species in the ecosystem from stagnating near a local optimum. See also Heywood [22], page 3 above.) Johnson highlights work by Krzysztof Krawiec [105], which perhaps has already taken a step in the direction advocated by Jason Moore [77] (see previous paragraph), where instead of fitness being applied blindly, “black box”, to the whole organism (i.e. the whole program) “search drivers” consider components within the program [106] and try to improve the whole by improving its parts. This is very much in the mode of recent “white box, blind no more” work by Darrell Whitely [107], where Whitely uses variable interaction graphs (VIGs) to find the natural components of combinatorial problems and uses them with crossover to effectively search vast spaces. See also Zaidi’s Value State Flow Graphs (VSFGs) for describing data flows inside programs [108]. Perhaps VIGs or VSFGs could be used in GP? Perhaps with a degree of fuzziness to eliminate potential weak connections between program components? Perhaps approximate VIGs could form part of the inherited genotype? Perhaps they could themselves be subject to mutation or other genetic operations, with new programs (epigenomes) being stochastically generated from VIGs?

I would like again to thank the contributors to this peer commentary and especially the two editorial “Lions” for assembling such a diverse set of skilled peer commentators with such good ideas about how to pull GP forward for the next 30 years.