Keywords

1 Introduction

This work is motivated by two issues. First, due to the innate constraints and biases of human thought, it is likely that manual design of optimisers explores only a subspace of optimiser designs. It is unlikely that this subspace contains optimal optimisers for all optimisation problems. Second, recent attempts to create novel optimisers from models of natural systems have been largely unsuccessful in broadening the scope of optimiser designs, instead tending only to generate variants of existing metaheuristic frameworks [9, 16]. This work attempts to address both of these issues by using Genetic Programming (GP) to explore the broader space of optimisation algorithms, with the aim of discovering novel optimisation behaviours that differ from those used by existing algorithms. In order to make the optimiser search space as broad as possible, a Turing-complete language, Push, is used to represent the optimisers, and the Push GP system is used to optimise them [17]. In [8], this approach was used to evolve local optimisers that can solve continuous-valued problems. In this work, this approach is extended to the population-based case, using Push GP to automatically design both local and population-based optimisers from primitive instructions.

The paper is organised as follows. Section 2 reviews existing work on the automated design of optimisers. Section 3.1 gives a brief overview of the Push language and the Push GP system, Sect. 3.2 describes how Push GP has been modified to support the evolution of population-based optimisers, and Sect. 3.3 outlines how the optimisers are evaluated. Section 4 presents results and analysis. Section 5 concludes.

2 Related Work

There is a significant history of using GP to optimise optimisers. This can be divided into two areas: using GP to optimise GP, and using GP to optimise other kinds of optimiser. The former approaches use a GP system to optimise the solution generation operators of a GP framework [4, 6, 17]. Autoconstructive evolution [17] is a particularly open-ended approach to doing this in which programs contains code that generates their own offspring; also notable is that, like our work, it uses the Push language.

However, more relevant is previous work on using GP to optimise non-GP optimisers. Much of this work has taken place within the context of hyperheuristics, which involves specialising existing optimisation frameworks so that they are better suited to solving particular problem classes. In this context, GP has been used to re-design components of evolutionary algorithms, such as their mutation [23], recombination [5] and selection operators [13], with the aim of making them better adapted to particular solution landscapes. Other hyperheuristic approaches have used GP to generate new optimisation algorithms by recombining the high-level building blocks of existing metaheuristic frameworks [3, 10, 12, 15]. Recently, this kind of approach has also been used to explore the design space of swarm algorithms, using grammatical evolution to combine high-level building blocks derived from existing metaheuristics [3]. Our approach differs from this, and previous work in hyperheuristics, in that it focuses on designing optimisers largely from scratch. By not reusing or building upon components of existing optimisers, the intention is to reduce the amount of bias in the exploration of optimiser design space, potentially allowing the exploration of previously unexplored areas.

Another recent development, which has some similarities to our work, is the use of deep learning to optimise optimisers [1, 11, 21]. So far these approaches have focused on improving the training algorithms used by deep learners, i.e. they are somewhat akin to using GP to optimise GP, though it is plausible that deep learning could be applied to the task of designing optimisers for non-neural domains. However, this is arguably an area in which GP is better suited than deep learning, since the optimisers produced by GP are likely to be far more efficient (in terms of runtime) than those produced by deep learning. Runtime efficiency is an important consideration for optimisers, since the same code is typically called over and over again during the course of an optimisation trajectory. Another advantage of GP is the relative interpretability of its solutions when compared to deep learning, and the potential that more general insights could be made into the design of optimisers by studying the code of evolved solutions.

3 Methods

3.1 Push and Push GP

In this work, optimisation behaviours are expressed using the Push language. Push was designed for use within a GP context, and has a number of features that promote evolvability [17,18,19]. These include the use of stacks, a mechanism that enables evolving programs to maintain state with less fragility than using conventional indexed memory instructions [7]. However, it is also Turing-complete, meaning that it is more expressive that many languages used within GP systems. Another notable strength is its type system, which is designed so that all randomly generated programs are syntactically valid, meaning that (unlike type systems introduced to more conventional forms of GP) there is no need to complicate the variation operators or penalise/repair invalid solutions. This is implemented by means of multiple stacks; each stack contains values of a particular type, and all instructions are typed, and will only execute when values are present on their corresponding type stacks. There are stacks for primitive data types (booleans, floats, integers) and each of these have both special-purpose instructions (e.g. arithmetic instructions for the integer and float stacks, logic operators for the boolean stack) and general-purpose stack instructions (push, pop, swap, duplicate, rot etc.) associated with them. Another important stack is the execution stack. At the start of execution, the instructions in a Push program are placed onto this stack and can be manipulated by special instructions; this allows behaviours like looping and conditional execution to be carried out. Finally, there is an input stack, which remains fixed during execution. This provides a way of passing non-volatile constants to a Push program; when popped from the input stack, corresponding values get pushed to the appropriate type stack. Push programs are evolved using the Push GP system. Since a Push program is basically a list of instructions, it can be represented as a linear array and manipulated using genetic algorithm-like mutation and crossover operators.

Table 1. Psh parameter settings
Table 2. Vector stack instructions

3.2 Evolving Population-Based Optimisers

In order to evolve population-based optimisers, this work uses a modified version of Psh (http://spiderland.org/Psh/), a Java implementation of Push GP. To allow programs to store and manipulate search points, an extra vector type has been added to the Push language. This represents search points as fixed-length floating point vectors, and these can be manipulated using the special-purpose vector instructions shown in Table 2; see [8] for more details about these instructions. Evolutionary parameters are shown in Table 1.

Algorithm 1 outlines the procedure for evaluating evolved Push optimisers. To reduce evolutionary effort, a Push program is only required to carry out a single move, or optimisation step, each time it is called. In order to generate an optimisation trajectory within a given search space, the Push program is then called multiple times by an outer loop until a specified evaluation budget has been reached. After each call, the value at the top of the Push program’s vector stack is popped and the corresponding search point is evaluated. The objective value of this search point, as well as information about whether it was an improving move and whether it moved outside the problem’s search bounds, are then passed back to the Push program via the relevant type stacks. Since the contents of a program’s stacks are preserved between calls, Push programs have the capacity to build up their own internal state during the course of an optimisation run, and consequently the potential to carry out different types of moves as search progresses.

figure a

To handle population-based optimisation, multiple copies of the Push program are run in parallel, one for each population member. Each copy of the program has its own stacks, so population members are able to build up their internal state independently. Population members are persistent, meaning there is no explicit mechanism to create or destroy them during the course of an optimisation run. To allow coordination between population members, two extra instructions are provided, vector.current and vector.best. These both look up information about another population member’s search state, returning either its current or best seen point of search. The target population member is determined by the value at the top of the integer stack (modulus the population size to ensure a valid number); if this stack is empty, or contains a negative value, the current or best search point of the current population member is returned. This sharing mechanism, combined with the use of persistent search processes, means that the evolved optimisers resemble swarm algorithms in their general mechanics. However, there is no selective pressure to use these mechanisms in any particular way, so evolved optimisers are not constrained by the design space of existing swarm optimisers.

3.3 Evaluation

Evolved optimisers are evaluated using a selection of functions taken from the widely used CEC 2005 real-valued parameter optimisation benchmarks [20]. These are all minimisation problems, meaning that the aim is to find the input vector (i.e. the search point) that generates the lowest value when passed as an argument to the function. The functions used during fitness evaluation, which were selected to provide a diverse range of optimisation landscapes, are:

  • \(F_1\), the sphere function, a separable unimodal bowl-shaped function. It is the simplest of the benchmarks, and can be solved by gradient descent.

  • \(F_9\), Rastrigin’s function, has a large number of regularly spaced local optima whose magnitudes curve towards a bowl where the global minimum is found. The difficulty of this function lies in avoiding the many local optima on the path to the global optimum, though it is made easier by the regular spacing, since the distance between local optima basins can in principle be learnt.

  • \(F_{12}\), Schwefel’s problem number 2.13, is multimodal and has a small number of peaks that can be followed down to a shared valley region. Gradient descent can be used to find the valley, but the difficulty lies in finding the global mimimum, since it contains multiple irregularly-spaced local optima.

  • \(F_{13}\) is a composition of Griewank’s and Rosenbrock’s functions. This composition leads to a complex surface that is highly multimodal and irregular, and hence challenging for optimisers to navigate.

  • \(F_{14}\), a version of Schaffer’s \(F_6\) Function, comprises concentric elliptical ridges. In the centre is a region of greater complexity where the global optimum lies. It is challenging due to the lack of useful gradient information in most places, and the large number of local optima.

To discourage overfitting to a particular problem instance, random transformations are applied to each dimension of these functions when they are used to measure fitness during the course of an evolutionary run. Random translations (of up to ±50% for each axis) prevent the evolving optimisers from learning the location of the optimum, random scalings (50–200% for each axis) prevent them from learning the distance between features of the landscape, and random axis flips (with 50% probability per axis) prevent directional biases, e.g. learning which corner of the landscape contains the global optimum. Fitness is the mean of 10 optimisation runs, each with random initial locations and random transformations. The 10-dimensional versions of the problems are used for training, with an evaluation budget of 1E+3 fitness evaluations (FEs). For the results tables and figures shown in the following section, the best-of-run optimisers are reevaluated over the CEC 2005 benchmark standard of 25 optimisation runs, and random transformations are not applied.

4 Results

For a population-based optimiser, the 1E+3 evaluation budget can be split between the population size and the number of iterations/generations in different ways. In these experiments, splits of (population size \(\times \) iterations) 50 \(\times \) 20, 25 \(\times \) 40, 5 \(\times \) 200 and 1 \(\times \) 1000 are used. The latter is included to give a comparison against local search, i.e. optimisers which only use a single point of search. Figure 1 shows the fitness distributions over 50 evolutionary runs for each of these configurations, where fitness is the mean error when the best-of-run optimisers are reevaluated over 25 optimisation runs. To give an idea of how these error rates compare to established general purpose optimisers, Fig. 1 also reproduces the mean errors achieved by two algorithms from the original CEC 2005 competition. G-CMA-ES [2] is a variant of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) with the addition of restarts and an increasing population size at each restart; it is a relatively complex algorithm and is generally regarded as the overall winner of the CEC 2005 competition. Differential Evolution (DE) [14], although less successful than G-CMA-ES in the competition, is another example of a well-regarded population-based optimiser.

Fig. 1.
figure 1

Fitness distributions of 50 runs for each problem and configuration. The value shown for each run is the mean fitness of the best solution over 25 reevaluations. Published results for CMA-ES (blue) and DE (green) are also shown. (Color figure online)

Figure 1 compares the ability of Push GP to find optimisers with different trade-offs between population size and number of iterations. The distributions show that this trade-off is more important for some problems than others. For \(F_1\), better optimisers are generally found for smaller population sizes, with the 1 \(\times \) 1000 distribution having the lowest mean error. This makes sense, because the unimodal \(F_1\) landscape favours intensification over diversification. For \(F_{12}\), the sweet spot appears to be for 5 \(\times \) 200, possibly reflecting the number of peaks in the landscape, i.e. 5. For the other problems, the differences appear relatively minor, and effective optimisers could be evolved for all configurations. In most cases, the best optimiser for a particular problem is an outlier within the distributions, so may not reflect any intrinsic benefit of one configuration over another. That said, four of these best-in-problem classifiers used small populations (2 with 1 \(\times \) 1000 and 2 with 5 \(\times \) 200), so maybe it is easier to find effective optimisers that use small populations than larger ones.

Perhaps more importantly, Fig. 1 shows that the Push GP runs found at least one optimiser that performed better, on average, than CMA-ES and DE. For the simplest problem \(F_1\), there was only one evolved optimiser that beat the general purpose optimisers. For the other problems, many optimisers were found that performed better. This reflects the results in [8], and is perhaps unsurprising given that the capacity to overfit problems is a central motivation for existing work on hyperheuristics. However, an important difference in this paper is the use of random problem transformations during training, since this causes the problems to exhibit greater generality, preventing optimisers from over-learning specific features of the landscape. The results suggest that this does not affect the ability of evolved optimisers to out-perform general purpose optimisers.

Table 3. Generality of evolved optimisers. For each optimiser, mean errors are shown for 25 optimisation runs on 10D and 30D problems. The mean rank including (and excluding) the problem the optimiser was trained on is also shown, and the best result for each combination of problem dimensionality (D) and fitness evaluation budget (FEs) is underlined for each problem number and ranking.

This ability to out-perform general purpose optimisers on the problem on which they were trained is arguably not that important. Of more interest is how they generalise to larger and different problems. Table 3 gives an insight into this, showing how well the best evolved optimiser for each training problem generalises to larger instances of the same problem and to the other four problems. Mean error rates are shown both for the 10-dimensional problems with the 1E+3 evaluation budget used in training, and for 30-dimensional versions of the same problems and 1E+4 evaluation budgets. First of all, these figures show that the evolved optimisers do not stop progressing after the 1E+3 solution evaluations on which they were trained, since they make significantly more progress on the same problem when given a budget of 1E+4 solution evaluations. Also, it is evident that most of the optimisers generalise well to 30-dimensional versions of the same problem. The best optimisers evolved on the 10D \(F_{12}\), \(F_{13}\) and \(F_{14}\) problems do particularly well in this regard, outperforming CMA-ES and DE on both the 10D and 30D versions of the problems. The \(F_1\) optimiser is the only one which generalises relatively poorly, being beaten by CMA-ES, DE and several of the other optimisers on the 30D version.

The most interesting insight from Table 3 is that many of the optimisers also generalise to other problems. For the 10D, 1E+3 evaluations case, all of the optimisers do better than DE when their average rank is taken across all five problems. More surprisingly, the \(F_{12}\) optimiser does as well as CMA-ES across all problems, despite only having been trained on one of them. Its average rank does drop slightly when its \(F_{12}\) rank is removed from the calculation of its average rank, suggesting it does not generalise quite as well as CMA-ES on the 10D problems. However, the figures for the 30D case are even more surprising, with the \(F_{12}\) optimiser doing better across the five problems (even with \(F_{12}\) discounted) than CMA-ES. Also notable is that the \(F_{13}\) optimiser comes first in three out of the five 30D problems, though this is balanced by coming last in the other two. CMA-ES does do slightly better than the \(F_{12}\) optimiser when given a budget of 1E+4 solution evaluations, but the difference is slight, and the best mean error rates for the four most difficult problems are found by the evolved optimisers.

Table 4. Evolved Push expressions of best-in-problem optimisers
Fig. 2.
figure 2

Example trajectories of the best-in-problem optimisers (\(F_1\), \(F_9\) & \(F_{12}\) top, \(F_{13}\) & \(F_{14}\) bottom) on the 2D versions of the benchmark problems they were trained on. The global minimum is shown as a black circle. The best point reached by the optimiser is shown as a black cross. Each population member’s trajectory is shown as a separate colour, with each search point shown as a point. Initial search points are surrounded by small coloured circles. The search landscape is shown in the background as a contour plot. (Color figure online)

Table 4 shows the evolved Push expression used by each best-in-problem optimiser, in each case slightly simplified by removing instructions that have no effect on their fitness. Whilst it is difficult to understand their behaviour by looking at these expressions alone, it is usually possible to gain more insight by observing the interpreter’s stack states as they run, and by observing their trajectories on 2D versions of the problems on which they were trained. Figure 2 shows examples of the latter; in almost all cases, optimisers generalise well to these easier 2D problems, and it can be seen in each case that the global optimum is found. It can also be seen from the trajectories that the behaviours of the five optimisers are quite diverse, and this is reflected in their program-level behaviours:

  • Each particle in the \(F_1\) optimiser looks up the population best and then adds a random vector to this to generate a new search point. Notably, the size of this random vector is determined using a trigonometric expression based on the components of the particle’s current and best search points, meaning that the move size carried out by each particle in the population is different.

  • The \(F_9\) optimiser (which uses only one point of search) continually switches between searching around the best-seen search point and evaluating a random search point. When searching around the best point, at each iteration it adds the sine of the move number to a single dimension, moving along two dimensions each time; in essence, this causes it to systematically explore the nearby search space, building up the space-filling pattern seen in Fig. 2.

  • The \(F_{12}\) optimiser is the most complex, and its behaviour at the instruction level is hard to understand. However, it does use the particle’s index and the index (but not the vector) of the population best, and both the improvement and out-of-bounds Boolean signals to determine each move. By observing its search trajectories, it is evident that it builds up a geometric pattern that causes it to explore moves with a power series distribution—in essence, a novel form of variable neighbourhood search.

  • The \(F_{13}\) optimiser, by comparison, has the simplest program. Each iteration, it adds a random value to one of the dimensions of the best-seen search point, cycling through the dimensions on each subsequent move (hence why it generates a cross-shaped trajectory). The size of the move (the upper bound of the random value) is determined by both the sine of the objective value of the current point and the sine of the maximum dimension size, the former causing it to vary cyclically as search progresses, and the latter allowing it to adapt the move size to the search area.

  • The \(F_{14}\) optimiser is the only one which uses both a larger population and the vector.between instruction. Each iteration, it uses this to generate a new population of search points half-way between the population best and one of each particle’s previous positions. Interestingly, which previous position is used for a particular particle is determined by its index; the first particle uses its current position, higher numbered particles go back further in time. This may allow backtracking, which could be useful for landscapes that are deceptive and have limited gradient information (such as \(F_{14}\)). A small random vector is added to each half-way point, presumably to inject further diversity.

Fig. 3.
figure 3

Examples of the best evolved optimisers for each problem (top to bottom: \(F_1\), \(F_9\), \(F_{12}\), \(F_{13}\), \(F_{14}\)) applied to each of the other problems (left-right: \(F_1\), \(F_9\), \(F_{12}\), \(F_{13}\), \(F_{14}\)). See caption of Fig. 2 for more information.

Figure 3 shows examples of trajectories when each of these optimisers are applied to 2D versions of the other four problems. These suggest that optimisers may fail to generalise not because of intrinsic assumptions about properties of landscapes, but because they make assumptions about the dimensions of the search area. For example, the \(F_{9}\) and \(F_{13}\) optimisers appear to fail on the \(F_{14}\) landscape because they are making moves, or sampling regions, which are only appropriate for a landscape with much smaller overall dimensions. Using a larger range of random scalings during training might help with this.

Fig. 4.
figure 4

Trajectories of other evolved optimisers. One example is shown for each combination of problem (top to bottom: \(F_1\), \(F_9\), \(F_{12}\), \(F_{13}\), \(F_{14}\)) and population size (left to right: 1, 5, 25, 50). See caption of Fig. 2 for more information.

However, these optimisers were not evolved for generality, so the fact that most of them generalise to other problems is a fortunate bi-product. Furthermore, it is likely that the optimisers that do best on one problem are not likely to be the best in terms of generality. Hence, in practice there is likely to be a benefit to looking at the best optimisers from the other 245 runs depicted in Fig. 1. Figure 4 gives a snapshot of these, showing one example for each combination of training problem and optimiser population size. These illustrate some of the broad diversity seen amongst the solutions. Many of these trajectories look nothing like conventional optimisers, so it is likely that interesting ideas of how to do optimisation could be gained by looking more closely at them. Another interesting direction for future work would be to consider ensembles of optimisers. There are many potential ways of doing this. For example, early results suggest that it may be advantageous, in terms of generality, to form a heterogenous population-based optimiser by combining the best programs from multiple runs.

5 Conclusions

In recent years, there has been a lot of criticism of the ad hoc design of new optimisers through mimicry of natural phenomena. Despite early success with evolutionary algorithms and particle swarm optimisation, this trend has increasingly resulted in optimisers that are technically novel, but which differ in minor and often arbitrary ways from existing optimisers. If we are to create new optimisation algorithms (and the no free lunch theorem [22] suggests a need for diverse optimisers), then perhaps it is better to do this in a more systematic, objective and automated manner. This paper contributes towards this direction of research by investigating the utility of Push GP for exploring the space of optimisers. The results show that Push GP can both discover and express optimisation behaviours that are effective, complex and diverse. Encouragingly, the evolved optimisers scale to problems they did not see during training, and often out-perform general purpose optimisers on these previously unseen problems. The behavioural analysis shows that the evolved optimisers use a diverse range of metaheuristic strategies to explore optimisation landscapes, using behaviours that differ significantly from existing local and population-based optimisers. Furthermore, these are only the tip of the iceberg; the evolved optimiser populations appear to contain broad behavioural diversity, and there are many potential ways of combining diverse optimisers to create ensembles.