Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Prior to introducing the particular algorithms in Sect. 2.2, the more general foundations of evolution strategies are introduced in Sect. 2.1. To start with, the definition of an optimization task as used throughout this book is given in Sect. 2.1.1. Following [58], Sect. 2.1.2 presents a discussion of evolution strategy metaheuristics as a special case of evolutionary algorithms. In particular, the components of such a metaheuristic—namely recombination, mutation, evaluation and selection—are described in a general way. Due to the particular importanceFootnote 1 of the mutation operator for evolution strategies (in \({\mathbb{R}}^{n}\)), it is discussed in quite some detail in Sect. 2.1.3.

2.1 Introduction

2.1.1 Optimization

Evolution strategies are particularly well suited (and developed) for nonlinear optimization tasks, which are defined as follows (see e.g. [17], Sect. 18.2.1.1):

$$\displaystyle{ f(\mathbf{x}) =\min !\mbox{ for }\mathbf{x} \in {\mathbb{R}}^{n}\mbox{ where} }$$
(2.1)
$$\displaystyle{ g_{i}(\mathbf{x}) \leq 0,i \in I = \{1,\ldots,m\},h_{j}(\mathbf{x}) = 0,j \in J = \{1,\ldots,r\}, }$$
(2.2)

and the set

$$\displaystyle{ M = \{\mathbf{x} \in {\mathbb{R}}^{n}: g_{ i}(\mathbf{x}) \leq 0,\forall i \in I,h_{j}(\mathbf{x}) = 0,\forall j \in J\} }$$
(2.3)

is called the set of feasible points and it defines the search space of the optimization problem. A point \({\mathbf{x}}^{{\ast}}\in {\mathbb{R}}^{n}\) is called a global minimum, if

$$\displaystyle{ {f}^{{\ast}} = f({\mathbf{x}}^{{\ast}}) \leq f(\mathbf{x})\mbox{ for all }\mathbf{x} \in M }$$
(2.4)

Conversely, it is called a local minimum if the above inequality only holds for x within an ε-environment \(U_{\epsilon \left (x\right )} \subseteq M\).

Formulating an optimization problem as a minimization task is equivalent to searching for a maximum or for a given target value, since maximization of f can be replaced by minimization of − f and a target value \(\bar{f}\) can be attained by minimizing \(\rho (\bar{f},f)\) with an arbitrary distance measureFootnote 2 ρ.

In this definition of an optimization task it is completely sufficient if the codomain is completely ordered, so that the inequality in Eq. 2.4 can be applied. Throughout this book, we will always deal with the codomain \(\mathbb{R}\) only. Moreover, we will not explicitly deal with the handling of constraints (e.g., as defined by Eq. 2.2), and refer the interested reader to Sect. 2.3 where literature references point to state-of-the-art techniques in constraint handling. A special case of constraints are so-called box constraints, as defined below:

$$\displaystyle\begin{array}{rcl} g_{1}(\mathbf{x})& =& \mathbf{l} -\mathbf{x} \leq \mathbf{0}\mbox{ where }\mathbf{l} = {(l_{1},\ldots,l_{n})}^{T} \in {\mathbb{R}}^{n} \\ g_{2}(\mathbf{x})& =& \mathbf{x} -\mathbf{u} \leq \mathbf{0}\mbox{ where }\mathbf{u} = {(u_{1},\ldots,u_{n})}^{T} \in {\mathbb{R}}^{n}{}\end{array}$$
(2.5)

Vectors l and u are called lower and upper bounds, respectively. Box constraints restrict the search space to the hyperrectangle \([l_{1},u_{1}] \times \ldots \times [l_{n},u_{n}]\) and are taken into account for the implementation of algorithms described in this book.

In the field of evolutionary algorithms, the vector x is often called the decision vector (and its parameters decision parameters), and its objective function value f(x) is also called the fitness value.

2.1.2 Evolution Strategies as a Specialization of Evolutionary Algorithms

Following [8] and [58], evolution strategies are described here as a specialization of evolutionary algorithms. The general framework of an evolutionary algorithm is presented in Algorithm 2.1.

Algorithm 2.1 General outline of an evolutionary algorithm

Initialization

repeat

    Recombination

    Mutation

    Evaluation

    Selection

until Termination criterion fulfilled

During initialization, the first generation, consisting of one or more individuals, is created, and the fitness of its individuals is evaluated. After initialization, the so-called evolution loop is entered, which consists of the operators recombination, mutation, evaluation and selection. Recombination creates new individuals, also called offspring, from the parent population. Two major types of recombination, dominant and intermediate recombination, are typically distinguished: In dominant recombination, a property of a parent individual is inherited by the offspring, i.e., this property dominates the corresponding property of the other individuals. For intermediate recombination, the properties of all individuals are taken into account, such that, e.g., in the simplest case, their mean value is used.

The mutation operator provides the main source of variation of offspring in an evolution strategy. Based on sampling random variables, properties of individuals are modified. The newly created individuals are then evaluated, i.e., their fitness values are calculated. Based on these fitness values, selection identifies a subset of individuals which form the new population which is used in the next iteration of the evolution loop. The loop is terminated based on a termination criterion set by the user, such as reaching a maximum number of evaluations, reaching a target fitness value, or stagnation of the search process.

According to [58], evolution strategies as a specific instantiation of evolutionary algorithms are characterized by the following four properties:

  • Selection of individuals for recombination is unbiased.

  • Selection is a deterministic process.

  • Mutation operators are parameterized and therefore they can change their properties during optimization.

  • Individuals consist of decision parameters as well as strategy parameters.Footnote 3

The generic framework of an evolutionary algorithm then specializes into a \((\mu /\rho,\kappa,\lambda )\)-ES,Footnote 4 as described in detail in Algorithm 2.2. Recombination and mutation are summarized here under the term variation. In addition to the description given in [58] (Algorithm 3), the variation operator of a (μ∕ρ,κ,λ)-ES is defined here by means of a parameter set \(\Psi _{V }\), and the evaluation operator is explicitly mentioned. A population at generation t ≥ 0 is denoted P (t) and is a set of individuals. An individual pP (t) is a tuple \((\mathbf{x},\Psi )\) for \(\mathbf{x} \in M \subseteq {\mathbb{R}}^{n}\), with M as in Eq. 2.3. The sets \(\Psi \) and \(\Psi _{V }\) are arbitrary finite sets representing the strategy parameters. Since these parameters are modified internally during execution of the algorithm, they are called endogenous strategy parameters. The number of parent individuals is denoted as μ, the number of offspring individuals as λ, and ρ denotes the number of parents taken into account for generating a single offspring by means of recombination. For these parameters, \(\mu,\rho,\lambda \in \mathbb{N}\) and ρ ≤ μ holds.

\(\kappa \in \mathbb{N} \cup \{\infty \}\) represents the largest age which can be reached by any individual in the population. In contrast to endogenous parameters, μ,ρ,λ und κ are to be set by the user of the algorithm, such that they are called exogenous strategy parameters.

The setting of κ has a direct impact on the selection operator. Usually, either κ = 1 (one generation maximum lifetime) or κ = (infinite maximum lifetime) is used. The former case is also called comma-selection, the latter plus-selection. Using the standard notation of evolution strategies, this is expressed as \((\mu /\rho,\lambda )\)-ES and (μ∕ρ + λ)-ES, so that κ is not explicitly stated any more. Using κ < requires the condition λ ≥ μ to hold.

Algorithm 2.2 (μ∕ρ,κ,λ)-ES

Initialization of P (0) with μ individuals

\(\forall p \in {P}^{(0)}: p.\Psi.Age \leftarrow 1\), p.ff(p.x)

t ← 0

repeat

    \({Q}^{(t)} \leftarrow \varnothing \)

    for i = 1 → λ do

      Sample ρ parents \(p_{1},\ldots,p_{\rho } \in {P}^{(t)}\) uniformly at random

      \(q \leftarrow \mbox{ Variation}(p_{1},\ldots,p_{\rho },\Psi _{V })\)

      \(q.\Psi.Age \leftarrow 0\), \(q.f \leftarrow f(q.\mathbf{x})\)

      \({Q}^{(t)} \leftarrow {Q}^{(t)} \cup \{q\}\)

    end for

    \({P}^{(t+1)} \leftarrow \) Selection of the μ best individuals from \({Q}^{(t)} \cup \{p \in {P}^{(t)}: p.\Psi.Age < \kappa \}\)

    Update \(\Psi _{V }\)

    \(\forall p \in {P}^{(t+1)}: p.\Psi.Age \leftarrow p.\Psi.Age + 1\)

    tt + 1

until Termination criterion fulfilled

2.1.3 Mutation in\({\mathbb{R}}^{n}\)

2.1.3.1 The Multivariate Normal Distribution

In [58], three guiding principles for the design of mutation operators are introduced, namely:

  • Any point of the search space needs to be attainable with probability strictly larger than zero by means of a finite number of applications of mutation.

  • Mutation should be unbiased, which can be achieved by using a maximum entropy distribution.Footnote 5

  • The operator is parameterized, such that the extent of variation can be controlled.

In \({\mathbb{R}}^{n}\), these requirements are fulfilled by a multivariate normal distribution. An n-dimensional random vector X is multivariate normally distributed with expectation \(\bar{\mathbf{x}} \in {\mathbb{R}}^{n}\) and positive definiteFootnote 6 covariance matrix \(\mathbf{C} \in {\mathbb{R}}^{n\times n}\) if its probability density function is defined according to:

$$\displaystyle{ f_{\mathbf{X}}(\mathbf{x}) = \frac{1} {{(2\pi )}^{\frac{n} {2} }{(\det \mathbf{C})}^{\frac{1} {2} }} \exp \left (-\frac{1} {2}{(\mathbf{x} -\bar{\mathbf{x}})}^{T}{\mathbf{C}}^{-1}(\mathbf{x} -\bar{\mathbf{x}})\right ) }$$
(2.6)

(see p. 86 in [28]). In short notation, this is typically written as \(\mathbf{X} \sim N(\bar{\mathbf{x}},\mathbf{C})\), where \(N(\bar{\mathbf{x}},\mathbf{C})\) denotes the multivariate normal distribution in its general form. In mathematical equations, \(N(\bar{\mathbf{x}},\mathbf{C})\) is sometimes used like a vector, meaning a vector which is actually sampled according to the distribution given. In other words, instead of writing x′ =x +X where XN(0,C), it is also possible to simply write x′ =x + N(0,C).

Due to the positive definiteness of the covariance matrix C, the following eigendecomposition exists (Theorem 1a in [58]):

$$\displaystyle{ \mathbf{C} = \mathbf{B}{\mathbf{D}}^{2}{\mathbf{B}}^{T} }$$
(2.7)

Here, B denotes an orthogonal matrix,Footnote 7 the columns of which are the eigenvectors of C. In [29], \(N(\bar{\mathbf{x}},\mathbf{C})\) is reduced to the distribution N(0,I) by means of the eigendecomposition given in Eq. 2.7, according to:

$$\displaystyle{ N(\bar{\mathbf{x}},\mathbf{C}) \sim \bar{\mathbf{x}} + \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I}) }$$
(2.8)
Fig. 2.1
figure 1

Mutation ellipsoids representing N(0,I), \(N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2}))\) and N(0,C) (from left to right)

In the field of evolution strategies, the three special cases N(0,I), \(N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2}))\) and N(0,C) are used for the definition of the most common algorithms. Figure 2.1 provides a sketch of the corresponding mutation ellipsoids, i.e., isolines of the probability density functions, embedded in a hypothetical two-dimensional fitness function.

The simplest case of generating the mutation x′ from x is based on using B =I and D = σI with a global step size \(\delta \in {\mathbb{R}}^{+}\) for matrices B and D as used in Eq. 2.8.

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot N(\mathbf{0},\mathbf{I}) }$$
(2.9)

This corresponds with spheres with individual radii defined by δ, as indicated in the left part of Fig. 2.1. This case of an offspring distribution is called isotropic.

To turn the spheres into anisotropic ellipsoids with main axes parallel to the coordinate axes, as shown in the middle of Fig. 2.1, matrix D in Eq. 2.8 must be turned into a diagonal matrix \(\boldsymbol{\delta } = {(\delta _{1},\ldots,\delta _{n})}^{T} \in {\mathbb{R}}^{n}\) with different entries on the main diagonal. As in the previous case, B is a diagonal matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{x}\prime& =& \mathbf{x} + \mathbf{I}\mbox{ diag}(\boldsymbol{\delta })N(\mathbf{0},\mathbf{I}) \\ & =& \mathbf{x} + N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2})){}\end{array}$$
(2.10)

The length ratios of the main axes of the mutation ellipsoids depend on the ratios between corresponding components of the vector δ. A rotation of mutation hyperellipsoids with respect to the coordinate axes, as shown in the rightmost part of Fig. 2.1, is achieved by using a covariance matrix C with off-diagonal entries different from zero. This case is denoted by the term correlated mutation. In contrast with the two previous cases, the matrix B is not just an identity matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{x}\prime& =& \mathbf{x} + \mathbf{B}\mbox{ diag}(\delta )N(\mathbf{0},\mathbf{I}) \\ & =& \mathbf{x} + \mathbf{B}N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2})) \\ & =& \mathbf{x} + N(\mathbf{0},\mathbf{C}) {}\end{array}$$
(2.11)

The choice of one of the three cases explained above has a direct impact on the complexity of the endogenous parameters controlling the multivariate normal distribution. In general, if n denotes the dimensionality of the search space, the number of endogenous strategy parameters in case of Eq. 2.9 is O(1), i.e., constant. In case of 2.10 a vector of size O(n) of endogenous parameters is required, and adaptation of an arbitrary covariance matrix, i.e., a symmetric n × n-matrix, according to Eq. 2.11, requires O(n 2) endogenous parameters.

For defining algorithm DR3 in Sect. 2.2.1 and for all algorithms based on the CMA-ES, the so-called line distribution [31] is of special interest: For \(\mathbf{u} \in {\mathbb{R}}^{n}\), the distribution \(N(\mathbf{0},\mathbf{u}{\mathbf{u}}^{T})\) is a multivariate normal distribution with the variance \(\|{\mathbf{u}\|}^{2}\) in the direction of the vector u. It is the normal distribution with highest probability of generating u.

2.1.3.2 Relationship Between Covariance Matrix and Hessian

In the previous section, using a multivariate normal distribution was motivated by certain requirements which should hold for the mutation operator. In this section, we will clarify why it is useful to use an arbitrary covariance matrix, as in Eq. 2.11, for adaptation.

Any differentiable function \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) can be approximated by a Taylor series expansion in the vicinity of a positionFootnote 8 \(\mathbf{\tilde{x}} \in {\mathbb{R}}^{n}\). Cutting off the Taylor series after the quadratic term, the following approximation is obtained:

$$\displaystyle{ f(\mathbf{x}) \approx f(\mathbf{\tilde{x}}) + {(\mathbf{x} -\mathbf{\tilde{x}})}^{T}\nabla f(\mathbf{\tilde{x}}) + \frac{1} {2}{(\mathbf{x} -\mathbf{\tilde{x}})}^{T}{\nabla }^{2}f(\mathbf{\tilde{x}})(\mathbf{x} -\mathbf{\tilde{x}}) }$$
(2.12)

Here, \(\nabla f(\mathbf{\tilde{x}})\) denotes the gradient, and \({\nabla }^{2}f(\mathbf{\tilde{x}})\) is the symmetric, positive definite Hessian, denoted by H in the following. For a quadratic function f, the Taylor series expansion is exact, and H contains information about the shape of the isolines of f. In general, these are ellipsoids, as shown in the rightmost part of Fig. 2.1. Hansen describes the relationship between the Hessian H and the covariance matrix C of a distribution N(0,C) informally [29]. It is argued that using \(\mathbf{C} ={ \mathbf{H}}^{-1}\) for optimizing a quadratic function is equivalent to using C =I for optimizing an isotropic function, such as the sphere function \(f(\mathbf{x}) = \frac{1} {2}\mathbf{x}{\mathbf{x}}^{T}\).

In other words: Adapting an arbitrary covariance matrix simplifies the optimization by transforming the objective function into an isotropic function. A more formal description of this topic can be found in Rudolph’s work, e.g., in the section Advanced Adaptation Techniques in \({\mathbb{R}}^{n}\) in [58], and also in [55].

2.2 Algorithms

This section contains descriptions of the key variants of evolution strategies in chronological order of their publication. On a high level, we differentiate between the two main Sects. 2.2.1 and 2.2.2, with the first one corresponding with the time frame 1964 until 1996.

This first Sect. 2.2.1 describes five main algorithms, namely, the (1+1)-ES as the historically first version of an evolution strategy and the (μ,λ)-MSC-ES (in [58] also called CORR-ES) as the first evolution strategy which adapts an arbitrary covariance matrix (see Sect. 2.1.3 for an explanation). The first derandomized algorithm variants, DR1, DR2, and DR3, complete this selection of older variants of evolution strategies. Their choice is motivated by the fact that they are derandomization steps towards the CMA-ES (see also [63]).

The second main Sect. 2.2.2 describes modern evolution strategies, a term which is used in this book to denote the CMA-ES and algorithms based on it. This distinction might seem somewhat arbitrary, but in fact the development of the CMA-ES defined a turning point in the history of evolution strategies, for two main reasons: First, the CMA-ES is the first algorithm which adapts a covariance matrix in a completely derandomized way. Second, the CMA-ES is seen by many authors as the state of the art in evolution strategies (e.g., [6, 13, 15, 26, 35, 58, 63], and [66]).

2.2.1 From the (1+1)-ES to the CMA-ES

2.2.1.1 (1+1)-ES

The foundation of the first evolution strategy was laid in the 1960s at the Technical University of Berlin by three students, namely Hans-Paul Schwefel, Ingo Rechenberg, and Peter Bienert. As described in [8] or [58], standard methods for solving black-box optimization problems, such as gradient-based methods (see [44]), were not able to deliver satisfactory solution quality for certain optimization problems in engineering applications. Inspired by lectures about biological evolution, they aimed at developing a solution method based on principles of variation and selection. In its first version, a very simple evolution loop without any endogenous parameters was used [59]. This algorithm generates a single offspring \(\mathbf{x}\prime = \mathbf{x} + {(N_{1}(0,\sigma ),\ldots,N_{n}(0,\sigma ))}^{T} = \mathbf{x} + \sigma \cdot N(\mathbf{0},\mathbf{I})\) from a single parent individual \(\mathbf{x} \in {\mathbb{R}}^{n}\). If the offspring performs better than its parent (in terms of fitness), it becomes the new parent. Otherwise, the parent remains. The standard deviation σ of the normal distribution was a fixed scalar value.

According to [53], by pure luck the value of σ was chosen in a way that made this first approach towards a (1+1)-ES successful. Only later on, the necessary step size adaptation was added to the algorithm [52]. Based on two fitness functions, the so-called corridor modelFootnote 9 and the so-called sphere model,Footnote 10 a theoretical result was derived for introducing step size adaptation: Maximum convergence velocity (i.e., speed of progress of the optimization) is achieved when about 1/5 of all mutations are successful, i.e., improvements over their parent.Footnote 11 This insight led to the development of the so-called 1/5-success rule for step size adaptation. If about 1/5 of all mutations are successful, the step size is optimal and no adaptation is required. If the success rate falls below 1/5, the step size needs to be reduced. If it grows above 1/5, the step size needs to be increased. To obtain the new step size \(\sigma \prime = \sigma \cdot {c}^{\{-1,1\}}\), the previous σ is decreased or increased, respectively, by multiplication or division by 0.817 ≤ c ≤ 1. The recommended value of c = 0.817 was derived by Schwefel according to theoretical arguments about step size adaptation speed [61]. The step size adaptation according to the above rule is applied each n iterations of the algorithm, and the success rate p S is measured over a sliding window of the last 10 ⋅ n mutations [8]. The pseudocode of the (1+1)-ES according to [8] is shown in Algorithm 2.3.

Algorithm 2.3 (1+1)-ES

\(P_{0} \leftarrow \{\mathbf{x}\}\)

\(\phi \leftarrow f(\mathbf{x})\)

\(p_{S} \leftarrow 0\)

initialize archive A for storing successful mutations

\(t \leftarrow 0\)

repeat

    \(t \leftarrow t + 1\)

    \(\mathbf{x}\prime \leftarrow \mathbf{x} + \sigma \cdot \mathbf{N}(\mathbf{0},\mathbf{I})\)

    \(\phi \prime \leftarrow f(\mathbf{x}\prime)\)

    if  \(\phi \prime < \phi \)  then

      \(\mathbf{x} \leftarrow \mathbf{x}\prime\)

      \(\phi \leftarrow \phi \prime\)

      store success in A

    else

      store failure in A

    end if

    \(P_{t} \leftarrow \{\mathbf{x}\}\)

    if \(t\mod n = 0\) then

      get \(\#\mathit{successes}\ and\ \#\mathit{failures}\) from at most 10n entries in A

      \(p_{S} = \frac{\#\mathit{successes}} {\#\mathit{successes}+\#\mathit{failures}}\)

      \(\sigma \prime \leftarrow \left \{\begin{array}{l} \sigma \cdot c\mbox{ if }p_{S} < 1/5 \\ \sigma /c\mbox{ if }p_{S} > 1/5 \\ \sigma \mbox{ if }p_{S} = 1/5\end{array} \right.\)

    end if

    \(\sigma \leftarrow \sigma \prime\)

until termination criterion fulfilled

2.2.1.2 (μ,λ)-MSC-ES

The (μ,λ)-MSC-ESFootnote 12 was the very first evolution strategy capable of adapting an arbitrary covariance matrix. The algorithm was developed by Schwefel [62] and is also called (μ,λ)-CORR-ES [58]. In this strategy, the covariance matrix is obtained as a product of n(n − 1)∕2 rotation matrices, where a single rotation matrix R ij for a rotation angle α between axis i and axis j, with \(i,j \in \{1,\ldots,n\}\) and i ≠ j, is given by an identity matrix, extended by the entries \(R(i,i) = R(j,j) =\cos \alpha _{\mathit{ij}}\) and \(R(i,j) = -R(j,i) = -\sin \alpha _{\mathit{ij}}\).

Indeed, this method is able to generate arbitrary correlated mutations, as proven by Rudolph [55]. In the framework of the (μ,λ)-MSC-ES, endogenous strategy parameters are modified by means of the so-called self-adaptation principle. For self-adaptation, an individual consists not only of the decision parameters x, but also contains an additional vector \(\sigma \in \mathbb{R}_{+}^{n}\) of step sizes and a vector \(\alpha \in (-\pi,\pi {]}^{n(n-1)/2}\) of rotation angles. The underlying idea of mutative step size adaptation is based on the assumption of individuals with good settings of strategy parameters to generate good offspring, such that the good strategy parameters survive selection. Recombination of decision parameters and endogenous strategy parameters is performed through global intermediary recombination, i.e., by averaging all of the μ parents. Concerning the exogenous strategy parameters, the local and global learning rates τ and τ′ need to be set. Following [8], after Schwefel [61], the settings \(\tau = \frac{1} {\sqrt{2\sqrt{n}}}\) and \(\tau \prime = \frac{1} {2\sqrt{n}}\) are recommended, depending only on the problem dimensionality n. Pseudocode of the (μ,λ)-MSC-ES is provided in Algorithm 2.4. Concerning the population sizes, we are using μ = 15 and \(\lambda = 7 \cdot \mu = 105\) throughout this book, close to the recommendations in [63].

Algorithm 2.4 (μ,λ)-MSC-ES

initialize population

\({P}^{(0)} \leftarrow \{(\mathbf{x}_{1},\sigma _{1},\alpha _{1}),\ldots,(\mathbf{x}_{\mu },\sigma _{\mu },\alpha _{\mu })\}\)

\(t \leftarrow 0\)

repeat

    \(t \leftarrow t + 1\)

    // recombination

    \(\bar{x} \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\mathbf{x}_{ i}\)

    \(\bar{\sigma } \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\sigma _{ i}\)

    \(\bar{\alpha } \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\alpha _{ i}\)

    for \(i = 1 \rightarrow \lambda \) do

      // mutation

      \(\eta \leftarrow \tau \prime \cdot N(0,1)\)

      \(\sigma _{i} \leftarrow \bar{ \sigma } \cdot \exp \left (\eta + \tau \cdot N(\mathbf{0},\mathbf{I})\right )\)

      \(\alpha _{i} \leftarrow \bar{ \alpha } + \beta \cdot N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{C} \leftarrow \prod _{i=1}^{n-1}\prod _{j=i+1}^{n}R_{\mathit{ij}}\)

      \(\mathbf{x}_{i} \leftarrow \mathbf{\bar{x}} + \mathbf{C} \cdot \sigma _{i} \cdot N(\mathbf{0},\mathbf{I})\)

      // evaluation

      \(\phi _{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    // selection

    P (t) are the μ best \((\mathbf{x}_{i},\sigma _{i},\alpha _{i})\) from 1 ≤ i ≤ λ

until termination criterion fulfilled

2.2.1.3 DR1

The (μ,λ)-MSC-ES as described in the previous section is based on mutative self-adaptation for step sizes \(\delta \in \mathbb{R}_{+}^{n}\). However, as Ostermeier et al. [47] claim, self-adaptation of individual step sizes is not possible in the case of small population sizes, and they identify two key reasons: First, a successful mutation of the decision parameters is not necessarily caused by a good step size, but can also be due to an advantageous instantiation of the normally distributed random vector (i.e., a lucky sample). Second, there is a conflict between the goals of maintaining a large variance of step sizes within one generation and avoiding too large fluctuations of step sizes between successive generations. The first derandomized evolution strategy, abbreviated DR1,Footnote 13 solves the first problem by using the length of the most successful mutation step within one generation (i.e., the one that yielded the best offspring) for controlling step size adaptation [47]. The second problem is solved by using a factor \(\xi \in \{\frac{5} {7}, \frac{7} {5}\}\) to provide sufficient variance of step sizes within one generation, and to dampenFootnote 14 this factor by applying an exponent β with 0 < β < 1 for step size adaptation, to reduce undesired fluctuations [47]. An offspring x′ of a parent x is then generated as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \xi \cdot \delta \otimes \mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Adaptation of step sizes δ is based on the most successful z (i.e., the normally distributed vector sample which generated the best offspring during this generation), which is first transformed as follows:

$$\displaystyle{ \xi _{\mathbf{z}} ={ \left (\exp \left (\vert z_{1}\vert -\sqrt{2/\pi }\right ),\ldots,\exp \left (\vert z_{n}\vert -\sqrt{2/\pi }\right )\right )}^{T} }$$

Combined with the exponents β and \(\beta _{\mathit{scal}} \in \mathbb{R}\) for damping the adaptation, as well as ξ and \(\xi _{\mathbf{z}}\) of the best mutation, the new step sizes δ′ are obtained as follows:

$$\displaystyle{ \delta \prime ={ \left (\xi \right )}^{\beta } \cdot {\left (\xi _{\mathbf{ z}}\right )}^{\beta _{\mathit{scal}} } \otimes \delta }$$

Pseudocode of the DR1 evolution strategy is given in Algorithm 2.5. Concerning the offspring population size λ, a constant setting of λ = 10, independently of dimensionality n, was used in [47]. The DR1 algorithm is based on a single parent individual (μ = 1), and sometimes also denoted as (1,10)-DR1-ES. Ostermeier et al.[47] recommends for the exponents β and β scal the following values:

$$\displaystyle\begin{array}{rcl} \beta & =& \sqrt{1/n} {}\\ \beta _{\mathit{scal}}& =& 1/n {}\\ \end{array}$$

Algorithm 2.5 DR1

initialize x, \(\boldsymbol{\delta } \leftarrow {(1,\ldots,1)}^{T}\)

t ← 0

repeat

    tt + 1

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{x}_{i} \leftarrow \mathbf{x} + \xi _{i} \cdot \boldsymbol{ \delta } \otimes \mathbf{z}_{i}\) where \(P(\xi _{i} = \frac{5} {7}) = P(\xi _{i} = \frac{7} {5}) = \frac{1} {2}\)

      \(\phi _{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    \(\mathit{sel} \leftarrow i\) with best value of ϕ i

    \(\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}\)

    \(\xi _{\mathbf{z}_{\mathit{sel}}} ={ \left (\exp \left (\vert z_{\mathit{sel}_{1}}\vert -\sqrt{2/\pi }\right ),\ldots,\exp \left (\vert z_{\mathit{sel}_{n}}\vert -\sqrt{2/\pi }\right )\right )}^{T}\)

    \(\delta \leftarrow {\left (\xi _{\mathit{sel}}\right )}^{\beta }{\left (\xi _{\mathbf{z}_{\mathit{sel}}}\right )}^{\beta _{\mathit{scal}}} \otimes \delta \)

until termination criterion fulfilled

2.2.1.4 DR2

The DR2 evolution strategyFootnote 15 represents the next step of derandomization for evolution strategies [48]. The creation of an offspring by mutation is parameterized by a global step size δ and local step sizes \(\boldsymbol{\delta }_{\mathit{scal}} \in {\mathbb{R}}^{n}\):

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot \boldsymbol{ \delta }_{\mathit{scal}} \otimes \mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

As in DR1, adaptation of step sizes is based on the most successful z. However, in addition to information about the most successful mutation of the current generation, the most successful mutation steps of previous generations are also taken into account, thereby accumulating information over generations. The accumulation takes place in a vector \(\boldsymbol{\zeta } \in {\mathbb{R}}^{n}\), using a factor c ∈ (0,1] to control the weight of previous generations in contrast to the current one:

$$\displaystyle{ \boldsymbol{\zeta }\prime = (1 - c) \cdot \boldsymbol{ \zeta } + c \cdot \mathbf{z}_{\mathit{sel}} }$$
(2.13)

Adaptation of step sizes δ and \(\boldsymbol{\delta }_{\mathit{scal}}\) is then based on the updated mutation path \(\boldsymbol{\zeta }\prime\):

$$\displaystyle\begin{array}{rcl} \delta \prime& =& \delta \cdot {\left (\exp \left ( \frac{\|\boldsymbol{\zeta }\prime\|} {\sqrt{n}\sqrt{ \frac{c} {2-c}}} - 1 + \frac{1} {5n}\right )\right )}^{\beta } {}\\ \boldsymbol{\delta }_{\mathit{scal}_{i}}\prime& =& \boldsymbol{\delta }_{\mathit{scal}_{i}} \cdot {\left ( \frac{\vert \boldsymbol{\zeta }_{i}\prime\vert } {\sqrt{ \frac{c} {2-c}}} + \frac{7} {20}\right )}^{\beta _{\mathit{scal}} }\forall i \in \{1,\ldots,n\} {}\\ \end{array}$$

Standard settings for the exponents β and β scal as well as the parameter c are as follows:

$$\displaystyle\begin{array}{rcl} \beta & =& \sqrt{1/n} {}\\ \beta _{\mathit{scal}}& =& 1/n {}\\ c& =& \sqrt{1/n} {}\\ \end{array}$$

The pseudocode of the DR2 evolution strategy is given in Algorithm 2.6.

Algorithm 2.6 DR2

initialize x, \(\boldsymbol{\zeta } \leftarrow \mathbf{0}\), \(\delta \leftarrow 1\), \(\boldsymbol{\delta }_{\mathit{scal}} \leftarrow {(1,\ldots,1)}^{T}\)

\(t \leftarrow 0\)

repeat

    \(t \leftarrow t + 1\)

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{x}_{i} \leftarrow \mathbf{x} + \delta \cdot \boldsymbol{ \delta }_{\mathit{scal}} \otimes \mathbf{z}_{i}\)

      \(\phi _{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    \(\mathit{sel} \leftarrow i\) with best value of ϕ i

    \(\boldsymbol{\zeta }\prime \leftarrow (1 - c) \cdot \boldsymbol{ \zeta } + c \cdot \mathbf{z}_{\mathit{sel}}\)

    \(\delta \prime \leftarrow \delta \cdot {\left (\exp \left ( \frac{\|\boldsymbol{\zeta }\prime\|} {\sqrt{n}\cdot \sqrt{ \frac{c} {2-c}}} - 1 + \frac{1} {5n}\right )\right )}^{\beta }\)

    \(\boldsymbol{\delta }\prime_{\mathit{scal}} \leftarrow \boldsymbol{ \delta }_{\mathit{scal}} \otimes {\left ( \frac{\vert \boldsymbol{\zeta }\prime_{i}\vert } {\sqrt{ \frac{c} {2-c}}} + \frac{7} {20}\right )}^{\beta _{\mathit{scal}}}\)

    \(\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}\)

    \(\boldsymbol{\zeta } \leftarrow \boldsymbol{ \zeta }\prime\)

    \(\delta \leftarrow \delta \prime\)

    \(\boldsymbol{\delta }_{\mathit{scal}} \leftarrow \boldsymbol{ \delta }\prime_{\mathit{scal}}\)

until termination criterion fulfilled

2.2.1.5 DR3

The DR3 evolution strategy [33], also called (1,λ)-GSA-ES (generating set adaptation), is able to generate mutations according to an arbitrary multivariate normal distribution, corresponding to the adaptation of an arbitrary covariance matrix according to Eq. 2.11. This process is not based on implicitly using a covariance matrix, but on transforming an isotropic random vector \(\mathbf{z} = N(\mathbf{0},\mathbf{I})\) into a correlated random vector y by multiplication with a matrixFootnote 16 \(\mathbf{B} = \left (\mathbf{b}_{1},\ldots,\mathbf{b}_{m}\right ) \in {\mathbb{R}}^{n\times m}\).

As described in Sect. 2.1.3, this can be interpreted as superposition of multiple line distributions. For the number m of column vectors, \({n}^{2} \leq m \leq 2{n}^{2}\) holds, with a smaller value of m providing a faster adaptation and a larger value of m a more accurate adaptation. Like in DR1, for variation of the global step size \(\delta \in \mathbb{R}\) a factor \(\xi \in \{\frac{2} {3}, \frac{3} {2}\}\) with \(P(\xi _{i} = 2/3) = P(\xi _{i} = 3/2) = 1/2\) is used. To guarantee an approximately constant length of the column vectors in B, y is adapted by using a factor c m . Based on its parents x, an offspring is then created as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot \xi \cdot \mathbf{y}\mbox{ where }\mathbf{y} = c_{m} \cdot \mathbf{B}N(\mathbf{0},\mathbf{I}) }$$

The adaptation of endogenous strategy parameters is based on the selected y sel and ξ sel . The column vectors of matrix B are updated according to:

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{1}\prime& =& (1 - c) \cdot \mathbf{b}_{1} + c \cdot (c_{u}\xi _{\mathit{sel}}\mathbf{y}_{\mathit{sel}}) {}\\ \mathbf{b}_{i+1}\prime& =& \mathbf{b}_{i}\mbox{ }\forall i \in \{1,\ldots,m - 1\} {}\\ \end{array}$$

Like with the previous versions of derandomized evolution strategies, the global step size δ is adapted based on the selected ξ sel , by using a damping exponent β:

$$\displaystyle{ \delta \prime = \delta \cdot {\left (\xi _{\mathit{sel}}\right )}^{\beta } }$$

For the exogenous parameters, the standard settings are given in [33] as follows:

$$\displaystyle\begin{array}{rcl} c& =& \sqrt{1/n} {}\\ \beta & =& \sqrt{1/n} {}\\ m& =& \frac{3} {2}{n}^{2} {}\\ c_{m}& =& (1/\sqrt{m})(1 + 1/m) {}\\ c_{u}& =& \sqrt{(2 - c)/c} {}\\ \lambda & =& 10 {}\\ \end{array}$$

The corresponding pseudocode of the DR3 evolution strategy is provided in Algorithm 2.7.

Algorithm 2.7 DR3

initialize x, δ, \(\mathbf{B} \leftarrow \left (\mathbf{0},N\left (\mathbf{0},(1/n)\mathbf{I}\right )\right ) \in {\mathbb{R}}^{n\times m}\)

t ← 0

repeat

    \(t \leftarrow t + 1\)

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\mbox{ where }\mathbf{z}_{i} \in {\mathbb{R}}^{m}\)

      \(\mathbf{y}_{i} \leftarrow c_{m} \cdot \mathbf{B}\mathbf{z}_{i}\)

      \(\mathbf{x}_{i} \leftarrow \mathbf{x} + \delta \cdot \xi _{i} \cdot \mathbf{y}_{i}\mbox{ where }P(\xi _{i} = 2/3) = P(\xi _{i} = 3/2) = 1/2\)

      \(\phi _{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    \(\mathit{sel} \leftarrow i\) with best value of ϕ i

    \(\mathbf{b} \leftarrow (1 - c) \cdot \mathbf{b}_{1} + c \cdot (c_{u}\xi _{\mathit{sel}}\mathbf{y}_{\mathit{sel}})\)

    \(\delta \prime \leftarrow \delta \cdot {\left (\xi _{\mathit{sel}}\right )}^{\beta }\)

    \(\mathbf{B}\prime \leftarrow (\mathbf{b},\mathbf{b}_{1},\ldots,\mathbf{b}_{m-1})\)

    \(\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}\), \(\delta \leftarrow \delta \prime\) and \(\mathbf{B} \leftarrow \mathbf{B}\prime\)

until termination criterion fulfilled

2.2.2 Modern Evolution Strategies

2.2.2.1 W ,λ)-CMA-ES

Algorithms DR1, DR2 and DR3, as described in Sect. 2.2.1, are derandomized evolution strategies in the sense of adapting endogenous strategy parameters depending on the selected mutation vector. This has also been called the first level of derandomization [63]. In addition, the second level of derandomization aims at the following goals [63]:

  • Increase the probability of generating the same mutation step again.

  • Provide a direct control mechanism for the rate of change of strategy parameters.

  • Keep the strategy parameters unchanged in case of random selection.

The so-called CMA-ES, as introduced in [31], meets these goals by means of two techniques, namely the covariance matrix adaptation, CMA and the cumulative step size adaptation, CSA, for adapting a global step size. The description of a CMA-ES as provided in [31] is focused on explaining these two techniques, and recombination in case of μ > 1 is not discussed at all. Therefore, we will discuss the CMA-ES in this section as a (μ W ,λ)-CMA-ES with weighted intermediary recombination, as described in [29] and [32].Footnote 17 Using the notation for evolution strategies as introduced in Sect. 2.1.2, the algorithm ought to be denoted more precisely as (μ∕μ W ,λ)-CMA-ES, with index W denoting the weighted recombination. However, the simplified notation is motivated by arguing that the notation μ∕μ W suggests two different numbers (μ and μ W ), although it is μ in both cases. Here, we adopt the simplified notation, and denote the CMA-ES with weighted recombination as (μ W ,λ)-CMA-ES.

Based on a parent x, an offspring x′ is then generated as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{B}\mathbf{D}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Matrices B and D result from an eigendecomposition of the covariance matrix C according to Eq. 2.7, and \(\sigma \in \mathbb{R}\) denotes the global step size. After generating and evaluating an offspring population of size λ according to this mutation operator, the μ best individuals of the offspring population are selected and undergo weighted intermediary recombination.

Weighted intermediary recombination is a generalization of classical global intermediary recombination. Weighted intermediary recombination is based on using μ weights \(w_{1} \geq w_{2} \geq \ldots \geq w_{\mu }\) with \(\sum _{i=1}^{\mu }w_{i} = 1\) for generating the new parent \(\langle \mathbf{x}\rangle\) and the best mutation step \(\langle \mathbf{y}\rangle\) as weighted averages:

$$\displaystyle\begin{array}{rcl} \langle \mathbf{x}\rangle & =& \sum _{i=1}^{\mu }w_{ i}\mathbf{x}_{i:\lambda } {}\\ \langle \mathbf{y}\rangle & =& \sum _{i=1}^{\mu }w_{ i}\mathbf{B}\mathbf{D}\mathbf{z}_{i:\lambda } {}\\ \end{array}$$

For adapting the strategy parameters, the so-called variance effective selection mass μ eff is required:

$$\displaystyle{ \mu _{\mathit{eff }} ={ \left (\sum _{i=1}^{\mu }w_{ i}^{2}\right )}^{-1} }$$

According to [29], \(1 \leq \mu _{\mathit{eff }} \leq \mu \) holds, and for identical weights \(w_{i} = \frac{1} {\mu }\) (\(\forall i \in \{1,\ldots,\mu \}\)): μ eff = μ. In analogy with Eq. 2.13 for DR2, the strategy parameter adaptation techniques, CMA and CSA, use so-called evolution paths for accumulating strategy parameter information across several generations. The (μ W ,λ)-CMA-ES uses two evolution paths, p c for the adaptation of the covariance matrix and p σ for global step size adaptation. The evolution paths are updated as follows:

$$\displaystyle\begin{array}{rcl} \mathbf{p}_{c}\prime& =& (1 - c_{c}) \cdot \mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle {}\\ \mathbf{p}_{\sigma }\prime& =& (1 - c_{\sigma }) \cdot \mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle {}\\ \end{array}$$

For updating p c , the function h σ is used, which is defined according to:

$$\displaystyle{ h_{\sigma } = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if } \frac{\|\mathbf{p}_{\sigma }\|} {\sqrt{1-{(1-c_{\sigma } )}^{2(t+1)}}} < \left (\frac{7} {5} + \frac{2} {n+1}\right )E(\|N(\mathbf{0},\mathbf{I})\|) \\ 0\quad &\mbox{ otherwise } \end{array} \right. }$$

The purpose of h σ is to avoid an update of p c to take information of the current generation t into account, when \(\|\mathbf{p}_{c}\|\) becomes too large. The expectation \(E(\|N(\mathbf{0},\mathbf{I})\|)\) of the length of a multivariate, normally distributed vector of dimensionality n, can be approximated (based on the gamma functionFootnote 18) as follows:

$$\displaystyle{ E(\|N(\mathbf{0},\mathbf{I})\|) = \sqrt{2}\Gamma (\frac{n + 1} {2} )/\Gamma (\frac{n} {2} ) \approx \sqrt{n}\left (1 - \frac{1} {4n} + \frac{1} {21{n}^{2}}\right ) }$$

The covariance matrix adaptation is performed according to the equation below:

$$\displaystyle{ \mathbf{C}\prime = (1 - c_{l} - c_{\mu })\mathbf{C} + c_{l}(\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \delta (h_{ \sigma })\mathbf{C}) + c_{\mu }\sum _{i=1}^{\mu }w_{ i}\mathbf{y}_{i:\lambda }\mathbf{y}_{i:\lambda }^{T} }$$
(2.14)

The first term in the summation represents the contribution of the previous covariance matrix. The second term is called the rank-one-update and takes the information accumulated in the evolution path p c into account. The third term, the so-called rank-update, was introduced with the extension of the CMA-ES for population sizes with μ > 1 [46]. The global step size σ is updated according to:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E(\|N(\mathbf{0},\mathbf{I})\|)} - 1\right )\right ) }$$

For the exogenous strategy parameters of the (μ W ,λ)-CMA-ES, the following standard settings are defined in [29]:

$$\displaystyle\begin{array}{rcl} \lambda & =& 4 + \lfloor 3\ln n\rfloor {}\\ \mu & =& \lfloor \frac{\lambda } {2}\rfloor {}\\ w_{i}& =& \frac{\ln (\frac{\lambda +1} {2} ) -\ln i} {\sum _{j=1}^{\mu }\ln (\frac{\lambda +1} {2} ) -\ln j}\mbox{ for }i \in \{1,\ldots,\mu \} {}\\ c_{\sigma }& =& \frac{\mu _{\mathit{eff }} + 2} {n + \mu _{\mathit{eff }} + 5} {}\\ d_{\sigma }& =& 1 + 2\max \left (0,\sqrt{\frac{\mu _{\mathit{eff } } - 1} {n + 1}} \right ) + c_{\sigma } {}\\ c_{c}& =& \frac{4 + \mu _{\mathit{eff }}/n} {n + 4 + 2\mu _{\mathit{eff }}/n} {}\\ c_{1}& =& \frac{2} {{\left (n + \frac{13} {10}\right )}^{2} + \mu _{\mathit{eff }}} {}\\ c_{\mu }& =& \min \left (1 - c_{1},\alpha _{\mu } \frac{\mu _{\mathit{eff }} - 2 + 1/\mu _{\mathit{eff }}} {{(n + 2)}^{2} + \alpha _{\mu }\mu _{\mathit{eff }}/2}\right )\mbox{ with }\alpha _{\mu } = 2 {}\\ \end{array}$$

Putting it all together, the pseudocode of the (μ W ,λ)-CMA-ES is given in Algorithm 2.8.

Algorithm 2.8 (μ W ,λ)-CMA-ES

initialize \(\langle \mathbf{x}\rangle\)

\(\mathbf{p}_{c} \leftarrow \mathbf{0}\)

\(\mathbf{p}_{\sigma } \leftarrow \mathbf{0}\)

\(\mathbf{C} \leftarrow \mathbf{I}\)

t ← 0

repeat

    \(t \leftarrow t + 1\)

    B and \(\mathbf{D} \leftarrow \) eigendecomposition of C

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{y}_{i} \leftarrow \mathbf{B}\mathbf{D}\mathbf{z}_{i}\)

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{y}_{k}\)

      \(f_{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    \(\langle \mathbf{y}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }\)

    \(\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \sigma \langle \mathbf{y}\rangle =\sum _{ i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }\)

    \(\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle\)

    \(\sigma \leftarrow \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )\)

    \(\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle\)

    \(\mathbf{C} \leftarrow (1 - c_{1} - c_{\mu })\mathbf{C} + c_{1}(\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \delta (h_{\sigma })\mathbf{C}) + c_{\mu }\sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }\mathbf{y}_{i:\lambda }^{T}\)

until termination criterion fulfilled

2.2.2.2 LS-CMA-ES

The LS-CMA-ES [6] is a (1,λ)-ES implementing the idea to adapt the covariance matrix C based on the inverse Hessian H −1. The Hessian itself is estimated by solving the appropriate least squares estimation problem. Based on Theorem 5 in [55], it is known that this requires at least \(m \geq \frac{1} {2}\left ({n}^{2} + 3n + 4\right )\) tuples \(\left (\mathbf{x},f(\mathbf{x})\right )\). To achieve this, the algorithm saves all tuples \(\left (\mathbf{x},f(\mathbf{x})\right )\) in an archive A. Based on the Taylor series expansion (Eq. 2.12), the least squares estimation problem is defined through the following minimization task:

$$\displaystyle\begin{array}{rcl} \min _{\mathbf{g}\in {\mathbb{R}}^{n},\mathbf{H}\in {\mathbb{R}}^{n\times n}}\sum _{k=1}^{m}{\left (f(\mathbf{x}_{ k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\mathbf{g} -\frac{1} {2}{(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\mathbf{H}(\mathbf{x}_{ k} -\mathbf{x}_{0})\right )}^{2}& &{}\end{array}$$
(2.15)

The result of minimizing 2.15 provides estimators \(\hat{\mathbf{g}}\) for the gradient and \(\hat{\mathbf{H}}\) for the Hessian.

Since the Taylor series expansion up to the quadratic term provides only an approximation of the true fitness landscape at x 0, we are also interested in obtaining an error measure \(Q(\hat{g},\hat{\mathbf{H}})\) of the estimate for deciding whether \(\hat{{\mathbf{H}}}^{-1}\) can be used for covariance matrix adaptation. The following error measure is used for this purpose:

$$\displaystyle{ Q(\hat{g},\hat{\mathbf{H}}) = \frac{1} {m}\sum _{k=1}^{m}{\left (\frac{f(\mathbf{x}_{k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{g}} -\frac{1} {2}{(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{H}}(\mathbf{x}_{ k} -\mathbf{x}_{0})} {f(\mathbf{x}_{k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{g}}} \right )}^{2} }$$
(2.16)

Unfortunately, solving Eq. 2.15 and inverting \(\hat{\mathbf{H}}\) by means of numerical methods requires algorithms with time complexity O(n 6), so that, especially for large n, an execution of these steps in each generation is not affordable. To solve this problem, the LS-CMA-ES provides two different working modes, denoted LS and CMA, for adapting the covariance matrix.

In mode LS, an approximation of H is performed only each n upd generations.Footnote 19 If the error Q falls below a required threshold Q t , the covariance matrix \(\mathbf{C} = \frac{1} {2}\hat{{\mathbf{H}}}^{-1}\) is used by the algorithm and remains unchanged until a new update after another n upd generations is performed.

If Q is bigger than the threshold value Q t , the LS-CMA-ES switches into mode CMA. Before explaining this mode, the creation of an offspring x′ from the parent \(\langle \mathbf{x}\rangle\) is defined below:

$$\displaystyle{ \mathbf{x}\prime =\langle \mathbf{x}\rangle + \sigma dN(\mathbf{0},\mathbf{C})\mbox{ where }d =\exp (\tau N(0,1)) }$$

In addition to the covariance matrix C, a global step size σ is used, which is updated by mutative step size adaptation. If b denotes the index of the best offspring, the global step size is changed according to \(\sigma \prime = \sigma \cdot d_{b}\). Adapting the covariance matrix C is based on a rank-one update (i.e., the second term in Eq. 2.14) by using an evolution path p c :

$$\displaystyle\begin{array}{rcl} \mathbf{p}_{c}\prime& =& (1 - c_{c}) \cdot \mathbf{p}_{c} + \frac{\sqrt{(c_{c } (2 - c_{c } ))}} {\sigma } (\mathbf{x}_{b} -\langle \mathbf{x}\rangle ) {}\\ \mathbf{C}\prime& =& (1 - c_{\mathit{cov}}) \cdot \mathbf{C} + c_{\mathit{cov}}\mathbf{p}_{c}{(\mathbf{p}_{c})}^{T} {}\\ \end{array}$$

The evolution path p c is also updated when operating in mode LS, to make sure C is updated based on up-to-date information when the algorithm switches into mode CMA.

The pseudocode of the LS-CMA-ES is given in Algorithm 2.9, and the exogenous strategy parameters are set as follows:

$$\displaystyle\begin{array}{rcl} \lambda & =& 10 {}\\ \tau & =& \frac{1} {\sqrt{n}} {}\\ n_{\mathit{upd}}& =& 100 {}\\ Q_{t}& =& 1{0}^{-3} {}\\ c_{c}& =& \frac{4} {n + 4} {}\\ c_{\mathit{cov}}& =& \frac{2} {{(n + \sqrt{2})}^{2}} {}\\ \end{array}$$

Algorithm 2.9 LS-CMA-ES

initialize \(\langle \mathbf{x}\rangle\), σ

\(\mathbf{C} \leftarrow \mathbf{I}\)

Archive \(A \leftarrow \varnothing \)

\(\mathbf{p}_{c} \leftarrow \mathbf{0}\)

mode \(\leftarrow \) LS

\(t \leftarrow 0\)

repeat

    \(t \leftarrow t + 1\)

    B and D ← eigendecomposition of C

    for i = 1 → λ do

      \(d_{i} \leftarrow \exp \left (\tau N(0,1)\right )\)

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \cdot d_{i}\mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I})\)

      \(f_{i} \leftarrow f(\mathbf{x}_{i})\)

      \(A \leftarrow A \cup \{(\mathbf{x}_{i},f_{i})\}\)

    end for

    \(b \leftarrow \) index of best offspring

    \(\sigma \leftarrow \sigma \cdot d_{b}\)

    \(\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + \frac{\sqrt{c_{c } (2-c_{c } )}} {\sigma } (\langle \mathbf{x}\rangle -\mathbf{x}_{b})\)

    if mode = LS then

      C unchanged

    else if mode = CMA then

      \(\mathbf{C} \leftarrow (1 - c_{\mathit{cov}})\mathbf{C} + c_{\mathit{cov}}\mathbf{p}_{c}\mathbf{p}_{c}^{T}\)

    end if

    if t modulo n upd = 0 then

      Obtain \(\hat{\mathbf{g}}\) and \(\hat{\mathbf{H}}\) based on the last n 2 tuples of A by solving Equation 2.15 where \(\mathbf{x}_{0} =\langle \mathbf{x}\rangle\).

      Obtain \(Q(\hat{\mathbf{g}},\hat{\mathbf{H}})\) from Equation 2.16

      if \(Q(\hat{\mathbf{g}},\hat{\mathbf{H}}) < Q_{t}\) then

        mode ← LS

        \(\mathbf{C} \leftarrow {\left (\frac{1} {2}\hat{\mathbf{H}}\right )}^{-1}\)

      else

        mode ← CMA

      end if

    end if

    \(\langle \mathbf{x}\rangle \leftarrow \mathbf{x}_{b}\)

until termination criterion fulfilled

2.2.2.3 LR-CMA-ES

The LR-CMA-ES (local restart) extends the (μ W ,λ)-CMA-ES by introducing restarts [4]. The strategy introduces five criteria for identifying stagnation of the optimization process and, in case of stagnation, starts a new run of the (μ W ,λ)-CMA-ES. Each run of the (μ W ,λ)-CMA-ES initializes the starting point of the search and the strategy parameters anew, so that the runs are independent of each other. For defining the termination criteria, the tolerance values \(T_{x} = \sigma 1{0}^{-12}\) and \(T_{f} = 1{0}^{-12}\) are used. Any other exogenous parameters are the same as in the \((\mu _{W},\lambda )\)-CMA-ES.

The first termination criterion, called equalfunvalhist, is satisfied if either the best fitness values \(f(\mathbf{x}_{1:\lambda })\) of the last \(\lceil 10 + 30n/\lambda \rceil \) generations are identical or the difference between their maximum and minimum values is smaller than T x .

The second criterion, TolX, is satisfied if the components of the vector \(\mathbf{v} = \sigma \mathbf{p}_{c}\) are all smaller than T x , i.e., v i < T x \(\forall i \in \{1,\ldots,n\}\).

The third criterion, noeffectaxis, takes changes with respect to the main coordinate axes induced by C into account. These are given by the eigenvectors \(\mathbf{u}_{i}\) and eigenvalues γ i , \(i \in \{1,\ldots,n\}\), of C, and they are found (normalized) in the columns of matrix B and the main diagonal elements of D. The termination criterion does not check all main axes at once, but in generation t it takes the axis i = t mod n into account. It is satisfied when \(\frac{\sigma } {10}\sqrt{\gamma _{i}}\mathbf{u}_{i} \approx 0\).

The fourth criterion, noeffectcoord, analyzes changes with respect to the coordinate axes. It is satisfied if \(\frac{\sigma } {5} C_{i,i} \approx 0\) \(\forall i \in \{1,\ldots,n\}\).

Finally, the criterion conditioncov checks whether the condition number of the matrix C, \(\mbox{ cond}(\mathbf{C}) = \frac{\max (\{\gamma _{1},\ldots,\gamma _{n}\})} {\min (\{\gamma _{1},\ldots,\gamma _{n}\})}\) exceeds 1014.

The pseudocode of the LR-CMA-ES, as shown in Algorithm 2.10, consists of a simple outer loop managing the restarts of the (μ W ,λ)-CMA-ES. The local termination criteria are exactly the five criteria introduced above for discovering stagnation. In contrast, the global termination criterion is the same as used in previous sections, see Sect. 2.1.2.

Algorithm 2.10 LR-CMA-ES

repeat

    execute (μ W ,λ)-CMA-ES (Algorithm 2.8) using the local termination criteria

until global termination criterion satisfied

2.2.2.4 IPOP-CMA-ES

The IPOP-CMA-ES [5] is an extension of the LR-CMA-ES as described in the previous section. Whenever a run of the (μ W ,λ)-CMA-ES is terminated due to a local termination criterion (as introduced for LR-CMA-ES), the population size is increased by a factor η for the next run of the (μ W ,λ)-CMA-ES. This strategy is motivated by empirical investigations of the behavior of the (μ W ,λ)-CMA-ES with different population sizes for multimodal test functions [30]. As these investigations clarified, the global convergence properties of the algorithm improve with increasing population size. The corresponding pseudocode is given in Algorithm 2.11. When using non-integer values for η, the new number of parents μ and offspring λ are obtained by rounding. For η, the interval \(\left [\frac{3} {2},5\right ]\) is identified as a reasonable range, and the default value η = 2 is recommended.

Algorithm 2.11 IPOP-CMA-ES

repeat

    execute (μ W ,λ)-CMA-ES (Algorithm 2.8) using the local termination criteria

    \(\mu \leftarrow \eta \cdot \mu \)

    \(\lambda \leftarrow \eta \cdot \lambda \)

until global termination criterion satisfied

2.2.2.5 (1+1)-Cholesky-CMA-ES

The (1+1)-Cholesky-CMA-ES [38] introduces a method for adapting the covariance matrix C implicitly, without using an eigendecomposition of C. Consequently, the approach reduces the computational complexity within each generation from O(n 3) to O(n 2).

The algorithm is based on the so-called Cholesky decompositionFootnote 20 of the covariance matrix, \(\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}\). As proven in [38], an update of the Cholesky factors A is possible without explicit knowledge of the covariance matrix C. The corresponding lemma and theorem are stated here without proof. The lemma states that, for any vector \(\mathbf{v} \in {\mathbb{R}}^{n}\) and \(\varsigma = \frac{1} {\|{\mathbf{v}\|}^{2}} \left (\sqrt{1 +\|{ \mathbf{v} \|}^{2}} - 1\right )\), the following equation holds:

$$\displaystyle{ \mathbf{I} + \mathbf{v}{\mathbf{v}}^{T} = \left (\mathbf{I} + \varsigma \mathbf{v}{\mathbf{v}}^{T}\right )\left (\mathbf{I} + \varsigma \mathbf{v}{\mathbf{v}}^{T}\right ) }$$

This lemma is required for the proof of the following theorem:

Theorem 2.2.1.

Let \(\mathbf{C} \in {\mathbb{R}}^{n}\) be a symmetric, positive definite matrix with Cholesky decomposition \(\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}\) . Let \(\mathbf{C}\prime = \alpha \mathbf{C} + \beta \mathbf{v}{\mathbf{v}}^{T}\) be an update of \(\mathbf{C}\) with \(\mathbf{v},\mathbf{z} \in {\mathbb{R}}^{n}\), \(\mathbf{v} = \mathbf{A}\mathbf{z}\) and \(\alpha,\beta \in {\mathbb{R}}^{+}\) . The updated Cholesky factor A ′ of C ′ is then given by \(\mathbf{A}\prime = \sqrt{\alpha }\mathbf{A} + \frac{\sqrt{\alpha }} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 + \frac{\beta } {\alpha }\|{\mathbf{z}\|}^{2}} - 1\right )\left (\mathbf{A}\mathbf{z}\right ){\mathbf{z}}^{T}\) .

Based on a parent individual x, an offspring x′ is then created according to:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\mbox{ with }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Using Theorem 2.2.1, the Cholesky factor A is adapted as follows:

$$\displaystyle{ \mathbf{A}\prime = c_{a}\mathbf{A} + \frac{c_{a}} {\|{\mathbf{z}\|}^{2}}\left (\sqrt{1 + \frac{(1 - c_{a }^{2 })\|{\mathbf{z} \|}^{2 } } {c_{a}^{2}}} - 1\right )\mathbf{A}\mathbf{z}{\mathbf{z}}^{T}, }$$

with a constant exogenous strategy parameter c a . The adaptation above is applied if the value of a measure \(\bar{p}_{s}\) (explained in the following) is smaller than a threshold value p t .

The adaptation of the global step size δ is in some ways similar to the 1/5-success rule of the (1+1)-ES (see Sect. 2.2.1). If the offspring is better than the parent, λ s = 1 in the equation below, otherwise, λ s = 0. These success indicators are accumulated across generations by using a learning rate c p , resulting in an accumulated success rate \(\bar{p}_{s}\):

$$\displaystyle{ \bar{p}_{s} = (1 - c_{p})\bar{p}_{s} + c_{p}\lambda _{s} }$$

Using this measure and its target value p s t for the success rate, the global step size σ is updated as follows:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{1} {d}\left (\bar{p}_{s} - \frac{p_{s}^{t}} {1 - p_{s}^{t}}(1 -\bar{ p}_{s})\right )\right ) }$$

The pseudocode is given in Algorithm 2.12, and the default settings of the exogenous strategy parameters are:

$$\displaystyle\begin{array}{rcl} p_{s}^{t}& =& \frac{2} {11} {}\\ p_{t}& =& \frac{11} {25} {}\\ c_{a}& =& \sqrt{1 - \frac{2} {{n}^{2} + 6}} {}\\ c_{p}& =& \frac{1} {12} {}\\ d& =& 1 + \frac{1} {n} {}\\ \end{array}$$

Algorithm 2.12 (1+1)-Cholesky-CMA-ES

initialize x, σ

\(\mathbf{A} \leftarrow \mathbf{I}\)

\(\bar{p}_{s} \leftarrow p_{s}^{t}\)

repeat

    \(\mathbf{z} \leftarrow N(\mathbf{0},\mathbf{I})\)

    \(\mathbf{x}\prime \leftarrow \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\)

    if \(f(\mathbf{x}\prime) \leq f(\mathbf{x})\) then

      \(\lambda _{s} \leftarrow 1\)

    else

      \(\lambda _{s} \leftarrow 0\)

    end if

    \(\bar{p}_{s} \leftarrow (1 - c_{p})\bar{p}_{s} + c_{p}\lambda _{s}\)

    \(\sigma \leftarrow \sigma \cdot \exp \left (\frac{1} {d}\left (\bar{p}_{s} - \frac{p_{s}^{t}} {1-p_{s}^{t}}(1 -\bar{ p}_{s})\right )\right )\)

    if \(f(\mathbf{x}\prime) \leq f(\mathbf{x})\) then

      \(\mathbf{x} \leftarrow \mathbf{x}\prime\)

      if \(\bar{p}_{s} \leq p_{t}\) then

        \(\mathbf{A} \leftarrow c_{a}\mathbf{A} + \frac{c_{a}} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 + \frac{\left (1-c_{a }^{2 } \right ) \|{\mathbf{z} \|}^{2 } } {c_{a}^{2}}} - 1\right )\mathbf{A}\mathbf{z}{\mathbf{z}}^{T}\)

      end if

    end if

until termination criterion satisfied

2.2.2.6 Active-CMA-ES

The (μ W ,λ)-CMA-ES uses weighted recombination of the μ best offspring to generate a new point in the search space. As shown by Rudolph [57], the convergence velocity of an evolution strategy can be further increased by also taking the worst offspring into account for recombination, however, with negative weights. The Active-CMA-ES [40] is based on this idea,Footnote 21 however, it is not used during the process of recombination,Footnote 22 but exclusively for adapting the covariance matrix. Therefore, the corresponding extension of the (μ W ,λ)-CMA-ES mainly consists of changing the covariance matrix adaptation method, modifying Eq. 2.14 of the (μ W ,λ)-CMA-ES within the Active-CMA-ES into:

$$\displaystyle\begin{array}{rcl} \mathbf{C}\prime& =& \mathbf{C} \leftarrow (1 - c_{c})\mathbf{C} + c_{c}\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \beta \mathbf{Z}\mbox{ where } {}\\ \mathbf{Z}& =& \mathbf{B}\mathbf{D}\left ( \frac{1} {\mu }\sum _{k=1}^{\mu }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T} - \frac{1} {\mu }\sum _{k=\lambda -\mu +1}^{\lambda }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T}\right ){\left (\mathbf{B}\mathbf{D}\right )}^{T} {}\\ \end{array}$$

In addition, the exogenous parameter c c is now modified to \(c_{c} = \frac{2} {{(n+\sqrt{2})}^{2}}\). The parameter β has been tuned by means of an empirical investigation, which is described in detail in [39]. Its setting of \(\beta = \frac{4\mu -2} {{(n+12)}^{2}+4\mu }\) reflects a compromise between the conflicting goals of achieving a large convergence velocity on the one hand and ensuring that C remains positive definite, to drive the evolution strategy into a robust working regime. The pseudocode is provided in Algorithm 2.13, and the default settings of the exogenous strategy parameters are, except for c c and β, identical to those used in the (μ W ,λ)-CMA-ES.

Algorithm 2.13 Active-CMA-ES

initialize \(\langle \mathbf{x}\rangle\)

\(\mathbf{p}_{c} \leftarrow \mathbf{0}\)

\(\mathbf{p}_{\sigma } \leftarrow \mathbf{0}\)

\(\mathbf{C} \leftarrow \mathbf{I}\)

t ← 0

repeat

    tt + 1

    B and D ← from eigendecomposition of C

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{y}_{i} \leftarrow \mathbf{B}\mathbf{D}\mathbf{z}_{i}\)

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{y}_{k}\)

      \(f_{i} \leftarrow f(\mathbf{x}_{i})\)

    end for

    \(\langle \mathbf{y}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }\)

    \(\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \sigma \langle \mathbf{y}\rangle =\sum _{ i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }\)

    \(\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle\)

    \(\sigma \leftarrow \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )\)

    \(\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle\)

    \(\mathbf{Z} \leftarrow \mathbf{B}\mathbf{D}\left (\frac{1} {\mu }\sum _{k=1}^{\mu }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T} - \frac{1} {\mu }\sum _{k=\lambda -\mu +1}^{\lambda }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T}\right ){\left (\mathbf{B}\mathbf{D}\right )}^{T}\)

    \(\mathbf{C} \leftarrow (1 - c_{c})\mathbf{C} + c_{c}\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \beta \mathbf{Z}\)

until termination criterion satisfied

2.2.2.7 (μ,λ)-CMSA-ES

The (μ,λ)-CMSA-ES [13], more precisely denoted the (μ∕μ I ,λ)-CMA-σ-SA-ES, reintroduces self-adaptation of the global step size σ, just like in the (μ,λ)-MSC-ES, into the algorithm. This approach is motivated by the fact that reintroducing self-adaptation decreases the number of exegenous strategy parameters to two,Footnote 23 consequently providing a simplification of the (μ W ,λ)-CMA-ES, which requires five exogenous strategy parameters. Offspring individuals x i and their step sizes σ i , \(i \in \{1,\ldots,\lambda \}\), are created based on the parent x, the global step size σ, and the matrices B and D (from an eigendecomposition of the covariance matrix C), as follows:

$$\displaystyle\begin{array}{rcl} \sigma _{i}& =& \sigma \cdot \exp (\tau N(0,1)) {}\\ \mathbf{s}_{i}& =& \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I}) {}\\ \mathbf{z}_{i}& =& \sigma _{i} \cdot \mathbf{s}_{i} {}\\ \mathbf{x}_{i}& =& \mathbf{x} + \mathbf{z}_{i} {}\\ \end{array}$$

Recombination is based on identical weights 1∕μ, resulting in averaging the μ best offspring. It is applied to the vectors \(\mathbf{z}_{i:\lambda }\), \(\mathbf{s}_{i:\lambda }\), and step sizes \(\sigma _{i:\lambda }\), for \(i \in \{1,\ldots,\mu \}\), and results in the vectors \(\langle \mathbf{z}\rangle\), \(\langle \mathbf{s}\rangle\) and the new global step size σ. The new parent x′ is then obtained as \(\mathbf{x}\prime = \mathbf{x} +\langle \mathbf{z}\rangle\). Vector \(\langle \mathbf{s}\rangle\) is required for adapting the covariance matrix C, and its update uses the learning rate τ C by proceeding as follows:

$$\displaystyle{ \mathbf{C}\prime = \left (1 - \frac{1} {\tau _{C}}\right )\mathbf{C} + \frac{1} {\tau _{C}}\langle \mathbf{s}\rangle \langle {\mathbf{s}\rangle }^{T} }$$
(2.17)

The default settings of the exogenous strategy parameters are:

$$\displaystyle\begin{array}{rcl} \mu & =& \max \left (\left \lfloor \frac{n} {10}\right \rfloor,2\right ) {}\\ \lambda & =& 4\mu {}\\ \tau & =& \frac{1} {\sqrt{2n}} {}\\ \tau _{C}& =& 1 + \frac{n(n + 1)} {2\mu } {}\\ \end{array}$$

The pseudocode of the corresponding (μ,λ)-CMSA-ES is given in Algorithm 2.14.

Algorithm 2.14 (μ,λ)-CMSA-ES

initialize x, σ

\(\mathbf{C} \leftarrow \mathbf{I}\)

\(\langle \sigma \rangle \leftarrow \sigma \)

repeat

    B and D ← from eigendecomposition of C

    for i = 1 → λ do

      \(\sigma _{i} \leftarrow \langle \sigma \rangle \exp \tau N(0,1)\)

      \(\mathbf{s}_{i} \leftarrow \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{z}_{i} \leftarrow \sigma _{i} \cdot \mathbf{s}_{i}\)

      \(\mathbf{y}_{i} \leftarrow \mathbf{x} + \mathbf{z}_{i}\)

      \(f_{i} \leftarrow f(\mathbf{y}_{i})\)

    end for

    \(\langle \mathbf{z}\rangle \leftarrow \) average of the best μ \(\mathbf{z}_{i},i \in \{1,\ldots,\lambda \}\)

    \(\langle \mathbf{s}\rangle \leftarrow \) average of the best μ \(\mathbf{s}_{i},i \in \{1,\ldots,\lambda \}\)

    \(\langle \sigma \rangle \leftarrow \) average of the best μ \(\sigma _{i},i \in \{1,\ldots,\lambda \}\)

    \(\mathbf{x} \leftarrow \mathbf{x} +\langle \mathbf{z}\rangle\)

    \(\mathbf{C} \leftarrow \left (1 - \frac{1} {\tau _{C}}\right )\mathbf{C} + \frac{1} {\tau _{C}}\langle \mathbf{s}{\mathbf{s}}^{T}\rangle\)

until termination criterion satisfied

2.2.2.8 sep-CMA-ES

The sep-CMA-ES [54] is a variation of the (μ W ,λ)-CMA-ES which reduces space and time complexity to reach O(n), i.e., linear in n. This is achieved by using, instead of an arbitrary covariance matrix, just a diagonal matrix D as in Eq. 2.10. Consequently, this kind of evolution strategy is not able anymore to generate correlated mutations, in return for the advantage of saving the computationally intensive eigendecomposition of the covariance matrix C. D can then be obtained from C by taking the square roots of the main diagonal elements of C. The covariance matrix is adapted according to the following update rule:

$$\displaystyle{ \mathbf{C}\prime = (1 - c_{\mathit{cov}})\mathbf{C} + \frac{1} {\mu _{\mathit{eff }}}c_{\mathit{cov}}\mathbf{p}_{c}{(\mathbf{p}_{c})}^{T} + c_{\mathit{ cov}}\left (1 - \frac{1} {\mu _{\mathit{eff }}}\right )\sum _{i=1}^{\mu }w_{ i}\mathbf{D}\mathbf{z}_{i:\lambda }{(\mathbf{D}\mathbf{z}_{i:\lambda })}^{T} }$$

Due to the reduced complexity of the covariance matrix, the learning rate c cov can be increased to accelerate the adaptation process. The learning rate c cov is then set as follows:

$$\displaystyle{ c_{\mathit{cov}} = \frac{n + 2} {3} \left ( \frac{1} {\mu _{\mathit{eff }}} \frac{2} {{(n + \sqrt{2})}^{2}} + (1 - \frac{1} {\mu _{\mathit{eff }}})\min \left (1, \frac{2\mu _{\mathit{eff }} - 1} {{(n + 2)}^{2} + \mu _{\mathit{eff }}}\right )\right ) }$$

All other settings of the sep-CMA-ES are identical to those used within the (μ W ,λ)-CMA-ES. The resulting pseudocode of the sep-CMA-ES is shown in Algorithm 2.15.

Algorithm 2.15 sep-CMA-ES

initialize \(\langle \mathbf{x}\rangle\)

\(\mathbf{C} \leftarrow \mathbf{I}\)

\(\mathbf{D} \leftarrow \mathbf{I}\)

\(\mathbf{p}_{\sigma } \leftarrow \mathbf{0}\)

\(\mathbf{p}_{c} \leftarrow \mathbf{0}\)

t ← 0

repeat

    tt + 1

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{D}\mathbf{z}_{i}\)

    end for

    \(\langle \mathbf{x}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }\)

    \(\langle \mathbf{z}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{z}_{i:\lambda }\)

    \(\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )}\sqrt{\mu _{\mathit{eff }}}\langle \mathbf{z}\rangle\)

    if \(\frac{\|\mathbf{p}_{\sigma }\|} {\sqrt{1-{(1-c_{\sigma } )}^{2t}}} < \left (\frac{7} {5} + \frac{2} {n+1}\right )E(\|N(\mathbf{0},\mathbf{I})\|)\) then

      \(H_{\sigma } \leftarrow 1\)

    else

      \(H_{\sigma } \leftarrow 0\)

    end if

    \(\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + H_{\sigma }\sqrt{c_{c } (2 - c_{c } )}\sqrt{\mu _{\mathit{eff }}}\mathbf{D}\langle \mathbf{z}\rangle\)

    \(\mathbf{C} \leftarrow (1 - c_{\mathit{cov}})\mathbf{C} + \frac{c_{\mathit{cov}}} {\mu _{\mathit{eff }}} \mathbf{p}_{c}\mathbf{p}_{c}^{T} + c_{c}\left (1 - \frac{1} {\mu _{\mathit{eff }}} \right )\sum _{i=1}^{\mu }w_{i}\mathbf{D}\mathbf{z}_{i:\lambda }{\left (\mathbf{D}\mathbf{z}_{i:\lambda }\right )}^{T}\)

    \(\sigma \leftarrow \sigma \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E(\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )\)

    \(\mathbf{D} = \mbox{ diag}\left (\sqrt{C_{1,1}},\ldots,\sqrt{C_{n,n}}\right )\)

until termination criterion satisfied

2.2.2.9 \((1{ + \atop,} \lambda _{m}^{s})\)-ES

The \((1{ + \atop,} \lambda _{m}^{s})\)-ES [16] introduces the two new concepts of mirrored sampling and sequential selection. These two mutually independent concepts change the algorithmic processes of offspring creation and their selection, and thus they do not establish a complete evolution strategy. The concept of mirrored sampling can be used within a (1 + λ)-ES as well as a (1,λ)-ES. The application of sequential selection is only possible in the case of a plus-strategy, explaining also the use of the notation \({ + \atop,}\). Furthermore, the indices s and m of λ represent the algorithmic concepts of sequential selection (s) and mirrored sampling (m), respectively.

The idea of mirrored sampling is to generate part of the offspring in a derandomized way by generating for a mutation vector z not only the offspring x +z, but also the additional offspring xz. These two offspring are obviously symmetricalFootnote 24 with respect to x. As a potential application, mentioned in [3], mirrored sampling can increase the robustness of the Evolutionary Gradient Search algorithm and increase convergence velocity in the sphere model. Theoretical convergence rates for variants of the \((1{ + \atop,} \lambda _{m}^{s})\)-ES have been derived; see [16] for the corresponding results.

Sequential selection can be used to reduce the number of function evaluations. It is applied within a (1 + λ)-ES by sequentially executing the steps mutation and evaluation for single offspring individuals, rather than generating all λ offspring first and then evaluating their fitness. In sequential selection, as soon as an offspring has a better fitness than the parent, the offspring can replace the parent, and no more offspring need to be generated and evaluated. In this way, up to λ − 1 function evaluations can potentially be saved at each generation.

The two concepts can be used independently of each other, or in combination. As explained before, the \((1{ + \atop,} \lambda _{m}^{s})\)-ES does not constitute a complete evolution strategy, but rather a method for generating the parent \(\langle \mathbf{x}\rangle \prime\) for the next generation based on the previous parent \(\langle \mathbf{x}\rangle\) and a method mutationOffset, which generates a mutation step and is determined by the underlying evolution strategy. The approach is summarized in pseudocode in Algorithm 2.16.

Algorithm 2.16 (\(1{ + \atop,} \lambda _{m}^{s}\))-ES

Input:search point \(\langle \mathbf{x}\rangle\) and a method mutationOffset Output:new search point \(\langle \mathbf{x}\rangle \prime\)

i ← 0

j ← 0

while i < λ do

    ii + 1

    jj + 1

    if  (mirrored sampling) \(\wedge \) (j modulo 2 = 0)  then

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle -\mathbf{z}_{i}\)

    else

      \(\mathbf{z}_{i} \leftarrow \) mutationOffset()

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \mathbf{z}_{i}\)

    end if

    if  (sequential selection) ∧ (\(f(\mathbf{x}_{i}) < f(\langle \mathbf{x}\rangle )\)then

      j ← 0

      break

    end if

end while

\(\langle \mathbf{x}\rangle \prime \leftarrow \mbox{ argmin}\left (\{f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{i})\}\right )\)

2.2.2.10 xNES

The xNES algorithm (exponential natural evolution strategies) [26] is a (1,λ)-ES which adapts its endogenous strategy parameters by using the so-called natural gradient (see [1]). The idea was implemented for the first time in the context of NES (natural evolution strategies) [71] and was then developed further by introducing the eNES (efficient natural evolution strategies)Footnote 25 [66].

In the following, the underlying ideas of the xNES are briefly summarized, without giving detailed descriptions of the underlying concepts, such as, e.g., the Fisher information matrix. These fundamentals can be found in the original work of Glasmachers et al. and the corresponding references, see [26].

This family of evolution strategy algorithms also relies on the multivariate normal distribution \(N(\langle \mathbf{x}\rangle,\mathbf{C})\) for generating correlated mutations of the current search point \(\langle \mathbf{x}\rangle\). Similar to the (1 + 1)-Cholesky-CMA-ES (see Sect. 2.2.2.5), rather than working with the covariance matrix C explicitly, a Cholesky factor A with \(\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}\) is used. The current search point and the covariance matrix are combined to form the tuple \(\theta = \left (\langle \mathbf{x}\rangle,\mathbf{C}\right )\), representing the quantities subject to adaptation within an xNES. Rewriting the probability density function of a normal distribution as a function of the current search point \(\langle \mathbf{x}\rangle\) and the Cholesky factor A, its probability density \(N(\langle \mathbf{x}\rangle,\mathbf{C})\) turns into:

$$\displaystyle{ p\left (\mathbf{x}\vert \theta \right ) = \frac{1} {{\left (\sqrt{2\pi }\right )}^{n}\det \mathbf{A}} \cdot \exp \left (-\frac{1} {2}{\left \|{\mathbf{A}}^{-1} \cdot (\mathbf{x} -\langle \mathbf{x}\rangle )\right \|}^{2}\right ) }$$

Given the distribution described by θ, the expectation J(θ) of the fitness becomes:

$$\displaystyle{ J(\theta ) = E(f(\mathbf{x})\vert \theta ) =\int f(\mathbf{x})p(\mathbf{x}\vert \theta )d\mathbf{x} }$$

The gradient of the expectation J(θ), ∇θ J(θ), can be calculated by using the so-called log-likelihood trick according to

$$\displaystyle{ \nabla _{\theta }J(\theta ) =\int \left (f(\mathbf{x})\nabla \log (p(\mathbf{x}\vert \theta ))\right )p(\mathbf{x}\vert \theta )d\mathbf{x}, }$$

which can be approximated by Monte Carlo estimation based on the offspring individuals \(\mathbf{x}_{i}\), \(i \in \{1,\ldots,\lambda \}\):

$$\displaystyle{ \nabla _{\theta }J(\theta ) \approx \frac{1} {\lambda }\sum _{i=1}^{\lambda }f(\mathbf{x}_{ i})\nabla \log (p(\mathbf{x}\vert \theta )). }$$

For calculating the term \(\nabla \log (p(\mathbf{x}\vert \theta ))\), we refer to [67]. Combining this with the Fisher information matrix (FIM) \(\mathbf{F} \in {\mathbb{R}}^{N\times N}\), where N = n + n(n + 1)∕2, the natural gradient G is obtained as:

$$\displaystyle{ G ={ \mathbf{F}}^{-1}\nabla _{ \theta }J(\theta ) }$$

Use of G is motivated by the fact that it is invariant with respect to linear transformations, so that the gradient converges in a correlated search space pretty much like in an isotropic one.

The NES suffer from the disadvantage of their impracticable computational complexity of O(n 6), caused by the explicit calculation of the FIM and its inversion. In contrast, the xNES do not require an explicit calculation of the FIM anymore. Based on using a so-called exponential parameterization (see Sect. 4.1 in [26]) a transformation of θ into natural coordinates (see Sect. 4.2 in [26]) is applied. Using step size δ and Cholesky factor B, an offspring x is then generated from the parent \(\langle \mathbf{x}\rangle\) according to:

$$\displaystyle{ \mathbf{x} =\langle \mathbf{x}\rangle + \delta \mathbf{B}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$
(2.18)

Similar to weighted recombination, the xNES uses so-called utility values u i . This approach is also called fitness shaping in the context of an xNES. Using the rank i given by the fitness values, utility values are calculated as follows:

$$\displaystyle{ u_{i} = \frac{\max \left (0,\log \left (\frac{\mu } {2} + 1\right ) -\log (i)\right )} {\sum _{j=1}^{\mu }\max \left (0,\log \left (\frac{\mu } {2} + 1\right ) -\log (i)\right )} - \frac{1} {\lambda } }$$

Using the mutation vectors z i from Eq. 2.18, the gradients G M for the covariance matrix and G δ for the current search point are defined by:

$$\displaystyle\begin{array}{rcl} \mathbf{G}_{M}& =& \frac{1} {2}\sum _{i=1}^{\lambda }u_{ i}\left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right ) {}\\ \mathbf{G}_{\delta }& =& \sum _{i=1}^{\lambda }u_{ i}\mathbf{z}_{i} {}\\ \end{array}$$

For calculating the gradients, all λ offspring individuals are taken into account, i.e., a selection in the classical sense is not applied. Using those gradients and the learning rates η x , ησ and η B , the new search point \(\langle \mathbf{x}\rangle \prime\), the new step sizes σ′, and the new Cholesky factor B′ are calculated:

$$\displaystyle\begin{array}{rcl} \langle \mathbf{x}\rangle \prime& =& \langle \mathbf{x}\rangle + \eta _{x} \cdot \mathbf{G}_{\delta } {}\\ \sigma \prime& =& \sigma \cdot \exp \left (\frac{\eta _{\sigma }} {2n} \cdot \mbox{ tr}\left (\sum _{i=1}^{\lambda }u_{ i} \cdot \left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right )\right )\right ) {}\\ \mathbf{B}\prime& =& \mathbf{B} \cdot \exp \left (\frac{\eta _{B}} {2} \cdot \mathbf{G}_{M}\right ) {}\\ \end{array}$$

Here, the exponential function of a matrix A is defined by \(\exp (\mathbf{A}) =\sum _{ n=0}^{\infty }\frac{{\mathbf{A}}^{n}} {n!}\), see [26].

The resulting pseudocode of the xNES is given in Algorithm 2.17. The default parameters of the exogenous strategy parameters are as follows:

$$\displaystyle\begin{array}{rcl} \lambda & =& 4 + \lfloor 3\log (n)\rfloor {}\\ \eta _{x}& =& 1 {}\\ \eta _{\sigma }& =& \frac{3} {5} \cdot \frac{3 +\log (n)} {n\sqrt{n}} {}\\ \eta _{B}& =& \eta _{\sigma } {}\\ \end{array}$$

Algorithm 2.17 xNES

initialize \(\langle \mathbf{x}\rangle\)

\(\mathbf{B} \leftarrow \mathbf{I}\)

\(\sigma \leftarrow \root{d}\of{\vert \det \mathbf{B}\vert }\)

for i = 1 → λ do

    \(u_{i} \leftarrow \frac{\max \left (0,\log \left (\frac{\lambda } {2} +1\right )-\log (i)\right )} {\sum _{j=1}^{\lambda }\max \left (0,\log \left (\frac{\lambda } {2} +1\right )-\log (i)\right )} -\frac{1} {\lambda }\)

end for

repeat

    for i = 1 → λ do

      \(\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\)

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{B}\mathbf{z}_{i}\)

    end for

    sort \(\{(\mathbf{z}_{i},\mathbf{x}_{i})\}\) by \(f(\mathbf{x}_{i})\)

    \(\mathbf{G}_{\delta } \leftarrow \sum _{i=1}^{\lambda }u_{i} \cdot \mathbf{z}_{i}\)

    \(\mathbf{G}_{M} \leftarrow \sum _{i=1}^{\lambda }u_{i} \cdot \left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right )\)

    \(G_{\sigma } \leftarrow \mbox{ tr}(\mathbf{G}_{M})/n\)

    \(\mathbf{G}_{B} \leftarrow \mathbf{G}_{M} - G_{\sigma } \cdot \mathbf{I}\)

    \(\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \eta _{x} \cdot \sigma \mathbf{B} \cdot \mathbf{G}_{\delta }\)

    \(\sigma \leftarrow \sigma \cdot \exp \left (G_{\sigma } \cdot \frac{\eta _{\sigma }} {2} \right )\)

    \(\mathbf{B} \leftarrow \mathbf{B} \cdot \exp \left (\mathbf{G}_{B} \cdot \frac{\eta _{B}} {2} \right )\)

until termination criterion satisfied

2.2.2.11 (1+1)-Active-CMA-ES

Extending the (1+1)-Cholesky-CMA-ES with the idea of the Active-CMA-ES to take information of unsuccessful offspring into account for covariance matrix adaptation consequently leads to the development of a hybrid, the (1+1)-Active-CMA-ES [2]. Instead of using an explicit covariance matrix \(\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}\), the (1+1)-Active-CMA-ES works directly with the Cholesky factor A and its inverse \({\mathbf{A}}^{-1}\). The update of A has been defined previously, based on Theorem 2.2.1. In order to use \({\mathbf{A}}^{-1}\), an extended version of this theorem is required, which we state below (without proof, see [2]):

Theorem 2.2.2.

Let \(\mathbf{C} \in {\mathbb{R}}^{n\times n}\) be a symmetric, positive definite matrix with Cholesky decomposition \(\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}\) , and let \(\mathbf{C}\prime = \alpha \mathbf{C} + \beta \mathbf{v}{\mathbf{v}}^{T}\) be an update transformation of C where \(\mathbf{v} \in {\mathbb{R}}^{n}\setminus \{\mathbf{0}\}\), \(\alpha \in {\mathbb{R}}^{+}\) and \(\beta \in \mathbb{R}\) . Let \(\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{v}\) with \(\alpha + \beta \|{\mathbf{w}\|}^{2} > 0\) and let \(\mathbf{C}\prime = \mathbf{A}\prime{\mathbf{A}\prime}^{T}\) be the Cholesky decomposition of the updated matrix C ′. Then, the Cholesky factor A ′ and its inverse A −1 are obtained as follows: \(\mathbf{A}\prime = \sqrt{\alpha }\mathbf{A} + \frac{\sqrt{\alpha }} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{\beta } {\alpha }\|{\mathbf{w}\|}^{2}} - 1\right )\mathbf{A}\mathbf{w}{\mathbf{w}}^{T}\) and \({\mathbf{A}\prime}^{-1} = \frac{1} {\sqrt{\alpha }}{\mathbf{A}}^{-1} - \frac{1} {\sqrt{\alpha }\|{\mathbf{w}\|}^{2}} \left (1 - \frac{1} {\sqrt{1+\beta \|{\mathbf{w} \|}^{2 } /\alpha }}\right )\mathbf{w}{\mathbf{w}}^{T}{\mathbf{A}}^{-1}\) .

The offspring x′ is generated from its parent x according to:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

As for the (1+1)-Cholesky-CMA-ES, the success rate p s , i.e., the fraction of successful mutations, is updated by taking the learning rate c p into account:

$$\displaystyle{ p_{s}\prime = \left \{\begin{array}{@{}l@{\quad }l@{}} (1 - c_{p})p_{s} + c_{p}\quad &\mbox{ if }f(\mathbf{x}\prime) \leq f(\mathbf{x}) \\ (1 - c_{p})p_{s} \quad &\mbox{ if }f(\mathbf{x}\prime) > f(\mathbf{x}) \end{array} \right. }$$

Based on the success rate p s , a damping parameter \(d \in {\mathbb{R}}^{+}\) and the target success rate p t , the global step size σ is updated as follows:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{1} {d} \frac{p_{s} - p_{t}} {1 - p_{t}} \right ) }$$

The algorithm uses \(p_{t} = \frac{2} {11}\) which makes the update similar to the 1/5-success rule update mechanism of the (1+1)-ES.

If the offspring performs better than its parent, a positive Cholesky update is applied. In contrast to the (1+1)-Cholesky-CMA-ES, which uses the mutation step z for this update, the (1+1)-Active-CMA-ES relies on a search path s, accumulating successful mutation steps with a learning rate c and updating s as follows:

$$\displaystyle{ \mathbf{s}\prime = (1 - c)\mathbf{s} + \sqrt{c(2 - c)}\mathbf{A}\mathbf{z} }$$

With a constant c c + > 0 and the vector \(\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{s}\), the positive update of matrices A and A −1 can now be defined according to Theorem 2.2.2:

$$\displaystyle\begin{array}{rcl} \mathbf{A}\prime& =& a\mathbf{A} + b(\mathbf{A}\mathbf{w}){\mathbf{w}}^{T}\mbox{ and }{}\end{array}$$
(2.19)
$$\displaystyle\begin{array}{rcl}{ \mathbf{A}}^{-1\prime}& =& \frac{1} {a}{\mathbf{A}}^{-1\prime} - \frac{b} {{a}^{2} + \mathit{ab}\|{\mathbf{w}\|}^{2}}\mathbf{w}({\mathbf{w}}^{T}{\mathbf{A}}^{-1})\mbox{ where } \\ a& =& \sqrt{1 - c_{c }^{+}}\mbox{ and } \\ b& =& \frac{\sqrt{1 - c_{c }^{+}}} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{c_{c }^{+ }} {1 - c_{c}^{+}}\|{\mathbf{w}\|}^{2}} - 1\right ) {}\end{array}$$
(2.20)

In the case of an Active-CMA-ES, the λ − μ worst individuals are used for the negative update of the covariance matrix, and these individuals can be called the “especially bad” individuals. In the case of the corresponding (1+1)-strategy, as introduced here, this definition is not applicable. Instead, the (1+1)-Active-CMA-ES stores past function evaluations and defines an individual to be “especially bad”, if its fitness value is worse than the fitness of its k-th predecessor. For an “especially bad” offspring, a negative update according to Eqs. 2.19 and 2.20 is performed, using modified values of the coefficients a and b. In contrast to the positive update, rather than the transformed search path \(\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{s}\) the vector z is used for the negative update:

$$\displaystyle\begin{array}{rcl} a& =& \sqrt{1 + c_{c }^{-}} {}\\ b& =& \frac{\sqrt{1 + c_{c }^{-}}} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 - \frac{c_{c }^{- }} {1 - c_{c}^{-}}\|{\mathbf{z}\|}^{2}} - 1\right ) {}\\ \end{array}$$

To ensure a positive definite covariance matrix, \(1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2} > 0\) needs to hold for the constant c c . Moreover, the convergence behavior of the algorithm can become unstable if the value of \(1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2}\) is very close to zero. As a countermeasure, in case of \(1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2} < 1/2\), the value of \(c_{c}^{-}\) is provided with an upper bound of \(1/(2\|{\mathbf{z}\|}^{2})\).

The default settings of the exogenous parameters are:

$$\displaystyle\begin{array}{rcl} d& =& 1 + n/2 {}\\ c& =& 2/(n + 2) {}\\ c_{p}& =& 1/12 {}\\ p_{t}& =& 2/11 {}\\ c_{c}^{+}& =& \frac{2} {{n}^{2} + 6} {}\\ c_{c}^{-}& =& \frac{2} {5({n}^{8/5} + 1)} {}\\ \end{array}$$

The pseudocode of the (1+1)-Active-CMA-ES is given in Algorithm 2.18.

Algorithm 2.18 (1+1)-Active-CMA-ES

initialize x, σ, \(\mathbf{A} \leftarrow \mathbf{I}\), \({\mathbf{A}}^{-1} \leftarrow \mathbf{I}\), \(\mathbf{h} \leftarrow \mathbf{0} \in {\mathbb{R}}^{k}\)

\(t \leftarrow 0\)

repeat

    tt + 1

    \(\mathbf{z} \leftarrow N(\mathbf{0},\mathbf{I})\)

    \(\mathbf{y} \leftarrow \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\)

    if t > k then

      \(h_{i} \leftarrow h_{i+1}\) \(\forall i \in \{1,\ldots,k - 1\}\)

      \(h_{k} \leftarrow f(\mathbf{y})\)

    else

      \(h_{t} \leftarrow f(\mathbf{y})\)

    end if

    if \(f(\mathbf{y}) \leq f(\mathbf{x})\) then

      \(\mathbf{x} \leftarrow \mathbf{y}\)

      \(p_{s} \leftarrow (1 - c_{p})p_{s} + c_{p}\)

      \(\mathbf{s} \leftarrow (1 - c)\mathbf{s} + \sqrt{c(2 - c)}\mathbf{A}\mathbf{z}\)

      \(\mathbf{w} \leftarrow {\mathbf{A}}^{-1}\mathbf{s}\)

      \(a \leftarrow \sqrt{1 - c_{c }^{+}}\)

      \(b \leftarrow \frac{\sqrt{1-c_{c }^{+}}} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{c_{c }^{+ }} {1-c_{c}^{+}} \|{\mathbf{w}\|}^{2}} - 1\right )\)

      \(\mathbf{A} \leftarrow a\mathbf{A} + b\left (\mathbf{A}\mathbf{w}\right ){\mathbf{w}}^{T}\)

      \({\mathbf{A}}^{-1} \leftarrow \frac{1} {a}{\mathbf{A}}^{-1} - \frac{b} {{a}^{2}+\mathit{ab}+\|{\mathbf{w}\|}^{2}} \mathbf{w}\left ({\mathbf{w}}^{T}{\mathbf{A}}^{-1}\right )\)

    else

      \(p_{s} \leftarrow (1 - c_{p})p_{s}\)

      if \(h_{0} < f(\mathbf{y})\) then

        \(a \leftarrow \sqrt{1 + c_{c }^{-}}\)

        \(b \leftarrow \frac{a} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 - \frac{c_{c }^{- }} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2}} - 1\right )\)

        \(\mathbf{A} \leftarrow a\mathbf{A} + b\left (\mathbf{A}\mathbf{w}\right ){\mathbf{w}}^{T}\)

        \({\mathbf{A}}^{-1} \leftarrow \frac{1} {a}{\mathbf{A}}^{-1} - \frac{b} {{a}^{2}+\mathit{ab}+\|{\mathbf{w}\|}^{2}} \mathbf{w}\left ({\mathbf{w}}^{T}{\mathbf{A}}^{-1}\right )\)

      end if

    end if

    \(\sigma \leftarrow \sigma \exp \left (\frac{1} {d} \frac{p_{s}-p_{t}} {1-p_{t}} \right )\)

until termination criterion satisfied

2.2.2.12 (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES

The (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES [7] is based on extending the idea of mirrored sampling, as described in Sect. 2.2.2.9 for a \((1{ + \atop,} \lambda _{m}^{s})\)-ES, for the case μ > 1. The offspring population size is given by the number of samples λ iid (independent, identically distributed samples from the mutation distribution) and the number of offspring, λ m (\(\lambda _{m} \leq \lambda _{\mathit{iid}}\)), which are also subject to mirroring. Using mirrored sampling in combination with weighted recombination and cumulative step size adaptation (see Sect. 2.2.2.1) introduces a bias with respect to the step size, i.e., the step size is more than desirably reduced, thus potentially causing a premature stagnation effect for the algorithm. To avoid this issue, the concept of pairwise selection is introduced, i.e., it is made sure that recombination will not involve an offspring individual and its mirrored version at the same time, but either one or the other.

The (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES introduces two different versions of mirroring, namely random mirroring and selective mirroring. In the case of random mirroring, denoted by (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}^{rand}\))-ES, the λ m offspring subject to mirroring are randomly selected out of the total number of offspring, \(\lambda _{\mathit{iid}}\). In the case of selective mirroring, denoted by (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}^{\mathit{sel}}\))-ES, the \(\lambda _{\mathit{iid}}\) offspring are first sorted by fitness and the λ m worst individuals undergo mirroring. This approach is motivated by considering that, in a convex objective function topology, mirroring the best offspring cannot yield any further improvement, such that it will be advantageous to mirror the worst individuals. Moreover, since bad offspring in the case of a \((\mu _{W},\lambda )\)-ES are often generated by applying too-large mutation steps, selective mirroring itself will also favor large mutation steps [7]. To counteract this undesired bias, the resample length approach changes the length of the mirrored mutation step by additionally using a second, newly sampled mutation vector z′. The mirrored version \(\mathbf{x}_{m}\) of the offspring \(\mathbf{x} =\langle \mathbf{x}\rangle + \sigma \mathbf{z}\) is then created according to \(\mathbf{x}_{m} =\langle \mathbf{x}\rangle - \sigma \frac{\|\mathbf{z}\prime\|} {\|\mathbf{z}\|}\mathbf{z}\).

Like for the \((1{ + \atop,} \lambda _{m}^{s})\)-ES, theoretical results for the convergence velocity on the sphere model have been derived, see [7]. In particular, it has been shown that, for the sphere model, maximum convergence velocity is achieved for a setting of \(r = \lambda _{m}/\lambda _{\mathit{iid}} \approx 0.1886\), which can serve as a guideline for the fraction of offspring individuals which should be mirrored.

The pseudocode as given in Algorithm 2.19 is based on using a method updateStepSize Footnote 26 to update the step size σ, and weights w i \(\forall i \in \{1,\ldots,\mu \}\), such that \(\sum _{i=1}^{\mu }w_{i} = 1\).

Algorithm 2.19 (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES

initialize \(\langle \mathbf{x}\rangle\), σ

r ← 0

repeat

    i ← 0

    while \(i < \lambda _{\mathit{iid}}\) do

      rr + 1

      ii + 1

      \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma N(\mathbf{0},\mathbf{I})\)

    end while

    if  selective mirroring then

      \(\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}} = \mbox{ argsort}\left (f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{\lambda _{\mathit{iid}}})\right )\)

    end if

    while \(i < \lambda _{\mathit{iid}} + \lambda _{m}\) do

      ii + 1

      if resample length then

        rr + 1

        \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle - \frac{\sigma \|N(\mathbf{0},\mathbf{I})\|} {\|\mathbf{x}_{i-\lambda _{m}}-\langle \mathbf{x}\rangle \vert }\left (\mathbf{x}_{i-\lambda _{m}} -\langle \mathbf{x}\rangle \right )\)

      else

        \(\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle -\left (\mathbf{x}_{i-\lambda _{m}} -\langle \mathbf{x}\rangle \right )\)

      end if

    end while

    \(\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}} = \mbox{ argsort}(f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{\lambda _{\mathit{iid}}-\lambda _{m}}),\)

    \(\min \{f(\mathbf{x}_{\lambda _{\mathit{iid}}-\lambda _{m}+1}),f(\mathbf{x}_{\lambda _{\mathit{iid}}+1})\},\ldots,\)

    \(\min \{f(\mathbf{x}_{\lambda _{\mathit{iid}}}),f(\mathbf{x}_{\lambda _{\mathit{iid}}+\lambda _{m}})\})\)

    σ ← updateStepSize\((\sigma,\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}},\langle \mathbf{x}\rangle )\)

    \(\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle +\sum _{ i=1}^{\mu }w_{i}(\mathbf{x}_{i} -\langle \mathbf{x}\rangle )\)

until termination criterion satisfied

2.2.2.13 SPO-CMA-ES

The SPO-CMA-ES [70] is essentially a restart-version of the \((\mu _{W},\lambda )\)-CMA-ES. It is based on using sequential parameter optimization (SPO) [11] to optimize the exogenous parameters of an evolution strategy. SPO uses methods of design of experiments (DoE) and design and analysis of computer experiments (DACE).Footnote 27

Concerning the exogenous parameters subject to sequential parameter optimization, the number of offspring individualsFootnote 28 \(\lambda \in \{\lambda _{\mathit{def }},\ldots,1,000\}\), the initial step size \(\sigma _{\mathit{init}} \in [1,5]\) and the so-called selection pressure λ∕μ ∈ [1.5,2.5] are identified.

The pseudocode of the SPO-CMA-ES is provided in Algorithm 2.20, and the approach is explained in the following by discussing the various methods used in the algorithm. To begin with, using latin hypercube sampling (LHS) [68] an initial design of experiments for the exogenous parameters is created. In the next step (runDesign), independent runs of the (μ W ,λ)-CMA-ES are executed, using the parameter sets of the DoE plan. The results, i.e., the best evaluated individual with its fitness value, of each run is collected in the set Y. This initial phase of the algorithm is called the exploration phase.

The next phase, called the exploitation phase, is repeated until the predefined budget of function evaluations is reached. Using a function aggregateRuns, a performance measure y is calculated for every run configuration in Y. Based on these performance measure values as outputs and the corresponding parameter sets according to the experimental plan, a Kriging modelFootnote 29 \(\mathcal{M}\) is trained in the method fitModel. This Kriging model \(\mathcal{M}\) is then used by the method modelOptimization to identify a new design point, e.g., by running an optimization on the Kriging model and using the resulting point. The new design point d is then added to the experimental plan D, and the loop is executed again. Default settings are not given for the size of the initial experimental plan, N init , nor for the split of the number of function evaluations between the two phases of the algorithm [70]. Rather, the user of the algorithm can fix them, depending on the optimization task at hand. In the case of noisy objective functions, the method runDesign can execute more than the one run, in order to use, e.g., the averages as an estimation of the true fitness value.

Algorithm 2.20 SPO-CMA-ES

Input:box constraints \(\mathbf{l},\mathbf{u} \in {\mathbb{R}}^{n}\) and size N init of the initial designOutput:final model \(\mathcal{M}\) and best design point d

\(i \leftarrow 0\), \(D \leftarrow \varnothing \)

\(d_{i} \leftarrow \mbox{ LHS}(\mathbf{l},\mathbf{u},N_{\mathit{init}})\)

\(Y \leftarrow \mbox{ runDesign}(d_{i})\)

\(D \leftarrow D \cup d_{i}\)

while function evaluation budget not exhausted do

    \(i \leftarrow i + 1\)

    \(y \leftarrow \mbox{ aggregateRuns}(Y )\)

    \(\mathcal{M}\leftarrow \) fitModel(D,y)

    \(d_{i} \leftarrow \mbox{ modelOptimization}(\mathcal{M})\)

    \(Y \leftarrow Y \cup \mbox{ runDesign}(d_{i})\)

    \(D \leftarrow D \cup d_{i}\)

end while

\({d}^{{\ast}}\leftarrow d_{k}\) with the best \(y_{k} \in \{y_{0},\ldots,y_{i}\}\)

2.3 Further Aspects of ES

So far, we have described the ES algorithms as single-criterion optimizers with \({\mathbb{R}}^{n}\) as search domain and without handling of constraints. The next three sections give summarized overviews and literature references for further aspects of ES, namely constraint handling, binary and integer search spaces, and multiobjective optimization.

2.3.1 Constraint Handling

In Sect. 2.1.1 we defined the optimization problem used throughout this book with equality and inequality constraints as in Eq. 2.2. There are many techniques for handling constraints ranging from simple penalty methods to more complex ones like hybrid methods involving Lagrangian multipliers. Coello gives an overview [18] of constraint-handling techniques to be used with Evolutionary Algorithms but some of these methods may be applied to ES as well. A review by Kramer [42] specializes in constraint-handling methods dedicated to ES and presents the four techniques penalty methods, a multiobjective bioinspired approach, coordinate alignment techniques, and metamodeling of constraints.

2.3.2 Beyond Real-Valued Search Spaces

There are many optimization problems where the search domain is not constrained to the real domain. Especially decision problemsFootnote 30 use categorical search spaces, in most cases binary search spaces, i.e., \(\mathbf{x} \in \{0,1{\}}^{n}\), as the simplest categorical search space. Another search space of practical use is the integer search space representable as a subset of \(\mathbb{Z}\). Originally, Genetic Algorithms (see [27] or [25] for a comprehensive introduction) were designed to handle binary search spaces, but there are approaches to incorporate those search spaces into ES. In Sect. 2.1.3 we named three guidelines to choose a distribution to be used for mutation. Rudolph [56] introduces a mutation operator for integer search spaces using the difference of two geometrical distributions. Each discrete variable of a categorical search space is assigned a probability whether to mutate or not. The new value of the discrete variable is drawn uniformly from all possible values. The MI-ES (mixed-integer evolution strategies) [43] solve optimization problems which are mixed in their search domain, i.e. the search domain is a composition of real, integer and categorical search spaces. They use the aformentioned mutation approaches together with self-adaptation for the endogenous parameters. An overview of other approaches for handling mixed search spaces is given by Li [43].

2.3.3 Multiobjective Optimization

In single-objective optimization fitness values can be ordered to decide whether one solution is better than another. In multiobjective optimization, where fitness values are represented as vectors, such a strict ordering does not exist anymore. Solutions are partially ordered and based on the partial order solutions can be either dominated or non-dominated by other solutions. Hence there is not a single optimum to be found but a set of solutions which is called the Pareto set or Pareto front. For a detailed description of these concepts see [20]. Algorithms for multiobjective optimization have to measure how well a Pareto front is approximated. The most common measures for this task are the crowding distance and the hypervolume contribution. The former is used for example by NSGA-II [21] the latter by SMS-EMOA [12].