Evolution Strategies

Bäck, Thomas; Foussette, Christophe; Krause, Peter

doi:10.1007/978-3-642-40137-4_2

Thomas Bäck⁹,
Christophe Foussette¹⁰ &
Peter Krause¹⁰

Part of the book series: Natural Computing Series ((NCS))

1465 Accesses
1 Citations

Abstract

Prior to introducing the particular algorithms in Sect. 2.2, the more general foundations of evolution strategies are introduced in Sect. 2.1. To start with, the definition of an optimization task as used throughout this book is given in Sect. 2.1.1. Following [58], Sect. 2.1.2 presents a discussion of evolution strategy metaheuristics as a special case of evolutionary algorithms.

Access provided by Autonomous University of Puebla. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Prior to introducing the particular algorithms in Sect. 2.2, the more general foundations of evolution strategies are introduced in Sect. 2.1. To start with, the definition of an optimization task as used throughout this book is given in Sect. 2.1.1. Following [58], Sect. 2.1.2 presents a discussion of evolution strategy metaheuristics as a special case of evolutionary algorithms. In particular, the components of such a metaheuristic—namely recombination, mutation, evaluation and selection—are described in a general way. Due to the particular importance^{Footnote 1} of the mutation operator for evolution strategies (in ${\mathbb{R}}^{n}$), it is discussed in quite some detail in Sect. 2.1.3.

2.1 Introduction

2.1.1 Optimization

Evolution strategies are particularly well suited (and developed) for nonlinear optimization tasks, which are defined as follows (see e.g. [17], Sect. 18.2.1.1):

$$\displaystyle{ f(\mathbf{x}) =\min !\mbox{ for }\mathbf{x} \in {\mathbb{R}}^{n}\mbox{ where} }$$

(2.1)

$$\displaystyle{ g_{i}(\mathbf{x}) \leq 0,i \in I = \{1,\ldots,m\},h_{j}(\mathbf{x}) = 0,j \in J = \{1,\ldots,r\}, }$$

(2.2)

and the set

$$\displaystyle{ M = \{\mathbf{x} \in {\mathbb{R}}^{n}: g_{ i}(\mathbf{x}) \leq 0,\forall i \in I,h_{j}(\mathbf{x}) = 0,\forall j \in J\} }$$

(2.3)

is called the set of feasible points and it defines the search space of the optimization problem. A point ${\mathbf{x}}^{{\ast}}\in {\mathbb{R}}^{n}$ is called a global minimum, if

$$\displaystyle{ {f}^{{\ast}} = f({\mathbf{x}}^{{\ast}}) \leq f(\mathbf{x})\mbox{ for all }\mathbf{x} \in M }$$

(2.4)

Conversely, it is called a local minimum if the above inequality only holds for x within an ε-environment $U_{\epsilon \left (x\right )} \subseteq M$.

Formulating an optimization problem as a minimization task is equivalent to searching for a maximum or for a given target value, since maximization of f can be replaced by minimization of − f and a target value $\bar{f}$ can be attained by minimizing $\rho (\bar{f},f)$ with an arbitrary distance measure^{Footnote 2} ρ.

In this definition of an optimization task it is completely sufficient if the codomain is completely ordered, so that the inequality in Eq. 2.4 can be applied. Throughout this book, we will always deal with the codomain $\mathbb{R}$ only. Moreover, we will not explicitly deal with the handling of constraints (e.g., as defined by Eq. 2.2), and refer the interested reader to Sect. 2.3 where literature references point to state-of-the-art techniques in constraint handling. A special case of constraints are so-called box constraints, as defined below:

$$\displaystyle\begin{array}{rcl} g_{1}(\mathbf{x})& =& \mathbf{l} -\mathbf{x} \leq \mathbf{0}\mbox{ where }\mathbf{l} = {(l_{1},\ldots,l_{n})}^{T} \in {\mathbb{R}}^{n} \\ g_{2}(\mathbf{x})& =& \mathbf{x} -\mathbf{u} \leq \mathbf{0}\mbox{ where }\mathbf{u} = {(u_{1},\ldots,u_{n})}^{T} \in {\mathbb{R}}^{n}{}\end{array}$$

(2.5)

Vectors l and u are called lower and upper bounds, respectively. Box constraints restrict the search space to the hyperrectangle $[l_{1},u_{1}] \times \ldots \times [l_{n},u_{n}]$ and are taken into account for the implementation of algorithms described in this book.

In the field of evolutionary algorithms, the vector x is often called the decision vector (and its parameters decision parameters), and its objective function value f(x) is also called the fitness value.

2.1.2 Evolution Strategies as a Specialization of Evolutionary Algorithms

Following [8] and [58], evolution strategies are described here as a specialization of evolutionary algorithms. The general framework of an evolutionary algorithm is presented in Algorithm 2.1.

Algorithm 2.1 General outline of an evolutionary algorithm

Initialization

repeat

Recombination

Mutation

Evaluation

Selection

until Termination criterion fulfilled

During initialization, the first generation, consisting of one or more individuals, is created, and the fitness of its individuals is evaluated. After initialization, the so-called evolution loop is entered, which consists of the operators recombination, mutation, evaluation and selection. Recombination creates new individuals, also called offspring, from the parent population. Two major types of recombination, dominant and intermediate recombination, are typically distinguished: In dominant recombination, a property of a parent individual is inherited by the offspring, i.e., this property dominates the corresponding property of the other individuals. For intermediate recombination, the properties of all individuals are taken into account, such that, e.g., in the simplest case, their mean value is used.

The mutation operator provides the main source of variation of offspring in an evolution strategy. Based on sampling random variables, properties of individuals are modified. The newly created individuals are then evaluated, i.e., their fitness values are calculated. Based on these fitness values, selection identifies a subset of individuals which form the new population which is used in the next iteration of the evolution loop. The loop is terminated based on a termination criterion set by the user, such as reaching a maximum number of evaluations, reaching a target fitness value, or stagnation of the search process.

According to [58], evolution strategies as a specific instantiation of evolutionary algorithms are characterized by the following four properties:

Selection of individuals for recombination is unbiased.
Selection is a deterministic process.
Mutation operators are parameterized and therefore they can change their properties during optimization.
Individuals consist of decision parameters as well as strategy parameters.^{Footnote 3}

The generic framework of an evolutionary algorithm then specializes into a $(\mu /\rho,\kappa,\lambda )$-ES,^{Footnote 4} as described in detail in Algorithm 2.2. Recombination and mutation are summarized here under the term variation. In addition to the description given in [58] (Algorithm 3), the variation operator of a (μ∕ρ,κ,λ)-ES is defined here by means of a parameter set $\Psi _{V }$, and the evaluation operator is explicitly mentioned. A population at generation t ≥ 0 is denoted P ^(t) and is a set of individuals. An individual p ∈ P ^(t) is a tuple $(\mathbf{x},\Psi )$ for $\mathbf{x} \in M \subseteq {\mathbb{R}}^{n}$, with M as in Eq. 2.3. The sets $\Psi $ and $\Psi _{V }$ are arbitrary finite sets representing the strategy parameters. Since these parameters are modified internally during execution of the algorithm, they are called endogenous strategy parameters. The number of parent individuals is denoted as μ, the number of offspring individuals as λ, and ρ denotes the number of parents taken into account for generating a single offspring by means of recombination. For these parameters, $\mu,\rho,\lambda \in \mathbb{N}$ and ρ ≤ μ holds.

$\kappa \in \mathbb{N} \cup \{\infty \}$ represents the largest age which can be reached by any individual in the population. In contrast to endogenous parameters, μ,ρ,λ und κ are to be set by the user of the algorithm, such that they are called exogenous strategy parameters.

The setting of κ has a direct impact on the selection operator. Usually, either κ = 1 (one generation maximum lifetime) or κ = ∞ (infinite maximum lifetime) is used. The former case is also called comma-selection, the latter plus-selection. Using the standard notation of evolution strategies, this is expressed as $(\mu /\rho,\lambda )$-ES and (μ∕ρ + λ)-ES, so that κ is not explicitly stated any more. Using κ < ∞ requires the condition λ ≥ μ to hold.

Algorithm 2.2 (μ∕ρ,κ,λ)-ES

Initialization of P ⁽⁰⁾ with μ individuals

$\forall p \in {P}^{(0)}: p.\Psi.Age \leftarrow 1$, p.f ← f(p.x)

t ← 0

repeat

${Q}^{(t)} \leftarrow \varnothing $

for i = 1 → λ do

Sample ρ parents $p_{1},\ldots,p_{\rho } \in {P}^{(t)}$ uniformly at random

$q \leftarrow \mbox{ Variation}(p_{1},\ldots,p_{\rho },\Psi _{V })$

$q.\Psi.Age \leftarrow 0$, $q.f \leftarrow f(q.\mathbf{x})$

${Q}^{(t)} \leftarrow {Q}^{(t)} \cup \{q\}$

end for

${P}^{(t+1)} \leftarrow $ Selection of the μ best individuals from ${Q}^{(t)} \cup \{p \in {P}^{(t)}: p.\Psi.Age < \kappa \}$

Update $\Psi _{V }$

$\forall p \in {P}^{(t+1)}: p.\Psi.Age \leftarrow p.\Psi.Age + 1$

t ← t + 1

until Termination criterion fulfilled

2.1.3 Mutation in${\mathbb{R}}^{n}$

2.1.3.1 The Multivariate Normal Distribution

In [58], three guiding principles for the design of mutation operators are introduced, namely:

Any point of the search space needs to be attainable with probability strictly larger than zero by means of a finite number of applications of mutation.
Mutation should be unbiased, which can be achieved by using a maximum entropy distribution.^{Footnote 5}
The operator is parameterized, such that the extent of variation can be controlled.

In ${\mathbb{R}}^{n}$, these requirements are fulfilled by a multivariate normal distribution. An n-dimensional random vector X is multivariate normally distributed with expectation $\bar{\mathbf{x}} \in {\mathbb{R}}^{n}$ and positive definite^{Footnote 6} covariance matrix $\mathbf{C} \in {\mathbb{R}}^{n\times n}$ if its probability density function is defined according to:

$$\displaystyle{ f_{\mathbf{X}}(\mathbf{x}) = \frac{1} {{(2\pi )}^{\frac{n} {2} }{(\det \mathbf{C})}^{\frac{1} {2} }} \exp \left (-\frac{1} {2}{(\mathbf{x} -\bar{\mathbf{x}})}^{T}{\mathbf{C}}^{-1}(\mathbf{x} -\bar{\mathbf{x}})\right ) }$$

(2.6)

(see p. 86 in [28]). In short notation, this is typically written as $\mathbf{X} \sim N(\bar{\mathbf{x}},\mathbf{C})$, where $N(\bar{\mathbf{x}},\mathbf{C})$ denotes the multivariate normal distribution in its general form. In mathematical equations, $N(\bar{\mathbf{x}},\mathbf{C})$ is sometimes used like a vector, meaning a vector which is actually sampled according to the distribution given. In other words, instead of writing x′ =x +X where X ∼ N(0,C), it is also possible to simply write x′ =x + N(0,C).

Due to the positive definiteness of the covariance matrix C, the following eigendecomposition exists (Theorem 1a in [58]):

$$\displaystyle{ \mathbf{C} = \mathbf{B}{\mathbf{D}}^{2}{\mathbf{B}}^{T} }$$

(2.7)

Here, B denotes an orthogonal matrix,^{Footnote 7} the columns of which are the eigenvectors of C. In [29], $N(\bar{\mathbf{x}},\mathbf{C})$ is reduced to the distribution N(0,I) by means of the eigendecomposition given in Eq. 2.7, according to:

$$\displaystyle{ N(\bar{\mathbf{x}},\mathbf{C}) \sim \bar{\mathbf{x}} + \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I}) }$$

(2.8)

In the field of evolution strategies, the three special cases N(0,I), $N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2}))$ and N(0,C) are used for the definition of the most common algorithms. Figure 2.1 provides a sketch of the corresponding mutation ellipsoids, i.e., isolines of the probability density functions, embedded in a hypothetical two-dimensional fitness function.

The simplest case of generating the mutation x′ from x is based on using B =I and D = σI with a global step size $\delta \in {\mathbb{R}}^{+}$ for matrices B and D as used in Eq. 2.8.

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot N(\mathbf{0},\mathbf{I}) }$$

(2.9)

This corresponds with spheres with individual radii defined by δ, as indicated in the left part of Fig. 2.1. This case of an offspring distribution is called isotropic.

To turn the spheres into anisotropic ellipsoids with main axes parallel to the coordinate axes, as shown in the middle of Fig. 2.1, matrix D in Eq. 2.8 must be turned into a diagonal matrix $\boldsymbol{\delta } = {(\delta _{1},\ldots,\delta _{n})}^{T} \in {\mathbb{R}}^{n}$ with different entries on the main diagonal. As in the previous case, B is a diagonal matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{x}\prime& =& \mathbf{x} + \mathbf{I}\mbox{ diag}(\boldsymbol{\delta })N(\mathbf{0},\mathbf{I}) \\ & =& \mathbf{x} + N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2})){}\end{array}$$

(2.10)

The length ratios of the main axes of the mutation ellipsoids depend on the ratios between corresponding components of the vector δ. A rotation of mutation hyperellipsoids with respect to the coordinate axes, as shown in the rightmost part of Fig. 2.1, is achieved by using a covariance matrix C with off-diagonal entries different from zero. This case is denoted by the term correlated mutation. In contrast with the two previous cases, the matrix B is not just an identity matrix:

$$\displaystyle\begin{array}{rcl} \mathbf{x}\prime& =& \mathbf{x} + \mathbf{B}\mbox{ diag}(\delta )N(\mathbf{0},\mathbf{I}) \\ & =& \mathbf{x} + \mathbf{B}N(\mathbf{0},\mbox{ diag}(\boldsymbol{{\delta }}^{2})) \\ & =& \mathbf{x} + N(\mathbf{0},\mathbf{C}) {}\end{array}$$

(2.11)

The choice of one of the three cases explained above has a direct impact on the complexity of the endogenous parameters controlling the multivariate normal distribution. In general, if n denotes the dimensionality of the search space, the number of endogenous strategy parameters in case of Eq. 2.9 is O(1), i.e., constant. In case of 2.10 a vector of size O(n) of endogenous parameters is required, and adaptation of an arbitrary covariance matrix, i.e., a symmetric n × n-matrix, according to Eq. 2.11, requires O(n ²) endogenous parameters.

For defining algorithm DR3 in Sect. 2.2.1 and for all algorithms based on the CMA-ES, the so-called line distribution [31] is of special interest: For $\mathbf{u} \in {\mathbb{R}}^{n}$, the distribution $N(\mathbf{0},\mathbf{u}{\mathbf{u}}^{T})$ is a multivariate normal distribution with the variance $\|{\mathbf{u}\|}^{2}$ in the direction of the vector u. It is the normal distribution with highest probability of generating u.

2.1.3.2 Relationship Between Covariance Matrix and Hessian

In the previous section, using a multivariate normal distribution was motivated by certain requirements which should hold for the mutation operator. In this section, we will clarify why it is useful to use an arbitrary covariance matrix, as in Eq. 2.11, for adaptation.

Any differentiable function $f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}$ can be approximated by a Taylor series expansion in the vicinity of a position^{Footnote 8} $\mathbf{\tilde{x}} \in {\mathbb{R}}^{n}$. Cutting off the Taylor series after the quadratic term, the following approximation is obtained:

$$\displaystyle{ f(\mathbf{x}) \approx f(\mathbf{\tilde{x}}) + {(\mathbf{x} -\mathbf{\tilde{x}})}^{T}\nabla f(\mathbf{\tilde{x}}) + \frac{1} {2}{(\mathbf{x} -\mathbf{\tilde{x}})}^{T}{\nabla }^{2}f(\mathbf{\tilde{x}})(\mathbf{x} -\mathbf{\tilde{x}}) }$$

(2.12)

Here, $\nabla f(\mathbf{\tilde{x}})$ denotes the gradient, and ${\nabla }^{2}f(\mathbf{\tilde{x}})$ is the symmetric, positive definite Hessian, denoted by H in the following. For a quadratic function f, the Taylor series expansion is exact, and H contains information about the shape of the isolines of f. In general, these are ellipsoids, as shown in the rightmost part of Fig. 2.1. Hansen describes the relationship between the Hessian H and the covariance matrix C of a distribution N(0,C) informally [29]. It is argued that using $\mathbf{C} ={ \mathbf{H}}^{-1}$ for optimizing a quadratic function is equivalent to using C =I for optimizing an isotropic function, such as the sphere function $f(\mathbf{x}) = \frac{1} {2}\mathbf{x}{\mathbf{x}}^{T}$.

In other words: Adapting an arbitrary covariance matrix simplifies the optimization by transforming the objective function into an isotropic function. A more formal description of this topic can be found in Rudolph’s work, e.g., in the section Advanced Adaptation Techniques in ${\mathbb{R}}^{n}$ in [58], and also in [55].

2.2 Algorithms

This section contains descriptions of the key variants of evolution strategies in chronological order of their publication. On a high level, we differentiate between the two main Sects. 2.2.1 and 2.2.2, with the first one corresponding with the time frame 1964 until 1996.

This first Sect. 2.2.1 describes five main algorithms, namely, the (1+1)-ES as the historically first version of an evolution strategy and the (μ,λ)-MSC-ES (in [58] also called CORR-ES) as the first evolution strategy which adapts an arbitrary covariance matrix (see Sect. 2.1.3 for an explanation). The first derandomized algorithm variants, DR1, DR2, and DR3, complete this selection of older variants of evolution strategies. Their choice is motivated by the fact that they are derandomization steps towards the CMA-ES (see also [63]).

The second main Sect. 2.2.2 describes modern evolution strategies, a term which is used in this book to denote the CMA-ES and algorithms based on it. This distinction might seem somewhat arbitrary, but in fact the development of the CMA-ES defined a turning point in the history of evolution strategies, for two main reasons: First, the CMA-ES is the first algorithm which adapts a covariance matrix in a completely derandomized way. Second, the CMA-ES is seen by many authors as the state of the art in evolution strategies (e.g., [6, 13, 15, 26, 35, 58, 63], and [66]).

2.2.1 From the (1+1)-ES to the CMA-ES

2.2.1.1 (1+1)-ES

The foundation of the first evolution strategy was laid in the 1960s at the Technical University of Berlin by three students, namely Hans-Paul Schwefel, Ingo Rechenberg, and Peter Bienert. As described in [8] or [58], standard methods for solving black-box optimization problems, such as gradient-based methods (see [44]), were not able to deliver satisfactory solution quality for certain optimization problems in engineering applications. Inspired by lectures about biological evolution, they aimed at developing a solution method based on principles of variation and selection. In its first version, a very simple evolution loop without any endogenous parameters was used [59]. This algorithm generates a single offspring $\mathbf{x}\prime = \mathbf{x} + {(N_{1}(0,\sigma ),\ldots,N_{n}(0,\sigma ))}^{T} = \mathbf{x} + \sigma \cdot N(\mathbf{0},\mathbf{I})$ from a single parent individual $\mathbf{x} \in {\mathbb{R}}^{n}$. If the offspring performs better than its parent (in terms of fitness), it becomes the new parent. Otherwise, the parent remains. The standard deviation σ of the normal distribution was a fixed scalar value.

According to [53], by pure luck the value of σ was chosen in a way that made this first approach towards a (1+1)-ES successful. Only later on, the necessary step size adaptation was added to the algorithm [52]. Based on two fitness functions, the so-called corridor model^{Footnote 9} and the so-called sphere model,^{Footnote 10} a theoretical result was derived for introducing step size adaptation: Maximum convergence velocity (i.e., speed of progress of the optimization) is achieved when about 1/5 of all mutations are successful, i.e., improvements over their parent.^{Footnote 11} This insight led to the development of the so-called 1/5-success rule for step size adaptation. If about 1/5 of all mutations are successful, the step size is optimal and no adaptation is required. If the success rate falls below 1/5, the step size needs to be reduced. If it grows above 1/5, the step size needs to be increased. To obtain the new step size $\sigma \prime = \sigma \cdot {c}^{\{-1,1\}}$, the previous σ is decreased or increased, respectively, by multiplication or division by 0.817 ≤ c ≤ 1. The recommended value of c = 0.817 was derived by Schwefel according to theoretical arguments about step size adaptation speed [61]. The step size adaptation according to the above rule is applied each n iterations of the algorithm, and the success rate p _S is measured over a sliding window of the last 10 ⋅ n mutations [8]. The pseudocode of the (1+1)-ES according to [8] is shown in Algorithm 2.3.

Algorithm 2.3 (1+1)-ES

$P_{0} \leftarrow \{\mathbf{x}\}$

$\phi \leftarrow f(\mathbf{x})$

$p_{S} \leftarrow 0$

initialize archive A for storing successful mutations

$t \leftarrow 0$

repeat

$t \leftarrow t + 1$

$\mathbf{x}\prime \leftarrow \mathbf{x} + \sigma \cdot \mathbf{N}(\mathbf{0},\mathbf{I})$

$\phi \prime \leftarrow f(\mathbf{x}\prime)$

if $\phi \prime < \phi $ then

$\mathbf{x} \leftarrow \mathbf{x}\prime$

$\phi \leftarrow \phi \prime$

store success in A

else

store failure in A

end if

$P_{t} \leftarrow \{\mathbf{x}\}$

if $t\mod n = 0$ then

get $\#\mathit{successes}\ and\ \#\mathit{failures}$ from at most 10n entries in A

$p_{S} = \frac{\#\mathit{successes}} {\#\mathit{successes}+\#\mathit{failures}}$

$\sigma \prime \leftarrow \left \{\begin{array}{l} \sigma \cdot c\mbox{ if }p_{S} < 1/5 \\ \sigma /c\mbox{ if }p_{S} > 1/5 \\ \sigma \mbox{ if }p_{S} = 1/5\end{array} \right.$

end if

$\sigma \leftarrow \sigma \prime$

until termination criterion fulfilled

2.2.1.2 (μ,λ)-MSC-ES

The (μ,λ)-MSC-ES^{Footnote 12} was the very first evolution strategy capable of adapting an arbitrary covariance matrix. The algorithm was developed by Schwefel [62] and is also called (μ,λ)-CORR-ES [58]. In this strategy, the covariance matrix is obtained as a product of n(n − 1)∕2 rotation matrices, where a single rotation matrix R _ij for a rotation angle α between axis i and axis j, with $i,j \in \{1,\ldots,n\}$ and i ≠ j, is given by an identity matrix, extended by the entries $R(i,i) = R(j,j) =\cos \alpha _{\mathit{ij}}$ and $R(i,j) = -R(j,i) = -\sin \alpha _{\mathit{ij}}$.

Indeed, this method is able to generate arbitrary correlated mutations, as proven by Rudolph [55]. In the framework of the (μ,λ)-MSC-ES, endogenous strategy parameters are modified by means of the so-called self-adaptation principle. For self-adaptation, an individual consists not only of the decision parameters x, but also contains an additional vector $\sigma \in \mathbb{R}_{+}^{n}$ of step sizes and a vector $\alpha \in (-\pi,\pi {]}^{n(n-1)/2}$ of rotation angles. The underlying idea of mutative step size adaptation is based on the assumption of individuals with good settings of strategy parameters to generate good offspring, such that the good strategy parameters survive selection. Recombination of decision parameters and endogenous strategy parameters is performed through global intermediary recombination, i.e., by averaging all of the μ parents. Concerning the exogenous strategy parameters, the local and global learning rates τ and τ′ need to be set. Following [8], after Schwefel [61], the settings $\tau = \frac{1} {\sqrt{2\sqrt{n}}}$ and $\tau \prime = \frac{1} {2\sqrt{n}}$ are recommended, depending only on the problem dimensionality n. Pseudocode of the (μ,λ)-MSC-ES is provided in Algorithm 2.4. Concerning the population sizes, we are using μ = 15 and $\lambda = 7 \cdot \mu = 105$ throughout this book, close to the recommendations in [63].

Algorithm 2.4 (μ,λ)-MSC-ES

initialize population

${P}^{(0)} \leftarrow \{(\mathbf{x}_{1},\sigma _{1},\alpha _{1}),\ldots,(\mathbf{x}_{\mu },\sigma _{\mu },\alpha _{\mu })\}$

$t \leftarrow 0$

repeat

$t \leftarrow t + 1$

// recombination

$\bar{x} \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\mathbf{x}_{ i}$

$\bar{\sigma } \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\sigma _{ i}$

$\bar{\alpha } \leftarrow \frac{1} {\mu }\sum _{i=1}^{\mu }\alpha _{ i}$

for $i = 1 \rightarrow \lambda $ do

// mutation

$\eta \leftarrow \tau \prime \cdot N(0,1)$

$\sigma _{i} \leftarrow \bar{ \sigma } \cdot \exp \left (\eta + \tau \cdot N(\mathbf{0},\mathbf{I})\right )$

$\alpha _{i} \leftarrow \bar{ \alpha } + \beta \cdot N(\mathbf{0},\mathbf{I})$

$\mathbf{C} \leftarrow \prod _{i=1}^{n-1}\prod _{j=i+1}^{n}R_{\mathit{ij}}$

$\mathbf{x}_{i} \leftarrow \mathbf{\bar{x}} + \mathbf{C} \cdot \sigma _{i} \cdot N(\mathbf{0},\mathbf{I})$

// evaluation

$\phi _{i} \leftarrow f(\mathbf{x}_{i})$

end for

// selection

P ^(t) are the μ best $(\mathbf{x}_{i},\sigma _{i},\alpha _{i})$ from 1 ≤ i ≤ λ

until termination criterion fulfilled

2.2.1.3 DR1

The (μ,λ)-MSC-ES as described in the previous section is based on mutative self-adaptation for step sizes $\delta \in \mathbb{R}_{+}^{n}$. However, as Ostermeier et al. [47] claim, self-adaptation of individual step sizes is not possible in the case of small population sizes, and they identify two key reasons: First, a successful mutation of the decision parameters is not necessarily caused by a good step size, but can also be due to an advantageous instantiation of the normally distributed random vector (i.e., a lucky sample). Second, there is a conflict between the goals of maintaining a large variance of step sizes within one generation and avoiding too large fluctuations of step sizes between successive generations. The first derandomized evolution strategy, abbreviated DR1,^{Footnote 13} solves the first problem by using the length of the most successful mutation step within one generation (i.e., the one that yielded the best offspring) for controlling step size adaptation [47]. The second problem is solved by using a factor $\xi \in \{\frac{5} {7}, \frac{7} {5}\}$ to provide sufficient variance of step sizes within one generation, and to dampen^{Footnote 14} this factor by applying an exponent β with 0 < β < 1 for step size adaptation, to reduce undesired fluctuations [47]. An offspring x′ of a parent x is then generated as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \xi \cdot \delta \otimes \mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Adaptation of step sizes δ is based on the most successful z (i.e., the normally distributed vector sample which generated the best offspring during this generation), which is first transformed as follows:

$$\displaystyle{ \xi _{\mathbf{z}} ={ \left (\exp \left (\vert z_{1}\vert -\sqrt{2/\pi }\right ),\ldots,\exp \left (\vert z_{n}\vert -\sqrt{2/\pi }\right )\right )}^{T} }$$

Combined with the exponents β and $\beta _{\mathit{scal}} \in \mathbb{R}$ for damping the adaptation, as well as ξ and $\xi _{\mathbf{z}}$ of the best mutation, the new step sizes δ′ are obtained as follows:

$$\displaystyle{ \delta \prime ={ \left (\xi \right )}^{\beta } \cdot {\left (\xi _{\mathbf{ z}}\right )}^{\beta _{\mathit{scal}} } \otimes \delta }$$

Pseudocode of the DR1 evolution strategy is given in Algorithm 2.5. Concerning the offspring population size λ, a constant setting of λ = 10, independently of dimensionality n, was used in [47]. The DR1 algorithm is based on a single parent individual (μ = 1), and sometimes also denoted as (1,10)-DR1-ES. Ostermeier et al.[47] recommends for the exponents β and β_scal the following values:

$$\displaystyle\begin{array}{rcl} \beta & =& \sqrt{1/n} {}\\ \beta _{\mathit{scal}}& =& 1/n {}\\ \end{array}$$

Algorithm 2.5 DR1

initialize x, $\boldsymbol{\delta } \leftarrow {(1,\ldots,1)}^{T}$

t ← 0

repeat

t ← t + 1

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{x}_{i} \leftarrow \mathbf{x} + \xi _{i} \cdot \boldsymbol{ \delta } \otimes \mathbf{z}_{i}$ where $P(\xi _{i} = \frac{5} {7}) = P(\xi _{i} = \frac{7} {5}) = \frac{1} {2}$

$\phi _{i} \leftarrow f(\mathbf{x}_{i})$

end for

$\mathit{sel} \leftarrow i$ with best value of ϕ_i

$\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}$

$\xi _{\mathbf{z}_{\mathit{sel}}} ={ \left (\exp \left (\vert z_{\mathit{sel}_{1}}\vert -\sqrt{2/\pi }\right ),\ldots,\exp \left (\vert z_{\mathit{sel}_{n}}\vert -\sqrt{2/\pi }\right )\right )}^{T}$

$\delta \leftarrow {\left (\xi _{\mathit{sel}}\right )}^{\beta }{\left (\xi _{\mathbf{z}_{\mathit{sel}}}\right )}^{\beta _{\mathit{scal}}} \otimes \delta $

until termination criterion fulfilled

2.2.1.4 DR2

The DR2 evolution strategy^{Footnote 15} represents the next step of derandomization for evolution strategies [48]. The creation of an offspring by mutation is parameterized by a global step size δ and local step sizes $\boldsymbol{\delta }_{\mathit{scal}} \in {\mathbb{R}}^{n}$:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot \boldsymbol{ \delta }_{\mathit{scal}} \otimes \mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

As in DR1, adaptation of step sizes is based on the most successful z. However, in addition to information about the most successful mutation of the current generation, the most successful mutation steps of previous generations are also taken into account, thereby accumulating information over generations. The accumulation takes place in a vector $\boldsymbol{\zeta } \in {\mathbb{R}}^{n}$, using a factor c ∈ (0,1] to control the weight of previous generations in contrast to the current one:

$$\displaystyle{ \boldsymbol{\zeta }\prime = (1 - c) \cdot \boldsymbol{ \zeta } + c \cdot \mathbf{z}_{\mathit{sel}} }$$

(2.13)

Adaptation of step sizes δ and $\boldsymbol{\delta }_{\mathit{scal}}$ is then based on the updated mutation path $\boldsymbol{\zeta }\prime$:

$$\displaystyle\begin{array}{rcl} \delta \prime& =& \delta \cdot {\left (\exp \left ( \frac{\|\boldsymbol{\zeta }\prime\|} {\sqrt{n}\sqrt{ \frac{c} {2-c}}} - 1 + \frac{1} {5n}\right )\right )}^{\beta } {}\\ \boldsymbol{\delta }_{\mathit{scal}_{i}}\prime& =& \boldsymbol{\delta }_{\mathit{scal}_{i}} \cdot {\left ( \frac{\vert \boldsymbol{\zeta }_{i}\prime\vert } {\sqrt{ \frac{c} {2-c}}} + \frac{7} {20}\right )}^{\beta _{\mathit{scal}} }\forall i \in \{1,\ldots,n\} {}\\ \end{array}$$

Standard settings for the exponents β and β_scal as well as the parameter c are as follows:

$$\displaystyle\begin{array}{rcl} \beta & =& \sqrt{1/n} {}\\ \beta _{\mathit{scal}}& =& 1/n {}\\ c& =& \sqrt{1/n} {}\\ \end{array}$$

The pseudocode of the DR2 evolution strategy is given in Algorithm 2.6.

Algorithm 2.6 DR2

initialize x, $\boldsymbol{\zeta } \leftarrow \mathbf{0}$, $\delta \leftarrow 1$, $\boldsymbol{\delta }_{\mathit{scal}} \leftarrow {(1,\ldots,1)}^{T}$

$t \leftarrow 0$

repeat

$t \leftarrow t + 1$

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{x}_{i} \leftarrow \mathbf{x} + \delta \cdot \boldsymbol{ \delta }_{\mathit{scal}} \otimes \mathbf{z}_{i}$

$\phi _{i} \leftarrow f(\mathbf{x}_{i})$

end for

$\mathit{sel} \leftarrow i$ with best value of ϕ_i

$\boldsymbol{\zeta }\prime \leftarrow (1 - c) \cdot \boldsymbol{ \zeta } + c \cdot \mathbf{z}_{\mathit{sel}}$

$\delta \prime \leftarrow \delta \cdot {\left (\exp \left ( \frac{\|\boldsymbol{\zeta }\prime\|} {\sqrt{n}\cdot \sqrt{ \frac{c} {2-c}}} - 1 + \frac{1} {5n}\right )\right )}^{\beta }$

$\boldsymbol{\delta }\prime_{\mathit{scal}} \leftarrow \boldsymbol{ \delta }_{\mathit{scal}} \otimes {\left ( \frac{\vert \boldsymbol{\zeta }\prime_{i}\vert } {\sqrt{ \frac{c} {2-c}}} + \frac{7} {20}\right )}^{\beta _{\mathit{scal}}}$

$\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}$

$\boldsymbol{\zeta } \leftarrow \boldsymbol{ \zeta }\prime$

$\delta \leftarrow \delta \prime$

$\boldsymbol{\delta }_{\mathit{scal}} \leftarrow \boldsymbol{ \delta }\prime_{\mathit{scal}}$

until termination criterion fulfilled

2.2.1.5 DR3

The DR3 evolution strategy [33], also called (1,λ)-GSA-ES (generating set adaptation), is able to generate mutations according to an arbitrary multivariate normal distribution, corresponding to the adaptation of an arbitrary covariance matrix according to Eq. 2.11. This process is not based on implicitly using a covariance matrix, but on transforming an isotropic random vector $\mathbf{z} = N(\mathbf{0},\mathbf{I})$ into a correlated random vector y by multiplication with a matrix^{Footnote 16} $\mathbf{B} = \left (\mathbf{b}_{1},\ldots,\mathbf{b}_{m}\right ) \in {\mathbb{R}}^{n\times m}$.

As described in Sect. 2.1.3, this can be interpreted as superposition of multiple line distributions. For the number m of column vectors, ${n}^{2} \leq m \leq 2{n}^{2}$ holds, with a smaller value of m providing a faster adaptation and a larger value of m a more accurate adaptation. Like in DR1, for variation of the global step size $\delta \in \mathbb{R}$ a factor $\xi \in \{\frac{2} {3}, \frac{3} {2}\}$ with $P(\xi _{i} = 2/3) = P(\xi _{i} = 3/2) = 1/2$ is used. To guarantee an approximately constant length of the column vectors in B, y is adapted by using a factor c _m. Based on its parents x, an offspring is then created as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \delta \cdot \xi \cdot \mathbf{y}\mbox{ where }\mathbf{y} = c_{m} \cdot \mathbf{B}N(\mathbf{0},\mathbf{I}) }$$

The adaptation of endogenous strategy parameters is based on the selected y _sel and ξ_sel. The column vectors of matrix B are updated according to:

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{1}\prime& =& (1 - c) \cdot \mathbf{b}_{1} + c \cdot (c_{u}\xi _{\mathit{sel}}\mathbf{y}_{\mathit{sel}}) {}\\ \mathbf{b}_{i+1}\prime& =& \mathbf{b}_{i}\mbox{ }\forall i \in \{1,\ldots,m - 1\} {}\\ \end{array}$$

Like with the previous versions of derandomized evolution strategies, the global step size δ is adapted based on the selected ξ_sel, by using a damping exponent β:

$$\displaystyle{ \delta \prime = \delta \cdot {\left (\xi _{\mathit{sel}}\right )}^{\beta } }$$

For the exogenous parameters, the standard settings are given in [33] as follows:

$$\displaystyle\begin{array}{rcl} c& =& \sqrt{1/n} {}\\ \beta & =& \sqrt{1/n} {}\\ m& =& \frac{3} {2}{n}^{2} {}\\ c_{m}& =& (1/\sqrt{m})(1 + 1/m) {}\\ c_{u}& =& \sqrt{(2 - c)/c} {}\\ \lambda & =& 10 {}\\ \end{array}$$

The corresponding pseudocode of the DR3 evolution strategy is provided in Algorithm 2.7.

Algorithm 2.7 DR3

initialize x, δ, $\mathbf{B} \leftarrow \left (\mathbf{0},N\left (\mathbf{0},(1/n)\mathbf{I}\right )\right ) \in {\mathbb{R}}^{n\times m}$

t ← 0

repeat

$t \leftarrow t + 1$

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})\mbox{ where }\mathbf{z}_{i} \in {\mathbb{R}}^{m}$

$\mathbf{y}_{i} \leftarrow c_{m} \cdot \mathbf{B}\mathbf{z}_{i}$

$\mathbf{x}_{i} \leftarrow \mathbf{x} + \delta \cdot \xi _{i} \cdot \mathbf{y}_{i}\mbox{ where }P(\xi _{i} = 2/3) = P(\xi _{i} = 3/2) = 1/2$

$\phi _{i} \leftarrow f(\mathbf{x}_{i})$

end for

$\mathit{sel} \leftarrow i$ with best value of ϕ_i

$\mathbf{b} \leftarrow (1 - c) \cdot \mathbf{b}_{1} + c \cdot (c_{u}\xi _{\mathit{sel}}\mathbf{y}_{\mathit{sel}})$

$\delta \prime \leftarrow \delta \cdot {\left (\xi _{\mathit{sel}}\right )}^{\beta }$

$\mathbf{B}\prime \leftarrow (\mathbf{b},\mathbf{b}_{1},\ldots,\mathbf{b}_{m-1})$

$\mathbf{x} \leftarrow \mathbf{x}_{\mathit{sel}}$, $\delta \leftarrow \delta \prime$ and $\mathbf{B} \leftarrow \mathbf{B}\prime$

until termination criterion fulfilled

2.2.2 Modern Evolution Strategies

2.2.2.1 (μ_W,λ)-CMA-ES

Algorithms DR1, DR2 and DR3, as described in Sect. 2.2.1, are derandomized evolution strategies in the sense of adapting endogenous strategy parameters depending on the selected mutation vector. This has also been called the first level of derandomization [63]. In addition, the second level of derandomization aims at the following goals [63]:

Increase the probability of generating the same mutation step again.
Provide a direct control mechanism for the rate of change of strategy parameters.
Keep the strategy parameters unchanged in case of random selection.

The so-called CMA-ES, as introduced in [31], meets these goals by means of two techniques, namely the covariance matrix adaptation, CMA and the cumulative step size adaptation, CSA, for adapting a global step size. The description of a CMA-ES as provided in [31] is focused on explaining these two techniques, and recombination in case of μ > 1 is not discussed at all. Therefore, we will discuss the CMA-ES in this section as a (μ_W,λ)-CMA-ES with weighted intermediary recombination, as described in [29] and [32].^{Footnote 17} Using the notation for evolution strategies as introduced in Sect. 2.1.2, the algorithm ought to be denoted more precisely as (μ∕μ_W,λ)-CMA-ES, with index W denoting the weighted recombination. However, the simplified notation is motivated by arguing that the notation μ∕μ_W suggests two different numbers (μ and μ_W), although it is μ in both cases. Here, we adopt the simplified notation, and denote the CMA-ES with weighted recombination as (μ_W,λ)-CMA-ES.

Based on a parent x, an offspring x′ is then generated as follows:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{B}\mathbf{D}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Matrices B and D result from an eigendecomposition of the covariance matrix C according to Eq. 2.7, and $\sigma \in \mathbb{R}$ denotes the global step size. After generating and evaluating an offspring population of size λ according to this mutation operator, the μ best individuals of the offspring population are selected and undergo weighted intermediary recombination.

Weighted intermediary recombination is a generalization of classical global intermediary recombination. Weighted intermediary recombination is based on using μ weights $w_{1} \geq w_{2} \geq \ldots \geq w_{\mu }$ with $\sum _{i=1}^{\mu }w_{i} = 1$ for generating the new parent $\langle \mathbf{x}\rangle$ and the best mutation step $\langle \mathbf{y}\rangle$ as weighted averages:

$$\displaystyle\begin{array}{rcl} \langle \mathbf{x}\rangle & =& \sum _{i=1}^{\mu }w_{ i}\mathbf{x}_{i:\lambda } {}\\ \langle \mathbf{y}\rangle & =& \sum _{i=1}^{\mu }w_{ i}\mathbf{B}\mathbf{D}\mathbf{z}_{i:\lambda } {}\\ \end{array}$$

For adapting the strategy parameters, the so-called variance effective selection mass μ_eff is required:

$$\displaystyle{ \mu _{\mathit{eff }} ={ \left (\sum _{i=1}^{\mu }w_{ i}^{2}\right )}^{-1} }$$

According to [29], $1 \leq \mu _{\mathit{eff }} \leq \mu $ holds, and for identical weights $w_{i} = \frac{1} {\mu }$ ($\forall i \in \{1,\ldots,\mu \}$): μ_eff = μ. In analogy with Eq. 2.13 for DR2, the strategy parameter adaptation techniques, CMA and CSA, use so-called evolution paths for accumulating strategy parameter information across several generations. The (μ_W,λ)-CMA-ES uses two evolution paths, p _c for the adaptation of the covariance matrix and p _σ for global step size adaptation. The evolution paths are updated as follows:

$$\displaystyle\begin{array}{rcl} \mathbf{p}_{c}\prime& =& (1 - c_{c}) \cdot \mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle {}\\ \mathbf{p}_{\sigma }\prime& =& (1 - c_{\sigma }) \cdot \mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle {}\\ \end{array}$$

For updating p _c, the function h _σ is used, which is defined according to:

$$\displaystyle{ h_{\sigma } = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if } \frac{\|\mathbf{p}_{\sigma }\|} {\sqrt{1-{(1-c_{\sigma } )}^{2(t+1)}}} < \left (\frac{7} {5} + \frac{2} {n+1}\right )E(\|N(\mathbf{0},\mathbf{I})\|) \\ 0\quad &\mbox{ otherwise } \end{array} \right. }$$

The purpose of h _σ is to avoid an update of p _c to take information of the current generation t into account, when $\|\mathbf{p}_{c}\|$ becomes too large. The expectation $E(\|N(\mathbf{0},\mathbf{I})\|)$ of the length of a multivariate, normally distributed vector of dimensionality n, can be approximated (based on the gamma function^{Footnote 18}) as follows:

$$\displaystyle{ E(\|N(\mathbf{0},\mathbf{I})\|) = \sqrt{2}\Gamma (\frac{n + 1} {2} )/\Gamma (\frac{n} {2} ) \approx \sqrt{n}\left (1 - \frac{1} {4n} + \frac{1} {21{n}^{2}}\right ) }$$

The covariance matrix adaptation is performed according to the equation below:

$$\displaystyle{ \mathbf{C}\prime = (1 - c_{l} - c_{\mu })\mathbf{C} + c_{l}(\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \delta (h_{ \sigma })\mathbf{C}) + c_{\mu }\sum _{i=1}^{\mu }w_{ i}\mathbf{y}_{i:\lambda }\mathbf{y}_{i:\lambda }^{T} }$$

(2.14)

The first term in the summation represents the contribution of the previous covariance matrix. The second term is called the rank-one-update and takes the information accumulated in the evolution path p _c into account. The third term, the so-called rank-μ-update, was introduced with the extension of the CMA-ES for population sizes with μ > 1 [46]. The global step size σ is updated according to:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E(\|N(\mathbf{0},\mathbf{I})\|)} - 1\right )\right ) }$$

For the exogenous strategy parameters of the (μ_W,λ)-CMA-ES, the following standard settings are defined in [29]:

$$\displaystyle\begin{array}{rcl} \lambda & =& 4 + \lfloor 3\ln n\rfloor {}\\ \mu & =& \lfloor \frac{\lambda } {2}\rfloor {}\\ w_{i}& =& \frac{\ln (\frac{\lambda +1} {2} ) -\ln i} {\sum _{j=1}^{\mu }\ln (\frac{\lambda +1} {2} ) -\ln j}\mbox{ for }i \in \{1,\ldots,\mu \} {}\\ c_{\sigma }& =& \frac{\mu _{\mathit{eff }} + 2} {n + \mu _{\mathit{eff }} + 5} {}\\ d_{\sigma }& =& 1 + 2\max \left (0,\sqrt{\frac{\mu _{\mathit{eff } } - 1} {n + 1}} \right ) + c_{\sigma } {}\\ c_{c}& =& \frac{4 + \mu _{\mathit{eff }}/n} {n + 4 + 2\mu _{\mathit{eff }}/n} {}\\ c_{1}& =& \frac{2} {{\left (n + \frac{13} {10}\right )}^{2} + \mu _{\mathit{eff }}} {}\\ c_{\mu }& =& \min \left (1 - c_{1},\alpha _{\mu } \frac{\mu _{\mathit{eff }} - 2 + 1/\mu _{\mathit{eff }}} {{(n + 2)}^{2} + \alpha _{\mu }\mu _{\mathit{eff }}/2}\right )\mbox{ with }\alpha _{\mu } = 2 {}\\ \end{array}$$

Putting it all together, the pseudocode of the (μ_W,λ)-CMA-ES is given in Algorithm 2.8.

Algorithm 2.8 (μ_W,λ)-CMA-ES

initialize $\langle \mathbf{x}\rangle$

$\mathbf{p}_{c} \leftarrow \mathbf{0}$

$\mathbf{p}_{\sigma } \leftarrow \mathbf{0}$

$\mathbf{C} \leftarrow \mathbf{I}$

t ← 0

repeat

$t \leftarrow t + 1$

B and $\mathbf{D} \leftarrow $ eigendecomposition of C

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{y}_{i} \leftarrow \mathbf{B}\mathbf{D}\mathbf{z}_{i}$

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{y}_{k}$

$f_{i} \leftarrow f(\mathbf{x}_{i})$

end for

$\langle \mathbf{y}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }$

$\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \sigma \langle \mathbf{y}\rangle =\sum _{ i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }$

$\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle$

$\sigma \leftarrow \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )$

$\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle$

$\mathbf{C} \leftarrow (1 - c_{1} - c_{\mu })\mathbf{C} + c_{1}(\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \delta (h_{\sigma })\mathbf{C}) + c_{\mu }\sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }\mathbf{y}_{i:\lambda }^{T}$

until termination criterion fulfilled

2.2.2.2 LS-CMA-ES

The LS-CMA-ES [6] is a (1,λ)-ES implementing the idea to adapt the covariance matrix C based on the inverse Hessian H ⁻¹. The Hessian itself is estimated by solving the appropriate least squares estimation problem. Based on Theorem 5 in [55], it is known that this requires at least $m \geq \frac{1} {2}\left ({n}^{2} + 3n + 4\right )$ tuples $\left (\mathbf{x},f(\mathbf{x})\right )$. To achieve this, the algorithm saves all tuples $\left (\mathbf{x},f(\mathbf{x})\right )$ in an archive A. Based on the Taylor series expansion (Eq. 2.12), the least squares estimation problem is defined through the following minimization task:

$$\displaystyle\begin{array}{rcl} \min _{\mathbf{g}\in {\mathbb{R}}^{n},\mathbf{H}\in {\mathbb{R}}^{n\times n}}\sum _{k=1}^{m}{\left (f(\mathbf{x}_{ k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\mathbf{g} -\frac{1} {2}{(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\mathbf{H}(\mathbf{x}_{ k} -\mathbf{x}_{0})\right )}^{2}& &{}\end{array}$$

(2.15)

The result of minimizing 2.15 provides estimators $\hat{\mathbf{g}}$ for the gradient and $\hat{\mathbf{H}}$ for the Hessian.

Since the Taylor series expansion up to the quadratic term provides only an approximation of the true fitness landscape at x ₀, we are also interested in obtaining an error measure $Q(\hat{g},\hat{\mathbf{H}})$ of the estimate for deciding whether $\hat{{\mathbf{H}}}^{-1}$ can be used for covariance matrix adaptation. The following error measure is used for this purpose:

$$\displaystyle{ Q(\hat{g},\hat{\mathbf{H}}) = \frac{1} {m}\sum _{k=1}^{m}{\left (\frac{f(\mathbf{x}_{k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{g}} -\frac{1} {2}{(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{H}}(\mathbf{x}_{ k} -\mathbf{x}_{0})} {f(\mathbf{x}_{k}) - f(\mathbf{x}_{0}) - {(\mathbf{x}_{k} -\mathbf{x}_{0})}^{T}\hat{\mathbf{g}}} \right )}^{2} }$$

(2.16)

Unfortunately, solving Eq. 2.15 and inverting $\hat{\mathbf{H}}$ by means of numerical methods requires algorithms with time complexity O(n ⁶), so that, especially for large n, an execution of these steps in each generation is not affordable. To solve this problem, the LS-CMA-ES provides two different working modes, denoted LS and CMA, for adapting the covariance matrix.

In mode LS, an approximation of H is performed only each n _upd generations.^{Footnote 19} If the error Q falls below a required threshold Q _t, the covariance matrix $\mathbf{C} = \frac{1} {2}\hat{{\mathbf{H}}}^{-1}$ is used by the algorithm and remains unchanged until a new update after another n _upd generations is performed.

If Q is bigger than the threshold value Q _t, the LS-CMA-ES switches into mode CMA. Before explaining this mode, the creation of an offspring x′ from the parent $\langle \mathbf{x}\rangle$ is defined below:

$$\displaystyle{ \mathbf{x}\prime =\langle \mathbf{x}\rangle + \sigma dN(\mathbf{0},\mathbf{C})\mbox{ where }d =\exp (\tau N(0,1)) }$$

In addition to the covariance matrix C, a global step size σ is used, which is updated by mutative step size adaptation. If b denotes the index of the best offspring, the global step size is changed according to $\sigma \prime = \sigma \cdot d_{b}$. Adapting the covariance matrix C is based on a rank-one update (i.e., the second term in Eq. 2.14) by using an evolution path p _c:

$$\displaystyle\begin{array}{rcl} \mathbf{p}_{c}\prime& =& (1 - c_{c}) \cdot \mathbf{p}_{c} + \frac{\sqrt{(c_{c } (2 - c_{c } ))}} {\sigma } (\mathbf{x}_{b} -\langle \mathbf{x}\rangle ) {}\\ \mathbf{C}\prime& =& (1 - c_{\mathit{cov}}) \cdot \mathbf{C} + c_{\mathit{cov}}\mathbf{p}_{c}{(\mathbf{p}_{c})}^{T} {}\\ \end{array}$$

The evolution path p _c is also updated when operating in mode LS, to make sure C is updated based on up-to-date information when the algorithm switches into mode CMA.

The pseudocode of the LS-CMA-ES is given in Algorithm 2.9, and the exogenous strategy parameters are set as follows:

$$\displaystyle\begin{array}{rcl} \lambda & =& 10 {}\\ \tau & =& \frac{1} {\sqrt{n}} {}\\ n_{\mathit{upd}}& =& 100 {}\\ Q_{t}& =& 1{0}^{-3} {}\\ c_{c}& =& \frac{4} {n + 4} {}\\ c_{\mathit{cov}}& =& \frac{2} {{(n + \sqrt{2})}^{2}} {}\\ \end{array}$$

Algorithm 2.9 LS-CMA-ES

initialize $\langle \mathbf{x}\rangle$, σ

$\mathbf{C} \leftarrow \mathbf{I}$

Archive $A \leftarrow \varnothing $

$\mathbf{p}_{c} \leftarrow \mathbf{0}$

mode $\leftarrow $ LS

$t \leftarrow 0$

repeat

$t \leftarrow t + 1$

B and D ← eigendecomposition of C

for i = 1 → λ do

$d_{i} \leftarrow \exp \left (\tau N(0,1)\right )$

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \cdot d_{i}\mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I})$

$f_{i} \leftarrow f(\mathbf{x}_{i})$

$A \leftarrow A \cup \{(\mathbf{x}_{i},f_{i})\}$

end for

$b \leftarrow $ index of best offspring

$\sigma \leftarrow \sigma \cdot d_{b}$

$\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + \frac{\sqrt{c_{c } (2-c_{c } )}} {\sigma } (\langle \mathbf{x}\rangle -\mathbf{x}_{b})$

if mode = LS then

C unchanged

else if mode = CMA then

$\mathbf{C} \leftarrow (1 - c_{\mathit{cov}})\mathbf{C} + c_{\mathit{cov}}\mathbf{p}_{c}\mathbf{p}_{c}^{T}$

end if

if t modulo n _upd = 0 then

Obtain $\hat{\mathbf{g}}$ and $\hat{\mathbf{H}}$ based on the last n ² tuples of A by solving Equation 2.15 where $\mathbf{x}_{0} =\langle \mathbf{x}\rangle$.

Obtain $Q(\hat{\mathbf{g}},\hat{\mathbf{H}})$ from Equation 2.16

if $Q(\hat{\mathbf{g}},\hat{\mathbf{H}}) < Q_{t}$ then

mode ← LS

$\mathbf{C} \leftarrow {\left (\frac{1} {2}\hat{\mathbf{H}}\right )}^{-1}$

else

mode ← CMA

end if

$\langle \mathbf{x}\rangle \leftarrow \mathbf{x}_{b}$

until termination criterion fulfilled

2.2.2.3 LR-CMA-ES

The LR-CMA-ES (local restart) extends the (μ_W,λ)-CMA-ES by introducing restarts [4]. The strategy introduces five criteria for identifying stagnation of the optimization process and, in case of stagnation, starts a new run of the (μ_W,λ)-CMA-ES. Each run of the (μ_W,λ)-CMA-ES initializes the starting point of the search and the strategy parameters anew, so that the runs are independent of each other. For defining the termination criteria, the tolerance values $T_{x} = \sigma 1{0}^{-12}$ and $T_{f} = 1{0}^{-12}$ are used. Any other exogenous parameters are the same as in the $(\mu _{W},\lambda )$-CMA-ES.

The first termination criterion, called equalfunvalhist, is satisfied if either the best fitness values $f(\mathbf{x}_{1:\lambda })$ of the last $\lceil 10 + 30n/\lambda \rceil $ generations are identical or the difference between their maximum and minimum values is smaller than T _x.

The second criterion, TolX, is satisfied if the components of the vector $\mathbf{v} = \sigma \mathbf{p}_{c}$ are all smaller than T _x, i.e., v _i < T _x $\forall i \in \{1,\ldots,n\}$.

The third criterion, noeffectaxis, takes changes with respect to the main coordinate axes induced by C into account. These are given by the eigenvectors $\mathbf{u}_{i}$ and eigenvalues γ_i, $i \in \{1,\ldots,n\}$, of C, and they are found (normalized) in the columns of matrix B and the main diagonal elements of D. The termination criterion does not check all main axes at once, but in generation t it takes the axis i = t mod n into account. It is satisfied when $\frac{\sigma } {10}\sqrt{\gamma _{i}}\mathbf{u}_{i} \approx 0$.

The fourth criterion, noeffectcoord, analyzes changes with respect to the coordinate axes. It is satisfied if $\frac{\sigma } {5} C_{i,i} \approx 0$ $\forall i \in \{1,\ldots,n\}$.

Finally, the criterion conditioncov checks whether the condition number of the matrix C, $\mbox{ cond}(\mathbf{C}) = \frac{\max (\{\gamma _{1},\ldots,\gamma _{n}\})} {\min (\{\gamma _{1},\ldots,\gamma _{n}\})}$ exceeds 10¹⁴.

The pseudocode of the LR-CMA-ES, as shown in Algorithm 2.10, consists of a simple outer loop managing the restarts of the (μ_W,λ)-CMA-ES. The local termination criteria are exactly the five criteria introduced above for discovering stagnation. In contrast, the global termination criterion is the same as used in previous sections, see Sect. 2.1.2.

Algorithm 2.10 LR-CMA-ES

repeat

execute (μ_W,λ)-CMA-ES (Algorithm 2.8) using the local termination criteria

until global termination criterion satisfied

2.2.2.4 IPOP-CMA-ES

The IPOP-CMA-ES [5] is an extension of the LR-CMA-ES as described in the previous section. Whenever a run of the (μ_W,λ)-CMA-ES is terminated due to a local termination criterion (as introduced for LR-CMA-ES), the population size is increased by a factor η for the next run of the (μ_W,λ)-CMA-ES. This strategy is motivated by empirical investigations of the behavior of the (μ_W,λ)-CMA-ES with different population sizes for multimodal test functions [30]. As these investigations clarified, the global convergence properties of the algorithm improve with increasing population size. The corresponding pseudocode is given in Algorithm 2.11. When using non-integer values for η, the new number of parents μ and offspring λ are obtained by rounding. For η, the interval $\left [\frac{3} {2},5\right ]$ is identified as a reasonable range, and the default value η = 2 is recommended.

Algorithm 2.11 IPOP-CMA-ES

repeat

execute (μ_W,λ)-CMA-ES (Algorithm 2.8) using the local termination criteria

$\mu \leftarrow \eta \cdot \mu $

$\lambda \leftarrow \eta \cdot \lambda $

until global termination criterion satisfied

2.2.2.5 (1+1)-Cholesky-CMA-ES

The (1+1)-Cholesky-CMA-ES [38] introduces a method for adapting the covariance matrix C implicitly, without using an eigendecomposition of C. Consequently, the approach reduces the computational complexity within each generation from O(n ³) to O(n ²).

The algorithm is based on the so-called Cholesky decomposition^{Footnote 20} of the covariance matrix, $\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}$. As proven in [38], an update of the Cholesky factors A is possible without explicit knowledge of the covariance matrix C. The corresponding lemma and theorem are stated here without proof. The lemma states that, for any vector $\mathbf{v} \in {\mathbb{R}}^{n}$ and $\varsigma = \frac{1} {\|{\mathbf{v}\|}^{2}} \left (\sqrt{1 +\|{ \mathbf{v} \|}^{2}} - 1\right )$, the following equation holds:

$$\displaystyle{ \mathbf{I} + \mathbf{v}{\mathbf{v}}^{T} = \left (\mathbf{I} + \varsigma \mathbf{v}{\mathbf{v}}^{T}\right )\left (\mathbf{I} + \varsigma \mathbf{v}{\mathbf{v}}^{T}\right ) }$$

This lemma is required for the proof of the following theorem:

Theorem 2.2.1.

Let $\mathbf{C} \in {\mathbb{R}}^{n}$ be a symmetric, positive definite matrix with Cholesky decomposition $\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}$ . Let $\mathbf{C}\prime = \alpha \mathbf{C} + \beta \mathbf{v}{\mathbf{v}}^{T}$ be an update of $\mathbf{C}$ with $\mathbf{v},\mathbf{z} \in {\mathbb{R}}^{n}$, $\mathbf{v} = \mathbf{A}\mathbf{z}$ and $\alpha,\beta \in {\mathbb{R}}^{+}$ . The updated Cholesky factor A ′ of C ′ is then given by $\mathbf{A}\prime = \sqrt{\alpha }\mathbf{A} + \frac{\sqrt{\alpha }} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 + \frac{\beta } {\alpha }\|{\mathbf{z}\|}^{2}} - 1\right )\left (\mathbf{A}\mathbf{z}\right ){\mathbf{z}}^{T}$ .

Based on a parent individual x, an offspring x′ is then created according to:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\mbox{ with }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

Using Theorem 2.2.1, the Cholesky factor A is adapted as follows:

$$\displaystyle{ \mathbf{A}\prime = c_{a}\mathbf{A} + \frac{c_{a}} {\|{\mathbf{z}\|}^{2}}\left (\sqrt{1 + \frac{(1 - c_{a }^{2 })\|{\mathbf{z} \|}^{2 } } {c_{a}^{2}}} - 1\right )\mathbf{A}\mathbf{z}{\mathbf{z}}^{T}, }$$

with a constant exogenous strategy parameter c _a. The adaptation above is applied if the value of a measure $\bar{p}_{s}$ (explained in the following) is smaller than a threshold value p _t.

The adaptation of the global step size δ is in some ways similar to the 1/5-success rule of the (1+1)-ES (see Sect. 2.2.1). If the offspring is better than the parent, λ_s = 1 in the equation below, otherwise, λ_s = 0. These success indicators are accumulated across generations by using a learning rate c _p, resulting in an accumulated success rate $\bar{p}_{s}$:

$$\displaystyle{ \bar{p}_{s} = (1 - c_{p})\bar{p}_{s} + c_{p}\lambda _{s} }$$

Using this measure and its target value p _s ^t for the success rate, the global step size σ is updated as follows:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{1} {d}\left (\bar{p}_{s} - \frac{p_{s}^{t}} {1 - p_{s}^{t}}(1 -\bar{ p}_{s})\right )\right ) }$$

The pseudocode is given in Algorithm 2.12, and the default settings of the exogenous strategy parameters are:

$$\displaystyle\begin{array}{rcl} p_{s}^{t}& =& \frac{2} {11} {}\\ p_{t}& =& \frac{11} {25} {}\\ c_{a}& =& \sqrt{1 - \frac{2} {{n}^{2} + 6}} {}\\ c_{p}& =& \frac{1} {12} {}\\ d& =& 1 + \frac{1} {n} {}\\ \end{array}$$

Algorithm 2.12 (1+1)-Cholesky-CMA-ES

initialize x, σ

$\mathbf{A} \leftarrow \mathbf{I}$

$\bar{p}_{s} \leftarrow p_{s}^{t}$

repeat

$\mathbf{z} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{x}\prime \leftarrow \mathbf{x} + \sigma \mathbf{A}\mathbf{z}$

if $f(\mathbf{x}\prime) \leq f(\mathbf{x})$ then

$\lambda _{s} \leftarrow 1$

else

$\lambda _{s} \leftarrow 0$

end if

$\bar{p}_{s} \leftarrow (1 - c_{p})\bar{p}_{s} + c_{p}\lambda _{s}$

$\sigma \leftarrow \sigma \cdot \exp \left (\frac{1} {d}\left (\bar{p}_{s} - \frac{p_{s}^{t}} {1-p_{s}^{t}}(1 -\bar{ p}_{s})\right )\right )$

if $f(\mathbf{x}\prime) \leq f(\mathbf{x})$ then

$\mathbf{x} \leftarrow \mathbf{x}\prime$

if $\bar{p}_{s} \leq p_{t}$ then

$\mathbf{A} \leftarrow c_{a}\mathbf{A} + \frac{c_{a}} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 + \frac{\left (1-c_{a }^{2 } \right ) \|{\mathbf{z} \|}^{2 } } {c_{a}^{2}}} - 1\right )\mathbf{A}\mathbf{z}{\mathbf{z}}^{T}$

end if

until termination criterion satisfied

2.2.2.6 Active-CMA-ES

The (μ_W,λ)-CMA-ES uses weighted recombination of the μ best offspring to generate a new point in the search space. As shown by Rudolph [57], the convergence velocity of an evolution strategy can be further increased by also taking the worst offspring into account for recombination, however, with negative weights. The Active-CMA-ES [40] is based on this idea,^{Footnote 21} however, it is not used during the process of recombination,^{Footnote 22} but exclusively for adapting the covariance matrix. Therefore, the corresponding extension of the (μ_W,λ)-CMA-ES mainly consists of changing the covariance matrix adaptation method, modifying Eq. 2.14 of the (μ_W,λ)-CMA-ES within the Active-CMA-ES into:

$$\displaystyle\begin{array}{rcl} \mathbf{C}\prime& =& \mathbf{C} \leftarrow (1 - c_{c})\mathbf{C} + c_{c}\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \beta \mathbf{Z}\mbox{ where } {}\\ \mathbf{Z}& =& \mathbf{B}\mathbf{D}\left ( \frac{1} {\mu }\sum _{k=1}^{\mu }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T} - \frac{1} {\mu }\sum _{k=\lambda -\mu +1}^{\lambda }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T}\right ){\left (\mathbf{B}\mathbf{D}\right )}^{T} {}\\ \end{array}$$

In addition, the exogenous parameter c _c is now modified to $c_{c} = \frac{2} {{(n+\sqrt{2})}^{2}}$. The parameter β has been tuned by means of an empirical investigation, which is described in detail in [39]. Its setting of $\beta = \frac{4\mu -2} {{(n+12)}^{2}+4\mu }$ reflects a compromise between the conflicting goals of achieving a large convergence velocity on the one hand and ensuring that C remains positive definite, to drive the evolution strategy into a robust working regime. The pseudocode is provided in Algorithm 2.13, and the default settings of the exogenous strategy parameters are, except for c _c and β, identical to those used in the (μ_W,λ)-CMA-ES.

Algorithm 2.13 Active-CMA-ES

initialize $\langle \mathbf{x}\rangle$

$\mathbf{p}_{c} \leftarrow \mathbf{0}$

$\mathbf{p}_{\sigma } \leftarrow \mathbf{0}$

$\mathbf{C} \leftarrow \mathbf{I}$

t ← 0

repeat

t ← t + 1

B and D ← from eigendecomposition of C

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{y}_{i} \leftarrow \mathbf{B}\mathbf{D}\mathbf{z}_{i}$

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{y}_{k}$

$f_{i} \leftarrow f(\mathbf{x}_{i})$

end for

$\langle \mathbf{y}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{y}_{i:\lambda }$

$\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \sigma \langle \mathbf{y}\rangle =\sum _{ i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }$

$\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )\mu _{\mathit{eff }}}\mathbf{B}{\mathbf{D}}^{-1}{\mathbf{B}}^{T}\langle \mathbf{y}\rangle$

$\sigma \leftarrow \sigma \cdot \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )$

$\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + h_{\sigma }\sqrt{c_{c } (2 - c_{c } )\mu _{\mathit{eff }}}\langle \mathbf{y}\rangle$

$\mathbf{Z} \leftarrow \mathbf{B}\mathbf{D}\left (\frac{1} {\mu }\sum _{k=1}^{\mu }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T} - \frac{1} {\mu }\sum _{k=\lambda -\mu +1}^{\lambda }\mathbf{z}_{ k:\lambda }\mathbf{z}_{k:\lambda }^{T}\right ){\left (\mathbf{B}\mathbf{D}\right )}^{T}$

$\mathbf{C} \leftarrow (1 - c_{c})\mathbf{C} + c_{c}\mathbf{p}_{c}\mathbf{p}_{c}^{T} + \beta \mathbf{Z}$

until termination criterion satisfied

2.2.2.7 (μ,λ)-CMSA-ES

The (μ,λ)-CMSA-ES [13], more precisely denoted the (μ∕μ_I,λ)-CMA-σ-SA-ES, reintroduces self-adaptation of the global step size σ, just like in the (μ,λ)-MSC-ES, into the algorithm. This approach is motivated by the fact that reintroducing self-adaptation decreases the number of exegenous strategy parameters to two,^{Footnote 23} consequently providing a simplification of the (μ_W,λ)-CMA-ES, which requires five exogenous strategy parameters. Offspring individuals x _i and their step sizes σ_i, $i \in \{1,\ldots,\lambda \}$, are created based on the parent x, the global step size σ, and the matrices B and D (from an eigendecomposition of the covariance matrix C), as follows:

$$\displaystyle\begin{array}{rcl} \sigma _{i}& =& \sigma \cdot \exp (\tau N(0,1)) {}\\ \mathbf{s}_{i}& =& \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I}) {}\\ \mathbf{z}_{i}& =& \sigma _{i} \cdot \mathbf{s}_{i} {}\\ \mathbf{x}_{i}& =& \mathbf{x} + \mathbf{z}_{i} {}\\ \end{array}$$

Recombination is based on identical weights 1∕μ, resulting in averaging the μ best offspring. It is applied to the vectors $\mathbf{z}_{i:\lambda }$, $\mathbf{s}_{i:\lambda }$, and step sizes $\sigma _{i:\lambda }$, for $i \in \{1,\ldots,\mu \}$, and results in the vectors $\langle \mathbf{z}\rangle$, $\langle \mathbf{s}\rangle$ and the new global step size σ. The new parent x′ is then obtained as $\mathbf{x}\prime = \mathbf{x} +\langle \mathbf{z}\rangle$. Vector $\langle \mathbf{s}\rangle$ is required for adapting the covariance matrix C, and its update uses the learning rate τ_C by proceeding as follows:

$$\displaystyle{ \mathbf{C}\prime = \left (1 - \frac{1} {\tau _{C}}\right )\mathbf{C} + \frac{1} {\tau _{C}}\langle \mathbf{s}\rangle \langle {\mathbf{s}\rangle }^{T} }$$

(2.17)

The default settings of the exogenous strategy parameters are:

$$\displaystyle\begin{array}{rcl} \mu & =& \max \left (\left \lfloor \frac{n} {10}\right \rfloor,2\right ) {}\\ \lambda & =& 4\mu {}\\ \tau & =& \frac{1} {\sqrt{2n}} {}\\ \tau _{C}& =& 1 + \frac{n(n + 1)} {2\mu } {}\\ \end{array}$$

The pseudocode of the corresponding (μ,λ)-CMSA-ES is given in Algorithm 2.14.

Algorithm 2.14 (μ,λ)-CMSA-ES

initialize x, σ

$\mathbf{C} \leftarrow \mathbf{I}$

$\langle \sigma \rangle \leftarrow \sigma $

repeat

B and D ← from eigendecomposition of C

for i = 1 → λ do

$\sigma _{i} \leftarrow \langle \sigma \rangle \exp \tau N(0,1)$

$\mathbf{s}_{i} \leftarrow \mathbf{B}\mathbf{D}N(\mathbf{0},\mathbf{I})$

$\mathbf{z}_{i} \leftarrow \sigma _{i} \cdot \mathbf{s}_{i}$

$\mathbf{y}_{i} \leftarrow \mathbf{x} + \mathbf{z}_{i}$

$f_{i} \leftarrow f(\mathbf{y}_{i})$

end for

$\langle \mathbf{z}\rangle \leftarrow $ average of the best μ $\mathbf{z}_{i},i \in \{1,\ldots,\lambda \}$

$\langle \mathbf{s}\rangle \leftarrow $ average of the best μ $\mathbf{s}_{i},i \in \{1,\ldots,\lambda \}$

$\langle \sigma \rangle \leftarrow $ average of the best μ $\sigma _{i},i \in \{1,\ldots,\lambda \}$

$\mathbf{x} \leftarrow \mathbf{x} +\langle \mathbf{z}\rangle$

$\mathbf{C} \leftarrow \left (1 - \frac{1} {\tau _{C}}\right )\mathbf{C} + \frac{1} {\tau _{C}}\langle \mathbf{s}{\mathbf{s}}^{T}\rangle$

until termination criterion satisfied

2.2.2.8 sep-CMA-ES

The sep-CMA-ES [54] is a variation of the (μ_W,λ)-CMA-ES which reduces space and time complexity to reach O(n), i.e., linear in n. This is achieved by using, instead of an arbitrary covariance matrix, just a diagonal matrix D as in Eq. 2.10. Consequently, this kind of evolution strategy is not able anymore to generate correlated mutations, in return for the advantage of saving the computationally intensive eigendecomposition of the covariance matrix C. D can then be obtained from C by taking the square roots of the main diagonal elements of C. The covariance matrix is adapted according to the following update rule:

$$\displaystyle{ \mathbf{C}\prime = (1 - c_{\mathit{cov}})\mathbf{C} + \frac{1} {\mu _{\mathit{eff }}}c_{\mathit{cov}}\mathbf{p}_{c}{(\mathbf{p}_{c})}^{T} + c_{\mathit{ cov}}\left (1 - \frac{1} {\mu _{\mathit{eff }}}\right )\sum _{i=1}^{\mu }w_{ i}\mathbf{D}\mathbf{z}_{i:\lambda }{(\mathbf{D}\mathbf{z}_{i:\lambda })}^{T} }$$

Due to the reduced complexity of the covariance matrix, the learning rate c _cov can be increased to accelerate the adaptation process. The learning rate c _cov is then set as follows:

$$\displaystyle{ c_{\mathit{cov}} = \frac{n + 2} {3} \left ( \frac{1} {\mu _{\mathit{eff }}} \frac{2} {{(n + \sqrt{2})}^{2}} + (1 - \frac{1} {\mu _{\mathit{eff }}})\min \left (1, \frac{2\mu _{\mathit{eff }} - 1} {{(n + 2)}^{2} + \mu _{\mathit{eff }}}\right )\right ) }$$

All other settings of the sep-CMA-ES are identical to those used within the (μ_W,λ)-CMA-ES. The resulting pseudocode of the sep-CMA-ES is shown in Algorithm 2.15.

Algorithm 2.15 sep-CMA-ES

initialize $\langle \mathbf{x}\rangle$

$\mathbf{C} \leftarrow \mathbf{I}$

$\mathbf{D} \leftarrow \mathbf{I}$

$\mathbf{p}_{\sigma } \leftarrow \mathbf{0}$

$\mathbf{p}_{c} \leftarrow \mathbf{0}$

t ← 0

repeat

t ← t + 1

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{D}\mathbf{z}_{i}$

end for

$\langle \mathbf{x}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{x}_{i:\lambda }$

$\langle \mathbf{z}\rangle \leftarrow \sum _{i=1}^{\mu }w_{i}\mathbf{z}_{i:\lambda }$

$\mathbf{p}_{\sigma } \leftarrow (1 - c_{\sigma })\mathbf{p}_{\sigma } + \sqrt{c_{\sigma } (2 - c_{\sigma } )}\sqrt{\mu _{\mathit{eff }}}\langle \mathbf{z}\rangle$

if $\frac{\|\mathbf{p}_{\sigma }\|} {\sqrt{1-{(1-c_{\sigma } )}^{2t}}} < \left (\frac{7} {5} + \frac{2} {n+1}\right )E(\|N(\mathbf{0},\mathbf{I})\|)$ then

$H_{\sigma } \leftarrow 1$

else

$H_{\sigma } \leftarrow 0$

end if

$\mathbf{p}_{c} \leftarrow (1 - c_{c})\mathbf{p}_{c} + H_{\sigma }\sqrt{c_{c } (2 - c_{c } )}\sqrt{\mu _{\mathit{eff }}}\mathbf{D}\langle \mathbf{z}\rangle$

$\mathbf{C} \leftarrow (1 - c_{\mathit{cov}})\mathbf{C} + \frac{c_{\mathit{cov}}} {\mu _{\mathit{eff }}} \mathbf{p}_{c}\mathbf{p}_{c}^{T} + c_{c}\left (1 - \frac{1} {\mu _{\mathit{eff }}} \right )\sum _{i=1}^{\mu }w_{i}\mathbf{D}\mathbf{z}_{i:\lambda }{\left (\mathbf{D}\mathbf{z}_{i:\lambda }\right )}^{T}$

$\sigma \leftarrow \sigma \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\mathbf{p}_{\sigma }\|} {E(\|N(\mathbf{0},\mathbf{I})\|} - 1\right )\right )$

$\mathbf{D} = \mbox{ diag}\left (\sqrt{C_{1,1}},\ldots,\sqrt{C_{n,n}}\right )$

until termination criterion satisfied

2.2.2.9 $(1{ + \atop,} \lambda _{m}^{s})$-ES

The $(1{ + \atop,} \lambda _{m}^{s})$-ES [16] introduces the two new concepts of mirrored sampling and sequential selection. These two mutually independent concepts change the algorithmic processes of offspring creation and their selection, and thus they do not establish a complete evolution strategy. The concept of mirrored sampling can be used within a (1 + λ)-ES as well as a (1,λ)-ES. The application of sequential selection is only possible in the case of a plus-strategy, explaining also the use of the notation ${ + \atop,}$. Furthermore, the indices s and m of λ represent the algorithmic concepts of sequential selection (s) and mirrored sampling (m), respectively.

The idea of mirrored sampling is to generate part of the offspring in a derandomized way by generating for a mutation vector z not only the offspring x +z, but also the additional offspring x −z. These two offspring are obviously symmetrical^{Footnote 24} with respect to x. As a potential application, mentioned in [3], mirrored sampling can increase the robustness of the Evolutionary Gradient Search algorithm and increase convergence velocity in the sphere model. Theoretical convergence rates for variants of the $(1{ + \atop,} \lambda _{m}^{s})$-ES have been derived; see [16] for the corresponding results.

Sequential selection can be used to reduce the number of function evaluations. It is applied within a (1 + λ)-ES by sequentially executing the steps mutation and evaluation for single offspring individuals, rather than generating all λ offspring first and then evaluating their fitness. In sequential selection, as soon as an offspring has a better fitness than the parent, the offspring can replace the parent, and no more offspring need to be generated and evaluated. In this way, up to λ − 1 function evaluations can potentially be saved at each generation.

The two concepts can be used independently of each other, or in combination. As explained before, the $(1{ + \atop,} \lambda _{m}^{s})$-ES does not constitute a complete evolution strategy, but rather a method for generating the parent $\langle \mathbf{x}\rangle \prime$ for the next generation based on the previous parent $\langle \mathbf{x}\rangle$ and a method mutationOffset, which generates a mutation step and is determined by the underlying evolution strategy. The approach is summarized in pseudocode in Algorithm 2.16.

Algorithm 2.16 ($1{ + \atop,} \lambda _{m}^{s}$)-ES

Input:search point $\langle \mathbf{x}\rangle$ and a method mutationOffset Output:new search point $\langle \mathbf{x}\rangle \prime$

i ← 0

j ← 0

while i < λ do

i ← i + 1

j ← j + 1

if (mirrored sampling) $\wedge $ (j modulo 2 = 0) then

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle -\mathbf{z}_{i}$

else

$\mathbf{z}_{i} \leftarrow $ mutationOffset()

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \mathbf{z}_{i}$

end if

if (sequential selection) ∧ ($f(\mathbf{x}_{i}) < f(\langle \mathbf{x}\rangle )$) then

j ← 0

break

end if

end while

$\langle \mathbf{x}\rangle \prime \leftarrow \mbox{ argmin}\left (\{f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{i})\}\right )$

2.2.2.10 xNES

The xNES algorithm (exponential natural evolution strategies) [26] is a (1,λ)-ES which adapts its endogenous strategy parameters by using the so-called natural gradient (see [1]). The idea was implemented for the first time in the context of NES (natural evolution strategies) [71] and was then developed further by introducing the eNES (efficient natural evolution strategies)^{Footnote 25} [66].

In the following, the underlying ideas of the xNES are briefly summarized, without giving detailed descriptions of the underlying concepts, such as, e.g., the Fisher information matrix. These fundamentals can be found in the original work of Glasmachers et al. and the corresponding references, see [26].

This family of evolution strategy algorithms also relies on the multivariate normal distribution $N(\langle \mathbf{x}\rangle,\mathbf{C})$ for generating correlated mutations of the current search point $\langle \mathbf{x}\rangle$. Similar to the (1 + 1)-Cholesky-CMA-ES (see Sect. 2.2.2.5), rather than working with the covariance matrix C explicitly, a Cholesky factor A with $\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}$ is used. The current search point and the covariance matrix are combined to form the tuple $\theta = \left (\langle \mathbf{x}\rangle,\mathbf{C}\right )$, representing the quantities subject to adaptation within an xNES. Rewriting the probability density function of a normal distribution as a function of the current search point $\langle \mathbf{x}\rangle$ and the Cholesky factor A, its probability density $N(\langle \mathbf{x}\rangle,\mathbf{C})$ turns into:

$$\displaystyle{ p\left (\mathbf{x}\vert \theta \right ) = \frac{1} {{\left (\sqrt{2\pi }\right )}^{n}\det \mathbf{A}} \cdot \exp \left (-\frac{1} {2}{\left \|{\mathbf{A}}^{-1} \cdot (\mathbf{x} -\langle \mathbf{x}\rangle )\right \|}^{2}\right ) }$$

Given the distribution described by θ, the expectation J(θ) of the fitness becomes:

$$\displaystyle{ J(\theta ) = E(f(\mathbf{x})\vert \theta ) =\int f(\mathbf{x})p(\mathbf{x}\vert \theta )d\mathbf{x} }$$

The gradient of the expectation J(θ), ∇_θ J(θ), can be calculated by using the so-called log-likelihood trick according to

$$\displaystyle{ \nabla _{\theta }J(\theta ) =\int \left (f(\mathbf{x})\nabla \log (p(\mathbf{x}\vert \theta ))\right )p(\mathbf{x}\vert \theta )d\mathbf{x}, }$$

which can be approximated by Monte Carlo estimation based on the offspring individuals $\mathbf{x}_{i}$, $i \in \{1,\ldots,\lambda \}$:

$$\displaystyle{ \nabla _{\theta }J(\theta ) \approx \frac{1} {\lambda }\sum _{i=1}^{\lambda }f(\mathbf{x}_{ i})\nabla \log (p(\mathbf{x}\vert \theta )). }$$

For calculating the term $\nabla \log (p(\mathbf{x}\vert \theta ))$, we refer to [67]. Combining this with the Fisher information matrix (FIM) $\mathbf{F} \in {\mathbb{R}}^{N\times N}$, where N = n + n(n + 1)∕2, the natural gradient G is obtained as:

$$\displaystyle{ G ={ \mathbf{F}}^{-1}\nabla _{ \theta }J(\theta ) }$$

Use of G is motivated by the fact that it is invariant with respect to linear transformations, so that the gradient converges in a correlated search space pretty much like in an isotropic one.

The NES suffer from the disadvantage of their impracticable computational complexity of O(n ⁶), caused by the explicit calculation of the FIM and its inversion. In contrast, the xNES do not require an explicit calculation of the FIM anymore. Based on using a so-called exponential parameterization (see Sect. 4.1 in [26]) a transformation of θ into natural coordinates (see Sect. 4.2 in [26]) is applied. Using step size δ and Cholesky factor B, an offspring x is then generated from the parent $\langle \mathbf{x}\rangle$ according to:

$$\displaystyle{ \mathbf{x} =\langle \mathbf{x}\rangle + \delta \mathbf{B}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

(2.18)

Similar to weighted recombination, the xNES uses so-called utility values u _i. This approach is also called fitness shaping in the context of an xNES. Using the rank i given by the fitness values, utility values are calculated as follows:

$$\displaystyle{ u_{i} = \frac{\max \left (0,\log \left (\frac{\mu } {2} + 1\right ) -\log (i)\right )} {\sum _{j=1}^{\mu }\max \left (0,\log \left (\frac{\mu } {2} + 1\right ) -\log (i)\right )} - \frac{1} {\lambda } }$$

Using the mutation vectors z _i from Eq. 2.18, the gradients G _M for the covariance matrix and G _δ for the current search point are defined by:

$$\displaystyle\begin{array}{rcl} \mathbf{G}_{M}& =& \frac{1} {2}\sum _{i=1}^{\lambda }u_{ i}\left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right ) {}\\ \mathbf{G}_{\delta }& =& \sum _{i=1}^{\lambda }u_{ i}\mathbf{z}_{i} {}\\ \end{array}$$

For calculating the gradients, all λ offspring individuals are taken into account, i.e., a selection in the classical sense is not applied. Using those gradients and the learning rates η_x, η_σ and η_B, the new search point $\langle \mathbf{x}\rangle \prime$, the new step sizes σ′, and the new Cholesky factor B′ are calculated:

$$\displaystyle\begin{array}{rcl} \langle \mathbf{x}\rangle \prime& =& \langle \mathbf{x}\rangle + \eta _{x} \cdot \mathbf{G}_{\delta } {}\\ \sigma \prime& =& \sigma \cdot \exp \left (\frac{\eta _{\sigma }} {2n} \cdot \mbox{ tr}\left (\sum _{i=1}^{\lambda }u_{ i} \cdot \left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right )\right )\right ) {}\\ \mathbf{B}\prime& =& \mathbf{B} \cdot \exp \left (\frac{\eta _{B}} {2} \cdot \mathbf{G}_{M}\right ) {}\\ \end{array}$$

Here, the exponential function of a matrix A is defined by $\exp (\mathbf{A}) =\sum _{ n=0}^{\infty }\frac{{\mathbf{A}}^{n}} {n!}$, see [26].

The resulting pseudocode of the xNES is given in Algorithm 2.17. The default parameters of the exogenous strategy parameters are as follows:

$$\displaystyle\begin{array}{rcl} \lambda & =& 4 + \lfloor 3\log (n)\rfloor {}\\ \eta _{x}& =& 1 {}\\ \eta _{\sigma }& =& \frac{3} {5} \cdot \frac{3 +\log (n)} {n\sqrt{n}} {}\\ \eta _{B}& =& \eta _{\sigma } {}\\ \end{array}$$

Algorithm 2.17 xNES

initialize $\langle \mathbf{x}\rangle$

$\mathbf{B} \leftarrow \mathbf{I}$

$\sigma \leftarrow \root{d}\of{\vert \det \mathbf{B}\vert }$

for i = 1 → λ do

$u_{i} \leftarrow \frac{\max \left (0,\log \left (\frac{\lambda } {2} +1\right )-\log (i)\right )} {\sum _{j=1}^{\lambda }\max \left (0,\log \left (\frac{\lambda } {2} +1\right )-\log (i)\right )} -\frac{1} {\lambda }$

end for

repeat

for i = 1 → λ do

$\mathbf{z}_{i} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma \mathbf{B}\mathbf{z}_{i}$

end for

sort $\{(\mathbf{z}_{i},\mathbf{x}_{i})\}$ by $f(\mathbf{x}_{i})$

$\mathbf{G}_{\delta } \leftarrow \sum _{i=1}^{\lambda }u_{i} \cdot \mathbf{z}_{i}$

$\mathbf{G}_{M} \leftarrow \sum _{i=1}^{\lambda }u_{i} \cdot \left (\mathbf{z}_{i}\mathbf{z}_{i}^{T} -\mathbf{I}\right )$

$G_{\sigma } \leftarrow \mbox{ tr}(\mathbf{G}_{M})/n$

$\mathbf{G}_{B} \leftarrow \mathbf{G}_{M} - G_{\sigma } \cdot \mathbf{I}$

$\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle + \eta _{x} \cdot \sigma \mathbf{B} \cdot \mathbf{G}_{\delta }$

$\sigma \leftarrow \sigma \cdot \exp \left (G_{\sigma } \cdot \frac{\eta _{\sigma }} {2} \right )$

$\mathbf{B} \leftarrow \mathbf{B} \cdot \exp \left (\mathbf{G}_{B} \cdot \frac{\eta _{B}} {2} \right )$

until termination criterion satisfied

2.2.2.11 (1+1)-Active-CMA-ES

Extending the (1+1)-Cholesky-CMA-ES with the idea of the Active-CMA-ES to take information of unsuccessful offspring into account for covariance matrix adaptation consequently leads to the development of a hybrid, the (1+1)-Active-CMA-ES [2]. Instead of using an explicit covariance matrix $\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}$, the (1+1)-Active-CMA-ES works directly with the Cholesky factor A and its inverse ${\mathbf{A}}^{-1}$. The update of A has been defined previously, based on Theorem 2.2.1. In order to use ${\mathbf{A}}^{-1}$, an extended version of this theorem is required, which we state below (without proof, see [2]):

Theorem 2.2.2.

Let $\mathbf{C} \in {\mathbb{R}}^{n\times n}$ be a symmetric, positive definite matrix with Cholesky decomposition $\mathbf{C} = \mathbf{A}{\mathbf{A}}^{T}$ , and let $\mathbf{C}\prime = \alpha \mathbf{C} + \beta \mathbf{v}{\mathbf{v}}^{T}$ be an update transformation of C where $\mathbf{v} \in {\mathbb{R}}^{n}\setminus \{\mathbf{0}\}$, $\alpha \in {\mathbb{R}}^{+}$ and $\beta \in \mathbb{R}$ . Let $\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{v}$ with $\alpha + \beta \|{\mathbf{w}\|}^{2} > 0$ and let $\mathbf{C}\prime = \mathbf{A}\prime{\mathbf{A}\prime}^{T}$ be the Cholesky decomposition of the updated matrix C ′. Then, the Cholesky factor A ′ and its inverse A ′ ⁻¹ are obtained as follows: $\mathbf{A}\prime = \sqrt{\alpha }\mathbf{A} + \frac{\sqrt{\alpha }} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{\beta } {\alpha }\|{\mathbf{w}\|}^{2}} - 1\right )\mathbf{A}\mathbf{w}{\mathbf{w}}^{T}$ and ${\mathbf{A}\prime}^{-1} = \frac{1} {\sqrt{\alpha }}{\mathbf{A}}^{-1} - \frac{1} {\sqrt{\alpha }\|{\mathbf{w}\|}^{2}} \left (1 - \frac{1} {\sqrt{1+\beta \|{\mathbf{w} \|}^{2 } /\alpha }}\right )\mathbf{w}{\mathbf{w}}^{T}{\mathbf{A}}^{-1}$ .

The offspring x′ is generated from its parent x according to:

$$\displaystyle{ \mathbf{x}\prime = \mathbf{x} + \sigma \mathbf{A}\mathbf{z}\mbox{ where }\mathbf{z} = N(\mathbf{0},\mathbf{I}) }$$

As for the (1+1)-Cholesky-CMA-ES, the success rate p _s, i.e., the fraction of successful mutations, is updated by taking the learning rate c _p into account:

$$\displaystyle{ p_{s}\prime = \left \{\begin{array}{@{}l@{\quad }l@{}} (1 - c_{p})p_{s} + c_{p}\quad &\mbox{ if }f(\mathbf{x}\prime) \leq f(\mathbf{x}) \\ (1 - c_{p})p_{s} \quad &\mbox{ if }f(\mathbf{x}\prime) > f(\mathbf{x}) \end{array} \right. }$$

Based on the success rate p _s, a damping parameter $d \in {\mathbb{R}}^{+}$ and the target success rate p _t, the global step size σ is updated as follows:

$$\displaystyle{ \sigma \prime = \sigma \cdot \exp \left (\frac{1} {d} \frac{p_{s} - p_{t}} {1 - p_{t}} \right ) }$$

The algorithm uses $p_{t} = \frac{2} {11}$ which makes the update similar to the 1/5-success rule update mechanism of the (1+1)-ES.

If the offspring performs better than its parent, a positive Cholesky update is applied. In contrast to the (1+1)-Cholesky-CMA-ES, which uses the mutation step z for this update, the (1+1)-Active-CMA-ES relies on a search path s, accumulating successful mutation steps with a learning rate c and updating s as follows:

$$\displaystyle{ \mathbf{s}\prime = (1 - c)\mathbf{s} + \sqrt{c(2 - c)}\mathbf{A}\mathbf{z} }$$

With a constant c _c ⁺ > 0 and the vector $\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{s}$, the positive update of matrices A and A ⁻¹ can now be defined according to Theorem 2.2.2:

$$\displaystyle\begin{array}{rcl} \mathbf{A}\prime& =& a\mathbf{A} + b(\mathbf{A}\mathbf{w}){\mathbf{w}}^{T}\mbox{ and }{}\end{array}$$

(2.19)

$$\displaystyle\begin{array}{rcl}{ \mathbf{A}}^{-1\prime}& =& \frac{1} {a}{\mathbf{A}}^{-1\prime} - \frac{b} {{a}^{2} + \mathit{ab}\|{\mathbf{w}\|}^{2}}\mathbf{w}({\mathbf{w}}^{T}{\mathbf{A}}^{-1})\mbox{ where } \\ a& =& \sqrt{1 - c_{c }^{+}}\mbox{ and } \\ b& =& \frac{\sqrt{1 - c_{c }^{+}}} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{c_{c }^{+ }} {1 - c_{c}^{+}}\|{\mathbf{w}\|}^{2}} - 1\right ) {}\end{array}$$

(2.20)

In the case of an Active-CMA-ES, the λ − μ worst individuals are used for the negative update of the covariance matrix, and these individuals can be called the “especially bad” individuals. In the case of the corresponding (1+1)-strategy, as introduced here, this definition is not applicable. Instead, the (1+1)-Active-CMA-ES stores past function evaluations and defines an individual to be “especially bad”, if its fitness value is worse than the fitness of its k-th predecessor. For an “especially bad” offspring, a negative update according to Eqs. 2.19 and 2.20 is performed, using modified values of the coefficients a and b. In contrast to the positive update, rather than the transformed search path $\mathbf{w} ={ \mathbf{A}}^{-1}\mathbf{s}$ the vector z is used for the negative update:

$$\displaystyle\begin{array}{rcl} a& =& \sqrt{1 + c_{c }^{-}} {}\\ b& =& \frac{\sqrt{1 + c_{c }^{-}}} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 - \frac{c_{c }^{- }} {1 - c_{c}^{-}}\|{\mathbf{z}\|}^{2}} - 1\right ) {}\\ \end{array}$$

To ensure a positive definite covariance matrix, $1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2} > 0$ needs to hold for the constant c _c ⁻. Moreover, the convergence behavior of the algorithm can become unstable if the value of $1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2}$ is very close to zero. As a countermeasure, in case of $1 - \frac{c_{c}^{-}} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2} < 1/2$, the value of $c_{c}^{-}$ is provided with an upper bound of $1/(2\|{\mathbf{z}\|}^{2})$.

The default settings of the exogenous parameters are:

$$\displaystyle\begin{array}{rcl} d& =& 1 + n/2 {}\\ c& =& 2/(n + 2) {}\\ c_{p}& =& 1/12 {}\\ p_{t}& =& 2/11 {}\\ c_{c}^{+}& =& \frac{2} {{n}^{2} + 6} {}\\ c_{c}^{-}& =& \frac{2} {5({n}^{8/5} + 1)} {}\\ \end{array}$$

The pseudocode of the (1+1)-Active-CMA-ES is given in Algorithm 2.18.

Algorithm 2.18 (1+1)-Active-CMA-ES

initialize x, σ, $\mathbf{A} \leftarrow \mathbf{I}$, ${\mathbf{A}}^{-1} \leftarrow \mathbf{I}$, $\mathbf{h} \leftarrow \mathbf{0} \in {\mathbb{R}}^{k}$

$t \leftarrow 0$

repeat

t ← t + 1

$\mathbf{z} \leftarrow N(\mathbf{0},\mathbf{I})$

$\mathbf{y} \leftarrow \mathbf{x} + \sigma \mathbf{A}\mathbf{z}$

if t > k then

$h_{i} \leftarrow h_{i+1}$ $\forall i \in \{1,\ldots,k - 1\}$

$h_{k} \leftarrow f(\mathbf{y})$

else

$h_{t} \leftarrow f(\mathbf{y})$

end if

if $f(\mathbf{y}) \leq f(\mathbf{x})$ then

$\mathbf{x} \leftarrow \mathbf{y}$

$p_{s} \leftarrow (1 - c_{p})p_{s} + c_{p}$

$\mathbf{s} \leftarrow (1 - c)\mathbf{s} + \sqrt{c(2 - c)}\mathbf{A}\mathbf{z}$

$\mathbf{w} \leftarrow {\mathbf{A}}^{-1}\mathbf{s}$

$a \leftarrow \sqrt{1 - c_{c }^{+}}$

$b \leftarrow \frac{\sqrt{1-c_{c }^{+}}} {\|{\mathbf{w}\|}^{2}} \left (\sqrt{1 + \frac{c_{c }^{+ }} {1-c_{c}^{+}} \|{\mathbf{w}\|}^{2}} - 1\right )$

$\mathbf{A} \leftarrow a\mathbf{A} + b\left (\mathbf{A}\mathbf{w}\right ){\mathbf{w}}^{T}$

${\mathbf{A}}^{-1} \leftarrow \frac{1} {a}{\mathbf{A}}^{-1} - \frac{b} {{a}^{2}+\mathit{ab}+\|{\mathbf{w}\|}^{2}} \mathbf{w}\left ({\mathbf{w}}^{T}{\mathbf{A}}^{-1}\right )$

else

$p_{s} \leftarrow (1 - c_{p})p_{s}$

if $h_{0} < f(\mathbf{y})$ then

$a \leftarrow \sqrt{1 + c_{c }^{-}}$

$b \leftarrow \frac{a} {\|{\mathbf{z}\|}^{2}} \left (\sqrt{1 - \frac{c_{c }^{- }} {1+c_{c}^{-}}\|{\mathbf{z}\|}^{2}} - 1\right )$

$\mathbf{A} \leftarrow a\mathbf{A} + b\left (\mathbf{A}\mathbf{w}\right ){\mathbf{w}}^{T}$

${\mathbf{A}}^{-1} \leftarrow \frac{1} {a}{\mathbf{A}}^{-1} - \frac{b} {{a}^{2}+\mathit{ab}+\|{\mathbf{w}\|}^{2}} \mathbf{w}\left ({\mathbf{w}}^{T}{\mathbf{A}}^{-1}\right )$

end if

$\sigma \leftarrow \sigma \exp \left (\frac{1} {d} \frac{p_{s}-p_{t}} {1-p_{t}} \right )$

until termination criterion satisfied

2.2.2.12 ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}$)-ES

The ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}$)-ES [7] is based on extending the idea of mirrored sampling, as described in Sect. 2.2.2.9 for a $(1{ + \atop,} \lambda _{m}^{s})$-ES, for the case μ > 1. The offspring population size is given by the number of samples λ_iid (independent, identically distributed samples from the mutation distribution) and the number of offspring, λ_m ($\lambda _{m} \leq \lambda _{\mathit{iid}}$), which are also subject to mirroring. Using mirrored sampling in combination with weighted recombination and cumulative step size adaptation (see Sect. 2.2.2.1) introduces a bias with respect to the step size, i.e., the step size is more than desirably reduced, thus potentially causing a premature stagnation effect for the algorithm. To avoid this issue, the concept of pairwise selection is introduced, i.e., it is made sure that recombination will not involve an offspring individual and its mirrored version at the same time, but either one or the other.

The ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}$)-ES introduces two different versions of mirroring, namely random mirroring and selective mirroring. In the case of random mirroring, denoted by ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}^{rand}$)-ES, the λ_m offspring subject to mirroring are randomly selected out of the total number of offspring, $\lambda _{\mathit{iid}}$. In the case of selective mirroring, denoted by ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}^{\mathit{sel}}$)-ES, the $\lambda _{\mathit{iid}}$ offspring are first sorted by fitness and the λ_m worst individuals undergo mirroring. This approach is motivated by considering that, in a convex objective function topology, mirroring the best offspring cannot yield any further improvement, such that it will be advantageous to mirror the worst individuals. Moreover, since bad offspring in the case of a $(\mu _{W},\lambda )$-ES are often generated by applying too-large mutation steps, selective mirroring itself will also favor large mutation steps [7]. To counteract this undesired bias, the resample length approach changes the length of the mirrored mutation step by additionally using a second, newly sampled mutation vector z′. The mirrored version $\mathbf{x}_{m}$ of the offspring $\mathbf{x} =\langle \mathbf{x}\rangle + \sigma \mathbf{z}$ is then created according to $\mathbf{x}_{m} =\langle \mathbf{x}\rangle - \sigma \frac{\|\mathbf{z}\prime\|} {\|\mathbf{z}\|}\mathbf{z}$.

Like for the $(1{ + \atop,} \lambda _{m}^{s})$-ES, theoretical results for the convergence velocity on the sphere model have been derived, see [7]. In particular, it has been shown that, for the sphere model, maximum convergence velocity is achieved for a setting of $r = \lambda _{m}/\lambda _{\mathit{iid}} \approx 0.1886$, which can serve as a guideline for the fraction of offspring individuals which should be mirrored.

The pseudocode as given in Algorithm 2.19 is based on using a method updateStepSize ^{Footnote 26} to update the step size σ, and weights w _i $\forall i \in \{1,\ldots,\mu \}$, such that $\sum _{i=1}^{\mu }w_{i} = 1$.

Algorithm 2.19 ($\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}$)-ES

initialize $\langle \mathbf{x}\rangle$, σ

r ← 0

repeat

i ← 0

while $i < \lambda _{\mathit{iid}}$ do

r ← r + 1

i ← i + 1

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle + \sigma N(\mathbf{0},\mathbf{I})$

end while

if selective mirroring then

$\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}} = \mbox{ argsort}\left (f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{\lambda _{\mathit{iid}}})\right )$

end if

while $i < \lambda _{\mathit{iid}} + \lambda _{m}$ do

i ← i + 1

if resample length then

r ← r + 1

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle - \frac{\sigma \|N(\mathbf{0},\mathbf{I})\|} {\|\mathbf{x}_{i-\lambda _{m}}-\langle \mathbf{x}\rangle \vert }\left (\mathbf{x}_{i-\lambda _{m}} -\langle \mathbf{x}\rangle \right )$

else

$\mathbf{x}_{i} \leftarrow \langle \mathbf{x}\rangle -\left (\mathbf{x}_{i-\lambda _{m}} -\langle \mathbf{x}\rangle \right )$

end if

end while

$\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}} = \mbox{ argsort}(f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{\lambda _{\mathit{iid}}-\lambda _{m}}),$

$\min \{f(\mathbf{x}_{\lambda _{\mathit{iid}}-\lambda _{m}+1}),f(\mathbf{x}_{\lambda _{\mathit{iid}}+1})\},\ldots,$

$\min \{f(\mathbf{x}_{\lambda _{\mathit{iid}}}),f(\mathbf{x}_{\lambda _{\mathit{iid}}+\lambda _{m}})\})$

σ ← updateStepSize$(\sigma,\mathbf{x}_{1},\ldots,\mathbf{x}_{\lambda _{\mathit{iid}}},\langle \mathbf{x}\rangle )$

$\langle \mathbf{x}\rangle \leftarrow \langle \mathbf{x}\rangle +\sum _{ i=1}^{\mu }w_{i}(\mathbf{x}_{i} -\langle \mathbf{x}\rangle )$

until termination criterion satisfied

2.2.2.13 SPO-CMA-ES

The SPO-CMA-ES [70] is essentially a restart-version of the $(\mu _{W},\lambda )$-CMA-ES. It is based on using sequential parameter optimization (SPO) [11] to optimize the exogenous parameters of an evolution strategy. SPO uses methods of design of experiments (DoE) and design and analysis of computer experiments (DACE).^{Footnote 27}

Concerning the exogenous parameters subject to sequential parameter optimization, the number of offspring individuals^{Footnote 28} $\lambda \in \{\lambda _{\mathit{def }},\ldots,1,000\}$, the initial step size $\sigma _{\mathit{init}} \in [1,5]$ and the so-called selection pressure λ∕μ ∈ [1.5,2.5] are identified.

The pseudocode of the SPO-CMA-ES is provided in Algorithm 2.20, and the approach is explained in the following by discussing the various methods used in the algorithm. To begin with, using latin hypercube sampling (LHS) [68] an initial design of experiments for the exogenous parameters is created. In the next step (runDesign), independent runs of the (μ_W,λ)-CMA-ES are executed, using the parameter sets of the DoE plan. The results, i.e., the best evaluated individual with its fitness value, of each run is collected in the set Y. This initial phase of the algorithm is called the exploration phase.

The next phase, called the exploitation phase, is repeated until the predefined budget of function evaluations is reached. Using a function aggregateRuns, a performance measure y is calculated for every run configuration in Y. Based on these performance measure values as outputs and the corresponding parameter sets according to the experimental plan, a Kriging model^{Footnote 29} $\mathcal{M}$ is trained in the method fitModel. This Kriging model $\mathcal{M}$ is then used by the method modelOptimization to identify a new design point, e.g., by running an optimization on the Kriging model and using the resulting point. The new design point d is then added to the experimental plan D, and the loop is executed again. Default settings are not given for the size of the initial experimental plan, N _init, nor for the split of the number of function evaluations between the two phases of the algorithm [70]. Rather, the user of the algorithm can fix them, depending on the optimization task at hand. In the case of noisy objective functions, the method runDesign can execute more than the one run, in order to use, e.g., the averages as an estimation of the true fitness value.

Algorithm 2.20 SPO-CMA-ES

Input:box constraints $\mathbf{l},\mathbf{u} \in {\mathbb{R}}^{n}$ and size N _init of the initial designOutput:final model $\mathcal{M}$ and best design point d ^∗

$i \leftarrow 0$, $D \leftarrow \varnothing $

$d_{i} \leftarrow \mbox{ LHS}(\mathbf{l},\mathbf{u},N_{\mathit{init}})$

$Y \leftarrow \mbox{ runDesign}(d_{i})$

$D \leftarrow D \cup d_{i}$

while function evaluation budget not exhausted do

$i \leftarrow i + 1$

$y \leftarrow \mbox{ aggregateRuns}(Y )$

$\mathcal{M}\leftarrow $ fitModel(D,y)

$d_{i} \leftarrow \mbox{ modelOptimization}(\mathcal{M})$

$Y \leftarrow Y \cup \mbox{ runDesign}(d_{i})$

$D \leftarrow D \cup d_{i}$

end while

${d}^{{\ast}}\leftarrow d_{k}$ with the best $y_{k} \in \{y_{0},\ldots,y_{i}\}$

2.3 Further Aspects of ES

So far, we have described the ES algorithms as single-criterion optimizers with ${\mathbb{R}}^{n}$ as search domain and without handling of constraints. The next three sections give summarized overviews and literature references for further aspects of ES, namely constraint handling, binary and integer search spaces, and multiobjective optimization.

2.3.1 Constraint Handling

In Sect. 2.1.1 we defined the optimization problem used throughout this book with equality and inequality constraints as in Eq. 2.2. There are many techniques for handling constraints ranging from simple penalty methods to more complex ones like hybrid methods involving Lagrangian multipliers. Coello gives an overview [18] of constraint-handling techniques to be used with Evolutionary Algorithms but some of these methods may be applied to ES as well. A review by Kramer [42] specializes in constraint-handling methods dedicated to ES and presents the four techniques penalty methods, a multiobjective bioinspired approach, coordinate alignment techniques, and metamodeling of constraints.

2.3.2 Beyond Real-Valued Search Spaces

There are many optimization problems where the search domain is not constrained to the real domain. Especially decision problems^{Footnote 30} use categorical search spaces, in most cases binary search spaces, i.e., $\mathbf{x} \in \{0,1{\}}^{n}$, as the simplest categorical search space. Another search space of practical use is the integer search space representable as a subset of $\mathbb{Z}$. Originally, Genetic Algorithms (see [27] or [25] for a comprehensive introduction) were designed to handle binary search spaces, but there are approaches to incorporate those search spaces into ES. In Sect. 2.1.3 we named three guidelines to choose a distribution to be used for mutation. Rudolph [56] introduces a mutation operator for integer search spaces using the difference of two geometrical distributions. Each discrete variable of a categorical search space is assigned a probability whether to mutate or not. The new value of the discrete variable is drawn uniformly from all possible values. The MI-ES (mixed-integer evolution strategies) [43] solve optimization problems which are mixed in their search domain, i.e. the search domain is a composition of real, integer and categorical search spaces. They use the aformentioned mutation approaches together with self-adaptation for the endogenous parameters. An overview of other approaches for handling mixed search spaces is given by Li [43].

2.3.3 Multiobjective Optimization

In single-objective optimization fitness values can be ordered to decide whether one solution is better than another. In multiobjective optimization, where fitness values are represented as vectors, such a strict ordering does not exist anymore. Solutions are partially ordered and based on the partial order solutions can be either dominated or non-dominated by other solutions. Hence there is not a single optimum to be found but a set of solutions which is called the Pareto set or Pareto front. For a detailed description of these concepts see [20]. Algorithms for multiobjective optimization have to measure how well a Pareto front is approximated. The most common measures for this task are the crowding distance and the hypervolume contribution. The former is used for example by NSGA-II [21] the latter by SMS-EMOA [12].

Notes

1.
This statement, however, is not meant to support the myth mentioned explicitly by Rudolph [58]: “Since early theoretical publications mainly analyzed simple ES without recombination, somehow the myth arose that ES put more emphasis on mutation than on recombination: This is a fatal misconception! Recombination has been an important ingredient of ES from the early beginning and this is still valid today.”
2.
See Sect. 12.2.1 in [17] for the definition of a distance measure.
3.
In the case of the (1+1)-ES the strategy parameters may be assigned to the algorithm itself instead of the individual, because only one set of strategy parameters is needed. This also holds for any strategy parameters which are not needed on the individual level (for example the covariance matrix of the CMA-ES).
4.
Algorithm 3 in [58].
5.
The normal distribution achieves maximum entropy among the distributions on the real domain. (See [64] for more details.)
6.
A symmetric matrix $\mathbf{A} \in {\mathbb{R}}^{n\times n}$ is positive definite iff ${\mathbf{x}}^{T}\mathbf{A}\mathbf{x} > 0$ for all $\mathbf{x} \in {\mathbb{R}}^{n}\setminus \{\mathbf{0}\}$ [17].
7.
For an orthogonal matrix A, $\mathbf{A}{\mathbf{A}}^{T} ={ \mathbf{A}}^{T}\mathbf{A} = \mathbf{I}$ holds.
8.
See Sect. 6.2.2.3 in [17].
9.
The rectangular corridor model according to [8]: $f_{1}(\mathbf{x}) = c_{0} + c_{1} \cdot x_{1}$ if the constraints $g_{j}(\mathbf{x}): x_{j} \leq b$ with $b \in {\mathbb{R}}^{+}$ for $j \in \{2,\ldots,n\}$ are fulfilled, f ₁(x) = ∞ otherwise.
10.
The sphere model according to [8]: $f_{2}(\mathbf{x}) = c_{0} + c_{1} \cdot \sum _{n}^{i=1}{(x_{i} - x_{i}^{{\ast}})}^{2}$.
11.
The exact values are 0.184 and 0.2025 for the corridor and sphere models, respectively [8].
12.
MSC is an abbreviation of mutative self-adaptation of covariances.
13.
In the original publication it is called (1,λ)-ES with derandomized mutative step size.
14.
This way, adapting the step size by a factor ξ requires at least 1∕β > 1 generations.
15.
In the original paper, the algorithm is called (1,λ)-ES with derandomized mutative step size control using accumulated information.
16.
The column vectors of the matrix B form a so-called generating set, which motivates the terminology generating set adaptation.
17.
According to [32], the suggestion to use weighted recombination within the CMA-ES is due to Ingo Rechenberg, based on personal communication in 1998.
18.
See [17]: $\Gamma (n) =\int _{ 0}^{\infty }{x}^{n-1}\exp (-x)\,\mbox{ d}x$.
19.
With the additional condition for A to consist of at least m = n ² tuples.
20.
Compare Sect. 19.2.1.2 in [17].
21.
The term active is motivated by the fact that specifically the bad offspring individuals play an active role, although they would normally not be taken into account after selection has been applied.
22.
This is explicitly avoided due to the occurrence of numerical instabilities for certain objective functions; see [40].
23.
Population sizes μ and λ are not counted.
24.
Instead of the term symmetrical, this is called mirrored in the context of this strategy.
25.
In [26] the eNES are called exact natural evolution strategies.
26.
The aforementioned techniques self-adaptation (see Sect. 2.2.1.2) or cumulative step size adaptation (see Sect. 2.2.2.1) are suitable.
27.
See [70] for literature references on these topics as well as the Kriging modeling method.
28.
For λ_def the standard setting of a (μ_W,λ)-CMA-ES with $\lambda _{\mathit{def }} = 4 + \lfloor 3\log n\rfloor $ is used.
29.
In principal, any modeling technique can be used to establish the relationship between the exogenous parameters and the performance measure.
30.
For example the NP-hard Traveling Salesman Problem.

Bibliography

S. Amari, Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article MathSciNet Google Scholar
D.V. Arnold, N. Hansen, Active covariance matrix adaptation for the (1+1)-CMA-ES, in Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’10), Portland, ed. by M. Pelikan, J. Branke (ACM, New York, 2010), pp. 385–392
Google Scholar
D.V. Arnold, R. Salomon, Evolutionary gradient search revisited. IEEE Trans. Evol. Comput. 11(4), 480–495 (2007)
Article Google Scholar
A. Auger, N. Hansen, Performance evaluation of an advanced local search evolutionary algorithm, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05), Edinburgh, vol. 2, ed. by B. McKay et al. (IEEE, Piscataway, 2005), pp. 1777–1784
Google Scholar
A. Auger, N. Hansen, A restart CMA evolution strategy with increasing population size, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05), Edinburgh, vol. 2, ed. by B. McKay et al. (IEEE, Piscataway, 2005), pp. 1769–1776
Google Scholar
A. Auger, M. Schoenauer, N. Vanhaecke, LS-CMA-ES: a second-order algorithm for covariance matrix adaptation, in Proceedings of the 8th International Conference on Parallel Problem Solving from Nature (PPSN VIII), Birmingham, ed. by X. Yao et al. Volume 3242 of Lecture Notes in Computer Science (Springer, Berlin, 2004), pp. 182–191
Google Scholar
A. Auger, D. Brockhoff, N. Hansen, Mirrored sampling in evolution strategies with weighted recombination, in Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO’11), Dublin, ed. by N. Krasnogor, P.L. Lanzi (ACM, New York, 2011), pp. 861–868
Google Scholar
T. Bäck, Evolutionary Algorithms in Theory and Practice (Oxford University Press, New York, 1996)
MATH Google Scholar
T. Bartz-Beielstein, C. Lasarczyk, M. Preuss, Sequential parameter optimization, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05), Edinburgh, ed. by B. McKay et al. (IEEE, Piscataway, 2005), pp. 773–780
Google Scholar
N. Beume, B. Naujoks, M. Emmerich, SMS-EMOA: multiobjective selection based on dominated hypervolume. Eur. J. Oper. Res. 181, 1653–1669 (2007)
Article MATH Google Scholar
H.-G. Beyer, B. Sendhoff, Covariance matrix adaptation revisited – the CMSA evolution strategy, in Proceedings of the 10th International Conference on Parallel Problem Solving from Nature (PPSN X), Dortmund, ed. by G. Rudolph et al. Volume 5199 in Lecture Notes in Computer Science (Springer, Berlin, 2008), pp. 123–132
Google Scholar
Z. Bouzarkouna, A. Auger, D.-Y. Ding, Local-meta-model CMA-ES for partially separable functions, in Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO’11), Dublin, ed. by N. Krasnogor et al. (ACM, New York, 2011), pp. 869–876
Google Scholar
D. Brockhoff, A. Auger, N. Hansen, D.V. Arnold, T. Hohm, Mirrored sampling and sequential selection for evolution strategies, in Proceedings of the 11th International Conference on Parallel Problem Solving from Nature (PPSN XI), Kraków, ed. by R. Schaefer et al. Volume 6238 in Lecture Notes in Computer Science. (Springer, Berlin, 2010), pp. 11–21
Google Scholar
I.N. Bronstein, K.A. Semendjajew, G. Musiol, H. Muehlig, Taschenbuch der Mathematik, 7th edn. (Harri Deutsch, Frankfurt am Main, 2008)
MATH Google Scholar
C.A. Coello Coello, Constraint-handling techniques used with evolutionary algorithms, in Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO’11), Companion Material, Dublin, ed. by N. Krasnogor et al. (ACM, New York, 2011), pp. 1137–1160
Google Scholar
K. Deb, Multiobjective Optimization Using Evolutionary Algorithms. Wiley-Interscience Series in Systems and Optimization (Wiley, Chichester, 2001)
Google Scholar
K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Article Google Scholar
A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing. Natural Computing Series (Springer, Berlin, 2003)
Book MATH Google Scholar
T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, J. Schmidhuber, Exponential natural evolution strategies, in Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’10), Portland, ed. by M. Pelikan, J. Branke (ACM, New York, 2010)
Google Scholar
D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning (Addison-Wesley, Boston, 1989)
MATH Google Scholar
W.H. Greene, Econometric Analysis, 4th edn. (Prentice Hall, Upper Saddle River, 1997)
Google Scholar
N. Hansen, The CMA evolution strategy: a tutorial. Continuously updated technical report, available via http://www.lri.fr/~hansen/cmatutorial.pdf. Accessed 12 Mar 2011
N. Hansen, S. Kern, Evaluating the CMA evolution strategy on multimodal test functions, in Proceedings of the 9th International Conference on Parallel Problem Solving from Nature (PPSN VIII), Birmingham. Volume 3242 of Lecture Notes in Computer Science, ed. by X. Yao et al. (Springer, 2004), pp. 282–291
Google Scholar
N. Hansen, A. Ostermeier, Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation, in Proceedings of the 1996 IEEE International Conference on Evolutionary Computation (ICEC’96), Nagoya, ed. by Y. Davidor et al. (IEEE, Piscataway, 1996), pp. 312–317
Google Scholar
N. Hansen, A. Ostermeier, Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)
Article Google Scholar
N. Hansen, A. Ostermeier, A. Gawelczyk, On the adaptation of arbitrary normal mutation distributions in evolution strategies: the generating set adaptation, in Proceedings of the 6th International Conference on Genetic Algorithms (ICGA 6), Pittsburgh, ed. by L.J. Eshelman (Morgan Kaufmann, San Francisco, 1995), pp. 57–64
Google Scholar
N. Hansen, A. Auger, R. Ros, S. Finck, P. Posik, Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009, in Proceedings of the 12th International Conference on Genetic and Evolutionary Computation Conference (GECCO’10), Companion Material, Portland, ed. by M. Pelikan, J. Branke (ACM, New York, 2010), pp. 1689–1696
Google Scholar
C. Igel, T. Suttorp, N. Hansen, A computational efficient covariance matrix update and a (1+1)-CMA for evolution strategies, in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO’06), Seattle, ed. by M. Keijzer et al. (ACM, New York, 2006), pp. 453–460
Google Scholar
G.A. Jastrebski, Improving evolution strategies through active covariance matrix adaptation. Master’s thesis, Faculty of Computer Science, Dalhousie University, 2005
Google Scholar
G.A. Jastrebski, D.V. Arnold, Improving evolution strategies through active covariance matrix adaptation, in Proceedings of the 2006 IEEE Congress on Evolutionary Computation (CEC’06), Vancouver, BC, Canada, ed. by G.G. Yen et al. (IEEE, Piscataway, 2006), pp. 2814–2821
Google Scholar
O. Kramer, A review of constraint-handling techniques for evolution strategies. Appl Comput. Int. Soft Comput. 2010, 1–11 (2010)
Article Google Scholar
R. Li, Mixed-integer evolution strategies for parameter optimization and their applications to medical image analysis. PhD thesis, Leiden Institute of Advanced Computer Science (LIACS), Faculty of Science, Leiden University, 2009
Google Scholar
D.G. Luenberger, Y. Ye, Linear and Nonlinear Programming, 2nd edn. (Springer, Berlin, 2003)
MATH Google Scholar
S.D. Müller, N. Hansen, P. Koumoutsakos, Increasing the serial and the parallel performance of the CMA-evolution strategy with large populations, in Proceedings of the 7th International Conference on Parallel Problem Solving from Nature (PPSN VII), Granada, ed. by J.J. Merelo et al. Volume 2439 of Lecture Notes in Computer Science (Springer, Berlin, 2002), pp. 422–431
Google Scholar
A. Ostermeier, A. Gawelczyk, N. Hansen, A derandomized approach to self adaptation of evolution strategies. Evol. Comput. 2(4), 369–380 (1994)
Article Google Scholar
A. Ostermeier, A. Gawelczyk, N. Hansen, Step-size adaptation based on non-local use of selection information, in Proceedings of the 3rd International Conference on Parallel Problem Solving from Nature (PPSN III), Jerusalem, ed. by Y. Davidor et al. Volume 866 of Lecture Notes in Computer Science (Springer, Berlin, 1994), pp. 189–198
Google Scholar
I. Rechenberg, Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der biologischen Evolution (Frommann-Holzboog, Stuttgart, 1973)
Google Scholar
I. Rechenberg, Evolutionsstrategie’94 (Frommann-Holzboog, Stuttgart, 1994)
Google Scholar
R. Ros, N. Hansen, A simple modification in CMA-ES achieving linear time and space complexity, in Proceedings of the 10th International Conference on Parallel Problem Solving from Nature (PPSN X), Dortmund, ed. by G. Rudolph et al. Volume 5199 of Lecture Notes in Computer Science (Springer, Berlin, 2008), pp. 296–305
Google Scholar
G. Rudolph, On correlated mutations in evolution strategies, in Proceedings of the 2nd International Conference on Parallel Problem Solving from Nature (PPSN II), Brussels, ed. by R. Männer, B. Manderick (Elsevier, Amsterdam, 1992), pp. 105–114
Google Scholar
G. Rudolph, An evolutionary algorithm for integer programming, in Proceedings of the 3rd Conference on Parallel Problem Solving from Nature (PPSN III), Jerusalem, ed. by Y. Davidor et al. Volume 866 of Lecture Notes in Computer Science (Springer, Berlin, 1994), pp. 63–66
Google Scholar
G. Rudolph, Convergence Properties of Evolutionary Algorithms (Kovač, Hamburg, 1997)
Google Scholar
G. Rudolph, Evolutionary strategies, in Handbook of Natural Computing, ed. by G. Rozenberg, T. Bäck, J.N. Kok (Springer, Berlin, 2012)
Google Scholar
H.-P. Schwefel, Kybernetische Evolution als Strategie der experimentellen Forschung in der Strömungstechnik. Diplomarbeit, Technische Universität Berlin, Hermann Föttinger–Institut für Strömungstechnik, 1964
Google Scholar
H.-P. Schwefel, Numerische Optimierung von Computer-Modellen Mittels der Evolutionsstrategie (Birkhäuser, Basel, 1977)
MATH Google Scholar
H.-P. Schwefel, Numerical Optimization of Computer Models (Wiley, Chichester, 1981)
MATH Google Scholar
O.M. Shir, Niching in Derandomized Evolution Strategies and its Applications in Quantum Control. PhD thesis, University of Leiden, The Netherlands, 2008
Google Scholar
A. Stuart, K. Ord, S. Arnold, Kendall’s Advanced Theory of Statistics, Classical Inference and the Linear Model. Volume 2 in Kendall’s Library of Statistics (Wiley, Chichester, 2009)
Google Scholar
Y. Sun, D. Wierstra, T. Schaul, J. Schmidhuber, Efficient natural evolution strategies, in Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO’09), Shanghai, ed. by F. Rothlauf et al. (ACM, New York, 2009), pp. 539–546
Google Scholar
Y. Sun, D. Wierstra, T. Schaul, J. Schmidhuber, Stochastic search using the natural gradient, in Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, Montreal, ed. by A. Pohoreckyj Danyluk et al. (ACM, New York, 2009), pp. 1161–1168
Google Scholar
B. Tang, Orthogonal array-based latin hypercubes. J. Am. Stat. Assoc. 88(424), 1392–1397 (1993)
Article MATH Google Scholar
S. Wessing, M. Preuss, G. Rudolph, When parameter tuning actually is parameter control, in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO’11), Dublin, ed. by N. Krasnogor et al. (ACM, New York, 2011), pp. 821–828
Google Scholar
D. Wierstra, T. Schaul, J. Peters, J. Schmidhuber, Natural evolution strategies, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’08), Hong Kong (IEEE, Piscataway, 2008), pp. 3381–3387
Google Scholar

Download references

Author information

Authors and Affiliations

Leiden University, Leiden, The Netherlands
Thomas Bäck
divis intelligent solutions GmbH, Dortmund, Germany
Christophe Foussette & Peter Krause

Authors

Thomas Bäck
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Foussette
View author publications
You can also search for this author in PubMed Google Scholar
Peter Krause
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bäck, T., Foussette, C., Krause, P. (2013). Evolution Strategies. In: Contemporary Evolution Strategies. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40137-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-40137-4_2
Published: 18 August 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40136-7
Online ISBN: 978-3-642-40137-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evolution Strategies

Abstract

Keywords

2.1 Introduction

2.1.1 Optimization

2.1.2 Evolution Strategies as a Specialization of Evolutionary Algorithms

Algorithm 2.1 General outline of an evolutionary algorithm

Algorithm 2.2 (μ∕ρ,κ,λ)-ES

2.1.3 Mutation in\({\mathbb{R}}^{n}\)

2.1.3.1 The Multivariate Normal Distribution

2.1.3.2 Relationship Between Covariance Matrix and Hessian

2.2 Algorithms

2.2.1 From the (1+1)-ES to the CMA-ES

2.2.1.1 (1+1)-ES

Algorithm 2.3 (1+1)-ES

2.2.1.2 (μ,λ)-MSC-ES

Algorithm 2.4 (μ,λ)-MSC-ES

2.2.1.3 DR1

Algorithm 2.5 DR1

2.2.1.4 DR2

Algorithm 2.6 DR2

2.2.1.5 DR3

Algorithm 2.7 DR3

2.2.2 Modern Evolution Strategies

2.2.2.1 (μ W ,λ)-CMA-ES

Algorithm 2.8 (μ W ,λ)-CMA-ES

2.2.2.2 LS-CMA-ES

Algorithm 2.9 LS-CMA-ES

2.2.2.3 LR-CMA-ES

Algorithm 2.10 LR-CMA-ES

2.2.2.4 IPOP-CMA-ES

Algorithm 2.11 IPOP-CMA-ES

2.2.2.5 (1+1)-Cholesky-CMA-ES

Theorem 2.2.1.

Algorithm 2.12 (1+1)-Cholesky-CMA-ES

2.2.2.6 Active-CMA-ES

Algorithm 2.13 Active-CMA-ES

2.2.2.7 (μ,λ)-CMSA-ES

Algorithm 2.14 (μ,λ)-CMSA-ES

2.2.2.8 sep-CMA-ES

Algorithm 2.15 sep-CMA-ES

2.2.2.9 \((1{ + \atop,} \lambda _{m}^{s})\)-ES

Algorithm 2.16 (\(1{ + \atop,} \lambda _{m}^{s}\))-ES

2.2.2.10 xNES

Algorithm 2.17 xNES

2.2.2.11 (1+1)-Active-CMA-ES

Theorem 2.2.2.

Algorithm 2.18 (1+1)-Active-CMA-ES

2.2.2.12 (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES

Algorithm 2.19 (\(\mu /\mu _{W},\lambda _{\mathit{iid}} + \lambda _{m}\))-ES

2.2.2.13 SPO-CMA-ES

Algorithm 2.20 SPO-CMA-ES

2.3 Further Aspects of ES

2.3.1 Constraint Handling

2.3.2 Beyond Real-Valued Search Spaces

2.3.3 Multiobjective Optimization

Notes

Bibliography

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation

2.2.2.1 (μ_W,λ)-CMA-ES

Algorithm 2.8 (μ_W,λ)-CMA-ES