1 Introduction

Evolutionary algorithms (EAs) are metaheuristics which are based on the principles of natural evolution. As such, they use recombination and mutation to create new search points and perform selection in order to determine which points may pass on their characteristics to succeeding generations. Research in EAs has a long tradition: The first evolutionary algorithms have been introduced in the 1960 s. Today, they represent one of the major classes in natural computation. Five main groups exist: genetic algorithms, genetic programming, evolutionary programming, differential evolution, and evolution strategies. The present article focuses on the last class, evolution strategies (ESs), which date back to the 70 s and are – at least today – mainly applied for continuous optimization. In this area, they have been established as well-performing black-box optimization methods, see e.g. [15]. Their main search operator is mutation in contrast to genetic algorithms which favor recombination.

Evolution strategies, see for example [1, 3, 32] for an introduction, operate with a multivariate normal distribution to generate new search points. The main parameters of the search distribution, the mean \(\mathbf m\) and the covariance matrix \(\sigma ^2\mathbf{C}\), must be adapted during the run of an evolution strategy. If they remain constant, the ES may exhibit either only a slow convergence behavior or it may fail in its optimization task entirely. In order to ensure that the mutation control parameters are suited to the current fitness landscape of the function to be optimized, the scale as well as the directions of the mutations must be adapted. For this reason, research in evolution strategies often centers on control techniques for the mutation process with a strong focus on the covariance matrix. Nearly all approaches estimate the current covariance based on the sample. Relying on the sample covariance matrix leads to an ill-posed estimation task in the case of evolution strategies, however: Due to efficiency, the sample size is typically small w.r.t. the search space dimension. This results in a well-known problem in statistics: The estimate may differ considerably from the underlying true covariance [35, 36]. This may be the reason why nearly all current techniques introduce additional correction or regularization terms for example by falling back to the previous covariance matrix and/or by strengthening certain promising directions. Interestingly, these procedures are reminiscent of shrinkage estimation in statistics which represents one technique to cope with a poor sample covariance estimate. These similarities lead to the research question of the present paper: If evolution strategies perform a kind of implicit shrinkage, can they profit from the introduction of explicit shrinkage operators?

So far, shrinkage and other covariance matrix estimators have been applied remarkably seldom in the area of evolutionary computation. A literature review resulted in only two papers aside from our previous approaches: The first by Dong and Yao explored an application in the case of Gaussian estimation of distribution algorithms [8]. They faced the problem that the learning of the covariance matrix during the run lead to non positive definite matrices. For this reason, a shrinkage procedure was applied to “repair” the covariance matrix. The approach was similar to in [20] with the exception of an adaptable shrinkage intensity. More recently, Kramer considered a Ledoit-Wolf-estimator based on [19] for an evolution strategy which does not follow a population-based approach but uses a variant of the single-point elistist (1\(\,+\,\)1)−ES. For this reason, the covariance matrix adaptation has to consider past search points and corrects the estimate with shrinkage [17].

The current analysis extends the work carried out in [27, 28] and augments the investigation conducted in [25, 29] for the case of thresholding estimators. [27, 28] presented the first approaches to apply Ledoit-Wolf shrinkage estimators in evolution strategies. In a proof of concept, the shrinkage estimators were combined with an approach stemming from a maximum entropy covariance selection principle. Here, the work is extended by considering several mixture matrices, targets, and choices for the shrinkage intensity which are compared to the original version of the underlying ES variant, the CMSA-ES [5], for noise-free and for noisy optimization.

The paper is structured as follows: First, a brief introduction into evolution strategies with covariance matrix adaptation is provided. Afterwards, we focus on the problem of estimating high-dimensional covariance matrices. Several shrinkage targets are introduced and their integration into evolution strategies is described. The strategies are assessed and compared to the original ES version in the experimental sections. The paper ends with the conclusions and an outlook regarding open research points.

2 Evolution Strategies: A Short Introduction

Let \(f:\mathbb {R}^N\rightarrow \mathbb {R}\) be a continuous function that allows only the evaluation of the function itself but not the derivation of higher order information. This is the area of black-box optimization, where metaheuristics as evolution strategies and similar approaches are often applied. Evolution strategies (ESs) are stochastic optimization methods that usually operate with a sample or population of search points also called candidate solutions. They distinguish between a population of \(\mu \) parents and \(\lambda \) offspring. In many applications in continuous search spaces, the parent population is discarded after the offspring have been created. Therefore, \(\lambda >\mu \) is required. Evolution strategies use a multivariate normal distribution with mean \(\mathbf{m}^{(g)}\) and covariance matrix \(\left( {\sigma }^{(g)}\right) ^2\mathbf{C}^{(g)}\) to create new search points. The mean represents the recombination process and is obtained as the (weighted) centroid of the parent population. The covariance matrix is updated by following one of the established techniques [5, 12]. Sampling \(\lambda \) times from the normal distribution results in the offspring population

$$\begin{aligned} \mathbf{x}_l= & {} \mathbf{m}^{(g)}+\sigma ^{(g)}\mathcal {N}(0,\mathbf{C}^{(g)}),\;l=1,\ldots ,\lambda . \end{aligned}$$
(1)

Afterwards, the new search points are evaluated using the function f to be optimized. Evolution strategies then apply deterministic selection in order to determine the next parent population and chose the \(\mu \) best of the \(\lambda \) offspring.

As stated earlier, the parameters of the normal distribution must be adapted in order to allow progress towards the optimal point. Here, the covariance matrix is of great importance and adaptation techniques have received a lot of attention in work on evolution strategies (see [12, 24]). The investigation in this paper centers on the covariance matrix self-adaptation evolution strategy (CMSA-ES) [5]. The CMSA-ES divides the adaptation of the covariance \((\sigma ^{(g)})^2\mathbf{C}^{(g)}\) into two main procedures: the covariance matrix update for \(\mathbf{C}^{(g)}\) and the adaptation of the scale factor \(\sigma ^{(g)}\). The scale factor is also often called step-size or mutation strength. Following established practice in evolution strategies, the matrix \(\mathbf{C}^{(g)}\) will be referred to as the covariance matrix in the remainder of the paper.

2.1 Adaptation I: Covariance Matrix Update

As it is the case for most techniques, the covariance matrix update of the CMSA-ES is based on the sample estimate, that is, on the \(\mu \) best offspring. Considering only the superior candidate solutions shall introduce a bias towards good search regions. Let \(\mathbf{x}_{m:\lambda }\) denote the mth best of the \(\lambda \) offspring w.r.t. the fitness and let

$$\begin{aligned} \mathbf{z}_{m:\lambda }^{(g+1)}:=\frac{1}{\sigma ^{(g)}}\Big (\mathbf{x}_{m:\lambda }^{(g+1)}-\mathbf{m}^{(g)}\Big ), \end{aligned}$$
(2)

stand for its normalization. The estimate then reads

$$\begin{aligned} \mathbf{C}^{(g+1)}_\mu :=\sum _{m=1}^\mu w_m \mathbf{z}_{m:\lambda }^{(g+1)}(\mathbf{z}_{m:\lambda }^{(g+1)})^{\mathrm {T}} \end{aligned}$$
(3)

with the weights usually set to \(w_m=1/\mu \) for the CMSA-ES [5]. Please note that the mean \(\mathbf{m}^{(g)}\) is known, thus, the number of degrees of freedom remains equal to \(\mu \). The sample covariance is then combined with the old covariance, resulting in

$$\begin{aligned} \mathbf{C}^{(g+1)}:= & {} (1-\frac{1}{c_\tau })\mathbf{C}^{(g)}+ \frac{1}{c_\tau }{} \mathbf{C}^{(g+1)}_\mu .\end{aligned}$$
(4)

The parameter \(c_\tau \in (0,1)\),

$$\begin{aligned} c_\tau =1+\frac{N(N+1)}{2\mu }, \end{aligned}$$
(5)

see [5] increases with the search space dimension and decreases with the population size.

2.2 Adaptation II: Self-adaptation

In addition to the covariance matrix update, the CMSA-ES applies self-adaptation to control the mutation strength \(\sigma ^{(g)}\). Self-Adaptation has been developed by Rechenberg [31] and Schwefel [33]. It takes place at the level of the individuals meaning that each population member operates with a distinct set. The strategy parameters are adapted by using similar evolutionary principles as for the main evolutionary algorithm. In other words, they also undergo recombination and mutation processes. Afterwards, they are used in the mutation of the search space position. The influence on the selection is indirect: Self-adaptation is based on the assumption of a stochastic linkage between good objective values and appropriately tuned strategy parameters.

In the case of the CMSA-ES, the mutation process of the mutation strength is realized with the help of the log-normal distribution following

$$\begin{aligned} \sigma _l^{(g)}= & {} \sigma ^{(g)}\mathrm {exp}(\tau \mathcal {N}(0,1)). \end{aligned}$$
(6)

The parameter \(\tau \), the learning rate, should scale with \(1/\sqrt{2N}\) [23]. Self-adaptation with recombination has been shown as robust against noise [2] and is therefore considered in this paper. In this case, the variable \(\sigma ^{(g)}\) in (6) is the result of the recombination of the mutation strengths. Here, the same recombination type as previously may be used, that is, \(\sigma ^{(g+1)}=\sum w_m \sigma _{m:\lambda }\) with \(\sigma _{m:\lambda }\) standing for the mutation strength associated with the mth best individual. Figure 1 summarizes the main steps of the covariance matrix self-adaptation evolution strategy.

Fig. 1.
figure 1

The main steps of a CMSA-ES. Normally, the weights \(w_m\) are set to \(w_m=1/\mu \) for \(m=1,\ldots ,\mu \).

3 Covariance Matrix Estimation: Introducing Shrinkage

Using the population covariance matrix necessitates an appropriate sample size with \(\mu \gg N\) for obtaining a high quality estimator. If this is not the case, estimate and “true” covariance may not agree well, especially in the case of high-dimensional search spaces. Among others, the eigen structure may be significantly distorted, see e.g. [20]. However, in evolution strategies typical recommendations for the population sizing state to use an offspring population size \(\lambda \) of either \(\lambda =\mathcal {O}(\log (N))\) or \(\lambda =\mathcal {O}(N)\) and setting \(\mu =\lceil c\lambda \rceil \) with \(c\in (0,0.5)\). Thus, either \(\mu /N\rightarrow c\) or even \(\mu /N\rightarrow 0\) for \(N\rightarrow \infty \) holds which disagrees with the requirement.

As stated above, the estimation of covariance matrices has received a lot of attention in statistics, see e.g. [6, 30, 38] and several techniques have been introduced. This paper focuses on linear shrinkage estimators that can be computed comparatively efficiently. Other classes, as e.g. thresholding operators for sparse covariance matrix estimation, are currently considered in separate analyses [25, 29]. Following [20, 35], linear shrinkage approaches are based on

$$\begin{aligned} \mathbf{S}_{\mathrm {est}}(\rho )= & {} \rho \mathbf{F}+(1-\rho )\mathbf{C}_\mu \end{aligned}$$
(7)

with \(\mathbf{F}\) the target to correct the estimate provided by the sample covariance \(\mathbf{C}_\mu \). The parameter \(\rho \in (0,1)\) is called the shrinkage intensity. Equation (7) is used to shrink the eigenvalues of \(\mathbf{C}_\mu \) towards the eigenvalues of \(\mathbf{F}\). The intensity \(\rho \) should be chosen to minimize the expected error

$$\begin{aligned} \mathrm {E}\Big ( \Vert \mathbf{S}_{\mathrm {est}}(\rho )-\varSigma \Vert _F^2\Big ) \end{aligned}$$
(8)

with \(\Vert \cdot \Vert _F^2\) denoting the squared Frobenius norm given by

$$\begin{aligned} \Vert \mathbf{A}\Vert _F^2= & {} \frac{1}{N}\mathrm {Tr}\Big [\mathbf{A}{} \mathbf{A}^{\mathrm {T}}\Big ], \end{aligned}$$
(9)

see [20]. Note the factor \(1\text {/}N\) is additionally introduced in [20] to normalize the norm w.r.t. the dimension.

Based on (8) and taking into account that the true covariance is unknown in practice, Ledoit and Wolf were able to obtain an optimal shrinkage intensity for the target \(\mathbf{F}=\mathrm {Tr}(\mathbf{C}_\mu )/N\, \mathbf{I}\) for general probability distributions.

Several other approaches can be identified in literature. On the one hand, different targets can be considered, see e.g. [11, 18, 34, 39]. Schäfer and Strimmer analyzed among others diagonal matrices with equal and unequal variance or special correlation models [34]. Fisher and Sun also allowed for several targets [11] assuming a multivariate normal distribution. Toulumis relaxed the normality assumption, considered several targets, and provided a new non-parametric family of shrinkage estimators [39]. Other authors introduced different estimators, see e.g. [7] or [6]). Recently, Ledoit and Wolf extended their work to include non-linear shrinkage estimators [21, 22].

A problem arises concerning the complexity of the approaches. The associated optimization problem has to be solved which may be a task of its own especially in the case of non-linear estimates. Since the estimation has to be performed in every generation of the ES, only computationally simple approaches can be taken into account. Therefore, the paper focuses on linear shrinkage with shrinkage targets and intensities taken from [11, 19, 20, 39].

Furthermore, transferring shrinkage estimators to ESs needs to consider the situation in which the estimation occurs since it differs from the assumptions in statistical literature:

  • First, the covariance matrix \(\varSigma =\mathbf{C}^{(g)}\) that was used to create the offspring is known.

  • Second, the sample is based on truncation selection. Therefore, the variables cannot be assumed to be independent and identically distributed (iid). However, this is one of the main assumptions for deriving alternative estimators for the covariance. In the case of evolution strategies, the sample \(\mathbf{x}_{1:\lambda },\ldots ,\mathbf{x}_{\mu :\lambda }\) would only represent normally distributed random variables if there were no selection pressure.

In this context, it is interesting to note that in the discussion [12] with respect to the setting of the CMA-ES control parameters it is argued to choose the values so that the distribution of the random variables would remain the same if there were no selection effects. This paper uses a similar argument to justify the usage of the shrinkage intensities obtained for assuming iid random or even normally distributed random variables. Since we are aware of the fact that the situation may differ considerably from the prerequirements in the statistical literature, other settings are also taken into account.

Before continuing, a closer look at the covariance matrix update (4) may be interesting. Equation (4) of the ES algorithm represents a special case of shrinkage with the old covariance matrix as the target. The shrinkage intensity is determined with \(c_{\tau }=1+[N(N+1)]/[2\mu ]\), Eq. (5), as \(\rho =1-1/c_\tau \). If \(\mu \gg N\), \(\rho \approx 0\) holds and the sample covariance is only slightly corrected. Otherwise, if \(N\gg \mu \), the old covariance gains importance. To illustrate the main effects of (4), the eigenspace of \(\mathbf{C}^{(g)}\) is considered. Since the covariance matrix is a positive definite matrix, a spectral composition of \(\mathbf{C}^{(g)}\) with \(\mathbf{C}^{(g)}=\mathbf{M}^{\scriptscriptstyle \mathrm {T}}\varLambda \mathbf{M}\) can be conducted. The modal matrix \(\mathbf{M}=(\mathbf{v}_1,\ldots ,\mathbf{v}_N)\) contains the eigenvectors \(\mathbf{v}_1,\ldots ,\mathbf{v}_N\) of \(\mathbf{C}^{(g)}\), whereas \(\varLambda =\mathrm {diag}(\lambda _1,\ldots ,\lambda _N)\) represents a diagonal matrix with the corresponding eigenvalues \(\lambda _1,\ldots ,\lambda _N\). The representation \(\mathbf{C}_C^{(g+1)}\) of \(\mathbf{C}^{(g+1)}\) in the eigenspace of \(\mathbf{C}^{(g)}\) then reads

$$\begin{aligned} \mathbf{C}_C^{(g+1)}= & {} \rho \varLambda +(1-\rho )\mathbf{C}_\mu ^C \nonumber \\= & {} \mathrm {diag}(\mathbf{C}_\mu ^C)+\rho \Big (\varLambda -\mathrm {diag}(\mathbf{C}_\mu ^C)\Big )+(1-\rho ) \Big (\mathbf{L}_\mu ^C+\mathbf{U}_\mu ^C\Big ) \end{aligned}$$
(10)

with \(\mathbf{C}_\mu ^C=\mathbf{M}^{\scriptscriptstyle \mathrm {T}}\mathbf{C}_\mu \mathbf{M}\). The matrix \(\mathbf{L}^C_\mu \) denotes the matrix with the entries of \(\mathbf{C}_\mu ^C\) below the diagonal, whereas \(\mathbf{U}^C_\mu \) comprises the elements above. As Eq. (10) shows, the covariance matrix update (4) decreases the off-diagonal elements of the population covariance in the eigenspace. In the case of the diagonal entries, two cases may appear: if \(c^C_{\mu _{ii}}<\lambda _i\), the new entry is in the interval \([c^C_{\mu _{ii}},\lambda _i]\) and thus the estimate increases towards \(\lambda _i\), otherwise it is shrunk towards \(\lambda _i\). Thus, in the eigenspace of \(\mathbf{C}^{(g)}\), Eq. (10) behaves similarly to shrinkage with a diagonal matrix as target matrix. Therefore, in original space, it shrinks the eigenvalues of the population matrix towards those of the target. In contrast to shrinkage, the target matrix is the old covariance (which is not obtainable in the general case). If the sample were drawn from independent and identically distributed variables, the update would “correct” the distortion due to the small sample size with the previous and in that case also the true covariance. Considering that a shrinkage procedure is already present in the original CMSA-ES, the question naturally arises, whether the strategy may benefit from additional corrections of the sample covariance.

Applying shrinkage requires among others the choice of an appropriate target. Most approaches consider regular structures as e.g. the scaled unity matrix, diagonal matrices, or matrices with constant correlations. However, a shrinkage towards a regular structure w.r.t. the coordinate system of the original search space does not appear as a good choice concerning the optimization of arbitrary functions.

4 Evolution Strategies and Shrinkage Estimators

Since we cannot assume that the covariance matrix adaptation would profit from correcting the estimate towards regular structures in every application case, shrinkage in the original search space is not taken into account. Instead, appropriate space transformations are investigated. The resulting ES algorithms will follow the same general principle:

  1. 1.

    A suitable transformation of the coordinate system is conducted.

  2. 2.

    A shrinkage estimation is performed in the transformed space.

  3. 3.

    After re-transformation to the original space, the result is used for the covariance matrix update.

As illustrated in the previous section, the original CMSA-update (4) shrinks the sample covariance towards a diagonal matrix in the eigenspace of the previous covariance matrix \(\mathbf{C}^{(g)}\). Therefore, this eigenspace will also be taken into account for the current investigation.

Space transformations have been considered in ESs before. For example, Hansen argued in [14] that changing the coordinate system may improve the performance. For this reason, he introduced an adaptive encoding for the CMA-ES. It is based on a spectral decomposition of the covariance matrix. New search points are created in the eigenspace of the covariance matrix. Similar to [14], we assume that the ES may benefit from a change of the coordinate system. However, the covariance matrix adaptation and estimation which in [14] occur in the original space will be carried out in the transformed space.

Furthermore, other spaces may be also be beneficial. For example, [37] introduced an additional potential transformation. In [37], the authors were faced with the task to obtain a reliable covariance matrix. To this end, a sample covariance matrix \(\mathbf{S}_i\) was combined with a pooled variance matrix \(\mathbf{C}_p\) – similar to (4)

$$\begin{aligned} \mathbf{S}_{mix}(\xi )=\xi \mathbf{C}_p+(1-\xi )\mathbf{S}_i \end{aligned}$$
(11)

with the parameter \(\xi \) to be determined. To continue, the authors switched to the eigenspace of the non-weighted mixture matrix where they followed a maximal entropy approach to determine an improved estimate of the covariance matrix. Based on [14, 37], this paper considers the following choices for the transformation matrix which arise as combinations of the population covariance matrix \(\mathbf{C}_\mu \) and \(\mathbf{C}^{(g)}\)

$$\begin{aligned} \mathbf{S}_{mix}= & {} \mathbf{C}^{(g)}+\mathbf{C}_{\mu }, \end{aligned}$$
(12)
$$\begin{aligned} \mathbf{S}_{(g+1)}= & {} (1-c_\tau )\mathbf{C}^{(g)}+c_\tau \mathbf{C}_{\mu }, \end{aligned}$$
(13)
$$\begin{aligned} \mathbf{S}_{(g)}= & {} \mathbf{C}^{(g)}. \end{aligned}$$
(14)

The variants (12)–(14) are based on different assumptions: The first (12) follows [37]. The influence of the old covariance and the population covariance are balanced. Structural changes caused by \(\mathbf{C}_{\mu }\) will be dampened but will influence the result more strongly than in the case of (13) and (14). The second (14) uses the covariance mixture that appears in the original CMSA-ES. Depending on the size of \(c_\tau \), which is in turn a function of \(\mu \) and N, see (5), the influence of the population covariance matrix may be stronger or lesser. The third considers the eigenspace of the old covariance matrix and reduces therefore the influence of the new estimate. Equations (12)–(14) are used to change the coordinate system. The representations of the covariance matrices in the eigenspace are given as \(\mathbf{C}^S_\mu :=\mathbf{M}_S^\mathrm {T}{} \mathbf{C}_{\mu }{} \mathbf{M}_S\) and \(\mathbf{C}_S:=\mathbf{M}_S^\mathrm {T}{} \mathbf{C}^{(g)}{} \mathbf{M}_S\) with S standing for one of the variants (12)–(14). Assuming that for example (14) is used, the following steps are performed

  • spectral decomposition: \(\mathbf{M},\mathbf{D}\leftarrow \mathrm {spectral}(\mathbf{S}_{(g)})\),

  • determination of \(\mathbf{C}^S_\mu :=\mathbf{M}_S^\mathrm {T}{} \mathbf{C}_{\mu }{} \mathbf{M}_S\) and \(\mathbf{C}_S:=\mathbf{M}_S^\mathrm {T}{} \mathbf{C}^{(g)}{} \mathbf{M}_S\),

  • shrinkage estimation resulting in \(\hat{\mathbf{C}}_{\mathrm {shr}}\),

  • retransformation \(\mathbf{C}_\mu =\mathbf{M}^{\scriptscriptstyle \mathrm {T}}\hat{\mathbf{C}}_{\mathrm {shr}} \mathbf{M}\),

  • covariance adaptation

    $$\begin{aligned} \mathbf{C}^{(g+1)}=(1-\frac{1}{c_\tau })\mathbf{C}^{(g)}+ \frac{1}{c_\tau }{} \mathbf{C}_\mu . \end{aligned}$$

They substitute Line 13 of the CMSA-ES algorithm in Fig. 1. Once the space transformation has been conducted, different targets can be taken into account. The analysis considers the matrices

$$\begin{aligned} \mathbf{F}_{\mathrm {u}}= & {} v \mathbf{I}, \end{aligned}$$
(15)

with \(v=\mathrm {Tr}(\mathbf{C}^S_\mu )/N\) [20],

$$\begin{aligned} \mathbf{F}_{\mathrm {d}}= & {} \mathrm {diag}(\mathbf{C}^S_\mu ) \end{aligned}$$
(16)

the diagonal entries of \(\mathbf{C}^S_\mu \) [11, 39], the constant correlation model with matrix \(\mathbf{F}_{\mathrm {c}}\) the entries of which read

$$\begin{aligned} f_{ij}= & {} \left\{ \begin{array}{ll} s_{ii} &{} \text{ if } i=j\\ \bar{r}\sqrt{s_{ii}s_{jj}} &{} \text{ if } i\ne j \end{array}\right. \end{aligned}$$
(17)

and \(\bar{r}=2/((N-1)N)\sum _{i=1}^{N-1}\sum _{j=i+1}^N s_{ij}/\sqrt{s_{ii}s_{jj}}\) [19]. The shrinkage intensities are taken from the corresponding publications. For (15) the parameter is based on [20], for (16) it follows [11, 39] while it is taken from [19] in the case of (17).

Since the new ES uses an explicit shrinkage, the question arises whether the additional term consisting of the old covariance matrix in (4) remains necessary or whether the ES may operate solely with shrinkage. Preliminary investigations indicate that the latter strategies perform worse than the original CMSA-ES. Therefore, the current analysis only considers a combination of both. It should also be noted that similar to the original update, the previous covariance could be used to determine the target instead of the sample. Both cases will be investigated more closely in future research.

5 Comparing Shrinkage Approaches: An Experimental Analysis

Experiments were conducted to investigate the shrinkage estimators introduced. First, the question of finding a suitable transformation was addressed. To this end, a comparison of the effects of (12)–(14) was carried out for a combination of (16) and the shrinkage intensity taken from [39]. Preliminary experiments showed that (14) lead to good results. Therefore, the remaining discussion in this paper is restricted to ESs using a transformation with the old covariance matrix. However, the increased variability provided by (12) and (13) should be considered together with (17) or (15) in further experiments.

The analysis considers ES-algorithms which apply shrinkage estimators as defined in (15) to (17). Aside from the CMSA-ES, we denote the strategies as follows

  1. 1.

    CI-ES: a CMSA-ES using (15) as shrinkage target,

  2. 2.

    CC-ES: the CMSA-ES with the constant correlation model (17),

  3. 3.

    FS-ES: the CMSA-ES which uses (16) and follows [11] to determine the shrinkage intensity,

  4. 4.

    Tou-ES: a CMSA-ES based on (16) which uses the shrinkage intensity of [39].

The approaches were coded in MATLAB. In the case of the CI-ES and the C-ES, we used the estimation source code provided by the authors on their webpageFootnote 1. The implementation of the Tou-ES follows closely the R packageFootnote 2. In the analysis presented here, we did not use the oracle shrinkage intensity for the FS-ES as in [26] but applied the optimal estimated value (11) in [11, p. 1913]. Therefore, the results may differ from [26]. This ES also operates with a maximal number of fitness evaluations of \(\mathrm {FE}_{\max }=3\times 10^{5}N\).

5.1 Experimental Set-Up

The parameters for the experiments read as follows. Each experiment uses 15 repeats. The initial population is drawn uniformly from \([-4,4]^N\), whereas the mutation strength is chosen from [0.25, 1]. The search space dimensions were set to 10 and 20. The maximal number of fitness evaluations is given by \(\mathrm {FE}_{\max }=2\times 10^{5}N\). All evolution strategies use \(\lambda =\lfloor \log (3N)+8\rfloor \) offspring and \(\mu =\lceil \lambda /4\rceil \) parents. A run terminates prematurely if the difference between the best value so far and the optimal fitness value \(|f_{\mathrm {best}}-f_{\mathrm {opt}}|\) is below a predefined precision set to \(10^{-8}\). Furthermore, we introduce a restart mechanism into the ESs so that the search is re-initialised when the search has stagnated for \(10+\lceil 30N/\lambda \rceil \) generations. Stagnation is determined by measuring the best function values in a generation. If the difference between minimal and maximal values of the sample lies below \(10^{-8}\) for the given time-interval, the ES does not make significant movements anymore and the search is started anew.

The experiments are conducted with the help of the black box optimization benchmarking (BBOB) software frameworkFootnote 3 and the test suite, see [13]. The framework allows the analysis of algorithms and provides means to generate tables and figures of the results together with the corresponding legends.

This paper considers 24 noise-less functions [9]. They consist of four subgroups: separable functions (function ids 1–5), functions with low/moderate conditioning (ids 6–9), functions with high conditioning (ids 10–14), and two groups of multimodal functions (ids 15–24).

The experiments use the expected running time (ERT) as performance measure. The ERT is defined as the expected value of the function evaluations (f-evaluations) the algorithm needs to reach the target value with the required precision for the first time, see [13]. In this paper, the estimate

$$\begin{aligned} \mathrm {ERT}=\frac{\#(FEs(f_{\mathrm {best}}\ge f_{\mathrm {target}}))}{\#succ} \end{aligned}$$
(18)

is used, that is, the fitness evaluations \(FEs(f_{\mathrm {best}}\ge f_{\mathrm {target}})\) of each run until the fitness of the best individual is smaller than the target value are summed up and divided by the number of successful runs.

Fig. 2.
figure 2

The development of the ratio of the largest to the smallest eigenvalue of the covariance on the sphere for the CMSA-ES, the CC-ES, and the FS-ES. Shown are the results from 15 runs per dimensionality.

Fig. 3.
figure 3

The development of the ratio of the largest to the smallest eigenvalue of the covariance on the discus. Shown are the results from 15 runs per dimensionality.

5.2 Results and Discussion

First of all, the behavior of the strategies for two exemplary functions, the sphere, \(f(\mathbf{x})=\Vert \mathbf{x}\Vert ^2\), and the discus, \(f(\mathbf{x})=10^6x_1^2+\sum _{i=2}^N x_i^2\) is investigated. The functions were selected since they represent very different optimization tasks. Figures 2 and 3 show the ratio of the largest to the smallest eigenvalue of the covariance matrix for the CMSA-ES and for two of the shrinkage approaches, the CC-ES and the FS-ES. An ES optimizing the sphere should keep the ratio close to one, whereas the ratio should increase for the discus. In the case of the sphere, the figures illustrate that the largest and smallest eigenvalue develop differently and diverge for \(N=10\). Shrinkage causes the problem to be less pronounced.

In the case of the discus, different eigenvalues are expected. All strategies achieve this, the ESs with shrinkage operators show again a lower rate of increase. This may be a hint that the adaptation process of the covariance matrix may be decelerated by the additional shrinkage. Figure 3 shows an even slower increase of the ratio for the FS-ES than for the CC-ES. Whether this lowers the performance is investigated in the second series of experiments. Figure 4 shows the empirical cumulative distribution functions for \(N=10\) and \(N=20\). Whether introducing shrinkage terms improves the performance of the ES depends on the function class. In the case of ill-conditioned functions, shrinkage targets consisting of the diagonal elements of the transformed sample covariance may offer benefits.

Fig. 4.
figure 4

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in \(10^{[-8..2]}\) for 10-D and 20-D.

The results of the experiments can be examined more closely with the help of Table 1 (\(N=10\)) and Table 2 (\(N=20\)). They provide the estimate of the expected running time (ERT) for several precision targets ranging from \(10^1\) to \(10^{-7}\). Also shown is the number of successful runs.

Several functions represent challenges for the ESs considered. These comprise the Rastrigin functions (id 3, id 4, id 15, and id 24) of the test suite, the step ellipsoidal function with a condition number of 100 (id 7), and a multi-modal function with a weak global structure based on Schwefel (id 20). In these cases, all strategies are unable to progress further than the first intermediate precision of \(10^1\). Additionally, for the multi-modal functions, 19 (composite Griewank-Rosenbrock function) and 23 (Katsuura), no strategy could reach \(10^{-1}\). The functions in question are therefore removed from the tables.

To analyze the remaining functions, the four groups of the test suite are taken into account. The first class comprises the separable functions with id 1 to id 5. The three remaining functions, the sphere (f1), the separable ellipsoidal function (f2), and the linear slope (f5) differ in the degree of difficulty for the strategies. All strategies do not show any problems on the sphere or on the slope. Here, several shrinkage variants perform similarly to or sometimes may even surpass the original version. However, the differences are not statistically significant.

The more difficult ellipsoidal function, however, cannot be solved by the CC-ES, the CI-ES, and for \(N=20\) by the Tou-ES. Interestingly, the FS-ES achieves successful runs for all search space dimensionalities. Its performance is significantly worse than the original CMSA-ES, however. Two effects may play a role. The ellipsoidal function, defined by \(f(\mathbf{x})=\sum _{i=1}^N 10^{6(i-1)/(N-1)} x_i^2\), is not solved well by ESs with covariance matrices treating all directions with the same weight. The matrix used in the transformation may not be sufficient to provide the variability required when combined with restrictive structures. The target matrices supplied by the CI-ES may be too regular for the ES to be able to adapt with the necessary velocity. Why the CC-ES also exhibits problems, although it represents a more general model, merits further investigations. The failure of the Tou-ES to achieve the final precision target may be due to the shrinkage intensity since this is the only point where it differs from the FS-ES. For this reason, Sect. 6.2 provides a first analysis of the impact of this parameter on the performance of the ES. In future research, the interaction with the parameter \(c_\tau \) will be investigated more closely. Since this parameter approaches infinity for the typical \(\mu /N\) ratios and increasing dimensionalities, the influence of the sample covariance lessens. Regularizing the covariance matrix may therefore be more important for smaller to medium search space dimensionalities. Concerning the question whether shrinkage improves the performance, no clear answer can be provided for the group of separable functions since the ellipsoidal function apparently requires a faster adaptation than the current versions supply.

The second group of functions consists of the attractive sector function (id 6), the step ellipsoidal function (id 7, results not shown), the original Rosenbrock function (id 8), and a rotated Rosenbrock function (id 9). These functions have low to moderate conditioning. The sector function is difficult to solve for all strategies. For \(N=20\), successful runs were recorded for the CMSA-ES and the CI-ES but only for a few cases, i.e., two or three. Therefore, the question arises whether initialization effects may have played a role in other words whether the ES in these cases started in the vicinity of the optimal point. For \(N=10\) and \(N=20\), the FS-ES reaches the final target precision of \(10^{-8}\) in 15 of 15 runs, a result not mirrored by the other strategies. Concerning the Rosenbrock functions (f8 and f9), the CMSA-ES and the FS-ES perform best with the CMSA-ES being superior.

Functions with high conditioning constitute the next group. For the ellipsoidal function (f10), the discus (f11), the bent cigar (f12), the sharp ridge (f13), and the different powers function (f14), the CMSA-ES appears as the best performing strategy, followed by the FS-ES and the Tou-ES. More experiments are required since the bent cigar apparently represents a stronger challenge for the FS-ES for \(N=10\) than for \(N=20\).

In the case of the multi-modal functions (ids 15–24), all ESs encounter problems. Only for the two Gallagher’s problems (id 21 and id 22) successes are recorded. Here, the CMSA-ES and the versions that use the diagonal elements of the sample covariance as shrinkage target show the best results. The FS-ES appears to be a good choice for \(N=10\) for function 21 whereas the CMSA-ES and the Tou-ES require fewer function evaluations for 22. For \(N=20\), only a few runs of all ESs reach the final target precision. Therefore, a comparison for the higher-dimensional search spaces is difficult and is not carried out in this paper.

Table 1. Expected running time (ERT in number of function evaluations) divided by the respective best ERT measured during BBOB- 2009 in dimension 10. The ERT and in braces, as dispersion measure, the half difference between 90 and 10%-tile of bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the first row. The different target \(\varDelta {f}\)-values are shown in the top row. #succ is the number of trials that reached the (final) target \(f_{\text {opt}}\) + 10\(^-8\). The median number of conducted function evaluations is additionally given in italics, if the target in the last column was never reached. Entries, succeeded by a star, are statistically significantly better (according to the rank-sum test) when compared to all other algorithms of the table, with p = 0:05 or p = 10\(^-k\) when the number k following the star is larger than 1, with Bonferroni correction by the number of instances.
Table 2. Expected running time (ERT in number of function evaluations) divided by the respective best ERT measured during BBOB- 2009 in dimension 20. The ERT and in braces, as dispersion measure, the half difference between 90 and 10%-tile of bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the first row. The different target \(\varDelta {f}\)-values are shown in the top row. #succ is the number of trials that reached the (final) target \(f_{\text {opt}}\) + 10\(^{-8}\). The median number of conducted function evaluations is additionally given in italics, if the target in the last column was never reached. Entries, succeeded by a star, are statistically significantly better (according to the rank-sum test) when compared to all other algorithms of the table, with p = 0:05 or p = 10\(^{-k}\) when the number k following the star is larger than 1, with Bonferroni correction by the number of instances.

6 Further Experimental Analyses

The experiments revealed that at least in the eigenspace of the previous covariance matrix and for the intensities considered, the ES does not benefit at all from shrinkage towards more complex structures than offered by a diagonal matrix. This is surprising, since the model with constant correlations would allow more degrees of freedom for the evolutionary process. Therefore, further investigations – concerning especially the question of the choice of the shrinkage intensity – will be carried out. In addition, ESs with the scaled unity matrix as a target do not perform as well as strategies with different diagonal elements. This can possibly be traced back to the fact that this target is the most restrictive and impairs the general adaptability of the ES.

Therefore, the further discussion in this paper is limited to the shrinkage estimators that use the diagonal entries of the sample covariance as correction term. The two variants taken into account, the Fisher-Sun (FS-ES) estimator and the Toulumis technique (Tou-ES) operate with the same target, but the performance of the associated evolution strategies differs. Since the only difference between these two estimators lies in the shrinkage intensity, this section carries out an investigation of the dependency of the performance of the ES on the choice of the factor.

So far, the experiments were restricted to the noise-free case assuming the possibility of exact function evaluations. The effects of shrinkage estimators concerning noisy optimization remain to be taken into account. Since this represents an important research and application area of evolution strategies, exemplary experiments are carried out.

6.1 A Brief Investigation Concerning Noisy Optimization

This subsection addresses the question whether shrinkage estimators may be useful in the case of noisy optimization. Noise or more generally uncertainty is a common and important problem in practical optimization. Following [4, 16], uncertainty comprises noise, robustness, dynamical changes, and function approximations. Many causes exist: Physical measurements may be necessary during the optimization which are usually imprecise to a certain degree. Situational changes may occur – for instance when trying to find the fastest route in a traffic network. For a closer explanation, let us reconsider the function \(f:\mathbb {R}^N\rightarrow \mathbb {R}\) that is to be optimized. Noise means that the function evaluations are not exact and that disturbances, i.e., measurement errors or similar, need to be taken into account. These can be modelled by a random variable \(\epsilon \). Instead of the exact f-value at \(\mathbf x\) only the noisy \(\tilde{f}(\mathbf{x})=g(f,\mathbf{x},{\epsilon })\) can be observed. Following the test suite [10], the multiplicative noise model

$$\begin{aligned} \tilde{f}(\mathbf{x})=f(\mathbf{x})\epsilon \end{aligned}$$
(19)

is considered in the paper. In the experiments conducted for the sphere and the Rosenbrock function, the noise term \(\epsilon \) is represented by a log-normally distributed random variable \(\exp (\beta \mathcal {N}(0,1))\) [10]. Two variants are taken into account, moderate noise with \(\beta =0.01\) (id 101 (sphere), id 104 (Rosenbrock)) and severe noise \(\beta =1\) (ids 107 (sphere) and 110).

For the comparison, the FS-ES is chosen as the representative for the ESs with additional shrinkage. Table 3 shows a promising finding: While both strategies behave comparable on the sphere if the noise does not have a strong effect, the FS-ES appears as superior if the fitness evaluations are severely disturbed. Here, the additional shrinkage may serve to stabilize the covariance matrix update and may thus improve the performance of evolution strategies in the case of noisy optimization. A similar behavior can be observed for the Rosenbrock function. Here, the FS-ES performs better even for low noise levels.

Table 3. ERT and half-interquantile range (90%–10%) divided by the best \(\mathrm {ERT}\) measured during BBOB 2009 for different \(\varDelta f\) values in 10-D.

6.2 The Role of the Shrinkage Intensity: A Closer Look at the Fisher-Sun Estimate

The preceeding sections showed that the shrinkage intensity may have an influence on the performance of the ES, considering that the results of the FS-ES and the Tou-ES differ. Therefore, this section takes a closer look at the Fisher-Sun estimator and analyzes different value settings. We start from the original equation in [11, p. 1913]

$$\begin{aligned} \hat{\lambda }=\frac{\hat{\beta }^2_D+\hat{\gamma }^2_D}{\hat{\delta }^2_D}. \end{aligned}$$
(20)

The parameters in (20) are obtained as follows: Let \(\hat{a}_1=\mathrm {Tr}(\mathbf{C}^S_\mu )/N\) and

$$\begin{aligned} \hat{a}_2=\frac{\mu ^2}{N(\mu -1)(\mu +2)}\left( \mathrm {Tr}({\mathbf{C}_\mu ^S}^2)-\frac{1}{\mu }(\mathrm {Tr}(\mathbf{C}_\mu ^S))^2\right) .\end{aligned}$$
(21)

Denote the matrix with the diagonal elements of \(\mathbf{C}_\mu ^S\) as \(\mathbf{D}_C\). Then with

$$\begin{aligned} \hat{a}^*_1&=\mathrm {Tr}(\mathbf{D}_C)/N \text{ and } \end{aligned}$$
(22)
$$\begin{aligned} \hat{a}^*_2&=\frac{\mu }{N(\mu +2)}\mathrm {Tr}(\mathbf{D}_C^2) \end{aligned}$$
(23)

the parameters read

$$\begin{aligned} \hat{\beta }_D^2&=\frac{1}{\mu }\left( \hat{a}_2+N\hat{a}_1^2\right) ,\end{aligned}$$
(24)
$$\begin{aligned} \hat{\gamma }_D^2&=-\frac{2}{\mu }\hat{a}^*_2,\end{aligned}$$
(25)
$$\begin{aligned} \hat{\delta }^2_D&=\frac{\mu +1}{\mu }\hat{a}_2+\frac{N}{\mu }\hat{a}^2_1-\frac{\mu +2}{\mu }\hat{a}^*_2. \end{aligned}$$
(26)

Starting from (20), the shrinkage intensity \(\rho =c\hat{\lambda }\) is varied with factors \(c=0.01\) (\(---\)), 0.1 (\(--\)), 0.5 (−), 1.5 (+), 2.0 (++), and \(c=10\) (+++). In the case that the resulting parameter exceeds one, it is reset to \(\rho =1\). The investigation considered two separable functions with ids 1 (sphere) and 2 (ellipsoidal), the two Rosenbrock functions (id 8, 9) as representatives for lowly and moderately conditioned functions, and aside from the sharp ridge (id 13) the set of functions with high conditioning (ids 10–14).

As it can be inferred from Table 4, the effects of the scaling factor \(\rho \) differ with the function that is to be optimized. For nearly all functions considered a decrease and therefore a smaller influence of the diagonal shrinkage target appears as beneficial. Comparing the results with those in Table 1 shows that the performance depends very strongly on the size of the factor and can be improved decisively. However, it appears difficult to obtain a general recommendation. In some cases, e.g. the discus, id 11, a slight decrease may be preferable whereas the ES in the case of the original Rosenbrock, id 8, apparently operates better with only a small correction by the target. For the set of functions considered, a decrease by half or the factor 0.1 appears to improve the results. However, further investigations – especially into potential adaptation mechanisms of the shrinkage intensity – appear necessary. This is underlined by the findings in the case of the ellipsoidal with id 2 which differ strongly from the rest of the functions. Here, an increase of the intensity improves the performance, albeit not significantly. This may be a hint that a strong influence of the target is preferable. As it can be seen, the “optimal” shrinkage intensity from literature may not represent the best choice for the purposes of optimization. This was to be expected since the situation differs from the iid or even normally distributed case considered in the papers.

Table 4. Expected running time (ERT in number of function evaluations) divided by the respective best ERT measured during BBOB- 2009 in dimension 10. The ERT and in braces, as dispersion measure, the half difference between 90 and 10%-tile of bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the first row. The different target \(\varDelta {f}\)-values are shown in the top row. #succ is the number of trials that reached the (final) target \(f_{\text {opt}}\) + 10\(^{-8}\)8. The median number of conducted function evaluations is additionally given in italics, if the target in the last column was never reached. Entries, succeeded by a star, are statistically significantly better (according to the rank-sum test) when compared to all other algorithms of the table, with p = 0:05 or p = 10\(^{-k}\) when the number k following the star is larger than 1, with Bonferroni correction by the number of instances.

7 Conclusions

Evolution strategies are population-based evolutionary algorithms for continuous optimization. They use a multivariate normal distribution to generate new search points. The parameters of this distribution must be adapted in order to achieve a well performing optimization method. For this reason, step-size and covariance matrix adaptation techniques are important topics in research on evolution strategies. This paper considered the covariance matrix adaptation. Current techniques use the sample covariance, an estimator, which may be of poor quality if the sample size is small. Typically, the estimate is corrected with the help of additional terms. The resulting effect is remarkably similar to shrinkage estimation, a method stemming from statistics. There, shrinkage operators have been introduced in order to improve the quality of the sample covariance by correcting the estimate with the help of a target matrix.

The realization that evolution strategies already perform an implicit shrinkage leads to the research question of the present paper: Would an additional inclusion of shrinkage operators improve the performance of evolution strategies? Applying explicit shrinkage in evolution strategies requires several new tasks to be solved: The choice of the target and the combination weight of target and sample covariance, the shrinkage intensity, are crucial. Since an evolution strategy is used to optimize arbitrary functions with various structures, the approach must remain sufficiently adaptable. To achieve this, we considered a transformation of the original search space to conduct the shrinkage.

The experimental analysis took several shrinkage targets into account using the intensity settings of the original publications. Pending further experiments that shall provide more information regarding the shrinkage intensity which may have interfered with the findings, shrinkage targets in the transformed space that use a diagonal matrix consisting of the different entries of the transformed sample covariance appear as the best choices. While introducing additional shrinkage does not always improve the performance in the noise-less case and at times even impairs it, it may offer a means to lessen the impact of noise.

As the experiments showed, the choice of the shrinkage intensity may have a strong influence on the performance. Since the original covariance matrix adaptation performs a further type of shrinkage which lessens the influence of the sample covariance when the search space dimensionality increases, future research will focus on the shrinkage intensity and its interaction with the covariance matrix adaptation.