Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Evolution strategies (ESs) are a variant of evolutionary algorithms often used for continuous black-box optimization. They differ from many other evolutionary algorithms in the role of mutation: While it is only a background operator in genetic algorithms, it represents the main search operator here. Evolution strategies operate with a multivariate normal distribution which is used to generate a population of new search points. Its parameters, the mean and the covariance matrix, must be updated during a run in order for the strategy to reach the vicinity of optimal points fast and reliably. The adaptation of the parameters takes the search history and the present population into account. Due to its importance, research focussed and focusses on the covariance matrix. The main techniques introduced are based on the sample covariance matrix [1]. The usage of this estimator may bear potential improvement points within itself: Evolution strategies typically operate with small population or sample sizes. The size of the population does not exceed the search space dimensionality. Estimating the \(N\times N\) dimensional covariance matrix with a sample size of \(\mu <N\) or \(\mu \approx N\) leads to unreliable estimates. All adaptation techniques introduced so far consider correction terms. However, the question remains whether an ES may benefit if techniques developed for and tailored to the task at hand were introduced.

Literature concerning attempts of combining evolutionary algorithms or related approaches with statistical estimation methods of high-dimensional covariance matrices is scarce. So far, we were only able to identify two approaches aside from our own research: In the first [6], the authors investigated estimation of distribution algorithms (EDAs) for continuous spaces. The EDA applied a Gaussian search distortion similar to evolution strategies. The estimation of the covariance matrix resulted however in matrices that were not positive definite. To circumvent the problem, a shrinkage procedure was introduced, see e.g. [13]. Very recently, a shrinkage estimator was integrated into an evolution strategy variant with a single search point [12].

The research presented here is part of an ongoing investigation into alternative estimation techniques for high-dimensional covariances [15, 16]. In [15, 16] Ledoit-Wolf shrinkage estimators were analyzed. While the results were promising, finding the appropriate shrinkage intensity represented a challenge. Therefore, in [14] another computational simple estimation method was introduced: thresholding. Here the work begun in [14] is continued by addressing two of open problems remaining: The first concerns the choice of the thresholding function, the latter the influence of an important parameter of the thresholding.

The paper is structured as follows. First, the evolution strategy variant considered in this paper is introduced. Afterwards, we argue why high-dimensional estimation techniques might improve the performance. The next section introduces the sparse covariance estimation evolution strategy developed. An experimental analysis of the approaches follows, before the paper closes with an outlook on potential future research.

2 Evolution Strategies

Evolution strategies (ESs) [18, 19] are used for continuous black-box optimization \(f:\mathbb {R}^N\rightarrow \mathbb {R}\). Several variants have been introduced (see e.g. [1, 3]). In many cases, a population of \(\mu \) parents is used to create a set of \(\lambda \) offspring, with \(\mu \le \lambda \). Like all evolutionary algorithms, evolution strategies operate in a sequence of generations. In each generation, the same cycle of processes is carried out. In general, these are parent selection, recombination, mutation, and survivor selection. In the following, the processes are described based on the ES variant considered. Here, all \(\mu \) parents contribute to create the offspring. First recombination is performed, that is, the centroid of the parents is computed [3]. All offspring are based on the same origin and differ only in their mutation vector, a normally distributed random variable with zero mean and covariance matrix \(\sigma ^2\mathbf{C}\) which is added to the mean. After the \(\lambda \) offspring \(\mathbf{y}_1,\ldots ,\mathbf{y}_\lambda \) have been created, the individuals are evaluated. In most cases, the function to be optimized is used directly. In that case, the function is also called fitness. Selection in evolution strategies takes often only the \(\lambda \) offspring into account of which the \(\mu \) best \(\mathbf{y}_{1:\lambda },\ldots ,\mathbf{y}_{\mu :\lambda }\) are chosen.

The most important factor concerning the mutation is the covariance matrix. It must be adapted during the run and fitted to the landscape. Otherwise, the performance may be low. Therefore, research on controlling the mutation has a long tradition in ESs. First approaches were already considered in [18]. The next section describes the variant considered in this paper.

2.1 Covariance Matrix Adaptation: The Population Covariance

To our knowledge, covariance matrix adaptation comprises two main classes: one applied in the covariance matrix adaptation evolution strategy (CMA-ES) [11] and an alternative used in the covariance matrix self-adaptation evolution strategy (CMSA-ES) [4]. Both are based on a variant of the sample covariance, correcting the estimate with information from the search history. The present paper focuses on the CMSA-ES leaving the CMA-ES for future research. One of the reasons is that the CMSA-ES does only include one additional correction term making it easier to assess the effects of the thresholding operator. The CMSA-ES considers the covariance matrix \((\sigma ^{(g)})^2\mathbf{C}^{(g)}\) with \(\sigma ^{(g)}\) denoting general scaling factor (or step-size or mutation strength) and with \(\mathbf{C}^{(g)}\) a rotation matrix. Following the usual practice in literature on evolution strategies the latter matrix \(\mathbf{C}^{(g)}\) is referred to as covariance matrix in the remainder of the paper. The CMSA uses covariance matrix adaptation for the matrix \(\mathbf{C}^{(g)}\) and self-adaptation for the mutation strength.

The covariance matrix update is based upon the common estimate of the covariance matrix using the newly created population. However, the sample consists of the selected parents and not of the complete set. Restricting the sample, shall induce a bias towards promising parts of the search space. Since the adaptation of the mutation strength happens separately, the sample is normalized with \(\mathbf{z}_{m:\lambda }^{(g+1)}:=\big (\mathbf{x}_{m:\lambda }^{(g+1)}-\mathbf{m}^{(g)}\big )/\sigma ^{(g)}\) before estimating the covariance, see also [11]. Since the centroid used for the mutation is known, the covariance matrix estimation does not need to re-estimate the mean. The rank-\(\mu \) update then obtains the covariance matrix as

$$\begin{aligned} \mathbf{C}_\mu ^{(g+1)}:= & {} \sum _{m=1}^\mu w_m\mathbf{z}_{m:\lambda }^{(g+1)}({\mathbf{z}_{m:\lambda }^{(g+1)}})^\mathrm {T} \end{aligned}$$
(1)

which is usually a positive semi-definite matrix since \(\mu \ll N\). The weights \(w_m\) should fulfill \(w_1\ge w_2\ge \ldots \ge w_m\) with \(\sum _{m=1}^\mu w_i=1\). While it is possible to consider unequal weights, the CMSA-ES usually operates with \(w_m=1/\mu \). To derive reliable estimates larger population sizes are required which would lower the algorithm’s speed. Therefore, past covariance matrices are taken into account via the convex combination of (1) with the sample covariance and the old covariance

$$\begin{aligned} \mathbf{C}^{(g+1)}:= & {} (1-\frac{1}{c_\tau })\mathbf{C}^{(g)}+ \frac{1}{c_\tau }\mathbf{C}^{(g+1)}_\mu \end{aligned}$$
(2)

with the weights usually set to \(w_m=1/\mu \) and following [4]

$$\begin{aligned} c_{\tau }=1+\frac{N(N+1)}{2\mu } .\end{aligned}$$
(3)

2.2 Step-Size Adaptation

The CMSA implements the step-size using self-adaptation first introduced in [18] and developed further in [19]. Here, evolution is used to tune the strategy parameters of the mutation process. In other words, these parameters undergo recombination, mutation, and indirect selection processes. The working principle is based on an indirect stochastic linkage between good individuals and appropriate parameters: Well-adapted parameters should result more often in better offspring than too large or too small values or misleading directions. Although self-adaptation has been developed to adapt the whole covariance matrix, it is applied today mainly to adapt the step-size or a diagonal covariance matrix. In the case of the mutation strength, usually a log-normal distribution \( \sigma _l^{(g)}=\sigma ^{(g)}\mathrm {exp}(\tau \mathscr {N}(0,1))\) is used for the mutation of the mutation strength. The parameter \(\tau \), the learning rate, should scale with \(1/\sqrt{2N}\). The CMSA-ES often uses recombination. Among others, self-adaptation with recombination improves the performance in the presence of noise [2]. While the recombination of the mutation strength could be realized in several ways, it normally follows the recombination of the objective values in computing the mean of the mutation strengths of the parents. The newly created mutation strength \(\sigma _l^{(g)}\) is then used for mutating the objective values of the offspring. If the resulting offspring is sufficiently good, the scale factor is passed to the next generation.

Fig. 1
figure 1

The development of the ratio of the largest to the smallest eigenvalue of the covariance for the CMSA-ES on the sphere and the discus. Shown are the results from 15 runs for each dimensionality. a Sphere, N = 10. b Sphere, N = 40. c Discus, N = 10. d Discus, N = 40

3 A Sparse Covariance Matrix Adaptation

This section introduces the new covariance adaptation technique which uses thresholding to transform the population covariance matrix. The decision for thresholding is based upon the comparatively computational efficiency of the approach.

The sample covariance (1) has a strong influence on the adaptation. However, the good properties of the maximum likelihood estimator hold for the case \(\mu \gg N\) and \(\mu \gg 1\). In evolution strategies, the sample size seldom exceeds the search space dimension with \(\mu < N\). For example, [9] recommends to use \(\lambda =\lfloor \log (3N)\rfloor +4\) offspring and to set the size of the parent population to \(\mu =\lfloor \lambda /2\rfloor \). Thus, a potential problem arises in high-dimensional settings. For \(N\rightarrow \infty \), we have \(\mu /N\rightarrow 0\) contradicting the assumptions on which the estimator was based.

In order to assess the problem in evolution strategies, we take a closer look at the eigenvalues of the covariance matrix for some selected functions. Figure 1 shows the development of the ratio of the largest to the smallest eigenvalue of the covariance matrix on the sphere \(f(\mathbf{x})=\Vert \mathbf{x}\Vert ^2\) and on the discus \(f(\mathbf{x})=10^6x_1^2+\sum _{i=2}^N x_i^2\). In the latter case it can be argued that the behavior observed is beneficial. For the sphere, the figures hint at a potential problem: The gap between largest and smallest eigenvalue widens for all runs with the problem being more pronounced for the smaller search space dimensionalities. Furthermore, the extremely small sample size for \(N=10\) causes a large variation between the runs. It is interestingly less distinct in the case of the higher dimensional search spaces. This is probably an effect of the parameter \(c_\tau \) which follows \(\lim _{N\rightarrow \infty } c_\tau (N)=\infty \) as long as \(\mu \propto \log (N)\) or \(\mu \propto N\). Thus, the influence of the population covariance lessens. In statistics the problem is well-known [20, 21] with a long research tradition concerning approaches to counteract the problematic properties, see e.g. [17] for an overview. Among others, it has been shown that the eigenstructures of the estimate and the covariance do not agree well.

3.1 Space Transformation

Several types of estimators assume a sparse structure of the covariance matrix. Shortly stated, these approaches work well if many entries are small or even zero. Then, computationally simple estimation techniques can be applied. In the case of evolution strategies, a sparseness assumption may not hold in every situation. The form of the covariance matrix depends strongly on the function landscape and may vary widely in practice. Furthermore, there may not be any information available concerning the type of the function itself. Therefore, the covariance matrix is not considered in the original space but in the eigenspace of the previous covariance matrix \(\mathbf{C}^{(g)}\).

Let the covariance matrix \(\mathbf{C}^{(g)}\) be a symmetric, positive definite \(N\times N\) matrix. The condition holds for the original adaptation since (2) combines a positive definite with a positive semi-definite matrix. As we will see below, in the case of thresholding the assumption may not always be fulfilled. Let \(\mathbf{v}_1,\ldots ,\mathbf{v}_N\) denote the N eigenvectors with the eigenvalues \(\lambda _1,\ldots ,\lambda _N\), \(\lambda _j>0\). The definiteness of \(\mathbf{C}^{(g)}\) guarantees their existence. The eigenvectors form a orthonormal basis of \(\mathbb {R}^N\), i.e., \(\mathbf{v}^\mathrm {T}_i\mathbf{v}_i=1\) and \(\mathbf{v}_i^\mathrm {T}\mathbf{v}_j=0\), if \(i\ne j\). Define \(\mathbf{V}:=(\mathbf{v}_1,\ldots ,\mathbf{v}_N)\). It then holds that \(\mathbf{V}^{-1}=\mathbf{V}^\mathrm {T}\). Switching to the eigenspace of \(\mathbf{C}^{(g)}\) results in the representation of the covariance matrix \(\varLambda ^{(g)}=\mathbf{V}\mathbf{C}^{(g)}\mathbf{V}^\mathrm {T}\) with \(\varLambda ^{(g)}\) a diagonal matrix containing the eigenvalues. Diagonal matrices are sparse, thus for the estimation of the covariance matrix the more efficient procedures for sparse structures could be used. However, it is not the goal to re-estimate \(\mathbf{C}^{(g)}\) but to estimate the true covariance matrix of the distribution indicated by the sample \(\mathbf{z}_{1;\lambda },\ldots ,\mathbf{z}_{\mu ;\lambda }\).

Before continuing, it should be noted that several definitions of sparseness have been introduced. For instance, the number of non-zero elements in a row may not exceed a predefined limit \(s_0(N)>0\), i.e., \( \max _i\sum _{j=1}^N \delta (|a_{ij}|>0)\le s_0(N)\), which should grow only slowly with N. This definition can, however, be relaxed to a more general definition of sparseness, also referred to as approximate sparseness [5] on which the adaptive thresholding considered is based. Applying thresholding in our case requires that the true covariance matrix of the selected set has an approximately sparse structure in the eigenspace of \(\mathbf{C}^{(g)}\). Assuming the validity of the assumption, we change the coordinate system in order to perform the covariance matrix estimation. Reconsider the normalized (aside from the covariance matrix) mutation vectors \(\mathbf{z}_{1;\lambda },\ldots ,\mathbf{z}_{\mu ;\lambda }\) that were associated with the \(\mu \) best offspring. Denoting their representation in the eigenspace as \(\hat{\mathbf{z}}_{m;\lambda }=\mathbf{V}^\mathrm {T}\mathbf{z}_{m;\lambda }\;\mathrm {for}\;m=1,\ldots ,\mu \) leads to the new population covariance

$$\begin{aligned} \hat{\mathbf{C}}_\mu =\sum _{i=1}^\mu w_i\hat{\mathbf{z}}_{m;\lambda }\hat{\mathbf{z}}_{m;\lambda }^\mathrm {T} \end{aligned}$$
(4)

which is used to derive the final estimate. In the next section, potential techniques for sparse covariance matrices are discussed.

3.2 Sparse Covariance Matrix Estimation

Several methods have been developed for estimating sparse covariance matrices: Among others banding, tapering, and thresholding can be applied, see e.g. [17]. While all three are based on the assumption that many entries of the true covariance are zero, banding and tapering assume an ordering of the variables which is not present in the case of evolution strategies.

Therefore, only thresholding remains. Thresholding discards entries which are smaller than a given threshold \(\varepsilon >0\). For a matrix \(\mathbf A\) \(\in \mathbb {R}^{N\times N}\), the thresholding operator \(T_\varepsilon (\mathbf{A})\) is defined as

$$\begin{aligned} T_\varepsilon (\mathbf{A}):=(a_{ij}\delta (|a_{ij}|\ge \varepsilon ))_{N\times N} \end{aligned}$$
(5)

with \(\delta (\cdot )=1\) if the condition is fulfilled and zero otherwise. The choice of the threshold is critical for the quality of the resulting estimate. Equation (5) represents an example of universal thresholding with a hard thresholding function. Soft thresholding is also common, examples of this function class comprise e.g.

$$\begin{aligned} s_\lambda (x)= & {} \mathrm {sign}(x)(|x|-\lambda )_+ \;\;\text{(soft-thresholding) } \end{aligned}$$
(6)
$$\begin{aligned} s_\lambda (x)= & {} |x|(1-|\frac{\lambda }{x}|^\eta )_+\;\;\;\;\,\,\,\,\text{(Lasso) } \end{aligned}$$
(7)

with \((x)_+:=\max (0,x)\). Adaptive thresholding which considers the current data for determining the threshold \(\lambda _{ij}\) appears as more appropriate for evolution strategies than using constant thresholds. Following [5], we use

$$\begin{aligned} \lambda _{ij}:=\lambda _{ij}(\delta )=\delta \sqrt{\frac{\hat{\theta }_{ij}\log N}{\mu }} \end{aligned}$$
(8)

where \(\delta >0\) can be either chosen as a constant or be obtained using cross-validation. The variable \(\hat{\theta }_{ij}\) in (8) is determined as \(\hat{\theta }_{ij}=\frac{1}{\mu }\sum _{m=1}^\mu [(\hat{z}_{mi}-\overline{Z^i})(\hat{z}_{mj}-\overline{Z^j})-\hat{c}_{ij}^\mu ]^2\) with \(\hat{c}_{ij}^\mu \) denoting the (i, j)-entry of \({\hat{\mathbf{C}}}_\mu ^{(g+1)}\), \(\hat{z}_{mi}\) the ith component of \(\hat{\mathbf{z}}_{m:\lambda }\), and \(\overline{Z^i}:=(1/\mu )\sum _{m=1}^\mu \hat{z}_{mi}\).

While thresholding respects symmetry and non-negativeness properties, it results only in asymptotically positive definite matrices. Thus, for finite sample sizes, it does neither preserve nor induce positive definiteness in general. Due to this potential problem, future research will investigate repair mechanisms as well as alternative thresholding functions, see e.g. [7]. Here, the soft-thresholding (6) and the Lasso thresholding function (7) are considered. While it is common to exclude the diagonal entries of the covariance from thresholding, this may not be always appropriate for optimization since the nature of the functions may vary widely. Our previous experiments did not show a clear advantage for either method. Therefore, both versions are taken into account. In combination with the thresholding function, the following four ES types are investigated: (1) CMSA-Thres-ES (abbreviated to Thres): an evolution strategy with CMSA which applies thresholding in the eigenspace of the covariance with soft-thresholding, (2) CMSA-ThresL-ES (abbreviated to ThresL): the same as above but using the Lasso thresholding, (3) CMSA-Diag-ES (abbreviated to Diag): an ES with covariance matrix adaptation with thresholding in the eigenspace of the covariance, preserving the diagonal elements, and using soft-thresholding, (4) CMSA-DiagL-ES (abbreviated to DiagL): the variation with the Lasso function.

4 Experiments

Two series of experiments were conducted: The first with the aim to gain more insight regarding the choice of the parameter \(\delta \). Our first approach was to make this parameter data dependent by setting it to \(\delta =2\max (\hat{\mathbf{C}}_\mu )\). Since [5] recommends to use either \(\delta =2\) or to conduct cross-validation, we performed a short experimental analysis and took a closer look at the development of the eigenvalues on the sphere and on the discus. We considered the \(\delta =2, 3,\) and 4 for the CMSA-ThresL-ES with the search space dimensionalities set to \(N=10, 20, 40\), and 100.

The second series of experiments compares the different shrinkage variants with the original CMSA-ES. Two thresholding operators, soft thresholding and Lasso thresholding (with \(\eta =4\)), are taken into account. The comparison is based on the search space dimensions \(N=10\) and 20. The second series of experiments uses a maximal number of fitness evaluations of \(\mathrm {FE}_{\max }= 2\times 10^{5}\,N\). While the experiments revealed that longer experiments are necessary in order to derive meaningful findings for the difficult multimodal functions, the task was delegated to future research because of the computing time required.

All strategies start from randomly chosen positions, sampled uniformly from the interval \([-4,4]^N\). The ESs used \(\lambda =\lfloor \log (3N)+8\rfloor \) offspring and \(\mu =\lceil \lambda /4\rceil \) parents. An equal setting of weights \(w_m\) was used with \(w_m=1/\mu \). A run terminates before reaching the maximal number of evaluations, if the difference between the best value obtained so far and the optimal fitness value \(|f_{\mathrm {best}}-f_{\mathrm {opt}}|\) is below a predefined target precision set to \(10^{-8}\). For each fitness function and dimension, 15 runs are conducted. In order not to waste resources, a run is restarted when a search stagnation is observed. The latter is characterized by observing changes of the best values below \(10^{-8}\) over the last \(10+\lceil 30 N/\lambda \rceil \) generations.

4.1 Test Suite Und Performance Measure

The algorithms were implemented in MATLAB. The paper uses the black box optimization benchmarking (BBOB) software framework and test suite, see [10]. The frameworkFootnote 1 can be used to benchmark and compare continuous black-box optimizers and provides easy means to generate tables and figures. This paper considers the 24 noiseless functions of the test suite [8]. In order to lower the possibility that an algorithm benefits from initialisation effects, the position of the optimum is changed from run to run. The test suite comprises four function classes which differ in the degree of difficulty they pose for the optimization: separable functions (function ids 1–5), functions with low/moderate conditioning (ids 6–9), functions with high conditioning (ids 10–14), and two groups of multimodal functions (ids 15–24), with the last comprising functions with a weak global structure.

Following [10], the expected running time (ERT) is used as the performance measure. It is defined as the expected value of the function evaluations (f-evaluations) required to reach the target value with the required precision for the first time, see [10]. In this paper, \( \mathrm {ERT}=\frac{\#(FEs(f_{\mathrm {best}}\ge f_{\mathrm {target}}))}{\#succ}\) is used as an estimate. It is obtained by summing up the evaluations \(FEs(f_{\mathrm {best}}\ge f_{\mathrm {target}})\) in each run until the fitness of the best individual is smaller than the target value, divided by the number of all successful runs.

4.2 Results and Discussion

First, we describe the results from the parameter dependency experiments. The thresholding should on the one hand “stabilize” the covariance matrix in the sense that the eigenvalues do not diverge unless of course it is required to optimize the function. On the other hand, it should not delay or prohibit the adaptation of the covariance matrix to the function space. Summarizing the effects of operating with a data independent \(\delta \) from \(\{2,3,4\}\), this detrimental behavior can be observed. Thus, Fig. 2 only shows the development of the ratio of the largest and the smallest eigenvalue for the sphere and the discus for two exemplary search space dimensions using the data dependent \(\delta \). Comparing Figs. 2 to 1 reveals that for the sphere the variation between the runs is reduced even for the smaller search space. In the case of the discus, the increase of the ratio is slower, which could result in slower convergence.

Fig. 2
figure 2

Development of the ratio of the largest and the smallest eigenvalue of the covariance matrix of the CMSA-ThresL-ES on the sphere. Shown are the results from 15 repeats for each dimensionality. a Sphere, N = 10. b Sphere, N = 40. c Discus, N = 10. d Discus, N = 40

The findings for the BBOB test suite indicate advantages for thresholding in many cases. The outcome of the comparison depends on the function class. In the case of the separable functions with ids 1–5, the strategies behave on the whole very similar for 10D and 20D. Concerning the particular functions, differences are revealed as Tables 1 and 2 show for the expected running time (ERT) provided for several precision targets. In the case of the sphere (function with id 1) and the separable ellipsoid (id 2), all strategies reach the final precision goal in all runs. For both functions, ESs with thresholding are the fastest. In the case of the sphere, preserving the diagonal elements appears slightly advantageous, however, all variants are close together. For the ellipsoid, the gap widens. Interestingly, two variants remain close together: the CMSA-Thres-ES and the CMSA-DiagL-ES which differ in the thresholding function as well as in the decision whether to subject the diagonal entries to thresholding or not. No strategy reaches the required target precision in the case of the separable Rastrigin (id 3) and the separable Rastrigin-Bueche (id 4). Since all strategies only achieve the lowest target precision of \(10^1\), a comparison is not performed and due to page restrictions the data are removed from the tables. In the case of the linear slope (id 5) all strategies are successful. While the thresholding variants perform better for the smaller search space probably due to the more stable behavior of the covariance adaptation, the advantage is lost for \(N=20\) as Table 2 shows.

Table 1 Expected running time (ERT in number of function evaluations) divided by the respective best ERT measured during BBOB-2009 in dimension 10

 

Table 2 Expected running time (ERT in number of function evaluations) divided by the respective best ERT measured during BBOB-2009 in dimension 20

In the case of the functions with low to moderate conditioning (id 6–9), the step ellipsoid with id 7 is the most difficult function to optimize for the ESs. Experiments with a larger number of maximal function evaluations will be performed in future research. In the case of the remaining functions, we see a separation between the attractive sector (id 6) and the Rosenbrock variants (ids 8 and 9). In the case of the attractive sector and \(N=10\), see Table 1, the original CMSA-ES could only reach the required target precision in eight of the 15 runs, whereas the thresholding variants resulted only in one or two unsuccessful runs. Increasing the search space dimensionality, causes all runs of the CMSA-ES to be unsuccessful while the thresholding variants still achieve two or three successful runs. On the original Rosenbrock function (id 8), the CMSA-Thres-ES with the soft-thresholding function is the worst performing strategy. For \(N=20\), Table 2, the CMSA-DiagL-ES which uses the Lasso also exhibits slower convergence. Here, the CMSA-ES is marked as the best strategy for many intermediate precision targets. However, the CMSA-Diag-ES and the CMSA-ThresL-ES achieve very similar results. Interesting is the mixture of thresholding target and thresholding function. The interactions will be investigated more closely in future work. In the case of the rotated Rosenbrock (id 9), the CMSA-ThresL-ES shows the best results.

For the ill-conditioned functions (id 10–14), the findings are mixed. On some functions, especially on the ellipsoid (id 10) and the bent cigar (id 12), the original CMSA-ES has the lowest ERT values for the precision targets. For \(N=10\), Table 1, all strategies are successful for the ellipsoid (id 10), the discus (id 11), the bent cigar (id 12), and the sum of different powers (id 14). For the higher-dimensional search space, the bent cigar leads to problems for the CMSA-Thres-ES and the CMSA-DiagL-ES. Again, there appears to be an interaction between thresholding target and function. Only the CMSA-ES and the CMSA-Diag-ES are able to reach the final precision target on the sharp ridge (id 13) for \(N=10\). Since this occurs just once in both cases, more experiments are clearly necessary. Interestingly, differences between the group consisting of the CMSA-ES, the CMSA-ThresL-ES, and the CMSA-Diag-ES and the remaining strategies can be observed. The latter group is unable to achieve comparable performance on f11, f12, and f13 with more unsuccessful runs and larger expected numbers of function evaluations especially for the lower targets.

The group of multi-modal functions represents challenges for all ESs under consideration: The functions Rastrigin (id 15), Weierstrass (id 16), Schaffer F7 with condition number 10 (id 17), Schaffer F7 with condition 1000 (id 18), and Griewank-Rosenbrock F8F2 (id 19) cannot be solved with the final target precision required. Partly, this may be due to the maximal number of fitness evaluations. Even the best performing methods of the 2009 BBOB workshop required more evaluations than we allowed in total. Thus, longer experiments should be conducted in future research. Concerning the preliminary targets with lower precision, thresholding variants often achieve the best results. However, more experiments are required. In the case of \(N=20\), the number of function evaluations necessary for the best algorithms of 2009 to reach even the lower precision target of \(10^{-1}\) exceeds our total budget. Therefore, no analysis is attempted and the results are not shown in Table 2.

The last group, the multi-modal functions with weak global structures, are also difficult to solve and struck from Table 2. Only for function 21, Gallagher 101 peaks, and function 22, Gallagher 21 peaks, successful runs are observed for \(N=10\), see Table 1. In the case of the first, the CMSA-Thres-ES achieves the best results, whereas the original CMSA-ES is the best strategy to tackle function 22.

To summarize the findings, thresholding appears as a means to improve the performance. However, we observe an interaction between thresholding function and threshold target that should be analyzed further.

5 Conclusions and Outlook

The focus of the paper lay on the covariance matrix adaptation in evolution strategies. In many cases, the sample covariance is used which gives cause for concern regarding that its quality may be poor in situations where the estimation is only based on a small sample. Alternative approaches have been developed in the field of statistics. Evolution strategies require, however, methods that do not increase the computational effort considerably. Therefore, the paper investigated and compared several thresholding techniques which originate from estimation theory for high-dimensional spaces. The performance of the resulting new evolution strategies were compared to the original variant on the black-box optimization benchmarking test suite. The results were promising with the new variants performing better for several function classes. Concerning the variants of thresholding, more experiments and analyses are required in order to identify the best solution and to shed more light on the interaction between thresholding function and thresholding target.