Keywords

1 Introduction

In many real-world optimization problems, the evaluation of candidate solutions is affected by noise. Possible sources of noise include physical measurement limitations, or the stochastic component employed in simulations. Similarly, in machine learning, the diversity of data used to train and test models adds a layer of uncertainty to the problem. Different models are usually compared using cross-validation approaches, but comparisons are not guaranteed to be correct. In noisy scenarios, since the true value of the objective function is distorted, making correct comparisons between candidate solutions is not straightforward. If the noise is too high with respect to the difference between the true values of two candidates (signal), and so the signal-to-noise ratio is too low, comparisons done using a single evaluation per solution might be wrong.

In order to deal with noise, various heuristics have been proposed and studied [2, 4, 5, 17, 27]. In particular, many studies employ variants of evolutionary algorithms, which adopt a set of candidate solutions (population) subject to local perturbative search and stronger diversification means, often described with terms derived from genetics. Since these algorithms iteratively employ a population to explore the search space and propose new solutions, they are considered to be robust to the presence of noise [2, 21]. Multiple works compared various heuristic algorithms with evolutionary strategies, and they have shown that population-based approaches are a good choice to optimize noisy functions [2, 4, 27]. However, these studies mostly compare evolutionary strategies with algorithms which propose candidate solutions deterministically. But, according to the results of [4], when information about gradients is not available and the objective function is noisy, randomized algorithms might be an effective choice.

Apart from the search policy, which defines how the search space is explored and new solutions are proposed, also other building blocks are necessary to build effective noisy optimization strategies. In fact, to be efficient in this context, algorithms must deal with the presence of noise. In general, this can be achieved in two ways: by increasing the strength of the signal, or by reducing the effect of noise. In randomized algorithms, the signal can be improved by adapting the search region according to the signal-to-noise ratio. Multiple variants of this strategy have been studied in the field of evolutionary algorithms [2, 5]. It has been shown that the way in which the search region is adapted during the optimization has a relevant impact. On the other hand, the effect of noise can be reduced by evaluating multiple times each solution [8, 9, 13, 28, 29]. Also, if a heuristic is population-based, the impact of noise can be decreased by incrementing the number of candidates (size) of the population [3, 5, 15]. This effect, called implicit averaging, has been studied using normally distributed noise with zero mean and different values of constant standard deviations [16, 17, 22]. Also, [17] shows that increasing indiscriminately the population size can be counterproductive, but without providing an explanation for this behavior. As will be shown empirically, the effect of implicit averaging depends on the type and the amount of noise in the objective function. It is also worth noticing that, in order to deal with noise, randomized algorithms can be extended with statistical analysis techniques. An example is given by simulated annealing [32], which has been extended by adapting the number of samples per solution based on some statistical analysis [1, 7]. However, studies which compare diverse randomized algorithms extended with statistical methods are missing in the literature, and this work is a first step in this direction.

This study aims at comparing the efficiency of different heuristic search strategies in the presence of noise, and to investigate the effects that different components of these strategies have on the performance. Differently from previous studies, all the heuristic search strategies employed in this study are randomized, and algorithms not based on populations are extended using a reactive sample size scheme proposed by [14]. The rest of the paper is structured as follows. Section 2 states more formally the noisy optimization problem and gives an overview of the reactive sample size scheme. Section 3 outlines the heuristic search algorithms which have been used in the experiments, and comments the components of the algorithms which are analyzed empirically. Section 4 defines the experiments and analyzes the results.

2 Noisy Optimization

Let F be a stochastic function that models a real world problem. The output of F depends on some decision variables x and on a random vector \(\xi \) that represents the stochasticity of the problem. The expectation of F is defined as

$$\begin{aligned} f(x) = {{\,\mathrm{\mathbb {E}}\,}}[F(x,\xi )] \end{aligned}$$
(1)

and it can be estimated by using a sample \(\xi _1, ..., \xi _n\) of independent identically distributed (i.i.d.) realizations of the random vector \(\xi \), in order to compute the Sample Average Approximation (SAA) of (1) as

$$\begin{aligned} \hat{f_n}(x) = \frac{1}{n} \sum _{i=1}^{n} F(x,\xi _i). \end{aligned}$$
(2)

If the sample \(\xi _1, ..., \xi _n\) is i.i.d., by the Law of Large Numbers, as n approaches infinity \(\hat{f_n}(x)\) converges to f(x) and so \(\hat{f_n}(x)\) is an unbiased estimator of f(x). Moreover, if the variance of F is finite, by the Central Limit Theorem \(\hat{f_n}(x)\) asymptotically follows a normal distribution with mean f(x) and variance \(\sigma ^2/n\) where \(\sigma ^2\) is the variance of F. As a consequence, the accuracy of the estimation increases with sample size n, but this also increments the computational burden (see also [26, 31]). The problem might be defined in the constraints-defined region \(\varTheta \) in which x can assume values as

$$\begin{aligned} \min _{x \in \varTheta } f(x). \end{aligned}$$
(3)

The SAA defined in (2) can be used as objective function by heuristic optimization techniques, in order to optimize f(x). The presence of noise might require large samples in order to obtain sufficiently accurate estimates, so comparing the performance of different configurations is not straightforward. Given any configuration x, \(f(x)-\hat{f_n}(x)\) defines an error \(\epsilon _n(x)\) that goes to 0 only in the limit of n going to infinity. As a consequence, when comparing two configurations \(x_1\) and \(x_2\), the difference \(\hat{f_n}(x_1) - \hat{f_n}(x_2)\) is not sufficient to decide which configuration has a better average. If the signal \(|f(x_1) - f(x_2)|\) is lower than the noise \(|\epsilon _n(x_1) - \epsilon _n(x_2)|\), the signal-to-noise ratio is too low and the comparison might be not significant.

2.1 A Reactive Sample Size Algorithm

In this work, in order to deal with the presence of noise, heuristic techniques which do not use a population are extended with a reactive sample size algorithm [14] based on paired t-tests and indifference-zone (IZ) selection. IZ selection is a concept commonly used in ranking and selection (R&S) algorithms. These methods aim at selecting, in a statistically significant manner, the best solution \(x^*\) which performs better among a finite set of k possibilities. In R&S methods based on IZ selection, the target is to select the best configuration \(x^*\) among a finite set of k configurations, where \(x^*\) is better than all other configurations in the set by at least \(\delta \) and the probability of correct selection (PCS) is \(1-\alpha > 0\), where \(\alpha \) is the probability of making an error of type I. \(\delta \) is called the IZ parameter, and it defines the minimum difference in means considered to be worth detecting. More information about R&S can be found in [10, 23, 25].

The algorithm works as follows. Given a pair of configurations \(\{x_1,x_2\}\) to be compared, paired evaluations are obtained by evaluating \(x_{1}\) and \(x_{2}\) using the same \(\xi _i\). Then, these evaluations are used to compute the paired t-test statistic. In fact, as observed by [24], using the same realizations helps to reduce the effect of noise. The correlation among pairs of evaluations reduces the variance with respect to an unpaired statistic. Also, the scheme assumes that F is normally distributed and that its variance is finite.

The algorithm reactively decides, in an online manner, the sample size to be used for each comparison done during the optimization. Also, all the evaluations of solutions previously visited during the search are kept in memory, to avoid the waste of computational budget if a configuration has to be compared multiple times. Significant differences are detected by considering the relationship between probabilities \(\alpha \) and \(\beta \) of making an error of type I and type II. To remind the reader, given a null hypothesis \(H_0\) and an alternative hypothesis \(H_1\), \(\alpha \) is the probability to reject \(H_0\) when \(H_0\) is true and \(\beta \) is the probability to fail to reject \(H_0\) when \(H_0\) is false. To compute \(\beta \), a significant difference in means for which \(H_0\) is assumed to be false and \(H_1\) to be true has to be defined. The value for which \(H_1\) is assumed to be true is \(\delta _{observed}\), which corresponds to \(\hat{f}_n(x_{1}) - \hat{f}_n(x_{2})\). So, given a paired sample, the minimum sample size n that should be used to test a one-tailed hypothesis with error probabilities \(\alpha \) and \(\beta \) can be computed. See [14] for more details.

Also, in real world problems, one might not be interested to correctly detect very small differences between means. If \(x_{1}\) and \(x_{2}\) have a very similar performance and \(\delta _{observed}\) is smaller than a certain user-defined \(\delta \), the comparison is done heuristically by considering only the values of \(\hat{f}_n(x_{1})\) and \(\hat{f}_n(x_{2})\). The value of \(\delta \) is expressed as a percentage of the current best solution, because in many cases the user does not know a priori the best possible result which can be obtained by the optimization.

3 Optimization Algorithms

The optimization algorithms employed in the experiments are Random Search (RS), the Reactive Affine Shaker (RAS) [11] and the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [19, 20]. RS is a simple stochastic local search algorithm which is often used as baseline for comparisons, while RAS and CMA-ES are more advanced stochastic schemes which adapt step size and direction of the search during the optimization.

In RS, a new candidate solution \(x_{new}\) is sampled from an interval defined in a neighborhood of the current best solution \(x_{current}\), according to a uniform distribution. A step size \(\sigma \) is used to define, as a percentage of the intervals which define \(\varTheta \) along each dimension, the boundaries of the local search region located around \(x_{current}\). Consequently, diverse step sizes correspond to search policies with different levels of locality. A step size of 1 would make the search global, and the optimization would correspond to pure random search.

In RAS, a local search region is adapted by an affine transformation. The aim is to scout for local minima in the attraction basin where the initial point falls. The step size \(\sigma \) and the direction of the search region are adapted in order to maintain heuristically the largest possible movement per function evaluation. The search occurs by generating points in a stochastic manner, with a uniform probability in the search region, following a double shot strategy. A single displacement \(\varDelta \) is generated, and two specular points \(x_{current} + \varDelta \) and \(x_{current} - \varDelta \) are considered for evaluation. An evaluation is successful if the objective function value in at least one of the two candidates is better than \(\hat{f}(x_{current})\). The search region is modified according to the outcome of the comparisons. It is compressed if both comparisons are unsuccessful, and it is expanded otherwise. In both cases, the search region is modified according to an expansion factor \(\rho > 1\).

CMA-ES [19, 20] is an evolutionary optimization paradigm in which configurations are sampled from a multivariate normal distribution \(N(\mu _t,M_t)\), where t defines the iteration of the algorithm. At each iteration, the mean \(\mu _t\) defines the center of the distribution, the covariance matrix \(M_t\) determines shape and orientation of the ellipsoid corresponding to \(N(\mu _t,M_t)\), and a step size \(\sigma _t\) controls the spread of the distribution as a percentage of the intervals which define each dimension of \(\varTheta \). The ellipsoid is the local search region used by the algorithm to explore the search space and propose candidate solutions. Iteratively, CMA-ES follows four steps. First, it samples a fixed number \(\lambda \) of new configurations from \(N(\mu _t,M_t)\), creating a population. Second, candidates are evaluated and ranked according to the quality of the evaluations. Third, the best \({\lfloor }\frac{\lambda }{2}{\rfloor }\) results are used to update \(N(\mu _t,M_t)\), in order to move the search towards the most promising search direction. Fourth, \(\sigma _t\) is increased or decreased according to the length of the so-called evolution paths, in order to maximize the expected improvement of the optimization. This last step is explained more in detail in the following subsection. Also, CMA-ES has been extended with an uncertainty handling (UH) method, to deal with possible noise in the objective function [18]. In this version of CMA-ES, referred as UH-CMA-ES, the uncertainty is measured by rank changes among members of the population. Once each solution of the population has been evaluated and ranked, a few additional evaluations are taken and the population is ranked again. By doing so the algorithm tries to estimate the amount of noise in the evaluations, in order to increase \(\sigma _t\) and prevent the signal-to-noise ratio from becoming too low.

3.1 A Note on CMA-ES Step-Size Adaptation

The adaptation of \(\sigma _t\), also called cumulative step-size adaptation, is based on the evolution paths mentioned in Sect. 3. An evolution path is a weighted vector sum of the last points successively visited by the algorithm. It provides information about the correlations between points, and it can be used to detect the direction of consecutive steps taken by the optimization. If consecutive steps are going in the same direction (scalar product greater than zero), the same distance could be covered by longer steps and the current path is too long. If consecutive steps are not going in the same direction (scalar product lower than zero), single steps tend to cancel each other out and so the current path is too short. Therefore, to make successive steps more efficient, \(\sigma _t\) is changed accordingly. The step size determines the signal strength used by CMA-ES to estimate the direction of the gradient. If the steps of the algorithm are very small, the signal is also likely low and therefore the signal-to-noise ratio becomes small as well. Also, it has been shown that in noisy optimization the cumulative step-size adaptation may result in premature convergence [6].

4 Experiments

Before showing the results about the performance of different randomized algorithms in various noisy scenarios and higher dimensions, a preliminary study on CMA-ES is presented. In order to investigate the effects of implicit averaging, the algorithm is tested using populations larger than the standard size proposed by its authors.

4.1 Benchmarking Functions and Noise Models

In order to test diverse heuristic strategies in the presence of noise, Sphere and Rastrigin functions have been extended with multiple types and levels of noise. As in other works in the literature, both functions are optimized in \([-5.12,5.12]^d\), where d is the number of dimensions. To evaluate the impact of noise on the optimization, a standard practice in the literature is to extend deterministic functions by introducing multiplicative or additive noise. In the case of multiplicative noise, a percentage \(\epsilon \) of f(x) is added to f(x) according to a displacement generated using a standard normal distribution:

$$\begin{aligned} f(x,\epsilon ) = f(x) + f(x) \cdot \epsilon \cdot N(0,1). \end{aligned}$$
(4)

This kind of noise is typical of devices which take physical measurements, like speed cameras, where values are guaranteed to be accurate up to a certain percentage of the measured quantity. However, as the optimization proceeds towards lower values, the noise decreases. This means that, as the optimization approaches the global optimum \(x^*\), it is easier to move into the right direction and, if \(f(x^*) = 0\), there is almost no noise in proximity of the global optimum.

Although such a situation is true in many real-world scenarios, there exist other problems where the noise does not always go to zero as the global optimum (if any) is approached. As examples, consider the optimization of simulation models, or the tuning of the hyperparameters of machine learning algorithms. In this case, the noise can be simulated by adopting additive noise. However, determining the amount of noise to add is up to the practitioner. In fact, additive noise is usually normally distributed with zero mean and constant standard deviation \(\sigma _{\epsilon }\). Since this kind of perturbations does not depend on the signal, the signal-to-noise ratio might cause problems only when approaching the minimum and its effects are going to be very different from function to function.

To avoid these drawbacks, a possibility is to define additive noise as normally distributed with zero mean and dynamic standard deviation. Since the step size used by randomized algorithms impacts the strength of the signal, in order to test harder noise scenarios it makes sense to set the noise level according to the step size. So, given a point x in \(\varTheta \) and a percentage \(\epsilon \), the dynamic standard deviation \(\sigma _i\) along each dimension i is computed as follows. Compute lower bound \(l_i = x_i - \epsilon \cdot \varTheta _i\) and upper bound \(u_i = x_i + \epsilon \cdot \varTheta _i\), where \(\varTheta _i\) is the interval in which the function is defined along dimension i. Then, find the minimum m and the maximum M among \(\{f(l_i),f(x_i),f(u_i)\}\). Finally, \(\sigma _i = |m- M|\) and the additive noise model is defined as

$$\begin{aligned} f(x,\epsilon ) = f(x) + \sum _{i=1}^{N} N(0, \frac{\sigma _i}{k} ), \end{aligned}$$
(5)

where N is the dimensionality of the function and k is a constant used to control the amount of noise. In the experiments, \(\epsilon = 0.1\) and \(k \in \{1, 2, 3, 6\}\). Therefore, while using this model of noise, the distortion of the signal is set according to the maximum signal which can be detected by an algorithm which adopts a fixed step size (like RS). With \(k=6\), \(99.7\%\) of the noise is generated within the intervals which define the local search region. With \(k=3, k=2, k=1\) the same is true respectively for \(81.86\%, 68,27\%, 34.14\%\) of the noise.

4.2 Setup

Each experiment is based on 100 macroreplications, where each optimization process has a budget (number of function evaluations) of 5000. Initial solutions are generated according to a uniform distribution defined on the interval of each dimension of \(\varTheta \). In each experiment, algorithms start the optimization from a randomly generated solution \(x_0\) and consider a local search region of the same size defined around \(x_0\). Then, the local search region is iteratively modified according to the algorithm. Apart from CMA-ES and UH-CMA-ES, the algorithms are employed in two versions: with a naive scheme which uses 1 evaluation for each solution, and the reactive scheme proposed by [14]. The acronym of each algorithm is preceded with N in the former case and with R in the latter. As suggested in [14], the values of the parameters are set as \(\alpha _{req} = 0.1\), \(\beta _{req} = 0.4\) and \(\delta = 0.01\).

In the first set of experiments on CMA-ES, restarts are not considered. In fact, the standard implementation of CMA-ES includes various stopping criterias and restart policies [20], but they have been deactivated in order to improve the analysis of the different components of the algorithm. When activated, all algorithms use global restarts based on a single stopping criterion: if the current best solution does not improve by at least \(10 \%\) in \(k = 500\) function evaluations, a restart is done. An exception is given by RAS, which possibly needs to be restarted because of its double-shot strategy. In fact, if \(x_0\) is generated nearby the boundaries of the search space, the double shot strategy might be unable to generate a valid configuration.

RS uses \(\sigma \in \{0.1,0.2\}\), while RAS employs \(\sigma \in \{0.1,0.2\}\) and \(\rho = 2\). CMA-ES and UH-CMA-ES adopt \(\sigma \in \{1.0,2.0\}\), because the library used to implement the algorithm defines \(\sigma \in (0, 10]\)Footnote 1. Also, \(\lambda = 4 + {\lfloor }3 \, \text {log} \, d{\rfloor }\), where d is the dimensionality of the objective function. The values of \(\rho \) and \(\lambda \) are the ones suggested respectively by the authors of [11, 20].

4.3 Average Loss Signal per Iteration

In a noisy scenario, in order to understand how the algorithm behaves with populations of different sizes, a possible way to proceed is to measure the magnitude of the error made when estimating the gradient. To do so, at each iteration, the population of candidate solutions \(p = \{x_1, ..., x_m\}\) is ranked in two ways. Firstly, according to the noisy ranking \(\hat{r}\) based on \(\hat{f}_1(x_1), ..., \hat{f}_1(x_m)\). Secondly, following the noiseless ranking r defined by \(f(x_1), ..., f(x_m)\). Then, the signal loss L is defined as the difference between the signal of these two rankings, where the signal of a ranking is the sum of the absolute differences among the ordered values used for ranking. More formally, in the case of \(\hat{r}\) and r,

$$\begin{aligned} \hat{s}(\hat{r},p) = \sum _{i = 1}^{m-1} |\hat{f}_1(x_i) - \hat{f}_1(x_{i+1})| \end{aligned}$$
(6)

and

$$\begin{aligned} s(r,p) = \sum _{i = 1}^{m-1} |f(x_i) - f(x_{i+1})|, \end{aligned}$$
(7)

where the set \(\{1, ..., m - 1\}\) is ordered respectively according to \(\hat{r}\) and r. Therefore,

$$\begin{aligned} L(\hat{r},r,p) = \hat{s}(\hat{r},p) - s(r,p) \end{aligned}$$
(8)

and the average signal loss per iteration is defined as

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}(L) = \frac{1}{N} \sum _{i = 1}^N L(\hat{r}_i,r_i,p_i), \end{aligned}$$
(9)

where N is the number of iterations of the algorithm. By following this procedure, the optimization estimates the gradient according to \(\hat{r}\) and it is possible to measure by how much the optimization goes in the wrong direction. Also, by comparing \(\hat{r}\) and r, the number of misrankings among each population’s candidates can be computed, as well as the average misrankings’ percentage \({{\,\mathrm{\mathbb {E}}\,}}(M)\) of the whole optimization.

Table 1. Performance of CMA-ES with \(\sigma = 1.0\) and \(\lambda \in \{6,36,60\}\) on the Sphere function in 2 dimensions. Macrocolumns show respectively the results obtained when using multiple levels of constant additive noise, multiplicative noise and dynamic additive noise. In each macrocolumn, the average best results \(f(x^*)\) are in bold and all values based on f(x) are normalized by the number of dimensions.

4.4 Results When Using Larger Populations in CMA-ES

This set of experiments is based on the bidimensional Sphere function. As it is possible to see in Table 1, \({{\,\mathrm{\mathbb {E}}\,}}(L)\) keeps increasing as the population grows. Larger populations can potentially detect more signal, but are unable to do so. In fact, as Figs. 1a, 1b, 1c show, larger populations contribute to mantain a larger step size and so to make wider steps. However, as the amount of noise increases, misrankings are going to happen first among similarly ranked candidates and then also between solutions ranked farther away from each other. Also, the magnitude of the errors increases with the noise. Consequently, larger populations tend to lose more signal and to guide the optimization farther away from the true direction of the gradient.

Fig. 1.
figure 1

Each row shows respectively average variation of CMA-ES step size (1a) and CMA-ES covariance matrix volume (1a), with \(\sigma = 1.0\) and \(\lambda \in \{6,36,60\}\). Each column refers respectively to a particular case of the diverse types of noise used in Table 1. More precisely, they show the cases with constant additive noise with \(\sigma =1.0\) (first column), multiplicative noise with \(\epsilon = 0.2\) (second column) and dynamic additive noise with \(k = 3\) (third column).

The effect of the misrankings depends on the amount of noise. Even if \({{\,\mathrm{\mathbb {E}}\,}}(M)\) increases with the population size, it is not implied that the performance of the optimization deteriorates. For example, Table 1 shows that in the case of constant additive noise, larger populations obtain better results. This happens because the signal-to-noise ratio is sufficiently high to avoid misrankings which would guide the optimization towards a significantly wrong direction. The amount of noise starts creating problems only when approaching the minimum, as shown in Fig. 2. With this amount of noise, even N-RS is able to perform comparably well. In contrast, in the case of dynamic additive noise, the signal-to-noise ratio is approximately the same throughout the search space and results are expected to deteriorate much more, as confirmed by the results in Table 1. Even in the case of multiplicative noise, larger populations possibly worsen the performance of the optimization. Therefore, when the signal-to-noise ratio is low and misrankings happen among further positions, increasing the population size or the number of parents is not going to significantly improve the robustness of the optimization. In this case, it is preferable to increase the sample size used to estimate each candidate solution.

It is also worth noticing the premature convergence of the covariance matrix. Figures 1d, 1e, 1f show that the volume of the covariance matrix goes to zero in the first part of the optimization. After that, CMA-ES is no longer able to propose significantly different solutions.

Fig. 2.
figure 2

Average convergence of CMA-ES with \(\sigma = 1.0\) and \(\lambda = 60\), and N-RS with \(\sigma = 0.1\) on the bidimensional Sphere function, with multiple levels of constant additive noise. Lines represent the mean noiseless value of the best configuration found during the optimization (which is not known by the optimizers). Also, all evaluations are normalized by the number of dimensions.

Fig. 3.
figure 3

Average convergence of the optimizers on the Sphere function, with \(d=10\) \(k=3\). The results obtained by using a very small step size are also shown.

4.5 Results When Using Larger Sample Size

These experiments are based on both Sphere and Rastrigin functions, with \(d \in \{2,10\}\). Results in Tables 2 and 3 show that, with high levels of noise, a simple optimization algorithm such as R-RS performs better than more complex algorithms like RAS, CMA-ES or UH-CMA-ES. Without increasing the sample size of estimators, using a population of solutions is not able to compete with single-point algorithms which adapt the sample size of estimators according to empirical evidence. Furthermore, in this context, UH-CMA-ES might even worsen the performance. As shown in Fig. 3d, increasing the step size according to observed misrankings provides better results when the initial step size is very low with respect to the distortion caused by the noise.

Figure 3a shows that a larger step size can improve the efficiency of the optimization. On average, compared solutions correspond to more different estimators and a lower sample size is required to make statistically significant comparisons. However, since a larger step size also implies reducing the sampling granularity, the optimization enhances the global search phase and the convergence speed tends to decrease. For the same reason, as shown in Fig. 3b, the effectiveness of the double-shot strategy in a noisy environment is questionable. Although such an approach can be a good strategy for deterministic optimization, on each iteration very similar configurations are compared, the signal-to-noise ratio tends to be low and the sample size required to make statistically significant comparisons increases. Furthermore, as the search region is compressed, this effect is further enhanced.

In noisy scenarios, step-size adaptation mechanisms and adaptations of the search space are potentially counterproductive. In deterministic functions, compressions of the search region usually lead to a better exploitation of the local structure of the objective function. However, because of the presence of noise, decisions to compress the search region might be wrong and therefore the optimization might prematurely converge to false local optima.

In larger dimensions, the situation changes in the case of Rastrigin function with lower levels of noise. However, it is expected that a population-based algorithm performs better than single-point algorithms in the case of a multimodal function like Rastrigin. Combining a set of candidate solutions at each iteration gives the algorithm the ability to adapt to the local topology of the objective function, reducing the risk to get stuck in local optima.

5 Conclusions

This paper investigated different components of diverse heuristic strategies in the context of noisy optimization. A preliminary study on the bidimensional Sphere function showed how the implicit averaging effect of population-based algorithms does not always improve the optimization as the size of the population is increased, and analyzed how different amounts of noise change the impact of this effect on the optimization. Randomized algorithms not based on populations have been extended with a statistical analysis technique [14] to deal with the presence of noise, and they have been compared with CMA-ES and UH-CMA-ES.

Table 2. Average best objective function value found by different optimizers, with different levels of dynamic additive noise added to the Sphere function. In each column, the best results are in bold and all values are normalized by the number of dimensions.
Table 3. Average best objective function value found by different optimizers, with different levels of dynamic additive noise added to the Rastrigin function. In each column, the best results are in bold and all values are normalized by the number of dimensions.

Results in Sect. 4.4 confirm the findings of [17]. The analysis provides an explanation about the reason for which larger populations do not always improve the optimization, and a higher sample size should be preferred when the signal-to-noise ratio is too low. Furthermore, these results also agree with [2]: in the presence of noise, step length control mechanisms are crucial to the performance of the optimization. If optimization methods are extended with statistical analysis techniques such as [14], the resolution at which the search space is explored matters significantly. With lower step sizes, solutions correspond to more similar function evaluations, and the sample size required to statistically determine a difference increases.

Future work aims at extending current results in order to consider other derivative-free optimization strategies and more benchmarks. Also, it would be worth investigating the performance of more global search policies, which iteratively compare configurations located farther away in \(\varTheta \) and search more locally in different parts of \(\varTheta \) only if sufficient empirical evidence to do so is observed. Approaches like CoRSO [12] or Bayesian Optimization [30] could be good choices.