Abstract
One goal of this book is to empirically answer the question of how efficient ES are in a setting of few function evaluations with a focus on modern ES from Sect. 2.2.2. This chapter addresses the experiments conducted and is organized as follows. Section 4.1 introduces two measures to evaluate the efficiency of ES, the fixed cost error (FCE) and the expected run time (ERT).
Access provided by Autonomous University of Puebla. Download chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
One goal of this book is to empirically answer the question of how efficient ES are in a setting of few function evaluations with a focus on modern ES from Sect. 2.2.2. This chapter addresses the experiments conducted and is organized as follows. Section 4.1 introduces two measures to evaluate the efficiency of ES, the fixed cost error (FCE) and the expected run time (ERT). Section 4.2 describes how the experiments were conducted technically and how they are examined. The results are presented and discussed in Sect. 4.3.
4.1 Measuring Efficiency
An ES is considered efficient in this book if it approaches the optimum f ∗ (see Eq. 2.4) quickly, i.e., by using as few function evaluations as possible. In order to compare different ES, a measure of efficiency for the convergence properties of an ES is needed. Figure 4.1 shows a sample convergence plot for five optimization runsFootnote 1 of an ES. The x-axis of the convergence plot represents the number of function evaluations. The y-axis shows the base ten logarithm of the difference between the currently best function value and the optimum f ∗. This difference will be called \(\Delta {f}^{{\ast}}\) in the following. For a plus-strategy the graph is monotonically decreasing. To achieve monotonicity for a comma-strategy as well, one uses the best evaluated individual found so far for the calculation of \(\Delta {f}^{{\ast}}\) instead of the best individual of the current generation.
Appendix D.3 in [34] describes two opposing approaches for deriving an efficiency measure from these convergence plots. On the one hand there is the fixed-cost view, on the other hand there is the fixed-target view. The fixed-cost view operates with a fixed number of function evaluations and yields the fixed-cost error (FCE) covered in detail in Sect. 4.1.1. The other approach leads to the expected runtime measure (ERT) which is used in the BBOB benchmarking framework and described in Sect. 4.1.2.
4.1.1 The FCE Measure
FCE measures \(\Delta {f}^{{\ast}}\) given a fixed number of function evaluations. Considering the convergence plot in Fig. 4.1 this approach is implemented graphically by drawing a vertical line. The FCE values are represented by the intersections between the convergence graphs and the vertical line. FCE is of relevance for industrial applications demanding a maximal run-time hence a fixed number of function evaluations. In [34] the relevance of FCE is acknowledged but the use of FCE is rejected because it does not allow for a quantitative interpretation. The lack of interpretation stems from the fact that the ratio between two \(\Delta {f}^{{\ast}}\) from two algorithms cannot quantify how much better one algorithm is than another. Nevertheless, a qualitative interpretation is possible. On the basis of FCE one can answer the question of which algorithm yields a smaller FCE and as a result is better. Since optimization runs with ES are influenced by random effects, both during initialization and running period, the FCE of an algorithm has to be measured based on many independent optimization runs. Then, the FCE of different algorithms can be analyzed with statistical techniques to find significant differences in quality. This analysis is described in Sect. 4.2.3.
4.1.2 The ERT Measure
BBOB uses ERT as the measure for benchmarking algorithms. It was introduced in [49] as expected number of function evaluations per success and further analyzed under the name success probability (SP) [4]. ERT is the expected number of function evaluations needed to reach a given \(\Delta {f}^{{\ast}}\). Graphically, ERT consists of an intersection between a convergence graph as shown in Fig. 4.1 and a horizontal line representing a fixed \(\Delta {f}^{{\ast}}\). With this approach there might be optimization runs which do not reach the given \(\Delta {f}^{{\ast}}\) within a finite amount of function evaluations. These runs are considered unsuccessful and are rated with the run-time r us . A successful run is rated with the number of function evaluations to reach \(\Delta {f}^{{\ast}}\), i.e., the run-time r s . The ratio of successful runs to all runs yields the value p s . Let R s and R us be the mean value of the r s and r us from different runs, then for p s > 0 ERT is defined as:
If there are unsuccessful runs, i.e., p s < 1, then ERT strongly depends on the termination criteria of the algorithm and the value of r us . Considering optimization runs with very few function evaluation easily leads to \(p_{s} = 0\) when using common values for \(\Delta {f}^{{\ast}}\). So, to use ERT in this scenario, appropriate values for \(\Delta {f}^{{\ast}}\) have to be found first.
4.2 Experiments
4.2.1 Selection of Algorithms
Not all modern ES algorithms covered in Sect. 2.2.2 were subject to an empirical analysis. Since the focus of this book is on optimization runs with very few function evaluations, the restart heuristics were omitted. They can be better analyzed by long-running optimizations. Interesting results from such runs conducted by the authors of the algorithms are summarized in Sect. 3.2.2. In addition, five algorithms developed before 1996, described in Sect. 2.2.1, are included in the experiments. The complete list of algorithms which were used in the experiments is shown in Table 4.3 in Sect. 4.2.2.2.
4.2.2 Technical Aspects
4.2.2.1 Framework
The experiments are performed using the framework BBOB 10.2 [34]. It provides standardized test functions and interfaces to the programming languages C, C++, Java and Matlab/Octave. Having an implementation of an algorithm in one of these languages allows us to conduct experiments with minimal organizational effort on a set of test functions F, a set of function instances FI and a set of dimensions D for the input space. The set FI controls the number of independent runs for a test function. Let \(R = F \times D \times FI\), then |R| runs are conducted in total. A run, i.e., an element of R, yields a table indexed by the number of function evaluations used, containing information regarding the optimization run. This information is the difference between the current noise free fitness and the optimum and the difference between the best measuredFootnote 2 noise free fitness and the optimum. For small dimensionality the input values x yielding the current fitness are displayed as well. BBOB provides Python scripts for post-processing these tables.
Runs are conducted on all 24 noise free test functions. A detailed description of the test functions can be found on the BBOB web page.Footnote 3 The global optima of all test functions are located inside the hyperbox \({\left [-5,5\right ]}^{n}\). The test functions can be classified by different features. A test function can be uni- or multimodal, i.e., having only one or multiple (local) optima. Multimodal functions allow the global optimization capabilities of an algorithm to be benchmarked. Furthermore, a test function can be symmetric, i.e., invariant under rotations of the coordinate system. The condition of a function can be interpreted as a reciprocal measure of its symmetry und depends on the condition of an optimal covariance matrix (see Sect. 2.2.2.2). A more vivid description is that a function with a high condition has a fitness landscape with very steep valleys. Table 4.1 provides a summary of all 24 test functions with their commonly used names and some of their features.
Considering their features, test functions can be classified. The discrimination into separable and non-separable and unimodal and multimodal functions are of special interest. Table 4.2 shows this distribution of test functions across these four classes.
Unimodal test functions have a unique optimum which make them suitable for testing convergence properties of an algorithm without interferences stemming from stagnation in local optima. Multimodal test functions are especially useful for testing restart heuristics or algorithms designed to escape a local optimum. Since real fitness functions are usually multimodal, multimodal functions comply better with real-world optimization scenarios than unimodal functions.
Separable functions allow the optimization run to be split into n independent one-dimensional optimizations. In contrast to this, non-separable functions cannot be optimized this way and for them it is advantageous to apply an ES using correlated mutations. In general, non-separable multimodal functions are far more difficult to solve and hence serve better as an idealization of real-world problems.
According to [34], 15 runs are sufficient to observe significant differences when comparing 2 algorithms. For the analysis based on the FCE measure a best-of-n approach (described in Sect. 4.2.3) is used. In order to observe significant differences with this approach as well, the number of runs per test function, defined by the function instances in BBOB, is increased to 100. BBOB recommends a maximum number of function evaluations of 106 ⋅ n. Since the focus is on optimization tasks allowing only very few function evaluations, a drastically decreased maximum number of function evaluations of 500 ⋅ n is chosen. The experiments are conducted with dimensions \(n \in \{2,5,10,20,40,100\}\). For the dimensions n = 40 and n = 100 the maximum number of function evaluations is reduced to 104. The initial search point is drawn uniformly from the hyperbox [−5,5]n.
4.2.2.2 Software for ES Algorithms
The BBOB framework is used with its interface to the Matlab/Octave programming language. If there are publicly available implementationsFootnote 4 by the authors of the ES algorithms, they are used. For most of the ES an original implementation was created. Table 4.3 provides an overview of the implementation used.
All original implementationsFootnote 5 represent the pseudocode of the algorithms, as listed in Chap. 2, for Octave [23]. Furthermore, these implementations are capable of constraining the search space to a hyperbox (see Eq. 2.5). For this purpose a transformation as described in [43] is applied individually to the coordinates of a search point. The transformed value x′ of \(x \in \mathbb{R}\) subject to the lower bound l and the upper bound u is calculated as follows:
Simply speaking, the transformation performs a reflection at the bounds. An optimization run is terminated if either the maximum number of function evaluations is reached or the fitness falls below a given target value. These two values can be configured by parameters in all the original implementations. The exogenous parameters of the different ES algorithms are configured with their default settings as described in Sect. 2.2.
4.2.3 Analysis
In the following, the procedure for evaluating the empirical test results is outlined.
4.2.3.1 Calculating FCE from Empirical Results
The basis for the calculation of FCE is the tables described in Sect. 4.2.2.1, which are called BBOB data in the following. The BBOB data contains tuples \((\#fe,\Delta {f}^{{\ast}})\), which consist of # f e, the number of function evaluations performed, and \(\Delta {f}^{{\ast}}\), the so-far bestFootnote 6 difference from the optimum f ∗. There is not necessarily a tuple for every \(\#fe \in \{1,\ldots,\#fe_{max}\}\) in the BBOB data. Let \(I \subset \{1,\ldots,\#fe_{max}\}\) be the subset of existing # f e values in the BBOB data with C t as target costs, then FCE is calculated as follows:
Simply speaking, the FCE for a specific C t is based on the closest C t available in the BBOB data, which is smaller than or equal to the desired C t . In this way, the performance of an algorithm might be underestimated but is not overestimated.
4.2.3.2 Calculating Rankings
Conducting m = 100 runs for each test function f and each dimension n yields a set E(f,n,C t ) containing m FCE(C t ) values. For each algorithm a the sets \(E(f,n,C_{t})_{a}\) can be analyzed pairwise with non-parametric statistical tests [36] to find significant differences in their FCE. We use unpaired Welch Student’s t-tests [69] to decide whether one algorithm is better than another.Footnote 7 The difference between the mean of \(E(f,n,C_{t})_{1}\) and the mean of \(E(f,n,C_{t})_{2}\) is considered significant for a p-value < 0.05 and the algorithm with the better mean FCE is declared the winner and gets a point. Doing so pairwise for all algorithms, the algorithms are ranked according to the number of points obtained.
In [24] two relevant optimization scenarios are described. In the first one the user has the opportunity to choose the best run out of several runs. For this purpose an algorithm with a good peak performance, i.e., an algorithm which performs very well sometimes but its general performance is highly variant, is appropriate. In the second scenario only one optimization run is done. This requires an algorithm to have a good performance without much variation. This kind of performance is called the average performance of an algorithm. To reflect these two scenarios in our analysis, we will use a best-of-k approach. Instead of using all m runs to create the set E(f,n,C t ) only the best out of k runs can be used. This reduces the cardinality of E(f,n,C t ) to \(\lfloor \frac{m} {k} \rfloor \). The analysis regarding the average performance of an algorithm is done with k = 1. For the peak performance we have to choose an appropriate k. The resulting set E(f,n,C t ) must not be too small in order to apply statistical testing for significant differences. We choose a best-of-k approach with k = 5 to rank the algorithms regarding their peak performance.
4.2.3.3 Selection of Test Functions
Until now the sets E were dependent on one test function. In order to calculate a rank aggregation for a set of test functions, the points won by an algorithm for each test function within the set are accumulated before determining the aggregated ranking. Aggregated rankings are calculated for the classes of test functions as assigned in Table 4.2.
4.2.3.4 Choice of Target Costs C t
Following the motivation of this work small values for target costs C t are chosen. C t should be dependent on the dimension n to facilitate the interpretation of results for different dimensions. BBOB recommends 106 ⋅ n for long runs thus establishing a linear dependency. We choose to analyze results for C t = α ⋅ n with \(\alpha \in \{25,50,100\}\) instead, i.e., our focus is on much smaller values for C t .
4.3 Results
4.3.1 Ranks by FCE
The following figures show rankings aggregated for the four function classes as described in Sect. 4.2.2.1. Each ranking is displayed for all dimensions. Instead of using the rank, the number of significant wins over other algorithms divided by the number of test functions per class is shown on the y-axis. This kind of normalization allows the plots for different function classes to be compared. With 14 algorithms tested an algorithm can achieve at most 13 significant wins. This representation also has the advantage of showing how clearly an algorithm wins or loses against others. The aggregated ranking over all 24 test functions is given in Table 4.4.
4.3.2 Discussion of Results
Based on the results we are able to answer two questions regarding optimization scenarios with very few function evaluations. The first one is: Are there significant differences in the convergence properties of Evolutions Strategies with few function evaluations? In general this question can be answered positively. Even 25 ⋅ n function evaluations are sufficient to observe significant differences. As can be seen in Figs. 4.2–4.7 there are hardly any significant differences in algorithm performance for non-separable, multimodal test functions with dimension n = 2. An explanation for this behaviour is given by the fact that the variance of the Euclidian distance between the initial search point and the global optimum in the search space decreases with the dimensionality.Footnote 8 That means for n = 2 the variance of the differences is relatively high and the initialization of the search point impacts the results too much for us be able to see more significant differences
in the convergence behaviour of the algorithms tested. According to the ranking aggregated over all 24 test functions as shown in Table 4.4, the Active-CMA-ES is clearly the best evolution strategy for optimization scenarios with few function evaluations, followed by the (μ W ,λ)-CMA-ES in second place. This result holds regardless of whether we analyze the peak or average performance. The sep-CMA-ES clearly ranked last in these experiments.
The second question, whether there are Evolution Strategies which are better given many function evaluations but are beaten given few function evaluations, can also be answered positively in some cases. For target costs C t = 100 ⋅ n the Active-CMA-ES or the (μ W ,λ)-CMA-ES usually rank best. Decreasing the target costs to 25 ⋅ n or 50 ⋅ n results in several (1 + 1)-strategiesFootnote 9 being found with good rankings, especially for unimodal functions. With the peak performance approach the (1 + 1)-Cholesky-CMA-ES and the (1 + 1)-ES rank first, sometimes even for multimodal functions. The CMA-ES variants catch up with more function evaluations, which can be explained by the time needed to adapt the covariance matrix successfully.
Despite using only anisotropic mutations with local step sizes the DR2 algorithm performs quite well. It often ranks directly behind the successful CMA-ES variants. Thus, it offers a better alternative to the sep-CMA-ES when the runtime of the algorithm cannot be neglected w.r.t. the time for a function evaluation, which might be the case for very high dimensional search spaces.
4.4 Further Analysis for n = 100
As the last section illustrates, several (1 + 1)-ES algorithms outperform CMA-ES variants considered state of the art when it comes to very few function evaluations. In industrial optimization scenarios, where function evaluations are extremely time consuming, we are interested in quick progress rather than finding the exact global optimum, or even converging to a local optimum.
A more thorough analysis for search space dimension n = 100 reflecting these scenarios was also conducted. The experiments described in the last section used the performance measure FCE based on the distance to the global optimum \(\Delta {f}^{{\ast}}\) to quantify progress of an algorithm. In order to reflect the scenario of quick progress we chose to measure the progress made w.r.t. the initial search point instead of using \(\Delta {f}^{{\ast}}\) directly. So, the \(\Delta f_{\mbox{ init}}^{{\ast}}\) of the function evaluation of the inital search point is used to normalize the \(\Delta {f}^{{\ast}}\) of later iterations yielding monotonically decreasing progress values.Footnote 10 Based on these values we can state by which order of magnitude an algorithm decreases the initial fitness value for a given test function after a given number of function evaluations. In order to decrease the influence of the initial search point the number of runs is increased from 100 used in the previous section to 1,000 for each of the 14 algorithms and each of the 24 test functions. As an example, Fig. 4.8 shows the resulting convergence plot for test function f 1.
As in the analysis in Sect. 4.3.1 we used the non-parametric Student’s t-test to find significant differences between the algorithms tested. According to these significant differences we are able to rank the algorithms for the four test function classes shown in Table 4.2.
The results of this additional test are summarized in Tables 4.5–4.8 for the four different classes of objective functions. In addition, the corresponding convergence plots for all objective functions are provided in Figs. 4.8–4.31. The following observations can be made when analyzing the results:
-
As clarified by the rankings, the (1 + 1)-Active-CMA-ES most often ranks first, regardless of the function class (with the exception of separable multimodal functions and large values of C t , for which DR2 is the best algorithm). In general, the (1 + 1)-algorithms, even including the simple (1 + 1)-ES, perform quite well. It seems that adapting endogenous search parameters in the beginning more frequently with less information is better than less frequently with more information as is the case in population-based strategies.
-
On non-separable, multimodal test functions, the (1 + 1)-Active-CMA-ES is the clear winner, followed by the (1 + 1)-ES. Similar performance can be observed for the other function classes.
-
The convergence plots for the different functions indicate that, for the more complicated functions (e.g., f 21, f 22), progress in the beginning is very slow and accelerates later on. In contrast to this, on easier unimodal functions such as f 1 the algorithms generally converge much faster (up to three orders of magnitude improvement) after 1,000 function evaluations, and the progress rate is already high during the first 100 evaluations.
In conclusion, the (1 + 1)-Active-CMA-ES is a good recommendation for a small function evaluation budget (i.e., up to 10 ⋅ n) and high-dimensional problems in general. Especially for non-separable, multimodal test functions, it consistently shows the best performance, and for the unimodal functions it fails to win in
only two cases, for C t = 100. The very simple (1 + 1)-ES performs surprisingly well, especially on unimodal functions. On multimodal test functions, the simple DR2 strategy also performs reasonably well, but not for unimodal test functions. Overall, the (1 + 1)-Active-CMA-ES is clearly recommendable due to its consistent performance across all functions tested.
Notes
- 1.
Actually, these runs were five independent runs of the (μ,λ)-MSC-ES on the 10-dimensional sphere function (f 1 in BBOB).
- 2.
For a plus-strategy these two values are the same.
- 3.
- 4.
N. Hansen and A. Auger’s CMA-ES is available at https://www.lri.fr/~hansen/cmaes_inmatlab.html; Y. Sun’s xNES is available at http://www.idsia.ch/~tom/code/xnes.m.
- 5.
The Octave source code is available for non-commercial use at the web site of divis intelligent solutions GmbH (http://www.divis-gmbh.com/es-software.html), see Sect. 1.4.
- 6.
This allows for comparing comma and plus strategies.
- 7.
We used the free statistics software R [50] for this purpose.
- 8.
The Euclidian distance of two points uniformly drawn from a hyper box in \({\mathbb{R}}^{n}\) is distributed according to the normal distribution \(N(\sqrt{n}, 1/\sqrt{2})\) (see e.g. [53]). Hence, with increasing n the variance decreases w.r.t. the mean.
- 9.
In detail these are the (1 + 1)-ES, the (1 + 1)-Cholesky-CMA-ES and the (1 + 1)-Active-CMA-ES.
- 10.
Monotonicity for comma-strategies can be guaranteed by using the so-far best \(\Delta {f}^{{\ast}}\) instead of the \(\Delta {f}^{{\ast}}\) of the current iteration.
Bibliography
A. Auger, N. Hansen, Performance evaluation of an advanced local search evolutionary algorithm, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05), Edinburgh, vol. 2, ed. by B. McKay et al. (IEEE, Piscataway, 2005), pp. 1777–1784
J.W. Eaton, GNU Octave Manual (Network Theory Limited, Godalming, 2002)
A.E. Eiben, M. Jelasity, A critical note on experimental research methodology in EC, in Proceedings of the 2002 Congress on Evolutionary Computation (CEC’02), Honolulu, ed. by R. Eberhart et al. (IEEE, Piscataway, 2002), pp. 582–587
N. Hansen, A. Auger, S. Finck, R. Ros, Real-parameter black-box optimization benchmarking 2010: experimental setup. Research report RR-7215, INRIA, 2010
J. Hartung, B. Elpelt, K.H. Klösener, Statistik, 14th edn. (Oldenbourg, München, 2005)
R. Li, Mixed-integer evolution strategies for parameter optimization and their applications to medical image analysis. PhD thesis, Leiden Institute of Advanced Computer Science (LIACS), Faculty of Science, Leiden University, 2009
K.V. Price, Differential evolution vs. the functions of the second ICEO, in Proceedings of the IEEE International Congress on Evolutionary Computation, Indianapolis, ed. by B. Porto et al. (IEEE, Piscataway, 1997), pp. 153–157
R Development Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2011. ISBN:3-900051-07-0
I. Rechenberg, Evolutionsstrategie’94 (Frommann-Holzboog, Stuttgart, 1994)
B.L. Welch, The generalization of “Student’s” problem when several different population variances are involved. Biometrika 34, 28–35 (1947)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bäck, T., Foussette, C., Krause, P. (2013). Empirical Analysis. In: Contemporary Evolution Strategies. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40137-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-40137-4_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40136-7
Online ISBN: 978-3-642-40137-4
eBook Packages: Computer ScienceComputer Science (R0)