Abstract
This chapter discusses a generalization of the expected improvement used in Bayesian global optimization to the multicriteria optimization domain, where the goal is to find an approximation to the Pareto front. The expected hypervolume improvement (EHVI) measures improvement as the gain in dominated hypervolume relative to a given approximation to the Pareto front. We will review known properties of the EHVI, applications in practice and propose a new exact algorithm for computing EHVI. The new algorithm has asymptotically optimal time complexity O(nlogn). This improves existing computation schemes by a factor of n∕logn. It shows that this measure, at least for a small number of objective functions, is as fast as other simpler measures of multicriteria expected improvement that were considered in recent years.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
Introduction
In the 1970s several new ideas for global optimization were proposed. Among these the idea of Bayesian Global Optimization (BGO) was proposed by the Lithuanian research group Jonas Mockus and Antanas Žilinskas [14, 15, 21, 25]. It had a lasting impact on the development of both deterministic and stochastic global optimization techniques. Today variations of this idea are known under various names, such as Efficient Global Optimization [9] or Expected Improvement Algorithm [22]. In these techniques the goal is to find the extremum of a function \(f: \mathcal{X} \rightarrow \mathbb{R}\) where \(\mathcal{X}\) is a compact subspace of \(\mathbb{R}^{d}\). BGO assumes that the objective function is the realization of a Gaussian random field. This random field can be conditioned by the knowledge of f(x (i)) at some points \(\mathbf{x}^{(i)} \in \mathcal{X},i = 1,\ldots,n\). Under this assumption, measures such as the expected improvement of a new design point are well defined, and can be used to guide search towards the global optimum.
In this chapter we describe a generalization of this approach to multicriteria optimization. It iteratively evaluates points from \(\mathcal{X}\) and finds a well distributed subset of the Pareto front of a multicriteria optimization problem. The algorithm is based on a generalization of the expected improvement, which is based on the hypervolume indicator, the so-called Expected Hypervolume Improvement (EHVI) [3]. It has attractive theoretical properties [23], but so far its computation time was considered to be expensive. In this chapter it is shown that, for bicriteria optimization, a fast algorithm exists for computing EHVI that has only linear time complexity in the size of the intermediate approximation to the Pareto front, given that the Pareto front is given as a sorted set. It is shown that this algorithm has asymptotically optimal time complexity.
This chapter is organized as follows: section “Bayesian Global Optimization” introduces the framework of BGO. Section “Multicriteria Optimization” shows how this framework can be generalized to multicriteria optimization. Section “Expected Hypervolume Improvement” defines the EHVI, discusses some of its theoretical properties, and reviews recent applications of it. Section “Efficient Exact Computation” outlines the new, asymptotically efficient algorithm for its exact computation and proves that it has an asymptotically optimal time complexity for bicriteria problems. A numerical example is discussed in section “Numerical Example”. Section “Application Notes and Further Reading” points to some recent applications and related work. Finally, section “Summary and Outlook”, concludes with a summary and discusses open questions.
Bayesian Global Optimization
In BGO the goal is to solve d-dimensional global optimization problems of the type: Find x ∗ with
(Without loss of generality we consider minimization only.)
In order to do so, a sequence \(\{\mathbf{x}^{(t)}\}_{t=1,2,\ldots }\) of points is computed such that
Here \(\mathrm{E}(I(\mathbf{x})\vert (\mathbf{x}^{(1)},f(\mathbf{x}^{(1)}),\ldots,(\mathbf{x}^{(t-1)},f(\mathbf{x}^{(t-1)})\) denotes the expected improvement measure that measures how promising the new point x is, given t − 1 previous evaluations of f at \(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(t-1)}\). This expected improvement is an expected value of a random variable, here called I(x) that requires further explanation.
In BGO one makes the assumption that the function f is the realization of a Gaussian random field F. A Gaussian random field is an infinite set of random variables. Each random variable in F is identified by its spatial index \(\mathbf{x} \in \mathbb{R}^{d}\). We will denote it with F x . It is assumed that the random variables share the same global mean value β and global variance s 2. Moreover, a correlation ρ(F u , F v ) is defined for every pair of indices \(\mathbf{u} \in \mathbb{R}^{d}\) and \(\mathbf{v} \in \mathbb{R}^{d}\). This correlation depends on the relation between u and v. A typical family of correlation functions is
It is important that this correlation function is positive definite. It obtains the value of 1, if v = u and gets smaller with increasing distance between v and u. The parameters q i and θ i are either set by the user or obtained from data fitting. The parameters θ i are positive.
The Gaussian random field can be viewed as a multivariate Gaussian distribution of infinite dimension. We can use well-known expressions for the marginal distributions of the multivariate distribution to find the conditional distribution, given that some of the realizations of one dimensional random variables are known. That is, given the prior information \(\mathbf{F}_{\mathbf{ x}^{(1)}} = f(\mathbf{x}^{(1)}),\ldots,\mathbf{F}_{\mathbf{ x}^{(t-1)}} = f(\mathbf{x}^{(t-1)})\) we can compute the parameters μ (conditional mean) and σ 2 (conditional variance) of the conditioned random variable:
As a shortcut we will denote this random variable with F x | X, f(X), where \(\mathbf{X} = (\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(t-1)})\) denote the indices for which we know realizations and the values of the corresponding realizations are abbreviated with
The estimation of hyperparameters θ i and q i , \(i = 1,\ldots,d\), of the correlation function, as well as the global variance and mean can be accomplished by maximum likelihood methods. For details on the computations of the parameters of the conditional distribution we refer to the specialized literature [19].
Now, the expected improvement can be defined: The improvement of a function value \(y \in \mathbb{R}\) is defined as
where \(y_{min} =\min \{ f(\mathbf{x}^{(1)}),\ldots,f(\mathbf{x}^{(t-1)})\}\). Then the expected improvement is defined as
Here PDF x | X, f(X) is the probability density function of F x | X, f(X).
Multicriteria Optimization
A continuous m-dimensional multicriteria optimization problem is a problem where multiple objective functions, say \(f_{1}: \mathcal{X} \rightarrow \mathbb{R},\ldots,f_{m}: \mathcal{X}^{m} \rightarrow \mathbb{R}\), are to be minimized simultaneously, \(\mathcal{X} \subseteq \mathbb{R}^{m}\).
In the a posteriori approach to multicriteria optimization an approximation to the Pareto front of the problem is computed first. Based on this, the trade-off is analyzed and a solution is selected by the decision maker. To define a Pareto front, we introduce the Pareto dominance order ≺ on \(\mathbb{R}^{m}\), with \(\forall \mathbf{y},\mathbf{z} \in \mathbb{R}^{m}: \mathbf{y} \prec \mathbf{z} \Leftrightarrow (\forall i \in \{ 1,\ldots,m\}: y_{i} \leq z_{i})\mbox{ and }\mathbf{y}\neq \mathbf{z}\). The non-dominated subset of a multiset of vectors \(\mathrm{Y} =\{ \mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m)}\}\) is defined as nd(Y) = {y ∈ Y | ∄ z ∈ Y: z ≺ y}. Given a multicriteria optimization problem, the image of \(\mathcal{X}\) is defined as \(\mathcal{Y} =\{ \mathbf{f}(\mathbf{x})\ \vert \ \mathbf{x} \in \mathcal{X}\}\). The Pareto front of a multicriteria optimization problem is defined as \(\mathcal{Y}_{\mathrm{nd}}:=\mathrm{ nd}(\mathcal{Y})\). An important special case is bicriteria optimization, where m = 2.
One way to generalize the BGO algorithm is to compute the expected improvement of the hypervolume indicator. The hypervolume indicator is the m-dimensional Lebesgue measure λ m of the dominated subspace limited from above by some reference point r m. More precisely the hypervolume indicator is defined as
In Fig. 1 the hypervolume indicator is illustrated for a Pareto front approximation with nine points and two objective functions (m = 2). Given a problem with a Pareto front bounded above by the reference point, sets that maximize the hypervolume indicator are well distributed subsets of the Pareto front [1]. This is why finding the Pareto front is sometimes recast as the problem of maximizing the hypervolume indicator over the set of all subsets of \(\mathcal{X}\). We will call this problem hypervolume maximization.
Expected Hypervolume Improvement
For hypervolume maximization problems the generalization of the improvement function is straightforward. We generalize the best solution found up to iteration t − 1, namely y min, t−1, to
The improvement function is generalized by the following definition of an m-dimensional improvement:
It is easy to show that this I m specializes to the improvement function in one dimension, if we chose r 1 to be sufficiently large.
In order to compute the expected improvement in the multicriteria case, we need also to generalize the assumption on the Gaussian random field. For this, we consider one Gaussian random field per objective function and assume that there is no correlation between random variables from different random fields. For every point \(\mathbf{x} \in \mathcal{X}\) we obtain an m-dimensional random variable conditioned on previous information, that is given by X and f(X) = (f(x (1), \(\ldots,\) f(x (t−1)).
The resulting EHVI can be denoted with
and it is a generalization of the single objective expected improvement, if we consider y min, 0 = r 1.
Efficient Exact Computation
In this section the problem of computing the EHVI is studied and a new, efficient algorithm for bicriteria optimization will be derived. Fast algorithms for computing the EHVI are important, because in BGO a large number evaluations of the EHVI are performed in each iteration when searching for its maximizer. Although the BGO algorithm is typically used in the context of expensive function evaluations, the optimization of the EHVI can significantly contribute to the total running time of the algorithm. For instance, this was recently reported as a major drawback of using EHVI in [12], even when considering only the two dimensional case.
A simplified notation will be used in the following. It focuses only on the elements that are relevant for the EHVI computation.
Symbol | Type | Description |
---|---|---|
\(\boldsymbol{\mu }\) | \(\mathbb{R}^{m}\) | Mean values of predictive distribution |
\(\boldsymbol{\sigma }\) | \((\mathbb{R}_{0}^{+})^{m}\) | Standard deviations of predictive distribution |
Y | \((\mathbb{R}^{m})^{n}\) | Sequence of mutually non-dominated points (Pareto front approximation in t − 1) |
\(\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(n)}\) | \(\mathbb{R}^{m}\) | The vectors in Y |
r | \(\mathbb{R}^{m}\) | Reference point |
For computing integrals of the expected improvement it is useful to define the function Δ. For a given vector of objective function values \(\mathbf{y} \in \mathbb{R}^{m}\), Δ(y, Y, r) is the subset of the vectors in \(\mathbb{R}^{m}\) which are exclusively dominated by a vector y and not by elements in Y and that dominate the reference point, in symbols
In order to simplify notation, we will write Δ(y) whenever Y, r are given by the context.
Based on this, we can now concisely (re-)define the EHVI function as
Example 1.
An illustration of the EHVI is displayed in Fig. 2. The light gray area is the dominated subspace of Y = { y (1) = (3, 1)⊤, y (2) = (2, 1. 5)⊤, y (3) = (1, 2. 5)⊤} cut by the reference point r = (4, 4)⊤. The bivariate Gaussian distribution has the parameters μ 1 = 2, μ 2 = 1. 5, σ 1 = 0. 7, and σ 2 = 0. 6. The PDF of the bivariate Gaussian distribution is indicated as a 3-D plot. Here y is a sample from this distribution, and the area of improvement relative to Y is indicated by the dark shaded area. The variable y 1 stands for the f 1 value and y 2 for the f 2 value.
State of the Art
To compute the EHVI (9) Monte Carlo integration is suggested in [3, 4]. Exact algorithms for computing EHVI for m = 2 are derived in [5] and for m > 2 in [2]. A different algorithm is described in [7].
Fast algorithms have been proposed in [2] and even faster algorithms for m = 2, 3 in [8]. So far the best known bounds for the time complexity of exact computations are O(n 2) for m = 2, and O(n 3) for m = 3. It is notable that the number of transcendental function evaluations scales only linearly in n in the algorithm presented in [8]. A lower bound of Ω(nlogn) is provided for unsorted Y. However, it makes sense to assume that Y is sorted in the first coordinate. In that case, as will be shown, a lower bound of Ω(n) still holds. None of the algorithms found so far for EHVI reach these lower bounds. In this paper we will present an algorithm for m = 2 that does so.
Next an algorithm is outlined that reaches the lower bound time complexity of Ω(nlogn). We thereby prove that the time complexity of EHVI is Θ(nlogn). However, this complexity stems from the complexity that is inherent to sorting Y by the first coordinate.
To keep Y sorted in the first coordinate requires an effort of amortized time complexity O(logn) per iteration. It makes therefore sense to assume a sorted Y. For this case we can show that the time complexity is Θ(n). To do so, we will first establish a lower bound of Ω(n) for this case:
Lemma 1.
The computational time complexity of computing the EHVI for a set Y that is sorted by the first coordinate is bounded from below by Ω(n).
Proof.
An adversary argument can be used to prove this statement. The algorithm has to “look at” all n points. If one point is not used, it could be moved by an adversary and this move will not be noticed by the algorithm; a move of any single point can, in general, change the EHVI. □
Efficient Algorithm
For m = 2 the expected improvement can be computed in linear time, given that Y is already sorted by the first coordinate. Next, a formula will be derived that consists of n + 1 integrals, each of which can be solved in constant time.
The starting point of the derivation is to partition the objective space into n + 1 disjoint rectangular stripes S 1, …, S n+1, as indicated in Fig. 3 (left). In order to define the stripes formally, augment Y with two sentinels: y (0) = (r 1, −∞) and y (n+1) = (−∞, r 2). The stripes are now defined by
We can now express the improvement of a point \(\mathbf{y} \in \mathbb{R}^{2}\) by
This gives rise to the compact integral for the original EHVI, y = (y 1, y 2):
It is observed that the intersection of S i with Δ(y 1, y 2) is non-empty if and only if y = (y 1, y 2) dominates the upper right corner of S i . In other words, if and only if y is located in the rectangle with lower left corner (−∞, −∞) and upper right corner (y 1 (i−1), y 2 (i)). See Fig. 3 (right) for an illustration. Therefore
In (14) also the summation is done after integration. This is allowed, because integration is a linear mapping.
Details of the Constant Time Integration
Recall the definition of the standard Gaussian PDF and CDF: \(\phi (x) = \dfrac{1} {\sqrt{2\pi }}\mathrm{exp}(-x^{2}/2),\quad \varPhi (x) = \dfrac{1} {2}(1 +\mathrm{ erf}(\sqrt{2}))\), and a function Ψ that was defined in [8] as follows:
Moreover it can be shown that
Then the first summand of (16) can be written as follows:
And the second summand of (16) can be written as follows:
The C++ and MATLAB source-code for computing the EHVI is made available under http://moda.liacs.nl or on request by the authors. The code has been compared to results of Monte Carlo integration and earlier implementations of the exact EHVI.
Numerical Example
The behavior of the BGO based on the EHVI will be illustrated by a single numerical experiment.
The numerical example is visualized in the plots of Fig. 4. The bicriteria optimization problem is: f 1(x) = | | x −1 | | → min, f 2(x) = | | x + 1 | | → min, and \(\mathbf{x} \in [-2,2] \times [-2,2] \subset \mathbb{R}^{2}\). The Pareto front is the line segment from \((0,2 \cdot \sqrt{2})\) to \((2 \cdot \sqrt{2},0)\), the efficient set is the line segment that connects (−1, −1) and (1, 1). The metamodel used is a Gaussian random field model with Gaussian correlation function exp(−θ | | x (1) −x (2) | | 2), for \(\mathbf{x}^{(1)} \in \mathbb{R}^{m}\) and \(\mathbf{x}^{(2)} \in \mathbb{R}^{m}\). We set θ = 0. 0001, which was estimated by maximum likelihood method for initial sample. An initial set of 10 points was evaluated, indicated by the dark blue squares. From this starting set 15 new points were generated using the EHVI. The maximizer of the expected improvement was found using a uniform grid. In total each objective function was evaluated 25 times.
The results of the experiment are depicted in plots. In all pictures, points that have been evaluated are indicated by triangles. The points from the initial set are additionally marked by squares. Efficient points are surrounded by circles. The top row depicts the mean value of the Gaussian random field model at x ∈ [−2,2] × [−2,2] for f 1 and f 2, resp. Likewise, the middle row depicts the variance of the Gaussian random field model at x ∈ [−2,2] × [−2,2] for f 1 and f 2, resp. On the left-hand side of the bottom row the hypervolume-based expected improvement values after 25 iterations are shown. The final set of points in the objective space and the Pareto front approximation is seen in the plot in the lower right corner. Using only 25 evaluations of the original objective functions, the algorithm finds a good approximation to the Pareto front.
Application Notes and Further Reading
In addition to this experiment, other applications of the EHVI have been recently reported. It was first used as selection criterion in evolutionary optimization [3] and in the context of airfoil design [4] and quantum control [18]. To our knowledge, it was used for the first time in BGO in the context of airfoil optimization in [13] and conceptually compared other multicriteria infill criteria, including proposal made in [11], [10], and in [23]. Other applications are robotics [20], biogas plant controllers [6], event detection in water quality management [24], structural design optimization [17], and tuning of machine learning tools [12]. An empirical comparison with other infill criteria is found in [16].
Summary and Outlook
This chapter described the EHVI as a multicriteria generalization of the expected improvement used in BGO. This generalization is based on the hypervolume indicator, which is a quality indicator for Pareto front approximations. It has recently served as an infill criterion in a number of BGO case studies, but was criticized for its high computational complexity. In this chapter, the time complexity of the 2-D EHVI was shown to be only Θ(n). The linear time algorithm presented in this paper improves upon previously proposed algorithms which required quadratic time complexity. It assumes a sorted Pareto front (otherwise its complexity is O(nlogn)), which is typically given in BGO. During a single iteration of BGO a large number of evaluations need to be performed, in order to find a minimizer based on the Gaussian random field model. Therefore the fast algorithm will be of great benefit for reducing the running time of multicriteria BGO based on EHVI.
Future research will investigate in more depth the theoretical properties of the EHVI. For the first results in this direction refer to [5], where it was shown that the 2-D EHVI is monotonic in the mean values and variance. Also it will be interesting to analyze the time complexity of EHVI for more than two objective functions.
References
Auger, A., Bader, J., Brockhoff, D., Zitzler, E.: Theory of the hypervolume indicator: optimal μ-distributions and the choice of the reference point. In: Proceedings of the Tenth ACM SIGEVO Workshop on Foundations of Genetic Algorithms, pp. 87–102. ACM, Chicago (2009)
Couckuyt, I., Deschrijver, D., Dhaene, T.: Fast calculation of multiobjective probability of improvement and expected improvement criteria for Pareto optimization. J. Global Optim. 60 (3), 575–594 (2014)
Emmerich, M.: Single-and multi-objective evolutionary design optimization assisted by Gaussian random field metamodels. Ph.D. thesis, Fachbereich Informatik, Chair of Systems Analysis, University of Dortmund (2005)
Emmerich, M., Giannakoglou, K.C., Naujoks, B.: Single-and multiobjective evolutionary optimization assisted by Gaussian random field metamodels. IEEE Trans. Evol. Comput. 10 (4), 421–439 (2006)
Emmerich, M., Deutz, A.H., Klinkenberg, J.W.: Hypervolume-based expected improvement: monotonicity properties and exact computation. In: 2011 IEEE Congress on Evolutionary Computation (CEC), pp. 2147–2154. IEEE, New Jersey (2011)
Gaida, D.: Dynamic real-time substrate feed optimization of anaerobic co-digestion plants. Ph.D. thesis, Leiden Institute of Advanced Computer Science (LIACS), Faculty of Science, Leiden University (2014)
Hupkens, I., Emmerich, M., Deutz, A.: Faster computation of expected hypervolume improvement. arXiv preprint arXiv:1408.7114 (2014)
Hupkens, I., Deutz, A., Yang, K., Emmerich, M.: Faster exact algorithms for computing expected hypervolume improvement. In: Evolutionary Multi-Criterion Optimization, pp. 65–79. Springer, Berlin, Heidelberg (2015)
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13 (4), 455–492 (1998)
Keane, A.J.: Statistical improvement criteria for use in multiobjective design optimization. AIAA J. 44 (4), 879–891 (2006)
Knowles, J.: ParEGO: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Trans. Evol. Comput. 10 (1), 50–66 (2006)
Koch, P., Wagner, T., Emmerich, M.T., Bäck, T., Konen, W.: Efficient multi-criteria optimization on noisy machine learning problems. Appl. Soft Comput. 29, 357–370, New Jersey (2015)
Łaniewski-Wołłk, Ł., Obayashi, S., Jeong, S.: Development of expected improvement for multi-objective problem. In: Proceedings of 42nd Fluid Dynamics Conference/Aerospace Numerical Simulation Symposium (2010)
Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications, vol. 37. Springer Science & Business Media, New York (2012)
Mockus, J., Tiesis, V., Žilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Towards Global Optimization, vol. 2, pp. 117–129. North-Holland, Amsterdam (1978)
Shimoyama, K., Sato, K., Jeong, S., Obayashi, S.: Comparison of the criteria for updating Kriging response surface models in multi-objective optimization. In: 2012 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE, New Jersey (2012)
Shimoyama, K., Sato, K., Jeong, S., Obayashi, S.: Updating Kriging surrogate models based on the hypervolume indicator in multi-objective optimization. J. Mech. Des. 135 (9), 094503 (2013)
Shir, O.M., Emmerich, M., Bäck, T., Vrakking, M.J.: The application of evolutionary multi-criteria optimization to dynamic molecular alignment. In: IEEE Congress on Evolutionary Computation, 2007, CEC 2007, pp. 4108–4115. IEEE, New Jersey (2007)
Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer Science & Business Media, New York (2012)
Tesch, M., Schneider, J., Choset, H.: Adapting control policies for expensive systems to changing environments. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 357–364. IEEE, New Jersey (2011)
Törn, A., Žilinskas, A.: Global Optimization. Springer, New York (1989)
Vazquez, E., Bect, J.: Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Stat. Plan. Inference 140 (11), 3088–3095 (2010)
Wagner, T., Emmerich, M., Deutz, A., Ponweiser, W.: On expected-improvement criteria for model-based multi-objective optimization. In: Parallel Problem Solving from Nature. PPSN XI, pp. 718–727. Springer, Berlin, Heidelberg (2010)
Zaefferer, M., Bartz-Beielstein, T., Naujoks, B., Wagner, T., Emmerich, M.: A case study on multi-criteria optimization of an event detection software under limited budgets. In: Evolutionary Multi-Criterion Optimization, pp. 756–770. Springer, Berlin, Heidelberg (2013)
Žilinskas, A., Mockus, J.: On one Bayesian method of search of the minimum. Avtomatika i Vychislitel’naya Teknika 4, 42–44 (1972)
Acknowledgements
Hao Wang gratefully acknowledges support by the Netherlands Organisation for Scientific Research, NWO ICT PPP Project Grant “Process mining for multi-objective online control (PROMIMOOC)”. Kaifeng Yang acknowledges financial support from China Scholarship Council (CSC), CSC No. 201306370037.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Emmerich, M., Yang, K., Deutz, A., Wang, H., Fonseca, C.M. (2016). A Multicriteria Generalization of Bayesian Global Optimization. In: Pardalos, P., Zhigljavsky, A., Žilinskas, J. (eds) Advances in Stochastic and Deterministic Global Optimization. Springer Optimization and Its Applications, vol 107. Springer, Cham. https://doi.org/10.1007/978-3-319-29975-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-29975-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29973-0
Online ISBN: 978-3-319-29975-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)