Keywords

Introduction

In the 1970s several new ideas for global optimization were proposed. Among these the idea of Bayesian Global Optimization (BGO) was proposed by the Lithuanian research group Jonas Mockus and Antanas Žilinskas [14, 15, 21, 25]. It had a lasting impact on the development of both deterministic and stochastic global optimization techniques. Today variations of this idea are known under various names, such as Efficient Global Optimization [9] or Expected Improvement Algorithm [22]. In these techniques the goal is to find the extremum of a function \(f: \mathcal{X} \rightarrow \mathbb{R}\) where \(\mathcal{X}\) is a compact subspace of \(\mathbb{R}^{d}\). BGO assumes that the objective function is the realization of a Gaussian random field. This random field can be conditioned by the knowledge of f(x (i)) at some points \(\mathbf{x}^{(i)} \in \mathcal{X},i = 1,\ldots,n\). Under this assumption, measures such as the expected improvement of a new design point are well defined, and can be used to guide search towards the global optimum.

In this chapter we describe a generalization of this approach to multicriteria optimization. It iteratively evaluates points from \(\mathcal{X}\) and finds a well distributed subset of the Pareto front of a multicriteria optimization problem. The algorithm is based on a generalization of the expected improvement, which is based on the hypervolume indicator, the so-called Expected Hypervolume Improvement (EHVI) [3]. It has attractive theoretical properties [23], but so far its computation time was considered to be expensive. In this chapter it is shown that, for bicriteria optimization, a fast algorithm exists for computing EHVI that has only linear time complexity in the size of the intermediate approximation to the Pareto front, given that the Pareto front is given as a sorted set. It is shown that this algorithm has asymptotically optimal time complexity.

This chapter is organized as follows: section “Bayesian Global Optimization” introduces the framework of BGO. Section “Multicriteria Optimization” shows how this framework can be generalized to multicriteria optimization. Section “Expected Hypervolume Improvement” defines the EHVI, discusses some of its theoretical properties, and reviews recent applications of it. Section “Efficient Exact Computation” outlines the new, asymptotically efficient algorithm for its exact computation and proves that it has an asymptotically optimal time complexity for bicriteria problems. A numerical example is discussed in section “Numerical Example”. Section “Application Notes and Further Reading” points to some recent applications and related work. Finally, section “Summary and Outlook”, concludes with a summary and discusses open questions.

Bayesian Global Optimization

In BGO the goal is to solve d-dimensional global optimization problems of the type: Find x with

$$\displaystyle{ \mathbf{x}^{{\ast}} \in \arg \min _{\mathbf{ x}\in \mathcal{X}}f(\mathbf{x}),\mathcal{X} = [\mathbf{x}_{min},\mathbf{x}_{max}] \subset \mathbb{R}^{d} }$$
(1)

(Without loss of generality we consider minimization only.)

In order to do so, a sequence \(\{\mathbf{x}^{(t)}\}_{t=1,2,\ldots }\) of points is computed such that

$$\displaystyle{ \mathbf{x}^{(t)} \in \arg \max _{\mathbf{ x}\in S}(\mathrm{E}(I(\mathbf{x})\vert (\mathbf{x}^{(1)},f(\mathbf{x}^{(1)}),\ldots,(\mathbf{x}^{(t-1)},f(\mathbf{x}^{(t-1)})) }$$
(2)

Here \(\mathrm{E}(I(\mathbf{x})\vert (\mathbf{x}^{(1)},f(\mathbf{x}^{(1)}),\ldots,(\mathbf{x}^{(t-1)},f(\mathbf{x}^{(t-1)})\) denotes the expected improvement measure that measures how promising the new point x is, given t − 1 previous evaluations of f at \(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(t-1)}\). This expected improvement is an expected value of a random variable, here called I(x) that requires further explanation.

In BGO one makes the assumption that the function f is the realization of a Gaussian random field F. A Gaussian random field is an infinite set of random variables. Each random variable in F is identified by its spatial index \(\mathbf{x} \in \mathbb{R}^{d}\). We will denote it with F x . It is assumed that the random variables share the same global mean value β and global variance s 2. Moreover, a correlation ρ(F u , F v ) is defined for every pair of indices \(\mathbf{u} \in \mathbb{R}^{d}\) and \(\mathbf{v} \in \mathbb{R}^{d}\). This correlation depends on the relation between u and v. A typical family of correlation functions is

$$\displaystyle{\rho (\mathbf{F}_{\mathbf{u}},\mathbf{F}_{\mathbf{v}}) =\exp \left (-\sum _{i=1}^{d}\theta _{ i}\vert u_{i} - v_{i}\vert ^{q_{i} }\right )}$$

It is important that this correlation function is positive definite. It obtains the value of 1, if v = u and gets smaller with increasing distance between v and u. The parameters q i and θ i are either set by the user or obtained from data fitting. The parameters θ i are positive.

The Gaussian random field can be viewed as a multivariate Gaussian distribution of infinite dimension. We can use well-known expressions for the marginal distributions of the multivariate distribution to find the conditional distribution, given that some of the realizations of one dimensional random variables are known. That is, given the prior information \(\mathbf{F}_{\mathbf{ x}^{(1)}} = f(\mathbf{x}^{(1)}),\ldots,\mathbf{F}_{\mathbf{ x}^{(t-1)}} = f(\mathbf{x}^{(t-1)})\) we can compute the parameters μ (conditional mean) and σ 2 (conditional variance) of the conditioned random variable:

$$\displaystyle{ \mathbf{F}_{\mathbf{x}}\ \vert \,\ \mathbf{F}_{\mathbf{x}^{(1)}} = f(\mathbf{x}^{(1)}),\ldots \phantom{0},\mathbf{F}_{\mathbf{ x}^{(t-1)}} = f(\mathbf{x}^{(t-1)}) }$$
(3)

As a shortcut we will denote this random variable with F x   | X, f(X), where \(\mathbf{X} = (\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(t-1)})\) denote the indices for which we know realizations and the values of the corresponding realizations are abbreviated with

$$\displaystyle{f(\mathbf{X}) = \left (f(\mathbf{x}^{(1)}),\ldots,f(\mathbf{x}^{(t-1)})\right )}$$

The estimation of hyperparameters θ i and q i , \(i = 1,\ldots,d\), of the correlation function, as well as the global variance and mean can be accomplished by maximum likelihood methods. For details on the computations of the parameters of the conditional distribution we refer to the specialized literature [19].

Now, the expected improvement can be defined: The improvement of a function value \(y \in \mathbb{R}\) is defined as

$$\displaystyle{ \mathrm{I}(y) =\max \{ 0,y_{min} - y)\} }$$
(4)

where \(y_{min} =\min \{ f(\mathbf{x}^{(1)}),\ldots,f(\mathbf{x}^{(t-1)})\}\). Then the expected improvement is defined as

$$\displaystyle{ \mathrm{E}(\mathrm{I}(\mathbf{F}_{\mathbf{x}}\vert \mathbf{X},f(\mathbf{X})) =\int _{ y=-\infty }^{y_{min} }\mathrm{I}(y)\mathrm{PDF}_{\mathbf{x}\vert \mathbf{X},f(\mathbf{X})}(y)\mathrm{d}y }$$
(5)

Here PDF x | X, f(X) is the probability density function of F x   | X, f(X).

Multicriteria Optimization

A continuous m-dimensional multicriteria optimization problem is a problem where multiple objective functions, say \(f_{1}: \mathcal{X} \rightarrow \mathbb{R},\ldots,f_{m}: \mathcal{X}^{m} \rightarrow \mathbb{R}\), are to be minimized simultaneously, \(\mathcal{X} \subseteq \mathbb{R}^{m}\).

In the a posteriori approach to multicriteria optimization an approximation to the Pareto front of the problem is computed first. Based on this, the trade-off is analyzed and a solution is selected by the decision maker. To define a Pareto front, we introduce the Pareto dominance order ≺ on \(\mathbb{R}^{m}\), with \(\forall \mathbf{y},\mathbf{z} \in \mathbb{R}^{m}: \mathbf{y} \prec \mathbf{z} \Leftrightarrow (\forall i \in \{ 1,\ldots,m\}: y_{i} \leq z_{i})\mbox{ and }\mathbf{y}\neq \mathbf{z}\). The non-dominated subset of a multiset of vectors \(\mathrm{Y} =\{ \mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m)}\}\) is defined as nd(Y) = {y ∈ Y | ∄ z ∈ Y: z ≺ y}. Given a multicriteria optimization problem, the image of \(\mathcal{X}\) is defined as \(\mathcal{Y} =\{ \mathbf{f}(\mathbf{x})\ \vert \ \mathbf{x} \in \mathcal{X}\}\). The Pareto front of a multicriteria optimization problem is defined as \(\mathcal{Y}_{\mathrm{nd}}:=\mathrm{ nd}(\mathcal{Y})\). An important special case is bicriteria optimization, where m = 2.

One way to generalize the BGO algorithm is to compute the expected improvement of the hypervolume indicator. The hypervolume indicator is the m-dimensional Lebesgue measure λ m of the dominated subspace limited from above by some reference point r m. More precisely the hypervolume indicator is defined as

$$\displaystyle{ \mathrm{hv}(\mathrm{Y}) =\lambda _{m}\left (\{\mathbf{y} \in \mathbb{R}^{m}\ \vert \ \exists \mathbf{z} \in \mathrm{ Y}: \mathbf{z} \prec \mathbf{y} \wedge \mathbf{y} \prec \mathbf{r}\}\right ) =\lambda _{ m}\left (\bigcup _{\mathbf{y}\in \mathrm{Y}}[\mathbf{y},\mathbf{r}]\right ) }$$
(6)

In Fig. 1 the hypervolume indicator is illustrated for a Pareto front approximation with nine points and two objective functions (m = 2). Given a problem with a Pareto front bounded above by the reference point, sets that maximize the hypervolume indicator are well distributed subsets of the Pareto front [1]. This is why finding the Pareto front is sometimes recast as the problem of maximizing the hypervolume indicator over the set of all subsets of \(\mathcal{X}\). We will call this problem hypervolume maximization.

Fig. 1
figure 1

Hypervolume indicator of Pareto front approximation

Expected Hypervolume Improvement

For hypervolume maximization problems the generalization of the improvement function is straightforward. We generalize the best solution found up to iteration t − 1, namely y min, t−1, to

$$\displaystyle{ \mathrm{Y}_{\mathrm{nd},t-1} =\mathrm{ nd}\left (\{\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(t-1)}\}\right ) }$$
(7)

The improvement function is generalized by the following definition of an m-dimensional improvement:

$$\displaystyle{ \mathrm{I}_{m}(\mathbf{y},\mathrm{Y}_{\mathrm{nd},t-1},\mathbf{r}):=\mathrm{ hv}\left (\mathrm{Y}_{\mathrm{nd},t-1} \cup \{\mathbf{f}(\mathbf{x})\}\right ) -\mathrm{ hv}(\mathrm{Y}_{\mathrm{nd},t-1}) }$$
(8)

It is easy to show that this I m specializes to the improvement function in one dimension, if we chose r 1 to be sufficiently large.

In order to compute the expected improvement in the multicriteria case, we need also to generalize the assumption on the Gaussian random field. For this, we consider one Gaussian random field per objective function and assume that there is no correlation between random variables from different random fields. For every point \(\mathbf{x} \in \mathcal{X}\) we obtain an m-dimensional random variable conditioned on previous information, that is given by X and f(X) = (f(x (1), \(\ldots,\) f(x (t−1)).

The resulting EHVI can be denoted with

$$\displaystyle\begin{array}{rcl} & & \mathrm{E}\left (\mathrm{I}_{m}\left ((F_{1}(\mathbf{x}),\ldots,F_{m}(\mathbf{x})\right )\vert \ \mathbf{X},\mathbf{f}(\mathbf{X}),\mathrm{Y}_{\mathrm{nd},t-1},\mathbf{r}\right ) = \\ & & \int _{\mathbf{y\in \mathbb{R}^{m}}}\mathrm{I}_{m}(\mathbf{y},\mathrm{Y}_{\mathrm{nd},t-1},\mathbf{r})\mathrm{PDF}_{\mathbf{x}\vert \mathbf{X},\mathbf{f}(\mathbf{X})}(\mathbf{y})\mathrm{d}\mathbf{y} {}\end{array}$$
(9)

and it is a generalization of the single objective expected improvement, if we consider y min, 0 = r 1.

Efficient Exact Computation

In this section the problem of computing the EHVI is studied and a new, efficient algorithm for bicriteria optimization will be derived. Fast algorithms for computing the EHVI are important, because in BGO a large number evaluations of the EHVI are performed in each iteration when searching for its maximizer. Although the BGO algorithm is typically used in the context of expensive function evaluations, the optimization of the EHVI can significantly contribute to the total running time of the algorithm. For instance, this was recently reported as a major drawback of using EHVI in [12], even when considering only the two dimensional case.

A simplified notation will be used in the following. It focuses only on the elements that are relevant for the EHVI computation.

Symbol

Type

Description

\(\boldsymbol{\mu }\)

\(\mathbb{R}^{m}\)

Mean values of predictive distribution

\(\boldsymbol{\sigma }\)

\((\mathbb{R}_{0}^{+})^{m}\)

Standard deviations of predictive distribution

Y

\((\mathbb{R}^{m})^{n}\)

Sequence of mutually non-dominated points (Pareto front approximation in t − 1)

\(\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(n)}\)

\(\mathbb{R}^{m}\)

The vectors in Y

r

\(\mathbb{R}^{m}\)

Reference point

For computing integrals of the expected improvement it is useful to define the function Δ. For a given vector of objective function values \(\mathbf{y} \in \mathbb{R}^{m}\), Δ(y, Y, r) is the subset of the vectors in \(\mathbb{R}^{m}\) which are exclusively dominated by a vector y and not by elements in Y and that dominate the reference point, in symbols

$$\displaystyle{ \varDelta (\mathbf{y},\mathrm{Y},\mathbf{r}) =\lambda _{m}\{\mathbf{z} \in \mathbb{R}\ \vert \ \mathbf{y} \prec \mathbf{z}\mbox{ and }\mathbf{z} \prec \mathbf{r}\mbox{ and }\nexists \mathbf{q} \in \mathrm{ Y}: \mathbf{q} \prec \mathbf{z}\} }$$
(10)

In order to simplify notation, we will write Δ(y) whenever Y, r are given by the context.

Based on this, we can now concisely (re-)define the EHVI function as

$$\displaystyle{ \mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r}) =\int _{ y_{1}=-\infty }^{\infty }\cdots \int _{ y_{m}=-\infty }^{\infty }\lambda _{ m}(\varDelta (\mathbf{y}))\mathrm{PDF}_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})\mathrm{d}y_{1}\ldots \mathrm{d}y_{m} }$$
(11)

Example 1.

An illustration of the EHVI is displayed in Fig. 2. The light gray area is the dominated subspace of Y = { y (1) = (3, 1), y (2) = (2, 1. 5), y (3) = (1, 2. 5)} cut by the reference point r = (4, 4). The bivariate Gaussian distribution has the parameters μ 1 = 2, μ 2 = 1. 5, σ 1 = 0. 7, and σ 2 = 0. 6. The PDF of the bivariate Gaussian distribution is indicated as a 3-D plot. Here y is a sample from this distribution, and the area of improvement relative to Y is indicated by the dark shaded area. The variable y 1 stands for the f 1 value and y 2 for the f 2 value.

Fig. 2
figure 2

Expected hypervolume improvement in 2-D (cf. Example 1)

State of the Art

To compute the EHVI (9) Monte Carlo integration is suggested in [3, 4]. Exact algorithms for computing EHVI for m = 2 are derived in [5] and for m > 2 in [2]. A different algorithm is described in [7].

Fast algorithms have been proposed in [2] and even faster algorithms for m = 2, 3 in [8]. So far the best known bounds for the time complexity of exact computations are O(n 2) for m = 2, and O(n 3) for m = 3. It is notable that the number of transcendental function evaluations scales only linearly in n in the algorithm presented in [8]. A lower bound of Ω(nlogn) is provided for unsorted Y. However, it makes sense to assume that Y is sorted in the first coordinate. In that case, as will be shown, a lower bound of Ω(n) still holds. None of the algorithms found so far for EHVI reach these lower bounds. In this paper we will present an algorithm for m = 2 that does so.

Next an algorithm is outlined that reaches the lower bound time complexity of Ω(nlogn). We thereby prove that the time complexity of EHVI is Θ(nlogn). However, this complexity stems from the complexity that is inherent to sorting Y by the first coordinate.

To keep Y sorted in the first coordinate requires an effort of amortized time complexity O(logn) per iteration. It makes therefore sense to assume a sorted Y. For this case we can show that the time complexity is Θ(n). To do so, we will first establish a lower bound of Ω(n) for this case:

Lemma 1.

The computational time complexity of computing the EHVI for a set Y that is sorted by the first coordinate is bounded from below by Ω(n).

Proof.

An adversary argument can be used to prove this statement. The algorithm has to “look at” all n points. If one point is not used, it could be moved by an adversary and this move will not be noticed by the algorithm; a move of any single point can, in general, change the EHVI. □ 

Efficient Algorithm

For m = 2 the expected improvement can be computed in linear time, given that Y is already sorted by the first coordinate. Next, a formula will be derived that consists of n + 1 integrals, each of which can be solved in constant time.

The starting point of the derivation is to partition the objective space into n + 1 disjoint rectangular stripes S 1, …, S n+1, as indicated in Fig. 3 (left). In order to define the stripes formally, augment Y with two sentinels: y (0) = (r 1, −) and y (n+1) = (−, r 2). The stripes are now defined by

$$\displaystyle{S_{i} = \left (\left (\begin{array}{c} y_{1}^{(i)} \\ -\infty \end{array} \right ),\left (\begin{array}{c} y_{1}^{(i-1)} \\ y_{2}^{(i)} \end{array} \right )\right ),i = 1,\ldots,n+1}$$
Fig. 3
figure 3

Left: partitioning of the integration region into stripes. Right: new partitioning of the reduced integration region after first iteration of the algorithm

We can now express the improvement of a point \(\mathbf{y} \in \mathbb{R}^{2}\) by

$$\displaystyle{ \mathrm{I}_{2}(\mathbf{y},\mathrm{Y},\mathbf{r}) =\sum _{ i=1}^{n+1}\lambda _{ 2}[S_{i} \cap \varDelta (\mathbf{y})] }$$
(12)

This gives rise to the compact integral for the original EHVI, y = (y 1, y 2):

$$\displaystyle{ \mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r}) =\int _{ y_{1}=-\infty }^{\infty }\int _{ y_{2}=-\infty }^{\infty }\sum _{ i=1}^{n+1}\lambda _{ 2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y} }$$
(13)

It is observed that the intersection of S i with Δ(y 1, y 2) is non-empty if and only if y = (y 1, y 2) dominates the upper right corner of S i . In other words, if and only if y is located in the rectangle with lower left corner (−, −) and upper right corner (y 1 (i−1), y 2 (i)). See Fig. 3 (right) for an illustration. Therefore

$$\displaystyle{ \mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r}) =\sum _{ i=1}^{n+1}\int _{ y_{1}=-\infty }^{y_{1}^{(i-1)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y} }$$
(14)

In (14) also the summation is done after integration. This is allowed, because integration is a linear mapping.

Details of the Constant Time Integration

$$\displaystyle{ \mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r}) =\sum _{ i=1}^{n+1}\int _{ y_{1}=-\infty }^{y_{1}^{(i-1)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y} }$$
(15)
$$\displaystyle\begin{array}{rcl} & & \phantom{\mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r})} = \quad \sum _{i=1}^{n+1}\int _{ y_{1}=-\infty }^{y_{1}^{(i)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y} + \\ & & \phantom{\mathrm{EHVI}(\boldsymbol{\mu },\boldsymbol{\sigma },\mathrm{Y},\mathbf{r})}\qquad \sum _{i=1}^{n+1}\int _{ y_{1}=y^{(i)}}^{y_{1}^{(i-1)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y}. {}\end{array}$$
(16)

Recall the definition of the standard Gaussian PDF and CDF: \(\phi (x) = \dfrac{1} {\sqrt{2\pi }}\mathrm{exp}(-x^{2}/2),\quad \varPhi (x) = \dfrac{1} {2}(1 +\mathrm{ erf}(\sqrt{2}))\), and a function Ψ that was defined in [8] as follows:

$$\displaystyle{\varPsi (a,b,\mu,\sigma ) =\int _{ -\infty }^{b}(a - z)\dfrac{1} {\sigma } \phi \left (\dfrac{z-\mu } {\sigma } \right )dz.}$$

Moreover it can be shown that

$$\displaystyle{\varPsi (a,b,\mu,\sigma ) =\int _{ -\infty }^{b}(a - z)\dfrac{1} {\sigma } \phi \left (\dfrac{z-\mu } {\sigma } \right )dz =\sigma \phi \left (\dfrac{b-\mu } {\sigma } \right ) + (a-\mu )\varPhi \left (\dfrac{b-\mu } {\sigma } \right ).}$$

Then the first summand of (16) can be written as follows:

$$\displaystyle\begin{array}{rcl} & & =\sum _{ i=1}^{n+1}\int _{ y_{1}=-\infty }^{y_{1}^{(i)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y}, \\ & & =\sum _{ i=1}^{n+1}\int _{ y_{1}=-\infty }^{y_{1}^{(i)} }(y_{1}^{(i-1)} - y_{ 1}^{(i)}) \cdot PDF_{\mu _{ 1},\sigma _{1}}(y_{1})dy_{1}\int _{y_{2}=-\infty }^{y_{2}^{(i)} }(y_{2}^{(i)} - y_{ 2}) \cdot PDF_{\mu _{2},\sigma _{2}}(y_{2})dy_{2}, \\ & & =\sum _{ i=1}^{n+1}(y_{ 1}^{(i-1)} - y_{ 1}^{(i)})\int _{ y_{1}=-\infty }^{y_{1}^{(i)} }PDF_{\mu _{1},\sigma _{1}}(y_{1})dy_{1}\int _{y_{2}=-\infty }^{y_{2}^{(i)} }(y_{2}^{(i)} - y_{ 2}) \cdot PDF_{\mu _{2},\sigma _{2}}(y_{2})dy_{2}, \\ & & =\sum _{ i=1}^{n+1}(y_{ 1}^{(i-1)} - y_{ 1}^{(i)}) \cdot \varPhi \left (\dfrac{y_{1}^{(i)} -\mu _{ 1}} {\sigma _{1}} \right ) \cdot \varPsi (y_{2}^{(i)},y_{ 2}^{(i)},\mu _{ 2},\sigma _{2}). {}\end{array}$$
(17)

And the second summand of (16) can be written as follows:

$$\displaystyle\begin{array}{rcl} & & =\sum _{ i=1}^{n+1}\int _{ y_{1}=y^{(i)}}^{y_{1}^{(i-1)} }\int _{y_{2}=-\infty }^{y_{2}^{(i)} }\lambda _{2}[S_{i} \cap \varDelta (y_{1},y_{2})] \cdot PDF_{\boldsymbol{\mu },\boldsymbol{\sigma }}(\mathbf{y})d\mathbf{y}, \\ & & =\sum _{ i=1}^{n+1}\int _{ y_{1}=y^{(i)}}^{y_{1}^{(i-1)} }(y_{1}^{(i-1)} - y_{ 1}) \cdot PDF_{\mu _{1},\sigma _{1}}(y_{1})dy_{1} \cdot \int _{y_{2}=-\infty }^{y_{2}^{(i)} }(y^{(i)} - y_{ 2}) \cdot PDF_{\mu _{2},\sigma _{2}}(y_{2})dy_{2}, \\ & & =\sum _{ i=1}^{n+1}\left (\varPsi (y_{ 1}^{(i-1)},y_{ 1}^{(i-1)},\mu _{ 1},\sigma _{1}) -\varPsi (y_{1}^{(i-1)},y_{ 1}^{(i)},\mu _{ 1},\sigma _{1})\right ) \cdot \varPsi (y_{2}^{(i)},y_{ 2}^{(i)},\mu _{ 2},\sigma _{2}). {}\end{array}$$
(18)

The C++ and MATLAB source-code for computing the EHVI is made available under http://moda.liacs.nl or on request by the authors. The code has been compared to results of Monte Carlo integration and earlier implementations of the exact EHVI.

Numerical Example

The behavior of the BGO based on the EHVI will be illustrated by a single numerical experiment.

The numerical example is visualized in the plots of Fig. 4. The bicriteria optimization problem is: f 1(x) = | | x1 | | → min,  f 2(x) = | | x + 1 | | → min, and \(\mathbf{x} \in [-2,2] \times [-2,2] \subset \mathbb{R}^{2}\). The Pareto front is the line segment from \((0,2 \cdot \sqrt{2})\) to \((2 \cdot \sqrt{2},0)\), the efficient set is the line segment that connects (−1, −1) and (1, 1). The metamodel used is a Gaussian random field model with Gaussian correlation function exp(−θ | | x (1)x (2) | | 2), for \(\mathbf{x}^{(1)} \in \mathbb{R}^{m}\) and \(\mathbf{x}^{(2)} \in \mathbb{R}^{m}\). We set θ = 0. 0001, which was estimated by maximum likelihood method for initial sample. An initial set of 10 points was evaluated, indicated by the dark blue squares. From this starting set 15 new points were generated using the EHVI. The maximizer of the expected improvement was found using a uniform grid. In total each objective function was evaluated 25 times.

Fig. 4
figure 4

Example run of multicriteria Bayesian global optimization

The results of the experiment are depicted in plots. In all pictures, points that have been evaluated are indicated by triangles. The points from the initial set are additionally marked by squares. Efficient points are surrounded by circles. The top row depicts the mean value of the Gaussian random field model at x ∈ [−2,2] × [−2,2] for f 1 and f 2, resp. Likewise, the middle row depicts the variance of the Gaussian random field model at x ∈ [−2,2] × [−2,2] for f 1 and f 2, resp. On the left-hand side of the bottom row the hypervolume-based expected improvement values after 25 iterations are shown. The final set of points in the objective space and the Pareto front approximation is seen in the plot in the lower right corner. Using only 25 evaluations of the original objective functions, the algorithm finds a good approximation to the Pareto front.

Application Notes and Further Reading

In addition to this experiment, other applications of the EHVI have been recently reported. It was first used as selection criterion in evolutionary optimization [3] and in the context of airfoil design [4] and quantum control [18]. To our knowledge, it was used for the first time in BGO in the context of airfoil optimization in [13] and conceptually compared other multicriteria infill criteria, including proposal made in [11], [10], and in [23]. Other applications are robotics [20], biogas plant controllers [6], event detection in water quality management [24], structural design optimization [17], and tuning of machine learning tools [12]. An empirical comparison with other infill criteria is found in [16].

Summary and Outlook

This chapter described the EHVI as a multicriteria generalization of the expected improvement used in BGO. This generalization is based on the hypervolume indicator, which is a quality indicator for Pareto front approximations. It has recently served as an infill criterion in a number of BGO case studies, but was criticized for its high computational complexity. In this chapter, the time complexity of the 2-D EHVI was shown to be only Θ(n). The linear time algorithm presented in this paper improves upon previously proposed algorithms which required quadratic time complexity. It assumes a sorted Pareto front (otherwise its complexity is O(nlogn)), which is typically given in BGO. During a single iteration of BGO a large number of evaluations need to be performed, in order to find a minimizer based on the Gaussian random field model. Therefore the fast algorithm will be of great benefit for reducing the running time of multicriteria BGO based on EHVI.

Future research will investigate in more depth the theoretical properties of the EHVI. For the first results in this direction refer to [5], where it was shown that the 2-D EHVI is monotonic in the mean values and variance. Also it will be interesting to analyze the time complexity of EHVI for more than two objective functions.