Keywords

1 Introduction

In the last decades, metamodeling (or surrogate modeling) has been increasingly used for problems involving costly computer codes (or “black-box simulators”). Practitioners typically dispose of a very limited evaluation budget and aim at selecting evaluation points cautiously when attempting to solve a given problem.

In global optimization, the focus is usually put on a real-valued function \(f\) with \(d\)-dimensional source space. In this settings, Jones et al. [1] proposed the now famous Efficient Global Optimization (EGO) algorithm, relying on a kriging metamodel [2] and on the Expected Improvement (EI) criterion [3]. In EGO, the optimization is done by sequentially evaluating \(f\) at points maximizing EI. A crucial advantage of this criterion is its fast computation (besides, the analytical gradient of EI is implemented in [4]), so that the hard optimization problem is replaced by series of much simpler ones.

Coming back to the decision-theoretic roots of EI [5], a Multi-points Expected Improvement (also called “\(q\)-EI”) criterion for batch-sequential optimization was defined in [6] and further developed in [7, 8]. Maximizing this criterion enables choosing batches of \(q > 1\) points at which to evaluate \(f\) in parallel, and is of particular interest in the frequent case where several CPUs are simultaneously available. Even though an analytical formula was derived for the \(2\)-EI in [7], the Monte Carlo (MC) approach of [8] for computing \(q\)-EI when \(q \ge 3\) makes the criterion itself expensive-to-evaluate, and particularly hard to optimize.

A lot of effort has recently been paid to address this problem. The pragmatic approach proposed by Ginsbourger and Le Riche [8] consists in circumventing a direct \(q\)-EI maximization, and replacing it by simpler strategies where batches are obtained using an offline \(q\)-points EGO. In such strategies, the model updates are done using dummy response values such as the kriging mean prediction (Kriging Believer) or a constant (Constant Liar), and the covariance parameters are re-estimated only when real data is assimilated. In [9] and [10], \(q\)-EI optimization strategies were proposed relying on the MC approach, where the number of MC samples is tuned online to discriminate between candidate designs. Finally, Frazier [11] proposed a \(q\)-EI optimization strategy involving stochastic gradient, with the crucial advantage of not requiring to evaluate \(q\)-EI itself.

In this article we derive a formula allowing a fast and accurate approximate evaluation of \(q\)-EI. This formula may contribute to significantly speed up strategies relying on \(q\)-EI. The main result, relying on Tallis’ formula, is given in Sect. 2. The usability of the proposed formula is then illustrated in Sect. 3 through benchmark experiments, where a brute force maximization of \(q\)-EI is compared to three variants of the Constant Liar strategy. In particular, a new variant (CL-mix) is introduced, and is shown to offer very good performances at a competitive computational cost. For self-containedness, a slightly revisited proof of Tallis’ formula is given in appendix.

2 Multi-Points Expected Improvement Explicit Formulas

In this section we give an explicit formula allowing a fast and accurate deterministic approximation of \(q\)-EI. Let us first give a few precisions on the mathematical settings. Along the paper, \(f\) is assumed to be one realisation of a Gaussian Process (GP) with known covariance kernel and mean known up to some linear trend coefficients, so that the conditional distribution of a vector of values of the GP conditional on past observations is still Gaussian (an improper uniform prior is put on the trend coefficients when applicable). This being said, most forthcoming derivations boil down to calculations on Gaussian vectors. Let Y := \((Y_{1},\ldots ,Y_{q})\) be a Gaussian Vector with mean m \(\in \mathbb {R}^{q}\) and covariance matrix \(\varSigma \). Our aim in this paper is to explicitly calculate expressions of the following kind:

$$\begin{aligned} \mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}} Y_i - T \right) _+ \right] \end{aligned}$$
(1)

where \((.)_+ := \max ( . , 0)\). In Bayesian optimization (say maximization), expectations and probabilities are taken conditional on response values at a given set of \(n\) points \((\mathbf {x}_{1},\ldots ,\mathbf {x}_{n})\in \mathbb {X}^n\) where \(\mathbb {X}\) is the input set of \(f\) (often, a compact subset of \(\mathbb {R}^d\), \(d\ge 1\)), the threshold \(T\in \mathbb {R}\) is usually the maximum of those \(n\) available response values, and \(\mathbf {Y}\) is the vector of unknown responses at a given batch of \(q\) points, \(\mathbf {X}^{q}:=(\mathbf {x}_{n+1},\ldots ,\mathbf {x}_{n+q}) \in \mathbb {X}^{q}\). In such framework, the vector \(\mathbf {m}\) and the matrix \(\varSigma \) are the so-called “Kriging mean” and “Kriging covariance” at \(\mathbf {X}^{q}\) and can be calculated relying on classical Kriging equations (see, e.g., [12]).

In order to obtain a tractable analytical formula for Expression (1), not requiring any Monte-Carlo simulation, let us first give a useful formula obtained by [13], and recently used in [14] for GP modeling with inequality constraints:

Proposition 1

(Tallis’ formulas). Let \(\mathbf {Z}:=(Z_1,\ldots ,Z_q)\) be a Gaussian Vector with mean \(\mathbf {m}\in \mathbb {R}^q\) and covariance matrix \(\varSigma \in \mathbb {R}^{q\times q}\). Let \(\mathbf {b}= (b_1,\ldots ,b_q)\in \mathbb {R}^q\). The expectation of any coordinate \(Z_k\) under the linear constraint \((\forall j\in \{1,\ldots ,q\},\ Z_j\le b_j)\) denoted by \(\mathbf {Z}\le \mathbf {b}\) can be expanded as follows:

$$\begin{aligned} \mathbb {E}(Z_k | \mathbf {Z}\le \mathbf {b}) = m_k -\frac{1}{p} \sum _{i=1}^q \varSigma _{ik}\ \varphi _{{m_i,\varSigma _{ii}}}(b_i)\ \varPhi _{q-1}\left( \mathbf {c}_{.i},\varSigma _{.i}\right) \end{aligned}$$
(2)

where:

  • \(p:= \mathbb {P}(\mathbf {Z}\le \mathbf {b}) = \varPhi _q(\mathbf {b}-\mathbf {m},\varSigma )\)

  • \(\varPhi _q(\mathbf {u},\varSigma )\) \((\mathbf {u}\in \mathbb {R}^q, \varSigma \in \mathbb {R}^{q\times q}, q\ge 1)\) is the c.d.f. of the centered multivariate Gaussian distribution with covariance matrix \(\varSigma \).

  • \(\varphi _{m,\sigma ^2}(.)\) is the p.d.f. of the univariate Gaussian distribution with mean \(m\) and variance \(\sigma ^2\)

  • \(\mathbf {c}_{.i}\) is the vector of \(\mathbb {R}^{q-1}\) with general term \((b_j - m_j) - (b_i - m_i)\frac{\varSigma _{ij}}{\varSigma _{ii}}\), \(j \ne i\)

  • \(\varSigma _{.i}\) is a \((q-1) \times (q-1)\) matrix obtained by computing \(\varSigma _{uv} - \frac{\varSigma _{iu}\varSigma _{iv}}{\varSigma _{ii}}\) for \(u\ne i\) and \(v\ne i\). This matrix corresponds to the conditional covariance matrix of the random vector \(\mathbf {Z}_{-i}:=(Z_1,\ldots ,Z_{i-1},Z_{i+1},\ldots ,Z_q)\) knowing \(Z_i\).

For the sake of brevity, the proof of this Proposition is sent in the Appendix. A crucial point for the practical use of this result is that there exist very fast procedures to compute the c.d.f. of the multivariate Gaussian distribution. For example, the work of Genz [15, 16] have been used in many R packages (see, e.g., [17, 18]). The Formula (2) above is an important tool to efficiently compute Expression (1) as shown with the following Property:

Proposition 2

Let \(\mathbf {Y}:= (Y_1,\ldots ,Y_q)\) be a Gaussian Vector with mean \(\mathbf {m}\in \mathbb {R}^q\) and covariance matrix \(\varSigma \). For \(k\in \{1,\ldots ,q\}\) consider the Gaussian vectors \(\mathbf {Z}^{(k)} := (Z^{(k)}_1,\ldots ,Z^{(k)}_q)\) defined as follows:

$$\begin{aligned} Z^{(k)}_j :=&\ Y_j - Y_k \ , \ j\ne k \\ Z^{(k)}_k :=&\ -Y_k \end{aligned}$$

Denoting by \(\mathbf {m}^{(k)}\) and \(\varSigma ^{(k)}\) the mean and covariance matrix of \(\mathbf {Z}^{(k)}\), and defining the vector \(\mathbf {b}^{(k)}\in \mathbb {R}^q\) by \(b_k^{(k)} = -T\) and \(b_j^{(k)} = 0\) if \(j\ne k\), the EI of \(\mathbf {X}^{q}\) writes:

$$\begin{aligned} \begin{boxed} {EI(\mathbf {X}^{q}) = \sum _{k=1}^q \left( (m_k - T)p_k + \sum _{i=1}^q \varSigma _{ik}^{(k)} \varphi _{m_i^{(k)},\varSigma _{ii}^{(k)}}(b_i^{(k)}) \varPhi _{q-1}\left( \mathbf {c}_{.i}^{(k)},\varSigma _{.i}^{(k)}\right) \right) } \end{boxed} \end{aligned}$$
(3)

where:

  • \(p_k := \mathbb {P}(\mathbf {Z}^{(k)} \le \mathbf {b}^{(k)}) = \varPhi _q(\mathbf {b}^{(k)} - \mathbf {m}^{(k)},\varSigma ^{(k)})\).

    \(p_k\) is actually the probability that \(Y_k\) exceeds \(T\) and \(Y_k = \max _{j=1,\ldots ,q} Y_j\).

  • \(\varPhi _q(.,\varSigma )\) and \(\varphi _{m,\sigma ^2}(.)\) are defined in Proposition 1

  • \(\mathbf {c}_{.i}^{(k)}\) is the vector of \(\mathbb {R}^{q-1}\) constructed like in Proposition 1, by computing \((b_j^{(k)} - m_j^{(k)}) - (b_i^{(k)} - m_i^{(k)}) \frac{\varSigma _{ij}^{(k)}}{\varSigma _{ii}^{(k)}}\), with \(j \ne i\)

  • \(\varSigma _{.i}^{(k)}\) is a \((q-1) \times (q-1)\) matrix constructed from \(\varSigma ^{(k)}\) like in Proposition 1. It corresponds to the conditional covariance matrix of the random vector \(\mathbf {Z}^{(k)}_{-i}:= (Z^{(k)}_1,\ldots ,Z^{(k)}_{i-1},Z^{(k)}_{i+1},\ldots ,Z^{(k)}_q)\) knowing \(Z^{(k)}_i\).

Proof 1

Using that \(\mathbb {1}_{\{\max _{i\in \{1,\ldots ,q\}}Y_i \ge T\}}=\sum _{k=1}^q \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j \ne k\}}\), we get

$$\begin{aligned} EI(\mathbf {X}^{q}) =&\ \mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}}Y_i - T \right) \sum _{k=1}^q \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k\}} \right] \\ =&\sum _{k=1}^q \mathbb {E}\left( (Y_k - T) \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k\}} \right) \\ =&\sum _{k=1}^q \mathbb {E}\left( Y_k - T \Big | Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k \right) \mathbb {P}\left( Y_k \ge T, Y_j \le Y_k \ \forall j\ne k \right) \\ =&\sum _{k=1}^q \left( -T - \mathbb {E}\left( Z^{(k)}_k \Big | \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) \right) \mathbb {P}\left( \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) \end{aligned}$$

Now the computation of \(p_k := \mathbb {P}\left( \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) \) simply requires one call to the \(\varPhi _q\) function and the proof can be completed by applying Tallis’ formula (2) to the random vectors \(\mathbf {Z}^{(k)} (\ 1\le k\le q)\).

Remark 1

From Properties (1) and (2), it appears that computing \(q\)-EI requires a total of \(q\) calls to \(\varPhi _q\) and \(q^2\) calls to \(\varPhi _{q-1}\). The proposed approach performs thus well when \(q\) is moderate (typically lower than \(10\)). For higher values of \(q\), estimating \(q\)-EI by Monte-Carlo might remain competitive. Note that, when \(q\) is larger (say, \(q = 50\)) and when \(q\) CPUs are available, one can always distribute the calculations of the \(q^2\) calls to \(\varPhi _{q-1}\) over these \(q\) CPUs.

Remark 2

In the particular case \(q=1\) and with the convention \(\varPhi _0(.,\varSigma )=1\), Eq. (3) corresponds to the classical EI formula proven in [1, 5].

Remark 3

The Multi-points EI can be used in a batch-sequential strategy to optimize a given expensive-to-evaluate function \(f\), as detailed in the next Section. Moreover, a similar criterion can also be used to perform optimization based on a Kriging model with linear constraints, such as the one developed by Da Veiga and Marrel [14]. For example expressions like: \(\mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}} Y_i - T \right) _+ | \mathbf {Y}\le \mathbf {a}\right] , \mathbf {a}\in \mathbb {R}^q\), can be computed using Tallis’ formula and the same proof.

3 Batch Sequential Optimization Using Multi-Points EI

Let us first illustrate Proposition 2 and show that the proposed \(q\)-EI calculation based on Tallis’ formula is actually consistent with a Monte Carlo estimation. From a kriging model based on \(12\) observations of the Branin-Hoo function [1], we generated a \(4\)-point batch (Fig. 1, left plot) and calculated its \(q\)-EI value (middle plot, dotted line). The MC estimates converge to a value close to the latter, and the relative error after \(5*10^9\) runs is less than \(10^{-5}\). \(4\)-point batches generated from the three strategies detailed below are drawn on the right plot.

Fig. 1.
figure 1

Convergence (middle) of MC estimates to the q-EI value calculated with Proposition 2 in the case of a batch of four points (shown on the left plot). Right: candidate batches obtained by q-EI stepwise maximisation (squares), and the CL-min (circles) and CL-max (triangles) strategies.

We now compare a few kriging-based batch-sequential optimization methods on two different functions: the function \(x\mapsto -\log (-\text {Hartman6}(x))\) (see, e.g., [1]), defined on \([0,1]^6\) and the Rastrigin function [19, 20] in dimension two restricted to the domain \([0,2.5]^2\). The first function in dimension 6 is unimodal, while the second one has a lot of local optima (see: Fig. 2). The Rastrigin function is one of the \(24\) noiseless test function of the Black-Box Optimization Benchmark (BBOB) [19].

Fig. 2.
figure 2

Contour lines of the Rastrigin function (grayscale) and location of the global optimizer (black triangle)

For each runs, we start with a random initial Latin hypercube design (LHS) of \(n_0 = 10\) (Rastrigin) or \(50\) (Hartman6) points and estimate the covariance parameters by Maximum Likelihood (here a Matérn kernel with \(\nu =3/2\) is chosen). For both functions and all strategies, batches of \(q=6\) points are added at each iteration, and the covariance parameters are re-estimated after each batch assimilation. Since the tests are done for several designs of experiments, we chose to represent, along the runs, the relative mean squared error:

$$\begin{aligned} \text {rMSE} = \frac{1}{M} \sum _{i=1}^\text {M} \left( \frac{y^{(i)}_\text {min} - y_\text {opt}}{y_\text {opt}} \right) ^2 \end{aligned}$$
(4)

where \(y^{(i)}_\text {min}\) in the current observed minimum in run number \(i\) and \(y_\text {opt}\) is the real unknown optimum. The total number \(M\) of different initial designs of experiments is fixed to \(50\). The tested strategies are:

  • (1) \(q\)-EI stepwise maximization: \(q\) sequential \(d\)-dimensional optimizations are performed. We start with the maximization of the \(1\)-point EI and add this point to the new batch. We then maximize the \(2\)-point EI (keeping the first point obtained as first argument), add the maximizer to the batch, and iterate until \(q\) points are selected.

  • (2) Constant Liar min (CL-min): We start with the maximization of the \(1\)-point EI and add this point to the new batch. We then assume a dummy response (a“lie”) at this point, and update the Kriging metamodel with this point and the lie. We then maximize the \(1\)-point EI obtained with the updated kriging metamodel, get a second point, and iterate the same process until a batch of \(q\) points is selected. The dummy response has the same value over the \(q-1\) lies, and is here fixed to the minimum of the current observations.

  • (3) Constant Liar max (CL-max): The lie in this Constant Liar strategy is fixed to the maximum of the current observations.

  • (4) Constant Liar mix (CL-mix): At each iteration, two batches are generated with the CL-min and CL-max strategies. From these two “candidate” batches, we choose the batch with the best actual \(q\)-EI value, calculated based on Proposition 2.

  • (5) Random sampling.

Note that CL-min tends to explore the function near the current minimizer (as the lie is a low value and we are minimizing \(f\)) while CL-max is more exploratory. Thus, CL-min is expected to perform well on unimodal functions. On the contrary, CL-max may perform better on multimodal functions. For all the tests we use the DiceKriging and DiceOptim packages [4]. The optimizations of the different criteria rely on a genetic algorithm using derivatives, available in the rgenoud package [21]. Figure 3 represents the compared performances of these strategies.

Fig. 3.
figure 3

Compared performances of the five considered batch-sequential optimization strategies, on two test functions.

From these plots we draw two main conclusions. From these plots we draw the following conclusions: first, the \(q\)-EI stepwise maximization strategy outperforms the strategies based on constant lies, CL-min and CL-max. However, the left graph of Fig. 3 points out that the CL-min strategy seems particularly well-adapted to the Hartman6 function. Since running a CL is computationally much cheaper than a brute fore optimization of \(q\)-EI, it is tempting to recommend the CL-min strategy for Hartman6. However, it is not straightforward to know in advance which of CL-min or CL-max will perform better on a given test case. Indeed, for example, CL-max outperforms CL-min on the Rastrigin function.

Now, we observe that using \(q\)-EI in the CL-mix heuristic enables very good performances in both cases without having to select one of the two lie values in advance. For the Hartman6 function, CL-mix even outperforms both CL-min and CL-max and has roughly the same performance as a brute force \(q\)-EI maximization. This suggests that a good heuristic might be to generate, at each iteration, candidate batches obtained with different strategies (e.g. CL with different lies) and to discriminate those batches using \(q\)-EI.

4 Conclusion

In this article we give a closed-form expression enabling a fast computation of the Multi-points Expected Improvement criterion for batch sequential Bayesian global optimization. This formula is consistent with the classical Expected Improvement formula and its computation does not require Monte Carlo simulations. Optimization strategies based on this criterion are now ready to be used on real test cases, and a brute maximization of this criterion shows promising results. In addition, we show that good performances can be achieved by using a cheap-to-compute criterion and by discriminating the candidate batches generated by such criterion with the \(q\)-EI. Such heuristics might be particularly interesting when the time needed to generate batches becomes a computational bottleneck, e.g. when \(q \ge 10\) and calls to the Gaussian c.d.f. become expensive.

A perspective, currently under study, is to improve the maximization of \(q\)-EI itself, e.g. through a more adapted choice of the algorithm and/or an analytical calculation of \(q\)-EI’s gradient.