Fast Computation of the Multi-Points Expected Improvement with Applications in Batch Selection

Chevalier, Clément; Ginsbourger, David

doi:10.1007/978-3-642-44973-4_7

Clément Chevalier^3,4 &
David Ginsbourger⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7997))

Included in the following conference series:

International Conference on Learning and Intelligent Optimization

2585 Accesses
43 Citations

Abstract

The Multi-points Expected Improvement criterion (or $q$-EI) has recently been studied in batch-sequential Bayesian Optimization. This paper deals with a new way of computing $q$-EI, without using Monte-Carlo simulations, through a closed-form formula. The latter allows a very fast computation of $q$-EI for reasonably low values of $q$ (typically, less than 10). New parallel kriging-based optimization strategies, tested on different toy examples, show promising results.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Differentiating the Multipoint Expected Improvement for Optimal Batch Design

Properties of the stochastic approximation EM algorithm with mini-batch sampling

Article 05 September 2020

Log-log Convergence for Noisy Optimization

Keywords

1 Introduction

In the last decades, metamodeling (or surrogate modeling) has been increasingly used for problems involving costly computer codes (or “black-box simulators”). Practitioners typically dispose of a very limited evaluation budget and aim at selecting evaluation points cautiously when attempting to solve a given problem.

In global optimization, the focus is usually put on a real-valued function $f$ with $d$-dimensional source space. In this settings, Jones et al. [1] proposed the now famous Efficient Global Optimization (EGO) algorithm, relying on a kriging metamodel [2] and on the Expected Improvement (EI) criterion [3]. In EGO, the optimization is done by sequentially evaluating $f$ at points maximizing EI. A crucial advantage of this criterion is its fast computation (besides, the analytical gradient of EI is implemented in [4]), so that the hard optimization problem is replaced by series of much simpler ones.

Coming back to the decision-theoretic roots of EI [5], a Multi-points Expected Improvement (also called “$q$-EI”) criterion for batch-sequential optimization was defined in [6] and further developed in [7, 8]. Maximizing this criterion enables choosing batches of $q > 1$ points at which to evaluate $f$ in parallel, and is of particular interest in the frequent case where several CPUs are simultaneously available. Even though an analytical formula was derived for the $2$-EI in [7], the Monte Carlo (MC) approach of [8] for computing $q$-EI when $q \ge 3$ makes the criterion itself expensive-to-evaluate, and particularly hard to optimize.

A lot of effort has recently been paid to address this problem. The pragmatic approach proposed by Ginsbourger and Le Riche [8] consists in circumventing a direct $q$-EI maximization, and replacing it by simpler strategies where batches are obtained using an offline $q$-points EGO. In such strategies, the model updates are done using dummy response values such as the kriging mean prediction (Kriging Believer) or a constant (Constant Liar), and the covariance parameters are re-estimated only when real data is assimilated. In [9] and [10], $q$-EI optimization strategies were proposed relying on the MC approach, where the number of MC samples is tuned online to discriminate between candidate designs. Finally, Frazier [11] proposed a $q$-EI optimization strategy involving stochastic gradient, with the crucial advantage of not requiring to evaluate $q$-EI itself.

In this article we derive a formula allowing a fast and accurate approximate evaluation of $q$-EI. This formula may contribute to significantly speed up strategies relying on $q$-EI. The main result, relying on Tallis’ formula, is given in Sect. 2. The usability of the proposed formula is then illustrated in Sect. 3 through benchmark experiments, where a brute force maximization of $q$-EI is compared to three variants of the Constant Liar strategy. In particular, a new variant (CL-mix) is introduced, and is shown to offer very good performances at a competitive computational cost. For self-containedness, a slightly revisited proof of Tallis’ formula is given in appendix.

2 Multi-Points Expected Improvement Explicit Formulas

In this section we give an explicit formula allowing a fast and accurate deterministic approximation of $q$-EI. Let us first give a few precisions on the mathematical settings. Along the paper, $f$ is assumed to be one realisation of a Gaussian Process (GP) with known covariance kernel and mean known up to some linear trend coefficients, so that the conditional distribution of a vector of values of the GP conditional on past observations is still Gaussian (an improper uniform prior is put on the trend coefficients when applicable). This being said, most forthcoming derivations boil down to calculations on Gaussian vectors. Let Y := $(Y_{1},\ldots ,Y_{q})$ be a Gaussian Vector with mean m $\in \mathbb {R}^{q}$ and covariance matrix $\varSigma $. Our aim in this paper is to explicitly calculate expressions of the following kind:

$$\begin{aligned} \mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}} Y_i - T \right) _+ \right] \end{aligned}$$

(1)

where $(.)_+ := \max ( . , 0)$. In Bayesian optimization (say maximization), expectations and probabilities are taken conditional on response values at a given set of $n$ points $(\mathbf {x}_{1},\ldots ,\mathbf {x}_{n})\in \mathbb {X}^n$ where $\mathbb {X}$ is the input set of $f$ (often, a compact subset of $\mathbb {R}^d$, $d\ge 1$), the threshold $T\in \mathbb {R}$ is usually the maximum of those $n$ available response values, and $\mathbf {Y}$ is the vector of unknown responses at a given batch of $q$ points, $\mathbf {X}^{q}:=(\mathbf {x}_{n+1},\ldots ,\mathbf {x}_{n+q}) \in \mathbb {X}^{q}$. In such framework, the vector $\mathbf {m}$ and the matrix $\varSigma $ are the so-called “Kriging mean” and “Kriging covariance” at $\mathbf {X}^{q}$ and can be calculated relying on classical Kriging equations (see, e.g., [12]).

In order to obtain a tractable analytical formula for Expression (1), not requiring any Monte-Carlo simulation, let us first give a useful formula obtained by [13], and recently used in [14] for GP modeling with inequality constraints:

Proposition 1

(Tallis’ formulas). Let $\mathbf {Z}:=(Z_1,\ldots ,Z_q)$ be a Gaussian Vector with mean $\mathbf {m}\in \mathbb {R}^q$ and covariance matrix $\varSigma \in \mathbb {R}^{q\times q}$. Let $\mathbf {b}= (b_1,\ldots ,b_q)\in \mathbb {R}^q$. The expectation of any coordinate $Z_k$ under the linear constraint $(\forall j\in \{1,\ldots ,q\},\ Z_j\le b_j)$ denoted by $\mathbf {Z}\le \mathbf {b}$ can be expanded as follows:

$$\begin{aligned} \mathbb {E}(Z_k | \mathbf {Z}\le \mathbf {b}) = m_k -\frac{1}{p} \sum _{i=1}^q \varSigma _{ik}\ \varphi _{{m_i,\varSigma _{ii}}}(b_i)\ \varPhi _{q-1}\left( \mathbf {c}_{.i},\varSigma _{.i}\right) \end{aligned}$$

(2)

where:

$p:= \mathbb {P}(\mathbf {Z}\le \mathbf {b}) = \varPhi _q(\mathbf {b}-\mathbf {m},\varSigma )$
$\varPhi _q(\mathbf {u},\varSigma )$ $(\mathbf {u}\in \mathbb {R}^q, \varSigma \in \mathbb {R}^{q\times q}, q\ge 1)$ is the c.d.f. of the centered multivariate Gaussian distribution with covariance matrix $\varSigma $.
$\varphi _{m,\sigma ^2}(.)$ is the p.d.f. of the univariate Gaussian distribution with mean $m$ and variance $\sigma ^2$
$\mathbf {c}_{.i}$ is the vector of $\mathbb {R}^{q-1}$ with general term $(b_j - m_j) - (b_i - m_i)\frac{\varSigma _{ij}}{\varSigma _{ii}}$, $j \ne i$
$\varSigma _{.i}$ is a $(q-1) \times (q-1)$ matrix obtained by computing $\varSigma _{uv} - \frac{\varSigma _{iu}\varSigma _{iv}}{\varSigma _{ii}}$ for $u\ne i$ and $v\ne i$. This matrix corresponds to the conditional covariance matrix of the random vector $\mathbf {Z}_{-i}:=(Z_1,\ldots ,Z_{i-1},Z_{i+1},\ldots ,Z_q)$ knowing $Z_i$.

For the sake of brevity, the proof of this Proposition is sent in the Appendix. A crucial point for the practical use of this result is that there exist very fast procedures to compute the c.d.f. of the multivariate Gaussian distribution. For example, the work of Genz [15, 16] have been used in many R packages (see, e.g., [17, 18]). The Formula (2) above is an important tool to efficiently compute Expression (1) as shown with the following Property:

Proposition 2

Let $\mathbf {Y}:= (Y_1,\ldots ,Y_q)$ be a Gaussian Vector with mean $\mathbf {m}\in \mathbb {R}^q$ and covariance matrix $\varSigma $. For $k\in \{1,\ldots ,q\}$ consider the Gaussian vectors $\mathbf {Z}^{(k)} := (Z^{(k)}_1,\ldots ,Z^{(k)}_q)$ defined as follows:

$$\begin{aligned} Z^{(k)}_j :=&\ Y_j - Y_k \ , \ j\ne k \\ Z^{(k)}_k :=&\ -Y_k \end{aligned}$$

Denoting by $\mathbf {m}^{(k)}$ and $\varSigma ^{(k)}$ the mean and covariance matrix of $\mathbf {Z}^{(k)}$, and defining the vector $\mathbf {b}^{(k)}\in \mathbb {R}^q$ by $b_k^{(k)} = -T$ and $b_j^{(k)} = 0$ if $j\ne k$, the EI of $\mathbf {X}^{q}$ writes:

$$\begin{aligned} \begin{boxed} {EI(\mathbf {X}^{q}) = \sum _{k=1}^q \left( (m_k - T)p_k + \sum _{i=1}^q \varSigma _{ik}^{(k)} \varphi _{m_i^{(k)},\varSigma _{ii}^{(k)}}(b_i^{(k)}) \varPhi _{q-1}\left( \mathbf {c}_{.i}^{(k)},\varSigma _{.i}^{(k)}\right) \right) } \end{boxed} \end{aligned}$$

(3)

where:

$p_k := \mathbb {P}(\mathbf {Z}^{(k)} \le \mathbf {b}^{(k)}) = \varPhi _q(\mathbf {b}^{(k)} - \mathbf {m}^{(k)},\varSigma ^{(k)})$.

$p_k$ is actually the probability that $Y_k$ exceeds $T$ and $Y_k = \max _{j=1,\ldots ,q} Y_j$.
$\varPhi _q(.,\varSigma )$ and $\varphi _{m,\sigma ^2}(.)$ are defined in Proposition 1
$\mathbf {c}_{.i}^{(k)}$ is the vector of $\mathbb {R}^{q-1}$ constructed like in Proposition 1, by computing $(b_j^{(k)} - m_j^{(k)}) - (b_i^{(k)} - m_i^{(k)}) \frac{\varSigma _{ij}^{(k)}}{\varSigma _{ii}^{(k)}}$, with $j \ne i$
$\varSigma _{.i}^{(k)}$ is a $(q-1) \times (q-1)$ matrix constructed from $\varSigma ^{(k)}$ like in Proposition 1. It corresponds to the conditional covariance matrix of the random vector $\mathbf {Z}^{(k)}_{-i}:= (Z^{(k)}_1,\ldots ,Z^{(k)}_{i-1},Z^{(k)}_{i+1},\ldots ,Z^{(k)}_q)$ knowing $Z^{(k)}_i$.

Proof 1

Using that $\mathbb {1}_{\{\max _{i\in \{1,\ldots ,q\}}Y_i \ge T\}}=\sum _{k=1}^q \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j \ne k\}}$, we get

$$\begin{aligned} EI(\mathbf {X}^{q}) =&\ \mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}}Y_i - T \right) \sum _{k=1}^q \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k\}} \right] \\ =&\sum _{k=1}^q \mathbb {E}\left( (Y_k - T) \mathbb {1}_{\{ Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k\}} \right) \\ =&\sum _{k=1}^q \mathbb {E}\left( Y_k - T \Big | Y_k \ge T, \ Y_j \le Y_k \ \forall j\ne k \right) \mathbb {P}\left( Y_k \ge T, Y_j \le Y_k \ \forall j\ne k \right) \\ =&\sum _{k=1}^q \left( -T - \mathbb {E}\left( Z^{(k)}_k \Big | \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) \right) \mathbb {P}\left( \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) \end{aligned}$$

Now the computation of $p_k := \mathbb {P}\left( \mathbf {Z}^{(k)} \le \mathbf {b}^{(k)} \right) $ simply requires one call to the $\varPhi _q$ function and the proof can be completed by applying Tallis’ formula (2) to the random vectors $\mathbf {Z}^{(k)} (\ 1\le k\le q)$.

Remark 1

From Properties (1) and (2), it appears that computing $q$-EI requires a total of $q$ calls to $\varPhi _q$ and $q^2$ calls to $\varPhi _{q-1}$. The proposed approach performs thus well when $q$ is moderate (typically lower than $10$). For higher values of $q$, estimating $q$-EI by Monte-Carlo might remain competitive. Note that, when $q$ is larger (say, $q = 50$) and when $q$ CPUs are available, one can always distribute the calculations of the $q^2$ calls to $\varPhi _{q-1}$ over these $q$ CPUs.

Remark 2

In the particular case $q=1$ and with the convention $\varPhi _0(.,\varSigma )=1$, Eq. (3) corresponds to the classical EI formula proven in [1, 5].

Remark 3

The Multi-points EI can be used in a batch-sequential strategy to optimize a given expensive-to-evaluate function $f$, as detailed in the next Section. Moreover, a similar criterion can also be used to perform optimization based on a Kriging model with linear constraints, such as the one developed by Da Veiga and Marrel [14]. For example expressions like: $\mathbb {E}\left[ \left( \max _{i\in \{1,\ldots ,q\}} Y_i - T \right) _+ | \mathbf {Y}\le \mathbf {a}\right] , \mathbf {a}\in \mathbb {R}^q$, can be computed using Tallis’ formula and the same proof.

3 Batch Sequential Optimization Using Multi-Points EI

Let us first illustrate Proposition 2 and show that the proposed $q$-EI calculation based on Tallis’ formula is actually consistent with a Monte Carlo estimation. From a kriging model based on $12$ observations of the Branin-Hoo function [1], we generated a $4$-point batch (Fig. 1, left plot) and calculated its $q$-EI value (middle plot, dotted line). The MC estimates converge to a value close to the latter, and the relative error after $5*10^9$ runs is less than $10^{-5}$. $4$-point batches generated from the three strategies detailed below are drawn on the right plot.

We now compare a few kriging-based batch-sequential optimization methods on two different functions: the function $x\mapsto -\log (-\text {Hartman6}(x))$ (see, e.g., [1]), defined on $[0,1]^6$ and the Rastrigin function [19, 20] in dimension two restricted to the domain $[0,2.5]^2$. The first function in dimension 6 is unimodal, while the second one has a lot of local optima (see: Fig. 2). The Rastrigin function is one of the $24$ noiseless test function of the Black-Box Optimization Benchmark (BBOB) [19].

For each runs, we start with a random initial Latin hypercube design (LHS) of $n_0 = 10$ (Rastrigin) or $50$ (Hartman6) points and estimate the covariance parameters by Maximum Likelihood (here a Matérn kernel with $\nu =3/2$ is chosen). For both functions and all strategies, batches of $q=6$ points are added at each iteration, and the covariance parameters are re-estimated after each batch assimilation. Since the tests are done for several designs of experiments, we chose to represent, along the runs, the relative mean squared error:

$$\begin{aligned} \text {rMSE} = \frac{1}{M} \sum _{i=1}^\text {M} \left( \frac{y^{(i)}_\text {min} - y_\text {opt}}{y_\text {opt}} \right) ^2 \end{aligned}$$

(4)

where $y^{(i)}_\text {min}$ in the current observed minimum in run number $i$ and $y_\text {opt}$ is the real unknown optimum. The total number $M$ of different initial designs of experiments is fixed to $50$. The tested strategies are:

(1) $q$-EI stepwise maximization: $q$ sequential $d$-dimensional optimizations are performed. We start with the maximization of the $1$-point EI and add this point to the new batch. We then maximize the $2$-point EI (keeping the first point obtained as first argument), add the maximizer to the batch, and iterate until $q$ points are selected.
(2) Constant Liar min (CL-min): We start with the maximization of the $1$-point EI and add this point to the new batch. We then assume a dummy response (a“lie”) at this point, and update the Kriging metamodel with this point and the lie. We then maximize the $1$-point EI obtained with the updated kriging metamodel, get a second point, and iterate the same process until a batch of $q$ points is selected. The dummy response has the same value over the $q-1$ lies, and is here fixed to the minimum of the current observations.
(3) Constant Liar max (CL-max): The lie in this Constant Liar strategy is fixed to the maximum of the current observations.
(4) Constant Liar mix (CL-mix): At each iteration, two batches are generated with the CL-min and CL-max strategies. From these two “candidate” batches, we choose the batch with the best actual $q$-EI value, calculated based on Proposition 2.
(5) Random sampling.

Note that CL-min tends to explore the function near the current minimizer (as the lie is a low value and we are minimizing $f$) while CL-max is more exploratory. Thus, CL-min is expected to perform well on unimodal functions. On the contrary, CL-max may perform better on multimodal functions. For all the tests we use the DiceKriging and DiceOptim packages [4]. The optimizations of the different criteria rely on a genetic algorithm using derivatives, available in the rgenoud package [21]. Figure 3 represents the compared performances of these strategies.

From these plots we draw two main conclusions. From these plots we draw the following conclusions: first, the $q$-EI stepwise maximization strategy outperforms the strategies based on constant lies, CL-min and CL-max. However, the left graph of Fig. 3 points out that the CL-min strategy seems particularly well-adapted to the Hartman6 function. Since running a CL is computationally much cheaper than a brute fore optimization of $q$-EI, it is tempting to recommend the CL-min strategy for Hartman6. However, it is not straightforward to know in advance which of CL-min or CL-max will perform better on a given test case. Indeed, for example, CL-max outperforms CL-min on the Rastrigin function.

Now, we observe that using $q$-EI in the CL-mix heuristic enables very good performances in both cases without having to select one of the two lie values in advance. For the Hartman6 function, CL-mix even outperforms both CL-min and CL-max and has roughly the same performance as a brute force $q$-EI maximization. This suggests that a good heuristic might be to generate, at each iteration, candidate batches obtained with different strategies (e.g. CL with different lies) and to discriminate those batches using $q$-EI.

4 Conclusion

In this article we give a closed-form expression enabling a fast computation of the Multi-points Expected Improvement criterion for batch sequential Bayesian global optimization. This formula is consistent with the classical Expected Improvement formula and its computation does not require Monte Carlo simulations. Optimization strategies based on this criterion are now ready to be used on real test cases, and a brute maximization of this criterion shows promising results. In addition, we show that good performances can be achieved by using a cheap-to-compute criterion and by discriminating the candidate batches generated by such criterion with the $q$-EI. Such heuristics might be particularly interesting when the time needed to generate batches becomes a computational bottleneck, e.g. when $q \ge 10$ and calls to the Gaussian c.d.f. become expensive.

A perspective, currently under study, is to improve the maximization of $q$-EI itself, e.g. through a more adapted choice of the algorithm and/or an analytical calculation of $q$-EI’s gradient.

References

Jones, D.R., Schonlau, M., William, J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)
Article MATH Google Scholar
Santner, T.J., Williams, B.J.: The Design and Analysis of Computer Experiments. Springer, New York (2003)
Book MATH Google Scholar
Mockus, J.: Bayesian Approach to Global Optimization. Theory and Applications. Kluwer Academic Publisher, Dordrecht (1989)
Book MATH Google Scholar
Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodelling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)
Google Scholar
Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Dixon, L., Szego, E.G. (eds.) Towards Global Optimization, pp. 117–129. Elsevier, Amsterdam (1978)
Google Scholar
Schonlau, M.: Computer experiments and global optimization. PhD thesis, University of Waterloo (1997)
Google Scholar
Ginsbourger, D.: Métamodèles multiples pour l’approximation et l’optimisation de fonctions numériques multivariables. PhD thesis, Ecole nationale supérieure des Mines de Saint-Etienne (2009)
Google Scholar
Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize optimization. Computational Intelligence in Expensive Optimization Problems. Adaptation Learning and Optimization, vol. 2, pp. 131–162. Springer, Heidelberg (2010)
Google Scholar
Janusevskis, J., Le Riche, R., Ginsbourger, D.: Parallel expected improvements for global optimization: summary, bounds and speed-up (August 2011)
Google Scholar
Janusevskis, J., Le Riche, R., Ginsbourger, D., Girdziusas, R.: Expected improvements for the asynchronous parallel global optimization of expensive functions: potentials and challenges. In: Hamadi, Y., Schoenauer, M. (eds.) LION 6. LNCS, vol. 7219, pp. 413–418. Springer, Heidelberg (2012)
Google Scholar
Frazier, P.I.: Parallel global optimization using an improved multi-points expected improvement criterion. In: INFORMS Optimization Society Conference, Miami FL (2012)
Google Scholar
Chilès, J.P., Delfiner, P.: Geostatistics: Modeling Spatial Uncertainty. Wiley, New York (1999)
Book MATH Google Scholar
Tallis, G.: The moment generating function of the truncated multi-normal distribution. J. Roy. Statist. Soc. Ser. B 23(1), 223–229 (1961)
MathSciNet MATH Google Scholar
Da Veiga, S., Marrel, A.: Gaussian process modeling with inequality constraints. Annales de la Faculté des Sciences de Toulouse 21(3), 529–555 (2012)
Article MATH Google Scholar
Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141–149 (1992)
Google Scholar
Genz, A., Bretz, F.: Computation of Multivariate Normal and t Probabilities. Springer, Heidelberg (2009)
Book MATH Google Scholar
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B., Hothorn, T.: Mvtnorm: Multivariate Normal and t Distributions. R package version 0.9-9992 (2012)
Google Scholar
Azzalini, A.: mnormt: The multivariate normal and t distributions. R package version 1.4-5 (2012)
Google Scholar
Finck, S., Hansen, N., Ros, R., Auger, A.: Real-parameter black-box optimization bencharking 2009: Presentation of the noiseless functions. Technical report, Research Center PPE, 2009 (2010)
Google Scholar
Hansen, N., Finck, S., Ros, R., Auger, A.: Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. Technical report, INRIA 2009 (2010)
Google Scholar
Mebane, W., Sekhon, J.: Genetic optimization using derivatives: The rgenoud package for R. J. Stat. Softw. 42(11), 1–26 (2011)
Google Scholar
Cressie, N., Davis, A., Leroy Folks, J.: The moment-generating function and negative integer moments. Am. Stat. 35(3), 148–150 (1981)
MATH Google Scholar

Download references

Acknowledgments

This work has been conducted within the frame of the ReDice Consortium, gathering industrial (CEA, EDF, IFPEN, IRSN, Renault) and academic (Ecole des Mines de Saint-Etienne, INRIA, and the University of Bern) partners around advanced methods for Computer Experiments. Clément Chevalier gratefully acknowledges support from the French Nuclear Safety Institute (IRSN). The authors also would like to thank Dr. Sébastien Da Veiga for raising our attention to Tallis’ formula.

Author information

Authors and Affiliations

Institut de Radioprotection et de Sûreté Nucléaire (IRSN), 31, avenue de la Division Leclerc, 92260, Fontenay-aux-Roses, France
Clément Chevalier
IMSV, University of Bern, Alpeneggstrasse 22, 3012, Bern, Switzerland
Clément Chevalier & David Ginsbourger

Authors

Clément Chevalier
View author publications
You can also search for this author in PubMed Google Scholar
David Ginsbourger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Ginsbourger .

Editor information

Editors and Affiliations

Università Catania Dipto. Matematica e Informatica, Catania, Italy
Giuseppe Nicosia
Industrial & Systems Engineering, University of Florida, Gainesville, Florida, USA
Panos Pardalos

Appendix: Proof for Tallis’ Formula (2)

The proof proposed here follows exactly the method given in [13] in the particular case of a centered Gaussian Vector with normalized covariance matrix (i.e. a covariance matrix equal to the correlation matrix). Here, the proof is slightly more detailed and applies in a more general case.

Let $\mathbf {Z}:=(Z_1,\ldots ,Z_q) \sim \mathcal {N}(\mathbf {m},\varSigma )$ with $\mathbf {m}\in \mathbb {R}^q$ and $\varSigma \in \mathbb {R}^{q\times q}$. Let $\mathbf {b}= (b_1,\ldots ,b_q)\in \mathbb {R}^q$. Our goal is to calculate: $\mathbb {E}(Z_k | \mathbf {Z}\le \mathbf {b})$. The method proposed by Tallis consists in calculating the conditional joint moment generating function (MGF) of $\mathbf {Z}$ defined as follows:

$$\begin{aligned} M_{\mathbf {Z}}(\mathbf {t}) := \mathbb {E}(\exp ( \mathbf {t}^\top \mathbf {Z}) | \mathbf {Z}\le \mathbf {b}) \end{aligned}$$

(5)

It is known (see, e.g., [22]) that the conditional expectation of $Z_k$ can be obtained by deriving such MGF with respect to $t_k$, in $\mathbf {t}= \mathbf {0}$. Mathematically this writes:

$$\begin{aligned} \mathbb {E}(Z_k | \mathbf {Z}\le \mathbf {b}) = \frac{\partial M_{\mathbf {Z}}(\mathbf {t})}{\partial t_k}\Bigg |_{\mathbf {t}= \mathbf {0}} \end{aligned}$$

(6)

The main steps of this proof are then to calculate such MGF and its derivative with respect to any coordinate $t_k$.

Let us consider the centered random variable $\mathbf {Z}^\text {c}:= \mathbf {Z}- \mathbf {m}$. Denoting $\mathbf {h}= \mathbf {b}- \mathbf {m}$, conditioning on $\mathbf {Z}\le \mathbf {b}$ or on $\mathbf {Z}^\text {c}\le \mathbf {h}$ are equivalent. The MGF of $\mathbf {Z}^\text {c}$ can be calculated as follows:

$$\begin{aligned} M_{\mathbf {Z}^\text {c}}(\mathbf {t}) :=&\mathbb {E}(\exp ( \mathbf {t}^\top \mathbf {Z}^\text {c}) | \mathbf {Z}^\text {c}\le \mathbf {h}) \\ =&\frac{1}{p} \int _{-\infty }^{h_1} \ldots \int _{-\infty }^{h_q} \exp (\mathbf {t}^\top \mathbf {u})\varphi _{\mathbf {0},\varSigma }(\mathbf {u}) \mathrm {d}\mathbf {u}\\ =&\frac{1}{p} (2\pi )^{-\frac{q}{2}} |\varSigma |^{-\frac{1}{2}} \int _{-\infty }^{h_1}\ldots \int _{-\infty }^{h_q} \exp \left( -\frac{1}{2} \left( \mathbf {u}^\top \varSigma ^{-1}\mathbf {u}- 2\mathbf {t}^\top \mathbf {u}\right) \right) \mathrm {d}\mathbf {u}\end{aligned}$$

where $p:= \mathbb {P}(\mathbf {Z}\le \mathbf {b})$ and $\varphi _{\mathbf {v},\varSigma }(.)$ denotes the p.d.f. of the multivariate normal distribution with mean $\mathbf {v}$ and covariance matrix $\varSigma $. The calculation can be continued by noting that:

$$\begin{aligned} M_{\mathbf {Z}^\text {c}}(\mathbf {t}) =&\frac{1}{p} (2\pi )^{-\frac{q}{2}} |\varSigma |^{-\frac{1}{2}} \exp \left( \frac{1}{2} \mathbf {t}^\top \varSigma \mathbf {t}\right) \int _{-\infty }^{h_1}\ldots \int _{-\infty }^{h_q} \exp \left( - \frac{1}{2} \left( \mathbf {u}- \varSigma \mathbf {t}\right) ^\top \varSigma ^{-1} \left( \mathbf {u}- \varSigma \mathbf {t}\right) \right) \mathrm {d}\mathbf {u}\\ =&\frac{1}{p} \exp \left( \frac{1}{2} \mathbf {t}^\top \varSigma \mathbf {t}\right) \varPhi _q (\mathbf {h}- \varSigma \mathbf {t}, \varSigma ) \end{aligned}$$

where $\varPhi _q(.,\varSigma )$ is the c.d.f. of the centered multivariate normal distribution with covariance matrix $\varSigma $.

Now, let us calculate for some $k\in \{1,\ldots ,q\}$ the partial derivative $\frac{\partial M_{\mathbf {Z}^\text {c}}(\mathbf {t})}{\partial t_k}$ in $\mathbf {t}= \mathbf {0}$, which is equal by definition to $\mathbb {E}(Z^c_k | \mathbf {Z}^\text {c}\le \mathbf {h})$.

$$\begin{aligned} p\ \mathbb {E}(Z^c_k | \mathbf {Z}^\text {c}\le \mathbf {h}) =&\ p\ \frac{\partial M_{\mathbf {Z}^\text {c}}(\mathbf {t})}{\partial t_k}\Bigg |_{\mathbf {t}= \mathbf {0}} \\ =&\ 0 + 1 . \frac{\partial }{\partial t_k} \left( \varPhi _q\left( \mathbf {h}- t_k \begin{pmatrix} \varSigma _{1k} \\ \vdots \\ \varSigma _{qk} \end{pmatrix},\varSigma \right) \right) \Bigg |_{t_k = 0} \\ =&- \sum _{i=1}^q \varSigma _{ik} \int _{-\infty }^{h_1}\ldots \int _{-\infty }^{h_{i-1}} \int _{-\infty }^{h_{i+1}}\ldots \int _{-\infty }^{h_q} \varphi _{\mathbf {0},\varSigma }(\mathbf {u}_{-i},u_i = h_i) \mathrm{d \mathbf {u}_{-i}} \end{aligned}$$

The last step is obtained applying the chain rule to $\mathbf {x}\mapsto \varPhi _q(\mathbf {x},\varSigma )$ at the point $\mathbf {x}= \mathbf {h}$. Here, $\varphi _{\mathbf {0},\varSigma }(\mathbf {u}_{-i},u_i = h_i)$ denotes the c.d.f. of the centered multivariate normal distribution at given points $(\mathbf {u}_{-i},u_i = h_i) := (u_1,\ldots ,u_{i-1}, h_i,u_{i+1},\ldots ,u_q)$. Note that the integrals in the latter Expression are in dimension $q-1$ and not $q$. In the $i^{\text {th}}$ term of the sum above, we integrate with respect to all the $q$ components except the component $i$. To continue the calculation we can use the identity:

$$\begin{aligned} \forall \mathbf {u}\in \mathbb {R}^q, \varphi _{\mathbf {0},\varSigma } (\mathbf {u}) = \varphi _{0,\varSigma _{ii}}(u_i) \varphi _{\varSigma _{ii}^{-1}\varSigma _i u_i , \varSigma _{-i,-i} - \varSigma _i\varSigma _{ii}^{-1}\varSigma _i^\top }( \mathbf {u}_{-i} ) \end{aligned}$$

(7)

where $\varSigma _i = (\varSigma _{1i},\ldots ,\varSigma _{i-1 i},\varSigma _{i+1 i},\ldots ,\varSigma _{qi} )^\top $ ($\varSigma _i\in \mathbb {R}^{q-1}$) and $\varSigma _{-i,-i}$ is the $(q-1)\times (q-1)$ matrix obtained by removing the line and column $i$ from $\varSigma $. This identity can be proven using Bayes formula and Gaussian vectors conditioning formulas. Its use gives:

$$\begin{aligned} p\ \mathbb {E}(Z^c_k | \mathbf {Z}^\text {c}\le \mathbf {h}) =&- \sum _{i=1}^q \varSigma _{ik} \varphi _{0,\varSigma _{ii}}(h_i) \varPhi _{q-1}(\mathbf {h}_{-i} - \varSigma _{ii}^{-1}\varSigma _i h_i, \varSigma _{-i,-i} - \varSigma _i\varSigma _{ii}^{-1}\varSigma _i^\top ) \\ =&- \sum _{i=1}^q \varSigma _{ik} \varphi _{m_i,\varSigma _{ii}}(b_i) \varPhi _{q-1}(\mathbf {h}_{-i} - \varSigma _{ii}^{-1}\varSigma _i h_i, \varSigma _{-i,-i} - \varSigma _i\varSigma _{ii}^{-1}\varSigma _i^\top ) \end{aligned}$$

which finally delivers Tallis’ formula, see Eq. (2).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chevalier, C., Ginsbourger, D. (2013). Fast Computation of the Multi-Points Expected Improvement with Applications in Batch Selection. In: Nicosia, G., Pardalos, P. (eds) Learning and Intelligent Optimization. LION 2013. Lecture Notes in Computer Science(), vol 7997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-44973-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-44973-4_7
Published: 26 November 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-44972-7
Online ISBN: 978-3-642-44973-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics