Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In materials design and discovery, we face the problem of choosing the chemical structure, composition, or processing conditions of a material to meet design criteria. The traditional approach is to use iterative trial and error, in which we (1) choose some material design that we think will work well based on intuition, past experience, or theoretical knowledge; (2) synthesize and test the material in physical experiments; and (3) use what we learn from these experiments in choosing the material design to try next. This iterative process is repeated until some combination of success and exhaustion is achieved.

While trial and error has been extremely successful, we believe that mathematics and computation together promise to accelerate the pace of materials discovery, not by changing the fundamental iterative nature of materials design, but by improving the choices that we make about which material designs to test, and by improving our ability to learn from previous experimental results.

In this chapter, we describe a collection of mathematical techniques, based on Bayesian statistics and decision theory, for augmenting and enhancing the trial and error process. We focus on one class of techniques, called Bayesian optimization (BO), or Bayesian global optimization (BGO), which use machine learning to build a predictive model of the underlying relationship between the design parameters of a material and its properties, and then use decision theory to suggest which design or designs would be most valuable to try next. The most well-developed Bayesian optimization methods assume that (1) the material is described by a vector of continuous variables, as is the case, e.g., when choosing ratios of constituent compounds, or choosing a combination of temperature and pressure to use during manufacture; (2) we have a single measure of quality that we wish to make as large as possible; and (3) the constraints on feasible materials designs are all known, so that any unknown constraints are incorporated into the quality measure. There is also a smaller body of work on problems that go beyond these assumptions, either by considering discrete design decisions (such as small molecule design), multiple competing objectives, or by explicitly allowing unknown constraints.

Bayesian optimization was pioneered by [1], with early development through the 1970s and 1980s by Mockus and Zilinskas [2, 3]. Development in the 1990s was marked by the popularization of Bayesian optimization by Jones, Schonlau, and Welch, who, building on previous work by Mockus, introduced the Efficient Global Optimization (EGO) method [4]. This method became quite popular and well-known in engineering, where it has been adopted for design applications involving time-consuming computer experiments, within a broader set of methods designed for optimization of expensive functions [5]. In the 2000s, development of Bayesian optimization continued in statistics and engineering, and the 2010s have seen additional development from the machine learning community, where Bayesian optimization is used for tuning hyperparameters of computationally expensive machine learning models [6]. Other introductions to Bayesian optimization may be found in the tutorial article [7] and textbooks [8, 9], and an overview of the history of the field may be found in [10].

We begin in Sect. 3.2 by introducing the precise problem considered by Bayesian Optimization. We then describe in Sect. 3.3 the predictive technique used by Bayesian Optimization, which is called Gaussian Process (GP) regression . We then show, in Sect. 3.4, how Bayesian Optimization recommends which experiments to perform. In Sect. 3.5 we provide an overview of software packages, both freely available and commercial, that implement the Bayesian Optimization methods described in this chapter. We offer closing remarks in Sect. 3.6.

2 Bayesian Optimization

Bayesian optimization considers materials designs parameterized by a d-dimensional vector x. We suppose that the space of materials designs in which x takes values is a known set \(A \subseteq \mathbb {R}^d\).

For example, \(x=(x(1),\ldots ,x(d))\) could give the ratio of each of d different constituents mixed together to create some aggregate material. In this case, we would choose A to be the set \(A=\{x : \sum _{i=1}^d x(i) = 1 \}\). As another example, setting \(d=2\), \(x=(x(1),x(2))\) could give the temperature (x(1)) and pressure (x(2)) used in material processing. In this case, we would choose A to be the rectangle bounded by the experimental setup’s minimum and maximal achievable temperature on one axis, \(T_{\mathrm {min}}\) and \(T_{\mathrm {max}}\), and the minimum and maximum achievable pressure on the other. As a final example, we could let \(x=(x(1),\ldots ,x(d)\) be the temperatures used in some annealing schedule, assumed to be decreasing over time. In this case, we would set A to be the set \(\left\{ x : T_{\mathrm {max}} \ge x(1) \ge \cdots \ge x(d) \ge T_{\mathrm {min}} \right\} \).

Let f(x) be the quality of the material with design parameter x. The function f is unknown, and observing f(x) requires synthesizing material design x and observing its quality in a physical experiment. We would like to find a design x for which f(x) is large. That is, we would like to solve

$$\begin{aligned} \max _{x\in A} f(x). \end{aligned}$$
(3.1)

This is challenging because evaluating f(x) is typically expensive and time-consuming. While the time and expense depends on the setting, synthesizing and testing a new material design could easily take days or weeks of effort and thousands of dollars of materials.

In Bayesian optimization, we use mathematics to build a predictive model for the function f based on observations of previous materials designs, and then use this predictive model to recommend a materials design that would be most valuable to test next. We first describe this predictive model in Sect. 3.3, which is performed using a machine learning technique called Gaussian process regression . We then describe, in Sect. 3.4, how this predictive model is used to recommend which design to test next.

3 Gaussian Process Regression

The predictive piece of Bayesian optimization is based on a machine learning technique called Gaussian process regression. This technique is a Bayesian version of a frequentist technique called kriging , introduced in the geostatistics literature by South-African mining engineer Daniel Krige [11], and popularized later by Matheron and colleagues [12], as described in [13]. A modern monograph on Gaussian process regression is [14], and a list of software implementing Gaussian process regression may be found at [15].

In Gaussian process regression, we seek to predict f(x) based on observations at previously evaluated points, call them \(x_1,\ldots ,x_n\). We first treat the case where f(x) can be observed exactly, without noise, and then later treat noise in Sect. 3.3.5. In this noise-free case, our observations are \(y_i = f(x_i)\) for \(i=1,\ldots ,n\).

Gaussian process regression is a Bayesian statistical method, and in Bayesian statistics we perform inference by placing a so-called prior probability distribution on unknown quantities of interest. The prior probability distribution is often called, more simply, the prior distribution or, even more simply, the prior. This prior distribution is meant to encode our intuition or domain expertise regarding which values for the unknown quantity of interest are most likely. We then use Bayes rule, together with any data observed, to calculate a posterior probability distribution on these unknowns. For a broader introduction to Bayesian statistics, see the textbook [16] or the research monograph [17].

In Gaussian process regression, if we wish to predict the value of f at a single candidate point \(x^*\), it is sufficient to consider our unknowns to be the values of f at the previously evaluated points, \(x_1,\ldots ,x_n\), and the new point \(x^*\) at which we wish to predict. That is, we take our unknown quantity of interest to be the vector \((f(x_1),\ldots ,f(x_n),f(x^*))\). We then take our data, which is \(f(x_1),\ldots ,f(x_n)\), and use Bayes rule to calculate a posterior probability distribution on the full vector of interest, \((f(x_1),\ldots ,f(x_n),f(x^*))\), or, more simply, just on \(f(x^*)\).

To calculate the posterior, we must first specify the prior, which Gaussian process regression assumes to be multivariate normal . It calculates the mean vector of this multivariate normal prior distribution using a function, called the mean function and written here as \(\mu _0(\cdot )\), which takes a single x as an argument. It applies this mean function to each of the points \(x_1,\ldots ,x_n,x^*\) to create an \(n+1\)-dimensional column vector. Gaussian process regression creates the covariance matrix of the multivariate normal prior distribution using another function, called the covariance function or covariance kernel and written here as \(\varSigma _0(\cdot ,\cdot )\), which takes a pair of points \(x,x'\) as arguments. It applies this covariance function to every pair of points in \(x_1,\ldots ,x_n,x\) to create an \((n+1)\times (n+1)\) matrix.

Thus, Gaussian process regression sets the prior probability distribution to,

$$\begin{aligned} \begin{bmatrix} f(x_1) \\ \ldots \\ f(x_n) \\ f(x^*) \end{bmatrix} \sim \mathrm {Normal}\left( \begin{bmatrix} \mu _0(x_1) \\ \ldots \\ \mu _0(x_n) \\ \mu _0(x^*) \end{bmatrix}, \begin{bmatrix} \varSigma _0(x_1,x_1)&\cdots&\varSigma _0(x_1,x_n)&\varSigma _0(x_1,x^*) \\ \vdots&\ddots&\vdots&\vdots \\ \varSigma _0(x_n,x_1)&\cdots&\varSigma _0(x_n,x_n)&\varSigma _0(x_n,x^*) \\ \varSigma _0(x^*,x_1)&\cdots&\varSigma _0(x^*,x_n)&\varSigma _0(x^*,x^*) \end{bmatrix}\right) \end{aligned}$$
(3.2)

The subscript “0” in \(\mu _0\) and \(\varSigma _0\) indicate that these functions are relevant to the prior distribution, before any data has been collected.

We now discuss how the mean and covariance functions are chosen, focusing on the covariance function first because it tends to be more important in getting good results from Gaussian process regression.

3.1 Choice of Covariance Function

In choosing the covariance function \(\varSigma _0(\cdot ,\cdot )\), we wish to satisfy two requirements.

The first is that it should encode the belief that points x and \(x'\) near each other tend to have more similar values for f(x) and \(f(x')\). To accomplish this, we want the covariance matrix in (3.2) to have entries that are larger for pairs of points that are closer together, and closer to 0 for pairs of points that are further apart.

The second is that the covariance function should always produce positive semidefinite covariance matrices in the multivariate normal prior. That is, if \(\varSigma \) is the covariance matrix in (3.2), then we require that \(a^T \varSigma a \ge 0\) for all column vectors a (where a is assumed to have the appropriate length, \(n+1\)). This requirement is necessary to ensure that the multivariate normal prior distribution is a well-defined probability distribution, because if \(\theta \) is multivariate normal with mean vector \(\mu \) and covariance matrix \(\varSigma \), then the variance of \(a\cdot \theta \) is \(a^T \varSigma a\), and we require variances to be non-negative.

Several covariance functions satisfy these two requirements. The most commonly used is called the squared exponential , or Gaussian kernel , and is given by,

$$\begin{aligned} \varSigma _0(x,x') = \alpha \exp \left( -\sum _{i=1}^d \beta _i (x_i-x'_i)^2\right) . \end{aligned}$$
(3.3)

This kernel is parameterized by \(d+1\) parameters: \(\alpha \), and \(\beta _1,\ldots ,\beta _d\).

The parameter \(\alpha >0\) controls how much overall variability there is in the function f. We observe that under the prior, the variance of f(x) is \(\mathrm {Var}(f(x)) = \mathrm {Cov}(f(x),f(x)) = \alpha \). Thus, when \(\alpha \) is large, we are encoding in our prior distribution that f(x) is likely to take a larger range of values.

The parameters \(\beta _i > 0\) controls how quickly the function f varies with x. For example, consider the relationship between some point x and another point \(x'=x+[1,0,\ldots ,0]\). When \(\beta _1\) is small (close to 0), the covariance between f(x) and \(f(x')\) is \(\alpha \exp (-\beta _1) \approx \alpha \), giving a correlation between f(x) and \(f(x')\) of nearly 1. This reflects a belief that f(x) and \(f(x')\) are likely to be very similar, and that learning the value of f(x) will also teach us a great deal about \(f(x')\). In contrast, when \(\beta _1\) is large, the covariance between f(x) and \(f(x')\) is nearly 0, given a correlation between f(x) and \(f(x')\) that is also nearly 0, reflecting a belief that f(x) and \(f(x')\) are unrelated to each other, and learning something about f(x) will teach us little about \((x')\).

Going beyond the squared exponential kernel

There are several other possibilities for the covariance kernel beyond the squared exponential kernel , which encode different assumptions about the underlying behavior of the function f. One particularly useful generalization of the squared exponential covariance kernel is the Matérn covariance kernel, which allows more flexibility in modeling the smoothness of f.

Before describing this kernel, let \(r = \sqrt{\sum _i \left( \frac{x_i - x'_i}{\beta _i}\right) ^2}\) be the Euclidean distance between x and \(x'\), but where we have altered the length scale in each dimension by some strictly positive parameter \(\beta _i\). Then, the squared exponential covariance kernel can be written as, \(\varSigma _0(x,x') = \alpha \exp \left( -r^2\right) \).

With this notation, the Matérn covariance kernel is,

$$\begin{aligned} \varSigma _0(x,x') = \alpha \frac{2^{1-\nu }}{\varGamma (\nu )}\left( \sqrt{2\nu }r\right) ^\nu K_\nu \left( \sqrt{2\nu } r\right) , \end{aligned}$$

where \(K_\nu \) is the modified Bessel function. If we take the limit as \(\nu \rightarrow \infty \), we obtain the squared exponential kernel ([14], Sect. 4.2 p. 85).

The Matérn covariance kernel is useful because it allows modeling the smoothness of f in a more flexible way, as compared with the squared exponential kernel. Under the squared exponential covariance kernel, the function f is infinitely mean-square differentiable,Footnote 1 which may not be an appropriate assumption in many applications. In contrast, under the Matérn covariance kernel, f is k-times mean-square differentiable if and only if \(\nu >k\). Thus, we can model a function that is twice differentiable but no more by choosing \(\nu =5/2\), and a function that is once differentiable but no more by choosing \(\nu =3/2\).

While the squared exponential and Matérn covariance kernels allow modeling a wide range of behaviors, and together represent a toolkit that will handle a wide variety of applications, there are other covariance kernels. For a thorough discussion of these, see Chap. 4 of [14].

Both the Matérn and squared exponential covariance kernel require choosing parameters. While it certainly is possible for one to choose the parameters \(\alpha \) and \(\beta _i\) (and \(\nu \) in the case of Matérn) based on one’s intuition about f, and what kinds of variability f is likely to have in a particular application, it is more common to choose these parameters (especially \(\alpha \) and \(\beta _i\)) adaptively, so as to best fit previously observed points. We discuss this more below in Sect. 3.3.6. First, however, we discuss the choice of the mean function.

3.2 Choice of Mean Function

We now discuss choosing the mean function \(\mu _0(\cdot )\). Perhaps the most common choice is to simply set the mean function equal to a constant, \(\mu \). This constant must be estimated, along with parameters of the covariance kernel such as \(\alpha \) and \(\beta _i\), and is discussed in Sect. 3.3.6.

Beyond this simple choice, if one believes that there will be trends in f that can be described in a parametric way, then it is useful to include trend terms into the mean function. This is accomplished by choosing

$$\begin{aligned} \mu _0(x) = \mu + \sum _{j=1}^J \gamma _j \varPsi _j(x), \end{aligned}$$

where \(\varPsi _j(\cdot )\) are known functions, and \(\gamma _j \in \mathbb {R}\), along with \(\mu \in \mathbb {R}\), are parameters that must be estimated.

A common choice for the \(\varPsi _j\), if one chooses to include them, are polynomials in x up to some small order. For example, if \(d=2\), so x is two-dimensional, then one might include all polynomials up to second order, \(\varPsi _1(x) = x_1\), \(\varPsi _2(x) = x_2\), \(\varPsi _3(x) = (x_1)^2\), \(\varPsi _4(x) = (x_2)^2\), \(\varPsi _5(x) = x_1 x_2\), setting \(J=5\). One recovers the constant mean function by setting \(J=0\).

3.3 Inference

Given the prior distribution (3.2) on \(f(x_1),\ldots ,f(x_n),f(x^*)\), and given (noise-free) observations of \(f(x_1),\ldots ,f(x_n)\), the critical step in Gaussian process regression is calculating the posterior distribution on \(f(x^*)\). We rely on the following general result about conditional probabilities and multivariate normal distributions. Its proof, which may be found in the Derivations and Proofs section, relies on Bayes rule and algebraic manipulation of the probability density of the multivariate normal distribution.

Proposition 1

Let \(\theta \) be a k-dimensional multivariate normal random column vector, with mean vector \(\mu \) and covariance matrix \(\varSigma \). Let \(k_1\ge 1, k_2\ge 1\) be two integers summing to k. Decompose \(\theta \), \(\mu \) and \(\varSigma \) as

$$\begin{aligned} \theta = \begin{bmatrix}\theta _{[1]} \\ \theta _{[2]}\end{bmatrix}, \qquad \mu = \begin{bmatrix}\mu _{[1]} \\ \mu _{[2]}\end{bmatrix}, \qquad \varSigma = \begin{bmatrix}\varSigma _{[1,1]}&\varSigma _{[1,2]} \\ \varSigma _{[2,1]}&\varSigma _{[2,2]}\end{bmatrix}, \end{aligned}$$

so that \(\theta _{[i]}\) and \(\mu _{[i]}\) are \(k_i\)-column vectors, and \(\varSigma _{[i,j]}\) is a \(k_i\times k_j\) matrix, for each \(i,j=1,2\).

If \(\varSigma _{1,1}\) and \(\varSigma _{2,2}\) are invertible, then, for any \(u\in \mathbb {R}^{k_1}\), the conditional distribution of \(\theta _{[2]}\) given that \(\theta _{[1]}=u\) is multivariate normal with mean

$$\begin{aligned} \mu _{[2]} + \varSigma _{[2,1]}\varSigma _{[1,1]}^{-1}(u-\mu _{[1]}) \end{aligned}$$

and covariance matrix

$$\begin{aligned} \varSigma _{[2,2]} - \varSigma _{[2,1]}\varSigma _{[1,1]}^{-1}\varSigma _{[1,2]}. \end{aligned}$$

We use this proposition to calculate the posterior distribution on \(f(x^*)\), given \(f(x_1),\ldots ,f(x_n)\).

Before doing so, however, we first introduce some additional notation. We let \(y_{1:n}\) indicate the column vector \([y_1,\ldots ,y_n]^T\), and we let \(x_{1:n}\) indicate the sequence of vectors \((x_1,\ldots ,x_n)\). We let \(f(x_{1:n}) = [f(x_1),\ldots ,f(x_n)]^T\), and similarly for other functions of x, such as \(\mu _0(\cdot )\). We introduce similar additional notation for functions that take pairs of points \(x,x'\), so that \(\varSigma (x_{1:n},x_{1:n})\) is the matrix \(\left[ {\begin{matrix} \varSigma _0(x_1,x_1) &{} \cdots &{} \varSigma _0(x_1,x_n) \\ \vdots &{} \ddots &{} \vdots \\ \varSigma _0(x_n,x_1) &{} \cdots &{} \varSigma _0(x_n,x_n) \end{matrix}} \right] \), \(\varSigma _0(x^*,x_{1:n})\) is the row vector \([\varSigma _0(x^*,x_1),\ldots ,\varSigma _0(x^*,x_n)]\), and \(\varSigma _0(x_{1:n},x^*)\) is the column vector \([\varSigma _0(x_1,x^*),\ldots ,\varSigma _0(x_n,x^*)]^T\).

This notation allows us to rewrite (3.2) as

$$\begin{aligned} \begin{bmatrix} y_{1:n}\\ f(x^*) \end{bmatrix} = \mathrm {Normal}\left( \begin{bmatrix} \mu _0(x_{1:n})\\ \mu _0(x^*) \end{bmatrix}, \begin{bmatrix} \varSigma _0(x_{1:n},x_{1:n})&\varSigma _0(x_{1:n},x^*) \\ \varSigma _0(x^*,x_{1:n})&\varSigma _0(x^*,x^*) \end{bmatrix} \right) . \end{aligned}$$
(3.4)

We now examine this expression in the context of Proposition 1. We set \(\theta _{[1]} = f(x_{1:n})\), \(\theta _{[2]} = f(x^*)\), \(\mu _{[1]} = \mu _0(x_{1:n})\), \(\mu _{[2]} = \mu _0(x^*)\), \(\varSigma _{[1,1]} = \varSigma _0(x_{1:n},x_{1:n})\), \(\varSigma _{[1,2]} = \varSigma _0(x_{1:n},x^*)\), \(\varSigma _{[2,1]} = \varSigma _0(x^*,x_{1:n})\), and \(\varSigma _{[2,2]} = \varSigma _0(x^*,x^*)\).

Then, applying Proposition 1, we see that the posterior distribution on \(f(x^*)\) given observations \(y_i = f(x_i), i=1,\ldots ,n\) is normal, with a mean \(\mu _n(x^*)\) and variance \(\sigma ^2_n(x^*)\) given by,

$$\begin{aligned} \mu _n(x^*)&= \mu _0(x^*) + \varSigma _0(x^*,x_{1:n})\varSigma _0(x_{1:n},x_{1:n})^{-1}(f(x_{1:n})-\mu _0(x_{1:n})), \end{aligned}$$
(3.5)
$$\begin{aligned} \sigma ^2_n(x^*)&= \varSigma _0(x^*,x^*) - \varSigma _0(x^*,x_{1:n})\varSigma _0(x_{1:n},x_{1:n})^{-1}\varSigma _0(x_{1:n},x^*). \end{aligned}$$
(3.6)

The invertibility of \(\varSigma _0(x_{1:n},x_{1:n})\) (and also \(\varSigma _0(x^*,x^*)\)) required by Proposition 1 depends on the covariance kernel and its parameters (typically called hyperparameters ), but this invertibility typically holds as long as these hyperparameters satisfy mild non-degeneracy conditions, and the \(x_{1:n}\) are distinct, i.e., that we have not measured the same point more than once. For example, under the squared exponential covariance kernel, invertibility holds as long as \(\alpha >0\) and the \(x_{1:n}\) are distinct. If we have measured a point multiple times, then we can safely drop all but one of the measurements, here where observations are noise-free. Below, we treat the case where observations are noisy, and in this case including multiple measurements of the same point is perfectly reasonable and does not cause issues.

Figure 3.1 shows the output from Gaussian process regression. In the figure, circles show points \((x_i,f(x_i))\), the solid line shows \(\mu _n(x^*)\) as a function of \(x^*\), and the dashed lines are positioned at \(\mu _n(x^*) \pm 1.96\sigma _n(x^*)\), forming a 95 % Bayesian credible interval for \(f(x^*)\), i.e., an interval in which \(f(x^*)\) lies with posterior probability 95 %. (A credible interval is the Bayesian version of a frequentist confidence interval.) Because observations are noise-free, the posterior mean \(\mu _n(x^*)\) interpolates the observations \(f(x^*)\).

Fig. 3.1
figure 1

Illustration of Gaussian process regression with noise-free evaluations. The circles show previously evaluated points, \((x_i,f(x_i))\). The solid line shows the posterior mean, \(\mu _n(x)\), as a function of x, which is an estimate f(x), and the dashed lines show a Bayesian credible interval for each f(x), calculated as \(\mu _n(x) \pm 1.96\sigma _n(x)\). Although this example shows f taking a scalar input, Gaussian process regression can be used for functions with vector inputs

3.4 Inference with Just One Observation

The expressions (3.5) and (3.6) are complex, and perhaps initially difficult to assimilate. To give more intuition about them, and also to support some additional analysis below in Sect. 3.4, it is useful to consider the simplest case, when we have just a single measurement, \(n=1\).

In this case, all matrices in (3.5) and (3.6) are scalars, \(\varSigma _0(x^*,x_1) = \varSigma _0(x_1,x^*)\), and the expressions (3.5) and (3.6) can be rewritten as,

$$\begin{aligned} \mu _1(x^*)&= \mu _0(x^*) + \frac{\varSigma _0(x^*,x_1)}{\varSigma _0(x_1,x_1)}(f(x_1)-\mu _0(x_1)), \end{aligned}$$
(3.7)
$$\begin{aligned} \sigma ^2_1(x^*)&= \varSigma _0(x^*,x^*) - \frac{\varSigma _0(x^*,x_1)^2}{\varSigma _0(x_1,x_1)}. \end{aligned}$$
(3.8)

Intuition about the expression for the posterior mean

We first examine (3.7). We see that the posterior mean of \(f(x^*)\), \(\mu _1(x^*)\), which we can think of as our estimate of \(f(x^*)\) after observing \(f(x_1)\), is obtained by taking our original estimate of \(f(x^*)\), \(\mu _0(x^*)\), and adding to it a correction term. This correction term is itself the product of two quantities: the error \(f(x_1)-\mu _0(x_1)\) in our original estimate of \(f(x_1)\), and the quantity \(\frac{\varSigma _0(x^*,x_1)}{\varSigma _0(x_1,x_1)}\). Typically, \(\varSigma _0(x^*,x_1)\) will be positive, and hence also \(\frac{\varSigma _0(x^*,x_1)}{\varSigma _0(x_1,x_1)}\). (Recall, \(\varSigma _0(x_1,x_1)\) is a variance, so is never negative.) Thus, if \(f(x_1)\) is bigger than expected, \(f(x_1) - \mu _0(x_1)\) will be positive, and our posterior mean \(\mu _1(x^*)\) will be larger than our prior mean \(\mu _0(x^*)\). In contrast, if \(f(x_1)\) is smaller than expected, \(f(x_1) - \mu _0(x_1)\) will be negative, and our posterior mean \(\mu _1(x^*)\) will be smaller than our prior mean \(\mu _0(x^*)\).

We can examine the quantity \(\frac{\varSigma _0(x^*,x_1)}{\varSigma _0(x_1,x_1)}\) to understand the effect of the position of \(x^*\) relative to \(x_1\) on the magnitude of the correction to the posterior mean. Notice that \(x^*\) only enters this expression through the numerator. If \(x^*\) is close to \(x_1\), then \(\varSigma _0(x^*,x_1)\) will be large under the squared exponential and most other covariance kernels, and positive values for \(f(x_1)-\mu _0(x_1)\) will also cause a strong positive change in \(\mu _1(x^*)\) relative to \(\mu _0(x^*)\). If \(x^*\) is far from \(x_1\), then \(\varSigma _0(x^*,x_1)\) will be close to 0, and \(f(x_1)-\mu _0(x_1)\) will have little effect on \(\mu _1(x^*)\).

Intuition about the expression for the posterior variance

Now we examine (3.8). We see that the variance of our belief on \(f(x^*)\) under the posterior, \(\sigma ^2_1(x^*)\), is smaller than its value under the prior, \(\varSigma _0(x^*,x^*)\). Moreover, when \(x^*\) is close to \(x_1\), \(\varSigma _0(x^*,x_1)\) will be large, and the reduction in variance from prior to posterior will also be large.

Conversely, when \(x^*\) is far from \(x_1\), \(\varSigma _0(x^*,x_1)\) will be close to 0, and the variance under the posterior will be similar to its value under the prior.

As a final remark, we can also rewrite the expression (3.8) in terms of the squared correlation under the prior, \(\mathrm {Corr}(f(x^*),f(x_1))^2 = \varSigma _0(x^*,x_1)^2/(\varSigma _0(x^*,x^*)\varSigma _0(x_1,x_1)) \in [0,1]\), as

$$\begin{aligned} \sigma ^2_1(x^*) = \varSigma _0(x^*,x^*)\left( 1 - \mathrm {Corr}(f(x^*),f(x_1))^2\right) . \end{aligned}$$

We thus see that the reduction in variance of the posterior distribution depends on the squared correlation under the prior, with larger squared correlation implying a larger reduction.

3.5 Inference with Noisy Observations

The previous section assumed that \(f(x^*)\) can be observed exactly, without any error. When \(f(x^*)\) is the outcome of a physical experiment, however, our observations are obscured by noise. Indeed, if we were to synthesize and test the same material design \(x^*\) multiple times, we might observe different results.

To model this situation, Gaussian process regression can be extended to allow observations of the form,

$$\begin{aligned} y(x_i) = f(x_i) + \varepsilon _i, \end{aligned}$$

where we assume that the \(\varepsilon _i\) are normally distributed with mean 0 and constant variance, \(\lambda ^2\), with independence across i. In general, the variance \(\lambda ^2\) is unknown, but we treat it as a known parameter of our model, and then estimate it along with all the other parameters of our model, as discussed below in Sect. 3.3.6.

These assumptions of constant variance (called homoscedasticity ) and independence make the analysis significantly easier, although they are often violated in practice. Experimental conditions that tend to violate these assumptions are discussed below, as are versions of GP regression that can be used when they are violated.

Analysis of independent homoscedastic noise

To perform inference under independent homoscedastic noise, and calculate a posterior distribution on the value of the function \(f(x_*)\) at a given point \(x_*\), our first step is to write down the joint distribution of our observations \(y_1,\ldots ,y_n\) and the quantity we wish to predict, \(f(x_*)\), under the prior. That is, we write down the distribution of the vector \([y_1,\ldots ,y_n,f(x_*)]\).

We first observe that \([y_1,\ldots ,y_n,f(x_*)]\) is the sum of \([f(x_1),\ldots ,f(x_n),f(x_*)]\) and another vector, \([\varepsilon _1,\ldots ,\varepsilon _n,0]\). The first vector has a multivariate normal distribution given by (3.4). The second vector is independent of the first and is also multivariate normal, with a mean vector that is identically 0, and a covariance matrix \(\mathrm {diag}(\lambda ^2,\ldots ,\lambda ^2,0)\). The sum of two independent multivariate normal random vectors is itself multivariate normal, with a mean vector and covariance matrix given, respectively, by the sums of the mean vectors and covariance matrices of the summands. This gives the distribution of \([y_1,\ldots ,y_n,f(x_*)]\) as

$$\begin{aligned} \begin{bmatrix} y_{1:n}\\ f(x^*) \end{bmatrix} \sim \mathrm {Normal}\left( \begin{bmatrix} \mu _0(x_{1:n})\\ \mu _0(x^*) \end{bmatrix}, \begin{bmatrix} \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n&\varSigma _0(x_{1:n},x^*) \\ \varSigma _0(x^*,x_{1:n})&\varSigma _0(x^*,x^*) \end{bmatrix} \right) , \end{aligned}$$
(3.9)

where \(I_n\) is the n-dimensional identity matrix.

As we did in Sect. 3.3.3, we can use Proposition 1 with the above expression to compute the posterior on \(f(x^*)\) given \(f(x_{1:n})\). We obtain,

$$\begin{aligned} \mu _n(x^*)&= \mu _0(x^*) + \varSigma _0(x^*,x_{1:n})\left[ \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n\right] ^{-1}(y_{1:n}-\mu _0(x_{1:n})) \end{aligned}$$
(3.10)
$$\begin{aligned} \sigma ^2_n(x^*)&= \varSigma _0(x^*,x^*) - \varSigma _0(x^*,x_{1:n})\left[ \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n\right] ^{-1}\varSigma _0(x_{1:n},x^*). \end{aligned}$$
(3.11)

If we set \(\lambda ^2=0\), so there is no noise, then we recover (3.5) and (3.6).

Figure 3.2 shows an example of a posterior distribution calculated with Gaussian process regression with noisy observations. Notice that the posterior mean no longer interpolates the observations, and the credible interval has a strictly positive width at points where we have measured. Noise prevents us from observing function values exactly, and so we remain uncertain about the function value at points we have measured.

Fig. 3.2
figure 2

Illustration of Gaussian process regression with noisy evaluations. As in Fig. 3.1, the circles show previously evaluated points, \((x_i,y_i)\), where \(y_i\) is \(f(x_i)\) perturbed by constant-variance independent noise. The solid line shows the posterior mean, \(\mu _n(x)\), as a function of x, which is an estimate of the underlying function f, and the dashed lines show a Bayesian credible interval for f, calculated as \(\mu _n(x) \pm 1.96\sigma _n(x)\)

Going beyond homoscedastic independent noise

Constant variance is violated if the experimental noise differs across materials designs, which occurs most frequently when noise arises during the synthesis of the material itself, rather than during the evaluation of a material that has already been created. Some work has been done to extend Gaussian process regression to flexibly model heteroscedastic noise (i.e., noise whose variance changes) [1821]. The main idea in much of this work is to use a second Gaussian process to model the changing variance across the input domain. Much of this work assumes that the noise is independent and Gaussian, though [21] considers non-Gaussian noise.

Independence is most typically violated, in the context of physical experiments, when the synthesis and evaluation of multiple materials designs is done together, and the variation in some shared component simultaneously influences these designs, e.g., through variation in the temperature while the designs are annealing together, or through variation in the quality of some constituent used in synthesis. We are aware of relatively little work modeling dependent noise in the context of Gaussian process regression and Bayesian optimization , with one exception being [22].

3.6 Parameter Estimation

The mean and covariance functions contain several parameters. For example, if we use the squared exponential kernel, a constant mean function, and observations have independent homoscedastic noise, then we must choose or estimate the parameters \(\mu ,\alpha ,\beta _1,\ldots ,\beta _d, \lambda \). These parameters are typically called hyperparameters because they are parameters of the prior distribution. (\(\lambda ^2\) is actually a parameter of the likelihood function, but it is convenient to treat it together with the parameters of the prior.) While one may simply choose these hyperparameters directly, based on intuition about the problem, a more common approach is to choose them adaptively, based on data.

To accomplish this, we write down an expression for the probability of the observed data \(y_{1:n}\) in terms of the hyperparameters, marginalizing over the uncertainty on \(f(x_{1:n})\). Then, we optimize this expression over the hyperparameters to find settings that make the observed data as likely as possible. This approach to setting hyperparameters is often called empirical Bayes , and it can be seen as an approximation to full Bayesian inference.

We detail this approach for the squared exponential kernel with a constant mean function. Estimation for other kernels and mean functions is similar. Using the probability distribution of \(y_{1:n}\) from (3.9), and neglecting constants, the natural logarithm of this probability, \(\log p(y_{1:n} \mid x_{1:n})\) (called the “log marginal likelihood”), can be calculated as

$$\begin{aligned} -\frac{1}{2} (y_{1:n} - \mu )^T \left( \varSigma _0(x_{1:n}, x_{1:n}) + \lambda ^2 I_n\right) ^{-1} (y_{1:n} - \mu ) - \frac{1}{2}\log |\varSigma _0(x_{1:n}, x_{1:n}) + \lambda ^2 I_n|, \end{aligned}$$

where \(|\cdot |\) applied to a matrix indicates the determinant.

To find the hyperparameters that maximize this log marginal likelihood (the neglected constant does not affect the location of the maximizer), we will take partial derivatives with respect to each hyperparameter. We will then use them to find maximizers of \(\mu \) and \(\sigma ^2 := \alpha + \lambda ^2\) analytically, and then use gradient-based optimization to maximize the other hyperparameters.

Taking a partial derivative with respect to \(\mu \), setting it to zero, and solving for \(\mu \), we get that the value of \(\mu \) that maximizes the marginal likelihood is

$$\begin{aligned} \hat{\mu } = \frac{\sum _{i=1}^n \left( (\varSigma _0(x_{1:n}, x_{1:n}) + \lambda ^2 I_n)^{-1} y_{1:n}\right) _i}{\sum _{i,j=1}^n (\varSigma _0(x_{1:n}, x_{1:n}) + \lambda ^2 I_n)^{-1}_{ij}}. \end{aligned}$$

Define R as the matrix with components

$$\begin{aligned} R_{ij} = {\left\{ \begin{array}{ll} 1 &{} i=j,\\ g \exp \left( -\sum \limits _{i=1}^d \beta _i (x_i-x_j)^2\right) &{} i \ne j, \end{array}\right. } \end{aligned}$$

where \(g = \frac{\alpha }{\sigma ^2}\). Then \(\varSigma _0(x_{1:n}, x_{1:n}) + \lambda ^2 I_n = \sigma ^2 R\) and \(\hat{\mu }\) can be written in terms of R as \(\hat{\mu } = \frac{\varSigma _{i=1}^n \left( R^{-1} y_{1:n}\right) _i}{\varSigma _{i,j=1}^n R^{-1}_{ij}}\). The log marginal likelihood (still neglecting constants) becomes

$$\begin{aligned} \log p(y_{1:n} \mid x_{1:n}) \sim -\frac{1}{2}(y_{1:n}-\hat{\mu })^T(\sigma ^2 R)^{-1}(y_{1:n}-\hat{\mu })-\frac{1}{2}\log |\sigma ^2 R|. \end{aligned}$$

Taking the partial derivative with respect to \(\sigma ^2\), and noting that \(\hat{\mu }\) does not depend on \(\sigma ^2\), we solve for \(\sigma ^2\) and obtain

$$\begin{aligned} \widehat{\sigma ^2} = \frac{1}{n}(y_{1:n}-\hat{\mu }) R^{-1} (y_{1:n} - \hat{\mu }). \end{aligned}$$

Substituting this estimate, the log marginal likelihood becomes

$$\begin{aligned} \log p(y_{1:n}\mid x_{1:n}) \sim -\log \left( \frac{1}{n} |R|^{\frac{1}{n}} (y_{1:n}-\hat{\mu })^TR^{-1}(y_{1:n}-\hat{\mu }) \right) . \end{aligned}$$
(3.12)

The expression (3.12) cannot in general be optimized analytically. Instead, one typically optimizes it numerically using a first- or second-order optimization algorithm, such as Newton’s method or gradient descent , obtaining estimates for \(\beta _1, \ldots , \beta _d\) and g. These estimates are in turn substituted to provide an estimate of R, from which estimates \(\hat{\mu }\) and \(\widehat{\sigma ^2}\) may be computed. Finally, using \(\widehat{\sigma ^2}\) and the estimated value of g, we may estimate \(\alpha \) and \(\lambda \).

3.7 Diagnostics

When using Gaussian process regression, or any other machine learning technique, it is advisable to check the quality of the predictions, and to assess whether the assumptions made by the method are met. One way to do this is illustrated by Fig. 3.3, which comes from a simulation of blood flow near the heart, based on [23], for which we get exact (not noisy) observations of f(x).

Fig. 3.3
figure 3

Diagnostic plot for Gaussian process regression, created with leave-one-out cross validation. For each point in our dataset, we hold that point \((x_i,y_i)\) out, train on the remaining points, calculate a 95 % credible interval for \(y_i\), and plot this confidence interval as an error bar whose x-coordinate is the actual value \(y_i\). If Gaussian process regression is working well, 95 % of the error bars will intersect the diagonal line Predicted = Actual

This plot is created with a technique called leave-one-out cross validation . In this technique, we iterate through the datapoints \(x_{1:n}\), \(y_{1:n}\), and for each \(i\in \{1,\ldots ,n\}\), we train a Gaussian process regression model on all of the data except \(x_i,y_i\), and then use it, together with \(x_i\), to predict what the value \(y_i\) should be. We obtain from this a posterior mean (the prediction), call it \(\mu _{-i}(x_i)\), and also a posterior standard deviation, call it \(\sigma _{-i}(x_i)\). When calculating these estimates, it is best to separately re-estimate the hyperparameters each time, leaving out the data \((x_i,y_i)\). We then calculate a 95 % credible interval \(\mu _{-i}(x_i) \pm 2\sigma _{-i}(x_i)\), and create Fig. 3.3 by plotting “Predicted” versus “Actual”, where the “Actual” coordinate (on the x-axis) is \(y_i\), and the “Predicted” value (on the y-axis) is pictured as an error bar centered at \(\mu _{-i}(x_i)\) with half-width \(2\sigma _{-i}(x_i)\).

If the uncertainty estimates outputted by Gaussian process regression are behaving as anticipated, then approximately 95 % of the credible intervals will intersect the diagonal line Predicted = Actual. Moreover, if Gaussian process regression’s predictive accuracy is high, then the credible intervals will be short, and their centers will be close to this same line Predicted=Actual.

This idea may be extended to noisy function evaluations, under the assumption of independent homoscedastic noise. To handle the fact that the same point may be sampled multiple times, let m(x) be the number of times that a point \(x\in \{x_1,\ldots ,x_n\}\) was sampled, and let \(\overline{y}(x)\) be the average of the observed values at this point. Moreover, by holding out all m(x) samples of x and training Gaussian process regression, we would obtain a normal posterior distribution on \(f(x_i)\) that has mean \(\mu _{-i}(x_i)\) and standard deviation \(\sigma _{-i}(x_i)\).

Since \(\overline{y}(x_i)\) is then the sum of \(f(x_i)\) and some normally distributed noise with mean 0 and variance \(\lambda ^2/m(x_i)\), the resulting distribution of \(\overline{y}(x_i)\) is normal with mean \(\mu _{-i}(x_i)\) and standard deviation \(\sqrt{\sigma _{-i}^2(x_i) + \lambda ^2/m(x_i)}\).

From this, a 95 % credible interval for \(\overline{y}(x_i)\) is then \(\mu _{-i}(x_i) \pm 2 \sqrt{\sigma _{-i}^2(x_i) + \lambda ^2/m(x_i)}\). We would plot Predicted versus Observed by putting this credible interval along the y-axis at x-coordinate \(\overline{y}(x_i)\). If Gaussian process regression is working well, then approximately 95 % of these credible intervals will intersect the line Predicted = Observed.

For Gaussian process regression to best support Bayesian optimization, it is typically most important to have good uncertainty estimates, and relatively less important to have high predictive accuracy. This is because Bayesian optimization uses Gaussian process regression as a guide for deciding where to sample, and so if Gaussian process regression reports that there is a great deal of uncertainty at a particular location and thus low predictive accuracy, Bayesian optimization can choose to sample at this location to improve accuracy. Thus, Bayesian optimization has a recourse for dealing with low predictive accuracy, as long as the uncertainty is accurately reported. In contrast, if Gaussian process regression estimates poor performance at a location that actually has near-optimal performance, and also provides an inappropriately low error estimate, then Bayesian optimization may not sample there within a reasonable timeframe, and thus may never correct the error.

If either the uncertainty is incorrectly estimated, or the predictive accuracy is unsatisfactorily low, then the most common “fixes” employed are to adopt a different covariance kernel , or to transform the objective function f. If the objective function is known to be non-negative, then the transformations \(\log (f)\) and \(\sqrt{f}\) are convenient for optimization because they are both strictly increasing, and so do not change the set of maximizers (or minimizers). If f is not non-negative, but is bounded below by some other known quantity a, then one may first shift f upward by a.

3.8 Predicting at More Than One Point

Below, to support the development of the knowledge-gradient method in Sects. 3.4.2 and 3.6, it will be useful to predict the value of f at multiple points, \(x^*_1,\ldots ,x^*_k\), with noise. To do so, we could certainly apply (3.10) and (3.11) separately for each \(x^*_1,\ldots ,x^*_k\), and this would provide us with both an estimate (the posterior mean) and a measure of the size of the error in this estimate (the posterior variance) associated with each \(f(x^*_i)\). It would not, however, quantify the relationship between the errors at several different locations. For this, we must perform the estimation jointly.

As we did in Sect. 3.3.5, we begin with our prior on \([y_{1:n},f(x^*_{1:k})]\), which is,

$$\begin{aligned} \begin{bmatrix} y_{1:n}\\ f(x^*_{1:k}) \end{bmatrix} \sim \mathrm {Normal}\left( \begin{bmatrix} \mu _0(x_{1:n})\\ \mu _0(x^*_{1:k}) \end{bmatrix}, \begin{bmatrix} \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n&\varSigma _0(x_{1:n},x^*_{1:k}) \\ \varSigma _0(x^*_{1:k},x_{1:n})&\varSigma _0(x^*_{1:k},x^*_{1:k}) \end{bmatrix} \right) , \end{aligned}$$

We then use Proposition 1 to compute the posterior on \(f(x^*_{1:k})\) given \(f(x_{1:n})\), which is multivariate normal with mean vector \(\mu _n(x^*_{1:k})\) and covariance matrix \(\varSigma _n(x^*_{1:k},x^*_{1:k})\) given by,

$$\begin{aligned} \mu _n(x^*_{1:k})&=\mu _0(x^*_{1:k})+\varSigma _0(x^*_{1:k},x_{1:n})\left[ \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n\right] ^{-1}(y_{1:n}-\mu _0(x_{1:n})), \end{aligned}$$
(3.13)
$$\begin{aligned} \varSigma _n(x^*_{1:k},x^*_{1:k})&=\varSigma _0(x^*_{1:k},x^*_{1:k})-\varSigma _0(x^*_{1:k},x_{1:n})\left[ \varSigma _0(x_{1:n},x_{1:n})+\lambda ^2 I_n\right] ^{-1}\varSigma _0(x_{1:n},x^*_{1:k}). \end{aligned}$$
(3.14)

We see that setting \(k=1\) provides the expressions (3.10) and (3.11) from Sect. 3.3.5.

3.9 Avoiding Matrix Inversion

The expressions (3.10) and (3.11) for the posterior mean and variance in the noisy case, and also (3.7) and (3.8) in the noise-free case, include a matrix inversion term. Calculating this matrix inversion is slow and can be hard to accomplish accurately in practice, due to the finite precision of floating point implementations. Accuracy is especially an issue when \(\varSigma \) has terms that are close to 0, which arises when data points are close together.

In practice, rather than calculating a matrix inverse directly, it is typically faster and more accurate to use a mathematically equivalent algorithm, which performs a Cholesky decomposition and then solves a linear system. This algorithm is described below, and is adapted from Algorithm 2.1 in Sect. 2.3 of [14]. This algorithm also computes the log marginal likelihood required for estimating hyperparameters in Sect. 3.3.6.

figure a

4 Choosing Where to Sample

Being able to infer the value of the objective function f(x) at unevaluated points based on past data \(x_{1:n}\),\(y_{1:n}\) is only one part of finding good designs. The other part is using this ability to make good decisions about where to direct future sampling.

Bayesian optimization methods addresses this by using a measure of the value of the information that would be gained by sampling at a point. Bayesian optimization methods then choose the point to sample next for which this value is largest. A number of different ways of measuring the value of information have been proposed. Here, we describe two in detail, expected improvement [2, 4], and the knowledge gradient [24, 25], and then survey a broader collection of design criteria.

4.1 Expected Improvement

Expected improvement, as it was first proposed, considered only the case where measurements are free from noise. In this setting, suppose we have taken n measurements at locations \(x_{1:n}\) and observed \(y_{1:n}\). Then

$$\begin{aligned} f^*_n = \max _{i=1,\ldots ,n} f(x_i) \end{aligned}$$

is the best value observed so far. Suppose we are considering evaluating f at a new point x. After this evaluation, the best value observed will be

$$\begin{aligned} f^*_{n+1} = \max ( f(x), f^*_n), \end{aligned}$$

and the difference between these values, which is the improvement due to sampling, is

$$\begin{aligned} f^*_{n+1} - f^*_n = \max (f(x) - f^*_n, 0) = (f(x)-f^*_n)^+, \end{aligned}$$

where \(a^+ = \max (a,0)\) indicates the positive part function.

Ideally, we would choose x to make this improvement as large as possible. Before actually evaluating f(x), however, we do not know what this improvement will be, so we cannot implement this strategy. However, we do have a probability distribution on f(x), from Gaussian process regression. The expected improvement , indicated \(\mathrm {EI}(x)\), is obtained by taking the expectation of this improvement with respect to the posterior distribution on f(x) given \(x_{1:n},y_{1:n}\).

$$\begin{aligned} \mathrm {EI}(x) = E_n[ (f(x)-f^*_n)^+], \end{aligned}$$
(3.15)

where \(E_n[\ \cdot \ ] = E[\ \cdot \ | x_{1:n}, y_{1:n}]\) indicates the expectation with respect to the posterior distribution.

The expectation in (3.15) can be computed more explicitly, in terms of the normal cumulative distribution function (cdf) \(\varPhi (\cdot )\), and the normal probability density function (pdf) \(\varphi (\cdot )\). Recalling from Sect. 3.3.3 that \(f(x) \sim \mathrm {Normal}(\mu _n(x),\sigma ^2_n(x))\), where \(\mu _n(x)\) and \(\sigma ^2_n(x)\) are given by (3.5) and (3.6), and integrating with respect to the normal distribution (a derivation may be found in the Derivations and Proofs section), we obtain,

$$\begin{aligned} \mathrm {EI}(x) = (\mu _n(x) - f^*_n)\varPhi \left( \frac{\mu _n(x) - f^*_n}{\sigma _n(x)}\right) + \sigma _n(x) \varphi \left( \frac{\mu _n(x) - f^*_n}{\sigma _n(x)}\right) . \end{aligned}$$
(3.16)

Figure 3.4 plots this expected improvement for a problem with a one-dimensional input space. We can see from this plot that the expected improvement is largest at locations where both the posterior mean \(\mu _n(x)\) is large, and also the posterior standard deviation \(\sigma _n(x)\) is large. This is reasonable because those points that are most likely to provide large gains are those points that have a high predicted value, but that also have significant uncertainty. Indeed, at points where we have already observed, and thus have no uncertainty, the expected improvement is 0. This is consistent with the idea that, in a problem without noise, there is no value to repeating an evaluation that has already been performed.

Fig. 3.4
figure 4

Upper panel shows the posterior distribution in a problem with no noise and a one-dimensional input space, where the circles are previously measured points, the solid line is the posterior mean \(\mu _n(x)\), and the dashed lines are at \(\mu _n(x) \pm 2\sigma _n(x)\). Lower panel shows the expected improvement \(\mathrm {EI}(x)\) computed from this posterior distribution. An “x” is marked at the point with the largest expected improvement, which is where we would evaluate next

This idea of favoring points that, on the one hand, have a large predicted value, but, on the other hand, have a significant amount of uncertainty, is called the exploration versus exploitation tradeoff , and appears in areas beyond Bayesian optimization, especially in reinforcement learning [26, 27] and multi-armed bandit problems [28, 29]. In these problems, we are taking actions repeatedly over time whose payoffs are uncertain, and wish to simultaneously get good immediate rewards, while learning the reward distributions for all actions to allow us to get better rewards in the future. We emphasize, however, that the correct balance between exploration and exploitation is different in Bayesian optimization as compared with multi-armed bandits, and should more favor exploration: in optimization, the advantage of measuring where the predicted value is high is that these areas tend to give more useful information about where the optimum lies; in contrast, in problems where we must “learn while doing” like multi-armed bandits, evaluating an action with high predicted reward is good primarily because it tends to give a high immediate reward.

We can also see the exploration versus exploitation tradeoff implicit in the expected improvement function in the contour plot, Fig. 3.5. This plot shows the contours of \(\mathrm {EI}(x)\) as a function of the posterior mean, expressed as a difference from the previous best, \(\varDelta _n(x) := \mu _n(x) - f^*_n\), and the posterior standard deviation \(\sigma _n(x)\).

Fig. 3.5
figure 5

Contour plot of the expected improvement, as a function of the difference in means \(\varDelta _n(x) := \mu _n(x) - f^*_n\) and the posterior standard deviation \(\sigma _n(x)\). The expected improvement is larger when the difference in means is larger, and when the standard deviation is larger

Given the expression (3.16), Bayesian optimization algorithms based on expected improvement, such as the Efficient Global Optimization (EGO) algorithm proposed by [4], and the earlier algorithms of Mockus (see, e.g., the monograph [2]), then recommend sampling at the point with the largest expected improvement. That is,

$$\begin{aligned} x_{n+1} \in \mathop {\mathrm {argmax}}_x \mathrm {EI}(x). \end{aligned}$$
(3.17)

Finding the point with largest expected improvement is itself a global optimization problem, like the original problem that we wished to solve (3.1). Unlike (3.1), however, \(\mathrm {EI}(x)\) can be computed quickly, and its first and second derivatives can also be computed quickly. Thus, we can expect to be able to solve (3.1) relatively well using an off-the-shelf optimization method for continuous global optimization. A common approach is to use a local solver for continuous optimization, such as gradient ascent, in a multistart framework, where we start the local solver from many starting points chosen at random, and then select the best local solution discovered. In Sect. 3.5 we describe several codes that implement expected improvement methods, and each makes its own choice about how to solve (3.17).

The algorithm given by (3.17) is optimal under three assumptions: (1) that we will take only a single sample; (2) there is no noise in our samples; and (3) that the x we will report as our final solution (i.e., the one that we will implement) must be among those previously sampled.

In practice, assumption (1) is violated, as Bayesian optimization methods like (3.17) are applied iteratively, and is made simply because it simplifies the analysis. Being able to handle violations of assumption (1) in a more principled way is of great interest to researchers working on Bayesian optimization methodology, and some partial progress in that direction is discussed in Sect. 3.4.3. Assumption (2) is also often violated in a broad class of applications, especially those involving physical experiments or stochastic simulations. In the next section, we present an algorithm, the knowledge-gradient algorithm [24, 25], that relaxes this assumption (2), and also allows relaxing assumption (3) if this is desired.

4.2 Knowledge Gradient

When we have noise in our samples, the derivation of expected improvement meets with difficulty. In particular, if we have noise, then \(f^*_n = \max _{i=1,\ldots ,n} f(x_i)\) is not precisely known, preventing us from using the expression (3.16).

One may simply take a quantity like \(\max _{i=1,\ldots ,n} y_i\) that is similar in spirit to \(f^*_n = \max _{i=1,\ldots ,n} f(x_i)\), and replace \(f^*_n\) in (3.16) with this quantity, but the resulting algorithm is no longer justified by an optimality analysis. Indeed, for problems with a great deal of noise, \(\max _{i=1,\ldots ,n} y_i\) tends to be significantly larger than the true underlying value of the best point previously sampled, and so the resulting algorithm may be led to make a poor tradeoff between exploration and exploitation, and exhibit poor performance in such situations.

Instead, the knowledge-gradient algorithm [24, 25] takes a more principled approach, and starts where the derivation of expected improvement began, but fully accounts for the introduction of noise (assumption 2 in Sect. 3.4.1), and the possibility that we wish to search over a class of solutions broader than just those that have been previously evaluated when recommending the final solution (assumption 3 in Sect. 3.4.1).

We first introduce a set \(A_n\), which is the set of points from which we would choose the final solution, if we were asked to recommend a final solution at time n, based on \(x_{1:n}\), \(y_{1:n}\). For tractability, we suppose \(A_n\) is finite. For example, if A is finite, as it often is in discrete optimization via simulation problems, we could take \(A_n=A\), allowing the whole space of feasible solutions. This choice was considered in [24]. Alternatively, one could take \(A_n=\{x_1,\ldots ,x_n\}\), stating that one is willing to consider only those points that have been previously evaluated. This choice is consistent with the expected improvement algorithm. Indeed, we will see that when one makes this choice, and measurements are free from noise, then the knowledge-gradient algorithm is identical to the expected improvement algorithm. Thus, the knowledge-gradient algorithm generalizes the expected improvement algorithm.

If we were to stop sampling at time n, then the expected value of a point \(x\in A_n\) based on the information available would be \(E_n[f(x)] = \mu _n(x)\). In the special case when evaluations are free from noise, this is equal to f(x), but when there is noise, these two quantities may differ. If we needed to report a final solution, we would then choose the point in \(A_n\) for which this quantity is the largest, i.e., we would choose \(\mathop {\mathrm {argmax}}_{x\in A_n} \mu _n(x)\). Moreover, the expected value of this solution would be

$$\begin{aligned} \mu ^*_n = \max _{x\in A_n} \mu _n(x). \end{aligned}$$

If evaluations are free from noise and \(A_n = \{x_{1:n}\}\), then \(\mu ^*_n\) is equal to \(f^*_n\), but in general these quantities may differ.

If we take one additional sample, then the expected value of the solution we would report based on this additional information is

$$\begin{aligned} \mu ^*_{n+1} = \max _{x\in A_{n+1}} \mu _{n+1}(x), \end{aligned}$$

where as before, \(A_{n+1}\) is some finite set of points we would be willing to consider when choosing a final solution. Observe in this expression that \(\mu _{n+1}(x)\) is not necessarily the same as \(\mu _{n}(x)\), even for points \(x \in \{x_{1:n}\}\) that we had previously evaluated, but that \(\mu _{n+1}(x)\) can be computed from the history of observations \(x_{1:n+1}\), \(y_{1:n+1}\).

The improvement in our expected solution value is then the difference between these two quantities, \(\mu ^*_{n+1} - \mu ^*_n\). This improvement is random at time n, even fixing \(x_{n+1}\), through its dependence on \(y_{n+1}\), but we can take its expectation. The resulting quantity is called the knowledge-gradient (KG) factor , and is written,

$$\begin{aligned} \mathrm {KG}_n(x) = E_n\left[ \mu ^*_{n+1} - \mu ^*_n \mid x_{n+1}=x \right] . \end{aligned}$$
(3.18)

Calculating this expectation is more involved than calculating the expected improvement, but nevertheless can also be done analytically in terms of the normal pdf and normal cdf. This is described in more detail in the Derivations and Proofs section.

The knowledge-gradient algorithm is then the one that chooses the point to sample next that maximizes the KG factor,

$$\begin{aligned} \mathop {\mathrm {argmax}}_x \mathrm {KG}_n(x). \end{aligned}$$

The KG factor for a one-dimensional optimization problem with noise is pictured in Fig. 3.6. We see a similar tradeoff between exploration and exploitation, where the KG factor favors measuring points with a large \(\mu _n(x)\) and a large \(\sigma _n(x)\). We also see local minima of the KG factor at points where we previously evaluated, just as with the expected improvement, but because there is noise in our samples, the value at these points is not 0—indeed, when there is noise, it may be useful to sample repeatedly at a point.

Fig. 3.6
figure 6

Upper panel shows the posterior distribution in a problem with independent normal homoscedastic noise and a one-dimensional input space, where the circles are previously measured points, the solid line is the posterior mean \(\mu _n(x)\), and the dashed lines are at \(\mu _n(x) \pm 2\sigma _n(x)\). Lower panel shows the natural logarithm of the knowledge-gradient factor \(\mathrm {KG}(x)\) computed from this posterior distribution, where \(A_n=A_{n+1}\) are the discrete grid \(\{1,\ldots ,300\}\). An “x” is marked at the point with the largest KG factor, which is where the KG algorithm would evaluate next

Choice of \(A_n\) and \(A_{n+1}\)

Recall that the KG factor depends on the choice of the sets \(A_n\) and \(A_{n+1}\), through the dependence of \(\mu ^*_n\) and \(\mu ^*_{n+1}\) on these sets. Typically, if we choose these sets to contain more elements, then we allow \(\mu ^*_n\) and \(\mu ^*_{n+1}\) to range over a larger portion of the space, and we allow the KG factor calculation to more accurately approximate the value that would result if we allowed ourself to implement the best option. However, as we increase the size of these sets, computing the KG factor is slower, making implementation of the KG method more computationally intensive.

For applications with a finite A, [24] proposed setting \(A_{n+1}=A_n=A\), which was seen to require fewer function evaluations to find points with large f, in comparison with expected improvement on noise-free problems, and in comparison with another Bayesian optimization method, sequential kriging optimization (SKO) [30] on noisy problems. However, the computation and memory required grows rapidly with the size of A, and is typically not feasible when A contains more than 10,000 points.

For large-scale applications, [25] proposed setting \(A_{n+1} = A_n = \{x_{1:n+1}\}\) in (3.18), and called the resulting quantity the approximate knowledge gradient (AKG) , observing that this choice maintains computational tractability as A grows, but also offers good performance. This algorithm is implemented in the DiceKriging package [31].

Finally, in noise-free problems (but not in problems with noise), setting \(A_{n+1} = \{x_{1:n+1}\}\) and \(A_n = \{x_{1:n}\}\) recovers expected improvement.

4.3 Going Beyond One-Step Analyses, and Other Methods

Both expected improvement and the knowledge-gradient method are designed to be optimal, in the special case where we will take just one more function evaluation and then choose a final solution. They are not, however, known to be optimal for the more general case in which we will take multiple measurements, which is the way they are used in practice.

The optimal algorithm for this more general setting is understood to be the solution to a partially observable Markov decision process , but actually computing the optimal solution using this understanding is intractable using current methods [32]. Some work has been done toward the goal of developing such an optimal algorithm [33], but computing the optimal algorithm remains out of reach. Optimal strategies have been computed for other closely related problems in optimization of expensive noisy functions, including stochastic root-finding [34], multiple comparisons with a standard [35], and small instances of discrete noisy optimization with normally distributed noise (also called “ranking and selection” ) [36].

Expected improvement and the knowledge gradient are both special cases of the more general concept of value of information, or expected value of sample information (EVSI) [37], as they calculate the expected reward of a final implementation decision as a function of the posterior distribution resulting from some information, subtract from this the expected reward that would result from not having the information, and then take the expectation of this difference with respect to the information itself.

Many other Bayesian optimization methods have been proposed. A few of these methods optimize the value of information, but are calculated using different assumptions than those used to derive expected improvement or value of information. A larger number of these methods optimize quantities that do not correspond to a value of information, but are derived using analyses that are similar in spirit. These include methods that optimize the probability of improvement [1, 38, 39], the entropy of the posterior distribution on the location of the maximum [40], and other composite measures involving the mean and the standard deviation of the posterior [30].

Other Bayesian optimization methods are designed for problem settings that do not match the assumptions made in this tutorial. These include [4143], which consider multiple objectives; [6, 4446], which consider multiple simultaneous function evaluations; [4749], which consider objective functions that can be evaluated with multiple fidelities and costs; [50], which considers Bernoulli outcomes, rather than normally distributed ones; [51], which considers expensive-to-evaluate inequality constraints; and [52], which considers optimization over the space of small molecules.

5 Software

There are a number of excellent software packages, both freely available and commercial, that implement the methods described in this chapter, and other similar methods.

A list of software packages focused on Gaussian process regression (but not Bayesian optimization) may be found at http://www.gaussianprocess.org/.

6 Conclusion

We have presented Bayesian optimization , including Gaussian process regression , the expected improvement method , and the knowledge-gradient method . In making this presentation, we wish to emphasize that this approach to materials design acknowledges the inherent uncertainty in statistical prediction and seeks to guide experimentation in a way that is robust to this uncertainty. It is inherently iterative, and does not seek to circumvent the fundamental trial-and-error process.

This is in contrast with another approach to informatics in materials design, which holds the hope that predictive methods can short-circuit the iterative loop entirely. In this alternative view of the world, one hopes to create extremely accurate prediction techniques, either through physically-motivated ab initio calculations, or using data-driven machine learning approaches, that are so accurate that one can rely on the predictions alone rather than on physical experiments. If this can be achieved, then we can search over materials designs in silico , find those designs that are predicted to perform best, and test those designs alone in physical experiments.

For this approach to be successful, one must have extremely accurate predictions, which limits its applicability to settings where this is possible. We argue that, in contrast, predictive techniques can be extremely powerful even if they are not perfectly accurate, as long as they are used in a way that acknowledges inaccuracy, builds in robustness , and reduces this inaccuracy through an iterative dialog with physical reality mediated by physical experiments. Moreover, we argue that mathematical techniques like Bayesian optimization , Bayesian experimental design , and optimal learning provide us the mathematical framework for accomplishing this goal in a principled manner, and for using our power to predict as effectively as possible.