Keywords

1 Introduction

Gaussian process (GP) regression and interpolation (see, e.g., Rasmussen and Williams 2006), also known as kriging (see, e.g., Stein 1999), has gained significant popularity in statistics and machine learning as a non-parametric Bayesian approach for the prediction of unknown functions. The need for function prediction arises not only in supervised learning tasks, but also for building fast surrogates of time-consuming computations, e.g., in the assessment of the performance of a learning algorithm as a function of tuning parameters or, more generally, in the design and analysis computer experiments (Santner et al. 2003). The interest for GPs has also risen considerably due to the development of Bayesian optimization (Mockus 1975; Jones et al. 1998; Emmerich et al. 2006; Srinivas et al. 2010...).

This context has fostered the development of a fairly large number of open-source packages to facilitate the use of GPs. Some of the popular choices are the Python modules scikit-learn (Pedregosa et al. 2011), GPy (Sheffield machine learning group 2012–2020), GPflow (Matthews et al. 2017), GPyTorch (Gardner et al. 2018), OpenTURNS (Baudin et al. 2017); the R package DiceKriging (Roustant et al. 2012); and the Matlab/GNU Octave toolboxes GPML (Rasmussen and Nickisch 2010), STK (Bect et al. 2011–2021) and GPstuff (Vanhatalo et al. 2012).

In practice, all implementations require the user to specify the mean and covariance functions of a Gaussian process prior under a parameterized form. Out of the various methods available to estimate the model parameters, we can safely say that the most popular approach is the maximum likelihood estimation (MLE) method. However, a simple numerical experiment consisting in interpolating a function (see Table 1), as is usually done in Bayesian optimization, shows that different MLE implementations from different Python packages produce very dispersed numerical results when the default settings of each implementation are used. These significant differences were also noticed by Erickson et al. (2018) but the causes and possible mitigation were not investigated. Note that each package uses its own default algorithm for the optimization of the likelihood: GPyTorch uses ADAM (Kingma and Ba 2015), OpenTURNS uses a truncated Newton method (Nash 1984) and the others generally use L-BFGS-B (Byrd et al. 1995). It turns out that none of the default results in Table 1 are really satisfactory compared to the result obtained using the recommendations in this studyFootnote 1.

Table 1. Inconsistencies in the results across different Python packages. The results were obtained by fitting a GP model, with constant mean and a Matérn kernel (\(\nu =5/2\)), to the Branin function, using the default settings for each package. We used 50 training points and 500 test points sampled from a uniform distribution on \([-5, 10] \times [0, 15]\). The table reports the estimated values for the variance and length scale parameters of the kernel, the empirical root mean squared prediction error (ERMSPE) and the minimized negative log likelihood (NLL). The last row shows the improvement using the recommendations in this study.

Focusing on the case of GP interpolation (with Bayesian optimization as the main motivation), the first contribution of this article is to understand the origin of the inconsistencies across available implementations. The second contribution is to investigate simple but effective strategies for improving these implementations, using the well-established GPy package as a case study. We shall propose recommendations concerning several optimization settings: initialization and restart strategies, parameterization of the covariance, etc. By anticipation of our numerical results, the reader is invited to refer to Fig. 1 and Table 2, which show that significant improvement in terms of estimated parameter values and prediction errors can be obtained over default settings using better optimization schemes.

Fig. 1.
figure 1

Improved (cf. Sect. 6) vs default setups in GPy on the Borehole function with \(n = 20d = 160\) random training points. We remove one point at a time to obtain (a) the distribution of the differences of negative log-likelihood (NLL) values between the two setups; (b) the empirical CDFs of the prediction error at the removed points; (c) pairs of box-plots for the estimated range parameters (for each dimension, indexed from 1 to 8 on the x-axis, the box-plot for improved setup is on the left and the box-plot for default setup is on the right; horizontal red lines correspond to the estimated values using the whole data set without leave-one-out). Notice that the parameter distributions of the default setup are more spread out.

Even though this work targets a seemingly prosaic issue, and advocates somehow simple solutions, we feel that the contribution is nonetheless of significant value considering the widespread use of GP modeling. Indeed, a host of studies, particularly in the literature of Bayesian optimization, rely on off-the-shelf GP implementations: for their conclusions to be reliable and reproducible, robust implementations are critical.

The article is organized as follows. Section 2 provides a brief review of GP modeling and MLE. Section 3 describes some numerical aspects of the evaluation and optimization of the likelihood function, with a focus on GPy’s implementation. Section 4 provides an analysis of factors influencing the accuracy of numerical MLE procedures. Finally, Sect. 5 assesses the effectiveness of our solutions through numerical experiments and Sect. 6 concludes the article.

Table 2. Improved (cf. Sect. 6) vs default setups in GPy for the interpolation of the Borehole function (input space dimension is \(d=8\)) with \(n\in \{3d,\, 5d\}\) random data points (see Sect. 5.3 for details). The experiment is repeated 50 times. The columns report the leave-one-out mean squared error (LOO-MSE) values (empirical mean over the repetitions, together with the standard deviation and the average proportion of the LOO-MSE to the total standard deviation of the data in parentheses).

2 Background

2.1 Gaussian Processes

Let \(Z \sim {\mathrm {GP}}(m,\, k)\) be a Gaussian process indexed by \(\mathbb {R}^d\), \(d\ge 1\), specified by a mean function \(m:\mathbb {R}^d\rightarrow \mathbb {R}\) and a covariance function \(k:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}\).

The objective is to predict Z(x) at a given location \(x\in \mathbb {R}^d\), given a data set \(D = \{(x_i,\,z_i)\in \mathbb {R}^d\times \mathbb {R},\, 1 \le i \le n\}\), where the observations \(z_i\)s are assumed to be the outcome of an additive-noise model: \(Z_i = Z(x_i) + \varepsilon _i\), \(1 \le i \le n\). In most applications, it is assumed that the \(\varepsilon _i\)s are zero-mean Gaussian i.i.d. random variables with variance \(\sigma _{\varepsilon }^2\ge 0\), independent of Z. (In rarer cases, heteroscedasticity is assumed.)

Knowing m and k, recall (see, e.g. Rasmussen and Williams 2006) that the posterior distribution of Z is such that \(Z \mid Z_1,\,\ldots ,\, Z_n,\, m,\, k \sim GP(\hat{Z}_n,\, k_n)\), where \(\hat{Z}_n\) and \(k_n\) stand respectively for the posterior mean and covariance functions:

$$\begin{aligned} \begin{array}{ll} \hat{Z}_n(x) &{}= m(x) + \sum \nolimits _{i=1}^n w_i(x;\underline{\text {x}}_n)\, (z_i - m(x_i))\,,\\ k_n(x, y) &{}=\; k(x,y) - \mathrm {w}(y;\underline{\text {x}}_n)^{\mathsf {T}}\mathrm {K}(\underline{\text {x}}_n, x) \,, \end{array} \end{aligned}$$

where \(\underline{\text {x}}_n\) denotes observation points \(\left( x_1,\, \ldots ,\, x_n\right) \) and the weights \(w_i(x;\underline{\text {x}}_n)\) are solutions of the linear system:

$$\begin{aligned} (\mathrm {K}(\underline{\text {x}}_n,\underline{\text {x}}_n) + \sigma _{\varepsilon }^2 \mathrm {I}_{n})\, \mathrm {w}(x;\underline{\text {x}}_n) = \mathrm {K}(\underline{\text {x}}_n, x)\,, \end{aligned}$$
(1)

with \(\mathrm {K}(\underline{\text {x}}_n,\,\underline{\text {x}}_n)\) the \(n\times n\) covariance matrix with entries \(k(x_i,\,x_j)\), \(\mathrm {I}_n\) the identity matrix of size n, and \(\mathrm {w}(x;\underline{\text {x}}_n)\) (resp. \(\mathrm {K}(\underline{\text {x}}_n, x)\)) the column vector with entries \(w_i(x;\underline{\text {x}}_n)\) (resp. \(k(x_i, x)\)), \(1 \le i \le n\).

It is common practice to assume a zero mean function \(m = 0\)—a reasonable choice if the user has taken care to center data—but most GP implementations also provide an option for setting a constant mean function \(m(\,\cdot \,) = \mu \in \mathbb {R}\). In this article, we will include such a constant in our models, and treat it as an additional parameter to be estimated by MLE along with the others. (Alternatively, \(\mu \) could be endowed with a Gaussian or improper-uniform prior, and then integrated out; see, e.g., O’Hagan (1978).)

The covariance function, aka covariance kernel, models similarity between data points and reflects the user’s prior belief about the function to be learned. Most GP implementations provide a couple of stationary covariance functions taken from the literature (e.g., Wendland 2004; Rasmussen and Williams 2006). The squared exponential, the rational quadratic or the Matérn covariance functions are popular choices (see Table 3). These covariance functions include a number of parameters: a variance parameter \(\sigma ^2>0\) corresponding to the variance of Z, and a set of range (or length scale) parameters \(\rho _1,\,\ldots ,\, \rho _d\), such that

$$\begin{aligned} k(x,y) = \sigma ^{2}r(h) , \end{aligned}$$
(2)

with \(h^2 = \sum _{i=1}^d (x_{[i]} - y_{[i]})^2/\rho _i^2\), where \(x_{[i]}\) and \(y_{[i]}\) denote the elements of x and y. The function \(r:\mathbb {R}\rightarrow \mathbb {R}\) in (2) is the stationary correlation function of Z. From now on, the vector of model parameters will be denoted by \(\theta = (\sigma ^2,\,\rho _1,\,\ldots ,\, \rho _d,\ldots , \sigma _{\varepsilon }^2)^{\mathsf {T}}\in \varTheta \subset \mathbb {R}^p\), and the corresponding covariance matrix \(\mathrm {K}(\underline{\text {x}}_n,\underline{\text {x}}_n) + \sigma _{\varepsilon }^2 \mathrm {I}_{n}\) by \(\mathrm {K}_{\theta }\).

Table 3. Some kernel functions available in GPy. The Matérn kernel is recommended by Stein (1999). \(\varGamma \) denotes the gamma function, \(\mathscr {K}_{\nu }\) is the modified Bessel function of the second kind.

2.2 Maximum Likelihood Estimation

In this article, we focus on GP implementations where the parameters \((\theta , \mu )\in \varTheta \times \mathbb {R}\) of the process Z are estimated by maximizing the likelihood \(\mathcal {L}(\underline{\text {Z}} _n|\theta , \mu )\) of \(\underline{\text {Z}} _n = (Z_1,\ldots , Z_n)^{\mathsf {T}}\), or equivalently, by minimizing the negative log-likelihood (NLL)

$$\begin{aligned} - \log (\mathcal {L}(\underline{\text {Z}} _n|\theta , \mu )) \,=\; \frac{1}{2}(\underline{\text {Z}} _n-\mu \mathbbm {1}_n)^{\top } \mathrm {K}_{\theta }^{-1}(\underline{\text {Z}} _n-\mu \mathbbm {1}_n) +\frac{1}{2}\log |\mathrm {K}_{\theta }| + \text {constant}. \end{aligned}$$
(3)

This optimization is typically performed by gradient-based methods, although local maxima can be of significant concern as the likelihood is often non-convex. Computing the likelihood and its gradient with respect to \((\theta , \mu )\) has a \(O(n^3 + dn^2)\) computational cost (Rasmussen and Williams 2006; Petit et al. 2020).

3 Numerical Noise

The evaluation of the NLL as well as its gradient is subject to numerical noise, which can prevent proper convergence of the optimization algorithms. Figure 2 shows a typical situation where the gradient-based optimization algorithm stops before converging to an actual minimum. In this section, we provide an analysis on the numerical noise on the NLL using the concept of local condition numbers. We also show that the popular solution of adding jitter cannot be considered as a fully satisfactory answer to the problem of numerical noise.

Numerical noise stems from both terms of the NLL, namely \(\frac{1}{2}\underline{\text {Z}} _n^{\top } \mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n\) and \(\frac{1}{2}\log |\mathrm {K}_{\theta }|\). (For simplification, we assume \(\mu =0\) in this section.)

First, recall that the condition number \(\kappa (\mathrm {K}_\theta )\) of \(\mathrm {K}_\theta \), defined as the ratio \(|\lambda _{\max }/\lambda _{\min }|\) of the largest eigenvalue to the smallest eigenvalue (Press et al. 1992), is the key element for analyzing the numerical noise on \(\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n\). In double-precision floating-point approximations of numbers, \(\underline{\text {Z}} _n\) is corrupted by an error \({\epsilon }\) whose magnitude is such that \(\Vert \epsilon \Vert /\Vert \underline{\text {Z}} _n\Vert \simeq 10^{-16}\). Worst-case alignment of \(\underline{\text {Z}} _n\) and \(\epsilon \) with the eigenvectors of \(\mathrm {K}_{\theta }\) gives

$$\begin{aligned} \frac{\Vert \mathrm {K}_{\theta }^{-1} \epsilon \Vert }{\Vert \mathrm {K}_{\theta }^{-1} \underline{\text {Z}} _n\Vert } \simeq \kappa (\mathrm {K}_{\theta }) \times 10^{-16} , \end{aligned}$$
(4)

which shows how the numerical noise is amplified when \(\mathrm {K}_{\theta }\) becomes ill-conditioned.

The term \(\log |\mathrm {K}_{\theta }|\) is nonlinear in \(\mathrm {K}_{\theta }\), but observe, using the identity \(\mathrm {d}\log |\mathrm {K}_{\theta }|/ \mathrm {d}\mathrm {K}_{\theta } = \mathrm {K}_{\theta }^{-1}\), that the differential of \(\log |\,\cdot \,|\) at \(\mathrm {K}_{\theta }\) is given by \(H \mapsto \mathrm {Trace}(\mathrm {K}_{\theta }^{-1} H)\). Thus, the induced operator norm with respect to the Frobenius norm \(\Vert \,\cdot \,\Vert _F\) is \(\Vert \mathrm {K}_{\theta }^{-1}\Vert _F\). We can then apply results from Trefethen and Bau (1997) to get a local condition number of the mapping \(\mathrm {A} \mapsto \log |\mathrm {A}|\) at \(\mathrm {K}_{\theta }\):

$$\begin{aligned} \kappa ( \log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \triangleq \lim _{\epsilon \rightarrow 0} \sup _{\Vert \delta _{\mathrm {A}}\Vert _F \le \epsilon } \frac{\bigl | \log |\mathrm {K}_{\theta } + \delta _{\mathrm {A}}| - \log |\mathrm {K}_{\theta }| \bigr | }{\bigl |\log |\mathrm {K}_{\theta }|\bigr |} \frac{\Vert \mathrm {K}_{\theta }\Vert _F}{\Vert \delta _{\mathrm {A}}\Vert _F} = \frac{\sqrt{\sum _{i=1}^n \frac{1}{\lambda _i^2}} \sqrt{\sum _{i=1}^n \lambda _i^2}}{|\sum _{i=1}^n \log (\lambda _i)|} \end{aligned}$$
(5)

where \(\lambda _1, \cdots , \lambda _n\) are the (positive) eigenvalues of \(\mathrm {K}_{\theta }\). Then, we have

$$\begin{aligned} \frac{\kappa (\mathrm {K}_{\theta })}{| \sum _{i=1}^n \log (\lambda _i)| } \le \kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \le \frac{n \kappa (\mathrm {K}_{\theta })}{|\sum _{i=1}^n \log (\lambda _i)| }, \end{aligned}$$
(6)

which shows that numerical noise on \(\log |\mathrm {K}_{\theta }|\) is linked to the condition number of \(\mathrm {K}_{\theta }\).

The local condition number of the quadratic form \(\frac{1}{2}\underline{\text {Z}} _n^{\mathsf {T}}\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n\) as a function of \(\underline{\text {Z}} _n\) can also be computed analytically. Some straightforward calculations show that it is bounded by \(\kappa (\mathrm {K}_{\theta })\).

(When the optimization algorithm stops in the example of Fig. 2, we have \(\kappa (\mathrm {K}_{\theta }) \simeq 10^{11}\) and \(\kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \simeq 10^{9.5}\). The empirical numerical fluctuations are measured as the residuals of a local second-order polynomial best fit, giving noise levels \(10^{-7}\), \(10^{-8}\) and \(10^{-7.5}\) for \(\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n\), \(\frac{1}{2}\underline{\text {Z}} _n^{\mathsf {T}}\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n\) and \(\log |\mathrm {K}_{\theta }|\) respectively. These values are consistent with the above first-order analysis.)

Thus, when \(\kappa (\mathrm {K}_{\theta })\) becomes large in the course of the optimization procedure, numerical noise on the likelihood and its gradient may trigger an early stopping of the optimization algorithm (supposedly when the algorithm is unable to find a proper direction of improvement). It is well-known that \(\kappa (\mathrm {K}_{\theta })\) becomes large when \(\sigma _{\varepsilon }^2 = 0\) and one of the following conditions occurs: 1) data points are close, 2) the covariance is very smooth (as for instance when considering the squared exponential covariance), 3) when the range parameters \(\rho _i\) are large. These conditions arise more often than not. Therefore, the problem of numerical noise in the evaluation of the likelihood and its gradient is a problem that should not be neglected in GP implementations.

Fig. 2.
figure 2

Noisy NLL profile along a particular direction in the parameter space, with a best linear fit (orange line). This example was obtained with GPy while estimating the parameters of a Matérn 5/2 covariance, using 20 data points sampled from a Branin function, and setting \(\sigma _{\varepsilon }^2 = 0\). The red vertical line indicates the location where the optimization of the likelihood stalled. (Color figure online)

The most classical approach to deal with ill-conditioned covariance matrices is to add a small positive number on the diagonal of the covariance matrix, called jitter, which is equivalent to assuming a small observation noise with variance \(\sigma _{\varepsilon }^2>0\). In GPy for instance, the strategy consists in always setting a minimal jitter of \(10^{-8}\), which is automatically increased by an amount ranging from \(10^{-6} \sigma ^2\) to \(10^{-1} \sigma ^2\) whenever the Cholesky factorization of the covariance matrix fails (due to numerical non-positiveness). The smallest jitter making \(\mathrm {K}_{\theta }\) numerically invertible is kept and an error is thrown if no jitter allows for successful factorization. However, note that large values for the jitter may yield smooth, non-interpolating approximations, with possible unintuitive and undesirable effects (see Andrianakis and Challenor 2012), and causing possible convergence problems in Bayesian optimization.

Table 4 illustrates the behaviour of GP interpolation when \(\sigma _{\varepsilon }^2\) is increased. It appears that finding a satisfying trade-off between good interpolation properties and low numerical noise level can be difficult. Table 4 also supports the connection in (4) and (6) between noise levels and \(\kappa (\mathrm {K}_{\theta })\). In view of the results of Fig. 1 based on the default settings of GPy and Table 4, we believe that adaptive jitter cannot be considered as a do-it-all solution.

Table 4. Influence of the jitter on the GP model (same setting as in Fig. 2). The table reports the condition numbers \(\kappa (\mathrm {K}_{\theta })\) and \(\kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta })\), and the impact on the relative empirical standard deviations \(\delta _\mathrm{quad}\) and \(\delta _\mathrm{logdet}\) of the numerical noise on \(\underline{\text {Z}} _n^T \mathrm {K}_{\theta }^{-1} \underline{\text {Z}} _n\) and \(\log |\mathrm {K}_{\theta }|\) respectively (measured using second-order polynomial regressions). As \(\sigma _{\varepsilon }\) increases, \(\delta _\mathrm{quad}\) and \(\delta _\mathrm{logdet}\) decrease but the interpolation error \(\sqrt{\mathrm {SSR}/\mathrm {SST}} = \sqrt{\frac{1}{n} \sum _{j = 1}^n (Z_j - \hat{Z}_n(x_j))^2} / \mathrm {std}(Z_1, ..., Z_n)\) and the NLL increase. Reducing numerical noise while keeping good interpolation properties requires careful attention in practice.

4 Strategies for Improving Likelihood Maximization

In this section we investigate simple but hopefully efficient levers/strategies to improve available implementations of MLE for GP interpolation, beyond the control of the numerical noise on the likelihood using jitter. We mainly focus on 1) initialization methods for the optimization procedure, 2) stopping criteria, 3) the effect of “restart” strategies and 4) the effect of the parameterization of the covariance.

4.1 Initialization Strategies

Most GP implementations use a gradient-based local optimization algorithm to maximize the likelihood that requires the specification of starting/initial values for the parameters. In the following, we consider different initialization strategies.

Moment-Based Initialization. A first strategy consists in setting the parameters using empirical moments of the data. More precisely, assuming a constant mean \(m = \mu \), and a stationary covariance k with variance \(\sigma ^2\) and range parameters \(\rho _1,\,\ldots ,\,\rho _d\), set

$$\begin{aligned} \mu _{\mathrm {init}}= & {} \mathop {\mathrm {mean}}\, (Z_1,\, \ldots ,\,Z_n), \end{aligned}$$
(7)
$$\begin{aligned} \sigma ^2_{\mathrm {init}}= & {} \mathop {\mathrm {var}}\, (Z_1,\, \ldots ,\,Z_n), \end{aligned}$$
(8)
$$\begin{aligned} \rho _{k,\, \mathrm {init}}= & {} \mathop {\mathrm {std}}\, (x_{1,\,[k]},\,\ldots ,\,x_{n,\,[k]}),\quad k=1,\, \ldots ,\,d,~ \end{aligned}$$
(9)

where \(\mathrm {mean}\), \(\mathrm {var}\) and \(\mathrm {std}\) stand for the empirical mean, variance and standard deviation, and \(x_{i,\,[k]}\) denotes the kth coordinate of \(x_i\in \mathbb {R}^d\). The rationale behind (9) (following, e.g., Rasmussen and Williams 2006) is that the range parameters can be thought of as the distance one has to move in the input space for the function value to change significantly and we assume, a priori, that this distance is linked to the dispersion of data points.

In GPy for instance, the default initialization consists in setting \(\mu = 0\), \(\sigma ^2 = 1\) and \(\rho _k = 1\) for all k. This is equivalent to the moment-based initialization scheme when the data (both inputs and outputs) are centered and standardized. The practice of standardizing the input domain into a unit length hypercube has been proposed (see, e.g., Snoek et al. 2012) to deal with numerical issues that arise due to large length scale values.

Profiled Initialization. Assume the range parameters \(\rho _1,\,\ldots ,\, \rho _d\) (and more generally, all parameters different from \(\sigma ^2\), \(\sigma _{\varepsilon }^2\) and \(\mu \)) are fixed, and set \(\sigma _{\varepsilon }^2 = \alpha \sigma ^2\), with a prescribed multiplicative factor \(\alpha \ge 0\). In this case, the NLL can be optimized analytically w.r.t. \(\mu \) and \(\sigma ^2\). Optimal values turn out to be the generalized least squares solutions

$$\begin{aligned} \mu _{\mathrm {GLS}}&= (\mathbbm {1}_n^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1} \mathbbm {1}_n)^{-1}\mathbbm {1}_n^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1}\underline{\text {Z}} _n\,, \end{aligned}$$
(10)
$$\begin{aligned} \sigma _{\mathrm {GLS}}^2&= \frac{1}{n} (\underline{\text {Z}} _n - \mu _{\mathrm {GLS}}\mathbbm {1}_n)^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1} (\underline{\text {Z}} _n - \mu _{\mathrm {GLS}}\mathbbm {1}_n)\,, \end{aligned}$$
(11)

where \(\tilde{\theta } = (\sigma ^2,\,\rho _1,\,\ldots ,\, \rho _d,\ldots ,\, \sigma _{\varepsilon }^2)^{\mathsf {T}}\in \varTheta \), with \(\sigma ^2=1\) and \(\sigma _{\varepsilon }^2= \alpha \). Under the profiled initialization scheme, \(\rho _{1}, \ldots , \rho _{d}\) are set using (9), \(\alpha \) is prescribed according to user’s preference, and \(\mu \) and \(\sigma ^2\) are initialized using (10) and (11).

Grid-Search Initialization. Grid-search initialization is a profiled initialization with the addition of a grid-search optimization for the range parameters.

Define a nominal range vector \(\rho _0\) such that

$$\begin{aligned} \rho _{0,[k]} \;=\; \sqrt{d}\, \left( \max _{1 \le i \le n} x_{i,[k]} - \min _{1 \le i \le n} x_{i,[k]} \right) ,\quad 1 \le k \le d. \end{aligned}$$

Then, define a one-dimensional grid of size L (e.g., \(L=5\)) by taking range vectors proportional to \(\rho _0\): \(\{\alpha _1 \rho _0,\, \ldots ,\, \alpha _L \rho _0\}\), where the \(\alpha _i\)s range, in logarithmic scale, from a “small” value (e.g., \(\alpha _1 = 1/50\)) to a “large” value (e.g., \(\alpha _L = 2\)). For each point of the grid, the likelihood is optimized with respect to \(\mu \) and \(\sigma ^2\) using (10) and (11). The range vector with the best likelihood value is selected. (Note that this initialization procedure is the default initialization procedure in the Matlab/GNU Octave toolbox STK.)

4.2 Stopping Condition

Most GP implementations rely on well-tested gradient-based optimization algorithms. For instance, a popular choice in Python implementations is to use the limited-memory BFGS algorithm with box constraints (L-BFGS-B; see Byrd et al. 1995) of the SciPy ecosystem. (Other popular optimization algorithms include the ordinary BFGS, truncated Newton constrained, SQP, etc.; see, e.g., Nocedal and Wright (2006).) The L-BFGS-B algorithm, which belongs to the class of quasi-Newton algorithms, uses limited-memory Hessian approximations and shows good performance on non-smooth functions (Curtis and Que 2015).

Regardless of which optimization algorithm is chosen, the user usually has the possibility to tune the behavior of the optimizer, and in particular to set the stopping condition. Generally, the stopping condition is met when a maximum number of iterations is reached or when a norm on the steps and/or the gradient become smaller than a threshold.

By increasing the strictness of the stopping condition during the optimization of the likelihood, one would expect better parameter estimations, provided the numerical noise on the likelihood does not interfere too much.

4.3 Restart and Multi-start Strategies

Due to numerical noise and possible non-convexity of the likelihood with respect to the parameters, gradient-based optimization algorithms may stall far from the global optimum. A common approach to circumvent the issue is to carry out several optimization runs with different initialization points. Two simple strategies can be compared.

Table 5. Two popular reparameterization mappings \(\tau \), as implemented, for example, in GPy and STK respectively. For invsoftplus, notice parameter \(s>0\), which is introduced when input standardization is considered (see Sect. 5).

Restart. In view of Fig. 2, a first simple strategy is to restart the optimization algorithm to clear its memory (Hessian approximation, step sizes...), hopefully allowing it to escape a possibly problematic location using the last best parameters as initial values for the next optimization run. The optimization can be restarted a number of times, until a budget \(N_{\mathrm {opt}}\) of restarts is spent or the best value for the likelihood does not improve.

Multi-start. Given an initialization point \((\theta _{\mathrm {init}}, \mu _{\mathrm {init}})\in \varTheta \times \mathbb {R}\), a multi-start strategy consists in running \(N_{\mathrm {opt}} > 1\) optimizations with different initialization points corresponding to perturbations of the initial point \((\theta _{\mathrm {init}}, \mu _{\mathrm {init}})\). In practice, we suggest the following rule for building the perturbations: first, move the range parameters around \((\rho _{1,\,\mathrm {init}},\, \ldots ,\, \rho _{d,\,\mathrm {init}})^T\) (refer to Sect. 5 for an implementation); then, propagate the perturbations on \(\mu \) and \(\sigma ^2\) using (10) and (11). The parameter with the best likelihood value over all optimization runs is selected.

4.4 Parameterization of the Covariance Function

The parameters of the covariance functions are generally positive real numbers (\(\sigma ^2\), \(\rho _1, \rho _2\ldots \)) and are related to scaling effects that act “multiplicatively” on the predictive distributions. Most GP implementations introduce a reparameterization using a monotonic one-to-one mapping \(\tau :\mathbb {R}_+^{\star } \rightarrow \mathbb {R}\), acting component-wise on the positive parameters of \(\theta \), resulting in a mapping \(\tau :\varTheta \rightarrow \varTheta ^{\prime }\). Thus, for carrying out MLE, the actual criterion J that is optimized in most implementations may then be written as

$$\begin{aligned} J: \theta ^{\prime }\in \varTheta ^{\prime } \mapsto - \log (\mathcal {L}(\underline{\text {Z}} _n|\tau ^{-1}(\theta ^{\prime }), c)) . \end{aligned}$$
(12)

Table 5 lists two popular reparameterization mappings \(\tau \).

The effect of reparameterization is to “reshape” the likelihood. Typical likelihood profiles using the log and the so-called invsoftplus reparameterizations are shown on Fig. 3. Notice that the NLL may be almost flat in some regions depending on the reparameterization. Changing the shape of the optimization criterion, combined with numerical noise, may or may not facilitate the convergence of the optimization.

Fig. 3.
figure 3

Profiles of the NLL along a linear path t through the profiled initialization point (at zero, blue vertical line) and the optimum (at one, black vertical line). Orange (resp. blue) line corresponds to the log (resp. invsoftplus) reparameterization. (Color figure online)

5 Numerical Study

5.1 Methodology

The main metric used in this numerical study is based on empirical cumulative distributions (ECDFs) of differences on NLL values.

More precisely, consider \(N + 1\) optimization schemes \(S_0, S_1,\ldots , S_N\), where \(S_0\) stands for a “brute-force” optimization scheme based on a very large number of multi-starts, which is assumed to provide a robust MLE, and \(S_1,\ldots , S_N\) are optimization schemes to be compared. Each optimization scheme is run on M data sets \(D_j\), \(1 \le j \le M\), and we denote by \(e_{i,\,j}\) the difference

$$\begin{aligned} e_{i,j} = \mathrm {NLL}_{i,\,j} - \mathrm {NLL}_{0,\, j} , \quad 1 \le i \le N, \quad 1 \le j \le M, \end{aligned}$$

where \(\mathrm {NLL}_{i,j}\) the NLL value obtained by optimization scheme \(S_i\) on data set \(D_j\).

A good scheme \(S_i\) should concentrate the empirical distribution of the sample \(E_i = \{e_{i,j}, j=1,\ldots ,\, M\}\) around zero—in other words, the ECDF is close to the ideal CDF \(e \mapsto \mathbbm {1}_{[0,\infty [}(e)\). Using ECDF also provides a convenient way to compare performances: a strategy with a “steeper” ECDF, or larger area under the ECDF, is better.

5.2 Optimization Schemes

All experiments are performed using GPy version 1.9.9, with the default L-BFGS-B algorithm. We use a common setup and vary the configurations of the optimization levers as detailed below.

Common Setup. All experiments use an estimated constant mean-function, an anisotropic Matérn covariance function with regularity \(\nu = 5/2\), and we assume no observation noise (the adaptive jitter of GPy ranging from \(10^{-6} \sigma ^2\) to \(10^{2} \sigma ^2\) is used, however).

Initialization Schemes. Three initialization procedures from Sect. 4.1 are considered.

Stopping Criteria. We consider two settings for the stopping condition of the L-BFGS-B algorithm, called soft (the default setting: maxiter \(=1000\), factr=\(10^{7}\), pgtol\(10^{-5}\)) and strict (maxiter \(=1000\), factr =10, pgtol \(=10^{-20}\)).

Restart and Multi-start. The two strategies of Sect. 4.3 are implemented using a log reparameterization and initialization points \((\theta _{\mathrm {init}}, \mu _{\mathrm {init}})\) determined using a grid-search strategy. For the multi-start strategy the initial range parameters are perturbed according to the rule \(\rho \leftarrow \rho _{\mathrm {init}} \cdot 10^{\eta }\) where \(\eta \) is drawn from a \(\mathcal {N}(0, \sigma _{\eta }^2)\) distribution. We take \(\sigma _{\eta } = \log _{10}(5)/1.96~ (\approx 0.35)\), to ensure that about 0.95 of the distribution of \(\rho \) is in the interval \(\left[ 1/5 \cdot \rho _{\mathrm {init}},~5\cdot \rho _{\mathrm {init}}\right] \).

Reparameterization. We study the log reparameterization and two variants of the invsoftplus. The first version called no-input-standardization simply corresponds to taking \(s=1\) for each range parameter. The second version called input-standardization consists in scaling the inputs to a unit standard deviation on each dimension (by taking the corresponding value for s).

5.3 Data Sets

The data sets are generated from six well-known test functions in the literature of Bayesian optimization: the Branin function (\(d=2\); see, e.g. Surjanovic and Bingham 2013), the Borehole function (\(d=8\); see, e.g. Worley 1987), the Welded Beam Design function (\(d=4\); see Chafekar et al. 2003), the g10 function (\(d=8\); see Ahmed 2004, p. 128), along with two modified versions, g10mod and g10modmod (see Feliot 2017).

Each function is evaluated on Latin hypercube samples with a multi-dimensional uniformity criterion (LHS-MDU; Deutsch and Deutsch 2012), with varying sample size \(n\in \{3d,\, 5d,\, 10d,\,20d\}\), resulting in a total of \(6\times 4 = 24\) data sets.

5.4 Results and Findings

Figure 4 shows the effect of reparameterization and the initialization method. Observe that the log reparameterization performs significantly better than the invsoftplus reparameterizations. For the log reparameterization, observe that the grid-search strategy brings a moderate but not negligible gain with respect to the two other initialization strategies, which behave similarly.

Fig. 4.
figure 4

Initialization and reparameterization methods. (a) ECDFs corresponding to the best initialization method for each of the three reparameterizations—red line: log reparam. with grid-search init.; green line: invsoftplus with input-standardization reparam. and grid-search init; blue line: invsoftplus with no-input-standardization reparam. and moment-based init. (b) ECDFs for different initialization methods for the \(\log \) reparameterization. (Color figure online)

Fig. 5.
figure 5

Area under the ECDF against run time: (a) restart strategy; (b) multi-start strategy. The maximum areas obtained are respectively 86.538 and 88.504.

Next, we study the effect of the different restart strategies and the stopping conditions, on the case of the log reparameterization and grid-search initialization. The metric used for the comparison is the area under the ECDFs of the differences of NLLs, computed by integrating the ECDF between 0 and \(\mathrm {NLL}_{\text {max}} = 100\). Thus, a perfect optimization strategy would achieve an area under the ECDF equal to 100. Since the multi-start strategy is stochastic, results are averaged over 50 repetitions of the optimization procedures (for each \(N_{\mathrm {opt}}\) value, the optimization strategy is repeated 50 times). The areas are plotted against the computational run time. Run times are averaged over the repetitions in the case of the multi-start strategy.

Figure 5 shows that the soft stopping condition seems uniformly better. The restart strategy yields small improvements using moderate computational overhead. The multi-start strategy is able to achieve the best results at the price of higher computational costs.

6 Conclusions and Recommendations

Our numerical study has shown that the parameterization of the covariance function has the most significant impact on the accuracy of MLE in GPy. Using restart/multi-start strategies is also very beneficial to mitigate the effect of the numerical noise on the likelihood. The two other levers have second-order but nonetheless measurable influence.

These observations make it possible to devise a recommended combination of improvement levers—for GPy at least, but hopefully transferable to other software packages as well. When computation time matters, an improved optimization procedure for MLE consists in choosing the combination of a log reparameterization, with a grid-search initialization, the soft (GPy’s default) stopping condition, and a small number, say \(N_{\mathrm {opt}}=5\), of restarts.

Figure 1 and Table 2 are based on the above optimization procedure, which results in significantly better likelihood values and smaller prediction errors. The multi-start strategy can be used when accurate results are sought.

As a conclusion, our recommendations are not intended to be universal, but will hopefully encourage researchers and users to develop and use more reliable and more robust GP implementations, in Bayesian optimization or elsewhere.