Numerical Issues in Maximum Likelihood Parameter Estimation for Gaussian Process Interpolation

Basak, Subhasish; Petit, Sébastien; Bect, Julien; Vazquez, Emmanuel

doi:10.1007/978-3-030-95470-3_9

Subhasish Basak¹⁶,
Sébastien Petit^16,17,
Julien Bect¹⁶ &
…
Emmanuel Vazquez¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13164))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

1972 Accesses
9 Citations

Abstract

This article investigates the origin of numerical issues in maximum likelihood parameter estimation for Gaussian process (GP) interpolation and investigates simple but effective strategies for improving commonly used open-source software implementations. This work targets a basic problem but a host of studies, particularly in the literature of Bayesian optimization, rely on off-the-shelf GP implementations. For the conclusions of these studies to be reliable and reproducible, robust GP implementations are critical.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Sparse Information Filter for Fast Gaussian Process Regression

On Robust Algorithm for Finding Maximum Likelihood Estimation of the Generalized Inverse Gaussian Distribution*

Article 13 September 2016

Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

Article Open access 14 December 2022

Keywords

1 Introduction

Gaussian process (GP) regression and interpolation (see, e.g., Rasmussen and Williams 2006), also known as kriging (see, e.g., Stein 1999), has gained significant popularity in statistics and machine learning as a non-parametric Bayesian approach for the prediction of unknown functions. The need for function prediction arises not only in supervised learning tasks, but also for building fast surrogates of time-consuming computations, e.g., in the assessment of the performance of a learning algorithm as a function of tuning parameters or, more generally, in the design and analysis computer experiments (Santner et al. 2003). The interest for GPs has also risen considerably due to the development of Bayesian optimization (Mockus 1975; Jones et al. 1998; Emmerich et al. 2006; Srinivas et al. 2010...).

This context has fostered the development of a fairly large number of open-source packages to facilitate the use of GPs. Some of the popular choices are the Python modules scikit-learn (Pedregosa et al. 2011), GPy (Sheffield machine learning group 2012–2020), GPflow (Matthews et al. 2017), GPyTorch (Gardner et al. 2018), OpenTURNS (Baudin et al. 2017); the R package DiceKriging (Roustant et al. 2012); and the Matlab/GNU Octave toolboxes GPML (Rasmussen and Nickisch 2010), STK (Bect et al. 2011–2021) and GPstuff (Vanhatalo et al. 2012).

In practice, all implementations require the user to specify the mean and covariance functions of a Gaussian process prior under a parameterized form. Out of the various methods available to estimate the model parameters, we can safely say that the most popular approach is the maximum likelihood estimation (MLE) method. However, a simple numerical experiment consisting in interpolating a function (see Table 1), as is usually done in Bayesian optimization, shows that different MLE implementations from different Python packages produce very dispersed numerical results when the default settings of each implementation are used. These significant differences were also noticed by Erickson et al. (2018) but the causes and possible mitigation were not investigated. Note that each package uses its own default algorithm for the optimization of the likelihood: GPyTorch uses ADAM (Kingma and Ba 2015), OpenTURNS uses a truncated Newton method (Nash 1984) and the others generally use L-BFGS-B (Byrd et al. 1995). It turns out that none of the default results in Table 1 are really satisfactory compared to the result obtained using the recommendations in this study^{Footnote 1}.

Table 1. Inconsistencies in the results across different Python packages. The results were obtained by fitting a GP model, with constant mean and a Matérn kernel ($\nu =5/2$), to the Branin function, using the default settings for each package. We used 50 training points and 500 test points sampled from a uniform distribution on $[-5, 10] \times [0, 15]$. The table reports the estimated values for the variance and length scale parameters of the kernel, the empirical root mean squared prediction error (ERMSPE) and the minimized negative log likelihood (NLL). The last row shows the improvement using the recommendations in this study.

Full size table

Focusing on the case of GP interpolation (with Bayesian optimization as the main motivation), the first contribution of this article is to understand the origin of the inconsistencies across available implementations. The second contribution is to investigate simple but effective strategies for improving these implementations, using the well-established GPy package as a case study. We shall propose recommendations concerning several optimization settings: initialization and restart strategies, parameterization of the covariance, etc. By anticipation of our numerical results, the reader is invited to refer to Fig. 1 and Table 2, which show that significant improvement in terms of estimated parameter values and prediction errors can be obtained over default settings using better optimization schemes.

Even though this work targets a seemingly prosaic issue, and advocates somehow simple solutions, we feel that the contribution is nonetheless of significant value considering the widespread use of GP modeling. Indeed, a host of studies, particularly in the literature of Bayesian optimization, rely on off-the-shelf GP implementations: for their conclusions to be reliable and reproducible, robust implementations are critical.

The article is organized as follows. Section 2 provides a brief review of GP modeling and MLE. Section 3 describes some numerical aspects of the evaluation and optimization of the likelihood function, with a focus on GPy’s implementation. Section 4 provides an analysis of factors influencing the accuracy of numerical MLE procedures. Finally, Sect. 5 assesses the effectiveness of our solutions through numerical experiments and Sect. 6 concludes the article.

Table 2. Improved (cf. Sect. 6) vs default setups in GPy for the interpolation of the Borehole function (input space dimension is $d=8$) with $n\in \{3d,\, 5d\}$ random data points (see Sect. 5.3 for details). The experiment is repeated 50 times. The columns report the leave-one-out mean squared error (LOO-MSE) values (empirical mean over the repetitions, together with the standard deviation and the average proportion of the LOO-MSE to the total standard deviation of the data in parentheses).

Full size table

2 Background

2.1 Gaussian Processes

Let $Z \sim {\mathrm {GP}}(m,\, k)$ be a Gaussian process indexed by $\mathbb {R}^d$, $d\ge 1$, specified by a mean function $m:\mathbb {R}^d\rightarrow \mathbb {R}$ and a covariance function $k:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}$.

The objective is to predict Z(x) at a given location $x\in \mathbb {R}^d$, given a data set $D = \{(x_i,\,z_i)\in \mathbb {R}^d\times \mathbb {R},\, 1 \le i \le n\}$, where the observations $z_i$s are assumed to be the outcome of an additive-noise model: $Z_i = Z(x_i) + \varepsilon _i$, $1 \le i \le n$. In most applications, it is assumed that the $\varepsilon _i$s are zero-mean Gaussian i.i.d. random variables with variance $\sigma _{\varepsilon }^2\ge 0$, independent of Z. (In rarer cases, heteroscedasticity is assumed.)

Knowing m and k, recall (see, e.g. Rasmussen and Williams 2006) that the posterior distribution of Z is such that $Z \mid Z_1,\,\ldots ,\, Z_n,\, m,\, k \sim GP(\hat{Z}_n,\, k_n)$, where $\hat{Z}_n$ and $k_n$ stand respectively for the posterior mean and covariance functions:

$$\begin{aligned} \begin{array}{ll} \hat{Z}_n(x) &{}= m(x) + \sum \nolimits _{i=1}^n w_i(x;\underline{\text {x}}_n)\, (z_i - m(x_i))\,,\\ k_n(x, y) &{}=\; k(x,y) - \mathrm {w}(y;\underline{\text {x}}_n)^{\mathsf {T}}\mathrm {K}(\underline{\text {x}}_n, x) \,, \end{array} \end{aligned}$$

where $\underline{\text {x}}_n$ denotes observation points $\left( x_1,\, \ldots ,\, x_n\right) $ and the weights $w_i(x;\underline{\text {x}}_n)$ are solutions of the linear system:

$$\begin{aligned} (\mathrm {K}(\underline{\text {x}}_n,\underline{\text {x}}_n) + \sigma _{\varepsilon }^2 \mathrm {I}_{n})\, \mathrm {w}(x;\underline{\text {x}}_n) = \mathrm {K}(\underline{\text {x}}_n, x)\,, \end{aligned}$$

(1)

with $\mathrm {K}(\underline{\text {x}}_n,\,\underline{\text {x}}_n)$ the $n\times n$ covariance matrix with entries $k(x_i,\,x_j)$, $\mathrm {I}_n$ the identity matrix of size n, and $\mathrm {w}(x;\underline{\text {x}}_n)$ (resp. $\mathrm {K}(\underline{\text {x}}_n, x)$) the column vector with entries $w_i(x;\underline{\text {x}}_n)$ (resp. $k(x_i, x)$), $1 \le i \le n$.

It is common practice to assume a zero mean function $m = 0$—a reasonable choice if the user has taken care to center data—but most GP implementations also provide an option for setting a constant mean function $m(\,\cdot \,) = \mu \in \mathbb {R}$. In this article, we will include such a constant in our models, and treat it as an additional parameter to be estimated by MLE along with the others. (Alternatively, $\mu $ could be endowed with a Gaussian or improper-uniform prior, and then integrated out; see, e.g., O’Hagan (1978).)

The covariance function, aka covariance kernel, models similarity between data points and reflects the user’s prior belief about the function to be learned. Most GP implementations provide a couple of stationary covariance functions taken from the literature (e.g., Wendland 2004; Rasmussen and Williams 2006). The squared exponential, the rational quadratic or the Matérn covariance functions are popular choices (see Table 3). These covariance functions include a number of parameters: a variance parameter $\sigma ^2>0$ corresponding to the variance of Z, and a set of range (or length scale) parameters $\rho _1,\,\ldots ,\, \rho _d$, such that

$$\begin{aligned} k(x,y) = \sigma ^{2}r(h) , \end{aligned}$$

(2)

with $h^2 = \sum _{i=1}^d (x_{[i]} - y_{[i]})^2/\rho _i^2$, where $x_{[i]}$ and $y_{[i]}$ denote the elements of x and y. The function $r:\mathbb {R}\rightarrow \mathbb {R}$ in (2) is the stationary correlation function of Z. From now on, the vector of model parameters will be denoted by $\theta = (\sigma ^2,\,\rho _1,\,\ldots ,\, \rho _d,\ldots , \sigma _{\varepsilon }^2)^{\mathsf {T}}\in \varTheta \subset \mathbb {R}^p$, and the corresponding covariance matrix $\mathrm {K}(\underline{\text {x}}_n,\underline{\text {x}}_n) + \sigma _{\varepsilon }^2 \mathrm {I}_{n}$ by $\mathrm {K}_{\theta }$.

Table 3. Some kernel functions available in GPy. The Matérn kernel is recommended by Stein (1999). $\varGamma $ denotes the gamma function, $\mathscr {K}_{\nu }$ is the modified Bessel function of the second kind.

Full size table

2.2 Maximum Likelihood Estimation

In this article, we focus on GP implementations where the parameters $(\theta , \mu )\in \varTheta \times \mathbb {R}$ of the process Z are estimated by maximizing the likelihood $\mathcal {L}(\underline{\text {Z}} _n|\theta , \mu )$ of $\underline{\text {Z}} _n = (Z_1,\ldots , Z_n)^{\mathsf {T}}$, or equivalently, by minimizing the negative log-likelihood (NLL)

$$\begin{aligned} - \log (\mathcal {L}(\underline{\text {Z}} _n|\theta , \mu )) \,=\; \frac{1}{2}(\underline{\text {Z}} _n-\mu \mathbbm {1}_n)^{\top } \mathrm {K}_{\theta }^{-1}(\underline{\text {Z}} _n-\mu \mathbbm {1}_n) +\frac{1}{2}\log |\mathrm {K}_{\theta }| + \text {constant}. \end{aligned}$$

(3)

This optimization is typically performed by gradient-based methods, although local maxima can be of significant concern as the likelihood is often non-convex. Computing the likelihood and its gradient with respect to $(\theta , \mu )$ has a $O(n^3 + dn^2)$ computational cost (Rasmussen and Williams 2006; Petit et al. 2020).

3 Numerical Noise

The evaluation of the NLL as well as its gradient is subject to numerical noise, which can prevent proper convergence of the optimization algorithms. Figure 2 shows a typical situation where the gradient-based optimization algorithm stops before converging to an actual minimum. In this section, we provide an analysis on the numerical noise on the NLL using the concept of local condition numbers. We also show that the popular solution of adding jitter cannot be considered as a fully satisfactory answer to the problem of numerical noise.

Numerical noise stems from both terms of the NLL, namely $\frac{1}{2}\underline{\text {Z}} _n^{\top } \mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n$ and $\frac{1}{2}\log |\mathrm {K}_{\theta }|$. (For simplification, we assume $\mu =0$ in this section.)

First, recall that the condition number $\kappa (\mathrm {K}_\theta )$ of $\mathrm {K}_\theta $, defined as the ratio $|\lambda _{\max }/\lambda _{\min }|$ of the largest eigenvalue to the smallest eigenvalue (Press et al. 1992), is the key element for analyzing the numerical noise on $\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n$. In double-precision floating-point approximations of numbers, $\underline{\text {Z}} _n$ is corrupted by an error ${\epsilon }$ whose magnitude is such that $\Vert \epsilon \Vert /\Vert \underline{\text {Z}} _n\Vert \simeq 10^{-16}$. Worst-case alignment of $\underline{\text {Z}} _n$ and $\epsilon $ with the eigenvectors of $\mathrm {K}_{\theta }$ gives

$$\begin{aligned} \frac{\Vert \mathrm {K}_{\theta }^{-1} \epsilon \Vert }{\Vert \mathrm {K}_{\theta }^{-1} \underline{\text {Z}} _n\Vert } \simeq \kappa (\mathrm {K}_{\theta }) \times 10^{-16} , \end{aligned}$$

(4)

which shows how the numerical noise is amplified when $\mathrm {K}_{\theta }$ becomes ill-conditioned.

The term $\log |\mathrm {K}_{\theta }|$ is nonlinear in $\mathrm {K}_{\theta }$, but observe, using the identity $\mathrm {d}\log |\mathrm {K}_{\theta }|/ \mathrm {d}\mathrm {K}_{\theta } = \mathrm {K}_{\theta }^{-1}$, that the differential of $\log |\,\cdot \,|$ at $\mathrm {K}_{\theta }$ is given by $H \mapsto \mathrm {Trace}(\mathrm {K}_{\theta }^{-1} H)$. Thus, the induced operator norm with respect to the Frobenius norm $\Vert \,\cdot \,\Vert _F$ is $\Vert \mathrm {K}_{\theta }^{-1}\Vert _F$. We can then apply results from Trefethen and Bau (1997) to get a local condition number of the mapping $\mathrm {A} \mapsto \log |\mathrm {A}|$ at $\mathrm {K}_{\theta }$:

$$\begin{aligned} \kappa ( \log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \triangleq \lim _{\epsilon \rightarrow 0} \sup _{\Vert \delta _{\mathrm {A}}\Vert _F \le \epsilon } \frac{\bigl | \log |\mathrm {K}_{\theta } + \delta _{\mathrm {A}}| - \log |\mathrm {K}_{\theta }| \bigr | }{\bigl |\log |\mathrm {K}_{\theta }|\bigr |} \frac{\Vert \mathrm {K}_{\theta }\Vert _F}{\Vert \delta _{\mathrm {A}}\Vert _F} = \frac{\sqrt{\sum _{i=1}^n \frac{1}{\lambda _i^2}} \sqrt{\sum _{i=1}^n \lambda _i^2}}{|\sum _{i=1}^n \log (\lambda _i)|} \end{aligned}$$

(5)

where $\lambda _1, \cdots , \lambda _n$ are the (positive) eigenvalues of $\mathrm {K}_{\theta }$. Then, we have

$$\begin{aligned} \frac{\kappa (\mathrm {K}_{\theta })}{| \sum _{i=1}^n \log (\lambda _i)| } \le \kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \le \frac{n \kappa (\mathrm {K}_{\theta })}{|\sum _{i=1}^n \log (\lambda _i)| }, \end{aligned}$$

(6)

which shows that numerical noise on $\log |\mathrm {K}_{\theta }|$ is linked to the condition number of $\mathrm {K}_{\theta }$.

The local condition number of the quadratic form $\frac{1}{2}\underline{\text {Z}} _n^{\mathsf {T}}\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n$ as a function of $\underline{\text {Z}} _n$ can also be computed analytically. Some straightforward calculations show that it is bounded by $\kappa (\mathrm {K}_{\theta })$.

(When the optimization algorithm stops in the example of Fig. 2, we have $\kappa (\mathrm {K}_{\theta }) \simeq 10^{11}$ and $\kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta }) \simeq 10^{9.5}$. The empirical numerical fluctuations are measured as the residuals of a local second-order polynomial best fit, giving noise levels $10^{-7}$, $10^{-8}$ and $10^{-7.5}$ for $\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n$, $\frac{1}{2}\underline{\text {Z}} _n^{\mathsf {T}}\mathrm {K}_{\theta }^{-1}\underline{\text {Z}} _n$ and $\log |\mathrm {K}_{\theta }|$ respectively. These values are consistent with the above first-order analysis.)

Thus, when $\kappa (\mathrm {K}_{\theta })$ becomes large in the course of the optimization procedure, numerical noise on the likelihood and its gradient may trigger an early stopping of the optimization algorithm (supposedly when the algorithm is unable to find a proper direction of improvement). It is well-known that $\kappa (\mathrm {K}_{\theta })$ becomes large when $\sigma _{\varepsilon }^2 = 0$ and one of the following conditions occurs: 1) data points are close, 2) the covariance is very smooth (as for instance when considering the squared exponential covariance), 3) when the range parameters $\rho _i$ are large. These conditions arise more often than not. Therefore, the problem of numerical noise in the evaluation of the likelihood and its gradient is a problem that should not be neglected in GP implementations.

The most classical approach to deal with ill-conditioned covariance matrices is to add a small positive number on the diagonal of the covariance matrix, called jitter, which is equivalent to assuming a small observation noise with variance $\sigma _{\varepsilon }^2>0$. In GPy for instance, the strategy consists in always setting a minimal jitter of $10^{-8}$, which is automatically increased by an amount ranging from $10^{-6} \sigma ^2$ to $10^{-1} \sigma ^2$ whenever the Cholesky factorization of the covariance matrix fails (due to numerical non-positiveness). The smallest jitter making $\mathrm {K}_{\theta }$ numerically invertible is kept and an error is thrown if no jitter allows for successful factorization. However, note that large values for the jitter may yield smooth, non-interpolating approximations, with possible unintuitive and undesirable effects (see Andrianakis and Challenor 2012), and causing possible convergence problems in Bayesian optimization.

Table 4 illustrates the behaviour of GP interpolation when $\sigma _{\varepsilon }^2$ is increased. It appears that finding a satisfying trade-off between good interpolation properties and low numerical noise level can be difficult. Table 4 also supports the connection in (4) and (6) between noise levels and $\kappa (\mathrm {K}_{\theta })$. In view of the results of Fig. 1 based on the default settings of GPy and Table 4, we believe that adaptive jitter cannot be considered as a do-it-all solution.

Table 4. Influence of the jitter on the GP model (same setting as in Fig. 2). The table reports the condition numbers $\kappa (\mathrm {K}_{\theta })$ and $\kappa (\log |\,\cdot \,|,\, \mathrm {K}_{\theta })$, and the impact on the relative empirical standard deviations $\delta _\mathrm{quad}$ and $\delta _\mathrm{logdet}$ of the numerical noise on $\underline{\text {Z}} _n^T \mathrm {K}_{\theta }^{-1} \underline{\text {Z}} _n$ and $\log |\mathrm {K}_{\theta }|$ respectively (measured using second-order polynomial regressions). As $\sigma _{\varepsilon }$ increases, $\delta _\mathrm{quad}$ and $\delta _\mathrm{logdet}$ decrease but the interpolation error $\sqrt{\mathrm {SSR}/\mathrm {SST}} = \sqrt{\frac{1}{n} \sum _{j = 1}^n (Z_j - \hat{Z}_n(x_j))^2} / \mathrm {std}(Z_1, ..., Z_n)$ and the NLL increase. Reducing numerical noise while keeping good interpolation properties requires careful attention in practice.

Full size table

4 Strategies for Improving Likelihood Maximization

In this section we investigate simple but hopefully efficient levers/strategies to improve available implementations of MLE for GP interpolation, beyond the control of the numerical noise on the likelihood using jitter. We mainly focus on 1) initialization methods for the optimization procedure, 2) stopping criteria, 3) the effect of “restart” strategies and 4) the effect of the parameterization of the covariance.

4.1 Initialization Strategies

Most GP implementations use a gradient-based local optimization algorithm to maximize the likelihood that requires the specification of starting/initial values for the parameters. In the following, we consider different initialization strategies.

Moment-Based Initialization. A first strategy consists in setting the parameters using empirical moments of the data. More precisely, assuming a constant mean $m = \mu $, and a stationary covariance k with variance $\sigma ^2$ and range parameters $\rho _1,\,\ldots ,\,\rho _d$, set

$$\begin{aligned} \mu _{\mathrm {init}}= & {} \mathop {\mathrm {mean}}\, (Z_1,\, \ldots ,\,Z_n), \end{aligned}$$

(7)

$$\begin{aligned} \sigma ^2_{\mathrm {init}}= & {} \mathop {\mathrm {var}}\, (Z_1,\, \ldots ,\,Z_n), \end{aligned}$$

(8)

$$\begin{aligned} \rho _{k,\, \mathrm {init}}= & {} \mathop {\mathrm {std}}\, (x_{1,\,[k]},\,\ldots ,\,x_{n,\,[k]}),\quad k=1,\, \ldots ,\,d,~ \end{aligned}$$

(9)

where $\mathrm {mean}$, $\mathrm {var}$ and $\mathrm {std}$ stand for the empirical mean, variance and standard deviation, and $x_{i,\,[k]}$ denotes the kth coordinate of $x_i\in \mathbb {R}^d$. The rationale behind (9) (following, e.g., Rasmussen and Williams 2006) is that the range parameters can be thought of as the distance one has to move in the input space for the function value to change significantly and we assume, a priori, that this distance is linked to the dispersion of data points.

In GPy for instance, the default initialization consists in setting $\mu = 0$, $\sigma ^2 = 1$ and $\rho _k = 1$ for all k. This is equivalent to the moment-based initialization scheme when the data (both inputs and outputs) are centered and standardized. The practice of standardizing the input domain into a unit length hypercube has been proposed (see, e.g., Snoek et al. 2012) to deal with numerical issues that arise due to large length scale values.

Profiled Initialization. Assume the range parameters $\rho _1,\,\ldots ,\, \rho _d$ (and more generally, all parameters different from $\sigma ^2$, $\sigma _{\varepsilon }^2$ and $\mu $) are fixed, and set $\sigma _{\varepsilon }^2 = \alpha \sigma ^2$, with a prescribed multiplicative factor $\alpha \ge 0$. In this case, the NLL can be optimized analytically w.r.t. $\mu $ and $\sigma ^2$. Optimal values turn out to be the generalized least squares solutions

$$\begin{aligned} \mu _{\mathrm {GLS}}&= (\mathbbm {1}_n^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1} \mathbbm {1}_n)^{-1}\mathbbm {1}_n^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1}\underline{\text {Z}} _n\,, \end{aligned}$$

(10)

$$\begin{aligned} \sigma _{\mathrm {GLS}}^2&= \frac{1}{n} (\underline{\text {Z}} _n - \mu _{\mathrm {GLS}}\mathbbm {1}_n)^{\mathsf {T}}\mathrm {K}_{\tilde{\theta }}^{-1} (\underline{\text {Z}} _n - \mu _{\mathrm {GLS}}\mathbbm {1}_n)\,, \end{aligned}$$

(11)

where $\tilde{\theta } = (\sigma ^2,\,\rho _1,\,\ldots ,\, \rho _d,\ldots ,\, \sigma _{\varepsilon }^2)^{\mathsf {T}}\in \varTheta $, with $\sigma ^2=1$ and $\sigma _{\varepsilon }^2= \alpha $. Under the profiled initialization scheme, $\rho _{1}, \ldots , \rho _{d}$ are set using (9), $\alpha $ is prescribed according to user’s preference, and $\mu $ and $\sigma ^2$ are initialized using (10) and (11).

Grid-Search Initialization. Grid-search initialization is a profiled initialization with the addition of a grid-search optimization for the range parameters.

Define a nominal range vector $\rho _0$ such that

$$\begin{aligned} \rho _{0,[k]} \;=\; \sqrt{d}\, \left( \max _{1 \le i \le n} x_{i,[k]} - \min _{1 \le i \le n} x_{i,[k]} \right) ,\quad 1 \le k \le d. \end{aligned}$$

Then, define a one-dimensional grid of size L (e.g., $L=5$) by taking range vectors proportional to $\rho _0$: $\{\alpha _1 \rho _0,\, \ldots ,\, \alpha _L \rho _0\}$, where the $\alpha _i$s range, in logarithmic scale, from a “small” value (e.g., $\alpha _1 = 1/50$) to a “large” value (e.g., $\alpha _L = 2$). For each point of the grid, the likelihood is optimized with respect to $\mu $ and $\sigma ^2$ using (10) and (11). The range vector with the best likelihood value is selected. (Note that this initialization procedure is the default initialization procedure in the Matlab/GNU Octave toolbox STK.)

4.2 Stopping Condition

Most GP implementations rely on well-tested gradient-based optimization algorithms. For instance, a popular choice in Python implementations is to use the limited-memory BFGS algorithm with box constraints (L-BFGS-B; see Byrd et al. 1995) of the SciPy ecosystem. (Other popular optimization algorithms include the ordinary BFGS, truncated Newton constrained, SQP, etc.; see, e.g., Nocedal and Wright (2006).) The L-BFGS-B algorithm, which belongs to the class of quasi-Newton algorithms, uses limited-memory Hessian approximations and shows good performance on non-smooth functions (Curtis and Que 2015).

Regardless of which optimization algorithm is chosen, the user usually has the possibility to tune the behavior of the optimizer, and in particular to set the stopping condition. Generally, the stopping condition is met when a maximum number of iterations is reached or when a norm on the steps and/or the gradient become smaller than a threshold.

By increasing the strictness of the stopping condition during the optimization of the likelihood, one would expect better parameter estimations, provided the numerical noise on the likelihood does not interfere too much.

4.3 Restart and Multi-start Strategies

Due to numerical noise and possible non-convexity of the likelihood with respect to the parameters, gradient-based optimization algorithms may stall far from the global optimum. A common approach to circumvent the issue is to carry out several optimization runs with different initialization points. Two simple strategies can be compared.

Table 5. Two popular reparameterization mappings $\tau $, as implemented, for example, in GPy and STK respectively. For invsoftplus, notice parameter $s>0$, which is introduced when input standardization is considered (see Sect. 5).

Full size table

Restart. In view of Fig. 2, a first simple strategy is to restart the optimization algorithm to clear its memory (Hessian approximation, step sizes...), hopefully allowing it to escape a possibly problematic location using the last best parameters as initial values for the next optimization run. The optimization can be restarted a number of times, until a budget $N_{\mathrm {opt}}$ of restarts is spent or the best value for the likelihood does not improve.

Multi-start. Given an initialization point $(\theta _{\mathrm {init}}, \mu _{\mathrm {init}})\in \varTheta \times \mathbb {R}$, a multi-start strategy consists in running $N_{\mathrm {opt}} > 1$ optimizations with different initialization points corresponding to perturbations of the initial point $(\theta _{\mathrm {init}}, \mu _{\mathrm {init}})$. In practice, we suggest the following rule for building the perturbations: first, move the range parameters around $(\rho _{1,\,\mathrm {init}},\, \ldots ,\, \rho _{d,\,\mathrm {init}})^T$ (refer to Sect. 5 for an implementation); then, propagate the perturbations on $\mu $ and $\sigma ^2$ using (10) and (11). The parameter with the best likelihood value over all optimization runs is selected.

4.4 Parameterization of the Covariance Function

The parameters of the covariance functions are generally positive real numbers ($\sigma ^2$, $\rho _1, \rho _2\ldots $) and are related to scaling effects that act “multiplicatively” on the predictive distributions. Most GP implementations introduce a reparameterization using a monotonic one-to-one mapping $\tau :\mathbb {R}_+^{\star } \rightarrow \mathbb {R}$, acting component-wise on the positive parameters of $\theta $, resulting in a mapping $\tau :\varTheta \rightarrow \varTheta ^{\prime }$. Thus, for carrying out MLE, the actual criterion J that is optimized in most implementations may then be written as

$$\begin{aligned} J: \theta ^{\prime }\in \varTheta ^{\prime } \mapsto - \log (\mathcal {L}(\underline{\text {Z}} _n|\tau ^{-1}(\theta ^{\prime }), c)) . \end{aligned}$$

(12)

Table 5 lists two popular reparameterization mappings $\tau $.

The effect of reparameterization is to “reshape” the likelihood. Typical likelihood profiles using the log and the so-called invsoftplus reparameterizations are shown on Fig. 3. Notice that the NLL may be almost flat in some regions depending on the reparameterization. Changing the shape of the optimization criterion, combined with numerical noise, may or may not facilitate the convergence of the optimization.

5 Numerical Study

5.1 Methodology

The main metric used in this numerical study is based on empirical cumulative distributions (ECDFs) of differences on NLL values.

More precisely, consider $N + 1$ optimization schemes $S_0, S_1,\ldots , S_N$, where $S_0$ stands for a “brute-force” optimization scheme based on a very large number of multi-starts, which is assumed to provide a robust MLE, and $S_1,\ldots , S_N$ are optimization schemes to be compared. Each optimization scheme is run on M data sets $D_j$, $1 \le j \le M$, and we denote by $e_{i,\,j}$ the difference

$$\begin{aligned} e_{i,j} = \mathrm {NLL}_{i,\,j} - \mathrm {NLL}_{0,\, j} , \quad 1 \le i \le N, \quad 1 \le j \le M, \end{aligned}$$

where $\mathrm {NLL}_{i,j}$ the NLL value obtained by optimization scheme $S_i$ on data set $D_j$.

A good scheme $S_i$ should concentrate the empirical distribution of the sample $E_i = \{e_{i,j}, j=1,\ldots ,\, M\}$ around zero—in other words, the ECDF is close to the ideal CDF $e \mapsto \mathbbm {1}_{[0,\infty [}(e)$. Using ECDF also provides a convenient way to compare performances: a strategy with a “steeper” ECDF, or larger area under the ECDF, is better.

5.2 Optimization Schemes

All experiments are performed using GPy version 1.9.9, with the default L-BFGS-B algorithm. We use a common setup and vary the configurations of the optimization levers as detailed below.

Common Setup. All experiments use an estimated constant mean-function, an anisotropic Matérn covariance function with regularity $\nu = 5/2$, and we assume no observation noise (the adaptive jitter of GPy ranging from $10^{-6} \sigma ^2$ to $10^{2} \sigma ^2$ is used, however).

Initialization Schemes. Three initialization procedures from Sect. 4.1 are considered.

Stopping Criteria. We consider two settings for the stopping condition of the L-BFGS-B algorithm, called soft (the default setting: maxiter $=1000$, factr=$10^{7}$, pgtol$10^{-5}$) and strict (maxiter $=1000$, factr =10, pgtol $=10^{-20}$).

Restart and Multi-start. The two strategies of Sect. 4.3 are implemented using a log reparameterization and initialization points $(\theta _{\mathrm {init}}, \mu _{\mathrm {init}})$ determined using a grid-search strategy. For the multi-start strategy the initial range parameters are perturbed according to the rule $\rho \leftarrow \rho _{\mathrm {init}} \cdot 10^{\eta }$ where $\eta $ is drawn from a $\mathcal {N}(0, \sigma _{\eta }^2)$ distribution. We take $\sigma _{\eta } = \log _{10}(5)/1.96~ (\approx 0.35)$, to ensure that about 0.95 of the distribution of $\rho $ is in the interval $\left[ 1/5 \cdot \rho _{\mathrm {init}},~5\cdot \rho _{\mathrm {init}}\right] $.

Reparameterization. We study the log reparameterization and two variants of the invsoftplus. The first version called no-input-standardization simply corresponds to taking $s=1$ for each range parameter. The second version called input-standardization consists in scaling the inputs to a unit standard deviation on each dimension (by taking the corresponding value for s).

5.3 Data Sets

The data sets are generated from six well-known test functions in the literature of Bayesian optimization: the Branin function ($d=2$; see, e.g. Surjanovic and Bingham 2013), the Borehole function ($d=8$; see, e.g. Worley 1987), the Welded Beam Design function ($d=4$; see Chafekar et al. 2003), the g10 function ($d=8$; see Ahmed 2004, p. 128), along with two modified versions, g10mod and g10modmod (see Feliot 2017).

Each function is evaluated on Latin hypercube samples with a multi-dimensional uniformity criterion (LHS-MDU; Deutsch and Deutsch 2012), with varying sample size $n\in \{3d,\, 5d,\, 10d,\,20d\}$, resulting in a total of $6\times 4 = 24$ data sets.

5.4 Results and Findings

Figure 4 shows the effect of reparameterization and the initialization method. Observe that the log reparameterization performs significantly better than the invsoftplus reparameterizations. For the log reparameterization, observe that the grid-search strategy brings a moderate but not negligible gain with respect to the two other initialization strategies, which behave similarly.

Next, we study the effect of the different restart strategies and the stopping conditions, on the case of the log reparameterization and grid-search initialization. The metric used for the comparison is the area under the ECDFs of the differences of NLLs, computed by integrating the ECDF between 0 and $\mathrm {NLL}_{\text {max}} = 100$. Thus, a perfect optimization strategy would achieve an area under the ECDF equal to 100. Since the multi-start strategy is stochastic, results are averaged over 50 repetitions of the optimization procedures (for each $N_{\mathrm {opt}}$ value, the optimization strategy is repeated 50 times). The areas are plotted against the computational run time. Run times are averaged over the repetitions in the case of the multi-start strategy.

Figure 5 shows that the soft stopping condition seems uniformly better. The restart strategy yields small improvements using moderate computational overhead. The multi-start strategy is able to achieve the best results at the price of higher computational costs.

6 Conclusions and Recommendations

Our numerical study has shown that the parameterization of the covariance function has the most significant impact on the accuracy of MLE in GPy. Using restart/multi-start strategies is also very beneficial to mitigate the effect of the numerical noise on the likelihood. The two other levers have second-order but nonetheless measurable influence.

These observations make it possible to devise a recommended combination of improvement levers—for GPy at least, but hopefully transferable to other software packages as well. When computation time matters, an improved optimization procedure for MLE consists in choosing the combination of a log reparameterization, with a grid-search initialization, the soft (GPy’s default) stopping condition, and a small number, say $N_{\mathrm {opt}}=5$, of restarts.

Figure 1 and Table 2 are based on the above optimization procedure, which results in significantly better likelihood values and smaller prediction errors. The multi-start strategy can be used when accurate results are sought.

As a conclusion, our recommendations are not intended to be universal, but will hopefully encourage researchers and users to develop and use more reliable and more robust GP implementations, in Bayesian optimization or elsewhere.

Notes

1.
Code available at https://github.com/saferGPMLE.

References

Ahmed, A.R.H.A.: Studies on metaheuristics for continuous global optimization problems. Ph.D. thesis, Kyoto University (2004)
Google Scholar
Andrianakis, I., Challenor, P.G.: The effect of the nugget on Gaussian process emulators of computer models. Comput. Stat. Data Anal. 56(12), 4215–4228 (2012)
Article MathSciNet Google Scholar
Baudin, M., Dutfoy, A., Iooss, B., Popelin, A.L.: OpenTURNS: an industrial software for uncertainty quantification in simulation. In: Ghanem, R., Higdon, D., Owhadi, H. (eds.) Handbook of Uncertainty Quantification, pp. 2001–2038. Springer, Switzerland (2017)
Chapter Google Scholar
Bect, J., Vazquez, E., et al.: STK: a small (Matlab/Octave) toolbox for Kriging. Release 2.6 (2011–2021). http://kriging.sourceforge.net
Byrd, R., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
Article MathSciNet Google Scholar
Chafekar, D., Xuan, J., Rasheed, K.: Constrained multi-objective optimization using steady state genetic algorithms. In: Cantú-Paz, E., et al. (eds.) GECCO 2003. LNCS, vol. 2723, pp. 813–824. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45105-6_95
Chapter MATH Google Scholar
Curtis, F.E., Que, X.: A Quasi-Newton algorithm for nonconvex, nonsmooth optimization with global convergence guarantees. Math. Program. Comput. 7(4), 399–428 (2015). https://doi.org/10.1007/s12532-015-0086-2
Article MathSciNet MATH Google Scholar
Deutsch, J.L., Deutsch, C.V.: Latin hypercube sampling with multidimensional uniformity. J. Stat. Plann. Inference 142(3), 763–772 (2012)
Article MathSciNet Google Scholar
Emmerich, M.T.M., Giannakoglou, K.C., Naujoks, B.: Single- and multiobjective evolutionary optimization assisted by Gaussian random field metamodels. IEEE Trans. Evol. Comput. 10(4), 421–439 (2006)
Article Google Scholar
Erickson, C.B., Ankenman, B.E., Sanchez, S.M.: Comparison of Gaussian process modeling software. Eur. J. Oper. Res. 266(1), 179–192 (2018)
Article MathSciNet Google Scholar
Feliot, P.: Une approche bayésienne pour l’optimisation multi-objectif sous contrainte. Ph.D. thesis, University of Paris-Saclay (2017)
Google Scholar
Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: GPyTorch: blackbox matrix-matrix Gaussian process inference with GPU acceleration. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates (2018)
Google Scholar
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)
Article MathSciNet Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, USA (2015)
Google Scholar
de G. Matthews, A.G., et al.: GPflow: a Gaussian process library using TensorFlow. J. Mach. Learn. Res., 18(40), 1–6 (2017)
Google Scholar
Mockus, J.: On Bayesian methods for seeking the extremum. In: Marchuk, G.I. (ed.) Optimization Techniques IFIP Technical Conference Novosibirsk. July 1–7, 1974, pp. 400–404. Springer, Heidelberg (1975)
Chapter Google Scholar
Nash, S.G.: Newton-type minimization via the Lanczos method. SIAM J. Numer. Anal. 21(4), 770–788 (1984)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
O’Hagan, A.: Curve fitting and optimal design for prediction. J. R. Stat. Soc. B 40, 1–24 (1978)
MathSciNet MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Petit, S., Bect, J., Da Veiga, S., Feliot, P., Vazquez, E.: Towards new cross-validation-based estimators for Gaussian process regression: efficient adjoint computation of gradients. arXiv:2002.11543 (2020)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press, New York (1992)
Google Scholar
Rasmussen, C.E., Nickisch, H.: Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010)
MathSciNet MATH Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
MATH Google Scholar
Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Software 51(1), 1–55 (2012)
Article Google Scholar
Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. SSS, Springer, New York (2003). https://doi.org/10.1007/978-1-4757-3799-8
Book MATH Google Scholar
Sheffield machine learning group. GPy: a Gaussian process framework in Python, version 1.9.9 (2012–2020). http://github.com/SheffieldML/ GPy
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: 25th International Conference on Neural Information Processing Systems, vol. 2, pp. 2951–2959. Curran Associates Inc. (2012)
Google Scholar
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: 27th International Conference on Machine Learning (ICML), pp. 1015–1022 (2010)
Google Scholar
Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York (1999). https://doi.org/10.1007/978-1-4612-1494-6
Book MATH Google Scholar
Surjanovic, S., Bingham, D.: Virtual library of simulation experiments: test functions and datasets (2013). http://www.sfu.ca/~ssurjano/branin.html. Accessed 13 Oct 2020
Trefethen, L.N., Bau, D.: Numerical Linear Algebra. SIAM (1997)
Google Scholar
Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V., Vehtari, A.: Bayesian modeling with Gaussian processes using the MATLAB toolbox GPstuff (v3.3). CoRR, abs/1206.5754 (2012)
Google Scholar
Wendland, H.: Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics, Cambridge University press, Cambridge (2004)
Book Google Scholar
Worley, B.A.: Deterministic uncertainty analysis. Technical report, ORNL-6428, Oak Ridge National Laboratory, TN, USA (1987)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire des Signaux et Systèmes, CentraleSupélec, CNRS, Univ. Paris-Saclay, Gif-sur-Yvette, France
Subhasish Basak, Sébastien Petit, Julien Bect & Emmanuel Vazquez
Safran Aircraft Engines, Moissy-Cramayel, France
Sébastien Petit

Authors

Subhasish Basak
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Petit
View author publications
You can also search for this author in PubMed Google Scholar
Julien Bect
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Vazquez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emmanuel Vazquez .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
Department of Computer Science, University of Reading, Reading, UK
Varun Ojha
Department of Computer Science, University of Oxford, Oxford, UK
Emanuele La Malfa
Cambridge Judge Business School, University of Cambridge, Cambridge, UK
Gabriele La Malfa
Department of Biochemistry, University of Cambridge, Cambridge, UK
Giorgio Jansen
Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, USA
Panos M. Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Department of Informatics, Dana-Farber Cancer Institute, Boston, MA, USA
Renato Umeton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basak, S., Petit, S., Bect, J., Vazquez, E. (2022). Numerical Issues in Maximum Likelihood Parameter Estimation for Gaussian Process Interpolation. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2021. Lecture Notes in Computer Science(), vol 13164. Springer, Cham. https://doi.org/10.1007/978-3-030-95470-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-95470-3_9
Published: 02 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95469-7
Online ISBN: 978-3-030-95470-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Numerical Issues in Maximum Likelihood Parameter Estimation for Gaussian Process Interpolation

Abstract

Similar content being viewed by others

Sparse Information Filter for Fast Gaussian Process Regression

On Robust Algorithm for Finding Maximum Likelihood Estimation of the Generalized Inverse Gaussian Distribution*

Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

Keywords

1 Introduction