1 Introduction

Kriging models (Cressie 1993; Stein 1999) are non-parametric statistical models which have been used in many different fields to infer the output of a function y based on a few observations. Applications include geostatistics (Krige 1951; Matheron 1963), the approximation of numerical experiments (Sacks et al 1989; Santner et al 2003), machine learning where the method is known as Gaussian process (GP) regression (Rasmussen and Williams 2006).

One of the main drawbacks of the Kriging method is that it scales poorly for large-scale problems: it suffers from the curse of dimensionality (Bellman 1966) when the dimension of the input is large. This issue is especially prevalent in engineering design optimization (Sobester et al 2008) as industrial designs are commonly parametrized by more than 50 shape parameters (Shan and Wang 2010; Gaudrie et al 2020). In this context, Kriging surrogate models are used to approximate the response of a computationally expensive numerical experiment based on a limited number of observations. The sample plan is usually built using a sequential strategy where an initial design of experiments is completed with new samples obtained by maximizing an acquisition criterion on the surrogate at each iteration (Jones et al 1998). It is therefore important that the surrogate is accurate, even with few observations as in the first iterations, since it will directly impact the number of additional samples needed for the optimization to converge (and thus the convergence speed).

The main challenge for building high-dimensional Kriging models resides in the hyperparameter optimization. Most Kriging models consider one length-scale hyperparameter per dimension which all need to be optimized simultaneously and this multidimensional optimization problem can be difficult to solve. Typically, the optimization is either performed by maximizing the likelihood of the model (Jones 2001) or by minimizing the leave-one-out cross-validation (LOOCV) error (Bachoc 2013). However, both methods involve the inversion of the covariance matrix with cost in \(O(n^3)\) (where \(n\) is the number of sample points). In design optimization, the number of samples is usually limited due to the cost of obtaining each of them, and thus the inversion is manageable. Yet, the optimization requires many of these inversions, especially in high-dimension as the size of the search space grows exponentially with the number of hyperparameters. As such, due to the large number of iterations needed to converge, the hyperparameter optimization can be prohibitively expensive even for a limited number of samples. One way to reduce the cost of the optimization is to use an approximation of the covariance matrix inverse such as those developed for Kriging models with large number of observations where the cost of an inversion is prohibitive (see Liu et al (2020) for a review). For example, in Quinonero-Candela and Rasmussen (2005), Titsias (2009) and Hensman et al (2013), the authors use a low-rank approximation of the covariance matrix to reduce the computational cost of the inversion. However, most of those methods are only designed for a large number of samples.

Besides the cost of the hyperparameter optimization, in high-dimension the input space training data is often sparse since the design space grows exponentially with the dimension. This, along with the large number of hyperparameters, can cause the usual criterion for the optimization to over-fit the training data leading to a poor estimation of the hyperparameters even when the optimization has converged (Ginsbourger et al 2009; Mohammed and Cawley 2017). Reducing the dimension of the problem is a way to solve these issues (see Binois and Wycoff (2021) for a review), but, because y is computationally expensive, classical sensibility analysis (Saltelli et al 2008) cannot be performed beforehand for variable selection. Some methods reduce the dimension by embedding the design space into a lower-dimension space (Constantine 2015; Bouhlel et al 2016). Additive Kriging (Durrande et al 2012) is another approach where y is decomposed into a sum of one dimensional components, enabling a sequential optimization of the length-scale hyperparameters.

In this paper, we propose a new method to tackle the challenging hyperparameter optimization for high-dimensional problems. Our approach avoids this optimization by combining Kriging sub-models with random length-scales. It replaces the challenging inner optimization of the length-scales by an optimization of the combination weights which is much simpler and whose solution can be obtained in closed-form. It also avoids reducing the dimension of the design space and preserves the correlation between all the input variables. This article starts by briefly recalling the main concepts in Kriging and introduces the employed notations in Sect. 2. Our combined Kriging method is detailed in Sect. 3. Finally, results of our method on numerical test problems are presented and discussed in Sect. 4.

2 Kriging model

2.1 Kriging predictions

This section briefly recalls the Kriging method and introduces the notations used throughout this paper. We denote by \(y: \varvec{x}\in {\mathcal {X}} \subset {\mathbb {R}}^d\rightarrow y(\varvec{x}) \in {\mathbb {R}}\) the \(d\)-dimensional black-box function that we want to approximate. We suppose y is known on an ensemble of \(n\) sample points \(\varvec{X}= \left( \varvec{x}_1,\dots ,\varvec{x}_n\right) ^T\) and we denote \(\varvec{Y}= \left( y(\varvec{x}_1),\dots ,y(\varvec{x}_n) \right) ^T\) the observed values at these locations. The Kriging method approximates y as the realization of a Gaussian process on \({\mathcal {X}}\):

$$\begin{aligned} Y(.) \sim GP \left( \mu ,\sigma ^2k_{\varvec{\theta }}(.,.) \right) . \end{aligned}$$

Without loss of generality, we can assume that the GP is centered (\(\mu =0\)). \(k_{\varvec{\theta }}: {\mathcal {X}} \times {\mathcal {X}} \rightarrow [-1,1]\) is the positive definite correlation function indexed by the hyperparameters \(\varvec{\theta }\in {\mathbb {R}}^d\), the correlation length-scales vector (also called range or scale parameters) with one length-scale value per dimension of the input space. Finally, \(\sigma ^2k_{\varvec{\theta }}\) is the covariance function (also called kernel) with \(\sigma ^2 \in {\mathbb {R}}^+\) the variance of the GP. A stationary GP with a Matérn-class covariance function is often recommended (Stein 1999; Rasmussen and Williams 2006). Throughout this paper, we use the radial Matérn 5/2 correlation defined as:

$$\begin{aligned} k_{\varvec{\theta }}(\varvec{x},\varvec{x}') :=\left( 1+\sqrt{5} \left\| \frac{\varvec{x}-\varvec{x}'}{\varvec{\theta }}\right\| + \frac{5}{3}\left\| \frac{\varvec{x}-\varvec{x}'}{\varvec{\theta }}\right\| ^2 \right) \exp \left( - \sqrt{5} \left\| \frac{\varvec{x}-\varvec{x}'}{\varvec{\theta }}\right\| \right) , \end{aligned}$$
(1)

where \(\left\| \frac{\varvec{x}-\varvec{x}'}{\varvec{\theta }}\right\| \) is the scaled distance between two points \(\varvec{x}, \varvec{x}'\in {\mathcal {X}}\) using component-wise division: \(\left\| \frac{\varvec{x}-\varvec{x}'}{\varvec{\theta }}\right\| ^2 :=\sum _{i=1}^{d} \left( \frac{{x}^{(i)}-{x'}^{(i)}}{\theta ^{(i)}}\right) ^2.\) This is a typical choice for design optimization (Roustant et al 2012), and even when the covariance is misspecified, a proper estimation of the hyperparameters can still yield a model with good predictive capacities (Bachoc 2013). Other covariances could be used if a priori knowledge about the unknown function is available.

The Simple Kriging (SK) predictor is a linear combination of the observations which is obtained by conditioning the Gaussian process Y over \({\mathcal {D}}=(\varvec{X},\varvec{Y})\):

$$\begin{aligned} M(\varvec{x}) :=E(Y(\varvec{x}) \vert {\mathcal {D}}) = k_{\varvec{\theta }}(\varvec{x},\varvec{X})k_{\varvec{\theta }}(\varvec{X},\varvec{X})^{-1}\varvec{Y}, \end{aligned}$$
(2)

where \(k_{\varvec{\theta }}(\varvec{x},\varvec{X})\) is the vector of correlations between the prediction point \(\varvec{x}\) and the sample points \(\varvec{X}\), and \(k_{\varvec{\theta }}(\varvec{X},\varvec{X})\) is the correlation matrix of the model, i.e. the \(n\times n\) matrix of correlations between the components of \(\varvec{X}\). Note that this predictor does not depend on \(\sigma ^2\). The predictive variance of the model can also be obtained as:

$$\begin{aligned} \hat{s}^2(\varvec{x}) :=Var(Y(\varvec{x})\vert {\mathcal {D}}) = \sigma ^2 \left( k_{\varvec{\theta }}(\varvec{x},\varvec{x}) - k_{\varvec{\theta }}(\varvec{x},\varvec{X}) k_{\varvec{\theta }}(\varvec{X},\varvec{X})^{-1} k_{\varvec{\theta }}(\varvec{X},\varvec{x})\right) . \end{aligned}$$
(3)

In the following, we will simply denote the correlation matrix as \(\varvec{K}_{\varvec{\theta }} :=k_{\varvec{\theta }}(\varvec{X},\varvec{X})\).

2.2 Hyperparameter estimation

The covariance hyperparameters must be chosen appropriately to obtain an accurate model. Usually, they are set using the maximum likelihood estimation (MLE) (Jones 2001), which consists on maximizing the marginal likelihood of the model:

$$\begin{aligned} {\mathcal {L}}(\sigma ,\varvec{\theta }) :=\frac{1}{(2\pi )^{d/2}(\sigma ^2)^{d/2} \ \textrm{det}(\varvec{K}_{\varvec{\theta }})^{1/2}} \ \textrm{exp}\left( -\frac{1}{2\sigma ^2}\varvec{Y}^T \varvec{K}^{-1}_{\varvec{\theta }} \varvec{Y}\right) . \end{aligned}$$
(4)

This is equivalent to minimizing \(-log({\mathcal {L}}(\sigma ,\varvec{\theta }))\). For a fixed \(\varvec{\theta }\), the MLE estimator for \(\sigma ^2\) is:

$$\begin{aligned} {\hat{\sigma }}^2_{MLE} = \frac{\varvec{Y}^T {\varvec{K}_{\varvec{\theta }}}^{-1}\varvec{Y}}{n}, \end{aligned}$$
(5)

After substituting (5) into the log-likelihood, we obtain the length-scales \(\varvec{\theta }\) by solving:

$$\begin{aligned} \varvec{\theta }_{MLE} = \mathop {\mathrm {arg\,min}}\limits _{\varvec{\theta }} \; -\frac{1}{2} \textrm{log} \left( {\hat{\sigma }}^2_{MLE} \right) -\frac{1}{2} \textrm{log} \left( \textrm{det}(\varvec{K}_{\varvec{\theta }}) \right) . \end{aligned}$$
(6)

An alternative to MLE is to minimize the leave-one-out cross-validation (LOOCV) error (Bachoc 2013) of the model:

$$\begin{aligned} e_{LOOCV}(\varvec{\theta }) :=\frac{1}{n} \sum _{k=1}^n\left( {M_{\varvec{\theta }}}_{-k}(\varvec{x}_k)-y(\varvec{x}_k) \right) ^2, \end{aligned}$$
(7)

where \({M_{\varvec{\theta }}}_{-k}\) is the simple Kriging model built by removing the \(k\)th sample point \(\varvec{x}_k\). For Kriging models, the LOOCV error can be computed easily (Ginsbourger and Schärer 2021) without having to build \(n\) models using the formula:

$$\begin{aligned} e_{LOOCV}(\varvec{\theta }) = \frac{1}{n} \sum _{k=1}^n\left( \frac{[\varvec{K}_{\varvec{\theta }}^{-1}\varvec{Y}]_k}{[\varvec{K}_{\varvec{\theta }}^{-1}]_{k,k}}\right) ^2. \end{aligned}$$
(8)

Finally, the LOOCV estimation of the length-scales is obtained as:

$$\begin{aligned} {\hat{\varvec{\theta }}}_{LOOCV} = \mathop {\mathrm {arg\,min}}\limits _{\varvec{\theta }} \; e_{LOOCV}(\varvec{\theta }). \end{aligned}$$
(9)

In practice, both optimization problems (6) and (9) can be difficult to solve numerically due to their multi-modality, to flat areas of the objectives, and to the fact that the objective evaluations can be expensive (cost in \(O(n^3)\) for both objectives and their gradients). This is particularly true for high-dimensional problems since \(\varvec{\theta }\) has dimension \(d\). Equations (6) and (9) are typically solved using gradient-based method (e.g BFGS) with multi-start, or using evolutionary algorithms (Roustant et al 2012). However, as we will show in Sect. 4, these methods can fail to produce suitable values of the hyperparameters in high-dimensional problems, when the data is relatively sparse.

In the next section, we propose an alternative method for building a Kriging-based surrogate model which avoids this challenging optimization of the length-scale hyperparameters.

3 Combined Kriging with fixed length-scales

3.1 Description of the method

Combining different surrogate models using weights has been explored by many authors in the past years. The proposed methods differ in the purpose of the combination, in the type of the surrogate models employed, and in the way the weights are computed. For example, Bayesian model averaging (Gelman et al 1995; Hoeting et al 1999; Burnham et al 2011) combines different models using different parameters to perform a multimodel inference while accounting for the uncertainty in the choice of the model. In Goel et al (2007), Acar and Rais-Rohani (2009) and Viana et al (2009), different metamodels build on the same data set are combined to obtain an ensemble of surrogates whose accuracy is better than the one of the best metamodel. To circumvent the difficulties of Kriging metamodels in the presence of large datasets, several methods combining local Kriging sub-models optimized on subset of points have also been proposed with different weighting schemes (Rasmussen and Ghahramani 2001; Cao and Fleet 2014; Deisenroth and Ng 2015; Rullière et al 2018). In the context of Bayesian optimization, Ginsbourger et al (2008) present a method to combine Kriging sub-models with various covariance functions, or with different hyperparameter optimization criteria. The combination of Kriging sub-models for selecting the covariance function is further explored in Palar and Shimoyama (2018), and Pronzato and Rendas (2017) combine several local Kriging sub-models with different covariance functions in a fully Bayesian manner to build a non-stationary model.

Contrarily to the combinations of Kriging sub-models presented above, in the method we propose, the length-scale hyperparameters are not optimized but randomly chosen. It avoids the expensive and difficult optimization of these hyperparameters for high dimensional problems by emphasizing the appropriate random length-scales through their weights in the combination, which are obtained in closed-form.

The combined model writes as:

$$\begin{aligned} M_{tot}(\varvec{x}) :=\sum _{i=1}^{p} w_i(\varvec{x})M_i(\varvec{x}), \end{aligned}$$
(10)

where \(M_{tot}\) is the combined model, \(p\) is the number of sub-models, and \(w_i\), \(i=1,\dots ,p,\) are the weights of each sub-model. The sub-models \(M_i\) are simple Kriging models with random length-scales, hence:

$$\begin{aligned} M_i(\varvec{x}) :=E(Y_{\varvec{\theta }_i}(\varvec{x})\vert {\mathcal {D}}_i) = k_{\varvec{\theta }_i}(\varvec{x},\varvec{X}_i)k_{\varvec{\theta }_i}(\varvec{X}_i,\varvec{X}_i)^{-1}\varvec{Y}_i, \end{aligned}$$
(11)

where \(\varvec{\theta }_i\) is the random length-scale vector and \({\mathcal {D}}_i=(\varvec{X}_i,\varvec{Y}_i)\) the training data set of the \(i\)th sub-model. We also have access to the variance of each sub-model:

$$\begin{aligned} \hat{s}_i^2(\varvec{x}) :=\sigma ^2 \left( k_{\varvec{\theta }_i}(\varvec{x},\varvec{x}) - k_{\varvec{\theta }_i}(\varvec{x},\varvec{X}_i)k_{\varvec{\theta }_i}(\varvec{X}_i,\varvec{X}_i)^{-1} k_{\varvec{\theta }_i}(\varvec{X}_i,\varvec{x})\right) . \end{aligned}$$
(12)

The proposed method enables the construction of a Kriging model for high dimensional problems without reducing the dimension. It both preserves the correlation between all input variables and avoids a loss of information due to a truncated design space. Moreover, this method is very flexible since each sub-model can for instance be constructed on different subsets of points, can take into account different design variables, or can have different covariance functions. Sub-models with very different behaviors sweeping through a wide range of length-scales can therefore be combined. The interest in this paper is for high-dimensional problems with typical dimension \(d> 20\). The number of sub-models is limited to \(p< d\ll n\), and we will show empirically in Sect. 4 that for our test problem with dimension \(d=50\), a small number of sub-models is sufficient as adding more no longer improves the combinations. Finally, we consider a moderate number of, at most, few thousands samples so that, albeit slightly expensive, the inverse of the covariance matrix can be computed for the \(p\) sub-models. Thus, the complexity of the combination is \(O(pn^3)\) which is generally less than the cost of an ordinary Kriging model in \(O(\alpha _{iter}n^3)\) where \(\alpha _{iter}\) is the number of matrix inversions (i.e. the number of iterations, typically of the order of 100) in the optimization of the \(d{}\) hyperparameters. To fully define the combination, the first step is to define the sub-models which is detailed in Sect. 3.2. Then, the choice of the weights for the combination is discussed in Sect. 3.3.

3.2 Choice of the sub-models

In this paper, all Kriging sub-models are constructed with all sample points and all design variables: \({\mathcal {D}}_i={\mathcal {D}}=(\varvec{X},\varvec{Y})\), \(i=1,\dots ,p\) so that the length-scales are the unique source of difference between the \(M_i\)’s. An appropriate choice of the length-scales is essential to obtain a combined Kriging model with a good accuracy. In particular, variety among the sub-models is crucial so that the combined model can select the most well-suited behaviors through the weights in the combination. Since no prior knowledge is available for the length-scales, we choose them randomly in a bounded interval:

$$\begin{aligned} \theta ^{(\ell )}\in \left[ \, \theta _{min}^{(\ell )}, \, \theta _{max}^{(\ell )}\, \right] , \quad \ell =1,\dots ,d, \end{aligned}$$
(13)

where \(\theta _{min}^{(\ell )}\) and \(\theta _{max}^{(\ell )}\) are lower and upper bounds for the \(\ell \)th component \(\theta ^{(\ell )}\) of \(\varvec{\theta }\). To the best of our knowledge, only few works in the literature deal with length-scale bounds. Mohammadi et al (2016, 2018) deal with these bounds in an optimization context, but it is common practice to assume pre-specified bounds. In the DiceKriging R package (Roustant et al 2012), by default, \(\theta ^{(\ell )}_{min}=10^{-10}\) and \(\theta ^{(\ell )}_{max}\) is twice the observed range in the \(\ell \)th dimension. Obrezanova et al (2007) fix the length-scales based on the standard deviation of the data. Issues related to flat likelihood landscapes may occur for too small length-scales and a suitable lower bound for maximum likelihood estimation is proposed in Richet (2018).

Intuitively, the choice of the length-scales range depends both on the design and on the covariance function family. We study hereafter both factors separately, although it would be possible to study them jointly.

3.2.1 Design impact

If the length-scales are large compared to most observed pairwise distances, then the correlations will tend to one. If they are smaller than most distances, trajectories with higher frequencies than observed in the given samples are implicitly considered. Therefore, length-scales should be of the order of most of the observed pairwise distances.

Let us investigate the distribution of observed distances between design points. Assume that design points are distributed as a random vector \({\textbf{X}}=(X^{(1)}, \ldots , X^{(d)})\), with respective standard deviations \(\sigma ^{(1)}, \ldots , \sigma ^{(d)}\). As we do not consider here the cross influence of joint length-scales, we investigate the impact of length-scales variations along the curve:

$$\begin{aligned} {\mathcal {C}} :=\left\{ \varvec{\theta }= \lambda (\sigma ^{(1)}, \ldots , \sigma ^{(d)}), \, \lambda \in {\mathbb {R}}^+ \right\} \,. \end{aligned}$$

Now denote by \(\left\| \frac{\textbf{R}}{\varvec{\theta }}\right\| \) the scaled random distance between two distinct independent points \({\textbf{X}}\) and \(\mathbf {X'}\) of the design, using component-wise division. When \(\varvec{\theta }\in {\mathcal {C}}\), this distance can be expressed as a function of \(\theta ^{(\ell )}\):

$$\begin{aligned} \left\| \frac{\textbf{R}}{\varvec{\theta }}\right\| ^2 :=\sum _{i=1}^{d} \left( \frac{{X}^{(i)}-{X'}^{(i)}}{\theta ^{(i)}}\right) ^2 = \left( \frac{\sigma ^{(\ell )}}{\theta ^{(\ell )}} \right) ^2 \sum _{i=1}^{d} \left( \frac{{X}^{(i)}-{X'}^{(i)}}{\sigma _i}\right) ^2 \,. \end{aligned}$$

Assuming the finiteness of the first four moments, when all components of \({\textbf{X}}\) and \(\mathbf {X'}\) are mutually independent with common kurtosis \(\kappa \),

$$\begin{aligned} E\left( \left\| \frac{\textbf{R}}{\varvec{\theta }}\right\| ^2 \right) = 2 d\left( \frac{\sigma ^{(\ell )}}{\theta ^{(\ell )}}\right) ^2 \quad \text{ and } \quad Var\left( \left\| \frac{\textbf{R}}{\varvec{\theta }}\right\| ^2 \right) = 2 d\left( \frac{\sigma ^{(\ell )}}{\theta ^{(\ell )}}\right) ^4 (\kappa +1) \,. \end{aligned}$$

Along the direction \(\ell \), using a simplified model, for \(d\) large enough, typical values of the unscaled distance \(\Vert \theta ^{(\ell )}\frac{\textbf{R}}{\varvec{\theta }}\Vert \) are given by the root of a Gaussian \(95\%\) confidence interval (for a Gaussian design, one should use the confidence interval of a \(\chi \) distribution):

$$\begin{aligned} \sigma ^{(\ell )}[r_{min}, r_{max}] = \sigma ^{(\ell )}\left[ \, \sqrt{2d- 1.96 \sqrt{2(\kappa +1)d}},\; \sqrt{2d+ 1.96 \sqrt{2(\kappa +1)d}}\, \right] . \end{aligned}$$

This interval corresponds to typically observed unscaled distances in the dimension \(\ell \). Notice that it grows as \(\sqrt{d}\) and that it is built around an average distance \(\sigma ^{(\ell )}\sqrt{2d}\) along this axis. For uniform random variables, the kurtosis is \(\kappa = 9/5\), for Gaussian ones, it is \(\kappa = 3\).

3.2.2 Covariance family impact

The impact of a change in the length-scales depends on the covariance function: for instance, the covariance varies slowly at short distances for Gaussian kernels, whereas it varies rapidly for exponential ones. This has to be taken into account when choosing length-scales bounds.

Let \(k(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| )\) be the covariance between two design points \({\textbf{x}}\) and \(\mathbf {x'}\), where \(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| \) is the scaled distance between the points and k(.) is the covariance function of an isotropic stationary Gaussian Process. When \(\varvec{\theta }\in {\mathcal {C}}\), \(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| \) can be expressed using only \(\theta ^{(\ell )}\). The influence of \(\theta ^{(\ell )}\) on the covariance can be measured by the following normalized derivative:

$$\begin{aligned} I^{(\ell )}\left( \left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| , \theta ^{(\ell )}\right) :=\Bigg |\frac{\frac{\partial }{\partial \theta ^{(\ell )}} k(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| )}{\max \limits _{\theta ^{(\ell )}, \varvec{\theta }\in {\mathcal {C}}}\frac{\partial }{\partial \theta ^{(\ell )}} k(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| )} \Bigg |\,. \end{aligned}$$
(14)

The derivative with respect to the length-scale can be obtained easily by direct calculation for the usual covariance functions. Along the axis \(\ell \), at a scaled distance \(\left\| \frac{\textbf{r}}{\varvec{\theta }}\right\| =\frac{r}{\theta ^{(\ell )}}\), a length-scale \(\theta ^{(\ell )}\) is considered influential enough if it belongs to:

$$\begin{aligned} \Theta ^{(\ell )}_{adm}(r) :=\left\{ \theta : \; \, I^{(\ell )}\left( \frac{r}{\theta }, \theta \right) \ge \delta \right\} \,, \end{aligned}$$

where \(\delta \in (0,1)\) is a user-defined threshold that we set to \(\delta =1/10\) in the following.

For \(r \in \left[ r_{min}^{(\ell )}, r_{max}^{(\ell )}\right] :=\sigma ^{(\ell )}[r_{min}, r_{max}]\), length-scales bounds are chosen as:

$$\begin{aligned} \theta _{min}^{(\ell )}:=\inf \bigcup \limits _{r \in \left[ r_{min}^{(\ell )}, r_{max}^{(\ell )}\right] } \Theta ^{(\ell )}_{adm}(r) \quad \text {and} \quad \theta _{max}^{(\ell )}:=\sup \bigcup \limits _{r \in \left[ r_{min}^{(\ell )}, r_{max}^{(\ell )}\right] } \Theta ^{(\ell )}_{adm}(r) \,. \end{aligned}$$

Note that multiplying distances by a scale factor \(\alpha >0\) changes the set of admissible length-scales by the same factor, \(\Theta ^{(\ell )}_{adm}(\alpha r)= \alpha \Theta ^{(\ell )}_{adm}(r)\). Therefore, one only has to solve for \(r=1\) in \(\theta \):

$$\begin{aligned} I^{(\ell )}\left( \frac{1}{\theta }, \theta \right) = \delta \,. \end{aligned}$$
(15)

We denote respectively as \(\theta ^{-}(k)\) and \(\theta ^{+}(k)\) the smallest and largest roots of (15), which depend only on the chosen covariance function k(.). The influence index and its roots are illustrated in Fig. 1 for the exponential and Gaussian kernels. The roots do not depend on the component number \(\ell \), and we finally get:

$$\begin{aligned} \theta _{min}^{(\ell )}= \sigma ^{(\ell )}\, r_{min} \, \theta ^{-}(k) \quad \text {and} \quad \theta _{max}^{(\ell )}= \sigma ^{(\ell )}\, r_{max} \, \theta ^{+}(k)\,. \end{aligned}$$
(16)

Notice that \(r_{min}\), \(r_{max}\) depend only on the design kurtosis \(\kappa \) and on the dimension \(d\). Examples of obtained bounds for uniformly sampled designs are given in Table 1. When d tends to infinity, the length-scale range becomes equivalent to \(\sigma ^{(\ell )}\sqrt{2d} [\theta ^{-}(k), \theta ^{+}(k)]\), and only depends on the design distribution through \(\sigma ^{(\ell )}\). The surprising result of distance concentration in high dimension (\(r_{min}\) and \(r_{max}\) both equivalent to \(\sqrt{2d}\) as d increases, see last lines in Table 1) is also discussed in the literature, see e.g. Aggarwal et al (2001).

Fig. 1
figure 1

Influence index \(I^{(\ell )}(r/\theta , \theta )\) as a function of \(\theta \), for Gaussian (red) and exponential (blue) covariance functions, \(r=1\). Threshold \(\delta =1/10\) (black horizontal line). Large length-scales have more impact with an exponential kernel than with a Gaussian one

Table 1 Table illustrating some values of the different terms in Eq. (16) for usual kernels, for a uniform design plan (\(\kappa =9/5\)), and for a standard deviation \(\sigma ^{(\ell )}=1/\sqrt{12}\) corresponding to a uniform designs in \([0,1]^d\). The chosen kernel influence threshold is \(\delta =1/10\)

3.2.3 Sampled bounds

The length-scales of the sub-models are sampled randomly in their corresponding interval. Different sampling strategies can be considered (i.e. space-filling designs, sample plans biased towards the center of the length-scale space). In this paper we use a uniform sampling scheme: \(\theta ^{(\ell )}\thicksim {\mathcal {U}}(\theta _{min}^{(\ell )},\theta _{max}^{(\ell )}), \quad \ell =1,\dots ,d\,.\)

3.3 Choice of the weighting method

As detailed in Sect. 3.1, the literature on model combination is vast and many weighting methods have been developed. Five of those are investigated in this paper in order to compute the weights of the sub-models in (10). This Section briefly describes the resulting five weighting schemes, more details on each method can be found in Appendix A. Their performances are then compared on numerical experiments in Sect. 4.

3.3.1 PoE approach

The first approach to obtain the weights is based on Product of Experts (PoE) (Hinton 2002). The PoE weights are given by:

$$\begin{aligned} w_{PoE_i}(\varvec{x}) = \frac{\hat{s}^{-2}_i(\varvec{x})}{\sum _{j=1}^{p} \hat{s}^{-2}_j(\varvec{x})}, \end{aligned}$$
(17)

where \(\hat{s}^{2}_i(\varvec{x})\) is the variance of the \(i\)th sub-model given in Eq. (12). Note that these weights depend only on the position of the sample points and not on the observed values. When the Kriging sub-models are all built with the same sample points, this method will emphasize the sub-models with large length-scales because these are the ones with smallest predicted variance. Thus, we expect this method to favor large length-scales and to fail in selecting the correct sub-models.

3.3.2 gPoE approach

The second approach is based on generalized Product of Experts (gPoE) (Cao and Fleet 2014; Deisenroth and Ng 2015). The gPoE weights are:

$$\begin{aligned} w_{gPoE_i}(\varvec{x}) = \frac{\beta _i^* \hat{s}^{-2}_i(\varvec{x})}{\sum _{j=1}^{p} \beta _j^* \hat{s}^{-2}_j(\varvec{x})}. \end{aligned}$$
(18)

In this paper, we use the gPoE approach to adjust the PoE weights in order to account for the observed values at the sample points. To this aim, the internal weights \(\varvec{\beta }^*\) are optimized numerically to minimize the LOOCV error of the combined model given by Eq. (7). However, a closed-form expression of the weights is no longer available because of this inner optimization.

3.3.3 LOOCV and LOOCV diag approaches

The third approach is to directly minimize the LOOCV error of the combination in Eq. (7) (Viana et al 2009) giving the LOOCV weights:

$$\begin{aligned} \varvec{w}_{LOOCV} = \frac{\varvec{C}^{-1} \varvec{1}}{\varvec{1}^T \varvec{C}^{-1} \varvec{1}}. \end{aligned}$$
(19)

where the components of the matrix \(\varvec{C} \in {\mathbb {R}}^{p\times p}\) are \(c_{ij} = \frac{1}{n} \varvec{e}_i^T \varvec{e}_j, \quad i=1,\dots ,p, \quad j=1,\dots ,p,\) with \(\varvec{e}_i\) the LOOCV vector for the \(i\)th sub-model: \(\varvec{e}_i=(\varvec{e}_i^{(1)},\dots ,\varvec{e}_i^{(n)})\). Using (8), these elements can be expressed easily as: \(\varvec{e}_i^{(k)} = [\varvec{K}_{\varvec{\theta }_i} \varvec{Y}]_k/ [\varvec{K}_{\varvec{\theta }_i}]_{k,k}\). Contrarily to the two previous approaches, these weights are constant and do not depend on \(\varvec{x}\). We also note that this method might lead to negative or greater than one weights. As we will discuss in Sect. 4, negative weights can raise some issues for the combination. Thus, following the suggestion of Viana et al , we propose the fourth weight definition enforcing \(w_i\in [0,1]\) by keeping only the diagonal elements of the matrix \(\varvec{C}\) in Eq. (19):

$$\begin{aligned} \varvec{w}_{LOOCV_{diag}} = \frac{\varvec{C}_{diag}^{-1} \varvec{1}}{\varvec{1}^T \varvec{C}_{diag}^{-1} \varvec{1}} \Longleftrightarrow w_{{LOOCV_{diag}}_i} = \frac{e_{LOOCV}(M_i)^{-1}}{\sum _{j=1}^{p} e_{LOOCV}(M_j)^{-1}}. \end{aligned}$$
(20)

3.3.4 MoE approach

The fifth approach based on Mixture of Experts (MoE) (Yuksel et al 2012) gives the MoE weights:

$$\begin{aligned} w_{{MoE}_i} = \frac{{\mathcal {L}}(\varvec{\theta }_i)}{\sum _{j=1}^{p} {\mathcal {L}}(\varvec{\theta }_j)}. \end{aligned}$$
(21)

Here, \({\mathcal {L}}(\varvec{\theta }_i)\) is the marginal likelihood of the \(i\)th sub-model. One drawback of MoE is that the likelihood of different sub-models can vary by several orders of magnitude. Thus this method may emphasize one single sub-model with the best likelihood instead of combining different sub-models.

4 Numerical results

4.1 Experiment setup

We compare the performances of the different combined models described in Sect. 3 with the simple Kriging method on simulated data and on a real-world application. To build the sub-models and the simple Kriging model, \(n_{train}=500\) random training points \(\varvec{x}_1,\dots ,\varvec{x}_{n_{train}} \in [0,1]^d\) are uniformly sampled. We use the Matérn 5/2 covariance function defined in Eq. (1). For the random sub-models, we follow the methodology detailed in Sect. 3.2, where we take a threshold \(\delta =1/10\), a kurtosis corresponding to a uniform distribution \(\kappa =9/5\), and the empirical standard deviation of the design along each direction. The weights of the combinations are computed according to Eqs. (17, 18, 19, 20) and (21). The performances of the five combined models are compared with the accuracy of a simple Kriging model with hyperparameters optimized by MLE (the optimization is performed using the package DiceKriging in the R language Roustant et al (2012) with 300 maximum iterations). The experiments are repeated for 10 different random seeds, with different random length-scales for the sub-models as well.

4.2 Simulated data

For the simulated data, the functions to surrogate are random samples of a high dimensional (\(d=50\)) centered Gaussian process Y using a Matérn 5/2 covariance function with known isotropic length-scales \(\theta _{true}=2\). Since in this case the true length-scales are known, we also compare the combinations and the simple Kriging to a Kriging model with the true hyperparameters \(\varvec{\theta }_{true}\) as a reference.

In a first experiment, we consider a combination of \(p=\)10 sub-models where, only this time, the length-scales are fixed to isotropic values ranging from 1 to 10. The purpose of this experiment is to observe how the different weights behave in a case where we know how close to the true function each sub-model is. In a second experiment, we build a set of \(p=40\) Kriging sub-models \({\mathcal {M}} = \{M_1,\dots ,M_{p}\}\), this time with random length-scales. The combined models are then constructed by aggregating a gradually increasing number of these sub-models (from 5 to all 40), using the 5 different methods for computing the weights. Additionally, to investigate the robustness of the combinations to “wrong” sub-models, we also add 5 sub-models with fixed isotropic large length-scales \(\theta =10\). The quality of each prediction \(M_{tot}\) is assessed by the mean-square error (MSE) computed on a test set of \(n_{test}\) = 5000 random test point \(\varvec{x}_1^{(t)},\dots ,\varvec{x}_{n_{test}}^{(t)} \in [0,1]^d\):

$$\begin{aligned} MSE(M_{tot}) :=\frac{1}{n_{test}} \sum _{k=1}^{n_{test}} \left( M_{tot}(\varvec{x}_k^{(t)}) - Y(\varvec{x}_k^{(t)}) \right) ^2. \end{aligned}$$

In order to further interpret the results, the LOOCV error of each model is also computed:

$$\begin{aligned} e_{LOOCV}(M_{tot}) :=\frac{1}{n_{train}} \sum _{k=1}^{n_{train}} \left( M_{{tot}_{-k}}(\varvec{x}_k) - Y(\varvec{x}_k) \right) ^2. \end{aligned}$$

The results of the first experiment are presented in the Figs. 2 and 3. The first result to note is the weak performance of the SK model with estimated hyperparameters (KrgMLE). Since the model is well specified (the underlying function we try to approximate is a GP sample with the same covariance structure), we would expect the MLE method to recover the true hyperparameters (see Bachoc (2013)). Moreover, as the high-dimensional optimization can be difficult, we use multi-start along with a large number of iterations (300 iterations) to ensure convergence. However, the maximum likelihood optimization still results in a wrong estimation of the length-scales \(\varvec{\theta }_{MLE}\), far from the truth \(\varvec{\theta }_{true}\). This is because with the small number of observations available, the maximum likelihood criterion over-fits the training data as highlighted in Fig. 2b by the LOOCV error of the model with estimated hyperparameters which is much smaller than that of the reference model. Because of the poor estimation of the length-scales, the MSE error of this model is also much worse than the MSE of the model with the true hyperparameters as shown in Fig. 2a. The PoE method clearly performs the worst among the combined models because it gives almost all the weight to the subs-models with large length-scales. The gPoE method avoids this issue thanks its internal weights as shown in Fig. 3a, and this method performs similarly to the three others. The different weighting strategies of each method are shown in Fig. 3. For the LOOCV method, the weights are fluctuating and hard to interpret, since their values are not in the [0,1] interval. The LOOCV diag method gives weights which are distributed quite uniformly among all sub-models though sub-models with \(\theta _{i}\approx \theta _{true}\) are highlighted, while the gPoE and MoE methods focus on the two more accurate sub-models.

Fig. 2
figure 2

MSE and LOOCV error (the lower the better) for the combinations of isotropic sub-models for the approximation of an isotropic Gaussian process sample. The 5 five first boxes (blue) correspond to the 5 weighting methods, the second to last box (red) to the simple Kriging model with hyperparameters estimated by MLE, the last box (green) to a simple Kriging model with \(\theta =\theta _{true}\) (colour figure online)

Fig. 3
figure 3

Weights of the isotropic sub-models for the first experiment. The x-axis values represent the isotropic length-scale of each sub-model, \(\theta \). For the gPoE method, the weights are the \(\varvec{\beta }\) internal weights in Eq. 18, for the 3 other methods the weights are the constant weights used for the combination

The results of the second experiment are given in Fig. 4. First, we can note that the SK model with estimated hyperparameters still overfits the data which results in a high MSE. Contrarily to the first experiment, the PoE method performs well as seen in Fig. 4a and f. This is because, in this experiment, the sub-models are no longer isotropic and are all composed of both small and large length-scales. As such, the PoE which discriminates against small length-scales leads to a good MSE since small length-scales often result in a Kriging model with large variations, and thus potentially large MSE values, while large length-scales give flatter models with moderate MSEs. However, for the same reason, PoE is not robust to the addition of “wrong” sub-models with large length-scales. Figure 4c shows that the accuracy of the combined model using the LOOCV method steadily decreases when too many sub-models are aggregated (more than 10). This is in contrast to the Fig. 4h where the LOOCV error of this method keeps decreasing when more sub-models are added, which is to be expected as the weights in this method are designed to minimize this very error. This, again, can be explained by the fact that this combination starts to overfit the data with too many sub-models. However, this issue does not occur for the LOOCV diag method in Fig. 4d and i where the MSE always decreases with \(p\) and converges to a threshold at about 15 sub-models in the combination. Figure 4e and j show that in this experiment the MoE method produces poor results. This is because, in this 50-dimensional example, the likelihoods of the sub-models are very small and differ by several order of magnitudes. This results in an MoE weight of almost one for the sub-model with the best likelihood, hence the method is almost equivalent to choosing only the best sub-model, thus using only one single pre-specified covariance. We also note that, for the methods which give the best accuracy (PoE, gPoE and LOOCV diag), the combined model is generally better than the best sub-model with as few as five sub-models in the combination. This shows well that the combination strategy is more effective compared to choosing the best sub-model among random samples.

Fig. 4
figure 4

Results of the second experiment for an initial GP sample with isotropic length-scales \(\theta = 2\). The top row shows the MSE results for the 5 combination methods, and the bottom row the LOOCV error results (the lower the better). In each boxplot, the first box gives the accuracy of the best sub-model Sub (yellow), the next 8 boxes (blue) give the accuracy of the combined model with an increasing number of sub-models (from 5 to 40), the third to last +Bad (purple) gives the accuracy when the combination is perturbed by the addition of 5 bad sub-models with large length-scales (\(\theta =10\)), the second to last box MLE (red) gives the performance of a simple Kriging model with hyperparameters estimated by MLE, the last box True (green) gives the precision of a simple Kriging model using the same length-scale as the initial GP sample (colour figure online)

Table 2 gives a summary of the different properties observed for the 5 weighting methods in these numerical experiments. The only method without a closed-form expression is gPoE because of the inner weights optimization (Eq. (A3)). As seen in the third experiment (Fig. 4b) PoE is not robust to “wrong” sub-models.Only PoE is not robust to “wrong” sub-models as seen in Fig. 4b. Figure 4c shows that LOOCV overfits when there are too many sub-models. Finally, both experiments (Figs. 2a and 4e) show that MoE does not suitably balance the weight between all sub-models.

Table 2 Empirical properties of the five weighting methods

4.3 Real-world application

To validate the method on a more realistic problem, we study a real-world application corresponding to the design of an electrical machine. The shape of the machine is parameterized with \(d=37\) design variables. These variables represent the position and size of air holes and magnets, as well as the radius of the machine. The layout of the machine is illustrated in Fig. 5. We are interested in the performances measured by two objectives to minimize: the consumption and cost of the machine; subject to ten constraints which characterize the dynamic of the car (maximum speed, acceleration,...), the reducer dimensioning, and the dynamic of the machine (oscillations,...). These two objectives and ten constraints are obtained via numerical simulation. Thus, we build 12 surrogates (one for each objective and constraint).

Fig. 5
figure 5

Layout of the electrical machine to be optimized. The 37 design parameters are the size and position of the air holes (in white) and of the magnets (in black), as well as the radius of the machine (colour figure online)

For a fixed number of \(p=20\) random sub-models, we compare the accuracy of the combinations with that of simple Kriging. To measure the accuracy, as the scales and units of each objectives and constraints cannot be compared, instead of the MSE and LOOCV we use the \(Q^2\) coefficient computed on a test set of \(n_{test}\) = 4500 random test point \(\varvec{x}_1^{(t)},\dots ,\varvec{x}_{n_{test}}^{(t)} \in [0,1]^d\):

$$\begin{aligned} Q^2 :=1 - \frac{\sum _{k=1}^{n_{test}} \left( M_{tot}(\varvec{x}_k^{(t)}) - Y(\varvec{x}_k^{(t)}) \right) ^2}{\sum _{k=1}^{n_{test}} \left( Y(\varvec{x}_k^{(t)}) - \frac{1}{n_{test}}\sum _{l=1}^{n_{test}} Y(\varvec{x}_l^{(t)}) \right) ^2} \end{aligned}$$

The results for the 2 objective and 10 constraints of the electrical machine is summarized in Fig. 6. Note that here the boxplots represent the result over the 12 surrogates (averaged over 10 different random seeds).

Fig. 6
figure 6

Results of the real-world application. The boxplots represent the \(Q^2\) (the higher the better) over the 12 objectives and constraints. The leftmost box gives the accuracy of the best sub-model Best sub (yellow), the next 5 boxes (blue) give the accuracy of the 5 methods for the combination, the last box Krg MLE (red) gives the performance of a Kriging model with hyperparameters estimated by MLE (colour figure online)

The results obtained confirm those for the simulated data. The accuracy using a combination of random sub-models is better than Kriging with hyperparameters optimized via maximum likelihood. Among the 5 methods for the weights, gPoE and LOOCV diag are to be preferred as the conclusion from the simulated data still applies here.

5 Conclusion

In this paper, we have proposed a new method to construct a surrogate model as a combination of Kriging sub-models which avoids the cumbersome optimization of the length-scales hyperparameters. The length-scales of the sub-models are pre-specified, for instance randomly, and the combined model emphasizes the important ones. We also provided a recipe for the choice of the length-scale bounds, as well as a comparison of different methods for weighting the sub-models.

Compared to other approaches, our method provides a novel way to build a Kriging-based surrogate model for high dimensional problems without employing dimension reduction techniques. The accuracy of our surrogate model is improved in comparison to simple Kriging models where the length-scales are optimized by MLE and which performs poorly in high-dimension, especially when the number of observations is limited. Moreover, the computational cost of the model is reduced as only \(p\) matrix inversions are needed to build the \(p\) sub-models, which, for a reasonable number of sub-models, is less expensive than the standard length-scale optimization which requires iterative covariance matrix inversions.

The numerical results for the 50 dimensional test problem and for the real-world application show that our method performs significantly better than simple Kriging with hyperparameters optimized by MLE for this type of problems. In particular, both the gPoE and LOOCV diag stand out as the best approaches to combine the sub-models and give an accuracy close to that of the reference model with as few as 15 sub-models.

Several aspects still need to be explored in further research. First, we can think of combining different kinds of sub-models, for example for problems where the design variables can naturally be separated into different groups, or by varying the covariance function, to further diversify the sub-models instead of considering identical sub-models sharing the same points and design variables. We could also consider sub-models built with subsets of samples in order to handle cases where in addition to the high-dimension, the number of observations is also large enough that the cost of the covariance matrix inversion becomes prohibitive. Second, we could try to induce sparsity in the weights in order to improve the interpretability of the combination. Finally, the variance of the aggregated model, which is mandatory to apply our method to the EGO framework for Bayesian optimization, is currently available only for the MoE weighting. Extending the current method to obtain variance estimates for other weighting approaches and applying it to Bayesian optimization constitutes an interesting research direction.