An Overview of Gradient-Enhanced Metamodels with Applications

Laurent, Luc; Le Riche, Rodolphe; Soulier, Bruno; Boucard, Pierre-Alain

doi:10.1007/s11831-017-9226-3

An Overview of Gradient-Enhanced Metamodels with Applications

Original Paper
Published: 17 July 2017

Volume 26, pages 61–106, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Archives of Computational Methods in Engineering Aims and scope Submit manuscript

An Overview of Gradient-Enhanced Metamodels with Applications

Download PDF

Luc Laurent ORCID: orcid.org/0000-0002-8935-5929¹,
Rodolphe Le Riche²,
Bruno Soulier³ &
…
Pierre-Alain Boucard³

1929 Accesses
47 Citations
9 Altmetric
1 Mention
Explore all metrics

Abstract

Metamodeling, the science of modeling functions observed at a finite number of points, benefits from all auxiliary information it can account for. Function gradients are a common auxiliary information and are useful for predicting functions with locally changing behaviors. This article is a review of the main metamodels that use function gradients in addition to function values. The goal of the article is to give the reader both an overview of the principles involved in gradient-enhanced metamodels while also providing insightful formulations. The following metamodels have gradient-enhanced versions in the literature and are reviewed here: classical, weighted and moving least squares, Shepard weighting functions, and the kernel-based methods that are radial basis functions, kriging and support vector machines. The methods are set in a common framework of linear combinations between a priori chosen functions and coefficients that depend on the observations. The characteristics common to all kernel-based approaches are underlined. A new $\nu $-GSVR metamodel which uses gradients is given. Numerical comparisons of the metamodels are carried out for approximating analytical test functions. The experiments are replicable, as they are performed with an opensource available toolbox. The results indicate that there is a trade-off between the better computing time of least squares methods and the larger versatility of kernel-based approaches.

Kernel Methods

Collinear Gradients Method for Minimizing Smooth Functions

Article 14 March 2023

Linear, Logistic, and Kernel Regression

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Despite continuous progress in the accuracy of experimental measurements and numerical simulations of the physics of a considered system, the need for metamodels keeps increasing. Metamodels are statistical or functional models of input–ouput data that are obtained either from experimental measures or from the numerical simulation of the associated physical phenomena. Metamodels are sometimes called surrogates, proxies, regression functions, approximating functions, supervised machine learners or are referred to with specific names such as the ones described later in this article. Although not directly linked to the physics, metamodels have proven to be necessary for creating simple, computationally efficient associations between the input and output of the considered phenomena. For example, in materials sciences input may be material properties or boundary conditions and outputs are displacements, forces, temperatures, concentrations or other quantities at specific locations; in design, inputs may be the parameters describing a shape or a material and outputs specific measures of performance such as mass, strength, stiffness; in geophysics inputs may be parameterized descriptions of the underground (permeabilities, faults, reservoir shapes) and the outputs quantities observed at the surface (flow rates, displacements, accelerations, gravity). Typically, actual or numerical experiments are costly in terms of time or other resources, in which case metamodels are a key technology to perform optimization, parameter identification and uncertainty propagation.

Infering nonlinear relationships requires large amount of data particularly when the number of input parameters grows (the “curse of dimensionality” [1]) so that it is important to use all available additional information. Gradients, i.e., derivatives of the outputs with respect to the inputs, are one of the most common and most useful side knowledge to be accounted for when building the metamodels: many finite elements codes have implemented adjoint methods to calculate gradients [2,3,4]; automatic differentiation is another solution for computing gradients [5,6,7]; there are responses such as volumes for which analytical gradient calculation is accessible.

Accounting for gradients when building metamodels often allows to decrease the necessary number of data points to achieve a given metamodel accuracy, or equivalently, it allows to increase the metamodel accuracy at a given number of data points. When guessing a function with a locally changing behavior (a non stationary process in the probability terminology) from a sparse set of observations, the traditional regression techniques relying only on the function values will tend to damp out local fluctuations. This is because useful regression methods comprise regularization strategies that make them robust to small perturbations in the data. Accounting for gradients is a way to recover some of the meaningful local fluctuations. The need for gradient information has been acknowledged in geophysics for reconstructing a gravity field from stations measurements when the underground is subject to local changes [8, 9]. Further illustrations of the interest of gradients will be given in Sect. 10.

The purpose of this article is to review the various approaches that have been proposed to create metamodels with zero order and gradient information. A global view is first developed. Section 2.1 is a general introduction to metamodeling which may be skipped by readers familiar with the concept. After a Section presenting the main notations, Sect. 2.3 synthetizes into a unique framework the different techniques which will be covered in the review. A generic idea, which can be applied to any surrogate, for indirectly using gradients is summarized in Sect. 3.

The article then details, in turn, each gradient-enhanced method: the large family of least squares approaches are the focus of Sect. 4; Shepard weighting functions are summarized in Sect. 5. All the methods covered later are based on kernels. After summarizing the main concepts behind kernels in Sect. 6, we provide details about gradient-enhanced radial basis functions (Sect. 7), kriging (Sect. 8) and support vector regression (Sect. 9). Note that the formulation of the gradient-enhanced support vector regression ($\nu $-GSVR) proposed in part 9.4 is a new contribution. Multivariate cubic Hermite splines [10, 11] are not discussed in this review as they seem to date to remain a topic of mathematical research.

Finally, the different methods are applied and compared on analytical test functions in Sect. 10. The ensuing analysis of results and presentation of related softwares should help in choosing specific gradient-enhanced techniques. All methods described in this article have been implemented and tested with the opensource matlab/octave GRENAT Toolbox [12].

2 Build, Validate and Exploit a Surrogate Model

2.1 Surrogates and Their Building in a Nutshell

In many contexts, the observation of the response of a parametrized system can be done only for a few parameters instances, also designated as points in the design space. A solution to getting an approximate response at non-sampled parameter instances is to use a metamodel (or surrogate). A metamodel is a doubly parameterized function, one set of parameters being the same as that of the studied system (i.e., the coordinates of the points), the other set of parameters allowing further control of the metamodel response to give it general representation abilities. For simplicity’s sake, parameters of the second set will be designated as internal parameters. The building of the metamodel involves tuning its internal parameters in order to match, in a sense to be defined, the observations at the points.

The simplest metamodels are polynomials tuned by regression, which are part of the response surface methodology (RSM [13]) for analyzing the results of experiments. For dealing with an increase in nonlinearity of the function, rising the degree of the polynomial could seem to be a solution. However, oscillations appear and the number of polynomial coefficients, $n_t,$ to be set grows rapidly, as

$$\begin{aligned} n_t ~=~\left( {\begin{array}{c}n_p + d^\circ \\ d^\circ \end{array}}\right) ~=~\frac{(n_p + d^\circ )!}{n_p! ~d^\circ !} \end{aligned}$$

(1)

where $n_p$ is the number of parameters and $d^\circ $ is the degree of the polynomial. This is why other techniques for approximating functions such as parametric kernel-based metamodels have received much attention.

The literature is already rich in contributions presenting and detailing surrogates models [14,15,16,17,18,19,20,21,22,23]. Hereafter, the basic steps in building and using surrogate models are summarized (see also Fig. 1):

Initial data generation Sampling strategies generate points in the design space (using, for instance, Latin Hypercube Sampling [24]). The responses of the actual function are calculated at each instance of the parameters. In many cases, this step is computationally intensive because the actual function involves a call to, typically, solvers of partial differential equations. Details on sampling techniques can be found in [25,26,27].
Build the metamodel Because data is sparse, parametric surrogate models (which are reviewed in the rest of this paper) are used. This step mainly means determining the model internal parameters.
In many situations and in particular for optimization, enrichment (or infill) strategies are used for adding points to the initial set of sample points. Enrichment strategies post-process the current surrogate. An example of infill method for optimization is the Expected Improvement [28, 29].
Finally the quality of the surrogate model is measured using dedicated criteria (such as $R^2$ or $\mathtt {Q}_{3},$ cf. Sect. 10).

At the end the building process and during the infill steps the surrogate can provide inexpensive approximate responses and gradients of the actual function. For a large number of sample points and/or a large number of parameters, the building of a surrogate can be (computer) time consuming but it is typically less expensive than a nonlinear finite elements analysis.

In the context of optimization, metamodels are often used for approximating objective or constraint functions and the approximation contributes to localizing the potential areas of the optimum. For efficiency in optimization, metamodels are not made accurate in the whole design space but only in potentially good regions. Such family of approaches is designated as surrogate-based analysis and optimization (SBAO) [18]. It is composed of optimization algorithms that rely on a metamodel, a classical optimization algorithm and an infill strategy. SBAO presents some similarities with Trust-Region methods [30] in the use of metamodels. However Trust-Region methods focus on proved rapid local convergence whereas SBAO targets globally optimal points. In SBAO the infill strategy looks for global optima by sequentially optimizing a criterion that is calculated directly with the metamodel (saving calls to the true functions) and that is a compromise between exploration and search intensification: exploration means adding points at badly known areas of the design space, intensification (also referred to as exploitation) means adding points in regions where one expects good performance. Among the many existing criteria [31, 32], Expected Improvement and related criteria [28, 29] are the most common and have led to the efficient global optimization algorithm (EGO) [33]. Thus, the use of surrogates in optimization is iterative: each step of SBAO algorithms includes (i) building a surrogate followed by (ii) optimizing an infill criterion based on the surrogate and then (iii) calling the actual simulation at the point output by the infill subproblem.

We now turn to the focus of this review that is gradient-enhanced metamodels also designated as gradient-assisted or gradient-based metamodels. The next sections present our notations and a global framework for gradient-enhanced metamodels.

2.2 Main Notations

Let us consider an experiment parameterized by $n_{p}$ continuous values grouped in the vector ${\bf x}^{(i)}.$$n_{p}$ if often known as the dimension of the (approximation or optimization) problem. The scalar output, or response, of the experiment is the function ${y}(\cdot).$ The notation ${\bf x}^{(i)}$ designates both sample points ($i\in {\llbracket 1,n_{s} \rrbracket })$ and any non sampled point ($i=0).$ The vectors of responses (also sometimes called observations) and gradients calculated at all sample points are denoted ${\bf y}_{g}$ and are assembled according to Eqs. (2)–(5).

$$\begin{aligned} {\bf y}_{g}=\begin{bmatrix}{\bf y}_{s}^\top&{\bf y}_{gs}^\top \end{bmatrix}^\top , \end{aligned}$$

(2)

with

$$\begin{aligned} {\bf y}_{s}&=\begin{bmatrix} {y}_1&{y}_2&\dots&{y}_{n_{s}}\end{bmatrix}^\top ,\end{aligned}$$

(3)

$$\begin{aligned} {\bf y}_{gs}&=\begin{bmatrix} {y}_{1,1}&{y}_{1,2}&\dots&{y}_{1,n_{p}}&{y}_{2,1}&\dots&{y}_{n_{s},n_{p}}\end{bmatrix}^\top , \end{aligned}$$

(4)

where

$$\begin{aligned} \forall&(i,k)\in \llbracket {0,n_{s}}\rrbracket \times \llbracket {1,n_{p}}\rrbracket ,\nonumber \\ {y}_{i}&={y}{\left( {{\bf x}^{(i)}}\right) }, \qquad {y}_{i,k}=\frac{{\partial {y}}}{\partial x_k}{\left( {\bf x}^{(i)}\right) } . \end{aligned}$$

(5)

More generally, a function ${y}$ and its derivatives is written using the following index notations: ${y}_{,i},$${y}_{,ij}.$.. where i and j take values in ${{\llbracket 0,n_{p}\rrbracket }}$ such as

$$\begin{aligned} {y}_{,i}({\bf x})&={\left\{ \begin{array}{ll} {y}({\bf x}) &{}\text { if } i=0,\\ \frac{{\partial {y}}}{\partial x_i}({\bf x}) &{}\text { if } i\in \llbracket {1,n_{p}}\rrbracket; \end{array}\right. }\end{aligned}$$

(6)

$$\begin{aligned} {y}_{,ij}({\bf x})&={\left\{ \begin{array}{ll} {y}({\bf x}) &{}\text { if } i=j=0\\ \frac{{\partial {y}}}{\partial x_i}({\bf x}) &{}\text { if } i\in \llbracket {1,n_{p}}\rrbracket \text { and } j=0,\\ \frac{{\partial {y}}}{\partial x_j}({\bf x}) &{}\text { if } j\in \llbracket {1,n_{p}}\rrbracket \text { and } i=0,\\ \frac{{\partial ^{2}{y}}}{\partial x_{i}\partial x_{j}}({\bf x}) &{}\text { if } (i,j)\in {\llbracket {1,n_{p}}\rrbracket }^{2}. \end{array}\right. } \end{aligned}$$

(7)

Finally, the notation ${\tilde{\bullet }}$ designates the approximation of the quantity of interest ${\bullet }$ provided by the metamodel. Bold fonts mean vectors and matrices. ${\Vert {\bullet} \Vert }$ denotes the Euclidian distance.

2.3 Introduction to Gradient-Enhanced Metamodels

This review article focuses on metamodels that, in addition to using and describing the responses, also use and model the gradient of the response with respect to the ${{\bf x}}$ parameters. Henceforth, for each sample point the value of the function and the gradients are supposed to have been observed. The following approaches will be covered: indirect approaches, least squares techniques (LS), weighted least squares (WLS), moving least squares (MLS), the Shepard weighting function (IDW), radial basis functions (GRBF), cokriging (GKRG) and support vector regression (GSVR). In these last acronyms, G stands for gradient-enhanced. A condensed view of the main references on which the next sections are based is given in Table 1.

Table 1 Summary of the main references on gradient-based metamodels

Full size table

Before precisely introducing each gradient-based surrogate, we give a common description of all the techniques (that can also be used for describing non-gradient-based models). It is noteworthy that all the surrogates discussed in this paper are obtained by linear combination of “coefficients” and “functions” that we will define soon. The approximation $\tilde{{y}}$ of an actual function ${y}$ can be calculated as follows:

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\, \tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }=A\left( {\bf x}^{(0)};\,{\bf y}_{g},\varvec{\ell }\right) +\sum _{j=0}^M\sum _{i=1}^N B_{ij}\left( {\bf x}^{(0)};\,\varvec{\ell }\right) \times C_{ij}\left( {\bf x}^{(0)};\,{\bf y}_{g},\varvec{\ell }\right) . \end{aligned}$$

(8)

${\varvec{\ell }}$ designates the internal parameters of the surrogate model. The terms A(), B(), and C() are specific to each kind of surrogate model but share common defining properties. A() is the trend term whose goal is to represent the main (average, large scale) features of the function ${{y}()}.$ The B()’s are “functions” and the C()’s are “weighting coefficients”. The B() functions are chosen a priori in the sense that, assuming ${\varvec{\ell }}$ is fixed, they do not depend on the observed responses, ${y{\left( {{\bf x}^{(i)}}\right) }}$ and their derivatives, $\frac{{\partial {y}}}{\partial x_{j}}{\left( {{\bf x}^{(i)}}\right) }.$ However, the B() functions can depend on the locations of the sample points, ${\bf x}^{(i)},$${i=1,\dots ,n_{s}}.$ The B() functions are typically user inputs to the methods.

In contrast, the C() coefficients are calculated from the observations ${y{\left( {{\bf x}^{(i)}}\right) }}$ and ${\frac{{\partial {y}}}{\partial {x}_{j}}\left( {\bf x}^{(i)}\right) },$ so that their linear combination with the B() functions, eventually added to the trend, makes an approximation to ${{y}()},$ as stated in Eq. (8). The coefficients are the weights in the linear combination of the B() functions. For example, if one expects that the response (for $n_{p}=1)$ is proportional to 1/x plus a quadratic term one could a priori choose $B_1(x)=1/x$ and $B_2(x)=x^2$ and create a simple approximation with constant coefficients ${y{\left( {{\bf x}^{(0)}}\right) } ~\approx ~ \sum _{i=1}^{n_t} C_i B_i({\bf x}^{(0)})}.$ The $C_i$’s are then calculated from the observations, which in our context include both the response function and its derivatives at the sampled points, for example so that the approximation fits the observations in a least squares sense. When there is no a priori on the B() functions, a generic choice is made: basis functions (e.g., polynomials) for LS, $\arctan $ for neural networks, kernels evaluated at a given ${\bf x}^{(i)}$ in kernel methods (GRBF, GKRG, GSVR here).

More generally, surrogates can be created by looking, at each ${{\bf x}^{(0)}},$ for the “best” (in a certain sense) linear combination of the B()’s, in which case the coefficients depend on ${{\bf x}^{(0)}}.$ The simplest template of such a surrogate would be ${y{\left( {{\bf x}^{(0)}}\right) } ~\approx ~ }$${\sum _{i=1}^{n_{s}} \text {similarity}\left( {\bf x}^{(i)},{\bf x}^{(0)}\right) y{\left( {{\bf x}^{(i)}}\right) }}$ where ${B_i\left( {\bf x}^{(0)}\right) }$ is a measure of similarity between ${{\bf x}^{(0)}}$ and ${\bf x}^{(i)}$ (not detailed here) and the coefficients $C_i()$ are ${y{\left( {{\bf x}^{(i)}}\right) }}.$ IDW and the kernel methods (GRBF, GKRG and GSVR) are refined examples of such surrogates.

Although mathematically equivalent to a single summation, the double summation in Eq. (8) emphasizes the specific structure of gradient-enhanced surrogates: in all kernel-based surrogates, the index i describes the sample point considered (therefore $N=n_{s} )$ while j represents the variable with respect to which the derivatives are taken, $j=0$ standing for the response without differentiation, (and $M=n_{p}).$

Table 2 summarizes the expressions of the trend, the functions and the coefficients such as they will appear later in the text. Note that all metamodels but LS have internal parameters, ${\varvec{\ell }},$ that, as with non-gradient-enhanced metamodels, are computed from the known points ${({\bf x}^{(i)},{y}({\bf x}^{(i)}))},$ and ${\frac{{\partial {y}}}{\partial {x}_{j}}\left( {\bf x}^{(i)}\right) }$ here, $i=1,n_{s} ,$ in a manner which is specific to each surrogate. For the sake of clarity, the difference between the functions B() and the coefficients C() is made assuming that ${\varvec{\ell }}$ is fixed, otherwise there is no clear general mathematical difference between them.

The methods that will be presented are organized in two groups, the kernel-based methods from Sect. 6 onward, and the rest (before). They can be distinguished in the same way as the two above examples. Kernel-based methods are built from the specification of a kernel, i.e., a function with two inputs that quantifies the similarity between what happens at these two inputs. The other approximations, which in this review are mainly variants of least squares, are made from a priori chosen single input functions that are linearly combined. Despite fundamental differences in the way they are constructed, many equivalences can be found between the methods: generalized least squares also have a kernel which is given at the end of Sect. 4.2; vice versa, the kernels of the GSVR are implicitely products of functions and the GSVR approximation is a linear combination of them like that of least squares; the approximate responses of GRBF and GKRG have the same form [Eqs. (72) and (89) without trend are equivalent]. These connections are further detailed in the paper. As a last common feature of the methods presented, it is striking that all the approaches but GSVR approximate the response by a linear combination of the observations ${\bf y}_{g},$ provided the internal parameters ${\varvec{\ell }}$ are fixed.

Table 2 Global framework for gradient-enhanced surrogates: definition of trends, a priori functions and coefficients as in Eq. (8)

Full size table

The calculation of the approximate gradients will be achieved by deriving Eq. (8), i.e., calculating ${\tilde{{y}}_{,k}(\mathbf {x})},$ which is possible if A, $B_{ij}$ and $C_{ij}$ are differentiable functions. One could think of other ways to build ${\tilde{{y}}_{,i}(\mathbf {x})},$ like learning them directly from the gradient data ${\bf y}_{gs}$ independently from the response ${\bf y}_{s},$ but such techniques are instances of the usual metamodel building (just applied to the gradients) and are out of the scope of this review.

In practice, it is common to only have access to some of the components of the gradient of the response. ${\tilde{{y}}_{,k}(\mathbf {x})}$ would be known while ${\tilde{{y}}_{,l}(\mathbf {x})},$$k \ne l,$ would not. All the techniques reviewed in this article apply only to the components where the derivatives are known. However, to keep notations simple, the derivations will always be carried out for all of the variables, as if all components of the gradient were accessible. Further remarks about missing data and higher order derivatives are given in Sect. 10.5.

3 Indirect Gradient-Based Metamodels

For taking into account the derivative information of the response in the making of a metamodel, the most basic idea is to use a first order Taylor’s series at sample point ${{\bf x}^{\left( {j}\right) }}$ to generate additional data points. For each sample point, for each of the $n_{p}$ parameters, a neighboring point is created,

$$\begin{aligned} \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\,\, {\bf x}^{(i)}+\Delta x_{k}{\bf {e}}_k, \end{aligned}$$

(9)

where ${{\bf {e}}_k}$ is an orthonormal basis vector of the design space. Under the assumption that ${\Delta x_{k}}$ remains small (${|\Delta x_{k}|\ll 1}),$ the Taylor’s serie provides the extrapolated value of the function ${y}$ at the neighboring point:

$$\begin{aligned} {y{\left( {{\bf x}^{(i)} + \Delta x_{k}{\bf {e}}_k}\right) } \approx y{\left( {{\bf x}^{(i)}}\right) }+\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) \Delta x_{k}.} \end{aligned}$$

(10)

Finally the non-gradient based metamodel can be built with $n_{s} \times (n_{p}+1)$ sample points and associated responses.

This approach has been used with kriging approximation [43, 48, 81] and has been called “Indirect Cokriging”. The main drawback of this method is that it requires a good choice of the ${\Delta x_{k}}$ parameters: if the value is too small, the kriging correlation matrix can be ill-conditioned and too large a value leads to a degraded extrapolation by Taylor’s expansion. In both cases, the metamodel provides an incorrect approximation. In previous works, Liu [81] used maximum likelihood estimation for estimating the value of each parameter ${\Delta x_{k}}$ and in [48], the ${\Delta x_{k}}$ are chosen equal to $10^{-4}.$

The indirect gradient-based approach does not scale well with dimension as the number of sample points is multiplied by $n_{p}+1$ when compared to a direct approach. Moreover, because the $n_{p}$ new sample points are very close to the initial sample point, numerical issues (such as the bad conditioning of covariance matrices in KRG) occur that complicate the determination of the internal parameters. Regularization or filtering techniques should be brought in. Therefore, indirect gradient-based approaches should only be used in low dimension and for problems where dedicated techniques for determining ${\Delta x_{k}}$ and the internal parameters exist. In other cases, it is better not to use gradients or to opt for a direct gradient-based approach. Examples of indirect gradient-based Kriging and RBF are proposed in Figs. 6 and 9. In these figures, the derivatives of RBF and KRG are determined analytically by deriving their predictors. In such low dimension, the indirect gradient-based approaches, InRBF and InOK, seem to perform as well as the direct gradient-based approaches, GRBF and OCK, in terms of approximating the true response derivative, dy/dx(x). However, as will be seen in Sect. 10 (Figs. 23 and 25), such indirect strategies are not competitive in higher dimensions.

4 Least Squares Approaches

4.1 Non Weighted Least Squares (LS and GradLS)

Least squares regression is the most common technique for approximating functions. Mainly applied in the context of response surface methodology (RSM [13]), the classical regression [82] can be extended for taking into account gradient information [75]. In this text, the acronym LS designates least squares regression without the use of gradients and GradLS is the gradient-enhanced version of it. Notice that this acronym differs from GLS that will designate generalized least squares.

The linear model used for gradient-based formulations remains the same as for non-gradient-based versions:

$${\bf y}_{g}={\bf F}{\varvec{\beta }}+{\varvec{\varepsilon}},$$

(11)

but this time the vector ${\bf y}_{g}$ contains $n_{s} \times (n_{p}+1)$ terms (responses and gradients), the matrix ${{{\bf F}}}$ contains evaluations of the $n_t$ a priori chosen functions $f_j$ and their derivatives at each sample points ${\bf x}^{(i)},$ the vector ${\varvec{\beta }}$ contains $n_t$ polynomial regression coefficients ${\beta _{j}},$ and the vector $ {\varvec{\varepsilon }} $ is made of the $n_{s} \times (n_{p}+1)$ errors of the model.

For gradient-based least squares models, at each point ${\left\{ {\bf x}^{(i)},y{\left( {{\bf x}^{(i)}}\right) },\frac{\text{d}{{y}}}{{\text{d}{\bf x}}}({\bf x}^{(i)})\right\} },$$n_{p}+1$ errors can be written:

$$\begin{aligned} \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\,&\forall {\bf x}^{(i)}\in \mathcal {D},\,\varepsilon _i=y{\left( {{\bf x}^{(i)}}\right) }-\tilde{{y}}{\left( {{\bf x}^{(i)}}\right) },\end{aligned}$$

(12)

$$\begin{aligned}&\varepsilon _{ik}=\frac{{\partial {y}}}{\partial x_k}\left( {\bf x}^{(i)}\right) -\frac{\partial {\tilde{{y}}}}{\partial x_k}\left( {\bf x}^{(i)}\right) . \end{aligned}$$

(13)

The matrices and vectors of Eq. (11) are now further defined:

$$\begin{aligned} {{\bf F}}&=\begin{bmatrix} {{\bf F}}_s^\top&{{\bf F}}_{gs}^\top \end{bmatrix}^\top ,\end{aligned}$$

(14)

$$ {\varvec{\beta }} = \left[ {\beta _{1} \;\beta _{2} \; \ldots \;\beta _{{n_{t} }} } \right]^{{ \top }} \quad {\varvec{\varepsilon }} = \left[ {\varepsilon _{1} \; \ldots \;\varepsilon _{{n_{s} }} \;\varepsilon _{{11}} \;\varepsilon _{{12}} \; \ldots \;\varepsilon _{{1n_{p} }} \;\varepsilon _{{21}} \;\ldots\; \varepsilon _{{n_{s} n_{p} }} } \right]^{{ \top }} $$

(15)

where

$$\begin{aligned} {{\bf F}}_s&=\begin{bmatrix} f_1\left( {\bf x}^{\left( {1}\right) }\right)&\dots&f_{n_t}\left( {\bf x}^{\left( {1}\right) }\right) \\ \vdots&\ddots&\vdots \\ f_1\left( {\bf x}^{\left( {n_{s}}\right) }\right)&\dots&f_{n_t}\left( {\bf x}^{\left( {n_{s}}\right) }\right) \\ \end{bmatrix}, \end{aligned}$$

(16)

$$\begin{aligned} {{\bf F}}_{gs}&=\begin{bmatrix} \frac{\partial {f_1}}{\partial {x_1}}\left( {\bf x}^{\left( {1}\right) }\right)&\dots&\frac{\partial {{f_{n_t}}}}{\partial x_1}\left( {\bf x}^{\left( {1}\right) }\right) \\ \vdots&\ddots&\vdots \\ \frac{\partial {f_1}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {1}\right) }\right)&\dots&\frac{\partial {f_{n_t}}}{\partial x_{n_{p}}} \left( {\bf x}^{\left( {1}\right) }\right) \\ \frac{\partial {f_1}}{\partial x_1}\left( {\bf x}^{\left( {2}\right) }\right)&\dots&\frac{\partial {f_{n_t}}}{\partial x_1}\left( {\bf x}^{\left( {2}\right) }\right) \\ \vdots&\ddots&\vdots \\ \frac{\partial {f_1}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {n_{s}}\right) }\right)&\dots&\frac{\partial {f_{n_t}}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {n_{s}}\right) }\right) \end{bmatrix}. \end{aligned}$$

(17)

The sizes of the previous matrices ${{\bf F}}_s$ and ${{\bf F}}_{gs}$ are $n_{s} \times n_t$ and $n_{p}n_{s} \times n_t,$ respectively.

The metamodel is built by determining the vector ${\hat{\varvec{\beta }}}$ which minimizes the following mean squares error:

$$\begin{aligned} {\text {MSE}(\varvec{\beta })=\sum _{i=1}^{n_{s}}\left[ \varepsilon _i^2+\sum _{k=1}^{n_{p}}\varepsilon _{ik}^2\right] =\left\| {{\bf F}}\varvec{\beta }-{\bf y}_{g}\right\| ^2_2.} \end{aligned}$$

(18)

Minimizing $\text {MSE}$ over ${\varvec{\beta }}$ yields the function approximation,

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }={\bf f}{\left( {{\bf x}^{(0)}}\right) }\hat{\varvec{\beta }}, \end{aligned}$$

(19)

where

$$\begin{aligned} {\bf f}{\left( {{\bf x}^{(0)}}\right) }&=\begin{bmatrix} f_1\left( {\bf x}^{(0)}\right)&\dots&f_{n_t}\left( {\bf x}^{(0)}\right) \end{bmatrix},\end{aligned}$$

(20)

$$\begin{aligned} \hat{\varvec{\beta }}&=\left( {{\bf F}}^\top {{\bf F}}\right) ^{-1}{{\bf F}}^\top {\bf y}_{g}. \end{aligned}$$

(21)

This metamodel leads to the familiar expression of regression approximation. Notice however that here, the gradients affect the coefficients ${\hat{\varvec{\beta }}}.$

The derivation of the GradLS model by minimization of the quadratic norm $\text {MSE}$ can be interpreted as making the orthogonal projection of the vector of observed responses and gradients ${\bf y}_{g}$ onto the space spanned by the columns of ${{\bf F}}.$ The result of the orthogonal projection is ${{{\bf F}}\hat{\varvec{\beta }}}.$ The projection is itself defined by the inner product it relies on. In least squares without derivatives, the inner product is the usual dot product between vectors in an Euclidean space. Accounting for derivatives extends this inner product to an inner product in a Sobolev space [83]:

$$\begin{aligned} \forall ~(\varvec{\mathbf {u}} _g ,\varvec{\mathbf {v}} _g),~ \langle \varvec{\mathbf {u}} _g , \varvec{\mathbf {v}} _g \rangle = \langle \varvec{\mathbf {u}} _s , \varvec{\mathbf {v}} _s \rangle + \langle \varvec{\mathbf {u}} _{gs} , \varvec{\mathbf {v}} _{gs} \rangle = \sum _{i=1}^{n_{s}} u_i v_i + \sum _{i=1}^{n_{s}}\sum _{k=1}^{n_{p}} u_{i,k} v_{i,k}~, \end{aligned}$$

(22)

where the g, s and gs subscripts have the same meaning as above with ${{\bf F}}.$ While both inner products account for the euclidean distance between the response and the metamodel, the Sobolev inner product further accounts for the difference in response and metamodel regularities through their gradients. In other words, the usual geometrical interpretations of least squares generalize to the least squares with derivatives formulation of Eq. (18) by moving from the Euclidean inner product to a product with additional derivative terms.

The derivatives of the GradLS approximation are directly obtained by deriving Eq. (19),

$$\begin{aligned} \forall k\in {\llbracket 1,n_{p}\rrbracket },\,\forall {\bf x}^{(0)}\in \mathcal {D},\,\frac{\partial {\tilde{{y}}}}{\partial {x_{k}}}\left( {\bf x}^{(0)}\right) =\frac{\partial {{\bf f}}}{\partial {x_{k}}}\left( {\bf x}^{(0)}\right) \hat{\varvec{\beta }}~, \end{aligned}$$

(23)

As required for building the gradient-enhanced least squares model, the functions $f_j$ must be differentiable at least once.

Although the empirical mean square error Eq. (18) can be reduced by increasing the degree of the polynomial basis, ${\tilde{{y}}()}$ will increasingly oscillate between the $n_{s} $ data points, which degrades the prediction quality. This oscillatory phenomenon, known as Runge’s phenomenon [84], is illustrated in Figs. 2a and 3d in 1 and 2 dimensions, respectively. Runge’s oscillations are mitigated when the actual function is polynomial, the number of sample points $n_{s}$ increases, when gradients are accounted for like here, or when a regularization strategy is added to the MSE minimization. For example, when approximating a 4th degree polynomial function using sufficiently many sample points in a dimension low enough so that Eq. (19) can be computed, a 4th degree least squares approach is exact. Regarding the effect of gradients, observe in Figs. 2b and 3g how gradient-enhanced least squares have a more stable response than LS which only uses function values.

4.2 Generalized Least Squares (GLS)

Initially introduced for addressing the uncertainties and correlations in measured responses, generalized least squares (GLS) follow the same logic as the previous LS and GradLS models except that weights are introduced in the error, $\text {MSE}.$ The generalized least squares error which incorporates gradient information now reads [75, 78,79,80],

$$\begin{aligned} \text {E}(\varvec{\beta })&=\left( {\bf y}_{s}-{\tilde{{\bf y}}}_{s}\right) ^\top \varvec{\mathbf {W}} _s\left( {\bf y}_{s}-{\tilde{{\bf y}}}_{s}\right) +\left( {\bf y}_{gs}-{\tilde{{\bf y}}}_{gs}\right) ^\top \varvec{\mathbf {W}} _{gs}\left( {\bf y}_{gs}-{\tilde{{\bf y}}}_{gs}\right) \end{aligned}$$

(24)

$$\begin{aligned}\text {E}(\varvec{\beta })=\left( {\bf y}_{g}-{\tilde{{\bf y}}}_{g}\right) ^\top \varvec{\mathbf {W}} _g\left( {\bf y}_{g}-{\tilde{{\bf y}}}_{g}\right) , \end{aligned}$$

(25)

where $\varvec{\mathbf {W}} _s$ and $\varvec{\mathbf {W}} _{gs}$ are positive definite weight matrices. The minimization of the error leads to the regression coefficients,

$$\begin{aligned} \hat{\varvec{\beta }}=\left( {{\bf F}}^\top \varvec{\mathbf {W}} _g {{\bf F}}\right) ^{-1}{{\bf F}}^\top \varvec{\mathbf {W}} _{g}{\bf y}_{g} , \end{aligned}$$

(26)

where ${{\bf F}}$ and ${\bf y}_{g}$ are the same as in the GradLS approach (see above) and ${\varvec{\mathbf {W}} _g=\text {diag}\begin{bmatrix}\varvec{\mathbf {W}} _s&\varvec{\mathbf {W}} _{gs}\end{bmatrix}}.$ The weighted least squares (WLS) [82] approach is a special case of the generalized least squares (GLS) where $\varvec{\mathbf {W}} _g$ is a diagonal matrix. Note that Eq. (26) encompasses traditional GLS without gradients by setting $\varvec{\mathbf {W}} _{gs} = 0.$

In traditional GLS (models without gradients), the definition of the weight matrix $\varvec{\mathbf {W}} _s$ depends on the context of the study:

If no a priori information on the covariance structure is available, the weights can come from a chosen weighting function R(): ${\varvec{\mathbf {W}} _s=\left[ R({\bf x}^{(i)}-{\bf x}^{\left( {j}\right) })\right] _{1\le i,j\le n_{s}}}.$R() must be such that $\varvec{\mathbf {W}} _s$ is positive definite, a condition shared with kernels and further discussed in Sect. 6.
If a covariance structure is known: $\varvec{\mathbf {W}} _s=\varvec{\mathbf {C}} ^{-1}$ where ${\varvec{\mathbf {C}} =\left[ {\text {cov}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },Y{\left( {{\bf x}^{\left( {j}\right) }}\right) }\right] \right] _{1\le i,j\le n_{s}}}.$ In the case of uncorrelated errors, $\varvec{\mathbf {C}} $ is reduced to the diagonal matrix $\text {diag}[\begin{array}{llll}\sigma _1&\sigma _2&\dots&\sigma _{n_{s}}\end{array}]$ where ${\sigma _i=\text {Var}{[ {\varepsilon _i}]}}$, and GLS degenerates into WLS.

The geometrical interpretation of gradient-enhanced GLS is similar to that of GradLS made in the previous Section, the only difference being that the projection of the vector of observations onto the space spanned by the regression functions is no longer orthogonal but oblique, following the null space of the projection matrix ${{{\bf F}}\hat{\varvec{\beta }}},$${\hat{\varvec{\beta }}}$ given by Eq. (26).

In [78], normalization methods are proposed for calculating the weight matrices of gradient-enhanced GLS:

A standard normalization of responses and gradients where
$$\begin{aligned} \varvec{\mathbf {W}} _s&=\text {diag}\begin{bmatrix}\dfrac{\mu _1}{{y}_1^2}&\dots&\dfrac{\mu _{n_{s}}}{{y}_{n_{s}}^2}\end{bmatrix}, \end{aligned}$$
(27)
$$\begin{aligned} \varvec{\mathbf {W}} _{gs}&=\text {diag}\begin{bmatrix} w_1&\dots&w_1&w_2&\dots&\dots&w_{n_{s}} \end{bmatrix} \text { with } w_i=\dfrac{\mu _i\lambda _i}{\eta _i}. \end{aligned}$$
(28)
The coefficients $\lambda _i$ and $\mu _i$ are meant to balance the influence of the derivatives and responses at each sample point, respectively. $\eta _i$ are normalization coefficients calculated as
$$\begin{aligned} \eta _i=\sum _{k=1}^{n_{p}}\frac{\partial y{\left( {{\bf x}^{(i)}}\right) }}{\partial x_{k}}. \end{aligned}$$
(29)
In this case, $\varvec{\mathbf {W}} _s$ contains $n_{s} $ non-null terms and the diagonal of $\varvec{\mathbf {W}} _{gs}$ contains $n_{s} $ blocks of $n_{p}$ terms, $\dfrac{\mu _i\lambda _i}{\eta _i}.$
A normalization using logarithmic derivatives where $\varvec{\mathbf {W}} _s$ is like that of the standard normalization above and
$$\begin{aligned} {\varvec{\mathbf {W}} _{gs}=\text {diag}\begin{bmatrix} w_1&\dots&w_1&w_2&\dots&\dots&w_{n_{s}} \end{bmatrix} \text { with } w_i=\dfrac{\mu _i\lambda _i\delta _i^2}{{y}_i^2}.} \end{aligned}$$
(30)
The $\delta _k$ coefficients, which are further described in [78], are based on the logarithmic derivatives introduced in [85]. $\mu _i$ and $\lambda _i$ have the same expressions as in the standard normalization.

To close the presentation on gradient-enhanced generalized least squares, following [86], we show how the approach can be looked at as a kernel-based method. This comment uses explanations given in Sect. 8 so that readers not familiar with kernels as covariances of Gaussian processes may wish to read that Section first. The kernel is the covariance between two responses at different locations when the responses are considered as random processes,

$$\begin{aligned} \begin{array}{ll} {\text {cov}}\left[ \hat{Y}({\bf x}),\hat{Y}({\bf x}')\right] &{} = {\text {cov}}\left[ {\bf f}{\left( {{\bf x}}\right) }\hat{\varvec{\beta }},{\bf f}{\left( {{\bf x}'}\right) }\hat{\varvec{\beta }}\right] = \mathbb {E}{\left[ {{\bf f}{\left( {{\bf x}}\right) }\hat{\varvec{\beta }}\hat{\varvec{\beta }}^\top {\bf f}{\left( {{\bf x}'}\right) }^\top }\right] } \\ &{} = {\bf f}{\left( {{\bf x}}\right) } \left( {{\bf F}}^\top \varvec{\mathbf {W}} _g {{\bf F}}\right) ^{-1} {\bf f}{\left( {{\bf x}'}\right) }^\top. \end{array} \end{aligned}$$

(31)

This relation is the expression of the kernel associated to the gradient-enhanced GLS. It is obtained assuming that the responses are centered (i.e., ${ \mathbb {E}{\left[ {Y{\left( {{\bf x}}\right) }}\right] }=0 })$ and the weight matrix is the inverse covariance of the responses and their derivatives, ${\varvec{\mathbf {W}} _g^{-1} = \mathbb {E}{\left[ {\mathbf {Y}_g \mathbf {Y}_g^T}\right] }}.$ It can then be checked that by using this kernel in the general prediction equation of kriging [with a null trend, see Table 6, or equivalently the GRBF prediction formula Eq. (72)],

$$\begin{aligned} \tilde{{y}}\left( {\bf x}^{(0)}\right) = \left[ {\text {cov}}\left[ Y{\left( {{\bf x}^{(0)}}\right) },Y{\left( {{\bf x}^{\left( {1}\right) }}\right) }\right] ,\dots ,{\text {cov}}\left[ Y{\left( {{\bf x}^{(0)}}\right) },Y{\left( {{\bf x}^{\left( {n_{s}}\right) }}\right) }\right] \right] \varvec{\mathbf {C}} ^{-1} {\bf y}_{g} ~, \end{aligned}$$

(32)

one gets back the GLS prediction formula, ${\tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }={\bf f}{\left( {{\bf x}^{(0)}}\right) }\hat{\varvec{\beta }}}$ with ${\hat{\varvec{\beta }}}$ given by Eq. (26).

4.3 Moving Least Squares (MLS)

Classical response surface methods like LS, GradLS or GLS approximate functions by combining once and for all a priori functions, $f_i()~,~i=1,\dots ,n_t,$ that are globally defined throughout the design space. When it is not possible to decide beforehand which functions to combine, as it is the case when the function is expected to have local variations, it can be useful to proceed with local approximations. For example, it was proposed in [87] to apply the classical RSM in neighborhoods of the points of interest. moving least squares (MLS) [88] is a generalization of GLS that builds a different metamodel at each ${{\bf x}^{(0)}}:$

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\tilde{{y}}{\left( {v^{(0)}}\right) }={\bf f}{\left( {{\bf x}^{(0)}}\right) }\hat{\varvec{\beta }}\left( {\bf x}^{(0)}\right) . \end{aligned}$$

(33)

The difference with previous approximations lies in the non constant regression coefficients ${\hat{\varvec{\beta }}\left( {\bf x}^{(0)}\right) }$ [compare Eqs. (19) and (33)]. Like other least squares techniques, the gradient-based MLS (also designated as Hermite version of MLS) [76, 77] is built by minimizing an error function, which here is

$$\begin{aligned} \text {E}\left( \varvec{\beta };\,{\bf x}^{(0)}\right)&=\alpha \sum _{i,j=1}^{n_{s}}w_{ij}\left( {\bf x}^{(0)}\right) \varepsilon _i\varepsilon _j+(1-\alpha )\sum _{i,j=1}^{n_{s}}\sum _{k,l=1}^{n_{p}} w_{ijkl}\left( {\bf x}^{(0)}\right) \varepsilon _{ik}\varepsilon _{jl}\nonumber \\&=\left( {\bf y}_{s}-{\tilde{{\bf y}}}_{s}\right) ^\top \varvec{\mathbf {W}} _{Ms}\left( {\bf x}^{(0)}\right) \left( {\bf y}_{s}-{\tilde{{\bf y}}}_{s}\right) \\ &\qquad+\left( {\bf y}_{gs}-{\tilde{{\bf y}}}_{gs}\right) ^\top \varvec{\mathbf {W}} _{Mgs}\left( {\bf x}^{(0)}\right) \left( {\bf y}_{gs}-{\tilde{{\bf y}}}_{gs}\right) \end{aligned}$$

(34)

$$\begin{aligned}\text {E}\left( \varvec{\beta };\,{\bf x}^{(0)}\right)=\left( {\bf y}_{g}-{\tilde{{\bf y}}}_{g}\right) ^\top \varvec{\mathbf {W}} _{M}\left( {\bf x}^{(0)}\right) \left( {\bf y}_{g}-{\tilde{{\bf y}}}_{g}\right) . \end{aligned}$$

(35)

The weights ${w_{ij}\left( {\bf x}^{(0)}\right) }$ and ${w_{ijlk}\left( {\bf x}^{(0)}\right) }$ depend of the location of ${{\bf x}^{(0)}}.$ These coefficients have the following properties:

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\forall (i,j,k,l)&\in {\llbracket 1,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket }^2,\,\nonumber \\ w_{ij}\left( {\bf x}^{(0)}\right)&={\left\{ \begin{array}{ll} h(\Vert {\bf x}^{(i)}-{\bf x}^{(0)}\Vert )&{}\text {if }i=j,\\ 0&{}\text {if } i\ne j, \end{array}\right. }\end{aligned}$$

(36)

$$\begin{aligned} w_{ijkl}\left( {\bf x}^{(0)}\right)&={\left\{ \begin{array}{ll} h_{kl}(\Vert {\bf x}^{(i)}-{\bf x}^{(0)}\Vert )&{}\text {if }i=j\text { and }k=l,\\ 0&{}\text {if } i\ne j\text { or } k\ne l , \end{array}\right. } \end{aligned}$$

(37)

where h() and $h_{kl}()$ are weight functions. Although different weight functions could be chosen for the responses and gradients, the simplest solution is to take ${\forall (k,l)\in {\llbracket 1,n_{p}\rrbracket }^2,\, h_{kl}(r)=h(r)}$ (see [76]). $\alpha $ is a coefficient for managing the influence of the derivatives. $\alpha =1$ leads to a MLS approximation without gradients.

The matrix ${\mathbf{W}}_{{\mathbf{M}}} \left( {{\mathbf{x}}^{{(0)}} } \right)$ is diagonal, $ {\mathbf{W}}_{{\mathbf{M}}} \left( {{\mathbf{x}}^{{(0)}} } \right) = {\text{diag}}\left[ {{\mathbf{W}}_{{Ms}} \left( {{\mathbf{x}}^{{(0)}} } \right){\mathbf{W}}_{{Mgs}} \left( {{\mathbf{x}}^{{(0)}} } \right)} \right] ,$ where ${\varvec{\mathbf {W}} _{Ms}\left( {\bf x}^{(0)}\right) }$ and ${\varvec{\mathbf {W}} _{Mgs}\left( {\bf x}^{(0)}\right) }$ are $n_{s} \times n_{s} $ and $n_{s} n_{p}\times n_{s} n_{p}$ matrices, respectively:

$$\begin{aligned}&\forall \in {\llbracket 1,n_{s} \rrbracket },\nonumber \\&\varvec{\mathbf {W}} _{Ms}\left( {\bf x}^{(0)}\right) =\alpha ~\text {diag}\begin{bmatrix}w_{11}\left( {\bf x}^{(0)}\right)&w_{22}\left( {\bf x}^{(0)}\right)&\dots&w_{n_{s} n_{s}}\left( {\bf x}^{(0)}\right) \end{bmatrix},\end{aligned}$$

(38)

$$\begin{aligned}&\varvec{\mathbf {W}} _{Mgs}\left( {\bf x}^{(0)}\right) =\alpha ~\text {diag}\begin{bmatrix}w_{1111}\left( {\bf x}^{(0)}\right)&w_{1122}\left( {\bf x}^{(0)}\right)&\dots&w_{11n_{p}n_{p}}\left( {\bf x}^{(0)}\right) \end{bmatrix}. \end{aligned}$$

(39)

The weight functions are non-negative piecewise functions chosen among the non-exhaustive list provided in Table 3.

Table 3 Examples of weighting functions, h(), for MLS approximation

Full size table

Finally, the MLS surrogate value at a non-sampled point ${{\bf x}^{(0)}}$ is given by Eq. (33) where the coefficients ${\hat{\varvec{\beta }}\left( {\bf x}^{(0)}\right) }$ are obtained by minimizing the weighted mean squares error of Eq. (35). Because the computation of these coefficients has to be done at each requested new point, MLS are computationally more expensive than other least squares techniques.

5 Shepard Weighting Function (IDW)

Also designated as inverse distance weighting method (IDW), the Shepard weighting method was introduced in [91]. The gradient-enhanced version of [46] is based on the modified Shepard Weighting method of [74]. The IDW approximation to the function is written as local linear combinations of local approximations to the true function around point ${\bf x}^{(i)},$$Q_i().$ Initially chosen as a quadratic function in [74], $Q_i()$ are taken here as the first order Taylor approximation at the sampled point ${\bf x}^{(i)}$ for the gradient-enhanced version of IDW [43, 46].

The IDW metamodel is formulated as,

$$\begin{aligned} {\forall {\bf x}^{(0)}\in \mathcal {D},\, \tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }=\displaystyle \sum _{j=1}^{n_{s}}\overline{W_j}\left( {\bf x}^{(0)}\right) Q_j\left( {\bf x}^{(0)}\right).} \end{aligned}$$

(40)

The relative weights,

$$ \overline{W_{j}} ({\mathbf{x}}^{(0)} ) = \frac{W_{j} \left( {\mathbf{x}}^{(0)} \right)}{\sum\limits_{k = 1}^{n_{s}} W_{k} \left( {\mathbf{x}}^{(0)} \right)}, $$

(41)

are made of the inverse distance functions,

$$\begin{aligned} {W_j({\bf x})=\left[ \dfrac{\left( R_w-\left\| {\bf x}-{\bf x}^{\left( {j}\right) }\right\| \right) _{+}}{R_w\left\| {\bf x}-{\bf x}^{\left( {j}\right) }\right\| }\right] ^2~,} \end{aligned}$$

(42)

where ${\forall d\in \mathbb {R},\,\, \left( d\right) _{+}= \max (0,d)},$ and $R_w$ is a radius of influence around ${{\bf x}^{\left( {j}\right) }}.$ The weight functions $W_j$ are such that ${Q_j({\bf x})}$ has an influence on the approximation only in a (hyper)sphere of center ${{\bf x}^{\left( {j}\right) }}$ and radius $R_w.$$R_w$ is set so that the hypershere includes $N_w$ sample points. A discussion on $R_w$ and $N_w$ can be found in [74].

The weight functions of Eqs. (41) and (42) have the following properties:

$$\begin{aligned}&\forall (i,j)\in {\llbracket 0,n_{s} \rrbracket }\times {\llbracket 1,n_{s} \rrbracket },\,\forall {\bf x}^{(i)}\in \mathcal {D},\nonumber \\&\overline{W_j}\left( {\bf x}^{(i)}\right) =\delta _{ij}={\left\{ \begin{array}{ll} 0&{}\text {if}\,\,j\ne i ~,\\ 1&{}\text {if}\,\,j=i. \end{array}\right. } \end{aligned}$$

(43)

The function ${Q_j({\bf x})}$ is a first order Taylor approximation of ${y}$ at ${{\bf x}^{\left( {j}\right) }},$

$$ \forall {\mathbf{x}} \in {\mathcal{D}},Q_{j} ({\mathbf{x}}) = y\left( {{\mathbf{x}}^{{\left( j \right)}} } \right) + \sum\limits_{{k = 1}}^{{n_{p} }} {\frac{{\partial y\left( {{\mathbf{x}}^{{\left( j \right)}} } \right)}}{{\partial x_{k} }}\left( {x_{k} - x_{k}^{{(j)}} } \right).} $$

(44)

The IDW approximation interpolates responses and gradients of the actual function at the sample points. To prove it, the IDW prediction and its derivatives are now calculated at the sample points:

$$\begin{aligned}&\forall i\in {\llbracket 1,n_{s} \rrbracket },\,\forall {\bf x}^{(i)}\in \mathcal {D},\nonumber \\ \tilde{{y}}{\left( {{\bf x}^{(i)}}\right) }&=\displaystyle \sum _{j=1}^{n_{s}}\overline{W_j}\left( {\bf x}^{(i)}\right) Q_j\left( {\bf x}^{(i)}\right) =Q_i\left( {\bf x}^{(i)}\right) \nonumber \\&=y{\left( {{\bf x}^{(i)}}\right) } ~; \end{aligned}$$

(45)

$$\begin{aligned}&\forall (i,l)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\,\forall {\bf x}^{(i)}\in \mathcal {D},\nonumber \\ \frac{\partial \tilde{{y}}{\left( {{\bf x}^{(i)}}\right) }}{\partial x_{l}}&=\displaystyle \sum _{j=1}^{n_{s}}\left[ \frac{\partial \overline{W_j}\left( {\bf x}^{(i)}\right) }{\partial x_{l}}Q_j\left( {\bf x}^{(i)}\right) +\overline{W_j}\left( {\bf x}^{(i)}\right) \frac{\partial Q_j\left( {\bf x}^{(i)}\right) }{\partial x_{l}}\right] \nonumber \\&=\frac{\partial Q_i\left( {\bf x}^{(i)}\right) }{\partial x_{l}}=\frac{\partial {y}\left( {\bf x}^{(i)}\right) }{\partial x_{l}} ~, \end{aligned}$$

(46)

because,

$$\begin{aligned}&\forall (i,j,l)\in {\llbracket 1,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket },\forall {\bf x}^{(i)}\in \mathcal {D},\nonumber \\&\frac{\partial \overline{W_j}\left( {\bf x}^{(i)}\right) }{\partial x_{l}}=0. \end{aligned}$$

(47)

The IDW metamodel bears similarities to the kernel methods of Sects. 6, 7, 8, 9: ${\overline{W_j}\left( {\bf x}\right) }$ is a double input function that grows with proximity between ${{\bf x}}$ and ${{\bf x}^{\left( {j}\right) }};$ In IDW, ${\overline{W_j}\left( {\bf x}\right) }$ is multiplied with response estimates (the $Q_j()$’s) in a way that is reminiscent of kriging, cf. GKRG in Table 2. Note also that, when compared to the other metamodels reviewed in this paper, IDW is the only approach that neither requires the inversion of large ($n_{s} (n_{p}+1)$ by $n_{s} (n_{p}+1))$ systems of linear equations nor the resolution of optimization problems as GSVR will. For this reason, IDW is computationally inexpensive. We now turn to the already mentioned kernel methods.

6 Kernel Functions for Gradient-Enhanced Kernel-Based Metamodels

Most kernel-based metamodels have been developped in the field of machine learning. While support vector machines are arguably the most well-known, other approximation techniques belong to kernel-based techniques. In this article, we will focus on radial basis functions (see Sect. 7), Kriging (see Sect. 8) and support vector regression (see Sect. 9). These three surrogate models, like all kernel-based metamodels, require choosing a kernel function or kernel, ${\varPsi },$ which measures a similarity, ${\varPsi {\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) }},$ between any two points ${\bf x}^{(i)}$ and ${{\bf x}^{\left( {j}\right) }},$ and is therefore a double input function. Kernel functions are examples of the functions B() of the general metamodel framework, Eq. (8).

As will be done in Sect. 8 about kriging, one can look at the responses at each point ${{\bf x}}$ as a random process, ${Y{\left( {{\bf x}}\right) }}.$ With this point of view, since a kernel is a similarity measure, it is natural to define a kernel as the correlation between the responses at different locations, ${\varPsi {\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) } = {\text {corr}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },Y{\left( {{\bf x}^{\left( {j}\right) }}\right) }\right] }.$

Kernels must satisfy Mercer’s conditions [92] which means that they must be continuous, symmetric and positive definite, a necessary condition for correlation functions. This is most easily done by taking the kernel function in a list of known Mercer’s kernels [86, 92, 93].

In the case of gradient-enhanced approximations, a great simplification comes from the fact that the kernels involving gradients are deduced from the kernel involving only the responses: the correlation functions between a response and a gradient is the derivative of the kernel and the correlation between two gradients is the second derivative of the correlation, cf. Eq. (92).

An additional condition on the kernel functions has then to be satisfied: the kernels used in gradient-enhanced metamodels must be twice differentiable.

Multidimensional kernel functions ${\varPsi }$ are usually built from unidimensional kernels ${h}$ by taking the product,

$$\begin{aligned} \forall ({\bf x}^{(i)},{\bf x}^{\left( {j}\right) },\varvec{\ell })\in \left( \mathbb {R} ^{n_{p}}\right) ^3,&\nonumber \\ \varPsi {\left( {{{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }};\varvec{\ell }}\right) }&=\prod _{k=1}^{n_{p}}h(x_{k}^{(i)}-x_{k}^{(j)};{\ell }_{k}), \end{aligned}$$

(48)

where ${\varvec{\ell }}$ is the vector of the kernel internal parameters. In the above formula, we have introduced the stationarity assumption that the similarity between two points depends only on the vector separating them and not on where they are located, ${h(x_{k}^{(i)},x_{k}^{(j)}) = h(x_{k}^{(i)}-x_{k}^{(j)}) = h(r)}.$ The sign of r is kept to simplify the calculations of the kernel derivatives. For gradient-enhanced metamodels, common twice differentiable kernel functions are summarized in Table 4 (see for example [62, 63, 94, 95]).

Table 4 Examples of kernel functions, ${r = x_{k}^{(i)}-x_{k}^{(j)}}$

Full size table

Introduced by Stein [96] in the context of approximation, the Matérn class [97] of kernels have parameters that make them highly adjustable. Matérn kernels use a modified Bessel function of the second kind $K_\nu $ normalized by a Gamma function $\varGamma (\nu ).$ Thanks to the parameter $\nu ,$ the smoothness of the kernel function can be accurately controlled. Matérn functions and their derivatives for 3 values of $\nu $ and ${\ell =0.8}$ are plotted in Fig. 4. In practice, two specific values of $\nu $ leads to the most often used Matérn 3/2 and Matérn 5/2 ($\nu =3/2$ or 5/2) functions. Figures 5b, c show these functions and their derivatives for 3 values of the parameter ${\ell }.$ They can be compared with the squared exponential kernel presented on Fig. 5a. The Matérn function is $\lceil \nu \rceil $ times differentiable [96] (where $\lceil \bullet \rceil $ denotes the ceiling function). A stronger result is that the second derivative is continuous in 0 and its asymptotic value is [98]

$$\begin{aligned} \frac{\text{d}^{2}h(r)}{\text{d}{r^2}}\underset{r\rightarrow 0}{\sim }-\dfrac{\nu }{\ell ^2}\dfrac{\varGamma (\nu -1)}{\varGamma (\nu )}~, \end{aligned}$$

(49)

because the k-th derivative of the metamodel exists if the $k+1$-th derivative of the kernel at 0 exists and is finite ([86] for Gaussian Processes), the Matérn function with $\nu>1$ can be used for building gradient-based ($k=1)$ metamodels. This assessment confirms the validity of the choice $\nu \ge 3/2$ proposed in [62, 94, 95, 99]. The squared exponential kernel has a very simple expression and is often encountered in practice. It should be noted that it yields extremely smooth metamodels: it is infinitely differentiable at $r=0$ and so are the associated surrogates. Such smoothness is often not representative of the true function and, worse, it causes ill-conditioning of matrices in radial basis functions and kriging (cf. Sects. 7 and 8). This is the reason why Matérn kernels should generally be preferred.

The implementation of multidimensionnal kernel functions and their first and second derivatives can lead to a complicated and time consuming code. In order to improve both aspects [99] has proposed the following formulation:

$$\begin{aligned} L_m&=\prod _{k=1}^{m}h(x_{k}^{(i)}-x_{k}^{(j)};{\ell }_{k}); \end{aligned}$$

(50)

$$\begin{aligned} U_m&=\prod _{k>m}^{n_{p}}h(x_{k}^{(i)}-x_{k}^{(j)};{\ell }_{k}); \end{aligned}$$

(51)

$$\begin{aligned} M_{m,n}&=\prod _{m<k<n}^{n_{p}}h(x_{k}^{(i)}-x_{k}^{(j)};\,{\ell }_{k})\;\text {with }m<n. \end{aligned}$$

(52)

The derivatives can then be computed as shown below where only the derivatives of the unidimensional correlation function are needed,

$$\begin{aligned} \frac{\partial \varPsi }{\partial x_{m}^{(i)}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\,\varvec{\ell }\right) =L_mU_m\frac{\text {d}h}{\text {d}x_{m}^{(i)}}\left( x_{m}^{(i)}-x_{m}^{(j)};\,{\ell }_{m}\right) ; \end{aligned}$$

(53)

$$\begin{aligned}&\frac{\partial ^2\varPsi }{{\partial x_{m}^{(i)}}{\partial x_{n}^{(j)}}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\,\varvec{\ell }\right) =\nonumber \\&{\left\{ \begin{array}{ll} L_mM_{m,n}U_n\frac{{\text {d}}h}{{\text {d}}x_{m}^{(i)}}\left( x_{m}^{(i)}-x_{m}^{(j)};\,{\ell }_{m}\right) \frac{{\text {d}}h}{{\text {d}}x_{n}^{(j)}}\left( x_{n}^{(i)}-x_{n}^{(j)};\,{\ell }_{n}\right) &{} \text { if } m\ne n,\\ L_mU_m\frac{\partial ^{2}h}{{\partial x_{m}^{(i)}}{\partial x_{m}^{(j)}}}\left( x_{m}^{(i)}-x_{m}^{(j)};\,{\ell }_{m}\right) &{}\text { if } m=n.\\ \end{array}\right. } \end{aligned}$$

(54)

7 Gradient-Enhanced Radial Basis Function (GRBF)

Gradient-enhanced radial basis function (GRBF) has also been designated as Hermite–Birkhoff or Hermite interpolation [64]. This method was introduced in the more global context of artificial neural networks [65, 66] and it was used for dealing with optimization problems involving expensive solvers in the context of computational fluid dynamics [46, 67] and assembly design [56,57,58,59].

7.1 Building Process

The principle of GRBF is similar to that of classical RBF approach [100,101,102] with an extended basis of functions. The added functions are chosen as the derivatives of the radial basis functions ${\varPsi }.$ Thus, the GRBF approximation reads,

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},&\nonumber \\ \tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }&=\sum _{i=1}^{n_{s}}{w}_{i}\varPsi {\left( {{\bf x}^{(0)},{\bf x}^{(i)}}\right) }+\sum _{j=1}^{n_{p}}\sum _{i=1}^{n_{s}}{w}_{ij}\frac{\partial \varPsi }{\partial x_{j}^{(0)}}\left( {\bf x}^{(0)},{\bf x}^{(i)}\right) \nonumber \\&=\sum _{j=0}^{n_{p}}\sum _{i=1}^{n_{s}}{w}_{ij}\varPsi _{0i,j}, \end{aligned}$$

(55)

where

$$\begin{aligned} \forall&{\bf x}^{(0)}\in \mathcal {D},\,\forall (i,j,k)\in {\llbracket 0,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket },\nonumber \\ {w}_{ij}&={\left\{ \begin{array}{ll}{w}_{i0}={w}_{i}&{}\text { if }j=0,\\ {w}_{ij}&{}\text { otherwise};\end{array}\right. } \end{aligned}$$

(56)

$$\begin{aligned} \varPsi _{ij,k}&={\left\{ \begin{array}{ll}\varPsi _{ij,0}=\varPsi _{ij}=\varPsi {\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) }&{}\text {if }k=0,\\ \varPsi _{ij,k}=\frac{\partial \varPsi _{ij}}{\partial x_{k}^{(i)}}=\frac{\partial \varPsi }{\partial x_{k}^{(i)}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) }\right) &{}\text {otherwise}.\end{array}\right. } \end{aligned}$$

(57)

Only one half of the first derivatives needs to be calculated because they are odd functions:

$$\begin{aligned} \forall (i,\,j,\,k)&\in {\llbracket 0,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket },\nonumber \\&\frac{\partial \varPsi }{\partial x_{k}^{(i)}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) }\right) =-\frac{\partial \varPsi }{\partial x_{k}^{(j)}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) }\right) . \end{aligned}$$

(58)

The second derivatives of the radial basis functions will be denoted

$$\begin{aligned} \forall (i,\,j,\,k,\,l)\in&{\llbracket 0,n_{s} \rrbracket }^2\times {\llbracket 0,n_{p}\rrbracket }^2,\,\forall ({\bf x}^{(i)},{\bf x}^{\left( {j}\right) })\in \mathcal {D},\nonumber \\ \varPsi _{ij,kl}&=\frac{\partial ^{2}\varPsi }{{\partial x_{k}}{\partial x_{l}}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) }\right) . \end{aligned}$$

(59)

The GRBF building process consists in the determination of the ${w_{ij}}$’s coefficients by ensuring that the GRBF approximation interpolates the responses and gradients of the actual function at the sample points:

$$\begin{aligned} \forall (k,l)\in {\llbracket 1,n_{s} \rrbracket }\times&{\llbracket 1,n_{p}\rrbracket },\forall {\bf x}^{\left( {k}\right) }\in \mathcal {D},\nonumber \\ \tilde{{y}}{\left( {{\bf x}^{\left( {k}\right) }}\right) }=\tilde{{y}}_k&={y}_k=y{\left( {{\bf x}^{\left( {k}\right) }}\right) },\end{aligned}$$

(60)

$$\begin{aligned} \frac{\partial \tilde{{y}}}{\partial x_{l}}\left( {\bf x}^{\left( {k}\right) }\right) =\tilde{{y}}_{k,l}&={y}_{k,l}=\frac{{\partial {y}}}{\partial {x}_{l}}\left( {\bf x}^{\left( {k}\right) }\right) . \end{aligned}$$

(61)

Equations (60) and (61) lead to the following matrix formulation:

$$\begin{aligned} \varvec{\Psi }_{g}{\bf w}_{g}={\bf y}_{g}. \end{aligned}$$

(62)

The vectors ${{\bf w}_g}$ and ${{\bf y}_{g}}$ contain the RBF coefficients and the responses and gradients of the actual function, respectively. The matrix ${\varvec{\Psi }_g}$ is built from the classical RBF matrix ${\varvec{\Psi }}$ and the first and second derivatives of the radial basis functions matrices, denoted ${\varvec{\Psi }_d}$ and ${\varvec{\Psi }_{dd}},$ respectively:

$$\begin{aligned} \varvec{\Psi }_g&=\begin{bmatrix}\varvec{\Psi }&-\varvec{\Psi }_{d}\\ \varvec{\Psi }_{d}^\top&\varvec{\Psi }_{dd}\end{bmatrix}; \end{aligned}$$

(63)

$$\begin{aligned} \varvec{\Psi }&=\begin{bmatrix} \varPsi _{11}&\varPsi _{12}&\dots&\varPsi _{1n_{s}}\\ \varPsi _{21}&\varPsi _{22}&\dots&\varPsi _{2n_{s}}\\ \vdots&&\ddots&\vdots \\ \varPsi _{n_{s} 1}&\varPsi _{n_{s} 2}&\dots&\varPsi _{n_{s} n_{s}} \end{bmatrix}; \end{aligned}$$

(64)

$$\begin{aligned} \varvec{\Psi }_{d}&=\begin{bmatrix} \varPsi _{11,1}&\varPsi _{11,2}&\dots&\varPsi _{11,n_{p}}&\varPsi _{12,1}&\dots&\varPsi _{1n_{s},n_{p}}\\ \varPsi _{21,1}&\varPsi _{21,2}&\dots&\varPsi _{21,n_{p}}&\varPsi _{22,1}&\dots&\varPsi _{2n_{s},n_{p}}\\ \vdots&&\ddots&\vdots&&\ddots&\\ \varPsi _{n_{s} 1,1}&\varPsi _{n_{s} 1,2}&\dots&\varPsi _{n_{s} 1,n_{p}}&\varPsi _{n_{s} 2,1}&\dots&\varPsi _{n_{s} n_{s},n_{p}} \end{bmatrix}; \end{aligned}$$

(65)

$$\begin{aligned} \varvec{\Psi }_{dd}&=\begin{bmatrix} \varPsi _{11,11}&\varPsi _{11,12}&\dots&\varPsi _{11,1n_{p}}&\varPsi _{12,11}&\dots&\varPsi _{1n_{s},1n_{p}}\\ \varPsi _{11,21}&\varPsi _{11,22}&\dots&\varPsi _{11,2n_{p}}&\varPsi _{12,21}&\dots&\varPsi _{1n_{s},2n_{p}}\\ \vdots&&\ddots&\vdots&&\ddots&\\ \varPsi _{11,n_{p}1}&\varPsi _{11,n_{p}2}&\dots&\varPsi _{11,n_{p}n_{p}}&\varPsi _{12,n_{p}1}&\dots&\varPsi _{1n_{s},n_{p}n_{p}}\\ \varPsi _{21,11}&\varPsi _{21,12}&\dots&\varPsi _{21,1n_{p}}&\varPsi _{22,11}&\dots&\varPsi _{2n_{s},1n_{p}}\\ \vdots&&\ddots&\vdots&&\ddots&\\ \varPsi _{n_{s} 1,n_{p}1}&\varPsi _{n_{s} 1,n_{p}2}&\dots&\varPsi _{n_{s} 1,n_{p}n_{p}}&\varPsi _{n_{s} 2,n_{p}1}&\dots&\varPsi _{n_{s} n_{s},n_{p}n_{p}}\\ \end{bmatrix}. \end{aligned}$$

(66)

The sizes of the $\varvec{\Psi },$$\varvec{\Psi }_d$ and $\varvec{\Psi }_{dd}$ matrices are $n_{s} \times n_{s}, $$n_{s} n_{p}\times n_{s} $ and $n_{s} n_{p}\times n_{s} n_{p},$ respectively. So, matrix $\varvec{\Psi }_g$ contains ${n_{s} (1+n_{p})\times n_{s} (1+n_{p})}$ terms. The other terms in Eq. (62) are

$$\begin{aligned} {\bf w}_g&=\begin{bmatrix}{w}_{1}&\dots&{w}_{n_{s}}&{w}_{11}&{w}_{12}&\dots&{w}_{1 n_{p}}&{w}_{21}&\dots&{w}_{n_{s} n_{p}} \end{bmatrix}^\top ;\end{aligned}$$

(67)

$$\begin{aligned} {\bf y}_{g}&=\begin{bmatrix}{y}_1&\dots&{y}_{n_{s}}&{y}_{11}&{y}_{12}&\dots&{y}_{1 n_{p}}&{y}_{21}&\dots&{y}_{n_{s} n_{p}} \end{bmatrix}^\top . \end{aligned}$$

(68)

The determination of the ${w_{ij}}$’s finally consists in the inversion of the $\varvec{\Psi }_g$ matrix. This square symmetrical matrix is larger than the $\varvec{\Psi }$ of the classical RBF approach.

In order to reduce the computation time, LU or Cholesky factorisation of the $\varvec{\Psi }_g$ matrix can be used.

Finally, the derivatives of the GRBF can be easily calculated by deriving Eq. (55).

Figures 6, 7 and 8 provide illustrations on analytical functions. In the figures, the points have been generated by an Improved Hypercube Sampling technique, IHS [103]. An indirect version of the gradient-enhanced RBF is proposed in 1D. This approach works in unidimensional problems but it becomes unstable as the number of sample points and the number of parameters of the problem increase.

7.2 RBF Kernels and Conditioning

Many radial basis functions have been proposed in the literature (see for example [104]) and they can be completed by the kernels presented in Sect. 6. In the case of gradient-based RBF, the kernels must be at least twice differentiable to comply with the expressions of Eqs. (60) and (61). Thus, Matérn or squared exponential kernels can be used in GRBF. The matrix made of 0, 1 and 2nd order derivatives $\varvec{\Psi }_g$ is guaranteed to be positive definitite as will be explained in Sect. 8.3 about the $\varvec{\mathbf {C}} _c$ matrix which has the same form. However, the conditioning of the matrix may be bad. As already discussed in Sect. 6, the squared exponential kernel is likely to yield an ill-conditioned $\varvec{\Psi }_g$ matrix, an issue that can be addressed through any of the following techniques: use more distant sampled points or equivalently decrease the value of the internal parameters ${\varvec{\ell }};$ replace the squared exponential kernel with a Matérn kernel. Another solution can be to add a very small value (with an order fo magnitude of the machine accuracy) to the diagonal of the $\varvec{\Psi }_g$ matrix. In this case the GRBF approximation will not interpolate the responses and gradients at the sample point.

7.3 Estimation of Parameters

The internal parameters of the RBF metamodel can be determined by minimizing the leave-one-out error (LOO) with respect to ${\varvec{\ell }=({\ell }_{i})_{1\le i\le n_{p}}}$ (and $\nu $ in the case of the Matérn kernel). Based on the principle of Cross-validation [105, 106], the classical LOO error is detailed hereafter where ${\tilde{{y}}_{-i}({\bf x}^{(i)})}$ is the RBF approximation at point ${\bf x}^{(i)}$ without taking into account the response and the gradient of the actual function at that sample point ${\bf x}^{(i)}:$

$$\begin{aligned} \text {LOO}(\varvec{\ell })=\dfrac{1}{n_{s}}\sum _{i=1}^{n_{s}}\left( \tilde{{y}}_{-i}({\bf x}^{(i)})-y{\left( {{\bf x}^{(i)}}\right) }\right) ^2. \end{aligned}$$

(69)

Bompard et al. [107] propose an extended LOO criterion by adding the derivatives information:

$$\begin{aligned} \text {LOO}(\varvec{\ell })=\dfrac{1}{n_{s} (n_{p}+1)}\sum _{i=1}^{n_{s}}\left[ \left( \tilde{{y}}_{-i}({\bf x}^{(i)})-y{\left( {{\bf x}^{(i)}}\right) }\right) ^2\right. \nonumber \\ +\left. \sum _{k=1}^{n_{p}}\left( \frac{\partial \tilde{{y}}_{-i,k}}{\partial x_{k}}({\bf x}^{(i)})-\frac{{\partial {y}}}{\partial {x}_{k}}({\bf x}^{(i)})\right) ^2\right] , \end{aligned}$$

(70)

where ${\frac{\partial \tilde{{y}}_{-i,k}}{\partial x_{k}}}$ is the approximation of the derivative provided by the metamodel built without taking into account the true k-th component of the gradient at point ${\bf x}^{(i)}.$ The approximations ${\tilde{{y}}_{-i}}$ and ${\frac{\partial \tilde{{y}}_{-i,k}}{\partial x_{k}}}$ are therefore obtained through Eq. (60) and partial use of (61) when a gradient is omitted from the LOO error. In order to avoid the building of numerous metamodels associated to each value of the internal parameters, an efficient way for computing the LOO was proposed in [108] and extended to GRBF in [107]. It implies estimating the LOO criterion by inverting the kernel matrix once and for all and calculating a vector product instead of completely building the metamodel each time a data is removed. Finally, due to the multimodality of the LOO, a global optimizer has to be used such as a stochastic optimizer (e.g., an evolution strategy or a particle swarm algorithm).

7.4 Variance of a Stochastic Process Obtained with GRBF

As an extension to an idea given in [29], Bompard [54] proposes to look at the deterministic response ${y}$ as an instance of a stationary Gaussian stochastic process ${Y}$ whose correlation is given by the GRBF kernel and whose constant variance is ${\sigma _{Y}^2 = \text {Var}{\left[ { Y{\left( {{\bf x}}\right) }}\right] }}.$ This allows to describe the mean and variance of the GRBF prediction. Let ${\varvec{\Psi }{\left( {{\bf x}^{(0)}}\right) }}$ be the vector containing the evaluations and first derivatives of the RBF kernels ${\varPsi _{0i,j}}$ [from Eq. (55)] at the new point ${{\bf x}^{(0)}}.$ By solving Eq. (62) for the weights, the GRBF estimation is expressed as a linear combination of the true responses and their derivatives,

$$\begin{aligned} \widetilde{Y}({{\bf x}^{(0)}}) = \varvec{\Psi }{\left( {{\bf x}^{(0)}}\right) }^\top \varvec{\Psi }_g^{-1} {\bf Y}_g. \end{aligned}$$

(71)

A mean and variance expressions are then calculated in a manner similar to kriging:

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},&\nonumber \\ \tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }&=\mathbb {E}{\left[ {\widetilde{Y}({{\bf x}^{(0)}})}\right] } = \varvec{\Psi }{\left( {{\bf x}^{(0)}}\right) }^\top \varvec{\Psi }_g^{-1} \mathbb {E}{\left[ {{\bf Y}_g}\right] } = {\bf y}_{g}^\top \varvec{\Psi }_g^{-1}\varvec{\Psi }{\left( {{\bf x}^{(0)}}\right)}, \end{aligned}$$

(72)

$$\begin{aligned} s_{GRBF}^2&= \mathbb {E}{\left[ {\widetilde{Y}({{\bf x}^{(0)}}) - Y({{\bf x}^{(0)}})}\right] }^2 \nonumber \\&= \sigma _{Y}^2 \left( 1-\varvec{\Psi }{\left( {{\bf x}^{(0)}}\right) }^\top \varvec{\Psi }_g^{-1}\varvec{\Psi }{\left( {{\bf x}^{(0)}}\right) } \right).\end{aligned}$$

(73)

The expression for the mean makes use of the further specification of the observations at the sample points, ${ \mathbb {E}{\left[ {{\bf Y}_g}\right] } = {\bf y}_{g} },$ i.e., one considers the conditional process ${Y}$ knowing the observations at the sample points.

This variance could be used in infill criteria such as the expected improvement [33]. Unfortunately, as was said earlier, this variance calculation will often fail due to loss of positiveness of the GRBF matrix $\varvec{\Psi }_g$ unless specific measures are undertaken.

8 Gradient-Enhanced Cokriging (GKRG)

Kriging, an alternative name for conditional Gaussian Processes, is today one of the main techniques for approximating functions and optimizing expensive to calculate functions. Cokriging is an extension of kriging for dealing with several correlated functions. Initially introduced for geostatistics [34, 35], many works focus on the assumptions, principles and formulations of cokriging [36,37,38, 42]. Gradient-based cokriging was introduced by Morris et al. [39] as a way to account for gradient information in kriging, and has since then been applied to many fields. Table 5 summarizes the references and the kind of applications that concern gradient-enhanced cokriging. Because the concepts underlying gradient-enhanced cokriging have received various names, the last column of the Table lists the original keywords employed by the cited authors. It is seen that gradient-enhanced cokriging has been largely used in the context of fluid problems. The efforts made to calculate gradients in fluid simulations explain this observation.

Table 5 Summary of works using gradient-enhanced cokriging

Full size table

8.1 Formulation of Gradient-Enhanced Cokriging

Gradient-enhanced cokriging is very similar to the classical kriging approach. Random processes associated with the deterministic objective function and its gradients are first defined through the primary response, ${Y},$ and the $n_{p}$ auxiliary responses, ${{W}^{i}}$[35]:

$$\begin{aligned} \forall i\in {\llbracket 1,n_{p}\rrbracket },\,&\forall {\bf x}^{(0)}\in \mathcal {D},\nonumber \\ Y{\left( {{\bf x}^{(0)}}\right) }&={\mu }_{0}\left( {\bf x}^{(0)}\right) +{Z}_{0}\left( {\bf x}^{(0)}\right) , \end{aligned}$$

(74)

$$\begin{aligned} {W}^{i}\left( {\bf x}^{(0)}\right)&={\mu }_{i}\left( {\bf x}^{(0)}\right) +{Z}_{i}\left( {\bf x}^{(0)}\right) . \end{aligned}$$

(75)

In the particular case of gradient-enhanced cokriging, the auxiliary responses ${{W}^{i}}$ correspond to the components of the gradient:

$$\begin{aligned} \forall i\in {\llbracket 1,n_{p}\rrbracket },\,\forall {\bf x}^{(0)}\in \mathcal {D},\,\,{W}^{i}\left( {\bf x}^{(0)}\right) =\frac{\partial Y}{\partial x_{i}}({\bf x}^{(0)}). \end{aligned}$$

(76)

As in regular kriging, ${{\mu }_{i}}$ and ${{Z}_{i}}$ represent, for each response, the deterministic trends and the fluctuations around the trends:

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},&\nonumber \\ \mathbb {E}{\left[ {Y{\left( {{\bf x}^{(0)}}\right) }}\right] }&=\mathbb {E}{\left[ {{\mu }_{0}\left( {\bf x}^{(0)}\right) +{Z}_{0}\left( {\bf x}^{(0)}\right) }\right] }={\mu }_{0}\left( {\bf x}^{(0)}\right) , \end{aligned}$$

(77)

$$\begin{aligned} \sigma _{Y}^2=\text {Var}{\left[ {Y{\left( {{\bf x}^{(0)}}\right) }}\right] }&=\text {Var}{\left[ {{\mu }_{0}\left( {\bf x}^{(0)}\right) +{Z}_{0}\left( {\bf x}^{(0)}\right) }\right] }=\text {Var}{\left[ {{Z}_{0}\left( {\bf x}^{(0)}\right) }\right] }=\sigma _{Z_0}^2; \end{aligned}$$

(78)

$$\begin{aligned} \forall i\in {\llbracket 1,n_{p}\rrbracket },\,&\forall {\bf x}^{(0)}\in \mathcal {D},\nonumber \\ \mathbb {E}{\left[ {Z_i({\bf x}^{(0)})}\right] }&=0,\,\,\mathbb {E}{\left[ {{W}^{i}\left( {\bf x}^{(0)}\right) }\right] }={\mu }_{i}({\bf x}^{(0)}), \end{aligned}$$

(79)

$$\begin{aligned} \sigma _{{W}^{i}}^2=\text {Var}{\left[ {{W}^{i}\left( {\bf x}^{(0)}\right) }\right] }&=\text {Var}{\left[ {{\mu }_{i}\left( {\bf x}^{(0)}\right) +{Z}_{i}\left( {\bf x}^{(0)}\right) }\right] }=\text {Var}{\left[ {{Z}_{i}\left( {\bf x}^{(0)}\right) }\right] }=\sigma _{Z_i}^2. \end{aligned}$$

(80)

All ${{Z}_{i}}$’s are centered stationary Gaussian Processes. As in usual kriging, the covariance of ${{Z}_{0}}$ is a function of a generalized distance among the sample points. Some other cross-covariance relations have to be introduced for the auxiliary variables. These covariances and cross-covariances are defined in Sects. 8.3 and 8.5.

The trend models ${{\mu }_{i}}$ can be chosen independently of one another [111] and this choice leads to different kinds of (co)kriging (simple when ${{\mu }_{i}}$ is a known constant, ordinary when ${{\mu }_{i}}$ is an unknown constant and universal in the general case that it is both unknown and a function of ${{\bf x}}).$ In this paper, the universal cokriging model where the trend is built using polynomial regression will be detailed, see Eq. (81). In order to limit the amount of required inputs, the trend models of the auxiliary responses, ${{\mu }_{i}},$${i\in {\llbracket 1,n_{p}\rrbracket }},$ will be obtained by differentiation of the primary response trend, ${{\mu }_{0}}$ [see [39] and Eq. (82)].

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},&\nonumber \\ {\mu }_{0}\left( {\bf x}^{(0)}\right)&=\sum _{j=1}^{n_t}\varvec{\beta }_{j}f_{j}\left( {\bf x}^{(0)}\right) ={\bf f}_{0}^\top \varvec{\beta }, \end{aligned}$$

(81)

$$\begin{aligned} \forall i\in {\llbracket 1,n_{p}\rrbracket },\,\,{\mu }_{i}\left( {\bf x}^{(0)}\right)&=\frac{\partial {\mu }_{0}}{\partial x_{i}}\left( {\bf x}^{(0)}\right) =\sum _{j=1}^{n_t}\varvec{\beta }_{j}\frac{\partial f_{j}}{\partial x_{i}}({\bf x}^{(0)})={\bf f}_{0}^{iT}\varvec{\beta }, \end{aligned}$$

(82)

where

$$\begin{aligned} \varvec{\beta }&=\begin{bmatrix}\varvec{\beta }_{1}&\varvec{\beta }_{2}&\dots&\varvec{\beta }_{n_t}\end{bmatrix}^\top ;\\ {\bf f}_{0}&=\begin{bmatrix}f_{1}\left( {{\bf x}^{(0)}}\right)&f_{2}\left( {{\bf x}^{(0)}}\right)&\dots&f_{n_t}\left( {{\bf x}^{(0)}}\right) \end{bmatrix}^\top ;\\ \forall i\in {\llbracket 1,n_{p}\rrbracket },\,\,{\bf f}_{0}^i&=\begin{bmatrix}\frac{\partial f_{1}}{\partial x_{i}}({\bf x}^{(0)})&\frac{\partial f_{2}}{\partial x_{i}}({\bf x}^{(0)})&\dots&\frac{\partial f_{n_t}}{\partial x_{i}}({\bf x}^{(0)})\end{bmatrix}^\top. \end{aligned}$$

The best linear unbiased predictor (BLUP) of the response using both primary and auxiliary responses makes the cokriging model [35]. This predictor is a linear combination of deterministic functions called ${\lambda ()}$’s and the evaluations of primary and auxiliary responses at the sample points:

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\hat{Y}({\bf x}^{(0)})=\sum _{i=1}^{n_{s}}\lambda _{i}^{0}\left( {\bf x}^{(0)}\right) Y{\left( {{\bf x}^{(i)}}\right) }+\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{j}\left( {\bf x}^{(0)}\right) {W}^{j}\left( {{\bf x}^{(i)}}\right) . \end{aligned}$$

(83)

The functions ${\lambda ()}$ are evaluated by minimizing the variance of the estimation error $\varepsilon \left( {\bf x}^{(0)}\right) =\hat{Y}({\bf x}^{(0)})-Y{\left( {{\bf x}^{(0)}}\right) }$ while accounting for the unbiasedness condition. Finally, the expressions of the cokriging prediction and variance are obtained. These steps are further explained in the next sections.

8.2 No Bias Condition

The condition for the cokriging estimator to be unbiased is

$$\begin{array} {ll}\forall {\bf x}^{(0)}\in \mathcal {D},\\ \mathbb {E}{\left[ {\hat{Y}({\bf x}^{(0)})-Y{\left( {{\bf x}^{(0)}}\right) }}\right] }&=0\nonumber \\ \mathbb {E}{\left[ {\sum\limits_{i=1}^{n_{s}}\lambda _{i}^{0}\left( {\bf x}^{(0)}\right) Y{\left( {{\bf x}^{(i)}}\right) }+\sum\limits_{i=1}^{n_{p}}\sum\limits_{j=1}^{n_{p}}\lambda _{i}^{j}\left( {\bf x}^{(0)}\right) {W}^{j}\left( {\bf x}^{(0)}\right) -Y{\left( {{\bf x}^{(0)}}\right) }}\right] }&=0\nonumber \\ \sum\limits_{i=1}^{n_{s}}\lambda _{i}^{0}\left( {\bf x}^{(0)}\right) \mathbb {E}{\left[ {Y{\left( {{\bf x}^{(i)}}\right) }}\right] }+\sum\limits_{i=1}^{n_{s}}\sum\limits_{j=1}^{n_{p}}\lambda _{i}^{j}\left( {\bf x}^{(0)}\right) \mathbb {E}{\left[ {{W}^{j}\left( {\bf x}^{(0)}\right) }\right] }-\mathbb {E}{\left[ {Y{\left( {{\bf x}^{(0)}}\right) }}\right] }&=0\nonumber \\ \sum\limits_{i=1}^{n_{s}}\lambda _{i}^{0}\left( {\bf x}^{(0)}\right) {\mu }_{0}\left( {{\bf x}^{(i)}}\right) +\sum\limits_{i=1}^{n_{s}}\sum\limits_{j=1}^{n_{p}}\lambda _{i}^{j}\left( {\bf x}^{(0)}\right) {\mu }_{j}\left( {{\bf x}^{(i)}}\right) -{\mu }_{0}\left( {\bf x}^{(0)}\right)&=0. \end{array}$$

(84)

Inserting the expression of the trend [Eqs. (81) and (82)] leads to

$$\begin{aligned} \sum _{i=1}^{n_{s}}\lambda _{i}^{0}\left( {\bf x}^{(0)}\right) \sum _{k=1}^{n_t}\varvec{\beta }_{k}f_{k}\left( {{\bf x}^{(i)}}\right)+\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{j}\left( {\bf x}^{(0)}\right) \sum _{k=1}^{n_t}\varvec{\beta }_{k}\frac{\partial f_{k}}{\partial x_{j}}({\bf x}^{(i)}) -\sum _{k=1}^{n_t}\varvec{\beta }_{k}f_{k}\left( {{\bf x}^{(0)}}\right)&=0~\text {, or,\ }\nonumber \\ \varvec{\lambda }_0^\top {\bf F}\varvec{\beta }+\varvec{\lambda }_{W}^\top {\bf F}_{W}\varvec{\beta }-{\bf f}_{0}^\top \varvec{\beta }&=0. \end{aligned}$$

(85)

with

$$\begin{aligned} \varvec{\lambda }_0&=\begin{bmatrix}\lambda _{1}^{0}&\lambda _{2}^{0}&\dots&\lambda _{n_{s}}^{0}\end{bmatrix}^\top&\text { size }1\times n_{s};\\ \varvec{\lambda }_{W}&=\begin{bmatrix}\lambda _{1}^{1}&\lambda _{1}^{2}&\dots&\lambda _{1}^{n_{p}}&\lambda _{2}^{1}&\dots&\lambda _{n_{s}}^{n_{p}}\end{bmatrix}^\top&\text { size }1\times n_{s} n_{p};\\ {\bf F}&=\begin{bmatrix} f_{1}\left( {{\bf x}^{\left( {1}\right) }}\right)&f_{2}\left( {{\bf x}^{\left( {1}\right) }}\right)&\dots&f_{n_t}\left( {{\bf x}^{\left( {1}\right) }}\right) \\ f_{1}\left( {{\bf x}^{\left( {2}\right) }}\right)&\ddots&\\ \vdots&\\ f_{1}\left( {{\bf x}^{\left( {n_{s}}\right) }}\right)&\dots&\dots&f_{n_t}\left( {{\bf x}^{\left( {n_{s}}\right) }}\right) \end{bmatrix}&\text { size }n_t\times n_{s};\\ {\bf F}_{W}&=\begin{bmatrix}\frac{\partial f_{1}}{\partial x_{1}}\left( {\bf x}^{\left( {1}\right) }\right)&\frac{\partial f_{2}}{\partial x_{1}}\left( {\bf x}^{\left( {1}\right) }\right)&\dots&\frac{\partial f_{n_t}}{\partial x_{1}}\left( {\bf x}^{\left( {1}\right) }\right) \\ \vdots&\vdots \\ \frac{\partial f_{1}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {1}\right) }\right)&\frac{\partial f_{2}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {1}\right) }\right)&\dots&\frac{\partial f_{n_t}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {1}\right) }\right) \\ \frac{\partial f_{1}}{\partial x_{1}}\left( {\bf x}^{\left( {2}\right) }\right)&\frac{\partial f_{2}}{\partial x_{1}}\left( {\bf x}^{\left( {2}\right) }\right)&\dots&\frac{\partial f_{n_t}}{\partial x_{1}}\left( {\bf x}^{\left( {2}\right) }\right) \\ \vdots&\vdots \\ \frac{\partial f_{1}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {n_{s}}\right) }\right)&\frac{\partial f_{2}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {n_{s}}\right) }\right)&\dots&\frac{\partial f_{n_t}}{\partial x_{n_{p}}}\left( {\bf x}^{\left( {n_{s}}\right) }\right) \end{bmatrix}&\text { size }n_t\times n_{s} n_{p}. \end{aligned}$$

Equation (85) can be further condensed after a simplification with respect to $\varvec{\beta }:$

$$\begin{aligned} \varvec{\lambda }_c {\bf F}_c={\bf f}_{0}^\top , \end{aligned}$$

(86)

where the vector $\varvec{\lambda }_c=\begin{bmatrix}\varvec{\lambda }_{0}^\top&\varvec{\lambda }_{W}^\top \end{bmatrix}^\top $ includes $(n_{p}+1)\times n_{s} $ cokriging coefficients and ${\bf F}_c=\begin{bmatrix}{\bf F}^\top&{\bf F}_{W}^\top \end{bmatrix}^\top $ is a ${n_t}\times (n_{s} +1)n_{s} $ matrix. It should be remembered that ${\varvec{\lambda }_c}$ and ${{\bf f}_{0}}$ depend on the non-sampled point ${{\bf x}^{(0)}}.$ For simplicity’s sake the functions ${\lambda ()}$ have and will been written without specifying that they are defined only at the non-sampled point ${{\bf x}^{(0)}}.$

8.3 Formulation of the Variance

The variance of the cokriging error estimate is

$$\begin{aligned} \forall&{\bf x}^{(0)}\in \mathcal {D},\,\,\\ s_{UCK}^2({\bf x}^{(0)})&=\text {Var}{\left[ {\hat{Y}({\bf x}^{(0)})-Y{\left( {{\bf x}^{(0)}}\right) }}\right] }\\&=\text {Var}{\left[ {\hat{Y}({\bf x}^{(0)})}\right] }+\text {Var}{\left[ {Y{\left( {{\bf x}^{(0)}}\right) }}\right] }-2{\text {cov}}\left[ \hat{Y}({\bf x}^{(0)}),Y{\left( {{\bf x}^{(0)}}\right) }\right] \\&= \text {Var}{\left[ {\sum _{i=1}^{n_{s}}\lambda _{i}^{0}({\bf x}^{(0)}){Z}_{0}\left( {{\bf x}^{(i)}}\right) +\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{j}({\bf x}^{(0)}){Z}_{j}\left( {{\bf x}^{(i)}}\right) }\right] }\\&\quad+\text {Var}{\left[ {{Z}_{0}\left( {\bf x}^{(0)}\right) }\right] }\\&\quad-2{\text {cov}}\left[ \sum _{i=1}^{n_{s}}\lambda _{i}^{0}({\bf x}^{(0)}){Z}_{0}\left( {{\bf x}^{(i)}}\right) +\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{j}({\bf x}^{(0)}){Z}_{j}\left( {{\bf x}^{(i)}}\right) ,{Z}_{0}\left( {\bf x}^{(0)}\right) \right] \\&=\sigma _{Z_0}^2+\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{0}\lambda _{j}^{0}{\text {cov}}\left[ {Z}_{0}\left( {{\bf x}^{(i)}}\right) ,{Z}_{0}\left( {{\bf x}^{\left( {j}\right) }}\right) \right] \\&\quad+\sum _{i=1}^{n_{s}}\sum _{k=1}^{n_{s}}\sum _{j=1}^{n_{p}}\sum _{l=1}^{n_{p}}\lambda _{i}^{j}\lambda _{j}^{l}{\text {cov}}\left[ {Z}_{j}\left( {{\bf x}^{(i)}}\right) ,{Z}_{l}\left( {{\bf x}^{\left( {k}\right) }}\right) \right] \\&\quad+2\sum _{i=1}^{n_{s}}\sum _{k=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{0}\lambda _{k}^{j}{\text {cov}}\left[ {Z}_{0}\left( {{\bf x}^{(i)}}\right) ,{Z}_{j}\left( {{\bf x}^{\left( {k}\right) }}\right) \right] \\&\quad-2\sum _{i=1}^{n_{s}}\sum _{j=1}^{n_{p}}\lambda _{i}^{j}{\text {cov}}\left[ {Z}_{j}\left( {{\bf x}^{(i)}}\right) ,{Z}_{0}\left( {{\bf x}^{(0)}}\right) \right] \\&\quad-2\sum _{i=1}^{n_{s}}\lambda _{i}^{0}{\text {cov}}\left[ {Z}_{0}\left( {{\bf x}^{(i)}}\right) ,{Z}_{0}\left( {\bf x}^{(0)}\right) \right] . \end{aligned}$$

The following notations are introduced for simplifying the covariances:

$$\begin{aligned} \forall (i,j,k,l)&\in {\llbracket 0,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket }^2,\nonumber \\ {\text {cov}}\left[ {Z}_{0}\left( {{\bf x}^{(i)}}\right) ,{Z}_{0}\left( {{\bf x}^{\left( {j}\right) }}\right) \right]&={\text {cov}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },Y{\left( {{\bf x}^{\left( {j}\right) }}\right) }\right] =c_{ij},\nonumber \\ {\text {cov}}\left[ {Z}_{0}\left( {{\bf x}^{(i)}}\right) ,{Z}_{k}\left( {{\bf x}^{\left( {j}\right) }}\right) \right]&= {\text {cov}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },{W}^{k}\left( {{\bf x}^{\left( {j}\right) }}\right) \right] =c_{ij,k},\nonumber \\ {\text {cov}}\left[ {Z}_{k}\left( {{\bf x}^{(i)}}\right) ,{Z}_{l}\left( {{\bf x}^{\left( {j}\right) }}\right) \right]&={\text {cov}}\left[ {W}^{k}\left( {{\bf x}^{(i)}}\right) ,{W}^{l}\left( {{\bf x}^{\left( {j}\right) }}\right) \right] =c_{ij,kl}. \end{aligned}$$

(87)

Now the variance of the cokriging error estimation can be written in matrix notation,

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\,s_{UCK}^2({\bf x}^{(0)})=\sigma _{Z_0}^2+\varvec{\lambda }_c^\top \varvec{\mathbf {C}} _c\varvec{\lambda }_c-2\varvec{\lambda }_c^\top {{\bf c}_{0}}_c, \end{aligned}$$

(88)

where $\varvec{\mathbf {C}} _c=\begin{bmatrix}\varvec{\mathbf {C}}&\varvec{\mathbf {C}} _{WY}\\\varvec{\mathbf {C}} _{WY}^\top&\varvec{\mathbf {C}} _{WW}\end{bmatrix}$ is the cokriging covariance/cross-covariance matrix. It is composed of the classical kriging covariance matrix ${\varvec{\mathbf {C}}},$ the cross-covariance matrix ${\varvec{\mathbf {C}} _{WY}}$ made of covariances between primary and auxiliary responses and the cross-covariance matrix ${\varvec{\mathbf {C}} _{WW}}$ between the auxiliary responses. Using the notations introduced in Eq. (87), these matrices are defined as:

$$\begin{aligned} (\varvec{\mathbf {C}})_{ij}&=c_{ij};\\ \varvec{\mathbf {C}} _{WY}&=\begin{bmatrix} c_{11,1}&c_{11,2}&\cdots&c_{11,n_{p}}&c_{12,1}&\cdots&c_{1n_{s},n_{p}}\\ c_{21,1}&c_{21,2}&\cdots&c_{21,n_{p}}&c_{22,1}&\cdots&c_{2n_{s},n_{p}}\\ c_{31,1}&c_{31,2}&\cdots&\\ \vdots&&\ddots&&&&\vdots \\ c_{n_{s} 1,1}&\cdots&&&&&c_{n_{s} n_{s},n_{p}} \end{bmatrix}&\text { size } n_{s}\,\times\,n_{s} n_{p};\\ \varvec{\mathbf {C}} _{WW}&=\begin{bmatrix} \varvec{\mathbf {C}} _{WW}^{11}&\varvec{\mathbf {C}} _{WW}^{12}&\cdots&\varvec{\mathbf {C}} _{WW}^{1n_{s}}\\ \varvec{\mathbf {C}} _{WW}^{21}&\varvec{\mathbf {C}} _{WW}^{22}&\cdots&\vdots \\ \vdots&&\ddots&\vdots \\ \varvec{\mathbf {C}} _{WW}^{n_{s} 1}&\varvec{\mathbf {C}} _{WW}^{n_{s} 2}&\cdots&\varvec{\mathbf {C}} _{WW}^{n_{s} n_{s}} \end{bmatrix}&\text { size } n_{s} n_{p}\,\times\, n_{s} n_{p}, \end{aligned}$$

$$\begin{aligned} \text { with } \forall (k,l)\in {\llbracket 1,n_{s} \rrbracket }^2,\,\,\varvec{\mathbf {C}} _{WW}^{kl}=\begin{bmatrix} c_{kl,11}&c_{kl,12}&\cdots&c_{kl,1n_{p}} \\ c_{kl,21}&c_{kl,22}&\cdots&\vdots \\ \vdots&&\ddots&\vdots \\ c_{kl,n_{p}1}&c_{kl,n_{p}2}&\cdots&c_{ kl,n_{p}n_{p}} \\ \end{bmatrix}. \end{aligned}$$

The global cokriging covariance matrix $\varvec{\mathbf {C}} _c$ obtained is symmetric and contains $n_{s} (n_{p}+1)$ rows and columns. ${{{\bf c}_{0}}_c}$ is the vector of covariances and cross-covariances between the sampled and any non-sampled points and it is expressed as ${{{\bf c}_{0}}_c=\begin{bmatrix}c_{10}&\dots&c_{n_{s} 0}&c_{10,1}&c_{10,2}&\dots&c_{20,1}&\dots&c_{n_{s} 0,n_{p}}\end{bmatrix}^\top }$ (size $1\times n_{s} n_{p}).$ The matrix ${\varvec{\mathbf {C}} _c}$ is positive definite. The proof is the following: ${\forall {\bf {v}} \in \mathbb {R} ^{n_{s} (n_{p}+1)}},$${{\bf {v}}^\top \varvec{\mathbf {C}} _c {\bf {v}} = }$${\text {Var}{\left[ {{\bf {v}}^\top \left( \begin{array}{c} Y\\ W\end{array}\right) }\right] } \ge 0}$ since a variance is always positive. The matrix $\varvec{\Psi }_g$ of GRBF [see Eq. (63)] is also positive definite because it has the same structure and is made of the same kernels. Above, positive definitness is not strict so that bad conditionning and even non invertibility may happen (e.g., when two sample points are identical).

8.4 Constrained Optimization Problem for Cokriging Building

Using the notations introduced in Eqs. (86) and (88), a cokriging model is built by solving the following constrained optimization problem.

Problem 1

(Universal cokriging) Find ${\varvec{\lambda }_c\in \mathbb {R} ^{n_{s} (n_{p}+1)}}$ that minimizes

$$\begin{aligned}&\varvec{\lambda }_c^\top \varvec{\mathbf {C}} _c\varvec{\lambda }_c-2\varvec{\lambda }_c^\top {{\bf c}_{0}}_c+\sigma _{Z_0}^2, \\ \text {subject to } ~&{\bf F}_c^\top \varvec{\lambda }_c={\bf f}_{0} \end{aligned}$$

Universal kriging and universal cokriging lead to the same constrained optimization problem. In the case of cokriging however, additional cross-covariances are taken into account. This constrained optimization problem is solved by the lagrangian technique which yields the following expressions for cokriging prediction and variance:

$$\begin{aligned} \forall&{\bf x}^{(0)}\in \mathcal {D},\nonumber \\&\tilde{{y}}_{UCK}\left( {\bf x}^{(0)}\right) =\left[ {{\bf c}_{0}}_c+{\bf F}_c\left( {\bf F}_c^\top \varvec{\mathbf {C}} _c^{-1}{\bf F}_c\right) ^{-1}\left( {\bf f}_{0}-{\bf F}_c^\top \varvec{\mathbf {C}} _c^{-1}{{\bf c}_{0}}_c\right) \right] ^\top \varvec{\mathbf {C}} _c^{-1}{\bf y}_{g}, \end{aligned}$$

(89)

with ${{\bf y}_{g}=\begin{bmatrix}{y}_1&\dots&{y}_{n_{s}}&\frac{\partial {y}_1}{\partial x_{1}}&\frac{\partial {y}_1}{\partial x_{2}}&\dots&\frac{\partial {y}_{n_{s}}}{\partial x_{n_{p}}}\end{bmatrix}^\top },$ and

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},\,\,s^2_{UCK}\left( {\bf x}^{(0)}\right)&=\sigma _{Z_0}^2-{{\bf c}_{0}}_c^\top \varvec{\mathbf {C}} _c^{-1}{{\bf c}_{0}}_c+{\bf {u}}_0^\top \left( {\bf F}_c^\top \varvec{\mathbf {C}} _c^{-1}{\bf F}_c\right) ^{-1}{\bf {u}}_0,\\ \text { with }&{\bf {u}}_0={\bf {u}}\left( {\bf x}^{(0)}\right) ={\bf F}_c^\top \varvec{\mathbf {C}} _c^{-1}{{\bf c}_{0}}_c-{\bf f}_{0} . \nonumber \end{aligned}$$

(90)

Like usual kriging, cokriging interpolates the responses at the data points by having the prediction equal to the response and the variance null there. The proof of this property is based on ${{{\bf c}_{0}}_c}$ being equal to the ith column of ${\varvec{\mathbf {C}} _c}$ at ${{\bf x}^{(0)}={\bf x}^{(i)}}.$

Simple and ordinary cokriging can be easily deduced from the previous equations by considering ${{\mu }_{0}\left( {\bf x}\right) =m}$ where $m$ is a known real or ${{\mu }_{0}\left( {\bf x}\right) =\beta }$ where ${\beta }$ is an unknow real. In both cases and according to Eq. (82), ${\forall i\in {\llbracket 1,n_{p}\rrbracket },\,{\mu }_{i}\left( {\bf x}\right) =0}.$ So, ${{\bf F}_c=\begin{bmatrix}\varvec{1}_{n_{s}}^\top&\varvec{0}_{n_{s} \times n_{p}}^\top \end{bmatrix}^\top }$ where ${\varvec{1}_{n_{s}}}$ and ${\varvec{0}_{n_{s} \times n_{p}}}$ are matrices containing $n_{s}$ 1’s and $n_{s} n_{p}\ 0$’s, respectively.

8.5 Covariance Structure

The most critical choice when creating a cokriging model is that of the covariance functions. In applications such as geostatistics (see for instance [35]), this choice can be governed by expert information. In the more general context of computer experiments, there is a wide range of covariance functions to choose from. However, noting that covariance functions are kernel functions such as introduced in Sect. 6, multidimensional kernels can be formed by multiplying unidimensional kernels. Continuing this strategy for gradient-enhanced cokriging, Morris et al. [39] have proposed a general form for the cross-covariance relations:

$$\begin{aligned} \forall k\in {\llbracket 1,n_{p}\rrbracket },\,\forall (a_k,b_k)\in \mathbb {N}^2,\,\forall ({\bf x}^{(i)},{\bf x}^{\left( {j}\right) })\in \mathcal {D},\,{\text {cov}}\left[ Y^{(a_1,a_2,\dots ,a_{n_{p}})}\left( {\bf x}^{(i)}\right) ,Y^{(b_1,b_2,\dots ,b_{n_{p}})}\left( {\bf x}^{\left( {j}\right) }\right) \right] =\sigma _{Y}^2(-1)^{\sum b_j}\prod _{k=1}^{n_{p}}\left[ h^{(a_k+b_k)}(x_{k}^{(i)}-x_{k}^{(j)};{\ell }_{k})\right] , \end{aligned}$$

(91)

where

$$\begin{aligned} Y^{(a_1,a_2,\dots ,a_{n_{p}})}\left( {\bf x}^{(i)}\right) =\dfrac{\partial ^{a_1+a_2\dots +a_{n_{p}}}Y}{\partial x_{1}^{a_1}\partial x_{2}^{a_2}\dots \partial x_{n_{p}}^{a_{n_{p}}}}\left( {\bf x}^{(i)}\right) , \end{aligned}$$

and ${h(r;\ell )}$ is a unidimensional correlation function depending on the real r and the correlation length ${\ell }$ and ${h^{(k)}}$ is its k-th derivative.

Readers can note that kernels are even functions but their first derivative are odd, cf. for example Fig. 5. Therefore, referring to the covariance notation of Eq. (87), the following relation is found: ${\forall (i,j,k)\in {\llbracket 1,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket },\,\,c_{ij,k}=-c_{ji,k}}.$ More generally, in the case of gradient-enhanced cokriging, the covariances satisfy,

$$\begin{aligned} \forall (i,j,k,l)&\in {\llbracket 0,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket }^2,\nonumber \\&{\text {cov}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },Y{\left( {{\bf x}^{\left( {j}\right) }}\right) }\right] =c_{ij}=c_{ji}=\sigma _{Y}^2\varPsi {\left( {{{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }};\varvec{\ell }}\right) },\nonumber \\&{\text {cov}}\left[ Y{\left( {{\bf x}^{(i)}}\right) },\frac{\partial Y}{\partial x_{k}}\left( {\bf x}^{\left( {j}\right) }\right) \right] =c_{ij,k}=-\sigma _{Y}^2\frac{\partial \varPsi }{\partial {r_k}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\varvec{\ell }\right) ,\nonumber \\&{\text {cov}}\left[ \frac{\partial Y}{\partial x_{k}}\left( {\bf x}^{(i)}\right) ,Y{\left( {{\bf x}^{\left( {j}\right) }}\right) }\right] =c_{ji,k}=\sigma _{Y}^2\frac{\partial \varPsi }{\partial {r_k}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\varvec{\ell }\right) ,\nonumber \\&{\text {cov}}\left[ \frac{\partial Y}{\partial x_{k}}\left( {\bf x}^{(i)}\right) ,\frac{\partial Y}{\partial x_{l}}\left( {\bf x}^{\left( {j}\right) }\right) \right] =c_{ij,kl}=-\sigma _{Y}^2\frac{\partial ^2\varPsi }{\partial {r_k}\partial {r_l}}\left( {\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\varvec{\ell }\right) , \end{aligned}$$

(92)

where ${r_k = x_{k}^{(i)} - x_{k}^{(j)} },$ and ${\varPsi {\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) };\varvec{\ell }}\right) }=\prod\nolimits _{k=1}^{n_{p}}h(x_{k}^{(i)}-x_{k}^{(j)};{\ell }_{k}) = \prod\nolimits _{k=1}^{n_{p}}h(r_k;{\ell }_{k}) }$ is the multidimensional correlation function.

In the literature, mainly squared exponential functions have been used for building kriging and cokriging approximations. Recently, many works [62, 86, 95, 96] have focused on Matérn [97] covariances, in particular Matérn $\frac{3}{2}$ and $\frac{5}{2}$ [62, 95]. Similarly to RBF and GRBF approximations, Matérn kernels improve the condition number of the covariance matrix, therefore improving the stability of the method.

With the product covariances introduced [see Eq. (92)], the process variance can be factored out of the different covariances in Eq. (89):

$$\begin{aligned} {\varvec{\mathbf {C}}}_c&=\sigma _{Y}^2{{\bf K}}_c; \end{aligned}$$

(93)

$$\begin{aligned} {{\bf c}_{0}}_c&=\sigma _{Y}^2{{\bf r}_{0}}_c. \end{aligned}$$

(94)

8.6 Summary of Cokriging Formulations and First Illustrations

Table 6 summarizes the different cokriging formulations which look similar to kriging formulations with an extended definition of the correlation matrix and vector. If we only consider the metamodel predictions and not its variance or internal parameter learning, the functional forms of simple cokriging without trend and gradient-enhanced radial basis functions are identical [compare Eq. (71) with SCK where $m=0$ in Table 6].

Table 6 Cokriging predictions and variances (SCK simple cokriging, OCK ordinary cokriging, UCK universal cokriging) with ${{\bf F}_{{\bf {10}}}=[\varvec{1}_{n_{s}}^\top \varvec{0}_{n_{s} \times n_{p}}^\top ]}$

Full size table

The Figs. 9, 10 and 11 illustrate how kriging, indirect kriging (the principle of indirect gradient-enhanced metamodels is presented in Sect. 3) and ordinary cokriging approximate one and two-dimensional functions. The indirect version of the gradient-enhanced cokriging is only proposed in 1D, in Fig. 9, where it can be seen that it yields very accurate results (the line cannot be visually separated from the true function on the plot). The Figs. 9, 10 and 11 show that, like for RBF approximation, the use of the gradient information improves the approximation of the analytical function, in particular for multimodal functions such as the Six-hump Camel Back in Fig. 11.

Figure 12 shows confidence intervals calculated with the predictions and the variances of ordinary kriging and cokriging. Remark that the use of the gradients reduces the approximation uncertainty.

When compared to GRBF, GKRG has the same covariance structure: $\varvec{\mathbf {C}} _c$ is the same matrix as $\varvec{\Psi }_g.$ Without trend and when the kernels are the same, the GRBF approximation of Eq. (72) is the same as that of GKRG (cf. Table 6). Differences arise because of the trend and the way the internal parameters are tuned. As a result, as will be observed in Sect. 10, gradient-enhanced cokriging and GRBF have very similar performances with a slight advantage for the cokriging.

8.7 Derivatives of the Cokriging Approximation

Derivatives of the cokriging approximation to the response, ${\frac{\partial \hat{Y}}{\partial x_{i}}\left( {\bf x}^{(0)}\right) },$$i=1,\dots, n_{p},$ can be obtained in two equivalent ways.

Firstly, Eq. (83) can be differentiated with respect to ${x_{i}}$ which means taking the derivatives of the functions ${\lambda ()}.$ Substituting the expression of the ${\lambda ()}$’s amounts to differentiating the correlation vectors ${{{\bf r}_{0}}_c}$ (and the trend functions ${{\bf f}_{0}}$ for universal cokriging) in the expressions for the approximation ${\tilde{{y}}()}$ given in Table 6. To do so, the second derivatives of the kernel functions, which appear in the derivatives of ${{{\bf r}_{0}}_c},$ are needed. The choice of the kernel must be adapted to this goal: squared exponential or Matérn (with $\nu>1)$ kernels are appropriate. It is remarkable that the second derivatives of the kernel functions were already required in the making of the cross-covariance matrix ${\varvec{\mathbf {C}} _{WW}},$ so approximating the derivative does not add requirements to the kernels.

Secondly and in an equivalent manner, the cokriging equations for predicting the response derivatives, ${\frac{\partial \hat{Y}}{\partial x_{i}}\left( {\bf x}^{(0)}\right) },$ can be obtained following the same path as that followed for the response: the cokriging estimate to the derivative is written as a linear combination of both the responses and their derivatives at sample points like in Eq. (83); the no bias condition of Eq. (84) is replaced by a no bias on the derivatives, ${\mathbb {E}{\left[ {\frac{\partial \hat{Y}}{\partial x_{i}}\left( {\bf x}^{(0)}\right) -\frac{\partial Y}{\partial x_{i}}\left( {\bf x}^{(0)}\right) }\right] }=0},$ and results in a relation like Eq. (86) with ${\frac{\partial {\bf f}_{0}}{\partial x_{i}}}$ instead of ${{\bf f}_{0}};$ similarly, the variance minimized is that of the error between derivatives, ${\text {Var}{\left[ {\frac{\partial \hat{Y}}{\partial x_{i}}\left( {\bf x}^{(0)}\right) -\frac{\partial Y}{\partial x_{i}}\left( {\bf x}^{(0)}\right) }\right] }},$ leading to an equation identical to Eq. (88) but ${{{\bf c}_{0}}_c}$ is replaced by the vector ${\frac{{\partial {\bf c}_{0}}_c}{\partial x_{i}}}.$ Therefore, the cokriging models summarized in Table 6 provide models for the derivatives by just differentiating the trend and the correlation vectors. As a result in these (differentiated) models, the kriging interpolation property also applies to the derivatives. A notable feature of such gradient-enhanced cokriging is that the uncertainty of the estimated response derivative is also calculated [99]. This property was not used in previous works but it should turn out to be useful in the context of uncertainty quantification or reliability-based optimization.

8.8 Estimation of the Cokriging Parameters

As in the case of kriging, the estimation of the cokriging parameters, ${{\ell }_{i}},$${\sigma _{Y}}$ and ${\varvec{\beta }}$ (and $\nu $ for the general Matérn kernel), can be achieved using Leave-One-Out or Maximum likelihood techniques. Leave-one-out (LOO) was already introduced for GRBF in Sect. 7.3 and has also been applied to Gradient-Based cokriging (see for example [54]). The Maximum likelihood approach [112] is made possible by the probabilistic interpretation of cokriging and more common than LOO.

Maximum likelihood estimation operates by maximizing the following likelihood function (or minimizing the opposite of its log):

$$\begin{aligned}&L(\varvec{\beta },\sigma _{Y}^2,\varvec{\ell })=\left( 2\pi \sigma _{Y}^2\right) ^{-n_{s} (n_{p}+1)/2}|{{\bf K}}_c(\varvec{\ell })|^{-1/2}\times \exp \left[ -\dfrac{1}{2\sigma _{Y}^2}\left( {\bf y}_{g}-{\bf F}\varvec{\beta }\right) ^\top {{\bf K}}_c(\varvec{\ell })^{-1}\left( {\bf y}_{g}-{\bf F}\varvec{\beta }\right) \right]. \end{aligned}$$

(95)

At a given ${\varvec{\ell }},$$L()$ can be analytically maximized over ${\varvec{\beta }}$ and ${\sigma _{Y}^2}$ which yields the expression of their estimates:

$$\begin{aligned} \hat{\varvec{\beta }}(\varvec{\ell })&= \left( {\bf F}^\top {\bf K}_c(\varvec{\ell })^{-1}{\bf F}\right) ^{-1}{\bf F}^\top {\bf K}_c(\varvec{\ell })^{-1}{\bf y}_{g}; \end{aligned}$$

(96)

$$\begin{aligned} \hat{\sigma }_{Y}^2(\varvec{\ell })&=\dfrac{1}{n_{s}}\left( {\bf y}_{g}-{\bf F}\varvec{\beta }\right) ^\top {\bf K}_c(\varvec{\ell })^{-1}\left( {\bf y}_{g}-{\bf F}\varvec{\beta }\right). \end{aligned}$$

(97)

The correlation lengths ${{\ell }_{i}}$ are obtained by a numerically minimizing the following expression which is the relevant part of minus the log-likelihood where ${\hat{\varvec{\beta }}}$ and ${\hat{\sigma }_{Y}^2}$ have been input:

$$\begin{aligned} \hat{\varvec{\ell }}=\underset{\varvec{\ell }\in \mathbb {L}}{\arg \min }\,\, \psi (\varvec{\ell }) \text { where } \psi (\varvec{\ell })=\hat{\sigma }_{Y}^2(\varvec{\ell })|{\bf K}_c(\varvec{\ell })|^{1/n_{s} {(n_{p}+1)}}. \end{aligned}$$

(98)

Because $\psi ()$ is multimodal, it is essential to perform the minimization with a global optimization algorithm [113]: for example, a stochastic optimizer such as the Particle Swarm Optimizer can be employed [114]. In order to reduce the number of optimization iterations, the gradient of the likelihood function is sometimes calculated and accounted for in the optimization [62, 115]. During the numerical optimization for finding the correlation lengths, ${\varvec{\ell }},$ the correlation matrix ${\bf K}_c$ has to be rebuilt, factorized and inverted at each iteration, which goes along with a noticeable computational cost. However, in most practical situations where metamodels are called in, the objective function relies on numerical simulation such as nonlinear finite elements and remains much more costly than the metamodel. Furthermore, the gain in accuracy of the gradient-based approximations allows in many cases to contain the computational time by reducing the necessary number of sample points [94, 95, 116].

9 Gradient-Enhanced Support Vector Regression (GSVR)

Support vector regression (SVR) is a nonlinear regression method that appeared within the framework of statistical learning theory. It is an extension of the support vector machines (SVMs) originally designed for nonlinear classification [117] and pattern recognition [118].

The literature on SVR is already rich and general introductions may be found in [119,120,121]. Initially built for learning from function responses at sample points, many extensions of SVR to additionally account for derivatives have been proposed. In compliance with the rest of the text, we shall call them gradient-enhanced SVR or GSVR. Initially introduced in [68] with an iteratively re-weighted least squares procedure, GSVR has then been revisited, again with a least squares approach in [70], with regularized least squares in [71], and by the Twin SVR technique in [69, 73]. A general framework for incorporating prior knowledge in SVR which has been applied to function derivatives was also put forward in [72]. More recently, GSVR has been applied to shape optimization in CFD problems [54].

9.1 Building Procedure

We now present the method introduced by [68] and applied in [54]. The approximation is built from a linear combination of the basis functions ${\phi _{i}()}$ and their derivatives (all of which are independent of the observations) added to a constant trend term ${\mu }.$ The ${\vartheta }$’s are the coefficients of the combination, and will be adjusted using the observed responses ${\bf y}_{g}:$

$$\begin{aligned} \forall {\bf x}^{(0)}\in \mathcal {D},&\nonumber \\ \tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }&=\mu +\sum _{i=1}^{n_{s}}{\vartheta }_{i}\phi {\left( {{\bf x}^{(0)},{\bf x}^{(i)}}\right) }+\sum _{j=1}^{n_{p}}\sum _{i=1}^{n_{s}}{\vartheta }_{ij}\frac{\partial \phi }{\partial x_{j}}\left( {\bf x}^{(0)},{\bf x}^{(i)}\right) \nonumber \\&=\mu +\sum _{j=0}^{n_{p}}\sum _{i=1}^{n_{s}}{\vartheta }_{ij}\phi _{0i,j} \end{aligned}$$

(99)

$$\begin{aligned}&=\mu +\varvec{\vartheta }^\top \varvec{\phi }_{g}\left( {\bf x}^{(0)}\right) , \end{aligned}$$

(100)

where ${\varvec{\vartheta }}$ and ${\varvec{\phi }_{g}\left( {\bf x}^{(0)}\right) }$ contain the $n_{s} \times (n_{p}+1)$ coefficients and evaluations of the basis function and its derivatives at the sample points, respectively.

At this point, the expression of the approximation ${\tilde{{y}}{\left( {{\bf x}^{(0)}}\right) }}$ is the same as that in any least squares approach, cf. Eq. 19 for example with ${\varvec{\vartheta }}$ and ${\hat{\varvec{\beta }}},$ and ${\varvec{\phi }_{g}\left( {\bf x}^{(0)}\right) }$ and ${{\bf f}{\left( {{\bf x}^{(0)}}\right) }}$ playing the same roles, respectively. However, the coefficients ${\varvec{\vartheta }}$ will be calculated through a different approach and the a priori functions ${\phi _{i,j}({\bf x})}$ will never be used as such but will always occur in products and hence they will be indirectly specified through a kernel and its derivatives, cf. Sect. 9.2.

Support vector regression seeks to approximate the function responses, ${y{\left( {{\bf x}^{(i)}}\right) }},$ within a $\varepsilon _0$ accuracy and, additionally, GSVR requires the derivatives, ${\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) },$ to be approximated within $\varepsilon _k$ accuracy. The SVR approximation is made more stable to changes in data by minimizing the vector norm ${\Vert \vartheta \Vert ^2}$ (cf. [117, 120] for explanations on how reducing ${\Vert \vartheta \Vert ^2}$ makes the approximation less flexible, therefore more stable). These considerations lead the constrained convex quadratic optimization Problem 2 where ${\varvec{\xi }^{+},\varvec{\xi }^{-},\varvec{\tau }^{+}}$ and ${\varvec{\tau }^{-}}$ are slack variables on the accuracies for avoiding problems with no feasible solution:

Problem 2

(GSVR as a minimization problem) Find ${\left( \varvec{\vartheta },\mu ,\varvec{\xi }^{+},\varvec{\xi }^{-},\varvec{\tau }^{+},\varvec{\tau }^{-}\right) }$^{Footnote 1} that minimize

$$\begin{aligned} \dfrac{1}{2}\Vert \varvec{\vartheta }\Vert ^2+\dfrac{\varGamma _{0}}{n_{s}}\sum _{i=1}^{n_{s}}\left( \xi ^{+(i)}+\xi ^{-(i)}\right) +\sum _{k=1}^{n_{p}}\dfrac{\varGamma _{k}}{n_{s}}\sum _{i=1}^{n_{s}}\left( \tau _{k}^{+(i)}+\tau _{k}^{-(i)}\right) , \end{aligned}$$

subject to

$$\begin{aligned} \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\quad {\left\{ \begin{array}{ll} y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }-\mu &{}\le \varepsilon _0+\xi ^{+(i)},\\ \varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }+\mu -y{\left( {{\bf x}^{(i)}}\right) }&{}\le \varepsilon _0+\xi ^{-(i)},\\ \frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) -\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) &{}\le \varepsilon _k+\tau _{k}^{+(i)},\\ \varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) -\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) &{}\le \varepsilon _k+\tau _{k}^{-(i)},\\ \xi ^{+(i)},\xi ^{-(i)},\tau _{k}^{+(i)},\tau _{k}^{-(i)}&{}\ge 0. \end{array}\right. } \end{aligned}$$

The hyper-parameters of the method, ${\varGamma _{k},~k=0,\dots ,n_{p}},$ are user-defined penalty parameters that control the trade-off between approximation regularity (low ${\Vert \varvec{\vartheta }\Vert ^2})$ and approximation accuracy in response and derivatives of the response. Geometrically, the constraints on accuracy are tubes of half-width $\varepsilon _i$ in the space of responses and derivatives outside of which the GSVR criterion is subject to a linear loss at a rate determined by the hyper-parameters ${\varGamma _{k}}.$

Problem 2 can be rewritten as a saddle-point problem involving a Lagrangian and positive Lagrange multipliers ${\alpha ^{\pm (i)}},$${\lambda _{k}^{\pm (i)}},$${\eta ^{\pm (i)}}$ and ${\theta _{k}^{\pm (i)}}$ (a.k.a., dual variables):

Problem 3

(GSVR as a saddle-point problem) Find ${\left( \varvec{\vartheta },\mu ,\varvec{\xi }^{+},\varvec{\xi }^{-},\varvec{\tau }^{+},\varvec{\tau }^{-}\right) }$ and ${\left( \varvec{{\alpha }}^{+},\varvec{{\alpha }}^{-},\varvec{{\lambda }}^{+},\varvec{{\lambda }}^{-},\varvec{\eta }^{+},\varvec{\eta }^{-},\varvec{{\theta }^{+}},\varvec{{\theta }^{-}}\right) }$ that, respectively, minimize and maximize the Lagrangian

$$\begin{aligned} L=&\dfrac{1}{2}\Vert \varvec{\vartheta }\Vert ^2+\dfrac{\varGamma _{0}}{n_{s}}\sum _{i=1}^{n_{s}}\left( \xi ^{+(i)}+\xi ^{-(i)}\right) +\sum _{k=1}^{n_{p}}\dfrac{\varGamma _{k}}{n_{s}}\sum _{i=1}^{n_{s}}\left( \tau _{k}^{+(i)}+\tau _{k}^{-(i)}\right) \\&-\sum _{i=1}^{n_{s}}\alpha ^{+(i)}\left[ \varepsilon _0+\xi ^{+(i)}-y{\left( {{\bf x}^{(i)}}\right) }+\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }+\mu \right] \\&-\sum _{i=1}^{n_{s}}\alpha ^{-(i)}\left[ \varepsilon _0+\xi ^{-(i)}+y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }-\mu \right] \\&-\sum _{k=1}^{n_{p}}\sum _{i=1}^{n_{s}}\lambda _{k}^{+(i)}\left[ \varepsilon _k+\tau _{k}^{+(i)}-\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) +\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) \right] \\&-\sum _{k=1}^{n_{p}}\sum _{i=1}^{n_{s}}\lambda _{k}^{-(i)}\left[ \varepsilon _k+\tau _{k}^{-(i)}+\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) -\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) \right] \\&-\sum _{i=1}^{n_{s}}\left( \eta ^{+(i)}\xi ^{+(i)}+\eta ^{-(i)}\xi ^{-(i)}\right) \\&-\sum _{k=1}^{n_{p}}\sum _{i=1}^{n_{s}}\left( \theta _{k}^{+(i)}\tau _{k}^{+(i)}+\theta _{k}^{-(i)}\tau _{k}^{-(i)}\right). \end{aligned}$$

At a solution, the partial derivatives of the Lagrangian with respect to the primal variables have to vanish:

$$\begin{aligned} \frac{{\partial }L}{\partial \varvec{\vartheta }}&=\varvec{\vartheta }-\sum _{i=1}^{n_{s}}\left( \alpha ^{+(i)}-\alpha ^{-(i)}\right) \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }\nonumber \\&\qquad -\sum _{k=1}^{n_{p}}\sum _{i=1}^{n_{s}}\left( \lambda _{i}^{+(k)}-\lambda _{i}^{-(k)}\right) \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) =\varvec{0}; \end{aligned}$$

(101)

$$\begin{aligned} \frac{{\partial }L}{\partial \mu }&=-\sum _{i=1}^{n_{s}}\left( \alpha ^{+(i)}-\alpha ^{-(i)}\right) =0; \end{aligned}$$

(102)

$$\begin{aligned} \frac{{\partial }L}{\partial \xi ^{+(i)}}&=\dfrac{\varGamma _{0}}{n_{s}}-\alpha ^{+(i)}-\eta ^{+(i)}=0 \qquad \forall i\in {\llbracket 1,n_{s} \rrbracket }; \end{aligned}$$

(103)

$$\begin{aligned} \frac{{\partial }L}{\partial \xi ^{-(i)}}&=\dfrac{\varGamma _{0}}{n_{s}}-\alpha ^{-(i)}-\eta ^{-(i)}=0 \qquad \forall i\in {\llbracket 1,n_{s} \rrbracket }; \end{aligned}$$

(104)

$$\begin{aligned} \frac{{\partial }L}{\partial \tau _{k}^{+(i)}}&=\dfrac{\varGamma _{k}}{n_{s}}-\lambda _{k}^{+(i)}-\theta _{k}^{+(i)}=0 \qquad \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket }; \end{aligned}$$

(105)

$$\begin{aligned} \frac{{\partial }L}{\partial \tau _{k}^{-(i)}}&=\dfrac{\varGamma _{k }}{n_{s}}-\lambda _{k}^{-(i)}-\theta _{k}^{-(i)}=0 \qquad \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket }. \end{aligned}$$

(106)

$\varvec{\vartheta },$$\varvec{\eta} ^{\pm }$ and $\varvec{\theta} ^{\pm }$ can readily be solved for from Eq. (101) and Eqs. (103)–(106) which leads to the constrained convex quadratic optimization Problem 4. Notice that the positive sign of the Lagrangian multipliers ${\alpha ^{\pm (i)}},$${\lambda _{k}^{\pm (i)}},$${\eta ^{\pm (i)}}$ and ${\theta _{k}^{\pm (i)}}$ produces the inequality constraints.

Problem 4

(GSVR as a Convex Constrained Quadratic Optimization) Find ${\left( \varvec{{\alpha }}^{+},\varvec{{\alpha }}^{-},\varvec{{\lambda }}^{+},\varvec{{\lambda }}^{-}\right) }$ that minimize

$$\begin{aligned} \dfrac{1}{2} \begin{bmatrix} \varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\\ \varvec{{\lambda }}^{+}\\ \varvec{{\lambda }}^{-}\end{bmatrix} ^\top \begin{bmatrix} \varvec{\Psi }_{r}&-\varvec{\Psi }_{r}&\varvec{\Psi }_{rd}&-\varvec{\Psi }_{rd}\\ -\varvec{\Psi }_{r}&\varvec{\Psi }_{r}&-\varvec{\Psi }_{rd}&\varvec{\Psi }_{rd}\\ \varvec{\Psi }_{rd}^\top&-\varvec{\Psi }_{rd}^\top&\varvec{\Psi }_{dd}&-\varvec{\Psi }_{dd}\\ -\varvec{\Psi }_{rd}^\top&\varvec{\Psi }_{rd}^\top&-\varvec{\Psi }_{dd}&\varvec{\Psi }_{dd} \end{bmatrix} \begin{bmatrix} \varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\\ \varvec{{\lambda }}^{+}\\ \varvec{{\lambda }}^{-}\end{bmatrix} +\begin{bmatrix} -{\bf y}_{s}\\ {\bf y}_{s}\\ -{\bf y}_{gs}\\ {\bf y}_{gs} \end{bmatrix} ^\top \begin{bmatrix} \varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\\ \varvec{{\lambda }}^{+}\\ \varvec{{\lambda }}^{-}\end{bmatrix}+\begin{bmatrix} \varepsilon _0\varvec{1}\\ \varepsilon _0\varvec{1}\\ {\varvec {\varepsilon }}\\ {\varvec {\varepsilon }} \end{bmatrix} ^\top \begin{bmatrix} \varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\\ \varvec{{\lambda }}^{+}\\ \varvec{{\lambda }}^{-}\end{bmatrix}, \end{aligned}$$

(107)

subject to

$$\begin{aligned} {\left\{ \begin{array}{ll} &{}\begin{bmatrix} \bf{1}\\ -\bf{1}\end{bmatrix}^\top \begin{bmatrix}\varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\end{bmatrix}=0,\\ &{}\begin{bmatrix}\bf{0}\\ \bf{0}\\ \bf{0}\\ \bf{0}\end{bmatrix} \le \begin{bmatrix}\varvec{{\alpha }}^{+}\\ \varvec{{\alpha }}^{-}\\ \varvec{{\lambda }}^{+}\\ \varvec{{\lambda }}^{-}\end{bmatrix}\le \begin{bmatrix}\varGamma _{0}/n_{s} \bf{1}\\ \varGamma _{0}/n_{s} \bf{1}\\ {\varvec {\Gamma }}\\ {\varvec {\Gamma }}\end{bmatrix}. \end{array}\right. } \end{aligned}$$

The vectors ${\bf y}_{s}$ and ${\bf y}_{gs}$ contain the responses and the derivatives of the actual function at the sample points, respectively. ${\varvec {\varepsilon }}$ is made of the $\varepsilon _k$ values (${\forall k\in {\llbracket 1,n_{p}\rrbracket }}).$ The matrices ${\varvec{\Psi }_{r}},$${\varvec{\Psi }_{rd}}$ and ${\varvec{\Psi }_{dd}}$ consist of the evaluations and derivatives of the kernel function (see Sect. 9.2), and ${{\varvec {\Gamma }}}$ designates the vector of ${\varGamma _{k}/n_{s}}.$

Responses or derivatives that are inside their $\varepsilon _k$ tube, that is, responses and derivatives for which the accuracy constraints are satisfied, do not impact the solution of any of the above problems and could be removed altogether from the metamodel building without changing the result. This is because the Lagrange multipliers associated to these points are both (upper and lower limit) null. On the opposite, responses or derivatives at points that have one non-zero Lagrange multiplier influence the metamodel and are called support vectors. The dual variables $\varvec{\alpha} ^{\pm }$ and $\varvec{\lambda} ^{\pm }$ are determined by solving the Constrained Quadratic Optimization Problem 4. Classical Quadratic Programming algorithms [120] such as the Interior Point algorithm can be applied. For dealing with large number of sample points and parameters, dedicated algorithms, such as sequential minimal optimization [122], are preferable. In order to reduce the computational cost of GSVR, a few works introduce new formulations: Lázaro et al. propose the IRWLS algorithm [68]; Jayadeva et al. have devised a regularized least squares approach [71]; Khemchandani et al. have come up with the Twin SVR [73].

9.2 Kernel Functions

Problem 4 has the variables ${\bf x}^{(i)}$ involved only through products of ${\varvec{\phi }()}$ and their derivatives. In SVR also, a kernel is defined as the inner product ${{\varPsi }\left( {\bf x},{\bf x}'\right) =\varvec{\phi }\left( {\bf x}\right) ^\top \varvec{\phi }\left( {\bf x}'\right) }.$ The “kernel trick” consists in not explicitely giving ${\phi \left( \right) }$ but directly working with the kernel ${{\varPsi }\left( ,\right) }.$ As was already stated in Sect. 6, any function with two inputs cannot be a kernel, it has to satisfy the Mercer’s conditions (see [120]) in order to be continuous, symmetric and positive definite. In the case of GSVR, the basis functions intervene in the following products:

$$\begin{aligned} \forall (i,j,k,l)\in {\llbracket 1,n_{s} \rrbracket }^2\times {\llbracket 1,n_{p}\rrbracket }^2,&\nonumber \\ \varvec{\phi }\left( {\bf x}^{(i)}\right) ^\top \varvec{\phi }\left( {\bf x}^{\left( {j}\right) }\right)&={\varPsi }\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) ={\varPsi }_{ij}, \end{aligned}$$

(108)

$$\begin{aligned} \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) ^\top \varvec{\phi }\left( {\bf x}^{\left( {j}\right) }\right)&=\frac{\partial {\varPsi }}{\partial x_{k}^{(i)}}\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) ={\varPsi }_{ij,k0}, \end{aligned}$$

(109)

$$\begin{aligned} \varvec{\phi }\left( {\bf x}^{(i)}\right) ^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{\left( {j}\right) }\right)&=\frac{\partial {\varPsi }}{\partial x_{k}^{(j)}}\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) ={\varPsi }_{ij,0k}, \end{aligned}$$

(110)

$$\begin{aligned} \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) ^\top \frac{\partial \varvec{\phi }}{\partial x_{l}}\left( {\bf x}^{\left( {j}\right) }\right)&=\frac{\partial ^{2}{\varPsi }}{{\partial x_{k}^{(i)}}{\partial x_{l}^{(j)}}}\left( {{\bf x}^{(i)},{\bf x}^{\left( {j}\right) }}\right) ={\varPsi }_{ij,kl}. \end{aligned}$$

(111)

Therefore, in addition to the Mercer’s conditions, a kernel used in GSVR must be twice differentiable. Again, like in GRBF and GKRG, squared exponential or Matérn ($\nu>1)$ functions can be used as kernels for GSVR (a list of kernels has been given in Sect. 6). With the notations introduced in Eqs. (108)–(111), the matrices present in Problem 4 can now be detailed:

$$\begin{aligned} \left( \varvec{\Psi }_{r}\right) _{ij}&={\varPsi }_{ij}; \end{aligned}$$

(112)

$$\begin{aligned} \varvec{\Psi }_{rd}&=\begin{bmatrix} {\varPsi }_{11,10}&{\varPsi }_{11,20}&\dots&{\varPsi }_{11,n_{p}0}&{\varPsi }_{12,10}&\dots&{\varPsi }_{1n_{s},0n_{p}}\\ {\varPsi }_{21,10}&{\varPsi }_{21,20}&\dots&{\varPsi }_{21,n_{p}0}&{\varPsi }_{22,10}&\dots&{\varPsi }_{2n_{s},0n_{p}}\\ \vdots&&\ddots&\vdots&\vdots&\ddots&\vdots \\ {\varPsi }_{n_{s} 1,10}&{\varPsi }_{n_{s} 1,20}&\dots&{\varPsi }_{n_{s} 1,n_{p}0}&{\varPsi }_{n_{s} 2,10}&\dots&{\varPsi }_{n_{s} n_{s},0n_{p}}\\ \end{bmatrix}; \end{aligned}$$

(113)

$$\begin{aligned} \varvec{\Psi }_{dd}&=\begin{bmatrix} {\varPsi }_{11,11}&{\varPsi }_{11,12}&\dots&{\varPsi }_{11,1n_{p}}&{\varPsi }_{12,11}&\dots&{\varPsi }_{1n_{s},1n_{p}}\\ {\varPsi }_{11,21}&{\varPsi }_{11,22}&\dots&{\varPsi }_{11,2n_{p}}&{\varPsi }_{12,21}&\dots&{\varPsi }_{1n_{s},2n_{p}}\\ \vdots&&\ddots&\vdots&\vdots&\ddots&\vdots \\ {\varPsi }_{11,n_{p}1}&{\varPsi }_{11,n_{p}2}&\dots&{\varPsi }_{11,n_{p}n_{p}}&{\varPsi }_{12,n_{p}1}&\dots&{\varPsi }_{1n_{s},n_{p}n_{p}}\\ {\varPsi }_{21,11}&{\varPsi }_{21,12}&\dots&{\varPsi }_{21,1n_{p}}&{\varPsi }_{22,11}&\dots&{\varPsi }_{2n_{s},1n_{p}}\\ \vdots&&\ddots&\vdots&\vdots&\ddots&\vdots \\ {\varPsi }_{n_{s} 1,n_{p}1}&{\varPsi }_{n_{s} 1,n_{p}2}&\dots&{\varPsi }_{n_{s} 1,n_{p}n_{p}}&{\varPsi }_{n_{s} 2,n_{p}1}&\dots&{\varPsi }_{n_{s} n_{s},n_{p}n_{p}}\\ \end{bmatrix}. \end{aligned}$$

(114)

The sizes of matrices ${\varvec{\Psi }_{r}},$${\varvec{\Psi }_{rd}}$ and ${\varvec{\Psi }_{dd}}$ are $n_{s} \times n_{s} ,$${n_{s} \times n_{s} n_{p}}$ and ${n_{s} n_{p}\times n_{s} n_{p}},$ respectively. The full matrix of kernel functions and their derivatives at the sample points in Problem 4 is square and contains $2n_{s} (1+n_{p})$ rows.

9.3 Evaluating the GSVR Metamodel

Solving the Convex Constrained Quadratic problem for ${\varvec{\alpha} ^{\pm }}$ and ${\varvec{\lambda} ^{\pm }}$ allows to calculate the ${\vartheta }$’s from Eq. (101). The GSVR response estimate at a new point ${{\bf x}^{(0)}}$ is then given by:

$$\begin{aligned}&\forall {\bf {x}}^{(0)}\in {\mathcal {D}},\nonumber \\ {\tilde{{y}}}{\left( {{\bf {x}}^{(0)}}\right) }&=\mu +\sum _{i=1}^{n_{s}}\left( \alpha ^{+(i)}-\alpha ^{-(i)}\right) {\varPsi }\left( {{\bf {x}}^{(i)},{\bf {x}}^{\left( {0}\right) }}\right) \nonumber \\&\qquad +\sum _{k=1}^{n_{p}}\sum _{i=1}^{n_{s}}\left( \lambda _{k}^{+(i)}-\lambda _{k}^{-(i)}\right) \frac{\partial {\varPsi }}{\partial x_{k}^{(i)}}\left( {{\bf {x}}^{(i)},{\bf {x}}^{\left( {0}\right) }}\right) \nonumber \\&=\mu +\begin{bmatrix}{\varvec{{\alpha }}}^{+}\\ -{\varvec{{\alpha }}}^{-}\\ {\varvec{{\lambda }}}^{+}\\ -{\varvec{{\lambda }}}^{-}\end{bmatrix}^\top \begin{bmatrix}{{\varvec{\Psi }}}\left( {\bf {x}}^{(0)}\right) \\{{\varvec{\Psi }}}\left( {\bf{ x}}^{(0)}\right) \\ {{\varvec{\Psi }}}_d\left( {\bf {x}}^{(0)}\right) \\ {{\varvec{\Psi}}}_d\left( {\bf {x}}^{(0)}\right) \end{bmatrix}, \end{aligned}$$

(115)

where ${{{\varvec{\Psi }}}\left( {\bf x}^{(0)}\right) }$ and ${{{\varvec{\Psi }}}_d\left( {\bf x}^{(0)}\right) }$ are the vectors of kernels functions ${\left( {{\bf x}^{(i)},{\bf x}^{\left( {0}\right) }}\right) }$ and their derivatives, respectively. The derivative of the approximation given by the GSVR metamodel is obtained by differentiating Eq. (115). To be able to do this, the kernel function ${{\varPsi }}$ must be at least twice differentiable. The trend term, ${\mu },$ has not been calculated yet. Its value stems from the Karush–Kuhn–Tucker conditions for the Convex Constrained Problem 4: at a solution, the products between the dual variables and the associated constraints vanish:

$$\begin{aligned}&\forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\nonumber \\&\alpha ^{+(i)}\left( \varepsilon _0+\xi ^{+(i)}-y{\left( {{\bf x}^{(i)}}\right) }+\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }+\mu \right) =0, \end{aligned}$$

(116)

$$\begin{aligned}&\alpha ^{-(i)}\left( \varepsilon _0+\xi ^{-(i)}+y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }-\mu \right) =0, \end{aligned}$$

(117)

$$\begin{aligned}&\lambda _{k}^{+(i)}\left( \varepsilon _k+\tau _{k}^{+(i)}+\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }}{\partial x_{k}}-\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) \right) =0, \end{aligned}$$

(118)

$$\begin{aligned}&\lambda _{k}^{-(i)}\left( \varepsilon _k+\tau _{k}^{-(i)}-\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) +\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) \right) =0, \end{aligned}$$

(119)

$$\begin{aligned}&\xi ^{+(i)}\left( \dfrac{\varGamma _{0}}{n_{s}}-\alpha ^{+(i)}\right) =0, \end{aligned}$$

(120)

$$\begin{aligned}&\xi ^{-(i)}\left( \dfrac{\varGamma _{0}}{n_{s}}-\alpha ^{-(i)}\right) =0, \end{aligned}$$

(121)

$$\begin{aligned}&\tau _{k}^{+(i)}\left( \dfrac{\varGamma _{k}}{n_{s}}-\lambda _{k}^{+(i)}\right) =0, \end{aligned}$$

(122)

$$\begin{aligned}&\tau _{k}^{-(i)}\left( \dfrac{\varGamma _{k}}{n_{s}}-\lambda _{k}^{-(i)}\right) =0. \end{aligned}$$

(123)

Equations (116)–(117) and (120)–(121) are the same as those for the classical (non-gradient based) SVR for which the following conclusions hold:

from Eqs. (120) and (121), either ${\left( \varGamma _{0}/n_{s}-\alpha ^{\pm (i)}\right) =0}$ and ${\xi ^{\pm (i)}>0},$ or ${\xi ^{\pm (i)}=0}$ and ${\varGamma _{0}/n_{s}>\alpha ^{\pm (i)}}.$
from Eq. (101) for $\varvec{\vartheta },$ Eqs. (116) and (117), and because ${\tilde{{y}}{\left( {{\bf x}^{(i)}}\right) }}$ cannot be below and above ${y{\left( {{\bf x}^{(i)}}\right) }}$ at the same time so that, in terms of the dual variables ${\alpha ^{+(i)}\alpha ^{-(i)}=0},$${\mu }$ can be calculated:
$$\begin{aligned} \mu&=y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }+\varepsilon _0 \quad \text { if }\quad \alpha ^{+(i)}=0 \text { and } \alpha ^{-(i)}\in ]0,\varGamma _{0}/n_{s} [,\end{aligned}$$
(124)
$$\begin{aligned}&\text {or}\nonumber \\ \mu&=y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }-\varepsilon _0 \quad \text { if }\quad \alpha ^{-(i)}=0 \text { and } \alpha ^{+(i)}\in ]0,\varGamma _{0}/n_{s} [. \end{aligned}$$
(125)
The above bounds on ${\alpha ^{\pm (i)}}$ are enforced as constraints of the quadratic optimization problem.

9.4 Gradient-Enhanced $\nu $-SVR

The classical GSVR discussed so far requires choosing ${\varGamma _{k}}$ and $\varepsilon _k$ (${k\in {\llbracket 0,n_{p}\rrbracket }}),$ and is sometimes called $\varepsilon _k$-GSVR. $\varepsilon _k$ is typically taken as the standard deviation of the noise of the response data and its derivatives. Often though, there is no prior knowledge on an eventual noise on the response. Furthermore, if the $\varepsilon _k$’s are taken small, there will be many non-zero Lagrange multipliers (i.e., among $\varvec{{\alpha }}^{+},\varvec{{\alpha }}^{-},\varvec{{\lambda }}^{+},\varvec{{\lambda }}^{-}),$ in other terms there will be many support vectors, and the SVR model will lose some of its “sparsity” in the sense that the ability to drop some of the terms when evaluating the metamodel [Eq. (115)] will decrease.

$\nu $-GSVR is an alternative support vector regression model where the $\varepsilon _k$’s are no longer given but calculated. $\nu $-GSVR uses new scalars, $\nu _k\in [0,1]^{n_{p}+1},$ which act as upperbounds on the proportion of points that will be support vectors. This approach is inherited from $\nu $-SVM (support vector machines [123, 124]), and has been compared with classical $\varepsilon $-SVR in [125]. The larger the $\nu _k$’s, the more the approximation is required to approach the data points and made flexible (or “complex” in an information theory sense). A $\nu $-GSVR model solves the following optimization problem (compare with Problem 2):

Problem 5

(GSVR as optimization problem) Find ${(\varvec{\vartheta },\mu ,\varvec{\xi }^{+},\varvec{\xi }^{-},\varvec{\tau }^{+},\varvec{\tau }^{-},\varvec{\varepsilon }}$^{Footnote 2} ) that minimize

$$\begin{aligned} \dfrac{1}{2}\Vert \varvec{\vartheta }\Vert ^2&+\dfrac{\varGamma _{0}}{n_{s}}\left[ \nu _0\varepsilon _0+\sum _{i=1}^{n_{s}}\left( \xi ^{+(i)}+\xi ^{-(i)}\right) \right] \\ +&\sum _{k=1}^{n_{p}}\dfrac{\varGamma _{k}}{n_{s}}\left[ \nu _k\varepsilon _k+\sum _{i=1}^{n_{s}}\left( \tau _{k}^{+(i)}+\tau _{k}^{-(i)}\right) \right] , \end{aligned}$$

subject to

$$\begin{aligned} \forall (i,k)\in {\llbracket 1,n_{s} \rrbracket }\times {\llbracket 1,n_{p}\rrbracket },\quad {\left\{ \begin{array}{ll} y{\left( {{\bf x}^{(i)}}\right) }-\varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }-\mu &{}\le \varepsilon _0+\xi ^{+(i)},\\ \varvec{\vartheta }^\top \varvec{\phi }{\left( {{\bf x}^{(i)}}\right) }+\mu -y{\left( {{\bf x}^{(i)}}\right) }&{}\le \varepsilon _0+\xi ^{-(i)},\\ \frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) -\varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) &{}\le \varepsilon _k+\tau _{k}^{+(i)},\\ \varvec{\vartheta }^\top \frac{\partial \varvec{\phi }}{\partial x_{k}}\left( {\bf x}^{(i)}\right) -\frac{{\partial {y}}}{\partial {x}_{k}}\left( {\bf x}^{(i)}\right) &{}\le \varepsilon _k+\tau _{k}^{-(i)},\\ \xi ^{+(i)},\xi ^{-(i)},\tau _{k}^{+(i)},\tau _{k}^{-(i)}&{}\ge 0,\\ \varepsilon _0,\varepsilon _k&{}\ge 0. \end{array}\right. } \end{aligned}$$

This problem is solved with a Lagrangian approach in a manner similar to that of $\varepsilon _k$-GSVR [54]. The $n_{p}+1$ new constraints on the positivity of the $\varepsilon _k$’s induce $n_{p}+1$ new Lagrange multipliers. The resulting quadratic dual optimization problem is similar to that given as Problem 4 with the additional Lagrange multipliers. The $\nu $-GSVR developped here has been implemented in the GRENAT Toolbox [12].

Figures 13 shows approximations of an analytical unidimensional function using $\nu $-SVR, $\nu $-GSVR and their derivatives [obtained by differentiating Eq. (115)]. The filled areas correspond to the $\varepsilon $-tube of both approximations. Just as for the previous GRBF and GKRG metamodels, the gradient-enhanced $\nu $-GSVR exhibits more accurate approximations than $\nu $-SVR does.

9.5 Tuning GSVR Parameters

The GSVR model involves the same parameters are the version without gradients, that is the ${\varepsilon _i}$ and ${\varGamma _{i}}$ internal parameters, plus the parameters of the kernels ${\varvec{\ell }}.$ Several works, summarized in [126], discuss how to tune these parameters for non-gradient SVR, either in the form of empirical choices or of methodologies. Both $\nu $-SVR and $\nu $-GSVR (see Sect. 9.4) help in choosing the $\varepsilon _i$ by replacing them by $\nu _i,$ a targeted proportion of points that are support vectors.

Algorithms have been proposed for determining the values of the aforementioned internal parameters using leave-one-out bounds [Eq. (69)] for support vector regression. Introduced in [127], these bounds have been completed with the Span concept [128] that currently stands as the most accurate bound. A method for the minimization of this bound is described and studied in [129]. The gradient of the leave-one-out bound for SVR with respect to the internal parameters has been calculated in [130] and used for tuning the parameters.

Recently, an extension of the Span bound to gradient-based SVR has been proposed in [54]: because the evaluation of the Span bound is computationally expensive, the authors have proposed to estimate the internal parameters of the kernel function as those of a gradient-enhanced RBF.

10 Applications and Discussion

10.1 Procedure for Comparing Performances

Comparisons of response-only and gradient-enhanced metamodels will be carried out for modeling the 5 and 10 dimensional ($n_{p}$=5 or 10) Rosenbrock and Schwefel functions which are respectively defined as,

$$\begin{aligned} \forall {\bf x}\in [-2,2]^{n_{p}},\quad&y{\left( {{\bf x}}\right) }=\sum _{i=1}^{n_{p}-1}\left[ 100\left( x_{i}^2-x_{i+1}\right) ^2+\left( x_{i}-1\right) ^2\right] ;\end{aligned}$$

(126)

$$\begin{aligned} \forall {\bf x}\in [-500,500]^{n_{p}},\quad&y{\left( {{\bf x}}\right) }=418.9829+\sum _{i=1}^{n_{p}}x_{i}\sin \left( \sqrt{|x_{i}|}\right). \end{aligned}$$

(127)

Rosenbrock’s function has only one basin of attraction but it is located in a long curved valley, that is the variables significantly interact with each other. Schwelfel’s function is highly multimodal and, worse, the function is challenging for many surrogates in that the frequency and the amplitude of the $\sin ()$ function that composes it changes through the design space, making the function non stationary. Schwefel’s difficulty is nevertheless limited because it is an additively decomposable and smooth function.

Between $n_{s} =5$ and $n_{s} =140$ points are generated by Improved Hypercube Sampling [103]. Each sampling and metamodel building is repeated 50 times. The global approximation quality of the metamodel is measured by computing the mean value and the standard deviation of the $\text {R}^2$ and $\mathtt {Q}_{3}$ criteria for the 50 metamodels at $n_v=1000$ validation points which are different from the sample points.

These metamodel quality criteria are now defined.

$$\begin{aligned} \text {R}^2&= \left( \dfrac{\sigma _{xy}}{\sigma _x\sigma _y}\right) ^2 \quad \text { with } \quad \sigma _{xy}=\dfrac{1}{n_v}\displaystyle \sum _{i=1}^{n_v}\left( y{\left( {{\bf x}^{(i)}}\right) }-\overline{{y}}\right) \left( \tilde{{y}}{\left( {{\bf x}^{(i)}}\right) }-\overline{\tilde{{y}}}\right) ~,\nonumber \\&\sigma _x=\sqrt{\dfrac{1}{n_v}\displaystyle \sum _{i=1}^{n_v}\left( y{\left( {{\bf x}^{(i)}}\right) }-\overline{{y}}\right) ^2} \text { , } \sigma _y=\sqrt{\dfrac{1}{n_v}\displaystyle \sum _{i=1}^{n_v}\left( \tilde{{y}}{\left( {{\bf x}^{(i)}}\right) }-\overline{\tilde{{y}}}\right) ^2}. \end{aligned}$$

(128)

As usual, the $\overline{\phantom {e}}$ symbol denotes the average. $\text {R}^2$ Pearson’s correlation coefficient, measures how well the surrogate predictions are correlated to the true responses. The closer $\text {R}^2$ is to 1, the better. $\mathtt {Q}_{3}$ is a normalized leave-one-out criterion [cf. Eq. (69)]:

$$\begin{aligned} \mathtt {Q}_{3} = \dfrac{1}{n_v}\displaystyle \sum _{i=1}^{n_v}e_i \text { with } e_i=\dfrac{\displaystyle \left( \tilde{{y}}_{-i}\left( {\bf x}^{(i)}\right) -y{\left( {{\bf x}^{(i)}}\right) }\right) ^2}{\displaystyle \max _{i\in {\llbracket 1,n_v\rrbracket }} y{\left( {{\bf x}^{(i)}}\right) }^2} \end{aligned}$$

(129)

The closer $\mathtt {Q}_{3}$ is to 0, the better the prediction accuracy of the surrogate.

The results presented next have been obtained on a computer equipped with an Intel^® Xeon^® E5-2680 v2 processor (20 cores at 2.8 GHz) and 128 Gb of volatile memory (DDR3-1866). The execution times given are CPU times which correspond to the time required for running the computer code on a single processor loaded at 100%.

10.2 Comparison of LS and GradLS Models

The response-only and gradient-enhanced least squares metamodels, LS and GradLS, are compared in details when approximating the 3 and 5 dimensional Rosenbrock’s functions. The least squares fit are carried out with polynomials of degrees $d^\circ =1$ to 10. Figures 15 and 16 show the results in terms of $\text {R}^2$ and $\mathtt {Q}_{3}$ (mean and standard deviation) for the 3 dimensional function and Figs. 17 and 18 show the results for the 5 dimensional function. The approximation quality improves as the mean of $\text {R}^2$ increases and its standard deviation simultaneously decreases or, as the mean of $\mathtt {Q}_{3}$ decreases and its standard deviation simultaneously decreases. In order to help understanding the outcome of the experiments, Fig. 14 summarizes both, (i) the number of terms in the polynomials, $n_t,$ which is a function of their degree as seen in Eq. (1) and, (ii) the number of equations in the least squares approximations [Eq. (11)] which is equal to $n_{s}$ and $n_{s} (n_{p}+1)$ for the LS and GradLS models, respectively. The number of polynomial terms are plotted, for each $n_{p}$ separately, with the continuous lines. The number of equations are plotted as marks, blue marks for the LS, and black marks with a dependency on $n_{p}$ for GradLS: the dotted lines give the number of equations in GradLS as a function of $n_{p}$ for each different $n_{s}.$ The GradLS formulation uses more equations than LS does thanks to the gradients. As long as there are more independent equations than terms in the polynomial, the solution (21) to the Mean Squares Error exists and is unique. In this case in our implementation a QR factorization of the ${{{\bf F}}^\top {{\bf F}}}$ is performed. On the contrary, if the degree of the polynomial is such that there are more polynomial terms than equations, the problem is ill-posed and the matrix ${{{\bf F}}^\top {{\bf F}}}$ in Eq. (21) is no longer invertible. In our implementation of least squares, solution unicity is then recovered by using the Moore–Penrose pseudo-inverse^{Footnote 3} of ${{\bf F}},$ written ${{{\bf F}}^+},$ i.e., by solving ${\hat{\varvec{\beta }}= {{\bf F}}^+ {\bf y}_{g}}.$ The portions of the solid lines that are below the marks associated to each $n_{s}$ indicate the polynomial degrees for which there are sufficiently many equations to solve the original least squares problem, in other terms, the polynomials which are fully defined by the data points.

General trends are visible in Figs. 15, 16, 17 and 18 concerning Rosenbrock’s function and in Figs. 19 and 20 concerning Schwefel’s function. They are particularly clear on the mean of $\mathtt {Q}_{3},$ and they are confirmed by $\text {R}^2.$ Not surprisingly, the quality of the approximations increases (i.e., the mean of $\mathtt {Q}_{3}$ diminishes) with the number of sample points $n_{s};$ at a given $n_{s}$ (larger than 5), the approximations improve from polynomial degree $d^\circ =1$ up to 4, and then degrade as the degrees of the polynomials go to 10. The explanation is that Rosenbrock’s function is a polynomial of degree 4. Below $d^\circ =4,$ the true function cannot be represented by the approximations. Beyond $d^\circ =4$ it can, but higher order terms of the polynomial need to be cancelled, which requires sufficiently many sample points to be accurately done. An estimate of the limit on the number of sample points is $n_{s}$ such that there are as many equations as polynomial terms $n_t.$ This limit is seen in Fig. 14: for LS in dimension $n_{p}=3,$ the lower bound on $n_{s}$ is 35, 84 and 120 for $d^\circ =4,$ 6, 7; a ridge where $\mathtt {Q}_{3}$ suddenly degrades crosses the upper right plot of Fig. 15 and this ridge closely follows these limits on $n_{s}.$ The same estimate (number of polynomial terms equal to number of equations) applied to GradLS yields $n_{s}$ limits $n_{p}+1$ smaller than those of LS: in Fig. 16, $n_{p}+1=4,$ and the $\mathtt {Q}_{3}$ degradation ridge follows ($d^\circ =4,$$ n_{s} =35/4\approx 9),$ ($d^\circ =6,$$n_{s} =84/4 = 21),$ ($d^\circ =7,$$n_{s} =120/4=30).$ In dimension 5, the $\mathtt {Q}_{3}$ degradation ridges in Figs. 17 and 18 can be interpreted in the same way. In all of the figures, the less regular variation of the quality indicators with $d^\circ $ for $n_{s}=5$ is because it is too small a sample size.

Schwefel’s function is not easily approximated with a polynomial. This is noticed in Figs. 19 and 20 which gather approximation performance indicators for the 3 dimensional version of the function and where the levels of $\text {R}^2$ and $\mathtt {Q}_{3}$ are respectively smaller and larger than those of the 3 dimensional Rosenbrock function. By comparing Figs. 19 and 20, it is also seen that the gradient-enhanced GradLS outperforms LS.

Two conclusions can already be drawn from these tests. Firstly, least squares approximations have a high performance domain characterized by a number of equation larger than the number of polynomial terms. This shows that the regularization performed by the pseudo-inverse does not produce as good least squares approximations as additional data points do. Secondly, because the gradient-enhanced least squares GradLS require sample sizes $n_{p}+1$ smaller than those of classical LS, they have a much larger high performance domain.

In addition, even outside of the high performance domain, it is observed in Figs. 15, 16, 17, 18, 19 and 20 that, at a given $n_{s}$ and $d^\circ ,$ GradLS consistently outperforms LS when approximating Rosenbrock’s function.

Figure 21 shows the CPU time needed to calculate the LS and GradLS metamodels as a function of the number of sample points $n_{s}$ for degrees of the polynomial approximation $d^\circ $ ranging from 2 to 10 and in dimensions $n_{p}$=3, 5 and 10. In 10 dimension, the polynomials are limited to the degrees $\{2,3,4,5,6\}$ to keep calculation times reasonable. In both LS and GradLS models, it is observed that the CPU time grows with $n_{s},$$n_{p}$ and $d^\circ $ and that the main factors for CPU time increase are $n_{p}$ and $d^\circ $ (a log scale is applied to the CPU axis). The effect of $n_{p}$ and $d^\circ $ is comparable in both metamodels so that the CPU times are close to each other. The CPU time of the gradient-enhanced GradLS grows slightly faster than that of LS with $n_{s}$ and in a manner independent of $n_{p}$ and $d^\circ ,$ which is mainly noticeable at low CPU times. It therefore turns out that enhancing least squares through the gradients is not as costly as one could fear. The reason is that, for high degree polynomials in high dimensions, the main numerical cost comes from the factorization of the $n_t \times n_t$ matrix ${{{\bf F}}^\top {{\bf F}}},$ which has $\mathcal O(n_t^3)$ complexity, and $n_t$ is a function of $n_{p}$ and $d^\circ $ but not of $n_{s}$ [cf. Eq. (1)]. The additional computation time of GradLS comes from the storage of the $( n_{s} (n_{p}+1)\times n_t)$ matrix ${{\bf F}}$ and its product with other matrices.

Note also in Fig. 21 that the geometric evolution of the CPU time sometimes exhibits a discontinuity. For example, for LS, $n_{p}=3$ and $d^\circ =6,$ there is a CPU time step at $n_{s} =84.$ These discontinuities correspond to the change in least squares algorithms when there are or not sufficient equations for determining the $n_t$ polynomial terms: when the problem is well-posed, a QR factorization of ${{{\bf F}}^\top {{\bf F}}}$ is performed, when the problem is ill-posed, it is the pseudo-inverse method that is called. For high degree polynomials the CPU evolution curves are continuous because the problem is always ill-posed and the pseudo-inverse is the only active algorithm.

10.3 Comparison of Kernel-Based Models

We now compare kernel-based metamodels that use or do not use gradients. These are variants of the RBF and KRG approaches. The SVR metamodels were not tested because of the large computational cost induced by tuning their internal parameters which does not allow repeated runs. The tested approaches are ordinary kriging (OK), radial basis functions (RBF), both of which do not utilize gradients, Indirect gradient-based ordinary kriging (InOK), Indirect gradient-based RBF (InRBF), gradient-enhanced ordinary cokriging (OCK), and gradient-enhanced RBF (GRBF). All these examples take Matérn 3/2 as kernel function.

Figures 22, 23, 24 and 25 show the values of the $\text {R}^2$ and $\mathtt {Q}_{3}$ criteria for the Rosenbrock and Schwefel test functions in 5 and 10 dimensions. It can be seen in Figs. 22 and 23 that all the surrogates tested provide a good approximation to the Rosenbrock function in 5 and 10 dimensions, as measured by $\text {R}^2$ tending to 1 and $\mathtt {Q}_{3} $ to 0 with $n_{s},$ which is likely because the function is smooth and unimodal. Nevertheless, the methods that directly or indirectly utilize gradients have a visible advantage, that is, OK and RBF converge much slower to good values of $\mathtt {Q}_{3}$ and $\text {R}^2$ as $n_{s}$ increases.

Schwefel’s function is the approximation target in Figs. 24 and 25. For modeling such a multimodal non stationary function, it is observed that directly accounting for gradients is a determining asset: the surrogates relying only on the response, OK and RBF, cannot approximate well the function, even when the number of sampled points $n_{s}$ is large (equal to 140); surrogates directly using gradients, OCK and GRBF, manage to represent well Schwefel’s function in both 5 and 10 dimensions. The performances of OCK and GRBF, are similar for the two test functions. The only noticeable difference is with Schwefel’s function in 10 dimensions where OCK converges slightly faster than GRBF.

On the average of Figs. 22, 23, 24 and 25, the indirect gradient-enhanced metamodels, InOK and InRBF, approximate the test functions with an accuracy that is between that of response-only and direct gradient-enhanced metamodels. A closer comparison of Figs. 22 and 24, and Figs. 23 and 25, suggests that InRBF performs better than InOK for the multimodal Schwefel function and vice versa for the smooth Rosenbrock’s function. Once again, the main difference between InRBF and InOK is that InRBF tunes its internal parameters by cross-validation when InOK tunes them by maximum likelihood. Cross-validation shows a better ability to deal with multimodality than maximum likelihood does.

To sum up, these results illustrate the advantage of direct gradient-enhanced metamodels in approximating non stationary, multimodal functions. They confirm other experiments carried out in the more complete study [57].

Figure 26 shows the CPU time taken for building the kernel-based metamodels for varying number of sample points and in dimensions 3, 5 and 10. SVR and GSVR are omitted in 10 dimensions because they take too much CPU time.

The building process includes the tuning of the metamodels’ internal parameters which is performed here with an Particle Swarm Optimizer.

The typical CPU times of the kernel methods in Fig. 26 are larger than those of the least squares methods reported in Fig. 21. Furthermore, kernel methods show a higher sensitivity to the number of sample points $n_{s}$ and a larger CPU penalty for including gradients in the model than least squares do. The main reason for the larger CPU time of kernel methods is the tuning of their internal parameters, which least squares do not do. The higher sensitivity of kernel methods to $n_{s}$ and, consequently, to the presence of gradients, comes from the inversion of the $n_{s} \times n_{s} $ or $n_{s} (n_{p}+1)\,\times\,n_{s} (n_{p}+1)$ (in the gradient-enhanced version) matrices $\varvec{\Psi }_g$ and ${\mathbf {C}} _c.$ Similarly, SVR and GSVR have a number of constraints that scales with $n_{s}$ and $n_{s}$ ($n_{p}+1),$ respectively. The advantage in terms of CPU time of the gradient-enhanced least squares methods should be assessed against an inferior approximation capacity as exemplified by the poor performance of GradLS on the Schwefel’s function.

Among kernel methods, SVR and GSVR are the most time consuming techniques while KRG and GKRG (here OK/OCK, i.e. ordinary kriging/cokriging) are faster to calculate.

10.4 Available Softwares for Gradient-Enhanced Metamodels

Despite the wide use and availability of metamodels that exclusively use simulation responses, ${{y}()},$ the more recent gradient-enhanced metamodels are only proposed in a few codes.

GRENAT [12], which stands for gradient enhanced approximation toolbox, is the toolbox that was used for generating all results and plots proposed in this review and in [59, 94, 95, 131, 132]. GRENAT is written in Matlab/Octave and follows the object oriented Matlab’s syntax. It allows building and exploiting response-only, and indirect and direct gradient-enhanced kriging, radial basis functions and support vector regression. It can be linked to the MultiDOE toolbox [133] which compiles many sampling techniques.

The ooDACE Toolbox [134, 135], also developed in object-oriented Matlab, has an implementation of cokriging that could accomodate accounting for gradients.

In addition to mathematical descriptions, Forrester et al. [21] also propose code samples for building gradient-enhanced metamodels.

10.5 Remarks About Missing Data and Higher Order Derivatives

In keeping with previous works [21, 136], the RBF, kriging and SVR metamodels and their gradient-enhanced versions that have been described in the review can readily be adapted for dealing with missing data: hybrid version of these metamodels can be considered by removing responses, components of gradient or full gradient at certain sample points. Components of the vector ${\bf y}_{g}$ are deleted and the corresponding terms in the linear combinations making the approximations (in generalized least squares, GRBF, GKRG) are removed from the equations. In the case of GSVR, the constraints in the model defining optimization problem for which there is no longer an observation are removed. With IDW, terms of the first order Taylor approximations ${Q_j({\bf x})}$ can be dropped, at the cost of loosing the corresponding gradient interpolation properties.

Deleting observations can even be a choice for minimizing the computational cost needed to build the metamodels and evaluate the responses and/or gradients. For example, when dealing with a function with known flat behavior in a part of the design space and a multimodal behavior in another part, accounting for gradient information is useful only in the latter. On the basis of relations like Eq. (26) for least squares, or Eq. (62) for radial basis functions, or Eq. (89) for cokriging, which all involve the inversion of a $n_{s} (n_{p}+1)$ by $n_{s} (n_{p}+1)$ covariance matrix, the computational complexity of each observation is at least cubic: assuming the number of operations required to invert a square matrix is slightly less than cubic, it will be at least cubic when multiplied by the number of repetitions of the inversion required for tuning internal parameters; then, accounting for all the gradients multiplies the complexity of the metamodels by at least $(n_{p}+1)^3.$ This metamodel complexity, although non negligible, will typically remain orders of magnitude smaller than the complexity of calculating the true response.

The formulations of gradient-enhanced RBF, cokriging and SVR can be also extended for taking into account higher order derivatives of the objectif function. Examples of formulations of Hessian-enhanced cokriging can be found in [21, 53]. In the case of SVR, an Hessian formulation may be based on the development proposed in [72]: the “prior knowledge”, which is added to the classical SVR formulation (Problem 2 in Sect. 9), consists in terms of the Hessian which are accounted for through new constraints.

11 Conclusions

We have reviewed the main surrogates (or metamodels) for approximating functions observed at a finite number of points that not only use function values but also their gradients. These surrogates are variations around the least squares methods, the Shepard weighting function, radial basis functions, kriging and support vector machines. Indirect methods, where the knowledge of the gradients produces new points to learn from, have also been covered. An effort was made to detail the logic and the formulations that led to these models. To the authors’ knowledge, the $\nu $-SVR formulation with gradients was given here for the first time. Another goal was to compare the metamodels. It was first done theoretically, in particular by casting all metamodels as linear combinations of functions chosen a priori and coefficients that depend on the observations. The comparison between metamodels was then substantiated by simple examples.

These examples, confirming other studies [9, 57, 132], show that exploiting gradient information is a determining advantage for approximating functions with locally changing behaviors. Including gradients in least squares methods comes at a negligible additional numerical cost. The more versatile kernel-based surrogates pay a numerical cost for also approximating gradients: all methods but Shepard weighting function have a complexity that scales at least with the cube of the number of observations, and each gradient at a point in a space of dimension $n_{p}$ adds $n_{p}$ observations.

The litterature on gradient-enhanced metamodels is recent but already rich. Today, many perspectives should be considered.

From a methodological point of view, there is a need for more robust, numerically less complex approaches that can account for large numbers (say, millions) of data points with their gradients, in higher dimensions (say, thousands). The current kernel methods can approximate a larger family of functions than least squares do, but they would not allow data sets beyond of the order of 1000 points because of bad conditionning issues and because of the rapidly growing number of operations. Beyond 10,000 points, computer memory would be an additional limitation. Recent works on Gaussian Processes have introduced strategies to deal with large number of points [137, 138] and high-dimensional problems [139, 140]. However these approaches remain currently limited to response-only data.

On the applicative side, surrogates that learn and predict gradients should contribute to progress in local and global sensitivity analysis, uncertainty propagation, local trust region and global surrogate-based optimization methods.

Notes

Recall that bold notations designate vectors. For example, ${\varvec{\xi }^{+}=\begin{bmatrix}\xi ^{+(1)}&\xi ^{+(2)}&\dots&\xi ^{+(n_{s})}\end{bmatrix}^\top}.$
As usual the boldface notation denotes the vector, ${ \varvec{\varepsilon }=[\varepsilon _0,\dots ,\varepsilon _{n_{p}}]^\top },$ which should not be mistaken with the errors in Sect. 4.
In the case where there are not enough equations, pseudo-inverse recovers solution unicity by choosing, out of the infinite number of solutions to ${{{\bf F}}\varvec{\beta }= y_{g}},$ the ${\varvec{\beta }}$ of minimal euclidean norm.

Abbreviations

$\nu $-GSVR:: $\nu $-Version of the gradient-enhanced support vector regression (surrogate model)
$\varepsilon $-SVR:: $\varepsilon $-Version of the support vector regression (surrogate model)
$\nu $-SVR:: $\nu $-Version of the support vector regression (surrogate model)
$\varepsilon _k$-GSVR:: $\varepsilon _k$-Version of the gradient-enhanced support vector regression (surrogate model)
BLUP:: Best linear unbiased predictor
EGO:: Efficient global optimization [33]
GBK:: Gradient-based kriging (surrogate model)
GEK:: Gradient-enhanced kriging (surrogate model)
GEUK:: Gradient-enhanced universal kriging (surrogate model)
GKRG:: Gradient-enhanced cokriging (surrogate model, same as GBK, GEK and GEUK)
GLS:: Generalized least square regression (surrogate model)
GradLS:: Gradient-enhanced least square regression (surrogate model)
GRBF:: Gradient-enhanced radial basis function (surrogate model)
GRENAT:: Gradient-enhanced approximation toolbox (Matlab/Octave’s toolbox [12])
GSVR:: Gradient-enhanced support vector regression (surrogate model)
IDW:: Inverse distance weighting method also called Shepard weighting method (surrogate model)
IHS:: Improved hypercube sampling (sampling technique [103])
InOK:: Indirect gradient-enhanced ordinary kriging (surrogate model)
InRBF:: Indirect gradient-enhanced radial basis function (surrogate model)
KRG:: Kriging (surrogate model)
LOO:: Leave-one-out
LS:: Least square regression (surrogate model)
MLS:: Moving least square regression (surrogate model)
MSE:: Mean square error (quality criterion)
MultiDOE:: Multiple design of experiments (Matlab/Octave’s toolbox [133])
OK:: Ordinary kriging (surrogate model)
OCK:: Gradient-enhanced ordinary cokriging (surrogate model)
RBF:: Radial basis function (surrogate model)
RSM:: Response surface methodology
SBAO:: Surrogate-based analysis and optimization [18]
SVM:: Support vector machine
SVR:: Support vector regression (surrogate model)
WLS:: Weigthed least square regression (surrogate model)
MARS:: Multivariate adaptive regression splines (surrogate model)
ANN:: Artificial neural network (surrogate model)
LHS:: Latin hypercube sampling (sampling technique)
OA:: Orthogonal array (sampling technique)
UD:: Uniform design (sampling technique)
RAAE:: Relative average absolute error (quality criterion)
RMSE:: Root mean square error (quality criterion)
RMAE:: Relative maximal absolute error (quality criterion)
LOOCV:: Leave-one-out cross-validation

References

Hughes GF (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14(1):55–63
Article Google Scholar
Lions J-L (1971) Optimal control of systems governed by partial differential equations. Springer, New York
Book MATH Google Scholar
Cacuci DG (1981) Sensitivity theory for nonlinear systems. I. Nonlinear functional analysis approach. J Math Phys 22(12):2794–2802
Article MathSciNet Google Scholar
Jameson A (1988) Aerodynamic design via control theory. J Sci Comput 3(3):233–260
Article MATH MathSciNet Google Scholar
Beda L, Korolev L, Sukkikh N, Frolova T (1959) Programs for automatic differentiation for the machine. Technical Report, BESM, Institute for Precise Mechanics and Computation Techniques, Academy of Science, Moscow
Wengert RE (1964) A simple automatic derivative evaluation program. Commun ACM 7:463–464
Article MATH Google Scholar
Griewank A (1989) On automatic differentiation. In: Mathematical programming: recent developments and applications. Kluwer, Amsterdam, pp 83–108
Paoletti V, Fedi M, Italiano F, Florio G, Ialongo S (2016) Inversion of gravity gradient tensor data: does it provide better resolution? Geophys J Int 205(1):192–202
Article Google Scholar
Qin P, Huang D, Yuan Y, Geng M, Liu J (2016) Integrated gravity and gravity gradient 3D inversion using the non-linear conjugate gradient. J Appl Geophys 126:52–73
Article Google Scholar
Lorentz R (2000) Multivariate hermite interpolation by algebraic polynomials: a survey. J Comput Appl Math 122(12):167–201. Numerical analysis in the 20th century, vol. II: interpolation and extrapolation
Lai M-J (2007) Multivariate splines for data fitting and approximation. In: Approximation theory XII: San Antonio, pp 210–228
Laurent L (2016) GRENAT (Matlab/Octave Toolbox). https://bitbucket.org/luclaurent/grenat
Box GEP, Wilson K (1951) On the experimental attainment of optimum conditions. J R Stat Soc B 13(1):1–45
MathSciNet MATH Google Scholar
Simpson TW, Mistree F, Korte JJ, Mauery TM (1998) Comparison of response surface and kriging models for multidisciplinary design optimization. In: AIAA paper 98-4758. 7th AIAA/USAF/NASA/ISSMO symposium on multidisciplinary analysis and optimization, no 98–4755
Giunta AA, Watson LT (1998) A comparison of approximation modeling techniques—polynomial versus interpolating models. In: 7th AIAA/USAF/NASA/ISSMO symposium on multidisciplinary analysis and optimization, no AIAA-98-4758, American Institute of Aeronautics and Astronautics
Jin R, Chen W, Simpson T (2000) Comparative studies of metamodeling techniques under multiple modeling criteria. In: 8th symposium on multidisciplinary analysis and optimization, no AIAA-2000-4801, American Institute of Aeronautics and Astronautics
Varadarajan S, Chen W, Pelka CJ (2000) Robust concept exploration of propulsion systems with enhanced model approximation capabilities. Eng Optim 32(3):309–334
Article Google Scholar
Queipo NV, Haftka RT, Shyy W, Goel T, Vaidyanathan R, Tucker PK (2005) Surrogate-based analysis and optimization. Prog Aerosp Sci 41(1):1–28
Article Google Scholar
Peter J, Marcelet M, Burguburu S, Pediroda V (2007) Comparison of surrogate models for the actual global optimization of a 2d turbomachinery flow. In: Proceedings of the 7th WSEAS international conference on simulation, modelling and optimization, pp 46–51. World Scientific and Engineering Academy and Society (WSEAS)
Marcelet M (2008) Etude et mise en oeuvre d’une méthode d’optimisation de forme couplant simulation numérique en aérodynamique et en calcul de structure. PhD thesis, École Nationale Supérieure d’Arts et Métiers
Forrester AIJ, Sóbester A, Keane AJ (2008) Engineering design via surrogate modelling: a practical guide, 1st edn. Wiley, Chichester
Book Google Scholar
Kim B-S, Lee Y-B, Choi D-H (2009) Comparison study on the accuracy of metamodeling technique for non-convex functions. J Mech Sci Technol 23:1175–1181
Article Google Scholar
Zhao D, Xue D (2010) A comparative study of metamodeling methods considering sample quality merits. Struct Multidiscip Optim 42(6):923–938
Article Google Scholar
McKay MD, Beckman RJ, Conover WJ (1979) Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2):239–245
MathSciNet MATH Google Scholar
Santner TJ, Williams BJ, Notz WI (2003) The design and analysis of computer experiments. Springer, New York
Book MATH Google Scholar
Fang KT, Li R, Sudjianto A (2005) Design and modeling for computer experiments. Chapman & Hall, Boca Raton
Book MATH Google Scholar
Franco J (2008) Planification d’expériences numériques en phase exploratoire pour la simulation des phénoménes complexes. PhD thesis, École Nationale Supérieure des Mines de Saint-étienne
Schonlau M (1997) Computer experiments and global optimization. PhD thesis, University of Waterloo
Sóbester A, Leary SJ, Keane AJ (2005) On the design of optimization strategies based on global response surface approximation models. J Glob Optim 33(1):31–59
Article MathSciNet MATH Google Scholar
Alexandrov NM, Dennis JE, Lewis RM, Torczon V (1998) A trust-region framework for managing the use of approximation models in optimization. Struct Optim 15(1):16–23
Article Google Scholar
Sasena MJ (2002) Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations. PhD thesis, University of Michigan
Watson AG, Barnes RJ (1995) Infill sampling criteria to locate extremes. Math Geol 27(5):589–608
Article Google Scholar
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492
Article MathSciNet MATH Google Scholar
Matheron G (1970) La théorie des variables régionalisées et ses applications. Les Cahiers du Centre de Morphologie Mathématique de Fontainebleau, école Nationale des Mines de Paris, vol Fascicule 5
Wackernagel H (1995) Multivariate geostatistics: an introduction with applications. Springer, Berlin
Book MATH Google Scholar
Isaaks EH, Srivastava RM (1989) An introduction to applied geostatistics. Oxford University Press, New York
Google Scholar
Hoef JM, Cressie NAC (1993) Multivariable spatial prediction. Math Geol 25(2):219–240
Article MathSciNet MATH Google Scholar
Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, New York
Google Scholar
Morris MD, Mitchell TJ, Ylvisaker D (1993) Bayesian design and analysis of computer experiments: use of derivatives in surface prediction. Technometrics 35(3):243–255
Article MathSciNet MATH Google Scholar
Koehler JR, Owen AB (1996) Computer experiments. Handb Stat 13:261–308
Article MathSciNet MATH Google Scholar
Lewis RM (1998) Using sensitivity information in the construction of kriging models for design optimization. In: Proceedings of the 7th AIAA/USAF/NASA/ISSMO multidisciplinary analysis & optimization symposium, Saint Louis, Missouri
Arnaud M, Emery X (2000) Estimation et interpolation spatiale. Hermes Science Publications, Paris
Google Scholar
Chung H-S, Alonso JJ (2002) Using gradients to construct cokriging approximation models for high-dimensional design optimization problems. In: 40th AIAA aerospace sciences meeting & exhibit, no AIAA-2002-0317, American Institute of Aeronautics and Astronautics
Chung H-S, Alonso JJ (2002) Design of a low-boom supersonic business jet using cokriging approximation models. In: 9th AIAA/ISSMO symposium on multidisciplinary analysis and optimization, no AIAA-2002-5598, American Institute of Aeronautics and Astronautics
Leary SJ, Bhaskar A, Keane AJ (2004) A derivative based surrogate model for approximating and optimizing the output of an expensive computer simulation. J Glob Optim 30(1):39–58
Article MathSciNet MATH Google Scholar
Leary SJ, Bhaskar A, Keane AJ (2004) Global approximation and optimization using adjoint computational fluid dynamics codes. AIAA J 42(3):631–641
Article Google Scholar
Laurenceau J, Sagaut P (2008) Building efficient response surfaces of aerodynamic functions with kriging and cokriging. AIAA J 46(2):498–507
Article Google Scholar
Laurenceau J, Meaux M (2008) Comparison of gradient and response surface based optimization frameworks using adjoint method. In: 49th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference, 16th AIAA/ASME/AHS adaptive structures conference, 10th AIAA non-deterministic approaches conference, 9th AIAA Gossamer Spacecraft Forum, 4th AIAA multidisciplinary design optimization specialists conference, no AIAA-2008-1889, American Institute of Aeronautics and Astronautics
Dwight R, Han Z-H (2009) Efficient uncertainty quantification using gradient-enhanced kriging. In: 50th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference, no AIAA-2009-2276, American Institute of Aeronautics and Astronautics
Xuan Y, Xiang J, Zhang W, Zhang Y (2009) Gradient-based kriging approximate model and its application research to optimization design. Sci China E 52(4):1117–1124
Article MATH Google Scholar
Lockwood BA, Mavriplis DJ (2010) Parameter sensitivity analysis for hypersonic viscous flow using a discrete adjoint approach. AIAA Paper 447
March A, Willcox K, Wang Q (2010) Gradient-based multifidelity optimisation for aircraft design using Bayesian model calibration. Aeronaut J 115(1174):729–738
Article Google Scholar
Yamazaki W, Rumpfkeil M, Mavriplis D (2010) Design optimization utilizing gradient/hessian enhanced surrogate model. In: 28th AIAA applied aerodynamics conference, no AIAA-2010-4363, American Institute of Aeronautics and Astronautics
Bompard M (2011) Modéles de substitution pour l’optimisation globale de forme en aérodynamique et méthode locale sans paramétrisation. PhD thesis, Université de Nice Sophia Antipolis
Rumpfkeil MP, Yamazaki W, Mavriplis DJ (2011) A dynamic sampling method for kriging and cokriging surrogate models. In: 49th AIAA Aerospace Sciences Meeting including the New Horizons Forum and Aerospace Exposition, no AIAA-2011-883, American Institute of Aeronautics and Astronautics
Laurent L, Boucard P-A, Soulier B (2012) Gradient-enhanced metamodels and multiparametric strategies for designing structural assemblies. In: Topping BHV (ed) Proceedings of the eleventh international conference on computational structures technology, 4–7 September, no Paper 230, Civil-Comp Press, Stirlingshire, UK
Laurent L (2013) Stratégie multiparamétrique et métamodèles pour l’optimisation multiniveaux de structures. PhD thesis, École Normale Supérieure de Cachan, 61, avenue du Président Wilson, 94230 Cachan
Laurent L, Boucard P-A, Soulier B (2013) Combining multiparametric strategy and gradient-based surrogate model for optimizing structure assemblies. In: ISSMO (ed)10th world congress on structural and multidisciplinary optimization, Orlando, Florida, USA, 19–24 May
Laurent L (2014) Global optimisation on assembly problems using gradient-based surrogate model and multiparametric strategy. In: PhD Olympiad ECCOMAS, 11th World Congress on Computational Mechanics, Barcelona, Spain, 20–25 July 2014
Han Z-H, Görtz S, Zimmermann R (2013) Improving variable-fidelity surrogate modeling via gradient-enhanced kriging and a generalized hybrid bridge function. Aerosp Sci Technol 25(1):177–189
Article Google Scholar
Zimmermann R (2013) On the maximum likelihood training of gradient-enhanced spatial gaussian processes. SIAM J Sci Comput 35(6):A2554–A2574
Article MathSciNet MATH Google Scholar
Ulaganathan S, Couckuyt I, Dhaene T, Degroote J, Laermans E (2016) Performance study of gradient-enhanced kriging. Eng Comput 32(1):15–34
Article Google Scholar
Ulaganathan S, Couckuyt I, Ferranti F, Laermans E, Dhaene T (2015) Performance study of multi-fidelity gradient enhanced kriging. Struct Multidiscip Optim 51(5):1017–1033
Article Google Scholar
Zongmin W (1992) Hermite-Birkhoff interpolation of scattered data by radial basis functions. Approx Theory Appl 8(2):1–10
MathSciNet MATH Google Scholar
Kampolis IC, Karangelos EI, Giannakoglou KC (2004) Gradient-assisted radial basis function networks: theory and applications. Appl Math Model 28(2):197–209
Article MATH Google Scholar
Giannakoglou KC, Papadimitriou DI, Kampolis IC (2006) Aerodynamic shape design using evolutionary algorithms and new gradient-assisted metamodels. Comput Methods Appl Mech Eng 195(44–47):6312–6329
Article MATH Google Scholar
Ong Y-S, Lum K, Nair PB (2008) Hybrid evolutionary algorithm with hermite radial basis function interpolants for computationally expensive adjoint solvers. Comput Optim Appl 39(1):97–119
Article MathSciNet MATH Google Scholar
Lázaro M, Santamaría I, Pérez-Cruz F, Artés-Rodríguez A (2005) Support vector regression for the simultaneous learning of a multivariate function and its derivatives. Neurocomputing 69(1–3):42–61
Article Google Scholar
Jayadeva, Khemchandani R, Chandra S (2006) Regularized least squares twin svr for the simultaneous learning of a function and its derivative. In: Neural networks, 2006. IJCNN’06. International joint conference onIJCNN ’06
Bloch G, Lauer F, Colin G, Chamaillard Y (2008) Support vector regression from simulation data and few experimental samples. Inf Sci 178(20):3813–3827. Special issue on industrial applications of neural networks—10th engineering applications of neural networks 2007
Jayadeva, Khemchandani R, Chandra S (2008) Regularized least squares support vector regression for the simultaneous learning of a function and its derivatives. Inf Sci 178(17):3402–3414
Article MATH Google Scholar
Lauer F, Bloch G (2008) Incorporating prior knowledge in support vector regression. Mach Learn 70(1):89–118
Article Google Scholar
Khemchandani R, Karpatne A, Chandra S (2013) Twin support vector regression for the simultaneous learning of a function and its derivatives. Int J Mach Learn Cybern 4(1):51–63
Article Google Scholar
Renka RJ (1988) Multivariate interpolation of large sets of scattered data. ACM Trans Math Softw (TOMS) 14(2):139–148
Article MathSciNet MATH Google Scholar
Lauridsen S, Vitali R, van KeulenF, Haftka RT, Madsen JI (2002) Response surface approximation using gradient information, vol 4, p 5. In: Cheng et al (ed) Proceedings of 4th world congress on structural and multidisciplinary optimization, Dalian China, 4–8 June 2001
Kim C, Wang S, Choi KK (2005) Efficient response surface modeling by using moving least-squares method and sensitivity. AIAA J 43(11):2404–2411
Article Google Scholar
Breitkopf P, Naceur H, Rassineux A, Villon P (2005) Moving least squares response surface approximation: formulation and metal forming applications. Comput Struct 83(17–18):1411–1428. Advances in meshfree methods
van Keulen F, Liu B, Haftka RT (2000) Noise and discontinuity issues in response surfaces based on functions and derivatives. In: 41st structures, structural dynamics, and materials conference and exhibit, no AIAA-00-1363, American Institute of Aeronautics and Astronautics
Vervenne K, van Keulen F (2002) An alternative approach to response surface building using gradient information. In 43rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference, no AIAA-2002-1584, American Institute of Aeronautics and Astronautics
van Keulen F, Vervenne K (2004) Gradient-enhanced response surface building. Struct Multidiscip Optim 27(5):337–351
Google Scholar
Liu W (2003) Development of gradient-enhanced kriging approximations for multidisciplinary design optimization. PhD thesis, University of Notre Dame, Indiana
Myers RH, Montgomery DC (1995) Response surface methodology: process and product optimization using designed experiments. Wiley, New York
MATH Google Scholar
Mazja V (1985) Sobolev spaces. Springer, New York
Book Google Scholar
Runge C (1901) Über empirische Funktionen und die Interpolation zwischen äquidistanten Ordinaten. Zeitschrift für Mathematik und Physik 46:224–243
MATH Google Scholar
Haftka RT (1993) Semi-analytical static nonlinear structural sensitivity analysis. AIAA J 31(7):1307–1312
Article MathSciNet MATH Google Scholar
Rasmussen CE (2006) Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge
MATH Google Scholar
Stander N (2001) The successive response surface method applied to sheet-metal forming. In: Proceedings of the first MIT conference on computational fluid and solid mechanics, pp 481–485, 12–15 June 2001
Lancaster P, Salkauskas K (1981) Surfaces generated by moving least squares methods. Math Comput 37(155):141–158
Article MathSciNet MATH Google Scholar
Zhou L, Zheng WX (2006) Moving least square ritz method for vibration analysis of plates. J Sound Vib 290(3–5):968–990
Article MATH Google Scholar
Häussler-Combe U, Korn C (1998) An adaptive approach with the element-free-Galerkin method. Comput Methods Appl Mech Eng 162(1–4):203–222
Article MathSciNet MATH Google Scholar
Shepard D (1968) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference, pp 517–524. ACM
Mercer J (1909) Functions of positive and negative type, and their connection with the theory of integral equations. Philos Trans R Soc Lond A 209(441–458):415–446
Article MATH Google Scholar
Genton MG (2001) Classes of kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–312
MathSciNet MATH Google Scholar
Laurent L, Boucard P-A, Soulier B (2013) Generation of a cokriging metamodel using a multiparametric strategy. Comput Mech 51:151–169
Article MathSciNet MATH Google Scholar
Laurent L, Boucard P-A, Soulier B (2013) A dedicated multiparametric strategy for the fast construction of a cokriging metamodel. Comput Struct 124:61–73. Special Issue: KRETA
Stein ML (1999) Interpolation of spatial data: some theory for kriging. Springer, New York
Book MATH Google Scholar
Matérn B (1960) Spatial Variation (Lecture NotesStatist. 36). Springer, Berlin
Google Scholar
Abramowitz M, Stegun I (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables. In: Applied mathematics series, vol 55, no 1972. U.S. Government Printing Office
Lockwood BA, Anitescu M (2012) Gradient-enhanced universal kriging for uncertainty propagation. Nucl Sci Eng 170:168–195
Article Google Scholar
Hardy RL (1971) Multiquadric equations of topography and other irregular surfaces. J Geophys Res 76(8):1905–1915
Article Google Scholar
Powell MJ (1981) Approximation theory and methods. Cambridge University Press, Cambridge
Book MATH Google Scholar
Broomhead D, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2:321–355
MathSciNet MATH Google Scholar
Beachkofski B, Grandhi R (2002) Improved distributed hypercube sampling. In: 43rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference, no AIAA 2002–1274
Schaback R (2007) A practical guide to radial basis functions. Book for teaching (Preprint)
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36:111–147
MathSciNet MATH Google Scholar
Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70(350):320–328
Article MATH Google Scholar
Bompard M, Peter J, Desideri J et al (2010) Surrogate models based on function and derivative values for aerodynamic global optimization. In: Proceedings of the V European conference on computational fluid dynamics ECCOMAS CFD 2010
Rippa S (1999) An algorithm for selecting a good value for the parameter c in radial basis function interpolation. Adv Comput Math 11:193–210
Article MathSciNet MATH Google Scholar
Soulier B, Courrier N, Laurent L, Boucard P-A (2015) Métamodèles à gradients et multiniveaux de fidélité pour l’optimisation d’assemblages. In: 12ème Coloque National en Calcul des Structures, Giens, France, 12-22 mai, CSMA
Ulaganathan S, Couckuyt I, Dhaene T, Degroote J, Laermans E (2016) High dimensional kriging metamodelling utilising gradient information. Appl Math Model 40:5256–5270
Article Google Scholar
Chauvet P (1999) Aide mémoire de la géostatistique linéaire. Cahiers de Géostatistique, école Nationale Supérieur des Mines de Paris, Centre de Géostatistique, Fontainebleau, vol Fascicule 2
Mardia K, Marshall R (1984) Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika 71(1):135
Article MathSciNet MATH Google Scholar
Warnes J, Ripley B (1987) Problems with likelihood estimation of covariance functions of spatial gaussian processes. Biometrika 74(3):640
Article MathSciNet MATH Google Scholar
Toal DJ, Bressloff N, Keane A, Holden C (2011) The development of a hybridized particle swarm for kriging hyperparameter tuning. Eng Optim 43:675–699
Article Google Scholar
Toal DJJ, Forrester AIJ, Bressloff NW, Keane AJ, Holden C (2009) An adjoint for likelihood maximization. Proc R Soc Lond A 465(2111):3267–3287
Article MathSciNet MATH Google Scholar
Laurent L (2013) Multilevel optimisation of structures using a multiparametric strategy and metamodels. PhD thesis, École Normale Supérieure de Cachan, 61, avenue du Président Wilson, 94230 Cachan
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Book MATH Google Scholar
Vapnik VN, Chervonenkis AY (1974) Theory of pattern recognition. Nauka, Moscow (in Russian)
MATH Google Scholar
Smola AJ, Murata N, Schölkopf B, Müller K-R (1998) Asymptotically optimal choice of ε-loss for support vector machines. In: Niklasson, L.F.: Proceedings of the 8th international conference on artificial neural networks 1998. Vol 1: Skövde, Sweden, 2–4 September 1998, ICANN 98, pp 105–110. Springer
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines: and other kernel-based learning methods. Cambridge University Press, New York
Book MATH Google Scholar
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research
Schölkopf B, Bartlett PL, Smola AJ, Williamson R (1999) Shrinking the tube: a new support vector regression algorithm. In: Advances in neural information processing systems, vol 11, (Cambridge, MA, USA), pp. 330–336, Max-Planck-Gesellschaft, MIT Press
Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12:1207–1245
Article Google Scholar
Chang C-C, Lin C-J (2002) Training v-support vector regression: theory and algorithms. Neural Comput 14:1959–1977
Article MATH Google Scholar
Cherkassky V, Ma Y (2002) Artificial neural networks—ICANN 2002: international conference Madrid, Spain, August 28–30, 2002 proceedings, chapter. Selection of meta-parameters for support vector regression, pp 687–693. Springer, Berlin Heidelberg
Vapnik VN (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Vapnik VN, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12:2013–2036
Article Google Scholar
Chapelle O, Vapnik VN, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46(1–3):131–159
Article MATH Google Scholar
Chang M-W, Lin C-J (2005) Leave-one-out bounds for support vector regression model selection. Neural Comput 17:1188–1222
Article MATH Google Scholar
Laurent L, Boucard P-A, Soulier B (2011) Fast multilevel optimization using a multiparametric strategy and a cokriging metamodel. In: Tsompanakis Y, Tolpping BHV (eds) Proceedings of the second international conference on soft computing technology in civil, structural and environmental engineering, 6–9 September, no Paper 50, Civil-Comp Press, Stirlingshire, UK
Laurent L, Soulier B, Le Riche R, Boucard P-A (2016) On the use of gradient-enhanced metamodels for global approximation and global optimization. In: VII European congress on computational methods in applied sciences and engineering, Hersonissos, Crete, Greece, June 5-10
Laurent L (2016) MultiDOE (Matlab/Octave Toolbox). https://bitbucket.org/luclaurent/multidoe
Couckuyt I, Dhaene T, Demeester P (2014) ooDACE toolbox: a flexible object-oriented kriging implementation. J Mach Learn Res 15(1):3183–3186
MATH Google Scholar
Ulaganathan S, Couckuyt I, Deschrijver D, Laermans E, Dhaene T (2015) A matlab toolbox for kriging metamodelling. Procedia Comput Sci 51:2708–2713
Article Google Scholar
Forrester AIJ, Sóbester A, Keane AJ (2006) Optimization with missing data. Proc R Soc A 462(2067):935–945
Article MATH Google Scholar
Fritz J, Neuweiler I, Nowak W (2009) Application of fft-based algorithms for large-scale universal kriging problems. Math Geo 41(5):509–533
Article MATH Google Scholar
Hensman J, Durrande N, Solin A (2016) Variational Fourier features for Gaussian processes. ArXiv e-prints
Bouhlel MA, Bartoli N, Otsmane A, Morlier J (2016) Improving kriging surrogates of high-dimensional design models by partial least squares dimension reduction. Struct Multidiscip Optim 53(5):935–952
Article MathSciNet MATH Google Scholar
Bouhlel MA, Bartoli N, Otsmane A, Morlier J (2016) An improved approach for estimating the hyperparameters of the kriging model for high-dimensional problems through the partial least squares method. Math Prob Eng 2016:11
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire de Mécanique des Structures et des Systèmes Couplés, Conservatoire National des Arts et Métiers, Case courrier 2D6R10, 2 rue Conté, 75003, Paris, France
Luc Laurent
École Nationale Supérieure des Mines de Saint-Étienne, 158, cours Fauriel, 42023, Saint-Étienne, France
Rodolphe Le Riche
LMT Cachan (ENS Cachan/CNRS/Université Paris-Saclay), 61, avenue du Président Wilson, 94235, Cachan, France
Bruno Soulier & Pierre-Alain Boucard

Authors

Luc Laurent
View author publications
You can also search for this author in PubMed Google Scholar
Rodolphe Le Riche
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Soulier
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Alain Boucard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luc Laurent.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laurent, L., Le Riche, R., Soulier, B. et al. An Overview of Gradient-Enhanced Metamodels with Applications. Arch Computat Methods Eng 26, 61–106 (2019). https://doi.org/10.1007/s11831-017-9226-3

Download citation

Received: 08 April 2017
Accepted: 22 April 2017
Published: 17 July 2017
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s11831-017-9226-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Overview of Gradient-Enhanced Metamodels with Applications

Abstract

Similar content being viewed by others

Kernel Methods

Collinear Gradients Method for Minimizing Smooth Functions

Linear, Logistic, and Kernel Regression

1 Introduction

2 Build, Validate and Exploit a Surrogate Model

2.1 Surrogates and Their Building in a Nutshell

2.2 Main Notations

2.3 Introduction to Gradient-Enhanced Metamodels

3 Indirect Gradient-Based Metamodels

4 Least Squares Approaches

4.1 Non Weighted Least Squares (LS and GradLS)

4.2 Generalized Least Squares (GLS)

4.3 Moving Least Squares (MLS)

5 Shepard Weighting Function (IDW)

6 Kernel Functions for Gradient-Enhanced Kernel-Based Metamodels

7 Gradient-Enhanced Radial Basis Function (GRBF)

7.1 Building Process

7.2 RBF Kernels and Conditioning

7.3 Estimation of Parameters

7.4 Variance of a Stochastic Process Obtained with GRBF

8 Gradient-Enhanced Cokriging (GKRG)

8.1 Formulation of Gradient-Enhanced Cokriging

8.2 No Bias Condition

8.3 Formulation of the Variance

8.4 Constrained Optimization Problem for Cokriging Building

Problem 1

8.5 Covariance Structure

8.6 Summary of Cokriging Formulations and First Illustrations

8.7 Derivatives of the Cokriging Approximation

8.8 Estimation of the Cokriging Parameters

9 Gradient-Enhanced Support Vector Regression (GSVR)

9.1 Building Procedure

Problem 2

Problem 3

Problem 4

9.2 Kernel Functions

9.3 Evaluating the GSVR Metamodel

9.4 Gradient-Enhanced \(\nu \)-SVR

Problem 5

9.5 Tuning GSVR Parameters

10 Applications and Discussion

10.1 Procedure for Comparing Performances

10.2 Comparison of LS and GradLS Models

10.3 Comparison of Kernel-Based Models

10.4 Available Softwares for Gradient-Enhanced Metamodels

10.5 Remarks About Missing Data and Higher Order Derivatives

11 Conclusions

Notes

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation