Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Chapter 12 considers a spectral approach to UQ, namely Galerkin expansion, that is mathematically very attractive in that it is a natural extension of the Galerkin methods that are commonly used for deterministic PDEs and (up to a constant) minimizes the stochastic residual, but has the severe disadvantage that the stochastic modes of the solution are coupled together by a large system such as (12.15). Hence, the Galerkin formalism is not suitable for situations in which deterministic solutions are slow and expensive to obtain, and the deterministic solution method cannot be modified. Many so-called legacy codes are not amenable to such intrusive methods of UQ.

In contrast, this chapter considers non-intrusive spectral methods for UQ. These are characterized by the feature that the solution \(U(\theta )\) of the deterministic problem is a ‘black box’ that does not need to be modified for use in the spectral method, beyond being able to be evaluated at any desired point \(\theta\) of the probability space \((\varTheta,\mathcal{F},\mu )\). Indeed, sometimes, it is necessary to go one step further than this and consider the case of legacy data, i.e. an archive or data set of past input-output pairs \(\{(\theta _{n},U(\theta _{n}))\mid n = 1,\ldots,N\}\), sampled according to a possibly unknown or sub-optimal strategy, that is provided ‘as is’ and that cannot be modified or extended at all: the reasons for such restrictions may range from financial or practical difficulties to legal and ethical concerns.

There is a substantial overlap between non-intrusive methods for UQ and deterministic methods for interpolation and approximation as discussed in Chapter 8 However, this chapter additionally considers the method of Gaussian process regression (also known as kriging), which produces a probabilistic prediction of \(U(\theta )\) away from the data set, including a variance-based measure of uncertainty in that prediction.

13.1 Non-Intrusive Spectral Methods

One class of non-intrusive UQ methods is the family of non-intrusive spectral methods, namely the determination of approximate spectral coefficients, e.g. polynomial chaos coefficients, of an uncertain quantity U. The distinguishing feature here, in contrast to the approximate spectral coefficients calculated in Chapter 12, is that realizations of U are used directly. A good mental model is that the realizations of U will be used as evaluations in a quadrature rule, to determine an approximate orthogonal projection onto a finite-dimensional subspace of the stochastic solution space. For this reason, these methods are sometimes called non-intrusive spectral projection (NISP).

Consider a square-integrable stochastic process \(U: \varTheta \rightarrow \mathcal{U}\) taking values in a separable Hilbert spaceFootnote 1 \(\mathcal{U}\), with a spectral expansion

$$\displaystyle{U =\sum _{k\in \mathbb{N}_{0}}u_{k}\varPsi _{k}}$$

of \(U \in L^{2}(\varTheta,\mu;\mathcal{U})\mathop{\cong}\mathcal{U}\otimes L^{2}(\varTheta,\mu; \mathbb{R})\) in terms of coefficients (stochastic modes) \(u_{k} \in \mathcal{U}\) and an orthogonal basis \(\{\varPsi _{k}\mid k \in \mathbb{N}_{0}\}\) of \(L^{2}(\varTheta,\mu; \mathbb{R})\). As usual, the stochastic modes are given by

$$\displaystyle{ u_{k} = \frac{\langle U\varPsi _{k}\rangle } {\langle \varPsi _{k}^{2}\rangle } = \frac{1} {\gamma _{k}}\int _{\varTheta }U(\theta )\varPsi _{k}(\theta )\,\mathrm{d}\mu (\theta ). }$$
(13.1)

If the normalization constants \(\gamma _{k}:=\langle \varPsi _{ k}^{2}\rangle \equiv \|\varPsi _{k}\|_{L^{2}(\mu )}^{2}\) are known ahead of time, then it remains only to approximate the integral with respect to μ of the product of U with each basis function Ψ k ; in some cases, the normalization constants must also be approximated. In any case, the aim is to use realizations of U to determine approximate stochastic modes \(\tilde{u}_{k} \in \mathcal{U}\), with \(\tilde{u}_{k} \approx u_{k}\), and hence an approximation

$$\displaystyle{\widetilde{U}:=\sum _{k\in \mathbb{N}_{0}}\tilde{u}_{k}\varPsi _{k} \approx U.}$$

Such a stochastic process \(\widetilde{U}\) is sometimes called a surrogate or emulator for the original process U.

Deterministic Quadrature. If the dimension of \(\varTheta\) is low and \(U(\theta )\) is relatively smooth as a function of \(\theta\), then an appealing approach to the estimation of \(\langle U\varPsi _{k}\rangle\) is deterministic quadrature. For optimal polynomial accuracy, Gaussian quadrature (i.e. nodes at the roots of μ-orthogonal polynomials) may be used. In practice, nested quadrature rules such as Clenshaw–Curtis may be preferable since one does not wish to have to discard past solutions of U upon passing to a more accurate quadrature rule. For multi-dimensional domains of integration \(\varTheta\), sparse quadrature rules may be used to partially alleviate the curse of dimension.

Note that, if the basis elements Ψ k are polynomials, then the normalization constant \(\gamma _{k}:=\langle \varPsi _{ k}^{2}\rangle\) can be evaluated numerically but with zero quadrature error by Gaussian quadrature with at least (k + 1)∕2 nodes.

Monte Carlo and Quasi-Monte Carlo Integration. If the dimension of \(\varTheta\) is high, or \(U(\theta )\) is a non-smooth function of \(\theta\), then it is tempting to resort to Monte Carlo approximation of \(\langle U\varPsi _{k}\rangle\). This approach is also appealing because the calculation of the stochastic modes u k can be written as a straightforward (but often large) matrix-matrix multiplication. The problem with Monte Carlo methods, as ever, is the slow convergence rate of \(\sim (\mbox{ number of samples})^{-1/2}\); quasi-Monte Carlo quadrature may be used to improve the convergence rate for smoother integrands.

Connection with Linear Least Squares. There is a close connection between least-squares minimization and the determination of approximate spectral coefficients via quadrature (be it deterministic or stochastic). Let basis functions \(\varPsi _{0},\ldots,\varPsi _{K}\) and nodes \(\theta _{1},\ldots,\theta _{N}\) be given, and let

$$\displaystyle{ V:= \left [\begin{array}{*{10}c} \varPsi _{0}(\theta _{1}) &\cdots & \varPsi _{K}(\theta _{1})\\ \vdots & \ddots & \vdots \\ \varPsi _{0}(\theta _{N})&\cdots &\varPsi _{K}(\theta _{N}) \end{array} \right ] \in \mathbb{R}^{N\times (K+1)} }$$
(13.2)

be the associated Vandermonde-like matrix. Also, let \(Q(f):=\sum _{ n=1}^{N}w_{n}f(\theta _{n})\) be an N-point quadrature rule using the nodes \(\theta _{1},\ldots,\theta _{N}\), and let \(W:=\mathop{ \mathrm{diag}}\nolimits (w_{1},\ldots,w_{N}) \in \mathbb{R}^{N\times N}\). For example, if the \(\theta _{n}\) are i.i.d. draws from the measure μ on \(\varTheta\), then

$$\displaystyle{w_{1} = \cdots = w_{N} = \frac{1} {N}}$$

corresponds to the ‘vanilla’ Monte Carlo quadrature rule Q.

Theorem 13.1.

Given observed data \(y_{n}:= U(\theta _{n})\) for \(n = 1,\ldots,N\) , and \(\boldsymbol{y} = [y_{1},\ldots,y_{N}]\) , the following statements about approximate spectral coefficients \(\tilde{\boldsymbol{u}} = (\tilde{u}_{0},\ldots,\tilde{u}_{K})\) for \(\widetilde{U}:=\sum _{ k=0}^{K}\tilde{u}_{k}\varPsi _{k}\) are equivalent:

  1. (a)

    \(\widetilde{U}\) minimizes the weighted sum of squared residuals

    $$\displaystyle{R^{2}:=\sum _{ n=1}^{N}w_{ n}{\bigl |\widetilde{U}(\theta _{n}) - y_{n}\bigr |}^{2};}$$
  2. (b)

    \(\tilde{\boldsymbol{u}}\) satisfies

    $$\displaystyle{ V ^{\mathsf{T}}WV \tilde{\boldsymbol{u}} = V ^{\mathsf{T}}W\boldsymbol{y}^{\mathsf{T}}; }$$
    (13.3)
  3. (c)

    \(\widetilde{U} = U\) in the weak sense, tested against \(\varPsi _{0},\ldots,\varPsi _{K}\) using the quadrature rule Q, i.e., for \(k = 0,\ldots,K\) ,

    $$\displaystyle{Q{\bigl (\varPsi _{k}\widetilde{U}\bigr )} = Q{\bigl (\varPsi _{k}U\bigr )}.}$$

Proof.

Since

$$\displaystyle{V \tilde{\boldsymbol{u}} = \left [\begin{array}{*{10}c} \widetilde{U}(\theta _{1})\\ \vdots \\ \widetilde{U}(\theta _{N}) \end{array} \right ],}$$

the weighted sum of squared residuals \(\sum _{n=1}^{N}w_{n}{\bigl |\widetilde{U}(\theta _{n}) - y_{n}\bigr |}^{2}\) for approximate model \(\widetilde{U}\) equals \(\|V \tilde{\boldsymbol{u}} -\boldsymbol{ y}^{\mathsf{T}}\|_{W}^{2}\). By Theorem 4.28, this function of \(\tilde{\boldsymbol{u}}\) is minimized if and only if \(\tilde{\boldsymbol{u}}\) satisfies the normal equations (13.3), which shows that (a)\(\;\Longleftrightarrow\;\)(b). Explicit calculation of the left- and right-hand sides of (13.3) yields

$$\displaystyle{\sum _{n=1}^{N}w_{ n}\left [\begin{array}{*{10}c} \varPsi _{0}(\theta _{n})\widetilde{U}(\theta _{n})\\ \vdots \\ \varPsi _{K}(\theta _{n})\widetilde{U}(\theta _{n}) \end{array} \right ] =\sum _{ n=1}^{N}w_{ n}\left [\begin{array}{*{10}c} \varPsi _{0}(\theta _{n})y_{n}\\ \vdots \\ \varPsi _{K}(\theta _{n})y_{n} \end{array} \right ],}$$

which shows that (b)\(\;\Longleftrightarrow\;\)(c) □ 

Note that the matrix \(V ^{\mathsf{T}}WV\) on the left-hand side of (13.3) is

$$\displaystyle{V ^{\mathsf{T}}WV = \left [\begin{array}{*{10}c} Q(\varPsi _{0}\varPsi _{0}) &\cdots & Q(\varPsi _{0}\varPsi _{K})\\ \vdots & \ddots & \vdots \\ Q(\varPsi _{K}\varPsi _{0})&\cdots &Q(\varPsi _{K}\varPsi _{K}) \end{array} \right ] \in \mathbb{R}^{(K+1)\times (K+1)},}$$

i.e. is the Gram matrix of the basis functions \(\varPsi _{0},\ldots,\varPsi _{K}\) with respect to the quadrature rule Q’s associated inner product. Therefore, if the quadrature rule Q is one associated to μ (e.g. a Gaussian quadrature formula for μ, or a Monte Carlo quadrature with i.i.d. \(\theta _{n} \sim \mu\)), then \(V ^{\mathsf{T}}WV\) will be an approximation to the Gram matrix of the basis functions \(\varPsi _{0},\ldots,\varPsi _{K}\) in the L 2(μ) inner product. In particular, dependent upon the accuracy of the quadrature rule Q, we will have \(V ^{\mathsf{T}}WV \approx \mathop{\mathrm{diag}}\nolimits (\gamma _{0},\ldots,\gamma _{K})\), and then

$$\displaystyle{\tilde{u}_{k} \approx \frac{Q(\varPsi _{k}U)} {\gamma _{k}},}$$

i.e. \(\tilde{u}_{k}\) approximately satisfies the orthogonal projection condition (13.1) satisfied by u k .

In practice, when given \(\{\theta _{n}\}_{n=1}^{N}\) that are not necessarily associated with some quadrature rule for μ, along with corresponding output values \(\{y_{n}:= U(\theta _{n})\}_{n=1}^{N}\), it is common to construct approximate stochastic modes and hence an approximate spectral expansion \(\widetilde{U}\) by choosing \(\tilde{u}_{0},\ldots \tilde{u}_{k}\) to minimize the some weighted sum of squared residuals, i.e. according to (13.3).

Conversely, one can engage in the design of experiments — i.e. the selection of \(\{\theta _{n}\}_{n=1}^{N}\) — to optimize some derived quantity of the matrix V; common choices include

  • A-optimality, in which the trace of \((V ^{\mathsf{T}}V )^{-1}\) is minimized;

  • D-optimality, in which the determinant of \(V ^{\mathsf{T}}V\) is maximized;

  • E-optimality, in which the least singular value of \(V ^{\mathsf{T}}V\) is maximized; and

  • G-optimality, in which the largest diagonal term in the orthogonal projection \(V (V ^{\mathsf{T}}V )^{-1}V ^{\mathsf{T}} \in \mathbb{R}^{N\times N}\) is minimized.

Remark 13.2.

The Vandermonde-like matrix V from (13.2) is often ill-conditioned, i.e. has singular values of hugely different magnitudes. Often, this is a property of the normalization constants of the basis functions \(\{\varPsi _{k}\}_{k=0}^{K}\). As can be seen from Table 8.2, many of the standard families of orthogonal polynomials have normalization constants \(\|\psi _{k}\|_{L^{2}}\) that tend to 0 or to \(\infty \) as \(k \rightarrow \infty \). A tensor product system \(\{\psi _{\alpha }\}_{\alpha \in \mathbb{N}_{0}^{d}}\) of multivariate orthogonal polynomials, as in Theorem 8.25, might well have

$$\displaystyle{\liminf _{\vert \alpha \vert \rightarrow \infty }\|\psi _{\alpha }\|_{L^{2}} = 0\quad \mbox{ and }\quad \limsup _{\vert \alpha \vert \rightarrow \infty }\|\psi _{\alpha }\|_{L^{2}} = \infty;}$$

this phenomenon arises in, for example, the products of the Legendre and Hermite, or the Legendre and Charlier, bases. Working with orthonormal bases, or using preconditioners, alleviates the difficulties caused by such ill-conditioned matrices V.

Remark 13.3.

In practice, the following sources of error arise when computing non-intrusive approximate spectral expansions in the fashion outlined in this section:

  1. (a)

    discretization error comes about through the approximation of \(\mathcal{U}\) by a finite-dimensional subspace \(\mathcal{U}_{M}\), i.e. the approximation the stochastic modes u k by a finite sum \(u_{k} \approx \sum _{m=1}^{M}u_{km}\phi _{m}\), where \(\{\phi _{m}\mid m \in \mathbb{N}\}\) is some basis for \(\mathcal{U}\);

  2. (b)

    truncation error comes about through the truncation of the spectral expansion for U after finitely many terms, i.e. \(U \approx \sum _{k=0}^{K}u_{k}\varPsi _{k}\);

  3. (c)

    quadrature error comes about through the approximate nature of the numerical integration scheme used to find the stochastic modes; classical statistical concerns about the unbiasedness of estimators for expected values fall into this category. The choice of integration nodes contributes greatly to this source of error.

A complete quantification of the uncertainty associated with predictions of U made using a truncated non-intrusively constructed spectral stochastic model \(\widetilde{U}:=\sum _{ k=0}^{K}\tilde{u}_{k}\varPsi _{k}\) requires an understanding of all three of these sources of error, and there is necessarily some tradeoff among them when trying to give ‘optimal’ predictions for a given level of computational and experimental cost.

Remark 13.4.

It often happens in practice that the process U is not initially defined on the same probability space as the gPC basis functions, in which case some appropriate changes of variables must be used. In particular, this situation can arise if we are given an archive of legacy data values of U without the corresponding inputs. See Exercise 13.5 for a discussion of these issues in the example setting of Gaussian mixtures.

Example 13.5.

Consider again the simple harmonic oscillator

$$\displaystyle{\ddot{U}(t) = -\varOmega ^{2}U(t)}$$

with the initial conditions U(0) = 1, \(\dot{U}(0) = 0\). Suppose that \(\varOmega \sim \text{Unif}([0.8,1.2])\), so that \(\varOmega = 1.0 + 0.2\varXi\), where \(\varXi \sim \text{Unif}([-1,1])\) is the stochastic germ, with its associated Legendre basis polynomials. Figure 13.1 shows the evolution of the approximate stochastic modes for U, calculated using N = 1000 i.i.d. samples of \(\varXi\) and the least squares approach of Theorem 13.1. As in previous examples of this type, the forward solution of the ODE is performed using a symplectic integrator with time step 0. 01.

Fig. 13.1
figure 1

The degree-10 Legendre PC NISP solution to the simple harmonic oscillator equation of Example 13.5 with \(\varOmega \sim \text{Unif}([0.8,1.2])\).

Note that many standard computational algebra routines, such as Python’s numpy.linalg.lstsq, will solve the all the least squares problems of finding \(\{\tilde{u}_{k}(t_{i})\}_{k=0}^{K}\) for all time points t i in a vectorized manner. That is, it is not necessary to call numpy.linalg.lstsq with matrix V and data \(\{U(t_{0},\omega _{n})\}_{n=1}^{N}\) to obtain \(\{\tilde{u}_{k}(t_{0})\}_{k=0}^{K}\), and then do the same for t 1, etc. Instead, all the data \(\{U(t_{i},\omega _{n})\mid n = 1,\ldots,N;i \in \mathbb{N}_{0}\}\) can be supplied at once as a matrix, yielding a matrix \(\{\tilde{u}_{k}(t_{i})\mid k = 0,\ldots,K;i \in \mathbb{N}_{0}\}\).

13.2 Stochastic Collocation

Collocation methods for ordinary and partial differential equations are another form of interpolation. The idea is to find a low-dimensional object — usually a polynomial — that approximates the true solution to the differential equation by means of exactly satisfying the differential equation at a selected set of points, called collocation points or collocation nodes. An important feature of the collocation approach is that an approximation is constructed not on a pre-defined stochastic subspace, but instead uses interpolation, and hence both the approximation and the approximation space are implicitly prescribed by the collocation nodes. As the number of collocation nodes increases, the space in which the solution is sought becomes correspondingly larger.

Example 13.6 (Collocation for an ODE).

Consider, for example, the initial value problem

$$\displaystyle\begin{array}{rcl} \dot{u}(t)& =& f(t,u(t)),\qquad \qquad \qquad \mbox{ for $t \in [a,b]$} {}\\ u(a)& =& u_{a}, {}\\ \end{array}$$

to be solved on an interval of time [a, b]. Choose n points

$$\displaystyle{a \leq t_{1} < t_{2} <\ldots < t_{n} \leq b,}$$

called collocation nodes. Now find a polynomial \(p(t) \in \mathbb{R}_{\leq n}[t]\) so that the ODE

$$\displaystyle{\dot{p}(t_{k}) = f(t_{k},p(t_{k}))}$$

is satisfied for \(k = 1,\ldots,n\), as is the initial condition p(a) = u a . For example, if n = 2, t 1 = a and t 2 = b, then the coefficients \(c_{2},c_{1},c_{0} \in \mathbb{R}\) of the polynomial approximation

$$\displaystyle{p(t) =\sum _{ k=0}^{2}c_{ k}(t - a)^{k},}$$

which has derivative \(\dot{p}(t) = 2c_{2}(t - a) + c_{1}\), are required to satisfy

$$\displaystyle\begin{array}{rcl} \dot{p}(a) = c_{1}& =& f(a,p(a)) {}\\ \dot{p}(b) = 2c_{2}(b - a) + c_{1}& =& f(b,p(b)) {}\\ p(a) = c_{0}& =& u_{a} {}\\ \end{array}$$

i.e.

$$\displaystyle{p(t) = \frac{f(b,p(b)) - f(a,u_{a})} {2(b - a)} (t - a)^{2} + f(a,u_{ a})(t - a) + u_{a}.}$$

The above equation implicitly defines the final value p(b) of the collocation solution. This method is also known as the trapezoidal rule for ODEs, since the same solution is obtained by rewriting the differential equation as

$$\displaystyle{u(t) = u(a) +\int _{ a}^{t}f(s,u(s))\,\mathrm{d}s}$$

and approximating the integral on the right-hand side by the trapezoidal quadrature rule for integrals.

It should be made clear at the outset that there is nothing stochastic about ‘stochastic collocation’, just as there is nothing chaotic about ‘polynomial chaos’. The meaning of the term ‘stochastic’ in this case is that the collocation principle is being applied across the ‘stochastic space’ (i.e. the probability space) of a stochastic process, rather than the space/time/space-time domain. That is, for a stochastic process U with known values \(U(\theta _{n})\) at known collocation points \(\theta _{1},\ldots,\theta _{N} \in \varTheta\), we seek an approximation \(\widetilde{U}\) such that

$$\displaystyle{\widetilde{U}(\theta _{n}) = U(\theta _{n})\quad \mbox{ for $n = 1,\ldots,N$.}}$$

There is, however, some flexibility in how to approximate \(U\theta )\) for \(\theta \neq \theta _{1},\ldots,\theta _{N}\).

Example 13.7.

Consider, for example, the random PDE

$$\displaystyle\begin{array}{rcl} \mathcal{L}_{\theta }[U(x,\theta )]& =& 0\qquad \qquad \qquad \quad \mbox{ for $x \in \mathcal{X}$, $\theta \in \varTheta $,} {}\\ \mathcal{B}_{\theta }[U(x,\theta )]& =& 0\qquad \qquad \qquad \quad \mbox{ for $x \in \partial \mathcal{X}$, $\theta \in \varTheta $,} {}\\ \end{array}$$

where, for μ-a.e. \(\theta\) in some probability space \((\varTheta,\mathcal{F},\mu )\), the differential operator \(\mathcal{L}_{\theta }\) and boundary operator \(\mathcal{B}_{\theta }\) are well defined and the PDE admits a unique solution \(U(\cdot,\theta ): \mathcal{X} \rightarrow \mathbb{R}\). The solution \(U: \mathcal{X}\times \varTheta \rightarrow \mathbb{R}\) is then a stochastic process. We now let \(\varTheta _{M}:=\{\theta _{1},\ldots,\theta _{M}\} \subseteq \varTheta\) be a finite set of prescribed collocation nodes. The collocation problem is to find a collocation solution \(\widetilde{U}\), an approximation to the exact solution U, that satisfies

$$\displaystyle\begin{array}{rcl} \mathcal{L}_{\theta _{m}}{\bigl [\widetilde{U}{\bigl (x,\theta _{m}\bigr )}\bigr ]}& =& 0\qquad \qquad \qquad \quad \mbox{ for $x \in \mathcal{X}$,} {}\\ \mathcal{B}_{\theta _{m}}{\bigl [\widetilde{U}{\bigl (x,\theta _{m}\bigr )}\bigr ]}& =& 0\qquad \qquad \qquad \quad \mbox{ for $x \in \partial \mathcal{X}$,} {}\\ \end{array}$$

for \(m = 1,\ldots,M\).

Interpolation Approach. An obvious first approach is to use interpolating polynomials when they are available. This is easiest when the stochastic space \(\varTheta\) is one-dimensional, in which case the Lagrange basis polynomials of a given nodal set are an attractive choice of interpolation basis. As always, though, care must be taken to use nodal sets that will not lead to Runge oscillations; if there is very little a priori information about the process U, then constructing a ‘good’ nodal set may be a matter of trial and error. In general, the choice of collocation nodes is a significant contributor to the error and uncertainty in the resulting predictions.

Given the values \(U(\theta _{1}),\ldots,U(\theta _{N})\) of U at nodes \(\theta _{1},\ldots,\theta _{N}\) in a one-dimensional space \(\varTheta\), the (Lagrange-form polynomial interpolation) collocation approximation \(\widetilde{U}\) to U is given by

$$\displaystyle{\widetilde{U}(\theta ) =\sum _{ n=1}^{N}U(\theta _{ n})\ell_{n}(\theta ) =\sum _{ n=1}^{N}U(\theta _{ n})\prod _{\begin{array}{c}1\leq k\leq N \\ k\neq n \end{array}} \frac{\theta -\theta _{k}} {\theta _{n} -\theta _{k}}.}$$

Example 13.8.

Figure 13.2 shows the results of the interpolation-collocation approach for the simple harmonic oscillator equation considered earlier, again for ω ∈ [0. 8, 1. 2]. Two nodal sets \(\omega _{1},\ldots,\omega _{N} \in \mathbb{R}\) are considered: uniform nodes, and Chebyshev nodes. In order to make the differences between the two solutions more easily visible, only N = 4 nodes are used.

The collocation solution \(\widetilde{U}(\cdot,\omega _{n})\) at each of the collocation nodes ω n is the solution of the deterministic problem

$$\displaystyle\begin{array}{rcl} \frac{\mathrm{d}^{2}} {\mathrm{d}t^{2}}\widetilde{U}(t,\omega _{n})& =& -\omega _{n}^{2}U(t,\omega _{ n}), {}\\ \widetilde{U}(0,\omega _{n})& =& 1, {}\\ \frac{\mathrm{d}} {\mathrm{d}t}\widetilde{U}(0,\omega _{n})& =& 0. {}\\ \end{array}$$

Away from the collocation nodes, \(\widetilde{U}\) is defined by polynomial interpolation: for each t, \(\widetilde{U}(t,\omega )\) is a polynomial in ω of degree at most N with prescribed values at the collocation nodes. Writing this interpolation in terms of Lagrange basis polynomials

$$\displaystyle{\ell_{n}(\omega;\omega _{1},\ldots \omega _{N}):=\prod _{\begin{array}{c}1\leq k\leq N \\ k\neq n \end{array}} \frac{\omega -\omega _{k}} {\omega _{n} -\omega _{k}}}$$

yields

$$\displaystyle{\widetilde{U}(t,\omega ) =\sum _{ n=1}^{N}U(t,\omega _{ n})\ell_{n}(\omega ).}$$

As can be seen in Figure 13.2(c–d), both nodal sets have the undesirable property that the approximate solution \(\widetilde{U}(t,\omega )\) has with the undesirable property that \({\bigl |\widetilde{U}(t,\omega )\bigr |} >{\bigl |\widetilde{ U}(0,\omega )\bigr |} = 1\) for some t > 0 and ω ∈ [0. 8, 1. 2]. Therefore, for general ω, \(\widetilde{U}(t,\omega )\) is not a solution of the original ODE. However, as the discussion around Runge’s phenomenon in Section 8.5 would lead us to expect, the regions in (t, ω)-space where such unphysical values are attained are smaller with the Chebyshev nodes than the uniformly distributed ones.

Fig. 13.2
figure 2

Interpolation solutions for a simple harmonic oscillator with uncertain natural frequency ω, U(0, ω) = 1, \(\dot{U}(0,\omega ) = 0\). Both cases use four interpolation nodes. Note that the Chebyshev nodes produce smaller regions in (t, ω)-space with unphysical values \({\bigl |\widetilde{U}(t,\omega )\bigr |} > 1\).

The extension of one-dimensional interpolation methods to the multi-dimensional case can be handled in a theoretically straightforward manner using tensor product grids, similar to the constructions used in quadrature. In tensor product constructions, both the grid of interpolation points and the interpolation polynomials are products of the associated one-dimensional objects. Thus, in a product space \(\varTheta =\varTheta _{1} \times \ldots \times \varTheta _{d}\), we take nodes

$$\displaystyle\begin{array}{rcl} \theta _{1}^{1},\ldots,\theta _{ N_{1}}^{1}& \in & \varTheta _{ 1} {}\\ & \vdots & {}\\ \theta _{1}^{d},\ldots,\theta _{ N_{d}}^{d}& \in & \varTheta _{ d} {}\\ \end{array}$$

and construct a product grid of nodes \(\theta _{\boldsymbol{n}}:= (\theta _{n_{1}}^{1},\ldots,\theta _{n_{d}}^{d}) \in \varTheta\), where the multi-index \(\boldsymbol{n} = (n_{1},\ldots,n_{d})\) runs over \(\{1,\ldots,N_{1}\} \times \ldots \times \{1,\ldots,N_{d}\}\). The corresponding interpolation formula, in terms of Lagrange basis polynomials, is then

$$\displaystyle{\widetilde{U}(\theta ) =\sum _{ \boldsymbol{n}=(1,\ldots,1)}^{(N_{1},\ldots,N_{d})}U(\theta _{\boldsymbol{ n}})\prod _{i=1}^{d}\ell_{ n_{i}}{\bigl (\theta ^{i};\theta _{ 1}^{i},\ldots,\theta _{ N_{i}}^{i}\bigr )}.}$$

The problem with tensor product grids for interpolative collocation is the same as for tensor product quadrature: the curse of dimension, i.e. the large number of nodes needed to adequately resolve features of functions on high-dimensional spaces. The curse of dimension can be partially circumvented by using interpolation through sparse grids, e.g. those of Smolyak type.

Collocation for arbitrary unstructured sets of nodes — such as those that arise when inheriting an archive of ‘legacy’ data that cannot be modified or extended for whatever reason — is a notably tricky subject, essentially because it boils down to polynomial interpolation through an unstructured set of nodes. Even the existence of interpolating polynomials such as analogues of the Lagrange basis polynomials is not, in general, guaranteed.

Other Approximation Strategies. There are many other strategies for the construction of collocation solutions, especially in high dimension, besides polynomial bases. Common choices include splines and radial basis functions; see the bibliographic notes at the end of the chapter for references. Another popular method is Gaussian process regression, which is the topic of the next section.

13.3 Gaussian Process Regression

The interpolation approaches of the previous section were all deterministic in two senses: they assume that the values \(U(\theta _{n})\) are observed exactly, without error and with perfect reproducibility; they also assume that the correct form for an interpolated value \(\widetilde{U}(\theta )\) away from the nodal set is a deterministic function of the nodes and observed values. In many situations in the natural sciences and commerce, these assumptions are not appropriate. Instead, it is appropriate to incorporate an estimate of the observational uncertainties, and to produce probabilistic predictions; this is another area in which the Bayesian perspective is quite natural.

This section surveys one such method of stochastic interpolation, known as Gaussian process regression or kriging; as ever, the quite rigid properties of Gaussian measures hugely simplify the presentation. The essential idea is that we will model U as a Gaussian random field; the prior information on U consists of a mean field and a covariance operator, the latter often being given in practice by a correlation length; the observations of U at discrete points are then used to condition the prior Gaussian using Schur complementation, and thereby produce a posterior Gaussian prediction for the value of U at any other point.

Noise-Free Observations. Suppose for simplicity that we observe the values \(y_{n}:= U(\theta _{n})\) exactly, without any observational error. We wish to use the data \(\{(\theta _{n},y_{n})\mid n = 1,\ldots,N\}\) to make a prediction for the values of U at other points in the domain \(\varTheta\). To save space, we will refer to \(\theta ^{\text{o}} = (\theta _{1},\ldots,\theta _{N})\) as the observed points and \(y^{\text{o}} = (y_{1},\ldots,y_{N})\) as the observed values; together, \((\theta ^{\text{o}},y^{\text{o}})\) constitute the observed data or training set. By way of contrast, we wish to predict the value(s) y p of U at point(s) \(\theta ^{\text{p}}\), referred to as the prediction points or test points. We will abuse notation and write \(m(\theta ^{\text{o}})\) for \((m(\theta _{0}),\ldots,m(\theta _{N}))\), and so on.

Under the prior assumption that U is a Gaussian random field with known mean \(m: \varTheta \rightarrow \mathbb{R}\) and known covariance function \(C: \varTheta \times \varTheta \rightarrow \mathbb{R}\), the random vector (y o, y p) is a draw from a multivariate Gaussian distribution with mean \((m(\theta ^{\text{o}}),m(\theta ^{\text{p}}))\) and covariance matrix

$$\displaystyle{\left [\begin{array}{*{10}c} C(\theta ^{\text{o}},\theta ^{\text{o}}) &C(\theta ^{\text{o}},\theta ^{\text{p}})^{\mathsf{T}} \\ C(\theta ^{\text{o}},\theta ^{\text{p}})& C(\theta ^{\text{p}},\theta ^{\text{p}}) \end{array} \right ]}$$

(Note that in the case of N observed data points and one new value to be predicted, \(C(\theta ^{\text{o}},\theta ^{\text{o}})\) is an N × N block, \(C(\theta ^{\text{p}},\theta ^{\text{p}})\) is 1 × 1, and \(C(\theta ^{\text{o}},\theta ^{\text{p}})\) is a 1 × N ‘row vector’.) By Theorem 2.54, the conditional distribution of \(U(\theta ^{\text{p}})\) given the observations \(U(\theta ^{\text{o}}) = y^{\text{o}}\) is Gaussian, with its mean and variance given in terms of the Schur complement

$$\displaystyle{S:= C(\theta ^{\text{p}},\theta ^{\text{p}})-C(\theta ^{\text{p}},\theta ^{\text{o}})^{\mathsf{T}}C(\theta ^{\text{o}},\theta ^{\text{o}})^{-1}C(\theta ^{\text{o}},\theta ^{\text{p}})}$$

by

$$\displaystyle{U(\theta ^{\text{p}})\vert \theta ^{\text{o}},y^{\text{o}} \sim \mathcal{N}{\bigl (m^{\text{p}} + C(\theta ^{\text{p}},\theta ^{\text{o}})C(\theta ^{\text{o}},\theta ^{\text{o}})^{-1}(y^{\text{o}} - m(\theta ^{\text{o}})),S\bigr )}.}$$

This means that, in practice, a draw \(\widetilde{U}(\theta ^{\text{p}})\) from this conditioned Gaussian measure would be used as a proxy/prediction for the value \(U(\theta ^{\text{p}})\). Note that S depends only upon the locations of the interpolation nodes \(\theta ^{\text{o}}\) and \(\theta ^{\text{p}}\). Thus, if variance is to be used as a measure of the precision of the estimate \(\widetilde{U}(\theta ^{\text{p}})\), then it will be independent of the observed data y o.

Noisy Observations. The above derivation is very easily adapted to the case of noisy observations, i.e. \(y^{\text{o}} = U(\theta ^{\text{o}})+\eta\), where η is some random noise vector. As usual, the Gaussian case is the simplest, and if \(\eta \sim \mathcal{N}(0,\varGamma )\), then the net effect is to replace each occurrence of “\(C(\theta ^{\text{o}},\theta ^{\text{o}})\)” above by “\(\varGamma +C(\theta ^{\text{o}},\theta ^{\text{o}})\)”. In terms of regularization, this is nothing other than quadratic regularization using the norm \(\|\cdot \|_{\varGamma ^{1/2}} =\|\varGamma ^{-1/2} \cdot \|\) on \(\mathbb{R}^{N}\).

One advantage of regularization, as ever, is that it sacrifices the interpolation property (exactly fitting the data) for better-conditioned solutions and even the ability to assimilate ‘contradictory’ observed data, i.e. \(\theta _{n} =\theta _{m}\) but y n y m . See Figure 13.3 for simple examples.

Example 13.9.

Consider \(\varTheta = [0,1]\), and suppose that the prior description of U is as a zero-mean Gaussian process with Gaussian covariance kernel

$$\displaystyle{C(\theta,\theta '):=\exp \left (-\frac{\vert \theta -\theta '\vert ^{2}} {2\ell^{2}} \right );}$$

 > 0 is the correlation length of the process, and the numerical results illustrated in Figure 13.3 use \(\ell= \tfrac{1} {4}\).

  1. (a)

    Suppose that values y o = 0. 1, 0. 8 and 0. 5 are observed for U at \(\theta ^{\text{o}} = 0.1\), 0. 5, 0. 9 respectively. In this case, the matrix \(C(\theta ^{\text{o}},\theta ^{\text{o}})\) and its inverse are approximately

    $$\displaystyle\begin{array}{rcl} C(\theta ^{\text{o}},\theta ^{\text{o}})& =& \left [\begin{array}{*{10}c} 1.000&0.278&0.006\\ 0.278 &1.000 &0.278 \\ 0.006&0.278&1.000\end{array} \right ] {}\\ C(\theta ^{\text{o}},\theta ^{\text{o}})^{-1}& =& \left [\begin{array}{*{10}c} 1.090 &-0.327& 0.084\\ -0.327 & 1.182 &-0.327 \\ 0.084 &-0.327& 1.090\end{array} \right ].{}\\ \end{array}$$

    Figure 13.3(a) shows the posterior mean field and posterior variance: note that the posterior mean interpolates the given data.

    Fig. 13.3
    figure 3

    A simple example of Gaussian process regression/kriging in one dimension. The dots show the observed data points, the black curve the posterior mean of the Gaussian process \(\widetilde{U}\), and the shaded region the posterior mean ± one posterior standard deviation.

  2. (b)

    Now suppose that values y o = 0. 1, 0. 8, 0. 9, and 0. 5 are observed for U at \(\theta ^{\text{o}} = 0.1\), 0. 5, 0. 5, 0. 9 respectively. In this case, because there are two contradictory values for U at \(\theta = 0.5\), we do not expect the posterior mean to be a function that interpolates the data. Indeed, the matrix \(C(\theta ^{\text{o}},\theta ^{\text{o}})\) has a repeated row and column:

    $$\displaystyle{C(\theta ^{\text{o}},\theta ^{\text{o}}) = \left [\begin{array}{*{10}c} 1.000&0.278&0.278&0.006\\ 0.278 &1.000 &1.000 &0.278 \\ 0.278&1.000&1.000&0.278\\ 0.006 &0.278 &0.278 &1.000\end{array} \right ],}$$

    and hence \(C(\theta ^{\text{o}},\theta ^{\text{o}})\) is not invertible. However, assuming that \(y^{\text{o}} = U(\theta ^{\text{o}}) + \mathcal{N}(0,\eta ^{2})\), with η > 0, restores well-posedness to the problem. Figure 13.3(b) shows the posterior mean and covariance field with the regularization η = 0. 1.

Variations. There are many ‘flavours’ of the kriging method, essentially determined by the choice of the prior, and in particular the choice of the prior mean. For example, simple kriging assumes a known spatially constant mean field, i.e. \(\mathbb{E}[U(\theta )] = m\) for all \(\theta\).

A mild generalization is ordinary kriging, in which it is again assumed that \(\mathbb{E}[U(\theta )] = m\) for all \(\theta\), but m is not assumed to be known. This underdetermined situation can be rendered tractable by including additional assumptions on the form of \(\widetilde{U}(\theta ^{\text{p}})\) as a function of the data \((\theta ^{\text{o}},y^{\text{o}})\): one simple assumption of this type is a linear model of the form \(\widetilde{U}(\theta ^{\text{p}}) =\sum _{ n=1}^{N}w_{n}y_{n}\) for some weights \(w = (w_{1},\mathop{\ldots },w_{N}) \in \mathbb{R}^{N}\) — note well that this is not the same as linearly interpolating the observed data.

In this situation, as in the Gauss–Markov theorem (Theorem 6.2), the natural criteria of zero mean error (unbiasedness) and minimal squared error are used to determine the estimate of \(U(\theta ^{\text{p}})\): writing \(\widetilde{U}(\theta ^{\text{p}}) =\sum _{ n=1}^{N}w_{n}y_{n}\), the unbiasedness requirement that \(\mathbb{E}{\bigl [\widetilde{U}(\theta ^{\text{p}}) - U(\theta ^{\text{p}})\bigr ]} = 0\) implies that the weights w n sum to 1, and minimizing \(\mathbb{E}{\bigl [{\bigl (\widetilde{U}(\theta ^{\text{p}}) - U(\theta ^{\text{p}})\bigr )}^{2}\bigr ]}\) becomes the constrained optimization problem

$$\displaystyle\begin{array}{rcl} & & \ \mbox{ minimize: }C(\theta ^{\text{p}},\theta ^{\text{p}}) - 2w^{\mathsf{T}}C(\theta ^{\text{p}},\theta ^{\text{o}}) + w^{\mathsf{T}}C(\theta ^{\text{o}},\theta ^{\text{o}})w {}\\ & & \qquad \mbox{ among: }w \in \mathbb{R}^{N} {}\\ & & \mbox{ subject to: }\sum _{n=1}^{N}w_{ n} = 1. {}\\ \end{array}$$

By the method of Lagrange multipliers, the weight vector w and the Lagrange multiplier \(\lambda \in \mathbb{R}\) are given jointly as the solutions of

$$\displaystyle{ \left [\begin{array}{*{10}c} C(\theta ^{\text{o}},\theta ^{\text{o}})&1 \\ 1 &0 \end{array} \right ]\left [\begin{array}{*{10}c} w\\ \lambda \end{array} \right ] = \left [\begin{array}{*{10}c} C(\theta ^{\text{p}},\theta ^{\text{o}}) \\ 1 \end{array} \right ]. }$$
(13.4)

Even when \(C(\theta ^{\text{o}},\theta ^{\text{o}})\) is positive-definite, the matrix on the left-hand side is not invertible: however, the column vector on the right-hand side does lie in the range, and so it is possibleFootnote 2 to solve for \((w,\lambda )\).

13.4 Bibliography

Non-intrusive methods for UQ, including non-intrusive spectral projection and stochastic collocation, are covered by Le Maître and Knio (2010, Chapter 3) and Xiu (2010, Chapter 7). A classic paper on interpolation using sparse grids is that of Barthelmann et al. (2000), and applications to UQ for PDEs with random input data have been explored by, e.g., Nobile et al. (2008a,b). Narayan and Xiu (2012) give a method for stochastic collocation on arbitrary sets of nodes using the framework of least orthogonal interpolation, following an earlier Gaussian construction of de Boor and Ron (1990). Yan et al. (2012) consider stochastic collocation algorithms with sparsity-promoting 1 regularizations. Buhmann (2003) provides a general introduction to the theory and practical usage of radial basis functions. A comprehensive introduction to splines is the book of de Boor (2001); for a more statistical interpretation, see, e.g., Smith (1979).

Kriging was introduced by Krige (1951) and popularized in geostatistics by Matheron (1963). See, e.g., Conti et al. (2009) for applications to the interpolation of results from slow or expensive computational methods. Rasmussen and Williams (2006) cover the theory and application of Gaussian processes to machine learning; their text also gives a good overview of the relationships between Gaussian processes and other modelling perspectives, including regularization, reproducing kernel Hilbert spaces, and support vector machines.

13.5 Exercises

Exercise 13.1.

Choose distinct nodes \(\theta _{1},\ldots,\theta _{N} \in \varTheta = [0,1]\) and corresponding values \(y_{1},\ldots,y_{N} \in \mathbb{R}\). Interpolate these data points in all the ways discussed so far in the text. In particular, interpolate the data using apiecewise linear interpolation, using a polynomial of degree N − 1, and using Gaussian processes with various choices of covariance kernel. Plot the interpolants on the same axes to get an idea of their qualitative features.

Exercise 13.2.

Extend the analysis of the simple harmonic oscillator from Examples 13.5 and 13.8 to incorporate uncertainty in the initial condition, and calculate sensitivity indices with respect to the various uncertainties. Perform the same analyses with an alternative uncertainty model, e.g. the log-normal model of Example 12.6.

Exercise 13.3.

Perform the analogue of Exercise 13.2 for the Van der Pol oscillator

$$\displaystyle{\ddot{u}(t) -\mu (1 - u(t)^{2})\dot{u}(t) +\omega ^{2}u(t) = 0.}$$

Compare your results with those of the active subspace method (Example 10.20 and Figure 10.1).

Exercise 13.4.

Extend the analysis of Exercises 13.2 and 13.3 by treating the time step h > 0 of the numerical ODE solver as an additional source of uncertainty and error. Suppose that the numerical integration scheme for the ODE has a global truncation error at most Ch r for some C, r > 0, and so model the exact solution to the ODE as the computed solution plus a draw from Unif(−Ch r, Ch r). Using this randomly perturbed observational data, calculate approximate spectral coefficients for the process using the NISP scheme. (For more sophisticated randomized numerical schemes for ODEs and PDEs, see, e.g., Schober et al. (2014) and the works listed as part of the Probabilistic Numerics project http://www.probabilistic-numerics.org.)

Exercise 13.5.

It often happens that the process U is not initially defined on the same probability space as the gPC basis functions: in particular, this situation can arise if we are given an archive of legacy data values of U without corresponding inputs. In this situation, it is necessary to transform both sets of random variables to a common probability space. This exercise concerns an example implementation of this procedure in the case that U is a real-valued Gaussian mixture : for some weights \(w_{1},\ldots,w_{J} \geq 0\) summing to 1, means \(m_{1},\ldots,m_{J} \in \mathbb{R}\), and variances \(\sigma _{1}^{2},\ldots,\sigma _{J}^{2} > 0\), the Lebesgue probability density \(f_{U}: \mathbb{R} \rightarrow [0,\infty )\) of U is given as the following convex combination of Gaussian densities:

$$\displaystyle{ f_{U}(x):=\sum _{ j=1}^{J} \frac{w_{j}} {\sqrt{2\pi \sigma _{j }^{2}}}\exp \left (-\frac{(x - m_{j})^{2}} {2\sigma _{j}^{2}} \right ). }$$
(13.5)

Suppose that we wish to perform a Hermite expansion of U, i.e. to write \(U =\sum _{k\in \mathbb{N}_{0}}u_{k}\mathrm{He}_{k}(Z)\), where \(Z \sim \gamma = \mathcal{N}(0,1)\). The immediate problem is that U is defined as a function of \(\theta\) in some abstract probability space \((\varTheta,\mathcal{F},\mu )\), not as a function of z in the concrete probability space \((\mathbb{R},\mathcal{B}(\mathbb{R}),\gamma )\).

  1. (a)

    Let \(\varTheta =\{ 1,\ldots,J\} \times \mathbb{R}\), and define a probability measure μ on \(\varTheta\) by

    $$\displaystyle{\mu:=\sum _{ j=1}^{J}w_{ j}\delta _{j} \otimes \mathcal{N}(m_{j},\sigma _{j}^{2}).}$$

    (In terms of sampling, this means that draws (j, y) from μ are performed by first choosing \(j \in \{ 1,\ldots,J\}\) at random according the weighting \(w_{1},\ldots,w_{J}\), and then drawing a Gaussian sample \(y \sim \mathcal{N}(m_{j},\sigma _{j}^{2})\).) Let \(P: \varTheta \rightarrow \mathbb{R}\) denote projection onto the second component, i.e. \(P(j,y):= y\). Show that the push-forward measure P μ on \(\mathbb{R}\) is the Gaussian mixture (13.5).

  2. (b)

    Let \(F_{U}: \mathbb{R} \rightarrow [0,1]\) denote the cumulative distribution function (CDF) of U, i.e.

    $$\displaystyle{F_{U}(x):= \mathbb{P}_{\mu }[U \leq x] =\int _{ -\infty }^{x}f_{ U}(s)\,\mathrm{d}s.}$$

    Show that F U is invertible, and that if \(V \sim \text{Unif}([0,1])\), then F U −1(V ) has the same distribution as U.

  3. (c)

    Let Φ denote the CDF of the standard normal distribution γ. Show, by change of integration variables, that

    $$\displaystyle{ \langle U,\mathrm{He}_{k}\rangle _{L^{2}(\gamma )} =\int _{ 0}^{1}F_{ U}^{-1}(v)\mathrm{He}_{ k}(\varPhi ^{-1}(v))\,\mathrm{d}v. }$$
    (13.6)
  4. (d)

    Use your favourite quadrature rule for uniform measure on [0, 1] to approximately evaluate (13.6), and hence calculate approximate Hermite PC coefficients \(\tilde{u}_{k}\) for U.

  5. (e)

    Choose some m j and \(\sigma _{j}^{2}\), and generate N i.i.d. sample realizations \(y_{1},\ldots,y_{N}\) of U using the observation of part (a). Approximate F U by the empirical CDF of the data, i.e.

    $$\displaystyle{F_{U}(x) \approx \widehat{ F}_{\boldsymbol{y}}(x):= \frac{\vert \{1 \leq n \leq N\mid y_{n} \leq x\}\vert } {N}.}$$

    Use this approximation and your favourite quadrature rule for uniform measure on [0, 1] to approximately evaluate (13.6), and hence calculate approximate Hermite PC coefficients \(\tilde{u}_{k}\) for U. (This procedure, using the empirical CDF, is essentially the one that we must use if we are given only the data \(\boldsymbol{y}\) and no functional relationship of the form \(y_{n} = U(\theta _{n})\).)

  6. (f)

    Compare the results of parts (d) and (e).

Exercise 13.6.

Choose nodes in the square [0, 1]2 and corresponding data values, and interpolate them using Gaussian process regression with a radial covariance function such as \(C(x,x') =\exp (-\|x - x'\|^{2}/r^{2})\), with r > 0 being a correlation length parameter. Produce accompanying plots of the posterior variance field.