1 Introduction

Multilevel Monte Carlo Sampling was first introduced for applications in the context of parametric integration by Heinrich [18, 19]. Later, to consider weak approximation of stochastic differential equations (SDEs) in mathematical finance, Kebaier [24] introduced a two-level Monte Carlo technique in which a coarse grid numerical approximation of an SDE was used as a control variate to a fine grid numerical approximation, thus reducing the number of samples needed on the fine grid and decreasing the total computational burden. This idea was extended to a multilevel Monte Carlo (MLMC) method by Giles in [12], who introduced a full hierarchy of discretizations with geometrically decreasing grid sizes. By optimally choosing the number of samples on each level this MLMC method decreases the computational burden, not only by a constant factor as standard control variate techniques do, but even reducing the rate in the computational complexity to compute a solution with error tolerance \(\mathrm{{TOL}}> 0\) from \({\mathcal {O}(\mathrm{{TOL}}^{-3})} \) of the standard Euler-Maruyama Monte Carlo method to \({\mathcal {O}(\log {(\mathrm{{TOL}})}^2\mathrm{{TOL}}^{-2})} \), assuming that the work to generate a single realization is \({\mathcal {O}(\mathrm{{TOL}}^{-1})} \). For one-dimensional SDEs, the computational complexity of MLMC was further reduced to \({\mathcal {O}(\mathrm{{TOL}}^{-2})} \) by using the Milstein Scheme [11]. Moreover, the same computational complexity can be achieved by using antithetic control variates with MLMC in multi-dimensional SDEs with smooth and piecewise smooth payoffs [16].

This standard MLMC method has since then been extended and applied in a wide variety of contexts, including jump diffusions [31] and Partial Differential Equations (PDEs) with random coefficients [57, 13, 30]. It is shown in [30, Theorem 2.3] that there is an optimal convergence rate that is similar to the previously mentioned complexity rates, but that depends on the relation between the rate of variance convergence of the discretization method of the underlying equation and the work complexity associated with generating a single sample of the quantity of interest. In fact, in certain cases, the computational complexity can be of the optimal rate, namely \({\mathcal {O}(\mathrm{{TOL}}^{-2})} \).

To achieve the optimal MLMC complexity rate and to obtain an estimate of the statistical error, sufficiently accurate estimates of the variance on each level must be obtained. Moreover, finding the optimal number of levels requires a sufficiently accurate estimate of the bias. As such, an algorithm is needed to find these estimates without incurring a significant overhead to the estimation of the wanted quantity of interest. In [12], Giles proposed an algorithm, henceforth referred to as Standard MLMC or SMLMC, that works by iteratively increasing the number of levels and using sample variance estimates across levels. Moreover, SMLMC uses an arbitrary fixed accuracy splitting between the bias and the statistical error contributions. Other works [7, 14, 15, 29] listed similar versions of this algorithm. We outline this algorithm in Sect. 3.

In Sect. 4, we propose a novel continuation type of MLMC algorithm that uses models for variance and weak convergence and for average computational work per sample. We refer to this algorithm as Continuation MLMC or CMLMC. The CMLMC algorithm solves the given problem for a sequence of decreasing tolerances, which plays the role of a continuation parameter, the algorithm ends when the required error tolerance is satisfied. Solving this sequence of problems allows CMLMC to find increasingly accurate estimates of the bias and variances on each level, in addition to the quantity of interest, which is the goal of the computation. In each case, given the current estimate of problem parameters, the optimal number of levels of the MLMC hierarchy is found. Moreover, we use a Bayesian inference approach to robustly estimate the various problem parameters. The CMLMC algorithm is able to relax the statistical error bound given the bias estimate, to achieve the optimal splitting between the two. These techniques improve the computational complexity of the CMLMC algorithm and decreases the variability of the running time of the algorithm.

The outline of this work is as follows: We start in Sect.  2 by recalling the MLMC method and the assumed models on work, and on weak and variance convergence. After introducing the algorithms in Sects. 3 and 4, Sect.  5 presents numerical examples, which include three-dimensional PDEs with random inputs and an Itô SDE. Finally, we finish by offering conclusions and suggesting directions for future work in Sect. 6.

2 Multilevel Monte Carlo

2.1 Problem setting

Let \(g(u)\) denote a real valued functional of the solution, \(u\), of an underlying stochastic model. We assume that \(g\) is either a bounded linear functional or Lipschitz with respect to \(u\). Our goal is to approximate the expected value, \({\hbox {E}\left[ g(u)\right] }\), to a given accuracy \(\mathrm{{TOL}}\) and a given confidence level. We assume that individual outcomes of the underlying solution \(u\) and the evaluation of the functional \(g(u)\) are approximated by a discretization-based numerical scheme characterized by a mesh size, \(h\). The value of \(h\) will govern the weak and strong errors in the approximation of \(g(u)\) as we will see below. To motivate this setting, we now give two examples and identify the numerical discretizations, the discretization parameter, \(h\), and the corresponding rates of approximation. The first example is common in engineering applications like heat conduction and groundwater flow. The second example is a simple one-dimensional geometric Brownian motion with European call option.

Example 2.1

Let \((\varOmega ,\mathcal {F},\mathbb {P})\) be a complete probability space and \(\mathcal {D}\) be a bounded convex polygonal domain in \(\mathbb {R}^d\). Find \(u: \mathcal {D} \times \varOmega \rightarrow \mathbb {R}\) that solves almost surely (a.s.) the following equation:

$$\begin{aligned} \begin{array}{ll} -\nabla \cdot \left( a(\mathbf {x}; \omega ) \nabla u(\mathbf {x}; \omega ) \right) = f(\mathbf {x}; \omega ) &{}\text { for } \mathbf {x} \in \mathcal {D}, \\ u(\mathbf {x}; \omega ) = 0 &{} \text { for } \mathbf {x} \in \partial \mathcal {D}, \end{array} \end{aligned}$$

where \(\omega \in \varOmega \) and the value of the diffusion coefficient and the forcing are represented by random fields, yielding a random solution. We wish to compute \({\hbox {E}\left[ g(u)\right] }\) for some deterministic functional \(g\) which is globally Lipschitz satisfying \(|g(u) - g(v)| \le G \Vert u-v\Vert _{H^1(\mathcal {D})}\) for some constant \(G>0\) and all \(u,v\in H^1(\mathcal {D})\). Following [30], we also make the following assumptions

  • \(a_{\min }(\omega ) = \min _{\mathbf {x} \in \mathcal D}a(\mathbf {x}; \omega ) > 0\) a.s. and \(1/a_{\min } \in L^p_\mathbb {P}(\varOmega )\), for all \(p \in (0, \infty )\).

  • \(a \in L^p_{\mathbb {P}}(\varOmega , C^1(\overline{\mathcal {D}}))\), for all \(p \in (0, \infty )\).

  • \(f \in L^{p^*}_{\mathbb {P}}(\varOmega , L^2(\mathcal {D}))\) for some \(p^* > 2\).

Here, \(L^p_{\mathbb {P}}(\varOmega , \mathcal {B})\) is the space of \(\mathcal {B}\)-valued random fields with a finite \(p\)’th moment of their \(\mathcal {B}\)-norm, where the \(p\)’th moment is with respect to measure \(\mathbb {P}\). On the other hand, \(C^1(\overline{D})\) is the space of continuously differentiable functions with the usual norm [6]. Note that with these assumptions and since \(\mathcal {D}\) is bounded, one can show that \(\max _{\mathbf {x} \in \mathcal {D}}a(\mathbf {x}; \omega ) < \infty \) a.s. A standard approach to approximate the solution of the previous problem is to use Finite Elements on regular triangulation. In such a setting, the parameter \(h>0\) refers to either the maximum element diameter or another characteristic length and the corresponding approximate solution is denoted by \(u_h(\omega )\). For piecewise linear or piecewise \(d\)-multilinear continuous finite element approximations, and with the previous assumptions, it can be shown [30, Corollary 3.1] that asymptotically as \(h \rightarrow 0\):

  • \(\left| {\hbox {E}\left[ g(u)-g(u_h)\right] } \right| \lesssim Q_W\, h^2\) for a constant \(Q_W>0\).

  • \({\mathrm{{Var}}\left[ g(u)-g(u_h)\right] } \lesssim Q_S\, h^4\) for a constant \(Q_S > 0\).

Example 2.2

Here we study the weak approximation of Itô stochastic differential equations (SDEs),

$$\begin{aligned} du(t) = a(t,u(t))dt + b(t,u(t))dW(t), \qquad 0<t<T, \end{aligned}$$
(2.1)

where \(u(t;\omega )\) is a stochastic process in \(\mathbb {R}^d\), with randomness generated by a \(k\)-dimensional Wiener process with independent components, \(W(t;\omega )\), cf. [23, 26], and \(a(t,u)\in \mathbb {R}^d\) and \(b(t,u)\in \mathbb {R}^{d\times k}\) are the drift and diffusion fluxes, respectively. For any given sufficiently well behaved function, \(g: \mathbb {R}^d\rightarrow \mathbb {R}\), our goal is to approximate the expected value, \({\hbox {E}\left[ g(u(T))\right] }\). A typical application is to compute option prices in mathematical finance, cf. [17, 22], and other related models based on stochastic dynamics. When one uses a standard Euler Maruyama (Forward Euler) method based on uniform time steps of size \(h\) to approximate (2.1), the following rates of approximation hold: \(|{\hbox {E}\left[ g(u(T))-g(u_h(T))\right] }| = Q_W \,h + {o (h)} \) and \({\hbox {E}\left[ (g(u(T))-g(u_h(T)))^2\right] } = Q_S \,h + {o (h)} ,\) for some constants, \(0<Q_W,Q_S<\infty \), different from the constants of the previous example. For suitable assumptions on the functions \(a\), \(b\) and \(g\), we refer to [25, 28].

To avoid cluttering the notation, we omit the reference to the underlying solution from now on, simply denoting the quantity of interest by \(g\). Following the standard MLMC approach, we assume, for any given non-negative integer \(L \in \mathbb {N}\), that we have a hierarchy of \(L+1\) meshes defined by a decreasing sequence of mesh sizes \(\{h_\ell \}_{\ell =0}^L\) where \({h_\ell =h_0 \beta ^{-\ell }}\) for some \(h_0>0\) and a constant integer \(\beta > 1\). We denote the resulting approximation of \(g\) using mesh size \(h_\ell \) by \(g_\ell \), or by \(g_\ell (\omega )\) when we want to stress the dependence on an outcome of the underlying random model. Using the following notation:

$$\begin{aligned} G_\ell (\omega )&= {\left\{ \begin{array}{ll} g_0(\omega ) &{} \text {if } \ell =0, \\ g_\ell (\omega ) - g_{\ell -1}(\omega ) &{} \text {if } \ell >0, \end{array}\right. } \end{aligned}$$

the expected value of the finest approximation, \(g_L\), can be expressed as

$$\begin{aligned} {\hbox {E}\left[ g_L\right] }&= \sum _{\ell =0}^L {\hbox {E}\left[ G_{\ell }\right] }, \end{aligned}$$

where the MLMC estimator is obtained by replacing the expected values in the telescoping sum by sample averages. We denote the sample averages by as

Each sample average, , is computed using \(M_\ell \in \mathbb {Z}_+\) independent identically distributed (i.i.d.) outcomes, \(\{\omega _{\ell ,m}\}_{m=1}^{M_\ell }\), of the underlying, mesh-independent, stochastic model; i.e. \(\omega _{\ell ,m} \in \varOmega \) for all \(\ell \) and \(m\). The MLMC estimator can then be written as

(2.2)

Note that the outcomes are also assumed to be independent among the different sample averages, .

We use the following model for the expected value of the cost associated with generating one sample of \(G_{\ell }\), including generating all the underlying random variables:

$$\begin{aligned} W_\ell \propto h_\ell ^{-\gamma } = h_0^{-\gamma }\beta ^{\ell \gamma } \end{aligned}$$

for a given \(\gamma \). Note the cost of generating a sample of \(G_\ell \) might differ for different realizations, for example due to different number of iterations in an iterative method or due to adaptivity of the used numerical method. The parameter \(\gamma \) depends on the number of dimensions of the underlying problem and the used numerical method. For example, \(\gamma =1\) for the one-dimensional SDE in Example 2.2. For the PDE in Example 2.1, if the number of dimensions is \(d=3\) then \(\gamma =3 \tilde{\gamma }\), where \(\tilde{\gamma }\) depends on the solver used to solve the resulting linear system. In that example, iterative methods may have a smaller value of \(\tilde{\gamma }\) than direct methods. The theoretical best-case scenario for iterative methods would be \(\tilde{\gamma }= 1\) for multigrid methods. On the other hand, we would have \(\tilde{\gamma }= 3\) if one used a direct method using a naive Gaussian elimination on dense matrices. The total work of the estimator (2.2) is

$$\begin{aligned} W&= \sum _{\ell =0}^{L} M_\ell W_\ell . \end{aligned}$$

We want our estimator to satisfy a tolerance with prescribed failure probability \(0 < \alpha < 1\), i.e.,

$$\begin{aligned} {{\hbox {P}}\left[ \left| {\hbox {E}\left[ g\right] }-\mathcal {A}\right| > \mathrm{{TOL}}\right] }&\le \alpha , \end{aligned}$$
(2.3)

while minimizing the work, \(W\). Here, we split the total error into bias and statistical error,

$$\begin{aligned} \left| {\hbox {E}\left[ g\right] }-\mathcal {A}\right| \le \underbrace{\left| {\hbox {E}\left[ g-\mathcal {A}\right] }\right| }_{\text {Bias}} + \underbrace{\left| {\hbox {E}\left[ \mathcal {A}\right] }-\mathcal {A}\right| }_{\text {Statistical error}}, \end{aligned}$$

and use a splitting parameter, \(\theta \in (0,1)\), such that

$$\begin{aligned} \mathrm{{TOL}}= \underbrace{(1-\theta ) \mathrm{{TOL}}}_{\text {Bias tolerance}} + \underbrace{\theta \mathrm{{TOL}}}_{\text {Statistical error tolerance}}. \end{aligned}$$

The MLMC algorithm should bound the bias, \(B = \left| {\hbox {E}\left[ g-\mathcal {A}\right] }\right| \), and the statistical error as follows:

$$\begin{aligned} B = \left| {\hbox {E}\left[ g-\mathcal {A}\right] }\right|&\le (1-\theta )\mathrm{{TOL}}, \end{aligned}$$
(2.4a)
$$\begin{aligned} \left| {\hbox {E}\left[ \mathcal {A}\right] }-\mathcal {A}\right|&\le \theta \mathrm{{TOL}}, \end{aligned}$$
(2.4b)

where the latter bound should hold with probability \(1-\alpha \). Note that \(\theta \) does not have to be a constant, indeed it can depend on \(\mathrm{{TOL}}\) as we shall see in Sect. 4. In the literature, some authors (e.g. [12]) have controlled the mean square error (MSE),

$$\begin{aligned} \text {MSE} = {\left| {\hbox {E}\left[ g-\mathcal {A}\right] }\right| }^2 + {\hbox {E}\left[ \left| {\hbox {E}\left[ \mathcal {A}\right] }-\mathcal {A}\right| ^2\right] }, \end{aligned}$$

rather than working with (2.3). We prefer to work with (2.3) since it allows us to prescribe both the accuracy \(\mathrm{{TOL}}\) and the confidence level, \(1-\alpha \), in our results. The bound (2.4b) leads us to require

$$\begin{aligned} {\mathrm{{Var}}\left[ \mathcal {A}\right] }&\le \left( \frac{\theta \mathrm{{TOL}}}{C_\alpha } \right) ^2, \end{aligned}$$
(2.5)

for some given confidence parameter, \(C_\alpha \), such that \({\varPhi (C_\alpha ) = 1-\frac{\alpha }{2}}\); here, \(\varPhi \) is the cumulative distribution function of a standard normal random variable. The bound (2.5) is motivated by the Lindeberg Central Limit Theorem in the limit \({\mathrm{{TOL}}\rightarrow 0}\), cf. Lemma 7 in the Appendix.

By construction of the MLMC estimator, \({\hbox {E}\left[ \mathcal {A}\right] }={\hbox {E}\left[ g_L\right] }\), and denoting \( V_\ell = {\mathrm{{Var}}\left[ G_\ell \right] }\), then by independence, we have \({\mathrm{{Var}}\left[ \mathcal {A}\right] } = \sum _{\ell =0}^L V_\ell M_\ell ^{-1},\) and the total error estimate can be written as

$$\begin{aligned} \text {Total error estimate} = B + C_\alpha \sqrt{{\mathrm{{Var}}\left[ \mathcal {A}\right] }}. \end{aligned}$$
(2.6)

Given \(L\) and \(0 < \theta < 1\) and minimizing \(W\) subject to the statistical constraint (2.5) for \(\left\{ M_\ell \right\} _{\ell =0}^L \in \mathbb {R}^{L+1}\) gives the following optimal number of samples per level \(\ell \):

$$\begin{aligned} M_\ell = \left( \frac{C_\alpha }{\theta \mathrm{{TOL}}} \right) ^2 \sqrt{\frac{V_\ell }{W_\ell }} \left( \sum _{\ell =0}^L \sqrt{V_\ell W_\ell } \right) . \end{aligned}$$
(2.7)

When substituting the optimal number of samples in all levels the optimal work can be written in terms of \(L\) as follows

$$\begin{aligned} W(\mathrm{{TOL}}, L) = \left( \frac{C_\alpha }{\theta \mathrm{{TOL}}} \right) ^2 \left( \sum _{\ell =0}^L \sqrt{V_\ell W_\ell }\right) ^2. \end{aligned}$$
(2.8)

Of course, the number of samples on each level is a positive integer. To obtain an approximate value of the optimal integer number of samples, we take the ceiling of the real-valued optimal values in (2.7).

In this work, we assume the following models on the weak error and variance:

$$\begin{aligned} {\hbox {E}\left[ g-g_\ell \right] }&\approx Q_W h_\ell ^{q_1},\end{aligned}$$
(2.9a)
$$\begin{aligned} {\mathrm{{Var}}\left[ g_\ell -g_{\ell -1}\right] }&\approx Q_S h_{\ell -1}^{q_2}, \end{aligned}$$
(2.9b)

for some constants \(Q_W\ne 0,Q_S>0,q_1>0\) and \(0 < q_2 \le 2 q_1\). Note that the condition that is usually assumed for MLMC is \(\min (q_2,d\gamma ) \le 2q_1\), cf. [30, Theorem 2.3], to ensure that the cost of MLMC is not dominated by the cost of a single sample on each level. We assume instead the slightly more restrictive condition \(q_2 \le 2 q_1\). As an example, recall that the PDE in Example 2.1 has \(q_2=2 q_1\) and in Sect. 5, the PDE is solved using a finite element method with standard trilinear basis and it has \(q_1=2\). On the other hand, for the SDE in Example 2.2 with Euler discretization, \(q_1=q_2=1\). Collectively, we refer to the parameters \(q_1, q_2, Q_S, Q_W\) and \(\{V_\ell \}_{\ell =0}^L\) as problem parameters. Based on these models, we can write for \(\ell >0\)

$$\begin{aligned} {\hbox {E}\left[ G_\ell \right] }&\approx Q_W h_0^{q_1} \beta ^{-\ell q_1} \left( \beta ^{q_1} -1\right) ,\end{aligned}$$
(2.10a)
$$\begin{aligned} {\mathrm{{Var}}\left[ G_\ell \right] } = V_\ell&\approx Q_S h_0^{q_2} \beta ^{-(\ell -1) q_2}. \end{aligned}$$
(2.10b)

Specifically, as a consequence of (2.9a), the bias model is

$$\begin{aligned} B \approx |Q_W| h_0^{q_1} \beta ^{-L q_1}. \end{aligned}$$
(2.11)

Finally, we note that the algorithms presented in this work are iterative. We therefore denote by \(\overline{M}_\ell , \overline{G}_\ell \) and \(\overline{V}_\ell \) the total number of samples of \(G_\ell \) generated in all iterations and their sample average and sample variance, respectively. Explicitly, we writeFootnote 1

$$\begin{aligned} \overline{G}_\ell&= \frac{1}{\overline{M}_\ell } \sum _{m=1}^{\overline{M}_\ell } G_{\ell }(\omega _{\ell ,m}),\end{aligned}$$
(2.12a)
$$\begin{aligned} \overline{V}_\ell&= \frac{1}{\overline{M}_\ell } \sum _{m=1}^{\overline{M}_\ell } \left( G_{\ell }(\omega _{\ell ,m}) - \overline{G}_\ell \right) ^2. \end{aligned}$$
(2.12b)

3 Standard MLMC

3.1 Overview

While minor variations exist among MLMC algorithms listed in [12, 15, 16], we believe that there is sufficient commonality in them for us to outline here the overarching idea and refer to this collection of methods as the Standard MLMC algorithm or simply SMLMC. SMLMC solves the problem by iteratively increasing the number of levels of the MLMC hierarchy. In order to find the optimal number of samples of each level \(\ell \), an estimate of the variance \(V_\ell \) is needed. If there were previously generated samples in previous iterations for a level \(\ell \), the sample variance \(\overline{V}_\ell \) is used. Otherwise, an initial fixed number of samples, \(\widetilde{M}\), is generated. Moreover, in most works, the splitting between bias and statistical error, \(\theta \), is chosen to be 0.5.

After running the hierarchy, an estimate of the total error is computed. To this end, the work [12] approximates the absolute value of the constant, \(Q_W\), using a similar expression to the following:

In other words, the absolute value of the constant \(Q_W\) is estimated using the samples generated on the last two levels. Thus, this estimate is only defined for \(L\ge 2\). Next, the variance of the estimator, \({\mathrm{{Var}}\left[ \mathcal A\right] }\), is approximated by

Finally, a total error estimate can be computed as outlined by (2.6)

(3.1)

The complete algorithm is outlined in Algorithm 1.

figure a

Usually all samples from previous iterations are used in the algorithm to run the hierarchy in step 7 to calculate the required quantity of interest. However, the analysis of the bias and the statistical error of the resulting estimator is difficult and has not been done before, to the best of our knowledge.

3.2 Accuracy of the parameter estimates

In the standard algorithm, \(Q_W\) and the variances \(\{V_\ell \}_{\ell =0}^L\) are needed and estimated. In this section, we look at the accuracy of the estimators for these problem parameters.

We examine the accuracy of the sample variance by computing its squared relative error for \(\ell >1\):

$$\begin{aligned} \frac{{\mathrm{{Var}}\left[ \ \overline{V}_\ell \right] }}{V_\ell ^2}&= \frac{\left( \overline{M}_\ell -1\right) ^2}{\overline{M}_\ell ^3 V_\ell ^2} \left( {\hbox {E}\left[ \left( G_\ell -{\hbox {E}\left[ G_\ell \right] }\right) ^4\right] } - \frac{V_\ell ^2 (\overline{M}_\ell - 3)}{\overline{M}_\ell -1}\right) \\&= \frac{\left( \overline{M}_\ell -1\right) ^2}{\overline{M}_\ell ^3} \left( {\hbox {E}\left[ \left( G_\ell -{\hbox {E}\left[ G_\ell \right] }\right) ^4\right] } V_\ell ^{-2} - \frac{\overline{M}_\ell - 3}{\overline{M}_\ell -1}\right) \\&\approx \frac{\left( \overline{M}_\ell -1\right) ^2}{\overline{M}_\ell ^3} \left( {\hbox {E}\left[ \left( G_\ell -{\hbox {E}\left[ G_\ell \right] }\right) ^4\right] } Q_S^{-2} h_\ell ^{-2q_2} - \frac{\overline{M}_\ell - 3}{\overline{M}_\ell -1}\right) . \end{aligned}$$

Unless \({\hbox {E}\left[ (G_\ell -{\hbox {E}\left[ G_\ell \right] })^4\right] } \le C h_\ell ^{2q_2},\) for some constant \(C>0\), or \(M_\ell \) increases sufficiently fast, the relative error in the estimator \(\overline{V}_\ell \) can become unbounded as \(\ell \rightarrow \infty \). Similarly, the relative error of the sample variance at level \(\ell =0\) can be shown to be bounded for instance by assuming that the second and fourth central moments of \(G_0\) are bounded.

Next, for simplicity, we look at the squared relative error estimate of \(Q_W\) by assuming that it is estimated using samples on a single level, \(L\), only.

$$\begin{aligned} \frac{{\mathrm{{Var}}\left[ \left| \frac{\overline{G}_L}{h_0^{q_1} \beta ^{-L q_1} ( \beta ^{q_1}-1)} \right| \right] } }{Q_W^2}&= \frac{V_L}{Q_W^2 M_L h_0^{2 q_1} \beta ^{-2 L q_1}(\beta ^{q_1}-1)^2} \\&= \frac{Q_S}{Q_W^2} \cdot \frac{h_0^{q_2} \beta ^{-q_2 L}}{Q_W^2 M_L h_0^{2 q_1} \beta ^{-2 L q_1}( \beta ^{q_1}-1)^2} \\&= \frac{Q_S h_0^{q_2 -2 q_1}}{Q_W^2(\beta ^{q_1}-1)^2} \left( \frac{\beta ^{L(2q_1-q_2)}}{M_L} \right) . \end{aligned}$$

Observe now that if \(q_2<2q_1\) (as in Example 2.2), then, for the previous relative error estimate to be \({o (1)} \), we must have \({M_L \propto \beta ^{L(2q_1 -q_2)} \rightarrow \infty }\) as \(L \rightarrow \infty \). This analysis shows that in some cases, \(M_L\) will have to grow to provide an accurate estimate to \(Q_W\), regardless of the optimal choice of the number of samples outlined in (2.7).

4 Continuation MLMC (CMLMC)

In this section we discuss the main contribution of this work, a continuation MLMC (CMLMC) algorithm that approximates the value \({\hbox {E}\left[ g(u)\right] }\). We begin in the next subsection by giving an overview of the general idea of algorithm. Subsequent subsections discuss how to estimate all the required problem parameters that are necessary for running the algorithm. CMLMC is listed in Algorithm 2.

figure b

4.1 Overview

The main idea of CMLMC is to solve for \({\hbox {E}\left[ g(u)\right] }\) with a sequence of decreasing tolerances. By doing this, CMLMC is able to increasingly improve estimates of several problem dependent parameters while solving relatively inexpensive problems corresponding to large tolerances. These parameters estimates are crucial to optimally distribute computational effort when solving for the last tolerance, which is the desired tolerance, \(\mathrm{{TOL}}\), or smaller. Moreover, the sequence is built such that the total work of the algorithm is close to the work of MLMC when solving for the desired tolerance, \(\mathrm{{TOL}}\), assuming all the necessary parameters are known a priori. To this end, we make the following choice for the sequence of decreasing tolerances \(\mathrm{{TOL}}_i\) for \(i=0,1,\ldots \)

$$\begin{aligned} \mathrm{{TOL}}_i = {\left\{ \begin{array}{ll} r_1^{i_E-i} r_2^{-1} \mathrm{{TOL}}&{} i < i_E, \\ r_2^{i_E-i} r_2^{-1} \mathrm{{TOL}}&{} i \ge i_E,\\ \end{array}\right. } \end{aligned}$$

where \(r_1 \ge r_2 > 1\). By imposing \(\mathrm{{TOL}}_0 =\mathrm{{TOL}}_{\max }\) for some maximum tolerance, we have

$$\begin{aligned} i_E = \left\lfloor \frac{-\log (\mathrm{{TOL}}) + \log (r_2) + \log (\mathrm{{TOL}}_{\max })}{\log (r_1)} \right\rfloor . \end{aligned}$$

Iterations for which \(i \le i_E\) are meant to obtain increasingly more accurate estimates of the problem parameters. The iteration \(i_E\) solves the problem for the tolerance \({r_2^{-1}\mathrm{{TOL}}}\). Notice that the problem is solved for a slightly smaller tolerance than the required tolerance \(\mathrm{{TOL}}\). This tolerance reduction is to prevent extra unnecessary iterations due to slight variations in estimates of the problem parameters. This technique improves the overall average running time of the algorithm. Similarly, iterations \(i > i_E\) have tolerances that are even smaller to account for cases in which estimates of the problem parameters are unstable. The parameters \(r_1\) and \(r_2\) are chosen such that the total work of the algorithm is not significantly more than the work of the final hierarchy that solves the problem with the required tolerance, \(\mathrm{{TOL}}\). For example, if the work of the MLMC estimator is \({\mathcal {O}(\mathrm{{TOL}}^{-2})}\), we choose \(r_1=2\) to ensure that the work of iteration \(i\) is roughly four times the work of iteration \(i-1\) for iterations for which \({\mathrm{{TOL}}_i \ge \mathrm{{TOL}}}\). The choice of \(r_2=1.1\), on the other hand, ensures that for iterations for which \({\mathrm{{TOL}}_i < \mathrm{{TOL}}}\), the work of iterations of \(i\) is roughly 1.2 times the work of iteration \(i-1\).

Consider now the \(i\)-th iteration of CMLMC and assume that estimates for \({\mathbf {Q} := \{q_1, q_2, Q_W, Q_S\}}\) and \(\{V_\ell \}_{\ell =0}^L\) are available from previous iterations; we will discuss how to obtain these estimate in Sect. 4.2. The \(i\)-th iteration begins by selecting the optimal number of levels \(L[i]\) that solves the problem for the given tolerance, \(\mathrm{{TOL}}_i\), as follows

$$\begin{aligned} L[i] = \text {argmin}_{L_{\min }[i] \le L \le L_{\max }[i]} W(\mathrm{{TOL}}_i, L), \end{aligned}$$
(4.1)

where \(W(\mathrm{{TOL}}_i,L)\) is defined by (2.8) and depends on all the parameters \(\mathbf Q\) and \(\{V_\ell \}_{\ell =0}^L\) and \(\theta = \theta (L)\) given by

$$\begin{aligned} \theta = 1 - \frac{| Q_W | h_L^{q_1}}{\mathrm{{TOL}}_i} = 1 - \frac{| Q_W | h_0^{q_1} \beta ^{-L q_1}}{\mathrm{{TOL}}_i}, \end{aligned}$$
(4.2)

which comes from enforcing that the bias model (2.11) equals \((1-\theta ) \mathrm{{TOL}}_i\). Moreover, \(L_{\min }\) should satisfy \({Q_W h_{L_{\min }}^{q_1} = \mathrm{{TOL}}_i}\) or, since we have \(h_\ell = h_0 \,\beta ^{-\ell },\)

$$\begin{aligned} L_{\min }[i] = \max \left( L[i-1], \frac{q_1 \log (h_0) - \log \left( \frac{\mathrm{{TOL}}_i}{|Q_W|}\right) }{q_1 \log {\beta }} \right) , \end{aligned}$$

where \(L[i-1]\) is the number of levels from the previous iteration. This ensures that \(L\) does not decrease from one iteration to the next, which agrees with our intuition that \(L\) increases with \(\log \left( \mathrm{{TOL}}_i^{-1}\right) \). On the other hand, \(L_{\max }\) is given by other considerations. For instance, it could be related to the minimum mesh size imposed by memory or computational restrictions. More practically, to ensure robustness, \(L_{\max }\) can be chosen to be \(L_{\min } + L_{\text {inc}}\), for a given fixed integer \(L_{\text {inc}}\), so that \(L\) has limited increments from one iteration to the next. Since only few values of \(L\) are considered in the optimization (4.1), it is easy to find the optimal \(L\) by exhaustive search. The choice (4.2) implies that the statistical constraint (2.5) is relaxed (or tightened) depending on the estimated bias of each hierarchy. The iteration then continues by running the resulting hierarchy with the optimal number of samples \(\{M_\ell \}_{\ell =0}^L\) according to (2.7). Finally the iteration ends by improving the estimates of the problem parameters \(\mathbf Q\) and \(\{V_\ell \}_{\ell =0}^L\) as well as the quantity of interest based on the newly available samples as described in Sect. 4.2.

To start CMLMC we compute with an initial, relatively inexpensive, hierarchy. The purpose of using this initial hierarchy is to obtain rough estimates of the problem parameters. Such a hierarchy cannot depend on estimates of problem parameters and should have at least three levels to allow estimating \(\mathbf Q\); these three levels are needed to be able to extrapolate (or interpolate) the weak error and variance estimates on all MLMC levels. The algorithm stops when the total error estimate is below the required tolerance \(\mathrm{{TOL}}\).

4.2 Parameters estimation

In this section, we discuss how to improve estimates of the parameters \(\mathbf Q\) as well as the variances \(V_\ell \) based on the generated samples in all iterations and all levels. For easier presentation, we will also use the following notation

$$\begin{aligned} w_\ell (q_1)&= h_0^{q_1} \beta ^{-\ell q_1} (\beta ^{q_1}-1) ,\\ s_\ell (q_2)&= h_0^{-q_2} \beta ^{\ell q_2}. \end{aligned}$$

Thus, using the notation above, (2.10) becomes

$$\begin{aligned} {\hbox {E}\left[ G_\ell \right] }&\approx Q_W w_\ell (q_1),\end{aligned}$$
(4.3a)
$$\begin{aligned} {\mathrm{{Var}}\left[ G_\ell \right] } = V_\ell&\approx Q_S s_\ell ^{-1}(q_2). \end{aligned}$$
(4.3b)

4.2.1 Estimating variances \(V_\ell \)

We first assume that we have estimates of \(q_1\), \(q_2\), \(Q_W\) and \(Q_S\) and discuss estimating the variances, \(\{V_\ell \}_{\ell =0}^L\), and the total statistical error after computing with a given hierarchy. Estimating \(q_1, q_2, Q_W\) and \(Q_S\) is discussed in the next subsection.

Usually the variances \(\{V_\ell \}_{\ell =0}^L\) are estimated by using the sample variance estimator (2.12b) to estimate the statistical error as well as the optimal number of samples \(\{M_\ell \}_{\ell =0}^L\). However, sometimes there are too few samples in a given level to give a corresponding accurate variance estimate. This is specially acute on the deepest levels, and unlike the standard MLMC algorithm, we do not impose a minimum number of samples across levels to obtain a stable estimate of the sample variance. Recalling that we have the variance model (4.3b) at our disposal, we can use this model to estimate the variance at all levels \(\ell > 0\). However, the model (4.3b) is only accurate asymptotically. We can use the generated samples on each level to locally improve the accuracy of the \(V_\ell \) estimates. To this end, we use a Bayesian setting [27].

We assume that \(G_\ell \) follows a normal distribution with mean \(\mu _\ell \) and precision \(\lambda _\ell \) (precision is simply the inverse of the variance). To simplify the computation, we choose a normal-gamma prior on \((\mu _\ell , \lambda _\ell )\) – the conjugate prior of the normal likelihood. The resulting posterior probability density function (pdf) is also a normal-gamma distribution function. We choose the parameters \((\widehat{\mu }_\ell ,\kappa _0, 0.5 + \widehat{\lambda }_{\ell }\kappa _1, \kappa _1)\) for the normal-gamma prior, such that it is maximized at \(\widehat{\mu }_{\ell }\) and \(\widehat{\lambda }_{\ell }\). The parameter \(\widehat{\mu }_\ell \) and \(\widehat{\lambda }_\ell \) serve as initial guesses for \(\mu _\ell \) and \(\lambda _\ell \), respectively. Moreover, \(\kappa _0\) and \(\kappa _1\) are positive constants that model our certainty in those respective guesses. We use the assumed models of the weak error and variance (4.3) to give the initial guesses

$$\begin{aligned} \widehat{\mu }_{\ell }&= Q_W w_\ell (q_1),\end{aligned}$$
(4.4a)
$$\begin{aligned} \widehat{\lambda }_{\ell }&= Q_S^{-1} s_\ell (q_2). \end{aligned}$$
(4.4b)

As mentioned, the posterior pdf is also a normal-gamma with parameters \((\varUpsilon _{1,\ell },\varUpsilon _{2,\ell },\varUpsilon _{3,\ell },\varUpsilon _{4,\ell })\) and it is maximized at \((\varUpsilon _{1,\ell }, \frac{\varUpsilon _{3,\ell }-0.5}{\varUpsilon _{4,\ell }})\). Specifically

$$\begin{aligned} \varUpsilon _{3,\ell }&= 0.5 + \kappa _1 \widehat{\lambda }_\ell + \frac{\overline{M}_\ell }{2},\\ \varUpsilon _{4,\ell }&= \kappa _1 + \frac{1}{2} \left( \sum _{m=1}^{\overline{M}_\ell } \left( G_{\ell , m} -\overline{G}_\ell \right) ^2 \right) + \frac{\kappa _0 \overline{M}_\ell (\overline{G}_\ell -\widehat{\mu }_\ell )^2}{2 (\kappa _0 + \overline{M}_\ell )}. \end{aligned}$$

As such, we use the following estimate of the variance \(V_\ell \) for \(\ell >0\)

$$\begin{aligned} V_\ell \approx \frac{\varUpsilon _{4,\ell }}{\varUpsilon _{3,\ell }-0.5}. \end{aligned}$$
(4.5)

Estimating the variance at the coarsest mesh, \(V_0\), can be done using the sample variance. The number of samples on the coarsest level, \(M_0\), is usually large enough to produce a stable and accurate estimate. Using these estimates and the bias estimate (2.11), the total error can be estimated as (2.6).

4.2.2 Estimating \(\mathbf {Q}\)

To incorporate prior knowledge on \(q_1\) and \(q_2\) including initial guesses and the relation \(q_2 \le 2q_1\), we again follow a Bayesian setting to estimate these parameters and assume that \(G_\ell \) follows a Gaussian distribution with mean \({Q_W w_\ell (q_1)}\) and variance \({Q_S s_\ell ^{-1}(q_2)}\). In what follows, \(\ell _0\) is a non-negative integer. With these assumptions, the corresponding likelihood is

$$\begin{aligned} \mathcal {L}&= \left( \prod _{\ell =\ell _0}^L \left( 2 \pi Q_S s_\ell ^{-1}(q_2)\right) ^\frac{-\overline{M}_\ell }{2} \right) \nonumber \\&\quad \times \exp \left( -\frac{1}{2 Q_S} \sum _{\ell =\ell _0}^L s_\ell (q_2) \sum _{m=1}^{\overline{M}_\ell } \left( G_{\ell ,m} - Q_W w_\ell (q_1) \right) ^2 \right) . \end{aligned}$$
(4.6)

Assuming a improper prior on \(Q_W\) and \(Q_S\) and maximizing the resulting posterior pdf with respect to \(Q_W\) and \(Q_S\) gives the following weighted least-squares solution:

$$\begin{aligned} Q_W^*&= \left( \sum _{\ell =\ell _0}^L \overline{M}_\ell w_\ell ^2(q_1) s_\ell (q_2) \right) ^{-1} \sum _{\ell =\ell _0}^L w_\ell (q_1) s_\ell (q_2) \overline{M}_\ell \overline{G}_\ell ,\end{aligned}$$
(4.7a)
$$\begin{aligned} Q_S^*&= \left( \sum _{\ell =\ell _0}^L \overline{M}_\ell \right) ^{-1} \sum _{\ell =\ell _0}^L s_\ell (q_2) \sum _{m=1}^{\overline{M}_\ell } \left( G_{\ell ,m} - Q_W w_\ell (q_1) \right) ^2. \end{aligned}$$
(4.7b)

We can substitute the previous expressions for \(Q_W\) and \(Q_S\) in (4.6) to obtain a likelihood in terms of \(q_1\) and \(q_2\). Denoting \(\overline{M} = \sum _{\ell =\ell _0}^L \overline{M}_\ell \), we write

$$\begin{aligned} \mathcal {L}(q_1, q_2)&= \exp \left( -\frac{\overline{M}}{2}\right) \left( \sum _{\ell =\ell _0}^L \sum _{m=0}^{\overline{M}_\ell } s_\ell (q_2) G^2_{\ell ,m}\right. \\&\quad -\left. \frac{\left( \sum _{\ell =\ell _0}^L s_\ell (q_2) w_\ell (q_1) \overline{M}_\ell \overline{G}_{\ell }\right) ^2}{\sum _{\ell =\ell _0}^L \overline{M}_\ell w_\ell (q_1)^2 s_\ell (q_2) } \right) ^{-\frac{\overline{M}}{2}}. \end{aligned}$$

We can then assume a prior on \(q_1\) and \(q_2\). However, remember that \(q_2 \le 2 q_1\), and \(q_1 > 0\). As such, we introduce the unconstrained parameters \({x_0(q_1) = \log (q_1) \in \mathbb {R}}\) and \({x_1(q_1,q_2) = \log (2 q_1 - q_2) \in \mathbb {R}}\) and assume a Gaussian prior on them

$$\begin{aligned} \rho _{\text {prior}}(q_1,q_2) = \frac{1}{2 \pi \sqrt{\sigma _0^2 \sigma _1^2}} \exp \left( -\frac{(x_0(q_1) - \widehat{x}_0)^2}{2 \sigma _0^2} - \frac{(x_1(q_1,q_2) - \widehat{x}_1)^2}{2 \sigma _1^2}\right) . \end{aligned}$$

Here, \(\widehat{x}_0\) and \(\widehat{x}_1\) represent our initial guesses of \(x_0\) and \(x_1\), respectively, which we can obtain from a rough analysis of the problem. Moreover, \(\sigma _1\) and \(\sigma _2\) model our confidence in those guesses. The more accurate our initial guesses are, the faster the algorithm converges. Finally, we numerically maximize the log of the posterior pdf with respect to \((x_0,x_1) \in \mathbb {R}^2\) using a suitable numerical optimization algorithm. For robustness, we choose \(\ell _0=1\) to estimate \(q_1\) and \(q_2\). In other words we include samples from all levels \(\ell >0\) for this estimation.

Given estimates of \(q_1\) and \(q_2\), we can use the least-squares estimates \(Q_W^*\) and \(Q_S^*\) in (4.7) as estimates of \(Q_W\) and \(Q_S\), respectively. However, usually not all levels follow the assumed asymptotic models (2.10) and as such special care must be taken to choose \(\ell _0\) in these estimates. The parameter \(Q_W\) must be accurate on deeper levels since it is used to compute the bias (2.11). Similarly, \(Q_S\) must be accurate on deeper levels where not many samples are available and the variance estimate (4.5) is mainly determined by the initial guess (4.4b). For these reasons, when computing \(Q_W^*\) and \(Q_S^*\), we choose \(\ell _0=\max (1,L-\mathfrak {L})\) in (4.7) for some positive integer \(\mathfrak {L}\) that denotes the maximum number of levels use to compute the estimates. Finally, since \(Q_W\) has an improper prior, its posterior is also a Gaussian with mean \(Q_W^*\) and variance

$$\begin{aligned} V_W := \sum _{\ell =\ell _0}^L \frac{Q_S}{\overline{M} w_\ell ^2(q_2) s_\ell (q_1)}. \end{aligned}$$

Motivated by the accuracy analysis of the \(Q_W\) estimate in Sect.  3.2, we use a worst-case estimate of \(Q_W\) instead of simply using the estimate \(Q_W^*\) in (4.7a), The worst-case estimate is produced by adding the maximum sampling error with \(1-\alpha \) confidence, namely \(C_\alpha \sqrt{V_W}\), multiplied by the sign of \(Q_W^*\). In other words, our estimate of \(Q_W\) is \(Q_W^* + \text {sign}(Q_W^*) C_\alpha \sqrt{V_W}\).

4.3 Algorithm parameters

Table 1 summarizes the parameters that control the CMLMC algorithm. Some of these parameters need to be suitably chosen for the specific problem. However, while there might be optimal values for these parameters to minimize the average running time, it is our experience that reasonable values of these parameters are enough to get average running times that are near-optimal. In fact, similar results to those that we show Sect. 5 were obtained with variations of \(\kappa _1\) and \(\kappa _2\); namely \(\kappa _1 = \kappa _2 \in \{0.05, 0.1, 0.2\}\).

Table 1 Summary of parameters in CMLMC

5 Numerical tests

In this section, we first introduce the test problems. We then describe several implementation details and finish by presenting the actual numerical results.

5.1 Test problems

We look at three test problems: the first two are based on PDEs with random inputs and the last one is based on an Itô SDE.

5.1.1 Ex.1

This problem is based on Example 2.1 in Sect. 2.1 with some particular choices that satisfy the assumptions therein. First, we choose \({\mathcal {D} = [0,1]^3}\) and assume that the forcing is

$$\begin{aligned} f(\mathbf {x};\omega ) = f_0 + \widehat{f} \sum _{i=0}^K \sum _{j=0}^K \sum _{k=0}^K \varPhi _{ijk}(\mathbf {x}) Z_{ijk}, \end{aligned}$$

where

$$\begin{aligned} \varPhi _{ijk}(\mathbf {x}) = \sqrt{\lambda _i \lambda _j \lambda _k} \phi _i(x_1)\phi _j(x_2)\phi _k(x_3), \end{aligned}$$

and

$$\begin{aligned} \phi _i(x)&= {\left\{ \begin{array}{ll} \cos \left( \frac{5 \Lambda i}{2} \pi x \right) &{} i \text { is even}, \\ \sin \left( \frac{5 \Lambda (i+1)}{2} \pi x \right) &{} i \text { is odd}, \end{array}\right. }, \\ \lambda _i&= \left( 2 \pi \right) ^\frac{7}{6}\Lambda ^\frac{11}{6} {\left\{ \begin{array}{ll} \frac{1}{2} &{} i=0, \\ \exp \left( - 2\left( \frac{\Lambda i}{4} \pi \right) ^2 \right) &{} i \text { is even}, \\ \exp \left( - 2\left( \frac{\Lambda (i+1)}{4} \pi \right) ^2 \right) &{} i \text { is odd}, \end{array}\right. } \end{aligned}$$

for given \(\Lambda > 0\), and positive integer \(K\) and \(\mathbf Z = \lbrace Z_{ijk} \rbrace \) a set of \((K+1)^3\) i.i.d. standard normal random variables. Moreover, we choose the diffusion coefficient to be a function of two random variables as follows:

$$\begin{aligned} a(\mathbf {x}; \omega )&= a_0 + \exp \Big (4 Y_1 \varPhi _{121}(\mathbf {x}) + 40 Y_2 \varPhi _{877}(\mathbf {x})\Big ). \end{aligned}$$

Here, \(\mathbf Y = \lbrace Y_1, Y_2 \rbrace \) is a set of i.i.d. normal Gaussian random variables, also independent of \(\mathbf Z\). Finally we make the following choice for the quantity of interest, \(g\):

$$\begin{aligned} g = \left( 2 \pi \sigma ^2 \right) ^\frac{-3}{2} \int _\mathcal {D} \exp \left( - \frac{ \Vert \mathbf {x} - \mathbf {x}_0 \Vert ^2_2}{2 \sigma ^2} \right) u(\mathbf {x}) d\mathbf {x}, \end{aligned}$$

and select the parameters \(a_0=0.01, f_0=50, \widehat{f}=10, \Lambda = \frac{0.2}{\sqrt{2}}, K=10, \sigma ^2=0.02622863\) and \({\mathbf {x}_0 = \left[ 0.5026695,0.26042876,0.62141498\right] }\). Since the diffusion coefficient, \(a\), is independent of the forcing, \(f\), a reference solution can be calculated to sufficient accuracy by scaling and taking expectation of the weak form with respect to \(\mathbf Z\) to obtain a formula with constant forcing for the conditional expectation with respect to Y. We then use stochastic collocation [3] with 11 Hermite quadrature points for each variable in Y (thus totaling 121 points) and a Finite Difference method with centered differences and 128 equally spaced points in each dimension to produce the reference value \({\hbox {E}\left[ g\right] }\). Using this method, the reference value \(1.6026\) was computed with an error estimate of \(10^{-4}\).

5.1.2 Ex.2

The second example is a slight variation of the first. First, we choose the following diffusion coefficient instead:

$$\begin{aligned} a(\mathbf {x}; \omega )&= a_0 + \exp \Big (Y_1 \phi _{121}(\mathbf {x}) + Y_2 \phi _{877}(\mathbf {x})\Big ). \end{aligned}$$

Moreover, in this example \(\mathbf Y\) is a set of two i.i.d. uniform random variables in the range \([-1,1]\), again independent of \(\mathbf Z\). We also make the following choice for the quantity of interest \(g\)

$$\begin{aligned} g = 100 \left( 2 \pi \sigma ^2 \right) ^\frac{-3}{2} \int _\mathcal {D} \exp \left( - \frac{ \Vert \mathbf {x} - \mathbf {x}_0 \Vert ^2_2}{2 \sigma ^2} \right) u(\mathbf {x}) d\mathbf {x}, \end{aligned}$$

and select the parameters \(a_0=1, f_0=1, \widehat{f}=1,\Lambda = 0.2, K=10, \sigma ^2=0.01194691\) and \({\mathbf {x}_0 = \left[ 0.62482261,0.45530923,0.49862328\right] }\). We use the same method as in Ex.1 to compute the reference solution, except that in this case, we use Legendre quadrature in the stochastic collocation method, instead of Hermite quadrature. The computed reference solution \({\hbox {E}\left[ g\right] }\) in this case is \(2.3627\) with an error estimate of \(10^{-4}\).

5.1.3 Ex.3

The third example is a one-dimensional geometric Brownian motion based on Example 2.2. We make the following choices:

$$\begin{aligned} u(0)&=1, T =1, \\ a(t,u)&= 0.05 u,\\ b(t,u)&= 0.2 u, \\ g(u)&= 10 \max (u(1)-1,0). \end{aligned}$$

The exact solution can be computed using a standard change of variables and Itô’s formula. For the selected parameters, the solution is \({\hbox {E}\left[ g\right] } = 1.04505835721856\).

5.2 Implementation and Runs

All the algorithms mentioned in this work were implemented using the C programming language, with the goal that the software be as optimal as possible, while maintaining generality.

For implementing the solver for the PDE test problems (Ex.1 and Ex.2), we use PetIGA [8, 9]. While the primary intent of this framework is to provide high-performance B-spline-based finite element discretizations, it is also useful for applications where the domain is topologically square and subject to uniform refinements. As its name suggests, PetIGA is designed to tightly couple to PETSc [4]. The framework can be thought of as an extension of the PETSc library, which provides methods for assembling matrices and vectors related to the discretization of integral equations.

In our PDE numerical tests (Ex.1 and Ex.2), we use a standard trilinear basis to discretize the weak form of the model problem, integrating with eight quadrature points. We also generate results for two linear solvers that PETSc provides an interface to. The first solver is an iterative GMRES solver that solves a linear system in almost linear time with respect to the number of degrees of freedom for the mesh sizes of interest; in other words \(\tilde{\gamma }=1\) in this case. The second solver we tried is a direct one, called MUMPS [1, 2]. For the mesh sizes of interest, the running time of MUMPS varies from quadratic to linear in the total number of degrees of freedom. The best fit turns out to be \(\tilde{\gamma }=1.5\) in the case.

From [30, Theorem 2.3], the complexity rate for all the examples is expected to be \({{\mathcal {O}({\mathrm{{TOL}}}^{-s_1} \log (\mathrm{{TOL}})^{s_2})} }\), where \({s_1}\) and \({s_2}\) depend on \(q_1, q_2\) and \(\gamma = d\tilde{\gamma }\). These and other problem parameters are summarized in Table 2 for the different examples.

Table 2 Summary of problem parameters

We run each algorithm \(100\) times for each tolerance and show in plots in the next section the medians with vertical bars spanning from the \(5\%\) percentile to the \(95\%\) percentile. Finally, all results were generated on the same machine with \(52\) gigabytes of memory to ensure that no overhead is introduced due to hard disk access during swapping that could occur when solving the three-dimensional PDEs with a fine mesh.

In order to compare CMLMC to SMLMC, and since the latter does not include a step to fit \(q_1\) and \(q_2\), we assume that these parameters are both known as discussed in Examples 2.1 and 2.2. Moreover, we use the parameters listed in Table 3.

Table 3 Summary of parameters values to used in numerical tests

5.3 Results

Figure 1 shows that the running time of CMLMC follows the expected complexity rates \({{\mathcal {O}(\mathrm{{TOL}}^{s_1} \log (\mathrm{{TOL}})^{s_2})} }\) as summarized in Table 2. Notice that the running time in this and all figures that we present in this work include the time necessary to sample the underlying stochastic solution and the time to do the necessary computation to estimate the problem parameters. However, the computational complexity of calculating of problem parameters is largely dominated by the computational complexity of sampling the approximate solution to the differential equations at hand. Indeed, the computations described in Sect. 4.2 are inexpensive post-processing calculations of the these samples. Moreover, our results show that the algorithm has the same complexity as the theoretical work (c.f. [30, Theorem 2.3]) of the last iteration where we effectively solve the problem with the required tolerance requirements. Next, Fig. 2 shows the number of levels, \(L\), in the last iteration of CMLMC for different tolerances. As expected, even though \(L\) depends on the particular realization, it is well approximated by a linear function of \(\log (\mathrm{{TOL}}^{-1}).\)

Fig. 1
figure 1

From top: Ex.1, Ex.2, Ex.3. These plots show the total running time of CMLMC. The reference dashed lines are \({\mathcal {O}(\mathrm{{TOL}}^{-s_1}\log (\mathrm{{TOL}})^{s_2})} \) as summarized in Table 2. Notice that, asymptotically, the total running times seem to follow the expected rates. This shows that the algorithm, in our examples, has the same complexity as the theoretical work (c.f. [30, Theorem 2.3]) of the last iteration where we effectively solve the problem with the required tolerance requirements

Fig. 2
figure 2

From top Ex.1, Ex.2, Ex.3. These plots show the number of levels, \(L\), for different tolerances, as produced in the last iteration of CMLMC. Here, it is clear that \(L\) depends on the particular realization. However, the relation between \(L\) and \(\log (\mathrm{{TOL}}^{-1})\) looks reasonably linear, as expected. Note that in Ex.2, \(L\) does no exhibit significant variations. This is because, for the tolerances considered, \(L=3\) already satisfies the bias constraint

Next, Fig. 3 shows the computational errors of CMLMC that were computed using the reference solutions as listed in Sect. 5.1. This indicates that the imposed accuracy is achieved with the required confidence of 95 %—since \(C_\alpha =2\). Compare this figure to Fig. 4 which shows the computational errors of SMLMC. One can see that, in certain cases, SMLMC solves the problem for a smaller tolerance than the imposed \(\mathrm{{TOL}}\). This is because \(\theta \) is fixed and the statistical error is not relaxed when the bias is small. This can be especially seen in Ex.2 where the choice \(h_0=1/8\) produces a bias much smaller than \(0.5\mathrm{{TOL}}\) for the shown tolerances. On the other hand, Fig. 5 is a QQ-plot showing that the empirical cumulative distribution function (CDF) of the MLMC error estimates is well approximated by the standard normal CDF, even for finite tolerances.

Fig. 3
figure 3

From top Ex.1, Ex.2, Ex.3. Actual computational errors based on the reference solutions when using CMLMC. The numbers above the dashed line show the percentage of runs that had errors larger than the required tolerance. We observe that these results are consistent with the imposed error constraints with a 95 % confidence

Fig. 4
figure 4

From top Ex.1, Ex.2, Ex.3. Actual computational errors based on the reference solutions when using SMLMC. The numbers above the dashed line show the percentage of runs that had errors larger than the required tolerance. We observe that these results are consistent with the imposed error constraints with a 95 % confidence. However, for particular tolerances, the error is smaller than \(\mathrm{{TOL}}\) because the statistical error is not relaxed when the bias is small since tolerance splitting parameter, \(\theta \), is kept fixed for all tolerances

Fig. 5
figure 5

From top Ex.1, Ex.2, Ex.3. Normalized empirical cumulative distribution function (CDF) of MLMC error estimates for different tolerances versus the standard normal CDF. Notice that, even for finite tolerances, the standard normal CDF is a good approximation of the CDF of the error estimates

Figure 6 shows a comparison of the running time of CMLMC and SMLMC. Notice that a good value of \(\widetilde{M}\) in SMLMC is not known a priori and the computational time varies considerably for different values of \(\widetilde{M}\), especially for smaller tolerances in Ex.1 and Ex.2. Specifically, a larger \(\widetilde{M}\) in SMLMC increases the computational time of the algorithm, but also decreases its variability. A smaller \(\widetilde{M}\) gives a smaller computational time at the expense of increased variation. The variation of the running time is due to inaccurate estimates of \(V_\ell \) due to the smaller number of initial samples. On the other hand, the running time of CMLMC is less varied, which is a reflection of the stability of the estimates of \(V_\ell \). The computational savings of CMLMC over SMLMC is an aggregate effect of the different improvements. This includes (1) a more stable variance and bias estimates as already discussed, (2) a better splitting of bias and statistical tolerances. This second point can be seen in Fig. 7, which shows the tolerance splitting parameter, \(\theta \), used in CMLMC as computed by (4.2). We can clearly see here that \(\theta \) is not trivial and changes with the tolerance. Looking closely, one can notice sudden jumps in the values of \(\theta \) due to changes in the discrete number of levels, \(L\). Between jumps, \(\theta \) changes continuously due to inaccuracies in the estimation of the weak error constant, \(Q_W\). Specifically, notice that for \(\mathrm{{TOL}}\approx 0.015\) in Ex.1 when using the direct solver, the splitting parameter \(\theta \) used in CMLMC is very close to \(0.5\) which explains why, for this case, the computational time of SMLMC is very close to the computational time of CMLMC as shown in Fig. 6.

Fig. 6
figure 6

From top Ex.1, Ex.2, Ex.3. The running time of CMLMC and SMLMC for different \(\widetilde{M}\) and \(\theta \), normalized by the median running time of CMLMC. This plot shows that a larger \(\widetilde{M}\) increases the median running time of the SMLMC but also decreases its variability. One sees that CMLMC outperforms SMLMC even for a small \(\widetilde{M}\) in all numerical examples. Note that for Ex.1 using direct method and for \(\mathrm{{TOL}}\approx 0.015\), all algorithms perform similarly. This is because, for this example, the optimal error splitting parameter, \(\theta \), according to (4.2) is approximately 0.5

Fig. 7
figure 7

From top Ex.1, Ex.2, Ex.3. The error splitting, \(\theta \), as computed in (4.2) an used in CMLMC, versus \(\mathrm{{TOL}}\). Observe the behavior of \(\theta \) is non-trivial and can be far from \(\frac{1}{2}\)

Finally, the bias of the MLMC estimator when using samples generated in previous iterations to compute the quantity of interest is not well understood. Using CMLMC, generating new samples at each iteration, instead of using samples from previous iterations, does not add a significant overhead to the total running time of the algorithm. Figure 8 explains this point by comparing the running time of CMLMC for both cases for both CMLMC and SMLMC. This figure shows that computational savings of CMLMC over SMLMC whether we reuse samples or not in the former, mainly due to better splitting of the tolerance between bias and statistical errors. Moreover, it shows that reusing samples in CMLMC does not offer significant computational savings that justify the increased complexity in the analysis of the resulting estimator.

Fig. 8
figure 8

From top Ex.1, Ex.2, Ex.3. Running time of CMLMC versus SMLMC when reusing samples for both. Also included, is CMLMC without reusing samples from previous iterations. All running times are normalized by the median of the running time of CMLMC when reusing samples. Notice that reusing samples in CMLMC does not add a significant advantage. Moreover, CMLMC still produces savings over SMLMC, even when reusing samples in the latter

6 Conclusions

We have proposed a novel Continuation Multi Level Monte Carlo (CMLMC) algorithm for weak approximation of stochastic models. Our algorithm uses discretization hierarchies that are defined a priori for each level and are geometrically refined across levels. These hierarchies are either uniform at each level or obtained by regular subdivision of a non-uniform mesh.

The actual choice of computational work across levels uses the optimal number of samples per level given the variance and the work contribution from each level. Accurate computation of these relevant quantities is based on parametric models.

These parameters are calibrated using approximate samples, either produced before running the CMLMC and/or during the actual runs. We also propose a novel Bayesian estimation of the variance and weak convergence model parameters, taking particular notice of the deepest levels of the discretization hierarchy, where only a few realizations are available to produce the required estimates. The idea is to use results from coarser levels, where more samples are available, to stabilize the estimates in the deeper levels. The resulting MLMC estimator exhibits a non-trivial splitting between bias and statistical contributions. Indeed, the actual split depends on the given accuracy and other problem parameters. In fact, as the numerical examples show, there are cases where most of the accuracy budget is devoted to the statistical error. Finally, using the Lindeberg–Feller theorem, we also show the asymptotic normality of the statistical error in the MLMC estimator and justify in this way our error estimate that allows prescribing both required accuracy and confidence in the final result.

We presented three numerical examples to substantiate the above results, exhibiting the robustness of the new CMLMC Algorithm and to demonstrate its corresponding computational savings. The examples are described in terms of differential equations either driven by random measures or with random coefficients.

Other aspects of MLMC estimators can also be explored, such as the optimality of geometric hierarchies compared to non-geometric ones. This will be the subject of a forthcoming work, where extensions of the CMLMC to that setting will be considered.