1 Introduction

Ordinary differential equations (ODEs) are arguably the most commonly used mathematical tool for describing physical and biological processes in the real world. Popular examples include Lotka-Volterra equation (Alligood et al. 1997), susceptible, infected, recovered (SIR) model (Kermack and McKendrick 1927) and the continuously stirred tank reactor (CSTR) model (Schmidt 2005). The Lotka-Volterra equation is the differential equation describing the dynamics of predator-prey systems. The SIR model is an ODE model for disease epidemic describing the relation among the numbers of susceptible, infected and recovered individuals in a closed population. The CSTR model describes the surface temperature changes of an object at a rate proportional to its relative temperature to the surroundings. These are just a few examples of ODEs.

The ODE model is the nonlinear regression model whose regression function is expressed as the solution of an ODE. The ODE model depicts the statistical situation of most applications where the parameters of an ODE need to be estimated based on the noisy data. The statistical inference for ODE model, however, poses computational challenges mainly due to the lack of analytical solutions for most ODEs.

Bard (1974) suggested to minimize an objective function, a suitable measure of lack of fit, in which the solution of ODE is approximated by numerical integration. The minimization is carried out by a gradient-based method. But the solutions are often divergent, stay at a local minimizer and are sensitive to initial values (Cao et al. 2011).

Varah (1982) proposed an estimation method with the following two steps: in the first step, the regression function is expressed by cubic splines with fixed knots and estimated by least squares method using the data; in the second step, the parameters of the ODE are estimated by minimizing a distance measure between the ODE and the estimated regression function in the first step. Ramsay and Silverman (2005) introduced a two step iteration method where the first step of Varah is modified to a penalized least squares method in which a roughness penalty term is introduced to measure the difference between the ODE and the estimated mean function.

The parameter cascading method was proposed in Ramsay et al. (2007). In the parameter cascading approach, parameters are grouped into (1) regularization parameters, (2) parameters in ODE and (3) regression coefficients in the basis expansion of the regression function. Parameters in each of the three groups are estimated in sequence. First, the regression coefficients are estimated given the structural parameters and regularization parameters, then the structural parameters are estimated given the regularization parameters, and finally the regularization parameters are estimated based on minimizing penalized least squares.

Gelman et al. (1996) proposed a Bayesian computational method for the inference of pharmacokinetic models. Huang et al. (2006) suggested a hierarchical Bayesian procedure for the estimation of parameters in a longitudinal HIV dynamic system. As it turns out, Bayesian computational schemes for ODE models using Markov Chain Monte Carlo type procedures as in these two papers result in even bigger challenges. Each time the parameters are sampled from a candidate distribution, numerical integration of ODE needs to be invoked to evaluate the full likelihood. Campbell (2007) adopted the collocation method to obtain an approximation to the regression function expressed by a differential equation as in Ramsey et al. (2007). The collocation method was subsequently combined with parallel tempering (Geyer 1992) to overcome instability of the posterior surface. Incorporating tempering overcomes instabilities but slows down computational speed significantly. Recently, Gaussian processes (GP) have been used to avoid the heavy computation of the numerical integration. Dondelinger et al. (2013) introduced the adaptive gradient matching (AGM) approach which has a link to numerical integration but without the corresponding high computational cost. Wang and Barber (2014) introduced the Gaussian process-ODE (GP-ODE) approach which provides a generative model and simpler graph model than AGM approach. Actually, GP-ODE approach makes an approximation to allow a generative and proper graph model as Macdonald et al. (2015) pointed out.

In this paper, to speed up the Bayesian computations, we propose a Laplace approximated procedure (LAP) for posterior inference in differential equation models. The marginal posterior density of the ODE parameter is computed by the Laplace approximation (LA), in which the regression function is approximated by a one-step numerical solver of ordinary differential equations. We use the Euler and the fourth order Runge-Kutta procedures for illustrations. Finally, posterior inference is carried out by grid sampling or griddy Gibbs sampling from the marginal posterior of the ODE parameters depending on its dimension.

The proposed method has the following advantages. First, for an ODE model with the parameter dimension less than or equal to four, the posterior computations utilizes the Monte Carlo method (not the Markov Chain Monte Carlo method) based on independent sampling; thus, its posterior sampling is significantly faster than methods utilizing full Bayesian computations. Even for moderate parameter dimensions, the LAP runs and produces results within an acceptable computational time frame.

The second advantage is that the LAP produces more accurate parameter estimates compared to the other existing methods. In a simulation study, we compared the LAP with the parameter cascading method (Ramsay et al. 2007), the delayed rejection adaptive Metropolis algorithm (Haario et al. 2006), GP-ODE approach (Wang and Barber 2014) and AGM approach (Dondelinger et al. 2013). In the FitzHugh-Nagumo model where the regression function changes more rapidly, the LAP estimator has better performance than the delayed rejection adaptive Metropolis (DRAM), GP-ODE and AGM approach in the sense of the root mean squared error (rmse) and the log-likelihood at the parameter estimates. The performance of LAP is comparable to the parameter cascading (PC) method in the same sense. The latter criteria judges whether the chosen procedure achieves a parameter estimate that is close to the maximum likelihood by ascertaining the corresponding log-likelihood value.

Third, inference based on the LAP is numerically stable. Frequentist methods need to maximize the log-likelihood surface which has many ripples. So, depending on the starting points, optimization algorithms can be trapped in local maximums. However, in many examples, the ripples of the log-likelihood surface occur at the periphery of the parameter space and disappear from the likelihood surface when the sample size n becomes large.

The rest of the paper is organized as follows. In Sect. 2, we lay out inference framework of the differential equation models and the priors considered in this paper. The proposed posterior computations are described in Sect. 3. In Sect. 4, we prove that the approximated posterior converges to the true posterior under certain regularity conditions. In Sect. 5, using the simulated data sets from three models, we examine the quality of the LA based posterior. In the examples we considered, inference based on LAP generates stable and accurate approximations of the true posterior. We apply the LAP to a real data set, U.S. Census data in Sect. 6. Discussions are presented in Sect. 7 whereas details of computations and technical results are relegated to the Appendix.

2 Regression model defined by ODE

We consider the regression model

$$\begin{aligned} y(t)= & {} x(t) + \varepsilon (t), \end{aligned}$$

where y(t) is a p-dimensional vector of observation at time \(t \in [T_0,T_1], 0 \le T_0 < T_1 < \infty \) and \(\varepsilon (t)\) represents an error term assumed to arise from \(N_p(0, \sigma ^2 I_p)\) with \(\sigma ^2 > 0\) where \(N_p(\mu , \varSigma )\) denotes the p-dimensional normal distribution with mean \(\mu \) and covariance matrix \(\varSigma \). The regression function, x(t), of the regression model is defined as the solution of a differential equation

$$\begin{aligned} {\dot{x}}(t) = f(x, u, t ; \theta ), \,\, t \in [T_0, T_1], \end{aligned}$$
(1)

where f is a p-dimensional smooth function of x(t), known input function u(t), time t, and the unknown parameter \(\theta \in \varTheta \subseteq R^{q}\) with \(q \ge 1\); \({\dot{x}}(t)\) denotes the first derivative of x(t) with respect to time t. The function x is determined by the initial value of x, \(x(T_0)\), \(\theta \) and the function \(u(\cdot )\). The unknown parameter \(\theta \) needs to be estimated from observed data on y(t)s and u(t)s which are given at certain pre-specified time points.

We assume that observed data is collected at the time points \(T_0 \le t_1 < t_2 < \ldots < t_n \le T_1\). Letting \(y_i = y(t_i)\), \(x_i = x(t_i)\) and \(\varepsilon _i = \varepsilon (t_i)\), we have the following regression model

$$\begin{aligned} y_i = x_i + \varepsilon _i, \,\, i=1,2,\ldots , n, \end{aligned}$$
(2)

where \(\varepsilon _i\) are drawn independently from \(N_p(0, \sigma ^2 I_p)\).

The value of each \(x_i\), \(i=1,2,\ldots ,n\), is determined by the initial value \(x_1\), \(\theta \) and \(u(\cdot )\) based on the differential equation model (1). When we need to emphasize this dependence, we will denote \(x_i\) by \(x_i \equiv x_i(\theta , x_1, u)\) or \(x_i(\theta , x_1)\) if x is not dependent on u. For simplicity of exposition, the input function u(t) is not considered further in the rest of the paper, but analysis based on a known input function can be easily accommodated into our inference framework.

The differential equation (1) involves only the first order derivatives, but can be used to describe those with higher order derivatives. For example, consider a second order equation \({\ddot{x}}(t) = f({\dot{x}}, x , t ; \theta )\). By introducing \(z(t) = {\dot{x}}(t)\), the differential equation model can be expressed as

$$\begin{aligned} {\dot{X}}(t) \equiv \left( {\begin{array}{c}{\dot{x}}(t)\\ {\dot{z}}(t)\end{array}}\right) = \left( {\begin{array}{c}f(x,t; \theta )\\ f(z,x,t ; \theta )\end{array}}\right) \equiv F(X,t ; \theta ) \end{aligned}$$

where \(X(t) \equiv (x(t)^T, z(t)^T)^T\) is now a vector with an added component for the dynamics of z(t). Since any higher order differential equation models can be converted into a first order differential equation model based on adding extra dynamical systems and variables, without loss of generality, we consider only the first order differential equation models for developing our inference procedures in the remainder of this paper.

In the model (1) and (2), there are three unknowns, \(x_1\), \(\theta \) and \(\sigma ^2\), whose priors are denoted by \(\pi (x_1 \mid \sigma ^2)\), \(\pi (\theta )\) and \(\pi (\sigma ^2)\) (or \(\pi (\tau ^2)\) with \(\tau ^2 = 1/\sigma ^2\)), respectively. In the following, we will take the following specific priors for \(\tau ^2\) and \(x_1\):

$$\begin{aligned}&\tau ^2 \sim Gamma(a,b)\end{aligned}$$
(3)
$$\begin{aligned}&x_1 \mid \tau ^2 \sim N_p(\mu _{x_1}, c \tau ^{-2} I_p), \end{aligned}$$
(4)

where \( c > 0\) and Gamma(ab) is the gamma distribution with parameters \(a, b >0\) and mean a / b. The prior selection for \((\tau ^{2},x_1)\) is guided by conjugacy considerations which enable components of the posterior to be integrated in closed form. One may select other types of priors for \((\tau ^{2},x_1)\). However, for large sample sizes, like the ones considered in this paper, the impact of these priors will be minimal since most of the inference will be driven and guided by the likelihood component of the posterior.

3 Posterior computation

3.1 Posterior of \(\theta \), \(\tau ^2\) and \(x_1\)

The full joint posterior of \(\theta \), \(x_1\) and \(\tau ^2\) given the observations \(\mathbf{y}_n = (y_1, y_2, \ldots , y_n)^T\) has the expression \(\pi (\theta , \tau ^2, x_1 \mid \mathbf{y}_n)\)

$$\begin{aligned}&\propto p(\mathbf{y}_n \mid \theta , \tau ^2, x_1) \pi (x_1 \mid \tau ^2)\pi (\tau ^2) \pi (\theta )\\&\quad = \left[ \prod _{i=1}^n det(\tau ^{-2} 2\pi I_p)^{-1/2} e^{-\frac{\tau ^2}{2} \Vert y_i - x_i(\theta , x_1) \Vert ^2}\right] \\&\qquad \times \,det(2\pi c \tau ^{-2} I_p)^{-1/2} e^{-\frac{\tau ^2}{2c} \Vert x_1 - \mu _{x_1}\Vert ^2}\\&\qquad \times \,\frac{b^a}{\varGamma (a)} (\tau ^{2})^{a-1} e^{-b\tau ^2} \times \pi (\theta )\\&\propto (\tau ^2)^{\frac{1}{2} (n+1)p +a -1} \times e^{-\frac{\tau ^2}{2} \Big (ng_n(x_1,\theta )+ \frac{\Vert x_1 - \mu _{x_1}\Vert ^2}{c} + 2b\Big )} \pi (\theta ), \end{aligned}$$

where \(g_n(x_1) = g_n(x_1,\theta ) = \sum _{i=1}^n \Vert y_i - x_i(\theta ,x_1)\Vert ^2 /n\) and \(\Vert x\Vert \) denotes the Euclidean norm of the vector x, and \(\pi (\theta )\) is any prior on \(\theta \). The choice of \(\pi (\theta )\) can be arbitrary as it does not affect the inference on \(\theta \) for large sample sizes n as is well known.

In most cases, \(\theta \) and \(\tau ^2\) are the parameters of primary interest whereas \(x_1\) is the nuisance parameter. The details of obtaining the posterior distributions of \(\theta \) and \(\tau ^{2}\) are outlined as follows: In the first step, the posterior of \(\theta \) and \(\tau ^2\), \(\pi (\theta , \tau ^2 \mid \mathbf{y}_n)\), is obtained by marginalizing (i.e., integrating out) \(x_1\). In this marginalization step, two approximations are implemented: (i) a one-step numerical method for calculating each \(x_i\), \(i=1,2,\ldots ,n\), and (ii) the Laplace method for integrating out \(x_1\). In the next step, as a consequence of conjugacy, it can be shown that the posterior of \(\tau ^2\) given \(\theta \) and \(\mathbf{y}_n\) follows a gamma distribution, which has the advantage that it can be easily and directly sampled from. In the third step, after marginalizing \(\tau ^{2}\), \(\theta \) is sampled from its posterior distribution, \(\pi (\theta \mid \mathbf{y}_n)\), using either grid sampling or griddy Gibbs sampling depending on its dimension q. By eliminating \(x_1\) and \(\tau ^{2}\) from the full posterior in the first two stages above, we reduce the dimension of the posterior from \(p+q+1\) to q, making it easier for thorough exploration of its surface using grid based sampling as in the third stage.

3.2 Marginalization of \(x_1\): joint posterior of \(\theta \) and \(\tau ^{2}\)

In the marginalization of \(x_1\), we use two approximations. In the first approximation, \(x_i(\theta , x_1)\) is successively approximated by a numerical procedure:

$$\begin{aligned} x_i \approx x_{i-1} + (t_i - t_{i-1})\phi (x_{i-1}, t_{i-1} ; \theta ),~~ i=2,\ldots , n, \end{aligned}$$

where different forms of \(\phi \) represent different numerical solvers of differential equation. For example, the Euler method is represented by

$$\begin{aligned} \phi (x_{i-1}, t_{i-1} ; \theta ) = f(x_{i-1}, t_{i-1} ; \theta ) ; \end{aligned}$$

while the 4th order Runge-Kutta is represented by

$$\begin{aligned} \phi (x_{i-1}, t_{i-1} ; \theta )= & {} {1 \over 6}(k_{i-1,1} + 2k_{i-1,2} + 2k_{i-1,3}\nonumber \\&+ \, k_{i-1,4}), \end{aligned}$$
(5)

where

$$\begin{aligned} k_{i-1,1}= & {} f(x_{i-1}, t_{i-1} ; \theta ),\\ k_{i-1,2}= & {} f\left( x_{i-1} +{1 \over 2}k_{i-1,1} , t_{i-1} +{1\over 2}(t_i - t_{i-1}) ; \theta \right) ,\\ k_{i-1,3}= & {} f\left( x_{i-1} +{1\over 2}k_{i-1,2},t_{i-1} +{1\over 2}(t_i - t_{i-1}) ; \theta \right) ,\\ k_{i-1,4}= & {} f(x_{i-1} + k_{i-1,3} , t_{i} ; \theta ). \end{aligned}$$

Let \(h = \max _{2 \le i \le n} (t_i - t_{i-1})\) and \(x^h\) be the approximation of x. The global error of the numerical method is defined by

$$\begin{aligned} \sup _{t\in [T_0,T_1]} \Vert x(t) - x^h(t)\Vert . \end{aligned}$$

If the global error is \(O(h^K)\) for some integer K, we call K to be the order of the numerical method. Under some smoothness conditions, the order of the 4th order Runge-Kutta numerical procedure (given in (5)) is \(K=4\) (Mathews and Fink 2004; Süli 2014).

In the second approximation, we integrate out \(x_1\) based on its prior \(\pi (\,x_1\,|\,\tau ^{2})\) defined in (4) and full likelihood using Laplace approximation for the corresponding integral. Using results from Tierney and Kadane (1986) and Azevedo-Filho and Shachter (1994), the marginal likelihood of \(\theta \) and \(\tau ^2\) can be approximated by

$$\begin{aligned} L(\theta , \tau ^2)= & {} \int \pi (x_1 \mid \tau ^2) L(\theta , \tau ^2, x_1) dx_1\\&\propto \int (\tau ^2)^{(n+1)p/2} e^{- \frac{\tau ^2}{2} \Big (\,n g_n(x_1) + \frac{\Vert x_1 - \mu _{x_1} \Vert ^2}{c}\,\Big )} dx_1\\&\propto (\tau ^2)^{(n+1)p/2}\, e^{-\frac{\tau ^2}{2} \,u(\theta )}\,\, det\Big (\,n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c} I_p \Big )^{-1/2}\\&\times \,(\tau ^2)^{-p/2}\Big (1+O(n^{-3/2})\Big )\\= & {} (\tau ^2)^{np/2}\, e^{-\frac{\tau ^2}{2} \,u(\theta ) -\frac{1}{2}\, v(\theta )} \Big (1+O(n^{-3/2})\Big ), \end{aligned}$$

where

$$\begin{aligned}&{\ddot{g}}_n(x_1) = \frac{\partial ^2\,g_n(x_1,\theta )}{\partial x_1^2},\,\,{\hat{x}}_1 \equiv {\hat{x}}_1(\theta )\,\text{ is } \text{ given } \text{ by }\\&{\hat{x}}_1(\theta ) = \underset{x_1}{{{\mathrm{argmin}}}} \Big (ng_n(x_1, \theta ) + \frac{\Vert x_1 - \mu _{x_1} \Vert ^2}{c} \Big ),\\&u(\theta ) = n g_n({\hat{x}}_1) + \frac{\Vert {\hat{x}}_1 - \mu _{x_1}\Vert ^2}{c}, \text{ and }\\&v(\theta ) = \log \, det\Big (n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c}I_p\Big ). \end{aligned}$$

It follows from the last expression for \(L(\theta ,\tau ^{2})\) that the approximate posterior of \(\theta \) and \(\tau ^2\) given \(\mathbf{y}_n\), based on independent priors \(\pi (\theta )\) and Gamma(ab) on \(\theta \) and \(\tau ^{2}\), respectively, is given by

$$\begin{aligned} \pi (\theta , \tau ^2 \mid \mathbf{y}_n)\propto & {} \pi (\theta ) \times (\tau ^2)^{\frac{np}{2} +a - 1} e^{-\tau ^2 \Big (\frac{1}{2}\, u(\theta )+b\Big )}\nonumber \\&\times \,\, det\Big (\, n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c}I_p\,\Big )^{-1/2}. \end{aligned}$$
(6)

Details for the computation of \({\ddot{g}}_n(x_1)\) is given in the Appendix. We used the gradient descent and Newton-Raphson procedures for obtaining the maximizer \({\hat{x}}_1(\theta )\) in our examples.

3.3 Marginalization of \(\tau ^{2}\): posterior of \(\theta \)

We note from equation (6) that the posterior of \(\tau ^2\) given \(\theta \) and \(\mathbf{y}_{n}\) is proportional to

$$\begin{aligned} \pi (\tau ^2 \mid \theta ,\mathbf{y}_n) \propto (\tau ^2)^{\frac{np}{2} + a - 1} e^{-\tau ^2 \Big (\frac{1}{2}\, u(\theta )+b\,\Big )}; \end{aligned}$$

thus, the conditional posterior distribution of \(\tau ^{2}\) given \(\theta \) and \(\mathbf{y}_{n}\) is given by

$$\begin{aligned} \tau ^2 \mid \theta ,\mathbf{y}_n \sim Gamma(a^*, b^*), \end{aligned}$$

where \(a^* = np/2 +a\) and \(b^* = u(\theta )/2 +b\). Now, by integrating out \(\tau ^2\) from the product of \(L(\theta , \tau ^2)\) and the prior of \(\tau ^2\), we get the marginal likelihood of \(\theta \) given by

$$\begin{aligned} L(\theta )\propto & {} \int (\tau ^2)^{\frac{np}{2} +a - 1} \, e^{-\tau ^2 \Big (\frac{1}{2} u(\theta )+b\Big )} d\tau ^2\\&\times \,\, det\,\Big (n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c}I_p\,\Big )^{-1/2}\\= & {} \frac{\varGamma (\frac{np}{2} +a)}{\Big (\frac{1}{2} u(\theta )+b\Big )^{\frac{np}{2} +a}} \,det\,\Big (n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c}I_p\,\Big )^{-1/2}. \end{aligned}$$

Consequently, the posterior of \(\theta \) is

$$\begin{aligned} \pi (\theta \mid \mathbf{y}_n) ~\propto ~ \frac{\pi (\theta )}{\Big (\frac{1}{2}\, u(\theta )+b\Big )^{\frac{np}{2} + a} \,det\,\Big (n {\ddot{g}}_n({\hat{x}}_1) + \frac{2}{c}I_p\Big )^{1/2}}.\nonumber \\ \end{aligned}$$
(7)

To numerically approximate \(\pi (\theta \mid \mathbf{y}_n)\), we propose grid sampling or griddy Gibbs sampling depending on the dimension of \(\theta \). When the dimension q of \(\theta \) is not large (say, \(q \le 4\)), the grid sampling is conceptually simple and numerically fast. When q is relatively large, we recommend the griddy Gibbs sampling.

3.4 Posterior sampling of \(\theta \)

When \(q \le 4\), we recommend the grid sampling to sample \(\theta \) from the marginal posterior \(\pi (\theta \mid \mathbf{y}_n)\) of \(\theta \) in (7). Let \({\mathscr {G}}_{\varTheta } \subset \varTheta \) be a grid set that covers \(\varTheta \) and let \(\pi ^d(\theta \mid \mathbf{y}_n)\) be the discrete distribution with support \({\mathscr {G}}_{\varTheta }\) whose value at \(\theta \in {\mathscr {G}}_{\varTheta }\) is proportional to \(\pi (\theta \mid \mathbf{y}_n)\). We will sample \(\theta \) from \(\pi ^d(\theta \mid \mathbf{y}_n)\).

In practice, the choice of the grid matrix \({\mathscr {G}}_{\varTheta }\) can be a nontrivial task (Joshi and Wilson 2011). To choose a grid set, we adopt the reparametrization technique used by Rue et al. (2009). Let \(\theta ^{0}\) be the initial guess for the center of the grid set, and let \({\hat{\varSigma }} = H^{-1}\) where H is the negative Hessian matrix of \(\pi (\theta | \mathbf{y}_n)\) at \(\theta ^{0}\). If H is not a positive definite matrix, we replace the negative eigenvalues of H with the minimum positive eigenvalue of it. We express \(\theta \) with a standardized variable z by

$$\begin{aligned} \theta (z) = \theta ^0 + U D^{1/2} z \end{aligned}$$

where \({\hat{\varSigma }}\) is diagonalized with \({\hat{\varSigma }} = UD U^T\), \(U = (u_1,\ldots , u_q)\) and \(D = diag(\lambda _j)\). \(\lambda _j\) is the eigenvalue of \({\hat{\varSigma }}\), and \(u_j\) is the corresponding eigenvector, \(j=1,2, \ldots , q\). The grid points are selected for the parametrization of z. We recommend the two step approach in choosing the range of the grid points. In the first step, the grid points for the ith coordinate \(z_i\) is chosen by dividing \([-4, 4]\) into \(2M_1\) equal length intervals resulting \(2M_1 + 1\) points. Note \([-4, 4]\) comes from the rough normal approximation. For each \((2M_1+1)^q\) grid points, we evaluate \(\pi (\theta (z_i) | \mathbf{y}_n)\), \(i=1,2, \ldots , (2M_1+1)^q\). With these values, we determine the range \([A_i, B_i]\) of each coordinate \(z_i\), \(i=1,2, \ldots , q\). \(A_i\) and \(B_i\) are defined by the minimum and maximum of \(z_i\) with \(\pi (\theta (z_1, \ldots , z_i, \ldots , z_q) | \mathbf{y}_n)\) \(> \eta \) where \(\eta \) is a small number close to 0. In our examples, we used \(\eta = 10^{-5}\). If the interval \([-4, 4]\) is not big enough to contain the mass of the posterior and \(A_i\) and \(B_i\) can not be selected, we perform the first step one more time with larger interval than \([-4, 4]\). The larger interval can be obtained by approximating the marginal posterior with normal density with larger standard deviation.

After \([A_i, B_i]\) are chosen, we move to the second step and determine the grid points for accurate computation. The purpose of the first step is to determine the grid set, and \(M_1\) is chosen as a small positive integer such that \((2M_1 + 1)^q\) is not overwhelmingly large computationally. In our examples, we used \(M_1=5\).

In the second step, \([A_i, B_i]\) is divided into \(2M_2\) intervals of equal length. The discrete approximation of the posterior is constructed by evaluating the posterior at \((2M_2 +1)^q\) grid points. Grid sampling is done first by sampling \(\theta ^{(i)}\) from the discrete approximation and the conditionally on \(\theta ^{(i)}\), \(\tau ^2\) is sampled from \(Gamma(np/2 +a, u(\theta ^{(i)})/2 + b)\). In our examples, we used \(M_2 = 15\) or 25. Note that the samples from this algorithm are independent samples. When q is not very large, the algorithm is very fast.

We summarize the algorithm below.

  1. 1.

    Step 1 Reparameterization step. Compute the initial guesses of the center \(\theta ^0\) and of the posterior covariance \({\hat{\varSigma }}\). Reparametrize \(\theta \) using the standardized variable z by

    $$\begin{aligned} \theta (z) = \theta ^0 + U D^{1/2} z \end{aligned}$$

    where \({\hat{\varSigma }} = U D U^T\).

  2. 2.

    Step 2 Finding ranges of \(z_i\). For each \(z_i\), divide the interval \([-4, 4]\) into \(2M_1\) intervals of equal length. Let

    $$\begin{aligned} A_i= & {} \min \{ z_i : \pi (z_i \mid \mathbf{y}_n) \ge \eta \}\\ B_i= & {} \max \{ z_i : \pi (z_i \mid \mathbf{y}_n) \ge \eta \}. \end{aligned}$$
  3. 3.

    Step 3 Grid sampling. Divide the intervals \([A_i, B_i]\) into \(2M_2\) intervals of equal length and construct grid points.

    1. 1.

      For each \(\theta \in {\mathscr {G}}_{\varTheta }\), calculate \(\pi (\theta |\mathbf{y}_n)\) using (7) and construct \(\pi ^d(\theta \mid \mathbf{y}_n)\).

    2. 2.

      Sample \(\theta ^{(1)}, \theta ^{(2)}, \ldots , \theta ^{(N)} \mathop {\sim }\limits ^{iid} \pi ^d(\theta \mid \mathbf{y}_n)\).

    3. 3.

      For each \(i=1,2,\ldots , N\), sample \({\tau ^2}^{(i)} \sim Gamma(np/2 +a, u(\theta ^{(i)})/2 + b)\).

When q is large (\(q \ge 5)\), the construction of the discrete approximation \(\pi ^d\) by evaluating the posterior at all the grid points can be computationally prohibitive. In this case, we recommend to replace the grid sampling by the griddy Gibbs sampling in the above algorithm. In the griddy Gibbs sampling, the coordinates of z is sampled from the conditional posterior and it does not require the evaluation of the posterior at all grid points.

To improve accuracy of the numerical solution of differential equation, we divided the interval \([t_{i-1}, t_{i}]\) to m intervals and added intermittent time points in computing x. If the differential equation is smooth enough, \(m = 4\) and 1 usually suffice for Euler and 4th order Runge–Kutta method not to add error rate to that of the Laplace approximation, respectively. See Theorem 2. But in practice sometimes larger values of m are required. We apply larger values in turn, and if the change in the mean of the posterior is less than \(0.1\%\), we stopped. In our examples, we used the sequence of m as \(1, 2, 4, 8, 14, 20, 30,\ldots \).

4 Convergence of the approximated posterior

4.1 Convergence of the approximated posterior as m increases

In this section, we show that the posterior with Laplace approximation and numerical method, \(\pi ^{LP}_m\), converges pointwise to the true posterior with an relative error of \(O(n^{-3/2})\) as \(m \rightarrow \infty \), under some regular conditions.

For convenience, let \(\pi _m \equiv \pi ^{LP}_m\). We assume \(h \equiv t_{i+1} - t_i\) for all \(i=2,3,\ldots , n\) and each \([t_{i-1}, t_i]\) is divided into m segments; thus, the length of one segment is h / m. Let \(x^m\) be the approximation of x by numerical method with m segment and \(x^m(t_1) = x(t_1)\) for all m.

The theorem requires the following assumptions.

  1. A1.

    \(\{ x(t) : t \in [T_0, T_1] \}\) is a compact subset of \({\mathbb {R}}^p\);

  2. A2.

    \(\{ y(t) : t \in [T_0,T_1] \}\) is a bounded subset of \({\mathbb {R}}^p\);

  3. A3.

    the Kth order derivative of \(f(x,t ;\theta )\) with respect to t exists and is continuous in x and t, where K is the order of the numerical method \(\phi \); and

  4. A4.

    the function \(ng_n(x_1) + \Vert x_1 - \mu _{x_1}\Vert ^2/c\) has the unique minimum \({\hat{x}}_1\).

Theorem 1

Suppose that \(f(x,t ; \theta )\) is Lipschitz continuous in x, and A1 – A4 hold. Then, for sufficiently large n,

$$\begin{aligned} \lim _{m\rightarrow \infty } \pi _m(\theta ,\sigma ^2 \mid \mathbf{y}_n) = \pi (\theta ,\sigma ^2 \mid \mathbf{y}_n) \times (1+ O(n^{-3/2})), \end{aligned}$$

for all \(\theta \) and \(\sigma ^2\).

The proof of theorem is given in Appendix.

4.2 Suitable rate of step size with respect to sample size

In this section, we analyze the relation between the step size h / m and the approximation error rate of the posterior, which is motivated by Xue et al. (2010). We assume that the number of the observation goes to infinity and \(h/m=O(n^{-\alpha })\). The large sample size and small step size give accurate inference, but they may cause heavy computation. We are interested in a reasonable choice of the step size h / m when the sample size n is growing. Here, reasonable choice means that it does not raise the relative error rate \(O(n^{-3/2})\) caused by the Laplace approximation.

Let K be the order of the numerical method \(\phi \). If we divide intervals \([t_{i-1},t_i]\) into m segments,

\(\max _{1 \le i \le n} \Vert x_i - x^m_i \Vert = O((h/m)^K) = O(n^{-K\alpha })\).

Theorem 2

Suppose that \(f(x,t; \theta )\) is Lipschitz continuous in x, and \(A1-A3\) hold. Let K be the order of the numerical method \(\phi \) and \(h/m = O(n^{-\alpha })\). If \(\alpha \ge 5/(2K)\), then, for sufficiently large n,

$$\begin{aligned} \pi _m(\theta ,\tau ^2 \mid \mathbf{y}_n) = \pi (\theta ,\tau ^2\mid \mathbf{y}_n) \times (1+ O(n^{-3/2})), \end{aligned}$$

for all \(\theta \) and \(\tau ^2\).

Theorem 2 says that if we set \(h/m = O(n^{-5/(2K)})\), the numerical approximation does not raise the order of the relative error caused by the Laplace approximation. Moreover, even if we take \(h/m \ll n^{-5/(2K)}\), it does not reduce the error rate \(O(n^{-3/2})\) and only raise the computational cost.

5 Simulated data examples

In this section, we test our LAP inference with data sets simulated from three ODE models. The data are generated with predetermined parameter value \(\theta \), the initial value \(x_1\) and error variance \(\sigma ^2\).

For the examples in 5.1, we use both the Euler and the 4th order Runge–Kutta method to approximate ODE solutions. For the examples in 5.2 and 5.3, we use the 4th order Runge–Kutta method to approximate ODE solutions. The LAP inference can be extended to other numerical methods by changing the function \(\phi (x, t ; \theta )\).

5.1 Newton’s law of cooling

5.1.1 Model description and data generation

English physicist Isaac Newton believed that temperature change of an object is proportional to the temperature difference between the object and its surroundings. This intuition is captured by Newton’s law of cooling, which is an ODE given by

$$\begin{aligned} {\dot{x}}(t)= & {} \theta _1(x(t) - \theta _2), \end{aligned}$$
(8)

where x(t) is the temperature of the object in Celcius at time t, \(\theta _1\) is a negative proportionality constant and \(\theta _2\) is the temperature of the environment. See Incropera (2006) for the details. The solution of the ODE (8) is known and is

$$\begin{aligned} x(t) = \theta _2 - (\theta _2 - x_1)e^{\theta _1 t} \end{aligned}$$

where \(x_1 \equiv x(0)\). Since the analytic form of the solution is known, it is not necessary to resort to the proposed approximate posterior computation method to fit the ODE model with (8). We have chosen this example as a testbed for the proposed method. We compare the true posterior without approximation with the approximate posterior obtained by the proposed method.

The model parameters were fixed at \(x_1 = 20, \theta = (-0.5, 80)^T\) and \(\sigma ^2 = 25\), and \(y(t_i)\) were generated at \(t_i = h (i-1)\) for \(i=1,2,\ldots , n\). We generated 4 data sets with sample sizes \(n=20, 50, 100, 150\), which have step sizes \(h=0.75, 0.3, 0.15, 0.1\), respectively. The effect of sample size on the approximation is investigated below. The data set with sample size \(n=20\) and the true mean function is given in Figure 1.

Fig. 1
figure 1

The solid lineis the true temperature as a function of time from the Newton’s law of cooling model with \(x_1=20\), \(\theta =(-0.5, 80)^T\) and \(n=20\). The scatter plot of the generated data of temperatures and times is also drawn

The priors were set by

$$\begin{aligned}&x_1 \mid \tau ^2 \sim N(\mu _{x_1} = y_1, 100/\tau ^2)\nonumber \\&\tau ^2 \sim Gamma(a, b)\nonumber \\&\theta =(\theta _1, \theta _2) \sim Uniform(-200, 0)\nonumber \\&\quad \times Uniform(-200, 500). \end{aligned}$$
(9)

where \(a=0.1, b=0.01\) and \(y_1=15.515\).

The true posterior of \(\theta \) and \(\tau ^2\) can be obtained as follows:

$$\begin{aligned}&\tau ^2 \mid \theta ,\mathbf{y}_n \sim Gamma\Big (\frac{np}{2}+a, \frac{1}{2}{\tilde{u}}(\theta ) + b\Big )\\&\theta \mid \mathbf{y}_n \sim \frac{1}{\Big (\frac{1}{2}{\tilde{u}}(\theta )+b\Big )^{\frac{np}{2}+a}} I(-200 < \theta _1 < 0)\\&\quad \times I(-200<\theta _2 < 500), \end{aligned}$$

where

$$\begin{aligned}&{\tilde{u}}(\theta ) = \mu _{x_1}^2/100 + \sum _{i=1}^n z_i^2 - \Big (1/100 + \sum _{i=1}^n e^{2\theta _1(i-1)h}\Big )^{-1}\\&\quad \times \Big (\mu _{x_1}/100 + \sum _{i=1}^n z_ie^{\theta _1(i-1)h}\Big )^2,\\&z_i = z_i(\theta ) = y_i - \theta _2 + \theta _2 e^{\theta _1(i-1)h}. \end{aligned}$$
Fig. 2
figure 2

The true posterior densities and the approximate posterior densities with Laplace approximation are shown. The data are generated from Newton’s law cooling model with sample size \(n = 20\). The red lines represent true values of the parameters, \(\theta = (-0.5, 80)^T\) and \(\sigma ^2 = 25\)

Since the dimension of \(\theta \) is only 2, the grid sampling is deemed to be adequate for sampling \(\theta \). For this example, we ended up setting \(M = 25\) and \(h_0 = (1, 1)^T\) where \(h_0\) is the vector of step sizes for grid matrix. The center of the grid matrix was chosen as \(\theta ^0 = (-0.547, 80.933)^T\) by parameter cascading method. In total, we have 2,601 grid points. In the rest of the paper, we got 10,000 posterior sample from each example.

5.1.2 Assessment of the performance of the approximate posteriors

The LAP inference has two approximations: Laplace approximation for the marginal posterior of \(\theta \) and \(\tau ^2\) and numerical approximation method for the regression function x. With this example, we investigate the quality of these two approximations. In particular, we examine (1) the effect of sample size on the Laplace approximation and (2) that of the numerical approximation. For the numerical approximation part, we compare the performance of the Euler method and the 4th order Runge-Kutta method.

To see the effect of sample size on the Laplace approximation, the true posterior \(\pi (\theta , \tau ^2 \mid \mathbf{y}_n)\) was compared with the posterior with only Laplace approximation \(\pi ^{LP}(\theta , \tau ^2 \mid \mathbf{y}_n)\). Figure 2 shows the true posterior densities and Laplace approximated posterior densities of each parameter when the sample size \(n= 20\). Even when the sample size is as small as \(n=20\), the Laplace approximated posterior densities are almost indistinguishable from the true posterior. Although we have not shown here, we tried the same comparison plots for the samples with sample sizes as small as 5 and 10 and concluded that the approximation is still good. Table 1 shows the similar story; that is, the summary statistics of the Laplace approximated posterior are quite close to those of the true posterior. Table 1 shows only the summary statistics of \(\theta _1\), but the same conclusion has been reached for \(\theta _2\) and \(\tau ^2\).

Table 1 Posterior summary statistics of \(\theta _1\) from the true posterior, Laplace approximated posterior, posterior with Laplace approximation and numerical approximation method with varying values of the number of steps m and sample sizes n in Newton’s law of cooling model
Fig. 3
figure 3

The true posterior density and the approximate posteriors with Laplace approximation and numerical methods for the Newton’s law of cooling model are drawn when \(n = 20\). The true parameter values are \(\theta = (-0.5, 80)^T\) and \(\sigma ^2 = 25\). As m grows, the approximate posterior is getting closer to the true posterior

To see the effect of the approximation due to the numerical methods, the true posterior \(\pi (\theta , \tau ^2 \mid \mathbf{y}_n)\) was compared with the approximate posteriors obtained by applying the Laplace and the numerical methods. In this example, we used the Euler method and 4th order Runge-Kutta method for the numerical method. The intervals between observations \([t_{i-1}, t_i]\) were divided into m segments with \(m=1,2, 4, 8, 14, 20, 30 , \ldots \). The approximate posteriors are denoted by \(\pi ^{LP,E}_m (\theta , \tau ^2 \mid \mathbf{y}_n)\) and \(\pi ^{LP,RK}_m (\theta , \tau ^2 \mid \mathbf{y}_n)\) where E and RK stand for the Euler and Runge–Kutta, and m is the number of segments. Figure 3 shows the posterior densities with different m and the true posterior density when the sample size \(n = 20\). The approximate posteriors \(\pi ^{LP,E}_m\) and \(\pi ^{LP,RK}_m\) are shown in the first row and the second row, respectively. The approximate posteriors \(\pi ^{LP,RK}_m\) are generally close to the true posterior even for \(m=1\), but \(\pi ^{LP,E}_m\) show different behavior. For \(\theta _2\) and \(\tau ^2\), \(\pi ^{LP,E}_m\) are close to the true posterior even for \(m=1\), but the marginal posterior of \(\theta _1\) of \(\pi ^{LP,E}_m\) deviates from the true posterior. The deviation disappears as m gets larger. These results can be also confirmed in Table 1 which includes the posterior summary statistics of \(\theta _1\) with different values of m and n. We represent the results for the Euler with \(m=1,20,50,60\) and the Runge-Kutta with \(m=1,2\). Based on these observations, we recommend the Euler with \(m=50\) and the Runge-Kutta with \(m=1\). In Sect. 4, we present a theorem, a theoretical basis for this observation. The Runge-Kutta method with \(m=1\) does not reduce the error rate obtained by Laplace method, while Euler with \(m=1\) does reduce the error rate and the larger value of m is needed for the Euler method. The computation times for the numerical methods with various values of m and n in this example are shown in Table 2.

Table 2 The computation times (s) for numerical methods with varying values of step size m and sample sizes in numerical approximation method for Newton’s law of cooling model
Fig. 4
figure 4

Approximate marginal posterior densities for each parameter with varying values of m for FitzHugh-Nagumo model. The red lines represent true values of the parameters, \(\theta = (0.2,0.2,3)^T\) and \(\sigma ^2 = 0.25\). As m grows, the approximate posterior seems to be stabilized

5.2 FitzHugh–Nagumo model

5.2.1 Model description and data generation

The action of spike potential in the giant axon of squid neurons is modeled by Hodgkin and Huxley (1952). FitzHugh (1961) and Nagumo et al. (1962) simplified this model with two variables. The reduced model with no external stimulus is given below:

$$\begin{aligned} {\dot{x_1}}(t)= & {} \theta _3 \Big (x_1(t) - \frac{1}{3}x_1^3(t) + x_2(t)\Big ),\nonumber \\ {\dot{x_2}}(t)= & {} -\frac{1}{\theta _3}(x_1(t) - \theta _1 + \theta _2 x_2(t)), \end{aligned}$$

where \(-0.8 < \theta _1, \theta _2 < 0.8, 0 < \theta _3 < 8\), and \(x_1(t)\) and \(x_2(t)\) are the voltage across an membrane and outward currents at time t and called the voltage and recovery variables, respectively. We use this parameter space for stable cyclical behavior of the system (Campbell 2007). With this example, we show that the Laplace approximated posterior inference works well with appropriate choice of m.

Table 3 Posterior summary statistics with varying values of step size m in 4th order Runge–Kutta method for FitzHugh–Nagumo model

We generated a simulated data set with model parameters \(\theta = (0.2, 0.2, 3)^T, x_1 = x(t_1) = (-1, 1)^T\) and \(\sigma ^2 = 0.25\). The time interval was fixed at \(t_i - t_{i-1} = 0.2\) for \(i=2, 3, \ldots , n\) with \(n=100\). We divided \([t_{i-1}, t_i]\) into 100 segments and applied the 4th order Runge-Kutta method to got the true mean function for simulated data.

For the prior, we had \(x_1 \mid \tau ^2 \sim N_2(\mu _{x_1} = y_1, 100/\tau ^2 I_2)\), \(\tau ^2 \sim Gamma(a, b)\) and \(\theta \sim Unif(A)\) where \(a=0.1, b=0.01\), \(y_1 = (-1.449, 1.092)^T\) and \(A = \{(\theta _1, \theta _2, \theta _3) : -0.8 < \theta _1,\theta _2 < 0.8, 0 < \theta _3 < 8 \}\).

5.2.2 Assessment of the performance of the approximate posteriors

We applied the procedure in Sect. 3, to choose the range and the center of the grid matrix. For the final analysis, we set \(M = 15\) and \(h_0 = (7, 7, 4)^T\), so we have 29,791 grid points. The center of the grid matrix \(\theta ^0\) was \((0.199, 0.131, 3.056)^T\). Figure 4 shows the posterior densities with different m. In this example, different values of m shows slight changes in the posterior approximations. Table 3 shows the posterior summary statistics with varying values of step size. We applied the procedure to choose m described in Sect. 3 and in the final analysis \(m=2\) was used, and Table 4 contains the computation times for \(m=1,2\) and \(n=100, 200\).

Figure 5 contains the scatter plots of the observations, the true mean functions, \(90\,\%\) credible lines for the mean functions, and the posterior mean functions as well as prediction values at 10 future time points when \(m =2\).

Table 4 The computation times (s) with varying values of step size m and sample size in numerical approximation method for the FitzHugh–Nagumo model
Fig. 5
figure 5

Scatter plotof the observations generated from the FitzHugh–Nagumo model, and plots of \(90\,\%\) credible set lines and true states are drawn when \(m=2\). Predictions of 10 time points ahead are also drawn. The upper, lower and middle dotted lines are the 95 and 5 % quantiles and mean of the posterior, respectively. The solid line in the middle is the true value of the state x(t), and the star-shaped points are the observations

5.2.3 Comparison with existing methods

We compare the performance of the LAP inference and the other existing methods: the parameter cascading method (Ramsay et al. 2007), the delayed rejection adaptive Metropolis (DRAM) algorithm (Haario et al. 2006) with numerical integration, the Gaussian Process-ODE (GP-ODE) approach (Wang and Barber 2014) and the adaptive gradient matching (AGM) approach (Dondelinger et al. 2013). We generate 100 simulated data set as above and compute the absolute bias, the standard deviation, the root mean squared error (rmse) and the log-likelihood to use as the measure of performance. However, for the GP-ODE approach (Wang and Barber 2014) and AGM approach (Dondelinger et al. 2013), only 20 data set were used because of their long computation times. This long computation times is mainly due to the fact that the implementations of the two approaches were based on the pure MATLAB codes. For the data generation, \(\theta =(0.2, 0.2, 3)^T, x_1=(-1,1)^T,\) \(\sigma ^2=0.25, h=0.2\) and \(n=30\) were used.

We are using the method of Ramsay et al. (2007) based on parameter cascading (PC). PC is also called generalized profiling. We give the details below.

To represent the state of the ODE, x(t), the PC methodology uses the collocation method: The collocation method uses a series of basis expansion to represent the p dimensional vector x(t), that is,

$$\begin{aligned} x(t) =(x_1(t),x_2(t),\ldots ,x_{p}(t)) = \varPhi (t){\mathbf {C}} \end{aligned}$$
(10)

where \(\varPhi (t) = (\varPhi _1(t), \ldots , \varPhi _K(t))\) is a set of K bases evaluated at time t, and the \(K \times p\) matrix \({\mathbf {C}}\) contains the coefficients of the basis functions of each variable in its columns. In other words, expanding (10), the i-th component of x(t) at time t has the basis function expansion

$$\begin{aligned} x_i(t) = \varPhi (t)c_i = \sum _{k=1}^{K}\,c_{ik}\varPhi _{k}(t), \end{aligned}$$
(11)

where \(c_i\) is a column vector of coefficients \(c_{ik}\) of length K, for \(i=1,2,\ldots ,p\). The ODE model whose parameters need to be estimated is given by

$$\begin{aligned} {\dot{x}}_{i}(t)=f_{i}(x(t),{\theta }) \end{aligned}$$
(12)

for \(i=1,2,\ldots ,p\).

PC involves a penalized likelihood criteria \(J\equiv J({\mathbf {C}},\theta ,{\varvec{\lambda }})\) which is based on the coefficients of basis expansions \({\mathbf {C}}\), the unknown parameter \(\theta \) to be estimated and \({\varvec{\lambda }}\equiv (\lambda _1, \ldots , \lambda _p)\), the penalty (or smoothing) parameters. The criteria J is reflects two competing goals based on two competing terms. The first term in \(J({\mathbf {C}},\theta ,{\varvec{\lambda }})\) measures how well the state function values fit the data whereas the second term measures how closely each of the state functions satisfy the corresponding differential equation (12). The smoothing parameters measures the weight of each competing term; when \(\lambda _i\) s, \(i=1,2,\ldots ,p\), are large, more and more emphasis is put on having \(x_i(t)\)s in (11) satisfy the differential equation in (12), as opposed to fitting the data and vice versa when \(\lambda _i\)s tend to zero.

PC optimization is based on two levels: An inner optimization step nested within an outer optimization. In the inner optimization, \({\theta }\) and \({\varvec{\lambda }}\) components are held fixed, and an inner optimization criterion is optimized with respect to the coefficients in matrix \({\mathbf {C}}\) only. In effect, this makes \(J = J({\mathbf {C}}(\theta , \lambda ),\theta ,{\varvec{\lambda }})\) a function of \({\theta }\) and \({\varvec{\lambda }}\) only. In the outer optimization step, J is optimized with respect to \(\theta \) keeping \({\varvec{\lambda }}\) fixed. This essentially makes \(J \equiv J({\mathbf {C}}(\theta ({\varvec{\lambda }}),\lambda ),\theta ({\varvec{\lambda }}),{\varvec{\lambda }})\) now a function of \({\varvec{\lambda }}\) only. The smoothing parameters \({\varvec{\lambda }}\) and number of basis functions K are finally chosen based on numerical stability of the parameter estimates. This is the key idea underlying the generalized profiling or parameter cascade algorithms in Ramsay (2007) and Cao and Ramsay (2009).

The implementation of the PC method was carried out using the CollocInfer package (Hooker et al. 2014) in R. This package uses B-spline basis functions for \(\varPhi _{k}(t), k = 1, \ldots , K\). B-spline basis functions are constructed by joining polynomial segments end-to-end at junctions specified by knots. Since our method used \(m=2\), we set \(2n-1\) equally spaced knots on \([t_1, t_n]\) to get a twice number of knots than the data points. The finer knots gave negligible improvement in parameter estimate while slowing down the computational speed. We chose the three-order of B-spline basis which was used in Ramsay et al. (2007) for the same model. For the choice of the tuning parameter \(\lambda \), we used both the manual procedure and the automatic procedure. We adopted the procedure of Ramsay et al. (2007) which tries larger values of \(\lambda \) and chooses \(\lambda \) manually which gives a stable result. The quartiles of the parameter estimates for 100 simulation data sets were obtained as \(\lambda \) is varied from \(10^{-2}\) to \(10^6\). After that, this \(\lambda \) set at \(10^5\). For the automatic procedure, the forward prediction error (FPE) in Hooker et al. (2010) was used. We divided the each data set into ten part, from \(t_1\) to \(t_{10}\), and \(t_{11}\) to \(t_{20}\), and so on. For one data set, the averaged FPE was obtained as \(\lambda \) is varied from \(10^{-2}\) to \(10^6\), and the optimal \(\lambda \) which minimizes FPE was chosen.

The DRAM algorithm (Haario et al. 2006) is a variant of the standard Metoropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970). We chose the DRAM algorithm with numerical integration to compare the computation time with our LAP inference. To implement it, we used the modMCMC function from the FME package (Soetaert and Petzoldt 2010) in R. The maximal number of tries for the delayed rejection was fixed to 1, so actually we used the adaptive Metropolis algorithm (Haario et al. 2001). The initial values were set by the modFit function which finds the best fit parameters using optimization approaches. The variance of the proposal distribution was set by sample covariance of parameters \((x_1, \theta )\) scaled with \(2.4^4/(p+q)\) in every 100 iteration. We got 10,000 posterior sample from the DRAM algorithm. The DRAM was used here as a benchmark method for obtaining exact results based on Markov Chain Monte Carlo procedures but at the expense of computational time.

Gaussian processes (GP) have been used to avoid the heavy computation of the numerical integration. AGM (Dondelinger et al. 2013) and GP-ODE (Wang and Barber 2014) are two state-of-art paradigms for modelling the differential equation models using GP. The gradient matching (GM) approach (Calderhead et al. 2009) was developed to infer the differential equation models based on GP. However, GM approach has disadvantages that the posterior of hyperparameter of GP does not depend on the differential equation system and it is not a generative model. Dondelinger et al. (2013) tried to remedy the former problem by substituting the directed edges between the hyperparameter and the GP with the undirected edges. This modification improved the performance of the inference, but it is still not a generative model. GP-ODE approach was developed by Wang and Barber (2014) to construct a simple generative model. They developed a different paradigm from gradient matching approaches and argued the GP-ODE approach performs at least as well as the AGM. However, GP-ODE has been shown to be conceptually problematic. Recently, Macdonald et al. (2015) pointed out that GP-ODE approach makes an undesirable approximation: GP-ODE eliminates the edge between the true state variable x(t) and the latent variable \({\tilde{x}}(t)\) which should be same to the true state variable. Macdonald et al. (2015) showed that AGM achieves better result than GP-ODE for the simple ODE model having missing values and comparable result for the FitzHugh-Nagumo system.

To compare our LAP inference with the GP based approaches, we illustrated the results from both GP-ODE and AGM approaches. The MATLAB code for GP-ODE is available from github, and Macdonald et al. (2015) provided the MATLAB code for the AGM approach. All parameters were sampled from griddy Gibbs sampling. The range for each parameter component \(\theta _i\) was chosen by \([{\widehat{\theta }}_i^R \pm 4{\widehat{sd}}({\widehat{\theta }}_i^R)]\) where \({\widehat{\theta }}_i^R\) is the estimate from the parameter cascading method (Ramsay et al. 2007). We devided it into 31 intervals of equal length to set the same number of grid for each parameter. For the variance function of GP, we chose squared exponential function \(c_{\phi _j}(t,t') = \sigma _j^x \exp (- l_j(t-t')^2)\) and discretized the parameters \(\sigma _j^x, l_j\) over the ranges [0.1, 1], [5, 50] with intervals 0.1, 5, respectively. We got 10,000 posterior sample from the GP-ODE and AGM approaches.

Table 5 The table of mean of the absolute biases, the standard deviations (sd), the root mean squared errors (rmse) for \({\hat{\theta }}\), log-likelihoods with estimated parameters and computations times (s) in the FitzHugh-Nagumo model

As we have concluded in the above simulation, we used the 4th order Runge-Kutta method with \(m=2\) for the LAP inference and got 10, 000 posterior sample from each simulation data set. The same grid set as GP-ODE was chosen for fair comparison.

Table 5 shows the table of mean of the absolute biases, the standard deviations, the root mean squared errors (rmse) for \({\hat{\theta }}\), log-likelihoods with estimated parameters and computations times. The absolute bias term is calculated by

$$\begin{aligned} | \text {Bias}^s(\theta _i)| = |\theta _i -{\widehat{\theta }}^s_i | , ~~ i = 1, 2,3 \end{aligned}$$

where \(\text {Bias}^s\) is the bias in s-th simulation and \({\widehat{\theta }}^s\) is the estimate of \(\theta \) in s-th simulation and \(\theta = (0.2, 0.2, 3)^T\). For the Bayesian procedures, we use the posterior mean as the estimate of the parameter.

Table 5 shows that the LAP inference has better performance than the other methods in terms of rmse. The LAP has lower rmse and higher log-likelihood than those of the DRAM method, while taking \(25\%\) less computational time compared to DRAM. The PC method with \(\lambda =10^5\) has the fastest computation time and has a slightly higher log-likelihood value than that of LAP, while the automatic choice of \(\lambda \) (PC FPE) has a comparable rmse and a relatively lower log-likelihood value as shown in Table 5. The GP-ODE and AGM do not perform well in terms of the computational speed and accuracy (as determined by rmse).

We have also checked the values of log-likelihood at the parameter estimates for each method which are shown in Table 5. Note that LAP consistently achieves the higher log-likelihood value corresponding to its parameter estimates (which is comparable to PC and DRAM) than GP-ODE and AGM. Note that if there was a parameter estimate from another method, different from the LAP estimate but comparable in explaining the data, the log-likelihood at that parameter value (for the other method) should be close to the log-likelihood value corresponding to the LAP estimate. But Table 5 shows that in terms of the log-likelihood values, this is not the case; GP-ODE and AGM yield significantly lower log-likelihood values suggesting suboptimal parameter estimates from them.

To understand suboptimal parameter estimates, we note that the Fitz–Hugh–Nagumo ODE model has large regions of the likelihood corresponding to unidentifiable parameter values. However, this (large regions of the likelihood where parameter values are unidentifiable) does not arise for parameter values close to the maximum likelihood point (MLE) and for large sample size n. Our method based on Laplace approximation finds this maximum likelihood estimate and hence the bias is of order \(O(n^{-1/2})\). As seen from the log-likelihood values in Table 5, GP-ODE and AGM give parameter estimates away from the MLE, and hence may belong to these regions of unidentifiability. In other words, parameter estimates from GP-ODE and AGM are genuinely deviating away from the true parameter value. Over repeated simulation experiments, these genuine deviations get translated into the large overall biases and standard deviations given in Table 5 (especially component \(\theta _3\)).

The coverage probabilities of 95 % credible interval of the LAP inference are comparable to those of the confidence intervals obtained by the other methods. The coverage probabilities for \(\theta _1,\theta _2, \theta _3\) of the LAP inference are 0.94, 0.96, 0.94, while those of the PC method and DRAM algorithm are 0.84, 0.91, 0.81 and 0.96, 0.93, 0.97, respectively.

5.3 Predator-prey system

Fussmann et al. (2000) suggested a mathematical model for predator-prey food chain between two microbials. The following system of equations describes the predator-prey oscillation between Brachionus calyciflorus and Chlorella vulgaris:

$$\begin{aligned} {\dot{x}}_1(t)= & {} \delta (N^* - x_1(t)) - \frac{\theta _1 x_1(t) x_2(t)}{\theta _2 + x_1(t)}\\ {\dot{x}}_2(t)= & {} \frac{\theta _1 x_1(t) x_2(t)}{\theta _2 + x_1(t)} - \frac{\theta _3 x_2(t)x_4(t)}{\theta _4 + x_2(t)}\cdot \frac{1}{\theta _5} - \delta x_2(t)\\ {\dot{x}}_3(t)= & {} \frac{\theta _3 x_2(t) x_3(t)}{\theta _4 + x_2(t)} - (\delta + \theta _6 + \theta _7) x_3(t)\\ {\dot{x}}_4(t)= & {} \frac{\theta _3 x_2(t) x_3(t)}{\theta _4 + x_2(t)} - (\delta + \theta _6) x_4(t). \end{aligned}$$

In the above model, \(x_1, x_2, x_3, x_4\) represent the concentrations of nitrogen, Chlorella, reproducing Brachionus and total Brachionus, respectively. The unit of Chlorella and Brachionus is \(\mu mol L^{-1}\). \(N^*\) is the inflow concentration of nitrogen, and \(\delta \) is dilution rate. We have seven positive parameters, \(\theta = (\theta _1, \ldots , \theta _7)^T\). \(\theta _1\) and \(\theta _2\) are the maximum birth rate and the half-saturation constant of Chlorella. \(\theta _3\) and \(\theta _4\) represent the maximum birth rate and the half-saturation constant of Brachionus. \(\theta _5, \theta _6\), and \(\theta _7\) are the assimilation efficiency, the mortality and the decay of fecundity of Brachionus.

The dimension of the parameter is 7 which is too big for the grid sampling. Instead, we applied the griddy Gibbs sampling method. We generated a simulated data set with model parameters \(\theta = (3.3, 0.43, 2.25, 1.5, 2.5, 0.055, 0.4)^T\), \(x_1 = (1, 3, 5, 5)^T\), \(\sigma ^2 = 1\) and \(N^* = 8, \delta = 0.68\). We used the absolute value of the data because the concentrations should be positive. The parameter settings come from Cao et al. (2008). We just modified the scale of \(x_1, x_2, \theta _2, \theta _4, \theta _5\) and \(N^*\) to control the scale of \(x_1\) and \(x_2\). The time interval was fixed at \(t_i - t_{i-1} = 0.1\) for \(i = 2, 3, \ldots , n\) where \(n=100\). We applied the 4th order Runge-Kutta method to get the true mean function for simulated data with \(m=1\).

For the prior, we had \(x_1 \mid \tau ^2 \sim N_4(\mu _{x_1} = y_1, 100/\tau ^2 I_4),\) \(\tau ^2 \sim Gamma(a,b)\) and \(\theta \sim Unif(A)\) where \(a=0.1, b=0.01\), \(y_1 = (0.103, 3.185, 6.298, 5.137)^T\) and \(A = \{ (\theta _1,\ldots , \theta _7) : 0< \theta _1, \theta _3, \theta _4, \theta _5 < 70, 0< \theta _2, \theta _6, \theta _7 < 10 \}\).

For this example, we ended up setting \(M=15\) and \(h_0 = (0.35, 0.40, 0.15, 0.17, 0.40, 0.07, 0.06)^T\), so we have 31 grid points for each \(\theta _j,~ j=1,\ldots , 7\). The center of the grid matrix \(\theta ^0\) was chosen as \((3.295, 1.444, 2.225, 1.393, 3.883, 0.248, 0.397)^T\).

Fig. 6
figure 6

Approximate marginal posterior densities for \(\theta _1\) (top left), \(\theta _2\) (top right), \(\theta _3\) (bottom left) and \(\sigma ^2\) (bottom right) for Predator-prey model. The step size \(m=1\) in 4th order Runge–Kutta method is used. We omit the densities for the rest of the parameters, and the red lines represent true values of the parameters

Total 50,000 posterior sample was drawn by the griddy Gibbs sampling and every 5-th draw was used as sample for the posterior inference; finally we got 10,000 posterior sample. It took 19.454 hours for this simulation. Figure 6 shows the approximate marginal posterior densities for some parameters. The summary statistics for the posterior is given at Table 6 with true value of the parameters \(\theta \) and \(\sigma ^2\).

Figure 7 contains the scatter plots of the observations for \(x_1, x_2, x_3, x_4\), the true mean functions, 90 % credible lines for the mean functions, and the posterior mean functions as well as prediction values at 10 future time points.

As commented by one of the referees, we have adjusted the amount of error in our simulations to make the SNR (signal-to-noise-ratio) close to 10 to resemble real life situations. For the predator-prey model, the scale of some parameters were chosen to control the variance of signal. As a result, on the Newton’s law of cooling model, the SNR for the different dataset sizes were obtained as follows: when \(n=20\), \(\hbox {SNR}=10.493\), when \(n=50\), \(\hbox {SRN}=8.313\), when \(n=100\), \(\hbox {SNR}=7.660\), when \(n=150\), \(\hbox {SNR}=7.450\). For the FitzHugh-Nagumo system, the SNR on species 1 is 8.598 and on species 2 is 1.928. For the predator-prey system, the SNRs on \(x_1,x_2,x_3,x_4\) are 5.712, 6.112, 5.696, 8.369, respectively.

6 U.S. census data: logistic equation

A simple logistic equation describing the evolution of an animal population over time is

$$\begin{aligned} {\dot{x}}(t) = \frac{\theta _1}{\theta _2} x(t) ({\theta _2} - x(t)), \end{aligned}$$
(13)

where x(t) is the population size at time t, \(\theta _1\) is the rate of maximum population growth, \(\theta _2\) is the maximum sustainable population sometimes called carrying capacity (Bacaër 2011). The analytic form of the solution to (13) can be found. See Law et al. (2003) for the details. In this example, however, we will use only the differential equation (13) to fit the model.

Table 6 Posterior summary statistics with step size \(m=1\) in 4th order Runge–Kutta method for Predator-prey model

U.S. takes a census of its population every 10 years which is mandated by the U. S. Constitution. It has important ramifications for many aspects. For instance, the census results are used in the decision of government program funding, congressional seat, and electoral votes. This data set represents U.S. population from 1790 to 2010. The population is represented by one million units.

Since the census has been conducted every 10 years from 1790 to 2010, we have total \(n=23\) observations, \((y_1,\ldots , y_{23})\), with \(h=t_i - t_{i-1} = 10\), \(i=1,2,\ldots ,n\).

We set the prior as \(x_1 \mid \tau ^2 \sim N(\mu _{x_1}=y_1, 100/\tau ^2)\), \(\tau ^2 \sim \textit{Gamma}(a, b)\) and \(\theta \sim \textit{Uniform}(0,1) \times \textit{Uniform}(300, 1000)\), where \(a=0.1, b=0.01\) and \(y_1 = 3.929\). The lower limit of \(\theta _2\) was set to 300 which is slightly lower than the population in year 2010, \(y_{23} =308.746\).

To apply the grid sampling method, we used the parameter cascading estimate as a center of an initial grid set. For the final analysis, we set \(M=35\), \(h_0=(0.12, 0.4)^T\) and \(\theta ^0 = (0.020, 532.125)^T\).

We tried several step sizes m and concluded that with \(m=1\) the posterior had been stabilized sufficiently. For the numerical approximation, we used the 4th order Runge-Kutta method. The marginal posterior densities of \(\theta _1\), \(\theta _2\) and \(\sigma ^2\) when \(m=1\) are given in Fig. 8. Figure 9 includes the scatter plot of the observations, the \(90\%\) credible interval lines and posterior mean as well as prediction values of populations at 10 future time points. Table 7 shows the summary statistics for the posterior.

Fig. 7
figure 7

Scatter plot of the observations generated from the Predator–prey model, and plots of \(90\,\%\) credible set lines and true x(t) values are drawn when \(m=1\). Predictions of 10 time points ahead are also drawn. The upper, lower and middle dotted lines are the 95 and 5 % quantiles and mean of the posterior, respectively. The solid line in the middle is the true value of the state x(t), and the star-shaped points are the observations

Fig. 8
figure 8

The marginal posterior densities of \(\theta _1\), \(\theta _2\) and \(\sigma ^2\) in the logistic model with U.S. census data when the step parameter \(m=1\)

7 Discussion

In this paper, we proposed a posterior computation method, the LAP, based on the Laplace method and numerical approximation for ODE. There are three advantages of the proposed method. First, when the dimension of the parameter is small, the computation is fast. The main issue of the proposed method is computation time when the dimension of \(\theta \) is high.

Second, the proposed method produces accurate estimator which has comparable or better performance than the other methods: the PC method, the DRAM, the GP-ODE and the AGM. Although it is not entirely clear, we suspect that the spline approximation of the PC method and the GP approximation of the GP based approaches to x(t) may cause loss of efficiency. This issue also needs further investigation.

Third, the proposed method is numerically stable. The frequentist methods need to maximize the log-likelihood surface which has many ripples. However, in many examples the ripples of the log-likelihood surface occurs at periphery of the parameter space and disappear in the likelihood surface as the sample size n increases.

Fig. 9
figure 9

Scatter plot of the U.S. census, \(90\%\) credible interval lines and posterior mean when the step parameter \(m=1\). Prediction values of populations at 10 future time points are also drawn

Referees pointed that there is a potential to use lattice rule or sparse grid construction which can control the computational costs of the proposed method. It is an attractive way to reduce the computation time of LAP when q is large. However, there were several challenges that need to be overcome. For the lattice rule, the best way of transforming integration domain to optimize its performance in the case of ODE models is not clear. It should be chosen carefully because poor transformation will cause the evaluations of ratios of densities under the lattice rule to be quite unstable. For the sparse grid, the existence of negative weights prevents computing the posterior probability on each grid point: we can compute the posterior moments only. To get the posterior probability on each grid point, the weights should be positive everywhere. Furthermore, in our experiment, the estimate from the sparse grid heavily depended on the range of the grid set and the accuracy of the sparse grid. We applied the sparse grid construction to the proposed method for predator-prey system. The Gauss-Legendre quadrature rule on [0, 1] was used with accuracy level 10. The integration domain was transformed to the same domain in Sect. 5.3 using linear transformation. In this settings, we obtained the mean value \((2.868, 1.281, 1.969, 1.225, 2.696, 0.116, 0.362)^T\) for \(\theta \). The estimated mean or other moments were quite unstable to the choice of the domain and accuracy level.

Table 7 Posterior summaries with \(m=1\) for U.S. census data

Although we concluded that these problems are not easy to get around, the lattice rule and sparse grid are interesting idea to enhance the practical use of our LAP inference. Thus, we decided that applying the lattice rule or sparse grid to LAP inference deserves a separate investigation and publication.