Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Pharmacokinetic/pharmakodynamic (PK/PD) modeling is often performed using nonlinear mixed effects models based on deterministic ordinary differential equations (ODEs), [24]. The ODE models the dynamics of the system as

$$\begin{aligned} \frac{dX}{dt}= & {} f(X\left( t\right) ,t) \\ y_k= & {} X\left( t_k\right) + e_k , \end{aligned}$$

where X(t) is the state of the system, \(f(\cdot )\) the model, \(y_k\) the discrete observations, and \(e_k\) the measurement errors which are assumed independent and identically distributed (iid) Gaussian. Given an initial value, the solution to the ODE X(t) is a perfect prediction of all future values. The ODE model is an input–output model, where the residuals are the difference between the solution to the ODE and the observations. In the population setup, this implies that the total variation in data for a population of individuals is split into inter- and intraindividual variation. However, due to the ODE framework, the interindividual variation can only come from the covariance of the iid residuals, i.e., there must be no autocorrelation in the residuals.

The ODE framework is built on the assumption that future values of the states X(t) can be predicted exactly and that the residual error is independent of the prediction horizon. This is often too simplistic and implies that the uncertainty about future values of the states and observations is not adequately described. This again has consequences for the design of model-based controllers and proper planning of medical treatments in general.

The ODE-based model class has a restricted residual error structure, as it assumes serially uncorrelated prediction residuals. There are several reasons why this assumption is violated: (1) misspecification or approximations of the structural model due to the complexity of the biological system, (2) unrecognized inputs, and (3) unpredictable random behavior of the process due to measurement errors for the input variables (e.g., specification of meals or physical exercise; both factors are known to influence future values of the blood glucose). In addition to these issues, the intraindividual (residual) variability also accounts for various environmental errors such as those associated with assay, dosing, and sampling. Since most of these errors cannot be considered as uncorrelated measurement errors, the description of the total individual error should preferably be separated (see also [7, 13]). Furthermore, [8] describe three types of residual error models to population PK/PD data analysis to account for more complicated residual error structures.

Neglecting the correlated residuals in the model description not only leads to serious issues when the model is used for forecasting and control as mentioned above, but it also disables a possibility for using proper methods for statistical model validation, parameter testing, and model identification (see e.g., [16], pp. 46–47).

In this chapter, stochastic differential equations (SDEs) are introduced to address these issues. SDEs facilitate the ability to split the intraindividual error into two fundamentally different types: (1) serially uncorrelated measurement error typically caused by assay error and (2) system error caused by model and input misspecifications. The concept will first be studied for a single subject and later on in a mixed effects setup with a population of individuals.

The use of SDEs opens up for new tools for model development, as it quantifies the amount of system and measurement noise. Specifically the approach allows for tracking of unknown inputs and parameters overtime by modeling them as random walk processes. These principles lead to efficient methods for pinpointing model deficiencies, and subsequently for identifying model improvements. The SDE approach also provides methods for model validation. This modeling framework is often called gray box modeling [25].

In this study, we will use maximum likelihood techniques both for parameter estimation and for model identification, and both for a single subject and in the population setting. It is known that parameter estimation in nonlinear mixed effects models with SDEs is most effectively carried out by considering an approximation to the population likelihood function. The population likelihood function is then approximated using the first-order conditional estimation (FOCE) method, which is based on a second-order Taylor expansion of each individual likelihood function at its optimum—see [17]. Like in [12], the extended Kalman filter is used for evaluating the single subject likelihood function.

This algorithm introduces a two-level numerical optimization, since not only the population likelihood function has to be maximized, but also for each value of the population likelihood all the individual likelihood functions must be maximized. This makes estimation computationally demanding, but the algorithm facilitates parallelization at several places to reduce the estimation time. The method is implemented in the R-package CTSM-R (continuous time stochastic modeling in R) [2], which is used in the DIACON project [3] focusing on technologies for semi- and fully-automatic insulin administration for treatment of type 1 diabetes. This project takes advantage of the fact that the SDE approach provides probabilistic forecasts for future values of the system states, which is crucial for reliable semi- and fully-automatic (closed-loop) insulin administration using model predictive control.

Section 2 describes various scenarios for data (single subject, repeated experiments, and populations of subjects), and how the likelihood function is formulated for each of these scenarios. Section 3 describes the approach used for population data from an experiment conducted in DIACON. Some practical issues related to SDE-based modeling are discussed in Sect. 4, and finally Sect. 5 summarizes. Both simulated and real-life experimental data are used throughout the chapter for illustrating the modeling and prediction framework.

2 Data and Modeling

Experiments can be conducted in various ways and the appropriate modeling approach depends on this. The basics start with a single experiment (solid ellipse in Fig. 1) which results in a series of data points \(\mathscr {Y}\) sampled, possibly irregularly, at times \(t_1<t_2<\dots <t_N\). This single time series and how it is modeled are described in Sect. 2.1. Repeating the same experiment multiple times (dashed ellipse in Fig. 1) may be modeled as independent data series assuming no random effects between the runs. This is described in Sect. 2.2. When an experiment is done using several subjects (dotted ellipse in Fig. 1), then it is normal to include random effects between them. This is the so called population extension which is described in Sect. 2.3. In addition to the structure of data prior information may be available or used as a modeling technique. This is described in Sect. 2.4.

Fig. 1
figure 1

A scenario of experiments in a study. The solid ellipse is a single time series trial. The dashed ellipse is a collection of three possibly independent repeated trials of subject 1. The dotted ellipse is a collection of subjects with random variation from a population

2.1 Single Data Series

This section begins by introducing a fundamental framework describing how to model physical phenomena. The aim is to provide a probabilistic model for a discrete time series \(\mathscr {Y}_N = {Y_1,Y_2,\dots ,Y_N}\). The formulation in this section is a general framework which is useful for all types of correlated time series data and not just physiological data.

The natural extension to the ODE framework is SDE’s. We begin by introducing the stochastic process \(\mathbf {x}_t\) which satisfies an Itô SDE

$$\begin{aligned} d\mathbf {x}_t = \mathbf {f}\left( \mathbf {x}_t, \mathbf {u}_t, t, \mathbf {\theta } \right) dt + \mathbf {\sigma }\left( \mathbf {x}_t, \mathbf {u}_t, t, \mathbf {\theta } \right) d\mathbf {\omega }_t\ , \end{aligned}$$
(1)

where \(\mathbf {x}_t\) is the state, \(\mathbf {u}_t\) is an exogenous input, and \(\mathbf {\theta }\) the parameters of the model. \(\mathbf {f}()\) and \(\mathbf {\sigma }()\) are possibly nonlinear functions called the drift and diffusion terms. \(\mathbf {\omega }\) is the Wiener process driving the stochastic part of the process. (1) describes the dynamics and is called the system equation. Note that the ODE model is contained within the SDE when removing the diffusion term \(\mathbf {\sigma }\left( \mathbf {x}_t, \mathbf {u}_t, t, \mathbf {\theta } \right) d\mathbf {\omega }_t\).

The solution to the SDE (1) is not in general known except for linear and a few other SDE’s. Many methods for solving SDE’s have been proposed, e.g., Hermite expansions, simulation-based methods and Kalman filtering, see [4]. This chapter focuses on the Kalman filter using CTSM-R. The Kalman filter restricts the diffusion to being independent of the states because the approximations required to integrate an SDE with state-dependent diffusion give undesirable results or performance. However, some SDEs with state-dependent diffusion can be transformed to an SDE with unit diffusion by the Lamperti transform (see Sect. 4.1).

The stochastic process is observed discretely and possibly partially with independent noise via the measurement equation

$$\begin{aligned} \mathbf {y}_k = \mathbf {h}\left( \mathbf {x}_k, \mathbf {u}_k, t_k, \mathbf {\theta }, \mathbf {e}_k \right) , \end{aligned}$$
(2)

where \(\mathbf {h}()\) is a possibly nonlinear function of the states and inputs. \(\mathbf {e}_k\) is an independent noise term attributed by the imperfect measurements. Due to the Kalman filter, the measurement model is restricted to additive noise in CTSM-R

$$\begin{aligned} \mathbf {y}_k = \mathbf {h}\left( \mathbf {x}_k, \mathbf {u}_k, t_k, \mathbf {\theta } \right) + \mathbf {e}_k, \end{aligned}$$
(3)

where \(\mathbf {e}_k\) is Gaussian with \(\mathscr {N} (0,\mathbf {S}(\mathbf {u}_k,t_k))\).

The combination (1) and (3) is the state space model formulation used in this paper to understand data. This is a gray box model as it bridges the gap between data driven black box models and pure physical white box models.

Example 1

As an example to illustrate the methods, we will use a simulation example (see Fig. 2). A linear 3 compartment transport model [15] similar to the real-data modeling example presented in Sect. 3 is used. We can think of the response (y) as venous glucose concentration in the blood of a patient, and the input (u) as exogenous glucagon.

The data are simulated according to the model

$$\begin{aligned} d\mathbf {x}_t= & {} \left( \left[ \begin{matrix} u_t\\ 0 \\ 0 \end{matrix}\right] + \left[ \begin{matrix} -k_a &{} 0 &{} 0 \\ k_a &{} -k_a &{} 0 \\ 0 &{} k_a &{} -k_e \end{matrix}\right] \mathbf {x}_t \right) dt+ \left[ \begin{matrix} \sigma _{1} &{} 0 &{} 0 \\ 0 &{} \sigma _{2} &{} 0 \\ 0 &{} 0 &{} \sigma _{3} \end{matrix}\right] d\mathbf {\omega }_t \end{aligned}$$
(4)
$$\begin{aligned} y_k= & {} \left[ \begin{matrix} 0&0&1\end{matrix}\right] \mathbf {x}_{t_k} + e_k , \end{aligned}$$
(5)

where \(\mathbf {x}\in \mathbb {R}^3\), \(e_k\sim \mathscr {N}(0,s^2)\), \(t_k=\{1,11,21,\ldots \}\), and the specific parameters (\(\mathbf {\theta }\)) used for simulation are given in Table 1 (first column).

Fig. 2
figure 2

Simulated data for the example (Eqs. (4), (5), and Table 1)

The structure of the model (4) will of course usually be hidden, and we will have to identify the structure based on the measurements as given in Fig. 2. As a general principle simple models are preferred over more complex models, and therefore a first hypothesis could be (Model 1)

$$\begin{aligned} dx_t= & {} \left( u_t -k_ex_t \right) dt+ \sigma _{3} d\omega _t \end{aligned}$$
(6)
$$\begin{aligned} y_k= & {} x_{t_k} + e_k. \end{aligned}$$
(7)

In this approach, the estimation is based on the likelihood function as defined in the following section.

2.1.1 Likelihood

Given a sequence of measurements

$$\begin{aligned} \mathscr {Y}_N = [\mathbf {y_0}, \mathbf {y_1}, \dots , \mathbf {y_k}, \dots , \mathbf {y}_N] \, \end{aligned}$$
(8)

the likelihood of the unknown parameters \(\theta \) given the model formulated as (1)–(3) is the joint probability density function (pdf)

$$\begin{aligned} L(\mathbf {\theta }, \mathscr {Y}_N) = {\mathrm {p}} ( \mathscr {Y}_N \vert \mathbf {\theta } ), \end{aligned}$$
(9)

where the likelihood L is the probability density function given \(\theta \). The joint probability density function is partitioned as the product of the one-step conditional probability functions

$$\begin{aligned} L(\mathbf {\theta }, \mathscr {Y}_N) = \left( \prod _{k=1}^{N} {\mathrm {p}} \left( \mathbf {y}_k \vert \mathscr {Y}_{k-1}, \mathbf {\theta } \right) \right) {\mathrm {p}} (\mathbf {y}_0 \vert \mathbf {\theta }). \end{aligned}$$
(10)

The solution to a linear SDE driven by a Brownian motion is a Gaussian process. Nonlinear SDEs do not result in a Gaussian process and thus the marginal probability is not Gaussian. By sampling, the nonlinearities fast enough in some sense then it is reasonable to assume that the conditional density is Gaussian.

The Gaussian density is fully described by the first and second-order moments

$$\begin{aligned} \hat{\mathbf {y}}_{k \vert k-1}= & {} E \left[ \mathbf {y}_k \vert \mathscr {Y}_{k-1}, \mathbf {\theta } \right] \end{aligned}$$
(11)
$$\begin{aligned} \Sigma _{k \vert k-1}= & {} V \left[ \mathbf {y}_k \vert \mathscr {Y}_{k-1}, \mathbf {\theta } \right] . \end{aligned}$$
(12)

Introducing the innovation error

$$\begin{aligned} \mathbf { \varepsilon }_k = \mathbf {y}_k - \hat{\mathbf {y}}_{k \vert k-1} , \end{aligned}$$
(13)

the likelihood (10) becomes

$$\begin{aligned} L(\mathbf {\theta }, \mathscr {Y}_N) = \left( \prod _{k=1}^{N} \frac{ {\mathrm {exp}} \left( -\frac{1}{2} \varepsilon _k^T \Sigma _{k \vert k-1}^{-1} \varepsilon _k \right) }{\sqrt{ \vert \Sigma _{k \vert k-1} \vert } \sqrt{2 \pi }^l} \right) {\mathrm {p}} (\mathbf {y}_0 \vert \mathbf {\theta }). \end{aligned}$$
(14)

The probability density of the initial observation \(p(y_0 \vert \mathbf {\theta })\) is parameterized through the probability density of the initial state \(p(x_0 \vert \mathbf {\theta })\). The mean \(\hat{\mathbf {y}}_{k \vert k-1}\) and covariance \(\Sigma _{k \vert k-1}\) are computed recursively using the extended Kalman filter, see Appendix A for a brief description, or [10] for a detailed description.

The unknown parameters are estimated by maximizing the likelihood function using an optimization algorithm. The likelihood (14) is a product of probability densities all less than 1, which causing numerical problems. Taking the logarithm of the likelihood (14) turns the product into a summation and cancels the exponentials thus stabilizing the calculation. The parameters are now found by maximizing the log-likelihood or by convention minimize the negative log-likelihood

$$\begin{aligned} \hat{\mathbf {\theta }} = \arg \min _{\mathbf {\theta } \in \varTheta } \left( -{\mathrm {ln}} \left( L(\mathbf {\theta }, \mathscr {Y}_N \right) \right) . \end{aligned}$$
(15)

The uncertainty of the maximum likelihood parameter estimate \(\hat{\mathbf {\theta }}\) is related to the curvature of the likelihood function. An estimate of the asymptotic covariance of \(\hat{\mathbf {\theta }}\) is the inverse of the observed Fisher information matrix

$$\begin{aligned} V\left[ \hat{\mathbf {\theta }} \right] = \left[ \mathbf {I} \left( \hat{\mathbf {\theta }} \right) \right] ^{-1}, \end{aligned}$$
(16)

where \(\mathbf {I} \left( \hat{\mathbf {\theta }} \right) \) is the observed Fisher information matrix, that is the negative Hessian matrix (curvature) of the likelihood function evaluated at the maximum likelihood estimate [17, 21].

Example 2

We continue with the simulated data from Example 1. As noted above, a first approach to model the data could be a first-state model (Eqs. (6)–(7)). The result of the estimation (\(\hat{\mathbf {\theta }}_1\)) is given in Table 1, the initial value of the state (\(x_{30}\)) and the time constant (\(1/k_e\)) are both captured quite well, while the uncertainty parameters are way off, the diffusion is too large and the observation variance is too small (with extremely large uncertainty).

The parameters in the model are all assumed to be greater than zero, and it is therefore advisable to estimate parameters in the \(\log \)-domain, and then transform back to the original domain before presenting the estimates. The log-domain estimation is also the explanation for the nonsymmetric confidence intervals in Table 1, the confidence intervals are all based on the Hessian of the likelihood at the optimal parameter values, and confidence intervals are based on the Wald confidence interval in the transformed (log) domain [21]. Such intervals could be refined using profile likelihood-based confidence intervals [21] (see also Sect. 4.4).

Table 1 Parameter estimates from simulation example and confidence intervals for the individual parameters are given in parenthesis below the estimates

In order to validate the model and suggest further development, we should inspect the innovation error. When the model is not time homogeneous, the standard error of the prediction will not be constant and the innovation error should be standardized

$$\begin{aligned} r_k = \frac{\varepsilon _k}{\sqrt{\Sigma _{k|k-1}}}, \end{aligned}$$
(17)

where the innovation error (\(\varepsilon _k\)) is given in (13). All numbers needed to calculate the standardized residuals can be obtained directly from CTSM-R using the function predict. Both the autocorrelation and partial autocorrelation (Fig. 3) are significant in lag 1 and 2. This suggests a second-state model for the innovation error, and hence a third-state model should be used. Consequently we can go directly from the first-state model to the true structure (a third-state model).

Fig. 3
figure 3

Autocorrelation and partial autocorrelation from a simple (1 state) model

Fig. 4
figure 4

Autocorrelation and partial autocorrelation from the third-state model (i.e., the correct model)

Now we have assumed that a number of the parameters are actually zero, in a real-life situation, we might test these parameters using likelihood ratio tests, or indeed identify them through engineering principles. The parameter estimates are given in Table 1 (\(\hat{\theta }_3\)); in this case, the diffusion parameter (\(\sigma _3\)) has an extremely wide confidence interval, and it could be checked if these parameters should indeed be zero (again using likelihood ratio test), but for now we will proceed with the residual analysis which is an important part of model validation (see e.g., [16]). The autocorrelation and partial autocorrelation for the third-state model are shown in Fig. 4. We see that there are no values outside the 95 % confidence interval, and we can conclude that there is no evidence against the hypothesis of white noise residuals, i.e., the model sufficiently describes the data.

Fig. 5
figure 5

Simulation with model 2 and 3, dashed gray line expectation of model 2, black line expectation of model 3, light gray area 95 % prediction interval for model 2, dark gray area 95 % prediction interval for model 3, and black dots are the observations

Autocorrelation and partial autocorrelations are based on short-term predictions (in this case 10 min) and hence we check the local behavior of the model. Depending on the application of the model, we might be interested in longer-term behavior of the model. Prediction can be made on any horizon using CTSM-R. In particular, we can compare deterministic simulation in CTSM-R (meaning conditioning only on the initial value of the states). Such a simulation plot is shown in Fig. 5, here we compare a second-state model (see Table 1) with the true third-state model. It is quite evident that model 2 is not suited for simulation, with the global structure being completely off, while “simulation” with a third-state model (with the true structure, but estimated parameters), gives narrow and reasonable simulation intervals. In the case of linear SDE-models with linear observation, this “simulation” is exact, but for nonlinear models it is recommended to use real simulations, e.g., using a Euler scheme.

The step from a second-state model (had we initialized our model development with a second-state model) to the third-state model is not at all trivial. However, Fig. 5 shows that simulation of model 2 does not contain the observations and thus model 2 will not be well suited for simulations. Also the likelihood ratio test (or AIC/BIC) supports that model 3 is far better than model 2, further it would be reasonable to fix \(\sigma _3\) at zero (in practice a very small number).

2.2 Independent Data Series

An experiment may be repeated several times without expecting variation in the underlying parameters. Given S sequences of possibly varying length

$$\begin{aligned} \mathbf {Y} = \left[ \mathscr {Y}_{N_1}^1, \mathscr {Y}_{N_2}^2, \dots , \mathscr {Y}_{N_i}^i, \dots , \mathscr {Y}_{N_S}^S \right] , \end{aligned}$$
(18)

the likelihood is the product of the likelihood (10) for each sequence

$$\begin{aligned} L(\mathbf {\theta }, \mathbf {Y}) = \prod _{i=1}^{S} \left( \left( \prod _{k=1}^{N} \frac{ {\mathrm {exp}} \left( -\frac{1}{2} \varepsilon _k^T \Sigma _{k \vert k-1}^{-1} \varepsilon _k \right) }{\sqrt{ \vert \Sigma _{k \vert k-1} \vert } \sqrt{2 \pi }^l} \right) {\mathrm {p}} (\mathbf {y}_{0,i} \vert \mathbf {\theta }) \right) . \end{aligned}$$
(19)

The unknown parameters are again estimated by minimizing the negative log-likelihood

$$\begin{aligned} \hat{\mathbf {\theta }} = \arg \min _{\mathbf {\theta } \in \varTheta } \left( -{\mathrm {ln}} \left( L(\mathbf {\theta }, \mathbf {Y} \right) \right) . \end{aligned}$$
(20)

If the independence assumption is violated and the parameters vary between the time series, then the model performance would be lowered as the parameter estimates will be a compromise. The natural extension is to include a population effect.

2.3 Population Extension

The gray box model can be extended to include a hierarchical structure to model variation occurring between data series where each series has its own parameter set. This is useful for describing data from a number of individuals belonging to a population of individuals. The hierarchical modeling is also called mixed effects and population extension in pharmaceutical science. Nonlinear mixed effects modeling has long been used in pharmacokinetic/pharmacodynamic studies to account for variation from the natural grouping: multiple centers, multiple days, age and BMI of subjects, etc. Mixed effects modeling combines fixed and random effects [17]. The fixed effect is the average of that effect over the entire population while the random effect allows for variation around that average.

Consider N subjects in a clinical study. This is a single level grouping. The model for the ith subject is

$$\begin{aligned} d\mathbf {x}_{i,t}= & {} \mathbf {f}\left( \mathbf {x}_{i,t}, \mathbf {u}_{i,t}, t, \mathbf {\theta }_i \right) dt + \mathbf {\sigma }\left( \mathbf {u}_{i,t}, t, \mathbf {\theta }_i \right) d\mathbf {\omega }_t \end{aligned}$$
(21)
$$\begin{aligned} \mathbf {y}_{i,k}= & {} \mathbf {h}\left( \mathbf {x}_{i,k}, \mathbf {u}_{i,k}, t_{i,k}, \mathbf {\theta }_i \right) + \mathbf {e}_{i,k}, \end{aligned}$$
(22)

which is the general model extended with subscript i. The individual parameters \(\mathbf {\theta }_i\) are

$$\begin{aligned} \mathbf {\theta }_i = z(\mathbf {\theta }_f, \mathbf {Z}_i, \mathbf {\eta }_i), \end{aligned}$$
(23)

where z maps from subject covariates such (i.e., BMI and age) \(\mathbf {Z}_i\), fixed effects parameters \(\theta _f\), and the random effects \(\eta _i \in R^k \sim \mathscr {N}(0,\varOmega )\) to subject parameters. The subject parameters are typically modeled as either normally or log-normally distributed by combining the fixed effect parameters and the random effects in either an additive \(\theta _i = \theta _f + \eta _i\) or an exponential transform \(\theta _i = \theta _f \mathrm e^{\eta _i}\).

The likelihood of the fixed effects is the product of the marginal probability densities for each subject

$$\begin{aligned} L(\mathbf {\theta }_f,\varOmega ) = \prod _{i=1}^N p(\mathscr {Y}_i \vert \mathbf {\theta }_i,\varOmega ), \end{aligned}$$
(24)

where the marginal density is found by integrating over the random effects \(\mathbf {\eta }_i\)

$$\begin{aligned} p(\mathscr {Y}_i \vert \mathbf {\theta }_i,\varOmega ) = \int p_1(\mathscr {Y}_i \vert \mathbf {\theta }_i, \mathbf {\eta }_i ) p_2(\mathbf {\eta }_i \vert \varOmega ) d\mathbf {\eta }_i. \end{aligned}$$
(25)

\(p_1(\mathscr {Y}_i \vert \mathbf {\theta }, \mathbf {\eta })\) is the probability of the individual subject which given by (10). \(p_2(\eta _i \vert \varOmega )\) is the probability of the second-stage model where the random effects describe the interindividual variation.

2.3.1 Approximation of the Marginal Density

The integral in (25) rarely has a closed-form solution and thus must be approximated in a computationally feasible way. This can be done in two ways: approximating (a) the integrand by Laplacian or (b) the entire integral by Gaussian quadrature.

Gaussian quadrature can approximate the integral by a weighted sum of the integrand evaluated at specific nodes. The accuracy of Gaussian quadrature increases as the order (number of nodes) increases. With adaptive Gaussian quadrature, the accuracy can be improved even further at higher cost. The computational complexity of Gaussian quadrature suffers from the curse of dimensionality and becomes infeasible even for few dimensions.

Now consider the Laplacian approximation which is widely used approximation to integrals [17]. Observe that the integrand in (25) is nonnegative such that

$$\begin{aligned} p_1(\mathscr {Y}_i \vert \mathbf {\theta }_i, \mathbf {\eta }_i ) p_2(\mathbf {\eta }_i \vert \varOmega )= & {} e^{\log \left( p_1(\mathscr {Y}_i \vert \mathbf {\theta }_i, \mathbf {\eta }_i ) p_2(\mathbf {\eta }_i \vert \varOmega ) \right) } \nonumber \\= & {} e^{g_i \left( \eta _i \right) } , \end{aligned}$$
(26)

where \(g_i \left( \mathbf {\eta }_i \right) \) is the log-posterior distribution for the ith subject. Now consider the second-order Taylor expansion of \(g_i \left( \mathbf {\eta }_i \right) \) around its mode \(\hat{\mathbf {\eta }}_i\)

$$\begin{aligned} g_i \left( \mathbf {\eta }_i \right) \approx g_i \left( \hat{\mathbf {\eta }}i \right) + \frac{1}{2} \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) ^T \varDelta g_i \left( \hat{\mathbf {\eta }}_i \right) \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) , \end{aligned}$$
(27)

since \(\nabla g_i \left( \hat{\mathbf {\eta }}_i \right) = 0\) at the mode. By inserting (27) and (26) in (25), the Laplacian approximation of the marginal probability density is defined as

$$\begin{aligned} p(\mathscr {Y}_i \vert \mathbf {\theta }_f,\varOmega )\approx & {} \int e ^ { g_i \left( \hat{\mathbf {\eta }}_i \right) + \frac{1}{2} \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) ^T \varDelta g_i \left( \hat{\mathbf {\eta }}_i \right) \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) } d\mathbf {\eta }_i \nonumber \\= & {} e^{g_i \left( \hat{\mathbf {\eta }}_i \right) } \int e ^ { \frac{1}{2} \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) ^T \varDelta g_i \left( \hat{\mathbf {\eta }}_i \right) \left( \mathbf {\eta }_i - \hat{\mathbf {\eta }}_i \right) } d\mathbf {\eta }_i, \end{aligned}$$
(28)

where the integral is recognized as the scaled integral over a multivariate GaussianFootnote 1 distribution with covariance \(\Sigma = (-\varDelta g(\eta _i))^{-1}\). The marginal density becomes

$$\begin{aligned} p(\mathscr {Y}_i \vert \mathbf {\theta }_i,\varOmega ) \approx e^{g_i \left( \hat{\mathbf {\eta }}_i \right) } \sqrt{\frac{(2\pi )^k}{\vert {-}\varDelta g\left( \hat{\mathbf {\eta }}_i\right) \vert }}. \end{aligned}$$
(29)

Inserting (29) in (24) the likelihood becomes

$$\begin{aligned} L(\theta _f,\varOmega ) \approx \prod _{i=1}^{N} e^{g_i \left( \hat{\mathbf {\eta }}_i \right) } \sqrt{\frac{(2\pi )^k}{\vert {-}\varDelta g\left( \hat{\mathbf {\eta }}_i\right) \vert }}. \end{aligned}$$
(30)

The Hessian \(\varDelta g(\hat{\mathbf {\eta }}_i)\) is found by analytically differentiating the expression for the log-posterior \(g(\mathbf {\eta })\). After some derivation, the Hessian is

$$\begin{aligned} \varDelta g(\mathbf {\eta }_i)&= \sum _{k=1}^N \left[ \frac{\partial ^2 \mathbf {y}^T}{\partial \mathbf {\eta }_i \partial \mathbf {\eta }_i} \Sigma ^{-1}_{k \vert k-1}\left( \mathbf {y}_k - \hat{\mathbf {y}}_{k\vert k-1} \right) + 2 \frac{\partial \hat{\mathbf {y}}_{k\vert k-1}}{\partial \mathbf {\eta }_i}\frac{\partial \left[ \Sigma ^{-1}_{k \vert k-1} \right] }{\partial \mathbf {\eta }_i} \left( \mathbf {y} - \hat{\mathbf {y}}_{k\vert k-1} \right) \right. \nonumber \\&\quad -\; \frac{\partial \hat{\mathbf {y}}_{k\vert k-1}}{\partial \mathbf {\eta }_i} \Sigma ^{-1}_{k \vert k-1} \frac{\partial \hat{\mathbf {y}}_{k\vert k-1}}{\partial \mathbf {\eta }_i} - \frac{1}{2} \left( \mathbf {y} - \hat{\mathbf {y}}_{k\vert k-1} \right) \frac{\partial ^2 \left[ \Sigma ^{-1}_{k \vert k-1} \right] }{\partial \mathbf {\eta }_i \partial \mathbf {\eta }_i} \left( \mathbf {y} - \hat{\mathbf {y}}_{k \vert k-1} \right) \nonumber \\&\quad \left. +\; {\mathrm {tr}}\left( \frac{\partial \left[ \Sigma ^{-1}_{k \vert k-1} \right] }{\partial \mathbf {\eta }_i}\frac{\partial \Sigma _{k \vert k-1}}{\partial \mathbf {\eta }_i}+\Sigma ^{-1}_{k \vert k-1}\frac{\partial \Sigma _{k \vert k-1}}{\partial \mathbf {\eta }_i\partial \mathbf {\eta }_i} \right) \right] - \varOmega ^{-1}, \end{aligned}$$
(31)

where \({\mathrm {tr}}\) is the trace of a matrix. The second-derivative terms are generally complicated or inconvenient to compute. At the mode \(\hat{\mathbf {\eta }}_i\), the contribution of the second-derivative terms is usually negligible and thus an approximation for the Hessian is

$$\begin{aligned} \varDelta g(\hat{\mathbf {\eta }}_i) \approx -\sum _{k=1}^{N} \left( \left. \frac{\partial \hat{\mathbf {y}}_{k\vert k-1}}{\partial \mathbf {\eta }_i} \right| _{\mathbf {\eta }_i=\hat{\mathbf {\eta }}_i} \Sigma _{k \vert k-1}^{-1} \left. \frac{\partial \hat{\mathbf {y}}_{k\vert k-1}}{\partial \mathbf {\eta }_i}\right| _{\mathbf {\eta }_i=\hat{\mathbf {\eta }}_i} \right) - \varOmega ^{-1}. \end{aligned}$$
(32)

This approximation is similar to the Gauss–Newton and NONMEM’s first-order conditional estimation (FOCE) approximations of the Hessian where only first partial derivatives are included [9, 17].

The parameters are found by iteratively minimizing the first- and second-stage model. For a trial set of fixed effect parameters, an optimization of g must be done for all subjects. When all \(\eta _i\) have been found, the Laplacian and FOCE approximations can be computed to obtain the population likelihood. The population likelihood can then be optimized.

2.4 Prior Information

Bayesian analysis combines the likelihood of the data and already known information which is called a prior. When the prior probability density function is updated, it becomes the posterior probability density function. In true, Bayesian analysis the prior may be any distribution, although conjugated priors are used in practice to simplify the computations.

In the view of CTSM-R, priors are mainly used as (a) empirical prior or for (b) regularizing the estimation.

An empirical prior is a result from a previous estimation. Imagine an experiment has been analyzed and followed by rerunning the experiment. These two data series are stochastically independent sets and should be analyzed as in Sect. 2.2. However, using the results from the first analysis as a prior, only the new data series has to be analyzed. If the quadratic Wald approximation holds this prior is Gaussian.

Regularizing one or more parameters is sometimes required to achieve a feasible estimation of the parameters. State equations describe a physical phenomenon and as such the modeler often has knowledge (possibly partly subjective) about the parameters from, e.g., another study. The reported values are often a mean and a standard deviance. Thus a Gaussian prior is reasonable.

Updating the prior probability density function \(p(\mathbf {\theta })\) forms the posterior probability density function through Bayes’ rule

$$\begin{aligned} p(\mathbf {\theta } \vert \mathscr {Y}_N) = \frac{p(\mathscr {Y}_N \vert \mathbf {\theta }) p(\mathbf {\theta }) }{p(\mathscr {Y}_N)} \propto p(\mathscr {Y}_N \vert \mathbf {\theta }) p(\mathbf {\theta }), \end{aligned}$$
(33)

where the probability density \(p(\mathscr {Y}_N \vert \mathbf {\theta })\) is proportional to the likelihood of a single data series given in (10). No information is called a diffuse prior which is uniform over the entire domain. The posterior then reduces to the likelihood of the data.

Let the prior be described by a Gaussian distribution \(\mathscr {N} \left( \mathbf {\mu }_{\mathbf {\theta }}, \Sigma _{\mathbf {\theta }} \right) \) where

$$\begin{aligned} \mathbf {\mu }_{\mathbf {\theta }}= & {} E \left[ \mathbf {\theta } \right] \end{aligned}$$
(34)
$$\begin{aligned} \Sigma _{\mathbf {\theta }}= & {} V \left[ \mathbf {\theta } \right] , \end{aligned}$$
(35)

and let

$$\begin{aligned} \mathbf {\varepsilon }_{\mathbf {\theta }} = \mathbf {\theta } - \mathbf {\mu }_{\mathbf {\theta }}, \end{aligned}$$
(36)

then the posterior probability density function is

$$\begin{aligned} p(\mathbf {\theta } \vert \mathscr {Y}_N) \propto \left( \left( \prod _{k=1}^{N} \frac{ {\mathrm {exp}} \left( -\frac{1}{2} \varepsilon _k^T \Sigma _{k \vert k-1}^{-1} \varepsilon _k \right) }{\sqrt{ \vert \Sigma _{k \vert k-1} \vert } \sqrt{2 \pi }^l} \right) {\mathrm {p}} (\mathbf {y}_0 \vert \mathbf {\theta }) \right) \times \frac{\exp \left( -\frac{1}{2} \mathbf {\varepsilon }_{\mathbf {\theta }}^T \Sigma _{\mathbf {\theta }}^{-1} \mathbf {\varepsilon }_{\mathbf {\theta }}^T \right) }{ \sqrt{ \vert \Sigma _{\mathbf {\theta }} \vert \sqrt{2 \pi }^p } }. \end{aligned}$$
(37)

The parameters are estimated by maximizing the posterior density function (37), i.e., maximum a posteriori (MAP) estimation. The MAP parameter estimate is found by minimizing the negative logarithm of (37)

$$\begin{aligned} \hat{\mathbf {\theta }} = \arg \min _{\mathbf {\theta } \in \varTheta } \left( -{\mathrm {ln}} \left( p(\mathbf {\theta } \vert \mathscr {Y}_N, \mathbf {y}_0) \right) \right) . \end{aligned}$$
(38)

When there is no prior the MAP estimate reduces to the ML estimate.

3 Example: Modeling the Effect of Exercise on Insulin Pharmacokinetics in “Continuous Subcutaneous Insulin Infusion” Treated Type 1 Diabetes Patients

The artificial pancreas is believed to ease substantially the burden of constant management of type 1 diabetes for patients. An important aspect of the artificial pancreas development is the mathematical models used for control, prediction, and simulation. A major challenge to the realization of the artificial pancreas is the effect of exercise on the insulin and plasma glucose dynamics. This is the first step towards a population model of exercise effects in type 1 diabetes. The focus is on the effect on the insulin pharmacokinetics in continuous subcutaneous insulin infusion (CSII)-treated patients by modeling the absorption rate as a function of exercise. This example is described in detail in [5].

3.1 Data

The insulin data for this study originates from a clinical study on 12 subjects with type 1 diabetes treated with continuous subcutaneous insulin infusion (CSII). Each subject did two study days separated by at least three weeks. The insulin was observed by drawing blood nonequidistantly over the course of the trial. A detailed description of the data is found in [23].

Natural considerations toward the subjects limits how frequent the insulin can be sampled. This limits the amount of observations per time series and often careful nonequidistant sampling becomes necessary. Both issues makes estimation of parameters more difficult. However, using all the subjects collectively increases the amount of data and improves estimation. The repeated trials per subject are considered independent trials, i.e., no random variation on the parameters. The subjects are assumed to have interindividual variation for several of the parameters.

Fig. 6
figure 6

Illustration of a three-compartment model describing the pharmacokinetics of insulin delivered continuously from an insulin pump. Lightning bolts indicate diffusion terms

3.2 The Gray Box Insulin Model

A linear three-compartment ODE model is used as basis to describe the pharmacokinetics of subcutaneous infused insulin in a single subject as suggested by [26]. The model is illustrated in Fig. 6.

The absorption is characterized by the rate parameter \(k_a\) between all three compartments. The two compartments \(Isc_2\) and Ip are modeled with diffusion. Only the third-state Ip is being observed.

The compartment model is formulated as the following SDE

$$\begin{aligned} d \begin{bmatrix} Isc_1\\ Isc_2\\ Ip \end{bmatrix} = \left( \begin{bmatrix} -k_a&0&0 \\ k_a&-k_a&0 \\ 0&\frac{k_a}{V_I}&-k_e \end{bmatrix} \begin{bmatrix} Isc_1\\ Isc_2\\ Ip \end{bmatrix} + \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix}I_{pump} \right) dt + \begin{bmatrix} 0&0&0 \\ 0&\sigma _{Isc}&0 \\ 0&0&\sigma _{Ip} \end{bmatrix} d\omega _t, \end{aligned}$$
(39)

where \(Isc_1\) [mU] and \(Isc_2\) [mU] represent the subcutaneous layer and deeper tissues, respectively, and Ip [mU/L] represents plasma. \(I_{pump}\) is the input from the pump [mU/min]. \(k_a\) [\(\min ^{-1}\)] is the absorption rate and \(k_e\) [\(\min ^{-1}\)] is the clearance rate of insulin from plasma. \(V_I\) is the volume of distribution [L]. \(\sigma _{Isc}\) and \(\sigma _{Ip}\) are the standard deviation of the diffusion processes.

The observation equation is formulated through a transformation of the third-state \(I_p\). The log transformation used here is a natural choice since \(I_p\) is a concentration which is a nonnegative number. Transformations are discussed in Sect. 4.1. The observation equation is

$$\begin{aligned} \log (y_{k}) = \log (Ip_{k}) + e_{k} , \end{aligned}$$
(40)

where \(y_{k}\) is the observed plasma insulin concentration and \(e_{k} \sim N(0,\xi )\) is the measurement noise. The variance is further modeled such that \(\xi = S_{\mathrm {min}} + S\), where \(S_{\mathrm {min}}\) is a known hardware specific measurement error variance of the equipment [5]. Note that the measurement error multiplicative in the natural domain of \(y_k\). This works as an approximation of a proportional error model.

The full gray box model is the SDE system equation (39) and the observation equation (40).

Population Parameters

The individual parameters are modeled as a combination of fixed population effects and random individual effects

$$\begin{aligned} \theta _i = h(\theta _{pop},Z_i) \cdot e^{\eta _i}, \end{aligned}$$
(41)

where \(\theta _i\) is the parameter value for individual i, \(h(\cdot )\) is a possibly nonlinear function, \(\theta _{pop}\) is the overall population parameter (fixed effect), \(Z_i\) are covariates (age, weight, gender etc.), and \(\eta _i \sim N(0,\varOmega )\) is the individual random effect.

For this model, four parameters were modeled with a random effect. The initial values of the two subcutaneous layer states are assumed to be affected by the same variation from the population mean

$$\begin{aligned} Isc_{1_{0},i} = Isc_{1_0} \cdot e^{\eta _{i,1}} \quad Isc_{2_{0},i} = Isc_{2_0} \cdot e^{\eta _{i,1}}. \end{aligned}$$

The absorption rate \(k_a\) and the clearance rate \(k_e\) have separate random effects

$$\begin{aligned} k_{a,i} = k_a \cdot e^{\eta _{i,2}} \quad k_{e,i} = k_e \cdot e^{\eta _{i,3}}. \end{aligned}$$

The volume of distribution \(V_I\) is scaled by the weight (kg) of the subject. The weight is a covariate

$$\begin{aligned} V_{I_i} = V_I \cdot {\mathrm {weight}}_i. \end{aligned}$$

The random effects are assumed Gaussian with

$$\begin{aligned} \eta _i = \left[ \eta _{i 1}, \eta _{i 2}, \eta _{i 3}\right] \sim \mathscr {N} \left( 0, {\text {diag}}\left( \omega _{Isc} , \omega _{k_a} , \omega _{k_e} \right) \right) . \end{aligned}$$

3.3 Exercise Effects

The model is further extended by making the absorption rate \(k_a\) dependent on exercising. Two extensions are investigated.

Model A

The first extension specifies \(k_a\) as

$$\begin{aligned} k_a =\bar{k}_a + \alpha \cdot {\mathrm {Ex}}, \end{aligned}$$
(42)

where \(\bar{k}_a\) is the basal rate and \(\alpha \) is the effect of exercise. \({\mathrm {Ex}}\) is a binary input which is 1 when the subject is exercising and otherwise 0.

Model B

The subjects were exercising at two intensities and this extends (42) to

$$\begin{aligned} k_a =\bar{k}_a + \alpha _{\mathrm {mild}} \cdot {\mathrm {Ex_{mild}}} + \alpha _{\mathrm {moderate}} \cdot {\mathrm {Ex_{moderate}}}, \end{aligned}$$
(43)

where \(\bar{k}_a\) is the basal rate, \(\alpha _{\mathrm {mild}}\) and \(\alpha _{\mathrm {moderate}}\) are the effects of mild and moderate exercise. \({\mathrm {Ex_{mild}}}\) and \({\mathrm {Ex_{moderate}}}\) are binary inputs which is 1 during either mild or moderate exercising.

3.4 Model Comparison

The best model is selected by comparing the ML estimates with the likelihood ratio test, AIC, and BIC in Table 2. The base model is nested in both model A and B and model A is nested in B. The nested models can be compared with the likelihood ratio test. Both models A and B explain significantly more of the variability in the data than the base model. Model A is the prefered model based on the likelihood ratio test. The additional improvement in the likelihood with model B is not enough to justify the extra parameter. The difference in AIC and BIC between model A and B relatively small but indicate that model B is to be prefered. The relative likelihood between model A and B is \(\exp \left( 0.5 \cdot \left( 1815 - 1817 \right) \right) = 0.37\) and suggests that model A is 37 % as probable as model B [1].

The parameter estimates for all three models are seen in Table 3. For model B, the moderate intensity exercise results in a larger absorption rate than mild exercise.

Table 2 Model comparison using likelihood ratio test, AIC and BIC
Table 3 Parameter estimates from the three models: base, A and C
Fig. 7
figure 7

Top One-step predictions from model A (Blue line). The observations are represented by dots. The gray area indicates 95 % prediction interval. Middle and bottom Insulin and exercise inputs

3.5 Predictions

From the three models tried here, model A with a single absorption rate is the best to explain the data. One-step predictions using model A using a single trial of one subject are shown in Fig. 7. In general, the predictions are acceptable and the model does seem to capture the increase related to exercise. Especially, in Fig. 7, the compliance between the predictions and the observations is good. The width of the prediction interval is, however, large in this case. k-step predictions can also easily be calculated using CTSM-R and the predict function. A more detailed account of the exercise dependence analysis using population modeling is found in [5].

4 Other Topics

4.1 Transformations

In general, transformations should be applied whenever appropriate, and as all inference with CTSM-R assumes Gaussian random output, this should be ensured by transformations. Transformations can be applied in three different levels (1) state transformations, (2) transformation of observations, and (3) transformation of the parameters. We will briefly discuss each of these types of transformations and refer the interested reader to appropriate literature.

If there are natural restrictions of the state space, e.g., the natural state space is the positive real axis, or some interval, then these restrictions should be included in the SDE description. This implies a formulation of the form

$$\begin{aligned} dx_t=f(x_t,u_t)dt+\sigma (x_t)dw_t. \end{aligned}$$
(44)

However, the Kalman filter requires the diffusion term to be independent of the state and therefore we should apply the Lamperti transform;

$$\begin{aligned} z_t=\int \frac{d\xi }{\sigma (\xi )}\bigg |_{\xi =x_t} \end{aligned}$$
(45)

and use Itô’s Lemma to obtain a description where the SDE description is independent of the state (see [18], Paper D for a tutorial on the Lamperti transform, and [19] for a nontrivial application).

The usual comments on transformation of the observations also apply to the SDE models, i.e., the standardized residuals should have constant variance, this should be checked and if the residuals do not have constant variance the observations should be transformed (e.g., using log transformation).

As already discussed in the examples in this chapter, the parameters should be estimated on the real axis (implying e.g., log transformation of positive parameters).

4.2 Identification

We have already seen that the autocorrelation function and the partial autocorrelation functions can be used for identification. If data are not equivalently sampled, one might use linear SDE models on the residuals to identify model order (number of states).

For nonlinear models the usual autocorrelation function is also relevant. Nonlinear dependence in the residuals will almost always include a linear dependence which will appear in the autocorrelation function. It can be shown that some nonlinear functions does not have linear dependence and the autocorrelation functions will fail. Generalizations in the form of lag-dependent and partial lag-dependent functions might then be used instead [20].

Finally identification can be based on random walk identification, where one parameter is formulated as a random walk process and the reconstruction or smoothed parameter is compared with state estimates and/or input to identify possible model extensions (see also [18, 19], paper F, and [11]).

4.3 Simulation/Prediction Models

As we have already seen in the simulation example, misspecification of a model can lead to very poor performance in simulation (long-term prediction) performance of models. A way to ensure reasonable performance in long-term predictions is by forcing the diffusion parameters to be small. This is done by fixing diffusion parameters, see [14] for a discussion about simulation and multistep predictions in SDE-models.

4.4 Testing and Confidence Intervals

Often, in particular for data-rich situations, the standard Wald confidence intervals, as presented directly from CTSM-R, are good approximations of the “true” confidence intervals. These are, however, approximations, and conclusions regarding individual parameters should be based on likelihood ratio tests rather than confidence intervals. In cases where models are not nested, it is recommended to use likelihood-based information criteria (AIC or BIC) for model selection.

Still, confidence intervals provide useful information that should always be reported, also when parameters are significant. But as we saw in the simulation examples, the Wald confidence interval might fail completely (e.g., \(\sigma _3\) in Models 2 and 3). The problem is that the Wald standard error uses the local curvature of the likelihood (the Hessian), to approximate the uncertainty, and e.g., if the curvature is close to zero (see Fig. 8), then the variance of the parameter estimates becomes infinite (as we saw in the examples).

As an alternative, we can calculate profile likelihood confidence intervals (see Fig. 8), we will not go into detail with the calculation of such intervals, but note that the profile likelihood confidence interval is based on the same statistical properties of the likelihood ratio as the likelihood ratio test. In the case of Model 3 of the simulation example, the profile likelihood confidence interval for \(\sigma _3\) is [0, 0.12], which seems much more reasonable than the values obtained by the Wald approximation. For further reading see [17, 21].

Fig. 8
figure 8

Profile likelihood for \(\sigma _3\) in the third-state simulation model of Examples 12

5 Summary

A general framework for modeling physical dynamical systems using stochastic differential equations has been demonstrated. CTSM-R is an efficient and parallelized implementation in the statistical language R. R facilitates easy data handling, visualization, and statistical tests essential for any modeling task. CTSM-R uses maximum likelihood and thus known techniques for model identification and selection can also be used for this framework as demonstrated.

This chapter has demonstrated the principles using linear models with transformations. CTSM-R has been used for a number of nonlinear problems see e.g., [18, 22].

CTSM-R has been extended to include hierarchical modeling. A study of exercise dependence in insulin absorption was modeled with a random effect between the subjects. This is an example of commonly used population modeling in PK/PD.

A detailed user guide and additional examples are available from http://ctsm.info.