1 Introduction

The continuous-discrete state space model is a convenient specification for the dynamic modeling of quantitative variables in continuous time, which are subject to random disturbances both in the dynamics and in the discrete process of observation (Jazwinski, 1970). In order to meet requirements of empirical data analysis, the dynamics of the state vector is given in continuous time (system equation), whereas the measurements are assumed to be given only at discrete, possibly unequally spaced time points (measurement model). Moreover, analogously to structural equation models (SEM) or factor analysis, the state is only incompletely observable and subject to measurement error.

In the linear case with gaussian random errors, the system can be estimated efficiently by maximum likelihood (ML) using the Kalman filter algorithm (cf. Jones, 1984, Harvey and Stock, 1985, Jones and Tryon, 1987, Zadrozny, 1988, Jones and Ackerson, 1990, Singer, 1993, 1995), but in nonlinear systems complicated equations for the transition density arise, which must be solved numerically. One approach to do this is the Monte Carlo simulation method, where many sample trajectories are simulated and unknown probability densities and integrals are estimated from these data. In order to keep close contact to the linear Kalman filter, a sequence of time and measurement updates (continuous-discrete filter) is utilized, and the resulting integral expressions (expectation values) are approximated by statistical averages. To reduce the simulation error of the likelihood function, importance sampling and other variance reduction techniques (such as antithetical sampling) are used.

Simulation based filtering methods in discrete time have been used in the literature such as Markov chain Monte Carlo (MCMC; Carlin et al., 1992, Kim et al., 1998), rejection sampling using density estimators (Tanizaki, 1996, Tanizaki and Mariano, 1995, Hurzeler and Künsch, 1998), importance sampling and antithetic variables (Durbin and Koopman, 1997, 2000) and recursive bootstrap resampling (Gordon et al., 1993, Kitagawa, 1996).

In this paper the time update is generalized to the continuous time case by using the Chapman-Kolmogorov equation and importance sampling is implemented by approximate smoothing in order to reduce the variance of the simulated likelihood function. For this purpose the Gaussian sum filter of Alspach and Sorenson (1972) is used. In linear systems, the smoothing is exact and the simulation error of the likelihood estimate is zero, given the data (cf. section 7.1).

Section 2 defines the continuous-discrete state space model and section 3 presents the recursive computation of the filter densities and the likelihood. Sections 4 and 5 derive the variance reduction and its implementation by smoothing, whereas sections 6 and 7 discuss practical issues and present 3 examples.

2 Nonlinear continuous-discrete state space models

We discuss the nonlinear continuous-discrete state space model (Jazwinski, 1970)

$$dy(t)\;\; = \;\;f(y(t),t,\psi )dt + g(y(t),t,\psi )dW(t)$$
((1))
$${z_i}\;\; = \;\;h(y({t_i}),{t_i},\psi ) + {\epsilon_i},$$
((2))

where discrete time measurements zi are taken at times {t0, t1, …, tT}, t0ttT. In state equation (1), the process error W(t) is a r-dimensional Wiener process and the state is described by the p-dimensional state vector y(t). It fulfils a system of stochastic differential equations (SDE) in the sense of Itô (cf. Arnold, 1974) with random initial condition y (t0) ∼ P0(y, t0) (prior distribution). The functions f : ℝp × ℝ × ℝu → ℝp and g : ℝp × ℝ × ℝu → ℝp × ℝr are called drift and diffusion coefficients, respectively. In measurement equation (2), ϵiN(0, R(ti, ψ)) is a k-dimensional discrete time white noise process (measurement error) and h: ℝp × ℝ × ℝu → ℝk is the output function. It is assumed that the error processes dW(t), ϵi and the initial state y(t0) are mutually independent. Parametric estimation is based on the u-dimensional parameter vector ψ. The key quantity for the computation of the likelihood function is the transition probability p(y, t|x, s) between states y and x at times t and s, respectively, which is a solution of the Fokker-Planck equation

$$\begin{array}{*{20}{c}} {\frac{{\partial p(y,t|x,s)}}{{\partial t}}}& = &{ - \sum\limits_i {\frac{\partial }{{\partial {y_i}}}[{f_i}(y,t,\psi )p(y,t|x,s)]\;\;\;\;\;\;\;\;\;} } \\ \;&\;&{ + \tfrac{1}{2}\sum\limits_{ij} {\frac{{{\partial ^2}}}{{\partial {y_i}\partial {y_j}}}[{\Omega _{ij}}(y,t,\psi )p(y,t|x,s)]} } \\ \;&{: = }&{F(y,t,\psi )p(y,t|x,s)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$
((3))

subject to the initial condition p(y, s|x, s) = δ(yx) (Dirac delta function). The symbol F(y, t, ψ) denotes the Fokker-Planck operator. The diffusion matrix is given by Ω = gg′: ℝp × ℝ × ℝup × ℝp. Under certain technical conditions the solution of (3) is the conditional density of y(t) given y(s) = x (see, e.g. Wong and Hajek, 1985, ch. 4).

In order to model exogenous influences, f, g, h and R are assumed to depend on deterministic regressor variables x(t): ℝ → ℝq, i.e. f(y, t, ψ) = f(y, t, x(t), ψ) etc. For notational simplicity, the dependence on x(t) and on ψ will be suppressed.

It may be noted that state space model (1, 2) allows the modeling of ARIMA systems, since unobserved higher order derivatives can be accomodated in an extended state vector \(\eta = {\rm{\{ }}y,\dot y,\ddot y, \ldots {\rm{\} }}\).

Furthermore, the functions f, g and h, R may depend on earlier measurements Zt = {z(tj);tj ≤ t} and \({Z^{{t_{i - 1}}}}\; = \;{\rm{\{ }}z({t_j});{t_j}\; \le \;{t_{i - 1}}{\rm{\} }}\), respectively, which allows the modeling of (G)ARCH effects ((generalized) autoregressive conditional heteroskedasticity). For example, the diffusion matrix g(y, t) may depend on earlier innovations νj = zjE[zj|zj−1, …, z0]; tjt and if the functions are linear in the state y, the state space model is conditionally gaussian (cf. Liptser and Shiryayev, 1978, vol. II, ch. 13). Again, for notational simplicity, the dependence on the Zt will be suppressed.

3 Computation of the likelihood function

The exact time and measurement updates of the continuous-discrete filter are given by the recursive scheme (Jazwinski, 1970) for the a priori and a posteriori densities Pi+1|i,pi|i:

time update:

$$\begin{array}{*{20}{c}} {\frac{{\partial p(y,t|{Z^i})}}{{\partial t}}}& = &{F(y,t)p(y,t|{Z^i});t \in [{t_i},{t_{i + 1}}]} \\ {p(y,{t_i}|{Z^i})}&{: = }&{p({y_i}|{Z^i}): = {p_{i|i}}\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ {p(y,{t_{i + 1}}|{Z^i})}&{: = }&{p({y_{i + 1}}|{Z^i}): = {p_{i + 1|i}}\;\;\;\;\;\;\;\;\;} \end{array}$$
((4))

measurement update:

$$\begin{array}{lll} {p({y_{i + 1}}{\rm{\vert}}{Z^{i + 1}})} & = & {{{p({z_{i + 1}}{\rm{\vert}}{y_{i + 1}},{Z^i})p({y_{i + 1}}{\rm{\vert}}{Z^i})} \over {p({z_{i + 1}}{\rm{\vert}}{Z^i})}}}\\{} & {: = } & {{p_{i + 1{\rm{\vert}}i + 1}}}\\{\;\;p({z_{i + 1}}{\rm{\vert}}{Z^i})} & = & {\int {p({z_{i + 1}}{\rm{\vert}}{y_{i + 1}},{Z^i})p({y_{i + 1}}{\rm{\vert}}{Z^i})d{y_{i + 1}}} ,}\end{array}$$
((5))
$$p\left( {{z_{i + 1}}\left| {{Z^i}} \right.} \right) = \int {p\left( {{z_{i + 1}}\left| {{y_{i + 1}},{Z^i}} \right.} \right)p\left( {{y_{i + 1}}\left| {{Z^i}} \right.} \right)d{y_{i + 1}},} $$
((6))

i = 0, …, T − 1, where F is the Fokker-Planck operator, Zi = {z{t)|tti} are the observations up to time ti and Li+1:= p(zi+1|Zi) is the likelihood function of observation zi+1. The time update describes the time evolution of the conditional density p(y, t|Zi) given information up to the last measurement and the measurement update is a discontinuous change due to new information zi+1 using the Bayes formula. Thus the likelihood of the complete observation ZT = {zT, …,z0} can be computed sequentially and new observations zT+1 can be processed with only one more update step.

Some remarks may be in order:

  1. 1.

    In the linear case, the conditional densities are gaussian and the recursive steps can be implemented for the conditional moments y(t|ti) = and P(t|ti) = Var[y(t)|Zi]. Instead of the Fokker-Planck equation, only linear ordinary differential equations must be solved. Furthermore, the measurement update can be computed analytically since all involved quantities are jointly (conditionally) gaussian. This is the celebrated Kalman filter algorithm extensively used in engineering, control theory, statistics, economics and the social sciences (cf. Jazwinski, 1970, Gelb, 1974, Liptser and Shiryayev, 1977, 1978, Harvey, 1989, Fahrmeir and Kaufmann, 1991, Singer, 1993).

  2. 2.

    In the general nonlinear case, the time and measurement updates require the solution of partial differential equations and integrals (likelihood function) which can be obtained only numerically by several approximation methods. Linearizing the system one obtains the extended Kalman filter (EKF), but more elaborate methods such as the second order nonlinear filter (SNF; Jazwinski, 1970), the gaussian sum filter (Alspach and Sorenson, 1972), numerical integration (Kitagawa, 1987), and simulation methods have been used (Kitagawa, 1996, Tanizaki, 1996, Singer, 1997, Kim et al. 1998, Hürzeler and Künsch, 1998). Usually, the filters are formulated in discrete time, however.

In order to compute the solution of Fokker-Planck equation (4), a Monte Carlo approach was utilized (Wagner, 1988, Kloeden and Platen, 1992). We use an integral representation based on the Chapman-Kolmogorov-equation for Markov processes

$$p({y_{i + 1}}{\rm{\vert}}{y_i})\;\; = \;\;\int {p({y_{i + 1}}{\rm{\vert}}\eta )p(\eta {\rm{\vert}}{y_i})d\eta } ,$$
((7))

which can be iterated to express

$$p({y_{i + 1}}{\rm{\vert}}{y_i})\;\; = \;\;\int {p({y_{i + 1}}{\rm{\vert}}{\eta _{{J_{i - 1}}}})p({\eta _{{J_{i - 1}}}}{\rm{\vert}}{\eta _{{J_{i - 2}}}}) \ldots p({\eta _1}{\rm{\vert}}{y_i})d{\eta _{{J_{i - 1}}}} \ldots d{\eta _1}} $$
((8))

as a product of transition densities inserted into the interval

The auxiliary variables on the grid are defined as \({\eta _j} = y({\tau _j});{\tau _j} = {t_i} + j\delta t;j = 0, \ldots ,{j_i} = \Delta {t_i}{\rm{/}}\delta t,{y_i} = {\eta _0}, \ldots ,{\eta _{{J_i}}} = {y_{i + 1}}\), so that δt = Δti/Ji can be chosen so small, that

$$p({\eta _{j + 1}}{\rm{\vert}}{\eta _j})\;\; \approx \;\;\phi ({\eta _{j + 1}};{\eta _j} + f({\eta _j},{\tau _j})\delta t,\Omega ({\eta _j},{\tau _j})\delta t),$$
((9))

Ω:= gg′, can be approximated by a Gaussian density φ (cf. equation (1)). The Ji-fold product of Gaussian densities is called an Euler density (cf. Kloeden and Platen, 1992, ch. 16.3). In the limit Ji → ∞ the so called path integral (functional integral) representation

$$\begin{array}{*{20}{c}} {p({y_{i + 1}}|{y_i})}& = &{\mathop {\lim }\limits_{{J_i} \to \infty } \int {\exp \left[ { - \tfrac{1}{2}\sum\limits_{j = 0}^{{J_i} - 1} {({\eta _{j + 1}} - {\eta _j} - {f_j}\delta t)'{{({\Omega _j}\delta t)}^{ - 1}} \times } } \right.} } \\ \;&\;&{\left. { \times ({\eta _{j + 1}} - {\eta _j} - {f_j}\delta t)} \right]\prod\limits_{j = 0}^{{J_i} - 1} {{{\left| {2\pi {\Omega _j}\delta t} \right|}^{ - 1/2}}d{\eta _{{J_i} - 1}}...d{\eta _1}} } \\ \;&{: = }&{\int {\exp ( - \tfrac{1}{2}O[y])Dy(t)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} } \end{array}$$
((10))

of the transition density is obtained (Haken, 1977, ch. 6.6, Risken, 1989, ch. 4.4.2). The exponent

$$O[y]\;\; = \;\;\int_{{t_i}}^{{t_{i + 1}}} {[\dot y(t) - f(y,t)]\prime \Omega {{(y,t)}^{ - 1}}[\dot y(t) - f(y,t)]dt} $$
((11))

is the Onsager-Machlup functional. The expression is only formal since y(t) is not differentiable. Analogous expressions are obtained when computing likelihood functionals, which can be transformed to formally existing limits (likelihood ratios) on dividing by a reference density and using Itô or Stratonovich integrals (cf. Wong and Hajek, 1985, ch. 6, p. 216 and the remark in ch. 7, p. 257; Stratonovich, 1989).

In numerical computations one does not go to the limit, but uses a δt small enough (a so called ϵ-Version in the sense of Stratonovich). The resulting (Ji − 1)-dimensional integral (8) can be estimated by the mean value

$$\hat p({y_{i + 1}}{\rm{\vert}}{y_i})\;\; = \;\;{N^{ - 1}}\sum\limits_{n = 1}^N {p({y_{i + 1}}{\rm{\vert}}{\eta _{n,{J_i} - 1}})} ,$$
((12))

where N is the Monte Carlo sample size. Here \({\eta _n} = {\rm{\{ }}{\eta _{n,{J_i} - 1}}, \ldots ,{\eta _{n,1}},{\eta _0}{\rm{\} }}\) are replications of the vector {y(ti + (Ji − 1)δt), …, y(ti + δt),y(ti)}, which represents the path of the Itô process y(t) on the time grid (conditioned on the initial value yi = η0). The approximation errors are controlled by the parameters δt and N, where the first corresponds to the approximation of the SDE and the second reflects the accuracy of the Monte Carlo integration.

The SDE is simulated by using the Euler-Maruyama scheme

$${\eta _{j + 1}}\;\; = \;\;{\eta _j} + f({\eta _j},{\tau _j})\delta t + g({\eta _j},{\tau _j})\sqrt {\delta t} {z_j};j = 0, \ldots ,{J_i} - 2$$
((13))
$${z_j}\;\; \sim \;\;N(0,I),i.i.d.$$
((14))

(cf. Kloeden and Platen, 1992, ch. 9 and 14).

In the desired extrapolation integral (time update)

$$\begin{array}{*{20}{c}} {{p_{i + 1|i}}: = p({y_{i + 1}}|{Z^i})}& = &{\int {p({y_{i + 1}}|{y_i})p({y_i}|{Z^i})d{y_i}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} } \\ \;& = &{\int {p({\eta _{Ji}}|{\eta _{Ji - 1}})p({\eta _{Ji - 1}}|{\eta _{Ji - 2}})...p({\eta _1}|{\eta _0}) \times } } \\ \;&\;&{ \times {p_{i|i}}({\eta _0})d{\eta _{Ji - 1}}...d{\eta _1}d{\eta _0}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$
((15))

an additional integration over the initial condition η0 = yi is required, which can be simulated by drawing yn,i = ηn,0 ∼ pi|i. The result is an estimator of delta type, similar to a kernel density estimator with variable band widths (variances) Ω(ηn, Ji−1, τJi−1)δt (cf. Silverman, 1986):

$$\begin{array}{lll} {{{\hat p}_{i + 1{\rm{\vert}}i}}} & = & {{N^{ - 1}}\sum {p({y_{i + 1}}{\rm{\vert}}{\eta _{n,{J_i} - 1}})} }\\{} & \approx & {{N^{ - 1}}\sum {\phi ({y_{i + 1}};{y_{n,i + 1{\rm{\vert}}i}},{P_{n,i + 1{\rm{\vert}}i}})} ,}\end{array}$$
((16))

where \({y_{n,i + 1{\rm{\vert}}i}}: = {\eta _{n,{J_i} - 1}} + f({\eta _{n,{J_i} - 1}},{\tau _{{J_i} - 1}})\delta t\) and \({P_{n,i + 1{\rm{\vert}}i}}: = \Omega ({\eta _{n,{J_i} - 1}},{\tau _{{J_i} - 1}})\) δt. The Gauss form occurs only if the process errors dW are gaussian, however. In contrast, a kernel density estimator can be chosen of gaussian form even if nongaussian error terms (e.g. Poisson processes) are used, as in some finance applications (cf. Lo, 1988). See Singer (1997) for a comparision of several filter algorithms.

4 Importance sampling

The integral representation (15) can be rewritten to reduce the the variance of the estimate (16). In general, the integral

$$\begin{array}{lll} {{E_1}[g]} & = & {\int {g(y){p_1}(y)dy} = \int {g(y){{{p_1}(y)} \over {{p_2}(y)}}{p_2}(y)dy} }\\{} & = & {{E_2}\left[ {g{{{p_1}} \over {{p_2}}}} \right]}\end{array}$$
((17))

can be approximated by a variance reduced unbiased estimate

$$\widehat{{E_1}[g]}\;\; = \;\;{N^{ - 1}}\sum\limits_{n = 1}^N {g({y_n}){{{p_1}({y_n})} \over {{p_2}({y_n})}}} ,$$
((18))

yn ∼ p2, if the density p2 (importance density) is chosen appropriately. One can show that the optimal density is given by

$${p_{2,opt}}\;\; = \;\;{{{\rm{\vert}}g(y){\rm{\vert}}{p_1}(y)} \over {{E_1}{\rm{\vert}}g(y){\rm{\vert}}}}$$
((19))

and the variance of (18) is zero if g is positive (cf. Kloeden and Platen, 1992, ch. 16.3). Unfortunately, the definition involves the desired quantity E1|g(y)| and p2,opt must be approximated (see below). Setting \(g({\eta _{{J_{i - 1}}}}) = p({y_{i + 1}}{\rm{\vert}}{\eta _{{J_i} - 1}})\) and \({p_1} = p({\eta _{{J_i} - 1}}{\rm{\vert}}{\eta _{{J_i} - 2}}) \ldots p({\eta _1}{\rm{\vert}}{\eta _0})p({\eta _0}{\rm{\vert}}{Z^i})\) leads to the optimal importance density

$${p_{2,opt}}\;\; = \;\;p({\eta _{{J_i} - 1}}{\rm{\vert}}{\eta _{{J_i} - 2}},{y_{i + 1}}) \ldots p({\eta _1}{\rm{\vert}}{\eta _0},{y_{i + 1}})p({\eta _0}{\rm{\vert}}{Z^i},{y_{i + 1}}),$$
((20))

where the transition densities are conditioned on future states yi+1, which are not observed. Replacing yi+1 by yields a modified density \({\tilde p_{2,opt}}\) and an estimate

$$\begin{array}{*{20}{c}} {\tilde p({y_{i + 1}}|{Z^i})}& = &{{N^{ - 1}}\sum\limits_n {p({y_{i + 1}}|{\eta _{{n_i}{J_i} - 1}}) \times \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} } \\ \;&\;&{ \times \frac{{p({\eta _{n,{J_i} - 1}}|{\eta _{n,{J_i} - 2}})...p({\eta _{n,1}}|{\eta _{n,0}})}}{{p({\eta _{n,{J_i} - 1}}|{\eta _{n,{J_i} - 2}},{z_{i + 1}})...p({\eta _{n,1}}|{\eta _{n,0,}}{z_{i + 1}})}}} \\ \;&\;&{ \times \frac{{p({\eta _{n,0}}|{Z^i})}}{{p({\eta _{n,0}}|{Z^i},{z_{i + 1}})}}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ \;&{: = }&{\sum\limits_n {p({y_{i + 1}}|{\eta _{n,{J_i} - 1}}){\alpha _{n,i + 1|i\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;}}} } \\ \;& \approx &{\sum\limits_n {\phi ({y_{i + 1}};{y_{n,i + 1|i}},{P_{n,i + 1|i}}){\alpha _{n,i + 1|i}},\;\;\;\;\;\;\;\;\;} } \end{array}$$
((21))

which can be shown to imply zero variance (given the data Zi) for the unbiased estimated likelihood

$$\begin{array}{*{20}{c}} {\tilde p({z_{i + 1}}|{Z^i})}& = &{\int {p({z_{i + 1}}|{y_{i + 1}})\tilde p({y_{i + 1}}|{Z^i})d{y_{i + 1}}} } \\ \;& = &{\sum\limits_n {p({z_{i + 1}}|{\eta _{n,{J_i} - 1}}){\alpha _{n,i + 1|i}}.\;\;\;\;\;\;\;} } \end{array}$$
((22))

Furthermore the expressions (22) and (25) coincide for δt → 0 (see appendix).

Since the time update p(yi+1|Zi) (15) can be approximated by a sum of gaussian densities (21) of small band width ∝ δt, the usual measurement updates of the extended Kalman filter (EKF) can be applied to each element in the superposition (21) (cf. Anderson and Moore, 1979, theorem 2.1). One obtains the estimated a posteriori density

$$\tilde p({y_{i + 1}}{\rm{\vert}}{Z^{i + 1}})\;\;: = \;\;\sum\limits_n {\phi ({y_{i + 1}};{y_{n,i + 1{\rm{\vert}}i + 1}},{P_{n,i + 1{\rm{\vert}}i + 1}}){\alpha _{n,i + 1{\rm{\vert}}i + 1}}} $$
((23))
$${\alpha _{n,i + 1{\rm{\vert}}i + 1}}\;\;: = \;\;{\alpha _{n,i + 1{\rm{\vert}}i}}{{{{\tilde L}_{n,i + 1}}} \over {{{\tilde L}_{i + 1}}}}$$
((24))
$${{\tilde L}_{i + 1}}\;\; = \;\;\sum\limits_n {{{\tilde L}_{n,i + 1}}{\alpha _{n,i + 1{\rm{\vert}}i}}} ,$$
((25))

where the EKF updates are given by

$${y_{n,i + 1{\rm{\vert}}i}}\;\; = \;\;{\eta _{n,{J_i} - 1}} + f({\eta _{n,{J_i} - 1}},{\tau _{{J_i} - 1}})\delta t$$
((26))
$${P_{n,i + 1{\rm{\vert}}i}}\;\; = \;\;\Omega ({\eta _{n,{J_i} - 1}},{\tau _{{J_i} - 1}})\delta t$$
((27))
$${y_{n,i + 1{\rm{\vert}}i + 1}}\;\; = \;\;{y_{n,i + 1{\rm{\vert}}i}} + {K_{n,i + 1{\rm{\vert}}i}}{\nu _{n,i + 1}}$$
((28))
$${P_{n,i + 1{\rm{\vert}}i + 1}}\;\; = \;\;(I - {K_{n,i + 1{\rm{\vert}}i}}{H_{n,i + 1}}){P_{n,i + 1{\rm{\vert}}i}}$$
((29))
$${K_{n,i + 1{\rm{\vert}}i}}\;\; = \;\;{P_{n,i + 1{\rm{\vert}}i}}H_{n,i + 1}^\prime \Gamma _{n,i + 1{\rm{\vert}}i}^{ - 1}\;({\rm{Kalman}}\;{\rm{gain}})$$
((30))
$${\nu _{n,i + 1}}\;\; = \;\;{z_{i + 1}} - h({y_{n,i + 1{\rm{\vert}}i}},{t_{i + 1}})\;({\rm{innovation}})$$
((31))
$${\Gamma _{n,i + 1{\rm{\vert}}i}}\;\; = \;\;{H_{n,i + 1}}{P_{n,i + 1{\rm{\vert}}i}}H_{n,i + 1}^\prime + {R_{i + 1}}\;({\rm{innovation}}\;{\rm{covariance}})$$
((32))
$${{\tilde L}_{n,i + 1}}\;\; = \;\;{(\det 2\pi {\Gamma _{n,i + 1{\rm{\vert}}i}})^{ - {1 \over 2}}}\exp [ - {1 \over 2}\nu _{n,i + 1}^\prime \Gamma _{n,i + 1{\rm{\vert}}i}^{ - 1}{\nu _{n,i + 1}}]$$
((33))
$$ = \;\;\phi ({z_{i + 1}};h({y_{n,i + 1{\rm{\vert}}i}},{t_{i + 1}}),{H_{n,i + 1}}{P_{n,i + 1{\rm{\vert}}i}}H_{n,i + 1}^\prime + {R_{i + 1}})$$
((34))
$${H_{n,i + 1}}\;\; = \;\;{h_y}({y_{n,i + 1{\rm{\vert}}i}},{t_{i + 1}})\;({\rm{Jacobian}})$$
((35))

The sequence of estimated a priori and a posteriori densities \(\tilde p({y_{i + 1}}{\rm{\vert}}{Z^i})\), \(\tilde p({y_{i + 1}}{\rm{\vert}}{Z^{i + 1}})\) (21, 23) yields a numerical implementation of the continuous-discrete filter (4, 5) and permits the variance reduced computation of the likelihood function \(\tilde L = \prod\nolimits_{i = 0}^{T - 1} {{{\tilde L}_{i + 1}}{L_0}} \), where L0 is the likelihood of the first observation z0. Since a functional integral representation of the density p(yi+1|Zi) is utilized the algorithm will be called functional integral filter (FIF).

5 Implementation of the importance density by smoothing

In general the optimal weights (likelihood ratio)

$$\begin{array}{*{20}{c}} {{\alpha _{n.i + 1|i}}}&{: = }&{\frac{1}{N}\prod\limits_{j = 0}^{{J_i} - 2} {\frac{{p({\eta _{n,j + 1}}|{\eta _{n,j}})}}{{p({\eta _{n,j + 1}}|{\eta _{n,j}},{z_{i + 1}})}}\frac{{p({\eta _{n,0}}|{Z^i})}}{{p({\eta _{n,0}}|{Z^i},{z_{i + 1}})}}} } \\ \;& = &{\frac{1}{N}\frac{{{p_1}}}{{{p_2},opt}}\left( {{\eta _n}|{Z^i},{z_{i + 1}}} \right),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$
((36))

i = 0, …, T − 1, determined by the importance density p2,opt are difficult to compute, since they involve the unknown conditional densities p(ηj+1|ηj, zi+1) and p(yi|Zi, zi+1), but for the linear system an exact result is available which can be generalized to the nonlinear case.

Thus in linear systems the likelihood estimate (22,25) is dispersion free and the mean is exact with one trajectory. In nonlinear systems, the approximate importance density p2 leads to a suboptimal estimate \({{\tilde L}_{i + 1}}\) with variance > 0 but still to a variance reduction.

Since in the linear case all densities are gaussian, one obtains the conditional mean and variance (ηj= y(τj); τj = ti + jδt; j = 0, …, Ji − 1)

$$\begin{array}{*{20}{c}} {E[{y_{i + 1}}|{\eta _j},{z_{i + 1}}]}& = &{E[{y_{i + 1}}|{\eta _j}] + {K_{i + 1}}({z_{i + 1}} - h(E[{y_{i + 1}}|{\eta _j}],{t_{i + 1}}))} \\ {Var[{y_{i + 1}}|{\eta _j},{z_{i + 1}}]}& = &{(I - {K_{i + 1}}{H_{i + 1}})Var[{y_{i + 1}}|{\eta _j}]\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ {{K_{i + 1}}}&{: = }&{Var({y_{i + 1}}|{\eta _j}){{H'}_{i + 1}} \times \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ \;&\;&{ \times ({H_{i + 1}}Var({y_{i + 1}}|{\eta _j}){{H'}_{i + 1}}) + {R_{i + 1}}{)^{ - 1}}\;\;\;\;\;\;\;\;\;\;} \\ {E[{\eta _{j + 1}}|{\eta _j},{z_{i + 1}}]}& = &{E[{\eta _{j + 1}}|{\eta _j}] + {F_j}(E[{y_{i + 1}}|{\eta _j},{z_{i + 1}}] - E[{y_{i + 1}}|{\eta _j}])} \\ {Var({\eta _{j + 1}}|{\eta _j},{z_{i + 1}})}& = &{Var({\eta _{j + 1}}|{\eta _j}) + \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ \;&\;&{ + {F_j}(Var({y_{i + 1}}|\eta ,{z_{i + 1}}) - Var({y_{i + 1}}|{\eta _j})){{F'}_j}\;\;\;\;} \\ {{F_j}}&{: = }&{Var({\eta _{j + 1}}|{\eta _j})\Phi '({t_{i + 1}},{\tau _{j + 1}})Var{{({y_{i + 1}}|{\eta _j})}^{ - 1}},\;\;\;\;} \end{array}$$

which characterize the density p(ηj+1|ηj, zi+1|j, zi+1) in p2,opt.

The smoother gain Fj and the update formulas are similar to the fixed interval smoother (cf. Anderson and Moore, 1979). The quantities in the update formulas can be obtained by solving the differential equations (τjtti+1)

$$\dot y(t{\rm{\vert}}{\tau _j})\;\; = \;\;f(y(t{\rm{\vert}}{\tau _j}),t);\;y({\tau _j}{\rm{\vert}}{\tau _j}) = {\eta _j}$$
((37))
$$\dot P(t{\rm{\vert}}{\tau _j})\;\; = \;\;A(t)P(t{\rm{\vert}}{\tau _j}) + P(t{\rm{\vert}}{\tau _j})A\prime (t) + \Omega (t);\;P({\tau _j}{\rm{\vert}}{\tau _j}) = 0$$
((38))
$$\dot \Phi (t{\rm{\vert}}{\tau _{j + 1}})\;\; = \;\;A(t)\Phi (t{\rm{\vert}}{\tau _{j + 1}});\;\Phi ({\tau _{j + 1}}{\rm{\vert}}{\tau _{j + 1}}) = I;{\tau _{j + 1}} \le t \le {t_{i + 1}}$$
((39))
$$f(y,t)\;\;: = \;\;A(t)y + b(t)\;({\rm{linear}}\;{\rm{system}})$$
((40))
$$h(y,t)\;\;: = \;\;H(t)y + d(t)$$
((41))

and setting

$$E[{\eta _{j + 1}}{\rm{\vert}}{\eta _j}]\;\; = \;\;y({\tau _{j + 1}}{\rm{\vert}}{\tau _j})$$
((42))
$$E[{y_{i + 1}}{\rm{\vert}}{\eta _j}]\;\; = \;\;y({t_{i + 1}}{\rm{\vert}}{\tau _j})$$
((43))
$${\rm{Var}}[{\eta _{j + 1}}{\rm{\vert}}{\eta _j}]\;\; = \;\;P({\tau _{j + 1}}{\rm{\vert}}{\tau _j})$$
((44))
$${\rm{Var}}[{y_{i + 1}}{\rm{\vert}}{\eta _j}]\;\; = \;\;P({t_{i + 1}}{\rm{\vert}}{\tau _j}).$$
((45))

For the density p(yi| Zi, Zi+1) in p2,opt one obtains the moments:

$$\begin{array}{*{20}{c}} {E[{y_{i + 1}}|{Z^{i + 1}}]}& = &{E[{y_{i + 1}}|{Z^i}] + {K_{i + 1}}({z_{i + 1}} - h(E|{y_{i + 1}}|{Z^i}],{t_{i + 1}}))\;\;\;\;\;} \\ {Var({y_{i + 1}}|{Z^{i + 1}})}& = &{(I - {K_{i + 1}}{{H'}_{i + 1}})Var({y_{i + 1}}|{Z^i})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ {{K_{i + 1}}}& = &{Var({y_{i + 1}}|{Z^i}){{H'}_{i + 1}}{{({H_{i + 1}}Var({y_{ + 1}}|{Z^i}){{H'}_{i + 1}} + {R_{i + 1}})}^{ - 1}}} \\ {E[{y_i}|{Z^i},{z_{i + 1}}]}& = &{E({y_i}|{Z^i}) + {F_i}(E[{y_{i + 1}}|{Z^{i + 1}}] - E[{y_{i + 1}}|{Z^i})])\;\;\;\;\;\;\;} \\ {Var({y_i}|{Z^i},{z_{i + 1}})}& = &{Var({y_i}|{Z^i})|{F_i}(Var({y_{i + 1}}|{Z^{i + 1}}) - Var({y_{i + 1}}|{Z^i})){{F'}_i}} \\ {{F_i}}&{: = }&{Var({y_i}|{Z^i})\Phi '({t_{i + 1}},{t_i})Var{{({y_{i + 1}}|{Z^i})}^{ - 1}}.\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$

Again the quantities in the update formulas can be obtained by solving the differential equations (titti+1), but with different initial conditions:

$$\dot y(t{\rm{\vert}}{t_i})\;\; = \;\;f(y(t{\rm{\vert}}{t_i}),t);\;y({t_i}{\rm{\vert}}{t_i}) = E[{y_i}{\rm{\vert}}{Z^i}]$$
((46))
$$\dot P(t{\rm{\vert}}{t_i})\;\; = \;\;A(t)P(t{\rm{\vert}}{t_i}) + P(t{\rm{\vert}}{t_i})A\prime (t) + \Omega (t);\;P({t_i}{\rm{\vert}}{t_i}) = {\rm{Var}}[{y_i}{\rm{\vert}}{Z^i}]$$
((47))
$$\dot \Phi (t{\rm{\vert}}{t_i})\;\; = \;\;A(t)\Phi (t{\rm{\vert}}{t_i});\Phi ({t_i}{\rm{\vert}}{t_i}) = I$$
((48))

and setting

$$E[{y_{i + 1}}{\rm{\vert}}{Z^i}]\;\;\; = \;\;y({t_{i + 1}}{\rm{\vert}}{t_i})$$
((49))
$${\rm{Var}}({y_{i + 1}}{\rm{\vert}}{Z^i})\;\; = \;\;P({t_{i + 1}}{\rm{\vert}}{t_i}).$$
((50))

In the limit of small δt one can write

$$\begin{array}{*{20}{c}} {E[{\eta _{j + 1}}|{\eta _j},{z_{i + 1}}]}& = &{{\eta _j} + f({\mu _j},{\tau _j})\delta t + {F_j}(E[{y_{i + 1}}|{\eta _j},{z_{i + 1}}] - E[{y_{i + 1}}|{\eta _j}])} \\ \;&{: = }&{{\eta _j} + {f_2}({\eta _j},{\tau _j})\delta t\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \\ {Var({\eta _{j + 1}}|{\eta _j},{z_{i + 1}})}& = &{\Omega ({\tau _j})\delta t + {F_j}(Var({y_{i + 1}}|{\mu _j},{z_{i + 1}}) - Var({y_{i + 1}}|{\eta _j})){{F'}_j}} \\ \;&{: = }&{{\Omega _2}({\tau _j})\delta t,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$

which may be interpreted as a correction to the drift f and the diffusion matrix Ω. Therefore the optimal density p2,opt and trajectories ηn = {ηn, Ji−1, ⋯ ηn,0} drawn from it can be obtained by using a modified drift f2 and diffusion coefficient Ω2. More precisely, a stochastic Euler-Maruyama scheme (13) with drift f2 and diffusion coefficient Ω2 is used to simulate ηn ∼ p2,opt.

Since the moment equations (37-41, 46-48) can be generalized to nonlinear systems by replacing A(t) → fy(y, t), Ω(t) → Ω(y, t), and H(t) → hy(y, t), one obtains a sampling scheme where the (sub)optimal density p2 is implemented by means of the EKF updates and trajectories \({{\rm{\{ }}{\eta _{{J_i} - 1}}, \ldots ,{\eta _0}{\rm{\} }}_n}\sim{p_2}\) can be simulated using f2 and Ω2.

6 Practical implementation

6.1 Smoothing with Gaussian Sums

If the density p0|−1 = p0(y0, t0) (initial condition) is represented by a gaussian mixture distribution of N populations

$${p_{0{\rm{\vert}} - 1}}\;\;: = \;\;\sum\limits_{n = 1}^N {\phi ({y_0};{\mu _n},{\Sigma _n}){\alpha _{n,0{\rm{\vert}} - 1}}} $$
((51))

using appropriate weights αn,0|−1, all updates preserve the structure of a gaussian sum and the computation of the importance density proceeds by an N-fold solution of the smoother equations and related trajectories drawn from the density p2. More precisely, the measurement update

$${p_{0{\rm{\vert}}0}} = p({y_0}{\rm{\vert}}{Z^0})\;\;: = \;\;\sum\limits_n {\phi ({y_0};{y_{n,0{\rm{\vert}}0}},{P_{n,0{\rm{\vert}}0}}){\alpha _{n,0{\rm{\vert}}0}}} $$
((52))

is again a gaussian sum and the smoothed density p(y0|Z0, z1) = p0|1 required in p2 may be represented by the moments E[yn,0|z0, z1], Var(yn,0|Z0, z1) of population n and an updated weight \({\alpha _{n,0{\rm{\vert}}1}} = {\alpha _{n,0{\rm{\vert}}0}}{{{L_{n,1}}} \over {{L_1}}},{L_1} = \sum {{\alpha _{n,0{\rm{\vert}}0}}{L_{n,1}}} \), i.e.

$${p_{0{\rm{\vert}}1}} = p({y_0}{\rm{\vert}}{Z^0},{z_1})\;\;: = \;\;\sum\limits_n {\phi ({y_0};{y_{n,0{\rm{\vert}}1}},{P_{n,0{\rm{\vert}}1}}){\alpha _{n,0{\rm{\vert}}1}}} .$$
((53))

Therefore, the EKF’s and smoothers computing the optimal weights and the importance density run on a gaussian sum which is called a gaussian sum, filter (Anderson and Moore, 1979, chapter 8). This filter does not involve any simulation and computes an approximate time update via deterministic moment equations using N EKF’s. Whereas these are only valid for small sampling intervals Δti, the stochastic simulation of trajectories via (13) leads to an estimate of the a priori density valid for arbitrary sampling intervals.

From the density p0|1 N random initial conditions ηn,0|1 can be drawn and used to simulate the trajectories \({\rm{\{ }}{\eta _{n,{J_0} - 1}}, \ldots ,{\eta _{n,0}}{\rm{\} }} \sim {p_2}\) using f2 and Ω2 in (13). From this the a priori density \({{\tilde p}_{1{\rm{\vert}}0}} = \tilde p({y_1}{\rm{\vert}}{Z^0})\) (21) can be estimated and the a posteriori density \({{\tilde p}_{1{\rm{\vert}}1}} = \tilde p({y_1}{\rm{\vert}}{Z^1})\) and the likelihood estimate \({{\tilde L}_1}\) in (23) is obtained. \({{\tilde p}_{1{\rm{\vert}}1}}\) is again a gaussian sum and the update \({{\tilde p}_{1{\rm{\vert}}2}} = \tilde p({y_1}{\rm{\vert}}{Z^2})\) may be computed as before etc. The algorithm runs recursively from i = 0, …, T and yields a sequence of likelihood contributions \({{\tilde L}_i}\).

6.2 Resampling Strategies and Antithetical Sampling

Drawing yj, j = 1, …, N from the mixture distribution (a posteriori density)

$$\tilde p(y)\;\;: = \;\;\sum\limits_{n = 1}^N {\phi (y;{y_n},{P_n}){\alpha _n}} $$
((54))

can be accomplished by drawing a population n with probability αn and then setting \({y_j} = {y_n} + P_n^{1/2}{z_j}\) where zj ∼ N(0, I) and \(P_n^{1/2}\) is a Cholesky root (or another matrix square root). The drawing of n may be implemented by drawing uj ∼ U[0, 1] (uniform distribution) and solving \(\sum\nolimits_{m = 1}^{n - 1} {{\alpha _m} < {u_j} \le \sum\nolimits_{m = 1}^{n - 1} {{\alpha _m}} } \). Alternatively, the deterministic values uj = (jc)/N; c ∈ (0, 1) or stratified values uj = (j + Uj − 0.5 − c)/N; UjU[0, 1], c ∈ (0, 1) could be used. According to Kitagawa (1996, appendix), when using deterministic or stratified drawing, it is preferrable in advance to sort the mixture in order of magnitude, i.e. \(\left\Vert {{{\tilde y}_n}} \right\Vert < \left\Vert {{{\tilde y}_{n + 1}}} \right\Vert\) and draw from the sorted \({{\tilde y}_n},{{\tilde P}_n}\) and \({{\tilde \alpha }_n}\). In my experience, sorting improves the smoothness of the simulated likelihood surface as a function of the parameter vector ψ (cf. example 7.3, figs. (1015)).

Another device in reduction of sampling error is antithetical sampling (Hammersley and Handscomb, 1964, p. 60). Instead of simulating zj ~ N(0, I); j = 1, …, N, pairs {zj, − zj}; j − 1, …, N/2, are drawn. The negatively correlated sample leads to estimators with smaller variance. When simulating the Euler scheme (13), the i.i.d. sequence −zj;j = 0, …, − 2 can be used to simulate a trajectory ηj(−z) which is anticorrelated with ηj(+z).

6.3 Implementation Details

The algorithm was programmed with Mathematica (Wolfram Research, 1992) and the MPW C compiler (Apple Computer, 2001) using the Mathlink communication library and run on Apple Power PC 604e and G3 computers. The Mathlink routines allow the calling of C programs from within Mathematica. For numerical computations (Cholesky roots, random numbers, sorting), the C algorithms in Numerical Recipes in C (Press et al., 1992) have been used.

7 Examples

7.1 AR(2) process

In order to test the performance of the importance sampling algorithm, a linear AR(2) model was simulated. The sampler should give a variance free estimate of the likelihood which must coincide with the exact result obtained by the Kalman filter. I used the state space model (equivalent to a 2nd order differential equation)

$$\begin{array}{*{20}{c}} {d\left[ {\begin{array}{*{20}{c}} {{y_1}(t)} \\ {{y_2}(t)} \end{array}} \right]}&{: = }&{\left[ {\begin{array}{*{20}{c}} 0&1 \\ { - 16}&{ - 4} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{y_1}(t)} \\ {{y_2}(t)} \end{array}} \right]dt + \left[ {\begin{array}{*{20}{c}} 0&0 \\ 0&2 \end{array}} \right]d\left[ {\begin{array}{*{20}{c}} {{W_1}(t)} \\ {{W_2}(t)} \end{array}} \right]} \\ {{z_i}}&{: = }&{[\begin{array}{*{20}{c}} 1&0 \end{array}]\left[ {\begin{array}{*{20}{c}} {{y_1}({t_i})} \\ {{y_2}({t_i})} \end{array}} \right] + { \in _i},\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \end{array}$$
((55))

where Var(ϵi):= R = 0.1, the data were equispaced with Δt = 2, t0 = 0 ≤ ttT = 50, and the discretization interval was chosen as δt = 0.01, Ji = Δt/δt − 200, i = 0, …, T = 25. A time series was computed according to (55) and using this data the likelihood function l(ψ) was simulated.

Results: The results are displayed in figures (14), where the variance reduction in M = 10 replications of the likelihood surface is summarized. Also shown is the exact result using the (linear) Kalman filter. The likelihoods and scores are plotted as a function of the parameter ψ3 = −16 in the interval [−20, −14]. Even in the case N = 1 (fig. 2), the sampling error is very small. Using larger discretization intervals δt = 0.1, 0.05, one can show numerically that the variance of the estimated likelihood increases. Therefore, approximation errors in the simulation (13) and in the transition density (9) lead to deviations from the (theoretically) exact variance free estimate \(\tilde p({z_{i + 1}}{\rm{\vert}}{Z^i})\) (22).

Figure 1
figure 1

Simulated likelihood surface \(\tilde l({\psi _3})\): AR(2) without importance sampling (sample size N = 10). M = 10 replications (left), means and standard deviations (middle) and score (right). Interval −20 ≤ ψ3 ≤ −14. Bold line: exact likelihood l(ψ3) and score (right).

Figure 2
figure 2

Simulated likelihood surface: AR(2) with importance sampling (sample size N = 1). Bold line: exact likelihood and score (right).

Figure 3
figure 3

Likelihood surface: AR(2) with importance sampling (sample size N = 10).

Figure 4
figure 4

Likelihood surface: AR(2) with importance sampling (sample size N = 20).

7.2 Ginzburg-Landau model

The Ginzburg-Landau equation is a nonlinear diffusion equation where the drift coefficient is the gradient of a double well potential \(\Phi (y,{\rm{\{ }}\alpha ,\beta {\rm{\} }}){\rm{ = }}{\textstyle{\alpha \over 2}}{y^2} + {\textstyle{\beta \over 2}}{y^4},\;f = - \partial \Phi /\partial y\):

$$dy\;\;\; = \;\;\; - [\alpha y + \beta {y^3}]dt + \sigma dW(t)$$
((56))
$${z_i}\;\; = \;\;y({t_i}) + {\epsilon_i}.$$
((57))

Models of this kind have been used to model limit cycles, bifurcations, phase transitions and normal forms of nonlinear systems (cf. V.I. Arnold, 1973, 1986, Haken, 1977, Holmes, 1981, normal form theorem 4.4). Other applications are in the modelling of equilibrium states of an economy (Herings, 1996) and the theory of system failure (Frey, 1996).

In the present context a parameter constellation of ψ = {α, β, σ, R} = {− 1., 0.1,2., 0.1}, R = Var(ϵ) was chosen, which corresponds to a potential with two minima and noisy sampled measurements (Δt = 2; 0 ≤ t ≤ 50; δt = 0.1).

The convergence of the simulated likelihood as a function of sample size N is shown in figure (5). Again the variance of the estimates is considerably reduced by importance sampling. The form of the likelihood surface and of the score as a function of ψ1 = α is compared in figures (67).

Figure 5
figure 5

Convergence of the simulated likelihood (Ginzburg-Landau model). Means ± standard deviations in M = 10 replications. Right picture: with importance sampling. Sample size N = 10, 50, 100, 200.

Figure 6
figure 6

Likelihood surface (Ginzburg-Landau model) as a function of ψ1 = α. M = 10 replications (left), means and standard deviations (middle) and score (right). Sample size N = 10 (without importance sampling).

Figure 7
figure 7

Likelihood surface (Ginzburg-Landau model) as a function of ψ1 = α. Sample size N = 10 (with importance sampling).

Finally, figs. (89) explain the effect of importance sampling on the trajectories drawn from p2. As shown in section 5, the actual drift and diffusion coefficients f, g are modified to f2, g2 in order to draw from the importance density, which yields more random numbers near the points in state space, where p(zi|yi) is averaged in the likelihood expression Li+1 = ∫ p(zi+1|yi+1)p(yi+1|Zi)dyi+1 (cf. eq. 6). Clearly, without importance sampling, only few trajectories are near the measurement at the end of the interval (see fig. 6) and the mean shows high dispersion.

Figure 8
figure 8

Sample trajectories from p1 (Ginzburg-Landau model). Sample size N = 10 (without importance sampling).

Figure 9
figure 9

Sample trajectories from p2 (Ginzburg-Landau model). Sample size N = 10 (with importance sampling).

Simulation studies (Singer, 1999b) compared the performance of the functional integral filter (FIF) with a filter based on kernel density estimates and with approximations based on Taylor expansions (EKF, 2nd order nonlinear filter SNF and local linearization (LL), cf. Shoji and Ozaki (1997, 1998)). It was shown that for large sampling intervals, the FIF with importance sampling exhibits the smallest bias even for small Monte Carlo sample sizes (N = 10), whereas without importance sampling, sample sizes of at least N = 50 are required. Moreover, Taylor expansion methods (EKF, SNF, LL) only yield good results for small measurement intervals.

7.3 Stochastic Volatility

Stochastic volatility models such as

$$\begin{array}{lll} {dS(t)} & = & {\mu S(t)dt + \sigma (t)S(t)dW(t)}\\ {d\sigma (t)} & = & {\lambda [\sigma (t) - \bar \sigma ]dt + \gamma dV(t),}\end{array}$$
((58))

Var(dW, dV) = ρdt (Scott, 1987, Hull and White, 1987), where the volatility process u(t) is not observable, can account for the fact that the returns

$$r(t) = dS{\rm{/}}S = \mu dt + \sigma (t)dW(t)$$
((59))

on financial time series exhibit a time dependent variance and for the leptokurtosis of the return distribution. In contrast to ARCH and GARCH models exhibiting conditional heteroscedasticity, too, the variance equation is driven by a separate Wiener process and the variance cannot be eliminated. For example, the discrete time GARCH(1,1) process

$$\begin{array}{lll}{\;\;{\epsilon_i}} & = & {{\sigma _i}{z_i}}\\ {\sigma _i^2} & = & {\omega + \alpha \epsilon_{i - 1}^2 + \beta \sigma _{i - 1}^2}\end{array}$$

permits the recursive computation of σi given measurements of the innovation process εi (which corresponds to σdW) and an initial value σ0. It has been shown by Nelson (1990) however, that a continuous time limit of the GARCH(1,1)-M model (Engle and Bollerslev, 1986) in the mean corrected log returns log(Si+1/Si):= yi+1yi

$$\begin{array}{lll}{{y_{i + 1}}}&=& {{y_i} + c\sigma _i^2 + {\sigma _i}{z_i}}\\ {\;\;\;\sigma _i^2}&=& {\omega+ \alpha \epsilon_{i - 1}^2 + \beta \sigma _{i - 1}^2}\end{array}$$

leads to the system of stochastic differential equations

$$\begin{array}{lll}{\;\;dy(t)}&= & {c\sigma {{(t)}^2}dt + \sigma (t)dW(t)}\\ {d\sigma {{(t)}^2}}&= & {[\omega- \theta \sigma {{(t)}^2}]dt + \alpha \sigma {{(t)}^2}dV(t),}\end{array}$$

where W and V are independent standard Wiener processes and the coefficients are scaled as \(dt \to 0\;(\omega \to \omega dt,\;\beta \to 1 - \alpha \sqrt {dt{\rm{/}}2} - \theta dt,\;\alpha \to \alpha \sqrt {dt{\rm{/}}2} )\). This differs somewhat from equation (58), where the volatility satisfies an Ornstein-Uhlenbeck process.

Stochastic volatility models in discrete time have been used as approximations to the stochastic differential equation (58) or some variants such as

$$\begin{array}{lll} {\;\;\;\;\;\;\;\;dS(t)}&= & {\mu S(t)dt + \sigma (t)S(t)dW(t)}\\ {d\;\log \;{\sigma ^2}(t)}&= & {\lambda [\log {\sigma ^2}(t) - \log {{\bar \sigma }^2}]dt + \gamma dV(t)}\\ {\;\;\;\;\;\;\;(dh(t)}&= & {\lambda [h(t) - \bar h]dt + \gamma dV(t)),}\end{array}$$
((60))

where the log-volatility h(t) is modeled by an Ornstein-Uhlenbeck-process to ensure a positive σ (cf. Wiggins, 1987, Nelson, 1990). Taking logarithms and using Itô’s lemma, y = log S fulfils

$$\begin{array}{lll}{\;\;\;\;\;\;\;\;\;\;dy(t)}& = & {[\mu - \sigma {{(t)}^2}/2]dt + \sigma (t)dW(t)}\\ {d\;\log \;{\sigma ^2}(t)}& = & {\lambda [\log {\sigma ^2}(t) - \log {{\bar \sigma }^2}]dt + \gamma dV(t).}\end{array}$$
((61))

This has been shown to be the continuous time limit of an AR(1)-EGARCH model (Nelson, 1990, sect. 3.3) and corresponds to the discrete time model

$$\begin{array}{lll}{\;\;\;\;\;\;\;\;\;{y_i}}& = & {\exp ({h_i}{\rm{/}}2){\epsilon_i}}\\ {{h_{i + 1}} - \bar h}& = & {\lambda [{h_i}(t) - \bar h] + \gamma {\eta _i}}\\{\;\;\;\;\;\;\;\;({h_i}}& = & {\log \sigma _i^2)}\end{array}$$

used for the mean corrected returns by Kim et al. (1998).

Since available data are measured in discrete time (daily or weekly), but the models in the option pricing literature are mostly formulated in continuous time, times series formulations are only approximations to the sampled stochastic processes. The continuous time asymptotics and asymptotic GARCH filters developed by Nelson (1990, 1992) are only valid in the limit of small sampling interval. In analogy to linear theory, the differential equations should be filtered and estimated using discrete data with arbitrary time interval. This involves sampled diffusion processes with latent variables, since the volatility is not observed. In contrast to linear models, where exact discrete time series can be derived explicitly (cf. Bergstrom, 1990), the exact analogs of nonlinear systems involve transition densities which are solutions of the Fokker-Planck equation.

The following example serves to illustrate the simulated properties of the continuous time stochastic log volatility model (61). We assume that T = 365 daily data are simulated (y0 = log 100, h0 = log 0.22), but measurements are taken only weekly, i.e. δt = 1/365; Δt = 7/365. Parameters were chosen as \(\psi \; = \;{\rm{\{ }}\mu ,\lambda ,\bar h,\gamma ,\rho {\rm{\} }}\;\; = \;\;{\rm{\{ }}.07, - 1,\log ({0.2^2})\;\; = \;\; - 3.21888,2,0{\rm{\} }}\) and the prior distribution of the state η(0) = {y(0), h(0)} was set to P0|−1N({4, −3.}, diag(1, 1)). Thus the correlation p between the Wiener processes W and V is zero in accordance with Kim et al. In the measurement model the measurement error was set to a variance of R = 0.0001. Figures (1015) show the simulated likelihood surface as a function of ψ3 = h in the interval [−6, 1]. It is seen that sorting of the posterior distribution (cf. section 6.2) improves the smoothness of the likelihood and lowers the sampling error of the score function ∂l/∂ψ). In figure (12), antithetical sampling still further improves the smoothness as seen in the score (right picture). Figures (1315) demonstrate that importance sampling permits the usage of much smaller Monte Carlo sample size JV.

Figure 10
figure 10

Stochastic volatility model. Likelihood surface simulated without importance sampling. Deterministic resampling without ordering (sample size N = 10000).

Figure 11
figure 11

Stochastic volatility model. Likelihood surface simulated without importance sampling. Deterministic resampling with ordering (sample size N = 10000).

Figure 12
figure 12

Stochastic volatility model. Likelihood surface simulated without importance sampling. Deterministic resampling with ordering and antithetical variates (sample size N = 10000).

Figure 13
figure 13

Stochastic volatility model. Likelihood surface simulated with importance sampling. Deterministic resampling without ordering (sample size N = 200).

Figure 14
figure 14

Stochastic volatility model. Likelihood surface simulated with importance sampling. Deterministic resampling with ordering (sample size N = 200).

Figure 15
figure 15

Stochastic volatility model. Likelihood surface simulated with importance sampling. Deterministic resampling with ordering (sample size N = 500).

8 Conclusion

We have shown how the likelihood function of a continuous-discrete state space model can be simulated using Monte Carlo integration. The variance of the estimate is considerably reduced by using importance sampling. The importance density was computed by approximate smoothing algorithms, which run on gaussian sums and are only suboptimal in general nonlinear systems. Nevertheless, a strong reduction in dispersion is achieved. Currently the algorithms are tested in estimating the parameters of stochastic volatility models and the Lorenz model.