Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In recent years state-space representations and the associated Kalman recursions have had a profound impact on time series analysis and many related areas. The techniques were originally developed in connection with the control of linear systems (for accounts of this subject see Davis and Vinter 1985; Hannan and Deistler 1988). An extremely rich class of models for time series, including and going well beyond the linear ARIMA and classical decomposition models considered so far in this book, can be formulated as special cases of the general state-space model defined below in Section 9.1. In econometrics the structural time series models developed by Harvey (1990) are formulated (like the classical decomposition model) directly in terms of components of interest such as trend, seasonal component, and noise. However, the rigidity of the classical decomposition model is avoided by allowing the trend and seasonal components to evolve randomly rather than deterministically. An introduction to these structural models is given in Section 9.2, and a state-space representation is developed for a general ARIMA process in Section 9.3. The Kalman recursions, which play a key role in the analysis of state-space models, are derived in Section 9.4. These recursions allow a unified approach to prediction and estimation for all processes that can be given a state-space representation. Following the development of the Kalman recursions we discuss estimation with structural models (Section 9.5) and the formulation of state-space models to deal with missing values (Section 9.6). In Section 9.7 we introduce the EM algorithm, an iterative procedure for maximizing the likelihood when only a subset of the complete data set is available. The EM algorithm is particularly well suited for estimation problems in the state-space framework. Generalized state-space models are introduced in Section 9.8. These are Bayesian models that can be used to represent time series of many different types, as demonstrated by two applications to time series of count data. Throughout the chapter we shall use the notation

$$ \displaystyle{ \{\mathbf{W}_{t}\} \sim \mathrm{WN}(\mathbf{0},\{R_{t}\}) } $$

to indicate that the random vectors W t have mean 0 and that

$$ \displaystyle{ E\left (\mathbf{W}_{s}\mathbf{W}_{t}'\right ) = \left \{\begin{array}{@{}l@{\quad }l@{}} R_{t},\quad &\mbox{ if }s = t, \\ 0, \quad &\mbox{ otherwise}. \end{array} \right. } $$

9.1 State-Space Representations

A state-space model for a (possibly multivariate) time series {Y t , t = 1, 2, } consists of two equations. The first, known as the observation equation, expresses the w-dimensional observation Y t as a linear function of a v-dimensional state variable X t plus noise. Thus

$$ \displaystyle{ \mathbf{Y}_{t} = G_{t}\mathbf{X}_{t} + \mathbf{W}_{t},\quad t = 1,2,\ldots, } $$
(9.1.1)

where {W t } ∼ WN(0, {R t }) and {G t } is a sequence of w × v matrices. The second equation, called the state equation, determines the state X t+1 at time t + 1 in terms of the previous state X t and a noise term. The state equation is

$$ \displaystyle{ \mathbf{X}_{t+1} = F_{t}\mathbf{X}_{t} + \mathbf{V}_{t},\quad t = 1,2,\ldots, } $$
(9.1.2)

where {F t } is a sequence of v × v matrices, {V t } ∼ WN(0, {Q t }), and {V t } is uncorrelated with {W t } (i.e., E(W t V s ′) = 0 for all s and t). To complete the specification, it is assumed that the initial state X 1 is uncorrelated with all of the noise terms {V t } and {W t }.

Remark 1.

A more general form of the state-space model allows for correlation between V t and W t (see Brockwell and Davis (1991), Chapter 12) and for the addition of a control term H t u t in the state equation. In control theory, H t u t represents the effect of applying a “control” u t at time t for the purpose of influencing X t+1. However, the system defined by (9.1.1) and (9.1.2) with \( E{\bigl (\mathbf{W}_{t}\mathbf{V}_{s}'\bigr )} = 0 \) for all s and t will be adequate for our purposes. □ 

Remark 2.

In many important special cases, the matrices F t , G t , Q t , and R t will be independent of t, in which case the subscripts will be suppressed. □ 

Remark 3.

It follows from the observation equation (9.1.1) and the state equation (9.1.2) that X t and Y t have the functional forms, for t = 2, 3, ,

$$ \displaystyle\begin{array}{rcl} \mathbf{X}_{t}& =& F_{t-1}\mathbf{X}_{t-1} + \mathbf{V}_{t-1} \\ & =& F_{t-1}(F_{t-2}\mathbf{X}_{t-2} + \mathbf{V}_{t-2}) + \mathbf{V}_{t-1} \\ & & \ \ \vdots \\ & =& (F_{t-1}\cdots F_{1})\mathbf{X}_{1} + (F_{t-1}\cdots F_{2})\mathbf{V}_{1} + \cdots + F_{t-1}\mathbf{V}_{t-2} + \mathbf{V}_{t-1} \\ & =& f_{t}(\mathbf{X}_{1},\mathbf{V}_{1},\ldots,\mathbf{V}_{t-1}) {}\end{array} $$
(9.1.3)

and

$$ \displaystyle{ \mathbf{Y}_{t} = g_{t}(\mathbf{X}_{1},\mathbf{V}_{1},\ldots,\mathbf{V}_{t-1},\mathbf{W}_{t}).\mbox{ $\square $} } $$
(9.1.4)

Remark 4.

From Remark 3 and the assumptions on the noise terms, it is clear that

$$ \displaystyle{ E\left (\mathbf{V}_{t}\mathbf{X}'_{s}\right ) = 0,\qquad E\left (\mathbf{V}_{t}\mathbf{Y}'_{s}\right ) = 0,\quad 1 \leq s \leq t, } $$

and

$$ \displaystyle{ E\left (\mathbf{W}_{t}\mathbf{X}'_{s}\right ) = 0,\quad 1 \leq s \leq t,\qquad E(\mathbf{W}_{t}\mathbf{Y}'_{s}) = 0,\quad 1 \leq s < t.\mbox{ $\square $} } $$

Definition 9.1.1

A time series {Y t } has a state-space representation if there exists a state-space model for {Y t } as specified by equations (9.1.1) and (9.1.2).

As already indicated, it is possible to find a state-space representation for a large number of time-series (and other) models. It is clear also from the definition that neither {X t } nor {Y t } is necessarily stationary. The beauty of a state-space representation, when one can be found, lies in the simple structure of the state equation (9.1.2), which permits relatively simple analysis of the process {X t }. The behavior of {Y t } is then easy to determine from that of {X t } using the observation equation (9.1.1). If the sequence {X 1, V 1, V 2, } is independent, then {X t } has the Markov property; i.e., the distribution of X t+1 given X t , , X 1 is the same as the distribution of X t+1 given X t . This is a property possessed by many physical systems, provided that we include sufficiently many components in the specification of the state X t (for example, we may choose the state vector in such a way that X t includes components of X t−1 for each t).

Example 9.1.1

An AR(1) Process

Let {Y t } be the causal AR(1) process given by

$$ \displaystyle{ Y _{t} =\phi Y _{t-1} + Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$
(9.1.5)

In this case, a state-space representation for {Y t } is easy to construct. We can, for example, define a sequence of state variables X t by

$$ \displaystyle{ X_{t+1} =\phi X_{t} + V _{t},\quad t = 1,2,\ldots, } $$
(9.1.6)

where X 1 = Y 1 =  j = 0 ϕ j Z 1−j and V t  = Z t+1. The process {Y t } then satisfies the observation equation

$$ \displaystyle{ Y _{t} = X_{t}, } $$

which has the form (9.1.1) with G t  = 1 and W t  = 0.

Example 9.1.2

An ARMA(1,1) Process

Let {Y t } be the causal and invertible ARMA(1,1) process satisfying the equations

$$ \displaystyle{ Y _{t} =\phi Y _{t-1} + Z_{t} +\theta Z_{t-1},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$
(9.1.7)

Although the existence of a state-space representation for {Y t } is not obvious, we can find one by observing that

$$ \displaystyle{ Y _{t} =\theta (B)X_{t} = \left [\begin{array}{*{10}c} \theta \quad 1 \end{array} \right ]\left [\begin{array}{*{10}c} X_{t-1} \\ X_{t} \end{array} \right ], } $$
(9.1.8)

where {X t } is the causal AR(1) process satisfying

$$ \displaystyle{ \phi (B)X_{t} = Z_{t}, } $$

or the equivalent equation

$$ \displaystyle{ \left [\begin{array}{*{10}c} X_{t} \\ X_{t+1} \end{array} \right ] = \left [\begin{array}{*{10}c} 0&1\\ 0 & \phi \\ \end{array} \right ]\left [\begin{array}{*{10}c} X_{t-1} \\ X_{t} \end{array} \right ]+\left [\begin{array}{*{10}c} 0\\ Z_{t+1}. \end{array} \right ]. } $$
(9.1.9)

Noting that X t  =  j = 0 ϕ j Z tj , we see that equations (9.1.8) and (9.1.9) for t = 1, 2,  furnish a state-space representation of {Y t } with

$$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} X_{t-1} \\ X_{t}\\ \end{array} \right ]\ \mathrm{and}\ \mathbf{X}_{1} = \left [\begin{array}{*{10}c} \sum \limits _{j=0}^{\infty }\phi ^{j}Z_{-j} \\ \sum \limits _{j=0}^{\infty }\phi ^{j}Z_{1-j} \end{array} \right ]. } $$

The extension of this state-space representation to general ARMA and ARIMA processes is given in Section 9.3.

In subsequent sections we shall give examples that illustrate the versatility of state-space models. (More examples can be found in Aoki 1987; Hannan and Deistler 1988; Harvey 1990; West and Harrison 1989.) Before considering these, we need a slight modification of (9.1.1) and (9.1.2), which allows for series in which the time index runs from − to . This is a more natural formulation for many time series models.

9.1.1 State-Space Models with t ∈ { 0, ±1, }

Consider the observation and state equations

$$ \displaystyle{ \mathbf{Y}_{t} = G\mathbf{X}_{t} + \mathbf{W}_{t},\qquad t = 0,\pm 1,\ldots, } $$
(9.1.10)
$$ \displaystyle{ \mathbf{X}_{t+1} = F\mathbf{X}_{t} + \mathbf{V}_{t},\qquad t = 0,\pm 1,\ldots, } $$
(9.1.11)

where F and G are v × v and w × v matrices, respectively, {V t } ∼ WN(0, Q), \( \{\mathbf{W}_{t}\} \sim \mathrm{WN}(\mathbf{0},R) \), and E(V s W t ′) = 0 for all s, and t.

The state equation (9.1.11) is said to be stable if the matrix F has all its eigenvalues in the interior of the unit circle, or equivalently if det(IFz) ≠ 0 for all z complex such that | z | ≤ 1. The matrix F is then also said to be stable.

In the stable case equation (9.1.11) has the unique stationary solution (Problem 9.1) given by

$$ \displaystyle{ \mathbf{X}_{t} =\sum _{ j=0}^{\infty }F^{j}\mathbf{V}_{ t-j-1}. } $$

The corresponding sequence of observations

$$ \displaystyle{ \mathbf{Y}_{t} = \mathbf{W}_{t} +\sum _{ j=0}^{\infty }GF^{j}\mathbf{V}_{ t-j-1} } $$

is also stationary.

9.2 The Basic Structural Model

A structural time series model, like the classical decomposition model defined by (1.5.1), is specified in terms of components such as trend, seasonality, and noise, which are of direct interest in themselves. The deterministic nature of the trend and seasonal components in the classical decomposition model, however, limits its applicability. A natural way in which to overcome this deficiency is to permit random variation in these components. This can be very conveniently done in the framework of a state-space representation, and the resulting rather flexible model is called a structural model. Estimation and forecasting with this model can be encompassed in the general procedure for state-space models made possible by the Kalman recursions of Section 9.4.

Example 9.2.1

The Random Walk Plus Noise Model

One of the simplest structural models is obtained by adding noise to a random walk. It is suggested by the nonseasonal classical decomposition model

$$ \displaystyle{ Y _{t} = M_{t} + W_{t},\quad \mathrm{where}\ \{W_{t}\} \sim \mathrm{WN}\left (0,\sigma _{w}^{2}\right ), } $$
(9.2.1)

and M t  = m t , the deterministic “level” or “signal” at time t. We now introduce randomness into the level by supposing that M t is a random walk satisfying

$$ \displaystyle{ M_{t+1} = M_{t} + V _{t},\quad \mathrm{and}\quad \{V _{t}\} \sim \mathrm{WN}\left (0,\sigma _{v}^{2}\right ), } $$
(9.2.2)

with initial value M 1 = m 1. Equations (9.2.1) and (9.2.2) constitute the “local level” or “random walk plus noise” model. Figure 9.1 shows a realization of length 100 of this model with M 1 = 0, σ v 2 = 4, and σ w 2 = 8. (The realized values m t of M t  are plotted as a solid line, and the observed data are plotted as square boxes.) The differenced data

$$ \displaystyle{ D_{t}:= \nabla Y _{t} = Y _{t} - Y _{t-1} = V _{t-1} + W_{t} - W_{t-1},\quad t \geq 2, } $$

constitute a stationary time series with mean 0 and ACF

$$ \displaystyle{ \rho_{D}(h) = \left \{\begin{array}{@{}l@{\quad }l@{}} \dfrac{-\sigma _{w}^{2}} {2\sigma _{w}^{2} +\sigma _{ v}^{2}},\quad &\mbox{ if }\vert h\vert = 1, \\ 0, \quad &\mbox{ if }\vert h\vert > 1. \end{array} \right. } $$

Since {D t } is 1-correlated, we conclude from Proposition 2.1.1 that {D t } is an MA(1) process and hence that {Y t } is an ARIMA(0,1,1) process. More specifically,

$$ \displaystyle{ D_{t} = Z_{t} +\theta Z_{t-1},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ), } $$
(9.2.3)

where θ and σ 2 are found by solving the equations

$$ \displaystyle{ \frac{\theta } {1 +\theta ^{2}} = \frac{-\sigma _{w}^{2}} {2\sigma _{w}^{2} +\sigma _{ v}^{2}}\quad \mathrm{and}\quad \theta \sigma ^{2} = -\sigma _{ w}^{2}. } $$

For the process {Y t } generating the data in Figure 9.1, the parameters θ and σ 2 of the differenced series {D t } satisfy θ∕(1 +θ 2) = −0. 4 and θ σ 2 = −8. Solving these equations for θ and σ 2, we find that θ = −0. 5 and σ 2 = 16 (or θ = −2 and σ 2 = 4). The sample ACF of the observed differences D t of the realization of {Y t } in Figure 9.1 is shown in Figure 9.2.

Fig. 9.1
figure 1

Realization from a random walk plus noise model. The random walk is represented by the solid line and the data are represented by boxes

Fig. 9.2
figure 2

Sample ACF of the series obtained by differencing the data in Figure 9.1

The local level model is often used to represent a measured characteristic of the output of an industrial process for which the unobserved process level {M t } is intended to be within specified limits (to meet the design specifications of the manufactured product). To decide whether or not the process requires corrective attention, it is important to be able to test the hypothesis that the process level {M t } is constant. From the state equation, we see that {M t } is constant (and equal to m 1) when V t  = 0 or equivalently when σ v 2 = 0. This in turn is equivalent to the moving-average model (9.2.3) for {D t } being noninvertible with θ = −1 (see Problem 8.2). Tests of the unit root hypothesis θ = −1 were discussed in Section 6.3.2

The local level model can easily be extended to incorporate a locally linear trend with slope β t at time t. Equation (9.2.2) is replaced by

$$ \displaystyle{ M_{t} = M_{t-1} + B_{t-1} + V _{t-1}, } $$
(9.2.4)

where B t−1 = β t−1. Now if we introduce randomness into the slope by replacing it with the random walk

$$ \displaystyle{ B_{t} = B_{t-1} + U_{t-1},\quad \mathrm{where}\ \{U_{t}\} \sim \mathrm{WN}\left (0,\sigma _{u}^{2}\right ), } $$
(9.2.5)

we obtain the “local linear trend” model.

To express the local linear trend model in state-space form we introduce the state vector

$$ \displaystyle{ \mathbf{X}_{t} = (M_{t},B_{t})'. } $$

Then (9.2.4) and (9.2.5) can be written in the equivalent form

$$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} 1&1\\ 0 &1 \end{array} \right ]\mathbf{X}_{t}+\mathbf{V}_{t},\quad t = 1,2,\ldots, } $$
(9.2.6)

where V t  = (V t , U t )′. The process {Y t } is then determined by the observation equation

$$ \displaystyle{ Y _{t} = [1\quad 0]\ \mathbf{X}_{t} + W_{t}. } $$
(9.2.7)

If {X 1, U 1, V 1, W 1, U 2, V 2, W 2, } is an uncorrelated sequence, then equations (9.2.6) and (9.2.7) constitute a state-space representation of the process {Y t }, which is a model for data with randomly varying trend and added noise. For this model we have v = 2, w = 1,

$$ \displaystyle{ F = \left [\begin{array}{*{10}c} 1& 1\\ 0 &1, \end{array} \right ]\quad G = [1\quad 0],\quad Q = \left [\begin{array}{*{10}c} \sigma _{v}^{2} & 0 \\ 0 &\sigma _{u}^{2} \end{array} \right ],\quad \mathrm{and}\ R =\sigma _{ w}^{2}. } $$

Example 9.2.2

A Seasonal Series with Noise

The classical decomposition (1.5.11) expressed the time series {X t } as a sum of trend, seasonal, and noise components. The seasonal component (with period d ) was a sequence {s t } with the properties s t+d  = s t and t = 1 d s t  = 0. Such a sequence can be generated, for any values of s 1, s 0, , s d+3, by means of the recursions

$$ \displaystyle{ s_{t+1} = -s_{t} -\cdots - s_{t-d+2},\quad t = 1,2,\ldots. } $$
(9.2.8)

A somewhat more general seasonal component {Y t }, allowing for random deviations from strict periodicity, is obtained by adding a term S t to the right side of (9.2.8), where {V t } is white noise with mean zero. This leads to the recursion relations

$$ \displaystyle{ Y _{t+1} = -Y _{t} -\cdots - Y _{t-d+2} + S_{t},\quad t = 1,2,\ldots. } $$
(9.2.9)

To find a state-space representation for {Y t } we introduce the (d − 1)-dimensional state vector

$$ \displaystyle{ \mathbf{X}_{t} = (Y _{t},Y _{t-1},\ldots,Y _{t-d+2})'. } $$

The series {Y t } is then given by the observation equation

$$ \displaystyle{ Y _{t} = [1\quad 0\quad 0\ \cdots \ 0]\ \mathbf{X}_{t},\quad t = 1,2,\ldots, } $$
(9.2.10)

where {X t } satisfies the state equation

$$ \displaystyle{ \mathbf{X}_{t+1} = F\mathbf{X}_{t} + \mathbf{V}_{t},\quad t = 1,2\ldots, } $$
(9.2.11)

V t  = (S t , 0, , 0)′, and

$$ \displaystyle{ F = \left [\begin{array}{*{10}c} -1 & -1 & \cdots & -1 & -1\\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & 1 & 0 \end{array} \right ]. } $$
(9.2.12)

Example 9.2.3

A Randomly Varying Trend with Random Seasonality and Noise

A series with randomly varying trend, random seasonality and noise can be constructed by adding the two series in Examples 9.2.1 and 9.2.2. (Addition of series with state-space representations is in fact always possible by means of the following construction. See Problem 9.9.) We introduce the state vector

$$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} \mathbf{X}_{t}^{1} \\ \mathbf{X}_{t}^{2} \end{array} \right ], } $$

where X t 1 and X t 2 are the state vectors in (9.2.6) and (9.2.11). We then have the following representation for {Y t }, the sum of the two series whose state-space representations were given in (9.2.6)–(9.2.7) and (9.2.10)–(9.2.11). The state equation is

$$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} F_{1} & 0 \\ 0 & F_{2} \end{array} \right ]\mathbf{X}_{t}+\left [\begin{array}{*{10}c} \mathbf{V}_{t}^{1} \\ \mathbf{V}_{t}^{2} \end{array} \right ], } $$
(9.2.13)

where F 1, F 2 are the coefficient matrices and {V t 1}, {V t 2} are the noise vectors in the state equations (9.2.6) and (9.2.11), respectively. The observation equation is

$$ \displaystyle{ Y _{t} = [1\quad 0\quad 1\quad 0\ \cdots \ 0]\,\mathbf{X}_{t} + W_{t}, } $$
(9.2.14)

where {W t } is the noise sequence in (9.2.7). If the sequence of random vectors {X 1, V 1 1, V 1 2, W 1, V 2 1, V 2 2, W 2, } is uncorrelated, then equations (9.2.13) and (9.2.14) constitute a state-space representation for {Y t }.

9.3 State-Space Representation of ARIMA Models

We begin by establishing a state-space representation for the causal AR(p) process and then build on this example to find representations for the general ARMA and ARIMA processes.

Example 9.3.1

State-Space Representation of a Causal AR(p) Process

Consider the AR(p) process defined by

$$ \displaystyle{ Y _{t+1} =\phi _{1}Y _{t} +\phi _{2}Y _{t-1} + \cdots +\phi _{p}Y _{t-p+1} + Z_{t+1},\quad t = 0,\pm 1,\ldots, } $$
(9.3.1)

where \( \{Z_{t}\} \sim \mathrm{WN}{\bigl (0,\sigma ^{2}\bigr )} \), and ϕ(z): = 1 −ϕ 1 z −⋯ −ϕ p z p is nonzero for | z | ≤ 1. To express {Y t } in state-space form we simply introduce the state vectors

$$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} Y _{t-p+1} \\ Y _{t-p+2}\\ \vdots \\ Y _{t}, \end{array} \right ],\quad t = 0,\pm 1,\ldots. } $$
(9.3.2)

From (9.3.1) and (9.3.2) the observation equation is

$$ \displaystyle{ Y _{t} = [0\quad 0\quad 0\ \cdots \ 1]\mathbf{X}_{t},\quad t = 0,\pm 1,\ldots, } $$
(9.3.3)

while the state equation is given by

$$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} 0 & 1 & 0 & \cdots & 0\\ 0 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1\\ \phi _{ p} & \phi _{p-1} & \phi _{p-2} & \cdots & \phi _{1} \end{array} \right ]\mathbf{X}_{t}+\left [\begin{array}{*{10}c} 0\\ 0\\ \vdots \\ 0\\ 1 \end{array} \right ]Z_{t+1},\quad t = 0,\pm 1,\ldots. } $$
(9.3.4)

These equations have the required forms (9.1.10) and (9.1.11) with W t  = 0 and V t  = (0, 0, , Z t+1)′, t = 0, ±1, . 

Remark 1.

In Example 9.3.1 the causality condition ϕ(z) ≠ 0 for | z | ≤ 1 is equivalent to the condition that the state equation (9.3.4) is stable, since the eigenvalues of the coefficient matrix in (9.3.4) are simply the reciprocals of the zeros of ϕ(z) (Problem 9.3). □ 

Remark 2.

If equations (9.3.3) and (9.3.4) are postulated to hold only for t = 1, 2, , and if X 1 is a random vector such that {X 1, Z 1, Z 2, } is an uncorrelated sequence, then we have a state-space representation for {Y t } of the type defined earlier by (9.1.1) and (9.1.2). The resulting process {Y t } is well-defined, regardless of whether or not the state equation is stable, but it will not in general be stationary. It will be stationary if the state equation is stable and if X 1 is defined by (9.3.2) with Y t  =  j = 0 ψ j Z tj , t = 1, 0, , 2 − p, and ψ(z) = 1∕ϕ(z),   | z | ≤ 1. □ 

Example 9.3.2

State-Space Form of a Causal ARMA(p, q) Process

State-space representations are not unique. Here we shall give one of the (infinitely many) possible representations of a causal ARMA(p,q) process that can easily be derived from Example 9.3.1. Consider the ARMA(p,q) process defined by

$$ \displaystyle{ \phi (B)Y _{t} =\theta (B)Z_{t},\quad t = 0,\pm 1,\ldots, } $$
(9.3.5)

where \( \{Z_{t}\} \sim \mathrm{WN}{\bigl (0,\sigma ^{2}\bigr )} \) and \( \phi (z)\neq 0 \) for | z | ≤ 1. Let

$$ \displaystyle{ r =\max (p,q + 1),\quad \phi _{j} = 0\quad \mathrm{for}\ j > p,\quad \theta _{j} = 0\quad \mathrm{for}\ \ j > q,\quad \mathrm{and}\quad \theta _{0} = 1. } $$

If {U t } is the causal AR( p) process satisfying

$$ \displaystyle{ \phi (B)U_{t} = Z_{t}, } $$
(9.3.6)

then Y t  = θ(B)U t , since

$$ \displaystyle{ \phi (B)Y _{t} =\phi (B)\theta (B)U_{t} =\theta (B)\phi (B)U_{t} =\theta (B)Z_{t}. } $$

Consequently,

$$ \displaystyle{ Y _{t} = [\theta _{r-1}\quad \theta _{r-2}\ \cdots \ \theta _{0}]\mathbf{X}_{t}, } $$
(9.3.7)

where

$$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} U_{t-r+1} \\ U_{t-r+2}\\ \vdots \\ U_{t} \end{array} \right ]. } $$
(9.3.8)

But from Example 9.3.1 we can write

$$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} 0 & 1 & 0 & \cdots & 0\\ 0 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1\\ \phi _{ r} & \phi _{r-1} & \phi _{r-2} & \cdots & \phi _{1} \end{array} \right ]\mathbf{X}_{t}+\left [\begin{array}{*{10}c} 0\\ 0\\ \vdots \\ 0\\ 1 \end{array} \right ]Z_{t+1},\quad t = 0,\pm 1,\ldots. } $$
(9.3.9)

Equations (9.3.7) and (9.3.9) are the required observation and state equations. As in Example 9.3.1, the observation and state noise vectors are again W t  = 0 and V t  = (0, 0, , Z t+1)′, t = 0, ±1, ….

Example 9.3.3

State-Space Representation of an ARIMA(p, d, q) Process

If \( \big\{Y _{t}\big\} \) is an ARIMA(p, d, q) process with {∇d Y t } satisfying (9.3.5), then by the preceding example \( \big\{\nabla ^{d}Y _{t}\big\} \) has the representation

$$ \displaystyle{ \nabla ^{d}Y _{ t} = G\mathbf{X}_{t},\quad t = 0,\pm 1,\ldots, } $$
(9.3.10)

where {X t } is the unique stationary solution of the state equation

$$ \displaystyle{ \mathbf{X}_{t+1} = F\mathbf{X}_{t} + \mathbf{V}_{t}, } $$

F and G are the coefficients of X t in (9.3.9) and (9.3.7), respectively, and V t  = (0, 0, , Z t+1)′. Let A and B be the \( d \times 1 \) and d × d matrices defined by A = B = 1 if d = 1 and

$$ \displaystyle{ A = \left [\begin{array}{*{10}c} 0\\ 0\\ \vdots \\ 0\\ 1 \end{array} \right ],\quad B = \left [\begin{array}{*{10}c} 0 & 1 & 0 & \cdots & 0\\ 0 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \\ (-1)^{d+1}{d\choose d} & (-1)^{d}{d\choose d - 1} & (-1)^{d-1}{d\choose d - 2} & \cdots & d \end{array} \right ] } $$

if d > 1. Then since

$$ \displaystyle{ Y _{t} = \nabla ^{d}Y _{ t} -\sum _{j=1}^{d}{d\choose j}(-1)^{j}Y _{ t-j}, } $$
(9.3.11)

the vector

$$ \displaystyle{ \mathbf{Y}_{t-1}:= (Y _{t-d},\ldots,Y _{t-1})' } $$

satisfies the equation

$$ \displaystyle{ \mathbf{Y}_{t} = A\nabla ^{d}Y _{ t} + B\mathbf{Y}_{t-1} = AG\mathbf{X}_{t} + B\mathbf{Y}_{t-1}. } $$

Defining a new state vector T t by stacking X t and Y t−1, we therefore obtain the state equation

$$ \displaystyle{ \mathbf{T}_{t+1}:= \left [\begin{array}{*{10}c} \mathbf{X}_{t+1} \\ \mathbf{Y}_{t} \end{array} \right ] = \left [\begin{array}{*{10}c} F & 0\\ AG &B \end{array} \right ]\mathbf{T}_{t}+\left [\begin{array}{*{10}c} \mathbf{V}_{t}\\ \mathbf{0} \end{array} \right ],\quad t = 1,2,\ldots, } $$
(9.3.12)

and the observation equation, from (9.3.10) and (9.3.11),

$$ \displaystyle\begin{array}{rcl} Y _{t}=\left [G\,(-1)^{d+1}{d\choose d}\ (-1)^{d}{d\choose d - 1}\ (-1)^{d-1}{d\choose d - 2}\quad \cdots \quad d\right ]& & \quad \left [\begin{array}{*{10}c} \mathbf{X}_{t} \\ \mathbf{Y}_{t-1} \end{array} \right ], \\ t = 1,2,\ldots,& & {}\end{array} $$
(9.3.13)

with initial condition

$$ \displaystyle{ \mathbf{T}_{1} = \left [\begin{array}{*{10}c} \mathbf{X}_{1} \\ \mathbf{Y}_{0} \end{array} \right ] = \left [\begin{array}{*{10}c} \sum \limits _{j=0}^{\infty }F^{ j}\ \mathbf{V}_{-j} \\ \mathbf{Y}_{0} \end{array} \right ], } $$
(9.3.14)

and the assumption

$$ \displaystyle{ E(\mathbf{Y}_{0}Z_{t}') = 0,\quad t = 0,\pm 1,\ldots, } $$
(9.3.15)

where Y 0 = (Y 1−d , Y 2−d , , Y 0)′. The conditions (9.3.15), which are satisfied in particular if Y 0 is considered to be nonrandom and equal to the vector of observed values (y 1−d , y 2−d , , y 0)′, are imposed to ensure that the assumptions of a state-space model given in Section 9.1 are satisfied. They also imply that \( E\left (\mathbf{X}_{1}\mathbf{Y}_{0}'\right ) = 0 \) and E(Y 0d Y t ′) = 0, t ≥ 1, as required earlier in Section 6.4 for prediction of ARIMA processes.

State-space models for more general ARIMA processes (e.g., {Y t } such that {∇∇12 Y t } is an ARMA(p, q) process) can be constructed in the same way. See Problem 9.4.

For the ARIMA(1, 1, 1) process defined by

$$ \displaystyle{(1 -\phi B)(1 - B)Y _{t} = (1 +\theta B)Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ),} $$

the vectors X t and Y t−1 reduce to X t  = (X t−1, X t )′ and Y t−1 = Y t−1. From (9.3.12) and (9.3.13) the state-space representation is therefore (Problem 9.8)

$$ \displaystyle{ Y _{t} = \left [\begin{array}{*{10}c} \theta &1&1 \end{array} \right ]\left [\begin{array}{*{10}c} X_{t-1} \\ X_{t} \\ Y _{t-1} \end{array} \right ], } $$
(9.3.16)

where

$$ \displaystyle{ \left [\begin{array}{*{10}c} X_{t} \\ X_{t+1} \\ Y _{t} \end{array} \right ] = \left [\begin{array}{*{10}c} 0 & 1 & 0\\ 0 & \phi & 0 \\ \theta & 1 & 1 \end{array} \right ]\left [\begin{array}{*{10}c} X_{t-1} \\ X_{t} \\ Y _{t-1} \end{array} \right ]+\left [\begin{array}{*{10}c} 0\\ Z_{ t+1} \\ 0 \end{array} \right ],\quad t = 1,2,\ldots, } $$
(9.3.17)

and

$$ \displaystyle{ \left [\begin{array}{*{10}c} X_{0} \\ X_{1} \\ Y _{0} \end{array} \right ] = \left [\begin{array}{*{10}c} \sum \limits _{j=0}^{\infty }\phi ^{j}Z_{-j} \\ \sum \limits _{j=0}^{\infty }\phi ^{j}Z_{1-j} \\ Y _{0} \end{array} \right ]. } $$
(9.3.18)

9.4 The Kalman Recursions

In this section we shall consider three fundamental problems associated with the state-space model defined by (9.1.1) and (9.1.2) in Section 9.1. These are all concerned with finding best (in the sense of minimum mean square error) linear estimates of the state-vector X t in terms of the observations Y 1, Y 2, , and a random vector Y 0 that is orthogonal to V t and W t for all t ≥ 1. In many cases Y 0 will be the constant vector (1, 1, , 1)′. Estimation of X t in terms of:

  1. a.

    Y 0, , Y t−1 defines the prediction problem,

  2. b.

    Y 0, , Y t defines the filtering problem,

  3. c.

    Y 0, , Y n  (n > t) defines the smoothing problem.

Each of these problems can be solved recursively using an appropriate set of Kalman recursions, which will be established in this section.

In the following definition of best linear predictor (and throughout this chapter) it should be noted that we do not automatically include the constant 1 among the predictor variables as we did in Sections 2.5 and 8.5 (It can, however, be included by choosing Y 0 = (1, 1, , 1)′.)

Definition 9.4.1

For the random vector X = (X 1, , X v )′,

$$ \displaystyle{ P_{t}(\mathbf{X}):= (P_{t}(X_{1}),\ldots,P_{t}(X_{v}))', } $$

where P t (X i ): = P(X i  | Y 0, Y 1, , Y t ), is the best linear predictor of X i in terms of all components of Y 0, Y 1, , Y t .

Remark 1.

By the definition of the best predictor of each component X i of X, P t (X) is the unique random vector of the form

$$ \displaystyle{ P_{t}(\mathbf{X}) = A_{0}\mathbf{Y}_{0} + \cdots + A_{t}\mathbf{Y}_{t} } $$

with v × w matrices A 0, , A t such that

$$ \displaystyle{ [\mathbf{X} - P_{t}(\mathbf{X})] \perp \mathbf{Y}_{s},\quad s = 0,\ldots,t } $$

[cf. (8.5.2) and (8.5.3)]. Recall that two random vectors X and Y are orthogonal (written X ⊥ Y) if E(XY′) is a matrix of zeros. □ 

Remark 2.

If all the components of \( \mathbf{X},\mathbf{Y}_{1},\ldots,\mathbf{Y}_{t} \) are jointly normally distributed and Y 0 = (1, , 1)′, then

$$ \displaystyle{ P_{t}(\mathbf{X}) = E(\mathbf{X}\vert \mathbf{Y}_{1},\ldots,\mathbf{Y}_{t}),\quad t \geq 1.\mbox{ $\square $} } $$

Remark 3.

P t is linear in the sense that if A is any k × v matrix and X, V are two v-variate random vectors with finite second moments, then (Problem 9.10)

$$ \displaystyle{ P_{t}(A\mathbf{X}) = AP_{t}(\mathbf{X}) } $$

and

$$ \displaystyle{ P_{t}(\mathbf{X} + \mathbf{V}) = P_{t}(\mathbf{X}) + P_{t}(\mathbf{V}). } $$

 □ 

Remark 4.

If X and Y are random vectors with v and w components, respectively, each with finite second moments, then

$$ \displaystyle{ P(\mathbf{X}\vert \mathbf{Y}) = M\mathbf{Y}, } $$

where M is a v × w matrix, M = E(XY′)[E(YY′)]−1 with [E(YY′)]−1 any generalized inverse of E(YY′). (A generalized inverse of a matrix S is a matrix S −1 such that SS −1 S = S. Every matrix has at least one. See Problem 9.11.)

In the notation just developed, the prediction, filtering, and smoothing problems (a), (b), and (c) formulated above reduce to the determination of P t−1(X t ), P t (X t ), and P n (X t ) (n > t), respectively. We deal first with the prediction problem. □ 

Kalman Prediction:

For the state-space model (9.1.1)–(9.1.2), the one-step predictors \( \hat{\mathbf{X}_{t}}:= P_{t-1}(\mathbf{X}_{t}) \) and their error covariance matrices \( \Omega _{t} = E{\bigl [{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}'\bigr ]} \) are uniquely determined by the initial conditions

$$ \displaystyle{ \hat{\mathbf{X}}_{1} = P(\mathbf{X}_{1}\vert \mathbf{Y}_{0}),\quad \Omega _{1} = E{\bigl [{\bigl (\mathbf{X}_{1} -\hat{\mathbf{X}}_{1}\bigr )}{\bigl (\mathbf{X}_{1} -\hat{\mathbf{X}}_{1}\bigr )}'\bigr ]} } $$

and the recursions, for t = 1, , 

$$ \displaystyle{ \hat{\mathbf{X}}_{t+1} = F_{t}\hat{\mathbf{X}}_{t} +\varTheta _{t}\Delta _{t}^{-1}\left (\mathbf{Y}_{ t} - G_{t}\hat{\mathbf{X}}_{t}\right ), } $$
(9.4.1)
$$ \displaystyle{ \Omega _{t+1} = F_{t}\Omega _{t}F_{t}' + Q_{t} -\varTheta _{t}\Delta _{t}^{-1}\varTheta _{ t}', } $$
(9.4.2)

where

$$ \displaystyle{\Delta _{t} = G_{t}\Omega _{t}G_{t}' + R_{t},} $$
$$ \displaystyle{\varTheta _{t} = F_{t}\Omega _{t}G_{t}',} $$

and Δ t −1 is any generalized inverse of Δ t .

Proof.

We shall make use of the innovations I t defined by I 0 = Y 0 and

$$ \displaystyle{\mathbf{I}_{t} = \mathbf{Y}_{t} - P_{t-1}\mathbf{Y}_{t} = \mathbf{Y}_{t} - G_{t}\hat{\mathbf{X}}_{t} = G_{t}\left (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\right ) + \mathbf{W}_{t},\quad t = 1,2,\ldots.} $$

The sequence {I t } is orthogonal by Remark 1. Using Remarks 3 and 4 and the relation

$$ \displaystyle{ P_{t}(\cdot ) = P_{t-1}(\cdot ) + P(\cdot \vert \mathbf{I}_{t}) } $$
(9.4.3)

(see Problem 9.12), we find that

$$ \displaystyle\begin{array}{rcl} \hat{\mathbf{X}}_{t+1}& =& P_{t-1}(\mathbf{X}_{t+1}) + P(\mathbf{X}_{t+1}\vert \mathbf{I}_{t}) = P_{t-1}(F_{t}\mathbf{X}_{t} + \mathbf{V}_{t}) +\varTheta _{t}\Delta _{t}^{-1}\mathbf{I}_{ t} \\ & =& F_{t}\hat{\mathbf{X}}_{t} +\varTheta _{t}\Delta _{t}^{-1}\mathbf{I}_{ t}, {}\end{array} $$
(9.4.4)

where

$$ \displaystyle{\Delta _{t} = E(\mathbf{I}_{t}\;\mathbf{I}_{t}') = G_{t}\Omega _{t}G_{t}' + R_{t},} $$
$$ \displaystyle{\varTheta _{t} = E(\mathbf{X}_{t+1}\mathbf{I}_{t}') = E\left [{\bigl (F_{t}\mathbf{X}_{t} + \mathbf{V}_{t}\bigr )}\left (\left [\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\right ]'G_{t}' + \mathbf{W}_{t}'\right )\right ] = F_{t}\Omega _{t}G_{t}'.} $$

To verify (9.4.2), we observe from the definition of Ω t+1 that

$$ \displaystyle{ \Omega _{t+1} = E\left (\mathbf{X}_{t+1}\mathbf{X}_{t+1}'\right ) - E\left (\hat{\mathbf{X}}_{t+1}\hat{\mathbf{X}}_{t+1}'\right ). } $$

With (9.1.2) and (9.4.4) this gives

$$ \displaystyle{\Omega _{t+1} = F_{t}E(\mathbf{X}_{t}\mathbf{X}_{t}')F_{t}' + Q_{t} - F_{t}E\left (\hat{\mathbf{X}}_{t}\hat{\mathbf{X}}_{t}'\right )F_{t}' -\varTheta _{t}\Delta _{t}^{-1}\varTheta _{ t}'} $$
$$ \displaystyle{\qquad \,\, = F_{t}\Omega _{t}F_{t}' + Q_{t} -\varTheta _{t}\Delta _{t}^{-1}\varTheta _{ t}'.\blacksquare } $$

9.4.1 h-Step Prediction of {Y t } Using the Kalman Recursions

The Kalman prediction equations lead to a very simple algorithm for recursive calculation of the best linear mean square predictors P t Y t+h , h = 1, 2,  . From (9.4.4), (9.1.1), (9.1.2), and Remark 3 in Section 9.1, we find that

$$ \displaystyle{ P_{t}\mathbf{X}_{t+1} = F_{t}P_{t-1}\mathbf{X}_{t} +\varTheta _{t}\Delta _{t}^{-1}(\mathbf{Y}_{ t} - P_{t-1}\mathbf{Y}_{t}), } $$
(9.4.5)
$$ \displaystyle{P_{t}\mathbf{X}_{t+h} = F_{t+h-1}P_{t}\mathbf{X}_{t+h-1}} $$
$$ \displaystyle{\vdots} $$
$$ \displaystyle{ = \left (F_{t+h-1}F_{t+h-2}\cdots F_{t+1}\right )P_{t}\mathbf{X}_{t+1},\quad h = 2,3,\ldots, } $$
(9.4.6)

and

$$ \displaystyle{ P_{t}\mathbf{Y}_{t+h} = G_{t+h}P_{t}\mathbf{X}_{t+h},\quad h = 1,2,\ldots. } $$
(9.4.7)

From the relation

$$ \displaystyle{\mathbf{X}_{t+h} - P_{t}\mathbf{X}_{t+h} = F_{t+h-1}(\mathbf{X}_{t+h-1} - P_{t}\mathbf{X}_{t+h-1}) + \mathbf{V}_{t+h-1},\quad h = 2,3,\ldots,} $$

we find that \( \Omega _{t}^{(h)}:= E[(\mathbf{X}_{t+h} - P_{t}\mathbf{X}_{t+h})(\mathbf{X}_{t+h} - P_{t}\mathbf{X}_{t+h})'] \) satisfies the recursions

$$ \displaystyle{ \Omega _{t}^{(h)} = F_{ t+h-1}\Omega _{t}^{(h-1)}F_{ t+h-1}' + Q_{t+h-1},\quad h = 2,3,\ldots, } $$
(9.4.8)

with Ω t (1) = Ω t+1. Then from (9.1.1) and (9.4.7), Δ t (h): = E[(Y t+h P t Y t+h )(Y t+h P t Y t+h )′] is given by

$$ \displaystyle{ \Delta _{t}^{(h)} = G_{ t+h}\Omega _{t}^{(h)}G_{ t+h}' + R_{t+h},\quad h = 1,2,\ldots. } $$
(9.4.9)

Example 9.4.1.

Consider the random walk plus noise model of Example 9.2.1 defined by

$$ \displaystyle{ Y _{t} = X_{t} + W_{t},\quad \{W_{t}\} \sim \mathrm{WN}\left (0,\sigma _{w}^{2}\right ), } $$

where the local level X t follows the random walk

$$ \displaystyle{ X_{t+1} = X_{t} + V _{t},\quad \{V _{t}\} \sim \mathrm{WN}\left (0,\sigma _{v}^{2}\right ). } $$

Applying the Kalman prediction equations with Y 0: = 1, R = σ w 2, and Q = σ v 2, we obtain

$$ \displaystyle{\hat{Y }_{t+1} = P_{t}Y _{t+1} =\hat{ X}_{t} + \frac{\varTheta _{t}} {\Delta _{t}}\left (Y _{t} -\hat{ Y }_{t}\right )} $$
$$ \displaystyle{= (1 - a_{t})\hat{Y }_{t} + a_{t}Y _{t}} $$

where

$$ \displaystyle{a_{t} = \frac{\varTheta _{t}} {\Delta _{t}} = \frac{\Omega _{t}} {\Omega _{t} +\sigma _{ w}^{2}}.} $$

For a state-space model (like this one) with time-independent parameters, the solution of the Kalman recursions (9.4.2) is called a steady-state solution if Ω t is independent of t. If Ω t  = Ω for all t, then from (9.4.2)

$$ \displaystyle{ \Omega _{t+1} = \Omega = \Omega +\sigma _{ v}^{2} - \frac{\Omega ^{2}} {\Omega +\sigma _{ w}^{2}} = \frac{\Omega \sigma _{w}^{2}} {\Omega +\sigma _{ w}^{2}} +\sigma _{ v}^{2}. } $$

Solving this quadratic equation for Ω and noting that Ω ≥ 0, we find that

$$ \displaystyle{ \Omega = \frac{1} {2}\left (\sigma _{v}^{2} + \sqrt{\sigma _{ v}^{4} + 4\sigma _{v}^{2}\sigma _{w}^{2}}\right ) } $$

Since Ω t+1Ω t is a continuous function of Ω t on Ω t  ≥ 0, positive at Ω t  = 0, negative for large Ω t , and zero only at Ω t  = Ω, it is clear that Ω t+1Ω t is negative for Ω t  > Ω and positive for Ω t  < Ω. A similar argument shows (Problem 9.14) that (Ω t+1Ω)(Ω t Ω) ≥ 0 for all \( \Omega _{t} \geq 0 \). These observations imply that Ω t+1 always falls between Ω and Ω t . Consequently, regardless of the value of Ω 1, Ω t converges to Ω, the unique solution of Ω t+1 = Ω t . For any initial predictors \( \hat{Y }_{1} =\hat{ X}_{1} \) and any initial mean squared error \( \Omega _{1} = E{\bigl (X_{1} -\hat{ X}_{1}\bigr )}^{2} \), the coefficients \( a_{t}:= \Omega _{t}/\left (\Omega _{t} +\sigma _{ w}^{2}\right ) \) converge to

$$ \displaystyle{ a ={ \Omega \over \Omega +\sigma _{ w}^{2}}, } $$

and the mean squared errors of the predictors defined by

$$ \displaystyle{ \hat{Y }_{t+1} = (1 - a_{t})\hat{Y }_{t} + a_{t}Y _{t} } $$

converge to Ω +σ w 2.

If, as is often the case, we do not know Ω 1, then we cannot determine the sequence {a t }. It is natural, therefore, to consider the behavior of the predictors defined by

$$ \displaystyle{ \hat{Y }_{t+1} = (1 - a)\hat{Y }_{t} + aY _{t} } $$

with a as above and arbitrary \( \hat{Y }_{1} \). It can be shown (Problem 9.16) that this sequence of predictors is also asymptotically optimal in the sense that the mean squared error converges to Ω +σ w 2 as t → .

As shown in Example 9.2.1, the differenced process D t  = Y t Y t−1 is the MA(1) process

$$ \displaystyle{ D_{t} = Z_{t} +\theta Z_{t-1},\ {\bigl \{Z_{t}\bigr \}} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ), } $$

where \( \theta /\left (1 +\theta ^{2}\right ) = -\sigma _{w}^{2}/\left (2\sigma _{w}^{2} +\sigma _{ v}^{2}\right ) \). Solving this equation for θ (Problem 9.15), we find that

$$ \displaystyle{ \theta = - \dfrac{1} {2\sigma _{w}^{2}}\left (2\sigma _{w}^{2} +\sigma _{ v}^{2} -\sqrt{\sigma _{ v}^{4} + 4\sigma _{v}^{2}\sigma _{w}^{2}}\right ) } $$

and that θ = a − 1.

It is instructive to derive the exponential smoothing formula for \( \hat{Y }_{t} \) directly from the ARIMA(0,1,1) structure of {Y t }. For t ≥ 2, we have from Section 6.5 that

$$ \displaystyle{ \hat{Y }_{t+1} = Y _{t} +\theta _{t1}(Y _{t} -\hat{ Y }_{t}) = -\theta _{t1}\hat{Y }_{t} + (1 +\theta _{t1})Y _{t} } $$

for t ≥ 2, where θ t1 is found by application of the innovations algorithm to an MA(1) process with coefficient θ. It follows that 1 − a t  = −θ t1, and since θ t1 → θ (see Remark 1 of Section 3.3) and a t converges to the steady-state solution a, we conclude that

$$ \displaystyle{ 1 - a =\lim _{t\rightarrow \infty }(1 - a_{t}) = -\lim _{t\rightarrow \infty }\theta _{t1} = -\theta. } $$

Example 9.4.2.

The lognormal stochastic volatility model

We can rewrite the defining equations (7.4.2) and (7.4.3) of the lognormal SV process {Z t } in the following state-space form

$$ \displaystyle\begin{array}{rcl} X_{t} =\gamma _{1}X_{t-1} +\eta _{t},& &{}\end{array} $$
(9.4.10)

and

$$ \displaystyle\begin{array}{rcl} Y _{t} = X_{t} +\varepsilon _{t},& &{}\end{array} $$
(9.4.11)

where the (one-dimensional) state and observation vectors are

$$ \displaystyle\begin{array}{rcl} X_{t} =\ell _{t} - \frac{\gamma _{0}} {1 -\gamma _{1}},& &{}\end{array} $$
(9.4.12)

and

$$ \displaystyle\begin{array}{rcl} Y _{t} =\ln Z_{t}^{2} + 1.27 - \frac{\gamma _{0}} {2(1 -\gamma _{1})}& &{}\end{array} $$
(9.4.13)

respectively. The independent white-noise sequences {η t } and {ɛ t } have zero means and variances σ 2 and 4.93 respectively.

Taking

$$ \displaystyle\begin{array}{rcl} \hat{X}_{0} = EX_{0} = 0& &{}\end{array} $$
(9.4.14)

and

$$ \displaystyle\begin{array}{rcl} \hat{\Omega }_{0} = \mathrm{Var}(X_{0}) =\sigma ^{2}/(1 -\gamma _{ 1}^{2}),& &{}\end{array} $$
(9.4.15)

and we can directly apply the Kalman prediction recursions (9.4.1), (9.4.2), (9.4.6) and (9.4.8), to compute recursively the best linear predictor of X t+h in terms of {Y s , s ≤ t}, or equivalently of the log volatility t+h in terms of the observations {lnZ s 2, s ≤ t}.

Kalman Filtering:

The filtered estimates X t | t  = P t (X t ) and their error covariance matrices Ω t | t  = E[(X t X t | t )(X t X t | t )′] are determined by the relations

$$ \displaystyle{ P_{t}\mathbf{X}_{t} = P_{t-1}\mathbf{X}_{t} + \Omega _{t}G_{t}'\Delta _{t}^{-1}\left (\mathbf{Y}_{ t} - G_{t}\hat{\mathbf{X}}_{t}\right ) } $$
(9.4.16)

and

$$ \displaystyle{ \Omega _{t\vert t} = \Omega _{t} - \Omega _{t}G_{t}'\Delta _{t}^{-1}G_{ t}\Omega _{t}'. } $$
(9.4.17)

Proof.

From (9.4.3) it follows that

$$ \displaystyle{ P_{t}\mathbf{X}_{t} = P_{t-1}\mathbf{X}_{t} + M\mathbf{I}_{t}, } $$

where

$$ \displaystyle{ M = E(\mathbf{X}_{t}\ \mathbf{I}_{t}')[E(\mathbf{I}_{t}\ \mathbf{I}_{t}')]^{-1} = E{\bigl [\mathbf{X}_{ t}(G_{t}(\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}) + W_{t})'\bigr ]}\Delta _{t}^{-1} = \Omega _{ t}G_{t}'\Delta _{t}^{-1}. } $$
(9.4.18)

To establish (9.4.17) we write

$$ \displaystyle{ \mathbf{X}_{t} - P_{t-1}\mathbf{X}_{t} = \mathbf{X}_{t} - P_{t}\mathbf{X}_{t} + P_{t}\mathbf{X}_{t} - P_{t-1}\mathbf{X}_{t} = \mathbf{X}_{t} - P_{t}\mathbf{X}_{t} + M\mathbf{I}_{t}. } $$

Using (9.4.18) and the orthogonality of X t P t X t and M I t , we find from the last equation that

$$ \displaystyle{ \Omega _{t} = \Omega _{t\vert t} + \Omega _{t}G_{t}'\Delta _{t}^{-1}G_{ t}\Omega _{t}', } $$

as required.

Kalman Fixed-Point Smoothing:

The smoothed estimates X t | n  = P n X t and the error covariance matrices Ω t | n  = E[(X t X t | n )(X t X t | n )′] are determined for fixed t by the following recursions, which can be solved successively for n = t, t + 1, :

$$ \displaystyle{ P_{n}\mathbf{X}_{t} = P_{n-1}\mathbf{X}_{t} + \Omega _{t,n}G_{n}'\Delta _{n}^{-1}\left (\mathbf{Y}_{ n} - G_{n}\hat{\mathbf{X}}_{n}\right ), } $$
(9.4.19)
$$ \displaystyle{ \Omega _{t,n+1} = \Omega _{t,n}[F_{n} -\varTheta _{n}\Delta _{n}^{-1}G_{ n}]', } $$
(9.4.20)
$$ \displaystyle{ \Omega _{t\vert n} = \Omega _{t\vert n-1} - \Omega _{t,n}G_{n}'\Delta _{n}^{-1}G_{ n}\Omega _{t,n}', } $$
(9.4.21)

with initial conditions \( P_{t-1}\mathbf{X}_{t} =\hat{ \mathbf{X}}_{t} \) and Ω t, t  = Ω t | t−1 = Ω t (found from Kalman prediction).

Proof.

Using (9.4.3) we can write P n X t  = P n−1 X t + C I n , where \( \mathbf{I}_{n} = G_{n}{\bigl (\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}\bigr )} + \mathbf{W}_{n} \). By Remark 4 above,

$$ \displaystyle{ C = E\left [\mathbf{X}_{t}\left (G_{n}(\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}) + \mathbf{W}_{n}\right )'\right ]\left [E\left (\mathbf{I}_{n}\mathbf{I}_{n}'\right )\right ]^{-1} = \Omega _{ t,n}G_{n}'\Delta _{n}^{-1}, } $$
(9.4.22)

where \( \Omega _{t,n}:= E\big[{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}{\bigl (\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}\bigr )}'\big] \). It follows now from (9.1.2), (9.4.5), the orthogonality of V n and W n with \( \mathbf{X}_{t} -\hat{\mathbf{X}}_{t} \), and the definition of Ω t, n that

$$ \displaystyle{ \Omega _{t,n+1}=E\left [\left (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\right )\left (\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}\right )'\left (F_{n} -\varTheta _{n}\Delta _{n}^{-1}G_{ n}\right )'\right ]=\Omega _{t,n}\left [F_{n} -\varTheta _{n}\Delta _{n}^{-1}G_{ n}\right ]', } $$

thus establishing (9.4.20). To establish (9.4.21) we write

$$ \displaystyle{ \mathbf{X}_{t} - P_{n}\mathbf{X}_{t} = \mathbf{X}_{t} - P_{n-1}\mathbf{X}_{t} - C\mathbf{I}_{n}. } $$

Using (9.4.22) and the orthogonality of X t P n X t and I n , the last equation then gives

$$ \displaystyle{ \Omega _{t\vert n} = \Omega _{t\vert n-1} - \Omega _{t,n}G_{n}'\Delta _{n}^{-1}G_{ n}\Omega _{t,n}',\quad n = t,t + 1,\ldots, } $$

as required.

9.5 Estimation for State-Space Models

Consider the state-space model defined by equations (9.1.1) and (9.1.2) and suppose that the model is completely parameterized by the components of the vector \( \theta \). The maximum likelihood estimate of \( \theta \) is found by maximizing the likelihood of the observations Y 1, , Y n with respect to the components of the vector \( \theta \). If the conditional probability density of Y t given Y t−1 = y t−1, , Y 0 = y 0 is f t (⋅ | y t−1, , y 0), then the likelihood of Y t , t = 1, , n (conditional on Y 0), can immediately be written as

$$ \displaystyle{ L(\theta;\mathbf{Y}_{1},\ldots,\mathbf{Y}_{n}) =\prod _{ t=1}^{n}f_{ t}(\mathbf{Y}_{t}\vert \mathbf{Y}_{t-1},\ldots,\mathbf{Y}_{0}). } $$
(9.5.1)

The calculation of the likelihood for any fixed numerical value of \( \theta \) is extremely complicated in general, but is greatly simplified if Y 0, X 1 and W t , V t , t = 1, 2, , are assumed to be jointly Gaussian. The resulting likelihood is called the Gaussian likelihood and is widely used in time series analysis (cf. Section 5.2) whether the time series is truly Gaussian or not. As before, we shall continue to use the term likelihood to mean Gaussian likelihood.

If Y 0, X 1 and W t , V t , t = 1, 2, , are jointly Gaussian, then the conditional densities in (9.5.1) are given by

$$ \displaystyle{ f_{t}(\mathbf{Y}_{t}\vert \mathbf{Y}_{t-1},\ldots,\mathbf{Y}_{0}) = (2\pi )^{-w/2}\left (\det \Delta _{ t}\right )^{-1/2}\exp \left [-{1 \over 2}\mathbf{I}_{t}'\Delta _{t}^{-1}\mathbf{I}_{ t}\right ], } $$

where \( \mathbf{I}_{t}\,=\,\mathbf{Y}_{t} - P_{t-1}\mathbf{Y}_{t}\,=\,\mathbf{Y}_{t} - G\hat{\mathbf{X}_{t}} \), P t−1 Y t , and Δ t , t ≥ 1, are the one-step predictors and error covariance matrices found from the Kalman prediction recursions. The likelihood of the observations Y 1, , Y n (conditional on Y 0) can therefore be expressed as

$$ \displaystyle{ L(\theta;\mathbf{Y}_{1},\ldots,\mathbf{Y}_{n}) = (2\pi )^{-nw/2}\left (\prod _{ j=1}^{n}\det \Delta _{ j}\right )^{-1/2}\exp \left [-{1 \over 2}\sum _{j=1}^{n}\mathbf{I}_{ j}'\Delta _{j}^{-1}\mathbf{I}_{ j}\right ]. } $$
(9.5.2)

Given the observations Y 1, , Y n , the distribution of Y 0 (see Section 9.4), and a particular parameter value \( \theta \), the numerical value of the likelihood L can be computed from the previous equation with the aid of the Kalman recursions of Section 9.4. To find maximum likelihood estimates of the components of \( \theta \), a nonlinear optimization algorithm must be used to search for the value of \( \theta \) that maximizes the value of L.

Having estimated the parameter vector \( \theta \), we can compute forecasts based on the fitted state-space model and estimated mean squared errors by direct application of equations (9.4.7) and (9.4.9).

9.5.1 Application to Structural Models

The general structural model for a univariate time series {Y t } of which we gave examples in Section 9.2 has the form

$$ \displaystyle{ Y _{t} = G\mathbf{X}_{t} + W_{t},\quad \{W_{t}\} \sim \mathrm{WN}\left (0,\sigma _{w}^{2}\right ),\qquad \qquad \qquad \qquad \qquad } $$
(9.5.3)
$$ \displaystyle{ \mathbf{X}_{t+1} = F\mathbf{X}_{t} + \mathbf{V}_{t},\quad \{\mathbf{V}_{t}\} \sim \mathrm{WN}(0,Q),\qquad \qquad \qquad \qquad \qquad } $$
(9.5.4)

for t = 1, 2, , where F and G are assumed known. We set Y 0 = 1 in order to include constant terms in our predictors and complete the specification of the model by prescribing the mean and covariance matrix of the initial state X 1. A simple and convenient assumption is that X 1 is equal to a deterministic but unknown parameter \( \boldsymbol{\mu } \) and that \( \hat{\mathbf{X}}_{1} =\boldsymbol{\mu } \), so that Ω 1 = 0. The parameters of the model are then \( \boldsymbol{\mu } \), Q, and σ w 2.

Direct maximization of the likelihood (9.5.2) is difficult if the dimension of the state vector is large. The maximization can, however, be simplified by the following stepwise procedure. For fixed Q we find \( \hat{\boldsymbol{\mu }}(Q) \) and σ w 2(Q) that maximize the likelihood \( L\left (\boldsymbol{\mu },Q,\sigma _{w}^{2}\right ) \). We then maximize the “reduced likelihood” \( L\left (\hat{\boldsymbol{\mu }}(Q),Q,\hat{\sigma }_{w}^{2}(Q)\right ) \) with respect to Q.

To achieve this we define the mean-corrected state vectors, \( \mathbf{X}_{t}^{{\ast}} = \mathbf{X}_{t} - F^{t-1}\boldsymbol{\mu } \), and apply the Kalman prediction recursions to {X t } with initial condition \( \mathbf{X}_{1}^{{\ast}} = \mathbf{0} \). This gives, from (9.4.1),

$$ \displaystyle{ \hat{\mathbf{X}}_{t+1}^{{\ast}} = F\hat{\mathbf{X}}_{ t}^{{\ast}} +\varTheta _{ t}\Delta _{t}^{-1}\left (Y _{ t} - G\hat{\mathbf{X}}_{t}^{{\ast}}\right ),\quad t = 1,2,\ldots, } $$
(9.5.5)

with \( \hat{\mathbf{X}}_{1}^{{\ast}} = \mathbf{0} \). Since \( \hat{\mathbf{X}}_{t} \) also satisfies (9.5.5), but with initial condition \( \hat{\mathbf{X}}_{t} =\boldsymbol{\mu } \), it follows that

$$ \displaystyle{ \hat{\mathbf{X}}_{t} =\hat{ \mathbf{X}}_{t}^{{\ast}} + C_{ t}\boldsymbol{\mu } } $$
(9.5.6)

for some v × v matrices C t . (Note that although \( \hat{\mathbf{X}}_{t} = P(\mathbf{X}_{t}\vert Y _{0},Y _{1},\ldots,Y _{t}) \), the quantity \( \hat{\mathbf{X}}_{t}^{{\ast}} \) is not the corresponding predictor of X t .) The matrices C t can be determined recursively from (9.5.5), (9.5.6), and (9.4.1). Substituting (9.5.6) into (9.5.5) and using (9.4.1), we have

$$ \displaystyle\begin{array}{rcl} \hat{\mathbf{X}}_{t+1}^{{\ast}}& =& F\left (\hat{\mathbf{X}}_{ t} - C_{t}\boldsymbol{\mu }\right ) +\varTheta _{t}\Delta _{t}^{-1}\left (Y _{ t} - G\left (\hat{\mathbf{X}}_{t} - C_{t}\boldsymbol{\mu }\right )\right ) {}\\ & =& F\hat{\mathbf{X}}_{t} +\varTheta _{t}\Delta _{t}^{-1}\left (Y _{ t} - G\hat{\mathbf{X}}_{t}\right ) -\left (F -\varTheta _{t}\Delta _{t}^{-1}G\right )C_{ t}\boldsymbol{\mu } {}\\ & =& \hat{\mathbf{X}}_{t+1} -\left (F -\varTheta _{t}\Delta _{t}^{-1}G\right )C_{ t}\boldsymbol{\mu }, {}\\ \end{array} $$

so that

$$ \displaystyle{ C_{t+1} = \left (F -\varTheta _{t}\Delta _{t}^{-1}G\right )C_{ t} } $$
(9.5.7)

with C 1 equal to the identity matrix. The quadratic form in the likelihood (9.5.2) is therefore

$$ \displaystyle{ S(\boldsymbol{\mu },Q,\sigma _{w}^{2}) =\sum _{ t=1}^{n}\frac{\left (Y _{t} - G\hat{\mathbf{X}}_{t}\right )^{2}} {\Delta _{t}} } $$
(9.5.8)
$$ \displaystyle{ =\sum _{ t=1}^{n}\frac{\left (Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}}- GC_{ t}\boldsymbol{\mu }\right )^{2}} {\Delta _{t}}. } $$
(9.5.9)

Now let Q : = σ w −2 Q and define L to be the likelihood function with this new parameterization, i.e., \( L^{{\ast}}\left (\boldsymbol{\mu },Q^{{\ast}},\sigma _{w}^{2}\right ) = L\left (\boldsymbol{\mu },\sigma _{w}^{2}Q^{{\ast}},\sigma _{w}^{2}\right ) \). Writing Δ t  = σ w −2 Δ t  and Ω t  = σ w −2 Ω t , we see that the predictors \( \hat{X}_{t}^{{\ast}} \) and the matrices C t in (9.5.7) depend on the parameters only through Q . Thus,

$$ \displaystyle{ S\left (\boldsymbol{\mu },Q,\sigma _{w}^{2}\right ) =\sigma _{ w}^{-2}S\left (\boldsymbol{\mu },Q^{{\ast}},1\right ), } $$

so that

$$ \displaystyle\begin{array}{rcl} -2\ln L^{{\ast}}\left (\boldsymbol{\mu },Q^{{\ast}},\sigma _{ w}^{2}\right )& =& n\ln (2\pi ) +\sum _{ t=1}^{n}\ln \Delta _{ t} +\sigma _{ w}^{-2}S\left (\boldsymbol{\mu },Q^{{\ast}},1\right ) {}\\ & =& n\ln (2\pi ) +\sum _{ t=1}^{n}\ln \Delta _{ t}^{{\ast}} + n\ln \sigma _{ w}^{2} +\sigma _{ w}^{-2}S\left (\boldsymbol{\mu },Q^{{\ast}},1\right ). {}\\ \end{array} $$

For Q fixed, it is easy to show (see Problem 9.18) that this function is minimized when

$$ \displaystyle{ \hat{\boldsymbol{\mu }}=\hat{\boldsymbol{\mu }} \left (Q^{{\ast}}\right ) = \left [\sum _{ t=1}^{n}\frac{C_{t}'G'GC_{t}} {\Delta _{t}^{{\ast}}} \right ]^{-1}\sum _{ t=1}^{n}\frac{C_{t}'G'\left (Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}}\right )} {\Delta _{t}^{{\ast}}} } $$
(9.5.10)

and

$$ \displaystyle\begin{array}{rcl} \hat{\sigma }_{w}^{2} =\hat{\sigma }_{ w}^{2}\left (Q^{{\ast}}\right ) = n^{-1}\sum _{ t=1}^{n}\frac{\left (Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}}- GC_{ t}\hat{\boldsymbol{\mu }}\right )^{2}} {\Delta _{t}^{{\ast}}}.& &{}\end{array} $$
(9.5.11)

Replacing \( \boldsymbol{\mu } \) and σ w 2 by these values in − 2lnL and ignoring constants, the reduced likelihood becomes

$$ \displaystyle{ \ell\left (Q^{{\ast}}\right ) =\ln \left (n^{-1}\sum _{ t=1}^{n}\frac{\big(Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}}- GC_{ t}\hat{\boldsymbol{\mu }}\big)^{2}} {\Delta _{t}^{{\ast}}} \right ) + n^{-1}\sum _{ t=1}^{n}\ln \left (\det \Delta _{ t}^{{\ast}}\right ). } $$
(9.5.12)

If \( \hat{Q}^{{\ast}} \) denotes the minimizer of (9.5.12), then the maximum likelihood estimator of the parameters \( \boldsymbol{\mu },Q,\sigma _{w}^{2} \) are \( \hat{\boldsymbol{\mu }},\hat{\sigma }_{w}^{2}\hat{Q}^{{\ast}},\hat{\sigma }_{w}^{2} \), where \( \hat{\boldsymbol{\mu }} \) and \( \hat{\sigma }_{w}^{2} \) are computed from (9.5.10) and (9.5.11) with Q replaced by \( \hat{Q}^{{\ast}} \).

We can now summarize the steps required for computing the maximum likelihood estimators of \( \boldsymbol{\mu } \), Q, and σ w 2 for the model (9.5.3)–(9.5.4).

  1. 1.

    For a fixed Q , apply the Kalman prediction recursions with \( \hat{\mathbf{X}}_{1}^{{\ast}} = \mathbf{0} \), Ω 1 = 0, Q = Q , and σ w 2 = 1 to obtain the predictors \( \hat{\mathbf{X}}_{t}^{{\ast}} \). Let Δ t denote the one-step prediction error produced by these recursions.

  2. 2.

    Set \( \hat{\boldsymbol{\mu }}=\hat{\boldsymbol{\mu }} (Q^{{\ast}}) = \left [\sum _{t=1}^{n}C_{t}'G'GC_{t}/\Delta _{t}\right ]^{-1}\sum _{t=1}^{n}C_{t}'G'(Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}})/\Delta _{t}^{{\ast}} \).

  3. 3.

    Let \( \hat{Q}^{{\ast}} \) be the minimizer of (9.5.12).

  4. 4.

    The maximum likelihood estimators of \( \boldsymbol{\mu } \), Q, and σ w 2 are then given by \( \hat{\boldsymbol{\mu }},\hat{\sigma }_{w}^{2}\hat{Q}^{{\ast}} \), and \( \hat{\sigma }_{w}^{2} \), respectively, where \( \hat{\boldsymbol{\mu }} \) and \( \hat{\sigma }_{w}^{2} \) are found from (9.5.10) and (9.5.11) evaluated at \( \hat{Q}^{{\ast}} \).

Example 9.5.1.

Random Walk Plus Noise Model

In Example 9.2.1, 100 observations were generated from the structural model

$$ \displaystyle\begin{array}{rcl} Y _{t}& =& M_{t} + W_{t},\quad \{W_{t}\} \sim \mathrm{WN}\left (0,\sigma _{w}^{2}\right ),\qquad \qquad \qquad \qquad \quad {}\\ M_{t+1}& =& M_{t} + V _{t},\quad \{V _{t}\} \sim \mathrm{WN}\left (0,\sigma _{v}^{2}\right ),\qquad \qquad \qquad \qquad \qquad \quad {}\\ \end{array} $$

with initial values μ = M 1 = 0, σ w 2 = 8, and σ v 2 = 4. The maximum likelihood estimates of the parameters are found by first minimizing (9.5.12) with \( \hat{\mu } \) given by (9.5.10). Substituting these values into (9.5.11) gives \( \hat{\sigma }_{w}^{2} \). The resulting estimates are \( \hat{\mu }= 0.906, \) \( \hat{\sigma }_{v}^{2} = 5.351 \), and \( \hat{\sigma }_{w}^{2} = 8.233 \), which are in reasonably close agreement with the true values.

Example 9.5.2.

International Airline Passengers, 1949–1960; AIRPASS.TSM

The monthly totals of international airline passengers from January 1949 to December 1960 (Box and Jenkins 1976) are displayed in Figure 9.3. The data exhibit both a strong seasonal pattern and a nearly linear trend. Since the variability of the data Y 1, , Y 144 increases for larger values of Y t , it may be appropriate to consider a logarithmic transformation of the data. For the purpose of this illustration, however, we will fit a structural model incorporating a randomly varying trend and seasonal and noise components (see Example 9.2.3) to the raw data. This model has the form

$$ \displaystyle\begin{array}{rcl} Y _{t}& =& G\mathbf{X}_{t} + W_{t},\quad \{W_{t}\} \sim \mathrm{WN}\left (0,\,\sigma _{w}^{2}\right ), {}\\ \mathbf{X}_{t+1}& =& F\mathbf{X}_{t} + \mathbf{V}_{t},\quad \{\mathbf{V}_{t}\} \sim \mathrm{WN}(0,\,Q), {}\\ \end{array} $$

where X t is a 13-dimensional state-vector,

$$ \displaystyle\begin{array}{rcl} F& =& \left [\begin{array}{*{10}c} 1 & 1 & 0 & 0 & \cdots & 0 & 0\\ 0 & 1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & -1 & -1 & \cdots & -1 & -1\\ 0 & 0 & 1 & 0 & \cdots & 0 & 0 \\ 0 & 0 & 0 & 1 & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & 1 & 0 \end{array} \right ], {}\\ G& =& \,\left [\begin{array}{*{10}c} 1\ &0\ &1\ &0\ &\ \cdots &\ 0 \end{array} \right ],{}\\ \end{array} $$

and

$$ \displaystyle{ Q = \left [\begin{array}{*{10}c} \sigma _{1}^{2} & 0 & 0 & 0 & \cdots & 0 \\ 0 & \sigma _{2}^{2} & 0 & 0 & \cdots & 0 \\ 0 & 0 & \sigma _{3}^{2} & 0 & \cdots & 0 \\ 0 & 0 & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & 0 \end{array} \right ]. } $$

The parameters of the model are \( \boldsymbol{\mu },\sigma _{1}^{2},\sigma _{2}^{2},\sigma _{3}^{2} \), and σ w 2, where \( \boldsymbol{\mu }= \mathbf{X}_{1} \). Minimizing (9.5.12) with respect to Q we find from (9.5.11) and (9.5.12) that

$$ \displaystyle{\left (\hat{\sigma }_{1}^{2},\hat{\sigma }_{ 2}^{2},\hat{\sigma }_{ 3}^{2},\hat{\sigma }_{ w}^{2}\right ) = (170.63,.00000,11.338,.014179)} $$

and from (9.5.10) that \( \hat{\boldsymbol{\mu }}=\, \)(146.9, 2.171, − 34. 92, − 34. 12, − 47. 00, − 16. 98, 22.99, 53.99, 58.34, 33.65, 2.204, − 4. 053, − 6. 894)′. The first component, X t1, of the state vector corresponds to the local linear trend with slope X t2. Since \( \hat{\sigma }_{2}^{2} = 0 \), the slope at time t, which satisfies

$$ \displaystyle{ X_{t2} = X_{t-1,2} + V _{t2}, } $$

must be nearly constant and equal to \( \hat{X}_{12} = 2.171 \). The first three components of the predictors \( \hat{\mathbf{X}}_{t} \) are plotted in Figure 9.4. Notice that the first component varies like a random walk around a straight line, while the second component is nearly constant as a result of \( \hat{\sigma }_{2}^{2} \approx 0 \). The third component, corresponding to the seasonal component, exhibits a clear seasonal cycle that repeats roughly the same pattern throughout the 12 years of data. The one-step predictors \( \hat{X}_{t1} +\hat{ X}_{t3} \) of Y t are plotted in Figure 9.5 (solid line) together with the actual data (square boxes). For this model the predictors follow the movement of the data quite well.

Fig. 9.3
figure 3

International airline passengers; monthly totals from January 1949 to December 1960

Fig. 9.4
figure 4

The one-step predictors \( \left (\hat{X}_{t1},\hat{X}_{t2},\hat{X}_{t3}\right )' \) for the airline passenger data in Example 9.5.2

Fig. 9.5
figure 5

The one-step predictors \( \hat{Y }_{t} \) for the airline passenger data (solid line) and the actual data (square boxes)

9.6 State-Space Models with Missing Observations

State-space representations and the associated Kalman recursions are ideally suited to the analysis of data with missing values, as was pointed out by Jones (1980) in the context of maximum likelihood estimation for ARMA processes. In this section we shall deal with two missing-value problems for state-space models. The first is the evaluation of the (Gaussian) likelihood based on \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\} \), where i 1, i 2, , i r are positive integers such that 1 ≤ i 1 < i 2 < ⋯ < i r  ≤ n. (This allows for observation of the process {Y t } at irregular intervals, or equivalently for the possibility that (nr) observations are missing from the sequence {Y 1, , Y n }.) The solution of this problem will, in particular, enable us to carry out maximum likelihood estimation for ARMA and ARIMA processes with missing values. The second problem to be considered is the minimum mean squared error estimation of the missing values themselves.

9.6.1 The Gaussian Likelihood of \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\}, \) 1 ≤ i 1 < i 2 < ⋯ < i r  ≤ n

Consider the state-space model defined by equations (9.1.1) and (9.1.2) and suppose that the model is completely parameterized by the components of the vector \( \theta \). If there are no missing observations, i.e., if r = n and i j  = j, j = 1, , n, then the likelihood of the observations {Y 1, , Y n } is easily found as in Section 9.5 to be

$$ \displaystyle{ L(\theta;\mathbf{Y}_{1},\ldots,\mathbf{Y}_{n}) = (2\pi )^{-nw/2}\left (\prod _{ j=1}^{n}\det \Delta _{ j}\right )^{-1/2}\exp \left [-{1 \over 2}\sum _{j=1}^{n}\mathbf{I}_{ j}'\Delta _{j}^{-1}\mathbf{I}_{ j}\right ], } $$

where I j  = Y j P j−1 Y j and Δ j , j ≥ 1, are the one-step predictors and error covariance matrices found from (9.4.7) and (9.4.9) with Y 0 = 1.

To deal with the more general case of possibly irregularly spaced observations \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\} \), we introduce a new series {Y t }, related to the process {X t } by the modified observation equation

$$ \displaystyle{ \mathbf{Y}_{t}^{{\ast}} = G_{ t}^{{\ast}}\mathbf{X}_{ t} + \mathbf{W}_{t}^{{\ast}},\quad t = 1,2,\ldots, } $$
(9.6.1)

where

$$ \displaystyle{ G_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} G_{t}\quad &\mbox{ if }t \in \{ i_{1},\ldots,i_{r}\}, \\ 0 \quad &\mbox{ otherwise}, \end{array} \right.\quad \mathbf{W}_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{W}_{t}\quad &\mbox{ if }t \in \{ i_{1},\ldots,i_{r}\}, \\ \mathbf{N}_{t} \quad &\mbox{ otherwise}, \end{array} \right. } $$
(9.6.2)

and {N t } is iid with

$$ \displaystyle{ \mathbf{N}_{t} \sim \mathrm{N}(\mathbf{0},I_{w\,\times \,w}),\quad \mathbf{N}_{s} \perp \mathbf{X}_{1},\quad \mathbf{N}_{s} \perp \left [\begin{array}{*{10}c} \mathbf{V}_{t} \\ \mathbf{W}_{t} \end{array} \right ],\quad s,t = 0,\pm 1,\ldots. } $$
(9.6.3)

Equations (9.6.1) and (9.1.2) constitute a state-space representation for the new series {Y t }, which coincides with {Y t } at each \( t \in \{ i_{1},i_{2},\ldots,i_{r}\} \), and at other times takes random values that are independent of {Y t } with a distribution independent of \( \theta \).

Let \( L_{1}\left (\theta;\,\mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}\right ) \) be the Gaussian likelihood based on the observed values \( \mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}} \) of \( \mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}} \) under the model defined by (9.1.1) and (9.1.2). Corresponding to these observed values, we define a new sequence, y 1 , , y n , by

$$ \displaystyle{ \mathbf{y}_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{y}_{t}\quad &\mbox{ if }t \in \{ i_{1},\ldots,i_{r}\}, \\ \mathbf{0} \quad &\mbox{ otherwise}. \end{array} \right. } $$
(9.6.4)

Then it is clear from the preceding paragraph that

$$ \displaystyle{ L_{1}\left (\theta;\mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}\right ) = (2\pi )^{(n-r)w/2}L_{ 2}\left (\theta;\mathbf{y}_{1}^{{\ast}},\ldots,\mathbf{y}_{ n}^{{\ast}}\right ), } $$
(9.6.5)

where L 2 denotes the Gaussian likelihood under the model defined by (9.6.1) and (9.1.2).

In view of (9.6.5) we can now compute the required likelihood L 1 of the realized values {y t , t = i 1, , i r } as follows:

  1. i.

    Define the sequence {y t , t = 1, , n} as in (9.6.4).

  2. ii.

    Find the one-step predictors \( \hat{\mathbf{Y}}_{t}^{{\ast}} \) of Y t , and their error covariance matrices Δ t , using Kalman prediction and equations (9.4.7) and (9.4.9) applied to the state-space representation (9.6.1) and (9.1.2) of {Y t }. Denote the realized values of the predictors, based on the observation sequence \( \left \{\mathbf{y}_{t}^{{\ast}}\right \} \), by \( \left \{\hat{\mathbf{y}}_{t}^{{\ast}}\right \} \).

  3. iii.

    The required Gaussian likelihood of the irregularly spaced observations \( \{\mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}\} \) is then, by (9.6.5),

    $$ \displaystyle{L_{1 } (\theta; \mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}) = (2\pi )^{-rw/2}\left (\prod _{ j=1}^{n}\det \Delta _{ j}^{{\ast}}\right )^{-1/2}\exp \left \{-\frac{1} {2}\sum _{j=1}^{n}\mathbf{i}_{ j}^{{\ast}}{}'\Delta _{ j}^{{\ast}-1}\mathbf{i}_{ j}^{{\ast}}\right \},} $$

    where i j denotes the observed innovation \( \mathbf{y}_{j}^{{\ast}}-\hat{\mathbf{y}}_{j}^{{\ast}} \), j = 1, , n.

Example 9.6.1.

An AR(1) Series with One Missing Observation

Let {Y t } be the causal AR(1) process defined by

$$ \displaystyle{ Y _{t} -\phi Y _{t-1} = Z_{t},\quad \ \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$

To find the Gaussian likelihood of the observations y 1, y 3, y 4, and y 5 of Y 1, Y 3, Y 4, and Y 5 we follow the steps outlined above.

  1. i.

    Set y i  = y i ,  i = 1, 3, 4, 5 and y 2  = 0.

  2. ii.

    We start with the state-space model for {Y t } from Example 9.1.1, i.e., Y t  = X t ,  X t+1 = ϕ X t + Z t+1. The corresponding model for {Y t } is then, from (9.6.1),

    $$ \displaystyle{ Y _{t}^{{\ast}} = G_{ t}^{{\ast}}X_{ t} + W_{t}^{{\ast}},\ t = 1,2,\ldots, } $$

    where

    $$ \displaystyle\begin{array}{rcl} X_{t+1}& =& F_{t}X_{t} + V _{t},\ t = 1,2,\ldots, {}\\ F_{t}& =& \phi,\quad G_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if }t\neq 2, \\ 0\quad &\mbox{ if }t = 2, \end{array} \right.\quad V _{t} = Z_{t+1},\quad W_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\mbox{ if }t\neq 2, \\ N_{t}\quad &\mbox{ if }t = 2, \end{array} \right. {}\\ Q_{t}& =& \sigma ^{2},\qquad R_{ t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0\quad &\mbox{ if }t\neq 2, \\ 1\quad &\mbox{ if }t = 2, \end{array} \right.\qquad S_{t}^{{\ast}} = 0, {}\\ \end{array} $$

    and X 1 =  j = 0 ϕ j Z 1−j . Starting from the initial conditions

    $$ \displaystyle{ \hat{X}_{1} = 0,\qquad \Omega _{1} =\sigma ^{2}/\left (1 -\phi ^{2}\right ), } $$

    and applying the recursions (9.4.1) and (9.4.2), we find (Problem 9.19) that

    $$ \displaystyle{ \varTheta _{t}\Delta _{t}^{-1} = \left \{\begin{array}{@{}l@{\quad }l@{}} \phi \quad &\mbox{ if }t = 1,3,4,5, \\ 0\quad &\mbox{ if }t = 2, \end{array} \right.\quad \Omega _{t} = \left \{\begin{array}{@{}l@{\quad }l@{}} \sigma ^{2}/\left (1 -\phi ^{2}\right )\quad &\mbox{ if }t = 1, \\ \sigma ^{2}\left (1 +\phi ^{2}\right ) \quad &\mbox{ if }t = 3, \\ \sigma ^{2} \quad &\mbox{ if }t = 2,4,5, \end{array} \right. } $$

    and

    $$ \displaystyle{\hat{X}_{1} = 0,\quad \hat{X}_{2} =\phi Y _{1},\quad \hat{X}_{3} =\phi ^{2}Y _{ 1},\quad \hat{X}_{4} =\phi Y _{3},\quad \hat{X}_{5} =\phi Y _{4}.} $$

    From (9.4.7) and (9.4.9) with h = 1, we find that

    $$ \displaystyle{\hat{Y }_{1}^{{\ast}} = 0,\quad \hat{Y }_{ 2}^{{\ast}} = 0,\quad \hat{Y }_{ 3}^{{\ast}} =\phi ^{2}Y _{ 1},\quad \hat{Y }_{4}^{{\ast}} =\phi Y _{ 3},\quad \hat{Y }_{5}^{{\ast}} =\phi Y _{ 4},} $$

    with corresponding mean squared errors

    $$ \displaystyle{ \Delta _{1 }^{{\ast} } =\sigma ^{2}/\left (1 -\phi ^{2}\right ),\quad \Delta _{ 2}^{{\ast}} = 1,\quad \Delta _{ 3}^{{\ast}} =\sigma ^{2}\left (1 +\phi ^{2}\right ),\quad \Delta _{ 4}^{{\ast}} =\sigma ^{2},\quad \Delta _{ 5}^{{\ast}} =\sigma ^{2}.} $$
  3. iii.

    From the preceding calculations we can now write the likelihood of the original data as

    $$ \displaystyle\begin{array}{rcl} & &L_{1 } (\phi,\sigma ^{2};\,y_{ 1},y_{3},y_{4},y_{5})=\sigma ^{-4}(2\pi )^{-2}\left [\left (1-\phi ^{2}\right )/\left (1+\phi ^{2}\right )\right ]^{1/2} {}\\ & &\quad \times \exp \left \{-{ 1 \over 2\sigma ^{2}}\left [y_{1}^{2}\left (1-\phi ^{2}\right )+\frac{(y_{3}-\phi ^{2}y_{ 1})^{2}} {1+\phi ^{2}} +(y_{4}-\phi y_{3})^{2}+(y_{ 5}-\phi y_{4})^{2}\right ]\right \}. {}\\ \end{array} $$

Remark 1.

If we are given observations \( y_{1-d},y_{2-d},\ldots,y_{0},y_{i_{1}},y_{i_{2}},\ldots,y_{i_{r}} \) of an ARIMA( p, d, q) process at times 1 − d, 2 − d, , 0, i 1, , i r , where 1 ≤ i 1 < i 2 < ⋯ < i r  ≤ n, a similar argument can be used to find the Gaussian likelihood of \( y_{i_{1}},\ldots,y_{i_{r}} \) conditional on Y 1−d  = y 1−d , Y 2−d  = y 2−d , , Y 0 = y 0. Missing values among the first d observations y 1−d , y 2−d , , y 0 can be handled by treating them as unknown parameters for likelihood maximization. For more on ARIMA series with missing values see Brockwell and Davis (1991) and Ansley and Kohn (1985). □ 

9.6.2 Estimation of Missing Values for State-Space Models

Given that we observe only \( \mathbf{Y}_{i_{1}},\mathbf{Y}_{i_{2}},\ldots,\mathbf{Y}_{i_{r}},1 \leq i_{1} < i_{2} < \cdots < i_{r} \leq n \), where {Y t } has the state-space representation (9.1.1) and (9.1.2), we now consider the problem of finding the minimum mean squared error estimators \( P\left (\mathbf{Y}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) \) of Y t , 1 ≤ t ≤ n, where Y 0 = 1. To handle this problem we again use the modified process {Y t } defined by (9.6.1) and (9.1.2) with Y 0  = 1. Since Y s  = Y s for s ∈ { i 1, , i r } and Y s  ⊥ X t , Y 0 for 1 ≤ t ≤ n and s ∉ {0, i 1, , i r }, we immediately obtain the minimum mean squared error state estimators

$$ \displaystyle{ P\left (\mathbf{X}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) = P\left (\mathbf{X}_{t}\vert \mathbf{Y}_{0}^{{\ast}},\mathbf{Y}_{ 1}^{{\ast}},\ldots,\mathbf{Y}_{ n}^{{\ast}}\right ),\quad 1 \leq t \leq n. } $$
(9.6.6)

The right-hand side can be evaluated by application of the Kalman fixed-point smoothing algorithm to the state-space model (9.6.1) and (9.1.2). For computational purposes the observed values of Y t ,  t ∉ {0, i 1, , i r }, are quite immaterial. They may, for example, all be set equal to zero, giving the sequence of observations of Y t defined in (9.6.4).

To evaluate \( P\left (\mathbf{Y}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) \), 1 ≤ t ≤ n, we use (9.6.6) and the relation

$$ \displaystyle{ \mathbf{Y}_{t} = G_{t}\mathbf{X}_{t} + \mathbf{W}_{t}. } $$
(9.6.7)

Since \( E\left (\mathbf{V}_{t}\mathbf{W}_{t}'\right ) = S_{t} = 0,\quad t = 1,\ldots,n, \) we find from (9.6.7) that

$$ \displaystyle{ P\left (\mathbf{Y}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) = G_{t}P\left (\mathbf{X}_{t}\vert \mathbf{Y}_{0}^{{\ast}},\mathbf{Y}_{ 1}^{{\ast}},\ldots,\mathbf{Y}_{ n}^{{\ast}}\right ). } $$
(9.6.8)

Example 9.6.2.

An AR(1) Series with One Missing Observation

Consider the problem of estimating the missing value Y 2 in Example 9.6.1 in terms of Y 0 = 1, Y 1, Y 3, Y 4, and Y 5. We start from the state-space model X t+1 = ϕ X t + Z t+1, Y t  = X t , for {Y t }. The corresponding model for {Y t } is the one used in Example 9.6.1. Applying the Kalman smoothing equations to the latter model, we find that

$$ \displaystyle{ \begin{array}{l@{\quad }l@{\quad }l} P_{1}X_{2} =\phi Y _{1}, \quad &P_{2}X_{2} =\phi Y _{1}, \quad &P_{3}X_{2} = \dfrac{\phi (Y _{1} + Y _{3})} {(1 +\phi ^{2})}, \\ P_{4}X_{2} = P_{3}X_{2},\quad &P_{5}X_{2} = P_{3}X_{2},\quad \\ \Omega _{2,2} =\sigma ^{2}, \quad & \Omega _{2,3} =\phi \sigma ^{2}, \quad & \Omega _{2,t} = 0,\quad t \geq 4,\\ \quad \end{array} } $$

and

$$ \displaystyle{ \Omega _{2\vert 1} =\sigma ^{2},\quad \Omega _{ 2\vert 2} =\sigma ^{2},\quad \Omega _{ 2\vert t} ={ \sigma ^{2} \over (1 +\phi ^{2})},\quad t \geq 3, } $$

where P t (⋅ ) here denotes \( P\left (\cdot \vert Y _{0}^{{\ast}},\ldots,Y _{t}^{{\ast}}\right ) \) and \( \Omega _{t,n},\Omega _{t\vert n} \) are defined correspondingly. We deduce from (9.6.8) that the minimum mean squared error estimator of the missing value Y 2 is

$$ \displaystyle{ P_{5}Y _{2} = P_{5}X_{2} = \frac{\phi (Y _{1} + Y _{3})} {\left (1 +\phi ^{2}\right )}, } $$

with mean squared error

$$ \displaystyle{ \Omega _{2\vert 5} = \frac{\sigma ^{2}} {\left (1 +\phi ^{2}\right )}. } $$

Remark 2.

Suppose we have observations \( Y _{1-d},Y _{2-d},\ldots,Y _{0},Y _{i_{1}},\ldots,Y _{i_{r}} \) \( (1 \leq i_{1} < i_{2}\cdots < i_{r} \leq n) \) of an ARIMA(p, d, q) process. Determination of the best linear estimates of the missing values \( Y _{t},\,t\notin \{i_{1},\ldots,i_{r}\} \), in terms of \( Y _{t},\,t \in \{ i_{1},\ldots,i_{r}\} \), and the components of Y 0: = (Y 1−d , Y 2−d , , Y 0)′ can be carried out as in Example 9.6.2 using the state-space representation of the ARIMA series {Y t } from Example 9.3.3 and the Kalman recursions for the corresponding state-space model for {Y t } defined by (9.6.1) and (9.1.2). See Brockwell and Davis (1991) for further details. □ 

We close this section with a brief discussion of a direct approach to estimating missing observations. This approach is often more efficient than the methods just described, especially if the number of missing observations is small and we have a simple (e.g., autoregressive) model. Consider the general problem of computing E(X | Y) when the random vector (X′, Y′)′ has a multivariate normal distribution with mean 0 and covariance matrix Σ. (In the missing observation problem, think of X as the vector of the missing observations and Y as the vector of observed values.) Then the joint probability density function of X and Y can be written as

$$ \displaystyle{ f_{\mathbf{X},\mathbf{Y}}(\mathbf{x},\mathbf{y}) = f_{\mathbf{X}\vert \mathbf{Y}}(\mathbf{x}\vert \mathbf{y})f_{\mathbf{Y}}(\mathbf{y}), } $$
(9.6.9)

where \( f_{\mathbf{X}\vert \mathbf{Y}}(\mathbf{x}\vert \mathbf{y}) \) is a multivariate normal density with mean E(X | Y) and covariance matrix \( \varSigma_{\mathbf{X}\vert \mathbf{Y}} \) (see Proposition A.3.1). In particular,

$$ \displaystyle{ f_{\mathbf{X}\vert \mathbf{Y}}(\mathbf{x}\vert \mathbf{y}) = \frac{1} {\sqrt{(2\pi )^{q } \det \varSigma_{\mathbf{X} \vert \mathbf{Y}} }}\exp \left \{-\frac{1} {2}(\mathbf{x} - E(\mathbf{X}\vert \mathbf{y}))'\varSigma_{\mathbf{X}\vert \mathbf{Y}}^{-1}(\mathbf{x} - E(\mathbf{X}\vert \mathbf{y}))\right \}, } $$
(9.6.10)

where q = dim(X). It is clear from (9.6.10) that \( f_{\mathbf{X}\vert \mathbf{Y}} (\mathbf{x}\vert \mathbf{y}) \) (and also f X, Y (x, y)) is maximum when x = E(X | y). Thus, the best estimator of X in terms of Y can be found by maximizing the joint density of X and Y with respect to x. For autoregressive processes it is relatively straightforward to carry out this optimization, as shown in the following example.

Example 9.6.3.

Estimating Missing Observations in an AR Process

Suppose {Y t } is the AR( p) process defined by

$$ \displaystyle{ Y _{t} =\phi _{1}Y _{t-1} + \cdots +\phi _{p}Y _{t-p} + Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ), } $$

and \( \mathbf{Y} = (Y _{i_{1}},\ldots,Y _{i_{r}})' \), with 1 ≤ i 1 < ⋯ < i r  ≤ n, are the observed values. If there are no missing observations in the first p observations, then the best estimates of the missing values are found by minimizing

$$ \displaystyle{ \sum _{t=p+1}^{n}(Y _{ t} -\phi _{1}Y _{t-1} -\cdots -\phi _{p}Y _{t-p})^{2} } $$
(9.6.11)

with respect to the missing values (see Problem 9.20). For the AR(1) model in Example 9.6.2, minimization of (9.6.11) is equivalent to minimizing

$$ \displaystyle{(Y _{2} -\phi Y _{1})^{2} + (Y _{ 3} -\phi Y _{2})^{2}} $$

with respect to Y 2. Setting the derivative of this expression with respect to Y 2 equal to 0 and solving for Y 2 we obtain \( E(Y _{2}\vert Y _{1},Y _{3},Y _{4},Y _{5}) =\phi (Y _{1} + Y _{3})/\left (1 +\phi ^{2}\right ) \).

9.7 The EM Algorithm

The expectation-maximization (EM) algorithm is an iterative procedure for computing the maximum likelihood estimator when only a subset of the complete data set is available. Dempster et al. (1977) demonstrated the wide applicability of the EM algorithm and are largely responsible for popularizing this method in statistics. Details regarding the convergence and performance of the EM algorithm can be found in Wu (1983).

In the usual formulation of the EM algorithm, the “complete” data vector W is made up of “observed” data Y (sometimes called incomplete data) and “unobserved” data X. In many applications, X consists of values of a “latent” or unobserved process occurring in the specification of the model. For example, in the state-space model of Section 9.1, Y could consist of the observed vectors Y 1, , Y n and X of the unobserved state vectors X 1, , X n . The EM algorithm provides an iterative procedure for computing the maximum likelihood estimator based only on the observed data Y. Each iteration of the EM algorithm consists of two steps. If θ (i) denotes the estimated value of the parameter θ after i iterations, then the two steps in the (i + 1)th iteration are

$$ \displaystyle{ \mathbf{ E-step }.\qquad \qquad \mbox{ Calculate }Q(\theta \vert \theta ^{(i)}) = E_{\theta ^{ (i)}}\left [\ell(\theta;\mathbf{X},\mathbf{Y})\vert \mathbf{Y}\right ] } $$

and

$$ \displaystyle{ \mathbf{ M-step }.\qquad \qquad \mbox{ Maximize }Q(\theta \vert \theta ^{(i)})\mbox{ with respect to }\theta. } $$

Then θ (i+1) is set equal to the maximizer of Q in the M-step. In the E-step, (θ; x, y) = lnf(x, y; θ), and \( E_{\theta ^{(i)}}(\cdot \vert \mathbf{Y}) \) denotes the conditional expectation relative to the conditional density \( f{\bigl (\mathbf{x}\vert \mathbf{y};\theta ^{(i)}\bigr )} = f{\bigl (\mathbf{x},\mathbf{y};\theta ^{(i)}\bigr )}/f{\bigl (\mathbf{y};\theta ^{(i)}\bigr )} \).

It can be shown that \( \ell{\bigl (\theta ^{(i)};\mathbf{Y}\bigr )} \) is nondecreasing in i, and a simple heuristic argument shows that if θ (i) has a limit \( \hat{\theta } \) then \( \hat{\theta } \) must be a solution of the likelihood equations \( \ell'{\bigl (\hat{\theta };\mathbf{Y}\bigr )} = 0 \). To see this, observe that lnf(x, y; θ) = lnf(x | y; θ) + (θ; y), from which we obtain

$$ \displaystyle{ Q\left (\theta \vert \theta ^{(i)}\right ) =\int \left (\ln f(\mathbf{x}\vert \mathbf{Y};\theta )\right )f\left (\mathbf{x}\vert \mathbf{Y};\theta ^{(i)}\right )\,d\mathbf{x} +\ell (\theta;\mathbf{Y}) } $$

and

$$ \displaystyle{ Q'(\theta \vert \theta ^{(i)}) =\int \left [\frac{\partial } {\partial \theta }f(\mathbf{x}\vert \mathbf{Y};\theta )\right ]/f(\mathbf{x}\vert \mathbf{Y};\theta )f\left (\mathbf{x}\vert \mathbf{Y};\theta ^{(i)}\right )\,d\mathbf{x} +\ell '(\theta;\mathbf{Y}). } $$

Now replacing θ with θ (i+1), noticing that Q′(θ (i+1) | θ (i)) = 0, and letting i → , we find that

$$ \displaystyle{ 0 =\int \frac{\partial } {\partial \theta }\left [\ f(\mathbf{x}\vert \mathbf{Y};\theta )\right ]_{\theta =\hat{\theta }}\,d\mathbf{x} +\ell '\left (\hat{\theta };\mathbf{Y}\right ) =\ell '\left (\hat{\theta };\mathbf{Y}\right ). } $$

The last equality follows from the fact that

$$ \displaystyle{ 0 = \frac{\partial } {\partial \theta }(1) = \frac{\partial } {\partial \theta }\left [\int (\,f(\mathbf{x}\vert \mathbf{Y};\theta )\,d\mathbf{x}\right ]_{\theta =\hat{\theta }} =\int \left [\frac{\partial } {\partial \theta }\ f(\mathbf{x}\vert \mathbf{Y};\theta )\right ]_{\theta =\hat{\theta }}\,d\mathbf{x}. } $$

The computational advantage of the EM algorithm over direct maximization of the likelihood is most pronounced when the calculation and maximization of the exact likelihood is difficult as compared with the maximization of Q in the M-step. (There are some applications in which the maximization of Q can easily be carried out explicitly.)

9.7.1 Missing Data

The EM algorithm is particularly useful for estimation problems in which there are missing observations. Suppose the complete data set consists of Y 1, , Y n of which r are observed and nr are missing. Denote the observed and missing data by \( \mathbf{Y} = (Y _{i_{1}},\ldots,Y _{i_{r}})' \) and \( \mathbf{X} = (Y _{j_{1}},\ldots,Y _{j_{n-r}})' \), respectively. Assuming that W = (X′, Y′)′ has a multivariate normal distribution with mean 0 and covariance matrix Σ, which depends on the parameter \( \theta \), the log-likelihood of the complete data is given by

$$ \displaystyle{ \ell(\theta;\mathbf{W}) = -\frac{n} {2} \ln (2\pi ) -\frac{1} {2}\ln \det (\varSigma ) -\frac{1} {2}\mathbf{W}'\varSigma \mathbf{W}. } $$

The E-step requires that we compute the expectation of \( \ell(\theta;\mathbf{W}) \) with respect to the conditional distribution of W given Y with \( \theta =\theta ^{(i)} \). Writing \( \varSigma (\theta ) \) as the block matrix

$$ \displaystyle{ \varSigma = \left [\begin{array}{*{10}c} \varSigma _{11}\ & \varSigma _{12}\\ \varSigma _{ 21}\ & \varSigma _{22} \end{array} \right ], } $$

which is conformable with X and Y, the conditional distribution of W given Y is multivariate normal with mean \( \left[\begin{array}{c}\hat{{\mathbf{X}}}\\ \mathbf{Y}\end{array}\right] \) and covariance matrix \( <mfenced-6 separators="" open="[" close="]"> <mfrac-1 linethickness="0"> <mrow>\varSigma _{11\vert 2}(\theta )\quad 0</mrow> <mrow>\quad 0 0</mrow> </mfrac> </mfenced> \), where \( \hat{\mathbf{X}} = \) \( E_{\theta }(\mathbf{X}\vert \mathbf{Y}) =\varSigma _{12}\varSigma _{22}^{-1}\mathbf{Y} \) and \( \varSigma _{11\vert 2}(\theta ) =\varSigma _{11} -\varSigma _{12}\varSigma _{22}^{-1}\varSigma _{21} \) (see Proposition A.3.1). Using Problem A.8, we have

$$ \displaystyle{ E_{\theta ^{(i) } } \left [(\mathbf{X}',\mathbf{Y}')\varSigma ^{-1}(\theta )(\mathbf{X}',\mathbf{Y}')'\vert \mathbf{Y}\right ] = \mbox{ trace}\left (\varSigma _{ 11\vert 2}(\theta ^{(i)})\varSigma _{ 11\vert 2}^{-1}(\theta )\right ) +\hat{ \mathbf{W}}'\varSigma ^{-1}(\theta )\hat{\mathbf{W}}, } $$

where \( \hat{\mathbf{W}} = \left (\hat{\mathbf{X}}',\mathbf{Y}'\right )' \). It follows that

$$ \displaystyle{ Q\left (\theta \vert \theta ^{(i)}\right ) =\ell \left (\theta,\hat{\mathbf{W}}\right ) -\frac{1} {2}\mathrm{trace}\left (\varSigma _{11\vert 2}\left (\theta ^{(i)}\right )\varSigma _{ 11\vert 2}^{-1}(\theta )\right ). } $$

The first term on the right is the log-likelihood based on the complete data, but with X replaced by its “best estimate” \( \hat{\mathbf{X}} \) calculated from the previous iteration. If the increments \( \theta ^{(i+1)} -\theta ^{(i)} \) are small, then the second term on the right is nearly constant ( ≈ nr) and can be ignored. For ease of computation in this application we shall use the modified version

$$ \displaystyle{ \tilde{Q}\left (\theta \vert \theta ^{(i)}\right ) =\ell \left (\theta;\hat{\mathbf{W}}\right ). } $$

With this adjustment, the steps in the EM algorithm are as follows:

E-step. :

Calculate \( E_{\theta ^{(i)}}(\mathbf{X}\vert \mathbf{Y}) \) (e.g., with the Kalman fixed-point smoother) and form \( \ \ \ell{\bigl (\theta;\hat{\mathbf{W}}\bigr )} \).

M-step. :

Find the maximum likelihood estimator for the “complete” data problem, i.e., maximize \( \ell{\bigl (\theta:\hat{ \mathbf{W}}\bigr )} \). For ARMA processes, ITSM can be used directly, with the missing values replaced with their best estimates computed in the E-step.

Example 9.7.1.

The Lake Data

It was found in Example 5.2.5 that the AR(2) model

$$ \displaystyle{W_{t} - 1.0415W_{t-1} + 0.2494W_{t-2} = Z_{t},\ \ \{Z_{t}\} \sim \mathrm{WN}(0,.4790)} $$

was a good fit to the mean-corrected lake data {W t }. To illustrate the use of the EM algorithm for missing data, consider fitting an AR(2) model to the mean-corrected data assuming that there are 10 missing values at times t = 17, 24, 31, 38, 45, 52, 59, 66, 73, and 80. We start the algorithm at iteration 0 with \( \hat{\phi }_{1}^{(0)} =\hat{\phi }_{ 2}^{(0)} = 0 \). Since this initial model represents white noise, the first E-step gives, in the notation used above, \( \hat{W}_{17} = \cdots =\hat{ W}_{80} = 0 \). Replacing the “missing” values of the mean-corrected lake data with 0 and fitting a mean-zero AR(2) model to the resulting complete data set using the maximum likelihood option in ITSM, we find that \( \hat{\phi}_{1}^{(1)} = 0.7252 \), \( \hat{\phi}_{2}^{(1)} = 0.0236 \). (Examination of the plots of the ACF and PACF of this new data set suggests an AR(1) as a better model. This is also borne out by the small estimated value of ϕ 2.) The updated missing values at times t = 17, 24, , 80 are found (see Section 9.6 and Problem 9.21) by minimizing

$$ \displaystyle{ \sum _{j=0}^{2}\left (W_{ t+j} -\hat{\phi }_{1}^{(1)}W_{ t+j-1} -\hat{\phi }_{2}^{(1)}W_{ t+j-2}\right )^{2} } $$

with respect to W t . The solution is given by

$$ \displaystyle{ \hat{W}_{t} = \frac{\hat{\phi }_{2}^{(1)}(W_{t-2} + W_{t+2}) + \left (\hat{\phi }_{1}^{(1)} -\hat{\phi }_{1}^{(1)}\hat{\phi }_{2}^{(1)}\right )(W_{t-1} + W_{t+1})} {1 + \left (\hat{\phi }_{1}^{(1)}\right )^{2} + \left (\hat{\phi }_{2}^{(1)}\right )^{2}}. } $$

The M-step of iteration 1 is then carried out by fitting an AR(2) model using ITSM applied to the updated data set. As seen in the summary of the results reported in Table 9.1, the EM algorithm converges in four iterations with the final parameter estimates reasonably close to the fitted model based on the complete data set. (In Table 9.1, estimates of the missing values are recorded only for the first three.) Also notice how \( -2\ell\left (\theta ^{(i)},\mathbf{W}\right ) \) decreases at every iteration. The standard errors of the parameter estimates produced from the last iteration of ITSM are based on a “complete” data set and, as such, underestimate the true sampling errors. Formulae for adjusting the standard errors to reflect the true sampling error based on the observed data can be found in Dempster et al. (1977).

Table 9.1 Estimates of the missing observations at times t = 17, 24, 31 and the AR estimates using the EM algorithm in Example 9.7.1

9.8 Generalized State-Space Models

As in Section 9.1, we consider a sequence of state variables {X t , t ≥ 1} and a sequence of observations {Y t , t ≥ 1}. For simplicity, we consider only one-dimensional state and observation variables, since extensions to higher dimensions can be carried out with little change. Throughout this section it will be convenient to write Y (t) and X (t) for the t dimensional column vectors Y (t) = (Y 1, Y 2, , Y t )′ and X (t) = (X 1, X 2, , X t )′.

There are two important types of state-space models, “parameter driven” and “observation driven,” both of which are frequently used in time series analysis. The observation equation is the same for both, but the state vectors of a parameter-driven model evolve independently of the past history of the observation process, while the state vectors of an observation-driven model depend on past observations.

9.8.1 Parameter-Driven Models

In place of the observation and state equations (9.1.1) and (9.1.2), we now make the assumptions that Y t given \( {\bigl (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\bigr )} \) is independent of \( {\bigl (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\bigr )} \) with conditional probability density

$$ \displaystyle{ p(\,y_{t}\vert x_{t}):= p{\bigl (y_{t}\vert x_{t},\mathbf{x}^{(t-1)},\mathbf{y}^{(t-1)}\bigr )},\quad t = 1,2,\ldots, } $$
(9.8.1)

and that X t+1 given \( {\bigl (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t)}\bigr )} \) is independent of \( {\bigl (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t)}\bigr )} \) with conditional density function

$$ \displaystyle{ p(x_{t+1}\vert x_{t}):= p{\bigl (x_{t+1}\vert x_{t},\mathbf{x}^{(t-1)},\mathbf{y}^{(t)}\bigr )}\quad t = 1,2,\ldots. } $$
(9.8.2)

We shall also assume that the initial state X 1 has probability density p 1. The joint density of the observation and state variables can be computed directly from (9.8.1)–(9.8.2) as

$$ \displaystyle\begin{array}{rcl} p(\,y_{1},\ldots,y_{n},x_{1},\ldots,x_{n})& =& p\left (y_{n}\vert x_{n},\mathbf{x}^{(n-1)},\mathbf{y}^{(n-1)}\right )p\left (x_{ n},\mathbf{x}^{(n-1)},\mathbf{y}^{(n-1)}\right ) {}\\ & =& p(\,y_{n}\vert x_{n})p\left (x_{n}\vert \mathbf{x}^{(n-1)},\mathbf{y}^{(n-1)}\right )p\left (\mathbf{y}^{(n-1)},\mathbf{x}^{(n-1)}\right ) {}\\ & =& p(\,y_{n}\vert x_{n})p(x_{n}\vert x_{n-1})p\left (\mathbf{y}^{(n-1)},\mathbf{x}^{(n-1)}\right ) {}\\ & =& \cdots {}\\ & =& \left (\prod _{j=1}^{n}p(\,y_{ j}\vert x_{j})\right )\left (\prod _{j=2}^{n}p(x_{ j}\vert x_{j-1})\right )p_{1}(x_{1}), {}\\ \end{array} $$

and since (9.8.2) implies that {X t } is Markov (see Problem 9.22),

$$ \displaystyle{ p(y_{1},\ldots,y_{n}\vert x_{1},\ldots,x_{n}) = \left (\prod _{j=1}^{n}p(\,y_{ j}\vert x_{j})\right ). } $$
(9.8.3)

We conclude that Y 1, , Y n are conditionally independent given the state variables X 1, , X n , so that the dependence structure of {Y t } is inherited from that of the state process {X t }. The sequence of state variables {X t } is often referred to as the hidden or latent generating process associated with the observed process.

In order to solve the filtering and prediction problems in this setting, we shall determine the conditional densities \( p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) \) of X t given Y (t), and \( p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right ) \) of X t given Y (t−1), respectively. The minimum mean squared error estimates of X t based on Y (t) and Y (t−1) can then be computed as the conditional expectations, \( E\left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \) and \( E\left (X_{t}\vert \mathbf{Y}^{(t-1)}\right ) \).

An application of Bayes’s theorem, using the assumption that the distribution of Y t given \( \left (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\right ) \) does not depend on \( \left (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\right ) \), yields

$$ \displaystyle{ p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) = p(y_{ t}\vert x_{t})p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right )/p\left (y_{ t}\vert \mathbf{y}^{(t-1)}\right ) } $$
(9.8.4)

and

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) =\int p\left (x_{ t}\vert \mathbf{y}^{(t)}\right )p(x_{ t+1}\vert x_{t})\,d\mu (x_{t}). } $$
(9.8.5)

(The integral relative to d μ(x t ) in (9.8.4) is interpreted as the integral relative to dx t in the continuous case and as the sum over all values of x t in the discrete case.) The initial condition needed to solve these recursions is

$$ \displaystyle{ p\left (x_{1}\vert \mathbf{y}^{(0)}\right ):= p_{ 1}(x_{1}). } $$
(9.8.6)

The factor \( p\left (y_{t}\vert \mathbf{y}^{(t-1)}\right ) \) appearing in the denominator of (9.8.4) is just a scale factor, determined by the condition \( \int p\left (x_{t}\vert \mathbf{y}^{(t)}\right )\,d\mu (x_{t}) = 1. \) In the generalized state-space setup, prediction of a future state variable is less important than forecasting a future value of the observations. The relevant forecast density can be computed from (9.8.5) as

$$ \displaystyle{ p\left (\,y_{t+1}\vert \mathbf{y}^{(t)}\right ) =\int p(\,y_{ t+1}\vert x_{t+1})p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right )\,d\mu (x_{ t+1}). } $$
(9.8.7)

Equations (9.8.1)–(9.8.2) can be regarded as a Bayesian model specification. A classical Bayesian model has two key assumptions. The first is that the data Y 1, , Y t , given an unobservable parameter (X (t) in our case), are independent with specified conditional distribution. This corresponds to (9.8.3). The second specifies a prior distribution for the parameter value. This corresponds to (9.8.2). The posterior distribution is then the conditional distribution of the parameter given the data. In the present setting the posterior distribution of the component X t of X (t) is determined by the solution (9.8.4) of the filtering problem.

Example 9.8.1.

Consider the simplified version of the linear state-space model of Section 9.1,

$$ \displaystyle{ Y _{t} = GX_{t} + W_{t},\quad \{W_{t}\} \sim \mathrm{iid\ N}(0,R), } $$
(9.8.8)
$$ \displaystyle{ X_{t+1} = FX_{t} + V _{t},\quad \{V _{t}\} \sim \mathrm{iid\ N}(0,Q), } $$
(9.8.9)

where the noise sequences {W t } and {V t } are independent of each other. For this model the probability densities in (9.8.1)–(9.8.2) become

$$ \displaystyle{ p_{1}(x_{1}) = n(x_{1};EX_{1},\mathrm{Var}(X_{1})), } $$
(9.8.10)
$$ \displaystyle{ p(y_{t}\vert x_{t}) = n(\,y_{t};Gx_{t},R), } $$
(9.8.11)
$$ \displaystyle{ p(x_{t+1}\vert x_{t}) = n(x_{t+1};Fx_{t},Q), } $$
(9.8.12)

where \( n\left (x;\mu,\sigma ^{2}\right ) \) is the normal density with mean μ and variance σ 2 defined in Example (a) of Section A.1.

To solve the filtering and prediction problems in this new framework, we first observe that the filtering and prediction densities in (9.8.4) and (9.8.5) are both normal. We shall write them, using the notation of Section 9.4, as

$$ \displaystyle{ p\left (x_{t}\vert \mathbf{Y}^{(t)}\right ) = n(x_{ t};X_{t\vert t},\Omega _{t\vert t}) } $$
(9.8.13)

and

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{Y}^{(t)}\right ) = n\left (x_{ t+1};\hat{X}_{t+1},\Omega _{t+1}\right ). } $$
(9.8.14)

From (9.8.5), (9.8.12), (9.8.13), and (9.8.14), we find that

$$ \displaystyle\begin{array}{rcl} \hat{X}_{t+1}& =& \int _{-\infty }^{\infty }x_{ t+1}p(x_{t+1}\vert \mathbf{Y}^{(t)})dx_{ t+1} {}\\ & =& \int _{-\infty }^{\infty }x_{ t+1}\int _{-\infty }^{\infty }p(x_{ t}\vert \mathbf{Y}^{(t)})p(x_{ t+1}\vert x_{t})\,dx_{t}\,dx_{t+1}\phantom{well} {}\\ & =& \int _{-\infty }^{\infty }p(x_{ t}\vert \mathbf{Y}^{(t)})\left [\int _{ -\infty }^{\infty }x_{ t+1}p(x_{t+1}\vert x_{t})\,dx_{t+1}\right ]\,dx_{t} {}\\ & =& \int _{-\infty }^{\infty }Fx_{ t}p(x_{t}\vert \mathbf{Y}^{(t)})\,dx_{ t} {}\\ & =& FX_{t\vert t} {}\\ \end{array} $$

and (see Problem 9.23)

$$ \displaystyle{ \Omega _{t+1} = F^{2}\Omega _{ t\vert t} + Q. } $$

Substituting the corresponding densities (9.8.11) and (9.8.14) into (9.8.4), we find by equating the coefficient of x t 2 on both sides of (9.8.4) that

$$ \displaystyle{ \Omega _{t\vert t}^{-1} = G^{2}R^{-1} + \Omega _{ t}^{-1} = G^{2}R^{-1} + (F^{2}\Omega _{ t-1\vert t-1} + Q)^{-1} } $$

and

$$ \displaystyle{ X_{t\vert t} =\hat{ X}_{t} + \Omega _{t\vert t}GR^{-1}\left (Y _{ t} - G\hat{X}_{t}\right ). } $$

Also, from (9.8.4) with \( p\left (x_{1}\vert \mathbf{y}^{(0)}\right ) = n(x_{1};EX_{1},\Omega _{1}) \) we obtain the initial conditions

$$ \displaystyle{ X_{1\vert 1} = EX_{1} + \Omega _{1\vert 1}GR^{-1}(Y _{ 1} - GEX_{1}) } $$

and

$$ \displaystyle{ \Omega _{1\vert 1}^{-1} = G^{2}R^{-1} + \Omega _{ 1}^{-1}. } $$

The Kalman prediction and filtering recursions of Section 9.4 give the same results for \( \hat{X}_{t} \) and X t | t , since for Gaussian systems best linear mean square estimation is equivalent to best mean square estimation.

Example 9.8.2.

A non-Gaussian Example

In general, the solution of the recursions (9.8.4) and (9.8.5) presents substantial computational problems. Numerical methods for dealing with non-Gaussian models are discussed by Sorenson and Alspach (1971) and Kitagawa (1987). Here we shall illustrate the recursions (9.8.4) and (9.8.5) in a very simple special case. Consider the state equation

$$ \displaystyle{ X_{t} = aX_{t-1}, } $$
(9.8.15)

with observation density

$$ \displaystyle{ p(y_{t}\vert x_{t}) = \frac{(\pi x_{t})^{ y_{t}}e^{-\pi x_{t}}} {y_{t}!},\quad y_{t} = 0,1,\ldots, } $$
(9.8.16)

where π is a constant between 0 and 1. The relationship in (9.8.15) implies that the transition density [in the discrete sense—see the comment after (9.8.5)] for the state variables is

$$ \displaystyle{ p(x_{t+1}\vert x_{t}) = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if }x_{t+1} = ax_{t}, \\ 0,\quad &\mbox{ otherwise}. \end{array} \right. } $$

We shall assume that X 1 has the gamma density function

$$ \displaystyle{ p_{1}(x_{1}) = g(x_{1};\alpha,\lambda ) = \frac{\lambda ^{\alpha }x_{1}^{ \alpha -1}e^{-\lambda x_{1}}} {\Gamma (\alpha )},\quad x_{1} > 0. } $$

(This is a simplified model for the evolution of the number X t of individuals at time t infected with a rare disease, in which X t is treated as a continuous rather than an integer-valued random variable. The observation Y t represents the number of infected individuals observed in a random sample consisting of a small fraction π of the population at time t.) Because the transition distribution of {X t } is not continuous, we use the integrated version of (9.8.5) to compute the prediction density. Thus,

$$ \displaystyle\begin{array}{rcl} P\left (X_{t} \leq x\vert \mathbf{y}^{(t-1)}\right )& =& \int _{ 0}^{\infty }P(X_{ t} \leq x\vert x_{t-1})p\left (x_{t-1}\vert \mathbf{y}^{(t-1)}\right )\,dx_{ t-1} {}\\ & =& \int _{0}^{x/a}p\left (x_{ t-1}\vert \mathbf{y}^{(t-1)}\right )\,dx_{ t-1}. {}\\ \end{array} $$

Differentiation with respect to x gives

$$ \displaystyle{ p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right ) = a^{-1}p_{ X_{t-1}\vert \mathbf{Y}^{(t-1)}}\left (a^{-1}x_{ t}\vert \mathbf{y}^{(t-1)}\right ). } $$
(9.8.17)

Now applying (9.8.4), we find that

$$ \displaystyle\begin{array}{rcl} p(x_{1}\vert y_{1})& =& p(\,y_{1}\vert x_{1})p_{1}(x_{1})/p(\,y_{1}) {}\\ & =& \left (\frac{(\pi x_{1})^{y_{1}}e^{-\pi x_{1}}} {y_{1}!} \right )\left (\frac{\lambda ^{\alpha }x_{1}^{\alpha -1}e^{-\lambda x_{1}}} {\Gamma (\alpha )} \right )\left ( \frac{1} {p(\,y_{1})}\right ) {}\\ & =& c(\,y_{1})x_{1}^{\alpha +y_{1}-1}e^{-(\pi +\lambda )x_{1} },\quad x_{1} > 0, {}\\ \end{array} $$

where c(y 1) is an integration factor ensuring that p(⋅ | y 1) integrates to 1. Since p(⋅ | y 1) has the form of a gamma density, we deduce (see Example (d) of Section A.1) that

$$ \displaystyle{ p(x_{1}\vert y_{1}) = g(x_{1};\alpha _{1},\lambda _{1}), } $$
(9.8.18)

where α 1 = α + y 1 and λ 1 = λ +π. The prediction density, calculated from (9.8.5) and (9.8.18), is

$$ \displaystyle\begin{array}{rcl} p\left (x_{2}\vert \mathbf{y}^{(1)}\right )& =& a^{-1}p_{ X_{1}\vert \mathbf{Y}^{(1)}}\left (a^{-1}x_{ 2}\vert \mathbf{y}^{(1)}\right )\phantom{well} {}\\ & =& a^{-1}g\left (a^{-1}x_{ 2};\alpha _{1},\lambda _{1}\right ) {}\\ & =& g(x_{2};\alpha _{1},\lambda _{1}/a). {}\\ \end{array} $$

Iterating the recursions (9.8.4) and (9.8.5) and using (9.8.17), we find that for t ≥ 1,

$$ \displaystyle{ \quad p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t};\alpha _{t},\lambda _{t}) } $$
(9.8.19)

and

$$ \displaystyle\begin{array}{rcl} p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right )& =& a^{-1}g\left (a^{-1}x_{ t+1};\alpha _{t},\lambda _{t}\right )\phantom{well} \\ & =& g(x_{t+1};\alpha _{t},\lambda _{t}/a), {}\end{array} $$
(9.8.20)

where α t  = α t−1 + y t  = α + y 1 + ⋯ + y t and \( \lambda _{t} =\lambda _{t-1}/a+\pi =\lambda a^{1-t} +\pi \left (1 - a^{-t}\right )/(1 - a^{-1}). \) In particular, the minimum mean squared error estimate of x t based on y (t) is the conditional expectation α t λ t with conditional variance α t λ t 2. From (9.8.7) the probability density of Y t+1 given Y (t) is

$$ \displaystyle\begin{array}{rcl} p(\,y_{t+1}\vert \mathbf{y}^{(t)})& =& \int _{ 0}^{\infty }\left (\dfrac{(\pi x_{t+1})^{y_{t+1}}e^{-\pi x_{t+1}}} {y_{t+1}!} \right )g(x_{t+1};\alpha _{t},\lambda _{t}/a)\,dx_{t+1} {}\\ & =& \dfrac{\Gamma (\alpha _{t} + y_{t+1})} {\Gamma (\alpha _{t})\Gamma (y_{t+1} + 1)}\left (1 - \dfrac{\pi } {\lambda _{t+1}}\right )^{\alpha _{t}}\left ( \dfrac{\pi } {\lambda _{t+1}}\right )^{y_{t+1} } {}\\ & =& nb(y_{t+1};\alpha _{t},1 -\pi /\lambda _{t+1}),\quad y_{t+1} = 0,1,\ldots, {}\\ \end{array} $$

where nb(y; α, p) is the negative binomial density defined in example (i) of Section A.1. Conditional on Y (t), the best one-step predictor of Y t+1 is therefore the mean, α t π∕(λ t+1π), of this negative binomial distribution. The conditional mean squared error of the predictor is Var\( \left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi \lambda _{t+1}/(\lambda _{t+1}-\pi )^{2} \) (see Problem 9.25).

Example 9.8.3.

A Model for Time Series of Counts

We often encounter time series in which the observations represent count data. One such example is the monthly number of newly recorded cases of poliomyelitis in the U.S. for the years 1970–1983 plotted in Figure 9.6. Unless the actual counts are large and can be approximated by continuous variables, Gaussian and linear time series models are generally inappropriate for analyzing such data. The parameter-driven specification provides a flexible class of models for modeling count data. We now discuss a specific model based on a Poisson observation density. This model is similar to the one presented by Zeger (1988) for analyzing the polio data. The observation density is assumed to be Poisson with mean exp{x t }, i.e.,

$$ \displaystyle{ p(\,y_{t}\vert x_{t}) = \frac{e^{ x_{t}y_{t}}e^{-e^{ x_{t}} }} {y_{t}!},\quad y_{t} = 0,1,\ldots, } $$
(9.8.21)

while the state variables are assumed to follow a regression model with Gaussian AR(1) noise. If u t  = (u t1, , u tk )′ are the regression variables, then

$$ \displaystyle{ X_{t} =\beta '\mathbf{u}_{t} + W_{t}, } $$
(9.8.22)

where \( \beta \) is a k-dimensional regression parameter and

$$ \displaystyle{ W_{t} =\phi W_{t-1} + Z_{t},\quad \{Z_{t}\} \sim \mathrm{IID\ N}\left (0,\sigma ^{2}\right ). } $$

The transition density function for the state variables is then

$$ \displaystyle{ p(x_{t+1}\vert x_{t}) = n(x_{t+1};\beta '\mathbf{u}_{t+1} +\phi \left (x_{t} -\beta '\mathbf{u}_{t}),\sigma ^{2}\right ). } $$
(9.8.23)
Fig. 9.6
figure 6

Monthly number of U.S. cases of polio, January 1970–December 1983

The case σ 2 = 0 corresponds to a log-linear model with Poisson noise.

Estimation of the parameters \( \theta = \left (\beta ',\phi,\sigma ^{2}\right )' \) in the model by direct numerical maximization of the likelihood function is difficult, since the likelihood cannot be written down in closed form. (From (9.8.3) the likelihood is the n-fold integral,

$$ \displaystyle{ \int _{-\infty }^{\infty }\cdots \int _{ -\infty }^{\infty }\exp \left \{\sum _{ t=1}^{n}{\bigl (x_{ t}y_{t} - e^{ x_{t} }\bigr )}\right \}L\left (\theta;\mathbf{x}^{(n)}\right )\,(dx_{ 1}\cdots dx_{n})\Big/\prod _{i=1}^{n}(y_{ i}!), } $$

where \( L(\theta;\mathbf{x}) \) is the likelihood based on X 1, , X n .) To overcome this difficulty, Chan and Ledolter (1995) proposed an algorithm, called Monte Carlo EM (MCEM), whose iterates θ (i) converge to the maximum likelihood estimate. To apply this algorithm, first note that the conditional distribution of Y (n) given X (n) does not depend on \( \theta \), so that the likelihood based on the complete data \( \left (\mathbf{X}^{(n)}{}',\mathbf{Y}^{(n)}{}'\right )' \) is given by

$$ \displaystyle{ L\left (\theta;\mathbf{X}^{(n)},\mathbf{Y}^{(n)}\right ) = f\left (\mathbf{Y}^{(n)}\vert \mathbf{X}^{(n)}\right )L\left (\theta;\mathbf{X}^{(n)}\right ). } $$

The E-step of the algorithm (see Section 9.7) requires calculation of

$$ \displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{(i)})& =& E_{\theta ^{ (i)}}\left (\ln L(\theta;\mathbf{X}^{(n)},\mathbf{Y}^{(n)})\vert \mathbf{Y}^{(n)}\right ) {}\\ & =& E_{\theta ^{(i)}}\left (\ln \ f(\mathbf{Y}^{(n)}\vert \mathbf{X}^{(n)})\vert \mathbf{Y}^{(n)}\right ) + E_{\theta ^{ (i)}}\left (\ln L(\theta;\mathbf{X}^{(n)})\vert \mathbf{Y}^{(n)}\right ). {}\\ \end{array} $$

We delete the first term from the definition of Q, since it is independent of \( \theta \) and hence plays no role in the M-step of the EM algorithm. The new Q is redefined as

$$ \displaystyle{ Q(\theta \vert \theta ^{(i)}) = E_{\theta ^{ (i)}}\left (\ln L(\theta;\mathbf{X}^{(n)})\vert \mathbf{Y}^{(n)}\right ). } $$
(9.8.24)

Even with this simplification, direct calculation of Q is still intractable. Suppose for the moment that it is possible to generate replicates of X (n) from the conditional distribution of X (n) given Y (n) when \( \theta =\theta ^{(i)} \). If we denote m independent replicates of X (n) by X 1 (n), , X m (n), then a Monte Carlo approximation to Q in (9.8.24) is given by

$$ \displaystyle{ Q_{m}\left (\theta \vert \theta ^{(i)}\right ) = \frac{1} {m}\sum _{j=1}^{m}\ln L\left (\theta;\mathbf{X}_{ j}^{(n)}\right ). } $$

The M-step is easy to carry out using Q m in place of Q (especially if we condition on X 1 = 0 in all the simulated replicates), since L is just the Gaussian likelihood of the regression model with AR(1) noise treated in Section 6.6 The difficult steps in the algorithm are the generation of replicates of X (n) given Y (n) and the choice of m. Chan and Ledolter (1995) discuss the use of the Gibb’s sampler for generating the desired replicates and give some guidelines on the choice of m.

In their analyses of the polio data, Zeger (1988) and Chan and Ledolter (1995) included as regression components an intercept, a slope, and harmonics at periods of 6 and 12 months. Specifically, they took

$$ \displaystyle{ \mathbf{u}_{t} = (1,t/1000,\cos (2\pi t/12),\sin (2\pi t/12),\cos (2\pi t/6),\sin (2\pi t/6))'. } $$

The implementation of Chan and Ledolter’s MCEM method by Kuk and Cheng (1994) gave estimates \( \hat{\beta }=\, \)(0.247, − 3. 871, 0.162, − 0. 482, 0.414, − 0. 011)′, \( \hat{\phi }= 0.648 \), and \( \hat{\sigma }^{2} = 0.281 \). The estimated trend function \( \hat{\beta }'\mathbf{u}_{t} \) is displayed in Figure 9.7. The negative coefficient of t∕1000 indicates a slight downward trend in the monthly number of polio cases.

Fig. 9.7
figure 7

Trend estimate for the monthly number of U.S. cases of polio, January 1970–December 1983

9.8.2 Observation-Driven Models

Again we assume that Y t , conditional on \( \big(X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\big) \), is independent of \( \big(\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\big) \). These models are specified by the conditional densities

$$ \displaystyle{ p(\,y_{t}\vert x_{t}) = p\big(y_{t}\vert \mathbf{x}^{(t)},\mathbf{y}^{(t-1)}\big),\quad t = 1,2,\ldots, } $$
(9.8.25)
$$ \displaystyle{ p\big(x_{t+1}\vert \mathbf{y}^{(t)}\big) = p_{ X_{t+1}\vert \mathbf{Y}^{(t)}}\big(x_{t+1}\vert \mathbf{y}^{(t)}\big),\quad t = 0,1,\ldots, } $$
(9.8.26)

where \( p\big(x_{1}\vert \mathbf{y}^{(0)}\big):= p_{1}(x_{1}) \) for some prespecified initial density p 1(x 1). The advantage of the observation-driven state equation (9.8.26) is that the posterior distribution of X t given Y (t) can be computed directly from (9.8.4) without the use of the updating formula (9.8.5). This then allows for easy computation of the forecast function in (9.8.7) and hence of the joint density function of (Y 1, , Y n )′,

$$ \displaystyle{ p(\,y_{1},\ldots,y_{n}) =\prod _{ t=1}^{n}p\left (y_{ t}\vert \mathbf{y}^{(t-1)}\right ). } $$
(9.8.27)

On the other hand, the mechanism by which the state X t−1 makes the transition to X t is not explicitly defined. In fact, without further assumptions there may be state sequences {X t } and {X t } with different distributions for which both (9.8.25) and (9.8.26) hold (see Example 9.8.6). Both sequences, however, lead to the same joint distribution, given by (9.8.27), for Y 1, , Y n . The ambiguity in the specification of the distribution of the state variables can be removed by assuming that X t+1 given \( \left (\mathbf{X}^{(t)},\mathbf{Y}^{(t)}\right ) \) is independent of X (t), with conditional distribution (9.8.26), i.e.,

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{x}^{(t)},\mathbf{y}^{(t)}\right ) = p_{X_{t+1}\vert \mathbf{Y}^{(t)}} \left (x_{t+1}\vert \mathbf{y}^{(t)}\right ). } $$
(9.8.28)

With this modification, the joint density of Y (n) and X (n) is given by (cf. (9.8.3))

$$ \displaystyle\begin{array}{rcl} p\left (\mathbf{y}^{(n)},\mathbf{x}^{(n)}\right )& =& p(\,y_{ n}\vert x_{n})p\left (x_{n}\vert \mathbf{y}^{(n-1)}\right )p\left (\mathbf{y}^{(n-1)},\mathbf{x}^{(n-1)}\right ) {}\\ & =& \cdots {}\\ & =& \prod _{t=1}^{n}\left (p(\,y_{ t}\vert x_{t})p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right )\right ). {}\\ \end{array} $$

Example 9.8.4.

An AR(1) Process

An AR(1) process with iid noise can be expressed as an observation driven model. Suppose {Y t } is the AR(1) process

$$ \displaystyle{ Y _{t} =\phi Y _{t-1} + Z_{t}, } $$

where {Z t } is an iid sequence of random variables with mean 0 and some probability density function f(x). Then with X t : = Y t−1 we have

$$ \displaystyle{ p(\,y_{t}\vert x_{t}) = f(\,y_{t} -\phi x_{t}) } $$

and

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if }x_{t+1} = y_{t}, \\ 0,\quad &\mbox{ otherwise}. \end{array} \right. } $$

Example 9.8.5.

Suppose the observation-equation density is given by

$$ \displaystyle{ p(\,y_{t}\vert x_{t}) = \frac{x_{t}^{ y_{t}}e^{-x_{t}}} {y_{t}!},\quad y_{t} = 0,1,\ldots, } $$
(9.8.29)

and the state equation (9.8.26) is

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t};\alpha _{t},\lambda _{t}), } $$
(9.8.30)

where α t  = α + y 1 + ⋯ + y t and λ t  = λ + t. It is possible to give a parameter-driven specification that gives rise to the same state equation (9.8.30). Let {X t } be the parameter-driven state variables, where X t  = X t−1 and X 1 has a gamma distribution with parameters α and λ. (This corresponds to the model in Example 9.8.2 with π = a = 1.) Then from (9.8.19) we see that \( p\left (x_{t}^{{\ast}}\vert \mathbf{y}^{(t)}\right ) = g(x_{t}^{{\ast}};\alpha _{t},\lambda _{t}) \), which coincides with the state equation (9.8.30). If {X t } are the state variables whose joint distribution is specified through (9.8.28), then {X t } and {X t } cannot have the same joint distributions. To see this, note that

$$ \displaystyle{ p\left (x_{t+1}^{{\ast}}\vert x_{ t}^{{\ast}}\right ) = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if }x_{t+1}^{{\ast}} = x_{ t}^{{\ast}}, \\ 0,\quad &\mbox{ otherwise}, \end{array} \right. } $$

while

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{x}^{(t)},\mathbf{y}^{(t)}\right ) = p\left (x_{ t+1}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t};\alpha _{t},\lambda _{t}). } $$

If the two sequences had the same joint distribution, then the latter density could take only the values 0 and 1, which contradicts the continuity (as a function of x t ) of this density.

9.8.3 Exponential Family Models

The exponential family of distributions provides a large and flexible class of distributions for use in the observation equation. The density in the observation equation is said to belong to an exponential family (in natural parameterization) if

$$ \displaystyle{ p(\,y_{t}\vert x_{t}) =\exp \{ y_{t}x_{t} - b(x_{t}) + c(y_{t})\}, } $$
(9.8.31)

where b(⋅ ) is a twice continuously differentiable function and c(y t ) does not depend on x t . This family includes the normal, exponential, gamma, Poisson, binomial, and many other distributions frequently encountered in statistics. Detailed properties of the exponential family can be found in Barndorff-Nielsen (1978), and an excellent treatment of its use in the analysis of linear models is given by McCullagh and Nelder (1989). We shall need only the following important facts:

$$ \displaystyle{ e^{b(x_{t})} =\int \exp \{ y_{ t}x_{t} + c(y_{t})\}\,\nu (dy_{t}), } $$
(9.8.32)
$$ \displaystyle{ b'(x_{t}) = E(Y _{t}\vert x_{t}), } $$
(9.8.33)
$$ \displaystyle{ b''(x_{t}) = \mathrm{Var}(Y _{t}\vert x_{t}):=\,\int y_{t}^{2}p(y_{ t}\vert x_{t})\,\nu (dy_{t}) -\left [b'(x_{t})\right ]^{2}, } $$
(9.8.34)

where integration with respect to ν(dy t ) means integration with respect to dy t in the continuous case and summation over all values of y t in the discrete case.

Proof.

The first relation is simply the statement that p(y t  | x t ) integrates to 1. The second relation is established by differentiating both sides of (9.8.32) with respect to x t  and then multiplying through by \( e^{-b(x_{t})} \) (for justification of the differentiation under the integral sign see Barndorff-Nielsen 1978). The last relation is obtained by differentiating (9.8.32) twice with respect to x t and simplifying.

Example 9.8.6.

The Poisson Case

If the observation Y t , given X t  = x t , has a Poisson distribution of the form (9.8.21), then

$$ \displaystyle{ p(y_{t}\vert x_{t}) =\exp {\bigl \{ y_{t}x_{t} - e^{x_{t} } -\ln y_{t}!\bigr \}},\quad y_{t} = 0,1,\ldots, } $$
(9.8.35)

which has the form (9.8.31) with \( b(x_{t}) = e^{x_{t}} \) and c(y t ) = −lny t ! . From (9.8.33) we easily find that \( E(Y _{t}\vert x_{t}) = b'(x_{t}) = e^{x_{t}} \). This parameterization is slightly different from the one used in Examples 9.8.2 and 9.8.5, where the conditional mean of Y t  given x t was π x t and not \( e^{\,x_{t}} \). For this observation equation, define the family of densities

$$ \displaystyle{ f(x;\alpha,\lambda ) =\exp \{\alpha x -\lambda b(x) + A(\alpha,\lambda )\},\quad -\infty < x < \infty, } $$
(9.8.36)

where α > 0 and λ > 0 are parameters and A(α, λ) = −lnΓ(α) +αlnλ. Now consider state densities of the form

$$ \displaystyle{ p(x_{t+1}\vert \mathbf{y}^{(t)}) = f(x_{ t+1};\alpha _{t+1\vert t},\lambda _{t+1\vert t}), } $$
(9.8.37)

where α t+1 | t and λ t+1 | t are, for the moment, unspecified functions of y (t). (The subscript t + 1 | t on the parameters is a shorthand way to indicate dependence on the conditional distribution of X t+1 given Y (t).) With this specification of the state densities, the parameters α t+1 | t are related to the best one-step predictor of Y t through the formula

$$ \displaystyle{ \alpha _{t+1\vert t}/\lambda _{t+1\vert t} =\hat{ Y }_{t+1}:= E\left (Y _{t+1}\vert \mathbf{y}^{(t)}\right ). } $$
(9.8.38)

Proof.

We have from (9.8.7) and (9.8.33) that

$$ \displaystyle\begin{array}{rcl} E(Y _{t+1}\vert \mathbf{y}^{(t)})& =& \sum _{ y_{t+1}=0}^{\infty }\int _{ -\infty }^{\infty }y_{ t+1}p(y_{t+1}\vert x_{t+1})p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right )\,dx_{ t+1} {}\\ & =& \int _{-\infty }^{\infty }b'(x_{ t+1})p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right )\,dx_{ t+1}. {}\\ \end{array} $$

Addition and subtraction of α t+1 | t λ t+1 | t then gives

$$ \displaystyle\begin{array}{rcl} E(Y _{t+1}\vert \mathbf{y}^{(t)})& =& \int _{ -\infty }^{\infty }\left (b'(x_{ t+1}) -\frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}}\right )p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right )\,dx_{ t+1} + \frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}} {}\\ & =& \int _{-\infty }^{\infty }-\lambda _{ t+1\vert t}^{-1}\,p'\left (x_{ t+1}\vert \mathbf{y}^{(t)}\right )\,dx_{ t+1} + \frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}} {}\\ & =& \left [-\lambda _{t+1\vert t}^{-1}\,p\left (x_{ t+1}\vert \mathbf{y}^{(t)}\right )\right ]_{ x_{t+1}=-\infty }^{x_{t+1}=\infty } + \frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}} {}\\ & =& \frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}}. {}\\ \end{array} $$

Letting A t | t−1 = A(α t | t−1, λ t | t−1), we can write the posterior density of X t given Y (t) as

$$ \displaystyle\begin{array}{rcl} p\left (x_{t}\vert \mathbf{y}^{(t)}\right )& =& \exp \{y_{ t}x_{t} - b(x_{t}) + c(y_{t})\}\exp \{\alpha _{t\vert t-1}x_{t} -\lambda _{t\vert t-1}b(x_{t}) {}\\ & & \quad + A_{t\vert t-1}\}/p\left (y_{t}\vert \mathbf{y}^{(t-1)}\right ) {}\\ & =& \exp \{\lambda _{t\vert t}\left (\alpha _{t\vert t}x_{t} - b(x_{t})\right ) - A_{t\vert t}\}, {}\\ & =& f(x_{t};\alpha _{t},\lambda _{t}), {}\\ \end{array} $$

where we find, by equating coefficients of x t and b(x t ), that the coefficients λ t and α t are determined by

$$ \displaystyle{ \lambda _{t} = 1 +\lambda _{t\vert t-1},\phantom{well} } $$
(9.8.39)
$$ \displaystyle{ \alpha _{t} = y_{t} +\alpha _{t\vert t-1}. } $$
(9.8.40)

The family of prior densities in (9.8.37) is called a conjugate family of priors for the observation equation (9.8.35), since the resulting posterior densities are again members of the same family.

As mentioned earlier, the parameters α t | t−1 and λ t | t−1 can be quite arbitrary: Any nonnegative functions of y (t−1) will lead to a consistent specification of the state densities. One convenient choice is to link these parameters with the corresponding parameters of the posterior distribution at time t − 1 through the relations

$$ \displaystyle{ \lambda _{t+1\vert t} =\delta \lambda _{t}\left (=\delta (1 +\lambda _{t\vert t-1})\right ),\phantom{well} } $$
(9.8.41)
$$ \displaystyle{ \alpha _{t+1\vert t} =\delta \alpha _{t}\left (=\delta (y_{t} +\alpha _{t\vert t-1})\right ), } $$
(9.8.42)

where 0 < δ < 1 (see Remark 4 below). Iterating the relation (9.8.41), we see that

$$ \displaystyle\begin{array}{rcl} \lambda _{t+1\vert t} =\delta (1 +\lambda _{t\vert t-1})& =& \delta +\delta \lambda _{t\vert t-1} \\ & =& \delta +\delta (\delta +\delta \lambda _{t-2\vert t-2}) \\ & =& \cdots \\ & =& \delta +\delta ^{2} + \cdots +\delta ^{t} +\delta ^{t}\lambda _{ 1\vert 0} \\ & \rightarrow & \delta /(1-\delta ) {}\end{array} $$
(9.8.43)

as t → . Similarly,

$$ \displaystyle\begin{array}{rcl} \alpha _{t+1\vert t}& =& \delta y_{t} +\delta \alpha _{t\vert t-1} \\ & =& \cdots \\ & =& \delta y_{t} +\delta ^{2}y_{ t-1} + \cdots +\delta ^{t}y_{ 1} +\delta ^{t}\alpha _{ 1\vert 0}.\phantom{well}{}\end{array} $$
(9.8.44)

For large t, we have the approximations

$$ \displaystyle{ \lambda _{t+1\vert t} =\delta /(1-\delta ) } $$
(9.8.45)

and

$$ \displaystyle{ \alpha _{t+1\vert t} =\delta \sum _{ j=0}^{t-1}\delta ^{ j}y_{ t-j}, } $$
(9.8.46)

which are exact if λ 1 | 0 = δ∕(1 −δ) and α 1 | 0 = 0. From (9.8.38) the one-step predictors are linear and given by

$$ \displaystyle{ \hat{Y }_{t+1} = \frac{\alpha _{t+1\vert t}} {\lambda _{t+1\vert t}} = \frac{\sum _{j=0}^{t-1}\delta ^{ j}y_{t-j} +\delta ^{t-1}\alpha _{1\vert 0}} {\sum _{j=0}^{t-1}\delta ^{ j} +\delta ^{t-1}\lambda _{1\vert 0}}. } $$
(9.8.47)

Replacing the denominator with its limiting value, or starting with λ 1 | 0 = δ∕(1 −δ), we find that \( \hat{Y }_{t+1} \) is the solution of the recursions

$$ \displaystyle{ \hat{Y }_{t+1} = (1-\delta )y_{t} +\delta \hat{ Y }_{t},\quad t = 1,2,\ldots, } $$
(9.8.48)

with initial condition \( \hat{Y }_{1} = (1-\delta )\delta ^{-1}\alpha _{1\vert 0} \). In other words, under the restrictions of (9.8.41) and (9.8.42), the best one-step predictors can be found by exponential smoothing.

Remark 1.

The preceding analysis for the Poisson-distributed observation equation holds, almost verbatim, for the general family of exponential densities (9.8.31). (One only needs to take care in specifying the correct range for x and the allowable parameter space for α and λ in (9.8.37).) The relations (9.8.43)–(9.8.44), as well as the exponential smoothing formula (9.8.48), continue to hold even in the more general setting, provided that the parameters α t | t−1 and λ t | t−1 satisfy the relations (9.8.41)–(9.8.42). □ 

Remark 2.

Equations (9.8.41)–(9.8.42) are equivalent to the assumption that the prior density of X t given y (t−1) is proportional to the δ-power of the posterior distribution of X t−1 given Y (t−1), or more succinctly that

$$ \displaystyle\begin{array}{rcl} f(x_{t};\alpha _{t\vert t-1},\lambda _{t\vert t-1})& =& f(x_{t};\delta \alpha _{t-1\vert t-1},\delta \lambda _{t-1\vert t-1})\phantom{well} {}\\ & \propto & f^{\delta }(x_{t};\alpha _{t-1\vert t-1},\lambda _{t-1\vert t-1}). {}\\ \end{array} $$

This power relationship is sometimes referred to as the power steady model (Grunwald et al. 1993; Smith 1979). □ 

Remark 3.

The transformed state variables \( W_{t} = e^{X_{t}} \) have a gamma state density given by

$$ \displaystyle{ p\left (w_{t+1}\vert \mathbf{y}^{(t)}\right ) = g(w_{ t+1};\alpha _{t+1\vert t},\lambda _{t+1\vert t}) } $$

(see Problem 9.26). The mean and variance of this conditional density are

$$ \displaystyle{ E\left (W_{t+1}\vert \mathbf{y}^{(t)}\right ) =\alpha _{ t+1\vert t}\quad \mathrm{and}\quad \mathrm{Var}\left (W_{t+1}\vert \mathbf{y}^{(t)}\right ) =\alpha _{ t+1\vert t}/\lambda _{t+1\vert t}^{2}.\mbox{ $\square $} } $$

Remark 4.

If we regard the random walk plus noise model of Example 9.2.1 as the prototypical state-space model, then from the calculations in Example 9.8.1 with G = F = 1, we have

$$ \displaystyle{ E\left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) = E\left (X_{ t}\vert \mathbf{Y}^{(t)}\right ) } $$

and

$$ \displaystyle{ \mathrm{Var}\left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) = \mathrm{Var}\left (X_{ t}\vert \mathbf{Y}^{(t)}\right ) + Q > \mathrm{Var}\left (X_{ t}\vert \mathbf{Y}^{(t)}\right ). } $$

The first of these equations implies that the best estimate of the next state is the same as the best estimate of the current state, while the second implies that the variance increases. Under the conditions (9.8.41), and (9.8.42), the same is also true for the state variables in the above model (see Problem 9.26). This was, in part, the rationale behind these conditions given in Harvey and Fernandes (1989). □ 

Remark 5.

While the calculations work out neatly for the power steady model, Grunwald et al. (1994) have shown that such processes have degenerate sample paths for large t. In the Poisson example above, they argue that the observations Y t converge to 0 as t →  (see Figure 9.12). Although such models may still be useful in practice for modeling series of moderate length, the efficacy of using such models for describing long-term behavior is doubtful. □ 

Example 9.8.7.

Goals Scored by England Against Scotland

The time series of the number of goals scored by England against Scotland in soccer matches played at Hampden Park in Glasgow is graphed in Figure 9.8. The matches have been played nearly every second year, with interruptions during the war years. We will treat the data y 1, , y 52 as coming from an equally spaced time series model {Y t }. Since the number of goals scored is small (see the frequency histogram in Figure 9.9), a model based on the Poisson distribution might be deemed appropriate. The observed relative frequencies and those based on a Poisson distribution with mean equal to \( \bar{y}_{52} = 1.269 \) are contained in Table 9.2. The standard chi-squared goodness of fit test, comparing the observed frequencies with expected frequencies based on a Poisson model, has a p-value of 0.02. The lack of fit with a Poisson distribution is hardly unexpected, since the sample variance (1.652) is much larger than the sample mean, while the mean and variance of the Poisson distribution are equal. In this case the data are said to be overdispersed in the sense that there is more variability in the data than one would expect from a sample of independent Poisson-distributed variables. Overdispersion can sometimes be explained by serial dependence in the data.

Fig. 9.8
figure 8

Goals scored by England against Scotland at Hampden Park, Glasgow, 1872–1987

Fig. 9.9
figure 9

Histogram of the data in Figure 9.8

Table 9.2 Relative frequency and fitted Poisson distribution of goals scored by England against Scotland

Dependence in count data can often be revealed by estimating the probabilities of transition from one state to another. Table 9.3 contains estimates of these probabilities, computed as the average number of one-step transitions from state y t to state y t+1. If the data were independent, then in each column the entries should be nearly the same. This is certainly not the case in Table 9.3. For example, England is very unlikely to be shut out or score 3 or more goals in the next match after scoring at least three goals in the previous encounter.

Table 9.3 Transition probabilities for the number of goals scored by England against Scotland

Harvey and Fernandes (1989) model the dependence in this data using an observation-driven model of the type described in Example 9.8.6. Their model assumes a Poisson observation equation and a log-gamma state equation:

$$ \displaystyle\begin{array}{rcl} p(\,y_{t}\vert x_{t})& =& \frac{\exp \{\,y_{t}x_{t} - e^{x_{t}}\}} {y_{t}!},\quad y_{t} = 0,1,\ldots, {}\\ p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right )& =& f(x_{ t};\alpha _{t\vert t-1},\lambda _{t\vert t-1}),\quad -\infty < x < \infty, {}\\ \end{array} $$

for t = 1, 2, , where f is given by (9.8.36) and α 1 | 0 = 0, λ 1 | 0 = 0. The power steady conditions (9.8.41)–(9.8.42) are assumed to hold for α t | t−1 and λ t | t−1. The only unknown parameter in the model is δ. The log-likelihood function for δ based on the conditional distribution of y 1, , y 52 given y 1 is given by [see (9.8.27)]

$$ \displaystyle{ \ell\left (\delta,\mathbf{y}^{(n)}\right ) =\sum _{ t=1}^{n-1}\ln p\left (y_{ t+1}\vert \mathbf{y}^{(t)}\right ), } $$
(9.8.49)

where \( p\left (\,y_{t+1}\vert \mathbf{y}^{(t))}\right ) \) is the negative binomial density [see Problem 9.25(c)]

$$ \displaystyle{ p\left (y_{t+1}\vert \mathbf{y}^{(t)}\right ) = nb\left (\,y_{ t+1};\alpha _{t+1\vert t},(1 +\lambda _{t+1\vert t})^{-1}\right ), } $$

with α t+1 | t and λ t+1 | t as defined in (9.8.44) and (9.8.43). (For the goal data, y 1 = 0, which implies α 2 | 1 = 0 and hence that \( p\left (\,y_{2}\vert y^{(1)}\right ) \) is a degenerate density with unit mass at y 2 = 0. Harvey and Fernandes avoid this complication by conditioning the likelihood on y (τ), where τ is the time of the first nonzero data value.)

Maximizing this likelihood with respect to δ, we obtain \( \hat{\delta }= 0.844 \). (Starting equations (9.8.43)–(9.8.44) with α 1 | 0 = 0 and λ 1 | 0 = δ∕(1 −δ), we obtain \( \hat{\delta }= 0.732 \).) With 0.844 as our estimate of δ, the prediction density of the next observation Y 53 given y (52) is nb(y 53; α 53 | 52, (1​ +​λ 53 | 52)​−1. The first five values of this distribution are given in Table 9.4. Under this model, the probability that England will be held scoreless in the next match is 0.471. The one-step predictors, \( \hat{Y }_{1} = 0,\hat{Y }_{2},\ldots,\hat{Y }_{52} \) are graphed in Figure 9.10. (This graph can be obtained by using the ITSM option Smooth>Exponential with α = 0. 154.)

Table 9.4 Prediction density of Y 53 given Y (52) for data in Figure 9.7
Fig. 9.10
figure 10

One-step predictors of the goal data

Figures 9.11 and 9.12 contain two realizations from the fitted model for the goal data. The general appearance of the first realization is somewhat compatible with the goal data, while the second realization illustrates the convergence of the sample path to 0 in accordance with the result of Grunwald et al. (1994).

Fig. 9.11
figure 11

A simulated time series from the fitted model to the goal data

Fig. 9.12
figure 12

A second simulated time series from the fitted model to the goal data

Example 9.8.8.

The Exponential Case

Suppose Y t given X t has an exponential density with mean − 1∕X t (X t  < 0). The observation density is given by

$$ \displaystyle{ p(y_{t}\vert x_{t}) =\exp \{ y_{t}x_{t} +\ln (-x_{t})\},\quad y_{t} > 0, } $$

which has the form (9.8.31) with b(x) = −ln(−x) and c(y) = 0. The state densities corresponding to the family of conjugate priors (see (9.8.37)) are given by

$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) =\exp \{\alpha _{ t+1\vert t}\,x_{t+1} -\lambda _{t+1\vert t}\,b(x_{t+1}) + A_{t+1\vert t}\},\quad -\infty < x < 0. } $$

(Here p(x t+1 | y (t)) is a probability density when α t+1 | t  > 0 and λ t+1 | t  > −1.) The one-step prediction density is

$$ \displaystyle\begin{array}{rcl} p\left (y_{t+1}\vert \mathbf{y}^{(t}\right )& =& \int _{ -\infty }^{0}e^{x_{t+1}y_{t+1}+\ln (-x_{t+1})+\alpha _{t+1\vert t}x-\lambda _{t+1\vert t}b(x)+A_{t+1\vert t} }\,dx_{t+1} {}\\ & =& (\lambda _{t+1\vert t} + 1)\alpha _{t+1\vert t}^{\lambda _{t+1\vert t}+1}(y_{ t+1} +\alpha _{t+1\vert t})^{-\lambda _{t+1\vert t}-2},\quad y_{ t+1} > 0 {}\\ \end{array} $$

(see Problem 9.28). While E(Y t+1 | y (t)) = α t+1 | t λ t+1 | t , the conditional variance is finite if and only if λ t+1 | t  > 1. Under assumptions (9.8.41)–(9.8.42), and starting with λ 1 | 0 = δ∕(1 −δ), the exponential smoothing formula (9.8.48) remains valid.

Problems

  1. 9.1

    Show that if all the eigenvalues of F are less than 1 in absolute value (or equivalently that \( F^{k}\! \rightarrow \! 0 \) as k​ → ​), the unique stationary solution of equation (9.1.11) is given by the infinite series

    $$ \displaystyle{ \mathbf{X}_{t} =\sum _{ j=0}^{\infty }F^{ j}V _{ t-j-1} } $$

    and that the corresponding observation vectors are

    $$ \displaystyle{ \mathbf{Y}_{t} = \mathbf{W}_{t} +\sum _{ j=0}^{\infty }GF^{ j}\mathbf{V}_{ t-j-1}. } $$

    Deduce that {(X t ′, Y t ′)′} is a multivariate stationary process. (Hint: Use a vector analogue of the argument in Example 2.2.1.)

  2. 9.2

    In Example 9.2.1, show that θ = −1 if and only if σ v 2 = 0, which in turn is equivalent to the signal M t being constant.

  3. 9.3

    Let F be the coefficient of X t in the state equation (9.3.4) for the causal AR(p) process

    $$ \displaystyle{ X_{t} -\phi _{1}X_{t-1} -\cdots -\phi _{p}X_{t-p} = Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$

    Establish the stability of (9.3.4) by showing that

    $$ \displaystyle{ \det (zI - F) = z^{p}\phi \left (z^{-1}\right ), } $$

    and hence that the eigenvalues of F are the reciprocals of the zeros of the autoregressive polynomial ϕ(z) = 1 −ϕ 1 z −⋯ −ϕ p z p.

  4. 9.4

    By following the argument in Example 9.3.3, find a state-space model for {Y t } when {∇∇12 Y t } is an ARMA(p, q) process.

  5. 9.5

    For the local linear trend model defined by equations (9.2.6)–(9.2.7), show that ∇2 Y t  = (1 − B)2 Y t is a 2-correlated sequence and hence, by Proposition 2.1.1, is an MA(2) process. Show that this MA(2) process is noninvertible if σ u 2 = 0.

  6. 9.6
    1. a.

      For the seasonal model of Example 9.2.2, show that ∇ d Y t  = Y t Y td is an MA(1) process.

    2. b.

      Show that ∇∇ d Y t is an MA(d + 1) process where {Y t } follows the seasonal model with a local linear trend as described in Example 9.2.3.

  7. 9.7

    Let {Y t } be the MA(1) process

    $$ \displaystyle{ Y _{t} = Z_{t} +\theta Z_{t-1},\ \ \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$

    Show that {Y t } has the state-space representation

    $$ \displaystyle{ Y _{t} = [1\quad 0]\mathbf{X}_{t}, } $$

    where {X t } is the unique stationary solution of

    $$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} 0&1\\ 0 &0 \end{array} \right ]\mathbf{X}_{t}+\left [\begin{array}{*{10}c} 1\\ \theta \end{array} \right ]Z_{t+1}. } $$

    In particular, show that the state vector X t can written as

    $$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} 1& \theta \\ \theta &0 \end{array} \right ]\left [\begin{array}{*{10}c} Z_{t} \\ Z_{t-1} \end{array} \right ]. } $$
  8. 9.8

    Verify equations (9.3.16)–(9.3.18) for an ARIMA(1,1,1) process.

  9. 9.9

    Consider the two state-space models

    $$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{X}_{t+1,1}\quad &=F_{1}\mathbf{X}_{t1} + \mathbf{V}_{t1}, \\ \mathbf{Y}_{t1} \quad &=G_{1}\mathbf{X}_{t1} + \mathbf{W}_{t1}, \end{array} \right. } $$

    and

    $$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{X}_{t+1,2}\quad &=F_{2}\mathbf{X}_{t2} + \mathbf{V}_{t2}, \\ \mathbf{Y}_{t2} \quad &=G_{2}\mathbf{X}_{t2} + \mathbf{W}_{t2}, \end{array} \right. } $$

    where {(V t1′, W t1′, V t2′, W t2′)′} is white noise. Derive a state-space representation for {(Y t1′, Y t2′)′}.

  10. 9.10

    Use Remark 1 of Section 9.4 to establish the linearity properties of the operator P t stated in Remark 3.

  11. 9.11
    1. a.

      Show that if the matrix equation XS = B can be solved for X, then X = BS −1 is a solution for any generalized inverse S −1 of S.

    2. b.

      Use the result of (a) to derive the expression for P(X | Y) in Remark 4 of Section 9.4.

  12. 9.12

    In the notation of the Kalman prediction equations, show that every vector of the form

    $$ \displaystyle{ \mathbf{Y} = A_{1}\mathbf{X}_{1} + \cdots + A_{t}\mathbf{X}_{t} } $$

    can be expressed as

    $$ \displaystyle{ \mathbf{Y} = B_{1}\mathbf{X}_{1} + \cdots + B_{t-1}\mathbf{X}_{t-1} + C_{t}\mathbf{I}_{t}, } $$

    where B 1, , B t−1 and C t are matrices that depend on the matrices A 1, , A t . Show also that the converse is true. Use these results and the fact that E(X s I t ) = 0 for all s < t to establish (9.4.3).

  13. 9.13

    In Example 9.4.1, verify that the steady-state solution of the Kalman recursions (9.1.2) is given by \( \Omega _{t} = \left (\sigma _{v}^{2} + \sqrt{\sigma _{v }^{4 } + 4\sigma _{v }^{2 }\sigma _{w }^{2}}\right )/2 \).

  14. 9.14

    Show from the difference equations for Ω t in Example 9.4.1 that (Ω t​+​1​ −​ Ω)(Ω t Ω) ≥ 0 for all Ω t  ≥ 0, where Ω is the steady-state solution for Ω t  given in Problem 9.13.

  15. 9.15

    Show directly that for the MA(1) model (9.2.3), the parameter θ is equal to \( -\left (2\sigma _{w}^{2} +\sigma _{ v}^{2} -\sqrt{\sigma _{v }^{4 } + 4\sigma _{v }^{2 }\sigma _{w }^{2}}\right )/\left (2\sigma _{w}^{2}\right ) \), which in turn is equal to −σ w 2∕(Ω +σ w 2), where Ω is the steady-state solution for Ω t given in Problem 9.13.

  16. 9.16

    Use the ARMA(0,1,1) representation of the series {Y t } in Example 9.4.1 to show that the predictors defined by

    $$ \displaystyle{ \hat{Y }_{n+1} = aY _{n} + (1 - a)\hat{Y }_{n},\quad n = 1,2,\ldots, } $$

    where a = Ω∕(Ω +σ w 2), satisfy

    $$ \displaystyle{ Y _{n+1} -\hat{ Y }_{n+1} = Z_{n+1} + (1 - a)^{n}\left (Y _{ 0} - Z_{0} -\hat{ Y }_{1}\right ). } $$

    Deduce that if 0 < a < 1, the mean squared error of \( \hat{Y }_{n+1} \) converges to Ω +σ w 2 for any initial predictor \( \hat{Y }_{1} \) with finite mean squared error.

  17. 9.17
    1. a.

      Using equations (9.4.1) and (9.4.16), show that \( \hat{\mathbf{X}}_{t+1} = F_{t}\mathbf{X}_{t\vert t} \).

    2. b.

      From (a) and (9.4.16) show that X t | t satisfies the recursions

      $$ \displaystyle{ \mathbf{X}_{t\vert t} = F_{t-1}\mathbf{X}_{t-1\vert t-1} + \Omega _{t}G_{t}'\Delta _{t}^{-1}(\mathbf{Y}_{ t} - G_{t}F_{t-1}\mathbf{X}_{t-1\vert t-1}) } $$

      for t = 2, 3, , with \( \mathbf{X}_{1\vert 1} =\hat{ \mathbf{X}}_{1} + \Omega _{1}G_{1}'\Delta _{1}^{-1}\left (\mathbf{Y}_{1} - G_{1}\hat{\mathbf{X}}_{1}\right ) \).

  18. 9.18

    In Section 9.5, show that for fixed Q , \( -2\ln L\left (\boldsymbol{\mu },Q^{{\ast}},\sigma _{w}^{2}\right ) \) is minimized when \( \boldsymbol{\mu } \) and σ w 2 are given by (9.5.10) and (9.5.11), respectively.

  19. 9.19

    Verify the calculation of Θ t Δ t −1 and Ω t in Example 9.6.1.

  20. 9.20

    Verify that the best estimates of missing values in an AR(p) process are found by minimizing (9.6.11) with respect to the missing values.

  21. 9.21

    Suppose that {Y t } is the AR(2) process

    $$ \displaystyle{ Y _{t} =\phi _{1}Y _{t-1} +\phi _{2}Y _{t-2} + Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ), } $$

    and that we observe Y 1, Y 2, Y 4, Y 5, Y 6, Y 7. Show that the best estimator of Y 3 is

    $$ \displaystyle{ \left (\phi _{2}(Y _{1} + Y _{5}) + (\phi _{1} -\phi _{1}\phi _{2})(Y _{2} + Y _{4})\right )/\left (1 +\phi _{ 1}^{2} +\phi _{ 2}^{2}\right ). } $$
  22. 9.22

    Let X t be the state at time t of a parameter-driven model (see (9.8.2)). Show that {X t } is a Markov chain and that (9.8.3) holds.

  23. 9.23

    For the generalized state-space model of Example 9.8.1, show that Ω t+1 = F 2 Ω t | t + Q.

  24. 9.24

    If Y and X are random variables, show that

    $$ \displaystyle{ \mathrm{Var}(Y ) = E(\mathrm{Var}(Y \vert X)) + \mathrm{Var}(E(Y \vert X)). } $$
  25. 9.25

    Suppose that Y and X are two random variables such that the distribution of Y given X is Poisson with mean π X, 0 < π ≤ 1, and X has the gamma density g(x; α, λ).

    1. a.

      Show that the posterior distribution of X given Y also has a gamma density and determine its parameters.

    2. b.

      Compute E(X | Y ) and Var(X | Y ).

    3. c.

      Show that Y has a negative binomial density and determine its parameters.

    4. d.

      Use (c) to compute E(Y ) and Var(Y ).

    5. e.

      Verify in Example 9.8.2 that \( E\left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi /(\lambda _{t+1}-\pi ) \) and Var\( \left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi \lambda _{t+1}/(\lambda _{t+1}-\pi )^{2} \).

  26. 9.26

    For the model of Example 9.8.6, show that

    1. a.

      \( E\left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) = E\left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \), Var\( \left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) > \) Var\( \left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \), and

    2. b.

      the transformed sequence \( W_{t} = e^{X_{t}} \) has a gamma state density.

  27. 9.27

    Let {V t } be a sequence of independent exponential random variables with EV t  = t −1 and suppose that {X t , t ≥ 1} and {Y t , t ≥ 1} are the state and observation random variables, respectively, of the parameter-driven state-space system

    $$ \displaystyle\begin{array}{rcl} X_{1}& =& V _{1}, {}\\ X_{t}& =& X_{t-1} + V _{t},\quad t = 2,3,\ldots, {}\\ \end{array} $$

    where the distribution of the observation Y t , conditional on the random variables Y 1, Y 2, , Y t−1, X t , is Poisson with mean X t .

    1. a.

      Determine the observation and state transition density functions p(y t  | x t ) and p(x t+1 | x t ) in the parameter-driven model for {Y t }.

    2. b.

      Show, using (9.8.4)–(9.8.6), that

      $$ \displaystyle{ p(x_{1}\vert y_{1}) = g(x_{1};y_{1} + 1,2) } $$

      and

      $$ \displaystyle{ p(x_{2}\vert y_{1}) = g(x_{2};y_{1} + 2,2), } $$

      where g(x; α, λ) is the gamma density function (see Example (d) of Section A.1).

    3. c.

      Show that

      $$ \displaystyle{ p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t};\alpha _{t} + t,t + 1) } $$

      and

      $$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t+1};\alpha _{t} + t + 1,t + 1), } $$

      where α t  = y 1 + ⋯ + y t .

    4. d.

      ​Conclude from (c) that the minimum mean squared error estimates of X t  and​ X t+1 based on Y 1, , Y t are

      $$ \displaystyle{ X_{t\vert t} = \frac{t + Y _{1} + \cdots + Y _{t}} {t + 1} } $$

      and

      $$ \displaystyle{ \hat{X}_{t+1} = \frac{t + 1 + Y _{1} + \cdots + Y _{t}} {t + 1}, } $$

      respectively.

  28. 9.28

    Let Y and X be two random variables such that Y given X is exponential with mean 1∕X, and X has the gamma density function with

    $$ \displaystyle{ g(x;\lambda +1,\alpha ) = \frac{\alpha ^{\lambda +1}x^{\lambda }\exp \{ -\alpha x\}} {\Gamma (\lambda +1)},\quad x > 0, } $$

    where λ > −1 and α > 0.

    1. a.

      Determine the posterior distribution of X given Y.

    2. b.

      Show that Y has a Pareto distribution

      $$ \displaystyle{ p(y) = (\lambda +1)\alpha ^{\lambda +1}(y+\alpha )^{-\lambda -2},\quad y > 0. } $$
    3. c.

      Find the mean and variance of Y. Under what conditions on α and λ does the latter exist?

    4. d.

      Verify the calculation of \( p\left (y_{t+1}\vert \mathbf{y}^{(t)}\right ) \) and \( E\left (Y _{t+1}\vert \mathbf{y}^{(t)}\right ) \) for the model in Example 9.8.8.

  29. 9.29

    Consider an observation-driven model in which Y t given X t is binomial with parameters n and X t , i.e.,

    $$ \displaystyle{ p(y_{t}\vert x_{t}) ={ n\choose y_{t}}x_{t}^{y_{t} }(1 - x_{t})^{n-y_{t} },\quad y_{t} = 0,1,\ldots,n. } $$
  30. a.

    Show that the observation equation with state variable transformed by the logit transformation W t  = ln(X t ∕(1 − X t )) follows an exponential family

    $$ \displaystyle{ p(y_{t}\vert w_{t}) =\exp \{ y_{t}w_{t} - b(w_{t}) + c(y_{t})\}. } $$

    Determine the functions b(⋅ ) and c(⋅ ).

  31. b.

    Suppose that the state X t has the beta density

    $$ \displaystyle{ p(x_{t+1}\vert \mathbf{y}^{(t)}) = f(x_{ t+1};\alpha _{t+1\vert t},\lambda _{t+1\vert t}), } $$

    where

    $$ \displaystyle{ f(x;\alpha,\lambda ) = [B(\alpha,\lambda )]^{-1}x^{\alpha -1}(1 - x)^{\lambda -1},\quad 0 < x < 1, } $$

    B(α, λ): = Γ(α)Γ(λ)∕Γ(α +λ) is the beta function, and α, λ > 0. Show that the posterior distribution of X t given Y t is also beta and express its parameters in terms of y t and α t | t−1, λ t | t−1.

  32. c.

    Under the assumptions made in (b), show that \( E{\bigl (X_{t}\vert \mathbf{Y}^{(t)}\bigr )} = E{\bigl (X_{t+1}\vert \mathbf{Y}^{(t)}\bigr )} \) and Var\( {\bigl (X_{t}\vert \mathbf{Y}^{(t)}\bigr )} < \) Var\( {\bigl (X_{t+1}\vert \mathbf{Y}^{(t)}\bigr )} \).

  33. d.

    Assuming that the parameters in (b) satisfy (9.8.41)–(9.8.42), show that the one-step prediction density \( p{\bigl (y_{t+1}\vert \mathbf{y}^{(t)}\bigr )} \) is beta-binomial,

    $$ \displaystyle{ p(y_{t+1}\vert \mathbf{y}^{(t)}) = \frac{B(\alpha _{t+1\vert t} + y_{t+1},\lambda _{t+1\vert t} + n - y_{t+1})} {(n + 1)B(y_{t+1} + 1,n - y_{t+1} + 1)B(\alpha _{t+1\vert t},\lambda _{t+1\vert t})}, } $$

    and verify that \( \hat{Y }_{t+1} \) is given by (9.8.47).