Abstract
In recent years state-space representations and the associated Kalman recursions have had a profound impact on time series analysis and many related areas.
Access provided by Autonomous University of Puebla. Download chapter PDF
Keywords
- Generalized State-Space Models
- Kalman Recursions
- Classical Decomposition Model
- Parameter-driven Models
- Local Linear Trend Model
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
In recent years state-space representations and the associated Kalman recursions have had a profound impact on time series analysis and many related areas. The techniques were originally developed in connection with the control of linear systems (for accounts of this subject see Davis and Vinter 1985; Hannan and Deistler 1988). An extremely rich class of models for time series, including and going well beyond the linear ARIMA and classical decomposition models considered so far in this book, can be formulated as special cases of the general state-space model defined below in Section 9.1. In econometrics the structural time series models developed by Harvey (1990) are formulated (like the classical decomposition model) directly in terms of components of interest such as trend, seasonal component, and noise. However, the rigidity of the classical decomposition model is avoided by allowing the trend and seasonal components to evolve randomly rather than deterministically. An introduction to these structural models is given in Section 9.2, and a state-space representation is developed for a general ARIMA process in Section 9.3. The Kalman recursions, which play a key role in the analysis of state-space models, are derived in Section 9.4. These recursions allow a unified approach to prediction and estimation for all processes that can be given a state-space representation. Following the development of the Kalman recursions we discuss estimation with structural models (Section 9.5) and the formulation of state-space models to deal with missing values (Section 9.6). In Section 9.7 we introduce the EM algorithm, an iterative procedure for maximizing the likelihood when only a subset of the complete data set is available. The EM algorithm is particularly well suited for estimation problems in the state-space framework. Generalized state-space models are introduced in Section 9.8. These are Bayesian models that can be used to represent time series of many different types, as demonstrated by two applications to time series of count data. Throughout the chapter we shall use the notation
to indicate that the random vectors W t have mean 0 and that
9.1 State-Space Representations
A state-space model for a (possibly multivariate) time series {Y t , t = 1, 2, …} consists of two equations. The first, known as the observation equation, expresses the w-dimensional observation Y t as a linear function of a v-dimensional state variable X t plus noise. Thus
where {W t } ∼ WN(0, {R t }) and {G t } is a sequence of w × v matrices. The second equation, called the state equation, determines the state X t+1 at time t + 1 in terms of the previous state X t and a noise term. The state equation is
where {F t } is a sequence of v × v matrices, {V t } ∼ WN(0, {Q t }), and {V t } is uncorrelated with {W t } (i.e., E(W t V s ′) = 0 for all s and t). To complete the specification, it is assumed that the initial state X 1 is uncorrelated with all of the noise terms {V t } and {W t }.
Remark 1.
A more general form of the state-space model allows for correlation between V t and W t (see Brockwell and Davis (1991), Chapter 12) and for the addition of a control term H t u t in the state equation. In control theory, H t u t represents the effect of applying a “control” u t at time t for the purpose of influencing X t+1. However, the system defined by (9.1.1) and (9.1.2) with \( E{\bigl (\mathbf{W}_{t}\mathbf{V}_{s}'\bigr )} = 0 \) for all s and t will be adequate for our purposes. □
Remark 2.
In many important special cases, the matrices F t , G t , Q t , and R t will be independent of t, in which case the subscripts will be suppressed. □
Remark 3.
It follows from the observation equation (9.1.1) and the state equation (9.1.2) that X t and Y t have the functional forms, for t = 2, 3, …,
and
Remark 4.
From Remark 3 and the assumptions on the noise terms, it is clear that
and
Definition 9.1.1
A time series {Y t } has a state-space representation if there exists a state-space model for {Y t } as specified by equations (9.1.1) and (9.1.2).
As already indicated, it is possible to find a state-space representation for a large number of time-series (and other) models. It is clear also from the definition that neither {X t } nor {Y t } is necessarily stationary. The beauty of a state-space representation, when one can be found, lies in the simple structure of the state equation (9.1.2), which permits relatively simple analysis of the process {X t }. The behavior of {Y t } is then easy to determine from that of {X t } using the observation equation (9.1.1). If the sequence {X 1, V 1, V 2, …} is independent, then {X t } has the Markov property; i.e., the distribution of X t+1 given X t , …, X 1 is the same as the distribution of X t+1 given X t . This is a property possessed by many physical systems, provided that we include sufficiently many components in the specification of the state X t (for example, we may choose the state vector in such a way that X t includes components of X t−1 for each t).
Example 9.1.1
An AR(1) Process
Let {Y t } be the causal AR(1) process given by
In this case, a state-space representation for {Y t } is easy to construct. We can, for example, define a sequence of state variables X t by
where X 1 = Y 1 = ∑ j = 0 ∞ ϕ j Z 1−j and V t = Z t+1. The process {Y t } then satisfies the observation equation
which has the form (9.1.1) with G t = 1 and W t = 0.
Example 9.1.2
An ARMA(1,1) Process
Let {Y t } be the causal and invertible ARMA(1,1) process satisfying the equations
Although the existence of a state-space representation for {Y t } is not obvious, we can find one by observing that
where {X t } is the causal AR(1) process satisfying
or the equivalent equation
Noting that X t = ∑ j = 0 ∞ ϕ j Z t−j , we see that equations (9.1.8) and (9.1.9) for t = 1, 2, … furnish a state-space representation of {Y t } with
The extension of this state-space representation to general ARMA and ARIMA processes is given in Section 9.3.
In subsequent sections we shall give examples that illustrate the versatility of state-space models. (More examples can be found in Aoki 1987; Hannan and Deistler 1988; Harvey 1990; West and Harrison 1989.) Before considering these, we need a slight modification of (9.1.1) and (9.1.2), which allows for series in which the time index runs from −∞ to ∞. This is a more natural formulation for many time series models.
9.1.1 State-Space Models with t ∈ { 0, ±1, …}
Consider the observation and state equations
where F and G are v × v and w × v matrices, respectively, {V t } ∼ WN(0, Q), \( \{\mathbf{W}_{t}\} \sim \mathrm{WN}(\mathbf{0},R) \), and E(V s W t ′) = 0 for all s, and t.
The state equation (9.1.11) is said to be stable if the matrix F has all its eigenvalues in the interior of the unit circle, or equivalently if det(I − Fz) ≠ 0 for all z complex such that | z | ≤ 1. The matrix F is then also said to be stable.
In the stable case equation (9.1.11) has the unique stationary solution (Problem 9.1) given by
The corresponding sequence of observations
is also stationary.
9.2 The Basic Structural Model
A structural time series model, like the classical decomposition model defined by (1.5.1), is specified in terms of components such as trend, seasonality, and noise, which are of direct interest in themselves. The deterministic nature of the trend and seasonal components in the classical decomposition model, however, limits its applicability. A natural way in which to overcome this deficiency is to permit random variation in these components. This can be very conveniently done in the framework of a state-space representation, and the resulting rather flexible model is called a structural model. Estimation and forecasting with this model can be encompassed in the general procedure for state-space models made possible by the Kalman recursions of Section 9.4.
Example 9.2.1
The Random Walk Plus Noise Model
One of the simplest structural models is obtained by adding noise to a random walk. It is suggested by the nonseasonal classical decomposition model
and M t = m t , the deterministic “level” or “signal” at time t. We now introduce randomness into the level by supposing that M t is a random walk satisfying
with initial value M 1 = m 1. Equations (9.2.1) and (9.2.2) constitute the “local level” or “random walk plus noise” model. Figure 9.1 shows a realization of length 100 of this model with M 1 = 0, σ v 2 = 4, and σ w 2 = 8. (The realized values m t of M t are plotted as a solid line, and the observed data are plotted as square boxes.) The differenced data
constitute a stationary time series with mean 0 and ACF
Since {D t } is 1-correlated, we conclude from Proposition 2.1.1 that {D t } is an MA(1) process and hence that {Y t } is an ARIMA(0,1,1) process. More specifically,
where θ and σ 2 are found by solving the equations
For the process {Y t } generating the data in Figure 9.1, the parameters θ and σ 2 of the differenced series {D t } satisfy θ∕(1 +θ 2) = −0. 4 and θ σ 2 = −8. Solving these equations for θ and σ 2, we find that θ = −0. 5 and σ 2 = 16 (or θ = −2 and σ 2 = 4). The sample ACF of the observed differences D t of the realization of {Y t } in Figure 9.1 is shown in Figure 9.2.
The local level model is often used to represent a measured characteristic of the output of an industrial process for which the unobserved process level {M t } is intended to be within specified limits (to meet the design specifications of the manufactured product). To decide whether or not the process requires corrective attention, it is important to be able to test the hypothesis that the process level {M t } is constant. From the state equation, we see that {M t } is constant (and equal to m 1) when V t = 0 or equivalently when σ v 2 = 0. This in turn is equivalent to the moving-average model (9.2.3) for {D t } being noninvertible with θ = −1 (see Problem 8.2). Tests of the unit root hypothesis θ = −1 were discussed in Section 6.3.2
The local level model can easily be extended to incorporate a locally linear trend with slope β t at time t. Equation (9.2.2) is replaced by
where B t−1 = β t−1. Now if we introduce randomness into the slope by replacing it with the random walk
we obtain the “local linear trend” model.
To express the local linear trend model in state-space form we introduce the state vector
Then (9.2.4) and (9.2.5) can be written in the equivalent form
where V t = (V t , U t )′. The process {Y t } is then determined by the observation equation
If {X 1, U 1, V 1, W 1, U 2, V 2, W 2, …} is an uncorrelated sequence, then equations (9.2.6) and (9.2.7) constitute a state-space representation of the process {Y t }, which is a model for data with randomly varying trend and added noise. For this model we have v = 2, w = 1,
Example 9.2.2
A Seasonal Series with Noise
The classical decomposition (1.5.11) expressed the time series {X t } as a sum of trend, seasonal, and noise components. The seasonal component (with period d ) was a sequence {s t } with the properties s t+d = s t and ∑ t = 1 d s t = 0. Such a sequence can be generated, for any values of s 1, s 0, …, s −d+3, by means of the recursions
A somewhat more general seasonal component {Y t }, allowing for random deviations from strict periodicity, is obtained by adding a term S t to the right side of (9.2.8), where {V t } is white noise with mean zero. This leads to the recursion relations
To find a state-space representation for {Y t } we introduce the (d − 1)-dimensional state vector
The series {Y t } is then given by the observation equation
where {X t } satisfies the state equation
V t = (S t , 0, …, 0)′, and
Example 9.2.3
A Randomly Varying Trend with Random Seasonality and Noise
A series with randomly varying trend, random seasonality and noise can be constructed by adding the two series in Examples 9.2.1 and 9.2.2. (Addition of series with state-space representations is in fact always possible by means of the following construction. See Problem 9.9.) We introduce the state vector
where X t 1 and X t 2 are the state vectors in (9.2.6) and (9.2.11). We then have the following representation for {Y t }, the sum of the two series whose state-space representations were given in (9.2.6)–(9.2.7) and (9.2.10)–(9.2.11). The state equation is
where F 1, F 2 are the coefficient matrices and {V t 1}, {V t 2} are the noise vectors in the state equations (9.2.6) and (9.2.11), respectively. The observation equation is
where {W t } is the noise sequence in (9.2.7). If the sequence of random vectors {X 1, V 1 1, V 1 2, W 1, V 2 1, V 2 2, W 2, …} is uncorrelated, then equations (9.2.13) and (9.2.14) constitute a state-space representation for {Y t }.
9.3 State-Space Representation of ARIMA Models
We begin by establishing a state-space representation for the causal AR(p) process and then build on this example to find representations for the general ARMA and ARIMA processes.
Example 9.3.1
State-Space Representation of a Causal AR(p) Process
Consider the AR(p) process defined by
where \( \{Z_{t}\} \sim \mathrm{WN}{\bigl (0,\sigma ^{2}\bigr )} \), and ϕ(z): = 1 −ϕ 1 z −⋯ −ϕ p z p is nonzero for | z | ≤ 1. To express {Y t } in state-space form we simply introduce the state vectors
From (9.3.1) and (9.3.2) the observation equation is
while the state equation is given by
These equations have the required forms (9.1.10) and (9.1.11) with W t = 0 and V t = (0, 0, …, Z t+1)′, t = 0, ±1, ….
Remark 1.
In Example 9.3.1 the causality condition ϕ(z) ≠ 0 for | z | ≤ 1 is equivalent to the condition that the state equation (9.3.4) is stable, since the eigenvalues of the coefficient matrix in (9.3.4) are simply the reciprocals of the zeros of ϕ(z) (Problem 9.3). □
Remark 2.
If equations (9.3.3) and (9.3.4) are postulated to hold only for t = 1, 2, …, and if X 1 is a random vector such that {X 1, Z 1, Z 2, …} is an uncorrelated sequence, then we have a state-space representation for {Y t } of the type defined earlier by (9.1.1) and (9.1.2). The resulting process {Y t } is well-defined, regardless of whether or not the state equation is stable, but it will not in general be stationary. It will be stationary if the state equation is stable and if X 1 is defined by (9.3.2) with Y t = ∑ j = 0 ∞ ψ j Z t−j , t = 1, 0, …, 2 − p, and ψ(z) = 1∕ϕ(z), | z | ≤ 1. □
Example 9.3.2
State-Space Form of a Causal ARMA(p, q) Process
State-space representations are not unique. Here we shall give one of the (infinitely many) possible representations of a causal ARMA(p,q) process that can easily be derived from Example 9.3.1. Consider the ARMA(p,q) process defined by
where \( \{Z_{t}\} \sim \mathrm{WN}{\bigl (0,\sigma ^{2}\bigr )} \) and \( \phi (z)\neq 0 \) for | z | ≤ 1. Let
If {U t } is the causal AR( p) process satisfying
then Y t = θ(B)U t , since
Consequently,
where
But from Example 9.3.1 we can write
Equations (9.3.7) and (9.3.9) are the required observation and state equations. As in Example 9.3.1, the observation and state noise vectors are again W t = 0 and V t = (0, 0, …, Z t+1)′, t = 0, ±1, ….
Example 9.3.3
State-Space Representation of an ARIMA(p, d, q) Process
If \( \big\{Y _{t}\big\} \) is an ARIMA(p, d, q) process with {∇d Y t } satisfying (9.3.5), then by the preceding example \( \big\{\nabla ^{d}Y _{t}\big\} \) has the representation
where {X t } is the unique stationary solution of the state equation
F and G are the coefficients of X t in (9.3.9) and (9.3.7), respectively, and V t = (0, 0, …, Z t+1)′. Let A and B be the \( d \times 1 \) and d × d matrices defined by A = B = 1 if d = 1 and
if d > 1. Then since
the vector
satisfies the equation
Defining a new state vector T t by stacking X t and Y t−1, we therefore obtain the state equation
and the observation equation, from (9.3.10) and (9.3.11),
with initial condition
and the assumption
where Y 0 = (Y 1−d , Y 2−d , …, Y 0)′. The conditions (9.3.15), which are satisfied in particular if Y 0 is considered to be nonrandom and equal to the vector of observed values (y 1−d , y 2−d , …, y 0)′, are imposed to ensure that the assumptions of a state-space model given in Section 9.1 are satisfied. They also imply that \( E\left (\mathbf{X}_{1}\mathbf{Y}_{0}'\right ) = 0 \) and E(Y 0∇d Y t ′) = 0, t ≥ 1, as required earlier in Section 6.4 for prediction of ARIMA processes.
State-space models for more general ARIMA processes (e.g., {Y t } such that {∇∇12 Y t } is an ARMA(p, q) process) can be constructed in the same way. See Problem 9.4.
For the ARIMA(1, 1, 1) process defined by
the vectors X t and Y t−1 reduce to X t = (X t−1, X t )′ and Y t−1 = Y t−1. From (9.3.12) and (9.3.13) the state-space representation is therefore (Problem 9.8)
where
and
9.4 The Kalman Recursions
In this section we shall consider three fundamental problems associated with the state-space model defined by (9.1.1) and (9.1.2) in Section 9.1. These are all concerned with finding best (in the sense of minimum mean square error) linear estimates of the state-vector X t in terms of the observations Y 1, Y 2, …, and a random vector Y 0 that is orthogonal to V t and W t for all t ≥ 1. In many cases Y 0 will be the constant vector (1, 1, …, 1)′. Estimation of X t in terms of:
-
a.
Y 0, …, Y t−1 defines the prediction problem,
-
b.
Y 0, …, Y t defines the filtering problem,
-
c.
Y 0, …, Y n (n > t) defines the smoothing problem.
Each of these problems can be solved recursively using an appropriate set of Kalman recursions, which will be established in this section.
In the following definition of best linear predictor (and throughout this chapter) it should be noted that we do not automatically include the constant 1 among the predictor variables as we did in Sections 2.5 and 8.5 (It can, however, be included by choosing Y 0 = (1, 1, …, 1)′.)
Definition 9.4.1
For the random vector X = (X 1, …, X v )′,
where P t (X i ): = P(X i | Y 0, Y 1, …, Y t ), is the best linear predictor of X i in terms of all components of Y 0, Y 1, …, Y t .
Remark 1.
By the definition of the best predictor of each component X i of X, P t (X) is the unique random vector of the form
with v × w matrices A 0, …, A t such that
[cf. (8.5.2) and (8.5.3)]. Recall that two random vectors X and Y are orthogonal (written X ⊥ Y) if E(XY′) is a matrix of zeros. □
Remark 2.
If all the components of \( \mathbf{X},\mathbf{Y}_{1},\ldots,\mathbf{Y}_{t} \) are jointly normally distributed and Y 0 = (1, …, 1)′, then
Remark 3.
P t is linear in the sense that if A is any k × v matrix and X, V are two v-variate random vectors with finite second moments, then (Problem 9.10)
and
□
Remark 4.
If X and Y are random vectors with v and w components, respectively, each with finite second moments, then
where M is a v × w matrix, M = E(XY′)[E(YY′)]−1 with [E(YY′)]−1 any generalized inverse of E(YY′). (A generalized inverse of a matrix S is a matrix S −1 such that SS −1 S = S. Every matrix has at least one. See Problem 9.11.)
In the notation just developed, the prediction, filtering, and smoothing problems (a), (b), and (c) formulated above reduce to the determination of P t−1(X t ), P t (X t ), and P n (X t ) (n > t), respectively. We deal first with the prediction problem. □
Kalman Prediction:
For the state-space model (9.1.1)–(9.1.2), the one-step predictors \( \hat{\mathbf{X}_{t}}:= P_{t-1}(\mathbf{X}_{t}) \) and their error covariance matrices \( \Omega _{t} = E{\bigl [{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}'\bigr ]} \) are uniquely determined by the initial conditions
and the recursions, for t = 1, …,
where
and Δ t −1 is any generalized inverse of Δ t .
Proof.
We shall make use of the innovations I t defined by I 0 = Y 0 and
The sequence {I t } is orthogonal by Remark 1. Using Remarks 3 and 4 and the relation
(see Problem 9.12), we find that
where
To verify (9.4.2), we observe from the definition of Ω t+1 that
With (9.1.2) and (9.4.4) this gives
9.4.1 h-Step Prediction of {Y t } Using the Kalman Recursions
The Kalman prediction equations lead to a very simple algorithm for recursive calculation of the best linear mean square predictors P t Y t+h , h = 1, 2, … . From (9.4.4), (9.1.1), (9.1.2), and Remark 3 in Section 9.1, we find that
and
From the relation
we find that \( \Omega _{t}^{(h)}:= E[(\mathbf{X}_{t+h} - P_{t}\mathbf{X}_{t+h})(\mathbf{X}_{t+h} - P_{t}\mathbf{X}_{t+h})'] \) satisfies the recursions
with Ω t (1) = Ω t+1. Then from (9.1.1) and (9.4.7), Δ t (h): = E[(Y t+h − P t Y t+h )(Y t+h − P t Y t+h )′] is given by
Example 9.4.1.
Consider the random walk plus noise model of Example 9.2.1 defined by
where the local level X t follows the random walk
Applying the Kalman prediction equations with Y 0: = 1, R = σ w 2, and Q = σ v 2, we obtain
where
For a state-space model (like this one) with time-independent parameters, the solution of the Kalman recursions (9.4.2) is called a steady-state solution if Ω t is independent of t. If Ω t = Ω for all t, then from (9.4.2)
Solving this quadratic equation for Ω and noting that Ω ≥ 0, we find that
Since Ω t+1 − Ω t is a continuous function of Ω t on Ω t ≥ 0, positive at Ω t = 0, negative for large Ω t , and zero only at Ω t = Ω, it is clear that Ω t+1 − Ω t is negative for Ω t > Ω and positive for Ω t < Ω. A similar argument shows (Problem 9.14) that (Ω t+1 − Ω)(Ω t − Ω) ≥ 0 for all \( \Omega _{t} \geq 0 \). These observations imply that Ω t+1 always falls between Ω and Ω t . Consequently, regardless of the value of Ω 1, Ω t converges to Ω, the unique solution of Ω t+1 = Ω t . For any initial predictors \( \hat{Y }_{1} =\hat{ X}_{1} \) and any initial mean squared error \( \Omega _{1} = E{\bigl (X_{1} -\hat{ X}_{1}\bigr )}^{2} \), the coefficients \( a_{t}:= \Omega _{t}/\left (\Omega _{t} +\sigma _{ w}^{2}\right ) \) converge to
and the mean squared errors of the predictors defined by
converge to Ω +σ w 2.
If, as is often the case, we do not know Ω 1, then we cannot determine the sequence {a t }. It is natural, therefore, to consider the behavior of the predictors defined by
with a as above and arbitrary \( \hat{Y }_{1} \). It can be shown (Problem 9.16) that this sequence of predictors is also asymptotically optimal in the sense that the mean squared error converges to Ω +σ w 2 as t → ∞.
As shown in Example 9.2.1, the differenced process D t = Y t − Y t−1 is the MA(1) process
where \( \theta /\left (1 +\theta ^{2}\right ) = -\sigma _{w}^{2}/\left (2\sigma _{w}^{2} +\sigma _{ v}^{2}\right ) \). Solving this equation for θ (Problem 9.15), we find that
and that θ = a − 1.
It is instructive to derive the exponential smoothing formula for \( \hat{Y }_{t} \) directly from the ARIMA(0,1,1) structure of {Y t }. For t ≥ 2, we have from Section 6.5 that
for t ≥ 2, where θ t1 is found by application of the innovations algorithm to an MA(1) process with coefficient θ. It follows that 1 − a t = −θ t1, and since θ t1 → θ (see Remark 1 of Section 3.3) and a t converges to the steady-state solution a, we conclude that
Example 9.4.2.
The lognormal stochastic volatility model
We can rewrite the defining equations (7.4.2) and (7.4.3) of the lognormal SV process {Z t } in the following state-space form
and
where the (one-dimensional) state and observation vectors are
and
respectively. The independent white-noise sequences {η t } and {ɛ t } have zero means and variances σ 2 and 4.93 respectively.
Taking
and
and we can directly apply the Kalman prediction recursions (9.4.1), (9.4.2), (9.4.6) and (9.4.8), to compute recursively the best linear predictor of X t+h in terms of {Y s , s ≤ t}, or equivalently of the log volatility ℓ t+h in terms of the observations {lnZ s 2, s ≤ t}.
Kalman Filtering:
The filtered estimates X t | t = P t (X t ) and their error covariance matrices Ω t | t = E[(X t −X t | t )(X t −X t | t )′] are determined by the relations
and
Proof.
From (9.4.3) it follows that
where
To establish (9.4.17) we write
Using (9.4.18) and the orthogonality of X t − P t X t and M I t , we find from the last equation that
as required. ■
Kalman Fixed-Point Smoothing:
The smoothed estimates X t | n = P n X t and the error covariance matrices Ω t | n = E[(X t −X t | n )(X t −X t | n )′] are determined for fixed t by the following recursions, which can be solved successively for n = t, t + 1, …:
with initial conditions \( P_{t-1}\mathbf{X}_{t} =\hat{ \mathbf{X}}_{t} \) and Ω t, t = Ω t | t−1 = Ω t (found from Kalman prediction).
Proof.
Using (9.4.3) we can write P n X t = P n−1 X t + C I n , where \( \mathbf{I}_{n} = G_{n}{\bigl (\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}\bigr )} + \mathbf{W}_{n} \). By Remark 4 above,
where \( \Omega _{t,n}:= E\big[{\bigl (\mathbf{X}_{t} -\hat{\mathbf{X}}_{t}\bigr )}{\bigl (\mathbf{X}_{n} -\hat{\mathbf{X}}_{n}\bigr )}'\big] \). It follows now from (9.1.2), (9.4.5), the orthogonality of V n and W n with \( \mathbf{X}_{t} -\hat{\mathbf{X}}_{t} \), and the definition of Ω t, n that
thus establishing (9.4.20). To establish (9.4.21) we write
Using (9.4.22) and the orthogonality of X t − P n X t and I n , the last equation then gives
as required. ■
9.5 Estimation for State-Space Models
Consider the state-space model defined by equations (9.1.1) and (9.1.2) and suppose that the model is completely parameterized by the components of the vector \( \theta \). The maximum likelihood estimate of \( \theta \) is found by maximizing the likelihood of the observations Y 1, …, Y n with respect to the components of the vector \( \theta \). If the conditional probability density of Y t given Y t−1 = y t−1, …, Y 0 = y 0 is f t (⋅ | y t−1, …, y 0), then the likelihood of Y t , t = 1, …, n (conditional on Y 0), can immediately be written as
The calculation of the likelihood for any fixed numerical value of \( \theta \) is extremely complicated in general, but is greatly simplified if Y 0, X 1 and W t , V t , t = 1, 2, …, are assumed to be jointly Gaussian. The resulting likelihood is called the Gaussian likelihood and is widely used in time series analysis (cf. Section 5.2) whether the time series is truly Gaussian or not. As before, we shall continue to use the term likelihood to mean Gaussian likelihood.
If Y 0, X 1 and W t , V t , t = 1, 2, …, are jointly Gaussian, then the conditional densities in (9.5.1) are given by
where \( \mathbf{I}_{t}\,=\,\mathbf{Y}_{t} - P_{t-1}\mathbf{Y}_{t}\,=\,\mathbf{Y}_{t} - G\hat{\mathbf{X}_{t}} \), P t−1 Y t , and Δ t , t ≥ 1, are the one-step predictors and error covariance matrices found from the Kalman prediction recursions. The likelihood of the observations Y 1, …, Y n (conditional on Y 0) can therefore be expressed as
Given the observations Y 1, …, Y n , the distribution of Y 0 (see Section 9.4), and a particular parameter value \( \theta \), the numerical value of the likelihood L can be computed from the previous equation with the aid of the Kalman recursions of Section 9.4. To find maximum likelihood estimates of the components of \( \theta \), a nonlinear optimization algorithm must be used to search for the value of \( \theta \) that maximizes the value of L.
Having estimated the parameter vector \( \theta \), we can compute forecasts based on the fitted state-space model and estimated mean squared errors by direct application of equations (9.4.7) and (9.4.9).
9.5.1 Application to Structural Models
The general structural model for a univariate time series {Y t } of which we gave examples in Section 9.2 has the form
for t = 1, 2, …, where F and G are assumed known. We set Y 0 = 1 in order to include constant terms in our predictors and complete the specification of the model by prescribing the mean and covariance matrix of the initial state X 1. A simple and convenient assumption is that X 1 is equal to a deterministic but unknown parameter \( \boldsymbol{\mu } \) and that \( \hat{\mathbf{X}}_{1} =\boldsymbol{\mu } \), so that Ω 1 = 0. The parameters of the model are then \( \boldsymbol{\mu } \), Q, and σ w 2.
Direct maximization of the likelihood (9.5.2) is difficult if the dimension of the state vector is large. The maximization can, however, be simplified by the following stepwise procedure. For fixed Q we find \( \hat{\boldsymbol{\mu }}(Q) \) and σ w 2(Q) that maximize the likelihood \( L\left (\boldsymbol{\mu },Q,\sigma _{w}^{2}\right ) \). We then maximize the “reduced likelihood” \( L\left (\hat{\boldsymbol{\mu }}(Q),Q,\hat{\sigma }_{w}^{2}(Q)\right ) \) with respect to Q.
To achieve this we define the mean-corrected state vectors, \( \mathbf{X}_{t}^{{\ast}} = \mathbf{X}_{t} - F^{t-1}\boldsymbol{\mu } \), and apply the Kalman prediction recursions to {X t ∗} with initial condition \( \mathbf{X}_{1}^{{\ast}} = \mathbf{0} \). This gives, from (9.4.1),
with \( \hat{\mathbf{X}}_{1}^{{\ast}} = \mathbf{0} \). Since \( \hat{\mathbf{X}}_{t} \) also satisfies (9.5.5), but with initial condition \( \hat{\mathbf{X}}_{t} =\boldsymbol{\mu } \), it follows that
for some v × v matrices C t . (Note that although \( \hat{\mathbf{X}}_{t} = P(\mathbf{X}_{t}\vert Y _{0},Y _{1},\ldots,Y _{t}) \), the quantity \( \hat{\mathbf{X}}_{t}^{{\ast}} \) is not the corresponding predictor of X t ∗.) The matrices C t can be determined recursively from (9.5.5), (9.5.6), and (9.4.1). Substituting (9.5.6) into (9.5.5) and using (9.4.1), we have
so that
with C 1 equal to the identity matrix. The quadratic form in the likelihood (9.5.2) is therefore
Now let Q ∗: = σ w −2 Q and define L ∗ to be the likelihood function with this new parameterization, i.e., \( L^{{\ast}}\left (\boldsymbol{\mu },Q^{{\ast}},\sigma _{w}^{2}\right ) = L\left (\boldsymbol{\mu },\sigma _{w}^{2}Q^{{\ast}},\sigma _{w}^{2}\right ) \). Writing Δ t ∗ = σ w −2 Δ t and Ω t ∗ = σ w −2 Ω t , we see that the predictors \( \hat{X}_{t}^{{\ast}} \) and the matrices C t in (9.5.7) depend on the parameters only through Q ∗. Thus,
so that
For Q ∗ fixed, it is easy to show (see Problem 9.18) that this function is minimized when
and
Replacing \( \boldsymbol{\mu } \) and σ w 2 by these values in − 2lnL ∗ and ignoring constants, the reduced likelihood becomes
If \( \hat{Q}^{{\ast}} \) denotes the minimizer of (9.5.12), then the maximum likelihood estimator of the parameters \( \boldsymbol{\mu },Q,\sigma _{w}^{2} \) are \( \hat{\boldsymbol{\mu }},\hat{\sigma }_{w}^{2}\hat{Q}^{{\ast}},\hat{\sigma }_{w}^{2} \), where \( \hat{\boldsymbol{\mu }} \) and \( \hat{\sigma }_{w}^{2} \) are computed from (9.5.10) and (9.5.11) with Q ∗ replaced by \( \hat{Q}^{{\ast}} \).
We can now summarize the steps required for computing the maximum likelihood estimators of \( \boldsymbol{\mu } \), Q, and σ w 2 for the model (9.5.3)–(9.5.4).
-
1.
For a fixed Q ∗, apply the Kalman prediction recursions with \( \hat{\mathbf{X}}_{1}^{{\ast}} = \mathbf{0} \), Ω 1 = 0, Q = Q ∗, and σ w 2 = 1 to obtain the predictors \( \hat{\mathbf{X}}_{t}^{{\ast}} \). Let Δ t ∗ denote the one-step prediction error produced by these recursions.
-
2.
Set \( \hat{\boldsymbol{\mu }}=\hat{\boldsymbol{\mu }} (Q^{{\ast}}) = \left [\sum _{t=1}^{n}C_{t}'G'GC_{t}/\Delta _{t}\right ]^{-1}\sum _{t=1}^{n}C_{t}'G'(Y _{t} - G\hat{\mathbf{X}}_{t}^{{\ast}})/\Delta _{t}^{{\ast}} \).
-
3.
Let \( \hat{Q}^{{\ast}} \) be the minimizer of (9.5.12).
-
4.
The maximum likelihood estimators of \( \boldsymbol{\mu } \), Q, and σ w 2 are then given by \( \hat{\boldsymbol{\mu }},\hat{\sigma }_{w}^{2}\hat{Q}^{{\ast}} \), and \( \hat{\sigma }_{w}^{2} \), respectively, where \( \hat{\boldsymbol{\mu }} \) and \( \hat{\sigma }_{w}^{2} \) are found from (9.5.10) and (9.5.11) evaluated at \( \hat{Q}^{{\ast}} \).
Example 9.5.1.
Random Walk Plus Noise Model
In Example 9.2.1, 100 observations were generated from the structural model
with initial values μ = M 1 = 0, σ w 2 = 8, and σ v 2 = 4. The maximum likelihood estimates of the parameters are found by first minimizing (9.5.12) with \( \hat{\mu } \) given by (9.5.10). Substituting these values into (9.5.11) gives \( \hat{\sigma }_{w}^{2} \). The resulting estimates are \( \hat{\mu }= 0.906, \) \( \hat{\sigma }_{v}^{2} = 5.351 \), and \( \hat{\sigma }_{w}^{2} = 8.233 \), which are in reasonably close agreement with the true values.
Example 9.5.2.
International Airline Passengers, 1949–1960; AIRPASS.TSM
The monthly totals of international airline passengers from January 1949 to December 1960 (Box and Jenkins 1976) are displayed in Figure 9.3. The data exhibit both a strong seasonal pattern and a nearly linear trend. Since the variability of the data Y 1, …, Y 144 increases for larger values of Y t , it may be appropriate to consider a logarithmic transformation of the data. For the purpose of this illustration, however, we will fit a structural model incorporating a randomly varying trend and seasonal and noise components (see Example 9.2.3) to the raw data. This model has the form
where X t is a 13-dimensional state-vector,
and
The parameters of the model are \( \boldsymbol{\mu },\sigma _{1}^{2},\sigma _{2}^{2},\sigma _{3}^{2} \), and σ w 2, where \( \boldsymbol{\mu }= \mathbf{X}_{1} \). Minimizing (9.5.12) with respect to Q ∗ we find from (9.5.11) and (9.5.12) that
and from (9.5.10) that \( \hat{\boldsymbol{\mu }}=\, \)(146.9, 2.171, − 34. 92, − 34. 12, − 47. 00, − 16. 98, 22.99, 53.99, 58.34, 33.65, 2.204, − 4. 053, − 6. 894)′. The first component, X t1, of the state vector corresponds to the local linear trend with slope X t2. Since \( \hat{\sigma }_{2}^{2} = 0 \), the slope at time t, which satisfies
must be nearly constant and equal to \( \hat{X}_{12} = 2.171 \). The first three components of the predictors \( \hat{\mathbf{X}}_{t} \) are plotted in Figure 9.4. Notice that the first component varies like a random walk around a straight line, while the second component is nearly constant as a result of \( \hat{\sigma }_{2}^{2} \approx 0 \). The third component, corresponding to the seasonal component, exhibits a clear seasonal cycle that repeats roughly the same pattern throughout the 12 years of data. The one-step predictors \( \hat{X}_{t1} +\hat{ X}_{t3} \) of Y t are plotted in Figure 9.5 (solid line) together with the actual data (square boxes). For this model the predictors follow the movement of the data quite well.
9.6 State-Space Models with Missing Observations
State-space representations and the associated Kalman recursions are ideally suited to the analysis of data with missing values, as was pointed out by Jones (1980) in the context of maximum likelihood estimation for ARMA processes. In this section we shall deal with two missing-value problems for state-space models. The first is the evaluation of the (Gaussian) likelihood based on \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\} \), where i 1, i 2, …, i r are positive integers such that 1 ≤ i 1 < i 2 < ⋯ < i r ≤ n. (This allows for observation of the process {Y t } at irregular intervals, or equivalently for the possibility that (n − r) observations are missing from the sequence {Y 1, …, Y n }.) The solution of this problem will, in particular, enable us to carry out maximum likelihood estimation for ARMA and ARIMA processes with missing values. The second problem to be considered is the minimum mean squared error estimation of the missing values themselves.
9.6.1 The Gaussian Likelihood of \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\}, \) 1 ≤ i 1 < i 2 < ⋯ < i r ≤ n
Consider the state-space model defined by equations (9.1.1) and (9.1.2) and suppose that the model is completely parameterized by the components of the vector \( \theta \). If there are no missing observations, i.e., if r = n and i j = j, j = 1, …, n, then the likelihood of the observations {Y 1, …, Y n } is easily found as in Section 9.5 to be
where I j = Y j − P j−1 Y j and Δ j , j ≥ 1, are the one-step predictors and error covariance matrices found from (9.4.7) and (9.4.9) with Y 0 = 1.
To deal with the more general case of possibly irregularly spaced observations \( \{\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\} \), we introduce a new series {Y t ∗}, related to the process {X t } by the modified observation equation
where
and {N t } is iid with
Equations (9.6.1) and (9.1.2) constitute a state-space representation for the new series {Y t ∗}, which coincides with {Y t } at each \( t \in \{ i_{1},i_{2},\ldots,i_{r}\} \), and at other times takes random values that are independent of {Y t } with a distribution independent of \( \theta \).
Let \( L_{1}\left (\theta;\,\mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}\right ) \) be the Gaussian likelihood based on the observed values \( \mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}} \) of \( \mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}} \) under the model defined by (9.1.1) and (9.1.2). Corresponding to these observed values, we define a new sequence, y 1 ∗, …, y n ∗, by
Then it is clear from the preceding paragraph that
where L 2 denotes the Gaussian likelihood under the model defined by (9.6.1) and (9.1.2).
In view of (9.6.5) we can now compute the required likelihood L 1 of the realized values {y t , t = i 1, …, i r } as follows:
-
i.
Define the sequence {y t ∗, t = 1, …, n} as in (9.6.4).
-
ii.
Find the one-step predictors \( \hat{\mathbf{Y}}_{t}^{{\ast}} \) of Y t ∗, and their error covariance matrices Δ t ∗, using Kalman prediction and equations (9.4.7) and (9.4.9) applied to the state-space representation (9.6.1) and (9.1.2) of {Y t ∗}. Denote the realized values of the predictors, based on the observation sequence \( \left \{\mathbf{y}_{t}^{{\ast}}\right \} \), by \( \left \{\hat{\mathbf{y}}_{t}^{{\ast}}\right \} \).
-
iii.
The required Gaussian likelihood of the irregularly spaced observations \( \{\mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}\} \) is then, by (9.6.5),
$$ \displaystyle{L_{1 } (\theta; \mathbf{y}_{i_{1}},\ldots,\mathbf{y}_{i_{r}}) = (2\pi )^{-rw/2}\left (\prod _{ j=1}^{n}\det \Delta _{ j}^{{\ast}}\right )^{-1/2}\exp \left \{-\frac{1} {2}\sum _{j=1}^{n}\mathbf{i}_{ j}^{{\ast}}{}'\Delta _{ j}^{{\ast}-1}\mathbf{i}_{ j}^{{\ast}}\right \},} $$where i j ∗ denotes the observed innovation \( \mathbf{y}_{j}^{{\ast}}-\hat{\mathbf{y}}_{j}^{{\ast}} \), j = 1, …, n.
Example 9.6.1.
An AR(1) Series with One Missing Observation
Let {Y t } be the causal AR(1) process defined by
To find the Gaussian likelihood of the observations y 1, y 3, y 4, and y 5 of Y 1, Y 3, Y 4, and Y 5 we follow the steps outlined above.
-
i.
Set y i ∗ = y i , i = 1, 3, 4, 5 and y 2 ∗ = 0.
-
ii.
We start with the state-space model for {Y t } from Example 9.1.1, i.e., Y t = X t , X t+1 = ϕ X t + Z t+1. The corresponding model for {Y t ∗} is then, from (9.6.1),
$$ \displaystyle{ Y _{t}^{{\ast}} = G_{ t}^{{\ast}}X_{ t} + W_{t}^{{\ast}},\ t = 1,2,\ldots, } $$where
$$ \displaystyle\begin{array}{rcl} X_{t+1}& =& F_{t}X_{t} + V _{t},\ t = 1,2,\ldots, {}\\ F_{t}& =& \phi,\quad G_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if }t\neq 2, \\ 0\quad &\mbox{ if }t = 2, \end{array} \right.\quad V _{t} = Z_{t+1},\quad W_{t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0 \quad &\mbox{ if }t\neq 2, \\ N_{t}\quad &\mbox{ if }t = 2, \end{array} \right. {}\\ Q_{t}& =& \sigma ^{2},\qquad R_{ t}^{{\ast}} = \left \{\begin{array}{@{}l@{\quad }l@{}} 0\quad &\mbox{ if }t\neq 2, \\ 1\quad &\mbox{ if }t = 2, \end{array} \right.\qquad S_{t}^{{\ast}} = 0, {}\\ \end{array} $$and X 1 = ∑ j = 0 ∞ ϕ j Z 1−j . Starting from the initial conditions
$$ \displaystyle{ \hat{X}_{1} = 0,\qquad \Omega _{1} =\sigma ^{2}/\left (1 -\phi ^{2}\right ), } $$and applying the recursions (9.4.1) and (9.4.2), we find (Problem 9.19) that
$$ \displaystyle{ \varTheta _{t}\Delta _{t}^{-1} = \left \{\begin{array}{@{}l@{\quad }l@{}} \phi \quad &\mbox{ if }t = 1,3,4,5, \\ 0\quad &\mbox{ if }t = 2, \end{array} \right.\quad \Omega _{t} = \left \{\begin{array}{@{}l@{\quad }l@{}} \sigma ^{2}/\left (1 -\phi ^{2}\right )\quad &\mbox{ if }t = 1, \\ \sigma ^{2}\left (1 +\phi ^{2}\right ) \quad &\mbox{ if }t = 3, \\ \sigma ^{2} \quad &\mbox{ if }t = 2,4,5, \end{array} \right. } $$and
$$ \displaystyle{\hat{X}_{1} = 0,\quad \hat{X}_{2} =\phi Y _{1},\quad \hat{X}_{3} =\phi ^{2}Y _{ 1},\quad \hat{X}_{4} =\phi Y _{3},\quad \hat{X}_{5} =\phi Y _{4}.} $$From (9.4.7) and (9.4.9) with h = 1, we find that
$$ \displaystyle{\hat{Y }_{1}^{{\ast}} = 0,\quad \hat{Y }_{ 2}^{{\ast}} = 0,\quad \hat{Y }_{ 3}^{{\ast}} =\phi ^{2}Y _{ 1},\quad \hat{Y }_{4}^{{\ast}} =\phi Y _{ 3},\quad \hat{Y }_{5}^{{\ast}} =\phi Y _{ 4},} $$with corresponding mean squared errors
$$ \displaystyle{ \Delta _{1 }^{{\ast} } =\sigma ^{2}/\left (1 -\phi ^{2}\right ),\quad \Delta _{ 2}^{{\ast}} = 1,\quad \Delta _{ 3}^{{\ast}} =\sigma ^{2}\left (1 +\phi ^{2}\right ),\quad \Delta _{ 4}^{{\ast}} =\sigma ^{2},\quad \Delta _{ 5}^{{\ast}} =\sigma ^{2}.} $$ -
iii.
From the preceding calculations we can now write the likelihood of the original data as
$$ \displaystyle\begin{array}{rcl} & &L_{1 } (\phi,\sigma ^{2};\,y_{ 1},y_{3},y_{4},y_{5})=\sigma ^{-4}(2\pi )^{-2}\left [\left (1-\phi ^{2}\right )/\left (1+\phi ^{2}\right )\right ]^{1/2} {}\\ & &\quad \times \exp \left \{-{ 1 \over 2\sigma ^{2}}\left [y_{1}^{2}\left (1-\phi ^{2}\right )+\frac{(y_{3}-\phi ^{2}y_{ 1})^{2}} {1+\phi ^{2}} +(y_{4}-\phi y_{3})^{2}+(y_{ 5}-\phi y_{4})^{2}\right ]\right \}. {}\\ \end{array} $$
Remark 1.
If we are given observations \( y_{1-d},y_{2-d},\ldots,y_{0},y_{i_{1}},y_{i_{2}},\ldots,y_{i_{r}} \) of an ARIMA( p, d, q) process at times 1 − d, 2 − d, …, 0, i 1, …, i r , where 1 ≤ i 1 < i 2 < ⋯ < i r ≤ n, a similar argument can be used to find the Gaussian likelihood of \( y_{i_{1}},\ldots,y_{i_{r}} \) conditional on Y 1−d = y 1−d , Y 2−d = y 2−d , …, Y 0 = y 0. Missing values among the first d observations y 1−d , y 2−d , …, y 0 can be handled by treating them as unknown parameters for likelihood maximization. For more on ARIMA series with missing values see Brockwell and Davis (1991) and Ansley and Kohn (1985). □
9.6.2 Estimation of Missing Values for State-Space Models
Given that we observe only \( \mathbf{Y}_{i_{1}},\mathbf{Y}_{i_{2}},\ldots,\mathbf{Y}_{i_{r}},1 \leq i_{1} < i_{2} < \cdots < i_{r} \leq n \), where {Y t } has the state-space representation (9.1.1) and (9.1.2), we now consider the problem of finding the minimum mean squared error estimators \( P\left (\mathbf{Y}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) \) of Y t , 1 ≤ t ≤ n, where Y 0 = 1. To handle this problem we again use the modified process {Y t ∗} defined by (9.6.1) and (9.1.2) with Y 0 ∗ = 1. Since Y s ∗ = Y s for s ∈ { i 1, …, i r } and Y s ∗ ⊥ X t , Y 0 for 1 ≤ t ≤ n and s ∉ {0, i 1, …, i r }, we immediately obtain the minimum mean squared error state estimators
The right-hand side can be evaluated by application of the Kalman fixed-point smoothing algorithm to the state-space model (9.6.1) and (9.1.2). For computational purposes the observed values of Y t ∗, t ∉ {0, i 1, …, i r }, are quite immaterial. They may, for example, all be set equal to zero, giving the sequence of observations of Y t ∗ defined in (9.6.4).
To evaluate \( P\left (\mathbf{Y}_{t}\vert \mathbf{Y}_{0},\mathbf{Y}_{i_{1}},\ldots,\mathbf{Y}_{i_{r}}\right ) \), 1 ≤ t ≤ n, we use (9.6.6) and the relation
Since \( E\left (\mathbf{V}_{t}\mathbf{W}_{t}'\right ) = S_{t} = 0,\quad t = 1,\ldots,n, \) we find from (9.6.7) that
Example 9.6.2.
An AR(1) Series with One Missing Observation
Consider the problem of estimating the missing value Y 2 in Example 9.6.1 in terms of Y 0 = 1, Y 1, Y 3, Y 4, and Y 5. We start from the state-space model X t+1 = ϕ X t + Z t+1, Y t = X t , for {Y t }. The corresponding model for {Y t ∗} is the one used in Example 9.6.1. Applying the Kalman smoothing equations to the latter model, we find that
and
where P t (⋅ ) here denotes \( P\left (\cdot \vert Y _{0}^{{\ast}},\ldots,Y _{t}^{{\ast}}\right ) \) and \( \Omega _{t,n},\Omega _{t\vert n} \) are defined correspondingly. We deduce from (9.6.8) that the minimum mean squared error estimator of the missing value Y 2 is
with mean squared error
Remark 2.
Suppose we have observations \( Y _{1-d},Y _{2-d},\ldots,Y _{0},Y _{i_{1}},\ldots,Y _{i_{r}} \) \( (1 \leq i_{1} < i_{2}\cdots < i_{r} \leq n) \) of an ARIMA(p, d, q) process. Determination of the best linear estimates of the missing values \( Y _{t},\,t\notin \{i_{1},\ldots,i_{r}\} \), in terms of \( Y _{t},\,t \in \{ i_{1},\ldots,i_{r}\} \), and the components of Y 0: = (Y 1−d , Y 2−d , …, Y 0)′ can be carried out as in Example 9.6.2 using the state-space representation of the ARIMA series {Y t } from Example 9.3.3 and the Kalman recursions for the corresponding state-space model for {Y t ∗} defined by (9.6.1) and (9.1.2). See Brockwell and Davis (1991) for further details. □
We close this section with a brief discussion of a direct approach to estimating missing observations. This approach is often more efficient than the methods just described, especially if the number of missing observations is small and we have a simple (e.g., autoregressive) model. Consider the general problem of computing E(X | Y) when the random vector (X′, Y′)′ has a multivariate normal distribution with mean 0 and covariance matrix Σ. (In the missing observation problem, think of X as the vector of the missing observations and Y as the vector of observed values.) Then the joint probability density function of X and Y can be written as
where \( f_{\mathbf{X}\vert \mathbf{Y}}(\mathbf{x}\vert \mathbf{y}) \) is a multivariate normal density with mean E(X | Y) and covariance matrix \( \varSigma_{\mathbf{X}\vert \mathbf{Y}} \) (see Proposition A.3.1). In particular,
where q = dim(X). It is clear from (9.6.10) that \( f_{\mathbf{X}\vert \mathbf{Y}} (\mathbf{x}\vert \mathbf{y}) \) (and also f X, Y (x, y)) is maximum when x = E(X | y). Thus, the best estimator of X in terms of Y can be found by maximizing the joint density of X and Y with respect to x. For autoregressive processes it is relatively straightforward to carry out this optimization, as shown in the following example.
Example 9.6.3.
Estimating Missing Observations in an AR Process
Suppose {Y t } is the AR( p) process defined by
and \( \mathbf{Y} = (Y _{i_{1}},\ldots,Y _{i_{r}})' \), with 1 ≤ i 1 < ⋯ < i r ≤ n, are the observed values. If there are no missing observations in the first p observations, then the best estimates of the missing values are found by minimizing
with respect to the missing values (see Problem 9.20). For the AR(1) model in Example 9.6.2, minimization of (9.6.11) is equivalent to minimizing
with respect to Y 2. Setting the derivative of this expression with respect to Y 2 equal to 0 and solving for Y 2 we obtain \( E(Y _{2}\vert Y _{1},Y _{3},Y _{4},Y _{5}) =\phi (Y _{1} + Y _{3})/\left (1 +\phi ^{2}\right ) \).
9.7 The EM Algorithm
The expectation-maximization (EM) algorithm is an iterative procedure for computing the maximum likelihood estimator when only a subset of the complete data set is available. Dempster et al. (1977) demonstrated the wide applicability of the EM algorithm and are largely responsible for popularizing this method in statistics. Details regarding the convergence and performance of the EM algorithm can be found in Wu (1983).
In the usual formulation of the EM algorithm, the “complete” data vector W is made up of “observed” data Y (sometimes called incomplete data) and “unobserved” data X. In many applications, X consists of values of a “latent” or unobserved process occurring in the specification of the model. For example, in the state-space model of Section 9.1, Y could consist of the observed vectors Y 1, …, Y n and X of the unobserved state vectors X 1, …, X n . The EM algorithm provides an iterative procedure for computing the maximum likelihood estimator based only on the observed data Y. Each iteration of the EM algorithm consists of two steps. If θ (i) denotes the estimated value of the parameter θ after i iterations, then the two steps in the (i + 1)th iteration are
and
Then θ (i+1) is set equal to the maximizer of Q in the M-step. In the E-step, ℓ(θ; x, y) = lnf(x, y; θ), and \( E_{\theta ^{(i)}}(\cdot \vert \mathbf{Y}) \) denotes the conditional expectation relative to the conditional density \( f{\bigl (\mathbf{x}\vert \mathbf{y};\theta ^{(i)}\bigr )} = f{\bigl (\mathbf{x},\mathbf{y};\theta ^{(i)}\bigr )}/f{\bigl (\mathbf{y};\theta ^{(i)}\bigr )} \).
It can be shown that \( \ell{\bigl (\theta ^{(i)};\mathbf{Y}\bigr )} \) is nondecreasing in i, and a simple heuristic argument shows that if θ (i) has a limit \( \hat{\theta } \) then \( \hat{\theta } \) must be a solution of the likelihood equations \( \ell'{\bigl (\hat{\theta };\mathbf{Y}\bigr )} = 0 \). To see this, observe that lnf(x, y; θ) = lnf(x | y; θ) + ℓ(θ; y), from which we obtain
and
Now replacing θ with θ (i+1), noticing that Q′(θ (i+1) | θ (i)) = 0, and letting i → ∞, we find that
The last equality follows from the fact that
The computational advantage of the EM algorithm over direct maximization of the likelihood is most pronounced when the calculation and maximization of the exact likelihood is difficult as compared with the maximization of Q in the M-step. (There are some applications in which the maximization of Q can easily be carried out explicitly.)
9.7.1 Missing Data
The EM algorithm is particularly useful for estimation problems in which there are missing observations. Suppose the complete data set consists of Y 1, …, Y n of which r are observed and n − r are missing. Denote the observed and missing data by \( \mathbf{Y} = (Y _{i_{1}},\ldots,Y _{i_{r}})' \) and \( \mathbf{X} = (Y _{j_{1}},\ldots,Y _{j_{n-r}})' \), respectively. Assuming that W = (X′, Y′)′ has a multivariate normal distribution with mean 0 and covariance matrix Σ, which depends on the parameter \( \theta \), the log-likelihood of the complete data is given by
The E-step requires that we compute the expectation of \( \ell(\theta;\mathbf{W}) \) with respect to the conditional distribution of W given Y with \( \theta =\theta ^{(i)} \). Writing \( \varSigma (\theta ) \) as the block matrix
which is conformable with X and Y, the conditional distribution of W given Y is multivariate normal with mean \( \left[\begin{array}{c}\hat{{\mathbf{X}}}\\ \mathbf{Y}\end{array}\right] \) and covariance matrix \( <mfenced-6 separators="" open="[" close="]"> <mfrac-1 linethickness="0"> <mrow>\varSigma _{11\vert 2}(\theta )\quad 0</mrow> <mrow>\quad 0 0</mrow> </mfrac> </mfenced> \), where \( \hat{\mathbf{X}} = \) \( E_{\theta }(\mathbf{X}\vert \mathbf{Y}) =\varSigma _{12}\varSigma _{22}^{-1}\mathbf{Y} \) and \( \varSigma _{11\vert 2}(\theta ) =\varSigma _{11} -\varSigma _{12}\varSigma _{22}^{-1}\varSigma _{21} \) (see Proposition A.3.1). Using Problem A.8, we have
where \( \hat{\mathbf{W}} = \left (\hat{\mathbf{X}}',\mathbf{Y}'\right )' \). It follows that
The first term on the right is the log-likelihood based on the complete data, but with X replaced by its “best estimate” \( \hat{\mathbf{X}} \) calculated from the previous iteration. If the increments \( \theta ^{(i+1)} -\theta ^{(i)} \) are small, then the second term on the right is nearly constant ( ≈ n − r) and can be ignored. For ease of computation in this application we shall use the modified version
With this adjustment, the steps in the EM algorithm are as follows:
- E-step. :
-
Calculate \( E_{\theta ^{(i)}}(\mathbf{X}\vert \mathbf{Y}) \) (e.g., with the Kalman fixed-point smoother) and form \( \ \ \ell{\bigl (\theta;\hat{\mathbf{W}}\bigr )} \).
- M-step. :
-
Find the maximum likelihood estimator for the “complete” data problem, i.e., maximize \( \ell{\bigl (\theta:\hat{ \mathbf{W}}\bigr )} \). For ARMA processes, ITSM can be used directly, with the missing values replaced with their best estimates computed in the E-step.
Example 9.7.1.
The Lake Data
It was found in Example 5.2.5 that the AR(2) model
was a good fit to the mean-corrected lake data {W t }. To illustrate the use of the EM algorithm for missing data, consider fitting an AR(2) model to the mean-corrected data assuming that there are 10 missing values at times t = 17, 24, 31, 38, 45, 52, 59, 66, 73, and 80. We start the algorithm at iteration 0 with \( \hat{\phi }_{1}^{(0)} =\hat{\phi }_{ 2}^{(0)} = 0 \). Since this initial model represents white noise, the first E-step gives, in the notation used above, \( \hat{W}_{17} = \cdots =\hat{ W}_{80} = 0 \). Replacing the “missing” values of the mean-corrected lake data with 0 and fitting a mean-zero AR(2) model to the resulting complete data set using the maximum likelihood option in ITSM, we find that \( \hat{\phi}_{1}^{(1)} = 0.7252 \), \( \hat{\phi}_{2}^{(1)} = 0.0236 \). (Examination of the plots of the ACF and PACF of this new data set suggests an AR(1) as a better model. This is also borne out by the small estimated value of ϕ 2.) The updated missing values at times t = 17, 24, …, 80 are found (see Section 9.6 and Problem 9.21) by minimizing
with respect to W t . The solution is given by
The M-step of iteration 1 is then carried out by fitting an AR(2) model using ITSM applied to the updated data set. As seen in the summary of the results reported in Table 9.1, the EM algorithm converges in four iterations with the final parameter estimates reasonably close to the fitted model based on the complete data set. (In Table 9.1, estimates of the missing values are recorded only for the first three.) Also notice how \( -2\ell\left (\theta ^{(i)},\mathbf{W}\right ) \) decreases at every iteration. The standard errors of the parameter estimates produced from the last iteration of ITSM are based on a “complete” data set and, as such, underestimate the true sampling errors. Formulae for adjusting the standard errors to reflect the true sampling error based on the observed data can be found in Dempster et al. (1977).
9.8 Generalized State-Space Models
As in Section 9.1, we consider a sequence of state variables {X t , t ≥ 1} and a sequence of observations {Y t , t ≥ 1}. For simplicity, we consider only one-dimensional state and observation variables, since extensions to higher dimensions can be carried out with little change. Throughout this section it will be convenient to write Y (t) and X (t) for the t dimensional column vectors Y (t) = (Y 1, Y 2, …, Y t )′ and X (t) = (X 1, X 2, …, X t )′.
There are two important types of state-space models, “parameter driven” and “observation driven,” both of which are frequently used in time series analysis. The observation equation is the same for both, but the state vectors of a parameter-driven model evolve independently of the past history of the observation process, while the state vectors of an observation-driven model depend on past observations.
9.8.1 Parameter-Driven Models
In place of the observation and state equations (9.1.1) and (9.1.2), we now make the assumptions that Y t given \( {\bigl (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\bigr )} \) is independent of \( {\bigl (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\bigr )} \) with conditional probability density
and that X t+1 given \( {\bigl (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t)}\bigr )} \) is independent of \( {\bigl (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t)}\bigr )} \) with conditional density function
We shall also assume that the initial state X 1 has probability density p 1. The joint density of the observation and state variables can be computed directly from (9.8.1)–(9.8.2) as
and since (9.8.2) implies that {X t } is Markov (see Problem 9.22),
We conclude that Y 1, …, Y n are conditionally independent given the state variables X 1, …, X n , so that the dependence structure of {Y t } is inherited from that of the state process {X t }. The sequence of state variables {X t } is often referred to as the hidden or latent generating process associated with the observed process.
In order to solve the filtering and prediction problems in this setting, we shall determine the conditional densities \( p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) \) of X t given Y (t), and \( p\left (x_{t}\vert \mathbf{y}^{(t-1)}\right ) \) of X t given Y (t−1), respectively. The minimum mean squared error estimates of X t based on Y (t) and Y (t−1) can then be computed as the conditional expectations, \( E\left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \) and \( E\left (X_{t}\vert \mathbf{Y}^{(t-1)}\right ) \).
An application of Bayes’s theorem, using the assumption that the distribution of Y t given \( \left (X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\right ) \) does not depend on \( \left (\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\right ) \), yields
and
(The integral relative to d μ(x t ) in (9.8.4) is interpreted as the integral relative to dx t in the continuous case and as the sum over all values of x t in the discrete case.) The initial condition needed to solve these recursions is
The factor \( p\left (y_{t}\vert \mathbf{y}^{(t-1)}\right ) \) appearing in the denominator of (9.8.4) is just a scale factor, determined by the condition \( \int p\left (x_{t}\vert \mathbf{y}^{(t)}\right )\,d\mu (x_{t}) = 1. \) In the generalized state-space setup, prediction of a future state variable is less important than forecasting a future value of the observations. The relevant forecast density can be computed from (9.8.5) as
Equations (9.8.1)–(9.8.2) can be regarded as a Bayesian model specification. A classical Bayesian model has two key assumptions. The first is that the data Y 1, …, Y t , given an unobservable parameter (X (t) in our case), are independent with specified conditional distribution. This corresponds to (9.8.3). The second specifies a prior distribution for the parameter value. This corresponds to (9.8.2). The posterior distribution is then the conditional distribution of the parameter given the data. In the present setting the posterior distribution of the component X t of X (t) is determined by the solution (9.8.4) of the filtering problem.
Example 9.8.1.
Consider the simplified version of the linear state-space model of Section 9.1,
where the noise sequences {W t } and {V t } are independent of each other. For this model the probability densities in (9.8.1)–(9.8.2) become
where \( n\left (x;\mu,\sigma ^{2}\right ) \) is the normal density with mean μ and variance σ 2 defined in Example (a) of Section A.1.
To solve the filtering and prediction problems in this new framework, we first observe that the filtering and prediction densities in (9.8.4) and (9.8.5) are both normal. We shall write them, using the notation of Section 9.4, as
and
From (9.8.5), (9.8.12), (9.8.13), and (9.8.14), we find that
and (see Problem 9.23)
Substituting the corresponding densities (9.8.11) and (9.8.14) into (9.8.4), we find by equating the coefficient of x t 2 on both sides of (9.8.4) that
and
Also, from (9.8.4) with \( p\left (x_{1}\vert \mathbf{y}^{(0)}\right ) = n(x_{1};EX_{1},\Omega _{1}) \) we obtain the initial conditions
and
The Kalman prediction and filtering recursions of Section 9.4 give the same results for \( \hat{X}_{t} \) and X t | t , since for Gaussian systems best linear mean square estimation is equivalent to best mean square estimation.
Example 9.8.2.
A non-Gaussian Example
In general, the solution of the recursions (9.8.4) and (9.8.5) presents substantial computational problems. Numerical methods for dealing with non-Gaussian models are discussed by Sorenson and Alspach (1971) and Kitagawa (1987). Here we shall illustrate the recursions (9.8.4) and (9.8.5) in a very simple special case. Consider the state equation
with observation density
where π is a constant between 0 and 1. The relationship in (9.8.15) implies that the transition density [in the discrete sense—see the comment after (9.8.5)] for the state variables is
We shall assume that X 1 has the gamma density function
(This is a simplified model for the evolution of the number X t of individuals at time t infected with a rare disease, in which X t is treated as a continuous rather than an integer-valued random variable. The observation Y t represents the number of infected individuals observed in a random sample consisting of a small fraction π of the population at time t.) Because the transition distribution of {X t } is not continuous, we use the integrated version of (9.8.5) to compute the prediction density. Thus,
Differentiation with respect to x gives
Now applying (9.8.4), we find that
where c(y 1) is an integration factor ensuring that p(⋅ | y 1) integrates to 1. Since p(⋅ | y 1) has the form of a gamma density, we deduce (see Example (d) of Section A.1) that
where α 1 = α + y 1 and λ 1 = λ +π. The prediction density, calculated from (9.8.5) and (9.8.18), is
Iterating the recursions (9.8.4) and (9.8.5) and using (9.8.17), we find that for t ≥ 1,
and
where α t = α t−1 + y t = α + y 1 + ⋯ + y t and \( \lambda _{t} =\lambda _{t-1}/a+\pi =\lambda a^{1-t} +\pi \left (1 - a^{-t}\right )/(1 - a^{-1}). \) In particular, the minimum mean squared error estimate of x t based on y (t) is the conditional expectation α t ∕λ t with conditional variance α t ∕λ t 2. From (9.8.7) the probability density of Y t+1 given Y (t) is
where nb(y; α, p) is the negative binomial density defined in example (i) of Section A.1. Conditional on Y (t), the best one-step predictor of Y t+1 is therefore the mean, α t π∕(λ t+1 −π), of this negative binomial distribution. The conditional mean squared error of the predictor is Var\( \left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi \lambda _{t+1}/(\lambda _{t+1}-\pi )^{2} \) (see Problem 9.25).
Example 9.8.3.
A Model for Time Series of Counts
We often encounter time series in which the observations represent count data. One such example is the monthly number of newly recorded cases of poliomyelitis in the U.S. for the years 1970–1983 plotted in Figure 9.6. Unless the actual counts are large and can be approximated by continuous variables, Gaussian and linear time series models are generally inappropriate for analyzing such data. The parameter-driven specification provides a flexible class of models for modeling count data. We now discuss a specific model based on a Poisson observation density. This model is similar to the one presented by Zeger (1988) for analyzing the polio data. The observation density is assumed to be Poisson with mean exp{x t }, i.e.,
while the state variables are assumed to follow a regression model with Gaussian AR(1) noise. If u t = (u t1, …, u tk )′ are the regression variables, then
where \( \beta \) is a k-dimensional regression parameter and
The transition density function for the state variables is then
The case σ 2 = 0 corresponds to a log-linear model with Poisson noise.
Estimation of the parameters \( \theta = \left (\beta ',\phi,\sigma ^{2}\right )' \) in the model by direct numerical maximization of the likelihood function is difficult, since the likelihood cannot be written down in closed form. (From (9.8.3) the likelihood is the n-fold integral,
where \( L(\theta;\mathbf{x}) \) is the likelihood based on X 1, …, X n .) To overcome this difficulty, Chan and Ledolter (1995) proposed an algorithm, called Monte Carlo EM (MCEM), whose iterates θ (i) converge to the maximum likelihood estimate. To apply this algorithm, first note that the conditional distribution of Y (n) given X (n) does not depend on \( \theta \), so that the likelihood based on the complete data \( \left (\mathbf{X}^{(n)}{}',\mathbf{Y}^{(n)}{}'\right )' \) is given by
The E-step of the algorithm (see Section 9.7) requires calculation of
We delete the first term from the definition of Q, since it is independent of \( \theta \) and hence plays no role in the M-step of the EM algorithm. The new Q is redefined as
Even with this simplification, direct calculation of Q is still intractable. Suppose for the moment that it is possible to generate replicates of X (n) from the conditional distribution of X (n) given Y (n) when \( \theta =\theta ^{(i)} \). If we denote m independent replicates of X (n) by X 1 (n), …, X m (n), then a Monte Carlo approximation to Q in (9.8.24) is given by
The M-step is easy to carry out using Q m in place of Q (especially if we condition on X 1 = 0 in all the simulated replicates), since L is just the Gaussian likelihood of the regression model with AR(1) noise treated in Section 6.6 The difficult steps in the algorithm are the generation of replicates of X (n) given Y (n) and the choice of m. Chan and Ledolter (1995) discuss the use of the Gibb’s sampler for generating the desired replicates and give some guidelines on the choice of m.
In their analyses of the polio data, Zeger (1988) and Chan and Ledolter (1995) included as regression components an intercept, a slope, and harmonics at periods of 6 and 12 months. Specifically, they took
The implementation of Chan and Ledolter’s MCEM method by Kuk and Cheng (1994) gave estimates \( \hat{\beta }=\, \)(0.247, − 3. 871, 0.162, − 0. 482, 0.414, − 0. 011)′, \( \hat{\phi }= 0.648 \), and \( \hat{\sigma }^{2} = 0.281 \). The estimated trend function \( \hat{\beta }'\mathbf{u}_{t} \) is displayed in Figure 9.7. The negative coefficient of t∕1000 indicates a slight downward trend in the monthly number of polio cases.
9.8.2 Observation-Driven Models
Again we assume that Y t , conditional on \( \big(X_{t},\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\big) \), is independent of \( \big(\mathbf{X}^{(t-1)},\mathbf{Y}^{(t-1)}\big) \). These models are specified by the conditional densities
where \( p\big(x_{1}\vert \mathbf{y}^{(0)}\big):= p_{1}(x_{1}) \) for some prespecified initial density p 1(x 1). The advantage of the observation-driven state equation (9.8.26) is that the posterior distribution of X t given Y (t) can be computed directly from (9.8.4) without the use of the updating formula (9.8.5). This then allows for easy computation of the forecast function in (9.8.7) and hence of the joint density function of (Y 1, …, Y n )′,
On the other hand, the mechanism by which the state X t−1 makes the transition to X t is not explicitly defined. In fact, without further assumptions there may be state sequences {X t } and {X t ∗} with different distributions for which both (9.8.25) and (9.8.26) hold (see Example 9.8.6). Both sequences, however, lead to the same joint distribution, given by (9.8.27), for Y 1, …, Y n . The ambiguity in the specification of the distribution of the state variables can be removed by assuming that X t+1 given \( \left (\mathbf{X}^{(t)},\mathbf{Y}^{(t)}\right ) \) is independent of X (t), with conditional distribution (9.8.26), i.e.,
With this modification, the joint density of Y (n) and X (n) is given by (cf. (9.8.3))
Example 9.8.4.
An AR(1) Process
An AR(1) process with iid noise can be expressed as an observation driven model. Suppose {Y t } is the AR(1) process
where {Z t } is an iid sequence of random variables with mean 0 and some probability density function f(x). Then with X t : = Y t−1 we have
and
Example 9.8.5.
Suppose the observation-equation density is given by
and the state equation (9.8.26) is
where α t = α + y 1 + ⋯ + y t and λ t = λ + t. It is possible to give a parameter-driven specification that gives rise to the same state equation (9.8.30). Let {X t ∗} be the parameter-driven state variables, where X t ∗ = X t−1 ∗ and X 1 ∗ has a gamma distribution with parameters α and λ. (This corresponds to the model in Example 9.8.2 with π = a = 1.) Then from (9.8.19) we see that \( p\left (x_{t}^{{\ast}}\vert \mathbf{y}^{(t)}\right ) = g(x_{t}^{{\ast}};\alpha _{t},\lambda _{t}) \), which coincides with the state equation (9.8.30). If {X t } are the state variables whose joint distribution is specified through (9.8.28), then {X t } and {X t ∗} cannot have the same joint distributions. To see this, note that
while
If the two sequences had the same joint distribution, then the latter density could take only the values 0 and 1, which contradicts the continuity (as a function of x t ) of this density.
9.8.3 Exponential Family Models
The exponential family of distributions provides a large and flexible class of distributions for use in the observation equation. The density in the observation equation is said to belong to an exponential family (in natural parameterization) if
where b(⋅ ) is a twice continuously differentiable function and c(y t ) does not depend on x t . This family includes the normal, exponential, gamma, Poisson, binomial, and many other distributions frequently encountered in statistics. Detailed properties of the exponential family can be found in Barndorff-Nielsen (1978), and an excellent treatment of its use in the analysis of linear models is given by McCullagh and Nelder (1989). We shall need only the following important facts:
where integration with respect to ν(dy t ) means integration with respect to dy t in the continuous case and summation over all values of y t in the discrete case.
Proof.
The first relation is simply the statement that p(y t | x t ) integrates to 1. The second relation is established by differentiating both sides of (9.8.32) with respect to x t and then multiplying through by \( e^{-b(x_{t})} \) (for justification of the differentiation under the integral sign see Barndorff-Nielsen 1978). The last relation is obtained by differentiating (9.8.32) twice with respect to x t and simplifying. ■
Example 9.8.6.
The Poisson Case
If the observation Y t , given X t = x t , has a Poisson distribution of the form (9.8.21), then
which has the form (9.8.31) with \( b(x_{t}) = e^{x_{t}} \) and c(y t ) = −lny t ! . From (9.8.33) we easily find that \( E(Y _{t}\vert x_{t}) = b'(x_{t}) = e^{x_{t}} \). This parameterization is slightly different from the one used in Examples 9.8.2 and 9.8.5, where the conditional mean of Y t given x t was π x t and not \( e^{\,x_{t}} \). For this observation equation, define the family of densities
where α > 0 and λ > 0 are parameters and A(α, λ) = −lnΓ(α) +αlnλ. Now consider state densities of the form
where α t+1 | t and λ t+1 | t are, for the moment, unspecified functions of y (t). (The subscript t + 1 | t on the parameters is a shorthand way to indicate dependence on the conditional distribution of X t+1 given Y (t).) With this specification of the state densities, the parameters α t+1 | t are related to the best one-step predictor of Y t through the formula
Proof.
We have from (9.8.7) and (9.8.33) that
Addition and subtraction of α t+1 | t ∕λ t+1 | t then gives
■
Letting A t | t−1 = A(α t | t−1, λ t | t−1), we can write the posterior density of X t given Y (t) as
where we find, by equating coefficients of x t and b(x t ), that the coefficients λ t and α t are determined by
The family of prior densities in (9.8.37) is called a conjugate family of priors for the observation equation (9.8.35), since the resulting posterior densities are again members of the same family.
As mentioned earlier, the parameters α t | t−1 and λ t | t−1 can be quite arbitrary: Any nonnegative functions of y (t−1) will lead to a consistent specification of the state densities. One convenient choice is to link these parameters with the corresponding parameters of the posterior distribution at time t − 1 through the relations
where 0 < δ < 1 (see Remark 4 below). Iterating the relation (9.8.41), we see that
as t → ∞. Similarly,
For large t, we have the approximations
and
which are exact if λ 1 | 0 = δ∕(1 −δ) and α 1 | 0 = 0. From (9.8.38) the one-step predictors are linear and given by
Replacing the denominator with its limiting value, or starting with λ 1 | 0 = δ∕(1 −δ), we find that \( \hat{Y }_{t+1} \) is the solution of the recursions
with initial condition \( \hat{Y }_{1} = (1-\delta )\delta ^{-1}\alpha _{1\vert 0} \). In other words, under the restrictions of (9.8.41) and (9.8.42), the best one-step predictors can be found by exponential smoothing.
Remark 1.
The preceding analysis for the Poisson-distributed observation equation holds, almost verbatim, for the general family of exponential densities (9.8.31). (One only needs to take care in specifying the correct range for x and the allowable parameter space for α and λ in (9.8.37).) The relations (9.8.43)–(9.8.44), as well as the exponential smoothing formula (9.8.48), continue to hold even in the more general setting, provided that the parameters α t | t−1 and λ t | t−1 satisfy the relations (9.8.41)–(9.8.42). □
Remark 2.
Equations (9.8.41)–(9.8.42) are equivalent to the assumption that the prior density of X t given y (t−1) is proportional to the δ-power of the posterior distribution of X t−1 given Y (t−1), or more succinctly that
This power relationship is sometimes referred to as the power steady model (Grunwald et al. 1993; Smith 1979). □
Remark 3.
The transformed state variables \( W_{t} = e^{X_{t}} \) have a gamma state density given by
(see Problem 9.26). The mean and variance of this conditional density are
Remark 4.
If we regard the random walk plus noise model of Example 9.2.1 as the prototypical state-space model, then from the calculations in Example 9.8.1 with G = F = 1, we have
and
The first of these equations implies that the best estimate of the next state is the same as the best estimate of the current state, while the second implies that the variance increases. Under the conditions (9.8.41), and (9.8.42), the same is also true for the state variables in the above model (see Problem 9.26). This was, in part, the rationale behind these conditions given in Harvey and Fernandes (1989). □
Remark 5.
While the calculations work out neatly for the power steady model, Grunwald et al. (1994) have shown that such processes have degenerate sample paths for large t. In the Poisson example above, they argue that the observations Y t converge to 0 as t → ∞ (see Figure 9.12). Although such models may still be useful in practice for modeling series of moderate length, the efficacy of using such models for describing long-term behavior is doubtful. □
Example 9.8.7.
Goals Scored by England Against Scotland
The time series of the number of goals scored by England against Scotland in soccer matches played at Hampden Park in Glasgow is graphed in Figure 9.8. The matches have been played nearly every second year, with interruptions during the war years. We will treat the data y 1, …, y 52 as coming from an equally spaced time series model {Y t }. Since the number of goals scored is small (see the frequency histogram in Figure 9.9), a model based on the Poisson distribution might be deemed appropriate. The observed relative frequencies and those based on a Poisson distribution with mean equal to \( \bar{y}_{52} = 1.269 \) are contained in Table 9.2. The standard chi-squared goodness of fit test, comparing the observed frequencies with expected frequencies based on a Poisson model, has a p-value of 0.02. The lack of fit with a Poisson distribution is hardly unexpected, since the sample variance (1.652) is much larger than the sample mean, while the mean and variance of the Poisson distribution are equal. In this case the data are said to be overdispersed in the sense that there is more variability in the data than one would expect from a sample of independent Poisson-distributed variables. Overdispersion can sometimes be explained by serial dependence in the data.
Dependence in count data can often be revealed by estimating the probabilities of transition from one state to another. Table 9.3 contains estimates of these probabilities, computed as the average number of one-step transitions from state y t to state y t+1. If the data were independent, then in each column the entries should be nearly the same. This is certainly not the case in Table 9.3. For example, England is very unlikely to be shut out or score 3 or more goals in the next match after scoring at least three goals in the previous encounter.
Harvey and Fernandes (1989) model the dependence in this data using an observation-driven model of the type described in Example 9.8.6. Their model assumes a Poisson observation equation and a log-gamma state equation:
for t = 1, 2, …, where f is given by (9.8.36) and α 1 | 0 = 0, λ 1 | 0 = 0. The power steady conditions (9.8.41)–(9.8.42) are assumed to hold for α t | t−1 and λ t | t−1. The only unknown parameter in the model is δ. The log-likelihood function for δ based on the conditional distribution of y 1, …, y 52 given y 1 is given by [see (9.8.27)]
where \( p\left (\,y_{t+1}\vert \mathbf{y}^{(t))}\right ) \) is the negative binomial density [see Problem 9.25(c)]
with α t+1 | t and λ t+1 | t as defined in (9.8.44) and (9.8.43). (For the goal data, y 1 = 0, which implies α 2 | 1 = 0 and hence that \( p\left (\,y_{2}\vert y^{(1)}\right ) \) is a degenerate density with unit mass at y 2 = 0. Harvey and Fernandes avoid this complication by conditioning the likelihood on y (τ), where τ is the time of the first nonzero data value.)
Maximizing this likelihood with respect to δ, we obtain \( \hat{\delta }= 0.844 \). (Starting equations (9.8.43)–(9.8.44) with α 1 | 0 = 0 and λ 1 | 0 = δ∕(1 −δ), we obtain \( \hat{\delta }= 0.732 \).) With 0.844 as our estimate of δ, the prediction density of the next observation Y 53 given y (52) is nb(y 53; α 53 | 52, (1 +λ 53 | 52)−1. The first five values of this distribution are given in Table 9.4. Under this model, the probability that England will be held scoreless in the next match is 0.471. The one-step predictors, \( \hat{Y }_{1} = 0,\hat{Y }_{2},\ldots,\hat{Y }_{52} \) are graphed in Figure 9.10. (This graph can be obtained by using the ITSM option Smooth>Exponential with α = 0. 154.)
Figures 9.11 and 9.12 contain two realizations from the fitted model for the goal data. The general appearance of the first realization is somewhat compatible with the goal data, while the second realization illustrates the convergence of the sample path to 0 in accordance with the result of Grunwald et al. (1994).
Example 9.8.8.
The Exponential Case
Suppose Y t given X t has an exponential density with mean − 1∕X t (X t < 0). The observation density is given by
which has the form (9.8.31) with b(x) = −ln(−x) and c(y) = 0. The state densities corresponding to the family of conjugate priors (see (9.8.37)) are given by
(Here p(x t+1 | y (t)) is a probability density when α t+1 | t > 0 and λ t+1 | t > −1.) The one-step prediction density is
(see Problem 9.28). While E(Y t+1 | y (t)) = α t+1 | t ∕λ t+1 | t , the conditional variance is finite if and only if λ t+1 | t > 1. Under assumptions (9.8.41)–(9.8.42), and starting with λ 1 | 0 = δ∕(1 −δ), the exponential smoothing formula (9.8.48) remains valid.
Problems
-
9.1
Show that if all the eigenvalues of F are less than 1 in absolute value (or equivalently that \( F^{k}\! \rightarrow \! 0 \) as k → ∞), the unique stationary solution of equation (9.1.11) is given by the infinite series
$$ \displaystyle{ \mathbf{X}_{t} =\sum _{ j=0}^{\infty }F^{ j}V _{ t-j-1} } $$and that the corresponding observation vectors are
$$ \displaystyle{ \mathbf{Y}_{t} = \mathbf{W}_{t} +\sum _{ j=0}^{\infty }GF^{ j}\mathbf{V}_{ t-j-1}. } $$Deduce that {(X t ′, Y t ′)′} is a multivariate stationary process. (Hint: Use a vector analogue of the argument in Example 2.2.1.)
-
9.2
In Example 9.2.1, show that θ = −1 if and only if σ v 2 = 0, which in turn is equivalent to the signal M t being constant.
-
9.3
Let F be the coefficient of X t in the state equation (9.3.4) for the causal AR(p) process
$$ \displaystyle{ X_{t} -\phi _{1}X_{t-1} -\cdots -\phi _{p}X_{t-p} = Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$Establish the stability of (9.3.4) by showing that
$$ \displaystyle{ \det (zI - F) = z^{p}\phi \left (z^{-1}\right ), } $$and hence that the eigenvalues of F are the reciprocals of the zeros of the autoregressive polynomial ϕ(z) = 1 −ϕ 1 z −⋯ −ϕ p z p.
-
9.4
By following the argument in Example 9.3.3, find a state-space model for {Y t } when {∇∇12 Y t } is an ARMA(p, q) process.
-
9.5
For the local linear trend model defined by equations (9.2.6)–(9.2.7), show that ∇2 Y t = (1 − B)2 Y t is a 2-correlated sequence and hence, by Proposition 2.1.1, is an MA(2) process. Show that this MA(2) process is noninvertible if σ u 2 = 0.
- 9.6
-
9.7
Let {Y t } be the MA(1) process
$$ \displaystyle{ Y _{t} = Z_{t} +\theta Z_{t-1},\ \ \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ). } $$Show that {Y t } has the state-space representation
$$ \displaystyle{ Y _{t} = [1\quad 0]\mathbf{X}_{t}, } $$where {X t } is the unique stationary solution of
$$ \displaystyle{ \mathbf{X}_{t+1} = \left [\begin{array}{*{10}c} 0&1\\ 0 &0 \end{array} \right ]\mathbf{X}_{t}+\left [\begin{array}{*{10}c} 1\\ \theta \end{array} \right ]Z_{t+1}. } $$In particular, show that the state vector X t can written as
$$ \displaystyle{ \mathbf{X}_{t} = \left [\begin{array}{*{10}c} 1& \theta \\ \theta &0 \end{array} \right ]\left [\begin{array}{*{10}c} Z_{t} \\ Z_{t-1} \end{array} \right ]. } $$ -
9.8
Verify equations (9.3.16)–(9.3.18) for an ARIMA(1,1,1) process.
-
9.9
Consider the two state-space models
$$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{X}_{t+1,1}\quad &=F_{1}\mathbf{X}_{t1} + \mathbf{V}_{t1}, \\ \mathbf{Y}_{t1} \quad &=G_{1}\mathbf{X}_{t1} + \mathbf{W}_{t1}, \end{array} \right. } $$and
$$ \displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathbf{X}_{t+1,2}\quad &=F_{2}\mathbf{X}_{t2} + \mathbf{V}_{t2}, \\ \mathbf{Y}_{t2} \quad &=G_{2}\mathbf{X}_{t2} + \mathbf{W}_{t2}, \end{array} \right. } $$where {(V t1′, W t1′, V t2′, W t2′)′} is white noise. Derive a state-space representation for {(Y t1′, Y t2′)′}.
-
9.10
Use Remark 1 of Section 9.4 to establish the linearity properties of the operator P t stated in Remark 3.
- 9.11
-
9.12
In the notation of the Kalman prediction equations, show that every vector of the form
$$ \displaystyle{ \mathbf{Y} = A_{1}\mathbf{X}_{1} + \cdots + A_{t}\mathbf{X}_{t} } $$can be expressed as
$$ \displaystyle{ \mathbf{Y} = B_{1}\mathbf{X}_{1} + \cdots + B_{t-1}\mathbf{X}_{t-1} + C_{t}\mathbf{I}_{t}, } $$where B 1, …, B t−1 and C t are matrices that depend on the matrices A 1, …, A t . Show also that the converse is true. Use these results and the fact that E(X s I t ) = 0 for all s < t to establish (9.4.3).
-
9.13
In Example 9.4.1, verify that the steady-state solution of the Kalman recursions (9.1.2) is given by \( \Omega _{t} = \left (\sigma _{v}^{2} + \sqrt{\sigma _{v }^{4 } + 4\sigma _{v }^{2 }\sigma _{w }^{2}}\right )/2 \).
-
9.14
Show from the difference equations for Ω t in Example 9.4.1 that (Ω t+1 − Ω)(Ω t Ω) ≥ 0 for all Ω t ≥ 0, where Ω is the steady-state solution for Ω t given in Problem 9.13.
-
9.15
Show directly that for the MA(1) model (9.2.3), the parameter θ is equal to \( -\left (2\sigma _{w}^{2} +\sigma _{ v}^{2} -\sqrt{\sigma _{v }^{4 } + 4\sigma _{v }^{2 }\sigma _{w }^{2}}\right )/\left (2\sigma _{w}^{2}\right ) \), which in turn is equal to −σ w 2∕(Ω +σ w 2), where Ω is the steady-state solution for Ω t given in Problem 9.13.
-
9.16
Use the ARMA(0,1,1) representation of the series {Y t } in Example 9.4.1 to show that the predictors defined by
$$ \displaystyle{ \hat{Y }_{n+1} = aY _{n} + (1 - a)\hat{Y }_{n},\quad n = 1,2,\ldots, } $$where a = Ω∕(Ω +σ w 2), satisfy
$$ \displaystyle{ Y _{n+1} -\hat{ Y }_{n+1} = Z_{n+1} + (1 - a)^{n}\left (Y _{ 0} - Z_{0} -\hat{ Y }_{1}\right ). } $$Deduce that if 0 < a < 1, the mean squared error of \( \hat{Y }_{n+1} \) converges to Ω +σ w 2 for any initial predictor \( \hat{Y }_{1} \) with finite mean squared error.
-
9.17
-
a.
Using equations (9.4.1) and (9.4.16), show that \( \hat{\mathbf{X}}_{t+1} = F_{t}\mathbf{X}_{t\vert t} \).
-
b.
From (a) and (9.4.16) show that X t | t satisfies the recursions
$$ \displaystyle{ \mathbf{X}_{t\vert t} = F_{t-1}\mathbf{X}_{t-1\vert t-1} + \Omega _{t}G_{t}'\Delta _{t}^{-1}(\mathbf{Y}_{ t} - G_{t}F_{t-1}\mathbf{X}_{t-1\vert t-1}) } $$for t = 2, 3, …, with \( \mathbf{X}_{1\vert 1} =\hat{ \mathbf{X}}_{1} + \Omega _{1}G_{1}'\Delta _{1}^{-1}\left (\mathbf{Y}_{1} - G_{1}\hat{\mathbf{X}}_{1}\right ) \).
-
a.
-
9.18
In Section 9.5, show that for fixed Q ∗, \( -2\ln L\left (\boldsymbol{\mu },Q^{{\ast}},\sigma _{w}^{2}\right ) \) is minimized when \( \boldsymbol{\mu } \) and σ w 2 are given by (9.5.10) and (9.5.11), respectively.
-
9.19
Verify the calculation of Θ t Δ t −1 and Ω t in Example 9.6.1.
-
9.20
Verify that the best estimates of missing values in an AR(p) process are found by minimizing (9.6.11) with respect to the missing values.
-
9.21
Suppose that {Y t } is the AR(2) process
$$ \displaystyle{ Y _{t} =\phi _{1}Y _{t-1} +\phi _{2}Y _{t-2} + Z_{t},\quad \{Z_{t}\} \sim \mathrm{WN}\left (0,\sigma ^{2}\right ), } $$and that we observe Y 1, Y 2, Y 4, Y 5, Y 6, Y 7. Show that the best estimator of Y 3 is
$$ \displaystyle{ \left (\phi _{2}(Y _{1} + Y _{5}) + (\phi _{1} -\phi _{1}\phi _{2})(Y _{2} + Y _{4})\right )/\left (1 +\phi _{ 1}^{2} +\phi _{ 2}^{2}\right ). } $$ -
9.22
Let X t be the state at time t of a parameter-driven model (see (9.8.2)). Show that {X t } is a Markov chain and that (9.8.3) holds.
-
9.23
For the generalized state-space model of Example 9.8.1, show that Ω t+1 = F 2 Ω t | t + Q.
-
9.24
If Y and X are random variables, show that
$$ \displaystyle{ \mathrm{Var}(Y ) = E(\mathrm{Var}(Y \vert X)) + \mathrm{Var}(E(Y \vert X)). } $$ -
9.25
Suppose that Y and X are two random variables such that the distribution of Y given X is Poisson with mean π X, 0 < π ≤ 1, and X has the gamma density g(x; α, λ).
-
a.
Show that the posterior distribution of X given Y also has a gamma density and determine its parameters.
-
b.
Compute E(X | Y ) and Var(X | Y ).
-
c.
Show that Y has a negative binomial density and determine its parameters.
-
d.
Use (c) to compute E(Y ) and Var(Y ).
-
e.
Verify in Example 9.8.2 that \( E\left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi /(\lambda _{t+1}-\pi ) \) and Var\( \left (Y _{t+1}\vert \mathbf{Y}^{(t)}\right ) =\alpha _{t}\pi \lambda _{t+1}/(\lambda _{t+1}-\pi )^{2} \).
-
a.
-
9.26
For the model of Example 9.8.6, show that
-
a.
\( E\left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) = E\left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \), Var\( \left (X_{t+1}\vert \mathbf{Y}^{(t)}\right ) > \) Var\( \left (X_{t}\vert \mathbf{Y}^{(t)}\right ) \), and
-
b.
the transformed sequence \( W_{t} = e^{X_{t}} \) has a gamma state density.
-
a.
-
9.27
Let {V t } be a sequence of independent exponential random variables with EV t = t −1 and suppose that {X t , t ≥ 1} and {Y t , t ≥ 1} are the state and observation random variables, respectively, of the parameter-driven state-space system
$$ \displaystyle\begin{array}{rcl} X_{1}& =& V _{1}, {}\\ X_{t}& =& X_{t-1} + V _{t},\quad t = 2,3,\ldots, {}\\ \end{array} $$where the distribution of the observation Y t , conditional on the random variables Y 1, Y 2, …, Y t−1, X t , is Poisson with mean X t .
-
a.
Determine the observation and state transition density functions p(y t | x t ) and p(x t+1 | x t ) in the parameter-driven model for {Y t }.
-
b.
Show, using (9.8.4)–(9.8.6), that
$$ \displaystyle{ p(x_{1}\vert y_{1}) = g(x_{1};y_{1} + 1,2) } $$and
$$ \displaystyle{ p(x_{2}\vert y_{1}) = g(x_{2};y_{1} + 2,2), } $$where g(x; α, λ) is the gamma density function (see Example (d) of Section A.1).
-
c.
Show that
$$ \displaystyle{ p\left (x_{t}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t};\alpha _{t} + t,t + 1) } $$and
$$ \displaystyle{ p\left (x_{t+1}\vert \mathbf{y}^{(t)}\right ) = g(x_{ t+1};\alpha _{t} + t + 1,t + 1), } $$where α t = y 1 + ⋯ + y t .
-
d.
Conclude from (c) that the minimum mean squared error estimates of X t and X t+1 based on Y 1, …, Y t are
$$ \displaystyle{ X_{t\vert t} = \frac{t + Y _{1} + \cdots + Y _{t}} {t + 1} } $$and
$$ \displaystyle{ \hat{X}_{t+1} = \frac{t + 1 + Y _{1} + \cdots + Y _{t}} {t + 1}, } $$respectively.
-
a.
-
9.28
Let Y and X be two random variables such that Y given X is exponential with mean 1∕X, and X has the gamma density function with
$$ \displaystyle{ g(x;\lambda +1,\alpha ) = \frac{\alpha ^{\lambda +1}x^{\lambda }\exp \{ -\alpha x\}} {\Gamma (\lambda +1)},\quad x > 0, } $$where λ > −1 and α > 0.
-
a.
Determine the posterior distribution of X given Y.
-
b.
Show that Y has a Pareto distribution
$$ \displaystyle{ p(y) = (\lambda +1)\alpha ^{\lambda +1}(y+\alpha )^{-\lambda -2},\quad y > 0. } $$ -
c.
Find the mean and variance of Y. Under what conditions on α and λ does the latter exist?
-
d.
Verify the calculation of \( p\left (y_{t+1}\vert \mathbf{y}^{(t)}\right ) \) and \( E\left (Y _{t+1}\vert \mathbf{y}^{(t)}\right ) \) for the model in Example 9.8.8.
-
a.
-
9.29
Consider an observation-driven model in which Y t given X t is binomial with parameters n and X t , i.e.,
$$ \displaystyle{ p(y_{t}\vert x_{t}) ={ n\choose y_{t}}x_{t}^{y_{t} }(1 - x_{t})^{n-y_{t} },\quad y_{t} = 0,1,\ldots,n. } $$ -
a.
Show that the observation equation with state variable transformed by the logit transformation W t = ln(X t ∕(1 − X t )) follows an exponential family
$$ \displaystyle{ p(y_{t}\vert w_{t}) =\exp \{ y_{t}w_{t} - b(w_{t}) + c(y_{t})\}. } $$Determine the functions b(⋅ ) and c(⋅ ).
-
b.
Suppose that the state X t has the beta density
$$ \displaystyle{ p(x_{t+1}\vert \mathbf{y}^{(t)}) = f(x_{ t+1};\alpha _{t+1\vert t},\lambda _{t+1\vert t}), } $$where
$$ \displaystyle{ f(x;\alpha,\lambda ) = [B(\alpha,\lambda )]^{-1}x^{\alpha -1}(1 - x)^{\lambda -1},\quad 0 < x < 1, } $$B(α, λ): = Γ(α)Γ(λ)∕Γ(α +λ) is the beta function, and α, λ > 0. Show that the posterior distribution of X t given Y t is also beta and express its parameters in terms of y t and α t | t−1, λ t | t−1.
-
c.
Under the assumptions made in (b), show that \( E{\bigl (X_{t}\vert \mathbf{Y}^{(t)}\bigr )} = E{\bigl (X_{t+1}\vert \mathbf{Y}^{(t)}\bigr )} \) and Var\( {\bigl (X_{t}\vert \mathbf{Y}^{(t)}\bigr )} < \) Var\( {\bigl (X_{t+1}\vert \mathbf{Y}^{(t)}\bigr )} \).
-
d.
Assuming that the parameters in (b) satisfy (9.8.41)–(9.8.42), show that the one-step prediction density \( p{\bigl (y_{t+1}\vert \mathbf{y}^{(t)}\bigr )} \) is beta-binomial,
$$ \displaystyle{ p(y_{t+1}\vert \mathbf{y}^{(t)}) = \frac{B(\alpha _{t+1\vert t} + y_{t+1},\lambda _{t+1\vert t} + n - y_{t+1})} {(n + 1)B(y_{t+1} + 1,n - y_{t+1} + 1)B(\alpha _{t+1\vert t},\lambda _{t+1\vert t})}, } $$and verify that \( \hat{Y }_{t+1} \) is given by (9.8.47).
References
Ansley, C. F., & Kohn, R. (1985). On the estimation of ARIMA models with missing values. In E. Parzen (Ed.), Time series analysis of irregularly observed data. Springer lecture notes in statistics (Vol. 25, pp. 9–37), Springer-Verlag, Berlin, Heidelberg, New York.
Aoki, M. (1987). State space modeling of time series. Berlin: Springer.
Barndorff-Nielsen, O. E. (1978). Information and exponential families in statistical theory. New York: Wiley.
Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: Forecasting and control (revised edition). San Francisco: Holden-Day.
Brockwell, P. J., & Davis, R. A. (1991). Time series: Theory and methods (2nd ed.). New York: Springer.
Chan, K. S., & Ledolter, J. (1995). Monte Carlo EM estimation for time series models involving counts. Journal of the American Statistical Association, 90, 242–252.
Davis, M. H. A., & Vinter, R. B. (1985). Stochastic modelling and control. London: Chapman and Hall.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.
Grunwald, G. K., Hyndman, R. J., & Hamza, K. (1994). Some Properties and Generalizations of Nonnegative Bayesian Time Series Models, Technical Report. Statistics Dept., Melbourne University, Parkville, Australia.
Grunwald, G. K., Raftery, A. E., & Guttorp, P. (1993). Prediction rule for exponential family state space models. Journal of the Royal Statistical Society B, 55, 937–943.
Hannan, E. J., & Deistler, M. (1988). The statistical theory of linear systems. New York: Wiley.
Harvey, A. C. (1990). Forecasting, structural time series models and the Kalman filter. Cambridge: Cambridge University Press.
Harvey, A. C., & Fernandes, C. (1989). Time Series models for count data of qualitative observations. Journal of Business and Economic Statistics, 7, 407–422.
Jones, R. H. (1980). Maximum likelihood fitting of ARMA models to time series with missing observations. Technometrics, 22, 389–395.
Kitagawa, G. (1987). Non-Gaussian state-space modeling of non-stationary time series. Journal of the American Statistical Association, 82 (with discussion), 1032–1063.
Kuk, A. Y. C., & Cheng, Y. W. (1994). The Monte Carlo Newton-Raphson Algorithm, Technical Report S94-10. Department of Statistics, U. New South Wales, Sydney, Australia.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman and Hall.
Smith, J. Q. (1979). A generalization of the Bayesian steady forecasting model. Journal of the Royal Statistical Society B, 41, 375–387.
Sorenson, H. W., & Alspach, D. L. (1971). Recursive Bayesian estimation using Gaussian sums. Automatica, 7, 465–479.
West, M., & Harrison, P. J. (1989). Bayesian forecasting and dynamic models. New York: Springer.
Wu, C. F. J. (1983). On the convergence of the EM algorithm. Annals of Statistics, 11, 95–103.
Zeger, S. L. (1988). A regression model for time series of counts. Biometrika, 75, 621–629.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Brockwell, P.J., Davis, R.A. (2016). State-Space Models. In: Introduction to Time Series and Forecasting. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-29854-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-29854-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29852-8
Online ISBN: 978-3-319-29854-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)