Keywords

JEL Classifications

Introduction

This article reviews the derivation of formulas for linear least squares and robust prediction of stationary time series and geometrically discounted distributed leads of such series. The derivations employed are the classical, frequency-domain procedures employed by Whittle (1983) and Whiteman (1983), and result in nearly closed-form expressions. The formulas themselves are useful directly in forecasting, and have also found uses in economic modelling, primarily in macroeconomics. Indeed, Hansen and Sargent (1980) refer to the cross-equation restrictions connecting the time series representation of driving variables to the analogous representation for predicting the present value of such variables as the ‘hallmark of rational expectations models’.

The Wold Representation

Suppose that {xt} is a covariance-stationary stochastic process and assume (without loss of generality) that Ext = 0. Covariance stationarity ensures that first and second unconditional moments of the process do not vary with time. Then, by the Wold decomposition theorem (see Sargent 1987, for an elementary exposition and proof), xt can be represented by:

$$ {x}_t=\sum\limits_{j=0}^{\infty }{a}_j{\varepsilon}_{t-j} $$
(1)

with

$$ {a}_0=1,\sum\limits_{j=0}^{\infty}\kern0.24em {a}_j^2<\infty $$

and

$$ {\varepsilon}_t={x}_t-P\left({x}_t|{x}_{t-1},{x}_{t-2},\dots \right),E{\varepsilon}_t^2={\sigma}^2 $$

where P(xt|xt−1, xt−2, …) denotes the linear least squares projection (population regression) of xt on xt−1, xt−2 , … Here, ‘represented by’ need not mean ‘generated by’, but rather ‘has the same variance and covariance stmcture as’. By construction, the ‘fundamental’ innovation εt is uncorrelated with information dated prior to t, including earlier values of the process itself: tεts = 0 ∀ s > 0. This fact makes the Wold representation very convenient for computing predictions. The convolution in (1) is often written xt = A(L)εt using the polynomial \( A(L)={\sum\limits}_{j=0}^{\infty}\kern0.24em {a}_j{L}^j \) in the ‘lag operator’ L, where t = εt−1.

Squared-Error Loss Optimal Prediction

The optimal prediction problem under squared-error loss can be thought of as follows. Given {xt} with the Wold representation (1) we want to find the stochastic process yt,

$$ {y}_t=\sum\limits_{j=0}^{\infty}\kern0.24em {c}_j{\varepsilon}_{t-j}=C(L){\varepsilon}_t $$

that will minimize the squared forecast error of the h-step ahead prediction

$$ \underset{\left\{{y}_t\right\}}{\min}\;E{\left({x}_{t+h}-{y}_t\right)}^2. $$

Equivalently, the problem can be written as

$$ \underset{\left\{{y}_t\right\}}{\min}\;E{\left({L}^{-h}{x}_t-{y}_t\right)}^2 $$

or

$$ \underset{\left\{{c}_j\right\}}{\min}\;E{\left({L}^{-h}\sum\limits_{j=0}^{\infty}\kern0.24em {a}_j{\varepsilon}_{t-j}-\sum\limits_{j=0}^{\infty}\kern0.24em {c}_j{\varepsilon}_{t-j}\right)}^2. $$
(2)

The problem in (2) involves finding a sequence of coefficients in the Wold representation of the unknown prediction process yt, and is referred to as the time domain problem. By virtue of the Riesz–Fisher theorem (see again Sargent 1987, for an exposition), the time-domain problem is equivalent to a frequency domain problem of finding an analytic function C(z) on the unit disk |z| ≤ 1 corresponding to the ‘z-transform’ of the {cj} sequence

$$ C(z)=\sum\limits_{j=0}^{\infty}\kern0.24em {c}_j{z}^j $$

that solves

$$ \underset{C(z)\in {H}^2}{\min}\frac{1}{2\pi i}\oint {\left|{z}^{-h}A(z)-C(z)\right|}^2\frac{dz}{z} $$
(3)

where H2 denotes the Hardy space of square-integrable analytic functions on the unit disk, and ∮ denotes (counterclockwise) integration about the unit circle. The requirement that C(z) ∈ H2 ensures that the forecast is causal, and contains no future values of the εʼs; this is equivalent to the requirement that C(z) have a well-behaved power series expansion in non-negative powers of z.

Each formulation of the problem is useful, as often one or the other will be simpler to solve. This stems from the fact that convolution in the time domain becomes multiplication in the frequency domain and vice versa. To see this, consider the two sequences \( {\left\{{g}_k\right\}}_{k=-\infty}^{\infty } \) and \( {\left\{{h}_k\right\}}_{k=-\infty}^{\infty } \). The convolution of {gk} and {hk} is the sequence {fk}, in which a typical element would be:

$$ {f}_k=\sum\limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{h}_{k-j}. $$

The z-transform of the convolution is given by

$$ {\displaystyle \begin{array}{ll}\sum\limits \limits_{k=-\infty}^{\infty }{f}_k{z}^k& =\sum\limits \limits_{k=-\infty}^{\infty}\left(\sum\limits \limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{h}_{k-j}\right){z}^k\hfill \\ {}& =\sum\limits \limits_{k=-\infty}^{\infty}\;\sum\limits \limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{z}^j{h}_{k-j}{z}^{k-j}\hfill \\ {}& =\sum\limits \limits_{\left(k-j\right)=-\infty}^{\infty}\;\sum\limits \limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{z}^j{h}_{k-j}{z}^{k-j}\hfill \\ {}& =\sum\limits \limits_{s=-\infty}^{\infty}\;\sum\limits \limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{z}^j{h}_s{z}^s\;\left(\mathrm{Substituting}\;s=k-j\right)\hfill \\ {}& =\sum\limits \limits_{s=-\infty}^{\infty }{h}_s{z}^s\sum\limits \limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{z}^j=g(z)h(z).\hfill \end{array}} $$

Thus the ʻz-transform’ of the convolution of the sequences {gk} and {hk} is the product of the z-transforms of the two sequences.

Similarly, the z-transform of the product of two sequences is the convolution of the z-transforms:

$$ \sum\limits_{k=-\infty}^{\infty}\kern0.24em {g}_k{h}_k{z}^k=\frac{1}{2\pi i}\oint g(p)h\left(z/p\right)\frac{dp}{p}. $$

To see why this is the case, note that

$$ g(p)h\left(z/p\right){p}^{-1}=\sum\limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{p}^j\sum\limits_{k=-\infty}^{\infty}\kern0.24em {h}_k{z}^k{p}^{-k-1}, $$

implying

$$ {\displaystyle \begin{array}{ll}\hfill & \frac{1}{2\pi i}\oint g(p)h\left(z/p\right){p}^{-1} dp\\ {}& =\frac{1}{2\pi i}\oint \sum\limits \limits_{j=-\infty}^{\infty}\kern0.5em \sum\limits \limits_{k=-\infty}^{\infty}\kern0.24em {g}_j{h}_k{z}^k{p}^{j-k-1} dp.\hfill \end{array}} $$

But all of the terms vanish except where j = k because

$$ \frac{1}{2\pi i}\oint {z}^k\frac{dz}{z}=0 $$

except when k = 0. To see why, let z = e. As θ increases from 0 to 2π,  z goes around the unit circle. So, since dz = ie, we have that

$$ \frac{1}{2\pi i}\oint {z}^k\frac{dz}{z}=\frac{i}{2\pi i}\oint {e}^{i\theta k} d\theta =\left\{\begin{array}{ll}1\hfill & \mathrm{if}\kern0.6em k=0\hfill \\ {}{\left.\frac{1}{2\pi}\frac{1}{i k}{e}^{i\theta k}\right|}_0^{2\pi }=0\hfill & \mathrm{otherwise}.\hfill \end{array}\right. $$

Thus,

$$ \frac{1}{2\pi i}\oint g(p)h\left(z/p\right){p}^{-1} dp=\sum\limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{h}_j{z}^j\frac{1}{2\pi i}\oint \frac{dp}{p}=\sum\limits_{j=-\infty}^{\infty}\kern0.24em {g}_j{h}_j{z}^j $$

by Cauchy’s Integral formula.

The frequency domain formulas can now be used to calculate moments quickly and conveniently. Consider \( {Ex}_t^2 \):

$$ {Ex}_t^2=E{\left(A(L){\varepsilon}_t\right)}^2=E{\left(\sum\limits_{j=0}^{\infty }{A}_j{\varepsilon}_{t-j}\right)}^2={\sigma}_{\varepsilon}^2\sum\limits_{j=0}^{\infty }{A}_j^2. $$
(4)

The result in Eq. (4) comes from the fact that tεts = 0,   ∀s ≠ 0. Using the product-convolution relation, we see that

$$ \sum\limits_{j=0}^{\infty }{A}_j^2=\sum\limits_{j=0}^{\infty }{A}_j^2{z}^j{\left|{}_{z=1}=\frac{1}{2\pi i}\oint A(p)A\left(z/p\right)\frac{dp}{p}\right|}_{z=1}=\frac{1}{2\pi i}\oint A(p)A\left({p}^{-1}\right)\frac{dp}{p}\kern1.56em =\frac{1}{2\pi i}\oint {\left|A(z)\right|}^2\frac{dz}{z}. $$
(5)

Returning to the prediction problem, the task is to choose c0, c1, c2, … to

$$ \underset{\left\{{c}_j\right\}}{\min}\frac{1}{2\pi i}\oint {\left|{z}^{-h}A(z)-\sum\limits_{j=0}^{\infty}\kern0.24em {c}_j{z}^j\right|}^2\frac{dz}{z}. $$
(6)

The first order conditions for the optimization in expression (7) are

$$ 0=\frac{1}{2\pi i}\oint \left\{{z}^j\left[{z}^hA\left({z}^{-1}\right)-C\left({z}^{-1}\right)\right]\right.+\left.{z}^{-j}\left[{z}^{-h}A(z)-C(z)\right]\right\}\frac{dz}{z}=\frac{1}{2\pi i}\oint {z}^{-j}\left[{z}^{-h}A(z)-C(z)\right]\frac{dz}{z}-\frac{1}{2\pi i}\oint {p}^{-j}\left[{p}^{-h}A(p)-C(p)\right]\frac{dp}{p} $$
(7)

for j = 0, 1, 2, … , where the second integral is the result of a change of variable p = z−1 so that dp = −z−1dz, resulting in

$$ \frac{dp}{p}=z\left(-{z}^{-2} dz\right)=-\frac{dz}{z}. $$

The result is that in the second integral, the direction of the contour integration is clockwise. Multiplying by −1 and integrating counterclockwise, the second integral becomes identical to the first, and we can write the set of first-order conditions as

$$ 0=\frac{1}{\pi i}\oint {z}^{-j}\left[{z}^{-h}A(z)-C(z)\right]\frac{dz}{z}j=0,1,2,\dots $$
(8)

Define F(z) such that

$$ F(z)={z}^{-h}A(z)-C(z)=\sum\limits_{j=-\infty}^{\infty }{F}_j{z}^j. $$

From Eq. (8), it must be the case that all coefficients on non-negative powers of z equal zero:

$$ {F}_j=0,\kern0.62em j=0,1,2,\dots . $$

Multiplying by zj and summing over all j = 0, ±1, ±2, … , we obtain

$$ F(z)=\sum\limits_{-\infty}^{-1} $$
(9)

where the term on the right-hand-side of (9) represents an unknown function in negative powers of z. Thus

$$ {z}^{-h}A(z)-C(z)=\sum\limits_{-\infty}^{-1}\kern0.36em , $$

which is an example of a ‘Wiener–Hopf’ equation. Now apply the (linear) ‘plussing’ operator, [⋅]+ , which means ‘ignore negative powers of z’ The unknown function in negative powers of z is ‘annihilated’ by this operation, resulting in

$$ {\displaystyle \begin{array}{ll}C(z)& ={\left[{z}^{-h}A(z)\right]}_{+}\hfill \\ {}& ={\left[{z}^{-h}{a}_0+{z}^{-h+1}{a}_1+{z}^{-h+2}{a}_2+\dots \right]}_{+}\hfill \\ {}& \begin{array}{ll}& =\left[{z}^0{a}_h+{z}^1{a}_{h+1}+{z}^2{a}_{h+2}+\dots \right]\hfill \\ {}& =\sum\limits \limits_{j=h}^{\infty }{a}_j{z}^{j-h}\hfill \end{array}\hfill \\ {}& ={z}^{-h}A(z)- pr\left[{z}^{-h}A(z)\right]\hfill \end{array}} $$

where pr[zhA(z)] is the principal part of the Laurent expansion of zhA(z) about z = 0. (The principal part of the Laurent expansion about z = 0 is the part involving negative powers of z.) This provides a very simple formula for computing forecasts.

AR(1) Example

Suppose that xt = axt−1 + εt. This means that A(z) = 1/(1 − az). In this case:

$$ {\displaystyle \begin{array}{ll}C(z)& ={\left[{z}^{-h}A(z)\right]}_{+}\hfill \\ {}& ={\left[{z}^{-h}\left(1+ az+{a}^2{z}^2+\dots \right)\right]}_{+}\hfill \\ {}& ={a}^h\left(1+ az+{a}^2{z}^2+\dots \right)\hfill \\ {}& =\frac{a^h}{\left(1- az\right)}\hfill \end{array}} $$

and the least squares loss predictor of xt+h using information dated t and earlier is

$$ {P}_t^{LS}{x}_{t+h}={y}_t=C(L){\varepsilon}_t=C(L){A}^{-1}(L){x}_t={a}^h{x}_t. $$

The forecast error is

$$ {x}_{t+h}-{a}^h{x}_t={\varepsilon}_{t+h}+a{\varepsilon}_{t+h-1}+\dots \kern0.5em +{a}^{h-1}{\varepsilon}_{t+1}, $$

which is serially correlated (for h ≥ 2), but not correlated with information dated t and earlier.

MA(1) Example

Supposed that xt = εtαεt−1, meaning A(z) = 1 − az. Thus,

$$ C(z)=\left[{z}^{-h}A(z)\right]=\left[{z}^{-h}\left(1-\alpha z\right)\right]=\left\{\begin{array}{l}\alpha \kern0.24em \mathrm{if}\kern0.62em h=1,\hfill \\ {}0\kern0.62em \mathrm{otherwise}.\hfill \end{array}\right. $$

So, the best one-step ahead predictor is

$$ {\alpha \varepsilon}_t=\alpha \left(1+\alpha L+{\alpha}^2{L}^2+\dots \right){x}_t $$

and the best predictor for forecasts of horizon two or more is exactly zero. For two-step-ahead (and beyond) prediction, the forecast error is xt+h itself, which is serially correlated but not correlated with information dated t and earlier.

Least Squares Prediction of Geometric Distributed Leads

A prediction problem that characterizes many models in economics involves the expectation of a discounted value. Perhaps the most common and widely studied example is the present value formula for stock prices. Abstracting from mean and trend, suppose the dividend process has a Wold representation given by

$$ {d}_t=\sum\limits_{j=0}^{\infty }{q}_j{\varepsilon}_{t-j}=q(L){\varepsilon}_t\kern0.86em E\left({\varepsilon}_t\right)=0,\kern0.86em E\left({\varepsilon}_t^2\right)=1. $$
(10)

Assuming that the constant discount factor is given by γ, we have the present value formula

$$ {p}_t={E}_t\sum\limits_{j=0}^{\infty }{\gamma}^j{d}_{t+j}={E}_t\left(\frac{q(L)}{1-\gamma {L}^{-1}}{\varepsilon}_t\right)={E}_t\left({p}_t^{\ast}\right). $$
(11)

The least-squares minimization problem the predictor faces is to find a stochastic process pt to minimize the expected squared prediction error \( E{\left({p}_t-{p}_t^{\ast}\right)}^2 \). In terms of the information known at date t, the agent’s task is to find a linear combination of current and past dividends, or, equivalently, of current and past dividend innovations εt, that is ‘close’ to \( {p}_t^{\ast } \). Writing pt = f(L)εt, the problem becomes one of finding the coefficients fj in f(L) = f0 + f1L + f2L2 + … to minimize \( E{\left(f(L){\varepsilon}_t-{p}_t^{\ast}\right)}^2. \) Using the method described in the previous section, the problem has an equivalent, frequency-domain representation

$$ \underset{f(z)\in {H}^2}{\min}\frac{1}{2\pi i}\oint {\left|\frac{q(z)}{1-\gamma {z}^{-1}}-f(z)\kern0.5em \right|}^2\frac{dz}{z}. $$
(12)

The first-order conditions for choosing fj are, after employing the same simplification used in (7),

$$ -\frac{2}{2\pi i}\oint {z}^{-j}\left[\frac{q\left(\mathrm{z}\right)}{1-\gamma {z}^{-1}}-f(z)\right]\frac{dz}{z}=0,j=0,1,2,\dots . $$
(13)

Now define

$$ H(z)=\frac{q(z)}{1-\gamma {z}^{-1}}-f(z) $$

so that (13) becomes

$$ -\frac{2}{2\pi i}\oint {z}^{-j}H(z)\frac{dz}{z}=0. $$

Then multiplying by zj and summing over all j = 0, ±1, ±2, … as above, we obtain

$$ H(z)=\frac{q(z)}{1-\gamma {z}^{-1}}-f(z)=\sum\limits_{-\infty}^{-1}, $$

the Wiener–Hopf equation for this problem. Applying the plussing operator to both sides yields

$$ {\left[\frac{q(z)}{1-\gamma {z}^{-1}}\right]}_{+}-{\left[f(z)\right]}_{+}=0 $$

implying

$$ f(z)={\left[\frac{q(z)}{1-\gamma {z}^{-1}}\right]}_{+}={\left[\frac{zq(z)}{z-\gamma}\right]}_{+} $$

because f(z) is, by construction, one-sided in non-negative powers of z. As in the previous section,

$$ {\left[A(z)\right]}_{+}=A(z)-P(z) $$

where P(z) is the principal part of the Laurent series expansion of A(z). To determine the principal part of [(zγ)−1zq(z)], note that zq(z) has a well-behaved power series expansion about z = γ, where ‘well-behaved’ means ‘involving no negative powers of (zγ)’. Thus [(z − γ)−1zq(z)] has a power series expansion about z = γ involving a single term in (zγ)−1:

$$ \left(\frac{zq(z)}{z-\gamma}\right)=\frac{b_{-1}}{z-\gamma }+{b}_0+{b}_1{\left(z-\gamma \right)}^1+{b}_2{\left(z-\gamma \right)}^2+\dots . $$

The principal part here is the part involving negative powers of (zγ) : b−1(zγ)−1. To determine it, multiply both sides by (zγ) and evaluate what is left at z = γ to find b−1 = γq(γ). Thus

$$ f(z)={\left[\frac{q(z)}{1-\gamma {z}^{-1}}\right]}_{+}={\left[\frac{zq(z)}{z-\gamma}\right]}_{+}=\frac{zq(z)-\gamma q\left(\gamma \right)}{z-\gamma }. $$
(14)

The ‘cross-equation restrictions’ of rational expectations refer to the connection between the serial correlation structure of the driving process (here dividends) and the serial correlation structure of the expected discounted value of the driving process (here prices). That is, when dividends are characterized by q(z), prices are characterized by f(z), and f(z) depends upon q(z) as depicted in (14).

To illustrate how the formula works, suppose detrended dividends are described by a first-order autoregression; that is, that q(L) = (1 − ρL)−1. Then

$$ {p}_t=f(L){\varepsilon}_t=\frac{Lq(L)-\gamma q\left(\gamma \right)}{L-\gamma }{\varepsilon}_t=\left(\frac{1}{1-\rho \gamma}\right){d}_t. $$
(15)

It is instructive to note that, while the pricing formula (15) makes pt the best least squares predictor of \( {p}_t^{\ast } \), the prediction errors \( {p}_t-{p}_t^{\ast } \) will not be serially uncorrelated. Indeed

$$ {\displaystyle \begin{array}{ll}{p}_t-{p}_t^{\ast }& =\gamma \left\{\frac{Lq(L)-\gamma q\left(\gamma \right)}{L-\gamma }-\frac{q(L)}{1-\gamma {L}^{-1}}\right\}{\varepsilon}_t\hfill \\ {}& =\frac{-{\gamma}^2q\left(\gamma \right)}{L-\gamma }{\varepsilon}_t=-{\gamma}^2q\left(\gamma \right)\frac{L^{-1}}{1-\gamma {L}^{-1}}{\varepsilon}_t\hfill \\ {}& =-{\gamma}^2q\left(\gamma \right)\left\{{\varepsilon}_{t+1}+{\gamma \varepsilon}_{t+2}+{\gamma}^2{\varepsilon}_{t+3}+\dots \right\}.\hfill \end{array}} $$

Thus the prediction errors will be described by a highly persistent (γ is close to unity) first-order autoregression. But because this autoregression involves future εt’s, the serial correlation structure of the errors cannot be exploited to improve the quality of the prediction of \( {p}_t^{\ast } \). The reason is that the predictor ‘knows’ the model for price setting (the present value formula) and the dividend process; the best predictor \( {p}_t={E}_t{p}_t^{\ast } \) of \( {p}_t^{\ast } \) tolerates’ the serial correlation because the (correct) model implies that it involves future εt’s and therefore cannot be predicted. If one only had data on the errors (and did not know the model that generated them), they would appear (rightly) to be characterized by a first-order autoregression; fitting an AR(1) (that is, the best linear model) and using it to ‘adjust’ pt by accounting for the serial correlation in the errors \( {p}_t-{p}_t^{\ast } \) would decrease the quality of the estimate of \( {p}_t^{\ast } \). The reason is the usual one that the Wold representation for \( {p}_t-{p}_t^{\ast } \) is not the economic model of \( {p}_t-{p}_t^{\ast } \), and (correct) models always beat Wold representations. This also serves as a reminder of circumstances under which one should be willing to tolerate serially correlated errors: when one knows the model that generated them, and the model implies that they are as small as they can be made.

Robust Optimal Prediction of Time Series

The squared-error loss function employed to this point is appropriate for situations in which the model (either the time series model or the economic model) is thought to be correct. But in many settings the forecaster or model builder may wish to guard against the possibility of misspecification. There are many ways to do this; an approach popular in the engineering literature and recently introduced into the economics literature by Hansen and Sargent (2007) involves behaving so as to minimize the maximum loss sustainable by using an approximating model when the truth may be something else. The ‘robust’ approach to this involves replacing the squared-error loss problem

$$ \underset{\left\{C(z)\right\}}{\min}\frac{1}{2\pi i}\oint {\left|{z}^{-h}A(z)-C(z)\right|}^2\frac{dz}{z} $$

with the ‘min-max’ problem

$$ \underset{\left\{C\left(\mathrm{z}\right)\right\}}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup }{\left|{z}^{-h}A(z)-C(z)\right|}^2, $$

so that minimizing the ‘average’ value on the unit circle has been replaced by minimizing the max. This problem can also be written

$$ \underset{\left\{C(z)\right\}}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup }{\left|A(z)-{z}^hC(z)\right|}^2. $$

This is known as the ‘minimum norm interpolation problem’ and amounts to finding a function φ(z) to

$$ \min \mid \mid \varphi (z)\mid {}_{\infty } $$

subject to the restriction that the power series expansion of φ(z) matches that of A(z) for the first h − 1 powers of z. This means that the following must hold:

$$ \sum\limits_{j=0}^{h-1}{\varphi}_j{z}^j=\sum\limits_{j=0}^{h-1}{a}_j{z}^j. $$
(16)

Theorem 1

The minimizing φ(z) function is such that |φ(z)|2is constant on |z| = 1. Moreover,

$$ \varphi (z)=M\prod\limits_{j=1}^h\frac{z-{\alpha}_j}{1-{\overline{\alpha}}_jz} $$

where M, α1, α2, … ,αnare chosen to ensure that (16) holds.

Proof: see Nehari (1957).

To see that φ(z) must be of the indicated form, note that the ‘Blaschke factors’ in the product have unit modulus:

$$ {\displaystyle \begin{array}{l}\frac{z-{\alpha}_j}{1-{\alpha}_jz}\left(\frac{z^{-1}-{\overline{\alpha}}_j}{1-{\alpha}_j{z}^{-1}}\right)=\left(\frac{z-{\alpha}_j}{1-{\alpha}_jz}\right)\left({z}^{-1}z\right)\hfill \\ {}\left(\frac{z^{-1}-{\overline{\alpha}}_j}{1-{\alpha}_j{z}^{-1}}\right)=\left(\frac{1-{\alpha}_j{z}^{-1}}{1-{\alpha}_jz}\right)\left(\frac{1-{\alpha}_jz}{1-{\alpha}_j{z}^{-1}}\right)=1,\hfill \end{array}} $$

so that |φ(z)|2 = M2.

In the general h-step-ahead prediction problem, we have that

$$ \varphi (z)=M\prod\limits_{j=1}^{h-1}\frac{z-{\alpha}_j}{1-{\overline{\alpha}}_jz}=A(z)-{z}^hC(z), $$

meaning that

$$ C(z)=\frac{1}{z^h}\left(A(z)-M\prod\limits_{j=1}^{h-1}\frac{z-{\alpha}_j}{1-{\overline{\alpha}}_jz}\right). $$

This is analogous to the solution in the least-squares case, but, instead of subtracting the principal part of zhA(z), we subtract a different function from zhA(z). Note also that because

$$ M\prod\limits_{j=1}^{h-1}\frac{z-{\alpha}_j}{1-{\overline{\alpha}}_jz} $$

matches the power series expansion of A(z) up to the power zh−1, C(z) is of the form

$$ C(z)={c}_0+{c}_1z+{c}_2{z}^2+\dots $$

Finally, note that the forecast error is serially uncorrelated because φ(z) is constant on |z| = 1.

Example. AR(1)

Let

$$ A(z)=\frac{1}{1- az}. $$

For h = 1, we see that φ(z) = A(z) − zC(z) must be constant on |z| = 1, and that φ(0) = A(0) = 1. Thus, φ(z) = M = 1, so that

$$ C(z)=\frac{A(z)-1}{z}=\frac{az}{\left(1- az\right)z}=\frac{a}{1- az}, $$

which implies that the robust one-step ahead forecast is

$$ {y}_t^R={ax}_t, $$

which coincides with the best least-squares forecast. This equivalence between the robust and least-squares one-step ahead forecasts is to be expected because the best one-step-ahead least-squares forecast also has serially uncorrelated errors. For h = 2, we have that

$$ \varphi (z)=\frac{M\left(z-\alpha \right)}{1-\overline{\alpha}z} $$

where (again) φ(0) = 1, but now we also see that φ(0) = a. Thus,

$$ \varphi (0)=1=-\alpha M\Rightarrow M=-\frac{1}{\alpha }, $$

and furthermore

$$ {\left.{\varphi}^{\prime }(0)=a=\frac{\left(1-\overline{\alpha}z\right)M-M\left(z-\alpha \right)\left(-\overline{\alpha}\right)}{{\left(1-\overline{\alpha}z\right)}^2}\right|}_{z=0}=M-M\left(\alpha \overline{\alpha}\right)=M\left(1-\alpha \overline{\alpha}\right). $$

Therefore, the solution will have the property that

$$ a=-\frac{1}{\alpha}\left(1-\alpha \overline{\alpha}\right)\kern0.36em - a\alpha =1-\alpha \overline{\alpha}\kern0.50em 0=1+ a\alpha -\alpha \overline{\alpha}. $$

That is, the roots are reciprocal pairs. Notice that the discriminant is positive \( \left({a}^2{\alpha}^2+4\alpha \overline{\alpha}>0\right) \), meaning that we will always have a real solution, and we choose |α| < 1. Then, we have that

$$ {\displaystyle \begin{array}{ll}C(z)& =\frac{1}{z^2}\left[\frac{1}{1- az}-\frac{M\left(z-\alpha \right)}{1-\alpha z}\right]\hfill \\ {}& =\frac{1}{z^2}\frac{1-\alpha z-\left(1- az\right)\left(1-\frac{1}{\alpha }z\right)}{\left(1- az\right)\left(1-\alpha z\right)}\hfill \\ {}& =\frac{1-\alpha z-1+ az+\frac{1}{\alpha }z-\frac{a}{\alpha }{z}^2}{z^2\left(1- az\right)\left(1-\alpha z\right)}\hfill \\ {}& =\frac{-\frac{a}{\alpha }}{\left(1- az\right)\left(1-\alpha z\right)}.\hfill \end{array}} $$

So, the robust prediction is given by

$$ {P}_t^R{x}_{t+2}=-\frac{a}{\alpha}\sum\limits_{j=0}^{\infty}\kern0.24em {\alpha}^j{x}_{t-j}, $$

in contrast to the least-squares prediction

$$ {P}_t^{LS}{x}_{t+2}={a}^2{x}_t. $$

Example. MA(1)

Suppose that the process follows an MA(1), xt =εtβεt−1, and therefore A(z) = 1 − βz. The analysis from the previous example still holds, and all of the following are true:

$$ \varphi (z)=\frac{M\left(z-\alpha \right)}{1-\overline{\alpha}z} $$

while

$$ \varphi (0)=1=-\alpha M\kern0.36em \Rightarrow \kern0.36em M=-{\alpha}^{-1} $$

and

$$ {\varphi}^{\prime }(0)=-\beta =-\frac{1}{\alpha}\left(1-\alpha \overline{\alpha}\right). $$

Therefore,

$$ 0=1-\alpha \beta -\alpha \overline{\alpha}, $$

meaning that, again, we have real roots which are reciprocal pairs and we can choose |α| < 1. Of course, α will depend upon the value of β, and we write α(β). Thus

$$ {\displaystyle \begin{array}{ll}C(z)& =\frac{1}{z^2}\left[1-\beta z-\frac{M\left(z-\alpha \left(\beta \right)\right)}{\left(1-\alpha \left(\beta \right)z\right)}\right]\hfill \\ {}& =\frac{1}{z^2}\left[\frac{\left(1-\beta z\right)\left(1-\alpha \left(\beta \right)z\right)\Big)-M\left(z-\alpha \left(\beta \right)\right)}{1-\alpha \left(\beta \right)z}\right]\hfill \\ {}& =\frac{1}{z^2}\left[\frac{1-\beta z-\alpha \left(\beta \right)z+\beta \alpha \left(\beta \right){z}^2- Mz+ M\alpha \left(\beta \right)}{1-\alpha \left(\beta \right)z}\right]\hfill \\ {}& =\frac{\beta \alpha \left(\beta \right)}{1-\alpha \left(\beta \right)z}.\hfill \end{array}} $$

Therefore, we have the robust prediction

$$ {P}_t^R{x}_{t+2}=\frac{\beta \alpha \left(\beta \right)}{1-\alpha \left(\beta \right)L}{\varepsilon}_t=\frac{\beta \alpha \left(\beta \right)}{1-\alpha \left(\beta \right)L}\left[{x}_t+\beta {x}_{t-1}+\beta {x}_{t-2}+\dots \right], $$

while the least-squares prediction is the standard

$$ {P}_t^{LS}{x}_{t+2}=0. $$

Robust Prediction of Geometric Distributed Leads

Following the excellent treatment in Kasa (2001), a robust present-value predictor fears that dividends may not be generated by the process in (10), and so, instead of choosing an f(z) to minimize the average loss around the unit circle, chooses f(z) to minimize the maximum loss:

$$ \underset{f(z)\in {H}^{\infty }}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup }{\left|\frac{q(z)}{1-\gamma {z}^{-1}}-f(z)\right|}^2\iff \underset{f(z)\in {H}^{\infty }}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup }{\left|\frac{zq(z)}{z-\gamma }-f(z)\right|}^2. $$

Unlike in the least squares case (14), where f(z) was restricted to the class H2 of functions finitely square integrable on the unit circle, the restriction now is to the class of functions with finite maximum modulus on the unit circle, and the H2 norm has been replaced by H norm.

To begin the solution process, note that there is considerable freedom in designing the minimizing function f(z): it must be well-behaved (that is, must have a convergent power series in non-negative powers of z on the unit disk), but is otherwise unrestricted. Recalling the Laurent expansion

$$ \frac{zq(z)}{z-\gamma }=\frac{b_{-1}}{z-\gamma }+{b}_0+{b}_1\left(z-\gamma \right)+{b}_2{\left(z-\gamma \right)}^2+\dots, $$

while in the least squares case f(z) was set to ‘cancel’ all the terms of this series except the first, here f(z) will be set to do something else. Now define the Blaschke factor Bγ(z) = (zγ)/(1 − γz) and note that, because of the unit modulus condition, the problem can be written

$$ \underset{\left\{f(z)\right\}}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup }{\left|\frac{zq(z)}{1-\gamma z}-\frac{z-\gamma }{1-\gamma z}f(z)\right|}^2. $$

Defining

$$ T(z)=\frac{zq(z)}{1-\gamma z} $$

we have

$$ \underset{f\in {H}^{\infty }}{\min}\kern0.24em \underset{\mid z\mid =1}{\sup}\mid T(z)-{B}_{\gamma }(z)f(z)\mid \iff \underset{f\in {H}^{\infty }}{\min}\mid \mid T(z)-{B}_{\gamma }(z)f(z)\mid {}_{\infty }. $$

Define the function inside the ∥’s as

$$ \varphi (z)=T(z)-{B}_{\gamma }(z)f(z) $$

and note that φ(γ) = T(γ). Thus the problem of finding f(z) reduces to the problem of finding the smallest φ(z) satisfying φ(γ) = T(γ):

$$ \underset{\varphi \in {H}^{\infty }}{\min}\mid \mid \varphi (z)\mid {}_{\infty}\mathrm{s}.\mathrm{t}.\varphi \left(\gamma \right)=T\left(\gamma \right) $$

Theorem 2

(Kasa 2001). The solution to (17) is the constant function φ(z) = T(γ).

Proof. To see this, first note that the norm of a constant function is the modulus of the constant itself. This is written as

$$ \mid {\left|\varphi (z)\right|}_{\infty }=\mid {\left|T\left(\gamma \right)\right|}_{\infty }={\left|T\left(\gamma \right)\right|}^2. $$
(17)

Next, suppose that there exists another function Ψ(z) ∈ H, with Ψ(γ) = T(γ) and also

$$ \mid \mid \Psi (z)\mid {}_{\infty }<\mid \mid \varphi (z)\mid {}_{\infty }. $$
(18)

Recall the definition of the H norm, and using Eqs. (17) and (18):

$$ \mid {\left|\Psi (z)\right|}_{\infty }=\underset{\mid z\mid =1}{\sup }{\left|\Psi (z)\right|}^2<{\left|T\left(\gamma \right)\right|}^2. $$

The maximum modulus theorem states that a function f which is analytic on the disk U achieves its maximum on the boundary of the disk. That is

$$ \underset{z\in U}{\sup }{\left|f(z)\right|}^2\le \underset{z\in \partial U}{\sup }{\left|f(z)\right|}^2. $$

Therefore, we can see that

$$ \underset{\mid z\mid <1}{\sup }{\left|\Psi (z)\right|}^2\le \underset{\mid z\mid =1}{\sup }{\left|\Psi (z)\right|}^2<{\left|T\left(\gamma \right)\right|}^2. $$

However, one of the values on the interior of the unit disk is z = γ, which can be inserted into the far left-hand-side of Eq. (6) to get the result

$$ {\left|\Psi \left(\gamma \right)\right|}^2\le \underset{\mid z\mid =1}{\sup }{\left|\Psi (z)\right|}^2<{\left|T\left(\gamma \right)\right|}^2\Rightarrow {\left|\Psi \left(\gamma \right)\right|}^2<{\left|T\left(\gamma \right)\right|}^2. $$

This contradicts the requirement that Ψ(γ) = T(γ). Therefore, we have verified that there does not exist another function Ψ(z) ∈ H such that Ψ(γ) = T(γ) and ||Ψ(z)|| < ||φ(z)|| . □

Given the form for φ(z), the form for f(z) follows. After some tedious algebra, we obtain

$$ f(z)=\frac{T(z)-\varphi (z)}{B_{\gamma }(z)}=\frac{zq(z)-\gamma q\left(\gamma \right)}{z-\gamma }+\frac{\gamma^2}{1-{\gamma}^2}q\left(\gamma \right) $$

which is the least squares solution plus a constant. Thus the robust cross-equation restrictions likewise differ from the least squares cross-equation restrictions. After the initial period, the impulse response function for the robust predictor is identical to that of the least squares predictor. In the initial period, the least squares impulse response is q(γ), while the robust impulse response is larger: q(γ)/(1 − γ2).

Because γ is the discount factor, and therefore close to unity, the robust impulse response can be considerably larger than that of the least squares response. Relatedly, the volatility of prices in the robust case will be larger as well. For example, in the first-order autoregressive case studied above,

$$ {p}_t=f(L){\varepsilon}_t=\frac{1}{1-\rho \gamma}{d}_t+\frac{\gamma^2}{\left(1-{\gamma}^2\right)\left(1-\rho \gamma \right)}{\varepsilon}_t $$
(19)

from which the variance can be calculated as

$$ {\sigma}^2\left({p}_t\right)={\left(\frac{1}{1-\rho \gamma}\right)}^2{\sigma}^2\left({d}_t\right)+\frac{2{\gamma}^2-{\gamma}^4}{{\left(1-\rho \gamma \right)}^2{\left(1-{\gamma}^2\right)}^2}. $$

When the discount factor is large and dividends are highly persistent, the variance of the robust present value prediction can be considerably larger than that of the least squares prediction (the first term on the right alone).

Finally, recall that the least-squares present-value predictor behaved in such a way as to minimize the variance of the error \( {p}_t-{p}_t^{\ast } \). Here, robust prediction results in an error with Wold representation

$$ {p}_t-{p}_t^{\ast }=\gamma \left\{\frac{Lq(L)-\gamma q\left(\gamma \right)}{L-\gamma }+\frac{\gamma^2}{1-{\gamma}^2}q\left(\gamma \right)-\frac{q(L)}{1-\gamma {L}^{-1}}\right\}{\varepsilon}_t=-\frac{\gamma q\left(\gamma \right)}{1-{\gamma}^2}\left\{\frac{1-\gamma L}{L-\gamma}\right\}{\varepsilon}_t. $$

The term in braces has the form of a Blaschke factor. Applying such factors in the lag operator to a serially uncorrelated process like εt leaves a serially uncorrelated result; thus the robust present value predictor has behaved in such a way that the resulting errors are white noise. Of course this comes at a cost: to make the error serially uncorrelated, the robust predictor must tolerate an error variance that is larger than the least squares error variance by a factor of a2/(1 − γ2), which can be substantial when γ is close to unity.

See Also