Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

It is a common practice to compare models by out-of-sample predictive mean squared error (PMSE). For example, Meese and Rogoff (1983a); Meese and Rogoff (1983b) and Swanson and White (1997) compare models according to their PMSE calculated in rolling windows. Another common practice is to use a consistent information criterion such as the Schwarz Information Criterion (SIC), used for example in Swanson and White (1997). Information criteria and the out-of-sample PMSE criteria deal with the issue of overfitting inherent in the in-sample PMSE criterion. Information criteria penalizes overparameterized models via penalty terms and are easy to compute. The out-of-sample PMSE criteria simulate out-of-sample forecasts and are very intuitive.Footnote 1

In a recent chapter, Inoue and Kilian (2006) show that the recursive and rolling PMSE criteria are inconsistent and recommend that consistent in-sample information criteria, such as the SIC, be used in model selection. They also show that even when there is structural change these out-of-sample PMSE criteria are not necessarily consistent. Their results are based on the assumption that the window size is proportional to the sample size.

In this chapter we consider an alternative framework in which the window size goes to infinity at a slower rate than the sample size. Under this assumption we show that the rolling-window PMSE criterion is consistent for selecting nesting linear forecasting models. When the nesting model is the truth, the criterion selects the nesting model with probability approaching one because the parameters and thus the PMSE are consistently estimated as the window size diverges. When the nested model is generating the data, the quadratic term in the quadratic expansion of the loss difference becomes dominant when the window size is small. Because the quadratic form is always positive, the criterion will select the nested model with probability approaching one. When the window size is large, however, the linear term and the quadratic term are of the same order and the sign cannot be determined. By letting the window size diverge slowly, the rolling PMSE criterion is consistent under a variety of environments, when parameters are constant or when they are time varying.

When the window size diverges at a slower rate than the sample size, the rolling regression estimator can be viewed as a nonparametric estimator (Giraitis et al. 2011) and time-varying parameters are consistently estimated. We show that our rolling-window PMSE criterion remains consistent even when parameters are time varying. When the window size is large, that is, when it is assumed to go to infinity at the same rate as the total sample size, the criterion is not consistent because the rolling regression estimator is oversmoothed. In the time-varying parameter case, the conventional information criterion is not consistent in general.

This chapter is related to, and different from, the works by West (1996); Clark and McCracken (2001); Giacomini and White (2006); Giacomini and Rossi (2010), and Rossi and Inoue (2011) in several ways. West (1996) and Clark and McCracken (2001) focus on comparing models’ relative to forecasting performance when the window size is a fixed fraction of the total sample size, whereas Giacomini and White (2006) focus on the case where the window size is constant; this chapter focuses instead on the case where the window size goes to infinity but at a slower rate than the total sample size. Giacomini and Rossi (2010) argue that, in the presence of instabilities, traditional tests of predictive ability may be invalid, since they focus on the forecasting performance of the models on average over the out-of-sample portion of the data. To avoid the problem, they propose to compare models’ relative predictive ability in the presence of instabilities by using a rolling window approach over the out-of-sample portion of the data. The latter helps them to follow the relative performance of the models as it evolves over time. In this chapter we focus on consistent model selection procedures, instead, rather than testing; furthermore, our focus is not to compare models’ predictive performance over time, rather to select the best forecasting model asymptotically. Rossi and Inoue (2011) focus on the problem of performing inference on predictive ability that is robust to the choice of the window size. In this chapter, instead, we take as given the choice of the window size and our objective is not to perform tests; we focus instead on understanding whether it is possible to consistently select the true model depending on the size of the window relative to the total sample size.

The rest of this chapter is organized as follows: In Sect.  2 we establish the consistency of the rolling PMSE criterion under the standard stationary environment as well as under the time-varying parameter environment. In Sect. 3 we investigate the finite-sample properties of the rolling-window PMSE criterion. Section 4 demonstrates the usefulness of our criteria in forecasting inflation. Section 5 concludes.

2 Asymptotic Theory

Consider two nesting linear forecasting models, models 1 and 2, to generate \( h\)-steps ahead direct forecasts (where \(h\) is finite):

$$\begin{aligned} \text{ Model} \text{1}: y_{t+h}&=\alpha ^{*\prime }x_{t}+u_{t+h},\end{aligned}$$
(1)
$$\begin{aligned} \text{ Model} \text{2}: y_{t+h}&=\beta ^{\prime }z_{t}+v_{t+h}\;\;=\;\;\alpha ^{\prime }x_{t}+\gamma ^{\prime }w_{t}+v_{t+h}, \end{aligned}$$
(2)

where dim\((\alpha )=k\) and dim\((\beta )=l\). The first terms on the right-hand sides of Eqs. (1) and (2), \(\alpha ^{*\prime }x_{t} \) and \(\beta ^{\prime }z_{t}\) are the population linear projections of \( y_{t+h}\) on \(x_{t}\) and \(z_{t}\), respectively. Thus, \(z_{t}\) is uncorrelated with \(v_{t+h}\), \(\alpha ^{*}=[E(x_{t}x_{t}^{\prime })]^{-1}E(x_{t}y_{t+h})\) and \(\beta =[E(z_{t}z_{t}^{\prime })]^{-1}E(z_{t}y_{t+h})\).

Define the population quadratic loss of each model by

$$\begin{aligned} \begin{array}{lllll} \sigma _{1}^{2}&= \lim _{T\rightarrow \infty }\frac{1}{T-h} \sum \limits _{t=1}^{T-h}E[(y_{t+h}-\alpha ^{\prime }x_{t})^{2}]&= \lim _{T\rightarrow \infty }\frac{1}{T-h}\sum \limits _{t=1}^{T-h}E(u_{t+h}^{2}), \\ \sigma _{2}^{2}&= \lim _{T\rightarrow \infty }\frac{1}{T-h} \sum \limits _{t=1}^{T-h}E[(y_{t+h}-\beta ^{\prime }z_{t})^{2}]&= \lim _{T\rightarrow \infty }\frac{1}{T-h}\sum \limits _{t=1}^{T-h}E(v_{t+h}^{2}). \end{array} \end{aligned}$$

Our goal is to select the model with smallest quadratic loss.

Let the window size used for parameter estimation be denoted by \(W\) for some \(W>h\). Define the rolling ordinary least squares (OLS) estimators as follows, for \(t=W+1,...,T\):

$$\begin{aligned} \hat{\alpha }_{t,W}&=\left( \sum _{s=t-W}^{t-h}x_{s}x_{s}^{\prime }\right) ^{-1}\sum _{s=t-W}^{t-h}x_{s}y_{s+h}, \end{aligned}$$
(3)
$$\begin{aligned} \hat{\beta }_{t,W}&=\left( \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }\right) ^{-1}\sum _{s=t-W}^{t-h}z_{s}y_{s+h}, \end{aligned}$$
(4)

and the associated rolling PMSEs by:

$$\begin{aligned} \hat{\sigma }_{1,W}^{2}&=\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\hat{u}_{t+h}^{2},\end{aligned}$$
(5)
$$\begin{aligned} \hat{\sigma }_{2,W}^{2}&=\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\hat{v}_{t+h}^{2}, \end{aligned}$$
(6)

where \(\hat{u}_{t+h}=y_{t+h}-\hat{\alpha }_{t,W}^{\prime }x_{t}\), \(\hat{v} _{t+h}=y_{t+h}-\hat{\beta }_{t,W}^{\prime }z_{t}.\) We say that the rolling PMSE criterion is consistent if

  • \(\hat{\sigma }_{1,W}^{2}<\hat{\sigma }_{2,W}^{2}\) with probability approaching one if \(\sigma _{1}^{2}=\sigma _{2}^{2}\); and

  • \(\hat{\sigma }_{1,W}^{2}>\hat{\sigma }_{2,W}^{2}\) with probability approaching one if \(\sigma _{1}^{2}>\sigma _{2}^{2}\).

Under what conditions on the window size is the rolling PMSE criterion consistent? The existing results are not positive. When the window size is large relative to the sample size (i.e., \(\exists \lambda \in (0,1)\) s.t. \(W=\lambda T+o(T)\)), Inoue and Kilian (2005) show that the criterion is not consistent. Specifically, when \(\sigma _{1}^{2}=\sigma _{2}^{2}\), they show that the criterion selects model 2 with a positive probability resulting in the overparameterized model. We will discuss this result in more detail in the next section, where we will compare it with the theoretical results proposed in this chapter.

When the window size is very small (i.e., \(W\) is a fixed constant), it is straightforward to show that the criterion may not be consistent. For example, compare the zero-forecast model (\(x_{t}=\emptyset \)) and the constant-forecast model (\(w_{t}=1\)) with \(W=h=1\). Suppose that \( y_{t+1}\;=\;c+u_{t+1},\;\) where \(u_{t}\sim iid(c,\sigma ^{2})\). Note that \( \sigma _{1}^{2}=c^{2}+\sigma ^{2}\) and \(\sigma _{2}^{2}=\sigma ^{2}\). Since

$$\begin{aligned} \begin{array}{llll} \hat{\sigma }_{1,1}^{2}&=\frac{1}{T-1}\sum \limits _{t=1}^{T-1}y_{t+1}^{2} \overset{p}{\rightarrow } c^{2}+\sigma ^{2}, \\ \hat{\sigma }_{2,1}^{2}&= \frac{1}{T-1}\sum \limits _{t=1}^{T-1}(y_{t+1}-y_{t})^{2} \overset{p}{\rightarrow } 2\sigma ^{2}, \end{array} \end{aligned}$$

however, \(\hat{\sigma }_{1,1}^{2}<\hat{\sigma }_{2,1}^{2}\) with probability approaching one whenever \(c^{2}<\sigma ^{2}\). This is because parameter estimation uncertainty never vanishes even asymptotically, when the window size is fixed.

The goal of the next section is to show that the criterion is consistent if the window size is small, but not too small, relative to the sample size in the following sense: \(W\rightarrow \infty \) and \(W/T\rightarrow 0\) as \(T\rightarrow \infty \). Following Clark and McCracken (2000), we use the following notation: Let \(q_{2,t}=z_{t}z_{t}^{\prime }\), \( q_{1,t}=x_{t}x_{t}^{\prime }\), \(B_{i}=\left[ E(q_{it})\right] ^{-1}\), \( B_{i}(t)=\left[ \frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}q_{i,s}\right] ^{-1}\) , \(H_{1}(t)=\frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}x_{s}(y_{s+h}-\alpha ^{*\prime }x_{s})\), \(H_{2}(t)=\frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h}\), where \(i\) is either 1 or 2 and \( W_{h}=W-h+1\).

2.1 Consistency of the Rolling-Window PMSE Criterion When Parameters are Constant

First, consider the case where the parameters are constant.

Assumption 1

As \(T \rightarrow \infty \), \( T^{1/2}/W=O(1)\) and \(W/T \rightarrow 0\).

Assumption 2

 

  1. (a)

    \(\{[x_t^{\prime }\;z_t^{\prime }\;y_{t+h}]^{\prime }\}\) is covariance stationary and has finite \(10\) moments with \(E(z_{t}z_{t}^{\prime })\) positive definite and \(B_{2}(t)\) positive definite for all \(t\) almost surely.

  2. (b)

    \(W^{1/2}(B_{i}(t)-B_{i})\) and \(W^{1/2}H_{i}(t)\) have finite fourth moments uniformly in \(t\) for \(i=1,2\).

  3. (c)

    \(E(v_{t+h}|\mathcal F _{t})=0\) with probability one for \( 1,2,\ldots \), where \(\mathcal F _{t}\) is the \(\sigma \) field generated by \( \{(y_{s+h},z_{s})\}_{s=1}^{t-h}\).

  4. (d)

    \(E[H_{1}^{\prime }(t)B_{1}(x_{t}x_{t}^{\prime }-E(x_{t}x_{t}^{\prime }))B_{1}H_{1}(t)]=o(W^{-1})\) and \(E[H_{2}^{\prime }(t)B_{2}(z_{t}z_{t}^{\prime }-E(z_{t}z_{t}^{\prime }))B_{2}H_{2}(t)]=o(W^{-1})\) uniformly in \(t\).

  5. (e)
    $$\begin{aligned}&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)(B_{i}(t)-B_{i})q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\\&\quad \qquad =O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)(B_{i}(t)-B_{i})q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vec}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)B_{i}q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\\&\quad \qquad = O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vec}\left(H_{i}^{\prime }(t)B_{i}q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)B_{i}q_{i,t}B_{i}H_{i}(t)\right)\right]\\&\quad \qquad = O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)B_{i}q_{i,t}B_{i}H_{i}\right)\right]\right), \end{aligned}$$

    for \(i=1,2\).

Remarks

When the window size is assumed to be proportional to the sample size, \(W=[rT]\) for \(r\in [0,1]\), the functional central limit theorem (FCLT) is often used to find the asymptotic properties of the recursive and rolling regression estimators (e.g., Clark and McCracken 2001). For example, if \(h=1\),

$$\begin{aligned} \sqrt{T}(\hat{\beta }_{t,W}-\beta )\;=\;\left( \frac{1}{T} \sum _{s=t-W}^{t-1}z_{s}z_{s}^{\prime }\right) ^{-1}\frac{1}{\sqrt{T}} \sum _{s=t-W}^{t-1}z_{s}v_{s+1} \end{aligned}$$

and if vech\((z_{t}z_{t}^{\prime })\) and \(z_{t}v_{t+1}\) satisfy the FCLT, we obtain

$$\begin{aligned} \sqrt{T}(\hat{\beta }_{[rT]}-\beta )\;\Rightarrow \;\frac{\sigma }{r} [E(z_{t}z_{t}^{\prime })]^{-1/2}B_{l}(r) \end{aligned}$$

where \(B_{l}(r)\) is the \(l\)-dimensional standard Brownian motion, provided \( [z_{t}^{\prime }\;v_{t+1}]^{\prime }\) is covariance stationary. Thus, we have \(\hat{\beta }_{t,W}-\beta \;=\;O_{p}(T^{-1/2})\) uniformly in \(t\). When the window size diverges slower than the sample size it is tempting to use the same analogy and claims \(\hat{\beta }_{t,W}-\beta =O_{p}(W^{-1/2})\) uniformly in \(t\). This result does not follow from the FCLT, however, even though \(\hat{\beta }_{t,W}-\beta \;=\;O_{p}(W^{-1/2})\) pointwise in \(t\). To see why, let \(z_{t}=1\). Then

$$\begin{aligned} \hat{\beta }_{t,W}-\beta&=\frac{1}{W}\sum _{s=1}^{t-1}v_{s+1}-\frac{1}{W} \sum _{s=1}^{t-W-1}v_{s+1} \nonumber \\&=\frac{\sqrt{T}}{W}\frac{1}{\sqrt{T}}\sum _{s=1}^{t-1}v_{s+1}-\frac{\sqrt{T} }{W}\frac{1}{\sqrt{T}}\sum _{s=1}^{t-W-1}v_{s+1} \nonumber \\&=o_{p}\left( \frac{\sqrt{T}}{W}\right) \end{aligned}$$

uniformly in \(t\), where the last equality follows from \(\frac{1}{\sqrt{T}} \sum _{s=1}^{t-1}v_{s+1}-\frac{1}{\sqrt{T}}\sum _{s=1}^{t-W-1}v_{s+1}=o_{p}(1)\) by the FCLT and \(W=o(T)\). Thus, the FCLT alone does not imply \(\hat{\beta } _{t,W}-\beta =O_{p}(W^{-1/2})\) uniformly in \(t\) in general. This is why we need some high-level assumption, such as Assumptions 2(b)(d)(e).

Assumption 1 requires that \(W\) diverges slower than \(T\). This assumption makes the convergence rates of terms in the expansion of the PMSE differential uneven which helps to establish the consistency of this criterion when the nested model is generating the data. Assumption 2(c) requires that the nesting model is (dynamically) correctly specified. Assumption 2(d) is trivially satisfied if \(z_{t}\) is strictly exogenous and allows for weak correlations between \(z_{t}\) and \(v_{s}\). Assumption 2(e) is a high-level assumption and imposes that the variance of the sum is in the same order of the sum of variances. In other words, the summands are only weakly serially correlated so that their autocovariances decay fast enough. This assumption is somewhat related to the concept of essential stationarity of (Wooldridge (1994), pp. 2643–2644). Assumptions somewhat similar to this condition are used in the central limit theorem for stationary and ergodic processes (e.g., Theorem 5.6 of Hall and Heyde 1980, p. 148) and the central limit theorem for near epoch-dependent processes (e.g., Theorem 5.3 of Gallant and White 1988, p. 76; Assumption C1 of Wooldridge and White 1988).

Theorem 1

Under Assumptions 1 and 2, the rolling-window PMSE criterion is consistent.

To compare our consistency result and the inconsistency result of Inoue and Kilian (2006), consider two simple competing models, \(y_{t+h}=u_{t+h}\) (model 1) and \(y_{t+h}=c+v_{t+h}\) (model 2) where \(v_{t+h}\;\)is i.i.d. with mean zero and variance \(\sigma _{2}^{2}\) and \(h=1.\) The difference of the out-of-sample PMSE can be written as

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2}\;=\;-\frac{2}{T-W-1} \sum _{t=W+1}^{T-1}(\hat{c}_{t}-c)v_{t+1}+\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}( \hat{c}_{t}-c)^{2} \end{aligned}$$

where \(\hat{c}_{t}=(1/W)\sum _{s=t-W}^{t-1}y_{s+1}\). Assume that \(c=0\) in population.

When \(W=[\lambda T]\) for some \(\lambda \in (0,1)\), it follows from Lemmas A6 and A7 of Clark and McCracken (2000) that

$$\begin{aligned} T\left( \hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2}\right) \;\overset{d}{ \rightarrow }\;&-\frac{2}{\lambda \left( 1-\lambda \right) }\sigma _{2}^{2}\int \limits _{\lambda }^{1}(B(r)-B(r-\lambda ))\mathrm{d}B(r)\\&+\frac{1}{\lambda ^{2}\left( 1-\lambda \right) }\sigma _{2}^{2}\int \limits _{\lambda }^{1}(B(r)-B(r-\lambda ))^{\prime }(B(r)-B(r-\lambda ))\mathrm{d}r \end{aligned}$$

where \(B(\cdot )\) is the standard Brownian motion. Because the probability that the right-hand side is negative is nonzero, the criterion is inconsistent when \(c=0\). This is the inconsistency result in Inoue and Kilian (2006).

When \(W=o(T^{1/(1+2\varepsilon )})\) for some \(\varepsilon \in (0,1/2)\), the case considered in this chapter, we have:

$$\begin{aligned} W(\hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2})&=-\frac{2W^{\frac{1}{2} +\varepsilon }}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2} +\varepsilon }}\sum _{s=t-W}^{t-1}v_{s+1}\right) v_{t+1} \\&\quad +\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2}}} \sum _{s=t-W}^{t-1}v_{s+1}\right) ^{2} \\&=\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2}}} \sum _{s=t-W}^{t-1}v_{s+1}\right) ^{2}+o_{p}(1) \end{aligned}$$

Because the right-hand side remains positive even asymptotically, the criterion will choose model 1 with probability approaching one. The key for the consistency result is that the last quadratic term in the expansion dominates the middle cross-term when the window size is small.

Lastly, it should be noted that our consistency result does not imply that the resulting forecast based on a slowly diverging window size is optimal. When parameters are constant, one would expect that the optimal forecast for the \(T+1\)st observation should be based on all \(T\) observations, not on the last \(W\) observations. Assumption 1 is merely a device to obtain the consistency of the rolling PMSE criterion.

2.2 Consistency of the Rolling-Window PMSE Criterion When Parameters are Time Varying

Sometimes it is claimed that out-of-sample PMSE comparisons are used to protect practitioners from parameter instability. As Inoue and Kilian (2006) show this is not always the case. In this section we show that the rolling PMSE criterion with small window sizes delivers consistent model selection even when parameters are time varying.

Suppose that the slope coefficients are time varying in the sense that

$$\begin{aligned} y_{T,t+h}\;=\;\beta \left( \frac{t}{T}\right) ^{\prime }z_{T,t}+v_{T,t+h} \end{aligned}$$
(7)

where \(\beta (r)=[\alpha (r)^{\prime }\;\gamma (r)^{\prime }]^{\prime }\) for \(r\in [0,1]\). When the slope coefficients are time varying, the second moments are also time varying. Let

$$\begin{aligned} \left[ \begin{array}{cc} \Gamma _{zz}\left(\frac{t}{T}\right)&\Gamma _{zy}\left(\frac{t}{T}\right)\\ \Gamma _{yz}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right) \end{array} \right]&=\left[ \begin{array}{ccc} \Gamma _{xx}\left(\frac{t}{T}\right)&\Gamma _{xw}\left(\frac{t}{T}\right)&\Gamma _{xy}\left(\frac{t}{T}\right) \\ \Gamma _{wx}\left(\frac{t}{T}\right)&\Gamma _{ww}\left(\frac{t}{T}\right)&\Gamma _{wy}\left(\frac{t}{T}\right) \\ \Gamma _{yx}\left(\frac{t}{T}\right)&\Gamma _{yw}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right) \end{array} \right] \\&=\left[ \begin{array}{ccc} E[x_{T,t}x_{T,t}^{\prime }]&E[x_{T,t}w_{T,t}^{\prime }]&E[x_{T,t}y_{T,t}]\\ E[w_{T,t}x_{T,t}^{\prime }]&E[w_{T,t}w_{T,t}^{\prime }]&E[w_{T,t}y_{T,t}]\\ E[y_{T,t}x_{T,t}^{\prime }]&E[y_{T,t}w_{T,t}^{\prime }]&E[y_{T,t}^{2}] \end{array} \right] , \end{aligned}$$

for \(t=1,2,...,T\) and \(T=1,2,...\). Let \(\bar{B}_{1}\left(\frac{t}{T} \right)=[E(x_{T,t}x_{T,t}^{\prime })]^{-1}\) and \(\bar{B}_{2}\left(\frac{t}{T} \right)=[E(z_{T,t}z_{T,t}^{\prime })]^{-1}\). Then \(\beta (\cdot )=[\Gamma _{zz}(\cdot )]^{-1}\Gamma _{zy}(\cdot )\). We compare

$$\begin{aligned} y_{T,t+h}\;=\;\alpha \left(\frac{t}{T}\right)^{\prime }x_{T,t}+u_{T,t+h} \end{aligned}$$
(8)

and (7), where (7) simplifies to () if \(\gamma (u)=0\) for all \(u\in [0,1]\).

Assumption 3

As \( T\rightarrow \infty \), \(T^{1/2}/W=O(1)\) and \(W=o(T^{2/3})\).

Assumption 4

 

  1. (a)
    $$\begin{aligned} \xi _{t}\;=\;\text{ vech}\left\{ \left[ \begin{array}{ccc} z_{T,t}z_{T,t}^{\prime }&z_{T,t}y_{T,t+h}&\\ y_{T,t+h}z_{T,t}^{\prime }&y_{T,t+h}^{2}&\end{array} \right] -\left[ \begin{array}{cc} \Gamma _{zz}\left(\frac{t}{T}\right)&\Gamma _{zy}\left(\frac{t}{T}\right)\\ \Gamma _{yz}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right)\end{array} \right]\right\} \end{aligned}$$
    (9)

    has finite fifth moments with \(B_{2}(t)\) positive definite for all \(t\) almost surely.

  2. (b)

    \(W^{1/2}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T }\right)\right)\) and \(W^{1/2}H_{i}(t)\) have finite fourth moments uniformly in \(t\) for \(i=1,2\).

  3. (c)

    \(E(v_{T,t+h}|\mathcal F _{Tt})=0\) with probability one for \(1,2,...\), where \(\mathcal F _{Tt}\) is the \(\sigma \) field generated by \(\{(y_{T,s+h},z_{Ts})\}_{s=1}^{t-h}\).

  4. (d)

    \(E[H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T} \right)(q_{i,T,t}-E(q_{i,T,t}))\bar{B}_{i}\left(\frac{t}{T} \right)H_{i}(t)]=o(W^{-1})\) uniformly in \(t\) for \(i=1,2\), where \( q_{1,T,t}=x_{T,t}x_{T,t}\) and \(q_{2,T,t}=z_{T,t}z_{T,t}^{\prime }\).

  5. (e)
    $$\begin{aligned}&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right] \\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\right)\right., \end{aligned}$$
    $$\begin{aligned}&\text{ Cov}\left[\text{ vec}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vec}\left(H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\bar{B}_{i}\left(\frac{t}{T}\right)H_{i}(t)\right)\right] \\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\bar{B}_{i}\left(\frac{t}{T}\right)H_{i}\right)\right]\right), \end{aligned}$$

    where \(i=1,2\).

  6. (f)

    \(\Gamma _{zz}(u)\) is positive definite for all \(u\in [0,1]\), and \(\alpha \left( .\right) \equiv \)\( \Gamma _{xx}(\cdot )\)\(^{-1}\)\(\Gamma _{xy}(\cdot )\) and \(\beta \left( .\right) \equiv \)\(\Gamma _{zz}(\cdot )\)\(^{-1}\)\(\Gamma _{zy}(\cdot )\) satisfy a Lipschitz condition of order 1.

Remarks

Assumption 3 is more restrictive than Assumption 1 to keep the bias of the rolling regression estimator from interfering the consistency of the rolling PMSE estimator. Assumptions 4(a)(b) requires that \(\xi _{t}\) behaves like a stationary process with enough many moments. Assumptions 4(b)–(e) are analogs of Assumptions 2(b)–(e). Assumption 4(f) requires that the second moments change very smoothly.

Theorem 2

Suppose Assumptions 3 and 4 hold. Then the rolling-window PMSE criterion is consistent.

Remarks

The above consistency result is intuitive once it is recognized that the rolling regression estimator is a nonparametric regression estimator of parameters with a truncated kernel. For example, Cai (2007) establish the consistency and asymptotic normality of nonparametric estimators of time-varying parameters, and Giraitis et al. (2011) prove the consistency and asymptotic normality of nonparametric estimators for stochastic time-varying coefficient AR(1) models.

In general, the conventional information criteria, such as SIC, are not consistent when parameters are time varying. To show why that is the case consider comparing two competing models \(y_{t+h}=u_{t+h} \) and \(y_{t+h}=c+v_{t+h}\) for \(h=1\) when the data are generated from:

$$\begin{aligned} y_{t}\;=\;\frac{t}{T}-\frac{1}{2}+\varepsilon _{t} \end{aligned}$$
(10)

where \(\varepsilon _{t}\;\)is i.i.d. with mean zero and variance \(\sigma ^{2}\). Then the population in-sample PMSE of the zero forecast model is

$$\begin{aligned} \lim _{T\rightarrow \infty }E\left( \frac{1}{T-1}\sum _{t=1}^{T-1}y_{t+1}^{2} \right) \;=\;\sigma ^{2}+\int \limits _{0}^{1}\left( r-\frac{1}{2}\right) ^{2}\mathrm{d}r\;=\sigma ^{2}+\frac{1}{12} \end{aligned}$$

The population in-sample PMSE of the forecast model that estimates the constant in rolling windows is also

$$\begin{aligned} \lim _{T\rightarrow \infty }\min _{c}E\left( \frac{1}{T-1} \sum _{t=1}^{T-1}(y_{t+1}-c)^{2}\right) \;=\;\min _{c}\left( \sigma ^{2}+\int \limits _{0}^{1}(r-c)^{2}\mathrm{d}r\right) \;=\sigma ^{2}+\frac{1}{12} \end{aligned}$$

Thus, the SIC would select the zero forecast model while the true DGP is a time-varying constant forecast model. Our criterion, by re-estimating the constant in rolling windows, is robust to time variation in the parameters and will select the second model with probability approaching unity asymptotically.

3 Monte Carlo Evidence

In this section we investigate the finite-sample performance of the rolling-window PMSE criterion in two Monte Carlo experiments. In the first experiment, we use the data generating process (DGP) of Clark and McCracken (2005) as it is similar to the empirical application that we will consider in the next section. In the second experiment, we use a simple DGP in which the dependent and independent variables both follow first-order autoregressive processes, and consider both constant parameter and time-varying parameter cases.

3.1 Simulation 1: DGP2 in Clark and McCracken (2005)

The second DGP of Clark and McCracken (2005) is based on estimates based on quarterly 1957:1–2004:3 data of inflation (\(Y\)) and the rate of capacity utilization in manufacturing (\(x\)). We consider restricted and unrestricted forecasting models as follows:

$$\begin{aligned} \text{ Model} \text{1} :\;\Delta Y_{t+1}&=\alpha _{0}+\alpha _{1}\Delta Y_{t}+\alpha _{2}\Delta Y_{t-1}+u_{1,t+1}\end{aligned}$$
(11)
$$\begin{aligned} \text{ Model} \text{2} :\;\Delta Y_{t+1}&=\alpha _{0}+\alpha _{1}\Delta Y_{t}+\alpha _{2}\Delta Y_{t-1}+\gamma _{1}x_{t-1}+\gamma _{2}x_{t-2}+\gamma _{3}x_{t-3} \nonumber \\&\qquad \qquad \qquad \qquad +\gamma _{4}x_{t-4}+u_{2,t+1} \end{aligned}$$
(12)

When the restricted model (11) is true, the DGP is parameterized using Eq. (7) in Clark and McCracken (2005):

$$\begin{aligned} \Delta Y_{t}&=-0.316\Delta Y_{t-1}-0.214\Delta Y_{t-2}+u_{y,t}, \end{aligned}$$
(13)
$$\begin{aligned} x_{t}&=-0.193\Delta Y_{t-1}-0.242\Delta Y_{t-2}-0.240\Delta Y_{t-3}-0.119\Delta Y_{t-4}\nonumber \\&\quad +1.427x_{t-1}-0.595x_{t-2}+0.294x_{t-3}-0.174x_{t-4}+u_{x,t}, \end{aligned}$$
(14)

where

$$\begin{aligned} \left[ \begin{array}{c} u_{y,t} \\ u_{x,t} \end{array} \right] \overset{iid}{\sim }N\left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right] ,\left[ \begin{array}{cc} 1.792&0.244 \\ 0.244&1.463 \end{array} \right] \right). \end{aligned}$$
(15)

When the unrestricted model (12) is the truth, the DGP is parameterized using Eq. (9) in Clark and McCracken (2005).

$$\begin{aligned} \Delta Y_{t} =&-0.419\Delta Y_{t-1}-0.258\Delta Y_{t-2} \nonumber \\&+0.331x_{t-1}-0.423x_{t-2}+0.309x_{t-3}-0.139x_{t-4}+u_{y,t}, \end{aligned}$$
(16)

where \(x_{t}\) is defined as in Eq. (14) and

$$\begin{aligned} \left[ \begin{array}{c} u_{y,t} \\ u_{x,t} \end{array} \right] \;\overset{iid}{\sim }N\left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right] ,\left[ \begin{array}{cc} 1.517&0.244 \\ 0.244&1.463 \end{array} \right] \right) , \end{aligned}$$
(17)

In both (15) and (17), the initial values of \(\Delta Y_{t}\) and \(x_{t}\) are generated with draws from the unconditional normal distribution. We compare the performance of the SIC and the rolling window PMSE criteria; the latter is implemented with a window size that is either (i) fixed relative to the sample size; (ii) proportional to the sample size; or (iii) diverging slower than the sample size. The number of Monte Carlo replications is set to 10,000. Tables 1, 2, 3, 4 report the empirical probabilities of selecting the correct model. If the procedure is correct, the corresponding probabilities in the tables should be unity.

Table 1 Selection probabilities of the SIC
Table 2 Selection probabilities of the PMSE criterion when the window size is a fixed fraction of the total sample size
Table 3 Selection probabilities of the PMSE criterion when the window size is constant
Table 4 Selection probabilities of the PMSE criterion when the window size is slowly diverging

Tables  1, 2 and 3 report the results for the SIC, the PMSE criterion with \(W\) proportional to \(T\), and the PMSE criterion with fixed \(W\), respectively. As expected, the SIC selects the correct model with probability approaching one as the sample size increases. The second last column of Table  2 shows that, when the window size is set to a fraction of the total sample size, \(W=[\pi T]\), the PMSE criterion tends to overparameterize the model when \(\pi \) is not very small. When the window size is fixed to a small number (\(W=10\)), the PMSE criterion tends to underparameterize the model. The results for \(W=[0.2T]\), \(W=50\), and \(W=90\) seem to contradict our claim that these specifications of the window size should yield inconsistent model selection; however, for reasonably large sample sizes, these specifications are observationally equivalent to the small window size specification we propose. Table  4 shows the results when the window size is small but diverging, \(W=o(T)\). The results for \(W=T^{3/4}\) support our consistency results. Although the window size \(W=T^{1/3}\) and \( W=T^{1/2}\) does not satisfy our sufficient condition (Assumption 1), the resulting criterion chooses the restricted model with probability approaching one when it is true. However, the PMSE criterion with \(W=T^{1/3}\) fails to choose the unrestricted model when it is the truth.Footnote 2

Overall, our results suggest that a window size that is a fixed fraction of the total sample size does not appear to give consistent results when Model 1 is the true data generating process. On the other hand, a constant window size \(W=10\) is not consistent when Model 2 is true. The divergent window size, in general, consistently selects the correct model, asymptotically. When \(W=T^{1/3}\), the consistency is not obvious due to the small window size, but unreported results show that the frequency of consistency will eventually converge to 1 when the total sample size becomes infinitely large.

The SIC does select the correct model asymptotically, and it appears to do so with an even higher probability that the PMSE criterion with a slowly diverging window size. However, as we will show in the next set of Monte Carlo simulations, the SIC will not select the correct model in the presence of time variation.

3.2 Simulation 2:Autoregressive DGP With/Without a Time-Varying Parameter

Next we consider two forecasting models

$$\begin{aligned} \begin{aligned}&\mathrm Model 1: \; y_t = \alpha y_{t-1} + u_{1,t}\\&\mathrm Model 2: \; y_t = \alpha y_{t-1} + \gamma x_t + u_{2,t} \end{aligned} \end{aligned}$$

where the data are generated by

$$\begin{aligned} \begin{aligned}&x_t = 0.5 x_{t-1}+u_{x,t}, \\&y_t = 0.5 y_{t-1} + \gamma x_t + u_{y,t} , \end{aligned} \end{aligned}$$

\(u_{x,t}\sim iid\) \(N(0,1)\) and \(u_{y,t}\sim iid\) \(N(0,1)\) are independent of each other. We consider four cases: \(\gamma =0;\) \(\gamma =0.25;\gamma =\) \( 0.5 \) and \(\gamma =t/T-0.5\). When \(\gamma =0\) Model 1 is true. Under the cases where \(\gamma =0.5\) or \(0.25\), Model 2 is true. Even when \(\gamma _{T,t}=t/T-0.5\), Model 2 should be selected since the true data generating process does include a constant, although the constant is time varying. The number of Monte Carlo replications is set to 10,000.

Tables  5, 6, 7, and 8 report the empirical probabilities of selecting the right model for the SIC and the rolling-window PMSE criterion with \(W=[\pi T] \), \(W\) being a constant, and \( W=o(T)\), respectively, when \(\gamma \) is time invariant. As before, the SIC is consistent and the PMSE criterion tends to either overparameterize or underparameterize the model when \(W\) is a large fraction of \(T\) or when \(W\) is a small constant. The results when \(W\) is a small fraction of \(T\) (\( \pi =0.2\)) or when \(W\) is 50 or 90 show that the PMSE criterion selects the correct model. This may be due to finite samples in which these window sizes are consistent with slowing diverging ones. The results in Table  8 show that the PMSE criterion selects the correct forecasting model with probability approaching one as the sample size increases when \( W\rightarrow \infty \) and \(T^{1/2}/W=O(1)\) as \(T\) grows.

Table 5 Selection probabilities of the SIC
Table 6 Selection probabilities of the PMSE criterion when the window size is a fixed fraction of the sample
Table 7 Selection probabilities of the PMSE criterion when the window size is constant
Table 8 Selection probabilities of the PMSE criterion when the window size is slowly diverging

The aforementioned results indicate that while the PMSE criterion with a slowly diverging window size is consistent the SIC tends to perform better. One advantage of the PMSE criterion over the SIC is that the PMSE criterion is robust to parameter instabilities. Table  9 reports the selection probabilities of the SIC and PMSE criterion when \( \gamma _{T,t}=t/T-0.5\). \(\gamma _{T,t}\) is modeled so that the in-sample PMSE of Model 2 equals that of Model 1 while the out-of-sample PMSE of Model 2 is smaller than that of Model 1. Table  9 shows that the PMSE criterion selects Model 2 with empirical probability approaching one while the SIC selects Model 1.Footnote 3

Table 9 Selection probabilities when a parameter is time varying

To summarize, the Monte Carlo results are consistent with our asymptotic theory and the PMSE criterion with a slowly diverging window size chooses the correct forecasting model with probability approaching one, no matter whether the parameters are time varying or not. On the other hand, although the SIC is consistent when the parameter is constant over time, it is inconsistent when the parameter is time varying.

4 Empirical Application

We consider forecasting quarterly inflation \(h\) -periods into the future. Let the regression model be:

$$\begin{aligned} y_{t+h}^{h}=\gamma _{0}+\gamma _{1}\left( L\right) x_{t}+\gamma _{2}\left( L\right) y_{t}+u_{t+h}^{h},t=1,\ldots ,T \end{aligned}$$
(18)

where the dependent variable is \(y_{t+h}^{h}=\left( 400/h\right) \ln (P_{t+h}/P_{t})-400\ln \left( P_{t}/P_{t-1}\right) \) where \(P_{t}\) is the price level (CPI) at time \(t\), \(h\) is the forecast horizon and equals four, so that the forecasts involve annual percent growth rates of inflation. \( \gamma _{1}\left( L\right) =\sum _{j=0}^{p}\gamma _{1j}L^{j}\) and \(\gamma _{2}\left( L\right) =\sum _{j=0}^{q}\gamma _{2j}L^{j}\), where \(L\) is the lag operator. Following Stock and Watson (2003), we consider several explanatory variables, \(x_{t}\), one at a time. The explanatory variable, \(x_{t}\), is either an interest rate or a measure of real output, unemployment, price, money, or earnings. The data are transformed to eliminate stochastic or deterministic trends and to quarterly frequencies. For a detailed description of the variables that we consider, see Table 10. We utilize quarterly, finally revised data available in January 2011. The earliest starting point of the sample that we consider is January 1959, although both M3 and the exchange rate series have a later starting date due to data availability constraints. Overall, this implies that the total sample size is about 240 observations. In the out-of-sample forecasting exercise, we estimate the number of lags (\(p\) and \(q\)) recursively by BIC; the estimation scheme is rolling with a window size of \(40\) observations. The benchmark model is an autoregressive model:

$$\begin{aligned} y_{t+h}^{h}=\gamma _{0}+\gamma _{2}\left( L\right) y_{t}+u_{t+h}^{h}, t=1,...,T. \end{aligned}$$
(19)

Results are reported in Fig.  1. The figure reports the ratio of the MSFE of the model, Eq. (18), relative to the MSFE of the autoregressive benchmark model, Eq. (). According to the Monte Carlo simulations in the previous section, the most successful window sizes are between \(T^{1/2}\) and \(T^{2/3}\), which, given the available sample of data, implies between 16 and 39 observations.

Table 10 Series description
Fig. 1
figure 1figure 1figure 1

QLR break test

Panel A reports results for predictors (\(x_{t}\)) that include real output measures. It is well known that such measures should be good predictors of inflation according to the Phillips curve. Several studies have documented the empirical success of Phillips curve models, see for example Stock et al. (1999a); Stock et al. (1999b) and 2003, although the empirical results in Marcellino et al. (2003) suggests that the ability of such measures to forecast inflation in Europe is more limited than in the United States. The figure shows that capacity utilization, employment, and unemployment measures are very useful predictors for inflation. In fact, when the window size is less than about 80, the MSFE of the model is always smaller than that of the autoregressive benchmark, sometimes even substantially. Note that for larger window sizes the PMSE criterion would however suggest that the AR benchmark forecasts better than the economic model.

Earnings, instead, is not a successful predictor: in window sizes in the range between \(T^{1/2}\) and \(T^{2/3}\), it is significantly worse, and occasionally better, although only for larger window sizes. However, recall from the discussion in Sect.  2 that when the window size is large relative to the total sample size, Inoue and Kilian (2005) have shown that the PMSE criterion tends to select overparameterized models. When the window sizes are between \(T^{1/2}\) and \(T^{2/3}\), the previous sections showed that the PMSE criterion tends to select the correct model. This suggests that earnings are particularly unreliable for forecasting inflation.

The performance of industrial production and real GDP predictors, instead, is less clear: the ratio can be either above or below unity depending on the window size. Even for window sizes in the range between \(T^{1/2}\) and \(T^{2/3}\), the ratio can be either above or below unity. These results suggest instabilities in the forecasting performance of these predictors, and are consistent with the results in Rossi and Sekhposyan (2010), although the latter were interested in testing equal predictive ability rather than consistently selecting the correct model, as we do here. Rossi and Sekhposyan (2010) empirical evidence documented that the economic predictors have forecasting ability in the early part of their sample, but the predictive ability disappears in the later part of their sample. The reversals in predictive ability happened, according to their tests, around the time of the Great Moderation, which the literature dates back to 1983–1984 (see McConnell and Perez-Quiros 2000), similar to the results in D’Agostino et al. (2006).

Panel B focuses on monetary measures. M1, M2, and M3 never have predictive ability except for some selected window sizes, again pointing to the presence of instabilities.

Panel C focuses on interest rates. The results are quite interesting. They show that interest rates (such as 1-year or 10-year bonds) appear to be very good predictors of inflation for medium window sizes, below 120–140 observations. Again, however, for very large window sizes the PMSE criterion would select the smaller model. Short-term interest rates tend to be useful predictors only when the window size is large, but again the ratio is below unity for some selected window sizes and above unity for others. Again, we conjecture that instabilities are important, as discussed in Rossi and Sekhposyan (2010).

Panel D focuses on other monetary variables. Stock prices are never useful for predicting inflation. Interestingly, the producer price index is a good predictor for inflation: the figure shows that for the relevant window sizes, the ratio of the MSFE of the model relative to that of the benchmark is always lower than unity, and it becomes higher than unity only for large window sizes.

Table 11 QLR break test P-values

Overall, our empirical results suggest that traditional Phillips curve predictors such as capacity utilization and unemployment are useful in forecasting inflation, as well as the producer price index. The empirical results for the other macroeconomic predictors are not clearcut, and might signal the importance of instabilities in the data. In order to provide more information on the instability in the forecasting regressions we consider, we report joint tests for structural breaks in the parameters of Eq. (18) using Andrews (1993) test for structural breaks. Table  11 reports the p-values of the test, which confirm that instabilities are extremely important.

5 Concluding Remarks

There is a known break, forecasters tend to use post-break observations when they make forecasts. In other words, they base their forecasts on a “truncated window” instead of the full sample. This chapter shows that this type of ideas can deliver the consistency of the rolling PMSE criterion not only when parameters are time varying but also when they are constant over time.

In this chapter we focus on the rolling scheme. Inoue and Kilian (2006) show that the PMSE criterion based on the recursive scheme is inconsistent if the number of initial observations is large, i.e., a fixed fraction of the sample size, while Wei (1992) proves that it is consistent if the number of initial observations is very small, i.e., a fixed constant. One might be able to extend Wei (1992) result to the case in which the number of initial observations diverges at a rate slower than the sample size. However, such a model selection criterion might not be robust to parameter instability.

It should be noted that our consistency results are based on correctly specified nested models. Although information criteria are not robust to parameter instabilities, they are robust to misspecification and nonnestedness (Sin and White 1996). We leave PMSE criterion-based model comparison of misspecified or non-nested models for future research.

The main object of forecasters is often to minimize PMSE rather than identify the true model. We are currently developing a data-dependent method for choosing the window size to achieve this goal in a separate chapter