Consistent Model Selection: Over Rolling Windows

Inoue, Atsushi; Rossi, Barbara; Jin, Lu

doi:10.1007/978-1-4614-1653-1_12

Atsushi Inoue³,
Barbara Rossi⁴ &
Lu Jin⁵

1375 Accesses
1 Citations

Abstract

In this chapter we analyze asymptotic properties of the simulated out-of-sample predictive mean squared error (PMSE) criterion based on a rolling window when selecting among nested forecasting models. When the window size is a fixed fraction of the sample size, Inoue and Kilian (J Econ 130: 273–306, 2006) show that the PMSE criterion is inconsistent. We consider alternative schemes under which the rolling PMSE criterion is consistent. When the window size diverges slower than the sample size at a suitable rate, we show that the rolling PMSE criterion selects the correct model with probability approaching one when parameters are constant or when they are time varying. We provide Monte Carlo evidence and illustrate the usefulness of the proposed methods in forecasting inflation.

Access provided by Autonomous University of Puebla. Download chapter PDF

Forecasting Under Structural Change

Horizon confidence sets

Article Open access 11 June 2020

Long-term prediction intervals of economic time series

Article Open access 10 April 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

It is a common practice to compare models by out-of-sample predictive mean squared error (PMSE). For example, Meese and Rogoff (1983a); Meese and Rogoff (1983b) and Swanson and White (1997) compare models according to their PMSE calculated in rolling windows. Another common practice is to use a consistent information criterion such as the Schwarz Information Criterion (SIC), used for example in Swanson and White (1997). Information criteria and the out-of-sample PMSE criteria deal with the issue of overfitting inherent in the in-sample PMSE criterion. Information criteria penalizes overparameterized models via penalty terms and are easy to compute. The out-of-sample PMSE criteria simulate out-of-sample forecasts and are very intuitive.^{Footnote 1}

In a recent chapter, Inoue and Kilian (2006) show that the recursive and rolling PMSE criteria are inconsistent and recommend that consistent in-sample information criteria, such as the SIC, be used in model selection. They also show that even when there is structural change these out-of-sample PMSE criteria are not necessarily consistent. Their results are based on the assumption that the window size is proportional to the sample size.

In this chapter we consider an alternative framework in which the window size goes to infinity at a slower rate than the sample size. Under this assumption we show that the rolling-window PMSE criterion is consistent for selecting nesting linear forecasting models. When the nesting model is the truth, the criterion selects the nesting model with probability approaching one because the parameters and thus the PMSE are consistently estimated as the window size diverges. When the nested model is generating the data, the quadratic term in the quadratic expansion of the loss difference becomes dominant when the window size is small. Because the quadratic form is always positive, the criterion will select the nested model with probability approaching one. When the window size is large, however, the linear term and the quadratic term are of the same order and the sign cannot be determined. By letting the window size diverge slowly, the rolling PMSE criterion is consistent under a variety of environments, when parameters are constant or when they are time varying.

When the window size diverges at a slower rate than the sample size, the rolling regression estimator can be viewed as a nonparametric estimator (Giraitis et al. 2011) and time-varying parameters are consistently estimated. We show that our rolling-window PMSE criterion remains consistent even when parameters are time varying. When the window size is large, that is, when it is assumed to go to infinity at the same rate as the total sample size, the criterion is not consistent because the rolling regression estimator is oversmoothed. In the time-varying parameter case, the conventional information criterion is not consistent in general.

This chapter is related to, and different from, the works by West (1996); Clark and McCracken (2001); Giacomini and White (2006); Giacomini and Rossi (2010), and Rossi and Inoue (2011) in several ways. West (1996) and Clark and McCracken (2001) focus on comparing models’ relative to forecasting performance when the window size is a fixed fraction of the total sample size, whereas Giacomini and White (2006) focus on the case where the window size is constant; this chapter focuses instead on the case where the window size goes to infinity but at a slower rate than the total sample size. Giacomini and Rossi (2010) argue that, in the presence of instabilities, traditional tests of predictive ability may be invalid, since they focus on the forecasting performance of the models on average over the out-of-sample portion of the data. To avoid the problem, they propose to compare models’ relative predictive ability in the presence of instabilities by using a rolling window approach over the out-of-sample portion of the data. The latter helps them to follow the relative performance of the models as it evolves over time. In this chapter we focus on consistent model selection procedures, instead, rather than testing; furthermore, our focus is not to compare models’ predictive performance over time, rather to select the best forecasting model asymptotically. Rossi and Inoue (2011) focus on the problem of performing inference on predictive ability that is robust to the choice of the window size. In this chapter, instead, we take as given the choice of the window size and our objective is not to perform tests; we focus instead on understanding whether it is possible to consistently select the true model depending on the size of the window relative to the total sample size.

The rest of this chapter is organized as follows: In Sect. 2 we establish the consistency of the rolling PMSE criterion under the standard stationary environment as well as under the time-varying parameter environment. In Sect. 3 we investigate the finite-sample properties of the rolling-window PMSE criterion. Section 4 demonstrates the usefulness of our criteria in forecasting inflation. Section 5 concludes.

2 Asymptotic Theory

Consider two nesting linear forecasting models, models 1 and 2, to generate $ h$-steps ahead direct forecasts (where $h$ is finite):

$$\begin{aligned} \text{ Model} \text{1}: y_{t+h}&=\alpha ^{*\prime }x_{t}+u_{t+h},\end{aligned}$$

(1)

$$\begin{aligned} \text{ Model} \text{2}: y_{t+h}&=\beta ^{\prime }z_{t}+v_{t+h}\;\;=\;\;\alpha ^{\prime }x_{t}+\gamma ^{\prime }w_{t}+v_{t+h}, \end{aligned}$$

(2)

where dim$(\alpha )=k$ and dim$(\beta )=l$. The first terms on the right-hand sides of Eqs. (1) and (2), $\alpha ^{*\prime }x_{t} $ and $\beta ^{\prime }z_{t}$ are the population linear projections of $ y_{t+h}$ on $x_{t}$ and $z_{t}$, respectively. Thus, $z_{t}$ is uncorrelated with $v_{t+h}$, $\alpha ^{*}=[E(x_{t}x_{t}^{\prime })]^{-1}E(x_{t}y_{t+h})$ and $\beta =[E(z_{t}z_{t}^{\prime })]^{-1}E(z_{t}y_{t+h})$.

Define the population quadratic loss of each model by

$$\begin{aligned} \begin{array}{lllll} \sigma _{1}^{2}&= \lim _{T\rightarrow \infty }\frac{1}{T-h} \sum \limits _{t=1}^{T-h}E[(y_{t+h}-\alpha ^{\prime }x_{t})^{2}]&= \lim _{T\rightarrow \infty }\frac{1}{T-h}\sum \limits _{t=1}^{T-h}E(u_{t+h}^{2}), \\ \sigma _{2}^{2}&= \lim _{T\rightarrow \infty }\frac{1}{T-h} \sum \limits _{t=1}^{T-h}E[(y_{t+h}-\beta ^{\prime }z_{t})^{2}]&= \lim _{T\rightarrow \infty }\frac{1}{T-h}\sum \limits _{t=1}^{T-h}E(v_{t+h}^{2}). \end{array} \end{aligned}$$

Our goal is to select the model with smallest quadratic loss.

Let the window size used for parameter estimation be denoted by $W$ for some $W>h$. Define the rolling ordinary least squares (OLS) estimators as follows, for $t=W+1,...,T$:

$$\begin{aligned} \hat{\alpha }_{t,W}&=\left( \sum _{s=t-W}^{t-h}x_{s}x_{s}^{\prime }\right) ^{-1}\sum _{s=t-W}^{t-h}x_{s}y_{s+h}, \end{aligned}$$

(3)

$$\begin{aligned} \hat{\beta }_{t,W}&=\left( \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }\right) ^{-1}\sum _{s=t-W}^{t-h}z_{s}y_{s+h}, \end{aligned}$$

(4)

and the associated rolling PMSEs by:

$$\begin{aligned} \hat{\sigma }_{1,W}^{2}&=\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\hat{u}_{t+h}^{2},\end{aligned}$$

(5)

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}&=\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\hat{v}_{t+h}^{2}, \end{aligned}$$

(6)

where $\hat{u}_{t+h}=y_{t+h}-\hat{\alpha }_{t,W}^{\prime }x_{t}$, $\hat{v} _{t+h}=y_{t+h}-\hat{\beta }_{t,W}^{\prime }z_{t}.$ We say that the rolling PMSE criterion is consistent if

$\hat{\sigma }_{1,W}^{2}<\hat{\sigma }_{2,W}^{2}$ with probability approaching one if $\sigma _{1}^{2}=\sigma _{2}^{2}$; and
$\hat{\sigma }_{1,W}^{2}>\hat{\sigma }_{2,W}^{2}$ with probability approaching one if $\sigma _{1}^{2}>\sigma _{2}^{2}$.

Under what conditions on the window size is the rolling PMSE criterion consistent? The existing results are not positive. When the window size is large relative to the sample size (i.e., $\exists \lambda \in (0,1)$ s.t. $W=\lambda T+o(T)$), Inoue and Kilian (2005) show that the criterion is not consistent. Specifically, when $\sigma _{1}^{2}=\sigma _{2}^{2}$, they show that the criterion selects model 2 with a positive probability resulting in the overparameterized model. We will discuss this result in more detail in the next section, where we will compare it with the theoretical results proposed in this chapter.

When the window size is very small (i.e., $W$ is a fixed constant), it is straightforward to show that the criterion may not be consistent. For example, compare the zero-forecast model ($x_{t}=\emptyset $) and the constant-forecast model ($w_{t}=1$) with $W=h=1$. Suppose that $ y_{t+1}\;=\;c+u_{t+1},\;$ where $u_{t}\sim iid(c,\sigma ^{2})$. Note that $ \sigma _{1}^{2}=c^{2}+\sigma ^{2}$ and $\sigma _{2}^{2}=\sigma ^{2}$. Since

$$\begin{aligned} \begin{array}{llll} \hat{\sigma }_{1,1}^{2}&=\frac{1}{T-1}\sum \limits _{t=1}^{T-1}y_{t+1}^{2} \overset{p}{\rightarrow } c^{2}+\sigma ^{2}, \\ \hat{\sigma }_{2,1}^{2}&= \frac{1}{T-1}\sum \limits _{t=1}^{T-1}(y_{t+1}-y_{t})^{2} \overset{p}{\rightarrow } 2\sigma ^{2}, \end{array} \end{aligned}$$

however, $\hat{\sigma }_{1,1}^{2}<\hat{\sigma }_{2,1}^{2}$ with probability approaching one whenever $c^{2}<\sigma ^{2}$. This is because parameter estimation uncertainty never vanishes even asymptotically, when the window size is fixed.

The goal of the next section is to show that the criterion is consistent if the window size is small, but not too small, relative to the sample size in the following sense: $W\rightarrow \infty $ and $W/T\rightarrow 0$ as $T\rightarrow \infty $. Following Clark and McCracken (2000), we use the following notation: Let $q_{2,t}=z_{t}z_{t}^{\prime }$, $ q_{1,t}=x_{t}x_{t}^{\prime }$, $B_{i}=\left[ E(q_{it})\right] ^{-1}$, $ B_{i}(t)=\left[ \frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}q_{i,s}\right] ^{-1}$ , $H_{1}(t)=\frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}x_{s}(y_{s+h}-\alpha ^{*\prime }x_{s})$, $H_{2}(t)=\frac{1}{W_{h}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h}$, where $i$ is either 1 or 2 and $ W_{h}=W-h+1$.

2.1 Consistency of the Rolling-Window PMSE Criterion When Parameters are Constant

First, consider the case where the parameters are constant.

Assumption 1

As $T \rightarrow \infty $, $ T^{1/2}/W=O(1)$ and $W/T \rightarrow 0$.

Assumption 2

(a)
$\{[x_t^{\prime }\;z_t^{\prime }\;y_{t+h}]^{\prime }\}$ is covariance stationary and has finite $10$ moments with $E(z_{t}z_{t}^{\prime })$ positive definite and $B_{2}(t)$ positive definite for all $t$ almost surely.
(b)
$W^{1/2}(B_{i}(t)-B_{i})$ and $W^{1/2}H_{i}(t)$ have finite fourth moments uniformly in $t$ for $i=1,2$.
(c)
$E(v_{t+h}|\mathcal F _{t})=0$ with probability one for $ 1,2,\ldots $, where $\mathcal F _{t}$ is the $\sigma $ field generated by $ \{(y_{s+h},z_{s})\}_{s=1}^{t-h}$.
(d)
$E[H_{1}^{\prime }(t)B_{1}(x_{t}x_{t}^{\prime }-E(x_{t}x_{t}^{\prime }))B_{1}H_{1}(t)]=o(W^{-1})$ and $E[H_{2}^{\prime }(t)B_{2}(z_{t}z_{t}^{\prime }-E(z_{t}z_{t}^{\prime }))B_{2}H_{2}(t)]=o(W^{-1})$ uniformly in $t$.
(e)
$$\begin{aligned}&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)(B_{i}(t)-B_{i})q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\\&\quad \qquad =O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)(B_{i}(t)-B_{i})q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vec}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)B_{i}q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\\&\quad \qquad = O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vec}\left(H_{i}^{\prime }(t)B_{i}q_{i,t}(B_{i}(t)-B_{i})H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)B_{i}q_{i,t}B_{i}H_{i}(t)\right)\right]\\&\quad \qquad = O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)B_{i}q_{i,t}B_{i}H_{i}\right)\right]\right), \end{aligned}$$
for $i=1,2$.

Remarks

When the window size is assumed to be proportional to the sample size, $W=[rT]$ for $r\in [0,1]$, the functional central limit theorem (FCLT) is often used to find the asymptotic properties of the recursive and rolling regression estimators (e.g., Clark and McCracken 2001). For example, if $h=1$,

$$\begin{aligned} \sqrt{T}(\hat{\beta }_{t,W}-\beta )\;=\;\left( \frac{1}{T} \sum _{s=t-W}^{t-1}z_{s}z_{s}^{\prime }\right) ^{-1}\frac{1}{\sqrt{T}} \sum _{s=t-W}^{t-1}z_{s}v_{s+1} \end{aligned}$$

and if vech$(z_{t}z_{t}^{\prime })$ and $z_{t}v_{t+1}$ satisfy the FCLT, we obtain

$$\begin{aligned} \sqrt{T}(\hat{\beta }_{[rT]}-\beta )\;\Rightarrow \;\frac{\sigma }{r} [E(z_{t}z_{t}^{\prime })]^{-1/2}B_{l}(r) \end{aligned}$$

where $B_{l}(r)$ is the $l$-dimensional standard Brownian motion, provided $ [z_{t}^{\prime }\;v_{t+1}]^{\prime }$ is covariance stationary. Thus, we have $\hat{\beta }_{t,W}-\beta \;=\;O_{p}(T^{-1/2})$ uniformly in $t$. When the window size diverges slower than the sample size it is tempting to use the same analogy and claims $\hat{\beta }_{t,W}-\beta =O_{p}(W^{-1/2})$ uniformly in $t$. This result does not follow from the FCLT, however, even though $\hat{\beta }_{t,W}-\beta \;=\;O_{p}(W^{-1/2})$ pointwise in $t$. To see why, let $z_{t}=1$. Then

$$\begin{aligned} \hat{\beta }_{t,W}-\beta&=\frac{1}{W}\sum _{s=1}^{t-1}v_{s+1}-\frac{1}{W} \sum _{s=1}^{t-W-1}v_{s+1} \nonumber \\&=\frac{\sqrt{T}}{W}\frac{1}{\sqrt{T}}\sum _{s=1}^{t-1}v_{s+1}-\frac{\sqrt{T} }{W}\frac{1}{\sqrt{T}}\sum _{s=1}^{t-W-1}v_{s+1} \nonumber \\&=o_{p}\left( \frac{\sqrt{T}}{W}\right) \end{aligned}$$

uniformly in $t$, where the last equality follows from $\frac{1}{\sqrt{T}} \sum _{s=1}^{t-1}v_{s+1}-\frac{1}{\sqrt{T}}\sum _{s=1}^{t-W-1}v_{s+1}=o_{p}(1)$ by the FCLT and $W=o(T)$. Thus, the FCLT alone does not imply $\hat{\beta } _{t,W}-\beta =O_{p}(W^{-1/2})$ uniformly in $t$ in general. This is why we need some high-level assumption, such as Assumptions 2(b)(d)(e).

Assumption 1 requires that $W$ diverges slower than $T$. This assumption makes the convergence rates of terms in the expansion of the PMSE differential uneven which helps to establish the consistency of this criterion when the nested model is generating the data. Assumption 2(c) requires that the nesting model is (dynamically) correctly specified. Assumption 2(d) is trivially satisfied if $z_{t}$ is strictly exogenous and allows for weak correlations between $z_{t}$ and $v_{s}$. Assumption 2(e) is a high-level assumption and imposes that the variance of the sum is in the same order of the sum of variances. In other words, the summands are only weakly serially correlated so that their autocovariances decay fast enough. This assumption is somewhat related to the concept of essential stationarity of (Wooldridge (1994), pp. 2643–2644). Assumptions somewhat similar to this condition are used in the central limit theorem for stationary and ergodic processes (e.g., Theorem 5.6 of Hall and Heyde 1980, p. 148) and the central limit theorem for near epoch-dependent processes (e.g., Theorem 5.3 of Gallant and White 1988, p. 76; Assumption C1 of Wooldridge and White 1988).

Theorem 1

Under Assumptions 1 and 2, the rolling-window PMSE criterion is consistent.

To compare our consistency result and the inconsistency result of Inoue and Kilian (2006), consider two simple competing models, $y_{t+h}=u_{t+h}$ (model 1) and $y_{t+h}=c+v_{t+h}$ (model 2) where $v_{t+h}\;$is i.i.d. with mean zero and variance $\sigma _{2}^{2}$ and $h=1.$ The difference of the out-of-sample PMSE can be written as

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2}\;=\;-\frac{2}{T-W-1} \sum _{t=W+1}^{T-1}(\hat{c}_{t}-c)v_{t+1}+\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}( \hat{c}_{t}-c)^{2} \end{aligned}$$

where $\hat{c}_{t}=(1/W)\sum _{s=t-W}^{t-1}y_{s+1}$. Assume that $c=0$ in population.

When $W=[\lambda T]$ for some $\lambda \in (0,1)$, it follows from Lemmas A6 and A7 of Clark and McCracken (2000) that

$$\begin{aligned} T\left( \hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2}\right) \;\overset{d}{ \rightarrow }\;&-\frac{2}{\lambda \left( 1-\lambda \right) }\sigma _{2}^{2}\int \limits _{\lambda }^{1}(B(r)-B(r-\lambda ))\mathrm{d}B(r)\\&+\frac{1}{\lambda ^{2}\left( 1-\lambda \right) }\sigma _{2}^{2}\int \limits _{\lambda }^{1}(B(r)-B(r-\lambda ))^{\prime }(B(r)-B(r-\lambda ))\mathrm{d}r \end{aligned}$$

where $B(\cdot )$ is the standard Brownian motion. Because the probability that the right-hand side is negative is nonzero, the criterion is inconsistent when $c=0$. This is the inconsistency result in Inoue and Kilian (2006).

When $W=o(T^{1/(1+2\varepsilon )})$ for some $\varepsilon \in (0,1/2)$, the case considered in this chapter, we have:

$$\begin{aligned} W(\hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2})&=-\frac{2W^{\frac{1}{2} +\varepsilon }}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2} +\varepsilon }}\sum _{s=t-W}^{t-1}v_{s+1}\right) v_{t+1} \\&\quad +\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2}}} \sum _{s=t-W}^{t-1}v_{s+1}\right) ^{2} \\&=\frac{1}{T-W-1}\sum _{t=W+1}^{T-1}\left( \frac{1}{W^{\frac{1}{2}}} \sum _{s=t-W}^{t-1}v_{s+1}\right) ^{2}+o_{p}(1) \end{aligned}$$

Because the right-hand side remains positive even asymptotically, the criterion will choose model 1 with probability approaching one. The key for the consistency result is that the last quadratic term in the expansion dominates the middle cross-term when the window size is small.

Lastly, it should be noted that our consistency result does not imply that the resulting forecast based on a slowly diverging window size is optimal. When parameters are constant, one would expect that the optimal forecast for the $T+1$st observation should be based on all $T$ observations, not on the last $W$ observations. Assumption 1 is merely a device to obtain the consistency of the rolling PMSE criterion.

2.2 Consistency of the Rolling-Window PMSE Criterion When Parameters are Time Varying

Sometimes it is claimed that out-of-sample PMSE comparisons are used to protect practitioners from parameter instability. As Inoue and Kilian (2006) show this is not always the case. In this section we show that the rolling PMSE criterion with small window sizes delivers consistent model selection even when parameters are time varying.

Suppose that the slope coefficients are time varying in the sense that

$$\begin{aligned} y_{T,t+h}\;=\;\beta \left( \frac{t}{T}\right) ^{\prime }z_{T,t}+v_{T,t+h} \end{aligned}$$

(7)

where $\beta (r)=[\alpha (r)^{\prime }\;\gamma (r)^{\prime }]^{\prime }$ for $r\in [0,1]$. When the slope coefficients are time varying, the second moments are also time varying. Let

$$\begin{aligned} \left[ \begin{array}{cc} \Gamma _{zz}\left(\frac{t}{T}\right)&\Gamma _{zy}\left(\frac{t}{T}\right)\\ \Gamma _{yz}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right) \end{array} \right]&=\left[ \begin{array}{ccc} \Gamma _{xx}\left(\frac{t}{T}\right)&\Gamma _{xw}\left(\frac{t}{T}\right)&\Gamma _{xy}\left(\frac{t}{T}\right) \\ \Gamma _{wx}\left(\frac{t}{T}\right)&\Gamma _{ww}\left(\frac{t}{T}\right)&\Gamma _{wy}\left(\frac{t}{T}\right) \\ \Gamma _{yx}\left(\frac{t}{T}\right)&\Gamma _{yw}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right) \end{array} \right] \\&=\left[ \begin{array}{ccc} E[x_{T,t}x_{T,t}^{\prime }]&E[x_{T,t}w_{T,t}^{\prime }]&E[x_{T,t}y_{T,t}]\\ E[w_{T,t}x_{T,t}^{\prime }]&E[w_{T,t}w_{T,t}^{\prime }]&E[w_{T,t}y_{T,t}]\\ E[y_{T,t}x_{T,t}^{\prime }]&E[y_{T,t}w_{T,t}^{\prime }]&E[y_{T,t}^{2}] \end{array} \right] , \end{aligned}$$

for $t=1,2,...,T$ and $T=1,2,...$. Let $\bar{B}_{1}\left(\frac{t}{T} \right)=[E(x_{T,t}x_{T,t}^{\prime })]^{-1}$ and $\bar{B}_{2}\left(\frac{t}{T} \right)=[E(z_{T,t}z_{T,t}^{\prime })]^{-1}$. Then $\beta (\cdot )=[\Gamma _{zz}(\cdot )]^{-1}\Gamma _{zy}(\cdot )$. We compare

$$\begin{aligned} y_{T,t+h}\;=\;\alpha \left(\frac{t}{T}\right)^{\prime }x_{T,t}+u_{T,t+h} \end{aligned}$$

(8)

and (7), where (7) simplifies to () if $\gamma (u)=0$ for all $u\in [0,1]$.

Assumption 3

As $ T\rightarrow \infty $, $T^{1/2}/W=O(1)$ and $W=o(T^{2/3})$.

Assumption 4

(a)
$$\begin{aligned} \xi _{t}\;=\;\text{ vech}\left\{ \left[ \begin{array}{ccc} z_{T,t}z_{T,t}^{\prime }&z_{T,t}y_{T,t+h}&\\ y_{T,t+h}z_{T,t}^{\prime }&y_{T,t+h}^{2}&\end{array} \right] -\left[ \begin{array}{cc} \Gamma _{zz}\left(\frac{t}{T}\right)&\Gamma _{zy}\left(\frac{t}{T}\right)\\ \Gamma _{yz}\left(\frac{t}{T}\right)&\Gamma _{yy}\left(\frac{t}{T}\right)\end{array} \right]\right\} \end{aligned}$$
(9)
has finite fifth moments with $B_{2}(t)$ positive definite for all $t$ almost surely.
(b)
$W^{1/2}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T }\right)\right)$ and $W^{1/2}H_{i}(t)$ have finite fourth moments uniformly in $t$ for $i=1,2$.
(c)
$E(v_{T,t+h}|\mathcal F _{Tt})=0$ with probability one for $1,2,...$, where $\mathcal F _{Tt}$ is the $\sigma $ field generated by $\{(y_{T,s+h},z_{Ts})\}_{s=1}^{t-h}$.
(d)
$E[H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T} \right)(q_{i,T,t}-E(q_{i,T,t}))\bar{B}_{i}\left(\frac{t}{T} \right)H_{i}(t)]=o(W^{-1})$ uniformly in $t$ for $i=1,2$, where $ q_{1,T,t}=x_{T,t}x_{T,t}$ and $q_{2,T,t}=z_{T,t}z_{T,t}^{\prime }$.
(e)
$$\begin{aligned}&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right] \\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\right)\right., \end{aligned}$$
$$\begin{aligned}&\text{ Cov}\left[\text{ vec}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vec}\left(H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\left(B_{i}(t)-\bar{B}_{i}\left(\frac{t}{T}\right)\right)H_{i}(t)\right)\right]\right), \\&\text{ Cov}\left[\text{ vech}\left(\sum _{t=W+1}^{T-h}H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\bar{B}_{i}\left(\frac{t}{T}\right)H_{i}(t)\right)\right] \\&= O\left(\sum _{t=W+1}^{T-h}\text{ Cov}\left[\text{ vech}\left(H_{i}^{\prime }(t)\bar{B}_{i}\left(\frac{t}{T}\right)q_{i,T,t}\bar{B}_{i}\left(\frac{t}{T}\right)H_{i}\right)\right]\right), \end{aligned}$$
where $i=1,2$.
(f)
$\Gamma _{zz}(u)$ is positive definite for all $u\in [0,1]$, and $\alpha \left( .\right) \equiv $$ \Gamma _{xx}(\cdot )$$^{-1}$$\Gamma _{xy}(\cdot )$ and $\beta \left( .\right) \equiv $$\Gamma _{zz}(\cdot )$$^{-1}$$\Gamma _{zy}(\cdot )$ satisfy a Lipschitz condition of order 1.

Remarks

Assumption 3 is more restrictive than Assumption 1 to keep the bias of the rolling regression estimator from interfering the consistency of the rolling PMSE estimator. Assumptions 4(a)(b) requires that $\xi _{t}$ behaves like a stationary process with enough many moments. Assumptions 4(b)–(e) are analogs of Assumptions 2(b)–(e). Assumption 4(f) requires that the second moments change very smoothly.

Theorem 2

Suppose Assumptions 3 and 4 hold. Then the rolling-window PMSE criterion is consistent.

Remarks

The above consistency result is intuitive once it is recognized that the rolling regression estimator is a nonparametric regression estimator of parameters with a truncated kernel. For example, Cai (2007) establish the consistency and asymptotic normality of nonparametric estimators of time-varying parameters, and Giraitis et al. (2011) prove the consistency and asymptotic normality of nonparametric estimators for stochastic time-varying coefficient AR(1) models.

In general, the conventional information criteria, such as SIC, are not consistent when parameters are time varying. To show why that is the case consider comparing two competing models $y_{t+h}=u_{t+h} $ and $y_{t+h}=c+v_{t+h}$ for $h=1$ when the data are generated from:

$$\begin{aligned} y_{t}\;=\;\frac{t}{T}-\frac{1}{2}+\varepsilon _{t} \end{aligned}$$

(10)

where $\varepsilon _{t}\;$is i.i.d. with mean zero and variance $\sigma ^{2}$. Then the population in-sample PMSE of the zero forecast model is

$$\begin{aligned} \lim _{T\rightarrow \infty }E\left( \frac{1}{T-1}\sum _{t=1}^{T-1}y_{t+1}^{2} \right) \;=\;\sigma ^{2}+\int \limits _{0}^{1}\left( r-\frac{1}{2}\right) ^{2}\mathrm{d}r\;=\sigma ^{2}+\frac{1}{12} \end{aligned}$$

The population in-sample PMSE of the forecast model that estimates the constant in rolling windows is also

$$\begin{aligned} \lim _{T\rightarrow \infty }\min _{c}E\left( \frac{1}{T-1} \sum _{t=1}^{T-1}(y_{t+1}-c)^{2}\right) \;=\;\min _{c}\left( \sigma ^{2}+\int \limits _{0}^{1}(r-c)^{2}\mathrm{d}r\right) \;=\sigma ^{2}+\frac{1}{12} \end{aligned}$$

Thus, the SIC would select the zero forecast model while the true DGP is a time-varying constant forecast model. Our criterion, by re-estimating the constant in rolling windows, is robust to time variation in the parameters and will select the second model with probability approaching unity asymptotically.

3 Monte Carlo Evidence

In this section we investigate the finite-sample performance of the rolling-window PMSE criterion in two Monte Carlo experiments. In the first experiment, we use the data generating process (DGP) of Clark and McCracken (2005) as it is similar to the empirical application that we will consider in the next section. In the second experiment, we use a simple DGP in which the dependent and independent variables both follow first-order autoregressive processes, and consider both constant parameter and time-varying parameter cases.

3.1 Simulation 1: DGP2 in Clark and McCracken (2005)

The second DGP of Clark and McCracken (2005) is based on estimates based on quarterly 1957:1–2004:3 data of inflation ($Y$) and the rate of capacity utilization in manufacturing ($x$). We consider restricted and unrestricted forecasting models as follows:

$$\begin{aligned} \text{ Model} \text{1} :\;\Delta Y_{t+1}&=\alpha _{0}+\alpha _{1}\Delta Y_{t}+\alpha _{2}\Delta Y_{t-1}+u_{1,t+1}\end{aligned}$$

(11)

$$\begin{aligned} \text{ Model} \text{2} :\;\Delta Y_{t+1}&=\alpha _{0}+\alpha _{1}\Delta Y_{t}+\alpha _{2}\Delta Y_{t-1}+\gamma _{1}x_{t-1}+\gamma _{2}x_{t-2}+\gamma _{3}x_{t-3} \nonumber \\&\qquad \qquad \qquad \qquad +\gamma _{4}x_{t-4}+u_{2,t+1} \end{aligned}$$

(12)

When the restricted model (11) is true, the DGP is parameterized using Eq. (7) in Clark and McCracken (2005):

$$\begin{aligned} \Delta Y_{t}&=-0.316\Delta Y_{t-1}-0.214\Delta Y_{t-2}+u_{y,t}, \end{aligned}$$

(13)

$$\begin{aligned} x_{t}&=-0.193\Delta Y_{t-1}-0.242\Delta Y_{t-2}-0.240\Delta Y_{t-3}-0.119\Delta Y_{t-4}\nonumber \\&\quad +1.427x_{t-1}-0.595x_{t-2}+0.294x_{t-3}-0.174x_{t-4}+u_{x,t}, \end{aligned}$$

(14)

where

$$\begin{aligned} \left[ \begin{array}{c} u_{y,t} \\ u_{x,t} \end{array} \right] \overset{iid}{\sim }N\left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right] ,\left[ \begin{array}{cc} 1.792&0.244 \\ 0.244&1.463 \end{array} \right] \right). \end{aligned}$$

(15)

When the unrestricted model (12) is the truth, the DGP is parameterized using Eq. (9) in Clark and McCracken (2005).

$$\begin{aligned} \Delta Y_{t} =&-0.419\Delta Y_{t-1}-0.258\Delta Y_{t-2} \nonumber \\&+0.331x_{t-1}-0.423x_{t-2}+0.309x_{t-3}-0.139x_{t-4}+u_{y,t}, \end{aligned}$$

(16)

where $x_{t}$ is defined as in Eq. (14) and

$$\begin{aligned} \left[ \begin{array}{c} u_{y,t} \\ u_{x,t} \end{array} \right] \;\overset{iid}{\sim }N\left( \left[ \begin{array}{c} 0 \\ 0 \end{array} \right] ,\left[ \begin{array}{cc} 1.517&0.244 \\ 0.244&1.463 \end{array} \right] \right) , \end{aligned}$$

(17)

In both (15) and (17), the initial values of $\Delta Y_{t}$ and $x_{t}$ are generated with draws from the unconditional normal distribution. We compare the performance of the SIC and the rolling window PMSE criteria; the latter is implemented with a window size that is either (i) fixed relative to the sample size; (ii) proportional to the sample size; or (iii) diverging slower than the sample size. The number of Monte Carlo replications is set to 10,000. Tables 1, 2, 3, 4 report the empirical probabilities of selecting the correct model. If the procedure is correct, the corresponding probabilities in the tables should be unity.

Table 1 Selection probabilities of the SIC

Full size table

Table 2 Selection probabilities of the PMSE criterion when the window size is a fixed fraction of the total sample size

Full size table

Table 3 Selection probabilities of the PMSE criterion when the window size is constant

Full size table

Table 4 Selection probabilities of the PMSE criterion when the window size is slowly diverging

Full size table

Tables 1, 2 and 3 report the results for the SIC, the PMSE criterion with $W$ proportional to $T$, and the PMSE criterion with fixed $W$, respectively. As expected, the SIC selects the correct model with probability approaching one as the sample size increases. The second last column of Table 2 shows that, when the window size is set to a fraction of the total sample size, $W=[\pi T]$, the PMSE criterion tends to overparameterize the model when $\pi $ is not very small. When the window size is fixed to a small number ($W=10$), the PMSE criterion tends to underparameterize the model. The results for $W=[0.2T]$, $W=50$, and $W=90$ seem to contradict our claim that these specifications of the window size should yield inconsistent model selection; however, for reasonably large sample sizes, these specifications are observationally equivalent to the small window size specification we propose. Table 4 shows the results when the window size is small but diverging, $W=o(T)$. The results for $W=T^{3/4}$ support our consistency results. Although the window size $W=T^{1/3}$ and $ W=T^{1/2}$ does not satisfy our sufficient condition (Assumption 1), the resulting criterion chooses the restricted model with probability approaching one when it is true. However, the PMSE criterion with $W=T^{1/3}$ fails to choose the unrestricted model when it is the truth.^{Footnote 2}

Overall, our results suggest that a window size that is a fixed fraction of the total sample size does not appear to give consistent results when Model 1 is the true data generating process. On the other hand, a constant window size $W=10$ is not consistent when Model 2 is true. The divergent window size, in general, consistently selects the correct model, asymptotically. When $W=T^{1/3}$, the consistency is not obvious due to the small window size, but unreported results show that the frequency of consistency will eventually converge to 1 when the total sample size becomes infinitely large.

The SIC does select the correct model asymptotically, and it appears to do so with an even higher probability that the PMSE criterion with a slowly diverging window size. However, as we will show in the next set of Monte Carlo simulations, the SIC will not select the correct model in the presence of time variation.

3.2 Simulation 2:Autoregressive DGP With/Without a Time-Varying Parameter

Next we consider two forecasting models

$$\begin{aligned} \begin{aligned}&\mathrm Model 1: \; y_t = \alpha y_{t-1} + u_{1,t}\\&\mathrm Model 2: \; y_t = \alpha y_{t-1} + \gamma x_t + u_{2,t} \end{aligned} \end{aligned}$$

where the data are generated by

$$\begin{aligned} \begin{aligned}&x_t = 0.5 x_{t-1}+u_{x,t}, \\&y_t = 0.5 y_{t-1} + \gamma x_t + u_{y,t} , \end{aligned} \end{aligned}$$

$u_{x,t}\sim iid$ $N(0,1)$ and $u_{y,t}\sim iid$ $N(0,1)$ are independent of each other. We consider four cases: $\gamma =0;$ $\gamma =0.25;\gamma =$ $ 0.5 $ and $\gamma =t/T-0.5$. When $\gamma =0$ Model 1 is true. Under the cases where $\gamma =0.5$ or $0.25$, Model 2 is true. Even when $\gamma _{T,t}=t/T-0.5$, Model 2 should be selected since the true data generating process does include a constant, although the constant is time varying. The number of Monte Carlo replications is set to 10,000.

Tables 5, 6, 7, and 8 report the empirical probabilities of selecting the right model for the SIC and the rolling-window PMSE criterion with $W=[\pi T] $, $W$ being a constant, and $ W=o(T)$, respectively, when $\gamma $ is time invariant. As before, the SIC is consistent and the PMSE criterion tends to either overparameterize or underparameterize the model when $W$ is a large fraction of $T$ or when $W$ is a small constant. The results when $W$ is a small fraction of $T$ ($ \pi =0.2$) or when $W$ is 50 or 90 show that the PMSE criterion selects the correct model. This may be due to finite samples in which these window sizes are consistent with slowing diverging ones. The results in Table 8 show that the PMSE criterion selects the correct forecasting model with probability approaching one as the sample size increases when $ W\rightarrow \infty $ and $T^{1/2}/W=O(1)$ as $T$ grows.

Table 5 Selection probabilities of the SIC

Full size table

Table 6 Selection probabilities of the PMSE criterion when the window size is a fixed fraction of the sample

Full size table

Table 7 Selection probabilities of the PMSE criterion when the window size is constant

Full size table

Table 8 Selection probabilities of the PMSE criterion when the window size is slowly diverging

Full size table

The aforementioned results indicate that while the PMSE criterion with a slowly diverging window size is consistent the SIC tends to perform better. One advantage of the PMSE criterion over the SIC is that the PMSE criterion is robust to parameter instabilities. Table 9 reports the selection probabilities of the SIC and PMSE criterion when $ \gamma _{T,t}=t/T-0.5$. $\gamma _{T,t}$ is modeled so that the in-sample PMSE of Model 2 equals that of Model 1 while the out-of-sample PMSE of Model 2 is smaller than that of Model 1. Table 9 shows that the PMSE criterion selects Model 2 with empirical probability approaching one while the SIC selects Model 1.^{Footnote 3}

Table 9 Selection probabilities when a parameter is time varying

Full size table

To summarize, the Monte Carlo results are consistent with our asymptotic theory and the PMSE criterion with a slowly diverging window size chooses the correct forecasting model with probability approaching one, no matter whether the parameters are time varying or not. On the other hand, although the SIC is consistent when the parameter is constant over time, it is inconsistent when the parameter is time varying.

4 Empirical Application

We consider forecasting quarterly inflation $h$ -periods into the future. Let the regression model be:

$$\begin{aligned} y_{t+h}^{h}=\gamma _{0}+\gamma _{1}\left( L\right) x_{t}+\gamma _{2}\left( L\right) y_{t}+u_{t+h}^{h},t=1,\ldots ,T \end{aligned}$$

(18)

where the dependent variable is $y_{t+h}^{h}=\left( 400/h\right) \ln (P_{t+h}/P_{t})-400\ln \left( P_{t}/P_{t-1}\right) $ where $P_{t}$ is the price level (CPI) at time $t$, $h$ is the forecast horizon and equals four, so that the forecasts involve annual percent growth rates of inflation. $ \gamma _{1}\left( L\right) =\sum _{j=0}^{p}\gamma _{1j}L^{j}$ and $\gamma _{2}\left( L\right) =\sum _{j=0}^{q}\gamma _{2j}L^{j}$, where $L$ is the lag operator. Following Stock and Watson (2003), we consider several explanatory variables, $x_{t}$, one at a time. The explanatory variable, $x_{t}$, is either an interest rate or a measure of real output, unemployment, price, money, or earnings. The data are transformed to eliminate stochastic or deterministic trends and to quarterly frequencies. For a detailed description of the variables that we consider, see Table 10. We utilize quarterly, finally revised data available in January 2011. The earliest starting point of the sample that we consider is January 1959, although both M3 and the exchange rate series have a later starting date due to data availability constraints. Overall, this implies that the total sample size is about 240 observations. In the out-of-sample forecasting exercise, we estimate the number of lags ($p$ and $q$) recursively by BIC; the estimation scheme is rolling with a window size of $40$ observations. The benchmark model is an autoregressive model:

$$\begin{aligned} y_{t+h}^{h}=\gamma _{0}+\gamma _{2}\left( L\right) y_{t}+u_{t+h}^{h}, t=1,...,T. \end{aligned}$$

(19)

Results are reported in Fig. 1. The figure reports the ratio of the MSFE of the model, Eq. (18), relative to the MSFE of the autoregressive benchmark model, Eq. (). According to the Monte Carlo simulations in the previous section, the most successful window sizes are between $T^{1/2}$ and $T^{2/3}$, which, given the available sample of data, implies between 16 and 39 observations.

Table 10 Series description

Full size table

Panel A reports results for predictors ($x_{t}$) that include real output measures. It is well known that such measures should be good predictors of inflation according to the Phillips curve. Several studies have documented the empirical success of Phillips curve models, see for example Stock et al. (1999a); Stock et al. (1999b) and 2003, although the empirical results in Marcellino et al. (2003) suggests that the ability of such measures to forecast inflation in Europe is more limited than in the United States. The figure shows that capacity utilization, employment, and unemployment measures are very useful predictors for inflation. In fact, when the window size is less than about 80, the MSFE of the model is always smaller than that of the autoregressive benchmark, sometimes even substantially. Note that for larger window sizes the PMSE criterion would however suggest that the AR benchmark forecasts better than the economic model.

Earnings, instead, is not a successful predictor: in window sizes in the range between $T^{1/2}$ and $T^{2/3}$, it is significantly worse, and occasionally better, although only for larger window sizes. However, recall from the discussion in Sect. 2 that when the window size is large relative to the total sample size, Inoue and Kilian (2005) have shown that the PMSE criterion tends to select overparameterized models. When the window sizes are between $T^{1/2}$ and $T^{2/3}$, the previous sections showed that the PMSE criterion tends to select the correct model. This suggests that earnings are particularly unreliable for forecasting inflation.

The performance of industrial production and real GDP predictors, instead, is less clear: the ratio can be either above or below unity depending on the window size. Even for window sizes in the range between $T^{1/2}$ and $T^{2/3}$, the ratio can be either above or below unity. These results suggest instabilities in the forecasting performance of these predictors, and are consistent with the results in Rossi and Sekhposyan (2010), although the latter were interested in testing equal predictive ability rather than consistently selecting the correct model, as we do here. Rossi and Sekhposyan (2010) empirical evidence documented that the economic predictors have forecasting ability in the early part of their sample, but the predictive ability disappears in the later part of their sample. The reversals in predictive ability happened, according to their tests, around the time of the Great Moderation, which the literature dates back to 1983–1984 (see McConnell and Perez-Quiros 2000), similar to the results in D’Agostino et al. (2006).

Panel B focuses on monetary measures. M1, M2, and M3 never have predictive ability except for some selected window sizes, again pointing to the presence of instabilities.

Panel C focuses on interest rates. The results are quite interesting. They show that interest rates (such as 1-year or 10-year bonds) appear to be very good predictors of inflation for medium window sizes, below 120–140 observations. Again, however, for very large window sizes the PMSE criterion would select the smaller model. Short-term interest rates tend to be useful predictors only when the window size is large, but again the ratio is below unity for some selected window sizes and above unity for others. Again, we conjecture that instabilities are important, as discussed in Rossi and Sekhposyan (2010).

Panel D focuses on other monetary variables. Stock prices are never useful for predicting inflation. Interestingly, the producer price index is a good predictor for inflation: the figure shows that for the relevant window sizes, the ratio of the MSFE of the model relative to that of the benchmark is always lower than unity, and it becomes higher than unity only for large window sizes.

Table 11 QLR break test P-values

Full size table

Overall, our empirical results suggest that traditional Phillips curve predictors such as capacity utilization and unemployment are useful in forecasting inflation, as well as the producer price index. The empirical results for the other macroeconomic predictors are not clearcut, and might signal the importance of instabilities in the data. In order to provide more information on the instability in the forecasting regressions we consider, we report joint tests for structural breaks in the parameters of Eq. (18) using Andrews (1993) test for structural breaks. Table 11 reports the p-values of the test, which confirm that instabilities are extremely important.

5 Concluding Remarks

There is a known break, forecasters tend to use post-break observations when they make forecasts. In other words, they base their forecasts on a “truncated window” instead of the full sample. This chapter shows that this type of ideas can deliver the consistency of the rolling PMSE criterion not only when parameters are time varying but also when they are constant over time.

In this chapter we focus on the rolling scheme. Inoue and Kilian (2006) show that the PMSE criterion based on the recursive scheme is inconsistent if the number of initial observations is large, i.e., a fixed fraction of the sample size, while Wei (1992) proves that it is consistent if the number of initial observations is very small, i.e., a fixed constant. One might be able to extend Wei (1992) result to the case in which the number of initial observations diverges at a rate slower than the sample size. However, such a model selection criterion might not be robust to parameter instability.

It should be noted that our consistency results are based on correctly specified nested models. Although information criteria are not robust to parameter instabilities, they are robust to misspecification and nonnestedness (Sin and White 1996). We leave PMSE criterion-based model comparison of misspecified or non-nested models for future research.

The main object of forecasters is often to minimize PMSE rather than identify the true model. We are currently developing a data-dependent method for choosing the window size to achieve this goal in a separate chapter

Notes

1.
The out-of-sample PMSE criteria are based on simulated out-of-sample predictions where parameters are estimated from a subsample to predict an observation outside the subsample. When subsamples always start with the first observation and use consecutive observations whose number is increasing, we call the simulated quadratic loss the recursive PMSE criterion. When subsamples are based on the same number of observations and are moving, we call the simulated quadratic loss the rolling PMSE criterion and the number of observations in the subsamples is the window size. See Inoue and Kilian (2006) for more technical definitions of these criteria.
2.
When $T=100$, $W=T^{1/3}$ is too small to compute a rolling estimator.
3.
Technically, the window size $W=T^{2/3}$ does not satisfy our sufficient condition but yields good results.

References

Andrews, D.W.K. (1993), “Tests for Parameter Instability and Structural Change with Unknown Change Point”, Econometrica 61, 821–856.
Article Google Scholar
D’Agostino, A., D. Giannone, and P. Surico (2006), “(Un)Predictability and Macroeconomic Stability” ECB Working Paper 605.
Google Scholar
Cai, Z., (2007), “Trending Time-Varying Coefficient Time Series Models with Serially Correlated Errors”, Journal of Econometrics, 136, 163–188.
Article Google Scholar
Clark, T.E., and M.W. McCracken (2000), “Not-for-Publication Appendix to ” Tests of Equal Forecast Accuracy and Encompassing for Nested Models, unpublished manuscript, Federal Reserve Bank of Kansas City and Louisiana State Unviersity.
Google Scholar
Clark, T.E., and M.W. McCracken (2001) “Tests of Equal Forecast Accuracy and Encompassing for Nested Models”, Journal of Econometrics, 105, 85–110.
Article Google Scholar
Clark, T.E., and M.W. McCracken (2005), “Evaluating Direct Multistep Forecasts”, Econometric Reviews, 24, 369–404.
Article Google Scholar
Gallant, A.R., and H. White (1988), A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models, Basil Blackwell: New York, NY.
Google Scholar
Giacomini, R. and B. Rossi (2010), “Forecast Comparisons in Unstable Environments”, Journal of Applied Econometrics 25(4), 595–620.
Article Google Scholar
Giacomini, R. and H. White (2006), “Tests of Conditional Predictive Ability”, Econometrica 74(6), 1545–1578.
Article Google Scholar
Giraitis, L., G. Kapetanios and T. Yates (2011),“Inference on Stochastic Time-Varying Coefficient Models”, unpublished manuscript, Queen Mary, University of London, and the Bank of England.
Google Scholar
Hall, P., and C.C. Heyde (1980), Martingale Limit Theory and its Application, Academic Press: San Diego CA.
Google Scholar
Inoue, A., and L. Kilian (2006), “On the Selection of Forecasting Models”, Journal of Econometrics, 130, 273–306.
Article Google Scholar
Marcellino, Massimiliano, James H. Stock and Mark W. Watson (2003), “Macroeconoinic Forecasting in the Euro Area: Country-Speci...c vs. Area-Wide Information”, European Economic Review, 47(1), pages 1–18.
Article Google Scholar
McConnell, M.M., and G. Perez-Quiros (2000), “Output Fluctuations in the United States: What Has Changed Since the Early 1980” American Economic Review, 90(5), 1464–1476.
Article Google Scholar
Meese, R. and K.S. Rogoff (1983a), “Exchange Rate Models of the Seventies. Do They Fit Out of Sample?”, The Journal of International Economics 14, 3–24.
Article Google Scholar
Meese, R. and K.S. Rogoff (1983b), “The Out of Sample Failure of Empirical Exchange Rate Models”, in Jacob Frankel (ed.), Exchange Rates and International Macroeconomics, Chicago: University of Chicago Press for NBER.
Google Scholar
Rossi, B., and A. Inoue (2011), “Out-of-Sample Forecast Tests Robust to the Choice of Window Size”, mimeo.
Google Scholar
Rossi, B., and T. Sekhposyan (2010), “Have Models™ Forecasting Performance Changed Over Time, and When?”, International Journal of Forecasting, 26(4).
Google Scholar
Sin, C.-Y., and H. White (1996), “Information Criteria for Selecting Possibly Misspecified Parametric Models”, Journal of Econometrics, 71, 207–225.
Article Google Scholar
Stock, James H. and Mark W. Watson (1999a), “Business Cycle Fluctuations in U.S. Macroeconomic Time Series”, in Handbook of Macroeconomics, Vol. 1, J.B. Taylor and M. Woodford, eds, Elsevier, 3–64.
Google Scholar
Stock, James H. and Mark W. Watson (1999b), “Forecasting In ation”, Journal of Monetary Economics, 44, 293–335.
Google Scholar
Swanson, N.R., and H. White (1997), “A Model Selection Approach to Real-Time Macroeconomic Forecasting Using Linear Models and Artificial Neural Networks”, Review of Economics and Statistics.
Google Scholar
Wei, C.Z. (1992), “On Predictive Least Squares Principles”, Annals of Statistics, 20, 1–42.
Article Google Scholar
West, K.D. (1996), “Asymptotic Inference about Predictive Ability”, Econometrica 64, 1067–1084.
Article Google Scholar
Wooldridge, J.M., and H. White (1988), Some Invariance Principles and Central Limit Theorems for Dependent Heterogeneous Processes, Econometric Theory 4, 210–230.
Article Google Scholar
Wooldridge, J.M. (1994), “Estimation and Inference for Dependent Processes”, in the Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden, Chapter 45, pp. 2639–2738.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Agricultural and Resource Economics, North Carolina State University, Campus Box 8109, Raleigh, NC, 27695-8109, USA
Atsushi Inoue
ICREA-UPF, Barcelona GSE and CR, , calle Ramon Trias Fargas 25-27, Barcelona, 08005, SPAIN
Barbara Rossi
Department of Economics, North Carolina State University, Campus Box 8110, Raleigh, NC, 27695-8110, USA
Lu Jin

Authors

Atsushi Inoue
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Rossi
View author publications
You can also search for this author in PubMed Google Scholar
Lu Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atsushi Inoue .

Editor information

Editors and Affiliations

Yale University, 30 Hillhouse Ave, New Haven, 06501, USA
Xiaohong Chen
Hamilton St 75, NEW BRUNSWICK, 08901-1248, New Jersey, USA
Norman R. Swanson

Appendix

1.1 A.1 Lemmas

Next, we present a lemma similar to Lemma A2 of Clark and McCracken (2000).

Lemma 1

Suppose that Assumptions 1 and 2 hold and that $\gamma =0$. Then:

(a)
$\frac{1}{T-h-W} \sum _{t=W+1}^{T-h}u_{t+h}x_{t}B_{1}(t)H_{1}(t) =o_{p}\left(\frac{1}{W} \right) $.
(b)
$\frac{1}{T-h-W} \sum _{t=W+1}^{T-h}v_{t+h}z_{t}B_{2}(t)H_{2}(t) =o_{p}\left(\frac{1}{W} \right) $.
(c)
$\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_1^{\prime }(t)B_1(t) x_tx_t^{\prime }B_1(t)H_1(t)=\frac{1}{T-h-W}\sum \limits _{t=W+h}^{T-1}H_1^{\prime }(t)B_1H_1(t)+o_p\left( \frac{1}{W}\right)$.
(d)
$\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2(t) z_tz_t^{\prime }B_2(t)H_2(t) = \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2H_2(t)+o_{p}\left( \frac{1}{W}\right)$.

Proof of Lemma 1: The proofs for (a) and (c) are very similar to those for (b) and (d), respectively. For brevity, we only provide the proofs of (b) and (d). The results for (a) and (c) can be easily derived by replacing $z_{t}$ and $ \beta $ by $x_{t}$ and $\alpha $, respectively.

Note that

$$\begin{aligned} \frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{t+h}z_{t}B_{2}(t)H_{2}(t)&= \frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{t+h}z_{t}B_{2}H_{2}(t) \\&\quad +\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{t+h}z_{t}(B_{2}(t)-B_{2})H_{2}(t) \end{aligned}$$

By Assumption 2(b) and Hölder’s inequality, the second moments of the summands on the right-hand side are of order $O(W^{-1})$ and $O(W^{-2})$, respectively. Thus, it follows from Assumption 2(c) that the variance of the left-hand side is of order $O(T^{-1}W^{-1})$. By the Chebyshev inequality and Assumption 1, the left-hand side is $o_{p}(W^{-1})$.

The proof of (d) is composed of two stages. In the first stage, we show that $B_2(t)$ in the equation can be approximated by its expectation $B_2$, which is

$$\begin{aligned}&\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2(t) z_tz_t^{\prime }B_{2}(t)H_2(t) \nonumber \\&\quad =\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_{2} z_tz_t^{\prime }B_{2}H_2(t)+o_p\left(\frac{1}{W}\right) \end{aligned}$$

(20)

Since the left-hand side of Eq. (A.1) contains four terms,

$$\begin{aligned}&\frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2(t) z_tz_t^{\prime }B_2(t)H_2(t) \nonumber \\&\quad \qquad = \frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2z_tz_t^{\prime }B_2H_2(t) \nonumber \nonumber \\&\quad \qquad \quad \quad + \frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)(B_2(t)-B_2)z_tz_t^{\prime }(B_2(t)-B_2)H_2(t) \nonumber \nonumber \\&\quad \qquad \quad \quad + \frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2z_tz_t^{\prime }(B_2(t)-B_2)H_2(t) \nonumber \nonumber \\&\quad \qquad \quad \quad + \frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)(B_2(t)-B_2)z_tz_t^{\prime }B_2H_2(t), \end{aligned}$$

(21)

which include the first term in the right-hand side of Eq. (A.1).

By Assumption 2(b) and Hölder’s inequality, the second moments of the summands in the last three terms are of order $ O(W^{-4})$, $O(W^{-3})$, and $O(W^{-3})$, respectively. Thus, their first moments are at most $O(W^{-3})=o(W^{-1})$. By using these and Assumption 2(e), the second moments of the last three terms are thus of the order $ O(T^{-1}W^{-4})$, $O(T^{-1}W^{-3})$ and $O(T^{-1}W^{-1})$, respectively. By the Chebyshev inequality and Assumption 1, these last three terms are of the order $o_{p}(W^{-1})$, proving (A.1).

The second stage of the proof of (d) is to show that we can further approximate $z_tz_t^{\prime }$ in the first term in the right-hand side of Eq. (A.2) by its expectation $ E(z_tz_t^{\prime }) $. Adding and subtracting $E(z_t z_t^{\prime })$, we obtain

$$\begin{aligned}&\frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2z_tz_t^{\prime }B_2H_2(t)\nonumber \\&\qquad =\frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2E(z_tz_t^{\prime })B_2H_2(t) \nonumber \nonumber \\&\qquad \quad \quad + \frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2(z_tz_t^{\prime }-E(z_tz_t^{\prime }))B_2 H_2(t) \end{aligned}$$

(22)

The mean of the second term is $o_{p}(W^{-1})$ by Assumption 2(d). The second moments of the summand in the second term is $O(W^{-2})$ by Assumption 2(b). Using these and Assumption 2(e), the second moment of the second term is of the order $o(W^{-2})$. By the Chebyshev inequality, () is $o_{p}(W^{-1})$.

Lemma 2.

Suppose that Assumptions 3 and 4 hold and that $\gamma (\cdot )=0$.

(a)
$\frac{1}{T-h-W} \sum _{t=W+1}^{T-h}u_{T,t+h}x_{T,t}B_{1}(t)H_{1}(t) =o_{p}\left(\frac{1}{W} \right)$.
(b)
$\frac{1}{T-h-W} \sum _{t=W+1}^{T-h}v_{T,t+h}z_{T,t}B_{2}(t)H_{2}(t) =o_{p}\left(\frac{1}{W} \right)$.
(c)
$\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_1^{\prime }(t)B_1(t) x_{T,t}x_{T,t}^{\prime }B_1(t)H_1(t) = \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_1^{\prime }(t)\bar{ B}_1\left(\frac{t}{T}\right)H_1(t)+o_p\left(\frac{1}{W}\right)$.
(d)
$\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)B_2(t) z_{T,t}z_{T,t}^{\prime }B_2(t)H_2(t) = \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_2^{\prime }(t)\bar{B}_2\left(\frac{t}{T}\right)H_2(t)+o_{p}\left(\frac{1}{W}\right)$.

Proof of Lemma 2 Under Assumptions 3 and 4 the proof of Lemma 2 takes exactly the same steps as the proof of Lemma 1 except that $B_{i}$, $u_{t}$, and $v_{t}$ are replaced by $\bar{B}_{i}\left(\frac{t}{T}\right)$, $u_{T,t}$, and $v_{T,t}$, respectively. This is because Lemma 2 is written in terms of $u_{T,t}$ and $ v_{T,t}$ rather than in terms of $\hat{\alpha }_{t,W}-\alpha \left(\frac{t}{T} \right)$ and $\hat{\beta }_{t,W}-\beta \left(\frac{t}{T}\right)$ which we deal with in the proof of Theorem 2.

1.2 A.2 Proofs of Theorems

Proof of Theorem 1 Note that the PMSEs $\hat{\sigma }_{1,W}^{2}$ and $\hat{\sigma }_{2,W}^{2}$ can be expanded as

$$\begin{aligned} \hat{\sigma }_{1,W}^{2}&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( { y_{t+h}-\widehat{\alpha }_{t}^{\prime }x_{t}}\right) }^{2} \nonumber \\&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( y_{t+h}-{\alpha ^{*}} ^{\prime }x_{t}-\left( \widehat{\alpha }_{t}^{\prime }x_{t}-{\alpha ^{*}}^{\prime }x_{t}\right) \right) ^{2}} \nonumber \\&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( y_{t+h}-{\alpha ^{*}} ^{\prime }x_{t}\right) ^{2}} \nonumber \\&\qquad -\frac{2}{T-h-W}\sum \limits _{t=W+1}^{T-h}{ \left( y_{t+h}-{\alpha ^{*}}^{\prime }x_{t}\right) x_{t}^{\prime }\left( \widehat{\alpha }_{t}-{\alpha ^{*}}\right) } \nonumber \\&\qquad +\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( \widehat{\alpha } _{t}^{\prime }-{\alpha ^{*}}^{\prime }\right) x_{t}x_{t}^{\prime }\left( \widehat{\alpha }_{t}-{\alpha ^{*}}\right) } \end{aligned}$$

(23)

and

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}&=\frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h}{\left( { y_{t+h}-\widehat{\beta }_{t}^{\prime }z_{t}}\right) }^{2} \nonumber \\&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( {y_{t+h}-\beta ^{\prime }z_{t}-\left( {\widehat{\beta }_{t}^{\prime }z_{t}-\beta ^{\prime }z_{t}} \right) }\right) }^{2} \nonumber \\&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( {y_{t+h}-\beta ^{\prime }z_{t}}\right) }^{2} \nonumber \\&\qquad -\frac{2}{T-h-W} \sum \limits _{t=W+1}^{T-h}{\left( { y_{t+h}-\beta ^{\prime }z_{t}}\right) }z_{t}^{\prime }\left( {\widehat{\beta }_{t}-\beta }\right) \nonumber \\&\qquad +\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( {\widehat{\beta } _{t}^{\prime }-\beta ^{\prime }}\right) }z_{t}z_{t}^{\prime }\left( { \widehat{\beta }_{t}-\beta }\right) , \end{aligned}$$

(24)

respectively, where $\alpha ^{*}=[E(x_{t}x_{t}^{\prime })]^{-1}E(x_{t}y_{t+h})$. There are two cases: the case in which the data are generated from model 1, i.e., $\gamma =0$ (case 1) and the case in which the data are generated from model 2, i.e., $\gamma \ne 0$ (case 2).

In case 1, the actual model is $y_{t+h}=\alpha ^{\prime }x_{t}+v_{t+h}$. The first component of $\hat{\sigma }_{2,W}^{2}$ in Eq. (A.5) is numerically identical to the first component of $\hat{\sigma }_{1,W}^{2}$ in Eq. (A.4) because $\gamma =0$ and $\alpha -\alpha ^{*}=0$. Note that all the other components converge to zero faster since all parameters are consistently estimated. Under the case where Model 1 is true, the difference between the probability limit of $\hat{\sigma }_{1,W}^{2}$ and $\hat{\sigma }_{2,W}^{2}$ is zero, which does not identify which model is the true model. Only comparing the probability limits of $\hat{\sigma }_{1,W}^{2}$ and $\hat{\sigma }_{2,W}^{2}$ as $T$ and $W $ go to infinity and $W$ diverges slowly than $T$ is not sufficient for the model selection to indicate that $\lim _{T\rightarrow \infty ,\ W\rightarrow \infty }P(\hat{\sigma }_{1,W}^{2}<\hat{\sigma } _{2,W}^{2})=1$. However, if we can tell whether $\hat{\sigma }_{1,W}^{2}$ is always smaller than $\hat{\sigma }_{2,W}^{2}$ along the path of convergence of $T$ and $W$ toward infinity, the true model can still be identified. Since the models are nested $u_{t+h}=v_{t+h}$, it follows from () and (A.5) that

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2}&=\frac{2}{T-h-W} \sum _{t=W+1}^{T-h}\left[ v_{t+h}z_{t}^{\prime }(\widehat{\beta }_{t}-\beta )-v_{t+h}x_{t}^{\prime }(\widehat{\alpha }_{t}-\alpha )\right] \nonumber \\&\quad +\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h} \left[ {\left( {\widehat{\beta } _{t}^{\prime }-\beta ^{\prime }}\right) }z_{t}z_{t}^{\prime }\left( { \widehat{\beta }_{t}-\beta }\right) \right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \left. -{\left( {\widehat{\alpha }_{t}^{\prime }-{\alpha }^{\prime }}\right) }x_{t}x_{t}^{\prime }\left( {\widehat{\alpha } _{t}-{\alpha }}\right) \right] \nonumber \\&=\frac{2}{T-h-W}\sum _{t=W+1}^{T-h}\left[ v_{t+h}z_{t}^{\prime }B_{2}(t)H_{2}(t)-v_{t+h}x_{t}^{\prime }B_{1}(t)H_{1}(t)\right] \nonumber \\&\quad +\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\left[ H_{2}(t)^{\prime }B_{2}(t)z_{t}z_{t}^{\prime }B_{2}(t)H_{2}(t)\right. \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \left.-H_{1}(t)^{\prime }B_{1}(t)x_{t}x_{t}^{\prime }B_{1}(t)H_{1}(t)\right] \nonumber \\&=\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}\left[ H_{2}(t)^{\prime }B_{2}H_{2}(t)-H_{1}(t)^{\prime }B_{1}H_{1}(t)\right] +o_{p}\left( \frac{1}{W }\right) \end{aligned}$$

(25)

where the last equality follows from Lemma 1(a)–(d).

To get the sign of Eq. (A.6), we first define $Q$ by

$$\begin{aligned} Q\;=\;[E(z_{t}z_{t}^{\prime })]^{\frac{1}{2}}\left\{ [E(z_{t}z_{t}^{\prime })]^{-1}-\left[ \begin{array}{cc} [E(x_{t}x_{t}^{\prime })]^{-1}&\mathbf 0 _{l\times (k-l)} \\ \mathbf 0 _{(k-l)\times l}&\mathbf 0 _{(k-l)\times (k-l)} \end{array} \right] \right\} [E(z_{t}z_{t}^{\prime })]^{\frac{1}{2}} \end{aligned}$$

(26)

as in Lemma A.4 of Clark and McCracken (2000). Clark and McCracken (2000) show that the $Q$ matrix is symmetric and idempotent. An idempotent matrix is positive semidefinite, which means for all $v\in \mathfrak R ^{k}$, $v^{T}Qv\ge 0 $. It implies that

$$\begin{aligned}&\quad \left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h} \right] ^{\prime }[E(z_{t}z_{t}^{\prime })]^{-1}\left[ \frac{1}{W_{h}^{\frac{ 1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h}\right] \nonumber \\&\quad -\left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}x_{s}v_{s+h} \right] ^{\prime }[E(x_{t}x_{t}^{\prime })]^{-1}\left[ \frac{1}{W_{h}^{\frac{ 1}{2}}}\sum \limits _{s=t-W}^{t-h}x_{s}v_{s+h}\right] \nonumber \\&=\left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h} \right] ^{\prime }\left\{ [E(z_{t}z_{t}^{\prime })]^{-1}-\left[ \begin{array}{cc} [E(x_{t}x_{t}^{\prime })]^{-1}&\mathbf 0 _{l\times (k-l)} \\ \mathbf 0 _{(k-l)\times l}&\mathbf 0 _{(k-l)\times (k-l)} \end{array} \right] \right\} \nonumber \\&\qquad \times \left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h}\right] \nonumber \\&=\left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h} \right] ^{\prime }[E(z_{t}z_{t}^{\prime })]^{-\frac{1}{2}}\cdot Q\cdot [E(z_{t}z_{t}^{\prime })]^{-\frac{1}{2}} \nonumber \\&\qquad \times \left[ \frac{1}{W_{h}^{\frac{1}{2}}}\sum \limits _{s=t-W}^{t-h}z_{s}v_{s+h}\right]\;\;\ge \;\;0 \end{aligned}$$

(27)

Note that the probability that $[E(z_{t}z_{t}^{\prime })]^{-1/2}W_{h}^{-1/2}\sum _{s=t-W}^{t-h}z_{s}v_{s+h}$ lies in the null space of $Q$ for infinitely many $t$ approaches zero because the dimension of the null space is $l<k$. Thus, the average of (A.8) over $t$ is positive with probability approaching one. Combining the results in Eqs. (A.6) and (A.8), we find that the sign of $W(\hat{\sigma }_{2,W}^{2}-\hat{\sigma }_{1,W}^{2})$ is always positive with probability approaching one. Therefore, when $\gamma =0$, $\hat{\sigma } _{1,W}^{2}<\hat{\sigma }_{2,W}^{2}$ with probability approaching one.

In case 2, that is, when Model 2 is the true model, we have $y_{t+h}$ $=\beta ^{\prime }z_{t}+v_{t+h}$ $=\alpha ^{\prime }x_{t}+\gamma ^{\prime }w_{t}+v_{t+h}$. By Assumptions 2(a)(b), the second and third terms on the right-hand sides of () and (A.5) are both $o_{p}(T^{1/2}/W)$ and $ o_{p}(T/W^{2})$, respectively. Thus, they are $o_{p}(1)$ by Assumption 1. The first term on the right-hand side of Eq. (A.5) converges to the variance of $v_{t+h}$ as the sample size $T$ goes to infinity:

$$\begin{aligned} \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{\left( {y_{t+h}-\beta ^{\prime }z_{t}}\right) }^{2}=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}{v_{t+h}^{2}} \overset{p}{\rightarrow }\sigma _{2}^{2}. \end{aligned}$$

(28)

Similarly, the first term on the right-hand side of Eq. () converges in probability to the variance of $u_{t+h}\equiv y_{t+h}-\alpha ^{*\prime }x_{t}$:

$$\begin{aligned} \hat{\sigma }_{1,W}^{2}&=\frac{1}{T-h-W}\sum \limits _{t=W+h}^{T-1}{\left( y_{t+h}-{\alpha ^{*}}^{\prime }x_{t}\right) ^{2}}+o_{p}(1) \nonumber \\&\overset{p}{\rightarrow } E\left[ (y_{t+h}-{\alpha ^{*}}^{\prime }x_{t})^{2}\right] \nonumber \\&=E\left[ (\alpha ^{\prime }x_{t}+\gamma ^{\prime }w_{t}+v_{t+h}-{\alpha ^{*}}^{\prime }x_{t})^{2}\right] \nonumber \\&=E\left[ (v_{t+h}+(\alpha ^{\prime }-{\alpha ^{*}}^{\prime })x_{t}+\gamma ^{\prime }w_{t})^{2}\right] \nonumber \\&=\sigma _{2}^{2}+\left[ \begin{array}{c} \alpha -\alpha ^{*} \\ \gamma \end{array} \right] ^{\prime }\left[ \begin{array}{cc} E(x_{t}x_{t}^{\prime })&E(x_{t}w_{t}^{\prime }) \\ E(w_{t}x_{t}^{\prime })&E(w_{t}w_{t}^{\prime }) \end{array} \right] \left[ \begin{array}{c} \alpha -\alpha ^{*} \\ \gamma \end{array} \right]\;\;>\;\;\sigma _{2}^{2}. \end{aligned}$$

(29)

Therefore, when Model 2 is true, the PMSEs satisfy $P(\hat{\sigma }_{1,W}^{2}> \hat{\sigma }_{2,W}^{2})=1$ as $T\rightarrow \infty $ and $W\rightarrow \infty $, where $W$ diverges slower than $T$.

Proof of Theorem 2 Note that the PMSEs, $\hat{\sigma }_{1,W}^{2}$ and $\hat{\sigma }_{2,W}^{2}$ can be expanded as

$$\begin{aligned} \hat{\sigma }_{1,W}^{2}&=\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( y_{T,t+h}-\alpha ^{*}\left(\frac{t}{T}\right) ^{\prime }x_{T,t}\right)^{2} \nonumber \\&\quad -\frac{2}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( y_{T,t+h}-\alpha ^{*}\left(\frac{t}{T}\right)^{\prime }x_{T,t}\right) x_{T,t}^{\prime } \left( \widehat{\alpha }_{t}-\alpha ^{*}\left(\frac{t}{T}\right)\right)\nonumber \\&\quad +\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left(\widehat{\alpha } _{t}^{\prime }-\alpha ^{*}\left(\frac{t}{T}\right)\right)^{\prime } x_{T,t}x_{T,t}^{\prime }\left( \widehat{\alpha }_{t}-\alpha ^{*}\left( \frac{t}{T}\right)\right) \end{aligned}$$

(30)

and

$$\begin{aligned} \hat{\sigma }_{2,W}^{2}&= \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( y_{T,t+h}-\beta ^{\prime }\left(\frac{t}{T}\right)z_{T,t}\right)^{2} \nonumber \\&\quad -\left( \frac{2}{T-h-W}\right) \sum \limits _{t=W+1}^{T-h}\left( y_{T,t+h}-\beta ^{\prime }\left(\frac{t}{T}\right)z_{T,t}\right) z_{T,t}^{\prime }\left( \widehat{\beta }_{t}-\beta \left(\frac{t}{T}\right) \right) \nonumber \\&\quad +\frac{1}{T-h-W} \sum \limits _{t=W+1}^{T-h} \left(\widehat{\beta } _{t}^{\prime }-\beta ^{\prime }\left(\frac{t}{T}\right)\right)^{\prime } z_{t}z_{t}^{\prime }\left( \widehat{\beta }_{t}-\beta \left(\frac{t}{T} \right) \right) , \end{aligned}$$

(31)

respectively. If we show that each of

$$\begin{aligned}&\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( y_{T,t+h}-\alpha ^{*}\left(\frac{t}{T}\right)^{\prime }x_{T,t}\right) x_{t}^{\prime }\left( \widehat{\alpha }_{t}-{\alpha ^{*}\left(\frac{t}{T}\right)}\right)\nonumber \\&\quad - \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}u_{T,t+h}x_{T,t}^{\prime } B_{1}(t)H_{1}(t), \end{aligned}$$

(32)

$$\begin{aligned}&\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( \widehat{\alpha } _{t}^{\prime }-\alpha ^{*}\left(\frac{t}{T}\right)\right)^{ \prime }x_{T,t}x_{T,t}^{\prime }\left( \widehat{\alpha }_{t}-\alpha ^{*}\left(\frac{t}{T}\right)\right) \nonumber \\&\quad - \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_{1}(t)^{\prime }B_{1}(t)z_{T,t}z_{T,t}^{\prime }B_{2}(t)H_{2}(t), \end{aligned}$$

(33)

$$\begin{aligned}&\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left(y_{T,t+h}-\beta \left( \frac{t}{T}\right)^{\prime }z_{T,t}\right)z_{T,t}^{\prime }\left( \widehat{ \beta }_{t}-\beta \left(\frac{t}{T}\right)\right) \nonumber \\&\quad - \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}v_{T,t+h}z_{T,t}^{\prime } B_{2}(t)H_{2}(t), \end{aligned}$$

(34)

$$\begin{aligned}&\frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}\left( \widehat{\beta } _{t}^{\prime }-\beta \left(\frac{t}{T}\right)^{\prime }\right)^{\prime } z_{T,t}z_{T,t}^{\prime }\left( \widehat{\beta }_{t}-\beta \left(\frac{t}{T} \right)\right) \nonumber \\&\quad - \frac{1}{T-h-W}\sum \limits _{t=W+1}^{T-h}H_{2}(t)^{\prime }B_{2}(t)z_{T,t}z_{T,t}^{\prime }B_{2}(t)H_{2}(t), \end{aligned}$$

(35)

are $o_{p}(1/W)$ when the data are generated from model 1 (case 1) and are $ o_{p}(1)$ when the data are generated from model 2 (case 2), the proof of Theorem 2 takes exactly the same steps as the proof of Theorem 1. Thus, it remains to show that (A.13)–(A.16) are $ o_{p}(W^{-1})$ in case 1 and $o_{p}(1)$ in case 2. Note that the bias of the rolling regression estimator can be written as:

$$\begin{aligned} \hat{\beta }_{W,t}-\beta \left(\frac{t}{T}\right)&= B_{2}(t)\frac{1}{W_{h}} \sum _{s=t-W}^{t-h}z_{s} \left[v_{s+h}+z_{s}^{\prime }\left(\beta \left(\frac{s }{T}\right)-\beta \left(\frac{t}{T}\right)\right) \right] \nonumber \\&= B_{2}(t)H_{2}(t)+\frac{B_{2}(t)}{W_{h}}\sum _{s=t-W}^{t-h}z_{s}z_{s}^{ \prime }\left(\beta \left(\frac{s}{T}\right)-\beta \left(\frac{t}{T} \right)\right) \end{aligned}$$

(36)

Thus, the difference (A.15) is

$$\begin{aligned}&\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{T,t+h}z_{T,t}^{\prime }B_{2}(t) \frac{1}{W_{h}}\sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }\left(\beta \left(\frac{s }{T}\right)-\beta \left(\frac{t}{T}\right)\right). \nonumber \\&= \frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{T,t+h}z_{T,t}^{\prime }\bar{B} _{2}\left(\frac{t}{T}\right) \frac{1}{W_{h}}\sum _{s=t-W}^{t-h}z_{s}z_{s}^{ \prime }\left(\beta \left(\frac{s}{T}\right)-\beta \left(\frac{t}{T} \right)\right) \nonumber \\&\quad +\frac{1}{T-h-W}\sum _{t=W+1}^{T-h}v_{T,t+h}z_{T,t}^{\prime }\left(B_{2}(t)-\bar{B}_{2}\left(\frac{t}{T}\right)\right) \frac{1}{W_{h}} \nonumber \\&\quad \times \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime } \left(\beta \left(\frac{s}{T} \right)-\beta \left(\frac{t}{T}\right)\right). \end{aligned}$$

(37)

By Assumption 4(c), the summands have zero mean. By Hölder’s inequality and Assumptions 4(b)(c)(e)(f), the second moments of the right-hand side terms are $O(W/T^{2})$. By Chebyshev’s inequality, (A.15) is $ O_{p}(W^{1/2}/T)$ which is $o_{p}(1/W)$ by Assumption 3. It can be shown that (A.13) is also $o_{p}(1/W)$ in a similar fashion.

The difference (A.16) is the sum of the following three terms:

$$\begin{aligned} \frac{1}{T-h-W}\sum _{t=W+1}^{T-h} v_{T,t+h}z_{T,t}z_{T,t}^{\prime }B_{2}(t)\frac{1}{W_{h}}\sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }\left(\beta \left(\frac{s}{T}\right)-\beta \left(\frac{t}{T}\right)\right), \end{aligned}$$

(38)

$$\begin{aligned} \frac{1}{T-h-W}\sum _{t=W+1}^{T-h} \left(\beta \left(\frac{s}{T} \right)-\beta \left(\frac{t}{T}\right)\right)^{\prime }\frac{1}{W_{h}} \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }B_{2}(t)z_{T,t}z_{T,t}^{\prime }v_{T,t+h}, \end{aligned}$$

(39)

$$\begin{aligned}&\frac{1}{T-h-W}\sum _{t=W+1}^{T-h} \left(\beta \left(\frac{s}{T} \right)-\beta \left(\frac{t}{T}\right)\right)^{\prime }\frac{1}{W_{h}} \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }B_{2}(t)z_{T,t} \nonumber \\&\quad \times z_{T,t}^{\prime }B_{2}(t)\frac{1}{W_{h}} \sum _{s=t-W}^{t-h}z_{s}z_{s}^{\prime }\left(\beta \left(\frac{s}{T} \right)-\beta \left(\frac{t}{T}\right)\right), \end{aligned}$$

(40)

Using Chebyshev’s inequality, Hölder’s inequality, Assumptions 3 and 4(b)(c)(e)(f), it can be shown that (A.19), (A.20), and (A.21) are $O_{p}(W^{1/2}T^{-2})$, $ O_{p}(W^{1/2}T^{-2})$ and $O_{p}(W^{2}T^{-2})$ all of which are $ o_{p}(W^{-1})$. It can be shown that (A.14) is also $ o_{p}(1/W) $ when $\gamma (\cdot )=0$ in an analogous fashion.

The rest of the proof of Theorem 2 takes exactly the same steps as the proof of Theorem 1 except that $\alpha ^{*}$, $\beta $, $ B_{i}$, $u_{t}$, $v_{t}$, $x_{t}$, $y_{t}$, $z_{t}$ and Lemma 1 is replaced by $\alpha \left(\frac{t}{T}\right)$, $\beta \left(\frac{t}{T}\right)$, $\bar{B }_{i}\left(\frac{t}{T}\right)$, $u_{Tt}$, $v_{Tt}$, $x_{Tt}$, $y_{Tt}$, $ z_{Tt}$ and Lemma 2, respectively.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Inoue, A., Rossi, B., Jin, L. (2013). Consistent Model Selection: Over Rolling Windows. In: Chen, X., Swanson, N. (eds) Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1653-1_12

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1653-1_12
Published: 01 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1652-4
Online ISBN: 978-1-4614-1653-1
eBook Packages: Business and EconomicsEconomics and Finance (R0)

Publish with us

Policies and ethics

Consistent Model Selection: Over Rolling Windows

Abstract

Similar content being viewed by others

Forecasting Under Structural Change

Horizon confidence sets

Long-term prediction intervals of economic time series

Keywords

1 Introduction

2 Asymptotic Theory

2.1 Consistency of the Rolling-Window PMSE Criterion When Parameters are Constant

Assumption 1

Assumption 2

Remarks

Theorem 1

2.2 Consistency of the Rolling-Window PMSE Criterion When Parameters are Time Varying

Assumption 3

Assumption 4

Remarks

Theorem 2

Remarks

3 Monte Carlo Evidence

3.1 Simulation 1: DGP2 in Clark and McCracken (2005)

3.2 Simulation 2:Autoregressive DGP With/Without a Time-Varying Parameter

4 Empirical Application

5 Concluding Remarks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 Lemmas

Lemma 1

Lemma 2.

1.2 A.2 Proofs of Theorems

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation