1 Introduction

Regression models with mixed sampling frequencies have gained increasing popularity in modeling economic activity since it is introduced in different contexts by Ghysels et al. (2004, 2006), Ghysels and Wright (2009), Galvao (2013), Miller (2014) and Ghysels and Miller (2015), among others. Despite the fact that time series observations are often sampled at different frequencies, which is a situation often encountered in many applications, the typical practice in estimating econometric models is to aggregate all higher frequency variables to the same (low) frequency using an equal weighting scheme. As illustrated by Andreou et al. (2010), using decaying weights may be a more appropriate aggregation scheme for most time series data, and an equal weighting scheme will often end up with inefficient and/or biased estimates. Mixed data sampling (MIDAS) regression models aim at extracting information from high-frequency variables to improve the efficiency of estimators and/or enhance the forecast accuracy of target variables observed at a lower frequency. However, a threshold-type MIDAS model is still unavailable, while threshold models can capture a very rich set of stylized facts of modern economics, such as multiple states, asymmetries and cyclical effects (e.g., Hansen 2000; Chen 2015). The main purpose of this paper is to fill this gap in the literature.

In this paper, we contribute to the threshold and MIDAS literature by introducing a new model called threshold mixed data sampling (TMIDAS) regression, in which we allow for a threshold effect to capture nonlinear effects in the relationship between dependent and explanatory variables, and the explanatory variables are sampled at a frequency higher than the dependent variable. Specifically, the TMIDAS model can be treated as an extension to the classical threshold regression model investigated by Hansen (2000) by allowing for mixed data sampling. Hence, the TMIDAS model not only has the important advantage of the classical threshold model in nonlinear effects that widely exist in different fields of economics, but also enjoys the efficiency gains from extracting information from high-frequency variables. Based on the literature on MIDAS and threshold models, we develop the estimation procedure of the model and propose test statistics for threshold effect and the equal weighting scheme. In estimation, we suggest to use a two-step procedure: first estimate MIDAS models using nonlinear least squares (NLS) approach for any given threshold value and then estimate the threshold parameter using a grid-search method which is widely used in the threshold literature. In the specification testing, we suggest a test statistic for testing the presence of threshold effect, and a test statistic for testing the null hypothesis of equal weights (simple average) in aggregating higher-frequency time series data before estimating econometric models. Moreover, we conduct Monte Carlo simulations to examine the performance properties of the estimation and testing procedures. Our simulation results point out that the estimation procedure works well in finite samples, and the test statistics are correctly sized and have good power properties.

To illustrate the usefulness of TMIDAS, we also conduct Monte Carlo experiments to compare out-of-sample forecasting performance of the TMIDAS relative to the Markov-switching (MS-)MIDAS, proposed by Guérin and Marcellino (2013) and MIDAS models for different data generating processes (DGPs). The simulations indicate that the proposed TMIDAS model has the best forecasting performance when there is a threshold effect in the true model. The proposed TMIDAS model is applied to investigate presence and pattern of cyclical bias in quarterly GDP forecast errors. The empirical results support that there is an overestimation (underestimation) bias during periods of relatively good (bad) state, indicating the predictability of the GDP forecast errors. We then compare the out-of-sample performance of the TMIDAS relative to the MS-MIDAS and MIDAS models for GDP forecast errors. The results imply that TMIDAS models clearly outperform the MS-MIDAS and MIDAS models in the considered application to GDP forecast errors. Both simulation and empirical results demonstrate the usefulness of TMIDAS.

The remainder of this paper is organized as follows: Sect. 2 introduces the threshold mixed data sampling (TMIDAS) model, develops the estimation method of the model parameters and constructs tests for threshold effect and equal weights. Section 3 presents Monte Carlo experiments evaluating the finite-sample properties of the estimation procedure and the test statistics, and assessing out-of-sample forecasting performance of TMIDAS models. Section 4 provides an application and Sect. 5 concludes.

2 Threshold mixed data sampling model

Consider the mixed data sampling process \(\{y_t, {\mathbf{z}}_t, {\mathbf{x}}^{(m)}_{t/m}, q_t\}\), where \(y_t, {\mathbf{z}}_t\) and \(q_t\) are observed at \(t=1,2,...,T\), and index t represents a low frequency, \({\mathbf{x}}^{(m)}_{t/m}=(x^{(m)}_{1,t},...,x^{(m)}_{p,t})'\) is a p-dimensional vector of higher frequency data, and the superscript m represents that the series are observed at most m times between t and \(t-1\). The threshold mixed data sampling (TMIDAS) model in the paper is given byFootnote 1

$$\begin{aligned} y_t=\left\{ \begin{matrix} {\varvec{\alpha }}'_1 {\mathbf{z}}_t+{{\varvec{\beta }}}'_1 {\mathbf{x}}_{t}({\varvec{\theta }})+e_t,\quad \text{ if } \;\; q_t\le \gamma \\ {\varvec{\alpha }}'_2 {\mathbf{z}}_t+{{\varvec{\beta }}}'_2 {\mathbf{x}}_{t}({\varvec{\theta }})+e_t,\quad \text{ if }\; \;q_t> \gamma \end{matrix}\right. ,\;t=1,2,...,T, \end{aligned}$$
(1)

where \(y_t, {\mathbf{z}}_t, {\mathbf{x}}_{t}({\varvec{\theta }})\) and \(q_t\) are assumed to be weakly dependent, \(q_t\) is the threshold variable and is used to split the sample into two subgroups, the random variable \(e_t\) is a regression disturbance, and \(\gamma \) is the threshold parameter. \({\mathbf{x}}_{t}({\varvec{\theta }})=(x^{(m)}_{1,t}({{\varvec{\theta }}}_1), x^{(m)}_{2,t}({{\varvec{\theta }}}_2),...,x^{(m)}_{p,t}({{\varvec{\theta }}}_p))'\) is a nonlinear function mapping the higher-frequency data into a low frequency such that

$$\begin{aligned} x^{(m)}_{k,t}({{\varvec{\theta }}}_k)=W(L^{1/m},{{\varvec{\theta }}}_k)x^{(m)}_{k,t}=\sum \limits _{j = 1}^J {w_{j,k}({{\varvec{\theta }}}_k)L^{j/m}x^{(m)}_{k,t}},\;{\text{ for }}\; k=1,2,...,p, \end{aligned}$$

in which \(L^{j/m}\) is the high-frequency lag operator such that \(L^{j/m} x^{(m)}_{k,t}=x^{(m)}_{k,t-(j/m)}\), and J is the number of high frequency lags used in the temporal aggregation of \({\mathbf{x}}^{(m)}_{t/m}\) such that \(J\ge m\). To identify the slope parameters \({{\varvec{\beta }}}_1\) and \({{\varvec{\beta }}}_2\) , we assume that \(0<w_{j,k}({{\varvec{\theta }}}_k)<1\) and \(\sum \nolimits _{j=1}^J{w_{j,k}({{\varvec{\theta }}}_k)}=1\).

One of the key features of MIDAS models is to parameterize the lagged coefficients in a parsimonious way (e.g., Ghysels and Qian 2019). In the proposed model, it is suitable to employ the commonly used parsimonious polynomial specifications including step functions, beta polynomial and Almon lag polynomial; see Ghysels et al. (2007) for a detailed discussion. As illustrated by the MIDAS literature (see, e.g., Ghysels et al. 2007; Andreou et al. 2010), a popular choice for the weighting scheme is the two parameter exponential Almon lag polynomial, which is flexible enough to mimic different weighting shapes for the lag coefficients:Footnote 2

$$\begin{aligned} w_{j,k}({{\varvec{\theta }}}_k)=w_{j,k}(\theta _{k,1}, \theta _{k,2})=\frac{\exp (\theta _{k,1} j+\theta _{k,2} j^2)}{\sum ^m_{j=1}\exp (\theta _{k,1} j+\theta _{k,2} j^2)}. \end{aligned}$$
(2)

It is worth noting that the Almon lag parameters are not identified when \({{\varvec{\beta }}}_1={{\varvec{\beta }}}_2={\mathbf{0}}\). This is the well-known Davies’s problem (see, e.g., Davies 1987), while it was addressed by Ghysels et al. (2007). To set focus, we do not discuss this question in the next.

2.1 Model estimation

For ease of manipulation, we express the model defined in (1) and (2) in a more compacted form. Define \({{\varvec{\alpha }}}=[{\varvec{\alpha }}'_2,{\varvec{\alpha }}'_1-{\varvec{\alpha }}'_2]'\), \({\mathbf{z}}_t(\gamma )=[{\mathbf{z}}'_t , {\mathbf{z}}'_t\{q_t\le \gamma \}]'\), and \({{\varvec{\beta }}}=[{\varvec{\beta }}'_2,{\varvec{\beta }}'_1-{\varvec{\beta }}'_2]'\), \({\mathbf{x}}_t(\varvec{\theta }, \gamma )=[{\mathbf{x}}'_t(\varvec{\theta }) , {\mathbf{x}}'_t (\varvec{\theta })\{q_t\le \gamma \}]'\), where \(\{.\}\) is the indicator function, then the model defined in (1) and (2) can be rewritten as:

$$\begin{aligned} y_t={\varvec{\alpha }}' {\mathbf{z}}_t(\gamma )+{{\varvec{\beta }}}'{\mathbf{x}}_t(\varvec{\theta }, \gamma )+e_t. \end{aligned}$$
(3)

We first estimate MIDAS models using nonlinear least squares (NLS) approach for any given threshold value. For any fixed \(\gamma \), the model in (3) simplifies to a MIDAS model. Following the MIDAS literature, consider the nonlinear least squares (NLS) estimator. Define the following objective function

$$\begin{aligned} \hbox {SSR}_T({\varvec{\alpha }}, {\varvec{\beta }}, \varvec{\theta }, \gamma )=\sum _{t=1}^T { e}_t^2({\varvec{\alpha }}, {\varvec{\beta }},\varvec{\theta }, \gamma )=\sum _{t=1}^T \left( y_t-{\varvec{\alpha }}' {\mathbf{z}}_t(\gamma )-{ {{\varvec{\beta }}}'}{\mathbf{x}}_t({\varvec{\theta }}, \gamma )\right) ^2. \end{aligned}$$
(4)

We assume \(\gamma \in \Gamma \) and \(({\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\theta }})\in {\varvec{\Theta }}\), where the parameter spaces \(\Gamma \) and \({\varvec{\Theta }}\) are bounded sets of the reals. Then NLS estimator is given as

$$\begin{aligned} ({\hat{\varvec{\alpha }}}(\gamma ), {\hat{\varvec{\beta }}}(\gamma ), \hat{{\varvec{\theta }}}(\gamma ))=\mathop {\arg \min }_{{\varvec{\alpha }}, {\varvec{\beta }}, {{\varvec{\theta }}\in {\varvec{\Theta }}}} \hbox {SSR}_T({\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\theta }}, \gamma ). \end{aligned}$$
(5)

We then estimate the threshold parameter using a combination of concentration and grid search which is widely used in the threshold literature (e.g., Hansen 2000). Denote the residuals as \({\hat{e}}_t(\gamma )=y_t-{\hat{\varvec{\alpha }}}' (\gamma ){\mathbf{z}}_t(\gamma )-{\hat{\varvec{\beta }}}'(\gamma ){\mathbf{x}}_t(\hat{{\varvec{\theta }}}(\gamma ), \gamma )\), then the sum of squared errors is \(S_T(\gamma )\equiv S_T({\hat{\varvec{\alpha }}} (\gamma ), {\hat{\varvec{\beta }}}(\gamma ),{\hat{{\varvec{\theta }}}}(\gamma ),\gamma )=\sum \limits _{t=1}^T{\hat{e}}'_t( \gamma ){\hat{e}}_t(\gamma )\). The threshold parameter \(\gamma \) is estimated as

$$\begin{aligned} {\hat{\gamma }}=\mathop {\arg \min }_{{\gamma \in \Gamma }} S_T(\gamma ). \end{aligned}$$
(6)

Once \({{\hat{\gamma }}}\) is obtained, the other parameters can be estimated as \({\hat{\varvec{\alpha }}}={\hat{\varvec{\alpha }}}({{\hat{\gamma }}})\), \({\hat{\varvec{\beta }}}={\hat{\varvec{\beta }}}({{\hat{\gamma }}})\), \(\hat{{\varvec{\theta }}}=\hat{{\varvec{\theta }}}({{\hat{\gamma }}})\). In practice, following Hansen (2000), a grid-searching method is used to estimate the threshold parameter in (6). We can divide the parameter space \(\Gamma \) into N quantiles and let \(\Gamma _N=\{q_1,q_2,...,q_N\}\). Then the estimator \({\hat{\gamma }}_N=\mathop {\arg \min }_{{\gamma \in \Gamma _N}} S_T(\gamma )\) is a good approximation to \({{\hat{\gamma }}}\) when N is sufficiently large. In practice, it is often undesirable to select a threshold \({{\hat{\gamma }}}\) sorting too few observations into one or the other regime. As suggested by Hansen (1999), this possibility can be excluded by restricting the grid search of \(\gamma \) such that a minimal percentage of the observations (say, 10% or 15%) lie in both regimes.

It is well known that the estimated threshold \({{\hat{\gamma }}}\) is super-consistent and asymptotically independent in the threshold literature (e.g., Chan 1993), and the distributions of slope coefficients can be approximated as if \(\gamma \) were known with certainty(e.g., Chan 1993; Hansen 2000). Andreou et al. (2010) derived the asymptotic distribution of the MIDAS-NLS estimator. Thus, this paper does not devote attention to the asymptotic properties of the estimator. Instead, we focus on examining the finite sample properties of the estimator through Monte Carlo simulations in this paper.

To construct confidence intervals for the model parameters, we invert the following statistic for the null \(H_0: \gamma = \gamma ^0\) given by

$$\begin{aligned} \hbox {LR}_T({ \gamma })= \frac{S_T({\gamma })-S_T({\hat{\gamma }})}{S_T({\hat{\gamma }})/T}. \end{aligned}$$

The null hypothesis is rejected for large values of \(\hbox {LR}_T({\gamma }^0)\). Following the literature (see, e.g., Hansen 1996, 1999; Chen, 2015), we suggest to compute the confidence intervals using a wild bootstrap procedure that conditions on the values of the explanatory variables.

Algorithm 1. Confidence Intervals for Parameters

Step 1 For t = 1,2,..., T, denote \({\hat{e}}_t\) as the residuals from the fitted TMIDAS model (3). Treat \(\{{\hat{e}}_{1}, {\hat{e}}_{2},..., {\hat{e}}_{T}\}\) as the empirical distribution to be used for bootstrapping. Draw (with replacement ) random variables \(\{{\hat{e}}^*_{1}, {\hat{e}}^*_{2}, ...,{\hat{e}}^*_{T}\}\) from the empirical distributions.

Step 2 Set \(y^*_t={\hat{\varvec{\alpha }}}' {\mathbf{z}}_t({{\hat{\gamma }}})+{\hat{\varvec{\beta }}}'{\mathbf{x}}_t(\hat{\varvec{\theta }}, {{\hat{\gamma }}})+e^*_t\), where \(({\hat{\varvec{\alpha }}}', {\hat{\varvec{\beta }}}', \hat{\varvec{\theta }}', {{\hat{\gamma }}})\) are estimates based on the original sample \(\{y_t, {\mathbf{z}}_t, {\mathbf{x}}^{(m)}_{t/m}, q_t\}\).

Step 3 Using the bootstrap sample \(\{y^*_t, {\mathbf{z}}_t, {\mathbf{x}}^{(m)}_{t/m}, q_t\}\), estimate the TMIDAS model and obtain the parameter estimates \(({\hat{\varvec{\alpha }}}^{*'}, {\hat{\varvec{\beta }}}^{*'}, \hat{\varvec{\theta }}^{*'}, {\hat{\gamma }}*)\), and the sum of squared errors \(S^*_T( {{\hat{\gamma }}}^*)\).

Step 4 Compute the statistic for \({{\hat{\gamma }}}\)

$$\begin{aligned} \hbox {LR}^*_T({\hat{\gamma }})= \frac{S^*_T({\hat{\gamma }})-S^*_T( {\hat{\gamma }}^*)}{S^*_T( {\hat{\gamma }}^*)/T}, \end{aligned}$$

where \(S^*_T({\hat{\gamma }})= \sum \limits _{t=1}^T \left( y^*_t-{\hat{\varvec{\alpha }}}^{*'} {\mathbf{z}}_t({{\hat{\gamma }}})+{\hat{\varvec{\beta }}}^{*'}{\mathbf{x}}_t(\hat{\varvec{\theta }}^{*'}, {{\hat{\gamma }}})\right) ^2\).

Step 5 Repeat Steps 1-4 B times and obtain a sample of simulated coefficient estimates \(({\hat{\varvec{\alpha }}}^{*'}, {\hat{\varvec{\beta }}}^{*'}, \hat{\varvec{\theta }}^{*'}, {\hat{\gamma }}^*)\) and a sample of \({\hbox {LR}^*_T({\hat{\gamma }})}\). Construct \(1-a\) bootstrap confidence intervals for the estimates \(({\hat{\varvec{\alpha }}}', {\hat{\varvec{\beta }}}', \hat{\varvec{\theta }}', {{\hat{\gamma }}})\) by the symmetric percentile method: the estimates plus and minus the \((1-a)\) quantile of the absolute centered bootstrap estimates. For example, the confidence interval of \({{\hat{\gamma }}}\) is \({\hat{\gamma }}\pm q^*_{1-a}\), where \(q^*_{1-a}\) is the \(1-a\) quantile of \(|{\hat{\gamma }}^*-{{\hat{\gamma }}}|\).

2.2 Model specification testing

In estimating a TMIDAS regression model, the applied researcher may be interested in investigating whether the TMIDAS model is significantly different from the MIDAS model, and whether the simple aggregation using equal weights in threshold models is supported by empirical data. In this section, we construct test statistics for threshold effect and the equal weighting scheme.

To test for the existence of the threshold effect in MIDAS models, we consider the null hypothesis \(H_0: {\varvec{\alpha }}_1={\varvec{\alpha }}_2, {\varvec{\beta }}_1={\varvec{\beta }}_2\). Under this null, the TMIDAS model shrinks to the MIDAS model given by:

$$\begin{aligned} y_t= {\varvec{\alpha }}'_1 {\mathbf{z}}_t+{{\varvec{\beta }}}'_1 {\mathbf{x}}_{t}({\varvec{\theta }})+e_t,\;t=1,2,...,T. \end{aligned}$$
(7)

Here, the threshold \(\gamma \) is not identified under the null of linearity; hence, the null distribution of test statistic is non-standard due to the well-known Davies’s problem, and the limiting distribution can be typically explored by taking the supremum of all possible values of unidentified parameters (see, e.g., Davies 1987; Hansen 1996). Denote the sum of squared errors of the MIDAS model (7) as \(S^{L}_0({\hat{{\varvec{\theta }}}})\) and denote the sum of squared errors of the proposed TMIDAS model as \(S_T({\hat{{\varvec{\theta }}}}(\gamma ),\gamma )\equiv S_T({\hat{\varvec{\alpha }}} (\gamma ), {\hat{\varvec{\beta }}}(\gamma ),{\hat{{\varvec{\theta }}}}(\gamma ),\gamma )\) defined as in (5). Then, a natural test for the null hypothesis of the MIDAS model against the TMIDAS model can be defined as follows:

$$\begin{aligned} \hbox {LR}_1=\sup _{\gamma \in \Gamma } \frac{S^{L}_0({\hat{{\varvec{\theta }}}})-S_T({\hat{{\varvec{\theta }}}}(\gamma ),\gamma )}{S_T({\hat{{\varvec{\theta }}}}(\gamma ),\gamma )/T}\equiv \frac{S^{L}_0({\hat{{\varvec{\theta }}}})-S_T({\hat{{\varvec{\theta }}}}({{\hat{\gamma }}}),{{\hat{\gamma }}})}{S_T({\hat{{\varvec{\theta }}}}({{\hat{\gamma }}}),{{\hat{\gamma }}})/T}. \end{aligned}$$
(8)

When the null hypothesis of no threshold effect is rejected, one can further examine whether or not the traditional approach using an equal weighting scheme in threshold models is supported by the empirical data. As shown by Andreou et al. (2010), aggregating the high-frequency data using equal weights (simple average) would generally lead to an inconsistent estimator; hence, it is important to test whether or not using equal weights is suitable in applications. To this end, consider the null \(H_0: {{\varvec{\theta }}}={\mathbf{0}}\) under which the weighting function in (2) becomes flat, leading to high-frequency data being aggregated using equal weights, and thus the TMIDAS model (1) becomes the usual threshold regression as in Hansen (2000):

$$\begin{aligned} y_t=\left\{ \begin{matrix} {\varvec{\alpha }}'_1 {\mathbf{z}}_t+{{\varvec{\beta }}}'_1 {\mathbf{x}}^*_{t}+e_t,\;\text{ if } \;\; q_t\le \gamma \\ {\varvec{\alpha }}'_2 {\mathbf{z}}_t+{{\varvec{\beta }}}'_2 {\mathbf{x}}^*_{t}+e_t,\;\text{ if } \; \;q_t> \gamma \end{matrix}\right. ,\;t=1,2,...,T, \end{aligned}$$
(9)

where \({\mathbf{x}}^*_t\) is taken as the simple average of the high-frequency data \({\mathbf{x}}^{(m)}_{t/m}\) over the periods between t and \(t-1\).

We can estimate the usual threshold model following Hansen (2000) and denote the sum of squared errors of the usual threshold regression as \(S^{TH}_0\). Then, a test statistic for the equal weighting scheme (the usual threshold model against the TMIDAS model) can be constructed as:

$$\begin{aligned} \hbox {LR}_2= \frac{S^{TH}_0-S_T({{\hat{\gamma }}})}{S_T({{\hat{\gamma }}})/T}. \end{aligned}$$
(10)

To implement the above test statistics, we propose a wild bootstrap algorithm following the classical threshold literature such as Hansen (1996, 1999, 2000, 2017). Hansen (1996) shows that the bootstrap approach produces asymptotically correct p values. The bootstrap procedure goes as follows.

Algorithm 2. Testing for threshold effect and the equal weighting scheme

Step 1 For \(t=1,2,...,T\), \({\hat{e}}_{1t}\) are the residuals from the MIDAS model (7), and \({\hat{e}}_{2t}\) are the residuals from the usual threshold model (9), where high-frequency data being aggregated using equal weights. Treat \(\{{\hat{e}}_{11}, {\hat{e}}_{12},..., {\hat{e}}_{1T}\}\) and \(\{{\hat{e}}_{21}, {\hat{e}}_{22},..., {\hat{e}}_{2T}\}\) as the empirical distributions to be used for bootstrapping. Draw (with replacement ) random variables \(\{{\hat{e}}^*_{11}, {\hat{e}}^*_{12}, ...,{\hat{e}}^*_{1T}\}\) and \(\{{\hat{e}}^*_{21}, {\hat{e}}^*_{22}, ...,{\hat{e}}^*_{2T}\}\) from the empirical distributions.

Step 2 Set \( y^*_{1t}= {\hat{\varvec{\alpha }}}'_0 {\mathbf{z}}_t+{\hat{\varvec{\beta }}}'_0 {\mathbf{x}}_{t}(\hat{{\varvec{\theta }}})+e^*_{1t}\), \(y^*_{2t}=\left\{ \begin{matrix} {\hat{\varvec{\alpha }}}'_1 {\mathbf{z}}_t+{\hat{\varvec{\beta }}}'_1 {\mathbf{x}}^*_{t}+e^*_{2t},\;\text{ if } \;\; q_t\le {\hat{\gamma }}\\ {\hat{\varvec{\alpha }}}'_2 {\mathbf{z}}_t+{\hat{\varvec{\beta }}}'_2 {\mathbf{x}}^*_{t}+e^*_{2t},\;\text{ if } \; \;q_t> {{\hat{\gamma }}} \end{matrix}\right. ,\;t=1,2,...,T,\), where \(({\hat{\varvec{\alpha }}}'_0, {\hat{\varvec{\beta }}}'_0, \hat{\varvec{\theta }}'_0)\) and \(({\hat{\varvec{\alpha }}}'_1, {\hat{\varvec{\beta }}}'_1, {\hat{\varvec{\alpha }}}'_2, {\hat{\varvec{\beta }}}'_2, {{\hat{\gamma }}})\) are estimates based on the original sample \(\{y_t, {\mathbf{z}}_t, {\mathbf{x}}^{(m)}_{t/m}, q_t\}\).

Step 3 Using the bootstrap sample \(\{y^*_{1t}, y^*_{2t}, {\mathbf{z}}_t, {\mathbf{x}}^{(m)}_{t/m}, q_t\}\), estimate the MIDAS model (7), the usual threshold model (9) and the proposed TMIDAS. Compute the test statistics \(\hbox {LR}_{1}\) and \(\hbox {LR}_{2}\).

Step 4 Repeat Steps 1–3 B times, so as to obtain two samples \(\hbox {LR}^*_{1}(1),\hbox {LR}^*_{1}(2),...,\hbox {LR}^*_{1}(B)\), and \(\hbox {LR}^*_{2}(1),\hbox {LR}^*_{2}(2),...,\hbox {LR}^*_{2}(B)\) of simulated \(\hbox {LR}_{1}\) and \(\hbox {LR}_{2}\) statistics.

Step 5 The empirical p values can be obtained by calculating the percentage of the simulated statistics that exceed actual value when the number of B is sufficiently large.

3 Monte Carlo simulations

In this section, we examine the finite sample performance of the estimation and model specification testing approaches proposed in Sect. 2. We also conduct Monte Carlo experiments to compare out-of-sample forecasting performance of the TMIDAS relative to the MS-MIDAS and MIDAS models.

The Monte Carlo design is based on the following DGP of the TMIDAS regression model:

$$\begin{aligned} y_t=\left\{ \begin{matrix} {\varvec{\alpha }}'_1 {\mathbf{z}}_t+ \beta _1 x_{t}({\varvec{\theta }})+e_t,\;\text{ if }\;\; q_t\le \gamma \\ {\varvec{\alpha }}'_2 {\mathbf{z}}_t+ \beta _2 x_{t}({\varvec{\theta }})+e_t,\;\text{ if }\; \;q_t> \gamma \end{matrix}\right. ,\;t=1,2,...,T, \end{aligned}$$
(11)

where \(x_{t}({\varvec{\theta }})=\sum \nolimits _{j = 1}^m {w_{j}({{\varvec{\theta }}})L^{j/m}x^{(m)}_{t}}\), \({\mathbf{z}}_t=(1, z_t)'\), \({x^{(m)}_{t/m}}\sim i.i.d.N(0,1)\), \(z_t\), \(q_t\) and \(e_t\) follow i.i.d.N(0, 1) and are independent of each other. Similarly with Andreou et al. (2010), \(x^{m}_{t/m}\) is sampled m times between t and \(t-1\) such that the high-frequency sample size is mT, while \({\mathbf{z}}_t, q_t\) and \(e_t\) are sampled at a low frequency with a sample size T. Furthermore, the high-frequency data \({x^{(m)}_{t/m}}\) are projected on to the low-frequency data \(x_t({\varvec{\theta }})\), using the two-parameter exponential Almon lag polynomial given by

$$\begin{aligned} w_{j}({{\varvec{\theta }}})=w_{j}(\theta _{1}, \theta _{2})=\frac{\exp (\theta _{1} j+\theta _{2} j^2)}{\sum ^m_{j=1}\exp (\theta _{1} j+\theta _{2} j^2)}. \end{aligned}$$
(12)

in which we set \(m=3\) as in the case with a quarterly-dependent variable and monthly explanatory variable.

Table 1 Estimates of the model parameters using the proposed estimation procedure

We first assess the finite sample properties of the proposed estimation approach. In these simulations, we follow Miller (2014) to investigate four weighing settings: (a) flat weights, \({\varvec{\theta }}=(0, 0)\); (b) slow decaying weights, \({\varvec{\theta }}=(-0.5, 0.04)\); (c) fast decaying weights, \({\varvec{\theta }}=(-1, -1)\); and (d) extreme weights, \({\varvec{\theta }}=(-5, -5)\), assigning unit weight to the first high-frequency variable and zero to the remaining high-frequency variables, which is a characteristic of selective sampling. We set \(({{\varvec{\alpha }}}'_1, \beta _1)=(\alpha _{11}, \alpha _{12}, \beta _1)=(1, 1, 1)\) and \(({{\varvec{\alpha }}}'_2, \beta _2)=(\alpha _{21}, \alpha _{22}, \beta _2)=(2, 2, 2)\), consider the four weighing parameters described as above, and run experiments on a range of sample sizes (\(T=100, 200, 500\)). Each experiment is replicated 1000 times to calculate the summary statistics (i.e., mean and standard deviation) for the parameter estimates. The simulation results are reported in Table 1. For all parameters, the mean of each parameter is close to its true value in all cases of weighting scheme, and the associated standard deviation becomes smaller as the sample size T increases. These results indicate that the estimation approach works well in finite samples.

We next conduct simulations to evaluate the size and power properties of the test statistics for threshold effect and the equal weighting scheme. In examining the finite sample performance of the test for threshold effect (the MIDAS model against the TMIDAS model), we set \(({{\varvec{\alpha }}}'_1, \beta _1)=(1, 1, 1)\), \({\varvec{\theta }}=(-1, -1)\). Then the rejection frequencies under DGP in (11) with \(({{\varvec{\alpha }}}'_2, \beta _2)=(1, 1, 1)\) and \(({{\varvec{\alpha }}}'_2, \beta _2)=\{(1.3, 1.3, 1.3), (1.5, 1.5, 1.5), (2, 2, 2)\}\) (small threshold effect, medium threshold effect and large threshold effect) are the size and power of the proposed test for threshold effect, respectively. Meanwhile, in evaluating the performance of the test for the equal weighting scheme (the usual threshold model against the TIMDAS model), we set \(({{\varvec{\alpha }}}'_1, \beta _1)=(1, 1, 1)\) and \(({{\varvec{\alpha }}}'_2, \beta _2)=(2, 2, 2)\). Then the rejection frequencies under the flat weighting setting with \({\varvec{\theta }}=(0, 0)\) and the other three weighing settings ( slow decaying weights, fast decaying weights and extreme weights) are the size and power of the proposed test for the equal weighting scheme, respectively. The simulation results are reported in Table 2, in which the size and power for each experiment were constructed using 1000 replications and the number of bootstrap replications was set as 100. The significance level is set at 5%. The simulation results indicate that the empirical sizes of the test statistics are close to 0.05, and the powers enhance as the sample size increases. Moreover, the power of the test for threshold effect increases as the magnitude of threshold effect becomes large, and the power for the equal weighting scheme increases when the weights decay fastly. These results indicate that the proposed test statistics have good size and power properties.

Table 2 Size and power of the proposed test statistics

Finally, we conduct Monte Carlo simulations to compare out-of-sample forecasting performance of the TMIDAS relative to the MS-MIDAS and MIDAS models for different data generating processes (DGPs). To better match the models employed in the following empirical application, we set \({\mathbf{z}}_t=1\) in (11) without loss of generality. For the sake of comparison, we consider the following four data generating processes (DGPs) based on (11) and (12): (a) threshold model with decay weights: \((\alpha _1, \beta _1)=(1,1), (\alpha _2, \beta _2)=(2,2)\), and \({{\varvec{\theta }}}=(0.5, -0.5)\); (b) threshold model with equal weights: \((\alpha _1, \beta _1)=(1,1), (\alpha _2, \beta _2)=(2,2)\), and \({{\varvec{\theta }}}=(0, 0)\); (c) linear model with decay weights: \((\alpha _1, \beta _1)=(\alpha _2, \beta _2)=(1,1)\), and \({{\varvec{\theta }}}=(0.5, -0.5)\); (d) linear model with equal weights: \((\alpha _1, \beta _1)=(\alpha _2, \beta _2)=(1,1)\), and \({{\varvec{\theta }}}=(0, 0)\). The sample size T is set as 100, which is split between an estimation sample and an evaluation sample. We set the sample size as 28 for the evaluation sample, which is roughly matched with the empirical application. In assessing out-of-sample performance, we, following Guérin and Marcellino (2013), estimate the model using the estimation sample and compute one-step ahead forecasts, and we recursively expand the estimation sample until we reach the end of the sample T so that we can compute 28 forecasts. The simulation results are reported in Table 3, in which we report the average mean square forecast errors (MSFE) over 100 Monte Carlo replications. When the true DGP are threshold models with decay weights or equal weights, the proposed TMIDAS models outperform the MS-MIDAS model and obtain the best forecasts. When the true DGP is the linear model with decay weights, the MIDAS model is the winner. In addition, when the true DGP is the linear model with equal weights, the MIDAS with equal weights, aggregating the high-frequency data using simple average, outperforms other models. Overall, the simulations indicate that the proposed TMIDAS model has the best forecasting performance when there is a threshold effect in the true model.Footnote 3

Table 3 Forecasting exercise: mean square forecast errors (MSFE) based on different models

4 Empirical application

In this section, we apply the proposed TMIDAS model to investigate presence and pattern of cyclical bias in the US quarterly GDP forecast errors and compare the out-of-sample performance of the TMIDAS relative to the MS-MIDAS and MIDAS models for GDP forecast errors.

Given the fact that GDP data are often subject to substantial revisions after initial release and the most fully revised data are often accompanied with a long-time delay in many countries (e.g., Sinclair and Stekler 2013; Sinclair et al. 2015; Yang 2017, 2020), it is clearly important to evaluate the accuracy of the first release data of GDP. In doing so, one of the most commonly used approach is to treat the first release data as a forecast of the actual GDP and investigate the cyclicality in the GDP forecast errors following Holden and Peel (1990):

$$\begin{aligned} y_{t,h}-z_t = \alpha + e_t, \end{aligned}$$
(13)

in which the most recent data \(y_{t,h}\) reported at time \(t+h\) are treated as the actual data of the quarterly GDP growth at time t, and \(z_t\) is the first release data of quarterly GDP growth at time t. The accuracy of the first release data is evaluated by testing the null that \(\alpha =0\). Rejecting the null is the evidence of the quarterly GDP data being biased and inefficient.

To investigate the characteristics of the GDP forecast errors, it may be desirable to introduce some available information into the above equation. Motivated by the well-known Phillips curve and Okun’s law, we extend the Holden-Peel regression by incorporating monthly non-farm payroll employment growth measuring employment, and incorporating the information of the monthly consumer price index (CPI) data measuring inflation, respectively. A significant coefficient of employment or CPI implies that the forecast errors are not unbiased because the information in employment or CPI is not incorporated into the GDP data. The modified regression is a standard MIDAS model given by

$$\begin{aligned} y_{t,h}-z_t = \alpha +\beta x_{t}({\varvec{\theta }})+ e_t, \end{aligned}$$
(14)

in which \(x_{t}({\varvec{\theta }})=\sum \nolimits _{j = 1}^3 {w_{j}({{\varvec{\theta }}})L^{j/3}x^{(3)}_{t}}\), \(x^{(3)}_{t/3}\) is the monthly non-farm payroll employment growth or the monthly CPI, and \(w_{j}({{\varvec{\theta }}})=w_{j}(\theta _{1}, \theta _{2})=\frac{\exp (\theta _{1} j+\theta _{2} j^2)}{\sum ^3_{j=1}\exp (\theta _{1} j+\theta _{2} j^2)}\). The properties of the forecast errors are examined by testing the null that \(\alpha =\beta =0\). Rejecting the null is the evidence of the quarterly GDP forecast being biased and inefficient.

As suggested by the recent studies (see, e.g., Sinclair and Stekler 2013; Sinclair et al. 2015; Messina et al. 2015; Xie and Hsu 2016; Yang 2020), the macroeconomic data might contain cyclical bias depending on business cycles; moreover, these systematic forecast errors associated with the business cycle may offset each other, yielding that the null of unbiasedness cannot be rejected when in fact there are systematic errors that are associated with the state of the economy. To capture such a cyclical bias in the quarterly GDP forecast errors, we extend the MIDAS model to our proposed TMIDAS model given as

$$\begin{aligned} {y_{t,h}}-z_t = \left\{ \begin{array}{l} {\alpha _1} + \beta _{1} x_{t}({\varvec{\theta }}) + {e_{t}},\;\text{ if }\;\; {q_t} \le {\gamma }\\ {\alpha _2} + \beta _{2} x_{t}({\varvec{\theta }}) + {e_{t}},\;\text{ if }\;\; {q_t} > {\gamma } \end{array} \right. , \end{aligned}$$
(15)

in which the threshold variable \(q_t\) is chosen as the lagged dependent variable \({y_{t-1,h}}\) (the actual data of the quarterly GDP growth at time \(t-1\)), since it provides an intuitive choice as a measure of real activity to reflect the fact that the macroeconomic data might be overestimated during slowdowns and underestimated during booms.

We use US quarterly GDP data from 2000Q1 to 2019Q3, and monthly non-farm payroll employment growth and monthly CPI data from January 2000 to September 2019; there are 79 quarterly GDP observations and 237 monthly employment and CPI observations. The GDP growth is the annualized quarterly growth rate of real GDP, and the employment growth and inflation rate are the annualized monthly growth rates of non-farm payroll employment and CPI. When aggregating the high-frequency data using simple average, we take the average of the annualized rates. The analyzed data are downloaded from the website of Federal Reserve Bank of Philadelphia (https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/first-second-third).

Table 4 Empirical results and 95% confidence intervals based on different models

The empirical results are reported in Table 4. For the sake of comparison, we also report the empirical results based on the Markov-switching MIDAS model (MS-MIDAS) proposed by Guérin and Marcellino (2013), and the empirical results between GDP growth and nonfarm payroll employment growth (or CPI). In Cases A and B for the GDP forecast errors, the results based on the Holden-Peel regression and the MIDAS with equal weights show that the intercept is not statistically significantly different from zero, and the slopes of employment growth and CPI are not statistically significantly different from zero at the 5% level, supporting the GDP forecast is unbiased and efficient. However, when the MIDAS model is employed, the results show that the intercept is negative and significantly different from zero, and the coefficient of employment growth is positive and significantly different from zero; such significant results cannot be observed for CPI. Furthermore, by allowing for a nonlinear effect, the TMIDAS outperforms the MS-MIDAS and MIDAS models in terms of \(R^2\), and the empirical results based on the TMIDAS and MS-MIDAS models share some similarities. When the proposed TMIDAS model is employed to allow for a threshold effect, the empirical results show that the intercept is significantly negative in boom periods (\(q>{{\hat{\gamma }}}\)) and significantly positive in recession periods (\(q\le {{\hat{\gamma }}}\)), implying that there is an overestimation (underestimation) bias during periods of relatively good (bad) state; moreover, the coefficients of employment growth are positive and statistically significantly different from zero at the 5% level. These results indicate that the GDP forecast is not unbiased and efficient, which is also supported by the coefficients of CPI in Case B.Footnote 4 According to the testing results based on \(B=1000\) bootstrap replications, in Case A for the GDP forecast errors and employment growth, we can reject the null of the flat weighting scheme and the null of no threshold effect at the 5% level, and in Case B for the GDP forecast errors and CPI, we can reject these nulls at the 10% level, indicating that the TMIDAS model is suitable for investigating cyclical bias in quarterly GDP forecast errors. Hence, we conclude that the GDP forecast errors contain cyclical bias depending on the state of the economy.

The above empirical results imply that the GDP forecast errors may be forecasted by monthly non-farm payroll employment growth and monthly CPI. Therefore, we next investigate the out-of-sample performance of the mentioned models in forecasting GDP forecast errors and GDP growth. To this end, the sample is split between an estimation sample and an evaluation sample. We set the sample size as 28 (7 years) for the evaluation sample. In assessing out-of-sample performance, we estimate the model using the estimation sample and compute one-step ahead forecasts, and we recursively expand the estimation sample until we reach the end of the sample T so that we can compute 28 forecasts. The results are reported in Table 5. The results show that the TMIDAS can obtain the best forecasts in Cases A and B; these results are consistent with the testing results in Table 4 and the forecasting simulations in Table 3.

In summary, based on TMIDAS and the GDP forecast errors, our empirical results support that the GDP forecast errors contain cyclical bias depending on the state of the economy, and monthly non-farm payroll employment growth and monthly CPI can improve the forecast.

Table 5 Mean square forecast errors (MSFE) based on different models

5 Conclusion

The relationship between economic variables are generally nonlinear, and the data are usually sampled at different frequencies. This paper proposes a model called threshold mixed data sampling (TMIDAS) regression, in which we allow for a threshold effect to capture nonlinear effects in the relationship between dependent and explanatory variables, and the explanatory variables are sampled at a frequency higher than the dependent variable. The proposed model can be treated as an extension of the classical threshold regression model in Hansen (2000) by allowing for mixed data sampling. Hence, the TMIDAS model not only has the important advantage of the classical threshold model in capturing nonlinear effects that widely exist in different fields of economics, but also enjoys the efficiency gains from extracting information from high-frequency variables.

We develop a two-step procedure to estimate the model based on nonlinear least squares (NLS) and the grid-search method, and suggest a test statistic for testing the presence of threshold effect, and a test statistic for testing the null hypothesis of equal weights (simple average) in aggregating higher-frequency time series data before estimating econometric models. Moreover, we conduct Monte Carlo simulations to examine the performance properties of the estimation and testing procedures, and out-of-sample forecasting performance. Our simulation results point out that the estimation and testing procedures work well in finite samples, and TMIDAS models have the best forecasting performance when there is a threshold effect in the true model. We apply the TMIDAS model to investigate presence and pattern of cyclical bias in quarterly GDP forecast errors and compare the out-of-sample performance of the TMIDAS relative to the MS-MIDAS and MIDAS models for GDP forecast errors. Both simulation and empirical results demonstrate the usefulness of TMIDAS.

One limitation of the proposed TMIDAS model is that the threshold variable and the dependent variable are sampled at a frequency lower than the explanatory variables. It is imperative for future work to develop a model that allows the threshold variable being sampled at the high frequency.Footnote 5