1 Introduction

Consider a regression model

$$\begin{aligned} y_t = \eta ({\varvec{x}}_t) + \epsilon _t, \end{aligned}$$
(1)

where \({\varvec{x}}_t\) is p-dimensional covariate and \(\epsilon _t\) is a mean zero stationary process with autocovariance function \(\gamma (\cdot )\). For independent \(\epsilon _t\)’s, there are several nonparametric approaches to estimate \(\eta \) such as kernel smoothing, spline smoothing, wavelet methods, etc (e.g. see Härdle 1990). Efforts have been also made to study estimation of \(\eta \) when the \(\epsilon _t\)’s are autocorrelated. Hall and Hart (1990) investigated convergence rates of nonparametric regression with the presence of both short- and long-range dependent errors. Johnstone and Silverman (1997) considered a level-dependent wavelet thresholding estimator for the data with correlated noise. Robinson (1997) investigated nonparametric regression for a linear process with semiparameteric modeling of spectral density. Wang (1996) considered a fractional Gaussian noise to approximate nonparametric regression with long-range dependence. Yang (2001) studied minimax rates of convergence for nonparametric regression under a random design which is different from Hall and Hart (1990) and Johnstone and Silverman (1997). Martins-Filho and Yao (2009) established the asymptotic distribution of a local linear estimator of nonparametric regression under the dependent error with a general parametric covariance model.

For most cases, researchers have assumed certain parametric or semiparametric autocovariance functions; with the exception of Su and Ullah (2006) who modeled the error process with a nonparametric function. In particular, Su and Ullah (2006) assumed that the error has a finite-order nonlinear structure.

In this paper, we propose a nonparametric approach to model an autocovariance function of a stationary process. Proposed approach simultaneously estimates the autocorrelation structure and the mean. To focus on modeling an autocovariance function, we consider a parametric form of \(\eta ({\varvec{x}})\), in particular, a linear model, by assuming that \(\eta ({\varvec{x}})={\varvec{x}}'{\varvec{\beta }}\). Then the model can be re-written as

$$\begin{aligned} y_t = {\varvec{x}}_t'{\varvec{\beta }}+ \epsilon _t, \end{aligned}$$
(2)

where \('\) is the transpose of a matrix or a vector. For the observed data, \(\{(y_t,{\varvec{x}}_t), t=1 \ldots , n\}\), the model in a matrix form is \({\varvec{y}}={\varvec{X}}{\varvec{\beta }}+{\varvec{\epsilon }}\) and we let the autocovariance matrix of \({\varvec{\epsilon }}\) be \({\varvec{\Gamma }}_n\). For independent errors, \({\varvec{\Gamma }}_n= \gamma (0)\,I_n\). A Bayesian approach can be implemented by placing independent priors on \({\varvec{\beta }}\) and \(\gamma (0)\) such as \( {\varvec{\beta }}\sim N({\varvec{\beta }}_0,{\varvec{\Sigma }})\) and \(1/\gamma (0)\sim \texttt {gamma}(a,b)\), where \(N({\varvec{\beta }}_0, {\varvec{\Sigma }})\) is a normal distribution with mean \({\varvec{\beta }}_0\) and the covariance matrix \({\varvec{\Sigma }}\), and \(\texttt {gamma}(a,b)\) is a Gamma distribution with mean ab. The posterior distributions of \({\varvec{\beta }}\) and \(1/\gamma (0)\) will be another normal and Gamma distribution, respectively. One can use a Gibbs sampler to infer such parametric model.

For fully known \(\gamma \), the mean and variance of the posterior distribution of \({\varvec{\beta }}\) are simply adjusted by known \({\varvec{\Gamma }}_n\). When \(\gamma \) is parameterized with a set of unknown parameters, one can use a simple Gibbs sampler for those parameters to learn about the autocovariance matrix \({\varvec{\Gamma }}_n\) and the mean \({\varvec{X}}{\varvec{\beta }}\). That is, given the values of autocorrelation parameters, we can compute \({\varvec{\Gamma }}_n\) and then sample from the posterior distribution of \({\varvec{\beta }}\).

For nonparametric modeling of \(\gamma \), we consider the corresponding spectral density in the frequency domain by putting a prior for the logarithm of the spectral density of \(\gamma \) rather than \(\gamma \) itself and the prior is updated using the Whittle likelihood (Whittle 1954). With the Whittle likelihood, the posterior distribution becomes computationally complicated. Thus, we apply the approach as in Carter and Kohn (1997) by approximating the logarithm of an exponential distribution as a mixture of five known normal distributions with known mixture proportions. By introducing the mixture components as unobserved latent variables, a Gibbs sampler can be described where conditional posteriors become normal through conjugacy.

The rest of this paper is organized as follows: in Sect. 2, we present particulars of the proposed nonparametric modeling of \(\gamma \). The estimation procedure is described in Sect. 3 and simulation study to support our approach is in Sect. 4. Two real data examples are illustrated in Sect. 5. The paper ends with a conclusion and discussion in Sect. 6.

2 Nonparametric modeling of serially correlated errors

To model an autocovariance function \(\gamma \) nonparametrically, we switch from the time domain to the frequency domain. For the autocovariance function \(\gamma \) of a stationary process, we have the corresponding spectral density \(\lambda \) such that

$$\begin{aligned} \lambda (\omega ) = \sum _{u= -\infty }^{\infty } \gamma (u) e^{- i 2\pi u \omega }, \end{aligned}$$
(3)

for \(\omega \in [0, 1)\), which is the Fourier transform of the covariance function \(\gamma \). Assume that the error process \(\epsilon _t\) is invertible and that the autocovariance function \(\gamma \) is absolutely summable. Then, the inverse Fourier transform enables us to get \(\gamma \) back from the spectral density \(\lambda \):

$$\begin{aligned} \gamma (u) = \int _{0}^1 \lambda (\omega ) e^{+i 2 \pi u \omega } d\omega . \end{aligned}$$
(4)

Thus, we model \(\lambda (\omega )\) instead of \(\gamma (u)\) and recover \(\gamma \) later. Note that we have \(0<\inf _{\omega }\lambda \le \sup _{\omega } \lambda < \infty \). Now, we consider modeling the spectral density \(\lambda \). Since \(\lambda >0\), we let \(\theta (\omega )=\log (\lambda (\omega ))\) and assume that \(\theta (\cdot )\) follows a Gaussian process (GP) with mean function \(\nu (\cdot )\) and covariance kernel \(\tau (\cdot , \cdot )\).

Due to its hierarchical structure with the additional stochastic processes assumption on \(\theta (\cdot )\), it is natural to interpret our approach with a Bayesian framework. That is, GP assumption on \(\theta (\cdot )\) can be considered as a prior on \(\theta (\cdot )\). Since we assume the linear relationship between \({\varvec{x}}_t\) and \(y_t\), \(\eta ({\varvec{x}}_t) = {\varvec{x}}_t'{\varvec{\beta }}\), we consider a Gaussian prior on \({\varvec{\beta }}\) such that \({\varvec{\beta }}~\sim ~ N({\varvec{\beta }}_0, \sigma _0^2 {\varvec{I}}_p)\).

Suppose that \((y_t,{\varvec{x}}_t')\) for \(t=1, \ldots , n\), are observed. Let \({\varvec{y}}=(y_1, \ldots , y_n)'\), \({\varvec{\epsilon }}=(\epsilon _1, \ldots , \epsilon _n)'\) and \({\varvec{\Gamma }}_n=\mathrm{Cov}({\varvec{\epsilon }})\) be the autocovariance matrix of the error vector. Then, the matrix form of the model (1) with n observations is

$$\begin{aligned} {\varvec{y}}= {\varvec{\eta }}+ {\varvec{\epsilon }}, \end{aligned}$$
(5)

with \({\varvec{\eta }}= {\varvec{X}}{\varvec{\beta }}\). Here \({\varvec{X}}\) is a design matrix whose t-th row is \({\varvec{x}}_t'\). The posterior density for \(\theta \) is not easy to obtain due to the complicated relationships among \(\lambda \), \(\gamma \) and \(\theta \). To overcome this issue, the following approach is considered. Note that a discrete version of the inverse Fourier transform given in (4) is

$$\begin{aligned} \gamma _n (u) = \frac{1}{n} \sum _{j=0}^{n-1} \lambda (\omega _j) e^{+i 2\pi u \omega _j}, \end{aligned}$$
(6)

where \(\omega _j = j/n\) is the Fourier frequency. Since \(\gamma _n(u) \rightarrow \gamma (u)\) as \(n \rightarrow \infty \), once we have \(\lambda (\omega _j)\) from \(\theta (\omega _j)\) for \(j=0, \ldots , n-1\), we can approximate \(\gamma (u)\) for \(u=0, \ldots , n-1\), using \(\gamma _n(u)\) from the relationship (6) to obtain \({\varvec{\Gamma }}_n\).

For the posterior density of \(\theta \), we consider a periodogram, a nonparametric estimate of the spectral density \(\lambda \), which is defined as

$$\begin{aligned} I_n(\omega ) = \frac{1}{n} \left| \sum _{t=1}^n \epsilon _t e^{-i 2 \pi t \omega }\right| ^2 \end{aligned}$$
(7)

for \(\omega \in [0, 1)\). Note that \(I_n(\omega )\) is asymptotically exponentially distributed with mean \(\lambda (\omega )\) and \(I_n(\omega _0), \ldots , I_n(\omega _m)\) are asymptotically independent for the Fourier frequencies, \(\omega _j = j/n\) for \(j=0, \ldots , m=\lfloor n/2 \rfloor \) [e.g. see Brockwell and Davis (1991)]. We focus on the first half of the Fourier frequencies since \(I_n\) is symmetric around 1 / 2. It is also known that the logarithm of a standard exponential random variable can be approximated by a mixture of five Gaussian random variables with known means and variances (Carter and Kohn 1997).

Let \(\xi \) be such a mixture of five Gaussian random variables. Its density is given by

$$\begin{aligned} \pi (\xi ) = \sum _{l=1}^5 p_l \phi _{v_l}(\xi -\kappa _l), \end{aligned}$$
(8)

where \(\phi _{v_l}\) is the density of the Gaussian distribution with mean zero and variance \(v_l^2\) and \(p_l\) is a mixture proportion. Here \(\kappa _l\) and \(v_l^2\) for \(l=1, \ldots , 5\) are known. Then, approximately, we have

$$\begin{aligned} \log (I_n(\omega _j)) =\theta (\omega _j) + \xi _{j}, \end{aligned}$$
(9)

for \(j=0, \ldots , m\), with \(\xi _{0}, \ldots , \xi _{m} {\mathop {\sim }\limits ^{i.i.d.}} \xi \). We can use (9) to approximate the posterior density of \({\varvec{\theta }}=(\theta (w_0),\cdots , \theta (w_m))'\). Let \(\varphi _{j}=\log (I_n(\omega _j))\). We now adopt a Bayesian Markov Chain Monte Carlo (MCMC) method to obtain the posterior samples for further analysis. The next section describes the detailed MCMC steps.

3 Estimation procedure

To use a Bayesian MCMC method for the proposed approach, we introduce settings for the hyper-parameters. For the mean function \(\nu (\cdot )\) and covariance kernel \(\tau (\cdot , \cdot )\) of \(\theta (\cdot )\), we assume that \(\nu (\cdot )\equiv 0\), \(\tau (w_i,w_j)= (1/\tau ^2_0) \exp (-\rho \Vert w_i-w_j\Vert )\), where \(\Vert \cdot \Vert \) is the Euclidean norm. We further assume that \(\tau ^2_0 \sim \texttt {gamma}(a,b)\) and \(\rho \) follows a noninformative prior. We need to find the posterior densities of unknown quantities given the data. Unknown variables in the model are \({\varvec{\beta }}\), \({\varvec{\theta }}\), information of mixture components in \({\varvec{\xi }}=(\xi _0, \ldots , \xi _m)'\), \(\tau ^2_0\) and \(\rho \). We introduce \(\psi _j\), for the mixture component label of \(\xi _j\), which takes values in \(\{1, \ldots , 5\}\). Introducing \(\psi _j\) allows us to have a simpler expression for posterior densities. Updating steps for \(\tau ^2_0\) and \(\rho \) are skipped since they are typical of conjugacy. Here is the description of how a Gibbs sampler works in sampling \({\varvec{\beta }}\), \({\varvec{\theta }}\) and \({\varvec{\psi }}=(\psi _0,\ldots ,\psi _m)'\):

(1) Update \({\varvec{\beta }}\): Given \({\varvec{\theta }}\), we approximate \({\varvec{\Gamma }}_n\) using (6) and denote \(\tilde{{\varvec{\Gamma }}}_n\). This will be used to update \({\varvec{\beta }}\). The conditional posterior density for \({\varvec{\beta }}\) with a Gaussian prior, \({\varvec{\beta }}\sim N({\varvec{\beta }}_{0}, \sigma ^2_0{\varvec{I}}_p)\), is

$$\begin{aligned}&{\varvec{\beta }}\,|\, {\cdots } ~\sim ~ N(\mu _{*}, {\varvec{\sigma }}^2_*),\nonumber \\&\quad {\varvec{\sigma }}^2_* = ({\varvec{X}}' \tilde{{\varvec{\Gamma }}}_n^{-1} {\varvec{X}}+ \sigma _0^{-2}{\varvec{I}}_p )^{-1},\nonumber \\&\quad \mu _{*} = {\varvec{\sigma }}^2_* ( {\varvec{X}}' \tilde{{\varvec{\Gamma }}}_n^{-1}{\varvec{y}}+ \sigma ^{-2}_0 {\varvec{\beta }}_0), \end{aligned}$$
(10)

where ‘\( \, |\, \cdots \)’ means conditioning on all other remaining variables and the data.

(2) Update \({\varvec{\theta }}\): Given \({\varvec{\eta }}={\varvec{X}}{\varvec{\beta }}\), we can compute \({\varvec{\varphi }}=(\varphi _0,\ldots , \varphi _m)'\) using (7) with \({\varvec{\epsilon }}= {\varvec{y}}-{\varvec{\eta }}\). Then, we have the following conditional distribution for \({\varvec{\theta }}\).

$$\begin{aligned}&{\varvec{\theta }}\,|\, {\cdots } ~\sim ~ N({\varvec{\nu }}^{*}, {\varvec{\Upsilon }}^*),\\&{\varvec{\Upsilon }}^* = ({\varvec{\Upsilon }}^{-1} + {\varvec{V}}_{\psi }^{-1})^{-1},\\&{\varvec{\nu }}^{*} = {{\varvec{\Upsilon }}^*} {\varvec{V}}_{\psi }^{-1} ({\varvec{\varphi }}- {\varvec{\kappa }}_{\psi } - {\varvec{\nu }}) + {\varvec{\nu }}, \end{aligned}$$

where \({\varvec{\nu }}=(\nu (w_0),\ldots , \nu (w_m))'\), \({\varvec{\Upsilon }}=(\tau (w_i,w_j))_{i,j=0,\ldots , m}\), \({\varvec{\kappa }}_{\psi } = (\kappa _{\psi _0}, \ldots , \kappa _{\psi _m})'\) and \({\varvec{V}}_{\psi }=\mathrm{diag}\{v^2_{\psi _0}, \ldots , v^2_{\psi _{m}}\}\).

(3) Update \({\varvec{\psi }}\): Given \({\varvec{\beta }}\) and \({\varvec{\theta }}\), we can obtain the discrete posterior density for \(\psi _{j}\) such that, for \(l=1, \ldots , 5\),

$$\begin{aligned} P(\psi _{j} = l \, | \, {\varvec{\beta }}, {\varvec{\theta }}) = p_l \phi _{v_l}(\varphi _j - \theta (\omega _j) - \kappa _l). \end{aligned}$$

Computation of \(\tilde{{\varvec{\Gamma }}}_n^{-1}\) \(\tilde{\Gamma }_n^{-1}\) can be obtained efficiently by modeling \(\lambda \). We have

$$\begin{aligned} \tilde{{\varvec{\Gamma }}}_n = {\varvec{Q}}_n {\varvec{\Lambda }}_n {\varvec{Q}}_n^{*}, \end{aligned}$$
(11)

where \({\varvec{\Lambda }}_n=\mathrm{diag}\{\lambda (\omega _0), \ldots , \lambda (\omega _{n-1}) \}\), \({\varvec{Q}}_n\) is the \(n \times n\) matrix with (uv)-th entry being \(q_{u,v}= \frac{1}{\sqrt{n}} e^{i (u-1) (v-1) 2\pi /n}\) and \({\varvec{Q}}_n^{*}\) is a complex conjugate matrix of \({\varvec{Q}}'\). Note that \(\tilde{\Gamma }_n\) is a real symmetric and positive definite matrix although \({\varvec{Q}}_n\) and \({\varvec{Q}}_n^*\) involves complex numbers. One can show equality in (11) by using the expression of \(\gamma _n(\cdot )\) given in (6). Since \({\varvec{Q}}_n\) is a unitary matrix, i.e., \({\varvec{Q}}_n{\varvec{Q}}_n^{*} ={\varvec{I}}\), we have \(\tilde{{\varvec{\Gamma }}}_n^{-1}= {\varvec{Q}}_n {\varvec{\Lambda }}_n^{-1} {\varvec{Q}}_n^{*}\), where \({\varvec{\Lambda }}_n^{-1}=\mathrm{diag}\{\lambda ^{-1}(\omega _0), \ldots , \lambda ^{-1}(\omega _{n-1}) \}\).

Forecasting We can use the fitted model to forecast future values, in particular, to predict a k-steps ahead value of y with a given value of \({\varvec{x}}\) and the observed data. Namely, we forecast \(y_{f} \) at \(f=k+n\) when \({\varvec{x}}_f\) is given for observations \((y_1,{\varvec{x}}_1'), \ldots , (y_n, {\varvec{x}}_n')\). In many time series data, we could guess \({\varvec{x}}_f\) reasonably well to predict \(y_f\) or we can specify desired value of \({\varvec{x}}_{f}\).

Prediction of \(y_f= {\varvec{x}}_f' {\varvec{\beta }}+ \epsilon _f\) is given as \( {\varvec{x}}_f' \hat{{\varvec{\beta }}} + E(\epsilon _f \,|\, {\varvec{\epsilon }})\), where \(\hat{{\varvec{\beta }}}\) is the estimate obtained by the estimation procedure in the previous section. The conditional expectation of \(\epsilon _f\) given \({\varvec{\epsilon }}\) is \(E(\epsilon _f \,|\, {\varvec{\epsilon }})={\varvec{h}}' {\varvec{\Gamma }}_n^{-1} {\varvec{\epsilon }}\), where \({\varvec{h}}=\mathrm{Cov}({\varvec{\epsilon }}, \epsilon _f)=(\gamma (k+n-1),\ldots , \gamma (k))'\), \(E({\varvec{\epsilon }})= {\varvec{0}}\) and \(\mathrm{Var}({\varvec{\epsilon }})={\varvec{\Gamma }}_n\). This prediction idea is embedded in the Gibbs sampling steps. In the \(r-\)th iteration, we get \({\varvec{\beta }}^{(r)}\) and \({\varvec{\Gamma }}_n^{(r)}\). From these, we obtain \({\varvec{\epsilon }}^{(r)}={\varvec{y}}-{\varvec{x}}'{\varvec{\beta }}^{(r)}\). Then, \(y_f^{(r)}\) is obtained by \(y_f^{(r)}={\varvec{x}}_f'{\varvec{\beta }}^{(r)} + \epsilon _f^{(r)}\), where \(\epsilon _f^{(r)}=E(\epsilon _f \,|\, {\varvec{\epsilon }}^{(r)})={{\varvec{h}}^{(r)}}' {{\varvec{\Gamma }}_n^{-1}}^{(r)} {\varvec{\epsilon }}^{(r)}\).

To complete the forecasting step, we need \({\varvec{h}}^{(r)}\) which consists of \(\gamma (k+n-1), \ldots , \gamma (k)\) at the r-th iteration. \(\gamma (n-1), \ldots , \gamma (0)\) are available from \(\Gamma _n^{(r)}\) but we do not have \(\gamma (k+n-1),\ldots , \gamma (n)\). These quantities can be estimated in the following way. We have \(\lambda ^{(r)}(\omega _0), \ldots , \lambda ^{(r)}(\omega _{n-1}) \), where \(w_j= j/n\), Fourier frequencies with n. We interpolate these to get \(\hat{\lambda }(\omega )\) for \(\omega \in [0, 1)\) so that we obtain \(\hat{{\varvec{\Lambda }}}_f = \mathrm{diag}\{ \hat{\lambda }(\omega _0^*), \ldots , \hat{\lambda }(\omega _{f-1}^*)\}\), where \(\omega _j^* = j/f\). Then, the estimate of \({\varvec{h}}\), \({\varvec{h}}^{(r)}\) is obtained from \({\varvec{\Gamma }}_f = {\varvec{Q}}_f \hat{{\varvec{\Lambda }}}_f {\varvec{Q}}_f^*\).

4 Simulation studies

For the simulation study, we consider bivariate \({\varvec{x}}_t\) where \(x_{t,1} \sim N(0,1)\) and \(x_{t,2} \sim \exp (1)\); \(x_{t,1}\) and \(x_{t,2}\) are independent. The mean part is \(\eta ({\varvec{x}})= \beta _0 + \beta _1 x_1 + \beta _2 x_2\) with \((\beta _0, \beta _1, \beta _2)= (1,2,3)\). Three models are considered based on three different error processes: \(\epsilon _t \sim AR(2)\), \(\epsilon _t \sim ARMA(1,1)\) and \(\epsilon _t \sim ARFIMA(0,v,0)\), where AR(2) is an autoregressive process with order 2, ARMA(1, 1) is an autoregressive moving average process with order (1, 1), and ARFIMA(0, v, 0) is an autoregressive fractionally integrated moving average process with order (0, v, 0) for \(0< v <1/2\).

For AR(2) error process, we have \(\epsilon _t - a_1 \epsilon _{t-1} -a_2 \epsilon _{t-2} = \sigma z_t\), where \(z_t \sim N(0,1)\) and the corresponding spectral density is \(\lambda (w)= \sigma ^2/|1-a_1 e^{i 2 \pi \omega } -a_2 e^{i 4 \pi \omega } |^2\). For simulated data, we used \(\epsilon _t - 0.5 \epsilon _{t-1} +0.3 \epsilon _{t-2} = 0.5 z_t\). For the ARMA(1, 1) error process, we have \(\epsilon _t -a_1 \epsilon _{t-1} = b_1 \sigma z_{t-1}+\sigma z_t\), where \(z_t \sim N(0,1)\) and the corresponding spectral density is \(\lambda (w)= \sigma ^2|1+b_1 e^{i 2\pi w}|^2/|1-a_1e^{i 2\pi w}|^2\). For simulated data, we used \(\epsilon _t -0.5 \epsilon _{t-1} = -0.6\cdot 0.5 \sigma z_{t-1}+ 0.5 z_t\). For FARMIA(0, v, 0) process, the corresponding spectral density is \(\lambda (w)= \sigma ^2 /|1-e^{i 2\pi w}|^{2v}\). For simulated data, we used \(v=0.25\) and \(\sigma =0.5\).

For a given model, sample size of \(n=200\) time points are considered along with \(k=5\) time points for prediction. Gibbs sampler is run for 3 chains with 5000 iterations and posterior analysis is performed based on the last 2500 iterations given that convergence is well achieved after 2500 iterations. We repeat this sampling and estimation for \(D=100\) simulated data sets for each model. The results are compared for parametric approach (AR(2) and AR(1)) and our non-parametric approach. That is, for each data model, we fit the data using AR(2), AR(1) and our non-parametric approach.

First, we compute two quantities to assess the estimation of \({\varvec{\beta }}\): Bias and MSE. Let \(\hat{\beta }_i^{(d)}\) be the posterior mean for \(\beta _i\) from MCMC samples with d-th data set for \(d=1, \ldots , D=100\). Then,

$$\begin{aligned} \mathrm{Bias}(\beta _i) = \frac{1}{D} \sum _{d=1}^D (\hat{\beta }_i^{(d)} -\beta _i), \quad \mathrm{MSE}(\beta _i) = \frac{1}{D} \sum _{d=1}^D (\hat{\beta }_i^{(d)} -\beta _i)^2 . \end{aligned}$$

Table 1 shows Bias and MSE of estimates for \({\varvec{\beta }}\) for the three simulated models. Results from the three different fitted models are comparable. When the data are generated from the AR(2) process, AR(2) fitted model gives best fitted results which is not surprising but the other fitted model including our non-parametric approach provides comparable results as well. When the data are generated from ARMA(1, 1) or ARFIMA(0, v, 0), AR(2) and AR(1) fitted model (misspecified models) present similar levels of bias compared to the one from the proposed non-parametric approach.

Table 1 Bias and MSE for regression coefficients (\({\varvec{\beta }}\)) for three simulated models from 100 repeated datasets

We evaluate estimation of a dependence structure of the data through the estimation of the spectral density. Figure 1 shows the estimated spectral density together with the true spectral density for three simulated models using three different fitted models. Figure 1a–c are corresponding to the data from a AR(2) model. The estimated spectral density using a AR(2) fitted model is better (Fig. 1a) than the other two (Fig. 1b, c) which is expected. However, the estimated spectral density from the proposed approach which does not assume any parametric model for the error, also tries to catch the shape of the true spectral density. This is observed in the other two simulated model cases when AR(2) and AR(1) fitted models are misspecified. AR(2) and AR(1) fitted models could not capture the downward shape of the true spectral density at low frequencies for ARMA(1, 1) simulated model (Fig. 1d, e). On the other hand, the estimated spectral density from the proposed approach was able to capture the shape of the true spectral density (Fig. 1f). Similar story can be made for a ARFIMA(0, v, 0) simulated model as well. From these results, we can see that the proposed non-parametric approach is able to fit various shapes of a spectral density compared to the parametric error models.

Fig. 1
figure 1

Comparison between true \(\lambda \) and estimated \(\lambda \) for three simulated models. The first column is from AR(2) parametric model to fit the data. The second column is from AR(1) parametric model to fit the data. The third column is from the proposed nonparametric approach to fit the data. Solid black curve is the true \(\lambda \), dashed red curve is the estimated \(\lambda \) and dotted blue curves are 95% credible bands

To compare how good the models fit the data, we compute marginal log likelihoods. As we consider 100 simulated data, we calculate 100 marginal log likelihoods for each model. Since this is simulation study, we can also calculate true marginal log likelihoods by plugging in the true parameter values. Table 2 shows median and average of marginal log likelihoods. One might expect that the true marginal log likelihood evaluated at the true parameter values would be higher than the others. However, it is not necessarily true due to the randomness in simulating the data. Indeed, the table shows no clear pattern. For example, all approaches give larger marginal log likelihoods than the true marginal log likelihood for AR(2) simulated model. For the AR(2) simulated model and ARMA(1, 1) simulated model, marginal log likelihoods of fitted models are comparable. On the other hand, for ARFIMA(0, v, 0) simulated model, non-parametric approach shows lower marginal log likelihood values compared to those of the other fitted models. Note that the covariance function of ARFIMA(0, vd) has slower decay rate as the lag increases. This corresponds to a faster increasing rate at lower frequency. Although non-parametric approach was able to capture the shape of spectral density well at lower frequencies, one can suspect that marginal log likelihood may not reflect this part (covariance at larger lags) fully as it considers the data up to the given finite lags.

Table 2 Median and average of marginal log likelihoods from 100 repeated datasets

Figure 2 shows the estimated prediction error for forecasting \(y_f\) for \(f=n+k\) with \(n=200\) and \(k=1, \cdots , 5\). The figure shows that the proposed nonparametric approach tends to have smaller prediction error compared to the other two parametric fitted models for all three simulated models.

Fig. 2
figure 2

Prediction error for \(y_{f}\), \(f=n+k\) with \(n=200\) and \(k=1, \cdots , 5\) for three simulated models from 100 repeated data sets

We also calculate log predictive likelihoods (Geweke and Amisano 2008). As we consider forecasting up to 5 time points ahead and 100 simulated data for each model, we report the average of log predictive likelihoods over 5 forecasting time points and 100 simulated data in Table 3. All fitted models are comparable in terms of log predictive likelihoods and our nonparametric approach gives slightly higher values compared to the other fitted models.

Table 3 Averaged log predictive likelihoods at 5 forecasting time points for 100 simulated data

5 Real data examples

As an application of the proposed methodology, we consider two real data examples.

5.1 Forward premium regression

The forward premium regression (Fama 1984) consider the following regression model:

$$\begin{aligned} s_{t}-s_{t-1}=\beta _{0}+\beta _{1}\left( f_{t-1}-s_{t-1}\right) +u_{t}, \end{aligned}$$
(12)

where \(s_{t}\) is the log of the monthly spot exchange rate at time t, quoted as the foreign price of domestic currency, and \(f_{t}\) is the log of the corresponding 30-day forward rate at time t. This model is widely used in universal finance literature to test whether or not the celebrated uncovered interest parity (UIP) condition holds. Under the UIP, we have \(\beta _0=0\) and \(\beta _1=1\), which can be tested under the regression framework. In addition, (12) can be used for exchange rate forecasting. Based on the methodology developed in this article, we estimate (12) and use the result for forecasting.

To that end, the spot and 1-month forward exchange rate data for British Pound (GBP) is made use of, resembling Baillie and Kim (2015). The numeraire currency is US Dollar (USD). Monthly data from December 1988 to October 2010 are employed in this study. Figure 3 shows \(s_{t}-s_{t-1}\) (black solid line) v.s. \(f_{t-1}-s_{t-1}\) (red dotted line).

Fig. 3
figure 3

A black solid line is \(s_{t}-s_{t-1}\) and a red dotted line is \(f_{t-1} - s_{t-1}\) for GBP/USD

The forward premium regression model is used to test \(H_0: \beta _0=0,~ \beta _1=1\). While researchers in international finance are often interested only in testing the simple hypothesis of \(H_0: \beta _1=1\), testing \(H_0: \beta _0=0\) is also of great interest because the result can be used to determine the existence of risk premium. Moreover, exchange rate forecasting based on (12) is of interest to many empirical researchers because a more precise prediction of exchange rate is one of the crying needs in international finance.

For the Bayesian MCMC implementation, we ran 3 Gibbs chains with 10,000 iterations, and obtained the last 1000 iterations from each chain for posterior analysis after checking convergence. Gibbs updates for AR(2) and AR(1) parametric models and the proposed non-parametric approach were implemented to compare the results.

For the forecasting purpose, we employ a psuedo out-of-sample forecasting experiment. We consider one-step-ahead forecasting for the last 50 time points. If the original data period is \(n_0\), we use the data for the first \(n_0 - 50\) time points to forecast the response at \((n_0 - 50) + 1\). Then, we extend the data by adding one more time point to forecast the response at \((n_0-50) +2\), and so on. We repeat this procedure until we reach the final time points. Note that we used Gibbs chains for each one-step-ahead forecasting.

Table 4 Posterior estimates of regression coefficients (\({\varvec{\beta }}\)). Posterior mean (mean), posterior standard deviation (sd) and 95% credible interval (95% CI) are reported

For the parameter estimates, we report the results from fitted models in Table 4, which are used for the last one-step-ahead forecasting. We also fit the model by the ordinary least square method (OLS). The OLS estimates and the corresponding OLS standard errors are \(\hat{\beta }_0 = 0.0003, se(\hat{\beta }_0) =0.0024\) and \(\hat{\beta }_1 = 0.1989, se(\hat{\beta }_1) =0.9872\), respectively. Parameter estimates for regression coefficients are similar among three models as well as the results from the OLS method. As shown in Table 4, all models appear to support the UIP hypothesis by not rejecting \(H_0: \beta _0=0,~ \beta _1=1\) given 95% credible intervals, which is also the case of the OLS method. As a result, we fail to reject the UIP hypothesis for this particular data set (GBP/USD during Dec 1988–Oct 2010).

To compare which model fits the data better, we compute marginal log likelihood. As we run the models 50 times by extending time horizon, we obtain 50 marginal log likelihoods for each model. We report summary statistics to compare three fitted models in Table 5. As we see from the Table, marginal log likelihood values for our approach are larger than those from AR(2) or AR(1). Indeed, the marginal log likelihood values of our approach are largest in all 50 cases.

Table 5 Summary statistics of marginal log likelihoods obtained by fitting each model

Now we turn to the forecasting. Forecasting exchange rate is important because a precise forecast of the rate is valuable in boosting one’s portfolio return. For this purpose, we compare root mean square error (RMSE), which is obtained by using the forecast values and observed values for last 50 time points. Table 6 shows that our approach has a smaller RMSE than the AR(2) model and AR(1) model by 2.1 and 2.6%, respectively. Log predictive likelihoods for 50 time horizon for AR(2), AR(1) and nonparametric approach are 101.4490, 101.4769 and 101.5894, respectively. Although the difference is marginal, our proposed approach is slightly better than the other two fitted models.

The empirical results indicate that our approach performs better than AR(2) and AR(1) models. However, there could be other models besides AR(2) and AR(1) that result in improved forecasting than the models we have used in this study.

Table 6 Root mean square error between forecast values and the observed values for the last 50 time points

5.2 House price index

In this section, we analyze quarterly Case–Shiller House Price Index (CSI), a benchmark measure of single-family house price in the United States. As price moves with the affordability of dwellers, and the affordability is directly linked to family income, inflation, and interest rate, etc.; real personal income (RPI), prime rate (PR), and consumers price index (CPI) serves as natural choice of covariates. However, it is well known that aside from these fundamental economic factors, the market sentiment plays a big role (particularly in illiquid markets like housing) and causes unexplained price movements and bubbles. While the regression component explains the contribution of the fundamental factors, modeling the error autocorrelation may help understand the market sentiment.

Housing related data were gathered from different sources; U.S. national level CSI data were collected from http://us.spindices.com/index-family/real-estate/sp-case-shiller. RPI data were collected from Federal Reserve Bank of St. Louis (http://research.stlouisfed.org/fred2/). PR and CPI data were collected from Federal Housing Finance Agency (http://www.fhfa.gov/). The CSI is available quarterly, whereas other variables are available on different time scale. For simplicity, we used all the variables considered in the analysis on a quarterly time scale after we either averaged out or interpolated them to match the same time scale. We collected data for the years 1987–2011, used data till 4th quarter of year 2010 for modeling and the rest four quarters for comparing the forecast.

For data analysis, we considered the following model of quarterly changes in log of these indices:

$$\begin{aligned}&\log \left( \frac{CSI_t}{CSI_{t-1}}\right) \nonumber \\&\quad = \beta _1 \log \left( \frac{RPI_{t-1}}{RPI_{t-2}}\right) +\beta _2 \log \left( \frac{CPI_{t-1}}{CPI_{t-2}}\right) + \beta _3 \log \left( \frac{PR_{t-1}}{PR_{t-2}}\right) + \epsilon _t. \end{aligned}$$
(13)

There is a concern that \(\log \left( \frac{CSI_t}{CSI_{t-1}}\right) \) would not be well modeled by a stationary process. Since our model assumption is the stationarity of the error process, we check unit root possibility for our data set using a standard Dickey–Fuller test (Dickey and Fuller 1979) and PhillipsPerron test (Perron 1988). Specifically, we test a unit root of the error term by testing the residuals obtained from the least square method. P values (0.04417 for Dickey–Fuller test and 0.04459 for Phillips–Perron test) suggest that the residuals does not indicate a unit root process. Thus, we apply our methodology to the model given in (13) with this data set.

For the Bayesian MCMC implementation, we ran 3 Gibbs chains with 10,000 iterations, and obtained the last 1000 iterations from each chain for posterior analysis after checking convergence. Gibbs updates for AR(2) and AR(1) parametric models and the proposed non-parametric approach were implemented to compare the results.

Table 7 Posterior estimates of regression coefficients (\({\varvec{\beta }}\)) for the real data. The value in () is the posterior standard deviation

Everyday economics tell us that an increase in income and inflation will fuel house price, while an increase in interest rate will shrink house prices. Thus, we expect that \(\beta _1\) and \(\beta _2\) to be positive, and \(\beta _3\) to be negative. Table 7 shows the estimates of regression coefficients for three covariates. Although all estimates are not statistically significant given the size of posterior standard deviation, we compare the sign of each estimate. All three models estimate \(\beta _3\) negative. However, only the proposed nonparametric estimate produces a positive \(\beta _1\). Surprisingly, all three estimates produce negative \(\beta _2\) which is counter intuitive at first. However, a closer look reveals that the inflation has already been taken care of by using real income instead of nominal income which can lead a negligibly small estimated \(\beta _2\).

Table 8 shows forecasting results from the real data. Similar to the simulation study, estimated prediction error from our approach tends to be smaller than that of the other two parametric approaches (AR(2) and AR(1)). Unfortunately, the predicted values are not that close to the observed values for all three fitted models but the values from our approach is closer to the observed values when compared to the results from the other two approaches.

Table 8 Prediction of the last four quarters for the real data

6 Discussion

We propose a nonparametric modeling of the autocovariance function in a time series regression model beneath a Bayesian framework. We achieved so by introducing a Gaussian process prior on the logarithm of the spectral density in the frequency domain. Considering a Whittle likelihood lead us to a manageable posterior distribution which helped us to collect posterior samples via a MCMC method for Bayesian inference.

Simulation results confirmed that the proposed methodology could successfully capture the shape of the dependence structure while parametric models may fail to capture the shape of the dependence structure if the parametric model is misspecified. In the real data example using the spot and 1-month forward exchange rate data for British Pound/US Dollar, our approach shows better performance in fitting and forecasting than other parametric models. For the real data example using the Case–Shiller House Price Index, the proposed methodology captured the expected sign for the regression parameter related to the real personal income (RPI) compared to other parametric models.

Computational advantage of the proposed approach over the parametric approach is more relevant with respect to big data, given that a typical parametric model requires inversion of the autocovariance matrix which is often computationally exhaustive; whereas our proposed approach can be independent of matrix inversion, hence more effective. An inherent extension of this research will be to relax the stationary assumption of the error structure to cover some non-stationary structure of the autocovariance function, which will help to model more complex data sets in conjunction with diversity.

The time series regression model we have considered does not account for heteroskedasticity, which is often of interest in economic time series modeling. Classical time varying volatility models such as ARCH and GARCH to accommodate heteroskedasticity are stationary under some conditions on parameter values and are unconditionally uncorrelated although they are not independent. For the uncorrelated error, the corresponding spectral density is constant and estimating unconditional variability of classical ARCH or GARCH type models may not be of interest. On the other hand, one can incorporate time varying volatility within the model by assuming a time varying parameter (for example, see Kim and Kim (2016)). That is, we can consider \(\epsilon _t= \sigma _t \cdot \epsilon ^{0}_t\), where \(\epsilon ^{0}_t\) is a weakly stationary process and \(\sigma _t\) is a time varying parameter. In such setting, we could still apply our approach by assuming the log spectral density from the stationary error \(\epsilon ^{0}_t\) has a Gaussian process prior and the time varying parameter \(\sigma _t\) also has a Gaussian process prior. Under the Bayesian framework, we can obtain the posterior distributions but it will require a Metropolis–Hasting sampler instead of a Gibbs sampler.