Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In this chapter, statistical inference for nonstationary processes is discussed. For long-memory, or, more generally, fractional stochastic processes this is of particular interest because long-range dependence often generates sample paths that mimic certain features of nonstationarity. It is therefore often not easy to distinguish between stationary long-memory behaviour and nonstationary structures. For statistical inference, including estimation, testing and forecasting, the distinction between stationary and nonstationary, as well as between stochastic and deterministic components, is essential.

The most obvious type of nonstationarity in time series is a deterministic trend. Related to that is the issue of parametric and nonparametric regression. Both topics will be addressed (Sects. 7.1, 7.2, 7.4, 7.5, 7.7). A common feature is that there is a distinct difference between fixed and random design regression. For most fixed designs, long memory influences the rate of convergence of parametric and nonparametric regression estimators. In contrast, random design often removes the effect of strong dependence. The issue is, however, more complex, and will be discussed in detail.

Standard techniques in nonparametric regression are kernel and local polynomial smoothing. The main question one has to address is the choice of a suitable bandwidth. In the context of fractional processes with an unknown long-memory parameter d∈(−1/2,1/2), this is a formidable task. The optimal bandwidth depends on the unknown long-memory parameter d. At the same time, using an inappropriate bandwidth leads to biased estimates of d. To complicate the matter, the possibility of nonstationarity due to integration (i.e. random walk type behaviour) cannot be excluded a priori, and may be masked by antipersistent dependence. Nevertheless, it is possible to design data driven algorithms for asymptotically optimal bandwidth selection and simultaneous estimation of dependence parameters as well as identification of random walk type structures (see Sect. 7.4.5.1). Extensions to nonlinear processes with trends are considered briefly in Sect. 7.4.10. As an alternative to kernel and local polynomial smoothing, trend estimation based on wavelets and the issue of optimal selection of the number of resolution levels is discussed in Sect. 7.5. Furthermore, a semiparametric regression model, also known as partial linear regression, is considered in Sect. 7.7.

Another important class of nonstationary models can be subsumed under the notion of local stationarity, in the sense that certain parameters change as a function of time. Quantile estimation along this line is discussed in Sect. 7.6. Local FARIMA type estimation is considered in Sect. 7.8.

The chapter concludes with a section on change point detection (Sect. 7.9). This is an important issue in the long-memory context because occasional structural changes often generate sample paths that resemble stationary processes with long-range dependence. A typical example is a model with occasional shifts in the mean. Various methods have been developed in the literature for distinguishing between structural changes and long-range dependence. We discuss a selection of typical methods.

7.1 Parametric Linear Fixed-Design Regression

In this section, we discuss estimation in fixed design linear regression with residuals exhibiting long memory. The least squares estimator (LSE) is compared with the BLUE. It turns out that under long memory (as well as under antipersistence) the LSE usually loses efficiency compared to the BLUE. This is in contrast to the case of weak dependence studied in Grenander (1954) and Grenander and Rosenblatt (1957). The concrete asymptotic results, however, depend on the combination of long-memory properties of the residuals and the type of regression functions (Yajima 1988, 1991). A practical problem with the BLUE is that the weights depend on the unknown autocovariance function of the residual process. For certain situations, Dahlhaus (1995) designed explicit weights that eliminate this problem. The asymptotic results for the LSE can be extended to robust estimation (see Giraitis et al. 1996a which is an extension of Beran 1991 to the regression context). Finally, we briefly discuss the question of optimal design in the linear (fixed-design) regression context.

7.1.1 Asymptotic Distribution of the LSE

We consider linear regression of the form

$$ Y_{t}=\sum_{j=1}^{p} \beta_{j}x_{tj}+e_{t}\quad(t=1,2,\dots,n) $$
(7.1)

where

$$ e_{t}=\sum_{j=0}^{\infty}a_{j} \varepsilon_{t-j} $$
(7.2)

is a linear process with ε t i.i.d., E(ε t )=0, \(\operatorname{var}(\varepsilon_{i})=\sigma_{\varepsilon}^{2}<\infty\) and a j =c a j d−1 (\(0<d<\frac{1}{2}\)). The following notation will be used:

and

$$\underset{n\times p}{X}= \bigl[ x_{\cdot1}(n),\dots,x_{\cdot p}(n) \bigr] =\left [ \begin{array} [c]{c}x_{1\cdot}^{T}\\ \vdots\\ x_{n\cdot}^{T}\end{array} \right ] . $$

Then

$$ y(n)=X\beta+e(n). $$
(7.3)

The least squares estimator of β is equal to

$$ \hat{\beta}_{\mathrm{LSE}}= \bigl( X^{T}X \bigr)^{-1}X^{T}y(n) $$
(7.4)

so that

$$\hat{\beta}_{\mathrm{LSE}}-\beta= \bigl( X^{T}X \bigr)^{-1}X^{T}e(n)= \bigl( X^{T}X \bigr)^{-1}\left ( \begin{array} [c]{c}x_{\cdot1}^{T}e(n)\\ \vdots\\ x_{\cdot p}^{T}e(n) \end{array} \right ) . $$

More generally, for a weighted least squares estimator with weights q j (j=1,2,…,n) we have

$$ \hat{\beta}= \bigl( X^{T}QX \bigr)^{-1}X^{T}Qy(n) $$
(7.5)

and

$$ \hat{\beta}-\beta= \bigl( X^{T}QX \bigr)^{-1}X^{T}Qe(n)= \bigl( X^{T}QX \bigr)^{-1}\left ( \begin{array} [c]{c}x_{\cdot1}^{T}Qe(n)\\ \vdots\\ x_{\cdot p}^{T}Qe(n) \end{array} \right ) $$
(7.6)

where the n×n matrix Q is given by \(Q=\operatorname{diag} ( q_{1},\dots,q_{n} ) \). The covariance matrix of \(\hat{\beta}\) is equal to

$$\varSigma_{\hat{\beta}}=\operatorname{var} ( \hat{\beta} ) = \bigl( X^{T}QX \bigr)^{-1}X^{T}Q\varSigma_{e}Q^{T}X \bigl( X^{T}QX \bigr)^{-1}$$

where Σ e =[cov(e i ,e j )] is the covariance matrix of e(n). In particular, the best linear unbiased estimator (BLUE) is given by

$$ \hat{\beta}_{\mathrm{BLUE}}= \bigl( X^{T}\varSigma_{e}^{-1}X \bigr)^{-1}X^{T}\varSigma_{e}^{-1}y(n) $$
(7.7)

and its covariance matrix is equal to

$$\varSigma_{\hat{\beta}}=\operatorname{var} ( \hat{\beta} ) = \bigl( X^{T} \varSigma_{e}^{-1}X \bigr)^{-1}. $$

To obtain a nondegenerate limit theorem for \(\hat{\beta}\) defined in (7.5), we need to standardize the estimator by a matrix that takes into account that \(\operatorname{var} ( \hat{\beta} ) \) depends on the design matrix X, the matrix Q and on the covariance matrix Σ e of the residuals. The first issue is taken into account by the normalizing diagonal p×p matrix

$$D_{n}=\operatorname{diag} \bigl( X^{\prime}X \bigr) =\left ( \begin{array} {c@{\quad}c@{\quad}c} \Vert x_{\cdot1}\Vert ^{2} & \cdots & 0\\ \vdots & \ddots & \vdots\\ 0 & \cdots & \Vert x_{\cdot p}\Vert ^{2}\end{array} \right ) $$

where for \(a\in\mathbb{R}^{p}\), \(\Vert a\Vert =\sqrt{a_{1}^{2}+\cdots+a_{p}^{2}}\) denotes the Euclidian norm. Then we can write

For most deterministic design matrices X and weights q j (i.e. Q), C n converges to a nondegenerate p×p matrix C so that

$$D_{n}^{\frac{1}{2}}\varSigma_{\hat{\beta}}D_n^{\frac{1}{2}}\approx C^{-1} \bigl( D_{n}^{-\frac{1}{2}}X^{T}Q \varSigma_{e}Q^{T}XD_n^{-\frac{1}{2}} \bigr) C^{-1}$$

and

Thus it is sufficient to study the asymptotic behaviour of W n e(n). If the elements of

$$W_{n}=D_{n}^{-\frac{1}{2}}X^{T}Q= [ w_{j,n} ]_{i,j=1,\ldots,p}$$

can be written as a function of i/n, then this amounts to studying the joint distribution of weighted sums

$$Z_{n,j}=\sum_{i=1}^{n}w_{j,n} \biggl( \frac{i}{n} \biggr) e_{i}\quad(j=1,\ldots,p). $$

If, in addition,

$$w_{j,n}(u)\approx n^{-\kappa}w_{j}(u) $$

for a fixed weight functions w j and a suitable power n κ, then results from Pipiras and Taqqu (2000c) can be used to obtain

$$n^{\kappa-H}D_{n}^{\frac{1}{2}} ( \hat{\beta}-\beta ) \underset {d}{\rightarrow}Z=C^{-1}\tilde{Z}$$

where \(H=d+\frac{1}{2}\) and

$$\tilde{Z}=\int_{0}^{1}w(u)\,dB_{H}(u)= \left ( \begin{array} [c]{c}\int_{0}^{1}w_{1}(u)\,dB_{H}(u)\\ \vdots\\ \int_{0}^{1}w_{p}(u)\,dB_{H}(u) \end{array} \right ) . $$

The vector Z is normally distributed with zero mean and covariance matrix \(\operatorname{var}(Z)=C^{-1}VC^{-1}\) where the elements of V=(v ij ) i,j=1,…,p are given by

(7.8)

In terms of fractional integrals (see Sect. 7.3) this can also be written as

$$ v_{ij}= \biggl( \frac{\varGamma(d+1)}{c_{1}} \biggr)^{2}\int _{-\infty}^{\infty } \bigl( I_{-}^{d}w_{i} \bigr) (s) \bigl( I_{-}^{d}w_{j} \bigr) (s)\,ds $$
(7.9)

where

$$\bigl( I_{-}^{d}w_{j} \bigr) (s)= \frac{1}{\varGamma(d)}\int_{0}^{1}u^{j-1} ( u-s )_{+}^{d-1}\,du $$

for 0≤s≤1 and zero otherwise, and c 1 is a constant that depends on d. To make sure that v ij are all finite, certain conditions on w j must be imposed. For instance, Deo (1997) defines the conditions w j C(0,1) and x α(1−x)α w j (x) bounded for x∈[0,1] and a some \(0<\alpha<\min ( \frac{1}{2},2d ) \).

Example 7.1

Consider a polynomial regression model of degree p defined by \(Y_{i}=\sum_{j=0}^{p}\beta_{0}i^{j}+e_{i}\). Note that, for obvious reasons, we deviate slightly from the previous notation by including j=0. Here, we have X=[x ⋅1(n),…,x p+1(n)], x j (n)=(1,2j−1,…,n j−1)T, x i(n)=(1,i 1,…,i p)T,

and the (p+1)×(p+1) matrix

$$D_{n}\approx \left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}n & 0 & \cdots & 0\\ 0 & \frac{n^{3}}{3} & & \vdots\\ \vdots & \ddots & \ddots & 0\\ 0 & \cdots & 0 & \frac{n^{2p-1}}{2p-1}\end{array} \right ) . $$

Furthermore,

For the LSE the elements of C n =(c ij ) i,j=1,…,p+1 are then given by

and

so that

Thus, we have \(\kappa=\frac{1}{2}\). Putting these results together and noting that κH=−d, we obtain

$$n^{-d}D_{n}^{\frac{1}{2}} ( \hat{\beta}-\beta ) \underset {d}{\rightarrow}Z=C^{-1}\tilde{Z}\sim N \bigl( 0,C^{-1}VC \bigr) . $$

The explicit form of V is given by (Yajima 1988)

$$ v_{ij}=\frac{\sqrt{(2i-1)(2j-1)}\varGamma(1-2d)}{\varGamma(d)\varGamma(1-2d)}\int_{0}^{1} \int_{0}^{1}x^{i-1}y^{j-1} \vert x-y\vert ^{2d-1}\,dy\,dx. $$
(7.10)

7.1.2 The Regression Spectrum and Efficiency of the LSE

A natural question is whether the least squares estimator should be replaced by the best linear unbiased estimator (BLUE) that is optimally adapted to the covariance structure. This issue was first addressed in a systematic manner by Grenander (1954) and Grenander and Rosenblatt (1957) (also see, e.g. Priestley 1981 for a nice summary). To study the asymptotic covariance matrix of \(\hat{\beta}_{\mathrm{LSE}}\) and \(\hat{\beta}_{\mathrm{BLUE}}\) for a general class of deterministic regression functions the following conditions are imposed: Let

$$x_{\cdot j}(k)=\left ( \begin{array} [c]{c}x_{1+k,j}\\ \vdots\\ x_{n+k,j}\end{array} \right ) $$

with x i,j :=0 if i∉{1,2,…,n} and

$$\bigl\langle x_{\cdot j}(0),x_{\cdot l}(k) \bigr\rangle =\sum _{i=1}^{n}x_{ij}(0)x_{il}(k). $$

Then we assume, as n→∞,

  • (R1) ∥x j 2→∞;

  • (R2)

    $$\frac{x_{nj}^{2}}{\Vert x_{\cdot j}\Vert ^{2}}\rightarrow0; $$
  • (R3)

    $$r_{jl}^{(n)}(k)=\frac{ \langle x_{\cdot j}(0),x_{\cdot l}(k) \rangle }{\Vert x_{\cdot j}\Vert \Vert x_{\cdot k}\Vert }\rightarrow r_{jl}(k)\in\mathbb{R;}$$
  • (R4) Define the p×p matrix R(k)=[r jl (k)] j,l=1,…,p . Then R(0) is nonsingular.

The first condition makes sure that x ij does not vanish too fast as time i tends to infinity. The second condition means that the last observed value x nj does not dominate all the previous ones. Condition (R3) defines a kind of a cross-correlation. The last condition excludes asymptotic collinearity of the explanatory variables. From the definition of R(k) it follows that there is a (complex-valued) function M:λM(λ) assigning every frequency in [−π,π] a p×p matrix M(λ) such that

$$M(\lambda_{2})-M(\lambda_{1})\geq0 $$

for all λ 2λ 1, where “≥0” means positive semidefiniteness, and

$$R(k)=\int_{-\pi}^{\pi}e^{ik\lambda}\,dM(\lambda) $$

for all k. The so-called (regression) spectral distribution function M(⋅) plays a key role when comparing the relative asymptotic efficiency of the least squares estimator compared to the BLUE.

The matrix R(k) may be interpreted as a (noncentred) asymptotic correlation matrix for the regression functions x j . In particular, R jj (0)=∫dM jj (λ)=1. This implies a property of M that turns out to be important in the context of long-range dependence. Suppose that

$$ dM_{jj}(0)=M_{jj}(0+)-M_{jj}(0)=1. $$
(7.11)

Since dM jj (λ)≥0 and |dM jl (λ)|≤dM jj (λ) dM ll (λ) this implies for all j,l,

$$ dM_{jl}(\lambda)=0\quad(\lambda\neq0). $$
(7.12)

As we will see below, (7.11) causes particular difficulties under long memory.

Example 7.2

Let p=1 and x t1=x t ≡1. This means that Y t is stationary and β=μ is the expected value of Y t . Conditions (R1)–(R4) hold for obvious reasons, and r(k)=r 11(k)≡1. Hence,

$$R(k)=\int_{-\pi}^{\pi}e^{ik\lambda}\,dM(\lambda) \equiv1 $$

so that M has a point mass at the origin such that (7.11) and (7.12) hold.

Example 7.3

For polynomial regression of order k we have x tj =t j−1 (j=1,…,p; p=k+1). Then, as n→∞,

$$\Vert x_{\cdot j}\Vert ^{2}=\sum _{t=1}^{n}t^{2j-2}\sim n^{2j-1}\int_{0}^{1}u^{2j-2}\,du= \frac{n^{2j-1}}{2j-1} $$

and

Thus, the “lag” k does not matter, i.e. for all k we have

$$r_{jl}(k)=\int_{-\pi}^{\pi}e^{ik\lambda}\,dM_{jl}( \lambda)\equiv\frac {\sqrt{(2j-1)(2l-1)}}{j+l-1}$$

which implies dM(λ)=0 (λ≠0) and

$$dM_{jl}(0)=\frac{\sqrt{(2j-1)(2l-1)}}{j+l-1}. $$

In particular,

$$dM_{jj}(0)=\frac{2j-1}{2j-1}=1 $$

so that again (7.11) and (7.12) hold.

Example 7.4

Let p=1 and x t1=cosλ 0 t for some λ 0∈(0,π). Then

$$\Vert x_{\cdot1}\Vert ^{2}\sim\frac{n}{2}$$

and

Thus, \(dM(\pm\lambda_{0})=\frac{1}{2}\) and dM(λ)=0 otherwise.

Example 7.5

Let p=1 and x t =x t1=(−1)t=cosπt. Then x t x t+k =(−1)k=cosπk, ∥x ⋅12=n so that r(k)=(−1)k. This implies \(dM(\pm\pi)=\frac{1}{2}\) and dM(λ)=0 otherwise.

Example 7.6

Let p=1 and \(x_{t}=x_{t1}=t ( 1+e^{-i\lambda_{0}t} ) \) for some λ 0∈(0,π). Note that the definitions above can be extended in a natural way to complex valued x-variables, with \(\langle x_{\cdot j}(0),x_{\cdot l}(k) \rangle =\sum x_{tj}(0)\bar{x}_{tl}(k)\). Then

$$\Vert x_{\cdot1}\Vert ^{2}=2\sum t^{2} ( 1+\cos\lambda_{0}t ) \sim\frac{2}{3}n^{3}$$

and

Hence

$$r(k)=r_{11}(k)=\frac{1}{2} \bigl( 1+e^{i\lambda_{0}k} \bigr) = \int_{-\pi }^{\pi}e^{ik\lambda}\,dM(\lambda) $$

so that

and dM(λ)=0 otherwise.

For residual processes with short-range dependence and spectral density f e , the asymptotic covariance matrix of \(\hat{\beta}_{\mathrm{LSE}}\) and \(\hat{\beta}_{\mathrm{BLUE}}\) can be expressed in terms of M and f e as follows (Grenander 1954; Grenander and Rosenblatt 1957):

Theorem 7.1

Let f e C[−π,π], \(D_{n}=\operatorname{diag} ( \Vert x_{\cdot1}\Vert ,\dots,\Vert x_{\cdot p}\Vert ) \) and assume that (R1)(R4) hold. Then, as n→∞,

$$ D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) D_{n}\rightarrow2 \pi R^{-1} ( 0 ) \int_{-\pi}^{\pi}f_{e}( \lambda)\,dM ( \lambda ) R^{-1} ( 0 ) . $$
(7.13)

Theorem 7.2

Under same assumptions as in Theorem 7.1, and f e >0,

$$ D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) D_{n}\rightarrow \biggl[ \frac {1}{2\pi}\int_{-\pi}^{\pi} \frac{1}{f_{e}(\lambda)}\,dM ( \lambda ) \biggr]^{-1}. $$
(7.14)

Theorem 7.1 includes not only the case of short memory (with f continuous) but also antipersistence with f e (λ)=L(λ)|λ|−2d (\(-\frac{1}{2}<d<0\)), provided that L(λ) is continuous. However, if M is such that dM(λ)=0 for all λ≠0, then ∫dM(λ)=0. In other words, for such explanatory variables the actual rate of convergence is faster than captured by (7.13). Theorem 7.2 does not include antipersistence because f e (λ)=0. The reason for the condition f e >0 is to avoid a pole in the integral \(\int f_{e}^{-1}\,dM\). It should be noted, however, that the conditions as stated here are sufficient but not necessary. For instance, piecewise continuous spectral distributions f e may be considered or even cases where f e (0)=0 provided that dM is zero in the neighbourhood of the origin. Long memory is, however, not included in any of the two theorems (or possible simple modifications) because f e has a pole. This causes difficulties with some of the integrals. A partial extension of the results was obtained by Yajima (1991). The main problem caused by the pole of f e at the origin occurs when dM(0)>0. The reason is that then ∫f e (λ) dM(λ) is infinite. Moreover, if dM(λ)=0 outside the origin, then \(\int f_{e}^{-1}(\lambda)\,dM(\lambda)=0\) so that we would divide by zero in (7.14).

Two cases have to be distinguished when considering long memory, namely

$$ M_{jj}(0+)-M_{jj}(0)=0\quad\text{(case 1)} $$
(7.15)

and

$$ M_{jj}(0+)-M_{jj}(0)>0\quad\text{(case 2)}. $$
(7.16)

For the second case, a more refined distinction will have to be made, namely

$$ 0<M_{jj}(0+)-M_{jj}(0)<1\quad\text{(case 2a)} $$
(7.17)

and

$$ M_{jj}(0+)-M_{jj}(0)=1\quad \text{(case 2b).} $$
(7.18)

First, we state the result for case 1. Since M does not have any mass at zero, the pole of f e does not disturb, i.e. there is no “interference” between long memory and the regression function.

Theorem 7.3

Let f e (λ)=L(λ)|1−e |−2d \((0<d<\frac{1}{2})\), LC[−π,π], and suppose that (7.15) holds for all j=1,…,p. Moreover, for j,l=1,…,p define

Then, under (R1)(R4),

$$ D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) D_{n}\rightarrow2 \pi R^{-1} ( 0 ) \int_{-\pi}^{\pi}f_{e}( \lambda)\,dM ( \lambda ) R^{-1} ( 0 ) $$
(7.19)

if and only if for all δ>0 there exists a finite constant c>0 and \(n_{0}\in\mathbb{N}\) such that

$$ \int_{-c}^{c}f_{e}( \lambda)\,dM_{jj}^{(n)}(\lambda)<\delta $$
(7.20)

for all j=1,…,p and nn 0.

Proof

Suppose first that (7.19) holds. For the left-hand side of (7.19), we have

$$D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) D_{n}= \bigl( D_{n}^{-1}X^{T}XD_{n}^{-1} \bigr)^{-1} \bigl( D_{n}^{-1}X^{T} \varSigma XD_{n}^{-1} \bigr) \bigl( D_{n}^{-1}X^{T}XD_{n}^{-1} \bigr)^{-1}. $$

Due to (R3), \(D_{n}^{-1}X^{T}XD_{n}^{-1}\) converges to R(0). Hence (7.19) and the definition of M (n) imply

$$ D_{n}^{-1}X^{T}\varSigma XD_{n}^{-1}=2 \pi\int_{-\pi}^{\pi}f_{e}(\lambda )\,dM^{(n)}(\lambda)\rightarrow2\pi\int_{-\pi}^{\pi}f_{e}( \lambda)\,dM ( \lambda ) . $$
(7.21)

Since M jj (0+)−M jj (0)=0, there exists a c>0 such that \(\int_{-c}^{c}f_{e}(\lambda)\,dM_{jj}(\lambda)<\delta\) for all j. Moreover, M (n) converges weakly to M and f e is continuous on {|λ|≥c} so that

$$ \int_{\vert \lambda \vert \geq c}f_{e}(\lambda)\,dM^{(n)}(\lambda)\rightarrow\int_{\vert \lambda \vert \geq c}f_{e}( \lambda)\,dM(\lambda). $$
(7.22)

Since also \(\int_{-\pi}^{\pi}f_{e}(\lambda)\,dM^{(n)}(\lambda)\) converges to \(\int_{-\pi}^{\pi}f_{e}(\lambda)\,dM(\lambda)\) (7.21), (7.20) follows for n large enough.

Suppose now that (7.20) holds. Again, by the same argument, (7.22) holds. Therefore, (7.20) implies that \(\int_{-\pi}^{\pi}f_{e}(\lambda)\,dM^{(n)}(\lambda)\) converges to \(\int_{-\pi}^{\pi}f_{e}(\lambda)\,dM(\lambda)\). □

Condition (7.20) holds, for instance, if dM(λ)=0 in an open neighbourhood of the origin.

In case 2, components where (7.16) holds have to be standardized by a larger power of n as follows.

Theorem 7.4

Let f e be as in Theorem 7.3, c f =L(0)>0 and M such that (7.16) and (7.20) hold for j=1,…,p. Define the p×p matrix \(V^{\ast}= [ v_{jl}^{\ast} ]_{j,l=1,\ldots,k}\) with the elements

$$v_{jl}^{\ast}=c_{f}\lim_{n\rightarrow\infty}n^{-2d} \int_{-\pi}^{\pi}\bigl \vert 1-e^{-i\lambda}\bigr \vert ^{-2d}\,dM_{jl}^{(n)}(\lambda) $$

and assume that all \(v_{jl}^{\ast}\) are finite. Then

$$ n^{-2d}D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) D_{n}\rightarrow V_{\mathrm{LSE}}=2\pi R^{-1} ( 0 ) V^{\ast}R^{-1} ( 0 ) . $$
(7.23)

Proof

First, note that, by setting

$$\tilde{D}_{n}=\operatorname{diag} \bigl( \Vert x_{\cdot1}\Vert n^{d},\dots,\Vert x_{\cdot p}\Vert n^{d} \bigr) =n^{d}D_{n},$$

we have

Thus, we may consider

$$\tilde{D}_{n}^{-1} \bigl( X^{T}X \bigr) \operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) \bigl( X^{T}X \bigr) \tilde{D}_{n}^{-1}= \tilde{D}_{n}^{-1}X^{T}\varSigma X \tilde{D}_{n}^{-1}. $$

Now

by definition of \(M_{jl}^{(n)}(\lambda)\) and \(m_{jl}^{(n)}(\lambda)\). For jk+1 the result follows as in the previous theorem. Moreover, since f e is continuous for |λ|≥c and M (n)M weakly, we have

$$\int_{\vert \lambda \vert \geq c}f_{e}(\lambda)\,dM_{jl}^{(n)}(\lambda)\rightarrow\int_{\vert \lambda \vert \geq c}f_{e}( \lambda)\,dM_{jl}(\lambda)<\infty. $$

The only integral we need to take care of is \(\int_{-c}^{c}\) \(f_{e}(\lambda)\,dM_{jl}^{(n)}(\lambda)\). Using the property f e (λ)∼c f |1−e |−2d (λ→0), one can show that

$$n^{-2d}\int_{-c}^{c}f_{e}( \lambda)\,dM_{jl}^{(n)}(\lambda)\sim n^{-2d}\int _{-\pi}^{\pi}\bigl \vert 1-e^{-i\lambda}\bigr \vert ^{-2d}\,dM_{jl}^{(n)}(\lambda) $$

which converges to \(v_{jl}^{\ast}\) by assumption. □

The difference to case 1 characterized by (7.15) (and also to short memory) is that an additional normalization by n −2d is required and a different limiting matrix V LSE is obtained. The reason for the slower rate of convergence is that under (7.16) the regression functions have a strong low-frequency component in the sense that M includes a point mass at the origin. This interferes with the pole of f e so that it becomes difficult to distinguish the low-frequency signal of the regression functions from low-frequency components in the residual process. Heuristically, the point mass of M at zero implies ∫f e (λ) dM(λ)≥f e (0) dM(0)=∞ so that n −2d has to be introduced to obtain a finite limit. A further interesting feature of (7.23) is that the asymptotic covariance matrix does not depend on the shape of f e outside the origin. Only c f and d are relevant. This is convenient for statistical inference since only these two parameters need to be estimated.

The evaluation of the matrix V is not always easy. An explicit formula is available for polynomial regression (Yajima 1988; also see Example 7.3):

Theorem 7.5

Let f e be as in Theorem 7.3, c f =L(0)>0 and x tj =t j−1. Then

$$ n^{-2d}D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) D_{n}\rightarrow V_{\mathrm{LSE}}=2\pi R^{-1} ( 0 ) V^{\ast}R^{-1} ( 0 ) . $$
(7.24)

where [D n ] jj n j/j, and R(0)=[r jl ] j,l=1,…,p and \(V^{\ast}= [ v_{jl}^{\ast} ]_{j,l=1,\dots,p}\) have the elements

$$r_{jl}\equiv\frac{\sqrt{(2j-1)(2l-1)}}{j+l-1}$$

and

$$v_{jl}^{\ast}=c_{f}\frac{\sqrt{ ( 2j-1 ) ( 2l-1 ) }\varGamma ( 1-2d ) }{\varGamma ( d ) \varGamma ( 1-d ) }\int _{0}^{1}\int_{0}^{1}x^{j-1}y^{l-1} \vert x-y\vert ^{2d-1}\,dy\,dx, $$

respectively.

Example 7.7

Figure 7.1 illustrates which problems long memory in the residual process may cause when the regression function has a zero-frequency component characterized by (7.16). Specifically, we observe Y t =3+0.025t+e t (t=1,2,…,1000) where e t is a FARIMA(0,d,0) process e t =(1−B)d ε t with d=0.4 and \(\operatorname{var}(\varepsilon_{t})=1\). The sample path of the residual process e t (lower curve) has a spurious downward trend. The actual trend function with slope β 1=0.025 (full line) is therefore hardly visible in Y t . The least squares estimate is indeed \(\hat{\beta}_{1}=0.0002\) so that the fitted trend (dotted line) is practically horizontal. On the other hand, fitting a least squares line to the estimated residual process \(\hat{e}_{i}\) yields \(\hat{\beta}_{1}=-0.025\). This is actually a spurious trend. If we use the usual t-test which assumes independence, then we come to the wrong conclusion that \(\hat{\beta}_{1}\) is significantly different from zero with a p-value far below 1 %. Clearly, a correction of this test is needed to take into account the possibility of spurious trends in e i . This is reflected in the additional norming constant n −2d in Theorem 7.4. Theorem 7.5 leads to

$$V^{\ast}=\frac{2}{3}c_{f}\frac{\varGamma(1-2d)}{(2d+1)\varGamma(1-d)\varGamma (1+d)}=1.29, $$

\(D_{n}^{2}\sim\frac{1}{3}n^{3}\) and R(0)=1. Hence, an approximate corrected 95 %-confidence interval for β 1 is given by \(-0.025\pm2\sqrt {3\cdot2\pi\cdot1.29}n^{d-3/2}\approx[-0.09,0.04]\) which includes zero.

Fig. 7.1
figure 1

Y t =3+0.025t+e t (t=1,2,…,1000) where e t is a FARIMA(0,d,0) process e t =(1−B)d ε t with d=0.4 and \(\operatorname{var}(\varepsilon_{t})=1\). The true trend function (full line) and the fitted least squares line (dotted line) are also plotted

Example 7.8

In Fig. 7.2, the same residuals as in the previous example are superimposed on a seasonal trend, namely Y t =cos(2πt/100)+e i . In spite of the spurious trend in the residual sample path, it is not too difficult to distinguish the seasonal fluctuation from e i . The reason is that the frequency λ 0=2π/100≈0.0628 is isolated and relatively far from zero. Therefore, according to Theorem 7.3, \(\hat{\beta}_{\mathrm{LSE}}\) has asymptotically the same rate of convergence as under independence. The only quantity that changes, depending on f e , is the finite constant

The concrete estimate for the observed series in Fig. 7.2 is \(\hat{\beta}_{\mathrm{LSE}}=1.00\). Since

$$\sum_{t=1}^{n}\cos^{2} ( \lambda_{0}t ) \approx\frac{1}{2}\sum _{t=1}^{n}\bigl \vert e^{i\lambda_{0}t}\bigr \vert ^{2}=n/2, $$

we have \(D_{n}^{2}\sim\frac{1}{2}n\). An approximate 95 %-confidence interval for β 1 is therefore given by

$$\hat{\beta}_{\mathrm{LSE}}\pm2\sqrt{2\cdot2\pi f_{e}( \lambda_{0})}n^{-\frac{1}{2}}=0.6\pm2\sqrt{31.9}n^{-\frac{1}{2}}=[0.64,1.36]. $$

This is shown in Fig. 7.2 as shaded area for the trend function.

Fig. 7.2
figure 2

Y t =cos(2πt/100)+e i (t=1,2,…,1000) where e t is a FARIMA(0,d,0) process e t =(1−B)d ε t with d=0.4 and \(\operatorname{var}(\varepsilon_{t})=1\). The true trend function (full line) is also plotted. The shaded area represents a 95 %-confidence region for the trend function, based on Theorem 7.3

A mixed result can also be obtained. If (7.15) holds for j=1,…,k and (7.16) for j=k+1, then, by setting

$$\tilde{D}_{n}=\operatorname{diag} \bigl( \Vert x_{\cdot1}\Vert ,\dots, \Vert x_{\cdot k}\Vert ,\Vert x_{\cdot k+1}\Vert n^{d},\dots,\Vert x_{\cdot p}\Vert n^{d} \bigr), $$

the asymptotic covariance matrix is of the form

$$V_{\mathrm{LSE}}=\left ( \begin{array} {c@{\quad}c}V_{1} & 0\\ 0 & V_{2}\end{array} \right ) $$

where V 1 is as in Theorem 7.3 and V 2 as in 7.4.

The derivation of the asymptotic variance of \(\hat{\beta}_{\mathrm{BLUE}}\) is a more challenging task. The first question is in how far formula (7.14) may be carried over to the long-memory case. The problem is that the integral \(\int f_{e}^{-1}(\lambda)\,dM(\lambda)\) may be zero. More specifically, suppose that M jj (0+)−M jj (0)=1. This implies dM jl (λ)=0 for all λ≠0 and j, l=1,…,p (see (7.11) and (7.12)) so that \(\int f_{e}^{-1}(\lambda)\,dM(\lambda)=0\) and the inverse does not exist. Therefore, we have to distinguish between the cases 2a (7.17) and 2b (7.18), i.e. 0<M jj (0+)−M jj (0)<1 and M jj (0+)−M jj (0)=1, respectively. Under assumption (7.17), formula (7.14) indeed carries over to the long-memory case. The same is true for case 1 (7.15).

Theorem 7.6

Let f e be as in Theorem 7.3, f e >0 and M such that either (7.15) or (7.17) holds for j=1,…,p. Moreover, under (7.17) assume further that, for all j=1,…,p and a suitable δ>1−2d,

$$\max_{1\leq t\leq n}\frac{x_{tj}^{2}}{\Vert x_{\cdot j}\Vert ^{2}}=o \bigl( n^{-\delta} \bigr) . $$

Then (7.14) holds, i.e.

$$ D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) D_{n}\rightarrow V_{\mathrm{BLUE}}= \biggl[ \frac{1}{2\pi}\int_{-\pi}^{\pi} \frac{1}{f_{e}(\lambda)}\,dM ( \lambda ) \biggr]^{-1}. $$
(7.25)

Proof

For case 1 with M jj (0+)−M jj (0)=0, the result follows by analogous arguments as in the short-memory case because on {|λ|≥c} (with c arbitrary) f e is continuous and such that \(0<f_{e}^{-1}(\lambda)<\infty\). For frequencies where dM jj (λ)>0, the function \(f_{e}^{-1}(\lambda)\) is bounded away from zero.

Consider now case 2a, i.e. 0<M jj (0+)−M jj (0)<1. Since

$$D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) D_{n}= \bigl( D_{n}^{-1}X^{T}\varSigma^{-1}XD_{n}^{-1} \bigr)^{-1},$$

we need to show that \(D_{n}^{-1}X^{T}\varSigma^{-1}XD_{n}^{-1}\) converges to \((2\pi)^{-1}\int f_{e}^{-1}(\lambda)\,dM(\lambda)\). The essential problem is that we have to deal with the inverse of the covariance matrix. It can be shown by some extended algebra that indeed

$$ D_{n}^{-1}X^{T} \bigl( \varSigma^{-1}-A_{n} \bigr) XD_{n}^{-1}\rightarrow0 $$
(7.26)

where A n =[a jl ] j,l=1,…,n has the elements

$$a_{jl}=\frac{1}{ ( 2\pi )^{2}}\int_{-\pi}^{\pi}e^{i(j-l)\lambda } \frac{1}{f_{e}(\lambda)}\,d\lambda. $$

Showing (7.26) is the main difficulty of the proof (see Yajima 1991 for details). Using this approximation, we obtain for \(C_{n}= [ c_{jl}^{(n)} ]_{j,l=1,\dots,p}=D_{n}^{-1}X^{T}A_{n}XD_{n}^{-1}\),

$$c_{jl}^{(n)}=\sum_{t,s=1}^{n} \frac{x_{tj}}{\Vert x_{\cdot j}\Vert }\frac{x_{tl}}{\Vert x_{\cdot l}\Vert }\int_{-\pi}^{\pi}e^{i ( j-l ) \lambda}g( \lambda)\,d\lambda=\int_{-\pi}^{\pi}g(\lambda )\,dM_{jl}^{(n)}(\lambda) $$

where 2πg(λ)=1/f e (λ). Since g(λ)∈C[−π,π] and M (n) converges weakly to M, this leads to

$$\lim_{n\rightarrow\infty}c_{jl}^{(n)}=\int_{-\pi}^{\pi}g( \lambda )\,dM(\lambda)=\frac{1}{2\pi}\int_{-\pi}^{\pi} \frac{1}{f_{e}(\lambda)} \,dM_{jl}(\lambda). $$

 □

This result means that if the regression spectral distribution is not completely concentrated at the origin (cases 1 and 2a), then the pole of f e at zero does not disturb the asymptotic covariance matrix of \(\hat{\beta}_{\mathrm{BLUE}}\). In contrast, in order that the asymptotic covariance matrix of \(\hat{\beta}_{\mathrm{LSE}}\) is unaffected by the pole of f e , M must not have any mass at the origin. What happens otherwise is illustrated in Theorem 7.4.

A general result for \(\hat{\beta}_{\mathrm{BLUE}}\) under condition (7.18) does not seem to be available currently. For polynomial regression, Yajima derived the following expression.

Theorem 7.7

Let f e be as in Theorem 7.3, f e >0 and x tj =t j−1 (j=1,…,p). Then

$$ n^{-2d}D_{n}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) D_{n}\rightarrow V_{\mathrm{BLUE}} $$
(7.27)

where V BLUE=2πc f W −1 and W=[w jl ] j,l=1,…,p with

$$ w_{jl}=\frac{\sqrt{ ( 2j-1 ) ( 2l-1 ) }}{j+l-1-2d}\frac{\varGamma(j-d)\varGamma(l-d)}{\varGamma(j-2d)\varGamma(l-2d)}. $$
(7.28)

Note that, as for the LSE in case 2, the asymptotic covariance matrix V in (7.27) does not depend on the shape of f e outside the origin.

Example 7.9

For Y t =μ+e t =β 0+e t with e t generated by any stationary long-memory process with long-memory parameter d and a constant c f , we have

$$W=w_{11}=\frac{1}{1-2d} \biggl[ \frac{\varGamma(1-d)}{\varGamma(1-2d)} \biggr]^{2}=\frac{\varGamma^{2} ( 1-d ) }{\varGamma ( 1-2d ) \varGamma ( 2-2d ) }$$

so that

$$V_{\mathrm{BLUE}}=2\pi c_{f}W^{-1}=2\pi c_{f} \frac{\varGamma(1-2d)\varGamma(2-2d)}{\varGamma^{2}(1-d)}. $$

In comparison, for the LSE which is the sample mean \(\bar{y}\), R(0)=1 and

$$V_{\mathrm{LSE}}=2\pi c_{f}\frac{\varGamma ( 1-2d ) }{\varGamma ( d ) \varGamma ( 1-d ) }\int _{0}^{1}\int_{0}^{1} \vert x-y\vert ^{2d-1}\,dy\,dx $$

with

$$\int_{0}^{1}\int_{0}^{1} \vert x-y\vert ^{2d-1}\,dy\,dx=\frac {2}{2d ( 2d+1 ) }. $$

Thus,

$$V_{\mathrm{LSE}}=2\pi c_{f}\frac{\varGamma ( 1-2d ) }{d ( 2d+1 ) \varGamma ( d ) \varGamma ( 1-d ) }. $$

Note that in Sect. 1.3.1 we derived the asymptotic variance of the sample mean to be equal to

$$\nu(d)c_{f}=c_{f}\frac{2\varGamma(1-2d)\sin\pi d}{d(2d+1)}. $$

This is indeed the same as the previous formula because

$$\varGamma(d)\varGamma(1-d)=\frac{\pi}{\sin\pi d}. $$

The asymptotic relative efficiency of the LSE compared with the BLUE is equal to

$$ e(d)=\frac{V_{\mathrm{BLUE}}}{V_{\mathrm{LSE}}}=\frac{(2d+1)\varGamma(2-2d)\varGamma(d+1)}{\varGamma(1-d)}. $$
(7.29)

This formula was first obtained by Adenstedt (1974) (also see Samarov and Taqqu 1988 and Beran and Künsch 1985), and holds for the whole range −1/2<d<1/2. We refer to the discussion in Sect. 5.2.2.

Example 7.10

Next, consider a linear trend model Y t =β 0+β 1 t+e t with e t generated by any stationary long-memory process. Then

and

Thus

$$W=w_{11}\left ( \begin{array} {c@{\quad }c}1 & \frac{\sqrt{3} ( 1-d ) }{2-2d}\\ \frac{\sqrt{3} ( 1-d ) }{2-2d} & \frac{3 ( 1-d )^{2}}{ ( 3-2d ) ( 1-2d ) }\end{array} \right ) . $$

The inverse of W is equal to

$$W^{-1}=w_{11}^{-1}\left ( \begin{array} {c@{\quad}c}4 ( 1-d )^{2} & -\frac{2}{\sqrt{3}} ( 3-2d ) ( 1-2d ) \\ -\frac{2}{\sqrt{3}} ( 3-2d ) ( 1-2d ) & \frac{4}{3} ( 1-2d ) ( 3-2d ) \end{array} \right ) . $$

The determinant of W −1 is equal to

$$\det\bigl(W^{-1}\bigr)=w_{11}^{-2} \biggl( 4- \frac{32}{3}\,d+\frac{16}{3}\,d^{2} \biggr) $$

so that

$$\det(V_{\mathrm{BLUE}})= \biggl( \frac{2\pi c_{f}}{w_{11}} \biggr)^{2} \biggl( 4-\frac{32}{3}\,d+\frac{16}{3}\,d^{2} \biggr) . $$

By similar calculations, one can derive an explicit formula for V LSE and the relative efficiency

$$e(d)=\frac{\det ( V_{\mathrm{BLUE}} ) }{\det ( V_{\mathrm{LSE}} ) }=\frac{ ( 3+2d ) ( 3-2d ) }{36} \biggl[ \frac{ ( 1+2d ) \varGamma ( 1+d ) \varGamma ( 3-2d ) }{\varGamma ( 2-d ) } \biggr]^{2}. $$

(Note that there is a typo in Yajima 1988 in that 1/e(d) instead of e(d) is given.) Figure 7.3 shows slightly larger efficiency losses than for the previous case where β 0=0. However, qualitatively the behaviour of e(d) is quite similar.

Fig. 7.3
figure 3

Relative asymptotic efficiency e(d)=det(V BLUE)/det(V LSE) of the least squares estimator in a linear regression model Y t =β 0+β 1 t+e t (full linear) and a regression model with β 1=0, i.e. Y t =β 0+e t (dotted line)

Example 7.11

Let Y t =β 1(1+cosλ 0 t)+e t . Then this corresponds to case 2a with 0<M(0+)−M(0)<1. Thus, Theorem 7.6 can be applied.

The next question is the comparison of the asymptotic covariance matrices for \(\hat{\beta}_{\mathrm{LSE}}\) and \(\hat{\beta}_{\mathrm{BLUE}}\). The previous examples illustrated that for polynomial regression \(\hat{\beta}_{\mathrm{LSE}}\) is asymptotically efficient under short memory whereas this is not the case when d≠0. In how far is this a general phenomenon? The short-memory case has been considered by Grenander (1954) (also see Grenander and Rosenblatt 1957). An essential notion in this context is the so-called regression spectrum:

Definition 7.1

Let M be a regression spectral distribution function. Then

$$S= \bigl\{ \lambda\in [ -\pi,\pi ] :dM(\lambda)>0 \bigr\} $$

is called the regression spectrum.

Each (regression) spectral distribution function M can be decomposed in the following way.

Lemma 7.1

There exist disjoint subsets S 1,…,S m (for some mp) such that

$$S=\bigcup_{j=1}^{m}S_{j}$$

and

where \(M(S_{j})=\int_{S_{j}}\,dM(\lambda)\) and \(M(\pi)=\int_{-\pi}^{\pi }\,dM(\lambda)\).

Lemma 7.1 leads to the following definition.

Definition 7.2

The sets S j are called the elements of the regression spectrum.

Using these definitions, Grenander derived the following necessary and sufficient conditions for the asymptotic efficiency of the LSE.

Theorem 7.8

Let f e C[−π,π], f e >0, \(D_{n}=\operatorname{diag} ( \Vert x_{\cdot1}\Vert ,\dots,\Vert x_{\cdot p}\Vert ) \), assume that (R1)(R4) hold and denote by S 1,…,S m the elements of the regression spectrum. Then

$$\lim_{n\rightarrow\infty}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) \bigl[ \operatorname{var} ( \hat{ \beta}_{\mathrm{LSE}} ) \bigr]^{-1}=I $$

if and only if there are constants c j (j=1,…,m) such that f e (λ)≡c j for λS j (i.e. f e is constant on each S j ). Moreover, this is equivalent to

$$\vert S\vert \leq p,\qquad \sum_{\lambda\in S}\operatorname{rank} \bigl \{ dM(\lambda ) \bigr\} =p. $$

This is a classical result (see, e.g. Grenander and Rosenblatt 1957), and we therefore only outline the basic idea only. Suppose that f e is indeed constant on each element of the regression spectrum. Then Theorems 7.1 and 7.2 imply

$$\begin{aligned} &\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) \bigl[ \operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) \bigr]^{-1} \\ &\quad \sim2\pi R^{-1} ( 0 ) \int f_{e}( \lambda)\,dM(\lambda)R^{-1} ( 0 ) \cdot\frac{1}{2\pi}\int \frac{1}{f_{e}(\lambda)}\,dM(\lambda). \end{aligned}$$

Using R(0)=M(π) and Lemma 7.1, the right-hand side is equal to

The question is under which circumstances Theorem 7.8 can be carried over to the case where d≠0. As we saw in the examples discussed previously, Theorem 7.8 no longer holds for polynomial regression, whereas \(\hat{\beta}_{\mathrm{LSE}}\) turns out to be fully efficient for a periodic component. The essential argument in Theorem 7.8 is based on formulas (7.13) and (7.14) for the asymptotic covariance matrix of \(\hat{\beta}_{\mathrm{LSE}}\) and \(\hat{\beta }_{\mathrm{BLUE}}\), respectively. However, it is assumed implicitly that all quantities involved are finite. This is no longer the case, if f e has a pole at the origin and dM(0)>0. It can therefore be concluded that the LSE is asymptotically efficient, compared to the BLUE, if Theorems 7.3 and 7.6 are applicable and dM(0)=0:

Theorem 7.9

Let f e and x tj be as in Theorem 7.6 and \(D_{n}=\operatorname{diag} ( \Vert x_{\cdot1}\Vert ,\dots, \Vert x_{-p}\Vert ) \). Assume that (R1)(R4) hold and denote by S 1,…,S m the elements of the regression spectrum S=⋃S j (mp). Then

$$\lim_{n\rightarrow\infty}\operatorname{var} ( \hat{\beta}_{\mathrm{BLUE}} ) \bigl[ \operatorname{var} ( \hat{ \beta}_{\mathrm{LSE}} ) \bigr]^{-1}=I $$

if and only if S j ={λ j } with λ j ∈(0,π] and

$$\sum_{\lambda\in S}\operatorname{rank} \bigl\{ dM(\lambda) \bigr\} =p. $$

Formally, the result is due to the fact that if dM(0)<1, then there is at least one nonzero frequency where dM(λ)>0. The integral \(\int f_{e}^{-1}(\lambda)\,dM(\lambda)\) is therefore no longer zero and the usual formula for the asymptotic covariance matrix (which relies on the inverse of this integral) is applicable. Thus, essentially the LSE does not lose efficiency as long as the regression spectrum does not include the frequency zero. A loss of efficiency usually occurs, if dM(0)>0. The intuitive reason is that in this case both the regression function and the residual process have a strong zero-frequency component. Incorporating the covariance structure in the estimator relieves this problem up to a certain extent. In fact, comparing Theorems 7.2 and 7.6, in cases where 0<dM(0)<1, this even leads to an improvement of the rate of convergence, matching the rate under short range dependence! This is illustrated by the following example.

Example 7.12

Let Y t =β 1(−1)t+e t with long-memory residuals e t as above. Then \(dM(\pm\pi)=\frac{1}{2}\) and zero otherwise, \(D_{n}=\sqrt{n}\) and R(0)=1. Thus, by Theorem 7.9, the LSE is asymptotically efficient. The asymptotic variance is given by

$$n\cdot \operatorname{var}(\hat{\beta}_{1})\rightarrow V=2\pi\int _{-\pi}^{\pi}f_{e}(\lambda)\,dM( \lambda)=2\pi f_{e}(\pi). $$

For instance, if e t is a FARIMA(0,d,0) process with variance one, then

$$V=\bigl \vert 1-e^{-i\pi}\bigr \vert ^{-2d} \frac{\varGamma^{2} ( 1-d ) }{\varGamma ( 1-2d ) }=2^{-2d}\frac{\varGamma^{2} ( 1-d ) }{\varGamma ( 1-2d ) }. $$

This is a monotonically decreasing function of d. In particular, for d=0, we have V=1 whereas, for instance, for d=0.4 one obtains V=0.28. The intuitive explanation for the better performance under long memory is that the sample paths of e t tend to be “smoother” so that it is easier to distinguish them from the alternating function x t =(−1)t.

In summary, one can say that the efficiency of the LSE compared to the BLUE very much depends on the combination of the long-memory properties of e i and the type of regression functions x tj . A practical problem with the BLUE is, however, that the weights depend on the autocovariance function γ e of the residual process. For observed data, γ e is usually unknown and has to be estimated from the same data. Thus, in cases where only minor efficiency gains are to be expected, the LSE is preferred. In other cases, the BLUE is much more efficient so that one would like to use it. However, since γ e has to be estimated, a balance between efficiency gain due to weighing by Σ −1 and additional inaccuracy induced by estimation of Σ has to be found. A further complication is that for large sample sizes and strong long memory inversion of Σ may be computationally difficult. As an alternative, Dahlhaus (1995) suggested using explicit weights without the need of inverting an n×n matrix. In particular, for polynomial regression with x tj =t j−1 (j=1,…,p) he shows that the weighted estimator

$$\hat{\beta}_{G}= \bigl( X^{T}GX \bigr)^{-1}X^{T}Gy(n) $$

with

$$\underset{p\times p}{G}=\operatorname{diag} \bigl( g(t_{1}),g(t_{2}), \dots,g(t_{n}) \bigr) , $$

t i =i/n and g(u)=u d(1−u)d has the same asymptotic covariance matrix as the BLUE. In applications, one would use, for instance, \(g_{n}(u)=u^{-d}(1-u-\frac{1}{2}n)^{-d}\) to avoid g(1)=∞. This result can be generated to regressors generated by Jacobi polynomials (see Dahlhaus 1995 for details; also see Sect. 3.1.4 for the definition of Jacobi polynomials).

7.1.3 Robust Linear Regression

Consider

$$ Y_{t}=\sum_{j=1}^{p} \beta_{j}x_{tj}+e_{t}=x_{t\cdot}^{\prime} \beta+e_{t}\quad(t=1,2,\dots,n) $$
(7.30)

as in (7.1) and a long-memory residual process as in (7.2). Denote by p e the probability density function of the marginal distribution of e t . A standard class of robust estimators of β (robust in the y-direction, see Hampel et al. 1986) can be defined as M-estimators, i.e. as solutions of p equations

$$ \sum_{t=1}^{n}\psi \bigl( Y_{t}-x_{t\cdot}^{\prime}\hat{\beta} \bigr) x_{t\cdot}=\underset{p\times1}{0} $$
(7.31)

where ψ is such that \(E [ \psi ( Y_{t}-x_{t\cdot}^{\prime}\beta ) x_{t\cdot} ] =0\). By similar arguments as for location estimation, it can be shown that the limit theorem (Theorem 4.33) for the empirical process implies asymptotic equivalence of any M-estimator and the LSE. If ψ is continuously differentiable, then this can be seen even more directly since (7.31) and consistency imply

$$\sum_{t=1}^{n}\psi \bigl( Y_{t}-x_{t\cdot}^{\prime}\beta \bigr) x_{t\cdot }- \sum_{t=1}^{n}\dot{\psi} \bigl( Y_{t}-x_{t\cdot}^{\prime}\beta \bigr) x_{t\cdot}x_{t\cdot}^{\prime} ( \hat{\beta}-\beta ) \approx0 $$

so that

$$ \hat{\beta}-\beta\approx \bigl\{ E \bigl[ \dot{\psi} ( e ) \bigr] X^{\prime}X \bigr\}^{-1}\frac{1}{n}\sum _{t=1}^{n}\psi ( e_{t} ) x_{t\cdot}. $$
(7.32)

If we can use the approximation

$$\psi ( e_{t} ) =-\int\psi ( u ) p_{e}^{\prime}(u)\,du \cdot e_{t}+r_{t}=-a_{\mathrm{app},1}e_{t}+r_{t}$$

with \(a_{\mathrm{app},1}=E [ \dot{\psi} ( e_{t} ) ] \) and r t in (7.32) is negligible (for instance, when a unique Appell expansion is valid), then

$$\hat{\beta}-\beta\approx \bigl( X^{\prime}X \bigr)^{-1} \frac{1}{n}\sum_{t=1}^{n}x_{t\cdot}e_{t}= \bigl( X^{\prime}X \bigr)^{-1}X^{\prime}e(n)=\hat{ \beta}_{\mathrm{LSE}}-\beta. $$

For more general, not necessarily differentiable, functions ψ, the limit theorem for the empirical process has to be applied more directly, along the lines of the proof of Theorem 5.1. A simplified version of the result in Giraitis et al. (1996a) can be stated as follows:

Theorem 7.10

Let ψ be nondecreasing, right-continuous and bounded. Furthermore, suppose that (XX)−1 exists for n large enough,

$$ \sqrt{n}\max_{1\leq t\leq n}\bigl \vert x_{t\cdot}^{\prime} \bigl( X^{\prime }X \bigr)^{-\frac{1}{2}}\bigr \vert =O(1), $$
(7.33)

e t =∑a j ε tj is a linear process with a j c a j d−1 (\(0<d<\frac{1}{2}\)), E[|ε t |k]<∞ for all \(k\in\mathbb{N}\) and denote by I the p×p identity matrix. Then

$$\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) \bigl[ \operatorname{var} ( \hat{\beta} ) \bigr]^{-1}\rightarrow\underset{p\times p}{I}$$

and

$$\bigl[ \operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) \bigr]^{-\frac{1}{2}} ( \hat{\beta}- \hat{\beta}_{\mathrm{LSE}} ) \rightarrow0. $$

Example 7.13

For polynomial regression

$$c_{kl}= \bigl( D_{n}^{-\frac{1}{2}}X^{\prime}XD_n^{-\frac{1}{2}} \bigr)_{kl}=\frac{ ( X^{\prime}X )_{kl}}{\Vert x_{\cdot k}(n)\Vert \Vert x_{\cdot l}\Vert }\sim\frac{\sqrt{ ( 2k-1 ) ( 2l-1 ) }}{k+l-1}$$

so that

Thus (7.33) holds and the theorem can be applied, for instance, if e t are generated by a FARIMA(0,d,0) process, then Theorem 7.10 holds.

7.1.4 Optimal Deterministic Designs

So far, it was assumed that the regression functions were evaluated at equidistant (time) points. For instance, for polynomial regression we considered x ij =i j−1 (i=1,…,n). Replacing the diagonal matrix \(D_{n}=\operatorname{diag} ( n^{\frac{1}{2}},n^{\frac{3}{2}},\dots,n^{\frac{2p-1}{2}} ) \) by \(\tilde{D}_{n}=n\cdot \operatorname{diag} ( 1,1,\dots,1 ) \) we may consider an analogous regression with \(x_{ij}=t_{i}^{j-1}=g_{j}(t_{i})\) where t i =i/n. In some situations, it is possible to choose the points t i where the regression functions are observed. This can be modelled as follows. For a given \(T\in \mathbb{R}\), let

$$ h: [ 0,1 ] \rightarrow [ -T,T ] $$
(7.34)

be a function such that h(t) can be written as a quantile \(h(t)=F_{h}^{-1}(t)\) of a distribution function \(F_{h}(x)=\int_{-\infty}^{x}\varphi(u)\,du\). Then it is assumed that the regression functions are generated at points

$$t_{i,n}=h \biggl( \frac{i-1}{n-1} \biggr) . $$

The collection of all points,

$$\varXi_{n}= \{ t_{1,n},\dots,t_{n,n} \} = \bigl\{ h(0),\dots,h(1) \bigr\}, $$

is called the experimental design of the regression model. To obtain asymptotic results regarding the variance of \(\hat{\beta}\), observations are assumed to be given by

$$ Y_{t}=\beta_{1}g_{1} ( t ) +\cdots+ \beta_{p}g_{p}(t)+e_{n} ( t ) \quad(t=1,\dots,n) $$
(7.35)

where \(e_{n}(t)=e_{n}^{(1)}(t)+e_{n}^{(2)}(t)\), \(e_{n}^{(1)}\) and \(e_{n}^{(1)}\) are zero mean processes, independent of each other, with variances \(\sigma_{j}^{2}\) (j=1,2), \(e_{n}^{(1)}(t)\) being uncorrelated and \(e_{n}^{(2)}(t)\) having autocorrelations

$$ \mathit{corr} \bigl( e_{n}^{(2)}(t),e_{n}^{(2)}(t+k) \bigr) =\rho_{n} ( k ) =\rho ( nk ) $$
(7.36)

with ρ(u)∼c ρ u 2d−1 (\(0<d<\frac{1}{2}\)) as u→∞. Moreover, g j are “explanatory” linearly independent functions. We will use the notation

$$\kappa=\frac{\sigma_{2}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}}. $$

Note that (7.36) is equivalent to letting T in (7.34) tend to infinity while keeping ρ n fixed. By similar arguments as in the previous sections, it can be shown that, under suitable regularity conditions, the asymptotic covariance matrix of the least squares estimator is given by Dette et al. (2009)

(7.37)
(7.38)

where

and

$$Q(v)=c_{\rho}^{-1}\lim_{n\rightarrow\infty}n^{-2d}\sum _{j=1}^{n}\rho ( jv ) =\frac{v^{2d-1}}{2d}. $$

Note in particular, that for an equidistant design with h(u)=(2u−1)T (and hence h′(u)≡2T), (7.37) gives back the asymptotic formulas in the previous section. An asymptotically optimal design is obtained by minimizing the function Ψ with respect to the design density φ.

Example 7.14

For Y t =βt+e n (t), Dette et al. (2009) derived explicit expressions for the optimal design density φ opt. Essentially, as d approaches 0, φ opt tends to the uniform distribution on [−T,T]. This result is directly related to the fact that for short-memory processes the LSE is asymptotically efficient. Recall that for the same regression (however, with t∈[0,1]), w(u)=u d(1−u)d was the weight function yielding the same efficiency as the BLUE (Dahlhaus 1995). As d→0, w also converges to a constant function w≡1. On the other hand, when d approaches \(\frac{1}{2}\), then the optimal design density φ opt puts more and more weight close to the left and right end of the interval. This is in correspondence with Dahlhaus’ optimal weight function w(u) in the equidistant case to having increasingly steeper poles at the ends of the interval. Intuitively, this means that one tries to estimate β from two parts of the series (the beginning and the end) that are as far apart in time as possible—thus avoiding too much correlation.

7.2 Parametric Linear Random-Design Regression

In this section, we address the problem of parameter estimation in a linear regression model

$$ Y_{t}=\sum_{j=1}^{p} \beta_{j}X_{tj}+e_{t}\quad (t=1,\ldots,n), $$
(7.39)

where the explanatory variables X t,j are random, and the processes X t,j (\(t\in\mathbb{Z}\)) and/or e t (\(t\in\mathbb{Z}\)) may be strongly dependent or nonstationary. In Sect. 7.2.1, we start with two examples that illustrate possible effects of long memory in errors and predictors on parameter estimation in the random design case. These examples will provide some intuition for asymptotic results on contrast estimation. Estimation of contrasts is, historically, one of the first illustrations of the phenomenon that estimators in random design regression tend to perform better than in a typical fixed design case (Künsch et al. 1993, also see Beran 1994a, Chap. 9).

In Sect. 7.2.2, we focus on the heteroskedastic case

$$Y_{t}=\beta_{0}+\beta_{1}X_{t}+ \sigma(X_{t})e_{t}, $$

where σ(⋅) is a positive function. We assume that predictors and errors are stationary with possible long memory, independent from each other. The general theory for the LSE is based on randomly weighted partial sums (see Sect. 7.2.3) as presented in Kulik and Wichelhaus (2012), see also Guo and Koul (2008). Other approaches, tailored for the homoscedastic case σ(⋅)≡σ are presented, following Robinson and Hidalgo (1997) and Choy and Taniguchi (2001). Further results can be found in Koul (1992), Koul and Mukherjee (1993), Giraitis et al. (1996a), Koul and Surgailis (1997, 2000), Hallin et al. (1999), Chung (2002), Koul et al. (2004), Lazarova (2005).

Section 7.2.4 addresses the problem of spurious correlation between nonstationary series X t , Y t that are independent of each other. In the case of a random walk and related integrated processes, it is well known that the sample correlation between two independent series does not converge to zero (see, e.g. Granger and Newbold 1974 and Phillips 1986). The same is true for fractionally integrated processes. We summarize detailed results including various combinations of nonstationarity, stationarity and long-range dependence as derived in Tsay and Chung (2000). Related results have been established in Phillips (1986, 1995), Phillips and Loretan (1991), Marmol (1995), Jeganathan (1999), Robinson and Marinucci (2003, 2003), Buchmann and Chan (2007).

Finally, Sect. 7.2.5 briefly addresses the problem of fractional cointegration. The idea of cointegration dates back to Granger (1981, 1983) and Engle and Granger (1987). In fractional cointegration, the reduction of the degree of integration is allowed to assume noninteger values. In some situations, this can lead to the lack of consistency of the LSE so that modifications are required (see, e.g. Robinson 1994a, 1994b and Marinucci 2000). Because the issue is of major interest in economics, there is meanwhile an extended literature. Important references are, for instance, Marinucci and Robinson (1999, 2001), Velasco (1999a, 1999b, 2003), Chen and Hurvich (2003a, 2003b, 2006) among others.

7.2.1 Some Examples, Estimation of Contrasts

As we saw in the previous section, the rate of convergence of (weighted) least squares estimators of β depends on the properties of the explanatory variables, i.e. on the regression design matrix X. If the explanatory themselves are random, then this means that the properties of \(\hat{\beta}\) depend on the distribution of X tj (j=1,…,p). Relevant are mainly two questions:

  1. 1.

    Is μ j =E(X tj ) zero?

  2. 2.

    What is the temporal dependence structure of X tj ?

This is illustrated by the following examples.

Example 7.15

Let Y t =βX t +e t with X t uncorrelated, E(X t )=0, \(\operatorname{var}(X_{t})= \sigma_{X}^{2}<\infty\), e t a zero mean stationary process with spectral density f e (λ)∼c f |λ|−2d (\(0<d<\frac{1}{2}\)) and independent of the process X t . Then, by the law of large numbers, the asymptotic distribution of

$$\hat{\beta}_{\mathrm{LSE}}=\frac{\sum_{t=1}^{n}X_{t}Y_{t}}{\sum X_{t}^{2}}\sim \sigma_{X}^{-2}n^{-1} \sum_{t=1}^{n}X_{t}Y_{t}$$

is the same as that of

$$\sigma_{X}^{-2}n^{-1}\sum _{t=1}^{n}X_{t}Y_{t}. $$

Furthermore,

$$\operatorname{var} \Biggl( \sigma_{X}^{-2}n^{-1}\sum _{t=1}^{n}X_{t}Y_{t} \Biggr) =\operatorname{var} \Biggl( \sigma_{X}^{-2}n^{-1}\sum _{t=1}^{n}X_{t}e_{t} \Biggr) \sim\sigma_{X}^{-4}n^{-2}\cdot n \sigma_{X}^{2}\sigma_{e}^{2}= \frac{\sigma_{e}^{2}}{\sigma_{X}^{2}}n^{-1}. $$

Thus, X t having zero mean and being uncorrelated removes a possible effect of (long-range) dependence in the residual process.

Example 7.16

Consider the same process as in the previous example; however, with μ=E(X t )≠0. Then the asymptotic distribution of \(\hat{\beta}_{\mathrm{LSE}}\) is the same as that of

$$\bigl( \sigma_{X}+\mu_{X}^{2} \bigr)^{-2}n^{-1}\sum_{t=1}^{n}X_{t}Y_{t}. $$

Furthermore,

Hence, even though X t are uncorrelated, the possible long-range dependence stemming from the residuals is not removed.

Example 7.17

Let \(X_{t}= ( -1 )^{Z_{t}}\) where Z t are i.i.d. Bernoulli random variables with \(P(Z_{t}=1)=P(Z_{t}=0)=\frac{1}{2}\) and independent of e t . Then \(\sigma_{X}^{2}=1\) and

$$\operatorname{var} ( \hat{\beta}_{\mathrm{LSE}} ) \sim\sigma_{e}^{2}n^{-1}=n^{-1} \int_{-\pi}^{\pi}f_{e}(\lambda)\,d\lambda. $$

It is in particular interesting to compare this with the asymptotic variance of \(\hat{\beta}_{\mathrm{LSE}}\) for the fixed-design regression with X t =(−1)t=cosπt where, from Theorem 7.3, one obtains n −12πf e (π). If f e achieves its minimum at λ=π, then this means that alternating the sign systematically yields a better estimate of β than if assigning the sign purely randomly. For instance, for a fractional ARIMA(0,d,0) model with d>0, f e (π) coincides with minimum of f e whereas the contrary is true for d<0. For d=0, f e is constant so that 2πf e (π) and \(\int_{-\pi}^{\pi }f_{e}(\lambda)\,d\lambda\) are the same.

From the applied point of view, a simple principle that may be deduced from these examples is that estimation of ‘absolute’ constants is more difficult than estimation of contrasts (for the definition of contrasts, see (7.43)). Or in other words, it is easier to compare constants than to estimate their individual values. This has been known to applied statisticians for a long time. In the context of long-memory processes and simple experimental designs, this principle can be formulated explicitly as follows (see Künsch et al. 1993). Suppose p treatments are assigned randomly to n observational units that are observed in a certain temporal (or other) sequence. Assuming an additive effect of the treatments leads to the regression model

$$ Y_{t}=\sum_{j=1}^{p} \beta_{j}x_{t,j}+e_{t}=x_{t\cdot}^{T} \beta+e_{t} $$
(7.40)

where β=(β 1,…,β p )T, β j is the jth treatment effect and e t is a zero mean process with spectral density f e c e |λ|−2d (λ→0). The explanatory variables are defined by

$$x_{t,j}=1 \{ a_{t}=j \} $$

with a t ∈{1,…,p} defining the treatment used. The question is now in how far long memory in the residuals affects the estimation of β and, in particular, whether the least squares estimator is asymptotically efficient. Furthermore, one may ask whether there are designs (random allocations of treatments) that improve the accuracy of estimates.

Künsch et al. (1993) considered the following standard designs:

  1. (a)

    Complete randomization: a t are i.i.d. with

    $$P(a_{t}=j)=\pi_{j}. $$
  2. (b)

    Restricted randomization: Given n, the number of assignments to treatment j (j=1,…,p) is fixed, i.e. n=n 1+⋯+n p and

    $$\sum_{t=1}^{n}x_{t,j}=n_{j}, $$

    and all possible allocations of this type have the same probability

    $$P ( a_{1},\dots,a_{n}\mid n_{1}, \dots,n_{p} ) =p ( a_{1},\dots,a_{n} ) =\frac{n!}{n_{1}!\cdot\cdot\cdot n_{p}!}. $$
  3. (c)

    Complete blockwise randomization: Restricted randomization within blocks, i.e. define b=[n/l] blocks of length l,

    $$B_{k}= \bigl\{ ( k-1 ) l+1,\dots,kl \bigr\} $$

    and, within each block (and independently of other blocks), apply restricted randomization subject to

The main difference between (a) and (b) is that in (a) n j (j=1,…,p) are random whereas they are fixed in (b). However, in (a) n j /n converges to π j almost surely so that for n large enough, n j is “in the neighbourhood” of the fixed number j . The randomization in case (c) is even more restricted than in (b) because the number of assignments to treatment j is also fixed within each block. A typical choice of l and l j in (c) is l=p and l j =1.

In vector form, model (7.40) can be written as

$$ Y(n)=X\beta+e(n) $$
(7.41)

with Y(n)=(Y 1,…,Y n )T,

$$X= ( x_{\cdot1},\dots,x_{\cdot p} ) =\left ( \begin{array} [c]{c}x_{1\cdot}^{T}\\ \vdots\\ x_{n\cdot}^{T}\end{array} \right ) , $$

and column and row vectors x j =(x 1j ,…,x nj )T and x t=(x t1,…,x tj )T, respectively such that

$$1^{T}x_{t\cdot}=\sum_{j=1}^{p}x_{tj}=1, \qquad 1^{T}x_{\cdot j}=\sum_{t=1}^{n}x_{tj}=n_{j}. $$

By definition, column vectors are orthogonal, i.e.

$$\langle x_{\cdot j},x_{\cdot l} \rangle =\sum _{t=1}^{n}x_{tj}x_{tl}=n_{j}\cdot\delta_{jl}$$

so that

$$X^{T}X=\left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}n_{1} & 0 & \cdots & 0\\ 0 & n_{2} & \ddots & \vdots\\ \vdots & \ddots & \ddots & 0\\ 0 & \cdots & 0 & n_{p}\end{array} \right ) . $$

Therefore, the least squares estimator of β can be written in a simple form

$$ \hat{\beta}_{\mathrm{LSE}}= \bigl( X^{T}X \bigr)^{-1}X^{T}y(n)= \left ( \begin{array} [c]{c}n_{1}^{-1}\sum_{t=1}^{n}x_{t1}y_{t}\\ \vdots\\ n_{p}^{-1}\sum_{t=1}^{n}x_{tp}y_{t}\end{array} \right ) . $$
(7.42)

For the BLUE, we have the usual formula

$$\hat{\beta}_{\mathrm{BLUE}}= \bigl( X^{T}\varSigma^{-1}X \bigr)^{-1}X^{T}\varSigma^{-1}y(n). $$

Now, instead of β itself, we are interested in estimation of contrasts. A contrast is defined by

$$ c=\eta^{T}\beta=\sum_{j=1}^{p} \eta_{j}\beta_{j}, $$
(7.43)

where η is a deterministic vector such that

$$1^{T}\eta=\sum_{j=1}^{p} \eta_{j}=0. $$

The variance of any estimated contrast can be written in terms of variances of estimates of the simple contrasts

$$c_{jk}=\beta_{j}-\beta_{k}. $$

It is therefore sufficient to study the variance of \(\hat{c}_{jk}=\hat{\beta }_{j}-\hat{\beta}_{k}\). Since usually inference is carried out conditionally on the given (randomly generated) design, one has to consider the asymptotic behaviour of the conditional variance \(V_{n} ( \hat{c}_{jk}\mid X ) =\operatorname{var} ( \hat{c}_{jk}\mid X ) \). Comparing the LSE and the BLUE of c jk , the corresponding conditional variances \(V_{n} ( \hat {c}_{jk;\mathrm{LSE}}\mid X ) \) and \(V_{n} ( \hat{c}_{jk:\mathrm{BLUE}}\mid X ) \) will be denoted by V n,LSE(X) and V n,BLUE(X), respectively. The following result can be obtained by relatively simple approximations of the second moment.

Theorem 7.11

Let f e satisfy one of the following conditions: (i) f e is piecewise continuous and 0<cf e C for suitable finite constants c and C, or (ii) f e (λ)=L(λ)|λ|−2d with \(0<d<\frac {1}{2}\), L(⋅) continuous, of bounded variation and 0<cLC. Then, under complete randomization (design (a)), we have, as n→∞,

(7.44)

The first remarkable result in this theorem is that contrasts can be estimated with the same rate of convergence as under independence, since V n =O(n −1). This is in sharp contrast to estimates of the slope parameters β j themselves. Since the expected value of the explanatory variables is not zero, the rate of convergence of \(\hat{\beta}_{j,\mathrm{LSE}}\) and \(\hat{\beta }_{k,\mathrm{BLUE}}\) is slower, namely \(\operatorname{var}(\hat{\beta})\sim \operatorname{const}\cdot n^{2d-1}\). In contrast to the case of uncorrelated residuals, however, \(\hat{\beta}_{j,\mathrm{LSE}}\) and \(\hat{c}_{jk,\mathrm{LSE}}\) loses efficiency compared to \(\hat{\beta}_{j,\mathrm{BLUE}}\) and \(\hat{c}_{jk,\mathrm{BLUE}}\). This is even true for cases where d=0 but f e is not constant. Note that this is very much in contrast to fixed-design regression under Grenander’s conditions. There, under short memory, \(\hat {\beta}_{j,\mathrm{LSE}}\) (and hence also \(\hat{c}_{jk,\mathrm{LSE}}\)) does not lose efficiency. Here, under the given random design, conditionally on X (and hence also unconditionally), the asymptotic efficiency of \(\hat{c}_{jk,\mathrm{LSE}}=\hat{\beta }_{j,\mathrm{LSE}}-\hat{\beta}_{k,\mathrm{LSE}}\) compared to the best linear unbiased estimator \(\hat{c}_{jk,\mathrm{BLUE}}=\hat{\beta}_{j,\mathrm{BLUE}}-\hat{\beta}_{k,\mathrm{BLUE}}\) can be written as

$$\mathit{eff}(\hat{c}_{jk,\mathrm{LSE}})= \biggl[ \frac{\sigma_{e}^{2}}{ ( 2\pi )^{2}}\int _{-\pi}^{\pi}\frac{1}{f_{e}(\lambda)}\,d\lambda \biggr]^{-1}. $$

Note that although the result was derived originally for d>0 only and d=0 under the given assumptions, analogous arguments lead to (7.44) for d<0.

Example 7.18

For e t generated by a FARIMA(0,d,0) process with variance \(\sigma_{e}^{2}=1\), we have

Using the equality ∫|1−e |2d=2πΓ(1+2d)/Γ 2(1+d), we obtain

and the relative asymptotic efficiency

$$\mathit{eff}(\hat{c}_{jk,\mathrm{LSE}})=\frac{ [ \varGamma ( 1-d ) \varGamma ( 1+d ) ]^{2}}{\varGamma ( 1-2d ) \varGamma ( 1+2d ) }. $$

Figure 7.4 shows \(\mathit{eff}(\hat{c}_{jk,\mathrm{LSE}})\) for all values of d. Towards the two extremes \(d\rightarrow\pm\frac{1}{2}\), the efficiency converges to zero. Thus, although the LSE keeps the same rate of convergence, it may be worthwhile using the BLUE, when d is far away from zero.

Fig. 7.4
figure 4

Relative asymptotic efficiency of the LSE of a contrast β j β k compared to the BLUE, as a function of d. The model considered here is a FARIMA(0,d,0) process

Similarly, for restricted and blockwise randomisation (designs (b) and (c)) it can be shown that the same asymptotic formulas for V n,LSE hold as under independence (see Künsch et al. 1993). For V n,BLUE this is conjectured to be true.

A possibility of improving the variance of the LSE is to apply blockwise randomization. The reason is that, under design (c), we have

If the autocovariance function γ e (k) is strictly positive and (strictly) monotonically decreasing with limit zero, then \(\sigma_{l}^{2}\) is strictly increasing in l and \(\sigma_{l}^{2}\rightarrow\sigma_{e}^{2}\) (see, e.g. Cochran 1946). Therefore, the smallest variance is expected under blockwise randomization with blocks of length l=p. Note, however, that this does not mean necessarily that, under this design, the efficiency of the LSE (compared to the BLUE) is better.

7.2.2 Some General Results and the Heteroskedastic Case

In this section, we consider a parametric random design regression model given by

$$ Y_{t}=\beta_{0}+\beta_{1}X_{t}+ \sigma(X_{t})e_{t}\quad(t=1,\ldots,n), $$
(7.45)

where σ(⋅) is a positive, deterministic function. As illustrated above, under random design, regression estimators may have a faster rate of convergence than in most fixed design cases. General results including the heteroskedastic case with σ(⋅) not constant can be derived, for instance, under the following conditions:

  • (P1) The sequence X t \((t\in\mathbb{Z})\) is i.i.d.;

  • (P2) The sequence X t \((t\in\mathbb{Z})\) is a linear process

    $$X_{t}=\mu_{X}+\sum_{j=0}^{\infty}b_{j} \xi_{t-j}, $$

    where ξ t (\(t\in\mathbb{Z}\)) are centred, i.i.d. random variables such that \(\operatorname{var}(X_{t})=\sigma_{X}^{2}=1\). Moreover, we assume \(b_{j}=j^{d_{X} -1}L_{b}(j)\), d X ∈(0,1/2). Unless stated otherwise, we assume μ X =0;

  • (E1) The sequence e t (\(t\in\mathbb{Z}\)) is i.i.d.;

  • (E2) The sequence e t (\(t\in\mathbb{Z}\)) is a linear process

    $$e_{t}=\sum_{j=0}^{\infty}a_{j} \varepsilon_{t-j}, $$

    where ε t (\(t\in\mathbb{Z}\)) are centred, i.i.d. random variables, \(\operatorname{var}(\varepsilon_{t})=\sigma_{\varepsilon}^{2}\) and \(a_{j} =j^{d_{e}-1}L_{a}(j)\), d e ∈(0,1/2).

Let f X and f e be the spectral densities of X t and e t , respectively. Under (P2) and (E2), we have \(f_{X}(\lambda)=|\lambda|^{-2d_{X}}L_{f_{X}}(\lambda^{-1})\), \(f_{e}(\lambda)=|\lambda|^{-2d_{e}}L_{\tilde{f}}(\lambda^{-1})\), where the functions \(L_{f_{X}}\) and \(L_{f_{e}}\) are slowly varying at infinity. Furthermore,

$$\operatorname{var} \Biggl( n^{-1}\sum_{t=1}^{n}e_{t} \Biggr) \sim n^{2d_{e}-1}L_{e}(n),\qquad {\operatorname{var}} \Biggl( n^{-1}\sum_{t=1}^{n}X_{t} \Biggr) \sim n^{2d_{X}-1}L_{X}(n), $$

where

(7.46)
(7.47)

Recall also that (see Sect. 4.2.4)

$$ n^{d_{e}-1}L_{e}^{-1/2}(n)\sum _{t=1}^{n}e_{t}\overset{\mathrm{d}}{ \rightarrow }Z_{0},\qquad n^{d_{X}-1}L_{X}^{-1/2}(n) \sum_{t=1}^{n}X_{t}\overset { \mathrm{d}}{\rightarrow}Z_{1}, $$
(7.48)

where Z 0 and Z 1 are standard normal random variables. Throughout this section, it is also assumed that the sequences X t and e t (\(t\in\mathbb{Z}\)) are mutually independent (the results are not applicable otherwise, see Sect. 7.2.5). Thus, Z 0 and Z 1 are independent. We recall also that

$$ E[e_{0}e_{k}]=\gamma_{e}(k)=L_{a}^{2}(k) \sigma_{\varepsilon}^{2}\int_{0}^{\infty} \bigl(u+u^{2}\bigr)^{d_{e}-1}\,du. $$
(7.49)

We start our discussion with the classical least squares estimator (LSE), which leads to

(7.50)
(7.51)

where

$$V_{n}^{2}=\frac{1}{n}\sum _{t=1}^{n}X_{t}^{2}. $$

If \(\sigma_{X}^{2}=1\), then the sample standard deviation V n converges (in probability) to σ X . For the purpose of limit theorems, we can replace \(V_{n}^{2}\) by \(\sigma_{X}^{2}=1\) in the expression for \(\hat{\beta }_{1}\).

As we will see in Theorem 7.12, for stochastic regression, the rate of convergence of \(\hat{\beta}_{0}\) is always influenced by a possible memory in the errors e t . However, the rate of convergence of \(\hat{\beta}_{1}\) depends properties of the regressors X t (\(t\in\mathbb{Z}\)), the errors e t (\(t\in\mathbb{Z}\)) and on the function σ(⋅). We start with a simple example.

Example 7.19

Consider the homoskedastic linear regression model without intercept,

$$ Y_{t}=\beta_{1}X_{t}+e_{t} \quad(t=1,\ldots,n), $$
(7.52)

and assume that (P1) and (E2) hold. We note that

$$\operatorname{var} \Biggl( n^{-1}\sum_{t=1}^{n}X_{t}e_{t} \Biggr) =n^{-2}\sum_{t,s=1}^{n}E[X_{t}X_{s}]E[e_{t}e_{s}]=n^{-1} \sigma_{e}^{2}. $$

According to the law of large numbers, \(n^{-1}\sum_{t=1}^{n}X_{t}^{2}\overset{p}{\rightarrow}\sigma_{X}^{2}=1\). Therefore, the asymptotic behaviour of \(\hat{\beta}_{1}-\beta_{1}\) is the same as that of \(n^{-1}\sum_{t=1}^{n}X_{t}e_{t}\). The formula for the variance suggests that \(\hat{\beta}_{1}\) behaves as if the errors e t were uncorrelated. We expect that \(\sqrt {n}(\hat{\beta}_{1}-\beta_{1})\) converges in distribution to a normal random variable; see (7.58) of Theorem 7.13.

Example 7.20

We consider the heteroskedastic linear regression model without intercept:

$$ Y_{t}=\beta_{1}X_{t}+\sigma(X_{t})e_{t} \quad(t=1,\ldots,n). $$
(7.53)

We assume again that (P1) and (E2) hold, and furthermore 0≠E[σ(X 1)X 1]<∞. Then

$$\operatorname{Var} \Biggl( n^{-1}\sum_{t=1}^{n}X_{t} \sigma(X_{t})e_{t} \Biggr) \sim E^{2}\bigl[ \sigma(X_{1})X_{1}\bigr]n^{2d_{e}-1}L_{e}(n) $$

so that the rate of convergence of \(\hat{\beta}_{1}\) is influenced by long memory in e t .

Example 7.21

Consider the homoscedastic model without intercept (7.52) and assume that the errors and predictors fulfill (E2) and (P2), respectively. If 2(d e +d X )>1

Otherwise, if 2(d e +d X )<1, then the variance is of order n −1. Thus, long memory in both errors and predictors may influence the limiting behaviour of \(\hat{\beta}_{1}\); see Theorem 7.12.

The complete convergence of the least squares estimators (7.51) and (7.50) is characterized in the following two theorems. These theorems were proven in Guo and Koul (2008) and Kulik and Wichelhaus (2012). The proof is given in Sect. 7.2.3 in a general context of randomly weighted partial sums.

Theorem 7.12

Consider the random design regression model (7.45) and let \(\hat{\beta}_{1}\), \(\hat{\beta}_{0}\) be least squares estimators defined in (7.50) and (7.51).

  • Assume that (P1) or (P2), and (E1) hold. Then

    $$ \sqrt{n}(\hat{\beta}_{0}-\beta_{0})\overset{\mathrm{d}}{\rightarrow}\sqrt{E\bigl[\sigma^{2}(X_{1}) \bigr]\sigma_{e}^{2}}Z_{0} $$
    (7.54)

    and

    $$ \sqrt{n}(\hat{\beta}_{1}-\beta_{1})\overset{\mathrm{d}}{ \rightarrow}\sqrt{E\bigl[\sigma^{2}(X_{1})X_{1}^{2} \bigr]\sigma_{e}^{2}}Z_{1}, $$
    (7.55)

    where Z 0, Z 1 are independent standard normal random variables.

  • Assume that (P1) and (E2) hold. If E[σ(X 1)X 1]≠0, then

    $$ n^{\frac{1}{2}-d_{e}}L_{e}^{-1/2}(n) (\hat{\beta}_{1}- \beta_{1})\overset {\mathrm{d}}{\rightarrow}E\bigl[ \sigma(X_{1})X_{1}\bigr]Z_{0} $$
    (7.56)

    and

    $$ n^{\frac{1}{2}-d_{e}}L_{e}^{-1/2}(n) (\hat{\beta}_{0}- \beta_{0})\overset {\mathrm{d}}{\rightarrow}E\bigl[ \sigma(X_{1})\bigr]Z_{1}, $$
    (7.57)

    where Z 0, Z 1 are independent standard normal random variables.

  • Assume that (P2) and (E2) hold and that X t , e t are Gaussian. If E[σ(X 1)X 1]≠0, then (7.56) and (7.57) hold.

If E[σ(X 1)X 1]=0, then the limiting behaviour of LS estimators changes.

Theorem 7.13

Consider the random design regression model (7.45) and let \(\hat{\beta}_{1}\), \(\hat{\beta}_{0}\) be LS estimators defined in (7.50) and (7.51). Assume that (P1) or (P2) and (E2) hold with E[σ(X 1)X 1]=0 and that X t , e t are Gaussian.

  • If 2(d X +d e )>1 and \(E[\sigma(X_{1})X_{1}^{2}]<\infty\), then

    $$ n^{1-(d_{e}+d_{X})}\bigl(L_{f_{X}}(n)L_{f_{e}}(n) \bigr)^{-1/2}(\hat{\beta}_{1}-\beta_{1})\overset{ \mathrm{d}}{\rightarrow}E\bigl[\sigma(X_{1})X_{1}^{2} \bigr]Z_{1,1} $$
    (7.58)

    where the random variable Z 1,1 is defined in (7.63).

  • If 2(d X +d ε )<1 and \(E[\sigma^{2}(X_{1})X_{1}^{2}]<\infty\), then

    $$ \sqrt{n}(\hat{\beta}_{1}-\beta_{1})\overset{\mathrm{d}}{ \rightarrow}N\bigl(0,C_{0}^{2}\bigr), $$
    (7.59)

    where \(C_{0}^{2}=\lim_{n\rightarrow\infty}\sum_{k=0}^{\infty}E [ X_{0}\sigma(X_{0})X_{k}\sigma(X_{k}) ] E [ \varepsilon_{0}\varepsilon_{k} ] \).

Of course, the LSE is not the only possible method. In the homoscedastic model without intercept it is possible to remove the dependence in e t first before estimating β 1. This way one can achieve \(\sqrt{n}\)-convergence. This is the case by definition for the BLUE. An alternative method that does not require inversion of the covariance matrix was suggested by Robinson and Hidalgo (1997). Thus, consider the homoscedastic regression model (7.52). Assume that (P2) and (E2) hold, possibly with μ X ≠0. Define the following weighted least squares estimator of β 1:

$$\hat{\beta}_{\phi,\mathrm{\mathrm{LSE}}}=\frac{\frac{1}{n}\sum_{t=1}^{n}\sum_{s=1}^{n}(X_{t}-\bar{x})(Y_{s}-\bar{y})\phi_{t-s}}{\frac{1}{n}\sum_{t=1}^{n}\sum_{s=1}^{n}(X_{t}-\bar{x})(X_{s}-\bar{x})\phi_{t-s}}, $$

where

$$\phi_{j}=\frac{1}{(2\pi)^{2}}\int_{-\pi}^{\pi} \phi(\lambda)\cos(j\lambda )\,d\lambda, $$

and ϕ(⋅) is some function such that ϕ j =O(j γ), γ≥2d e +1. This holds in particular if \(\phi=f_{e}^{-1}\) is the reciprocal of the spectral density of e t (\(t\in\mathbb{Z}\)). One can verify that

$$\operatorname{var} \Biggl( \frac{1}{n}\sum_{t=1}^{n} \sum_{s=1}^{n}(X_{t}-\bar{x}) (Y_{s}-\bar{y})\phi_{t-s} \Biggr) =O \bigl(n^{-1}\bigr). $$

Consequently, the asymptotic variance of \(\hat{\beta}_{\phi,\mathrm{LSE}}\) is not influenced by LRD in X t or e t . This observation leads to the following result, proven in Robinson and Hidalgo (1997).

Theorem 7.14

Consider the model (7.52). Assume that (P2) and (E2) hold. Under appropriate technical conditions,

$$\sqrt{n} ( \hat{\beta}_{\phi,\mathrm{LSE}}-\beta_{1} ) \overset{ \mathrm{d}}{\rightarrow}N\bigl(0,\varSigma_{\phi}^{-1} \varSigma_{\psi}\varSigma_{\phi}^{-1}\bigr) , $$

where ψ(λ)=ϕ 2(λ)f e (λ) and we use the notation \(\varSigma_{h}=(2\pi)^{-1}\int_{-\pi}^{\pi}h(\lambda)\,d\lambda\) for h=ψ,ϕ.

The “appropriate technical conditions” are in particular continuity of ψ(⋅) and independence between errors and predictors. Moreover, it has to be mentioned that \(\sqrt{n}\)-consistency does not hold, in general, in the heteroskedastic case. To see this, assume for simplicity that (P1) holds and μ X =0. Then

$$\operatorname{var} \Biggl( \frac{1}{n}\sum_{t=1}^{n} \sum_{s=1}^{n}X_{t} \sigma(X_{t})e_{s}\phi_{t-s} \Biggr) \sim\phi_{0}^{2}E^{2} \bigl[ \sigma(X_{1})X_{1} \bigr] \operatorname{var} \Biggl( \frac{1}{n}\sum_{t=1}^{n}e_{t} \Biggr) . $$

Finally, we consider again the model (7.52) and the following estimators:

$$\hat{\beta}_{R}:=\sum_{t=1}^{n}Y_{t}\Big/ \sum_{t=1}^{n}X_{t}$$

and

$$\hat{\beta}_{\mathrm{BLUE}}=\bigl(X^{T}\varSigma^{-1}X \bigr)^{-1}X^{T}\varSigma^{-1}Y, $$

with column vectors of X=(X 1,…,X n )′, X=(Y 1,…,Y n )′, respectively, and Σ being the covariance matrix of e 1,…,e n . The following result (under a slightly different set of assumptions) was proven in Choy and Taniguchi (2001).

Theorem 7.15

Consider the model (7.52). Assume that (P2) and (E2) hold and that μ X =E[X 1]≠0. Then

$$n^{1/2-d_{e}}L_{e}^{-1/2}(n) (\hat{\beta}_{R}- \beta_{1})\overset{\mathrm{d}}{\rightarrow} \mu_{X}^{-1}Z_{0} $$

and

$$\sqrt{n}(\hat{\beta}_{\mathrm{BLUE}}-\beta_{1})\overset{ \mathrm{d}}{\rightarrow}CZ_{0}, $$

where \(C^{-1}=(2\pi)^{-1}\int_{-\pi}^{\pi}f_{e}^{-1}(\lambda)f_{X}(\lambda)\,d\lambda\).

Proof

We prove only the convergence of \(\hat{\beta}_{R}\). We have

$$\hat{\beta}_{R}-\beta_{1}=\frac{n^{-1}\sum_{t=1}^{n}e_{t}}{n^{-1}\sum_{t=1}^{n}X_{t}}. $$

By the law of large numbers, we may replace the denominator by μ X . The convergence of the nominator, and hence of \(\hat{\beta}_{R}\), follows from (7.48). □

By definition, \(\hat{\beta}_{\mathrm{BLUE}}\) is better than \(\hat{\beta}_{R}\) and \(\hat{\beta}_{\mathrm{LSE}}\) (in the sense of a smaller variance of the asymptotic distribution). However, in the heteroskedastic case, Σ is the covariance matrix of σ(X 1)e 1,…,σ(X n )e n . This involves knowledge of σ(⋅). In most situations with heteroskedastic errors, one may therefore prefer to use the LSE.

7.2.3 Randomly Weighted Partial Sums

Asymptotic results in the context of regression with stochastic explanatory variables are usually based on limit theorems for weighted sums, where weights are stochastic. It is therefore useful to consider such sums in general. Thus let

$$ R_{n}:=\frac{1}{n}\sum_{t=1}^{n} \nu(X_{t})e_{t} $$
(7.60)

where ν(⋅) is a deterministic function such that E[ν(X t )]≠0. Also, define the σ-algebras , . The following properties will be used under different combinations of (E1), (E2), (P1) and (P2)Footnote 1 (we used some of these properties also in Sect. 5.14 on density estimation):

  • (M) If (E1) holds, then R n (n≥1) is a martingale with respect to a sigma-field .

  • (M/L) If (P1) holds, we use the decomposition

    (7.61)

    The first part is a martingale, so that its convergence with scaling \(\sqrt {n}\) can be described by an appropriate martingale central limit theorem. Furthermore, so that the second sum is just the sum of long-memory moving averages and the asymptotic behaviour of is the same as that of \(\sum_{i=1}^{n}e_{t}\) (cf. (7.48)):

    We will call the second term the LRD part. It contributes (and dominates) only if E[ν(X 1)]≠0.

  • (H) In general, under (E2) and (P2), we assume for simplicity that X t are standard Gaussian. We decompose R n as

    $$ R_{n}=E\bigl[\nu(X_{1})\bigr]\frac{1}{n}\sum _{t=1}^{n}e_{t}+\sum _{m=1}^{\infty}\frac{J(m)}{m!} \frac{1}{n}\sum_{t=1}^{n}e_{t}H_{m}(X_{t}), $$
    (7.62)

    where J(m) is the mth Hermite coefficient of zν(z). If E[ν(X 1)]≠0, then the first term dominates, and convergence of R n is equivalent to convergence of the sum \(n^{-1}\sum_{i=1}^{n}e_{t}\). Indeed, let us note that from Lemma 3.5 the random variables H m (X t ), (m≥1) are uncorrelated. Since the sequences X t and e t are independent, we have for each mk and all t,s,

    $$\mathit{cov}\bigl(H_{m}(X_{t})e_{t},H_{k}(X_{s}) \bigr)=E\bigl(H_{m}(X_{t})H_{m}(X_{s}) \bigr)E(e_{t}e_{s})=0. $$

    Thus,

    $$\operatorname{var} \Biggl( \sum_{m=1}^{\infty} \frac{J(m)}{m!}\frac{1}{n}\sum_{t=1}^{n}e_{t}H_{m}(X_{t}) \Biggr) =\sum _{m=1}^{\infty}\frac{J^{2}(m)}{(m!)^{2}}\operatorname{var} \Biggl( \frac{1}{n}\sum_{t=1}^{n}e_{t}H_{m}(X_{t}) \Biggr) . $$

    Furthermore, for a given \(m\in\mathbb{N}\) we have

    where L is a slowly varying function.

These decompositions provide a general framework that will be used several times. In particular, we will use it to prove Theorem 7.12. We note, however, that the situation with E[σ(X 1)X 1]=0 and (E2) is not covered by any of these cases. To study this situation, we shall consider

$$T_{n}:=n^{-1}\sum_{t=1}^{n}X_{t}e_{t}$$

directly, assuming (P2), (E2), and also that X t , e t (\(t\in\mathbb{Z}\)) are two independent centred Gaussian sequences. We recall some spectral theory from Sect. 4.1.3, see also proof of Theorem 4.2. The innovation processes ξ t and ε t have the spectral representation

$$\xi_{t}=\frac{1}{\sqrt{2\pi}}\int_{-\pi}^{\pi}e^{it\lambda}\,dM_{0,\xi}(\lambda),\qquad\varepsilon_{t}=\frac{1}{\sqrt{2\pi}}\int _{-\pi}^{\pi }e^{it\lambda}\,dM_{0,\varepsilon}( \lambda)\quad(t\in\mathbb{Z}), $$

where M 0,ξ and M 0,ε are two independent complex-valued Gaussian random measures with independent increments such that \(E[|dM_{\xi }(\lambda)|^{2}]=\sigma_{\xi}^{2}\,d\lambda\), \(E[|dM_{\varepsilon}(\lambda )|^{2}] =\sigma_{\varepsilon}^{2}\,d\lambda\). Furthermore,

$$X_{t}=\int_{-\pi}^{\pi}e^{it\lambda}\,dM_{X}( \lambda),\qquad e_{t}=\int_{-\pi }^{\pi}e^{it\lambda}\,dM_{e}( \lambda), $$

where

Repeating the same argument as in the proof of Theorem 4.2,

If f X and f e are spectral densities of the two sequences, respectively, then by taking

$$b(\lambda)=L_{f_{X}}^{1/2}\bigl(\lambda^{-1}\bigr)| \lambda|^{-d_{X}},\qquad a(\omega)=L_{f_{e}}^{1/2}\bigl( \omega^{-1}\bigr)|\omega|^{-d_{e}}, $$

we may conclude for d X +d e >1/2 that

$$ \begin{aligned}[b] &n^{1-(d_{X}+d_{e})}\bigl(L_{f_{X}}(n)L_{f_{e}}(n) \bigr)^{-1/2}T_{n} \\ &\quad \overset{\mathrm{d}}{ \rightarrow}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \frac{1}{|\lambda|^{d_{X}}}\frac{1}{|\omega|^{d_{e}}} \frac{e^{i(\lambda+\omega)}}{i(\lambda+\omega)}\,dM_{0,\xi}(\lambda)\,dM_{0,\varepsilon}( \omega)=:Z_{1,1}. \end{aligned} $$
(7.63)

Having this general framework, we are ready to prove Theorems 7.12 and 7.13.

Proof of Theorem 7.12

Recall the formulas (7.50) and (7.51) for \(\hat{\beta}_{1}\) and \(\hat{\beta}_{0}\), and also that we may replace \(V_{n}^{2}\) by \(\sigma_{X}^{2}=1\).

1. If (E1) holds, i.e. the errors are i.i.d., we apply the (M)-decomposition to (7.60) with ν(X t )=σ(X t )X t and ν(X t )=σ(X t ), respectively. The martingale central limit theorem (Lemma 4.2) yields (7.54) and (7.55).

2. If (P1) and (E2) hold and E[σ(X 1)X 1]≠0, then we apply the (M/L)-decomposition to (7.60) with ν(X t )=σ(X t )X t . The limiting behaviour of \(\hat{\beta}_{1}-\beta_{1}\) is determined by

(7.64)

Similarly, the limiting behaviour of \(\hat{\beta}_{0}-\beta_{0}\) is determined by

(7.65)

We conclude (7.56) and (7.57). Independence of the limiting random variables follows from

$$\mathit{cov} ( \hat{\beta}_{1},\hat{\beta}_{0} ) \rightarrow0. $$

3. Under the conditions (E2) and (P2), and E[σ(X 1)X 1]≠0, we apply (7.62) to ν(X t )=σ(X t )X t and to ν(X t )=σ(X t ). Convergence of the regression estimates can be concluded the same way as under (P1) and (E2). □

Proof of Theorem 7.13

Under the conditions (E2), (P2) and E[σ(X 1)X 1]=0, we apply the (H)-decomposition (7.62) with ν(X t )=σ(X t )X t . Since E[ν(X 1)]=0, the limiting behaviour of \(\hat{\beta}_{1}-\beta_{1}\) is determined by

$$J(1)\frac{1}{n}\sum_{t=1}^{n}X_{t}e_{t}+ \sum_{m=2}^{\infty}\frac{J(m)}{m!}\frac{1}{n}\sum_{t=1}^{n}e_{t}H_{m}(X_{t}), $$

where \(J(1)=E[\sigma(X_{1})X_{1}^{2}]\) is the first Hermite coefficient of ν(z)=σ(z)z. Clearly, the first part dominates. Applying (7.63),

$$ n^{1-(d_{e}+d_{X})}\bigl(L_{f_{X}}(n)L_{f_{e}}(n) \bigr)^{-1/2}(\hat{\beta}_{1}-\beta_{1})\overset{ \mathrm{d}}{\rightarrow}J(1)Z_{1,1}. $$
(7.66)

 □

Finally, it is worth mentioning another possibility. Consider assumptions (P2) and (E2), but with the modification μ X ≠0 and instead of E[σ(X 1)X 1]=0 (which was used in Theorem 7.13) the condition E[σ(X 1)(X 1μ X )]=0. Then, the estimator of β 1 has to be replaced by

$$ \hat{\beta}_{1}-\beta_{1}=\frac{1}{V_{n}^{2}} \Biggl( \frac{1}{n}\sum_{t=1}^{n}X_{t} \sigma(X_{t})e_{t}-\frac{1}{n}\sum _{t=1}^{n}X_{t}\frac{1}{n}\sum_{t=1}^{n}\sigma(X_{t})e_{t} \Biggr) , $$
(7.67)

with \(V_{n}^{2}=n^{-1}\sum_{t=1}^{n}(X_{t}-\bar{x})^{2}\). Again, we may replace \(V_{n}^{2}\) by \(\sigma_{X}^{2}=1\) asymptotically. Applying the (H)-decomposition to \(n^{-1}\sum_{t=1}^{n}\sigma(X_{t})e_{t}\) yields

$$\frac{1}{n}\sum_{t=1}^{n} \sigma(X_{t})e_{t}=E\bigl[\sigma(X_{t})\bigr] \frac{1}{n}\sum_{t=1}^{n}e_{t}+ \sum_{m=1}^{\infty}\frac{J^{\ast}(m)}{m!} \frac{1}{n}\sum_{t=1}^{n}e_{t}H_{m}(X_{t}), $$

where now J (m)=E[σ(X 1)H m (X 1)]. As in the proof of Theorem 7.13 (see also proof of Theorem 4.2),

$$n^{\frac{1}{2}-d_{e}}L_{f_{e}}^{-1/2}(n)\frac{1}{n}\sum _{t=1}^{n}e_{t}\overset{d}{\rightarrow}Z_{0},\qquad n^{\frac{1}{2}-d_{X}}L_{f_{X}}^{-1/2}(n) \frac{1}{n}\sum_{t=1}^{n}X_{t} \overset{d}{\rightarrow}Z_{1}, $$

where Z 0 and Z 1 are independent and standard normal. Independence is clear since E[X t ,σ(X s )e s ]=0 for all s,t. Combining this with (7.66), we obtain

$$n^{1-(d_{e}+d_{X})}\bigl(L_{f_{X}}(n)L_{f_{e}}(n) \bigr)^{-1/2}(\hat{\beta}_{1}-\beta_{1})\overset{d}{\rightarrow} \bigl( J(1)Z_{1,1}-E\bigl[\sigma(X_{1}) \bigr]Z_{0}Z_{1} \bigr) . $$

7.2.4 Spurious Correlations

So far it has been assumed that the explanatory variable(s) X t and the residual process e t are stationary. In practice, this is not always clear. In some applications, such as financial time series, it is, in fact, often more likely that none of the observed series is stationary. This is known to cause considerable problems for regression, even without introducing the complication of long memory or antipersistence. For instance, Granger and Newbold (1974) and Phillips (1986) considered two independent random walks

$$X_{t}=\sum_{j=1}^{t} \xi_{j},\qquad Y_{t}=\sum_{j=1}^{t} \eta_{j}, $$

i.e. with ξ j , η j , i.i.d. and independent of each other. Suppose we set up an equation of the form

$$Y_{t}=\beta X_{t}+e_{t}$$

with e t zero mean stationary. Since e t is stationary but Y t and X t are not, we certainly cannot have β=0. Of course, the model is misspecified. However, in practice we do not know that. The problem is then to see what happens if we actually fit a linear regression to the xy-observations. For instance, if \(\xi_{t}\sim N(0,\sigma_{\xi}^{2})\) and \(\eta_{t}\sim N(0,\sigma_{\eta}^{2})\), then \(\sum_{s=1}^{t}\xi_{t}=_{d}B_{1}(t)\), \(\sum_{s=1}^{t}\eta_{t}=_{d}B_{2}(t)\) where B 1, B 2 are two Brownian motions that are independent from each other. Hence,

where u i =in −1 so that

$$n^{-2}\sum X_{t}Y_{t}\underset{d}{ \rightarrow}\int_{0}^{1}B_{1}(u)B_{2}(u)\,du. $$

Similarly,

$$\sum_{t=1}^{n}X_{t}^{2} \underset{d}{=}n\sum_{i=1}^{n}B_{1}^{2}(u_{i})=n^{2}\sum_{i=1}^{n}B_{1}^{2}(u_{i}) \frac{1}{n}$$

implies

$$n^{-2}\sum_{t=1}^{n}X_{t}^{2} \underset{d}{\rightarrow}\int_{0}^{1}B_{1}^{2}(u)\,du. $$

Thus,

$$\hat{\beta}_{\mathrm{LSE}}=\frac{\sum X_{t}Y_{t}}{\sum X_{t}^{2}}\underset{d}{\rightarrow} \frac{\int_{0}^{1}B_{1}(u)B_{2}(u)\,du}{\int_{0}^{1}B_{1}^{2}(u)\,du}. $$

In other words, instead of tending to zero, \(\hat{\beta}_{\mathrm{LSE}}\) tends to a random variable that is not equal to zero with probability one. This means that, if a regression of Y on X is carried out, we will (for n large enough) always find a relationship even though it is not there. This is a famous phenomenon in econometrics, known as ‘spurious correlation’ or ‘spurious regression’. Initiated by Granger and others, methods for determining the relationship between integrated time series has become an extended branch of the econometric literature, mostly subsumed under the label ‘cointegration’.

Results on spurious correlations can be generalized to long-memory processes. For instance, Tsai (2006) and Tsay and Chung (2000) consider the following situation. Let η t and ξ t be i.i.d. and independent of each other, E(η t )=E(ξ t )=0, \(\operatorname{var}(\eta_{t})=\sigma_{\eta}^{2}\) and \(\operatorname {var}(\xi_{t})=\sigma_{\xi}^{2}\). Furthermore, define the FARIMA processes

with \(0<d_{1},d_{2}<\frac{1}{2}\), and the corresponding integrated processes, i.e. the FARIMA(0,1+d 1,0) and FARIMA(0,1+d 2,0) processes (starting at zero for t=0),

Now we consider \(\hat{\beta}_{\mathrm{LSE}}\) for the following regressions with intercept,

$$Y_{t}=\beta_{0}+\beta_{1}X_{t}+e_{t},$$

where X t , Y t are defined as follows:

  • Model 1: \(Y_{t}=v_{t}^{\ast}\), \(X_{t}=w_{t}^{\ast}\);

  • Model 2: Y t =v t , X t =w t with \(d_{1}+d_{2}>\frac{1}{2}\);

  • Model 3: \(Y_{t}=v_{t}^{\ast}\), X t =w t with d 2>0;

  • Model 4: Y t =v t , \(X_{t}=w_{t}^{\ast}\) with d 1>0;

  • Model 5: \(Y_{t}=v_{t}^{\ast}\) on X t =t;

  • Model 6: Y t =v t on X t =t with d 1>0.

Table 7.1 gives an overview. The following notation will be used:

Moreover, \(s^{2}=(n-2)^{-2}\sum_{t=1}^{n} ( y_{t}-\hat{y}_{t} )^{2}\) will denote the usual estimate of the variance of Y t (note, however, that for a nonstationary Y t , \(\sigma _{y}^{2}\) grows with t, i.e. the estimate s 2 is actually meaningless) and similarly, \(s_{\beta_{0}}^{2}\) and \(s_{\beta_{1}}^{2}\) are the usual estimates of \(\operatorname {var}(\beta_{0})\) and \(\operatorname{var}(\beta_{1})\). Finally, \(t_{\beta_{0}}=\hat{\beta }_{0}/s_{\beta_{0}}\) and \(t_{\beta_{1}}=\hat{\beta}_{1}/s_{\beta_{1}}\) are the corresponding t-statistics for β 0 and β 1. For simplicity of presentation, we assume all moments of η t and ξ t to be finite.

Table 7.1 Models considered in the context of spurious correlation

For Model 1, the limit theorems in Sect. 4.2 can be applied to obtain

with

$$c_{j}=\frac{\varGamma( 1-2d_{j} ) }{ ( 1+2d_{j} ) \varGamma( 1+d_{j} ) \varGamma( 1-d_{j} ) }\quad(j=1,2). $$

Assume for a moment that our FARIMA sequences v t and w t are replaced by fGn, i.e. increments of two independent fractional Brownian motions \(B_{H_{1}}\), \(B_{H_{2}}\) with \(H_{j}=d_{j}+\frac{1}{2}\). Then

$$\sum_{t=1}^{n}X_{t}=_{d} \sum_{t=1}^{n}B_{H_{2}}(t)=_{d}n^{1+H_{2}} \sum_{t=1}^{n}B_{H_{2}} \biggl( \frac{t}{n} \biggr) \frac{1}{n}, $$

and an analogous embedding applies to \(\sum_{t=1}^{n}Y_{t}\). Similarly, we can consider the other quantities in \(\hat{\beta}_{\mathrm{LSE}}\), including \(\sum_{t=1}^{n}X_{t}Y_{t}\) and \(\sum_{t=1}^{n}X_{t}^{2}\):

$$\sum_{t=1}^{n}X_{t}Y_{t}=_{d} \sum_{t=1}^{n}B_{H_{1}}(t)B_{H_{2}}(t)=_{d}n^{1+H_{1}+H_{2}}\sum_{t=1}^{n}B_{H_{1}} \biggl( \frac{t}{n} \biggr) B_{H_{2}} \biggl( \frac{t}{n} \biggr) \frac{1}{n}. $$

Using the notation

$$\int_{0}^{1}B_{H_{i}}(u)B_{H_{j}}(u)\,du=Z_{i,j}, \qquad\int_{0}^{1}B_{H_{j}}(u)\,du=Z_{i},$$

we have

and similarly,

$$n^{-(1+2H_{2})}\sum_{t=1}^{n}X_{t}^{2}=n^{-(2+2d_{2})} \sum_{t=1}^{n}X_{t}^{2} \rightarrow_{d}\int_{0}^{1}B_{H_{2}}^{2}(u)\,du=Z_{2,2}. $$

All asymptotic limits can be considered jointly. Since

we obtain

$$n^{d_{2}-d_{1}}\hat{\beta}_{1}\rightarrow_{d} \frac{Z_{1,2}-Z_{1}Z_{2}}{Z_{2,2}-Z_{2}^{2}}=: \beta_{1}^{\ast}. $$

Similar arguments apply to the other regression quantities of interest, and (due to convergence to fGn in D[0,1]) we may state the following result for general FARIMA models:

Theorem 7.16

Assume that the FARIMA processes have all moments finite. Then, under Model 1,

For related results, also see, e.g. Phillips (1995), Phillips and Loretan (1991), Marmol (1995), Jeganathan (1999), Robinson and Marinucci (2003, 2003), Buchmann and Chan (2007). Theorem 7.16 can be interpreted as follows. Model 1 deals with the case where Y t and X t are both integrated processes, independent of each other and such that the first difference exhibits (stationary) long memory. The estimated intercept \(\hat{\beta}_{0}\) always diverges. For the slope, it is more complicated. If long memory in the dependent variable Y t is at least as strong as in X t (i.e. d 1d 2) then the estimated slope \(\hat{\beta}_{1}\) does not converge to zero. In particular, if d 1=d 2, we have spurious correlation in the standard sense, namely \(\hat{\beta }_{1}\) converges to a non-constant random variable. If d 1>d 2, then \(\hat{\beta}_{1}\) assumes asymptotically the values ±∞ only. If X t has stronger long memory than Y t , then \(\hat{\beta}_{1}\) does converge to zero; however, at a very slow rate. What is even worse is that the R 2-statistic does not converge to zero, irrespective of the concrete values of d 1 and d 2. Furthermore, we also have spurious correlation at a second-order level for all values of d 1,d 2>0, in the sense that the usual t-tests for β 0 and β 1 asymptotically reject the null hypothesis that these parameters are zero.

Example 7.22

Figures 7.5(a)–(f) display simulated distributions and boxplots of \(\hat{\beta}_{1}\) for the cases d 1=d 2=0.4 and d 1=0.1, d 2=0.4, respectively, and sample sizes n=20,50,100,200,400,1000 and 2000. As expected from Theorem 7.16, the results for the two cases are very different. In case 2, the distribution of \(\hat{\beta}_{1}\) (Figs. 7.5(d)–(e)) is increasingly concentrated around the true value of β 1 as n grows. In case 1, however, the distribution remains essentially the same (Figs. 7.5 (a)–(b)). For R 2, the behaviour is the same in both cases. As expected from the asymptotic result, the distribution of R 2 stabilizes at a nondegenerate level (Figs. 7.5(c) and (f)). In other words, one is led to believe that there is a linear relationship between the two series, although in reality they are independent of each other.

Fig. 7.5
figure 5

Simulated distributions and boxplots of \(\hat{\beta}_{1}\) in a regression of two independent integrated FARIMA(0,d,0) processes with d 1=d 2=0.4 ((a) and (b)) and d 1=0.1, d 2=0.4 ((d) and (e)), respectively. The sample sizes are n=20,50,100,200,400,1000 and 2000. Also shown are boxplots of the R 2-statistic ((c) and (f), respectively)

The results for the other models (Models 2 through 6) can be obtained by similar arguments. In the following, only the order of the variables is written down since this is the essential part of the statements. To simplify notation, we will write “\(O_{p}^{\ast} ( n^{\alpha} ) \)” for a random quantity that is equal to n α times a random variable with positive variance. In contrast to Model 1, Model 2 involves the estimated relationship between two stationary long-memory processes. For obvious reasons, the least squares estimators of β 0 and β 1, as well as R 2, do converge to zero (see also (7.58) in Theorem 7.13). However, if \(d_{1}+d_{2}>\frac {1}{2}\), then

$$t_{\beta_{1}}=O_{p}^{\ast} \bigl( n^{d_{1}+d_{2}-\frac{1}{2}} \bigr) . $$

Thus, if the two variables have enough “joint” long memory, then second-order spurious correlations occur in the sense that the usual t-test rejects H 0:β 1=0 asymptotically. Long memory has to be taken into account to obtain correct rejection regions. This is analogous to tests and confidence intervals for the location parameter, as considered in Sect. 5.2.1.

A different result is obtained in Model 3 where a nonstationary series Y t is regressed on a stationary series X t . Here, nonstationarity of the response series alone leads to spurious correlations, as described in the following theorem.

Theorem 7.17

Under Model 3,

Thus, regressing a nonstationary long-memory process on an independent stationary long-memory series leads to spurious correlations in the sense that \(\vert \hat{\beta}_{1}\vert \) diverges to infinity, and the t-test for β 1 needs adjustment. On the other hand, there is no spurious correlation as such because R 2 (which is in the case of simple linear regression equal to the square of the sample correlation) converges to zero. In contrast, regressing a stationary process on a nonstationary series leads to a spurious effect only when considering the (unadjusted) t-test.

Theorem 7.18

Under Model 4,

Thus, apart from the need for an adjustment in the t-test, nothing too serious happens when regressing a stationary series on an unrelated nonstationary one.

The situation is different, when fitting a liner trend function to an integrated process:

Theorem 7.19

Under Model 5,

Thus, the t-test and the value of R 2 indicate asymptotically the presence of a linear trend. On the other hand, \(\hat{\beta}_{1}\) itself is asymptotically zero with probability one, but the convergence to zero is very slow. Finally, if the differenced series (i.e. a stationary long-memory process) is regressed on a linear trend, then the only remaining problem is that the t-test would need adjustment. Specifically, one obtains for Model 6

$$t_{\beta_{1}}=O_{p}^{\ast} \bigl( n^{d_{1}} \bigr) . $$

7.2.5 Fractional Cointegration

The problem of spurious correlations leads to the natural question how to recognize which (linear) relationships between observed nonstationary time series are real and which ones are spurious. The original definition of cointegration of random walk type processes (or integrated processes with an integer valued degree of integration) was introduced by Granger (1981, 1983) and further developed in Engle and Granger (1987) and many subsequent papers. Qualitative considerations suggesting that certain nonstationary time series should not drift arbitrarily far apart existed before, for instance, in Davidson et al. (1978). Much later, cointegration was extended to fractionally integrated processes. There is an extended literature on this topic, and fractional cointegration is still somewhat controversial among economists. Here, only a very brief introduction is given.

For simplicity, we consider the bivariate case, i.e. two series Y t and X t . The first step is to specify exactly what kind of nonstationarity is considered. This leads to the notion of integrated processes. There are at least two possible ways of defining such processes, and these definitions are, in fact, quite different (see, e.g. Chen and Hurvich 2009). The first definition was used, for instance, in Velasco (1999a, 1999b), Chen and Hurvich (2003a, 2003b, 2006) and Velasco (2003):

Definition 7.3

A univariate process X t is called I(d) of Type I or integrated of order \(d>-\frac{1}{2}\) if either (a) \(-\frac{1}{2}<d<\frac{1}{2}\), X t is stationary and with spectral density f X (λ)∼c f |λ|−2d (λ→0), or (b) \(d>\frac{1}{2}\) and there is an integer m such that \(-\frac{1}{2}<d^{\ast}=d-m<\frac{1}{2}\) and (1−B)m X t is I(d ).

The second definition was used in Marinucci and Robinson (2000):

Definition 7.4

A univariate process X t (t≥1) is called I(d) of Type II or integrated of order \(d>-\frac{1}{2}\) if, for t≥1,

$$X_{t}=\sum_{j=0}^{t-1}a_{j} \xi_{t-j}=\sum_{j=0}^{\infty}a_{j} \xi_{t-j}^{\ast }= ( 1-B )^{-d}\xi_{t}^{\ast}$$

where ξ t are zero mean i.i.d. with finite variance, \(\xi _{t}^{\ast}=\xi_{t}\cdot1 \{ t\geq1 \} \), and

The second definition may be generalized by imposing the asymptotic condition on a j only. It should be noted that the two definitions are quite different. For \(d>\frac{1}{2}\), both imply a nonstationary process. For \(-\frac{1}{2}<d<\frac{1}{2}\), X t obtained from Definition 7.3 is stationary, whereas this is only the case asymptotically when Definition 7.4 is used. Moreover, different limits for partial sums are obtained. For example, if X t is I(d) according to Definition 7.4 with \(\frac{1}{2}<d<\frac{3}{2}\), then

$$X_{n}=X_{1}^{\ast}+X_{2}^{\ast}+ \cdots+X_{n}^{\ast}$$

where

$$X_{t}^{\ast}= ( 1-B )^{-(d-1)}\xi_{t}^{\ast}, $$

and the partial sums

$$S_{n} ( u ) =\sum_{i=1}^{ [ nu ] }X_{i}^{\ast} \quad(0\leq u\leq1) $$

are such that \(Z_{n}(u)=S_{n} ( u ) /\sqrt{\operatorname{var} ( S_{n} ( 1 ) ) }\) converges to a so-called Type II or Riemann–Liouville fractional Brownian motion (Marinucci and Robinson 2000; also see Akonom and Gourieroux 1987; Silveira 1991) which is defined for all \(H=d+\frac{1}{2}>0\). On the other hand, if X t is obtained from Definition 7.3, then Z n (u) converges to the usual fractional Brownian motion as in Mandelbrot and van Ness (1968) (see Sect. 1.3.5) which is defined for 0<H<1 only. For limit theorems for Fourier transforms under the two definitions, see, e.g. Velasco (2007).

More generally, I(d) may be defined for bivariate (or multivariate) processes X t =(X t1,X t2) as follows. Using the spectral representation

$$X_{t,j}=\int_{-\pi}^{\pi}e^{it\lambda}\,dM_{j}( \lambda)\quad(j=1,2), $$

the cross-covariance is

Thus, in this notation,

$$f_{12}(\lambda)=E \bigl[ dM_{1}(\lambda) \,\overline{dM_{2}(\lambda)} \bigr] . $$

If, for instance, \(dM_{2}(\lambda)=e^{-i\phi_{12}(\lambda)}\, dM_{1}(\lambda)\) with ϕ 12(λ)=ϕλ and ϕ>0, then this means that X t,2 is delayed with respect to X t,1 by the time span ϕ. For the cross-spectral density, we have

$$f_{12}(\lambda)=e^{i\phi_{12}(\lambda)}\bigl \vert f_{12}( \lambda)\bigr \vert =e^{i\phi\lambda}\bigl \vert f_{12}(\lambda) \bigr \vert . $$

Thus, in the notation used here, the slope of the phase, \(\phi_{12}^{\prime}(\lambda)\), corresponds to the time delay of dM 2(λ) with respect to dM 1(λ) (see, e.g. Brockwell and Davis 1991). A possible definition of bivariate fractionally integrated processes is as follows:

Definition 7.5

A stationary process \(X_{t}=(X_{t,1},X_{t,2})^{T}\in\mathbb{R}^{2}\) is called I(d 1,d 2) of Type I if there exist \(-\frac{1}{2}<d_{1},d_{2}<\frac{1}{2}\) such that X t has a 2×2 spectral density

$$f_{X}(\lambda)\sim\varLambda(\lambda)C_{f}\bar{ \varLambda}(\lambda)\quad (\lambda\rightarrow0) $$

with C f a constant, real, positive semidefinite and symmetric p×p matrix such that [C f ] ii ≠0, and

$$\varLambda(\lambda)=\left ( \begin{array}{c@{\quad}c} \vert \lambda \vert ^{-d_{1}} & 0\\ 0 & e^{-i\phi_{12}(\lambda)}\vert \lambda \vert ^{-d_{2}}\end{array} \right ) $$

for some differentiable function ϕ 12 with derivative \(\phi_{12}^{\prime}\) such that \(\lim_{\lambda\rightarrow0}\phi _{12}^{\prime }(\lambda)=\phi_{0}\in(0,\pi]\). A nonstationary process X t is called I(d 1,d 2) of Type I if there is an integer m such that \(-\frac {1}{2}<d_{i}^{\ast}=d_{i}-m<\frac{1}{2}\) and (1−B)m X t =((1−B)m X t,1,(1−B)m X t,2)T is \(I ( d_{1}^{\ast},d_{2}^{\ast} ) \).

The generalization to p-dimensional cointegrated vector series is obvious. More explicitly, a stationary I(d 1,d 2) process has a spectral density that behaves at the origin like

In particular, this means that for low frequency components of X t there is an approximately constant phase shift corresponding to X t,2 being behind by Δt=ϕ 0. In the simplest case with \(\lim_{\lambda\rightarrow0}\phi_{12}^{\prime}(\lambda)=0\) (see, e.g. Christensen and Nielsen 2006), there is no phase shift for very low frequencies (more precisely, for λ→0).

Example 7.23

Consider a multivariate FARIMA model defined as the stationary solution of

$$ \left ( \begin{array} {c@{\quad}c}( 1-B )^{d_{1}} & 0\\ 0 & ( 1-B )^{d_{2}}\end{array} \right ) X_{t}=\varphi^{-1} ( B ) \psi ( B ) \xi_{t}=\eta_{t}=\binom{\eta_{t,1}}{\eta_{t,2}} $$
(7.68)

(see, e.g. Lobato 1999; Robinson and Yajima 2002; Shimotsu 2006) with i.i.d. ξ t =(ξ t,1,ξ t,2)T, zero mean random variables and ξ t,1 independent of ξ s,2 for all s,t. The spectral density of X t is given by

$$f(\lambda)=\left ( \begin{array} {c@{\quad}c}( 1-e^{-i\lambda} )^{-d_{1}} & 0\\ 0 & ( 1-e^{-i\lambda} )^{-d_{2}}\end{array} \right ) f_{\eta}(\lambda)\left ( \begin{array} {c@{\quad}c}( 1-e^{i\lambda} )^{-d_{1}} & 0\\ 0 & ( 1-e^{i\lambda} )^{-d_{2}}\end{array} \right ) $$

where

For λ→0,

$$f_{\eta}(\lambda)\rightarrow C_{f}=\frac{\sigma_{\xi}^{2}}{2\pi}\bigl \vert \psi( 1 ) \varphi^{-1} ( 1 ) \bigr \vert ^{2}$$

and

$$\bigl( 1-e^{i\lambda} \bigr)^{d}\sim( 1-1-i\lambda )^{d}=\lambda^{d}e^{-i\frac{\pi}{2}\,d}. $$

Thus,

so that Definition 7.5 applies with

$$\phi_{12}(\lambda)\equiv\frac{\pi}{2}(d_{1}-d_{2}) $$

and

$$\phi_{0}=\phi_{12}^{\prime}(\lambda)\equiv0. $$

This means that for FARIMA models as defined above there is no time shift, although the phase ϕ 12 itself is not zero except for d 1=d 2. (For less restrictive models, see, e.g. Robinson 2007). Note, however, that this only refers to λ→0. Outside any open neighbourhood of the origin, the AR- and MA-matrices φ and ψ can model any kind of phase shifts with \(\phi_{12}^{\prime}\neq0\).

Similarly, a Type II I(d 1,d 2)-process can be defined (see, e.g. Robinson and Marinucci 2001, 2003, Marinucci and Robinson 2000; Marmol and Velasco 2004; Nielsen and Shimotsu 2007).

A simple, though not most general, definition of cointegration can be given as follows (Chen and Hurvich 2003a, 2003b, 2006).

Definition 7.6

Let \(X_{t}\in\mathbb{R}^{2}\) be I(d 1,d 2) with \(d_{1}=d_{2}=d>-\frac{1}{2}\). Then X t is cointegrated of order d, b (or CI(d,b)) if there exists a vector \(\beta\in\mathbb{R}^{2}\) such that β≠0 and \(Y_{t}(\beta)=\beta ^{T}X_{t}\in\mathbb{R}\) is I(d ) with d =db<d. Any such vector β is called a cointegrating vector.

By definition, β is determined up to a scaling constant. Thus, for a bivariate series, there is at most one β with \(\Vert \beta \Vert =\sqrt{\beta_{1}^{2}+\beta_{2}^{2}}=1\). More generally, for p-dimensional series, there are at most p−1 such vectors. The number of linearly independent cointegrating vectors is then called the cointegrating rank. Note that originally, cointegration was defined for integer valued differencing parameters d j only (Engle and Granger 1987): the components of \(X_{t}\in\mathbb{R}^{p}\) are said to be cointegrated of order \(d,b\in\mathbb{N}\) in the sense of Engle and Granger (X t CI(d,b)) if all components of X t are I(d) and there exists a vector \(\beta\in\mathbb {R}^{p}\) such that β T X t I(db),b>0. Definition 7.6 is applicable to any d and b=dd . The possibility of extending cointegration to fractional differences was suggested before by Granger (Granger 1981, 1986). Note also that d may be less or equal \(-\frac{1}{2}\). This means that Y t (β) may turn out to be non-invertible. More general definitions that allow for d 1d 2 were also introduced in the literature, but are more complicated due to the variety of possible subsets with equal d j ’s (see, e.g. Robinson and Yajima 2002; Robinson and Marinucci 2003, 2003).

Example 7.24

Suppose that X t1 and X t2 are both Type I I(d) with \(d\in (0,\frac{1}{2})\) and \(e_{t}\in\mathbb{R}\) is Type I I(d e ) with \(0<d_{e}<d<\frac{1}{2}\). If there is an α≠0 such that

$$ X_{t2}=\alpha X_{t1}+e_{t}, $$
(7.69)

then X t =(X t1,X t2)T is fractionally cointegrated with cointegrating vector β=(1,−α)T and fractional integration parameters d and d e (see, e.g. Robinson 1994b).

Example 7.25

Let X t be defined as in the previous example and \(\tilde{X}_{t}\) be such that \(( 1-B ) \tilde{X}_{t}=X_{t}\). Also denote by \(\tilde{e}_{t}\) an I(d e +1) process such that \((1-B)\tilde{e}_{t}=e_{t}\). Then

$$ \tilde{X}_{t,2}=\mu+\alpha\tilde{X}_{t,1}+ \tilde{e}_{t} $$
(7.70)

where μ is an arbitrary constant. The integrated process \(\tilde{X}_{t}\) is cointegrated with cointegrating vector β=(1,−α)T and fractional integration parameters d+1 and d e +1 (see Chen and Hurvich 2003a for a generalization to d+m).

Example 7.26

A Type I p-dimensional fractional common component model proposed in Chen and Hurvich (2006) is defined as

$$X_{t}=A_{0}\xi_{t}^{(0)}+A_{1} \xi_{t}^{(1)}+\cdots+A_{s}\xi_{t}^{(s)}$$

with latent (unobserved) I(d j )-processes \(\xi _{t}^{(j)}\in\mathbb{R}^{p_{j}}\) such that

$$-m_{0}+\frac{1}{2}<d_{s}<\cdots<d_{0}< \frac{1}{2}, $$

A 0,…,A s are p×p j full-rank matrices with all columns linearly independent, p 0+⋯+p s =r, 1≤r<p and 1≤sr. This means that X t can be decomposed orthogonally into s cointegrating subspaces defined by A 1,…,A s and the cointegration rank is r. Moreover, by definition, X t is I(d 0). If we choose β as a linear combination of the columns of matrix A j (j≠0), then—due to orthogonality—

$$Y_{t}(\beta)=\beta^{T}X_{t}= \beta^{T}A_{j}\xi_{t}^{(j)}$$

so that Y t (β) is I(d j ).

Example 7.27

Sowell (1990) and Dueker and Startz (1998) consider a cointegrated FARIMA process of the form X t =(X t1,X t2)T with

$$ \underset{2\times2}{\varphi} ( B ) \left ( \begin{array} {c@{\quad}c}( 1-B )^{d_{1}} & 0\\ 0 & ( 1-B )^{d_{2}}\end{array} \right ) \left ( \begin{array} {c@{\quad}c}1 & 0\\ -\alpha& 1 \end{array} \right ) X_{t}=\underset{2\times2}{\psi} ( B ) \xi_{t} $$
(7.71)

where \(-\frac{1}{2}<d_{2}<d_{1}<\frac{1}{2}\), and φ and ψ are AR- and MA-operators of order p and q. This means that \(X_{t}^{\ast}= ( X_{t1},X_{t2}-\alpha X_{t1} )^{T}\) is the usual multivariate FARIMA process. The bivariate process X t is cointegrated with cointegrating vector β=(−α,1)T. If the i.i.d. innovation variables ξ t are assumed to be Gaussian, then, in principle, the parameters in (7.71) can be estimated by a maximum likelihood type method. For non-Gaussian innovations, the same method may be used (under moment assumptions), though it may not be optimal (see, e.g. Dueker and Startz 1998; Jeganathan 1999).

For further results, discussions and literature, see, e.g. Chan and Terrin (1995), Breitung and Hassler (2002), Davidson (2002), Dolado et al. (2003), Robinson and Hualde (2003), Nielsen (2005a, 2005b), Johansen (2008, 2008), Lasak (2010).

In classical cointegration with integer valued d and b, the cointegrating vector β=(1,−α)T can be estimated by minimizing ∑(X 1t μαX 2t )2 with respect to μ and α. (The generalization to higher dimensions p>2 is obvious.) In addition, because of the problem of spurious correlation, one has to test whether \(\hat{\beta}\) is “real” or spurious. The classical method suggested by Engle and Granger is to test for unit roots in the residuals \(\hat{e}_{t}=X_{1t}-\hat{\mu}-\hat{\alpha}X_{2t}\) (i.e. H 0:φ=1 vs. H 1:|φ|<1 where we assume e t =φe t−1+u t ). This is typically done by a suitable version of the Dickey–Fuller test (Dickey and Fuller 1981). If H 0 is not rejected, then cointegration is assumed to be real. An alternative method is based on reduced rank regression of a multivariate ARMA process the cointegration model can be embedded in (see, e.g. Johansen 1996).

At first sight, the generalization of estimation and identification techniques to fractional cointegration is not obvious because unit root testing is not sufficient. The first question is estimation of β in the case where cointegration applies. The second question is how to guard against spurious correlations. In particular, the usual Dickey–Fuller test is not applicable. With respect to estimation no fundamentally new problem occurs if a parametric model, such as (7.71), is acceptable. In this case, maximum likelihood estimation of the cointegration vector β and other parameters of the model (including d 1, d 2) can be carried out in principle because everything is specified. However, in models where only the behaviour of the (cross-) spectrum near the origin is specified (see some of the examples above), the task is more difficult. Consider, for example, (7.69) with

$$ X_{t2}=\alpha X_{t1}+e_{t}, $$
(7.72)

X t1 stationary with autocovariance function γ 11(k), variance \(\operatorname{var}(X_{t1})=\gamma_{11}(0)=\sigma_{1}^{2}\) and I(d) for some \(0<d<\frac {1}{2}\), and e t stationary and I(d e ) with d e <d. For the least squares estimator of α, we then have

$$\hat{\alpha}_{\mathrm{LSE}}=\alpha+\frac{\sum_{t=1}^{n}X_{t1}e_{t}}{\sum _{t=1}^{n}X_{t1}^{2}}\underset{p}{ \rightarrow}\alpha+\frac{\mathit{cov} ( X_{t1},e_{t} ) }{\sigma_{1}^{2}}. $$

This is equal to zero only if X t1 and e t are uncorrelated. The result is different from nonfractional cointegration where, for instance, X t,1, X t,2 are CI(1,1) which implies that \(\sum_{t=1}^{n}X_{t1}^{2}\) is of a larger order than \(\sum_{t=1}^{n}X_{t1}e_{t}\). A possible solution for the fractional cointegration model here is to apply least squares regression to low frequency components only. The reason is that

where

$$f(\lambda)=\left ( \begin{array} {c@{\quad}c}f_{11}(\lambda) & f_{1,e}(\lambda)\\ f_{e,1}(\lambda) & f_{ee}(\lambda) \end{array} \right ) $$

is the (real-valued) bivariate spectral density of (X t1,e t )′. Since \(0\leq \vert f_{1,e}\vert \leq \sqrt{f_{11}f_{ee}}\) and d e <d, we have for λ→0,

$$f_{1,e}(\lambda)=O \bigl( \lambda^{-d-d_{e}} \bigr) =o \bigl( \lambda^{-2d} \bigr) . $$

Denote by

$$Z_{j}(\lambda_{k})=\frac{1}{\sqrt{2\pi n}}\sum _{t=1}^{n}X_{tj}e^{i\lambda _{k}t} \quad(j=1,2) $$

the discrete Fourier transform of X tj at Fourier frequencies λ k =2πk/n and define

$$ \hat{\alpha}_{\mathrm{LSE}}(m_{n})=\frac{\sum_{k=1}^{m_{n}}Re ( Z_{1}(\lambda_{k})\overline{Z_{2} ( \lambda_{k} ) } ) }{\sum _{k=1}^{m_{n}}\vert Z_{1}(\lambda_{k})\vert ^{2}} $$
(7.73)

with m n →∞ such that m n /n→0. For Z j we have

and

$$E \bigl[ \bigl \vert Z_{1}(\lambda_{k})\bigr \vert ^{2} \bigr] =\frac{1}{2\pi n}\sum _{t,s=1}^{n}\gamma_{11}(t-s)e^{i\lambda_{k}(t-s)}=O \bigl( \lambda_{k}^{-2d} \bigr) . $$

Similar arguments apply to the variance of the enumerator and denominator in (7.73) so that, under suitable detailed regularity conditions,

$$\hat{\alpha}_{\mathrm{LSE}}(m_{n})=\alpha+O_{p} \bigl( \lambda^{d-d_{e}} \bigr) =\alpha+o_{p}(1) $$

(see Robinson 1994b). Robinson and Marinucci (2001) showed that \(\hat{\alpha}_{\mathrm{LSE}}(m_{n})\) is also consistent for a Type II nonstationary cointegration model. Similarly, Chen and Hurvich (2003a) showed consistency and derived the asymptotic distribution of \(\hat{\alpha}_{\mathrm{LSE}}(m_{n})\) refined by tapering, under a Type I cointegration model with arbitrary integer integration parameter (also see, e.g. Chen and Hurvich 2006; Robinson and Yajima 2002; Velasco 2003; Nielsen and Shimotsu 2007). Also note that an alternative estimator based on the Whittle approximation is proposed in Robinson (2008). Moreover, Johansen and Nielsen (2010a, 2010b) show how to generalize reduced rank regression to fractional cointegration (also see Johansen 2010a, 2010b, 1996, 2008, Lütkepohl 2006).

The second question is how to design “unit roots” tests that detect fractional departures from stationarity. More generally, the question is how to identify the cointegration rank in the fractional cointegration context. Tests along this line are discussed, for instance, in Breitung and Hassler (2002, 2006), Davidson (2002, 2006), Robinson and Yajima (2002), Marmol and Velasco (2004), Nielsen (2004b, 2004c, 2004a, 2005a, 2005b), Chen and Hurvich (2006), Nielsen and Shimotsu (2007), Hualde and Velasco (2008), Avarucci and Velasco (2009), Lasak (2010), MacKinnon and Nielsen (2010). For additional references to fractional cointegration, see, e.g. Cheung and Lai (1993), Baillie and Bollerslev (1994), Ravishanker and Ray (1997, 2002), Kim and Phillips (2001), Gil-Alana (2004), Nielsen (2004b, 2004c), Robinson and Iacone (2005), Hualde and Robinson (2007, 2010), Robinson (2008), Berger et al. (2009), Davidson and Hashimzade (2009a, 2009b), Gil-Alana and Hualde (2009), Sela and Hurvich (2009), Franchi (2010), Nielsen (2010, 2011), Nielsen and Frederiksen (2011).

7.3 Piecewise Polynomial and Spline Regression

We consider a process of the form

$$ X_{t}=m \biggl( \frac{t}{n} \biggr) +e_{t}\quad ( t=1,\ldots,n ) $$
(7.74)

where e t is a zero mean second-order stationary process. In some situations, a natural model for the expected value m is a piecewise polynomial. For instance, Fig. 1.18 in Sect. 1.2 shows typical olfactory response curves to an odorant stimulus administered at a known time point t 0. In this case, a continuous piecewise linear polynomial (or in other words, a linear spline function) with one known knot at time t 0 and one subsequent unknown knot characterizes the essential features of the expected value as a function of time. The residual processes e t often exhibit long memory.

More generally, we may consider an arbitrary continuous piecewise polynomial function

$$m ( s ) =\sum_{k=0}^{l}\sum _{j=1}^{p_{k}}a_{k,j} ( s-\eta_{k} )_{+}^{\beta_{j,k}}$$

with β j,k <β j+1,k , knots 0=η 0<η 1<⋯<η l <1 of which some (but not necessarily all) are unknown. Note that m is continuous if β j,k ≥1 for k≥1. The definition includes splines, but is more general since apart from continuity no differentiability conditions are imposed. For simplicity of presentation, we will discuss the case with one unknown knot η only. As we will see, however, results can be formulated in a general form so that all cases with an arbitrary number of knots and arbitrary polynomials are included. Thus, suppose that there is one unknown knot η. Then m(s) has the representation

$$ m(s)=\sum_{j=1}^{p}\alpha_{j}f_{j}(s) \quad\bigl(s\in [0,1]\bigr) $$
(7.75)

with α T=(α 1,…,α p ) denoting unknown regression coefficients and

$$ \begin{aligned} &f_{1} ( s ) =1,\qquad f_{2}(s)=s,\qquad \ldots,\qquad f_{q} ( s ) =s^{q-1},\qquad \\ &f_{q+1} ( s ) =(s-\eta)_{+},\qquad \ldots, \qquad f_{p} ( s ) =(s-\eta)_{+}^{p-q}\end{aligned}$$
(7.76)

(where \((s-\eta)_{+}^{l}:=\max ( 0,(s-\eta)^{l} ) \)). The unknown parameter vector is θ=(α T,η)T. The true value of θ will be denoted by θ o. Note that for identifiability of η 0, one needs the condition that \(\alpha_{j}^{0}\neq0\) for at least one jq+1. Beran and Weiershäuser (2011) and Beran et al. (2013) derived the asymptotic distribution of the least squares estimator of θ 0 under long memory, short memory and antipersistence of the residual process e t . In particular, if e t is linear, then unified formulas applicable to all three cases can be derived. The key to obtaining these results is a linearization of the nonlinear regression estimator of θ and convergence of weighted sums of e t to integrals with respect to fractional Brownian motion. Combined with fractional calculus unified formulas follow.

We will use the notation ν(d) as in Corollary 1.2. Minimizing the sum of the squared residuals, \(Q(\theta)=\sum_{t=1}^{n} [ X_{t}-m ( s_{n};\theta ) ]^{2}\) (with s n =t/n) with respect to θ can be done in two steps. First of all, for each value of η, the optimal value of α is obtained by standard linear least squares regression on the functions f j defined by using knot η. Thus, for each η∈(0,1) we define the n×p matrix

$$ \mathbf{W}_{n}=\mathbf{W}_{n}(\eta)=(w_{ij})_{i=1,\dots,n;j=1,\dots,p}=(\mathbf{w}_{1,n},\dots,\mathbf{w}_{p,n}) $$
(7.77)

with \(w_{i,j}=f_{j} ( \frac{i}{n} ) \) (1≤in;1≤jp), and column vectors denoted by w j,n (j=1,…,p). For n large enough, \(\mathbf{W}_{n}^{T}\mathbf{W}_{n}\) is invertible so that the projection matrix on the column space of W n (η) may be written as

$$ P_{\mathbf{W}_{n}}=P_{\mathbf{W}_{n}}(\eta)=\mathbf{W}_{n}\bigl( \mathbf{W}_{n}^{T}\mathbf{W}_{n} \bigr)^{-1}\mathbf{W}_{n}^{T}. $$
(7.78)

Thus, given observations X=(X 1,…,X n )T, \(\hat{\eta}\) is obtained by minimizing \(\Vert \mathbf{X}-P_{\mathbf{W}_{n} }(\eta)\mathbf{X}\Vert ^{2}\) with respect to η. The slope estimates are given by

$$\hat{\alpha}=\bigl(\mathbf{W}_{n}^{T}\mathbf{W}_{n} \bigr)^{-1}\mathbf{W}_{n}^{T} \mathbf{X}$$

and m(s 1),…,m(s n ) are estimated by

$$ \biggl[ m \biggl( \frac{1}{n};\hat{\theta} \biggr) ,m \biggl( \frac{2}{n};\hat{\theta} \biggr),\dots,m ( 1;\hat{\theta} ) \biggr]^{T}=P_{\mathbf{W}_{n}(\hat{\eta})}\mathbf{X}. $$
(7.79)

Note that, in spite of the projection, neither \(\hat{\alpha}\) nor \(\hat{\eta}\) are linear in X. For general piecewise polynomials, linearization of \(\hat{\theta}\) has to take into account that derivatives of m with respect to η may not exist for t=η. Denoting by m (j+) the right-hand partial derivatives of m with respect to θ j and defining the n×(p+1) matrix

$$ \mathbf{M}_{n+}= \bigl[ m_{(j+)}(t/n) \bigr]_{t=1,\ldots,n;j=1,\ldots ,p+1} \in\mathbb{R}^{n\times(p+1)}$$
(7.80)

the limit

$$ \lim_{n\rightarrow\infty}n^{-1}\bigl(\mathbf{M}_{n+}^{T} \mathbf{M}_{n+}\bigr)_{jk}=\int _{0}^{1}m_{(j+)}(s,\theta)m_{(k+)}(s, \theta)\,ds $$
(7.81)

exists. Therefore, the matrix \(\mathbf{M}_{n+}^{T}\mathbf{M}_{n+}\) is of full rank for n large enough, and we can also define the asymptotic matrix

$$ \varLambda=\lim_{n}n\bigl(\mathbf{M}_{n+}^{T}\mathbf{M}_{n+} \bigr)^{-1}. $$
(7.82)

Suppose now that the spectral density of e t is of the form f e (λ)∼c f |λ|−2d for λ→0 where \(d\in ( -\frac{1}{2},\frac{1}{2} ) \). Using the notation e(n)=(e 1,…,e n )T it can then be shown that \(\Vert \hat{\theta}-\theta-(\mathbf{M}_{n+}^{T}\mathbf{M}_{n+})^{-1}\mathbf{M}_{n+}e(n)\Vert =o_{p} ( n^{d-\frac{1}{2}} ) \) and

$$ \lim_{n\rightarrow\infty}\mathit{cov} \bigl( n^{\frac{1}{2}-d}\nu^{-\frac{1}{2}} ( d ) \bigl( \mathbf{M}_{n+}^{T}\mathbf{M}_{n+} \bigr)^{-1}\mathbf{M}_{n+}^{T}e(n) \bigr) =\varLambda\varSigma_{0}\varLambda $$
(7.83)

where Σ 0 depends on d. At first sight, the formulas for Σ 0 seem to be quite different depending on whether we have long memory, short memory or antipersistence:

  1. 1.

    d>0:

    $$ \varSigma_{0}=d ( 1-2d ) \biggl( \int_{0}^{1} \int_{0}^{1}\frac {m_{(j)}(s)m_{(k)}(t)\,dt\,ds}{|s-t|^{1-2d}} \biggr)_{j,k=1,\ldots,p+1}. $$
    (7.84)
  2. 2.

    d=0:

    $$ \varSigma_{0}= \biggl( \int_{0}^{1}m_{(j)}(t)m_{(k)}(t)\,dt \biggr)_{j,k=1,\ldots ,p+1}. $$
    (7.85)
  3. 3.

    d<0:

    $$ \begin{aligned} \varSigma_{0}&=c \biggl( \int_{0}^{1}m_{(j)}(t) \int_{\mathbb{R}\setminus [0,1]}\frac{m_{(k)}(t)}{|s-t|^{1-2d}}\,ds \\ &\quad {}-\int_{0}^{1} \frac{m_{(k)}(s)-m_{(k)}(t)}{|s-t|^{1-2d}}\,ds\,dt \biggr)_{j,k=1,\ldots,p+1}\end{aligned}$$
    (7.86)

    with c=d(1−2d).

However, using fractional calculus (as discussed in Sect. 3.7.3), one formula for all three cases can be given. This approach also helps deriving the asymptotic distribution of \(\hat{\theta}\) in an elegant way similar to Pipiras and Taqqu (2000a, 2000c, 2003). Extending m (j+) to the real axis by setting m (j+)(t)=0 (j=1,…,p+1) for t∉[0,1), the unified formula for Σ 0 can be given as follows (Beran et al. 2013):

Theorem 7.20

Define

$$c_{1}^{2}(d):=\int_{\mathbb{R}} \bigl( ( 1+s )^{d}-s^{d} \bigr)^{2}\,ds+\frac{1}{2d+1}. $$

Then

$$\varSigma_{0}= \biggl[ \frac{\varGamma(d+1)^{2}}{c_{1}^{2}(d)}\int_{\mathbb{R}} \bigl( I_{-}^{d}m_{(j+)} \bigr) (s) \bigl( I_{-}^{d}m_{(k+)} \bigr) (s)\,ds \biggr]_{j,k=1,\ldots,p+1}. $$

Finally, recalling the linearization

$$n^{\frac{1}{2}-d}\nu^{-\frac{1}{2}} ( d ) ( \hat{\theta }-\theta ) \approx n^{\frac{1}{2}-d}\nu^{-\frac{1}{2}} ( d ) \bigl(\mathbf{M}_{n+}^{T} \mathbf{M}_{n+}\bigr)^{-1}\mathbf{M}_{n+}e(n), $$

convergence to a normal distribution can be derived by extending limit theorems for weighted sums given in Pipiras and Taqqu (2000a, 2000c). The limit is a linear transformation of the (p+1)-dimensional Gaussian variable

$$Z:= \biggl( \int m_{(j+)}(s)\,dB_{H}(s) \biggr)_{j=1,\ldots,p+1}$$

where B H (s) denotes a fractional Brownian motion with Hurst parameter H=d+0.5 and the integral ∫⋅ dB H (s) is understood in the sense of Pipiras and Taqqu (2000a, 2000c). The asymptotic distribution can then be expressed as follows.

Theorem 7.21

Under the assumptions summarized above (see Beran and Weiershäuser 2011 and Beran et al. 2013 for detailed assumptions) we have, as n→∞,

$$ n^{\frac{1}{2}-d}\nu^{-\frac{1}{2}} ( d ) ( \hat{\theta }-\theta ) \underset{d}{\rightarrow}\varLambda Z\sim N(0,\varLambda\varSigma_{0}\varLambda). $$
(7.87)

Note that the formulation of the asymptotic distribution in terms of fractional integration is general so that it directly applies to any continuous piecewise polynomial function \(m ( s ) =\sum_{k=0} ^{l}\sum_{j=1}^{p_{k}}a_{k,j} ( s-\eta_{k} )_{+}^{\beta_{j,k}}\) as specified above.

An application of these results to calcium imaging data in the context of olfactory research was introduced in Sect. 1.2. The data displayed in Fig. 1.18 are part of a data set consisting of estimated entropy series for 25 adult forager bees (Apis mellifera carnica). The original series were based on calcium imaging data reflecting the response in the antennal lobe of bees to an odorant stimulus (more specifically, hexanol). For the response series in Fig. 1.18, a linear spline function (i.e. a continuous piecewise linear function) with one known knot at the time of intervention and two subsequent unknown knots provides a rather accurate approximation of the main characteristics. For each bee, two response series were measured under two different conditions, namely without and with the addition of the neurotransmitter octopamine. The research hypothesis was that under the influence of the neurotransmitter, the change in entropy should be faster. Using a linear splines fit with one known knot η 0 at the time of intervention and two subsequent unknown knots η 1,η 2, we have m(s)=α 0+α 1 s+α 2(sη 0)++α 3(sη 1)+α 4(sη 2)+ with unknown parameter vector θ=(α 0,…,α 4,η 1,η 2). Let θ without and θ with be the parameters without and with octopamine. Then checking the research hypothesis can be interpreted as testing the null hypothesis H 0:α 2,without=α 2,with. Using least squares estimation for each of the response series, the distribution of \(\hat{\alpha }_{2,\mathrm{without}}\) and \(\hat{\alpha}_{2,\mathrm{with}}\), respectively, follows from the theorem above. Since the two series are always measured within one individual bee, the estimates are correlated so that a paired test has to be applied that takes into account the correlation ρ between the two estimates. The difference \(\hat{\varDelta }=\hat{\alpha}_{2,\mathrm{with}}-\hat{\alpha}_{2,\mathrm{without}}\) is then approximately normal with variance \(\operatorname{var} ( \hat{\varDelta } ) =\operatorname{var} ( \hat{\alpha}_{2,\mathrm{with}} ) +\operatorname{var} ( \hat{\alpha}_{2,\mathrm{without}} ) -\rho\sqrt{\operatorname{var} ( \hat{\alpha}_{2,\mathrm{with}} ) \operatorname{var} ( \hat{\alpha}_{2,\mathrm{without}} ) }\). The variances are obtained from the asymptotic results above whereas ρ may be replaced by the sample correlation based on all bees in the data set. Beran et al. (2013) used these estimates to calculate an optimally weighted mean as an estimate of \(\mu_{\varDelta }=E ( \hat{\varDelta } ) \). Using asymptotic normality or bootstrap, it could indeed be shown that μ Δ >0 with a p-value below 1 %.

7.4 Nonparametric Regression with LRD Errors—Kernel and Local Polynomial Smoothing

In this section, we consider the nonparametric regression model

$$ Y_{i}=m(X_{i})+\sigma(X_{i})e_{i} \quad(i=1,\ldots,n), $$
(7.88)

where m(⋅), σ(⋅) are unknown functions, X i are predictors (deterministic or random), and e i is a second-order stationary process. First, in Sect. 7.4.1, we give a brief introduction to kernel (Priestley–Chao, Nadaraya–Watson) and local polynomial smoothing. We provide some preliminary calculations of the bias and variance and point out important differences between fixed and random design. It turns out that random design may improve rates of convergence. We have observed this already for parametric regression in Sects. 7.1 and 7.2 . Methods for estimating derivatives and boundary effects are also discussed.

In Sects. 7.4.27.4.3, we present general results for fixed design kernel and local polynomial estimation. In particular, it is shown that long memory or antipersistence influences rates of convergence. Hall and Hart (1990b) were the first to derive an asymptotic formula for the mean squared error of kernel estimators of the trend function in fixed-design regression with long-memory errors. This result was extended further in Beran and Feng (2001a, 2001b, 2002a, 2002b, 2002c), including kernel estimation with boundary corrections, local polynomial estimation of derivatives and integrated processes. Further results have been obtained in Csörgő and Mielniczuk (1995b, 1995a), Robinson (1997), Beran and Feng (2001a, 2007), Pawlak and Stadtmüller (2007), Feng et al. (2007). Extensions to LARCH-type residuals are given in Beran and Feng (2007). Optimal convergence rates are derived in Feng and Beran (2012), but will not be discussed here. The nonexistence of optimal kernels in the long-memory setting is shown in Beran and Feng (2007). Sections 7.4.4 and 7.4.6 are devoted to bandwidth choice in nonparametric kernel and local polynomial regression. Bandwidth choice in the long-memory context by cross-validation originates from Hall et al. (1995a), whereas the plug-in approach is discussed in Ray and Tsay (1997), Beran and Feng (2002a, 2002b, 2002c). Sections 7.4.5 and 7.4.6 include a discussion of the so-called SEMIFAR models and iterative procedures to estimate the trend function and, in particular, the long-memory parameter simultaneously (Beran 1999; Beran and Feng 2001a, 2001b, 2002a, 2002b, 2007, Beran and Ocker 2001). Furthermore, robust versions of local polynomial estimators in the long-memory context are considered in Beran et al. (2002) and Beran et al. (2003). Extensions to nonequidistant time series and tests for rapid change points are discussed in Sect. 7.10 (Menéndez et al. 2010).

Section 7.4.8 is devoted to random design regression. It turns out that the choice of a bandwidth is even more fundamental than for fixed design regression. We show a dichotomy between small and large bandwidths. This is the same phenomenon as observed already for density estimation (see Sect. 5.14). For small bandwidths, long-range dependence in the residuals has no influence and one obtains exactly the same asymptotic distribution as for i.i.d. data. This is in contrast to fixed-design kernel (and local polynomial) regression. For large bandwidths, we have a long-memory behaviour. We also show an improvement in the rate of convergence for shape functions. Such observations have its origin in the work by Cheng and Robinson (1994). Further references include Csörgő and Mielniczuk (1999, 2000), Mielniczuk and Wu (2004), Zhao and Wu (2008), Kulik and Lorek (2011). In the latter article, the authors consider a very general class of errors that includes FARIMA–GARCH and antipersistent processes. In Bryk and Mielniczuk (2008), the authors consider a randomization scheme for fixed-design regression. As a consequence, the resulting kernel estimator has a rate of convergence as in the random-design case. Results for the Nadaraya–Watson estimator have further extensions to local linear regression estimators (see Masry and Mielniczuk 1999 and Masry 2001). Furthermore, Benhenni et al. (2008) considered consistency of a kernel estimator in functional regression with stochastic regressors and long-memory errors.

In Sect. 7.4.9, we deal with estimation of the conditional variance σ 2(⋅) in random-design regression. Rates of convergence are different than for estimation of the conditional mean m(⋅) in the model (7.88). Such results are obtained in Guo and Koul (2008), Zhao and Wu (2008), Kulik and Wichelhaus (2011, 2012), and also have some connections to residual empirical processes. The latter topic is not discussed here, we refer to Chan and Ling (2008) and Kulik and Lorek (2012).

7.4.1 Introduction

Here we briefly recall some basic results from kernel- and local polynomial smoothing. Also some first heuristic comments are made on the role of long-range dependence and antipersistence in the context of nonparametric regression.

7.4.1.1 The Priestley–Chao Regression Estimator—Deterministic Design

We consider the nonparametric regression model with a response variable Y being a function of a deterministic design variable X. In the simplest case, we have the regression model

$$ Y_{i}=m(x_{i})+e_{i}\quad (i=1,2,\ldots,n) $$
(7.89)

with fixed (i.e. deterministic) equally spaced design variables x 1,x 2,…,x n . Often one uses x i =t i =in −1∈[0,1]. To emphasize that the “explanatory” variables x i are deterministic and equally spaced, we will use the notation t i instead of x i . Note that, strictly speaking, one actually has a sequence of models Y i,n because the grid of t-values (x-values) changes slightly with each n, i.e.

$$Y_{i}=Y_{i,n}=m(t_{i})+e_{i}. $$

The residual process e i is assumed to be second-order stationary with E(e i )=0, autocovariances γ e (k) and variance \(\sigma_{e}^{2}=\gamma_{e} ( 0 ) \). The regression function m(t i ) is not specified except for suitable regularity conditions. In kernel and local polynomial smoothing, one usually assumes that m is at least continuous, or even a few times continuously differentiable (see, e.g. standard books such as Härdle 1990a, 1990b; Wand and Jones 1994; Fan and Gijbels 1996; Simonoff 1996; Eubank 1999; Tsybakov 2010).

Effective estimation of m can be quite difficult in the presence of long-range dependence. The reason is that long-memory processes tend to exhibit spurious trends which may be mistaken for deterministic ones. At the same time, smooth trends can lead to increased values of the periodogram near the origin and to sample autocovariances with a high positive bias. For example, considering a sample autocovariance at a fixed lag k≥0,

$$ \hat{\gamma}(k)=n^{-1}\sum_{i=1}^{n-k}(y_{i}- \bar{y}) (y_{i+k}-\bar{y}) $$
(7.90)

we have, as n→∞, \(\operatorname{var} ( \hat{\gamma}(k) ) =o ( 1 ) \), but

$$ \mathrm{Bias}=E\bigl[\hat{\gamma}(k)\bigr]-\gamma_{e} ( k ) \sim\int \biggl[ m(t)-\int m(s)\,ds \biggr]^{2}\,dt, $$
(7.91)

which is a positive constant, unless m is constant almost everywhere. Thus, not removing the trend function leads to the overestimation of d. Related to this is the problem that the choice of a good estimate of m depends on approximate knowledge of d. A feasible solution that will be described below (Sects. 7.4.4 and 7.4.6) can be given in terms of an iterative procedure where trend estimation and estimation of the dependence parameters of e i are applied repeatedly (Beran and Feng 2002a, 2002b; Ray and Tsay 1997).

Suppose now that m is smooth (in a sense to be specified). The problem is nonparametric estimation of this function. The Priestley–Chao estimator (0<x<1) is given by

$$ \widehat{m}_{\mathrm{PC}}(t)=\frac{1}{nb}\sum _{i=1}^{n}y_{i}K \biggl( \frac{t_{i}-t}{b} \biggr) $$
(7.92)

(Priestley and Chao 1972) where b>0 is a bandwidth, and K≥0 is a symmetric kernel function with support [−1,1] and ∫K(u) du=1. The idea is that, since m is continuous, the value of m(t) may be estimated by taking a weighted average over a neighbourhood of x. For instance, if \(K ( u ) =\frac{1}{2}1 \{ -1\leq u\leq1 \} \), then \(\widehat{m}_{\mathrm{PC}}(t)\) is the average over all y i with tbt i t+b. Since t i =in −1, this condition means n(tb)≤in(t+b) so that we are taking an average over 2[nb]+1 observations. Since the grid of t-values is increasingly dense and m is continuous, the bias of \(\widehat{m}_{\mathrm{PC}}(t)\) converges to zero, provided that the neighbourhood we are taking observations from shrinks. At the same time, however, one needs to make sure that the variance of \(\widehat{m}_{\mathrm{PC}}(t)\) tends to zero which means that the number of observations in the weighted mean must increase to infinity. This leads to the conditions b→0 and nb→∞.

The most important decision in kernel regression is the choice of the bandwidth b. If b is chosen too small, then the number of averaged observations is small so that the variance is large. On the other hand, if b is too large, then one averages the function m over a large neighbourhood of x. For highly nonlinear functions, this leads to a large bias. This dilemma leads to a trade-off between minimizing bias and variance. If the mean squared error is used as a criterion, then the separation of the two effects is additive,

Asymptotic expressions for the bias do not depend on the autocovariance structure of e i . Suppose that m is twice continuously differentiable. Using the notation i 0:=[nt] and u i =(t i t)/b, the standard argument is a Taylor expansion of the form

(Note that the symmetry of K implies ∫K(u)udu=0.) Thus, the bias is proportional to the squared bandwidth and to the second derivative of m(t). If we can assume a higher degree of smoothness of m(t), then an even better order of the bias can be achieved by using a different type of kernel. Suppose that m(t) is k times differentiable. Using a Lipschitz continuous kernel with

$$ \int K(u)u^{i}\,du=\left \{ \begin{array} {l@{\quad}l}1, & i=0,\\ 0, & i=1,\dots,k-1,\\ \beta_{k}, & i=k, \end{array} \right . $$
(7.93)

we obtain

provided that the error term in the Taylor expansion can be controlled well. Thus the bias is order O(b k). Kernels with property (7.93) are called kernels of order k, the kth moment of K, denoted by β k =∫K(u)u kdu≠0, is the so-called kernel constant in the asymptotic bias. In most cases, one uses kernels of order 2 for estimating m(t) because one would like to keep the assumptions on the unknown function as general as possible. More comments on the choice of a kernel are given in the next section.

In contrast to the bias, the variance of \(\widehat{m}_{\mathrm{PC}} ( t ) \),

$$\operatorname{var} \bigl( \widehat{m}_{\mathrm{PC}}(t) \bigr) = ( nb )^{-2}\sum _{i,j=1}^{n}K \biggl( \frac{t_{i}-t}{b} \biggr) K \biggl( \frac{t_{i}-t}{b} \biggr) \gamma_{e} ( i-j ), $$

depends on the autocovariance structure of e i . In particular, the distinction between short memory, long memory or antipersistence is essential because the variance turns out to be proportional to (nb)2d−1. This implies that a bandwidth chosen by minimizing the MSE will be of a different order for different values of d. It should be noted that the choice of b is not only important for estimating m but also for reliable estimation of the parameters d and c f which, in turn, determine the optimal value of b. Moreover, knowledge of these two parameters is needed for tests and confidence intervals for m, as well as for forecasting.

If one lets d vary freely, then the choice of a good bandwidth is not only more difficult but also more important than in situations where one assumes short memory (i.e. d=0) a priori. The reason is that, as mentioned above, the estimation of d from the residuals \(\hat{e}_{i}=y_{i}-\hat{m} ( t_{i} ) \) very much depends on the choice of b. This is illustrated in Fig. 7.6 with \(m ( t ) =\tanh ( \frac{1}{2} ( t-\frac{1}{2} ) ) \) and e i generated by a FARIMA(0,0.3,0) process with innovation variance one. The four figures show nonparametric fits \(\hat{m} ( t ) \) based on kernel regression with the rectangular kernel and different bandwidths: (a) very small bandwidth; (b) medium size bandwidths; (c) large bandwidth; (d) b=∞ (so that \(\hat{m} ( t ) \equiv\bar{y}\)). The true trend function m(t) is also displayed in Fig. 7.6(d). The bandwidth in (a) is clearly too small. The fitted line follows the data too closely. The corresponding residual series \(\hat{e}_{i}\) (Fig. 7.7(a)) therefore resembles an antipersistent process. Fitting a FARIMA(0,d,0) process to \(\hat{e}_{i}\) by maximum likelihood estimation (including model choice by the BIC) indeed yields a value of \(\hat{d}=-0.34\). The moderate and large bandwidths used in (b) and (c) provide much better trend estimates. The corresponding values of \(\hat{d}\) are equal 0.23 and 0.25, respectively, and thus much closer to the true value of d=0.3. On the other hand, choosing an infinite bandwidth, and thus not removing any trend estimate at all (Fig. 7.7(d)) leads to slight overestimation with \(\hat{d}=0.33\).

Fig. 7.6
figure 6

The four pictures show the same series Y i =m(t i )+e i with \(m ( t ) =\tanh ( \frac{1}{2} ( t-\frac {1}{2} ) ) \) and e i generated by a FARIMA(0,0.3,0) process with innovation variance one. The four figures show nonparametric fits \(\hat{m} ( t ) \) based on kernel regression with the rectangular kernel and different bandwidths: (a) very small bandwidth; (b) medium size bandwidths; (c) large bandwidth; (db=∞. In (d), the true trend function is also shown

Fig. 7.7
figure 7

Residuals \(\hat{e}_{i}=Y_{i}-\hat{m} ( t_{i} ) \) based on the fits in Figs. 7.6(a)–(d)

The easiest way to see the essential difference between long memory, short memory and antipersistence more formally is to look at the rectangular kernel \(K ( u ) =\frac{1}{2}1 \{ -1\leq u\leq1 \} \). For this second-order kernel, \(\widehat{m}_{\mathrm{PC}}(t)\) is simply a sample mean of 2[nb]+1 consecutive observations. From Corollary 1.2, we know that the variance can be approximated by c f ν(d)22d−1(nb)2d−1 where the spectral density of e i is assumed to be such that f e (λ)∼c f |λ|−2d, as λ→0, and

$$\nu ( d ) =\frac{\varGamma ( 1-2d ) 2\sin\pi d}{d(2d+1)}\quad(d\neq 0),\ \nu ( 0 ) =2\pi. $$

Thus, for the mean squared error we have

$$ \mathit{MSE} ( t;b ) \sim\tilde{C}_{1} ( t ) b^{4}+ \tilde{C}_{2} ( nb )^{2d-1} $$
(7.94)

with

$$\tilde{C}_{1} ( t ) = \biggl\{ {\frac{1}{2}}m^{\prime\prime} ( t ) \int_{-1}^{1}u^{2}K(u)\,du \biggr \}^{2}=\frac{1}{36} \bigl\{ m^{\prime\prime} ( t ) \bigr \}^{2}$$

and \(\tilde{C}_{2}=\nu ( d ) 2^{2d-1}c_{f}\). If the approximation is uniform in t (in a suitable sense), then we obtain an analogous formula for the integrated mean squared error

$$ \mathit{IMSE} ( b ) =\int_{0}^{1}\mathit{MSE} ( t;b )\,dt\sim C_{1}b^{4}+C_{2} ( nb )^{2d-1} $$
(7.95)

with

$$C_{1}=\int_{0}^{1} \tilde{C}_{1} ( t )\,dt=\frac{1}{36}\int_{0}^{1} \bigl\{ m^{\prime\prime} ( t ) \bigr\}^{2}\,dt $$

and C 2=ν(d)22d−1 c f . Setting the derivative of the right-hand side of (7.95) equal to zero, we obtain the asymptotically optimal bandwidth

$$ b_{\mathrm{opt}}=C_{\mathrm{opt}}n^{-\beta_{\mathrm{opt}}} $$
(7.96)

with

The integrated squared curvature \(\int_{0}^{1} \{ m^{\prime\prime} ( t ) \}^{2}\,dt\) is in the denominator. This means that a smaller bandwidth is required if m has various sharp turns. The reason is that the bias can become quite large when we average over a too large neighbourhood. In contrast, if m is close to a straight line, then the curvature is almost zero so that one may average with a large bandwidth without causing much damage. Note that b opt is such that the bias and the variance terms in the MSE are of the same order. The optimal mean squared error is then of the order b 4 which means

$$ \mathit{MSE}_{\mathrm{opt}}\sim \mathrm{const}\cdot n^{-4\beta_{\mathrm{opt}}}=\mathrm{const}\cdot n^{-\frac{4-8d}{5-2d}}. $$
(7.97)

Under short memory (including independence) with d=0, one has the well known rates of \(b_{\mathrm{opt}}\sim \mathrm{const}\cdot n^{-\frac{1}{5}}\) and \(\mathit{MSE}_{\mathrm{opt}}\sim \mathrm{const}\cdot n^{-\frac{4}{5}}\). For long memory, β opt is smaller than \(\frac{1}{5}\) so that b opt is larger and the MSE opt converges to zero at a slower rate. The reason is that, due to long-term positive dependence, one needs more data to make the variance of the sample mean small. In contrast, under antipersistence (d<0) β opt is larger than \(\frac {1}{5}\) so that the optimal bandwidth and mean squared error converge to zero faster than under short memory. These properties carry over to other kernels K. In summary, optimal bandwidth selection very much depends on the type of memory we have in the residual process. In the case of long memory, larger bandwidths are required. This is also related to the problem that it is often difficult to distinguish between long-range dependence and deterministic trend functions or change points in the mean (see also Sect. 7.9). The basic reason is that trend functions tend to increase the values of the periodogram near the origin. This can be confounded with a pole due to long memory.

The practical application of (7.95) is not straightforward in practice because it involves the unknown quantities d, c f and m′′(t). If we are willing to assume short memory, then the problem is less difficult because the long-memory parameter is fixed at d=0. Various methods have been developed for obtaining a data driven approximation of the IMSE and thus an approximately optimal bandwidth. Well known methods are, for instance, cross-validation and iterative plug-in methods. If d is a free parameter in the interval \(( -\frac {1}{2},\frac{1}{2} ) \), then the problem is more involved. Data driven plug-in methods, however, have been developed, for instance, in Ray and Tsay (1997) and Beran and Feng (2002a, 2002b). The idea is to start with initial estimates of m(⋅) and m′′(t), estimate the parameters d and c f from the residuals, obtain an estimate of b opt and then iterate the procedure. This will be discussed below in the Sects. 7.4.4 and 7.4.6. In the short-memory context, similar methods are discussed in Gasser et al. (1991) and Ruppert et al. (1995).

7.4.1.2 Higher-Order Kernel Estimators and Estimation of Derivatives

So far we assumed that the kernel function K is given. More generally, not only the bandwidth but also the kernel K has to be chosen before carrying out a kernel regression. Although the choice of K is generally less important, it is still worth investigating the role of K in detail. In particular, one gains insight into the interplay between smoothness of the function and a suitable choice of the kernel, and it becomes more clear how to estimate derivatives.

Commonly used second-order kernels on [−1,1] are of the form

$$ K_{\mu}(u)=C_{\mu}\bigl(1-u^{2} \bigr)^{\mu}1\{-1\leq u\leq1\} $$
(7.98)

for some nonnegative integer μ, where C μ is such that ∫K(u) du=1. The parameter μ is called the degree of smoothness (or simply smoothness) of a kernel function of this type (see Müller 1984) which means that the (μ−1)th derivative of the kernel function is Lipschitz continuous. This also controls the degree of smoothness of the corresponding kernel estimator. For μ=0,1,2,3, K μ in (7.98) corresponds to the Uniform kernel, the Epanechnikov kernel, the Bisquare kernel and the Triweight kernel, respectively. Another commonly used kernel—which has, however, an unbounded support—is the Gaussian (or normal) kernel, i.e. the standard normal density function. It can also be considered as a rescaled limit of K μ for μ→∞. Explicit formulae of these kernel functions are given in Table 7.2.

Table 7.2 Some second-order kernels

The Uniform, the Epanechnikov and the Bisquare kernels are shown in Fig. 7.8. Corresponding higher-order kernels and kernels for estimating derivatives m (j)(t)=d j/dt j m(t) can be generated based on kernel functions defined in (7.98). This will be discussed below.

Fig. 7.8
figure 8

Three commonly used second-order kernels with compact support

As already mentioned before, higher-order kernels as defined in (7.93) can be used to reduce the bias of \(\hat {m} ( t ) \), if we are willing to assume stronger smoothness properties for m. Note that a high-order kernel with k>2 (see (7.93)) is symmetric but not necessarily nonnegative. Thus, for

$$\hat{m} ( t ) = ( nb )^{-1}\sum y_{i}K \bigl( ( t_{i}-t ) /b \bigr) =\sum w_{i}y_{i}$$

the weights w i are sometimes negative, although we still have ∑w i =1. Second-order kernels defined by (7.98) are special cases of (7.93) with k=2. Most commonly used higher-order kernel functions are generated by the special kernels given in Table 7.2 (see Tables 5.7 of Müller 1988). Only kernels of polynomial form will be used for simplicity in the following. Most of the standard kernels proposed in the literature are of polynomial form.

Once the order of the kernel is fixed, its shape is less important and in particular does not influence the rate of convergence. If the residuals e i are i.i.d., then the optimal second-order kernel is Epanechnikov’s function \(K(u)=\frac{3}{4}(1-u^{2})\), in the sense that it minimizes the MSE when the optimal bandwidth is used (Epanechnikov 1969; Benedetti 1977). Similarly, higher-order kernels generated by the Epanechnikov kernel are also optimal for the corresponding order. These findings remain true under short memory. Despite its elegance this result is of little practical relevance because using suboptimal kernels does not lead to a substantial increase in the asymptotic MSE (Rosenblatt 1971). Furthermore, it turns out that an optimal kernel function does not exist in the long-memory setting.

Slightly more important than the shape is the degree of smoothness of the kernel function because it carries over to \(\hat{m} ( t ) \). If a kernel of smoothness μ is used, then \(\hat{m}\) has the same degree of smoothness, i.e. the (μ−1)th derivative of \(\hat{m}\) is Lipschitz continuous. Thus, the higher the μ the smoother the \(\hat{m}\). For instance, \(\hat{m}\) obtained with the uniform kernel is discontinuous because the kernel itself is discontinuous at both end points (u=±1). Note in particular that this does not depend on the smoothness of the true function m, nor is it influenced by the dependence structure of e i .

The most important feature of a kernel is its order. As demonstrated above, the optimal rate of convergence of \(\hat{m} ( t ) \) is faster the higher the order k. One should bear in mind, however, that, in general, this is only true if m(t) itself is smooth enough. Otherwise the asymptotic arguments leading to a bias of order O(b 2k) do not apply. Thus, using higher-order kernels and the corresponding asymptotic results involves rather strong assumptions on the unknown trend function m. Moreover, the finite sample variance of a higher order kernel estimator is usually larger than for a second-order kernel estimator. For small samples, the performance of a higher-order kernel estimator is therefore not necessarily better, even if m has the required smoothness properties. In practice, the order of the kernel is often chosen subjectively according to the data and further analysis. The safest choice that requires minimal assumptions is, however, a kernel of order 2.

Though the notion of higher-order kernels for estimating m(t) may seem mainly of theoretical interest; the general approach of defining higher-order kernels via their moments becomes practically relevant when it comes to estimating derivatives. Estimation of derivatives is not only important in applications where the derivatives themselves are the object of interest. Even if the actual aim is to estimate m(t), optimal data driven bandwidth selection based on the plug-in idea requires the estimation of higher-order derivatives (see, e.g. (7.96)). Kernel estimators of m (j)(t) in the i.i.d. case are investigated, for instance, in Gasser and Müller (1984), Rice (1986) and Ullah (1988, 1989). The simplest way of obtaining an estimate of the jth derivative is to start with \(\hat{m} ( t ) \) based on a kernel of order k>j (as in definition (7.93)) that is at least j times differentiable, and then take the derivative. Thus we define

(7.99)
(7.100)

A more systematic approach is to define a new class of kernels as follows. Let j≥0 be an integer and k such that kj≥2 is an even number. A kernel function K of order (j, k) for estimating the jth derivative of m(t) (Gasser et al. 1985; Müller 1984, 1988) is defined as a Lipschitz continuous function satisfying the moment conditions

$$ \int K(u)u^{i}\,du=\left \{ \begin{array} {l@{\quad}l}0, & 0\leq i\leq k-1,i\neq j,\\ j!, & i=j,\\ \beta_{k}, & i=k, \end{array} \right . $$
(7.101)

where β k =∫K(u)u kdu≠0 is again a kernel constant in the asymptotic bias. A kernel of order (j,k) with k=j+2 is called a standard kernel function. On the other hand, K is called a higher-order kernel, if k>j+2. The estimator of m (j)(t) is then given by

$$ \hat{m}_{\mathrm{PC}}^{(j)}(t)=\frac{1}{nb^{j+1}}\sum _{i=1}^{n}K \biggl( \frac{t_{i}-t}{b} \biggr) y_{i}=\sum_{i=1}^{n}w_{i}^{j}y_{i} $$
(7.102)

with \(w_{i}^{j}=(nb^{j+1})^{-1}K((t_{i}-t)/b)\). As will be seen below, a necessary and sufficient condition for consistency of \(\hat{m}_{\mathrm{PC}}^{ ( j ) }(t)\), for d∈(−0.5,0.5), is that b→0 and (nb)1−2d b 2j→∞. In particular, the second condition implies nb 1+j→∞ which is a necessary condition for \(w_{i}^{j}\) to tend to zero uniformly. More exactly, (7.102) is a good definition for interior points only. As discussed in the next section, the kernel has to be modified near the border to keep the bias small. This will be discussed below. A heuristic justification of definition (7.101) and (7.102) can be given as before, namely

Note that kernels of order (0,k) coincide with kernels of order k according to the previous definition (7.93). Besides the moment conditions given in (7.101), some additional conditions are often required, such as the degree of smoothness and the minimal number of sign changes.

7.4.1.3 Boundary Effects and Boundary Kernels

Formula (7.102) does not yield good results for boundary points t∈[0,b)∪(1−b,1] (see, e.g. Gasser and Müller 1979 and Müller 1984). The reason is that observations are not placed symmetrically on both sides of t. This increases the bias. While the bias of the estimator in (7.102) is of the order O(b 2), it is the order O(b) at boundary points. This problem can be solved by using the so-called boundary kernels. The solution is relatively complex in general though, in particular when higher order kernels are used or when estimation of the derivatives is considered. A more elegant solution is provided by local polynomial regression discussed later, where adaptation at the boundary is automatic. Nevertheless, it is interesting to study the approach of boundary kernels because one gains a better understanding of boundary problems. Moreover, local polynomial fits can be represented asymptotically as kernel estimators with boundary kernels at boundary points (see Sect. 7.4.1.6).

Consider, for instance, a second-order kernel estimator \(\hat{m} ( t ) \) of m(t) and denote by Δ(t) its bias. The contribution of the bias to the IMSE is \(B=\int_{0} ^{1}\varDelta ^{2}(t)\,dt\). Although the length of the boundary areas tends to zero, the contribution of Δ(t) in the boundary region is not negligible. The reason is that the contribution of interior points to the IMSE is

$$\int _{b}^{1-b}\varDelta ^{2}(t)\,dt=\int _{b}^{1-b}O \bigl( b^{4} \bigr)\,dt=O \bigl( b^{4} \bigr) $$

whereas for boundary points we have

$$\int _{0}^{b}\varDelta ^{2}(t)\,dt=\int _{0}^{b}O\bigl(b^{2}\bigr)\,dx=O \bigl( b^{3} \bigr) $$

and the same holds for \(\int_{1-b}^{1}\varDelta ^{2} ( t )\,dt\). This means that the integrated squared bias is dominated by the bias in the boundary regions. In the extreme case with t=0, the estimator in (7.102) even converges to \(\frac{1}{2}m ( 0 ) \) because we have only half of the weights (Müller 1991). The boundary effect is even worse for higher-order kernel estimators and kernel estimators of derivatives.

The problem can be overcome by using boundary kernels that are designed to make the bias of the same order of magnitude for all t∈[0,1]. To achieve that, the moment conditions given in (7.101) should be satisfied not only at interior but also at boundary points. Boundary kernels are solutions obtained from (7.101) and additional side conditions. Examples of boundary kernels may be found in Gasser and Müller (1979), Gasser et al. (1985), Müller (1991) and Müller and Wang (1994). In the following, the discussion will only be carried out for left boundary points t∈[0,b). For the right boundary, arguments are analogous. Note that asymptotically any fixed point t∈(0,1) is an interior point because b→0. A left boundary point can be written as t=cb with 0≤c=c(t)<1. For interior points t∈[b,1−b], we define c=1.

A left boundary kernel K c (u) of order (j,k) is defined as a Lipschitz continuous function with compact support [−1,c] satisfying the moment conditions

$$ \int_{-1}^{c}K_{c}(u)u^{i}\,du= \left \{ \begin{array} {l@{\quad}l}0, & i=0,\dots,j-1,\ j+1,\dots,k-1,\\ j!, & i=j,\\ \beta_{c,k}\neq0, & i=k. \end{array} \right . $$
(7.103)

Boundary kernels for the right boundary t∈(1−b,1] are defined in an analogous manner.

For the kernel function in the interior, some additional conditions are often required such as a certain degree of smoothness. Müller (1991) proposed a class of the so-called μ-smooth optimal boundary kernels which are obtained by solving (7.103) under the side condition that \(\int_{-1}^{c}[K_{c}^{(\mu)}(u)]^{2}\,du\) is minimized. Such kernels have the same degree of smoothness in the boundary area as in the interior. Also, the degree of smoothness of such boundary kernels is always μ over the whole support [−1,c]. Second-order boundary kernels of this type (for estimating the regression function m itself) corresponding to the Uniform, the Epanechnikov and the Bisquare kernels in the interior (see Table 1 in Müller 1991) are listed in Table 7.3. For c=1, these formulae reduce to the corresponding ones in the interior given in Table 7.2.

Table 7.3 Three commonly used second-order μ-smooth boundary kernels

Another class of boundary kernels with a so-called (μ,μ−1) degree of smoothness was proposed by Müller and Wang (1994). These are defined as solutions of (7.103) under certain smoothness conditions (see (K2) and (K3) in Müller and Wang 1994, with α and β there corresponding to μ and μ−1, respectively). At a boundary point t=cb with 0≤c<1, the degree of smoothness of a boundary kernel in this class is μ at the left end point u=−1 and μ−1 at the right end point u=c, provided that μ>1. In the interior, one obtains the same kernels as before. In particular, the kernels given in Table 7.3 may be called boundary kernels with a (μ,μ) degree of smoothness. The authors showed that these new boundary kernels have some advantages over those proposed in Müller (1991). Note that the boundary kernels given in Table 7.3 are polynomials of order 2μ−2 in the interior and of order 2μ−1 at the boundary. In contrast, for μ≥1, the boundary kernels proposed by Müller and Wang (1994) are of the same order 2μ−2 in the interior and at the boundary. Boundary kernels in this class corresponding to the Uniform, the Epanechnikov and the Bisquare kernels in the interior are listed in Table 7.4. Note that here the boundary kernel corresponding to the Epanechnikov kernel with c<1 is discontinuous at u=c. This means that the degree of smoothness at this end point is μ−1=0.

Table 7.4 Three second-order boundary kernels proposed by Müller and Wang (1994)

Further examples of boundary kernels can be found, for instance, in Gasser et al. (1985), Müller (1988, Sect. 5.8). Messer and Goldstein (1993) considered the continuation of equivalent spline kernels from the interior to the boundary. Gasser et al. (1985) also proposed some boundary kernels which, for any μ, are non-smooth at the end point u=c (c≠1). Boundary kernels considered by Gasser et al. (1985) belong to another class generated by local polynomial regression with a truncated weight function at the boundary.

7.4.1.4 The Nadaraya–Watson Regression Estimator—Random Design

If we consider the same nonparametric regression model (7.89),

$$Y_{i}=m(x_{i})+e_{i}\quad(i=1,\ldots,n), $$

but with a design variable X=x that is random, say with density function p X , then the Priestley–Chao estimator has to be modified, in general. The reason is that by analogous arguments as above one obtains

$$E \bigl( \widehat{m}_{\mathrm{PC}}(x) \bigr) =p_{X}(x)m(x)+O \bigl(b^{2}\bigr)\quad \bigl( x\in ( 0,1 ) \bigr). $$

Thus, in general, one has a bias that does not disappear asymptotically, unless p X is the uniform distribution on [0,1]. (Note, in particular, that the equidistant fixed design considered previously can be seen as a special case, or rather an extended special case, in the sense of conditional inference given x 1,…,x n and a uniform limiting design density p X .) A simple solution is to divide \(\widehat{m}_{\mathrm{PC}}(x)\) by a consistent estimate of p X (x). This is the idea of the Nadaraya–Watson estimator (Nadaraya 1964; Watson 1964)

$$ \widehat{m}_{\mathrm{NW}}(x)=\frac{\sum_{i=1}^{n}y_{i}K ( \frac{x_{i}-x}{b} ) }{\sum_{i=1}^{n}K ( \frac{x_{i}-x}{b} ) }= \frac{\widehat{m}_{\mathrm{PC}}(x)}{\hat{p}_{X}(x)} $$
(7.104)

where

$$\hat{p}_{X}(x)=\frac{1}{nb}\sum_{i=1}^{n}K \biggl( \frac{x_{i}-x}{b} \biggr) $$

is the so-called Parzen–Rosenblatt kernel estimator of p X (x) (Rosenblatt 1956; Parzen 1979) since, under standard conditions \(\hat{p}_{X}(x)\rightarrow_{p}p_{X} ( x ) \) and \(\widehat{m}_{\mathrm{PC}}(x)\rightarrow_{p}p_{X} ( x ) m ( x ) \), the Nadaraya–Watson estimator \(\widehat{m}_{\mathrm{NW}}(x)\) converges in probability to m(x). Expressions for the bias and variance are slightly more complicated than those for \(\widehat{m}_{\mathrm{PC}}(x)\) in the deterministic equidistant case because the accuracy of \(\hat{p}_{X}(x)\) also plays a role. However, the order of the bias is as before, namely O(b 2) for second-order kernels. In how far the variance of \(\widehat{m}_{\mathrm{NW}}(x)\) is influenced by the autocovariance structure depends on the random mechanism generating the values of X. This is similar to a parametric linear regression where, for instance, autocorrelations play no role when Y i =βx i +e i with x 1,…,x n obtained by i.i.d. sampling of a zero-mean random variable X, whereas the opposite is true when E(X)≠0 (see Sect. 7.2).

7.4.1.5 Local Polynomial Smoothing

The main idea behind local polynomial smoothing (see, e.g. Ruppert and Wand 1994 and Fan and Gijbels 1995, 1996 and references therein) is based on a polynomial approximation of a (p+1)-times differentiable function m(x) in a small neighbourhood of x. This is applicable to deterministic as well as to random designs. By a Taylor series expansion around x, a pth-degree polynomial approximation of m(x i ) is given by

$$m(x_{i})\approx m(x)+(x_{i}-x)m^{(1)}(x)+{ \frac{(x_{i}-x)^{2}}{2!}}m^{(2)}(x)+\cdots+{\frac{(x_{i}-x)^{p}}{p!}}m^{(p)}(x). $$

As before, we use the notation m (j) for the jth derivative. Since the coefficients

$$\beta_{j}=\beta_{j}(x)={\frac{m^{(j)}(x)}{j!}}\quad(j=0,1,2, \ldots ,p)$$

are fixed, we can rewrite m(x i ) as

$$m(x_{i})\approx\sum_{j=0}^{p}(x_{i}-x)^{j} \beta_{j}$$

where the coefficients β 0,…,β p are the same for all x i “close” to x. This enables us to estimate m(x) and its derivatives m (j)(x) (j=1,2,…,p) by fitting a local polynomial of degree p to observations (x i ,y i ) with x i (fixed or random) in the neighbourhood of x. Estimates of derivatives are then defined by

$$\hat{m}^{ ( j ) }(x)=j!\hat{\beta}_{j}\quad(j=0,1,\dots,p). $$

In other words, we apply a polynomial regression locally. The regression parameter β=β(x)=(β 0,…,β p )T is estimated by minimizing a weighted sum of squared residuals,

$$Q(x)=\sum_{i=1}^{n} \Biggl\{ y_{i}-\sum_{j=0}^{p}(x_{i}-x)^{j} \beta_{j} \Biggr\}^{2}D \biggl( \frac{x_{i}-x}{b} \biggr), $$

with respect to β where the weights D((xx i )/b) make sure that only values in the neighbourhood of x are included. In matrix form, Q can also be written as

$$Q(x)= ( \mathbf{y}-\mathbf{X\beta} )^{\prime}\mathbf{D}(x) ( \mathbf{y}-\mathbf{X\beta} ) $$

where

$$\mathbf{X}= ( \mathbf{x}_{\cdot1},\dots,\mathbf{x}_{\cdot p+1} ) = \left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}1 & x_{1}-x & \dots & ( x_{1}-x )^{p}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n}-x & \dots& ( x_{n}-x )^{p}\end{array} \right ) $$

and

$$ \mathbf{D}=\left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}D ( \frac{x_{1}-x}{b} ) & 0 & \dots & 0\\ 0 & D ( \frac{x_{2}-x}{b} ) & \ddots & \vdots\\ \vdots & \ddots & \ddots & 0\\ 0 & \dots & 0 & D ( \frac{x_{n}-x}{b} ) \end{array} \right ) . $$
(7.105)

The weighted least squares solution can be written as

$$ \widehat{m^{(j)}}(x)=j!\hat{\beta}_{j}=j! \mathbf{\delta}_{j+1}^{T} \bigl(\mathbf{X}^{T}\mathbf{DX} \bigr)^{-1}\mathbf{X}^{T}\mathbf{Dy} $$
(7.106)

where \(\mathbf{\delta}_{j}= ( \delta_{1,j},\dots,\delta_{p+1,j} )^{T}\) (j=1,…,p+1) denote unit vectors with δ j,j =1, δ i,j =0 (ij).

To derive asymptotic properties of \(\widehat{m^{( j )}}(x )\), it is often convenient to write (7.106) as a weighted sum. Defining the weighting system

$$ \mathbf{w}_{j;b,n}^{T}= \bigl( w_{j;b,n} ( x;1 ) , \dots,w_{j;b,n}( x;n ) \bigr) =j!\mathbf{\delta}_{j+1}^{T} \bigl( \mathbf{X}^{T}\mathbf{DX} \bigr)^{-1} \mathbf{X}^{T}\mathbf{D,} $$
(7.107)

we have

$$\widehat{m^{(j)}}(x)=\mathbf{w}_{j;b,n}^{T} \mathbf{y}=\sum_{i=1}^{n}w_{j;b,n} ( x;i ) Y_{i}. $$

Note, that each weight w j;b,n (i) associated with Y i changes with changing sample size n. Thus, investigating the asymptotic distribution of \(\widehat{m^{(j)}}(x)\) amounts to studying the sequence of sums

$$ S_{n}=\sum_{i=1}^{n}w_{j;b,n} ( x;i ) e_{i}=\sum_{i=1}^{n}\zeta_{i,n}\quad(n\in\mathbb{N}) $$
(7.108)

of a triangular array ζ i,n =w ν;b,n (x;i)ε i (1≤in; \(n\in\mathbb{N}\)). Since

$$\mathbf{\delta}_{j+1}^{T} \bigl( \mathbf{X}^{T} \mathbf{DX} \bigr)^{-1}\mathbf{X}^{T}\mathbf{DX}= \mathbf{\delta}_{j+1}^{T}= ( 0,\dots,0,1,0,\dots,0 ) $$

(with 1 being the (j+1)st component), the weights have the property

(7.109)

and

$$ \mathbf{w}_{j;b,n}^{T}\mathbf{x}_{\cdot l+1}=\sum _{i=1}^{n}w_{j;b,n} ( x;i ) ( x_{i}-x )^{l}=0\quad(l\neq j,\ 0\leq l\leq p ). $$
(7.110)

These equations hold under any design that makes \(\hat{m}^{(j)}\) exactly unbiased in the case where m is a polynomial of degree qp.

The bias of local polynomial estimators is of the same order for interior and boundary points. For instance, if j=0 and p=1, then

where the latter equality follows from (7.110) and a detailed argument for the remainder term using the property (x i x)2b 2. More generally, local polynomial estimators of m (j) are automatically boundary corrected if pj is odd, in the sense that the bias at interior and boundary points is of the same order. In contrast, for kernel estimators (7.109) and (7.110) hold only approximately, and this leads to problems at the boundary. Furthermore, these properties show that local polynomial regression is design adaptive. In contrast to the Priestley–Chao kernel estimator, no adjustment by the design density is required.

More specifically, if b→0 and nb 3→∞, then, under suitable conditions on D, expressions for the bias of \(\widehat {m^{(j)}}(x)\) can be shown to be of the form

with c 1 and c 2 not depending on m. In particular, this means that if pj is even, then the bias is affected by the design density. This can be problematic especially near the boundary of the x-space, and thus we have another reason for choosing pj odd. Moreover, one would like to choose p as small as possible in order to avoid unnecessary differentiability conditions on m. Therefore, the usual choice of p is j+1 which leads to a bias of the order O(b 2).

The variance of \(\widehat{m^{(j)}}(x)\) depends on the autocovariance structure and the design. For asymptotic considerations, it is also useful to note that local polynomials can be approximated by kernel estimators. For instance, in the case of equidistant fixed design regression with x i =i/n=:t i , the asymptotically equivalent kernel estimator is (see Müller 1987 and Feng 1999)

$$\tilde{m}^{ ( j ) } ( t ) =\frac{1}{nb}\sum K_{ ( j,p+1,c ) } \biggl( \frac{t_{i}-t}{b} \biggr) Y_{i}$$

where the “equivalent kernel” K (j,p+1,c) has the following properties. As before, the notation is t=cb and 1−cb with 0≤c<1 for boundary points t=cb and 1−cb, and c=1 for interior points t∈[b,1−b]. Then K (j,p+1,c)(u) is such that, for 0≤jp,

and

$$\tau=\int_{-c}^{1}K_{ ( j,p+1,c ) } ( u ) u^{p+1}\neq0. $$

Note that the kernel is different for boundary points. This reflects the automatic boundary correction of local polynomials. Equivalence is expressed in terms of a uniform approximation of the weighting system w j;b,n of \(\hat{m}^{ ( j ) } ( t ) \) by the weighting system \(\tilde{\mathbf{w}}_{j;b,n}\) of \(\tilde{m}^{ ( j ) } ( t ) \), namely

$$\lim _{n\rightarrow\infty}\sup _{1\leq i\leq n}\biggl| \frac {w_{j;b,n} ( t;i ) }{\tilde{w}_{j;b,n} ( t;i ) }-1\biggr|=0 $$

where we define 0/0:=1 (Müller 1987; also see Lejeune 1985; Lejeune and Sarda 1992 and Ruppert and Wand 1994). Using the approximation by \(\tilde {m}^{ ( j ) } ( t ) \), one obtains the asymptotic variance of \(\hat{m}^{ ( j ) } ( t ) \) by similar arguments as for the Priestley–Chao kernel estimator,

(Beran and Feng 2001a, 2001b, 2002c, 2007).

Example 7.28

Let p=0. Then we obtain a local constant fit that minimizes

$$Q(x)=\sum_{i=1}^{n} \{ y_{i}- \beta_{0} \}^{2}D \biggl( {\frac {t_{i}-t}{b}} \biggr) . $$

The solution is a weighted sample mean

$$\hat{\beta}_{0}(x)=\frac{1}{nb}\sum _{i=1}^{n}\tilde{D} \biggl( {\frac{t_{i}-t}{b}} \biggr) y_{i}$$

with

$$\tilde{D}(u)=\frac{D(u)}{(nb)^{-1}\sum_{i=1}^{n}D ( u ) }. $$

Thus, \(\tilde{D}(u)\) is the equivalent kernel. Note that \(\hat{\beta}_{0}(x)\) is the Nadaraya–Watson estimator discussed in the previous section. Explicit formulae of the weights for the local linear estimator of m(t) are given by (2.3) and (2.4) in Fan (1992).

In summary, the main practical advantages of local polynomial estimation compared to direct kernel smoothing are the direct availability of estimated derivatives, the automatic bias correction at the border (for more discussion on this topic, see, e.g. Fan and Gijbels 1996) and design adaptivity. The calculation of \(\hat{m}^{ ( j ) } ( x ) \) is very simple because it essentially only requires a program for linear regression. The representation by an equivalent kernel estimator is useful for deriving asymptotic results.

7.4.1.6 Calculation of Equivalent Kernels

Here we provide some details on the calculation of the equivalent kernel introduced above. We consider the case of j=0 only, i.e. estimation of m(x) by

$$\widehat{m}(x)=\mathbf{w}^{T}\mathbf{y}=\sum _{i=1}^{n}w ( i ) Y_{i}$$

with

$$\mathbf{w=w}_{0;b,n}^{T}=\mathbf{\delta}_{1}^{T} \bigl( \mathbf{X}^{T}\mathbf{DX} \bigr)^{-1} \mathbf{X}^{T}\mathbf{D}. $$

Lejeune and Sarda (1992) showed that there is a kth order equivalent kernel function (for estimating m) where k=p+1 if p is odd and k=p+2 if p is even. It can be calculated as follows. Let

$$ \mathbf{N}_{p}=\left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}1 & \mu_{1} & \dots & \mu_{p}\\ \mu_{1} & \mu_{2} & \dots & \mu_{p+1}\\ \vdots & \vdots & \ddots & \vdots\\ \mu_{p} & \mu_{p+1} & \dots & \mu_{2p}\end{array} \right ) , $$
(7.111)

and

$$ \mathbf{M}_{p}=\left ( \begin{array} {c@{\quad}c@{\quad}c@{\quad}c}1 & \mu_{1} & \dots & \mu_{p}\\ u & \mu_{2} & \dots & \mu_{p+1}\\ \vdots & \vdots & \ddots & \vdots\\ u^{p} & \mu_{p+1} & \dots & \mu_{2p}\end{array} \right ) , $$
(7.112)

where \(\mu_{j}=\int_{-1}^{1}u^{j}D(u)\,du\) is the jth moment of D(u). The equivalent kernel function is given by

$$ K(u)=K_{k}(u)=\frac{\det ( \mathbf{M}_{p}(u) ) }{\det ( \mathbf{N}_{p} ) }D(u). $$
(7.113)

Note that the kernel function is determined by the weight function D(u) and the order of the polynomial p. It does not depend on the design and is therefore the same for fixed (equi- and nonequidistant) and random design. Another representation is

$$ K(u)= \Biggl( \sum_{j=1}^{p+1}a_{1j}u^{j-1} \Biggr) W(u), $$
(7.114)

where \(\mathbf{N}_{p}^{-1}=(a_{ij})_{i,j=1,\dots,p+1}\). Note that for j even, a 1j =0. Thus, all odd powers of u in (7.114) vanish. One can also see that K(u) is a polynomial kernel whenever D(u) is a polynomial. Moreover, if p is even, then k=p+2=(p+1)+1, and one can see that K=K k is the same for p and p+1.

Let w NW(x;i) denote the weights of the Nadaraya–Watson estimator of m(⋅) defined by K k (u). It can be shown that w(x;i)=w NW(x;i)[1+o p (1)]. Hence the kernel K k (u) is often called the (asymptotically) equivalent kernel function of the local polynomial regression. This interpretation is, however, somehow inaccurate because the detailed difference between the NW-estimator and the local polynomial estimator is only asymptotically negligible in the case of an equidistant design. This is not true for random or non-equidistant fixed design.

We conclude the discussion with two examples of equivalent kernels.

Example 7.29

Consider a local quadratic (p=2) or local cubic (p=3) estimator of m(t) using the Epanechnikov kernel \(D(u)=\frac{3}{4}(1-u^{2})\) (|u|≤1) as weight function. We have k=4, \(a_{11}=\frac{15}{8}\) and \(a_{13}=-\frac{35}{8}\). The resulting equivalent kernel is

$$ K_{4}^{\mathrm{E}}(u)=\frac{15}{32}\bigl(3-10u^{2}+7u^{4} \bigr), $$
(7.115)

which is a well known fourth-order kernel used in the literature (Gasser et al. 1985).

Example 7.30

Consider a local quadratic (p=2) or local cubic (p=3) estimator of m(t) using the Gaussian kernel \(D(u)=\varphi(u)=(2\pi)^{-\frac{1}{2}}\exp(-\frac {1}{2}u^{2})\) as weight function. We have k=4, \(a_{11}=\frac{3}{2}\) and \(a_{13}=-\frac{1}{2}\). The resulting equivalent kernel is

$$ K_{4}^{\mathrm{G}}(u)=\frac{1}{2}\bigl(3-u^{2} \bigr)\varphi(u). $$
(7.116)

Further examples of equivalent kernel functions in the interior may be found in Gasser et al. (1985) and Müller (1988). Examples of equivalent kernels including boundary kernels and estimation of derivatives are given in Feng (1999, 2004a, 2004b).

7.4.2 Fixed-Design Regression with Homoscedastic LRD Errors

7.4.2.1 Bias and Variance of Kernel and Local Polynomial Estimators

We assume a nonparametric regression model (7.89) with a fixed equidistant design,

$$Y_{i}=Y_{i,n}=m(t_{i})+e_{i}, $$

where t i =i/n and e i is a second-order zero mean stationary process with spectral density f e (λ)∼c f |λ|−2d for some \(d\in ( -\frac{1}{2},\frac{1}{2} ) \). In view of the discussion above, essentially the same results are expected to hold for local polynomial estimators and kernel estimators with boundary kernels. The following results are therefore formulated under the assumption that \(\hat{m}^{ ( j ) }\) is either a local polynomial estimator (with polynomials of degree p) or a kernel estimator of the corresponding degree and boundary corrections.

For reasons discussed previously, we will assume pj to be odd. Moreover, we will use the notation k=p+1. Thus kj+2 and kj is always even. If \(\hat{m}^{(j)}\) is a local polynomial estimator with polynomials of order p, then it is asymptotically equivalent to a certain kth order kernel estimator with boundary corrections (see discussion above). The corresponding kernel is denoted by K (j,p+1,c). Otherwise, if we use a kernel estimator, then this denotes the kernel we use. To derive the asymptotic mean squared error, the following assumptions are sufficient (but not necessary).

  1. A1.

    The errors e i have the Wold decomposition

    $$e_{i}=\sum_{s=0}^{\infty}a_{s} \varepsilon_{i-s}$$

    where E(ε i )=0, \(\sigma_{\varepsilon}^{2}=\operatorname{var} ( \varepsilon_{i} ) <\infty\),

    $$f_{e} ( \lambda ) =\frac{\sigma_{\varepsilon}^{2}}{2\pi}\bigl \vert A \bigl( e^{-i\lambda} \bigr) \bigr \vert ^{2}\sim c_{f}\vert \lambda \vert ^{-2d}\quad(\lambda\rightarrow0) $$

    for some d∈(−0.5,0.5) and ε i is a martingale difference.

  2. A2.

    The trend function m(t) is at least k (=p+1) times continuously differentiable on [0,1] with kj+2 and kj even, and \(\hat{m}^{(j)}\) is either a pth order local polynomial or a kth order kernel estimator with a corresponding boundary correction.

  3. A3.

    For the bandwidth we have, as n tends to infinity,

    $$b\rightarrow0,\qquad ( nb )^{1-2d}b^{2j}\rightarrow\infty. $$
  4. A4.

    For y=x−(xy) (with x and y in the support of K (j,p+1,c)) the kernel K (j,p+1,c) can be written as

    $$ K_{ ( j,p+1,c ) }(y)=K_{ ( j,p+1,c ) }(x)+\tilde {K}_{ ( j,p+1,c ) }(x-y), $$
    (7.117)

    where

    $$\tilde{K}_{ ( j,p+1,c ) }(x-y)=\sum_{j=1}^{r} \eta_{j}(x-y)^{j}, $$

    with coefficients η j =η j (x) determined by the value of x.

These conditions are sufficient for deriving the asymptotic results given below. Note, however, that for the derivation of the minimax lower bounds, for estimating the unknown dependence structure after subtracting a nonparametric trend estimate or for the development of data-driven algorithms, stronger conditions are required.

Assumption A1 defines the linear dependence structure, including short memory (with d=0), long memory (d>0) and antipersistence (d<0). If ε i are i.i.d., then e i is a linear fractional process. However, linearity is not required. It is sufficient that the process e i is a martingale difference. This is particularly useful when one would like to include short-range volatility dependence. For instance, Beran and Feng (2001a) consider the case where e i is a FARIMA–GARCH with GARCH-innovations ε i . In other words,

where A(B)=(1−B)d φ −1(B)ψ(B) is the usual FARIMA(p,d,q) operator. If only the asymptotic variance of \(\hat{m}^{(j)}\) is of interest, then weaker conditions than the martingale assumptions are sufficient. This assumption is useful when it comes to deriving the asymptotic distribution of \(\hat {m}^{(j)}\). Assumption A2 is a regularity condition on the smoothness of m which, together with A3, is required for the derivation of the order of magnitude of the bias of \(\hat{m}^{(j)}\). If only consistency is required, then it is sufficient that m (j) is continuous in a neighbourhood of x. As discussed previously, the first condition in A3 is needed so that the bias converges to zero. The second condition is needed for the variance to tend to zero. More specifically, (nb)1−2d b 2j→∞ implies nb j+1→∞ for all d∈(−0.5,0.5). This ensures that w j;b,n (t;i)→0 (see (7.107)). Condition A4 is needed for the case of antipersistence (see the result below). For local polynomial estimation A4 can be achieved, for instance, by using a second-order weight function K(u) in (7.105) that is μ-smooth and of the form

$$K(u)=C_{\mu}\bigl(1-u^{2}\bigr)^{\mu}1 \{ -1\leq u \leq1 \} $$

for some \(\mu\in\mathbb{N}\). For kernel estimation a polynomial kernel can be chosen directly by taking into account (7.117).

For any point t∈[0,1], the asymptotic mean squared error can be obtained by detailed arguments following along the line of the heuristic ideas outlined so far. As before, for any interior point t∈(0,1) we write c=1, and for boundary points t=cb or t=1−cb with 0≤c<1. The corresponding support of K (j,p+1,c) is denoted by with a 1=c and a 2=1 for a left, and a 1=1 and a 2=c for a right boundary kernel. In the interior, we have a 1=a 2=1.

Theorem 7.22

Assume Conditions A1A4. We define a 1=b 1=1 for interior points t∈[b,1−b], a 1=c, a 2=1 for left boundary points t=cb∈[0,b) and a 1=1, a 2=c for right boundary points t=1−cb∈(1−b,1]. Then for d∈(−0.5,0.5) and any t∈[0,1] we have

  1. (i)

    Bias:

    $$ E\bigl[\hat{m}^{(j)}(t)-m^{(j)}(t)\bigr]=b^{k-j} \frac{m^{(k)}(t)\beta_{ ( j,k,c ) }}{k!}\bigl[1+o(1)\bigr], $$
    (7.118)

    where \(\beta_{ ( j,k,c ) }=\int_{-a_{1}}^{a_{2}}u^{k}K_{ ( j,k,c ) }(u)\,du\),

  2. (ii)

    Variance:

    $$ \operatorname{var} \bigl( \hat{m}^{(j)}(t) \bigr) =(nb)^{2d-1}b^{-2j}V_{ ( j,k,c ) } ( d ) \bigl[1+o(1)\bigr], $$
    (7.119)

where for d=0 we have

$$ V_{ ( j,k,c ) } ( 0 ) =2\pi c_{f}\int_{-a_{1}}^{a_{2}}K_{ ( j,k,c ) }^{2}(x)\,dx, $$
(7.120)

for d>0,

$$ \begin{aligned}[b] V_{ ( j,k,c ) } ( d ) &=2c_{f}\varGamma(1-2d)\sin\pi d \\ &\quad {}\times\int _{-a_{1}}^{a_{2}}\int_{-a_{1}}^{a_{2}}K_{ ( j,k,c ) }(x)K_{ ( j,k,c ) }(y)|x-y|^{(2d-1)}\,dx\,dy \end{aligned} $$
(7.121)

and for d<0,

$$ V_{ ( j,k,c ) } ( d ) =2c_{f}\varGamma(1-2d)\sin(\pi d)I(j,k,c;d) $$
(7.122)

with

$$ I(j,k,c;d)=\int_{-a_{1}}^{a_{2}}K_{ ( j,k,c ) }(x)M(x)\,dx, $$
(7.123)
$$ M(x)=\int_{-a_{1}}^{a_{2}}\tilde{K}_{ ( j,k,c ) }(x-y)|x-y|^{2d-1}\,dy-K_{ ( j,k,c ) }(x)\int_{ \atop{y<-a_{1}}{y>a_{2}}}|x-y|^{2d-1}\,dy. $$
(7.124)

We note that for j=0, k=2 the results in Theorem 7.22 agree with the expressions for bias and variance given above. Note also that being in the boundary region not only affects the bias but also the variance. The reason is that having less data in the boundary regions necessarily increases the variance, though the order does not change. A detailed proof of Theorem 7.22 can be found in Beran and Feng (2002a). For earlier partial results in the short- and long-memory context, respectively, see, e.g. Altman (1990), Hart (1991) and Hall and Hart (1990a). Note that, for d<0, the integral on the right-hand side of (7.121) is not well defined. However, the two integrals on the right-hand side of (7.122) based on the decomposition of the kernel function given in (7.123) and (7.124) are both well defined, since −0.5<d<0 and the powers of (yx) in \(\tilde{K}_{(j,k,c)}(x-y)\) are at least of order one. This is why the decomposition was needed.

Example 7.31

Let e t be generated by a FARIMA(0,d,0) process. Consider the kernel estimation of m with the rectangular kernel for interior points and the corresponding boundary kernels for left and right boundary points. Thus, j=0, and we choose k=2. For interior points, we have

$$K_{ ( 0,2,1 ) } ( u ) =\frac{1}{2}1 \{ -1\leq u\leq1 \} $$

and, for instance, for left boundary points we have the kernel

$$K_{ ( 0,2,c ) } ( u ) =\frac{1}{c+1} \biggl\{ 1+3 \biggl( \frac{1-c}{1+c} \biggr)^{2}+6\frac{1-c}{(1+c)^{2}}u \biggr\} $$

with 0≤c<1 (see Table 7.3). Note in particular that K (j,k,c) converges to the rectangular kernel as c→1. For β (j,p+1,c) we have

$$\beta_{ ( 0,2,1 ) }=\int _{-1}^{1}u^{2}K_{ ( 0,2,1 ) }(u)\,du= \frac{1}{2}\int_{-1}^{1}u^{2}\,du= \frac{1}{3}$$

and, with c<1,

Figure 7.9 shows how β (0,2,c) increases as c decreases to zero. The smallest value for c=0 is equal to \(\beta_{ ( 0,2,0 ) }=\frac{13}{3}\). Thus, the bias of \(\hat {m} ( 0 ) \) is more than four times larger than for interior points. More specifically, we have for t∈[b,1−b],

$$\mathrm{Bias}=E\bigl[\hat{m}(t)\bigr]-m(t)=b^{2}\frac{1}{6}m^{(2)}(t)+o \bigl(b^{2}\bigr) $$

and for t=0,

$$\mathrm{Bias}=E\bigl[\hat{m}(0)\bigr]-m(0)=b^{2}\frac{13}{8}m^{(2)}(0)+o \bigl(b^{2}\bigr). $$

The variance can be evaluated from (7.119) by inserting K (0,2,c) in the corresponding integral. Figure 7.10 shows V (j,k,c)(d) as a function of c∈[0,1] for different values of d. As for the bias, the variance increases the closer we are to the boundary. However, in contrast to the bias, the effect is stronger for higher values of d. This means that the increase in the variance near the border is much more dramatic in the presence of strong long memory so that, for instance, confidence intervals for m(t) near the border can differ considerably from those at interior points. Note also that for d<0, the function \(\tilde{K}_{ ( j,p+1,c ) }=\tilde{K}_{ ( 0,2,1 ) }\) is given as follows. Let y=(yx)+x. Then for interior points (c=1) we have

$$K_{ ( 0,2,1 ) }(y)=\frac{1}{2}1 \{ -1\leq y\leq1 \} =K_{ ( 0,2,1 ) }(x)+ \tilde{K}_{ ( 0,2,1 ) }(x-y) $$

with \(\tilde{K}_{ ( 0,2,1 ) }\) being an indicator function determined by the value of x by

$$\tilde{K}_{ ( 0,2,1 ) }(u)=-\frac{1}{2} \bigl( 1 \{ u<x-1 \} +1 \{ u>1 \} \bigr) . $$

For 0≤c<1 and left boundary points, we have

$$\tilde{K}_{ ( 0,2,c ) }(u)=1 \{ -1\leq x\leq c \} 1 \{ x-c\leq x-y\leq x+1 \} , $$

and for right boundary points,

$$\tilde{K}_{ ( 0,2,c ) }(u)=1 \{ -c\leq x\leq1 \} 1 \{ x-1\leq x-y\leq x+c \} . $$

Again, the variance increases with decreasing c.

Fig. 7.9
figure 9

Plot of β (0,2,c) for 0≤c<1 and K (0,2,c) derived from the rectangular kernel

Fig. 7.10
figure 10

V (0,2,c)(d) plotted as a function of c∈[0,1) for different values of \(d\in ( 0,\frac{1}{2} ) \)

Theorem 7.22 implies an asymptotic formula for the MSE at t of the form

(7.125)
(7.126)

By minimizing this expression, we obtain the asymptotically optimal local bandwidth

$$ b_{\mathrm{opt}}=b_{\mathrm{opt}}(t)=C_{\mathrm{opt}}(t)n^{-\alpha_{\mathrm{opt}}} $$
(7.127)

where

$$\alpha_{\mathrm{opt}}=\frac{1-2d}{2k+1-2d}$$

and

$$ C_{\mathrm{opt}} ( t ) = \biggl\{ \frac{2j+1-2d}{2(k-j)}\, \biggl( \frac {k!}{m^{(k)}(t)\beta_{ ( j,k,c ) }} \biggr)^{2}V_{ ( j,k,c ) } ( d ) \biggr \}^{\frac{1}{2k+1-2d}}. $$
(7.128)

Here it was assumed tacitly that m (k)(x)≠0. Note that a bandwidth of the optimal order \(n^{-\alpha_{\mathrm{opt}}}\) is such that the squared asymptotic bias and the asymptotic variance are of the same order of magnitude. Inserting b opt(x) in (7.125), we obtain an optimal MSE of the order

$$ \mathit{MSE}_{\mathrm{opt}}=O\bigl(n^{-r}\bigr), $$
(7.129)

with

$$ r=2 ( k-j ) \alpha_{\mathrm{opt}}=2(k-j)\cdot\frac{1-2d}{2k+1-2d}. $$
(7.130)

Under the assumptions of Theorem 7.22, this rate turns out to be optimal among all possible nonparametric regression estimators (Feng and Beran 2012). Moreover, Beran and Feng (2007) show that there is no kernel (or weighting system) that would be optimal for all values of \(d\in ( 0,\frac{1}{2} ) \). Thus, in contrast to the case where we restrict models to short-range autocorrelations, optimization with respect to the kernel is not meaningful because the value of d is not known a priori.

The standard choice of k is k=j+2 which leads to

and

Thus, compared to the case of short memory with d=0, the optimal order of the MSE is increased for d>0 and decreased for d<0 by the factor \(n^{\varDelta r_{\mathrm{opt}} ( j,d ) }\). In Fig. 7.11, Δr opt(j,d) is plotted against j=0, 1, 2, 3 and 4 for n=1000, and d ranging between −0.4 and 0.4. The effect is quite dramatic for low values of j and strong long memory. The largest increase within the range considered here is obtained for j=0 and d=0.4 with Δr opt(0,0.4)≈0.61. Note that, for instance, for n=1000 this amounts to an increase by the factor \(n^{\varDelta r_{\mathrm{opt}} ( j,d ) }\approx67\).

Fig. 7.11
figure 11

Change Δr of the optimal exponent r opt in \(\mathit{MSE}_{\mathrm{opt}} (\hat{m}^{(j)})=O(n^{-r_{\mathrm{opt}}})\) compared to the case of short memory, as a function of j, plotted for different values of d

If one prefers to use a global bandwidth instead of a local one, then one can minimize an integrated MSE (IMSE). If we use local polynomial estimation or a kernel estimator with boundary kernels, then the bias for boundary points is of the same order as in the interior. The contribution of boundary points to the IMSE is therefore asymptotically negligible because the boundary intervals shrink to length zero. (It should be emphasized, however, that this conclusion is wrong when one does not use boundary kernels—see the previous discussion.) The asymptotic expression therefore simplifies to

(7.131)
(7.132)

where

$$ I_{k}=\int_{0}^{1} \bigl( m^{(k)}(t) \bigr)^{2}\,dt. $$
(7.133)

The asymptotically optimal global bandwidth is then given by

$$ b_{\mathrm{opt}}=C_{\mathrm{opt}}n^{-\alpha_{\mathrm{opt}}}$$
(7.134)

where α opt is as before and

$$ C_{\mathrm{opt}}= \biggl\{ \frac{2j+1-2d}{2(k-j)}\, \biggl( \frac{k!}{\beta_{ ( j,k,1 ) }} \biggr)^{2}\frac{V_{ ( j,k,1 ) } ( d ) }{I_{k}} \biggr\}^{\frac{1}{2k+1-2d}}. $$
(7.135)

Example 7.32

Let e t be generated by a FARIMA(0,d,0) process with \(0<d<\frac{1}{2}\). Consider kernel estimation of m with the rectangular kernel for interior points and the corresponding boundary kernels for left and right boundary points. Then j=0, k=2,

and (with the notation from (7.133))

(7.136)
(7.137)

This is the same expression we obtained in (7.95).

7.4.2.2 Asymptotic Distribution

As mentioned previously in (7.108), local polynomial and kernel estimators can be written as sums of triangular arrays. Investigating the asymptotic behaviour of \(\hat{m}^{(j)}(t)\) amounts to studying a sequence of sums

$$ S_{n}=\sum_{i=1}^{n} \zeta_{i,n}\quad (n\in\mathbb{N})$$
(7.138)

with

$$\zeta_{i,n}=w_{j;b,n} ( i ) e_{i}$$

(1≤in; \(n\in\mathbb{N}\)). The asymptotic distribution of \(\hat {m}^{(j)}(t)\) therefore follows as a corollary of a suitable limit theorem for triangular arrays. For instance, Beran and Feng (2002a) consider the case of a second order stationary residual process

$$e_{i}=\sum_{s=0}^{\infty}a_{s} \varepsilon_{i-s}$$

with square integrable martingale differences ε i and

$$f_{e} ( \lambda ) =\frac{\sigma_{\varepsilon}^{2}}{2\pi}\bigl \vert A \bigl( e^{-i\lambda} \bigr) \bigr \vert ^{2}\sim c_{f}\vert \lambda \vert ^{-2d}\quad(\lambda\rightarrow0) $$

for some d∈(−0.5,0.5). This includes not only second-order stationary linear processes but also nonlinear fractional processes such as FARIMA–GARCH models. Under relatively mild conditions on the marginal distribution of e i , one has a limit theorem

$$\sigma_{n}^{-1}\sum_{i=1}^{n}e_{i} \underset{d}{\rightarrow}Z\sim N ( 0,1 ) , $$

where

$$\sigma_{n}^{2}=\operatorname{var} \Biggl( \sum _{i=1}^{n}e_{i} \Biggr) . $$

This can be extended to sums of arrays ζ i,n =w i,n e i as follows.

Theorem 7.23

Under the conditions stated above (see Beran and Feng 2002a), the following holds. Let (w i,n ) be a triangular array of weights such that \(\sigma_{n,w}^{2}:=\operatorname{var}(\sum_{i=1}^{n}w_{i,n}e_{i})>0\) for all n. If

$$ \max_{1\leq i\leq n}|w_{i,n}|/\sigma_{n,w}\rightarrow0 $$
(7.139)

and

$$ \sup_{i}\Biggl|\sum_{j=1}^{n}w_{j,n}a_{i-j}\Biggr|\Big/ \sigma_{n,w}\rightarrow0, $$
(7.140)

then

$$ \Biggl[\sum_{i=1}^{n}w_{i,n}e_{i}\Biggr]\Big/ \sigma_{n,w}\underset{d}{\rightarrow}Z\sim N(0,1). $$
(7.141)

The detailed proof of Theorem 7.23 can be found in Beran and Feng (2002a). Condition (7.139) means that the weights w i,n are uniformly negligible. Note that, if max|w i,n |=O(1), then \(\sigma_{n,w}^{2}\rightarrow\infty\) as n→∞. Condition (7.140) on the weighted sum ∑w j a ij is often related to (7.139). Theorem 4.2 in Müller (1988) on the asymptotic normality of a weighted sum of i.i.d. random variables is a special case of Theorem 7.23. Related results on the asymptotic normality of weighted sums can be found, for instance, in Fuller (1996, Theorem 6.3.4).

Asymptotic normality for local polynomial and kernel estimators is now a corollary of (7.141). As before, we distinguish between interior points t∈(0,1) with c=1, and boundary points t=ch or t=1−ch with c∈[0,1).

Corollary 7.1

Let \(\hat{m}^{(j)}(t)\) (t∈[0,1]) be a local polynomial estimator or a kernel estimator with boundary kernels. Suppose that the conditions of Theorem 7.22 and the conditions on e i in Theorem 7.23 hold. Assume furthermore that the bandwidth is of the optimal order, i.e.

$$b=c_{b}\cdot n^{-\alpha_{\mathrm{opt}}}$$

(for some 0<c b <∞), and let

$$ \mu_{ ( j,k,c ) }=c_{b}^{\frac{1}{2}-d+k}\frac{m^{(k)}(t)\beta_{ ( j,k,c ) }}{k!}. $$
(7.142)

Then, for any \(d\in(-\frac{1}{2},\frac{1}{2})\), we have

$$ (nb)^{\frac{1}{2}-d}b^{j}\bigl[\hat{m}^{(j)}(t)- \hat{m}^{(j)}(t)\bigr]\underset {d}{\rightarrow}Z_{ ( j,k,c ) }\sim N \bigl( \mu_{ ( j,k,c ) },V_{ ( j,k,c ) } ( d ) \bigr) , $$
(7.143)

where V (j,k,c)(d) and β (j,k,c) are the constants defined in Theorem 7.22.

Note that, as usual in nonparametric regression, using a bandwidth with the optimal rate leads to a non-negligible asymptotic bias after standardization. For statistical inference about m (j)(t), this means that one needs to include an estimate of this bias. The other option is to use a slightly faster rate for the bandwidth so that the bias disappears asymptotically because it is dominated by the variance.

A further result that is useful for simultaneous confidence bands for the function m(t) has been shown in Csörgő and Mielniczuk (1995a) for the case of long memory. Assuming a spectral density f e (λ)∼c f |λ|−2d or autocovariances γ e (k)∼c γ |k|2d−1 with \(0<d<\frac{1}{2}\), and a second-order kernel estimator \(\hat{m}\), one can show that for interior points 0<t 1<⋯<t l <1 one has asymptotic independence. In other words,

$$ ( nb )^{1/2-d}V_{ ( 0,2,1 ) }^{-\frac{1}{2}} \bigl( \hat{m} ( t_{1} ) -m ( t_{1} ) ,\dots,\hat{m} ( t_{l} ) -m ( t_{l} ) \bigr) \underset{d}{\rightarrow } ( Z_{1}, \dots,Z_{l} ) $$
(7.144)

where Z i are independent standard normal random variables and V (0,2,1) is defined in (7.121). The result is, of course, only valid, if the standardized sums of e i are also asymptotically normal. Specifically, Csörgő and Mielniczuk (1995a) consider Gaussian residuals as well as Gaussian subordination. In the latter case, the Hermite rank of the transformation has to be one (see Sect. 4.2.3). The reason why we have asymptotic independence can be seen quite easily. For ts, we have

Up to this point, the evaluation is almost the same as for the variance of \(\hat{m} ( t ) \). However, the crucial difference is that with b→0 the function g(u,v)=|tsb(uv)| converges to |xy| uniformly in (u,v)∈[−1,1]2. Therefore,

$$\mathit{cov} \bigl( \hat{m} ( t ) ,\hat{m} ( s ) \bigr) \sim c_{\gamma}n^{2d-1}|t-s|^{2d-1}. $$

However, our standardization in (7.144) is (nb)1/2−d so that

$$( nb )^{1-2d}\mathit{cov} \bigl( \hat{m} ( t ) ,\hat{m} ( s ) \bigr) \sim c_{\gamma}b^{1-2d}|t-s|^{2d-1}\rightarrow0. $$

Note finally that all asymptotic considerations above were made under the assumption that f e (λ)∼c f |λ|−2d and γ e (k)∼c γ |k|2d−1. More generally, the same results follow when the constants c f and c γ are replaced by slowly varying functions. Also extensions to Gaussian subordination with non-Gaussian limits can be considered (see Csörgő and Mielniczuk 1995a). Further results can be found, for instance, in Robinson (1997).

7.4.3 Fixed-Design Regression with Heteroskedastic LRD Errors

Suppose we have a slightly more general model with a deterministic equidistant design, namely with a residual process that has a time-varying variance. More specifically, we assume

$$ Y_{i}=m(t_{i})+\sigma(t_{i})e_{i} $$
(7.145)

with σ(⋅) continuous and e i as before. Suppose moreover that, apart from possible heteroskedasticity modelled by σ, the assumptions of Theorem 7.22 hold. Since the bias is not influenced by the autocovariance structure, the asymptotic expression for the bias remains the same. For the variance, the assumption that σ is continuous implies that at point t only σ 2(t) comes in asymptotically. Thus, in the formulas for the asymptotic variance given in Theorem 7.22, we just have to multiply V (j,k,c) by σ 2(t). Formula (7.125) changes to

$$ \mathit{MSE} ( t ) \sim b^{2 ( k-j ) } \biggl( \frac {m^{(k)}(t)\beta_{ ( j,k,c ) }}{k!} \biggr)^{2}+ ( nb )^{2d-1}b^{-2j}\sigma^{2} ( t ) V_{ ( j,k,c ) } ( d ) . $$
(7.146)

All other formulas for b opt and MSE opt, Theorem 7.22, Corollary 7.1, and (7.144) have to be modified accordingly.

7.4.4 Bandwidth Choice for Fixed Design Nonparametric Regression—Part I

Nonparametric regression works well only if an appropriate bandwidth is chosen. Unfortunately, asymptotic expressions for the MSE and IMSE all involve unknown parameters. If we allow d to vary, instead of being fixed at zero, the situation is even worse because a good estimate of d is essential, in particular if d>0 (see, e.g. Figs. 7.6 and 7.7). It is therefore very important to design a reliable data-adaptive method for the case of fractional residuals with unknown correlation structure.

Bandwidth selection in nonparametric regression with uncorrelated errors is well studied. Numerous results on this topic may be found in the literature. Standard bandwidth selection rules include cross-validation (CV; Clark 1975; Bowman 1984), generalized cross-validation (GCV; Craven and Wahba 1979) and the so-called R-Criterion (Rice 1984). Also see Härdle et al. (1988), Marron (1989) and Jones et al. (1996) for related surveys on bandwidth selection rules in the closely related context of nonparametric density estimation. The main drawback of those bandwidth selection rules is that their rate of convergence is just O(n −1/10). Other, more recent, bandwidth selection rules in nonparametric regression have higher rates of convergence. These include, for instance, the iterative plug-in (IPI, Gasser et al. 1991), the direct plug-in (DPI, Ruppert et al. 1995) and the double smoothing approach (DS, Müller 1985; Härdle et al. 1992; Heiler and Feng 1998). Bandwidth selection in nonparametric regression with dependent errors is more difficult because the bandwidth selection and the estimation of the dependence structure depend on each other. This problem is discussed, for instance, in Altman (1990), Hart (1991), Herrmann et al. (1992), Hall et al. (1995a), Ray and Tsay (1997), Opsomer et al. (2001) and Beran and Feng (2002a, 2002b, 2002c). The two main approaches discussed in the long-memory context are bootstrap based cross-validation (Hall et al. 1995b), and the iterative plug-in method (Ray and Tsay 1997; Beran and Feng 2002a, 2002b, 2002c).

Although the case of a fractional residual process is very general, it does have a clear structure due to the asymptotic dominance of the parameters d and c f . An iterative plug-in (IPI) algorithm is therefore a natural approach. The first IPI algorithm in the long-memory context was proposed by Ray and Tsay (1997).

Specifically, consider a second-order kernel estimator of m. Ray and Tsay (1997) propose the following iteration.

  1. 1.

    Estimate an “optimal” bandwidth \(\hat{b}_{\mathrm{opt}}\), assuming only short-range dependent errors, using a standard method such as the procedure in Herrmann et al. (1992).

  2. 2.

    Set \(b_{0}=\hat{b}_{\mathrm{opt}}\).

  3. 3.

    For j≥1 estimate m(t) using b j−1 and let \(\hat{e}_{i}=y_{i}-\hat{m}(t_{i})\). Estimate d and c f using the log-periodogram regression by Geweke and Porter-Hudak (or any other semiparametric method) applied to \(\hat{e}_{i}\).

  4. 4.

    Let \(b_{2,j}=b_{j-1}n^{(1-2\hat{d})/(2(5-2\hat{d}))}\), and estimate m′′ and I(m′′)=∫(m′′(t))2dt using a fourth-order kernel estimator for estimating the second derivative with the bandwidth b 2,j .

  5. 5.

    Improve b j−1 by setting

    $$ b_{j}=\hat{C}_{\mathrm{opt}}n^{(2\hat{d}-1)/(5-2\hat{d})} $$
    (7.147)

    where \(\hat{C}_{\mathrm{opt}}\) is obtained from the current estimates of d, c f , and I(m′′).

  6. 6.

    Increase j by 1 and repeat Steps 3 to 5 until convergence is reached. Finally, at the end of the iteration set \(\hat{b}_{\mathrm{opt}}=b_{j}\).

This algorithm is based on the proposal of Herrmann et al. (1992). The formula \(b_{2,j}=b_{j-1}n^{(1-2\hat{d})/(2(5-2\hat{d}))}\) in Step 4 is called an inflation method. An improved algorithm was proposed in Beran and Feng (2002a, 2002b, 2002c). This is discussed in more detail in Sect. 7.4.6.

7.4.5 The SEMIFAR Model

7.4.5.1 Introduction

As we have seen in this chapter, distinguishing between deterministic trend functions and random stationary fluctuations with long memory can be quite difficult. A further complication is that sometimes it may not even be clear whether the stochastic component of the observed series is stationary. For practical applications, one would therefore like to have a data-driven methodology that is able to identify at least certain standard types of stochastic nonstationarities and distinguish them from stationary dependence (including short and long memory, and antipersistence) or deterministic trend functions. A semiparametric approach along this line, the so-called SEMIFAR (semiparametric autoregressive) models, has been developed in Beran (1999) and Beran and Feng (2001b, 2002a, 2002b). For applications, see, e.g. Beran and Ocker (2001), Beran et al. (2003), Beran (2007b) and Feng et al. (2007). An implementation is available in the S-Plus module S+FinMetrics (see Zivot and Wang 2003).

The idea is to define a semiparametric model that incorporates a nonparametric trend function, parameters that determine whether the detrended series is integrated or stationary, and parameters determining the detailed dependence structure of the underlying stationary process. All parameters are estimated from the data, including an integer valued and a fractional differencing parameter. The SEMIFAR model, originally introduced in Beran (1999), extends the model in Beran (1995) by including a trend function.

7.4.5.2 Definition of the SEMIFAR Model

Assume that m(t) (t∈[0,1]) is a trend function satisfying suitable smoothness conditions, let ε i \((i\in\mathbb{N})\) be a sequence of i.i.d. zero mean random variables with finite variance \(\sigma_{\varepsilon}^{2}=\operatorname{var}(\varepsilon_{i})\), define B j m(t i )=m(t ij ), where t i =i/n is rescaled time, and denote by \(\varphi(z)=1-\sum_{j=1}^{p}\varphi_{j}z^{j}\) a polynomial with all roots outside the unit circle. A SEMIFAR model is defined as follows.

Definition 7.7

A process X i is called a semiparametric fractional autoregressive (or SEMIFAR) model if there exist an integer r∈{0,1} and a d∈(−0.5,0.5) such that

$$ \varphi(B) (1-B)^{d}\bigl\{(1-B)^{r}X_{i}-m(t_{i}) \bigr\}=\varepsilon_{i}. $$
(7.148)

For Y i =(1−B)r X i we are back to the model with a nonparametric trend function and stationary errors generated by a FARIMA(p,d,0) process, namely

$$ Y_{i}=m(t_{i})+e_{i}\quad(i=1,2,\dots,n), $$
(7.149)

where e i =φ −1(B)(1−B)d ε i . We will also use the notation

$$ E_{i}=(1-B)^{d}e_{i}=\sum _{j=0}^{\infty}b_{j}e_{i-j}= \varphi^{-1}(B)\varepsilon_{i} $$
(7.150)

for the autoregressive residuals obtained after filtering out the fractional differencing component. Note, however, that we are assuming r to be unknown, so that taking the appropriate rth difference cannot be applied directly.

7.4.5.3 Fitting the SEMIFAR Model

Fitting a SEMIFAR models consists of two main parts: (a) nonparametric estimation of the trend function m(t) and (b) estimation of the parameters \(\sigma_{\varepsilon}^{2}\), r, d, p and φ 1,…,φ p . Since r is an integer and \(d\in(-\frac{1}{2},\frac{1}{2})\), r and d can be summarized by one parameter d total=d+r only. The two differencing parameters can be obtained from d total by r=[d total+0.5] and d=d totalr, where [⋅] denotes the integer part. Parts (a) and (b) of SEMIFAR fitting depend on each other because for (b) we need to have subtracted a good estimate of the trend function, whereas for (a) one would need to know r in the first place, and also have some knowledge of d, \(\sigma_{\varepsilon}^{2}\) and φ 1,…,φ p (and the second derivative of m) to calculate the optimal bandwidth. The method considered in Beran (1999) and Beran and Feng (2002a, 2002b) is an iterative plug-in algorithm. This is related (but not identical) to similar methods in the short-memory context (Gasser et al. 1991; Ruppert et al. 1995) and to the method by Ray and Tsay (1997) introduced in Sect. 7.4.4. Note that, as discussed in Sect. 7.4.4, other methods like cross-validation seem less appropriate. Even in the i.i.d. context, it is well known that cross-validation and related methods (Clark 1975; Bowman 1984; Craven and Wahba 1979) lead to highly volatile bandwidths that converge to the optimal one at the slow rate of \(O(n^{-\frac{1}{10}})\). Methods based on the plug-in principle are known to provide more reliable bandwidth estimates with a smaller variability and much faster convergence to the optimal bandwidth (Gasser et al. 1991; Ruppert et al. 1995; Müller 1985; Härdle et al. 1992; Heiler and Feng 1998). In the context of long memory, the situation is even worse since the estimate of the IMSE obtained by cross-validation converges to the actual IMSE only under very restrictive conditions. In contrast, the plug-in method (for fixed design) considered here can be shown to provide reasonable reliable estimates of the optimal bandwidths (see results below).

The key ingredient of the plug-in method is the possibility of estimating the unknown parameter vector consistently even though the trend estimate \(\hat {m}(t)\) may not be optimal. More specifically, let \(\vartheta^{0}=(\sigma_{\varepsilon,0}^{2},\theta^{0})=(\sigma_{\varepsilon,0}^{2},d_{\mathrm{total}}^{0},\varphi_{1}^{0},\dots,\varphi_{p^{0}}^{0})\) be the true parameter vector defining the (possibly integrated) fractional ARIMA component. Suppose that \(\hat{m}(x)\) is a kth order kernel regression estimator with a bandwidth b=O(n α) such that 0<α<1/2. Then it can be shown that, under some regularity conditions and the assumption +d 0>0 (which always holds for d 0>0), the parameter θ 0 (including the integer differencing parameter r 0) can be estimated consistently. The same is true when the autoregressive order p 0 is chosen by the BIC (Beran et al. 1998) as discussed in Sect. 5.5.6 (provided that p 0 does not exceed the maximal autoregressive order p max used in the selection). Moreover, if \(k\alpha+d^{0}>\frac{1}{4}\), then the approximate MLE defined in Beran (1995) yields a \(\sqrt{n}\)-consistent estimator of θ 0 (for more details, see Beran and Feng 2002a and Feng 2004a, 2004b). Note that this is a specific condition for avoiding too large bandwidths.

7.4.6 Bandwidth Choice for Fixed Design Nonparametric Regression—Part II: Data-Driven SEMIFAR Algorithms

In the following, we present two data-driven algorithms within the SEMIFAR framework. The first algorithm (Algorithm A, AlgA) relies on a full search with respect to d, and was originally proposed in Beran (1999) (also see Beran and Ocker 2001). The second algorithm (Algorithm B, AlgB) was proposed in Beran and Feng (2002b) and runs much faster than Algorithm A because a full search is avoided. As explained below, both methods are superior to the plug-in procedure proposed by Ray and Tsay (1997) in different ways. To simplify the presentation, only local linear estimates of the trend function m will be considered here, and m′′ (needed in the constant of the bias) will be calculated using a local cubic or a fourth-order kernel estimator.

Algorithm A

  1. Step 1:

    Let p max be the maximal order of φ(B) that will be tried, and define a sufficiently fine grid G∈(−0.5,1.5)∖{0.5}. First, carry out Steps 2 through 4 for p=p max in order to select the integer differencing order r.

  2. Step 2:

    For each d totalG, set r=[d total+0.5], d=d totalr, and Y i (r)=(1−B)r X i , and carry out Step 3.

  3. Step 3:

    Carry out the following iteration:

    1. Step 3a:

      Let b 0=Δ 0min(n (2d−1)/(5−2d),0.5) (for some fixed Δ 0>0) and set j=1.

    2. Step 3b:

      Calculate \(\hat{m}(t_{i};r)\) using the bandwidth b j−1. Set \(\hat{e}_{i}(r)=Y_{i}(r)-\hat{m}(t_{i};r)\).

    3. Step 3c:

      Set \(\hat{E}_{i,d_{\mathrm{total}}}=\sum_{j=0}^{i-1}b_{j}(d)\hat{e}_{i-j}\) (\(\approx(1-B)^{d}\hat{e}_{i}\)), where \(b_{j}= (-1)^{j}\binom{d}{j}\).

    4. Step 3d:

      Estimate the autoregressive parameters φ 1,…,φ p , from \(\hat{E}_{i,d_{\mathrm{total}}}\) and obtain the estimates \(\hat{\sigma}_{\varepsilon}^{2}=\hat{\sigma}_{\varepsilon}^{2}(d_{\mathrm{total}};j)\) and \(\hat{c}_{f}=\hat{c}_{f}(j)\). Estimation of the parameters can be done, for instance, by using the S-PLUS function ar.burg or arima.mle or an analogous R-function for autoregressive MLE. If p=0, set \(\hat{\sigma}_{\varepsilon}^{2}\) equal to \(n^{-1}\sum\hat{E}_{i,d_{\mathrm{total}}}^{2}\) and \(\hat{c}_{f}\) equal to \(\hat{\sigma}_{\varepsilon}^{2}/(2\pi)\).

    5. Step 3e:

      Set b 2,j =(b j−1)α with α=α 0=(5−2d)/(9−2d), and improve b j−1 by defining

      $$ b_{j}= \biggl( \frac{1-2d}{I^{2}(K)}\,\frac{(1-2d)\hat{V}}{\hat{I}(m^{\prime\prime}(t;b_{2,j}))} \biggr)^{1/(5-2d)}\cdot n^{(2d-1)/(5-2d)} $$
      (7.151)

      where I(K)=∫u 2 K(u) du, \(I(\hat{m}^{\prime\prime}(t;b_{2,j}))\) is an estimate of I(m′′)=∫[m′′(t)]2dt using bandwidth b 2,j and \(\hat{V}\) is an estimate of the constant in the asymptotic variance (see Theorem 7.22).

    6. Step 3f:

      Increase j by one and repeat Steps 3b to 3e until convergence is reached or until a given number of iterations has been carried out. This yields, for each d totalG separately, the ultimate value of \(\hat{\sigma}_{\varepsilon}^{2}(d_{\mathrm{total}})\), as a function of d total.

  4. Step 4:

    Define \(\hat{d}_{\mathrm{total}}\) to be the value of d total for which \(\hat{\sigma}_{\varepsilon}^{2}(d_{\text{total}})\) is minimal, and let \(\hat{r}=[\hat{d}_{\mathrm{total}}+0.5]\).

  5. Step 5:

    For each p=0,1,…,p max, carry out Steps 2 through 4 for \(l=\hat{r}\). Define \(\hat{d}_{\mathrm{total}}\) to be the value of d total for which \(\hat{\sigma}_{\varepsilon}^{2}(d_{\text{total}})\) is minimal. This, together with the corresponding estimates of the AR parameters, yields a value of an information criterion for the given order p, e.g. BIC\((p)=n\log\hat{\sigma}_{\varepsilon}^{2}(p)+p\log n\), as a function of p and the corresponding values of \(\hat{\theta}\) and \(\hat{m}\).

  6. Step 6:

    Select the order p that minimizes the BIC(p). This yields the final estimates of θ 0 and m.

This algorithm differs from Ray and Tsay (1997) mainly in the inflation method and in the estimation of the integer differencing parameter r. The inflation method used here in Step 3e is b 2,j =(b j−1)α with \(\alpha=\alpha_{0}=(5-2\hat{d})/(9-2\hat{d})\). This is also called an exponential inflation method (EIM). Ray and Tsay (1997) use instead a multiplicative inflation method (MIM) of the form b 2,j =b j−1 n β with \(\beta=\beta_{\text{\textrm{v}}}=\frac{1}{2}(1-2\hat{d})/(5-2\hat{d})\). The constants α or β in the two inflation methods are called inflation factors. The asymptotic rate of convergence of \(\hat{b}\) depends on the choice of the inflation factor only, not on the choice of the inflation method. However, an algorithm based on the EIM requires a smaller number of iterations to reach a consistent bandwidth estimate. Commonly used choices of the inflation factors are: (i) α v or β v such that the variance of \(\hat{b}\) is minimized; (ii) α opt or β opt such that the MSE of \(\hat{I}\) is minimized and the rate of convergence of \(\hat{b}\) is optimized; or (iii) α 0 or β 0 such that the MSE of \(\hat{m}^{\prime\prime}\) is minimized. Explicit formulae for these inflation factors may be found in Beran and Feng (2002b). The rate of convergence of \(\hat{b}\) based on α v or β v is the worst of all three choices, namely \(O(n^{(2d^{0}-1)/(5-2d^{0})})\). The rate of convergence of AlgA—which is based on α 0—is of the order \(O(n^{2(2d^{0}-1)/(9-2d^{0})})\) which is slightly faster than for the algorithm in Ray and Tsay (1997). Another advantage of AlgA compared to Ray and Tsay (1997) is the choice of the initial bandwidth. Although it does not affect the rate of convergence of \(\hat{b}\), the initial bandwidth in AlgA is already of the correct optimal order. This reduces the number of required iterations.

Algorithm B

AlgA is straightforward and intuitive. However, the iterative procedure has to be carried out for each trial value dG. This makes the algorithm computationally slow. Beran and Feng (2002b) therefore proposed a much faster algorithm where all parameters, except for p and r, are estimated directly from the residuals by maximizing the likelihood function. In the practical implementation, the S-PLUS function arima.fracdiff or an analogous R-function can be used. The algorithm can essentially be described as follows:

  1. Step 1:

    First, we obtain a bandwidth for estimating r 0:

    1. Step 1a:

      Set r=1. Calculate Y i (r) = (1−B)r X i , and estimate m from Y i (r) using the initial bandwidth b 0 = n −1/3. Calculate the residuals.

    2. Step 1b:

      Set p=p max and assume that the residual process follows a FARIMA(p,d,0) model. Calculate a second initial bandwidth b 1 following, e.g. AlgA or another simple bandwidth selection procedure, but with \(\alpha=\hat{\alpha}_{\mathrm{opt}}=(5-2\hat{d})/(7-2\hat{d})\).

  2. Step 2:

    Estimate r 0:

    1. Step 2a:

      Carry out Steps 1a and 1b with the selected b 1 as new initial bandwidth for r=0 and r=1 separately.

    2. Step 2b:

      Select r following the BIC. Now we obtain an estimate \(\hat{r}\) of r 0.

    3. Step 2c:

      Set \(r=\hat{r}\).

  3. Step 3:

    Further iterations: Carry out further iterations for each p=0,1,…,p max with \(r=\hat{r}\) and a new starting bandwidth \(b_{2}:=\frac{1}{3}n^{-1/3}\) (or b 2:=n −5/7) until convergence is reached or a given number of iterations has been reached.

  4. Step 4:

    Select the best AR order p following the BIC and take the parameter estimate corresponding to \(\hat{p}\) as the final estimate.

In this algorithm, r=1 is used at the first iteration as a starting value of r. The initial input of the S-PLUS function arima.fracdiff is therefore always stationary, no matter what the value of r 0 is. The purpose of this step is to obtain a starting bandwidth for estimating r. The estimated value of r 0 is then selected in the second iteration and is asymptotically consistent. The use of p=p max avoids the selection of p in the first two steps. Afterwards, \(\hat{r}^{0}\) is used as a known parameter. At the beginning, the starting bandwidth b 0=n −1/3 is used. Since (2⋅(−0.5)−1)/(5−2⋅(−0.5))=−1/3, this is the smallest possible order of optimal bandwidths for d in the range (−0.5,0.5). The order of magnitude of b 0 also ensures that, for any r 0∈{0,1}, the bandwidth selected at the end of Step 1 fulfills the basic assumptions on the bandwidth.

AlgB runs much quicker than AlgA. Furthermore, the rate of convergence of \(\hat{b}\) is improved by choosing the inflation factor \(\alpha_{\mathrm{opt}}=(5-2\hat{d})/(7-2\hat{d})\). The resulting rate of convergence of \(\hat{b}\) is now of the order \(O_{p}(n^{2(2{d}^{0}-1)/(7-2{d}^{0})})\), which is the highest known rate for an iterative plug-in bandwidth selector in the current context. More specifically, the following results can be shown (Beran and Feng 2002b).

Proposition 7.1

Let X i be a SEMIFAR process defined by (7.148). Suppose that m(t)∈C 4[0,1] and, as n→∞, nb→∞ and b→0. Denote by b A the optimal asymptotic bandwidth obtained by minimizing the asymptotic formula for the IMSE and let b M be the actually optimal bandwidth that minimizes the exact finite sample IMSE. Then

$$\frac{b_{A}-b_{M}}{b_{M}}=O\bigl(b_{M}^{2}\bigr). $$

For the data driven bandwidths obtained by AlgA and AlgB, respectively, the following asymptotic formulas hold (Beran and Feng 2002b):

Theorem 7.24

Let X i be a SEMIFAR process with autoregressive order p 0, fractional differencing parameter d 0, and integer differencing parameter r 0∈{0,1}. Suppose that m(t)∈C 4[0,1], and denote by \(\hat{b}_{\mathrm{AlgA}}\) and \(\hat{b}_{\mathrm{AlgB}}\) the data driven bandwidths obtained by Algorithms A and B, respectively, with maximal AR-order p maxp 0. Then

For details, see Beran and Feng (2002a, 2002b). The iterative plug-in algorithms can easily be adapted to select bandwidths for estimating derivatives \(\hat {m}^{(j)}\) (j>0). Similar asymptotic results can be obtained for \(\hat{b}\) as in Theorem 7.24.

Example 7.33

Figure 7.12 shows two simulated SEMIFAR series. In Fig. 7.12(a), the sample path was simulated by an integrated FARIMA process without trend. More specifically, we have n=1000 observations of a FARIMA(p 0,d 0,0) series with p 0=1, \(d_{\mathrm{total}}^{0}=1.3\) (r 0=1, d=0.3) and \(\varphi_{1}^{0}=-0.4\). This is the same as a SEMIFAR model with the same parameters and m(t)≡0. The SEMIFAR fit using Algorithm B is \(\hat{p}=1\), \(\hat{d}_{\mathrm{total}}=1.29\) (hence \(\hat{r}=1\), \(\hat{d}=0.29\)) and \(\hat{\varphi}=-0.43\) with 95 %-confidence intervals [1.23,1.35] and [−0.50,−0.36], respectively. Also no significant trend was found. The series in (b) is a SEMIFAR process with the same parameters for the stochastic part, but including a trend function m(t). The estimated parameters obtained by AlgB are \(\hat{p}=1\), \(\hat{d}_{\mathrm{total}}=1.28\) and \(\hat{\varphi }=-0.37\), with 95 %-confidence intervals [1.22,1.34] and [−0.44,−0.30], respectively. The estimated trend function is significant (at the 5 %-level) and also plotted, together with true trend function. Note that m(t) is the trend function of the differenced process. Figure 7.12(b) shows, however, the integrated process. In contrast to m, the integrated trend function is not bounded. This explains why the estimated trend in the picture is relatively far from the true trend: errors \(\hat{m}(t_{i})-m(t_{i})\) in the differenced domain have a long lasting effect in the integrated domain. This reflects the general uncertainty about trends when considering integrated processes.

Fig. 7.12
figure 12

(a) Simulated FARIMA(p 0,d 0,0) series with p 0=1, \(d_{\mathrm{total}}^{0}=1.3\) (r 0=1, d=0.3) and \(\varphi_{1}^{0}=-0.4\). This is the same as a SEMIFAR model with the same parameters and m(t)≡0. (b) SEMIFAR process with the same parameters as in (a), but including a non-constant trend function m(t). The estimated trend (full line) is also plotted together with the true (integrated) trend function (dotted line)

Example 7.34

Figure 7.13(a) shows a volatility series of the DAX between January 3, 2000 and September 12, 2011 as defined in Sect. 1.2. A nonparametric trend function fitted by Algorithm B is also shown. The trend is significant at the 5 %-level. The parameter estimates are \(\hat{p}=2\), \(\hat{d}_{\mathrm{total}}=0.26\) (i.e. \(\hat{r}=0\), \(\hat{d}=0.26\)), \(\hat{\varphi}_{1}=-0.28\), \(\hat{\varphi}_{2}=-0.09\) with 95 %-confidence intervals [0.21,0.30], [−0.33,−0.22] and [−0.14,−0.04], respectively. The corresponding log–log-plot of the periodogram (of the detrended process) together with the fitted spectral density is displayed in (b). The results are confirmed when one looks at weekly aggregates. Figure 7.13(c) shows weakly averages of the original series displayed in (a). The SEMIFAR-fit again yields a significant trend which looks very much like the function fitted in (a). As expected (see Sect. 2.2.1), due to temporal aggregation, the log–log-plot of the periodogram (of the detrended series) displayed in (d) is closer to a straight line. Applying Algorithm B indeed yields \(\hat {p}=0\) so that a pure FARIMA(0,d,0) model seems appropriate. (Note that the spectral density of a FARIMA(0,d,0) model is very close to the one of fractional Gaussian noise). The estimated value of d is 0.34 with a 95 %-confidence interval of [0.27,0.40].

Fig. 7.13
figure 13

Volatility series for the DAX between January 3, 2000 and September 12, 2011. (a) Shows daily data together with a nonparametric trend function fitted by Algorithm B. The corresponding log–log-plot of the periodogram together with the fitted spectral density is displayed in (b). (c) and (d) show analogous results, however, for weekly aggregates of the original data

Example 7.35

Figure 7.14(a) shows monthly precipitation anomalies for the Sahel region between January 1900 to December 2011 (data courtesy of Todd Mitchell, The Joint Institute for the Study of the Atmosphere and Ocean at the University of Washington, JISAO; the data source is the National Oceanic and Atmospheric Administration Global Historical Climatology Network (version 2), at the National Climatic Data Center of NOAA; http://www.ncdc.noaa.gov/temp-and-precip/ghcn-gridded-products.php). First, we try to fit a stationary FARIMA(p,d,0) process by selecting the order p using the BIC with pp max=16. Figure 7.14(b) displays the periodogram of the data in log–log-coordinates, together with the fitted spectral density. The fit appears to be quite good, and mimics in particular the seasonal peaks. The estimated AR-order is \(\hat{p}=13\). The estimated long-memory parameter is equal to \(\hat{d}=0.35\) with a 95 %-confidence interval of [0.14,0.55]. Note, however, that we used the restriction d<0.5. Now the question is whether the apparent long memory may not rather be caused by a deterministic trend function or an integrated process (i.e. d total>0.5). We therefore fit a SEMIFAR process using AlgB and the BIC with pp max=16. The fitted trend function indeed turns out to be significantly different from a constant (see (c), with horizontal lines demarking the critical limits). As suspected, the trend indicates a decline in precipitation starting around 1960. Subtracting the trend function seems to have removed long memory, since for the residuals we obtain a 95 %-confidence interval for d of [−0.28,0.18] (and \(\hat{p}=12\)). The corresponding log–log-periodogram and fitted spectral density of the detrended data are shown in (d). Note also that the possibility of an integrated process (d total>0.5, r=[d total+0.5]) was excluded by the estimation procedure. A more detailed analysis can be obtained by separating the rainy season (June to October) from the rest of the year. Figure 7.14(e) shows the Sahel rainfall index with each year being represented by measurements form the rainy season only (i.e. we have June to October only for each year). The fitted trend function is very similar to the one in Fig. 7.14(c), and significant. Also as before, the estimated value of d is not longer significant, with a 95 %-confidence interval of [−0.20,0.13] (see (f) for the log–log-periodogram and spectral density). Note also that the selected autoregressive order of \(\hat{p}=3\) is much smaller than before because of the different (stochastic) periodicity. Finally, Fig. 7.14(g)–(h) show the results for the other months. This time the trend function is not quite significant at the 5 %-level. However, it is close to the critical limits and clearly monotonously decreasing. In contrast to the rainy season, \(\hat{d}=0.09\) with a 95 %-confidence interval of [0.03,0.15] indicates the possibility of slight long-range dependence in the residuals. Moreover, there does not appear to be any periodicity left (see Fig. 7.14(h)), and accordingly we have \(\hat{p}=0\). In summary, we may say that there is relatively clear evidence for a decline in precipitation in the Sahel zone starting around 1960. The alternative models of an integrated process or of stationarity with long memory can probably be excluded.

Fig. 7.14
figure 14

Monthly precipitation anomalies for the Sahel region between January 1900 to December 2011 (data courtesy of Todd Mitchell, JISAO, University of Washington; http://www.ncdc.noaa.gov/temp-and-precip/ghcn-gridded-products.php): (a) original series; (b) log–log-periodogram and spectral density obtained by stationary fit; (c) data with fitted SEMIFAR trend (and critical limits); (d) log–log-periodogram and spectral density after SEMIFAR fit; (e) series with rainy seasons only; (f) log–log-periodogram and spectral density after SEMIFAR fit for data in (e); (g) series excluding rainy seasons; (h) log–log-periodogram and spectral density after SEMIFAR fit for data in (g)

7.4.7 Trend Estimation from Replicates

Suppose that we have N time series Y j (i) where j=1,2,…,N denotes a replicate, i=1,2,…,n denotes time and the problem is estimation of the common trend m(⋅) in the nonparametric regression model

$$y_{j}(i)=m(t_{i})+e_{j}(i)\quad \biggl(t_{i}=\frac{i}{n}\biggr) $$

by smoothing the average series \(\bar{y}(i)=N^{-1}\sum_{j=1}^{N}y_{j}(i)\). The function m(t) (t∈(0,1)) is assumed to be smooth whereas e j (i) are random error terms that are stationary zero mean processes within each replicate but independent between replicates. In other words, cov(e j (i),e l (i+k)) is zero if jl and equals γ j (k) otherwise, where γ j is a covariance function.

Specifically, we make the following assumptions on the jth error series e j (i):

  • (A1) Mean: E[e j (i)]=0;

  • (A2) Spectral density: \(\lim_{\lambda\rightarrow0} [ f_{j}(\lambda)/ \{ D_{j}|\lambda|^{-2d_{j}} \} ] =1\) where D j >0, 0<d j <1/2 and the convergence is uniform;

  • (A3) Covariances: \(\mathit{cov} ( e_{j}(i),e_{j}(i+k) ) =\gamma_{j}(k)\sim C_{j}|k|^{2d_{j}-1}\) as |k|→∞, d j ≠0, C j >0 where, C j =sin(πd j )Γ(1−2d j )D j /(1+2d j ).

Consider the Priestley–Chao estimate of m(t),

$$\hat{m}(t)=\frac{1}{nb}\sum_{i=1}^{n}K \biggl( \frac{t_{i}-t}{b} \biggr) \bar{y}(i), $$

where the kernel K is a symmetric probability density function on (−1,1) and b is a bandwidth such that

$$b\rightarrow0\quad \text{and}\quad nb^{3}\rightarrow\infty\quad \text{as}\ n \rightarrow \infty. $$

The uniform kernel \(K(u)=\frac{1}{2}1\{|u|\leq1\}\) is an example of such a kernel which we use in this section, but the arguments also hold for other kernels.

Clearly, the precision of such an estimator will depend on n as well as on N. Two different cases are of interest: (i) N is fixed and finite and (ii) N→∞.

  1. Case (i)

    N is fixed and finite. As we shall see, in this case the mean squared error of the estimated trend function will be dominated by the largest fractional differencing parameter.

Theorem 7.25

Let N be fixed and finite. Then, as n→∞, the asymptotic expression of the bias of \(\widehat{m}(t)\) for t∈(0,1) is

$$E \bigl[ \widehat{m}(t) \bigr] -m(t)={\frac{b^{2}}{2}}m^{\prime\prime}(t)\int_{-1}^{1}u^{2}K(u)\,du+o \bigl(b^{2}\bigr). $$

Proof

Since \(E [ \bar{y}(i) ] =m(t_{i})\), the proof follows, as we have seen before in previous sections, from a two-term Taylor series expansion of m(t i ) around t and in particular by noting that as n→∞,

$$\Biggl \vert \frac{1}{nb}\sum_{j=1}^{n} \biggl( \frac{t_{j}-t}{b} \biggr)^{p}K \biggl( \frac{t_{j}-t}{b} \biggr) -\int_{-1}^{1}u^{p}K(u)\,du\Biggr \vert =O \biggl( \frac{1}{nb} \biggr) $$

where p is a positive integer. To simplify further, the term O((nb)−1) can be absorbed into o(b 2) since nb 3→∞. □

As an example, when K is the uniform kernel on (−1,1), since \(\int_{-1}^{1}u^{2}K(u)\,du =1/3\) the asymptotic expression of the bias of \(\widehat{m}(t)\) is

$$E \bigl[ \widehat{m}(t) \bigr] -m(t)={\frac{b^{2}}{6}}m^{(2)}(t)+o \bigl(b^{2}\bigr) $$

and for η∈(0,1/2), as n→∞, the integrated squared bias of \(\widehat{g}\) is:

$$\int_{\eta}^{1-\eta} \bigl\{ E \bigl[ \widehat{m}(t) \bigr] -m(t) \bigr\}^{2}\,dt={\frac{b^{4}}{36}}\int _{\eta}^{1-\eta} \bigl\{ m^{(2)}(t) \bigr \}^{2}\,dt+o\bigl(b^{4}\bigr). $$

As for the covariances, note that when d=max{d 1,…,d k }, N is fixed and finite and \(\bar{e}(i)=N^{-1}\sum_{j=1}^{N}e_{j}(i)\), by (A2) and (A3),

$$\mathit{cov} \bigl( \bar{e}(i),\bar{e}(i+k) \bigr) =\gamma_{\bar{e}}(k)={ \frac {1}{N^{2}}}\sum_{j=1}^{N} \gamma_{j}(k)\sim{\frac{1}{N^{2}}}C_{d,N}|k|^{2d-1}\quad (\text{as}\ |k|\rightarrow\infty) $$

where

$$C_{d,N}=\sum_{j:d_{j}=d}C_{j}. $$

Similarly, the spectral density is

$$f_{\bar{e}}(\lambda)={\frac{1}{2\pi}}\sum _{k=-\infty}^{\infty}\gamma_{\bar{e}}(k)e^{-ik\lambda}={ \frac{1}{N^{2}}}\sum_{j=1}^{N}f_{j}( \lambda)\sim{\frac {1}{N^{2}}}D_{d,N}|\lambda|^{-2d}\quad (\text{as}\ \lambda\rightarrow0) $$

where

$$D_{d,N}=\sum_{j:d_{j}=d}D_{j}. $$

These facts can be summarized as follows:

Lemma 7.2

Let d=max{d 1,…,d N }, and let N be fixed and finite. Then the largest fractional differencing parameter d is also the fractional differencing parameter for the sample mean process \(\bar{e}(i)\) (i=1,2,…).

Theorem 7.26

Let N be fixed and finite. Let \(K(u)={\frac{1}{2}}1\{-1\leq u\leq1\}\), d=max{d 1,…,d N } and

$$\beta(d,N)=\frac{{2^{2d-1}}}{d(2d+1)}C_{d,N}. $$

Then for η∈(0,1/2) and as n→∞, the integrated variance of \(\widehat{m}\) is

$$\int_{\eta}^{1-\eta}\operatorname{Var} \bigl[ \widehat{m}(t) \bigr] \,dt={\frac{1}{N^{2}}}(1-2\eta) (nb)^{2d-1}\beta(d,N)+o \bigl( (nb)^{2d-1} \bigr). $$

Proof

For every fixed t∈(0,1),

where the last expression is obtained by substituting r 1=rn(tb)+1 and s 1=sn(tb)+1. Thus, we get

where

We have d j ∈(0,1/2) so that 2d j −1∈(−1,0) and

$$\lim_{nb\rightarrow\infty}\sum_{k=-2nb}^{2nb} \gamma_{j}(k)=\gamma_{j}(0)+2C_{j} \lim_{nb\rightarrow\infty}\sum_{k=1}^{2nb}|k|^{2d_{j}-1}=\infty. $$

Also as nb→∞,

$$\Biggl \vert \sum_{u=1}^{2nb}|u|^{2d_{j}-1}-(2nb)^{2d_{j}} \int_{0}^{1}x^{2d_{j}-1}\,dx \Biggr \vert =O \bigl( (nb)^{2d_{j}-1} \bigr) . $$

Simplifying, and since \((nb)^{2d_{j}-2}=o((nb)^{2d_{j}-1})\),

$$V_{n,j}^{(1)}=\frac{C_{j}}{d_{j}}(2nb)^{2d_{j}-1}+o \bigl( (nb)^{2d_{j}-1} \bigr) $$

and clearly \(V_{n,j}^{(2)}=o(V_{n,j}^{(1)})\). As for \(V_{n,j}^{(3)}\), \(|k|\gamma_{j}(k)\sim C_{j}|k|^{2d_{j}}\) as |k|→∞, so that

$$V_{n,r}^{(3)}=\frac{2C_{j}}{2d_{j}+1}(2nb)^{2d_{j}-1}+o \bigl( (nb)^{2d_{j}-1} \bigr). $$

The theorem follows by noting that \(V_{n,j}^{(1)}-V_{n,j} ^{(3)}=(2nb)^{2d_{j}-1}C_{j}/ ( d_{j}(2d_{j}+1) ) +o ( (nb)^{2d_{j}-1} ) \) and, as n→∞, the sum \(\sum_{j=1}^{N}\{V_{n,j}^{(1)}-V_{n,j}^{(3)}\}\) will be dominated by a multiple of (nb)2d−1 where d is the largest fractional differencing parameter. □

Corollary 7.2

Let \(K(u)=\frac{1}{2}1\{-1<u<1\}\) and, as n→∞, b→0 and nb 3→∞. If N is fixed and finite and d j (j=1,2,…,N) are fractional differencing parameters with \(d=\max\{d_{1},\ldots,d_{N}\},\ 0<d_{j}<\frac{1}{2}\), then for η∈(0,1/2), the asymptotic expression for the integrated mean squared error for \(\widehat{m}\) is (as n→∞)

and the global optimum bandwidth minimising \(\mathit{IMSE} ( \widehat {m} ) \) is

$$b_{\mathrm{opt}}= \biggl[ {\frac{9(1-2\eta)(1-2d)\beta(d,N)}{\int_{\eta}^{1-\eta }\{m^{(2)}(t)\}^{2}\,dt}} \biggr]^{1/(5-2d)}\times\ n^{(2d-1)/(5-2d)}N^{-2/(5-2\delta)}$$

where β(d,N) is defined in Theorem 7.26.

Substituting b opt in the leading term of \(\mathit{IMSE} ( \widehat{m} ) \) the optimum rate of convergence can be obtained as O(n (8d−4)/(5−2d) N −8/(5−2d)). Note that when d→0 (i.e. the process approaches short-memory or independence) and N=1, the familiar rate n −4/5 for the integrated mean squared error for estimation of the trend function can be confirmed. As usual, the rate of convergence under long memory (d>0) is slower than under independence (d=0). Compare also with (7.97) which corresponds to the case N=1.

  1. Case (ii)

    In this case, infinitely many replicates are available asymptotically.

Theorem 7.27

We assume that \(\lim_{N\rightarrow\infty}N^{-1}\sum_{j=1}^{N}f_{j}(\lambda)=f(\lambda)\) uniformly in λ∈(0,π) with f(λ)∼L(λ)|λ|−2d, 0<d<1/2 where L is slowly-varying at zero in the sense of Zygmund. Let \(\gamma(k)=(2\pi)^{-1}\int_{-\pi}^{\pi}f(\lambda)e^{ik\lambda}\,d\lambda\sim L(1/|k|)|k|^{2d-1}\) (|k|→∞). Then for η∈(0,1/2), the asymptotic expression for the integrated mean squared error of \(\widehat{m}\) (as N→∞, n→∞) is

(7.152)

Proof

The expression for the bias term follows as in Theorem 7.25. As for the variance, first of all, j disappears due to the convergence of the mean \(N^{-1}\sum_{j=1}^{N}\gamma_{j}(k)\) appearing in \(\operatorname{var} ( \widehat{m}(t) ) \) to the limit γ(k) that follows a slow hyperbolic decay given by (A3). The proof follows from similar arguments as for Theorem 7.26. □

Corollary 7.3

Under the conditions of Theorem 7.27, the global optimum bandwidth minimizing \(\mathit{IMSE} ( \widehat{m} ) \) is

$$\begin{aligned} b_{\mathrm{opt}}&= \biggl[ {\frac{9(1-2\eta)(1-2d)2^{(2d-1)/(5-2d)}L(1/(nb))}{d(2d+1)\int _{\eta}^{1-\eta}\bigl\{g^{(2)}(t)\bigr \}^{2}\,dt}} \biggr]^{1/(5-2d)}\\ &\quad {}\times n^{(2d-1)/(5-2d)}N^{-1/(5-2d)}\end{aligned}$$

where the slowly-varying function L is defined in Theorem 7.27.

Remark

By assumption, the spectral density f j (λ) of the jth error process e j behaves at zero like a constant D j times \(|\lambda|^{-2d_{j}}\). In the theorem above, however, we assume the average spectral density to be a product of a slowly varying function L and |λ|−2d where 0<d<1/2. In particular, L need not be a constant. An insight into this may be gained, for instance, by considering the case of i.i.d. random fractional differencing parameters having a moment generating function M where M(−2log|u|)=L(u)|u|−2d; an example is the uniform distribution; see Ghosh (2001). In this case, the expected value of the spectral density function is directly proportional to L(λ)×|λ|−2θ where 1/2>θ>0, and L(u)∝1/log(|u|).

7.4.8 Random-Design Regression Under LRD

In this section, our goal is to estimate the conditional mean function m(Y t |X t ) in a random-design model with residuals exhibiting long-range dependence and a variance that may depend on X t . Thus, we have

$$ Y_{i}=m(X_{i})+\sigma(X_{i})e_{i} $$
(7.153)

where now X i is a stationary process with marginal density p X , e i is a stationary zero mean process with long memory and σ is a continuous function of X i . Since the design is random, we consider the Nadaraya–Watson estimator (7.104), i.e.

$$ \widehat{m}_{\mathrm{NW}}(x)=\frac{\widehat{m}_{\mathrm{PC}}(x)}{\hat{p}_{X}(x)}=\frac {(nb)^{-1}\sum_{i=1}^{n}K ( \frac{X_{i}-x}{b} ) Y_{i}}{\hat{p}_{X}(x)}$$
(7.154)

where

$$ \hat{p}_{X}(x)=\frac{1}{nb}\sum_{i=1}^{n}K \biggl( \frac{X_{i}-x}{b} \biggr) $$
(7.155)

is a kernel density estimator of p X .

We can summarize the limiting behaviour of \(\widehat{m}_{\mathrm{NW}}\) in the following theorem. This theorem summarizes results obtained under different sets of assumptions and using different techniques in papers like Cheng and Robinson (1994), Csörgő and Mielniczuk (1999, 2000), Mielniczuk and Wu (2004), Zhao and Wu (2008) and Kulik and Lorek (2011).

Theorem 7.28

Suppose that m and σ are twice continuously differentiable in a neighbourhood of x 0. Then the following holds:

  • Suppose that X i are i.i.d. and \(e_{i}=\sum_{j=0}^{\infty}a_{j}\varepsilon_{i-j}\) is a linear process with i.i.d. zero mean innovations ε i , \(\sigma_{\varepsilon}^{2}=\operatorname{var}(\varepsilon_{i})<\infty\) and \(a_{j}\sim c_{a}j^{d_{e}-1}\) for some \(0<d<\frac{1}{2}\). Then, for a sequence of bandwidths

    $$b=o\bigl(n^{-2d_{e}}\bigr) $$

    we have

    $$ \sqrt{nb}\sqrt{\hat{p}_{X}(x_{0})} \bigl\{ \hat{m}(x_{0})-E \bigl[ \hat {m}(x_{0}) \bigr] \bigr\} \overset{\mathrm{d}}{\rightarrow}Z\sqrt {\sigma^{2}(x_{0})p(x_{0}) \int K^{2}(u)\,du} $$
    (7.156)

    where Z is a standard normal random variable.

  • Under the same assumptions, but with

    $$b\gg n^{-2d_{e}},$$

    we have

    $$ n^{\frac{1}{2}-d_{e}}c_{e}^{-\frac{1}{2}} \bigl\{ \hat{m}(x_{0})-E \bigl[ \hat{m}(x_{0}) \bigr] \bigr\} \overset{\mathrm{d}}{ \rightarrow}\sigma (x_{0})Z $$
    (7.157)

    where \(c_{e}=c_{f_{e}}\nu(d_{e})\) is the constant in \(\operatorname{var}(\sum_{i=1}^{n}e_{i})\sim c_{e}n^{2d_{e}+1}\).

  • Suppose that \(X_{i}=\sum_{j=0}^{\infty}a_{j,X}\xi_{i-j}\) is a zero mean Gaussian process with long-range dependence such that γ X (k)∼c γ |k|2d−1 (\(0<d<\frac{1}{2}\)). Then, keeping the other conditions as above, the same results follow for \(b=o(n^{-2d_{e}})\) and \(b\gg n^{-2d_{e}}\), respectively.

Proof

We write

It can be shown that the first term is o p ((nb)−1/2) and is hence asymptotically negligible. The second term has the structure \(R_{n}:=n^{-1}\sum_{i=1}^{n}\nu_{n}(X_{i})e_{i}\) (cf. (7.60)), where

$$\nu_{n}(X_{i})=b^{-1}K \biggl( \frac{x_{0}-X_{i}}{b} \biggr) \sigma (X_{i})=b^{-1}K \biggl( \frac{X_{i}-x_{0}}{b} \biggr) \sigma(X_{i}). $$

Note that

$$ \begin{aligned}[b] E \bigl[ \nu_{n}(X_{1}) \bigr] &=b^{-1}\int K \biggl( \frac{x_{0}-u}{b} \biggr) \sigma(u)p_{X}(u)\,du \\ &=\int K(u) \sigma(x_{0}-ub)p_{X}(x_{0}-ub)\,du\neq 0. \end{aligned} $$
(7.158)

Since σ and p X are assumed to be twice continuously differentiable in a neighbourhood of x 0, with bounded second derivatives, we have

$$ E \bigl[ \nu_{n}(X_{1}) \bigr] \sim\sigma(x_{0})p_{X}(x_{0}), \qquad {\operatorname{var}} \bigl( \nu_{n}(X_{1}) \bigr) \sim b^{-1}\sigma^{2}(x_{0})p_{X}(x_{0})\int K^{2}(u)\,du. $$
(7.159)

Thus, we can apply techniques from Sect. 7.2.3:

  • If e i are i.i.d., then R n is a martingale. An application of a martingale central limit theorem (Lemma 4.2) yields

    $$\sqrt{nb}\frac{1}{nb}\sum_{i=1}^{n}K \biggl( \frac{x_{0}-X_{i}}{b} \biggr) \sigma(X_{i})e_{i} \overset{\mathrm{d}}{\rightarrow}\sigma(x_{0})Z\sqrt {p_{X}(x_{0})\int K^{2}(u)\,du}. $$
  • If e i is a linear long-memory process and X i are i.i.d., then we apply the (M/L)-decomposition

    The second part is a martingale and again an application of the martingale CLT yields

    $$ \sqrt{nb}R_{n,2}\overset{\mathrm{d}}{\rightarrow}Z\sqrt{ \sigma^{2}(x_{0})p_{X}(x_{0}) \int K^{2}(u)\,du}. $$
    (7.160)

    For the first part, we have, recalling (7.48) and (7.159),

    $$ n^{\frac{1}{2}-d_{e}}c_{e}^{-\frac{1}{2}}R_{n,1}\overset{ \mathrm{d}}{\rightarrow}\sigma(x_{0})p_{X}(x_{0})Z. $$
    (7.161)
  • If both, X i and e i are linear processes with long memory, then we proceed exactly the same way as in the case of parametric linear regression. The direct application of the Hermite polynomial decomposition does not lead to weakly dependent behaviour (7.156). However, conditioning on ξ i ,ξ i−1,…, we start with an (M/L)-decomposition

    (7.162)

    where p ξ (⋅) is the density of ξ i and \(\hat{X}_{i}=X_{i}-\xi_{i}\) is the one-step forecast of X i given ξ s (si−1). Now, \(\tilde{R}_{n,2}\) is a martingale and its limiting properties are described by (7.160). For \(\tilde{R}_{n,1}\) we apply the Hermite polynomial decomposition (7.62) with

    $$\tilde{\nu}_{n}(z)=\int K \biggl( \frac{x_{0}-(u+z)}{b} \biggr) \sigma (u+z)p_{\xi}(u)\,du. $$

    Let \(p_{\hat{X}}\) be the density of \(\hat{X}_{i}\). Note that p X is the convolution of \(p_{\hat{X}}\) and p ξ , i.e. \(p_{X}=p_{\hat{X}}\ast p_{\xi}\). Then

    Thus, using the same argument as for parametric regression, we are able to conclude that (7.161) holds for \(\tilde{R}_{n,2}\). The result then follows by comparing the term R n,1 with R n,2, and \(\tilde{R}_{n,1}\) with \(\tilde{R}_{n,2}\), respectively, and noting that \(\hat{p}_{X}\) is the consistent estimator of p X (see Sect. 5.14).  □

The theorem is remarkable in several ways. First of all, it reveals a dichotomy between small and large bandwidths. This is the same phenomenon as observed already for density estimation (see Sect. 5.14). For small bandwidths \(b=cn^{-\alpha}=o(n^{-2d_{e}})\), the long-range dependence in the residuals has no influence, and one obtains exactly the same asymptotic distribution as for i.i.d. data. The optimal bandwidth is then of the form \(b=cn^{-\frac{1}{5}}\), and optimal MSE has the order \(O(n^{-\frac{4}{5}})\). This is in contrast to fixed-design kernel estimation. On the other hand, this behaviour is not unexpected in view of similar results for random design linear regression (Sect. 7.2) and kernel density estimation (Sect. 5.14). For large bandwidths \(b\gg n^{-2d_{e}}\), the contribution of the bias is proportional to \(n^{-4\alpha}\gg n^{-8d_{e}}\) whereas the variance is proportional to \(n^{-(1-2d_{e})}\). Since 1−2d e <8d e is equivalent to d e >0.1, the first conclusion is that the optimal MSE is of the order \(n^{-\frac{4}{5}}\) (with \(b_{\mathrm{opt}}=cn^{-\frac{1}{5}}\)) only if d e <0.1. For d e >0.1, the optimal order is \(n^{-(1-2d_{e})}\) which is achieved as long as the variance dominates the bias. This is the case for a whole range of bandwidths b=cn α with 1−2d e <4α<8d e . These general results are the same as for density estimation. We therefore do not repeat the same comments and refer the reader to Sect. 5.14. The second remarkable aspect of Theorem 7.28 is that long memory in the explanatory process X i does not influence the asymptotic behaviour.

The results can be generalized to multivariate time series. In the context of (7.160), the limit is multivariate normal with independent components; in the context of (7.161), the limit is multivariate normal with perfectly correlated components. Furthermore, one can also obtain analogous results for multivariate predictors.

The main conclusion is that for d e >0.1, the MSE is dominated by the variance as long as the bandwidth is not too large but of a larger order than \(n^{-2d_{e}}\). An exact choice of b is not needed to achieve the optimal rate of \(n^{-(1-2d_{e})}\). However, as for density estimation, a higher-order expansion of the MSE can be used to derive a criterion for an optimal bandwidth—even though it may not have an influence asymptotically. Considering a weighted integrated mean squared error

$$\mathit{IMSE}(\hat{m},m;w)=\int E \bigl[ \bigl( \hat{m}(x)-m(x) \bigr)^{2} \bigr] w(x)\,dx, $$

Kulik and Lorek (2011) obtained the following formula.

Proposition 7.2

Under the assumptions of the third part of Theorem 7.28 (i.e. when both e i and X i have long memory), we have

(7.163)

where κ 1=∫K 2(u) du, κ 2=∫u 2 K(u) du, and

$$\psi_{e}(x)=\sigma(x)\frac{(\sigma(x)p_{X}(x))^{\prime\prime}}{p_{X}(x)}. $$

Of course, the weight function w must be chosen in such the way that the integrals are finite. For example, if σ(x)≡1 and p X is the standard normal density, then

$$\int\frac{\sigma^{2}(x)}{p_{X}(x)}w(x)\,dx=\int\frac{w(x)}{p_{X}(x)}\,dx $$

would be infinite if we chose w(x)≡1, whereas this is not the case, for instance, for \(w(x)=p_{X}^{2}(x)\).

The first term in (7.163) is due to the bias, the second one describes i.i.d.-type behaviour. The term involving d e describes a possible contribution of long memory. Note that we have to include the term \(b^{2}n^{2d_{e}-1}c_{e}\) to obtain a criterion for bandwidth selection that can also be used for d>0.1. For d>0.1 this terms does not have an influence on the optimal behaviour of the MISE, but it improves the higher-order term in the expansion. Optimizing the higher order expansion with respect to b yields

$$b_{\mathrm{opt}}\sim \left \{ \begin{array} {l@{\quad}l}Cn^{-\frac{1}{5}} & \text{if}\ d_{e}<0.3,\\ Cn^{-\frac{2}{3}\,d_{e}} & \text{if}\ d_{e}>0.3. \end{array} \right . $$

The optimal \(\mathit{IMSE}(\hat{m},m;w)\) with b opt is then proportional to n −4/5 if d e <1/10, and to \(n^{2d_{e}-1}c_{e}(n)\) if d e >1/10. However, as discussed above (also see Sect. 5.14), for d>1/10 the optimal order can be achieved even if b is not exactly of the order \(O(n^{-\frac{2}{3}\,d_{e}})\).

The optimal bandwidth depends on unknown parameters. Moreover, for d e >0.1 data driven bandwidth choice is not quite trivial because b opt is based on a higher order expansion of the IMSE. Given an observed series where we may not know much about the underlying process, it seems quite difficult to estimate the IMSE with sufficient accuracy to assess the contribution of higher-order terms. For instance, cross-validation turns out to be applicable for d e <0.1 only (for a precise statement, see Kulik and Lorek 2011).

An improved result can be obtained if one is interested in the shape of the function m(x) only. This means that the aim is to estimate

$$m^{\ast}(x)=E[Y|X=x]-E[Y]=m(x)-\int m(x)p_{X}(x)\,dx. $$

The natural estimator is given by

$$ \hat{m}^{\ast}(x)=\widehat{m}_{\mathrm{NW}}(x)-\bar{y} $$
(7.164)

where \(\bar{y}=n^{-1}\sum Y_{i}\). In contrast to Proposition 7.2, the mean squared error is now influenced by the dependence structure of X i (Kulik and Lorek 2011) whereas the long-memory property of e i disappears:

Theorem 7.29

Suppose that m is twice continuously differentiable in a neighbourhood of x 0 and σ(x)≡1. Then the following holds:

  • Suppose that X i are i.i.d. and \(e_{i}=\sum_{j=0}^{\infty}a_{j}\varepsilon_{i-j}\) is a linear process with i.i.d. zero mean innovations ε i , \(\sigma_{\varepsilon}^{2}=\operatorname{var}(\varepsilon_{i})<\infty\) and \(a_{j}\sim c_{a}j^{d_{e}-1}\) for some \(0<d_{e}<\frac{1}{2}\). Then

    (7.165)

    where κ 1=∫K 2(u) du, κ 2=∫u 2 K(u) du.

  • Suppose that X i is a zero mean Gaussian process with long-range dependence such that \(\gamma_{X}(k)\sim c_{\gamma} \vert k\vert ^{2d_{X}-1}\) (\(0<d_{X}<\frac{1}{2}\)) and \(\operatorname{var}(\sum_{i=1}^{n}X_{i})\sim c_{X}n^{2d_{X}-1}\). Then

    (7.166)

The first part of Theorem 7.29 means that for i.i.d. explanatory variables the asymptotic mean squared error is exactly the same as for i.i.d. residuals. Thus, if we are interested in the shape of m only, then the optimal bandwidth is the same as under i.i.d. assumptions, namely \(b_{\mathrm{opt}}=C_{\mathrm{opt}}n^{-\frac{1}{5}}\), and the optimal IMSE is of the order \(O(n^{-\frac{4}{5}})\). This is similar to results on linear regression through the origin with explanatory variables having expected value zero. Note in particular that even if ∫m(x)p X (x) dx=0, the rate can be improved by subtracting \(\bar{y}\). This is similar to the improved rate of the empirical process when subtracting the sample mean (see Sect. 4.8.3) and results discussed in the context of goodness-of-fit testing where estimation of nuisance parameters improves the rate of convergence (Sect. 5.16). On the other hand, if X i exhibits long memory, then the rate deteriorates for functions m whose Hermite rank is one. In terms of orders, we have \(\mathit{IMSE}=O(b^{4})+O((nb)^{-1})+O(n^{2d_{X}-1})\). Minimization with respect to b=cn α therefore yields exactly the same optimal value \(b_{\mathrm{opt}}=C_{\mathrm{opt}}n^{-\frac{1}{5}}\) as for i.i.d. residuals. However, the optimal mean squared error is of the order \(O(n^{-\frac{4}{5}})\) only if \(\frac{4}{5}\leq1-2d_{X}\) which means d X ≤0.1. For d X >0.1 the variance dominates the optimal IMSE which is asymptotically proportional to \(n^{2d_{X}-1}\). On the other hand, for very large bandwidths b=cn α with \(\alpha<\frac{1}{4}(1-2d_{X})\), the bias dominates the IMSE which is then, however, far from the optimal one. In summary, if X i exhibits long memory, then the results are analogous to estimation of m; however, with d e replaced by d X .

7.4.9 Conditional Variance Estimation

We go back to the parametric regression model (7.45)

$$Y_{i}=\beta_{0}+\beta_{1}X_{i}+ \sigma(X_{i})e_{i}. $$

Our goal now is to estimate the conditional variance function σ 2(⋅) in a nonparametric way. To do so, we first estimate β 0 and β 1 by the least squares method studied in Sect. 7.2. Then, in analogy to conditional mean estimation, we estimate σ 2(⋅) by smoothing residuals with a kernel K and a bandwidth b,

$$ \hat{\sigma}^{2}(x_{0})=\frac{(nb)^{-1}\sum_{i=1}^{n}(Y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}X_{i})^{2}K ( \frac{X_{i}-x_{0}}{b} ) }{\hat {p}_{X}(x_{0})}, $$
(7.167)

where \(\hat{p}_{X}(x_{0})\) is the kernel density estimator defined in (7.155). It is known that in the case of weakly dependent errors and/or predictors, estimation of β 0 and β 1 does not influence the performance of \(\hat{\sigma}^{2}(\cdot)\) (see Fan and Yao 1998; Zhao and Wu 2008).

To see what happens in the case of long memory, we will work under the condition that X i are i.i.d. and e i =∑a j ε ij is a linear long-memory process with a j c a j d−1 (\(0<d<\frac{1}{2}\)). Defining

$$\varDelta _{t}= ( \hat{\beta}_{0}-\beta_{0} ) + ( \hat{\beta}_{1}-\beta_{1} ) X_{t}=: \varDelta _{0}+\varDelta _{1,t},$$

we can write down the decomposition

If β 0 and β 1 were known, then we would have Δ i =0 and thus J 3=J 4≡0. Let us recall the proof of Theorem 7.28. The first two terms J 1 and J 2 are very similar to the terms appearing in the decomposition of \(\hat{p}_{X}(x_{0}) ( \hat{m}(x_{0})-m(x_{0}) ) \). If we assume nb 5→0, then \(\sqrt{nb}J_{1}=o_{p}(1)\) so that the term J 1 is negligible. The second term can be decomposed into two terms J 21 and J 22 with

$$ \sqrt{nb}J_{21}\overset{\mathrm{d}}{\rightarrow}Z_{1} \sigma^{2}(x_{0})\sqrt{p_{X}(x_{0}) \int K^{2}(u)\,du} $$
(7.168)

and, if d∈(1/4,1/2),

$$ n^{1-2d_{\varepsilon}}c_{e,2}^{-\frac{1}{2}}J_{22}\overset{ \mathrm{d}}{\rightarrow}\sigma^{2}(x_{0})p_{X}(x_{0})Z_{2,H_{0}}(1) $$
(7.169)

where \(Z_{2,H_{0}}(1)\) is the Hermite–Rosenblatt process at time 1 and c e,2 is the constant in \(\operatorname{var}(\sum(e_{i}^{2}-1))\sim c_{e,2}n^{4d+2}\). If d∈(0,1/4), then \(\sqrt{n}J_{22}=o_{P}(1)\). The reason for the difference between (7.161) and (7.169) is that the latter involves limiting behaviour of \(\sum_{t=1}^{n}(e_{t}^{2}-1)\).

To deal with J 3, write

Defining the quantity

$$\tilde{J}_{3}:=\frac{2}{n^{2}b}\sum_{i=1}^{n} \sum_{j=1}^{n}K \biggl( \frac{X_{i}-x_{0}}{b} \biggr) \sigma(X_{i})\sigma(X_{j})X_{i}X_{j}e_{i}e_{j}, $$

we may decompose J 3 into two parts,

$$ J_{3}=\tilde{L}_{3}\frac{1}{n}\sum _{i=1}^{n}\sigma(X_{i}) \varepsilon_{i}+\frac{1}{V_{n}}\tilde{J}_{3}, $$
(7.170)

with \(V_{n}^{2}=n^{-1}\sum_{i=1}^{n}X_{i}^{2}\). Furthermore, in \(\tilde{J}_{3}\) we may ignore summation over i=j. Since X i are i.i.d., the (M/L)-decomposition suggests that J 3 behaves like

$$E \biggl[ b^{-1}K \biggl( \frac{X_{i}-x_{0}}{b} \biggr) \sigma(X_{i})\sigma(X_{j})X_{i}X_{j} \biggr] n^{-2}\sum_{t=1}^{n}\sum _{s=1,s\neq t}^{n}e_{t}e_{s}. $$

Since the expected value above behaves like E[σ(X 1)X 1]σ(x 0)x 0, we conclude from (7.48) that

$$ n^{(1-2d_{e})}c_{e}^{-\frac{1}{2}}\tilde{J}_{3} \overset{\mathrm{d}}{\rightarrow}2E\bigl[\sigma(X_{1})X_{1} \bigr]\sigma(x_{0})x_{0}p_{X}(x_{0}) \cdot Z_{0}^{2}. $$
(7.171)

Similar arguments yield

$$ n^{(1-2d_{e})}c_{e}^{-\frac{1}{2}}\tilde{L}_{3}n^{-1} \sum_{i=1}^{n}\sigma(X_{i})e_{i}\overset{\mathrm{d}}{\rightarrow}2E \bigl[\sigma(X_{1})\bigr]\sigma(x_{0})p_{X}(x_{0}) \cdot Z_{0}^{2}. $$
(7.172)

Since V n converges in probability to 1, the last two equations mean that \(n^{1-2d_{e}}c_{e}^{-\frac{1}{2}}J_{3}\) converges in distribution to

$$2 \bigl\{ E\bigl[\sigma(X_{1})X_{1}\bigr]x_{0}+E \bigl[\sigma(X_{1})\bigr] \bigr\} \sigma (x_{0})p_{X}(x_{0}) \cdot Z_{0}^{2}. $$

We note that this conclusion is obtained by justifying that the convergence in (7.171) and (7.172) is joint. Similar considerations can be applied to J 4. Details can be found in Kulik and Wichelhaus (2011). There, the results are obtained under more general assumption on predictors; see also Guo and Koul (2008). Extension to conditional variance estimation in the model (7.153) are given in Kulik and Wichelhaus (2012) and Zhao and Wu (2008). In summary, the following dichotomy is obtained:

Theorem 7.30

Consider the random design regression model (7.45). Assume that nb 5→0 and σ is twice continuously differentiable in a neighbourhood of x 0. Furthermore, suppose that X i are i.i.d. and \(e_{i}=\sum_{j=0}^{\infty}a_{j}\varepsilon_{i-j}\) is a second-order stationary linear process with \(a_{j}\sim c_{a}j^{d_{e}-1}\) (\(0<d_{e}<\frac{1}{2}\)), and denote by Z and Z 0 standard normal variables and by \(Z_{2,H_{0}}(1)\) an Hermite–Rosenblatt variable. Then the following holds:

  • If \(b=o(n^{1-4d_{e}})\), then

    $$\sqrt{nb}\sqrt{\hat{p}_{X}(x_{0})} \bigl( \hat{ \sigma}^{2}(x_{0})-\sigma^{2}(x_{0}) \bigr) \overset{\mathrm{d}}{\rightarrow}Z\sigma^{2}(x_{0})\sqrt{p_{X}(x_{0})\int K^{2}(u) \,du}; $$
  • If \(b\gg n^{1-4d_{e}}\), then

    (7.173)

The last two terms quantify the price we have to pay due to estimation of β 0 and β 1 and due to the fact that the error process has long-range dependence. Note that the first of the two terms disappears, if E 2[σ(X 1)X 1]=0. Finally, note that the assumption nb 5→0 was used for convenience in order that the bias of \(\hat{\sigma}^{2}(x_{0})\) be asymptotically negligible. This assumption can be dropped, but then \(\hat{\sigma}^{2}(x_{0})-\sigma^{2}(x_{0})\) has to be replaced by \(\hat{\sigma}^{2}(x_{0})-E [ \hat{\sigma}^{2}(x_{0}) ] \), and the bias of \(\hat{\sigma}^{2}(x_{0})\) has to be treated separately (as it was done previously when estimating the conditional mean function m(x 0) nonparametrically).

7.4.10 Estimation of Trend Functions for LARCH Processes

Consider a time series model Y i =m(t i )+e i with a nonparametric trend function m(t i ) (t i ∈[0,1]) and residuals e i that exhibit long-range dependence in volatility, and a linear dependence structure corresponding either to short memory, long memory or antipersistence. The main question addressed here is the asymptotic behaviour of nonparametric estimators of m. In particular, one is interested in characterizing the influence of linear and nonlinear dependence of \(\hat{m}\).

More specifically, Beran and Feng (2007) consider residuals e i having a Wold decomposition

$$e_{i}=\sum_{j=0}^{\infty}a_{j}X_{i-j}=A ( B ) Z_{i}$$

with \(\vert A ( e^{-i\lambda} ) \vert ^{2}\sim L_{f_{e}} ( \lambda ) \vert \lambda \vert ^{-2d_{1}}\) (\(-\frac{1}{2}<d_{1}<\frac{1}{2}\)) as λ→0, \(L_{f_{e}} ( \lambda ) \in C [ -\pi,\pi ] \) slowly varying, and Z i is a long-memory LARCH process with \(b_{j}\sim cj^{d_{2}-1}\) (as j→∞) for some \(0<d_{2}<\frac{1}{2}\) and \(\sum b_{j}^{2}<1\). For the autocovariances of e i , we have \(\gamma_{e} ( k ) \sim L_{\gamma_{e}} ( k ) \vert k\vert ^{2d_{1}-1}\) with \(L_{\gamma_{e}}\) slowly varying, whereas Z i are uncorrelated but the squares \(Z_{i}^{2}\) have autocovariances of the form \(\gamma_{Z^{2}} ( k ) \sim L_{\gamma_{Z^{2}}} ( k ) \vert k\vert ^{2d_{2}-1}\) (as j→∞) where \(L_{\gamma_{Z^{2}}}\) is another slowly varying function.

We recall that, given a polynomial degree \(p\in\mathbb{N}\) and a bandwidth b>0, a local polynomial estimator of the jth derivative m (j)(t 0) (for a fixed t 0∈[0,1]) can be written as

(7.174)
(7.175)

where \(\mathbf{\delta}_{j}= ( \delta_{1,j},\dots,\delta_{p+1,j} )^{T}\) (j=1,…,p+1) denote unit vectors with δ j,j =1, δ i,j =0 (ij) (see (7.106)). Thus, investigating the asymptotic behaviour of \(\hat{\mu}^{ ( j ) } ( t_{0} ) \) amounts to studying the sequence of sums

$$S_{n}=\sum_{i=1}^{n}w_{j,b;n} ( i ) Y_{i}=\sum_{i=1}^{n} \zeta_{i,n}\quad(n\in\mathbb{N})$$

of a triangular array ζ i,n =w j,b;n (i)Y i (1≤in; \(n\in\mathbb{N}\)). For the specific weights given by local polynomial estimation, Beran and Feng (2007) derive asymptotic normality of S n under suitable conditions on the tail behaviour of e i and on the weights w j,b;n . In particular, one must make sure that the weights are balanced in the sense that \(\max_{1\leq i\leq n}w_{j,b;n}^{2} ( i ) \) is asymptotically of a smaller order than \(\operatorname{var} ( S_{n} ) \) (for the detailed assumptions, see Beran and Feng 2007). Also note that the results for the mean squared error are the same as in Theorem 7.22 because these depend on the linear dependence structure only.

7.4.11 Further Bibliographic Comments

Hall and Hart (1990b) were the first to derive an asymptotic formula for the mean squared error of kernel estimators of the trend function m(t) in fixed-design regression with long-memory errors. This result was extended further in Beran and Feng (2001a, 2001b, 2002a, 2002b, 2002c), including kernel estimation with boundary corrections, local polynomial estimation of derivatives and integrated processes. Results along the line of (7.144) were proven in Csörgő and Mielniczuk (1995a) under the condition of a homoscedastic Gaussian residual process (the modification to the heteroskedastic case is obvious). See also Csörgő and Mielniczuk (1995b) and Robinson (1997). Nonparametric trend estimation in replicated long-memory time series is considered in Ghosh (2001). The general results applicable to local polynomial estimators of m (j) and kernel estimators with boundary correction was given in Beran and Feng (2001a, 2001b, 2002a) (also see Feng et al. 2007). Properties of cross-validation and plug-in bandwidth were studied in Hall et al. (1995a) and Beran and Feng (2002a, 2002b, 2002c), respectively. Data driven bandwidth selection including asymptotic results on the convergence of the estimated bandwidth can also be found in Beran and Feng (2002a, 2002b, 2002c). Extensions to LARCH-type residuals are given in Beran and Feng (2007). Opsomer et al. (2001) give an overview of up-to-date existing results in nonparametric estimation with short- and long-memory errors. Robust versions of local polynomial estimators in the long-memory context are considered in Beran et al. (2002) and Beran et al. (2003). Optimal convergence rates in the long-memory setting are derived in Feng and Beran (2012). The nonexistence of optimal kernels in the long-memory setting is shown in Beran and Feng (2007). Extensions to nonequidistant time series and tests for rapid change points are derived in Menéndez et al. (2010).

Theorem 7.28 has its origin in work by Cheng and Robinson (1994). Further references include Csörgő and Mielniczuk (1999, 2000), Mielniczuk and Wu (2004), Zhao and Wu (2008), Kulik and Lorek (2011). In the latter article, the authors consider very general class of errors, which include FARIMA–GARCH or antipersistent processes. In Bryk and Mielniczuk (2008), the authors consider a randomization scheme for fixed-design regression. As a consequence, the resulting kernel estimator has a rate of convergence as in the random-design case. Results for the kernel Nadaraya–Watson estimator have further extensions to local linear regression estimators; see Masry and Mielniczuk (1999) and Masry (2001).

7.5 Trend Estimation Based on Wavelets

7.5.1 Introduction

In this section, we consider adaptive estimation of m(t)=E(X) using wavelets. The advantage of the wavelet approach is evident for functions m that are inhomogeneous in time or not smooth. We start with the fixed-design case. As was shown for kernel and local polynomial estimation, the rates of convergence are affected by the presence of long memory. The same happens for wavelet methods (see, e.g. Wang 1996; Wang 1997; Johnstone and Silverman 1997; Johnstone 1999; Li and Xiao 2007; Kulik and Raimondo 2009a; Beran and Shumeyko 2012a). Again, in the random design case, it is possible to achieve the same rates as for weakly dependent data (Kulik and Raimondo 2009b).

7.5.2 Fixed Design

7.5.2.1 Data Adaptive Trend Estimation

As before, we consider a model with trend,

$$ Y_{i}=m(t_{i})+e_{i}, $$
(7.176)

with t i =i/n, mL 2[0,1] and e i a zero mean stationary process with long-range dependence. Wavelet based trend estimation in the context of i.i.d. or short-range dependent residuals has been considered by many authors (see, e.g. a series of pioneering papers by Donoho and Johnstone). Most results deal with optimality in the sense of a minimax risk, and are partially also applicable in the long-memory setting. For an observed data set, however, the minimax principle often leads to estimates of m that may be far from optimal in the specific situation. A useful alternative is therefore to take a data adaptive approach where one tries to extract information about the dependence structure of e i and preliminary information about m in order to come up with a (close to) optimal solution for \(\hat{m}\). Results along this line are available in Li and Xiao (2007) and Beran and Shumeyko (2012a). For simplicity, suppose that e i is a Gaussian process with autocovariance function γ(k)=E(e i e i+k )∼C γ |k|2d−1 (k→∞) and spectral density f(λ)=(2π)−1γ(k)exp(−ikλ)∼C f |λ|−2d (λ→0). To include a larger variety of wavelets, Beran and Shumeyko (2012a) assume that the support of the father and mother wavelets ϕ(t) and ψ(t) is [0,N] with N an arbitrary integer. Moreover, ψ(0)=ψ(N)=0 and

$$ \int_{0}^{N}\phi(t)\,dt=\int _{0}^{N}\phi^{2}(t)\,dt=\int _{0}^{N}\psi^{2}(t)\,dt=1. $$
(7.177)

Then, for any J≥0, the system \(\{\phi_{Jk},\psi_{jk},k\in\mathbb{Z},j\geq0\}\) with

$$\psi_{jk}(t)=N^{1/2}2^{(J+j)/2}\psi \bigl(N2^{J+j}t-k\bigr),\qquad \phi_{Jk}(t)=N^{1/2}2^{J/2}\phi\bigl(N2^{J}t-k\bigr), $$

is an orthonormal basis in \(L^{2}(\mathbb{R})\) (see Sects. 3.5 and 3.5). An important role is played by the number \(M_{\psi}\in\mathbb{N}\) of vanishing moments, defined by the properties

$$ \int_{0}^{N}t^{k}\psi(t)\,dt=0 \quad(k=0,1,\dots,M_{\psi}-1) $$
(7.178)

and

$$ \int_{0}^{N}t^{M_{\psi}}\psi(t)\,dt= \nu_{M_{\psi}}\neq0. $$
(7.179)

Recall that for every fixed, J≥0, every function mL 2([0,1]) has a unique orthogonal wavelet representation

$$ m(t)=\sum_{k=-N+1}^{N2^{J}-1}s_{Jk} \phi_{Jk}(t)+\sum_{j=0}^{\infty}\sum_{k=-N+1}^{N2^{J+j}-1}\,d_{jk} \psi_{jk}(t), $$
(7.180)

with

$$s_{Jk}=\int_{0}^{1}m(t) \phi_{Jk}(t)\,dt,\qquad d_{jk}=\int_{0}^{1}m(t) \psi_{jk}(t)\,dt. $$

Setting

$$\hat{s}_{Jk}=\frac{1}{n}\sum_{i=1}^{n}Y_{i} \phi_{Jk}(t_{i}),\qquad \hat{d}_{jk}= \frac{1}{n}\sum_{i=1}^{n}Y_{i} \psi_{jk}(t_{i}), $$

a (hard) thresholding wavelet estimator of m is defined by

$$ \hat{g}(t)=\sum_{k=-N+1}^{N2^{J}-1} \hat{s}_{Jk}\phi_{Jk}(t)+\sum_{j=0}^{q}\sum_{k=-N+1}^{N2^{J+j}-1}\hat{d}_{jk} \,I\bigl(|\hat{d}_{jk}|>\delta_{j}\bigr)\psi_{jk}(t). $$
(7.181)

The constants J, q and δ j are called the decomposition level, smoothing parameter and threshold, respectively, and can be chosen quite freely except for some minimal asymptotic requirements such as δ j →0 (with rates in a certain range), q→∞, etc. The decomposition level J may also tend to infinity, but a reasonable assumption is that 2J=o(n). The reason is that the lowest resolution level which is of the order O(2J) should tend to zero at a slower rate than the distance n −1 between successive observational time points. This requirement corresponds to letting the length of the window of a kernel estimator tend to zero at a slower rate than n −1. More specifically, N2J t∈[0,N] if and only if 0≤t≤2J, so that we need n −1=o(2J).

The question of interest is now how to choose the constants J, q and δ j optimally for a given data set. An asymptotic answer is given, at least partially, in Beran and Shumeyko (2012a) (also see Li and Xiao 2007). The solution consists of an asymptotic expression for the integrated mean squared error \(\mathit{MISE}=\int E [ ( \hat{m}(t)-m(t) )^{2} ]\, dt\) that can be minimized. The result depends on the differentiability of m, the number m ψ of vanishing moments and further regularity properties of the mother wavelet ψ, and on the long-memory parameter d. A specific assumption used in Beran and Shumeyko (2012a) is a uniform Hölder condition with exponent 1/2, i.e.

$$ \bigl|\psi(x)-\psi(y)\bigr|\leq C|x-y|^{1/2},\quad\forall x,y\in[0,N]. $$
(7.182)

This is, however, not necessary since analogous results can be derived, for instance, for Haar wavelets.

In a first step, it can be shown that minimization with respect to J, q and {δ j } yields the following optimal order of the MISE:

Theorem 7.31

Suppose that mC r[0,1], m (r)(t)≠0 for a non-zero set (w.r.t. Lebesgue measure), the process ε i is Gaussian with covariance structure γ(k)=E(e i e i+k )∼C γ |k|2d−1, and ψ is such that M ψ =r. Then, minimizing the MISE with respect to J, q and {δ j } yields the optimal order

$$ \mathit{IMSE}_{\mathrm{opt}}=O\bigl(n^{-\frac{2r\alpha}{2r+\alpha}}\bigr) $$
(7.183)

where α=1−2d.

Since only the rate is given, Theorem 7.31 is not directly applicable in practice. Instead, an expression for the IMSE including all relevant constants is required. Moreover, the trend function (or its derivatives) should be allowed to have at least a finite number of jumps.

It turns out that the optimal order can be achieved without thresholding, i.e. setting δ j =0 for all j. Using no thresholding simplifies asymptotic calculations. A detailed analysis of the IMSE yields the following optimal values of J and q.

Theorem 7.32

Under the assumptions of the previous theorem and thresholds

$$\delta_{j}=0\quad (0\leq j\leq q), $$

the following holds: Let

(7.184)
(7.185)
  1. (i)

    If \((2^{\alpha}-1)C_{\phi}^{2}>C_{\psi}^{2}\), then the asymptotic IMSE is minimized by decomposition levels J satisfying \(2^{J^{\ast} }=o ( n^{\frac{\alpha}{2r+\alpha}} ) \) and smoothing parameters

    $$ q^{\ast}= \biggl\lfloor \frac{\alpha}{2r+\alpha}\log_{2}n+C_{\psi}^{\ast } \biggr\rfloor -J^{\ast} $$
    (7.186)

    where log2 denotes logarithm to the base 2. The optimal IMSE is of the form

    $$ \mathit{MISE}=A_{1}A_{2}\cdot n^{-\frac{2r\alpha}{2r+\alpha}}+{o \bigl( n^{-\frac {2r\alpha}{2r+\alpha}} \bigr) } $$
    (7.187)

    with constants A 1, A 2 defined explicitly as functions of d, and the wavelet functions (see Beran and Shumeyko 2012a).

  2. (ii)

    If \((2^{\alpha}-1)C_{\phi}^{2}<C_{\psi}^{2}\), then minimizing the asymptotic IMSE with respect to J and q yields

    $$ \hat{g}(t)=\sum_{k=-N+1}^{N2^{J^{\ast}}-1} \hat{s}_{J^{\ast}k}\phi_{J^{\ast}k}(t), $$
    (7.188)

    with

    $$ J^{\ast}= \biggl\lfloor \frac{\alpha}{2r+\alpha}\log_{2}n+C_{\phi}^{\ast } \biggr\rfloor +1 $$
    (7.189)

    and \(C_{\phi}^{\ast}\) defined explicitly as a function of d, and the wavelet functions (see Beran and Shumeyko 2012a). The optimal IMSE is of the form

    $$ \mathit{IMSE}=A_{3}A_{2}\cdot n^{-\frac{2r\alpha}{2r+\alpha}}+{o \bigl( n^{-\frac {2r\alpha}{2r+\alpha}} \bigr) }, $$
    (7.190)

    where again A 1, A 2 can be given explicitly.

This result establishes an explicit asymptotic expression (and not just the order) for optimal choices of J and q , for the case where g is sufficiently smooth and when a wavelet basis is used that matches at least this degree of smoothness. Most interesting is part (i) where the optimal estimator does not contain any mother wavelets. Thus, smoothing is done solely by refining the resolution level J in the father wavelet decomposition. The optimal choice is a logarithmic increase of J with constants as given in (7.189).

If jumps in the function g are expected, then the same asymptotic formula for the MISE holds, when essentially using the same rules in this theorem; however, adding thresholded mother wavelet components to capture local disturbances. Thus, consider

$$ \hat{g}(t)=\sum_{k=-N+1}^{N2^{J}-1} \hat{s}_{Jk}\phi_{Jk}(t)+\sum_{j=0}^{q}\sum_{k=-N+1}^{N2^{J+j}-1}\hat{d}_{jk} \,I\bigl(|\hat{d}_{jk}|>\delta_{j}\bigr)\psi_{jk}(t). $$
(7.191)

Then the following holds.

Theorem 7.33

Suppose that g (r) exists on [0,1] except for at most a finite number of points, and, where it exists, it is piecewise continuous and bounded. Furthermore, assume that supp(g (r)) has positive Lebesgue measure, M ψ =r and the process e i is Gaussian with long memory as specified above. Then the following holds:

  1. (i)

    If \((2^{\alpha}-1)C_{\phi}^{2}>C_{\psi}^{2}\), J is such that \(2^{J}=o ( n^{\frac{\alpha}{2r+\alpha}} ) \), q=⌊log2 n⌋−J, q is defined by (7.186), and δ j is such that for 0≤jq

    $$ \delta_{j}=0 $$
    (7.192)

    and for q <jq

    $$ 2^{J+j}\delta_{j}^{2}\rightarrow0, 2^{(J+j)(2r+1)}\delta_{j}^{2}\rightarrow \infty, \qquad \delta_{j}^{2}\ge\frac{4 e C_{\psi}^{2}N^{-1+\alpha }(\ln n)^{2} }{n^{\alpha}2^{(J+j)(1-\alpha)}}, $$
    (7.193)

    then (7.187) holds.

  2. (ii)

    If \((2^{\alpha}-1)C_{\phi}^{2}<C_{\psi}^{2}\), J=J with J defined by (7.189), q=⌊log2 n⌋−J and δ j such that

    $$ \begin{aligned} &2^{J+j}\delta_{j}^{2}\rightarrow0, 2^{(J+j)(2r+1)}\delta_{j}^{2} \rightarrow \infty, \\ &\delta_{j}^{2}\geq\frac{4eC_{\psi}^{2}N^{-1+\alpha}(\ln n)^{2}}{n^{\alpha}2^{(J+j)(1-\alpha)}}\quad (0\leq j\leq q), \end{aligned}$$
    (7.194)

    then (7.190) holds.

7.5.2.2 Convergence in Besov Classes

An alternative approach to convergence rates of wavelet estimators in the long-memory context was initiated by Wang (1996). Assume that the error sequence e i is Gaussian with covariance function γ(k)∼c γ k 2d−1, d∈(0,1/2). As before, set α=1−2d. Then, in continuous time, a model that is analogous to Y i =m(t i )+e i discussed above is given by

$$ dY(t)=m(t)\,dt+\varepsilon ^{\alpha}\,dB_{H}(t), $$
(7.195)

where B H (t) (t∈[0,1]) is a standard fractional Brownian motion (fBm) with Hurst index H=d+1/2, and ε=n −1/2 is the “noise level”.

Recall that the function m(t) can be expanded as

$$m(t)=\sum_{k=-\infty}^{\infty}\alpha_{jk} \phi_{Jk}(t)+\sum_{j\geq J}\sum _{k=0}^{\infty}\beta_{jk}\psi_{jk}(t). $$

Equivalently, we may write

$$m(t)=\alpha_{00}\phi_{00}(t)+\sum _{j\geq0}\sum_{k=0}^{\infty} \beta_{jk}\psi_{jk}(t) $$

where ϕ 00(t) is a suitable father wavelet. To characterize properties of m, one considers the so-called Besov spaces, characterised by the behaviour of the wavelet coefficients as follows:

Definition 7.8

Assume that mL λ([0,1]). We say that m belongs to the Besov space if

$$ \sum_{j\geq0}2^{j(r+1/2-1/\lambda)s}\biggl[\sum _{0\leq k\leq2^{j}}|\beta_{jk}|^{\lambda} \biggr]^{s/\lambda}<\infty. $$
(7.196)

The parameter r can be thought of as related to the number of derivatives of m. With different values of λ and s, Besov spaces capture a variety of smoothness features in a function, including spatially inhomogeneous behaviour.

The wavelet estimator is constructed similarly to (7.181):

$$\hat{m}(t)=\hat{\alpha}_{00}\phi_{00}(t)+\sum _{j=0}^{J}\sum_{k=0}^{2^{j}-1} \hat{\beta}_{jk}1\bigl(|\hat{\beta}_{jk}|>\delta_{j}\bigr) \psi_{jk}(t), $$

where in the continuous time model (7.195) we set

$$ \hat{\beta}_{jk}:=\hat{\beta}_{jk}^{C}:=\int \psi_{jk}(t)\,dY_{t}. $$
(7.197)

Of course, in the original model we have to take instead

$$ \hat{\beta}_{jk}:=\hat{\beta}_{jk}^{D}:= \frac{1}{n}\sum_{i=1}^{n} \psi_{jk}(t_{i})Y_{i}. $$
(7.198)

The tuning parameters J and δ j are chosen as follows:

  • Fine resolution level J:

    $$ 2^{J}= \biggl( \frac{n}{\log n} \biggr)^{\alpha}= \biggl( \frac{n}{\log n} \biggr)^{1-2d}. $$
    (7.199)
  • Threshold: The threshold value δ=δ j has three input parameters and is written as

    $$ \delta_{j}= \eta \sigma_{j} c_{n}$$
    (7.200)
    • η: \(\eta>\sqrt{8\alpha}\sqrt{2\vee p}\);

    • σ j : a level-dependent scaling factor

      (7.201)
      (7.202)
    • c n : a sample size-dependent scaling factor

      $$ c_{n}=(\log n)^{\frac{1}{2}}\, {n}^{-\frac{\alpha}{2}}. $$
      (7.203)

The following comments have to be made here. First, in the definition of η, we have a new parameter p that is connected to the loss function we would like to use. Specifically, let

$$\Vert f-g\Vert_{\nu}^{\nu}=\int\bigl|f(t)-g(t)\bigr|^{\nu}\,dt $$

be the νth norm. Then we will measure accuracy of the estimator \(\hat{m}\) by computing

$$E \bigl( \Vert\hat{m}-m\Vert_{\nu}^{\nu} \bigr) . $$

Clearly, if ν=2, this definition agrees with the IMSE, as considered in Theorem 7.31. The value of σ j comes from

$$\sigma_{j}^{2}=\operatorname{var} \biggl( \int\psi_{jk}(t)\,dB_{H}(t) \biggr) . $$

Furthermore, the parameter τ in (7.202) is chosen for the continuous model (7.195). For the original discrete time model, the parameter should be changed to

$$\tau^{2}=c_{f}\int_{0}^{1} \int_{0}^{1}\psi(u)\psi(v)|u-v|^{-\alpha}\,du\,dv. $$

We note that the estimator is adaptive with respect to the smoothness class as our tuning paradigm does not depend on r.

The following result was proven in Kulik and Raimondo (2009a), see also Wang (1996), Wang (1997), Johnstone and Silverman (1997), Johnstone (1999) and Li and Xiao (2007).

Theorem 7.34

Consider the continuous time model (7.195) with ε=n −1/2, and the wavelet estimator with (7.199), (7.200), (7.201), (7.202) and (7.203). Assume p>1 and with \(r\geq\frac{1}{\lambda}\). There exists a constant C>0 such that for all n≥0,

$$E \bigl( \Vert \hat{m}-m\Vert _{\nu}^{\nu} \bigr) \leq C \biggl( \frac{(\log n)^{\frac{1}{\alpha}}}{n} \biggr)^{\gamma}, $$

with

(7.204)
(7.205)
(7.206)

The proof of this result is based on the so-called maxiset theorem, see Kerkyacharian and Picard (2000). In particular, the following estimates are crucial. First, \(E (\hat{\beta}_{jk} ) =\beta_{jk}\) and

$$\operatorname{var} ( \hat{\beta}_{jk} ) =\operatorname{var} \biggl( \varepsilon ^{\alpha}\int \psi_{\kappa}(t)\,dB_{H}(t) \biggr) =n^{-\alpha}2^{-j(1-\alpha)} \tau^{2}\leq C\sigma_{j}^{2}\,c_{n}^{2}. $$

Since the random variables \(\hat{\beta}_{jk}-\beta_{jk}\) are Gaussian, we have the following large deviations inequality

$$ P \bigl( |\hat{\beta}_{jk}-\beta_{jk}|>\eta \sigma_{j}\,c_{n}/2 \bigr) \leq\exp \biggl( -\log n \frac{\eta^{2}}{8} \biggr) \leq C\,\bigl(c_{n}^{2p}\wedge c_{n}^{4}\bigr) $$
(7.207)

provided \(\eta>\sqrt{8\alpha}\sqrt{p\vee2}\).

The two rate regimes (7.204) and (7.206) are referred as the ‘dense’ and ‘sparse’ phases (see, e.g. Kerkyacharian and Picard 2000 in the i.i.d. case). The result above shows that the boundary region \(r=\frac{\alpha}{2}(\frac{p}{\lambda}-1)\) depends on the LRD index α, and the sparse region is smaller for dependent data. In other words, some inhomogeneous properties of the trend function are “hidden” in the LRD noise. We note further that the condition \(p>\frac{2}{\alpha}+\lambda\) is required for the sparse regime to be visible. In particular, if p=2 then there is no sparse region and the rate results agree (up to a logarithmic term) with the result in Theorem 7.31.

7.5.3 Random Design

In this part, we are interested in estimating the conditional mean function m(⋅) in the heteroskedastic model

$$ Y_{i}=m(X_{i})+\sigma(X_{i})e_{i} \quad(i=1,\ldots,n). $$
(7.208)

Again, the rates of convergence will be analysed using Besov classes, although in the random-design context we cannot change this model to a continuous set-up as we did before. Furthermore, the fact that we consider random design has to be addressed appropriately. This can be done using the so-called warped wavelets. The wavelet expansion of m(t) is replaced by

$$ m(x)=\alpha_{0,0}\phi_{00}\bigl(F(x)\bigr)+\sum _{j\geq0}\sum_{k=0}^{\infty} \beta_{jk}\psi_{jk}\bigl(F(x)\bigr), $$
(7.209)

with

$$ \beta_{{jk}}=\int_{0}^{1}m(x)p(x) \psi_{{jk}}\bigl(F(x)\bigr)\,dx, $$
(7.210)

and F(⋅), p=F′ being a cumulative distribution and density function of X 1, respectively.

The partially adaptive wavelet estimator we are going to consider is

$$ \hat{m}(t)=\hat{\alpha}_{00}\phi_{00}\bigl(F(t)\bigr)+\sum _{j=0}^{J}\sum _{k=0}^{2^{j}-1}\hat{\beta}_{jk} 1\bigl(|\hat{ \beta}_{{jk}}|\geq\delta_{j}\bigr)\psi_{{jk}}\bigl(F(t) \bigr), $$
(7.211)

where

$$ \hat{\alpha}_{{00}}:=\frac{1}{n}\sum _{i=1}^{n}\phi_{{00}}\bigl(F(X_{i})\bigr)Y_{i},\qquad\hat{\beta}_{{jk}}:=\frac{1}{n} \sum_{i=1}^{n}\psi_{{jk}}\bigl(F(X_{i})\bigr)Y_{i}. $$
(7.212)

The highest resolution level is chosen as

$$2^{J}\sim\frac{n}{\log n}. $$

The theoretical level-dependent threshold parameter is set to be

$$\delta_{j}=\tau_{0} \biggl( \frac{\log n}{\sqrt{n}}\vee1\bigl \{E \bigl( \psi_{jk}\bigl(F(X_{1})\bigr) \sigma(X_{1}) \bigr) \neq 0\bigr\}\frac{(\log n)^{1/2}}{n^{\alpha/2}} \biggr) $$

where τ 0 is large enough and α=1−2d. We note the significant difference between fixed and random design. The choice of the highest resolution level J in the case of a random design does not involve LRD. Furthermore, in most regular cases the threshold δ j does not depend on α. Indeed, we have

$$E \bigl[ \psi_{jk}\bigl(F(X_{1})\bigr) \sigma(X_{1}) \bigr] =\int\psi_{jk}(u)\sigma \bigl(F^{-1}(u)\bigr)\,du. $$

Note first that if σ(⋅)≡σ, then the above integral vanishes. Furthermore, this is also the case if σ(⋅) has polynomial-like behaviour and appropriately regular wavelets are used. Consequently, in most practical cases the parameters of the wavelet estimator can be tuned without knowledge of α.

Since we deal with warped wavelets, we have to consider the following weighted norm

$$\Vert f-g\Vert_{L^{\nu}(p)}^{\nu}= \biggl( \int\bigl|f(x)-g(x)\bigr|^{\nu}p(x)\,dx \biggr) . $$

Using the notation

$$ \alpha_{D}:=\frac{2r}{2r+1},\qquad\alpha_{S}:= \frac{2 ( r- ( \frac {1}{\lambda}-\frac{1}{\nu} ) ) }{2(r-\frac{1}{\lambda})+1}, $$
(7.213)

the following rates of convergence can be derived (Kulik and Raimondo 2009b):

Theorem 7.35

Consider the random-design regression model (7.208) such that X i are i.i.d. and e i is a long-range dependent Gaussian sequence such that γ e (k)∼c γ k 2d−1. Both sequences are assumed to be independent from each other. Assume furthermore that , λ≥1, where \(r>\max\{\frac{1}{\lambda},\frac{1}{2}\}\). Then

$$E \bigl( \Vert\hat{m}-m\Vert_{L^{\nu}(p)}^{\nu} \bigr) \leq Cn^{-\frac{\nu }{2}\,\gamma}(\log n)^{\kappa}, $$

where

$$\gamma=\left \{ \begin{array} {l@{\quad}l}\alpha_{D} & \text{\textit{if} } \alpha>\alpha_{D}\ \text{\textit{and}}\ r>\frac {\nu-\pi}{2\pi},\text{d\textit{ense phase}}; \\ \alpha_{S} & \text{\textit{if }} \alpha>\alpha_{S}\ \text{\textit{and}}\ \frac {1}{\pi}<r<\frac{p-\pi}{2\pi},\ \text{\textit{sparse phase}}; \\ \alpha & \text{\textit{if}}\ \alpha\leq\min(\alpha_{S},\alpha_{D}),\ \text{\textit{LRD phase}}, \end{array} \right . $$

α S , α D are given in (7.213), and κ>0. If α=1, then the LRD phase is not relevant.

The proof is based on the M/L technique, as discussed before in the context of random-design regression. The main tool is a large deviation inequality for LRD processes. Informally speaking, LRD appears at low resolution levels only and is suppressed by the additional threshold term.

Furthermore, as in the case of kernel estimators, the rates of convergence improve when once considers estimation of the shape function m (t)=m(t)−E(m(X 1)).

To get full adaptiveness F(⋅) has to be replaced by its empirical counterpart F n (⋅). The results of Theorem 7.35 continue to hold. However, the highest resolution level must be chosen according to \(2^{J}\sim\sqrt{n/\log n}\).

The results in Theorem 7.35 are optimal. It other words, it is not possible to find estimators that achieve better rates of convergence.

7.6 Estimation of Time Dependent Distribution Functions and Quantiles

Limit theorems for empirical quantiles of stationary long-memory processes, and their direct application to quantile estimation have been discussed in Sect. 4.8.2.1. Here we consider the more complicated situation where quantiles may change with time. The approach introduced in the following is nonparametric.

Consider time series observations Y 1,Y 2,…,Y n such that Y i =G(Z i ,t i ) where t i =i/n are rescaled times and {Z i , i=1,2,…} is a zero mean stationary Gaussian process with long-memory. The function G(x,⋅) is assumed to be an unknown square integrable function (with respect to the N(0,1) density). As for the Gaussian process Z i , we assume that

$$\mathit{cov}(Z_{i},Z_{i+k})=\gamma(k)\sim C|k|^{2H-2}, \quad \text{as}\ |k|\rightarrow\infty, $$

H being the long-memory parameter with 1/2<H<1 and C is a positive constant. For \(y\in\mathbb{R}\), t i =i/n, define the cumulative distribution function of Y at rescaled time t i to be

$$F_{t_{i}}(y)=P(Y_{i}\leq y). $$

For simplicity of arguments, let F t , t∈(0,1) be continuous with a probability density function f t defined by

$$f_{t}(y)={\frac{\partial}{\partial y}}F_{t}(y). $$

The problem is the nonparametric estimation of F t (⋅), t∈(0,1) and consequently the estimation of the α-quantile (0<α<1)

$$\theta_{t}(\alpha)=\underset{y}{\operatorname{inf}}\bigl \{y|F_{t}(y)\geq\alpha\bigr\}, $$

and deriving asymptotic confidence bands for these functions. The results summarized in this section can be found in Ghosh et al. (1997). As for applicability of these ideas, estimation and prediction of the time dependent probability function F t (y) can be of practical relevance in various situations. For instance, if Y i is precipitation at time i (rescaled time t i ), then 1−F t (y) is the probability that the amount of rain at time t will exceed a previously specified level y, having implications for regions where heavy rainfall is the primary factor leading to floods. Equivalently, quantile functions may be considered. Very low values of θ t (α) for low α may be indicative of a drought, also having serious implications for agriculture.

The time dependent Gaussian subordination model considered here is a model for processes that are nonstationary in the sense that the marginal distribution function may change with time. Moreover, the distribution may be Gaussian or non-Gaussian. Some simple examples are:

  1. (i)

     Y i =μ(t i )+σ(t i )Z i , where μ and σ are real-valued functions;

  2. (ii)

     \(Y_{i}=\mu_{1}(t_{i})Z_{i}^{2}+\mu_{2}(t_{i})Z_{i}^{3}\) where μ 1 and μ 2 are real-valued functions;

  3. (iii)

     Y i =1{Z i <z}−P(Z i <z), \(z\in\mathbb{R}\), etc.

Let K(u), u∈(−1,1) be a symmetric probability density function on (−1,1). Also let b n =b be a sequence of bandwidths such that b→0 and nb 3→∞ as n→∞. Define the Priestley–Chao estimator

$$\widehat{F}_{t}(y)={\frac{1}{nb}}\sum _{i=1}^{n}K \biggl( {\frac{t_{i}-t}{b}} \biggr) I_{i}(y) $$

where

$$I_{i}(y)=1\quad \text{if}\ Y_{i}\leq y\quad \text{and}\quad I_{i}(y)=0\quad \text{otherwise.}$$

Since the indicator function I i (y) is a function of Y i , it is also Gaussian subordinated. We assume that the following Hermite polynomial expansion holds

$$I_{i}(y)-P ( Y_{i}\leq y ) =\sum _{l=m}^{\infty}{\frac{c_{l}(t_{i},y)}{l!}}H_{l}(Z_{i}). $$

In the above expansion, m is the Hermite rank of G, the functions c l are the Hermite coefficients, and H l denotes the Hermite polynomial of degree l. Note that when H>1−1/(2m), I i (y)−P(Y i y), i=1,2,… will have long-memory.

Theorem 7.36

Under the conditions stated above for H>1−1/(2m) and under further regularity conditions on the Hermite coefficients and assuming that the distribution function F t (y) is twice differentiable with respect to t, for fixed t and y and as n→∞, \(\widehat{F}_{t}(y)\) will have the following asymptotic properties:

where

Proof

We have,

$$E \bigl[ \widehat{F}_{t}(y) \bigr] ={\frac{1}{nb}}\sum _{i=1}^{n}K \biggl( {\frac{t_{i}-t}{b}} \biggr) E \bigl[ I_{i}(y) \bigr] ={\frac{1}{nb}}\sum_{i=1}^{n}K \biggl( { \frac{t_{i}-t}{b}} \biggr) F_{t_{i}}(y). $$

The proof for bias of \(\widehat{F}_{t}(y)\) then follows by a Taylor series expansion of \(F_{t_{i}}(y)\) around t and by noting that as n→∞,

$$\Biggl \vert {\frac{1}{nb}}\sum_{i=1}^{n} \biggl( {\frac{t_{i}-t}{b}} \biggr)^{p}K \biggl( { \frac{t_{i}-t}{b}} \biggr) -\int_{-1}^{1}u^{p}K(u)\,du \Biggr \vert =O \biggl( {\frac{1}{nb}} \biggr) $$

where p is a positive integer, and also \(O({\frac{1}{nb}})=o(b^{2})\) since nb 3→∞. Moreover, since K is a symmetric probability density function, \(\int_{-1}^{1}u^{p}K(u)\,du\) equals 1 when p=0 and equals 0 when p is odd.

As for the variance, since \(\mathit{cov} [ H_{l_{1}}(Z_{i}),H_{l_{2}}(Z_{j}) ] =0\) if l 1l 2 and equals l![γ(ij)]l if l 1=l 2=l,

The last step follows since ∑ i,j |ij|l(2H−2) diverges as n→∞. Now using a one-term Taylor series expansion of c l (t i ) and c l (t j ) around t and due to the convergence of the Riemann sums involving the kernel K, the expression for the variance follows. The formula for the mean squared error (MSE) follows from definition. □

By differentiating the asymptotic expression for the MSE with respect to b, a formula for an optimal bandwidth for estimating F t (y) can be derived as

$$b_{t}^{(\mathrm{opt})}(y) = Q_{t}(y) \times n^{m(2H-2)/(4+m(2-2H))}$$

where

$$Q_{t}(y)= \biggl[ {\frac{m(2-2H)B(t,y) }{4A^{2}(t,y)}} \biggr]^{1/[4+m(2-2H)]}. $$

Thus, for instance, when m=1 and H≈1/2, \(b_{t}^{(\mathrm{opt})}(y) \propto n^{-1/5}\). As H moves away from 0.5 and approaches 1, \(b_{t}^{(\mathrm{opt})}(y)\) becomes large as well. This has to do with the fact that long memory creates an apparent smoothness in the data as a result of which larger bandwidths suffice for optimum smoothing.

The quantile function θ t (α) for a given α can be estimated by inverting the estimated distribution function \(\widehat {F}_{t}(y), \ y \in\mathbb{R}\) as follows:

It turns out that the estimator \(\hat{\theta}_{t}\) inherits the asymptotic properties of \(\widehat{F}_{t}\). Specifically, we have the following result:

Theorem 7.37

Let θ t (α) be unique and f t (θ t (α))>0. Then,

Proof

For additional information, refer to Rao (1973, Chap. 6f.2) and Serfling (1980, Chap. 2.3). First of all, as n→∞, \(\hat{\theta}_{t}(\alpha) \to {\theta}_{t}(\alpha)\) in probability. Secondly, as in Pollard (1984, p. 98),

$$(nb)^{m(2-2H)} \bigl[ \hat{\theta}_{t}(\alpha)-{ \theta}_{t}(\alpha) \bigr] ={\frac{-(nb)^{m(2-2H)} [ \widehat{F}_{t}(\hat{\theta}_{t}(\alpha ))-F_{t}(\hat{\theta}_{t}(\alpha)) ] -o_{p}(1) }{f_{t}({\theta}_{t}(\alpha))+o_{p}(1)}}.$$

The result follows from the continuous mapping theorem. □

Remark

It is easy to see that the asymptotically optimal local bandwidth that minimizes the leading term in the MSE of \(\hat{\theta }_{t}(\alpha)\) (term inside the square brackets) is the same as the bandwidth needed for the estimation of F t (θ t (α)).

Under the condition that the Hermite rank of the function G is equal to 1, we have the following central limit theorem:

Theorem 7.38

Let m=1.

  1. (a)

    CLT for \(\widehat{F}_{t_{i}}(y)\): Let \(y\in\mathbb{R}\), k≥1 and \(t_{1}^{0}<t_{2}^{0}<\cdots<t_{k}^{0}\) (with \(t_{i}^{0}\in(0,1)\)) be fixed. Define

    $$U_{i,n}=(nb)^{1-H}{\frac{ [ \widehat{F}_{t_{i}}(y)-F_{t_{i}}(y)-b^{2}A(t_{i},y) ] }{\sqrt{B(t_{i},y)}}},\quad t_{i}=t_{i_{n}}=i_{n}/n $$

    with \(t_{i}\rightarrow t_{i}^{0}\) (i=1,2,…,k) as n→∞. Then as n→∞, the random vector

    $$\mathbf{U}_{n}=(U_{1,n},U_{2,n}, \ldots,U_{k,n})^{T}$$

    converges in distribution to \(\mathbf{Z}^{u}=(Z_{1}^{u},Z_{2}^{u},\ldots ,Z_{k}^{u})^{T}\) where \(Z_{i}^{u}\), i=1,2,…,k are independent and identically distributed standard normal random variables.

  2. (b)

    CLT for \(\hat{\theta}_{t_{i}}(\alpha)\): Let α∈(0,1) and k≥1 be fixed, and \(t_{i}^{0}\) as before. Define

    $$\begin{aligned} &W_{i,n}= (nb)^{1-H}{\frac{ [ \hat{\theta}_{t_{i}}(\alpha)-{\theta}_{t_{i}}(\alpha)-b^{2}A(t_{i},{\theta}_{t_{i}}(\alpha))/f_{t_{i}}(\theta_{t_{i}}(\alpha)) ] }{\sqrt{B(t_{i},\theta_{t_{i}}(\alpha))}/f_{t_{i}}(\theta_{t_{i}}(\alpha))}}, \\ &\quad t_{i}=t_{i_{n}}=i_{n}/n \end{aligned}$$

    with \(t_{i_{n}}\) as above. Then as n→∞, the random vector

    $$\mathbf{W}_{n}=(W_{1,n},W_{2,n}, \ldots,W_{k,n})^{T}$$

    converges in distribution to \(\mathbf{Z}^{w}=(Z_{1}^{w},Z_{2}^{w},\ldots ,Z_{k}^{w})^{^{T}}\) where \(Z_{i}^{w}\), i=1,2,…,k are independent and identically distributed standard normal random variables.

Proof

(a) Due to Theorem 7.36, as n→∞, for each t∈(0,1)

$$(nb)^{1-H}\bigl|\widehat{F}_{t}(y) - F_{t}(y) - b^{2} A(t,y) - R_{n}(t,y)\bigr| \to0 $$

in probability, where

$$R_{n}(t,y) = (nb)^{-1} \sum_{i=1}^{n} K \biggl( {\frac{t_{i} -t }{b}} \biggr) c_{1}(t_{i},y)Z_{i}. $$

Note that (nb)1−H R n (t,y) has a normal distribution because it is a linear combination of standard normal random variables that are also jointly normal. Also, \(\mathit{cov} ( (nb)^{1-H}\widehat{F}_{t}(y), (nb)^{1-H}\widehat {F}_{s}(y) ) \) for ts converges to zero in probability. The result follows by considering the sequence of random vectors U n and Theorem 7.36(i) in Csörgő and Mielniczuk (1995a).

(b) The proof follows from (a) above and the arguments of Theorem 7.37(b). □

7.7 Partial Linear Models

A partial linear model is a semiparametric regression model containing a nonparametric as well as a linear parametric regression component. An example is as follows:

$$y(i)=\mathbf{x}^{T}(i)\mathbf{\beta}+\mu(t_{i})+ \varepsilon (i) $$

where y(i), i=1,2,…,n is an observation on the dependent variable y, x T(i) is a (row) vector of explanatory variables

$$\mathbf{x}^{T}(i)=\bigl(x_{1}(i),x_{2}(i), \ldots,x_{p}(i)\bigr),\quad p\geq1, $$

\(\mathbf{\beta}\) is a (column) vector of regression parameters

$$\mathbf{\beta}^{T}=(\beta_{1},\beta_{2},\ldots, \beta_{p}) $$

and t i =i/n is rescaled time. The nonparametric component μ is an unknown but smooth function in C 2[0,1] whereas ε(i) is the error term with zero mean. Of special interest is the case when ε(i) is a stationary long-memory process. Specifically, let ε(i) have a covariance function γ ε and a spectral density f ε

where as usual ∼ means that the left-hand side divided by the right-hand side converges to one, c ε is a positive constant and \(0\leq d_{\varepsilon }<\frac{1}{2}\). Let E(εε T)=Γ ε,n =Γ ε =[γ ε (ij)] i,j=1,2,…,n . The uncorrelated case, namely when \(\mathbf{\beta}\) and μ are unknown but the errors are uncorrelated, is considered in Speckman (1988). He suggests a \(\sqrt{n}\)-consistent estimator for \(\mathbf{\beta}\) under the assumption that also the explanatory variables contain a rough component. Beran and Ghosh (1998) examine Speckman’s method of estimation under long-memory in the errors. As it turns out, even under long-memory, a \(\sqrt{n}\)-rate of convergence of the slope estimates can be achieved. In this section, we take a closer look at some of these results.

To start with, we set our notations: we observe (x T(i),y(i)) at time points i=1,2,…,n. Using vector notations, we define

Let the n×p full design matrix be

$$\mathbf{X}=\mathbf{M}+\mathbf{\eta}$$

where M is a deterministic matrix of order n×p and \(\mathbf{\eta}\) is a random matrix, its elements being zero mean random variables. The ith row of X is x T(i), the columns of M are (m 1,m 2,…,m p ),

$$\mathbf{m}_{j}^{T}=\bigl(m_{j}(t_{1}),m_{j}(t_{2}), \ldots,m_{j}(t_{n})\bigr),\quad j=1,2,\ldots,p $$

whereas the ith row of M is

$$\bigl(m_{1}(t_{i}),m_{2}(t_{i}), \ldots,m_{p}(t_{i})\bigr),\quad i=1,2,\ldots,n. $$

The functions m j (⋅) are in C 2[0,1]. The columns of the random matrix η are denoted by e j , i.e.

$$\mathbf{\eta}=(\mathbf{e}_{1},\mathbf{e}_{2},\ldots, \mathbf{e}_{p}) $$

where

$$\mathbf{e}_{j}^{T}=\bigl(e_{j}(1),e_{j}(2), \dots,e_{j}(n)\bigr),\quad j=1,2,\ldots,p, $$

rows are given by

$$\mathbf{e}^{T}(i)=\bigl(e_{1}(i),e_{2}(i), \ldots,e_{p}(i)\bigr). $$

The random “error” terms in X are assumed to have the following properties: \(\mathbf{\eta}\) is independent of \(\mathbf{\varepsilon }\). As for the covariances,

where \(c_{e_{j}}\) is a positive constant and \(0\leq d_{e_{j}}<\frac{1}{2}\). Let σ e (j,l)=Cov(e j (i),e l (i)) so that the p×p matrix of zero-lag cross-covariances is E(e(i)e T(i))=Γ e =[σ e (j,l)] j,l=1,2,…,p . The partial linear model is then of the form

$$\mathbf{y}=\mathbf{X}\mathbf{\beta}+\mathbf{\mu}+\mathbf{\varepsilon } =\mathbf{M\beta}+\mathbf{\eta\beta}+\mathbf{\mu}+\mathbf{\varepsilon }. $$

In the above formula, M β+μ is deterministic whereas \(\mathbf{\eta\beta}+\mathbf{\varepsilon }\) is random. The main idea is to smooth the values of y to obtain an estimate of the deterministic part and consequently an estimate of the error. Similarly, the error in X can be estimated by detrending the data series containing the values of the explanatory variables. These error estimates are then used in a regression model to recover \(\mathbf{\beta}\). For instance, consider the Nadaraya–Watson kernel (see Gasser et al. 1985)

$$K(t_{i},t_{j},n,b)={\frac{w ( {\frac{t_{i}-t_{j}}{b}} ) }{n^{-1}\sum_{i=1}^{n}w ( { \frac{t_{l}}{b}} ) }}$$

and define the kernel matrix

$$\mathbf{K}=\bigl[K(t_{i},t_{j},n,b)\bigr]_{i,j=1,2,\ldots,n}.$$

Here b is a bandwidth satisfying in particular that as n→∞, b→0, nb→∞, and w is a bounded, non-negative, symmetric and piecewise continuous function with support [−1,1] such that \(\int_{-1}^{1}w(s)\,ds=1\). Additional conditions on b that are used to prove the asymptotic results concerning the estimated slope are in Beran and Ghosh (1998).

Define the residuals

$$\tilde{\mathbf{X}}=(\mathbf{I}-\mathbf{K})\mathbf{X},\qquad \tilde{\mathbf{y}}= (\mathbf{I}-\mathbf{K})\mathbf{y}. $$

Then the semiparametric regression estimate of the slope parameter \(\mathbf{\beta}\) can be given by

$$\hat{\beta}= \bigl( \tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} \bigr)^{-1}\tilde{\mathbf{X}}^{T}\tilde{\mathbf{y}}.$$

In addition to the conditions stated earlier, let, as n→∞,

$$n\bigl(\mathbf{\eta}^{T}\mathbf{\eta}\bigr)^{-1}\mathbf{ \eta}^{T}\varSigma_{\mathbf{\varepsilon }}\mathbf{\eta}\bigl(\mathbf{ \eta}^{T}\mathbf{\eta}\bigr)^{-1}\rightarrow \mathbf{A}$$

almost surely, and

$$\sqrt{n}\bigl(\mathbf{\eta}^{T}\mathbf{\eta}\bigr)^{-1} \mathbf{\eta}^{T}\mathbf{\varepsilon }\rightarrow N(0,\mathbf{A}) $$

in distribution where N(0,A) denotes a p-variate normal distribution with zero mean and covariance matrix A. These conditions ensure that \(\mathbf{\beta}\) can be estimated with \(\sqrt{n}\)-convergence. For sufficient conditions for these to hold, see Sect. 7.2 (and in particular Yajima 1991 and Künsch et al. 1993). Under the conditions stated above, the following asymptotic results can be derived.

Theorem 7.39

Let \(d_{0}=\max_{j=1,\ldots,p}\,d_{e_{j}}\). Then as n→∞, conditionally on X,

Note in particular that asymptotically the bias is of a smaller order than the variance. For the proof of the theorem and additional technical conditions on the bandwidth, see Beran and Ghosh (1998). In applications, the covariance matrix A would have to be estimated. These authors recommend fitting a parametric model \(f_{\mathbf{\varepsilon }}(\lambda;\hat{\theta})\) for the spectral density to the residuals \(\hat{\varepsilon }(i)=\tilde{y}(i)-\tilde {\mathbf{x}}^{T}(i)\mathbf{\beta}\) and setting \(\hat{\varGamma}_{\mathbf{\varepsilon }} =\varGamma_{\mathbf{\varepsilon }}(\hat{\theta})\). For an extension of these results to testing for partial linear models with long memory, see Aneiros-Pérez et al. (2004).

7.8 Inference for Locally Stationary Processes

7.8.1 Introduction

In this short section, we discuss estimation for locally stationary long-memory processes. In the context of weakly dependent processes, the mathematical background stems from Dahlhaus (1997) (also see, e.g. Priestley 1981 for earlier references). In a long-memory setting, the general idea is that the long-memory parameter is treated as a smooth function of time (that is, the dependence parameter becomes a curve). Specifically, Whitcher and Jensen (2000) propose locally stationary ARFIMA processes. Ghosh et al. (1997) consider subordinated locally stationary Gaussian processes in the context of quantile estimation. Asymptotic theory for estimators of the “dependence curves” is presented in Beran (2009). The results use tools from kernel regression, as discussed before in Sect. 7.4. Roueff and von Sachs (2011) discuss estimation for locally stationary processes using wavelet methods.

The motivation for considering locally stationary processes is the observation that often time series appear to be stationary when one looks at short time periods; however, in the long run, the structure changes. If changes are not abrupt, then such data can be modelled by the so-called locally stationary processes. The general idea is that the probabilistic structure of the process changes smoothly in time such that locally the series are stationary in a first approximation. In engineering, this idea has been used long before exact mathematical definitions of local stationarity were introduced. A systematic mathematical approach was initiated by pioneering contributions of Subba Rao (1970), Hallin (1978) and Priestley (1981), followed by Dahlhaus (1997) who developed a general theory based on an exact definition of locally stationary processes in terms of their spectral representation X t =∫e itλ A(e ;u t,n ) dM ε (λ) where M ε is the spectral measure of white noise, u t,n =t/n and A depends (smoothly) on rescaled time u t,n . More exactly, we have a sequence of processes

$$ X_{t,n}=\int_{-\pi}^{\pi}e^{it\lambda}A_{t,n}^{0} \bigl( e^{-i\lambda};\theta(u_{t,n}) \bigr)\,dM_{\varepsilon}(\lambda ) $$
(7.214)

with transfer functions \(A_{t,n}^{0}(e^{-i\lambda};\theta)\) such that

$$ \sup_{\lambda\in[-\pi,\pi],t=1,2,\ldots,n}\bigl \vert A_{t,n}^{0}\bigl(e^{-i\lambda};\theta(u_{t,n})\bigr)-A\bigl(e^{-i\lambda}; \theta(u_{t,n})\bigr)\bigr \vert \leq Cn^{-1}$$
(7.215)

for all n, some constant C and a certain transfer function A(e ;θ). This definition allows for changes in the linear dependence structure. As an alternative definition that also includes the possibility of changes in the spectral measure dM ε (⋅), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a, 2002b) propose using the concept of subordination, defining X t,n =G(ζ t ;u n ) where ζ t is a stationary process and G(⋅;u) is a smooth function of u. In the following, we discuss inference for processes that are locally stationary in the sense of definition (7.214).

In the context of long-memory processes, changes in the long-memory parameter d are of particular interest. Numerous data examples are reported in the literature where d may be changing in time (see, e.g. Vesilo and Chan 1996; Whitcher and Jensen 2000; Whitcher et al. 2000, 2002; Lavielle and Ludena 2000; Ray and Tsay 2002; Granger and Hyung 2004; Falconer and Fernandez 2007). This motivated Whitcher and Jensen (2000) to consider locally stationary fractional ARIMA (FARIMA) processes. Optimal fitting of parameters in locally stationary long-memory processes is discussed in Beran (2009). An example is plotted in Figs. 7.15(a)–(b). After subtracting the nonparametric trend (see the nonlinear line in Fig. 7.15(a)), estimated values of d based on moving (overlapping) blocks of 175 years are plotted against the year in the middle of each block. The plot indicates that long memory is stronger for the initial measurements and then declines to a lower level.

Fig. 7.15
figure 15

(a) Central England temperature series with fitted linear and nonparametric trend function respectively; (b) local maximum likelihood estimates of d for detrended series, based on moving blocks of 176 years and a fractional ARIMA(0,d,0) model

7.8.2 Optimal Estimation for Locally Stationary Processes

In the following, we consider a locally stationary long-memory model of the following form. Define a sequence of processes X t,n with a time-varying infinite autoregressive representation given by

$$ X_{t,n}=\sum_{j=1}^{\infty}b_{j,n}X_{t-j,n}+ \varepsilon_{t} $$
(7.216)

where ε t are i.i.d. zero-mean random variables with finite variance \(\sigma_{\varepsilon}^{2}=\sigma_{\varepsilon}^{2}(u_{n})\) (u n =t/n) and coefficients b j,n =b j (θ(u n )). For fixed u, it is assumed that \(d(u)\in(0,\frac{1}{2})\) and the coefficients are such that

(7.217)
(7.218)

where c b , c f are positive constants. Specifically, we may consider a locally stationary fractional ARIMA(p,d,q) process. Then \(c_{f} ( u ) =\sigma_{\varepsilon}^{2} ( u ) /(2\pi)\) and for \(z\in\mathbb{C}\), with |z|≤1 and z≠1,

$$ 1-\sum_{j=1}^{\infty}b_{j}\bigl( \theta ( u ) \bigr)z^{j}=\varphi (z;u)\psi^{-1}(z;u) ( 1-z )^{d ( u ) }$$
(7.219)

where θ(u)=[d(u),φ 1(u),…,φ p (u),ψ 1(u),…,ψ q (u)]T,

(7.220)
(7.221)

Separating σ ε from the other parameters in the spectral representation, we can write

$$ X_{t,n}=\sigma_{\varepsilon}(u_{t,n})\int _{-\pi}^{\pi}e^{it\lambda}A_{t,n}^{0} \bigl( e^{-i\lambda};\theta(u_{t,n}) \bigr)\,dM_{\varepsilon}(\lambda) $$
(7.222)

with

$$ A_{t,n}^{0} \bigl( z;\theta(u) \bigr) =\frac{\psi(z;u)}{\varphi(z;u)} ( 1-z )^{-d ( u ) }. $$
(7.223)

Let θ 0(u) denote the true parameter function, and X t,n a locally stationary FARIMA process. In general, the shape of θ 0(⋅) is unknown. Under smoothness conditions, estimation of θ 0(⋅) can be done in a similar manner as regression smoothing. Suppose we would like to estimate θ 0 at a fixed rescaled time point u 0∈(0,1). A natural approach is to apply quasi-maximum likelihood estimation based on time points in a small neighbourhood of u 0. Using the Gaussian likelihood, this is essentially equivalent to local minimization of the sum of squared residuals estimated from (7.216). Thus, let t 0(n)=[nu 0], \(u_{t_{0},n}=t_{0}(n)/n\). Given a kernel function K≥0 with K(−x)=K(x), K(x)=0 (|x|>1) and ∫K(x) dx=1, a kernel estimate of θ 0(u 0) minimizes

(7.224)

or solves the equation

(7.225)

where

$$ \varepsilon_{t}^{\ast}(\theta)=X_{t}-\sum _{j=1}^{t-1}b_{j}(\theta )X_{t-j},\qquad \dot{\varepsilon}_{t}^{\ast}(\theta)= \frac{\partial}{\partial \theta}\varepsilon_{t}^{\ast}(\theta)=-\sum _{j=1}^{t-1}\dot{b}_{j}(\theta)X_{t-j}$$
(7.226)

are approximations of

$$ \varepsilon_{t}(\theta)=X_{t}-\sum _{j=1}^{\infty}b_{j}(\theta)X_{t-j}$$
(7.227)

and

$$ \dot{\varepsilon}_{t}(\theta)=-\sum_{j=1}^{\infty} \dot{b}_{j}(\theta)X_{t-j},$$
(7.228)

respectively, and \(\dot{b}_{j}=\partial/\partial\theta b_{j}\in\mathbb{R}^{p+q+1}\). The asymptotic distribution of \(\hat{\theta}(u_{0})\) was derived in Beran (2009) in an analogous manner as for stationary processes. The same result was later also shown to hold for the local Whittle estimator (Palma and Olea 2010).

Theorem 7.40

Let X t,n be a locally stationary FARIMA process defined by (7.222) and (7.223) and let u 0∈(0,1). Moreover, assume that, as n tends to infinity, b→0 and nb 3→∞. Then, under regularity assumptions and moment conditions (see Beran 2009), there is a sequence \(\hat{\theta}_{n}\) such that and \(\hat{\theta}_{n}\rightarrow\theta^{0}(u_{0})\) in probability. Moreover,

$$ \sqrt{nb}\bigl(\hat{\theta}_{n}-E(\hat{\theta}_{n})\bigr) \rightarrow_{d}N(0,V) $$
(7.229)

where

$$ V=J^{-1}\bigl(\theta^{0}\bigr)\int_{-1}^{1}K^{2}(x)\,dx $$
(7.230)

with

$$ J\bigl(\theta^{0}\bigr)= \biggl[ \frac{1}{4\pi}\int _{-\pi}^{\pi}\frac{\partial}{\partial \theta_{r}}\log g\bigl(\lambda;\theta^{0}\bigr) \frac{\partial}{\partial \theta_{s}}\log g\bigl(\lambda;\theta^{0}\bigr)\,d\lambda \biggr]_{r,s=1,\ldots,k}$$
(7.231)

and \(g ( \lambda;\theta(u_{t,n}) ) =|A_{t,n}^{0}(e^{-i\lambda };\theta(u_{t,n}))|^{2}\).

Once the estimate of θ 0(u 0) is given, \(\sigma_{\varepsilon}^{2}(u_{0})\) can be estimated by

$$ \hat{\sigma}_{\varepsilon}^{2}(u_{0})=\sum _{t=t_{0}-[nb]}^{t_{0}+[nb]}K \biggl( \frac{t_{0}(n)-t}{nb} \biggr) \bigl( \varepsilon_{t}^{\ast}(\hat{\theta }) \bigr)^{2}. $$
(7.232)

As in the stationary case, \(\hat{\sigma}_{\varepsilon}^{2}(u_{0})\) is asymptotically independent of \(\hat{\theta}\) and the asymptotic distribution of \(\hat{\theta}\) does not depend on \(\sigma_{\varepsilon}^{2}\).

Example 7.36

Let X t,n be a local fractional ARIMA(0,d,0) process. Then J=π 2/6 for any value of θ 0(u 0). The asymptotic variance of \(\sqrt{nb}(\hat{d}-d^{0}(u_{0}))\) is therefore nuisance parameter free. If we use, for instance, the rectangular kernel \(K(x)=\frac{1}{2}1\{\vert x\vert \leq1\}\), then \(\int K^{2} ( x)\, dx=\frac{1}{2}\) and

$$ V=\frac{6}{\pi^{2}}\frac{1}{2}=\frac{3}{\pi^{2}}\approx0.304. $$
(7.233)

The limit theorem cannot be used directly for inference about θ 0 because it refers to the deviation of \(\hat{\theta}\) from its expected value. What we would need instead is a result for \(\hat{\theta}-\theta^{0}\). As always in nonparametric smoothing, an asymptotic formula for the bias \(E ( \hat{\theta} ) -\theta^{0}\) is required. Since the order of the bias is not influenced by the dependence structure, we have \(E ( \hat{\theta} ) -\theta^{0}=O ( b^{2} ) \). Moreover, in contrast to nonparametric regression smoothing with long-memory errors, the rate of convergence of \(\hat{\theta}-E ( \hat{\theta} ) \) is the same as under independence. Therefore, the mean squared error \(E [ \Vert \hat{\theta} ( u_{0} ) -\theta^{0} ( u_{0} ) \Vert ^{2} ] \) can be approximated by the sum of a bias term of order O(b 4) and a variance term of order O((nb)−1), and the optimal bandwidth is of the order \(O ( n^{-\frac{1}{5}} ) \).

More specifically, suppose, for instance, that X t,n is a locally stationary fractional ARIMA(0,d,0) process. Then the optimal choice of b can be based on the following result.

Theorem 7.41

Let dC 2[0,1] and d′′(u 0)≠0. Then under regularity and moment assumptions (see Beran 2009), we have, as n→∞,

  1. 1.

    Bias:

    $$ E\bigl[\hat{d}(u_{0})\bigr]-d^{0}(u_{0})=b^{2} \frac{1}{2}\,d^{\prime\prime}(u_{0})\int _{-1}^{1}K(x)x^{2}\,dx+o \bigl(b^{2}\bigr); $$
    (7.234)
  2. 2.

    Variance:

    (7.235)
    (7.236)
  3. 3.

    Mean squared error:

    $$ \mathit{MSE}(\hat{d})=E\bigl[\bigl(\hat{d}-d^{0}\bigr)^{2} \bigr]=b^{4}C_{1}+(nb)^{-1}C_{2}+o \bigl \{ \max\bigl(b^{4},(nb)^{-1}\bigr) \bigr\} $$
    (7.237)

    with

    $$ C_{1}(u_{0})= \biggl[ \frac{1}{2}\,d^{\prime\prime}(u_{0}) \int_{-1}^{1}K(x)x^{2}\,dx \biggr]^{2}$$
    (7.238)

    and

    $$ C_{2}=J^{-1}\int_{-1}^{1}K^{2}(x)\,dx= \frac{6}{\pi^{2}}\int_{-1}^{1}K^{2}(x)\,dx. $$
    (7.239)

By minimizing the asymptotic expression (7.237) with respect to b, the asymptotically optimal bandwidth is of the form

$$ b_{\mathrm{opt}}(u_{0})=C_{\mathrm{opt}}(u_{0})n^{-1/5} $$
(7.240)

with

$$ C_{\mathrm{opt}}(u_{0})= \biggl[ \frac{C_{2}}{4C_{1}(u_{0})} \biggr]^{1/5}. $$
(7.241)

The resulting MSE is of the order O(n −4/5). This result is analogous to nonparametric regression with uncorrelated residuals. The reason is the \(\sqrt{n}\)-rate of convergence of \(\hat{\theta}\). The second derivative d′′ of the estimated d-curve influences the constant C opt. The stronger the curvature of d(u) at the point u 0, the smaller the locally optimal bandwidth b opt(u 0). Similar results are derived in Dahlhaus and Giraitis (1998) for locally stationary AR(p) processes. For practical purposes, one may prefer using a global bandwidth that minimizes the asymptotic integrated mean squared error. To avoid boundary effects, one may use the formula

$$ \mathit{IMSE}=b^{4}\int_{\delta}^{1-\delta}C_{1}(u)\,du+(nb)^{-1} \int_{\delta}^{1-\delta }C_{2}(u)\,du $$
(7.242)

where \(0<\delta<\frac{1}{2}\). The constant C opt in (7.240) has to be adjusted accordingly.

If the optimal bandwidth or a bandwidth of the same order is used, then inference about the curve d 0(u) has to take into account that the bias is of the same order as the standard deviation. This means that a bias correction has to be subtracted before using the bounds based on the CLT. An easier solution is to used a bandwidth that is of a slightly smaller order than O(n −1/5). This way one can avoid a bias correction. Approximate (1−α/2)-confidence intervals can then be given by

$$\hat{d} ( u_{0} ) \pm z_{1-\alpha/2}\frac{\sqrt{6}}{\pi} \biggl( \int_{-1}^{1}K^{2}(x)\,dx \biggr)^{\frac{1}{2}} ( nb )^{-\frac {1}{2}}. $$

In particular, for the rectangular kernel we have \(\int K^{2}\,dx=\frac{1}{2}\), so that the interval reduces to

$$\hat{d} ( u_{0} ) \pm z_{1-\alpha/2}\frac{\sqrt{3}}{\pi} ( nb )^{-\frac{1}{2}}. $$

Analogous formulas can be given for FARIMA(p,d,q) processes with p and q arbitrary. However, in general the optimal bandwidth and the confidence intervals are no longer parameter free.

7.8.3 Computational Issues

In practice, the involved parameters and hence also C opt and b opt are unknown and have to be estimated. In the context of nonparametric regression with i.i.d. errors, various data driven methods for bandwidth choice are known (see, e.g. Gasser et al. 1991; Herrmann et al. 1992). Similar algorithms may be applied here. A possible solution to this problem is an iterative plug-in algorithm where one obtains initial parameter estimates using a first bandwidth. This yields new estimates of b opt so that one can again obtain new parameter estimates and so on. Beran (2009) suggests, for instance, the following algorithm for locally stationary fractional ARIMA(0,d,0) processes:

Algorithm 1

  • Step 1: Set j=0 and set b j equal to an initial bandwidth.

  • Step 2: Estimate d(⋅) using the bandwidth b j .

  • Step 3: For each u o , fit a local polynomial regression \(\beta_{0}(u_{0})+\beta_{1}(u_{0})(u-u_{0})+\frac{1}{2}\beta_{2}(u_{0})(u-u_{0} )^{2}\) directly to \(\hat{d}(u)\) (plotted against u) using a suitable bandwidth b 2.

  • Step 4: For each u 0, set \(\hat{d}^{\prime\prime}(u_{0})=2\beta_{2}(u_{0})\), and calculate an estimate of C opt(u 0) (or a global value C opt minimizing the integrated mean squared error).

  • Step 5: Set j=j+1 and b j =C opt n −1/5. If b j and b j−1 are very similar (according to a specified criterion), go to Step 6. Otherwise go to Step 2.

  • Step 6: Fit a kernel regression with kernel K and bandwidth b j to \(\hat{d}(u)\) directly.

Note that the only purpose of Step 6 is to obtain a somewhat smoother curve, without changing the order of the mean squared error. This step is, however, not necessary. The algorithm can easily be generalized to FARIMA(p,d,q) or more general processes. To do so, one needs to define a suitable mean square error criterion such as \(E [ \Vert \hat{\theta}-\theta \Vert ^{2} ] \) and plug-in \(\hat{\theta}\) into the asymptotic expression of the criterion. A more complicated algorithm has to be designed, if one wants to combine optimal bandwidth selection with data driven choice of the AR- and MA-orders p and q. A proposal in the context of short-memory AR(p) processes is given in Van Bellegen and Dahlhaus (2006) under the assumption that p (which is unknown) remains constant. Note, however, that even in the AR(p) case the assumption that p is constant may not be reasonable. In view of the fact that even for stationary fractional ARIMA(p,d,q) processes choosing p and q in a data adaptive way is not easy (see, e.g. Sect. 5.5.6), the problem of including unknown orders p and q (which may also change in time) is far from trivial in the context of locally stationary processes. Alternatively, if the interest lies solely in estimating the long-memory curve d(u), a possibly more elegant solution is to apply a semiparametric method for estimating d(u) locally. This approach is discussed in Roueff and von Sachs (2011) where results on local wavelet estimation of d are obtained.

7.9 Estimation and Testing for Change Points, Trends and Related Alternatives

7.9.1 Introduction

Modelling time series by locally stationary processes is closely related to change point detection and estimation. The main difference is that in change point analysis the emphasis is on abrupt changes. Changes can occur in any aspect of the probability distribution, but most frequently these are the expected value, the marginal distribution or the correlation structure. Here we consider such questions in the long-memory context. An additional issue is that sample paths of short-range dependent processes with change points may be almost indistinguishable from a stationary process with long-range dependence (see, e.g. Bhattacharya et al. 1983; Künsch 1986; Granger and Ding 1996; Teverovsky and Taqqu 1997; Hidalgo and Robinson 1996; Bai 1998; Krämer and Sibbertsen 2000; Mikosch and Starica 2000, 2004; Diebold and Inoue 2001; Granger and Hyung 2004; Davidson and Sibbertsen 2005, also see Sibbertsen 2004 and Banerjee and Urga 2005 and references therein). An important question is therefore how to distinguish “genuine” long memory from such models.

Change point analysis is a classical field of probability theory and statistics, and the literature is enormous (for an overview, see, e.g. Basseville and Nikiforov 1993; Csörgő and Horváth 1998 and references therein), even if we restrict attention to long-memory processes. In the following, some exemplary change point problems are discussed in the context of long-memory processes.

We start with change points in the mean. The standard approach is based on the so-called CUSUM statistics and the asymptotic results follow directly from the asymptotic behaviour of partial sums discussed in Sect. 4.2. In the long-memory context, CUSUM tests are discussed in Horváth and Kokoszka (1997).

Changes in the distribution are detected using empirical processes. In a weakly dependent situation, a sequential empirical process converges to a bivariate Gaussian process, the so-called Kiefer process. In the long-memory set-up the latter process has to be replaced by a process that is degenerate in one dimension and a fractional Brownian bridge in the other. Such results follow from Dehling and Taqqu (1989a, 1989b), see also Sect. 4.8.

Changes in the spectrum (i.e. in the linear dependence structure) are considered in Giraitis and Leipus (1992), Beran and Terrin (1994) and Horváth and Shao (1999), among others. In the last two papers, the dependence parameter before and after a potential change is estimated using Whittle’s estimator. Hence, the asymptotic distribution under the “no-change” assumption follows from results for quadratic forms.

Tests that distinguish between changes in the mean (as null hypothesis) and stationary long memory. The best available results are obtained in Berkes et al. (2006), further improvements are suggested in Baek and Pipiras (2011).

Finally, this section is concluded with the question of detecting so-called rapid change points. This notion refers to smooth but very fast changes in the mean. Results in the long-memory context and applications to paleoclimatology are discussed in Menéndez et al. (2010).

7.9.2 Changes in the Mean Under Long Memory

Suppose we would like to test whether a process is stationary against the alternative that there may be changes in the expected value. If, under the alternative, the mean function μ(t)=E(X t ) is expected to follow certain regularity conditions such as differentiability or L 2-integrability, then we are back to the question of simultaneous modelling of trend functions and dependence structure. We refer to Sects. 7.1, 7.4 and 7.5 for a discussion of this topic. On the other hand, if abrupt changes are expected, then this leads to questions in the realm of change point detection and estimation. (Another situation that is somewhere between standard nonparametric trend estimation and change point analysis is the so-called rapid change point detection discussed in Sect. 7.10.)

Specifically, consider the null hypothesis

$$H_{0}:Y_{t}=\mu+X_{t}$$

where X t is a zero mean second-order stationary process against the alternative

$$H_{1}:Y_{t}=\mu+\varDelta \cdot1 \{ t>t_{0}+1 \} +X_{t}\quad (\varDelta \neq0) $$

where t 0 (1≤t 0<n) is an unknown change point. The best known approach is based on the CUSUM statistic (originally introduced by Page 1954 in the context of quality control; also see Barnard 1959) defined by

where we use the notation

$$V_{i}=S_{1,i}-\frac{i}{n}S_{1,n}, \qquad S_{i,j}=\sum_{t=i}^{j}Y_{t}$$

and

$$S_{n} ( u ) =\sum_{t=1}^{ [ nu ] }Y_{t}. $$

Note that n −1 V i can also be written as a weighted sum of the difference between the two sample means before and after i, namely

$$n^{-1}V_{i}=\frac{i}{n} \biggl( 1-\frac{i}{n} \biggr) \biggl( \frac{1}{i}S_{1,i}- \frac{1}{n-i}S_{i+1,n} \biggr) . $$

In the classical change point analysis, the process X t is assumed to be in the area of attraction of Brownian motion in the sense that S n (u), properly standardized, converges in the space of càdlàg functions D[0,1] to a standard Brownian motion B(u) (u∈[0,1]). This result usually applies to second-order stationary short-memory processes where \(\operatorname{var} ( S_{n} ( 1 ) ) \sim c_{S}n\). Thus, under H 0, we have a functional limit theorem with \(\tilde{Z}_{n} ( u ) = ( S_{n}(u)-uS_{n} ( 1 ) ) c_{S}^{-\frac{1}{2}}n^{-\frac{1}{2}}\) converging to a Brownian bridge \(\tilde{B} ( u ) =B ( u ) -uB ( 1 ) \), and hence

$$c_{S}^{-\frac{1}{2}}n^{-\frac{1}{2}}D_{1,n}\underset{d}{ \rightarrow}\sup_{u\in [ 0,1 ] }\bigl \vert \tilde{B} ( u ) \bigr \vert . $$

In view of the limit theorems discussed in Chap. 4, this result can be generalized quite easily to processes with long memory and antipersistence, respectively. Suppose that X t is in the domain of attraction of fractional Brownian motion B H (u) (again in the sense of a functional limit theorem) with self-similarity parameter H∈(0,1). The case of short memory is included here, with \(H=\frac{1}{2}\), antipersistence corresponds to \(H<\frac{1}{2}\) and long memory to \(H>\frac{1}{2}\). Then, under the null hypothesis formulated above, the process

$$\tilde{Z}_{n} ( u ) \approx L_{S}^{-\frac{1}{2}} ( n ) n^{-H} \bigl( S_{n}(u)-uS_{n} ( 1 ) \bigr) $$

(with L S a slowly varying function as defined in Sect. 4.2.2) converges to a fractional Brownian bridge \(\tilde{B}_{H} ( u ) =B_{H} ( u ) -uB_{H} ( 1 ) \). For the standardized statistic, we then have

$$T=L_{S}^{-\frac{1}{2}} ( n ) n^{-H}D_{1,n} \underset{d}{\rightarrow }\sup_{u\in [ 0,1 ] }\bigl \vert \tilde{B}_{H} ( u ) \bigr \vert . $$

In contrast, under the alternative H 1 with a change point in μ(t)=E(Y t ), the expected value of S n (u)−uS n (1) is of the order nn H so that T p ∞ (for further results and detailed regularity assumptions, see, e.g. Csörgő and Horváth 1998; Berkes et al. 2006). Note that an analogous result can be obtained in principle for processes in the domain of attraction of a Hermite process of any order.

The standardization \(L_{S}^{-\frac{1}{2}} ( n ) n^{-H}\) contains the unknown self-similarity parameter H and the slowly varying function L S . Both have to be estimated from the observed data. For most practical purposes, it is sufficient to assume that L S converges to a constant c S >0 so that \(\operatorname{var} ( S_{n} ( 1 ) ) \sim c_{S}\cdot n^{2H}\) (n→∞). In view of Sect. 1.3.1, a natural way of rewriting the standardization is

$$L_{S}^{\frac{1}{2}} ( n ) n^{H}=\sqrt{\nu ( d ) c_{f_{X}}}n^{d+\frac{1}{2}}=\sqrt{\nu ( d ) f_{X} \bigl( n^{-1} \bigr) }n^{\frac{1}{2}}$$

with \(d=H-\frac{1}{2}\),

and \(c_{f_{X}}\) such that \(f_{X} ( \lambda ) \sim c_{f_{X}}\vert \lambda \vert ^{-2d}\) (λ→0). In the classical change point analysis, H is assumed to be equal to \(\frac{1}{2}\) a priori so that only the constant c f , or equivalently f X (0), needs to be estimated (see, e.g. Csörgő and Horváth 1998 and references therein). However, if we calculate T under this assumption but the true value of H is actually larger than \(\frac{1}{2}\), then the asymptotic rejection probability tends to one even if the null hypothesis is true (for a further discussion along this line, see, e.g. Horváth and Kokoszka 1997; Wright 1998; Krämer et al. 2002; Sibbertsen 2004; for extensions to linear regression, see, e.g. Krämer and Sibbertsen 2000). In other words, assuming independence or short-range dependence ultimately leads to the erroneous conclusion that the mean is not constant. The formal reason is that the standardization by \(n^{\frac{1}{2}}\) is too small by a factor proportional to \(n^{H-\frac{1}{2}}\rightarrow\infty\) so that T tends to infinity. The intuitive explanation is that long-range dependent series exhibit local spurious trends and tend to stay on one side of the expected value for a long time. This often looks as if the mean were changing occasionally.

If we are not assuming \(H=\frac{1}{2}\) a priori, then both parameters, c f and H, need to be estimated consistently. Given such estimates, we define the statistic

$$T=n^{-\hat{H}}\hat{\nu}^{-\frac{1}{2}}\hat{c}_{f_{X}}^{-\frac{1}{2}}D_{1,n}$$

with \(\hat{H}=\hat{d}+\frac{1}{2}\) and \(\hat{\nu}=\nu ( \hat{d} ) \). The null hypothesis of no change point is rejected at the level of significance α, if T>q 1−α where q 1−α is defined by

$$P \Bigl( \sup_{u\in [ 0,1 ] }\bigl \vert \tilde{B}_{\hat{H}} ( u ) \bigr \vert >q_{1-\alpha} \Bigr) =\alpha. $$

(Note that here the probability is evaluated for a fractional Brownian bridge with \(\hat{H}\) being fixed.)

Example 7.37

Let X t be generated by a fractional ARIMA(0,d,0) process with zero mean i.i.d. innovations ε t . Then \(c_{f}=\sigma_{\varepsilon}^{2}/ ( 2\pi ) \) and we may estimate \(\theta=(\sigma_{\varepsilon }^{2},d)\) by one of the (quasi-) maximum likelihood methods discussed in Sect. 5.5. The test statistic simplifies to

$$\tilde{T}=n^{-\frac{1}{2}-\hat{d}}\hat{\nu}^{-\frac{1}{2}}\sqrt{2\pi}\hat{ \sigma}_{\varepsilon}^{-1}D_{1,n}. $$

Example 7.38

Figure 7.16(a) displays simulated sample paths of

$$Y_{t}=\varDelta \cdot1 \{ t\geq120 \} +X_{t}$$

(t=1,2,…,400) with Δ=1 and 0, respectively, and X t generated by a fractional ARIMA(0,0.3,0) process. The shift is hardly visible by eye. Nevertheless, H 0 is rejected at the 5 %-level of significance. The fact that H and c f have to be estimated does not make much of a difference. This can be seen from Figs. 7.16(c)–(d) where the values of \(S_{1,i}-\frac{i}{n}S_{1,n}\) are plotted against i, together with critical 10 %- and 5 %-limits (horizontal lines) based on the true parameters (Fig. 7.16(c)) and the estimated parameters (Fig. 7.16(d)), respectively. The estimated value of H is 0.78.

Fig. 7.16
figure 16

Simulated sample paths of Y t =Δ⋅1{t≥120}+X t (a) and X t (b) where X t is a FARIMA(0,0.3,0) process and Δ=1. The values of V i =S 1,i −(i/n)S 1,n are plotted against i in (c) and (d), with 5 %- and 10 %-critical values (horizontal lines) based on the true (c) and estimated parameters d and c f  (d), respectively

Although in this example the estimation of d and c f has almost no influence on the result, this may not always be the case. In fact, under the alternative, the observed process is no longer stationary. This may have undesirable effects on the estimates. Sometimes it may first be necessary to remove an estimated trend function \(\hat{\mu}(t)\) before estimating d and c f . This brings us back, however, to the question how to fit a trend function in the presence of dependent errors (see Sects. 7.1, 7.4 and 7.5). If a step function with a finite but unknown number of change points is expected under the alternative, then one may try, for instance, wavelet thresholding with Haar wavelets (see Sect. 7.5) or nonlinear regression with piecewise constant polynomials (see Sect. 7.3). Another possibility is to first calculate parameter estimates based on relatively short disjoint blocks of observations and then take their average. For quasi-maximum likelihood estimation, this can be done without any loss of asymptotic efficiency (Beran and Terrin 1996). This approach is illustrated in the following example.

Example 7.39

Figure 7.17(a) displays a sample path of Y t =μ(t)+X t where X t is a FARIMA(0,0.1,0) process and μ(t) has multiple change points with values switching between 0 and 1 as displayed in Fig. 7.17(b). The values of V i =S 1,i −(i/n)S 1,n are plotted in Figs. 7.17(c)–(d). In Fig. 7.17(c), the horizontal lines correspond to 10 %- and 5 %-critical values when using \(\hat{d}\) and \(\hat{c}_{f}\) estimated (by QMLE) from the complete series Y t (t=1,2,…,n) directly, whereas in Fig. 7.17(d), the critical boundaries are based on averages of estimates \(\hat{d}_{j}\) and \(\hat{c}_{f,j}\) (j=1,2,…,10) obtained from disjoint blocks Y t+(j−1)100,…,Y j100 of length 100. In the first case, d 0=0.1 is overestimated by the amount of \(\hat{d}-d^{0}=0.13\) whereas in the second case overestimation is less severe with \(\hat{d}-d^{0}=0.06\). This leads to clear rejection of H 0 at the 5 %-level in the second case; however, no rejection in the first case.

Fig. 7.17
figure 17

Figure (a) shows a sample path of Y t =μ(t)+X t where X t is a FARIMA(0,0.1,0) process and μ(t) has multiple change points with values switching between 0 and 1 as displayed in (b). The values of V i =S 1,i −(i/n)S 1,n are plotted in (c) and (d). The horizontal lines correspond to 10 %- and 5 %-critical values using estimates of d and c f . In (c), the estimates were based on Y t (t=1,2,…,n), whereas in (d) these are averages of estimates \(\hat{d}_{j}\) and \(\hat{c}_{f,j}\) (j=1,2,…,10) obtained from disjoint blocks Y 1+(j−1)100,…,Y j100 of length 100

The test statistics above do not take into account that the variance function of \(\tilde{B}_{H} ( u ) \) is not constant. More specifically, we have

Since w H is zero at both ends and achieves its maximum in the middle (see Fig. 7.18), the test based on T or \(\tilde{T}\) may have little power when change points occur near the two ends. One therefore sometimes prefers to standardize by \(\sqrt{w_{H} ( u ) }\) before taking the supremum. This means that one defines a test based on \(D_{1,n}^{\ast}=\max \vert V_{i}\vert /\sqrt{w ( \frac{i}{n} ) }\). The asymptotic distribution of \(D_{1,n}^{\ast}\) is, however, more difficult to derive.

Fig. 7.18
figure 18

Standard deviation of a fractional Brownian bridge \(\tilde{B} _{H} ( u ) \)

The statistics \(w^{-\frac{1}{2}}V_{i}\) (i=2,…,n−1) are also often used for estimating the change point t 0 itself, namely by choosing \(\hat{t}_{0}=i\) such that \(\vert w^{-\frac{1}{2}}V_{i}\vert \) is minimal. For i.i.d. data, the asymptotic distribution of \(\hat{t}_{0}\) has been derived by Antoch et al. (1995) (also see Hinkley 1970; Yao 1987 for earlier results). Similar results in the context of short-range dependence can be found, for instance, in Bagshaw and Johnson (1975), Davis et al. (1995), Horváth (1993), Johnson and Bagshaw (1974) and Tang and MacNeill (1993). Horváth and Kokoszka (1997) derive limit theorems for \(\hat{t}_{0}\) under more general dependence assumptions in the domain of attraction of fractional Brownian motion with H∈(0,1), and also consider a more general class of estimators.

Change point estimation in the mean can be extended to the problem of structural breaks in regression models. Results along this line in the long-memory context can be found, for instance, in Wright (1998), Krämer and Sibbertsen (2003), Sibbertsen (2004), Lazarova (2005), Gil-Alana (2008). Also see Ben Hariz and Wylie (2005) and Ben Hariz et al. (2007) for general results. Change point estimation in the long-memory context based on the Wilcoxon two-sample test is considered in Dehling et al. (2013), rank tests are developed in Wang (2008).

7.9.3 Changes in the Marginal Distribution

Instead of testing for changes in the mean, one may more generally test whether any changes in the marginal distribution occur. If we do not want to specify which features of the distribution may change, then we are led to nonparametric testing based on the empirical distribution function. This problem has been addressed, for instance, in Giraitis et al. (1996b) by studying a test based on the Kolmogorov–Smirnov statistic. In the i.i.d. and short memory context, such tests have been studied extensively (see, e.g. Picard 1985; Carlstein 1988; Leipus 1988; Dümbgen 1991; Ferger and Stute 1992; Carlstein and Lele 1993; Ferger 1994; also see Csörgő and Horváth 1988, 1998; Brodsky and Darkhovsky 1993 and references therein).

The essential probabilistic result one needs is the asymptotic distribution of the empirical process. More specifically, suppose we observe Y 1,…,Y n generated by a stationary process with marginal distribution F(y)=P(Yy). A natural statistic for testing for changes in the marginal distribution function can be constructed by comparing an estimated cumulative distribution of Y 1,…,Y i with the corresponding estimate for Y i+1,…,Y n . Let

$$F_{i,j} ( y ) =\frac{1}{ ( j-i+1 ) }\sum_{t=i}^{j}1 \{ Y_{t}\leq y \} $$

where ji, and

$$F_{1, [ nu ] } ( y ) =F_{ [ nu ] } ( y ) $$

with u∈[0,1] and [nu] denoting the largest integer not exceeding nu. Then we consider weighted differences

$$V_{i} ( y ) =\frac{i}{n} \biggl( 1-\frac{i}{n} \biggr) \bigl[ F_{1,i} ( y ) -F_{i+1,n} ( y ) \bigr]\quad (i=1, \dots,n-1). $$

Let u∈(0,1) and i=[nu]. Then we can rewrite V i (y) as

This is analogous to the quantities used for the CUSUM statistic in the previous section. The only difference is that instead of the observations themselves we average the 0–1-variables 1{Y t y}. The CUSUM statistic is then of the form

(see, e.g. Picard 1985). The asymptotic distribution of D 1,n follows easily, once we have a suitable functional limit theorem for the difference F [nu](y)−F(y), understood as a stochastic process in (u,y)∈[0,1]×[−∞,∞].

Suppose that there is a suitable sequence of numbers v n →0 such that

$$v_{n}^{-\frac{1}{2}} \bigl[ F_{ [ nu ] } ( y ) -F ( y ) \bigr] $$

converges (weakly in a suitable manner) to a process W(u,y). Then we define the test statistic

$$T=v_{n}^{-\frac{1}{2}}D_{1,n}. $$

Under the null hypothesis that the marginal distribution remains the same, we have

Thus, a rejection region at a level of significance α can be defined by K α ={T>q 1−α } where q 1−α are (1−α)-quantiles defined by

$$P \Bigl( \sup_{ ( u,y ) \in [ 0,1 ] \times\mathbb{R}}\bigl \vert W ( u,y ) -uW ( 1,y ) \bigr \vert >q_{1-\alpha} \Bigr) =\alpha. $$

For i.i.d. observations, it is well known that the asymptotic limit of

$$W_{n} ( u,y ) =n^{\frac{1}{2}} \bigl[ F_{ [ nu ] } ( y ) -F ( y ) \bigr] $$

is a Kiefer process W(u,y) where convergence is in the space D([0,1]×[−∞,∞]). Recall that a Kiefer process is a Gaussian process (in (u,y)) with zero mean and covariance function

$$\mathit{cov} \bigl( W ( u_{1},y_{1} ) ,W ( u_{2},y_{2} ) \bigr) =\min \{ u_{1},u_{2} \} \cdot \bigl[ F \bigl( \min ( y_{1},y_{2} ) \bigr) -F ( y_{1} ) F ( y_{2} ) \bigr] $$

(see, e.g. Shorack and Wellner 1986 and references therein). This result can be generalized to standard short-memory conditions to obtain a Gaussian limiting process with covariance function

$$\mathit{cov} \bigl( W ( u_{1},y_{1} ) ,W ( u_{2},y_{2} ) \bigr) =\min \{ u_{1},u_{2} \} \cdot\sigma ( y_{1},y_{2} ) $$

where

$$\sigma ( y_{1},y_{2} ) =\sum_{t=-\infty}^{\infty} \bigl[ P ( Y_{0}\leq y_{1},Y_{t}\leq y_{2} ) -P ( Y_{0}\leq y_{1} ) P ( Y_{t}\leq y_{2} ) \bigr] $$

(see, e.g. Berkes and Philipp 1977). In contrast, under long memory the rate of convergence is slower and one obtains a degenerate limiting process (see Sect. 4.8). For instance, let Y t =G(Z t ) where Z t is a zero mean Gaussian process with variance one, slowly decaying autocovariances γ Z (k)∼L γ (k)|k|2d−1 and assume that 1{G(Z t )≤y} has Hermite rank m=1. Then Dehling and Taqqu (1989b) showed that

$$W_{n,H} ( u,y ) =L_{S}^{-\frac{1}{2}} ( n ) n^{1-H} \bigl[ F_{ [ nu ] } ( y ) -F ( y ) \bigr] $$

(with \(H=d+\frac{1}{2}\) and L S (n)=L γ (n)(d(2d+1))−1, see Sect. 4.2.2) converges in D([0,1]×[−∞,∞]) equipped with the sup-norm to a constant (depending on y) times a fractional Brownian motion B H , or more specifically,

$$W ( u,y ) =W_{H} ( u,y ) =J_{1} ( y ) B_{H} ( u ) $$

where J 1(y)=E[1{G(Z)≤y}Z]. An analogous result holds for higher Hermite ranks with B H replaced by the corresponding Hermite process of order m. This result is remarkable because along the y-axis, no stochasticity is involved. Once u is fixed and the random variable B H (u) is generated, the process evolves in y only via multiplication by the deterministic function J 1(y). The asymptotic distribution of D 1,n is therefore much simpler than under short memory. Defining

$$T=L_{S}^{-\frac{1}{2}} ( n ) n^{-H}D_{1,n},$$

we obtain

$$T\underset{d}{=}\zeta+o_{p} ( 1 ) $$

with

The first factor is a deterministic constant that only depends on the transformation G. The second term is the usual supremum of a fractional Brownian bridge. Now we can calculate critical values for testing the null hypothesis that we observe a stationary process Y t =G(Z t ) with a certain (unknown) marginal distribution F against the alternative

$$H_{1}:Y_{t}=X_{t,1}\quad (1\leq t\leq t_{0}),\qquad Y_{t}=X_{t,2} \quad (t_{0}<t\leq n) $$

where X t,1, X t,2 are two stationary processes with marginal distributions F 1F 2 and t 0 is an unknown change point. A rejection region at level of significance α can be defined by

$$T>\sup_{y\in\mathbb{R}}\bigl \vert J_{1} ( y ) \bigr \vert \cdot q_{1-\alpha},$$

or equivalently,

$$D_{1,n}>L_{S}^{\frac{1}{2}} ( n ) n^{H}\cdot \sup_{y\in\mathbb{R}}\bigl \vert J_{1} ( y ) \bigr \vert \cdot q_{1-\alpha}$$

where q 1−α is defined by

$$P \Bigl( \sup_{u\in [ 0,1 ] }\bigl \vert \tilde{B} ( u ) \bigr \vert >q_{1-\alpha} \Bigr) =\alpha. $$

Example 7.40

Let Y t be a Gaussian FARIMA(0,d,0) process with \(\operatorname{var} ( \varepsilon_{t} ) =1\). Then Y t =σ Y Z t with \(\sigma_{Y} ^{2}=\operatorname{var} ( Y_{t} ) =\varGamma ( 1-2d ) /\varGamma^{2} ( 1-d ) \) and

$$J_{1} ( y ) =E \bigl[ 1 \{ \sigma_{Y}Z\leq y \} Z \bigr] =\int_{-\infty}^{\sigma_{Y}^{-1}y}z\frac{1}{\sqrt{2\pi}}e^{-\frac {1}{2}z^{2}}\,dz=- \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\sigma_{Y}^{-2}y^{2}}. $$

The supremum of |J 1(y)| is \(1/\sqrt{2\pi}\). Moreover,

$$L_{\gamma} ( n ) =\varGamma ( 1-2d ) / \bigl[ \varGamma ( d ) \varGamma ( 1-d ) \bigr] $$

so that

$$L_{S} ( n ) =L_{\gamma} ( n ) \bigl( d ( 2d+1 ) \bigr)^{-1}=\frac{\varGamma ( 1-2d ) }{\varGamma ( 1+d ) \varGamma ( 1-d ) ( 2d+1 ) }. $$

A critical region at level α is therefore given by

$$\biggl\{ T>\frac{1}{\sqrt{2\pi}}\cdot q_{1-\alpha} \biggr\} = \biggl\{ D_{1,n}>n^{H}\cdot\sqrt{\frac{\varGamma ( 1-2d ) }{2\pi\varGamma ( 1+d ) \varGamma ( 1-d ) ( 2d+1 ) }}\cdot q_{1-\alpha} \biggr\} $$

where \(H=d+\frac{1}{2}\).

7.9.4 Changes in the Linear Dependence Structure

Often the dependence structure in an observed time series is not constant. Slow changes can be captured by locally stationary processes. This has been discussed in Sect. 7.8. On the other hand, there are situations where the dependence structure changes suddenly. Such situations are in the realm of change point analysis. The null hypothesis we are testing is that the observed process Y t is stationary with a fixed spectral distribution F Y . The alternative is that there is a change point t 0 such that Y t has the spectral distributions F 1 and F 2 for tt 0 and t>t 0, respectively, with F 1F 2. Note that here F denotes the spectral distribution, and not the marginal distribution.

A simple way of testing for change points in the correlation structure is considered in Beran and Terrin (1994). Suppose we have a parametric model with \(\theta= ( \sigma_{\varepsilon}^{2},d,\dots )^{T}= ( \sigma_{\varepsilon}^{2},\eta )^{T}\) where the central limit theorem holds for quasi-maximum likelihood estimates as discussed in Sect. 5.5. For instance, we may assume a FARIMA(p,d,q) process with spectral density

$$f ( \lambda;\theta ) =\sigma_{\varepsilon}^{2}\bigl \vert 1-\exp(-i \lambda)\bigr \vert ^{-2d}\biggl \vert \frac{\psi ( e^{-i\lambda } ) }{\phi ( e^{-i\lambda} ) }\biggr \vert ^{2}. $$

First, we divide the time axis into m blocks I 1={1,2,…,n 1}, I 2={n 1+1,…,n 1+n 2},… such that ∑n j =n and n j /np j ∈(0,1). For each block of observations Y t (tI j ) a quasi-MLE \(\hat{\eta}_{j}\) is computed. Similar arguments as in Sect. 5.5 (Beran and Terrin 1994) show that, as n→∞, \(Z_{j,n}=\sqrt{n_{j}} ( \hat{\eta}_{j}-\eta ) \) (j=1,2,…,m) are asymptotically independent of each other, with limiting N(0,Σ j )-distribution where Σ j =4πV −1 and

$$V= \biggl\{ \int\frac{\partial}{\partial\eta}\log f ( \lambda ;\theta ) \biggl[ \frac{\partial}{\partial\eta}\log f ( \lambda;\theta ) \biggr]^{T}\,d\lambda \biggr \}^{-1}. $$

This can be used for testing whether the parameter η remains constant over time. For simplicity suppose that we are only interested in changes of the long-memory parameter d. Then the null hypothesis is that Y t is stationary, which means in particular that d is constant. Denoting by d j the long-memory parameter in block I j (j=1,2,…,m), the null hypothesis implies d 1=⋯=d m =d. The alternative is specified by the existence of at least one pair j 1,j 2∈{1,2,…,m} such that \(d_{j_{1}}\neq d_{j_{2}}\). Suppose for simplicity that n 1=⋯=n m =nm −1 and denote by v m,n =4π[V −1]11 mn −1 the approximate variance of each \(\hat{d}_{j}\). Using the notation \(\bar{d}=m^{-1}\sum\hat{d}_{j}\), a simple test statistic of H 0 can be based on

Under H 0, the statistic is approximately \(\chi_{m-1}^{2}\)-distributed. In contrast, under the alternative, \(\sum(\hat{d}_{j}-\bar{d})^{2}\) converges in probability to \(\sum_{j=1}^{m} ( d_{j}-d )^{2}>0\) where d=m −1d j so that χ 2 diverges to infinity.

Example 7.41

Let Y t be a FARIMA(0,d,0) process. Then 4π[V −1]11=6/π 2. The null hypothesis is rejected at the level of significance α, if

$$\frac{\pi^{2}}{6}\frac{n}{m}\sum_{j=1}^{m} ( \hat{d}_{j}-\bar{d} )^{2}>\chi_{m-1;1-\alpha}^{2}$$

with \(\chi_{m-1;1-\alpha}^{2}\) denoting the (1−α)-quantile of a \(\chi_{m-1}^{2}\)-distribution. We apply this test to the detrended central England temperatures displayed in Fig. 7.19(b). The sample size is n=352. Using m=4 blocks of length n j =88, and a FARIMA(0,d,0) fit for each block, the maximum likelihood estimates \(\hat{d}_{j}\) (j=1,2,3,4) are equal to 0.30, 0.07, 0.02 and 0.29, respectively. The value of the χ 2-statistic is about 9.15 which corresponds to a p-value (based on a \(\chi_{3}^{2}\)-distribution) of 0.027. Thus, there is quite strong evidence for a change in d. This confirms the visual impression of the log–log-periodogram plots for the four blocks in Figs. 7.19(c)–(f), and also the impression obtained by fitting a locally stationary FARIMA(0,d,0) process in Sect. 7.8. (Note also that the FARIMA(0,d,0) model does indeed fit the data reasonably well, locally.)

Fig. 7.19
figure 19

Yearly Central England temperatures 1659–2010 (a) and the detrended series (b) after subtracting a nonparametric trend function. Also displayed are log–log-periodograms and FARIMA(0,d,0) spectral densities fitted to four disjoint blocks of length n j =88

In situations where the location of change points is unknown, one would prefer a method where one does not have to divide the time axis into blocks by hand. Assume again a parametric model with spectral density f(λ;θ) and a p-dimensional parameter \(\theta= ( \sigma_{\varepsilon}^{2},d,\dots )^{T}= ( \sigma_{\varepsilon}^{2},\eta )^{T}\). Suppose for simplicity of presentation that we are only interested in changes in the long-memory parameter d. A CUSUM type statistic can be defined by

$$D_{1,n}=\max_{n_{\mathrm{low}}\leq i\leq n_{\mathrm{up}}}\biggl \vert \frac{i}{n} \biggl( 1- \frac{i}{n} \biggr) ( \hat{d}_{1,i}-\hat{d}_{i+1,n} ) \biggr \vert $$

with \(\hat{d}_{1,i}=[\hat{\eta}_{1,i}]_{1}\), \(\hat{d}_{i+1,n}=[\hat{\eta }_{i+1,n}]_{1}\) where \(\hat{\eta}_{1,i}\) and \(\hat{\eta}_{i+1,n}\) are estimates of η=(d,…)T based on X 1,X 2,…,X i and X i+1,…,X n , respectively. Note that, in contrast to the sample mean, the estimates require a certain minimal size of the sample. Therefore, in practice n low has to be chosen larger than 1, and n up smaller than n.

Suppose now that under the null hypothesis H 0 the observed time series Y t (t=1,…,n) is generated by a stationary process in the parametric class with θ=θ 0. The alternative H 1 we would like to test against is that there is a change point 1<t 0<n such that the long-memory parameter is d=d 1 for tt 0 and d=d 2d 1 for t>t 0. To estimate θ 0 we use one of the approximate quasi-maximum likelihood estimators derived from the normal likelihood. Recall that under H 0, the central limit theorem holds for \(\hat{\theta}\) with a \(\sqrt{n}\)-rate of convergence, and the scale estimator is asymptotically independent of \(\hat{\eta}\). The proof of this result relies either on a central limit theorem for quadratic forms or on an approximation by martingale differences (see Sect. 5.5). For instance, if we use the second approach, then \(\hat{\eta}\) is defined by minimizing \(\sum e_{t}^{2} ( \eta ) \) where \(e_{t} ( \eta ) =\sum_{j=0}^{t-1}b_{j} ( \eta ) Y_{t-j}\) is an approximation of ε t obtained from the autoregressive representation \(\varepsilon_{t}=\sum_{j=0}^{\infty}b_{j} ( \eta ) Y_{t-j}\), and \(\hat{\theta}_{1}=\hat{\sigma}_{\varepsilon}^{2}\) is set equal to \(n^{-1}\sum e_{t}^{2} ( \hat{\eta} ) \). Then, based on n observations, we have the approximation

$$\hat{\eta}-\eta^{0}=n^{-1}S_{n}+o_{p} \bigl( n^{-1} \bigr) $$

where

$$S_{n}= \bigl( S_{n}^{1},\dots,S^{p-1} \bigr)^{T}=M^{-1}\sum_{t=2}^{n}\dot{\varepsilon}_{t} \bigl( \eta^{0} \bigr) \varepsilon_{t} \bigl( \eta^{0} \bigr) , $$

\(M=E ( \dot{\varepsilon}_{t}\dot{\varepsilon}_{t}^{T} ) \) and \(\dot{\varepsilon}_{t}=\partial/\partial\eta\varepsilon_{t} ( \eta ) \mid_{\eta=\eta^{0}}=\sum\dot{b}_{j}Y_{t-j}\). Using the notation

$$\zeta_{t}= \bigl( \zeta_{t}^{1},\dots, \zeta_{t}^{p-1} \bigr)^{T}=M^{-1}\dot{\varepsilon}_{t} \bigl( \eta^{0} \bigr) \varepsilon_{t} \bigl( \eta^{0} \bigr) $$

and

$$\zeta_{t}^{j}=\sum_{l=1}^{p-1} \tilde{m}_{jl} \biggl\{ \frac{\partial}{\partial \eta_{l}}\varepsilon_{t} \bigl( \eta^{0} \bigr) \varepsilon_{t} \bigl( \eta^{0} \bigr) \biggr\} $$

with \(M^{-1}= [ \tilde{m}_{jl} ]_{j,l=1,\dots,p-1}\), we can write \(S_{n}=\sum_{t=2}^{n}\zeta_{t}\). Since we are only interested in d, the only relevant component of S n is

$$S_{n}^{1}=\sum_{t=2}^{n} \zeta_{t}^{1}. $$

This means that asymptotically \(\hat{d}-d^{0}\) can be approximated by a sample mean, and D 1,n can be written in the form of a usual CUSUM statistic with sample means. Furthermore, since \(\dot{\varepsilon}_{t} ( \eta^{0} ) \varepsilon_{t} ( \eta^{0} ) \) is a martingale difference, we have, under suitable moment conditions, a functional limit theorem

$$n^{-\frac{1}{2}}S_{n}^{1} ( u ) =n^{-\frac{1}{2}}\sum _{t=2}^{ [ nu ] }\zeta_{t}^{1} \rightarrow \mathrm{const}\cdot B ( u ) $$

where convergence is in D[0,1] and B(u) (u∈[0,1]) is a standard Brownian motion. Assuming that n low/n→0 and n up/n→1, we may therefore write

with \(\tilde{B}\) denoting a standard Brownian bridge. Analogous arguments can be carried out using a quasi-MLE based on quadratic forms. The derivation given here is, of course, purely heuristic, an exact proof is more difficult. For the approach based on quadratic forms, a complete proof can be found in Horváth and Shao (1999). Specifically, the following result is derived.

Theorem 7.42

Consider a parametric family \(Y_{t}=\sum_{j=-\infty}^{\infty}a_{j} ( \eta ) \varepsilon_{t-j}\) of second-order stationary linear processes with \(\theta= ( \sigma_{\varepsilon}^{2},\eta^{T} )^{T}= ( \sigma_{\varepsilon}^{2},d,\dots )^{T}\in\varTheta\subseteq\mathbb{R}_{+}\times ( 0,\frac{1}{2} ) \times\mathbb{R}^{p-2}\). Suppose that we observe Y 1,…,Y n with the true parameter θ 0 in the interior of Θ 0. Let \(\hat{d}_{1,i}\) and \(\hat{d}_{i+1,n}\) be the first components of \(\hat{\eta}_{1,i}\) and \(\hat{\eta}_{i,n}\) respectively obtained by Whittle estimation. Assume furthermore that the conditions in the central limit theorem for Whittle estimators given in Giraitis and Surgailis (1990) hold, and also \(E ( \varepsilon_{t}^{4+r} ) <\infty\) for some r>0. Denote by Σ η =4πV −1 the asymptotic covariance matrix of \(\hat{\eta}\) with

$$V=\int\partial/\partial\eta\log f [ \partial/\partial\eta\log f ]^{T}\,d \lambda $$

and by v d =[Σ η ]11 the asymptotic variance of  \(\hat{d}\). Then

$$n^{\frac{1}{2}}u ( 1-u ) ( \hat{d}_{1,i}-\hat{d}_{i+1,n} ) \rightarrow\sqrt{v_{d}}\tilde{B} ( u ) $$

where \(\tilde{B} ( u ) \) is a standard Brownian bridge.

The theorem implies that under the null hypothesis

$$T=\sqrt{n}D_{1,n}=\sqrt{n}v_{d}^{-\frac{1}{2}} \max_{n_{\mathrm{low}}\leq i\leq n_{\mathrm{up}}}\biggl \vert \frac{i}{n} \biggl( 1-\frac{i}{n} \biggr) ( \hat{d}_{1,i}-\hat{d}_{i+1,n} ) \biggr \vert \underset{d}{\rightarrow}\sup_{u\in [ 0,1 ] }\bigl \vert \tilde{B} ( u ) \bigr \vert . $$

Thus, we reject H 0 at the level of significance α, if T>q 1−α where q 1−α is the (1−α)-quantile of \(\sup_{u\in [ 0,1 ] }\vert \tilde{B} ( u ) \vert \).

Example 7.42

Let Y t be a FARIMA(0,d,0) process. Then v d =6/π 2 so that an approximate rejection region at level α is given by

$$T=\sqrt{n}\frac{\pi}{\sqrt{6}}\max_{n_{\mathrm{low}}\leq i\leq n_{\mathrm{up}}}\biggl \vert \frac {i}{n} \biggl( 1-\frac{i}{n} \biggr) ( \hat{d}_{1,i}- \hat{d}_{i+1,n} ) \biggr \vert >q_{1-\alpha}. $$

We apply this method to the detrended central England temperature series considered before. The practical difficulty one encounters is that it is not clear how to choose n low and n up. Although the results in Horváth and Shao suggest that asymptotically one may choose n low=1 and n up=n, this is not really true because the calculation of the MLE based on one (or a very small number of) observation is not meaningful; in fact, for very small samples, numerical optimization often fails to find a solution in the interior of the parameter space. Here, we chose n low=100 and n up=n−100=252. This means, however, that u=n/n low≈0.28 and u=n up/n≈0.72 are far from the left and right border of the interval [0,1]. Instead of using quantiles of the supremum of \(\vert \tilde{B} ( u ) \vert \) over the whole range of u∈[0,1] we therefore calculated quantiles of \(\sup_{u\in [ 0.28,0.72 ] }\vert \tilde{B} ( u ) \vert \). The critical 5 %-level value is about 1.34. The observed value of T is 0.99 so that, in contrast to the simple χ 2-test calculated previously, H 0 is not rejected.

The failure to reject in this example may be due to the (conjectured) possibility that the potential change points are near the two borders of the observational period (recall that the estimates of d calculated for the four blocks were 0.30, 0.07, 0.02 and 0.29). The test based on T has little power when changes occur near the borders because the variance of \(\tilde{B} ( u ) \) is equal to u(1−u) and thus approaches zero at the two ends. One may increase the power by changing the standardization by the factor \([ u ( 1-u ) ]^{-\frac{1}{2}}\) and hence using the statistic

$$\tilde{T}=\sqrt{n}\tilde{D}_{1,n}=\sqrt{n}v_{d}^{-\frac{1}{2}} \max_{n_{\mathrm{low}}\leq i\leq n_{\mathrm{up}}}\biggl \vert \sqrt{\frac{i}{n} \biggl( 1- \frac{i}{n} \biggr) } ( \hat{d}_{1,i}- \hat{d}_{i+1,n} ) \biggr \vert . $$

The derivation of the asymptotic distribution of \(\tilde{T}\) is more involved, however, because convergence in D[0,1] no longer holds. The statistic \(\tilde{T}\) was suggested in Beran and Terrin (1996), its asymptotic distribution was derived by Horváth and Shao (1999). Under additional regularity conditions, Horváth and Shao obtain the asymptotic expression

where

$$c ( x ) =x+2\log x+\frac{1}{2}\log\log x-\frac{1}{2}\log\pi. $$

Thus, given a level of significance α, we first need to determine x α such that \(\exp ( -2e^{-x_{\alpha}} ) =1-\alpha\). We reject H 0 at the level of significance α, if

$$\tilde{T}>\frac{c ( x_{\alpha} ) }{\sqrt{2\log n}}, $$

where

$$x_{\alpha}=-\log\log\frac{1}{\sqrt{1-\alpha}}. $$

For instance, for α=0.05 we have x α =3.66 and c(x α )=5.82.

Example 7.43

We apply the test based on \(\tilde{T}\) to the detrended Central England series, using a FARIMA(0,d,0) model. For α=0.01 and 0.05 we have \(c ( x_{\alpha} ) / \sqrt{2\log n}=2.43\) and 1.70, respectively. The value of \(\tilde{T}\) turns out to be 2.13. Thus, in contrast to the test based on T, we can reject H 0 at α=0.05. Figure 7.20 shows a comparison between \(\vert i/n ( 1-i/n ) ( \hat {d}_{1,i}-\hat{d}_{i+1,n} ) \vert \) and \(\vert \sqrt{i/n ( 1-i/n ) } ( \hat{d}_{1,i}-\hat{d}_{i+1,n} ) \vert \). Due to the new standardization, the second statistic is indeed much larger near the left border.

Fig. 7.20
figure 20

Plot of \(\vert \frac{i}{n} ( 1-\frac{i}{n} ) ( \hat{d}_{1,i}-\hat{d}_{i+1,n} ) \vert \) and \(\vert \sqrt{\frac{i}{n} ( 1-\frac{i}{n} ) } ( \hat{d}_{1,i}-\hat {d}_{i+1,n} ) \vert \) against i=100,…,252 for detrended yearly Central England temperatures. The horizontal line corresponds to the 5 %-critical value for the second statistic. The corresponding critical value for the first statistic is outside the plotted range

7.9.5 Changes in the Mean vs. Long-Range Dependence

One of the controversial issues in the applied literature is whether long-memory phenomena may not be caused by changes in parameters of a short-memory process rather than stationary long-range dependence (see, e.g. Klemes 1974; Boes and Salas 1978; Roughan and Veitch 1999; Veres and Boda 2000; Karagiannis et al. 2004; Diebold and Inoue 2001; Granger and Hyung 2004; Mikosch and Starica 2004; Charfeddine and Guegan 2009; Mills 2007). One way to answer this is the pragmatic view that in situations where the data were actually generated by a more complex short-memory mechanism, stationary processes with long-range dependence often provide a convenient parsimonious model (by including just one additional parameter d or H). Nevertheless, one would at least like to be able to distinguish long memory from certain simple alternatives. Among the most important competitors are short-memory processes with changes in the expected value. Essentially, we may distinguish two situations: (a) E(Y t ) changes gradually; (b) E(Y t ) changes abruptly. In the first case, the standard nonparametric approach is to consider a sequence of models Y t,n =m(t/n)+X t where X t is a zero mean stationary process and \(m: [ 0,1 ] \rightarrow\mathbb{R}\) satisfies certain regularity conditions such as mC[0,1] or L 2[0,1]. This leads back to the question of estimating a deterministic trend function m and parameters describing the stochastic dependence structure simultaneously. This topic is discussed in Sects. 7.4 and 7.5. (Note, in particular, that wavelet thresholding provides a way of distinguishing m from the dependence structure of X t even if m is not smooth, which is the case under alternatives in change point analysis.)

In this section, we turn to scenario (b) where changes in the expected value are abrupt. The fundamental difficulty of distinguishing between a stationary long-memory process and a short-memory process with change points can be illustrated by the following example. Suppose that X t are i.i.d. with zero mean. We observe Y t =μ(t)+X t with μ(t)=μ(t;ω)∈{0,1} generated by an ON–OFF process that is independent of X t and has long memory. In other words,

$$\mu ( t;\omega ) =W ( t ) =\sum_{j=-\infty}^{\infty }1 \{ \tau_{j-1}\leq t<\tau_{j-1}+T_{j,\mathrm{on}} \} , $$

with T j =τ j τ j−1=T j,on+T j,off as defined in Sect. 2.2.3 (there we used the notation X j,on, X j,off instead of T j,on, T j,off). The distributions of the ON and OFF intervals are such that \(P ( T_{j,\mathrm{on}}>x ) \sim C_{\mathrm{on}}x^{-\alpha_{\mathrm{on}}}\) and \(P ( T_{j,\mathrm{off}}>x ) \sim C_{\mathrm{off}}x^{-\alpha_{\mathrm{off}}}\) with 1<α on<α off<2. Then \(\mathit{cov} ( \mu ( t ) ,\mu ( t+k ) ) \sim \mathrm{const}\cdot \vert k\vert ^{- ( \alpha_{\mathrm{on}}-1 ) }\). This means that μ(t) and hence also Y t has long-range dependence. On the other hand, conditionally on μ(t;ω) the observations Y t (t=1,2,…,n) are independent. Figures 7.21(a)–(f) show simulated sample paths of μ(t;ω), X t and Y t , respectively, and the corresponding empirical correlograms. Here, T j,on and T j,off are equal to 10 times standard Pareto-distributed variables with α on=1.1 and α off=1.2, respectively, i.e. P(T i,off>x)=(x/10)−1.1 and P(T i,off>x)=(x/10)−1.2 (for x≥10). The correlogram of X t —which is the same as the conditional correlogram of Y t given μ(t;ω)—does not show any dependence, whereas in the (unconditional) correlogram of Y t the long memory of μ leaks in. If we observe one sample path of the process Y t only, then in principle we are not able to tell whether μ(t) has been generated randomly or if it is deterministic, unless we know or assume a priori that the class of possible deterministic functions has certain properties that make them distinguishable asymptotically from typical sample paths of the long-memory ON–OFF process. If, however, no assumptions are imposed on the function E(Y t ), then one realization of the process Y t with μ generated by the ON–OFF process can also be interpreted as a series of independent observations with deterministic shifts in the expected value. More generally, one can say that the question whether we have stationarity with long memory or short memory with shifts in the mean function is ill-posed, unless one specifies a priori some detailed properties of the shifts in E(Y t ). Such restrictions may be, for example, the maximal number, the frequency, the location, the spacing, integrability or the size of shifts.

Fig. 7.21
figure 21

Figure (g) shows a simulated sample path of Y t =μ(t/n)+X t where X t are i.i.d. N(0,1)-variables and μ(u) (u∈[0,1]) is generated by an ON–OFF-process with long-range dependence. The ON–OFF-process is displayed in (a), the residual process X t in (d). Also shown are the corresponding correlograms ((b), (e) and (h)) and log–log-periodograms ((c), (f) and (i))

Once we have decided on what type of change point models we would like to compare with, an appropriate statistical test can be set up. Depending on the application, the assumption of stationarity with long memory can be assigned to the null hypothesis H 0 or to the alternative H 1. The former is considered, for instance, in Ohanissian et al. (2008), Müller and Watson (2008), Qu (2010), Kuswanto (2011), the latter in Berkes et al. (2006), Jach and Kokoszka (2008) and Baek and Pipiras (2011).

As an example, we discuss the method proposed by Berkes et al. (2006). The idea is to start with testing

$$H_{0}:Y_{t}=\mu+\varDelta \cdot1 \{ t>t_{0}+1 \} +X_{t}\quad ( \varDelta \neq0) $$

where 1≤t 0<n and X t is a fourth-order stationary zero mean short-memory process with absolutely summable autocovariances γ X (k) in the domain of attraction of a Brownian motion. The alternative is

$$H_{1}:Y_{t}=\mu+X_{t}$$

where X t is a fourth-order stationary zero mean long-memory process with autocovariances γ X (k)∼c γ |k|2d−1 (|k|→∞) for some \(0<d<\frac{1}{2}\), in the domain of attraction of a fractional Brownian motion. An additional technical assumption is that under H 0 the fourth-order cumulants

are such that

$$\sup_{k_{1}}\sum_{k_{2},k_{3}=-\infty}^{\infty}\bigl \vert \kappa ( k_{1},k_{2},k_{3} ) \bigr \vert <\infty. $$

Under H 1, the fourth-order cumulants are assumed to be such that

$$\sup_{k_{1}}\sum_{k_{2},k_{3}=-n}^{n}\bigl \vert \kappa ( k_{1},k_{2},k_{3} ) \bigr \vert =O \bigl( n^{2d} \bigr) . $$

The idea of the test proposed in Berkes et al. (2006) is to use a CUSUM statistic with a standardization of the order \(O(\sqrt{n})\) that leads to a well known limiting distribution under H 0, but to divergence under H 1 because there dividing by \(n^{\frac{1}{2}}\) is not enough. The distribution of CUSUM statistics is well known under the assumption of no change in the mean. Under the null hypothesis considered here, we have one change point. If we knew the change point t 0, then we could consider a CUSUM statistic for \(Y_{1},\dots,Y_{t_{0}}\) and another CUSUM statistic for \(Y_{t_{0}+1},\dots,Y_{n}\) separately. For each statistic, the asymptotic distribution could be calculated using the supremum of a Brownian bridge. A natural approach to testing H 0 is therefore to first estimate the change point t 0, and then to consider the two CUSUM statistics for Y t (\(t\leq\hat{t}_{0}\)) and Y t (\(t\geq\hat{t}_{0}+1\)). Estimation of t 0 can also be done by means of a CUSUM statistic. Thus, we define

$$\hat{t}_{0}=\min \Bigl\{ i:\vert V_{i}\vert = \max_{1\leq i\leq n}\vert V_{i}\vert \Bigr\} $$

where

$$V_{i}=S_{1,i}-\frac{i}{n}S_{1,n}. $$

Given \(\hat{t}_{0}\), we consider

$$D_{1,\hat{t}_{0}}=\max_{1\leq i\leq\hat{t}_{0}}\biggl \vert S_{1,i}- \frac{i}{\hat{t}_{0}}S_{1,\hat{t}_{0}}\biggr \vert $$

and

$$D_{\hat{t}_{0}+1,n}=\max_{\hat{t}_{0}+1\leq i\leq n}\biggl \vert S_{\hat{t}_{0}+1,i}- \frac{i-\hat{t}_{0}}{n-\hat{t}_{0}}S_{\hat{t}_{0}+1,n}\biggr \vert . $$

Note that in both cases, the location parameter is removed automatically. The essential part is therefore the standardization of \(D_{1,\hat{t}_{0}}\) and \(D_{\hat{t}_{0}+1,n}\). To obtain a standardization that corresponds to \(\sqrt{\operatorname{var} ( S_{1,t_{0}} ) }\) and \(\sqrt{\operatorname{var} ( S_{t_{0}+1,n} ) }\) asymptotically under H 0, but remains of the order \(O ( \sqrt{n} ) \) under H 1, Berkes et al. (2006) propose Bartlett estimators defined by

where \(m_{\hat{t}_{0}}\) and \(m_{n-\hat{t}_{0}}\) tend to infinity at a slower rate than n. Here we use the notation

$$\hat{\gamma}_{i,j} ( u ) =\frac{1}{n_{i,j}}\sum _{t=i}^{j-\vert u\vert } ( Y_{t}-\bar{y}_{i,j} ) ( Y_{t+\vert u\vert }-\bar{y}_{i,j} ) $$

for the sample autocovariance at lag u (where j>i), based on observations Y i ,Y i+1,…,Y j , with n i,j =ji+1 and \(\bar{y}_{i,j}=n_{i,j}^{-1}S_{i,j}\). If it is assumed that under H 0 the change point \(\hat{t}_{0}\) is asymptotically proportional (but not equal) to n, then \(v_{1,\hat{t}_{0}}\) and \(v_{\hat{t}_{0}+1,n}\) both converge in probability to \(\sum_{u=-\infty}^{\infty}\gamma_{X} ( u ) =2\pi f_{X} ( 0 ) \). This is the asymptotic variance of a standardized sum since \(\operatorname{var} ( S_{1,n} ) \sim2\pi f_{X} ( 0 ) n\). On the other hand, under H 1, \(\operatorname{var} ( S_{1,n} ) \sim c_{S}n^{2d}\), but \(v_{1,\hat{t}_{0}}\) and \(v_{\hat{t}_{0}+1,n}\) diverge to infinity at a slower rate than n 2d. This essentially follows from \(\sum_{k=1}^{m}k^{2d-1}\sim \mathrm{const}\cdot m^{2d}=o ( n^{2d} ) \). Thus we obtain the desired asymptotic properties for the test statistics

$$T_{1,\hat{t}_{0}}=\hat{t}_{0}{}^{-\frac{1}{2}}v_{1,\hat{t}_{0}}^{-\frac{1}{2}}D_{1,\hat{t}_{0}}$$

and

$$T_{\hat{t}_{0}+1,n}= ( n-\hat{t}_{0} )^{-\frac{1}{2}}v_{\hat {t}_{0}+1,n}^{-\frac{1}{2}}D_{\hat{t}_{0}+1,n}. $$

More specifically, Berkes et al. (2006) use following additional conditions:

and

$$\varDelta ^{2}\vert \hat{t}_{0}-t_{0}\vert =O_{p} ( 1 ) . $$

The joint distribution of the two statistics under H 0 is given by

Theorem 7.43

Suppose H 0 holds, and m n is nondecreasing, m n →∞ and such that

$$\sup_{k\geq0}\frac{m_{2^{k+1}}}{m_{2^{k}}}<\infty,\qquad m_{n} ( \log n )^{4}=O ( n ) . $$

Then, under the conditions above,

$$( T_{1,\hat{t}_{0}},T_{\hat{t}_{0}+1,n} ) \underset{d}{\rightarrow} \Bigl( \sup_{0\leq u\leq1}\bigl \vert \tilde{B}^{ ( 1 ) } ( u ) \bigr \vert ,\sup_{0\leq u\leq1}\bigl \vert \tilde{B}^{ ( 2 ) } ( u ) \bigr \vert \Bigr) $$

where \(\tilde{B}^{ ( 1 ) }\), \(\tilde{B}^{ ( 2 ) }\) are two independent Brownian bridges, i.e. \(\tilde{B}^{ ( i ) } ( u ) =B^{ ( i ) } ( u ) -uB^{ ( i ) } ( 1 ) \) with B (i) (i=1,2) two independent standard Brownian motions.

In contrast, under the alternative, we have long-range dependence so that the rate of convergence of sums is slower, the two statistics are no longer asymptotically independent and their distribution can be expressed in terms of one common fractional Brownian motion:

Theorem 7.44

Suppose that H 1 holds, and m n is nondecreasing, m n →∞ and such that

$$\sup_{k\geq0}\frac{m_{2^{k+1}}}{m_{2^{k}}}<\infty,\qquad m_{n} ( \log n )^{\frac{7}{2-4d}}=O ( n ) . $$

Then, under the conditions above,

$$\biggl( \biggl( \frac{m_{\hat{t}_{0}}}{n} \biggr)^{d}T_{1,\hat{t}_{0}}, \biggl( \frac{m_{n-\hat{t}_{0}}}{n} \biggr)^{d}T_{\hat{t}_{0}+1,n} \biggr) \underset{d}{\rightarrow} ( Z_{1},Z_{2} ) $$

where

B H is a fractional Brownian motion with self-similarity parameter \(H=d+\frac{1}{2}\) and

$$\tau=\inf \Bigl\{ t\geq0:\bigl \vert B_{H} ( t ) \bigr \vert = \sup_{0\leq u\leq1}\bigl \vert B_{H} ( u ) \bigr \vert \Bigr\} . $$

By assumption \(m_{\hat{t}_{0}}/n\) and \(m_{n-\hat{t}_{0}}/n\) converge to zero so that, under H 1, the vector \(( T_{1,\hat{t}_{0}},T_{\hat{t}_{0}+1,n} ) \) diverges to (∞,∞) in probability. Defining

$$T=\max \{ T_{1,\hat{t}_{0}},T_{\hat{t}_{0}+1,n} \}, $$

we have

$$T\underset{d}{\rightarrow}\max \Bigl\{ \sup_{0\leq u\leq1}\bigl \vert \tilde {B}^{ ( 1 ) } ( u ) \bigr \vert ,\sup_{0\leq u\leq1}\bigl \vert \tilde{B}^{ ( 2 ) } ( u ) \bigr \vert \Bigr\} , $$

under H 0 whereas under H 1 the statistic diverges to infinity. The results can be extended to H 0 including several shifts in the mean.

An essential element in the test procedure by Berkes et al. (2006) is the Bartlett estimator based on sample autocovariances. Apart from the difficulty of choosing appropriate sequences \(m_{\hat{t}_{0}}\) and \(m_{n-\hat{t}_{0}}\), more efficient estimators of the asymptotic values of γ X (k) exist because γ X (k)∼c γ |k|2d−1 is characterized by two parameters only. A test where all autocovariances are estimated by the sample autocovariance is likely to have relatively low power. Baek and Pipiras (2011) therefore suggest a more powerful test procedure where the hyperbolic shape of the autocovariances and the spectral density is exploited more directly. As before, in a first step \(\hat{t}_{0}\) is calculated. In a second step, the data are centred using \(\hat{t}_{0}\) by defining

The third step is to estimate the long-memory parameter from \(\hat{X}_{1},\dots,\hat{X}_{n}\). If \(\hat{t}_{0}\) converges to t 0 fast enough, then \(\hat{d}\) converges to the true value d 0 under H 0 and under H 1. Thus, if we are able to establish that under H 0 a standardized statistic \(n^{\beta} ( \hat{d}-d^{0} ) \) converges to a nondegenerate random variable ζ, then we may use the test statistic \(T^{\ast}=\vert n^{\beta} ( \hat{d}-\frac{1}{2} ) \vert \). Under H 0, T converges in distribution to |ζ| whereas under H 1 the statistic diverges to infinity because the true value of d is not \(\frac{1}{2}\). For instance, Baek and Pipiras (2011) show the following result for the local Whittle estimator.

Theorem 7.45

Let \(\hat{d}\) be a local Whittle estimator based on \(\hat{X}_{t}\) using m Fourier frequencies λ j =2πj/n (j=1,2,…,m). Suppose that conditions used in the theorems above as well as regularity conditions needed for the Whittle estimator (see Theorem 2 in Robinson 1995b; also see Chap5) hold. Furthermore, assume

$$\frac{m\log^{2}m}{n\varDelta ^{2}}\rightarrow0. $$

Then, under H 0,

$$\sqrt{m} \biggl( \hat{d}-\frac{1}{2} \biggr) \underset{d}{\rightarrow} \zeta\sim N \biggl( 0,\frac{1}{4} \biggr) , $$

whereas under H 1 with \(d^{0}\in ( 0,\frac{1}{2} ) \),

$$\hat{d}\underset{d}{\rightarrow}\,d^{0}. $$

For exact regularity conditions and detailed proofs, see Baek and Pipiras (2011). Note that Δ is even allowed to tend to zero; however, at a slower rate than \(\log m\sqrt{m/n}\). The theorem essentially says that estimation of t 0 does not change the asymptotic distribution of the local Whittle estimator under H 0, and under H 1 the estimator remains consistent. We may therefore reject H 0 at the level of significance α if

$$T^{\ast}=\biggl \vert \sqrt{m} \biggl( \hat{d}-\frac{1}{2} \biggr) \biggr \vert >\frac{1}{2}z_{1-\frac{\alpha}{2}}$$

where \(z_{1-\frac{\alpha}{2}}\) is the \(( 1-\frac{\alpha}{2} ) \)-quantile of the standard normal distribution.

7.10 Estimation of Rapid Change Points in the Trend Function

In this section, we address rapid change point detection in a nonparametric regression function where the regression residuals are Gaussian subordinated via an unknown function (see Sect. 7.6) with long-memory. Due to a specific application that we have in mind, we base our estimation procedure on time series observed at unevenly spaced time points. In fact, this type of problem tends to occur in palaeoclimatic research where in order to answer questions concerning past environmental changes, one may analyse environmental proxies such as pollens, oxygen and other gas isotopes that are found in ice or sediment samples. Such environmental proxies give rise to time series data, where the successive observations are unevenly spaced in time. One important topic is rapid climate change where one is concerned with identification of rapid change points in the trend function; see Ammann et al. (2000) for background information on palaeoclimatic research. Most of the material covered in this section can be found in Menéndez et al. (2010); also see Menéndez (2009) and Menéndez et al. (2012). We start by introducing a continuous time stationary Gaussian process Z(u) \((u\in \mathbb{R})\) with E[Z(u)]=0, \(\operatorname{var}(Z)=1\) and

as v→∞ where H∈(0,1). Here “∼” means that the ratio of the left and right hand side tends to one. The observed time series Y 1,…,Y n is assumed to be generated by a nonparametric regression model of the form

where ε i =G(Z(T i ),t i ), \(T_{i}\in \mathbb{R}_{+}\), T 1T 2≤⋯≤T n , t i =T i /T n ∈[0,1] and m(⋅) is a smooth function. For each fixed t∈[0,1] the function G(⋅,t) is assumed to be in the L 2-space of functions (on \(\mathbb{R}\)) with \(E[G(Z,t)]=(2\pi )^{-\frac{1}{2}}\int G(z,t) \exp (-z^{2}/2)\,dz=0\) and ∥G2=E[G 2(Z,t)]<∞. This implies a convergent L 2-expansion

where H k (⋅) are Hermite polynomials and q≥1 is the Hermite rank. The function G provides the possibility of having non-Gaussian residuals with a changing marginal distribution (see Sect. 7.6). The spacings between the successive time points is arbitrary except for some technical conditions (similar in spirit as the equidistant case, where T i =iT n /n and t i =i/n).

Rapid change is defined in terms of derivatives of the trend function. Such a change may be rapid but it is a continuous change in the trend function m. More specifically, rapid change is said to occur whenever the absolute value of the first derivative of m has a local maximum and exceeds a certain threshold. Let m (i)(t) denote the ith derivative of m with respect to t. We shall follow this definition of a rapid change point considered in Müller and Wang (1994) in the context of hazard rate estimation:

Definition 7.9

Given a threshold η>0, the p time points {τ 1,τ 2,…,τ p }∈(0,1) are rapid change points of the trend function m if

In applications, the trend derivatives will have to be estimated. Thus consider the non-parametric curve estimates using Priestley–Chao type kernel estimator

where ν=0,1,2,…,t 0=0 and the kernel K satisfies the following conditions (Gasser and Müller 1984):

  1. (i)

    KC ν+1[−1,1];

  2. (ii)

    K(x)≥0, K(x)=0 (|x|>1), \(\int_{-1}^{1}K(x)\,dx=1\);

  3. (iii)

    x,y∈[−1,1], |K (ν)(x)−K (ν)(y)|≤L 0|xy| where \(L_{0}\in \mathbb{R}^{+}\) is a constant;

  4. (iv)

    K is of order (ν,k), νk−2, where k is a positive integer, i.e.

    $$ \int_{-1}^{1}K^{(\nu )}(x)x^{j}\,dx =\left \{ \begin{array}{l@{\quad}l} (-1)^{\nu }\nu !, & j=\nu, \\ 0, & j=0,\ldots ,\nu -1,\nu +1,\ldots ,k-1, \\ \theta, & j=k \end{array} \right . $$

    where θ≠0 is a constant;

  5. (v)

    K (j)(1)=K (j)(−1)=0 for all j=0,1,…,ν−1.

It turns out that by Lemma 1 in Gasser and Müller (1984) one can also write

$$ \int_{-1}^{1}K(x)x^{j}\,dx=\left \{ \begin{array}{l@{\quad}l} 1, & j=0, \\ 0, & j=1,\ldots ,k-\nu -1, \\ (-1)^{\nu }\theta {\frac{(k-\nu )!}{k!}}, & j=k-\nu. \end{array} \right . $$

For a given sample and a fixed value of the first derivative threshold η, the number of change points \(\hat{p}\) where \(\hat{m}^{(2)}\) is zero is random whereas the true number of change points p is unknown. However, as the sample size increases, under suitable regularity conditions on m, consistency of \(\hat{m}\) and \(\hat{p}\) follows. The following technical conditions are used to prove the consistency result in the theorem below:

  1. (A1)

    The coefficients c k (t)=E[G(Z,t)H k (Z)] in the Hermite expansion of G(Z,t) are continuously differentiable with respect to t∈[0,1];

  2. (A2)

    1−(2q)−1<H<1;

  3. (A3)

    mC ν+1[0,1];

  4. (A4)

    0≤T 1T 2≤⋯≤T n , t i =T i /T n ∈[0,1];

  5. (A5)

    \(\alpha_{n}^{-1}\leq t_{j}-t_{j-1}\leq \beta_{n}^{-1}\) where α n β n >0 and β n →∞;

  6. (A6)

    b→0, b 2ν(T n b)(2−2H)q→∞, and n →∞;

  7. (A7)

    lim n→∞( n )1+(2−2H)q( n )−2=0;

  8. (A8)

    KC ν+1[0,1] with 0<c ν+1=sup u∈[0,1]|K (ν+1)(u)|<∞.

The following observations can be made. (A1) implies a slowly changing marginal distribution of the regression residuals. This may be understood as a type of local-stationarity. Due to (A2), the long-memory property of Z i is inherited by the subordinated error process. (A5) ensures that no repeated time points and, more generally, no extreme clustering of the time points occurs. A special case is when the observations are available at equidistant time points (set α n =β n =n). The first condition in (A6) is needed to avoid an asymptotic bias in \(\hat{m}^{(\nu )}(t)\) whereas the second and the third conditions ensure convergence of the asymptotic expression for the variance of \(\hat{m}^{(\nu )}(t)\) to zero. (A7) is needed for the asymptotic approximation of the mean squared error. Due to (A2), (2−2H)q<1 so that (A7) is possible although α n β n . For additional discussions and related results, specifically for monotone transforms G and slightly different conditions on the spacings between successive observations T i T i−1, see Menéndez et al. (2012).

Theorem 7.46

Under the assumptions stated earlier in this section and (A1)(A7), we have for t∈(0,1):

where

and

Proof

Let t∈(0,1) be a scalar. The expression for the bias follows from a Taylor series expansion of m and properties of the kernel. To prove the result for the variance, note that

where

$$V_{i,j}=\mathit{Cov}(Y_{i},Y_{j})=\sum _{l=q}^{n}{\frac{c_{l}(t_{i})c_{l}(t_{j})}{l!}}\gamma_{Z}^{l}(T_{i}-T_{j}). $$

Recalling

$$ \gamma_{Z}(T_{i}-T_{j})\sim C_{Z} \vert T_{i}-T_{j}\vert ^{2H-2} $$

and −1<(2H−2)q<0, we have

$$ \mathit{Cov}(Y_{i},Y_{j})\sim \frac{c_{q}^{2}(t)}{q!} \gamma_{Z}^{q}(T_{i}-T_{j}) $$

for i,jU b (t) with \(U_{b}= \{ k\in \mathbb{N}:\vert t-T_{k}/T_{n}\vert \leq b \} \). It is then sufficient to consider

$$ \begin{aligned} S_{n}&=b^{-2}(T_{n}b)^{(2-2H)q}\sum _{i\neq j}(t_{i}-t_{i-1}) (t_{j}-t_{j-1})K^{(\nu )} \biggl( \frac{t_{i}-t}{b} \biggr) K^{(\nu )} \\ &\quad {}\times\biggl( \frac{t_{j}-t}{b} \biggr) \vert T_{i}-T_{j}\vert ^{(2H-2)q}. \end{aligned}$$

Since K(u)=0 for |u|>1, we have

$$ S_{n}=\sum_{i:\vert T_{i}-tT_{n}\vert \leq T_n b}K^{(\nu )} \biggl( \frac{t_{i}-t}{b} \biggr) \frac{t_{i}-t_{i-1}}{b} [ S_{i,1}+S_{i,2} ] $$

where

Setting

$$ h_{n}(x)=K^{(\nu )} \biggl( x- {t \over b} \biggr) \times \biggl( {t_i \over b} -x \biggr)^{(2H-2)q}, $$

we have

and an analogous expression for S i,2 where t j−1/bx j t j /b and \(h_{n}^{\prime }(x)=g_{n,1}(x)+g_{n,2}(x)\) with

By assumption we have \(\alpha_{n}^{-1}\leq \vert t_{j}-t_{j-1}\vert \leq \beta_{n}^{-1}\), −1<(2H−2)q<0 and

$$ 0\leq \sup_{u\in [ 0,1]}\bigl \vert K^{(\nu +1)} ( u ) \bigr \vert =c_{\nu +1}<\infty . $$

Also note that the assumption n →∞ implies n →∞. Using the notation j 1=[α n (tb)] and j 2=[α n (t+b)], an upper bound can be given by

Thus if (2H−2)q>−1 and \(\lim_{n\rightarrow \infty }b^{-1}\alpha_{n}\beta_{n}^{-2}=0\) there is a uniform (in i) upper bound on the remainder term r n,i,1. Note that 1+(2−2H)q>1 and n →∞ so that lim n→∞ n ( n )−2=0 follows from the assumption that lim n→∞( n )1+(2−2H)q( n )−2=0. Similarly, considering the remainder term r n,,i,2 for g n,2, we have

so that, under the assumption that H<1 and lim n→∞( n )1+(2−2H)q( n )−2=0, there is a uniform (in i) upper bound on the remainder term r n,i,1. Analogous arguments apply to S i,2 so that the sum S n converges to the corresponding double integral and \(c_{q}^{2}(t)/q!C_{Z}\) times S n converges to the asymptotic variance as given in the theorem. □

The asymptotic formula for the mean squared error stated above implies an asymptotically optimal bandwidth of the form

$$ b_{\mathrm{opt}}= \biggl[ \frac{2\nu +(2-2H)q}{2(k-\nu )}\frac{I_{q}}{J_{\nu ,k}^{2}} \biggr]^{\frac{1}{2k+(2-2H)q}}T_{n}^{\frac{(2H-2)q}{2k+(2-2H)q}}. $$

The central limit theorem in the corollary below states that if the Hermite rank q equals 1, the limiting distribution of \(\hat{m}^{(\nu )}(t)\) is normal and the estimates at different fixed values t 1,…,t k are asymptotically independent. If, however, q≥2, a similar limit theorem can be derived but with a non-normal asymptotic distribution which would correspond to the marginal distribution of a Hermite process of order q.

Corollary 7.4

Suppose that the Hermite rank q of G is one. Let t=(t 1,…,t k )′, \(\hat{\mathbf{m}}^{(\nu )}(\mathbf{t})= [ \hat{m}^{(\nu )}(t_{1}),\dots,\hat{m}^{(\nu )}(t_{k}) ]^{\prime }\) and define the k×k diagonal matrix

$$\mathbf{D}=\operatorname{diag}\bigl(\sqrt{I_{1}(t_{1})} , \dots,\sqrt{I_{1}(t_{k})}\bigr). $$

Then, under the assumptions of Theorem 7.46, we have, as n tends to infinity,

$$ b^{\nu }(T_{n}b)^{1-H}D^{-1} \bigl\{ \hat{\mathbf{m}}^{(\nu )}(\mathbf{t})-E\bigl[ \hat{\mathbf{m}}^{(\nu )}(\mathbf{t}) \bigr] \bigr\} \underset{d}{\rightarrow } ( \zeta_{1},\dots,\zeta_{k} )^{\prime } $$

where ζ i are i.i.d. standard normal variables.

Proof

The result follows from the previous theorem and the fact that asymptotically the distribution of

$$ \varDelta _{n}=(T_{n}b)^{(1-H)q} \bigl\{ \hat{m}^{(2)}(\tau_{i})-E \bigl[ \hat{m}^{(2)}( \tau_{i}) \bigr] \bigr\} $$

is equivalent to the asymptotic distribution of

which is a sequence of normal variables. Asymptotic independence of \(\hat{m}^{(\nu )}(t)\) and \(\hat{m}^{(\nu )}(s)\) for ts follows by analogous arguments as in the proof of the last theorem, along the lines of Csörgő and Mielniczuk (1995b). □

Note that the estimate of the change points will involve estimates of the trend derivatives, which in turn will depend on the respective bandwidths. As we have seen in the theorem earlier, if b is too large, and in particular if b −2ν(T n b)(2H−2)q is of smaller order than b 2(kν), then the bias of \(\hat{\mathbf{\tau}}_{n}\) will dominate the mean squared error and no reasonable confidence interval for \(\mathbf{\tau }\) can be given. Consider, however, (i) b 2k=o((T n b)(2H−2)q) which allows the bias to be asymptotically negligible, or (ii) b 2kC⋅(T n b)(2H−2)q which makes the asymptotic contribution of both bias and variance of the same order. For these cases, if the Hermite rank of G is one, asymptotic normality of \(\hat{\mathbf{\tau}}_{n}\) follows.

Theorem 7.47

Let \(\mathbf{\tau }= ( \tau_{1,}\tau_{2},\dots,\tau_{p} )^{\prime }\) be the points of rapid change of m, and suppose that the assumptions of the corollary to the last theorem hold. Then there is a sequence \(\hat{\mathbf{\tau}}_{n} = ( \hat{\tau}_{n;1,}\hat{\tau}_{n;2},\dots,\hat{\tau}_{n;p} )^{\prime }\) such that \(\hat{m}^{(2)}(\hat{\tau}_{n;i})=0\) (1≤ip) and \(\hat{\mathbf{\tau}}_{n}\rightarrow_{p}\mathbf{\tau }\). Moreover, define the p×p diagonal matrix

$$ \tilde{\mathbf{D}} = \operatorname{diag}\bigl(\sqrt{I_{1}(\tau_{1})}\big/ \bigl \vert m^{(3)}(\tau_{1}) \bigr \vert ,\dots, \sqrt{I_{1}(\tau_{p})}\big/ \bigl \vert m^{(3)}( \tau_{p})\bigr \vert \bigr). $$

Then the asymptotic distribution of \(\hat{\mathbf{\tau}}_{n}\) is given as follows:

  1. (i)

    If b 2k=o((T n b)2H−2) then \((T_{n}b)^{1-H}\tilde{\mathbf{D}}^{-1} ( \hat{\mathbf{\tau}}_{n}-\mathbf{\tau } ) \underset{d}{\rightarrow }(\zeta_{1},\dots,\zeta_{p})^{\prime }\) where ζ i are i.i.d. standard normal variables;

  2. (ii)

    If b 2kC⋅(T n b)2H−2 then \((T_{n}b)^{1-H}\tilde{\mathbf{D}}^{-1} ( \hat{\mathbf{\tau}}_{n}-\mathbf{\tau } ) \underset{d}{\rightarrow }(\mu_{1}+\zeta_{1},\dots,\mu_{p}+ \zeta_{p})^{\prime }\) where ζ i are as in (i) and

    $$ \mu_{i}= \biggl[ \frac{m^{(k)}(\tau_{i})}{k!} \int_{-1}^{1}K^{(\nu)}(u)u^{k-\nu }\,du \biggr] \Big/m^{(3)}(\tau_{i}). $$

Proof

Consistency follows from m(t)∈C ν+1[0,1] and the consistency of \(\hat{m}^{(2)}(t)\). For the asymptotic distribution of \(\hat{\mathbf{\tau}}_{n}\), we have by Taylor expansion

$$ \hat{\mathbf{\tau}}_{n:i}-E ( \hat{\mathbf{\tau}}_{n:i} ) =- \hat{m}^{(2)}(\tau_{i}) \bigl[ m^{(3)}( \tau_{i}) \bigr]^{-1}+o_{p}\bigl(b^{-2}(T_{n}b)^{H-1} \bigr). $$

Since the Hermite rank q of G is equal to one, the limiting behaviour given in (i) and (ii) then follows from the last theorem and its corollary. □

Note that, a similar non-Gaussian limit theorem can be derived for q≥2. By analogous arguments as above, it can be shown that the number of zeros of \(\hat{m}^{(2)}\) with \(\vert \hat{m}^{(2)}\vert >\eta \) converges to p in probability, so that when n is sufficiently large, p can be estimated with arbitrary precision and in particular, the estimate of p can be plugged-in for computing confidence intervals for the change points.

The example below is concerned with evidence of rapid climate changes in the northern hemisphere approximately 20,000 years before present (‘present’ being set at 1989). The observations are oxygen isotope ratio measurements from a Greenland ice core (Johnsen et al. 1997) resulting in unevenly spaced time series observations, so that a continuous time process is appropriate for modelling the regression errors. The data are analysed and rapid change points in the trend functions are identified by using the methods described in this section. For curve estimation, the Gaussian kernel and its derivatives with support \(\mathbb{R}\) were used which gave very smooth curve estimates. This is appropriate in the current example. The regression residuals are estimated by detrending the data series locally, using an optimal bandwidth (formula given in the text above). The distribution of the residuals turned out to be very close to normal so that one may assume q=1 and \(c_{1}^{2}(t_{i})\approx \operatorname{var}(Y_{i})\). On the original time scale in years (before 1989) the method identifies the main points of rapid change around the epoch known as the Younger Dryas at about 11,560 and 14,658 years before 1989 (see Fig. 7.22). For further details of the data analysis, see Menéndez et al. (2010).

Fig. 7.22
figure 22

Top: Oxygen isotope values plotted against age (years before present or 1989) and an estimated trend curve. Left middle: Distance between successive time points. Right middle: Periodogram of residuals and fitted spectral density in log-log coordinates. Bottom: Estimated trend derivatives \(\widehat{m^{(\nu )}}\) (ν=0,1,2,3). The curve estimates are rescaled for better visibility. The two vertical lines mark rapid climate change points where the threshold for the speed of change is set at η=100. The two main points of rapid climate change points are estimated to be at around 11,560 and 14’658 years before 1989. The asymptotic 95 %-confidence intervals for the change points (in years before 1989) ignoring bias in estimation are (11,554;11,566) and (14,646;14,670), respectively. Data source: Greenland Ice Core Project dataset, Johnsen et al. 1997. The figure is reproduced from the Journal of Statistical Planning and Inference (2010), vol. 40, 3343–3354