Abstract
We address the problem of forecasting a time series meeting the Causal Bernoulli Shift model, using a parametric set of predictors. The aggregation technique provides a predictor with well established and quite satisfying theoretical properties expressed by an oracle inequality for the prediction risk. The numerical computation of the aggregated predictor usually relies on a Markov chain Monte Carlo method whose convergence should be evaluated. In particular, it is crucial to bound the number of simulations needed to achieve a numerical precision of the same order as the prediction risk. In this direction we present a fairly general result which can be seen as an oracle inequality including the numerical cost of the predictor computation. The numerical cost appears by letting the oracle inequality depend on the number of simulations required in the Monte Carlo approximation. Some numerical experiments are then carried out to support our findings.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
The objective of our work is to forecast a stationary time series \(Y = \left (Y _{t}\right )_{t\in \mathbb{Z}}\) taking values in \(\mathcal{X} \subseteq \mathbb{R}^{r}\) with r ≥ 1. For this purpose we propose and study an aggregation scheme using exponential weights.
Consider a set of individual predictors giving their predictions at each moment t. An aggregation method consists of building a new prediction from this set, which is nearly as good as the best among the individual ones, provided a risk criterion (see [17]). This kind of result is established by oracle inequalities. The power and the beauty of the technique lie in its simplicity and versatility. The more basic and general context of application is individual sequences, where no assumption on the observations is made (see [9] for a comprehensive overview). Nevertheless, results need to be adapted if we set a stochastic model on the observations.
The use of exponential weighting in aggregation and its links with the PAC-Bayesian approach has been investigated for example in [5, 8] and [11]. Dependent processes have not received much attention from this viewpoint, except in [1] and [2]. In the present paper we study the properties of the Gibbs predictor, applied to Causal Bernoulli Shifts (CBS). CBS are an example of dependent processes (see [12] and [13]).
Our predictor is expressed as an integral since the set from which we do the aggregation is in general not finite. Large dimension is a trending setup and the computation of this integral is a major issue. We use classical Markov chain Monte Carlo (MCMC) methods to approximate it. Results from Łatuszyński [15, 16] control the number of MCMC iterations to obtain precise bounds for the approximation of the integral. These bounds are in expectation and probability with respect to the distribution of the underlying Markov chain.
In this contribution we first slightly revisit certain lemmas presented in [2, 8] and [20] to derive an oracle bound for the prediction risk of the Gibbs predictor. We stress that the inequality controls the convergence rate of the exact predictor. Our second goal is to investigate the impact of the approximation of the predictor on the convergence guarantees described for its exact version. Combining the PAC-Bayesian bounds with the MCMC control, we then provide an oracle inequality that applies to the MCMC approximation of the predictor, which is actually used in practice.
The paper is organised as follows: we introduce a motivating example and several definitions and assumptions in Sect. 2. In Sect. 3 we describe the methodology of aggregation and provide the oracle inequality for the exact Gibbs predictor. The stochastic approximation is studied in Sect. 4. We state a general proposition independent of the model for the Gibbs predictor. Next, we apply it to the more particular framework delineated in our paper. A concrete case study is analysed in Sect. 5, including some numerical work. A brief discussion follows in Sect. 6. The proofs of most of the results are deferred to Sect. 7.
Throughout the paper, for \(\boldsymbol{a} \in \mathbb{R}^{q}\) with \(q \in \mathbb{N}^{{\ast}}\), \(\|\boldsymbol{a}\|\) denotes its Euclidean norm, \(\|\boldsymbol{a}\| = (\sum _{i=1}^{q}a_{i}^{2})^{1/2}\) and \(\|\boldsymbol{a}\|_{1}\) its 1-norm \(\|\boldsymbol{a}\|_{1} =\sum _{ i=1}^{q}\vert a_{i}\vert \). We denote, for \(\boldsymbol{a} \in \mathbb{R}^{q}\) and Δ > 0, \(B\left (\boldsymbol{a},\varDelta \right ) =\{\boldsymbol{ a}_{1} \in \mathbb{R}^{q}:\|\boldsymbol{ a} -\boldsymbol{ a}_{1}\| \leq \varDelta \}\) and \(B_{1}\left (\boldsymbol{a},\varDelta \right ) =\{\boldsymbol{ a}_{1} \in \mathbb{R}^{q}:\|\boldsymbol{ a} -\boldsymbol{ a}_{1}\|_{1} \leq \varDelta \}\) the corresponding balls centered at \(\boldsymbol{a}\) of radius Δ > 0. In general bold characters represent column vectors and normal characters their components; for example \(\boldsymbol{y} = \left (\,y_{i}\right )_{i\in \mathbb{Z}}\). The use of subscripts with ‘:’ refers to certain vector components \(\boldsymbol{y}_{1:k} = \left (\,y_{i}\right )_{1\leq i\leq k}\), or elements of a sequence \(X_{1:k} = \left (X_{t}\right )_{1\leq t\leq k}\). For a random variable U distributed as ν and a measurable function h, ν[h(U)] or simply ν[h] stands for the expectation of h(U): ν[h] = ∫ h(u)ν(du).
2 Problem Statement and Main Assumptions
Real stable autoregressive processes of a fixed order, referred to as AR(d) processes, are one of the simplest examples of CBS. They are defined as the stationary solution of
where the \((\xi _{t})_{t\in \mathbb{Z}}\) are i.i.d. real random variables with \(\mathbb{E}[\xi _{t}] = 0\) and \(\mathbb{E}[\xi _{t}^{2}] = 1\).
We dispose of several efficient estimates for the parameter \(\boldsymbol{\theta }= \left [\begin{array}{*{10}c} \theta _{1} & \ldots & \theta _{d} \end{array} \right ]^{{\prime}}\) which can be calculated via simple algorithms as Levinson-Durbin or Burg algorithm for example. From them we derive also efficient predictors. However, as the model is simple to handle, we use it to progressively introduce our general setup.
Denote
\(\boldsymbol{X}_{t-1} = \left [\begin{array}{*{10}c} X_{t-1} & \ldots & X_{t-d} \end{array} \right ]^{{\prime}}\) and \(\boldsymbol{e}_{1} = \left [\begin{array}{*{10}c} 1&0&\ldots &0 \end{array} \right ]^{{\prime}}\) the first canonical vector of \(\mathbb{R}^{d}\). M′ represents the transpose of matrix M (including vectors). The recurrence (1) gives
The eigenvalues of \(A\left (\boldsymbol{\theta }\right )\) are the inverses of the roots of the autoregressive polynomial \(\boldsymbol{\theta }\left (z\right ) = 1 -\sum _{k=1}^{d}\theta _{k}z^{k}\), then at most δ for some \(\delta \in \left (0,1\right )\) due to the stability of X (see [7]). In other words \(\boldsymbol{\theta }\in s_{d}\left (\delta \right ) =\{\boldsymbol{\theta }\:\ \boldsymbol{\theta }\left (z\right )\neq 0\;\mbox{ for}\;\vert z\vert <\delta ^{-1}\} \subseteq s_{d}\left (1\right )\). In this context (or even in a more general one, see [14]) for all δ 1 ∈ (δ, 1) there is a constant \(\bar{K}\) depending only on \(\boldsymbol{\theta }\) and δ 1 such that for all j ≥ 0
and then, the variance of X t , denoted γ 0, satisfies \(\gamma _{0} =\sigma ^{2}\sum _{j=0}^{\infty }\vert \boldsymbol{e}'_{1}A^{j}\left (\boldsymbol{\theta }\right )\boldsymbol{e}_{1}\vert ^{2} \leq \bar{ K}^{2}\sigma ^{2}/(1 -\delta _{1}^{2})\).
The following definition allows to introduce the process which interests us.
Definition 1
Let \(\mathcal{X}'\subseteq \mathbb{R}^{r'}\) for some r′ ≥ 1 and let \(A = (A_{j})_{j\geq 0}\) be a sequence of non-negative numbers. A function \(H: (\mathcal{X}')^{\mathbb{N}} \rightarrow \mathcal{X}\) is said to be A-Lipschitz if
for any \(\boldsymbol{u} = (u_{j})_{j\in \mathbb{N}},\boldsymbol{v} = (v_{j})_{j\in \mathbb{N}} \in (\mathcal{X}')^{\mathbb{N}}\).
Provided \(A = (A_{j})_{j\geq 0}\) with A j ≥ 0 for all j ≥ 0, the i.i.d. sequence of \(\mathcal{X}'\)-valued random variables \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\) and \(H: (\mathcal{X}')^{\mathbb{N}} \rightarrow \mathcal{X}\), we consider that a time series \(X\ =\ \left (X_{t}\right )_{t\in \mathbb{Z}}\) admitting the following property is a Causal Bernoulli Shift (CBS) with Lipschitz coefficients A and innovations \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\).
-
(M)
The process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) meets the representation
$$\displaystyle\begin{array}{rcl} X_{t}& =& H\left (\xi _{t},\xi _{t-1},\xi _{t-2},\ldots \right ),\forall t \in \mathbb{Z}\;, {}\\ \end{array}$$where H is an A -Lipschitz function with the sequence A satisfying
$$\displaystyle\begin{array}{rcl} \tilde{A}_{{\ast}}& =& \sum _{j=0}^{\infty }jA_{ j} < \infty \;. {}\end{array}$$(4)We additionally define
$$\displaystyle{ A_{{\ast}} =\sum _{ j=0}^{\infty }A_{ j}\;. }$$(5)
CBS regroup several types of nonmixing stationary Markov chains, real-valued functional autoregressive models and Volterra processes, among other interesting models (see [10]). Thanks to the representation (2) and the inequality (3) we assert that AR(d) processes are CBS with \(A_{j} =\sigma \bar{ K}\delta _{1}^{j}\) for j ≥ 0.
We let ξ denote a random variable distributed as the ξ t s. Results from [1] and [2] need a control on the exponential moment of ξ in ζ = A ∗, which is provided via the following hypothesis.
-
(I)
The innovations \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\) satisfy \(\phi (\zeta ) = \mathbb{E}\left [\mathrm{e}^{\zeta \|\xi \|}\right ] < \infty \).
Bounded or Gaussian innovations trivially satisfy this hypothesis for any \(\zeta \in \mathbb{R}\).
Let π 0 denote the probability distribution of the time series Y that we aim to forecast. Observe that for a CBS, π 0 depends only on H and the distribution of ξ. For any \(f: \mathcal{X}^{\mathbb{N}^{{\ast}} } \rightarrow \mathcal{X}\) measurable and \(t \in \mathbb{Z}\) we consider \(\hat{Y }_{t} = f\left (\left (Y _{t-i}\right )_{i\geq 1}\right )\), a possible predictor of Y t from its past. For a given loss function \(\ell: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}_{+}\), the prediction risk is evaluated by the expectation of \(\ell(\hat{Y }_{t},Y _{t})\)
We assume in the following that the loss function ℓ fulfills the condition:
-
(L)
For all \(\boldsymbol{y},\boldsymbol{z} \in \mathcal{X}\), \(\ell\left (\,\boldsymbol{y},\boldsymbol{z}\right ) = g\left (\,\boldsymbol{y} -\boldsymbol{ z}\right )\), for some convex function g which is non-negative, \(g\left (0\right ) = 0\) and K- Lipschitz: \(\left \vert g\left (\,\boldsymbol{y}\right ) - g\left (\boldsymbol{z}\right )\right \vert \leq K\|\boldsymbol{y} -\boldsymbol{ z}\|\).
If \(\mathcal{X}\) is a subset of \(\mathbb{R}\), \(\ell\left (\,y,z\right ) = \left \vert y - z\right \vert \) satisfies 2 with K = 1.
From estimators of dimension d for \(\boldsymbol{\theta }\) we can build the corresponding linear predictors \(f_{\boldsymbol{\theta }}\left (\,\boldsymbol{y}\right ) =\boldsymbol{\theta } '\boldsymbol{y}_{1:d}\). Speaking more broadly, consider a set Θ and associated with it a set of predictors \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta \right \}\). For each \(\boldsymbol{\theta }\in \varTheta\) there is a unique \(d = d\left (\boldsymbol{\theta }\right ) \in \mathbb{N}^{{\ast}}\) such that \(f_{\boldsymbol{\theta }}: \mathcal{X}^{d} \rightarrow \mathcal{X}\) is a measurable function from which we define
as a predictor of Y t given its past. We can extend all functions \(f_{\boldsymbol{\theta }}\) in a trivial way (using dummy variables) to start from \(\mathcal{X}^{\mathbb{N}^{{\ast}} }\). A natural way to evaluate the predictor associated with \(\boldsymbol{\theta }\) is to compute the risk \(R\left (\boldsymbol{\theta }\right ) = R\left (\,f_{\boldsymbol{\theta }}\right )\). We use the same letter R by an abuse of notation.
We observe X 1: T from \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\), an independent copy of Y. A crucial goal of this work is to build a predictor function \(\hat{\,f}_{T}\) for Y, inferred from the sample X 1: T and Θ such that \(R(\hat{\,f}_{T})\) is close to \(\inf _{\boldsymbol{\theta }\in \varTheta }R\left (\boldsymbol{\theta }\right )\) with π 0- probability close to 1.
The set Θ also depends on T, we write \(\varTheta \equiv \varTheta _{T}\). Let us define
The main assumptions on the set of predictors are the following ones.
-
(P-1)
The set \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) is such that for any \(\boldsymbol{\theta }\in \varTheta _{T}\) there are \(b_{1}\left (\boldsymbol{\theta }\right ),\ldots,\) \(b_{d\left (\boldsymbol{\theta }\right )}\left (\boldsymbol{\theta }\right )\ \in \ \mathbb{R}_{+}\) satisfying for all \(\boldsymbol{y} = \left (\,y_{i}\right )_{i\in \mathbb{N}^{{\ast}}},\boldsymbol{z} = \left (z_{i}\right )_{i\in \mathbb{N}^{{\ast}}} \in \mathcal{X}^{\mathbb{N}^{{\ast}} }\),
$$\displaystyle\begin{array}{rcl} \left \vert \left \vert \,f_{\boldsymbol{\theta }}(\boldsymbol{y}) - f_{\boldsymbol{\theta }}(\boldsymbol{z})\right \vert \right \vert &\leq &\sum \limits _{j=1}^{d\left (\boldsymbol{\theta }\right )}b_{ j}(\boldsymbol{\theta })\left \vert \left \vert y_{j} - z_{j}\right \vert \right \vert \;. {}\\ \end{array}$$We assume moreover that \(L_{T} =\sup _{\boldsymbol{\theta }\in \varTheta _{T}}\sum _{j=1}^{d\left (\boldsymbol{\theta }\right )}b_{j}\left (\boldsymbol{\theta }\right ) < \infty \).
-
(P-2)
The inequality L T + 1 ≤ logT holds for all T ≥ 4.
In the case where \(\mathcal{X} \subseteq \mathbb{R}\) and \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) is such that \(\boldsymbol{\theta }\in \mathbb{R}^{d\left (\boldsymbol{\theta }\right )}\) and \(f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) =\boldsymbol{\theta } '\boldsymbol{y}_{1:d(\boldsymbol{\theta })}\) for all \(\boldsymbol{y} \in \mathbb{R}^{\mathbb{N}}\), we have
The last conditions are satisfied by the linear predictors when Θ T is a subset of the ℓ 1-ball of radius logT − 1 in \(\mathbb{R}^{d_{T}}\).
3 Prediction via Aggregation
The predictor that we propose is defined as an average of predictors \(f_{\boldsymbol{\theta }}\) based on the empirical version of the risk,
where \(\hat{X}_{t}^{\boldsymbol{\theta }} = f_{\boldsymbol{\theta }}\left (\left (X_{t-i}\right )_{i\geq 1}\right )\). The function \(r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\) relies on X 1: T and can be computed at stage T; this is in fact a statistic.
We consider a prior probability measure π T on Θ T . The prior serves to control the complexity of predictors associated with Θ T . Using π T we can construct one predictor in particular, as detailed in the following.
3.1 Gibbs Predictor
For a measure ν and a measurable function h (called energy function) such that \(\nu \left [\exp \left (h\right )\right ] =\int \exp \left (h\right )\;\mathrm{d}\nu < \infty \;,\) we denote by \(\nu \left \{h\right \}\) the measure defined as
It is known as the Gibbs measure.
Definition 2 (Gibbs predictor)
Given η > 0, called the temperature or the learning rate parameter, we define the Gibbs predictor as the expectation of \(f_{\boldsymbol{\theta }}\), where \(\boldsymbol{\theta }\) is drawn under \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\), that is
3.2 PAC-Bayesian Inequality
At this point more care must be taken to describe Θ T . Here and in the following we suppose that
Suppose moreover that Θ T is equipped with the Borel σ-algebra \(\mathcal{B}(\varTheta _{T})\).
A Lipschitz type hypothesis on \(\boldsymbol{\theta }\) guarantees the robustness of the set \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) with respect to the risk R.
-
(P-3)
There is \(\mathcal{D} < \infty \) such that for all \(\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2} \in \varTheta _{T}\),
$$\displaystyle\begin{array}{rcl} \pi _{0}\left [\left \vert \left \vert \,f_{\boldsymbol{\theta }_{1}}\left (\left (X_{t-i}\right )_{i\geq 1}\right ) - f_{\boldsymbol{\theta }_{2}}\left (\left (X_{t-i}\right )_{i\geq 1}\right )\right \vert \right \vert \right ]& \leq &\mathcal{D}d_{T}^{1/2}\left \vert \left \vert \boldsymbol{\theta }_{ 1} -\boldsymbol{\theta }_{2}\right \vert \right \vert \;, {}\\ \end{array}$$where d T is defined in (6).
Linear predictors satisfy this last condition with \(\mathcal{D} =\pi _{0}\left [\left \vert X_{1}\right \vert \right ]\).
Suppose that the \(\boldsymbol{\theta }\) reaching the \(\inf _{\boldsymbol{\theta }\in \varTheta _{T}}R(\boldsymbol{\theta })\) has some zero components, i.e. \(\mathrm{supp}(\boldsymbol{\theta }) < n_{T}\). Any prior with a lower bounded density (with respect to the Lebesgue measure) allocates zero mass on lower dimensional subsets of Θ T . Furthermore, if the density is upper bounded we have \(\pi _{T}[B(\boldsymbol{\theta },\varDelta ) \cap \varTheta _{T}] = O(\varDelta ^{n_{T}})\) for Δ small enough. As we will notice in the proof of Theorem 1, a bound like the previous one would impose a tighter constraint to n T . Instead we set the following condition.
-
(P-4)
There is a sequence \(\left (\boldsymbol{\theta }_{T}\right )_{T\geq 4}\) and constants \(\mathcal{C}_{1} > 0\), \(\mathcal{C}_{2},\mathcal{C}_{3} \in (0,1]\) and γ ≥ 1 such that \(\boldsymbol{\theta }_{T} \in \varTheta _{T}\),
$$\displaystyle\begin{array}{rcl} R\left (\boldsymbol{\theta }_{T}\right )& \leq &\inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\boldsymbol{\theta }\right ) + \mathcal{C}_{1} \frac{\log ^{3}\left (T\right )} {T^{1/2}}\;, {}\\ \mathrm{and}\;\;\;\pi _{T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap \varTheta _{T}\right ]& \geq &\mathcal{C}_{2}\varDelta ^{n_{T}^{1/\gamma } },\forall 0 \leq \varDelta \leq \varDelta _{T} = \frac{\mathcal{C}_{3}} {T} \;. {}\\ \end{array}$$
A concrete example is provided in Sect. 5.
We can now present the main result of this section, our PAC-Bayesian inequality concerning the predictor \(\hat{\,f}_{\eta _{T},T}\left (\cdot \left \vert X\right.\right )\) built following (7) with the learning rate η = η T = T 1∕2∕(4logT), provided an arbitrary probability measure π T on Θ T .
Theorem 1
Let ℓ be a loss function such that Assumption ( L ) holds. Consider a process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) satisfying Assumption ( M ) and let π 0 denote its probability distribution. Assume that the innovations fulfill Assumption ( I ) with ζ = A ∗ ; A ∗ is defined in ( 5 ). For each T ≥ 4 let \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) be a set of predictors meeting Assumptions ( P-3 ), ( P-4 ) and ( P-3 ) such that d T , defined in ( 6 ), is at most T∕2. Suppose that the set Θ T is as in (8) with n T ≤ logγ T for some γ ≥ 1 and we let π T be a probability measure on it such that Assumption ( P-4 ) holds for the same γ. Then for any \(\varepsilon > 0\) , with π 0 -probability at least \(1-\varepsilon\) ,
where
with \(\tilde{A}_{{\ast}}\) defined in (4) , K, ϕ and \(\mathcal{D}\) in Assumptions ( L ), ( I ) and ( P-3 ), respectively, and \(\mathcal{C}_{1}\) , \(\mathcal{C}_{2}\) and \(\mathcal{C}_{3}\) in Assumption ( P-4 ).
The proof is postponed to Sect. 7.1.
Here however we insist on the fact that this inequality applies to an exact aggregated predictor \(\hat{\,f}_{\eta _{T},T}\left (\cdot \left \vert X\right.\right )\). We need to investigate how these predictors are computed and how practical numerical approximations behave compared to the properties of the exact version.
4 Stochastic Approximation
Once we have the observations X 1: T , we use the Metropolis – Hastings algorithm to compute \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ) =\int f_{\boldsymbol{\theta }}\left (\cdot \left \vert X\right.\right )\pi _{T}\left \{-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right \}\left (\mathrm{d}\boldsymbol{\theta }\right )\). The Gibbs measure \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\) is a distribution on Θ T whose density \(\pi _{\eta,T}\left (\cdot \left \vert X\right.\right )\) with respect to π T is proportional to \(\exp \left (-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right )\).
4.1 Metropolis: Hastings Algorithm
Given \(X \in \mathcal{X}^{\mathbb{Z}}\), the Metropolis-Hastings algorithm generates a Markov chain \(\varPhi _{\eta,T}\left (X\right ) = (\boldsymbol{\theta }_{\eta,T,n}(X))_{n\geq 0}\) with kernel P η, T (only depending on X 1: T ) having the target distribution \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\) as the unique invariant measure, based on the transitions of another Markov chain which serves as a proposal (see [21]). We consider a proposal transition of the form \(Q_{\eta,T}(\boldsymbol{\theta }_{1},\mathrm{d}\boldsymbol{\theta }) = q_{\eta,T}(\boldsymbol{\theta }_{1},\boldsymbol{\theta })\pi _{T}(\mathrm{d}\boldsymbol{\theta })\) where the conditional density kernel q η, T (possibly also depending on X 1: T ) on \(\varTheta _{T} \times \varTheta _{T}\) is such that
This is the case of the independent Hastings algorithm, where the proposal is i.i.d. with density q η, T . The condition gets into
In Sect. 5 we provide an example.
The relation (10) implies that the algorithm is uniformly ergodic, i.e. we have a control in total variation norm (\(\|\cdot \|_{TV }\)). Thus, the following condition holds (see [18]).
-
(A)
Given η, T > 0, there is \(\beta _{\eta,T}: \mathcal{X}^{\mathbb{Z}} \rightarrow \left (0,1\right )\) such for any \(\boldsymbol{\theta }_{0} \in \varTheta _{T}\), \(\boldsymbol{x} \in \mathcal{X}^{\mathbb{Z}}\) and \(n \in \mathbb{N}\), the chain \(\varPhi _{\eta,T}\left (\boldsymbol{x}\right )\) with transition law P η, T and invariant distribution \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert \boldsymbol{x}\right.\right )\right \}\) satisfies
$$\displaystyle\begin{array}{rcl} \left \vert \left \vert P_{\eta,T}^{n}\left (\boldsymbol{\theta }_{ 0},\cdot \right ) -\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert \boldsymbol{x}\right.\right )\right \}\right \vert \right \vert _{TV }& \leq & 2\left (1 -\beta _{\eta,T}\left (\boldsymbol{x}\right )\right )^{n}\;. {}\\ \end{array}$$
4.2 Theoretical Bounds for the Computation
In [16, Theorem 3.1] we find a bound on the mean square error of approximating one integral by the empirical estimate obtained from the successive samples of certain ergodic Markov chains, including those generated by the MCMC method that we use.
A MCMC method adds a second source of randomness to the forecasting process and our aim is to measure it. Let \(\boldsymbol{\theta }_{0} \in \cap _{T\geq 1}\varTheta _{T}\), we set \(\boldsymbol{\theta }_{\eta,T,0}\left (\boldsymbol{x}\right ) =\boldsymbol{\theta } _{0}\) for all T, η > 0, \(\boldsymbol{x} \in \mathcal{X}^{\mathbb{Z}}\). We denote by \(\mu _{\eta,T}\left (\cdot \left \vert X\right.\right )\) the probability distribution of the Markov chain \(\varPhi _{\eta,T}\left (X\right )\) with initial point \(\boldsymbol{\theta }_{0}\) and kernel P η, T .
Let ν η, T denote the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\); it is defined by setting for all sets \(A \in (\mathcal{B}(\mathcal{X}))^{\otimes \mathbb{Z}}\) and \(B \in (\mathcal{B}(\varTheta _{T}))^{\otimes \mathbb{N}}\)
Given \(\varPhi _{\eta,T} = (\boldsymbol{\theta }_{\eta,T,n})_{n\geq 0}\), we then define for \(n \in \mathbb{N}^{{\ast}}\)
Since our chain depends on X, we make it explicit by using the notation \(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\). The cited [16, Theorem 3.1] leads to a proposition that applies to the numerical approximation of the Gibbs predictor (the proof is in Sect. 7.2). We stress that this is independent of the model (CBS or any), of the set of predictors and of the theoretical guarantees of Theorem 1.
Proposition 1
Let ℓ be a loss function meeting Assumption ( L ). Consider any process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) with an arbitrary probability distribution π 0 . Given T ≥ 2, η > 0, a set of predictors \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) and \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) , let \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\) be defined by (7) and let \(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\) be defined by (13) . Suppose that Φ η,T meets Assumption ( A ) for η and T with a function \(\beta _{\eta,T}: \mathcal{X}^{\mathbb{Z}} \rightarrow (0,1)\) . Let ν η,T denote the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\) as defined in (14) . Then, for all n ≥ 1 and D > 0, with ν η,T - probability at least max {0,1 − A η,T ∕(Dn 1∕2 )} we have \(\vert R(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )) - R(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ))\vert \leq D\) , where
We denote by \(\nu _{T} =\nu _{\eta _{T},T}\) the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\) setting η = η T = T 1∕2∕(4logT). As Theorem 1 does not involve any simulation, it also holds in ν T - probability. From this and Proposition 1 a union bound gives us the following.
Theorem 2
Under the hypothesis of Theorem 1 , consider moreover that Assumption ( A ) is fulfilled by Φ η,T for all η = η T and T with T ≥ 4. Thus, for all \(\varepsilon > 0\) and \(n \geq M\left (T,\varepsilon \right )\) , with ν T - probability at least \(1-\varepsilon\) we have
where \(\mathcal{E}\) is defined in (9) and \(M\left (T,\varepsilon \right ) = A_{\eta _{T},T}^{2}T/(\varepsilon ^{2}\log ^{6}T)\) with A η,T as in (14) .
5 Applications to the Autoregressive Process
We carefully recapitulate all the assumptions of Theorem 2 in the context of an autoregressive process. After that, we illustrate numerically the behaviour of the proposed method.
5.1 Theoretical Considerations
Consider a real valued stable autoregressive process of finite order d as defined by (1) with parameter \(\boldsymbol{\theta }\) lying in the interior of \(s_{d}\left (\delta \right )\) and unit normally distributed innovations (Assumptions (M) and (I) hold). With the loss function \(\ell\left (\,y,z\right ) = \left \vert y - z\right \vert \) Assumption (L) holds as well. The linear predictors is the set that we test; they meet Assumption (P-3). Without loss of generality assume that d T = n T . In the described framework we have \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ) = f_{\widehat{\boldsymbol{\theta }}_{\eta,T}\left (X\right )}\), where
This \(\hat{\boldsymbol{\theta }}_{\eta,T}\left (X\right ) \in \mathbb{R}^{d_{T}}\) is known as the Gibbs estimator.
Remark that, by (2) and the normality of the innovations, the risk of any \(\hat{\boldsymbol{\theta }}\in \mathbb{R}^{d_{T}}\) is computed as the absolute moment of a centered Gaussian, namely
where \(\varGamma _{T} = (\gamma _{i,j})_{0\leq i,j\leq d_{T}-1}\) is the covariance matrix of the process. In (15) the vector \(\boldsymbol{\theta }\) originally in \(\mathbb{R}^{d}\) is completed by d T − d zeros.
In this context \(\arg \inf _{\boldsymbol{\theta }\in \mathbb{R}^{\mathbb{N}^{{\ast}}}}R\left (\boldsymbol{\theta }\right ) \in s_{d}(1)\) gives the true parameter \(\boldsymbol{\theta }\) generating the process. Let us verify Assumption (P-4) by setting conveniently Θ T and π T . Let Δ d∗ > 0 be such that \(B\left (\boldsymbol{\theta },\varDelta _{d{\ast}}\right ) \subseteq s_{d}(1)\).
We express \(\varTheta _{T} =\bigcup _{ k=1}^{d_{T}}\varTheta _{k,T}\) where \(\boldsymbol{\theta }\in \varTheta _{k,T}\) if and only if \(d\left (\boldsymbol{\theta }\right ) = k\). It is interesting to set Θ k, T as the part of the stability domain of an AR(k) process satisfying Assumptions (P-3) and (P-4). Consider \(\varTheta _{1,T} = s_{1}(1) \times \{ 0\}^{d_{T}-1} \cap B_{1}\left (\boldsymbol{0},\log T - 1\right )\) and \(\varTheta _{k,T} = s_{k}(1) \times \{ 0\}^{d_{T}-k} \cap B_{1}\left (\boldsymbol{0},\log T - 1\right )\setminus \varTheta _{k-1,T}\) for k ≥ 2. Assume moreover that \(d_{T} = \lfloor \log ^{\gamma }T\rfloor \).
We write \(\pi _{T} =\sum _{ k=1}^{d_{T}}c_{k,T}\pi _{k,T}\) where for all k, c k, T π k, T is the restriction of π T to Θ k, T with c k, T a real non negative number and π k, T a probability measure on Θ k, T . In this setup \(c_{k,T} =\pi _{T}\left [\varTheta _{k,T}\right ]\) and \(\pi _{k,T}\left [A \cap \varTheta _{k,T}\right ] =\pi _{T}\left [A \cap \varTheta _{k,T}\right ]/c_{k,T}\) if c k, T > 0 and \(\pi _{k,T}\left [A \cap \varTheta _{k,T}\right ] = 0\) otherwise. The vector \(\left [\begin{array}{*{10}c} c_{1,T}&\ldots &c_{d_{T},T} \end{array} \right ]\) could be interpreted as a prior on the model order. Set \(c_{k,T} = c_{k}/(\sum _{i=1}^{d_{T}}c_{i})\) where c k > 0 is the k-th term of a convergent series (\(\sum _{k=1}^{\infty }c_{k} = c^{{\ast}} < \infty \)).
The distribution π k, T is inferred from some transformations explained below. Observe first that if a ≤ b we have \(s_{k}(a) \subseteq s_{k}(b)\). If \(\boldsymbol{\theta }\in s_{k}(1)\) then \(\left [\begin{array}{*{10}c} \lambda \theta _{1} & \ldots & \lambda ^{k}\theta _{k} \end{array} \right ]^{{\prime}} \in s_{k}(1)\) for any λ ∈ (−1, 1). Let us set
We define \(F_{k,T}(\boldsymbol{\theta }) = \left [\begin{array}{*{10}c} \lambda _{T}(\boldsymbol{\theta })\theta _{1} & \ldots & \lambda _{T}^{k}(\boldsymbol{\theta })\theta _{k}&0&\ldots &0 \end{array} \right ]^{{\prime}}\in \mathbb{R}^{d_{T}}\). Remark that for any \(\boldsymbol{\theta }\in s_{k}(1)\), \(\|F_{k,T}(\boldsymbol{\theta })\|_{1} \leq \lambda _{T}(\boldsymbol{\theta })\|\boldsymbol{\theta }\|_{1} \leq \log T - 1\). This gives us an idea to generate vectors in Θ k, T . Our distribution π k, T is deduced from:
Algorithm 1 π k, T generation
The distribution π k, T is lower bounded by the uniform distribution on s k (1).
Provided any γ ≥ 1, let T ∗ = min{T: d T ≥ d γ, logT ≥ d 1∕22d}. Since \(s_{k}(1) \subseteq B(\boldsymbol{0},2^{k}\ -\ 1)\) (see [19, Lemma 1]) and \(k^{1/2}\|\boldsymbol{\theta }\| \geq \|\boldsymbol{\theta }\|_{1}\) for any \(\boldsymbol{\theta }\in \mathbb{R}^{k}\), the constraint \(\|\boldsymbol{\theta }\|_{1} \leq \log T - 1\) becomes redundant in Θ k, T for 1 ≤ k ≤ d and T ≥ T ∗, i.e. \(\varTheta _{1,T} = s_{1}(1) \times \{ 0\}^{d_{T}-1}\) and \(\varTheta _{k,T} = s_{k}(1) \times \{ 0\}^{d_{T}-k}\setminus \varTheta _{k-1,T}\) for 2 ≤ k ≤ d. We define the sequence of Assumption (P-4) as \(\boldsymbol{\theta }_{T} =\boldsymbol{ 0}\) for T < T ∗ and \(\boldsymbol{\theta }_{T} =\arg \inf _{\boldsymbol{\theta }\in \varTheta _{T}}R(\boldsymbol{\theta })\) for T ≥ T ∗. Remark that the first d components of \(\boldsymbol{\theta }_{T}\) are constant for T ≥ T ∗ (they correspond to the \(\boldsymbol{\theta }\in \mathbb{R}^{d}\) generating the AR(d) process), and the last d T − d are zero. Let Δ 1∗ = 2log2 − 1. Then, we have for T < T ∗ and all Δ ∈ [0, Δ 1∗]
Furthermore, for T ≥ T ∗ and Δ ∈ [0, Δ d∗]
Assumption (P-4) is then fulfilled for any γ ≥ 1 with
Let q η, T be the constant function 1, this means that the proposal has the same distribution π T . Let us bound the ratio (11).
Now note that
Plugging the bound (17) on (16) with η = η T
we deduce that
Taking (18) into account, setting γ = 1 (thus \(d_{T} = \lfloor \log T\rfloor \)), using Assumption (P-3), that K = 1 and applying the Cauchy-Schwarz inequality we get
As X 1 is centered and normally distributed of variance γ 0, \(\pi _{0}\left [\left \vert X_{1}\right \vert \right ] = \left (2\gamma _{0}/\pi \right )^{1/2}\) and \(\pi _{0}[\exp (T^{1/2}\left \vert X_{1}\right \vert /4)] =\gamma _{0}T^{1/2}\exp (\gamma _{0}T/32)/4\).
From \(n \geq M^{{\ast}}\left (T,\varepsilon \right ) = 9\gamma _{0}^{3}T^{2}\exp \left (\gamma _{0}T/16\right )/(2\pi \varepsilon ^{2}\log ^{3}T)\) the result of Theorem 2 is reached. This bound of \(M\left (T,\varepsilon \right )\) is prohibitive from a computational viewpoint. That is why we limit the number of iterations to a fixed n ∗.
What we obtain from MCMC is \(\bar{\,f}_{\eta _{T},T,n}\left (\,\boldsymbol{y}\left \vert X\right.\right ) =\bar{\boldsymbol{\theta }} '_{\eta _{T},T,n}\left (X\right )\boldsymbol{y}_{1:d_{T}}\) with \(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n}\left (X\right ) =\sum _{ i=0}^{n-1}\boldsymbol{\theta }_{\eta _{T},T,i}\left (X\right )/n\). Remark that \(\bar{\,f}_{\eta _{T},T,n}\left (\cdot \left \vert X\right.\right ) = f_{\bar{\boldsymbol{\theta }}_{\eta _{ T},T,n}\left (X\right )}\). The risk is expressed as
5.2 Numerical Work
Consider 100 realisations of an autoregressive processes X simulated with the same \(\boldsymbol{\theta }\in s_{d}\left (\delta \right )\) for d = 8 and δ = 3∕4 and with σ = 1. Let \(\boldsymbol{c}^{(i)}\), i = 1, 2 the sequences defining two different priors in the model order:
-
1.
c k (1) = k −2, the sparsity is favoured,
-
2.
c k (2) = e−k, the sparsity is strongly favoured.
For each sequence c and for each value of T ∈ { 2j, j = 6, …, 12} we compute \(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n^{{\ast}}}\), the MCMC approximation of the Gibbs estimator using Algorithm 2 with η = η T .
Algorithm 2 Independent Hastings Sampler
The acceptance rate is computed as \(\alpha _{\eta,T,X}(\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2}) =\exp (\eta r_{T}(\boldsymbol{\theta }_{1}\left \vert X\right.) -\eta r_{T}\) \(\left (\boldsymbol{\theta }_{2}\left \vert X\right.\right ))\).
Algorithm 1 used by the distributions π k, T generates uniform random vectors on \(s_{k}\left (1\right )\) by the method described in [6]. It relies in the Levinson-Durbin recursion algorithm. We also implemented the numerical improvements of [3].
Set \(\varepsilon = 0.1\). Figure 1 displays the \((1-\varepsilon )\)-quantiles in data \(R(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n^{{\ast}}}\left (X\right )) - (2/\pi )^{1/2}\sigma ^{2}\) for \(\boldsymbol{c}^{(1)}\) and \(\boldsymbol{c}^{(2)}\) using different values of n ∗.
Note that, for the proposed algorithm the prediction risk decreases very slowly when the number T of observations grows and the number of MCMC iterations remains constant. If n ∗ = 1, 000 the decaying rate is faster than if n ∗ = 100 for smaller values of T. For T ≥ 2, 000 we observe that both rates are roughly the same in the logarithmic scale. This behaviour is similar in both cases presented in Fig. 1. As expected, the risk of the approximated predictor does not converge as log3 T∕T 1∕2.
6 Discussion
There are two sources of error in our method: prediction (of the exact Gibbs predictor) and approximation (using the MCMC). The first one decays when T grows and the obtained guarantees for the second one explode. We found a possibly pessimistic upper bound for M(T, ε). The exponential growing of this bound is the main weakness of our procedure. The use of a better adapted proposal in the MCMC algorithm needs to be investigated. The Metropolis Langevin Algorithm (see [4]) gives us an insight in this direction. However it is encouraging to see that, in the analysed practical case, the risk of \(\bar{\,f}_{\eta _{T},T,n^{{\ast}}}\left (\cdot \left \vert X\right.\right )\) does not increase with T.
7 Technical Proofs
7.1 Proof of Theorem 1
The proof of Theorem 1 is based on the same tools used by [2] up to Lemma 3. For the sake of completeness we quote the essential ones.
We denote by \(\mathcal{M}_{+}^{1}\left (F\right )\) the set of probability measures on the measurable space \((F,\mathcal{F})\). Let \(\rho,\nu \in \mathcal{M}_{+}^{1}\left (F\right )\), \(\mathcal{K}\left (\rho,\nu \right )\) stands for the Kullback-Leibler divergence of ν from ρ.
The first lemma can be found in [8, Equation 5.2.1].
Lemma 1 (Legendre transform of the Kullback divergence function)
Let \((F,\mathcal{F})\) be any measurable space. For any \(\nu \in \mathcal{M}_{+}^{1}\left (F\right )\) and any measurable function \(h\:\ F\ \rightarrow \ \mathbb{R}\) such that \(\nu \left [\exp \left (h\right )\right ] < \infty \) we have,
with the convention ∞−∞ = −∞. Moreover, as soon as h is upper-bounded on the support of ν, the supremum with respect to ρ in the right-hand side is reached by the Gibbs measure \(\nu \left \{h\right \}\) .
For a fixed C > 0, let \(\tilde{\xi }_{t}^{\left (C\right )} =\max \left \{\min \left \{\xi _{t},C\right \},-C\right \}\). Consider \(\tilde{X}_{t} = H(\tilde{\xi }_{t}^{\left (C\right )},\tilde{\xi }_{t-1}^{\left (C\right )},\ldots )\).
Denote \(\tilde{X} = (\tilde{X}_{t})_{t\in \mathbb{Z}}\) and by \(\tilde{R}\left (\boldsymbol{\theta }\right )\) and \(\tilde{r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\) the respective exact and empirical risks associated with \(\tilde{X}\) in \(\boldsymbol{\theta }\).
where \(\widehat{\tilde{X}}_{t}^{\boldsymbol{\theta }} = f_{\boldsymbol{\theta }}((\tilde{X}_{t-i})_{i\geq 1})\).
This thresholding is interesting because truncated CBS are weakly dependent processes (see [2, Section 4.2]).
A Hoeffding type inequality introduced in [20, Theorem 1] provides useful controls on the difference between empirical and exact risks of a truncated process.
Lemma 2 (Laplace transform of the risk)
Let ℓ be a loss function meeting Assumption ( L ) and \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) a process satisfying Assumption ( M ). For all T ≥ 2, any \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) satisfying Assumption ( P-1 ), Θ T such that d T , defined in (6), is at most T∕2, any truncation level C > 0, η ≥ 0 and \(\boldsymbol{\theta }\in \varTheta _{T}\) we have,
and
where \(k(T,C) = 2^{1/2}CK(1 + L_{T})\left (A_{{\ast}} +\tilde{ A}_{{\ast}}\right )\) . The constants \(\tilde{A}_{{\ast}}\) and A ∗ are defined in (4) and (5) respectively, K and L T in Assumptions ( L ) and ( P-1 ) respectively.
The following lemma is a slight modification of [2, Lemma 6.5]. It links the two versions of the empirical risk: original and truncated.
Lemma 3
Suppose that Assumption ( L ) holds for the loss function ℓ, Assumption ( P-1 ) holds for \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) and Assumption ( I ) holds for the innovations with ζ = A ∗ ; A ∗ is defined in (5) . For all T ≥ 2, any \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) meeting Assumption ( P-1 ) with Θ T such that d T , defined in (6) , is at most T∕2, any truncation level C > 0 and any \(0 \leq \eta \leq T/4\left (1 + L_{T}\right )\) we have,
where
with K and L T defined in Assumptions ( L ) and ( P-1 ) respectively.
Finally we present a result on the aggregated predictor defined in (7). The proof is partially inspired by that of [2, Theorem 3.2].
Lemma 4
Let ℓ be a loss function such that Assumption ( L ) holds and let \(X\ =\ \left (X_{t}\right )_{t\in \mathbb{Z}}\) a process satisfying Assumption ( M ) with probability distribution π 0 . For each T ≥ 2 let \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) be a set of predictors and \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) any prior probability distribution on Θ T . We build the predictor \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\) following (7) with any η > 0. For any \(\varepsilon > 0\) and any truncation level C > 0, with π 0 -probability at least \(1-\varepsilon\) we have,
Proof
We use Tonelli’s theorem and Jensen’s inequality with the convex function g to obtain an upper bound for \(R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )\)
In the remainder of this proof we search for upper bounding \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [R\right ]\).
First, we use the relationship:
For the sake of simplicity and while it does not disrupt the clarity, we lighten the notation of r T and \(\tilde{r}_{T}\). We now suppose that in the place of \(\boldsymbol{\theta }\) we have a random variable distributed as \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\). This is taken into account in the following expectations. The identity (21) and the Cauchy-Schwarz inequality lead to
Observe now that \(R\left (\boldsymbol{\theta }\right ) = \mathbb{E}\left [r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right ]\) and \(\tilde{R}\left (\boldsymbol{\theta }\right ) = \mathbb{E}[\tilde{r}_{T}(\boldsymbol{\theta }\vert \tilde{X})]\). Jensen’s inequality for the exponential function gives that
From (23) we see that
Combining (22) and (24) we obtain
Let \(L_{\eta,T,C} =\log ((\mathbb{E}[\exp (\eta (\tilde{R} -\tilde{ r}_{T}))])^{1/2}\mathbb{E}[\exp (\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\vert r_{T}(\boldsymbol{\theta }\vert X) -\tilde{ r}_{T}(\boldsymbol{\theta }\vert \tilde{X})\vert )])\). Remark that the left term of (25) is equal to the integral of the expression enclosed in brackets with respect to the measure π 0 ×π T . Changing η by 2η and thanks to Lemma 1 we get
Markov’s inequality implies that for all \(\varepsilon > 0\), with π 0- probability at least \(1-\varepsilon\)
Hence, for any \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) and η > 0, with π 0- probability at least \(1-\varepsilon\), for all \(\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\)
By setting \(\rho =\pi _{T}\{ -\eta r_{T}\left (\cdot \left \vert X\right.\right )\}\) and relying on Lemma 1, we have
Using (26) with \(\rho =\pi _{T}\{ -\eta r_{T}\left (\cdot \left \vert X\right.\right )\}\) it follows that, with π 0- probability at least \(1-\varepsilon\),
To upper bound ρ[r T (⋅ | X)] we use an upper bond on \(\rho \left [r_{T}(\cdot \vert X) - R\right ]\). We obtain an inequality similar to (26) with \(\rho \left [R - r_{T}(\cdot \vert X)\right ]\) replaced by \(\rho \left [r_{T}(\cdot \vert X) - R\right ]\) and L η, T, C replaced by \(L'_{\eta,T,C} =\log ((\mathbb{E}[\exp (\eta (\tilde{r}_{T} -\tilde{ R}))])^{1/2}\mathbb{E}[\exp (\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\vert r_{T}(\boldsymbol{\theta }\vert X) -\tilde{ r}_{T}(\boldsymbol{\theta }\vert \tilde{X})\vert )])\). This provides us another inequality satisfied with π 0- probability at least \(1-\varepsilon\). To obtain a π 0- probability of the intersection larger than \(1-\varepsilon\) we apply previous computations with \(\varepsilon /2\) instead of \(\varepsilon\) and hence,
We can now proof Theorem 1.
Proof
Let π 0, C denote the distribution on \(\mathcal{X}^{\mathbb{Z}} \times \mathcal{X}^{\mathbb{Z}}\) of the couple \((X,\tilde{X})\). Fubini’s theorem and (19) of Lemma 2 imply that
Using (20), we analogously get
Consider the set of probability measures \(\left \{\rho _{\boldsymbol{\theta }_{ T},\varDelta },T \geq 2,0 \leq \varDelta \leq \varDelta _{T}\right \} \subset \mathcal{M}_{+}^{1}\left (\varTheta _{ T}\right )\), where \(\boldsymbol{\theta }_{T}\) is the parameter defined by Assumption (P-4) and \(\rho _{\boldsymbol{\theta }_{ T},\varDelta }\left (\boldsymbol{\theta }\right ) \propto \pi _{T}\left (\boldsymbol{\theta }\right ) \mathbb{1}_{B\left (\boldsymbol{\theta }_{T},\varDelta \right )\cap \varTheta _{T}}\left (\boldsymbol{\theta }\right )\). Lemma 4, together with Lemma 3, (27) and (28) guarantee that for all \(0 <\eta \leq T/8\left (1 + L_{T}\right )\)
Thanks to Assumptions (L) and (P-3), for any T ≥ 2 and \(\boldsymbol{\theta }\in B\left (\boldsymbol{\theta }_{T},\varDelta \right )\)
For T ≥ 4 Assumption (P-4) gives
Plugging (30) and (31) into (29) and using again Assumption (P-4)
where \(\mathcal{E}_{1} = K\mathcal{D}\), \(\mathcal{E}_{2} = 32K^{2}\left (A_{{\ast}} +\tilde{ A}_{{\ast}}\right )^{2}\), \(\mathcal{E}_{3} = 8K\phi (A_{{\ast}})A_{{\ast}}\) and \(\mathcal{E}_{4} = 32K^{2}\phi (A_{{\ast}})\).
We upper bound d T by T∕2, n T by logγ T and substitute \(\varDelta _{T} = \mathcal{C}_{3}/T\). Since it is difficult to minimize the right term of (32) with respect to η and C at the same time, we evaluate them in certain values to obtain a convenient upper bound.
At a fixed \(\varepsilon\), the convergence rate of \(\left [2\log \left (2/\varepsilon \right ) - 2\log \left (\mathcal{C}_{2}\right )\right ]/\eta + \mathcal{E}_{4}\left (1 + L_{T}\right )^{2}\eta /T\) is at best logT∕T 1∕2, and we get it doing \(\eta \propto T^{1/2}/\log T\). As η ≤ T∕8(1 + L T ) we set η = η T = T 1∕2∕(4logT).
The order of the already chosen terms is log3 T∕T 1∕2, doing C = logT∕A ∗ we preserve it. Taking into account that \(R\left (\boldsymbol{\theta }_{T}\right ) \leq \inf _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\boldsymbol{\theta }\right ) + \mathcal{C}_{1}\log ^{3}T/T^{1/2}\) the result follows.
7.2 Proof of Proposition 1
Considering that Assumption (L) holds we get
Observe that the last expression depends on X 1: T and \(\varPhi _{\eta,T}\left (X\right )\). We bound the expectation to infer a bound in probability.
Tonelli’s theorem and Jensen’s inequality lead to
We are then interested in upper bounding the expression under the square root. To that end, we use [16, Theorem 3.1] which implies that for any \(\boldsymbol{x}\)
Plugging this on (33), using that n ≥ 1 and that
we obtain the following
The result follows from Markov’s inequality.
References
Alquier, P., & Li, X. (2012). Prediction of quantiles by statistical learning and application to GDP forecasting. In J.-G. Ganascia, P. Lenca, & J.-M. Petit (Eds.), Discovery science (Volume 7569 of Lecture notes in computer science, pp. 22–36). Berlin/Heidelberg: Springer.
Alquier, P., & Wintenberger, O. (2012). Model selection for weakly dependent time series forecasting. Bernoulli, 18(3), 883–913.
Andrieu, C., & Doucet, A. (1999). An improved method for uniform simulation of stable minimum phase real ARMA (p,q) processes. IEEE Signal Processing Letters, 6(6), 142–144.
Atchadé, Y. F. (2006). An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8(2), 235–254.
Audibert, J.-Y. (2004). PAC-bayesian statistical learning theory. PhD thesis, Université Pierre et Marie Curie-Paris VI.
Beadle, E. R., & Djurić, P. M. (1999). Uniform random parameter generation of stable minimum-phase real ARMA (p,q) processes. IEEE Signal Processing Letters, 4(9), 259–261.
Brockwell, P. J., & Davis, R. A. (2006). Time series: Theory and methods (Springer series in statistics). New York: Springer. Reprint of the second (1991) edition.
Catoni, O. (2004). Statistical learning theory and stochastic optimization (Volume 1851 of Lecture notes in mathematics). Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, 8–25 July 2001.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.
Coulon-Prieur, C., & Doukhan, P. (2000). A triangular central limit theorem under a new weak dependence condition. Statistics and Probability Letters, 47(1), 61–68.
Dalalyan, A. S., & Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-bayesian bounds and sparsity. Machine Learning, 72(1–2), 39–61.
Dedecker, J., Doukhan, P., Lang, G., León R, J. R., Louhichi, S., & Prieur, C. (2007). Weak dependence: With examples and applications (Volume 190 of Lecture notes in statistics). New York: Springer.
Dedecker, J., & Prieur, C. (2005). New dependence coefficients. Examples and applications to statistics. Probability Theory and Related Fields, 132(2), 203–236.
Künsch, H. R. (1995). A note on causal solutions for locally stationary AR-processes. Note from ETH Zürich, available on line here: ftp://ftp.stat.math.ethz.ch/U/hkuensch/localstat-ar.pdf.
Łatuszyński, K., Miasojedow, B., & Niemiro, W. (2013). Nonasymptotic bounds on the estimation error of MCMC algorithms. Bernoulli, 19, 2033–2066.
Łatuszyński, K., & Niemiro, W. (2011). Rigorous confidence bounds for MCMC under a geometric drift condition. Journal of Complexity, 27(1), 23–38.
Leung, G., & Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory, 52(8), 3396–3410.
Mengersen, K. L., & Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24(1), 101–121.
Moulines, E., Priouret, P., & Roueff, F. (2005). On recursive estimation for time varying autoregressive processes. The Annals of Statistics, 33(6), 2610–2654.
Rio, E. (2000). Inégalités de Hoeffding pour les fonctions lipschitziennes de suites dépendantes. Comptes Rendus de l’Academie des Sciences Paris Series I Mathematics, 330(10), 905–908.
Roberts, G. O., & Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys, 1, 20–71.
Acknowledgements
The author is specially thankful to François Roueff, Christophe Giraud, Peter Weyer-Brown and the two referees for their extremely careful readings and highly pertinent remarks which substantially improved the paper. This work has been partially supported by the Conseil régional d’Île-de-France under a doctoral allowance of its program Réseau de Recherche Doctoral en Mathématiques de l’Île de France (RDM-IdF) for the period 2012–2015 and by the Labex LMH (ANR-11-IDEX-003-02).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Sanchez-Perez, A. (2015). Time Series Prediction via Aggregation: An Oracle Bound Including Numerical Cost. In: Antoniadis, A., Poggi, JM., Brossat, X. (eds) Modeling and Stochastic Learning for Forecasting in High Dimensions. Lecture Notes in Statistics(), vol 217. Springer, Cham. https://doi.org/10.1007/978-3-319-18732-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-18732-7_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18731-0
Online ISBN: 978-3-319-18732-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)