Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The objective of our work is to forecast a stationary time series \(Y = \left (Y _{t}\right )_{t\in \mathbb{Z}}\) taking values in \(\mathcal{X} \subseteq \mathbb{R}^{r}\) with r ≥ 1. For this purpose we propose and study an aggregation scheme using exponential weights.

Consider a set of individual predictors giving their predictions at each moment t. An aggregation method consists of building a new prediction from this set, which is nearly as good as the best among the individual ones, provided a risk criterion (see [17]). This kind of result is established by oracle inequalities. The power and the beauty of the technique lie in its simplicity and versatility. The more basic and general context of application is individual sequences, where no assumption on the observations is made (see [9] for a comprehensive overview). Nevertheless, results need to be adapted if we set a stochastic model on the observations.

The use of exponential weighting in aggregation and its links with the PAC-Bayesian approach has been investigated for example in [5, 8] and [11]. Dependent processes have not received much attention from this viewpoint, except in [1] and [2]. In the present paper we study the properties of the Gibbs predictor, applied to Causal Bernoulli Shifts (CBS). CBS are an example of dependent processes (see [12] and [13]).

Our predictor is expressed as an integral since the set from which we do the aggregation is in general not finite. Large dimension is a trending setup and the computation of this integral is a major issue. We use classical Markov chain Monte Carlo (MCMC) methods to approximate it. Results from Łatuszyński [15, 16] control the number of MCMC iterations to obtain precise bounds for the approximation of the integral. These bounds are in expectation and probability with respect to the distribution of the underlying Markov chain.

In this contribution we first slightly revisit certain lemmas presented in [2, 8] and [20] to derive an oracle bound for the prediction risk of the Gibbs predictor. We stress that the inequality controls the convergence rate of the exact predictor. Our second goal is to investigate the impact of the approximation of the predictor on the convergence guarantees described for its exact version. Combining the PAC-Bayesian bounds with the MCMC control, we then provide an oracle inequality that applies to the MCMC approximation of the predictor, which is actually used in practice.

The paper is organised as follows: we introduce a motivating example and several definitions and assumptions in Sect. 2. In Sect. 3 we describe the methodology of aggregation and provide the oracle inequality for the exact Gibbs predictor. The stochastic approximation is studied in Sect. 4. We state a general proposition independent of the model for the Gibbs predictor. Next, we apply it to the more particular framework delineated in our paper. A concrete case study is analysed in Sect. 5, including some numerical work. A brief discussion follows in Sect. 6. The proofs of most of the results are deferred to Sect. 7.

Throughout the paper, for \(\boldsymbol{a} \in \mathbb{R}^{q}\) with \(q \in \mathbb{N}^{{\ast}}\), \(\|\boldsymbol{a}\|\) denotes its Euclidean norm, \(\|\boldsymbol{a}\| = (\sum _{i=1}^{q}a_{i}^{2})^{1/2}\) and \(\|\boldsymbol{a}\|_{1}\) its 1-norm \(\|\boldsymbol{a}\|_{1} =\sum _{ i=1}^{q}\vert a_{i}\vert \). We denote, for \(\boldsymbol{a} \in \mathbb{R}^{q}\) and Δ > 0, \(B\left (\boldsymbol{a},\varDelta \right ) =\{\boldsymbol{ a}_{1} \in \mathbb{R}^{q}:\|\boldsymbol{ a} -\boldsymbol{ a}_{1}\| \leq \varDelta \}\) and \(B_{1}\left (\boldsymbol{a},\varDelta \right ) =\{\boldsymbol{ a}_{1} \in \mathbb{R}^{q}:\|\boldsymbol{ a} -\boldsymbol{ a}_{1}\|_{1} \leq \varDelta \}\) the corresponding balls centered at \(\boldsymbol{a}\) of radius Δ > 0. In general bold characters represent column vectors and normal characters their components; for example \(\boldsymbol{y} = \left (\,y_{i}\right )_{i\in \mathbb{Z}}\). The use of subscripts with ‘:’ refers to certain vector components \(\boldsymbol{y}_{1:k} = \left (\,y_{i}\right )_{1\leq i\leq k}\), or elements of a sequence \(X_{1:k} = \left (X_{t}\right )_{1\leq t\leq k}\). For a random variable U distributed as ν and a measurable function h, ν[h(U)] or simply ν[h] stands for the expectation of h(U): ν[h] = ∫ h(u)ν(du).

2 Problem Statement and Main Assumptions

Real stable autoregressive processes of a fixed order, referred to as AR(d) processes, are one of the simplest examples of CBS. They are defined as the stationary solution of

$$\displaystyle\begin{array}{rcl} X_{t}& =& \sum \limits _{j=1}^{d}\theta _{ j}X_{t-j} +\sigma \xi _{t}\;,{}\end{array}$$
(1)

where the \((\xi _{t})_{t\in \mathbb{Z}}\) are i.i.d. real random variables with \(\mathbb{E}[\xi _{t}] = 0\) and \(\mathbb{E}[\xi _{t}^{2}] = 1\).

We dispose of several efficient estimates for the parameter \(\boldsymbol{\theta }= \left [\begin{array}{*{10}c} \theta _{1} & \ldots & \theta _{d} \end{array} \right ]^{{\prime}}\) which can be calculated via simple algorithms as Levinson-Durbin or Burg algorithm for example. From them we derive also efficient predictors. However, as the model is simple to handle, we use it to progressively introduce our general setup.

Denote

$$\displaystyle{A\left (\boldsymbol{\theta }\right ) = \left [\begin{array}{*{10}c} \theta _{1} & \theta _{2} & \ldots & \ldots & \theta _{d} \\ 1&0& \ldots & \ldots &0\\ 0 &1 &0 & \ddots &0 \\ \vdots &0& \ddots & \ddots & \vdots\\ 0 & \ldots &0 &1 &0 \end{array} \right ]\;,}$$

\(\boldsymbol{X}_{t-1} = \left [\begin{array}{*{10}c} X_{t-1} & \ldots & X_{t-d} \end{array} \right ]^{{\prime}}\) and \(\boldsymbol{e}_{1} = \left [\begin{array}{*{10}c} 1&0&\ldots &0 \end{array} \right ]^{{\prime}}\) the first canonical vector of \(\mathbb{R}^{d}\). M′ represents the transpose of matrix M (including vectors). The recurrence (1) gives

$$\displaystyle\begin{array}{rcl} X_{t} =\boldsymbol{\theta } '\boldsymbol{X}_{t-1} +\sigma \xi _{t} =\sigma \sum \limits _{ j=0}^{\infty }\boldsymbol{e}'_{ 1}A^{j}\left (\boldsymbol{\theta }\right )\boldsymbol{e}_{ 1}\xi _{t-j}\;.& &{}\end{array}$$
(2)

The eigenvalues of \(A\left (\boldsymbol{\theta }\right )\) are the inverses of the roots of the autoregressive polynomial \(\boldsymbol{\theta }\left (z\right ) = 1 -\sum _{k=1}^{d}\theta _{k}z^{k}\), then at most δ for some \(\delta \in \left (0,1\right )\) due to the stability of X (see [7]). In other words \(\boldsymbol{\theta }\in s_{d}\left (\delta \right ) =\{\boldsymbol{\theta }\:\ \boldsymbol{\theta }\left (z\right )\neq 0\;\mbox{ for}\;\vert z\vert <\delta ^{-1}\} \subseteq s_{d}\left (1\right )\). In this context (or even in a more general one, see [14]) for all δ 1 ∈ (δ, 1) there is a constant \(\bar{K}\) depending only on \(\boldsymbol{\theta }\) and δ 1 such that for all j ≥ 0

$$\displaystyle\begin{array}{rcl} \left \vert \boldsymbol{e}'_{1}A^{j}\left (\boldsymbol{\theta }\right )\boldsymbol{e}_{ 1}\right \vert \leq \bar{ K}\delta _{1}^{j}\;,& &{}\end{array}$$
(3)

and then, the variance of X t , denoted γ 0, satisfies \(\gamma _{0} =\sigma ^{2}\sum _{j=0}^{\infty }\vert \boldsymbol{e}'_{1}A^{j}\left (\boldsymbol{\theta }\right )\boldsymbol{e}_{1}\vert ^{2} \leq \bar{ K}^{2}\sigma ^{2}/(1 -\delta _{1}^{2})\).

The following definition allows to introduce the process which interests us.

Definition 1

Let \(\mathcal{X}'\subseteq \mathbb{R}^{r'}\) for some r′ ≥ 1 and let \(A = (A_{j})_{j\geq 0}\) be a sequence of non-negative numbers. A function \(H: (\mathcal{X}')^{\mathbb{N}} \rightarrow \mathcal{X}\) is said to be A-Lipschitz if

$$\displaystyle\begin{array}{rcl} \|H\left (\boldsymbol{u}\right ) - H\left (\boldsymbol{v}\right )\|& \leq & \sum \limits _{j=0}^{\infty }A_{ j}\|u_{j} - v_{j}\|\;, {}\\ \end{array}$$

for any \(\boldsymbol{u} = (u_{j})_{j\in \mathbb{N}},\boldsymbol{v} = (v_{j})_{j\in \mathbb{N}} \in (\mathcal{X}')^{\mathbb{N}}\).

Provided \(A = (A_{j})_{j\geq 0}\) with A j  ≥ 0 for all j ≥ 0, the i.i.d. sequence of \(\mathcal{X}'\)-valued random variables \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\) and \(H: (\mathcal{X}')^{\mathbb{N}} \rightarrow \mathcal{X}\), we consider that a time series \(X\ =\ \left (X_{t}\right )_{t\in \mathbb{Z}}\) admitting the following property is a Causal Bernoulli Shift (CBS) with Lipschitz coefficients A and innovations \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\).

  1. (M)

    The process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) meets the representation

    $$\displaystyle\begin{array}{rcl} X_{t}& =& H\left (\xi _{t},\xi _{t-1},\xi _{t-2},\ldots \right ),\forall t \in \mathbb{Z}\;, {}\\ \end{array}$$

    where H is an A -Lipschitz function with the sequence A satisfying

    $$\displaystyle\begin{array}{rcl} \tilde{A}_{{\ast}}& =& \sum _{j=0}^{\infty }jA_{ j} < \infty \;. {}\end{array}$$
    (4)

    We additionally define

    $$\displaystyle{ A_{{\ast}} =\sum _{ j=0}^{\infty }A_{ j}\;. }$$
    (5)

CBS regroup several types of nonmixing stationary Markov chains, real-valued functional autoregressive models and Volterra processes, among other interesting models (see [10]). Thanks to the representation (2) and the inequality (3) we assert that AR(d) processes are CBS with \(A_{j} =\sigma \bar{ K}\delta _{1}^{j}\) for j ≥ 0.

We let ξ denote a random variable distributed as the ξ t s. Results from [1] and [2] need a control on the exponential moment of ξ in ζ = A , which is provided via the following hypothesis.

  1. (I)

    The innovations \(\left (\xi _{t}\right )_{t\in \mathbb{Z}}\) satisfy \(\phi (\zeta ) = \mathbb{E}\left [\mathrm{e}^{\zeta \|\xi \|}\right ] < \infty \).

Bounded or Gaussian innovations trivially satisfy this hypothesis for any \(\zeta \in \mathbb{R}\).

Let π 0 denote the probability distribution of the time series Y that we aim to forecast. Observe that for a CBS, π 0 depends only on H and the distribution of ξ. For any \(f: \mathcal{X}^{\mathbb{N}^{{\ast}} } \rightarrow \mathcal{X}\) measurable and \(t \in \mathbb{Z}\) we consider \(\hat{Y }_{t} = f\left (\left (Y _{t-i}\right )_{i\geq 1}\right )\), a possible predictor of Y t from its past. For a given loss function \(\ell: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}_{+}\), the prediction risk is evaluated by the expectation of \(\ell(\hat{Y }_{t},Y _{t})\)

$$\displaystyle\begin{array}{rcl} R\left (\,f\right ) = \mathbb{E}\left [\ell\left (\hat{Y }_{t},Y _{t}\right )\right ] =\pi _{0}\left [\ell\left (\hat{Y }_{t},Y _{t}\right )\right ] =\int \limits _{\mathcal{X}^{\mathbb{Z}}}\ell\left (\,f\left (\left (\,y_{t-i}\right )_{i\geq 1}\right ),y_{t}\right )\pi _{0}\left (\mathrm{d}\boldsymbol{y}\right )\;.& & {}\\ \end{array}$$

We assume in the following that the loss function fulfills the condition:

  1. (L)

    For all \(\boldsymbol{y},\boldsymbol{z} \in \mathcal{X}\), \(\ell\left (\,\boldsymbol{y},\boldsymbol{z}\right ) = g\left (\,\boldsymbol{y} -\boldsymbol{ z}\right )\), for some convex function g which is non-negative, \(g\left (0\right ) = 0\) and K- Lipschitz: \(\left \vert g\left (\,\boldsymbol{y}\right ) - g\left (\boldsymbol{z}\right )\right \vert \leq K\|\boldsymbol{y} -\boldsymbol{ z}\|\).

If \(\mathcal{X}\) is a subset of \(\mathbb{R}\), \(\ell\left (\,y,z\right ) = \left \vert y - z\right \vert \) satisfies 2 with K = 1.

From estimators of dimension d for \(\boldsymbol{\theta }\) we can build the corresponding linear predictors \(f_{\boldsymbol{\theta }}\left (\,\boldsymbol{y}\right ) =\boldsymbol{\theta } '\boldsymbol{y}_{1:d}\). Speaking more broadly, consider a set Θ and associated with it a set of predictors \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta \right \}\). For each \(\boldsymbol{\theta }\in \varTheta\) there is a unique \(d = d\left (\boldsymbol{\theta }\right ) \in \mathbb{N}^{{\ast}}\) such that \(f_{\boldsymbol{\theta }}: \mathcal{X}^{d} \rightarrow \mathcal{X}\) is a measurable function from which we define

$$\displaystyle\begin{array}{rcl} \hat{Y }_{t}^{\boldsymbol{\theta }}& =& f_{\boldsymbol{\theta }}\left (Y _{ t-1},\ldots,Y _{t-d}\right )\;, {}\\ \end{array}$$

as a predictor of Y t given its past. We can extend all functions \(f_{\boldsymbol{\theta }}\) in a trivial way (using dummy variables) to start from \(\mathcal{X}^{\mathbb{N}^{{\ast}} }\). A natural way to evaluate the predictor associated with \(\boldsymbol{\theta }\) is to compute the risk \(R\left (\boldsymbol{\theta }\right ) = R\left (\,f_{\boldsymbol{\theta }}\right )\). We use the same letter R by an abuse of notation.

We observe X 1: T from \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\), an independent copy of Y. A crucial goal of this work is to build a predictor function \(\hat{\,f}_{T}\) for Y, inferred from the sample X 1: T and Θ such that \(R(\hat{\,f}_{T})\) is close to \(\inf _{\boldsymbol{\theta }\in \varTheta }R\left (\boldsymbol{\theta }\right )\) with π 0- probability close to 1.

The set Θ also depends on T, we write \(\varTheta \equiv \varTheta _{T}\). Let us define

$$\displaystyle\begin{array}{rcl} d_{T} =\sup _{\boldsymbol{\theta }\in \varTheta _{T}}d\left (\boldsymbol{\theta }\right )\;.& &{}\end{array}$$
(6)

The main assumptions on the set of predictors are the following ones.

  1. (P-1)

    The set \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) is such that for any \(\boldsymbol{\theta }\in \varTheta _{T}\) there are \(b_{1}\left (\boldsymbol{\theta }\right ),\ldots,\) \(b_{d\left (\boldsymbol{\theta }\right )}\left (\boldsymbol{\theta }\right )\ \in \ \mathbb{R}_{+}\) satisfying for all \(\boldsymbol{y} = \left (\,y_{i}\right )_{i\in \mathbb{N}^{{\ast}}},\boldsymbol{z} = \left (z_{i}\right )_{i\in \mathbb{N}^{{\ast}}} \in \mathcal{X}^{\mathbb{N}^{{\ast}} }\),

    $$\displaystyle\begin{array}{rcl} \left \vert \left \vert \,f_{\boldsymbol{\theta }}(\boldsymbol{y}) - f_{\boldsymbol{\theta }}(\boldsymbol{z})\right \vert \right \vert &\leq &\sum \limits _{j=1}^{d\left (\boldsymbol{\theta }\right )}b_{ j}(\boldsymbol{\theta })\left \vert \left \vert y_{j} - z_{j}\right \vert \right \vert \;. {}\\ \end{array}$$

    We assume moreover that \(L_{T} =\sup _{\boldsymbol{\theta }\in \varTheta _{T}}\sum _{j=1}^{d\left (\boldsymbol{\theta }\right )}b_{j}\left (\boldsymbol{\theta }\right ) < \infty \).

  2. (P-2)

    The inequality L T + 1 ≤ logT holds for all T ≥ 4.

In the case where \(\mathcal{X} \subseteq \mathbb{R}\) and \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) is such that \(\boldsymbol{\theta }\in \mathbb{R}^{d\left (\boldsymbol{\theta }\right )}\) and \(f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) =\boldsymbol{\theta } '\boldsymbol{y}_{1:d(\boldsymbol{\theta })}\) for all \(\boldsymbol{y} \in \mathbb{R}^{\mathbb{N}}\), we have

$$\displaystyle\begin{array}{rcl} \left \vert \,f_{\boldsymbol{\theta }}(\boldsymbol{y}) - f_{\boldsymbol{\theta }}(\boldsymbol{z})\right \vert & \leq & \sum \limits _{j=1}^{d\left (\boldsymbol{\theta }\right )}\left \vert \theta _{ j}\right \vert \left \vert y_{j} - z_{j}\right \vert \;. {}\\ \end{array}$$

The last conditions are satisfied by the linear predictors when Θ T is a subset of the 1-ball of radius logT − 1 in \(\mathbb{R}^{d_{T}}\).

3 Prediction via Aggregation

The predictor that we propose is defined as an average of predictors \(f_{\boldsymbol{\theta }}\) based on the empirical version of the risk,

$$\displaystyle\begin{array}{rcl} r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )& =& \frac{1} {T - d\left (\boldsymbol{\theta }\right )}\sum \limits _{t=d\left (\boldsymbol{\theta }\right )+1}^{T}\ell\left (\hat{X}_{ t}^{\boldsymbol{\theta }},X_{ t}\right )\;. {}\\ \end{array}$$

where \(\hat{X}_{t}^{\boldsymbol{\theta }} = f_{\boldsymbol{\theta }}\left (\left (X_{t-i}\right )_{i\geq 1}\right )\). The function \(r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\) relies on X 1: T and can be computed at stage T; this is in fact a statistic.

We consider a prior probability measure π T on Θ T . The prior serves to control the complexity of predictors associated with Θ T . Using π T we can construct one predictor in particular, as detailed in the following.

3.1 Gibbs Predictor

For a measure ν and a measurable function h (called energy function) such that \(\nu \left [\exp \left (h\right )\right ] =\int \exp \left (h\right )\;\mathrm{d}\nu < \infty \;,\) we denote by \(\nu \left \{h\right \}\) the measure defined as

$$\displaystyle\begin{array}{rcl} \nu \left \{h\right \}\left (\mathrm{d}\boldsymbol{\theta }\right )& =& \frac{\exp \left (h\left (\boldsymbol{\theta }\right )\right )} {\nu \left [\exp \left (h\right )\right ]} \nu \left (\mathrm{d}\boldsymbol{\theta }\right )\;. {}\\ \end{array}$$

It is known as the Gibbs measure.

Definition 2 (Gibbs predictor)

Given η > 0, called the temperature or the learning rate parameter, we define the Gibbs predictor as the expectation of \(f_{\boldsymbol{\theta }}\), where \(\boldsymbol{\theta }\) is drawn under \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\), that is

$$\displaystyle\begin{array}{rcl} \hat{\,f}_{\eta,T}\left (\,\boldsymbol{y}\left \vert X\right.\right ) =\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [\,f_{\cdot }\left (\,\boldsymbol{y}\right )\right ] =\int \limits _{\varTheta _{T}}f_{\boldsymbol{\theta }}\left (\,\boldsymbol{y}\right ) \frac{\exp \left (-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right )} {\pi _{T}\left [\exp \left (-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right )\right ]}\pi _{T}\left (\mathrm{d}\boldsymbol{\theta }\right )\;.\!\!\!& & \\ & &{}\end{array}$$
(7)

3.2 PAC-Bayesian Inequality

At this point more care must be taken to describe Θ T . Here and in the following we suppose that

$$\displaystyle\begin{array}{rcl} \varTheta _{T} \subseteq \mathbb{R}^{n_{T} }\,\,\,\mathrm{forsome}n_{T} \in \mathbb{N}^{{\ast}}\;.& &{}\end{array}$$
(8)

Suppose moreover that Θ T is equipped with the Borel σ-algebra \(\mathcal{B}(\varTheta _{T})\).

A Lipschitz type hypothesis on \(\boldsymbol{\theta }\) guarantees the robustness of the set \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) with respect to the risk R.

  1. (P-3)

    There is \(\mathcal{D} < \infty \) such that for all \(\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2} \in \varTheta _{T}\),

    $$\displaystyle\begin{array}{rcl} \pi _{0}\left [\left \vert \left \vert \,f_{\boldsymbol{\theta }_{1}}\left (\left (X_{t-i}\right )_{i\geq 1}\right ) - f_{\boldsymbol{\theta }_{2}}\left (\left (X_{t-i}\right )_{i\geq 1}\right )\right \vert \right \vert \right ]& \leq &\mathcal{D}d_{T}^{1/2}\left \vert \left \vert \boldsymbol{\theta }_{ 1} -\boldsymbol{\theta }_{2}\right \vert \right \vert \;, {}\\ \end{array}$$

    where d T is defined in (6).

Linear predictors satisfy this last condition with \(\mathcal{D} =\pi _{0}\left [\left \vert X_{1}\right \vert \right ]\).

Suppose that the \(\boldsymbol{\theta }\) reaching the \(\inf _{\boldsymbol{\theta }\in \varTheta _{T}}R(\boldsymbol{\theta })\) has some zero components, i.e. \(\mathrm{supp}(\boldsymbol{\theta }) < n_{T}\). Any prior with a lower bounded density (with respect to the Lebesgue measure) allocates zero mass on lower dimensional subsets of Θ T . Furthermore, if the density is upper bounded we have \(\pi _{T}[B(\boldsymbol{\theta },\varDelta ) \cap \varTheta _{T}] = O(\varDelta ^{n_{T}})\) for Δ small enough. As we will notice in the proof of Theorem 1, a bound like the previous one would impose a tighter constraint to n T . Instead we set the following condition.

  1. (P-4)

    There is a sequence \(\left (\boldsymbol{\theta }_{T}\right )_{T\geq 4}\) and constants \(\mathcal{C}_{1} > 0\), \(\mathcal{C}_{2},\mathcal{C}_{3} \in (0,1]\) and γ ≥ 1 such that \(\boldsymbol{\theta }_{T} \in \varTheta _{T}\),

    $$\displaystyle\begin{array}{rcl} R\left (\boldsymbol{\theta }_{T}\right )& \leq &\inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\boldsymbol{\theta }\right ) + \mathcal{C}_{1} \frac{\log ^{3}\left (T\right )} {T^{1/2}}\;, {}\\ \mathrm{and}\;\;\;\pi _{T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap \varTheta _{T}\right ]& \geq &\mathcal{C}_{2}\varDelta ^{n_{T}^{1/\gamma } },\forall 0 \leq \varDelta \leq \varDelta _{T} = \frac{\mathcal{C}_{3}} {T} \;. {}\\ \end{array}$$

A concrete example is provided in Sect. 5.

We can now present the main result of this section, our PAC-Bayesian inequality concerning the predictor \(\hat{\,f}_{\eta _{T},T}\left (\cdot \left \vert X\right.\right )\) built following (7) with the learning rate η  =  η T   =  T 1∕2∕(4logT), provided an arbitrary probability measure π T on Θ T .

Theorem 1

Let ℓ be a loss function such that Assumption ( L ) holds. Consider a process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) satisfying Assumption ( M ) and let π 0 denote its probability distribution. Assume that the innovations fulfill Assumption ( I ) with ζ = A ; A is defined in ( 5 ). For each T ≥ 4 let \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) be a set of predictors meeting Assumptions ( P-3 ), ( P-4 ) and ( P-3 ) such that d T , defined in ( 6 ), is at most T∕2. Suppose that the set Θ T is as in (8) with n T logγ T for some γ ≥ 1 and we let π T be a probability measure on it such that Assumption ( P-4 ) holds for the same γ. Then for any \(\varepsilon > 0\) , with π 0 -probability at least \(1-\varepsilon\) ,

$$\displaystyle\begin{array}{rcl} R\left (\hat{\,f}_{\eta _{T},T}\left (\cdot \left \vert X\right.\right )\right )& \leq & \inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\,f_{\boldsymbol{\theta }}\right ) + \mathcal{E} \frac{\log ^{3}T} {T^{1/2}} + \frac{8\log T} {T^{1/2}}\log \left (\frac{1} {\varepsilon } \right )\;, {}\\ \end{array}$$

where

$$\displaystyle\begin{array}{rcl} \mathcal{E} = \mathcal{C}_{1} + 8 + \frac{2} {\log 2} -\frac{2\log \mathcal{C}_{2}} {\log ^{2}2} -\frac{4\log \mathcal{C}_{3}} {\log 2} & +& \frac{8K^{2}\left (A_{{\ast}} +\tilde{ A}_{{\ast}}\right )^{2}} {\tilde{A}_{{\ast}}^{2}} + \frac{K\mathcal{D}\mathcal{C}_{3}} {8\log ^{3}2} \\ & & +\frac{4K\phi (A_{{\ast}})} {\log 2} + \frac{2K^{2}\phi (A_{{\ast}})} {\log ^{2}2} \;, {}\end{array}$$
(9)

with \(\tilde{A}_{{\ast}}\) defined in (4) , K, ϕ and \(\mathcal{D}\) in Assumptions ( L ), ( I ) and ( P-3 ), respectively, and \(\mathcal{C}_{1}\) , \(\mathcal{C}_{2}\) and \(\mathcal{C}_{3}\) in Assumption ( P-4 ).

The proof is postponed to Sect. 7.1.

Here however we insist on the fact that this inequality applies to an exact aggregated predictor \(\hat{\,f}_{\eta _{T},T}\left (\cdot \left \vert X\right.\right )\). We need to investigate how these predictors are computed and how practical numerical approximations behave compared to the properties of the exact version.

4 Stochastic Approximation

Once we have the observations X 1: T , we use the Metropolis – Hastings algorithm to compute \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ) =\int f_{\boldsymbol{\theta }}\left (\cdot \left \vert X\right.\right )\pi _{T}\left \{-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right \}\left (\mathrm{d}\boldsymbol{\theta }\right )\). The Gibbs measure \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\) is a distribution on Θ T whose density \(\pi _{\eta,T}\left (\cdot \left \vert X\right.\right )\) with respect to π T is proportional to \(\exp \left (-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right )\).

4.1 Metropolis: Hastings Algorithm

Given \(X \in \mathcal{X}^{\mathbb{Z}}\), the Metropolis-Hastings algorithm generates a Markov chain \(\varPhi _{\eta,T}\left (X\right ) = (\boldsymbol{\theta }_{\eta,T,n}(X))_{n\geq 0}\) with kernel P η, T (only depending on X 1: T ) having the target distribution \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\) as the unique invariant measure, based on the transitions of another Markov chain which serves as a proposal (see [21]). We consider a proposal transition of the form \(Q_{\eta,T}(\boldsymbol{\theta }_{1},\mathrm{d}\boldsymbol{\theta }) = q_{\eta,T}(\boldsymbol{\theta }_{1},\boldsymbol{\theta })\pi _{T}(\mathrm{d}\boldsymbol{\theta })\) where the conditional density kernel q η, T (possibly also depending on X 1: T ) on \(\varTheta _{T} \times \varTheta _{T}\) is such that

$$\displaystyle\begin{array}{rcl} \beta _{\eta,T}\left (X\right ) =\inf \limits _{\left (\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2}\right )\in \varTheta _{T}\times \varTheta _{T}} \frac{q_{\eta,T}\left (\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2}\right )} {\pi _{\eta,T}\left (\boldsymbol{\theta }_{2}\left \vert X\right.\right )} \in \left (0,1\right )\;.& &{}\end{array}$$
(10)

This is the case of the independent Hastings algorithm, where the proposal is i.i.d. with density q η, T . The condition gets into

$$\displaystyle\begin{array}{rcl} \beta _{\eta,T}\left (X\right ) =\inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}} \frac{q_{\eta,T}\left (\boldsymbol{\theta }\right )} {\pi _{\eta,T}\left (\boldsymbol{\theta }\left \vert X\right.\right )} \in \left (0,1\right )\;.& &{}\end{array}$$
(11)

In Sect. 5 we provide an example.

The relation (10) implies that the algorithm is uniformly ergodic, i.e. we have a control in total variation norm (\(\|\cdot \|_{TV }\)). Thus, the following condition holds (see [18]).

  1. (A)

    Given η, T > 0, there is \(\beta _{\eta,T}: \mathcal{X}^{\mathbb{Z}} \rightarrow \left (0,1\right )\) such for any \(\boldsymbol{\theta }_{0} \in \varTheta _{T}\), \(\boldsymbol{x} \in \mathcal{X}^{\mathbb{Z}}\) and \(n \in \mathbb{N}\), the chain \(\varPhi _{\eta,T}\left (\boldsymbol{x}\right )\) with transition law P η, T and invariant distribution \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert \boldsymbol{x}\right.\right )\right \}\) satisfies

    $$\displaystyle\begin{array}{rcl} \left \vert \left \vert P_{\eta,T}^{n}\left (\boldsymbol{\theta }_{ 0},\cdot \right ) -\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert \boldsymbol{x}\right.\right )\right \}\right \vert \right \vert _{TV }& \leq & 2\left (1 -\beta _{\eta,T}\left (\boldsymbol{x}\right )\right )^{n}\;. {}\\ \end{array}$$

4.2 Theoretical Bounds for the Computation

In [16, Theorem 3.1] we find a bound on the mean square error of approximating one integral by the empirical estimate obtained from the successive samples of certain ergodic Markov chains, including those generated by the MCMC method that we use.

A MCMC method adds a second source of randomness to the forecasting process and our aim is to measure it. Let \(\boldsymbol{\theta }_{0} \in \cap _{T\geq 1}\varTheta _{T}\), we set \(\boldsymbol{\theta }_{\eta,T,0}\left (\boldsymbol{x}\right ) =\boldsymbol{\theta } _{0}\) for all T, η > 0, \(\boldsymbol{x} \in \mathcal{X}^{\mathbb{Z}}\). We denote by \(\mu _{\eta,T}\left (\cdot \left \vert X\right.\right )\) the probability distribution of the Markov chain \(\varPhi _{\eta,T}\left (X\right )\) with initial point \(\boldsymbol{\theta }_{0}\) and kernel P η, T .

Let ν η, T denote the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\); it is defined by setting for all sets \(A \in (\mathcal{B}(\mathcal{X}))^{\otimes \mathbb{Z}}\) and \(B \in (\mathcal{B}(\varTheta _{T}))^{\otimes \mathbb{N}}\)

$$\displaystyle\begin{array}{rcl} \nu _{\eta,T}\left (A \times B\right ) =\int \mathbb{1}_{A}\left (\boldsymbol{x}\right ) \mathbb{1}_{B}\left (\boldsymbol{\phi }\right )\mu _{\eta,T}\left (\mathrm{d}\boldsymbol{\phi }\left \vert \boldsymbol{x}\right.\right )\pi _{0}\left (\mathrm{d}\boldsymbol{x}\right )& &{}\end{array}$$
(12)

Given \(\varPhi _{\eta,T} = (\boldsymbol{\theta }_{\eta,T,n})_{n\geq 0}\), we then define for \(n \in \mathbb{N}^{{\ast}}\)

$$\displaystyle\begin{array}{rcl} \bar{\,f}_{\eta,T,n} = \frac{1} {n}\sum _{i=0}^{n-1}f_{\boldsymbol{\theta }_{\eta,T,i}}\,.& &{}\end{array}$$
(13)

Since our chain depends on X, we make it explicit by using the notation \(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\). The cited [16, Theorem 3.1] leads to a proposition that applies to the numerical approximation of the Gibbs predictor (the proof is in Sect. 7.2). We stress that this is independent of the model (CBS or any), of the set of predictors and of the theoretical guarantees of Theorem 1.

Proposition 1

Let ℓ be a loss function meeting Assumption ( L ). Consider any process \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) with an arbitrary probability distribution π 0 . Given T ≥ 2, η > 0, a set of predictors \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) and \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) , let \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\) be defined by (7) and let \(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\) be defined by (13) . Suppose that Φ η,T meets Assumption ( A ) for η and T with a function \(\beta _{\eta,T}: \mathcal{X}^{\mathbb{Z}} \rightarrow (0,1)\) . Let ν η,T denote the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\) as defined in (14) . Then, for all n ≥ 1 and D > 0, with ν η,T - probability at least max {0,1 − A η,T ∕(Dn 1∕2 )} we have \(\vert R(\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )) - R(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ))\vert \leq D\) , where

$$\displaystyle\begin{array}{rcl} A_{\eta,T} = 3K\int \limits _{\mathcal{X}^{\mathbb{Z}}} \frac{1} {\beta _{\eta,T}\left (\boldsymbol{x}\right )}\int \limits _{\mathcal{X}^{\mathbb{Z}}}\sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \,f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right )\right \vert \pi _{0}\left (\mathrm{d}\boldsymbol{y}\right )\pi _{0}\left (\mathrm{d}\boldsymbol{x}\right )\,.& &{}\end{array}$$
(14)

We denote by \(\nu _{T} =\nu _{\eta _{T},T}\) the probability distribution of \((X,\varPhi _{\eta,T}\left (X\right ))\) setting η  =  η T   =  T 1∕2∕(4logT). As Theorem 1 does not involve any simulation, it also holds in ν T - probability. From this and Proposition 1 a union bound gives us the following.

Theorem 2

Under the hypothesis of Theorem 1 , consider moreover that Assumption ( A ) is fulfilled by Φ η,T for all η = η T and T with T ≥ 4. Thus, for all \(\varepsilon > 0\) and \(n \geq M\left (T,\varepsilon \right )\) , with ν T - probability at least \(1-\varepsilon\) we have

$$\displaystyle\begin{array}{rcl} R\left (\bar{\,f}_{\eta _{T},T,n}\left (\cdot \left \vert X\right.\right )\right )& \leq & \inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\,f_{\boldsymbol{\theta }}\right ) + \left (\mathcal{E} + \frac{2} {\log 2} + 2\right ) \frac{\log ^{3}T} {T^{1/2}} + \frac{8\log T} {T^{1/2}}\log \left (\frac{1} {\varepsilon } \right )\;, {}\\ \end{array}$$

where \(\mathcal{E}\) is defined in (9) and \(M\left (T,\varepsilon \right ) = A_{\eta _{T},T}^{2}T/(\varepsilon ^{2}\log ^{6}T)\) with A η,T as in (14) .

5 Applications to the Autoregressive Process

We carefully recapitulate all the assumptions of Theorem 2 in the context of an autoregressive process. After that, we illustrate numerically the behaviour of the proposed method.

5.1 Theoretical Considerations

Consider a real valued stable autoregressive process of finite order d as defined by (1) with parameter \(\boldsymbol{\theta }\) lying in the interior of \(s_{d}\left (\delta \right )\) and unit normally distributed innovations (Assumptions (M) and (I) hold). With the loss function \(\ell\left (\,y,z\right ) = \left \vert y - z\right \vert \) Assumption (L) holds as well. The linear predictors is the set that we test; they meet Assumption (P-3). Without loss of generality assume that d T  = n T . In the described framework we have \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right ) = f_{\widehat{\boldsymbol{\theta }}_{\eta,T}\left (X\right )}\), where

$$\displaystyle{\hat{\boldsymbol{\theta }}_{\eta,T}\left (X\right ) =\int \limits _{\varTheta _{T}}\boldsymbol{\theta } \frac{\exp \left (-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right )} {\pi _{T}\left [\exp \left (-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right )\right ]}\pi _{T}\left (\mathrm{d}\boldsymbol{\theta }\right )\,.}$$

This \(\hat{\boldsymbol{\theta }}_{\eta,T}\left (X\right ) \in \mathbb{R}^{d_{T}}\) is known as the Gibbs estimator.

Remark that, by (2) and the normality of the innovations, the risk of any \(\hat{\boldsymbol{\theta }}\in \mathbb{R}^{d_{T}}\) is computed as the absolute moment of a centered Gaussian, namely

$$\displaystyle\begin{array}{rcl} R\left (\,f_{\hat{\boldsymbol{\theta }}}\right ) = R\left (\hat{\boldsymbol{\theta }}\right ) = \frac{\left (2\left (\hat{\boldsymbol{\theta }}-\boldsymbol{\theta }\right )'\varGamma _{T}\left (\hat{\boldsymbol{\theta }}-\boldsymbol{\theta }\right ) + 2\sigma ^{2}\right )^{1/2}} {\pi ^{1/2}} \;,& &{}\end{array}$$
(15)

where \(\varGamma _{T} = (\gamma _{i,j})_{0\leq i,j\leq d_{T}-1}\) is the covariance matrix of the process. In (15) the vector \(\boldsymbol{\theta }\) originally in \(\mathbb{R}^{d}\) is completed by d T d zeros.

In this context \(\arg \inf _{\boldsymbol{\theta }\in \mathbb{R}^{\mathbb{N}^{{\ast}}}}R\left (\boldsymbol{\theta }\right ) \in s_{d}(1)\) gives the true parameter \(\boldsymbol{\theta }\) generating the process. Let us verify Assumption (P-4) by setting conveniently Θ T and π T . Let Δ d > 0 be such that \(B\left (\boldsymbol{\theta },\varDelta _{d{\ast}}\right ) \subseteq s_{d}(1)\).

We express \(\varTheta _{T} =\bigcup _{ k=1}^{d_{T}}\varTheta _{k,T}\) where \(\boldsymbol{\theta }\in \varTheta _{k,T}\) if and only if \(d\left (\boldsymbol{\theta }\right ) = k\). It is interesting to set Θ k, T as the part of the stability domain of an AR(k) process satisfying Assumptions (P-3) and (P-4). Consider \(\varTheta _{1,T} = s_{1}(1) \times \{ 0\}^{d_{T}-1} \cap B_{1}\left (\boldsymbol{0},\log T - 1\right )\) and \(\varTheta _{k,T} = s_{k}(1) \times \{ 0\}^{d_{T}-k} \cap B_{1}\left (\boldsymbol{0},\log T - 1\right )\setminus \varTheta _{k-1,T}\) for k ≥ 2. Assume moreover that \(d_{T} = \lfloor \log ^{\gamma }T\rfloor \).

We write \(\pi _{T} =\sum _{ k=1}^{d_{T}}c_{k,T}\pi _{k,T}\) where for all k, c k, T π k, T is the restriction of π T to Θ k, T with c k, T a real non negative number and π k, T a probability measure on Θ k, T . In this setup \(c_{k,T} =\pi _{T}\left [\varTheta _{k,T}\right ]\) and \(\pi _{k,T}\left [A \cap \varTheta _{k,T}\right ] =\pi _{T}\left [A \cap \varTheta _{k,T}\right ]/c_{k,T}\) if c k, T  > 0 and \(\pi _{k,T}\left [A \cap \varTheta _{k,T}\right ] = 0\) otherwise. The vector \(\left [\begin{array}{*{10}c} c_{1,T}&\ldots &c_{d_{T},T} \end{array} \right ]\) could be interpreted as a prior on the model order. Set \(c_{k,T} = c_{k}/(\sum _{i=1}^{d_{T}}c_{i})\) where c k  > 0 is the k-th term of a convergent series (\(\sum _{k=1}^{\infty }c_{k} = c^{{\ast}} < \infty \)).

The distribution π k, T is inferred from some transformations explained below. Observe first that if a ≤ b we have \(s_{k}(a) \subseteq s_{k}(b)\). If \(\boldsymbol{\theta }\in s_{k}(1)\) then \(\left [\begin{array}{*{10}c} \lambda \theta _{1} & \ldots & \lambda ^{k}\theta _{k} \end{array} \right ]^{{\prime}} \in s_{k}(1)\) for any λ ∈ (−1, 1). Let us set

$$\displaystyle{\lambda _{T}(\boldsymbol{\theta }) =\min \left \{1, \frac{\log T - 1} {\|\boldsymbol{\theta }\|_{1}} \right \}\;.}$$

We define \(F_{k,T}(\boldsymbol{\theta }) = \left [\begin{array}{*{10}c} \lambda _{T}(\boldsymbol{\theta })\theta _{1} & \ldots & \lambda _{T}^{k}(\boldsymbol{\theta })\theta _{k}&0&\ldots &0 \end{array} \right ]^{{\prime}}\in \mathbb{R}^{d_{T}}\). Remark that for any \(\boldsymbol{\theta }\in s_{k}(1)\), \(\|F_{k,T}(\boldsymbol{\theta })\|_{1} \leq \lambda _{T}(\boldsymbol{\theta })\|\boldsymbol{\theta }\|_{1} \leq \log T - 1\). This gives us an idea to generate vectors in Θ k, T . Our distribution π k, T is deduced from:

Algorithm 1 π k, T generation

The distribution π k, T is lower bounded by the uniform distribution on s k (1).

Provided any γ ≥ 1, let T  = min{T: d T  ≥ d γ, logT ≥ d 1∕22d}. Since \(s_{k}(1) \subseteq B(\boldsymbol{0},2^{k}\ -\ 1)\) (see [19, Lemma 1]) and \(k^{1/2}\|\boldsymbol{\theta }\| \geq \|\boldsymbol{\theta }\|_{1}\) for any \(\boldsymbol{\theta }\in \mathbb{R}^{k}\), the constraint \(\|\boldsymbol{\theta }\|_{1} \leq \log T - 1\) becomes redundant in Θ k, T for 1 ≤ k ≤ d and T ≥ T , i.e. \(\varTheta _{1,T} = s_{1}(1) \times \{ 0\}^{d_{T}-1}\) and \(\varTheta _{k,T} = s_{k}(1) \times \{ 0\}^{d_{T}-k}\setminus \varTheta _{k-1,T}\) for 2 ≤ k ≤ d. We define the sequence of Assumption (P-4) as \(\boldsymbol{\theta }_{T} =\boldsymbol{ 0}\) for T < T and \(\boldsymbol{\theta }_{T} =\arg \inf _{\boldsymbol{\theta }\in \varTheta _{T}}R(\boldsymbol{\theta })\) for T ≥ T . Remark that the first d components of \(\boldsymbol{\theta }_{T}\) are constant for T ≥ T (they correspond to the \(\boldsymbol{\theta }\in \mathbb{R}^{d}\) generating the AR(d) process), and the last d T d are zero. Let Δ 1∗ = 2log2 − 1. Then, we have for T < T and all Δ ∈ [0, Δ 1∗]

$$\displaystyle{\pi _{T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap \varTheta _{T}\right ] \geq c_{1,T}\pi _{1,T}\left [B\left (\boldsymbol{0},\varDelta \right ) \cap s_{1}(1) \times \{ 0\}^{d_{T}-1}\right ] \geq \frac{c_{1}} {c^{{\ast}}}\varDelta \;.}$$

Furthermore, for T ≥ T and Δ ∈ [0, Δ d]

$$\displaystyle{\pi _{T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap \varTheta _{T}\right ] \geq c_{d,T}\pi _{d,T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap s_{d}(1) \times \{ 0\}^{d_{T}-d}\right ] \geq \frac{c_{d}} {2^{d^{2} }c^{{\ast}}}\varDelta ^{d}\;.}$$

Assumption (P-4) is then fulfilled for any γ ≥ 1 with

$$\displaystyle\begin{array}{rcl} \mathcal{C}_{1}& =& \max \left \{0,(R\left (0\right ) -\inf _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\boldsymbol{\theta }\right ))T^{1/2}\log ^{-3}T,4 \leq T < T_{ {\ast}}\right \} {}\\ \mathcal{C}_{2}& =& \min \left \{1, \frac{c_{1}} {c^{{\ast}}}, \frac{c_{d}} {2^{d^{2} }c^{{\ast}}}\right \} {}\\ \mathcal{C}_{3}& =& \min \left \{1,4\varDelta _{1{\ast}},T_{{\ast}}\varDelta _{d{\ast}}\right \}\;. {}\\ \end{array}$$

Let q η, T be the constant function 1, this means that the proposal has the same distribution π T . Let us bound the ratio (11).

$$\displaystyle\begin{array}{rcl} \beta _{\eta,T}\left (X\right ) =\inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}} \frac{q_{\eta,T}\left (\boldsymbol{\theta }\right )} {\pi _{\eta,T}\left (\boldsymbol{\theta }\left \vert X\right.\right )}& =& \inf \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\frac{\sum \limits _{k=1}^{d_{T} }c_{k,T}\int \limits _{\varTheta _{k,T}}\exp \left (-\eta r_{T}\left (z\left \vert X\right.\right )\right )\pi _{k,T}\left (\mathrm{d}\boldsymbol{z}\right )} {\exp \left (-\eta r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right )} \\ & \geq & \sum \limits _{k=1}^{d_{T} }c_{k,T}\int \limits _{\varTheta _{k,T}}\exp \left (-\eta r_{T}\left (z\left \vert X\right.\right )\right )\pi _{k,T}\left (\mathrm{d}\boldsymbol{z}\right ) > 0\;.\quad {}\end{array}$$
(16)

Now note that

$$\displaystyle\begin{array}{rcl} \left \vert x_{t} - f_{\boldsymbol{\theta }}\left (\left (x_{t-i}\right )_{i\geq 1}\right )\right \vert \leq \left \vert x_{t}\right \vert +\sum \limits _{ j=1}^{d\left (\boldsymbol{\theta }\right )}\left \vert \theta _{ j}\right \vert \left \vert x_{t-j}\right \vert \leq \log T\max \limits _{j=0,\ldots,d\left (\boldsymbol{\theta }\right )}\left \vert x_{t-j}\right \vert \;.& &{}\end{array}$$
(17)

Plugging the bound (17) on (16) with η = η T

$$\displaystyle{\beta _{\eta _{T},T}\left (\boldsymbol{x}\right ) \geq \sum \limits _{k=1}^{d_{T} }c_{k}\int \limits _{\varTheta _{k}}\exp \left (-\eta _{T}r_{T}\left (z\left \vert \boldsymbol{x}\right.\right )\right )\pi _{k}\left (\mathrm{d}\boldsymbol{z}\right ) \geq \exp \left (-\frac{T^{1/2}} {4} \max \limits _{j=0,\ldots,d_{T}}\left \vert x_{t-j}\right \vert \right )\;,}$$

we deduce that

$$\displaystyle\begin{array}{rcl} \frac{1} {\beta _{\eta _{T},T}\left (\boldsymbol{x}\right )} \leq \sum \limits _{k=0}^{d_{T} }\exp \left (\frac{T^{1/2}\left \vert x_{t-j}\right \vert } {4} \right )\;.& &{}\end{array}$$
(18)

Taking (18) into account, setting γ = 1 (thus \(d_{T} = \lfloor \log T\rfloor \)), using Assumption (P-3), that K = 1 and applying the Cauchy-Schwarz inequality we get

$$\displaystyle\begin{array}{rcl} A_{\eta _{T},T}& =& 3K\int \limits _{\mathcal{X}^{\mathbb{Z}}} \frac{1} {\beta _{\eta _{T},T}\left (\boldsymbol{x}\right )}\int \limits _{\mathcal{X}^{\mathbb{Z}}}\sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \,f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) - f_{\hat{\boldsymbol{\theta }}_{\eta _{ T},T}\left (\boldsymbol{x}\right )}\left (\boldsymbol{y}\right )\right \vert \pi _{0}\left (\mathrm{d}\boldsymbol{y}\right )\pi _{0}\left (\mathrm{d}\boldsymbol{x}\right ) {}\\ & \leq & 3\left (d_{T} + 1\right )d_{T}^{1/2}\pi _{ 0}\left [\exp \left (\frac{T^{1/2}\left \vert X_{1}\right \vert } {4} \right )\right ]\pi _{0}\left [\left \vert X_{1}\right \vert \right ]\sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \left \vert \boldsymbol{\theta }\right \vert \right \vert {}\\ &\leq & 6\log ^{3/2}T\pi _{ 0}\left [\exp \left (\frac{T^{1/2}\left \vert X_{1}\right \vert } {4} \right )\right ]\pi _{0}\left [\left \vert X_{1}\right \vert \right ]\,. {}\\ \end{array}$$

As X 1 is centered and normally distributed of variance γ 0, \(\pi _{0}\left [\left \vert X_{1}\right \vert \right ] = \left (2\gamma _{0}/\pi \right )^{1/2}\) and \(\pi _{0}[\exp (T^{1/2}\left \vert X_{1}\right \vert /4)] =\gamma _{0}T^{1/2}\exp (\gamma _{0}T/32)/4\).

From \(n \geq M^{{\ast}}\left (T,\varepsilon \right ) = 9\gamma _{0}^{3}T^{2}\exp \left (\gamma _{0}T/16\right )/(2\pi \varepsilon ^{2}\log ^{3}T)\) the result of Theorem 2 is reached. This bound of \(M\left (T,\varepsilon \right )\) is prohibitive from a computational viewpoint. That is why we limit the number of iterations to a fixed n .

What we obtain from MCMC is \(\bar{\,f}_{\eta _{T},T,n}\left (\,\boldsymbol{y}\left \vert X\right.\right ) =\bar{\boldsymbol{\theta }} '_{\eta _{T},T,n}\left (X\right )\boldsymbol{y}_{1:d_{T}}\) with \(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n}\left (X\right ) =\sum _{ i=0}^{n-1}\boldsymbol{\theta }_{\eta _{T},T,i}\left (X\right )/n\). Remark that \(\bar{\,f}_{\eta _{T},T,n}\left (\cdot \left \vert X\right.\right ) = f_{\bar{\boldsymbol{\theta }}_{\eta _{ T},T,n}\left (X\right )}\). The risk is expressed as

$$\displaystyle{R\left (\bar{\,f}_{\eta _{T},T,n}\left (\cdot \left \vert X\right.\right )\right ) = \frac{\left (2\left (\bar{\boldsymbol{\theta }}_{\eta _{T},T,n}\left (X\right )-\boldsymbol{\theta }\right )'\varGamma \left (Y \right )\left (\bar{\boldsymbol{\theta }}_{\eta _{T},T,n}\left (X\right )-\boldsymbol{\theta }\right ) + 2\sigma ^{2}\right )^{1/2}} {\pi ^{1/2}} \;.}$$

5.2 Numerical Work

Consider 100 realisations of an autoregressive processes X simulated with the same \(\boldsymbol{\theta }\in s_{d}\left (\delta \right )\) for d = 8 and δ = 3∕4 and with σ = 1. Let \(\boldsymbol{c}^{(i)}\), i = 1, 2 the sequences defining two different priors in the model order:

  1. 1.

    c k (1) = k −2, the sparsity is favoured,

  2. 2.

    c k (2) = ek, the sparsity is strongly favoured.

For each sequence c and for each value of T ∈ { 2j, j = 6, , 12} we compute \(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n^{{\ast}}}\), the MCMC approximation of the Gibbs estimator using Algorithm 2 with η = η T .

Algorithm 2 Independent Hastings Sampler

The acceptance rate is computed as \(\alpha _{\eta,T,X}(\boldsymbol{\theta }_{1},\boldsymbol{\theta }_{2}) =\exp (\eta r_{T}(\boldsymbol{\theta }_{1}\left \vert X\right.) -\eta r_{T}\) \(\left (\boldsymbol{\theta }_{2}\left \vert X\right.\right ))\).

Algorithm 1 used by the distributions π k, T generates uniform random vectors on \(s_{k}\left (1\right )\) by the method described in [6]. It relies in the Levinson-Durbin recursion algorithm. We also implemented the numerical improvements of [3].

Set \(\varepsilon = 0.1\). Figure 1 displays the \((1-\varepsilon )\)-quantiles in data \(R(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n^{{\ast}}}\left (X\right )) - (2/\pi )^{1/2}\sigma ^{2}\) for \(\boldsymbol{c}^{(1)}\) and \(\boldsymbol{c}^{(2)}\) using different values of n .

Fig. 1
figure 1

The plots represent the 0. 9-quantiles in data \(R(\bar{\boldsymbol{\theta }}_{\eta _{T},T,n^{{\ast}}}\left (X\right )) - (2/\pi )^{1/2}\sigma ^{2}\) for T = 32, 64, , 4, 096. The graph on the left corresponds to the order prior c k (1) = k −2 while that on the right corresponds to c k (2) = ek. The solid curves were plotted with n  = 100, the dashed ones with n  = 1, 000 and as a reference, the dotted curve is proportional to log3 TT 1∕2

Note that, for the proposed algorithm the prediction risk decreases very slowly when the number T of observations grows and the number of MCMC iterations remains constant. If n  = 1, 000 the decaying rate is faster than if n  = 100 for smaller values of T. For T ≥ 2, 000 we observe that both rates are roughly the same in the logarithmic scale. This behaviour is similar in both cases presented in Fig. 1. As expected, the risk of the approximated predictor does not converge as log3 TT 1∕2.

6 Discussion

There are two sources of error in our method: prediction (of the exact Gibbs predictor) and approximation (using the MCMC). The first one decays when T grows and the obtained guarantees for the second one explode. We found a possibly pessimistic upper bound for M(T, ε). The exponential growing of this bound is the main weakness of our procedure. The use of a better adapted proposal in the MCMC algorithm needs to be investigated. The Metropolis Langevin Algorithm (see [4]) gives us an insight in this direction. However it is encouraging to see that, in the analysed practical case, the risk of \(\bar{\,f}_{\eta _{T},T,n^{{\ast}}}\left (\cdot \left \vert X\right.\right )\) does not increase with T.

7 Technical Proofs

7.1 Proof of Theorem 1

The proof of Theorem 1 is based on the same tools used by [2] up to Lemma 3. For the sake of completeness we quote the essential ones.

We denote by \(\mathcal{M}_{+}^{1}\left (F\right )\) the set of probability measures on the measurable space \((F,\mathcal{F})\). Let \(\rho,\nu \in \mathcal{M}_{+}^{1}\left (F\right )\), \(\mathcal{K}\left (\rho,\nu \right )\) stands for the Kullback-Leibler divergence of ν from ρ.

$$\displaystyle\begin{array}{rcl} \mathcal{K}\left (\rho,\nu \right )& =& \left \{\begin{array}{ll} \int \log \frac{\mathrm{d}\rho } {\mathrm{d}\nu }\left (\boldsymbol{\theta }\right )\rho \left (\mathrm{d}\boldsymbol{\theta }\right )&\mathrm{,if}\rho \ll \nu \\ + \infty &\mathrm{, otherwise\;.} \end{array} \right.{}\\ \end{array}$$

The first lemma can be found in [8, Equation 5.2.1].

Lemma 1 (Legendre transform of the Kullback divergence function)

Let \((F,\mathcal{F})\) be any measurable space. For any \(\nu \in \mathcal{M}_{+}^{1}\left (F\right )\) and any measurable function \(h\:\ F\ \rightarrow \ \mathbb{R}\) such that \(\nu \left [\exp \left (h\right )\right ] < \infty \) we have,

$$\displaystyle\begin{array}{rcl} \nu \left [\exp \left (h\right )\right ]& =& \exp \left (\sup \limits _{\rho \in \mathcal{M}_{+}^{1}\left (F\right )}\left (\rho \left [h\right ] -\mathcal{K}\left (\rho,\nu \right )\right )\right )\;, {}\\ \end{array}$$

with the convention ∞−∞ = −∞. Moreover, as soon as h is upper-bounded on the support of ν, the supremum with respect to ρ in the right-hand side is reached by the Gibbs measure \(\nu \left \{h\right \}\) .

For a fixed C > 0, let \(\tilde{\xi }_{t}^{\left (C\right )} =\max \left \{\min \left \{\xi _{t},C\right \},-C\right \}\). Consider \(\tilde{X}_{t} = H(\tilde{\xi }_{t}^{\left (C\right )},\tilde{\xi }_{t-1}^{\left (C\right )},\ldots )\).

Denote \(\tilde{X} = (\tilde{X}_{t})_{t\in \mathbb{Z}}\) and by \(\tilde{R}\left (\boldsymbol{\theta }\right )\) and \(\tilde{r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\) the respective exact and empirical risks associated with \(\tilde{X}\) in \(\boldsymbol{\theta }\).

$$\displaystyle\begin{array}{rcl} \tilde{R}\left (\boldsymbol{\theta }\right )& =& \mathbb{E}\left [\ell\left (\widehat{\tilde{X}}_{t}^{\boldsymbol{\theta }},\tilde{X}_{ t}\right )\right ]\;, {}\\ \tilde{r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )& =& \frac{1} {T - d\left (\boldsymbol{\theta }\right )}\sum \limits _{t=d\left (\boldsymbol{\theta }\right )+1}^{T}\ell\left (\widehat{\tilde{X}}_{ t}^{\boldsymbol{\theta }},\tilde{X}_{ t}\right )\;, {}\\ \end{array}$$

where \(\widehat{\tilde{X}}_{t}^{\boldsymbol{\theta }} = f_{\boldsymbol{\theta }}((\tilde{X}_{t-i})_{i\geq 1})\).

This thresholding is interesting because truncated CBS are weakly dependent processes (see [2, Section 4.2]).

A Hoeffding type inequality introduced in [20, Theorem 1] provides useful controls on the difference between empirical and exact risks of a truncated process.

Lemma 2 (Laplace transform of the risk)

Let ℓ be a loss function meeting Assumption ( L ) and \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) a process satisfying Assumption ( M ). For all T ≥ 2, any \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) satisfying Assumption ( P-1 ), Θ T such that d T , defined in (6), is at most T∕2, any truncation level C > 0, η ≥ 0 and \(\boldsymbol{\theta }\in \varTheta _{T}\) we have,

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left [\exp \left (\eta \left (\tilde{R}(\boldsymbol{\theta }) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right )\right )\right ]& \leq & \exp \left (\frac{4\eta ^{2}k^{2}(T,C)} {T} \right )\;,{}\end{array}$$
(19)

and

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left [\exp \left (\eta \left (\tilde{r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right ) -\tilde{ R}(\boldsymbol{\theta })\right )\right )\right ]& \leq & \exp \left (\frac{4\eta ^{2}k^{2}(T,C)} {T} \right )\;,{}\end{array}$$
(20)

where \(k(T,C) = 2^{1/2}CK(1 + L_{T})\left (A_{{\ast}} +\tilde{ A}_{{\ast}}\right )\) . The constants \(\tilde{A}_{{\ast}}\) and A are defined in (4) and (5) respectively, K and L T in Assumptions ( L ) and ( P-1 ) respectively.

The following lemma is a slight modification of [2, Lemma 6.5]. It links the two versions of the empirical risk: original and truncated.

Lemma 3

Suppose that Assumption ( L ) holds for the loss function ℓ, Assumption ( P-1 ) holds for \(X = \left (X_{t}\right )_{t\in \mathbb{Z}}\) and Assumption ( I ) holds for the innovations with ζ = A ; A is defined in (5) . For all T ≥ 2, any \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) meeting Assumption ( P-1 ) with Θ T such that d T , defined in (6) , is at most T∕2, any truncation level C > 0 and any \(0 \leq \eta \leq T/4\left (1 + L_{T}\right )\) we have,

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]& \leq & \exp \left (\eta \varphi \left (T,C,\eta \right )\right )\;, {}\\ \end{array}$$

where

$$\displaystyle\begin{array}{rcl} \varphi (T,C,\eta )& =& 2K(1 + L_{T})\phi (A_{{\ast}})\left ( \frac{A_{{\ast}}C} {\exp \left (A_{{\ast}}C\right ) - 1} +\eta \frac{4K(1 + L_{T})} {T} \right )\;, {}\\ \end{array}$$

with K and L T defined in Assumptions ( L ) and ( P-1 ) respectively.

Finally we present a result on the aggregated predictor defined in (7). The proof is partially inspired by that of [2, Theorem 3.2].

Lemma 4

Let ℓ be a loss function such that Assumption ( L ) holds and let \(X\ =\ \left (X_{t}\right )_{t\in \mathbb{Z}}\) a process satisfying Assumption ( M ) with probability distribution π 0 . For each T ≥ 2 let \(\left \{\,f_{\boldsymbol{\theta }},\boldsymbol{\theta }\in \varTheta _{T}\right \}\) be a set of predictors and \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) any prior probability distribution on Θ T . We build the predictor \(\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\) following (7) with any η > 0. For any \(\varepsilon > 0\) and any truncation level C > 0, with π 0 -probability at least \(1-\varepsilon\) we have,

$$\displaystyle\begin{array}{rcl} & & R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right ) \leq \inf \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left \{\rho \left [R\right ] + \frac{2\mathcal{K}\left (\rho,\pi _{T}\right )} {\eta } \right \} + \frac{2\log \left (\frac{2} {\varepsilon } \right )} {\eta } {}\\ & & \qquad \quad + \frac{1} {2\eta }\log \left (\mathbb{E}\left [\exp \left (2\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ]\right ) + \frac{1} {2\eta }\log \left (\mathbb{E}\left [\exp \left (2\eta \left (\tilde{r}_{T} -\tilde{ R}\right )\right )\right ]\right ) {}\\ & & \qquad \qquad \qquad \qquad \quad + \frac{2} {\eta } \log \left (\mathbb{E}\left [\exp \left (2\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]\right )\;. {}\\ \end{array}$$

Proof

We use Tonelli’s theorem and Jensen’s inequality with the convex function g to obtain an upper bound for \(R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )\)

$$\displaystyle\begin{array}{rcl} & & \ \ R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right ) =\int \limits _{\mathcal{X}^{\mathbb{Z}}}g\left (\,\int \limits _{\varTheta _{T}}\left (\,f_{\boldsymbol{\theta }}\left (\left (\,y_{t-i}\right )_{i\geq 1}\right ) - y_{t}\right )\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left (\mathrm{d}\boldsymbol{\theta }\right )\right )\pi _{0}\left (\mathrm{d}\boldsymbol{y}\right ) {}\\ & & \qquad \qquad \qquad \leq \int \limits _{\mathcal{X}^{\mathbb{Z}}}\left [\,\int \limits _{\varTheta _{T}}g\left (\,f_{\boldsymbol{\theta }}\left (\left (\,y_{t-i}\right )_{i\geq 1}\right ) - y_{t}\right )\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left (\mathrm{d}\boldsymbol{\theta }\right )\right ]\pi _{0}\left (\mathrm{d}\boldsymbol{y}\right ) {}\\ & & =\int \limits _{\varTheta _{T}}\left [\,\int \limits _{\,\mathcal{X}^{\mathbb{Z}}}g\left (\,f_{\boldsymbol{\theta }}\left (\left (\,y_{t-i}\right )_{i\geq 1}\right ) - y_{t}\right )\pi _{0}\left (\boldsymbol{y}\right )\right ]\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left (\mathrm{d}\boldsymbol{\theta }\right ) =\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [R\right ]\;. {}\\ \end{array}$$

In the remainder of this proof we search for upper bounding \(\pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [R\right ]\).

First, we use the relationship:

$$\displaystyle\begin{array}{rcl} R - r_{T}\left (\cdot \left \vert X\right.\right ) = \left (\tilde{R} -\tilde{ r}_{T}\left (\cdot \left \vert \tilde{X}\right.\right )\right ) + \left (R -\tilde{ R}\right ) -\left (r_{T}\left (\cdot \left \vert X\right.\right ) -\tilde{ r}_{T}\left (\cdot \left \vert \tilde{X}\right.\right )\right )\;.& &{}\end{array}$$
(21)

For the sake of simplicity and while it does not disrupt the clarity, we lighten the notation of r T and \(\tilde{r}_{T}\). We now suppose that in the place of \(\boldsymbol{\theta }\) we have a random variable distributed as \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\). This is taken into account in the following expectations. The identity (21) and the Cauchy-Schwarz inequality lead to

$$\displaystyle\begin{array}{rcl} & & \ \ \ \mathbb{E}\left [\exp \left ( \frac{\eta } {2}\left (R - r_{T}\right )\right )\right ] = \mathbb{E}\left [\exp \left ( \frac{\eta } {2}\left (\tilde{R} -\tilde{ r}_{T}\right )\right )\exp \left ( \frac{\eta } {2}\left (\left (R -\tilde{ R}\right ) -\left (r_{T} -\tilde{ r}_{T}\right )\right )\right )\right ] \\ & & \quad \qquad \qquad \leq \left (\mathbb{E}\left [\exp \left (\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ]\mathbb{E}\left [\exp \left (\eta \left (\left (R -\tilde{ R}\right ) -\left (r_{T} -\tilde{ r}_{T}\right )\right )\right )\right ]\right )^{1/2} \\ & & \leq \left (\mathbb{E}\left [\exp \left (\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ]\mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \left (R -\tilde{ R}\right )\left (\boldsymbol{\theta }\right ) -\left (r_{T} -\tilde{ r}_{T}\right )\left (\boldsymbol{\theta }\right )\right \vert \right )\right ]\right )^{1/2}\;.\! {}\end{array}$$
(22)

Observe now that \(R\left (\boldsymbol{\theta }\right ) = \mathbb{E}\left [r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right )\right ]\) and \(\tilde{R}\left (\boldsymbol{\theta }\right ) = \mathbb{E}[\tilde{r}_{T}(\boldsymbol{\theta }\vert \tilde{X})]\). Jensen’s inequality for the exponential function gives that

$$\displaystyle\begin{array}{rcl} \exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert R\left (\boldsymbol{\theta }\right ) -\tilde{ R}\left (\boldsymbol{\theta }\right )\right \vert \right )& \leq & \exp \left (\eta \mathbb{E}\left [\sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right ]\right ) \\ & \leq & \mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]\;.{}\end{array}$$
(23)

From (23) we see that

$$\displaystyle\begin{array}{rcl} & & \mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \left (R -\tilde{ R}\right )\left (\boldsymbol{\theta }\right ) -\left (r_{T} -\tilde{ r}_{T}\right )\left (\boldsymbol{\theta }\right )\right \vert \right )\right ] \\ & & \quad \leq \mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert R\left (\boldsymbol{\theta }\right ) -\tilde{ R}\left (\boldsymbol{\theta }\right )\right \vert \right )\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ] \\ & & \qquad \qquad \qquad \qquad \quad \leq \left (\mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]\right )^{2}\;. {}\end{array}$$
(24)

Combining (22) and (24) we obtain

$$\displaystyle\begin{array}{rcl} & & \mathbb{E}\left [\exp \left ( \frac{\eta } {2}\left (R - r_{T}\left (\cdot \left \vert X\right.\right )\right )\right )\right ] \leq \left (\mathbb{E}\left [\exp \left (\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ]\right )^{1/2} \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \mathbb{E}\left [\exp \left (\eta \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]\;. {}\end{array}$$
(25)

Let \(L_{\eta,T,C} =\log ((\mathbb{E}[\exp (\eta (\tilde{R} -\tilde{ r}_{T}))])^{1/2}\mathbb{E}[\exp (\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\vert r_{T}(\boldsymbol{\theta }\vert X) -\tilde{ r}_{T}(\boldsymbol{\theta }\vert \tilde{X})\vert )])\). Remark that the left term of (25) is equal to the integral of the expression enclosed in brackets with respect to the measure π 0 ×π T . Changing η by 2η and thanks to Lemma 1 we get

$$\displaystyle\begin{array}{rcl} \pi _{0}\left [\exp \left (\sup \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left (\eta \rho [R - r_{T}\left (\cdot \left \vert X\right.\right )] -\mathcal{K}\left (\rho,\pi _{T}\right )\right )\right )\right ] \leq \exp \left (L_{2\eta,T,C}\right )\;.& & {}\\ \end{array}$$

Markov’s inequality implies that for all \(\varepsilon > 0\), with π 0- probability at least \(1-\varepsilon\)

$$\displaystyle\begin{array}{rcl} \sup \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left (\eta \rho \left [R - r_{T}\left (\cdot \left \vert X\right.\right )\right ] -\mathcal{K}\left (\rho,\pi _{T}\right )\right ) -\log \left (\frac{1} {\varepsilon } \right ) - L_{2\eta,T,C} \leq 0\;.& & {}\\ \end{array}$$

Hence, for any \(\pi _{T} \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\) and η > 0, with π 0- probability at least \(1-\varepsilon\), for all \(\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )\)

$$\displaystyle\begin{array}{rcl} \rho \left [R - r_{T}\left (\cdot \left \vert X\right.\right )\right ] -\frac{1} {\eta } \mathcal{K}\left (\rho,\pi _{T}\right ) -\frac{1} {\eta } \log \left (\frac{1} {\varepsilon } \right ) -\frac{L_{2\eta,T,C}} {\eta } & \leq & 0\;.{}\end{array}$$
(26)

By setting \(\rho =\pi _{T}\{ -\eta r_{T}\left (\cdot \left \vert X\right.\right )\}\) and relying on Lemma 1, we have

$$\displaystyle\begin{array}{rcl} \mathcal{K}\left (\pi _{T}\left \{-\eta r_{T}\right \},\pi _{T}\right )& =& \pi _{T}\left \{-\eta r_{T}\right \}\left [\log \frac{\mathrm{d}\pi _{T}\left \{-\eta r_{T}\right \}} {\mathrm{d}\pi _{T}} \right ] =\pi _{T}\left \{-\eta r_{T}\right \}\left [\log \frac{\exp \left (-\eta r_{T}\right )} {\pi _{T}\left [\exp \left (-\eta r_{T}\right )\right ]}\right ] {}\\ & =& \pi _{T}\left \{-\eta r_{T}\right \}\left [-\eta r_{T}\right ] -\log \left (\pi _{T}\left [\exp \left (-\eta r_{T}\right )\right ]\right ) {}\\ & =& \pi _{T}\left \{-\eta r_{T}\right \}\left [-\eta r_{T}\right ] +\inf \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left \{\rho \left [\eta r_{T}\right ] + \mathcal{K}\left (\rho,\pi _{T}\right )\right \} {}\\ & & {}\\ \end{array}$$

Using (26) with \(\rho =\pi _{T}\{ -\eta r_{T}\left (\cdot \left \vert X\right.\right )\}\) it follows that, with π 0- probability at least \(1-\varepsilon\),

$$\displaystyle\begin{array}{rcl} \pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [R\right ] \leq & \inf \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left \{\rho \left [r_{T}\left (\cdot \left \vert X\right.\right )\right ]+\frac{\mathcal{K}\left (\rho,\pi _{T}\right )} {\eta } \right \} + \frac{\log \left (\frac{1} {\varepsilon } \right )} {\eta } + \frac{L_{2\eta,T,C}} {\eta } \;.& {}\\ \end{array}$$

To upper bound ρ[r T (⋅ | X)] we use an upper bond on \(\rho \left [r_{T}(\cdot \vert X) - R\right ]\). We obtain an inequality similar to (26) with \(\rho \left [R - r_{T}(\cdot \vert X)\right ]\) replaced by \(\rho \left [r_{T}(\cdot \vert X) - R\right ]\) and L η, T, C replaced by \(L'_{\eta,T,C} =\log ((\mathbb{E}[\exp (\eta (\tilde{r}_{T} -\tilde{ R}))])^{1/2}\mathbb{E}[\exp (\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\vert r_{T}(\boldsymbol{\theta }\vert X) -\tilde{ r}_{T}(\boldsymbol{\theta }\vert \tilde{X})\vert )])\). This provides us another inequality satisfied with π 0- probability at least \(1-\varepsilon\). To obtain a π 0- probability of the intersection larger than \(1-\varepsilon\) we apply previous computations with \(\varepsilon /2\) instead of \(\varepsilon\) and hence,

$$\displaystyle\begin{array}{rcl} & & \pi _{T}\left \{-\eta r_{T}\left (\cdot \left \vert X\right.\right )\right \}\left [R\right ] \leq \inf \limits _{\rho \in \mathcal{M}_{+}^{1}\left (\varTheta _{T}\right )}\left \{\rho \left [R\right ] + \frac{2\mathcal{K}\left (\rho,\pi _{T}\right )} {\eta } \right \} + \frac{2\log \left (\frac{2} {\varepsilon } \right )} {\eta } {}\\ & & \qquad + \frac{1} {2\eta }\log \left (\mathbb{E}\left [\exp \left (2\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ]\right ) + \frac{1} {2\eta }\log \left (\mathbb{E}\left [\exp \left (2\eta \left (\tilde{r}_{T} -\tilde{ R}\right )\right )\right ]\right ) {}\\ & & \qquad \qquad \qquad \qquad + \frac{2} {\eta } \log \left (\mathbb{E}\left [\exp \left (2\eta \sup _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert r_{T}\left (\boldsymbol{\theta }\left \vert X\right.\right ) -\tilde{ r}_{T}\left (\boldsymbol{\theta }\left \vert \tilde{X}\right.\right )\right \vert \right )\right ]\right )\;. {}\\ \end{array}$$

We can now proof Theorem 1.

Proof

Let π 0, C denote the distribution on \(\mathcal{X}^{\mathbb{Z}} \times \mathcal{X}^{\mathbb{Z}}\) of the couple \((X,\tilde{X})\). Fubini’s theorem and (19) of Lemma 2 imply that

$$\displaystyle\begin{array}{rcl} & & \mathbb{E}\left [\exp \left (2\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ] =\pi _{0,C} \times \pi _{T}\left [\exp \left (2\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ] =\pi _{T} \times \pi _{0,C}\left [\exp \left (2\eta \left (\tilde{R} -\tilde{ r}_{T}\right )\right )\right ] \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \leq \exp \left (\frac{16\eta ^{2}k^{2}(T,C)} {T} \right )\;. {}\end{array}$$
(27)

Using (20), we analogously get

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left [\exp \left (2\eta \left (\tilde{r}_{T} -\tilde{ R}\right )\right )\right ] \leq \exp \left (\frac{16\eta ^{2}k^{2}(T,C)} {T} \right )\;.& &{}\end{array}$$
(28)

Consider the set of probability measures \(\left \{\rho _{\boldsymbol{\theta }_{ T},\varDelta },T \geq 2,0 \leq \varDelta \leq \varDelta _{T}\right \} \subset \mathcal{M}_{+}^{1}\left (\varTheta _{ T}\right )\), where \(\boldsymbol{\theta }_{T}\) is the parameter defined by Assumption (P-4) and \(\rho _{\boldsymbol{\theta }_{ T},\varDelta }\left (\boldsymbol{\theta }\right ) \propto \pi _{T}\left (\boldsymbol{\theta }\right ) \mathbb{1}_{B\left (\boldsymbol{\theta }_{T},\varDelta \right )\cap \varTheta _{T}}\left (\boldsymbol{\theta }\right )\). Lemma 4, together with Lemma 3, (27) and (28) guarantee that for all \(0 <\eta \leq T/8\left (1 + L_{T}\right )\)

$$\displaystyle\begin{array}{rcl} & & R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right ) \leq \inf \limits _{0\leq \varDelta \leq \varDelta _{T}}\left \{\rho _{\boldsymbol{\theta }_{T},\varDelta }\left [R\right ] + \frac{2\mathcal{K}\left (\rho _{\boldsymbol{\theta }_{T},\varDelta },\pi _{T}\right )} {\eta } \right \} + \frac{16\eta k^{2}(T,C)} {T} + \frac{2\log \left (\frac{2} {\varepsilon } \right )} {\eta } + \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 4\varphi (T,C,2\eta )\;. {}\end{array}$$
(29)

Thanks to Assumptions (L) and (P-3), for any T ≥ 2 and \(\boldsymbol{\theta }\in B\left (\boldsymbol{\theta }_{T},\varDelta \right )\)

$$\displaystyle\begin{array}{rcl} R\left (\boldsymbol{\theta }\right ) - R\left (\boldsymbol{\theta }_{T}\right ) \leq K\pi _{0}\left [\left \vert \left \vert \,f_{\boldsymbol{\theta }}\left (\left (Y _{t-i}\right )_{i\geq 1}\right ) - f_{\boldsymbol{\theta }_{T}}\left (\left (Y _{t-i}\right )_{i\geq 1}\right )\right \vert \right \vert \right ] \leq K\mathcal{D}d_{T}^{1/2}\varDelta \;.& &{}\end{array}$$
(30)

For T ≥ 4 Assumption (P-4) gives

$$\displaystyle\begin{array}{rcl} \mathcal{K}\left (\rho _{\boldsymbol{\theta }_{T},\varDelta },\pi _{T}\right ) =\log \left ( \frac{1} {\pi _{T}\left [B\left (\boldsymbol{\theta }_{T},\varDelta \right ) \cap \varTheta _{T}\right ]}\right ) \leq -n_{T}^{1/\gamma }\log \left (\varDelta \right ) -\log \left (\mathcal{C}_{ 2}\right )\;.& &{}\end{array}$$
(31)

Plugging (30) and (31) into (29) and using again Assumption (P-4)

$$\displaystyle\begin{array}{rcl} R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )& \leq & R\left (\boldsymbol{\theta }_{T}\right ) +\inf \limits _{0\leq \varDelta \leq \varDelta _{T}}\left \{\mathcal{E}_{1}d_{T}^{1/2}\varDelta -\frac{2n_{T}^{1/\gamma }\log \left (\varDelta \right )} {\eta } \right \} + \frac{\mathcal{E}_{2}\eta \left (1 + L_{T}\right )^{2}C^{2}} {T} \\ & & +\frac{\mathcal{E}_{3}\left (1 + L_{T}\right )C} {\exp \left (A_{{\ast}}C\right ) - 1} + \frac{2\log \left (\frac{2} {\varepsilon } \right ) - 2\log \left (\mathcal{C}_{2}\right )} {\eta } + \frac{\mathcal{E}_{4}\left (1 + L_{T}\right )^{2}\eta } {T} {}\end{array}$$
(32)

where \(\mathcal{E}_{1} = K\mathcal{D}\), \(\mathcal{E}_{2} = 32K^{2}\left (A_{{\ast}} +\tilde{ A}_{{\ast}}\right )^{2}\), \(\mathcal{E}_{3} = 8K\phi (A_{{\ast}})A_{{\ast}}\) and \(\mathcal{E}_{4} = 32K^{2}\phi (A_{{\ast}})\).

We upper bound d T by T∕2, n T by logγ T and substitute \(\varDelta _{T} = \mathcal{C}_{3}/T\). Since it is difficult to minimize the right term of (32) with respect to η and C at the same time, we evaluate them in certain values to obtain a convenient upper bound.

At a fixed \(\varepsilon\), the convergence rate of \(\left [2\log \left (2/\varepsilon \right ) - 2\log \left (\mathcal{C}_{2}\right )\right ]/\eta + \mathcal{E}_{4}\left (1 + L_{T}\right )^{2}\eta /T\) is at best logTT 1∕2, and we get it doing \(\eta \propto T^{1/2}/\log T\). As η ≤ T∕8(1 + L T ) we set η = η T  = T 1∕2∕(4logT).

The order of the already chosen terms is log3 TT 1∕2, doing C = logTA we preserve it. Taking into account that \(R\left (\boldsymbol{\theta }_{T}\right ) \leq \inf _{\boldsymbol{\theta }\in \varTheta _{T}}R\left (\boldsymbol{\theta }\right ) + \mathcal{C}_{1}\log ^{3}T/T^{1/2}\) the result follows.

7.2 Proof of Proposition 1

Considering that Assumption (L) holds we get

$$\displaystyle\begin{array}{rcl} \left \vert R\left (\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\right ) - R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )\right \vert \leq K\int \limits _{\mathcal{X}^{\mathbb{Z}}}\left \vert \bar{\,f}_{\eta,T,n}\left (\boldsymbol{y}\left \vert X\right.\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert X\right.\right )\right \vert \pi _{0}\left (\mathrm{d}\boldsymbol{y}\right )& & {}\\ \end{array}$$

Observe that the last expression depends on X 1: T and \(\varPhi _{\eta,T}\left (X\right )\). We bound the expectation to infer a bound in probability.

Tonelli’s theorem and Jensen’s inequality lead to

$$\displaystyle\begin{array}{rcl} & & \nu _{\eta,T}\left [\left \vert R\left (\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\right ) - R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )\right \vert \right ] \leq \\ & &\quad K\int \limits _{\mathcal{X}^{\mathbb{Z}}}\int \limits _{\mathcal{X}^{\mathbb{Z}}}\left (\,\int \limits _{\varTheta _{T}^{\mathbb{N}}}\left \vert \bar{\,f}_{\eta,T,n}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right )\right \vert ^{2}\mu _{ \eta,T}\left (\mathrm{d}\boldsymbol{\phi }\left \vert \boldsymbol{x}\right.\right )\right )^{1/2}\pi _{ 0}\left (\mathrm{d}\boldsymbol{y}\right )\pi _{0}\left (\mathrm{d}\boldsymbol{x}\right )\;.{}\end{array}$$
(33)

We are then interested in upper bounding the expression under the square root. To that end, we use [16, Theorem 3.1] which implies that for any \(\boldsymbol{x}\)

$$\displaystyle\begin{array}{rcl} & & \int \limits _{\varTheta _{T}^{\mathbb{N}}}\left \vert \bar{\,f}_{\eta,T,n}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right )\right \vert ^{2}\mu _{ \eta,T}\left (\mathrm{d}\boldsymbol{\phi }\left \vert \boldsymbol{x}\right.\right ) \leq {}\\ & &\qquad \qquad \qquad \sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left (\,f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right )\right )^{2}\left ( \frac{4} {\beta _{\eta,T}\left (\boldsymbol{x}\right )} - 3\right )\left ( \frac{1} {n} + \frac{2} {n^{2}\beta _{\eta,T}\left (\boldsymbol{x}\right )}\right )\;. {}\\ \end{array}$$

Plugging this on (33), using that n ≥ 1 and that

$$\displaystyle{\left (\left (4 - 3\beta _{\eta,T}\left (\boldsymbol{x}\right )\right )\left (2 +\beta _{\eta,T}\left (\boldsymbol{x}\right )\right )\right )^{1/2} \leq 3\;,}$$

we obtain the following

$$\displaystyle\begin{array}{rcl} & & \nu _{\eta,T}\left [\left \vert R\left (\bar{\,f}_{\eta,T,n}\left (\cdot \left \vert X\right.\right )\right ) - R\left (\hat{\,f}_{\eta,T}\left (\cdot \left \vert X\right.\right )\right )\right \vert \right ] \leq {}\\ & &\qquad \qquad \qquad \quad \frac{3K} {n^{1/2}}\int \limits _{\mathcal{X}^{\mathbb{Z}}} \frac{1} {\beta _{\eta,T}\left (\boldsymbol{x}\right )}\int \limits _{\mathcal{X}^{\mathbb{Z}}}\sup \limits _{\boldsymbol{\theta }\in \varTheta _{T}}\left \vert \,f_{\boldsymbol{\theta }}\left (\boldsymbol{y}\right ) -\hat{\, f}_{\eta,T}\left (\boldsymbol{y}\left \vert \boldsymbol{x}\right.\right )\right \vert \pi _{0}\left (\mathrm{d}\boldsymbol{y}\right )\pi _{0}\left (\mathrm{d}\boldsymbol{x}\right )\,. {}\\ \end{array}$$

The result follows from Markov’s inequality.