Abstract
In the Bayes paradigm, given a loss function and an n-sample, we present the construction of a new type of posterior distribution, that extends the classical Bayes one. The loss functions we have in mind are either those derived from the total variation and Hellinger distances or some \({\mathbb {L}}_{j}\)-ones for \(j>1\). We prove that, with a probability close to one, this new posterior distribution concentrates its mass in a neighbourhood (for the chosen loss function) of the law of the data, provided that this law belongs to the support of the prior or, at least, lies close enough to it. We therefore establish that the new posterior distribution enjoys some robustness properties with respect to a possible misspecification of the prior, or more precisely, its support. We also show that the posterior distribution is stable with respect to the equidistribution assumption we started from. Besides, when the model is regular and well-specified and one uses the squared Hellinger loss, we show that our credible regions possess, at least for n sufficiently large, the same ellipsoidal shapes and approximately the same sizes as those we would derive from the classical Bayesian posterior distribution by using the Bernstein–von Mises theorem. Then we use our Bayesian-like approach to solve the following problems. We first consider the estimation of a location parameter or both the location and scale parameters of a density in a nonparametric framework. Then we tackle the problem of estimating a density, with the squared Hellinger loss, in a high-dimensional parametric model under some sparsity conditions on the parameter. Importantly, the results established in this paper are nonasymptotic and provide, as much as possible, bounds with explicit constants.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Observe n i.i.d. random variables \(X_1,\ldots ,X_n\) with values in a measurable space \((E,{\mathcal {E}})\) and assume that their common distribution \(P^{\star }\) belongs to a family \({\mathscr {M}}\) of candidate probabilities, or at least lies close enough to it in a suitable sense. We consider the problem of estimating \(P^{\star }\) from the observation of \({\varvec{X}}=(X_1,\ldots ,X_n)\) and we evaluate the performance of an estimator with values in \({\mathscr {M}}\) by means of a given loss function \(\ell :{\mathscr {P}}\times {\mathscr {M}}\rightarrow {\mathbb {R}}_{+}\), where \({\mathscr {P}}\) denotes a set of probabilities containing \(P^{\star }\).
Our approach to solve this estimation problem has a Bayesian flavour. We endow \({\mathscr {M}}\) with a \(\sigma \)-algebra \({\mathcal {A}}\) and a probability measure \(\pi \) that plays the same role as the prior in the classical Bayes paradigm. Our aim is to design a posterior distribution \({\widehat{\pi }}_{{\varvec{X}}}\), solely based on \({\varvec{X}}\) and the choice of \(\ell \), that concentrates its mass, with a probability close to one, on an \(\ell \)-ball, namely a set of the form
This means that with a probability close to 1, a point \({{\widehat{P}}}\) which is randomly drawn according to our (random) distribution \({\widehat{\pi }}_{{\varvec{X}}}\) is likely to estimate \(P^{\star }\) with an accuracy (with respect to the chosen loss \(\ell \)) not larger than r. Our objective is to design \({\widehat{\pi }}_{{\varvec{X}}}\) in such a way that this concentration property holds for small values of r and under mild assumptions on \(P^{\star }\) and \({\mathscr {M}}\).
In the literature, many authors have studied the concentration properties of the classical Bayes posterior distribution on Hellinger balls. We refer to the pioneering papers by van der Vaart and his co-authors—see for example Ghosal, Ghosh and van der Vaart [19]. They show that the concentration property around \(P^{\star }\) holds, as n tends to infinity, provided that the prior \(\pi \) puts enough mass on sets of the form \({\mathcal {K}}(P^{\star },{\varepsilon })=\{P\in {\mathscr {M}},\; K(P^{\star },P)<{\varepsilon }\}\) where \({\varepsilon }\) is a positive number and \(K(P^{\star },P)\) the Kullback–Leibler divergence between \(P^{\star }\) and P. This assumption may, however, be quite restrictive even in the favorable situation where \(P^{\star }\) belongs to the model \({\mathscr {M}}\). Such sets may indeed be empty, and the condition therefore unsatisfied, when the probabilities in \({\mathscr {M}}\) are not equivalent. This is for example the case when \({\mathscr {M}}\) is the set of all uniform distributions \(P_{\theta }\) on \([\theta -1/2,\theta +1/2]\), with \(\theta \in {\mathbb {R}}\), although the problem of estimating \(P^{\star }\in {\mathscr {M}}\) in this setting is quite easy, even in the Bayesian paradigm. The assumption appears even more restrictive when the probability \(P^{\star }\) does not belong to \({\mathscr {M}}\), that is when the model is misspecified. For example, if the distributions in \({\mathscr {M}}\) are all equivalent and R is singular with respect to \({\overline{P}}\in {\mathscr {M}}\), \({\mathcal {K}}(P^{\star },{\varepsilon })\) is empty for \(P^{\star }=(1-10^{-10}){\overline{P}} +10^{-10}R\) although \(P^{\star }\) and \({\overline{P}}\in {\mathscr {M}}\) are statistically indistinguishable from any n-sample of realistic size.
Unfortunately, it is in general impossible to get rid of the restrictive conditions we have mentioned above. It is well known that the Bayes posterior distribution can be unstable in case of a misspecification of the model. Examples that illustrate this weakness have been given in Jiang and Tanner [21] and Baraud and Birgé [6] for instance. This instability is due to the fact that the Bayes posterior distribution is based on the log-likelihood function and similar issues are known for the maximum likelihood estimator.
In order to obtain the concentration and stability properties we look for, we replace the log-likelihood function by a more stable one. Substituting another function to the log-likelihood one is not new in the literature and leads to what is called quasi-posterior distributions. The resulting estimators, called quasi-Bayesian estimators or Laplace type estimators, have been studied by various statisticians among which Chernozhukov and Hong [18] and Bissiri et al. [16] (we also refer to the references therein). These papers, however, do not address the problem of misspecification. In contrast, it is addressed in Jiang and Tanner [21] for performing variable selection in the logistic model. The authors show that the classical Bayesian approach is no longer reliable when the model is slightly misspecified while their Gibbs posterior distribution performs well and offers thus a much safer alternative. The problem of estimating a high-dimensional parameter \(\theta \in {\mathbb {R}}^{d}\) under a sparsity condition was considered in Atchadé [2]. His quasi-posterior distribution is obtained by replacing the joint density of the data by a more suitable one and by using some specific prior that forces sparsity. He proves that the so-defined posterior distribution contracts around the true parameter \(\theta ^{\star }\) at rate \(\sqrt{(s^{\star }\log d)/n}\) (where \(s^{\star }\) is the number of nonzero coordinates of \(\theta ^{\star }\)) when both d and n tend to infinity. A common feature of the papers we have cited above lies in their asymptotic nature. This is not the case for Bhattacharya et al. [8] who replaced the likelihood function in the expression of the posterior distribution by the fractional likelihood, that is a suitable power of the likelihood function. The authors also consider the situation where the model is possibly misspecified but their result involves the \(\alpha \)-divergence which, as the Kullback one, can be infinite even when the true distribution of the data is close to the model for the total variation distance or the Hellinger one.
Baraud and Birgé [6] propose a surrogate to the Bayes posterior distribution that is called the \(\rho \)-posterior distribution in reference to the theory of \(\rho \)-estimation that was developed in Baraud et al. [7] and Baraud and Birgé [5]. In the frequentist paradigm, this theory aimed at solving the various problems connected to the instability of the maximum likelihood method. The \(\rho \)-posterior distribution preserves some of the nice features of the classical Bayes one but also possesses the robustness property we are interested in. The authors show that their posterior distribution concentrates on a Hellinger ball around \(P^{\star }\) as soon as the prior puts enough mass around a point which is close enough to \(P^{\star }\). However their approach applies to specific dominated models \({\mathscr {M}}=\{P=p\cdot \mu ,\; p\in {\mathcal {M}}\}\) only. They assume that the family \({\mathcal {M}}\) of densities that defines their model possesses some special combinatorial structure which is either met when \({\mathcal {M}}\) is finite or when it satisfies some VC-type condition (see their Section 5). As a consequence, the concentration radius they obtain not only depends on the choice of the prior but also on a complexity term that is linked to this structure. Unlike theirs, our approach makes no such assumptions on \({\mathcal {M}}\) and we are therefore able to get rid of this unpleasant complexity term while retaining a similar dependency with respect to the choice of the prior. Baraud and Birgé’s posterior distribution has also the drawback to involve the supremum over the family \({\mathcal {M}}\) of an empirical process. Their posterior distribution is therefore difficult to calculate in practice, unless \({\mathcal {M}}\) is finite with a reasonable size. From a more theoretical point of view, it also raises some unpleasant issues with regard to the measurability of this supremum in the situation where the family \({\mathcal {M}}\) is uncountable, which is the typical case. Finally, Baraud and Birgé’s approach restricts to the squared Hellinger loss while ours applies to many others.
Closer to our approach are the aggregation methods and PAC-Bayesian techniques that have been popularized by Olivier Catoni in statistical learning (see Catoni [17]). This approach has mainly been applied for the purpose of empirical risk minimization and statistical learning (see for example Alquier [1]). Our aim is to extend these techniques toward a versatile tool that can solve our Bayes-like estimation problem for various loss functions simultaneously.
The problem of designing a good estimator of \(P^{\star }\) for a given loss function \(\ell \) was tackled in the frequentist paradigm in Baraud [4]. There, the author provides a general framework that enables one to deal with various loss functions of interest, among which the total variation, 1-Wasserstein, Hellinger, and \({\mathbb {L}}_{j}\)-losses among others. His approach relies on the construction of a suitable family of robust tests and lies in the line of the former work of Le Cam [22], Birgé [9] and Birgé [11]. The aim of the present paper is to transpose this theory from the frequentist to the Bayesian paradigm. If \(\ell \) is the Kullback–Leibler divergence, our construction recovers the classical Bayes posterior distribution even though this is not the choice we would recommend for the reasons we have explained before.
Quite surprisingly, the concentration properties that we establish here require almost no assumption on \({\mathscr {M}}\) and the distribution of the data (apart from independence). They mostly depend on the choices of the prior \(\pi \) and the loss function \(\ell \). For a suitable element P which belongs to the model \({\mathscr {M}}\) and lies close enough to \(P^{\star }\), these concentration properties depend on the minimal value of the radius r over which the log-ratio \(V(P,r)=\log \left[ {\pi ({\mathscr {B}}(P,2r))/\pi ({\mathscr {B}}(P,r))}\right] \) (with \({\mathscr {B}}\) defined in (1)) increases at most linearly with r. This log-ratio was introduced in Birgé [12] for the purpose of analyzing the behaviour of the classical Bayes posterior distribution. In our Bayes-like paradigm, we show that the behaviour of the quantities V(P, r) for \(P\in {\mathscr {M}}\) and \(r>0\) completely encapsulates the complexity of the model \({\mathscr {M}}\). We prove that our posterior distribution \({\widehat{\pi }}_{{\varvec{X}}}\) concentrates on an \(\ell \)-ball centered at \(P^{\star }\) and the radius \(r=r(n)\) of which is usually of minimax order as n tends to infinity when the model is well-specified. From a nonasymptotic point of view, we show that \({\widehat{\pi }}_{{\varvec{X}}}\) retains its nice concentration properties as long as \(P^{\star }\) remains close enough to an element P in \({\mathscr {M}}\) around which the prior puts enough mass, that is, even in the situation where the model is slightly misspecified. Actually, we establish the stronger result that even when the data are only independent but not i.i.d., the above conclusion remains true for the average \({\overline{P}}^{\star }\) of their marginal distributions in place of \(P^{\star }\). We therefore show that the posterior distribution \({\widehat{\pi }}_{{\varvec{X}}}\) enjoys some robustness properties with respect to the equidistribution assumption we started from. The main theorems involve as much as possible explicit numerical constants. We illustrate our results with examples which are deliberately chosen to be as general and simple as possible. Our aim is to give a flavour of the results that can be established with our Bayes-like posterior, avoiding as much as possible the technicalities that would result from the choice of ad-hoc priors introduced to solve specific problems. Instead, we wish to discuss the optimality and robustness properties of our construction for solving general parametric and nonparametric estimation problems in the density framework under assumptions that we wish to be as weak as possible. These posterior distributions will therefore provide a benchmark for comparison with other methods. Their practical implementation will be the subject of future work.
Of special interest is the choice of \(\ell \) given by the total variation distance or the Hellinger one. As we shall see, for such losses the stability of our posterior distribution automatically leads to estimators \({{\widehat{P}}}\sim {\widehat{\pi }}_{{\varvec{X}}}\) that are naturally robust to the presence of outliers or contaminating data among the sample. These results contrast sharply with the instability of the classical Bayes posterior distribution we underlined earlier. Nevertheless, our posterior distribution also shares some similarities with the classical Bayes one. When the model is well-specified and one uses the squared Hellinger loss, we show that the credible regions of our posterior distribution asymptotically possess the same ellipsoidal shapes and approximately the same sizes as the ones we derive from the classical Bayes posterior by means of the Bernstein–von Mises theorem. Establishing an analogue of this theorem for our Bayes-like posterior distribution is, however, beyond the scope of the present paper.
Our paper is organized as follows. We present our statistical setting in Sect. 2. We consider there independent but not necessarily i.i.d. data in order to analyse later on the behaviour of our posterior distribution with respect to a possible departure from equidistribution. The construction of the posterior distribution is described in Sect. 3. In this section, we also show how more classical constructions based on the likelihood or the fractional likelihoods are particular cases of ours. We complete this section with some heuristics which, we hope, help understanding the main ideas of our approach. In particular, we bridge there the problem of designing robust posterior distributions to that of testing between two disjoint \(\ell \)-balls. Section 4 is devoted to the main theorems. We describe there the concentration properties of our posterior distribution. The applications of these results to classical loss functions are presented in Sect. 5. We put a special emphasis on the cases of the total variation distance and the squared Hellinger loss. In the remaining part of the paper, we only focus on these two losses. In Sect. 6 we highlight some similarities and differences between the classical Bayes posterior and ours for the squared Hellinger loss. In Sect. 7 we explain how our posterior distribution can be used to solve the problem of estimating a density, or a parameter associated with it, in several statistical frameworks of interest. We discuss there how the concentration properties of our posterior distribution deteriorate in the case of a misspecification of the model by the prior. We also consider the problems of estimating a density in a location-scale family and a high-dimensional parameter in a parametric model under a sparsity constraint. We also show how our estimation strategy leads to unusual rates of convergence for estimating a translation parameter in a non-regular statistical model. In Sect. 8, we provide an evaluation of the concentration radius of our posterior distributions in the parametric framework. Finally, Sect. 9 is devoted to the proofs of the main theorems and Sect. 10 to the other proofs.
2 The statistical setting
Let \({\varvec{X}}=(X_1,\ldots ,X_n)\) be an n-tuple of independent random variables with values in a measurable space \((E,{\mathcal {E}})\) and joint distribution \({\textbf{P}}^{\star }=\bigotimes _{i=1}^{n}P_{i}^{\star }\). Even though this might not be true, we pretend that the \(X_{i}\) are i.i.d. and our aim is to estimate their (presumed) common distribution \(P^{\star }\) from the observation of \({\varvec{X}}\). To do so, we introduce a family \({\mathscr {M}}\) that consists of candidate probabilities (or merely finite signed measures in the case of the \({\mathbb {L}}_{j}\)-loss). The reason for considering finite signed measures lies in the fact that statisticians sometimes estimate probability densities by integrable functions that are not necessarily densities but elements of a suitable linear space for instance (think of the case of projection estimators). We endow \({\mathscr {M}}\) with a \(\sigma \)-algebra \({\mathcal {A}}\) and a probability measure \(\pi \), that we call a prior by analogy to the classical Bayesian framework, and we refer to the resulting pair \(({\mathscr {M}},\pi )\) as our model. The model \(({\mathscr {M}},\pi )\) plays here a similar role as in the classical Bayes paradigm. It encapsulates the a priori information that the statistician has on \(P^{\star }\). Nevertheless, we do not assume that \(P^{\star }\), if it ever exists, belongs to \({\mathscr {M}}\) nor that the true marginals \(P_{i}^{\star }\) do. We rather assume that the model \(({\mathscr {M}},\pi )\) is approximately correct in the sense that the average distribution
is close enough to some point P in \({\mathscr {M}}\) around which the prior \(\pi \) puts enough mass. We assume that \({\overline{P}}^{\star }\) belongs to a given set \({\mathscr {P}}\) of probability measures on \((E,{\mathcal {E}})\) and we measure the estimation accuracy by means of a loss function \(\ell :({\mathscr {M}}\cup {\mathscr {P}})\times {\mathscr {M}}\rightarrow {\mathbb {R}}_{+}\) which is not identical to 0 in order to avoid trivialities. Even though \(\ell \) may not be a genuine distance in general, we assume that it shares some similar features and we interpret it as if it were. For this reason, we call \(\ell \)-ball (or ball for short) centered at \(P\in {\mathscr {P}}\cup {\mathscr {M}}\) with radius \(r>0\) the subset of \({\mathscr {M}}\)
Our aim is to built a posterior distribution (or posterior for short) \({\widehat{\pi }}_{{\varvec{X}}}\) on \(({\mathscr {M}},{\mathcal {A}})\), depending on our observation \({\varvec{X}}\), which concentrates with a probability close to 1 on an \(\ell \)-ball of the form \({\mathscr {B}}({\overline{P}}^{\star },r_{n})\) where we wish the value of \(r_{n}>0\) to be small.
2.1 The special case of parametrized models
In many situations we consider statistical models \({\mathscr {M}}=\{P_{\theta },\; \theta \in \Theta \}\) which are parametrized via a one-to-one mapping \(\theta \mapsto P_{\theta }\). When \((\Theta , \mathfrak {B},\nu )\) is a measurable space, we endow \({\mathscr {M}}\) with the \(\sigma \)-algebra \({\mathcal {A}}=\{A,\; \{\theta \in \Theta ,\; P_{\theta }\in A\}\in \mathfrak {B}\}\). This choice possesses several advantages. First, the mapping \(\theta \mapsto P_{\theta }\) is measurable from \((\Theta ,\mathfrak {B})\) onto \(({\mathscr {M}},{\mathcal {A}})\) and we may therefore define the prior \(\pi \) on \(({\mathscr {M}},{\mathcal {A}})\) as the image of \(\nu \) by this mapping. Besides, a function F is measurable on \(({\mathscr {M}},{\mathcal {A}})\) if and only if the mapping \(\theta \mapsto F\circ P_{\theta }\) is measurable on \((\Theta ,\mathfrak {B})\). This property makes the measurability of F easier to check in general. In particular, the mapping \(F:P_{\theta }\mapsto \theta \) is measurable on \(({\mathscr {M}},{\mathcal {A}})\) because \(\theta \mapsto F\circ P_{\theta }=\theta \) is measurable on \((\Theta ,\mathfrak {B})\) and we may then define a posterior \(\widehat{\nu }_{{\varvec{X}}}\) on \((\Theta ,\mathfrak {B})\) as the image by F of our posterior \({\widehat{\pi }}_{{\varvec{X}}}\) on \(({\mathscr {M}},{\mathcal {A}})\). By definition of \({\widehat{\nu }}_{{\varvec{X}}}\), for all \(\theta \in \Theta \) and \(r>0\)
where \(\ell (\theta ,\theta ')\) denotes, slightly abusively, \(\ell (P_{\theta },P_{\theta '})\) for \(\theta ,\theta '\in \Theta \). The concentration of \({\widehat{\pi }}_{{\varvec{X}}}\) on an \(\ell \)-ball centered at \(P_{\theta }\) with radius \(r>0\) is then equivalent to the concentration of \({\widehat{\nu }}_{{\varvec{X}}}\) on the set \(\{\theta '\in \Theta ,\; \ell (\theta ,\theta ')\leqslant r\}\). Every time we consider a parametrized model, we assume that it is identifiable and implicitly use the construction that we presented above as well as its consequences.
2.2 Notation and conventions
Throughout this paper, we use the following notation and conventions. For \(a,b\in {\mathbb {R}}\), \(a\vee b \) and \(a\wedge b\) denote \(\min \{a,b\}\) and \(\max \{a,b\}\) respectively. For \(x\in {\mathbb {R}}\), \((x)_{+}=x\vee 0\) while \((x)_{-}=(-x)\vee 0\). The Euclidean spaces \({\mathbb {R}}^{k}\) with \(k\geqslant 1\) are equipped with their Borel \(\sigma \)-algebras. The cardinality of a set A is denoted |A| and its complement \({^\textsf{c}}{\!{A}}{}\). In particular, for \(P\in {\mathscr {P}}\cup {\mathscr {M}}\) and \(r>0\), \({^\textsf{c}}{\!{{\mathscr {B}}}}{}(P,r)=\left\{ {Q\in {\mathscr {M}},\; \ell (P,Q)> r}\right\} \). The elements of \({\mathbb {R}}^{k}\) with \(k>1\) are denoted with bold letters, e.g. \({\varvec{x}}=(x_{1},\ldots ,x_{k})\) and \(\varvec{0}=(0,\ldots ,0)\). For \({\varvec{x}}\in {\mathbb {R}}^{k}\), \(|{\varvec{x}}|_{\infty }=\max _{i\in \{1,\ldots ,k\}}|x_{i}|\) while \(\left| {{\varvec{x}}}\right| \) denotes the Euclidean norm of \({\varvec{x}}\). The inner product of \({\mathbb {R}}^{k}\) is denoted by \(\langle \cdot ,\cdot \rangle \) and the closed Euclidean ball centered at \({\varvec{x}}\) with radius \(r\geqslant 0\) by \({\mathcal {B}}({\varvec{x}},r)\). By convention \(\inf _{{\varnothing }}=+\infty \) unless otherwise specified. We write \(f\equiv c\) when a function f is constant and equals c on its domain. For all suitable functions f on \((E^{n},{\mathcal {E}}^{\otimes n})\), \({\mathbb {E}}\left[ {f({\varvec{X}})}\right] \) means \(\int _{E^{n}}fd{\textbf{P}}^{\star }\) while for f on \((E,{\mathcal {E}})\), \({\mathbb {E}}_{S}\left[ {f(X)}\right] \) denotes the integral \(\int _{E}fdS\) with respect to the measure S on \((E,{\mathcal {E}})\). For \(j\in [1,+\infty )\), we denote by \({\mathscr {L}}_{j}(E,{\mathcal {E}},\mu )\), the set of measurable functions f on \((E,{\mathcal {E}})\) such that \(\left\| {f}\right\| _{j,\mu }=[\int _{E}|f|^{j}d\mu ]^{1/j}<+\infty \) while \(\left\| {f}\right\| _{\infty }=\sup _{x\in E}|f(x)|\) is the supremum norm of a function f on E. If \(\pi '\) is a distribution on \(({\mathscr {M}},{\mathcal {A}})\), \(Q\sim \pi '\) means that Q is a random variable with distribution \(\pi '\). Finally, all the measures that we consider are implicitly assumed to be \(\sigma \)-finite.
3 Construction of the posterior distribution
Throughout this section, the model \(({\mathscr {M}},\pi )\) is assumed to be fixed.
3.1 The properties of our loss functions
The construction of the posterior not only depends on the prior \(\pi \) but also on the choice of the loss function. We first assume that \(\ell \) satisfies some basic properties which are described below.
Assumption 1
For all \(S\in {\mathscr {P}}\cup {\mathscr {M}}\), the mapping
is measurable.
Under such an assumption, \(\ell \)-balls are measurable and the quantities \(\pi ({\mathscr {B}}(P,r))\) for \(P\in {\mathscr {P}}\cup {\mathscr {M}}\) and \(r>0\) are therefore well-defined.
Assumption 2
There exists a positive number \(\tau \) such that, for all \(S\in {\mathscr {P}}\) and \(P,Q\in {\mathscr {M}}\),
When \(\ell \) is a genuine distance, inequalities (3) and (4) are satisfied with \(\tau =1\) since they correspond to the triangle inequality. When \(\ell \) is the square of a distance, these inequalities are satisfied with \(\tau =2\).
Importantly, we assume that \(\ell \) is associated with a family \({\mathscr {T}}(\ell ,{\mathscr {M}})=\big \{t_{(P,Q)},\; (P,Q)\in {\mathscr {M}}^{2}\big \}\) of test statistics on \((E,{\mathcal {E}})\) which possesses the properties below. We shall see in Sect. 5 that many classical loss functions (among which the total variation distance, the squared Hellinger distance, etc.) can be associated with families \({\mathscr {T}}(\ell ,{\mathscr {M}})\) satisfying the following assumptions.
Assumption 3
The elements \(t_{(P,Q)}\) of \({\mathscr {T}}(\ell ,{\mathscr {M}})\) satisfy:
-
(i)
The mapping
$$\begin{aligned} \begin{array}{l|rcl} t: &{} (E\times {\mathscr {M}}\times {\mathscr {M}},{\mathcal {E}}\otimes {\mathcal {A}}\otimes {\mathcal {A}}) &{} \longrightarrow &{} {\mathbb {R}} \\ &{} (x,P,Q) &{} \longmapsto &{} t_{(P,Q)}(x) \end{array} \end{aligned}$$is measurable.
-
(ii)
For all \(P,Q\in {\mathscr {M}}\), \(t_{(P,Q)}=-t_{(Q,P)}\).
-
(iii)
there exist positive numbers \(a_{0},a_{1}\) such that, for all \(S\in {\mathscr {P}}\) and \(P,Q\in {\mathscr {M}}\),
$$\begin{aligned} {\mathbb {E}}_{S}\left[ {t_{(P,Q)}(X)}\right] \leqslant a_{0}\ell (S,P)-a_{1}\ell (S,Q). \end{aligned}$$(5) -
(iv)
For all \(P,Q\in {\mathscr {M}}\),
$$\begin{aligned} \sup _{x\in E}t_{(P,Q)}(x)-\inf _{x\in E}t_{(P,Q)}(x)\leqslant 1. \end{aligned}$$
Under assumption (ii), \(t_{(P,P)}=0\) and we deduce from (5) that \( (a_{0}-a_{1})\ell (S,P)\geqslant 0\), hence that \(a_{0}\geqslant a_{1}\) since \(\ell \) is not constantly equal to 0.
Some families \({\mathscr {T}}(\ell ,{\mathscr {M}})\) may satisfy the stronger
Assumption 4
Additionally to Assumption 3, there exists \(a_{2}>0\) such that
-
(iv)
for all \(S\in {\mathscr {P}}\) and \(P,Q\in {\mathscr {M}}\),
$$\begin{aligned} {\text {Var}}_{S}\left[ {t_{(P,Q)}(X)}\right] \leqslant a_{2}\left[ {\ell (S,P)+\ell (S,Q)}\right] . \end{aligned}$$
3.2 Construction of the posterior
Let \({\mathscr {T}}(\ell ,{\mathscr {M}})\) be a family of test statistics that satisfies our Assumption 3 and let \(\beta \) and \(\lambda \) be two positive numbers such that
We set
and define \({\widetilde{\pi }}_{{\varvec{X}}}(\cdot |P)\) as the probability on \(({\mathscr {M}},{\mathcal {A}})\) with density
Then, for \(P\in {\mathscr {M}}\) we set
Finally, we define \({\widehat{\pi }}_{{\varvec{X}}}\) as the posterior distribution on \(({\mathscr {M}},{\mathcal {A}})\) with density
Our Assumption 3– (i) ensures that \(d\widetilde{\pi }_{{\varvec{X}}}(\cdot |P)/d\pi \) is a measurable function of \(({\varvec{X}},P,Q)\) and \(d{\widehat{\pi }}_{{\varvec{X}}}/d\pi \) a measurable function of \(({\varvec{X}},P)\).
The posterior distribution depends on our choice of \(\beta \) and \(\lambda \) (or equivalently c) even though we drop this dependency with the notation \({\widehat{\pi }}_{{\varvec{X}}}\).
3.3 Monte Carlo computation of functions of the posterior
Even though we focus on the concentration properties of the posterior \({\widehat{\pi }}_{{\varvec{X}}}\), one may alternatively be interested in some estimators derived from it. For example, estimators of the form
where F is a real-valued \(\pi \)-integrable function on \(({\mathscr {M}},{\mathcal {A}})\). For typical choices of F, I gives the expected mean, mode or median of the posterior whenever these quantities make sense. One may also choose \(F:P\mapsto {\mathbb {1}}_{P\in {\mathscr {B}}(P_{0},{\varepsilon })}\) with \(P_{0}\in {\mathscr {M}}\) and \({\varepsilon }>0\) in order to compute the (posterior) probability that \(\ell (P_{0}, {{\widehat{P}}})\) is not larger than \({\varepsilon }\) when \({{\widehat{P}}}\sim {\widehat{\pi }}_{{\varvec{X}}}\).
Interestingly, the integral I can be approximated by Monte Carlo as follows. Assume that the prior \(\pi \) admits a density of the form \(C^{-1}\Pi \) with respect to a given probability measure \(\mathfrak {m}\), where \(\Pi \) is a nonnegative \(\mathfrak {m}\)-integrable function on \(({\mathscr {M}},{\mathcal {A}})\) and \(C=\int _{{\mathscr {M}}}\Pi (P)d\mathfrak {m}(P)>0\) a positive normalizing constant (that will not be involved in our calculation). Let \(P_{1},\ldots ,P_{N}\) be an N-sample with distribution \(\mathfrak {m}\) and for each \(i\in \{1,\ldots ,N\}\), \(Q_{i}^{(1)},\ldots ,Q_{i}^{(N')}\) an independent \(N'\)-sample with the same distribution. We may approximate I by
where for all \(i\in \{1,\ldots ,N\}\),
It is then easy to check that, by the law of large numbers,
3.4 Connection with the classical Bayes posterior distribution
The classical Bayes posterior turns out to be a particular case of the posterior-type ones introduced in Sect. 3.2. As we shall see now, they are associated with the Kullback–Leibler divergence loss. We recall that the Kullback–Leibler divergence \(\ell (P,Q)=K(P,Q)\) between two probabilities P, Q on \((E,{\mathcal {E}})\) is defined by
Let us consider now a family \({\mathscr {M}}\) of probabilities that satisfy for some \(a>0\) and suitable versions of their densities dQ/dP the following inequalities:
It follows from Baraud [4, Proposition 12] that the families of functions
satisfies our Assumptions 3 and 4 with \(a_{0}=a_{1}=1/(2a)\) and \(a_{2}=2a/[\tanh (a/2)]\). Note that given \(P,Q\in {\mathscr {M}}\), \(P\ne Q\), the test based on the sign of \(t_{(P,Q)}\) is the classical likelihood ratio test between P and Q.
If we apply the construction described in Sect. 3.2 to the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) we obtain that for all \(P,Q,P_{0}\in {\mathscr {M}}\),
For all \(\lambda >0\), the density of \({\widetilde{\pi }}_{{\varvec{X}}}(\cdot |P)\)
is independent of P and writing \({\widetilde{\pi }}_{{\varvec{X}}}(\cdot )\) in place of \({\widetilde{\pi }}_{{\varvec{X}}}(\cdot |P)\) we obtain that
Finally, the density of our posterior \({\widehat{\pi }}_{{\varvec{X}}}\) at \(P\in {\mathscr {M}}\) is given by
This is the density of the classical Bayes posterior when \(\beta =2a\) while for other values of \(\beta \) it is that of fractional Bayes ones.
Nevertheless, in the present paper we restrict our study to loss functions that satisfy some triangle-type inequality – see Assumption 2. This excludes the Kullback–Leibler divergence unless one is ready to make strong assumptions on the unknown distribution of the data, which we do not want to do here.
3.5 Some heuristics
In this section, we present the basic ideas that underline our approach. In particular, we shall see how the estimation problem we want to solve is linked to the one of testing between two disjoint \(\ell \)-balls \({\mathscr {B}}(P,r)\) and \({\mathscr {B}}(Q,r)\) with \(P,Q\in {\mathscr {M}}\).
In order to avoid unnecessary details, we assume here that we observe i.i.d. data \(X_1,\ldots ,X_n\) with distribution \(P^{\star }\in {\mathscr {P}}\) and that we have at disposal a family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) of functions that satisfies our Assumption 3. In particular it follows from Assumption 3-(iii) that
The antisymmetric property required by Assumption 3-(ii) entails that
and leads to the lower bound
Assuming for the sake of simplicity that \(a_{0}=a_{1}=1\), these calculations show that \(n^{-1}{\textbf{T}}({\varvec{X}},P,Q)=n^{-1}\sum _{i=1}^{n}t_{(P,Q)}(X_{i})\) is an unbiased and consistent estimator of \(\ell (P^{\star },P)-\ell (P^{\star },Q)\). In particular, if the two \(\ell \)-balls \({\mathscr {B}}(P,r)\), \({\mathscr {B}}(Q,r)\) are disjoint and \(P^{\star }\) belongs to one of them, the sign of \(n^{-1}{\textbf{T}}({\varvec{X}},P,Q)=n^{-1}\sum _{i=1}^{n}t_{(P,Q)}(X_{i})\) provides a consistent test for deciding which one contains \(P^{\star }\). In fact, the test does not depend on the value of r and consequently chooses the element among \(\{P,Q\}\) which is the closest to \(P^{\star }\) (with respect to \(\ell \)), at least when n is large enough. As compared to the classical likelihood ratio test between P and Q, this test has the advantage not to assume that \(P^{\star }\) is either P or Q but only that it lies in a small enough \(\ell \)-vicinity around one of these two probabilities. The test is said to be robust with respect to the model \(\{P,Q\}\). Its nonasymptotic properties have been studied in Baraud [4].
Let us now explain how such families \(\{{\textbf{T}}({\varvec{X}},P,Q), (P,Q)\in {\mathscr {M}}^{2}\}\) of test statistics can be used to build robust estimators and not only tests. In the frequentist paradigm, the construction of \(\ell \)-estimators is based on the following heuristics. If, with a probability close to 1, \(n^{-1}{\textbf{T}}({\varvec{X}},P,Q)\) is close to its expectation \(\ell (P^{\star },P)-\ell (P^{\star },Q)\) uniformly with respect to \((P,Q)\in {\mathscr {M}}^{2}\) then \(n^{-1}{\textbf{T}}'({\varvec{X}},P)=\sup _{Q\in {\mathscr {M}}}\left[ {n^{-1}{\textbf{T}}({\varvec{X}},P,Q)}\right] \) is close to
We therefore expect that a minimizer over \({\mathscr {M}}\) of the function \(P\in {\mathscr {M}}\mapsto n^{-1}{\textbf{T}}'({\varvec{X}},P)\) be close to a minimizer over \({\mathscr {M}}\) of the function \(P\in {\mathscr {M}}\mapsto \ell (P^{\star },P)-\inf _{Q\in {\mathscr {M}}}\ell (P^{\star },Q)\), that is an element that minimizes the loss \(\ell (P^{\star },P)\) among the probabilities \(P\in {\mathscr {M}}\).
In the Bayesian paradigm, we may argue in a similar way as follows. Replacing \(n^{-1}{\textbf{T}}({\varvec{X}},P,Q)\) by its expectation \(\ell (P^{\star },P)-\ell (P^{\star },Q)\), as we did before, amounts to replacing \({\textbf{T}}({\varvec{X}},{\textbf{P}})\) by
Note that the second term in the right-hand side does not depend on P. Consequently, replacing \({\textbf{T}}({\varvec{X}},{\textbf{P}})\) by \(\overline{{\textbf{T}}}({\varvec{X}},{\textbf{P}})\) in the expression (7) of the density of \({\widehat{\pi }}_{{\varvec{X}}}\) leads to the density
We recognize here the density of a Gibbs measure associated with the energy \(\ell (P^{\star },P)\) at point \(P\in {\mathscr {M}}\) and inverse temperature \(n\beta >0\). We know that when the temperature goes to 0 (or equivalently \(n\beta \) to infinity), Gibbs measures concentrate their masses in vicinities of low energy points in \({\mathscr {M}}\). In our case, these low energy points are those for which \(\ell (P^{\star },P)\) is minimal.
Similar ideas can be found in Catoni’s work and more specifically in his construction of Gibbs estimators—see Catoni [17, Chapter 4]. There, Catoni shows how to aggregate a continuous family of estimators in order to minimize a risk. In the present paper, we do not aim at aggregating estimators but we use similar ideas and tools that are due to Catoni and his co-authors for the construction of our robust posterior distribution.
4 The main results
4.1 Linking the prior to the complexity of the model
For \(P\in {\mathscr {M}}\) and \(r>0\), we recall that
where we use the convention \(a/0=+\infty \) for all \(a\geqslant 0\). We said in the Introduction that such quantities encapsulate in some sense the complexity of the model \(({\mathscr {M}},\pi )\) and we shall now explain why. If \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in {\mathbb {R}}^{k}\}\) is a parametric model endowed with a loss \(\ell \) such that \(\ell ({\varvec{\theta }},{\varvec{\theta }}')=\left| {{\varvec{\theta }}-{\varvec{\theta }}'}\right| \), so that \(({\mathscr {M}},\ell )\) is isometric to \(({\mathbb {R}}^{k},\left| {\cdot }\right| )\), and if the prior \(\nu \) on \(\Theta ={\mathbb {R}}^{k}\) is improper and given by the Lebesgue measure, we obtain that for all \(P\in {\mathscr {M}}\) and \(r>0\)
We observe that V(P, r) corresponds in this case to the usual dimension of \({\mathbb {R}}^{k}\) (up the factor \(\log 2\)). For more general models \(({\mathscr {M}},\pi )\) and loss functions \(\ell \), we may interpret V(P, r) as some notion of dimension (or complexity) associated with the element \(P\in {\mathscr {M}}\) at the scale \(r>0\). As we do not consider improper priors but probability distributions, \(\lim _{r\rightarrow +\infty }\pi ({\mathscr {B}}(P,r))=1\) and consequently \(\lim _{r\rightarrow +\infty }V(P,r)=0\). This means that the connection with the notion of “dimension” is only relevant for values of r which are not too large.
Given \(\gamma \in (0,1]\), the set
is the subinterval of \({\mathbb {R}}_{+}\) on which the mapping \(r\mapsto V(P,r)\) is not larger than \(r\mapsto \gamma n\beta a_{1}{r}\). We denote by
the left endpoint of \({\mathcal {R}}(\beta ,P)\). Since \({\mathcal {R}}(\beta ,P)\) is increasing with \(\beta \) with respect to set inclusion, \({r}_{n}(\beta ,P)\) is a nonincreasing function of \(\beta \). For example, in the ideal situation given in (10) where \(V(P,r)\equiv k\log 2\) with \(k\log 2\geqslant 1\), \({r}_{n}(\beta ,P)=(\gamma a_{1})^{-1}[k\log 2/(n\beta )]\). When the model \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta \}\) is parametric and the parameter space \(\Theta \) is an open subset of \({\mathbb {R}}^{k}\) endowed with a prior \(\nu \), we shall see in Sect. 8.2 that under suitable assumptions \({r}_{n}(\beta ,P_{{\varvec{\theta }}})\) is indeed of order \(k/(n\beta )\), at least for n sufficiently large.
The Bayesian paradigm offers the possibility to favour some elements of \({\mathscr {M}}\) as compared to others. The order of magnitude of \({r}_{n}(\beta ,P)\) allows one to quantify how much the prior \(\pi \) advantages or disadvantages \(P\in {\mathscr {M}}\). It follows from the definition of \({r}_{n}(\beta ,P)\) that
Letting \({r}\) decrease to \({r}_{n}(\beta ,P)\), we derive that (12) also holds for \({r}={r}_{n}(\beta ,P)\). In particular, \(\pi \left( {{\mathscr {B}}(P,r)}\right) >0\) for \({r}={r}_{n}(\beta ,P)\). If the prior puts no mass on the \(\ell \)-ball \({\mathscr {B}}(P,{r})\), which clearly corresponds to a situation where the prior disadvantages P, \({r}_{n}(\beta ,P)>{r}\) and \({r}_{n}(\beta ,P)\) is therefore large if \({r}\) is large. In the opposite case, if the prior puts enough mass on \({\mathscr {B}}(P,{r})\) in the sense that
then for all \({r}'\geqslant {r}\),
hence,
The quantity \({r}_{n}(\beta ,P)\) is therefore small if \({r}\) is small. Although (13) is not equivalent to (12) (it is actually stronger), the previous arguments provide a partial view on the relationship between \(\pi \) and \({r}_{n}\) and conditions to decide whether P is favoured by \(\pi \) or not, according to the size of \({r}_{n}(\beta ,P)\).
4.2 A general result on the concentration property of the posterior distribution
According to the discussion of Sect. 4.1, we see that, when the set
is nonempty, it contains the most favoured elements of the model \(({\mathscr {M}},\pi )\) at level \(a_{1}^{-1}\beta \). Since \({r}_{n}(\beta ,P)\) is nonincreasing with \(\beta \), the set \({\mathscr {M}}(\beta )\) is increasing with \(\beta \) with respect to set inclusion. If \(a_{1}^{-1}\beta \geqslant (n\beta a_{1})^{-1}\) or equivalently \(\beta \geqslant 1/\sqrt{n}\), the set \({\mathscr {M}}(\beta )\) can alternatively be defined from V(P, r) as follows:
This set plays a crucial role in our first result.
Theorem 1
Assume that the model \(({\mathscr {M}},\pi )\) and the loss \(\ell \) satisfy Assumptions 1 and 2 and the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) Assumption 3. Let \(\gamma <(c_{0}\wedge {c})/(2\tau )\) and \(\beta \geqslant 1/\sqrt{n}\) be chosen in such a way that the set \({\mathscr {M}}(\beta )\) defined by (14) is not empty. Then, the posterior \(\widehat{\pi }_{{\varvec{X}}}\) defined by (7) possesses the following property. There exists \(\kappa _{0}>0\) only depending on \({c},\tau ,\gamma \) and the ratio \(a_{0}/a_{1}\) such that, for all \(\xi >0\) and any distribution \({\textbf{P}}^{\star }\) with marginals in \({\mathscr {P}}\),
with
In particular,
The value of \(\kappa _{0}\) is given by (119) in the proof. It only depends on the choice of the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) but not on the prior \(\pi \). Hence, for a given family \({\mathscr {T}}(\ell ,{\mathscr {M}})\), \(\kappa _{0}\) is a numerical constant.
Let us now comment on Theorem 1. When \(X_1,\ldots ,X_n\) are truly i.i.d. with distribution \(P^{\star }\) and the prior puts enough mass around \(P^{\star }\), in the sense that \(P^{\star }\in {\mathscr {M}}(\beta )\), then \(r=a_{1}^{-1}[\beta +2\xi /(n\beta )]\) in (17). When this ideal situation is not met, either because the data are not identically distributed or because \(P^{\star }\) does not belong to \({\mathscr {M}}(\beta )\), r increases by at most an additive term of order \(\inf _{P\in {\mathscr {M}}(\beta )}\ell ({\overline{P}}^{\star },P)\). When this approximation term remains small as compared to \(a_{1}^{-1}\beta \), the value of r does not deteriorate too much as compared to the previous situation.
The value of \({r}\) given by (17) depends on the choice of the parameter \(\beta \). Since the set \({\mathscr {M}}(\beta )\) is increasing (with respect to set inclusion) as \(\beta \) gets larger, the two terms \(\inf _{P\in {\mathscr {M}}(\beta )}\ell ({\overline{P}}^{\star },P)\) and \(a_{1}^{-1}\beta \) vary in opposite directions as \(\beta \) increases. The set \({\mathscr {M}}(\beta )\) must be large enough to provide a suitable approximation of \({\overline{P}}^{\star }\) while \(\beta \) must not be too large in order to keep \(a_{1}^{-1}\beta \) to a reasonable size. Practically, we recommend to choose \(\beta =\beta (\alpha )\geqslant 1/\sqrt{n}\) such that
In Example 1 below and in Sect. 7.1, we give some examples of choices of \(\beta \).
Example 1
Let \(({\mathscr {M}},\pi )\) be a model where the prior \(\pi \) satisfies for some \(k\geqslant 1\) and constants \(0<A\leqslant (2/e)B\),
This means that the prior \(\pi \) behaves like the Lebesgue measure on an Euclidean space of dimension k for small enough values of r. Then,
which implies that for all \(P\in {\mathscr {M}}\)
The right-hand side is not larger than \(a_{1}^{-1}\beta \) for
which is larger than \(1/\sqrt{n}\) since \((2B/A)\geqslant e\) and \(\gamma \in (0,1]\). For such a value of \(\beta \), which does not depend on the distribution of the data, the element P belongs to \({\mathscr {M}}(\beta )\) given by (15), and since P is arbitrary we derive that \({\mathscr {M}}(\beta )={\mathscr {M}}\). Applying Theorem 1 we conclude that the distribution \({\widehat{\pi }}_{{\varvec{X}}}\) concentrates on an \(\ell \)-ball centered at \({\overline{P}}^{\star }\) with a radius \({r}\) of order
4.3 A refined result under Assumption 4
Let us assume now that the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) satisfies the stronger Assumption 4. We introduce the mapping
The function \(\phi \) is increasing on \((0,+\infty )\) and tends to 1 when z tends to 0. Given \(\beta >0\) and a family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) that satisfies Assumption 4, we define
Note that the value of \({\overline{{c}}}_{1}\wedge {\overline{{c}}}_{2} \wedge {\overline{{c}}}_{3}\) is positive for \(\beta =0\) and decreases continuously to \(-\infty \) when \(\beta \) grows to infinity. Consequently, there exists some \(\beta _{0}>0\) for which \({\overline{{c}}}_{1}\wedge {\overline{{c}}}_{2} \wedge {\overline{{c}}}_{3}=0\) and \({\overline{{c}}}_{1}\wedge {\overline{{c}}}_{2} \wedge {\overline{{c}}}_{3}\) is positive for all values \(\beta \in (0,\beta _{0})\).
Let us now present our second result on the concentration property of our posterior \({\widehat{\pi }}_{{\varvec{X}}}\).
Theorem 2
Assume that the model \(({\mathscr {M}},\pi )\) and the loss \(\ell \) satisfy Assumptions 1 and 2 and the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) Assumption 4. For \(\beta \in (0,\beta _{0})\) and \(\gamma <({\overline{{c}}}_{1}\wedge {\overline{{c}}}_{2} \wedge {\overline{{c}}}_{3})/(2\tau )\), the posterior \(\widehat{\pi }_{{\varvec{X}}}\) defined by (7) satisfies the following property. There exists \(\kappa _{0}>0\) only depending on \(a_{0}/a_{1},a_{2}/a_{1},{c},\tau ,\beta \) and \(\gamma \) such that, for all \(\xi >0\) and any distribution \({\textbf{P}}^{\star }\) with marginals in \({\mathscr {P}}\),
with
In particular,
The value of \(\kappa _{0}\) is given by (132) in the proof. Note that the constraints on \(\beta \) and \(\gamma \), that are required in our Theorem 2, and that on \({c}\) given in (6) only depend on \(a_{0},a_{1}\) and \(a_{2}\), hence on the choice of the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\). When \(a_{0},a_{1}\) and \(a_{2}\) do not depend on \({\mathscr {M}}\), the value of \(\beta \) can be chosen as a universal constant. In particular, it neither depends on the model \(({\mathscr {M}},\pi )\) nor on the sample size n.
Example 2
(Example 1 continued) Let us go back to the framework of our Example 1 and assume that \({\mathscr {T}}(\ell ,{\mathscr {M}})\) satisfies the requirements of Theorem 2, hence Assumption 4. Applying our construction with some numerical value of \(\beta \) which satisfies the constraint of our Theorem 2, we deduce from (21) that \({\widehat{\pi }}_{{\varvec{X}}}\) concentrates on an \(\ell \)-ball with radius of order
When the model is well-specified, \(\inf _{P\in {\mathscr {M}}}\ell (\overline{P}^{\star },P)=0\) and the ball \({\mathscr {B}}(P^{\star },\kappa _{0}{\overline{{r}}})\) with radius \({\overline{{r}}}={\overline{{r}}}(n)\) contracts at the rate 1/n. Applying our Theorem 1 under Assumption 3, ignoring the fact that the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) also satisfies Assumption 4, would lead to the weaker result that when the model is well-specified the posterior concentrates on an \(\ell \)-ball with radius of order \(\sqrt{k/n}\), hence at a rate \(1/\sqrt{n}\), as shown by (23).
4.4 Concentrated priors
Theorem 1 and 2 show that starting from a prior \(\pi \) that puts enough mass around most of the elements of \({\mathscr {M}}\), the posterior \({\widehat{\pi }}_{{\varvec{X}}}\) concentrates on an \(\ell \)-ball with radius of order \(\inf _{P\in {\mathscr {M}}}\ell ({\overline{P}}^{\star },P)+r_{n}\) where \(r_{n}\) is small, at least under suitable assumptions and for n sufficiently large. The situation we want to investigate now is what happens when the prior is very concentrated on a small \(\ell \)-ball with radius \({\varepsilon }>0\) around an element \({{\overline{Q}}}\in {\mathscr {M}}\) that might not be the true distribution of the data. More precisely, assume the following
Assumption 5
For \({{\overline{Q}}}\in {\mathscr {M}}\) and \({\varepsilon }>0\),
In this case, we establish the following result.
Theorem 3
Assume that the model \(({\mathscr {M}},\pi )\) and the loss \(\ell \) satisfy Assumptions 1 and 2 and the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) Assumption 3. If Assumption 5 is satisfied, there exists \(\kappa _{0}>0\) only depending on \({c},\tau \) and the ratio \(a_{0}/a_{1}\) such that for any distribution \({\textbf{P}}^{\star }\) with marginals in \({\mathscr {P}}\),
In particular, for the choice \(\beta =a_{1} {\varepsilon }\), \(r=\ell (\overline{P}^{\star },{{\overline{Q}}})\vee {\varepsilon }\).
If furthermore, Assumption 4 is satisfied and \(\beta \in (0,\beta _{0})\) (where \(\beta _{0}\) is defined in Sect. 4.3), there exists \(\kappa _{0}'>0\) only depending on \(\tau ,\beta , a_{0}/a_{1}\) and \(a_{2}/a_{1}\) such that for any distribution \({\textbf{P}}^{\star }\) with marginals in \({\mathscr {P}}\),
This result shows that for a suitable choice of \(\beta \), the posterior \({\widehat{\pi }}_{{\varvec{X}}}\) also concentrates on an \(\ell \)-ball centred at \({\overline{P}}^{\star }\) with radius of order \({\varepsilon }\) when the model is well-specified, that is, when the data are i.i.d. with distribution \({\overline{P}}^{\star }={{\overline{Q}}}\). When the model is misspecified, the radius of the ball is of order \(\ell ({\overline{P}}^{\star },{{\overline{Q}}})\vee {\varepsilon }\) and therefore does not inflate more than the distance of \({\overline{P}}^{\star }\) to the center \({{\overline{Q}}}\). This result illustrates the stability of the posterior \({\widehat{\pi }}_{{\varvec{X}}}\) with respect to misspecification.
5 Applications to classical loss functions
The aim of this section is to show how our general construction can be applied to loss functions \(\ell \) of interest. The propositions contained in this section about the corresponding families \({\mathscr {T}}(\ell ,{\mathscr {M}})\) have been established in Baraud [4] except for the squared Hellinger loss for which we refer to Baraud and Birgé [5, Proposition 3]. The list of loss functions we present here is not exhaustive. Our results also apply to all loss functions that derive from a variational formula of the form
where \({\mathscr {F}}\) is a suitable class of bounded functions. For such losses, we refer the reader to Baraud [4].
In this section, we consider models \({\mathscr {M}}=\{P=p\cdot \mu , p\in {\mathcal {M}}\}\) which are dominated by a measure \(\mu \) on \((E,{\mathcal {E}})\) and we denote by \({\mathcal {M}}\subset {\mathscr {L}}_{1}(E,{\mathcal {E}},\mu )\) the corresponding families of densities with respect to \(\mu \). Elements \(P,Q,\ldots \) in \({\mathscr {M}}\) are associated with their densities in \({\mathcal {M}}\) by using lower case letters \(p,q,\ldots \). In all the cases we consider, \(t_{(P,Q)}(x)\) is a measurable function of (p(x), q(x)) for \(P,Q\in {\mathscr {M}}\) and \(x\in E\). In order to satisfy our measurability Assumption 3-(i), it is therefore sufficient to assume that
is measurable. In the case of a parametrized model \({\mathscr {M}}=\{P_{\theta }=p_{\theta }\cdot \mu , \theta \in \Theta \}\), as described in Sect. 2.1, this condition is satisfied as soon as the mapping
is measurable. Throughout this section, we assume that such measurability assumptions are satisfied.
5.1 The case of the total variation distance
In this section, \({\mathscr {P}}\) is the set of all probability measures on \((E,{\mathcal {E}})\) and
denotes the total variation loss (TV-loss for short) between \(P,Q\in {\mathscr {P}}\).
Proposition 1
The family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) which consists of all the functions \(t_{(P,Q)}\) defined for \(P=p\cdot \mu \) and \(Q=q\cdot \mu \) in \({\mathscr {M}}\) by
satisfies Assumption 2 with \(\tau =1\) and Assumption 3 with \(a_{0}=3/2\) and \(a_{1}=1/2\).
It follows from Proposition 1 that we may apply our general construction to the so-defined family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) with the values \({c}={c}_{0}=1/3\) (hence \(\lambda =4/3\)). The reader can check that the value \(\gamma =1/100\) satisfies the requirement of our Theorem 1 and that (16) is satisfied with \(\kappa _{0}=220\). Theorem 1 can therefore be rephrased as follows.
Corollary 1
Let \(\beta \geqslant 1/\sqrt{n}\), \({c}=1/3\) and \({\widehat{\pi }}_{{\varvec{X}}}^{{{\text {TV}}}}\) be the posterior defined by (7) and associated with the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) given in Proposition 1. For all \(\xi >0\) and any distribution \({\textbf{P}}^{\star }\), with a probability at least \(1-2e^{-\xi /2}\), the posterior \({\widehat{\pi }}_{{\varvec{X}}}^{{\text {TV}}}\) satisfies
where
By convexity, we may write that
and the left-hand side is therefore small when there exists \(P\in {\mathscr {M}}(\beta )\) that approximates well enough most of the marginals of \({\textbf{P}}^{\star }\). The concentration properties of \(\widehat{\pi }_{{\varvec{X}}}^{{\text {TV}}}\) remain thus stable with respect to a possible misspecification of the model and a departure from the equidistribution assumption.
In fact, as we shall see in our Example 3 below, the average distribution \({\overline{P}}^{\star }\) may belong to \({\mathscr {M}}(\beta )\) even when none of the marginals \(P_{i}^{\star }\) does. This means that in good situations, the posterior may concentrate around \(\overline{P}^{\star }\), as it would do in the i.i.d. case when the distribution of the data does belong to \({\mathscr {M}}(\beta )\), even when the data are non-i.i.d. and their marginals do not belong to \({\mathscr {M}}(\beta )\).
Example 3
(Example 1 continued) Going back to Example 1 and taking for \(\ell \) the TV-loss (then \(a_{1}=1/2\)), we deduce from (23) that
In particular, if for each \(i\in \{1,\ldots ,n\}\), \(P_{i}^{\star }\) is the uniform distribution on \([(i-1)/n,i/n]\) and \({\mathscr {M}}\) contains the uniform distribution \({\mathcal {U}}([0,1])\) on [0, 1], \({\mathscr {M}}\) contains \({\overline{P}}^{\star }={\mathcal {U}}([0,1])\), even if none of the marginals \(P_{i}^{\star }\) belongs to \({\mathscr {M}}\). We then get that
and the posterior concentrates around \({\overline{P}}^{\star }\) at a parametric rate.
5.2 Case of the \({\mathbb {L}}_{j}\)-loss
Let \(j\in (1,+\infty )\). We denote by \({\mathscr {P}}_{j}\) the set of all finite and signed measures on \((E,{\mathcal {E}},\mu )\) which are of the form \(P=p\cdot \mu \) with \(p\in {\mathscr {L}}_{j}(E,\mu )\cap {\mathscr {L}}_{1}(E,\mu )\). Let \(\ell _{j}\) be the loss defined by \(\ell _{j}(P,Q)=\left\| {p-q}\right\| _{\mu ,j}\) for all \(P=p\cdot \mu \) and \(Q=q\cdot \mu \) in \({\mathscr {P}}_{j}\). In this section, \({\mathscr {P}}\) is the subset that consists of all the probability measures in \({\mathscr {P}}_{j}\).
Proposition 2
Let \({\mathscr {M}}=\left\{ {P=p\cdot \mu ,\; p\in {\mathcal {M}}}\right\} \) be a subset of \({\mathscr {P}}_{j}\) for which \({\mathcal {M}}\) satisfies for some \(R>0\)
Define for \(P=p\cdot \mu \) and \(Q=q\cdot \mu \) in \({\mathscr {M}}\),
Then, the family \({\mathscr {T}}(\ell _{j},{\mathscr {M}})\) which contains the functions \(t_{(P,Q)}\) defined for \(P,Q\in {\mathscr {M}}\) by
satisfies Assumption 2 with \(\tau =1\) and Assumption 3 with \(a_{0}=3/(4R^{j-1})\) and \(a_{1}=1/(4R^{j-1})\).
When \(j=2\), (36) is typically satisfied when \({\mathcal {M}}\) is a subset of a linear space enjoying good connections between the \({\mathbb {L}}_{2}(\mu )\) and the supremum norms. Many finite dimensional linear spaces with good approximation properties do satisfy such connections (e.g. piecewise polynomials of a fixed degree on a regular partition of [0, 1], trigonometric polynomials on [0, 1) etc.). We refer the reader to Birgé and Massart [14, Section 3] for additional examples. The property may also hold for infinite dimensional linear spaces as proven in Baraud [4].
It follows from Proposition 2 that one may choose \({c}={c}_{0}=1/3\) in (6) and \(\gamma =1/100\) in Theorem 1. Besides, Theorem 1 applies with \(\kappa _{0}=220\).
Example 4
(Example 1 continued) Let us go back to our Example 1 with \(\ell =\ell _{j}\) and \({\mathscr {T}}(\ell ,{\mathscr {M}})\) given in Proposition 2. For the choice of \(\beta \) given in (22) and \(\gamma =1/100\), we deduce from (23) (with \(a_{1}=1/(4R^{j-1})\)) that the resulting posterior \({\widehat{\pi }}_{{\varvec{X}}}\) concentrates on an \(\ell _{j}\)-ball around \({\overline{P}}^{\star }\) with a radius of order
5.3 The case of the squared Hellinger loss
Here, \({\mathscr {P}}\) is the set of all probability measures on \((E,{\mathcal {E}})\) and
is the squared Hellinger distance between two probabilities \(P,Q\in {\mathscr {P}}\).
Proposition 3
Let \(\psi \) be the function defined by
The family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) containing the functions \(t_{(P,Q)}\) defined for \(P=p\cdot \mu \) and \(Q=q\cdot \mu \) in \({\mathscr {M}}\) by
(with the conventions \(0/0=1\) and \(x/0=+\infty \) for all \(x>0\)) satisfies Assumption 2 with \(\tau =2\) and Assumption 4 with \(a_{0}=2\), \(a_{1}=3/16\), \(a_{2}=3\sqrt{2}/4\).
With such a choice of family \({\mathscr {T}}(\ell ,{\mathscr {M}})\), (6) is satisfied with \({c}=1/125\), then \({c}_{0}\in [0.922,0.923]\), and the requirements of Theorem 2 are satisfied with \(\beta =2\gamma =1/500\). Then the value \(\kappa _{0}=1694\) suits. The definition (11) of \({r}_{n}(\beta ,P)\) for \(P\in {\mathscr {M}}\) becomes
with the convention \(\sup {\varnothing }=8000/(3n)\). Theorem 2 can then be rephrased as follows.
Corollary 2
Let \(\pi _{{\varvec{X}}}^{h}\) be the posterior defined by (7) and associated with the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) given in Proposition 3 and the choices \({c}=1/125\) and \(\beta =1/500\). For all \(\xi >0\) and any distribution \({\textbf{P}}^{\star }\), with a probability at least \(1-2e^{-\xi /2}\),
where
and \({r}_{n}(\beta ,P)\) is given by (40).
As for the total variation distance, we may write that
The left-hand side is small when there exists an element \(P\in {\mathscr {M}}\) that approximates well most of the marginal distribution \(P_{i}^{\star }\). If for such a P, the quantity \({r}_{n}(\beta ,P)\) is small enough, the posterior concentrates around \({\overline{P}}^{\star }\) just as it would do if the data were truly i.i.d. with distribution \(P\in {\mathscr {M}}\).
Example 5
(Example 1 continued) Let us go back to Example 1, more precisely Example 2, with \(\ell =h^{2}\) and \({\mathscr {T}}(\ell ,{\mathscr {M}})\) given in Proposition 3. Inequality (21) is satisfied with \(\beta =2\gamma =1/500\) and \(a_{1}=3/16\). It follows from (30) that \({\widehat{\pi }}_{{\varvec{X}}}^{h}\) concentrates on an \(h^{2}\)-ball around \({\overline{P}}^{\star }\) with a radius of order
6 Comparing the classical Bayesian approach to ours
In this section, our aim is to highlight some similarities and differences between the Bayesian posterior and ours. Throughout this section, we consider the squared Hellinger loss \(\ell =h^{2}\) and denote by \({\widehat{\pi }}_{{\varvec{X}}}^{K}\) the Bayes posterior associated with the model \(({\mathscr {M}},\pi )\). The letter K in the notation \(\widehat{\pi }_{{\varvec{X}}}^{K}\) refers to the fact that the Bayesian posterior can be obtained from our general construction by using the Kullback–Leibler divergence as explained in Sect. 3.4. When \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta \}\) is parametric with \(\Theta \subset {\mathbb {R}}^{k}\), we denote by \({\widehat{\nu }}_{{\varvec{X}}}^{K}\) the Bayesian posterior on the parameter space \(\Theta \) and \({\widehat{\nu }}_{{\varvec{X}}}^{h}\) that associated to \({\widehat{\pi }}_{{\varvec{X}}}^{h}\).
6.1 Some classical concentration results for the Bayes posterior distribution
Most of the results that have been established about the concentration properties of the Bayesian posterior are asymptotic in nature. It seems difficult to establish a general nonasymptotic version of those as we do for our posterior. One of the only exceptions we are aware of is Birgé [13].
When the data are i.i.d. with a distribution \(P^{\star }\in {\mathscr {M}}\), a typical asymptotic form of these results is the following one (see Ghosal et al. [19] Theorems 2.1 and 2.4 for example). Let \({\varepsilon }_{n}\) be a sequence of positive numbers that converges to zero when n goes to infinity. If \(P^{\star }\) fulfils some suitable conditions, that we shall discuss later on and which depend on the prior \(\pi \) and \({\varepsilon }_{n}\), the following convergence in probability holds true
In (41), \(M_{n}=M\) denotes some large enough positive constant if \(n{\varepsilon }_{n}^{2}\rightarrow +\infty \) as \(n\rightarrow +\infty \) while \(M_{n}\) is increasing to infinity as \(n\rightarrow +\infty \) if \(\liminf n{\varepsilon }_{n}^{2}>0\) as \(n\rightarrow +\infty \). The first condition on \({\varepsilon }_{n}\) is typically satisfied when \({\mathscr {M}}\) is a nonparametric model while the second one generally applies to parametric ones.
In comparison, in this well-specified framework, our Corollary 2 leads to the following result. For all \(P^{\star }\in {\mathscr {M}}\) and \(\xi >0\)
for some numerical constant \(\kappa _{0}'>0\). If \(P^{\star }\) satisfies \(r_{n}(\beta ,P^{\star })\leqslant {\varepsilon }_{n}^{2}\), we recover (41) by setting \(\xi =\xi _{n}=(M_{n}/(\kappa _{0}')-1)n{\varepsilon }_{n}^{2}\). However, our condition that \(r_{n}(\beta ,P^{\star })\leqslant {\varepsilon }_{n}^{2}\) is not equivalent to that imposed on \(P^{\star }\) by Ghosal, Ghosh and van der Vaart [19]. It is actually weaker. In their paper, this condition is fulfilled when the prior puts enough mass on Kullback–Leibler type balls around \(P^{\star }\). Our approach allows one to consider Hellinger balls only, which are larger and make our assumption weaker. In fact, as already underlined in the Introduction, these Kullback–Leibler type balls could be empty, and the condition unsatisfied, while our theorem would still apply.
The result established by Birgé [13] provides an improvement as compared to the one presented above and established by Ghosal, Ghosh and van der Vaart. Birgé shows that it is essentially possible to get rid of the Kullback–Leibler divergence (see his Theorem 2) but only when the model is parametric and well-specified. Apart for the nonparametric framework, this result leaves little place for improvement since we know that the Bayesian posterior may fail to concentrate around the true parameter when the model becomes slightly ill-specified.
Another consequence of our Corollary 2, as compared to (41), is that it allows one to control
uniformly over the set \(\{P^{\star }\in {\mathscr {M}}, r_{n}(\beta ,P^{\star })\leqslant {\varepsilon }_{n}^{2}\}\). For example, in the framework of Example 2, for the choice \({\varepsilon }_{n}^{2}=ck/n\) with \(c=\log (2B/A)/(\gamma a_{1}\beta )\), we know that \(r_{n}(\beta ,P^{\star })\leqslant {\varepsilon }_{n}^{2}\) for all \(P^{\star }\in {\mathscr {M}}\) and we deduce from (42) that
The concentration properties of our posterior is therefore uniform over the statistical model \({\mathscr {M}}\).
6.2 About the shapes and sizes of the credible regions
A nice feature of the Bayesian approach lies in the fact that it allows one to build credible regions. In practice, they often play the same role as the confidence regions in the frequentist paradigm. When the data are i.i.d. with distribution \(P^{\star }=P_{{\varvec{\theta }}^{\star }}\) in a parametric model \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta \}\), \(\Theta \subset {\mathbb {R}}^{k}\), a credible set for the parameter \({\varvec{\theta }}^{\star }\) is a subset \({\widehat{\Theta }}_{n,{\varvec{X}}}\subset \Theta \) (only depending on observable quantities) that satisfies \(\widehat{\nu }_{{\varvec{X}}}^{K}({\widehat{\Theta }}_{n,{\varvec{X}}})\geqslant 1-e^{-\xi }\) for some choice of \(\xi >0\). When \({\mathscr {M}}\) is a regular parametric model with a nonsingular Fisher information matrix \({\textbf{J}}\), and provided that it satisfies additional assumptions—see van der Vaart [24]—the Bernstein–von Mises theorem applies and tells us that
where \({\widehat{{\varvec{\theta }}}}_{n}\) denotes the Maximum Likelihood Estimator (MLE for short). Denoting by \(\overline{\chi }_{k}^{-1}(\xi )\) the \((1-e^{-\xi })\)-quantile of a chi-square random variable with k degrees of freedom and
we deduce that
hence
The asymptotic level of “credibility” of the set \(\Theta _{n,{\varvec{X}}}\) is therefore \(1-e^{-\xi }\). This set is not, however, a genuine credible region since it depends on the unknown parameter \({\varvec{\theta }}^{\star }\). We would obtain a genuine credible region by replacing \({\varvec{\theta }}^{\star }\) by \({\widehat{{\varvec{\theta }}}}_{n}\) in the expression of \(\Theta _{n,{\varvec{X}}}\). This substitution would change the level of credibility but not the shape of the region, which is an ellipsoid centred at \({\widehat{{\varvec{\theta }}}}_{n}\) and the axes of which are given by the eigenvectors of the Fisher information matrix.
The aim of this section is to show that our posterior concentrates its mass on regions that have the same shape and approximately the same size. The size of \(\Theta _{n,{\varvec{X}}}\) is determined by the value of the quantile \({\overline{\chi }}_{k}^{-1}(\xi )\). The aim of the following lemma is to specify the order of magnitude of this quantile as a function of k and \(\xi \). In fact, we consider below the more general case of the quantiles of a gamma distribution \(\gamma (s,\sigma )\) with parameters \(s,\sigma >0\), that is, the distribution with density \(x\mapsto (x^{s-1}e^{-s/\sigma })/(\sigma ^{s}\Gamma (s))\) with respect to the Lebesgue measure on \({\mathbb {R}}_{+}\). The proof is postponed to Sect. 10.1.
Lemma 1
For \(s,\sigma ,\xi >0\), let \({\overline{\gamma }}_{s,\sigma }^{-1}(\xi )\) be the \((1-e^{-\xi })\)-quantile of the gamma distribution \(\gamma (s,\sigma )\) and \({\overline{\Phi }}^{-1}(\xi )\) that of a standard Gaussian random variable. Then,
and for all \(s=t+1>1\) and \(\xi \geqslant \log 2+1/(12t)\),
Since \({\overline{\Phi }}^{-1}(\xi )\) is equivalent to \(\sqrt{2\xi }\) for large values of \(\xi >0\), these two inequalities show that for s and \(\xi \) large enough, \({\overline{\gamma }}_{s,\sigma }^{-1}(\xi )\) is of order \(\sigma \left[ {s+\xi }\right] \). In particular, \(\overline{\chi }_{k}^{-1}(\xi )={\overline{\gamma }}_{k/2,2}^{-1}(\xi )\) is of order \(k+\xi \) for k and \(\xi \) large enough.
To compare ourselves with the classical Bayesian paradigm, we prove in Sect. 10.2 the result below for our posterior. This result is based on the assumption that the statistical model \({\mathscr {M}}\) is regular in the sense that is defined in Ibragimov and Has’minskiĭ [20]. In order to avoid too many technicalities here, we refer the reader to our Sect. 8.3, more precisely Corollary 4, for a complete description of the assumptions on the statistical model \({\mathscr {M}}\).
Theorem 4
Assume that the statistical model \({\mathscr {M}}\) satisfies the assumptions of Corollary 4. If \(X_{1},\ldots ,X_{n}\) are i.i.d. with distribution \(P_{{\varvec{\theta }}^{\star }}\in {\mathscr {M}}\), for all \(\xi >0\) and n large enough, with a probability \(1-2e^{-\xi }\),
where \(\kappa ^{\star }\) is a positive numerical constant.
The set
possesses the same shape and, by Lemma 1, approximately the same size as the set \(\Theta _{n,{\varvec{X}}}\) defined by (43). We deduce from Theorem 4 that the classical Bayes posterior and ours concentrate both on similar sets. If \({\widehat{{\varvec{\theta }}}}_{n}\) is an asymptotically efficient estimator of \({\varvec{\theta }}^{\star }\), it is therefore reasonable to look for a credible region of the form
for \({\widehat{\nu }}_{{\varvec{X}}}^{h}\) as we would do for the classical Bayes one.
6.3 Robustness
As already mentioned, our approach allows the statistician to design robust posteriors by choosing as a loss function the squared Hellinger loss or the total variation one. In this section, we illustrate this property on a concrete example. Consider the statistical model \({\mathscr {M}}=\{P_{\theta }={\mathcal {N}}(\theta ,1),\; \theta \in {\mathbb {R}}\}\) and the prior \(\pi \) associated with the distribution \(\nu ={\mathcal {N}}(0,1)\) on \(\Theta ={\mathbb {R}}\). Then, the Bayes posterior on \(\Theta \) is \(\widehat{\nu }_{{\varvec{X}}}^{K}={\mathcal {N}}({{\widehat{m}}}_{n}, \sigma _{n}^{2})\) with \(\widehat{m}_{n}=(n+1)^{-1}\sum _{i=1}^{n} X_{i}\) and \(\sigma _{n}^{2}=1/(n+1)\). It concentrates on intervals of the form \([\widehat{m}_{n}-c/\sqrt{n+1},{{\widehat{m}}}_{n}+c/\sqrt{n+1}]\) for \(c>0\) large enough. If the distribution of the data is contaminated so that \(X_{1},\ldots ,X_{n}\) are i.i.d. with distribution
then with a probability at least \(1-(1-1/n)^{n}\geqslant 1-1/e>63\%\), the posterior concentrates around \({{\widehat{m}}}_{n}\approx 10^{4}\), hence far away from 0, even though \(P^{\star }\) and \(P_{0}\) are close: \(\left\| {P^{\star }-P_{0}}\right\| \leqslant 1/n\).
In this specific framework, the model \({\mathscr {M}}\) is regular, the Fisher information is constant and positive, \(\nu \) admits a positive density which is continuous at \(\theta ^{\star }=0\) and for all \(\theta ,\theta '\in \Theta \), \(h^{2}(\theta ,\theta ')=1-e^{-|\theta -\theta '|^{2}/8}\). We shall see in Sect. 8, more precisely in Corollary 4, that for such regular statistical models \(r_{n}(\beta ,P_{0})\leqslant \kappa ^{\star }/n\) for some numerical constant \(\kappa ^{\star }>0\), at least for n large enough. Since \(h^{2}(P^{\star },P_{0})\leqslant \left\| {P^{\star }-P_{0}}\right\| \leqslant 1/n\), we deduce from Corollary 2 that the posterior \(\widehat{\nu }_{{\varvec{X}}}^{h}\) concentrates on a set of the form
with \(c>0\). This set is an interval around 0 of approximate length \(1/\sqrt{n}\), at least for n sufficiently large. Despite the contamination of the data, the concentration property of \(\widehat{\nu }_{{\varvec{X}}}^{h}\) remains thus the same as in the well-specified case.
7 Applications
7.1 How to choose \(\beta \) in Theorem 1 for a translation model?
In this section, we consider the translation model \({\mathscr {M}}=\{P_{\theta }=p(\cdot -\theta )\cdot \mu ,\; \theta \in {\mathbb {R}}\}\) where p is a density on \({\mathbb {R}}\) with respect to the Lebesgue measure \(\mu \). Our aim is to estimate the translation parameter \(\theta \) by using a prior \(\nu _{\sigma }\) on \(\Theta ={\mathbb {R}}\) with a density (with respect to \(\mu \)) of the form \(q(\cdot /\sigma )/\sigma \) for some density q and positive number \(\sigma \). We evaluate the estimation error by means of the total variation loss. In order to use our construction we need to tune the parameter \(\beta \). In Sect. 4.2, we suggested to choose \(\beta \geqslant 1/\sqrt{n}\) satisfying (18). In order to find such a value of \(\beta =\beta (\alpha )\), we may proceed as follows. Consider a symmetric bounded interval \(I=[-l/2,l/2]\subset {\mathbb {R}}\) of length \(l>0\) satisfying \(\nu _{\sigma }(I)\geqslant 1-\alpha \), hence concentrating most of the mass of the prior \(\nu _{\sigma }\). If the set \({\mathscr {M}}(\beta )\) is large enough to contain \(\{P_{\theta },\; \theta \in I\}\),
and \(\beta \) satisfies (18). We deduce from our Corollary 1 that the corresponding posterior \(\widehat{\pi }_{{\varvec{X}}}^{{\text {TV}}}\) concentrates with a probability at least \(1-2e^{-\xi /2}\) on a TV-ball with a radius of order
The approximation term \(\inf _{\theta \in I}\ell (\overline{P}^{\star },P_{\theta })\) is small as soon as \({\overline{P}}^{\star }\) is close enough to a distribution \(P_{\theta ^{\star }}\) whose parameter \(\theta ^{\star }\) belongs to I. If we want to prevent us from the situation where \(\textrm{argmin}_{\theta \in \Theta }\ell ({\overline{P}}^{\star },P_{\theta })\) is far from 0, we need to increase I (or equivalently diminish \(\alpha \)). What would be the consequence on the value of \( \beta =\beta (\alpha )\)? What if we increase \(\sigma \), to make the prior distribution flatter, or diminish \(\sigma \) to make it more picky? Finally, what is the influence of the choice of the density q on the size of \( \beta \)?
These are the questions we want to answer in this section. In order to simplify the presentation of our results and avoid technicalities, we make the change of variables \(l=2\sigma t\), or equivalently \(t=l/(2\sigma )>0\), and assume the following.
Assumption 6
The density q is positive, symmetric and decreasing on \({\mathbb {R}}_{+}\). There exists some nonnegative and nondecreasing function \(\varphi :[0,1)\rightarrow {\mathbb {R}}_{+}\) such that
When p is symmetric and nonincreasing on \({\mathbb {R}}_{+}\), the total variation distance between \(P_{0}\) and \(P_{\theta }\) is given by
Our Assumption 6 is then satisfied with \(\varphi (r)=F_{0}^{-1}[(r+1)/2]\) for all \(r\in [0,1)\), where \(F_{0}^{-1}\) denotes the quantile function of the distribution \(P_{0}\). We set
and assume that this quantity is finite. Note that it only depends on q(0) and p. For example, if p is the density \(x\mapsto (1/2)e^{-|x|}\),
Since the mapping \(r\mapsto [\varphi (2r)/\varphi (r)]\) is increasing, we obtain in this case
If now \(p:x\mapsto (s/2)(1-|x|)^{s-1}{\mathbb {1}}_{|x|<1}\) with \(s>0\),
The mapping \(r\mapsto \varphi (2r)/\varphi (r)\) has a continuous extension on [0, 1/4] and is therefore bounded. Given q(0), \({\overline{\Gamma }}\) is therefore a finite number.
The following result is proven in Sect. 10.3.
Proposition 4
Assume that Assumption 6 is satisfied and \(\overline{\Gamma }\) is finite. Let t be a \((1-\alpha /2)\)-quantile of q with \(\alpha \leqslant 1/2\). The set \({\mathscr {M}}(\beta )\) contains the subset \(\left\{ {P_{\theta }, \theta \in [-\sigma t,\sigma t]}\right\} \) and therefore satisfies (47) if
Let us now comment on this result. The quantity \({\overline{\beta }}\) may be written as \(C/\sqrt{n}\) with
Increasing the value of \(\sigma \) or that of t enlarges the interval \(I=[-\sigma t,\sigma t]\). It also makes the value of \(C=C(\sigma ,t)\) larger. Increasing \(\sigma \) makes the prior \(\nu _{\sigma }\) flatter and for a fixed value of \(t>0\), \(C=C(\sigma )\) increases as \(\sqrt{\log \sigma }\) when \(\sigma \) is larger than 1. In the other case, for a fixed value of \(\sigma \), \(C=C(t)\) increases as \(\sqrt{\log (1/q(2t))}\). For example, when q is the density of a standard Gaussian random variable, \(\sqrt{\log (1/q(2t))}\) is of order t, while for the Laplace and the Cauchy distributions it is of order \(\sqrt{t}\) and \(\sqrt{\log t}\) respectively. This result illustrates the fact that it is safer to use priors with heavy tails when the size of the location parameter is uncertain. In case of a light-tailed prior, it may be wise to introduce a scaling parameter \(\sigma >1\). By taking \(\sigma =10\), the concentration radius only increases by a factor less than 1.6, while the interval I is ten times longer.
7.2 Fast rates
We go back to the statistical framework described in Sect. 7.1 and consider the special case of the density \(p:x\mapsto s x^{s-1}{\mathbb {1}}_{(0,1]}\) with \(s\in (0,1]\). As before, we choose the TV-loss. In this specific situation,
and consequently, \(\varphi (r)=r^{1/s}\) for all \(r\in [0,1)\). Besides, the family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) given by (34) satisfies not only Assumption 3 but also Assumption 4 with \(a_{2}=1\). These two facts are proven in Baraud [4, Examples 5 and 6]. As a consequence, Theorem 2 applies. The reader can check that the constants \({c}=\beta =0.1\) and \(\gamma =0.01\) satisfy the requirements of Theorem 2 and that its conclusion holds true with \(\kappa _{0}=144\).
In order to be more specific about the concentration radius of our posterior \({\widehat{\pi }}_{{\varvec{X}}}^{{\text {TV}}}\), the following proposition provides an upper bound for the quantity \(r_{n}(\beta ,P_{\theta })\). The proof is postponed to Sect. 10.4.
Proposition 5
Let \(t_{0}\) be the third quartile of \(\nu _{1}\). If the density q is positive, symmetric and decreasing on \([0,+\infty )\), for all \(\theta \in {\mathbb {R}}\) the quantity \(r_{n}(\beta ,P_{\theta })\) is not larger than
Then, our Theorem 2 tells us that for all \(\xi >0\), with a probability at least \(1-2e^{-\xi /2}\), the posterior satisfies
with
When the data are i.i.d. with distribution \(P_{\theta ^{\star }}\), with probability close to 1, a randomized estimator \(P_{{\widehat{\theta }}}\) with distribution \({\widehat{\pi }}_{{\varvec{X}}}^{{\text {TV}}}\) satisfies with high probability
This inequality implies, at least for n large enough, that
which means that the parameter \(\theta ^{\star }\) is estimated at rate \(n^{-1/s}\). This rate is much faster than the usual \((1/\sqrt{n})\)-parametric one that is reached by an estimator based on a moment method for instance. For example, when \(s=1/3\) and \(n=100\), a moment estimator provides an accuracy of order \(10^{-1}\) while that of \({\widehat{\theta }}\) is of order \(10^{-6}\). Since p is unbounded, note that the maximum likelihood estimator for \(\theta ^{\star }\) does not exist and is therefore useless.
It follows from the work of Le Cam that in a translation model \({\mathscr {M}}\) of the form \(\{P_{\theta }=p(\cdot -\theta )\cdot \mu , \theta \in {\mathbb {R}}\}\), where p is a density with respect to the Lebesgue measure \(\mu \), it is impossible to estimate a distribution \(P^{\star }\in {\mathscr {M}}\) from an n-sample at a rate faster than 1/n for the TV-loss. Because of (51), the rate we get is not only optimal for estimating the distribution \(P_{\theta ^{\star }}\) but also for estimating the parameter \(\theta ^{\star }\) with respect to the Euclidean distance.
An alternative rate-optimal estimator for estimating \(\theta ^{\star }\) is that given by the minimum of the observations. This estimator is unfortunately obviously non-robust to the presence of an outlier among the sample. Our construction provides an estimator which possesses the property of being both rate-optimal and robust.
It also interesting to see how the quantity \(\overline{r}_{n}(\beta ,P_{\theta })\) given in (52) deteriorates under a misspecification of the prior \(\nu _{\sigma }\), that is, when the size of the parameter \(\theta ^{\star }\) is large compared to \(\sigma \). When q is Gaussian, \(\overline{r}_{n}(\beta ,P_{\theta ^{\star }})\) increases by a factor of order \((\theta ^{\star }/\sigma )^{2}\) while for the Laplace and Cauchy distributions it is of order \(|\theta ^{\star }|/\sigma \) and \(\log (|\theta ^{\star }|/\sigma )\) respectively. From these results, we conclude as before that the Cauchy distribution possesses some advantages over the other two distributions when little information is available on the location of the parameter \(\theta ^{\star }\).
7.3 A general result under entropy
In this section, we equip \(E={\mathbb {R}}^{k}\) with the Lebesgue measure \(\mu \) and the norm \(\left| {\cdot }\right| _{\infty }\). We consider the TV-loss and the location-scale family
where \({\mathcal {M}}_{0}\) is a set of densities on \({\mathbb {R}}^{k}\). Given independent observations \(X_1,\ldots ,X_n\) with presumed distribution \(P^{\star }=P_{(p^{\star },{\textbf{m}}^{\star },\sigma ^{\star })}\in {\mathscr {M}}\), our aim is to estimate the density \(p^{\star }\in {\mathcal {M}}_{0}\), the location parameter \({\textbf{m}}^{\star }\in {\mathbb {R}}^{k}\) and the scale parameter \(\sigma ^{\star }>0\), hence the parameter \(\theta ^{\star }=(p^{\star },{\textbf{m}}^{\star },\sigma ^{\star })\in \Theta ={\mathcal {M}}_{0}\times {\mathbb {R}}^{k}\times (0,+\infty )\). We assume that the set of densities \({\mathcal {M}}_{0}\) satisfies the following conditions.
Assumption 7
Let \({{\widetilde{D}}}\) be a continuous nonincreasing mapping from \((0,+\infty )\) to \([1,+\infty )\) such that \(\lim _{\eta \rightarrow +\infty }\eta ^{-2}{{\widetilde{D}}}(\eta )=0\). For all \(\eta >0\), there exists a finite subset \({\mathcal {M}}_{0}[\eta ]\subset {\mathcal {M}}_{0}\) satisfying
such that for all \(p\in {\mathcal {M}}_{0}\), there exists \({\overline{p}}\in {\mathcal {M}}_{0}[\eta ]\) that satisfies
Besides, we assume that there exist \(A,s>0\) such that for all \(p\in {\mathcal {M}}_{0}\), \({\textbf{m}}\in {\mathbb {R}}^{k}\) and \(\sigma \geqslant 1\),
The first part of Assumption 7, which corresponds to inequalities (55) and (56), aims at measuring the size of the set \({\mathcal {M}}_{0}\) by means of its entropy. The entropy of a set controls its metric dimension and usually determines the minimax rate of convergence over it as shown in Birgé [9]. With the second part of Assumption 7, namely inequality (57), we require some regularity properties of the TV-loss with respect to the location and scale parameters. It will be commented on later. We shall see that this condition may be satisfied even when the densities in \({\mathcal {M}}_{0}\) are not smooth.
Let us now turn to the choice of our prior. We first consider a countable subset of the parameter space \(\Theta \) that will be proven to possess good approximation properties. Namely, we define for \(\eta ,\delta >0\)
and we associate a positive weight \(L_{\theta }\) with any element \(\theta =\theta ({\overline{p}},j_{0},{\textbf{j}})\in \Theta [\eta ,\delta ]\) as follows
with \(L=\log \left[ {(\pi ^{2}/3)-1}\right] \). It is not difficult to check that \(\sum _{\theta \in \Theta [\eta ,\delta ]}e^{-L_{\theta }}=1\), and we may therefore endow \({\mathscr {M}}\) with the (discrete) prior \(\pi \) defined as
With such a prior, our posterior \({\widehat{\pi }}_{{\varvec{X}}}^{{\text {TV}}}\) given in Corollary 1 possesses the following properties.
Corollary 3
Let \(\xi >0\) , \(K>1\). Assume that \({\mathcal {M}}_{0}\) satisfies Assumption 7 and define
and the subset \({\mathscr {M}}_{n}(K)\) of \({\mathscr {M}}\) that consists of the elements \(P_{(p,{\textbf{m}},\sigma )}\) for which
Then, the posterior \({\widehat{\pi }}_{{\varvec{X}}}^{{\text {TV}}}\) satisfies the following property: there exists a numerical constant \(\kappa _{0}'>0\) such that for all \(\xi >0\),
with
Let us now comment on this result. The radius \({r}_{n}\) is the sum of three main terms, omitting the dependency with respect to \(\xi \). The first one, \(\inf _{P\in {\mathscr {M}}_{n}(K)}\ell ({\overline{P}}^{\star },P)\), corresponds to the approximation of \({\overline{P}}^{\star }\) by an element of \({\mathscr {M}}\) whose location and scale parameters satisfy the constraints given in (63). The quantity \(\eta _{n}\), involved in the second term, usually corresponds to the minimax rate for solely estimating a density \(p\in {\mathcal {M}}_{0}\) from an n-sample. Finally, the third term \(\sqrt{(k+1)/n}\) corresponds to the rate we would get for solely estimating the location and translation parameters \(({\textbf{m}},\sigma )\in {\mathbb {R}}^{k+1}\) when the density p is known.
Let us now provide some examples for which our condition (57) is satisfied. We start with an example where the densities in \({\mathcal {M}}_{0}\) are smooth.
Lemma 2
Assume that the set \({\mathcal {M}}_{0}\) consists of densities p that are supported on \([0,1]^{k}\), satisfy \(\sup _{p\in {\mathcal {M}}_{0}}\left\| {p}\right\| _{\infty }\leqslant L_{0}\) and
with constants \(L_{0},L_{1}>0\) and \(s\in (0,1]\). Then (57) is satisfied with \(A=L_{1}\vee [(1+L_{1}k^{s/2}+L_{0})/2]\).
Nevertheless, condition (57) may also be satisfied for families \({\mathcal {M}}_{0}\) of densities which are not smooth, as shown in Lemma 3 below. It makes it possible to consider the following example.
Example 6
We consider here the situation where \(k=1\) and \({\mathcal {M}}_{0}\) is the set of all nonincreasing densities on [0, 1] that are bounded by \(B>1\). Then, \({\mathscr {M}}\) consists of all the probabilities whose densities are supported on intervals I with positive lengths, nonincreasing on I and which are bounded by \(B/\mu (I)\). Birman and Solomjak [15] proved that \({\mathcal {M}}_{0}\) satisfies Assumption 7 with \({{\widetilde{D}}}(\eta )\) of order \((1/\eta )\vee 1\) (up to some constant that depends on B). We deduce from (60) that \(\eta _{n}\) is therefore of order \(n^{-1/3}\). Besides, it follows from Lemma 3 below that (57) is satisfied with \(A=B\) and \(s=1\). We may therefore apply Corollary 3. For a value of K large enough compared to 1, \(\Lambda _{n}\) defined by (63) is larger than \(\exp \left[ {CK^{2}n^{1/3}}\right] \) for some constant \(C>0\) (depending on A). In particular, if \(X_{1},\ldots ,X_{n}\) are i.i.d. with a density of the form
where \(p\in {\mathcal {M}}_{0}\), \(|m^{\star }/\sigma ^{\star }|\leqslant \exp \left[ {CK^{2}n^{1/3}}\right] \) and
(64) is satisfied with \(r_{n}\) of order \(C'n^{-1/3}\) where the constant \(C'>0\) only depends on \(\xi ,K,B\) but not on \(m^{\star }\) and \(\sigma ^{\star }\). This means that the concentration properties of \({\widehat{\pi }}_{{\varvec{X}}}\) hold true uniformly over a huge range of translation and scale parameters \({\textbf{m}}\) and \(\sigma \) when n is large enough.
Lemma 3
Let p be a nonincreasing density on \((0,+\infty )\). For all \(\sigma \geqslant 1\)
If, furthermore, p is bounded by \(B\geqslant 1\), for all \(m\in {\mathbb {R}}\),
In particular, for all \(m\in {\mathbb {R}}\) and \(\sigma \geqslant 1\),
7.4 Estimating a parameter under sparsity
Let us consider a parametric dominated model \({\mathscr {M}}=\left\{ {P_{{\varvec{\theta }}}=p_{{\varvec{\theta }}}\cdot \mu ,\; \varvec{\theta }\in {\mathbb {R}}^{k}}\right\} \) where the dimension k of the parameters is large. We presume, even though this might not be true, that the data are i.i.d. with distribution \(P_{\varvec{\theta }^{\star }}\in {\mathscr {M}}\) and that the coordinates of the true parameter \(\varvec{\theta }^{\star }=(\theta _{1}^{\star },\ldots ,\theta _{k}^{\star })\) are all zero except for a small number of them. Our aim is to estimate \(P_{\varvec{\theta }^{\star }}\) from the observation of \(X_1,\ldots ,X_n\) by using the squared Hellinger loss.
To tackle this problem, we partition the model \({\mathscr {M}}\) into the sub-models \(\{{\mathscr {M}}_{m},\; m\subset \{1,\ldots ,k\}\}\) where \({\mathscr {M}}_{m}\) consists of those distributions \(P_{\varvec{\theta }}\in {\mathscr {M}}\) for which the coordinates of \(\varvec{\theta }=(\theta _{1},\ldots ,\theta _{k})\) are all zero except those with an index \(i\in m\). We denote by \(\Theta _{m}\) the set of such parameters, so that \({\mathscr {M}}_{m}=\{P_{\varvec{\theta }},\; \varvec{\theta }\in \Theta _{m}\}\), and we use the conventions \(\Theta _{{\varnothing }}=\{\textbf{0}\}\) and \({\mathscr {M}}_{{\varnothing }}=\{P_{\textbf{0}}\}\). Given some positive number \(R>0\), we equip each parameter space \(\Theta _{m}\), \(m\subset \{1,\ldots ,k\}\), with the uniform distribution \(\nu _{m}\) on \(\Theta _{m}(R)=[-R,R]^{k}\cap \Theta _{m}\) when \(m\ne {\varnothing }\) and the Dirac mass \(\nu _{{\varnothing }}=\delta _{\textbf{0}}\) at \(\textbf{0}\in {\mathbb {R}}^{k}\) when \(m={\varnothing }\). We may then define on \({\mathbb {R}}^{k}=\bigcup _{m\subset \{1,\ldots ,k\}}\Theta _{m}\), the hierarchical prior
We endow \({\mathscr {M}}\) with the \(\sigma \)-algebra and the prior \(\pi \) as described in Sect. 2.1. Besides, we assume that there exists \(s\in (0,1]\) and a positive number \(B_{k}=B_{k}(R)\), possibly depending on k and R (although we drop the dependency with respect to R), such that
The following result is proven in Sect. 10.8.
Proposition 6
Assume that
is measurable. If \(RB_{k}^{1/s}\geqslant 1\) there exists a numerical constant \(\kappa _{0}'>0\) such that for any distribution \({\textbf{P}}^{\star }\) and \(\xi >0\)
where
Let us now comment on this result. First of all, the mapping
being nondecreasing, our condition \(RB_{k}^{1/s}=R[B_{k}(R)]^{1/s}\geqslant 1\) is always satisfied for a value of R sufficiently large.
When \(B_{k}\) does not increase faster than a power of k, the radius r given in (72) only depends logarithmically on the dimension k of the parameter space, as expected.
Let us now illustrate Proposition 6 by choosing some specific models \({\mathscr {M}}=\{P_{\varvec{\theta }},\; \varvec{\theta }\in {\mathbb {R}}^{k}\}\). If \(P_{\varvec{\theta }}\) is the Gaussian distribution with mean \(\varvec{\theta }\in {\mathbb {R}}^{k}\) and covariance matrix \(\sigma ^{2} I_{k}\), where \(I_{k}\) denotes the \(k\times k\) identity matrix,
Then, inequality (71) is satisfied with \(B_{k}=k/(8\sigma ^{2})\) and \(s=2\). In particular, our condition \(RB_{k}^{1/s}\geqslant 1\) is equivalent to \(R\geqslant 2\sigma \sqrt{(2/k)}\). In this case, the value of r given by (72) is of order
More generally, if \({\mathscr {M}}=\{P_{\varvec{\theta }},\varvec{\theta }\in {\mathbb {R}}^{k}\}\) is a regular statistical model with a nonsingular Fisher information matrix \({\textbf{J}}(\varvec{\theta })\) for all \(\varvec{\theta }\in {\mathbb {R}}^{k}\), we know from the book of Ibragimov and Has’minskiĭ [20, Theorem 7.1, p. 81] that for all \(\varvec{\theta },\varvec{\theta }'\in {\mathbb {R}}^{k}\) such that \(\varvec{\theta },\varvec{\theta }'\in [-R,R]^{k}\)
Then, Assumption (71) holds with \(s=2\) and we may take
where \(\varrho \left( {{\textbf{J}}(\varvec{\theta }'')}\right) \) denotes the largest eigenvalue of the matrix \({\textbf{J}}(\varvec{\theta }'')\). This value is independent of \(\varvec{\theta }''\) when \({\mathscr {M}}\) is a translation model.
Finally note that the second term in (72) only increases logarithmically with respect to R, at least when \(B_{k}=B_{k}(R)\) does not increase faster than a power of R. By taking larger values of R one may therefore considerably enlarge the sizes of the cubes \(\Theta _{m}(R)\), and therefore diminish the approximation term in (72), while only slightly increasing the second term \([|m|\log (2kR(nB_{k})^{1/s})+\xi ]/n\).
8 Some tools for evaluating \({r}_{n}(\beta ,P)\)
The aim of this section is to provide some mathematical results that allow one to bound the quantity \({r}_{n}(\beta ,P)\) from above, or at least evaluate its order of magnitude, when n is sufficiently large. Throughout this section, we consider a parametric statistical model \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta \}\) where the parameter space \(\Theta \subset {\mathbb {R}}^{k}\) is endowed with a prior \(\nu \) which admits a density q with respect to the Lebesgue measure on \({\mathbb {R}}^{k}\). In order to use the definition (11) of the quantity \({r}_{n}(\beta ,P)\), we assume that we have at disposal a family \({\mathscr {T}}(\ell ,{\mathscr {M}})\) that satisfies our Assumption 3, which provides us with a value of \(a_{1}>0\), as well as a value \(\gamma \) that satisfy the requirements of our main theorems. Our aim is to bound \({r}_{n}(\beta ,P)\) as a function of \(a_{1},\gamma ,\beta , k\) and n under suitable assumptions on the density q and the behaviour of the loss \(\ell \). Once \(\ell \) and \({\mathscr {T}}(\ell ,{\mathscr {M}})\) are given, \(a_{1}\) and \(\gamma \) can be considered as fixed numerical constants. The value of \(\beta \) can also be considered as a numerical constant when Theorem 2 applies. Otherwise, it can be chosen of order \(\sqrt{k/n}\) as in our Example 1.
8.1 Bounding \({r}_{n}(\beta ,P_{{\varvec{\theta }}})\) in parametric models
In what follows, \(\left| {\cdot }\right| _{*}\) denotes some arbitrary norm on \({\mathbb {R}}^{k}\) and \({\mathcal {B}}_{*}({\varvec{x}},z)\) the corresponding closed ball centered at \({\varvec{x}}\in {\mathbb {R}}^{k}\) with radius \(z\geqslant 0\).
Assumption 8
Let \({\varvec{\theta }}^{\star }\) be an element of \(\Theta \subset {\mathbb {R}}^{k}\).
-
(i)
There exist positive numbers \({\underline{a}},{\overline{a}}\) and \({s}\) such that
$$\begin{aligned} {\underline{a}}\left| {{\varvec{\theta }}-{\varvec{\theta }}^{\star }}\right| _{*}^{s}\leqslant \ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })\leqslant \overline{a}\left| {{\varvec{\theta }}-{\varvec{\theta }}^{\star }}\right| _{*}^{s}\quad \text{ for } \text{ all } {\varvec{\theta }}\in \Theta . \end{aligned}$$(73) -
(ii)
There exists a positive nonincreasing function \(\upsilon _{{\varvec{\theta }}}\) on \({\mathbb {R}}_{+}\) such that
$$\begin{aligned} \nu ({\mathcal {B}}_{*}({\varvec{\theta }}^{\star },2x))\leqslant \upsilon _{{\varvec{\theta }}^{\star }}(x)\nu ({\mathcal {B}}_{*}({\varvec{\theta }}^{\star },x))\quad \text{ for } \text{ all } x>0. \end{aligned}$$(74)
Under Assumption 8-(i), the loss function behaves like a power of a norm between the parameters.
The following result is an extension of Proposition 10 in Baraud and Birgé [6]. It was established there for the special case of the squared Hellinger loss and we provide here an extension to an arbitrary one. Since the proof follows the same lines, we omit it.
Proposition 7
Under Assumption 8,
with \(\varrho _{0}=1+\log (2{\overline{a}}/{\underline{a}})/[{s}\log 2]\). If \(\upsilon _{{\varvec{\theta }}^{\star }}\equiv \upsilon >0\), then
If Assumption 8-(i) is satisfied and if the parameter space \({\Theta }\) is convex and q satisfies
then Assumption 8-(ii) holds with \(\upsilon _{{\varvec{\theta }}^{\star }}\equiv 2^{k}({\overline{b}}/{\underline{b}})\). Consequently,
When Assumption 8-(i) is satisfied and \(\nu \) admits a density which is bounded away from 0 and infinity on a convex parameter space \(\Theta \subset {\mathbb {R}}^{k}\), \({r}_{n}(\beta ,P_{{\varvec{\theta }}})\) is of order \(k/(n\beta )\) for all \({\varvec{\theta }}\in \Theta \). This result may also hold true when the density is not bounded away from infinity as shown in the following example. If \(k=1\), \(\Theta =[-1,1]\) and \(q:\theta \mapsto (t/2)|\theta |^{t-1}{\mathbb {1}}_{[-1,1]}(\theta )\) with \(t\in (0,1)\), Assumption 8-(ii) holds with \(\upsilon _{\theta }\equiv 2^{1+t}\left( 2^{t}-1\right) ^{-1}\) for all \(\theta \in [-1,1]\)—see Baraud and Birgé [6, Proposition 11]. Then (76) still applies. In the other direction, when the density q takes very small values in the neighbourhood of the parameter \({\varvec{\theta }}\), the function \(\upsilon _{{\varvec{\theta }}}\) may take large values around 0. This is for example the case when q is proportional to \(\theta \mapsto \exp \left[ -1/\left( 2|\theta |^{t}\right) \right] {\mathbb {1}}_{[-1,1]}(\theta )\), \(t>0\), and \(\theta =0\). It follows from Baraud and Birgé [6, Proposition 12] (and its proof) that Assumption 8-(ii) is satisfied with \(\upsilon _{\theta }:x\mapsto \exp (c(t)/x^{t})\) for some quantity \(c(t)>0\). Applying (75) leads to an upper bound on \({r}_{n}(\beta ,P_{\theta })\) of order \((n \beta )^{-s/(s+t)}\).
8.2 Some asymptotic order of magnitude
In Sect. 8.2, we have given some general tools for controlling the quantity \(r_{n}(\beta ,P_{{\varvec{\theta }}})\) for a given value of n. In this section, we present some sufficient conditions under which \(r_{n}(\beta ,P_{{\varvec{\theta }}})\) is of order \(k/(n\beta )\) at least when n is large enough. These conditions are not the weakest possible ones but they have the advantage to be relatively easy to check on many examples.
Assumption 9
The density q is continuous and positive at \({\varvec{\theta }}^{\star }\in \Theta \). The loss function \(\ell \) satisfies the following properties for some positive number \(s>0\) and a norm \(\left| {\cdot }\right| _{*}\) on \({\mathbb {R}}^{k}\).
-
(i)
For all \({\varepsilon }>0\), there exists \(z=z({\varepsilon })>0\) such that
$$\begin{aligned} (1-{\varepsilon })\left| {{\varvec{\theta }}-{\varvec{\theta }}^{\star }}\right| _{*}^{s}\leqslant \ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })\leqslant (1+{\varepsilon })\left| {{\varvec{\theta }}-{\varvec{\theta }}^{\star }}\right| _{*}^{s}\quad \text {for all } {\varvec{\theta }}\in {\mathcal {B}}_{*}({\varvec{\theta }}^{\star },z). \end{aligned}$$ -
(ii)
There exists a subset \({\mathcal {K}}\subset \Theta \), the interior of which contains \({\varvec{\theta }}^{\star }\), that satisfies for some positive numbers \({\underline{a}}_{{\mathcal {K}}}\) and \(\eta \):
$$\begin{aligned} {\underline{a}}_{{\mathcal {K}}}\left| {{\varvec{\theta }}-{\varvec{\theta }}^{\star }}\right| _{*}^{s}\leqslant \ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })\quad \text {for }{\varvec{\theta }}\in {\mathcal {K}}\text { and for } {\varvec{\theta }}\not \in {\mathcal {K}}\quad \ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })\geqslant \eta >0. \end{aligned}$$(79)
Under these assumptions, we establish the following proposition, the proof of which is postponed to Sect. 10.10.
Proposition 8
Under Assumption 9, at least for n sufficiently large,
8.3 The case of the squared Hellinger loss on a regular statistical model
Of particular interest is the situation where the statistical model \({\mathscr {M}}=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta \}\), \(\Theta \subset {\mathbb {R}}^{k}\), is regular. There exist several ways of defining a regular model in statistics and we adopt here the definition of Ibragimov and Has’minskiĭ [20].
Definition 1
Let \(\mu \) be a measure on \((E,{\mathcal {E}})\) and \(\Theta \) an open subset of \({\mathbb {R}}^{k}\). The statistical model \({\mathscr {M}}=\{P_{{\varvec{\theta }}}=p_{{\varvec{\theta }}}\cdot \mu ,\; {\varvec{\theta }}\in \Theta \}\) is said to be regular if the family of functions \(\{\zeta _{{\varvec{\theta }}}=\sqrt{p_{{\varvec{\theta }}}},\; \theta \in \Theta \}\subset {\mathscr {L}}_{2}(E,{\mathcal {E}},\mu )\) satisfies the following properties.
-
(i)
For \(\mu \)-almost all \(x\in E\), \({\varvec{\theta }}\mapsto \zeta _{{\varvec{\theta }}}(x)\) is continuous.
-
(ii)
For all \({\varvec{\theta }}\in \Theta \), there exists \(\dot{\varvec{\zeta }}_{{\varvec{\theta }}}=({\dot{\zeta }}_{{\varvec{\theta }},1},\ldots ,{\dot{\zeta }}_{{\varvec{\theta }},k}):E\rightarrow {\mathbb {R}}^{k}\) such that
$$\begin{aligned} \int _{E}\left| {\dot{\varvec{\zeta }}_{{\varvec{\theta }}}(x)}\right| ^{2}d\mu (x)<+\infty \end{aligned}$$and
$$\begin{aligned} \int _{E}\left| {\zeta _{{\varvec{\theta }}+\varvec{\epsilon }}(x)-\zeta _{{\varvec{\theta }}}(x) -\langle \dot{\varvec{\zeta }}_{{\varvec{\theta }}}(x),\varvec{\epsilon }\rangle }\right| ^{2}d\mu (x)=o(\left| {\varvec{\epsilon }}\right| ^{2})\quad \text {when }\left| {\varvec{\epsilon }}\right| \rightarrow 0. \end{aligned}$$ -
(iii)
For all \(i\in \{1,\ldots ,k\}\), the mapping \({\varvec{\theta }}\mapsto {\dot{\zeta }}_{{\varvec{\theta }},i}\) is continuous in \( {\mathscr {L}}_{2}(E,{\mathcal {E}},\mu )\).
When the model is regular, the matrix
is called the Fisher information matrix.
The matrix \({\textbf{J}}({\varvec{\theta }})\) is symmetric and nonnegative and we may therefore consider its square root \({\textbf{J}}^{1/2}({\varvec{\theta }})\), that is, the symmetric \((k\times k)\)-nonnegative matrix that satisfies \({\textbf{J}}^{1/2}({\varvec{\theta }}){\textbf{J}}^{1/2}({\varvec{\theta }})={\textbf{J}}({\varvec{\theta }})\).
Regular statistical models enjoy nice metric properties that are described in Proposition 9 below. For a proof we refer the reader to Ibragimov and Has’minskiĭ [20]—Lemma 7.1 page 65, Theorem 7.6 page 81 and its proof.
Proposition 9
Let \(\Theta \) be an open subset of \({\mathbb {R}}^{k}\) and \({\varvec{\theta }}^{\star }\in \Theta \). If \({\mathscr {M}}=\{P_{{\varvec{\theta }}}=p_{{\varvec{\theta }}}\cdot \mu ,\; {\varvec{\theta }}\in \Theta \}\) is regular and the Fisher information matrix \({\textbf{J}}({\varvec{\theta }}^{\star })\) nonsingular at \({\varvec{\theta }}^{\star }\in \Theta \), Assumption 9-(i) is satisfied with \(\ell =h^{2}\), \(s=2\) and for the norm \(\left| {\cdot }\right| _{*}\) defined by
Besides, for any compact subset \({\mathcal {K}}\subset \Theta \) there exist positive numbers \( {\overline{a}}_{{\mathcal {K}}},{\underline{a}}_{{\mathcal {K}}}\) such that
Using Proposition 8, we immediately infer the following result.
Corollary 4
Let \(\Theta \) be an open subset of \({\mathbb {R}}^{k}\). Assume that \({\mathscr {M}}=\{P_{{\varvec{\theta }}}=p_{{\varvec{\theta }}}\cdot \mu ,\; {\varvec{\theta }}\in \Theta \}\) is regular and the Fisher information matrix \({\textbf{J}}({\varvec{\theta }}^{\star })\) nonsingular at \({\varvec{\theta }}^{\star }\in \Theta \subset {\mathbb {R}}^{k}\). Assume that there exists a compact set \({\mathcal {K}}\subset \Theta \), containing \({\varvec{\theta }}^{\star }\) in its interior, such that \(h({\varvec{\theta }},{\varvec{\theta }}^{\star })\geqslant \eta >0\) for all \({\varvec{\theta }}\not \in {\mathcal {K}}\). Assume furthermore that the density q is continuous and positive at \({\varvec{\theta }}^{\star }\). Then, \(r_{n}(\beta ,P_{{\varvec{\theta }}^{\star }})\leqslant [3/(2a_{1}\gamma \beta )] (k/n)\), at least for n sufficiently large.
9 Proofs of Theorems 1, 2 and 3
Throughout this proof we fix some \({{\overline{Q}}}\in {\mathscr {M}}\), \({r},\beta >0\) and use the following notation: \({c}_{1}=1+{c}\), \({c}_{2}=2+{c}\),
and for \({r}\in {\mathcal {V}}(\pi ,{{\overline{Q}}})\), \({\mathscr {B}}={\mathscr {B}}({{\overline{Q}}},{r})\) and \(\pi _{{\mathscr {B}}}=\left[ {\pi ({\mathscr {B}})}\right] ^{-1}{\mathbb {1}}_{{\mathscr {B}}}\cdot \pi \).
9.1 Main parts of the proofs of Theorems 1 and 2
Throughout the proofs of these two theorems we fix some positive number z, that will be chosen later on, \(r\geqslant {r}_{n}(\beta ,{{\overline{Q}}})\) and set
It follows from the definition (7) of \(\widehat{\pi }_{{\varvec{X}}}\) that for all \(J\in {\mathbb {N}}\)
In a first step, we prove that for some well chosen values of \(\beta ,z,{r}\) and for J large enough, each of the two terms in the right-hand side of (83) is not larger than \(e^{-\xi }\). To achieve this goal, we bound the first term of the right-hand side of (83) by applying Markov’s inequality
and then by using Lemma 6, we obtain that
We therefore have a control of \({\mathbb {P}}({^\textsf{c}}{\!{A}}{})\) by choosing z small enough. We bound the second term of (83) by using Lemma 5.
We then finish the proofs of Theorems 1 and 2 as follows. In the context of Theorem 1, we finally establish that for a suitable value of J and all \({{\overline{Q}}}\in {\mathscr {M}}(\beta )\),
By (3), \({\mathscr {B}}({{\overline{Q}}},2^{J}r)\subset {\mathscr {B}}(\overline{P}^{\star },\tau \ell ({\overline{P}}^{\star },{{\overline{Q}}})+\tau 2^{J}r)\) for all \({{\overline{Q}}}\in {\mathscr {M}}(\beta )\), and consequently \({\mathbb {E}}\left[ {\widehat{\pi }_{{\varvec{X}}}\left( {{^\textsf{c}}{\!{{\mathscr {B}}}}{}({\overline{P}}^{\star },{\overline{{r}}}}\right) }\right] \leqslant 2e^{-\xi }\) with
We obtain (16) by monotone convergence, taking a sequence \(({{\overline{Q}}}_{N})_{N\geqslant 0}\subset {\mathscr {M}}(\beta )\) such that \(\ell (\overline{P}^{\star },{{\overline{Q}}}_{N})\) is nonincreasing to \(\inf _{P\in {\mathscr {M}}(\beta )}\ell ({\overline{P}}^{\star },P)\), so that
and (16) holds provided that \(\kappa _{0}\geqslant \tau (2^{J}+1)\).
In the context of Theorem 2, we show that for some suitable value of J and all \({{\overline{Q}}}\in {\mathscr {M}}\),
and we get (28) by arguing similarly.
9.2 Preliminary results
In the proofs of Theorems 1 and 2, we use the following consequence of our Assumption 3. We may write
and we deduce from (5) that for all \(P,Q\in {\mathscr {M}}\),
Besides, using the antisymmetry property (ii) we also obtain that
For the proof of Theorems 2, we additionnally use the following consequence of our Assumption 4. By taking \(S={\overline{P}}^{\star }\) and using the convexity of the mapping \(u\mapsto u^{2}\), we deduce that for all \(P,Q\in {\mathscr {M}}\)
and it follows then from Assumption 4 (iv) that for all \(P,Q\in {\mathscr {M}}\)
The proofs of our main results rely on the following lemmas.
Lemma 4
Let (U, V) be a pair of random variables with values in a product space \((E\times F,{\mathcal {E}}\otimes {\mathcal {F}})\) and marginal distributions \(P_{U}\) and \(P_{V}\) respectively. For all measurable function h on \((E\times F,{\mathcal {E}}\otimes {\mathcal {F}})\),
This lemma is proven in Audibert and Catoni [3, Lemma 4.2, p. 28].
Lemma 5
For \(P,Q\in {\mathscr {M}}\), we set
For all \({r}\in {\mathcal {V}}(\pi ,{{\overline{Q}}})\) and \(P\in {\mathscr {M}}\),
Proof
Let \({r}\in {\mathcal {V}}(\pi ,{{\overline{Q}}})\). For \(P,Q\in {\mathscr {M}}\), we set
Then,
Since \(\lambda ={c}_{1}\beta =(1+{c})\beta \), it follows from the convexity of the exponential that
Hence,
Applying Lemma 4 with \(U={\varvec{X}}\), \(V=Q\) with distribution \(\pi _{{\mathscr {B}}}\), and \(h(U,V)=-I({\varvec{X}},P,Q)\), we obtain that
and (88) follows from (89). \(\square \)
Lemma 6
For \(P,Q\in {\mathscr {M}}\), we set
For all \({r}\in {\mathcal {V}}(\pi ,{{\overline{Q}}})\),
Proof
For \(P,Q\in {\mathscr {M}}\), we set
Then,
It follows from the convexity of the exponential and the fact that \(\lambda ={c}_{1}\beta \) that for all \(P\in {\mathscr {M}}\),
Applying Lemma 4 with \(U={\varvec{X}}\), \(V=Q\) with distribution \(\pi \), and \(h(U,V)=-H({\varvec{X}},P,Q)\) we obtain that
We deduce from (90) that for all \(P\in {\mathscr {M}}\)
Applying Lemma 4 with \(U={\varvec{X}}\), \(V=P\) with distribution \(\pi \) and \(h(U,V)=\beta {\textbf{T}}({\varvec{X}},P)\), gives
which together with (91) leads to the result. \(\square \)
The proofs of Theorems 1 and 2 rely on suitable bounds on the Laplace transforms of sums of independent random variables and on a summation lemma. These results are presented below.
Lemma 7
For all \(\beta \in {\mathbb {R}}\) and random variable U with values in an interval of length \(l\in (0,+\infty )\),
Lemma 8
Let U be a squared integrable random variable not larger than \(b> 0\). For all \(\beta >0\),
where \(\phi \) is defined by (24).
The proofs of Lemmas 7 and 8 can be found on pages 21 and 23 in Massart [23] (where our function \(\phi \) is defined as twice his).
Lemma 9
Let \(J\in {\mathbb {N}}\), \(\gamma >0\) and \({{\overline{Q}}}\in {\mathscr {M}}\). If \({r}\) satisfies \(n\beta a_{1}{r}\geqslant 1\) and (12), for all \(\gamma _{0}>2\gamma \)
with
Besides,
with
Proof
From (12), we deduce by induction that for all \(j\geqslant 0\)
Consequently,
Since \(2^{j}\geqslant j+1\) for all \(j\geqslant 0\) we obtain that
which leads to (94) since \(n\beta a_{1}2^{J}{r}\geqslant n\beta a_{1}{r}\geqslant 1\). Finally, by applying this inequality with \(J=0\) we obtain that
which is (95). \(\square \)
9.3 Proof of Theorem 1
For all \(i\in \{1,\ldots ,n\}\) and \(P,Q,Q'\in {\mathscr {M}}\), let us set
The random variables \(U_{i}\) are independent and under Assumption 3-(iv), they takes their values in an interval of length \(l_{1}={c}+{c}_{1}=1+2{c}\). The \(V_{i}\) are also independent and they takes their values in an interval of length \(l_{2}={c}_{1}+{c}_{2}=3+2{c}\). Applying Lemma 7, we obtain that
and
By using Assumption 2 and the fact that \({c}_{0}={c}_{1}-{c}a_{0}/a_{1}>0\),
with
It follows from (100) and Assumptions 3-(iii), more precisely its consequences (85) and (86), that
Since \(a_{0}\geqslant a_{1}\) and \({c}_{2}>{c}_{1}\), \({c}_{0}'={c}_{2}(a_{0}/a_{1})-{c}_{1}>0\) and by arguing as above, we obtain similarly that
with
Using (98) and (102), we deduce that for all \(P,Q,Q'\in {\mathscr {M}}\)
with
Using (99) and (103), we obtain similarly that for all \(P,Q,Q'\in {\mathscr {M}}\)
with
Since \(2\gamma<\tau ^{-1}{c}<\tau ^{-1}{c}_{2}\), we may apply Lemma 9 with \(\gamma _{0}=\tau ^{-1}{c}\) and \(\gamma _{0}=\tau ^{-1}{c}_{2}\) successively which leads to
and
with
Putting (107) and (110) together leads to
and since, for all \((P,Q)\in {\mathscr {B}}^{2}\), by definition (108) of \(\Delta _{2}(P,Q)\),
we derive that
We deduce from (84) that
In particular, \({\mathbb {P}}({^\textsf{c}}{\!{A}}{})\leqslant e^{-\xi }\) for z satisfying
Putting (105) and (109) together, we obtain that
It follows from the definition (106) of \(\Delta _{1}(P,Q)\) that for all \(P\in {\mathscr {M}}\) and for all \(Q\in {\mathscr {B}}\),
and consequently, for all \(P\in {\mathscr {M}}\) and \(Q\in {\mathscr {B}}\)
We derive from Lemma 5 that
hence,
Applying Lemma 9 with \(\gamma _{0}=\tau ^{-1}{c}_{0}>2\gamma \) and setting \(e_{2}=\tau ^{-1}{c}_{0}-2\gamma \), we get
with
which together with (114) leads to
Using the definitions (113) of z and (112) of \(\Delta _{2}\) we deduce from (116) that
Setting,
we see that the right-hand side of (117) is not larger than \(-\xi \), provided that
or equivalently if
Choosing \({{\overline{Q}}}\) in \({\mathscr {M}}(\beta )\) and using the inequalities \(a_{1}^{-1}\beta \geqslant {r}_{n}(\beta ,{{\overline{Q}}})\geqslant 1/(\beta na_{1})\), for
we obtain that the right-hand side of (118) satisfies
with \(C_{3}=\max \{1,C_{1},\left[ {l_{1}^{2}+l_{2}^{2}}\right] /8\}\). Inequality (118) is therefore satisfied for \(J\in {\mathbb {N}}\) such that
and we may take
We recall below, the list of constants depending on \(a_{0},a_{1},c,\tau \) and \(\gamma \) and we have used along the proof.
and
9.4 Proof of Theorem 2
The proof follows the same lines as that of Theorem 1. Under Assumption 3-(iv), the random variables \(U_{i}\) and \(V_{i}\) defined by (96) and (97) are not larger than with \(b={c}+{c}_{1}=l_{1}\) and \(b={c}_{2}+{c}_{1}=l_{2}\) respectively. Since under Assumption 4, more precisely its consequence (87), that
and
we may apply Lemma 8 and using the notation \(\Lambda _{1}=\tau \phi (\beta l_{1})\), \(\Lambda _{2}=\tau \phi (\beta l_{2})\) and Assumption 1, we get
and similarly
It follows from (102) that
Using the definitions (25) of \({\overline{{c}}}_{1}\) and (26) of \({\overline{{c}}}_{2}\),, that is,
and setting
and arguing as in the proof of inequality (105), we deduce from (120) that
It follows from (103) that
Using the definition (27) of \({\overline{{c}}}_{3}\),, that is,
and setting
and arguing as in the proof of (107), we deduce from (121) that
Under our assumption on \(\beta \), we know that the quantities \({\overline{{c}}}_{2}\) and \({\overline{{c}}}_{3}\) are positive and that \(2\gamma <\tau ^{-1}\left( {{\overline{{c}}}_{2}\wedge {\overline{{c}}}_{3}}\right) \). We may therefore apply Lemma 9 with \(\gamma _{0}=\tau ^{-1} {\overline{{c}}}_{2}\) and \(\gamma _{0}=\tau ^{-1} {\overline{{c}}}_{3}\) successively and get
and
with
Putting (123) and (125) together, we obtain that for all \((P,Q)\in {\mathscr {B}}^{2}\)
Consequently,
We deduce from (84) that
In particular, \({\mathbb {P}}({^\textsf{c}}{\!{A}}{})\leqslant e^{-\xi }\) for z satisfying
Putting (122) and (124) together, we obtain that for all \(Q\in {\mathscr {B}}\)
We derive from Lemma 5 that
and consequently,
Since under our assumptions, \( {\overline{{c}}}_{1}>0\) and \(2\gamma <\tau ^{-1} {\overline{{c}}}_{1}\) we may apply Lemma 9 with \(\gamma _{0}=\tau ^{-1} {\overline{{c}}}_{1}\), and setting \(e_{8}=\tau ^{-1} {\overline{{c}}}_{1}-2\gamma \) which leads to
with
which together with (128) leads to
Using the definition (127) of z, we deduce that
The right-hand side is not larger than \(-\xi \) provided that
Using the fact that \({r}_{n}(\beta ,{{\overline{Q}}})\geqslant 1/(n \beta a_{1})\), with the choice
the right-hand side of (131) satisfies
Inequality (131) holds for \(J\in {\mathbb {N}}\) such that
and we may take
In complements to constants listed at the end of the proof of Theorem 1, we recall that
and
9.5 Proof of Theorem 3
Let us take \({r}\geqslant {\varepsilon }\) and set \(\varpi =2\xi +1\) so that
In order to prove the first part, let us go back to the proof of Theorem 1. Clearly,
and similarly,
Inequalities (109) and (110) are therefore satisfied with \(\Xi _{1}=\log (1+e^{-1})\). Moreover,
We deduce from (114) that
and consequently,
Using the definitions (113) of z and (112) of \(\Delta _{2}\), we deduce that
where the constants \(C_{1}\) and \(C_{2}\) are the same as those defined in the proof of Theorem 1. If we choose \(r=\ell ({\overline{P}}^{\star },{{\overline{Q}}})\vee (\beta /a_{1})\vee {\varepsilon }\) and J such that \(\tau ^{-1}c_{0}2^{J}\geqslant C_{1}+C_{2}+(l_{1}^{2}+l_{2}^{2})/8\), we obtain that
since \(\varpi =2\xi +1\geqslant 2(\xi +\Xi _{1})\). We conclude as in the proof of Theorem 1.
In order to prove the second part of Theorem 3, we go back to the proof of Theorem 2. The arguments are similar. As before,
and
Inequalities (124) and (125) are therefore both satisfied with \({\overline{\Xi }}_{1}=\log (1+e^{-1})\). Moreover
and we deduce from (128) that
Using the definition (127) of z, we deduce that
Taking \(r=\ell ({\overline{P}}^{\star },{{\overline{Q}}})\vee {\varepsilon }\geqslant {\varepsilon }\) and \(J\geqslant 0\) such that
we obtain that
and we conclude as before.
10 Other proofs
10.1 Proof of Lemma 1
Let Y be a random variable with gamma distribution \(\gamma (s,1)\). Since \(\sigma Y\sim \gamma (s,\sigma )\), it is sufficient to prove the result for \(\sigma =1\). Using the inequality \(\log (1-x)\geqslant -x/(1-x)\) which holds for all \(x\in [0,1)\), we obtain that
Applying Lemma 8.2 in Birgé [10] with \(a=\sqrt{s}\) and \(b=1\), we obtain that
which proves (44). Let us now turn to the lower bound. For \(x\geqslant 0\), let us set
For all \(t,u\geqslant 0\),
where \({\overline{F}}(z)={\mathbb {P}}\left[ {{\mathcal {N}}(0,1)\geqslant z}\right] \) for all \(z\in {\mathbb {R}}\). Using the the following inequalities
that can be found in Whittaker and Watson [25, p. 253], with \(t=s-1>0\), we deduce that
Using the fact that \({\overline{F}}({\overline{\Phi }}^{-1}(z))=e^{-z}\) for all \(z\geqslant 0\), we obtain that for the choice
which is nonnegative for \(\xi \geqslant \log 2+1/(12t)\), the quantity \({\mathbb {P}}\left[ {Y\geqslant t+u}\right] \) is at least \(e^{-\xi }\), which proves (45).
10.2 Proof of Theorem 4
Throughout this proof, \(a_{0}=2\), \(a_{1}=3/16\), \(\beta =2\gamma =1/500\) and \(\kappa \) denotes a positive numerical constant that may vary from line to line. It follows from Corollary 4 that for n large enough, \(r_{n}(\beta ,P_{\theta ^{\star }})\leqslant r_{n}^{\star }=\kappa k/n\). Applying our Corollary 2 with \(\ell =h^{2}\) (and \(2\xi \) in place of \(\xi \)), we obtain that for n large enough, with a probability at least \(1-2e^{-\xi }\),
We know by Proposition 9 that under the assumptions of Corollary 4, Assumption 9-(i) is satisfied with \(s=2\), \(\left| {\cdot }\right| _{*}\) given by (81) and \({\varepsilon }=1/2\). This implies that for n large
which leads to (46).
10.3 Proof of Proposition 4
Let us denote by \(F_{\sigma }\) the distribution function of \(\nu _{\sigma }\). Throughout this proof, we fix some \(\theta ^{\star }\in [-\sigma t, \sigma t]\). Our aim is to prove that \(P_{\theta ^{\star }}\) belongs to \({\mathscr {M}}({\overline{\beta }})\).
Since the total variation distance is translation invariant, \(\left\| {P_{\theta }-P_{\theta ^{\star }}}\right\| = \left\| {P_{\theta -\theta ^{\star }}-P_{0}}\right\| =\left\| {P_{\theta ^{\star }-\theta }-P_{0}}\right\| \) and consequently, for all \(r\in [0,1)\),
while for \(r\geqslant 1\), \(\left\{ {\theta \in \Theta ,\; \left\| {P_{\theta }-P_{\theta ^{\star }}}\right\| \leqslant r}\right\} =\Theta ={\mathbb {R}}\).
We set \(r_{0}=\sup \{r>0,\; \varphi (r)\leqslant \sigma t\}\) and distinguish between two cases.
Case 1 Assume \(r_{0}\leqslant 1/4\). For all \(r<r_{0}\), \(\varphi (r)<\sigma t\), \(2r<1\), and since q is symmetric, positive and decreasing on \({\mathbb {R}}_{+}\),
For all \(r_{0}<r<1\), \(|\theta ^{\star }|\leqslant \sigma t< \varphi (r)\), hence \(F_{\sigma }(|\theta ^{\star }|-\varphi (r))\leqslant F_{\sigma }(0)=1/2\) and \(F_{\sigma }(|\theta ^{\star }|+\varphi (r))\geqslant F_{\sigma }(\varphi (r))\geqslant F_{\sigma }(\sigma t)= F_{1}(t)\geqslant 3/4\) under our assumption on t. Consequently,
Note that the result also holds for \(r=r_{0}\) by letting r decrease to \(r_{0}\).
Case 2 Assume that \(r_{0}>1/4\). Then \(\varphi (1/4)\leqslant \sigma t\) and arguing as before, we obtain that for all \(r\leqslant 1/4<r_{0}\),
For all \(r\in (1/4,1)\), \(\varphi (r)\geqslant \varphi (1/4)\) and
We obtain that in any case, for all \(r\in (0,1)\) and \(\theta ^{\star }\in [-\sigma t,\sigma t]\),
The inequality is also clearly true for \(r\geqslant 1\) since then \(\pi ({\mathscr {B}}(P_{\theta ^{\star }},2r))=\pi ({\mathscr {B}}(P_{\theta ^{\star }},r))=1\). Hence, for all \(r\geqslant a_{1}^{-1}\beta \)
The right-hand side is not larger than \(\beta \) provided that it satisfies (50) and this lower bound is not smaller than \(1/\sqrt{n}\) since \(\gamma \leqslant 1\). We conclude by using (15).
10.4 Proof of Proposition 5
Under our assumption on q, Assumption 6 is satisfied and
Let \(t=(|\theta |/\sigma )\vee t_{0}\). Then, \(\theta \in [-\sigma t,\sigma t]\), \(\nu _{1}([t,+\infty ))\leqslant 1/4\) and inequality (134) holds true. We deduce from (11) that
and the result follows from our specific choices of \(a_{1},\gamma \) and \(\beta \).
10.5 Proof of Corollary 3
We set for short \(\Theta =\Theta [\eta ,\delta ]\) with the parameters \(\eta \) and \(\delta \) defined by (60) and (61) respectively and also define
so that \({\mathscr {M}}_{n}(K)\) contains the elements \(P=P_{(p,{\textbf{m}},\sigma )}\) of \({\mathscr {M}}\) such that
Hereafter we fix \(P=P_{(p,{\textbf{m}},\sigma )}\in {\mathscr {M}}_{n}(K)\). There exist \(\theta =\theta (P)=({{\overline{Q}}},{\overline{{\textbf{m}}}},{\overline{\sigma }})\in \Theta \) with \({\overline{\sigma }}=(1+\delta )^{j_{0}}\), \({\overline{{\textbf{m}}}}=\overline{\sigma }\delta {\textbf{j}}\), \((j_{0},{\textbf{j}})\in {\mathbb {Z}}\times {\mathbb {Z}}^{k}\) such that
for all \(i\in \{1,\ldots ,k\}\). Consequently,
and we infer from (56) and (57) and the fact that the total variation loss is translation and scale invariant that \(P_{\theta }\) satisfies
Besides, the parameters \((j_{0},{\textbf{j}})\in {\mathbb {Z}}\times {\mathbb {Z}}^{k}\) can be controlled in the following way. Using that \(\sigma \leqslant \overline{\sigma }\), the inequality \(\log (1+\delta )\leqslant \delta \) and (137), we obtain that for all \(i\in \{1,\ldots ,k\}\),
Besides,
and using the inequality \(\log (1+2x)\leqslant 2\log (1+x)\), which holds for all \(x\geqslant 0\), we obtain that
Putting these inequalities together and using the fact that \(P\in {\mathscr {M}}_{n}(K)\), we get
For all \(r>0\), \(e^{-L_{\theta }}\leqslant \pi \left( {{\mathscr {B}}(P_{\theta },r)}\right) \leqslant 1\) and these two inequalities together with the definition (60) of \(\eta \) and Assumption 7 imply that for all \(r>0\)
Using (138), the definition (135) of \(J_{n}\) and the fact that \(\log (2+x)\leqslant \log 3+\log x\) for all \(x\geqslant 1\), we derive that
and since \(\gamma =1/6\leqslant L'=L+\log 9<3.1\),
For the choice of \(\beta =\beta _{n}\) given by (62),
hence, \(r_{n}(\beta ,P_{\theta })\leqslant a_{1}^{-1}\beta \) and \(P_{\theta }\in {\mathscr {M}}(\beta )\). This implies that
and the result follows by applying Corollary 1 and by using the fact that P is arbitrary in \({\mathscr {M}}_{n}(K)\).
10.6 Proof of Lemma 2
For all \(p\in {\mathcal {M}}_{0}\), \(\sigma \geqslant 1\) and \({\textbf{m}}\in {\mathbb {R}}^{k}\), the supports of the functions \({\varvec{x}}\mapsto p({\varvec{x}}/\sigma )\) and \({\varvec{x}}\mapsto p(({\varvec{x}}-{\textbf{m}})/\sigma )\) are included in the set \({\mathcal {K}}=[0,\sigma ]^{k}\cup \{{\textbf{m}}+{\varvec{x}},\; {\varvec{x}}\in [0,\sigma ]^{k}\}\) the Lebesgue measure of which is not larger than \(2\sigma ^{k}\). Consequently, using (66), we deduce that for all \(p\in {\mathcal {M}}_{0}\), \(\sigma \geqslant 1\) and \({\textbf{m}}\in {\mathbb {R}}^{k}\),
and (57) is therefore satisfied with \(A=L_{1}\vee [(1+L_{1}k^{s/2}+L_{0})/2]\).
10.7 Proof of Lemma 3
By doing the change of variables \(u=x-m\) in (68) if ever necessary, we may assume with no loss of generality that \(m>0\). Then, since p is nonincreasing in \((0,+\infty )\) and vanishes elsewhere \(p(x-m)\geqslant p(x)\) for all \(x\geqslant m\) and \(p(x)\geqslant p(x-m)=0\) for all \(x\in (0,m)\). Consequently,
and we obtain (68).
Since \(\sigma \geqslant 1\), \(p(x/\sigma )\geqslant p(x)\) and \(p(x)/\sigma \leqslant p(x)\) for all \(x>0\). Hence,
which leads to (67).
Finally, by combining (68) and (67) we deduce that for all \(m\in {\mathbb {R}}\) and \(\sigma \geqslant 1\)
which yields to (69).
10.8 Proof of Proposition 6
This proposition is a consequence of Corollary 2. Let us first check that the assumptions of this corollary are satisfied. For all \(S\in {\mathscr {P}}\), the mapping \({\varvec{\theta }}\mapsto h(S,P_{\theta })\) is continuous because of (71). It is therefore measurable and it follows from the definition of the algebra \({\mathcal {A}}\) that Assumption 1 is satisfied. Since the mapping \((x,{\varvec{\theta }})\mapsto p(x,{\varvec{\theta }})\) is measurable, so are the mappings
and \( (x,\varvec{\theta },\varvec{\theta }')\mapsto \psi \left( {\sqrt{p_{\varvec{\theta }'}(x)/p_{\varvec{\theta }}(x)}}\right) \), since \(\psi \) is measurable. We deduce that \((x,P,P')\mapsto t_{(P,P')}(x)\) is measurable on \((E\times {\mathscr {M}}\times {\mathscr {M}},{\mathcal {E}}\otimes {\mathcal {A}}\otimes {\mathcal {A}})\) which proves that Assumption 3-(i) holds true. The requirements of Corollary 2 are therefore satisfied and we may apply it. In order to evaluate the quantity \(r_{n}(\beta ,P_{{\varvec{\theta }}})\) for \({\varvec{\theta }}\in {\mathbb {R}}^{k}\), we use the following lemma the proof of which is postponed to Sect. 10.9.
Lemma 10
Let \(\varvec{\theta }\in [-R,R]^{k}\). For all \(m\subset \{1,\ldots ,k\}\) and \(r>0\)
with the convention \(\prod _{{\varnothing }}=1\). In particular, if \(\varvec{\theta }\in \Theta _{m}(R)\) and
and for all \(K>1\)
Let us set \(B=B_{k}\) for short and define \(m^{\star }\) as the subset of \(\{1,\ldots ,k\}\) that minimizes over those \(m\subset \{1,\ldots ,k\}\) the mapping
Finally, let \(\varvec{\theta }^{\star }\) for some arbitrary element of \(\Theta _{m^{\star }}(R)\). It follows from (71) and (139) that for all \(r>0\),
where the last inequality holds true under the assumption that \(RB^{1/s}\geqslant 1\).
We deduce from (141) that for all \(r>0\)
Provided that
we obtain
and deduce from (142) that \( r_{n}(\beta ,P_{\varvec{\theta }^{\star }})\) defined by (11) satisfies
Applying Corollary 2, we obtain that for some numerical constant \(\kappa _{0}'>0\),
with
Finally, the conclusion follows from the definition of \(m^{\star }\) and the fact that \(\varvec{\theta }^{\star }\) is arbitrary in \(\Theta _{m^{\star }}(R)\).
10.9 Proof of Lemma 10
Let \(\theta \in {\mathbb {R}}\) and \(\nu \) be the uniform distribution on \([-R,R]\). For all \(\theta \in [-R,R]\) and \(r>0\),
Let now \(\varvec{\theta }\in {\mathbb {R}}^{k}\) such that \(\left| {\varvec{\theta }}\right| _{\infty }\leqslant R\). For all \(m\subset \{1,\ldots ,k\}\), \(m\ne {\varnothing }\),
if there exists \(i\not \in m\) such that \(|\theta _{i}|>r\). Otherwise
If \(m={\varnothing }\),
Let us now turn to the proof of (140). Since \(\varvec{\theta }\in \Theta _{m}(R)\), for all \(K'\in \{1,K\}\)
It is therefore enough to show that for all \(r>0\) and \(\theta \in [0,R]\)
This is what we do now by distinguishing between several cases.
When \(\theta +Kr\leqslant R\), \(\theta -Kr\geqslant 2\theta -R\geqslant -R\) and consequently, \(\Delta (r)=K\). When \(\theta +Kr>R\) and \(-R\leqslant \theta -Kr\),
and the conclusion follows from the facts that \(0\leqslant R-\theta \leqslant Kr\). When \(\theta +Kr> R\) and \(\theta -Kr< -R\), \(r\geqslant (\theta +R)/K\geqslant R/K\), hence \(R+r-\theta \geqslant 2R/K\) and \(R\leqslant Kr\). Consequently,
which concludes the proof.
10.10 Proof of Proposition 8
Let \({\varepsilon }\) be a small enough positive number. Since q is continuous and positive at \({\varvec{\theta }}^{\star }\) and since \({\mathcal {K}}\) has a nonempty interior, there exists \(z^{\star }>0\) such that \(\Theta ^{\star }={\mathcal {B}}_{*}({\varvec{\theta }}^{\star },z^{\star })\subset {\mathcal {K}}\),
for all \({\varvec{\theta }}\in \Theta ^{\star }\) and
In particular, \(\nu (\Theta ^{\star })>0\) and we may define the distribution \(\nu ^{\star }=\nu (\cdot \cap \Theta ^{\star })/\nu (\Theta ^{\star })\) on \(\Theta ^{\star }\) with density \(q^{\star }=q{\mathbb {1}}_{\Theta ^{\star }}/\nu (\Theta ^{\star })\). Let \({\mathscr {M}}^{\star }=\{P_{{\varvec{\theta }}},\; {\varvec{\theta }}\in \Theta ^{\star }\}\) and \(\pi ^{\star }\) be the prior on \({\mathscr {M}}^{\star }\) associated with \(\nu ^{\star }\). The parameter space \(\Theta ^{\star }\) is convex and it follows fom (144) that \((\Theta ^{\star },{\varvec{\theta }}^{\star },\ell , \nu ^{\star })\) satisfy Assumption 8-(i) with \(\overline{a}=1+{\varepsilon }, {\underline{a}}=1-{\varepsilon }\). Besides, it follows from (143) that the density \(q^{\star }\) satisfies condition (77) on \(\Theta ^{\star }\). We may apply Proposition 7 and deduce that for the model \(({\mathscr {M}}^{\star },\pi ^{\star })\), \(r_{n}^{\star }=r_{n}^{\star }(\beta ,P_{{\varvec{\theta }}^{\star }})\) is not larger than \(\kappa _{0}^{\star }k/(\beta n)\) with
for \({\varepsilon }\) small enough. Consequently, by definition of \(r_{n}^{\star }\), for all \(r\geqslant r_{n}^{\star }\)
Let \(r_{1}=[(z^{\star })^{s}{\underline{a}}_{{\mathcal {K}}})\wedge \eta ]/2\). If \(r\in (0,r_{1})\) and the parameter \({\varvec{\theta }}\in \Theta \) satisfies \(\ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })\leqslant 2r\), then \(\ell ({\varvec{\theta }},{\varvec{\theta }}^{\star })<\eta \) and \({\varvec{\theta }}\) necessarily belongs to \({\mathcal {K}}\) under Assumption 9-(ii). Applying (79) we deduce that for such a parameter \({\varvec{\theta }}\in \Theta \)
which implies that \({\varvec{\theta }}\in \Theta ^{\star }\). For n large enough, \(r_{n}^{\star }=\kappa _{0}^{\star }k/n<r_{1}\) and for \(r\in (r_{n}^{\star },r_{1})\) we may therefore write, using (145),
Since q is bounded away from 0 in a neighbourhood of \({\varvec{\theta }}^{\star }\), \(\pi \left( {{\mathscr {B}}(P_{{\varvec{\theta }}^{\star }},r_{1})}\right) >0\) and we may also write that for \(r\geqslant r_{1}\) and n large enough
Putting (146) and (147) together we obtain that for n large enough
and consequently that \(r_{n}(\beta ,P_{{\varvec{\theta }}^{\star }})\leqslant r_{n}^{\star }=\kappa _{0}^{\star }k/n\).
References
Alquier, P.: PAC-Bayesian bounds for randomized empirical risk minimizers. Math. Methods Stat. 17(4), 279–304 (2008)
Atchadé, Y.A.: On the contraction properties of some high-dimensional quasi-posterior distributions. Ann. Stat. 45(5), 2248–2273 (2017)
Audibert, J.-Y., Catoni, O.: Linear regression through PAC-Bayesian truncation. arXiv:1010.0072 (2011)
Baraud, Y.: Tests and estimation strategies associated to some loss functions. Probab. Theory Relat. Fields 180(3), 799–846 (2021)
Baraud, Y., Birgé, L.: Rho-estimators revisited: general theory and applications. Ann. Stat. 46(6B), 3767–3804 (2018)
Baraud, Y., Birgé, L.: Robust Bayes-like estimation: Rho-Bayes estimation. Ann. Stat. 48(6), 3699–3720 (2020)
Baraud, Y., Birgé, L., Sart, M.: A new method for estimation and model selection: \(\rho \)-estimation. Invent. Math. 207(2), 425–517 (2017)
Bhattacharya, A., Pati, D., Yang, Y.: Bayesian fractional posteriors. Ann. Stat. 47(1), 39–66 (2019)
Birgé, L.: Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65(2), 181–237 (1983)
Birgé, L.: An alternative point of view on Lepski’s method. In: State of the Art in Probability and Statistics (Leiden, 1999), Volume 36 of IMS Lecture Notes Monograph Series, pp. 113–133. Institute of Mathematical Statistics, Beachwood (2001)
Birgé, L.: Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Stat. 42(3), 273–325 (2006)
Birgé, L.: About the non-asymptotic behaviour of Bayes estimators. J. Stat. Plan. Inference 166, 67–77 (2015)
Birgé, L.: About the non-asymptotic behaviour of Bayes estimators. J. Stat. Plan. Inference 166, 67–77 (2015)
Birgé, L., Massart, P.: Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4(3), 329–375 (1998)
Birman, Mv., Solomjak, M.Z.: Piecewise polynomial approximations of functions of classes \(W_{p}{}^{\alpha }\). Mat. Sb. (N.S.) 73(115), 331–355 (1967)
Bissiri, P.G., Holmes, C.C., Walker, S.G.: A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78(5), 1103–1130 (2016)
Catoni, O.: Statistical learning theory and stochastic optimization. In: Lecture Notes from the 31st Summer School on Probability Theory Held in Saint-Flour, July 8–25, 2001. Springer, Berlin (2004)
Chernozhukov, V., Hong, H.: An MCMC approach to classical estimation. J. Econom. 115(2), 293–346 (2003)
Ghosal, S., Ghosh, J.K., van der Vaart, A.W.: Convergence rates of posterior distributions. Ann. Stat. 28(2), 500–531 (2000)
Ibragimov, I.A., Has’minskiĭ, R.Z.: Statistical Estimation. Asymptotic Theory, vol. 16. Springer, New York (1981)
Jiang, W., Tanner, M.A.: Gibbs posterior for variable selection in high-dimensional classification and data mining. Ann. Stat. 36(5), 2207–2231 (2008)
Le Cam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat. 1, 38–53 (1973)
Massart, P.: Concentration Inequalities and Model Selection, Volume 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003 (2007)
van der Vaart, A.W.: Asymptotic Statistics, Volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (1998)
Whittaker, E.T., Watson, G.N.: A Course of Modern Analysis. Cambridge Mathematical Library. Cambridge University Press, Cambridge. An introduction to the general theory of infinite processes and of analytic functions; with an account of the principal transcendental functions, Reprint of the fourth (1927) edition (1996)
Acknowledgements
The author is grateful to Lucien Birgé and the two anonymous referees for their support and suggestions, which contributed to improving a previous version of the present paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 811017.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baraud, Y. From robust tests to Bayes-like posterior distributions. Probab. Theory Relat. Fields 188, 159–234 (2024). https://doi.org/10.1007/s00440-023-01222-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-023-01222-8
Keywords
- Bayes procedure
- Gibbs estimator
- Posterior distribution
- Robustness
- Hellinger distance
- Total variation distance