Abstract
We consider Bayesian nonparametric density estimation with a Dirichlet process kernel mixture as a prior on the class of Lebesgue univariate densities, the emphasis being on the achievability of the error rate \(n^{-1/2}\), up to a logarithmic factor, depending upon the kernel. We derive rates of convergence for the Bayes’ estimator of super-smooth densities that are location-scale mixtures of densities whose Fourier transforms have sub-exponential tails. We show that a nearly parametric rate is attainable in the L 1-norm, under weak assumptions on the tail decay of the true mixing distribution and the overall Dirichlet process base measure.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Consider the estimation of a density f 0 on \(\mathbb{R}\) from observations X 1, …, X n taking a Bayesian nonparametric approach. A prior is defined on a metric space of probability measures with Lebesgue density and a summary of the posterior, e.g., the posterior expected density, is employed. The so-called what if approach, which consists in investigating frequentist asymptotic properties of the posterior, under the non-Bayesian assumption that the data are generated from a fixed density, provides a way to validate priors on infinite-dimensional spaces. Desirable asymptotic properties of posterior distributions are consistency, minimax-optimal concentration rate of the posterior mass around the “truth” as the sample size grows, possibly with full adaptation to the regularity level of f 0, if unknown, and distributional convergence. For bounded and convex distances, posterior contraction rates yield upper bounds on convergence rates of the Bayes’ estimator, thus motivating the interest in their study. Since the seminal articles of Ferguson [2] and Lo [4], the idea of constructing priors on spaces of densities by convoluting a fixed kernel with a random distribution has been successfully exploited in density estimation. Even if much progress has been done during the last decade in understanding frequentist asymptotic properties of mixture models, the choice of the kernel is a topic largely ignored in the literature, except for the article of Wu and Ghosal [9], mainly focussed on consistency. Posterior contraction rates for Dirichlet process kernel mixture priors have been investigated by Ghosal and van der Vaart [3] and Scricciolo [5]. One key message is that some constraints on the regularity of the kernel and on the tail decay of the true mixing distribution are necessary to accurately estimate a density. Most of the literature has dealt with the estimation of mixtures, with normal (or generalized normal) kernel and mixing distribution having either compact support or sub-exponential tails, finding a nearly parametric rate, up to a logarithmic factor, in the L 1-distance, but there are almost no results beyond the Gaussian kernel. The aim of this work is to contribute to the understanding of the role of the kernel choice in density estimation with a Dirichlet process mixture prior. The main result states that a nearly parametric rate can be attained to estimate mixtures of super-smooth densities having Fourier transforms that decay exponentially, whatever the kernel tail decay, heavy tailed distributions, like Student’s-t or Cauchy, being included, which have been proved to be extremely useful in accurately modeling different kinds of financial data. For example, individual stock indices can be modeled as stable laws. Multivariate stable laws have been fruitfully used in computer networks, see Bickson and Guestrin [1]. The assumption on the exponential tail decay of the true mixing distribution seems unavoidable in order to find a finite approximating mixture with a sufficiently restricted number of points. This step is a delicate mathematical point in the proof, see Lemma 1. Such an approximation result, which is reported in the Appendix, may be of autonomous interest as well. In Sect. 2, we fix the notation and present the result.
2 Main Result
We derive rates for location-scale mixtures of super-smooth densities. The model is \(f_{F,\,G}(x):=\int _{ 0}^{\infty }(F {\ast} K_{\sigma })(x)\,\mathrm{d}G(\sigma )\), \(x \in \mathbb{R}\), where K is a kernel density, \(F \sim D_{\alpha }\) is a Dirichlet process with base measure \(\alpha:=\alpha (\mathbb{R})\bar{\alpha }\), for \(0 <\alpha (\mathbb{R}) <\infty\) and \(\bar{\alpha }\) a probability measure on \(\mathbb{R}\), and \(G \sim D_{\beta }\), with finite and positive base measure β on \((0,\,\infty )\). We assume that \(f_{0} = f_{F_{0},\,G_{0}}\), with F 0 and G 0 denoting the true mixing distributions for the location and scale parameters, respectively. We use the following assumptions.
-
(A)
The true mixing distribution G 0 for the scale parameter satisfies
$$\displaystyle{ \int _{0}^{\infty }\sigma \,\mathrm{d}G_{ 0}(\sigma ) <\infty \qquad \mathrm{and}\qquad \int _{0}^{\infty }\frac{1} {\sigma } \,\mathrm{d}G_{0}(\sigma ) <\infty. }$$(1)Also, for constants d 1, d 2 > 0 and \(0 <\gamma _{ 1}^{0},\,\gamma _{2}^{0} \leq \infty\),
$$\displaystyle{G_{0}(s) \lesssim e^{-d_{1}s^{-\gamma _{1}^{0}} }\quad \mathrm{as\ }s \rightarrow 0\quad \,\mathrm{and}\quad \,1 - G_{0}(s) \lesssim e^{-d_{2}s^{\gamma _{2}^{0}} }\quad \mathrm{as\ }s \rightarrow \infty.}$$ -
(B)
The base measure β of the Dirichlet process prior for G has a continuous and positive Lebesgue density β′ on \((0,\,\infty )\) such that, for constants C j , D j > 0, j = 1, …, 4, q 1, q 2, r 1, r 2 ≥ 0 and \(0 <\gamma _{1},\,\gamma _{2} \leq \infty\),
$$\displaystyle{ C_{1}\sigma ^{-q_{1} }e^{-C_{2}\sigma ^{-\gamma _{1}}(\log (1/\sigma ))^{r_{1}} } \leq \beta '(\sigma ) \leq C_{3}\sigma ^{-q_{1} }e^{-C_{4}\sigma ^{-\gamma _{1}}(\log (1/\sigma ))^{r_{1}} } }$$(2)for all \(\sigma\) in a neighborhood of 0, and
$$\displaystyle{ D_{1}\sigma ^{q_{2} }e^{-D_{2}\sigma ^{\gamma _{2}}(\log \sigma )^{r_{2}} } \leq \beta '(\sigma ) \leq D_{3}\sigma ^{q_{2} }e^{-D_{4}\sigma ^{\gamma _{2}}(\log \sigma )^{r_{2}} } }$$(3)for all \(\sigma\) large enough.
Remark 1
The right-hand side requirement in (1) has also been postulated by Tokdar [7], see condition 3 of Lemma 5.1 and condition 4 of Theorem 5.2, pp. 102–103. If, for example, G 0 is an \(\mathrm{IG}(\nu,\,\lambda )\), with shape parameter ν > 0 and scale parameter \(\lambda> 0\), then \(\int _{0}^{\infty }\sigma ^{-1}\,\mathrm{d}G_{0}(\sigma ) = (\nu /\lambda ) <\infty\). If G 0 is a right-truncated distribution, then the requirement on the upper tail is satisfied with \(\gamma _{2}^{0} = \infty\). A right-truncated Inverse-Gamma distribution meets all the requirements of assumption (A).
Remark 2
Condition (2) is satisfied (with r 1 = 0) if β′ is an Inverse-Gamma distribution. It can be seen that (2) implies that
Condition (3) has been considered by van der Vaart and van Zanten [8], p. 2660, and implies that \(\beta ((s,\,\infty )) \lesssim \exp \{-D_{4}s^{\gamma _{2}}/2\}\) as \(s \rightarrow \infty\), see Lemma 4.9, p. 2669.
We assess rates for location-scale mixtures of symmetric stable laws. The result goes through to location-scale mixtures of Student’s-t distributions.
Theorem 1
Let K be the density of a symmetric stable law of index 0 < r ≤ 2. Suppose that \(f_{0} =\int _{ 0}^{\infty }(F_{0} {\ast} K_{\sigma })\,\mathrm{d}G_{0}(\sigma )\) , with the true mixing distribution F 0 for the location parameter satisfying the tail condition
for some constant c 0 > 0, and the true mixing distribution G 0 for the scale parameter satisfying assumption (A ), with \(\gamma _{2}^{0} = \infty\) . If the base measure α has a density α′ such that, for constants b > 0 and \(0 <\delta \leq 1 + I_{(1,\,2]}(r)/(r - 1)\) , satisfies
the base measure β satisfies assumption (B ), with \(0 <\gamma _{j} \leq \gamma _{j}^{0} \leq \infty\) and γ j < γ j 0 if r j > 0, j = 1, 2, then the posterior rate of convergence relative to the Hellinger distance is \(\varepsilon _{n} = n^{-1/2}(\log n)^{\kappa }\) , with κ > 0 depending on γ 1 0 , γ 1 , γ 2 , and r.
Proof
The proof is in the same spirit as that of Theorem 4.1 in Scricciolo [6], which, for space limitations, cannot be reported here. Let \(\bar{\varepsilon }_{n} = n^{-1/2}(\log n)^{\kappa }\) and \(\tilde{\varepsilon }_{n} = n^{-1/2}(\log n)^{\tau }\), with κ > τ > 0 whose rather lengthy expressions we refrain from writing down. Let \(0 <s_{n} \leq E(\log (1/\bar{\varepsilon }_{n}))^{-2\tau /\gamma _{1}}\), \(0 <S_{n} \leq F(\log (1/\bar{\varepsilon }_{n}))^{2\tau /\gamma _{2}}\), and \(0 <a_{n} \leq L(\log (1/\bar{\varepsilon }_{n}))^{2\tau /\delta }\), with E, F, L > 0 suitable constants. Replacing the expression of N in (A.19) of Lemma A.7 of Scricciolo [6], with that in Lemma 1, we can estimate the covering number of the sieve set
and show that \(\log D(\bar{\varepsilon }_{n},\,\mathcal{F}_{n},\,d_{\mathrm{H}}) \lesssim (\log n)^{2\kappa } = n\bar{\varepsilon }_{n}^{2}\). Verification of the remaining mass condition \(\pi (\mathcal{F}_{n}^{c}) \lesssim \exp \{-(c_{2} + 4)n\tilde{\varepsilon }_{n}^{2}\}\) can proceed as in the aforementioned theorem using, among others, the fact that 2τ > 1.
We now turn to consider the small ball probability condition. For \(0 <\varepsilon <1/4\), let \(a_{\varepsilon }:= (c_{0}^{-1}\log (1/(s_{\varepsilon }\varepsilon )))^{1/(1+I_{(1,\,2]}(r)/(r-1))}\) and \(s_{\varepsilon }:= (d_{1}^{-1}\log (1/\varepsilon ))^{-1/\gamma _{1}^{0} }\). Let G 0 ∗ be the re-normalized restriction of G 0 to \([s_{\varepsilon },\,S_{0}]\), with S 0 the upper endpoint of the support of G 0, and F 0 ∗ the re-normalized restriction of F 0 to \([-a_{\varepsilon },\,a_{\varepsilon }]\). Then, \(\|f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}- f_{0}\|_{1} \lesssim \varepsilon\). By Lemma 1, there exist discrete distributions \(F'_{0}:=\sum _{ j=1}^{N}p_{j}\delta _{\theta _{j}}\) on \([-a_{\varepsilon },\,a_{\varepsilon }]\) and \(G_{0}':=\sum _{ k=1}^{N}q_{k}\delta _{\sigma _{k}}\) on \([s_{\varepsilon },\,S_{0}]\), with at most \(N \lesssim (\log (1/\varepsilon ))^{2\tau -1}\) support points, such that \(\|\,f_{F'_{0},\,G'_{0}} - f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}\|_{\infty }\lesssim \varepsilon\). For \(T_{\varepsilon }:= (2a_{\varepsilon } \vee \varepsilon ^{-1/(r+I_{(0,\,1]}(r))})\),
Without loss of generality, the \(\theta _{j}\)’s and \(\sigma _{k}\)’s can be taken to be at least \(2\varepsilon\)-separated. For any distribution F on \(\mathbb{R}\) and G on \((0,\,\infty )\) such that
by the same arguments as in the proof of Theorem 4.1 in Scricciolo [6],
Consequently,
By an analogue of the last part of the same proof, we get that \(\pi (B_{\mathrm{KL}}(\,f_{0};\,\tilde{\varepsilon }_{n}^{2})) \gtrsim \exp \{-c_{2}n\tilde{\varepsilon }_{n}^{2}\}\).
Remark 3
Assumptions (4) on F 0 and (5) on α′ imply that \(\mathrm{supp}(F_{0}) \subseteq \mathrm{ supp}(\alpha )\), thus, F 0 is in the weak support of D α . Analogously, assumptions (A) on G 0 and (B) on β′, together with the restrictions on γ j 0, γ j , j = 1, 2, imply that \(\mathrm{supp}(G_{0}) \subseteq \mathrm{ supp}(\beta )\), thus, G 0 is in the weak support of D β .
Remark 4
If \(\gamma _{1} =\gamma _{2} = \infty\), then also \(\gamma _{1}^{0} =\gamma _{ 2}^{0} = \infty\), i.e., the true mixing distribution G 0 for \(\sigma\) is compactly supported on an interval [s 0, S 0], for some \(0 <s_{0} \leq S_{0} <\infty\), and (an upper bound on) the rate is given by \(\varepsilon _{n} = n^{-1/2}(\log n)^{\kappa }\), with κ whose value for Gaussian mixtures (r = 2) reduces to the same found by Ghosal and van der Vaart [3] in Theorem 6.1, p. 1255.
References
Bickson, D., Guestrin, C.: Inference with multivariate heavy-tails in linear models. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 208–216. Curran Associates, Inc., Red Hook (2010). http://papers.nips.cc/paper/3949-inference-with-multivariate-heavy-tails-in-linear-models.pdf
Ferguson, T.S.: Bayesian density estimation by mixtures of normal distributions. In: Rizvi, M.H., Rustagi, J.S., Siegmund, D. (eds.) Recent Advances in Statistics, pp. 287–302. Academic, New York (1983)
Ghosal, S., van der Vaart, A.W.: Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Stat. 29, 1233–1263 (2001)
Lo, A.Y.: On a class of Bayesian nonparametric estimates: I. Density estimates. Ann. Stat. 12, 351–357 (1984)
Scricciolo, C.: Posterior rates of convergence for Dirichlet mixtures of exponential power densities. Electron. J. Stat. 5, 270–308 (2011)
Scricciolo, C.: Rates of convergence for Bayesian density estimation with Dirichlet process mixtures of super-smooth kernels. Working Paper No.1. DEC, Bocconi University (2011)
Tokdar, S.T.: Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression. Sankhyā 68, 90–110 (2006)
van der Vaart, A.W., van Zanten, J.H.: Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. Ann. Stat. 37, 2655–2675 (2009)
Wu, Y., Ghosal, S.: Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electron. J. Stat. 2, 298–331 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The following lemma provides an upper bound on the number of mixing components of finite location-scale mixtures of symmetric stable laws that uniformly approximate densities of the same type with compactly supported mixing distributions. We use \(\mathbb{E}\) and \(\mathbb{E}'\) to denote expectations corresponding to priors G and G′ for the scale parameter \(\varSigma\), respectively.
Lemma 1
Let K be a density with Fourier transform such that, for constants A, ρ > 0 and 0 < r < 2, \(\varPhi _{K}(t) = Ae^{-\rho \vert t\vert ^{r} }\), \(t \in \mathbb{R}\) . Let 0 < ε < 1, \(0 <a <\infty\) and \(0 <s \leq S <\infty\) be given, with (a∕s) ≥ 1. For any pair of probability measures F on [−a, a] and G on [s, S], there exist discrete probability measures F′ on [−a, a] and G′ on [s, S], with at most
support points, such that \(\|\mathbb{E}[F {\ast} K_{\varSigma }] - \mathbb{E}'[F' {\ast} K_{\varSigma }]\|_{\infty }\lesssim \epsilon\).
Proof
We consider first the case where 1 < r < 2 because, since (a∕s) ≥ 1 by assumption, we can appeal to Lemma A.1 of Scricciolo [6]. The arguments of the first part of the proof can be then used to deal also with the case where 0 < r ≤ 1. For each \(s \leq \sigma \leq S\), since \(\int _{-\infty }^{\infty }\vert \varPhi _{K}(\sigma t)\vert \,\mathrm{d}t <\infty\), the inversion formula can be applied to recover both \(F {\ast} K_{\sigma }\) and \(F' {\ast} K_{\sigma }\). For any M > 0 and \(x \in \mathbb{R}\),
Let
and
For \(M \geq (\rho ^{1/r}s)^{-1}(\log (1/(s^{r}\varepsilon )))^{1/r}\),
In order to find an upper bound on U, we apply Lemma A.1 of Ghosal and van der Vaart [3], p. 1260, to both F and G. There exists a discrete probability measure F′ on [−a, a], with at most N 1 + 1 support points, where N 1 is a positive integer to be suitably chosen later on, such that it matches the (finite) moments of F up to the order N 1, i.e., \(\mathbb{E}'[\varTheta ^{j}] = \mathbb{E}[\varTheta ^{j}]\) for all j = 1, …, N 1. Analogously, there exists a discrete probability measure G′ on [s, S], with at most N 2 support points, where N 2 is a positive integer to be suitably chosen later on, such that
Both N 1 and N 2 will be chosen to be increasing functions of \(\varepsilon\). In virtue of the latter matching conditions,
Using arguments of Lemma A.1 in Scricciolo [6] and inequality (6),
where, in the last line, we have used Stirling’s approximation for (N 2)! , assuming N 2 is large enough. For \(N_{1} \lesssim \max \{\log (1/(s\varepsilon )),\,(a/s)^{r/(r-1)}\}\),
Let M be such that aM ≥ 1 and (ρ 1∕r SM) ≥ 2a. Then, for \(N_{2} \geq \max \{ (2N_{1} + 1)(r - 1)/(r(2 - r)),\,e^{3}(\rho ^{1/r}SM)^{r/(r-1)},\,\log (1/\varepsilon )\}\),
and
Hence, \(N_{2} \lesssim \max \{ (a/s)^{r/(r-1)},\,\left ((S/s)\right )^{r/(r-1)}(\log (1/s^{r}\varepsilon ))^{1/(r-1)}\}\).
In the case where 0 < r ≤ 1, since (a∕s) ≥ 1, we need to restrict the support of the mixing distribution F. To the aim, we consider a partition of [−a, a] into \(k = \lceil (a/s)(\log (1/(s\varepsilon )))^{1/r-1}\rceil\) subintervals I 1, …, I k of equal length \(0 <l \leq 2s(\log (1/(s\varepsilon )))^{-(1-r)/r}\) and, possibly, a final interval I k+1 of length 0 ≤ l k+1 < l. Let J be the number of intervals in the partition, which can be either k or k + 1. Write \(F =\sum _{ j=1}^{J}F(I_{j})F_{j}\), where F j denotes the re-normalized restriction of F to I j . Then, for each \(s \leq \sigma \leq S\), we have \((F {\ast} K_{\sigma })(x) =\sum _{ j=1}^{J}F(I_{j})(F_{j} {\ast} K_{\sigma })(x)\), \(x \in \mathbb{R}\). For any probability measure F′ such that F′(I j ) = F(I j ), j = 1, …, J,
Reasoning as in the case where 1 < r < 2, with a to be understood as l∕2 and N 1 as the number of support points of the generic F j , for \(M \geq ((\rho /2)^{1/r}s)^{-1}(\log (1/\varepsilon ))^{1/r}\),
Since \((a/s) \lesssim (\log (1/(s\varepsilon )))^{-(1-r)/r}\) by construction, for \(N_{1} =\log (1/(s\varepsilon ))\), it turns out that \(U_{1} \lesssim \varepsilon\). For \(N_{2} \geq \max \{ N_{1},\,2e^{4}\rho (SM)^{r}\log (1/(s\varepsilon )),\,\log (1/\varepsilon )\}\),
and \(U_{2} \lesssim \varepsilon\). Then, \(N_{2} \lesssim (S/s)^{r}(\log (1/(s\varepsilon )))^{2}\) and the total number N T of support points of F′ is bounded above by
The proof is thus complete.
Remark 5
Lemma 1 does not cover the case where r = 2, i.e., the kernel is Gaussian: this might possibly be due to the arguments laid out in the proof. This case can be retrieved from Lemma A.2 in Scricciolo [6] when p = 2.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Scricciolo, C. (2016). Rates for Bayesian Estimation of Location-Scale Mixtures of Super-Smooth Densities. In: Alleva, G., Giommi, A. (eds) Topics in Theoretical and Applied Statistics. Studies in Theoretical and Applied Statistics(). Springer, Cham. https://doi.org/10.1007/978-3-319-27274-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-27274-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27272-6
Online ISBN: 978-3-319-27274-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)