Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Consider the estimation of a density f 0 on \(\mathbb{R}\) from observations X 1, , X n taking a Bayesian nonparametric approach. A prior is defined on a metric space of probability measures with Lebesgue density and a summary of the posterior, e.g., the posterior expected density, is employed. The so-called what if approach, which consists in investigating frequentist asymptotic properties of the posterior, under the non-Bayesian assumption that the data are generated from a fixed density, provides a way to validate priors on infinite-dimensional spaces. Desirable asymptotic properties of posterior distributions are consistency, minimax-optimal concentration rate of the posterior mass around the “truth” as the sample size grows, possibly with full adaptation to the regularity level of f 0, if unknown, and distributional convergence. For bounded and convex distances, posterior contraction rates yield upper bounds on convergence rates of the Bayes’ estimator, thus motivating the interest in their study. Since the seminal articles of Ferguson [2] and Lo [4], the idea of constructing priors on spaces of densities by convoluting a fixed kernel with a random distribution has been successfully exploited in density estimation. Even if much progress has been done during the last decade in understanding frequentist asymptotic properties of mixture models, the choice of the kernel is a topic largely ignored in the literature, except for the article of Wu and Ghosal [9], mainly focussed on consistency. Posterior contraction rates for Dirichlet process kernel mixture priors have been investigated by Ghosal and van der Vaart [3] and Scricciolo [5]. One key message is that some constraints on the regularity of the kernel and on the tail decay of the true mixing distribution are necessary to accurately estimate a density. Most of the literature has dealt with the estimation of mixtures, with normal (or generalized normal) kernel and mixing distribution having either compact support or sub-exponential tails, finding a nearly parametric rate, up to a logarithmic factor, in the L 1-distance, but there are almost no results beyond the Gaussian kernel. The aim of this work is to contribute to the understanding of the role of the kernel choice in density estimation with a Dirichlet process mixture prior. The main result states that a nearly parametric rate can be attained to estimate mixtures of super-smooth densities having Fourier transforms that decay exponentially, whatever the kernel tail decay, heavy tailed distributions, like Student’s-t or Cauchy, being included, which have been proved to be extremely useful in accurately modeling different kinds of financial data. For example, individual stock indices can be modeled as stable laws. Multivariate stable laws have been fruitfully used in computer networks, see Bickson and Guestrin [1]. The assumption on the exponential tail decay of the true mixing distribution seems unavoidable in order to find a finite approximating mixture with a sufficiently restricted number of points. This step is a delicate mathematical point in the proof, see Lemma 1. Such an approximation result, which is reported in the Appendix, may be of autonomous interest as well. In Sect. 2, we fix the notation and present the result.

2 Main Result

We derive rates for location-scale mixtures of super-smooth densities. The model is \(f_{F,\,G}(x):=\int _{ 0}^{\infty }(F {\ast} K_{\sigma })(x)\,\mathrm{d}G(\sigma )\), \(x \in \mathbb{R}\), where K is a kernel density, \(F \sim D_{\alpha }\) is a Dirichlet process with base measure \(\alpha:=\alpha (\mathbb{R})\bar{\alpha }\), for \(0 <\alpha (\mathbb{R}) <\infty\) and \(\bar{\alpha }\) a probability measure on \(\mathbb{R}\), and \(G \sim D_{\beta }\), with finite and positive base measure β on \((0,\,\infty )\). We assume that \(f_{0} = f_{F_{0},\,G_{0}}\), with F 0 and G 0 denoting the true mixing distributions for the location and scale parameters, respectively. We use the following assumptions.

  1. (A)

    The true mixing distribution G 0 for the scale parameter satisfies

    $$\displaystyle{ \int _{0}^{\infty }\sigma \,\mathrm{d}G_{ 0}(\sigma ) <\infty \qquad \mathrm{and}\qquad \int _{0}^{\infty }\frac{1} {\sigma } \,\mathrm{d}G_{0}(\sigma ) <\infty. }$$
    (1)

    Also, for constants d 1, d 2 > 0 and \(0 <\gamma _{ 1}^{0},\,\gamma _{2}^{0} \leq \infty\),

    $$\displaystyle{G_{0}(s) \lesssim e^{-d_{1}s^{-\gamma _{1}^{0}} }\quad \mathrm{as\ }s \rightarrow 0\quad \,\mathrm{and}\quad \,1 - G_{0}(s) \lesssim e^{-d_{2}s^{\gamma _{2}^{0}} }\quad \mathrm{as\ }s \rightarrow \infty.}$$
  2. (B)

    The base measure β of the Dirichlet process prior for G has a continuous and positive Lebesgue density β′ on \((0,\,\infty )\) such that, for constants C j , D j  > 0, j = 1, , 4, q 1, q 2, r 1, r 2 ≥ 0 and \(0 <\gamma _{1},\,\gamma _{2} \leq \infty\),

    $$\displaystyle{ C_{1}\sigma ^{-q_{1} }e^{-C_{2}\sigma ^{-\gamma _{1}}(\log (1/\sigma ))^{r_{1}} } \leq \beta '(\sigma ) \leq C_{3}\sigma ^{-q_{1} }e^{-C_{4}\sigma ^{-\gamma _{1}}(\log (1/\sigma ))^{r_{1}} } }$$
    (2)

    for all \(\sigma\) in a neighborhood of 0, and

    $$\displaystyle{ D_{1}\sigma ^{q_{2} }e^{-D_{2}\sigma ^{\gamma _{2}}(\log \sigma )^{r_{2}} } \leq \beta '(\sigma ) \leq D_{3}\sigma ^{q_{2} }e^{-D_{4}\sigma ^{\gamma _{2}}(\log \sigma )^{r_{2}} } }$$
    (3)

    for all \(\sigma\) large enough.

Remark 1

The right-hand side requirement in (1) has also been postulated by Tokdar [7], see condition 3 of Lemma 5.1 and condition 4 of Theorem 5.2, pp. 102–103. If, for example, G 0 is an \(\mathrm{IG}(\nu,\,\lambda )\), with shape parameter ν > 0 and scale parameter \(\lambda> 0\), then \(\int _{0}^{\infty }\sigma ^{-1}\,\mathrm{d}G_{0}(\sigma ) = (\nu /\lambda ) <\infty\). If G 0 is a right-truncated distribution, then the requirement on the upper tail is satisfied with \(\gamma _{2}^{0} = \infty\). A right-truncated Inverse-Gamma distribution meets all the requirements of assumption (A).

Remark 2

Condition (2) is satisfied (with r 1 = 0) if β′ is an Inverse-Gamma distribution. It can be seen that (2) implies that

$$\displaystyle{\beta ((0,\,s]) \leq \exp \left \{-\frac{C_{4}} {2} s^{-\gamma _{1} }\left (\log \frac{1} {s}\right )^{r_{1} }\right \} \lesssim e^{-\frac{1} {2} C_{4}s^{-\gamma _{1}} }\qquad \mathrm{as\ }s \rightarrow 0.}$$

Condition (3) has been considered by van der Vaart and van Zanten [8], p. 2660, and implies that \(\beta ((s,\,\infty )) \lesssim \exp \{-D_{4}s^{\gamma _{2}}/2\}\) as \(s \rightarrow \infty\), see Lemma 4.9, p. 2669.

We assess rates for location-scale mixtures of symmetric stable laws. The result goes through to location-scale mixtures of Student’s-t distributions.

Theorem 1

Let K be the density of a symmetric stable law of index 0 < r ≤ 2. Suppose that \(f_{0} =\int _{ 0}^{\infty }(F_{0} {\ast} K_{\sigma })\,\mathrm{d}G_{0}(\sigma )\) , with the true mixing distribution F 0 for the location parameter satisfying the tail condition

$$\displaystyle{ F_{0}(\{\theta:\, \vert \theta \vert> t\}) \lesssim \exp \{-c_{0}t^{1+I_{(1,\,2]}(r)/(r-1)}\}\qquad \mathrm{for\ large}\,t> 0, }$$
(4)

for some constant c 0 > 0, and the true mixing distribution G 0 for the scale parameter satisfying assumption (A ), with \(\gamma _{2}^{0} = \infty\) . If the base measure α has a density α′ such that, for constants b > 0 and \(0 <\delta \leq 1 + I_{(1,\,2]}(r)/(r - 1)\) , satisfies

$$\displaystyle{ \alpha '(\theta ) \propto e^{-b\vert \theta \vert ^{\delta }},\qquad \theta \in \mathbb{R}, }$$
(5)

the base measure β satisfies assumption (B ), with \(0 <\gamma _{j} \leq \gamma _{j}^{0} \leq \infty\) and γ j < γ j 0 if r j > 0, j = 1, 2, then the posterior rate of convergence relative to the Hellinger distance is \(\varepsilon _{n} = n^{-1/2}(\log n)^{\kappa }\) , with κ > 0 depending on γ 1 0 , γ 1 , γ 2 , and r.

Proof

The proof is in the same spirit as that of Theorem 4.1 in Scricciolo [6], which, for space limitations, cannot be reported here. Let \(\bar{\varepsilon }_{n} = n^{-1/2}(\log n)^{\kappa }\) and \(\tilde{\varepsilon }_{n} = n^{-1/2}(\log n)^{\tau }\), with κ > τ > 0 whose rather lengthy expressions we refrain from writing down. Let \(0 <s_{n} \leq E(\log (1/\bar{\varepsilon }_{n}))^{-2\tau /\gamma _{1}}\), \(0 <S_{n} \leq F(\log (1/\bar{\varepsilon }_{n}))^{2\tau /\gamma _{2}}\), and \(0 <a_{n} \leq L(\log (1/\bar{\varepsilon }_{n}))^{2\tau /\delta }\), with E, F, L > 0 suitable constants. Replacing the expression of N in (A.19) of Lemma A.7 of Scricciolo [6], with that in Lemma 1, we can estimate the covering number of the sieve set

$$\displaystyle{\mathcal{F}_{n}:=\{\, f_{F,\,G}:\, F([-a_{n},\,a_{n}]) \geq 1 -\bar{\varepsilon }_{n}/2,\quad G([s_{n},\,S_{n}]) \geq 1 -\bar{\varepsilon }_{n}/2\}}$$

and show that \(\log D(\bar{\varepsilon }_{n},\,\mathcal{F}_{n},\,d_{\mathrm{H}}) \lesssim (\log n)^{2\kappa } = n\bar{\varepsilon }_{n}^{2}\). Verification of the remaining mass condition \(\pi (\mathcal{F}_{n}^{c}) \lesssim \exp \{-(c_{2} + 4)n\tilde{\varepsilon }_{n}^{2}\}\) can proceed as in the aforementioned theorem using, among others, the fact that 2τ > 1.

We now turn to consider the small ball probability condition. For \(0 <\varepsilon <1/4\), let \(a_{\varepsilon }:= (c_{0}^{-1}\log (1/(s_{\varepsilon }\varepsilon )))^{1/(1+I_{(1,\,2]}(r)/(r-1))}\) and \(s_{\varepsilon }:= (d_{1}^{-1}\log (1/\varepsilon ))^{-1/\gamma _{1}^{0} }\). Let G 0 be the re-normalized restriction of G 0 to \([s_{\varepsilon },\,S_{0}]\), with S 0 the upper endpoint of the support of G 0, and F 0 the re-normalized restriction of F 0 to \([-a_{\varepsilon },\,a_{\varepsilon }]\). Then, \(\|f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}- f_{0}\|_{1} \lesssim \varepsilon\). By Lemma 1, there exist discrete distributions \(F'_{0}:=\sum _{ j=1}^{N}p_{j}\delta _{\theta _{j}}\) on \([-a_{\varepsilon },\,a_{\varepsilon }]\) and \(G_{0}':=\sum _{ k=1}^{N}q_{k}\delta _{\sigma _{k}}\) on \([s_{\varepsilon },\,S_{0}]\), with at most \(N \lesssim (\log (1/\varepsilon ))^{2\tau -1}\) support points, such that \(\|\,f_{F'_{0},\,G'_{0}} - f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}\|_{\infty }\lesssim \varepsilon\). For \(T_{\varepsilon }:= (2a_{\varepsilon } \vee \varepsilon ^{-1/(r+I_{(0,\,1]}(r))})\),

$$\displaystyle{\|f_{F'_{0},\,G'_{0}} - f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}\|_{1} \lesssim T_{\varepsilon }\|f_{F'_{0},\,G'_{0}} - f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}\|_{\infty } + T_{\varepsilon }^{-r} \lesssim \varepsilon ^{1-1/(r+I_{(0,\,1]}(r))}.}$$

Without loss of generality, the \(\theta _{j}\)’s and \(\sigma _{k}\)’s can be taken to be at least \(2\varepsilon\)-separated. For any distribution F on \(\mathbb{R}\) and G on \((0,\,\infty )\) such that

$$\displaystyle{\sum _{j=1}^{N}\vert F([\theta _{ j}-\varepsilon,\,\theta _{j}+\varepsilon ]) - p_{j}\vert \leq \varepsilon \quad \mathrm{and}\quad \sum _{k=1}^{N}\vert G([\sigma _{ k}-\varepsilon,\,\sigma _{k}+\varepsilon ]) - q_{k}\vert \leq \varepsilon,}$$

by the same arguments as in the proof of Theorem 4.1 in Scricciolo [6],

$$\displaystyle{\|f_{F,\,G} - f_{F'_{0},\,G'_{0}}\|_{1} \lesssim \varepsilon.}$$

Consequently,

$$\displaystyle\begin{array}{rcl} d_{\mathrm{H}}^{2}(\,f_{ F,\,G},\,f_{0})& \leq & \|f_{F,\,G} - f_{F_{0}',\,G_{0}'}\|_{1} +\| f_{F_{0}',\,G_{0}'} - f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}\|_{1} +\| f_{F_{0}^{{\ast}},\,G_{0}^{{\ast}}}- f_{0}\|_{1} {}\\ & \lesssim & \varepsilon ^{1-1/(r+I_{(0,\,1]}(r))}. {}\\ \end{array}$$

By an analogue of the last part of the same proof, we get that \(\pi (B_{\mathrm{KL}}(\,f_{0};\,\tilde{\varepsilon }_{n}^{2})) \gtrsim \exp \{-c_{2}n\tilde{\varepsilon }_{n}^{2}\}\).

Remark 3

Assumptions (4) on F 0 and (5) on α′ imply that \(\mathrm{supp}(F_{0}) \subseteq \mathrm{ supp}(\alpha )\), thus, F 0 is in the weak support of D α . Analogously, assumptions (A) on G 0 and (B) on β′, together with the restrictions on γ j 0, γ j , j = 1, 2, imply that \(\mathrm{supp}(G_{0}) \subseteq \mathrm{ supp}(\beta )\), thus, G 0 is in the weak support of D β .

Remark 4

If \(\gamma _{1} =\gamma _{2} = \infty\), then also \(\gamma _{1}^{0} =\gamma _{ 2}^{0} = \infty\), i.e., the true mixing distribution G 0 for \(\sigma\) is compactly supported on an interval [s 0, S 0], for some \(0 <s_{0} \leq S_{0} <\infty\), and (an upper bound on) the rate is given by \(\varepsilon _{n} = n^{-1/2}(\log n)^{\kappa }\), with κ whose value for Gaussian mixtures (r = 2) reduces to the same found by Ghosal and van der Vaart [3] in Theorem 6.1, p. 1255.