Keywords

1 Introduction

The Bayesian analysis of the problem of recovering the unknown mixing distribution in mixture models has recently attracted much attention and stimulated an active discussion encouraging new ideas. Several papers–including [Efron [4], Gao and van der Vaart [5], Heinrich and Kahn [9], Ishwaran et al. [11], Nguyen [14], Scricciolo [18]]–have been devoted to the investigation of this topic, with extensive comparisons with the frequentist solutions. In order to introduce the problem, suppose that \(x\mapsto k(x\mid y)\) is a probability density for every \(y\in \mathscr {Y}\subseteq \mathbb {R}\), where \((\mathscr {Y},\,\mathscr {B})\) is a measurable space. If the mapping \((x,\,y)\mapsto k(x\mid y)\) is jointly measurable, then

$$\begin{aligned} p_G(x):=\int _{\mathscr {Y}} k(x\mid y)\,\mathrm {d}G(y),\quad x\in \mathbb {R}, \end{aligned}$$
(1)

defines a probability density on \(\mathbb {R}\) for every probability measure G on \((\mathscr {Y}, \mathscr {B})\), whose collection is indicated by \(\mathscr {G}\). The cumulative distribution function of the mixed density in (1) is denoted by

$$\begin{aligned} F_G(x)=\int _{-\infty }^x p_G(u)\,\mathrm {d}u,\quad x\in \mathbb {R}. \end{aligned}$$

Suppose we observe n independent random variables \(X_1,\,\ldots ,\,X_n\) identically distributed according to the mixed density

$$\begin{aligned} p_0(x)\equiv p_{G_0}(x)=\int _{\mathscr {Y}}k(x\mid y)\,\mathrm {d}G_0(y),\quad x\in \mathbb {R}. \end{aligned}$$

We denote by \(F_0\) the cumulative distribution function of the density \(p_0\), namely,

$$F_0(x)\equiv F_{G_0}(x)=\int _{-\infty }^xp_0(u)\,\mathrm {d}u,\quad x\in \mathbb {R}.$$

The interest is in recovering the unknown mixing distribution \(G_0\in \mathscr {G}\) from observations of the random sample \(X^{(n)}:=(X_1,\,\ldots ,\,X_n)\). The formulation of the problem applies to both finite and infinite mixtures, but the focus of this chapter is primarily on the case when the sampling density is a mixture with an unknown, but bounded above number of components.

The problem has been initially studied from the frequenstist perspective by Chen [1], who established that, when \(p_0\) has an unknown number of components \(d_0\) such that \(1\le d_0\le N\), for some fixed integer N, then the optimal rate for estimating the mixing distribution \(G_0\) is only \(n^{-1/4}\) and this rate is achievable, under identifiability conditions, by some minimum distance estimator. Even if Theorem 2 in Chen [1], p. 226, is not correct because of Lemma 2 it relies on, an emended version of Lemma 2 has been recently given by Heinrich and Kahn [9] in assertion (21) of their Theorem 6.3, p. 2857, by comparing a fixed mixture with all the mixtures having mixing distributions in an \(L^1\)-Wasserstein ball, instead of comparing all possible pairs of mixtures in a ball. As a consequence, Theorem 2 of Chen [1] remains valid by dropping uniformity over an \(L^1\)-Wasserstein ball and the statement is weakened to an assertion on the optimal pointwise rate of estimation: for any fixed mixing distribution, say \(G_0\), the minimum distance estimator converges at \(n^{-1/4}\)-rate, but with a multiplicative constant that may depend on \(G_0\). The first Bayesian analysis of the problem we are aware of traces back to Ishwaran et al. [11], who define a prior law over the space of all mixing distributions with at most N components, the mixture weights being assigned an N-dimensional Dirichlet distribution with a non-informative choice for the shape parameters that are all set equal to \(\alpha /N\) for a positive constant \(\alpha \). Under conditions similar to those postulated by Chen [1], which, in particular, employ the notion of strong identifiability in mixture models, they prove that Bayesian estimation of the mixing distribution in the Kantorovich metric is possible at the optimal rate \(n^{-1/4}\), up to a \(\log n\)-factor. More recently, posterior convergence rates for estimating the mixing distribution in the \(L^2\)-Wasserstein metric for finite mixtures of multivariate distributions have been discussed by Nguyen [14] following a different line of reasoning. In this chapter, we show that, by combining the approach of Ishwaran et al. [11], which instrumentally uses posterior contraction rates in the sup-norm for the distribution function and strong identifiability to shift to the Kantorovich distance between mixing distributions, with the current methodology for the study of posterior contraction rates, which can by now count upon many refined results for small ball prior probability estimates, the mixing distribution is estimable in the Kantorovich or \(L^1\)-Wasserstein metric at the optimal rate \(n^{-1/4}\) (up to a logarithmic factor) for a large class of prior laws over the space of mixing distributions with at most N components, under less stringent conditions than those used in Ishwaran et al. [11] or in Nguyen [14]. Many aspects of this fundamental statistical problem still remain unclear and we hope to contribute to a better understanding of it in a follow-up study.

Before introducing the notation, a remark on the use of the term “Bayesian deconvolution” is in order. This phrase has been recently introduced by Efron [4] to describe a maximum likelihood procedure for estimating the mixing distribution in general mixture models of the form in (1). Even if the mixtures herein considered are not necessarily convolution kernel mixtures, we liked the evocative power of the expression to recall the general inverse problem of recovering the unknown mixing distribution.

Notation. In this paragraph, we set out the notation and recall some definitions used throughout the chapter.

  • The symbols “\(\lesssim \)” and “\(\gtrsim \)” indicate inequalities valid up to a constant multiple that is universal or fixed within the context, but anyway inessential for our purposes.

  • All probability density functions are meant to be with respect to Lebesgue measure \(\lambda \) on \(\mathbb {R}\) or on some subset thereof.

  • The same symbol, say G, is used to denote a probability measure on \((\mathscr {Y},\,\mathscr {B})\) as well as the corresponding cumulative distribution function.

  • The degenerate probability distribution putting mass one at a point \(y\in \mathbb {R}\) is denoted by \(\delta _y\).

  • The notation Pf stands for the expected value \(\int f\,\mathrm {d}P\), where the integral is understood to extend over the entire natural domain when, here and elsewhere, the domain of integration is omitted. With this convention, for the empirical measure \({\mathbb P_n}:={n}^{-1}\sum _{i=1}^n\delta _{X_i}\) associated with the random sample \(X_1,\,\ldots ,\,X_n\), namely, the discrete uniform distribution on the sample values that puts mass 1 / n on each one of the observations, the notation \(\mathbb {P}_nf\) abbreviates the formula \(n^{-1}\sum _{i=1}^n f(X_i)\).

  • For every pair \(\mathbf {x}_N,\,\mathbf {y}_N\in \mathbb {R}^N\), \(\Vert \mathbf {x}_N-\mathbf {y}_N\Vert _{\ell ^1}\) stands for the \(\ell ^1\)-distance \(\sum _{j=1}^N|x_j-y_j|\).

  • For a probability measure Q on \((\mathbb {R},\,\mathscr {B}({\mathbb {R}}))\), let q denote its density. For any \(\epsilon >0\),

    $$B_{\mathrm {KL}}(P_0;\,\epsilon ^2):= \left\{ Q:\,P_0\left( \log \frac{p_0}{q}\right) \le \epsilon ^2,\,\,\, P_0\left( \log \frac{p_0}{q}\right) ^2\le \epsilon ^2 \right\} $$

    denotes a Kullback-Leibler type neighborhood of \(P_0\) of radius \(\epsilon ^2\). Defined, for every \(\alpha \in (0,\,1]\), the divergence \(\rho _\alpha (P_0\Vert Q):=(1/\alpha )[P_0(p_0/q)^\alpha -1]\), see Wong and Shen [21], pp. 351–352,

    $$B_{\rho _\alpha }(P_0;\, \epsilon ^2):= \left\{ Q:\,\rho _\alpha (P_0\Vert Q)\le \epsilon ^2 \right\} $$

    is the \(\rho _\alpha \)-neighborhood of \(P_0\) of radius \(\epsilon ^2\). The definition of \(\rho _\alpha \) extends to negatives values of \(\alpha \in (-1,\,0)\). In particular, for \(\alpha =-1/2\), the divergence \(\rho _{-1/2}(P_0\Vert Q)=-2\int p_0[(q/p_0)^{1/2}-1]\,\mathrm {d}\lambda =\int (p_0^{1/2}-q^{1/2})^2\,\mathrm {d}\lambda \) is the squared Hellinger distance. We can thus define the following Hellinger type neighborhood of \(P_0\) of radius \(\epsilon ^2\):

    $$B_{\rho _{-1/2}\Vert \cdot \Vert _\infty }(P_0;\, \epsilon ^2):= \left\{ Q:\,\rho _{-1/2}(P_0\Vert Q)\left\| \frac{p_0}{q}\right\| _\infty \le \epsilon ^2 \right\} .$$
  • For any real number \(p\ge 1\) and any pair of probability measures \(G_1,\,G_2\in \mathscr {G}\) with finite pth absolute moments, the \(L^p\)-Wasserstein distance between \(G_1\) and \(G_2\) is defined as

    $$W_p(G_1,\,G_2):=\left( \inf _{\gamma \in \Gamma (G_1,\,G_2)}\int _{\mathscr {Y}\times \mathscr {Y}}|y_1-y_2|^p\, \gamma (\mathrm {d}y_1,\,\mathrm {d}y_2)\right) ^{1/p},$$

    where \(\Gamma (G_1,\,G_2)\) is the set of all joint probability measures on \((\mathscr {Y}\times \mathscr {Y})\subseteq \mathbb {R}^2\), with marginal distributions \(G_1\) and \(G_2\) on the first and second arguments, respectively.

2 Main Results

This section is devoted to expose the main results of the chapter and is split into two parts. In the first one, preliminary results on Bayesian estimation of distribution functions in the Kolmogorov metric, which are valid for a large class of prior laws, are presented and some issues highlighted. In the second part, arguably the most relevant, attention is restricted to finite mixtures with an unknown, but bounded above number of components and Bayesian estimation of the mixing distribution in the Kantorovich metric at the optimal rate \(n^{-1/4}\) (up to a logarithmic factor) is discussed.

Posterior Concentration of Kernel Mixtures in the Kolmogorov Metric

The following assumption will be hereafter in force.

Assumption \(\mathbf {A}\). Let

$$\begin{aligned} \epsilon _n:=\bigg (\frac{\log n}{n}\bigg )^{1/2}L_n,\quad n\in \mathbb {N}, \end{aligned}$$
(2)

where, depending on the prior concentration rate on small balls around \(P_0\), the sequence of positive real numbers \((L_n)\) can be either slowly varying at \(+\infty \) or degenerate at an appropriate constant \(L_0\).

Comments on the two possible specifications of \((L_n)\) in connection with the prior concentration rate are postponed to Lemma 1, which provides sufficient conditions on the distribution function \(F_0\) and the prior concentration rate \(\epsilon _n\) for the posterior to contract at a nearly \(\sqrt{n}\)-rate on Kolmogorov neighborhoods of \(F_0\). We warn the reader that, unless otherwise specified, in all stochastic order symbols used hereafter, the probability measure \(\mathbf {P}\) is understood to be \(P_0^n\), the joint law of the first n coordinate projections of the infinite product probability measure \(P_0^{\infty }\). Also, \(\varPi _n\) stands for a prior law, possibly depending on the sample size, over the space of probability measures \(\{P_G,\, G\in \mathscr {G}\}\), with density \(p_G\) as defined in (1).

Lemma 1

Let \(F_0\) be a continuous distribution function. If, for a constant \(C>0\) and a sequence \(\epsilon _n\) as defined in (2), we have

$$\begin{aligned} \varPi _n(B_{\mathrm {KL}}(P_0;\,\epsilon _n^2))\gtrsim \exp {(-Cn\epsilon _n^2)}, \end{aligned}$$
(3)

then, for \(M_n\gtrsim \sqrt{(C+1/2)}L_n\),

$$\begin{aligned} \varPi _n\bigg (\sqrt{n}\sup _x|(F_G-F_0)(x)|>M_n(\log n)^{1/2}\mid X^{(n)}\bigg )=o_{\mathbf {P}}(1). \end{aligned}$$
(4)

Proof

The posterior probability of the event

$$A_n^c:=\bigg \{G:\,\sqrt{n}\sup _x|(F_G-F_0)(x)|>M_n(\log n)^{1/2}\bigg \}$$

is given by

$$\varPi _n(A_n^c\mid X^{(n)})=\frac{\int _{A_n^c}\prod _{i=1}^np_G(X_i)\, \varPi _n(\mathrm {d}G)}{\int _{\mathscr {G}} \prod _{i=1}^np_G(X_i)\, \varPi _n(\mathrm {d}G)}.$$

We construct (a sequence of) tests \((\phi _n)\) for testing the hypothesis

$$H_0:\,P=P_0\quad \text{ versus }\quad H_1:\,P=P_G,\, G\in A_n^c,$$

where \(\phi _n:\,\equiv \phi _n(X^{(n)};\,P_0):\,\mathscr {X}^n\rightarrow \{0,\,1\}\) is the indicator function of the rejection region of \(H_0\), such that

$$\begin{aligned}&P_0^n\phi _n\rightarrow 0 \,\,\, \text{ as } n\rightarrow +\infty \\[5pt]&\qquad \quad \text{ and } \,\,\,\, \sup _{G\in A_n^c} P_G^n(1-\phi _n)\le 2\exp {(-2(M_n-K)^2\log n)} \, \text{ for } \text{ sufficiently } \text{ large } \text{ n }, \end{aligned}$$

with a finite constant \(K>0\) and a sequence \(M_n>K\) for every n large enough. Define the test

$$\phi _n:=1_{R_n},\,\, \text{ with } \,\,R_n:=\bigg \{x^{(n)}:\,\sqrt{n}\sup _x|(F_n-F_0)(x)|>K(\log n)^{1/2}\bigg \},$$

where \(F_n\) is the empirical distribution function, that is, the distribution function associated with the empirical probability measure \(\mathbb {P}_n\) of the sample \(X^{(n)}\). Since \(x\mapsto F_0(x)\) is continuous by assumption, in virtue of the Dvoretzky–Kiefer–Wolfowitz [3] (DKW for short) inequality, with the tight universal constant in Massart [13], the type I error probability \(P_0^n\phi _n\) can be bounded above as follows

$$\begin{aligned} P_0^n\phi _n=P_0^n(R_n)\le 2\exp {(-2K^2\log n)}. \end{aligned}$$

Then,

$$\begin{aligned} E_0^n[\varPi _n(A_n^c\mid X^{(n)})\phi _n]\le P_0^n\phi _n\le 2\exp {(-2K^2\log n)}, \end{aligned}$$
(5)

where \(E_0^n\) denotes expectation with respect to \(P_0^n\), and

$$\begin{aligned} E_0^n[\varPi _n(A_n^c\mid X^{(n)})]= & {} E_0^n[\varPi _n(A_n^c\mid X^{(n)})\phi _n]+ E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)]\\\le & {} 2\exp {(-2K^2\log n)}+ E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)]. \end{aligned}$$

It remains to control the term \(E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)]\). Defined the set

$$D_n:=\left\{ x^{(n)}:\, \int _{\mathscr {G}}\prod _{i=1}^n \frac{p_G}{p_0}(x_i)\,\varPi _n(\mathrm {d}G)\le \varPi _n(B_{\mathrm {KL}}(P_0;\,\epsilon _n^2)) \exp {(-(C+1)n\epsilon _n^2)} \right\} ,$$

consider the following decomposition

$$E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)]= E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)(1_{D_n}+1_{D_n^c})].$$

It is known from Lemma 8.1 of Ghosal et al. [7], p. 524, that \(P_0^n(D_n)\le (C^2n\epsilon _n^2)^{-1}\). It follows that

$$\begin{aligned} E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)1_{D_n}]\le P_0^n(D_n)\le (C^2n\epsilon _n^2)^{-1}. \end{aligned}$$
(6)

By the assumption in (3) and Fubini’s theorem,

$$\begin{aligned} E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)1_{D_n^c} ]\lesssim \exp {((2C+1)n\epsilon _n^2)}\int _{A_n^c} P_G^n(1-\phi _n)\,\varPi _n(\mathrm {d}G). \end{aligned}$$
(7)

The following arguments are aimed at finding an exponential upper bound on \(\sup _{G\in A_n^c} P_G^n(1-\phi _n)\). By the triangular inequality, over the set \(R_n^c\), for every \(G\in A_n^c\),

$$\begin{aligned} M_n(\log n)^{1/2}<&\sqrt{n}\sup _x|(F_G-F_0)(x)|\\&\qquad \quad \le \sqrt{n}\sup _x|(F_G- F_n)(x)|+ \sqrt{n}\sup _x|(F_n-F_0)(x)|\\&\qquad \quad \le \sqrt{n}\sup _x|(F_G-F_n)(x)|+K(\log n)^{1/2}, \end{aligned}$$

which implies that

$$\sqrt{n}\sup _x|(F_G-F_n)(x)|>(M_n-K)(\log n)^{1/2}.$$

Since \(x\mapsto F_G(x):=\int _{-\infty }^xp_G(u)\,\mathrm {d}u\) is continuous, by applying again the DKW’s inequality, we obtain that

$$\begin{aligned} \sup _{G\in A_n^c} P_G^n(1-\phi _n)&\le \sup _{G\in A_n^c} P_G^n\bigg ( \sqrt{n}\sup _x|(F_G-F_n)(x)|>(M_n-K)(\log n)^{1/2} \bigg )\\&\le 2\exp {(-2(M_n-K)^2\log n)}. \end{aligned} $$

Combining the above assertion with (7), we see that

$$\begin{aligned} E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)1_{D_n^c} ]\lesssim 2\exp {(-[2(M_n-K)^2-(2C+1)L_n^2]\log n)}, \end{aligned}$$
(8)

where the right-hand side of the above inequality tends to zero provided that \((M_n-K)>\sqrt{(C+1/2)}L_n\) for every sufficiently large n. The in-probability convergence in (4) follows from (5), (6) and (8). This concludes the proof.    \(\square \)

Some remarks and comments on Lemma 1 are in order.

  • The first one aims at spelling out the assumptions used in the proof, some of which could otherwise erroneously seem to be confined to the context of finite mixture models, as well as at clarifying their role. Given the prior concentration rate \(\epsilon _n\) as defined in (3), which depends on the prior distribution \(\varPi _n\) and the “point” \(P_0\), the only further assumption used is the continuity of the distribution functions \(F_0\) and \(F_G\), which is satisfied for Lebesgue dominated probability measures \(P_0\) and \(P_G\). This condition is used to control the type I and type II error probabilities of the (sequence of) tests \((\phi _n)\) by the DKW’s inequality. It is, instead, in no way used the assumption that the density \(p_G\) is modeled as a mixture, so that, even if the result has its origin in the context of finite mixtures, it applies to general dominated models and a nearly parametric (up to a logarithmic factor) prior concentration rate is the only driver and determinant of posterior contraction.

  • Lemma 1 has its roots in Theorem 2 of Ishwaran et al. [11], p. 1324 (see pp. 1330–1331 for the proof), which deals with finite mixtures having an unknown number of components \(d_0\), yet bounded above by an integer N, namely, \(1\le d_0\le N<+\infty \), while the prior is supported over the space of all mixing distributions with at most N components, the mixture weights being assigned an N-dimensional Dirichlet distribution with a non-informative choice for the shape parameters that are all set equal to \(\alpha /N\) for a positive constant \(\alpha \). Nonetheless, as previously remarked, Lemma 1 has a broader scope of validity and applies also to infinite kernel mixtures with other prior laws for the mixing distribution than the Dirichlet process, which “locally” attain an almost parametric prior concentration rate. This is the case for Dirichlet location or location-scale mixtures of normal densities and, more in general, for location-scale mixtures of exponential power densities with an even integer shape parameter, when the sampling density is of the same form as the assumed model, with mixing distribution being either compactly supported or having sub-exponential tails, see Ghosal and van der Vaart [8], Scricciolo [16], Theorems 4.1, 4.2 and Corollary 4.1, pp. 285–290. In all these cases, the prior concentration rate is (at worst) \(\epsilon _n=n^{-1/2}\log n\), where \(L_n=(\log n)^{1/2}\). An extension of the previous results to convolution mixtures of super-smooth kernels, with Pitman-Yor or normalized inverse-Gaussian processes as priors for the mixing distribution, for which Lemma 1 also holds, is considered in Scricciolo [17], see Theorem 1, pp. 486–487. Another class of priors on kernel mixtures to which Lemma 1 applies is that of sieve priors. For a given kernel, a sieve prior is defined by combining single priors on classes of kernel mixtures, each one indexed by the number of mixture components, with a prior on such random number. A probability measure with kernel mixture density is then generated in two steps: first the model index, i.e., the number of mixture components, is selected; then a probability measure is generated from the chosen model according to a prior on it. When the true density \(p_0\) is itself a kernel mixture, the prior concentration rate can be assessed by bounding below the probability of Kullback-Leibler type neighborhoods of \(P_0\) by the probability of \(\ell ^1\)-balls of appropriate dimension. In fact, approximation properties of mixtures under consideration can be exploited to find a good fitting distribution of the sampling density in a proper subclass. More precisely, any finite kernel mixture can be approximated arbitrarily well (in the distance induced by the \(L^1\)-norm) by mixtures having the same number of components, the mixture components and weights taking values in \(\ell ^1\)-neighborhoods of the corresponding true elements. The number of mixture components is then constant, this leading to the prior concentration rate \(\epsilon _n\propto (n/\log n)^{-1/2}\), where \(L_n\equiv L_0\). Examples of sieve priors in which, for every choice of the model index, the mixture weights are jointly distributed according to a Dirichlet distribution, are provided by the Bernstein polynomials, see Theorem 2.2 of Ghosal [6], pp. 1268–1269, by histograms and polygons, see Theorem 1 of Scricciolo [15], pp. 629–630 (see pp. 636–637 for the proof). If, as a special case, a single prior distribution on kernel mixtures with a sample size-dependent number \(N\equiv L_n\) of mixture components is considered, then the prior concentration rate is \(\epsilon _n=(n/\log n)^{-1/2}L_n\) for any arbitrary slowly varying sequence \(L_n\rightarrow +\infty \).

We now state sufficient conditions on the kernel density and the prior distributions for the mixture atoms and weights so that the overall prior on kernel mixtures with (at most) N components verifies condition (3) for \(\epsilon _n\propto (n/\log n)^{-1/2}\), when the sampling density is itself a kernel mixture with \(1\le d_0\le N\) components. The aim of this analysis is twofold: first, to provide less stringent requirements on the kernel density than those postulated in condition (b) employed in Theorem 2 of Ishwaran et al. [11], p. 1324, which relies on Lemma 4 of Ishwaran [10], pp. 2170–2171; second, to generalize the aforementioned result to a class of prior distributions on the mixture weights that comprises the Dirichlet distribution as a special case. The density \(p_G\) is modeled as

$$\begin{aligned} p_{G}(\cdot )=\sum _{j=1}^Nw_jk(\cdot \mid y_j), \end{aligned}$$

with a discrete mixing distribution \(G=\sum _{j=1}^{N}w_j\delta _{y_j}\). The vector \(\mathbf {w}_N:=(w_1,\,\ldots ,w_N)\) of mixing weights has a prior distribution \(\tilde{\pi }_N\) on the \((N-1)\)-dimensional simplex \(\varDelta _{N}:=\{\mathbf {w}_N\in \mathbb {R}^N:\, 0\le w_j\le 1,\,\,j=1,\,\ldots ,\,N,\,\,\,\,\sum _{j=1}^Nw_j=1\}\) and the atoms \(y_1,\,\ldots ,\,y_N\) are independently and identically distributed according to a prior distribution \(\pi \). We shall also use the notation \(\mathbf {y}_N\) for \((y_1,\,\ldots ,\,y_N)\). The model can be thus described:

  • the random vectors \(\mathbf {y}_N\) and \(\mathbf {w}_N\) are independent;

  • given \((\mathbf {y}_N,\,\mathbf {w}_N)\), the random variables \(X_1,\,\ldots ,\,X_n\) are conditionally independent and identically distributed according to \(p_{G}\).

The overall prior is then \(\varPi =\tilde{\pi }_N\times \pi ^{\otimes N}\). Let the sampling density \(p_0\) be itself a finite kernel mixture, with \(1\le d_0\le N\) components,

$$\begin{aligned} p_0(\cdot )\equiv p_{G_0}(\cdot )=\sum _{j=1}^{d_0}w_j^0k(\cdot \mid y_j^0), \end{aligned}$$

where the mixing distribution is \(G_0=\sum _{j=1}^{d_0}w_j^0\delta _{y_j^0}\) for weights \(\mathbf {w}_{d_0}^0:=(w^0_1,\,\ldots ,w^0_{d_0})\in \varDelta _{d_0}\) and support points \(\mathbf {y}_{d_0}^0:=(y^0_1,\,\ldots ,\,y^0_{d_0})\in \mathbb {R}^{d_0}\). A caveat applies: if \(d_0\) is strictly smaller than N, that is, \(1\le d_0<N\), then the vectors \(\mathbf {w}_{d_0}^0\) and \(\mathbf {y}_{d_0}^0\) are viewed as degenerate elements of \(\varDelta _{N}\) and \(\mathbb {R}^{N}\), respectively, with coordinates \(w_{d_0+1}=\,\cdots \,=w_{N}=0\) and \(y_{d_0+1}=\,\cdots \,=y_{N}=0\).

We assume that

  1. (i)

    there exists a constant \(c_k>0\) such that

    $$\Vert k(\cdot \mid y_1)-k(\cdot \mid y_2)\Vert _1\le c_k|y_1-y_2|\,\,\, \text{ for } \text{ all } y_1,\,y_2\in \mathscr {Y};$$
  2. (ii)

    for every \(\epsilon >0\) small enough and a constant \(c_0>0\),

    $$\tilde{\pi }_N(\{\mathbf {w}_{N}\in \varDelta _{N}:\,\Vert \mathbf {w}_{N}-\mathbf {w}^0_{N}\Vert _{\ell ^1}\le \epsilon \})\gtrsim \epsilon ^{c_0N};$$
  3. (iii)

    the prior distribution \(\pi \) for the atoms has a continuous and positive Lebesgue density (also denoted by \(\pi \)) on an interval containing the support of \(G_0\).

Some remarks and comments on the previously listed assumptions are in order. Condition (i) requires the kernel density \(k(\cdot \mid y)\) to be globally Lipschitz continuous on \(\mathscr {Y}\). Condition (ii) is satisfied for a Dirichlet prior distribution \(\tilde{\pi }_N=\mathrm {Dir}(\alpha _{1},\,\ldots ,\,\alpha _{N})\), with parameters \(\alpha _{1},\,\ldots ,\,\alpha _{N}\) such that, for constants \(a,\,A>0\), \(D\ge 1\) and, for \(0<\epsilon \le 1/(DN)\),

$$\begin{aligned} A\epsilon ^a\le \alpha _{j}\le D, \quad j=1,\,\ldots ,\,N. \end{aligned}$$

Using Lemma A.1 of Ghosal [6], pp. 1278–1279, we find that \(\tilde{\pi }_{N}(N(\mathbf {w}_{N}^0;\,\epsilon ))\gtrsim \exp {(-c_0N\log (1/\varepsilon ))}\) for a constant \(c_0>0\) depending only on \(a,\,A,\,D\) and \(\sum _{j=1}^{N}\alpha _{j}\).

Proposition 1

Under assumptions (i)–(iii), condition (3) is verified for

$$\epsilon _n\propto (n/\log n)^{-1/2}.$$

Proof

For every density \(p_{G}\), with mixing distribution \(G=\sum _{j=1}^{N}w_j\delta _{y_j}\) having support points \(\mathbf {y}_{N}\in \mathbb {R}^{N}\) and mixture weights \(\mathbf {w}_{N}\in \varDelta _{N}\), by assumption (i) we have

$$\begin{aligned} \Vert p_{G}-p_0\Vert _1&\lesssim \sum _{j=1}^{N}w_j^0 \Vert k(\cdot \mid y_j)-k(\cdot \mid y_j^0)\Vert _1\,+\, \sum _{j=1}^{N}|w_j-w_j^0|\Vert k(\cdot \mid y_j)\Vert _1\\[0pt]&\lesssim \Vert \mathbf {y}_{N}-\mathbf {y}^0_{N}\Vert _{\ell ^1} \,+\,\Vert \mathbf {w}_{N}-\mathbf {w}_{N}^0\Vert _{\ell ^1}. \end{aligned} $$

Let \(0<\epsilon \le [(1/2)\wedge (1-e^{-1})/\sqrt{2}]\) be fixed. For \(\mathbf {y}_{N}\in \mathbb {R}^{N}\) and \(\mathbf {w}_{N}\in \varDelta _{N}\) such that \(\Vert \mathbf {y}_{N}-\mathbf {y}^0_{N}\Vert _{\ell ^1}\le \epsilon \) and \(\Vert \mathbf {w}_{N}-\mathbf {w}_{N}^0\Vert _{\ell ^1}\le \epsilon \), by LeCam [12] inequalities, p. 40, relating the \(L^1\)-norm and the Hellinger metric, the squared Hellinger distance between \(p_0\) and \(p_G\) can be bounded above by a multiple of \(\epsilon \):

$$\rho _{-1/2}(P_0\Vert P_G)=\int (p_{G}^{1/2}-p_0^{1/2})^2\,\mathrm {d}\lambda \le \Vert p_{G}-p_0\Vert _1\lesssim \epsilon .$$

Then, by Lemma A.10 in Scricciolo [16], p. 305, for a suitable constant \(c_1>0\),

$$\begin{aligned} \left\{ p_{G}:\,G=\sum _{j=1}^{N}w_j\delta _{y_j}, \,\,\,\Vert \mathbf {w}_{N}-\mathbf {w}_{N}^0\Vert _{\ell ^1}\le \epsilon ,\,\,\, \Vert \mathbf {y}_{N}-\mathbf {y}^0_{N}\Vert _{\ell ^1}\le \varepsilon \right\}&\!\\ \subseteq B_{\mathrm {KL}}&\!\!\!\left( P_0;\,c_1\epsilon \left( \log \frac{1}{\epsilon }\right) ^2\right) . \end{aligned}$$

Next, define the set \(N(\mathbf {w}_{N}^0;\,\epsilon ):=\{\mathbf {w}_{N}\in \varDelta _{N}:\,\Vert \mathbf {w}_{N}-\mathbf {w}^0_{N}\Vert _{\ell ^1}\le \epsilon \}\). For \(\epsilon >0\) small enough, by assumption (ii),

$$\tilde{\pi }_{N}(N(\mathbf {w}_{N}^0;\,\epsilon ))\gtrsim \exp {(-c_0N\log (1/\varepsilon ))}$$

with an appropriate constant \(c_0>0\). Denoted by \(B(\mathbf {y}_{N}^0;\,\epsilon )\) the \(\mathbf {y}^0_{N}\)-centered \(\ell ^1\)-ball of radius \(\epsilon >0\),

$$B(\mathbf {y}^0_{N};\,\epsilon ):=\{\mathbf {y}_{N}\in \mathbb {R}^{N}:\, \Vert \mathbf {y}_{N}- \mathbf {y}_{N}^0\Vert _{\ell ^1}\le \epsilon \},$$

by condition (iii) the prior probability of \(B(\mathbf {y}^0_{N};\,\epsilon )\) under the N-fold product measure \(\pi ^{\otimes N}\) can be bounded below as follows:

$$\begin{aligned} \pi ^{\otimes N}(B(\mathbf {y}^0_{N};\,\epsilon ))&\ge \prod _{j=1}^{N} \pi ([y_j^0-(\epsilon /N),\,y_j^0+(\epsilon /N)])\\&=\prod _{j=1}^{N} \int _{y_j^0-(\epsilon /N)}^{y_j^0+(\epsilon /N)} \pi (y)\,\mathrm {d}y\gtrsim \exp {(-d_1N\log (1/\epsilon ))} \end{aligned} $$

for a positive constant \(d_1\). Therefore, for appropriate constants \(c_1,\,d_2>0\),

$$ \varPi (B_{\mathrm {KL}}(P_0;\,c_1\epsilon |\log \epsilon |^2))\gtrsim \tilde{\pi }_{N}(N(\mathbf {w}_{N}^0;\,\epsilon )) \,\pi ^{\otimes N} (B(\mathbf {y}^0_{N};\,\epsilon ))\gtrsim \exp {(-d_2N\log (1/\epsilon ))}. $$

Set \(\xi :=(c_1\epsilon )^{1/2}\log (1/\epsilon )\), since \(\log (1/\epsilon )\lesssim \log (1/\xi )\), we have \(\varPi (B_{\mathrm {KL}}(P_0;\,\xi ^2))\gtrsim \exp {(-c_2\log (1/\xi ))}\) for a real constant \(c_2>0\) (possibly depending on \(p_0\)). Replacing \(\xi \) with \({\epsilon }_n\), we get \(\varPi (B_{\mathrm {KL}}(P_0;\,{\epsilon }_n^2))\gtrsim \exp {(-c_2n{\epsilon }^2_n)}\) for sufficiently large n, and the proof is complete.    \(\square \)

Inspection of the proof of Lemma 1 reveals that, under the small ball prior probability estimate in (3), we have

$$E_0^n[\varPi _n(A_n^c\mid X^{(n)})]=O((n\epsilon _n^2)^{-1}).$$

The assertion of Lemma 1 can be enhanced to have

$$E_0^n[\varPi _n(A_n^c\mid X^{(n)})]=O(\exp {(-B_1n\epsilon _n^2)})$$

by employing a small ball prior probability estimate involving stronger divergences. The convergence in (4) then becomes almost-sure. Besides, due to the fact that the posterior probability vanishes exponentially fast, namely, along almost all sample sequences, for a finite constant \(B>0\), we have

$$\varPi _n(A_n^c\mid X^{(n)})\lesssim \exp {(-Bn\epsilon _n^2)}\, \text{ for } \text{ all } \text{ but } \text{ finitely } \text{ many } \text{ n, }$$

the stochastic order of the maximum absolute difference between \(F_0\) and the posterior expected distribution function can be assessed, see Corollary 1 below.

Lemma 2

Under the conditions of Lemma 1, if the small ball prior probability estimate in (3) is replaced by

$$\begin{aligned} \varPi _n(B_{\rho _\alpha }(P_0;\, \epsilon _n^2))\gtrsim \exp {(-Cn\epsilon _n^2)},\,\,\, \text{ for } \alpha \in (0,\,1]\text{, } \end{aligned}$$
(9)

then, for \(M_n\gtrsim \sqrt{(C+1/2)}L_n\),

$$ \varPi _n\bigg (\sqrt{n}\sup _x|(F_G-F_0)(x)|>M_n(\log n)^{1/2}\mid X^{(n)}\bigg )\rightarrow 0 \quad P_0^\infty \text{-almost } \text{ surely. } $$

Proof

The proof is an adaptation of that of Lemma 1. We therefore highlight only the main changes. Taking a sequence \(K_n=\theta M_n\) for any \(\theta \in (0,\,1)\), we have

$$ E_0^n[\varPi _n(A_n^c\mid X^{(n)})\phi _n]\le P_0^n\phi _n\le 2\exp {(-2\theta ^2 M_n^2\log n)} $$

and

$$ E_0^n[\varPi _n(A_n^c\mid X^{(n)})(1-\phi _n)1_{D_n^c} ]\lesssim 2\exp {(-[2(1-\theta )^2M_n^2-(2C+1)L_n^2]\log n)}, $$

with

$$\begin{aligned} M_n>(1-\theta )^{-1}\sqrt{(C+1/2)}L_n \end{aligned}$$
(10)

for every sufficiently large n. A straightforward extension of Lemma 2 in Shen and Wasserman [19], p. 691 (and pp. 709–710 for the proof), yields that, for every \(\xi \in (0,\,1)\),

$$\begin{aligned} P_0^n\big (D_n\le \xi \varPi _n(B_{\rho _\alpha }(P_0;\, \epsilon _n^2))\exp {(-(C+1)n\epsilon _n^2)}\big ) \le (1-\xi )^{-1}\exp {(-\alpha Cn\epsilon _n^2)}. \end{aligned}$$
(11)

Considering \(M_n=IL_n\) for a finite constant \(I>(1-\theta )^{-1}\sqrt{(C+1/2)}\) so that condition (10) is satisfied, by combining partial bounds we obtain that

$$ E_0^n[\varPi _n(A_n^c\mid X^{(n)})]\lesssim \exp {(-B_1n\epsilon _n^2)} $$

for an appropriate finite constant \(B_1>0\). For a constant \(B>0\),

$$\begin{aligned} P_0^n\big (\varPi _n(A_n^c\mid X^{(n)})\ge \exp {(-Bn\epsilon _n^2)}\big )\lesssim \exp {\big (-(B_1-B)n\epsilon _n^2\big )}. \end{aligned}$$

Choose \(0<B<B_1\). Since \(\sum _{n=1}^{\infty }\exp {(-(B_1-B)n\epsilon _n^2)}<+\infty \), almost sure convergence follows from the first Borel-Cantelli lemma.    \(\square \)

Remark 1

The assertion of Lemma 2 still holds if the small ball prior probability estimate in (9) is replaced by the requirement

$$\begin{aligned} \varPi _n(B_{\rho _{-1/2}\Vert \cdot \Vert _\infty }(P_0;\, \epsilon _n^2))\gtrsim \exp {(-Cn\epsilon _n^2)}, \end{aligned}$$
(12)

which involves a Hellinger type neighborhood of \(P_0\). Then, a bound similar to that in (11) is given in Lemma 8.4 of Ghosal et al. [7], pp. 526–527.

As previously mentioned, Lemma 2 allows to derive the stochastic order of the maximum absolute difference between \(F_0\) and its Bayes’ estimator

$$F_n^{\text {B}}(\cdot ):= \int _{\mathscr {G}} F_G(\cdot )\varPi (\mathrm {d}G\mid X^{(n)}), $$

namely, the posterior expected distribution function.

Corollary 1

Under the conditions of Lemma 2, we have

$$\begin{aligned} \sqrt{n}\sup _x|(F_n^{\mathrm {B}}-F_0)(x)|=O_{\mathbf {P}}(M_n(\log n)^{1/2}). \end{aligned}$$

Proof

By standard arguments,

$$\begin{aligned} \sqrt{n}\sup _x|(F_n^{\mathrm {B}}-F_0)(x)|&= \sqrt{n}\sup _x\bigg |\int _{\mathscr {G}} F_G(x)\,\varPi _n(\mathrm {d}G\mid X^{(n)})-F_0(x)\bigg |\\&\le \int _{\mathscr {G}}\sqrt{n}\sup _x |(F_G-F_0)(x)|\,\varPi _n(\mathrm {d}G\mid X^{(n)})\\&= \bigg (\int _{A_n}+\int _{A_n^c}\bigg )\sqrt{n}\sup _x |(F_G-F_0)(x)|\,\varPi _n(\mathrm {d}G\mid X^{(n)})\\&\le M_n(\log n)^{1/2}+2\sqrt{n}\varPi _n(A_n^c\mid X^{(n)})\\&\lesssim M_n(\log n)^{1/2}\quad \text{ for } \text{ sufficiently } \text{ large } \text{ n } \end{aligned}$$

because condition (9) yields that, with probability one, for a finite constant \(B>0\), the posterior probability \(\sqrt{n}\varPi _n(A_n^c\mid X^{(n)})\lesssim \sqrt{n}\exp {(-Bn\epsilon _n^2)}\) for all but finitely many n. The assertion follows.    \(\square \)

Posterior Concentration of the Mixing Distribution in the Kantorovich Metric

In this section, we deal with the case where the prior distribution \(\varPi \) is supported over the collection of finite kernel mixtures with at most N components. Sufficient conditions are stated in Theorem 1 below so that the posterior rate of convergence, relative to the Kantorovich or \(L^1\)-Wasserstein metric, for the mixing distribution of over-fitted mixtures is, up to a slowly varying sequence, (at worst) equal to \((n/\log n)^{-1/4}\), the optimal pointwise rate being \(n^{-1/4}\), cf. Chen [1], Sect. 2, pp. 222–224.

In order to state the result, we need to introduce some more notation. For every \(y\in \mathscr {Y}\), we denote by \(K(x\mid y)\) the cumulative distribution function at x of the kernel density \(k(\cdot \mid y)\),

$$K(x\mid y):=\int _{-\infty }^xk(u\mid y)\,\mathrm {d}u.$$

For clarity of exposition, we recall that \(F_0\) is the distribution function of the mixture density \(p_0\equiv p_{G_0}\) corresponding to the mixing distribution \(G_0\) having an unknown number of components \(d_0\) bounded above by a fixed integer N.

Theorem 1

Under the conditions of Lemma 1, if, in addition,

(a):

\(\mathscr {Y}\) is compact,

(b):

for all \(x\in \mathbb R\), \(K(x\mid y)\) is 2-differentiable with respect to y,

(c):

\(\left\{ K(\cdot \mid y):\,y\in \mathscr {Y}\right\} \) is strongly identifiable in the sense of Definition 2 in Chen [1], p. 225, equivalently, 2-strongly identifiable in the sense of Definition 2.2 in Heinrich and Kahn [9], p. 2848,

(d):

there exists a uniform modulus of continuity \(\omega (\cdot )\) such that

$$\sup _x|K^{(2)}(x\mid y)-K^{(2)}(x\mid y')|\le \omega (|y-y'|)\,\,\, \text{ with } \lim _{h\rightarrow 0}\omega (h)=0,$$

then, for \(M_n\gtrsim \sqrt{(C+1/2)}L_n\),

$$ \varPi \big (n^{1/4}W_1(G,\,G_0)>\sqrt{M_n}(\log n)^{1/4}\mid X^{(n)}\big )=o_{\mathbf {P}}(1). $$

Proof

Since Lemma 1 holds, we have

$$\begin{aligned} \varPi \bigg (\sqrt{n}\sup _x|(F_G-F_0)(x)|>M_n(\log n)^{1/2}\mid X^{(n)}\bigg )=o_{\mathbf {P}}(1). \end{aligned}$$
(13)

Consistently with the notation introduced in Lemma 1, we set

$$A_n:=\bigg \{G:\,\sqrt{n}\sup _x|(F_G-F_0)(x)|\le M_n(\log n)^{1/2}\bigg \}.$$

Under assumptions (a)–(d), assertion (21) of Theorem 6.3 of Heinrich and Kahn [9], p. 2857, holds true, this implying that, for every \(G\in A_n\), the Kolmogorov distance between the distribution functions \(F_G\) and \(F_0\) is bounded below (up to a constant) by the squared \(L^1\)-distance between the mixing distributions G and \(G_0\), respectively: there exists a constant \(C_0>0\) (possibly depending on \(G_0\)) such that, for every \(G\in A_n\),

$$\begin{aligned} C_0\Vert G-G_0\Vert _1^2<\sup _{x}|(F_G-F_0)(x)|\le M_nn^{-1/2}(\log n)^{1/2}. \end{aligned}$$
(14)

Taking into account the following representation of the \(L^1\)-Wasserstein distance

$$W_1(G,\,G_0)=\Vert G-G_0\Vert _1,$$

see, e.g., Shorack and Wellner [20], pp. 64–66, which was obtained by Dall’Aglio [2], the assertion follows by combining (13) with (14). This concludes the proof.    \(\square \)

Some comments on the applicability and consequences of Theorem 1 are in order.

  • Theorem 1, like Lemma 1, has its roots in Theorem 2 of Ishwaran et al. [11], p. 1324, which is tailored for finite Dirichlet mixtures. However, thanks to Proposition 1, which implies the conclusion of Lemma 1, meanwhile ensuring applicability to a larger family of prior distributions, under conditions (a)–(d), the assertion that, for sufficiently large constant \(M>0\), the convergence

    $$ \varPi \big (n^{1/4}W_1(G,\,G_0)>M(\log n)^{1/4}\mid X^{(n)}\big )\rightarrow 0 \quad \text{ in } P_0^n\text{-probability } $$

    takes place, still holds. The present result differs from that of Theorem 5 in Nguyen [14], pp. 383–384, under various respects: the latter gives an assessment of posterior contraction in the \(L^2\)-Wasserstein, as opposed to the \(L^1\)-Wasserstein metric, for finite mixtures of multivariate distributions, under more stringent conditions and following a completely different line of reasoning.

  • As previously observed on the occasion of the transition from Lemma 1 to Lemma 2, if the small ball prior probability estimate in (3) is replaced with either that in (9) or in (12), then the almost-sure version of Theorem 1

    $$\begin{aligned} \varPi \big (n^{1/4}W_1(G,\,G_0)>\sqrt{M_n}(\log n)^{1/4}\mid X^{(n)}\big )\rightarrow 0 \quad P_0^\infty \text{-almost } \text{ surely } \end{aligned}$$

    holds and the rate of convergence for the Bayes’ estimator of the mixing distribution can be assessed as follows.

Corollary 2

Under the conditions of Theorem 1, with the small ball prior probability estimate in (9), we have

$$ W_1(G_n^{\mathrm {B}},\,G_0)=O_{\mathbf {P}}(\sqrt{M_n}(n/\log n)^{-1/4}),$$

where \( G_n^{\mathrm B}(\cdot ):=\int _{\mathscr {G}}G(\cdot )\varPi (\mathrm {d}G\mid X^{(n)})\) is the Bayes’ estimator of the mixing distribution.