1 Introduction

The problem of recovering a distribution function from observations additively contaminated with measurement errors is the object of study in this note. Assuming data are sampled from a convolution kernel mixture, the interest is in “estimating” the mixing or latent distribution from contaminated observations. The statement of the problem is as follows. Let X be a random variable (r.v.) with probability measure \(P_0\) on the Borel-measurable space \((\mathbb {R},\,\mathscr {B}(\mathbb {R}))\), with Lebesgue density \(p_0:=\mathrm {d}P_0/\mathrm {d} \lambda \). Suppose that

$$\begin{aligned} X=Y+Z, \end{aligned}$$

where Y and Z are independent, unobservable random variables, Z having Lebesgue density f. We examine the case where the error has the standard Laplace distribution with density

$$\begin{aligned} f(z)=\frac{1}{2}e^{-|z|}, \quad z\in \mathbb {R}. \end{aligned}$$

The r.v. Y has unknown distribution \(G_0\) on some measurable space \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), with \(\mathscr {Y}\subseteq \mathbb {R}\) and \(\mathscr {B}(\mathscr {Y})\) the Borel \(\sigma \)-field on \(\mathscr {Y}\). The density \(p_0\) is then the convolution of \(G_0\) and f,

$$\begin{aligned} p_0(x)=(G_0*f)(x)=\int _{\mathscr {Y}}f(x-y)\,\mathrm {d}G_0(y),\quad x\in \mathbb {R}. \end{aligned}$$

In what follows, we also write \(p_0\equiv p_{G_0}\) to stress the dependence of \(p_0\) on \(G_0\). Letting \(\mathscr {G}\) be the set of all probability measures G on \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), the parameter space

$$\begin{aligned} \mathscr {P}:=\Bigg \{p_G(\cdot ):=\int _{\mathscr {Y}}f(\cdot -y)\,\mathrm {d}G(y),\, G\in \mathscr {G}\Bigg \} \end{aligned}$$

is the collection of all convolution Laplace mixtures and the model is nonparametric.

Suppose we observe n independent copies \(X_1,\,\ldots ,\,X_n\) of X. The r.v.’s \(X_1,\,\ldots ,\,X_n\) are independent and identically distributed (i.i.d.) according to the density \(p_0\equiv p_{G_0}\) on the real line. The interest is in recovering the mixing distribution \(G_0\in \mathscr {G}\) from indirect observations. Deconvolution problems may arise in a wide variety of contexts, the error distribution being typically modelled as a Gaussian, even if also the Laplace has relevant applications. Full density deconvolution, together with the related many normal means problem, has drawn attention in the literature since the late 1950’s and different deconvolution methods have been proposed and developed since then taking the frequentist approach, the most popular being based on nonparametric maximum likelihood and kernel methods. Rates of convergence have been mostly investigated for density deconvolution: Fan (1991a, b) showed that deconvolution kernel density estimators achieve global optimal rates for weighted \(L^p\)-risks, \(p\ge 1\), when the smoothness of the density to be recovered is measured in terms of the number of its derivatives. Hall and Lahiri (2008) considered estimation of the distribution function using the cumulative distribution function corresponding to the deconvolution kernel density estimator and showed that it attains minimax-optimal pointwise and global rates for the integrated mean-squared error over different functional classes for the error and latent distributions, smoothness being described through the tail behaviour of their Fourier transforms. For a comprehensive account on the topic, the reader may refer to the monograph of Meister (2009). In this note, we do not assume that the probability measure \(G_0\) possesses Lebesgue density. Wasserstein metrics are then particularly well-suited as global loss functions: convergence in \(L^p\)-Wasserstein metrics for discrete mixing distributions has, in fact, a natural interpretation in terms of convergence of the single supporting atoms of the probability measures involved. Dedecker and Michel (2015) have obtained a lower bound on the rate of convergence for the \(L^p\)-Wasserstein risk, \(p\ge 1\), when no smoothness assumption, except for a moment condition, is imposed on the latent distribution and the error distribution is ordinary smooth, the Laplace being a special case.

Deconvolution problems have only recently begun to be studied from a Bayesian perspective: the typical scheme considers the mixing distribution as a draw from a Dirichlet process prior. Posterior contraction rates for recovering the mixing distribution in \(L^p\)-Wasserstein metrics have been investigated in Nguyen (2013) and Gao and van der Vaart (2016), even though the upper bounds in these articles do not match with the lower bound in Dedecker and Michel (2015). Minimax-optimal adaptive recovery rates for mixing densities belonging to Sobolev spaces have been instead obtained by Donnet et al. (2018) in a fully Bayes as well as in an empirical Bayes approach to inference, the latter accounting for a data-driven choice of the prior hyperparameters of the Dirichlet process baseline measure.

In this note, we study nonparametric Bayes and maximum likelihood estimation of the mixing distribution \(G_0\), when no smoothness assumption is imposed on it. The analysis begins with the estimation of the sampling density \(p_0\): estimating the mixed density \(p_0\) can, in effect, be the first step for recovering the mixing distribution \(G_0\). Taking a Bayesian approach, if the random density \(p_G\) is modelled as a Dirichlet–Laplace mixture, then \(p_0\) can be consistently estimated at a rate \(n^{-3/8}\), up to a \((\log n)\)-factor, if \(G_0\) has tails matching with those of the baseline measure of the Dirichlet process, which essentially requires \(G_0\) to be in the weak support of the process, see Proposition 1 and Proposition 2. This requirement allows to extend to a possibly unbounded set of locations the results of Gao and van der Vaart (2016), which take into account only the case of compactly supported mixing distributions. Taking a frequentist approach, \(p_0\) can be estimated by the maximum likelihood still at a rate \(n^{-3/8}\), up to a logarithmic factor. As far as we are aware, the result on the rate of convergence in the Hellinger metric for the maximum likelihood estimator (MLE) of a Laplace convolution mixture is new and is obtained taking the approach proposed by Van de Geer (1996), according to which it is the “dimension” of the class of kernels and the behaviour of \(p_0\) near zero that determine the rate of convergence for the MLE. As previously mentioned, results on the estimation of \(p_0\) are interesting in view of the fact that, appealing to an inversion inequality translating the Hellinger or the \(L^2\)-distance between kernel mixtures, with Fourier transform of the kernel density having polynomially decaying tails, into any \(L^p\)-Wasserstein distance, \(p\ge 1\), between the corresponding mixing distributions, rates of convergence in the \(L^1\)-Wasserstein metric for the MLE and the Bayes’ estimator of the mixing distribution can be assessed. Merging in the \(L^1\)-Wasserstein metric between Bayes and maximum likelihood for deconvolving Laplace mixtures follows as a by-product.

Organization.   The note is organized as follows. Convergence rates in the Hellinger metric for Bayes and maximum likelihood density estimation of Laplace convolution mixtures are preliminarily studied in Sect. and in Sect. , respectively, in view of their subsequent instrumental use for assessing the \(L^1\)-Wasserstein accuracy of the two estimation procedures in recovering the mixing distribution of the sampling density. Merging between Bayes and maximum likelihood follows, as shown in Sect. . Remarks and suggestions for possible refinements and extensions of the exposed results are presented in Sect. 5. Auxiliary lemmas, along with the proofs of the main results, are deferred to Appendices A–D.

Notation. We fix the notation and recall some definitions used throughout.

Calculus

  • The symbols “\(\lesssim \)” and “\(\gtrsim \)” indicate inequalities valid up to a constant multiple that is universal or fixed within the context, but anyway inessential for our purposes.

  • For sequences of real numbers \((a_n)_{n\in \mathbb {N}}\) and \((b_n)_{n\in \mathbb {N}}\), the notation \(a_n\sim b_n\) means that \((a_n/b_n)\rightarrow 1\) as \(n\rightarrow +\infty \). Analogously, for real-valued functions f and g, the notation \(f\sim g\) means that \(f/g\rightarrow 1\) in an asymptotic regime that is clear from the context.

Covering and entropy numbers

  • Let \((T,\,d)\) be a (subset of a) semi-metric space. For every \(\varepsilon >0\), the \(\varepsilon \)-covering number of \((T,\,d)\), denoted by \(N(\varepsilon ,\,T,\,d)\), is defined as the minimum number of d-balls of radius \(\varepsilon \) needed to cover T. Take \(N(\varepsilon ,\,T,\,d)=+\infty \) if no finite covering by d-balls of radius \(\varepsilon \) exists. The logarithm of the \(\varepsilon \)-covering number, \(\log N(\varepsilon ,\,T,\,d)\), is called the \(\varepsilon \)-entropy.

  • Let \((T,\,d)\) be a (subset of a) semi-metric space. For every \(\varepsilon >0\), the \(\varepsilon \)-packing number of \((T,\,d)\), denoted by \(D(\varepsilon ,\,T,\,d)\), is defined as the maximum number of points in T such that the distance between each pair is at least \(\varepsilon \). Take \(D(\varepsilon ,\,T,\,d)=+\infty \) if no such finite \(\varepsilon \)-packing exists. The logarithm of the \(\varepsilon \)-packing number, \(\log D(\varepsilon ,\,T,\,d)\), is called the \(\varepsilon \)-entropy.

Covering and packing numbers are related by the inequalities

$$\begin{aligned} N(\varepsilon ,\,T,\,d)\le D(\varepsilon ,\,T,\,d)\le N(\varepsilon /2,\,T,\,d). \end{aligned}$$

Function spaces and probability

  • For real number \(1\le p <+\infty \), let \(L^p(\mathbb {R}):=\{f|\,f:\mathbb {R}\rightarrow \mathbb {C},\,f \hbox { is Borel}~\hbox {measurable, }\int |f|^p\,\mathrm {d}\lambda <+\infty \}\). For \(f\in L^p(\mathbb {R})\), the \(L^p\)-norm of f is defined as \(||f||_p:=(\int |f|^p\,\mathrm {d}\lambda )^{1/p}\). The supremum norm of a function f is defined as \(||f||_\infty :=\sup _{x\in \mathbb {R}}|f(x)|\).

  • For \(f\in L^1(\mathbb {R})\), the complex-valued function \(\hat{f}(t):=\int _{-\infty }^{+\infty } e^{itx}f(x)\,\mathrm {d}x\), \(t\in \mathbb {R}\), is called the Fourier transform of f.

  • All probability density functions are meant to be with respect to Lebesgue measure \(\lambda \) on \(\mathbb {R}\) or on some subset thereof.

  • The same symbol, G (say), is used to denote a probability measure on a Borel-measurable space \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\) and the corresponding cumulative distribution function (c.d.f.).

  • The degenerate probability distribution putting mass one at a point \(\theta \in \mathbb {R}\) is denoted by \(\delta _\theta \).

  • The notation Pf abbreviates the expected value \(\int f\,\mathrm {d}P\), where the integral is understood to extend over the entire natural domain when, here and elsewhere, the domain of integration is omitted.

  • Given a r.v. Y with distribution G, the moment generating function of Y or the Laplace transform of the probability measure G is defined as

    $$\begin{aligned} M_G(s):=E[e^{sY}]=\int _{\mathscr {Y}}e^{sy}\,\mathrm {d}G(y) \,\,\,\hbox { for all } s \hbox { for which the integral is finite.} \end{aligned}$$

Metrics and divergences

  • The Hellinger distance between any pair of probability density functions \(q_1\) and \(q_2\) on \(\mathbb {R}\) is defined as \(h(q_1,\,q_2):=\{\int (q_1^{1/2}-q_2^{1/2})^2\,\mathrm {d}\lambda \}^{1/2}\), the \(L^2\)-distance between the square-root densities. The following inequalities, due to LeCam (1973), p. 40, relating the \(L^1\)-norm and the Hellinger distance hold:

    $$\begin{aligned} h^2(q_1,\,q_2)\le ||q_1-q_2||_1 \end{aligned}$$
    (1)

    and

    $$\begin{aligned} ||q_1-q_2||_1\le 2 h(q_1,\,q_2). \end{aligned}$$
    (2)
  • For ease of notation, the same symbol d is used throughout to denote the \(L^1\)-norm, the \(L^2\)-norm or the Hellinger metric, the intended meaning being declared at each occurrence.

  • For any probability measure Q on \((\mathbb {R},\,\mathscr {B}({\mathbb {R}}))\) with density q, let

    $$\begin{aligned}\text {KL}(P_0\Vert Q):= \left\{ \begin{array}{ll} \displaystyle \int \log \frac{\mathrm {d}P_0}{\mathrm {d}Q}\,\mathrm {d}P_0=\int _{p_0q>0} p_0\log \frac{p_0}{q}\,\mathrm {d}\lambda , &{} \hbox { if } P_0\ll Q,\\ \quad +\infty , &{}\hbox { otherwise,} \end{array}\right. \end{aligned}$$

    be the Kullback–Leibler divergence of Q from \(P_0\) and, for \(k\ge 2\), let

    $$\begin{aligned} \mathrm{V}_k(P_0\Vert Q):=\left\{ \begin{array}{ll} \displaystyle \int \bigg |\log \frac{\mathrm {d}P_0}{\mathrm {d}Q}\bigg |^k\,\mathrm {d}P_0= \int _{p_0q>0} p_0\bigg |\log \frac{p_0}{q}\bigg |^k\,\mathrm {d}\lambda , &{} \hbox { if } P_0\ll Q,\\ \quad +\infty , &{}\hbox { otherwise,} \end{array}\right. \end{aligned}$$

    be the kth absolute moment of \(\log (\mathrm {d}P_0/\mathrm {d}Q)\). For any \(\varepsilon >0\) and a given \(k\ge 2\), define a Kullback–Leibler type neighborhood of \(P_0\) as

    $$\begin{aligned} B_{\mathrm {KL}}(P_0;\,\varepsilon ^k):=\left\{ Q:\,\text {KL}(P_0\Vert Q)\le \varepsilon ^2,\,\text {V}_k(P_0\Vert Q)\le \varepsilon ^k\right\} . \end{aligned}$$
  • For any real number \(p\ge 1\) and any pair of probability measures \(G_1,\,G_2\in \mathscr {G}\) with finite pth absolute moments, the \(L^p\)-Wasserstein distance between \(G_1\) and \(G_2\) is defined as

    $$\begin{aligned} W_p(G_1,G_2):=\left( {\inf _{\gamma \in \Gamma (G_1,\,G_2)}\int _{\mathscr {Y}\times \mathscr {Y}}|y_1-y_2|^p\,} \gamma (\mathrm {d}y_1,\,\mathrm {d}y_2)\right) ^{1/p} , \end{aligned}$$

    where \(\Gamma (G_1,\,G_2)\) is the set of all joint probability measures on \((\mathscr {Y}\times \mathscr {Y})\subseteq \mathbb {R}^2\), with marginals \(G_1\) and \(G_2\) on the first and second arguments, respectively.

Stochastic order symbols

Let \((Z_n)_{n\in \mathbb {N}}\) be a sequence of real-valued random variables, possibly defined on entirely different probability spaces \((\Omega _n,\,\mathscr {F}_n,\,\mathbf {P}_n)_{n\in \mathbb {N}}\). Suppressing n in \(\mathbf {P}\) causes no confusion if it is understood that \(\mathbf {P}\) refers to whatever probability space \(Z_n\) is defined on. Let \((k_n)_{n\in \mathbb {N}}\) be a sequence of positive real numbers. We write

  • \(Z_n=O_{\mathbf {P}}(k_n)\) if \(\lim _{T\rightarrow +\infty }\limsup _{n\rightarrow +\infty }\mathbf {P}(|Z_n|>Tk_n)=0\). Then, \(Z_n/k_n=O_{\mathbf {P}}(1)\),

  • \(Z_n=o_{\mathbf {P}}(k_n)\) if, for every \(\varepsilon >0\), \(\lim _{n\rightarrow +\infty }\mathbf {P}(|Z_n|>\varepsilon k_n)=0\). Then, \(Z_n/k_n=o_{\mathbf {P}}(1)\).

Unless otherwise specified, in all stochastic order symbols used throughout, the probability measure \(\mathbf {P}\) is understood to be \(P_0^n\), the joint law of the first n coordinate projections of the infinite product probability measure \(P_0^{\mathbb {N}}\).

2 Rates of convergence for \(L^1\)-Wasserstein deconvolution of Dirichlet–Laplace mixtures

In this section, we present some results on the Bayesian recovery of a distribution function from data contaminated with an additive random error following the standard Laplace distribution: we derive rates of convergence for the \(L^1\)-Wasserstein deconvolution of Dirichlet–Laplace mixture densities. The density is modeled as a Dirichlet–Laplace mixture

$$\begin{aligned} p_{G}(\cdot )\equiv (G *f)(\cdot )=\int _{\mathscr {Y}} f(\cdot -y)\,\mathrm {d}G(y), \end{aligned}$$

with the kernel density f being the standard Laplace and the mixing distribution G being any probability measure on \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), with \(\mathscr {Y}\subseteq \mathbb {R}\). As a prior for G, we consider a Dirichlet process with base measure \(\alpha \) on \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), denoted by \(\mathscr {D}_{\alpha }\). We recall that a Dirichlet process on a measurable space \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), with finite and positive base measure \(\alpha \) on \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), is a random probability measure \(\tilde{G}\) on \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\) such that, for every finite partition \((B_1,\,\ldots ,\,B_k)\) of \(\mathscr {Y}\), \(k\ge 1\), the vector of random probabilities \((\tilde{G}(B_1),\,\ldots ,\,\tilde{G}(B_k))\) has Dirichlet distribution with parameters \((\alpha (B_1),\,\ldots ,\,\alpha (B_k))\). A Dirichlet process mixture of Laplace densities can be structurally described as follows:

  •     \(\tilde{G}\sim \mathscr {D}_{\alpha }\),

  •     given \(\tilde{G}=G\), the r.v.’s \(Y_1,\,\ldots ,\,Y_n\) are i.i.d. according to G,

  •     given \((G,\,Y_1,\,\ldots ,\,Y_n)\), the r.v.’s \(Z_1,\,\ldots ,\,Z_n\) are i.i.d. according to f,

  •     sampled values from \(p_G\) are defined as \(X_i:=Y_i+Z_i\) for \(i=1,\,\ldots ,\,n\).

Let the sampling density \(p_0\) be itself a Laplace mixture with mixing distribution \(G_0\), that is, \(p_0\equiv p_{G_0}=G_0*f\). In order to assess the rate of convergence in the \(L^1\)-Wasserstein metric for the Bayes’ estimator of the true mixing distribution \(G_0\), we appeal to an inversion inequality relating the \(L^2\)-norm or the Hellinger distance between Laplace mixed densities to any \(L^p\)-Wasserstein distance, \(p\ge 1\), between the corresponding mixing distributions, see Lemma 4 in Appendix D. Therefore, we first derive rates of contraction in the \(L^2\)-norm and the Hellinger metric for the posterior distribution of a Dirichlet–Laplace mixture prior: convergence of the posterior distribution at a rate \(\varepsilon _n\), in fact, implies the existence of Bayes’ point estimators that converge at least as fast as \(\varepsilon _n\) in the frequentist sense. The same indirect approach has been taken by Gao and van der Vaart (2016), who deal with the case of compactly supported mixing distributions, while we extend the results to mixing distributions possibly supported on the whole real line or on some unbounded subset thereof. We present two results on posterior contraction rates for a Dirichlet–Laplace mixture prior. The first one, as stated in Proposition 1, is relative to the \(L^1\)-norm or the Hellinger metric; the second one, as stated in Proposition 2, is relative to the \(L^2\)-metric. Proofs are deferred to Appendix C.

Proposition 1

Let \(X_1,\,\ldots ,\,X_n\) be i.i.d. observations from a density \(p_0\equiv p_{G_0}=G_0*f\), with the kernel density f being the standard Laplace and the mixing distribution \(G_0\) such that, for some decreasing function \(A_0:\,(0,\,+\infty )\rightarrow [0,\,1]\) and a constant \(0<c_0<+\infty \),

$$\begin{aligned} G_0([-T,\,T]^c)\le A_0(T)\lesssim \exp {(-c_0T)}\quad \hbox {for large } T>0. \end{aligned}$$
(3)

If the baseline measure \(\alpha \) of the Dirichlet process is symmetric around zero and possesses density \(\alpha '\) such that, for some constants \(0<b<+\infty \) and \(0<\tau \le 1\),

$$\begin{aligned} \alpha '(y)\propto \exp {(-b|y|^\tau )},\quad y\in \mathbb {R}, \end{aligned}$$
(4)

then there exists a sufficiently large constant \(M>0\) such that

$$\begin{aligned} \Pi (d(p_G,\,p_0)\ge M n^{-3/8}\log ^{5/8}n \mid X^{(n)})=o_{\mathbf {P}}(1), \end{aligned}$$

where \(\Pi (\cdot \mid X^{(n)})\) denotes the posterior distribution corresponding to a Dirichlet–Laplace process mixture prior after n observations and d can be either the Hellinger or the \(L^1\)-metric.

Remark 1

In virtue of the following inequality,

$$\begin{aligned} \forall \,G,\,G'\in \mathscr {G},\,\,\, ||p_G-p_{G'}||_2^2\le 4||f||_\infty h^2(p_G,\,p_{G'}), \end{aligned}$$

where \(||f||_\infty =1/2\) for the standard Laplace kernel density, see (28) in Lemma 3, the \(L^2\)-metric posterior contraction rate for a Dirichlet–Laplace mixture prior could, in principle, be derived from Proposition 1, which relies on Theorem 2.1 of Ghosal et al. (2000), p. 503, or Theorem 2.1 of Ghosal and van der Vaart (2001), p. 1239, but this would impose slightly stronger conditions on the density \(\alpha '\) of the baseline measure than those required in Proposition 2 below, which is based on Theorem 3 of Giné and Nickl (2011), p. 2892, that is tailored for assessing posterior contraction rates in \(L^r\)-metrics, \(1< r< +\infty \), taking an approach that can only be used if one has sufficiently fine control of the approximation properties of the prior support in the \(L^r\)-metric considered.

Proposition 2

Let \(X_1,\,\ldots ,\,X_n\) be i.i.d. observations from a density \(p_0\equiv p_{G_0}=G_0*f\), with the kernel density f being the standard Laplace and the mixing distribution \(G_0\) such that condition (3) holds as in Proposition 1. If the baseline measure \(\alpha \) of the Dirichlet process possesses continuous and positive density \(\alpha '\) such that, for some constants \(0<b<+\infty \) and \(0<\tau \le 1\),

$$\begin{aligned} \alpha '(y)\gtrsim \exp {(-b|y|^\tau )}\quad \hbox {for large } |y|, \end{aligned}$$
(5)

then there exists a sufficiently large constant \(M>0\) such that

$$\begin{aligned} \Pi (||p_G-p_0||_2\ge M n^{-3/8}\log ^{5/8}n \mid X^{(n)})=o_{\mathbf {P}}(1), \end{aligned}$$
(6)

where \(\Pi (\cdot \mid X^{(n)})\) denotes the posterior distribution corresponding to a Dirichlet–Laplace process mixture prior after n observations.

As previously mentioned, convergence of the posterior distribution at a rate \(\varepsilon _n\) implies the existence of point estimators that converge at least as fast as \(\varepsilon _n\) in the frequentist sense, see, for instance, Theorem 2.5 in Ghosal et al. (2000), p. 506, for the construction of a point estimator that applies to general statistical models and posterior distributions. The posterior expectation of the density \(p_G\), which we refer to as the Bayes’ density estimator,

$$\begin{aligned} \hat{p}_n^{\text {B}}(\cdot ):= \int _{\mathscr {G}} p_G(\cdot )\Pi (\mathrm {d}G\mid X^{(n)}), \end{aligned}$$

has a similar property when jointly considered with bounded semi-metrics that are convex or whose square is convex in one argument. When the random mixing distribution \(\tilde{G}\) is distributed according to a Dirichlet process, the expression of the Bayes’ density estimator \(\hat{p}_n^{\text {B}}\) is given by formula (2.6) of Lo (1984), p. 353, replacing \(K(\cdot ,\,u)\) with \(\frac{1}{2}\exp {\{-|\cdot -u|\}}\) at each occurrence.

Corollary 1

Suppose that condition (3) holds for some decreasing function \(A_0:\,(0,\,+\infty )\rightarrow [0,\,1]\) and a finite constant \(c_0>(1/e)\) such that

$$\begin{aligned} G_0([-T,\,T]^c)\le A_0(T)\lesssim \exp {(-e^{c_0T})}\quad \hbox {for large } T>0 \end{aligned}$$
(7)

and condition (4) holds as in Proposition 1. Then,

$$\begin{aligned} d(\hat{p}^{\mathrm {B}}_n,\,p_0)=O_{\mathbf {P}}(n^{-3/8}\log ^{1/2} n), \end{aligned}$$

for d being either the Hellinger or the \(L^1\)-metric.

Proof

In virtue of the inequality in (2), it suffices to prove the assertion for the Hellinger metric. The proof follows standard arguments as, for instance, in Ghosal et al. (2000), pp. 506–507. By convexity of \(h^2\) in each argument and Jensen’s inequality, for \(\varepsilon _n:=\max \{\bar{\varepsilon _n},\, \tilde{\varepsilon _n}\}=n^{-3/8}(\log n)^{(3\vee 4)/8}=n^{-3/8}\log ^{1/2} n \) and a sufficiently large constant \(M>0\),

$$\begin{aligned} h^2(\hat{p}_n^{\text {B}},\,p_0)\le & {} \int _{\mathscr {G}} h^2(p_G,\,p_0)\Pi (\mathrm {d}G\mid X^{(n)})\\= & {} \left( \int _{h(p_G,\,p_0)<M\varepsilon _n}+\int _{h(p_G,\,p_0)\ge M\varepsilon _n} \right) h^2(p_G,\,p_0)\Pi (\mathrm {d}G\mid X^{(n)})\\\lesssim & {} M^2\varepsilon _n^2 + 2 \Pi (h(p_G,\,p_0) \ge M\varepsilon _n\mid X^{(n)}). \end{aligned}$$

It follows that

$$\begin{aligned} P_0^n h^2(\hat{p}_n^{\text {B}},\,p_0) \lesssim M^2\varepsilon _n^2 + 2 P_0^n\Pi ( h(p_G,\,p_0) \ge M\varepsilon _n\mid X^{(n)})\lesssim \varepsilon ^2_n+o(\varepsilon _n^2) \end{aligned}$$

because we can apply the almost sure version of Theorem 7 in Scricciolo (2007), p. 636 (see also Theorem A.1 in Scricciolo (2006), p. 2918), which, under the prior mass condition

$$\begin{aligned} \Pi (h^2(p_G,\,p_0)\Vert p_0/p_G\Vert _\infty \le \tilde{\varepsilon }_n ^2)\gtrsim \exp {(-Bn\tilde{\varepsilon }_n^2)}, \end{aligned}$$
(8)

with \(\tilde{\varepsilon }_n:=n^{-3/8}\log ^{1/2} n\) and a constant \(0<B<+\infty \), yields exponentially fast convergence of the posterior distribution since \(P_0^n\Pi ( h(p_G,\,p_0) \ge M\varepsilon _n\mid X^{(n)})\lesssim \exp {(-B_1n\tilde{\varepsilon }_n^2)}\) for a suitable constant \(0<B_1<+\infty \). To verify that condition (8) is satisfied, we can proceed as in the proof of Proposition 2: for any G satisfying (27), not only is \(h(p_G,\,p_0)\lesssim \varepsilon \), but, under assumption (7) which guarantees that \(M_{G_0}(-1)<+\infty \) and \(M_{G_0}(1)<+\infty \), it also is

$$\begin{aligned}&||p_0/p_G||_\infty \le e^{a_\varepsilon }[M_{G_0}(-1)+M_{G_0}(1)]\lesssim \log (1/\varepsilon ),\quad \\&\quad \hbox {for}\,\,a_\varepsilon := A_0^{-1}(\varepsilon ^2)\lesssim \log \log (1/\varepsilon ). \end{aligned}$$

Then,

$$\begin{aligned} \log \Pi (h^2(p_G,\,p_0)\Vert p_0/p_G\Vert _\infty \le \varepsilon ^2\log (1/\varepsilon ))\gtrsim -\varepsilon ^{-2/3}\log (1/\varepsilon ). \end{aligned}$$

Condition (8) is thus verified for \(\tilde{\varepsilon }_n:=\varepsilon \log ^{1/2}(1/\varepsilon )=n^{-3/8}\log ^{1/2}n\). Conclude that \(h(\hat{p}^{\mathrm {B}}_n,\,p_0)=O_{\mathbf {P}}(\varepsilon _n)\). \(\square \)

Remark 2

Admittedly, condition (7) imposes a stringent constraint on the tail decay rate of \(G_0\). An alternative sufficient condition for concluding that

$$\begin{aligned} P_0^n\Pi ( d(p_G,\,p_0) \ge M\varepsilon _n\mid X^{(n)})=o(\varepsilon _n^2),\quad \hbox { for } \,d=h\, \hbox { or } \, d=\Vert \cdot \Vert _1, \end{aligned}$$
(9)

is a prior mass condition involving the kth absolute moment of \(\log (p_0/p_G)\) for a suitable value of k, in place of the sup-norm \(\Vert p_0/p_G\Vert _\infty \), which can possibly induce a lighter condition on \(G_0\). For \(\tilde{\varepsilon }_n:=n^{-3/8}\log ^{\omega }n\), with \(\omega >0\), let \(\varepsilon _n:=\max \{\bar{\varepsilon }_n,\, \tilde{\varepsilon }_n\}=n^{-3/8}(\log n)^{(3/8)\vee \omega }\). It is known from Lemma 10 of Ghosal and van der Vaart (2007b), p. 220, that if

$$\begin{aligned} \Pi (B_{\mathrm {KL}}(P_0;\,\tilde{\varepsilon }_n^k))\gtrsim \exp {(-Bn\tilde{\varepsilon }_n^2)},\quad k\ge 2, \end{aligned}$$
(10)

then

$$\begin{aligned} P_0^n\Pi ( d(p_G,\,p_0) \ge M\varepsilon _n\mid X^{(n)})\lesssim (n\tilde{\varepsilon }_n^2)^{-k/2}. \end{aligned}$$
(11)

Thus, if condition (10) holds for some \(k\ge 6\) so that \((n\tilde{\varepsilon }_n^2)^{-k/2}=o(\varepsilon _n^2)\), the value \(k=6\) would suffice for the purpose, then condition (9) is satisfied.

We now state a result on the rate of convergence for the Bayes’ estimator, denoted by \(\hat{G}_n^{\text {B}}\), of the mixing distribution \(G_0\) for the \(L^1\)-Wasserstein deconvolution of Dirichlet–Laplace mixtures. The Bayes’ estimator is the posterior expectation of the random probability measure \(\tilde{G}\), that is, \(\hat{G}_n^{\text {B}}(\cdot ):=E[\tilde{G}(\cdot )\mid X^{(n)}]\) and its expression can be derived from the expression of the posterior distribution, cf. Ghosh and Ramamoorthi (2003), pp. 144–146. In order to state the result, let \(M_{\hat{G}_n^{\text {B}}}(s):=\int _{-\infty }^{+\infty } e^{sy}\,\mathrm {d}\hat{G}_n^{\text {B}}(y)\), \(s\in \mathbb {R}\), whose expression can be obtained from formula (2.6) of Lo (1984), p. 353, replacing \(K(x,\,u)\) with \(e^{s u}\) at all occurrences (s playing the role of x).

Proposition 3

Suppose that the assumptions of Corollary 1 hold. If, in addition, \(\bar{\alpha }:=\alpha /\alpha (\mathbb {R})\) has finite moment generating function on some interval \((-s_0,\,s_0)\), with \(0<s_0<1\), and

$$\begin{aligned}&\forall \, 0<s<s_0,\quad \limsup _{n\rightarrow +\infty }P_0^nM_{\hat{G}_n^{\mathrm {B}}}(-s)\le M_{G_0}(-s)\nonumber \\&\hbox {and} \quad \limsup _{n\rightarrow +\infty }P_0^nM_{\hat{G}_n^{\mathrm {B}}}(s)\le M_{G_0}(s), \end{aligned}$$
(12)

then

$$\begin{aligned} W_1(\hat{G}_n^{\mathrm {B}},\,G_0)=O_{\mathbf {P}}(n^{-1/8}(\log n)^{2/3}). \end{aligned}$$
(13)

Proof

Let \(\rho _n:=n^{-1/8}(\log n)^{2/3}\) and, for a suitable finite constant \(c_1>0\), \(M_n=c_1(\log n)\). Fix numbers s and u such that \(0<u<s<s_0<1\). For sufficiently large constants \(0<T, \,T',\,T''<+\infty \), reasoning as in Lemma 4,

$$\begin{aligned} P_0^n(W_1(\hat{G}_n^{\mathrm {B}},\,G_0)>T\rho _n)\le & {} P_0^n(h(\hat{p}^{\mathrm {B}}_n,\,p_0)>T'\rho _n^3(\log n)^{-3/2})\\&+P_0^n(M_{\hat{G}_n^{\mathrm {B}}}(-s)+M_{\hat{G}_n^{\mathrm {B}}}(s)>T''e^{uM_n}\rho _n)=:P_1+P_2. \end{aligned}$$

By Corollary 1, \(h(\hat{p}^{\mathrm {B}}_n,\,p_0)=O_{\mathbf {P}}(n^{-3/8}\log ^{1/2}n)\). Hence, \(P_1\rightarrow 0\) as \(n\rightarrow +\infty \). By Markov’s inequality, for some real \(\nu >0\),

$$\begin{aligned} P_2\lesssim & {} e^{-uM_n}\rho _n^{-1} [P_0^nM_{\hat{G}_n^{\mathrm {B}}}(-s)+P_0^nM_{\hat{G}_n^{\mathrm {B}}}(s)]\\\lesssim & {} \frac{1}{n^\nu } [P_0^nM_{\hat{G}_n^{\mathrm {B}}}(-s)+P_0^nM_{\hat{G}_n^{\mathrm {B}}}(s)] \rightarrow 0 \quad \text{ as } n\rightarrow +\infty \end{aligned}$$

by assumption (12). Thus, \(P_2\rightarrow 0\) as \(n\rightarrow +\infty \). The assertion follows. \(\square \)

Some remarks are in order. There are two main reasons why we focus on deconvolution in the \(L^1\)-Wasserstein metric. The first one is related to the inversion inequality in (30), where the upper bound on the \(L^p\)-Wasserstein metric, as a function of the order \(p\ge 1\), increases as p gets larger, thus making it advisable to begin the analysis from the smallest value of p. The second reason is related to the interpretation of the assertion in (13): the \(L^1\)-Wasserstein distance between any two probability measures \(G_1\) and \(G_2\) on some Borel-measurable space \((\mathscr {Y},\,\mathscr {B}(\mathscr {Y}))\), \(\mathscr {Y}\subseteq \mathbb {R}\), with finite first absolute moments, is by itself an interesting distance because it metrizes weak convergence plus convergence of the first absolute moments, but it is even more interesting in view of the fact that, letting \(G_1^{-1}(\cdot )\) and \(G_2^{-1}(\cdot )\) denote the left-continuous inverse or quantile functions, \(G_i^{-1}(u):=\inf \{y\in \mathscr {Y}:\,G_i(y)\ge u\}\), \(u\in (0,\,1)\), \(i=1,\,2\), it can be written as the \(L^1\)-distance between the quantile functions or, equivalently, as the \(L^1\)-distance between the cumulative distribution functions,

$$\begin{aligned} W_1(G_1,\,G_2)= & {} \int _{0}^{1}|G_1^{-1}(u)-G_2^{-1}(u)|\,\mathrm {d}u \nonumber \\= & {} \int _{\mathscr {Y}}|G_1(y)-G_2(y)|\,\mathrm {d}y=||G_1-G_2||_1, \end{aligned}$$
(14)

see, e.g., Shorack and Wellner (1986), pp. 64–66. The representation in (14) was obtained by Dall’Aglio (1956). Thus, by rewriting \(W_1(\hat{G}_n^{\mathrm {B}},\,G_0)\) as the \(L^1\)-distance between the c.d.f.’s \(\hat{G}_n^{\mathrm {B}}\) and \(G_0\), the assertion of Proposition 3,

$$\begin{aligned} W_1(\hat{G}_n^{\mathrm {B}},\,G_0)=||\hat{G}_n^{\mathrm {B}}-G_0||_1=O_{\mathbf {P}}(n^{-1/8}(\log n)^{2/3}), \end{aligned}$$

becomes more transparent and meaningful.

3 Rates of convergence for ML estimation and \(L^1\)-Wasserstein deconvolution of Laplace mixtures

In this section, we first study the rate of convergence in the Hellinger metric for the MLE \(\hat{p}_n\) of a Laplace mixture density \(p_0\equiv p_{G_0}=G_0 *f\), with unknown mixing distribution \(G_0\in \mathscr {G}\). We then derive the rate of convergence in the \(L^1\)-Wasserstein metric for the MLE \(\hat{G}_n\) of the mixing distribution \(G_0\), which corresponds to the MLE \(\hat{p}_n\) of the mixed density \(p_0\), by appealing to an inversion inequality relating the Hellinger distance between Laplace mixture densities to any \(L^p\)-Wasserstein distance, \(p\ge 1\), between the corresponding mixing distributions (see Lemma 4 in Appendix D).

A MLE \(\hat{p}_n\) of \(p_0\) is a measurable function of the observations taking values in \(\mathscr {P}:=\{p_G:\,G\in \mathscr {G}\}\) such that

$$\begin{aligned} \hat{p}_n\in \underset{p_G\in \mathscr {P}}{\arg \max } \frac{1}{n}\sum _{i=1}^n\log p_G(X_i)=\underset{p_G\in \mathscr {P}}{\arg \max } \int (\log p_G)\,\mathrm {d}{\mathbb P_n}, \end{aligned}$$

where \({\mathbb P_n}:={n}^{-1}\sum _{i=1}^n\delta _{X_i}\) is the empirical measure associated with the random sample \(X_1,\,\ldots ,\,X_n\), namely, the discrete uniform distribution on the sample values that puts mass 1 / n on each one of the observations. We assume that the MLE exists, but do not require it to be unique, see Lindsay (1995), Theorem 18, p. 112, for sufficient conditions ensuring uniqueness.

Results on rates of convergence in the Hellinger metric for the MLE of a density can be found in Birgé and Massart (1993), Van de Geer (1993) and Wong and Shen (1995); it can, however, be difficult to calculate the \(L^2\)-metric entropy with bracketing of the square-root densities that is employed in these articles. Taking instead into account that a mixture model \(\{\int _{\mathscr {Y}}K(\cdot ,\,y)\,\mathrm {d}G(y):\,G\in \mathscr {G}\}\) is the closure of the convex hull of the collection of kernels \(\{K(\cdot ,\,y):\,y\in \mathscr {Y}\subseteq \mathbb {R}\}\), which is typically a much smaller class, a bound on a form of metric entropy without bracketing of the class of mixtures can be derived from a covering number of the class of kernels (a result on metric entropy without bracketing of convex hulls that is deducible from Ball and Pajor 1990), so that a relatively simple “recipe” can be given to obtain (an upper bound on) the rate of convergence in the Hellinger metric for the MLE of a density in terms of the “dimension” of the class of kernels and the behaviour of \(p_0\) near zero, cf. Corollary 2.3 of Van de Geer (1996), p. 298.

Proposition 4

Let the sampling density \(p_0\equiv p_{G_0}=G_0*f\), with the kernel density f being the standard Laplace and the mixing distribution \(G_0\in \mathscr {G}\). Suppose that, for a sequence of non-negative real numbers \(\sigma _n=O(n^{-3/8}\log ^{1/8}n)\), we have

(a):

\(\int _{p_0\le \sigma _n}p_0\,\mathrm {d}\lambda \lesssim \sigma _n^2\),

(b):

\(\int _{p_0>\sigma _n}(1/p_0)\,\mathrm {d}\lambda \lesssim \log (1/\sigma _n)\).

Then,

$$\begin{aligned} h(\hat{p}_n,\, p_0)=O_{\mathbf {P}}(n^{-3/8}\log ^{1/8}n). \end{aligned}$$

Proof

We begin by spelling out the remark mentioned in the introduction concerning the fact that a mixture model is the closure of the convex hull of the collection of kernels. Recall that the convex hull of a class \(\mathscr {K}\) of functions, denoted by \(\mathrm {conv}(\mathscr {K})\), is defined as the set of all finite convex combinations of functions in \(\mathscr {K}\),

$$\begin{aligned} \mathrm {conv}(\mathscr {K}):=\Bigg \{\sum _{j=1}^r\theta _jK_j:\, \theta _j\ge 0,\,K_j\in \mathscr {K},\,j=1,\,\ldots ,\,r,\,\sum _{j=1}^r\theta _j=1,\,r\in \mathbb {N}\Bigg \}. \end{aligned}$$

In our case,

$$\begin{aligned} \mathscr {K}:=\left\{ f(\cdot -y):\,y\in \mathscr {Y}\subseteq \mathbb {R}\right\} \end{aligned}$$

is the collection of kernels with f the standard Laplace density. The class \(\mathscr {P}:=\{p_G:\,G\in \mathscr {G}\}\) of all Laplace convolution mixtures \(p_G=G*f\) is the closure of the convex hull of \(\mathscr {K}\),

$$\begin{aligned} \mathscr {P}=\overline{\mathrm {conv}}(\mathscr {K}). \end{aligned}$$

Clearly, \(\mathscr {P}\) is itself a convex class. This remark enables us to apply Theorem 2.2 and Corollary 2.3 of Van de Geer (1996), pp. 297–298 and 310, or, equivalently, Theorem 7.7 of Van de Geer (2000), pp. 104–105, whose conditions are hereafter shown to be satisfied. To the aim, we define the class

$$\begin{aligned} {\mathscr {K}}/p_0:=\left\{ \frac{f(\cdot -y)}{p_0(\cdot )}\mathbf {1}\{p_0>\sigma _n\}:\,y\in \mathscr {Y}\right\} \end{aligned}$$

and the envelope function

$$\begin{aligned} \bar{K}(\cdot ):=\sup _{y\in \mathscr {Y}}\frac{f(\cdot -y)}{p_0(\cdot )}\mathbf {1}\{p_0>\sigma _n\}, \end{aligned}$$

where we have suppressed the subscript n in \({\mathscr {K}}/p_0\) and \(\bar{K}(\cdot )\) stressing possible dependence on \(\sigma _n\) when \(\sigma _n>0\). Since, by assumption (a),

$$\begin{aligned} \int _{p_0\le \sigma _n}\mathrm {d}P_0= \int _{p_0\le \sigma _n}p_0\,\mathrm {d}\lambda \lesssim \sigma _n^2 \end{aligned}$$

and, by assumption (b), together with the fact that \(\Vert f\Vert _\infty =1/2\),

$$\begin{aligned} \int \bar{K}^2\,\mathrm {d}P_0 \lesssim \int _{p_0>\sigma _n}\frac{1}{p_0}\,\mathrm {d}\lambda \lesssim \log (1/\sigma _n), \end{aligned}$$
(15)

we can take the sequence \(\delta _n^2\propto \sigma _n^2\) in condition (7.21) of Theorem 7.7 of Van de Geer (2000), p. 104. Because the (standard) Laplace kernel density f is Lipschitz,

$$\begin{aligned} \forall \,y_1,\,y_2\in \mathscr {Y},\quad |f(\cdot -y_1)-f(\cdot -y_2)|\le \frac{1}{2} |y_1-y_2|, \end{aligned}$$

see, e.g., Lemma A.1 in Scricciolo (2011), pp. 299–300, on the set

(16)

where \(T>0\) is a finite constant, we find that, for \(\mathrm {d}{\mathbb Q_n}:= \mathrm {d}{\mathbb P_n}/(T^2\log (1/\delta _n))\),

$$\begin{aligned} N(\delta ,\,{\mathscr {K}}/p_0,\,||\cdot ||_{2,{\mathbb Q_n}})\lesssim \delta ^{-1}\quad \hbox {for }\delta >0, \end{aligned}$$

where \(||\cdot ||_{2,{\mathbb Q_n}}\) denotes the \(L^2({\mathbb Q_n})\)-norm, that is, \(||g||_{2,{\mathbb Q_n}}:=(\int |g|^2\,\mathrm {d}{\mathbb Q_n})^{1/2}\). So, in view of the result of Ball and Pajor (1990), reported as Theorem 1.1 in Van de Geer (1996), p. 295, on the same set as in (16), we have

$$\begin{aligned} \log N(\delta ,\,\overline{\mathrm {conv}}({\mathscr {K}}/p_0),\,||\cdot ||_{2,{\mathbb Q_n}})\lesssim \delta ^{-2/3}, \end{aligned}$$

hence

$$\begin{aligned} \log N(\delta ,\,\overline{\mathrm {conv}}({\mathscr {K}}/p_0),\,||\cdot ||_{2,{\mathbb P_n}})\lesssim \left( \frac{T\log ^{1/2}(1/\delta _n)}{\delta }\right) ^{2/3}. \end{aligned}$$

Next, defined the class

considered in condition (7.20) of Theorem 7.7 in Van de Geer (2000), p. 104, since

$$\begin{aligned} \log N(2\delta ,\,\mathscr {P}^{(\mathrm {conv})}_{\sigma _n},\,||\cdot ||_{2,{\mathbb P_n}})\le \log N(\delta ,\,\overline{\mathrm {conv}}({\mathscr {K}}/p_0),\,||\cdot ||_{2,{\mathbb P_n}}), \end{aligned}$$

in view of (15), we have

$$\begin{aligned} \sup _{\delta >0}\frac{\log N(\delta ,\,\mathscr {P}^{(\mathrm {conv})}_{\sigma _n},\,||\cdot ||_{2,{\mathbb P_n}})}{H(\delta )}=O_{\mathbf {P}}(1) \end{aligned}$$

for the non-increasing function of \(\delta \)

$$\begin{aligned} H(\delta ):=\delta ^{-2/3}\log ^{1/3}(1/\delta _n),\quad \delta >0. \end{aligned}$$

Taken \(\Psi (\delta ):=c_1\delta ^{2/3}\log ^{1/6}(1/\delta _n)\) with a suitable finite constant \(c_1>0\), we have

$$\begin{aligned} \forall \,\delta \in (0,\,1),\,\,\,\Psi (\delta )\ge \left( \int _{\delta ^2/c}^\delta H^{1/2}(u)\,\mathrm {d}u\right) \vee \delta \end{aligned}$$

and, for some \(\varepsilon >0\), \(\Psi (\delta )/\delta ^{2-\varepsilon }\) is non-increasing. Then, for \(\delta _n\) such that \(\sqrt{n}\delta _n^2\ge \Psi (\delta _n)\), cf. condition (7.22) of Theorem 7.7 in Van de Geer (2000), p. 104, which implies that, consistently with the initial choice, we can take \(\delta _n\propto n^{-3/8}\log ^{1/8}n\), we have \(h(\hat{p}_n,\, p_0)=O_{\mathbf {P}}(\delta _n)\) and the proof is complete. \(\square \)

Remark 3

If \(p_0>0\) and \(\mathscr {Y}\) is a compact interval \([-a,\,a]\), with \(a>0\), then \(h(\hat{p}_n,\,p_0)=O_{\mathbf {P}}(n^{-3/8})\). In fact, the sequence \(\sigma _n\equiv 0\), \(||\bar{K}||_\infty \le e^{2a}\) and \(\int \bar{K}^2\,\mathrm {d}P_0 \le e^{4a}\) so that, on the set \(\{\int \bar{K}^2\,\mathrm {d}{\mathbb P_n}\le T\}\), the entropy \(\log N(\delta ,\,\overline{\mathrm {conv}}({\mathscr {K}}/p_0),\,||\cdot ||_{2,{\mathbb P_n}})\lesssim \delta ^{-2/3}\) and, reasoning as in Proposition 4, we find the rate \(n^{-3/8}\).

We now derive a consequence of Proposition 4 on the rate of convergence in the \(L^1\)-Wasserstein metric for the MLE of \(G_0\). A MLE \(\hat{p}_n\) of the mixed density \(p_0\) corresponds to a MLE \(\hat{G}_n\) of the mixing distribution \(G_0\), that is, \(\hat{p}_n\equiv p_{\hat{G}_n}\), such that

$$\begin{aligned} \hat{G}_n\in \underset{G\in \mathscr {G}}{\arg \max } \frac{1}{n}\sum _{i=1}^n\log p_G(X_i)=\underset{G\in \mathscr {G}}{\arg \max } \int (\log p_G)\,\mathrm {d}{\mathbb P_n}. \end{aligned}$$

Clearly, \(\hat{G}_n\) is a discrete distribution, but we do not know the number of its components: Lindsay (1995) showed that the MLE \(\hat{G}_n\) is a discrete distribution supported on at most \(k\le n\) support points, k being the number of distinct observed values or data points.

Corollary 2

Suppose that the assumptions of Proposition 4 hold. If, in addition, the mixing distribution \(G_0\) has finite moment generating function in some interval \((-s_0,\,s_0)\), with \(0<s_0<1\), and

$$\begin{aligned}&\forall \, 0<s<s_0,\quad \limsup _{n\rightarrow +\infty }P_0^nM_{\hat{G}_n}(-s)\le M_{G_0}(-s) \quad \hbox {and} \quad \nonumber \\&\limsup _{n\rightarrow +\infty }P_0^nM_{\hat{G}_n}(s)\le M_{G_0}(s), \end{aligned}$$
(17)

where \(M_{\hat{G}_n}(s):=\int _{\mathscr {Y}} e^{sy}\,\mathrm {d}\hat{G}_n(y)\), \(s\in \mathbb {R}\), then

$$\begin{aligned} W_1(\hat{G}_n,\,G_0)=O_{\mathbf {P}}(n^{-1/8}(\log n)^{13/24}). \end{aligned}$$

Proof

Let \(k_n:=n^{-1/8}(\log n)^{13/24}\) and, for a suitable finite constant \(c_2>0\), \(M_n=c_2(\log n)\). Fix numbers s and u such that \(0<u<s<s_0<1\). For sufficiently large constants \(0<T,\,T',\,T''<+\infty \), reasoning as in Lemma 4, we have

$$\begin{aligned} \begin{aligned}&P_0^n(W_1(\hat{G}_n,\,G_0)>T k_n)\\&\quad \le P_0^n(h(\hat{p}_n,\, p_0)>T' k_n^3(\log n)^{-3/2})\\&\qquad + P_0^n(M_{\hat{G}_n}(-s)+M_{\hat{G}_n}(s)>T''k_ne^{uM_n})=:P_1+P_2. \end{aligned} \end{aligned}$$

The term \(P_1\) can be made arbitrarily small because \(h(\hat{p}_n,\, p_0)=O_{\mathbf {P}}(n^{-3/8}\log ^{1/8}n)\) by Proposition 4. The term \(P_2\) goes to zero as \(n\rightarrow +\infty \): in fact, by Markov’s inequality and assumption (17), for some real \(0<l<+\infty \),

$$\begin{aligned} P_2\lesssim & {} e^{-uM_n}k_n^{-1} [P_0^nM_{\hat{G}_n}(-s)+P_0^nM_{\hat{G}_n}(s)]\\\lesssim & {} \frac{1}{n^l} [P_0^nM_{\hat{G}_n}(-s)+P_0^nM_{\hat{G}_n}(s)]\rightarrow 0 \quad \text{ as } n\rightarrow +\infty \end{aligned}$$

and the assertion follows. \(\square \)

Remark 4

Assumption (17) essentially requires that \(M_{\hat{G_n}}\) is an asymptotically unbiased estimator of \(M_{G_0}\) in some neighborhood of zero \((-s_0,\,s_0)\), with \(0<s_0<1\). An analysis of the asymptotic behaviour of certain linear functionals of the MLE \(\hat{G_n}\) is presented in Van de Geer (1995), wherein sufficient conditions are provided so that they are \(\sqrt{n}\)-consistent, asymptotically normal and efficient.

4 Merging of Bayes and ML for \(L^1\)-Wasserstein deconvolution of Laplace mixtures

In this section, we show that the Bayes’ estimator and the MLE of \(G_0\) merge in the \(L^1\)-Wasserstein metric, their discrepancy vanishing, at worst, at rate \(n^{-1/8}(\log n)^{2/3}\) because they both consistently estimate \(G_0\) at a speed which is within a \((\log n)\)-factor of \(n^{-1/8}\), cf. Proposition 3 and Corollary 2.

Proposition 5

Under the assumptions of Proposition 3 and Corollary 2, we have

$$\begin{aligned} W_1(\hat{G}_n^{\mathrm {B}},\,\hat{G}_n)=O_{\mathbf {P}}(n^{-1/8}(\log n)^{2/3}). \end{aligned}$$
(18)

Proof

By the triangle inequality,

$$\begin{aligned} W_1(\hat{G}_n^{\mathrm {B}},\,\hat{G}_n)\le W_1(\hat{G}_n^{\mathrm {B}},\,G_0)+ W_1(G_0,\,\hat{G}_n), \end{aligned}$$

where \(W_1(\hat{G}_n^{\mathrm {B}},\,G_0)=O_{\mathbf {P}}(n^{-1/8}(\log n)^{2/3})\) and \(W_1(G_0,\,\hat{G}_n)=O_{\mathbf {P}}(n^{-1/8}(\log n)^{13/24})\) by Proposition 3 and Corollary 2, respectively. Relationship (18) follows. \(\square \)

Proposition 5 states that the Bayes’ estimator and the MLE of \(G_0\) will eventually be indistinguishable and (an upper bound on) the speed of convergence for their \(L^1\)-Wasserstein discrepancy is determined by the stochastic orders of their errors in recovering \(G_0\). The crucial question that remains open is whether the Bayes’ estimator and the MLE are rate-optimal. Concerning this issue, we note that, on the one hand, other deconvolution estimators for the distribution function attain the rate \(n^{-1/8}\) when the error distribution is the standard Laplace, with the proviso, however, that the \(L^1\)-Wasserstein metric is not linked to the integrated quadratic risk between the c.d.f.’s used in the result we are going to mention, so that the rates are not comparable. For instance, the estimator \(G_n^{K}(h_n)(y):=\int _{-\infty }^yp_n^K(h_n)(u)\,\mathrm {d}u\), \(y\in \mathbb {R}\), of the c.d.f. \(G_0\) based on the standard deconvolution kernel density estimator is such that \(\{\int _{-\infty }^{+\infty }E[G_n^{K}(h_n)(y)-G_0(y)]^2\,\mathrm {d}y\}^{1/2}=O(n^{-1/8})\) when no assumptions on \(G_0\) are postulated, except for the existence of the first absolute moment, see (3.12) in Corollary 3.3 of Hall and Lahiri (2008), p. 2117. On the other hand, a recent lower bound result, due to Dedecker and Michel (2015), Theorem 4.1, pp. 246–248, suggests that better rates are possible. For \(M>0\) and \(r\ge 1\), let \(\mathscr {D}(M,\,r)\) be the class of all probability measures G on \((\mathbb {R},\,\mathscr {B}(\mathbb {R}))\) such that \(\int _{-\infty }^{+\infty } |y|^r\,\mathrm {d}G(y)\le M\). Let f be the error density. Assume that there exist \(\beta >0\) and \(c>0\) such that, for every \(\ell \in \{0,\,1,\,2\}\), it holds \(|\hat{f}^{(\ell )}(t)|\le c(1+|t|)^{-\beta }\), \(t\in \mathbb {R}\). Then, there exists a finite constant \(C>0\) such that, for any estimator \({\hat{G}}_n\) (we warn the reader of the clash of notation with the symbol \(\hat{G}_n\) previously used to denote the MLE of \(G_0\)),

$$\begin{aligned} \liminf _{n\rightarrow +\infty }n^{p/(2\beta +1)}\sup _{G\in \mathscr {D}(M,\,r)}EW_p^p({\hat{G}}_n,\,G)>C. \end{aligned}$$

For \(p=1\) and the (standard) Laplace error distribution, this renders the lower bound \(n^{-1/5}\), which is better than the leading term \(n^{-1/8}\) of the upper bounds we have found, even if it is not said that either the Bayes’ estimator or the MLE attains it.

Finally, a remark on the use of the term “merging”. Even if this term is herein declined with a different meaning from that considered in Barron (1988), where merging is intended as the convergence to one of the ratio of the marginal likelihood to the joint density of the first n observations, or from that in Diaconis and Freedman (1986), where merging refers to the “intersubjective agreement”, as more and more data become available, between two Bayesians with different prior opinions, the underlying idea is, in a broad sense, the same: different inferential procedures become essentially indistinguishable for large sample sizes.

5 Final remarks

In this note, we have studied rates of convergence for Bayes and maximum likelihood estimation of Laplace mixtures and for their \(L^1\)-Wasserstein deconvolution. The result on the convergence rate in the Hellinger metric for the MLE of Laplace mixtures is achieved taking a different approach from that adopted in Ghosal and van der Vaart (2001), which is based on the \(L^1\)-metric entropy with bracketing of the set of densities under consideration and is difficult to apply in the present context, due to the non-analyticity of the Laplace density. Posterior contraction rates for Dirichlet–Laplace mixtures have been previously studied by Gao and van der Vaart (2016) in the case of compactly supported mixing distributions and have been here extended to mixing distributions with a possibly unbounded set of locations, this accounting for the derivation of more general entropy estimates, cf. Appendix B. An interesting extension to pursue would be that of considering general kernel densities with polynomially decaying Fourier transforms in the sense of Definition 1: indeed, in the proof of Proposition 2, which gives an assessement of the posterior contraction rate in the \(L^2\)-metric for Dirichlet–Laplace mixtures, all conditions, except for the Kullback–Leibler prior mass requirement, hold for any kernel density as in Definition 1, provided that \(\beta >1\). The missing piece is an extension of Lemma 2 in Gao and van der Vaart (2016), pp. 615–616, which is preliminary for checking the Kullback–Leibler prior mass condition and guarantees that a Laplace mixture, with mixing distribution that is the re-normalized restriction of \(G_0\) to a compact interval, can be approximated in the Hellinger metric by a Laplace mixture with a discrete mixing distribution having a sufficiently restricted number of support points. We believe that, as for the Laplace kernel, the number of support points of the approximating mixing distribution will ultimately depend only on the decay rate of the Fourier transform of the kernel density, even though, in a general proof, the explicit expression of the kernel density cannot be exploited as in the Laplace case. Extending the result on posterior contraction rates to general kernel mixtures would be of interest in itself and for extending the \(L^1\)-Wasserstein deconvolution result, even though this would pose in more general terms the rate-optimality question, as it happens for the \(n^{-1/8}\)-rate in the Laplace case, see the remarks at the end of Sect. 4. We hope to report on these issues in a follow-up contribution.