1 Introduction

Distance between probability measures has many applications, including distribution testing, density estimation, signal detection, etc (Rachev 1991; Vajda 1989). In recent years, many kinds of distance between probability measures have been proposed (see, for example, Sriperumbudur et al. 2009, 2010). Many of them are built on the general approach of integral probability metric (IPM) (Müller 1997).

To introduce the approach, denote by \(\mathcal {P}\) the set of all Borel probability measures on a probability space \((M,\mathcal {A})\). The IPM between \(\mathbb {P}_1,\mathbb {P}_2\in \mathcal {P}\) is defined as

$$\begin{aligned} \gamma _{\mathcal {F}}(\mathbb {P}_1,\mathbb {P}_2)=\sup _{f\in \mathcal {F}}\left| \int _M f\,\mathrm {d}\mathbb {P}_1-\int _M f\,\mathrm {d}\mathbb {P}_2\right| , \end{aligned}$$
(1)

where \(\mathcal {F}\) is a class of real-valued bounded measurable functions on M. Different choices of the class \(\mathcal {F}\) yield different metrics \(\gamma _\mathcal {F}\) on the given probability space. The followings are among the well-known examples in the literature:

  1. 1.

    Total variation distance: \(\mathcal {F}=\text {C}_{bu}(M)\), the space of all uniformly bounded continuous functions on M or \(\mathcal {F}=\{f:\Vert f\Vert _\infty \le 1\}\), where \(\Vert f\Vert _\infty =\sup _{x\in M}|f(x)|\) (see, for example, Shorack 2000, Chapter 19);

  2. 2.

    The Kolmogorov distance: \(\mathcal {F}=\{1_{(-\infty ,t]}:t\in \mathbb {R}^d\}\), where \(1_A\) denotes the characteristic function of a subset A of \(\mathbb {R}^d\) (see, for example, Shorack 2000, Chapter 19);

  3. 3.

    The Kantorovich metric or Wasserstein distance: \(\mathcal {F}:=\{f:\Vert f\Vert _L\le 1\}\) where \(\Vert f\Vert _L:=\sup \{|f(x)-f(y)|/\rho (x,y),\ x\ne y\in M\}\) with M being a metric space with metric \(\rho \) (see, Dudley 2002, Theorem 11.8.2);

  4. 4.

    Reproducing kernel Hilbert space embedding of measures: \(\mathcal {F}=\{f\in \mathcal {H}_K: \Vert f\Vert _{\mathcal {H}_K}\le 1\}\), where \(\mathcal {H}_K\) is the reproducing kernel Hilbert space of a reproducing kernel K on M (Gretton et al. 2007; Smola et al. 2007).

  5. 5.

    Reproducing kernel Banach space embedding of measures (Sriperumbudur et al. 2011): \(\mathcal {F}=\{f\in \mathcal {B}: \Vert f\Vert _{\mathcal {B}}\le 1\}\), where \(\mathcal {B}\) is a reproducing kernel Banach space on M (Song et al. 2013; Zhang et al. 2009).

This paper attempts to contribute to the approach 4 above. We introduce the concept of reproducing kernel Hilbert spaces and reproducing kernels (Aronszajn 1950).

Definition 1.1

Let M be a prescribed set. A reproducing kernel on M is a real-valued function \(K:M\times M\rightarrow \mathbb {R}\) such that for all finite points \(x_1,x_2,\ldots ,x_n\in M\), the matrix

$$\begin{aligned} {[}K(x_j,x_k)]_{j,k=1}^n \end{aligned}$$

is symmetric and positive semi-definite.

For a reproducing kernel K on M, there exists a unique associated Hilbert space denoted by \(\mathcal {H}_K\) consisting of certain functions on M such that \(K(x,\cdot )\in \mathcal {H}_K\) for all \(x\in M\) and

$$\begin{aligned} f(x)=\langle f,K(x,\cdot )\rangle _{\mathcal {H}_K} \text{ for } \text{ all } f\in \mathcal {H}_K,\ x\in M, \end{aligned}$$

where \(\langle \cdot ,\cdot \rangle _{\mathcal {H}_K}\) denotes the inner product on \(\mathcal {H}_K\). The space \(\mathcal {H}_K\) is called the reproducing kernel Hilbert space (RKHS) of the reproducing kernel K. We introduce the notation

$$\begin{aligned} \gamma _K(\mathbb {P}_1,\mathbb {P}_2)=\sup _{f\in \mathcal {H}_K,\Vert f\Vert _{\mathcal {H}_K}\le 1}\left| \int _M f\,\mathrm {d}\mathbb {P}-\int _M f\,\mathrm {d}\mathbb {Q}\right| . \end{aligned}$$
(2)

When only finite i.i.d. random samples \(\{X_i:1\le i\le m\}\) and \(\{Y_j:1\le j\le n\}\) drawn from unknown measures \(\mathbb {P}_1,\mathbb {P}_2\) are available, one approximates \(\mathbb {P}_1\) and \(\mathbb {P}_2\) respectively by

$$\begin{aligned} \mathbb {P}_{1m}:=\frac{1}{m}\sum _{i=1}^m\delta _{X_i} \text{ and } \mathbb {P}_{2n}:=\frac{1}{n}\sum _{j=1}^n\delta _{Y_j} \end{aligned}$$

and thereby approximating \(\gamma _\mathcal {F}(\mathbb {P}_1,\mathbb {P}_2)\) by \(\gamma _\mathcal {F}(\mathbb {P}_{1m},\mathbb {P}_{2n})\). By choosing \(\mathcal {F}\) to be the unit ball of the reproducing kernel Hilbert space of a reproducing kernel K, the approach of reproducing kernel Hilbert space embedding of measures enjoys many advantages over other approaches (Gretton et al. 2007; Sriperumbudur et al. 2009; Weaver 1999). Firstly, \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is simply a sum of expectations of the kernel K and hence is much easier to compute compared to other choices. Secondly, \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is a \(\sqrt{mn/(m+n)}\)-consistent estimate of \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) for all \(\mathbb {P}_1,\mathbb {P}_2\) under the mild conditions that K is measurable and bounded (Gretton et al. 2007). Thirdly, when K is translation-invariant, the rate of approximating \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) by \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is independent of the dimension (Sriperumbudur et al. 2009).

Despite many favorable properties, there is one critical question not well-addressed in the RKHS embedding of measures, which is the choice of reproducing kernels. An RKHS \(\mathcal {H}_K\) is completely determined by its reproducing kernel K (Aronszajn 1950; Zhang and Zhao 2013). In fact, \(\mathcal {H}_K\) is the completion of the linear space

$$\begin{aligned} \,\mathrm{span}\,\{K(x,\cdot ):x\in M\} \end{aligned}$$

under the inner product

$$\begin{aligned} \left\langle \sum _{j=1}^p c_j K(x_j,\cdot ),\sum _{k=1}^q d_k K(y_k,\cdot )\right\rangle _{\mathcal {H}_K}=\sum _{j=1}^p\sum _{k=1}^q c_jd_kK(x_j,y_k)\qquad c_j,d_k\in \mathbb {R}. \end{aligned}$$

Therefore, the choice of the reproducing kernel K much affects the embedding of probability measures in \(\mathcal {H}_K\). So far, studies in the literature have focused on characteristic kernels which ensure \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) to be a metric on \((M,\mathcal {A})\) (see, for example, Berlinet and Thomas-Agnan 2004; Chen et al. 2016; Fukumizu et al. 2009, 2008; Gretton et al. 2007; Sriperumbudur et al. 2011, 2009, 2010; Steinwart 2001). Being characteristic can be viewed as a preliminary requirement on the reproducing kernel. Our attempt in this paper is to impose another admissibility criterion on the reproducing kernel in measuring the similarity of a class of probability distributions. Let us make our objective clear.

Assume that K is a characteristic kernel. Thus \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) is a metric and can be used to measure the similarity between two probability measures \(\mathbb {P}_1,\mathbb {P}_2\). Consider the most important class of Gaussian distributions measures

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\sigma }(x)=\frac{1}{\sqrt{2\pi }\sigma }\exp \left( -\frac{(x-\mu )^2}{2\sigma ^2}\right) \,\mathrm {d}x,\ \ x\in \mathbb {R}. \end{aligned}$$

with mean \(\mu \) and standard deviation \(\sigma \). Naturally, two Gaussian measures \(\mathbb {P}_{\mu _1,\sigma _1}\) and \(\mathbb {P}_{\mu _2,\sigma _2}\) should be closer in the following two cases (see Figs. 1 and 2 for illustration and explanation):

  1. (i)

    when the means are getting closer, that is, \(\mu _1\) tends to \(\mu _2\);

  2. (ii)

    when \(\sigma _1=\sigma _2\) are increasing while the means are different but fixed.

Fig. 1
figure 1

Let \(\mu <\nu \) and \(\sigma \) be fixed. Then the regions under the curves of the density functions of \(\mathbb {P}_{\mu ,\sigma }\) and \(\mathbb {P}_{\nu ,\sigma }\) have a larger overlapping area when \(\mu ,\nu \) become closer

Fig. 2
figure 2

Let \(\mu <\nu \) be fixed. Then the regions under the curves of the density functions of \(\mathbb {P}_{\mu ,\sigma }\) and \(\mathbb {P}_{\nu ,\sigma }\) have a larger overlapping area as \(\sigma \) increases

To summarize, we shall study conditions on the reproducing kernel K that is admissible for the Gaussian distributions in the following sense. Denote by \(\Vert \cdot \Vert \) the standard Euclidean norm on \(\mathbb {R}^d\).

Definition 1.2

Let \(\mathbb {Q}\) be a Borel probability measure on \(\mathbb {R}^d\). A reproducing kernel K on \(\mathbb {R}^d\) is said to be admissible for the class of distributions

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\sigma }(x)=\frac{1}{\sigma ^d}\,\mathrm {d}\mathbb {Q}\left( \frac{x-\mu }{\sigma }\right) ,~~~~~x\in \mathbb {R}^d,\mu \in \mathbb {R}^d,\sigma >0 \end{aligned}$$

(or simply, \(\mathbb Q\)-admissible), if the following two conditions are satisfied:

(A1):

\(\gamma _K(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma })\) is strictly decreasing as \(\Vert \mu _1-\mu _2\Vert \) decreases;

(A2):

\(\gamma _K(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma })\) is strictly decreasing as \(\sigma \) increases.

We present sufficient conditions for a translation-invariant kernel K to be \(\mathbb {Q}\)-admissible in Sect. 3. The concrete examples of Gaussian distributions is then investigated. To this end, we present necessary preliminaries on reproducing kernels and RKHS embeddings of probability measures in Sect. 2. Section 3 is devoted to non-symmetric multivariate Gaussian distributions. We remark that by the illustration of our motivation in Figs. 1 and 2, the notion of admissible kernels introduced in the paper seems useful only for probability distributions of a single mode such as the Gaussian distributions. Kernel methods to evaluate the distance between probability distributions of multiple modes would be an interesting question for us in the future.

2 Admissible kernels

Let K be a reproducing kernel on \(\mathbb {R}^d\) that is translation-invariant in the sense

$$\begin{aligned} K(x,y)=K(x-z,y-z) \end{aligned}$$

for all \(x,y,z\in \mathbb {R}^d\). It is easy to see

$$\begin{aligned} K(x,y)=\psi (x-y),\ \ x,y\in \mathbb {R}^d \end{aligned}$$
(3)

for some function \(\psi (x)=K(x,0)\) on \(\mathbb {R}^d\). By the celebrated Bochner theorem (Bochner 1959), if \(\psi \) is continuous on \(\mathbb {R}^d\) then \(K(x,y)=\psi (x-y)\) makes a reproducing kernel on \(\mathbb {R}^d\) if and only if there exists a finite positive Borel measure \(\rho \) on \(\mathbb {R}^d\) such that

$$\begin{aligned} \psi (x)=\int _{\mathbb {R}^d}e^{-i x\cdot t}\,\mathrm {d}\rho (t),\ \ x\in \mathbb {R}^d. \end{aligned}$$
(4)

It was shown in Sriperumbudur et al. (2010) that K is a characteristic kernel, that is, \(\gamma _K\) is a metric, if and only if \(\rho \) is supported on the whole \(\mathbb {R}^d\). For simplicity, we also assume that \(\rho \) is symmetric about the origin so that \(\psi \) and K are real-valued.

Denote by \(\mathcal {P}(\mathbb {R}^d)\) the set of all Borel probability measures on \(\mathbb {R}^d\). Let \(\mathbb {R}_+:=[0,+\infty )\) and let \(\mathbb {N}\) be the set of positive integers. We shall need the convolution of a bounded continuous function f on \(\mathbb {R}^d\) and a measure \(\rho \in \mathcal {P}(\mathbb {R}^d)\) given as

$$\begin{aligned} (f*\rho )(x):=\int _{\mathbb {R}^d} f(x-y)\,\mathrm {d}\rho (y),~~~x\in \mathbb {R}^d. \end{aligned}$$

and the convolution of two probability measures \(\varrho ,\lambda \in \mathcal {P}(\mathbb {R}^d)\)

$$\begin{aligned} (\varrho *\lambda )(E):=\int _{\mathbb {R}^d}\int _{\mathbb {R}^d}1_E(x-y)\,\mathrm {d}\varrho (x)\,\mathrm {d}\lambda (y). \end{aligned}$$

For a probability distribution \(\mathbb {Q}\), denote

$$\begin{aligned} \bar{\mathbb {Q}} := \mathbb {Q}* \tilde{\mathbb {Q}}, \end{aligned}$$
(5)

where \(\,\mathrm {d}\tilde{\mathbb {Q}}(x):=\,\mathrm {d}\mathbb {Q}(-x)\).

Let K be given by (3) and (4), where \(\rho \) is a finite positive Borel measure on \(\mathbb {R}^d\). Then K is bounded on \(\mathbb {R}^d\times \mathbb {R}^d\). Consequently, the function

$$\begin{aligned} f_\varrho (x):=\int _{\mathbb {R}^d} K(x,t)\,\mathrm {d}\varrho (t),\ \ x\in \mathbb {R}^d \end{aligned}$$

is well-defined for each \(\varrho \in \mathcal {P}(\mathbb {R}^d)\). An important observation made in Sriperumbudur et al. (2010) is that \(f_\varrho \in \mathcal {H}_K\) for all \(\varrho \in \mathcal {P}(\mathbb {R}^d)\) and that

$$\begin{aligned} \gamma _K(\varrho ,\lambda )=\left\| f_\varrho -f_\lambda \right\| _{\mathcal {H}_K},\ \ \varrho ,\lambda \in \mathcal {P}(\mathbb {R}^d). \end{aligned}$$
(6)

Moreover,

$$\begin{aligned} \langle f_\varrho ,g\rangle _{\mathcal {H}_K}=\int _{\mathbb {R}^d} g(x)\,\mathrm {d}\varrho (x),\ \ \varrho \in \mathcal {P}(\mathbb {R}^d),\ g\in \mathcal {H}_K. \end{aligned}$$
(7)

Returning to our main theme, we let \(\mathbb {Q}\in \mathcal {P}(\mathbb {R}^d)\) and define the associated class of probability measures

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\sigma }(x)=\frac{1}{\sigma ^d}\,\mathrm {d}\mathbb {Q}\left( \frac{x-\mu }{\sigma }\right) \,\ x\in \mathbb {R}^d,\ \ \mu \in \mathbb {R}^d,\sigma >0. \end{aligned}$$
(8)

We first give an initial result for the reproducing kernel K given by (3), (4) to satisfy the two admissible requirements (A1) and (A2). To this end, we introduce the following definitions.

Definition 2.1

Let f be a function on \(\mathbb {R}^d\). Then it is said to be

  • radial provided that \(f(x)=f(y)\) whenever \(\Vert x\Vert =\Vert y\Vert \);

  • radially decreasing if f is radial and \(f(x)\le f(y)\) whenever \(\Vert x\Vert >\Vert y\Vert \);

  • strictly radially decreasing if f is radial and \(f(x)< f(y)\) whenever \(\Vert x\Vert >\Vert y\Vert \);

  • radially increasing (strictly radially increasing) if \(-f\) is radially decreasing (strictly radially decreasing).

It can be verified that the convolution of two radial functions remains radial. Interested readers are referred to Lieb and Loss (2001) for more properties about radial functions.

Lemma 2.2

Let K be the translation-invariant kernel given by (3) and (4), where \(\rho \) is supported on the whole \(\mathbb {R}^d\). Define the function

$$\begin{aligned} \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x):=(\psi _\sigma * \bar{\mathbb {Q}})(x),\ \ x\in \mathbb {R}^d, \end{aligned}$$
(9)

where

$$\begin{aligned} \psi _\sigma (x):=\psi (\sigma x), \ \ x\in \mathbb {R}^d. \end{aligned}$$

Then K is admissible for the class of distributions (8) if and only if \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}\) is strictly decreasing as \(\Vert x\Vert \) increases for any \(\sigma >0\) and as \(\sigma \) increases for fixed \(\Vert x\Vert \).

Proof

By (6) and (7), we have for two measures \(\varrho ,\lambda \in \mathcal {P}(\mathbb {R}^d)\)

$$\begin{aligned} \begin{aligned} \big (\gamma _K(\varrho , \lambda )\big )^2&=\left\| f_\varrho -f_\lambda \right\| _{\mathcal {H}_K}^2\\&=\langle f_\varrho ,f_\varrho \rangle _{\mathcal {H}_K}+\langle f_\lambda ,f_\lambda \rangle _{\mathcal {H}_K}-2\langle f_\varrho ,f_\lambda \rangle _{\mathcal {H}_K}\\&=\int _{\mathbb {R}^d}f_\varrho (x)\,\mathrm {d}\varrho (x)+\int _{\mathbb {R}^d}f_\lambda (x)\,\mathrm {d}\lambda (x)-2\int _{\mathbb {R}^d}f_\varrho (x)\,\mathrm {d}\lambda (x)\\&=\int _{\mathbb {R}^d}\int _{\mathbb {R}^d} K(x,y) \,\mathrm {d}\varrho (x) \,\mathrm {d}\varrho (y) + \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} K(x,y) \,\mathrm {d}\lambda (x) \,\mathrm {d}\lambda (y)\\&\quad - 2 \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} K(x,y) \,\mathrm {d}\varrho (x) \,\mathrm {d}\lambda (y). \end{aligned} \end{aligned}$$
(10)

Specifying the distributions \(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma }\), we compute

$$\begin{aligned} \begin{aligned} \gamma _K^2(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2, \sigma })&= \int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \psi (x - y) \,\mathrm {d}\mathbb {P}_{\mu _1,\sigma }(y) \,\mathrm {d}\mathbb {P}_{\mu _1,\sigma }(x) \\&\qquad + \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} \psi (x - y) \,\mathrm {d}\mathbb {P}_{\mu _2, \sigma }(y) \,\mathrm {d}\mathbb {P}_{\mu _2, \sigma }(x) \\&\qquad - 2 \int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \psi (x-y) \,\mathrm {d}\mathbb {P}_{\mu _2,\sigma }(y) \,\mathrm {d}\mathbb {P}_{\mu _1,\sigma }(x) \\&= \sigma ^{-2d} \Bigl [\int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \psi (\sigma x + \mu _1 - \sigma y - \mu _1) \,\mathrm {d}\mathbb {Q}(y) \,\mathrm {d}\mathbb {Q}(x) \\&\qquad +\int _{\mathbb {R}^d}\int _{\mathbb {R}^d} \psi (\sigma x + \mu _2 - \sigma y - \mu _2) \,\mathrm {d}\mathbb {Q}(y) \,\mathrm {d}\mathbb {Q}(x) \\&\qquad - \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} \psi (\sigma x + \mu _1 -\sigma y - \mu _2) \,\mathrm {d}\mathbb {Q}(y) \,\mathrm {d}\mathbb {Q}(x)\Bigr ] \\&= \sigma ^{-2d} \int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \left[ \psi \left( \sigma (x - y) \right) -\psi \bigl (\sigma (x - y) + \mu _1 - \mu _2 \bigr )\right] \,\mathrm {d}\mathbb {Q}(y) \,\mathrm {d}\mathbb {Q}(x) \\&= \sigma ^{-2d} \left[ \left( \psi _\sigma * {\bar{\mathbb {Q}}}\right) (0) - (\psi _\sigma * {\bar{\mathbb {Q}}})\left( \frac{\mu _1 - \mu _2}{\sigma }\right) \right] . \end{aligned} \end{aligned}$$

Since \(\rho \) is supported on the whole real line, \(\gamma _K\) is a metric on \(\mathcal {P}(\mathbb {R}^d)\). Then by the above calculation, we have that

$$\begin{aligned} (\psi _\sigma * \bar{\mathbb {Q}})(t)<(\psi _\sigma * \bar{\mathbb {Q}})(0)~ \text{ for } t\ne 0. \end{aligned}$$

As a result, the requirements (A1) and (A2) are satisfied if and only if \((\psi _\sigma * \bar{\mathbb {Q}})(t)\) is monotonically decreasing as \(\Vert t\Vert \) increases for fixed \(\sigma >0\) and as \(\sigma \) increases for fixed \(\Vert t\Vert \). \(\square \)

Specifying a certain class of probability distributions, we are able to show that not all translation-invariant characteristic kernels are admissible for the RKHS embedding of probability distributions.

Example 2.3

Consider dimension \(d=1\). Let \(\mathbb {Q}\) be the Bernoulli distribution with success probability \(\frac{1}{2}\), i.e. \(\mathbb {Q}(0)=\mathbb {Q}(1)=\frac{1}{2}\). Then we have \(\bar{\mathbb {Q}}(-1)=\bar{\mathbb {Q}}(1)=\frac{1}{4}\) and \(\bar{\mathbb {Q}}(0)=\frac{1}{2}\). With \(\psi \) being given by (4), we have

$$\begin{aligned} \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(t)= \psi _\sigma * \bar{\mathbb {Q}}(t)=\frac{1}{2}\psi (\sigma t)+\frac{1}{4}\psi (\sigma t-\sigma )+\frac{1}{4}\psi (\sigma t+\sigma ) \end{aligned}$$

Now let \(\psi \) be an even function on \(\mathbb {R}\) which is strictly decreasing on \(\mathbb {R}_+\) and converging to 0 (examples including \(e^{-|t|}\), \(e^{-t^2}\), etc.). Choose \(\sigma \) large enough such that

$$\begin{aligned} \psi \bigl (\frac{\sigma }{2}\bigr ) < \frac{1}{4} \psi (0). \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}\left( \frac{1}{2}\right)&= \tfrac{1}{4} \psi _\sigma \bigl (-\tfrac{1}{2}\bigr ) + \tfrac{1}{2} \psi _\sigma \bigl (\tfrac{1}{2}\bigr ) + \tfrac{1}{4} \psi _\sigma \bigl (\tfrac{3}{2}\bigr )\\&\le \psi _\sigma \bigl (\tfrac{1}{2}\bigr )\\&< \tfrac{1}{4} \psi _\sigma (0) + \tfrac{1}{2} \psi _\sigma (1) + \tfrac{1}{4} \psi _\sigma \bigl (2) = \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(1). \end{aligned} \end{aligned}$$

Therefore, the function \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}\) is not monotonically decreasing for this example. By Lemma 2.2, the kernel K is not admissible.

Next we shall present a sufficient condition and a necessary condition guaranteeing admissibility, which covers a large class of reproducing kernels and probability distributions.

Theorem 2.4

Let \(\psi \) and \(\bar{\mathbb {Q}}\) be defined by (4) and (5), respectively, with \(\rho \) supported on \(\mathbb {R}^d\). Suppose that \(\bar{\mathbb {Q}}\) has a Lebesgue integrable density function f. If both \(\psi \) and f are radially decreasing with at least one of them being strictly radially decreasing then the kernel K given by (3) is admissible for the class of distributions (8).

Conversely, suppose that f and \(\psi \) are radial and that f is radially decreasing. If the kernel K is admissible, then \(\psi \) is radially decreasing.

Proof

For the first part, by Lemma 2.2, we need to show that the function \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(t)\) defined by (9) is monotonically decreasing as \(\Vert t\Vert \) and \(\sigma \) increase.

We only consider the case when \(\sigma \) is fixed, the case when \(\sigma \) varies can be handled similarly. Let \(\delta ,\Delta \) be two points in \(\mathbb {R}^d\) with \(\Vert \delta \Vert <\Vert \Delta \Vert \). As \(\psi _\sigma ,f\) are both radial, so is \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}=\psi _\sigma *f\). We may hence assume \(\delta =(\delta _1,0,\ldots ,0), \Delta =(\Delta _1,0,\ldots ,0)\), where \(0\le \delta _1<\Delta _1\). Also, we may assume further that \(\psi \) is the one which is strictly radially decreasing.

Set \(H_{t}:=\{x\in \mathbb {R}^d:x_1< t\}\), \(t\in \mathbb {R}\). We first write

$$\begin{aligned} \begin{aligned}&\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )=(\psi _\sigma * f)(\delta )-(\psi _\sigma * f)(\Delta )\\&\quad =\int _{\mathbb {R}^d}\bigl [\psi _\sigma (\delta - x)-\psi _\sigma (\Delta -x)\bigr ]f(x)\,\mathrm {d}x \\&\quad =\int _{H_\frac{\delta _1+\Delta _1}{2}}\bigl [\psi _\sigma (\delta - x)-\psi _\sigma (\Delta -x)\bigr ]f(x)\,\mathrm {d}x\\&\qquad -\int _{\mathbb {R}^d{\setminus } H_\frac{\delta _1+\Delta _1}{2}}\bigl [\psi _\sigma (\Delta -x)-\psi _\sigma (\delta -x)\bigr ]f(x)\,\mathrm {d}x. \end{aligned} \end{aligned}$$

We then apply the substitution \(x=\delta +\Delta -t\) to the second integral above and use the radiality of \(\psi \) and f to get

$$\begin{aligned}&\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )\nonumber \\&\quad =\int _{H_\frac{\delta _1+\Delta _1}{2}}\bigl [\psi _\sigma (\delta - x)-\psi _\sigma (\Delta -x)\bigr ]\bigl [f(x)-f(x-\delta -\Delta )\bigr ]\,\mathrm {d}x. \end{aligned}$$
(11)

Note that for \(x\in H_\frac{\delta _1+\Delta _1}{2}\),

$$\begin{aligned} \Vert \delta - x\Vert<\Vert \Delta -x\Vert \ \text{ and } \ \Vert x\Vert <\Vert x-\delta -\Delta \Vert . \end{aligned}$$

Since \(\psi _\sigma \) is strictly radially decreasing and f is radially decreasing,

$$\begin{aligned} \psi _\sigma (\delta - x)-\psi _\sigma (\Delta -x)>0 \text{ and } f(x)-f(x-\delta -\Delta )\ge 0,\ \ x\in H_\frac{\delta _1+\Delta _1}{2}. \end{aligned}$$

As a consequence, we get by (11) that \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )\ge 0\).

To obtain \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )>0\), we have to show that the set \(E=\{x\in H_{\frac{\delta _1+\Delta _1}{2}}:f(x)\ne f(x-\delta -\Delta )\}\) has a positive Lebesgue measure on \(\mathbb {R}^d\). Assume on the contrary that the Lebesgue measure of E equals 0. Then f is a periodic function on \(H_{\frac{\delta _1+\Delta _1}{2}}{\setminus } E\). Since f is radially decreasing, it must be constant on \(H_{\frac{\delta _1+\Delta _1}{2}}{\setminus } E\). This together with the assumption that f is radial implies that it equals a constant almost everywhere on \(\mathbb {R}^d\). It is hence impossible to be the density function of a probability distribution, which is a contradiction.

For the second part, we only have to prove that if f is radially decreasing, then the kernel K being admissible implies \(\psi \) being radially decreasing. Since f is radially decreasing and is the density function of \(\bar{\mathbb {Q}}\), we have \(\int _{\mathbb {R}^d} f(x)\,\mathrm {d}x=1\). Define

$$\begin{aligned} \begin{aligned} \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x)&=\int _{\mathbb {R}^d}\psi _\sigma (x-y)f(y)\,\mathrm {d}y \\&=\frac{1}{\sigma ^d}\int _{\mathbb {R}^d}\psi (\sigma x-y)f\left( \frac{y}{\sigma }\right) \,\mathrm {d}y, \end{aligned} \end{aligned}$$

where \(\psi \) is a bounded and continuous function on \(\mathbb {R}^d\). The density function \(\frac{1}{\sigma ^d}f(\frac{y}{\sigma })\) clearly converges in distribution to the Dirac mass \(\delta _0\) in y when \(\sigma \rightarrow 0\). Hence \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x)\rightarrow \psi (x)\) when \(\sigma \rightarrow 0\). Now assume on the contrary that \(\psi \) is not radially decreasing. Then there exist two points \(x_0,y_0\) such that \(\Vert x_0\Vert \le \Vert y_0\Vert \) and \(\psi (x_0)< \psi (y_0)\). Let \(\varepsilon \) be a real number such that \(0<\varepsilon <\psi (y_0)-\psi (x_0)\). Then by the continuity of \(\psi \) and the fact that \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x)\rightarrow \psi (x)~(as~\sigma \rightarrow 0)\), there exists \(\sigma \) small enough such that

$$\begin{aligned} \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x_0/\sigma )\le \psi (x_0)+\varepsilon /2<\psi (y_0)-\varepsilon /2\le \mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(y_0/\sigma ). \end{aligned}$$

Therefore, by Lemma 2.2 we know that K is not admissible. So if K is admissible, then \(\psi \) must be radially decreasing. \(\square \)

Note that the kernel which satisfies the assumptions in Theorem 2.4 is of the form

$$\begin{aligned} K(x,y)=\psi (x-y)=\phi (\Vert x-y\Vert ),~~~~x,y\in \mathbb {R}^d, \end{aligned}$$
(12)

where \(\phi \) is decreasing on \(\mathbb {R}_{+}\). Kernels of the above form are called radial basis functions (Wendland 2005; Wu 1995). Let \(\phi \) be a function on \(\mathbb {R}_+\) such that \(\phi (\Vert x-y\Vert )\) is a reproducing kernel on \(\mathbb {R}^d\). It is quite natural to ask when \(\phi \) is strictly decreasing. A fundamental result on radial basis functions due to Schoenberg (1938) states that \(\phi (\Vert x-y\Vert )\) makes a reproducing kernel on \(\mathbb {R}^d\) for all dimensions \(d\in \mathbb {N}\) if and only if there is a finite positive Borel measure \(\mu \) on \(\mathbb {R}_{+}\) such that

$$\begin{aligned} \phi (t)=\int _0^{+\infty }\exp (-st^2)\,\mathrm {d}\mu (s),~~~t\in \mathbb {R}_+. \end{aligned}$$

In this case, \(\phi \) is automatically decreasing on \(\mathbb {R}_+\) and is strictly decreasing as long as \(\,\mathrm{supp}\,\mu \ne \{0\}\), which is equivalent to say that the radial kernel K in (12) is characteristic (Sriperumbudur et al. 2011, Proposition 5). In conclusion, if for all \(d\in \mathbb {N}\), K is a nontrivial reproducing kernel on \(\mathbb {R}^d\) and a characteristic kernel, then we have by Theorem 2.4 that K is \(\mathbb {Q}\)-admissible provided that the density function f of \(\mathbb {Q}\) is radially decreasing.

Things are different if \(\phi (\Vert x-y\Vert )\) is only a kernel on certain dimensions. We present an explicit example to illustrate this. It was proved in Schoenberg (1938) that for a fixed dimension d, \(\phi (\Vert x-y\Vert )\) is a kernel on \(\mathbb {R}^d\) if and only if

$$\begin{aligned} \phi (t)=\int _0^{+\infty }\Omega _d(ts)\,\mathrm {d}\mu (s),~~~t\ge 0, \end{aligned}$$

where \(\mu \) is a finite positive Borel measure on \(\mathbb {R}_+\) and

$$\begin{aligned} \Omega _d(r):=\frac{\int _0^\pi e^{ir\cos \theta }\sin ^{d-2}\theta \,\mathrm {d}\theta }{\int _0^\pi \sin ^{d-2}\theta \,\mathrm {d}\theta },~~~d\ge 2,~r\ge 0. \end{aligned}$$

Setting \(d=3\) leads to

$$\begin{aligned} \Omega _3(r)=\frac{\sin r}{r}. \end{aligned}$$

We then choose the measure \(\mu \) such that \(\,\mathrm{supp}\,\mu =[0,2\pi ]\) and \(\,\mathrm {d}\mu (s)=s\,\mathrm {d}s\) for \(s\in [0,2\pi ]\). The resulting function \(\phi \) is

$$\begin{aligned} \phi (t)=\int _0^{2\pi }\frac{\sin st}{st}\,\mathrm {d}\mu (s)=\frac{1-\cos 2\pi t}{t^2},\ \ t\ge 0, \end{aligned}$$

which is not decreasing since \(\phi (1)=0\) while \(\phi (\frac{3}{2})=\frac{8}{9}\). Therefore by Theorem 2.4\(K(x,y)=\phi (\Vert x-y\Vert )\) is not admissible for any class of radially decreasing distributions. In particular, it is not admissible for the Gaussian distributions.

Nevertheless, there exists a large class of compactly supported decreasing \(\phi \) so that (12) defines a kernel that satisfies the conditions in Theorem 2.4. These are the compactly supported radial basis functions of minimal degree constructed in Wendland (2005) and Wu (1995). Examples for dimension \(d=3\) include

$$\begin{aligned} \phi (r):=(1-r)_+^3 \text{ and } \phi (r):=(1-r)_+^4(1+4r),\ \ r\in \mathbb {R}_+, \end{aligned}$$

where \((1-r)_+:=\max \{0,1-r\}\). More examples are available in (Wendland 2005, Chapter 9).

Going back to our main objective, we shall use Theorem 2.4 to establish admissibility for the RKHS embedding of Gaussian distributions.

Theorem 2.5

Let K be a reproducing kernel on \(\mathbb {R}^d\) of the form \(K(x,y)=\psi (x-y)=\phi (\Vert x-y\Vert )\) where \(\phi \) is deceasing on \(\mathbb {R}_+\) and such that \(\rho \) in (4) is supported on all of \(\mathbb {R}^d\). Then K is admissible for the class of Gaussian distributions

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\sigma }(x)=\left( \frac{1}{\sqrt{2\pi }\sigma }\right) ^d\exp \left( -\frac{\Vert x-\mu \Vert ^2}{2\sigma ^2}\right) \,\mathrm {d}x,\ \ x\in \mathbb {R}^d,\ \ \mu \in \mathbb {R}^d,\sigma >0 \end{aligned}$$
(13)

and for the class of generalized Gaussian distributions

$$\begin{aligned}&\,\mathrm {d}E_{\mu ,\sigma }(x)=\left( \frac{1}{\sqrt{2\pi }\sigma }\right) ^dc_\omega \int _0^\infty \exp \left( -\frac{\Vert x-\mu \Vert ^2}{2\tau \sigma ^2}\right) \,\mathrm {d}\omega (\tau )\,\mathrm {d}x,\nonumber \\&\quad \qquad \ \ x\in \mathbb {R}^d,\ \ \mu \in \mathbb {R}^d,\sigma >0 \end{aligned}$$
(14)

where \(\omega \) is a nontrivial finite positive Borel measure on \(\mathbb {R}_+\) with \(\,\mathrm{supp}\,\omega \ne \{0\}\), and \(c_w\) is a positive constant such that

$$\begin{aligned} c_\omega \int _0^\infty \tau ^{\frac{d}{2}}\,\mathrm {d}\omega (\tau )=1. \end{aligned}$$

Proof

It suffices to verify that the conditions in Theorem 2.4 is satisfied. Firstly, \(K(x,y)=\psi (x-y)\) where \(\psi (x)=\phi (\Vert x\Vert )\) is radial and radially decreasing. For the Gaussian distributions, we see that

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\sigma }(x)=\,\mathrm {d}\mathbb {Q}\left( \frac{x-\mu }{\sigma }\right) \end{aligned}$$

where \(\mathbb {Q}\) has the density function

$$\begin{aligned} g(x):=\left( \frac{1}{\sqrt{2\pi }}\right) ^d \exp \left( -\frac{\Vert x\Vert ^2}{2}\right) ,\ \ x\in \mathbb {R}^d. \end{aligned}$$

Thus, the density function for \(\bar{\mathbb {Q}}=\mathbb {Q}* \tilde{\mathbb {Q}}\) is

$$\begin{aligned} f(x):=(g*g)(x):=\left( \frac{1}{2\sqrt{\pi }}\right) ^d\exp \left( -\frac{\Vert x\Vert ^2}{4}\right) ,\ \ x\in \mathbb {R}^d, \end{aligned}$$

which is radial and strictly radially decreasing. Therefore K is admissible for the Gaussian distributions (13).

For the generalized Gaussian distributions (14), the density function of \(\bar{\mathbb {Q}}\) is

$$\begin{aligned} g(x):=\left( \frac{1}{\sqrt{2\pi }}\right) ^d c_\omega ^2\int _0^\infty \int _0^\infty \sqrt{\frac{ts}{t+s}}\exp \left( -\frac{ \Vert x \Vert ^2}{2(t+s)}\right) \,\mathrm {d}\omega (s)\,\mathrm {d}\omega (t),\ \ x\in \mathbb {R}^d, \end{aligned}$$

which is also radial and strictly radially decreasing. \(\square \)

Generalized Gaussian distributions have found many applications in image processing (Mallat 1989; Moulin and Liu 1999) and the field of engineering (Miller and Thomas 1972; Beaulieu and Young 2009). Classical probability density functions for generalized Gaussian distributions are of the following form

$$\begin{aligned} f_p(x)=\frac{c_p}{\sigma ^d} \exp \left( -\frac{\Vert x-\mu \Vert ^p}{2\sigma ^p}\right) , ~~~x\in \mathbb {R}^d, \end{aligned}$$

where \(p>0\) and \(c_p\) is the constant that makes \(f_p(x)\) a density function. The existence of the measure \(\omega (\tau )\) can be guaranteed by theoretic results in Bochner (1937).

Specifying the measure \(\omega \), we have the following examples.

Example 2.6

Let the Borel measure \(\,\mathrm {d}\omega (\tau )=\,\mathrm {d}\tau /\tau ^2,~\tau \in [1,2]\) in (14). Then the corresponding generalized Gaussian distribution is

$$\begin{aligned} \,\mathrm {d}E_{\mu ,\sigma }(x)= & {} \frac{2c_\omega }{(2\pi )^{d/2}\sigma ^{d-2}\Vert x-\mu \Vert ^2} \left( \exp \left( -\frac{\Vert x-\mu \Vert ^2}{4\sigma ^2}\right) \right. \\&\left. -\exp \left( -\frac{\Vert x-\mu \Vert ^2}{2\sigma ^2}\right) \right) \,\mathrm {d}x,~~~x\in \mathbb {R}^d. \end{aligned}$$

Let the Borel measure \(\omega =\sum _{i=1}^m \alpha _i\delta _{\tau _i}\), where \(\alpha _i\ge 0\) and \(\delta _{\tau _i}\) is the dirac measure at point \(\tau _i>0\). Then the corresponding generalized Gaussian distribution is simply the linear combination of Gaussian distributions with the same expectation. That is,

$$\begin{aligned} \,\mathrm {d}E_{\mu ,\sigma }(x)=\left( \frac{1}{\sqrt{2\pi }\sigma }\right) ^dc_\omega \sum _{i=1}^m \alpha _i \exp \left( -\frac{\Vert x-\mu \Vert ^2}{2\tau _i\sigma ^2}\right) \,\mathrm {d}x,~~~x\in \mathbb {R}^d. \end{aligned}$$

In particular, the Wendland functions (Wendland 2005) and the Gaussian kernels are admissible for the Gaussian distributions. We remark that the latter observation can also be made from direct computation as done in Sriperumbudur et al. (2009), where it was shown that for two Gaussian distributions \(\mathbb {P}_{\mu ,\sigma },\mathbb {P}_{\nu ,\theta }\) and for the Gaussian kernel

$$\begin{aligned} K(x,y)=\exp \left( -\frac{\Vert x-y\Vert ^2}{2\tau ^2}\right) ,\ \ x,y\in \mathbb {R}^d, \tau > 0, \end{aligned}$$

it holds

$$\begin{aligned} \begin{aligned} \gamma _K^2(\mathbb {P}_{\mu ,\sigma },\mathbb {P}_{\nu ,\theta })&=\left( \frac{\tau }{\sqrt{2\sigma ^2+\tau ^2}}\right) ^d\\&\quad +\left( \frac{\tau }{\sqrt{2\theta ^2+\tau ^2}}\right) ^d-2\prod _{i=1}^d\frac{\tau \exp \left( -\frac{(\mu _i-\nu _i)^2}{2(\sigma ^2 +\theta ^2+\tau ^2)}\right) }{\sqrt{\sigma ^2+\theta ^2+\tau ^2}}. \end{aligned} \end{aligned}$$

Thus, when \(\sigma =\theta \),

$$\begin{aligned} \gamma ^2_K(\mathbb {P}_{\mu ,\sigma },\mathbb {P}_{\nu ,\sigma })=2\left( \frac{\tau }{\sqrt{2\sigma ^2+\tau ^2}}\right) ^d \left( 1-\exp \left( -\frac{\Vert \mu -\nu \Vert ^2}{2(2\sigma ^2+\tau ^2)}\right) \right) . \end{aligned}$$
(15)

Clearly, the two admissibility requirements are satisfied.

3 Non-radial Gaussian distributions

In this section, we study similarity of two multivariate non-radial Gaussian distributions under RKHS embedding. Such distributions appear widely in probability and statistics. They are of the general form

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_{\mu ,\Sigma }(x)=\left( \frac{1}{\sqrt{2\pi }}\right) ^d(\det \Sigma )^{-\frac{1}{2}} \exp \left( -\frac{1}{2}(x-\mu )^T\Sigma ^{-1}(x-\mu )\right) \,\mathrm {d}x, \ \ x\in {\mathbb {R}}^d \end{aligned}$$
(16)

where \(\mu \in \mathbb {R}^d\) and \(\Sigma \) is a radial and positive-definite \(d\times d\) matrix. To fulfill our purpose, we need to introduce the definition of multivariate monotonic functions from (Engelking 1989).

Definition 3.1

Let h be a function from \(\mathbb {R}^n\) to \(\mathbb {R}^k\). We say that h is monotonic provided that for any \(y\in \mathbb {R}^k\), \(h^{-1}(y)\) is connected in \(\mathbb {R}^n\).

Recall that a set C in \(\mathbb {R}^n\) is connected if there do not exist two disjoint open subsets \(U,V\in \mathbb {R}^n\) such that \(C\subseteq U\cup V\) and both \(C\cap U\) and \(C\cap V\) are nonempty.

Obviously, a constant function f on \(\mathbb {R}^n\) is monotonic as for every \(c\in \mathbb {R},\) \(f^{-1}(c)\) is either empty or the entire \(\mathbb {R}^n\). We give some other nontrivial examples of multivariate monotonic functions to help comprehend this definition.

We first point out that the above seemingly abstract definition coincides with the ordinary one for continuous univariate monotonic functions.

Example 3.2

Let f be a continuous function on \(\mathbb {R}\). If f is monotonic in the ordinary sense then it is easy to see that it satisfies Definition 3.1. On the other hand, assume that it is monotonic according to Definition 3.1. In other words, for each \(c\in \mathbb {R}\), \(f^{-1}(c)\) is connected in \(\mathbb {R}\). One shows by the intermediate value theorem for continuous functions that f must be monotonic in the ordinary sense.

Example 3.3

A linear function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) defined by \(f(x):=a_1x_1+a_2x_2+\cdots +a_nx_n\), where \(a_i\in \mathbb {R}\) is a monotonic function. This is because for all \(y\in \mathbb {R}\), \(f^{-1}(y)=\{x\in \mathbb {R}^n:\sum _{i=1}^n a_ix_i=y\}\) is a hyperplane in \(\mathbb {R}^n\), which is connected in \(\mathbb {R}^n\).

The following example will appear in our discussion.

Lemma 3.4

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) defined by \(f(x)=\exp (\alpha _1 x_1^2+ \alpha _2 x_2^2+\cdots +\alpha _n x_n^2)\), where \(\alpha _i\) are simultaneously all negative or all positive. Then f is a monotonic function.

Proof

Without loss of generality, we assume \(\alpha _1, \alpha _2,\dots ,\alpha _n> 0\). Then \(f(x)\ge 1\) and \(f^{-1}(1)=\{0\}\), which is a connected set. For \(c> 1\), the set \(f^{-1}(c)=\{(x_1,x_2,\dots ,x_n):\alpha _1 x_1^2+\alpha _2 x_2^2+\alpha _nx_n^2=\ln c\}\) is an n-dimensional ellipsoid, which is connected in \(\mathbb {R}^n\). Therefore we have by Definition 3.1 that f is a monotonic function. \(\square \)

Before proceeding to the next lemma, two classical results are needed, see van Mill (1989).

  1. 1.

    A continuous injective function f from a compact subset of \(\mathbb {R}^n\) to \(\mathbb {R}^n\) is a homeomorphism (van Mill 1989, Excise 1.1.4).

  2. 2.

    The Brouwer Invariance of Domain Theorem: If two sets \(X,Y\subseteq \mathbb {R}^n\) are homeomorphic then so are their interiors (van Mill 1989, Theorem 4.6.7, Corollary 4.6.6).

Lemma 3.5

Let \(F:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuous monotonic function and \(G:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be a continuous injective mapping. Then their composition \(F\circ G:\mathbb {R}^n\rightarrow \mathbb {R}\) is monotonic.

Proof

It is easy to see that \(F\circ G\) is a continuous function. By definition, we have to show that for every \(c\in \mathbb {R}\), the set \((F\circ G)^{-1}(c)\) is connected in \(\mathbb {R}^n.\) Since F is monotonic, we know that \(F^{-1}(c)\) is a connected set in \(\mathbb {R}^n\). By the fact that the continuous image of a connected set is still connected, the set \((F\circ G)^{-1}(c)=G^{-1}(F^{-1}(c))\) is connected provided that \(G^{-1}\) is continuous.

We then prove that if G is continuous and injective, then its inverse \(G^{-1}\) is a continuous function from \(G(\mathbb {R}^n)\) to \(\mathbb {R}^n\). For any sequence \(\{y_k:k=1,2,\ldots \}\subseteq G(\mathbb {R}^n)\) that converges to \(y_0\in G(\mathbb {R}^n)\), denote their preimages by \(\{x_k:k=1,2,\ldots \}\) and \(x_0\), respectively. Let \(B_r\) be a closed ball with radius r centered at \(x_0\), which is clearly compact. Using the result 1 above we have the fact that \(B_r\) must be homeomorphic to \(G(B_r)\). Then we have by the Brouwer Invariance of Domain Theorem that \(\mathrm{int}\,B_r\) and \(\mathrm{int}\,G(B_r)\) are homeomorphic. Therefore for sufficiently large k, \(y_k\) must be located in \(\mathrm{int}\,G(B_r)\). Again by the fact that G is injective, \(x_k\) must be contained in \(B_r\) for all sufficient large k. Since the radius r is arbitrary, we have that \(x_k\) converges to \(x_0\), namely, \(G^{-1}\) is continuous. \(\square \)

We are ready to present the main result of this section about the similarity of two general d-dimensional Gaussian distributions.

Theorem 3.6

The Gaussian reproducing kernel \(K(x,y)=\exp (-\frac{\Vert x-y\Vert _2^2}{2\tau ^2}),~\tau >0\) is admissible for the class of d-dimensional Gaussian distributions given by (16) in the sense that

(A1’):

As a function of \(\mu ,\nu \), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\) decreases monotonically to 0 as \(\mu \) tends to \(\nu \),

(A2’):

As a function of the eigenvalues of \(\Sigma ^{-1}\), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\) decreases monotonically to 0 as \(\det \Sigma \) tends to infinity.

Proof

Let \(\mathbb {P}_1,~\mathbb {P}_2\) be two d-dimensional Gaussian distributions given by

$$\begin{aligned} \,\mathrm {d}\mathbb {P}_1(x)= & {} \frac{1}{(2\pi )^{\frac{d}{2}}(\det \Sigma _1)^{\frac{1}{2}}}\exp \left( -\frac{1}{2}(x-\mu )^T\Sigma _1^{-1}(x-\mu )\right) \,\mathrm {d}x,\\ \,\mathrm {d}\mathbb {P}_2(x)= & {} \frac{1}{(2\pi )^{\frac{d}{2}}(\det \Sigma _2)^{\frac{1}{2}}}\exp \left( -\frac{1}{2}(x-\nu )^T\Sigma _2^{-1}(x-\nu )\right) \,\mathrm {d}x, \end{aligned}$$

where \(\Sigma _1,\Sigma _2\) are two positive-definite matrices.

If \(\Sigma _1\Sigma _2=\Sigma _2\Sigma _1\) then there exists an orthogonal \(d\times d\) matrix B such that

$$\begin{aligned} \Lambda _1=B^T\Sigma _1^{-1}B =\left( \begin{array}{cccc} \alpha _1 &{}~&{} ~ &{} ~ \\ ~ &{} \alpha _2 &{} ~ &{} ~ \\ ~ &{} ~ &{} \ddots &{} ~ \\ &{} ~ &{} ~ &{} \alpha _d \\ \end{array} \right) ,\quad \Lambda _2=B^T\Sigma _2^{-1}B =\left( \begin{array}{cccc} \beta _1 &{}~&{} ~ &{} ~ \\ ~ &{} \beta _2 &{} ~ &{} ~ \\ ~ &{} ~ &{} \ddots &{} ~ \\ &{} ~ &{} ~ &{} \beta _d \\ \end{array} \right) , \end{aligned}$$

where \(\alpha _i,\beta _i,i=1,2,\ldots ,d\) are all positive. Then

$$\begin{aligned} \begin{aligned}&\iint _{\mathbb {R}^d\times \mathbb {R}^d} K(x,y) \,\mathrm {d}\mathbb {P}_1(x) \,\mathrm {d}\mathbb {P}_1(y)\\&\quad = \frac{1}{(2\pi )^d\det \Sigma _1}\iint _{\mathbb {R}^d\times \mathbb {R}^d} K(x,y)\\&\qquad \exp \left\{ -\frac{1}{2}\left[ (x-\mu )^T\Sigma _1^{-1}(x-\mu )+(y-\mu )^T\Sigma _1^{-1}(y-\mu )\right] \right\} \,\mathrm {d}x \,\mathrm {d}y\\&\quad \overset{(\spadesuit )}{=}\frac{1}{(2\pi )^d\det \Sigma _1}\iint _{\mathbb {R}^d\times \mathbb {R}^d}\\&\qquad \exp \left( -\frac{\Vert x-y\Vert _2 ^2}{2\tau ^2}\right) \exp \left\{ -\frac{1}{2}(x^T\Sigma _1^{-1}x+y^T\Sigma _1^{-1}y)\right\} \,\mathrm {d}x \,\mathrm {d}y\\&\quad \overset{(\heartsuit )}{=}\frac{1}{(2\pi )^d\det \Sigma _1}\iint _{\mathbb {R}^d\times \mathbb {R}^d} \exp \left( -\frac{\Vert x-y\Vert _2 ^2}{2\tau ^2}\right) \\&\qquad \exp \left\{ -\frac{1}{2}(x^T\Lambda _1x+y^T\Lambda _1y)\right\} \,\mathrm {d}x \,\mathrm {d}y\\&\quad =\frac{1}{(\det \Sigma _1)^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau }{\sqrt{2+\alpha _i\tau ^2}}, \end{aligned} \end{aligned}$$

the equality \((\spadesuit )\) holds since the kernel K(xy) is translation invariance, and \((\heartsuit )\) follows if we replace x and y with Bx and By, respectively.

Similarly, we have

$$\begin{aligned} \iint _{\mathbb {R}^d\times \mathbb {R}^d} K(x,y) \,\mathrm {d}\mathbb {P}_2(x) \,\mathrm {d}\mathbb {P}_2(y)=\frac{1}{(\det \Sigma _2)^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau }{\sqrt{2+\beta _i\tau ^2}} \end{aligned}$$

and

$$\begin{aligned} \iint _{\mathbb {R}^d\times \mathbb {R}^d} K(x,y) \,\mathrm {d}\mathbb {P}_1(x) \,\mathrm {d}\mathbb {P}_2(y)=\frac{1}{(\det (\Sigma _1\Sigma _2))^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau e^{-\frac{\alpha _i\beta _i(\mu _i-\nu _i)^2}{2(\alpha _i+\beta _i+\alpha _i\beta _i\tau ^2)}}}{\sqrt{\alpha _i+\beta _i+\alpha _i\beta _i\tau ^2}}. \end{aligned}$$

As a result,

$$\begin{aligned} \begin{aligned} \gamma _K^2(\mathbb {P}_1,\mathbb {P}_2)&=\frac{1}{(\det \Sigma _1)^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau }{\sqrt{2+\alpha _i\tau ^2}}+ \frac{1}{(\det \Sigma _2)^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau }{\sqrt{2+\beta _i\tau ^2}}\\&\quad -\frac{2}{(\det (\Sigma _1\Sigma _2))^{\frac{1}{2}}}\prod _{i=1}^d\frac{\tau e^{-\frac{\alpha _i\beta _i(\mu _i-\nu _i)^2}{2(\alpha _i+\beta _i+\alpha _i\beta _i\tau ^2)}}}{\sqrt{\alpha _i+\beta _i+\alpha _i\beta _i\tau ^2}}. \end{aligned} \end{aligned}$$

In particular, we have

$$\begin{aligned} \gamma _K^2(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })=\frac{2\tau ^d}{(\det \Sigma )^{\frac{1}{2}}} \frac{1-\prod _{i=1}^de^{-\frac{\alpha _i(\mu _i-\nu _i)^2}{2(2+\alpha _i\tau ^2)}}}{\prod _{i=1}^d\sqrt{2+\alpha _i\tau ^2}}, \end{aligned}$$
(17)

where the \(\alpha _i\)’s are the eigenvalues of \(\Sigma ^{-1}\). Based on this formula, we shall verify that the properties (A1’) and (A2’) hold true.

It is easy to see that \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\rightarrow 0\) as \(\Vert \mu -\nu \Vert \rightarrow 0\). To show the monotonicity of \(\gamma _K\) with respect to \(\mu -\nu \), we just have to verify that

$$\begin{aligned} D(x):=\overset{d}{\underset{i=1}{\prod }} \exp \left( -\frac{\alpha _ix_i^2}{2(2+\alpha _i\tau ^2)}\right) ,\ x\in \mathbb {R}^d \end{aligned}$$

is monotonic on \(\mathbb {R}^d\) according to Definition 3.1. This falls into Lemma 3.4. Thus, (A1’) is verified.

Next, let \(\mu ,\nu \) be fixed. By Formula (17), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\rightarrow 0\) as \(\det \Sigma \rightarrow \infty \). It remains to show that \(\gamma _K\) is monotonic with respect to \((\alpha _1,\alpha _2,\ldots ,\alpha _d)\). Set

$$\begin{aligned} g(s):=\sqrt{\frac{s}{2+\tau ^2 s}},\ s\ge 0~\text {and}~G(t):=(g(t_1),g(t_2),\ldots ,g(t_d)),\ t\in \mathbb {R}_+^d. \end{aligned}$$

Then g is continuous and strictly increasing on \(\mathbb {R}_{+}\). Therefore G is continuous and injective on \(\mathbb {R}^n\). Denote

$$\begin{aligned} F_c(t_1,t_2,\ldots ,t_d)=t_1t_2\cdots t_d(1-\exp [-(c_1t_1^2+c_2t_2^2+\cdots +c_dt_d^2)]), \end{aligned}$$

where \(c=(c_1,c_2,\ldots ,c_d)\) and \(t=(t_1,t_2,\dots ,t_d)\) are both in \(\mathbb {R}_{+}^d\). Then by Formula (17) we have

$$\begin{aligned} \gamma _K^2(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })=2\tau ^dF_{\omega }(G(\alpha _1,\alpha _2,\ldots ,\alpha _d)) \end{aligned}$$

where \(\omega =(-\frac{(\mu _1-\nu _1)^2}{2},-\frac{(\mu _2-\nu _2)^2}{2},\ldots ,-\frac{(\mu _d-\nu _d)^2}{2})\). For every \(s\in \mathbb {R}_{+},\) the set

$$\begin{aligned} \begin{aligned} F_\omega ^{-1}(s)&=\{x\in \mathbb {R}_{+}^d:F_\omega (x)=s\}\\&=\left\{ (x_1,\ldots ,x_d):(x_1\cdots x_d)\right. \\&\quad \left. \left( 1-\prod _{i=1}^d\exp \left[ -\left( \frac{(\mu _i-\nu _i)^2}{2}x_i^2\right) \right] \right) =s\right\} \end{aligned} \end{aligned}$$

is connected in \(\mathbb {R}_{+}^d\) (Please see the proof in the Appendix). By Lemma 3.5, \(2\tau ^dF_\omega \circ G(\alpha _1,\ldots ,\alpha _d)\) is monotonic on \(\alpha \), which confirms (A2’). \(\square \)

4 Conclusion

Measuring the similarity and distance between two probability distributions is important in many applications of statistics. The approach of RKHS embedding has many advantages over other integral probability metrics. Due to the one-to-one correspondence between reproducing kernels and reproducing kernel Hilbert spaces, the choice of the reproducing kernel is critical in the approach. Past studies have been focusing on when the kernel is characteristic. We investigate an admissibility criterion on the kernel to ensure that the similarity among the RKHS embeddings of the same class of distributions would satisfy two natural requirements. Sufficient and necessary conditions are provided. In particular, we find that radially decreasing radial basis functions are admissible for Gaussian distributions. We remark that the study can be extended to other classes of probability distributions and to other norms on the Euclidean space.