Abstract
Similarity measurement of two probability distributions is important in many applications of statistics. Embedding such distributions into a reproducing kernel Hilbert space (RKHS) has many favorable properties. The choice of the reproducing kernel is crucial in the approach. We study this question by considering the similarity of two distributions of the same class. In particular, we investigate when the RKHS embedding is “admissible” in the sense that the distance between the embeddings should become smaller when the expectations are getting closer or when the variance is increasing to infinity. We give conditions on the widely-used translation-invariant reproducing kernels to be admissible. We also extend the study to multivariate non-symmetric Gaussian distributions.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Distance between probability measures has many applications, including distribution testing, density estimation, signal detection, etc (Rachev 1991; Vajda 1989). In recent years, many kinds of distance between probability measures have been proposed (see, for example, Sriperumbudur et al. 2009, 2010). Many of them are built on the general approach of integral probability metric (IPM) (Müller 1997).
To introduce the approach, denote by \(\mathcal {P}\) the set of all Borel probability measures on a probability space \((M,\mathcal {A})\). The IPM between \(\mathbb {P}_1,\mathbb {P}_2\in \mathcal {P}\) is defined as
where \(\mathcal {F}\) is a class of real-valued bounded measurable functions on M. Different choices of the class \(\mathcal {F}\) yield different metrics \(\gamma _\mathcal {F}\) on the given probability space. The followings are among the well-known examples in the literature:
-
1.
Total variation distance: \(\mathcal {F}=\text {C}_{bu}(M)\), the space of all uniformly bounded continuous functions on M or \(\mathcal {F}=\{f:\Vert f\Vert _\infty \le 1\}\), where \(\Vert f\Vert _\infty =\sup _{x\in M}|f(x)|\) (see, for example, Shorack 2000, Chapter 19);
-
2.
The Kolmogorov distance: \(\mathcal {F}=\{1_{(-\infty ,t]}:t\in \mathbb {R}^d\}\), where \(1_A\) denotes the characteristic function of a subset A of \(\mathbb {R}^d\) (see, for example, Shorack 2000, Chapter 19);
-
3.
The Kantorovich metric or Wasserstein distance: \(\mathcal {F}:=\{f:\Vert f\Vert _L\le 1\}\) where \(\Vert f\Vert _L:=\sup \{|f(x)-f(y)|/\rho (x,y),\ x\ne y\in M\}\) with M being a metric space with metric \(\rho \) (see, Dudley 2002, Theorem 11.8.2);
-
4.
Reproducing kernel Hilbert space embedding of measures: \(\mathcal {F}=\{f\in \mathcal {H}_K: \Vert f\Vert _{\mathcal {H}_K}\le 1\}\), where \(\mathcal {H}_K\) is the reproducing kernel Hilbert space of a reproducing kernel K on M (Gretton et al. 2007; Smola et al. 2007).
-
5.
Reproducing kernel Banach space embedding of measures (Sriperumbudur et al. 2011): \(\mathcal {F}=\{f\in \mathcal {B}: \Vert f\Vert _{\mathcal {B}}\le 1\}\), where \(\mathcal {B}\) is a reproducing kernel Banach space on M (Song et al. 2013; Zhang et al. 2009).
This paper attempts to contribute to the approach 4 above. We introduce the concept of reproducing kernel Hilbert spaces and reproducing kernels (Aronszajn 1950).
Definition 1.1
Let M be a prescribed set. A reproducing kernel on M is a real-valued function \(K:M\times M\rightarrow \mathbb {R}\) such that for all finite points \(x_1,x_2,\ldots ,x_n\in M\), the matrix
is symmetric and positive semi-definite.
For a reproducing kernel K on M, there exists a unique associated Hilbert space denoted by \(\mathcal {H}_K\) consisting of certain functions on M such that \(K(x,\cdot )\in \mathcal {H}_K\) for all \(x\in M\) and
where \(\langle \cdot ,\cdot \rangle _{\mathcal {H}_K}\) denotes the inner product on \(\mathcal {H}_K\). The space \(\mathcal {H}_K\) is called the reproducing kernel Hilbert space (RKHS) of the reproducing kernel K. We introduce the notation
When only finite i.i.d. random samples \(\{X_i:1\le i\le m\}\) and \(\{Y_j:1\le j\le n\}\) drawn from unknown measures \(\mathbb {P}_1,\mathbb {P}_2\) are available, one approximates \(\mathbb {P}_1\) and \(\mathbb {P}_2\) respectively by
and thereby approximating \(\gamma _\mathcal {F}(\mathbb {P}_1,\mathbb {P}_2)\) by \(\gamma _\mathcal {F}(\mathbb {P}_{1m},\mathbb {P}_{2n})\). By choosing \(\mathcal {F}\) to be the unit ball of the reproducing kernel Hilbert space of a reproducing kernel K, the approach of reproducing kernel Hilbert space embedding of measures enjoys many advantages over other approaches (Gretton et al. 2007; Sriperumbudur et al. 2009; Weaver 1999). Firstly, \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is simply a sum of expectations of the kernel K and hence is much easier to compute compared to other choices. Secondly, \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is a \(\sqrt{mn/(m+n)}\)-consistent estimate of \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) for all \(\mathbb {P}_1,\mathbb {P}_2\) under the mild conditions that K is measurable and bounded (Gretton et al. 2007). Thirdly, when K is translation-invariant, the rate of approximating \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) by \(\gamma _K(\mathbb {P}_{1m},\mathbb {P}_{2n})\) is independent of the dimension (Sriperumbudur et al. 2009).
Despite many favorable properties, there is one critical question not well-addressed in the RKHS embedding of measures, which is the choice of reproducing kernels. An RKHS \(\mathcal {H}_K\) is completely determined by its reproducing kernel K (Aronszajn 1950; Zhang and Zhao 2013). In fact, \(\mathcal {H}_K\) is the completion of the linear space
under the inner product
Therefore, the choice of the reproducing kernel K much affects the embedding of probability measures in \(\mathcal {H}_K\). So far, studies in the literature have focused on characteristic kernels which ensure \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) to be a metric on \((M,\mathcal {A})\) (see, for example, Berlinet and Thomas-Agnan 2004; Chen et al. 2016; Fukumizu et al. 2009, 2008; Gretton et al. 2007; Sriperumbudur et al. 2011, 2009, 2010; Steinwart 2001). Being characteristic can be viewed as a preliminary requirement on the reproducing kernel. Our attempt in this paper is to impose another admissibility criterion on the reproducing kernel in measuring the similarity of a class of probability distributions. Let us make our objective clear.
Assume that K is a characteristic kernel. Thus \(\gamma _K(\mathbb {P}_1,\mathbb {P}_2)\) is a metric and can be used to measure the similarity between two probability measures \(\mathbb {P}_1,\mathbb {P}_2\). Consider the most important class of Gaussian distributions measures
with mean \(\mu \) and standard deviation \(\sigma \). Naturally, two Gaussian measures \(\mathbb {P}_{\mu _1,\sigma _1}\) and \(\mathbb {P}_{\mu _2,\sigma _2}\) should be closer in the following two cases (see Figs. 1 and 2 for illustration and explanation):
-
(i)
when the means are getting closer, that is, \(\mu _1\) tends to \(\mu _2\);
-
(ii)
when \(\sigma _1=\sigma _2\) are increasing while the means are different but fixed.
To summarize, we shall study conditions on the reproducing kernel K that is admissible for the Gaussian distributions in the following sense. Denote by \(\Vert \cdot \Vert \) the standard Euclidean norm on \(\mathbb {R}^d\).
Definition 1.2
Let \(\mathbb {Q}\) be a Borel probability measure on \(\mathbb {R}^d\). A reproducing kernel K on \(\mathbb {R}^d\) is said to be admissible for the class of distributions
(or simply, \(\mathbb Q\)-admissible), if the following two conditions are satisfied:
- (A1):
-
\(\gamma _K(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma })\) is strictly decreasing as \(\Vert \mu _1-\mu _2\Vert \) decreases;
- (A2):
-
\(\gamma _K(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma })\) is strictly decreasing as \(\sigma \) increases.
We present sufficient conditions for a translation-invariant kernel K to be \(\mathbb {Q}\)-admissible in Sect. 3. The concrete examples of Gaussian distributions is then investigated. To this end, we present necessary preliminaries on reproducing kernels and RKHS embeddings of probability measures in Sect. 2. Section 3 is devoted to non-symmetric multivariate Gaussian distributions. We remark that by the illustration of our motivation in Figs. 1 and 2, the notion of admissible kernels introduced in the paper seems useful only for probability distributions of a single mode such as the Gaussian distributions. Kernel methods to evaluate the distance between probability distributions of multiple modes would be an interesting question for us in the future.
2 Admissible kernels
Let K be a reproducing kernel on \(\mathbb {R}^d\) that is translation-invariant in the sense
for all \(x,y,z\in \mathbb {R}^d\). It is easy to see
for some function \(\psi (x)=K(x,0)\) on \(\mathbb {R}^d\). By the celebrated Bochner theorem (Bochner 1959), if \(\psi \) is continuous on \(\mathbb {R}^d\) then \(K(x,y)=\psi (x-y)\) makes a reproducing kernel on \(\mathbb {R}^d\) if and only if there exists a finite positive Borel measure \(\rho \) on \(\mathbb {R}^d\) such that
It was shown in Sriperumbudur et al. (2010) that K is a characteristic kernel, that is, \(\gamma _K\) is a metric, if and only if \(\rho \) is supported on the whole \(\mathbb {R}^d\). For simplicity, we also assume that \(\rho \) is symmetric about the origin so that \(\psi \) and K are real-valued.
Denote by \(\mathcal {P}(\mathbb {R}^d)\) the set of all Borel probability measures on \(\mathbb {R}^d\). Let \(\mathbb {R}_+:=[0,+\infty )\) and let \(\mathbb {N}\) be the set of positive integers. We shall need the convolution of a bounded continuous function f on \(\mathbb {R}^d\) and a measure \(\rho \in \mathcal {P}(\mathbb {R}^d)\) given as
and the convolution of two probability measures \(\varrho ,\lambda \in \mathcal {P}(\mathbb {R}^d)\)
For a probability distribution \(\mathbb {Q}\), denote
where \(\,\mathrm {d}\tilde{\mathbb {Q}}(x):=\,\mathrm {d}\mathbb {Q}(-x)\).
Let K be given by (3) and (4), where \(\rho \) is a finite positive Borel measure on \(\mathbb {R}^d\). Then K is bounded on \(\mathbb {R}^d\times \mathbb {R}^d\). Consequently, the function
is well-defined for each \(\varrho \in \mathcal {P}(\mathbb {R}^d)\). An important observation made in Sriperumbudur et al. (2010) is that \(f_\varrho \in \mathcal {H}_K\) for all \(\varrho \in \mathcal {P}(\mathbb {R}^d)\) and that
Moreover,
Returning to our main theme, we let \(\mathbb {Q}\in \mathcal {P}(\mathbb {R}^d)\) and define the associated class of probability measures
We first give an initial result for the reproducing kernel K given by (3), (4) to satisfy the two admissible requirements (A1) and (A2). To this end, we introduce the following definitions.
Definition 2.1
Let f be a function on \(\mathbb {R}^d\). Then it is said to be
-
radial provided that \(f(x)=f(y)\) whenever \(\Vert x\Vert =\Vert y\Vert \);
-
radially decreasing if f is radial and \(f(x)\le f(y)\) whenever \(\Vert x\Vert >\Vert y\Vert \);
-
strictly radially decreasing if f is radial and \(f(x)< f(y)\) whenever \(\Vert x\Vert >\Vert y\Vert \);
-
radially increasing (strictly radially increasing) if \(-f\) is radially decreasing (strictly radially decreasing).
It can be verified that the convolution of two radial functions remains radial. Interested readers are referred to Lieb and Loss (2001) for more properties about radial functions.
Lemma 2.2
Let K be the translation-invariant kernel given by (3) and (4), where \(\rho \) is supported on the whole \(\mathbb {R}^d\). Define the function
where
Then K is admissible for the class of distributions (8) if and only if \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}\) is strictly decreasing as \(\Vert x\Vert \) increases for any \(\sigma >0\) and as \(\sigma \) increases for fixed \(\Vert x\Vert \).
Proof
By (6) and (7), we have for two measures \(\varrho ,\lambda \in \mathcal {P}(\mathbb {R}^d)\)
Specifying the distributions \(\mathbb {P}_{\mu _1,\sigma }, \mathbb {P}_{\mu _2,\sigma }\), we compute
Since \(\rho \) is supported on the whole real line, \(\gamma _K\) is a metric on \(\mathcal {P}(\mathbb {R}^d)\). Then by the above calculation, we have that
As a result, the requirements (A1) and (A2) are satisfied if and only if \((\psi _\sigma * \bar{\mathbb {Q}})(t)\) is monotonically decreasing as \(\Vert t\Vert \) increases for fixed \(\sigma >0\) and as \(\sigma \) increases for fixed \(\Vert t\Vert \). \(\square \)
Specifying a certain class of probability distributions, we are able to show that not all translation-invariant characteristic kernels are admissible for the RKHS embedding of probability distributions.
Example 2.3
Consider dimension \(d=1\). Let \(\mathbb {Q}\) be the Bernoulli distribution with success probability \(\frac{1}{2}\), i.e. \(\mathbb {Q}(0)=\mathbb {Q}(1)=\frac{1}{2}\). Then we have \(\bar{\mathbb {Q}}(-1)=\bar{\mathbb {Q}}(1)=\frac{1}{4}\) and \(\bar{\mathbb {Q}}(0)=\frac{1}{2}\). With \(\psi \) being given by (4), we have
Now let \(\psi \) be an even function on \(\mathbb {R}\) which is strictly decreasing on \(\mathbb {R}_+\) and converging to 0 (examples including \(e^{-|t|}\), \(e^{-t^2}\), etc.). Choose \(\sigma \) large enough such that
Then
Therefore, the function \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}\) is not monotonically decreasing for this example. By Lemma 2.2, the kernel K is not admissible.
Next we shall present a sufficient condition and a necessary condition guaranteeing admissibility, which covers a large class of reproducing kernels and probability distributions.
Theorem 2.4
Let \(\psi \) and \(\bar{\mathbb {Q}}\) be defined by (4) and (5), respectively, with \(\rho \) supported on \(\mathbb {R}^d\). Suppose that \(\bar{\mathbb {Q}}\) has a Lebesgue integrable density function f. If both \(\psi \) and f are radially decreasing with at least one of them being strictly radially decreasing then the kernel K given by (3) is admissible for the class of distributions (8).
Conversely, suppose that f and \(\psi \) are radial and that f is radially decreasing. If the kernel K is admissible, then \(\psi \) is radially decreasing.
Proof
For the first part, by Lemma 2.2, we need to show that the function \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(t)\) defined by (9) is monotonically decreasing as \(\Vert t\Vert \) and \(\sigma \) increase.
We only consider the case when \(\sigma \) is fixed, the case when \(\sigma \) varies can be handled similarly. Let \(\delta ,\Delta \) be two points in \(\mathbb {R}^d\) with \(\Vert \delta \Vert <\Vert \Delta \Vert \). As \(\psi _\sigma ,f\) are both radial, so is \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}=\psi _\sigma *f\). We may hence assume \(\delta =(\delta _1,0,\ldots ,0), \Delta =(\Delta _1,0,\ldots ,0)\), where \(0\le \delta _1<\Delta _1\). Also, we may assume further that \(\psi \) is the one which is strictly radially decreasing.
Set \(H_{t}:=\{x\in \mathbb {R}^d:x_1< t\}\), \(t\in \mathbb {R}\). We first write
We then apply the substitution \(x=\delta +\Delta -t\) to the second integral above and use the radiality of \(\psi \) and f to get
Note that for \(x\in H_\frac{\delta _1+\Delta _1}{2}\),
Since \(\psi _\sigma \) is strictly radially decreasing and f is radially decreasing,
As a consequence, we get by (11) that \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )\ge 0\).
To obtain \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\delta )-\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(\Delta )>0\), we have to show that the set \(E=\{x\in H_{\frac{\delta _1+\Delta _1}{2}}:f(x)\ne f(x-\delta -\Delta )\}\) has a positive Lebesgue measure on \(\mathbb {R}^d\). Assume on the contrary that the Lebesgue measure of E equals 0. Then f is a periodic function on \(H_{\frac{\delta _1+\Delta _1}{2}}{\setminus } E\). Since f is radially decreasing, it must be constant on \(H_{\frac{\delta _1+\Delta _1}{2}}{\setminus } E\). This together with the assumption that f is radial implies that it equals a constant almost everywhere on \(\mathbb {R}^d\). It is hence impossible to be the density function of a probability distribution, which is a contradiction.
For the second part, we only have to prove that if f is radially decreasing, then the kernel K being admissible implies \(\psi \) being radially decreasing. Since f is radially decreasing and is the density function of \(\bar{\mathbb {Q}}\), we have \(\int _{\mathbb {R}^d} f(x)\,\mathrm {d}x=1\). Define
where \(\psi \) is a bounded and continuous function on \(\mathbb {R}^d\). The density function \(\frac{1}{\sigma ^d}f(\frac{y}{\sigma })\) clearly converges in distribution to the Dirac mass \(\delta _0\) in y when \(\sigma \rightarrow 0\). Hence \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x)\rightarrow \psi (x)\) when \(\sigma \rightarrow 0\). Now assume on the contrary that \(\psi \) is not radially decreasing. Then there exist two points \(x_0,y_0\) such that \(\Vert x_0\Vert \le \Vert y_0\Vert \) and \(\psi (x_0)< \psi (y_0)\). Let \(\varepsilon \) be a real number such that \(0<\varepsilon <\psi (y_0)-\psi (x_0)\). Then by the continuity of \(\psi \) and the fact that \(\mathcal {G}_{\psi _\sigma ,\mathbb {Q}}(x)\rightarrow \psi (x)~(as~\sigma \rightarrow 0)\), there exists \(\sigma \) small enough such that
Therefore, by Lemma 2.2 we know that K is not admissible. So if K is admissible, then \(\psi \) must be radially decreasing. \(\square \)
Note that the kernel which satisfies the assumptions in Theorem 2.4 is of the form
where \(\phi \) is decreasing on \(\mathbb {R}_{+}\). Kernels of the above form are called radial basis functions (Wendland 2005; Wu 1995). Let \(\phi \) be a function on \(\mathbb {R}_+\) such that \(\phi (\Vert x-y\Vert )\) is a reproducing kernel on \(\mathbb {R}^d\). It is quite natural to ask when \(\phi \) is strictly decreasing. A fundamental result on radial basis functions due to Schoenberg (1938) states that \(\phi (\Vert x-y\Vert )\) makes a reproducing kernel on \(\mathbb {R}^d\) for all dimensions \(d\in \mathbb {N}\) if and only if there is a finite positive Borel measure \(\mu \) on \(\mathbb {R}_{+}\) such that
In this case, \(\phi \) is automatically decreasing on \(\mathbb {R}_+\) and is strictly decreasing as long as \(\,\mathrm{supp}\,\mu \ne \{0\}\), which is equivalent to say that the radial kernel K in (12) is characteristic (Sriperumbudur et al. 2011, Proposition 5). In conclusion, if for all \(d\in \mathbb {N}\), K is a nontrivial reproducing kernel on \(\mathbb {R}^d\) and a characteristic kernel, then we have by Theorem 2.4 that K is \(\mathbb {Q}\)-admissible provided that the density function f of \(\mathbb {Q}\) is radially decreasing.
Things are different if \(\phi (\Vert x-y\Vert )\) is only a kernel on certain dimensions. We present an explicit example to illustrate this. It was proved in Schoenberg (1938) that for a fixed dimension d, \(\phi (\Vert x-y\Vert )\) is a kernel on \(\mathbb {R}^d\) if and only if
where \(\mu \) is a finite positive Borel measure on \(\mathbb {R}_+\) and
Setting \(d=3\) leads to
We then choose the measure \(\mu \) such that \(\,\mathrm{supp}\,\mu =[0,2\pi ]\) and \(\,\mathrm {d}\mu (s)=s\,\mathrm {d}s\) for \(s\in [0,2\pi ]\). The resulting function \(\phi \) is
which is not decreasing since \(\phi (1)=0\) while \(\phi (\frac{3}{2})=\frac{8}{9}\). Therefore by Theorem 2.4\(K(x,y)=\phi (\Vert x-y\Vert )\) is not admissible for any class of radially decreasing distributions. In particular, it is not admissible for the Gaussian distributions.
Nevertheless, there exists a large class of compactly supported decreasing \(\phi \) so that (12) defines a kernel that satisfies the conditions in Theorem 2.4. These are the compactly supported radial basis functions of minimal degree constructed in Wendland (2005) and Wu (1995). Examples for dimension \(d=3\) include
where \((1-r)_+:=\max \{0,1-r\}\). More examples are available in (Wendland 2005, Chapter 9).
Going back to our main objective, we shall use Theorem 2.4 to establish admissibility for the RKHS embedding of Gaussian distributions.
Theorem 2.5
Let K be a reproducing kernel on \(\mathbb {R}^d\) of the form \(K(x,y)=\psi (x-y)=\phi (\Vert x-y\Vert )\) where \(\phi \) is deceasing on \(\mathbb {R}_+\) and such that \(\rho \) in (4) is supported on all of \(\mathbb {R}^d\). Then K is admissible for the class of Gaussian distributions
and for the class of generalized Gaussian distributions
where \(\omega \) is a nontrivial finite positive Borel measure on \(\mathbb {R}_+\) with \(\,\mathrm{supp}\,\omega \ne \{0\}\), and \(c_w\) is a positive constant such that
Proof
It suffices to verify that the conditions in Theorem 2.4 is satisfied. Firstly, \(K(x,y)=\psi (x-y)\) where \(\psi (x)=\phi (\Vert x\Vert )\) is radial and radially decreasing. For the Gaussian distributions, we see that
where \(\mathbb {Q}\) has the density function
Thus, the density function for \(\bar{\mathbb {Q}}=\mathbb {Q}* \tilde{\mathbb {Q}}\) is
which is radial and strictly radially decreasing. Therefore K is admissible for the Gaussian distributions (13).
For the generalized Gaussian distributions (14), the density function of \(\bar{\mathbb {Q}}\) is
which is also radial and strictly radially decreasing. \(\square \)
Generalized Gaussian distributions have found many applications in image processing (Mallat 1989; Moulin and Liu 1999) and the field of engineering (Miller and Thomas 1972; Beaulieu and Young 2009). Classical probability density functions for generalized Gaussian distributions are of the following form
where \(p>0\) and \(c_p\) is the constant that makes \(f_p(x)\) a density function. The existence of the measure \(\omega (\tau )\) can be guaranteed by theoretic results in Bochner (1937).
Specifying the measure \(\omega \), we have the following examples.
Example 2.6
Let the Borel measure \(\,\mathrm {d}\omega (\tau )=\,\mathrm {d}\tau /\tau ^2,~\tau \in [1,2]\) in (14). Then the corresponding generalized Gaussian distribution is
Let the Borel measure \(\omega =\sum _{i=1}^m \alpha _i\delta _{\tau _i}\), where \(\alpha _i\ge 0\) and \(\delta _{\tau _i}\) is the dirac measure at point \(\tau _i>0\). Then the corresponding generalized Gaussian distribution is simply the linear combination of Gaussian distributions with the same expectation. That is,
In particular, the Wendland functions (Wendland 2005) and the Gaussian kernels are admissible for the Gaussian distributions. We remark that the latter observation can also be made from direct computation as done in Sriperumbudur et al. (2009), where it was shown that for two Gaussian distributions \(\mathbb {P}_{\mu ,\sigma },\mathbb {P}_{\nu ,\theta }\) and for the Gaussian kernel
it holds
Thus, when \(\sigma =\theta \),
Clearly, the two admissibility requirements are satisfied.
3 Non-radial Gaussian distributions
In this section, we study similarity of two multivariate non-radial Gaussian distributions under RKHS embedding. Such distributions appear widely in probability and statistics. They are of the general form
where \(\mu \in \mathbb {R}^d\) and \(\Sigma \) is a radial and positive-definite \(d\times d\) matrix. To fulfill our purpose, we need to introduce the definition of multivariate monotonic functions from (Engelking 1989).
Definition 3.1
Let h be a function from \(\mathbb {R}^n\) to \(\mathbb {R}^k\). We say that h is monotonic provided that for any \(y\in \mathbb {R}^k\), \(h^{-1}(y)\) is connected in \(\mathbb {R}^n\).
Recall that a set C in \(\mathbb {R}^n\) is connected if there do not exist two disjoint open subsets \(U,V\in \mathbb {R}^n\) such that \(C\subseteq U\cup V\) and both \(C\cap U\) and \(C\cap V\) are nonempty.
Obviously, a constant function f on \(\mathbb {R}^n\) is monotonic as for every \(c\in \mathbb {R},\) \(f^{-1}(c)\) is either empty or the entire \(\mathbb {R}^n\). We give some other nontrivial examples of multivariate monotonic functions to help comprehend this definition.
We first point out that the above seemingly abstract definition coincides with the ordinary one for continuous univariate monotonic functions.
Example 3.2
Let f be a continuous function on \(\mathbb {R}\). If f is monotonic in the ordinary sense then it is easy to see that it satisfies Definition 3.1. On the other hand, assume that it is monotonic according to Definition 3.1. In other words, for each \(c\in \mathbb {R}\), \(f^{-1}(c)\) is connected in \(\mathbb {R}\). One shows by the intermediate value theorem for continuous functions that f must be monotonic in the ordinary sense.
Example 3.3
A linear function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) defined by \(f(x):=a_1x_1+a_2x_2+\cdots +a_nx_n\), where \(a_i\in \mathbb {R}\) is a monotonic function. This is because for all \(y\in \mathbb {R}\), \(f^{-1}(y)=\{x\in \mathbb {R}^n:\sum _{i=1}^n a_ix_i=y\}\) is a hyperplane in \(\mathbb {R}^n\), which is connected in \(\mathbb {R}^n\).
The following example will appear in our discussion.
Lemma 3.4
Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) defined by \(f(x)=\exp (\alpha _1 x_1^2+ \alpha _2 x_2^2+\cdots +\alpha _n x_n^2)\), where \(\alpha _i\) are simultaneously all negative or all positive. Then f is a monotonic function.
Proof
Without loss of generality, we assume \(\alpha _1, \alpha _2,\dots ,\alpha _n> 0\). Then \(f(x)\ge 1\) and \(f^{-1}(1)=\{0\}\), which is a connected set. For \(c> 1\), the set \(f^{-1}(c)=\{(x_1,x_2,\dots ,x_n):\alpha _1 x_1^2+\alpha _2 x_2^2+\alpha _nx_n^2=\ln c\}\) is an n-dimensional ellipsoid, which is connected in \(\mathbb {R}^n\). Therefore we have by Definition 3.1 that f is a monotonic function. \(\square \)
Before proceeding to the next lemma, two classical results are needed, see van Mill (1989).
-
1.
A continuous injective function f from a compact subset of \(\mathbb {R}^n\) to \(\mathbb {R}^n\) is a homeomorphism (van Mill 1989, Excise 1.1.4).
-
2.
The Brouwer Invariance of Domain Theorem: If two sets \(X,Y\subseteq \mathbb {R}^n\) are homeomorphic then so are their interiors (van Mill 1989, Theorem 4.6.7, Corollary 4.6.6).
Lemma 3.5
Let \(F:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuous monotonic function and \(G:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be a continuous injective mapping. Then their composition \(F\circ G:\mathbb {R}^n\rightarrow \mathbb {R}\) is monotonic.
Proof
It is easy to see that \(F\circ G\) is a continuous function. By definition, we have to show that for every \(c\in \mathbb {R}\), the set \((F\circ G)^{-1}(c)\) is connected in \(\mathbb {R}^n.\) Since F is monotonic, we know that \(F^{-1}(c)\) is a connected set in \(\mathbb {R}^n\). By the fact that the continuous image of a connected set is still connected, the set \((F\circ G)^{-1}(c)=G^{-1}(F^{-1}(c))\) is connected provided that \(G^{-1}\) is continuous.
We then prove that if G is continuous and injective, then its inverse \(G^{-1}\) is a continuous function from \(G(\mathbb {R}^n)\) to \(\mathbb {R}^n\). For any sequence \(\{y_k:k=1,2,\ldots \}\subseteq G(\mathbb {R}^n)\) that converges to \(y_0\in G(\mathbb {R}^n)\), denote their preimages by \(\{x_k:k=1,2,\ldots \}\) and \(x_0\), respectively. Let \(B_r\) be a closed ball with radius r centered at \(x_0\), which is clearly compact. Using the result 1 above we have the fact that \(B_r\) must be homeomorphic to \(G(B_r)\). Then we have by the Brouwer Invariance of Domain Theorem that \(\mathrm{int}\,B_r\) and \(\mathrm{int}\,G(B_r)\) are homeomorphic. Therefore for sufficiently large k, \(y_k\) must be located in \(\mathrm{int}\,G(B_r)\). Again by the fact that G is injective, \(x_k\) must be contained in \(B_r\) for all sufficient large k. Since the radius r is arbitrary, we have that \(x_k\) converges to \(x_0\), namely, \(G^{-1}\) is continuous. \(\square \)
We are ready to present the main result of this section about the similarity of two general d-dimensional Gaussian distributions.
Theorem 3.6
The Gaussian reproducing kernel \(K(x,y)=\exp (-\frac{\Vert x-y\Vert _2^2}{2\tau ^2}),~\tau >0\) is admissible for the class of d-dimensional Gaussian distributions given by (16) in the sense that
- (A1’):
-
As a function of \(\mu ,\nu \), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\) decreases monotonically to 0 as \(\mu \) tends to \(\nu \),
- (A2’):
-
As a function of the eigenvalues of \(\Sigma ^{-1}\), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\) decreases monotonically to 0 as \(\det \Sigma \) tends to infinity.
Proof
Let \(\mathbb {P}_1,~\mathbb {P}_2\) be two d-dimensional Gaussian distributions given by
where \(\Sigma _1,\Sigma _2\) are two positive-definite matrices.
If \(\Sigma _1\Sigma _2=\Sigma _2\Sigma _1\) then there exists an orthogonal \(d\times d\) matrix B such that
where \(\alpha _i,\beta _i,i=1,2,\ldots ,d\) are all positive. Then
the equality \((\spadesuit )\) holds since the kernel K(x, y) is translation invariance, and \((\heartsuit )\) follows if we replace x and y with Bx and By, respectively.
Similarly, we have
and
As a result,
In particular, we have
where the \(\alpha _i\)’s are the eigenvalues of \(\Sigma ^{-1}\). Based on this formula, we shall verify that the properties (A1’) and (A2’) hold true.
It is easy to see that \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\rightarrow 0\) as \(\Vert \mu -\nu \Vert \rightarrow 0\). To show the monotonicity of \(\gamma _K\) with respect to \(\mu -\nu \), we just have to verify that
is monotonic on \(\mathbb {R}^d\) according to Definition 3.1. This falls into Lemma 3.4. Thus, (A1’) is verified.
Next, let \(\mu ,\nu \) be fixed. By Formula (17), \(\gamma _K(\mathbb {P}_{\mu ,\Sigma },\mathbb {P}_{\nu ,\Sigma })\rightarrow 0\) as \(\det \Sigma \rightarrow \infty \). It remains to show that \(\gamma _K\) is monotonic with respect to \((\alpha _1,\alpha _2,\ldots ,\alpha _d)\). Set
Then g is continuous and strictly increasing on \(\mathbb {R}_{+}\). Therefore G is continuous and injective on \(\mathbb {R}^n\). Denote
where \(c=(c_1,c_2,\ldots ,c_d)\) and \(t=(t_1,t_2,\dots ,t_d)\) are both in \(\mathbb {R}_{+}^d\). Then by Formula (17) we have
where \(\omega =(-\frac{(\mu _1-\nu _1)^2}{2},-\frac{(\mu _2-\nu _2)^2}{2},\ldots ,-\frac{(\mu _d-\nu _d)^2}{2})\). For every \(s\in \mathbb {R}_{+},\) the set
is connected in \(\mathbb {R}_{+}^d\) (Please see the proof in the Appendix). By Lemma 3.5, \(2\tau ^dF_\omega \circ G(\alpha _1,\ldots ,\alpha _d)\) is monotonic on \(\alpha \), which confirms (A2’). \(\square \)
4 Conclusion
Measuring the similarity and distance between two probability distributions is important in many applications of statistics. The approach of RKHS embedding has many advantages over other integral probability metrics. Due to the one-to-one correspondence between reproducing kernels and reproducing kernel Hilbert spaces, the choice of the reproducing kernel is critical in the approach. Past studies have been focusing on when the kernel is characteristic. We investigate an admissibility criterion on the kernel to ensure that the similarity among the RKHS embeddings of the same class of distributions would satisfy two natural requirements. Sufficient and necessary conditions are provided. In particular, we find that radially decreasing radial basis functions are admissible for Gaussian distributions. We remark that the study can be extended to other classes of probability distributions and to other norms on the Euclidean space.
References
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Beaulieu NC, Young DJ (2009) Designing time-hopping ultrawide bandwidth receivers for multiuser interference environments. Proc IEEE 97(2):255–284
Berlinet A, Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probability and statistics. Kluwer, Dordrecht
Bochner S (1937) Stable laws of probability and completely monotone functions. Duke Math J 3(4):726–728
Bochner S (1959) Lectures on Fourier integrals with an author’s supplement on monotonic functions, Stieltjes integrals, and harmonic analysis. Annals of mathematics studies, vol 42. Princeton University, New Jersey
Chen W, Wang B, Zhang H (2016) Universalities of reproducing kernels revisited. Appl Anal 95:1776–1791
Dudley RM (2002) Real analysis and probability. Cambridge University Press, Cambridge, UK
Engelking R (1989) Gerneral topology, 2nd edn. Heldermann-Verlag, Berlin
Fukumizu K, Gretton A, Sun X, Schölkopf B (2008) Kernel measures of conditional dependence. In: Advances in neural information processing systems, vol 20. MIT Press, Cambridge, pp 489–496
Fukumizu K, Bach FR, Jordan MI (2009) Kernel dimension reduction in regression. Ann Stat 37:1871–1905
Gretton A, Borgwardt K, Rasch B, Schölkopf B, Smola A (2007) A kernel methods for the two sample problem. In: Advances in neural information processing systems, vol 19. MIT Press, Cambridge, pp 513–520
Lieb EH, Loss M (2001) Analysis. American Mathematical Society, New York
Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693
Miller J, Thomas JB (1972) Detectors for discrete-time signals in non-Gaussian noise. IEEE Trans Inf Theory 18(2):241–250
Moulin P, Liu J (1999) Analysis of multiresolution image denoising schemes using generalized Gaussian and complexity priors. IEEE Trans Inf Theory 45(3):909–919
Müller A (1997) Integral probability metrics and their generating classes of functions. Adv Appl Probab 29:429–443
Rachev ST (1991) Probability metrics and the stability models. Wiley, Chichester
Schoenberg IJ (1938) Metric spaces and completely monotone functions. Ann. Math. (2) 39:811–841
Shorack GR (2000) Probability for statisticians. Springer, New York
Smola AJ, Gretton A, Song L, Schölkopf B (2007) A Hilbert space embedding for distributions. In: Proc. 18th international conference on algorithmic learning theory. Springer, Berlin, pp 13–31
Song G, Zhang H, Hickernell FJ (2013) Reproducing kernel banach spaces with the \(\ell ^1\) norm. Appl Comput Harmon Anal 34:96–116
Sriperumbudur BK, Gretton A, Fukumizu K, Schölkopf B, Lanckriet GRG (2009) On integral probability metrics, \(\phi \)-divergences and binary classification. Computing Research Repository. arXiv: 0901.2698v4
Sriperumbudur BK, Gretton A, Fukumizu K, Schölkopf B, Lanckriet GRG (2010) Hilbert space embeddings and metrics on probability measures. J Mach Learn Res 11:1517–1561
Sriperumbudur BK, Fukumizu K, Lanckriet GRG (2011) Learning in Hilbert vs. Banach spaces: a measure embedding viewpoint. In: Advances in neural information processing systems, vol 24. MIT Press, pp 1773–1781
Sriperumbudur BK, Fukumizu K, Lanckriet GRG (2011) Universality, characteristic kernels and RKHS embedding of measures. J Mach Learn Res 12:2389–2410
Steinwart I (2001) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2:67–93
Vajda I (1989) Theory of statistical inference and information. Kluwer Academic Publishers, Boston
van Mill J (1989) Infinite-dimensional topology, prerequisites and introduction. North-Holland math. library, vol 43. Elsevier, Amsterdam
Weaver N (1999) Lipschitz algebras. World Scientific Publishing Company, Singapore
Wendland H (2005) Scattered data approximation. Cambridge University Press, Cambridge
Wu ZM (1995) Compactly supported positive definite radial functions. Adv Comput Math 4(3):283–292
Zhang H, Zhao L (2013) On the inclusion relation of reproducing kernel Hilbert spaces. Anal Appl 11, 1350014
Zhang H, Xu Y, Zhang J (2009) Reproducing kernel Banach spaces for machine learning. J Mach Learn Res 10:2741–2775
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
H. Zhang: Supported in part by NSFC under Grants 11571377 and 11971490, and by the Guangdong Natural Science Foundation (No. 2018A030313841). This work was completed while this author was visiting the Hong Kong University of Science and Technology.
Appendix
Appendix
We now prove the connectivity of the set \(F_\omega ^{-1}(s)\) in Theorem 3.6.
For if \(\omega =\mathbf{0}\), then \(F_\mathbf{0}(x_1,x_2,\dots ,x_d)=0\) for any \(x\in \mathbb {R}_{+}^d\). Therefore, \(F_\mathbf{0}^{-1}(0)=\mathbb {R}_{+}^d\) and for any \(s\ne 0\), \(F_\mathbf{0}^{-1}(s)\) is an empty set. In both cases, the set \(F_\mathbf{0}^{-1}(s)\) is connected.
If \(\omega \ne \mathbf{0}\), then there exists at least one coordinate \(\omega _i\ne 0\). For the case \(s=0\), it is easy to see that the set
is clearly a connected subset in \(\mathbb {R}_{+}^d\).
Now assume \(s> 0\), by scaling and permuting the coordinates we can assume further that \(\omega _1=\dots =\omega _n=1\) and \(\omega _{n+1}=\dots =\omega _d=0\), where \(n\le d\). We then use the polar coordinates to simplify the problem. That is, let
where \(r> 0\) and \(\theta _1,\dots ,\theta _{n-1}\in (0,\pi /2)\). Then we have
where \(\Theta (\theta _1,\ldots ,\theta _{n-1})=\sin ^{n-1}\theta _1\sin ^{n-2}\theta _2\cdots \sin \theta _{n-1}\cdot \cos \theta _1\cdots \cos \theta _{n-1}\).
For every fixed point \((\theta _1,\dots ,\theta _{n-1})\in (0,\pi /2)^{n-1}\) and fixed product \(p=x_{n+1}\cdots x_d\), there exists a unique solution \(r_s=r(\theta _1,\dots ,\theta _{n-1},p)>0\) to the equation
The inverse mapping theorem shows that \(r(\theta _1,\dots ,\theta _{n-1},p)\) is a continuous function with respect to the variables \(\theta _1,\dots ,\theta _{n-1},p\). We now show that this implies that the following set
is path connected, and therefore connected. Indeed, for any two distinct points \(a,b\in F_\omega ^{-1}(s)\), \(s>0\), assume their corresponding polar coordinates (without the r coordinate) are \(\alpha =(\theta _1',\dots ,\theta _{n-1}',a_{n+1},\dots ,a_d)\) and \(\beta =(\theta _1'',\dots ,\theta _{n-1}'',b_{n+1},\dots ,b_d)\), respectively. Let \(\gamma _t=(1-t)\alpha + t\beta \). Then by the definition of \(r_s\), we have for every \(t\in [0,1]\), \((r_s(\gamma _t),\gamma _t)\in F_\omega ^{-1}(s)\), and therefore \((r_s(\gamma _t),\gamma _t),~t\in [0,1]\) is a path from a to b.
Rights and permissions
About this article
Cite this article
Chen, L., Hotz, T. & Zhang, H. Admissible kernels for RKHS embedding of probability distributions. Stat Papers 62, 1499–1518 (2021). https://doi.org/10.1007/s00362-019-01144-5
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-019-01144-5