Abstract
The Wasserstein distance on multivariate non-degenerate Gaussian densities is a Riemannian distance. After reviewing the properties of the distance and the metric geodesic, we present an explicit form of the Riemannian metrics on positive-definite matrices and compute its tensor form with respect to the trace inner product. The tensor is a matrix which is the solution to a Lyapunov equation. We compute the explicit formula for the Riemannian exponential, the normal coordinates charts and the Riemannian gradient. Finally, the Levi-Civita covariant derivative is computed in matrix form together with the differential equation for the parallel transport. While all computations are given in matrix form, nonetheless we discuss also the use of a special moving frame.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Given two probability measures \(\nu _1\) and \(\nu _2\) on \(\mathbb {R}^n\), with finite second moments, consider the set \(\mathscr {P}(\nu _1,\nu _2)\) of probability measures on the product sample space \(\mathbb {R}^{2n}\), such that the two n-dimensional margins have the prescribed distributions, \(X_1 \sim \nu _1\) and \(X_2 \sim \nu _2\). The index
as a measure of dissimilarity between distributions has been considered by many classical authors e.g., C. Gini, P. Levy, and M.R. Fréchet. There is considerable contemporary literature discussing the index W, which is usually called Wasserstein distance. E.g., the monograph by C. Villani [37]. We want also to mention Y. Brenier [9] and R.J. McCann [27].
There is an important particular case, where the above problem reduces to the Monge transport problem. Borrowing the argument from M. Knott and C.S. Smith [18], assume \(\varPhi :\mathbb {R}^n \rightarrow \mathbb {R}\) is a smooth convex function and \(\nabla \varPhi (X_1) \sim \nu _2\). Clearly, the condition
turns out to be equivalent to \({{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot \nabla \varPhi (X_1)\right] \ge {{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot X_2\right] \). Latter inequality shows that the minimum quadratic distance is attained. In view of this new formulation, let \(\varPhi ^*\) be the convex conjugate of \(\varPhi \). By the Young inequality we have
as well as the Young equality
By assumption \(X_2 \sim \nabla \varPhi (X_1)\), so that
which proves that \(\nabla \varPhi (X_1)\) solves the Monge problem.
This argument, including an existence proof, is in Y. Brenier [9]. In the present paper we shall study the same problem where all the involved distributions are Gaussian. It would be feasible to reduce the Gaussian case to the general one. However, we resort to methods specially suited for this case.
1.1 The Gaussian case
Given two Gaussian distributions \(\nu _i={{\mathrm{N}}}_{n}\left( \mu _i,\varSigma _i\right) \), \(i=1,2\), consider the set \({\mathscr {G}}(\nu _1,\nu _2)\) of Gaussian distributions on \(\mathbb {R}^{2n}\) such that the two n-dimensional margins have the prescribed distributions, \(X_i \sim \nu _i\). The corresponding index is
Observe that if \(\mu _1=\mu _2=0\) and U is a symmetric matrix such that \(U\varSigma _1U = \varSigma _2\), then the previous argument applies by means of the convex function \(\varPhi (x) = \frac{1}{2} x^t U x\).
The value of \(W^2\) in Eq. (1) as a function of the mean and the dispersion matrix has been computed by some authors, in particular: I. Olkin and F. Pukelsheim [28], D. C. Dowson and B. V. Landau [12], C. R. Givens and R. M. Shortt [14], M. Gelbrich [13]. They found the (equivalent) forms
Further interpretations of W are available. R. Bhatia et al. [8] showed that W is also the solution of constrained minimization problems for the Frobenius matrix norm \(\left\| M\right\| = \sqrt{{{\mathrm{Tr}}}\left( M^*M\right) }\), when \(\mu _1=\mu _2=0\). Especially,
Notice that \(\varSigma ^{1/2}U\) is the generic transformation of the standard Gaussian to the Gaussian with dispersion matrix \(\varSigma \).
Because of the exponent 2 in Eq. (1), the W distance is more precisely called \(L^2\)-Wasserstein distance. Other exponents or other distances could be used in the definition. The quadratic case is particularly relevant as W is a Riemannian distance. More references will be given later.
In an Information Geometry perspective, we can mimic the argument of the seminal paper by Amari [4], who derived the notion of both Fisher metric and natural gradient, from the second order approximation of the Kullback-Leibler divergence.
It will be shown (see Sect. 2) that the value \(W^2\) of Eq. (2) has the differential second-order expansion for small H:
where \(\mathscr {L} _ {\varSigma } \left[ H\right] = X\) is the solution to the Lyapunov equation \(X \varSigma + \varSigma X = H\).
The quadratic form in the RHS of Eq. (3) provides a candidate to be the Riemannian inner product associated with the distance W. In addition, if f is a smooth real function defined on a small W-sphere. i.e., \(W(\varSigma ,\varSigma +H) = \epsilon \) for small \(\epsilon \), then the increment \(f(\varSigma +H)-f(\varSigma )\) is maximized along the direction
where here \(\nabla \) denotes the Euclidean gradient. The operator \({{\mathrm{grad}}}\) is Amari’s natural gradient, i.e., the Riemannian gradient.
It is remarkable that all geometric objects shown in the previous equations above may be expressed as matrix operations. In this paper, we proceed in developing systematically the Wasserstein geometry of Gaussian models according to such a formalism.
1.2 Relations with the literature on the general transport theory
The Wasserstein distance and its relevant geometry can be studied non-parametrically also for general distributions. We do not pursue in this direction and refer to the monograph by C. Villani [37]. The \(L^2\)-Wasserstein metric geometry has been shown to be Riemannian by F. Otto [29, §4] and J. Lott [21]. Cf. the earlier account by J.D. Lafferty [19].
Let us briefly discuss Otto’s approach in the language of Information Geometry, i.e., with reference to S. Amari and H. Nagaoka [3]. In view of the non-parametric approach first introduced in [33], and denoted by \({\mathscr {M}}\) the set of n-dimensional Gaussian densities with zero mean, the vector bundle
is the Amari Hilbert bundle on \({\mathscr {M}}\). The Hilbert bundle contains the statistical bundle whose fibers consist of the scores \(\left. \frac{d}{dt} \log \rho (t) \right| _{t=0}\) for all smooth curves \(t \mapsto \rho (t) \in {\mathscr {M}}\) with \(\rho (0)=\rho \). In turn, the statistical bundle is the tangent space of \({\mathscr {M}}\) considered as an exponential manifold, see [32, 33].
In our present case, since the model \({\mathscr {M}}\) is an exponential family, the natural parameter is the concentration matrix \(C = \varSigma ^{-1}\). The log-likelihood is
If V is a symmetric matrix, the derivative of \(C \mapsto \log \rho (y;C)\) in the direction V is
where \(\phi (y;C) = \frac{1}{2}(C^{-1} - yy^*)\) is a symmetric matrix identified with a linear operator on symmetric matrices \({{\mathrm{Sym}}}\left( n\right) \), equipped with the Frobenius inner product. The fiber at \(\rho (\cdot ;C)\) consists of the vector space of functions \({{\mathrm{Tr}}}\left( \phi (\cdot ;C)V\right) \), \(V \in {{\mathrm{Sym}}}\left( n\right) \). The inner product in the Hilbert bundle, restricted to the parameterized statistical bundle, is the Fisher metric
The study of the Fisher metric in the Gaussian case has been done first by L.T. Skovgaard [35].
F. Otto [29, §1.3], who was motivated by the study of a class of partial differential equation, considered a inner product defined on smooth functions of the \(\rho \)-fiber of the Hilbert bundle, as
In the non-parametric case, Otto’s metric of Eq. (5) is related to the Wasserstein distance, for a detailed study of such a metric see J. Lott [21].
If we apply this definition to our score \({{\mathrm{Tr}}}\left( \phi (y;C)V\right) = {{\mathrm{Tr}}}\left( \frac{1}{2}(C^{-1}-yy^*)V\right) \) and \(V \in {{\mathrm{Sym}}}\left( n\right) \), the gradient is \(\nabla {{\mathrm{Tr}}}\left( \phi (y;C)V\right) = - V y\) and the metric becomes
The equivalence between the metric in Eq. (6) and the one in Eq. (4) can be seen by a change of parameterization both in \({\mathscr {M}}\) and in each fiber. First, one must define the inner product at \(\varSigma \) to be the inner product computed in the bijection \(\varSigma \leftrightarrow C\), to get \({{\mathrm{Tr}}}\left( U \varSigma V\right) \), which is the form of the metric provided by A. Takatsu [36, Prop. A]. Second, one has to change the parameterization on each fiber of the statistical bundle by \(U \mapsto U \varSigma + \varSigma U\). The involved change of parameterization in the statistical bundle \((C,U) \mapsto (C^{-1},UC^{-1}+C^{-1}U)\) whose inverse is \((\varSigma ,X) \mapsto (\varSigma ^{-1},\mathscr {L} _ {\varSigma } \left[ X\right] )\) produces the desired inner product.
We mention also that the Machine Learning literature discusses a divergence introduced by A. Hyvärinen [16], which is related to Otto’s metric. Precisely, in the concentration parameterization the Hyvärinen divergence is
and the second derivative of \(D \mapsto {\text {DH}}\left( D \vert C \right) \) at C is
In Statistics, Hyvärinen divergence is related to local proper scoring rules, see M. Parry et al. [31].
1.3 Overview
The first two sections of the paper are mostly review of known material. In Sec. 2 we recall some properties of the space of symmetric matrices. In particular, we study the Riccati equation, the Lyapunov equation, and we calculate derivatives for the two mappings \({{\mathrm{sq}}}:A \mapsto A^2\) and \({{\mathrm{sqrt}}}:A \mapsto A^{1/2}\). The mapping \(\sigma :A \mapsto AA^*\), where A is a non-singular square matrix is shown to be a submersion and the horizontal vectors at each point is computed. Despite of our manifold being finite dimensional, there is no need of choosing a basis, as all operations of interest are matrix operations. For that reason, we rely on the language of non-parametric differential geometry of W. Klingenberg [17] and S. Lang [20].
In Sec. 3 we discuss known results about the metric geometry induced by the Wasserstein distance. These results are re-stated in Prop. 3 and, for sake of completeness, we provide a further proof inspired by [12]. Prop. 4 provides an explicit metric geodesic, as done by R.J. McCann [27, Example 1.7].
The space of non-degenerate Gaussian measures (or, equivalently, the space of positive definite matrices) can be endowed with a Riemann structure that induces the Wasserstein distance. This is elaborated in Sec. 4, where we use the presentation given by [36], cf. also [8], which in turn adapts to the Gaussian case the original work [29, §4].
The remaining part of the paper is offered as a new contribution to this topic. The Wasserstein Riemannian metric turns out to be
at each matrix \(\varSigma \), and where U, V are symmetric matrices. By submersion methods we study the more general problem of the horizontal surfaces in \({\text {GL}}\left( n\right) \), characterized in Prop. 8. As a specialized case we get the Riemannian geodesic which agrees with the metric geodesic of Sect. 3.
The explicit form of Riemannian exponential is obtained in Sect. 5. The natural (Riemannian) gradient is discussed in Sect. 6 and some applications to optimization are provided in Sect. 6.1. The analysis of the second-order geometry is treated in Sect. 7, where we compute the Levi-Civita covariant derivative, the Riemannian Hessian, and discuss other related topics. However, the curvature tensor will not be taken into consideration in the present paper.
In the final Sect. 8, we discuss the results in view of applications and in Information Geometry of statistical sub-models of the Gaussian manifold.
2 Symmetric matrices
The set \({\mathscr {G}}^{n}\) of Gaussian distributions on \({\mathbb {R}}^{n}\) is in 1-to-1 correspondence with the space of its parameters \({\mathscr {G}}^{n}\ni {\text {N}}_{n}\left( \mu ,\varSigma \right) \leftrightarrow (\mu ,\varSigma )\in {\mathbb {R}}^{n}\times {\text {Sym}}^{+}\left( n\right) \). Moreover, \({\mathscr {G}}^{n}\) is closed for the weak convergence and the identification is continuous in both directions. A reference for Gaussian distributions is the monograph T.W. Anderson [6].
For ease of later reference, we recall a few results on spaces of matrices. General references are the monographs by P. R. Halmos [15], J. R. Magnus and H. Neudecker [22], and R. Bhatia [7].
The vector space of \(n\times m\) real matrices is denoted by \({\text {M}}(n\times m)\), while square matrices are denoted \({\text {M}}(n)={\text {M}}(n\times n)\). It is an Euclidean space of dimension nm and the vectorization mapping \({\text {M}} (n\times m)\ni A\mapsto \mathbf {vec}\left( A\right) \in {\mathbb {R}}^{nm}\) is an isometry for the Frobenius inner product \(\left\langle A,B\right\rangle =(\mathbf {vec}\left( A\right) )^{*}(\mathbf {vec}\left( B\right) )={\text {Tr}}\left( AB^{*}\right) \).
Symmetric matrices \({{\mathrm{Sym}}}\left( n\right) \) form a vector subspace of M(n) whose orthogonal complement is the space of anti-symmetric matrices \({{\mathrm{Sym}}}^{\perp }\left( n\right) \). We will find it convenient the use, with regard to symmetric matrices, of the equivalent inner product \( \left\langle A,B\right\rangle _{2}=\frac{1}{2}{\text {Tr}}\left( AB\right) \), see e.g. Eq. (18) below. The closed pointed cone of non-negative-definite symmetric matrices is denoted by \({{\mathrm{Sym}}}^+\left( n\right) \) and its interior, the open cone of the positive-definite symmetric matrices, by \({{\mathrm{Sym}}}^{++}\left( n\right) \).
Given \(A,B\in {\text {Sym}}\left( n\right) \), the equation \(TAT=B\) is called Riccati equation. If \(A\in {\text {Sym}}^{++}\left( n\right) \) and \(B\in {\text {Sym}}^{+}\left( n\right) \), then the equation \(TAT=B\) has unique solution \(T\in {\text {Sym}}^{+}\left( n\right) \). In fact, from \(TAT=B\) it follows \(A^{1/2}TA^{1/2}A^{1/2}TA^{1/2}=A^{1/2}BA^{1/2}\) and, in turn, \(A^{1/2}TA^{1/2}=\left( A^{1/2}BA^{1/2}\right) ^{1/2}\) because \(T \in {{\mathrm{Sym}}}^+\left( n\right) \). Hence, the solution to Riccati equation is
Notice that \({{\mathrm{det}}}\left( T\right) = {{\mathrm{det}}}\left( A\right) ^{-1/2} {{\mathrm{det}}}\left( B\right) ^{1/2}\), consequently \({{\mathrm{det}}}\left( T\right) > 0\) if \({{\mathrm{det}}}\left( B\right) > 0\). In terms of random variables, if \(X \in {{\mathrm{N}}}_{n}\left( 0,A\right) \) and \(Y = {{\mathrm{N}}}_{n}\left( 0,B\right) \), then T is the unique matrix of \({{\mathrm{Sym}}}^+\left( n\right) \) such that \(Y \sim TX\).
A more compact closed-form solution of the Riccati equation is available. Given \(A \in {{\mathrm{Sym}}}^{++}\left( n\right) \) and \(B \in {{\mathrm{Sym}}}^+\left( n\right) \), observe that \(AB = A^{1/2}(A^{1/2}BA^{1/2})A^{-1/2}\). By similarity, the eigenvalues of AB are non-negative, hence the square root
is well defined, see [7, Ex. 4.5.2]. Therefore, an equivalent formulation of Eq. (8) is
Since \(AB = A(BA)A^{-1}\), the eigenvalues of AB and BA are identical, so that the same argument used before yields too
The square mapping \({{\mathrm{sq}}}:A \mapsto A^2\) is an injection of \({{\mathrm{Sym}}}^{++}\left( n\right) \) onto itself with derivative \(d_X {{\mathrm{sq}}}(A) = XA + AX\). Hence, the derivative operator \(d{{\mathrm{sq}}}(A)\) is invertible. An alternative notation for the derivative we find convenient to use now and then is \(d_X {{\mathrm{sq}}}(A) = d {{\mathrm{sq}}}(A)[X]\).
For each assigned matrix \(V \in {{\mathrm{Sym}}}\left( n\right) \), the matrix \(X = (d {{\mathrm{sq}}}(A))^{-1} V\) is the unique solution X in the space \({{\mathrm{Sym}}}\left( n\right) \) to the Lyapunov equation
Its solution will be written \(X = \mathscr {L} _ {A} \left[ V\right] \). Clearly we have also
The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of the square mapping \({{\mathrm{sq}}}\) is the square root mapping \({{\mathrm{sqrt}}}:\varSigma \rightarrow \varSigma ^{1/2}\). By the derivative-of-the-inverse rule,
If \(\varSigma \) is the dispersion of a non-singular Gaussian distribution, then \(C = \varSigma ^{-1} \in {{\mathrm{Sym}}}^{++}\left( n\right) \) is the concentration matrix and represents an alternative and useful parameterization. From the Lyapunov equation \(V = X\varSigma + \varSigma X\) we obtain \(\varSigma ^{-1}V\varSigma ^{-1} = \varSigma ^{-1}X + X\varSigma ^{-1}\), hence
Likewise, another useful formula is
There is also a relation between the Lyapunov equation and the trace. From \(X \varSigma + \varSigma X = V\), it follows \(\varSigma ^{-1} X \varSigma + X = \varSigma ^{-1}V\). Then
We will later need the derivative of the mapping \(A \mapsto \mathscr {L} _ {A} \left[ V\right] \), for a fixed V. Differentiating the first identity in Eq. (13) in the direction U, we have
Hence \(d_{U}\mathscr {L} _ {A} \left[ V\right] \) is the solution to the Lyapunov equation
so that we get
It will be useful in the following to evaluate the second derivative of the mapping \({{\mathrm{sqrt}}}:\varSigma \mapsto \varSigma ^{1/2}\). From Eqs. (14) and (17) it follows
Lyapunov equation plays a crucial role, as the linear operator \({\mathscr {L}}_A\) enters the expression of the Riemannian metric with respect to the standard inner product, see Eq. (7). As a consequence, the numerical implementation of the inner product \(W_\varSigma (U,V)\) will require the computation of the matrix \(\mathscr {L} _ {\varSigma } \left[ U\right] \). There are many ways to write down the closed-form solution to Eq. (12). They are discussed in [7]. However, efficient numerical solutions are not based on the closed forms, but rely on specialized numerical algorithms, as discussed by E. L. Wachspress [38] and by V. Simoncini [34].
We now turn to the computation of the second-order approximation of \(W^2\) in Eq. (2).
Fix \(\varSigma \in {{\mathrm{Sym}}}^{++}\left( n\right) \) and let \(H \in {{\mathrm{Sym}}}\left( n\right) \) so that \((\varSigma \pm H) \in {{\mathrm{Sym}}}^{++}\left( n\right) \). Hence, \(\varSigma + \theta H\in {{\mathrm{Sym}}}^{++}\left( n\right) \) for all \(\theta \in [-1,+1]\). Consider the expression of \(W^2\) with \(\mu _1=\mu _2=0\), \(\varSigma _1=\varSigma \), \(\varSigma _2=\varSigma +\theta H\), namely
By Eq. (14) and Eq. (16), the first-order derivative is
Observe that \(\left. \frac{d}{d\theta }W^2(\varSigma ,\varSigma +\theta H)\right| _{\theta =0} = 0\).
The second derivative is
with
so that
where Eq. (15) has been used. Finally, observe that
We can conclude that
Therefore, the bi-linear form in the RHS suggests the form of the Riemannian metric to be derived.
2.1 The mapping \(A \mapsto AA^*\)
We study now the extension of the square operation to general invertible matrices, namely the mapping \({{\mathrm{\sigma }}}: {\text {GL}}\left( n\right) \rightarrow {{\mathrm{Sym}}}^{++}\left( n\right) \), defined by \({{\mathrm{\sigma }}}(A) = AA^*\). Next proposition shows that this operation is a submersion. We recall first its definition, see [10, Ch. 8, Ex. 8–10] or [20, §II.2 ].
Let \({\mathscr {O}}\) be an open set of the Hilbert space H, and \(f:{ \mathscr {O}}\rightarrow {\mathscr {N}}\) a smooth surjection from the Hilbert space H onto a manifold \({\mathscr {N}}\), i.e., assume that for each \(A\in { \mathscr {O}}\) the derivative at A, \(df(A):H\rightarrow T_{f(A)}{ \mathscr {N}}\) is surjective. In such a case, for each \(C\in {\mathscr {N}}\), the fiber \(f^{-1}(C)\) is a sub-manifold. Assigned a point \(A\in f^{-1}(C)\), a vector \(U\in H\) is called vertical if it is tangent to the manifold \( f^{-1}(C)\). Each such a tangent vector U is the velocity at \(t=0\) of some smooth curve \(t\mapsto \gamma (t)\) with \(\gamma (0)=A\) and \({\dot{\gamma }}(0)=U\). Precisely, from \(f(\gamma (t))=C\) for all t we derive the characterization of vertical vectors. We have \(df(A)[{\dot{\gamma }}(0)]=0\) i.e., the tangent space at A is \(T_{A}f^{-1}(f(A))={\text {Ker}}(df(A))\). The orthogonal space to the tangent space \(T_{A}f^{-1}(f(A))\) is called the space of horizontal vectors at A,
Let us apply this argument to our specific case. Let \({\text {GL}}(n)\subset {\text {M}}(n)\) be the open set of invertible matrices; \({\text {O}}\left( n\right) \) the subgroup of \({\text {GL}}(n)\) of orthogonal matrices; \({\text {Sym}}^{\perp }\left( n\right) \) the subspace of \({\text {M}}(n)\) of anti-symmetric matrices.
Proposition 1
-
1.
For each given \(A\in {\text {GL}}(n)\) we have the orthogonal splitting
$$\begin{aligned} {\text {M}}(n)={\text {Sym}}\left( n\right) A\oplus {\text {Sym}}^{\perp }\left( n\right) (A^{*})^{-1}. \end{aligned}$$ -
2.
The mapping
$$\begin{aligned} \sigma :{\text {GL}}(n)\ni A\mapsto AA^{*}\in {\text {Sym}}^{++}\left( n\right) \end{aligned}$$has derivative at A given by \(d_{X}\sigma (A)=XA^{*}+AX^{*}\). It is a submersion with fibers
$$\begin{aligned} \sigma ^{-1}(C)=\left\{ C^{1/2}R \vert R\in {\text {O}}(n) \right\} . \end{aligned}$$ -
3.
The kernel of the differential is
$$\begin{aligned} {\text {Ker}}(d\sigma (A))={\text {Sym}}^{\perp }\left( n\right) (A^{*})^{-1}\ \end{aligned}$$and its orthogonal complement, \({\mathscr {H}}_{A}={\text {Ker}}(d\sigma (A))^{\perp },\) is
$$\begin{aligned} {\mathscr {H}}_{A}={\text {Sym}}\left( n\right) A. \end{aligned}$$ -
4.
The orthogonal projection of \(X \in M(n)\) onto \({\mathscr {H}}_A\) is \(\mathscr {L} _ {AA^*} \left[ XA^*+AX^*\right] A\).
Proof
We provide here the proof for sake of completeness. See also [36] and [8].
-
1.
If \(\left\langle B,CA\right\rangle =0\), for all \(C\in {{\mathrm{Sym}}}\left( n\right) \) i.e., \(CA \in {{\mathrm{Sym}}}^+\left( n\right) A\) , then \({\text {Tr}}\left( BA^{*}C\right) =0\), so that \( BA^{*}\in {\text {Sym}}^{\perp }\left( n\right) \) that is, \(B \in {{\mathrm{Sym}}}^{\perp }\left( n\right) (A^*)^{-1}\).
-
2.
Let the matrix A be an element in the fiber manifold \(\sigma ^{-1}(AA^{*})\). The derivative of \(\sigma \) at A, \(X \mapsto XA^{*}+AX^{*}\), is surjective, because for each \( W\in {\text {Sym}}\left( n\right) \) we have \(d\sigma (A)\left[ \frac{1}{2} W(A^{*})^{-1}\right] =W\). Hence \(\sigma \) is a submersion and the fiber \(\sigma ^{-1}(AA^*)=\left\{ (AA^*)^{1/2}R \vert R \in {\text {O}}(n) \right\} \) is a sub-manifold of \({\text {GL}}(n)\).
-
3.
Let us compute the splitting of \({{\mathrm{M}}}(n)\) into the kernel of \(d\sigma (A)\) and its orthogonal: \({\text {M}}(n)={\text {Ker}}(d\sigma (A))\oplus {\mathscr {H}}_{A}\). The vector space tangent to \(\sigma ^{-1}(AA^{*})\) at A is the kernel of the derivative at A:
$$\begin{aligned} {\text {Ker}}\left( d\sigma (A)\right)= & {} \left\{ X\in {\text {M}}(n)|\text { }XA^{*}+AX^{*}=0\right\} \\= & {} \left\{ X\in {\text {M}}(n)|\text { }(AX^{*})^{*}=-AX^{*}\right\} . \end{aligned}$$Therefore, \(X\in {\text {Ker}}(d\sigma (A))\) if, and only if, \(AX^{*}\in {\text { Sym}}^{\perp }\left( n\right) \), i.e., \({\text {Ker}}(d\sigma (A))={\text {Sym}} ^{\perp }\left( n\right) (A^{*})^{-1}\). We have just proved that this implies \(\mathscr {H}_{A}= {{\mathrm{Sym}}}\left( n\right) A\).
-
4.
Consider the decomposition of X into the horizontal and the vertical part: \(X = C A + D (A^*)^{-1}\) with \(C \in {{\mathrm{Sym}}}\left( n\right) \) and \(D \in {{\mathrm{Sym}}}^{\perp }\left( n\right) \). By transposition, we get \(X^* = A^* C - A^{-1} D\). From the previous two equations, we obtain the two equations \(XA^* = C (AA^*) + D\) and \(AX^* = (AA^*)C - D\). The sum of the two previous equations is \(XA^* + AX^* = C(AA^*)+(AA^*)C\), which is a Lyapunov equation having solution \(C = \mathscr {L} _ {AA^*} \left[ XA^* + AX^*\right] \). It follows that the projection is \(CA = \mathscr {L} _ {AA^*} \left[ XA^* + AX^*\right] A\) \(\square \)
3 Wasserstein distance
The aim of this section is to discuss the Wasserstein distance for the Gaussian case as well as the equation for the associated metric geodesic. Most of its content is an exposition of known results.
3.1 Block-Gaussian
Let us suppose that the dispersion matrix \(\varSigma \in {{\mathrm{Sym}}}^+\left( 2n\right) \) is partitioned into \(n\times n\) blocks, and consider random variables X and Y such that
so that \(K_{ij}={\text {Cov}}\left( X_{i},Y_{j}\right) \) if \(i=1,\dots ,n\) and \( j=(n+1),\dots ,2n\). It follows that \(K_{ij}^{2} \le (\varSigma _{1})_{ii}(\varSigma _{2})_{jj} \le \frac{1}{2} \left( (\varSigma _{1})_{ii} + (\varSigma _{2})_{jj}\right) \), which in turn imply the bounds
For mean vectors \(\mu _{1},\mu _{2}\in {\mathbb {R}}^{2}\) and dispersion matrices \(\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right) \), define the set of jointly Gaussian distributions with given marginals to be
and the Gini dissimilarity index
Actually, in view of either of the bounds in Eq. (19), the set \({ \mathscr {G}}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2}))\) is compact and the \(\inf \) is attained.
It is easy to verify that
defines a distance on the space \({\mathscr {G}}_{n}\simeq {\mathbb {R}}^{n}\times {\text { Sym}}^{+}\left( n\right) \). The symmetry of W is clear as well as the triangle inequality, by considering Gaussian distributions on \({\mathbb {R}} ^{n}\times {\mathbb {R}}^{n}\times {\mathbb {R}}^{n}\) with given marginals. To conclude, assume that the \(\min \) is reached at some \({\overline{\gamma }}\). Then
A further observation is that distance W is homogeneous i.e.,
3.2 Computing the quadratic dissimilarity index
We will present a proof as given by Dowson and Landau [12], but with some corrections.
Given \(\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right) \), each admissible K’s in (20) belongs to a compact set of \({\text {M}} (n) \) thanks to bound (19), so the maximum of the function \(2 {\text {Tr}}\left( K\right) \) is reached. Therefore, we are led to study the problem
The value of the similar problem with \(\max \) replaced by \(\min \) will be denoted by \(\beta (\varSigma _{1},\varSigma _{2}).\)
Proposition 2
-
1.
Let \(\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right) \). Then
$$\begin{aligned} \alpha (\varSigma _{1},\varSigma _{2})=2{\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) \text { and } \beta (\varSigma _{1},\varSigma _{2})=-\alpha (\varSigma _{1},\varSigma _{2}). \end{aligned}$$ -
2.
If moreover \({{\mathrm{det}}}\left( \varSigma _1\right) > 0\), then
$$\begin{aligned} \alpha (\varSigma _{1},\varSigma _{2}) = 2 {{\mathrm{Tr}}}\left( (\varSigma _1\varSigma _2)^{1/2}\right) . \end{aligned}$$
Proof
(point (1)) A symmetric matrix \(\varSigma \in {\text {Sym}}\left( 2n\right) \) is non-negative defined if, and only if, it is of the form \(\varSigma =SS^{*}\), with \(S\in {\text {M}}\left( 2n\right) \). Given the block structure of \( \varSigma \) in (21), we can write
where A and B are two matrices in \({\text {M}}(n\times 2n).\)
Therefore, problem (21) becomes
We have already observed that the optimum exists, so the necessary conditions of Lagrange theorem allows us to characterize this optimum. However, the two constraints \(\varSigma _{1}=AA^{*}\) and \(\varSigma _{2}=BB^{ *} \) are not necessarily regular at every point (i.e., the Jacobian of the transformation may fail to be of full rank at some point), so we must take into account that the optimum could be an irregular point. To this purpose, as a customary, we shall adopt Fritz John first-order formulation for the Lagrangian (see [25]).
We shall initially assume that both \(\varSigma _{1}\) and \(\varSigma _{2}\) are non-singular.
Let then \(\left( \nu _{0},\varLambda ,\varGamma \right) \in \left\{ 0,1\right\} \times {\text {Sym}}\left( n\right) \times {\text {Sym}}\left( n\right) \), \(\left( \nu _{0},\varLambda ,\varGamma \right) \ne \left( 0,0,0\right) \), where the symmetric matrices \(\varLambda \) and \(\varGamma \) are the Lagrange multipliers. The Lagrangian function will be
The first-order conditions of L lead to
In the case \(\nu _{0}=1,\) i.e., the case of stationary regular points, Eq. (22) becomes
which in turn implies
and further
Of course, Eqs. (24) could be more general than Eqs. (23) and thus possibly contain undesirable solutions. In this light, we establish the following facts, in which both matrices \(\varSigma _{1}\) and \( \varSigma _{2}\) must be nonsingular. Notice that in this case Eqs. (24) imply that both \(\varLambda \) and \(\varGamma \) are nonsingular as well.
Claim 1: If \((\varGamma ,\varLambda )\) is a solution to (24) and \(\varLambda ^{-1}=\varGamma \), then the couple \((\varGamma ,\varLambda )\) are Lagrange multipliers of Problem (21).
Actually, let \(\varSigma _{1}=AA^{*}\), \(A\in {\text {M}}(n\times 2n)\) be any representation of the matrix \(\varSigma _{1}\). Define \(B=\varLambda A\) so that \( A=\varLambda ^{-1}B=\varGamma B\). Moreover
and so \(\left( \varLambda ,\varGamma \right) \) are multipliers associated with the feasible point (A, B).
Claim 2: The set of solutions to (24), such that \( \varGamma ^{-1}=\varLambda \), is not empty. In particular, there is a unique pair \( \left( {\widetilde{\varLambda }},{\widetilde{\varGamma }}\right) \) where both \( {\widetilde{\varLambda }}\) and \({\widetilde{\varGamma }}\) are positive definite.
We have already observed that Eqs. (24) imply that \(\varLambda \) and \( \varGamma \) are nonsingular. Moreover, we have \(\varGamma ^{-1}\varSigma _{1}\varGamma ^{-1}=\varSigma _{2}\). Recalling that Riccati’s equation has one and only one solution in the class of positive definite matrices, then \(X=\varLambda =\varGamma ^{-1}\).
Now we proceed to study the solutions to \(\varLambda \varSigma _{1}\varLambda = \varSigma _{2} \) and we shall show that Eq (24) has infinitely many solutions. In correspondence to each one \(\varLambda \), the value of the objective function will be given by \(2{\text {Tr}}\left( K\right) =2{\text {Tr}} \left( \varSigma _{1}\varLambda \right) \). Therefore, we must select the matrix \( \varLambda \) such that \({\mathrm {Tr}}\left( \varSigma _{1}\varLambda \right) \) be maximized.
Following [12], we define
so that, in view of (24), we have
Moreover,
Eq. (25) shows that, though the Lagrangian can have many rest points (i.e., many solutions \(\varLambda \)) the matrix \(R^{2}=\varSigma _{1}^{1/2} \varSigma _{2}\varSigma _{1}^{1/2}\in {\text {Sym}}^{+}\left( n\right) \) remains constant. Not so the value of the objective function \({\text {Tr}}\left( K\right) ={\text {Tr}}\left( R\right) \) which depends on R (i.e., on \(\varLambda \) ).
Let
denote the spectral decomposition of \(R^{2}\), then the solutions to R will be
with \(\varepsilon _{k}=\pm 1\). Hence \({\text {Tr}}\left( K\right) ={\text {Tr}} \left( R\right) \) will be maximized whenever \(\varepsilon _{k}\equiv 1\) and so \(R\in {\text {Sym}}^{+}\left( n\right) \). Clearly the objective function will be minimized if \(\varepsilon _{k}\equiv -1\). From now on the proof of the \(\min \) statement follows similarly.
Hence the maximum of the trace occurs at
namely \(\varLambda =\varSigma _{1}^{-1/2}\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\varSigma _{1}^{-1/2}.\) Thanks to Claims 1-2 this matrix is a multiplier of the Lagrangian and so we would have
as long as the optimum is attained at a regular point. In fact, to complete the proof, we must still examine the case \(\nu _{0}=0\), for which Eq. (22) becomes
It follows
and consequently \(\varLambda =\varGamma =0\). Therefore there is no irregular point, provided \(\varSigma _{1}\) and \(\varSigma _{2}\) are not singular matrices. So we have proved the relation (26) under the above assumptions.
Last step will be that of extending our result to possibly singular matrices \(\varSigma _{1} \) and \(\varSigma _{2}\).
Given the two matrices \(\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right) \), set
If \(\varepsilon >0\), then
where \(\lambda _{i,j}\), \(j=1,\dots ,n\) is a set of eigenvalues of \(\varSigma _{i}\) , \(i=1,2\). Let us consider the parametric programming problem
Observe that the feasible region is contained in a compact set independent of \(\varepsilon \in \left[ 0,1\right] \) because of the bound (19).
Now the continuity of the optimal value \(\varepsilon \mapsto \alpha (\varSigma _{1}(\varepsilon ),\varSigma _{2}(\varepsilon ))\) follows easily from Berge maximum theorem, see for instance [2, Th. 17.31]. Hence
and the assertion is proved for any \(\varSigma _{1},\varSigma _{2}\in {\text {Sym}} ^{+}\left( n\right) \). \(\square \)
Proof
(point (2)) From Eq. (9) we have
\(\square \)
The following result provides exact both lower and upper bounds of \(\mathbb {E}\left[ \left\| X-Y\right\| ^{2}\right] \).
Proposition 3
Let X, Y be multivariate Gaussian random variables taking values in \({\mathbb {R}} ^{n}\) and having means \(\mu _{1}\) and \(\mu _{2}\) and dispersion matrices \( \varSigma _{1}\) and \(\varSigma _{2}\) respectively. Then
If \(\det \varSigma _{1}\ne 0\), then the extremal values are attained at the joint distribution of
respectively, where \(T\in {\text {Sym}}^{+}\left( n\right) \) is the solution to the Riccati equation \(T\varSigma _{1}T=\varSigma _{2}\).
Proof
From Proposition 2 and Eq. (20), it follows
To check the extremal points it suffices to observe that, in view of relation (8):
Hence it is verified that the extremal values are attained at \(Y=\mu _{2}\pm T(X-\mu _{1})\). In the second form of the distribution we are using Eq. (10) and Eq. (11). \(\square \)
The W-distance defines on \({\mathbb {R}} \times {{\mathrm{Sym}}}^{++}\left( n\right) \) a metric geometry with geodesics. This result is due to [27].
Proposition 4
The relation
defines a distance on \({\mathbb {R}}^{n}\times {\text {Sym}}^{+}\left( n\right) \). The geodesic from \((\mu _{1},\varSigma _{1})\) to \((\mu _{2},\varSigma _{2})\), with \((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})\in {\mathbb {R}}^{n}\times {\text {Sym}}^{++}\left( n\right) \), is the curve
where \(\mu (t) = (1-t)\mu _{1} + t\mu _{2}\) and
and T is the (unique) non-negative definite solution to the Riccati equation \(T\varSigma _{1}T=\varSigma _{2}\).
Proof
Clearly, \(\varGamma (0)=\left( \mu _{1},\varSigma _{1}\right) \) and \(\varGamma (1)=\left( \mu _{2},\varSigma _{2}\right) \). Let us compute the distance between \(\varGamma (0)\) and the point
We have
so that
and hence
We have
Collecting all the above results,
In conclusion,
\(\square \)
We end this section by adding a few remarks.
In metric spaces, the definition of geodesic we use here is related to Merger convexity property, see [30, p. 78]. A stronger definition requires the proportionality of the distance between couple of points on the curve, i.e.,
for \(s,t\in \left[ 0,1\right] \). It will be proved later that in fact our geodesics enjoy such a stronger property.
Clearly Proposition 4 still holds under the only assumption that \(\varSigma _{1}\) is not singular, but the case in which both the distributions are degenerate remains excluded.
The simplest example occurs when the two subspaces, \({\text {Range}}\varSigma _{1}\) and \({\text {Range}}\varSigma _{2}\), are orthogonal. In this case, for all joint distribution of the random vector (X, Y), with marginals \( X\sim {\text {N}}_{2}\left( 0,\varSigma _{1}\right) \) and \(Y\sim {\text {N}}_{2}\left( 0,\varSigma _{2}\right) ,\) the values of X and Y will lie into orthogonal subspaces, so that \(XY^{*}=0.\) Hence \(\left\| X-Y\right\| ^{2}=\left\| X\right\| ^{2}+\left\| Y\right\| ^{2}\), and
So any joint distribution (X, Y) attains the optimal value \(\sqrt{{\text {Tr}} \left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) }.\)
If we now define \(X(t)=(1-t)X+tY\), then
consequently X(t) is the geodesic joining the two random vectors X and Y.
The previous example can be extended by taking two singular matrices
where \(v\ne w\in \mathbb {R}^{n}\) and \(\left\| v\right\| =\left\| w\right\| =1\). Clearly, \({\text {Range}}\varSigma _{1}\cap {\text {Range}} \varSigma _{2}=\left\{ 0\right\} \) and they are one-dimensional spaces spanned by vectors v and w, respectively (it is not restrictive to assume \( v^{*}w\ge 0\), too). By Eq. (27),
Despite singularity of these matrices, it can be directly found the point realizing the minimum in (20), which is the singular matrix in \({\text {Sym}}^{+}\left( 2n\right) \):
4 Wasserstein Riemannian geometry
We have seen how to compute the geodesic for the distance W. Since the component \(\mathbb {R}^n\) carries the standard Euclidean geometry, we focus on the geometry of the matrix part, i.e., we shall restrict our analysis to 0-mean distributions \({\text {N}}_{n}\left( 0,\varSigma \right) \). Moreover, \(\varSigma \) will be assumed to be positive definite. Our purpose is to endow the open set \({{\mathrm{Sym}}}^{++}\left( n\right) \) with a structure of Riemannian manifold whose metric tensor generates the Wasserstein distance. The Riemannian metric is obtained by pushing forward the Euclidean geometry of square matrices to the space of dispersion matrices via the mapping \(\sigma :A \mapsto AA^* = \varSigma \). This approach has been introduced by F. Otto [29] in the general non-parametric case and developed in the Gaussian case by A. Takatsu [36] and R. Bhatia [8].
In view of Prop. 1, \(\sigma :{\text {GL}} (n)\rightarrow {\text {Sym}}^{++}\left( n\right) \subset {\text {M}}(n)\) is a submersion and \({\mathscr {H}}_{A}={\text {Sym}}\left( n\right) A\) is the space of horizontal vectors at A.
We recall that a submersion \(f:{\text {GL}}(n)\rightarrow {\text {Sym}} ^{++}\left( n\right) \) is called Riemannian if for all A the differential restricted to horizontal vectors
is an isometry i.e.,
A linear isometry is always 1-to-1 and, if it is onto, we can write backward that
Conversely, the previous equation provides the definition of a metric on \({{\mathrm{Sym}}}^{++}\left( n\right) \) for which the submersion f is Riemannian.
If \(U_A\) is the projection of U on \({\mathscr {H}}_A\), then \(df(A)[U] = df(A)[U_A]\) and Eq. (28) becomes
In general, a submersion induces a local diffeomorphisms from horizontal spaces to the image manifold. In our case, the submersion \(\sigma \) provides a global parameterization of the manifold of symmetric matrices. Fix a matrix \(A\in {\text {GL}}(n)\) such that \(\sigma (A)=AA^{*}=\varSigma \), and consider the open convex cone
We denote by \(\sigma _{A}\) the restriction to \({\mathscr {H}}_{A}^{++}\) of \( \sigma \).
Proposition 5
For all \(A \in {\text {GL}}\left( n\right) \), the mapping
is a surjective bijection, with inverse
Proof
For each \(C\in {\text {Sym}}^{++}\left( n\right) \), the equation
is a Riccati equation for \(BA^{-1}\). As \(B \in {\text { Sym}}^{++}\left( n\right) A\), we have \(BA^{-1}\in {\text {Sym}}^{++}\left( n\right) \) and
is the unique solution. \(\square \)
We come now to the point, i.e., the construction of a metric based on horizontal vectors at a given matrix \(\varSigma \). We are here using Prop. 1.
Proposition 6
The inner product
defines a metric on \({{\mathrm{Sym}}}^{++}\left( n\right) \) such that \(\sigma :A\mapsto AA^{*}\) is a Riemannian submersion.
Proof
Let \(X\in {\text {M}}(n)\) and consider the decomposition of \(X=X_{V}+X_{H}\) with \(X_{V}\) vertical at A and \(X_{H}\) horizontal at A. Then \(d\sigma (A)[X]=d\sigma (A)[X_{H}]\) and the restriction of the derivative \(d\sigma (A) \) to the vector space \({\mathscr {H}}_{A}\) of horizontal vectors at A is 1-to-1 onto the tangent space of \({\text {Sym}}^{++}\left( n\right) \) at \( AA^{*}\), that is, \({\text {Sym}}\left( n\right) \). For such a restriction, for each \(H\in {\mathscr {H}}_{A},\)
so that the inverse mapping of the restriction is given by
Let us push-forward the inner product from \({\mathscr {H}}_{A}\) to \( T_{AA^{*}}{\text {Sym}}^{++}\left( n\right) \).
From Eq. (29), we have
which depends on \(AA^{*}=\varSigma \) only. \(\square \)
Next proposition provides a useful tensorial form of Wasserstein Riemannian metric.
Proposition 7
It holds
Proof
We have
and, taking the semi-sum of the first and the last term of the previous equation,
\(\square \)
After having shown in Prop. 4 the existence of a metric geodesic for the Wasserstein distance, connecting a pair of matrices \(\varSigma _{1},\varSigma _{2} \in {{\mathrm{Sym}}}^{++}\left( n\right) \), we prove that the same curve is a Riemannian geodesic, see R.J. McCann [26] and also [8, 36].
More generally, let us discuss the existence of affine horizontal surfaces in \({\text {GL}}\left( n\right) \) and the existence of geodesically convex surfaces in \({{\mathrm{Sym}}}^{++}\left( n\right) \). As a particular case, the results give rise to the desired Riemannian geodesics.
A surface \(\theta \mapsto A(\theta ) \in {\text {GL}}\left( n\right) \), with \(\theta \in \varTheta \), where \(\varTheta \) is an open subset of \(\mathbb {R}^n\), is called horizontal for the submersion \(\sigma :A \mapsto AA^*\), if \(\partial /\partial {\theta _j} A(\theta ) \in {\mathscr {H}}_{A(\theta )}\) for each j and \(\theta \), i.e.,
A surface is horizontal if, and only if, every smooth curve which lies in it is horizontal.
Proposition 8
-
1.
The surface \(\varTheta \ni \theta \mapsto A(\theta ) \in {\text {GL}}\left( n\right) \) is horizontal for \(\sigma \) if, and only if,
$$\begin{aligned} \frac{\partial }{\partial \theta _j} A^*(\theta ) A(\theta )=A^*(\theta ) \frac{\partial }{\partial \theta _j} A(\theta ), \quad j=1,\dots ,k, \quad \theta \in \varTheta \ . \end{aligned}$$(31) -
2.
Let
$$\begin{aligned} A(\theta ) = A_0 + \sum _{i=1}^k \theta _i (A_i - A_0) , \quad \theta \in \varTheta , \end{aligned}$$(32)be a surface in \({\text {GL}}\left( n\right) \) with the k-simplex of \(\mathbb {R}^k\) contained in \(\varTheta \). The surface is horizontal if, and only if,
$$\begin{aligned} A^*_j A_i = A^*_i A_j , \quad i,j=0,\dots ,k . \end{aligned}$$ -
3.
Let be given \(\varSigma _0,\varSigma _1 \in {{\mathrm{Sym}}}^{++}\left( n\right) \) and choose \(A_0,A_1\) such that \(\varSigma _0 = A_0A_0^*\) and \(\varSigma _1 = A_1A_1^*\). The line
$$\begin{aligned} A(\theta ) = (1-\theta ) A_0+\theta A_1 \end{aligned}$$(33)is horizontal for \(\theta \) in an open interval containing 0 and 1 if, and only if, \(A_1 = TA_0\) with \(T \in {{\mathrm{Sym}}}^{++}\left( n\right) \). This implies T is the solution of the Riccati equation \(T\varSigma _0T = \varSigma _1\).
-
4.
Let be given \(\varSigma _j = A_jA_j^*\in {{\mathrm{Sym}}}^{++}\left( n\right) \), \(j=0,1\dots ,k\). The surface
$$\begin{aligned} \theta \mapsto A_0 + \sum _{j=0}^k \theta _k (A_j-A_0) \end{aligned}$$is horizontal in an open set of parameters containing the k-simplex if, and only if, \(A_i = T_{ij} A_j\) with \(T_{ij} \in {{\mathrm{Sym}}}^{++}\left( n\right) \), \(i,j=0,\dots ,k\).
Proof
-
1.
Eq. (30) is equivalent to \(A^*(\theta )^{-1}\partial / \partial {\theta _j} A^*(\theta ) = \partial / \partial {\theta _j} A(\theta )A(\theta )^{-1}\) hence to \(\partial / \partial {\theta _j} A^*(\theta ) A(\theta ) = A^*(\theta ) \partial / \partial {\theta _j} A(\theta )\).
-
2.
For the surface in Eq. (32) we have \(\partial /\partial {\theta _j} A(\theta ) = A_j\) so that Eq. (31) becomes
$$\begin{aligned} A_j^*(\theta ) A(\theta )=A^*(\theta ) A_j(\theta ), \quad j=1,\dots ,k\ , \quad \theta \in \varTheta . \end{aligned}$$If \(\theta = 0\), it holds \(A_j^*A_0 = A_0^*A_j\), \(j = 1,\dots ,k\). If \(\theta =e_i\) then it holds \(A_j^* A_i = A_i A_j^*\) for \(i,j=1,\dots ,k\). The converse holds by linearity.
-
3.
Assume \(\theta \mapsto A(\theta )\) of Eq. (33) is horizontal on \(\varTheta \). Then, from the previous item we know \(A_1^*A_0 = A_0^* A_1\). In turn, this implies \({A_0^*}^{-1}A_1^* = A_1 A_0^{-1}\), hence \(T = A_1A_0^{-1} \in {{\mathrm{Sym}}}\left( n\right) \). It follows \(T \varSigma _0 T = A_1A_0^{-1} \varSigma _0 (A_0^*)^{-1} A_1^* = \varSigma _1\). It remains to show that T is positive definite. Actually, it holds
$$\begin{aligned} (1 - \theta ) A_0 + \theta A_1 = \left( (1 - \theta ) I + \theta T\right) A_0 \in {\text {GL}}\left( n\right) , \quad \theta \in \varTheta . \end{aligned}$$If \(\lambda _i\) are eigenvalues of the matrix T, then the eigenvalues of the matrix \((1 - \theta ) I + \theta T\) are \((1-\theta ) + \theta \lambda _i\). As they are never zero for any \(\theta \in [0,1]\), it follows that no \(\lambda _i\) can be negative. The \(\lambda _i\) are not zero by assumption and the conclusion \(T \in {{\mathrm{Sym}}}^{++}\left( n\right) \) follows.
Conversely, if \(T \in {{\mathrm{Sym}}}^{++}\left( n\right) \) and \(TA_0 = A_1\), then \(A_1^* A_0 = A_0^* T A_0\) is symmetric. Consequently, for all \(\theta \) such that \((1-\theta )A_0+\theta A_1 \in {\text {GL}}\left( n\right) \) the curve is horizontal. On the other hand, \((1-\theta )I + \theta T\) is the convex combination of positive definite matrices then it is positive definite on an open interval containing [0, 1].
-
4.
Conversely, The proof follows exactly the same arguments as in the 2-points case of the previous item.\(\square \)
The previous proposition shows that there is equality between the metric geodesic derived from the Wasserstein distance and the the geodesic we obtain from the submersion argument. Moreover, in the next Corollary we also characterize the existence of geodesically convex surfaces with given vertices.
Corollary 1
-
1.
Given \(\varSigma _0, \varSigma _1 \in {{\mathrm{Sym}}}^{++}\left( n\right) \), there exists an open interval \(\varTheta \supset [0,1]\) such that the curve
$$\begin{aligned} \varSigma (\theta ) = \left( (1-\theta )I + \theta T\right) \varSigma _0 \left( (1-\theta )I + \theta T\right) , \quad \theta \in \varTheta , \end{aligned}$$(34)is the Wasserstein Riemannian geodesic through \(\varSigma _0\) and \(\varSigma _1\), with \(T\varSigma _0T = \varSigma _1\).
-
2.
Let \(\varSigma _0, \dots ,\varSigma _k \in {{\mathrm{Sym}}}^{++}\left( n\right) \), there exists an open set \(\varTheta \) containing the k-simplex such that the surface
$$\begin{aligned} \varSigma (\theta ) = \left( I + \sum _{j=1}^k\theta (T_j-I)\right) \varSigma _0 \left( I + \sum _{j=1}^k\theta (T_j-I)\right) , \quad \theta \in \varTheta , \end{aligned}$$is the Wasserstein Riemannian geodesic surface through \(\varSigma _0, \dots ,\varSigma _k\) if, and only if, the matrices \(T_j\), which are the positive definite solution of the Riccati equations \(T_j \varSigma _0T_j\), \(j=1,\dots ,k\), pairwise commute.
Proof
-
1.
Pick \(A_0 = \varSigma _0^{1/2} U\), with \(U \in {\text {O}}(n)\), and \(A_1 = TA_0\), where T is the positive definite solution of the Riccati equation \(T\varSigma _0T = \varSigma _1\) and so \(A_1A_1^* = \varSigma _1\). By Prop. 8, Item 3, \(\theta \mapsto A(\theta )\) is horizontal in \({\text {GL}}\left( n\right) \). Consequently, \(\varSigma (\theta )=A(\theta )A^*(\theta )\) is a geodesic.
-
2.
In view of Prop. 8, Item 4, \(T_{ij} = T_iT_j^{-1}\). The surface is horizontal if, and only if, each \(T_{ij}\) is symmetric, that is, \(T_iT_j^{-1} = T_j T_i^{-1}\), which, in turn, is equivalent to \(T_iT_j = T_jT_i\).\(\square \)
Unlike the two-points case, the commutativity condition puts severe restrictions on the set of matrices \(\varSigma _0\),...,\(\varSigma _k\) generating a geodesic surface, when \(k>1\). For instance, if \(\varSigma _0=I\), then we have \(T_i=\varSigma _i^{1/2}\). Hence, Corollary 9 entails that the matrices I,\(\varSigma _1\),...,\(\varSigma _k\) generate a geodesic surface if, and only if, they pairwise commute.
5 Wasserstein Riemannian exponential
We aim now at reformulating a Riemannian geodesic in terms of the exponential map. In other words, the purpose is that of writing the geodesic arc passing through a given point and having a given velocity at the point itself.
The velocity of the geodesic of Eq. (34) is
Using the horizontal lift \(\varSigma (\theta ) = A(\theta )A^*(\theta )\), the velocity turns out to be
where \(\dot{A}(\theta ) A^{-1}(\theta ) \in {{\mathrm{Sym}}}\left( n\right) \) by Eq. (30). Therefore,
In particular, the initial velocity is
and \(T - I = \mathscr {L} _ {\varSigma (0)} \left[ {{\dot{\varSigma }}}(0)\right] \).
Let us compute the norm of the velocity in the Riemannian metric. The value of \(W^2({{\dot{\varSigma }}},{{\dot{\varSigma }}})\) at \(\varSigma (\theta )\) is
It is constant, as we expect from the definition by isometric submersion. Also, we can confirm that the length of the geodesic is
The last equality follows from the relation \(\varSigma _0^{1/2} T \varSigma _0^{1/2} = (\varSigma _0^{1/2}\varSigma _1\varSigma _0^{1/2})^{1/2}\).
By substituting Eq. (35) into the equation of the geodesic (34), we get
We are so led to the following definition, see [1, p. 101–102]) for example.
Definition 1
For any \(C\in {\text {Sym}}^{++}\left( n\right) \) and \(V\in {\text {Sym}}\left( n\right) \simeq T_{C}{\text {Sym}}^{++}\left( n\right) \), the Wasserstein Riemannian exponential is
Next proposition collects some properties of the Riemannian exponential.
Proposition 9
-
1.
All geodesics emanating from a point \(C\in {{\mathrm{Sym}}}^{++}\left( n\right) \) are of the form \(\varSigma (\theta )={\text {Exp}}_{C}\left( \theta V\right) \), with \(\theta \in J_{V}\), where \(J_{V}\) is the open interval about the origin:
$$\begin{aligned} J_{V}=\left\{ \theta \in \mathbb {R} \vert I+\theta \mathscr {L} _ {C} \left[ V\right] \in {{\mathrm{Sym}}}^{++}\left( n\right) \right\} \ . \end{aligned}$$ -
2.
The map \(V\mapsto {\text {Exp}}_{C}\left( V\right) ,\) restricted to the open set
$$\begin{aligned} \varTheta =\left\{ V\in {\text {Sym}}\left( n\right) :I+\mathscr {L}_{C}[V]\in {\text {Sym}}^{++}\left( n\right) \right\} , \end{aligned}$$is a diffeomorphism of \(\varTheta \) into \({\text {Sym}}^{++}\left( n\right) \) with inverse
$$\begin{aligned} {\text {Log}}_{C}\left( B\right) = (BC)^{1/2} + (CB)^{1/2} - 2C \ ; \end{aligned}$$ -
3.
The derivative of the Riemannian exponential is
$$\begin{aligned} d_{X}\left( V\longmapsto {\text {Exp}}_{C}\left( V\right) \right) =X+\mathscr {L} _{C}[X]C\mathscr {L}_{C}[V]+\mathscr {L}_{C}[V]C\mathscr {L}_{C}[X]. \end{aligned}$$
Remark 1
Notice that \(I + \theta \mathscr {L} _ {C} \left[ V\right] = \mathscr {L} _ {C} \left[ \frac{1}{2} C^{-1}+\theta V\right] \) hence, \(\theta \in J_V\) if \(\frac{1}{2} C^{-1} + \theta V \in {{\mathrm{Sym}}}^{++}\left( n\right) \).
Clearly, \(0\in J_{V}\) and \({\text {Exp}}_C(0) = C\) and the maximal open interval containing 0 in which \({\text {Exp}}_C(\theta V) \in {{\mathrm{Sym}}}^{++}\left( n\right) \) is precisely \(J_V\). Moreover, the interval \(J_{V}\) is unbounded from the right, i.e., it is of the kind \(J_{V}=\left( {\bar{\theta }} ,+\infty \right) \), provided \(V\in {{\mathrm{Sym}}}^+\left( n\right) \). Likewise, \(J_{V}=\left( -\infty ,{\bar{\theta }}\right) \), if \(- V\in {{\mathrm{Sym}}}^+\left( n\right) \). Similarly, \(\varTheta \) is an open set containing the origin and so \(V\mapsto {\text {Exp}}_{C}\left( V\right) \) is a local diffeomorphism around the origin.
Since the geodesics are not defined for all the values of the parameter \( t\in \mathbb {R}\), we infer that the Riemannian manifold \({\text {Sym}} ^{++}\left( n\right) \) is geodesically incomplete. Of course this is not a surprising fact: \({\text {Sym}}^{++}\left( n\right) \) is not a complete metric space, and hence Hopf-Rinow theorem implies that it cannot be geodesically complete, see M.P. do Carmo [10].
Proof
-
1.
Let
$$\begin{aligned} \varSigma (\theta ) = {\text {Exp}}_{C}\left( \theta V\right) =C+\theta V+\theta ^{2}\mathscr {L} _{C}[V]C\mathscr {L}_{C}[V] , \quad \theta \in J_{V} . \end{aligned}$$Clearly, \(\varSigma (0)=C\) and \({\dot{\varSigma }}(0)=V\). Pick a scalar \({\bar{\theta }}\in J_{V}\) and consider the two matrices \(\varSigma \left( 0\right) \) and \(\varSigma \left( {\bar{\theta }}\right) \) belonging to the curve \(\varSigma .\) Introduce the new parameterization \({\tilde{\varSigma }}\left( \tau \right) =\varSigma \left( \tau {\bar{\theta }}\right) \), so that \({\tilde{\varSigma }}\left( 0\right) =\varSigma \left( 0\right) \) and \({\tilde{\varSigma }}\left( 1\right) =\varSigma \left( {\bar{\theta }} \right) \). We have,
$$\begin{aligned} {\tilde{\varSigma }}\left( \tau \right) =C+\tau ({\bar{\theta }}V) + \tau ^{2}\mathscr {L} _ {C} \left[ {\bar{\theta }}V\right] C \mathscr {L} _ {C} \left[ {\bar{\theta }}V\right] \ . \end{aligned}$$(37)Setting \({\tilde{T}}-I=\mathscr {L} _ {C} \left[ {\bar{\theta }}V\right] \), we have \({\widetilde{T}} \in {{\mathrm{Sym}}}^{++}\left( n\right) \) and
$$\begin{aligned} {\tilde{T}} C {\tilde{T}} = (I+\mathscr {L} _ {C} \left[ {{\bar{\theta }}}\right] ) C (I+\mathscr {L} _ {C} \left[ {{\bar{\theta }}}\right] ) = {{\tilde{\varSigma }}}(1) , \end{aligned}$$and the Eq. (37) above becomes
$$\begin{aligned}&{\tilde{\varSigma }}\left( \tau \right) = C+\tau ({\tilde{T}}-I)C+\tau C({\tilde{T}}-I)+\tau ^{2}({\tilde{T}}-I)C({\tilde{T}}-I)= \\&\quad =\left[ \left( 1-\tau \right) I+\tau {\tilde{T}}\right] C\left[ \left( 1-\tau \right) I+\tau {\tilde{T}}\right] , \end{aligned}$$which is the geodesic connecting \(\varSigma (0) = {{\tilde{\varSigma }}}(0) = C\) to \({{\tilde{\varSigma }}}(1) = \varSigma ({{\bar{\theta }}})\).
-
2.
By Eq. (36) the solution to Riccati equation
$$\begin{aligned} {\text {Exp}}_{C}\left( V\right) =(I+\mathscr {L}_{C}[V])C(I+\mathscr {L} _{C}[V])=B \end{aligned}$$is
$$\begin{aligned} I+\mathscr {L}_{C}[V]=C^{-1/2}(C^{1/2}BC^{1/2})^{1/2}C^{-1/2}\ \end{aligned}$$provided \(I+\mathscr {L}_{C}[V]\in {\text {Sym}}^{++}\left( n\right) \). This is true in a sufficiently small neighborhood \(\left\| V\right\| <r\) of the origin. The inversion of the operator \(\mathscr {L}_{C}[\cdot ]\) and Eq. (9) provide the desired formula for \({\text {Log}}_{C}\left( B\right) \).
-
3.
The derivative follows from a simple bilinear computation.
\(\square \)
The second order properties of the geodesic and the Riemannian exponential will be established in Sect. 7.6.
6 Natural gradient
We have found the form of the Riemannian metric associated to Wasserstein distance. In turn, the inner product equals the second order approximation of \(W^2\). This is a general fact, whose interpretation is based on the discussion of the natural gradient of the metric as solution to the problem
which allows the identification of the direction of the maximal increase of the function f with the natural gradient, according to the name introduced by Amari [4], i.e., the Riemannian gradient as defined below.
The Riemannian gradient is the gradient with respect to the inner product of the metric. We denote by \(\nabla \) the gradient with respect to the inner product \(\left\langle \cdot ,\cdot \right\rangle _{2}\) and by \({{\mathrm{grad}}}\) the gradient with respect to the Riemannian metric. By Prop. 7, \(W_\varSigma (X,Y) = \left\langle \mathscr {L} _ {\varSigma } \left[ X\right] ,Y\right\rangle _{2}\), hence for each smooth scalar field \(\phi \) we have
where the second equality follows from the definition of \(\mathscr {L}_\varSigma \). Conversely,
The gradient flow of a smooth scalar field \(\phi \) is the flow generated by the vector field
that is, the flow of the differential equation
The gradient flow equation is the model for many optimization problems which are based on various discrete time approximations of the gradient flow. It should be noted that the expression of the natural gradient in the Wasserstein Riemannian metric is simple and does not require any time-consuming operation as it is the case in optimization methods using the Fisher Riemannian metric. We do not discuss this issue here and refer to [1, 4, 24].
6.1 Gradient flow and optimization
With reference to the full Gaussian distribution, one can consider smooth functions defined on \({\mathbb {R}}^n\times {\text {Sym}}^{++}\left( n\right) \). The first component of the gradient does not require a special gradient as the Riemannian structure is the Euclidean one. The full gradient will thus have two components:
An important example is based on the gradient flow of the mean value of an objective function \(f :\mathbb {R}^n \rightarrow \mathbb {R}\). Its Euler scheme is used in optimization, see [1, Ch. 4] and [23]. In the second example in Sect. 6.2 we discuss the gradient flow of the entropy function of a centered Gaussian.
We call relaxation to the full Gaussian model of the objective function \(f:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}\) the function
If we would include the Dirac measures in the Gaussian model, then \( f(x)=\phi (x,0)\) and the function \(\phi \) would actually be an extension of the given function. However, we consider only \(\varSigma \in {\text {Sym}} ^{++}\left( n\right) \) in order to work with a function defined on our manifold.
There are two ways to calculate the expected value as a function of \(\mu \) and \(\varSigma \). Each of them leads to a peculiar expression of the natural gradient.
The first one arises from the relation
which will lead to an equation for the gradient involving the derivatives of f. The second one uses
In the second case, the natural gradient will be achieved by an equation not involving the gradient of the function f. Both forms have their own field of application.
Condider the first approach, under standard conditions regarding the derivation under the expectation sign. We have
By means of Eq. (14), it is straightforward to compute \(d_{U}\left( \varSigma \mapsto \phi (\mu ,\varSigma )\right) \).
Note that \(\nabla f\) is the column vector and so \(\nabla ^{*}f\) will be a row vector. We have
Under symmetrization (and setting \(X=\varSigma ^{1/2}Z+\mu \)):
It follows that
Calculating the natural gradient:
If we set \(\varXi =\mathbb {E}\left[ Z\nabla ^{*}f(X)+\nabla f(X)Z\right] \), the natural gradient admits the representation
We move on to consider the second procedure. Following the standard computation of the Fisher score and starting from the log-density \(p(x;\mu ,\varSigma )\) of \({\text {N}}_{n}\left( \mu ,\varSigma \right) \), we have
Denoting the partial derivative \(d_{u}\left( \mu \longmapsto \log p(x;\mu ,\varSigma )\right) \) as \(d_{u}\log p(x;\mu ,\varSigma )\), and the other derivative \(d_{U}\left( \varSigma \longmapsto \log p(x;\mu ,\varSigma )\right) \) as \(d_{U}\log p(x;\mu ,\varSigma )\), we get:
So that
and
At last, thanks to Eq. (38), the natural gradient of \(\phi (\mu ,\varSigma )\) will be
6.2 Entropy gradient flow
The flow of entropy can be easily calculated by Eq. (39). We have
The entropy does not depend on \(\mu \) so that \(\nabla _{1}\mathscr {E}(\mu ,\varSigma )=0\). Moreover (see [22, §8.3]) we know that \(\nabla \mathscr {E}(\varSigma )=\varSigma ^{-1}\), so that
The entropic flow will be solution to the equations
that is
The integral curve is defined for all t such that \(2t<\lambda _{*}\), \(\lambda _{*}\) being the minimum of the spectrum of \(\varSigma (0)\).
7 Second order geometry
Recall that \({{\mathrm{Sym}}}^{++}\left( n\right) \) as an open set of the Hilbert space \({{\mathrm{Sym}}}\left( n\right) \), endowed with the inner product \(\left\langle X,Y\right\rangle _{2} = \frac{1}{2} {{\mathrm{Tr}}}\left( XY\right) \). Prop. 7 states that the Wasserstein Riemannian metric W can be expressed through the inner product of \({{\mathrm{Sym}}}\left( n\right) \), as
for each \((\varSigma ,X)\) and \((\varSigma ,Y)\) in the trivial tangent bundle \(T {{\mathrm{Sym}}}^{++}\left( n\right) \simeq {{\mathrm{Sym}}}^{++}\left( n\right) \times {{\mathrm{Sym}}}\left( n\right) \). In the equation above, \({\mathscr {L}} :{{\mathrm{Sym}}}^{++}\left( n\right) \mapsto L({{\mathrm{Sym}}}\left( n\right) ,{{\mathrm{Sym}}}\left( n\right) )\) is the field of linear operators defining the Wasserstein metric with respect to the standard inner product.
In the trivial chart, a smooth vector field X is a smooth mapping \(X :{{\mathrm{Sym}}}^{++}\left( n\right) \rightarrow {{\mathrm{Sym}}}\left( n\right) \). The action of the vector field X on the scalar field f that is, Xf, is expressed in the trivial chart by \(d_Xf\), i.e., the scalar field whose value at point \(\varSigma \) is the derivative of f in the direction \(X(\varSigma )\). Similarly, \(d_YX\) denotes the vector field whose value at point \(\varSigma \) is the derivative at \(\varSigma \) of X in the direction \(Y(\varSigma )\). The Lie bracket [X, Y] of two smooth vector fields X, Y is given by \(d_XY - d_YX\).
7.1 The moving frame
While we prefer to express our computation by matrix algebra, in some cases it may be useful to employ a vector basis. Let us now introduce a field of vector bases of particular interest.
The set of symmetric matrices
\(e_p\) being the p-th element of the standard basis of \(\mathbb {R}^n\), spans the vector space \({{\mathrm{Sym}}}\left( n\right) \). Notice that \({{\mathrm{Tr}}}\left( E^{p,q}\right) = 2 \delta _{p,q}\), where \(\delta \) is the Kronecker symbol. To avoid repeated elements, a unique enumeration is obtained by taking indexes in the set A of the parts of \(\left\{ 1,\dots ,n\right\} \) having 1 or 2 elements.
The generating set of Eq. (40) is related to the symmetric product of matrices by the equation
where \(\delta \) is the Kronecker symbol.
In particular, if we take the trace of the equation above, we get
which in turn implies
In the sequel, we denote by \((E^\alpha )_{\alpha \in A}\) the vector basis above, properly normalized to obtain an orthonormal basis. We do not write down the normalizing constants in order to simplify the notation.
For each \(\varSigma \in {{\mathrm{Sym}}}^{++}\left( n\right) \) the sequence
is a vector basis of \({{\mathrm{Sym}}}\left( n\right) \simeq T_\varSigma {{\mathrm{Sym}}}^{++}\left( n\right) \), because it is the image of a vector basis under a linear mapping which is onto. We will call such a sequence of vector fields the (principal) moving frame.
Notice the following properties:
At a generic point \(\varSigma \), we can express each \(\mathscr {E}^{\alpha }\) in the \((E^\beta )_\beta \)’s orthonormal basis as
Since
the matrix \([g_{\alpha ,\beta }]_{\alpha ,\beta }\) is the expression of the Riemannian metric in such a moving frame. Namely, if X, Y are vector fields expressed in the moving frame as \(X = \sum _\alpha x_\alpha \mathscr {E}^{\alpha }\) and \(Y = \sum _\beta y_\beta \mathscr {E}^{\beta }\), then
This expression of the inner product is to be compared to that used in [36].
In this way, any vector field X has two representations: one with respect to the moving frame \((\mathscr {E}^{\alpha })_\alpha \) and another one with respect to the basis \((E^\alpha )_\alpha \). These two representations are related to each other as follows. We have
so that
hence, by applying the inverse matrix \([g^{\alpha ,\beta }(\varSigma )]=[g_{\alpha ,\beta }(\varSigma )]^{-1}\), we have
For example, \(\mathscr {L} _ {\varSigma } \left[ V\right] = \sum _\alpha \ell _{\varSigma }^\alpha (V) \mathscr {E}^{\alpha }(\varSigma )\), with
7.2 Covariant derivative in the moving frame
If X and Y are vector fields, denote by \(D_YX\) the action of a covariant derivative, namely, a bilinear operator satisfying, for each scalar field f, the following two conditions:
- (CD1) :
-
\(D_{fY}X = fD_Y X\) ,
- (CD2) :
-
\(D_Y (fX) = (d_Yf) X + f D_Y X\) .
see e.g [10, Sect. 3] or [20, Ch. 8.4].
A convenient way to express a covariant derivative in the moving frame (41) is to define Christoffel symbols in the moving frame as
Each \(\varGamma _{\alpha ,\beta }^\gamma \) is to be computed by means of Eq. (43).
If \(X=\sum _\alpha x_\alpha \mathscr {E}^{\alpha }\) and \(Y = \sum _\beta y_\beta \mathscr {E}^{\beta }\), by using (CD1), (CD2), and Eq. (42), we obtain
The inner product of \(D_XY\) and \(Z = \sum _\delta z_\delta \mathscr {E}^{\delta }\) is
7.3 Levi-Civita derivative
The Levi-Civita (covariant) derivative of a vector field, is the unique covariant derivative D that, for all vector fields X, Y, Z, is:
-
(LC1) compatible with the metric, \(d_{X}W(Y,Z)=W(D_X Y,Z) + W(Y,D_{X}Z)\), (LC2) torsion-free, \(D_{Y}X-D_{X}Y = [X,Y] = d_Y X - d_X Y\).
In order to keep a compact notation, it will be convenient to make use of the symmetrized of a matrix \(A \in {{\mathrm{M}}}\left( n\right) \), defined by \(\left\{ A\right\} _S = \frac{1}{2}\left( A+A^*\right) \). If either A or B is symmetric, then \({{\mathrm{Tr}}}\left( \left\{ A\right\} _SB\right) = {{\mathrm{Tr}}}\left( AB\right) \).
We denote by X, Y, Z smooth vector fields on \({{\mathrm{Sym}}}^{++}\left( n\right) \) and we shall use frequently the derivative of the vector field \(\varSigma \mapsto \mathscr {L} _ {\varSigma } \left[ X\right] \). In view of Eq. (17) and under our notation for the symmetrization, we have
Proposition 10
The Levi-Civita derivative \(D_{X}Y\) is implicitly defined by
while the Levi-Civita derivative itself is given by
Proof
In our case, Eq. MD3 of [20, p. 205] becomes
By Eq. (17) we have
and, analogously,
This way, Eq. (45) becomes the first part of Eq. (44).
The second part of Eq. (44) is then easily obtained. For instance,
Regarding the explicit formula of the Levi-Civita derivative (10), observe that
Moreover,
Therefore, Eq. (44) can be written as
and the desired result obtains. \(\square \)
Observe that we have computed the Levi-Civita covariant derivative using its explicit expression in term of derivatives of the metric. However is easy to check the result directly using the properties of the Lyapunov operator.
7.4 Levi-Civita derivative in a moving frame
Let us explicit the Levi-Civita derivative in the moving frame (41). Note that \(X(\varSigma ) = \mathscr {E}^{\alpha }(\varSigma ) = E^\alpha \varSigma + \varSigma E^\alpha \) and \(Y(\varSigma ) = \mathscr {E}^{\beta }(\varSigma ) = E^\beta \varSigma + \varSigma E^\beta \) are vector fields.
Proposition 11
For the Levi-Civita covariant derivative D, it holds
Proof
Eq. (10) yields
We are going to compute one by one the three terms in this equation.
The first term of Eq. (46) is
The second one is
Their sum is
The third term is
\(\square \)
The computation of the Christoffel symbols \(\sum _\gamma \varGamma _{\alpha ,\beta }^\sigma \mathscr {E}^{\gamma }= D_{\mathscr {E}^{\alpha }}\mathscr {E}^{\beta }\) would require the solution of the equations
We do not discuss that here.
Instead, let us take now \(X = x_\alpha \mathscr {E}^{\alpha }\) and \(Y=y_\beta \mathscr {E}^{\beta }\). Properties (CD1) and (CD2) lead to
Finally, for general X and Y,
which is the desired result.
7.5 Parallel transport
The expression of the Levi-Civita derivative in Eq. (44) can be re-written as
where \(\varGamma (\varSigma ;\cdot ,\cdot )\) is the symmetric tensor field defined by
We have
and, on the diagonal,
\(\varGamma (\varSigma ;X,Y)\) is the expression in the trivial chart of the Christoffel symbol of the Levi-Civita derivative as in [17]. In [20], \(-\varGamma \) is called the spray of the Levi-Civita derivative.
Given the Christoffel symbol, the linear differential equation of the parallel transport along a curve \(t \mapsto \varSigma (t)\) is
see [20, VIII, §3 and §4]. Recall that the parallel transport for the Levi-Civita derivative is isometric.
We do not discuss here the representation in the moving frame of Eq. (7.5). We limit ourselves to mention that the action of the Christoffel symbol on vector fields expressed in the moving frame can be computed from
7.6 Riemannian Hessian
According to [1, Def. 5.5.1] and [10, p. 141], the Riemannian Hessian of a smooth scalar field \(\phi :{{\mathrm{Sym}}}^{++}\left( n\right) \rightarrow \mathbb {R}\), is the Levi-Civita covariant derivative of the natural gradient \({{\mathrm{grad}}}\phi \). Namely, for each vector field X, it is the vector field \({{\mathrm{Hess}}}_X \phi \) whose value at \(\varSigma \) is
The associated symmetric bilinear form is (see [1, Prop. 5.5.3])
To our purpose it will be enough to compute the diagonal of the symmetric form. Therefore, letting \(X=Z=V\) in the second part of Eq. (44), we obtain
where \(Y={\text {grad}}\phi \left( \varSigma \right) \). After plugging \(Y = {\text {grad}}\phi \left( \varSigma \right) =\varSigma \nabla \phi \left( \varSigma \right) +\nabla \phi \left( \varSigma \right) \varSigma \) into it, we get easily
Plugging \(V = \mathscr {L} _ {\varSigma } \left[ V\right] \varSigma + \varSigma \mathscr {L} _ {\varSigma } \left[ V\right] \) into the second term of the RHS, we have at last
Relation (47) substantiates the following important property that links the Hessian to the derivative along a geodesic (see the proof of Prop. 5.5.4 of [1]).
Proposition 12
Let \(\phi :\) \({\text {Sym}}^{++}\left( n\right) \rightarrow {\mathbb {R}}\) be a smooth scalar field and define
It holds
Proof
By Prop. 9
where \(\varSigma (0)=\varSigma \) and \({\dot{\varSigma }}(0)=V.\) Hence \({\dot{\varphi }} \left( t\right) =\left\langle \nabla \phi (\varSigma (t)),{\dot{\varSigma }} (t)\right\rangle _{2}\), and
that evaluated at \(t=0\), provides
In view of Eq. (47),
\(\square \)
8 Conclusion
In the present paper we have discussed in some detail the Wasserstein geometric properties of the Gaussian densities manifold. We have followed a known argument based on the geometric notion of submersion and we have improved upon what is known in the literature by offering a number of further results. In particular, we have studied the geodesic surfaces and provided an explicit form for the Riemannian exponential. More important, a new formulation of the metric based on the field of operators \(\varSigma \mapsto \mathscr {L} _ {\varSigma } \left[ \cdot \right] \) is introduced. This field of operator gives the Riemannian metric by the Frobenius inner product: \(W_\varSigma (X,Y) = \left\langle \mathscr {L} _ {\varSigma } \left[ X\right] ,Y\right\rangle _{2}\). This gives rise to an explicit identification of the Riemannian gradient as well as to the calculation of the Levi-Civita covariant derivative, through the partial derivatives of the metric. The equations of the parallel transport and of the Riemannian Hessian have been also derived.
While the form of the natural gradient is simple and may be a source of applications such as those of interest in Machine Learning, the Levi-Civita covariant derivative turns out to be more involved and it is not clear how to use it in applications. However, we have produced a simpler form by the introduction of a special moving frame. In view of this issue, we have not proceeded in this paper to compute other geometrical quantities of interest, like the curvature tensor.
Numerical as well as simulation methods for the relevant equations of the geometry, like geodesics, parallel transport, Hessians, should be also considered. Applications of special interest are in the area of the linear optimization, by means of the natural gradient as direction of increase and by using the Riemannian exponential as a retraction, cf. [1] and in Amari monograph [5]. Also, second order optimization methods (Newton method), via the Riemannian Hessian and the Riemannian exponential, cf. [1] and [5], are source of promising researches.
The issue of a comparison between Fisher and Wasserstein metric is not taken into account here as it is, for example, in Chevallier et al. [11].
From the point of view of applications in Statistics and Machine Learning, the use of the full Gaussian model is in many cases not realistic. We expect our results to be used to compute the Wasserstein geometry induced on parsimonious sub-manifolds such as those listed below.
-
1.
Sub-manifold of the correlation matrices i.e, with unitary diagonal elements. In this case, the tangent space at each point is the space of symmetric matrices with zero diagonal.
-
2.
Sub-manifold of trace 1 matrices. This case is of particular interest in Physics and prompts for a generalization of the theory to complex Gaussians i.e., Gaussians densities on \(\mathbb C^n\). Such distributions have Hermitian covariant matrices, a case that is discussed in [8].
-
3.
Sub-manifold of the concentration matrices with a given sparsity pattern. Notice that concentration matrices and dispersion matrices are both elements of the same space \({{\mathrm{Sym}}}^{++}\left( n\right) \). In this case the statistical interpretation of the Wasserstein distance is not available but nevertheless other interpretations of the distance are mentioned in the Introduction.
References
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008). (with a foreword by Paul Van Dooren)
Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis. A Hitchhiker’s Guide, 3rd edn. Springer, Berlin (2006)
Amari, S., Nagaoka, H.: Methods of information geometry. American Mathematical Society, Providence (2000). (translated from the 1993 Japanese original by Daishi Harada)
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998). https://doi.org/10.1162/089976698300017746
Amari, S.I.: Information geometry and its applications. Appl. Math. Sci. 194 (2016). https://doi.org/10.1007/978-4-431-55978-8
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics, 3rd edn. Wiley, Hoboken (2003)
Bhatia, R.: Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, Princeton (2007). ([2015] paperback edition of the 2007 original [MR2284176])
Bhatia, R., Jain, T., Lim, Y.: On the Bures-Wasserstein distance between positive definite matrices. Expositiones Mathematicae (2018). https://doi.org/10.1016/j.exmath.2018.01.002 arXiv:1712.01504 (in press)
Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Comm. Pure Appl. Math. 44(4), 375–417 (1991). https://doi.org/10.1002/cpa.3160440402
do Carmo, M.P.: Riemannian geometry. Mathematics: Theory and Applications. Birkhuser Boston Inc., Cambridge (1992). (translated from the second Portuguese edition by Francis Flaherty)
Chevallier, E., Kalunga, E., Angulo, J.: Kernel density estimation on spaces of Gaussian distributions and symmetric positive definite matrices. SIAM J. Imaging Sci. 10(1), 191–215 (2017). https://doi.org/10.1137/15M1053566
Dowson, D.C., Landau, B.V.: The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 12(3), 450–455 (1982). https://doi.org/10.1016/0047-259X(82)90077-X
Gelbrich, M.: On a formula for the \(L^2\) Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147, 185–203 (1990). https://doi.org/10.1002/mana.19901470121
Givens, C.R., Shortt, R.M.: A class of Wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984). https://doi.org/10.1307/mmj/1029003026
Halmos, P.R.: Finite-dimensional vector spaces. The University Series in Undergraduate Mathematics, 2nd edn. D. Van Nostrand Co., Inc., Princeton-Toronto-New York-London (1958)
Hyvrinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)
Klingenberg, W.P.A.: Riemannian Geometry, De Gruyter Studies in Mathematics, vol. 1, 2nd edn. Walter de Gruyter & Co., Berlin (1995). https://doi.org/10.1515/9783110905120
Knott, M., Smith, C.S.: On the optimal mapping of distributions. J. Optim. Theory Appl. 43(1), 39–49 (1984). https://doi.org/10.1007/BF00934745
Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988). https://doi.org/10.2307/2000885
Lang, S.: Differential and Riemannian manifolds, Graduate Texts in Mathematics, vol. 160, 3rd edn. Springer, Berlin Heidelberg (1995)
Lott, J.: Some geometric calculations on Wasserstein space. Comm. Math. Phys. 277(2), 423–437 (2008). https://doi.org/10.1007/s00220-007-0367-3
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Series in Probability and Statistics. Wiley, Chichester (1999). (Revised reprint of the 1988 original)
Malagò, L., Pistone, G.: Combinatorial optimization with information geometry: Newton method. Entropy 16, 4260–4289 (2014)
Malagò, L., Pistone, G.: Information geometry of the Gaussiandistributionin view of stochastic optimization. In: Proceedings of FOGA’15, held on January 17-20, 2015, Aberystwyth,Wales, 2015 (2015)
Mangasarian, O.L., Fromovitz, S.: The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. J. Math. Anal. Appl. 17, 37–47 (1967). https://doi.org/10.1016/0022-247X(67)90163-1
McCann, R.J.: A convexity principle for interacting gases. Adv. Math. 128(1), 153–179 (1997). https://doi.org/10.1006/aima.1997.1634
McCann, R.J.: Polar factorization of maps on Riemannian manifolds. Geom. Funct. Anal. 11(3), 589–608 (2001). https://doi.org/10.1007/PL00001679
Olkin, I., Pukelsheim, F.: The distance between two random vectors with given dispersion matrices. Linear Algebra Appl. 48, 257–263 (1982). https://doi.org/10.1016/0024-3795(82)90112-4
Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Comm. Partial Differential Equations 26(1-2), 101–174 (2001)
Papadopoulos, A.: Metric spaces, convexity and non-positive curvature, IRMA Lectures in Mathematics and Theoretical Physics, vol. 6, 2nd edn. European Mathematical Society (EMS), Zürich (2014). https://doi.org/10.4171/132
Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012). https://doi.org/10.1214/12-AOS971
Pistone, G.: Nonparametric information geometry. In: F. Nielsen, F. Barbaresco (eds.) Geometric Science of Information, Lecture Notes in Comput. Sci., vol. 8085, pp. 5–36. Springer, Heidelberg (2013). First International Conference, GSI 2013 Paris, France, August 28-30 (2013) (proceedings)
Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
Simoncini, V.: Computational methods for linear matrix equations. SIAM Rev. 58(3), 377–441 (2016). https://doi.org/10.1137/130912839
Skovgaard, L.T.: A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 11(4), 211–223 (1984)
Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)
Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin Heidelberg (2008)
Wachspress, E.L.: Trail to a Lyapunov equation solver. Comput. Math. Appl. 55(8), 1653–1659 (2008). https://doi.org/10.1016/j.camwa.2007.04.048
Acknowledgements
The authors wish to thank two anonymous referees for helpful comments. G. Pistone acknowledges the support of de Castro Statistics and Collegio Carlo Alberto. He is a member of GNAMPA-INdAM.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Malagò, L., Montrucchio, L. & Pistone, G. Wasserstein Riemannian geometry of Gaussian densities. Info. Geo. 1, 137–179 (2018). https://doi.org/10.1007/s41884-018-0014-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-018-0014-4
Keywords
- Information geometry
- Gaussian distribution
- Wasserstein distance
- Riemannian metrics
- Natural gradient
- Riemannian exponential
- Normal coordinates
- Levi-Civita covariant derivative
- Optimization on positive-definite symmetric matrices