Keywords

It is common knowledge that the Riemannian barycentre \(\bar{x}\), of a probability distribution P defined on a Riemannian manifold M, may fail to be unique. However, if P is supported inside a geodesic ball \(B(x^*,\delta )\) with radius \(\delta < \frac{1}{2} r_{\scriptscriptstyle cx}\) (\(r_{\scriptscriptstyle cx}\) the convexity radius of M), then \(\bar{x}\) is unique and also belongs to \(B(x^*,\delta )\). In fact, Afsari has shown this to be true, even when \(\delta < r_{\scriptscriptstyle cx}\) (see [1, 2]).

Does this statement continue to hold, if P is not supported inside \(B(x^*,\delta )\), but merely concentrated on this ball? The answer to this question is positive, assuming that M is a simply-connected compact Riemannian symmetric space, and \(P = P_{\scriptscriptstyle {T}} \propto \exp (-U/T)\), where the function U has unique global minimum at \(x^* \in M\). This is given by Proposition 2, in Sect. 2 below.

Proposition 2 motivates the main idea of the present work: the Riemannian barycentre \(\bar{x}_{\scriptscriptstyle T}\) of \(P_{\scriptscriptstyle {T}}\) can be used as a proxy for the global minimum \(x^*\) of U. In general, \(\bar{x}_{\scriptscriptstyle T}\) only provides an approximation of \(x^*\), but the two are equal if U is invariant by geodesic symmetry about \(x^*\), as stated in Proposition 3, in Sect. 4 below.

The following Sect. 1 introduces Proposition 2, which estimates the Riemannian distance between \(\bar{x}_{\scriptscriptstyle T}\) and \(x^*\), as a function of T.

1 Concentration of the Barycentre

Let P be a probability distribution on a complete Riemannian manifold M. A (Riemannian) barycentre of P is any global minimiser \(\bar{x} \in M\) of the function

$$\begin{aligned} \mathcal {E}(x) = \frac{1}{2}\, \int _M d^{\scriptscriptstyle 2}(x,z)P(dz) \quad \text{ for } \, x \in M \end{aligned}$$
(1)

The following statement is due to Karcher, and was improved upon by Afsari [1, 2]: if P is supported inside a geodesic ball \(B(x^*,\delta )\), where \(x^* \in M\) and \(\delta < \frac{1}{2} r_{\scriptscriptstyle cx}\) (\(r_{\scriptscriptstyle cx}\) the convexity radius of M), then \(\mathcal {E}\) is strongly convex on \(B(x^*,\delta )\), and P has a unique barycentre \(\bar{x} \in B(x^*,\delta )\).

On the other hand, the present work considers a setting where P is not supported inside \(B(x^*,\delta )\), but merely concentrated on this ball. Precisely, assume P is equal to the Gibbs distribution

$$\begin{aligned} P_{\scriptscriptstyle {T}}(dz) \,=\, \left( Z(T)\right) ^{-1}\,\exp \left[ -\frac{U(z)}{T}\right] \mathrm {vol}(dz);\, T > 0 \end{aligned}$$
(2)

where Z(T) is a normalising constant, U is a \(C^2\) function with unique global minimum at \(x^*\), and \(\mathrm {vol}\) is the Riemannian volume of M. Then, let \(\mathcal {E}_{\scriptscriptstyle {T}}\) denote the function \(\mathcal {E}\) in (1), and let \(\bar{x}_{\scriptscriptstyle {T}}\) denote any barycentre of \(P_{\scriptscriptstyle {T\,}}\).

In this new setting, it is not clear whether \(\mathcal {E}_{\scriptscriptstyle {T}}\) is differentiable or not. Therefore, statements about convexity of \(\mathcal {E}_{\scriptscriptstyle {T}}\) and uniqueness of \(\bar{x}_{\scriptscriptstyle {T}}\) are postponed to the following Sect. 2. For now, it is possible to state the following Proposition 1. In this proposition, \(d(\cdot ,\cdot )\) denotes Riemannian distance, and \(W(\cdot ,\cdot )\) denotes the Kantorovich (\(L^1\)-Wasserstein) distance [3, 4]. Moreover, \((\,\mu _{\scriptscriptstyle \min \,},\mu _{\scriptscriptstyle \max })\) is any open interval which contains the spectrum of the Hessian \(\nabla ^2 U(x^*)\), considered as a linear mapping of the tangent space \(T_{\scriptscriptstyle x^*}M\).

Proposition 1

Assume M is an n-dimensional compact Riemannian manifold with non-negative sectional curvature. Denote \(\delta _{\scriptscriptstyle x^*}\) the Dirac distribution at \(x^*\). The following hold,

(i) for any \(\eta > 0\),

$$\begin{aligned} W(P_{\scriptscriptstyle {T\,}},\delta _{\scriptscriptstyle x^*})< \frac{\eta ^2}{(4\,\mathrm {diam}\, M)}\,\,\Longrightarrow \,\, d(\bar{x}_{\scriptscriptstyle {T\,}},x^*) < \eta \end{aligned}$$
(3)

(ii) for \(T \le T_o\) (which can be computed explicitly)

$$\begin{aligned} W(P_{\scriptscriptstyle {T\,}},\delta _{\scriptscriptstyle x^*}) \le \sqrt{2}\,\left( \pi /2\right) ^{n-1}\,B^{-1}_n\,\left( \mu _{\scriptscriptstyle \max }/\mu _{\scriptscriptstyle \min }\right) ^{n/2}\,\left( T/\mu _{\scriptscriptstyle \min }\right) ^{1/2} \end{aligned}$$
(4)

where \(B_n = B(1/2,n/2)\) in terms of the Beta function.

Proposition 1 is motivated by the idea of using \(\bar{x}_{\scriptscriptstyle T}\) as an approximation of \(x^*\). Intuitively, this requires choosing T so small that \(P_{\scriptscriptstyle T}\) is sufficiently close to \(\delta _{\scriptscriptstyle x^*\,}\). Just how small a T may be required is indicated by the inequality in (4). This inequality is optimal and explicit, in the following sense.

It is optimal because the dependence on \(T^{1/2}\) in its right-hand side cannot be improved. Indeed, by the multi-dimensional Laplace approximation (see [5], for example), the left-hand side is equivalent to \(\mathrm {L}\cdot T^{1/2\,}\) (in the limit \(T \rightarrow 0\)). While this constant \(\mathrm {L}\) is not tractable, the constants appearing in Inequality (4) depend explicitly on the manifold M and the function U. In fact, this inequality does not follows from the multi-dimensional Laplace approximation, but rather from volume comparison theorems of Riemannian geometry [6].

In spite of these nice properties, Inequality (4) does not escape the curse of dimensionality. Indeed, for fixed T, its right-hand side increases exponentially with the dimension n (note that \(B_n\) decreases like \(n^{\scriptscriptstyle -1/2}\)). On the other hand, although \(T_o\) also depends on n, it is typically much less affected by dimensionality, and decreases slower that \(n^{-1}\) as n increases.

2 Convexity and Uniqueness

Assume now that M is a simply-connected, compact Riemannian symmetric space. In this case, for any T, the function \(\mathcal {E}_{\scriptscriptstyle T}\) turns out to be \(C^2\) throughout M. This results from the following lemma.

Lemma 1

Let M be a simply-connected compact Riemannian symmetric space. Let \(\gamma : I \rightarrow M\) be a geodesic defined on a compact interval I. Denote \(\mathrm {Cut}(\gamma )\) the union of all cut loci \(\mathrm {Cut}(\gamma (t))\) for \(t \in I\). Then, the topological dimension of \(\mathrm {Cut}(\gamma )\) is strictly less than \(n = \dim M\). In particular, \(\mathrm {Cut}(\gamma )\) is a set with volume equal to zero.

Remark: The assumption that M is simply-connected cannot be removed, as the conclusion does not hold if M is a real projective space.

The proof of Lemma 1 uses the structure of Riemannian symmetric spaces, as well as some results from topological dimension theory [7] (Chapter VII). The notion of topological dimension arises because it is possible \(\mathrm {Cut}(\gamma )\) is not a manifold. The lemma immediately implies, for all t,

$$ \mathcal {E}_{\scriptscriptstyle T}(\gamma (t)) \,=\, \frac{1}{2}\,\int _{M}\,d^{\scriptscriptstyle 2}(\gamma (t),z)P_{\scriptscriptstyle T}(dz)\,=\, \frac{1}{2}\,\int _{M - \mathrm {Cut}(\gamma )}\,d^{\scriptscriptstyle 2}(\gamma (t),z)P_{\scriptscriptstyle T}(dz) $$

Then, since the domain of integration avoids the cut loci of all the \(\gamma (t)\), it becomes possible to differentiate under the integral. This is used in obtaining the following (the assumptions are the same as in Lemma 1).

Corollary 1

For \(x \in M\), let \(G_x(z)=\nabla f_z(x)\) and \(H_x(z) = \nabla ^2f_z(x)\), where \(f_z\) is the function \(x\mapsto \frac{1}{2}\,d^{\scriptscriptstyle 2}(x,z)\). The following integrals converge for any T

$$ G_x \,=\, \int _{M-\mathrm {Cut}(x)}\,G_x(z)\,P_{\scriptscriptstyle T}(dz); \;\;\; H_x \,=\, \int _{M-\mathrm {Cut}(x)}\,H_x(z)\,P_{\scriptscriptstyle T}(dz) $$

and both depend continuously on x. Moreover,

$$\begin{aligned} \nabla \mathcal {E}_{\scriptscriptstyle T}(x) = G_{x} \;\; \text{ and } \;\; \nabla ^2 \mathcal {E}_{\scriptscriptstyle T}(x) = H_x \end{aligned}$$
(5)

so that \(\mathcal {E}_{\scriptscriptstyle T}\) is \(C^2\) throughout M.

With Corollary 1 at hand, it is possible to obtain Proposition 2, which is concerned with the convexity of \(\mathcal {E}_{\scriptscriptstyle T}\) and uniqueness of \(\bar{x}_{\scriptscriptstyle T\,}\). In this proposition, the following notation is used

$$\begin{aligned} f(T) = (4/\pi )\left( \pi /8\right) ^{n/2}\left( \mu _{\scriptscriptstyle \max }/T\right) ^{n/2}\exp \left( -U_{\scriptscriptstyle \delta }/T\right) \end{aligned}$$
(6)

where \(U_{\scriptscriptstyle \delta } = \inf \lbrace U(x) - U(x^*)\,;\, x \notin B(x^*,\delta )\rbrace \) for positive \(\delta \). The reader may wish to note the fact that f(T) decreases to 0 as T decreases to 0.

Proposition 2

Let M be a simply-connected compact Riemannian symmetric space. Let \(\kappa ^2\) be the maximum sectional curvature of M, and \(r_{\scriptscriptstyle cx} = \kappa ^{-1}\frac{\pi }{2}\) its convexity radius. If \(T \le T_o\) (see (ii) of Proposition 1), then the following hold for any \(\delta < \frac{1}{2} r_{\scriptscriptstyle cx}\).

(i) for all x in the geodesic ball \(B(x^*,\delta )\),

$$\begin{aligned} \nabla ^2 \mathcal {E}_{\scriptscriptstyle T}(x) \ge \mathrm {Ct}(2\delta )\left( 1 - \mathrm {vol}(M) f(T)\right) - \pi A_M f(T) \end{aligned}$$
(7)

where \(\mathrm {Ct}(2\delta ) = 2\kappa \delta \cot (2\kappa \delta ) > 0\) and \(A_M > 0\) is a constant given by the structure of the symmetric space M.

(ii) there exists \(T_{\scriptscriptstyle \delta }\) (which can be computed explicitly), such that \(T \le T_{\scriptscriptstyle \delta }\) implies \(\mathcal {E}_{\scriptscriptstyle {T}}\) is strongly convex on \(B(x^*,\delta )\,\), and has a unique global minimum \(\bar{x}_{\scriptscriptstyle T} \in B(x^*,\delta )\). In particular, this means \(\bar{x}_{\scriptscriptstyle T}\) is the unique barycentre of \(P_{\scriptscriptstyle T\,}\).

Note that (ii) of Proposition 2 generalises the statement due to Karcher [1], which was recalled in Sect. 1.

3 Finding \(T_o\) and \(T_{\scriptscriptstyle \delta }\)

Propositions 1 and 2 claim that \(T_o\) and \(T_{\scriptscriptstyle \delta }\) can be computed explicitly. This means that, with some knowledge of the Riemannian manifold M and the function U, \(T_o\) and \(T_{\scriptscriptstyle \delta }\) can be found by solving scalar equations. The current section gives the definitions of \(T_o\) and \(T_{\scriptscriptstyle \delta \,}\).

In the notation of Proposition 1, let \(\rho > 0\) be small enough, so that,

$$ \mu _{\scriptscriptstyle \min \,}d^{\scriptscriptstyle 2}(x,x^*) \,\le \, 2\left( U(x)-U(x^*)\right) \,\le \, \mu _{\scriptscriptstyle \max \,}d^{\scriptscriptstyle 2}(x,x^*) $$

whenever \(d(x,x^*) \le \rho \,\), and consider the quantity

$$ f(T,m,\rho ) \,=\, (2/\pi )^{1/2}\,\left( \mu _{\scriptscriptstyle \max }/T\right) ^{m/2}\,\exp \left( -U_{\scriptscriptstyle \rho }/T\right) $$

where \(U_{\scriptscriptstyle \rho }\) is defined as in (6). Note that \(f(T,m,\rho )\) decreases to 0 as T decreases to 0, for fixed m and \(\rho \). Now, it is possible to define \(T_o\) as

$$\begin{aligned} T_o \,=\, \min \left\{ T^1_o,\,T^2_o\right\} \quad \text{ where } \end{aligned}$$
(8)
$$\begin{aligned} T^1_o&= \inf \left\{ T> 0\,:\, f(T,n-2,\rho )\,>\,\rho ^{2-n}\,A_{n-1}\,\right\} \\ T^2_o&= \inf \left\{ T> 0\,:\, f(T,n+1,\rho )\,>\,\left( \mu _{\scriptscriptstyle \max }/\mu _{\scriptscriptstyle \min }\right) ^{n/2}\,C_n\right\} \\ \end{aligned}$$

Here, \(A_n = E|X|^n\) for \(X \sim N(0,1)\), and \(C_n = \omega _n\,A_{n}/\!\left( \mathrm {diam}\, M\times \mathrm {vol}\, M\right) \), where \(\omega _n\) is the surface area of a unit sphere \(S^{n-1\,}\).

With regard to Proposition 2, define \(T_{\scriptscriptstyle \delta }\) as follows,

$$\begin{aligned} T_{\scriptscriptstyle \delta } \,=\, \min \left\{ T^1_{\scriptscriptstyle \delta },\,T^2_{\scriptscriptstyle \delta }\right\} - \varepsilon \end{aligned}$$
(9)

for some arbitrary \(\varepsilon > 0\). Here, in the notation of (4), (6) and (7),

$$\begin{aligned} T^1_{\scriptscriptstyle \delta }= & {} \inf \left\{ T\le T_o\,:\, \sqrt{2\pi }\,(T/\mu _{\scriptscriptstyle \min })^{1/2}\,>\,\delta ^2\left( \mu _{\scriptscriptstyle \min }/\mu _{\scriptscriptstyle \max }\right) ^{n/2}\,D_n\right\} \\ T^2_o= & {} \inf \left\{ T\le T_o \,:\, f(T)\,>\, \mathrm {Ct}(2\delta )\left( \mathrm {Ct}(2\delta )\,\mathrm {vol}\,M + \pi A_M\right) ^{-1}\,\right\} \\ \end{aligned}$$

where \(D_n = (2/\pi )^{n-1}\,B_n/(4\,\mathrm {diam}\,M)\).

4 Black-Box Optimisation

Consider the problem of searching for the unique global minimum \(x^*\) of U. In black-box optimisation, it is only possible to evaluate U(x) for given \(x \in M\), and the cost of this evaluation precludes numerical approximation of derivatives. Then, the problem is to find \(x^*\) using successive evaluations of U(x) (hopefully, as few of these evaluations as possible).

Here, a new algorithm for solving this problem is described. The idea of this algorithm is to find \(\bar{x}_{\scriptscriptstyle T}\) using successive evaluations of U(x), in the hope that \(\bar{x}_{\scriptscriptstyle T}\) will provide a good approximation of \(x^*\). While the quality of this approximation is controlled by Inequalities (3) and (4) of Proposition 1, in some cases of interest, \(\bar{x}_{\scriptscriptstyle T}\) is exactly equal to \(x^*\), for correctly chosen T, as in the following proposition 3.

To state this proposition, let \(s_{\scriptscriptstyle {x^*}}\) denote geodesic symmetry about \(x^*\) (see [7]). This is the transformation of M, which leaves \(x^*\) fixed, and reverses the direction of geodesics passing through \(x^*\).

Proposition 3

Assume that U is invariant by geodesic symmetry about \(x^*\,\), in the sense that \(U \circ s_{\scriptscriptstyle {x^*}} = U\). If \(T \le T_{\scriptscriptstyle \delta }\) (see (ii) of Proposition 2), then \(\bar{x}_{\scriptscriptstyle T} = x^*\) is the unique barycentre of \(P_{\scriptscriptstyle T\,}\).

Proposition 3 follows rather directly from Proposition 2. Precisely, by (ii) of Proposition 2, the condition \(T \le T_{\scriptscriptstyle \delta }\) implies \(\mathcal {E}_{\scriptscriptstyle {T}}\) is strongly convex on \(B(x^*,\delta )\), and \(\bar{x}_{\scriptscriptstyle T} \in B(x^*,\delta )\). Thus, \(\bar{x}_{\scriptscriptstyle T}\) is the unique stationary point of \(\mathcal {E}_{\scriptscriptstyle {T}}\) in \(B(x^*,\delta )\). But, using the fact that U is invariant by geodesic symmetry about \(x^*\), it is possible to prove that \(x^*\) is a stationary point of \(\mathcal {E}_{\scriptscriptstyle {T\,}}\), and this implies \(\bar{x}_{\scriptscriptstyle T} = x^*\).

The two following examples verify the conditions of Proposition 3.

Example 1

Assume \(M = \mathrm {Gr}(k,\mathbb {C}^n)\) is a complex Grassmann manifold. In particular, M is a simply-connected, compact Riemannian symmetric space. Identify M with the set of Hermitian projectors \(x : \mathbb {C}^n\rightarrow \mathbb {C}^n\) such that \(\mathrm {tr}(x) = k\), where \(\mathrm {tr}\) denotes the trace. Then, define \(U(x) = -\,\mathrm {tr}(C\,x)\) for \(x \in \mathrm {Gr}(k,\mathbb {C}^n)\), where C is a Hermitian positive-definite matrix with distinct eigenvalues. Now, the unique global minimum of U occurs at \(x^*\), the projector onto the principal k-subspace of C. Also, the geodesic symmetry \(s_{\scriptscriptstyle x^*}\) is given by \(s_{\scriptscriptstyle x^*}\cdot x = r_{\scriptscriptstyle x^*}x\,r_{\scriptscriptstyle x^*}\), where \(r_{\scriptscriptstyle x^*} : \mathbb {C}^n\rightarrow \mathbb {C}^n\) denotes reflection through the image space of \(x^*\). It is elementary to verify that U is invariant by this geodesic symmetry.

Example 2

Let M be a simply-connected, compact Riemannian symmetric space, and \(U_{\scriptscriptstyle o}\) a function on M with unique global minimum at \(o \in M\). Assume moreover that \(U_{\scriptscriptstyle o}\) is invariant by geodesic symmetry about o. For each \(x^* \in M\), there exists an isometry g of M, such that \(x^* = g\cdot o\). Then, \(U(x) = U_{\scriptscriptstyle o}(g^{\scriptscriptstyle {-1}}\cdot x)\) has unique global minimum at \(x^*\), and is invariant by geodesic symmetry about \(x^*\).

Example 1 describes the standard problem of finding the principal subspace of the covariance matrix C. In Example 2, the function \(U_{\scriptscriptstyle o}\) is a known template, which undergoes an unknown transformation g, leading to the observed pattern U. This is a typical situation in pattern recognition problems.

Of course, from a mathematical point of view, Example 2 is not really an example, since it describes the completely general setting where the conditions of Proposition 3 are verified. In this setting, consider the following algorithm.

Description of the algorithm:

figure a

The above algorithm recursively computes the Riemannian barycentre \(\hat{x}_{\scriptscriptstyle n\,}\) of the samples \(z_{\scriptscriptstyle n}\) generated by a symmetric Metropolis-Hastings algorithm (see [8]). Here, The Metropolis-Hastings algorithm is implemented in lines (1)--(3). On the other hand, line (4) takes care of the Riemannian barycentre. Precisely, if \(\gamma :[0,1]\rightarrow M\) is a length-minimising geodesic connecting \(\hat{x}_{\scriptscriptstyle n-1}\) to \(z_{\scriptscriptstyle n}\), let

$$\begin{aligned} \hat{x}_{\scriptscriptstyle n-1}\, \#_{\scriptscriptstyle \frac{1}{n}}\, z_{\scriptscriptstyle n} \,=\,\gamma \left( 1/n\right) \end{aligned}$$
(10)

This geodesic \(\gamma \) need not be unique.

The point of using the Metropolis-Hastings algorithm is that the generated \(z_{\scriptscriptstyle n}\) eventually sample from the Gibbs distribution \(P_{\scriptscriptstyle T}\). The convergence of the distribution \(P_{\scriptscriptstyle n}\) of \(z_{\scriptscriptstyle n}\) to \(P_{\scriptscriptstyle T}\) takes place exponentially fast. Indeed, it may be inferred from [8] (see Theorem 8, Page 36)

$$\begin{aligned} \Vert P_{\scriptscriptstyle n} - P_{\scriptscriptstyle T}\Vert _{\scriptscriptstyle TV} \le (1-p_{\,\scriptscriptstyle T}\,)^n \end{aligned}$$
(11)

where \(\Vert \cdot \Vert _{\scriptscriptstyle TV}\) is the total variation norm, and \(p_{\,\scriptscriptstyle T} \in (0,1)\) verifies

$$ p_{\,\scriptscriptstyle T} \le \, (\mathrm {vol}(M))\,\inf _{x,z} \,q(x,z)\,\exp (-\sup _x U(x)/T) $$

so the rate of convergence is degraded when T is small.

Accordingly, the intuitive justification of the above algorithm is the following. Since the \(z_{\scriptscriptstyle n}\) eventually sample from the Gibbs distribution \(P_{\scriptscriptstyle T\,}\), and the desired global minimum \(x^*\) of U is equal to the barycentre \(\bar{x}_{\scriptscriptstyle T}\) of \(P_{\scriptscriptstyle T\,}\) (by Proposition 3), then the barycentre \(\hat{x}_{\scriptscriptstyle n}\) of the \(z_{\scriptscriptstyle n}\) is expected to converge to \(x^*\).

It should be emphasised that, in the present state of the literature, there is no rigorous result which confirms this convergence \(z_{\scriptscriptstyle n} \rightarrow x^*\,\). It is therefore an open problem, to be confronted in future work.

For a basic computer experiment, consider \(M = S^2 \subset \mathbb {R}^3,\) and let

$$\begin{aligned} U(x) = -\,P_{\scriptscriptstyle 9}(x^{\scriptscriptstyle 3}) \quad \text{ for } \, x = (x^{\scriptscriptstyle 1},x^{\scriptscriptstyle 2},x^{\scriptscriptstyle 3}) \in S^2 \end{aligned}$$
(12)

where \(P_{\scriptscriptstyle 9}\) is the Legendre polynomial of degree 9 [9]. The unique global minimiser of U is \(x^* = (0,0,1)\), and the conditions of Proposition 3 are verified, since U is invariant by reflection in the \(x^{\scriptscriptstyle 3}\) axis, which is geodesic symmetry about \(x^*\).

Fig. 1.
figure 1

graph of \(-P_{\scriptscriptstyle 9}(x^{\scriptscriptstyle 3})\)

Fig. 2.
figure 2

\(\hat{x}^{\scriptscriptstyle 3}_{\scriptscriptstyle n}\) versus n

Figure 1 shows the dependence of U(x) on \(x^{\scriptscriptstyle 3}\), displaying multiple local minima and maxima. Figure 2 shows the algorithm overcoming these local minima and maxima, and converging to the global minimum \(x^* = (0,0,1)\), within \(n=5000\) iterations. The experiment was conducted with \(T = 0.2\), and the Markov kernel Q obtained from the von Mises-Fisher distribution (see [10]). The initial guess \(\hat{x}_{\scriptscriptstyle 0} = (0,0,-1)\) is not shown in Fig. 2.

In comparison, a standard simulated annealing method offered less robust performance, which varied considerably with the choice of annealing schedule.

Proofs

The proofs of all results stated in this work are detailed in the extended version, available online: https://arxiv.org/abs/1902.03885