A Formalization of the Natural Gradient Method for General Similarity Measures

Mallasto, Anton; Haije, Tom Dela; Feragen, Aasa

doi:10.1007/978-3-030-26980-7_62

Anton Mallasto¹⁰,
Tom Dela Haije¹⁰ &
Aasa Feragen¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11712))

Included in the following conference series:

International Conference on Geometric Science of Information

1754 Accesses
2 Citations

Abstract

In optimization, the natural gradient method is well-known for likelihood maximization. The method uses the Kullback–Leibler (KL) divergence, corresponding infinitesimally to the Fisher–Rao metric, which is pulled back to the parameter space of a family of probability distributions. This way, gradients with respect to the parameters respect the Fisher–Rao geometry of the space of distributions, which might differ vastly from the standard Euclidean geometry of the parameter space, often leading to faster convergence. The concept of natural gradient has in most discussions been restricted to the KL-divergence/Fisher–Rao case, although in information geometry the local $C^2$ structure of a general divergence has been used for deriving a closely related Riemannian metric analogous to the KL-divergence case. In this work, we wish to cast natural gradients into this more general context and provide example computations, notably in the case of a Finsler metric and the p-Wasserstein metric. We additionally discuss connections between the natural gradient method and multiple other optimization techniques in the literature.

Access provided by Autonomous University of Puebla. Download conference paper PDF

On the Remarkable Efficiency of SMART

Divergence Functions and Geometric Structures They Induce on a Manifold

Natural gradient via optimal transport

Article 19 November 2018

Keywords

1 Introduction

The natural gradient method [2] in optimization originates from information geometry [4], which utilizes the Riemannian geometry of statistical manifolds (the parameter spaces of model families) endowed with the Fisher–Rao metric. The natural gradient is used for minimizing the Kullback–Leibler (KL) divergence, a similarity measure between a model distribution and a target distribution, that can be shown to be equivalent to maximizing model likelihood of given data. The success of natural gradient in optimization stems from accelerating likelihood maximization and providing infinitesimal invariance to reparametrizations of the model, providing robustness towards arbitrary parametrization choices.

In the modern formulation of the natural gradient, a Riemannian metric on the statistical manifold is chosen, with respect to which the gradient of the given similarity is computed [4, Sec. 12]. The choice of the Riemannian metric should, however, relate closely to the similarity measure being minimized. We have illustrated this in Fig. 1, where model selection for Gaussian process regression is carried out by maximizing the prior-likelihood of the data with natural gradients stemming form different metrics. Clearly, the Fisher–Rao metric—which infinitesimally corresponds to the KL-divergence—achieves the fastest convergence.

An example of an approach to choose a related Riemannian metric is the classical Newton’s method that derives a metric from the Hessian of a convex objective function, or its absolute value in the non-convex case [7]. Unfortunately, evaluating the Hessian is not feasible in some cases. Instead, we can compute a local Hessian, which corresponds to a local second order expansion of the similarity measure [3]. This approach generalizes the natural gradient from the KL-divergence case to general similarity measures, and to avoid confusion with the well-known KL-divergence setting, we refer to this approach as the formal natural gradient. We furthermore discuss the similarities between the trust region, proximal, and natural gradient methods in Sect. 3 and provide example computations in Sect. 4.

2 Useful Metrics via Formalizing the Natural Gradient

The natural gradient is computed with respect to a chosen metric on the statistical manifold, which often results from pulling back a metric between distributions. This way, the gradient takes into account how the metric on distributions penalizes movement into different directions. We will now review how the natural gradient is computed given a Riemannian metric. Then, we introduce the formal natural gradient, which derives this metric from the similarity measure.

Statistical Manifold. Let $\mathrm {AC}(X)$ denote the set of absolutely continuous probability distributions on some manifold X. A statistical manifold is defined by a triple $(X, \varTheta , \rho )$, where X is called the sample space and $\varTheta \subseteq \mathbb {R}^n$ the parameter space. Then, $\rho :\varTheta \rightarrow \mathrm {AC}(X)$ maps a parameter to a density, given by $\rho :\theta \mapsto \rho _\theta (\cdot )$, for any $\theta \in \varTheta $. Abusing terminology, we also call $\varTheta $ the statistical manifold.

Cost Function. Let a similarity measure $c^* :\mathrm {AC}(X) \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}$ (e.g. a metric or an information divergence) be defined on $\mathrm {AC}(X)$ satisfying $c^*(\rho , \rho ')=0$ if and only if $\rho =\rho '$. Assume $c^*$ to be strictly convex in $\rho $. Given a target distribution $\rho \in \mathrm {AC}(X)$ and a statistical manifold $(X, \varTheta , \rho )$, we wish to minimize the cost function $c\rightarrow \varTheta \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}$ given by

$$\begin{aligned} c(\theta , \rho ) = c^*(\rho _\theta , \rho ). \end{aligned}$$

(2.1)

If $\rho = \rho _{\theta '}$ for some $\theta ' \in \varTheta $, then by abuse of notation we write $c(\theta , \theta ')$. We finally assume that $\theta \mapsto c(\theta , \theta ')$ is $C^2$ whenever $\theta \ne \theta '$.

Natural Gradient. Assume a Riemannian structure $(\varTheta , g^\varTheta )$ on the statistical manifold. The Riemannian metric $g^\varTheta $ induces a metric tensor $G^\varTheta $, given by $g_\theta ^\varTheta (u,v) = u^TG_\theta ^\varTheta v$ and a distance function which we denote by $d_\varTheta $. The vectors u, v belong to the tangent space $T_\theta \varTheta $ at $\theta $. It is common intuition that the negative gradient $v = -\nabla _\theta c(\theta , \rho )$ gives the direction of maximal descent for c. However, this is only true on a Euclidean manifold. Consider

$$\begin{aligned} \hat{v} = \mathop {\text {arg min}}\limits _{v \in T_\theta \varTheta :d_\varTheta (\theta , \theta + v)= \varDelta } c(\theta + v, \rho ), \end{aligned}$$

(2.2)

where $\theta + v$ is to be understood in a chart of $\varTheta $, and $\varDelta >0$ defines the radius of the trust region. Linearly approximating the objective and quadratically approximating the constraint, this is solved using Lagrangian multipliers, giving the natural gradient

$$\begin{aligned} \hat{v} = - \frac{1}{\lambda }\left[ G_\theta ^\varTheta \right] ^{-1}\nabla _\theta c(\theta , \rho ), \end{aligned}$$

(2.3)

for some Lagrangrian multiplier $\lambda > 0$, which we refer to as the learning rate. Below, a similar derivation is carried out in more detail.

Formal Natural Gradient. Traditionally, the natural gradient uses the Fisher–Rao metric when the similarity measure used is the KL-divergence. We will now show, how a trust region formulation with respect to the chosen similarity measure can be used to derive a natural metric under which the natural gradient can be computed, resulting in the formal natural gradient. Thus, consider the minimization task

$$\begin{aligned} \hat{v}:= \mathop {\text {arg min}}_{v\in T_\theta \varTheta ,~c(\theta + v, \theta ) = \varDelta } c(\theta + v, \rho ). \end{aligned}$$

(2.4)

We approximate the constraint by the second degree Taylor expansion

$$\begin{aligned} c(\theta + v, \theta ) \approx \frac{1}{2}v^T \left( \nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\right) v, \end{aligned}$$

(2.5)

where the 0^th and 1^st degree terms disappear as $c(\theta + v,\theta )$ has a minimum 0 at $v=0$. We call the symmetric positive definite matrix $H^c_\theta :=\nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )$ the local Hessian. Then, we further approximate the objective function

$$\begin{aligned} c(\theta + v, \rho ) \approx c(\theta , \rho ) + \nabla _{\theta } c(\theta , \rho )^Tv. \end{aligned}$$

(2.6)

Writing the approximate Langrangian $\mathcal {L}(v)$ of (2.4) with a multiplier $\lambda > 0$, we get

$$\begin{aligned} \mathcal {L}(v) \approx c(\theta , \rho ) + \nabla _{\theta } c(\theta , \rho )^Tv + \frac{\lambda }{2}v^T \left( \nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\right) v. \end{aligned}$$

(2.7)

Thus by the method of Langrangian multipliers, (2.4) is solved as

$$\begin{aligned} \hat{v} = -\frac{1}{\lambda }\left[ H^c_\theta \right] ^{-1}\nabla _{\theta }c(\theta , \rho ). \end{aligned}$$

(2.8)

We refer to $\hat{v}$ as the formal natural gradient with respect to c.

Remark 1

We could have just substituted $\eta = \theta $ in the local Hessian if $\nabla ^2_\eta c(\eta , \theta )$ was continuous at $\eta $. However, when studying Finsler metrics later in this work, the expression has a discontinuity at $\eta = \theta $. Therefore, a direction for a limit has to be chosen, and as a straight-forward candidate we compute the limit from the direction of the gradient.

Metric Interpretation. The local Hessian $G^c_\theta $ can be seen as a metric tensor at any $\theta \in \varTheta $, inducing an inner product $g_\theta ^c :T_\theta \varTheta \times T_\theta \varTheta \rightarrow \mathbb {R}$ given by $g_\theta ^c(v,u) = v^TH^c_\theta u$. This imposes a pseudo-Riemannian structure on $\varTheta $, forming the pseudo-Riemannian manifold $(\varTheta , g^c)$. Therefore, $G^c_x$ provides us a natural metric under which to compute the natural gradient for a general $c^*$. If $\rho $ has a full rank Jacobian everywhere, then a Riemannian metric is retrieved. Also, there is an obvious pullback structure at play. Recall, that the cost is defined by $c(\theta , \theta ')=c^*(\rho _{\theta }, \rho _{\theta '})$. Then, computing the local Hessian yields

$$\begin{aligned} H^c_\theta = J_\theta ^T H^{c^*}_{\rho _\theta } J_\theta , \end{aligned}$$

(2.9)

where $H^{c^*}_{\rho _\theta } = \nabla ^2_{\rho \rightarrow \rho _\theta } c^*(\rho , \rho _\theta )$. Thus, $H^c$ results from pulling back the $c^*$ induced metric tensor $H^{c^*}$ on $\mathrm {AC}(X)$ to the statistical manifold $\varTheta $. In information geometry, this Riemannian metric is said to be induced by the corresponding divergence (similarity measure) [3]. Therefore, the formal natural gradient is just the Riemannian gradient under the aforementioned induced metric.

Asymptotically Newton’s Method. We provide a straightforward result, stating that the local Hessian approaches the actual Hessian in the limit, thus the formal natural gradient method approaches Newton’s method. This is well known in the Fisher–Rao case, but for completeness we provide the result for the formal natural gradient.

Proposition 1

Assume $c(\theta ,\rho ) = c(\theta , \theta ')$ for some $\theta ' \in \varTheta $, and that c is $C^2$ in $\theta $. Then, the natural gradient yields asymptotically Newton’s method.

Proof

The Hessian at $\theta $ is given by $\nabla ^2_\theta c(\theta , \theta ')$. Then, as c is $C^2$ in the first argument, passing the limit $\theta \rightarrow \theta '$ yields

$$\begin{aligned} H^c_\theta = \nabla ^2_{\eta \rightarrow \theta } c(\eta , \theta ) \overset{\theta \rightarrow \theta '}{\rightarrow } \nabla ^2_{\eta \rightarrow \theta '}c(\eta ,\theta ') = \nabla ^2_{\eta = \theta '} c(\eta , \theta '), \end{aligned}$$

(2.10)

where the last expression is the Hessian at $\theta '$.

3 Loved Child has Many Names – Related Methods

In this section, we discuss connections between seemingly different optimization methods. Some of these connections have already been reported in the literature, some are likely to be known to some extent in the community. However, the authors are unaware of previous work drawing out these connections in their full extent. We provide such a discussion, and then present other related connections.

As discussed in [14], proximal methods and trust region methods are equivalent up to learning rate. Trust region methods employ an $l^2$-metric constraint

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x:\Vert x-x_t\Vert _2 \le \varDelta } f(x),~\varDelta >0, \end{aligned}$$

(3.1)

whereas proximal methods include a $l^2$-metric penalization term

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x}\left\{ f(x) + \frac{1}{2\lambda }\Vert x - x_t\Vert _2^2\right\} ,~\lambda > 0, \end{aligned}$$

(3.2)

The two can be shown to be equivalent up to learning rate via Lagrangian duality.

Instead of the $l^2$ metric penalization, mirror gradient descent [13] employs a more general proximity function $\varPsi :\mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}_{>0}$, that is strictly convex in the first argument. Then, the mirror descent step is given by

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x}\left\{ \langle x - x_t, \nabla f(x_t)\rangle + \frac{1}{\lambda }\varPsi (x, x_t)\right\} . \end{aligned}$$

(3.3)

Commonly, $\varPsi $ is chosen to be a Bregman divergence $D_g$, defined by choosing a strictly convex $C^2$ function g and writing

$$\begin{aligned} D_g(x,x') = g(x) - g(x') - \langle \nabla g(x'), x - x' \rangle . \end{aligned}$$

(3.4)

To explain how these methods are related to the natural gradient, assume that we are minimizing a general similarity measure c(x, y) with respect to x, as in Sect. 2. Recall, that we first defined the natural gradient as a trust region step. In order to derive an analytical expression for the iteration, we approximated the objective function with the first order Taylor polynomial and the constraints by the local Hessian and then used Lagrangian duality to yield a proximal expression, which yields the formal natural gradient when solved. In Sect. 4, we will show how this workflow indeed corresponds to known examples of the natural gradient.

Further Connections. Raskutti and Mukherjee [16] showed, that Bregman divergence proximal mirror gradient descent is equivalent to the natural gradient method on the dual manifold of the Bregman divergence. Khan et al. [8], consider a KL divergence proximal algorithm for learning conditionally conjugate exponential families, which they show to correspond to a natural gradient step. For exponential families, the KL-divergence corresponds to a Bregman divergence, and so the natural gradient step is on the primal manifold of the Bregman divergence. Thus the result seems to conflict with the resut in [16]. However, this can be explained, as the gradient is taken with respect to a different argument of the divergence, i.e., they consider $\nabla _x D_g(x',x)$ and not $\nabla _x D_g(x,x')$. It is intriguing how two different geometries are involved in this choice.

Pascanu and Bengio [15] remarked on the connections between the natural gradient method and Hessian-free optimization [11], Krylov Subspace Descent [17], and TONGA [9]. The main connection between Hessian-free optimization and Krylov subspace descent is the use of extended Gauss–Newton approximation of the Hessian [18], which gives a similar square form involving the Jacobian as the pullback Fisher–Rao metric on a statistical manifold. The connection was further studied by Martens [12], where an equivalence criterion between the Fisher–Rao natural gradient and extended Gauss–Newton was given.

4 Example Computations

We will now provide example computations for the local Hessian $H^c$ of different similarity measures c, as it is the essential object in computing the natural gradient given in (2.8). We first show that in the cases of KL-divergence and a Riemannian metric, the definition of the formal natural gradient matches the classical definition, as expected. Furthermore, we contribute local Hessians for general f-divergences and Finsler metrics, specifically for the p-Wasserstein metrics.

Natural Gradient of f-Divergences. Let $\rho , \rho ' \in \mathrm {AC}(X)$ and $f:\mathbb {R}_{>0}\rightarrow \mathbb {R}_{\ge 0}$ be a convex function satisfying $f(1) = 0$. Then, the f-divergence from $\rho '$ to $\rho $ is

$$\begin{aligned} D_f(\rho ||\rho ') = \int _X \rho (x) f\left( \frac{\rho '(x)}{\rho (x)}\right) dx. \end{aligned}$$

(4.1)

Now, consider the statistical manifold $(\mathbb {R}^d, \varTheta , \rho )$, and compute the local Hessian

$$\begin{aligned} \left[ H^{D_f}_\theta \right] _{ij} = \nabla ^2f(1)\int _{X}\frac{\partial \log \rho _\theta (x)}{\partial \theta _i}\frac{\partial \log \rho _\theta (x)}{\partial \theta _j} \rho _\theta (x) dx. \end{aligned}$$

(4.2)

Substituting $f = - \log $ in (4.1) results in the KL-divergence, denoted by $D_{\mathrm {KL}}(\rho ||\rho ')$. Noticing that $\nabla ^2f(1)=1$ with this substitution, we can write (4.2) as $H^{D_f}_\theta = \nabla ^2 f(1)H^{D_{\mathrm {KL}}}_\theta $, where the local Hessian $H^{D_{\mathrm {KL}}}_\theta $ is also the Fisher–Rao metric tensor at $\theta $, and thus the natural gradient of Amari [2] is retrieved.

Natural Gradient of Riemannian Distance. Let (M, g) be a Riemannian manifold with the induced distance function $d_g$ and the metric tensor at $\rho \in M$ denoted by $G^M_\rho $. Finally, denote by $\rho _\theta $ a submanifold of M parametrized by $\theta \in \varTheta $. Then, when $c=\frac{1}{2}d^2$, we compute $G_\theta ^{\frac{1}{2}d_g}$ as follows

$$\begin{aligned} \begin{aligned} \left[ H^{\frac{1}{2}d^2}_\theta \right] _{ij}=&\frac{1}{2}\left( \frac{\partial }{\partial \theta _j}\rho _\theta \right) ^T \left[ \nabla ^2_{\rho _\eta \rightarrow \rho _\theta }d^2(\rho _\eta ,\rho _{\theta })\right] \left( \frac{\partial }{\partial \theta _i}\rho _\theta \right) \\&+\frac{1}{2}\left[ \frac{\partial ^2}{\partial \theta _j \partial \theta _i}\rho _\theta \right] \left[ \nabla _{\rho _\eta \rightarrow \rho _\theta }d^2(\rho _\eta ,\rho _{\theta })\right] , \end{aligned} \end{aligned}$$

(4.3)

as $\theta ' \rightarrow \theta $, the second term vanishes. Finally, $\nabla ^2_{\rho _\eta \rightarrow \rho _\theta } d^2(\rho _\eta , \rho _{\theta }) = 2G^M_{\rho _\theta }$, thus

$$\begin{aligned} H^{\frac{1}{2}d_g}_{\theta } =J_\theta ^T G^M_{x_\theta } J_\theta , \end{aligned}$$

(4.4)

where $J_\theta = \frac{\partial }{\partial \theta }\rho _\theta $ denotes the Jacobian. Therefore, the formal natural gradient corresponds to the traditional coordinate-free definition of a gradient on a Riemannian manifold, when the metric is given by the pullback.

Natural Gradient of Finsler Distance. Let (M, F) denote a Finsler manifold, where $F_\rho :T_\rho M \rightarrow \mathbb {R}_{\ge 0}$, for any $\rho \in M$, is a Finsler metric, satisfying the properties of strong convexity, positive 1-homogeneity and positive definiteness. Then, a distance $d_F$ is induced on M by

$$\begin{aligned} d_F(\rho ,\rho ') = \inf \limits _{\gamma } \int _0^1 F_{\gamma (t)}(\dot{\gamma }(t))dt,~\rho ,\rho '\in M \end{aligned}$$

(4.5)

where $\gamma $ is any continuous, unit-parametrized curve with $\gamma (0) = \rho $ and $\gamma (1) = \rho '$.

The fundamental tensor $G^F$ of F at $(\rho ,v)$ is defined as $G^F_{\rho }(v) = \frac{1}{2}\nabla ^2_{v} F^2_\rho (v)$. Then, $G^F_\rho $ is 0-homogeneous as the second differential of a 2-homogeneous function. Therefore, $G^F_\rho (\lambda v) = G^F_\rho (v)$ for any $\lambda > 0$. Furthermore, $G^F_\rho (v)$ is positive-definite when $v\ne 0$. Now, let $u = -J_\theta \nabla _\theta d^2_F(\rho _\theta , \rho ') $, and as we can locally write $d^2_F(\rho , \rho ') = F^2_{\rho \theta }(v)$ for a suitable v, then

$$\begin{aligned} H^{\frac{1}{2}d_F^2}_\theta =\frac{1}{2}\nabla ^2_{\eta \rightarrow \theta } d^2_F(\rho _\eta , \rho _{\theta }) = \frac{1}{2}\lim \limits _{\lambda \rightarrow 0} \nabla ^2_{v= \lambda u} F^2_{\rho _\theta }(v) = J_\theta ^T G^F_{\rho _\theta }(u) J_\theta . \end{aligned}$$

(4.6)

Coordinate-free gradient descent on Finsler manifolds has been studied by Bercu [5]. The formal natural gradient differs slightly from this, as we use $v =-J_\theta \nabla _\theta d^2_F(\rho _\theta ,\rho ')$ in the preconditioning matrix $G^F_{(\rho _\theta , v)}$ (see Remark 1), where as in [5], v is chosen to maximize the descent. Thus the natural gradient descent in the Finsler case approximates the geometry in the direction of the gradient quadratically to improve the descent, but fails to take the entire local geometry into account.

p -Wasserstein Metric. Let $X=\mathbb {R}^n$ and $\rho \in \mathcal {P}_p(X)$ if

$$\begin{aligned} \int _{X}d_2^p(x_0,x)\rho (x)dx,~\text {for some }x_0\in X, \end{aligned}$$

(4.7)

where $d_2$ is the Euclidean distance. Then, the p-Wasserstein distance $W_p$ between $\rho , \rho ' \in \mathcal {P}_p(X)$ is given by

$$\begin{aligned} W_p(\rho , \rho ') = \left( \inf \limits _{\gamma \in \mathrm {ADM}(\rho , \rho ')}\int _{X\times X}d_2^p(x,x')d\gamma (x,x')\right) ^\frac{1}{p}, \end{aligned}$$

(4.8)

where $\mathrm {ADM}(\rho , \rho ')$ is the set of joint measures with marginal densities $\rho $ and $\rho '$. The p-Wasserstein distance is induced by a Finsler metric [1], given by

$$\begin{aligned} F_\rho (v) = \left( \int _X \Vert \nabla \varPhi _v\Vert _2^p d\rho \right) ^\frac{1}{p}, \end{aligned}$$

(4.9)

where $v\in T_\rho \mathcal {P}_p(X)$ and $\varPhi _v$ satisfies $v(x) = -\nabla \cdot \left( \rho (x) \nabla _x \varPhi _v(x)\right) $ for any $x\in X$, where $\nabla \cdot $ is the divergence operator. Now, choose $v = -J_\theta \nabla _\theta W_p^2(\rho _\theta , \rho )$. Then, through a cumbersome computation, we compute how the local Hessian acts on two tangent vectors $d\theta _1, d\theta _2\in T_\theta \varTheta $

$$\begin{aligned} \begin{aligned}&H^{\frac{1}{2}W_p^2}_\theta (d\theta _1, d\theta _2) \\ =&(2-p)F^{2(1-p)}_{\rho _\theta }(v)\left( \int _X \Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _v \rangle d\rho _\theta \right) \\&\times \, \left( \int _X \Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _2}, \nabla \varPhi _v \rangle d\rho _\theta \right) \\&+\, F_{\rho _\theta }^{2-p}(v)\int _X\Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _{d\theta _2} \rangle d\rho _\theta \\&+\, (p-2)F_{\rho _\theta }^{2-p}(v)\int _X\Vert \nabla \varPhi _v\Vert _2^{p-4}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _v \rangle \langle \nabla \varPhi _{d\theta _2},\nabla \varPhi _v \rangle d\rho _\theta , \end{aligned} \end{aligned}$$

(4.10)

where $J_\theta d\theta _i = - \nabla \cdot \left( \rho _\theta \nabla \varPhi _{d\theta _i}\right) $ for $i=1,2$. The case $p=2$ is special, as the 2-Wasserstein metric is induced by a Riemannian metric, whose pullback can be recovered by substituting $p=2$ in (4.10), yielding

$$\begin{aligned} H^{\frac{1}{2}W_2^2}_{\theta }(d\theta _1, d\theta _2) = \int _{X} \langle \nabla \varPhi _{d\theta _1}, \nabla \varPhi _{d\theta _2}\rangle d\rho _\theta . \end{aligned}$$

(4.11)

This yields the natural gradient of $W_2^2$ as introduced in [6, 10].

References

Agueh, M.: Finsler structure in the p-Wasserstein space and gradient flows. Comptes Rendus Mathematique 350(1–2), 35–40 (2012)
Article MathSciNet Google Scholar
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Amari, S.i.: Divergence function, information monotonicity and information geometry. In: Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). Citeseer (2009)
Google Scholar
Amari, S.I.: Information Geometry and Its Applications. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8
Book MATH Google Scholar
Bercu, G.: Gradient methods on Finsler manifolds. In: Proceedings of the Workshop on Global Analysis, Differential Geometry and Lie Algebras, pp. 230–233 (2000)
Google Scholar
Chen, Y., Li, W.: Natural gradient in Wasserstein statistical manifold. arXiv preprint arXiv:1805.08380 (2018)
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)
Google Scholar
Khan, M.E., Baqué, P., Fleuret, F., Fua, P.: Kullback-Leibler proximal variational inference. In: Advances in Neural Information Processing Systems, pp. 3402–3410 (2015)
Google Scholar
Le Roux, N., Manzagol, P.A., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2008)
Google Scholar
Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018)
Article Google Scholar
Martens, J.: Deep learning via Hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)
Google Scholar
Martens, J.: New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193 (2014)
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)
Google Scholar
Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Article Google Scholar
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)
Raskutti, G., Mukherjee, S.: The information geometry of mirror descent. IEEE Trans. Inf. Theory 61(3), 1451–1457 (2015)
Article MathSciNet Google Scholar
Saad, Y.: Krylov subspace methods for solving large unsymmetric linear systems. Math. Comput. 37(155), 105–126 (1981)
Article MathSciNet Google Scholar
Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723–1738 (2002)
Article Google Scholar

Download references

Acknowledgements

The authors were supported by Centre for Stochastic Geometry and Advanced Bioimaging, and a block stipendium, both funded by a grant from the Villum Foundation. We furthermore wish to thank the anonymous reviewers for their very useful comments.

Author information

Authors and Affiliations

University of Copenhagen, Copenhagen, Denmark
Anton Mallasto, Tom Dela Haije & Aasa Feragen

Authors

Anton Mallasto
View author publications
You can also search for this author in PubMed Google Scholar
Tom Dela Haije
View author publications
You can also search for this author in PubMed Google Scholar
Aasa Feragen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anton Mallasto .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
Thales, Limours, France
Frédéric Barbaresco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mallasto, A., Haije, T.D., Feragen, A. (2019). A Formalization of the Natural Gradient Method for General Similarity Measures. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_62

Download citation

DOI: https://doi.org/10.1007/978-3-030-26980-7_62
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Formalization of the Natural Gradient Method for General Similarity Measures

Abstract