Keywords

1 Introduction

The natural gradient method [2] in optimization originates from information geometry [4], which utilizes the Riemannian geometry of statistical manifolds (the parameter spaces of model families) endowed with the Fisher–Rao metric. The natural gradient is used for minimizing the Kullback–Leibler (KL) divergence, a similarity measure between a model distribution and a target distribution, that can be shown to be equivalent to maximizing model likelihood of given data. The success of natural gradient in optimization stems from accelerating likelihood maximization and providing infinitesimal invariance to reparametrizations of the model, providing robustness towards arbitrary parametrization choices.

In the modern formulation of the natural gradient, a Riemannian metric on the statistical manifold is chosen, with respect to which the gradient of the given similarity is computed [4, Sec. 12]. The choice of the Riemannian metric should, however, relate closely to the similarity measure being minimized. We have illustrated this in Fig. 1, where model selection for Gaussian process regression is carried out by maximizing the prior-likelihood of the data with natural gradients stemming form different metrics. Clearly, the Fisher–Rao metric—which infinitesimally corresponds to the KL-divergence—achieves the fastest convergence.

An example of an approach to choose a related Riemannian metric is the classical Newton’s method that derives a metric from the Hessian of a convex objective function, or its absolute value in the non-convex case [7]. Unfortunately, evaluating the Hessian is not feasible in some cases. Instead, we can compute a local Hessian, which corresponds to a local second order expansion of the similarity measure [3]. This approach generalizes the natural gradient from the KL-divergence case to general similarity measures, and to avoid confusion with the well-known KL-divergence setting, we refer to this approach as the formal natural gradient. We furthermore discuss the similarities between the trust region, proximal, and natural gradient methods in Sect. 3 and provide example computations in Sect. 4.

Fig. 1.
figure 1

Maximizing prior likelihood for Gaussian process regression using natural gradients under different metrics on Gaussian distributions. Convergence plots on left. Data and model fit, with optimal exponentiated quadratic kernel parameters, on right.

2 Useful Metrics via Formalizing the Natural Gradient

The natural gradient is computed with respect to a chosen metric on the statistical manifold, which often results from pulling back a metric between distributions. This way, the gradient takes into account how the metric on distributions penalizes movement into different directions. We will now review how the natural gradient is computed given a Riemannian metric. Then, we introduce the formal natural gradient, which derives this metric from the similarity measure.

Statistical Manifold. Let \(\mathrm {AC}(X)\) denote the set of absolutely continuous probability distributions on some manifold X. A statistical manifold is defined by a triple \((X, \varTheta , \rho )\), where X is called the sample space and \(\varTheta \subseteq \mathbb {R}^n\) the parameter space. Then, \(\rho :\varTheta \rightarrow \mathrm {AC}(X)\) maps a parameter to a density, given by \(\rho :\theta \mapsto \rho _\theta (\cdot )\), for any \(\theta \in \varTheta \). Abusing terminology, we also call \(\varTheta \) the statistical manifold.

Cost Function. Let a similarity measure \(c^* :\mathrm {AC}(X) \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}\) (e.g. a metric or an information divergence) be defined on \(\mathrm {AC}(X)\) satisfying \(c^*(\rho , \rho ')=0\) if and only if \(\rho =\rho '\). Assume \(c^*\) to be strictly convex in \(\rho \). Given a target distribution \(\rho \in \mathrm {AC}(X)\) and a statistical manifold \((X, \varTheta , \rho )\), we wish to minimize the cost function \(c\rightarrow \varTheta \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}\) given by

$$\begin{aligned} c(\theta , \rho ) = c^*(\rho _\theta , \rho ). \end{aligned}$$
(2.1)

If \(\rho = \rho _{\theta '}\) for some \(\theta ' \in \varTheta \), then by abuse of notation we write \(c(\theta , \theta ')\). We finally assume that \(\theta \mapsto c(\theta , \theta ')\) is \(C^2\) whenever \(\theta \ne \theta '\).

Natural Gradient. Assume a Riemannian structure \((\varTheta , g^\varTheta )\) on the statistical manifold. The Riemannian metric \(g^\varTheta \) induces a metric tensor \(G^\varTheta \), given by \(g_\theta ^\varTheta (u,v) = u^TG_\theta ^\varTheta v\) and a distance function which we denote by \(d_\varTheta \). The vectors uv belong to the tangent space \(T_\theta \varTheta \) at \(\theta \). It is common intuition that the negative gradient \(v = -\nabla _\theta c(\theta , \rho )\) gives the direction of maximal descent for c. However, this is only true on a Euclidean manifold. Consider

$$\begin{aligned} \hat{v} = \mathop {\text {arg min}}\limits _{v \in T_\theta \varTheta :d_\varTheta (\theta , \theta + v)= \varDelta } c(\theta + v, \rho ), \end{aligned}$$
(2.2)

where \(\theta + v\) is to be understood in a chart of \(\varTheta \), and \(\varDelta >0\) defines the radius of the trust region. Linearly approximating the objective and quadratically approximating the constraint, this is solved using Lagrangian multipliers, giving the natural gradient

$$\begin{aligned} \hat{v} = - \frac{1}{\lambda }\left[ G_\theta ^\varTheta \right] ^{-1}\nabla _\theta c(\theta , \rho ), \end{aligned}$$
(2.3)

for some Lagrangrian multiplier \(\lambda > 0\), which we refer to as the learning rate. Below, a similar derivation is carried out in more detail.

Formal Natural Gradient. Traditionally, the natural gradient uses the Fisher–Rao metric when the similarity measure used is the KL-divergence. We will now show, how a trust region formulation with respect to the chosen similarity measure can be used to derive a natural metric under which the natural gradient can be computed, resulting in the formal natural gradient. Thus, consider the minimization task

$$\begin{aligned} \hat{v}:= \mathop {\text {arg min}}_{v\in T_\theta \varTheta ,~c(\theta + v, \theta ) = \varDelta } c(\theta + v, \rho ). \end{aligned}$$
(2.4)

We approximate the constraint by the second degree Taylor expansion

$$\begin{aligned} c(\theta + v, \theta ) \approx \frac{1}{2}v^T \left( \nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\right) v, \end{aligned}$$
(2.5)

where the 0th and 1st degree terms disappear as \(c(\theta + v,\theta )\) has a minimum 0 at \(v=0\). We call the symmetric positive definite matrix \(H^c_\theta :=\nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\) the local Hessian. Then, we further approximate the objective function

$$\begin{aligned} c(\theta + v, \rho ) \approx c(\theta , \rho ) + \nabla _{\theta } c(\theta , \rho )^Tv. \end{aligned}$$
(2.6)

Writing the approximate Langrangian \(\mathcal {L}(v)\) of (2.4) with a multiplier \(\lambda > 0\), we get

$$\begin{aligned} \mathcal {L}(v) \approx c(\theta , \rho ) + \nabla _{\theta } c(\theta , \rho )^Tv + \frac{\lambda }{2}v^T \left( \nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\right) v. \end{aligned}$$
(2.7)

Thus by the method of Langrangian multipliers, (2.4) is solved as

$$\begin{aligned} \hat{v} = -\frac{1}{\lambda }\left[ H^c_\theta \right] ^{-1}\nabla _{\theta }c(\theta , \rho ). \end{aligned}$$
(2.8)

We refer to \(\hat{v}\) as the formal natural gradient with respect to c.

Remark 1

We could have just substituted \(\eta = \theta \) in the local Hessian if \(\nabla ^2_\eta c(\eta , \theta )\) was continuous at \(\eta \). However, when studying Finsler metrics later in this work, the expression has a discontinuity at \(\eta = \theta \). Therefore, a direction for a limit has to be chosen, and as a straight-forward candidate we compute the limit from the direction of the gradient.

Metric Interpretation. The local Hessian \(G^c_\theta \) can be seen as a metric tensor at any \(\theta \in \varTheta \), inducing an inner product \(g_\theta ^c :T_\theta \varTheta \times T_\theta \varTheta \rightarrow \mathbb {R}\) given by \(g_\theta ^c(v,u) = v^TH^c_\theta u\). This imposes a pseudo-Riemannian structure on \(\varTheta \), forming the pseudo-Riemannian manifold \((\varTheta , g^c)\). Therefore, \(G^c_x\) provides us a natural metric under which to compute the natural gradient for a general \(c^*\). If \(\rho \) has a full rank Jacobian everywhere, then a Riemannian metric is retrieved. Also, there is an obvious pullback structure at play. Recall, that the cost is defined by \(c(\theta , \theta ')=c^*(\rho _{\theta }, \rho _{\theta '})\). Then, computing the local Hessian yields

$$\begin{aligned} H^c_\theta = J_\theta ^T H^{c^*}_{\rho _\theta } J_\theta , \end{aligned}$$
(2.9)

where \(H^{c^*}_{\rho _\theta } = \nabla ^2_{\rho \rightarrow \rho _\theta } c^*(\rho , \rho _\theta )\). Thus, \(H^c\) results from pulling back the \(c^*\) induced metric tensor \(H^{c^*}\) on \(\mathrm {AC}(X)\) to the statistical manifold \(\varTheta \). In information geometry, this Riemannian metric is said to be induced by the corresponding divergence (similarity measure) [3]. Therefore, the formal natural gradient is just the Riemannian gradient under the aforementioned induced metric.

Asymptotically Newton’s Method. We provide a straightforward result, stating that the local Hessian approaches the actual Hessian in the limit, thus the formal natural gradient method approaches Newton’s method. This is well known in the Fisher–Rao case, but for completeness we provide the result for the formal natural gradient.

Proposition 1

Assume \(c(\theta ,\rho ) = c(\theta , \theta ')\) for some \(\theta ' \in \varTheta \), and that c is \(C^2\) in \(\theta \). Then, the natural gradient yields asymptotically Newton’s method.

Proof

The Hessian at \(\theta \) is given by \(\nabla ^2_\theta c(\theta , \theta ')\). Then, as c is \(C^2\) in the first argument, passing the limit \(\theta \rightarrow \theta '\) yields

$$\begin{aligned} H^c_\theta = \nabla ^2_{\eta \rightarrow \theta } c(\eta , \theta ) \overset{\theta \rightarrow \theta '}{\rightarrow } \nabla ^2_{\eta \rightarrow \theta '}c(\eta ,\theta ') = \nabla ^2_{\eta = \theta '} c(\eta , \theta '), \end{aligned}$$
(2.10)

where the last expression is the Hessian at \(\theta '\).

3 Loved Child has Many Names – Related Methods

In this section, we discuss connections between seemingly different optimization methods. Some of these connections have already been reported in the literature, some are likely to be known to some extent in the community. However, the authors are unaware of previous work drawing out these connections in their full extent. We provide such a discussion, and then present other related connections.

As discussed in [14], proximal methods and trust region methods are equivalent up to learning rate. Trust region methods employ an \(l^2\)-metric constraint

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x:\Vert x-x_t\Vert _2 \le \varDelta } f(x),~\varDelta >0, \end{aligned}$$
(3.1)

whereas proximal methods include a \(l^2\)-metric penalization term

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x}\left\{ f(x) + \frac{1}{2\lambda }\Vert x - x_t\Vert _2^2\right\} ,~\lambda > 0, \end{aligned}$$
(3.2)

The two can be shown to be equivalent up to learning rate via Lagrangian duality.

Instead of the \(l^2\) metric penalization, mirror gradient descent [13] employs a more general proximity function \(\varPsi :\mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}_{>0}\), that is strictly convex in the first argument. Then, the mirror descent step is given by

$$\begin{aligned} x_{t+1} = \mathop {\text {arg min}}\limits _{x}\left\{ \langle x - x_t, \nabla f(x_t)\rangle + \frac{1}{\lambda }\varPsi (x, x_t)\right\} . \end{aligned}$$
(3.3)

Commonly, \(\varPsi \) is chosen to be a Bregman divergence \(D_g\), defined by choosing a strictly convex \(C^2\) function g and writing

$$\begin{aligned} D_g(x,x') = g(x) - g(x') - \langle \nabla g(x'), x - x' \rangle . \end{aligned}$$
(3.4)

To explain how these methods are related to the natural gradient, assume that we are minimizing a general similarity measure c(xy) with respect to x, as in Sect. 2. Recall, that we first defined the natural gradient as a trust region step. In order to derive an analytical expression for the iteration, we approximated the objective function with the first order Taylor polynomial and the constraints by the local Hessian and then used Lagrangian duality to yield a proximal expression, which yields the formal natural gradient when solved. In Sect. 4, we will show how this workflow indeed corresponds to known examples of the natural gradient.

Further Connections. Raskutti and Mukherjee [16] showed, that Bregman divergence proximal mirror gradient descent is equivalent to the natural gradient method on the dual manifold of the Bregman divergence. Khan et al. [8], consider a KL divergence proximal algorithm for learning conditionally conjugate exponential families, which they show to correspond to a natural gradient step. For exponential families, the KL-divergence corresponds to a Bregman divergence, and so the natural gradient step is on the primal manifold of the Bregman divergence. Thus the result seems to conflict with the resut in [16]. However, this can be explained, as the gradient is taken with respect to a different argument of the divergence, i.e., they consider \(\nabla _x D_g(x',x)\) and not \(\nabla _x D_g(x,x')\). It is intriguing how two different geometries are involved in this choice.

Pascanu and Bengio [15] remarked on the connections between the natural gradient method and Hessian-free optimization [11], Krylov Subspace Descent [17], and TONGA [9]. The main connection between Hessian-free optimization and Krylov subspace descent is the use of extended Gauss–Newton approximation of the Hessian [18], which gives a similar square form involving the Jacobian as the pullback Fisher–Rao metric on a statistical manifold. The connection was further studied by Martens [12], where an equivalence criterion between the Fisher–Rao natural gradient and extended Gauss–Newton was given.

4 Example Computations

We will now provide example computations for the local Hessian \(H^c\) of different similarity measures c, as it is the essential object in computing the natural gradient given in (2.8). We first show that in the cases of KL-divergence and a Riemannian metric, the definition of the formal natural gradient matches the classical definition, as expected. Furthermore, we contribute local Hessians for general f-divergences and Finsler metrics, specifically for the p-Wasserstein metrics.

Natural Gradient of f-Divergences. Let \(\rho , \rho ' \in \mathrm {AC}(X)\) and \(f:\mathbb {R}_{>0}\rightarrow \mathbb {R}_{\ge 0}\) be a convex function satisfying \(f(1) = 0\). Then, the f-divergence from \(\rho '\) to \(\rho \) is

$$\begin{aligned} D_f(\rho ||\rho ') = \int _X \rho (x) f\left( \frac{\rho '(x)}{\rho (x)}\right) dx. \end{aligned}$$
(4.1)

Now, consider the statistical manifold \((\mathbb {R}^d, \varTheta , \rho )\), and compute the local Hessian

$$\begin{aligned} \left[ H^{D_f}_\theta \right] _{ij} = \nabla ^2f(1)\int _{X}\frac{\partial \log \rho _\theta (x)}{\partial \theta _i}\frac{\partial \log \rho _\theta (x)}{\partial \theta _j} \rho _\theta (x) dx. \end{aligned}$$
(4.2)

Substituting \(f = - \log \) in (4.1) results in the KL-divergence, denoted by \(D_{\mathrm {KL}}(\rho ||\rho ')\). Noticing that \(\nabla ^2f(1)=1\) with this substitution, we can write (4.2) as \(H^{D_f}_\theta = \nabla ^2 f(1)H^{D_{\mathrm {KL}}}_\theta \), where the local Hessian \(H^{D_{\mathrm {KL}}}_\theta \) is also the Fisher–Rao metric tensor at \(\theta \), and thus the natural gradient of Amari [2] is retrieved.

Natural Gradient of Riemannian Distance. Let (Mg) be a Riemannian manifold with the induced distance function \(d_g\) and the metric tensor at \(\rho \in M\) denoted by \(G^M_\rho \). Finally, denote by \(\rho _\theta \) a submanifold of M parametrized by \(\theta \in \varTheta \). Then, when \(c=\frac{1}{2}d^2\), we compute \(G_\theta ^{\frac{1}{2}d_g}\) as follows

$$\begin{aligned} \begin{aligned} \left[ H^{\frac{1}{2}d^2}_\theta \right] _{ij}=&\frac{1}{2}\left( \frac{\partial }{\partial \theta _j}\rho _\theta \right) ^T \left[ \nabla ^2_{\rho _\eta \rightarrow \rho _\theta }d^2(\rho _\eta ,\rho _{\theta })\right] \left( \frac{\partial }{\partial \theta _i}\rho _\theta \right) \\&+\frac{1}{2}\left[ \frac{\partial ^2}{\partial \theta _j \partial \theta _i}\rho _\theta \right] \left[ \nabla _{\rho _\eta \rightarrow \rho _\theta }d^2(\rho _\eta ,\rho _{\theta })\right] , \end{aligned} \end{aligned}$$
(4.3)

as \(\theta ' \rightarrow \theta \), the second term vanishes. Finally, \(\nabla ^2_{\rho _\eta \rightarrow \rho _\theta } d^2(\rho _\eta , \rho _{\theta }) = 2G^M_{\rho _\theta }\), thus

$$\begin{aligned} H^{\frac{1}{2}d_g}_{\theta } =J_\theta ^T G^M_{x_\theta } J_\theta , \end{aligned}$$
(4.4)

where \(J_\theta = \frac{\partial }{\partial \theta }\rho _\theta \) denotes the Jacobian. Therefore, the formal natural gradient corresponds to the traditional coordinate-free definition of a gradient on a Riemannian manifold, when the metric is given by the pullback.

Natural Gradient of Finsler Distance. Let (MF) denote a Finsler manifold, where \(F_\rho :T_\rho M \rightarrow \mathbb {R}_{\ge 0}\), for any \(\rho \in M\), is a Finsler metric, satisfying the properties of strong convexity, positive 1-homogeneity and positive definiteness. Then, a distance \(d_F\) is induced on M by

$$\begin{aligned} d_F(\rho ,\rho ') = \inf \limits _{\gamma } \int _0^1 F_{\gamma (t)}(\dot{\gamma }(t))dt,~\rho ,\rho '\in M \end{aligned}$$
(4.5)

where \(\gamma \) is any continuous, unit-parametrized curve with \(\gamma (0) = \rho \) and \(\gamma (1) = \rho '\).

The fundamental tensor \(G^F\) of F at \((\rho ,v)\) is defined as \(G^F_{\rho }(v) = \frac{1}{2}\nabla ^2_{v} F^2_\rho (v)\). Then, \(G^F_\rho \) is 0-homogeneous as the second differential of a 2-homogeneous function. Therefore, \(G^F_\rho (\lambda v) = G^F_\rho (v)\) for any \(\lambda > 0\). Furthermore, \(G^F_\rho (v)\) is positive-definite when \(v\ne 0\). Now, let \(u = -J_\theta \nabla _\theta d^2_F(\rho _\theta , \rho ') \), and as we can locally write \(d^2_F(\rho , \rho ') = F^2_{\rho \theta }(v)\) for a suitable v, then

$$\begin{aligned} H^{\frac{1}{2}d_F^2}_\theta =\frac{1}{2}\nabla ^2_{\eta \rightarrow \theta } d^2_F(\rho _\eta , \rho _{\theta }) = \frac{1}{2}\lim \limits _{\lambda \rightarrow 0} \nabla ^2_{v= \lambda u} F^2_{\rho _\theta }(v) = J_\theta ^T G^F_{\rho _\theta }(u) J_\theta . \end{aligned}$$
(4.6)

Coordinate-free gradient descent on Finsler manifolds has been studied by Bercu [5]. The formal natural gradient differs slightly from this, as we use \(v =-J_\theta \nabla _\theta d^2_F(\rho _\theta ,\rho ')\) in the preconditioning matrix \(G^F_{(\rho _\theta , v)}\) (see Remark 1), where as in [5], v is chosen to maximize the descent. Thus the natural gradient descent in the Finsler case approximates the geometry in the direction of the gradient quadratically to improve the descent, but fails to take the entire local geometry into account.

p -Wasserstein Metric. Let \(X=\mathbb {R}^n\) and \(\rho \in \mathcal {P}_p(X)\) if

$$\begin{aligned} \int _{X}d_2^p(x_0,x)\rho (x)dx,~\text {for some }x_0\in X, \end{aligned}$$
(4.7)

where \(d_2\) is the Euclidean distance. Then, the p-Wasserstein distance \(W_p\) between \(\rho , \rho ' \in \mathcal {P}_p(X)\) is given by

$$\begin{aligned} W_p(\rho , \rho ') = \left( \inf \limits _{\gamma \in \mathrm {ADM}(\rho , \rho ')}\int _{X\times X}d_2^p(x,x')d\gamma (x,x')\right) ^\frac{1}{p}, \end{aligned}$$
(4.8)

where \(\mathrm {ADM}(\rho , \rho ')\) is the set of joint measures with marginal densities \(\rho \) and \(\rho '\). The p-Wasserstein distance is induced by a Finsler metric [1], given by

$$\begin{aligned} F_\rho (v) = \left( \int _X \Vert \nabla \varPhi _v\Vert _2^p d\rho \right) ^\frac{1}{p}, \end{aligned}$$
(4.9)

where \(v\in T_\rho \mathcal {P}_p(X)\) and \(\varPhi _v\) satisfies \(v(x) = -\nabla \cdot \left( \rho (x) \nabla _x \varPhi _v(x)\right) \) for any \(x\in X\), where \(\nabla \cdot \) is the divergence operator. Now, choose \(v = -J_\theta \nabla _\theta W_p^2(\rho _\theta , \rho )\). Then, through a cumbersome computation, we compute how the local Hessian acts on two tangent vectors \(d\theta _1, d\theta _2\in T_\theta \varTheta \)

$$\begin{aligned} \begin{aligned}&H^{\frac{1}{2}W_p^2}_\theta (d\theta _1, d\theta _2) \\ =&(2-p)F^{2(1-p)}_{\rho _\theta }(v)\left( \int _X \Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _v \rangle d\rho _\theta \right) \\&\times \, \left( \int _X \Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _2}, \nabla \varPhi _v \rangle d\rho _\theta \right) \\&+\, F_{\rho _\theta }^{2-p}(v)\int _X\Vert \nabla \varPhi _v\Vert _2^{p-2}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _{d\theta _2} \rangle d\rho _\theta \\&+\, (p-2)F_{\rho _\theta }^{2-p}(v)\int _X\Vert \nabla \varPhi _v\Vert _2^{p-4}\langle \nabla \varPhi _{d\theta _1},\nabla \varPhi _v \rangle \langle \nabla \varPhi _{d\theta _2},\nabla \varPhi _v \rangle d\rho _\theta , \end{aligned} \end{aligned}$$
(4.10)

where \(J_\theta d\theta _i = - \nabla \cdot \left( \rho _\theta \nabla \varPhi _{d\theta _i}\right) \) for \(i=1,2\). The case \(p=2\) is special, as the 2-Wasserstein metric is induced by a Riemannian metric, whose pullback can be recovered by substituting \(p=2\) in (4.10), yielding

$$\begin{aligned} H^{\frac{1}{2}W_2^2}_{\theta }(d\theta _1, d\theta _2) = \int _{X} \langle \nabla \varPhi _{d\theta _1}, \nabla \varPhi _{d\theta _2}\rangle d\rho _\theta . \end{aligned}$$
(4.11)

This yields the natural gradient of \(W_2^2\) as introduced in [6, 10].