Abstract
In optimization, the natural gradient method is well-known for likelihood maximization. The method uses the Kullback–Leibler (KL) divergence, corresponding infinitesimally to the Fisher–Rao metric, which is pulled back to the parameter space of a family of probability distributions. This way, gradients with respect to the parameters respect the Fisher–Rao geometry of the space of distributions, which might differ vastly from the standard Euclidean geometry of the parameter space, often leading to faster convergence. The concept of natural gradient has in most discussions been restricted to the KL-divergence/Fisher–Rao case, although in information geometry the local \(C^2\) structure of a general divergence has been used for deriving a closely related Riemannian metric analogous to the KL-divergence case. In this work, we wish to cast natural gradients into this more general context and provide example computations, notably in the case of a Finsler metric and the p-Wasserstein metric. We additionally discuss connections between the natural gradient method and multiple other optimization techniques in the literature.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The natural gradient method [2] in optimization originates from information geometry [4], which utilizes the Riemannian geometry of statistical manifolds (the parameter spaces of model families) endowed with the Fisher–Rao metric. The natural gradient is used for minimizing the Kullback–Leibler (KL) divergence, a similarity measure between a model distribution and a target distribution, that can be shown to be equivalent to maximizing model likelihood of given data. The success of natural gradient in optimization stems from accelerating likelihood maximization and providing infinitesimal invariance to reparametrizations of the model, providing robustness towards arbitrary parametrization choices.
In the modern formulation of the natural gradient, a Riemannian metric on the statistical manifold is chosen, with respect to which the gradient of the given similarity is computed [4, Sec. 12]. The choice of the Riemannian metric should, however, relate closely to the similarity measure being minimized. We have illustrated this in Fig. 1, where model selection for Gaussian process regression is carried out by maximizing the prior-likelihood of the data with natural gradients stemming form different metrics. Clearly, the Fisher–Rao metric—which infinitesimally corresponds to the KL-divergence—achieves the fastest convergence.
An example of an approach to choose a related Riemannian metric is the classical Newton’s method that derives a metric from the Hessian of a convex objective function, or its absolute value in the non-convex case [7]. Unfortunately, evaluating the Hessian is not feasible in some cases. Instead, we can compute a local Hessian, which corresponds to a local second order expansion of the similarity measure [3]. This approach generalizes the natural gradient from the KL-divergence case to general similarity measures, and to avoid confusion with the well-known KL-divergence setting, we refer to this approach as the formal natural gradient. We furthermore discuss the similarities between the trust region, proximal, and natural gradient methods in Sect. 3 and provide example computations in Sect. 4.
2 Useful Metrics via Formalizing the Natural Gradient
The natural gradient is computed with respect to a chosen metric on the statistical manifold, which often results from pulling back a metric between distributions. This way, the gradient takes into account how the metric on distributions penalizes movement into different directions. We will now review how the natural gradient is computed given a Riemannian metric. Then, we introduce the formal natural gradient, which derives this metric from the similarity measure.
Statistical Manifold. Let \(\mathrm {AC}(X)\) denote the set of absolutely continuous probability distributions on some manifold X. A statistical manifold is defined by a triple \((X, \varTheta , \rho )\), where X is called the sample space and \(\varTheta \subseteq \mathbb {R}^n\) the parameter space. Then, \(\rho :\varTheta \rightarrow \mathrm {AC}(X)\) maps a parameter to a density, given by \(\rho :\theta \mapsto \rho _\theta (\cdot )\), for any \(\theta \in \varTheta \). Abusing terminology, we also call \(\varTheta \) the statistical manifold.
Cost Function. Let a similarity measure \(c^* :\mathrm {AC}(X) \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}\) (e.g. a metric or an information divergence) be defined on \(\mathrm {AC}(X)\) satisfying \(c^*(\rho , \rho ')=0\) if and only if \(\rho =\rho '\). Assume \(c^*\) to be strictly convex in \(\rho \). Given a target distribution \(\rho \in \mathrm {AC}(X)\) and a statistical manifold \((X, \varTheta , \rho )\), we wish to minimize the cost function \(c\rightarrow \varTheta \times \mathrm {AC}(X) \rightarrow \mathbb {R}_{\ge 0}\) given by
If \(\rho = \rho _{\theta '}\) for some \(\theta ' \in \varTheta \), then by abuse of notation we write \(c(\theta , \theta ')\). We finally assume that \(\theta \mapsto c(\theta , \theta ')\) is \(C^2\) whenever \(\theta \ne \theta '\).
Natural Gradient. Assume a Riemannian structure \((\varTheta , g^\varTheta )\) on the statistical manifold. The Riemannian metric \(g^\varTheta \) induces a metric tensor \(G^\varTheta \), given by \(g_\theta ^\varTheta (u,v) = u^TG_\theta ^\varTheta v\) and a distance function which we denote by \(d_\varTheta \). The vectors u, v belong to the tangent space \(T_\theta \varTheta \) at \(\theta \). It is common intuition that the negative gradient \(v = -\nabla _\theta c(\theta , \rho )\) gives the direction of maximal descent for c. However, this is only true on a Euclidean manifold. Consider
where \(\theta + v\) is to be understood in a chart of \(\varTheta \), and \(\varDelta >0\) defines the radius of the trust region. Linearly approximating the objective and quadratically approximating the constraint, this is solved using Lagrangian multipliers, giving the natural gradient
for some Lagrangrian multiplier \(\lambda > 0\), which we refer to as the learning rate. Below, a similar derivation is carried out in more detail.
Formal Natural Gradient. Traditionally, the natural gradient uses the Fisher–Rao metric when the similarity measure used is the KL-divergence. We will now show, how a trust region formulation with respect to the chosen similarity measure can be used to derive a natural metric under which the natural gradient can be computed, resulting in the formal natural gradient. Thus, consider the minimization task
We approximate the constraint by the second degree Taylor expansion
where the 0th and 1st degree terms disappear as \(c(\theta + v,\theta )\) has a minimum 0 at \(v=0\). We call the symmetric positive definite matrix \(H^c_\theta :=\nabla ^2_{\eta \rightarrow \theta }c(\eta , \theta )\) the local Hessian. Then, we further approximate the objective function
Writing the approximate Langrangian \(\mathcal {L}(v)\) of (2.4) with a multiplier \(\lambda > 0\), we get
Thus by the method of Langrangian multipliers, (2.4) is solved as
We refer to \(\hat{v}\) as the formal natural gradient with respect to c.
Remark 1
We could have just substituted \(\eta = \theta \) in the local Hessian if \(\nabla ^2_\eta c(\eta , \theta )\) was continuous at \(\eta \). However, when studying Finsler metrics later in this work, the expression has a discontinuity at \(\eta = \theta \). Therefore, a direction for a limit has to be chosen, and as a straight-forward candidate we compute the limit from the direction of the gradient.
Metric Interpretation. The local Hessian \(G^c_\theta \) can be seen as a metric tensor at any \(\theta \in \varTheta \), inducing an inner product \(g_\theta ^c :T_\theta \varTheta \times T_\theta \varTheta \rightarrow \mathbb {R}\) given by \(g_\theta ^c(v,u) = v^TH^c_\theta u\). This imposes a pseudo-Riemannian structure on \(\varTheta \), forming the pseudo-Riemannian manifold \((\varTheta , g^c)\). Therefore, \(G^c_x\) provides us a natural metric under which to compute the natural gradient for a general \(c^*\). If \(\rho \) has a full rank Jacobian everywhere, then a Riemannian metric is retrieved. Also, there is an obvious pullback structure at play. Recall, that the cost is defined by \(c(\theta , \theta ')=c^*(\rho _{\theta }, \rho _{\theta '})\). Then, computing the local Hessian yields
where \(H^{c^*}_{\rho _\theta } = \nabla ^2_{\rho \rightarrow \rho _\theta } c^*(\rho , \rho _\theta )\). Thus, \(H^c\) results from pulling back the \(c^*\) induced metric tensor \(H^{c^*}\) on \(\mathrm {AC}(X)\) to the statistical manifold \(\varTheta \). In information geometry, this Riemannian metric is said to be induced by the corresponding divergence (similarity measure) [3]. Therefore, the formal natural gradient is just the Riemannian gradient under the aforementioned induced metric.
Asymptotically Newton’s Method. We provide a straightforward result, stating that the local Hessian approaches the actual Hessian in the limit, thus the formal natural gradient method approaches Newton’s method. This is well known in the Fisher–Rao case, but for completeness we provide the result for the formal natural gradient.
Proposition 1
Assume \(c(\theta ,\rho ) = c(\theta , \theta ')\) for some \(\theta ' \in \varTheta \), and that c is \(C^2\) in \(\theta \). Then, the natural gradient yields asymptotically Newton’s method.
Proof
The Hessian at \(\theta \) is given by \(\nabla ^2_\theta c(\theta , \theta ')\). Then, as c is \(C^2\) in the first argument, passing the limit \(\theta \rightarrow \theta '\) yields
where the last expression is the Hessian at \(\theta '\).
3 Loved Child has Many Names – Related Methods
In this section, we discuss connections between seemingly different optimization methods. Some of these connections have already been reported in the literature, some are likely to be known to some extent in the community. However, the authors are unaware of previous work drawing out these connections in their full extent. We provide such a discussion, and then present other related connections.
As discussed in [14], proximal methods and trust region methods are equivalent up to learning rate. Trust region methods employ an \(l^2\)-metric constraint
whereas proximal methods include a \(l^2\)-metric penalization term
The two can be shown to be equivalent up to learning rate via Lagrangian duality.
Instead of the \(l^2\) metric penalization, mirror gradient descent [13] employs a more general proximity function \(\varPsi :\mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}_{>0}\), that is strictly convex in the first argument. Then, the mirror descent step is given by
Commonly, \(\varPsi \) is chosen to be a Bregman divergence \(D_g\), defined by choosing a strictly convex \(C^2\) function g and writing
To explain how these methods are related to the natural gradient, assume that we are minimizing a general similarity measure c(x, y) with respect to x, as in Sect. 2. Recall, that we first defined the natural gradient as a trust region step. In order to derive an analytical expression for the iteration, we approximated the objective function with the first order Taylor polynomial and the constraints by the local Hessian and then used Lagrangian duality to yield a proximal expression, which yields the formal natural gradient when solved. In Sect. 4, we will show how this workflow indeed corresponds to known examples of the natural gradient.
Further Connections. Raskutti and Mukherjee [16] showed, that Bregman divergence proximal mirror gradient descent is equivalent to the natural gradient method on the dual manifold of the Bregman divergence. Khan et al. [8], consider a KL divergence proximal algorithm for learning conditionally conjugate exponential families, which they show to correspond to a natural gradient step. For exponential families, the KL-divergence corresponds to a Bregman divergence, and so the natural gradient step is on the primal manifold of the Bregman divergence. Thus the result seems to conflict with the resut in [16]. However, this can be explained, as the gradient is taken with respect to a different argument of the divergence, i.e., they consider \(\nabla _x D_g(x',x)\) and not \(\nabla _x D_g(x,x')\). It is intriguing how two different geometries are involved in this choice.
Pascanu and Bengio [15] remarked on the connections between the natural gradient method and Hessian-free optimization [11], Krylov Subspace Descent [17], and TONGA [9]. The main connection between Hessian-free optimization and Krylov subspace descent is the use of extended Gauss–Newton approximation of the Hessian [18], which gives a similar square form involving the Jacobian as the pullback Fisher–Rao metric on a statistical manifold. The connection was further studied by Martens [12], where an equivalence criterion between the Fisher–Rao natural gradient and extended Gauss–Newton was given.
4 Example Computations
We will now provide example computations for the local Hessian \(H^c\) of different similarity measures c, as it is the essential object in computing the natural gradient given in (2.8). We first show that in the cases of KL-divergence and a Riemannian metric, the definition of the formal natural gradient matches the classical definition, as expected. Furthermore, we contribute local Hessians for general f-divergences and Finsler metrics, specifically for the p-Wasserstein metrics.
Natural Gradient of f-Divergences. Let \(\rho , \rho ' \in \mathrm {AC}(X)\) and \(f:\mathbb {R}_{>0}\rightarrow \mathbb {R}_{\ge 0}\) be a convex function satisfying \(f(1) = 0\). Then, the f-divergence from \(\rho '\) to \(\rho \) is
Now, consider the statistical manifold \((\mathbb {R}^d, \varTheta , \rho )\), and compute the local Hessian
Substituting \(f = - \log \) in (4.1) results in the KL-divergence, denoted by \(D_{\mathrm {KL}}(\rho ||\rho ')\). Noticing that \(\nabla ^2f(1)=1\) with this substitution, we can write (4.2) as \(H^{D_f}_\theta = \nabla ^2 f(1)H^{D_{\mathrm {KL}}}_\theta \), where the local Hessian \(H^{D_{\mathrm {KL}}}_\theta \) is also the Fisher–Rao metric tensor at \(\theta \), and thus the natural gradient of Amari [2] is retrieved.
Natural Gradient of Riemannian Distance. Let (M, g) be a Riemannian manifold with the induced distance function \(d_g\) and the metric tensor at \(\rho \in M\) denoted by \(G^M_\rho \). Finally, denote by \(\rho _\theta \) a submanifold of M parametrized by \(\theta \in \varTheta \). Then, when \(c=\frac{1}{2}d^2\), we compute \(G_\theta ^{\frac{1}{2}d_g}\) as follows
as \(\theta ' \rightarrow \theta \), the second term vanishes. Finally, \(\nabla ^2_{\rho _\eta \rightarrow \rho _\theta } d^2(\rho _\eta , \rho _{\theta }) = 2G^M_{\rho _\theta }\), thus
where \(J_\theta = \frac{\partial }{\partial \theta }\rho _\theta \) denotes the Jacobian. Therefore, the formal natural gradient corresponds to the traditional coordinate-free definition of a gradient on a Riemannian manifold, when the metric is given by the pullback.
Natural Gradient of Finsler Distance. Let (M, F) denote a Finsler manifold, where \(F_\rho :T_\rho M \rightarrow \mathbb {R}_{\ge 0}\), for any \(\rho \in M\), is a Finsler metric, satisfying the properties of strong convexity, positive 1-homogeneity and positive definiteness. Then, a distance \(d_F\) is induced on M by
where \(\gamma \) is any continuous, unit-parametrized curve with \(\gamma (0) = \rho \) and \(\gamma (1) = \rho '\).
The fundamental tensor \(G^F\) of F at \((\rho ,v)\) is defined as \(G^F_{\rho }(v) = \frac{1}{2}\nabla ^2_{v} F^2_\rho (v)\). Then, \(G^F_\rho \) is 0-homogeneous as the second differential of a 2-homogeneous function. Therefore, \(G^F_\rho (\lambda v) = G^F_\rho (v)\) for any \(\lambda > 0\). Furthermore, \(G^F_\rho (v)\) is positive-definite when \(v\ne 0\). Now, let \(u = -J_\theta \nabla _\theta d^2_F(\rho _\theta , \rho ') \), and as we can locally write \(d^2_F(\rho , \rho ') = F^2_{\rho \theta }(v)\) for a suitable v, then
Coordinate-free gradient descent on Finsler manifolds has been studied by Bercu [5]. The formal natural gradient differs slightly from this, as we use \(v =-J_\theta \nabla _\theta d^2_F(\rho _\theta ,\rho ')\) in the preconditioning matrix \(G^F_{(\rho _\theta , v)}\) (see Remark 1), where as in [5], v is chosen to maximize the descent. Thus the natural gradient descent in the Finsler case approximates the geometry in the direction of the gradient quadratically to improve the descent, but fails to take the entire local geometry into account.
p -Wasserstein Metric. Let \(X=\mathbb {R}^n\) and \(\rho \in \mathcal {P}_p(X)\) if
where \(d_2\) is the Euclidean distance. Then, the p-Wasserstein distance \(W_p\) between \(\rho , \rho ' \in \mathcal {P}_p(X)\) is given by
where \(\mathrm {ADM}(\rho , \rho ')\) is the set of joint measures with marginal densities \(\rho \) and \(\rho '\). The p-Wasserstein distance is induced by a Finsler metric [1], given by
where \(v\in T_\rho \mathcal {P}_p(X)\) and \(\varPhi _v\) satisfies \(v(x) = -\nabla \cdot \left( \rho (x) \nabla _x \varPhi _v(x)\right) \) for any \(x\in X\), where \(\nabla \cdot \) is the divergence operator. Now, choose \(v = -J_\theta \nabla _\theta W_p^2(\rho _\theta , \rho )\). Then, through a cumbersome computation, we compute how the local Hessian acts on two tangent vectors \(d\theta _1, d\theta _2\in T_\theta \varTheta \)
where \(J_\theta d\theta _i = - \nabla \cdot \left( \rho _\theta \nabla \varPhi _{d\theta _i}\right) \) for \(i=1,2\). The case \(p=2\) is special, as the 2-Wasserstein metric is induced by a Riemannian metric, whose pullback can be recovered by substituting \(p=2\) in (4.10), yielding
This yields the natural gradient of \(W_2^2\) as introduced in [6, 10].
References
Agueh, M.: Finsler structure in the p-Wasserstein space and gradient flows. Comptes Rendus Mathematique 350(1–2), 35–40 (2012)
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Amari, S.i.: Divergence function, information monotonicity and information geometry. In: Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). Citeseer (2009)
Amari, S.I.: Information Geometry and Its Applications. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8
Bercu, G.: Gradient methods on Finsler manifolds. In: Proceedings of the Workshop on Global Analysis, Differential Geometry and Lie Algebras, pp. 230–233 (2000)
Chen, Y., Li, W.: Natural gradient in Wasserstein statistical manifold. arXiv preprint arXiv:1805.08380 (2018)
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)
Khan, M.E., Baqué, P., Fleuret, F., Fua, P.: Kullback-Leibler proximal variational inference. In: Advances in Neural Information Processing Systems, pp. 3402–3410 (2015)
Le Roux, N., Manzagol, P.A., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2008)
Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018)
Martens, J.: Deep learning via Hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)
Martens, J.: New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193 (2014)
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)
Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)
Raskutti, G., Mukherjee, S.: The information geometry of mirror descent. IEEE Trans. Inf. Theory 61(3), 1451–1457 (2015)
Saad, Y.: Krylov subspace methods for solving large unsymmetric linear systems. Math. Comput. 37(155), 105–126 (1981)
Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723–1738 (2002)
Acknowledgements
The authors were supported by Centre for Stochastic Geometry and Advanced Bioimaging, and a block stipendium, both funded by a grant from the Villum Foundation. We furthermore wish to thank the anonymous reviewers for their very useful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mallasto, A., Haije, T.D., Feragen, A. (2019). A Formalization of the Natural Gradient Method for General Similarity Measures. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_62
Download citation
DOI: https://doi.org/10.1007/978-3-030-26980-7_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)