Keywords

1 Introduction

Machine learning in general, and deep learning (LeCun 2015) in particular often amount to solving an optimization problem where a loss function has to be minimized. For complex problems (due to nonlinearity, non-convexity, etc.), the best recourse seems often to rely on iterative schemes that exploit first-order or second-order derivatives of the loss function to get successive improvements and converge towards a local minimum. This explains why variants of gradient descent are becoming increasingly ubiquitous in machine learning and have been made widely available in the main deep learning libraries, being the tool of choice to optimize deep neural networks. Among these methods, vanilla gradient descent strongly benefits from its computational efficiency as it simply computes partial derivatives at each step of an iterative process. Though it is widely used, it is limited for two main reasons: it depends on arbitrary parameterizations and may diverge or converge slowly if the step size is not properly tuned. To address these issues, several lines of improvement exist. Here, we focus on two of them. On the one hand, first-order methods such as the natural gradient introduce particular metrics to restrict gradient steps and make them independent from parametrization choices (Amari 1998). On the other hand, second-order methods use the Hessian matrix of the loss or its approximations to take into account its local curvature.

Both types of approaches enhance the vanilla gradient descent update, multiplying it by the inverse of a large matrix (of size \(d^2\), where d is the dimensionality of the parameter space). The contribution of this paper is to study connections between 6 popular first-order and second-order improvements of the gradient descent seen as different variants of a unique simple framework. This general framework uses a first-order approximation of the loss and constrains the step with a quadratic norm. Therefore, each modification \(\delta {\boldsymbol{\theta }}\) of the vector of parameters \({\boldsymbol{\theta }}\) is computed via an optimization problem of the following form:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T M({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(1)

where \(\nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})\) is the gradient of the loss \(L({\boldsymbol{\theta }})\), and \(M({\boldsymbol{\theta }})\) a symmetric positive-definite (SPD) matrix. The 6 methods differ by the matrix \(M({\boldsymbol{\theta }})\), which has an effect not only on the size of the steps, but also on the direction of the steps, as illustrated in Fig. 1.

Fig. 1.
figure 1

Different metrics affect both the gradient step size and direction. Here, a different \(\delta {\boldsymbol{\theta }}\) is obtained with \(M=I\) or M an arbitrary SPD matrix.

The solution of the minimization problem (1) has the following form:

$$\delta {\boldsymbol{\theta }}= -\alpha M({\boldsymbol{\theta }})^{-1} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }}).$$

Section 2 mainly defines notations, and the rest of the paper is organized as follows. In Sect. 3, we show how the vanilla gradient descent, the classical Gauss-Newton method and the natural gradient descent method fit into the proposed framework, with constraints that are independent from the loss function. In Sect. 4, we consider approaches that depend on the loss, namely the gradient covariance matrix approach, Newton’s method, and the generalized Gauss-Newton method, and show that they also fit into the framework. Table 1 summarizes the values of \(M({\boldsymbol{\theta }})\) for all 6 approaches.

Table 1. The matrices \(M({\boldsymbol{\theta }})\) associated to 6 popular variants of the gradient descent, when interpreted as instances of the optimization problem (1). See Sect. 2 for the definitions of the notations.

Providing a unifying view of several first-order and second-order variants of the gradient descent, the framework presented in this paper makes the connections between different approaches more obvious, and can hopefully give insights on these connections and help clarifying some of the literature.

2 Problem Statement and Notations

Notation

Description

\(J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\)

Jacobian of the function \({\boldsymbol{\theta }}\mapsto {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\)

\(\mathbb {E}_{a \sim p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})}[f(a)]\)

Expected value of f(a), when a follows the distribution \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\)

\(H({\boldsymbol{\theta }})\)

Hessian of the loss L, defined by \(\left[ H({\boldsymbol{\theta }})\right] _{i,j} = \frac{\partial ^2 L}{\partial {\boldsymbol{\theta }}_i \partial {\boldsymbol{\theta }}_j}({\boldsymbol{\theta }})\)

\(\mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\)

Hessian of the function \({\boldsymbol{h}}\mapsto l({\boldsymbol{y}}, {\boldsymbol{h}})\) at \({\boldsymbol{h}}= {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\)

We consider a context of regression analysisFootnote 1 in which, based on samples \({\boldsymbol{s}}= ({\boldsymbol{x}}, {\boldsymbol{y}})\), the objective is to estimate the conditional distribution of \({\boldsymbol{y}}\) given \({\boldsymbol{x}}\). More formally, this distribution is estimated by a parametrized probability density function (p.d.f.) \(p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}| {\boldsymbol{x}})\), and the goal of the learning is to progressively optimize the vector \({\boldsymbol{\theta }}\) to improve the accuracy of this estimation. We furthermore assume that the p.d.f. \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\) can be represented by a finite-dimensional vector \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\). For instance, in many applications, \(p_{{\boldsymbol{\theta }}}(\cdot | {\boldsymbol{x}})\) is a multivariate Gaussian distribution, and in this case the vector \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) would typically contain the mean and covariance matrix components.

The accuracy of \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\) is measured via a loss function L estimated over a dataset of samples \({\boldsymbol{s}}\). L depends only on \(\theta \) and we assume that it is expressed as the expected value (over the sample distribution) of an atomic loss, a loss per sample \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\):

$$ L({\boldsymbol{\theta }}) = \mathbb {E}_{{\boldsymbol{s}}}{\left[ l_{{\boldsymbol{\theta }}}(s) \right] }. $$

In the remainder of the paper, for simplicity we keep L expressed as an expected value over the samples, but in practice it is replaced by an empirical mean over a batch, which makes its gradient expressible from the gradient of the atomic loss (w.r.t. \({\boldsymbol{\theta }}\)). The dependency of the atomic loss to \({\boldsymbol{\theta }}\) is via \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\), so we can also express it as a function of the finite-dimensional representation \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\):

$$ l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) = l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})). $$

3 Vanilla, Classical Gauss-Newton and Natural Gradient Descent

3.1 Vanilla Gradient Descent

The core objective of gradient descent algorithms is to determine the direction and magnitude of \(\delta {\boldsymbol{\theta }}\), a small modification of \({\boldsymbol{\theta }}\) computed iteratively to decrease the value of L. The so-called “vanilla” gradient descent is a first-order method that relies on a first-order Taylor approximation of the loss function L:

$$\begin{aligned} L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) \simeq L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}. \end{aligned}$$
(2)

At each iteration, the objective is the minimization of \(\nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\) with the variable \(\delta {\boldsymbol{\theta }}\). If the gradient is non-zero, the value of this term is unbounded below: it suffices for instance to set \(\delta {\boldsymbol{\theta }}= -\alpha \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})\) with \(\alpha \) arbitrarily large. As a result, constraints are needed to avoid making excessively large steps. In vanilla approaches, the Euclidean norm \(\Vert \delta {\boldsymbol{\theta }}\Vert = \sqrt{\delta {\boldsymbol{\theta }}^T \delta {\boldsymbol{\theta }}}\) is used to bound the increments \(\delta {\boldsymbol{\theta }}\). The optimization problem solved at every step of the scheme is:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(3)

where \(\epsilon \) is a user-defined upper bound. It is the most trivial instance of the general framework (1), and the solution of this problem is \(\delta {\boldsymbol{\theta }}= -\alpha \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})\), with \(\alpha = \frac{\epsilon }{\Vert \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})\Vert }\). To set the size of the step, instead of tuning \(\epsilon \), the most common approach is to use the expression \(-\alpha \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})\) and directly tune \(\alpha \), which is called the learning rate. An interesting property with this approach is that, as \({\boldsymbol{\theta }}\) gets closer to an optimum, the norm of the gradient \(\Vert \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})\Vert \) decreases, so the \(\epsilon \) corresponding to the fixed \(\alpha \) decreases as well. This means that steps tend to become smaller and smaller, which is a necessary property to make asymptotic convergence possible.

3.2 Classical Gauss-Newton

As mentioned in Sect. 2, the atomic loss function \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) = l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) depends indirectly on the parameters \({\boldsymbol{\theta }}\) via the vector \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\), which is a finite-dimensional representation of the p.d.f. \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\). So it may be more meaningful to measure the modifications arising from an update \(\delta {\boldsymbol{\theta }}\) by looking at the change on \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\), not simply on \(\delta {\boldsymbol{\theta }}\) as with the vanilla gradient descent approach. The constraint \(\delta {\boldsymbol{\theta }}^T \delta {\boldsymbol{\theta }}\le \epsilon ^2\) acts as if all components of \(\delta {\boldsymbol{\theta }}\) had the same importance, which is not necessarily the case. Some components of \({\boldsymbol{\theta }}\) might have much smaller effects on \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) than others, and this will not be taken into account with the vanilla gradient descent method, which typically performs badly with unbalanced parametrizations. Measuring and bounding the change on the vector \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) makes the updates independent from the way \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}\) is parametrized. To do this, a natural choice is to bound the expected squared Euclidean distance between \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) and \({\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}})\):

$$ \mathbb {E}_{{\boldsymbol{s}}}{\left[ \Vert {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\Vert ^2\right] }\le \epsilon ^2. $$

Using again a first-order approximation, we have \({\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}) \simeq J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\delta {\boldsymbol{\theta }}\), where \(J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\) is the Jacobian of the function \({\boldsymbol{\theta }}\mapsto {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\). The constraint can be rewritten:

$$ \mathbb {E}_{{\boldsymbol{s}}}{\left[ \Vert J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\delta {\boldsymbol{\theta }}\Vert ^2\right] } = \delta {\boldsymbol{\theta }}^T\mathbb {E}_{{\boldsymbol{s}}}{\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^TJ_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\right] }\delta {\boldsymbol{\theta }}\le \epsilon ^2, $$

resulting in the optimization problem:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}{\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^TJ_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\right] } \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(4)

which fits into the general framework (1) \(M_{CGN}({\boldsymbol{\theta }}) = \mathbb {E}_{{\boldsymbol{s}}}{\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^TJ_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\right] }\) is SPD.

Damping. The structure of the matrix \(M_{CGN}({\boldsymbol{\theta }})\) makes it symmetric and positive semi-definite, but not necessarily definite-positive. To ensure the definite-positiveness, a regularization or damping term \(\lambda I\) can be added, resulting in the constraint \(\delta {\boldsymbol{\theta }}^T \big (M_{CGN}({\boldsymbol{\theta }}) + \lambda I \big ) \delta {\boldsymbol{\theta }}\le \epsilon ^2\), which can be rewritten:

$$ \delta {\boldsymbol{\theta }}^T M_{CGN}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}+ \lambda \delta {\boldsymbol{\theta }}^T \delta {\boldsymbol{\theta }}\le \epsilon ^2. $$

This damping, often called Tikhonov damping (Martens and Sutskever 2012), regularizes the constraint with a term proportional to the squared Euclidean norm of \(\delta {\boldsymbol{\theta }}\), and it must be noted that with it the constraint is not independent to the parametrization in \({\boldsymbol{\theta }}\) anymore.

Classical Gauss-Newton as a Second Order Approximation. Let us assume that the atomic loss \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\) is defined as follows: \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) = \frac{1}{2}\Vert \varDelta _{{\boldsymbol{\theta }}}({\boldsymbol{s}})\Vert ^2\), where \(\varDelta _{{\boldsymbol{\theta }}}({\boldsymbol{s}})\) is a vector-valued function. Functions of the form \(\varDelta _{{\boldsymbol{\theta }}}({\boldsymbol{s}}) = {\boldsymbol{y}}- f_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) are typical examples in the context of regression. Denoting by \(\mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }})\) the Jacobian of \({\boldsymbol{\theta }}\mapsto \varDelta _{{\boldsymbol{\theta }}}({\boldsymbol{s}})\), it can be shown that the following expression is a second-order Taylor expansion of the loss in which the terms involving second derivatives of \(\varDelta _{{\boldsymbol{\theta }}}\) with respect to \({\boldsymbol{\theta }}\) have been dropped (Bottou et al. 2018):

$$ L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) = L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})^T\delta {\boldsymbol{\theta }}+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}{\left[ \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }})^T \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }}) \right] } \delta {\boldsymbol{\theta }}+ O(\delta {\boldsymbol{\theta }}^3). $$

Assuming that \(\mathbb {E}_{{\boldsymbol{s}}}{\left[ \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }})^T \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }}) \right] }\) is positive-definite, the minimum is reached with

$$ \delta {\boldsymbol{\theta }}= -\mathbb {E}_{{\boldsymbol{s}}}{\left[ \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }})^T \mathcal {J}^\varDelta _{{\boldsymbol{s}}}({\boldsymbol{\theta }}) \right] }^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}). $$

As in the update obtained with the optimization problem (4), the matrix with a structure of type Jacobian transpose-times-Jacobian is characteristic of the classical Gauss-Newton approach. To ensure positive-definiteness, damping can be added in the exact same way. The derivation that lead to (4) shows that this kind of update does not only make sense with a squared error-type of loss, so in some sense it is a generalization of the context in which a classical Gauss-Newton approach may be useful. If the dependency of the loss to \({\boldsymbol{\theta }}\) is naturally expressed via a finite-dimensional vector \({\boldsymbol{v}}({\boldsymbol{\theta }})\) (e.g. \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) or \(\varDelta _{{\boldsymbol{\theta }}}({\boldsymbol{s}})\) in the above cases), then measuring the quantity \(\Vert {\boldsymbol{v}}({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) - {\boldsymbol{v}}({\boldsymbol{\theta }})\Vert \) to evaluate the magnitude of the modifications induced by \(\delta {\boldsymbol{\theta }}\) is likely to be more meaningful than using the vanilla approach (i.e. simply measuring \(\Vert ({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) - {\boldsymbol{\theta }}\Vert = \Vert \delta {\boldsymbol{\theta }}\Vert \)).

Learning Rate. The solution \(\delta {\boldsymbol{\theta }}=-\alpha M({\boldsymbol{\theta }})^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})\) to the general framework (1) is such that \(\alpha = \frac{\epsilon }{\sqrt{\nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})^T M({\boldsymbol{\theta }})^{-1}\nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})}}\). The classical Gauss-Newton approach corresponds to \(M({\boldsymbol{\theta }}) = M_{CGN}({\boldsymbol{\theta }}) + \lambda I\), or \(M({\boldsymbol{\theta }}) = M_{CGN}({\boldsymbol{\theta }})\) if we ignore the damping. With the approach based on the second-order approximation of the loss expressed as a squared error, the resulting update has the form \(\delta {\boldsymbol{\theta }}=- M({\boldsymbol{\theta }})^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }})\), which is similar to the above expression except that \(\alpha = 1\). However, this theoretical difference in \(\alpha \) (referred to as the learning rate in Sect. 3.1) is not significant in practice since its value is usually redefined separately, for various reasons. In particular, when \(M({\boldsymbol{\theta }})\) is a very large matrix, \(M({\boldsymbol{\theta }})^{-1}\) is often estimated via drastic approximations. In that case, it can be preferable to only compute the update direction, and then perform a line search to find a value of \(\alpha \) for which it is verified that the corresponding step size is reasonable. This line search is an important component of the popular reinforcement learning (RL) algorithm TRPO (Schulman et al. 2015).

3.3 Natural Gradient

To further improve the independence to parametrization, it is possible to directly measure the change from \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\) to \(p_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\) with a metric on probability density functions. This way, the updates do not even depend on the choice of finite-dimensional representation via \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}\). Amari (1997; 1998) proposed and popularized the notion of natural gradient, which is based on a matrix called the Fisher information matrix, defined for the p.d.f \(p_{{\boldsymbol{\theta }}}( \cdot | {\boldsymbol{x}})\) by:

$$ \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) = \mathbb {E}_{a\sim p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})}{\left[ \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}(a|{\boldsymbol{x}}) \right) } \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}(a|{\boldsymbol{x}}) \right) }^T \right] }. $$

It can be used to measure a “distance” \(d\ell \) between two infinitesimally close probability distributions \( p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) and \(p_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) as follows:

$$ d\ell ^2(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}}), p_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})) = \delta {\boldsymbol{\theta }}^T \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}. $$

Averaging over the samples, we extrapolate a measure of distance between \({\boldsymbol{\theta }}\) and \({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}\):

$$ DL^2({\boldsymbol{\theta }}, {\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) = \delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] \delta {\boldsymbol{\theta }}, $$

where \(\mathbb {E}_{{\boldsymbol{s}}}\left[ \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] = \mathbb {E}_{{\boldsymbol{s}}}\left[ \mathbb {E}_{a\sim p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})}{\left[ \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}(a|{\boldsymbol{x}}) \right) } \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}(a|{\boldsymbol{x}}) \right) }^T \right] } \right] \) is the averaged Fisher information matrix. It is common to approximate \(\mathbb {E}_{{\boldsymbol{s}}}\left[ \mathbb {E}_{a \sim p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})}[\cdot ] \right] \) with the empirical mean over the samples, which reduces the above expression to

$$ DL^2({\boldsymbol{\theta }}, {\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) \approx \delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}|{\boldsymbol{x}}) \right) } \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}|{\boldsymbol{x}}) \right) }^T \right] \delta {\boldsymbol{\theta }}. $$

\( \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}|{\boldsymbol{x}}) \right) } \nabla _{{\boldsymbol{\theta }}}\log {\left( p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}|{\boldsymbol{x}}) \right) }^T \right] \) is usually called the empirical Fisher matrix (Martens 2014). We denote it by \(F({\boldsymbol{\theta }})\). Putting an upper bound on \(\delta {\boldsymbol{\theta }}^T F({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\) results in the following optimization problem:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T F({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(5)

which yields natural gradient steps of the form

$$ \delta {\boldsymbol{\theta }}= -\alpha F({\boldsymbol{\theta }})^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}), $$

provided that \(F({\boldsymbol{\theta }})\) is invertible. \(F({\boldsymbol{\theta }})\) is always positive semi-definite. Therefore, as in Sect. 3.2 with the classical Gauss-Newton approach, a damping term can be added to ensure invertibility (but again, by doing so the independence to the parametrization is lost). The Fisher information matrix is in some sense uniquely defined by the property of invariance to reparametrization of the metric it induces (Čencov 1982), and can be obtained from many different derivations. But a particularly interesting fact is that \(d\ell ^2(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}}), p_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}}))\) corresponds to the second-order approximation of the Kullback-Leibler (KL) divergence \(KL(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}}),p_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}}))\) (Kullback 1997; Akimoto and Ollivier 2013). Hence, the terms \(\delta {\boldsymbol{\theta }}^T \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\) and \(\delta {\boldsymbol{\theta }}^T F({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\) share some of the properties of the KL divergence. For instance, when the variance of the probability distribution \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) decreases, the same parameter modification \(\delta {\boldsymbol{\theta }}\) tends to result in increasingly large measures \(\delta {\boldsymbol{\theta }}^T \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\) (see Fig. 2).

Fig. 2.
figure 2

The same parameter change (here, a constant shift of the mean to the right) yields a larger KL divergence when the variance is small.

Consequently, if the bound \(\epsilon ^2\) of Eq. (5) is kept constant, possible modifications of \({\boldsymbol{\theta }}\) become smaller when the variance of the parametrized distribution decreases. Thus natural gradient iterations slow down when the variance becomes small, which is a desirable property when keeping some variability matters. Typically, in RL, this variability can be related to exploration, and should not vanish early, which is one of the reasons why some RL algorithms benefit from natural gradient steps (Peters and Schaal 2008; Schulman et al. 2015; Wu et al. 2017).

Relation Between Natural Gradient and Classical Gauss-Newton. Let us consider a very simple case where \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) is a multivariate normal distribution with fixed covariance matrix \(\varSigma = \beta ^2 I\). The only variable parameter on which the distribution \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) depends is its mean \({\boldsymbol{\mu }}_{{\boldsymbol{\theta }}}\), so we can use it as a representation of the distribution itself and write

$$ {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}) = {\boldsymbol{\mu }}_{{\boldsymbol{\theta }}}. $$

It can be shown that the KL divergence between two normal distributions of equal variance is proportional to the squared Euclidean distance between the means. More precisely, the KL divergence between \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) and \(p_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\) is equal to \(\frac{1}{2\beta ^2}\Vert {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\Vert ^2\). For small values of \(\delta {\boldsymbol{\theta }}\), it is approximately equal to the measure obtained with the true Fisher information matrix:

$$ \frac{1}{2\beta ^2}\Vert {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\Vert ^2 \approx \delta {\boldsymbol{\theta }}^T \mathcal {I}_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}. $$

Bounding the average over the samples of the right term is the motivation of the natural gradient descent method. Besides, we have seen in Sect. 3.2 that the classical Gauss-Newton method can be considered as a way to bound \(\mathbb {E}_{{\boldsymbol{s}}}[\Vert {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{s}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\Vert ^2]\), which is equal to the average of the left term over the samples, up to a multiplicative constant. Hence, even though both methods introduce slightly different approximations, in this context the classical Gauss-Newton and natural gradient descent methods are very similar. This property is used in Pascanu and Bengio (2013) to perform a natural gradient descent on deterministic neural networks, by interpreting their outputs as the mean of a conditional Gaussian distribution with fixed variance.

4 Gradient Covariance Matrix, Newton’s Method and Generalized Gauss-Newton

The previous approaches fit into the general framework (1) with matrices \(M({\boldsymbol{\theta }})\) that do not depend on the loss function, while the 3 approaches presented in this section fit into the same framework but with matrices \(M({\boldsymbol{\theta }})\) that do depend on the loss.

4.1 Gradient Covariance Matrix

The simplest way to exploit the loss to measure magnitude of changes is to consider the expected squared difference between \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\) and \(l_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}({\boldsymbol{s}})\):

$$ \mathbb {E}_{{\boldsymbol{s}}}\left[ \big (l_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}({\boldsymbol{s}}) - l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\big )^2 \right] . $$

For a single sample \({\boldsymbol{s}}\), changing slightly \({\boldsymbol{\theta }}\) does not necessarily modify the atomic loss \(l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\), but in many cases it can be assumed that this loss becomes different for at least some of the samples, yielding a positive value for \( \mathbb {E}_{{\boldsymbol{s}}}\left[ \big (l_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}({\boldsymbol{s}}) - l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})\big )^2 \right] \) which quantifies in some sense the amount of change introduced by \(\delta {\boldsymbol{\theta }}\) with respect to the objective. It can be a meaningful measure as it usually depends on the most relevant features for the task to achieve. Let us replace \(l_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}({\boldsymbol{s}})\) by a first-order approximation: \( l_{{\boldsymbol{\theta }}+\delta {\boldsymbol{\theta }}}({\boldsymbol{s}}) \simeq l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) + \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \delta {\boldsymbol{\theta }}\). The above expectation simplifies to

$$ \mathbb {E}_{{\boldsymbol{s}}}\left[ \big (\nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \delta {\boldsymbol{\theta }}\big )^2 \right] = \delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] \delta {\boldsymbol{\theta }}. $$

\(\mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] \) is called the outer product metric (Ollivier 2015) or the gradient covariance matrix (Bottou and Bousquet 2008). Putting a bound on the term \(\delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] \delta {\boldsymbol{\theta }}\), the iterated optimization becomes:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] \delta {\boldsymbol{\theta }}\le \epsilon ^2. \end{array} \right. \end{aligned}$$
(6)

It results in updates of the form:

$$ \delta {\boldsymbol{\theta }}= -\alpha \mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] ^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}). $$

Again, a regularization term may be added to ensure the invertibility of the matrix. Link with the Natural Gradient. Let us assume that the atomic loss on a sample \({\boldsymbol{s}}= ({\boldsymbol{x}}, {\boldsymbol{y}})\) is the negative log-likelihood (a common case): \( l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) = -\log ( p_{{\boldsymbol{\theta }}}({\boldsymbol{y}}|{\boldsymbol{x}}) ) \). It follows that the empirical Fisher matrix, as defined in Sect. 3.3, is equal to \(\mathbb {E}_{{\boldsymbol{s}}}\left[ \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}}) \nabla _{{\boldsymbol{\theta }}} l_{{\boldsymbol{\theta }}}({\boldsymbol{s}})^T \right] \), which is exactly the definition of the gradient covariance matrix. Thus, in this case, the two approaches are identical. Several algorithms use this identity for the natural gradient computation, e.g. George et al. (2018).

4.2 Newton’s Method

Let us consider now a second-order approximation of the loss:

$$ L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) \approx L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T H({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}, $$

where \(H({\boldsymbol{\theta }})\) is the Hessian matrix: \(\left[ H({\boldsymbol{\theta }})\right] _{i,j} = \frac{\partial ^2 L}{\partial {\boldsymbol{\theta }}_i \partial {\boldsymbol{\theta }}_j}({\boldsymbol{\theta }})\). Although there are obvious counterexamples, one can argue that the first-order approximation, i.e. \(L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) \approx L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\) (which is used as minimization objective for gradient descents), is most likely good as long as the second-order term \(\frac{1}{2}\delta {\boldsymbol{\theta }}^T H({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\) remains small. Therefore, it makes sense to directly put an upper bound on this quantity to restrict \(\delta {\boldsymbol{\theta }}\), as follows:

$$ \delta {\boldsymbol{\theta }}^T H({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\le \epsilon ^2. $$

If \(H({\boldsymbol{\theta }})\) is SPD, this constraint defines a trust region, i.e. a neighborhood of \({\boldsymbol{\theta }}\) in which the first-order approximation of \(L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }})\) is supposed to be reasonably accurate. However, \(H({\boldsymbol{\theta }})\) is symmetric but not even necessarily positive semi-definite, unlike the matrices obtained with the previous approaches. Therefore the damping required to make it definite-positive may be larger than with other methods. It leads to the following optimization problem solved at every iteration:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T (H({\boldsymbol{\theta }}) + \lambda I) \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(7)

and to updates of the form: \( \delta {\boldsymbol{\theta }}= -\alpha (H({\boldsymbol{\theta }}) + \lambda I)^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}) \).

Newton’s Method as a Second-Order Approximation. The same update direction is obtained by directly minimizing the damped second-order approximation:

$$ L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T (H({\boldsymbol{\theta }}) + \lambda I) \delta {\boldsymbol{\theta }}. $$

When \((H({\boldsymbol{\theta }}) + \lambda I)\) is SPD, the minimum of this expression is obtained for \( \delta {\boldsymbol{\theta }}= - (H({\boldsymbol{\theta }}) + \lambda I)^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}) \).

4.3 Generalized Gauss-Newton

\(L({\boldsymbol{\theta }})\) is equal to \(\mathbb {E}_{{\boldsymbol{s}}}{\left[ l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) \right] }\): it does not depend directly on \({\boldsymbol{\theta }}\) but on the outputs of \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}\), which are vectors of finite dimension. Posing \(\delta {\boldsymbol{h}}= {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}) - {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\), a second-order Taylor expansion of \(l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}}))\) can be written:

$$ l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}})) = l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) + \frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T \delta {\boldsymbol{h}}+ \frac{1}{2}\delta {\boldsymbol{h}}^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) \delta {\boldsymbol{h}}+ O(\delta {\boldsymbol{h}}^3), $$

where \(\mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) is the Hessian matrix of the atomic loss \(l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) with respect to variations of \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\), and \(\frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}\) is the gradient of \(l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) w.r.t. variations of \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\). Using the equality \(\delta {\boldsymbol{h}}= J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}+ O(\delta {\boldsymbol{\theta }}^2)\) (with \(J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\) the Jacobian of the function \({\boldsymbol{\theta }}\mapsto {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\)):

$$\begin{aligned} \begin{aligned} l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}}({\boldsymbol{x}})) =&\ l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) + \frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}+ \frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T O(\delta {\boldsymbol{\theta }}^2) \\&+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}( {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}+ O(\delta {\boldsymbol{\theta }}^3). \end{aligned} \end{aligned}$$

The generalized Gauss-Newton approach is an approximation that consists in dropping the term \(\frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T O(\delta {\boldsymbol{\theta }}^2)\). Averaging over the samples yields the following approximation of \(L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }})\):

$$ L({\boldsymbol{\theta }}) + \mathbb {E}_{{\boldsymbol{s}}}\left[ \frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\right] ^T \delta {\boldsymbol{\theta }}+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] \delta {\boldsymbol{\theta }}. $$

Noticing that \(\mathbb {E}_{{\boldsymbol{s}}}\left[ \frac{\partial l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))}{\partial {\boldsymbol{h}}}^T J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] = \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})\), it can be rewritten:

$$ L({\boldsymbol{\theta }}+ \delta {\boldsymbol{\theta }}) \approx L({\boldsymbol{\theta }}) + \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}+ \frac{1}{2}\delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] \delta {\boldsymbol{\theta }}. $$

As for Newton’s method, the usual way to derive the generalized Gauss-Newton method is to directly minimize this expression (see Martens (2014)), but we can also put a bound on the quantity \(\delta {\boldsymbol{\theta }}^T \mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] \delta {\boldsymbol{\theta }}\) to define a trust region for the validity of the first-order approximation (as in Sect. 4.2), provided that \(\mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] \) is SPD. If the loss \(l({\boldsymbol{y}}, {\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) is convex in \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) (which is often true), the matrix is at least positive semi-definite, so a small damping term suffices to make it positive-definite. If a non-negligible portion of the matrices \(J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})\) are full rank, the damping term may be added to \(\mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}}))\) rather than to the full matrix. See Martens and Sutskever (2012) for an extensive discussion on different options for the damping and their benefits and drawbacks. With the damping on the full matrix, the optimization problem to solve at every iteration becomes:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T \left( \mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] + \lambda I\right) \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$
(8)

with updates \( \delta {\boldsymbol{\theta }}= - \alpha \left( \mathbb {E}_{{\boldsymbol{s}}}\left[ J_{{\boldsymbol{x}}}({\boldsymbol{\theta }})^T \mathcal {H}_{{\boldsymbol{y}}}({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})) J_{{\boldsymbol{x}}}({\boldsymbol{\theta }}) \right] + \lambda I\right) ^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}) \).

5 Summary and Conclusion

In Sects. 3 and 4 we motivated and derived 6 different ways to compute parameter updates, that can all be interpreted as solving an optimization problem of this type:

$$\begin{aligned} \left\{ \begin{array}{l} \min _{\delta {\boldsymbol{\theta }}} \nabla _{{\boldsymbol{\theta }}} L({\boldsymbol{\theta }})^T \delta {\boldsymbol{\theta }}\\ \delta {\boldsymbol{\theta }}^T M({\boldsymbol{\theta }}) \delta {\boldsymbol{\theta }}\le \epsilon ^2, \end{array} \right. \end{aligned}$$

resulting in updates of the form \( \delta {\boldsymbol{\theta }}= - \alpha M({\boldsymbol{\theta }})^{-1} \nabla _{{\boldsymbol{\theta }}}L({\boldsymbol{\theta }}) \), with \(M({\boldsymbol{\theta }})\) SPD.

The quadratic term of the inequality corresponds to a specific metric defined by \(M({\boldsymbol{\theta }})\) used to measure the magnitude of the modification induced by \(\delta {\boldsymbol{\theta }}\). To evaluate this magnitude, the focus can be simply on the norm of \(\delta {\boldsymbol{\theta }}\), or on the effect of \(\delta {\boldsymbol{\theta }}\) on the loss, or on the effect of \(\delta {\boldsymbol{\theta }}\) on \({\boldsymbol{h}}_{{\boldsymbol{\theta }}}({\boldsymbol{x}})\) or on \(p_{{\boldsymbol{\theta }}}(\cdot |{\boldsymbol{x}})\), resulting in various approaches, with various definitions of \(M({\boldsymbol{\theta }})\). In a context of probabilistic regression, we gave 6 examples that correspond to popular variants of the gradient descent, summarized in Table 1. All methods except the natural gradient can be declined to deterministic cases. Unifying several first-order or second-order variants of the gradient descent method reveals links between these different approaches, and contexts in which some of them are equivalent. The proposed framework gives a compact overview of common variants of the gradient descent, and can hopefully help choosing adequately between them depending on the problem to solve.