1 Introduction

Saddle points have long been regarded as a major obstacle for machine learning methodology that require optimizing a non-convex objective [21, 43]. It is well understood that in many applications of interest, the number of saddle points significantly outnumber the number of local minima, which is especially problematic when the solutions associated with worst-case saddle points are considerably worse than those associated with worst-case local minima [19].

Moreover, it is not hard to construct examples where a worst-case initialization of gradient descent (or other first-order methods) provably converge to saddle points [39, Sect. 1.2.3].

The main message of our paper is that, under very mild regularity conditions, saddle points have little effect on the asymptotic behavior of first-order methods. Building on tools from the theory of dynamical systems, we generalize recent analysis of gradient descent [32, 42] to establish that a wide variety of first-order methods—including gradient descent, proximal point algorithm, block coordinate descent, mirror descent—avoid so-called “strict” saddle points for almost all initializations; that is, saddle points where the Hessian of the objective function admits at least one direction of negative curvature (see Definition 1).

Our results provide a unified theoretical framework for analyzing the asymptotic behavior of a wide variety of classic optimization algorithms in non-convex optimization. Furthermore, we believe that furthering our understanding of the behavior and geometry of deterministic optimization techniques with random initialization can serve in the development of stochastic algorithms which improve upon their deterministic counterparts and achieve strong convergence-rate results; indeed, such insights have already led to significant improves in modifying gradient descent to navigate saddle-point geometry [22, 28].

1.1 Related work

In recent years, the optimization and machine learning communities have dedicated much effort to understanding the geometry of non-convex landscapes by searching for unified geometric properties which could be leverage by general-purpose optimization techniques. The strict saddle property (Definition 1) is one such property which has been shown to hold in a wide and diverse range of salient objective functions: PCA, a fourth-order tensor factorization [24], formulations of dictionary learning [52, 53], phase retrieval [51], low-rank matrix factorizations [12, 25, 26], and simple neural networks [16, 23, 49]. It is also known that, in the worst case, the strict saddle property is unavoidable as finding descent-directions at critical points with degenerate Hessians is NP-hard in general [38].

Earlier work had shown that first-order descent methods can circumvent strict saddle points, provided that they are augmented with unbiased noise whose variance is sufficiently large in each direction. For example, Pemantle [44] establishes convergence of the Robbins–Monro stochastic approximation method to local minimizers for strict saddle functions. More recently, Ge et al. [24] give quantitative rates on the convergence of noisy gradient descent to local minimizers, for strict saddle functions. ONeill and Wright [41] use the stable manifold theorem to prove that the accelerated gradient method converges to a single strict saddle point with probability zero. Theorem 2 is complementary and strengthens their result to guarantee the probability of converging to the set of strict saddles points is zero, rather than a single (or countably many) strict saddle point.

To obtain provable guarantees without the addition of stochastic noise, Sun et al. [51,52,53] adopt trust-region methods which leverage Hessian information in order to circumvent saddle points. This approach represents a refinement of a long tradition of related, “second-order” strategies, including: a modified Newton’s method with curvilinear line search [37], the modified Cholesky method [27], trust-region methods [20], and the related cubic regularized Newton’s method [40]. Specialized to deep learning applications Dauphin et al. [21] and Pascanu et al. [43] have introduced a saddle-free Newton method.

However, such curvature-based optimization algorithms have a per-iteration computational complexity which scales quadratically or even cubically in the dimension d, rendering them unsuitable for optimization of high-dimensional functions. In more recent work, several works have presented faster curvature-based methods including Liu and Yang [34], Reddi et al. [45], Royer and Wright [47] by combining fast first-order methods with fast eigenvector algorithms, to obtain lower per-iteration complexity.

Fortunately, it appears that neither the addition of isotropic noise, nor the use of second-order methods are necessary for circumventing saddle points. For example, recent work by Jin et al. [28] showed that by carefully perturbing the iterates of gradient descent in the vicinity of possible saddles results in a first-order method which converges to local minimizers in a number of iterations with only poly-logarithmic dimension dependence. Moreover, many recent works have shown that, even without any random perturbations, a combination of gradient descent and a smart-initialization provably converges to the global minimum for a variety of non-convex problems: such settings include matrix factorization [29, 55] , phase retrieval [17, 18], dictionary learning [6], and latent-variable models [11, 54]. While our results only guarantee convergence to local minimizers, they eschew the need for complex and often computationally prohibitive initialization procedures.

In addition to what has been established theoretically, there is a broadly-accepted folklore in the field that running gradient descent with a random initialization is sufficient to identity a local optima. For example, the authors of Sun et al. [51] empirically observe gradient descent with 100 random initializations on the phase retrieval problem always converges to a local minimizer, one whose quality matches that of the solution found using more costly trust-region techniques. It is the purpose of this work to place these intuitions on firm mathematical footing.

Finally, we emphasize that there are many settings in which all local optima (but not saddles!) have objective values which are nearly as small as those of the global minima; see for example Ge et al. [25, 26], Soltanolkotabi et al. [49], Sun et al. [50, 52]. Some preliminary results have suggested that this may be a quite general phenomenon. For example, Choromanska et al. [19] study the loss surface of a particular Gaussian random field as a proxy for understanding the objective landscape of deep neural nets. The results leverage the Kac–Rice Theorem [5, 9], and establish that critical points with more positive eigenvalues have lower expected function value, often close to that of the global minimizer. We remark that functions drawn from this Gaussian random field model share the strict saddle property defined above, and so our results apply in this setting. On the other hand, our results are considerably more general, as they do not place stringent generative assumptionsFootnote 1 on the objective function f.

1.2 Organization

The rest of the paper is organized as follows. Section 2 introduces the notation and definitions used throughout the paper. Section 3 provides an intuitive explanation for why it is unlikely that gradient descent converges to a saddle point, by studying a non-convex quadratic and emphasizing the analogy with power iteration. Section 4 develops the main technical theorem, which uses the stable manifold theorem to show that the stable set of unstable fixed points has measure zero. Section 5 applies the main theorem to show that gradient descent, block coordinate descent, proximal point, manifold gradient descent, and mirror descent all avoid strict saddle points. Finally, we conclude in Sect. 6 by suggesting several directions of future work.

2 Preliminaries

Throughout the paper, we will use \(f: {\mathcal {X}}\rightarrow {\mathbf {R}}\) to denote a real-valued function in \(C^2\), the space of twice-continuously differentiable functions.

Definition 1

(Strict saddle) When \({\mathcal {X}}= {\mathbf {R}}^n\),

  1. 1.

    A point \(x^*\) is a critical point of f if \(\nabla f(x^*)=0\).

  2. 2.

    A point \(x^*\) is a strict saddle pointFootnote 2 of f if \(x^*\) is a critical point and \(\lambda _{\min } (\nabla ^2 f(x^*)) <0\), where \(\lambda _{\min } (H)\) is the smallest eigenvalue of H. Let \({\mathcal {X}}^*\) denote the set of strict saddle points.

When \({\mathcal {X}}\) is a manifold, the same definition applies, but with gradient and Hessian replaced by the Riemannian gradient \(\nabla _{R} f(x)\) and Riemannian Hessian \(\nabla ^2_{R} f(x)\). See Sect. 5.5 for details, and Chapter 5.5 of Absil et al. [2].

Our interest is in the attraction region of an optimization algorithm g, viewed as a mapping from \({\mathcal {X}}\rightarrow {\mathcal {X}}\). The iterates of the algorithm are generated by the sequence

$$\begin{aligned} x_{k}=g(x_{k-1})=g^k (x_0), \end{aligned}$$

where \(g^k\) is the k-fold composition of g. As an example, gradient descent corresponds to \(g(x) = x - \alpha \nabla f(x)\).

Since we are interested in the region of attraction of a critical point, we provide the definition of the stable set.

Definition 2

(Global stable set) The global stable set of the strict saddles is the set of initial conditions where iteration of the mapping g converges to a strict saddle. This is defined as

$$\begin{aligned} W_g = \left\{ x_0: \lim _k g^k (x_0) \in {\mathcal {X}}^* \right\} . \end{aligned}$$

3 Intuition

To illustrate why gradient descent and related first-order methods do not converge to saddle points, consider the case of a non-convex quadratic, \(f(x)=\frac{1}{2} x^T H x\). Without loss of generality, assume \(H = \mathop {\mathbf {diag}}(\lambda _1,\ldots ,\lambda _n)\) with \(\lambda _1, \ldots , \lambda _k >0\) and \(\lambda _{k+1}, \ldots , \lambda _n <0\). \(x^*=0\) is the unique critical point of this function and the Hessian at \(x^*\) is H. Gradient descent initialized from \(x_0\) has iterates

$$\begin{aligned} x_{t+1} =g(x_t)=\sum _{i=1}^n (1-\alpha \lambda _i)^{t+1} \langle e_i,x_0 \rangle e_i\,. \end{aligned}$$

where \(e_i\) denote the standard basis vectors. This iteration resembles power iteration with the matrix \(I - \alpha H\).

Let \(L = \max |\lambda _i|\), and suppose \(\alpha < 1/L\). Thus we have \((1- \alpha \lambda _i)< 1\) for \(i \le k\) and \((1-\alpha \lambda _i) >1\) for \(i >k\). If \(x_0 \in E_s:={{\,\mathrm{span}\,}}(e_1,\ldots , e_k)\), then \(x_t\) converges to the saddle point at zero since \((1-\alpha \lambda _i)^{t+1} \rightarrow 0\). However, if \(x_0\) has a component outside \(E_s\) then gradient descent diverges to \(\infty \). For this simple quadratic function, we see that the global stable set (attractive set) of zero is the subspace \(E_s\). Now, if we choose our initial point at random, the probability of that point landing in \(E_s\) is zero as long as \(k<n\) (i.e., \(E_s\) is not full dimensional).

As an example of this phenomenon for a non-quadratic function, consider the following example from Nesterov [39, Sect. 1.2.3]. Letting \(f(x,y) = \frac{1}{2} x^2 +\frac{1}{4} y^4-\frac{1}{2} y^2\), the corresponding gradient mapping is

$$\begin{aligned} g(x)&= \begin{bmatrix} (1-\alpha ) x\\ (1+\alpha ) y - \alpha y^3 \end{bmatrix}. \end{aligned}$$

The critical points are

$$\begin{aligned} z_1 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad z_2 = \begin{bmatrix} 0 \\ -1 \end{bmatrix},\quad z_3 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}. \end{aligned}$$

The points \(z_2\) and \(z_3\) are isolated local minima, and \(z_1\) is a saddle point.

Gradient descent initialized from any point of the form \(\begin{bmatrix} x \\ 0 \end{bmatrix}\) converges to the saddle point \(z_1\). Any other initial point either diverges, or converges to a local minimum, so the stable set of \(z_1\) is the x-axis, which is a zero-measure set in \({\mathbf {R}}^2\). By computing the Hessian,

$$\begin{aligned} \nabla ^2 f(x) = \begin{bmatrix} 1&\quad 0 \\ 0&\quad 3y^2 -1 \end{bmatrix}, \end{aligned}$$

we find that \(\nabla ^2 f(z_1)\) has one positive eigenvalue with eigenvector that spans the x-axis, thus agreeing with our above characterization of the stable set. If the initial point is chosen randomly, there is zero probability of initializing on the x-axis and thus zero probability of converging to the saddle point \(z_1\).

For gradient descent, the local attractive set of a critical point \(x^*\) is well-approximated by the span of the eigenvectors corresponding to positive eigenvalues of the Hessian. By an application of Taylor’s theorem, one can see that if the initial point \(x_0\) is chosen uniformly random in a small neighborhood around \(x^*\), then the probability of initializing in the span of these eigenvectors is zero whenever there is a negative eigenvalue. Thus, gradient descent initialized at \(x_0\) will leave the neighborhood of \(x^*\). Although this argument provides valuable intuition, there are several difficulties with formalizing this argument: (1) \(x_0\) is randomly distributed over the entire domain, not a small neighborhood around \(x^*\), and Taylor’s theorem does not provide any global guarantees, and (2) it does not rule out converging to a different saddle point.

4 Stable manifold theorem and unstable fixed points

4.1 Setup

For the rest of this paper, g is a mapping from \({\mathcal {X}}\) to itself, and \({\mathcal {X}}\) is a n-dimensional manifold without boundary. Recall that a \({\mathcal {C}}^k\)-smooth, n-dimensional manifold is a space \({\mathcal {X}}\), together with a collection of charts\(\{(U_{\alpha },\phi _{\alpha })\}\), called an atlas, where each \(\phi _{\alpha }\) is a homeomorphism from an open subset \(U_{\alpha } \subset {\mathcal {X}}\) to \({\mathbb {R}}^n\). The charts are required to be compatible in the sense that, whenever \(U_{\alpha } \cap U_{\beta } \ne \emptyset \), then the transition map \(\phi _{\alpha } \circ \phi _{\beta }^{-1}\) is a \({\mathcal {C}}^k\) map from \(\phi _{\beta }(U_{\beta } \cap U_{\alpha }) \rightarrow {\mathbb {R}}^n\). We also require that \(\bigcup _{\alpha } U_{\alpha } = {\mathcal {X}}\), and \({\mathcal {X}}\) is second countable, which means that for any set U contained in \(\bigcup _{\alpha \in {\mathcal {I}}}U_{\alpha }\) for some index set \({\mathcal {I}}\), there exists a countable set \({\mathcal {J}} \subset {\mathcal {I}}\) such that \(U \subset \bigcup _{\alpha \in {\mathcal {J}}}U_{\alpha }\). We can now recall the definition of a measure zero subset of a manifold:

Definition 3

(Section 5.4 of Mikusinski and Taylor [36]) Given a n-dimensional manifold \({\mathcal {X}}\), we say that a set \(S \subset {\mathcal {X}}\) is measure zero if there is an atlas \(\{U_i,\phi _i\}_{i \ge 1}\) such that \(\phi _i(S \cap U_i)\) has Lebesgue-measure zero as a subset of \({\mathbb {R}}^n\). In this case, we use the shorthand \(\mu (S) = 0\). The measure zero property is independent of the choice of atlas [36, Chapter 5].

Definition 4

(Chapter 3 of Absil et al. [2]) The differential of g, denoted as \(\mathrm {D}g (x)\), is a linear operator from \({\mathcal {T}}(x) \rightarrow {\mathcal {T}}(g(x))\), where \({\mathcal {T}}(x)\) is the tangent space of \({\mathcal {X}}\) at point x. Given a curve \(\gamma (t)\) in \({\mathcal {X}}\) with \(\gamma (0) =x\) and \(\frac{d\gamma }{dt}(0) =v \in {\mathcal {T}}(x)\), the linear operator is defined as \(\mathrm {D}g (x) v = \frac{d( g \circ \gamma ) }{dt} (0) \in {\mathcal {T}}(g(x))\). The determinant of the linear operator \(\det (\mathrm {D}g(x) ) \) is the determinant of the matrix representing \(\mathrm {D}g(x)\) with respect to an arbitrary basis.Footnote 3

Lemma 1

Let \(E \subset {\mathcal {X}}\) be a measure zero subset. If \(\det (Dg(x)) \ne 0\) for all \(x \in {\mathcal {X}}\) , then \(\mu (g^{-1}(E))=0\).

Proof

Let \(h=g^{-1}\), and \(\{(V_i , \psi _i)\} \) be a countable collection of charts of the co-domain of h. By countable additivity of measure, it suffices to show that each \(h(E) \cap V_i\) is measure zero. Without loss of generality, we may assume that h(E) is contained in a chart \(( V, \psi )\), else we could repeat the same argument for each element of the chart.

We wish to show that \(\mu ( \psi \circ h (E)) =0\). Let \((U_j, \phi _j)\) be another countable collection of charts of the domain of g. Define \(E_j = E \cap U_j \), and note that \(E= \cup _{i=1}^{\infty } E_i = \cup \phi _i^{-1} \circ \phi _i\ (E_i)\). Thus

$$\begin{aligned} \mu (\psi \circ h (E))&= \mu (\psi \circ h (\cup _{i} \phi _i^{-1} \circ \phi _i (E_i)))\\&\le \sum _{i=1}^\infty \mu ( \psi \circ h \circ \phi _i ^{-1} (\phi _i(E_i))) . \end{aligned}$$

By assumption, \(\phi _i(E_i)\) is measure zero. The function \( \psi \circ h \circ \phi _i^{-1} =\psi \circ g^{-1} \circ \phi _i^{-1}\) is \(C^1\) if \(\det ( Dg) \ne 0\), and thus locally Lipschitz, so preserves measure zero sets. By countable additivity and the displayed equation above, E has measure zero. \(\square \)

4.2 Unstable fixed points

Definition 5

(Set of unstable fixed points) Let

$$\begin{aligned} {\mathcal {A}}^* _g= \left\{ x: g(x) =x, \max _{i} |\lambda _i (Dg(x))|>1 \right\} \end{aligned}$$

be the set of fixed points where the differential has at least a single eigenvalue with magnitude greater than one. These are the unstable fixed points. 0

Theorem 1

(Theorem III.7, Shub [48]) Let \(x^*\) be a fixed point for the \(C^r\) local diffeomorphism \(g: {\mathcal {X}}\rightarrow {\mathcal {X}}\). Suppose that \(E = E_{cs} \oplus E_u\), where \(E_{cs}\) is the span of the eigenvectors corresponding to eigenvalues of magnitude less than or equal to one of \(D g(x^*)\), and \(E_u\) is the span of the eigenvectors corresponding to eigenvalues of magnitude greater than one of \(D g(x^*)\). Then there exists a \(C^r\) embedded disk \(W^{cs}_{loc}\) that is tangent to \(E_{cs}\) at \(x^*\) called the local stable center manifold. Moreover, there exists a neighborhood B of \(x^*\), such that \(g(W^{cs}_{loc} ) \cap B \subset W^{cs} _{loc}\), and \(\cap _{k=0}^\infty g^{-k} (B) \subset W^{cs}_{loc}\).

Theorem 2

Let g be a \(C^1\) mapping from \({\mathcal {X}}\rightarrow {\mathcal {X}}\) and \(\det ( \mathrm {D}g(x))\ne 0 \) for all \(x \in {\mathcal {X}}\). Then the set of initial points that converge to an unstable fixed point has measure zero, \(\mu (\{x_0: \lim _{k\rightarrow \infty } x_k \in {\mathcal {A}}^*_g \} ) =0\).

Proof

For each \(x^* \in {\mathcal {A}}^* _g\), there is an associated open neighborhood \(B_{x^*}\) promised by the Stable Manifold Theorem 1. \( \cup _{x^* \in {\mathcal {A}}^*_g} B_{x^*}\) forms an open cover, and since \({\mathcal {X}}\) is second-countable we can extract a countable subcover, so that \(\cup _{x^* \in {\mathcal {A}}^*_g} B_{x^*}= \cup _{i=1}^\infty B_{x^*_i}\).

Define \(W=\{x_0: \lim _{k\rightarrow \infty } x_k \in {\mathcal {A}}^*_g\}\). Fix a point \(x_0 \in W\). Since \(x_k \rightarrow x^* \in {\mathcal {A}}^*_g \), then for some non-negative integer T and all \(t\ge T\), \(g^t(x_0) \in \cup _{x^* \in {\mathcal {A}}^*_g} B_{x^*}\). Since we have a countable sub-cover, \(g^t(x_0) \in B_{x^*_i} \) for some \(x^* _i \in {\mathcal {A}}^*_g\) and all \(t\ge T\). Since \(g^{t+k}(x_0) \in B_{x^*_i}\), then \(g^t (x_0) \in g^{-k} (B_{x^*_i})\), and thus \(g^t (x_0) \in \cap _{k=0}^\infty \ g^{-k} ( B_{x^*_i})\) for all \(t\ge T\). By Theorem 1, \( S_i \triangleq \cap _{k=0}^\infty g^{-k} ( B_{x^*_i})\) is a subset of the local center stable manifold which has co-dimension at least one, and \(S_i\) is thus measure zero.

Finally, \(g^T(x_0) \in S_i\) implies that \(x_0 \in g^{-T} ( S_i)\). Since T is unknown we union over all non-negative integers, to obtain \(x_0 \in \cup _{j=0}^\infty g^{-j} (S_i)\). Since \(x_0\) was arbitrary, we have shown that \(W \subset \cup _{i=1}^\infty \cup _{j=0}^\infty g^{-j} (S_i)\). Using Lemma 1 and that countable union of measure zero sets is measure zero, W has measure zero. \(\square \)

Next, we state a simple corollary that only requires verifying \(\det ( \mathrm {D}g(x) ) \ne 0\), and \({\mathcal {X}}^* \subset {\mathcal {A}}^*_g\).

Corollary 1

Under the same conditions as Theorem 2, and in addition assume \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\), then the global stable set (defined in Definition 2) is of measure zero, \(\mu (W_g ) =0\).

Proof

Since \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\), then the global stable set satisfies \(W_g \subset \{x_0: \lim _{k\rightarrow \infty } g^k (x_0) \in {\mathcal {A}}^*_g \}\). Using Theorem 2, \(\mu (W_g) =0\). \(\square \)

5 Application to optimization

5.1 Gradient descent

As an application of Theorem 2, we show that gradient descent avoids saddle points. Consider the gradient descent algorithm with step-size \(\alpha \):

$$\begin{aligned} x_{k+1}=g(x_k) \triangleq x_k - \alpha \nabla f(x_k). \end{aligned}$$
(1)

Assumption 1

(Bounded Hessian) Let \(f \in C^2\), and \(\left\| \nabla ^2 f(x) \right\| _2 \le L\) for all x.

Proposition 1

Every strict saddle point \(x^* \) is an unstable fixed point of gradient descent, meaning \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\) .

Proof

First we verify that critical points of f are fixed points of g. Since \(\nabla f(x) =0\), then \(g(x) = x-\alpha \nabla f(x) =x\) and is a fixed point.

At a strict saddle \( x^* \in {\mathcal {X}}^*\), \(\mathrm {D}g(x^*) =I-\alpha \nabla ^2 f(x^*)\) with eigenvalues \( 1- \alpha \lambda _i\) , where \(\lambda _i \) are eigenvalues of \(\nabla ^2 f(x^*)\). Since \(x^*\) is a strict saddle, then there is at least one eigenvalue \(\lambda <0\), and \(1-\alpha \lambda _i >1\). Thus \(x^* \in {\mathcal {A}}^* _g\).\(\square \)

Proposition 2

Under Assumption 1 and \(\alpha < \frac{1}{L}\), then \(\det (\mathrm {D}g (x)) \ne 0\) for all x.

Proof

By a straightforward calculation

$$\begin{aligned} \mathrm {D}g(x) = I- \alpha \nabla ^2 f(x)= I- \alpha V D V^T, \end{aligned}$$

where \(\nabla ^2 f(x) = V DV^T\). The eigenvalues of Dg(x) are \(1- \alpha \lambda _i \), and so

$$\begin{aligned} \det (\mathrm {D}g(x) ) =\prod _{i} ( 1-\alpha \lambda _i). \end{aligned}$$

Using the bounded Hessian assumption, \(\alpha < 1/|\lambda _i |\) and each term in the product is positive, so \(\det (\mathrm {D}g(x)) >0\).\(\square \)

Corollary 2

Let g be the gradient descent algorithm as defined in Eq. (1). Under Assumption 1 and \(\alpha < \frac{1}{L}\), the stable set of the strict saddle points has measure zero, meaning \(\mu (W_g)=0\).

Proof

The proof is a straightforward application of the previous two Propositions and Corollary 1. Proposition 1 shows that \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\), and Proposition 2 shows that \(\det ( \mathrm {D}g (x)) \ne 0\). By applying Corollary 1, we conclude that \(\mu ( \{x_0: \lim _k g^k (x_0) \in {\mathcal {X}}^*\}) =0\).\(\square \)

5.2 Proximal point

The proximal point algorithm is given by the iteration

$$\begin{aligned} x_{k+1} = g(x) \triangleq \arg \min _{z} f(z) +\frac{1}{2\alpha } \left\| x_k -z \right\| ^2. \end{aligned}$$
(2)

Proposition 3

Under Assumption 1 and \(\alpha < \frac{1}{L}\), then

  1. 1.

    \(\det ( \mathrm {D}g(x)) \ne 0\).

  2. 2.

    Every strict saddle point \(x^*\) is an unstable fixed point of proximal point, meaning \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\).

Proof

Since \(\nabla f\) is L-Lipschitz, \(f(z) +\frac{1}{2\alpha } \left\| x-z \right\| ^2\) is strongly convex for \(\alpha < \frac{1}{L}\), and the \(\arg \min \) is well-defined and unique. By the optimality conditions, \(g(x) +\alpha \nabla f(g(x)) =x\). By implicit differentiation, \( \mathrm {D}g(x) +\alpha \nabla ^2 f(g(x)) \mathrm {D}g(x) = I\), and so

$$\begin{aligned} \mathrm {D}g(x) = (I+\alpha \nabla ^2 f(g(x)))^{-1} . \end{aligned}$$

At a strict saddle \(x^*\), \(\mathrm {D}g(x^*) = (I+\alpha \nabla ^2 f(x^*) )^{-1}\), and thus has an eigenvalue greater than one. For \(\alpha <\frac{1}{L}\), \(\mathrm {D}g (x)\) is invertible, and thus \(\det ( \mathrm {D}g (x)) \ne 0\).\(\square \)

By combining Proposition 3 and Corollary 1, we have the following:

Corollary 3

(Proximal Point) Let g be the proximal point algorithm as defined in Eq. (2). Under Assumption 1 and \(\alpha < \frac{1}{L}\), the stable set of the strict saddle points has measure zero, meaning \(\mu (W_g)=0\).

5.3 Coordinate descent

figure a

We define \(g_i(x)=x - \alpha (0,\ldots ,0, \frac{\partial f(x)}{\partial x_i},0,\ldots ,0)\) to be the coordinate descent update of index i in Algorithm 1. One iteration of coordinate gradient descent corresponds to the update

$$\begin{aligned} x_{k+1} = g(x_k)= g_n \circ g_{n-1} \circ \cdots \circ g_1(x_k). \end{aligned}$$
(4)

Assumption 2

(Lipschitz coordinate gradient) Let \(f \in C^2\), and \(|\nabla ^2 f(x) _{ii}|\le L_{\max }\) for all \(i \in [n]\) and x.

Lemma 2

The differential is

$$\begin{aligned} \mathrm {D}g (x_k) = \prod _{j=1}^n (I- \alpha e_{n-j+1}e_{n-j+1}^{T} \nabla ^2 f(y_k^{n-j})), \end{aligned}$$
(5)

where \(e_i\) is the ith standard basis vector.

Proof

This is an application of the chain rule. The differential of the composition of two functions \(f \circ h\) is just \(\mathrm {D}f(h(x)) \cdot \mathrm {D}h (x)\). By repeatedly applying this and observing that \(\mathrm {D}g_i (x) = I- \alpha e_i e_i^{T} \nabla ^2 f(x)\), we have the result. \(\square \)

Proposition 4

Under Assumption 2 and \(\alpha < \frac{1}{L_{\max }}\), then \(\det ( \mathrm {D}g(x)) \ne 0\).

Proof

It suffices to prove that every term of Eq. (5) is an invertible matrix. Using the matrix determinant lemma, the characteristic polynomial of the matrix \(I- \alpha e_i e_i^{T}\nabla ^2 f(x)\) is equal to \((\lambda -1)^{n-1} (\lambda - 1+\alpha \frac{\partial ^2 f(x)}{\partial x_i^2})\). For \(\alpha < \frac{1}{ \frac{\partial ^2 f(x)}{\partial x_i^2}}\), the eigenvalues of \(\mathrm {D}g_i \) are all positive, and thus \(\mathrm {D}g_i\) is invertible. \(\square \)

Proposition 5

(Instability at saddle points) Under Assumption 2 and \(\alpha < \frac{1}{L_{\max }}\), every strict saddle point \(x^*\) is an unstable fixed point of coordinate descent, meaning \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\).

Proof

Let \(H= \nabla ^2 f(x^*)\), \(J= \mathrm {D}g(x^*) =\prod _{j=1}^n (I- \alpha e_{n-j+1}e_{n-j+1}^{T} H)\), and \(y_0\) be the eigenvector corresponding to the smallest (leftmost) eigenvalue of H.

We shall prove that \(\left\| J^t y_0 \right\| _2 \ge c(1+\frac{\epsilon }{4})^t \) for some \(\epsilon ,c\) which depend on H, but not on t. By proving the aforementioned fact, we get that \(\left\| J^t \right\| _2 \ge \frac{\left\| J^t y_0 \right\| _2}{\left\| y_0 \right\| _2} \ge \frac{c}{\left\| y_0 \right\| _2} (1+\frac{\epsilon }{4})^t\). Gelfand’s theorem states that the spectral radius \(\rho \) satisfies

$$\begin{aligned} \rho (J) = \lim _{t\rightarrow \infty } \left\| J^t \right\| _2^{1/t}, \end{aligned}$$

and thus

$$\begin{aligned} \rho (J) = \lim _{t\rightarrow \infty } \left\| J^t \right\| _2^{1/t} \ge \lim _{t\rightarrow \infty }\left( \frac{c}{\left\| y_0 \right\| _2}\right) ^{1/t}(1+\frac{\epsilon }{4}) = (1+\frac{\epsilon }{4}), \end{aligned}$$

i.e., J must have an eigenvalue greater than one.

We fix some arbitrary iteration t and let \(y_t = J^t y_0\). We will first show that there exists an \(\epsilon >0\) so that

$$\begin{aligned} y_{t+1}^{T}Hy_{t+1} \le (1+\epsilon ) y_{t}^{T}Hy_{t}, \end{aligned}$$
(6)

for all \(t\in {\mathbb {N}}\). Let \(z_1 = y_t\) and \(z_{i+1} = (I- \alpha e_ie_i^{T}H)z_i = z_i - \alpha (e_i^{T}Hz_i)e_i \), so that \(y_{t+1} = Jy_t = z_{n+1}\). We see that the sequence \(z_{i+1} ^T H z_{i+1}\) is decreasing (non-increasing),

$$\begin{aligned} z_{i+1}^{T} Hz_{i+1}&= [z_i^{T} - \alpha (e_i^{T}Hz_i)e_i^{T}] H [z_i - \alpha (e_i^{T}Hz_i)e_i] \nonumber \\&= z_i^{T}Hz_i -\alpha (z_i^{T}He_i)(e_i^{T}Hz_i) - \alpha (e_i^{T}Hz_i)e_i^{T}Hz_i + \alpha ^2 (e_i^{T}Hz_i)^2 e_i^{T}He_i \nonumber \\&= z_i^{T}Hz_i - \alpha (z_i^{T}He_i)^2(2 - \alpha e_i^{T}He_i) \nonumber \\&< z_i^{T}Hz_i - \alpha (z_i^{T}He_i)^2 , \end{aligned}$$
(7)

where the last inequality uses that \(\alpha < \frac{1}{L_{\max }}\).

Next, we use the following claim to show sufficient decrease by lower bounding \((z_i ^T H e_i )^2\).

Claim

Let \(y_t\) be in the range of H. There exists a \(j \in [n]\) so that \(\alpha |e_j^{T}Hz_j| \ge \delta \left\| z_j \right\| _2\) for some global constant \(\delta >0\) that depends on Hn.

The proof of the claim is provided in “Appendix A”.

Decompose \(y_t= y_{{\mathcal {N}}} + y_{{\mathcal {R}}}\) into the orthogonal components defined by the nullspace \({\mathcal {N}}(H)\) and range space \({\mathcal {R}}(H)\). Notice that J acts as the identity on \({\mathcal {N}}(H)\), so

$$\begin{aligned} y_{t+1}&=J y_t = y_{{\mathcal {N}}} +Jy_{{\mathcal {R}}}. \end{aligned}$$

Define an auxiliary sequence \({\overline{y}}_{t+1} = J {\overline{y}}_t\), and \(\overline{y}_t = y_{{\mathcal {R}}}\). Similarly, \({\overline{z}}_1 = y_{{\mathcal {R}}}\), \({\overline{z}}_{i+1} = (I- \alpha e_i e_i ^T H ) {\overline{z}}_i\), and \({\overline{z}}_{n+1} = \overline{y}_{t+1}\). It follows that

$$\begin{aligned} y_{t+1} ^T H y_{t+1}&= (y_{{\mathcal {N}}} + Jy_{{\mathcal {R}}})^T H (y_{{\mathcal {N}}} + Jy_{{\mathcal {R}}})\\&= (J y_{{\mathcal {R}}})^T H ( J y_{{\mathcal {R}}})\\&= {\overline{z}}_{n+1} ^T H {\overline{z}}_{n+1}\\&\le {\overline{z}}_{j+1}^T H {\overline{z}}_{j+1} \quad {\text {(non-increasing property in Eq. } (7))} \\&< {\overline{z}}_j ^T H {\overline{z}}_j - \alpha ( {\overline{z}}_j ^T H e_j )^2 \quad {\text {(using Eq. } (7))} \\&< {\overline{z}}_j ^T H {\overline{z}}_j - \frac{\delta ^2}{\alpha } \left\| \overline{z}_j \right\| _{2}^{2} \quad {\text {(using Claim 4)}}\\&\le {\overline{z}}_j^T H {\overline{z}}_j + \frac{\delta ^2}{\alpha L} {\overline{z}}_{j}^{T} H {\overline{z}}_j \quad {\left( \text {since } L \left\| {\overline{z}}_j \right\| ^2 \ge {\overline{z}}_j ^T H \overline{z}_j \right) }\\&= \left( 1+\frac{\delta ^2}{\alpha L}\right) {\overline{z}}_j ^T H {\overline{z}}_j\\&\le \left( 1+\frac{\delta ^2}{\alpha L}\right) {\overline{z}}_1 ^T H {\overline{z}}_1 \quad {\text {(non-increasing property)}}\\&= \left( 1+\frac{\delta ^2}{\alpha L}\right) {\overline{y}}_t ^T H {\overline{y}}_t\\&= \left( 1+\frac{\delta ^2}{\alpha L}\right) y_t ^T H y_t, \end{aligned}$$

where we used that Eq. (7) applies also the sequence \({\overline{z}}_j\). Let \(\epsilon =\frac{\delta ^2}{\alpha L}\). By inducting, and noting that \(y_0 ^T H y_0 = -\lambda \),

$$\begin{aligned} y_t ^T H y_t&\le (1+\epsilon )^t y_0 ^T H y_0\\&=-\lambda (1+\epsilon )^t. \end{aligned}$$

Using \(-\lambda \left\| y_t \right\| _2 ^2 \le y_t ^T H y_t\),

$$\begin{aligned} - \lambda \left\| y_t \right\| _2 ^2&\le -\lambda (1+\epsilon )^t\\ \left\| y_t \right\| _2^2&\ge (1+\epsilon )^t\\ \left\| J^t y_0 \right\| _2&\ge (1+ \epsilon )^{t/2}\\&\ge \left( 1+ \frac{\epsilon }{4}\right) ^{t}, \end{aligned}$$

where the last inequality uses that \(\epsilon \le \frac{1}{2}\). By Gelfand’s theorem, we have established

$$\begin{aligned} \rho (J) \ge 1+\frac{\epsilon }{4}, \end{aligned}$$

and thus J has an eigenvalue of magnitude greater than one. Thus \(x^* \in {\mathcal {A}}^* _g\). \(\square \)

By combining Propositions 54, and Corollary 1, we have the following:

Corollary 4

(Coordinate Descent) Let g be the coordinate descent algorithm as defined in Eq. (4). Under Assumption 2 and \(\alpha < \frac{1}{L_{\max }}\), the stable set of the strict saddle points has measure zero, meaning \(\mu (W_g)=0\).

Remark 1

In the worst-case, \(L_{\max } =L\), but in many instances \(L_{\max } \ll L\), so coordinate descent can use more aggressive step-sizes. The step-size choice \(\alpha < \frac{1}{L_{\max }} \) is standard for coordinate-descent methods [46].

5.4 Block coordinate descent

The results of this section are a generalization of the previous section, but we present the coordinate descent case separately, since the proofs are simpler.

We partition the set \([n]=\{1, 2, \ldots , n\}\) to b blocks \(\{S_1, S_2, \ldots , S_b\}\) such that \([n]= \cup _{i} S_i\). For ease of notation, we define \(S_0=\emptyset \).

figure b

We define \(g_i(x)\) to be the block coordinate descent update of block i in Algorithm 2. Block coordinate gradient descent is a dynamical system

$$\begin{aligned} x_{k+1} = g(x_k) = g_b \circ g_{b-1} \circ \cdots \circ g_1, \end{aligned}$$
(9)

where \(g_i (x) = x - \alpha \sum _{j\in S_i} e_j^T \nabla f(x)\). We define the matrix \(P_S= \sum _{i \in S} e_{i}e_{i}^{T}\), i.e., the projector onto the entries in S.

Lemma 3

The differential is

$$\begin{aligned} \mathrm {D}g(x_k) = \prod _{i=1}^b (I- \alpha P_{S_{b-i+1}} \nabla ^2 f(y_k^{S_{b-i}})). \end{aligned}$$
(10)

Proof

This is an application of the chain rule. The differential of the composition of two functions \(f \circ h\) is just \(\mathrm {D}f(h(x)) \cdot \mathrm {D}h(x)\). By repeatedly applying this and observing that \(\mathrm {D}g_i (x) = I- \alpha P_{S_i}\nabla ^2 f(x)\), we obtain the result.\(\square \)

Assumption 3

Let \(f \in C^2\), and \(\nabla ^2 f(x)_{S}\) be the submatrix of \(\nabla ^2 f(x)\) by extracting the rows and columns indexed by S. Let \(L_b =\max _{i \in [b]} \left\| \nabla ^2 f(x)_{S_i} \right\| _2\)

Proposition 6

Under Assumption 3 and \(\alpha <\frac{1}{L_b}\), then \(\det ( \mathrm {D}g(x) ) \ne 0\).

Proof

It suffices to prove that every term of the product (10) is an invertible matrix. Every matrix of the form \(I- \alpha P_{S_i} \nabla ^2 f(x)\) has \(n-|S_i|\) eigenvalues equal to one and the rest of its eigenvalues correspond to eigenvalues of \(I_{S_i} - \alpha \nabla ^2 f(x) _{S_i, S_i}\). Since \(\alpha < \frac{1}{L_b}\), then the eigenvalues of \(I_{S_i} - \alpha \nabla ^2 f(x) _{S_i}\) are all greater than zero. Thus each \(I- \alpha P_{S_i} \nabla ^2 f(x)\) is invertible, and \(\mathrm {D}g\) is also invertible. \(\square \)

Proposition 7

(Stability at fixed points) Let \(x^*\) be a strict saddle point of f. The Jacobian of the update rule of block coordinate descent computed at point \(x^*\) has an eigenvalue of modulus greater than one.

The proof is provided in “Appendix B”. By combining Propositions 76, and Corollary 1, we have the following:

Corollary 5

(Block Coordinate Descent) Let g be the block coordinate descent algorithm as defined in Eq. (9). Under Assumption 3 and \(\alpha < \frac{1}{L_{b}}\), the stable set of the strict saddle points has measure zero, meaning \(\mu (W_g)=0\).

Remark 2

The step-size choice \(\alpha < \frac{1}{L_{b}} \) is standard for block coordinate descent methods [46].

5.5 Manifold gradient descent

Let \({\mathcal {M}}\) be a submanifold of \({\mathbf {R}}^n\), and \({\mathcal {T}}(x)\) be the tangent space of \({\mathcal {M}}\) at x. \(P_{{\mathcal {X}}}\) and \(P_{{\mathcal {T}}(x)}\) are the orthogonal projectors onto \({\mathcal {M}}\) and \({\mathcal {T}}(x)\) respectively, and \(I_{{\mathcal {T}}}\) is the identity operator on \({\mathcal {T}}\). Let \({\overline{f}}\) be a smooth extension of f to \({\mathbf {R}}^n\), and \(f= {\overline{f}}|_{{\mathcal {M}}}\). The manifold gradient descent algorithm is:

$$\begin{aligned} x_{k+1}= g(x_k) \triangleq P_{{\mathcal {M}}} ( x_k - \alpha P_{{\mathcal {T}}(x_k)} \nabla {\overline{f}}(x_k)). \end{aligned}$$
(11)

Recall that the Riemannian gradient \(\nabla _{R} f(x) = P_{{\mathcal {T}}(x)} \nabla {\overline{f}}(x)\), so the above iteration is precisely manifold gradient descent with \(P_{{\mathcal {M}}}\) as retraction.

Proposition 8

At a strict saddle point \(x^*\), \(\mathrm {D}g (x^*)\) has an eigenvalue of magnitude larger than 1.

Proof

Since \(x^*\) is a strict saddle, the Riemannian Hessian (see Sect. 5.5 of [2] for the definition) \(\nabla ^2 _{R} f (x^*)\) has a negative eigenvalue \(\lambda _v\) and eigenvector v, and \(P_{{\mathcal {T}}(x^*)} \nabla {\overline{f}}(x^*) =0\).

Using Lewis and Malick [33, Lemma 4] , \(\mathrm {D}P_{{\mathcal {M}}} (x) = P_{{\mathcal {T}}(x)}\) for \(x \in {\mathcal {M}}\),

$$\begin{aligned} \mathrm {D}g(x^*) v =P_{{\mathcal {T}}(x^*)} v- \alpha P_{{\mathcal {T}}(x^*)} \mathrm {D}( P_{{\mathcal {T}}} \nabla {\overline{f}})(x^*) v. \end{aligned}$$

Using Absil et al. [3, Eq. 4], \(P_{{\mathcal {T}}(x)}\mathrm {D}(P_{{\mathcal {T}}} \nabla {\overline{f}})(x) v = \nabla _{R} ^2 f(x) v\), so

$$\begin{aligned} \mathrm {D}g(x^*) v= v- \lambda _v v . \end{aligned}$$

Thus v is an eigenvector of \(\mathrm {D}g(x^*)\) with eigenvalue \(1-\lambda _v >1\). \(\square \)

Proposition 9

For a compact submanifold \({\mathcal {M}}\), there is a strictly positive \(\alpha \) such that \(\det ( \mathrm {D}g) \ne 0\).

Proof

Since \({\mathcal {M}}\) is a compact smooth manifold, \(P_{{\mathcal {M}}}\) is unique and smooth in a neighborhood of radius r of the manifold [4]. Letting \(\alpha < \frac{r}{\max _{x \in {\mathcal {M}}} \left\| \nabla f(x) \right\| }\), \(P_{{\mathcal {M}}} (x- \alpha P_{{\mathcal {T}}(x)} \nabla f(x))\) and its derivatives exist.

We wish to show that \(\mathrm {D}g(x) = \mathrm {D}P_{{\mathcal {M}}} ( x- \alpha P_{{\mathcal {T}}(x)} \nabla f(x)) ( I- \alpha \mathrm {D}(P_{{\mathcal {T}}} \nabla f) (x) )\) is invertible. Define

$$\begin{aligned} h_x(\alpha ) = \det \left( \mathrm {D}P_{{\mathcal {M}}} ( x- \alpha P_{{\mathcal {T}}(x)} \nabla f(x)) ( I- \alpha \mathrm {D}(P_{{\mathcal {T}}} \nabla f) (x) )\right) . \end{aligned}$$

Using \(\mathrm {D}P_{{\mathcal {M}}} (x) = P_{{\mathcal {T}}(x)}\) [4], \(h_x(0) =1 \). Since

$$\begin{aligned} B:=\max _{x \in {\mathcal {M}}, \alpha<\frac{r}{\max _{x \in {\mathcal {M}}} \left\| \nabla f(x) \right\| }}| \frac{n h_x }{n \alpha } (\alpha ) | <\infty , \end{aligned}$$

we see that for

$$\begin{aligned}\alpha < C_{{\mathcal {M}},f}\triangleq \min \left( \frac{r}{\max _{x \in {\mathcal {M}}} \left\| \nabla f(x) \right\| }, \frac{1}{B} \right) \end{aligned}$$

that \(h_x (\alpha )\) is positive, so \(\mathrm {D}g\) is invertible.\(\square \)

Corollary 6

Let g be the operator corresponding to the manifold gradient descent algorithm of Eq. (11), and \({\mathcal {M}}\) be a compact sub-manifold of \({\mathbf {R}}^n\). Then there is a \(C_{{\mathcal {M}},f} >0\), that only depends on the properties of \({\mathcal {M}}\) and f, such that for any step-size \(\alpha <C_{{\mathcal {M}}, f} \) , the stable set of the strict saddle points has measure zero, meaning \(\mu (W_g) =0\).

Proof

The proof is a straightforward application of the previous two Propositions and Corollary 1. Proposition 8 shows that \({\mathcal {X}}^* \subset {\mathcal {A}}^* _g\), and Proposition 9 shows that \(\det ( \mathrm {D}g (x)) \ne 0\). By applying Corollary 1, we conclude that \(\mu ( \{x_0: \lim _k g^k(x_0) \in {\mathcal {X}}^*\}) =0\). The parameter \(C_{{\mathcal {M}},f}\) is specified in the proof of Proposition 9.\(\square \)

5.6 Mirror descent

In this section, we consider the mirror descent algorithm. Let \({\mathcal {D}}\) be a convex open subset of \({\mathbf {R}}^n\), and \({\mathcal {X}}= {\mathcal {D}}\cap {\mathcal {M}}\) for some affine space \({\mathcal {M}}\). Given a mirror map \(\varPhi \), we define the mirror descent algorithm in Algorithm 3.

Assumption 4

(Mirror map) We say that \(\varPhi \) is a mirror map if it satisfies the following properties:

  1. 1.

    \(\varPhi \) is a proper closed convex function with \({{\overline{{\mathcal {D}}}}}= \text {dom} (\varPhi )\).

  2. 2.

    \(\varPhi \) is \(C^2\) on \({\mathcal {D}}= \text {int}({{\overline{{\mathcal {D}}}}})\).

  3. 3.

    The gradient of \(\varPhi \) is surjective onto \({\mathbf {R}}^n\), that is \(\nabla \varPhi ( {\mathcal {D}})={\mathbf {R}}^n\).

  4. 4.

    The subgradient \(\partial \varPhi (x) = \{ \nabla \varPhi (x) \}\) for \( x \in {\mathcal {D}}\) and \(\partial \varPhi (x) = \emptyset \) for \( x \notin {\mathcal {D}}\).

figure c

Due to the affine constraint, \({\mathcal {X}}\) may not be full-dimensional, so we define the appropriate notions of gradient and Hessian. Let \({\mathcal {T}}\) be the tangent space of \({\mathcal {M}}\). The Riemannian gradient is \(\nabla _{R} \varPhi (x) = P_{{\mathcal {T}}} \nabla \varPhi (x) \). Similarly the Riemannian Hessian is \(\nabla ^2_{R} \varPhi (x) = P_{{\mathcal {T}}} \nabla ^2 \varPhi (x) P_{{\mathcal {T}}}\) and is a linear mapping from \({\mathcal {T}}\rightarrow {\mathcal {T}}\). Finally, the mirror descent mapping is defined as

$$\begin{aligned} g(x) =h \circ F (x) , \end{aligned}$$
(12)

with \(F(x) = \nabla \varPhi (x) - \alpha \nabla f(x)\) and \(h(x) =\arg \max _{z \in {\mathcal {X}}} z^T x - \varPhi (z)\).

Assumption 5

(Strong convexity of\(\varPhi \)and Lipschitz gradient) Let \(I_{{\mathcal {T}}}\) be the identity mapping on \({\mathcal {T}}\). We assume that

  1. 1.

    \(\varPhi \) is \(\mu \)-strongly convex, meaning \(\nabla ^2 _{R} \varPhi (x) \succeq \mu I_{{\mathcal {T}}}\) for \(x \in {\mathcal {X}}\).

  2. 2.

    f has L-Lipschitz gradient, meaning \(\nabla ^2 _{R} f(x) \preceq L I_{{\mathcal {T}}}\) for \(x \in {\mathcal {X}}\).

Example 1

(Probability simplex) Define the mirror map \(\varPhi (x) = \sum x_i \log x_i \), with \({\mathcal {D}}\) being the positive orthant \({\mathbf {R}}^{n} _{>0}\), \({{\overline{{\mathcal {D}}}}}\) is the non-negative orthant, and affine space \({\mathcal {M}}= \{x: \sum _i x_i =1\}\). The domain is \({{\overline{{\mathcal {X}}}}} = {{\overline{{\mathcal {D}}}}} \cap {\mathcal {M}}\) which is the probability simplex. The mirror descent algorithm corresponds to the update:

$$\begin{aligned} x_i \leftarrow \frac{x_i \exp \left( - \alpha \frac{\partial f}{\partial x_i} (x) \right) }{\sum _{j} x_j \exp \left( - \alpha \frac{\partial f}{\partial x_j} (x) \right) }. \end{aligned}$$

The strong convexity parameter satisfies \(\mu \ge 1\).

We first express the mapping g as a composition of simple mappings.

Lemma 4

Assume that \(\varPhi \) is a \(\mu \)-strongly convex mirror map. The mirror descent algorithm can be equivalently expressed as \(g(x) = (\nabla _{R} \varPhi )^{-1} \circ P_{{\mathcal {T}}} \circ F (x)\), and \((\nabla _{R} \varPhi )^{-1}:{\mathcal {T}}\rightarrow {\mathcal {X}}\) is a local diffeomorphism.

Proof

Recall that \(h(w) =\arg \max _{z \in {\mathcal {X}}} z^Tw - \varPhi (z)\). By a standard result [10, Theorem 9.8], the maximum is uniquely attained and \(h(w) \in {\mathcal {D}}\).

By the first-order optimality conditions, \( \nabla _{R} \varPhi (h(w)) = P_{{\mathcal {T}}} w \), and thus

$$\begin{aligned} h(w) = (\nabla _{R} \varPhi )^{-1} \circ P_{{\mathcal {T}}} (w). \end{aligned}$$

As a shorthand, let \(\varPsi =(\nabla _{R} \varPhi )^{-1}\). By existence and uniqueness of the maximizer, \(\varPsi \) is a single-valued function from \({\mathcal {T}}\rightarrow {\mathcal {X}}\). Thus \(g= h \circ F = \varPsi \circ P_{{\mathcal {T}}} \circ F\).

Finally we verify that \(\varPsi = (\nabla _R \varPhi )^{-1}\) is a local diffeomorphism. Since \(D \varPsi (\nabla _R \varPhi (x)) = (\nabla ^2_R \varPhi (x))^{-1}\) and \(\varPhi \) is strongly convex over T, the inverse function theorem ensures that \(\varPsi \) is a local diffeomorphism. \(\square \)

Proposition 10

Under Assumptions 4, 5 and \(\alpha < \frac{\mu }{L}\), then

  1. 1.

    \(\det (\mathrm {D}g(x)) \ne 0\).

  2. 2.

    Every strict saddle point of \(x^*\) is an unstable fixed point of mirror descent, meaning \(x^* \in {\mathcal {A}}^* _g\).

Proof

Again adopt the shorthand \(\varPsi = (\nabla _{R} \varPhi )^{-1} \). Using Lemma 4, \(g= \varPsi \circ \left( P_{{\mathcal {T}}} \circ F\right) \), and \(\varPsi \) is a local diffeomorphism. To show \(\det ( \mathrm {D}g) \ne 0 \), it suffices to show that \(P_{{\mathcal {T}}} \circ F\) is a local diffeomorphism, or equivalently verify that \( \mathrm {D}( P_{{\mathcal {T}}} \circ F) (x): {\mathcal {T}}\rightarrow {\mathcal {T}}\) is an invertible linear transformation for every \(x \in {\mathcal {X}}\).

First, \( \mathrm {D}( P_{{\mathcal {T}}} \circ F) (x) = \nabla _{R} ^2 \varPhi (x) - \alpha \nabla _{R} ^2 f(x)\). Then using that \(\nabla ^2_R \varPhi (x) \succeq \mu I_{{\mathcal {T}}}\), \(\nabla ^2 _R f(x) \preceq L I_{{\mathcal {T}}}\), and \(\alpha \le \frac{\mu }{L}\),

$$\begin{aligned} \alpha \nabla _{R}^2 f(x) \preceq \mu I_{{\mathcal {T}}} \preceq \nabla ^2 _{R} \varPhi (x). \end{aligned}$$

Thus \(\nabla _{R} ^2 \varPhi (x) - \alpha \nabla _{R} ^2 f(x) \succ 0\) and is invertible. This completes our proof of the first part.

Let \(x^* \in {\mathcal {X}}\) be a strict saddle point. First we verify that it is a fixed point of g. Using that \(x^*\) is a critical point,

$$\begin{aligned} g(x^*)&=\varPsi ( P_{{\mathcal {T}}} \nabla \varPhi (x^*) -P_{{\mathcal {T}}}\nabla f(x^*)) \\&= \varPsi ( P_{{\mathcal {T}}}\nabla \varPhi (x^*) )\\&= \varPsi ( \nabla _{R} \varPhi (x^*))\\&=x^*. \end{aligned}$$

Next we verify that \(\mathrm {D}g(x^*)\) has an eigenvalue of magnitude greater than one. Using the chain rule and then inverse function theorem,

$$\begin{aligned} \mathrm {D}g(x^*)&= \mathrm {D}\varPsi (P_{{\mathcal {T}}} \circ F(x^*)) (\nabla ^2_{R} \varPhi (x^*) - \alpha \nabla ^2 _{R} f(x^*)) \quad {\text {(chain rule)}}\\&=\nabla ^2 _{R} \varPhi ( x^*) ^{-1} (\nabla ^2_{R} \varPhi (x^*) - \alpha \nabla ^2 _{R} f(x^*)) \\&\quad \quad {\text {(inverse function theorem and } \varPsi \circ P_{{\mathcal {T}}} \circ F(x^*) =x^* )}\\&= I_{{\mathcal {T}}} - \alpha \nabla ^2 _{R} \varPhi ( x^*) ^{-1} \nabla ^2 _{R} f(x^*). \end{aligned}$$

Define \(A= \nabla _{R} ^2 \varPhi (x^*)\) and \( H= \nabla ^2 _{R} f(x^*)\) . By similarity transformation under \(A^{1/2}\),

$$\begin{aligned} A^{1/2} Dg(x^*)A^{-1/2} = I_{{\mathcal {T}}} - \alpha A^{-1/2} H A^{-1/2} , \end{aligned}$$

which is a symmetric linear operator. Define \({\overline{v}} = A^{1/2} v \), where v is an eigenvector of \(\nabla ^2 _{R} f(x^*)\) corresponding to a strictly negative eigenvalue \(\lambda \), then \({\overline{v}}^T A^{-1/2} H A^{-1/2} {\overline{v}} <0\), so \(\lambda _{\min } ( A^{-1/2} H A^{-1/2}) <0\). Thus \( 1-\alpha \lambda _{\min } ( A^{-1/2} H A^{-1/2})\) is an eigenvalue of \( I_{{\mathcal {T}}} - \alpha A^{-1/2} H A^{-1/2}\) that is greater than one. Since similarity transformations preserve eigenvalues, \(Dg(x^*)\) also has an eigenvalue greater than one, and so \(x ^* \in {\mathcal {A}}^* _g\). \(\square \)

By combining Proposition 10 with Corollary 1, we have the following:

Corollary 7

Let g be the mirror descent algorithm defined in Eq. (12). Under Assumptions 4, 5, and \(\alpha < \frac{\mu }{L}\), then the stable set of the strict saddles in \({\mathcal {X}}\) is measure zero, meaning \(\mu (W_g ) =0\).

Remark 3

This corollary does not guarantee that the stable set of saddles on \({{\overline{{\mathcal {X}}}}}\backslash {\mathcal {X}}\) is measure zero. For example in the Multiplicative Weights algorithm, there are fixed points on \(\overline{{\mathcal {X}}}\backslash {\mathcal {X}}\) (e.g. all the vectors x with support size 1).

5.7 Lojasiewicz inequality and existence of a single limit point

In the previous parts of this section, we have implicitly assumed that \(\lim _k x_k\) exists, otherwise the theorems are vacuously true. As we will show with specific citations to recent papers, the existence of the limit is ensured when the function satisfies the Lojasiewicz inequality, and the iterates are bounded. Both assumptions are very weak, as we can explain below.

Assumption 6

(Lojasiewicz inequality) Let \(x^*\) be a critical point of f. f is said to satisfy the Lojasiewicz inequality if there are constants \(c>0\) and \(\mu \in [0,1)\) such that

$$\begin{aligned} \left\| \nabla f(x) \right\| \ge c | f(x) - f(x^*)|^\mu , \end{aligned}$$

for all x in a neighborhood of \(x^*\).

The Lojasiewicz inequality is very general as discussed in [7, 8, 13], and applies to all real analytic functions.

Assumption 7

(Bounded iterates) The iterates \(\{x_k\}\) remain in a bounded set.

Remark 4

The bounded iterates assumption is necessary for \(\lim x_k\) to exist. For example, gradient descent on \(f(x) =e^x\) has iterates that minimize f by generating a sequence that \(x_k \downarrow -\infty \), and so \(\lim x_k\) does not exist. For algorithms that generate non-increasing objective values, bounded iterates can be ensured when the function has compact sub-level sets.

For gradient descent, Absil et al. [1] showed that when f satisfies the Lojasiewicz inequality that \(\lim x_k\) exists. We recall the results below.

Proposition 11

(Theorems 3.2 and 4.1 of [1]) Let f satisfy Assumptions 16, and 7. Assume that the stepsize \(\alpha < \frac{1}{L}\), then \(\lim x_k \) exists.

Proof

Theorem 3.2 of [1] shows that \(\lim x_k \) exists whenever the function satisfies the Lojasiewicz inequality and the algorithm satisfies the sufficient descent condition. An immediate consequence of Theorem 4.1 of [1] is that gradient descent with stepsize \(\alpha < \frac{1}{L}\) satisfies the sufficient descent condition. \(\square \)

For block coordinate descent and manifold gradient descent, Bolte et al. [14] showed a similar result.

Proposition 12

(Block coordinate descent limit exists [14]) Let f satisfy Assumptions 3, 6, and 7. Assume that \(\alpha < \frac{1}{L_b}\), then the block coordinate descent iterates attain a limit, \(\lim x_k\) exists.

Proof

This is proved in Theorem 1 and Sect. 3.6 of [14]. \(\square \)

Proposition 13

(Mirror Descent limit exists [15]) Under Assumptions 4, 57, and 6 with \(\alpha < \frac{\mu }{L}\), then the mirror descent iterates attain a limit, \(\lim x_k\) exists.

Proof

This follows from a minor modification of Bolte et al. [15], which studies mirror descent on composite functions of the form \(\varPsi (x) = f(x) + g(x)\) , where f is smooth and differentiable and g is proper and lower semicontinuous. For our setting, we choose g as the indicator function of M, \(g(x) = {\mathbf {1}}_{M} (x)\). Bolte et al. assume that \(\text {dom} (\varPhi ) = {\mathbf {R}}^n\), but this is only used in Eq. (4.7) of [15]. The same result can be obtained by the strong convexity of \(\varPhi \) over the domain \({\mathcal {X}}={\mathcal {D}}\cap {\mathcal {M}}\) and using that \(x_{k+1}-x_k \in {\mathcal {T}}\). Equation (4.7) with

$$\begin{aligned} f(x_k) -f(x_{k+1})&\ge \left( \frac{1}{\alpha }- L\right) ( \varPhi (x_{k+1}) -\varPhi (x_k)-\nabla \varPhi (x_k)^T (x_{k+1}-x_k))\\&= \left( \frac{1}{\alpha }{-} L\right) \int dt\frac{1}{2} (x_{k+1} {-}x_k)^T \nabla ^2 \varPhi (x_k {+}t (x_{k+1}-x_k)) (x_{k+1}{-}x_k) \\&\ge \left( \frac{1}{\alpha }- L\right) \frac{\mu }{2} \left\| x_{k+1}-x_k \right\| ^2. \end{aligned}$$

The right-hand side is the same as Eq. (4.7) of [15], so the rest of the argument proceeds exactly as in Theorem 4.1 of Bolte et al. \(\square \)

Remark 5

For functions f that do not satisfy the Lojasiewicz inequality, it is possible to prove that \(\lim _k x_k\) exists under the assumption that the critical points are isolated and Assumption 7. The strategy is standard in the literature and can be found in [31, 35], so we omit the details in the sketch below.

Let \(\{x_k\}_{k \in {\mathbb {N}}}\) be the trajectory of any of the algorithms discussed and \(\varOmega \) be the set of limit points. First, \(\varOmega \) is non-empty due to Assumption 7. Since the algorithms are sufficient descent methods, it can be shown \(\varOmega \) consists of critical points, and is a compact, connected set.

Finally, by the assumption that the critical points are isolated, it follows that \(\varOmega \) is a singleton critical point (since \(\varOmega \) is connected), i.e., \(\varOmega = \{x^*\}\) and hence \(\lim _k x_k = x^*\).

6 Conclusion

We have shown that first-order methods with random initialization and appropriate constant step-size do not converge to a strict saddle point. Our results apply to gradient descent, proximal point algorithm, coordinate descent, block coordinate descent, manifold gradient descent and mirror descent. The key common insight in analyzing all these optimization methods is to treat these algorithms as dynamical systems. Every strict saddle point is shown to be locally unstable for these first-order methods and applications of the center-stable manifold theorem suffice to characterize the local behavior. As long as the mapping induced by the optimization method is sufficiently well behaved, e.g. local diffeomorphism, these local arguments can be extended to the whole domain. Proving the instability of saddle points as well as the smoothness and invertibility of the corresponding maps depends upon careful instantiations of these generic arguments (e.g.  choice of step-size) on a case-by-case basis. The global instability of saddle points for first-order methods is many times informally invoked without careful discussion about the necessary technical conditions needed to formalize these arguments. We hope that this work will help ground these arguments on a unified formal foundation. We end this paper with a brief discussion of some open directions:

Step-size It is not clear if the step size restrictions are necessary to avoid saddle points (e.g. \(\alpha <1/L\) for gradient descent; see Panageas and Piliouras [42] in which examples are provided where \(\alpha < 2/L\) is necessary for gradient descent). Most of the constructions where the gradient method converges to saddle points require fragile initial conditions as discussed in Sect. 3. It remains a possibility that adaptive choice of step-size by Wolfe Line Search or backtracking, may still avoid saddle points provided the initial point is chosen at random.

Strict saddles It is also important to understand how stringent the strict saddle assumption is. Will a perturbation of a function always satisfy the strict saddle property? Adler and Taylor [5] provide very general sufficient conditions for a random function to be Morse, meaning the eigenvalues at critical points are non-zero, which implies the strict saddle condition. These conditions rely on checking that the density of \(\nabla ^2 f(x)\) has full support conditioned on the event that \(\nabla f(x)=0\). This can be explicitly verified for functions f that arise from learning problems. Similar arguments for applications that arise in game theory are developed in Kleinberg et al. [30].

However, we note that there are very difficult unconstrained optimization problems where the strict saddle condition fails. Perhaps the simplest is optimization of quartic polynomials. Indeed, checking if zero is a local minimizer of the quartic

$$\begin{aligned} f(x) = \sum _{i,j=1}^n q_{ij} x_i^2 x_j^2 \end{aligned}$$

is equivalent to checking whether the matrix \(Q=[q_{ij}]\) is co-positive, a co-NP complete problem. For this f, the Hessian at \(x=0\) is zero, so \(x=0\) is a second-order KKT point, but not necessarily a local minimizer. By the change of variables \(z_i = x_i^2\), we see that checking local minimality in a problem with quadratic objective and non-negative inequality constraints is also co-NP complete.

Speed of convergence Although gradient descent can take exponential amount of time to escape from saddle points at least for some carefully constructed non-convex functions [22], its stochastic counterparts perform much better [28]. It would be interesting to characterize these hard instances to the extent possible and to understand whether they are indeed prevalent in applications of interest (e.g. deep learning). In the other direction, it would be rather useful to show that all first-order methods can be sped up by switching to carefully chosen stochastic variants.

Beyond saddle points Even if saddle points are provably avoided, there can be multiple local minima of widely different objective value. The performance of first-order methods would depend crucially on whether they converge for most initial conditions to nearly optimal global minima. Panageas and Piliouras panageas2016average analyze such a game theoretic application and show that indeed the size of the region of attraction of the good local optima dominates that of the bad local optima implying nearly optimal average case performance. Such arguments depend crucially both on the setting as well as on the chosen optimization method and it would be interesting to explore their applicability in other settings.