1 Introduction

For most of the survey, we focus on the unconstrained optimization problem:

$$\begin{aligned} \min _{w\in {\mathbb{R}}^d} f(w), \end{aligned}$$
(1)

where \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) is a scalar-valued function. The problem models several critical ML problems such as least squares regression, training SVMs, and training neural networks. The problem has been extensively studied in the literature, especially for convex functions5 defined below.

Definition 1

(Convex functions) \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) is a convex function if for all \(x, y\in \text{Domain}(f)\) and \(1\ge \lambda \ge 0,\) we have: \(f(\lambda x+ (1-\lambda )y)\le \lambda f(x)+(1-\lambda )f(y)\). Equivalently,

$$\begin{aligned} f(x)+\langle \nabla f(x), y-x\rangle \le f(y),\quad \text{or if},\quad \nabla ^2 f(x)\succeq 0, \end{aligned}$$

where \(\nabla f(x)\) is the gradient of f computed at x and \(\nabla ^2 f(x)\) is the Hessian of f at x.

For convex functions, it is well known that (1) can be solved optimally with techniques as simple as gradient descent (GD). That is, Gradient Descent (Algorithm 1) requires finite many iterations of GD to obtain an \(\varepsilon \)-approximate solution to (1), i.e.,

$$\begin{aligned} f(w_T)\le \min _{w\in {\mathbb{R}}^d} f(w)+\varepsilon , \end{aligned}$$

where \(w_T\) is the Tth iterate of the GD algorithm. Throughout the article, \(w_*\) refers to an optimal solution of (1).

Note that Algorithm 1 is defined only for differentiable functions f. While it can be extended to non-differentiable convex functions using sub-gradients, in this paper we will focus only on smooth differentiable functions.

Definition 2

(Lipschitz continuous gradient functions) \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) is L-smooth or L-Lipschitz Continuous Gradient if the following holds for all \(x, y\in \text{Domain}(f)\): \(\Vert \nabla _w f(x)-\nabla _w f(y)\Vert \le L\Vert x-y\Vert .\) Equivalently,

$$\begin{aligned} \Vert \nabla ^2 f(x)\Vert _2\le H,\quad \text{or},\quad f(x)\le f(y)+\langle \nabla f(x), y-x\rangle + \frac{L}{2}\Vert y-x\Vert ^2. \end{aligned}$$

Throughout the article, we use \(\Vert x\Vert \) to denote \(L_2\) norm of a vector. Similarly, \(\Vert X\Vert \) denotes operator norm of X.

For smooth convex optimization, it is well known that Algorithm 1 converges to an \(\varepsilon \)-optimal solution in \(T=O(L/\varepsilon ^2)\) iterations, where L is the smoothness constant of f. Hence, the time complexity of the algorithm is \(O((d+T_f)/\varepsilon ^2)\), where \(T_f\) is the time complexity of computing gradient of f at any point w.

Problem (1) is significantly more complex for non-convex f, as the problem is in general NP hard. For convex f, \(\nabla _w f(w)=0\) implies global optimality. However, for non-convex f, such critical points do not even imply local optimality. In fact, finding even local optima is NP hard in general. Hence, various methods in the literature try to either find the first-order or second-order stationary points of f. We first define the first-order stationary points (FOSP) of a function f below.

Definition 3

(First-order stationary point (FOSP)) w is a first-order stationary point (FOSP) of \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) iff the following holds: \(\nabla _w f(w)=0.\)w is an \(\varepsilon \)-FOSP of f, iff: \(\Vert \nabla _w f(w)\Vert \le \varepsilon \). For “random” w, the equivalent condition is: \({\mathbb{E}}_w[\Vert \nabla _w f(w)\Vert ]\le \varepsilon \).

Note that FOSP can be a saddle point or even a local maxima (see Fig. 1). Hence, several works in the literature have studied methods for finding the second-order stationary point (SOSP) defined below.

Definition 4

(Second-order stationary point (SOSP)) w is a second-order stationary point (SOSP) of \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) iff the following holds: \(\nabla _w f(w)=0,\ \nabla ^2_w f(w)\succeq 0.\)w is an \((\varepsilon , \rho )\)-SOSP of f, iff:

$$\begin{aligned} \Vert \nabla _w f(w)\Vert \le \varepsilon ,\quad \nabla _w^2 f(w)\succeq -\sqrt{\rho \varepsilon }I. \end{aligned}$$
Figure 1:
figure 1

Various functions of two variables and their corresponding FOSP and SOSP.

That is, SOSP (see Fig. 1) avoids first-order saddle points. Now one can define higher order saddle points such as third-order stationary points, but in general computing fourth- or higher order stationary points is NP hard4.

Note that while FOSP can be obtained using more standard techniques such as gradient descent or stochastic gradient descent, SOSP requires more carefully designed algorithms such as cubic regularization or noisy stochastic gradient descent and requires significantly more involved analysis in general.

In Sect. 2 we will discuss various methods for finding FOSP in different settings. Section 3 discusses a few techniques for finding SOSP. Finally, we conclude with Sect. 4.

2 Methods for Finding FOSP

The goal of FOSP of a function f is to find a point w s.t. \(\nabla _w f(w)=0\). While there exist several techniques to achieve this goal5, most of the techniques use second- or higher order derivatives of f and hence are not suitable for large-scale ML problems. In contrast, first-order techniques based only on gradient of f perform well in practice. In the remaining section, we will focus on such first-order techniques for finding FOSP.

2.1 Gradient Descent for FOSP

figure b

In this section, we study gradient descent for finding FOSP. While standard analysis of GD exists for convex functions, it turns out that a similar analysis can be applied to arbitrary non-convex problems for finding FOSP.

Theorem 1

(GD finds FOSP) Let\(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\)be anL-smooth non-convex function. Then the output of Algorithm 1 with\(\eta =\frac{1}{L}\)and\(T=O(\frac{L(f(w_0)-f(w_*))}{\varepsilon ^2})\)is an\(\varepsilon \)-FOSP (Definition 3) of Problem (1). \(w_0\)is the initial point provided to Algorithm 1, and\(f(w_*)=\min _w f(w)\)is the value at an optima.

Proof

Using smoothness property (Definition 2), we have

$$\begin{aligned} f(w_{t+1})&\le f(w_t)+\langle \nabla _w f(w_t), w_{t+1}-w_t\rangle +\frac{L}{2}\Vert w_{t+1}-w_t\Vert ^2,\\&{\mathop {=}\limits ^{\zeta _1}}\, f(w_t)-\eta \left( 1-\frac{L\cdot \eta }{2}\right) \Vert \nabla _w f(w_t)\Vert ^2,\\&{\mathop {\le }\limits ^{\zeta _2}}\, f(w_t)-\frac{L}{2}\Vert \nabla _w f(w_t)\Vert ^2, \end{aligned}$$

where \(\zeta _1\) follows from definition of \(w_{t+1}\) and \(\zeta _2\) follows from \(\eta =\frac{1}{L}\). Theorem follows by adding the above equation for all T.

Note that for non-convex functions GD requires \(O(1/\varepsilon ^2)\) iterations while for convex functions, it requires \(O(1/\varepsilon )\) iterations to achieve similar sub-optimality. Now, one can accelerate the convergence using momentum-based techniques21, 23; however, still the gap between the rates for non-convex functions and convex functions remains a lively area for research.

Algorithm 2 presents a pseudo-code of the acceleration-based algorithm for general non-convex functions, and the below Theorem bounds the number of iterations required to \(\varepsilon \)-approximately solve the FOSP problem.

Theorem 2

(Accelerated gradient descent for FOSP) Let\(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\)be anL-smooth non-convex function. Then there exists\(\alpha _t, \beta _t, \beta _t\)s.t. Algorithm 2 with\(T=O(\frac{\sqrt{L(f(w_0)-f(w_*))}}{\varepsilon })\)outputs an\(\varepsilon \)-FOSP (Definition 3) of Problem (1). \(w_0\)is the initial point provided to Algorithm 2, and\(f(w_*)=\min _w f(w)\)is the value at an optima.

See13 for a proof of the above theorem.

figure c

2.2 Stochastic Gradient Methods for FOSP

figure d

Typical ML-based optimization problems require optimizing a function over several data points. For example, deep learning requires minimizing a loss function which in itself is a sum of loss functions over individual data points. That is, the function f can be written as a finite sum \(f(w)=\frac{1}{n}\sum _i f_i(w)\).

Thus, in general, computing gradient itself requires O(nd), where n is the number of points and d is the number of parameters. Now, in general, n is very large. For example, ImageNet9, a standard image classification benchmark has about 14M data points. So computing gradient itself is prohibitive.

However, one can exploit the fact that the function f is a finite sum function, to devise easily computable randomized gradients which in expectation match the actual gradient. That is, at tth iteration, we can sample a function \(f_{i_t}\) and just compute gradient of the sampled function. Note that \({\mathbb{E}}_{i_t\sim {\mathrm{Unif}}[1,n]}[\nabla _w f_{i_t}(w)]=\nabla _w f(w)\) for any fixed w with respect to \(i_t\).

Such proxy for gradient is utilized by the well-known stochastic gradient descent (SGD) algorithm (Algorithm 3). Interestingly, similar to GD, analysis for SGD also follows easily from the smoothness definition.

Theorem 3

(SGD finds FOSP) Let\(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\)be anL-smooth non-convex function. Then\(T=O(\frac{L\sigma ^2 (f(w_0)-f(w_*))}{\varepsilon ^4})\)th iterate of Algorithm 3 with\(\eta =\frac{\sqrt{f(w_0)-f(w_*)}}{\sigma \cdot \sqrt{L}\cdot \sqrt{T}}\)satisfies:

$$\begin{aligned} \min _{k\in [T]}{\mathbb{E}}[\Vert \nabla _w f(w_k)\Vert ^2]\le \varepsilon ^2. \end{aligned}$$

\(w_0\)is the initial point provided to Algorithm 3, and\(f(w_*)=\min _w f(w)\)is the value at an optima.

Proof

Using smoothness property (Definition 2), we have

$$\begin{aligned} {\mathbb{E}}[f(w_{t+1})]&\le {\mathbb{E}}[f(w_t)]+{\mathbb{E}}\left[ \langle \nabla _w f(w_t), w_{t+1}-w_t\rangle +\frac{L}{2}\Vert w_{t+1}-w_t\Vert ^2\right] ,\\&{\mathop {=}\limits ^{\zeta _1}}\, {\mathbb{E}}[f(w_t)]-\eta \left( 1-\frac{\eta \cdot L}{2}\right) \Vert \nabla _w f(w_t)\Vert ^2+\frac{L}{2}\eta ^2\sigma ^2,\\ \end{aligned}$$

where \(\sigma ^2=\max _t {\mathbb{E}}[\Vert \nabla _w f_{i_t}(w_t)\Vert ^2]-\Vert \nabla _w f(w_t)\Vert ^2\) and \(\zeta _1\) follows by definition of \(w_{t+1}\) and the fact that \(i_t\) is selected independent of \(w_t\).

Adding the above terms and rearranging, we have

$$\begin{aligned} \min _{k\in [T]} {\mathbb{E}}[\Vert \nabla _w f(w_k)\Vert ^2]&\le \frac{2(f(w_0)-f(w_*))}{\eta }+L\sigma ^2 \eta ,\\&\le \frac{\sigma \sqrt{2L\cdot (f(w_0)-f(w_*))}}{\sqrt{T}}\le \varepsilon ^2, \end{aligned}$$

where the second inequality follows by setting \(\eta =\frac{\sqrt{f(w_0)-f(w_*)}}{\sigma \cdot \sqrt{L}\cdot \sqrt{T}}\).

Note that while the number of iterations is larger for SGD (in terms of sub-optimality \(\varepsilon \)), the time complexity per step is O(n) smaller12. Hence, the usage of such methods for deep learning where \(\varepsilon \) is not required to be extremely small.

figure e

However, the time complexity is dependent on the variance quantity which can be large. Similar to convex optimization literature, there are several variance reduction techniques to further improve the time complexity. Below we present one such approach. Algorithm 4 presents a pseudo-code for the NC-SVRG algorithm and its iteration complexity is given by

Theorem 4

(SVRG finds FOSP) Let\(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\)be anL-smooth non-convex function. Then the output of Algorithm 4 with\(T=O(\frac{L(f(w_0)-f(w_*))\cdot n^{-1/3}}{\varepsilon ^2}),\)\(m=O(n),\)and\(\eta =O(1/(Ln^{2/3}))\)satisfies

$$\begin{aligned} \min _{k\in [T]}{\mathbb{E}}[\Vert \nabla _w f(w_k)\Vert ^2]\le \varepsilon ^2. \end{aligned}$$

\(w_0\)is the initial point provided to Algorithm 4, and\(f(w_*)=\min _w f(w)\)is the value at an optima.

Note that the above theorem shows that the total number of gradient calls made by the algorithm scales as \(O(L(f(w_0)-f(w_*))\cdot (n+\frac{n^{2/3}}{\varepsilon ^2}))\). See3, 25 for a proof of the above theorem.

2.3 Summary

Finding FOSP is an important problem in the non-convex optimization literature and several algorithms have been proposed recently. In particular, a lot of progress has been made for the finite sum case. Table 1 summarizes the current state-of-the-art for finding FOSP of smooth but non-convex functions. Also see companion article22 for a discussion on finding FOSP/SOSP for finite sum setting. Recently, several other problem settings have been considered. For example, if the function f is given \(f(w)=\frac{1}{n}\sum _i f_i(w)+g(w)\), where g is a non-smooth but convex function14, 24. Despite the progress, several critical problems still persist in the area. For example, can we obtain tighter rates for convergence to FOSP for various methods discussed above. SVRG for convex functions has a \(\sqrt{n}\) term while for non-convex functions it is \(n^{2/3}\), so another interesting open question is can we bridge this gap in time complexity.

Table 1: FOSP: no. of first-order oracle calls (gradient computations) required by different methods for converging to \(\varepsilon \)-FOSP (Definition 3) of f when f is a non-convex and a convex function, respectively.

3 Methods for Finding SOSP

As discussed in the previous section, standard gradient descent type techniques can find FOSP, which is not entirely surprising as the goal is to only guarantee the first-order stationary (\(\nabla _w f(w)=0)\). However, SOSP requires a second-order condition \(\nabla _w^2 f(w)\succeq 0\) and hence the standard gradient descent techniques fail. For example, if the gradient descent method is initialized at a first-order saddle point, then it will not even move as the gradient is 0.

Hence, it was widely believed that we require a “second-order” method for finding SOSP, i.e., a method that has access to the Hessian of the function f along with the gradient. Nesterov and Polyak20 introduced one such method that updates w in each iteration using

$$\begin{aligned} w_{t+1}=\mathop {\arg \min }_w f(w_t)+\langle f(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^{\mathrm{T}} \nabla _w^2 f(w_t) (w-w_t) + \frac{1}{6} \Vert w-w_t\Vert ^3. \end{aligned}$$
(2)

Nesterov and Polyak20 showed that the above method converges to an SOSP in \(O(\frac{1}{\varepsilon ^{1.5}})\) iterations and each iteration requires computation of the Hessian, which can be significantly expensive. Several recent results can relax this Hessian computation requirement if they are provided access to the product of the Hessian with an arbitrary vector, albeit with worse convergence rates1, 27. Unfortunately, even computing Hessian-vector product is significantly expensive for standard ML problems such as deep learning training which leads to the following question:

  • can we find SOSP using first-order methods, i.e., by just using gradients of f?

Interestingly, the answer is yes, assuming an additional condition on f which is that the Hessian is Lipschitz continuous.

Definition 5

(Lipschitz continuous Hessian) \(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\) is M-Lipschitz continuous Hessian if the following holds for all \(x, y\in \text{Domain}(f)\): \(\Vert \nabla _w^2 f(x)-\nabla _w^2 f(y)\Vert \le \rho \Vert x-y\Vert .\)

Below, we describe one such technique for finding SOSP, and then extend it to the finite-sum setting, and summarize with some more advanced results.

3.1 Noisy Gradient Descent for SOSP

figure f

As the standard gradient descent can get stuck in a FOSP due to 0 gradient, a natural approach would be to perturb the solution so that it can escape the first-order saddle point. Noisy gradient descent techniques precisely exploit this intuition; see Algorithm 5 for a pseudo-code of the algorithm. Note that while the perturbation should be large enough to ensure escaping a saddle point, but it should not be large enough to escape local minima or global minima, else the method might not even converge. The following theorem shows that such a fine balance can be struck and the algorithm converges to a SOSP in \(O(1/\varepsilon ^2)\) iterations.

Theorem 5

(Convergence to SOSP) Let\(f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}\)be anL-smooth, \(\rho \)-HessianLipschitz Continuous function. Then the output of Algorithm 5 with\(T=O(\text{poly}(d/\varepsilon ))\), \(R\le 5\log d / (\eta \cdot \sqrt{\rho \varepsilon })\), and\(\eta \le \text{poly}(d, 1/\varepsilon , \rho , L, f(w_0)-f(w_*))\)is an\(\varepsilon \)-SOSP, i.e.,

$$\begin{aligned} {\mathbb{E}}[\Vert \nabla _w f(w_T)\Vert ^2]\le \varepsilon ^2, {\mathbb{E}}[\nabla _w^2 f(w_T)]\succeq -\sqrt{\rho \varepsilon }. \end{aligned}$$

Proof

The proof can be broken into three parts:

  1. 1.

    \(\Vert \nabla _w f(w_t)\Vert > \varepsilon \): during this phase, Algorithm 5 performs standard Gradient Descent. Hence, the standard GD analysis would guarantee convergence to FOSP (\(\Vert \nabla _w f(w_t)\Vert \le \varepsilon \)) in \(O(\frac{1}{\varepsilon ^2})\) iterations.

  2. 2.

    \(\Vert \nabla _w f(w_t)\Vert \le \varepsilon \) but \(w_t\) is not a SOSP: for such \(w_t\), we need to show that after r iterations of gradient descent, \(f(w_{t+r})\) is significantly smaller than \(f(w_t)\) thus escaping the first-order saddle point.

  3. 3.

    \(\Vert \nabla _w f(w_t)\Vert \le \varepsilon \) but \(w_t\) is a SOSP: for such \(w_t\), we need to show that after r iterations of gradient descent, \(f(w_{t+r})\) is not much larger than \(f(w_t)\), thus it does not escape the second-order saddle point.

Combining all the above three parts would give us the desired result. As mentioned above, the first part follows from Theorem 1 directly.

Escaping first-order stationary point For this second part, lets recall update for \(w_{t+r}=w_{t+r-1}-\eta \nabla _w f(w_{t+r-1})\). Now, as f is Hessian Lipschitz Continuous, we can essentially replace f by a quadratic function without significant error. That is, define

$$\begin{aligned} \hat{f}(w)=f(w_t)+\frac{1}{2}(w-w_t)^{\mathrm{T}} H (w-w_t), \end{aligned}$$

where \(H=\nabla _w^2 f(w_t)\). Also, define

$$\begin{aligned} \widehat{w}_{t+r+1}=\widehat{w}_{t+r}-\eta \nabla _w \hat{f}(\widehat{w}_{t+r}),\quad 1\le r\le R, \end{aligned}$$

where \(\widehat{w}_{t+1}=w_{t+1}=w_t+\zeta _t\).

It is easy to observe that

$$\begin{aligned} \widehat{w}_{t+r+1}-w_t=\left( I-\eta H\right) ^r\zeta _t. \end{aligned}$$

That is, \(\widehat{w}_{t+r+1}\) is the result of r power-method updates on \(M=I-\eta H\). Note that as \(w_t\) is a saddle point, i.e., \(\lambda _{\mathrm{min}}(H)\le -\sqrt{\rho \varepsilon }\) where \(\lambda _{\mathrm{min}}(A)\) is the smallest (or most negative) eigenvalue of A. Hence, \(\widehat{w}_{t+r+1}\) converges to the eigenvector of M corresponding to largest eigenvalue, i.e., it converges to the eigenvector corresponding to most negative eigenvalue of H. Hence, \(\hat{f}(\widehat{w}_{t+R})\) should be significantly smaller than \(f(w_t)\).

Formally, using \({\mathbb{E}}[\zeta _t]=0\) and \({\mathbb{E}}[\zeta _t \zeta _t^{\mathrm{T}}]=\eta ^2 I\):

$$\begin{aligned} \hat{f}(\widehat{w}_{t+R})=f(w_t)+\frac{\eta ^2}{2}tr((I-\eta H)^{2R}H)\le f(w_t)-\eta /2. \end{aligned}$$

The inequality follows using \(R=\frac{5\log (d)}{\eta \sqrt{\rho \varepsilon }}\) and \(\eta \le 1/\Vert H\Vert \le 1/L\). Finally, using Hessian Lipschitz continuity and straightforward arguments, we get \(f({w}_{t+R})\le \hat{f}(\widehat{w}_{t+R}) + \eta /4\). Hence, we get

$$\begin{aligned} f({w}_{t+R})\le f(w_t)-\eta /4. \end{aligned}$$
(3)

Entrapment near SOSP For this third part, we again use \(\hat{f}(w)=f(w_t)+\frac{1}{2}(w-w_t)^{\mathrm{T}} H (w-w_t).\) Using similar arguments, and the fact that \(R=\log d/{\eta \sqrt{\rho \varepsilon }}\), we have: \(f({w}_{t+R})\le f(w_t)-\eta ^2/4.\) Hence, once entrapped near SOSP, the algorithm does not leave the neighborhood.

The theorem follows by combining the above three observations.

Note that the above proof requires \(O(\text{poly}(d))\) iterations, while for convex functions, convergence is independent of d and only depends on L. Jin et al.17 resolved this gap, by showing that a variant of the above algorithm converges in O() iterations. The main weakness of the above analysis is that it only analyzes a quadratic approximation of the problem and then uses Hessian continuity arguments to show that the algorithm converges to SOSP. Jin et al.17 directly analyzes how the iterates escape saddle point, leading to a tighter result albeit with more involved arguments.

Note that while cubic regularization converges in \(O(1/\varepsilon ^{1.5})\) iterations, noisy gradient descent requires \(O(1/\varepsilon ^2)\) iterations. A natural question is can be use a noisy version of the accelerated gradient descent19 to similarly escape the saddle points but with lesser number of iterations. Jin et al.18 show that it is possible but there is still a gap w.r.t. cubic regularization. That is, it shows that a noisy accelerated gradient descent method converges to \(\varepsilon \)-approximate SOSP in \(O(1/\varepsilon ^{1.75})\) iterations. Devising the first-order methods that can achieve same iteration complexity as the cubic regularization technique for finding SOSP or a proof of lower bound for first-order methods remains an important open question.

3.2 Noisy Stochastic Gradient Descent for SOSP

figure g

Similar to FOSP, it is critical for several ML problems to design SOSP algorithms that can work with the finite sum setting, i.e., when \(f(w)=\frac{1}{n}\sum _{i=1}^n f_i(w)\). For this setting, we can use gradient of a randomly sampled \(f_i\) as a proxy for the gradient of f, which leads to Algorithm 6. The convergence rate for the algorithm is similar to the one given in Theorem 5. Similar to FOSP, the convergence rate for Algorithm 6 also depends on the variance term \(\max _w {\mathbb{E}}[\Vert \nabla _w f_{i}(w)\Vert ^2]-\nabla _w f(w)^2\). Several recent results address this concern using variance reduction techniques [].

3.3 Summary

Finding SOSP is critical in several domains. Even in deep learning, it is widely believed that second-order saddle points are reasonable solutions but first-order saddle points are not desirable. Several recent results have addressed the problem, we summarize a few results as follows: Open problems Note that all the above results require the Hessian to be Lipschitz continuous which is a stricter condition than the smoothness condition required by standard convex optimization problems. A natural open question is to design and analyze algorithms for convergence to SOSP that do not require such assumption on Hessian. Another important open question is to bridge the gap between iteration complexity of the cubic regularization and the first-order methods discussed above. Finally, further exploiting structure in f, like non-convex + smooth/non-convex + strongly-convex, to provide faster algorithms is another interesting research direction (Tables 2, 3, 4).

Table 2: FOSP in finite-sum setting (\(f(w)=\sum _{i=1}^n f_i(w)\): No. of first-order oracle calls (individual gradient computations \(\nabla f_i(w)\)) required by different methods for converging to \(\varepsilon \)-FOSP (Definition 3) of f when f is a non-convex and a convex function, respectively.
Table 3: SOSP: no. of first-order oracle calls (gradient computations) required by different methods for converging to \(\varepsilon \)-SOSP (Definition 3) of f when f is a non-convex and a convex function, respectively.
Table 4: SOSP in finite-sum setting (\(f(w)=\frac{1}{n}\sum _{i=1}^n f_i(w)\): no. of first order oracle calls (individual gradient computations \(\nabla f_i(w)\)) required by different methods for converging to \(\varepsilon \)-SOSP (Definition 3) of f when f is a non-convex and a convex function, respectively.

4 Discussion

Non-convex optimization is critical to several ML problems with applications in deep learning, recommendation systems (matrix completion), dimensionality reduction (PCA, sparse-PCA), robust learning, etc. Most of these problems are NP hard in general, and come up with certain structure that helps make them more tractable in practice.

In this survey, we focus on a canonical unconstrained non-convex function optimization formulation. However, several other formulations are equally critical in ML domain. For example, optimizing non-convex functions over convex sets. Another form is optimizing convex function over non-convex sets which represents several critical problems such as PCA, sparse regression, tensor decomposition. We refer readers to15 for a survey of some of these techniques.

For unconstrained non-convex optimization, we discussed a few scalable techniques for FOSP as well as for SOSP. While SOSPs are desirable in practice, especially for deep learning techniques, most of the existing practical algorithms use variants of SGD (Algorithm 3) with momentum term for acceleration. While theoretically we can only guarantee convergence to FOSP for such algorithms, in practice this simple algorithm converges to a SOSP26. One possible reason for the same might be the fact that SGD itself adds significant amount of noise thus avoiding saddle points but currently existing SGD analysis are unable to provide a general result without some assumptions on noise added by SGD in each iteration8, and remains an active area of research. In addition, studying and possibly bridging the gap between time complexity for finding FOSP for convex functions and non-convex functions is an important problem.

Several recent results have shown that several non-convex optimizations do not admit spurious local minima or SOSP; hence, convergence to SOSP is enough to guarantee global optimality6, 11. Past few years have seen a flurry of activity in the SOSP domain, with several new and scalable techniques. However, still a few fundamental questions remain open. For example, can we obtain solutions as efficient as cubic regularization (in terms of iteration complexity) using the first-order methods? Does randomly initialized gradient descent method already converge to SOSP without any additional noise? Similarly, does SGD’s iterations add enough randomness to escape first-order saddle points efficiently?

Finally, statistics play a critical role in non-convex optimization problems in ML, which represent exciting new opportunities. For example, standard matrix completion problem is an NP-hard problem but by imposing statistical restrictions on the generated data, we can reduce the problem to an SOSP problem which can be solved in polynomial time (in \(1/\varepsilon \))11. Similarly, sparse linear regression/compressed sensing problems admit optimal solution to \(L_0\) constrained non-convex optimization problems due to statistical nature of the observations16. It is widely believed that techniques that are aware of both the statistical as well as optimization aspects of ML problems can produce strong results.