1 Introduction

In this paper, we consider the following nonlinear programming problem

$$\begin{aligned} \varPsi ^* := \min \limits _{x \in X} \{\varPsi (x) := f(x) + {\mathcal X}(x)\}, \end{aligned}$$
(1.1)

where \(X \subseteq {\mathbb {R}}^n\) is a closed convex set, \(f:X \rightarrow {\mathbb {R}}\) is possibly nonconvex, and \({\mathcal X}\) is a simple but possibly non-differentiable convex function with known structure (e.g. \({\mathcal X}(x) = \Vert x\Vert _1\) or \({\mathcal X}\) even vanishes). Moreover, we assume that f has Hölder continuous gradient on X i.e., there exists \( H >0\) and \(\nu \in [0,1]\) such that

$$\begin{aligned} \Vert f'(y) - f'(x)\Vert \le H \Vert y-x\Vert ^\nu \quad \forall \, x, y \in X, \end{aligned}$$
(1.2)

where \(f'(x)\) is the gradient of f at x and \(\Vert \cdot \Vert \) is the Euclidean norm in \({\mathbb {R}}^n\). The above assumption in (1.2) covers a wide range class of objective functions, including smooth functions (with \(\nu = 1\)), weakly smooth functions (with \(\nu \in (0,1)\)), and nonsmooth convex functions with bounded subgradients (with \(\nu = 0\)).

The complexity of solving (1.1) has been well-understood under the convex setting, i.e., when f is convex. According to the classic complexity theory by Nemirovski and Yudin [21], if f is a general convex function with bounded subgradients (i.e., \(\nu = 0\)), then the number of subgradient evaluations of f required to find a solution \({{\bar{x}}} \in X\) such that \(\varPsi ({{\bar{x}}}) -\varPsi ^* \le \epsilon \) cannot be smaller than \({\mathcal O}(1/\epsilon ^2)\) when n is sufficiently large. Here \(\varPsi ^*\) denotes the optimal value of (1.1). Such a lower complexity bound can be achieved by different first-order methods, including the subgradient and mirror descent methods in [21], and the bundle-level type methods in [4, 18]. Moreover, if f is a smooth convex function with Lipschitz continuous gradients (i.e., \(\nu = 1\)), then the number of gradient evaluations of f cannot be smaller than \({\mathcal O}(1/\sqrt{\epsilon })\) for sufficiently large n. Such a lower complexity bound can be achieved by the well-known Nesterov’s accelerated gradient (AG) method, which was originally developed in [22] for the smooth case with \({\mathcal X}= 0\), and recently extended for an emerging class of composite problems with a relatively simple nonsmooth term \({\mathcal X}\) [2, 24, 29].

While traditionally different classes of convex optimization problems were solved by using different algorithms, the last few years have seen a growing interest in the development of unified algorithms that can achieve the optimal complexity for solving different classes of convex problems, preferably without requiring the input of any problem parameters. Lan showed in [14] that Nesterov’s AG method in [22, 23] can achieve the optimal complexity for solving not only smooth, but also general nonsmooth optimization problems (i.e., \(\nu = 0\) in (1.2)) by introducing a novel stepsize policy and some new convergence analysis techniques. Devolder, Glineur and Nesterov [28] further generalized this development and showed that the AG method can exhibit the optimal \({\mathcal O}(1/\epsilon ^{\frac{2}{1+3\nu }})\) complexity for solving weakly smooth problems. These methods in [14, 28] still require the input of problem parameters like \(\nu \) and H, and even the iteration limit N. In a different research line, Lan [15] generalized the bundle-level type methods, originally designed for nonsmooth problems, for both smooth and weakly smooth problems. He showed that these accelerated level methods are uniformly optimal for convex optimization in the sense that they can achieve the optimal complexity bounds without requiring the input of any problem parameters and the iteration limit. Simplified variants of these methods were also proposed in [7] for solving ball-constrained and unconstrained convex optimization problems. Recently, Nesterov [25] presented another uniformly optimal method, namely, the universal accelerated gradient method for nonsmooth and (weakly) smooth convex programming. This method only needs the target accuracy as an input, which, similar to those in [7, 15], completely removes the need of inputting problem parameters in [14, 28].

In general, all the above-mentioned methods require the convexity on the objective function to establish their convergence results. When f is possibly nonconvex, a different termination criterion according to the projected gradient \(g_{_{X,k}}\)(see e.g., (2.16)) is often employed to analyze the complexity of the solution methods. While there is no known lower iteration complexity bound for first-order methods to solve the problem (1.1), the (projected) gradient-type methods [6, 12, 23] achieve the best-known iteration complexity \({\mathcal O}(1/\epsilon )\) to find a solution such that \(\Vert g_{_{X,k}}\Vert ^2 \le \epsilon \) when f in (1.1) has Lipschitz continuous gradient. Recently, Ghadimi and Lan [11] generalized Nesterov’s AG method to solve this class of nonconvex nonlinear optimization problems. They showed that this generalized AG method can not only achieve the best-known \({\mathcal O}(1/\epsilon )\) iteration complexity for finding approximate stationary points for nonconvex problems, but also exhibit the optimal iteration complexity if the objective function turns out to be convex. However, in oder to apply this method, we need to assume that all the generated iterates lie in a bounded set and that the gradients of f are be Lipschitz continuous, and also requires the input of a few problem parameters a priori. Our main goal of this paper is to understand whether we can generalize some of the aforementioned uniformly optimal methods to solve a broader class of nonlinear programming given in (1.1), where function f could be nonconvex and only weakly smooth or nonsmooth (see (1.2)). In addition to these theoretical aspects, our study has also been motivated by the following applications.

  • In many machine learning problems, the regularized loss function in the objective is given as a summation of convex and nonconvex terms (see e.g., [8, 19, 20] ). A unified approach may help us in exploiting the possible local convex property of the objective function in this class of problems, while globally these problems are not convex.

  • Some optimization problems are given through a black-box oracle (see e.g., [1, 9, 17]). Hence, both the smoothness level of the objective function and its convex property are unknown. A unified algorithm for both convex and nonconvex optimization and for handling the smoothness level in the objective function automatically could achieve better convergence performance for this class of problems.

Our contribution in this paper mainly consists of the following aspects. First, we generalize Nesterov’s AG method and present a unified problem parameter free accelerated gradient (UPFAG) method for a broad class of problems with \(\nu \in [0,1]\) in (1.2). The basic idea of this method is to combine a gradient descent step with Nesterov’s AG method and maintain the monotonicity of the objective function value at the iterates generated by the algorithm. Hence, the UPFAG method contains the gradient projection and Nesterov’s AG methods as special cases (see the discussions after presenting Algorithm 1). It should also be pointed out that two different line search procedures are implemented at each iteration of the UPFAG method which remove the necessity of knowing problem parameters in advance. We then show that this method achieves the best known iteration complexity

$$\begin{aligned} {\mathcal O} \left( \frac{H^{\frac{1}{\nu }}}{\epsilon ^{\frac{1+\nu }{2\nu }}}\right) \end{aligned}$$
(1.3)

to find at least one k such that \(\Vert g_{_{X,k}}\Vert ^2 \le \epsilon \), and exhibits, similar to those in [15, 25, 28], the optimal complexity

$$\begin{aligned} {\mathcal O} \left( \left[ \frac{H}{\epsilon }\right] ^{\frac{2}{1+3\nu }}\right) \end{aligned}$$
(1.4)

to find a solution \({{\bar{x}}} \in X\) such that \(\varPsi ({{\bar{x}}}) -\varPsi ^* \le \epsilon \), if f turns out to be convex.

To the best of our knowledge, this is the first time that a uniform first-order algorithm, which does not require any problem parameter information but only takes the target accuracy and a few user-defined line search parameters as the input, has been presented for solving smooth, nonsmooth, weakly smooth convex and nonconvex optimization problems. Moreover, this algorithm can also exploit a more efficient Quasi-Newton step rather than the gradient descent step for achieving the same iteration complexity bounds.

Second, by incorporating a gradient descent step into the framework of the bundle-level type methods, namely, the accelerated prox-level (APL) method presented in [15], we propose a unified APL (UAPL) method for solving a class of nonlinear programming defined in (1.1), where f satisfies (1.2). We show that this method achieves the complexity bounds in (1.4) and (1.3) for both convex and nonconvex optimization. Moreover, we simplify this method and present its fast variant, by incorporating a gradient descent step into the framework of the fast APL method [7], for solving ball-constrained and unconstrained problems. To the best of our knowledge, this is the first time that the bundle-level type methods are generalized for these nonconvex nonlinear programming problems. Indeed, the new algorithmic framework presented in this paper can be used to generalize any optimal first-order algorithm for convex optimization to solve general nonlinear (not necessarily convex) optimization problems.

The rest of the paper is organized as follows. In Sect. 2, we present the UPFAG method for solving a broad class of nonlinear programming problems where the objective function is the summation of a weakly smooth function with Hölder continuous gradient and a simple convex function, and establish the convergence results. In Sect. 3, we provide different variants of the bundle-level type methods for solving the aforementioned class of nonlinear programming. In Sect. 4, we show some numerical illustration of implementing the above-mentioned algorithms.

Notation For a differentiable function \(h: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(h'(x)\) is the gradient of h at x. More generally, when h is a proper convex function, \(\partial h(x)\) denotes the subdifferential set of h at x. For \(x \in {\mathbb {R}}^n\) and \(y \in {\mathbb {R}}^n\), \(\langle x, y \rangle \) is the standard inner product in \({\mathbb {R}}^n\). The norm \(\Vert \cdot \Vert \) is the Euclidean norm given by \(\Vert x\Vert = \sqrt{ \langle x,x \rangle }\), while \(\Vert x\Vert _G = \sqrt{ \langle G x,x \rangle }\) for a positive definite matrix G. Moreover, we let \({\mathcal B}({{\bar{x}}}, r)\) to be the ball with radius r centered at \({{\bar{x}}}\) i.e., \({\mathcal B}({{\bar{x}}}, r) =\{x \in {\mathbb {R}}^n \ \ | \ \ \Vert x-{{\bar{x}}}\Vert \le r\}\). We denote I as the identity matrix. For any real number r, \(\lceil r \rceil \) and \(\lfloor r \rfloor \) denote the nearest integer to r from above and below, respectively. \({\mathbb {R}}_+\) denotes the set of nonnegative real numbers.

2 Unified Problem Parameter Free Accelerated Gradient Method

Our goal in this section is to present a unified gradient type method to solve problem (1.1) when f is smooth, weakly smooth or nonsmooth. In particular, we consider the class of problems in (1.1) such that the gradient of f is Hölder continuous in the sense of (1.2), which also implies

$$\begin{aligned} |f(y) - f(x) - \langle f'(x), y - x \rangle | \le \frac{H}{1+\nu } \Vert y - x\Vert ^{1+\nu } \ \ \ \text{ for } \text{ any } x, y \in X. \end{aligned}$$
(2.1)

Our algorithm is stated below as Algorithm 1 which involves two line search procedures.

figure a

We have the following remarks for this algorithm. First, observe that by just considering (2.3), (2.4), (2.5) and setting \(x^{ag}_k = {{\tilde{x}}}^{ag}_k\), Algorithm 1 would reduce to a variant of the well-known Nesterov’s optimal gradient method (see, e.g., [23]). Moreover, if replacing \(x^{ag}_{k-1}\) by \(x^{md}_k\) in (2.10) and setting \(x^{ag}_k = {{\bar{x}}}^{ag}_k\), then (2.3), (2.4) and (2.10) would give the accelerated gradient (AG) method proposed by Ghadimi and Lan [11]. However, when f is nonconvex, the convergence of this AG method in [11] requires the boundedness assumption on the iterates. On the other hand, by just considering (2.10) and setting \(x^{ag}_k = {{\bar{x}}}^{ag}_k\), the UPFAG algorithm would be a variant of the projected gradient method [12]. Indeed, it follows from these observations that we can possibly perform the convergence analysis of the UPFAG method for convex and nonconvex optimization separately (see the discussions after Corollary 1). It should be also pointed out that these algorithms are designed to solve the class of problems in (1.1) when f is smooth and they do not necessarily use line search procedures to specify the stepsizes. Moreover, (2.11) guarantees the objective function value at the iterates \(x^{ag}_k\) generated by the UPFAG algorithm is non-increasing. Different from the previous accelerated gradient method, such a monotonicity of the objective function value is enforced by the UPFAG method to establish convergence analysis when \(\varPsi \) is nonconvex, as shown in the proof of Theorem 1.b).

Second, note that in steps 1 and 2, we implement two independent line search procedures, respectively, in (2.6) and (2.9). Indeed, we start with initial choices of stepsizes \({\hat{\lambda }}_k\) and \({\hat{\beta }}_k\), and then perform Armijo type of line searches such that certain specially designed line search conditions are satisfied. We will show in Theorem 1.a) that the two line search procedures will finish in finite number of inner iterations. One simple choice of line search in practice is to set the initial stepsizes to be some Barzilai-Borwein type stepsizes such as

$$\begin{aligned} {\hat{\lambda }}_k = \max \left\{ \frac{\langle s^{md}_{k-1}, y^{md}_{k-1} \rangle }{\langle y^{md}_{k-1}, y^{md}_{k-1} \rangle }, \sigma \right\} \text{ for } k \ge 1, \quad \text{ and } \quad {\hat{\beta }}_k = \max \left\{ \frac{\langle s^{ag}_{k-1}, y^{ag}_{k-1} \rangle }{\langle y^{ag}_{k-1}, y^{ag}_{k-1} \rangle }, \sigma \right\} \text{ for } k > 1,\nonumber \\ \end{aligned}$$
(2.12)

where \(s^{md}_{k-1} = x^{md}_{k-1} - {\tilde{x}}^{ag}_{k-1}\), \(y^{md}_{k-1} = f'(x^{md}_{k-1}) - f'({\tilde{x}}^{ag}_{k-1})\), \(s^{ag}_{k-1} = x^{ag}_{k-1} - x^{ag}_{k-2}\) and \(y^{ag}_{k-1} = f'(x^{ag}_{k-1}) - f'(x^{ag}_{k-2})\). And we can choose \({\hat{\beta }}_1=1/{{\hat{H}}}\), if an estimation of \({{\hat{H}}}\) of the Hölder continuity constant in (1.2) is known; otherwise, we can simply set \({\hat{\beta }}_1\) to be a positive constant (e.g., \({\hat{\beta }}_1 = \min \{1/\Vert f'(x_0)\Vert , 1\}\)).

Third, the Euclidean metric used in the quadratic term of the gradient projection step in (2.10) can be replaced by a variable metric. In particular, \(\Vert \cdot \Vert \) can be replaced by \(\Vert \cdot \Vert _{G_k}\) for some positive definite matrix \(G_k\), i.e.,

$$\begin{aligned} {{\bar{x}}}^{ag}_k = \arg \min \limits _{u \in X} \left\{ \langle f'(x^{ag}_{k-1}), u \rangle + \frac{1}{2 \beta _k} \Vert u- x^{ag}_{k-1}\Vert _{G_k}^2 +{\mathcal X}(u) \right\} . \end{aligned}$$
(2.13)

This modification allows us to include some curvature information of f into the matrix \(G_k\) to have better local approximation of the function \(\varPsi \) at \(x^{ag}_{k-1}\). In this case, unit initial stepsize is often preferred, that is to set \({\hat{\beta }}_k = 1\) for all \(k \ge 1\). In practice, we can set \(G_k\) to be some Quasi-Newton matrix, e.g., the well-known BFGS or limited memory BFGS (LBFGS) matrix (see e.g., [5, 26, 27]). Then, the unit step \({\hat{\beta }}_k = 1\) can be often accepted near the solution, which implies (2.13) will be exactly a Quasi-Newton step if \({\mathcal X}(x) \equiv 0\), and hence, a fast local convergence rate could be expected in practice (see Sect. 4 for more details). For some special structured X and \({\mathcal X}\), if the inverse \(G_k^{-1}\) is known, we may have closed formula solution of the subproblem (2.13) (e.g., see [3]). However, for more general cases, we may not have closed formula for the solution of this subproblem. Then, the alternating direction method of multipliers or primal-dual type algorithms could solve (2.13) quite efficiently, since its objective function is just a composition of a simple convex function \({\mathcal X}\) and a convex quadratic function with known inverse of the Hessian. It is also worth noting that we actually do not need to solve the subproblem (2.13) exactly if no closed formula solution exists. We can see from the proof of Theorem 1 that, instead of having (2.23) implied by the exact subproblem solution, we only need to find an approximate solution \({{\bar{x}}}^{ag}_k\) of (2.13) such that

$$\begin{aligned} \langle f'(x^{ag}_{k-1}), {{\bar{x}}}^{ag}_k - x^{ag}_{k-1} \rangle + {\mathcal X}({{\bar{x}}}^{ag}_k) \le {\mathcal X}(x^{ag}_{k-1}) - \frac{\sigma }{\beta _k} \Vert {{\bar{x}}}^{ag}_k - x^{ag}_{k-1}\Vert ^2. \end{aligned}$$
(2.14)

for some \(\sigma \in (0,1)\). Hence, it is quite flexible to choose proper algorithms to solve the subproblem (2.13). We may even apply different algorithms to solve (2.13) when different solution accuracy are required. And similar as showing (2.23), the solution of the subproblem (2.13) will satisfy (2.14) as long as \(G_k \succeq \sigma I\) for all k. In addition, similar as the proof in Theorem 1, if the parameter \(\gamma _3\) in Algorithm 1 is set such that \(0< \gamma _3 < 2 \sigma \), the line search condition (2.9) will be finally satisfied as \(\beta _k\) goes to zero. Hence, step 2 of Algorithm 1 is also well-defined under the variable matric and inexact solutions of subproblem (2.13). In general, by different choices of the matrix \(G_k\), many well-studied efficient methods in nonlinear programming could be incorporated into the algorithm. Furthermore, from the complexity point of view, instead of setting the initial stepsizes given in (2.12), we could also take advantage of the line search in the previous iteration and set

$$\begin{aligned} {\hat{\lambda }}_k = \eta _{k-1} \quad \text{ and } \quad {\hat{\beta }}_k = \beta _{k-1}, \end{aligned}$$
(2.15)

where \(\eta _{k-1}\) and \(\beta _{k-1}\) are the accepted stepsizes in the previous \(k-1\)-th iteration. The choice of initial stepsizes in (2.12) is more aggressive and inherits some quasi-Newton information, and hence, could perform better in practice. However, the strategies in (2.15) would have theoretical advantages in the total number of inner iterations needed in the line search (see the discussion after Corollary 1).

In the remaining part of this section, we provide convergence properties of Algorithm 1. Noticing that \({\mathcal X}\) in problem (1.1) is not necessarily differentiable and f in (1.1) may not be a convex function, we need to define a termination criterion for this algorithm when the objective function \(\varPsi \) is nonconvex. In this case, we would terminate the algorithm when the norm of the generalized projected gradient defined by

$$\begin{aligned} g_{_{X,k}} = \frac{x^{ag}_{k-1} - {{\bar{x}}}^{ag}_k}{\beta _k} \end{aligned}$$
(2.16)

is sufficiently small. Note that \(g_{_{X,k}} = f'(x^{ag}_{k-1})\) when \({\mathcal X}\) vanishes and \(X={\mathbb {R}}^n\). Indeed the above generalized projected gradient in constrained nonsmooth optimization plays an analogous role to that of the gradient in unconstrained smooth optimization. In particular, it can be shown that if \(\Vert g_{_{X,k}}\Vert \le \epsilon \), then \(\varPsi '({\bar{x}}^{ag}_k) \in -{\mathcal N}_X ({\bar{x}}^{ag}_k) + {\mathcal B}(\epsilon ^{\nu } (\epsilon ^{1-\nu }+H \beta _k^{\nu }))\), where \(\varPsi '({\bar{x}}^{ag}_k) \in \partial \varPsi ({\bar{x}}^{ag}_k)\), \({\mathcal N}_X ({{\bar{x}}}^{ag}_k)\) is the normal cone of X at \({{\bar{x}}}^{ag}_k\), and \({\mathcal B}(r) := \{x \in {\mathbb {R}}^n: \Vert x\Vert \le r\}\) (see e.g., [11]).

To establish the convergence of the UPFAG algorithm, we also need the following simple technical result (see Lemma 3 of [16] for a slightly more general result).

Lemma 1

Let \(\{\alpha _k\}\) be a sequence of real numbers such that \(\alpha _1 = 1\) and \(\alpha _k \in (0,1)\) for any \(k \ge 2\). If a sequence \(\{\omega _k\}\) satisfies

$$\begin{aligned} \omega _k \le (1-\alpha _k) \omega _{k-1}+\zeta _k, \ \ k=1,2,\ldots , \end{aligned}$$
(2.17)

then for any \(k \ge 1\) we have

$$\begin{aligned} \omega _k \le \varGamma _k \sum _{i=1}^k ( \zeta _i/\varGamma _i), \end{aligned}$$

where

$$\begin{aligned} \varGamma _{k} := \left\{ \begin{array}{ll} 1, &{} k = 1,\\ (1 - \alpha _{k}) \varGamma _{k-1}, &{} k \ge 2. \end{array} \right. \end{aligned}$$
(2.18)

Below, we present the main convergence properties of the UPFAG algorithm.

Theorem 1

Let \(\{x^{ag}_k\}\) be the iterates generated by Algorithm 1 and \(\varGamma _k\) be defined in (2.18).

  1. a)

    The line search procedures in step 1 and step 2 of the algorithm will finish in finite number of inner iterations.

  2. b)

    Suppose that f is bounded below over X, i.e., \(\varPsi ^*\) is finite. Then, for any \(N \ge 1\), we have

    $$\begin{aligned} \min \limits _{k=1,\ldots ,N} \Vert g_{_{X,k}}\Vert ^2 \le \frac{2\left[ \varPsi (x_0) - \varPsi ^*+\sum _{k=\lfloor N/2 \rfloor +1}^N k^{-1}\right] }{\gamma _3 \sum _{k=\lfloor N/2 \rfloor +1}^N \beta _k}, \end{aligned}$$
    (2.19)

    where \(g_{_{X,k}}\) is defined in (2.16).

  3. c)

    Suppose that \(\varPsi \) is convex and an optimal solution \(x^*\) exists for problem (1.1). Then for any \(N \ge 1\), we have

    $$\begin{aligned} \varPsi (x^{ag}_N)-\varPsi (x^*)\le & {} \frac{\varGamma _N\Vert x_0 - x^*\Vert ^2}{2\lambda _1}+\delta . \end{aligned}$$
    (2.20)

Proof

We first show part a). By (2.1), we have

$$\begin{aligned} f({{\bar{x}}}^{ag}_k) \le f(x^{ag}_{k-1}) + \langle f'(x^{ag}_{k-1}), {{\bar{x}}}^{ag}_k- x^{ag}_{k-1} \rangle +\frac{H}{1+\nu } \Vert {\bar{x}}^{ag}_k- x^{ag}_{k-1} \Vert ^{1+\nu }. \end{aligned}$$
(2.21)

Noting that the objective in (2.10) is strongly convex and hence has a quadratic growth property near its minimum, for any \(x \in X\), we have

$$\begin{aligned}&\langle f'(x^{ag}_{k-1}), {{\bar{x}}}^{ag}_k - x \rangle + {\mathcal X}({\bar{x}}^{ag}_k) \le {\mathcal X}(x) \nonumber \\&\quad + \frac{1}{2\beta _k} \left[ \Vert x^{ag}_{k-1} - x\Vert ^2-\Vert {{\bar{x}}}^{ag}_k - x\Vert ^2 -\Vert x^{ag}_{k-1} -{{\bar{x}}}^{ag}_k\Vert ^2 \right] . \end{aligned}$$
(2.22)

Letting \(x=x^{ag}_{k-1}\) in the above inequality, we have

$$\begin{aligned} \langle f'(x^{ag}_{k-1}), {{\bar{x}}}^{ag}_k - x^{ag}_{k-1} \rangle + {\mathcal X}({{\bar{x}}}^{ag}_k) \le {\mathcal X}(x^{ag}_{k-1}) - \frac{1}{\beta _k} \Vert x^{ag}_k -{{\bar{x}}}^{ag}_{k-1}\Vert ^2. \end{aligned}$$
(2.23)

Summing (2.23) with (2.21), we have

$$\begin{aligned} \varPsi ({{\bar{x}}}^{ag}_k) \le \varPsi (x^{ag}_{k-1}) - \left( \frac{\Vert {\bar{x}}^{ag}_k- x^{ag}_{k-1} \Vert ^2}{\beta _k}-\frac{H\Vert {{\bar{x}}}^{ag}_k- x^{ag}_{k-1} \Vert ^{1+\nu }}{1+\nu } \right) . \end{aligned}$$
(2.24)

Now, for \(\nu \in [0,1)\), it follows from the inequality \(ab \le a^p/p+b^q/q\) with \(p=\frac{2}{1+\nu }\), \(q = \frac{2}{1-\nu } \), and

$$\begin{aligned} a = \frac{H}{1+\nu } \left[ \frac{(1-\nu )k}{2}\right] ^{\frac{1-\nu }{2}}\Vert x^{ag}_{k-1}-{\bar{x}}^{ag}_k\Vert ^{1+\nu } \quad \text{ and } \quad b = \left[ \frac{2}{(1-\nu )k}\right] ^{\frac{1-\nu }{2}} \end{aligned}$$

that

$$\begin{aligned} \frac{H}{1+\nu } \Vert x^{ag}_{k-1}-{{\bar{x}}}^{ag}_k\Vert ^{1+\nu } = a b \le L(\nu ,H)k^{\frac{1-\nu }{1+\nu }} \Vert x^{ag}_{k-1}-{{\bar{x}}}^{ag}_k\Vert ^2 +\frac{1}{k}, \end{aligned}$$
(2.25)

where

$$\begin{aligned} L(\nu ,H) = \left\{ \frac{H}{2\left[ \frac{1+\nu }{1-\nu }\right] ^{\frac{1-\nu }{2}}}\right\} ^{\frac{2}{1+\nu }}. \end{aligned}$$
(2.26)

Let us define

$$\begin{aligned} L(1,H) = \lim _{\nu \rightarrow 1} L(\nu ,H) = \frac{H}{2}. \end{aligned}$$
(2.27)

Then, (2.25) holds for all \(\nu \in [0,1]\). Combining (2.24) and (2.25), we have

$$\begin{aligned} \varPsi ({{\bar{x}}}^{ag}_k) \le \varPsi (x^{ag}_{k-1}) - \frac{1- L(\nu ,H) k^{\frac{1-\nu }{1+\nu }} \beta _k}{\beta _k} \Vert {{\bar{x}}}^{ag}_k- x^{ag}_{k-1}\Vert ^2 + \frac{1}{k}. \end{aligned}$$
(2.28)

Also, by (2.1), (2.3), and (2.5), we have

$$\begin{aligned} f({{\tilde{x}}}^{ag}_k)\le & {} f(x^{md}_k) + \langle f'(x^{md}_k), {\tilde{x}}^{ag}_k- x^{md}_k \rangle +\frac{H}{1+\nu } \Vert {{\tilde{x}}}^{ag}_k- x^{md}_k \Vert ^{1+\nu } \nonumber \\= & {} f(x^{md}_k) + \alpha _k \langle f'(x^{md}_k), x_k- x_{k-1} \rangle +\frac{H \alpha _k^{1+\nu }}{1+\nu } \Vert x_k- x_{k-1}\Vert ^{1+\nu }\nonumber \\= & {} f(x^{md}_k) + \alpha _k \langle f'(x^{md}_k), x_k- x_{k-1} \rangle \nonumber \\&-\alpha _k \left( \frac{\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}-\frac{H \alpha _k^{\nu }}{1+\nu } \Vert x_k- x_{k-1}\Vert ^{1+\nu }\right) +\frac{\alpha _k}{2\lambda _k} \Vert x_k- x_{k-1}\Vert ^2 \nonumber \\\le & {} f(x^{md}_k) + \alpha _k \langle f'(x^{md}_k), x_k- x_{k-1} \rangle \nonumber \\&-\frac{\alpha _k\left( 1-2L(\nu ,H)\alpha _k^{\frac{2\nu }{1+\nu }}\lambda _k \delta ^{\frac{\nu -1}{1+\nu }}\right) \Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k} + \frac{\alpha _k}{2\lambda _k}\Vert x_k- x_{k-1}\Vert ^2+\delta \alpha _k,\nonumber \\ \end{aligned}$$
(2.29)

where the last inequality is obtained similar to (2.25) and \(L(\nu ,H)\) is defined in (2.26) and (2.27).

Now, observe that if

$$\begin{aligned} \alpha _k^{\frac{2\nu }{1+\nu }} \lambda _k \le \frac{\delta ^{\frac{1-\nu }{1+\nu }}}{2 L(\nu ,H)} \quad \text{ and } \quad \beta _k \le \frac{(2 - \gamma _3) k^{\frac{\nu -1}{1+\nu }}}{2 L(\nu ,H)}, \end{aligned}$$
(2.30)

then (2.28) and (2.29), respectively, imply (2.9) and (2.6). By (2.2) and our setting of \(\alpha _k = \lambda _k/\varLambda _k = \lambda _k/(\lambda _k + \varLambda _{k-1})\), we have \(\alpha _k \lambda _k = \eta _k\). Hence,

$$\begin{aligned} \alpha _k^{\frac{2\nu }{1+\nu }} \lambda _k = (\alpha _k \lambda _k )^{\frac{2\nu }{1+\nu }} \lambda _k^{\frac{1-\nu }{1+\nu }} = \eta _k^{\frac{2\nu }{1+\nu }} \lambda _k^{\frac{1-\nu }{1+\nu }}. \end{aligned}$$
(2.31)

By (2.2), (2.8) and \(\gamma _1, \gamma _2 \in (0,1)\), we have

$$\begin{aligned} \lim _{\tau _{1,k} \rightarrow \infty } \eta _k = 0, \quad \lim _{\eta _k \rightarrow 0} \lambda _k = 0 \quad \text{ and } \quad \lim _{\tau _{2,k} \rightarrow \infty } \beta _k = 0, \end{aligned}$$

for any fixed k, which together with (2.31) imply that (2.30) will be finally satisfied in the line search procedure and therefore, (2.9) and (2.6) will essentially be satisfied. So the line search procedures in step 1 and step 2 of Algorithm 1 are well-defined and finite.

We now show part b). Noting (2.9), (2.11), and in view of (2.16), we have

$$\begin{aligned} \varPsi (x^{ag}_k) \le \varPsi ({{\bar{x}}}^{ag}_k) \le \varPsi (x^{ag}_{k-1}) - \frac{ \gamma _3 \Vert {{\bar{x}}}^{ag}_k- x^{ag}_{k-1}\Vert ^2}{2\beta _k} +\frac{1}{k} = \varPsi (x^{ag}_{k-1}) - \frac{\gamma _3 \beta _k}{2} \Vert g_{_{X,k}}\Vert ^2 +\frac{1}{k}. \end{aligned}$$

Summing up the above inequalities for k from \(\lfloor N/2 \rfloor +1\) to N and re-arranging the terms, we obtain

$$\begin{aligned}&\min \limits _{k=1,2,\ldots ,N} \Vert g_{_{X,k}}\Vert ^2 \sum _{k=\lfloor N/2 \rfloor +1}^N \frac{\gamma _3 \beta _k}{2} \le \min \limits _{k=\lfloor N/2 \rfloor +1,2,\ldots ,N} \Vert g_{_{X,k}}\Vert ^2 \sum _{k=\lfloor N/2 \rfloor +1}^N \frac{\gamma _3 \beta _k}{2}\nonumber \\&\quad \le \sum _{k=\lfloor N/2 \rfloor +1}^N \frac{ \gamma _3 \beta _k}{2}\Vert g_{_{X,k}}\Vert ^2 \le \varPsi (x^{ag}_{\lfloor N/2 \rfloor }) -\varPsi (x^{ag}_N)\nonumber \\&\quad +\sum _{k=\lfloor N/2 \rfloor +1}^N \frac{1}{k} \le \varPsi (x_0) -\varPsi (x^*)+\sum _{k=\lfloor N/2 \rfloor +1}^N \frac{1}{k}, \end{aligned}$$
(2.32)

where the last inequality follows from (2.11) and the fact that \(\varPsi ^* \le \varPsi (x^{ag}_N)\). Dividing both sides of the above inequality by \(\gamma _3 \sum _{k=\lfloor N/2 \rfloor +1}^N \frac{\beta _k}{2}\), we clearly obtain (2.19).

We now show part c). By (2.3), (2.5), (2.6), and the convexity of f, for any \(x \in X\), we have

$$\begin{aligned} f({{\tilde{x}}}^{ag}_k)\le & {} f(x^{md}_k) + \alpha _k \langle f'(x^{md}_k), x_k- x_{k-1} \rangle +\frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k \nonumber \\= & {} f(x^{md}_k)+\langle f'(x^{md}_k), {{\tilde{x}}}^{ag}_k- x^{md}_k \rangle +\frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k \nonumber \\= & {} (1-\alpha _k)[f(x^{md}_k) + \langle f'(x^{md}_k), x^{ag}_{k-1}-x^{md}_k \rangle ] + \alpha _k [f(x^{md}_k) \nonumber \\+ & {} \langle f'(x^{md}_k), x_k-x^{md}_k \rangle ] + \frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k \nonumber \\\le & {} (1-\alpha _k)f(x^{ag}_{k-1})+\alpha _k f(x) +\alpha _k \langle f'(x^{md}_k), x_k-x\rangle +\frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k,\nonumber \\ \end{aligned}$$
(2.33)

which together with (2.11), and the convexity of \({\mathcal X}\) imply that

$$\begin{aligned}&\varPsi (x^{ag}_k) \le \varPsi ({{\tilde{x}}}^{ag}_k) = f({{\tilde{x}}}^{ag}_k)+{\mathcal X}({{\tilde{x}}}^{ag}_k) \nonumber \\&\quad \le (1-\alpha _k)f(x^{ag}_{k-1})+\alpha _k f(x) +\alpha _k \langle f'(x^{md}_k), x_k-x\rangle \nonumber \\&\qquad +\frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k+(1-\alpha _k) {\mathcal X}(x^{ag}_{k-1})+\alpha _k {\mathcal X}(x_k) \nonumber \\&\quad = (1-\alpha _k)\varPsi (x^{ag}_{k-1})+\alpha _k f(x) +\alpha _k \langle f'(x^{md}_k), x_k-x\rangle +\frac{\alpha _k\Vert x_k- x_{k-1}\Vert ^2}{2\lambda _k}+\delta \alpha _k+\alpha _k {\mathcal X}(x_k). \end{aligned}$$
(2.34)

Now, by (2.4) and similar to (2.22), we have

$$\begin{aligned} \langle f'(x^{md}_k), x_k - x \rangle + {\mathcal X}(x_k) \le {\mathcal X}(x) + \frac{1}{2\lambda _k} \left[ \Vert x_{k-1} - x\Vert ^2-\Vert x_k - x\Vert ^2 -\Vert x_k -x_{k-1}\Vert ^2 \right] . \end{aligned}$$

Multiplying both sides of the above inequality by \(\alpha _k\) and summing it up with (2.34), we obtain

$$\begin{aligned} \varPsi (x^{ag}_k) \le (1-\alpha _k)\varPsi (x^{ag}_{k-1})+\alpha _k \varPsi (x)+\frac{\alpha _k}{2\lambda _k} \left[ \Vert x_{k-1} - x\Vert ^2-\Vert x_k - x\Vert ^2\right] + \delta \alpha _k.\nonumber \\ \end{aligned}$$
(2.35)

Also, note that by (2.7) and (2.18), we can easily show that

$$\begin{aligned} \varGamma _k = \frac{\lambda _1}{\sum _{i=1}^k \lambda _i} \ \ \text{ and } \ \ \frac{\alpha _k}{\lambda _k \varGamma _k} = \frac{1}{\lambda _1} \ \ \ \forall k \ge 1. \end{aligned}$$
(2.36)

Subtracting \(\varPsi (x)\) from both sides of (2.35), dividing them by \(\varGamma _k\), then it follows from Lemma 1 that for any \(x \in X\) we have

$$\begin{aligned} \frac{\varPsi (x^{ag}_N)-\varPsi (x)}{\varGamma _N}\le & {} \sum _{k=1}^N \frac{\alpha _k}{2 \lambda _k \varGamma _k} \left[ \Vert x_{k-1}-x\Vert ^2 -\Vert x_k-x\Vert ^2 \right] +\delta \sum _{k=1}^N \frac{\alpha _k}{\varGamma _k}\nonumber \\\le & {} \frac{\Vert x_0 - x\Vert ^2}{2 \lambda _1} +\frac{\delta }{\varGamma _N}, \end{aligned}$$
(2.37)

where the second inequality follows from (2.36) and the fact that

$$\begin{aligned} \sum _{k = 1}^N \frac{\alpha _k}{\varGamma _k} = \frac{\alpha _1}{\varGamma _1} + \sum _{k = 2}^k \frac{1}{\varGamma _k} \left( 1 - \frac{\varGamma _k}{\varGamma _{k-1}} \right) = \frac{1}{\varGamma _1} + \sum _{k = 2}^k \left( \frac{1}{\varGamma _k} - \frac{1}{\varGamma _{k-1}} \right) = \frac{1}{\varGamma _N}. \end{aligned}$$
(2.38)

Then, (2.20) follows immediately from (2.37) with \(x=x^*\). \(\square \)

In the next result we specify the convergence rates of the UPFAG algorithm.

Corollary 1

Let \(\{x^{ag}_k\}\) be the iterates generated by Algorithm 1. Suppose there exist some constants \(\lambda >0\) and \(\beta >0\) such that the initial stepsizes \({\hat{\lambda }}_k \ge \lambda \) and \({\hat{\beta }}_k \ge \beta \) for all \(k \ge 1\).

  1. a)

    Suppose that \(\varPsi \) is bounded below over X, i.e., \(\varPsi ^*\) is finite. Then, for any \(N \ge 1\), we have

    $$\begin{aligned} \min \limits _{k=1,\ldots ,N} \Vert g_{_{X,k}}\Vert ^2 \le \frac{8[\varPsi (x_0) - \varPsi ^*+1]}{\gamma _2} \left[ \frac{8 L(\nu ,H)}{ (2 - \gamma _3) N^{\frac{2\nu }{1+\nu }}}+\frac{1}{\beta N}\right] . \end{aligned}$$
    (2.39)
  2. b)

    Suppose that f is convex and an optimal solution \(x^*\) exists for problem (1.1). Then for any \(N \ge 1\), we have

    $$\begin{aligned} \varPsi (x^{ag}_N)-\varPsi (x^*)\le & {} \frac{4\Vert x_0 - x^*\Vert ^2}{\gamma _1 N^{\frac{1+3\nu }{1+\nu }} }\left[ \frac{2 L(\nu ,H)}{\delta ^{\frac{1-\nu }{1+\nu }}}+\frac{1}{\lambda }\right] +\delta . \end{aligned}$$
    (2.40)

Proof

Since \({\hat{\lambda }}_k \ge \lambda \) and \({\hat{\beta }}_k \ge \beta \) for all \(k \ge 1\), then it follows from (2.2), (2.8), (2.30) and \(\eta _k = \alpha _k \lambda _k\) that

$$\begin{aligned} \beta _k \ge \gamma _2 \min \left\{ \frac{ (2 - \gamma _3) k^{\frac{\nu -1}{1+\nu }}}{2 L(\nu ,H)}, \beta \right\} \ \ \ \text{ and } \ \ \ \alpha _k^{\frac{2\nu }{1+\nu }} \lambda _k \ge \gamma _1 \min \left\{ \frac{ \delta ^{\frac{1-\nu }{1+\nu }}}{2 L(\nu ,H)}, \lambda \alpha _k^{\frac{\nu -1}{1+\nu }}\right\} ,\nonumber \\ \end{aligned}$$
(2.41)

which together with

$$\begin{aligned} \sum _{k=\lfloor N/2 \rfloor +1}^N k^{\frac{1-\nu }{1+\nu }}\le & {} \int _{x=\lfloor N/2 \rfloor +1}^{N+1} x^{\frac{1-\nu }{1+\nu }} dx \le 4N^{\frac{2}{1+\nu }},\nonumber \\ \sum _{k=\lfloor N/2 \rfloor +1}^N \frac{1}{k}\le & {} \int _{x=\lfloor N/2 \rfloor }^N \frac{dx}{x} = \ln \frac{N}{\lfloor N/2 \rfloor } \le 1, \end{aligned}$$
(2.42)

and arithmetic-harmonic mean inequality imply that

$$\begin{aligned}&\sum _{k=\lfloor N/2 \rfloor +1}^N \beta _k \ge \sum _{k=\lfloor N/2 \rfloor +1}^N \gamma _2 \min \left\{ \frac{(2 - \gamma _3) k^{\frac{\nu -1}{1+\nu }}}{2 L(\nu ,H)}, \beta \right\} \nonumber \\&\qquad \ge \frac{\gamma _2 N^2}{4\sum _{k=\lfloor N/2 \rfloor +1}^N \max \left\{ \frac{2 L(\nu ,H)}{(2 - \gamma _3)k^{\frac{\nu -1}{1+\nu }}}, \frac{1}{\beta }\right\} } \nonumber \\&\qquad \ge \frac{\gamma _2 N^2}{4\sum _{k=\lfloor N/2 \rfloor +1}^N \left\{ 2(2 - \gamma _3)^{-1} L(\nu ,H) k^{\frac{1-\nu }{1+\nu }}+\beta ^{-1}\right\} } \nonumber \\&\qquad \ge \frac{\gamma _2}{4 \left( 8 (2 - \gamma _3)^{-1} L(\nu ,H) N^{\frac{-2\nu }{1+\nu }}+(\beta N)^{-1}\right) }. \end{aligned}$$
(2.43)

Combining the above relation with (2.19), we clearly obtain (2.39).

Now, observing (2.41) and the facts that \(\alpha _k \in (0,1]\) and \(\nu \in (0,1]\), we have

$$\begin{aligned} \alpha _k^{\frac{2\nu }{1+\nu }} \lambda _k \ge \gamma _1 \min \left\{ \frac{ \delta ^{\frac{1-\nu }{1+\nu }}}{2 L(\nu ,H)}, \lambda \right\} , \end{aligned}$$

which together with (2.36) imply that

$$\begin{aligned} \lambda _k \ge \left( \frac{\lambda _1}{\varGamma _k}\right) ^{\frac{2\nu }{1+3\nu }} \gamma _1^{\frac{1+\nu }{1+3\nu }} \min \left\{ \left( \frac{ \delta ^{\frac{1-\nu }{1+\nu }}}{2 L(\nu ,H)}\right) ^{\frac{1+\nu }{1+3\nu }}, \lambda ^{\frac{1+\nu }{1+3\nu }}\right\} . \end{aligned}$$

Noticing this observation, defining \(c=\frac{1+\nu }{1+3\nu }\), then by (2.18) and (2.36), we have

$$\begin{aligned} \frac{1}{\varGamma _k^c} -\frac{1}{\varGamma _{k-1}^c} = \frac{1}{\varGamma _k^c} -\frac{(1-\alpha _k)^c}{\varGamma _k^c} \ge \frac{c \alpha _k}{\varGamma _k^c} =\frac{c \lambda _k \varGamma _k^{1-c}}{\lambda _1} \ge \frac{c \gamma _1^c}{\lambda _1^c} \min \left\{ \frac{\delta ^{\frac{1-\nu }{1+3\nu }}}{[2 L(\nu ,H)]^c}, \lambda ^c\right\} , \end{aligned}$$

where the first inequality follows from the fact that \(1-(1-\alpha )^c \ge c \alpha \) for all \(\alpha \in (0,1]\) and \(c \in [1/2,1]\). By the above inequality and noticing \(\varGamma _0=1\), similar to (2.43), for any \(N \ge 1\) we obtain

$$\begin{aligned} \frac{1}{\varGamma _N^c} \ge \frac{c \gamma _1^c N^2}{\lambda _1^c \sum _{k=1}^N \left\{ [2 L(\nu ,H)\delta ^{\frac{\nu -1}{1+\nu }}]^c+\lambda ^{-c}\right\} } \ge \frac{c \gamma _1^c N}{\lambda _1^c \left\{ [2 L(\nu ,H)\delta ^{\frac{\nu -1}{1+\nu }}]^c +\lambda ^{-c}\right\} }, \end{aligned}$$

which together with the facts that \(c^{\frac{1}{c}} \ge 1/4\) and \((a+b)^{\frac{1}{c}} \le 2 (a^{\frac{1}{c}}+b^{\frac{1}{c}})\) for any \(a,b>0\), imply that

$$\begin{aligned} \varGamma _N \le \frac{8\lambda _1}{\gamma _1 N^{\frac{1+3\nu }{1+\nu }}}\left[ \frac{2 L(\nu ,H)}{ \delta ^{\frac{1-\nu }{1+\nu }}}+\frac{1}{\lambda }\right] . \end{aligned}$$

Combining the above relation with (2.20), clearly we obtain (2.40). \(\square \)

We now add a few remarks about the results obtained in Corollary 1. First, for simplicity let us assume \(\beta \ge (2 - \gamma _3)/ L(\nu , H)\). Then, by (2.39), we conclude that the number of iterations performed by the UPFAG method to have \(\Vert g_{_{X,k}}\Vert ^2 \le \epsilon \) for at least one k, after disregarding some constants, is bounded byFootnote 1

$$\begin{aligned} {\mathcal O} \left( H^{\frac{1}{\nu }} \left[ \frac{\varPsi (x_0) - \varPsi ^*}{\epsilon }\right] ^{\frac{1+\nu }{2\nu }}\right) . \end{aligned}$$
(2.44)

This bound can be also obtained by using a variable metric in (2.13) instead of the Euclidean one in (2.10). When \(G_k\) is uniformly bounded from above, i.e., \(G_k \preceq M I\) for some constant \(M >0 \), the projected gradients defined according to both (2.10) and (2.13) are equivalent. Moreover, when \(\nu =1\), the above bound will reduce to the best known iteration complexity for the class of nonconvex functions with Lipschitz continuous gradient which is also known for the steepest descent method for unconstrained problems [23], and the projected gradient method for composite problems in Ghadimi, Lan and Zhang [12]. In this case, a similar bound is also obtained by the AG method [10], which however, for composite problems, relies on an additional assumption that the iterates are bounded. In spite of this similarity in iteration complexity of these methods, the UPFAG method can outperform the other two methods in practice (see Sect. 4 for more details). On the other hand, one possible advantage of this AG method in [10] exists in that it can separate the affects of the Lipschitz constants of smooth convex terms.

Second, by choosing \(\delta =\epsilon /2\) and assuming that \(\lambda \ge \delta ^{\frac{1-\nu }{1+\nu }}/L(\nu , H)\), (2.40) implies that the UPFAG method can find a solution \({{\bar{x}}}\) such that \(\varPsi ({{\bar{x}}})-\varPsi (x^*) \le \epsilon \), after disregarding some constants, in at most

$$\begin{aligned} {\mathcal O} \left( \left[ \frac{H\Vert x_0 - x^*\Vert ^{1+\nu }}{\epsilon }\right] ^{\frac{2}{1+3\nu }}\right) \end{aligned}$$
(2.45)

number of iterations which is optimal for convex programming [21]. Note that (2.45) is in the same order of magnitude as the bound obtained by the universal fast gradient method proposed by Nesterov [25] for convex optimization problems. While this method also uses line search procedure to specify stepsizes, it does not applicable to nonconvex optimization. On the other hand, the UPFAG method does not need to know the convexity of the objective function as a prior knowledge. It treats both the convex and nonconvex optimization problems in a unified way. In any case, this method always achieves the complexity bound in (2.44) and when the objective function happens to be convex, it would also achieve the optimal complexity bound in (2.45). This unified approach, as mentioned before, can help us to exploit local convexity structure of general nonconvex problems and find a stationary point faster.

Finally, it is interesting to find the number of gradient computations at each iteration of Algorithm 1 assuming the initial stepsizes \({\hat{\lambda }}_k \ge \lambda \) and \({\hat{\beta }}_k \ge \beta \) for all \(k \ge 1\). According to (2.8) and (2.41), we conclude that, after disregarding some constants, \(\tau _{2,k} \le \log k\) which implies that the number of gradient computations at point \(x^{ag}_{k-1}\) is bounded by \(\log k\). Similarly, we obtain that the number of gradient computations at point \(x^{md}_{k}\) is also bounded by \(\log k\). Hence, the total number of gradient computations at the k-th iteration is bounded by \(2 \log k\). On the other hand, suppose that we choose \({\hat{\beta }}_k\) and \({\hat{\lambda }}_k\) according to (2.15). Then, we have

$$\begin{aligned} \tau _{1,k} = (\log \eta _k-\log \eta _{k-1})/\log \gamma _1 \quad \text{ and } \quad \tau _{2,k} = (\log \beta _k-\log \beta _{k-1})/\log \gamma _2. \end{aligned}$$

So the number of gradient evaluations in step 1 and step 2 at the k-th iteration is bounded by

$$\begin{aligned} 1 + \tau _{1,k} = 1 + (\log \eta _k-\log \eta _{k-1})/\log \gamma _1 \quad \text{ and } \quad 1 + \tau _{2,k} = 1+ (\log \beta _k-\log \beta _{k-1})/\log \gamma _2, \end{aligned}$$

which implies the total number of gradient evaluations in step 1 and step 2 is bounded by

$$\begin{aligned} N_\eta= & {} N + \sum _{k=1}^N \tau _{1,k} = N + \frac{\log \eta _N-\log \eta _0}{\log \gamma _1} \quad \text{ and } \\ N_\beta= & {} N + \sum _{k=1}^N \tau _{2,k} = N + \frac{\log \beta _N-\log \beta _0}{\log \gamma _2}. \end{aligned}$$

Note that (2.41) implies \(N_\eta \le N+c_1\) and \(N_\beta \le N+c_2 \log N\) for some positive constants \(c_1\) and \(c_2\). Hence, the above relations show that the average number of gradient computations at each iteration is bounded by a constant, which is less than the aforementioned logarithmic bound \(\log k\) obtained for the situations where \({\hat{\beta }}_k\) and \({\hat{\lambda }}_k\) are chosen according to (2.2) and (2.8). However, in (2.2) and (2.8) the algorithm allows more freedom to choose initial stepsizes.

One possible drawback of the UPFAG method is that we need to fix the accuracy \(\delta \) before running the algorithm and if we want to change it, we should run the algorithm from the beginning. Moreover, we need to implement two line search procedures to find \(\beta _k\) and \(\lambda _k\) in each iteration. In the next section, we address these issues by presenting some problem-parameter free bundle-level type algorithms which do not require a fixed target accuracy in advance and only performs one line search procedure in each iteration. These bundle-level type methods can sometimes outperform gradient type methods in practice (see Sect. 4 for more details).

3 Unified Bundle-Level Type Methods

Our goal in this section is to generalize bundle-level type methods, originally designed for convex programming, for solving a class of possibly nonconvex nonlinear programming problems. Specifically, in Sect. 3.1, we introduce a unified bundle-level type method by incorporating a local search step into an accelerated prox-level method and establish its convergence properties under the boundedness assumption of the feasible set. In Sect. 3.2, we simplify this algorithm and provide its fast variants for solving ball-constrained problems and unconstrained problems with bounded level sets. To the best of our knowledge, this is the first time that bundle-level type methods are proposed for solving a class of nonconvex optimization problems.

3.1 Unified Accelerated Prox-level Method

In this subsection, we generalize the accelerate prox-level (APL) method presented by Lan [15] to solve a class of nonlinear programming given in (1.1), where f has Hölder continuous gradient on X. Lan [15] showed that the APL method is uniformly optimal for solving problem (1.1) when f is convex and satisfies (1.2). Here, we combine the framework of this algorithm with a gradient descent step and present a unified accelerated prox-level (UAPL) method for solving both convex and nonconvex optimization.

As the bundle-level type methods, we introduce some basic definitions about the objective function and the feasible set. We first define a function h(yx) for a given \(y \in X\), as

$$\begin{aligned} h(y,x)=f(y)+\langle f'(y), x-y \rangle +{\mathcal X}(x) \ \ \text{ for } \text{ any } x \in X. \end{aligned}$$
(2.1)

Note that if f is convex, then we have \(h(y,x) \le f(x)+{\mathcal X}(x)=\varPsi (x)\) and hence h(yx) defines a lower bound for \(\varPsi (x)\). Also, let \({\mathcal S}_\varPsi (l)\) be the level set of \(\varPsi \) given by \({\mathcal S}_\varPsi (l)=\{x \in X : \varPsi (x) \le l\}\) and define a convex compact set \(X'\) as a localizer of the level set \({\mathcal S}_\varPsi (l)\) if it satisfies \({\mathcal S}_\varPsi (l) \subseteq X' \subseteq X\). Then, it can be shown [16] that, when \(\varPsi \) is convex, \(\min \{l, {\underline{h}}(y) \} \le \varPsi (x)\) for any \(x \in X\), where

$$\begin{aligned} {\underline{h}}(y) = \min \{h(y,x) : x \in X'\}. \end{aligned}$$
(2.2)

Using the above definitions, we present a unified accelerated prox-level (UAPL) algorithm, Algorithm 2, for nonlinear programming.

figure b

We now make a few remarks about the above algorithm. First, note that the updating of \({{\bar{x}}}^{ag}_k\) in step 3 of the UAPL algorithm is essentially a gradient descent step, and hence without this update, the UAPL algorithm would reduce to a simple variant of the APL method in [15] for convex programming. However, this update is required to establish convergence for the case when the objective function is nonconvex. Second, this UAPL algorithm has two nested loops. The outer loop called phase is counted by index s. The number of iterations of the inner loop in each phase is counted by index t. If we make a progress in reducing the gap between the lower and upper bounds on \(\varPsi \), we terminate the current phase (inner loop) in step 4 and go to the next one with a new lower bound. As shown in [15], the number of steps in each phase is finite when \(\varPsi \) is convex. However, when \(\varPsi \) is nonconvex, \({{\underline{\varPsi }}}\) is not necessary a lower bound on \(\varPsi \), and hence the termination criterion in step 4 may not be satisfied. In this case, we could still provide the convergence of the algorithm in terms of the projected gradient defined in (2.16) because of the gradient descent step incorporated in step 3.

The following Lemma [15] shows some properties of the UAPL algorithm by generalizing the prox-level method in [4].

Lemma 2

Assume that \(\varPsi \) is convex and bounded below over X, i.e., \(\varPsi ^*\) is finite. Then, the following statements hold for each phase of Algorithm 2.

  1. a)

    \(\{X'_t\}_{t \ge 0}\) is a sequence of localizers of the level set \({\mathcal S}_\varPsi (l)\).

  2. b)

    \({{\underline{\varPsi }}}_0 \le {{\underline{\varPsi }}}_1 \le \ldots \le {{\underline{\varPsi }}}_t \le \varPsi ^* \le {{\bar{\varPsi }}}_t \le \ldots \le {\bar{\varPsi }}_1 \le {{\bar{\varPsi }}}_0\) for any \(t \ge 1\).

  3. c)

    \(\emptyset \ne {\underline{X}}_t \subseteq {{\bar{X}}}_t\) for any \(t \ge 1\) and hence, step 5 is always feasible unless the current phase terminates.

Now, we can present the main convergence properties of the above UAPL algorithm.

Theorem 2

Let the feasible set X be bounded.

  1. a)

    Suppose \(\varPsi \) is bounded below over X, i.e., \(\varPsi ^*\) is finite. The total number of iterations performed by the UAPL method to have \(\Vert g_{_{X,k}}\Vert ^2 \le \epsilon \) for at least one k, after disregarding some constants, is bounded by (2.44).

  2. b)

    Suppose that f is convex and an optimal solution \(x^*\) exists for problem (1.1). Then, the number of phases performed by the UAPL method to find a solution \({{\bar{x}}}\) such that \(\varPsi ({{\bar{x}}})-\varPsi (x^*) \le \epsilon \), is bounded by

    $$\begin{aligned} S(\epsilon ) = \left\lceil \max \left\{ 0, \log _{\frac{1}{q}} \frac{H \max _{x,y \in X} \Vert x-y\Vert ^{1+\nu }}{(1+\nu )\epsilon }\right\} \right\rceil , \end{aligned}$$
    (3.5)

    where

    $$\begin{aligned} q = 1-\frac{1}{2}\min \{\eta , 1-\eta \}. \end{aligned}$$
    (3.6)

    In addition, by choosing \(\alpha _t = 2/(t+1)\) for any \( t \ge 1\), the total number of iterations to find the aforementioned solution \({{\bar{x}}}\) is bounded by

    $$\begin{aligned} S(\epsilon )+ \frac{1}{1-q^{\frac{2}{3\nu +1}}}\left( \frac{ 4 \sqrt{3} H \max _{x,y \in X} \Vert x-y\Vert ^{1+\nu }}{\eta \theta (1+\nu ) \epsilon }\right) ^{\frac{2}{3\nu +1}}. \end{aligned}$$
    (3.7)

Proof

First, note that part a) can be established by essentially following the same arguments as those in Theorem 1.b) and Corollary 1.a) due to step 3 of Algorithm 2. Second, due to the termination criterion in step 4 of Algorithm 2, for any phase \(s \ge 1\), we have

$$\begin{aligned} \varPsi (p_{s+1})-\mathrm{lb}_{s+1} \le q [\varPsi (p_s)-\mathrm{lb}_s], \end{aligned}$$

which by induction and together with the facts that \(\mathrm{lb}_1=h(p_0,p_1)\) and \(\mathrm{lb}_s \le \varPsi (x^*)\), clearly imply

$$\begin{aligned} \varPsi (p_s)-\varPsi (x^*) \le \left[ \varPsi (p_1)-h(p_0,p_1)\right] q^{s-1}. \end{aligned}$$
(3.8)

This relation, as shown in Theorem 4 of [15] for convex programming, implies (3.5). Third, (3.7) is followed by (3.5) and [15, Proposition 2, Theorem 3]. \(\square \)

We now add a few remarks about the above results. First, note that the UAPL and UPFAG methods essentially have the same mechanism to ensure the global convergence when the problem (1.1) is nonconvex. To the best of our knowledge, this is the first time that a bundle-level type method is proposed for solving a class of possibly nonconvex nonlinear programming problems. It should be also mentioned that similar complexity results can be obtained by using (2.13) instead of (2.10) in step 3 of the UAPL method. Second, note that the bound in (3.7) is in the same order of magnitude as the optimal bound in (2.45) for convex programming. However, to obtain this bound, we need the boundedness assumption on the feasible set X, although we do not need the target accuracy as a priori. Third, parts a) and c) of Theorem 2 imply that the UAPL method can uniformly solve weakly smooth nonconvex and convex problems without requiring any problem parameters. In particular, it achieves the best known convergence rate for nonconvex problems and its convergence rate is optimal if the problem turns out to be convex.

Finally, in steps 1 and 2 of the UAPL algorithm, we need to solve two subproblems which can be time consuming. Moreover, to establish the convergence of this algorithm, we need the boundedness assumption on X as mentioned above. In the next subsection, we address these issues by exploiting the framework of another bundle-level type method which can significantly reduce the iteration cost.

3.2 Unified Fast Accelerated Prox-level Method

In this subsection, we aim to simplify the UAPL method for solving ball-constrained problems and unconstrained problems with bounded level sets. Recently, Chen et al. [7] presented a simplified variant of the APL method, namely fast APL (FAPL) method, for ball constrained convex problems. They showed that the number of subproblems in each iteration is reduced from two to one and presented an exact method to solve the subproblem.

In this subsection, we first generalize the FAPL method for ball-constrained nonconvex problems and then discuss how to extend it for unconstrained problems with bounded level sets. It should be mentioned that throughout this subsection, we assume that the simple nonsmooth convex term vanishes in the objective function i.e., \({\mathcal X}\equiv 0\) in (1.1). Below, we present the unified FAPL (UFAPL) method to sovle the problem (1.1) with the ball constraint, i.e., \(X = {\mathcal B}({{\bar{x}}}, R)\).

We now add a few remarks about this algorithm. First, note that we do not need to solve the subproblem (2.2) in the UFAPL method. Moreover, the subproblem (3.3) in the UFAPL method is to project \({\bar{x}}\) over a closed polyhedral. There exist quite efficient methods for performing such a projection on a polyhedral (see e.g., [7, 13]). When (2.13) is used and \(G_k = I\), the subproblem associated with finding \({{\bar{x}}}_k^{ag}\) in step 3 of the UFAPL method has a closed-form solution. Hence, there is only one subproblem to be solved in each iteration of the UFAPL method and this subproblem can be solved quite efficiently. In addition, since the objective function may be nonconvex, when a phase s starts in the UAPL or the UFAPL, the lower bound \({\underline{\varPsi }}_0\) could be even larger than the upper bound \({\bar{\varPsi }}_0\). When this happens, from practical efficiency point of view, one may simply restart the algorithm at the current iteration point \(p_s\) or may even only perform the local nonlinear search iterations starting at \(p_s\).

Second, note that the UFAPL algorithm can be terminated in step 3 or step 4. Moreover, by combining the convergence results in [7] and applying similar techniques used in showing the Theorem 1, we can establish complexity results similar to the Theorem 2 for the UFAPL method. For simplicity, we do not repeat these arguments here.

Finally, we can extend the above results for the UFAPL method to unconstrained problems. In particular, suppose the level set

$$\begin{aligned} {\mathcal S}_0 = \{x \in {\mathbb {R}}^n \ \ | \varPsi (x) \le \varPsi (x_0) \}, \end{aligned}$$

is bounded, where \(x_0\) is the initial point for the UFAPL method. Now, consider the ball-constrained problem

$$\begin{aligned} \min _{x \in {\mathcal B} (x_0, R)} \varPsi (x), \end{aligned}$$
(3.9)

such that \(R = \max _{x,y \in {\mathcal S}_0} \Vert x-y\Vert +\delta \) for a given \(\delta >0\). To solve this problem, we could apply the UFAPL method with small modifications. Specifically, we use \(X={\mathbb {R}}^n\) to find \({{\bar{x}}}^{ag}_k\) in step 3 of this method. Now, let \(\{x^{ag}_k\}_{k \ge 1}\) be generated by this modified UFAPL method. By steps 3 and 4 of this method, we clearly have \(\varPsi (x^{ag}_k) \le \varPsi (x^{ag}_{k-1})\) for all \(k \ge 1\), which implies that \(x^{ag}_k \in {\mathcal S}_0\) for all \(k \ge 1\). Hence, all generated points \(\{x^{ag}_k\}_{k \ge 1}\) lie in the interior of the aforementioned ball \({\mathcal B} (x_0, R)\) due to \(\delta >0\), which consequently implies that the optimal solution of the problem (3.9) is the same as the that of the unconstrained problem \(\min _{x \in {\mathbb {R}}^n} \varPsi (x)\). Therefore, under the boundedness assumption on \({\mathcal S}_0\), we can apply the above modified UFAPL method to solve the ball-constrained problem (3.9) in order to solve the original unconstrained problem.

Table 1 Average required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2 \) over 10 random generated instances of the SVM problem with the sigmoid loss function
Table 2 Required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2\) over real data of the SVM problem with the sigmoid loss function
Table 3 Average required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2 \) over 10 random generated instances of the SVM problem with the squared hinge loss function
Table 4 Average required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2 \) over 10 random generated instances of the SVM problem with the squared hinge loss function
Table 5 Required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2 \) over data set a6a of the SVM problem with the squared hinge loss function
Table 6 Required number of iterations (Iter(k)), runtime (T(s)), objective value, and classification error found till reaching a desired accuracy for \(\Vert g_{_X} ({\bar{x}}^*)\Vert ^2 \) over data set w1a of the SVM problem with the squared hinge loss function
figure c

4 Numerical Experiments

In this section, we show the performance of our algorithms for solving two support vector machine (SVM) problems. In particular, given data points \(\{(u_i, v_i)\}_{i=1}^m\) we consider the SVM problem defined as

$$\begin{aligned} \min _{\Vert x\Vert \le a} \left\{ \varPsi (x) :=\frac{1}{m}\sum _{i=1}^m l(u_i,v_i;x) + \frac{ \lambda }{2} \Vert x\Vert _2^2\right\} \end{aligned}$$
(4.1)

for some \(\lambda >0\), where \(l(x;u_i,v_i)\) is either the nonconvex sigmoid loss function [20] i.e.,

$$\begin{aligned} l(x;u_i,v_i) = 1 - \tanh \left( v_i \langle x, u_i \rangle \right) , \end{aligned}$$

or the convex squared hinge one i.e.,

$$\begin{aligned} l(x;u_i,v_i) = \max \left\{ 0, 1 - v_i \langle x, u_i \rangle \right\} ^2. \end{aligned}$$

Note that the above loss functions are smooth and have Lipschitz continuous gradients. Hence, (4.1) fits into the setting of problem (1.1) with \({\mathcal X}\equiv 0\).

Here, we use both real and synthetic data sets. For the former, we use some data sets from LIBSVM at http://www.csie.ntu.edu.tw/~cjlin/libsvm. For the latter, we assume that each data point \((u_i, v_i)\) is drawn from the uniform distribution on \([0,1]^n \times \{-1,1\}\), where \(u_i \in {\mathbb {R}}^n\) is the feature vector and \(v_i \in \{-1,1\}\) denotes the corresponding label. Moreover, we assume that \(u_i\) is sparse with \(5\%\) nonzero components and \(v_i = \text {sign} \left( \langle {\bar{x}}, u_i \rangle \right) \) for some \({\bar{x}} \in {\mathbb {R}}^n\) with \(\Vert {{\bar{x}}}\Vert \le a\). We also consider two different problem sizes as \(n=2000, 8000\) with \(m=1000, 4000\), respectively.

In our experiments, we set \(\lambda =1/m\), \(a=50\). The initial point is randomly chosen within the ball centered at origin with radius a. For synthetic data sets, we report the average results of running different algorithms over 10 instances for each problem size using large sample sizes of \(K=10000\). Moreover, in order to further assess the quality of the generated solutions, we also report the classification error evaluated at the classifier x given by

$$\begin{aligned} er(x):=\frac{|\left\{ i: v_i \ne \text {sign} (\langle x, u_i \rangle ), i = 1, \ldots , K \right\} |}{K}. \end{aligned}$$
(4.2)

Real data sets are divided into two categories. Training sets are used to form the loss functions and the results are evaluated using the test sets. Tables 1, 2, 3, 4, 5 and 6 summarize the results of implementing the following algorithms. The LBFGS method only performs the local quasi-Newton search step (2.13), where \(G_k\) is the LBFGS [26, 27] matrix, without incorporating the acceleration techniques for convex optimization. The UPFAG-LBFGS, UPFAG-fullBB, and UPFAG-GradDescent methods are different variants of Algorithm 1. In particular, the UPFAG-LBFGS applys the LBFGS local search (2.13) and uses the Barzilai-Borwein initial stepsizes (2.12). The UPFAG-fullBB and the UPFAG-GradDescent methods both apply the gradient projection step (2.10), while the former uses the Barzilai-Borwein initial stepsizes (2.12), and the latter tries the stepsize policy in (2.15). The UAPL and the UFAPL methods are also equipped with the LBFGS quai-Newton step (2.13). In all the above algorithms, step (2.13), if used, is only solved inexactly according to (2.14) with \(\sigma = 10^{-10}.\) In addition to the above algorithms, when the convex loss function is used, we also implement two well-known optimal methods for convex optimization, namely, a variant of the accelerated gradient (AG) method in [11] and the FAPL method in [7]. To make a fair comparison, the generalized projected gradient (2.16) with \(\beta _k=1\) are always used in the stopping condition for all the comparison algorithms.

The following observations can be made from the numerical results. First, for the nonconvex problems, the UPFAG-GradDescent method performs the worst among all the compared algorithms. Second, the UPFAG-LBFGS method outperforms the other two variants of Algorithm 1 showing the effectiveness of the quasi-Newton step (2.13) in comparison to the gradient descent step in (2.10). The LBFGS method is also usually worse than the UPFAG-LBFGS method. Third, the UPFAG-LBFGS and the UFAPL methods have the best and most stable performances over all instances. While both methods have comparable running times, the UFAPL method sometimes outperforms the UPFAG-LBFGS method in terms of number iterations. The UAPL method is also usually comparable to the UFAPL method in terms of number of iterations. However, its running time is much worse as expected due to the existence of two subproblems at each iteration. It is worth noting that the UFAPL method simplifies the subproblems of the UAPL method and hence reduce its running time only when the feasible region is ball-constrained or unconstrained.

In addition, similar observations can be also made on the above-mentioned algorithms when convex loss function is used. Furthermore, it can be observed that by incorporating local quasi-Newton steps, the unified algorithms presented here perform significantly better than the two optimal methods even for solving convex problems, which shows the practical importance of using the local search steps (2.10) and (2.13) especially in a large data setting. Finally, the bundle-level type methods often outperform the accelerated gradient type methods in terms of the iteration number for solving both convex and nonconvex models, indicating the potential of these methods for the situation when the computation of gradients are more expensive than the solution of projection subproblems. This observation has also been particularly made for convex problems without using the local search step (see e.g., [7, 15]).

5 Concluding Remarks

In this paper, we extend the framework of uniformly optimal algorithms, currently designed for convex programming, to nonconvex nonlinear programming. In particular, by incorporating a gradient descent step into the framework of uniformly optimal convex programming methods, namely, accelerated gradient and bundle-level type methods, and enforcing the function values evaluated at each iteration of these methods non-increasing, we present unified algorithms for minimizing composite objective functions given by summation of a function f with Hölder continuous gradient and simple convex term over a convex set. We show that these algorithms exhibit the best known convergence rate when f is nonconvex and possess the optimal iteration convergence rate if f turns out to be convex. Therefore, these algorithms allow us to have a unified treatment for nonlinear programming problems regardless of their smoothness level and convexity property. Furthermore, we show that by incorporating quasi-Newton local search steps, the practical performance of these algorithms can be often greatly improved. Preliminary numerical experiments are also presented to show the performance of our developed algorithms for solving a couple of important nonlinear programming problems arising from machine learning.