1 Introduction

1.1 Problem description

In this paper, we are interested in solving the constrained optimization problem where the objective function F may contain either or both of \(f_0\) and f as follows:

$$\begin{aligned} \min _{x\in X}\ \{F(x):=f_0(x)+f(x)\},\quad \text {s.t.}\quad g_i(x) \le 0,\ \forall \, i = 1,\ldots ,n_c. \end{aligned}$$
(1)

Here X is a compact convex set in \(\mathbb {R}^{d_X}\), the functions \(f_0,g_i:X\rightarrow \mathbb {R}\) are proper, closed, and convex functions, and there exist \(L_{0}, L_{i}>0\) and \(\rho _{0} , \rho _{i}\in [0,1]\) such that

$$\begin{aligned} f_0(\bar{x})-f_0(x)-\langle \xi (x), \bar{x}-x \rangle&\le \frac{L_{0}}{1+\rho _{0} } \Vert \bar{x}-x\Vert ^{1+\rho _{0} }, \quad \forall \, x,\bar{x} \in X, \end{aligned}$$
(2)
$$\begin{aligned} g_i(\bar{x})-g_i(x)- \langle \zeta _i(x), \bar{x}-x \rangle&\le \frac{L_{i}}{1+\rho _{i}} \Vert \bar{x}-x\Vert ^{1+\rho _{i}}, \quad \forall \, x,\bar{x} \in X, \quad i=1,\ldots , n_c. \end{aligned}$$
(3)

In (1), f is a special max-type (possibly nonsmooth) function:

$$\begin{aligned} f(x):=\max \limits _{y\in Y}\ \{\left\langle Ax,y\right\rangle -\chi (y)\}, \end{aligned}$$
(4)

where \(Y\subseteq \mathbb {R}^{d_Y}\) is a compact convex set, \(\chi : Y\rightarrow \mathbb {R}\) is a proper, closed convex function, and \(A:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is a linear operator (such as a matrix). In (2)–(4), \(\Vert \cdot \Vert\) is the standard Euclidean norm, \(\langle \cdot , \cdot \rangle\) is the inner product of vectors in \(\mathbb {R}^n\), and \(\xi (x)\in \partial f_0(x)\) and \(\zeta _i(x)\in \partial g_i(x)\) are any (sub)gradient of \(f_0\) and \(g_i\) at x, respectively. The conditions given in (2) and (3) allow us to present our algorithm and convergence results in a unified framework with \(f_0\) and \(g_i\)’s of any smoothness level \(\rho _{0} ,\rho _{g_i} \in [0,1]\). More precisely, we know that \(f_0\) is nonsmooth if \(\rho _{0} =0\), weakly smooth if \(\rho _{0} \in (0,1)\), and smooth if \(\rho _{0} =1\). Similar statements hold for the constraint functions \(g_i\)’s. Therefore, for example, if \(\rho _{0}=0\) but \(\rho _{i}=1\) for all \(i=1,\ldots ,n_c\), then we will be dealing with problem (1) that has nonsmooth objective function F and smooth constraints \(g_i\), etc. Moreover, if \(\rho _{0}=1\) (or \(f_0\) is not present), then we can show that the problem with a nonsmooth max-type f given in (4) can be handled with a better iteration complexity than that for treating f as a generic nonsmooth objective function.

The goal of this paper is develop an algorithm that solves the problem (1) uniformly for smooth, weakly smooth and nonsmooth \(f_0\) and \(g_i\), where the (whole or part of) objective function F can be written as the structured, possibly nonsmooth f as in (4). This type of problems has extensive real-world applications in signal/image processing [2, 24], tensor decomposition [15, 26], overlapped group lasso [10, 18], and graph regularization [25], etc. To cope with the special structure of f, we consider a prox-function (also known as a distance generating function) \(v:Y\rightarrow \mathbb {R}\), which is strongly convex with modulus \(\sigma _v\). Then the Bregman divergence associated with this v is defined by

$$\begin{aligned} V(y):=v(y) - v(c_v) - \left\langle \nabla v(c_v), y - c_v\right\rangle , \end{aligned}$$
(5)

where \(c_v:={\mathrm{arg\,min}}_{v\in Y}v(y)\). By employing Nesterov’s smoothing technique [22], we can approximate \(f(\cdot )\) by the smooth function \(f_{\eta }\) defined as follows,

$$\begin{aligned} f_{\eta }(x):= \max _{y\in Y}\{\left\langle Ax,y\right\rangle - \chi (y) - \eta V(y)\}, \end{aligned}$$
(6)

where \(\eta\) is called the smoothing parameter. It is shown in [22] that \(f_{\eta }(x)\) is differentiable and has Lipschitz continuous gradient \(\nabla f_{\eta }\) with Lipschitz constant

$$\begin{aligned} L_{\eta }:={\Vert A\Vert ^2}/({\eta \sigma _v}), \end{aligned}$$
(7)

where \(\Vert A\Vert\) is the operator norm of A. Moreover, the “closeness” of \(f_{\eta }(\cdot )\) to \(f(\cdot )\) depends linearly on the smoothing parameter \(\eta\). More precisely,

$$\begin{aligned} f_{\eta }(x)\le f(x)\le f_{\eta }(x)+\eta D_{v,Y},\ \forall x\in X, \end{aligned}$$
(8)

where the “diameter” of Y under the Bregman distance indued by v is defined as

$$\begin{aligned} D_{v,Y}:=\max _{y,z\in Y}\{v(y)-v(z)-\left\langle \nabla v(z),y-z\right\rangle \}. \end{aligned}$$
(9)

In this paper, we assume that (1) has at least one solution, and let \(x^*\) denote any solution of (1) and \(F^*:=F(x^*)\) be the optimal objective function value. We also assume that there exist a first order oracle to compute \(f_0(x)\), f(x) and \(g_i(x)\), as well as some \(\xi _0(x)\in \partial f_0(x)\), \(\xi (x)\in \partial f(x)\) and \(\zeta _i(x)\in \partial g_i(x)\), for any given \(x\in X\). Then our goal is, for any prescribed tolerance \(\epsilon >0\), to find an \(\epsilon\)-solution \(x_{\epsilon }\) to (1) such that

$$\begin{aligned} F(x_{\epsilon }) - F^* \le \epsilon \quad \text {and} \quad g_i(x_{\epsilon })\le \epsilon ,\ \forall \, i=1,\ldots ,n_c. \end{aligned}$$
(10)

The functional constrained problem (1) is considered very challenging especially if \(g_i\)’s are not simple and/or the projection onto the feasible set \(\{x\in X: g_i(x)\le 0,\forall \,i\}\) is difficult to compute [1]. In recent years, we have witnessed a fast development of level-bundle methods for solving (1), which employ historical information and bundle management techniques in a sophisticated manner to achieve very promising efficiency. For a series of important work on level-bundle methods, we refer to [8, 11, 13, 17, 27].

For notation simplicity, we demonstrate the proposed level-bundle methods with only one constraint \(g(x)\le 0\), as it is straightforward to rewrite our results for multiple-constraint case: the changes are, for example, from the improvement function \(h(x,L):=\max \{f(x)-L,g(x)\}\) with one constraint to \(h(x,L):=\max \{f(x)-L,g_1(x),\ldots ,g_{n_c}(x)\}\) for multiple constraints, etc. Note that \(\rho _{g}\) in the present work is interpreted as \(\min _i\{\rho _{i}\}\), which is only bottlenecked by the least smooth constraint function among \(g_i\)’s. This is in contrast to existing level-bundle methods designed to work for nonsmooth constraint, where multiple constraints \(g_i \le 0\) for all i are reduced to a single constraint \(g(x):=\max _{i}\{g_i(x)\}\le 0\). As a consequence, g can be nonsmooth (\(\rho _{g}=0\)) even if all \(g_i\)’s are smooth (\(\min _i\{\rho _{i}\}=1\)). This artificial reduction from \(g_i\)’s to g does not make difference for these methods, but yields in a worse iteration complexity bound than that with \(g_i\)’s treated separately in our method.

1.2 Related work

In contrast to widely used gradient-descent methods and their numerous variants, level-bundle methods are based on a very different approach to find an \(\epsilon\)-solution by tightening the gap between the upper and lower bounds of the optimal value of the objective function. The so-called cutting plane model plays an important role in generating those bounds in level-bundle methods. For the convex programming (CP) problem as follows,Footnote 1

$$\begin{aligned} f^*_X:=\min _{x\in X}f(x), \end{aligned}$$
(11)

with given \(x_1,x_2,\ldots ,x_k\in X\), the cutting plane model is defined by

$$\begin{aligned} m_k^f(x) := \max \{\ell _f(x_i,x),1\le i\le k\}, \end{aligned}$$
(12)

where a cutting plane \(\ell _f(z,\cdot )\) of f at z is defined by

$$\begin{aligned} \ell _f(z,x) := f(z) + \langle \xi (z), x-z \rangle \end{aligned}$$
(13)

for some \(\xi (z) \in \partial f(z)\). Therefore (12) bounds the objective function \(f(\cdot )\) from below due to the convexity of f. The classical level-bundle method proposed by Lemaréchal, Nemirovskii and Nesterov [17] defined the basic framework of level-bundle methods–given \(x_1,x_2,\ldots ,x_k\), the classical level-bundle method [17] performs the following three steps in each iteration:

  1. a.

    Set \(\overline{f}_k:=\min \{f(x_i),1\le i\le k\}\) and compute \(\underline{f}_k = \min _{x\in X} m_k^f(x)\) as the upper and lower bounds of \(f_X^*\), respectively.

  2. b.

    Set the level \(l_k = \beta \underline{f}_k + (1-\beta )\overline{f}_k\) for some \(\beta \in (0,1)\).

  3. c.

    Set \(X_k:=\{x\in X: m_k^f(x)\le l_k\}\) and determine a new iterate by solving

    $$\begin{aligned} x_{k+1} = {\mathrm{arg\,min}}_{x \in X_k} \Vert x - x_k\Vert ^2. \end{aligned}$$
    (14)

We can see that the upper bounds \(\{\overline{f}_k\}\) and lower bounds \(\{\underline{f}_k\}\) on \(f^*_X\) are monotonically decreasing and increasing, respectively, and the gaps between them are tightened. If the termination condition is set to \(\overline{f}_k - \underline{f}_k\le \epsilon\) where \(\overline{f}_k=f(x_k)\) for some \(x_k\), then \(0\le f(x_k)-f_X^*\le \overline{f}_k - \underline{f}_k\le \epsilon\), which means that \(x_k\) is an \(\epsilon\)-solution of (11).

In recent years, there have been increasing research interests in improving iteration complexity of level-bundle type methods for solving smooth CPs inspired by the development of accelerated gradient decent methods. In [16], Nesterov’s accelerated multi-sequence scheme for smooth CPs [20, 23] and the smoothing technique [22] for non-smooth CPs are employed to improve iteration complexity of level-bundle methods for unconstrained convex optimization problems where the objective functions are (weakly) smooth or in an important class of saddle-point (SP) problems. In [4], the performance of these acceleration methods are further improved by simplified schemes and more efficient subproblem solvers. Moreover, these accelerated algorithms are extended to unconstrained CP problems where X is unbounded. For more details about the developments of level-bundle methods we refer to [7, 16, 17]. Recently, several level-bundle methods with the incorporation of Nesterov’s multi-sequence scheme for smooth CPs and a class of saddle point problems using inexact oracle are developed [3]. The accuracy of the approximate solution and the convergence analysis for those algorithms are also studied.

Level-bundle methods are also developed to solve functional constrained convex optimization problem (1) where f and \(g_i\)’s can be nonsmooth [8, 11, 13, 17, 27]. The idea shared by these methods is to convert the constrained problem (1) to an equivalent, unconstrained problem, for which the classical level-bundle method described above can be applied with necessary modifications. For example, in [13, 27], restricted-memory variants [5, 12, 14] are employed in level-bundle method to solve the following problem which is equivalent to (1):

$$\begin{aligned} \min _{x\in X}\ h^*(x), \quad \text {where}\quad h^*(x) := \max \{f(x)-f^*, g(x) \}. \end{aligned}$$
(15)

We can see that \(x_{\epsilon }\) is an \(\epsilon\)-solution to (15) if and only if \(f(x_{\epsilon }) - f^*\le \epsilon\) and \(g(x_{\epsilon })\le \epsilon\), i.e., \(x_{\epsilon }\) is an \(\epsilon\)-solution to (1) in the sense of (10). However, the optimal value \(f^*\) is unknown in practice and hence one cannot obtain the first-order information of \(h^*(x)\) directly. To tackle this issue, a non-decreasing sequence \(\{L_k\}\) is generated such that \(L_k\uparrow f^*\) and used in place of \(f^*\) in (15) in each iteration k [13, 27]. Namely, \(h^*(x)\) in (15) is replaced by

$$\begin{aligned} h(x, L_k):= \max \{f(x)- L_k,g(x)\}, \end{aligned}$$
(16)

for some \(L_k\le f^*\). Note that \(h(x,L_k)\ge h^*(x)\ge 0\) for any \(x\in X\). In addition, it is easy to see that \(h^*(x^*) = 0\) and therefore \(h^*(x)\) has a tight lower bound 0. If we define

$$\begin{aligned} \bar{h}_k := \min \{h(x_i,L_k):i=1, \ldots ,k \}, \end{aligned}$$
(17)

then \(\bar{h}_k\) is the current best estimate of \(\min _{x\in X}h(x,L_k)\). Therefore, the goal is to generate \(\{x_k\}\) and \(\{L_k\}\) such that \(\bar{h}_k\downarrow 0\) [13, 27]. To this end, [13] computes

$$\begin{aligned} L_k := \min \{m^f_k(x):m^g_k(x)\le 0, x\in X\}, \end{aligned}$$
(18)

for given \(x_1,\ldots , x_k\), where \(m^f_k(x)\) and \(m^g_k(x)\) are the current cutting plane models of \(f(\cdot )\) and \(g(\cdot )\) defined in (12), respectively. Then a level \(l_k:=(1-\beta )\bar{h}_k\) is set, where \(\bar{h}_k\) is defined in (17), and the next iterate \(x_{k+1}\) is obtained by projecting \(x_k\) to the level set \(X_k := \{x \in X : m_k^f(x)-L_k \le l_k, m_k^g(x)\le 0\}\). This algorithm achieves the iteration complexity of \(O(\epsilon ^{-2}\log \epsilon )\). Different from [13], the constrained level-bundle method in [27] generates the sequence \(\{L_k\}\) and iterates \(\{x_k\}\) jointly without solving (18). Instead, it projects \(x_k\) to the level set \(X_k\)–if no feasible solution \(x_{k+1}\) can be found, then it enlarges the level set by increasing \(L_k\) to \(L_k+l_k\), and repeats this procedure until a new iterate \(x_{k+1}\) is found. In [8, 17], the constrained problem (1) is reformulated to an equivalent min-max problem as follows using the duality theory:

$$\begin{aligned} \min _{x\in X} \max _{0\le \alpha \le 1} h(x, \alpha )\quad \text {where}\quad h(x,\alpha ) := \alpha (f(x) - f^*) + (1-\alpha ) g(x). \end{aligned}$$
(19)

Then, at each iteration k, the unknown \(f^*\) is replaced by its lower bound \(L_k\) in (18), namely, \(h(x,\alpha )\) is replaced by

$$\begin{aligned} h(x, \alpha , L_k) := \alpha (f(x)-L_k)+(1-\alpha )g(x), \end{aligned}$$

and \(\alpha\) and x are updated alternately in the just discussed algorithms. For fixed \(\alpha _k\) at each iteration k, the classical level-bundle method is applied to \(h_k(x, \alpha _k; L_k)\). Similar as the constrained level-bundle methods discussed above, \(h(x^*,\alpha _k,L_k)\) is bounded below by 0 and above by \(\bar{h}_k:=\min \{h(x_i,\alpha _k,L_k) : i=1,\ldots ,k \}\), which is therefore the gap between the upper and lower bounds of \(\min _{x\in X}h(x,\alpha _k,L_k)\) Then a level set \(X_k := \{x \in X : \alpha _k(m_k^f(x)-L_k)+(1-\alpha _k) m_k^g(x)\le l_k\}\), where \(l_k:=(1-\beta )\bar{h}_k\), is built for \(h(x,\alpha _k;L_k)\). With similar idea, the method in [11] further incorporates the bundle aggregation technique and a filter strategy for evaluating candidate points for solving \(\min _{x\in X}h(x,\alpha _k;L_k)\). In [6], a bundle method is combined with the target radius method to solve nonsmooth convex optimization where the constraint is on the boundedness of a specific strongly convex function, achieving iteration complexity \({\mathcal {O}}(\epsilon ^{-2})\).

To obtain an \(\epsilon\)-solution to (1) with nomsmooth \(f(\cdot )\) and \(g(\cdot )\), the algorithms in [8, 13, 17] and [27] exhibit iteration complexities \({\mathcal {O}}(\epsilon ^{-2}\log (\epsilon ))\) and \({\mathcal {O}}(\epsilon ^{-3})\), respectively. Moreover, [8, 9] and [27] further extend the constrained level-bundle methods in [17] and [13] to deal with inexact oracles, respectively.

1.3 Contribution

Our main contribution of this work lies in the development of a unified level-bundle method, called the accelerated constrained level bundle (ACLB), for solving the constrained convex optimization problem (1). By employing Nesterov’s accelerated gradient technique, we show that ACLB can maintain the same iteration complexity as the existing level-bundle methods for nonsmooth problems, while significantly improves the complexity if the objective and constraint functions are (weakly) smooth. We also show that, if the objective function has a nonsmooth component which can be written as a max-type function, the iteration complexity for this component can be much lower than that as if we treat it as a generic nonsmooth function. In summary, compared to existing level-bundle methods, the proposed algorithms enjoy orders of lower iteration complexities when the objective and constraint functions are (weakly) smooth. More specifically, ACLB attains the iteration complexity \({\mathcal {O}}(\epsilon ^{-3(1+\rho )/(1+3\rho )}+\Vert A\Vert \epsilon ^{-2})\), where \(\rho = \min \{\rho _{i}:i=0,1,\ldots ,n_c\}\) is the least smooth coefficient among the functions \(f_0\) and \(g_i\)’s in the problem (1). A comparison with existing methods with respect to iteration complexity in two extreme cases is given in Table 1.

Table 1 Iteration complexity of the existing best and our ACLB method on two extreme cases \(\rho =0\) and \(\rho =1\). Note that \(\rho =\min \{\rho _{i}:i=0,1,\ldots ,n_c\}\) for ACLB

Given the improved iteration complexity of ACLB for (weakly) smooth problems, however, we also point out a drawback of our methods: besides every call of the standard first order oracle, which computes both function values and (sub)gradients of f and \(g_i\)’s, our methods require an additional function evaluation of f and \(g_i\)’s. In other words, each iteration of our algorithms calls the zeroth order oracle and the first order oracle, each once. This is in contrast to most existing level-bundle methods where each iteration only calls the first order oracle once. Therefore, the actual per-iteration cost of ACLB is higher (but no more than twice higher) than that of those existing level-bundle methods. For certain problems, such as unit-commitment, Lagrangian relaxation, and two-stage stochastic problems, a subgradient is a byproduct of the function evaluation and hence the zeroth order and first order oracles are at the same cost. Meanwhile, there are also many problems where function evaluations are computationally much less expensive than (sub)gradient evaluations, for which the increase of our per-iteration cost due to this additional zeroth order oracle is very minor. Moreover, the much lowered order of iteration complexity attained by our methods can well compensate such minor increase of per-iteration cost. In addition, our algorithms requires memory space for \(x^l\) and \(x^u\) besides x, but they are updated not recorded during iterations, which is often a negligible issue in practice.

1.4 Paper organization

The remainder of this paper is organized as follows. In Sect. 2, we present our ACLB algorithm in details. In Sect. 3, we provide a comprehensive analysis of the iteration complexity of ACLB. Numerical results of ACLB are presented in Sect. 4. Section 5 concludes this paper.

2 Accelerated constrained level-bundle method

In this section, we provide a detailed description of our proposed accelerated level-bundle (ACLB) method. To tackle the constraint, ACLB relies on the improvement function h defined by

$$\begin{aligned} h(x, L):= \max \,\{F(x)- L,g(x)\}, \end{aligned}$$
(20)

where L is a lower bound of the unknown optimal value \(F^*\), i.e., \(L \le F^*\). Note that there is \(h(x,L)\ge 0\) for any \(x\in X\). The goal of ACLB is thus to generate a sequence of pairs of iterate and lower bound \((x_n, L_n)\), such that the lower bounds \(L_n\uparrow F^*\) and \(h(x_n,L_n)\downarrow 0\) as \(n\rightarrow \infty\). Therefore, at the heart of ACLB is a “gap reduction” procedure \({\mathcal {G}}_{\mathrm {ACLB}}\). Let \(\beta ,\theta \in (0,1)\) be arbitrary and fixed, then given an input triple \((x,L,\Delta )\) where \(\Delta :=h(x,L)\ge 0\) is the gap, the gap reduction procedure \({\mathcal {G}}_{\mathrm {ACLB}}\) can output \((x^+,L^+,\Delta ^+)\) such that either the new lower bound \(L^+\) improves over L in the sense that \(L^+=L+(1-\beta )\Delta \in (L, F^*)\), or \(L^+=L\) but the new gap \(\Delta ^+\) is reduced in the sense that \(\Delta ^+:=h(x^+,L^+)\le (1-\beta +\theta \beta ) \Delta = q \Delta\) where \(q:=1-\beta +\theta \beta \in (0,1)\). The ACLB gap reduction procedure \({\mathcal {G}}_{\mathrm {ACLB}}\) is given in Procedure 1.

Procedure 1 ACLB gap reduction procedure \({\mathcal {G}}_{\mathrm {ACLB}}\): \((x^+,L^+,\Delta ^+)={\mathcal {G}}_{\mathrm {ACLB}}(x, L, \Delta , \beta , \theta ,D_{v,Y})\)

1:

(Initialization) Denote \(F_{\eta }(x) := f_0(x) + f_{\eta }(x)\) and define \(h_{\eta }(x)=\max \{F_{\eta }(x)-L,g(x)\}\), where

 

\(\eta := \beta \theta \Delta / (2D_{v,Y}).\)

(21)

Set \(\overline{h}_0=\Delta\), \(l=(1-\beta ) \Delta\), \(R_0=X\), \(x_0^u=x_0=x\), \(k=1\).

 

2:

(Update the cutting plane model) Set

 

\(x_k^l = (1-\alpha _k)x_{k-1}^u+\alpha _k x_{k-1}\),

(22)

\(\underline{R}_k= \{x\in R_{k-1}:\ell _{F_{\eta }}(x_k^l,x) - L\le l,\ \ell _g(x_k^l, x)\le 0\}\)

(23)

3:

(Update the iterate or lower bound of \(F^*\)) Solve \(x_k\) from

 

\(x_k={\mathrm{arg\,min}}_{z\in \underline{R}_k}\left\{ d(x,z):= \frac{1}{2}\Vert z-x\Vert ^2\right\} .\)

(24)

If no solution, i.e., \(\underline{R}_k=\varnothing\), then terminate and output \(x^+=x_{k-1}^u\), \(L^+=L + l\), \(\Delta ^+=h(x^+,L^+)\).

 

4:

(Update the upper bound) Set

 

\(\tilde{x}_k^u = (1-\alpha _k)x_{k-1}^u + \alpha _k x_k,\)

(25)

\(x_k^u = {\left\{ \begin{array}{ll} \tilde{x}_k^u, &{}\text { if } h_{\eta }(\tilde{x}_k^u,L)< h_{\eta }(x_{k-1}^u,L), \\ x_{k-1}^u, &{} \text {otherwise}, \end{array}\right. }\)

(26)

and set \(\overline{h}_k=h(x_k^u,L)\). If \(\overline{h}_k - l\le \beta \theta \Delta\), then terminate and output \(x^+=x_{k}^u\), \(L^+=L\), \(\Delta ^+=\overline{h}_k\).

 

5:

(Bundle management) Define \(\overline{R}_k\) and choose \(R_k\) satisfying \(\underline{R}_k\subseteq R_k\subseteq \overline{R}_k\), where

 

\(\overline{R}_k:=\{z\in X:\left\langle x_k-x, z-x_k\right\rangle \ge 0\}.\)

(27)

Set \(k=k+1\) and go to Step 2.

 

The key of the acceleration property of ACLB procedure 1 is due to the Nesterov’s acceleration technique. In this case, we need a sequence of combination parameters \(\{\alpha _k\} \subset \mathbb {R}\) to satisfy the following properties: there exist \(c_1,c_2>0\) such that

$$\begin{aligned} \alpha _1=1,\quad 0<\alpha _k\le 1,\quad \frac{c_1}{k}\le \alpha _k \le \frac{c_2}{k},\quad \text {and}\quad \frac{1-\alpha _{k+1}}{\alpha _{k+1}^{2}}\le \frac{1}{\alpha _k^{2}},\quad \forall \, k\ge 1. \end{aligned}$$
(28)

The following proposition provides two examples of such \(\{\alpha _k\}\).

Proposition 1

The sequence \(\{\alpha _k\}\)generated by either way below satisfies (28) with \(c_1=1\)and \(c_2=2\):

  1. a.

    \(\alpha _k=\frac{2}{k + 1}\)for \(k\ge 1\).

  2. b.

    \(\alpha _k > 0\)is recursively defined by

    $$\begin{aligned} \alpha _1=1,\quad \alpha _{k+1}^2=(1-\alpha _{k+1})\alpha _{k}^2, \quad \forall \, k\ge 1, \end{aligned}$$
    (29)

Proof

Part (a) can be verified directly by checking \(\alpha _k=\frac{2}{k+1}\) in (28).

For part (b), it is easy to show by induction that \(\alpha _k\in (0,1]\) and \(\alpha _{k+1} <\alpha _k\) for all \(k\ge 1\). Since (29) implies that \(\frac{1}{\alpha _{k+1}}-\frac{1}{\alpha _{k}} = \frac{\alpha _k-\alpha _{k+1}}{\alpha _k\alpha _{k+1}}=\frac{\alpha _{k}}{\alpha _k+\alpha _{k+1}}\), we can readily show that \(1> \frac{1}{\alpha _{k+1}}-\frac{1}{\alpha _{k}} \ge \frac{1}{2}\) for all \(k\ge 1\). Noting that \(\alpha _1=1\), we have \(\frac{1}{\alpha _k} = 1 + \sum _{i=1}^{k-1}(\frac{1}{\alpha _{i+1}} - \frac{1}{\alpha _{i}})\), which is bounded between \(1+\frac{k-1}{2}=\frac{k+1}{2}\) and \(1+(k-1)=k\). Therefore \(\frac{1}{k}<\alpha _k\le \frac{2}{k+1}< \frac{2}{k}\). \(\square\)

We now add a few remarks about \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1). Firstly, the linear approximation of \(F_{\eta }(\cdot )\) instead of \(F(\cdot )\) are used to define \(\underline{R}_k\) if the max-type function f presents in the objective function of (1). However, the definition of \(\bar{h}_k\) and the termination condition in Step 4 are still defined on the upper bound on h(xL). Secondly, the parameter \(\eta\) due to the presence of f is specified as a function of the input \(\Delta\) and the parameters \(\beta\), \(\theta\), and \(D_{v,Y}\) (or any user-chosen value greater than \(D_{v,Y}\)), and it is fixed within each \({\mathcal {G}}_{\mathrm {ACLB}}\) and decreases in the input gap \(\Delta\). Thirdly, the \({\mathcal {G}}_{\mathrm {ACLB}}\) either terminates at Step 3 to increase L or at Step 4 to reduce \(\Delta\).

Our ACLB gap reduction procedure \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1) also employs bundle management (Step 5), also known as bundle compression or restricted memory in the literature, to maintain finite bundle size and hence ensure implementation feasibility in practice. This is realized by the flexible choice of localizer \(R_k\) in Step 5: one can discard some linear inequality constraints in \(\underline{R}_k\) to obtain \(R_k\), as long as the latter still lies in the half space defined by \(\overline{R}_k\) in (27). Note that, in addition to the cost of first order oracles, bundle methods (including ACLB) also require solving a quadratic program (QP) in each iteration which can be costly for problems with high dimensionality (large \(d_X\)). However, with small bundle size (e.g., 5-10), one can instead solve the dual problem of QP with very low computational cost [4].

Finally we are ready to present the ACLB method in Algorithm 1.

Algorithm 1 Accelerated constrained level bundle (ACLB) method

1:

Set tolerance \(\epsilon >0\) and \(\beta ,\theta \in (0,1)\). If f is present in (1), then also give prox-function \(v(\cdot )\) in (5),

 

\(D_{v,Y}\) in (9) (or any number greater).

2:

Choose an initial \(p_0\in X\), compute \(x_0 \in {\mathrm{arg\,min}}_{x\in X} \{\ell _F(p_0, x): \ell _g(p_0, x)\le 0 \}\). Set \(L_0 = \ell _F(p_0, x_0)\),

 

\(\Delta _0=\max \{F(x_0)-L_0, g(x_0)\}\), and set \(n=0\).

3:

If \(\Delta _n\le \epsilon\), terminate and output \(\epsilon\)-solution \(x_n\).

4:

Compute \((x_{n+1}, L_{n+1},\Delta _{n+1})={\mathcal {G}}_{\mathrm {ACLB}}(x_n, L_n, \Delta _n, \beta , \theta , D_{v,Y})\).

5:

Set \(n=n+1\) and go to Step 3.

3 Convergence analysis

In this section, we establish the iteration complexities of the proposed ACLB (Algorithm 1).

Lemma 2

Suppose that \(\{x_k^l, x_k, \tilde{x}_k^u, x_k^u\}\)are generated by \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1), and the procedure does not terminate at the Kth iteration. Then the following estimate holds for the input \(\Delta\):

$$\begin{aligned} h_{\eta }(x_k^u)- l \le&\ (1-\alpha _k)(h_{\eta }(x_{k-1}^u)-l)+ \frac{\alpha _k^{1+\rho _{0}}L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0}} \\&+\frac{\alpha _k^{1+\rho _{g}}L_g}{1+\rho _g}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g}+\frac{\alpha _k^2 L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2, \nonumber \end{aligned}$$
(30)

where \(l=(1-\beta )\Delta\)and \(L_{\eta }\)is the Lipschitz constant of \(\nabla f_{\eta }\).

Proof

For any \(k\ge 1\), we have

$$\begin{aligned} F_{\eta }(\tilde{x}_k^u) - L&\le \ell _{F_{\eta }}(x_k^l,\tilde{x}_k^u) - L \\&\quad + \frac{L_{0}}{1+\rho _{0}}\Vert \tilde{x}_k^u-x_k^l\Vert ^{1+\rho _{0} }+\frac{L_{\eta }}{2}\Vert \tilde{x}_k^u-x_k^l\Vert ^2 \\&= (1-\alpha _k)\ell _{F_{\eta }}(x_k^l,x_{k-1}^u)+\alpha _k \ell _{F_{\eta }}(x_k^l,x_k) - L \\&\quad +\frac{\alpha _k^{1+\rho }L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0} }+\frac{\alpha _k^2 L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2 \\&\le (1-\alpha _k)(F_{\eta }(x_{k-1}^u) - L) + \alpha _k(\ell _{F_{\eta }}(x_k^l,x_k) - L) + \frac{\alpha _k^{1+\rho }L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0} } \\&\quad +\frac{\alpha _k^2 L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2 \\&\le (1-\alpha _k)(F_{\eta }(x_{k-1}^u) - L) + \alpha _kl + \frac{\alpha _k^{1+\rho }L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0} } \\&\quad +\frac{\alpha _k^2 L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2, \end{aligned}$$

where the first inequality is due to (2), the first equality follows from the definitions of \(x_k^l\) in (22) and \(\tilde{x}_k^u\) in (25), and the linearity of \(\ell _{F_{\eta }}(x_k^l,\cdot )\), the second inequality is due to the convexity of \(F_{\eta }\), and the last follows from (23) and (24). Similarly, for \(g(\cdot )\), we have

$$\begin{aligned} g(\tilde{x}_k^u)&\le \ell _g(x_k^l, \tilde{x}_k^u) + \frac{L_g}{1+\rho _g}\Vert \tilde{x}_k^u-x_k^l\Vert ^{1+\rho _g}\\&\le (1-\alpha _k)\ell _g(x_k^l,x_{k-1}^u)+\alpha _k \ell _g(x_k^l,x_k) + \frac{\alpha _k^{1+\rho _g}L_g}{1+\rho _g}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g} \\&\le (1-\alpha _k)g(x_{k-1}^u) + \alpha _kl +\frac{\alpha _k^{1+\rho _g}L_g}{1+\rho _g}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g}, \end{aligned}$$

where the last inequality is due to \(\ell _g(x_k^l,x) \le 0 < l\). In view of (26), we have \(h_{\eta }(x_k^u) \le h_{\eta }(\tilde{x}_k^u) = \max \{f(\tilde{x}_k^u)-L, g(\tilde{x}_k^u)\}\). Combining the two inequalities above, we get

$$\begin{aligned} h_{\eta }(x_k^u)&\le (1-\alpha _k)\max \{F_{\eta }(x_{k-1}^u) - L, g(x_{k-1}^u)\} + \alpha _kl + \frac{\alpha _k^{1+\rho _{0}}L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0}} \\&\quad +\frac{\alpha _k^{1+\rho _g}L_g}{1+\rho _g}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g} +\frac{\alpha _k^2L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2. \end{aligned}$$

Subtracting l on both sides of the above estimate, we obtain (30). \(\square\)

The following lemma provides several important properties of the bundle management step in \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1).

Lemma 3

Let \((x,L,\Delta )\)be the input of \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1), and denote \(\mathcal {E}_l:=\{\bar{x}\in X:F(\bar{x}) - L\le l, g(\bar{x})\le 0\}\)where \(l=(1-\beta )\Delta\), then the following statements hold:

  1. (a)

    \(\underline{R}_k \subseteq \overline{R}_k\) for all \(k\ge 1\).

  2. (b)

    There is \(\mathcal {E}_l\subseteq \underline{R}_k\subseteq R_k \subseteq \overline{R}_k\)for all \(k \ge 1\). If \(\mathcal {E}_l\ne \varnothing\), then (24) has a unique solution.

  3. (c)

    If \(\underline{R}_k=\varnothing\), then \(L + l < f^*\).

  4. (d)

    If terminated in Step 4, then \(\Delta ^+\le q\Delta\)where \(q:=1-\beta +\theta \beta\).

Proof

  1. (a)

    If x in (24) satisfies \(x\in \underline{R}_k\), then due to the fact that \(x_k\) is the projection of x onto \(\underline{R}_k\) in (24), we know \(x_k=x\), and \(\langle x_k-x, z-x_k \rangle =0\) for all \(z\in X\). Therefore \(\overline{R}_k=X\) and \(\underline{R}_k\subseteq \overline{R}_k\). If \(x\notin \underline{R}_k\), then due to the optimality condition of \(x_k\) in (24), we have \(\Vert z-x\Vert ^2\ge \Vert z-x_k\Vert ^2+\Vert x_k-x\Vert ^2\) for all \(z\in \underline{R}_k\), from which we obtain \(\langle x_k-x, z-x_k \rangle \ge 0\), i.e., \(z\in \overline{R}_k\). Hence \(\underline{R}_k\subseteq \overline{R}_k\).

  2. (b)

    We prove the result by induction. Since \(R_0 = X\), there is \(\mathcal {E}_l\subseteq R_0\). Assume that \(\mathcal {E}_l\subseteq R_{k-1}\) holds for some \(k\ge 1\). Note that for any \(x\in \mathcal {E}_l\), we have \(\ell _{F_{\eta }}(x_k^l, x)-L\le {F_{\eta }}(x)-L \le F(x)-L\le l\) and \(\ell _g(x_k^l, x)\le g(x)\le 0\) in \({\mathcal {G}}_{\mathrm {ACLB}}\) due to the convexity of f, \(f_{\eta }\), and g. By the definition of \(\underline{R}_k\) in (23), and the induction assumption \(\mathcal {E}_l\subseteq R_{k-1}\), we have \(\mathcal {E}_l\subseteq \underline{R}_k\). Due to \(\underline{R}_k\subseteq \overline{R}_k\) in Part (a) and the choice of \(R_k\) such that \(\underline{R}_k\subseteq R_k \subseteq \overline{R}_k\) for any \(k \ge 1\), we obtain \(\mathcal {E}_l \subseteq R_k\). Therefore we have \(\mathcal {E}_l\subseteq \underline{R}_k\subseteq R_k \subseteq \overline{R}_k\) by induction. Moreover, d(x) is strongly convex and \(\underline{R}_k\) is nonempty, hence (24) has a unique solution.

  3. (c)

    If \(L+l\ge F^*\), then for every solution \(x^*\) to (1), there is \(F(x^*)-L=F^*-L \le l\) and \(g(x^*)\le 0\), and hence \(x^*\in \mathcal {E}_l\), which contradicts to \(\underline{R}_k = \varnothing\). Therefore \(L+l<F^*\).

  4. (d)

    The inequality \(\Delta ^+\le q\Delta\) follows immediately due to the definition of \(l=(1-\beta )\Delta\) and the termination condition \(\overline{h}_k - l\le \beta \theta \Delta\) in Step 4.

\(\square\)

Now we have the following bound for the iterates \(\{x_k\}\) within each \({\mathcal {G}}_{\mathrm {ACLB}}\).

Lemma 4

Let \(\{x_k\}\)be the iterates generated by \({\mathcal {G}}_{\mathrm {ACLB}}\)before termination at some iteration K, then

$$\begin{aligned} \sum _{k=1}^K \Vert x_{k}-x_{k-1}\Vert ^2\le D_X^2, \end{aligned}$$
(31)

where the diameter \(D_X\)of the compact set X is defined by

$$\begin{aligned} D_X := \max _{x, y \in X}\Vert x - y\Vert . \end{aligned}$$
(32)

Proof

For all \(k>1\), there is \(x_k\in \underline{R}_k \subseteq R_{k-1} \subseteq \overline{R}_{k-1}\), where the first inclusion is due to the definition of \(x_k\) in (24), the second due the definition of \(\underline{R}_k\) in (23), and the last due to the selection \(R_{k}\subseteq \overline{R}_k\) in Step 5 of \({\mathcal {G}}_{\mathrm {ACLB}}\) for all k. Furthermore, for any input x in \({\mathcal {G}}_{\mathrm {ACLB}}\), we know \(d(x,z)=(1/2)\cdot \Vert z-x\Vert ^2\) is strongly convex in z with modulus 1. Therefore,

$$\begin{aligned} d(x,x_{k})\ge d(x,x_{k-1})+\left\langle x_{k-1}-x,x_{k}-x_{k-1}\right\rangle +\frac{1}{2}\Vert x_{k}-x_{k-1}\Vert ^2\ge d(x,x_{k-1}) + \frac{1}{2}\Vert x_{k}-x_{k-1}\Vert ^2, \end{aligned}$$

where the last inequality follows from \(x_k\in \overline{R}_{k-1}\) as we showed earlier and the definition of \(\overline{R}_k\) in (27). By taking the sum of both sides over \(k=1,2,\ldots ,K\), we obtain

$$\begin{aligned} \frac{1}{2}\sum _{k=1}^{K}\Vert x_{k}-x_{k-1}\Vert ^2\le d(x,x_{K})-d(x,x_0)\le d(x,x_K)=\frac{1}{2}\Vert x_{K}-x\Vert ^2\le \frac{1}{2}D_X^2, \end{aligned}$$

which implies the bound (31). \(\square\)

Proposition 5

If \(\{\alpha _k\}\)in \({\mathcal {G}}_{\mathrm {ACLB}}\)is chosen such that  (28) holds, then the number of iterations performed within each \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1) with input  \(\Delta\)does not exceed

$$\begin{aligned} N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta ) :=&\ C\left(( { {\frac{L_{0}}{\Delta }})^{2/(1+3\rho _{0})} + ({\frac{L_{g}}{\Delta }})^{2/(1+3\rho _{g})} +( {\frac{\Vert A\Vert }{\Delta }\sqrt{\frac{D_{v,Y}}{\sigma _v}} } })\right) \nonumber \\ =&{\mathcal {O}}\left({{\frac{L_{0}}{\Delta }})^{2/(1+3\rho _{0})}}) + {\mathcal {O}}(({{\frac{L_{g}}{\Delta }})^{2/(1+3\rho _{g})}})+{\mathcal {O}}(({\frac{\Vert A\Vert }{\Delta }})\right), \end{aligned}$$
(33)

where \(C>0\)is the constant dependent on \(\rho _{0}\), \(\rho _g\), \(D_X\), \(\theta\)and \(\beta\)only, and \(D_X\)and \(D_{v,Y}\)are defined in (32)and (9) respectively.

Proof

Suppose that \({\mathcal {G}}_{\mathrm {ACLB}}\) does not terminate at the Kth iteration for some \(K > 0\). Dividing both sides of (30) by \(\alpha _k^2\), we obtain that

$$\begin{aligned} \frac{h_{\eta }(x_k^u)- l }{\alpha _k^2}&\le \frac{(1-\alpha _k)(h_{\eta }(x_{k-1}^u)-l)}{\alpha _k^2}+ \frac{\alpha _k^{\rho _{0}-1}L_{0}}{1+\rho _{0}}\Vert x_k-x_{k-1}\Vert ^{1+\rho _{0}} \nonumber \\&\quad +\frac{\alpha _k^{\rho _{g}-1}L_g}{1+\rho _g}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g} +\frac{L_{\eta }}{2}\Vert x_k-x_{k-1}\Vert ^2. \end{aligned}$$
(34)

Noting that \(\alpha _k\) satisfies (28), and taking sum of k over \(1,\ldots ,K\), on both sides of (34), we have

$$\begin{aligned} \frac{h_{\eta }(x_K^u)- l}{\alpha _K^2}&\le \frac{L_{0}}{1+\rho _{0}} \sum _{k=1}^{K}\alpha _k^{\rho _{0}-1}\Vert x_{k} - x_{k-1}\Vert ^{1+\rho _{0}} +\frac{L_g}{1+\rho _g}\sum _{k=1}^{K} \alpha _k^{\rho _{g}-1}\Vert x_k-x_{k-1}\Vert ^{1+\rho _g} \nonumber \\&\quad + \frac{L_\eta }{2}\sum _{k=1}^{K}\Vert x_{k} - x_{k-1}\Vert ^2. \end{aligned}$$
(35)

By the Hölder’s inequality, (31) from Lemma 4, and \(\alpha _K>c_1/K\) for some \(c_1>0\), we obtain

$$\begin{aligned} \sum _{k=1}^{K}\alpha _k^{\rho -1}\Vert x_{k} - x_{k-1}\Vert ^{1+\rho } \le {\sum _{k=1}^{K} \alpha _k^{-2}}^{\frac{1-\rho }{2}} {\sum _{k=1}^K\Vert x_k-x_{k-1}\Vert ^2}^{\frac{1+\rho }{2}}\le C(K^{\frac{3-3\rho }{2}})D_X^{1+\rho }, \end{aligned}$$
(36)

where \(C>0\) is a constant depending only on \(\rho\) and \(c_1\). By (36) with \(\rho\) substituted by \(\rho _{0}\) and \(\rho _{g}\), and \(c_1/K<\alpha _K\le c_2/K\), we deduce from (35) that

$$\begin{aligned} h_{\eta }(x_K^u)- l \le C {L_{0}K^{-\frac{1+3\rho _{0}}{2}}D_X^{1+\rho _{0}}+L_{g}K^{-\frac{1+3\rho _g}{2}}D_X^{1+\rho _{g}}+\frac{L_\eta }{2}K^{-2}D_X^2 } \end{aligned}$$
(37)

Then by (8) and (21), we can obtain

$$\begin{aligned} \bar{h}_K-l&= h(x_K^u)-l\le h_{\eta }(x_K^u)-l + \eta D_{v,Y} \\&\le C {L_{0}K^{-\frac{1+3\rho _{0}}{2}}D_X^{1+\rho _{0}}+L_{g}K^{-\frac{1+3\rho _g}{2}}D_X^{1+\rho _{g}}+\frac{L_\eta }{2}K^{-2}D_X^2 } + \frac{\theta \beta \Delta }{2}, \end{aligned}$$

where we used (35) and \(\eta =\beta \theta \Delta /(2D_{v,Y})\) in (21). In view of the termination condition in Step 4 of \({\mathcal {G}}_{\mathrm {ACLB}}\) Procedure 1, we have \(\bar{h}_k-l > \theta \beta \Delta\), therefore,

$$\begin{aligned} \frac{\theta \beta \Delta }{2}\le 3C \max {L_{0}K^{-\frac{1+3\rho _{0}}{2}}D_X^{1+\rho _{0}}, L_{g}K^{-\frac{1+3\rho _g}{2}}D_X^{1+\rho _{g}}, \frac{L_\eta }{2}K^{-2}D_X^2 }. \end{aligned}$$
(38)

If the first term in the max in (38) is the largest among the three, then from \(\frac{\theta \beta \Delta }{2}\le 3C L_{0}K^{-\frac{1+3\rho _{0}}{2}}D_X^{1+\rho _{0}}\), we have \(N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta ) = C(\frac{L_{0}}{\Delta })^{2/(1+3\rho _{0})}={\mathcal {O}}((\frac{L_{0}}{\Delta })^{2/(1+3\rho _{0})})\). Similar argument holds if the second term in the max in (38) is the largest. If the third term in the max in (38) is the largest, then by the fact that \(L_\eta =\Vert A\Vert ^2/(\eta \sigma _v)=2\Vert A\Vert ^2D_{v,Y}/(\theta \beta \sigma _v\Delta )\) in (7), we have \(N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta ) = C\frac{\Vert A\Vert }{\Delta }\sqrt{C\frac{D_{v,Y}}{\sigma _v}}={\mathcal {O}}(\frac{\Vert A\Vert }{\Delta })\). Therefore (33) holds. \(\square\)

Finally, we are ready to establish the iteration complexity of ACLB (Algorithm 1). For ease of presentation, we call a gap reduction procedure \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1) critical if \({\mathcal {G}}_{\mathrm {ACLB}}\) terminates at Step 4, i.e., \(\bar{h}_k-l\le \beta \theta \Delta\) in \({\mathcal {G}}_{\mathrm {ACLB}}\); otherwise it is called non-critical, i.e., it terminates at Step 3 and the level set \(\underline{R}_k=\varnothing\) in \({\mathcal {G}}_{\mathrm {ACLB}}\).

Theorem 6

For any given \(\epsilon >0\), if \(\{\alpha _k\}\)in every \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1) satisfies (28), then the following statements hold for ACLB (Algorithm 1) to compute an \(\epsilon\)-solution to problem (1):

  1. (a)

    The total number of calls to \({\mathcal {G}}_{\mathrm {ACLB}}\)in ACLB (Algorithm 1) does not exceed

    $$\begin{aligned} N_{\mathrm {ACLB}}^{\mathrm {outer}}(\epsilon ) := 1 + \frac{F^* - L_0}{(1-\beta )\epsilon } + \log _{\frac{1}{q}}\frac{{\hat{V}}_X}{\epsilon } = {\mathcal {O}}{\frac{1}{\epsilon }}, \end{aligned}$$
    (39)

    where the constant \(\hat{V}_X\)independent of \(\epsilon\)is defined by

    $$\begin{aligned} {\hat{V}} _X:= \max {2\Vert A\Vert \sqrt{2D_{v,Y}/\sigma _v},\ \frac{L_g}{1+\rho _g}D_X^{1+\rho _g},\ \frac{L_{0}}{1+\rho _{0}} D_X^{1+\rho _{0}} }. \end{aligned}$$
    (40)
  2. (b)

    The total number calls to the first order oracle in ACLB (Algorithm 1) does not exceed

    $$\begin{aligned} N_{\mathrm {ACLB}}^{\mathrm {total}}(\epsilon ) := N_{\mathrm {ACLB}}^{\mathrm {inner}}(\epsilon ) \cdot N_{\mathrm {ACLB}}^{\mathrm {outer}}(\epsilon ) \le {\mathcal {O}}(L_{0}^{\frac{2}{1+3\rho _{0}}}\epsilon ^{-\frac{3(1+\rho _{0})}{1+3\rho _{0}}})+{\mathcal {O}}(L_{g}^{\frac{2}{1+3\rho _{g}}}\epsilon ^{-\frac{3(1+\rho _{g})}{1+3\rho _{g}}})+{\mathcal {O}}(\Vert A\Vert \epsilon ^{-2}). \end{aligned}$$
    (41)

Proof

  1. (a)

    We can partition the set of iteration counters in the ACLB Algorithm 1 into \(\{i_1,\ldots ,i_{\bar{m}}\}\) for non-critical calls of \({\mathcal {G}}_{\mathrm {ACLB}}\) and \(\{j_1,\ldots , j_{\bar{n}}\}\) for critical calls of \({\mathcal {G}}_{\mathrm {ACLB}}\), where \(\bar{N}=\bar{n}+\bar{m}\) is the total number of calls of \({\mathcal {G}}_{\mathrm {ACLB}}\) in ACLB (Algorithm 1).

    Note that \({\mathcal {G}}_{\mathrm {ACLB}}\) with input \(\Delta\) will output \(\Delta ^+\le \Delta\). In addition, if \({\mathcal {G}}_{\mathrm {ACLB}}\) with input \(\Delta\) is critical, we have \(\Delta ^+ \le q \Delta\) for \(q=1-\beta +\theta \beta \in (0,1)\). Therefore, one can easily see that the number of critical \({\mathcal {G}}_{\mathrm {ACLB}}\)’s in Algorithm 1 is finite: we have \(\Delta _{j_{m+1}} \le q \Delta _{j_m}\) for all \(m=1,\ldots ,\bar{m}-1\), and \(\Delta _{j_{\bar{m}}}\le \epsilon <\Delta _{j_{\bar{m}-1}}\) due to the termination condition in Step 3 of ACLB (Algorithm 1). Therefore, we have \(\epsilon <\Delta _{j_{\bar{m}-1}}\le q^{\bar{m}-2}\Delta _{j_1}\le q^{\bar{m}-2}\Delta _{0}\), which implies that

    $$\begin{aligned} \bar{m} < \log _{\frac{1}{q}}\frac{\Delta _0}{\epsilon }=:N_{\mathrm {ACLB}}^{\mathrm {outer-c}}(\epsilon )={\mathcal {O}}(\log \epsilon ). \end{aligned}$$
    (42)

    Now we only need to show that \(\Delta _0 \le \hat{V}_X\) to finalize the role of (42) in (39), where \(\Delta _0 =\max \{F(x_0) - L_0, g(x_0)\}\). From (2) and (3), we have

    $$\begin{aligned} F(x_0) - L_0&= F(x_0) - \ell _F(p_0, x_0)\le \frac{L_{0}}{1+\rho _{0} }\Vert x_0-p_0\Vert ^{1+\rho _{0} }+f(x_0)- \ell _f(p_0,x_0) \nonumber \\&\le \frac{L_{0}}{1+\rho _{0} }D_X^{1+\rho _{0} }+f(x_0)- \ell _f(p_0,x_0), \end{aligned}$$
    (43)
    $$\begin{aligned} g(x_0)&\le \ell _g(p_0, x_0) + \frac{L_g}{1+\rho _{g}}\Vert x_0-p_0\Vert ^{1+\rho _{g}} \le \frac{L_g}{1+\rho _{g}}D_X^{1+\rho _{g}}, \end{aligned}$$
    (44)

    where we used the fact \(L_{i_1} = L_0 = \ell _f(p_0, x_0)\) to obtain the equality in (43), and \(\ell _g(p_0,x_0)\le 0\) in (44) due to the way we compute \(x_0\) in Step 2 of ACLB (Algorithm 1). Moreover, we have from Lemma 8 in [16] that

    $$\begin{aligned} f(x_0)- \ell _f(p_0,x_0) \le 2\Vert A\Vert \sqrt{2D_{v,Y}/ \sigma _v}. \end{aligned}$$
    (45)

    Combining (43), (44) and (45), we obtain \(\Delta _0 \le \hat{V}_X\), where \(\hat{V}_X\) is defined in (40). Therefore, we have

    $$\begin{aligned} N_{\mathrm {ACLB}}^{\mathrm {outer-c}}(\epsilon )=\bar{m} < \log _{\frac{1}{q}}\frac{\hat{V}_X}{\epsilon }={\mathcal {O}}(\log \epsilon ). \end{aligned}$$
    (46)

    For each non-critical \({\mathcal {G}}_{\mathrm {ACLB}}\) with input L, the output \(L^+\) satisfies \(L^+-L=l=(1-\beta )\Delta\). Since input \(\Delta _n>\epsilon\) for all n before ACLB Algorithm terminates, we know that \(L^+-L\ge (1-\beta )\epsilon\). Therefore, the number of non-critical \({\mathcal {G}}_{\mathrm {ACLB}}\)’s in ACLB (Algorithm 1) is bounded by

    $$\begin{aligned} N_{\mathrm {ACLB}}^{\mathrm {outer-nc}}(\epsilon ):= \frac{F^* - L_0}{(1-\beta )\epsilon } + 1 = {\mathcal {O}}{\frac{1}{\epsilon }}. \end{aligned}$$
    (47)

    Combining (46) and (47) we obtain that the total number of calls of \({\mathcal {G}}_{\mathrm {ACLB}}\) in ACLB Algorithm is bounded by \(N_{\mathrm {ACLB}}^{\mathrm {outer}}(\epsilon ):=N_{\mathrm {ACLB}}^{\mathrm {outer-c}}(\epsilon )+N_{\mathrm {ACLB}}^{\mathrm {outer-nc}}(\epsilon )\), as given in (39).

  2. (b)

    Now we have known that the number of calls to \({\mathcal {G}}_{\mathrm {ACLB}}\) in ACLB (Algorithm 1) is bounded above by \(N_{\mathrm {ACLB}}^{\mathrm {outer}}(\epsilon )\) in (39). On the other hand, the number of calls to the first order oracle in each \({\mathcal {G}}_{\mathrm {ACLB}}\) (Procedure 1) in the nth iteration of ACLB (Algorithm 1) is bounded by \(N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta _n)\) given in (33). Hence, the total number of iterations (calls to the first order oracles of \(f_0\), f and g) is bounded above by

    $$\begin{aligned} \sum _{n=1}^{N} N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta _n) \le N_{\mathrm {ACLB}}^{\mathrm {outer}}(\epsilon ) \cdot N_{\mathrm {ACLB}}^{\mathrm {inner}}(\epsilon ) = N_{\mathrm {ACLB}}^{\mathrm {total}}(\epsilon ), \end{aligned}$$
    (48)

    where we used the facts that \(\Delta _n >\epsilon\) for all n and \(N_{\mathrm {ACLB}}^{\mathrm {inner}}(\Delta )\) is non-decreasing in \(\Delta\) in the first inequality. Applying (33) and (39) to (48) yields the bound in (41).

\(\square\)

This theorem provides the iteration complexity bound for each function in the objective functional and constraint. Hence we can have iteration complexity bound for the (1) in various cases.

4 Numerical experiment

In this section, we conduct a series of numerical experiments on synthesized constrained optimization problems. In the first part of our experiments, we evaluate the performance of the proposed ACLB algorithm on a smooth constrained optimization problem, i.e., both of the objective function and constraint function are smooth. We demonstrate the improved convergence rate of ACLB by comparing to a state-of-the-art constrained level-bundle (CLB) method [27] for such smooth constrained problems. In the second part of our experiments, we compare the proposed ACLB on nonsmooth but structured constrained problems, where we implement our method as if the problem is a generic nonsmooth problem (labeled as ACLB) and utilizing the smoothing technique (labeled as ACLB-S). More specifically, the objective function in the latter problem is nonsmooth but can be written as a max-type function using Fenchel duality. We construct a special type of problems under this case, and show that ACLB-S may outperform ACLB by making use of the max-type structure of objective function.

The performance of all comparison algorithms is evaluated using the progresses of improvement function \(h_k\) and estimated lower bound \(L_k\) where k is the iteration counter (number of calls of the first-order oracle) averaged over 10 random instances. The improvement function \(h_k\) should monotonically decrease to 0, and the estimated lower bound \(L_k\) should increase to \(f^*\) although it is often unknown. Therefore, the algorithm with faster decay (increase) of \(h_k\) (\(L_k\) respectively) is considered more efficient. All the experiments are implemented and tested in MATLAB R2018a on a Windows desktop with 3.70GHz CPU and 16GB of memory.

4.1 Smooth constrained optimization

We consider the following constrained optimization problem:

$$\begin{aligned} \min _{x \in X} \frac{1}{2}\Vert Ax-b\Vert ^2, \quad \text{ subject } \text{ to } \frac{1}{2}\Vert Cx - d\Vert ^2 \le e, \end{aligned}$$
(49)

where \(x\in X \triangleq \{x\in \mathbb {R}^n : \Vert x\Vert _{\infty } \le 1\}\), A and C are given matrices, bd are given vectors of compatible dimensions, and \(e>0\) is a given error tolerance. In our experiments, we set both A and C as m-by-n Gaussian random matrices, where n is the dimension of the unknown x, and \(b,d\in \mathbb {R}^m\), for two different pairs of (mn) as (50, 100) and (200, 500). More specifically, for each pair of (mn), we first generate a reference \(\bar{x}\in \mathbb {R}^n\) using MATLAB randn function and normalize it by \(\bar{x}\leftarrow \bar{x}/\Vert \bar{x}\Vert\), and then generate two random matrices \(A,C\in \mathbb {R}^{m\times n}\), and set \(b=Ax\), \(d=Cx\), and \(e=10\). Namely, \(F(x) = f_0(x) = \frac{1}{2}\Vert Ax - b\Vert ^2\) and one constraint \(g(x) = \frac{1}{2}\Vert Cx - d \Vert ^2 -e\) in (1). In this case, we know \(\bar{x}\) is an optimal solution, and the optimal objective function value is \(f^*=0\). Therefore, a convergent level bundle method applied to (49) should generate \(\{x_k\}\) such that \(h_k \downarrow 0\) and \(L_k\uparrow F^*=0\) as \(k\rightarrow \infty\). For comparison, we apply CLB (Algorithm 1 in [27]) and the proposed ACLB algorithm to solve (49). We set \(\gamma =0.6\) for CLB and \(\beta =\theta =0.6\) for ACLB, and tried the initial lower bound of \(F^*\) as \(L_0=-1\) and \(-1000\) (denoted by \(f_{\text {low}}^0\) in CLB [27]), and total bundle size (nb) for both objective and constraint functions to 10 and 60 (i.e., 5 and 30 for each of f and g). The initial \(x_0\) is set to 0, and \(\epsilon = 10^{-6}\) in the termination condition (same for all experiments in this section). We also set the same maximum number 800 of calls to the first order oracle in all methods. We use the MATLAB builtin QP solver quadprog to solve the subproblems in both algorithms. For \(m=50\) and \(n=100\), we generate 10 random instances, and show the average of \(h_k\) and \(L_k\) versus iteration number with initial lower bound \(L_0=-1\) (left two plots) in and \(L_0=-1000\) (right two plots) in the top row of Fig. 1. Due to the data generation above, the true \(F^*=0\) in all cases. The results of the same experiment with problem size \(m=200\) and \(n=500\) are shown in the bottom row of Fig. 1. In Table 2, we also show the mean and standard deviation (in parentheses) of \(L_k\), \(h_k\), and CPU time (in seconds) after \(k=800\) iterations over the 10 random instances in the following order from top to bottom: \(L_0=-1\) (first table) and \(L_0=-1000\) (second table) with problem size \(n=100\), and \(L_0=-1\) (third table) and \(L_0=-1000\) (fourth table) with problem size \(n=500\). As we can see, ACLB can significantly improve the convergence for the smooth constrained problem (49).

Fig. 1
figure 1

Average of improvement function h and the corresponding lower bound L over 10 instances versus iteration with initial \(L_0=-1\) (left two columns) and initial \(L_0=-1000\) (right two columns) for the smooth constrained problem (49) with problem sizes \(n=100\) (top row) and \(n=500\) (bottom row)

Table 2 Comparison of CLB and ACLB on the smooth constrained problem (49) with different bundle size (nb)

4.2 Structured nonsmooth constrained optimization

We proposed to leverage the special structure of certain nonsmooth constrained problem with improved convergence rate in ACLB, which we call ACLB-S. To demonstrate the improvement gained by ACLB-S, we consider the following nonsmooth constrained problem:

$$\begin{aligned} \min _{x\in X} \{\Vert Ax-b\Vert _1 = \max _{\Vert y\Vert _{\infty }\le 1}\langle Ax-b, y\rangle \} , \quad \text{ subject } \text{ to } \frac{1}{2}\Vert Cx - d\Vert ^2 \le e, \end{aligned}$$
(50)

where again \(X \triangleq \{x\in \mathbb {R}^n : \Vert x\Vert _{\infty } \le 1\}\), A and C are given matrices, bd are given vectors of compatible dimensions, and \(e>0\) is a given error tolerance. Therefore, the objective function is \(F(x)=f(x) = \max _{y \in Y} \langle Ax, y\rangle - \chi (y)\) where \(Y=\{y \in \mathbb {R}^m : \Vert y\Vert _{\infty } \le 1\}\) and \(\chi (y) = \langle b, y\rangle\). We apply both ACLB (as if the objective function is a generic nonsmooth function \(F(x)=f_0(x)=\Vert Ax-b\Vert _1\)) and ACLB-S to the problem (50). For this test, both A and C are \(2062\times 2062\) matrices, where A is from the worst-case QP instance for first-order methods generated by Nemirovski (see the construction scheme in [19, 21]) and C is a randomly generated using MATLAB builtin function randn. Then we set \(b=Ax, d= Cx, e=10^{-3}\), and the initial lower bound of \(f^*\) to \(L_0=-1\) and \(L_0=-1000\) for testing. For \(L_0=-1\), we set \(\theta =0.8\) for both ACLB and ACLB-S, and \(\beta =0.5\) for ACLB and 0.4 for ACLB-S. For \(L_0=-1000\), we set \(\theta =\beta =0.6\) for ACLB, and \(\theta =0.7,\beta =0.4\) for ACLB-S. The total bundle size for both objective and constraint functions to 10. For ACLB-S, we use a sufficiently large estimate 1000 for \(D_{v,Y}\) and compute the corresponding \(\eta\) (same below). We also applied CLB Algorithm 1 in [27] with \(\gamma =0.6\) to this problem with the same bundle management setting. These parameter settings seem to yield optimal performance of the comparison algorithms among those we tested. We also set the same maximum number 600 of calls to the first order oracle in all methods. We use the MATLAB builtin QP solver quadprog to solve the subproblems in all three algorithms. We again generate 10 random instances. The average of \(h_k\) and \(L_k\) versus iteration number with initial \(L_0=-1\) (left two plots) and \(L_0=-1000\) (right two plots) are given in Fig. 2. In Table 3, we also show the mean and standard deviation (in parentheses) of \(L_k\), \(h_k\), and CPU time (in seconds) after \(k=600\) iterations with initial \(L_0=-1\) (top table) and \(L_0=-1000\) (bottom table). From these results, we can see ACLB and ACLB-S have similar efficiency for this problem, and both outperform CLB significantly.

Fig. 2
figure 2

Average of improvement function h and the corresponding lower bound L over 10 instances for the bundle size 10 versus iteration with initial \(L_0=-1\) (left two plots) and initial \(L_0=-1000\) (right two plots) for the structured nonsmooth constrained problem (50) of size \(n=2062\)

Table 3 Comparison of CLB, ACLB, and ACLB-S on the structured nonsmooth constrained problem (50) of size \(n=2062\)

We further consider the problem (50) of where A and C are both \(2062\times 4124\), and \(X = \{x\in \mathbb {R}^n : \Vert x\Vert \le 1\}\). In this case, the MATLAB builtin QP solver becomes much slower as the dimension of x is \(n=4124\). However, at the small bundle size 10, we can still solve the dual problem efficiently as in [4] as the set X is a ball in \(\mathbb {R}^n\). By employing this solver, we apply ACLB and ACLB-S to problem (50) with initial \(L_0\) as \(-1\) and \(-1000\). In both cases, we set \(\theta =\beta =0.5\) for both ACLB and ACLB-S, apply them to 10 randomly generated cases of (50), and compare their performance. We also set the same maximum number 800 of calls to the first order oracle in these methods. The average of \(h_k\) and \(L_k\) versus iteration number with initial \(L_0=-1\) (left two plots) and \(L_0=-1000\) (right two plots) are given in Fig. 3. In Table 4, we also show the mean and standard deviation (in parentheses) of \(L_k\), \(h_k\), and CPU time (in seconds) after \(k=800\) iterations with initial \(L_0=-1\) (top table) and \(L_0=-1000\) (bottom table). From these results with larger problem size, we can see ACLB-S outperforms ACLB by making use of the max-type structure of the objective function for improved theoretical and practical convergence rate.

Fig. 3
figure 3

Average of improvement function h and the corresponding lower bound L over 10 instances for the bundle size 10 versus iteration with initial \(L_0=-1\) (left two plots) and initial \(L_0=-1000\) (right two plots) for the structured nonsmooth constrained problem (50) of size \(n=4124\)

Table 4 Comparison of ACLB and ACLB-S on the structured nonsmooth constrained problem (50) of size \(n=4124\)

5 Concluding remarks

In this paper, we presented an accelerated level bound method, called ACLB, to uniformly solve smooth, weakly smooth and nonsmooth convex constrained optimization problems. We provided convergence analysis of the proposed method. The iteration complexity bound of ACLB is obtained via the estimations for the total number of calls to the gap reduction procedure and the numbers of iterations performed within the critical and non-critical gap reduction procedures. To the best of our knowledge, this is the first time in the literature to establish the iteration complexity bounds for accelerated level-bundle methods to solve the constrained optimization problem, where either the objective function or constraint is smooth or weakly smooth. We provided numerical results to demonstrate the improved efficiency of ACLB in practice.