1 Introduction

Many data analysis problems are often modeled as the following broad class of convex programming (CP) problems:

$$\begin{aligned} f^*:=\min _{x\in X} f(x), \end{aligned}$$
(1.1)

where \(X \subseteq \mathbb {R}^n\) is a closed convex set, and \(f:X\rightarrow \mathbb {R}\) is a convex function. We denote by \(X^*\) the solution set of the above problem. Throughout this paper, we assume that the solution set \(X^*\) is nonempty. One example of problem (1.1) is the classic ridge regression model (namely, Tikhonov regularization) in statistical learning estimates parameters \(\beta \) by

$$\begin{aligned} \min _{\beta \in \mathbb {R}^n}\Vert y-A\beta \Vert ^2 \text { subject to } \Vert \beta \Vert \le \lambda , \end{aligned}$$
(1.2)

where \(\Vert \cdot \Vert \) denotes the Euclidean norm, y describes the observed outcome, A are the predictors in the observed data, and \(\lambda \) is a regularization parameter. The above model can be viewed as a special case of (1.1) with \(x=\beta \), \(f(x) = \Vert y-Ax\Vert ^2\), and \(X=\{x\in \mathbb {R}^n: \Vert x\Vert \le \lambda \}\). Another important example is the classical two-dimensional total variation (TV) based image reconstruction problem [2, 3] given by:

$$\begin{aligned} \min _{u\in \mathbb {R}^n}\frac{1}{2}\Vert Au-b\Vert ^2 + \lambda \Vert u\Vert _{TV}, \end{aligned}$$
(1.3)

where A is the measurement matrix, u is the n-vector form of a two-dimensional image to the constructed, b represents the observed data, \(\Vert \cdot \Vert _{TV}\) is the discrete TV semi-norm, and \(\lambda \) is the regularization parameter. Problem (1.3) can also be casted as (1.1) by setting \(x=u\), \(f(x) = \Vert Ax-b\Vert ^2/2 + \lambda \Vert x\Vert _{TV}\), and \(X=\mathbb {R}^n\). It is worth noting that while problem (1.2) has an Euclidean ball constraint, problem (1.3) is an unconstrained CP problem. Moreover, the objective function in (1.2) is smooth, while the objective function in (1.3) is defined as the summation of a smooth term \(\Vert Ax-b\Vert ^2/2\) and a nonsmooth term \(\lambda \Vert x\Vert _{TV}\).

Due to the high dimensionality of x for many applications in data analysis and imaging, much recent research effort has been directed to the development of efficient first-order methods for solving (1.1). First-order methods use gradients (or subgradients) of f exclusively and hence possess significantly reduced iteration cost than second-order methods. The efficiency of these algorithms are often measured by their iteration complexity in terms of the number of (sub)gradient evaluations required to find an approximate solution of (1.1). In view of the classic complexity theory [4], for any first-order methods the number of (sub)gradient evaluations required to find an \(\epsilon \)-solution of (1.1) (i.e., a point \(x_{\epsilon }\in \mathbb {R}^n\) satisfying \(f(x_{\epsilon }) - f^*\le \epsilon \)) cannot be smaller than \(\mathcal {O}(1/\epsilon ^2)\) if f is nonsmooth. This can be achieved, for example, by traditional subgradient methods. For a smooth f, the optimal iteration complexity is \(\mathcal {O}(1/\sqrt{\epsilon })\), which can be achieved, for example, by Nesterov’s accelerated gradient (AG) algorithms [5,6,7]. Recently, by adapting Nesterov’s AG schemes and smoothing technique [5], several popular classes of first-order methods have been developed to improve their iteration complexity bounds. For instance, the accelerated primal dual (APD) algorithm [8] and accelerated hybrid proximal extragradient algorithm [9], which exhibit the optimal iterative complexity for solving a broad class of saddle point problems, and several variants of alternating direction method of multipliers (ADMM)[10,11,12,13,14,15], have improved the iteration complexity regarding to the smooth component.

In this paper, we focus on a different class of first-order methods, i.e., bundle-level (BL) type methods, which can utilize historical first-order information through cutting plane models to accelerate the numerical performance of the gradient descent type methods as mentioned above. We first give a review on several different types of BL methods.

1.1 Cutting plane, bundle and bundle-level methods

The bundle-level method originated from the well-known Kelley’s cutting-plane method in 1960 [16]. Consider the convex programming problem

$$\begin{aligned} f^*_X:=\min _{x\in X}f(x), \end{aligned}$$
(1.4)

where X is a compact convex set and f is a closed convex function. The fundamental idea of the cutting plane method is to generate a sequence of piecewise linear functions to approximate f on X. In particular, given \(x_1,x_2,\ldots ,x_k\in X\), this method approximates f by

$$\begin{aligned} m_k(x):=\max \{h(x_i,x),1\le i\le k\}, \end{aligned}$$
(1.5)

and computes the iterate \(x_{k+1}\) by

$$\begin{aligned} x_{k+1}\in \mathrm{Argmin}_{x\in X} m_k(x), \end{aligned}$$
(1.6)

where

$$\begin{aligned} h(z,x):=f(z)+\left\langle f'(z),x-z\right\rangle , \end{aligned}$$
(1.7)

and \(f'(x)\in \partial f(x)\), where \(\partial f(x):=\{\xi \in \mathbb {R}^n|f(y)\ge f(x) + \langle \xi , y-x\rangle ,\ \forall y\in \mathbb {R}^n\}\) denotes the subdifferential of f at x. Clearly, the functions \(m_i\), \(i = 1, 2, \ldots \), satisfy \(m_i(x)\le m_{i+1}(x)\le f(x)\) for any \(x\in X\), and are identical to f at those search points \(x_i\), \(i = 1, \ldots , k\). However, the inaccuracy and instability of the piecewise linear approximation \(m_k\) over the whole feasible set X may affect the selection of new iterates, and the above scheme converges slowly both theoretically and practically [4, 7]. Some important improvements of Kelley’s method have been made under the name of bundle methods (see, e.g., [17,18,19,20,21]). In particular, by incorporating the level sets into Kelley’s method, Lemaréchal, Nemirovskii and Nesterov [20] proposed in 1995 the classic bundle-level (BL) method by performing a series of projections over the approximate level sets.

Given \(x_1,x_2,\ldots ,x_k\), the classic BL iteration consists of the following three steps:

  1. (a)

    Set \(\overline{f}_k:=\min \{f(x_i),1\le i\le k\}\) and compute a lower bound on \(f^*_X\) by \( \underline{\smash {f}}_k = \min _{x\in X} m_k(x). \)

  2. (b)

    Set the level \(l_k = \beta \underline{\smash {f}}_k + (1-\beta )\overline{f}_k\) for some \(\beta \in (0,1)\).

  3. (c)

    Set \( X_k:=\{x\in X: m_k(x)\le l_k\}\) and determine the new iterate by

    $$\begin{aligned} x_{k+1} = \mathrm{argmin}_{x \in X_k} \Vert x - x_k\Vert ^2. \end{aligned}$$
    (1.8)

In the BL method, the localizer \(X_k\) is used to approximate the level set \(L_k := \{x: f(x) \le l_k \}\), because the projection over \(L_k\) is often too difficult to compute. Intuitively, as k increases, the value of \(l_k\) will converge to \(f^*_X\), and consequently both \(L_k\) and \(X_k\) will converge to the set of optimal solutions to problem (1.4). It is shown in [20] that the number of BL iterations required to find an \(\epsilon \)-solution to problem (1.4), i.e., a point \(\hat{x} \in X\) s.t. \(f(\hat{x}) - f^*_X \le \epsilon \), can be bounded by \(\mathcal{O} (1/\epsilon ^2)\), which is optimal for general nonsmooth convex optimization in the black-box model.

Observe that for the above BL methods, the localizer \(X_k\) accumulates constraints, and hence the subproblem in Step c) becomes more and more expensive to solve. In order to overcome this difficulty, some restricted memory BL algorithms have been developed in [19, 22]. In particular, Ben-Tal and Nemirovski [22] introduced the non-Euclidean restricted memory level (NERML) method, in which the number of extra linear constraints in \(X_k\) can be as small as 1 or 2, without affecting the optimal iteration complexity. Moreover, the objective function \(\Vert \cdot \Vert ^2\) in (1.8) is replaced by a general Bregman distance \(d(\cdot )\) for exploiting the geometry of the feasible set X. From our understanding, the efficiency of NERML can be attributed to its combination of previous progresses on BL methods (e.g., [7,11,5,36]) with the incorporation of the mirror descent idea in order to exploit the geometry of the feasible set. Some more recent development of inexact proximal bundle methods and BL methods could be found in [23,24,25,26,27,28,29,30].

While the classic BL method was optimal for solving nonsmooth CP problems only, Lan [1] recently significantly generalized this method so that it can optimally solve any black-box CP problems, including nonsmooth, smooth and weakly smooth CP problems. In particular, for problem (1.4) over compact feasible set X, the two new BL methods proposed in [1], i.e., the accelerated bundle-level (ABL) and accelerated prox-level (APL) methods, can solve these problems optimally without requiring any information on problem parameters. The ABL method can be viewed as an accelerated version of the classic BL method. Same as the classic BL method, the lower bound on \(f^*_X\) is estimated from the cutting plane model \(m_k\) in (1.5), the upper bound on \(f^*_X\) is given by the best objective value found so far. The novelty of the ABL method exists in that three different sequences, i.e., \(\{x_k^l\},\{x_k\}\) and \(\{x_k^u\}\), are used for updating the lower bound, prox-center, and upper bound respectively, which leads to its accelerated iteration complexity for smooth and weakly smooth problems. The APL method is a more practical, restricted memory version of the ABL method, which also employs non-Euclidean prox-functions to explore the geometry of the feasible set X. It is shown in [1] that both the ABL and APL methods achieve the optimal iteration complexity uniformly for smooth, weakly smooth and nonsmooth convex functions for solving problem (1.4). Moreover, by incorporating Nesterov’s smoothing technique [5] into the APL method, Lan also presented in [1] that the uniform smoothing level (USL) method which can achieve the optimal complexity for solving an important class of nonsmooth structured saddle point (SP) problems without requiring input of any problem parameters (see Sect. 2.2 for more details).

1.2 Motivation and contribution of this paper

One crucial problem associated with most existing BL type methods, including APL and USL, is that each iteration of these algorithms involves solving two optimization problems: first a linear programming problem to compute the lower bound, and then a constrained quadratic programming problem to update the prox-center or new iterate. In fact, the efficiency of these algorithms relies on the solutions to these two involved subproblems, and the latter one is often more complicated than the projection subproblem in the gradient projection type methods. Moreover, most existing BL type methods require the assumption that the feasible set is bounded due to the following two reasons. Firstly, the feasible set has to be bounded to compute a meaningful lower bound by solving the aforementioned linear programming problem. Secondly, the iteration complexity analysis of those methods, such as the classical BL method [20], NERML [22], ABL, APL and USL [1], relies on the assumption that the feasible set is compact. It should be noted that there exist some variants of BL methods [23, 29, 31, 32] for solving nonsmooth CP problems in which the computation of the subproblem for updating the lower bound is skipped, so that the feasible set X is allowed to be unbounded. For instance, the level bundle method in [32] updates the level parameter directly when the distance from the stability center to the newly generated iterate becomes larger than a chosen parameter. However, all these methods are focusing on solving nonsmooth CP problems. From the complexity analysis point of view, when applied to solve smooth CP problems, the methods do not necessarily achieve the optimal rate of convergence for smooth convex optimization. In this paper, our main focus is to develop a BL type method that achieves the optimal rate of convergence for unconstrained smooth convex optimization. Our contribution in this paper mainly consists of the following three aspects.

Firstly, we propose the FAPL and FUSL methods that greatly reduce the computational cost per iteration of the APL and USL methods for solving ball-constrained CP problems. The improvement is achieved mainly by eliminating the linear optimization subproblem for computing the lower bound and removing the ball constraint in the quadratic subproblem by properly choosing the prox-functions. More importantly, we are able to show that with these simplifications the FAPL and FUSL methods can also achieve the optimal iteration complexity uniformly for smooth, nonsmooth and weakly smooth functions.

Secondly, we propose a novel algorithmic framework for solving unconstrained CP problems. The proposed framework solves unconstrained CP problems through solutions to a series of ball-constrained CP problems and achieves the same order of the iteration complexity as the corresponding ball-constrained CP problems. In particular, if there exists a uniformly optimal method (e.g., the APL and USL methods) that solves ball-constrained black-box or structured CP problems, then the proposed algorithm solves unconstrained black-box or structured CP problems with optimal complexity without requiring the input of any additional problem parameters. To the best of our knowledge, this is the first time in the literature that the complexity analysis has been performed for BL type methods to solve unconstrained CP problems (see Sections 3.2 and 3.3 in [33] for more details).

Finally, we apply the FAPL and FUSL methods to solve large-scale least squares problems and total variation based image reconstruction problems. Our experimental results show that these algorithms outperform APL, NERML, Nesterov’s optimal method [5], the accelerated primal dual (APD) method [8], Nesterov’s smoothing (NEST-S) gradient method [5, 34] and the accelerated linearized ADMM (AL-ADMM) with line-search method [15], and the MATLAB solver for linear systems, especially when the dimension and/or the Lipschitz constant of the problem is large. Moreover, by using the FAPL and FUSL methods, we could achieve more accurate solutions to the corresponding CP problems, which results in better reconstructed image qualities and lower acquisition rates.

1.3 Organization of the paper

The paper is organized as follows. In Sect. 2, the new FAPL and FUSL methods are proposed followed by their convergence analysis, then an exact approach is introduced to solve the subproblem in these algorithms. In Sect. 3, we present a general scheme to extend the optimal methods for ball-constrained CP problems to unconstrained CP problems. The applications to quadratic programming and image processing problems are presented in Sect. 4.

2 Fast prox-level type methods for ball-constrained problems

In this section, we discuss the following ball-constrained CP problem:

$$\begin{aligned} f^*_{\overline{x}, R}:=\min _{x\in B(\overline{x},R)}f(x), \end{aligned}$$
(2.1)

where \(B(\bar{x}, R):= \left\{ x \in \mathbb {R}^n: \Vert x - \overline{x}\Vert \le R \right\} \) denotes the closed Euclidean ball centered at \(\bar{x}\) with radius R, and \(f: \mathbb {R}^n\rightarrow \mathbb {R}\) is a convex function satisfying

$$\begin{aligned} f(y)-f(x)-\left\langle f'(x),y-x\right\rangle \le \frac{M}{1+\rho }\Vert y-x\Vert ^{1+\rho }, \ \ \forall x,y \in B(\bar{x}, R), \end{aligned}$$
(2.2)

for some \(M>0\) and \(\rho \in [0, 1]\). This \(f(\cdot )\) can be nonsmooth (\(\rho =0\)), smooth (\(\rho =1\)) and weakly smooth (\(0<\rho <1\)).

This section contains four subsections. We first present a much simplified APL method, referred to the fast APL (FAPL) method, for solving ball-constrained black-box CP problems in Sect. 2.1, and then present the fast USL (FUSL) method for solving a special class of ball-constrained structured CP problems in Sect. 2.2. We show how to solve the subproblems in these two algorithms in “Appendix A”.

2.1 FAPL for ball-constrained black-box problems

Our goal in this subsection is to present the FAPL method, which can reduce the iteration cost for the APL method applied to problem (2.1). In particular, we show that only one subproblem, rather than two subproblems (see the two subproblems in Equations (2.9) and (2.10) in [1]) as in the APL method, is required in the FAPL method for defining a new iterate (or prox-center) and updating lower bound. We also demonstrate that the ball constraint in (2.1) can be eliminated from the subproblem by properly specifying the prox-function.

Similarly to the APL method, the FAPL method consists of outer-inner loops, and in each outer iteration, an inner loop, the FAPL gap reduction procedure denoted by \(\mathcal {G}_{FAPL}\), is called to reduce the gap between the upper and lower bounds on \(f^*_{\overline{x}, R}\) in (2.1) by a constant factor.

We start by describing the FAPL gap reduction procedure in Procedure 1. This procedure differs from the gap reduction procedure used in the APL method in the following several aspects. Firstly, the localizers \(\underline{\smash {Q}}_k\) and \(\overline{Q}_k\) in procedure \(\mathcal {G}_{FAPL}\) (see Steps 1 and 4) only contain linear constraints and hence are possibly unbounded, while the localizers in the APL method must be compact. Secondly, we eliminate the subproblem that updates the lower bound on \(f^*_{\overline{x},R}\) in the APL method. Instead, in the FAPL method, the lower bound is updated to the level l directly whenever \(\underline{\smash {Q}}_k = \emptyset \) or \(\Vert x_k-\overline{x}\Vert >R\). Note that this strategy has been used in some existing BL methods [32]. Thirdly, we choose a specific prox-function \(d(x)=\frac{1}{2}\Vert x-\overline{x}\Vert ^2\), and as a result, all the three sequences \(\{x_k\},\{x_k^l\}\) and \(\{x_k^u\}\) will reside in the ball \(B(\overline{x},R)\). At last, as we will show in next subsection, since the subproblem (2.6) only contains a limited number of linear constraints (depth of memory), we can solve it efficiently, or even exactly if the depth of memory is small (e.g., less than 10).

figure a

We now add a few more remarks about the technical details of Procedure 1. Firstly, Procedure 1 is terminated at Step 2 if \(Q_k=\emptyset \) or \(\Vert x_k-\overline{x}\Vert > R\), which can be checked automatically when solving the subproblem (2.6) (see Subsection A for more details). Secondly, the set \(\underline{\smash {Q}}_k \ne \emptyset \) in Step 4 (otherwise, the procedure stops at Step 2). Moreover, we can show that \(\underline{\smash {Q}}_k\) is a subset of \(\bar{Q}_k\) (see Lemma 2.2). Therefore, \(\bar{Q}_k\) will never be empty at Step 4. Thirdly, in Step 4, \(Q_k\) can be any polyhedral set between \(\underline{\smash {Q}}_k\) and \(\overline{Q}_k\). In practice, we can simply choose \(Q_k\) to be the intersection of the half-space \(\{x\in \mathbb {R}^n: \left\langle x_k-\overline{x},x-x_k\right\rangle \ge 0\}\) and a few most recently generated half-spaces, each of which is defined by \(\{x\in \mathbb {R}^n:h(x_\tau ^l,x)\le l\}\) for some \(1\le \tau \le k\). Finally, in order to guarantee the termination of procedure \(\mathcal {G}_{FAPL}\) and the optimal iteration complexity, the parameters \(\{\alpha _k\}\) used in this procedure need to be properly chosen. One set of conditions that \(\{\alpha _k\}\) should satisfy to guarantee the convergence of procedure \(\mathcal {G}_{FAPL}\) is given as follows:

$$\begin{aligned} \alpha _1=1,\ 0<\alpha _k\le 1, \alpha _k \le \frac{c}{k} \ \text {and}\ \frac{1-\alpha _{k+1}}{\alpha _{k+1}^{1+\rho }}\le \frac{1}{\alpha _k^{1+\rho }}, \forall k\ge 1, \end{aligned}$$
(2.10)

for some constants \(c>0\) and \(\forall \rho \in [0,1]\).

The following lemma provides two examples for the selection of \(\{\alpha _k\}\).

Lemma 2.1

  1. (a)

    If \(\alpha _k={2}/(k+1)\), \(k=1,2,\ldots \), then the condition (2.10) is satisfied with \(c=2\).

  2. (b)

    If \(\{\alpha _k\}\) is recursively defined by

    $$\begin{aligned} \alpha _1=1,\alpha _{k+1} = \frac{-\alpha _k^2 + \sqrt{\alpha _k^4+4\alpha _k^2}}{2}, \forall k\ge 1, \end{aligned}$$
    (2.11)

    then the condition (2.10) holds with \(c=2\). Note that the stepsizes in (2.11) satisfies

    $$\begin{aligned} \alpha _{k+1}^2=(1-\alpha _{k+1})\alpha _{k}^2 \end{aligned}$$
    (2.12)

Proof

To prove part a), note that if \(\alpha _k=2/(k+1)\), then we have \(\alpha _1=1,\ 0<\alpha _k\le 1\), and \(\alpha _k \le {2}/{k}\). Moreover, for any \(\rho \in [0,1]\) we have

$$\begin{aligned} (1-\alpha _{k+1})\cdot \frac{\alpha _k^{1+\rho }}{\alpha _{k+1}^{1+\rho }}&= \frac{k}{k+2}\cdot \left( \frac{k+2}{k+1}\right) ^{1+\rho } = \frac{k}{k+1}\cdot \left( \frac{k+2}{k+1}\right) ^{\rho } \\&\le \frac{k}{k+1}\cdot \left( \frac{k+1}{k}\right) ^{\rho } = \left( \frac{k}{k+1}\right) ^{1-\rho }\le 1. \end{aligned}$$

Therefore (2.10) holds.

We continue to prove part b). First it is easy to verify that the stepsizes defined in (2.11) satisfy the relation (2.12). Moreover, by (2.12) we also have \(\alpha _{k+1}\not =0\) as long as \(\alpha _k\not =0\). Therefore, noting that \(\alpha _1=1\) we have \(\alpha _k\not =0\) for all \(k\ge 1\). Consequently, from (2.11) we have

$$\begin{aligned} \alpha _{k+1} = \frac{4\alpha _k^2}{2(\alpha _k^2 + \sqrt{\alpha _k^4+4\alpha _k^2})} \le \frac{4\alpha _k^2}{4\alpha _k^2} = 1,\ \forall k\ge 1, \end{aligned}$$

and thus we have \(\alpha _k\in (0,1]\) for all \(k\ge 1\). Using this result and (2.12), we also have \(\alpha _{k+1}^2<\alpha _k^2\), hence \(\alpha _{k+1}<\alpha _k\). Now rearranging the terms in (2.12), we observe that \(\alpha _k^2 - \alpha _{k+1}^2 = \alpha _{k+1}\alpha _k^2\). Using this observation and the previous result that \(\alpha _{k+1}<\alpha _k\), we obtain

$$\begin{aligned} \frac{1}{\alpha _{k+1}}-\frac{1}{\alpha _{k}} = \frac{\alpha _k-\alpha _{k+1}}{\alpha _k\alpha _{k+1}}=\frac{\alpha _k^2-\alpha _{k+1}^2}{\alpha _{k+1}\alpha _k^2}\cdot \frac{\alpha _{k}}{\alpha _k+\alpha _{k+1}} = \frac{\alpha _{k}}{\alpha _k+\alpha _{k+1}}> \frac{1}{2}, \forall k\ge 1. \end{aligned}$$

Using the above inequality and noting that \(\alpha _1=1\), we have

$$\begin{aligned} \frac{1}{\alpha _k} =&\left( \frac{1}{\alpha _k} - \frac{1}{\alpha _{k-1}}\right) + \left( \frac{1}{\alpha _{k-1}} - \frac{1}{\alpha _{k-2}}\right) + \cdots \left( \frac{1}{\alpha _2} - \frac{1}{\alpha _{1}}\right) \nonumber \\&+ \frac{1}{\alpha _1} > (k-1)\cdot \frac{1}{2} + 1 = \frac{k+1}{2}, \end{aligned}$$
(2.13)

and hence \(\alpha _k\le {2}{(k+1)}< {2}/{k}\). Finally, using (2.12) and the result that \(\alpha _{k+1}<\alpha _k\), for any \(\rho \in [0,1]\) we have

$$\begin{aligned} (1-\alpha _{k+1})\cdot \frac{\alpha _k^{1+\rho }}{\alpha _{k+1}^{1+\rho }} = \frac{(1-\alpha _{k+1})\alpha _k^2}{\alpha _{k+1}^2}\cdot \frac{\alpha _k^{\rho -1}}{\alpha _{k+1}^{\rho -1}} = \frac{\alpha _k^{\rho -1}}{\alpha _{k+1}^{\rho -1}} = \left( \frac{\alpha _{k+1}}{\alpha _k}\right) ^{1-\rho } <1. \end{aligned}$$

Therefore (2.11) holds. \(\square \)

The following lemma describes some important observations regarding the execution of procedure \(\mathcal {G}_{FAPL}\).

Lemma 2.2

Let \(\mathcal {E}_f(l):=\{x\in B(\overline{x},R):f(x)\le l\}\), the following statements hold for procedure \(\mathcal {G}_{FAPL}\).

  1. (a)

    If \(\mathcal {E}_f(l)\ne \emptyset \), then \(\mathcal {E}_f(l)\subseteq \underline{\smash {Q}}_k\subseteq Q_k \subseteq \overline{Q}_k\) for any \(k\ge 1\).

  2. (b)

    If \(\underline{\smash {Q}}_k \ne \emptyset \), then problem (2.6) in Step 2 has a unique solution. Moreover, if procedure \(\mathcal {G}_{FAPL}\) terminates at Step 2, then \(l\le f^{*}_{\overline{x},R}\).

Proof

To prove part a), it suffices to prove that \(\mathcal {E}_f(l)\subseteq \underline{\smash {Q}}_k\) and \(\underline{\smash {Q}}_k\subseteq \overline{Q}_k\) for all \(k\ge 1\), since from the definition of \(Q_k\) we already have \(\underline{\smash {Q}}_k\subseteq Q_k\subseteq \overline{Q}_k\). we first use induction to prove the former relation. As \(Q_0\) is set to \(\mathbb {R}^n\), we have \(\mathcal {E}_f(l)\subseteq Q_0\). Moreover, if \(\mathcal {E}_f(l)\subseteq Q_{k-1}\) for some \(k\ge 1\), then from the definition of \(\underline{\smash {Q}}_k\) in (2.5) and the observation that \(h(x_k^l,z)\le f(z)\le l\) for all \(z\in \mathcal {E}_f(l)\), we have \(\mathcal {E}_f(l)\subseteq \underline{\smash {Q}}_k\). To prove the relation that \(\underline{\smash {Q}}_k\subseteq \overline{Q}_k\), observe that the definition of \(\overline{Q}_k\) in (2.9) implies that \(\overline{Q}_k = \{x\in \mathbb {R}^n:d(x)\ge d(x_k) \}\). Combining such observation with the definition of \(x_k\) in (2.6), we have \(\underline{\smash {Q}}_k\subseteq \overline{Q}_k\) and conclude part a).

We now provide the proof of part b). From the definition of \(Q_k\) in Step 4 and the definition of \(\underline{\smash {Q}}_k\) in (2.5), we can see that \(\underline{\smash {Q}}_k\) is the intersection of half-spaces, hence it is convex and closed. Therefore, the subproblem (2.6) always has a unique solution as long as \(\underline{\smash {Q}}_k\) is non-empty.

To finish the proof it suffices to show that \(\mathcal {E}_f(l)=\emptyset \) when either \(\underline{\smash {Q}}_k=\emptyset \) or \(\Vert x_k-\overline{x}\Vert >R\), which can be proved by contradiction. Firstly, if \(\underline{\smash {Q}}_k=\emptyset \) but \(\mathcal {E}_f(l)\not =\emptyset \), then by part a) proved above, we have \(\mathcal {E}_f(l)\subseteq \underline{\smash {Q}}_k\), which contradicts the assumption that \(\underline{\smash {Q}}_k\) is empty. On the other hand, suppose that \(\Vert x_k-\overline{x}\Vert > R\) and \(\mathcal {E}_f(l)\not =\emptyset \), let \(x_R^{*}\in \mathrm{Argmin}_{x\in B(\overline{x},R)}f(x)\), it is clear that \(x_R^*\in \mathcal {E}_f(l)\subseteq \underline{\smash {Q}}_k\) by a), however \(\Vert x_R^{*}-\overline{x}\Vert \le R <\Vert x_k-\overline{x}\Vert \) which contradicts the definition of \(x_k\) in (2.6). \(\square \)

Let \(\mathrm{ub}:=f(\hat{x}),\mathrm{ub}^+:=f(x^+)\) be the input and output upper bounds on \(f^*_{\overline{x},R}\) in procedure \(\mathcal {G}_{FAPL}\), respectively. The following lemma shows that whenever procedure \(\mathcal {G}_{FAPL}\) terminates, the gap between the upper and lower bounds on \(f^*_{\overline{x},R}\) is reduced by a constant factor.

Lemma 2.3

Whenever procedure \(\mathcal {G}_{FAPL}\) terminates, we have \(\mathrm{ub}^+-\mathrm{lb}^+\le q(\mathrm{ub}-\mathrm{lb})\), where

$$\begin{aligned} q:=\max \{\beta , 1-(1-\theta )\beta \}. \end{aligned}$$
(2.14)

Proof

By the definition of \(x_k^u\) in (2.7) and the definition of \(\overline{f}_k\) in Step 3 of procedure \(\mathcal {G}_{FAPL}\), we have \(\overline{f}_{k}\le \overline{f}_{k-1},\ \forall k\ge 1\), which implies \(\mathrm{ub}^+\le \mathrm{ub}\). Procedure \(\mathcal {G}_{FAPL}\) could terminate at either Step 2 or Step 3. We first assume that it terminates at Step 2 at \(K^{th}\) iteration. The termination condition gives \(\mathrm{lb}^+=l=\beta \cdot \mathrm{lb}+(1-\beta )\mathrm{ub}\), then we have

$$\begin{aligned} \mathrm{ub}^+-\mathrm{lb}^+\le \mathrm{ub}-\beta \, \mathrm{lb}- (1-\beta )\mathrm{ub}= \beta (\mathrm{ub}-\mathrm{lb}). \end{aligned}$$

Next suppose that Procedure 1 terminates at Step 3 at \(K^{th}\) iteration. We have \(\mathrm{ub}^+=f(x^+)=\overline{f}_K\le l+\theta (\mathrm{ub}-l)\), since \(\mathrm{lb}^+\ge \mathrm{lb}\) and \(l=\beta \cdot \mathrm{lb}+(1-\beta )\mathrm{ub}\), we conclude that

$$\begin{aligned} \mathrm{ub}^+-\mathrm{lb}^+\le l+\theta \, (\mathrm{ub}-l)-\mathrm{lb}=[1-(1-\theta )\beta ](\mathrm{ub}-\mathrm{lb}). \end{aligned}$$

The lemma is proved by combining the above two cases. \(\square \)

We now provide a bound on the number of iterations performed by procedure \(\mathcal {G}_{FAPL}\). Note that the proof of this result is similar to Theorem 3 in [1].

Proposition 2.4

If the stepsizes \(\{\alpha _k\}_{k\ge 1}\) are chosen such that (2.10) holds, then the number of iterations performed by procedure \(\mathcal {G}_{FAPL}\) does not exceed

$$\begin{aligned} N(\Delta ):= \left( \frac{c^{1+\rho }MR^{1+\rho }}{(1+\rho )\theta \beta \Delta } \right) ^{\frac{2}{1+3\rho }}+1, \end{aligned}$$
(2.15)

where \(\Delta :=\mathrm{ub}-\mathrm{lb}\).

Proof

Suppose that procedure \(\mathcal {G}_{FAPL}\) does not terminate at the \(K^{th}\) iteration for some \(K >0\). Then observing that \(x_{k}\in \underline{\smash {Q}}_k\subseteq Q_{k-1}\subseteq \overline{Q}_{k-1}\) for any \(2\le k\le K\) due to (2.5) and (2.6), and \(x_1\in Q_0, x_0=\mathrm{argmin}_{x\in Q_0} d(x)\), we have \(\left\langle \nabla d(x_{k-1}),x_{k}-x_{k-1}\right\rangle \ge 0, \forall 1\le k \le K\). Since d(x) is strongly convex with modulus 1, we also have

$$\begin{aligned} d(x_{k})\ge d(x_{k-1})+\left\langle \nabla d(x_{k-1}),x_{k}-x_{k-1}\right\rangle +\frac{1}{2}\Vert x_{k}-x_{k-1}\Vert ^2. \end{aligned}$$

Combining the two relations above, it implies \(\frac{1}{2}\Vert x_{k}-x_{k-1}\Vert ^2\le d(x_{k})-d(x_{k-1})\). Summing up these inequalities for \(1\le k\le K\), we conclude that

$$\begin{aligned} \frac{1}{2}\sum \limits _{k=1}^{K}\Vert x_{k}-x_{k-1} \Vert ^2\le d(x_K)=\frac{1}{2}\Vert x_K-\overline{x}\Vert ^2\le \frac{1}{2}R^2. \end{aligned}$$
(2.16)

By (2.2), (2.7), (2.8) and the convexity of f, we have for any \(1\le k\le K\),

$$\begin{aligned} f(x_k^u)&\le f(\tilde{x}_k^u)\le h(x_k^l,\tilde{x}_k^u) + \frac{M}{1+\rho }\Vert \tilde{x}_k^u-x_k^l\Vert ^{1+\rho } \end{aligned}$$
(2.17)
$$\begin{aligned}&= (1-\alpha _k)h(x_k^l,x_{k-1}^u)+\alpha _k h(x_k^l,x_k) +\frac{M}{1+\rho }\Vert \tilde{x}_k^u-x_k^l\Vert ^{1+\rho }. \end{aligned}$$
(2.18)

Since \(h(x_k^l,x_{k-1}^u)\le f(x_{k-1}^u)\), \(h(x_k^l,x_k)\le l\) due to (2.5) and (2.6), and \(\tilde{x}_k^u-x_k^l=\alpha _k(x_k-x_{k-1})\) due to (2.3) and (2.7), we have

$$\begin{aligned} f(x_k^u) - l \le (1-\alpha _k)(f(x_{k-1}^u)-l) + \frac{\alpha _k^{1+\rho }M}{1+\rho }\Vert x_k-x_{k-1}\Vert ^{1+\rho } \end{aligned}$$
(2.19)

Dividing both sides of (2.19) by \(\alpha _k^{1+\rho }\), and then summing up these inequalities for \(1\le k\le K\), by (2.10) and the fact \(f(x_k^u)-l\ge 0\), \( \forall 1\le k\le K\), we have

$$\begin{aligned} f(x_K^u) - l \le \frac{\alpha _K^{1+\rho }M}{1+\rho }\sum _{k=1}^{K}\Vert x_k-x_{k-1}\Vert ^{1+\rho } \end{aligned}$$
(2.20)

Apply Hölder’s inequality, and use (2.16), we have

$$\begin{aligned} \sum _{k=1}^{K}\Vert x_k-x_{k-1}\Vert ^{1+\rho } \le K^{\frac{1-\rho }{2}}\left( \sum _{k=1}^K\Vert x_k-x_{k-1}\Vert ^2\right) ^{\frac{1+\rho }{2}}\le K^{\frac{1-\rho }{2}}R^{1+\rho }, \end{aligned}$$

and \(a_K^{1+\rho }\le c^{1+\rho }K^{-(1+\rho )}\) due to (2.10). Therefore (2.20) gives

$$\begin{aligned} f(x_K^u) - l \le \frac{MR^{1+\rho }}{(1+\rho )}\cdot \frac{c^{1+\rho }}{K^{\frac{1+3\rho }{2}}} \end{aligned}$$
(2.21)

In view of Step 3 in procedure 1, and using the fact that \(l=\beta \cdot \mathrm{lb}+ (1-\beta )\mathrm{ub}\) in Step 0, we also have

$$\begin{aligned} f(x_K^u)-l>\theta (\mathrm{ub}-l)=\theta \beta \Delta . \end{aligned}$$

Combining the above two relations, and using (2.10) and (2.16), we obtain

$$\begin{aligned} \theta \beta \Delta < \frac{MR^{1+\rho }}{(1+\rho )}\cdot \frac{c^{1+\rho }}{K^{\frac{1+3\rho }{2}}}, \end{aligned}$$
(2.22)

which implies that

$$\begin{aligned} K<\left( \frac{c^{1+\rho }MR^{1+\rho }}{(1+\rho )\theta \beta \Delta } \right) ^{\frac{2}{1+3\rho }}. \end{aligned}$$
(2.23)

\(\square \)

In view of Lemma 2.3 and Proposition 2.4, we are now ready to describe the FAPL method, which performs a sequence of calls to procedure \(\mathcal {G}_{FAPL}\) until an approximate solution with sufficient accuracy is found.

figure b

For the sake of simplicity, each iteration of Procedure \(\mathcal {G}_{FAPL}\) is also referred to as an inner iteration of the FAPL method. Our complexity bound is then built based on the total number of combined inner iterations of the FAPL method, namely, the number of inner iterations performed by all calls to Procedure \(\mathcal {G}_{FAPL}\), combined together. The following theorem establishes the complexity bounds on the numbers of gap reduction procedures \(\mathcal {G}_{FAPL}\) and total combined inner iterations performed by the FAPL method, its proof is similar to that of Theorem 4 in [1].

Theorem 2.5

For any given \(\epsilon >0\), if the stepsizes \(\{\alpha _k\}\) in procedure \(\mathcal {G}_{FAPL}\) are chosen such that (2.10) holds, then the following statements hold for the FAPL method to compute an \(\epsilon \)-solution to problem (2.1).

  1. (a)

    The number of gap reduction procedures \(\mathcal {G}_{FAPL}\) performed by the FAPL method does not exceed

    $$\begin{aligned} S:=\left\lceil \max \left\{ 0,\log _{\frac{1}{q}} \left( \frac{(2R)^{1+\rho }M}{(1+\rho )\epsilon } \right) \right\} \right\rceil . \end{aligned}$$
    (2.24)
  2. (b)

    The total number of combined inner iterations performed by the FAPL method can be bounded by

    $$\begin{aligned} N(\epsilon ):= S+\frac{1}{1-q^{\frac{2}{1+3\rho }}}\left( \frac{c^{1+\rho }MR^{1+\rho }}{(1+\rho )\theta \beta \epsilon } \right) ^{\frac{2}{1+3\rho }}, \end{aligned}$$
    (2.25)

    where q is defined in (2.14).

Proof

We first prove part a). Let \(\Delta _s:=\mathrm{ub}_s-\mathrm{lb}_s\), without loss of generality, we assume that \(\Delta _1>\epsilon \). In view of Step 0 in Algorithm 1 and (2.2), we have

$$\begin{aligned} \Delta _1\le f(p_1)-h(p_0,p_1)=f(p_1)-f(p_0)-\left\langle f'(p_0),p_1-p_0\right\rangle \le \frac{(2R)^{1+\rho }M}{1+\rho }. \end{aligned}$$
(2.26)

Also, by Lemma 2.3 we can see that \(\Delta _{s+1}\le q\Delta _s\) for any \(s\ge 1\), which implies that

$$\begin{aligned} \Delta _{s+1}\le q^s \Delta _1,\ \forall s\ge 0. \end{aligned}$$

Moreover, if an \(\epsilon \)-solution is found after performing S gap reduction procedures \(\mathcal {G}_{FAPL}\), then we have

$$\begin{aligned} \Delta _{S}> \epsilon \ge \Delta _{S+1}. \end{aligned}$$
(2.27)

Combining the above three inequalities, we conclude that

$$\begin{aligned} \epsilon < q^{S-1}\Delta _1\le q^{S-1}\frac{(2R)^{1+\rho }M}{1+\rho }, \end{aligned}$$
(2.28)

and part a) follows immediately from the above inequality. We are now ready to prove part b). In view of Lemma 2.3 and (2.27), we have \(\Delta _s\ge {\epsilon }{q^{s-S}}\) for any \(1\le s\le S\). Using this estimate, part a), and Proposition 2.4, we conclude that the total number of combined inner iterations performed by the FAPL method is bounded by

$$\begin{aligned} N(\epsilon )&=\sum _{s=1}^{S}N_s= S+\left( \frac{c^{1+\rho }MR^{1+\rho }}{(1+\rho )\theta \beta \epsilon } \right) ^{\frac{2}{1+3\rho }}\sum _{s=1}^{S}q^{\frac{2(S-s)}{1+3\rho }} \nonumber \\&<S+\frac{1}{1-q^{\frac{2}{1+3\rho }}}\left( \frac{c^{1+\rho }MR^{1+\rho }}{(1+\rho )\theta \beta \epsilon } \right) ^{\frac{2}{1+3\rho }}, \end{aligned}$$
(2.29)

where \(N_s\) denotes the number of the inner iterations performed by the sth gap reduction procedure \(\mathcal {G}_{FAPL}\) for any \(1\le s\le S\). \(\square \)

A few remarks are in place for Theorem 2.5. First, the number of iterations performed by the FAPL method is defined as the number of gradient/subgradient evaluations, but the FAPL method requires two function evaluations for each iteration. Therefore, the number of function evaluations of the FAPL method is the same as that of the APL method in [1], but twice as some BL methods for nonsmooth optimization (e.g., [20, 22]) and Nesterov’s accelerate gradient method [5]. Second, when f is nonsmooth (i.e., \(\rho =0\) in (2.2)), setting \(\rho =0\) in (2.25) we have

$$\begin{aligned} N(\epsilon ) = S + \frac{1}{1-q^2}\left( \frac{cMR}{\theta \beta \epsilon }\right) ^2 = \mathcal {O}\left( \frac{M^2R^2}{\epsilon ^2}\right) . \end{aligned}$$
(2.30)

Following similar computations, we can summarize the total number of inner iterations \(N(\epsilon )\) performed by the FAPL method in Table 1. Third, it is important to note that, in the case when f is smooth, the total number of inner iterations \(N(\epsilon )\) in Table 1 is smaller in an order than that of the case when f is nonsmooth. Finally, we can compare the iteration complexity results in Table 1 with other BL methods in the literature. For BL methods that focus on nonsmooth CP problems, e.g., the NERML method in [22], their complexity results matches our result for nonsmooth f (i.e., \(\rho =0\)). However, we are able to improve the complexity results when f is smooth or weakly smooth. To the best of our knowledge, in the current literature the APL method in [1] is the only other method that is able to achieve all the three iteration complexity results in Table 1. However, the APL method is not applicable to unconstrained CP problems, while we can extend our study to an expansion algorithm for unconstrained CP problems (see Sect. 3 later).

Table 1 The total number of inner iterations \(N(\epsilon )\) performed by the FAPL method with respect to different smoothness condition of f

2.2 FUSL for ball-constrained structured problems

In this subsection, we still consider the ball-constrained problem (2.1), but assume that its objective function is given by

$$\begin{aligned} f(x):=\hat{f}(x)+F(x), \end{aligned}$$
(2.31)

where \(\hat{f}\) is a smooth convex function, i.e., \(\exists L_{\hat{f}}>0\) s.t.

$$\begin{aligned} \hat{f}(y)-\hat{f}(x)-\langle \nabla \hat{f}(x),y-x\rangle \le \frac{L_{\hat{f}}}{2}\Vert y-x\Vert ^2, \end{aligned}$$
(2.32)

and

$$\begin{aligned} F(x):=\max \limits _{y\in Y}\{\left\langle Ax,y\right\rangle -\hat{g}(y)\}. \end{aligned}$$
(2.33)

Here, \(Y\subseteq \mathbb {R}^m\) is a compact convex set, \(\hat{g}:= Y\rightarrow \mathbb {R}\) is a relatively simple convex function, and \(A:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is a linear operator. We will integrate Nesterov’s smoothing technique for minimizing nonsmooth functions [5] into the FAPL method to solve problem (2.1)–(2.31) (i.e., problem (2.1) with f defined in (2.31)), which can reduce the iteration cost of the USL method [1].

Let \(v:Y\rightarrow \mathbb {R}\) be a prox-function with modulus \(\sigma _v\) and denoting \(c_v:=\mathrm{argmin}_{v\in Y} v(y)\), by employing the idea in the important work of Nesterov [5], we can approximate F in (2.33) by the smooth function

$$\begin{aligned} F_{\eta }(x):&=\max _{y\in Y}\{\left\langle Ax,y\right\rangle -\hat{g}(y)-\eta V(y)\}, \end{aligned}$$
(2.34)

where \(\eta >0\) is called the smoothing parameter, and \(V(\cdot )\) is the Bregman divergence defined by

$$\begin{aligned} V(y):=v(y)-v(c_v)-\left\langle \nabla v(c_v),y-c_v\right\rangle . \end{aligned}$$
(2.35)

It was shown in [5] that the gradient of \(F_{\eta }(\cdot )\) given by \(\nabla F_{\eta }(x)= A^{*}y^{*}(x)\) is Lipschitz continuous with constant

$$\begin{aligned} L_{\eta }:={\Vert A\Vert ^2}/({\eta \sigma _v}), \end{aligned}$$
(2.36)

where \(\Vert A\Vert \) is the operator norm of A, \(A^{*}\) is the adjoint operator, and \(y^{*}(x)\in Y\) is the solution to the optimization problem (2.34). Moreover, the “closeness” of \(F_{\eta }(\cdot )\) to \(F(\cdot )\) depends linearly on the smoothing parameter \(\eta \), i.e.,

$$\begin{aligned} F_{\eta }(x)&\le F(x)\le F_{\eta }(x)+\eta D_{v,Y},\ \forall x\in X, \end{aligned}$$
(2.37)

where

$$\begin{aligned} D_{v,Y}:=\max _{y,z\in Y}\Big \{v(y)-v(z)-\left\langle \nabla v(z),y-z\right\rangle \Big \}. \end{aligned}$$
(2.38)

Therefore, if we denote

$$\begin{aligned} f_{\eta }(x):=\hat{f}(x)+F_{\eta }(x), \end{aligned}$$
(2.39)

then

$$\begin{aligned} f_{\eta }(x)&\le f(x) \le f_{\eta }(x)+\eta D_{v,Y}. \end{aligned}$$
(2.40)

Applying an optimal gradient method to minimize the smooth function \(f_\eta \) in (2.39), Nesterov proves in [5] that the iteration complexity for computing an \(\epsilon \)-solution to problem (2.1)–(2.31) is bounded by \(\mathcal{O}(\frac{1}{\sqrt{\epsilon }}+\frac{1}{\epsilon })\). However, the values of quite a few problem parameters, such as \(\Vert A\Vert ,\sigma _v\) and \(D_{v,Y}\), are required for the implementation of Nesterov’s smoothing scheme.

By incorporating Nesterov’s smoothing technique [5] into the APL method, Lan developed in [1] a new bundle-level type method, namely the uniform smoothing level (USL) method, to solve structured problems given in the form of (2.1)–(2.31). While the USL method achieves the same optimal iteration complexity as Nesterov’s smoothing scheme in [5], one advantage of the USL method over Nesterov’s smoothing scheme is that the smoothing parameter \(\eta \) is adjusted dynamically during the execution, and an estimate of \(D_{v,Y}\) is obtained automatically, which makes the USL method problem parameter free. However, similar to the APL method, each iteration of the USL method involves the solutions to two subproblems. Based on the USL method in [1] and our analysis of the FAPL method in Sect. 2.1, we propose a fast USL (FUSL) method that solves problem (2.1)–(2.31) with the same optimal iteration complexity as the USL method, but requiring only to solve one simpler subproblem in each iteration.

Similar to the FAPL method, the FUSL method has an outer-inner loops structure, and each outer iteration calls an inner loop, a gap reduction procedure denoted by \(\mathcal {G}_{FUSL}\), to reduce the gap between the upper and lower bounds on \(f^{*}_{\overline{x},R}\) in (2.1)–(2.31) by a constant factor unless an \(\epsilon \)-solution is found. We start by describing procedure \(\mathcal {G}_{FUSL}\).

figure c

A few remarks about procedure \(\mathcal {G}_{FUSL}\) are in place. Firstly, since the nonsmooth objective function f is replaced by its smoothed approximation \(f_\eta \), we replace the cutting plane model in (1.7) with the one for \(f_\eta \) (see (2.42)). Also note that for the USL method in [1], \(\hat{f}\) is assumed to be a simple Lipschitz continuous convex function, and only \(F_{\eta }\) is approximated by the linear estimation. However, in the FUSL method, we assume \(\hat{f}\) is general smooth and convex, and linearize both \(\hat{f}\) and \(F_{\eta }\) in (2.42). Secondly, the smoothing parameter \(\eta \) is specified as a function of the parameter D, \(\bar{f}_0\) and l, where D is an estimator of \(D_{v,Y}\) defined by (2.38) and given as an input parameter for procedure \(\mathcal {G}_{FUSL}\). Thirdly, same as the FAPL method, the parameters \(\{\alpha _k\}\) are chosen according to (2.10). Such conditions are required to guarantee the optimal iteration complexity of the FUSL method for solving problem (2.1)–(2.31). Fourthly, similar to the FAPL method, the localizers \(\underline{\smash {Q}}_k, Q_k, \overline{Q}_k\) only contain a limited number of linear constraints, and there is only one subproblem (i.e., (2.6)) involved in procedure \(\mathcal {G}_{FUSL}\), which can be solved exactly when the depth of memory is small.

The following lemma provides some important observations about procedure \(\mathcal {G}_{FUSL}\), which are similar to those for the USL gap reduction procedure in [1].

Lemma 2.6

The following statements hold for procedure \(\mathcal {G}_{FUSL}\).

  1. (a)

    If procedure \(\mathcal {G}_{FUSL}\) terminates at Steps 2 or 3(a), then we have \(\mathrm{ub}^+-\mathrm{lb}^+\le q(\mathrm{ub}-\mathrm{lb})\), where q is defined by (2.14) and \(\mathrm{ub}:=f(\hat{x}),\mathrm{ub}^{+}:=f(x^{+})\).

  2. (b)

    If procedure \(\mathcal {G}_{FUSL}\) terminates at Step 3b), then \(D< D_{v,Y}\) and \(D^+<2D_{v,Y}\).

Proof

The proof of part (a) is the same as that of Lemma 2.3, and we only prove part (b) here. By the termination condition at Step 3b), we have

$$\begin{aligned} f(x_k^u)-f_{\eta }(x_k^u)>\frac{\theta }{2}(\overline{f}_0-l). \end{aligned}$$

We conclude from the above relation, (2.40), and (2.41) that

$$\begin{aligned} D_{v,Y}\ge \frac{f(x_k^u)-f_{\eta }(x_k^u)}{\eta }>\frac{\theta (\overline{f}_0-l)}{2\eta } = D. \end{aligned}$$

Finally, \(D^+<2D_{v,Y}\) follows immediately from the above relation and the definition of \(D^+\) in Step 3b). \(\square \)

The following result provides a bound on the number of iterations performed by procedure \(\mathcal {G}_{FUSL}\).

Proposition 2.7

Suppose that \(\{\alpha _k\}_{k\ge 1}\) in procedure \(\mathcal {G}_{FUSL}\) are chosen such that (2.10) holds. Then, the number of iterations performed by this procedure does not exceed

$$\begin{aligned} \overline{N}(\Delta , D):=cR\left( \sqrt{\frac{L_{\hat{f}}}{\theta \beta \Delta }}+\frac{\sqrt{2 }\Vert A\Vert }{\theta \beta \Delta }\sqrt{\frac{D}{\sigma _v}}\right) +1, \end{aligned}$$
(2.43)

where \(\Delta :=f(\hat{x})-\mathrm{lb}\).

Proof

It is easy to see that the gradient of \(f_{\eta }\) in (2.31) has Lipschitz continuous gradient with constant \(L=L_{\hat{f}}+L_{\eta }\), where \(L_\eta \) and \(L_{\hat{f}}\) are defined in (2.36) and (2.32), respectively. Suppose that procedure \(\mathcal {G}_{FUSL}\) does not terminate at iteration K. As the \(\mathcal {G}_{FUSL}\) procedure could be viewed as applying the \(\mathcal {G}_{FAPL}\) procedure to \(f_{\eta }\), similar to the discussion on (2.21), and notice that \(\rho =1\) as \(f_{\eta }\) is smooth, we have

$$\begin{aligned} f_{\eta }(x_K^u)-l\le \frac{c^2L d(x_K)}{K^2}\le \frac{c^2LR^2}{2K^2}, \end{aligned}$$
(2.44)

where c is defined in (2.10). Also, since procedure \(\mathcal {G}_{FUSL}\) does not terminate at iteration K, in view of the termination condition in Step 3b) and the definition of l in Step 0, we have

$$\begin{aligned} f_{\eta }(x_K^u)-l>\frac{\theta \beta \Delta }{2}. \end{aligned}$$
(2.45)

By (2.44), (2.45), (2.36) and (2.41), we conclude that

$$\begin{aligned} K<\sqrt{\frac{c^2LR^2}{\theta \beta \Delta }}\le cR\left( \sqrt{\frac{L_{\hat{f}}}{\theta \beta \Delta }}+\frac{\sqrt{2 }\Vert A\Vert }{\theta \beta \Delta }\sqrt{\frac{D}{\sigma _v}}\right) . \end{aligned}$$
(2.46)

\(\square \)

We are now ready to describe the FUSL method which iteratively calls procedure \(\mathcal {G}_{FUSL}\) to solve the structured saddle point problem (2.1)-(2.31).

figure d

For simplicity, we say that a phase (i.e., an outer iteration) of the FUSL method occurs when s increases by 1. More specifically, similar to the USL method, we classify two types of phases in the FUSL method. A phase is called significant if the corresponding \(\mathcal {G}_{FUSL}\) procedure terminates at Steps 2 or 3a), otherwise it is called non-significant. Clearly, if the value of \(D_{v,y}\) is provided, which is the assumption made in Nesterov’s smoothing scheme [5], then we can set \(D_1=D_{v,Y}\) in the schemes of both the USL and FUSL methods, and consequently, all the phases of both methods become significant.

As the same as the FAPL method, an iteration of procedure \(\mathcal {G}_{FUSL}\) is also referred to an iteration of the FUSL method. The following result establishes a bound on the total number of iterations performed by the FUSL method to find an \(\epsilon \)-solution to problem (2.1)–(2.31). Note that the proof of this result is similar to that of Theorem 7 in [1].

Theorem 2.8

Suppose that \(\{\alpha _k\}\) in procedure \(\mathcal {G}_{FUSL}\) are chosen such that (2.10) holds. Then, the total number of iterations performed by the FUSL method for computing an \(\epsilon \)-solution to problem (2.1)–(2.31) is bounded by

$$\begin{aligned} \overline{N}(\epsilon ):=S_1+S_2+\left( \frac{2}{\sqrt{2} -1}+\frac{\sqrt{2}}{1- q}\right) \frac{cR\Vert A\Vert }{\theta \beta \epsilon }\sqrt{\frac{\tilde{D}}{\sigma _v}} +\left( S_1+\frac{1}{1-\sqrt{q}}\right) cR\sqrt{\frac{L_{\hat{f}}}{\theta \beta \epsilon }}, \end{aligned}$$
(2.47)

where q and \(D_{v,Y}\) are defined by (2.14) and (2.38) respectively, and

$$\begin{aligned}&\tilde{D}:=\max \{D_1,2D_{v,Y}\}, S_1:=\max \left\{ \left\lceil \log _2\frac{D_{v,Y}}{D_1} \right\rceil , 0\right\} \text {\ and\ }\nonumber \\&S_2:=\left\lceil \log _{\frac{1}{q}}\frac{4\sqrt{2}R\Vert A\Vert \sqrt{\frac{D_{v,Y}}{\sigma _v}}+2R^2L_{\hat{f}}}{\epsilon } \right\rceil . \end{aligned}$$
(2.48)

Proof

We prove this result by estimating the numbers of iterations performed within non-significant and significant phases separately. Suppose that the sets of indices of the non-significant and significant phases are \(\{m_1,m_2,\ldots ,m_{s_1}\}\) and \(\{n_1,n_2,\ldots ,n_{s_2}\}\) respectively. For any non-significant phase \(m_k\), \(1\le k\le s_1\), we can easily see from Step 3b) that \(D_{m_{k+1}}=2D_{m_k}\), by part b) in Lemma 2.6, the number of non-significant phases performed by the FUSL method is bounded by \(S_1\) defined above, i.e., \(s_1\le S_1\).

In addition, since \(D_{m_{s_1}}\le \tilde{D}\), we have \(D_{m_k}\le (1/2)^{s_1-k}\tilde{D}\), where \(\tilde{D}\) is defined above. Combining the above estimates on \(s_1\) and \(D_{m_k}\), and in view of the fact \(\Delta _{m_k}>\epsilon \) for all \(1\le k\le s_1\), we can bound the number of iterations performed within the non-significant phases by

$$\begin{aligned} \begin{aligned} \overline{N}_1&=\sum _{k=1}^{s_1}\overline{N}(\Delta _{m_k},D_{m_k})\le \sum _{k=1}^{s_1}\overline{N}\left( \epsilon ,\tilde{D}/2^{s_1-k}\right) \\&\le S_1\left( cR\sqrt{\frac{L_{\hat{f}}}{\theta \beta \epsilon }}+1\right) +\frac{\sqrt{2}cR\Vert A\Vert }{\theta \beta \epsilon }\sqrt{\frac{\tilde{D}}{\sigma _v}}\sum _{k=1}^{S_1}2^{-\frac{(S_1-k)}{2}}\\&\le S_1\left( cR\sqrt{\frac{L_{\hat{f}}}{\theta \beta \epsilon }}+1\right) +\frac{2cR\Vert A\Vert }{(\sqrt{2}-1)\theta \beta \epsilon }\sqrt{\frac{\tilde{D}}{\sigma _v}}. \end{aligned} \end{aligned}$$
(2.49)

Applying Lemma 8 in [1] and relation (2.32), and in view of the fact that \(p_0, p_1\in B(\overline{x},R)\) in Algorithm 2, the initial gap is bounded by

$$\begin{aligned} \Delta _1:=&\,\mathrm{ub}_1-\mathrm{lb}_1\le \left[ F(p_0)-F(p_1)-\left\langle F'(p_1),p_0-p_1\right\rangle \right] \nonumber \\&+\left[ \hat{f}(p_0)-\hat{f}(p_1)-\left\langle \hat{f}'(p_1),p_0-p_1\right\rangle \right] \end{aligned}$$
(2.50)
$$\begin{aligned} \le \,&4\sqrt{2}R\Vert A\Vert \sqrt{\frac{D_{v,Y}}{\sigma _v}}+2R^2L_{\hat{f}}, \end{aligned}$$
(2.51)

where \(F'(p_1)\in \partial F(p_1)\). Then for the significant phases, similar to the proof of Theorem 2.5, we have \(s_2\le S_2\). Moreover, for any \(n_k\), \(1\le k\le s_2\), using Lemmas 2.3, 2.6, we have \(D_{n_k}\le \tilde{D}\), \(\Delta _{n_{k+1}}\le q\Delta _{n_{k}}\), and \(\Delta _{n_{s_2}}> \epsilon \), which implies \(\Delta _{n_k}>\epsilon /q^{s_2-k}\). Combine these estimates on \(D_{n_k}, \Delta _{n_k}\) and bound on \(s_2\), we can see that the total number of iterations performed within the significant phases is bounded by

$$\begin{aligned} \begin{aligned} \overline{N}_2&=\sum _{k=1}^{s_2}\overline{N}(\Delta _{n_k},D_{n_k})\le \sum _{k=1}^{s_2}\overline{N}(\epsilon /q^{s_2-k}, \tilde{D})\\&\le S_2+cR\sqrt{\frac{L_{\hat{f}}}{\theta \beta \epsilon }}\sum _{k=1}^{S_2}q^{\frac{S_2-k}{2}}+\frac{\sqrt{2}cR\Vert A\Vert }{\theta \beta \epsilon }\sqrt{\frac{ \tilde{D}}{\sigma _v}}\sum _{k=1}^{S_2}q^{S_2-k}\\&\le S_2+\frac{cR}{1-\sqrt{q}}\sqrt{\frac{L_{\hat{f}}}{\theta \beta \epsilon }}+\frac{\sqrt{2}cR\Vert A\Vert }{\theta \beta \epsilon (1-q)}\sqrt{\frac{\tilde{D}}{\sigma _v}}. \end{aligned} \end{aligned}$$
(2.52)

Finally, the total number of iterations performed by the FUSL method is bounded by \(\overline{N}_1+\overline{N}_2\), and thus (2.47) holds. \(\square \)

From (2.47) in the above theorem, we can see that the iteration complexity of the FUSL method for solving problem (2.1)–(2.31) is bounded by

$$\begin{aligned} \mathcal {O}\left( \sqrt{\frac{L_{\hat{f}}}{\epsilon }} + \frac{\Vert A\Vert }{\epsilon }\right) . \end{aligned}$$
(2.53)

Note that, similar to the FAPL method, the FUSL method requires two function evaluations for each iteration . The above iteration complexity is the same as that of the Nesterov smoothing scheme in [5] and the USL method in [1]. However, both the USL and FUSL methods improve Nesterov’s smoothing scheme in that both of them are problem parameter free. In addition, as detailed in Subsection A below, the FUSL method further improves the USL method by reducing its iteration cost and improving the accuracy for solving its subproblems.

3 Solving unconstrained CP problems through ball-constrained CP problems

In this section we discuss the following unconstrained CP problem:

$$\begin{aligned} f^*:=\min _{x\in \mathbb {R}^n} f(x), \end{aligned}$$
(3.1)

where \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) is convex, and for any closed sets \(\Omega \in \mathbb {R}^n\), there exists \(M(\Omega )>0\) and \(\rho (\Omega )\in [0,1]\), such that

$$\begin{aligned} f(y)-f(x)-\left\langle f'(x),y-x\right\rangle \le \frac{M(\Omega )}{1+\rho (\Omega )}\Vert y-x\Vert ^{1+\rho (\Omega )}, \ \ \forall x,y \in \Omega . \end{aligned}$$
(3.2)

The above assumption on \(f(\cdot )\) covers unconstrained nonsmooth (\(\rho (\Omega )=0\)), smooth (\(\rho (\Omega )=1\)) and weakly smooth (\(0<\rho (\Omega )<1\)) CP problems.

We first present a generic algorithmic framework to solve unconstrained CP problems through solutions to a series of ball-constrained CP problems in Sect. 3.1, then extend the FAPL and FUSL methods to unconstrained CP problems with iteration complexity analysis in Sect. 3.2.

3.1 Expansion algorithm for unconstrained CP problems

In order to analyze the computational complexity of BL methods, it is commonly assumed that the feasible set of interest is compact (e.g., [1, 20, 22]). While the compactness assumption is essential for performing complexity analysis, the applicability of existing BL methods on unconstrained CP problem may be impaired. In practice, one possible workaround for BL methods to solve (3.1) may involve an assumption on the distance from a point \(\bar{x}\) to an optimal solution \(x^*\), namely, \(x^*\in B(\bar{x}, R):= \left\{ x \in \mathbb {R}^n: \Vert x - \overline{x}\Vert \le R \right\} \) for some \(\bar{x}\) and R. With such an assumption, we can solve (3.1) by considering a ball-constrained CP problem

$$\begin{aligned} f^*_{\overline{x}, R}:=\min _{x\in B(\overline{x},R)}f(x). \end{aligned}$$
(3.3)

The above technique was also discussed in [32] (see Sect. 4 in [32]). It should be noted that, while the above equivalent reformulation seems straightforward, its computational complexity relies almost exclusively on the radius R. In particular, if R is close to the distance from \(\bar{x}\) to \(X^*\), the optimal solution set of (3.1), i.e., \(R-D^*\) is small enough, where

$$\begin{aligned} x^{*}:=\mathrm{argmin}_{x}\{\Vert \overline{x} - x\Vert : x\in X^{*}\} \ \ \text{ and } \ \ \ D^{*} := \Vert \overline{x} - x^{*}\Vert , \end{aligned}$$
(3.4)

then the computational complexity for computing approximate solutions to the ball-constrained CP problem (3.3) and the original unconstrained CP problem (3.1) are close, and it is definitely reasonable to solve (3.3) instead. However, if R is severely overestimated, the computational complexity for solving the ball-constrained CP (3.3) may become much higher than the optimal complexity bound that depends on \(D^*\).

Based on the above discussion, we can conclude that a good BL method should tackle the unconstrained CP problem (3.1) from two perspectives. Firstly, without any satisfiable knowledge regarding \(D^*\), such BL method should still be able to solve (3.1) with optimal complexity bound that depends on \(D^*\). Secondly, if there exists an approximate distance R that is close to \(D^*\), a good BL method should solve the ball-constrained problem (3.3) efficiently. We will consider a generic framework that follows the former perspective in this subsection, and then extend the BL method to solve (3.1) in the next subsection.

In this subsection, our goal is to design a generic algorithmic framework that follows the former perspective in the above discussion, and solve the unconstrained CP problem (3.1) through solutions to a series of ball-constrained CP problems. It should be noted that such concept indeed follows naturally from the two perspectives discussed above: if a BL method is computationally efficient for ball-constrained CP problems of form (3.3), starting from certain R for (3.3) and enlarging it by two each time, we will eventually reach a close enough estimate of \(D^*\) after logarithmic amount of times.

Given \(\overline{x}\in \mathbb {R}^n, R>0,\epsilon >0\), let us assume that there exists a first-order algorithm, denoted by \(\mathcal {A}(\overline{x},R,\epsilon )\), which can find an \(\epsilon \)-solution to (3.3). In other words, we assume that each call to \(\mathcal {A}(\overline{x}, R,\epsilon )\) will compute a point \(z\in B(\overline{x},R)\) such that \(f(z)-f^*_{\overline{x},R}\le \epsilon \). Moreover, throughout this section, we assume that the number of (sub)gradient evaluations required by \(\mathcal {A}(\overline{x}, R,\epsilon )\) for finding an \(\epsilon \)-solution to (3.3) is bounded by

$$\begin{aligned} N_{\overline{x},R,\epsilon }:=\frac{C_1(\overline{x}, R, f)R^{\alpha _1}}{\epsilon ^{\beta _1}}+\frac{C_2(\overline{x}, R, f)R^{\alpha _2}}{\epsilon ^{\beta _2}}, \end{aligned}$$
(3.5)

where \(\alpha _1\ge \beta _1>0\) and \(\alpha _2\ge \beta _2>0\). \(C_1(\overline{x}, R, f)\) and \(C_2(\overline{x}, R, f)\) are constants that depend on f in (3.3) and nondecreasing with respect to R. For example, if f is a smooth convex function, \(\nabla f\) is Lipschitz continuous in \(\mathbb {R}^n\) with constant L, i.e., (3.2) holds with \(\rho (\mathbb {R}^n)=1\) and \(M(\mathbb {R}^n)=L\), and we apply the APL method to (3.3), then we have only one term with \(\alpha _1=1\), \(\beta _1=1/2\), and \(C_1(\overline{x}, R, f)=c\sqrt{L}\) in (3.5), where c is a universal constant. Observe that the two complexity terms in (3.5) will be useful for analyzing some structured CP problems in Sect. 2.2. It should also be noted that a more accurate estimate of \(C_1(\overline{x}, R, f)\) is \(cM(B(\overline{x},R))\), since the Lipschitz constant \(L=M(\mathbb {R}^n)\) throughout \(\mathbb {R}^n\) is larger than or equal to the local Lipschitz constant on \(B(\overline{x},R)\).

Let \(\mathcal {A}\) be the algorithm that can find an \(\epsilon \)-solution to ball-constrained problem (3.3). By utilizing a novel guess and check procedure, we present an expansion algorithm for unconstrained convex optimizations as follows:

figure e

Note that in order to reflect the flexibility we choose \(r_1\) as any positive constant in Algorithm 3. However, from the theoretical complexity point of view, we prefer to starting with a smaller \(r_1\) (i.e., a lower bound on \(D^*\)). In fact, given the target accuracy \(\epsilon \) one can derive a theoretically viable selection of \(r_1\) given as follows. For simplicity, assume that f has Lipschitz continuous gradient with constant L, we have \(f(x_0) - f^* \le \frac{L}{2} \Vert x_0 - x^*\Vert ^2\). If \(\Vert x_0 - x^*\Vert \le \sqrt{2 \epsilon /L}\), then \(x_0\) already satisfies \(f(x_0) - f^* \le \epsilon \). Therefore, we can set \(r_1= \sqrt{2\epsilon /L}\). Such an estimate requires some prior information of L. It should be noted that an overestimate on L does not hurt the complexity bound, since \(r_1\) is updated exponentially fast and will approach \(D^*\) quickly.

Steps 1 and 2 in Algorithm 3 constitute a loop for finding a pair of solutions \((\overline{x}_k',\overline{x}_k'')\) satisfying \(0\le f(\overline{x}_k')-f(\overline{x}_k'')\le \Delta _k\) for any \(k\ge 1\). Since \(\overline{x}_k'\) and \(\overline{x}_k''\) are \(\Delta _k\)-optimal solutions to \(\min _{x\in B(\overline{x},r_k)}f(x)\) and \(\min _{x\in B(\overline{x},2r_k)}f(x)\) respectively, this loop must terminate in finite time, because it will terminate whenever \(r_k\ge D^*\).

For simplicity, we call it an expansion if we double the radius in Step 2. Each iteration may contain several expansions before outputting solution \(\overline{x}_k^{*}\) in Step 3. Note that, for the implementation of Algorithm 3, most computational work exists in Step 1, which involves a sequence of calls to algorithm \(\mathcal {A}\). However, notice that the gaps (\(\Delta _k\)) and radiuses (\(r_k\)) for these calls are monotonically decreasing and increasing, respectively, numerous computational cost could be saved by using results from previous expansions and iterations. For instance, the output solutions of previous calls to \(\mathcal {A}\) associated with larger gaps or smaller radiuses could always be used as the starting point for the current call to \(\mathcal {A}\). For successive expansions, if the previous execution of Step 1 called \(\mathcal {A}(\bar{x}, r_k,\Delta _k)\) and \(\mathcal {A}(\bar{x}, 2r_k,\Delta _k)\) and the current execution of Step 1 calls \(\mathcal {A}(\bar{x}, 2r_k,\Delta _k)\) and \(\mathcal {A}(\bar{x}, 4r_k,\Delta _k)\), the computation of call \(\mathcal {A}(\bar{x}, 2r_k,\Delta _k)\) could be saved. Moreover, if \(\mathcal {A}\) is referred to some specific algorithms, such as BL type methods, more previous results, like the lower bounds and prox-centers, could also be utilized to save computational cost.

Note that there are proximal bundle methods incorporating with trust region techniques for updating the new iterate [35, 36]. In [35] an quadratic cutting plane model based method and in [36] the idea of Chebychev center were used to generate the trust regions. These trust region methods restrict the iterates in the trust region for better convergence to the optimal solution, while the approximate solutions in the searching balls generated by our expansion algorithm are used only for checking whether or not the current searching ball needs to be expanded in order to get a better estimate of \(D^*\).

Before analyzing the iteration complexity of Algorithm 3, we discuss some important observations related to the aforementioned expansions.

Lemma 3.1

Let \(\bar{x} \in \mathbb {R}^n\) and \(r>0\) be a constant, \(\overline{x}_1\) and \(\overline{x}_2\) be \(\Delta \)-solutions to CP problems

$$\begin{aligned} f_{\overline{x},r}^{*} :=\min _{x\in B(\overline{x}, r)}f(x) \ \ \ \text{ and } \ \ \ f_{\overline{x},2r}^{*} :=\min _{x\in B(\overline{x}, 2r)}f(x), \end{aligned}$$
(3.6)

respectively. If \(0\le f(\overline{x}_1)-f(\overline{x}_2)\le \Delta \), then we have

$$\begin{aligned} f(\overline{x}_2)-f^{*}\le \left( 3+\frac{2D^{*}}{r} \right) \Delta , \end{aligned}$$
(3.7)

where \(f^*\) and \(D^{*}\) are defined in (3.1) and (3.4) respectively.

Proof

Clearly, by definition, we have \(\Vert \overline{x}_1-\overline{x}\Vert \le r\), \(\Vert \overline{x}_2-\overline{x}\Vert \le 2r\), \(0\le f(\overline{x}_1)-f_{\overline{x},r}^{*}\le \Delta \) and \(0\le f(\overline{x}_2)-f_{\overline{x},2r}^{*}\le \Delta .\) It suffices to consider the case when \(f^*_{\bar{x}, 2r}> f^{*}\) and \(\Vert x^*-\overline{x}\Vert >2r\), since otherwise (3.7) holds trivially. Suppose \(x_1^{*}\) and \(x_2^{*}\) are the solutions to the first and second problems in (3.6) respectively, let \(\hat{x}\) be the intersection of the line segment \((x^{*},x_1^{*})\) with the ball \(B(\overline{x},2r)\), and denote \(R_1:=\Vert \hat{x}-x_1^{*}\Vert \) and \(R_2:=\Vert x^{*}-x_1^{*}\Vert \). Clearly, \(\hat{x}=(1-\frac{R_1}{R_2})x_1^{*}+\frac{R_1}{R_2}x^{*}\). By the convexity of \(f(\cdot )\), we have

$$\begin{aligned} f(\hat{x})\le \left( 1-\frac{R_1}{R_2}\right) f(x_1^{*})+\frac{R_1}{R_2}f(x^{*}), \end{aligned}$$
(3.8)

which implies that

$$\begin{aligned} \frac{R_1}{R_2}\left[ f(x_1^{*})-f(x^{*})\right] \le f(x_1^{*})-f(\hat{x}), \end{aligned}$$
(3.9)

and that \(f(\hat{x})\le f(x_1^{*})\) due to the fact that \(f(x^{*})\le f(x_1^{*})\). Also, we have \(f(\hat{x})\ge f(x_2^{*})\) since \(\hat{x} \in B(\overline{x}, 2r)\). In addition,

$$\begin{aligned} f(x_1^{*})-f(x_2^{*})&=[f(x_1^{*})-f(\overline{x}_1)]+[f(\overline{x}_1)-f(\overline{x}_2)]+[f(\overline{x}_2)-f(x_2^{*})] \end{aligned}$$
(3.10)
$$\begin{aligned}&\le 0+\Delta +\Delta =2\Delta . \end{aligned}$$
(3.11)

Combining the previous inequalities, we obtain

$$\begin{aligned} f(x_1^{*})-2\Delta \le f(x_2^{*})\le f(\hat{x})\le f(x_1^{*}), \end{aligned}$$
(3.12)

which implies that \(f(x_1^{*})-f(\hat{x})\le 2\Delta \). Using (3.9), and the fact that \(R_1\ge r\) and \(R_2\le D^{*}+r\), we have

$$\begin{aligned} f(x_1^{*})-f(x^{*})\le \frac{2 R_2\Delta }{R_1}\le \left( 2+\frac{2D^{*}}{r}\right) \Delta . \end{aligned}$$

Therefore,

$$\begin{aligned} f(\overline{x}_2)-f(x^{*})\le & {} f(\overline{x}_1)-f(x^{*})\le [f(\overline{x}_1)-f(x_1^{*})]+[f(x_1^{*})-f(x^{*})]\\\le & {} \left( 3+\frac{2D^{*}}{r} \right) \Delta . \end{aligned}$$

\(\square \)

We are now ready to prove the iteration complexity of Algorithm 3 for solving the unconstrained CP problem (3.1).

Theorem 3.2

Suppose that the number of (sub)gradient evaluations required by \(\mathcal {A}(\overline{x}, R,\epsilon )\) for finding an \(\epsilon \)-solution to (3.3) is bounded by (3.5). For any \(k\ge 1\), denote \(\epsilon _k:=f(\overline{x}_k^{*})-f^{*}\) for the output \(\overline{x}_k^{*}\) in Algorithm 3. Then we have

  1. (a)

    \(r_k \le \bar{r} :=\max \{r_1,2D^{*}\}\) for all \(k\ge 1\), where \(D^{*}\) is defined in (3.4);

  2. (b)

    \(\lim _{k\rightarrow \infty }\epsilon _k=0\);

  3. (c)

    The total number of (sub)gradient evaluations performed by Algorithm 3 for finding the \(\epsilon _k\)-solution \(\bar{x}_k^*\) to problem (3.1) is bounded by

    $$\begin{aligned} \mathcal {O}\left( \frac{C_1(\overline{x}, 2\bar{r}, f) \bar{r}^{\alpha _1}}{\epsilon _k^{\beta _1}}+\frac{C_2(\overline{x}, 2\bar{r}, f) \bar{r}^{\alpha _2}}{\epsilon _k^{\beta _2}}\right) . \end{aligned}$$
    (3.13)

Proof

We start by proving a), if \(r_1 \ge D^*\), then \(f(\bar{x}_k') - f(\bar{x}_k'')\le \Delta _k\) for any \(k\ge 1\) and no expansion takes place, hence \(r_k=r_1=\bar{r}\). If \(r_1 \le D^*\), from Algorithm 3, we see that expansion occurs at Step 2 if and only if \(f(\overline{x}_k')-f(\overline{x}_k'')>\Delta _k\). Hence, if \(r_k \ge D^*\), this condition is not satisfied and no more expansion is performed. This implies \(r_k< 2D^{*}\).

To prove (b), observe that \(\bar{x}_k'\) is used as the initial point for computing \(\bar{x}_k''\) in Algorithm 3 and hence \(f(\bar{x}_k'') \le f(\bar{x}_k')\). Combining this observation with the condition in Step 3, we have \(0\le f(\bar{x}_k') - f(\bar{x}_k'') \le \Delta _k\). Applying Lemma 3.1 implies

$$\begin{aligned} \epsilon _k=f(\overline{x}_k'')-f^{*}\le \left( 3+\frac{2D^{*}}{r_k} \right) \Delta _k. \end{aligned}$$
(3.14)

Note that the total number of expansion is bounded by

$$\begin{aligned} \overline{S}_1:=\left\lceil \log _2{\frac{\bar{r}}{r_1}}\right\rceil , \end{aligned}$$
(3.15)

hence \(\Delta _k\) decreases to 0 as k increases, we have \(\lim _{k\rightarrow \infty }\epsilon _k= 0\).

To prove c), assume that the number of executions of Step 1 in Algorithm 3 for finding \(\overline{x}_k^{*}\) is K. For any \(1\le j\le K\), let \(\mathcal {A}(\overline{x}, R_j, \delta _j)\) and \(\mathcal {A}(\overline{x}, 2R_j, \delta _j)\) be called in the \(j^{th}\) execution of Step 1. By using (3.5) and noting that \(C_1(\overline{x}, R, f)\) and \(C_2(\overline{x}, R, f)\) are nondecreasing with respect to R, we have the number of (sub)gradient evaluations performed by the \(j^{th}\) execution of Step 1 is bounded by

$$\begin{aligned} N_j:= \frac{(1+2^{\alpha _1})C_1(\overline{x},2R_j,f)R_j^{\alpha _1}}{\delta _j^{\beta _1}}+ \frac{(1+2^{\alpha _2})C_2(\overline{x},2R_j,f)R_j^{\alpha _2}}{\delta _j^{\beta _2}}. \end{aligned}$$
(3.16)

Let \(N_j'\) and \(N_j''\) be the first and second terms on the right of (3.16), respectively. The \((j+1)^{th}\) execution of Step 1 either doubles the radius or reduces the gap by half comparing to the \(j^{th}\) execution, i.e., \(R_{j+1} = 2R_j\) or \(\delta _{j+1} = \delta _j/2\) respectively. Therefore we have either \(N_{j+1}'\ge 2^{\alpha _1}N_j'\) and \(N_{j+1}''\ge 2^{\alpha _2}N_j''\), or \(N_{j+1}'\ge 2^{\beta _1}N_j'\) and \(N_{j+1}''\ge 2^{\beta _2}N_j''\). Since \(2^{\alpha _1}\ge 2^{\beta _1}>1\) and \(2^{\alpha _2}\ge 2^{\beta _2}>1\), we can combine these two cases and have

$$\begin{aligned} N_{j}'\le 2^{-\beta _1}N_{j+1}' \text {\ and\ } N_{j}''\le 2^{-\beta _2}N_{j+1}'',\text {\ for\ } 1\le j\le K-1, \end{aligned}$$
(3.17)

which further implies

$$\begin{aligned} N_{j}'\le 2^{-\beta _1(K-j)}N_{K}' \text {\ and\ } N_{j}''\le 2^{-\beta _2(K-j)}N_{K}'', \text {\ for\ } 1\le j\le K. \end{aligned}$$
(3.18)

Then the total number of (sub)gradient evaluations performed by these K executions of Step 1 in Algorithm 3 is bounded by

$$\begin{aligned} N:&=\sum _{j=1}^K (N_j'+N_j'') \le N_K'\sum _{j=1}^{K} 2^{-\beta _1(K-j)}+N_K''\sum _{j=1}^{K} 2^{-\beta _2(K-j)} \end{aligned}$$
(3.19)
$$\begin{aligned}&< N_K'\sum _{j=0}^{+\infty } 2^{-\beta _1 j}+ N_K''\sum _{j=0}^{+\infty } 2^{-\beta _2 j}\le \frac{1}{1-2^{-\beta _1}}N_K'+\frac{1}{1-2^{-\beta _2}}N_K'' \end{aligned}$$
(3.20)
$$\begin{aligned}&\le \frac{(1+2^{\alpha _1})C_1(\overline{x}, 2r_{k}, f)}{1-2^{-\beta _1} }\cdot \frac{r_{k}^{\alpha _1}}{\Delta _k^{\beta _1}}+ \frac{(1+2^{\alpha _2})C_2(\overline{x}, 2r_{k}, f)}{1-2^{-\beta _2} }\cdot \frac{r_{k}^{\alpha _2}}{\Delta _k^{\beta _2}}. \end{aligned}$$
(3.21)

The last inequality follows from the observation that the last (i.e., \(K^{th}\)) execution of Step 1 before \(\bar{x}_k^*\) is found has \(R_K = r_{k}\) and \(\delta _K = \Delta _k\). Combining the above inequality with (3.14), we have

$$\begin{aligned} N<\sum _{i=1}^2\frac{(1+2^{\alpha _i})C_i(\overline{x}, 2r_{k}, f)}{1-2^{-\beta _i}}\cdot \frac{r_{k}^{\alpha _i}\left( 3+\frac{2D^{*}}{r_{k}}\right) ^{\beta _i}}{\epsilon _k^{\beta _i}}. \end{aligned}$$
(3.22)

Since \(\alpha _i\ge \beta _i>0\), \(C_i(\overline{x}, 2r_{k}, f)r_{k}^{\alpha _i}(3+\frac{2D^{*}}{r_{k}})^{\beta _i}=C_i(\overline{x}, 2r_{k}, f)r_{k}^{\alpha _i-\beta _i}(3r_{k}+2D^{*})^{\beta _i}\) for \(i=1,2\) are monotonically increasing with respect to \(r_{k}\), which, in view of the fact \(r_{k}< 2\bar{r}\) proved by part a), therefore clearly implies

$$\begin{aligned} N< \sum _{i=1}^2 \frac{(2^{3\beta _i}+2^{\alpha _i + 3\beta _i})C_i(\overline{x},2\bar{r},f)}{2^{\beta _i}-1}\cdot \frac{ \bar{r}^{\alpha _i}}{\epsilon _k^{\beta _i}}. \end{aligned}$$
(3.23)

Hence the proof is complete. \(\square \)

Note that to solve the unconstrained CP problem (3.1), the termination criterions of most first-order algorithms are based on the residual of the (sub)gradient, which would lead to different complexity analysis. To the best of our knowledge, without any prior information on \(D^{*}\), there is no verifiable termination criterion based on functional optimality gap that could guarantee the termination of algorithms for finding an \(\epsilon \)-solution to (3.1). Comparing to Nesterov’s optimal gradient method for unconstrained problems in [6], Algorithm 3 only provides efficiency estimates about \(\epsilon _k:=f(\overline{x}^{*}_k)-f^{*}\) when the output \(\overline{x}^{*}_k\) is updated, while the optimal gradient method could have estimates about \(\overline{\epsilon }_k:=f(x_k)-f^{*}\) for each iterate \(x_k\). For both methods the efficiency estimates involve \(D^{*}\). Since Algorithm 3 extends methods for ball-constrained CP problems to solve (3.1), and the iterations in the expansions of Algorithm 3 could be regarded as a guess and check procedure to estimate \(D^{*}\), it is reasonable that the efficiency estimates are only provided for unexpansive steps, i.e., Step 3 of Algorithm 3, which output \(\overline{x}^{*}_k\).

It has been shown in [1] that the APL method and its variant, the USL method, achieve the optimal iteration complexities in (3.3) for smooth, nonsmooth and weakly smooth CP problems and a class of structured saddle point problems respectively, on any convex and compact feasible set. So Algorithm 3 could be incorporated to solve (3.1) with the optimal iteration complexities too. Therefore, the remaining part of this paper will focus on how to improve the efficiency of these BL type methods for solving ball-constrained CP problems.

3.2 Extending FAPL and FUSL for unconstrained problems

In this subsection, we study how to utilize the FAPL and FUSL methods to solve the unconstrained problems based on our results in Sect. 3.

Let us first consider the case when f in (3.1) satisfies (3.2). If the method \(\mathcal {A}\) in Step 1 of Algorithm 3 is given as the FAPL method, then by Theorem 2.5, and the fact that only one (sub)gradient of f is computed in each iteration of the FAPL method, the number of (sub)gradient evaluations within one call to \(\mathcal {A}(\overline{x}, R,\epsilon )\) is bounded by \(N(\epsilon )\) given by (2.33).

Therefore, for the FAPL method, we have \(\alpha _1 = \frac{2(1+\rho )}{3+2\rho }, \beta _1= \frac{2}{1+3\rho }, C_1(\overline{x},R,f)=C'M^{\frac{2}{1+3\rho }}\) and \(C_2(\overline{x},R,f)=\alpha _2=\beta _2=0\) in (3.5), where \(C'\) is a constant depending on the parameters \(q,\theta ,\beta ,\rho \) and c in the FAPL method. Letting \(\epsilon _k:=f(\overline{x}^*_k)-f^{*}\) for \(k\ge 1\) in Algorithm 3 and applying Theorem 3.2, we then conclude for finding the \(\epsilon _k\)-solution \(\overline{x}_k^*\) to problem (3.1), the total number of (sub)gradient evaluations performed by Algorithm 3 is bounded by

$$\begin{aligned} \mathcal {O}\left( \left[ \frac{M(2\overline{r})^{1+\rho }}{\epsilon _k}\right] ^{\frac{2}{1+3\rho }}\right) , \end{aligned}$$
(3.24)

where \(M:=M(B(\overline{x},2\overline{r}))\), \(\rho :=\rho (B(\overline{x},2\overline{r})) \) and \(\overline{r}\) is defined in part a) of Theorem 3.2. It should be noted that the constants M and \(\rho \) are local constants that depend on the size of the initial ball, i.e., \(r_1\), and the distance from \(\overline{x}\) and \(x^*\), which are not required for the FAPL method and Algorithm 3, and also generally smaller than the constants \(M(\mathbb {R}^n)\) and \(\rho (\mathbb {R}^n)\), respectively, for the global Hölder continuity condition.

Moreover, if f in (3.1) is given in the form of (2.31) as a structured nonsmooth CP problem, then the FUSL method could be applied to solve the corresponding structured ball-constrained problems in Algorithm 3. By Theorem 2.8, the number of (sub)gradient evaluations of f within one call to \(\mathcal {A}(\overline{x},R,\epsilon )\) is bounded by

$$\begin{aligned} S_1+S_2+C'R\sqrt{\frac{L_{\hat{f}}}{\epsilon }}+\frac{C''R\Vert A\Vert }{\epsilon }, \end{aligned}$$
(3.25)

where \(C',C''\) are some constants depending on the parameters \(q,\theta ,\beta ,\sigma _v, c, D_0\) and \(D_{v,Y}\) in the FUSL method.

Applying Theorem 3.2 with \(\alpha _1=\alpha _2 =1\), \(\beta _1=\frac{1}{2}\), \(\beta _2=1\), \(C_1(\overline{x},R,f)=C' \sqrt{L_{\hat{f}}}\), and \(C_2(\overline{x},R,f)=C''\Vert A\Vert \), we conclude that the total number of (sub)gradient evaluations performed by Algorithm 3 to find the \(\epsilon _k\)-solution \(\overline{x}_k^*\) to problem (3.1)-(2.31) is bounded by

$$\begin{aligned} \mathcal {O}\left( 2C'\overline{r}\sqrt{\frac{L_{\hat{f}}}{\epsilon _k}}+\frac{2C''\overline{r}\Vert A\Vert }{\epsilon _k} \right) . \end{aligned}$$
(3.26)

Similar to the FAPL method, here \(L_f:=L_f(B(\overline{x},2\overline{r}))\) is a lower bound of \(L_f(\mathbb {R}^n)\).

4 Numerical experiments

In this section we apply the FAPL and FUSL methods to solve a few large-scale CP problems, including the quadratic programming problems with large Lipschitz constants, synthetic and real world total variation based image reconstruction problems, then compare them with some other first-order algorithms. All the algorithms were implemented in MATLAB, Version R2011a and all experiments were performed on a desktop with an Inter Dual Core 2 Duo 3.3 GHz CPU and 8G memory. In Sects. 4.1 and 4.2, some random matrices are generated during the numerical experiments.

4.1 Quadratic programming

The main purpose of this section is to investigate the performance of the FAPL method for solving smooth CP problems especially with large Lipschitz constants and demonstrate the improvement of the FAPL method comparing to some other BL type methods. And since most BL type methods require compact feasible sets, we consider the following quadratic programming problem:

$$\begin{aligned} \min _{\Vert x\Vert \le 1}\Vert Ax-b\Vert ^2, \end{aligned}$$
(4.1)

where \(A\in \mathbb {R}^{m \times n}\) and \(b\in \mathbb {R}^m\). We compare the FAPL method with NERML [22] and APL [1]. We also compare the FAPL method with the built-in Matlab linear system solver . In the APL method, the subproblems are solved by MOSEK [37], an efficient software package for linear and second-order cone programming. Two cases with different choices of initial lower bound LB in these experiments are conducted: (1). \(LB=0\) and (2). \(LB=-\infty \). Moreover, for all the instances, except for the worst-case instance where the given data is specially designed and has no randomness, we randomly generate the data and run the algorithms 100 times, then the means and standard deviations of the number of iterations, CPU time and accuracy of solution are computed and reported for each algorithm.

In our experiments, given m and n, two types of matrix A are generated. The first type of matrix A is randomly generated with entries uniformly distributed in [0,1], while the entries of the second type are normally distributed according to N(0, 1). We then randomly choose an optimal solution \(x^{*}\) within the unit ball in \(\mathbb {R}^n\), and generate the data b by \(b=Ax^{*}\). We apply FAPL, NERML and APL to solve (4.1) with the set of data A and b, and the accuracy of the solutions are measured by \(e_k=\Vert Ax_k-b\Vert ^2\). Additionally, for each type of matrix A, we run one instance 30 times and compute the means and standard deviations of the numbers of iterations, CPU time and accuracies. The results are shown in Tables 2 and 3.

Table 2 Uniformly distributed QP instances
Table 3 Gaussian distributed QP instances

In order to investigate the efficiency of solving unconstrained CP problems using the proposed expansion algorithm (i.e., Algorithm 3), we conduct two sets of experiments to compare the performance of solving the unconstrained QP problem \(\min _{x\in \mathbb {R}^n}\Vert Ax -b\Vert ^2\) using two different strategies: one starts with small initial feasible ball, then applies the unconstrained FAPL method (i.e., incorporating the expansion algorithm with the FAPL method as subroutine \(\mathcal {A}\)), while the other one, under the assumption that a bound on \(D^*\) defined in (3.4) is known, applies the ball-constrained FAPL method directly by choosing some large ball that contains at least one optimal solution.

Since the performance of both methods would be affected by the distance between the initial point \(x_0\) and the optimal solution set, in both experiments, we set \(\bar{x} =0, D^* \approx 1\) , and for any initial ball \(B(\bar{x},R)\), we choose the starting point \(x_0\) randomly within the ball and then normalize \(x_0\) and set \(\Vert x_0\Vert = \frac{R}{2}\). For both methods, the initial lower bound is set to \(-\infty \), and the parameters of the FAPL method are the same. In this first experiment, A is generated randomly with entries uniformly distributed in [0, 1]. In the second experiment, we use the worst-case QP instance for first-order methods which are generated by A. Nemirovski (see the construction scheme in [7] and [4] ). When we apply the unconstrained FAPL method, the radiuses of the initial balls are chosen as \(10^{-5}D\), \(10^{-4}D\), \(10^{-3}D\), \(10^{-2}D\) and \(10^{-1}D\), respectively. While the ball-constrained FAPL method is employed, the radiuses of the balls are selected as \(10^{5} D\), \(10^{4} D\), \(10^{3} D\), \(10^{2} D\) and 10D, respectively. The results are shown in Table 5.

Table 4 Comparison to Matlab solver
Table 5 Unconstrained QP instances

The advantages of the FAPL method can be observed from these experiments. Firstly, among these three BL type methods, NERML requires more iterations than APL and FAPL, which have optimal iteration complexity for this problem. Moreover, in view of the CPU time, FAPL has 10 times cheaper computational cost per iteration than that of APL and NERML.

Secondly, consider the difference between the performance of setting the initial lower bound equal to 0 and \(-\infty \), it is also evident that FAPL is more robust to the choice of the initial lower bound and it updates the lower bound more efficiently than the other two BL type methods. Though setting the initial lower bound equal to \(-\infty \) increases the numbers of iterations for all these three BL type methods, a close examination reveals that the difference between setting the lower bound to zero and \(-\infty \) for FAPL is not so significant as that for APL and NERML, especially for large matrix, for example, the second one in Table 2 .

Thirdly, FAPL needs less number of iterations than APL, especially when the required accuracy is high. A plausible explanation is that exactly solving the subproblems provides better updating for the prox-centers, and consequently, more accurate prox-centers improve the efficiency of algorithm . The experiments show that, for APL and NERML, it is hard to improve the accuracy beyond \(10^{-10}\). However, FAPL can keep almost the same speed for deceasing the objective value from \(10^6\) to \(10^{-21}\).

Fourthly, we can clearly see from Table 4 that FAPL is comparable to or outperforms the built-in Matlab solver for randomly generated linear systems, even though our code is implemented in MATLAB rather than lower-level languages, such as C or FORTRAN. We can expect that the efficiency of FAPL will be improved by using C or FORTRAN implementation, which has been used in the MATLAB solver for linear systems.

Finally, from Table 5 it is evident that the performance of both the unconstrained FAPL method and the ball-constrained FAPL method are affected by the distance between the starting point \(x_0\) and the optimal solution set. And improper estimations on \(D^*\) would increase the computational cost significantly. Comparing the results presented in the same rows of the left and right columns in Table 5, one can see that, for the uniform instance, both methods could achieve high accuracy of solution, and the ball-constrained FAPL method needs less iterations and CPU time than the unconstrained FAPL method, but for the worst-case instance, the unconstrained FAPL method outperforms the the ball-constrained FAPL method more significant in terms of accuracy of solution, iterations and CPU time.

In summary, due to its low iteration cost and effective usage of the memory of first-order information, the FAPL method is a powerful tool for solving ball-constrained smooth CP problems especially when the number of variables is huge and/or the value of Lipschitz constant is large. And by incorporating the proposed Expansion Algorithm, the unconstrained FAPL method is very efficient for solving unconstrained CP problems especially when a proper estimation on the optimal solution set is not available.

4.2 Total variation based image reconstruction

In this subsection, we apply the FUSL method to solve the nonsmooth total variation (TV) based image reconstruction problem:

$$\begin{aligned} \min _{u\in \mathbb {R}^N}\frac{1}{2}\Vert Au-b\Vert _2^2+\lambda \Vert u\Vert _{TV}, \end{aligned}$$
(4.2)

where A is a given matrix, u is the vector form of the image to be reconstructed, b represents the observed data, and \(\Vert \cdot \Vert _{TV}\) is the discrete TV semi-norm defined by

$$\begin{aligned} \Vert u\Vert _{TV}:=\sum _{i=1}^N\Vert D_iu\Vert _2, \end{aligned}$$
(4.3)

where \(D_iu\in \mathbb {R}^2\) is the discrete gradient (finite differences along the coordinate directions) of the \(i^{th}\) component of u, and N is the number of pixels in the image. Note that \(\Vert u\Vert _{TV}\) is convex and nonsmooth.

One of the approaches to solve this problem is to consider the associated dual or primal-dual formulations of (4.3) based on the dual formulation of the TV norm:

$$\begin{aligned} \Vert u\Vert _{TV}= & {} \max _{p\in Y} \left\langle p,Du\right\rangle , \text {where } Y=\{p=(p_1,\ldots ,p_N)\in \mathbb {R}^{2N}:p_i\in \mathbb {R}^2,\nonumber \\&\Vert p_i\Vert _2\le 1,1\le i\le N \}. \end{aligned}$$
(4.4)

Consequently, we can rewrite (4.2) as a saddle-point problem:

$$\begin{aligned} \min _{u\in \mathbb {R}^N}\max _{p\in Y}\frac{1}{2}\Vert Au-b\Vert ^2_2+\lambda \left\langle p,Du\right\rangle . \end{aligned}$$
(4.5)

Note that (4.5) is exactly the form we considered in the USL and FUSL methods if we set \(\hat{g}(y)=0\). Specifically, the prox-function v(y) on Y is simply chosen as \(v(y)=\frac{1}{2}\Vert y\Vert ^2\) in these smoothing techniques.

In our experiments, we consider two types of instances depending on how the matrix A is generated. Specifically, for the first case, the entries of A are normally distributed, while for the second one, the entries are uniformly distributed. For both types of instances, first, we generate the matrix \(A\in \mathbb {R}^{m\times n}\), then choose a true image \(x_{ture}\) and convert it to a vector, and finally compute b by \(b=Ax_{true}+\epsilon \), where \(\epsilon \) is the Gaussian noise with distribution \(\epsilon =N(0,\sigma )\). We compare the following algorithms: the accelerated primal dual (APD) method [8], Nesterov’s smoothing (NEST-S) method [5, 34], and the FUSL method.

For our first experiment, the entries of the matrix A of size \(4,096 \times 16,384\) is randomly generated from a normal distribution N(0, 64), the image \(x_{true}\) is a \(128 \times 128\) Shepp–Logan phantom generated by MATLAB. Moreover, we set \(\lambda =10^{-3}\) and the standard deviation \(\sigma =10^{-3}\). The values of Lipschitz constants are provided for APD and NEST-S, and the initial lower bound for FUSL is set to 0. We run 300 iterations for each algorithm, and present the objective values of problem (4.2) and the relative errors defined by \(\Vert x_k-x_{true}\Vert _2/\Vert x_{true}\Vert _2\) in Fig. 1. In our second experiment, the matrix A is randomly generated with entries uniformly distributed in [0, 1]. We use a \(200 \times 200\) brain image [38] as the true image \(x_{true}\), and set \(m=20,000, \lambda =10, \sigma =10^{-2}\). Other setup is the same as the first experiment, and the results are shown in Fig. 2.

Fig. 1
figure 1

TV-based reconstruction (Shepp–Logan phantom)

Fig. 2
figure 2

TV-based reconstruction (brain image)

We make some observations about the results in Figs. 1 and 2. For the first experiment, there is almost no difference between APD and NEST-S, but FUSL outperforms both of them after 5 seconds in terms of both objective value and relative error. The second experiment clearly demonstrates the advantage of FUSL for solving CP problems with large Lipschitz constants. The Lipschitz constant of matrix A in this instance is about \(2 \times 10^8\), much larger than the Lipschitz constant (about 5.9) in the first experiment. FUSL still converges quickly and decreases the relative error to 0.05 in less than 100 iterations, while APD and NEST-S converge very slowly and more than 1, 000 iterations are required due to the large Lipschitz constants. It seems that FUSL is not so sensitive to the Lipschitz constants as the other two methods. This feature of FUSL makes it more efficient for solving large-scale CP problems which often have big Lipschitz constants.

In summary, for the TV-based image reconstruction problem (4.2), FUSL not only enjoys the completely parameter-free property (and hence no need to estimate the Lipschitz constant), but also demonstrates advantages for its speed of convergence and its solution quality in terms of relative error, especially for large-scale problems.

4.3 Partially parallel imaging

In this subsection, we compare the performance of the FUSL method with several related algorithms in reconstruction of magnetic resonance (MR) images from partial parallel imaging (PPI), to further confirm the observations on advantages of the FUSL method. The detailed background and description of PPI reconstruction can be found in [38]. This image reconstruction problem can be modeled as

$$\begin{aligned} \min _{u\in C^n}\sum _{j=1}^k\Vert M\mathcal {F}S_ju-f_j\Vert ^2+\lambda \sum _{i=1}^N\Vert D_iu\Vert _2, \end{aligned}$$

where \(n=2\) (we consider two dimensional case), u is the N-vector form of a two-dimensional complex valued image to be reconstructed, k is the number of coils (consider them as sensors) in the magnetic resonance (MR) parallel imaging system. \(F\in C^{n\times n}\) is a 2D discrete Fourier transform matrix, \(S_j\in C^{n\times n}\) is the sensitivity map of the j-th sensor, and \(M\in R^{n\times n}\) is a binary mask describes the scanning pattern. Note that the percentages of nonzero elements in M describes the compression ratio of PPI scan. In our experiments, the sensitivity maps \(\{S_j\}_{j=1}^k\) are shown in Fig. 3, the image \(x_{true}\) is of size \(512 \times 512\) shown in Figs. 4 and 5, and the measurements \(\{f_j\}\) are generated by

$$\begin{aligned} f_j=M(FS_jx_{true}+\epsilon ^{re}_j/\sqrt{2}+\epsilon ^{im}_j/\sqrt{-2}),\ j=1,\ldots ,k, \end{aligned}$$
(4.6)

where \(\epsilon ^{re}_j,\epsilon ^{im}_j\) are the noise with entries independently distributed according to \(N(0,\sigma )\). We conduct two experiments on this data set with different acquisition rates, and compare the FUSL method to NEST-S method, and the accelerated linearized alternating direction of multipliers (AL-ADMM) with line-search method [15].

Fig. 3
figure 3

Sensitivity map and Cartesian masks

Fig. 4
figure 4

PPI image reconstruction (acquisition rate: \(14\%\))

Fig. 5
figure 5

PPI image reconstruction (acquisition rate: \(10\%\))

For both experiments, set \(\sigma =3\times 10^{-2},\lambda =10^{-5}\), and \(\{f_j\}_{j=1}^k\) are generated by (4.6). In the first experiment, we use Cartesian mask with acquisition rate \(14\%\): acquire image in one row for every successive seven rows, while for the second one, we use Cartesian mask with acquisition rate \(10\%\): acquire image in one row for every successive ten rows. The two masks are shown in Fig. 3. The results of the first and second experiments are shown in Figs. 4 and 5, respectively. These experiments again demonstrate the advantages of the FUSL method over other techniques for PPI image reconstruction,

5 Concluding remarks

In this paper, we presented two new BL type methods, the FAPL and FUSL methods, to uniformly solve smooth, nonsmooth, and weakly smooth CP problems and a class of structured nonsmooth problems with optimal iteration complexities. Because of the use of cutting plane model, technique of restricted memory and an efficient and scalable solver for solving involved subproblem, the proposed methods have lower iteration cost, and can find a solution with higher accuracy within less number of iterations than many gradient descent type methods. Moreover, these BL type methods do not require the input of any problem parameter, or involve any stepsize that could be affected by large Lipschitz constant of the objective function and large dimension of the problem. These built-in features of the proposed methods are essential for solving large-scale CP problems. Our numerical results of least squares problems and total variation based image reconstructions clearly demonstrate the advantages of the FAPL and FUSL methods over the original APL, USL and some other first-order methods.