1 Introduction

In this work, we are interested in solving problems of the form

$$\begin{aligned} f^{{{\mathrm{inf}}}}:=\inf _{{x}\in {\fancyscript{X}}} f({x}), \end{aligned}$$
(1)

where \(f:\mathfrak {R}^n\rightarrow \mathfrak {R}\) is a nonsmooth convex function and \({\fancyscript{X}}\subset \mathfrak {R}^n\) is a nonempty convex and closed set, typically polyhedral. As is widely accepted, the most efficient optimization techniques to solve such problems are the bundle methods, e.g., [18, Chap. XIV], [4, Part II], and the analytic-center cutting-plane methods, e.g., [15, 16]. All bundle methods make use of the following three ingredients:

  1. (i)

    a convex model \(\check{f}_k\) of \(f\) (usually, \(\check{f}_k\) is a cutting-plane approximation satisfying \(\check{f}_k\leqslant f\));

  2. (ii)

    a stability center \({\hat{{x}}}_k\) (some previous iterate, usually the “best” point generated by the iterative process so far);

  3. (iii)

    a certain algorithmic parameter updated at each iteration (proximal, level, or trust-region, depending on the variant of the method).

The new iterate \({{x}_{k+1}}\) of a bundle method depends on the above three ingredients, whose organization defines different methods. The main classes are the proximal, level, and trust-region. We next discuss some details of the proximal and level variants, as these are the two strategies relevant for our developments. The simplified conceptual versions can be stated as follows.

Proximal bundle method e.g., [10, 13, 19, 23],

$$\begin{aligned} {{x}_{k+1}}:=\arg \min \left\{ \check{f}_k({x})+\frac{1}{2\tau _k}|{x}-{\hat{{x}}}_k|^2:\; {x}\in {\fancyscript{X}}\right\} , \end{aligned}$$
(2)

where \(\tau _k >0\) is the proximal parameter.

Level bundle method e.g., [5, 9, 20, 22],

$$\begin{aligned} {{x}_{k+1}}:=\arg \min \left\{ \frac{1}{2}|{x}-{\hat{{x}}}_k|^2:\;\check{f}_k({x})\le \ell _k,\quad {x}\in {\fancyscript{X}}\right\} , \end{aligned}$$
(3)

where \(\ell _k\in \mathfrak {R}\) is the level parameter.

As is well known, for the same model \(\check{f}_k\) and the same stability center \({\hat{{x}}}_k\), one can find the proximal and level parameters \(\tau _k\) and \(\ell _k\) such that the two versions above generate the same next iterate \({{x}_{k+1}}\) (i.e., for some choice of parameters, the solution of the subproblems (2) and (3) is the same). In this (formal, theoretical) sense the two approaches can be considered equivalent. Details of the implementation and practical performance can be quite different, however. In particular, because the parameters are updated by strategies specific to each of the methods, and the corresponding rules are not related in any direct way.

It is worth to mention that updating the stability center \({{\hat{{x}}}_k}\) in item (ii) above is mandatory for (at least the most standard) versions of proximal bundle methods, but it may not be necessary for level methods. In some variants of level methods one can update \({{\hat{{x}}}_k}\) at each iteration [12, 22], or keep \({{\hat{{x}}}_k}={\hat{{x}}}\) fixed for all iterations [2] (in which case \({{\hat{{x}}}_k}\) does not have the role of the “best” point computed so far). See also [3, 5, 20] for various rules to manage the stability center \({{\hat{{x}}}_k}\) in level methods.

It should be stressed that the choice of the parameter \(\tau _k\) in the proximal variant is quite a delicate task. Although the simplest choice \(\tau _k= \tau >0\) (for all \(k\)) is enough to prove theoretical convergence, it is well understood that for practical efficiency \(\tau _k\) must be properly updated along iterations. We refer to [19] and [23] for some strategies that usually work well in practice. However, the former has some heuristic features, while the latter (based on quasi-Newton formulas) is designed for unconstrained problems and needs some safeguards to fit the general convergence theory. Also, it was noticed during numerical experimentation in [25] that for constrained problems the rule of [23] does not work as well as for the unconstrained.

Continuing the discussion of choosing parameters, a fixed level parameter \(\ell _k\) is not possible, of course, as this may give infeasible subproblems (3). But there exist strategies that manage \(\ell _k\) by simple explicit calculations (whether the problem is unconstrained or constrained), and which are theoretically justified. As a somewhat more costly but very efficient option, the level parameter \(\ell _k\) can be adjusted by solving a linear program (when the feasible set \({\fancyscript{X}}\) is polyhedral, and is either bounded or there is a known lower bound \(f^{{{\mathrm{low}}}}\) for the optimal value \(f^{{{\mathrm{inf}}}}\)); see [12, 22] and (17) below.

Overall, there seems to be a consensus that for solving unconstrained problems proximal bundle methods are very good choices, although the updating rule for \(\tau _k\) is somewhat of an issue (at least from the viewpoint of combining theory and efficiency). On the other hand, there is some evidence that for constrained problems level bundle methods might be preferable. Also, strategies for updating the level parameter \(\ell _k\) are readily available. It is thus appealing to try to combine the attractive features of both approaches in a single algorithm that performs for unconstrained (respectively, constrained) problems as well as proximal bundle methods (respectively, level bundle methods), or maybe even better in some cases. To this end, we propose what we call a doubly stabilized bundle method, that combines both proximal and level stabilizations in the same subproblem, namely:

$$\begin{aligned} {{x}_{k+1}}:=\arg \min \left\{ \check{f}_k({x})+\frac{1}{2\tau _k}|{x}-{\hat{{x}}}_k|^2:\; \check{f}_k({x})\le \ell _k,\quad {x}\in {\fancyscript{X}}\right\} . \end{aligned}$$
(4)

We immediately comment that (4) can be reformulated as a quadratic program (if \({\fancyscript{X}}\) is polyhedral), just like (2) or (3), with just one extra scalar bound constraint compared to (2), or one extra scalar variable and scalar bound constraint compared to (3); see (8) below. The dual for (4) is also very similar in structure to the duals of (2) or (3). Thus, the subproblem (4) is no harder (or at least, cannot be much harder) to solve than (2) or (3). Moreover, it turns out that the (unique) solution to problem (4) is also a solution to at least one of the problems (2) or (3); see Lemma 1 below. This reveals that the proposed method indeed combines the proximal and the level approaches, “automatically” choosing between the two at every step.

The advantages derived from (4) can be summarized as follows:

  • the level parameter \(\ell _k\) is easily updated, and can take into account a lower bound for \(f^{{{\mathrm{inf}}}}\) if it is available;

  • the level constraint \(\check{f}_k({x})\le \ell _k\) provides:

    • a Lagrange multiplier useful to update the proximal parameter \(\tau _k\);

    • an additional useful stopping test, based on a certain optimality gap;

  • the objective function \(\check{f}_k({x})+\frac{1}{2\tau _k}|{x}-{\hat{{x}}}_k|^2\) with proximal regularization allows for searching for good points inside of the level set \(\{ x\in {\fancyscript{X}}: \check{f}_k({x})\le \ell _k\}\), and not only on its boundary, as the level method does.

(It should be noted here that proximal bundle methods can also exploit known lower bounds for \(f^{{{\mathrm{inf}}}}\) by adding certain associated linearizations [14]).

Among other things, our new variant aims at taking advantage of the simplicity of managing the level parameter \(\ell _k\) to produce a simple and efficient rule to update the proximal parameter \(\tau _k\). In particular, this update depends on whether or not the level constraint is active. In this sense, activity of this constraint (and the associated Lagrange multiplier) can be seen as a tool indicating when and how to update \(\tau _k\). Furthermore, depending on the feasible set \({\fancyscript{X}}\) (for example if it is bounded), the management of the level parameter can provide a lower bound for \(f^{{{\mathrm{inf}}}}\), giving an additional stopping test based on a certain optimality gap. It will be shown in Sect. 2 that the lower bound can be updated in a computationally cheap way.

The idea to combine proximal and level bundle methods was first suggested in [8] (giving some limited numerical experiment, without proof of convergence and without handling inexact data). To the best of our knowledge, the only other bundle-type method which employs some kind of double stabilization is [1], where the proximal and trust-region features are present for piecewise quadratic models of \(f\). However, the motivation for this and the resulting algorithm are rather different from ours. For example, the subproblems in [1] are that of minimizing a quadratic function subject to quadratic constraints.

The rest of this paper is organized as follows. Section 2 introduces the doubly stabilized bundle algorithm more formally. Section 3 is devoted to convergence analysis of the method. Inexactness of the function and subgradient evaluations is addressed in Sect. 4. Section 5 contains numerical experiments comparing the proposed algorithm with: the proximal bundle methods using the updating rules for \(\tau _k\) based on [19] and [23]; and the level method given in [5]. A variety of different types of problems are used to validate our proposal: the model unit-commitment problem in the energy sector, two-stage stochastic linear programming, nonsmoothly-regularized maxima of quadratic functions, and some standard nonsmooth optimization test problems. Finally, Sect. 6 gives some concluding comments and remarks.

Our notation is standard. For any points \(x,y \in \mathfrak {R}^n\), \(\langle x,y\rangle \) stands for the Euclidean inner product, and \(|\cdot |\) for the associated norm, i.e., \(|x|=\sqrt{\langle x,x\rangle }\). For a set \({\fancyscript{X}}\subset \mathfrak {R}^n\), we denote by \(i_{\fancyscript{X}}\) its indicator function, i.e., \(i_{\fancyscript{X}}(x)=0\) if \(x\in {\fancyscript{X}}\) and \(i_{\fancyscript{X}}(x) =+\infty \) otherwise. For a convex set \({\fancyscript{X}}\), \(\mathtt{ri }{\fancyscript{X}}\) stands for its relative interior, and \(N_{{\fancyscript{X}}}(x)\) for its normal cone at the point \(x\), i.e., the set \(\{y: \langle y,z-x\rangle \leqslant 0\;\forall \, z\in {\fancyscript{X}}\}\) if \(x\in {\fancyscript{X}}\) and the empty set otherwise. Given a convex function \(f\), we denote its subdifferential at the point \(x\) by \(\partial f(x)=\{g: f(y)\geqslant f(x)+\langle g,y-x\rangle \;\forall \,y\}\).

2 A doubly stabilized bundle method

The method generates a sequence of feasible iterates \(\{{x}_k\} \subset {\fancyscript{X}}\). For each point \({{x}_{k}}\) an oracle (black-box) is called to compute the function value \(f({{x}_{k}})\) and one arbitrary subgradient \(g_k\in \partial f({{x}_{k}})\). With this information, the method creates the linearization

$$\begin{aligned} \bar{f}_k ({x}):=f({x}_k)+\langle g_k,{x}-{x}_k\rangle \leqslant f({x}), \end{aligned}$$
(5)

where the inequality holds by the definition of the subgradient of \(f\). At iteration \(k\), a polyhedral cutting-plane model of \(f\) is available:

$$\begin{aligned} \check{f}_k({x}):=\max _{j \in {{\fancyscript{B}}_{k}}}\,\bar{f}_j({x}) \leqslant f({x}), \end{aligned}$$
(6)

where the set \({{\fancyscript{B}}_{k}}\) may index some of the linearizations \(\bar{f}_j\), \(j\leqslant k\), of the form in (5), but also affine functions obtained as certain convex combinations of such previous linearizations (the so-called aggregate linearizations, defined below). Note that (5) implies that the inequality in (6) holds for such a construction. Some additional (standard) conditions on the model \(\check{f}_k\) will be imposed further below, when needed. Note finally that in our notation \({{\fancyscript{B}}_{k}}\) simply enumerates the affine functions comprising \(\check{f}_k\), and thus \({{\fancyscript{B}}_{k}}\) need not be a subset of \(\{1, \ldots , k\}\) even though \(\check{f}_k\) is, of course, built with information computed on those previous iterations. In particular, the aggregate linearization mentioned above may be indexed by some \(j\not \in \{1, \ldots , k\}\) (this gives some notational convenience; for example, we do not have to worry about assigning to an aggregate linearization an index already taken by the “usual” previous cutting plane).

Let \({\hat{{x}}}_k\) be the current stability center (the best past iterate), and let \(v^{\ell }_k\) be a nonnegative scalar representing how much we aim to reduce the value \(f({\hat{{x}}}_k)\) at the current iteration. Define the corresponding level parameter by

$$\begin{aligned} \ell _k := f\left( {\hat{{x}}}_k\right) -v^{\ell }_k. \end{aligned}$$

Then the level set associated with the model \(\check{f}_k\) and the parameter \(\ell _k\) is given by

$$\begin{aligned} {\mathbb {X}}_k := \{{x}\in {\fancyscript{X}}:\check{f}_k({x})\leqslant \ell _k \}, \end{aligned}$$
(7)

which is polyhedral if \({\fancyscript{X}}\) is polyhedral.

We first observe that in the standard (via slack variable) reformulation of the doubly stabilized subproblem (4) given by

$$\begin{aligned} \displaystyle \min _{({x},r) \in \mathfrak {R}^{n+1}} \left\{ r + \displaystyle \frac{1}{2\tau _k}|{x}-{{\hat{{x}}}_k}|^2 : \check{f}_k({x})\leqslant r,\quad \check{f}_k({x}) \leqslant \ell _k,\quad {x}\in {\fancyscript{X}}\right\} , \end{aligned}$$

the first constraint must be active at the solution (\(\check{f}_k({x}) = r\)), as otherwise the \(r\) term in the objective can be reduced maintaining feasibility (with the same \(x\) part of the solution). This observation implies that the solution to the latter problem, and thus to (4), can be alternatively obtained by solving the simpler

$$\begin{aligned} \displaystyle \min _{({x},r) \in \mathfrak {R}^{n+1}} \left\{ r + \displaystyle \frac{1}{2\tau _k}|{x}-{{\hat{{x}}}_k}|^2 :\check{f}_k({x})\leqslant r,\quad r \leqslant \ell _k,\quad {x}\in {\fancyscript{X}}\right\} . \end{aligned}$$
(8)

We now state some properties of the minimizer \({{x}_{k+1}}\) in (4), or equivalently of the \(x\) part of the solution in (8).

Proposition 1

If \({\mathbb {X}}_k\ne \emptyset \) then problem (4) has the unique solution \({{x}_{k+1}}\).

In addition, if \({\fancyscript{X}}\) is polyhedral or \(\mathtt{ri } {\fancyscript{X}}\cap \{{x}\in \mathfrak {R}^n : \check{f}_k (x) \leqslant \ell _k\} \ne \emptyset \) then there exist \(s_{k+1} \in \partial \check{f}_k({{x}_{k+1}})\) and \(h_{k+1}\in N_{{\fancyscript{X}}}({{x}_{k+1}}) =\partial i_{\fancyscript{X}}({{x}_{k+1}})\), and (scalar) Lagrange multipliers \(\mu _k\geqslant 1\) and \(\lambda _k\geqslant 0\) such that

$$\begin{aligned} {{x}_{k+1}}&= {\hat{{x}}}_k-\tau _k\mu _k\hat{g}_k,\,\, \text{ with }\;\hat{g}_k = s_{k+1} + \frac{1}{\mu _k} h_{k+1},\nonumber \\&\quad \mu _k=\lambda _k +1 \hbox { and }\, \lambda _k(\check{f}_k({{x}_{k+1}})-\ell _k)=0. \end{aligned}$$
(9)

In addition, for all \({x}\in {\fancyscript{X}}\) the aggregate linearization

$$\begin{aligned} \bar{f}_{k}^a(\cdot ):=\check{f}_k({{x}_{k+1}})+ \langle \hat{g}_k,\cdot -{{x}_{k+1}}\rangle \quad \text{ satisfies }\quad \bar{f}_{k}^a({x}) \leqslant \check{f}_k({x}) \leqslant f({x}). \end{aligned}$$
(10)

Proof

The existence and uniqueness of solution \({{x}_{k+1}}\) to (4) follow from the assumption that the problem is feasible and the fact that its objective function is strongly convex.

Next, under the stated assumptions, combining the results from [18] (specifically, Thm. 1.1.1 on p. 293, Prop. 5.3.1 and Remark 5.3.2 on p. 139, Prop. 2.2.2 on p. 308), the optimality conditions for (8) assert that there exist \(\mu _k\geqslant 0\) and \(\lambda _k\geqslant 0\) such that

$$\begin{aligned}&0\in \frac{1}{\tau _k}\left( {{x}_{k+1}}-{{\hat{{x}}}_k}\right) + \mu _k \partial \check{f}_k ({{x}_{k+1}})+N_{\fancyscript{X}}({{x}_{k+1}}),\\&0= 1-\mu _k +\lambda _k,\\&\mu _k\left( \check{f}_k({{x}_{k+1}})-r_{k+1}\right) =0,\quad \lambda _k(r_{k+1}-\ell _k) =0. \end{aligned}$$

In particular, \(\mu _k = 1+\lambda _k\ge 1\) and thus \(r_{k+1}=\check{f}_k({{x}_{k+1}})\), and there exist \(s_{k+1}\in \partial \check{f}_k ({{x}_{k+1}})\) and \(h_{k+1}\in N_{\fancyscript{X}}({{x}_{k+1}})\) such that

$$\begin{aligned} {{x}_{k+1}}= {{\hat{{x}}}_k}- \tau _k (\mu _k s_{k+1} + h_{k+1}) = {{\hat{{x}}}_k}- \tau _k\mu _k \left( s_{k+1} +\frac{1}{\mu _k} h_{k+1}\right) , \end{aligned}$$

which completes the proof of all the relations in (9).

To show (10), note that for all \(x\in {\fancyscript{X}}\) it holds that

$$\begin{aligned} \bar{f}_{k}^a (x) = \check{f}_k({{x}_{k+1}}) + \langle s_{k+1},{x}- {{x}_{k+1}}\rangle + \frac{1}{\mu _k}\langle h_{k+1},{x}- {{x}_{k+1}}\rangle \leqslant \check{f}_k({x}) \leqslant f(x), \end{aligned}$$

where the first inequality follows from the facts that \(s_{k+1}\in \partial \check{f}_k({{x}_{k+1}})\) and \(h_{k+1}\in N_{\fancyscript{X}}({{x}_{k+1}})\). \(\square \)

The next result shows that the solution \({{x}_{k+1}}\) of the doubly stabilized problem (4) solves at least one of the “singly” stabilized problems: the proximal (2) or the level (3).

Lemma 1

For \(\tau _k>0\) and \(\ell _k \in \mathfrak {R}\), let \({x}^\tau _k\in \mathfrak {R}^n\) and \({x}^\ell _k\in \mathfrak {R}^n\) be the (unique) solutions of problems (2) and (3), respectively. Let \({{x}_{k+1}}\in \mathfrak {R}^n\) be the unique solution of problem (4). Then it holds that

$$\begin{aligned} {{x}_{k+1}}=\left\{ \begin{array}{ll} {x}^\tau _k &{}\quad \mathrm{if}\;\mu _k =1 \\ {x}^\ell _k &{}\quad \mathrm{if}\;\mu _k >1, \end{array}\right. \end{aligned}$$

where \(\mu _k\) is the Lagrange multiplier defined in Proposition 1.

Proof

Let \(\mu _k=1\). Similarly to the proof of Proposition 1, writing the optimality conditions for (4) with \(\mu _k=1\) gives

$$\begin{aligned} 0\in \frac{1}{\tau _k}\left( {{x}_{k+1}}-{{\hat{{x}}}_k}\right) + \partial \check{f}_k ({{x}_{k+1}})+N_{\fancyscript{X}}({{x}_{k+1}}) , \end{aligned}$$

which shows that \({{x}_{k+1}}\) satisfies the optimality condition for (2). Since the solutions of the respective problems are unique, it holds that \({{x}_{k+1}}={x}^\tau _k\).

If \(\mu _k>1\) then \(\lambda _k >0\), and hence \(\check{f}_k({{x}_{k+1}})=\ell _k\) by (9). Clearly, the solution \({x}^\ell _k\) of (3) is also the unique solution of

$$\begin{aligned} \min \left\{ \ell _k+\frac{1}{2\tau _k}|{x}-{\hat{{x}}}_k|^2: \check{f}_k({x})\leqslant \ell _k,\quad {x}\in {\fancyscript{X}}\right\} . \end{aligned}$$
(11)

Observe that the optimal value of (11) is bounded below by the optimal value of the problem (4), due to the level constraint \(\ell _k \geqslant \check{f}_k({x})\). As the solution \({{x}_{k+1}}\) of (4) is feasible in (11) and achieves this lower bound (since \(\ell _k = \check{f}_k({{x}_{k+1}})\)), it follows that \({{x}_{k+1}}\) solves (11). Since problems (11) and (4) have unique solutions, it holds that \({{x}_{k+1}}={x}^\ell _k\). \(\square \)

According to Lemma 1, we shall call \({{x}_{k+1}}\) a proximal iterate if \(\mu _k=1\), and otherwise (\(\mu _k>1\)), we shall call it a level iterate. Similarly, an iteration \(k\) will be referred to as a proximal or a level iteration. It is thus clear that each iteration of the doubly stabilized algorithm makes either a step of the associated proximal bundle method, or of the level method. At every iteration, the algorithm makes this choice automatically.

We now define the predicted decrease by the model \(\check{f}_k\) by

$$\begin{aligned} v^{\tau }_k := f({{\hat{{x}}}_k}) -\check{f}_k ({{x}_{k+1}}) \geqslant 0, \end{aligned}$$
(12)

where the inequality follows from \({{x}_{k+1}}\) being the solution of (4) via

$$\begin{aligned} f({{\hat{{x}}}_k})\geqslant \check{f}_k ({{\hat{{x}}}_k}) \geqslant \check{f}_k ({{x}_{k+1}}) + \frac{1}{2\tau _k} \left| {{x}_{k+1}}-{{\hat{{x}}}_k}\right| ^2. \end{aligned}$$

As discussed in [11], to define the predicted decrease quantity there are alternatives other than (12). We have chosen (12) because of its direct connection with the level parameter \(\ell _k\), established in (15) below.

Once the iterate \({{x}_{k+1}}\) is computed, the oracle provides the new function value \(f({{x}_{k+1}})\). As is usual in bundle methods, we shall change the stability center when the new iterate gives sufficient descent with respect to the predicted one. Namely, when

$$\begin{aligned} f({{x}_{k+1}}) \leqslant f({{\hat{{x}}}_k})-m_{f}v^{\tau }_k, \end{aligned}$$
(13)

where \(m_{f}\in (0,1)\). Accordingly, each iteration results either

  • in a descent step when (13) holds, in which case \({{\hat{{x}}}_k}\) is moved to \({{x}_{k+1}}\); or

  • in a null step when (13) does not hold, and the stability center is maintained.

We next provide useful connections between the predicted decrease \(v^{\ell }_k = f({{\hat{{x}}}_k}) -\ell _k\) related to the level parameter \(\ell _k\), the predicted decrease \(v^{\tau }_k =f({{\hat{{x}}}_k})-\check{f}_k ({{x}_{k+1}})\) related to the solution of (4) and thus to the proximal parameter \(\tau _k\), and the aggregate linearization error given by

$$\begin{aligned} \hat{e}_k := f({{\hat{{x}}}_k}) -\bar{f}_k^a ({{\hat{{x}}}_k}). \end{aligned}$$
(14)

We also establish a key relation that would be the basis for the subsequent convergence analysis.

Proposition 2

It holds that

$$\begin{aligned} \hat{e}_k \geqslant 0,\quad \hat{e}_k+\tau _k\mu _k \left| \hat{g}_k \right| ^2=v^{\tau }_k\geqslant f({\hat{{x}}}_k)-\ell _k=v^{\ell }_k, \end{aligned}$$
(15)

where \(\mu _k\) is the Lagrange multiplier defined in Proposition 1. Moreover, if \(\mu _k >1\) then \(v^{\tau }_k =v^{\ell }_k\).

Furthermore, for all \({x}\in {\fancyscript{X}}\) it holds that

$$\begin{aligned} f({{\hat{{x}}}_k}) +\left\langle \hat{g}_k{x-{{\hat{{x}}}_k}}\right\rangle - \hat{e}_k \leqslant f(x). \end{aligned}$$
(16)

(In other words, \(\hat{g}_k\) is \(\hat{e}_k\)-subgradient of the essential objective \((f+i_{\fancyscript{X}})\) at \({{\hat{{x}}}_k}\)).

Proof

The fact that \(\hat{e}_k\geqslant 0\) follows directly from (10). To show (15), note that

$$\begin{aligned} \begin{array}{lll} \hat{e}_k &{}= f({{\hat{{x}}}_k}) -\bar{f}_k^a ({{\hat{{x}}}_k})\\ &{}= f({{\hat{{x}}}_k}) -\left( \check{f}_k({{x}_{k+1}})+\left\langle \hat{g}_k{{{\hat{{x}}}_k}-{{x}_{k+1}}}\right\rangle \right) \\ &{}= v^{\tau }_k - \left\langle \hat{g}_k{{{\hat{{x}}}_k}-{{x}_{k+1}}}\right\rangle \\ &{}= v^{\tau }_k - \tau _k\mu _k|\hat{g}_k|^2, \end{array} \end{aligned}$$

where the last equality follows from (9). In addition, since \({{x}_{k+1}}\) is feasible in (4), we have that \(\check{f}_k({{x}_{k+1}})\leqslant \ell _k=f({{\hat{{x}}}_k}) - v^{\ell }_k\), which implies \(v^{\ell }_k\leqslant v^{\tau }_k\). This completes the proof of (15). (Recall also that if \(\mu _k >1\) then \(\lambda _k >0\), in which case (9) implies \(\check{f}_k ({{x}_{k+1}}) =\ell _k\), so that \(v^{\ell }_k =v^{\tau }_k\)).

The relation (16) follows from the fact that \(\hat{g}_k\) is \(\hat{e}_k\)-subgradient of the essential objective at \({{\hat{{x}}}_k}\), which is verified as follows. Using again (10), for all \(x\in {\fancyscript{X}}\) it holds that

$$\begin{aligned} f(x)&\geqslant \bar{f}_k^a(x) \\&= \check{f}_k ({{x}_{k+1}}) +\langle \hat{g}_k, {x}-{{x}_{k+1}}\rangle \\&= f({{\hat{{x}}}_k}) -\left( f({{\hat{{x}}}_k})-\check{f}_k ({{x}_{k+1}})\right) +\langle \hat{g}_k, {x}-{{\hat{{x}}}_k}\rangle +\langle \hat{g}_k, {{\hat{{x}}}_k}-{{x}_{k+1}}\rangle \\&= f({{\hat{{x}}}_k}) -v^{\tau }_k +\langle \hat{g}_k, {x}-{{\hat{{x}}}_k}\rangle +\tau _k\mu _k |\hat{g}_k|^2 , \end{aligned}$$

and (16) follows taking into account (15). \(\square \)

The relation (16) motivates one of the alternative stopping tests for our algorithm, which is in the spirit of standard bundle methods: stop the algorithm when both \(|\hat{g}_k|\) and \(\hat{e}_k\) are small enough, i.e., an approximate optimality condition holds.

We now state the algorithm in full detail, and then comment on some of its ingredients.

figure a

Some comments are in order.

  1. (a)

    Observe that the lower bound \(f^{{{\mathrm{low}}}}_k\) is updated either when the level set \({\mathbb {X}}_k\) is empty in Step 2.1, or in Step 5. In the second case, it is explicit that \(f^{{{\mathrm{low}}}}_k \leqslant f^{{{\mathrm{inf}}}}\). In the first case, \({\mathbb {X}}_k =\emptyset \) means that \(\ell _k < \check{f}_k({x}) \leqslant f({x})\) for all \({x}\in {\fancyscript{X}}\). And since the update sets \(f^{{{\mathrm{low}}}}_k \leftarrow \ell _k\), it again holds that \(f^{{{\mathrm{low}}}}_k \leqslant f^{{{\mathrm{inf}}}}\). Therefore, \(f^{{{\mathrm{low}}}}_k \leqslant f^{{{\mathrm{inf}}}}\) for all \(k\), and if the algorithm stops at Step 1, we have that

    $$\begin{aligned} \mathtt{Tol }_{\varDelta }\geqslant f\left( {{\hat{{x}}}_k}\right) -f^{{{\mathrm{low}}}}_k\geqslant f\left( {{\hat{{x}}}_k}\right) -f^{{{\mathrm{inf}}}}, \end{aligned}$$

    i.e., \({{\hat{{x}}}_k}\) is a \(\mathtt{Tol }_{\varDelta }\)-approximate solution to problem (1).

    Note that when the level set \({\mathbb {X}}_k\) is empty, the update rules in the pass through Step 2.1 and back through Step 1, decrease the optimality gap \(\varDelta _k\) by the factor of \((1-m_{\ell })\).

    A simple update of the lower bound in Step 5 is \(f^{{{\mathrm{low}}}}_{k+1}\leftarrow f^{{{\mathrm{low}}}}_k\).

  2. (b)

    To identify if the level set is empty, the most natural is probably to proceed as usual with solving (4) and let the solver return with the infeasibility flag. Note that this is not a wasteful computation, as it leads to adjusting the level parameter as well as improving the lower bound \(f^{{{\mathrm{low}}}}_k\). Alternatively, to detect infeasibility we can solve the linear program (if \({\fancyscript{X}}\) is a polyhedron)

    $$\begin{aligned} \min \;s\;\text{ s.t. }\;\bar{f}_j({x})- s\leqslant \ell _k\;\forall j \in {{\fancyscript{B}}_{k}},\quad {x}\in {\fancyscript{X}},\quad s\geqslant 0. \end{aligned}$$

    If its optimal value is positive then \({\mathbb {X}}_k =\emptyset \).

  3. (c)

    If one prefers to avoid infeasible level sets \({\mathbb {X}}_k\), then when \({\fancyscript{X}}\) is bounded or \(f^{{{\mathrm{low}}}}_{k}\) is finite, it is enough to update \(f^{{{\mathrm{low}}}}_{k}\) in Step 5 as follows, solving the linear program:

    $$\begin{aligned} \text{ set }\, f^{{{\mathrm{low}}}}_{k+1} \leftarrow \min \;r \quad \text{ s.t. }\;\bar{f}_j({x})\leqslant r\;\forall j \in {{\fancyscript{B}}_{k}}\quad f^{{{\mathrm{low}}}}_k\leqslant r\quad {x}\in {\fancyscript{X}}\quad r\in \mathfrak {R}. \end{aligned}$$
    (17)

    This strategy is especially effective when solving LP is not too expensive relative to other tasks of the algorithm (in particular, the oracle computations).

  4. (d)

    If \({\fancyscript{X}}\) is unbounded, the level set \({\mathbb {X}}_k\) can be nonempty for all \(k\), and \(f^{{{\mathrm{low}}}}_k\) will never be updated (for example, for problem (1) with \(f(x)=e^{-x}\) and \({\fancyscript{X}}=[0,+\infty )\)). In that case, the algorithm will not stop at Step 1, unless the initial lower bound \(f^{{{\mathrm{low}}}}_1\) is within the \(\mathtt{Tol }_{\varDelta }\)-tolerance of \(f^{{{\mathrm{inf}}}}\).

  5. (e)

    Step 5 increases the proximal parameter \(\tau _k\) only after descent steps resulting from level iterations (\(\mu _k>1\)). On the other hand, \(\tau _k\) can be decreased only after null steps. A simple rule used in the numerical experiments of Sect. 5 is

    $$\begin{aligned} \tau _{k+1}\leftarrow \max \left\{ \tau _{\min },\,\tau _k{v^{\ell }_k}/{v^{\tau }_k}\right\} , \end{aligned}$$

    which decreases the proximal parameter only after null steps resulting from proximal iterations (\(v^{\tau }_k>v^{\ell }_k\) is only possible when \(\mu _k=1\), see Proposition 2). In this manner, the level parameter \(\ell _k\) and the multiplier \(\mu _k\) indicate how to update the proximal parameter \(\tau _k\). This is precisely the novel strategy to manage proximal parameter, proposed in this work.

  6. (f)

    If at Step 2 (for all \(k\)) the rule \(\ell _k=f({{\hat{{x}}}_k})-v^{\ell }_k\) is replaced by \(\ell _k=+\infty \), Algorithm 1 becomes a proximal bundle algorithm (all iterations are proximal iterations).

  7. (g)

    The QP formulation of subproblem (8) is given by

    $$\begin{aligned} \displaystyle \min _{({x},r) \in \mathfrak {R}^{n+1}} \left\{ r + \displaystyle \frac{1}{2\tau _k}\left| {x}-{{\hat{{x}}}_k}\right| ^2 : \bar{f}_j({x})\leqslant r \;\forall j \in {{\fancyscript{B}}_{k}}\quad r \leqslant \ell _k,\; {x}\in {\fancyscript{X}}\right\} . \end{aligned}$$

    It can be seen that (if \({\fancyscript{X}}=\mathfrak {R}^n\)) its dual has the number of variables equal to the number of cutting-planes in the bundle.

    To keep the size of this QP (or of its dual) manageable, the number of elements in the bundle (the cardinality of the set \({{\fancyscript{B}}_{k}}\)) should be kept bounded, without impairing convergence. For this, the usual aggregation techniques of proximal bundle can be employed here. After a serious step, the only requirement is that the model should be below the objective function (which means that elements from the bundle can be deleted arbitrarily); this is reflected in Step 5.1 of Algorithm 1. During a sequence of consecutive nulls steps, the model \(\check{f}_k\) can be composed of as few as only two cutting planes, corresponding to the new linearization \(\bar{f}_{k+1}\) and the aggregate linearization \(\bar{f}_k^a\) (or any number of cutting planes, as long as these two are included). This is reflected in the choice of the model specified in Step 5.2 of Algorithm 1. If the next model contains all the linearizations for which the constraint \(\bar{f}_j({x})\leqslant r\) of the above QP is active at its solution \(({{x}_{k+1}},r_{k+1})\), then there is no need to include the aggregate linearization \(\bar{f}_k^a\).

3 Convergence analysis

Convergence analysis of the doubly stabilized bundle method has to account for all the possible combinations of level and proximal steps, whether null or descent, and the possibility of empty level sets. To that end, we consider the following three possible cases:

  • The level sets \({\mathbb {X}}_k\) are empty infinitely many times;

  • The above does not happen, and infinitely many descent steps are generated;

  • In the same situation, finitely many descent steps are generated.

In what follows, we assume that \(\mathtt{Tol }_{\varDelta }=\mathtt{Tol }_{e}=\mathtt{Tol }_{g}=0\) and that Algorithm 1 does not stop. (If the algorithm stops for zero tolerance in Step 1, then the last descent step is, by comment (a) above, a solution to the problem. The same conclusion holds, by (16), if the method stops for zero tolerances in Step 3.) As a by-product of our convergence analysis, it would also follow that if the stopping rules parameters are positive then the method terminates in a finite number of iterations, with an appropriate approximate solution.

Lemma 2

Suppose the level set \({\mathbb {X}}_k\) is empty infinitely many times.

Then \(\varDelta _k\rightarrow 0\), \(\{f({{\hat{{x}}}_k})\}\rightarrow f^{{{\mathrm{inf}}}}\), and every cluster point of the sequence \(\{{{\hat{{x}}}_k}\}\) (if any exists) is a solution to problem (1); or the last \({{\hat{{x}}}_k}\) is a solution if this sequence is finite.

Proof

It follows by Step 2 that for all \(k\) after the first \({\mathbb {X}}_k =\emptyset \) is encountered, we have \(f^{{{\mathrm{low}}}}_k >-\infty \) and thus \(\varDelta _k <+\infty \). Also, by Steps 2 and 5, \(v^{\ell }_k\leqslant (1-m_{\ell })\varDelta _k\). Thus,

$$\begin{aligned} f({{\hat{{x}}}_k})-\ell _k=f({{\hat{{x}}}_k})-\left( f({{\hat{{x}}}_k})-v^{\ell }_k\right) =v^{\ell }_k\leqslant (1-m_{\ell })\varDelta _k, \end{aligned}$$

which shows that if \({\mathbb {X}}_k =\emptyset \) at iteration \(k\), then the update \(f^{{{\mathrm{low}}}}_k\leftarrow \ell _k\) decreases the optimality gap \(\varDelta _k\) by a factor of at least \((1-m_{\ell })\). Hence, if this happens infinitely many times, we have that \(\varDelta _k \rightarrow 0\). Moreover, as no level set can be empty if \(f^{{{\mathrm{inf}}}}=-\infty \), in the case under consideration \(f^{{{\mathrm{inf}}}}>-\infty \). We can then write \(\varDelta _k = f({{\hat{{x}}}_k})-f^{{{\mathrm{low}}}}_k \geqslant f({{\hat{{x}}}_k}) - f^{{{\mathrm{inf}}}}\), which implies the assertion as \(\varDelta _k\rightarrow 0\). \(\square \)

From now on, we consider the case when \({\mathbb {X}}_k\ne \emptyset \) for all \(k\) large enough. Clearly, without loss of generality, we can simply assume that \({\mathbb {X}}_k\ne \emptyset \) for all \(k\).

Analysis in the case of infinitely many descent steps essentially follows the theory for proximal bundle methods; in particular the argument in [6] can be readily applied.

Lemma 3

Suppose Algorithm 1 generates infinitely many descent steps.

Then \(\{f({{\hat{{x}}}_k})\}\rightarrow f^{{{\mathrm{inf}}}}\) and every cluster point of the sequence \(\{{{\hat{{x}}}_k}\}\) (if any exist) is a solution to problem (1).

In addition, if the solution set of (1) is nonempty and the sequence \(\{\tau _k\mu _k\}\) is bounded above (for example, this is the case when there are finitely many level iterations) then the sequence \(\{{{\hat{{x}}}_k}\}\) converges to a solution of (1).

Proof

Let \(\{{\hat{{x}}}_{k(j)}\}\) be the subsequence of \(\{{{\hat{{x}}}_k}\}\) such that \(k(j)\) corresponds to the \(j\)th descent step. Define \(i(j)=k(j+1)-1\). Recalling (13), (15) and Proposition 2, we then have an iterative sequence satisfying, for all \(j\geqslant 1\), the relations

$$\begin{aligned}&{\hat{{x}}}_{k(j+1)} = {\hat{{x}}}_{k(j)} - \tau _{i(j)}\mu _{i(j)} \hat{g}_{i(j)}, \quad \hat{g}_{i(j)}\in \partial _{e_{i(j)}} (f+i_{\fancyscript{X}}) \left( {\hat{{x}}}_{k(j)}\right) , \quad \tau _{i(j)}\mu _{i(j)} \ge \tau _{\min },\\&\quad f\left( {\hat{{x}}}_{k(j)}\right) -f\left( {\hat{{x}}}_{k(j+1)}\right) \geqslant m_{f}\left( e_{i(j)} + \tau _{i(j)}\mu _{i(j)} \left| \hat{g}_{i(j)}\right| ^2\right) . \end{aligned}$$

We are thus in the setting of the \(\epsilon \)-subgradient method with an additional descent condition along the iterations. The announced convergence properties follow from [6].

For the last assertion, recall that \(\tau _k\) can increase only on descent steps resulting from level iterations (in the case of \(\mu _k >1\)). Thus, if the number of such iterations is finite, the sequence \(\{\mu _k\tau _k\}\) is bounded above. Then, [6, Prop. 2.2] with \(t_k\) therein replaced by \(\mu _k\tau _k\) can be invoked to obtain the stated assertion. \(\square \)

Now we consider the last case, when \({{\hat{{x}}}_k}\) is eventually fixed and the last descent step is followed by an infinite number of null steps (note also that in this case the level sets \({\mathbb {X}}_k\) are nonempty).

Lemma 4

Suppose there exists an index \(k_1\geqslant 1\) such that the descent test (13) is not satisfied for all \(k \geqslant k_1\).

Then there is an infinite number of level iterations, and the last descent iterate \({\hat{{x}}}_{k_1}\) is a solution to problem (1).

Proof

Note that the sequence \(\{v^{\ell }_k\}\) is nonincreasing. Let \(K\) be the set of indices \(k\) such that \(\mu _k >1\) (level iterations), and so according to Step 5.2 of Algorithm 1, \(v^{\ell }_{k+1}=m_{\ell }v^{\ell }_k\). We then have that the values in \(\{v^{\ell }_k\}\) only reduce on indices in \(K\) and do not change otherwise.

Suppose first that \(K\) is a finite set. Then, by Proposition 2, there exists an index \(k_2\ge k_1\) such that \(\mu _k=~1\), \(\lambda _k =0\) and \(v^{\ell }_k =v^{\ell }_{k_2}> 0\) for all \(k\ge k_2\). Thus, by (15),

$$\begin{aligned} v^{\tau }_k\ge v^{\ell }_{k_2} >0 \quad \text{ for } \text{ all }\quad k\ge k_2. \end{aligned}$$
(18)

Moreover, by Lemma 1, all such iterations are proximal iterations. Hence, all iterations of Algorithm 1 indexed by \(k\ge k_2\) can be considered as those of the classical proximal bundle method applied to the same problem. It then follows from [18] [Chap. XV, Thm. 3.2.4] that \(v^{\tau }_k \rightarrow 0\), in contradiction with (18).

Hence, \(K\) must have infinitely many indices. But then the values of \(v^{\ell }_k\) are reduced by the factor of \(m_{\ell }\) infinitely many times, so that \(\{v^{\ell }_k\}\rightarrow 0\) as \(k\rightarrow \infty \). Since for \(k\in K\) it holds that \(v^{\tau }_k =v^{\ell }_k\) (c.f. Proposition 2), we conclude that \(\{v^{\tau }_k\}\rightarrow 0\) as \(K\ni k\rightarrow \infty \). As \(\tau _k\geqslant \tau _{\min }>0\) and \(\mu _k\geqslant 1\), it follows from (15) that

$$\begin{aligned} \hat{e}_k\rightarrow 0\;\text{ and }\;|\hat{g}_k| \rightarrow 0 \quad \text{ as }\;K\ni k\rightarrow \infty . \end{aligned}$$
(19)

As \(\hat{g}_k\) is \(\hat{e}_k\)-subgradient of the essential objective \((f+i_{\fancyscript{X}})\) at \({\hat{{x}}}_{k_1}\), (19) implies that \({\hat{{x}}}_{k_1}\) is a solution to (1). This completes the proof. \(\square \)

Summarizing Lemmas 24, we state the following convergence properties of Algorithm 1.

Theorem 1

If for the sequence generated by Algorithm 1 it holds that \({{\hat{{x}}}_k}={\hat{{x}}}_{k_1}\) for all \(k\ge k_1\), then \({\hat{{x}}}_{k_1}\) is a solution to (1). Otherwise, \(\{f({{\hat{{x}}}_k})\}\rightarrow f^{{{\mathrm{inf}}}}\) as \(k\rightarrow \infty \), and every cluster point of \(\{{{\hat{{x}}}_k}\}\) (if any exist) is a solution to problem (1). In addition, if the solution set of (1) is nonempty, and an infinite number of descent steps is generated among which the number of level iterations is finite, then the sequence \(\{{{\hat{{x}}}_k}\}\) converges to a solution of (1).

An interesting question is whether the level bundle methods’ lower worst-case complexity (when compared to the proximal versions) extends to the doubly stabilized algorithm. At this time, we conjecture this is probably not the case, as there does not seem to be a way to estimate the number of proximal iterations between level iterations.

We finish this section by considering a more general strategy of managing the level parameter, which we found useful in our numerical experiments. Note that Step 5.2 of Algorithm 1 reduces the predicted decrease \(v^{\ell }_k\) by a factor of \(m_{\ell }\) on null level iterations (\(\mu _k >1\)), and keeps it unchanged on null proximal ones. Decreasing \(v^{\ell }_k\) implies increasing the level parameter \(\ell _k\) (Step 2 in Algorithm 1). The idea is that it may be sometimes useful to keep \(\ell _k\) fixed for some null level iterations, because this can lead to infeasible level sets which, in turn, leads to updating the lower bound \(f^{{{\mathrm{low}}}}_k\) thus decreasing the optimality gap \(\varDelta _k\). The idea itself can be implemented in a number of different ways. For example, by decreasing \(v^{\ell }_k\) after some fixed number of consecutive null steps. Note, however, that the argument in Lemma 4 would not apply (because not all null level iterations reduce \(v^{\ell }_k\), which is an important ingredient in the proof). Thus the implementation should be such that convergence can still be justified by other tools.

3.1 Managing the level parameter

Consider an additional parameter \(\mu _{\max }\ge 1\) as input for the algorithm, and replace the update rule for \(v^{\ell }_k\) in Step 5.2 of Algorithm 1 by the following:

$$\begin{aligned} \text{ If }\,\mu _k>\mu _{\max },\,\hbox {set}\,v^{\ell }_{k+1}\leftarrow m_{\ell }v^{\ell }_k; \quad \hbox {otherwise}\quad \hbox {set}\,v^{\ell }_{k+1}\leftarrow v^{\ell }_k. \end{aligned}$$
(20)

Note that \(\mu _{\max }=1\) recovers the original formulation of Algorithm 1. The parameter \(v^{\ell }_k\) remains fixed for null level iterations that result in a multiplier \(\mu _k\) not large enough; when it is sufficiently large, \(v^{\ell }_k\) is decreased and the level parameter \(\ell _k\) is increased. The motivation for keeping \(v^{\ell }_k\) fixed on some iterations is outlined above. The reason for updating \(v^{\ell }_k\) when \(\mu _k > \mu _{\max } >1\) has to do with using [5, Thm. 3.7] to show convergence in the corresponding case. Additionally, an intuition as to why it is reasonable that the update of \(v^{\ell }_k\) depends on \(\mu _k\) can be derived from Lemma 7 below. The arguments in the proof of Lemma 7 (it is not important that it considers the more general case with inexact data) show that if \(v^{\ell }_k\) is fixed over a sequence of null steps then \(\mu _k\) is increasing (tends to \(+\infty \) if the sequence is continued infinitely). Thus, if \(\mu _{\max }\) is large enough, the rule (20) is likely to keep \(v^{\ell }_k\) fixed, but only for some iterations so that the parameter is eventually updated.

As the modified rule (20) plays a role only on null steps, to verify convergence of this version of the algorithm we only have to consider the case when all the level sets are nonempty and there is a finite number of descent steps, i.e., all iterations from some point on are null steps. Apart from the condition \(\mu _{\max } >1\), we need the following stronger (but not at all restrictive from the practical viewpoint) condition on managing the bundle during null steps. Let \(p(k)\) be the last proximal iteration performed up to iteration \(k\). Choose \(\check{f}_{k+1}\) to satisfy

$$\begin{aligned} \max \left\{ \bar{f}_{k+1}(\cdot ), \bar{f}_k^a (\cdot ), \bar{f}_{p(k)+1}(\cdot ), \bar{f}_{p(k)}^a (\cdot )\right\} \leqslant \check{f}_{k+1}(\cdot )\leqslant f(\cdot ). \end{aligned}$$
(21)

In particular, if \(k\) is a null proximal iteration, then \(p(k)=k\) and the above rule is the same as for usual proximal bundle methods [6, 13]. However, (21) differs from standard rules in the case of null level steps: during null level iterations information about the last proximal iteration is kept in the bundle.

If there are infinitely many null proximal iterations, the algorithm can be interpreted as a proximal bundle method in the case of a finite number of descent steps followed by null steps, with level iterates seen as merely enriching the cutting-plane model. In particular, the key conditions (4.7)–(4.9) in [6] are satisfied. Convergence then follows from [18, Chap. XV, Thm. 3.2.4]; see also [6, 11].

On the other hand, if there are only finitely many proximal iterations, the algorithm becomes essentially a level bundle method in the case of a finite number of descent steps followed by null steps. In this case, [5, Thm. 3.7] provides the assertion on convergence [we note that for this it is important that \(\mu _{\max } >1\), because \(\lambda _k\) in [5] is required to be bounded by some \(\lambda _{\max }>0\), and we have \(\mu _k=\lambda _k +1\) in (9)].

4 Handling inexact data

In various real-world applications, the objective function and/or its subgradient can be too costly (sometimes, impossible) to compute. This is particularly true when \(f\) is given by some optimization problem, e.g., \(f({x})=\max _{u\in U} \varphi (u,{x})\), as in numerical experiments in Sect. 5.2 for example. In such situations, approximate values must be used.

Various inexact bundle methods that use approximate function and subgradient evaluations have been studied in [10, 11, 17, 21, 26]. The natural setting is to assume that, given any \(x\in \mathfrak {R}^n\), the oracle provides some approximate values \(f_x\in \mathfrak {R}\) and \(g_x\in \mathfrak {R}^n\) of the objective function and its subgradient, respectively, such that

$$\begin{aligned} \left\{ \begin{array}{l} f_{x}=f({x}) - \eta _x \quad \text{ and }\\ f(\cdot )\geqslant f_{x}+ \langle g_{x},\cdot -x\rangle - \eta ^{g}_x, \end{array} \right. \end{aligned}$$
(22)

where \(\eta _x \in \mathfrak {R}\) and \(\eta ^{g}_x \geqslant 0\) are some unknown but uniformly bounded errors. Specifically, there exist \(\eta \geqslant 0\) and \(\eta ^{g}\geqslant 0\) such that

$$\begin{aligned} |\eta _x|\leqslant \eta \quad \text{ and }\quad \eta ^{g}_x \leqslant \eta ^{g}\quad \text{ for } \text{ all }\;x\in {\fancyscript{X}}. \end{aligned}$$
(23)

Remark 1

Assumptions (22) and (23) are also present in [21] and [27]. They are weaker than the assumptions employed by the level bundle methods given in [9], which require \(\eta ^{g}=0\), and further the bound \(\eta \) to be known, controllable, and asymptotically vanishing in a suitable sense. Thus, the ingredients of our analysis concerning level iterations are certainly applicable to the setting of [9], and lead to new results under the weaker oracle assumptions. On the other hand, using stronger assumptions [9] is able to compute exact solutions, rather than inexact as in our case.

In [27], nonlinearly constrained problems are considered, which require the use of non-static merit functions (specifically, of improvement functions as in [25]). Thus, even considering level iterations only, [27] is very different from our case. Also, [27] requires boundedness of the feasible set \({\fancyscript{X}}\) for convergence analysis, and in fact for convergence itself (there are examples which show that the method therein can fail for unbounded \({\fancyscript{X}}\)).

With given oracle information, the inexact linearization of \(f\) at iteration \(k\) is defined accordingly by

$$\begin{aligned} \bar{f}_k({x}):=f_{{x}_k} + \left\langle g_{{x}_k}{x-{x}_k}\right\rangle \quad (\leqslant f({x}) + \eta ^{g}), \end{aligned}$$

and the inexact model \(\check{f}_k\) is then defined as in (6). However, because of the inexactness, we now have the weaker property \(\check{f}_k(\cdot ) \leqslant f(\cdot ) + \eta ^{g}\). The predicted decrease must now employ only the available (inexact) information; the counterpart of (12) is thus given by

$$\begin{aligned} v^{\tau }_k :=f_{{{\hat{{x}}}_k}}-\check{f}_k({{x}_{k+1}}), \end{aligned}$$

and the level parameter is

$$\begin{aligned} \ell _k := f_{{{\hat{{x}}}_k}}-v^{\ell }_k \quad \hbox {for}\,\hbox {a}\,\hbox {given}\; v^{\ell }_k>0. \end{aligned}$$

Solving the doubly stabilized bundle subproblem (4) for the inexact model \(\check{f}_k\), the direction of change \(\hat{g}_k\) and the aggregate linearization \(\bar{f}_k^a\) are defined exactly as before, i.e., by (9) and (10), respectively. The aggregate linearization error is now given by

$$\begin{aligned} \hat{e}_k:=f_{{{\hat{{x}}}_k}} -\bar{f}_k^a({{\hat{{x}}}_k}). \end{aligned}$$

The first observation is that, unlike in the exact case [recall (15)], the aggregate linearization error \(\hat{e}_k\) can be negative due to inaccuracy in the data. However, given the oracle assumptions (22), the following lower bound holds:

$$\begin{aligned} \hat{e}_k\geqslant f({{\hat{{x}}}_k})-\eta -\bar{f}_k^a({{\hat{{x}}}_k})\geqslant f({{\hat{{x}}}_k})-\eta -\left( f({{\hat{{x}}}_k}) + \eta ^{g}\right) = -(\eta +\eta ^{g}). \end{aligned}$$
(24)

Most inexact proximal bundle methods work the following way. In the proximal setting the predicted decrease has the form \(v^{\tau }_k=\hat{e}_k + \tau _k|\hat{g}_k|^2\) (recall Proposition 2, where the proximal method corresponds to \(\mu _k=1\)). Then \(v^{\tau }_k<0\) means that \(\hat{e}_k\) is too negative (the oracle error is excessive). In such a case, the descent test

$$\begin{aligned} f_{{{x}_{k+1}}}\leqslant f_{{{\hat{{x}}}_k}}- m_f v^{\tau }_k, \end{aligned}$$
(25)

mimicking (13), is not meaningful. The methods in [10, 11, 21] deal with this situation using the following simple idea. To make \(v^{\tau }_k\) positive (when \(\hat{g}_k \ne 0\)), the strategy is then to increase the proximal parameter \(\tau _k\) and solve again the QP with the same model \(\check{f}_k\) to get another candidate \({{x}_{k+1}}\). This procedure, called noise attenuation [21], ensures that:

  1. (i)

    the predicted decrease \(v^{\tau }_k\) is always nonnegative before testing for descent;

  2. (ii)

    if the noise is persistently excessive (an infinite number of noise attenuation steps is required) then the associated parameter is driven to infinity, which ensures in turn that \(\hat{g}_k\) tends to zero.

With exact oracles, the predicted decrease \(v^{\tau }_k\) can be seen as an optimality measure: if the proximal parameter \(\tau _k>0\) is bounded away from zero, (15) ensures that

$$\begin{aligned} v^{\tau }_k=0\; \Longleftrightarrow \; \hat{e}_k =0\quad \text{ and }\quad \hat{g}_k=0. \end{aligned}$$

The above is no longer true for inexact oracles. For the proximal version (corresponding to \(\mu _k =1\) above), one has the following (much weaker) relation:

$$\begin{aligned} v^{\tau }_k\le 0 \;\Longrightarrow \; \tau _k|\hat{g}_k|^2 \le -\hat{e}_k \quad \left( \le \eta +\eta ^{g}\right) . \end{aligned}$$

It then holds that

$$\begin{aligned} |\hat{g}_k|^2 \le \frac{(\eta +\eta ^{g})}{ \tau _k}. \end{aligned}$$

And this is where the property (ii) above comes into play. To ensure that \(\hat{g}_k\) goes to zero in the case of excessive oracle errors, [21] drives \(\tau _k\) to infinity. In principle, a similar strategy can be implemented in Algorithm 1. However, this clearly has some disadvantages. To start with, the QP has to be solved again with the same model \(\check{f}_k\) and (sharply) increased prox-parameter, to obtain another candidate \({{x}_{k+1}}\). And this may need to be done more than once consecutively. Also, it may eventually turn out that this increase of the prox-parameter is harmful, or at least unnecessary in some sense (note that there are only heuristic rules for this update). It turns out that the doubly stabilized method does not require such procedures to ensure that \(\hat{g}_k\) always tends to zero. Instead of “manually” increasing \(\tau _k\), the algorithm controls the steps automatically and properly via the multipliers \(\mu _k\) (as is revealed by the analysis in Lemma 7 below). This is an interesting, and clearly desirable property. Another interesting feature of the doubly stabilized method is that the predicted decrease \(v^{\tau }_k\) is always positive, i.e., property (i) above holds true. To that end, first note that if \(v^{\ell }_k\) becomes nonpositive at some iteration \(k\) due to the updates in Steps 2 and 5 of Algorithm 1, then so does the inexact optimality gap \(\varDelta _k\) in Step 1 and the algorithm stops immediately (and it can be seen that an appropriate approximate solution is obtained). We can thus consider that \(v^{\ell }_k >0\) for all \(k\). Then the same argument as that in Proposition 2 shows that

$$\begin{aligned} v^{\tau }_k=\hat{e}_k + \tau _k\mu _k|\hat{g}_k|^2 \geqslant f_{{{\hat{{x}}}_k}}-\ell _k=v^{\ell }_k >0 \quad \forall \,k. \end{aligned}$$
(26)

Therefore, a descent test like (25) is always meaningful, unlike for the proximal bundle approach with inexact data. In conclusion, our doubly stabilized method does not require the noise attenuation modification to handle inexact data: the property (i) is automatic, while the assertion of (ii) is obtained as a consequence of the algorithm’s behavior (the iterates it generates) rather than driving some parameter to extreme values by “brute-force”.

In what follows, we consider Algorithm 1 with the change of notation in that \(\check{f}_k\) refers to the inexact model with the data satisfying (22) and (23). Accordingly, \(f({{\hat{{x}}}_k})\) in Algorithm 1 is replaced by \(f_{{\hat{{x}}}_k}\), etc. The quantities \(v^{\tau }_k\), \(\ell _k\) and \(\hat{e}_k\) are as defined in this section above. Finally, for the current inexact setting the bundle management rule given in (21) becomes

$$\begin{aligned} \max \left\{ \bar{f}_{k+1}(\cdot ), \bar{f}_k^a (\cdot ), \bar{f}_{p(k)+1}(\cdot ), \bar{f}_{p(k)}^a (\cdot )\right\} \leqslant \check{f}_{k+1}(\cdot ) \leqslant f(\cdot )+\eta ^{g}, \end{aligned}$$
(27)

where \(p(k)\) once again stands for the last proximal iteration performed up to iteration \(k\).

As standard in inexact proximal bundle methods, the linearization error \(\hat{e}_k\) is declared not too negative when the inequality

$$\begin{aligned} \hat{e}_k \geqslant - m_{e}\tau _k\mu _k|\hat{g}_k|^2 \end{aligned}$$
(28)

holds for some parameter \(m_{e}\in (0,1)\). This inequality, together with a parameter \(\mu _{\max }\ge 1\), is employed to update \(v^{\ell }_k\) in Step 5.2 of Algorithm 1 as follows:

$$\begin{aligned} \text{ If }\,\mu _k> \mu _{\max }\,\hbox {and}\,(28)\,\hbox {holds},\,\hbox {set}\, v^{\ell }_{k+1}\leftarrow m_{\ell }v^{\ell }_k;\quad \hbox {otherwise set}\; v^{\ell }_{k+1}\leftarrow v^{\ell }_k. \end{aligned}$$
(29)

Since our method does not use noise attenuation, we cannot invoke the results from [21] and [11] for the case of infinitely many proximal iterations. For the case of finitely many proximal iterations, we cannot invoke previous results on inexact level bundle methods either; see comments in Remark 1. Therefore, convergence analysis largerly independent of previous literature is in order (although, naturally, a few ingredients would be familiar). First note that if the oracle errors do not vanish in the limit, of course only approximate solutions to (1) can be expected in general. This is natural, and similar to [10, 11, 21, 26].

4.1 Convergence analysis for the inexact case

We can proceed as in Proposition 1 to show that \(\bar{f}_k^a({x}) \leqslant \check{f}_k({x})\) for all \({x}\in {\fancyscript{X}}\). Since by the inexact oracle definition (22) we have that \(\bar{f}_j(\cdot ) \leqslant f(\cdot )+ \eta ^{g}\) for all \(j\in {{\fancyscript{B}}_{k}}\), we conclude that for all \(x\in {\fancyscript{X}}\) it holds that

$$\begin{aligned} f(x) +\eta ^{g}&\geqslant \check{f}_k({x}) \geqslant \bar{f}_k^a(x) = \check{f}_k ({{x}_{k+1}}) +\left\langle \hat{g}_k{{x}-{{x}_{k+1}}}\right\rangle \end{aligned}$$
(30)
$$\begin{aligned}&= f_{{\hat{{x}}}_k}-\left( f_{{\hat{{x}}}_k}-\check{f}_k ({{x}_{k+1}})\right) +\left\langle \hat{g}_k{ {x}-{{\hat{{x}}}_k}}\right\rangle +\left\langle \hat{g}_k{ {{\hat{{x}}}_k}-{{x}_{k+1}}}\right\rangle \nonumber \\&= f_{{\hat{{x}}}_k}-v^{\tau }_k +\left\langle \hat{g}_k{ {x}-{{\hat{{x}}}_k}}\right\rangle +\tau _k\mu _k |\hat{g}_k|^2 \nonumber \\&= f_{{\hat{{x}}}_k}-\hat{e}_k +\left\langle \hat{g}_k{ {x}-{{\hat{{x}}}_k}}\right\rangle \end{aligned}$$
(31)
$$\begin{aligned}&\geqslant f_{{\hat{{x}}}_k}-v^{\tau }_k +\left\langle \hat{g}_k{ {x}-{{\hat{{x}}}_k}}\right\rangle . \end{aligned}$$
(32)

Note also that as in the exact case, if \(\check{f}_k({{x}_{k+1}})= \ell _k\) (which holds if \(\mu _k>1\)), then in (26) we have that \(v^{\ell }_k=v^{\tau }_k\).

As in Sect. 3, we consider separately the same three possible cases.

Lemma 5

Suppose the level set \({\mathbb {X}}_k\) is empty infinitely many times.

Then \(\varDelta _k\rightarrow 0\),

$$\begin{aligned} \lim _{k\rightarrow \infty } f_{{\hat{{x}}}_k}\leqslant f^{{{\mathrm{inf}}}}+ \eta ^{g}, \end{aligned}$$
(33)

and every cluster point of the sequence \(\{{{\hat{{x}}}_k}\}\) (if any exist) is a \((\eta +\eta ^{g})\)-approximate solution to problem (1), or the last \({{\hat{{x}}}_k}\) is a \((\eta +\eta ^{g})\)-approximate solution if this sequence is finite.

Proof

Recall that in the case under consideration \(f^{{{\mathrm{inf}}}}>-\infty \). The same argument as that of Lemma 2 shows that \(\varDelta _k\rightarrow 0\). Also, on the iterations in question we have that \(\ell _k < \check{f}_k (x)\) for all \(x\in {\fancyscript{X}}\), and thus the update in Step 2 and (30) ensure that \(f^{{{\mathrm{low}}}}_k \leqslant f^{{{\mathrm{inf}}}}+\eta ^{g}\). As \(\{f_{{\hat{{x}}}_k}\}\) is decreasing and bounded below (since \(f^{{{\mathrm{inf}}}}>-\infty \)), we conclude that

$$\begin{aligned} \lim _{k\rightarrow \infty } f_{{\hat{{x}}}_k}-f^{{{\mathrm{inf}}}}-\eta ^{g}\leqslant \lim _{k\rightarrow \infty } \left( f_{{\hat{{x}}}_k}- f^{{{\mathrm{low}}}}_k\right) =\lim _{k\rightarrow \infty }\varDelta _k = 0, \end{aligned}$$

which gives (33).

Now let \(\tilde{x}\) be any cluster point of \(\{{{\hat{{x}}}_k}\}\), and let \(\{\hat{x}_{k_j}\}\) be a subsequence converging to \(\tilde{x}\) as \(j\rightarrow \infty \). Then

$$\begin{aligned} f^{{{\mathrm{inf}}}}+ \eta ^{g}\geqslant \lim _{j\rightarrow \infty } f_{\hat{x}_{k_j}} = \lim _{j\rightarrow \infty } \left( f\left( \hat{x}_{k_j}\right) -\eta _{\hat{x}_{k_j}}\right) \geqslant f(\tilde{x}) -\eta , \end{aligned}$$
(34)

which establishes the last assertion. \(\square \)

Consider now the case where \({\mathbb {X}}_k\ne \emptyset \) for all \(k\) large enough, and there is an infinite number of descent steps [for which (25) holds].

Lemma 6

Suppose Algorithm 1 generates infinitely many descent steps.

Then (33) holds and every cluster point of the sequence \(\{{{\hat{{x}}}_k}\}\) (if any exist) is a \((\eta +\eta ^{g})\)-approximate solution to problem (1).

Proof

Let \(\{{\hat{{x}}}_{k(j)}\}\) be the subsequence of \(\{{{\hat{{x}}}_k}\}\) such that \(k(j)\) corresponds to the \(j\)th descent step, and define \(i(j)=k(j+1)-1\). It follows from (25) that \(\{f_{{\hat{{x}}}_{k(j)}}\}\) is decreasing and either \(\{f_{{\hat{{x}}}_{k(j)}}\}\rightarrow -\infty \), in which case (22), (23) imply that \(\{f({\hat{{x}}}_{k(j)})\}\rightarrow -\infty \) and the conclusions are obvious, or the limit of \(\{f_{{\hat{{x}}}_{k(j)}}\}\) is finite. In the second case (25) implies that

$$\begin{aligned} \lim _{j\rightarrow \infty }v^{\tau }_{i(j)}=0. \end{aligned}$$

Let \(x\in {\fancyscript{X}}\) be arbitrary. Using (32) and the fact that \({\hat{{x}}}_{k(j)}={\hat{{x}}}_{i(j)}\), we then obtain that

$$\begin{aligned} \left| {\hat{{x}}}_{k(j+1)}-{x}\right| ^2&= \left| {\hat{{x}}}_{k(j)}-{x}\right| ^2 + \left( \tau _{i(j)}\mu _{i(j)}\right) ^2|\hat{g}_{i(j)}|^2\\&+\,2\tau _{i(j)}\mu _{i(j)}\left\langle \hat{g}_{i(j)}{{x}- {\hat{{x}}}_{k(j)}}\right\rangle \\&\leqslant \left| {\hat{{x}}}_{k(j)}-{x}\right| ^2 + \left( \tau _{i(j)}\mu _{i(j)}\right) ^2 |\hat{g}_{i(j)}|^2\\&+\, 2 \tau _{i(j)}\mu _{i(j)} \left( f(x) +\eta ^{g}- f_{{\hat{{x}}}_{k(j)}} +v^{\tau }_{i(j)}\right) . \end{aligned}$$

Suppose that (33) does not hold. Then there exist \(t>0\) and \(\tilde{x}\in {\fancyscript{X}}\) such that \(f_{{\hat{{x}}}_{k(j)}} \geqslant f(\tilde{x}) +\eta ^{g}+t\) for all \(j\). Taking \(j\) large enough so that \(v^{\tau }_{i(j)}\leqslant t/2\), and choosing \(x=\tilde{x}\) in the chain of inequalities above, we obtain that

$$\begin{aligned} \left| {\hat{{x}}}_{k(j+1)}-\tilde{x}\right| ^2&\leqslant \left| {\hat{{x}}}_{k(j)}-\tilde{x}\right| ^2 -\tau _{i(j)}\mu _{i(j)} t \\&\leqslant \left| {\hat{{x}}}_{k(1)}-\tilde{x}\right| ^2 - t \sum _{q=1}^j \tau _{i(q)}\mu _{i(q)} \\&\leqslant \left| {\hat{{x}}}_{k(1)} -\tilde{x}\right| ^2 -j t\tau _{\min }, \end{aligned}$$

where we used the fact that \(\tau _k\mu _k\ge \tau _k\ge \tau _{\min }\). The above gives a contradiction when \(j\rightarrow \infty \). We conclude that (33) holds. The last assertion then follows by the same argument as in Lemma 5. \(\square \)

We now consider the case of finitely many descent steps, with the level set \({\mathbb {X}}_k\) nonempty (for all \(k\) large enough).

Lemma 7

Suppose that for Algorithm 1, with the additional bundle management rule (27) and Step 5 employing (29) with \(\mu _{\max }\geqslant 1\), there exists an index \(k_1\geqslant 1\) such that the descent test (25) is not satisfied for all \(k \geqslant k_1\).

Then the last descent iterate \({\hat{{x}}}_{k_1}\) is a \((\eta +\eta ^{g})\)-approximate solution to problem (1).

Proof

The sequence \(\{v^{\ell }_k\}\) is monotone and when its elements decrease, they decrease by a fixed fraction \(m_{e}\in (0,1)\). Thus either \(v^{\ell }_k\rightarrow 0\) or \(v^{\ell }_k = v^{\ell }>0\) for all \(k\) large enough.

Consider first the case of \(v^{\ell }_k\rightarrow 0\). Then by rule (29) there exists an infinite index set \(K\) such that \(\mu _k>\mu _{\max }\) and the inequality (28) is valid for \(k\in K\). For such indices, it then holds that

$$\begin{aligned} 0 \leqslant (1-m_{e})\tau _{\min }|\hat{g}_k|^2 \leqslant (1-m_{e})\tau _k\mu _k|\hat{g}_k|^2 \leqslant \hat{e}_k + \tau _k\mu _k|\hat{g}_k|^2 =v^{\tau }_k=v^{\ell }_k, \end{aligned}$$
(35)

where the last equality follows from Proposition 2, because \(\mu _k>\mu _{\max } \geqslant 1\) for all \(k \in K\). It follows from (35) that

$$\begin{aligned} \tau _k\mu _k|\hat{g}_k|^2\rightarrow 0,\quad \hat{g}_k\rightarrow 0, \quad \hat{e}_k \rightarrow 0 \quad \text{ as }\;K\ni k\rightarrow \infty . \end{aligned}$$

Now passing onto the limit in (31) as \(K\ni k\rightarrow \infty \), with \(x\in {\fancyscript{X}}\) fixed but arbitrary and \(\hat{x}_k =\hat{x}_{k_1}\) fixed, implies the assertion.

We now consider the second case: \(v^{\ell }_k= v^{\ell }>0\) for all \(k\geqslant k_2\).

Suppose first that there exists an infinite subsequence of null proximal steps (\(\mu _k =~1\)), indexed by \(\{k(j)\}\). “Ignoring” the possible null level steps in between, we can consider the sequence \(\{x_{k(j)}\}\) as that generated by the proximal bundle method, where the model satisfies, by the rule (27), the key conditions

$$\begin{aligned} \max \left\{ \bar{f}_{k(j)} (\cdot ), \bar{f}_{k(j)-1}^a(\cdot )\right\} \leqslant \check{f}_{i} (\cdot ) \leqslant f (\cdot )+\eta ^{g},\quad \text{ for }\; k(j)\leqslant i\leqslant k(j+1)-1. \end{aligned}$$

Of specific importance here is the relation for \(i=k(j+1)-1\), which shows that on consecutive null proximal steps the model satisfies the conditions which, together with \(\{\tau _{k(j)}\}\) being nonincreasing, guarantee the following property of the (inexact) proximal bundle method:

$$\begin{aligned} 0\geqslant \limsup _{j\rightarrow \infty }\left( f_{x_{k(j)}} -\check{f}_{k(j)-1}(x_{k(j)})\right) . \end{aligned}$$
(36)

(See [11, Theorem 6.4] and/or [21, Lemma 3.3 and Section 4.1].) On the other hand, as the descent condition (25) does not hold,

$$\begin{aligned} f_{x_{k(j)}} -\check{f}_{k(j)-1}(x_{k(j)})&> f_{\hat{x}_{k_1}} - m_{f}v^{\tau }_{k(j)-1} - \check{f}_{k(j)-1}(x_{k(j)})\\&= (1 - m_{f})v^{\tau }_{k(j)-1} \\&\geqslant (1 - m_{f})v^{\ell }> 0, \end{aligned}$$

which gives a contradiction with (36).

Therefore, in the case under consideration, there can only be a finite number of null proximal steps. Hence, all iterations indexed by \(k\geqslant k_3\) are of the null level type, and it holds that \(\mu _k >1, \lambda _k >0, \hat{x}_k =\hat{x}, v^{\ell }_k =v^{\ell }>0\) and \(\ell _k =\ell \) for all \(k\geqslant k_3\).

Note that

$$\begin{aligned} \ell \geqslant \check{f}_k ({{x}_{k+1}}) \geqslant \bar{f}_{k-1}^a ({{x}_{k+1}}) = \check{f}_{k-1} ({{x}_{k}})+ \left\langle \hat{g}_{k-1}{{x}_{k+1} - {x}_{k}}\right\rangle . \end{aligned}$$

By Proposition 1, as \(\lambda _{k-1} >0\) it holds that \(\check{f}_{k-1} ({{x}_{k}}) = \ell \). Hence, \(0 \geqslant \langle \hat{g}_{k-1},{x}_{k+1} - {x}_{k}\rangle \), and since \(\hat{x} - x_k = \tau _{k-1}\mu _{k-1} \hat{g}_{k-1}\), it holds that

$$\begin{aligned} 0 \geqslant \left\langle \hat{x} - x_k, {x}_{k+1} - {x}_{k}\right\rangle . \end{aligned}$$

It then follows that

$$\begin{aligned} \left| {x}_{k+1} - {\hat{{x}}}\right| ^2 \geqslant \left| {x}_k -{\hat{{x}}}\right| ^2+ \left| {{x}_{k+1}}- {{x}_{k}}\right| ^2. \end{aligned}$$
(37)

Note that

$$\begin{aligned} \ell \geqslant \check{f}_k ({{x}_{k+1}}) \geqslant \bar{f}_k ({{x}_{k+1}}) = f_{{{x}_{k}}}+ \left\langle g_{{{x}_{k}}}{{{x}_{k+1}}- {{x}_{k}}}\right\rangle . \end{aligned}$$

Using the Cauchy–Schwarz inequality, we obtain that

$$\begin{aligned} |g_{{{x}_{k}}}||{{x}_{k+1}}- {{x}_{k}}|\geqslant f_{{{x}_{k}}}-\ell . \end{aligned}$$
(38)

Since this is a null step, it holds that

$$\begin{aligned} f_{{{x}_{k}}} > f_{{\hat{{x}}}} - m_{f}v^{\tau }_{k-1}, \end{aligned}$$

and since it is a level step, \(v^{\tau }_{k-1}= v^{\ell }_{k-1}=v^{\ell }>0\). Using further the definition \(\ell = f_{\hat{{x}}}- v^{\ell }\), we conclude that

$$\begin{aligned} f_{{{x}_{k}}}-\ell \geqslant (1-m_{f})v^{\ell }>0. \end{aligned}$$

In view of (38) and the last inequality above, it holds that \(g_{{{x}_{k}}}\ne 0\) and we obtain that

$$\begin{aligned} \left| {{x}_{k+1}}- {{x}_{k}}\right| \geqslant \frac{f_{{{x}_{k}}}-\ell }{\left| g_{{{x}_{k}}}\right| } \geqslant \frac{(1 -m_{f})v^{\ell }}{\left| g_{{{x}_{k}}}\right| }. \end{aligned}$$

Using now (37), it follows that

$$\begin{aligned} \left| {{x}_{k+1}}- {\hat{{x}}}\right| ^2 \geqslant \left| {{x}_{k}}- {\hat{{x}}}\right| ^2 + \left( \frac{(1 -m_{f})v^{\ell }}{|g_{{{x}_{k}}}|}\right) ^2. \end{aligned}$$
(39)

If the sequence \(\{x_k\}\) were to be bounded, there would exist a constant \(C>0\) such that \(|g_{{{x}_{k}}}|\leqslant C\) for all \(k\) [by boundedness of the \(\varepsilon \)-subdifferential on bounded sets and (22), (23)]. But then (39) would mean that the monotone sequence \(\{|{{x}_{k}}- {\hat{{x}}}|^2\}\) is increasing at every iteration by a fixed positive quantity \(((1 -m_{f})v^{\ell }/C)^2\), and thus \(\{x_k\}\) cannot be bounded. Hence, \(\{x_k\}\) is unbounded. Since \(\{|{{x}_{k}}- {\hat{{x}}}|^2\}\) is monotone by (39), it follows that \(|{{x}_{k}}- {\hat{{x}}}|\rightarrow +\infty \) as \(k\rightarrow \infty \).

We next show that \(\limsup _{k\rightarrow \infty } \mu _k = +\infty \). Suppose the contrary, i.e., that there exists \(\bar{\mu } > 0\) such that \(\mu _k \leqslant \bar{\mu } \) for all \(k\). As \(\{\tau _k\}\) is nonincreasing for \(k\geqslant k_3\) and \(v^{\tau }_k = v^{\ell }_k =v^{\ell }\), using (24) we have that

$$\begin{aligned} \begin{array}{lll} \tau _{k_3}\bar{\mu } v^{\ell }&{}\geqslant \tau _k\mu _k v^{\tau }_k = \tau _k\mu _k \hat{e}_k + (\tau _k\mu _k)^2 |\hat{g}_k|^2\\ &{}\geqslant -\tau _{k_3}\bar{\mu }(\eta +\eta ^{g}) + \left| {{x}_{k+1}}- {\hat{{x}}}\right| ^2, \end{array} \end{aligned}$$

in contradicton with \(|{{x}_{k}}- {\hat{{x}}}|\rightarrow +\infty \). Hence, \(\limsup _{k\rightarrow \infty } \mu _k = +\infty \).

In the case under consideration, by rule (29) of Algorithm 1, \(\hat{e}_k < -m_e \tau _k\mu _k |\hat{g}_k|^2\) for all \(k\geqslant k_3\). In particular, \( \limsup _{k\rightarrow \infty } \hat{e}_k \leqslant 0.\) Also, using again (24), from \(\hat{e}_k < -m_e \tau _k\mu _k |\hat{g}_k|^2\) it follows that

$$\begin{aligned} \frac{\left( \eta +\eta ^{g}\right) }{\tau _{\min } \mu _k}> m_{e}|\hat{g}_k|. \end{aligned}$$

As \(\limsup _{k\rightarrow \infty } \mu _k = +\infty \), this implies that \( \liminf _{k\rightarrow \infty } |\hat{g}_k|=0\). Now fixing an arbitrary \(x\in {\fancyscript{X}}\), and passing onto the limit in (31) along a subsequence for which the last relation above holds (taking also into account that in the case under consideration \(\hat{e}_k \leqslant 0\)), concludes the proof. \(\square \)

Combining all the cases considered above, we conclude the following.

Theorem 2

If Algorithm 1 [with the additional rules (29) and (27)] generates a sequence such that \({{\hat{{x}}}_k}={\hat{{x}}}_{k_1}\) for all \(k\ge k_1\), then \({\hat{{x}}}_{k_1}\) is a \((\eta +\eta ^{g})\)-approximate solution to (1). Otherwise, (33) holds and every cluster point of the sequence \(\{{{\hat{{x}}}_k}\}\) (if any exist) is a \((\eta +\eta ^{g})\)-approximate solution to problem (1).

The analysis above also shows that in all the cases either \(\varDelta _k \rightarrow 0\) or there exists a subsequence \(K\subset \{1,2,\ldots \}\) such that \(\limsup _{K\ni k \rightarrow \infty } \hat{e}_k \leqslant 0\) and \(\lim _{K\ni k \rightarrow \infty }|\hat{g}_k| =0\). This means that, for positive tolerances, some stopping rule in Algorithm 1 is eventually satisfied (at which time an appropriate approximate solution is obtained).

5 Numerical results

In this section we report computational experiments on different types of problems: two-stage stochastic linear programming, nonsmoothly-regularized maxima of quadratic functions, the model unit-commitment problem in the energy sector, and some standard nonsmooth test problems (about 1000 instances overall). We compare the following four solvers:

  • PBM-1—proximal bundle method using a rule to update \(\tau _k\) based on [19];

  • PBM-2—proximal bundle method using a rule to update \(\tau _k\) based on [23];

  • LBM—level bundle method of [5];

  • DSBM—doubly stabilized bundle method (the algorithm described in this article).

The runs were performed on a computer with Intel(R) Core(TM), i3-3110M CPU @ 2.40, 4G (RAM), under Windows 8, 64 Bits. The QPs (and also LPs) were solved by the MOSEK 7 toolbox for MATLAB (http://www.mosek.com/). The MATLAB version is R2012a.

Our analysis of the outcomes reports success or failure (i.e., whether a stopping test was eventually satisfied or the maximal number of iterations was reached), the number of oracle calls (here, the same as number of iterations), and CPU time to termination. We also compare the quality of solutions obtained at termination. To get some further insight, we report the numbers of descent steps for all the solvers, the number of empty level sets encountered for LBM and DSBM, and for DSBM which has various possibilities—the number of level iterations and which stopping criterion triggered termination.

We start with describing some details of implementations, tuning, and stopping rules of the algorithms in question.

5.1 Implementations, tuning the parameters, and stopping criteria

Many parameters need to be set for the solvers: the constant for the descent test \(m_f \in (0,1)\) in (13) (used in all four solvers), the constant \(m_\ell \in (0,1)\) for adjusting the level parameter (for LBM and DSBM), and some further parameters for updating \(\tau _k\) in the proximal solvers PBM-1, PBM-2 and DSBM.

Some specific parameters of each solver are listed below.

5.1.1 The level bundle algorithm LBM

The algorithm is as described in [5]. The initial predicted decrease is given by \(v^{\ell }_1=f({x}_1)-\check{f}_1(\tilde{x})\), where \(\tilde{x}\) is the solution of the QP (2) with \(k=1\) and \(\tau _1\) given. When a lower bound \(f^{{{\mathrm{low}}}}_k\) for the optimal value \(f^{{{\mathrm{inf}}}}\) is found, the subsequent iterations solve the LP (17) to update \(f^{{{\mathrm{low}}}}_k\) to \(f^{{{\mathrm{low}}}}_{k+1}\).

As in the rule (20), the LBM method of [5] employs the parameter \(\mu _{\max }>0\). For this solver, we need to set mainly the parameters \(m_{\ell }\), \(\mu _{\max }\) and \(\tau _1\) (the latter defines \(v^{\ell }_1\) as explained above).

5.1.2 The proximal bundle solvers PBM-1 and PBM-2

The rule to update the prox-parameter \(\tau _k\) is as follows: let \(a>1\) and \(\tau _{\min }>0\) be two given parameters, and \(\tau _\mathtt{aux}^k\) be an auxiliary prox-parameter at iteration \(k\) (different for PBM-1 and PBM-2).

  • If null step, set \( \tau _{k+1} \leftarrow \min \{ \tau _k, \max \{\tau _\mathtt{aux}^k,\tau _k/a,\tau _{\min }\}\} \)

  • If descent step:

    • if more than five consecutive descent steps, set \(\tau _\mathtt{aux}^k\leftarrow a\tau _\mathtt{aux}^k\)

    • set \(\tau _{k+1} \leftarrow \min \{ \tau _\mathtt{aux}^k,10\tau _k\}\).

In PBM-1 [19], one sets

$$\begin{aligned} \tau _\mathtt{aux}^k \leftarrow 2\tau _k\left( 1+\frac{f({{\hat{{x}}}_k})-f({{x}_{k+1}})}{v^{\tau }_k}\right) . \end{aligned}$$

In PBM-2 [23], one sets

$$\begin{aligned} \tau _\mathtt{aux}^k \leftarrow \tau _k\left( 1+\frac{\langle g_{k+1}-g_k,{{x}_{k+1}}-{{x}_{k}}\rangle }{|g_{k+1}-g_k|^2}\right) , \end{aligned}$$

under some safeguards [23, Section 4.2].

The essential parameters to tune in the updates above are \(a\), \(\tau _1\) and \(\tau _{\min }\). Parameters taken as 10 and 2 in the setting of \(\tau _k\) could also be tuned, but we use here their standard values.

5.1.3 The doubly stabilized DSBM solver

This is Algorithm 1 employing rule (20) in Step 5. The initial predicted decrease is given by \(v^{\ell }_1=f({x}_1)-\check{f}_1(\tilde{x})\), where \(\tilde{{x}}\) is the solution of the QP (2) (the same as for LBM). When a lower bound \(f^{{{\mathrm{low}}}}_k\) for the optimal value \(f^{{{\mathrm{inf}}}}\) is found, the subsequent iteration solves the LP (17) to update \(f^{{{\mathrm{low}}}}_k\) to \(f^{{{\mathrm{low}}}}_{k+1}\) (the same as in LBM).

The essential parameters to tune in the updates above are \(\mu _{\max }\), \(m_{\ell }\), \(\tau _1\) and \(\tau _{\min }\).

5.1.4 Tuning the parameters

The parameters were tuned for each problem class separately. To decide on the “best” settings of parameters, we first ran each solver on representative instances (a subset of about 10 %) of each considered family of problems, with various possible combinations of the solvers’ parameters.

  • Setting the stopping tolerances. Depending on the solver, the tolerances involved in stopping tests are: \(\mathtt{Tol }_{e}\) for the aggregate error \(\hat{e}_k\), \(\mathtt{Tol }_{g}\) for the norm of the aggregate subgradient \(\hat{g}_k\) and \(\mathtt{Tol }_{\varDelta }\) for the optimality gap \(\varDelta _k\). As it is natural to have the optimality measures \(\hat{e}_k\) and \(\varDelta _k\) of the same magnitude, we set \(\mathtt{Tol }_{e}= \mathtt{Tol }_{\varDelta }=\mathtt{Tol }\). On the other hand, \(|\hat{g}_k|\) is a dimension-dependent measure, which can be different. To set \(\mathtt{Tol }_{g}\) we performed the following steps for each class of problems:

    • first, the sample of problems was solved by Algorithm 1 with the stopping test \(\hat{e}_k \le 10^{-8}\) (checking also that \(\hat{g}_k\) is small enough at termination);

    • at the last iteration \(k_i\) of the given method on problem \(i\), we performed a linear regression on the data \(\{\hat{e}_{k_i}\}\) and \(\{|\hat{g}_{k_i}|\}\) to estimate the best constant \(\rho >0\) that minimizes the mean square error \(\sum _i (\rho \hat{e}_{k_i} - |\hat{g}_{k_i}|)^2\);

    • given tolerance \(\mathtt{Tol }\) for \(\hat{e}_k\) and \(\varDelta _k\), we then set \(\mathtt{Tol }_{g}:= \rho \mathtt{Tol }\).

    In the final experiments reported, the solvers terminate either if the number of oracle calls reaches 1000 (considered a failure) or when

    $$\begin{aligned} \hat{e}_k \!\leqslant \! \mathtt{Tol }\quad \text{ and } \quad |\hat{g}_k| \!\leqslant \! \mathtt{Tol }_{g},\quad \text{ or }\quad \varDelta _k \!\leqslant \! \mathtt{Tol }\quad \text{ with } \quad \mathtt{Tol }\!=\!\left( 1 \!+\! \left| \bar{f}\right| \right) 10^{-8}. \end{aligned}$$
    (40)

    Here, \(\bar{f}\) is a good approximation of the optimal value \(f^{{{\mathrm{inf}}}}\), obtained by running one solver in advance, and the stopping tolerances are set as described above. The last stopping test, based on the optimality gap, is employed only by the solvers LBM and DSBM.

  • Setting the initial prox-parameter.  As mentioned, all the solvers employ an initial prox-parameter \(\tau _1\) (solvers LBM and DSBM use \(\tau _1\) to define \(v^{\ell }_1\)). For each class of problems we tested \(\tau _1\in \{1,5,10\}\).

  • Lower bound for the prox-parameter.  Except for solver LBM: \(\tau _{\min } \in \{10^{-6},10^{-5},10^{-3}\}\).

  • Parameter \(a\) to update \(\tau _k\) during null steps. Only for solvers PBM-1 and PBM-2: \(a \in \{2,4,5\}\).

  • Level parameter \(m_{\ell }\). Only for solvers LBM and DSBM: \(m_{\ell }\in \{0.2,0.5,0.7\}\).

  • Descent parameter \(m_{f}\). All solvers: \(m_{f}\in \{0.1,0.5,0.7\}\).

  • Parameter \(\mu _{\max }\) in (20). Only for solvers LBM and DSBM: \(\mu _{\max }\in \{1,5,10\}\).

As expected, the standard choice \(m_{f}=0.1\) for the descent test proved adequate for all the solvers. Another adequate choice was \(\mu _{\max }=5\). Other parameters take different values depending on the class of problems, as shown below.

In all the solvers, all linearizations are kept in the model until the bundle reaches its maximal allowed size, which was set at 334 (approximately one third of the maximum number of iterations). When the maximal allowed size is reached, the solvers eliminate inactive linearizations, if any exist. If there are no inactive linearizations, the bundle is compressed: the two “less active” linearizations (with the smallest Lagrange multipliers) are replaced by the latest \(\bar{f}_{k+1}\) and by the aggregate linearization \(\bar{f}_k^a\).

5.2 Two-stage stochastic linear programming problems

We consider ten families of problems available at http://web.uni-corvinus.hu/~ideak1/kut_en.htm, by I. Deák. They result in convex linearly-constrained nonsmooth problems, of the form (1). Specifically,

$$\begin{aligned} f({x}):= \langle c, {x}\rangle + \sum _{i=1}^N p_iQ({x}, h_i)\quad \text{ and }\quad {\fancyscript{X}}:=\left\{ {x}\in \mathfrak {R}^{n}_+:\;A{x}=b\right\} , \end{aligned}$$

where

$$\begin{aligned} Q({x}, h_i):=\min _{y\in \mathfrak {R}^{n_2}_+} \langle q, y\rangle \quad \text{ s.t. }\; Tx+Wy=h_i \end{aligned}$$

is the recourse function corresponding to the \(i\)th scenario \(h_i\in \mathfrak {R}^{m_2}\) with probability \(p_i>0\) (\(W\) and \(T\) above are matrices of appropriate dimensions); \(c\in \mathfrak {R}^n\), matrix \( A \in \mathfrak {R}^{m_1\times n}\) and vector \( b \in \mathfrak {R}^{m_1}\) are such that the set \({\fancyscript{X}}\) is bounded. We consider twenty instances corresponding to scenarios

$$\begin{aligned} N \in \{5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100\}. \end{aligned}$$

The best configuration found for the parameters is the following: PBM-1 and PBM-2: \(\tau _1=10, \tau _{\min }=10^{-6}\) and \(a=2\); LBM: \(\tau _1=10\), and \(m_{\ell }=0.2\); DSBM: \(\tau _1=1\), \(\tau _{\min }=10^{-6}\) and \(m_{\ell }=0.2\). All the solvers employed the tolerances \(\mathtt{Tol }_{g}=100\mathtt{Tol }\) and \(\mathtt{Tol }\) as in (40).

Table 1 shows the total number of oracle calls and CPU times for solving (successively) all the twenty instances of each of the 10 problems (in total, 200) by the four methods.

Table 1 Total number of oracle calls and CPU time: sum over 200 instances

DSBM is the fastest solver, followed by LBM. There were no failures in this benchmark. DSBM solver stopped by the relative optimality gap in 93 % of the instances, whereas LBM in around 96 %.

Optimality measures are reported in Table 2, for a subset of the instances. Ideally, both measures \({\hat{e}_k}/(1+|\bar{f}|)\) and \({\hat{g}_k}/(1+|\bar{f}|)\), or the measure \({\varDelta _k}/(1+|\bar{f}|)\), should be zero. Table 2 presents the number of digits of accuracy for these quantities. For instance, the number 09 for \({\hat{e}_k}/(1+|\bar{f}|)\) means that the quantity in question has the value \(c\,10^{-09}\), with some \(c \in (1,10)\).

Table 2 Comparison of the optimality measures: digits of accuracy

In Fig. 1 we give performance profiles [7] of the four solvers over the 200 instances. The top graphic considers the number of oracle calls (iterations), and the bottom one considers the CPU time. For example, let the criterion be CPU time. For each algorithm, we plot the proportion of problems that it solved within a factor of the time required by the best algorithm. In other words, denoting by \(t_s(p)\) the time spent by solver \(s\) to solve problem \(p\) and by \(t^*(p)\) the best time for the same problem among all the solvers, the proportion of problems solved by \(s\) within a factor \(\gamma \) is

$$\begin{aligned} \phi _s(\gamma ) = \displaystyle \frac{\text {number}\,\text {of}\,\text {problems}\;p\; \text {such}\,\text {that}\;t_s(p) \le \gamma t^*(p)}{\text {total}\,\text {number}\, \text {of}\,\text {problems}}. \end{aligned}$$

Therefore, the value \(\phi _s(1)\) gives the probability of the solver \(s\) to be the best by a given criterion. Furthermore, unless \(t_s(p)=\infty \) (which means that solver \(s\) failed to solve problem \(p\)), it follows that \(\lim _{\gamma \rightarrow \infty }\phi _s(\gamma )=1\). Thus, the higher is the line, the better is the solver (by this criterion).

Fig. 1
figure 1

Performance profile: 200 instances of two-stage stochastic problems

We conclude from Fig. 1 that among the four solvers, LBM used less oracle calls (and CPU time) in approximately 60 % (58 %) of the 200 different instances, followed by DSBM (40 %) that was better than both solvers PBM-1 and PBM-2.

5.3 RandMaxQuad problems

In this subsection we consider a family of randomly generated problems of the form (1) with the objective function given by

$$\begin{aligned} f({x})&= \max _{i=1,\ldots ,10} \bigl \{\left\langle Q_i{x}{{x}}\right\rangle + \left\langle {q_i}{{x}}\right\rangle \bigr \} + \alpha |{x}|_1\quad \text{ and } \\ f({x})&= \max _{i=1,\ldots ,10} \bigl \{\left\langle {Q_i{x}}{{x}}\right\rangle + \left\langle {q_i}{{x}}\right\rangle \bigr \} + \alpha |{x}|_\infty , \end{aligned}$$

where \(Q_i \in \mathfrak {R}^{n\times n}\) and \(q_i \in \mathfrak {R}^n\) are randomly generated, \(Q_i\) being symmetric positive semidefinite, \(i=1,\ldots , 10\). The problem’s dimension \(n\) varies according to

$$\begin{aligned} n \in \{10,20,30,40,50,60,70,80,90,100,150,200,250,300,500\}. \end{aligned}$$

Parameter \(\alpha \) runs through the values \(\alpha \in \{0.1,0.5,1\}\). Two settings were considered for the feasible set \({\fancyscript{X}}\) in (1): \({\fancyscript{X}}=\mathfrak {R}^n\) (unconstrained setting) and \({\fancyscript{X}}=\{{x}\in \mathfrak {R}_+^n: \sum _{i=1}^n{x}=1 \}\) (simplex setting). In total, 708 different instances of problem (1) were obtained by using different seeds for the MATLAB random number generator: 354 unconstrained and 354 constrained.

5.3.1 Unconstrained RandMaxQuad: 354 instances

The best configurations found for the parameters were: PBM-1: \(\tau _1=1, \tau _{\min }=10^{-6}\) and \(a=5\); PBM-2: \(\tau _1=1\), \(\tau _{\min }=10^{-6}\) and \(a=2\); LBM: \(\tau _1=1\), and \(m_{\ell }=0.2\); DSBM: \(\tau _1=1\), \(\tau _{\min }=10^{-6}\) and \(m_{\ell }=0.2\). Tolerances were set as \(\mathtt{Tol }_{g}=1000\mathtt{Tol }\), with \(\mathtt{Tol }\) given in (40).

Among other information, Table 3 shows the total number of CPU time (in min) and oracles calls required to solve all the 354 unconstrained instances. Notice that the less demanding with respect to oracle calls and CPU time is the DSBM solver, followed by PBM-2. Table 3 also shows that around 75 % of the DSBM iterations were of the level type.

Table 3 Total number of oracle calls and CPU time: sum over 354 instances

Table 4 presents optimality measures at termination for each solver on some instances. DSBM stopped by the relative optimality gap in 29 % of the instances, whereas LBM triggered this additional stopping test in 38 %.

Table 4 Comparison of the optimality measures: digits of accuracy

Figure 2 gives performance profiles [7] of the four solvers over 354 instances of the unconstrained RandMaxQuad problem.

Fig. 2
figure 2

Performance profile of the four solvers over 354 instances of MaxQuad

We observe that DSBM required less oracle calls in approximately 48 % of the 354 instances, followed by PBM-2 and PBM-1 (around 40 and 10 %, respectively). Besides, DSBM is more robust in terms of oracle calls: it achieves \(\phi (\gamma )=1\) for lower values of \(\gamma \). For this type of problems, both PBM-1 and PBM-2 are more robust than LBM, that failed to satisfy the stopping test in around 25 % of the instances, as reported in Table 3.

5.3.2 Constrained RandMaxQuad: 354 instances

We now consider problems with the same 354 objective functions, but constrained on a simplex. The employed solver parameters are the following: PBM-1: \(\tau _1=1, \tau _{\min }=10^{-3}\) and \(a=5\); PBM-2: \(\tau _1=1, \tau _{\min }=10^{-5}\) and \(a=2\); LBM: \(\tau _1=1\), and \(m_{\ell }=0.7\); DSBM: \(\tau _1=1, \tau _{\min }=10^{-6}\) and \(m_{\ell }=0.7\). Tolerances were set as \(\mathtt{Tol }_{g}=1000\mathtt{Tol }\), with \(\mathtt{Tol }\) given in (40).

Table 5 shows the total number of oracle calls, CPU time, descent steps and level steps required to solve all the constrained instances. We observe that the solvers LBM and DSBM are much more effective on the constrained problems than PBM-1 and PBM-2. Around 45 % of the DSBM iterations were of the level type.

Table 5 Total number of oracle calls and CPU time: sum over 354 instances

LBM triggered the optimality gap stopping test in 62 % of the instances, while DSBM in 72 %. Note that these percentages were smaller for the unconstrained instances: 38 and 29 %, respectively. Table 6 reports (for some selected instances) the optimality measures at the last iteration.

Table 6 Comparison of the optimality measures: digits of accuracy

Performance profiles of the four solvers on these 354 constrained instances are presented in Fig. 3. Among the considered solvers, we notice that DSBM is both the fastest and the most robust one, followed by LBM.

Fig. 3
figure 3

Performance profile of the four solvers over 354 instances of constrained MaxQuad

5.4 Unit-commitment energy problems

In this subsection we consider a unit-commitment problem for a power system operation model with four power plants. For each given point \({x}\), an oracle must solve four mixed-integer linear programming problems to compute \(f({x})\) and \(g\in \partial f({x})\). The feasible set for this problem is the positive orthant \({\fancyscript{X}}=\mathfrak {R}^n_+\). In our configuration, the problem’s dimension ranges in \( n \in \{12,24,36,48,60\}. \) The electricity demands for the unit-commitment problem were chosen randomly, using ten different seeds for the random number generator. In total, 50 instances of the problem were considered.

The employed solver parameters are the following: PBM-1: \(\tau _1=10, \tau _{\min }=10^{-6}\) and \(a=2\); PBM-2: \(\tau _1=1, \tau _{\min }=10^{-5}\) and \(a=4\); LBM: \(\tau _1=1\), and \(m_{\ell }=0.7\); DSBM: \(\tau _1=1, \tau _{\min }=10^{-6}\) and \(m_{\ell }=0.2\). Tolerances were set as \(\mathtt{Tol }_{g}=\mathtt{Tol }\), with \(\mathtt{Tol }\) given in (40).

In this battery of problems, all the runs were successful, i.e., a stopping test was satisfied before the maximal number of iterations was reached. Table 7 shows the total number of oracle calls and CPU times required to stop the four solvers over all instances of the problem.

Table 7 Total number of oracle calls and CPU times: sum over 50 instances

PBM-2 was the fasted solver on these problems, followed by DSBM. DSBM terminated by the optimality gap in 98 % of the instances, whereas LBM in 96 %. Moreover, around 58 % of the DSBM’s iterations were of the level type. In Table 8 we present (for some instances) the optimality measures at the last iteration.

Table 8 Comparison of the optimality measures: digits of accuracy

Figure 4 gives performance profiles of the four solvers over 50 instances of the problem.

Fig. 4
figure 4

Performance profile of the four solvers over 40 instances

We observe that DSBM required less oracle calls in approximately 90 % of cases, while PBM-2 was the fastest solver in 20 % of the instances.

5.5 Classical unconstrained nonsmooth test problems

In this subsection we consider some typical functions for nonsmooth optimization benchmarking, such as MaxQuad [4, p. 153], TR48 [18, p.21, vol. II] and others. All the problems are unconstrained and have known optimal values. We refer to [24] for more information on these test problems.

Tables 9 and 10 report on results obtained by the four solvers on this type of problems, using default dimensions and starting points.

Table 9 Total number of oracle calls and CPU time
Table 10 Digits of accuracy in the difference \(f({\hat{{x}}})-f^{{{\mathrm{inf}}}}\)

Table 10 shows the true optimal value of each problem (column \(f^{{{\mathrm{inf}}}}\)) and the number of digits of accuracy in the difference \(f({\hat{{x}}})-f^{{{\mathrm{inf}}}}\) for the four solvers at termination, where \({\hat{{x}}}\) is the obtained solution.

We can conclude from Table 10 that the quality of solutions obtained by the doubly stabilized method is as good as the other solvers.

6 Concluding remarks

We proposed a new algorithm for nonsmooth convex minimization, called doubly stabilized bundle method. It combines the level and proximal stabilizations in a single subproblem, and at each iteration automatically “chooses” between a proximal and a level step. The aim is to take advantage of good properties of both, depending on the problem at hand, and also use the simplicity of updating the level parameter to produce a simple and efficient rule to update the proximal parameter, thus speeding up the optimization process. In addition, the method provides a useful stopping test based on the optimality gap.

The algorithm appears to perform well in computation, as validated in Sect. 5, where almost 1,000 instances of various types of problems were considered. Numerical results show that the proposed method compares favorably with both the proximal and level bundle methods.

The new doubly stabilized algorithm can also handle inexactness of data in a natural way, without introducing special modifications to the iterative procedure (such as noise attenuation).