1 Introduction

Consider the following unconstrained optimization problem

$$\begin{aligned} \min {\quad }&f(x), \nonumber \\&x \in \mathbb {R}^{n} , \end{aligned}$$
(1)

where \( \mathbb {R}^{n} \) denotes the n-dimensional Euclidean space and \( f: \mathbb {R}^{n} \rightarrow \mathbb {R} \) is a locally Lipschitz function, but possibly nonsmooth and nonconvex.

Several numerical methods have been proposed for solving nonsmooth optimization problems. These methods include subgradient algorithms [1, 3, 23, 24], bundle methods [5, 8, 9, 11, 12, 20], gradient sampling algorithms [4, 19], the trust region algorithm [10] and the conjugate gradient method [16].

Bundle methods were first introduced by Lemaréchal [20] and have been developed over the years for various problems. Proximal bundle methods are currently considered among the most efficient methods for nonsmooth problems, see for convex problems [21] and the nonconvex case [18]. Recently the proximal bundle algorithm is developed in [8] for the problem (1) with \(\mathop {\mathrm {lower}}-C^{2} \) and in [9] with \(\mathop {\mathrm {lower}}-C^{1} \) objective functions. Moreover this method is generalized for the problem (1) with \(\mathop {\mathrm {lower}}-C^{1} \) and \(\mathop {\mathrm {upper}}-C^{1} \) objective functions in [6] and with DC objective functions in [17]. In addition, this method is extended for constrained optimization problems in [5] with \(\mathop {\mathrm {lower}}-C^{1} \) or \(\mathop {\mathrm {upper}}-C^{1} \) functions and in [11, 12] with regular functions. More recently, the proximal bundle algorithm is studied for nonsmooth multiobjective problems in [13, 14]

In this paper, we study the redistributed proximal bundle method of [8, 9] to the class of all locally Lipschitz functions, which is less restrictive than \( \mathop {\mathrm {lower}}-C^{2} \), \( \mathop {\mathrm {lower}}-C^{1} \) and \( \mathop {\mathrm {upper}}-C^{1} \) assumptions. We strengthen the existing convergence results for this algorithm and introduce a slightly revised version for which convergence results are established for locally Lipschitz objective function without requiring any additional assumptions. To handle nonconvexity, the augmented objective function and its corresponding piecewise linear approximation are used. Iterates of the proximal bundle algorithm are generated by solving a quadratic programming subproblem. Each quadratic programming is defined by means of the piecewise linear model of the augmented objective function, stabilized by a quadratic term centered at the best point obtained so far (which is referred to the latest serious step). The quadratic term is added to guarantee the existence and uniqueness of the minimum point and also to keep the approximation local enough. At the end by minimizing the obtained model a trial iterate and new bundle elements are obtained. Different from [8, 9, 12], we do not require any lower-\( C^{2} \), lower-\( C^{1} \) or regularity assumption and the augmented function can be convex or nonconvex. Moreover unlike [8, 9], we employ the upper envelope model of [5, 11, 12] and by utilizing it we connect the convex model with the original nonconvex problem to analyze the global convergence. This modification and also employing different techniques in the convergence theory enable us to drive convergence results with weaker assumptions than [8, 9]. We prove that if the algorithm stops with a finite number of iterations, then the latest serious iterate is a stationary point. On the other hand, if the algorithm generates a sequence of iterates, then two cases may happen. (I) If we have a finite number of serious iterates with infinite number of null iterates, then the latest serious iterate is a stationary point. (II) If we obtain an infinite number of serious iterates, then every accumulation point of this sequence is a stationary point. At the end, the algorithm is implemented in the MATLAB environment and applied on some nonsmooth test problems. Numerical results illustrate the efficiency of the proximal bundle algorithm in the practical computation.

Throughout the paper, we use the following notation and definitions. We denote by \( \langle u,v\rangle :=\sum _{i=1}^{n} u_{i}v_{i} \) the inner product of two vectors \( u,v \in \mathbb {R}^{n} \) and by \( \Vert \cdot \Vert \) the standard Euclidean norm. For \( x \in \mathbb {R}^{n} \) and \( \varepsilon >0 \), \( B_{\varepsilon }(x) \) (\( \bar{B}_{\varepsilon }(x) \)) is an open (closed) ball of the radius \( \varepsilon \) centered at x.

The function \( f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) is convex if \( f(\lambda x+ (1-\lambda )y) \le \lambda f(x)+ (1-\lambda ) f(y) \), for all \( x,y \in \mathbb {R}^{n} \) and \( \lambda \in [0,1] \). The subdifferential of a convex function f at x is given by \(\partial _{c} f(x) := \{\xi \in \mathbb {R}^{n}| ~f(y) \ge f(x) +\langle \xi ,y-x \rangle , ~ \forall y\in \mathbb {R}^{n}\}\). For any \( \varepsilon \ge 0 \), the \( \varepsilon - \)subdifferential [21] of a convex function f at x is defined as

$$\begin{aligned} \partial _{\varepsilon } f(x) := \{\xi \in \mathbb {R}^{n}| ~f(y) \ge f(x)+\langle \xi ,y-x\rangle - \varepsilon , ~ \forall y\in \mathbb {R}^{n}\}. \end{aligned}$$
(2)

A function \( f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) is said to be locally Lipschitz of rank \( L> 0 \) at \( x \in \mathbb {R}^{n} \) if for some \( \varepsilon > 0 \) we have \( |f(y) -f(z) | \le L \Vert y - z \Vert \) for all \( y , z \in B_{\varepsilon }(x) \). The Clarke subdifferential (generalized gradient) of f at x is defined as \( \partial f(x) := \{\xi \in \mathbb {R}^{n}| ~ \langle \xi , d\rangle \le f^{\circ }(x; d),~\forall d\in \mathbb {R}^{n}\} \) and it coincides with the convex subdifferential for every convex function, where \( f^{\circ }(x;d) \) is a Clarke directional derivative. Each element \( \xi \in \partial f(x) \) is called a subgradient of f at x. It is well-known that \({\partial } f(x)\) is a nonempty convex compact set in \( \mathbb {R}^{n} \). Also the Clarke subdifferential \( \partial f(x)\) is upper semicontinuous at every \( x\in \mathbb {R}^{n} \).

This paper is organized as follows. The new algorithm and its convergence analysis are described in Sect. 2. The corresponding encouraging computational results are reported in Sect. 3.

2 The proximal bundle algorithm

Suppose that \( x^{l} \) is the latest serious iterate, l and k are the serious iteration counter and the iteration counter, respectively. We use the notations \( \mathcal {L}_{k} \) and \( \{y_{j}\}_{j\in \mathcal {L}_{k}} \) to denote the index set and the set of bundle points, respectively. Assume that the latest serious iterate will be one of the bundle points: \( x^{l}\in \{y_{j}\}_{j\in \mathcal {L}_{k}} \). As usual in the bundle methods, generated information is used to obtain a piecewise linear model and also a new iterate. Here f is locally Lipschitz, therefore it is possibly nonconvex. Motivated by the presented method in [8, 9, 11, 12] for nonconvex cases, we use the augmented function as follows \( f_{\eta _{k}}(\cdot ,x^{l}):= f(\cdot )+ \frac{\eta _{k}}{2} \Vert \cdot -x^{l}\Vert ^{2} \), where \( \eta _{k}\in \mathbb {R} \) is a positive parameter, that is adjusted dynamically.

The piecewise linear model for the augmented function \( f_{\eta _{k}} \) is formed at the \(\hbox {k}{\mathrm{th}}\) iteration as follows:

$$\begin{aligned} M^{k}(y,x^{l}):=f(x^{l})+\max _{j\in \mathcal {L}_{k}}\{-c_{j}^{l}+\langle \xi _{j}^{l},y-x^{l} \rangle \}, \end{aligned}$$
(3)

where \( \xi _{j}\in \partial f(y_{j}) \), \( e_{j}^{l}:=f(x^{l})-f(y_{j})-\langle \xi _{j},x^{l}-y_{j} \rangle \), \( c_{j}^{l}:=e_{j}^{l}+\eta _{k}b_{j}^{l} \), \( \xi _{j}^{l}:= \xi _{j}+\eta _{k} d_{j}^{l} \), \( d_{j}^{l}:=y_{j}-x^{l} \) and \( b_{j}^{l}:=\frac{\Vert y_{j}-x^{l}\Vert ^{2}}{2} \), the index set at the \(\hbox {k}{\mathrm{th}}\) iteration is \( \mathcal {L}_{k}\subseteq \{1,2,\ldots ,k\} \) and the bundle of information is defined as . Our aim is to keep \( c_{j}^{l} \), \( j\in \mathcal {L}_{k} \) nonnegative, for this purpose in our setting we take

$$\begin{aligned} \eta _{k}\ge \max \{ \max _{\begin{array}{c} j\in \mathcal {L}_{k}, y_{j} \ne x^{l} \end{array}} \frac{- 2 e_{j}^{l}}{\Vert y_{j}-x^{l}\Vert ^2} ,\omega \}+\omega , \end{aligned}$$
(4)

where \( \omega >0 \) is a positive constant. The parameter \( \eta _k \) is chosen motivated by [8, 9, 12]. By using (4), for all \( j\in \mathcal {L}_{k} \) such that \( y_{j} \ne x^{l} \), we have \( c_{j}^{l}=e_{j}^{l}+\frac{\eta _{k}}{2}\Vert y_{j}-x^{l}\Vert ^{2}\ge \frac{\omega }{2}\Vert y_{j}-x^{l}\Vert ^{2}\ge 0\). On the other hand, if \( y_{j} = x^{l} \) we deduce \( c_{j}^{l}=0 \). Therefore \( c_{j}^{l}\ge 0 \) for all \( j\in \mathcal {L}_{k} \). This term implies that the augmented linearization errors, i.e., \( c_{j}^{l}=e_{j}^{l}+\eta _{k}b_{j}^{l} \) remain positive for all \( j\in \mathcal {L}_{k} \) and \( y_{j}\ne x^{l} \), and in addition we have

$$\begin{aligned} c_{j}^{l}=e_{j}^{l}+\eta _{k}b_{j}^{l}\ge \frac{\omega }{2}\Vert y_{j}-x^{l}\Vert ^2. \end{aligned}$$
(5)

The relation (5) is the main key in the proof of parts (ii) and (iv) of Lemma 1. Taking the maximum of the term in (4) with \( \omega \) yields positivity of \( \eta _{k} \), and adding the positive parameter \( \omega \) makes \( \eta _{k} \) satisfy (5).

Since the latest serious iterate is one of the bundle points, it follows that there exists \( j(l)\in \mathcal {L}_{k} \) such that \( x^{l}=y_{j(l)} \). So we get \( M^{k}(x^{l},x^{l})=f(x^{l})+\max _{j\in \mathcal {L}_{k}}\{-c_{j}^{l}\}=f(x^{l}) \). In addition by (3) and \( M^{k}(x^{l},x^{l})=f(x^{l}) \) we obtain \( M^{k}(y,x^{l})\ge M^{k}(x^{l},x^{l})+\langle \xi _{j}^{l},y-x^{l} \rangle -c_{j}^{l} \), for all \( y\in \mathbb {R}^{n} \) and \( j\in \mathcal {L}_{k} \). Using the definition of the \( \varepsilon - \)subdifferential in (2) we conclude

$$\begin{aligned} \xi _{j}^{l}\in \partial _{c_{j}^{l}} M^{k}(x^{l},x^{l}),\quad \forall j\in \mathcal {L}_{k}. \end{aligned}$$
(6)

To generate the next iterate \( y_{k+1} \), our bundle method chooses a proximal parameter \( \mu _{k}>0 \) and solves the following quadratic program

$$\begin{aligned} \min _{y\in \mathbb {R}^{n}}~~ M^{k}(y,x^{l})+ \frac{\mu _{k}}{2} \Vert y-x^{l}\Vert ^{2}. \end{aligned}$$
(7)

Clearly \( y_{k+1} \) is unique, since the objective function is strictly convex. Set \( d_{k+1}:=y_{k+1}-x^{l} \) and \( v_{k+1}:=M^{k}(y_{k+1},x^{l})-f(x^{l}) \). If \( y_{k+1}=x^{l} \), then \( v_{k+1}=0 \) and the algorithm stops. Therefore we assume \( y_{k+1}\ne x^{l} \). By uniqueness of \( y_{k+1} \) as the solution of the problem (7), we get \( M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert y_{k+1}-x^{l}\Vert ^{2}< M^{k}(x^{l},x^{l})+ \frac{\mu _{k}}{2} \Vert 0\Vert ^{2}= M^{k}(x^{l},x^{l})=f(x^{l}). \) On the other hand, using \( \frac{\mu _{k}}{2} \Vert y_{k+1}-x^{l}\Vert ^{2}\ge 0 \), we have \( M^{k}(y_{k+1},x^{l})<f(x^{l}) \) and \( v_{k+1}<0 \).

The problem (7) can be rewritten in the following smooth form

$$\begin{aligned} \min {\quad }&v+ \frac{\mu _{k}}{2} \Vert y-x^{l}\Vert ^{2}\nonumber \\&\langle \xi _{j}^{l},y-x^{l}\rangle -c_{j}^{l}\le v,\quad \forall j\in \mathcal {L}_{k},\nonumber \\&y\in \mathbb {R}^{n},~v\in \mathbb {R}. \end{aligned}$$
(8)

The quadratic dual problem of the problem (8) is formulated as follows:

$$\begin{aligned} \min {\quad }&\frac{1}{2\mu _{k}}\Vert \sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l} \Vert ^{2}+\sum _{j\in \mathcal {L}_{k}} \lambda _{j}c_{j}^{l}\nonumber \\&\sum _{j\in \mathcal {L}_{k}}\lambda _{j}=1,~\lambda _{j}\ge 0,\quad \forall j\in \mathcal {L}_{k}. \end{aligned}$$
(9)

If \( \lambda _{j} \) for all \( j\in \mathcal {L}_{k} \) solve the problem (9), then a unique solution of the problem (8) (using the relationship between the primal and dual solutions) is obtained in the form

$$\begin{aligned} d_{k+1}=-\frac{1}{\mu _{k}}\sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l},~~ \text {and} ~~ v_{k+1}=-\Big (\frac{1}{\mu _{k}} \Vert \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l} \Vert ^{2}+\sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l}\Big ), \end{aligned}$$
(10)

where \( y_{k+1}=d_{k+1}+x^{l} \).

Now the new trial iterate is computed (i.e., \( y_{k+1} \)), we first check whether it provides sufficient decrease of the objective function as compared to the latest serious iterate. If the descent is sufficient, then the corresponding point is declared as a new serious iterate (a so-called serious iteration). More precisely, when \(y_{k+1}\) satisfies the sufficient descent test

$$\begin{aligned} f(y_{k+1})\le f(x^{l})+m_{L}v_{k+1}, \end{aligned}$$
(11)

where \( m_{L}\in (0,1) \), then we have a new serious iterate and set \( x^{l+1}=y_{k+1} \). Therefore the model \( M^{k}(\cdot ,x^{l}) \) should be updated and any element in \( \mathcal {B}_{k} \) that is depended on l should be redefined. Motivated by the formulae (12) in [8], we update the elements in \( \mathcal {B}_{k} \) according to the following relations:

$$\begin{aligned} e_{j}^{l+1}&= e_{j}^{l} + f(x^{l+1})- f(x^{l})-\langle \xi _{j}, x^{l+1}-x^{l}\rangle , \end{aligned}$$
(12a)
$$\begin{aligned} b_{j}^{l+1}&= b_{j}^{l}+ \frac{1}{2}\Vert x^{l+1} -x^{l}\Vert ^{2} - \langle d_{j}^{l},x^{l+1} -x^{l}\rangle , \end{aligned}$$
(12b)
$$\begin{aligned} d_{j}^{l+1}&= d_{j}^{l}-(x^{l+1}-x^{l}). \end{aligned}$$
(12c)

If (11) does not hold, then \(y_{k+1}\) is a null iterate and the latest serious iterate \(x^l\) remains unchanged (a so-called null iteration). In this case the model will be improved by adding new information to the bundle. For this purpose we update the index set and the bundle, that is \( \mathcal {L}_{k+1}=\mathcal {L}_{k}\bigcup \{k+1\} \) and also \( \mathcal {B}_{k+1}=\mathcal {B}_{k}\bigcup \{(\xi _{k+1},e_{k+1}^{l},b_{k+1}^{l},d_{k+1}^{l})\} \). After that a new iterate \( y_{k+2} \) can be calculated and the algorithm is repeated. We perform null steps until (11) is satisfied and the sufficient descent is reached.

Now we are in position to state the proximal bundle algorithm to solve the problem (1).

figure a

Remark 1

In the sequel, we suppose that \( \{\eta _{k}\}_{k} \) is bounded. Using (4), we deduce that \( \eta _{k}\ge 2\omega \) for all k, therefore we assume that there exists \( \bar{\eta } \) such that \( 2\omega \le \eta _{k}\le \bar{\eta } \) for all k. Since the value of \( \bar{\eta } \) is not needed in the performance and the analysis of the algorithm, this assumption is not restrictive on the implementation of the algorithm. The boundedness of \( \{\eta _{k}\}_{k} \) has been considered in [9] with \( \mathop {\mathrm {lower}}-C^{1} \) functions and in [11, 12] with regular functions. However we consider this assumption with locally Lipschitz functions.

For the analytical purpose, motivated by [5, 11, 12, 22], we define the upper envelope model associated with the cutting plane. Assume that \( \bar{B}_{R}(x) \) is a fixed closed ball large enough such that it contains all possible trial steps \(y^{+}\). Set \( \mathcal {B}(x):=\{y^{+}|~y^{+}\in \bar{B}_{R}(x), y^{+}\text { is a trial point} \} \). The upper envelope model \( M^{\uparrow }(\cdot ,x):\mathbb {R}^{n}\rightarrow \mathbb {R} \) is defined, for the objective function f corresponding with a given point \( x\in \mathbb {R}^{n} \), as

$$\begin{aligned} M^{\uparrow }(y,x):= f(x)+\sup _{2\omega \le \eta \le \bar{\eta },~y^{+}\in \mathcal {B}(x),~\xi \in \partial f(y^{+})} \{ m_{y^{+},\xi ,\eta }(y,x) \}, \end{aligned}$$

where \( 2\omega \le \eta \le \bar{\eta } \) with \( \bar{\eta } \) defined in Remark 1. The plane \( m_{y^{+},\xi ,\eta }(y,x) \) is the cutting plane at the serious iterate x and the trial step \( y^{+} \) which is defined by \( m_{y^{+},\xi ,\eta }(y,x) :=- \frac{\omega }{2}\Vert y^{+}-x\Vert ^{2}+ \langle \xi +\eta (y^{+}-x), y-x\rangle . \) The boundedness of \( \bar{B}_{R}(x) \) and the definition of \( \eta \) imply that \( M^{\uparrow }(\cdot ,x) \) is defined everywhere. Some beneficial properties of the upper envelope model \( M^{\uparrow }(\cdot ,x) \) are stated in Lemma 1. These results can be found in [5, Lemma 6.1], however the definition of the upper envelope model \( M^{\uparrow }(\cdot ,x) \) and its cutting planes are different from [5]. The proof of these results follows immediately from [12, Lemma 5] with slight modifications.

Lemma 1

Suppose that \( f:\mathbb {R}^{n}\rightarrow \mathbb {R} \) is a locally Lipschitz function, then the following hold:

  1. (i)

    \( M^{\uparrow }(\cdot ,x) \) is a convex function.

  2. (ii)

    \( M^{k}(\cdot ,x) \le M^{\uparrow }(\cdot ,x) \), \(\forall k \).

  3. (iii)

    \( M^{\uparrow }(x,x)=f(x). \)

  4. (iv)

    \( \partial _{c} M^{\uparrow }(x,x)\subseteq \partial f(x). \)

Now we examine the convergence properties of Algorithm 1. We consider different cases that may happen during the implementation of it and in each case we state the results.

Theorem 1

Assume that f is a locally Lipschitz function, \( \text {tol}=0 \) and \( \mu _{k}>0 \), for all k. If Algorithm 1 stops with a finite number of iterations, then the latest serious iterate \( x^{l} \) is a stationary point for the problem (1).

Proof

According to the assumption, Algorithm 1 stops with a finite number of iterations. This can happen at Step 2 with \( -v_{k+1}\le 0 \) and since \( -v_{k+1}\ge 0 \) we obtain \( -v_{k+1}=0 \). By definition \(-v_{k+1}=\frac{1}{\mu _{k}}\Vert \sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l}\Vert ^{2}+\sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l} \) and using \( \mu _{k}>0 \), we deduce

$$\begin{aligned} \Vert \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l}\Vert ^{2}=0,\quad \text {and} \quad \sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l}=0. \end{aligned}$$
(13)

By (6) we have \( \xi _{j}^{l}\in \partial _{c_{j}^{l}}M^{k}(x^{l},x^{l}) \), and consequently \( M^{k}(y,x^{l})\ge M^{k}(x^{l},x^{l})+ \langle \xi _{j}^{l},y-x^{l}\rangle - c_{j}^{l}\), for all \( y\in \mathbb {R}^{n} \). Taking into account that \( \lambda _{j}\ge 0 \) and \( \sum _{j\in \mathcal {L}_{k}}\lambda _{j}=1 \), we get \( M^{k}(y,x^{l})\ge M^{k}(x^{l},x^{l})+ \langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},y-x^{l}\rangle - \sum _{j\in \mathcal {L}_{k}}\lambda _{j} c_{j}^{l} \). By \( M^{k}(x^{l},x^{l})=f(x^{l}) \) and Lemma 1 (ii)–(iii), We conclude that \(M^{\uparrow }(y,x^{l})\ge M^{\uparrow }(x^{l},x^{l})+ \langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},y-x^{l}\rangle - \sum _{j\in \mathcal {L}_{k}}\lambda _{j} c_{j}^{l}\). Using (13) and Lemma 1 (i), (iv), we get \( 0\in \partial _{c} M^{\uparrow }(x^{l},x^{l}) \) and also \( 0\in \partial f(x^{l}) \). Consequently \( x^{l} \), i.e., the latest serious iterate is a stationary point and the proof is completed.\(\square \)

From now on, we assume that Algorithm 1 does not stop and generates an infinite sequence of iterates. We consider different cases that may happen during the performance of Algorithm 1 with infinite cycles.

Theorem 2

Assume that f is a locally Lipschitz function, \( \{\eta _{k}\}_{k} \) is bounded above and \( \mu _{k} \le \mu _{k+1} \le \bar{\mu } \) for all k. If Algorithm 1 is performed infinitely with a finite number of serious iterations, then the latest serious iterate \( x^{l} \) is a stationary point for the problem (1).

Proof

Assume that Algorithm 1 produced a finite number of serious iterates that follows with infinite number of null iterates. Suppose that \( y_{k+1} \) is the optimal solution of the problem (8) and \( d_{k+1}=y_{k+1}-x^{l} \). In this process the latest serious point \( x^{l} \) is constant and we demonstrate that it is a stationary point. First, we show that the sequence \( \{M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+1}\Vert ^2\}_{k} \) is bounded above and nondecreasing.

Since the problem (7) is strictly convex, it follows that its solution (i.e., \( y_{k+1} \)) is unique. Thus \( M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+1}\Vert ^2< M^{k}(y,x^{l})+ \frac{\mu _{k}}{2} \Vert y-x^{l}\Vert ^2 \), for all \( y\in \mathbb {R}^{n} \). Now set \( y=x^{l} \), then \( M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+1}\Vert ^2 < M^{k}(x^{l},x^{l})=f(x^{l}) \). Therefore the sequence \( \{M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+1}\Vert ^2\}_{k} \) is bounded from above by \( f(x^{l}) \). It is good to note that \( x^{l} \) is constant here. Next let us prove that this sequence is nondecreasing

$$\begin{aligned}&M^{k+1}(y_{k+2},x^{l})+ \frac{\mu _{k+1}}{2} \Vert d_{k+2}\Vert ^2\\&\ge M^{k+1}(y_{k+2},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+2}\Vert ^2\\&\ge f(x^{l})-c_{j}^{l}+\langle \xi _{j}^{l},y_{k+2}-x^{l} \rangle + \frac{\mu _{k}}{2} \Vert d_{k+2}\Vert ^2\\&= f(x^{l})-c_{j}^{l}+\langle \xi _{j}^{l},d_{k+2}-d_{k+1} \rangle +\langle \xi _{j}^{l},d_{k+1} \rangle + \frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}+d_{k+1}\Vert ^2\\&= f(x^{l})-c_{j}^{l}+\langle \xi _{j}^{l},d_{k+2}-d_{k+1} \rangle +\langle \xi _{j}^{l},d_{k+1} \rangle \\&\quad + \frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}+\frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2+\mu _{k} \langle d_{k+2}-d_{k+1},d_{k+1} \rangle , \end{aligned}$$

where the first inequality follows from \( \mu _{k} \le \mu _{k+1} \), the second inequality holds by the definition of \( M^{k+1}(\cdot ,x^{l}) \) and the other relations are obvious. The above relation is satisfied for all \( j\in \mathcal {L}_{k+1} \) and also \( \mathcal {L}_{k+1}=\mathcal {L}_{k}\bigcup \{k+1\} \). For all \( j\in \mathcal {L}_{k} \), multiplying each relation with corresponding \( \lambda _{j} \), summing up and by using the fact that \( \sum _{j\in \mathcal {L}_{k}}\lambda _{j}=1 \), we arrive at

$$\begin{aligned}&M^{k+1}(y_{k+2},x^{l})+ \frac{\mu _{k+1}}{2} \Vert d_{k+2}\Vert ^2\ge f(x^{l})-\sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l}+\langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},d_{k+2}-d_{k+1} \rangle \\&+ \langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},d_{k+1} \rangle + \frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}+ \frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2+\mu _{k} \langle d_{k+2}-d_{k+1},d_{k+1} \rangle . \end{aligned}$$

Since \( d_{k+1}=-\frac{1}{\mu _{k}}\sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l} \), we have

$$\begin{aligned}&M^{k+1}(y_{k+2},x^{l})+ \frac{\mu _{k+1}}{2} \Vert d_{k+2}\Vert ^2\\&\ge f(x^{l})-\sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l}+ \frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2+\langle \sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l},-\frac{1}{\mu _{k}}\sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l} \rangle \\&\quad + \frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}\\&=f(x^{l})-\big (\sum _{j\in \mathcal {L}_{,k}}\lambda _{j}c_{j}^{l}+ \frac{1}{\mu _{k}}\Vert \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l} \Vert ^{2}\big )+ \frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2+\frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}\\&= f(x^{l})+v_{k+1}+ \frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2+ \frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}\\&=M^{k}(y_{k+1},x^{l})+\frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2 +\frac{\mu _{k}}{2} \Vert d_{k+2}-d_{k+1}\Vert ^{2}\\&\ge M^{k}(y_{k+1},x^{l})+\frac{\mu _{k}}{2}\Vert d_{k+1}\Vert ^2. \end{aligned}$$

Therefore the sequence \( \{M^{k}(y_{k+1},x^{l})+ \frac{\mu _{k}}{2} \Vert d_{k+1}\Vert ^2\}_{k} \) is nondecreasing and by its boundedness, we deduce this sequence is convergent. Passing to the limit in the above inequality when \( k\rightarrow \infty \) we have \( \lim _{k \rightarrow \infty } \frac{\mu _{k}}{2} \Vert d_{k+1}-d_{k}\Vert ^{2} \le 0 \). By assumption that \( \mu _{k}\le \mu _{k+1} \le \bar{\mu } \), we obtain \( \mu _{k}\ge \mu _{k^{*}} \) that is \( \{\mu _{k}\}_{k} \) is bounded below by \( \mu _{k^{*}} \). This implies that \( d_{k+1}-d_{k} \rightarrow 0 \), as \( k\rightarrow \infty \) and thus \( \{d_{k}\}_{k} \) is bounded. Using the definition of \( M^{k+1}(y_{k+2},x^{l}) \), for all \( j\in \mathcal {L}_{k+1} \) we have \( M^{k+1}(y_{k+2},x^{l})\ge f(x^{l})-c_{j}^{l}+\langle \xi _{j}^{l},y_{k+2}-x^{l} \rangle \). Set \( j=k+1 \) in this relation, we obtain

$$\begin{aligned} M^{k+1}(y_{k+2},x^{l})&\ge f(x^{l})-c_{k+1}^{l}+\langle \xi _{k+1}^{l},d_{k+2} \rangle \\&= f(y_{k+1})+\langle \xi _{k+1},x^{l}-y_{k+1} \rangle -\frac{\eta _{k+1}}{2}\Vert y_{k+1}-x^{l}\Vert ^{2}+\langle \xi _{k+1}^{l},d_{k+2} \rangle \\&= f(y_{k+1})-\langle \xi _{k+1},d_{k+1} \rangle -\frac{\eta _{k+1}}{2}\Vert d_{k+1}\Vert ^{2} +\langle \xi _{k+1}^{l},d_{k+2}\rangle \\&\ge f(y_{k+1})-\langle \xi _{k+1},d_{k+1} \rangle -\eta _{k+1}\Vert d_{k+1}\Vert ^{2} +\langle \xi _{k+1}^{l},d_{k+2}\rangle \\&= f(y_{k+1})-\langle \xi _{k+1},d_{k+1} \rangle -\eta _{k+1} \langle y_{k+1}-x^{l},d_{k+1}\rangle +\langle \xi _{k+1}^{l},d_{k+2}\rangle \\&= f(y_{k+1})-\langle \xi _{k+1}+\eta _{k+1}(y_{k+1}-x^{l}),d_{k+1}\rangle +\langle \xi _{k+1}^{l},d_{k+2}\rangle \\&= f(y_{k+1})+\langle \xi _{k+1}^{l},d_{k+2}-d_{k+1}\rangle \\&> f(x^{l})+m_{L} v_{k+1}+\langle \xi _{k+1}^{l},d_{k+2}-d_{k+1}\rangle . \end{aligned}$$

By the definition of \( v_{k+1} \) we have \( v_{k+2}=M^{k+1}(y_{k+2},x^{l})-f(x^{l}) \) and get

$$\begin{aligned} 0 \le -v_{k+2}&=f(x^{l})-M^{k+1}(y_{k+2},x^{l}) < -m_{L}v_{k+1}- \langle \xi _{k+1}^{l},d_{k+2}-d_{k+1} \rangle . \end{aligned}$$

Since \( d_{k+2}-d_{k+1}\rightarrow 0 \), as \( k\rightarrow \infty \), \( m_{L}\in (0,1) \) and \( \{\xi _{k+1}^{l}\}_{k} \) is bounded (since \( \{d_{k+1}^{l}\}_{k} \) and \( \{\eta _{k}\}_k \) are bounded and f is locally Lipschitz), it follows that \( -v_{k+1}\rightarrow 0 \). By the definition of \( v_{k+1}\) in (10) and \( c_{j}^{l}\ge 0 \) for all \( j\in \mathcal {L}_{k} \) and also \( \mu _{k}\le \bar{\mu } \) we get

$$\begin{aligned} \sum _{j\in \mathcal {L}_{k}} \lambda _{j}\xi _{j}^{l}\rightarrow 0,\quad \text {and}\quad \sum _{j\in \mathcal {L}_{k}} \lambda _{j}c_{j}^{l}\rightarrow 0,\quad k\rightarrow \infty . \end{aligned}$$
(14)

Using (6) and by the definition of the \(\varepsilon -\)subdifferential we deduce \( M^{k}(y,x^{l})\ge f(x^{l})+\langle \xi _{j}^{l},y-x^{l}\rangle -c_{j}^{l} \). By multiplying \( \lambda _{j}\ge 0 \) in this relation, summing up and using the fact that \( \sum _{j\in \mathcal {L}_{k}}\lambda _{j}=1 \) we obtain \( M^{k}(y,x^{l})\ge f(x^{l})+\langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},y-x^{l}\rangle -\sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l} \). From Lemma 1 (ii)–(iii), we get \( M^{\uparrow }(y,x^{l})\ge M^{\uparrow }(x^{l},x^{l})+\langle \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l},y-x^{l}\rangle - \sum _{j\in \mathcal {L}_{k}}\lambda _{j}c_{j}^{l} \). Taking the limit as \( k\rightarrow \infty \) and by (14), we deduce \(M^{\uparrow }(y,x^{l})\ge M^{\uparrow }(x^{l},x^{l})+\langle 0,y-x^{l}\rangle \). Hence by Lemma 1 (i) and (iv) we obtain that \( 0\in \partial _{c} M^{\uparrow }(x^{l},x^{l}) \) and so \( 0\in \partial f(x^{l}) \). \(\square \)

Theorem 3

Assume that f is locally Lipschitz and there exists \( \bar{\mu } >0 \) such that \( \mu _{k} \le \bar{\mu } \) for all k. In addition suppose that the level set \( A(x^{1}):=\{x\in \mathbb {R}^{n},~f(x)\le f(x^{1})\} \) is bounded. Then every accumulation point of the serious iterate sequence is stationary for the problem (1).

Proof

By assumption there exists a sequence \( \{x^{l}\}_{l} \). The method is descent type thus we have \( \{x^{l}\}_{l}\subseteq A(x^{1}) \). Since f is locally Lipschitz and \( A(x^{1}) \) is bounded, the sequence \( \{f(x^{l})\}_{l} \) is bounded below. On the other hand, for each l we have \(f(x^{l+1})\le f(x^{l}) +m_{L}v_{k+1(l)}\). Since f is bounded below we deduce \( v_{k+1(l)}\rightarrow 0 \), as \( l\rightarrow \infty \). Using \( \mu _{k}\le \bar{\mu } \) we deduce

$$\begin{aligned} \sum _{j\in \mathcal {L}_{k}}\lambda _{j}\xi _{j}^{l}\rightarrow 0,\quad \text {and}\quad \sum _{j\in \mathcal {L}_{k}}\lambda _{j} c_{j}^{l}\rightarrow 0,\quad ~ l\rightarrow \infty . \end{aligned}$$
(15)

On the other hand, by assumption we have \( A(x^{1}) \) is bounded. Therefore the sequence \( \{x^{l}\}_{l} \) is bounded and it has a convergent subsequence \( \{x^{l}\}_{l\in \mathcal {A}}\subseteq \{x^{l}\}_{l} \) thus there exists \( x^{*}\in \mathbb {R}^{n} \) such that \( x^{l}\rightarrow _{l\in \mathcal {A}} x^{*} \). By (6) we have \( \xi _{j}^{l}\in \partial _{c_{j}^{l}}M^{k}(x^{l},x^{l}) \) and consequently \( M^{k}(y,x^{l})\ge M^{k}(x^{l},x^{l})- \langle \xi _{j}^{l},x-x^{l}\rangle -c_{j}^{l}\) for all \( y\in \mathbb {R}^{n} \). By Lemma 1 (ii)–(iii), \( \lambda _{j}\ge 0 \) and \( \sum _{j\in \mathcal {L}_{k}}\lambda _{j}=1 \) we have \( M^{\uparrow }(y,x^{l})\ge M^{\uparrow }(x^{l},x^{l})+ \langle \sum _{j\in \mathcal {L}_{k}}\lambda _{i,j}\xi _{j}^{l},y-x^{l}\rangle - \sum _{j\in \mathcal {L}_{k}}\lambda _{j} c_{j}^{l} \). Passing to the limit in the above relation when \( l\in \mathcal {A} \) and \( l\rightarrow \infty \) and using (15), we obtain \(M^{\uparrow }(y,x^{*}) \ge M^{\uparrow }(x^{*},x^{*})+ \langle 0,y-x^{*}\rangle \). By Lemma 1 (i) and (iv), we deduce \( 0\in f(x^{*}) \) and thus \( x^{*} \) is a stationary point. \(\square \)

3 Numerical experiments

In this section, we report the results of numerical experiments for the proposed algorithm and compare it with two existing algorithms in the literature. They are SolvOpt (Solver for local nonlinear optimization problems) [24] and HANSO 2.2 (Hybrid Algorithm for Nonsmooth Optimization, Version 2.2) [4, 15]. All codes are written in MATLAB R2016a and run on a PC Intel Core I5 with CPU 2.5 GHZ and 4GB of RAM. The number of function evaluations and subgradient evaluations are considered as a measure of efficiency. For SolvOpt and HANSO 2.2, the parameters are set to the default values suggested by the respective codes and the stopping parameter is considered \( \text {tol}:=10^{-8} \). We set the parameters of Algorithm 1 as \( \text {tol}:=10^{-8} \), \( m_{L}:=10^{-2} \) and \( \omega :=1.2 \). The proximal parameter is considered one in all iterations, i.e., \( \mu _{k}:=1 \) for all k. The size of the bundle needs to be limited, we choose \( L_{\max }:=n+50 \) as the maximal size for the bundle, where n is the dimension of each problem. The index set \( \mathcal {L}_{k} \) and the bundle of information \( \mathcal {B}_{k} \) are updated as in Step 3 if \( |\mathcal {L}_{k}|<L_{\max } \), and if \( |\mathcal {L}_{k}|=L_{\max } \), then the set \(\mathcal {L}_{k+1}=\mathcal {L}_{k}\cup \{k+1\}\setminus \{k-L_{\max }\}\) is used and accordingly the bundle is updated. The quadratic programming solver is \( \mathsf {quadprog.m} \), which is available in the MATLAB optimization toolbox. In our results, to select \( \eta _{k} \) we use the relation (4) with equality.

Table 1 Results of P1–P20 with given starting points
Table 2 Average results of P1–P20 with 20 random starting points
Table 3 Results of large scale problems with various starting points and dimensions

To measure the efficiency of the considered algorithms two classes of test problems of [2] are used. The first class includes the problems 1–10 (denoted by P1–P10) and 21–30 (called by P11–P20) with constant number of variables and the second class can be formulated with any number of variables.

We first applied the algorithms for solving problems P1-P20 by applying the given starting points from [2]. Results are presented in Table 1. In this table, we state the value of the objective function at the final point by \( f_{\mathop {\mathrm {final}}} \), the number of function evaluations and the number of subgradient evaluations by \(n_f\) and \(n_{\xi }\), respectively. The numerical experiments demonstrate that Algorithm 1 has an acceptable behaviour for nonsmooth problems, since it uses the least number of function evaluations for 17 problems and HANSO 2.2 needs the least number of function evaluations for 3 problems. On the other hand, SolvOpt uses the least number of subgradient evaluations for 12 problems and Algorithm 1 needs the least number of subgradient evaluations for 6 problems. For more details and comparison average values of evaluations see Table 1.

At the next step, we use 20 random generated starting points (by using \(\mathsf {randn.m}\) function in MATLAB) for each problem and the starting points are the same for all algorithms. To compare the performance of the algorithms, we apply the indicators: \(n_b-\) the number of successful runs considering the best known solution. We say that an algorithm finds a solution to a problem with a tolerance \( \varepsilon >0 \), if \( |f_{\mathop {\mathrm {final}}}-f^{*}|/(1+|f^{*}|)\le \varepsilon \), where \(f^{*}\) is the optimal value. In our experiments \( \varepsilon =5\times 10^{-4} \). Results of these numerical experiments are presented in Table 2. We present the number \(n_b\) for each problem as well as the average of the objective function values (denoted by \(f_{\text {ave}}\)), the average number of the objective function evaluations and the subgradient evaluations (denoted by \(n_{f}^{\text {ave}}\) and \(n_{\xi }^{\text {ave}}\), respectively) over 20 runs for each problem and each solver. We observe that Algorithm 1 solves successfully 355 examples, where HANSO 2.2 and SolvOpt solve 318 and 344 examples, respectively.

At the third step, we consider the generalization of MAXQ, GMXHIB and Chained crescent I [2]. These problems can be formulated with any number of variables. We have used here \( n=5,10,50,100,200 \) and apply 4 different starting points. The starting points are the same for all test problems and three algorithms. The results are summarized in Table 3 and show that values of \(f_{\mathop {\mathrm {final}}}\) obtained by Algorithm 1 are quite correct for all problems whereas HANSO 2.2 and SolvOpt can not calculate any descent directions for MAXQ with \( x^{1}=[1,1,\ldots ,1] \) and \( x^{1}=[-1.5,2,\ldots ] \) and for \( n=5,10,50,100,200 \) and \( n=50,100,200 \) respectively. To compare better, check the results in Table 3.

In addition, in order to obtain better comparison of the considered algorithms, we analyze the results using the performance profile. More details regarding the definition of the performance profile can be found in [7]. Here we use the number of function evaluations and the number of subgradient evaluations as efficiency measures to define the performance profile. In the performance profiles, the value of \( \rho _{s}(\tau ) \) at \( \log (\tau )=0 \) indicates the ratio of the test problems for which the solver s is the best – that is, the solver s uses the least number of function evaluations or the least number of subgradient evaluations. The value of \( \rho _{s}(\tau ) \) at the rightmost abscissa gives the ratio of the test problems that the solver s can solve – that is, the robustness of the solver s. In addition, the higher curve shows that its corresponding solver is better.

The results corresponding P1–P20 with the given starting points in [2] are reported in Figs. 1, 2 and 3 part (a). In part (b), the results of P1–P20 with 20 random starting points and the results of MAXQ, GMXHIB and Chained crescent I with dimensions \( n=5,10 \) are stated. In addition the results of MAXQ, GMXHIB and Chained crescent I with the dimensions \( n=50,100,200 \) are reported in part (c) of figures.

From Figure 1, we deduce that Algorithm 1 is more efficient in comparison with HANSO 2.2 with the number of function and subgradient evaluations for the small problems with given and random starting points, since it is superior to HANSO 2.2 in Figure 1(a),(b). Algorithm 1 can solve more than \( 94\% \) of the small problems with various starting point where HANSO 2.2 solves nearly \( 85\% \). For most of the small test problems, Algorithm 1 is better than HANSO 2.2, i.e., it uses the least number of function and subgradient evaluations. For big problems whose results are stated in Figure 1 (c), for \( 60\% \) of problems HANSO 2.2 is better than Algorithm 1 and for \( 40\% \) of problems Algorithm 1 is better than HANSO 2.2. Moreover Algorithm 1 can solve \( 100\% \) of problems successfully, where HANSO 2.2 could solve \( 83\% \) of problems.

Fig. 1
figure 1

Comparison between Algorithm 1 and HANSO 2.2, Number of function and subgradient evaluations

From Figure 2, it is easy to understand that from the aspect of the number of the function evaluations Algorithm 1 is better than SolvOpt for all considered problems.

Fig. 2
figure 2

Comparison between Algorithm 1 and SolvOpt, Number of function evaluations

From Figure 3(a),(b), we deduce that for small test problems with the given and random starting points SolvOpt is more efficient than Algorithm 1 if the number of subgradient evaluations is considered as a measure of efficiency. For big problems, Figure 3(c), SolvOpt is better than Algorithm 1 for nearly \( 55\% \) of problems, while Algorithm 1 is better than SolvOpt for \( 45\% \) of problems. On the other hand, SolvOpt can solve nearly \( 83\% \) of problems successfully, while Algorithm 1 solves \( 100\% \) of problems. However, we can say that the performances of these two solvers are comparable.

Fig. 3
figure 3

Comparison between Algorithm 1 and SolvOpt, Number of subgradient evaluations

Overall, our results show that, in general, the proposed method is efficient and more robust than other methods used in the numerical experiments.

At the end, we are interested in exploring the assumption that the parameter \( \eta _k \) remains bounded. In [9] the authors report that \( \eta _k \) was less than \(2n+2\) when solving 73 of the 75 exact unconstrained optimization problems; for one test problem \( \eta _k \) was between \( 2n+2 \) and 25n and for the remaining one \( \eta _k \) exceeded 25n. In [12], the authors stated that \( \eta _k=2\omega \) for 85 problems, \( \eta _k \) is bounded by 2n for 55 problems and it is between 2n and 25n for 10 problems. In our results we note that \( \eta _k=2\omega \) for 315 examples, \( \eta _k \) is bounded by 2n for 127 cases and it is between 2n and 25n for 36 examples and for 3 cases \( \eta _k \) is bounded by 31n. Altogether, the numerical experiments support that the boundedness assumption on the sequence \( \{\eta _{k}\} \) is quite suitable.