1 Introduction

We consider the following constrained optimization problem

$$\begin{aligned} \min _{x \in X}&\quad f(x), \nonumber \\&\quad g(x) \le 0, \end{aligned}$$
(1)

where the objective and constraint functions \( f, g{:}\,\mathbb {R}^{n} \rightarrow \mathbb {R} \) are locally Lipschitz but potentially nonsmooth and nonconvex. In addition \( X\subseteq \mathbb {R}^n \) is a convex compact set.

Several numerical methods have been proposed for solving nonsmooth problems. Two classes of the most efficient methods are subgradient methods [17, 34] and bundle methods. Bundle methods were first introduced by Lemaréchal  [23] and have been developed over the years. Proximal bundle methods are currently considered among the most efficient methods for nonsmooth optimization problems, at least for convex problems [15, 16, 24]. Not long after works on the bundle methods for the convex case, the unconstrained counterpart of problem (1) was considered in [8, 20, 22] and more recently in [13, 14, 18]. The success of the proximal bundle method in unconstrained optimization (convex and nonconvex functions) greatly stimulated research to extend it to solve nonconvex constrained problems.

Extending Lemaréchal’s algorithm to the nonconvex case, Mifflin [28] gives a bundle algorithm using the so-called downshift mechanism for problem (1). In [27], a proximal bundle method is presented for problem (1) with linear constraints by using the improvement function for handling the constraints. This method admits only the weak convergence in the sense that at least one of the accumulation points of the serious iterative sequence is stationary. We recall that, a new trial point is called a serious iterate if it is accepted as the next step, and a null iterate otherwise. Recently in [11] an algorithm is stated for problem (1) with the strong convergence, in the sense that every accumulation point is stationary, for classes of \( {\mathrm {lower}}-C^{1} \) functions (in the sense of [31]). In [7], an algorithm is proposed for problem (1) where f and g are \( {\mathrm {lower}}-C^{1} \) or \( {\mathrm {upper}}-C^{1} \) functions and the progress function is used for handling the constraint and showed that every accumulation point of the serious iterative sequence is stationary for the progress function. In addition, redistributed proximal bundle methods [13, 14] are extended for solving problem (1) in [37] by using the penalty function, where the objective and constraint functions are \( {\mathrm {lower}}-C^{2} \). Moreover, these methods are studied in [25] by using the improvement function where the objective and constraint functions are \( {\mathrm {lower}}-C^{1} \). More recently, in [26] a semi-infinite programming (SIP) problem is studied, where the objective function is \( {\mathrm {lower}}-C^{2} \) and the constraint functions are twice continuously differentiable. By using a maximum function, the SIP problem is formulated as a nonsmooth constrained optimization problem. Then an infeasible bundle method based on the so-called improvement function is presented. By utilizing a special constraint qualification, the global convergence is obtained in the sense that the generated sequence converges to the KKT points of problem (1) as well as the KKT points of the SIP problem.

Some other methods for constrained nonsmooth nonconvex optimization problems are gradient sampling (GS) methods; see, for instance [5, 36], where the penalty and improvement functions for handling the constraints are used, respectively. The global convergence of the sequential quadratic programming GS algorithm [36] and the penalty sequential quadratic programming GS algorithm [5] are established in the sense that, with probability one, every accumulation point of the iterative sequence is stationary, respectively, for the improvement function and the penalty function.

Techniques for bundle compression and aggregation, which are usual in the convex bundle algorithm, are not used in [26, 28, 37] and although mentioned in [7, 11, 25, 27], are not explicitly addressed for updating formulas and the practical aspects. Without these features, we can not guarantee that the number of constraints in the subproblems can be kept smaller than a given desired bound. Therefore the method can not be guaranteed to be practical, then numerical difficulties may arise. In our method, we preserve the possibility of using aggregation techniques, which allow effective control to the size of subproblems and in addition we present updating formulas.

It is worth mentioning that in the most of the cited methods for constrained problems, e.g., [5, 7, 27, 28, 36] all the serious iterates and also the starting point, are required to be feasible. In those methods computing a feasible point is essential to start the algorithm. This preliminary general nonsmooth feasibility problem may be as difficult as to solve problem (1).

In this paper, we propose an infeasible proximal bundle method to solve problem (1). To handle nonconvexity, we extend the redistributed proximal bundle method [13, 14] to the class of all regular locally Lipschitz functions, which is less restrictive than \( {\mathrm {lower}}-C^{1} \) and \( {\mathrm {lower}}-C^{2} \) functions. We use the improvement function to regularize the constraint, which is one of the most effective tools in this context; see [21, 26, 33]. At least part of the motivation for using the improvement function consists in avoiding the need to estimate a suitable value of the penalty parameter, which is often a delicate task. Then we obtain the convex cutting plane model. Different from the unconstrained optimization, our cutting plane model no longer approximates the objective function f or its local convexification as in [13, 14]. In addition, unlike [26] our model does not approximate a maximum of the piecewise-linear models of the local convexification of the objective and constraint functions. However, we consider the augmented improvement function at the current step. Moreover, we compute the proximal bundle point of the model function to obtain new bundle elements and generate better minimizer estimate. The global convergence of the proposed algorithm is proved in the sense that, every accumulation point of the serious iterates sequence is stationary for the improvement function, when f and g are regular locally Lipschitz functions.

It is worth mentioning that in [26, 37] the authors use the fact that when f and g are \( {\mathrm {lower}}-C^2 \) and the convexity parameter is large enough then the local convexification of f and g can be obtained, however, in our model the regular functions are considered.

The proposed method is implemented in the MATLAB environment and applied to solve some test optimization problems. The numerical results show that this method can successfully be used to solve optimization problems. We also compare the proposed algorithm with the public software SolvOpt [34] and the proximal bundle method [25].

The rest of the paper is organized as follows. In Sect. 2 we review some basic definitions and results from nonsmooth analysis. In Sect. 3 we provide the details of our algorithm and its convergence presented in Sect. 4. Computational experience is reported in Sect. 5 and, finally, in Sect. 6 the conclusions are presented.

2 Preliminaries

Our notations are fairly standard. We denote by \( \langle u,v\rangle =\sum _{i=1}^{n} u_{i}v_{i} \) the inner product of two vectors \( u,v \in \mathbb {R}^{n} \). We use \( \Vert \cdot \Vert \) to denote the standard Euclidean norm. For \( x \in \mathbb {R}^{n} \) and \( \varepsilon >0 \), \( B_{\varepsilon }(x) \) is an open ball of the radius \( \varepsilon \) centered at x and \( \overline{B}_{\varepsilon }(x) \) is a closed ball of radius \( \varepsilon \) centered at x.

First we recall some concepts and results from nonsmooth analysis. Most of the materials included here can be found in [3]. Then we recall the definition of the improvement function and state some of its properties.

A function \( h {:}\,\mathbb {R}^{n} \rightarrow \mathbb {R} \) is said to be locally Lipschitz of rank \( L> 0 \) at \( x \in \mathbb {R}^{n} \) if for some \( \varepsilon > 0 \), we have \( |h(y) -h(z) | \le L \Vert y - z \Vert \) for all \( y, z \in B_{\varepsilon }(x) \). The classical directional derivative of h at x in the direction \( d\in \mathbb {R}^{n} \) is defined as

$$\begin{aligned} h^{\prime }(x ; d) := \lim _{\alpha \rightarrow 0} \frac{h(x+\alpha d) - h(x)}{\alpha }. \end{aligned}$$

The Clarke directional derivative of h at x in the direction d is defined by

$$\begin{aligned} h^{\circ }(x ; d) := \limsup _{ \begin{array}{c} y \rightarrow x \\ \alpha \downarrow 0 \end{array}} ~ \frac{h(y+ \alpha d) - h(y)}{\alpha }. \end{aligned}$$

The Clarke subdifferential (generalized gradient) of h at x is defined as

$$\begin{aligned} \partial _{c} h(x) := \left\{ \xi \in \mathbb {R}^{n}| ~ \xi ^{T} d \le h^{\circ }(x; d),~\forall d\in \mathbb {R}^{n}\right\} . \end{aligned}$$

Each element \( \xi \in \partial _{c} h(x) \) is called a subgradient of h at x. It is well-known that \({\partial }_{c} h(x)\) is a nonempty convex compact set in \( \mathbb {R}^{n} \). Also the Clarke subdifferential \( \partial _{c} h(x)\) is upper semicontinuous for every \( x\in \mathbb {R}^{n} \).

The function \( h {:}\,\mathbb {R}^{n} \rightarrow \mathbb {R} \) is convex, if \( h(\lambda x+ (1-\lambda )y) \le \lambda h(x)+ (1-\lambda ) h(y) \), for all \( x,y \in \mathbb {R}^{n} \) and \( \lambda \in [0,1] \). The subdifferential of a convex function h at x is the set

$$\begin{aligned} \partial h(x) := \left\{ \xi \in \mathbb {R}^{n}| ~h(y) \ge h(x) +\langle \xi ,y-x \rangle , ~ \forall y\in \mathbb {R}^{n}\right\} , \end{aligned}$$

and it coincides with the Clarke subdifferential for every convex function.

Let \( \varepsilon \ge 0 \) the \( \varepsilon \)-subdifferential of a convex function h at x is defined as

$$\begin{aligned} \partial _{\varepsilon } h(x) := \left\{ \xi \in \mathbb {R}^{n}| ~h(y) \ge h(x)+\langle \xi ,y-x\rangle - \varepsilon , ~\forall y\in \mathbb {R}^{n}\right\} . \end{aligned}$$

Each element \( \xi \in \partial _{\varepsilon } h(x) \) is called an \( \varepsilon \)-subgradient of h at x [27].

The function \( h{:}\,\mathbb {R}^{n} \rightarrow \mathbb {R} \) is regular at x provided that h is locally Lipschitz at x and admits directional derivatives \( h^{\prime }(x;d) \) at x for all d, with \( h^{\prime }(x;d)=h^{\circ }(x;d) \).

Let S be a subset of \( \mathbb {R}^{n} \). Its indicator function \( I_{S}{:}\,\mathbb {R}^{n}\rightarrow \mathbb {R}_{\infty } \) is the function which has value 0 on S and \( +\,\infty \) elsewhere.

The family of \( {\mathrm {lower}}-C^{1} \) functions was introduced in [35]. It constitutes a broad class of locally Lipschitz functions that contains \( {\mathrm {lower}}-C^{2} \) functions as a subfamily. Combining [6, Thm. 2 and Cor. 3] with [35, Prob. 2.4], when \( h{:}\,\mathbb {R}^{n} \rightarrow \mathbb {R} \) is \( {\mathrm {lower}}-C^{1} \), then h is semismooth and regular. But the converse is not necessarily true. Spingarn in [35, P. 83] gives an example of a regular function in \( \mathbb {R}^{2} \) whose Clarke subdifferentials is not strictly submonotone at some points and therefore the function is not \( {\mathrm {lower}}-C^{1} \).

A point \( x^{*} \in \mathbb {R}^{n} \) is called a stationary point of \( h{:}\,\mathbb {R}^{n}\rightarrow \mathbb {R} \) if \( 0\in \partial _{c}h(x^{*}) \). If \( x^{*}\) is a local minimizer of h, then \( x^{*}\) is a stationary point of h. Also a point \( x^{*}\in X \) is called a stationary point of h on X if \( 0\in \partial _{c}h(x^{*})+\partial I_{X}(x^{*}) \). When \( x^{*} \in X \) is a local minimizer of h on X, then \( x^{*} \) is stationary for h on X.

Recall that \( x^{*}\in X \) satisfies the Fritz John necessary optimality conditions for problem (1) if there exist \( \lambda _{0}^{*}\ge 0 \), \( \lambda _{1}^{*}\ge 0 \) with \( \lambda _{0}^{*}+ \lambda _{1}^{*}=1 \) such that \( 0 \in \lambda _{0}^{*} \partial _{c} f(x^{*}) + \lambda _{1}^{*} \partial _{c} g(x^{*})+\partial I_{X}(x^{*}) \) and \( \lambda _{1}^{*}g(x^{*})= 0 \), \( g(x^{*})\le 0 \). When \( \lambda _{0}^{*}>0 \), then \( x^{*} \) satisfies the Karush–Kuhn–Tucker (KKT) conditions.

2.1 The improvement function

We introduce the improvement function \( h{:}\,\mathbb {R}^{n}\rightarrow \mathbb {R} \) associated with problem (1). For a given point \( x \in \mathbb {R}^{n} \), let \( h_{x}(y)= h(y,x):= \max \{f(y)-f(x), g(y)\} \). We recall some useful properties of this function. Directly by definition of the improvement function and subdifferential calculus [3], we have

$$\begin{aligned} \partial _{c} h(y,x) \subseteq \left\{ \begin{array}{ll} \partial _{c} f(y) &{} \quad f(y)-f(x) > g(y) \\ {\mathrm {conv}}\{\partial _{c} f(y)\cup \partial _{c} g(y)\} &{} \quad f(y)-f(x) = g(y)\\ \partial _{c} g(y) &{} \quad f(y)-f(x) < g(y), \end{array} \right. \end{aligned}$$
(2)

where “\( {\mathrm {conv}}\)” denotes convex hull, and equality holds when f and g are regular functions. In addition \( h(x,x)= g^{+}(x)= \max \{g(x),0\} \), for all \( x \in \mathbb {R}^{n} \).

Let \( X\subseteq \mathbb {R}^{n} \) be a convex set. We have \( I_{X}(\cdot ) \) is a convex function, and \( \partial I_{X}(\cdot ) \) is the normal cone to X at x, i.e.,

$$\begin{aligned} \partial I_{X}(x)=N_{X}(x):=\left\{ v \in \mathbb {R}^{n}|~ \langle v,y-x \rangle \le 0,~ \forall y\in X\right\} . \end{aligned}$$

First we recall the following result from [7, Lemma 2.1], which shows that the improvement function is suitable to serve as an alternative of the penalty function.

Lemma 1

Suppose that f and g are locally Lipschitz functions, then the following hold : 

  1. (A)

    If \( x^{*} \) is a local minimizer of problem (1), then it is local minimizer of \( h(\cdot ,x^{*}) \) on X. In addition we have \( 0 \in \partial _{c} h(x^{*},x^{*})+ \partial I_{X}(x^*) \).

  2. (B)

    If \( 0 \in \partial _{c} h(x^{*},x^{*})+ \partial I_{X}(x^{*}) \) for \( x^{*} \in X \), then only one of the following occurs:

    1. (i)

      if \( g(x^{*})>0 \), then \( 0 \in \partial _{c} g(x^{*})+ \partial I_{X}(x^{*}) \), i.e., \( x^{*} \) is a stationary point of g on X.

    2. (ii)

      if \( g(x^{*})< 0 \), then \( 0 \in \partial _{c} f(x^{*})+\partial I_{X}(x^{*}) \), i.e., \( x^{*} \) is stationary for f on X and it is a KKT point of problem (1).

    3. (iii)

      finally \( g(x^{*})= 0 \) implies that \( x^{*} \) satisfies the Fritz John necessary conditions for problem (1).

3 Description of the algorithm

In this section we describe the new algorithm. The main idea of the method is as follows. Given the last serious iterate \( x^{k} \) (the starting point \( x^{0} \) is regarded as the first serious iterate). We generate candidate points by iterating with the proximal bundle method applied to a piecewise-linear model of the augmented improvement function, until the next serious iterate \( x^{k+1} \) is obtained. However, to decide whether a candidate point can be the next serious iterate, we use a usual descent condition. When the new serious iterate \( x^{k+1} \) has been found, the piecewise-linear model should be redefined, according to \( x^{k+1} \), for the next iteration. In the sequel of this section we will discuss different elements of the algorithm, which itself will be presented at the end of this section.

3.1 Model function

As usual in the bundle methods, the available information is used to define a piecewise-linear model of \( h(\cdot ,x^{k}) \). In our setting, we are working with possible nonconvexity of f, g and consequently of \( h(\cdot ,x^{k}) \). We define the linearization errors for f and g at \( x^{k} \), corresponding with \( y^{j}\in \mathbb {R}^{n} \), respectively by

$$\begin{aligned}&e_{f_{j}}^{k}:= f(x^{k}) - f(y^{j}) - \langle \xi _{f}^{j}, x^{k}-y^{j}\rangle , \end{aligned}$$
(3a)
$$\begin{aligned}&e_{g_{j}}^{k}:=g(x^{k}) - g(y^{j}) - \langle \xi _{g}^{j}, x^{k}-y^{j}\rangle , \end{aligned}$$
(3b)

where \( \xi _{f}^{j} \in \partial _{c}f(y^{j}) \) and \( \xi _{g}^{j} \in \partial _{c} g(y^{j}) \). For a convex function, linearization errors are always nonnegative. This feature is crucial to prove the convergence of the most of the bundle methods [4, 16]. However, in the nonconvex case, it may yield negative linearization error. To overcome this drawback, we extend the approach introduced in [13, 14] for nonconvex constrained optimization problems and work with the following augmented function

$$\begin{aligned} h_{\eta _{l}}(\cdot ,x^{k})= h(\cdot ,x^{k})+ \frac{\eta _{l}}{2} \Vert \cdot -x^{k}\Vert ^{2}, \end{aligned}$$
(4)

where \( \eta _{l} \) is a certain parameter, that adjusted dynamically. Let l be the current iteration index and \( \mathcal {L}_{l}\subseteq \{0,1,\ldots ,l-1\} \) is the index set at the lth iteration. We generate the convex piecewise-linear model as

$$\begin{aligned} H_{l}(y,x^{k})&:= h(x^{k},x^{k})+\max _{j \in \mathcal {L}_{l}}\left\{ -c_{f_{j}}^{k}+\langle s_{f_{j}}^{k},y-x^{k}\rangle , -c_{g_{j}}^{k}+\langle s_{g_{j}}^{k},y-x^{k}\rangle \right\} \nonumber \\&= g^{+}(x^{k})+\max _{j \in \mathcal {L}_{l}}\left\{ -c_{f_{j}}^{k}+\langle s_{f_{j}}^{k},y-x^{k}\rangle ,-c_{g_{j}}^{k}+\langle s_{g_{j}}^{k},y-x^{k}\rangle \right\} , \end{aligned}$$
(5)

where

$$\begin{aligned}&c_{f_{j}}^{k}:=e_{f_{j}}^{k}+g^{+}(x^{k}) +\eta _{l} b_{j}^{k}, \quad c_{g_{j}}^{k}:=e_{g_{j}}^{k}+g^{+}(x^{k}) - g(x^{k})+\eta _{l} b_{j}^{k}, \end{aligned}$$
(6a)
$$\begin{aligned}&s_{f_{j}}^{k}:= \xi ^{j}_{f}+\eta _{l} d_{j}^{k},~ s_{g_{j}}^{k}:= \xi ^{j}_{g}+\eta _{l} d_{j}^{k},~ d_{j}^{k}:=y^{j}-x^{k},~ b_{j}^{k}:=\frac{\Vert y^{j}-x^{k}\Vert ^{2}}{2}, \end{aligned}$$
(6b)

and \( e_{f_{j}}^{k} \), \( e_{g_{j}}^{k} \) are stated in (3a), (3b) and the bundle of information is defined as follows

$$\begin{aligned} \mathcal {B}_{l}\subseteq \bigcup _{i\in \mathcal {L}_{l}}\big \{\big (f_{i},g_{i}, \xi _{f}^{i}, \xi _{g}^{i}, e_{f_{i}}^{k}, e_{g_{i}}^{k}, b_{i}^{k},d_{i}^{k} \big )\big \}, \end{aligned}$$

where \( f_{i}:=f(y^{i})\) and \( g_{i}:=g(y^{i}) \). Clearly \( H_{l}(\cdot ,x^{k}) \) is a piecewise-linear model for the augmented improvement function (4). We call \( m_{f}(\cdot ,x^{k}):=g^{+}(x^{k}) -c_{f_{j}}^{k}+\langle s_{f_{j}}^{k},\cdot -x^{k}\rangle \) and \( m_{g}(\cdot ,x^{k}):=g^{+}(x^{k}) -c_{g_{j}}^{k}+\langle s_{g_{j}}^{k},\cdot -x^{k}\rangle \) as oracle planes. Any choice of the positive parameter \( \eta _{l} \in \mathbb {R} \) that keeps \( c_{f_{j}}^{k} \) and \( c_{g_{j}}^{k} \) nonnegative is acceptable. In our setting we take

$$\begin{aligned} \eta _{l}\ge \max \left\{ \max _{\begin{array}{c} j\in \mathcal {L}_{l}\\ y^{j} \ne x^{k} \end{array}} \left\{ \frac{- \left( e_{f_{j}}^{k}+g^{+}(x^{k})\right) }{b_{j}^{k}}, \frac{-\left( e_{g_{j}}^{k}+g^{+}(x^{k}) - g(x^{k})\right) }{b_{j}^{k}}\right\} ,\omega \right\} +\omega , \end{aligned}$$
(7)

where \( \omega >0 \) is a positive constant.

Remark 1

In [13], the authors established the proximal bundle method for unconstrained problem when f is \( {\mathrm {lower}}-C^{2} \). Moreover in [37] a proximal bundle method and in [26] an infeasible proximal bundle algorithm were presented for constrained problems where f and g are \( {\mathrm {lower}}-C^{2} \). They can say nothing about what is obtained if the sequence \( \{\eta _{l}\} \) remains too small. Since they used the fact that when \( \eta _{l} \) is large enough and f, g are \( {\mathrm {lower}}-C^{2} \), then the augmented function \( h_{\eta _{l}}(\cdot ,x^{k}) \) is convex. Indeed, in our approach we do not require any \( {\mathrm {lower}}-C^{2} \), \( {\mathrm {lower}}-C^{1} \) or prox-regular assumption on the functions. Therefore \( h_{\eta _{l}}(\cdot ,x^{k}) \) in (4) can be nonconvex in some iterations and only we choose \( \eta _{l} \) such that (7) is satisfied.

Lemma 2

If \( \eta _{l} \) is selected according to Eq. (7), then for every \( j\in \mathcal {L}_{l} \), we obtain \( c_{f_{j}}^{k} \ge 0 \) and \( c_{g_{j}}^{k} \ge 0 \).

Proof

By using (7) for all \( j\in \mathcal {L}_{l} \) when \( y^{j} \ne x^{k} \), we have

$$\begin{aligned} (\eta _{l} -\omega ) b_{j}^{k} \ge -\left( e_{f_{j}}^{k}+g^{+}(x^{k})\right) \quad \text {and} \quad (\eta _{l}-\omega )b_{j}^{k} \ge -\left( e_{g_{j}}^{k}+g^{+}(x^{k})-g(x^{k})\right) , \end{aligned}$$

and by (6a) we obtain \( c_{f_{j}}^{k}\ge \omega b_{j}^{k}\ge 0 \) and \( c_{g_{j}}^{k}\ge \omega b_{j}^{k}\ge 0 \), for all \( j\in \mathcal {L}_{l} \) and \( y^{j} \ne x^{k} \). When \( y^{j} = x^{k} \) we have \( c_{f_{j}}^{k}=g^{+}(x^{k})\ge 0 \) and \( c_{g_{j}}^{k}=g^{+}(x^{k})-g(x^{k})\ge 0 \). Hence for all \( j\in \mathcal {L}_{l} \) we get \( c_{f_{j}}^{k} \ge 0 \) and \( c_{g_{j}}^{k} \ge 0 \). \(\square \)

We assume the last serious iterate until now, will be one of the bundle points, i.e., \( x^{k} \in \{y^{j}\}_{j\in \mathcal {L}_{l}} \). That is, there exists \( l(k)\in \mathcal {L}_{l} \) such that \( x^{k}=y^{l(k)} \). Hence \( m^{0}_{f}(\cdot ,x^{k})=g^{+}(x^{k}) + \langle \xi ^{l(k)}_{f}, \cdot - x^{k}\rangle \) and \( m^{0}_{g}(\cdot ,x^{k})=g^{+}(x^{k})-g(x^{k}) + \langle \xi ^{l(k)}_{g}, \cdot - x^{k}\rangle \) are the cutting planes of \( H_{l}(\cdot ,x^{k}) \). Also \( c_{f_{l(k)}}^{k}= g^{+}(x^{k})\) and \( c_{g_{l(k)}}^{k}= g^{+}(x^{k})- g(x^{k}) \), therefore we have

$$\begin{aligned} H_{l}(x^{k},x^{k})= g^{+}(x^{k})+\max _{j \in \mathcal {L}_{l}}\left\{ -c_{f_{j}}^{k}, -c_{g_{j}}^{k}\right\} =g^{+}(x^{k}). \end{aligned}$$

The iterates \( \{y^{l}\} \) include both the serious steps \( x^{k} \) and the null steps \( y^{l} \) (which are candidate points that were not serious). As we have mentioned above, for all k there exists l(k) such that \( x^{k}=y^{l(k)} \). Therefore we have \( \{x^{k}\}\subseteq \{y^{l}\} \).

Given \( \mu _{l}>0 \), a positive proximal parameter, the next iterate \( y^{l} \) is generated by solving the following convex problem

$$\begin{aligned} \min _{y\in X}~~ H_{l}(y,x^{k})+ \frac{\mu _{l}}{2} \Vert y-x^{k}\Vert ^{2}. \end{aligned}$$
(8)

Clearly, \( y^{l} \) is unique. If \( y^{l} \) provides a significant decrease with respect to \( h(x^{k},x^{k}) \), we call it serious iterate, otherwise null iterate (we explain this case in the next subsection). From the optimality conditions of problem (8) we have \( 0\in \partial H_{l}(y^{l},x^{k})+ \mu _{l}(y^{l}-x^{k})+\partial I_{X}(y^{l}) \). Therefore, there exist \( \widehat{\xi }_{l}^{k}\in \partial H_{l}(y^{l},x^{k}) \) and \( v_{l}\in \partial I_{X}(y^{l}) \) such that \( y^{l}=x^{k} - \frac{1}{\mu _{l}} (\widehat{\xi }_{l}^{k}+v_{l}) \) and \( \mu _{l}(x^{k}-y^{l}) =(\widehat{\xi }_{l}^{k}+v_{l}) \in \partial H_{l}(y^{l},x^{k})+\partial I_{X}(y^{l}) \). Since the model (5) is piecewise-linear, there exist \( \alpha ^{l}, \beta ^{l} \in \mathbb {R}^{|\mathcal {L}_{l}|} \) with \( \alpha _{j}^{l},\beta _{j}^{l} \ge 0 \) and \( \sum _{j=1}^{|\mathcal {L}_{l}|} (\alpha _{j}^{l}+ \beta _{j}^{l})=1 \) such that \( \widehat{\xi }_{l}^{k}:= \sum _{j=1}^{|\mathcal {L}_{l}|} (\alpha _{j}^{l} s_{f_{j}}^{k}+ \beta _{j}^{l} s_{g_{j}}^{k}) \). We called \( \widehat{\xi }_{l}^{k} \) the aggregate subgradient. The aggregate error is defined by

$$\begin{aligned} \widehat{e}_{l}^{k}:= g^{+}(x^{k}) - H_{l}(y^{l},x^{k})-\langle \widehat{\xi }_{l}^{k},x^{k}-y^{l} \rangle \ge 0, \end{aligned}$$
(9)

where the inequality follows from \( \widehat{\xi }_{l}^{k}\in \partial H_{l}(y^{l},x^{k}) \). It is easy to verify that \( \widehat{\xi }_{l}^{k}\in \partial _{\widehat{e}_{l}^{k}} H_{l}(x^{k},x^{k}) \). We define the aggregate plane by \( m_{l}^{*}(\cdot ,x^{k}):=g^{+}(x^{k}) -\widehat{e}_{l}^{k}+\langle \widehat{\xi }_{l}^{k},\cdot -x^{k}\rangle \). To avoid overflow, when generating the new model \( H_{l}(\cdot ,x^{k}) \), we may replace all older planes corresponding to \( \alpha _{j}^{l}>0 \) and \( \beta _{j}^{l}>0 \) by the aggregate plane. Note that this aggregate plane can directly be defined from the aggregate couple \( (\widehat{e}_{l}^{k},\widehat{\xi }_{l}^{k}) \).

Accordingly, we have two kinds of planes in the bundle, namely: the oracle and aggregate planes. Therefore we shall write the index set in the form \( \mathcal {L}_{l}:=\mathcal {L}_{l}^{{\mathrm {oracle}}} \bigcup \mathcal {L}_{l}^{{\mathrm {agg}}} \) and their bundles in the form \( \mathcal {B}_{l}:=\mathcal {B}_{l}^{{\mathrm {oracle}}} \bigcup \mathcal {B}_{l}^{{\mathrm {agg}}} \),

$$\begin{aligned} \mathcal {B}_{l}^{{\mathrm {oracle}}}&\subseteq \bigcup _{i\in \mathcal {L}_{l}^{{\mathrm {oracle}}}}\left\{ \big (f_{i},g_{i}, \xi _{f}^{i}, \xi _{g}^{i}, e_{f_{i}}^{k}, e_{g_{i}}^{k}, b_{i}^{k},d_{i}^{k} \big )\right\} , \\ \quad \text {and} \quad \mathcal {B}_{l}^{{\mathrm {agg}}}&\subseteq \bigcup _{i\in \mathcal {L}_{l}^{{\mathrm {agg}}}}\left\{ \big (\widehat{e}_{i}^{k}, \widehat{\xi }_{i}^{k}\in \partial H_{i}(y^{i},x^{k}) \big )\right\} . \end{aligned}$$

It is worth mentioning that for any iterate \( y^{l} \) (serious or null), we add one element in \( \mathcal {B}_{l}^{{\mathrm {oracle}}} \) that is \( \big (f_{l},g_{l}, \xi _{f}^{l}, \xi _{g}^{l}, e_{f_{l}}^{k}, e_{g_{l}}^{k}, b_{l}^{k},d_{l}^{k} \big ) \) (and also in \( \mathcal {L}_{l}^{{\mathrm {oracle}}} \)). Corresponding to any element in this bundle we have two cutting planes in the model, for f and g. Like our method, in [19, 33] for convex constrained problems one bundle for both f and g is considered and thus the implementation of the algorithm and updating the formulas are simple. However, unlike our method, in some bundle methods for constrained problems two bundles for f and g independently are considered; see e.g., [25, 26]. Therefore the convex piecewise-linear model is as follows

$$\begin{aligned} H_{l}(y,x^{k})= & {} g^{+}(x^{k})+ \max \bigg \{ \max _{j \in \mathcal {L}_{l}^{{\mathrm {agg}}}} \left\{ -\widehat{e}^{k}_{j}+\langle \widehat{\xi }_{j}^{k},y-x^{k}\rangle \right\} , \nonumber \\&\quad \max _{j \in \mathcal {L}_{l}^{{\mathrm {oracle}}}} \left\{ -c_{f_{j}}^{k}+\langle s_{f_{j}}^{k},y-x^{k}\rangle ,-c_{g_{j}}^{k}+\langle s_{g_{j}}^{k},y-x^{k}\rangle \right\} \bigg \}.\nonumber \\ \end{aligned}$$
(10)

3.2 Acceptance test

An iterate \( y^{l} \) is considered good enough to become the next serious iterate when \( h(y^{l},x^{k}) \) provides a significant decrease with respect to \( h(x^{k},x^{k})=g^{+}(x^{k}) \). Let \( m\in (0,1) \) be a given parameter. Define the predicted decrease \( \delta ^{l}:= \widehat{e}_{l}^{k} + \frac{1}{2\mu _{l}} \Vert \widehat{\xi }_{l}^{k}+v_{l}\Vert ^{2} \). Note that since \( \widehat{e}_{l}^{k}\ge 0 \), it follows that \( \delta ^{l}\ge 0 \). When \( y^{l} \) satisfies the descent test

$$\begin{aligned} h(y^{l},x^{k}) \le g^{+}(x^{k}) - m \delta ^{l}, \end{aligned}$$
(11)

a serious iterate is declared, i.e., \( x^{k+1}:= y^{l} \). Otherwise, \( y^{l} \) is a null iterate and \( x^{k} \) remains unchanged. The algorithm stops when \( \delta ^{l} \) is small enough. In this case, both \( \widehat{e}_{l}^{k} \) and \( \Vert \widehat{\xi }_{l}^{k}+v_{l}\Vert \) are small and since \( \widehat{\xi }_{l}^{k}\in \partial _{\widehat{e}_{l}^{k}} H_{l}(y^{l},x^{k}) \) and \( v_{l}\in \partial I_{X}(y^{l}) \), we deduce \( x^{k} \) is a stationary point (we will resume the discussion of convergence in Sect. 4).

Remark 2

Recalling the definition of \( h(\cdot ,x^{k}) \), we conclude that if the descent test is satisfied and a serious iterate is declared, we get

$$\begin{aligned} f(x^{k+1})-f(x^{k}) \le g^{+}(x^{k}) - m \delta ^{l} \quad \text {and} \quad g(x^{k+1})\le g^{+}(x^{k}) - m \delta ^{l}. \end{aligned}$$
(12)

In particular, if \( x^{k} \) is feasible, i.e., \( g^{+}(x^{k})=0 \), it follows from (12) that \( f(x^{k+1})<f(x^{k}) \) and \( g(x^{k+1})\le 0 \). Thus \( x^{k+1} \) is feasible too and \( \{f(x_{k})\} \) is strictly monotone. Otherwise, when \( x^{k} \) is infeasible, it is possible \( f(x_{k+1}) > f(x_{k}) \) due to \( g^{+}(x^{k}) = g(x^{k})>0 \). Therefore outside of the feasible region, the method is not monotone with respect to f. However outside of the feasible region we have \( g(x^{k+1}) <g^{+}(x^{k})=g(x^{k}) \). This seems intuitively reasonable: while it is natural to accept the increase in the objective function value in order to decrease infeasibility.

3.3 Updating the model

Suppose that \( y^{l} \) is a null iterate, we will augment the piecewise-linear model \( H_{l}(\cdot ,x^{k}) \) and generate \( H_{l+1}(\cdot ,x^{k}) \). Note that the cutting plane respect to the last serious iterate (i.e., \( x^{k} \)) should keep in the model. To make \( H_{l+1}(\cdot ,x^{k}) \) better than \( H_{l}(\cdot ,x^{k}) \), we should add two more elements: the cutting plane and the aggregate plane corresponding to \( y^{l} \). It is enough to make sure \( \{(\widehat{e}_{l}^{k}, \widehat{\xi }_{l}^{k})\} \subseteq \mathcal {B}_{l+1}^{{\mathrm {agg}}} \) and \( \{(f_{l},g_{l}, \xi _{f}^{l}, \xi _{g}^{l}, e_{f_{l}}^{k}, e_{g_{l}}^{k},b_{l}^{k},d_{l}^{k})\} \subseteq \mathcal {B}_{l+1}^{{\mathrm {oracle}}}\). Moreover we set \( \mathcal {L}_{l+1}^{{\mathrm {oracle}}}=\mathcal {L}_{l}^{{\mathrm {oracle}}}\cup \{l\} \) and \( \mathcal {L}_{l+1}^{{\mathrm {agg}}}=\mathcal {L}_{l}^{{\mathrm {agg}}}\cup \{l\} \). Therefore, when there is a null iterate the updating of the bundle and presenting of the model accomplish without any problem.

Remark 3

It is worth mentioning that the total number of the elements in \( \mathcal {B}_{l} \) can be kept as small as three, at every step of the method. It can be given by

$$\begin{aligned} \mathcal {B}_{l+1}= & {} \left\{ \left( f_{l},g_{l}, \xi _{f}^{l}, \xi _{g}^{l}, e_{f_{l}}^{k}, e_{g_{l}}^{k},b_{l}^{k},d_{l}^{k}\right) ,\left( \widehat{e}_{l}^{k}, \widehat{\xi }_{l}^{k}\right) , \right. \\&\left. \quad \left( f_{l(k)},g_{l(k)}, \xi _{f}^{l(k)}, \xi _{g}^{l(k)}, e_{f_{l(k)}}^{k}, e_{g_{l(k)}}^{k},b_{l(k)}^{k},d_{l(k)}^{k}\right) \right\} . \end{aligned}$$

As usual in the convex proximal bundle method (see, for instance [19]) the total number of elements in the bundle can be kept as small as two, namely: the cutting plane at the last iterate \( y^{l} \) and the last aggregate plane. Let us emphasize that the situation in our setting is slightly different from methods for standard convex programming. These differences are actually natural, having in mind that convex and nonconvex cases require significantly different tools. Like our setting in [7] the cutting plane respect to the last serious iterate \( x^{k} \) is considered.

When a new serious iterate is declared, the model \( H_{l}(\cdot ,x^{k}) \) has to be properly revised and at the next iteration we start working with the new model \( H_{l}(\cdot ,x^{k+1}) \). Therefore any element in \( \mathcal {B}_{l}^{{\mathrm {agg}}} \) and \( \mathcal {B}_{l}^{{\mathrm {oracle}}} \) that is depended on k should be redefine. On the other hand, since we did not store \( y^{j} \) and \( H_{j}(y^{j},x^{k}) \) for any \( j\in \mathcal {L}_{l} \), we redefine the elements of \( \mathcal {B}_{l}^{{\mathrm {agg}}} \) and \( \mathcal {B}_{l}^{{\mathrm {oracle}}} \) by updating formulas. Incidentally, by updating formulas the computation of the new cutting model \( H_{l}(\cdot ,x^{k+1}) \) is easy. We update the linearization errors and the other items for all \( j\in \mathcal {L}_{l}^{{\mathrm {oracle}}} \) according to the following relations

$$\begin{aligned} e_{f_{j}}^{k+1}&= e_{f_{j}}^{k} + f(x^{k+1})- f(x^{k})-\langle \xi _{f}^{j}, x^{k+1}-x^{k}\rangle , \end{aligned}$$
(13a)
$$\begin{aligned} e_{g_{j}}^{k+1}&= e_{g_{j}}^{k} + g(x^{k+1})- g(x^{k}) -\langle \xi _{g}^{j}, x^{k+1}-x^{k}\rangle , \end{aligned}$$
(13b)
$$\begin{aligned} b_{j}^{k+1}&= b_{j}^{k}+ \frac{1}{2}\Vert x^{k+1} -x^{k}\Vert ^{2} - \langle d_{j}^{k},x^{k+1} -x^{k}\rangle , \end{aligned}$$
(13c)
$$\begin{aligned} d_{j}^{k+1}&= d_{j}^{k}-(x^{k+1}-x^{k}). \end{aligned}$$
(13d)

The update aims at satisfying (3a), (3b) and the two last relations (6b). The aggregate subgradients and the aggregate errors for all \( j\in \mathcal {L}_{l}^{{\mathrm {agg}}} \), respectively, are updated by

$$\begin{aligned} \widehat{\xi }_{j}^{k+1}&= \widehat{\xi }_{j}^{k}+ \eta _{l} (x^{k} -x^{k+1}), \end{aligned}$$
(14a)
$$\begin{aligned} \widehat{e}_{j}^{k+1}&= \widehat{e}_{j}^{k}+g^{+}(x^{k+1})-g^{+}(x^{k}) + \frac{\eta _{l}}{2} \Vert x^{k+1}-x^{k}\Vert ^{2} -\langle \widehat{\xi }_{j}^{k}, x^{k+1}-x^{k}\rangle \nonumber \\&\quad + (f(x^{k+1})-f(x^{k}))^{+}. \end{aligned}$$
(14b)

The update aims at satisfying the key relation \( \widehat{\xi } _{j}^{k} \in \partial _{\widehat{e}_{j}^{k}} H_{l}(x^{k},x^{k}) \); see Lemma 4.

In the following lemma, we show the relationship between \( h_{\eta _{l}}(\cdot ,x^{k}) \) and \( h_{\eta _{l}}(\cdot ,x^{k+1}) \), that these are the augmented functions in two tandem serious iterations. We also show that the same relationship holds for their cutting plane models.

Lemma 3

Functions \( h_{\eta _{l}}(\cdot ,x^{k}) \) and \( h_{\eta _{l}}(\cdot ,x^{k+1}) \) satisfy the following inequality

$$\begin{aligned} h_{\eta _{l}}(y,x^{k+1})\ge & {} h_{\eta _{l}}(y,x^{k}) -(f(x^{k+1})-f(x^{k}))^{+}+ \eta _{l} \langle y-x^{k+1},x^{k}-x^{k+1}\rangle \nonumber \\&-\,\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}, \end{aligned}$$
(15)

for all \( y\in \mathbb {R}^{n} \). Furthermore, the above inequality is true for \( H_{l}(\cdot ,x^{k}) \) and \( H_{l}(\cdot ,x^{k+1}) \) by replacing \( h_{\eta _{l}}(\cdot ,x^{k}) \) and \( h_{\eta _{l}}(\cdot ,x^{k+1}) \), respectively.

Proof

First, we show that

$$\begin{aligned} h(y,x^{k+1})\ge h(y,x^{k}) -(f(x^{k+1})-f(x^{k}))^{+}, \quad \forall y \in \mathbb {R}^{n}. \end{aligned}$$
(16)

Consider \( y\in \mathbb {R}^{n} \). If \( f(y)- f(x^{k+1})\ge g(y) \) and \( f(y)- f(x^{k})\ge g(y) \), then \( h(y,x^{k+1})-h(y,x^{k})= f(x^{k}) - f(x^{k+1}) \ge -(f(x^{k+1}) - f(x^{k}))^{+} \). When \( f(y)- f(x^{k+1})< g(y) \) and \( f(y)- f(x^{k})< g(y) \), then \( h(y,x^{k+1})-h(y,x^{k})= 0 \) and (16) holds. Now suppose that \( f(y)- f(x^{k+1})< g(y) \) and \( f(y)- f(x^{k})\ge g(y) \), then \( h(y,x^{k+1})-h(y,x^{k})= g(y)- f(y)+f(x^{k})> f(y)-f(x^{k+1})-f(y)+f(x^{k})=f(x^{k})-f(x^{k+1}) \ge -(f(x^{k+1}) - f(x^{k}))^{+} \). Finally if \( f(y)- f(x^{k+1})\ge g(y) \) and \( f(y)- f(x^{k})< g(y) \), then \( h(y,x^{k+1})-h(y,x^{k})=f(y)-f(x^{k+1}) -g(y)\ge 0 \) which again implies (16). Using (4) we get

$$\begin{aligned} h_{\eta _{l}}(y,x^{k+1})\,{\ge }\, h_{\eta _{l}}(y,x^{k}) {+} \frac{\eta _{l}}{2}(\Vert y-x^{k+1}\Vert ^{2}{-} \Vert y-x^{k}\Vert ^{2})-(f(x^{k+1})-f(x^{k}))^{+}. \end{aligned}$$

We know that

$$\begin{aligned} \Vert y-x^{k}\Vert ^{2}&=\Vert y-x^{k+1}+x^{k+1}-x^{k}\Vert ^{2}\\&=\Vert y-x^{k+1}\Vert ^{2}+\Vert x^{k+1}-x^{k}\Vert ^{2}+2\langle y-x^{k+1},x^{k+1}-x^{k} \rangle , \end{aligned}$$

which implies (15).

Now we show that (15) holds for \( H_{l}(\cdot ,x^{k}) \) and \( H_{l}(\cdot ,x^{k+1}) \) by replacing \( h_{l}(\cdot ,x^{k}) \) and \( h_{l}(\cdot ,x^{k+1}) \), respectively. For all \( j\in \mathcal {L}^{{\mathrm {oracle}}}_{l} \) we have

$$\begin{aligned} H_{l}(y,x^{k+1})&\ge g^{+}(x^{k+1})-c_{f_{j}}^{k+1}+\langle s_{f_{j}}^{k+1},y-x^{k+1}\rangle \nonumber \\&= -e_{f_{j}}^{k+1}-\eta _{l} b_{j}^{k+1}+\langle \xi _{f}^{j}+\eta _{l}(y^{j}-x^{k+1}),y-x^{k+1}\rangle \nonumber \\&=-\Big (e_{f_{j}}^{k}{+}f(x^{k+1}){-}f(x^{k}){-}\langle \xi _{f}^{j},x^{k+1}-x^{k}\rangle \Big ) -\eta _{l}\Big (\frac{1}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}\nonumber \\&\quad +\,b_{j}^{k}-\langle y^{j}-x^{k},x^{k+1}-x^{k}\rangle \Big )+\langle \xi _{f}^{j}+\eta _{l}(y^{j}-x^{k+1}),y-x^{k+1}\rangle \nonumber \\&=-(e_{f_{j}}^{k}{+}\eta _{l}b_{j}^{k}){+}f(x^{k})-f(x^{k+1})-\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+\langle \xi _{f}^{j},y-x^{k}\rangle \nonumber \\&\quad +\,\eta _{l}\langle y^{j}-x^{k+1},y-x^{k+1}\rangle +\eta _{l}\langle y^{j}-x^{k},x^{k+1}-x^{k}\rangle \nonumber \\&=-(e_{f_{j}}^{k}+\eta _{l}b_{j}^{k})+f(x^{k}){-}f(x^{k+1}){-}\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+\langle \xi _{f}^{j},y-x^{k}\rangle \nonumber \\&\quad +\,\eta _{l}\langle y^{j}-x^{k},y-x^{k}\rangle +\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle \nonumber \\&= -(e_{f_{j}}^{k}+\eta _{l}b_{j}^{k}){+}f(x^{k})-f(x^{k+1}){-}\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+\langle s_{f_{j}}^{k},y-x^{k}\rangle \nonumber \\&\quad +\,\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle \nonumber \\&=g^{+}(x^{k})-c_{f_{j}}^{k}+ \langle s_{f_{j}}^{k},y-x^{k}\rangle -(f(x^{k+1})-f(x^{k}))^{+}\nonumber \\&\quad -\,\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2} +\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle , \end{aligned}$$
(17)

where the first inequality comes from the definition of \( H_{l}(y,x^{k+1}) \) in (10), the first equality holds by the first relation in (6a) and the first relation in (6b), the second equality is satisfied by (13a) and (13c) and the two last equalities follow from the first relation in (6a) and the first relation in (6b), respectively.

By the same discussion and adding \( -(f(x^{k+1})-f(x^{k}))^{+} \), for all \( j\in \mathcal {L}^{{\mathrm {oracle}}}_{l} \) we deduce

$$\begin{aligned} H_{l}(y,x^{k+1})\ge & {} g^{+}(x^{k})-c_{g_{j}}^{k}+ \langle s_{g_{j}}^{k},y-x^{k}\rangle -(f(x^{k+1})-f(x^{k}))^+\nonumber \\&-\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+\,\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle . \end{aligned}$$
(18)

Since (17) and (18) hold for all \( j\in \mathcal {L}_{l}^{{\mathrm {oracle}}} \), we get

$$\begin{aligned} H_{l}(y,x^{k+1})\ge & {} g^{+}(x^{k}) + \max _{j\in \mathcal {L}_{l}^{{\mathrm {oracle}}}}\{-c_{f_{j}}^{k}+ \langle s_{f_{j}}^{k},y-x^{k}\rangle , -c_{g_{j}}^{k}+ \langle s_{g_{j}}^{k},y-x^{k}\rangle \}\nonumber \\&-\,(f(x^{k+1})-f(x^{k}))^+{-}\frac{\eta _{l}}{2}\Vert x^{k+1}{-}x^{k}\Vert ^{2}{+}\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle .\nonumber \\ \end{aligned}$$
(19)

By using the definition of \( H_{l}(y,x^{k}) \), (14a), (14b), for all \( j\in \mathcal {L}_{l}^{{\mathrm {agg}}} \), we have

$$\begin{aligned} H_{l}(y,x^{k+1})&\ge g^{+}(x^{k+1})-\widehat{e}_{j}^{k+1}+\langle \widehat{\xi }_{j}^{k+1},y-x^{k+1}\rangle \\&= g^{+}(x^{k+1}){-}\Big (\widehat{e}_{j}^{k}+g^{+}(x^{k+1}){-}g^{+}(x^{k}) {+} \frac{\eta _{l}}{2} \Vert x^{k+1}-x^{k}\Vert ^{2}+ (f(x^{k+1}) \\&\quad -\,f(x^{k}))^{+}-\langle \widehat{\xi }_{j}^{k}, x^{k+1}-x^{k}\rangle \Big )+\langle \widehat{\xi }_{j}^{k}+ \eta _{l} (x^{k} -x^{k+1}),y-x^{k+1} \rangle \\&= g^{+}(x^{k})+(-\widehat{e}_{j}^{k}+\langle \widehat{\xi }_{j}^{k},y{-}x^{k}\rangle ){+}\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle \\&\quad -\,(f(x^{k+1})-f(x^{k}))^{+}-\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}, \end{aligned}$$

therefore

$$\begin{aligned} H_{l}(y,x^{k+1})\ge & {} g^{+}(x^{k}) + \max _{j\in \mathcal {L}_{l}^{{\mathrm {agg}}}}\left\{ -\widehat{e}_{j}^{k}+\langle \widehat{\xi }_{j}^{k},y-x^{k}\rangle \right\} -(f(x^{k+1})-f(x^{k}))^{+}\nonumber \\&-\,\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}+\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle . \end{aligned}$$
(20)

Using (10), (19) and (20), we deduce

$$\begin{aligned} H_{l}(y,x^{k+1})\ge & {} H_{l}(y,x^{k}) -(f(x^{k+1})-f(x^{k}))^{+}-\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}\nonumber \\&+\,\eta _{l}\langle x^{k}-x^{k+1},y-x^{k+1}\rangle , \end{aligned}$$
(21)

and the proof is completed. \(\square \)

Lemma 4

Suppose that the associated \( y^{l} \) is declared as a serious iterate, i.e., \( x^{k+1}=y^{l} \). Then the following hold:

  1. (i)

    If for each \( j\in \mathcal {L}^{{\mathrm {oracle}}}_{l} \) we update \( e_{f_{j}}^{k+1} \) and \( e_{g_{j}}^{k+1} \) according to relations (13a), (13b), then conditions (3a), (3b) are satisfied with \( k:=k+1 \).

  2. (ii)

    If for each \( j\in \mathcal {L}^{{\mathrm {agg}}}_{l} \) we update the aggregate errors and the aggregate subgradients according to relations (14a) and (14b), then \( \widehat{e}_{j}^{k+1}\ge 0 \) and \( \widehat{\xi }_{j}^{k+1} \in \partial _{\widehat{e}_{j}^{k+1}}H_{j}(x^{k+1},x^{k+1}) \).

Proof

(i) Let \( j\in \mathcal {L}^{{\mathrm {oracle}}}_{l} \), we have

$$\begin{aligned} e_{f_{j}}^{k+1}&= e_{f_{j}}^{k} + f(x^{k+1})- f(x^{k}) - \langle \xi _{f}^{j}, x^{k+1}-x^{k}\rangle \\&= f(x^{k}) - f(y^{j}) - \langle \xi _{f}^{j}, x^{k}-y^{j}\rangle + f(x^{k+1})- f(x^{k}) - \langle \xi _{f}^{j}, x^{k+1}-x^{k}\rangle \\&= f(x^{k+1}) - f(y^{j}) - \langle \xi _{f}^{j}, x^{k+1}-y^{j}\rangle , \end{aligned}$$

and the same argument applies for \( e_{g_{j}}^{k+1} \).

(ii) Let \( j\in \mathcal {L}^{{\mathrm {agg}}}_{l} \). Since \( \widehat{\xi }_{j}^{k} \in \partial _{\widehat{e}^{k}_{j} } H_{j}(x^{k},x^{k}) \) and \( H_{j}(\cdot ,x^{k}) \) is convex, it follows that \( H_{j}(y,x^{k}) \ge H_{j}(x^{k},x^{k})+ \langle \widehat{\xi }_{j}^{k},y-x^{k}\rangle - \widehat{e}^{k}_{j} \) for all \( y\in \mathbb {R}^{n} \). By using (14a), (14b) and (21), we obtain

$$\begin{aligned} H_{j}(y,x^{k+1})&\ge H_{j}(y,x^{k})-(f(x^{k+1})-f(x^{k}))^{+}+\eta _{l}\langle x^{k}-x^{k+1}, y-x^{k+1}\rangle \nonumber \\&\quad -\,\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2} \nonumber \\&\ge H_{j}(x^{k},x^{k})+ \langle \widehat{\xi }_{j}^{k},y-x^{k}\rangle - \widehat{e}^{k}_{j} -(f(x^{k+1})-f(x^{k}))^{+} \nonumber \\&\quad +\,\eta _{l}\langle x^{k}-x^{k+1}, y-x^{k+1}\rangle -\frac{\eta _{l}}{2}\Vert x^{k+1}-x^{k}\Vert ^{2}\nonumber \\&=\langle \widehat{\xi }_{j}^{k},y-x^{k}\rangle +g^{+}(x^{k+1}) - \widehat{e}^{k+1}_{j} +\eta _{l}\langle x^{k}-x^{k+1}, y-x^{k+1}\rangle \nonumber \\&\quad -\,\langle \widehat{\xi }_{j}^{k},x^{k+1}-x^{k}\rangle \nonumber \\&=g^{+}(x^{k+1}) - \widehat{e}^{k+1}_{j}+\eta _{l}\langle x^{k}-x^{k+1}, y-x^{k+1}\rangle +\langle \widehat{\xi }_{j}^{k},y-x^{k+1}\rangle \nonumber \\&= g^{+}(x^{k+1}) + \langle \widehat{\xi }_{j}^{k}+ \eta _{l}( x^{k}-x^{k+1}), y-x^{k+1}\rangle - \widehat{e}^{k+1}_{j}. \end{aligned}$$
(22)

By the convexity of \( H_{l}(\cdot ,x^{k+1}) \), we get \( \widehat{\xi }_{j}^{k+1} \in \partial _{\widehat{e}^{k+1}_{j} } H_{j}(x^{k+1},x^{k+1}) \). In particular, set \( y:=x^{k+1} \) in (22) and using this fact that \( H_{j}(x^{k+1},x^{k+1})= g^{+}(x^{k+1}) \) we deduce \( \widehat{e}^{k+1}_{j}\ge 0 \). \(\square \)

Remark 4

In the following we assume that \( \{\eta _{l}\} \) is bounded. From (7) it is clear that \( \eta _{l}\ge 2\omega \) for all l, so we suppose that there exists \( \overline{\eta } \) such that \( 2\omega \le \eta _{l}\le \overline{\eta } \) for all l. Since the analysis of the algorithm do not require the constant \( \overline{\eta } \) to be known, this assumption is not restrictive on implementation of the algorithm. The boundedness of \( \{\eta _{l}\} \) has been established in [14] for the unconstrained optimization problems with \( {\mathrm {lower}}-C^{1} \) objective functions and in [25] for the constrained optimization problems with \( {\mathrm {lower}}-C^{1} \), objective and constraint functions, respectively. However, in our convergence analysis we consider this assumption with regular functions which is weaker than \( {\mathrm {lower}}-C^{1} \) assumption.

3.4 Upper envelope model

The argument of this paper are actually two-fold. The first line is as follows. For nonconvex problem (1), we use the improvement function to get rid of the constraint and use the term \( \frac{\eta _{l}}{2}\Vert \cdot -x\Vert ^{2} \) to deal with potentially negative linearization errors. We do not require any \( {\mathrm {lower}}-C^{2} \), \( {\mathrm {lower}}-C^{1} \) or prox-regular assumption therefore the augmented function (4) can be convex or nonconvex. Then we state the convex cutting plane model \( H_{l}(\cdot ,x^{k}) \) for (4) and use the usual bundle argument for this model. The second line of argument deals with connecting the convex cutting plane model \( H_{l}(\cdot ,x^{k}) \) with the original nonconvex problem to analysis the convergence of the method. For this purpose, motivated by [30], we define the upper envelope model \( M^{\uparrow }(y,x) \) associated with the cutting plane for constrained problems. It is worth mentioning that the upper envelope model in [30] was defined for the unconstrained optimization problems. Furthermore, the method of [30] employs the “downshift” mechanism, that is absolutely different from our mechanism for the definition of the cutting planes.

Suppose that \( \overline{B}_{R}(x) \) is a fixed closed ball large enough such that contains all possible trial steps \(y^{+}\) (this assumption was considered in [7, 30]). Set \( \mathcal {B}(x):=\{y|~y\in \overline{B}_{R}(x), y\text { is a trial point} \} \). We define the upper envelope model \( M^{\uparrow }(\cdot ,x){:}\,\mathbb {R}^{n}\rightarrow \mathbb {R} \), for a given point \( x\in \mathbb {R}^{n} \), as

$$\begin{aligned} M^{\uparrow }(y,x):= g^{+}(x)+\sup _{\begin{array}{c} 2\omega \le \eta \le \overline{\eta },~y^{+}\in \mathcal {B}(x) \\ \xi _{f}\in \partial _{c} f(y^{+}),~\xi _{g}\in \partial _{c} g(y^{+}) \end{array}} \left\{ m^{f}_{y^{+},\xi _{f},\eta }(y,x),~m^{g}_{y^{+},\xi _{g},\eta }(y,x)\right\} , \end{aligned}$$

where\(\,2\omega \le \eta \le \overline{\eta } \) with \( \overline{\eta } \) defined in Remark 4. The plane \( m^{f}_{y^{+},\xi _{f},\eta }(y,x) \) is the cutting plane at the serious iterate x and the trial step \( y^{+} \) as follows

$$\begin{aligned} m^{f}_{y^{+},\xi _{f},\eta }(y,x) :&= - c_{y^{+}}+ \langle \xi _{f}+\eta (y^{+}-x), y-x\rangle \nonumber \\&=- \frac{\omega }{2}\Vert y^{+}-x\Vert ^{2}+ \langle \xi _{f}+\eta (y^{+}-x), y-x\rangle . \end{aligned}$$
(23)

Also \( m^{g}_{y^{+},\xi _{g},\eta }(y,x) \) is the cutting plane at the serious iterate x and the trial step \( y^{+} \) as

$$\begin{aligned} m^{g}_{y^{+},\xi _{g},\eta }(y,x) :&=- c_{y^{+}}+ \langle \xi _{g}+\eta (y^{+}-x), y-x\rangle \nonumber \\&=- \frac{\omega }{2}\Vert y^{+}-x\Vert ^{2}+ \langle \xi _{g}+\eta (y^{+}-x), y-x\rangle . \end{aligned}$$
(24)

The boundedness of \( \overline{B}_{R}(x) \) and the definition of \( \eta \) imply that \( M^{\uparrow }(\cdot ,x) \) is defined everywhere. In the following we state some properties of the upper envelope model \( M^{\uparrow }(\cdot ,x) \). These results can be found in [7, Lemma 6.1], however we state the proof since the definition of the upper envelope model \( M^{\uparrow }(\cdot ,x) \), its cutting planes and details of the proof are different from [7].

Lemma 5

Suppose that f and g are regular functions, then the following hold:

  1. (i)

    \( M^{\uparrow }(\cdot ,x) \) is a convex function.

  2. (ii)

    \( H_{l}(\cdot ,x) \le M^{\uparrow }(\cdot ,x) \) for all l.

  3. (iii)

    \( M^{\uparrow }(x,x)=g^{+}(x). \)

  4. (iv)

    \( \partial M^{\uparrow }(x,x)\subseteq \partial _{c} h(x,x). \)

Proof

The proof of (i) is obvious. To prove (ii), from (6a), for all \( j\in \mathcal {L}_{l} \), we have \(c_{f_{j}}^{k}=e_{f_{j}}^{k}+g^{+}(x^{k})+\eta _{l}b_{j}^{k}\). By using (7), we obtain

$$\begin{aligned} e_{f_{j}}^{k}+g^{+}(x^{k})+\eta _{l}b_{j}^{k}\ge \frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^2, \end{aligned}$$

therefore for all \( j\in \mathcal {L}_{l} \) we have \( c_{f_{j}}^{k}\ge \frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^2 \). The same argument holds for g and we again get \( c_{g_{j}}^{k}\ge \frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^2 \). Therefore we deduce \( -\,c_{f_{j}}^{k}\le -\frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^2 \) and \( -\,c_{g_{j}}^{k}\le -\frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^2 \). Compare the definition of \( H_{l}(\cdot ,x^{k}) \) in (5) with the definition of \( M^{\uparrow }(\cdot ,x) \) and their cutting planes [see relations (6a), (6b), (23) and (24)]. The second term in their cutting planes is similar, i.e., \( s_{f_{j}}^{k} \) with \( \xi _{f}+\eta (y^{+}-x) \) and \( s_{g_{j}}^{k} \) with \( \xi _{g}+\eta (y^{+}-x) \). Therefore

$$\begin{aligned} -c_{f_{j}}^{k}+\langle s_{f_{j}}^{k},y-x^{k} \rangle \le -\frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^{2}+\langle \xi _{f}^{j}+\eta _{l}(y_{j}-x^{k}),y-x^{k} \rangle , \end{aligned}$$

and

$$\begin{aligned} -c_{g_{j}}^{k}+\langle s_{g_{j}}^{k},y-x^{k} \rangle \le -\frac{\omega }{2}\Vert y_{j}-x^{k}\Vert ^{2}+\langle \xi _{g}^{j}+\eta _{l}(y_{j}-x^{k}),y-x^{k} \rangle . \end{aligned}$$

Since we assume \( \overline{B}_{R}(x) \) is large enough such that it contains all possible trial steps, we get \( H_{l}(\cdot ,x) \le M^{\uparrow }(\cdot ,x) \) for all l and the proof of (ii) is completed.

We have \( c_{y^{+}}= \frac{\omega }{2}\Vert y^{+}-x\Vert ^{2} \ge 0 \) and \( c_{x} = 0 \), therefore we obtain (iii)

$$\begin{aligned} M^{\uparrow }(x,x) = g^{+}(x) + \sup _{y^{+}\in \overline{B}_{R}(x)} \{- c_{y^{+}} \} = g^{+}(x). \end{aligned}$$

To prove (iv). Let \( \overline{\xi }\in \partial M^{\uparrow }(x,x) \) and define \( \overline{m}(\cdot ,x)= M^{\uparrow }(x,x) + \langle \overline{\xi },\cdot -x\rangle = g^{+}(x)+ \langle \overline{\xi },\cdot -x\rangle \), the tangent plane to the graph of \( M^{\uparrow }(\cdot ,x) \) at x associated with \( \overline{\xi } \). By using convexity, we have \( \overline{m}(\cdot ,x)\le M^{\uparrow }(\cdot ,x) \). Fix a vector \( d \in \mathbb {R}^{n} \) and consider the values \( M^{\uparrow }(x+td,x) \) for \( t>0 \). According to the definition of \( M^{\uparrow }(\cdot ,x) \) for any \( t>0 \), there exist \( 2\omega \le \eta _{t} \le \overline{\eta } \), \( y_{t}\in \overline{B}_{R}(x) \) and \( \xi _{t}^{f} \in \partial _{c} f(y_{t}) \) or \( \xi _{t}^{g} \in \partial _{c} g(y_{t}) \) such that \( M^{\uparrow }(x+td,x)\le g^{+}(x^{k})+m^{f}_{y_{t},\xi ^{f}_{t},\eta _{t}}(x+td,x)+t^{2} \) or \( M^{\uparrow }(x+td,x)\le g^{+}(x^{k})+m^{g}_{y_{t},\xi ^{g}_{t},\eta _{t}}(x+td,x)+t^{2} \).

We suppose that \( M^{\uparrow }(x+td,x)\le g^{+}(x^{k})+m^{f}_{y_{t},\xi ^{f}_{t},\eta _{t}}(x+td,x)+t^{2} \), therefore

$$\begin{aligned} \overline{m}(x+td,x)&\le g^{+}(x^{k})+m^{f}_{y_{t},\xi ^{f}_{t},\eta _{t}}(x+td,x)+t^{2}\\&= g^{+}(x)- c_{y_{t}}+\langle \xi ^{f}_{t}+\eta _{t}(y_{t}-x),td\rangle +t^{2}\\&=g^{+}(x)- \frac{\omega }{2}\Vert y_{t}-x\Vert ^{2}+\langle \xi _{t}^{f}+\eta _{t}(y_{t}-x),td\rangle +t^{2}. \end{aligned}$$

When \( M^{\uparrow }(x+td,x)\le g^{+}(x^{k})+m^{g}_{y_{t},\xi ^{g}_{t},\eta _{t}}(x+td,x)+t^{2} \), we again get the above relation with \( \xi _{t}^{g} \in \partial _{c} g(y^{t}) \). Therefore, we can say

$$\begin{aligned} \overline{m}(x+td,x)\le g^{+}(x)- \frac{\omega }{2}\Vert y_{t}-x\Vert ^{2}+\langle \xi _{t}+\eta _{t}(y_{t}-x),td\rangle +t^{2}, \end{aligned}$$
(25)

where \( \xi _{t}=\xi _{t}^{f} \) or \( \xi _{t}^{g} \). Since \( \xi _{t}^{f} \in \partial _{c} f(y_{t}) \), \( \xi _{t}^{g} \in \partial _{c} g(y_{t}) \) and f, g are locally Lipschitz on \( \overline{B}_{R}(x) \), hence \( \{\xi _{t}\} \) is bounded. Therefore \( \{\xi _{t}+\eta _{t}(y_{t}-x)\} \) is bounded and since d is an arbitrary fixed vector, then \( \langle \xi _{t}+\eta _{t}(y_{t}-x),td\rangle \rightarrow 0 \) as \( t\rightarrow 0 \). Passing the limit in (25), we get \( \lim _{t\rightarrow 0} -\frac{\omega }{2} \Vert y_{t}-x\Vert ^{2} \ge 0 \) which implies that \( \lim _{t\rightarrow 0} \frac{\omega }{2}\Vert y_{t}-x\Vert ^{2} = 0 \) and we deduce \( y_{t} \rightarrow x \). Consider the sequence \( \{\xi _{t}\} \), since f and g are regular, we have \( \xi _{t} \in \partial _{c} h(y_{t},x) \). As we have mentioned above, this sequence is bounded on \( \overline{B}_{R}(x) \), passing to a subsequence if necessary, we may assume there exists \( \widehat{\xi } \) such that \( \xi _{t} \rightarrow \widehat{\xi } \). Since \( y_{t} \rightarrow x \), it follows from the upper semicontinuity of the Clarke subdifferential that \( \widehat{\xi }\in \partial _{c} h(x,x) \). From (25), we obtain \( \langle \overline{\xi },d\rangle \le \langle \xi _{t}+\eta _{t}(y_{t}-x),d\rangle +t \), let \( t\rightarrow 0 \) which implies that \( \langle \overline{\xi },d\rangle \le \langle \widehat{\xi },d\rangle \). Thus

$$\begin{aligned} \langle \overline{\xi },d\rangle \le \langle \widehat{\xi },d\rangle \le \max \{\langle \xi ,d\rangle ,~ \xi \in \partial _{c} h(x,x)\} = (h(\cdot ,x))^{\circ }(x,d). \end{aligned}$$

Since d is an arbitrary vector, we get \( \overline{\xi } \in \partial _{c} h(x,x) \) and the proof is completed. \(\square \)

3.5 Statement of the algorithm

Now we are in position to state our infeasible proximal bundle algorithm for solving problem (1).

figure a

Remark 5

It is worth mentioning that contrary to some nonconvex constrained optimization algorithms that solve a quadratic programming subproblem and also use a line search method, e.g., [5, 27, 28, 36], our setting does not employ any line search sub-routine and it only solves a quadratic programming subproblem, without impairing its global convergence.

Remark 6

We require a usual rule for choosing the proximal parameter \( \mu _{l} \). The conditions that \( \mu _{l} \) should be satisfied, for the global convergence, are very simple and we state these in the convergence results (see Sect. 4).

4 Convergence analysis

In this section, we establish the global convergence of Algorithm 1. We assume from now on that the stopping tolerance tol is set to zero. Note that, if the algorithm stops at Step 3, then \( \delta ^{l}=0 \). Indeed, \( \delta ^{l}=0 \) implies that \( \Vert \xi _{l}^{k}+v_{l}\Vert =0 \) and \( \widehat{e}_{l}^{k}=0 \). Since \( \widehat{\xi }_{l}^{k}\in \partial _{\widehat{e}_{l}^{k}} H_{l}(x^{k},x^{k}) \), we have

$$\begin{aligned} H_{l}(y,x^{k}) \ge H_{l}(x^{k},x^{k})+ \langle \widehat{\xi }_{l}^{k}, y-x^{k}\rangle -\widehat{e}_{l}^{k}=g^{+}(x^{k})+ \langle \widehat{\xi }_{l}^{k}, y-x^{k}\rangle -\widehat{e}_{l}^{k}, \end{aligned}$$
(26)

for all \( y \in \mathbb {R}^{n} \). By \( v_{l}\in \partial I_{X}(y^{l}) \), we get

$$\begin{aligned} I_{X}(y)&\ge I_{X}(y^{l})+ \langle v_{l},y-y^{l} \rangle \nonumber \\&=\langle v_{l},y-y^{l} \rangle =I_{X}(x^{k})+\langle v_{l},y-x^{k} \rangle +\langle v_{l},x^{k}-y^{l} \rangle \nonumber \\&=I_{X}(x^{k})+\langle v_{l},y-x^{k} \rangle +\langle v_{l},\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l})\rangle , \end{aligned}$$
(27)

for all \( y \in \mathbb {R}^{n} \). By (26) and (27), we obtain

$$\begin{aligned} H_{l}(y,x^{k})+I_{X}(y)&\ge g^{+}(x^{k})+ \langle \widehat{\xi }_{l}^{k}, y-x^{k}\rangle -\widehat{e}_{l}^{k}+I_{X}(x^{k}) +\langle v_{l},y-x^{k} \rangle \nonumber \\&\quad +\,\left\langle v_{l},\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l})\right\rangle \nonumber \\&= g^{+}(x^{k}){+}I_{X}(x^{k}){+} \langle \widehat{\xi }_{l}^{k}+v_{l}, y-x^{k}\rangle -\widehat{e}_{l}^{k} {+}\left\langle v_{l},\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l})\right\rangle . \end{aligned}$$
(28)

We have \( \Vert \widehat{\xi }_{l}^{k}+v_{l}\Vert =\widehat{e}_{l}^{k}=0 \) by using this fact in relation (28) we deduce

$$\begin{aligned} H_{l}(y,x^{k})+I_{X}(y) \ge g^{+}(x^{k})+I_{X}(x^{k})+ \langle 0, y-x^{k}\rangle . \end{aligned}$$

Due to Lemma 5 (ii) and (iii), we get

$$\begin{aligned} M^{\uparrow }(y,x^{k})+I_{X}(y) \ge M^{\uparrow }(x^{k},x^{k})+I_{X}(x^{k})+ \langle 0, y-x^{k}\rangle , \end{aligned}$$

and by the convexity of \( M^{\uparrow }(\cdot ,x^{k}) \) and \( I_{X}(\cdot ) \), we deduce \( 0\in \partial M^{\uparrow }(x^{k},x^{k})+\partial I_{X}(x^{k}) \). Using Lemma 5 (iv) we obtain \( 0\in \partial _{c} h(x^{k},x^{k})+\partial I_{X}(x^{k}) \). Hence when Algorithm 1 stops finitely, the last serious iterate \( x^{k} \) is stationary for the improvement function \( h(\cdot ,x^{k}) \) on X.

From now on, we assume that Algorithm 1 does not stop and generates an infinite sequence of iterates. As customary in the bundle method, we consider two cases: the serious iterates sequence \( \{x^{k}\} \) is either finite or infinite. Set \( \mathcal {A}:=\{l|~l~\text {is a serious iteration}\} \) collects the indices of the serious iterates in the sequence \( \{y^{l}\} \). We denote the feasible set of problem (1) by \( \mathcal {D}:= \{x \in X|~ g(x) \le 0\} \).

Proposition 1

Suppose that \( k_{1} \) is any serious iteration, then for all \( k\ge k_{1} \) we get \( x^{k} \in \{x \in \mathbb {R}^{n},~ g(x) \le g^{+}(x^{k_{1}})\} \). Therefore, if there exists \( k_{2} \) such that \( x^{k_{2}} \in \mathcal {D} \), then \( x^{k} \in \mathcal {D} \) for all \( k\ge k_{2} \).

Proof

By using Remark 2 for any k we have \( g(x^{k+1})< g(x^{k}) \). \(\square \)

From Proposition 1, we obtain immediately that if \( x^{0} \) is a feasible point, then the sequence of the serious iterates \( \{x ^{k}\} \) is feasible too.

Lemma 6

Suppose that f is bounded below on \( \mathcal {D} \), and Algorithm 1 generates an infinite number of serious iterates. Then \( \{\delta ^{l}\}_{l \in \mathcal {A}}\rightarrow 0 \) and \( \{ \widehat{e}_{l}^{k}\}_{l \in \mathcal {A}}\rightarrow 0 \). Furthermore if there exists \( \overline{\mu }>0 \) such that \( \mu _{l}\le \overline{\mu } \) for all \( l\in \mathcal {A} \), then \( \{ \widehat{\xi }_{l}^{k}+v_{l}\}_{l \in \mathcal {A}}\rightarrow 0 \).

Proof

By Proposition 1, we consider two cases: either \( x^{k} \notin \mathcal {D} \) for all k or there exists an index \( k_{3} \) such that \( x^{k}\in \mathcal {D} \) for all \( k\ge k_{3} \).

  1. (i)

    If \( x^{k} \notin \mathcal {D} \) for all k, by using (12) we have \( m\delta ^{l(k+1)}\le g(x^{k})- g(x^{k+1}) \) for all \( k\ge 0 \). Therefore \( \{g(x^{k})\} \) is decreasing and \( g(x^{k})>0 \). Thus this sequence is decreasing and bounded below which implies that, there exists \( \overline{g} \) such that \( g_{l(k)}\rightarrow \overline{g} \) as \( l(k) \in \mathcal {A} \) and \( k\rightarrow \infty \). Therefore \( \sum _{l\in \mathcal {A}} \delta ^{l} \le \frac{g(x_{0})- \overline{g}}{m} <\infty \).

  2. (ii)

    Now suppose that, there exists \( k_{3} \) such that if \( k<k_{3} \), then \( x^{k} \notin \mathcal {D} \) and for all \( k\ge k_{3} \) we have \( x^{k} \in \mathcal {D} \). By using (12) we have \( m\delta ^{l(k+1)}\le f(x^{k})- f(x^{k+1}) \) for all \( k\ge k_{3} \). Hence \( \{f(x^{k})\} \) is decreasing for \( k\ge k_{3} \) and since f is bounded below on \( \mathcal {D} \) there exists \( \overline{f} \) such that \( f(x^{k})\rightarrow \overline{f} \) as \( l(k) \in \mathcal {A} \) and \( k\rightarrow \infty \). This implies

    $$\begin{aligned} \sum _{l\in \mathcal {A}} \delta ^{l}= \sum _{l\in \mathcal {A}, ~l<l(k_{3})} \delta ^{l}{+} \sum _{l\in \mathcal {A},~ l\ge l(k_{3})} \delta ^{l} \le \frac{1}{m}(g(x^{0}) {-} g(x^{k_{3}}) + f(x^{k_{3}}) - \overline{f})<\infty . \end{aligned}$$

Hence in each case we have \( \sum _{l\in \mathcal {A}} \delta ^{l}<\infty \), which implies that \( \{\delta ^{l}\}_{l \in \mathcal {A}}\rightarrow 0 \). By the definition of \( \delta ^{l} \) we have \( \widehat{e}_{l}^{k} \le \delta ^{l} \) and \( \frac{1}{2\mu _{l}} \Vert \widehat{\xi }_{l}^{k}+v_{l}\Vert ^{2} \le \delta ^{l} \), it immediately follows that \( \{\widehat{e}_{l}^{k}\}_{l \in \mathcal {A}}\rightarrow 0 \) and by \( \mu _{l}\le \overline{\mu } \), we have \( \{\widehat{\xi }_{l}^{k}+v_{l}\}_{l \in \mathcal {A}}\rightarrow 0 \). Hence the proof is completed. \(\square \)

In what follows, we assume that Algorithm 1 generates an infinite sequence of serious iterates and show that every accumulation point of this sequence is a stationary point of the improvement function. Here, motivated by Lemma 1, we call \( x^{*} \) is a stationary point of the constraint violation if \( 0 \in \partial _{c} g(x^{*}) \) along with \( g(x^{*})> 0 \).

Theorem 1

Suppose that f and g are regular functions, f is bounded below on \( \mathcal {D} \) and \( \{\eta _{l}\} \) is bounded above. Assume that Algorithm 1 generates an infinite sequence of serious iterates. If there exists \( \overline{\mu }>0 \) such that \( \mu _{l} \le \overline{\mu } \), then for every accumulation point of \( \{x^{k}\} \), i.e., \( x^{*} \) we have \( 0\in \partial _{c} h(x^{*},x^{*})+\partial I_{X}(x^{*}) \), that is \( x^{*} \) is a stationary point of the improvement function on X. In particular, \( x^{*} \) is either a stationary point of the constraint violation or a Fritz John point or a KKT point of problem (1).

Proof

By using Lemma 6 we have

$$\begin{aligned} \{\delta ^{l}\}_{l \in \mathcal {A}}\rightarrow 0, \quad \{ \widehat{e}_{l}^{k}\}_{l \in \mathcal {A}}\rightarrow 0 \quad \text {and} \quad \{ \widehat{\xi }_{l}^{k}+v_{l}\}_{l \in \mathcal {A}}\rightarrow 0. \end{aligned}$$
(29)

Suppose that \( l \in \mathcal {A} \), since \( \widehat{\xi }_{l}^{k}+v_{l} \in \partial H_{l} (y^{l},x^{k})+ \partial I_{X}(y^{l})\) and by the convexity of \( H_{l} (\cdot ,x^{k})+I_{X}(\cdot ) \) for all \( y\in \mathbb {R}^{n} \), we obtain

$$\begin{aligned} M^{\uparrow }(y,x^{k})+I_{X}(y)&\ge H_{l} (y,x^{k})+I_{X}(y) \nonumber \\&\ge H_{l} (y^{l},x^{k})+ I_{X}(y^{l})+ \langle \widehat{\xi }_{l}^{k}+v_{l}, y- y^{l}\rangle \nonumber \\&=g^{+}(x^{k})- \widehat{e}_{l}^{k}+\langle \widehat{\xi }_{l}^{k},y^{l}-x^{k}\rangle + I_{X}(y^{l})+ \langle \widehat{\xi }_{l}^{k}+v_{l}, y- y^{l}\rangle \nonumber \\&=g^{+}(x^{k})- \widehat{e}_{l}^{k}+\langle \widehat{\xi }_{l}^{k}+v_{l},y-x^{k}\rangle +\langle v_{l}, x^{k}- y^{l}\rangle \nonumber \\&=g^{+}(x^{k})- \widehat{e}_{l}^{k}+\langle \widehat{\xi }_{l}^{k}+v_{l},y-x^{k}\rangle +\left\langle v_{l}, \frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l})\right\rangle , \end{aligned}$$
(30)

where the first inequality comes from Lemma 5 (ii), the first equality follows from (9) and the last equality from \( x^{k}-y^{l}=\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l}) \).

Since \( \{x^{k}\} \) is in the compact set X, it has a subsequence \( x_{k} \rightarrow _{k \in \mathcal {K}} x^{*} \) with \( x^{*} \in X \). We know \( l\in \mathcal {A} \) so that \( l=l(k) \). Since \( x^{k}\rightarrow x^{*} \) and \( x^{k}-y^{l}=\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{k}+v_{l}) \rightarrow 0 \), therefore \( y^{l}\rightarrow x^{*} \) as \( k \in \mathcal {K},~ l(k)\rightarrow \infty \). On the other hand \( v_{l}\in \partial I_{X}(y^{l}) \) and X is a convex compact set hence there exists \( \overline{\varepsilon }\) such that \( \{v_{l}\}\subseteq \partial I_{X}(x^{*})+\overline{B}_{\overline{\varepsilon }}(0) \). Consider (30), (29) for all \( k \in \mathcal {K} \), passing to the limit and by the boundedness of \( \{v_{l}\} \), we obtain \( \lim _{k\rightarrow \infty ,k \in \mathcal {K}} M^{\uparrow }(y,x^{k})+ I_{X}(y) \ge \lim _{k\rightarrow \infty ,k \in \mathcal {K}}g^{+}(x^{k}) +\langle 0, y-x^{k}\rangle \) and hence

$$\begin{aligned} M^{\uparrow }(y,x^{*})+ I_{X}(y)\ge g^{+}(x^{*}) + \langle 0, y-x^{*}\rangle = M^{\uparrow }(x^{*},x^{*})+ I_{X}(x^{*})+ \langle 0, y-x^{*}\rangle . \end{aligned}$$

Using the convexity of \( M^{\uparrow }(\cdot ,x^{*})+ I_{X}(\cdot ) \), we deduce \( 0\in \partial M^{\uparrow }(x^{*},x^{*})+\partial I_{X}(x^{*}) \) and by Lemma 5 (iv) we have \( 0\in \partial _{c} h(x^{*},x^{*})+\partial I_{X}(x^{*}) \).

When \( g(x^{*})>0 \), then \( 0 \in \partial _{c} g(x^{*})+\partial I_{X}(x^{*}) \) and, we get \( x^{*} \) is a stationary point of the constraint violation. If \( g(x^{*})=0 \), then \(x^{*}\) is a Fritz John point. Otherwise, \( x^{*} \) is a KKT point of problem (1). \(\square \)

We conclude by considering the case when the number of serious iterates is finite, i.e., there exists \( l^{*}:=\max \{l|~l\in \mathcal {A}\} \). We denote the corresponding last serious iterate by \( x^{{ end}}:=x^{k(l^{*})}=y^{l^{*}} \). Then for all \( l\ge l^{*} \) the iterate \( x^{{ end}} \) is fixed. We show that \( x^{{ end}} \) is a stationary point of the improvement function on X.

Theorem 2

Assume that f and g are regular functions and Algorithm 1 takes a finite number of serious iterates. If \( \{\eta _{l}\} \) is bounded above and \( \mu _{l} \le \mu _{l+1} \le \overline{\mu } \) for all \( l\ge l^{*} \), then \( 0 \in \partial _{c} h( x^{{ end}},x^{{ end}})+\partial I_{X}(x^{{ end}}) \), i.e., \( x^{{ end}} \) is a stationary point of the improvement function on X. In particular, \( x^{*} \) is either a stationary point of the constraint violation or a Fritz John point or a KKT point of problem (1).

Proof

Suppose that l is large enough, such that \( l\ge l^{*} \). Let \( y^{l} \) be the optimal solution of (8). We show that the sequence \( \{H_{l}(y^{l},x^{{ end}})+ \frac{\mu _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^2\} \) is bounded above and nondecreasing. Since \( \widehat{\xi }_{l}^{{ end}}\in \partial H_{l}(y^{l},x^{{ end}}) \) and \( H_{l}(\cdot ,x^{{ end}}) \) is convex, we have \( H_{l}(y,x^{{ end}}) \ge H_{l}(y^{l},x^{{ end}})+\langle \widehat{\xi }_{l}^{{ end}},y-y^{l}\rangle \) for every \( y \in \mathbb {R}^{n} \). By using the definition of the aggregate plane and (9) we have \( m_{l}^{*}(y,x^{{ end}})=H_{l}(y^{l},x^{{ end}})+\langle \widehat{\xi }_{l}^{{ end}},y-y^{l}\rangle \). Substituting \( y=x^{{ end}} \), we get \( H_{l}(y^{l},x^{{ end}}) = m_{l}^{*}(x^{{ end}},x^{{ end}})- \langle \widehat{\xi }_{l}^{{ end}},x^{{ end}}-y^{l}\rangle \). Then we obtain

$$\begin{aligned}&H_{l}(y^{l},x^{{ end}})+ \frac{\mu _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^{2} \\&\quad \le H_{l}(y^{l},x^{{ end}})+ \mu _{l} \Vert y^{l}-x^{{ end}}\Vert ^{2}\\&\quad \le H_{l}(y^{l},x^{{ end}})+ \mu _{l} \langle y^{l}-x^{{ end}},y^{l}-x^{{ end}} \rangle - \langle v_{l},x^{{ end}}-y^{l} \rangle \\&\quad = H_{l}(y^{l},x^{{ end}})+ \langle \widehat{\xi }_{l}^{{ end}},x^{{ end}}-y^{l} \rangle \\&\quad = m_{l}^{*}(x^{{ end}},x^{{ end}}) \le H_{l+1} (x^{{ end}},x^{{ end}})=g^{+}(x^{{ end}}), \end{aligned}$$

where the first inequality is clear, and the second inequality follows from \( v_{l}\in \partial I_{X}(y^{l}) \), the first equality holds by the definition of \( \widehat{\xi }_{l}^{{ end}} \), the second equality is satisfied by the definition of \( m_{l}^{*}(\cdot ,x^{{ end}}) \) and the last inequality by the definition of \( H_{l+1} (\cdot ,x^{{ end}}) \), hence the sequence is bounded above. Next let us prove that this sequence is nondecreasing

$$\begin{aligned}&H_{l+1}(y^{l+1},x^{{ end}})+ \frac{\mu _{l+1}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2} \\&\quad \ge m_{l}^{*}(y^{l+1},x^{{ end}})+ \frac{\mu _{l+1}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2}\\&\quad \ge m_{l}^{*}(y^{l+1},x^{{ end}})+ \frac{\mu _{l}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2}\\&\quad = H_{l}(y^{l},x^{{ end}})+ \langle \widehat{\xi }_{l}^{{ end}},y^{l+1}-y^{l} \rangle + \frac{\mu _{l}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2}\\&\quad = H_{l}(y^{l},x^{{ end}})+\langle \mu _{l}(x^{{ end}}-y^{l})-v_{l}, y^{l+1}-y^{l}\rangle +\frac{\mu _{l}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2}\\&\quad \ge H_{l}(y^{l},x^{{ end}})+\langle \mu _{l}(x^{{ end}}-y^{l}), y^{l+1}-y^{l}\rangle +\frac{\mu _{l}}{2} \Vert y^{l+1}-x^{{ end}}\Vert ^{2}\\&\quad =H_{l}(y^{l},x^{{ end}}) + \frac{\mu _{l}}{2} \big ( \Vert y^{l}-x^{{ end}}\Vert ^{2} + \Vert y^{l+1}-y^{l}\Vert ^{2} \big )\\&\quad \ge H_{l}(y^{l},x^{{ end}}) + \frac{\mu _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^{2}. \end{aligned}$$

Where the first inequality holds by the definition of \( H_{l+1}(\cdot ,x^{{ end}}) \), the second inequality by \( \mu _{l+1}\ge \mu _{l} \) and the first equality by the definition \( m_{l}^{*}(\cdot ,x^{{ end}}) \) and the third inequality follows from \( v_{l}\in \partial I_{X}(y^{l}) \). Therefore the sequence \( \{H_{l}(y^{l},x^{{ end}})+ \frac{\mu _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^2\} \) is nondecreasing and by its boundedness, we obtain this sequence is convergent. Passing to the limit in the above inequality we have \( \lim _{l \rightarrow \infty } \frac{\mu _{l}}{2} \Vert y^{l+1}-y^{l}\Vert ^{2} \le 0 \). By assumption that \( \mu _{l}\le \mu _{l+1} \le \overline{\mu } \), we obtain \( \mu _{l}\ge \mu _{l^{*}} \) that is \( \{\mu _{l}\} \) is bounded below by \( \mu _{l^{*}} \). This implies that \( y^{l+1}-y^{l} \rightarrow 0 \), therefore \( \{y^{l}\} \) is bounded.

By using the definition of \( \delta ^{l} \) we have

$$\begin{aligned} \delta ^{l}&= \widehat{e}_{l}^{{ end}} + \frac{1}{2\mu _{l}} \Vert \widehat{\xi }_{l}^{{ end}}+v_{l}\Vert ^{2}\\&=g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}})- \langle \widehat{\xi }_{l}^{{ end}}, x^{{ end}}-y^{l}\rangle + \frac{1}{2\mu _{l}} \Vert \widehat{\xi }_{l}^{{ end}}+v_{l}\Vert ^{2}\\&\le g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}})- \langle \widehat{\xi }_{l}^{{ end}}, x^{{ end}}-y^{l} \rangle + \frac{1}{\mu _{l}} \Vert \widehat{\xi }_{l}^{{ end}}+v_{l}\Vert ^{2}\\&= g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}})- \langle \widehat{\xi }_{l}^{{ end}}, x^{{ end}}-y^{l} \rangle + \left\langle \frac{1}{\mu _{l}} (\widehat{\xi }_{l}^{{ end}}+v_{l}),\widehat{\xi }_{l}^{{ end}}+v_{l}\right\rangle \\&= g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}})- \langle \widehat{\xi }_{l}^{{ end}}, x^{{ end}}-y^{l} \rangle + \langle x^{{ end}}-y^{l},\widehat{\xi }_{l}^{{ end}}+v_{l}\rangle \\&\le g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}})+ \langle x^{{ end}}-y^{l},v_{l}\rangle \\&\le g^{+}(x^{{ end}}) - H_{l}(y^{l},x^{{ end}}), \end{aligned}$$

where the first equality holds by the definition of \( \widehat{e}_{l}^{{ end}} \) in (9), the last inequality follows from \( v_{l}\in \partial I_{X}(y^{l}) \) and \( \langle v_{l},x^{{ end}}-y^{l}\rangle \le 0 \) and the other relations are obvious. By considering above inequality for \( l+1 \), we obtain

$$\begin{aligned} \delta ^{l+1} \le g^{+}(x^{{ end}}) - H_{l+1}(y^{l+1},x^{{ end}}). \end{aligned}$$
(31)

Since l is a null iteration, it follows that (11) is not satisfied. We observe that either \( f(y)-f(x^{{ end}})>g^{+}(x^{{ end}})-m\delta ^{l} \) or \( g(y)> g^{+}(x^{{ end}})-m\delta ^{l} \).

(I) Without loss of generality we suppose that \( f(y)-f(x^{{ end}})>g^{+}(x^{{ end}})-m\delta ^{l} \). Therefore we get

$$\begin{aligned} -c_{f_{l}}^{{ end}} + \langle s^{{ end}}_{f_{l}}, y^{l}-x^{{ end}}\rangle&= -\Big (e_{f_{l}}^{{ end}}+g^{+}(x^{{ end}})+\frac{\eta _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^{2} \Big )\nonumber \\&\quad +\,\langle \xi _{f}^{l}+\eta _{l}(y^{l}-x^{{ end}}),y^{l}-x^{{ end}}\rangle \nonumber \\&= -f(x^{{ end}})+f(y^{l})+ \langle \xi ^{l}_{f}, x^{{ end}}-y^{l}\rangle - g^{+}(x^{{ end}}) \nonumber \\&\quad +\,\langle \xi ^{l}_{f}, y^{l}-x^{{ end}}\rangle -\frac{\eta _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^{2} + \eta _{l}\Vert y^{l}-x^{{ end}}\Vert ^{2}\nonumber \\&=f(y^{l})-f(x^{{ end}})-g^{+}(x^{{ end}})+\frac{\eta _{l}}{2} \Vert y^{l}-x^{{ end}}\Vert ^{2} \nonumber \\&> g^{+}(x^{{ end}}) - m \delta ^{l} -g^{+}(x^{{ end}})+ \frac{\eta _{l}}{2}\Vert y^{l}-x^{{ end}}\Vert ^{2} \nonumber \\&\ge - m \delta ^{l}, \end{aligned}$$
(32)

where the first equality is due to the first relation on (6a) and relations on (6b) and the second equality holds by (3a). By the definition of \( H_{l+1}(\cdot ,x^{{ end}}) \) in (10), we obtain

$$\begin{aligned} H_{l+1}(y,x^{{ end}}) \ge g^{+}(x^{{ end}}) -c_{f_{l}}^{{ end}}+\langle s_{f_{l}}^{{ end}},y-x^{{ end}}\rangle , \quad \forall y \in \mathbb {R}^{n}. \end{aligned}$$

Substituting \( y:=y^{l+1} \), then we get

$$\begin{aligned} -H_{l+1}(y^{l+1},x^{{ end}}) \le - g^{+}(x^{{ end}}) +c_{f_{l}}^{{ end}}-\langle s_{f_{l}}^{{ end}},y^{l+1}-x^{{ end}}\rangle . \end{aligned}$$
(33)

Thus (32) and (33) imply that \( - m \delta ^{l} -H_{l+1}(y^{l+1},x^{{ end}}) \le - g^{+}(x^{{ end}}) +\langle s^{{ end}}_{f_{l}}, y^{l}-y^{l+1}\rangle \), and by (31) we obtain \( 0 \le \delta ^{l+1} \le m \delta ^{l} +\langle s_{f_{l}}^{{ end}}, y^{l}-y^{l+1}\rangle \).

(II) If \( g(y)> g^{+}(x^{{ end}})-m\delta ^{l} \), then by the same argument we again conclude that

$$\begin{aligned} 0 \le \delta ^{l+1} \le m \delta ^{l} +\langle s_{g_{l}}^{{ end}}, y^{l}-y^{l+1}\rangle . \end{aligned}$$

Since \( \{y^{l}\} \) is bounded, the functions f and g are locally Lipschitz and X is compact, it follows that \( \{\xi _{f}^{l}\} \) and \( \{\xi _{g}^{l}\} \) are bounded on that set. Therefore by the boundedness of \( \{\eta _{l}\} \), we get \( s_{f_{l}}^{{ end}}=\xi _{f}^{l}+\eta _{l}(y^{l}-x^{{ end}}) \) and \( s_{g_{l}}^{{ end}}=\xi _{g}^{l}+\eta _{l}(y^{l}-x^{{ end}}) \) are bounded. Since \( y^{l}-y^{l+1} \rightarrow 0 \) and \( m\in (0,1) \), we deduce \( \delta ^{l} \rightarrow 0 \). By the definition of \( \delta ^{l} \) we also have \( \widehat{e}_{l}^{{ end}} \rightarrow 0 \) and \( \frac{1}{2\mu _{l}} \Vert \widehat{\xi }_{l}^{{ end}}+v_{l}\Vert ^{2} \rightarrow 0 \), on the other hand we have \( \mu _{l} \le \overline{\mu } \), therefore we get \( \widehat{\xi }_{l}^{{ end}}+v_{l}\rightarrow 0 \). Which implies that \( y^{l} \rightarrow x^{{ end}} \). By the convexity of \( H_{l}(\cdot ,x^{{ end}}) \) and \( \widehat{\xi }_{l}^{{ end}}+v_{l} \in \partial H_{l}(y^{l},x^{{ end}})+ \partial I_{X}(y^{l}) \), it follows from Lemma 5 (ii) that

$$\begin{aligned} M^{\uparrow }(y,x^{{ end}})+ I_{X}(y)&\ge H_{l}(y,x^{{ end}})+ I_{X}(y) \\&\ge H_{l}(y^{l},x^{{ end}})+ I_{X}(y^{l}) + \langle \widehat{\xi }_{l}^{{ end}}+v_{l},y-y^{l}\rangle \\&= g^{+}(x^{{ end}})- \widehat{e}_{l}^{{ end}} + \langle \widehat{\xi }_{l}^{{ end}}, y^{l}-x^{{ end}} \rangle +\langle \widehat{\xi }_{l}^{{ end}}+v_{l},y-y^{l}\rangle \\&= g^{+}(x^{{ end}})- \widehat{e}_{l}^{{ end}} +\langle \widehat{\xi }_{l}^{{ end}}+v_{l}, y-x^{{ end}}\rangle +\langle v_{l},x^{{ end}}-y^{l} \rangle \\&=g^{+}(x^{{ end}}){-}\widehat{e}_{l}^{{ end}} {+}\langle \widehat{\xi }_{l}^{{ end}}{+}v_{l}, y{-}x^{{ end}}\rangle {+}\left\langle v_{l},\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{{ end}}+v_{l})\right\rangle , \end{aligned}$$

where the first equality follows from (9) and the last equality by \( x^{{ end}}-y^{l}=\frac{1}{\mu _{l}}(\widehat{\xi }_{l}^{{ end}}+v_{l}) \). Passing to the limit in above inequality and by the boundedness of \( \{v_{l}\} \) (since there exists \( \overline{\varepsilon } \) such that \( \{v_{l}\}\subseteq \partial I_{X}(x^{{ end}})+\overline{B}_{\overline{\varepsilon }}(0) \)), we obtain

$$\begin{aligned} M^{\uparrow }(y,x^{{ end}})+ I_{X}(y)&\ge g^{+}(x^{{ end}})+ \langle 0, y-x^{{ end}}\rangle \\&=M^{\uparrow }(x^{{ end}},x^{{ end}})+ I_{X}(x^{{ end}})+ \langle 0, y-x^{{ end}}\rangle . \end{aligned}$$

By the convexity of \( M^{\uparrow }(\cdot ,x^{{ end}}) \) and \( I_{X}(\cdot ) \), we get \( 0 \in \partial M^{\uparrow }(x^{{ end}},x^{{ end}})+ \partial I_{X}(x^{{ end}}) \) and using Lemma 5 (iv) we deduce \( 0\in \partial _{c} h(x^{{ end}},x^{{ end}})+\partial I_{X}(x^{{ end}}) \). The remainder of the proof is the same as the proof of Theorem 1. \(\square \)

5 Numerical experiments

In this section, we report the results of numerical experiments for the proposed algorithm. To provide a brief comparison of Algorithm 1, we used the public software SolvOpt [34], which is one of the most efficient codes available for nonsmooth optimization, and the proximal bundle method (PBM) [25], which was proposed for constrained nonsmooth nonconvex optimization with inexact information. But in this experiment we consider the PBM method without any noise.

All algorithms are coded in MATLAB R2012a and run on a PC Intel Core I5 with CPU 2.5 GHz and 4 GB of RAM.

Two classes of test problems are used to measure the efficiency of the considered algorithms. The first class is taken from [1, 32] with constant number of variables and the second class is from [10] such that these problems can be formulated with any number of variables. Therefore we state numerical results in two experiments.

We set the algorithm parameters as follows \( { tol}:=10^{-8} \), \( m:=0.01\,{ or}\,0.25 \) (for \( n=20,50 \) we used \( m=0.25 \) and for the others we set \( m=0.01 \)), \( \omega :=1.2 \) and \( |\mathcal {L}|_{max}:=100 \). As our convex compact feasible set, we use \( X:=\overline{B}_{10}(0) \). The Quadratic programming solver is \( \mathsf {quadprog.m} \), which is available in the MATLAB optimization toolbox. It is worth mentioning that, there is not any sensitivity to \( \mathsf {quadprog.m} \) and any other quadratic programming solver can replace it. For each run, and for all of the algorithms, we give the total number of the evaluations (Note that evaluations include function evaluations and subgradient evaluations). We did not report the computational times, since for most of the problems (i.e., for \( n=2,3,4,5 \)) the time is near zero and for others there were not a very big difference in the computational times of the different solvers. To check the precision, we will report the objective function value and the constraint function value in the last serious iteration.

We consider two stopping criteria for all algorithms: the first one is “the algorithms stop when the optimality condition is satisfied” and the second one is “the algorithms stop when the number of iterations exceeded 10,000”. Nevertheless, stoppage due to the maximum number of function evaluations was never invoked in these experiments. In addition Algorithm 1 and PBM terminated when one of the following criteria is satisfied, if \( \delta ^{l}\le 10^{-8} \) or the value of \( \delta ^{l} \) did not decrease in 150 consecutive iterations.

In order to obtain a better comparison of the considered algorithms, we analyze the results using the performance profile and data profile introduced in [2, 9, 29]. Next we give a brief descriptions of the performance and data profiles. Given a set of solvers S and a set of problems P. We are interested in using the number of function evaluations as an efficiency measure. For each problem \( p\in P \) and solver \( s\in S \), define

$$\begin{aligned} t_{p,s}= \text {number of function evaluations required to solve problem p by solver s}. \end{aligned}$$

In the performance profile, we use a ratio to compare the implementation of each solver s on problem p with the best solver on this problem. This ratio is defined as follow

$$\begin{aligned} r_{p,s}=\frac{t_{p,s}}{\min \{t_{p,s}|~s\in S\}}. \end{aligned}$$

We assume that a parameter \( r_{M}\ge r_{p,s} \) for all p, s are chosen, and \( r_{p,s}=r_{M} \) if and only if solver s does not solve problem p. The performance profile \( \rho _{s}(\tau ) \) is defined as

$$\begin{aligned} \rho _{s}(\tau )=\frac{1}{n_{p}} \text {size}\{p\in P|~ r_{p,s}\le \tau \}. \end{aligned}$$

Here \( n_{p} \) is the number of problems in P. It is clear that \( \tau \in [1,r_{M}] \). The value of \( \rho _{s}(1) \) gives the percentage of test problems for which the corresponding algorithm is the best (with respect to the number of function evaluations). Since the ratio for some problems and solvers is large, we scale \( \tau \) using the natural logarithm. Therefore, in the performance profiles, the value of \( \rho _{s}(\tau ) \) at \( \log (\tau )=0 \) (not at \( \tau =1 \)) gives the percentage of test problems for which the corresponding algorithm is the best.

Data profile was first introduced in [29]. Data profiles try to answer the question: what percentage of problems can be solved within the budget of k function evaluations? The authors in [29] assume that the required number of function evaluations to satisfy the convergence test is likely to grow as the number of variables increases. The data profile of an optimization algorithm s is defined using

$$\begin{aligned} d_{s}(\alpha )=\frac{1}{n_{p}} \text {size}\left\{ p\in P|~ \frac{t_{p,s}}{n+1}\le \alpha \right\} , \end{aligned}$$

where n is the number of variables in the problem \( p\in P \) and \( d_{s}(\alpha ) \) is the percentage of problems that can be solved with \( \alpha (n+1) \) function evaluations.

5.1 Experiment 1

We recall that in the convergence analysis, we assume that f and g are regular locally Lipschitz functions which are less restrictive than the \( {\mathrm {lower}}-C^{1} \) assumption. Now our aim is to design some examples which are not \( {\mathrm {lower}}-C^{1} \). Spingarn in [35] gives an example of a regular function in \( \mathbb {R}^{2} \) which is not \( {\mathrm {lower}}-C^{1} \), as follows:

$$\begin{aligned} f(x_{1},x_{2})=\left\{ \begin{array}{ll} |x_{2}| &{}\quad \text {if} ~x_{1}\le 0\\ |x_{2}|-x_{1}^2 &{}\quad \text {if}~ x_{1} \ge 0,~ |x_{2}|\ge x_{1}^{2}\\ \frac{x_{1}^{4}-x_{2}^{2}}{2x_{1}^{2}} &{}\quad \text {if}~ x_{1} \ge 0,~ |x_{2}|\le x_{1}^{2}. \end{array} \right. \end{aligned}$$

We consider this function as objective function and using some \( {\mathrm {lower}}-C^{1} \) constraints from [32], we design three problems as follows:

$$\begin{aligned} \text {Problem 1}{:}\,\min _{x\in X} \quad&f(x_{1},x_{2}) \\&g(x)=\max \{-x_{1}^{4}-2x_{2}^{2}-1,~ 2x_{1}^{2}-x_{2}^{2}-2.5\} \le 0. \\ \text {Problem 2}{:}\,\min _{x\in X} \quad&f(x_{1},x_{2}) \\&g(x)=\max \{x_{1}^{2}-x_{2}^{2},~-2x_{1}^{2}-x_{2}^{2}\}\le 0.\\ \text {Problem 3}{:}\,\min _{x\in X} \quad&f(x_{1},x_{2}) \\&g(x)=\max \{100x_{1}^{2}+x_{2}^{2}-101,~80x_{1}^{2}-x_{2}^{2}-79\} \le 0. \end{aligned}$$

We observe that \( f(x_{1},x_{2}) \) is bounded below by 0. We also have \( f(0,0)=0 \), so \( 0=\min _{(x_{1},x_{2})\in \mathbb {R}^{n}}f(x_{1},x_{2}) \). On the other hand (0, 0) is a feasible point for Problems 1–3. Therefore \( x^{*}=(0,0) \) is an optimal solution for these problems and the optimal value is \( f^{*}=0 \).

In addition, we consider some problems from [1, 32]. We consider two types of problems, with nonsmooth convex and nonsmooth nonconvex functions. A brief description of each test problem is given in Table 1, where the following notations are used. We denote n for the number of variables, \( f_{opt} \) for the optimal value and Ref. for the reference of each problem. For each problem we used ten randomly generated starting points (by using \( \mathsf {rand.m}\) and \(\mathsf {randn.m}\) functions in MATLAB) and the starting points are the same for all algorithms. Some of the starting points are feasible and the others are infeasible. To better organize the numerical results, these randomly generated starting points are not reported in the tables. The interested reader can find more details about these points by sending email to the authors. The numerical results are listed in Tables 2345678 and 9, where the following notations are used:

  • \( x_{0} \): starting point;

  • F/InF: feasible/infeasible;

  • \({ NE}\): number of evaluations;

  • \( f_{{\mathrm {final}}} \): final objective value;

  • \( g_{{\mathrm {final}}} \): final constraint value.

The results of Tables 2345678 and 9 show a reasonable performance of Algorithm 1. One can observe that Algorithm 1 performs well and provides an optimal solution to each example. We note that the accuracy obtained by Algorithm 1 was similar for both feasible and infeasible starting points, that means the algorithm is not sensitive with feasible or infeasible points. As comparing with PBM and SolvOpt, in this part we run every algorithm 80 times, for 51 examples we observe that Algorithm 1 needs the least number of function evaluations than SolvOpt and PBM. For 28 examples SolvOpt method uses the least number of evaluations than others and for one example PBM needs the least number of function evaluations. For more details and carefully comparison see Tables 2345678 and 9.

Table 1 The description of test problems
Table 2 Comparison between Algorithm 1, PBM and SolvOpt for Problem 1
Table 3 Comparison between Algorithm 1, PBM and SolvOpt for Problem 2
Table 4 Comparison between Algorithm 1, PBM and SolvOpt for Problem 3
Table 5 Comparison between Algorithm 1, PBM and SolvOpt for Problem 4
Table 6 Comparison between Algorithm 1, PBM and SolvOpt for Problem 5
Table 7 Comparison between Algorithm 1, PBM and SolvOpt for Problem 6
Table 8 Comparison between Algorithm 1, PBM and SolvOpt for Problem 7
Table 9 Comparison between Algorithm 1, PBM and SolvOpt for Problem 8

Figure 1 presents the performance profiles for Experiment 1. Algorithm 1 has the best performance for \(63\%\) of the problems; meaning that Algorithm 1 is able to solve \(63\%\) of the problems with the least number of function evaluations than the other solvers. The SolvOpt and PBM solve roughly \(39\%\) and \(3\%\) of the problems with the least number of function evaluations. If we choose being within a factor of \(log(\tau )\simeq 0.4\) of the best solver as the scope of our interest, then Algorithm 1 is the best solver for all problems while SolvOpt and PBM are the best solvers for \(78\%\) and \( 50\% \) of the problems.

Fig. 1
figure 1

Performance profile for Experiment 1

Figure 2 illustrates the data profiles for Experiment 1. Suppose the user has a budget limit of 100 number of function evaluations; according to Fig. 2, with this budget Algorithm 1 has the best performance, solving roughly \(68\%\) of the problems; while SolvOpt and PBM solve \( 52\% \) and \(18\%\) of the problems, respectively. Moreover, if the user has a budget limit of 200 number of function evaluations, then Algorithm 1 solves \( 100\% \) of the problems and it has the best performance among all the solvers. While PBM has the worst performance among all the solvers since with this budget it solves roughly \(64\%\) of the problems.

Fig. 2
figure 2

Data profile for Experiment 1

Moreover it is clear from Figs. 1 and 2 that Algorithm 1 and PBM can solve \( 100\% \) and \( 97\% \) of the problems, while PBM solves \( 80\% \) of the problems.

5.2 Experiment 2

In this subsection, we consider a series of polynomial functions developed in [10], which were used in [13, 14, 25, 37] and these problems can be formulated with any number of variables. For each \( i=1,2,\ldots ,n \), the function \( h_{i}{:}\,\mathbb {R}^{n}\rightarrow \mathbb {R} \) is defined by \( h_{i}(x):=(ix_{i}^2-2x_{i})+\sum _{j=1}^{n}x_{j} \). There are five classes of test functions defined by \( h_{i} \) in [10] as follows:

$$\begin{aligned} f_{1}(x)&:=\sum _{i=1}^{n} |h_{i}(x)|, \\ f_{2}(x)&:=\max _{1\le i\le n} |h_{i}(x)|, \\ f_{3}(x)&:=\sum _{i=1}^{n} |h_{i}(x)|+ \frac{1}{2} \Vert x\Vert ^2,\\ f_{4}(x)&:=\sum _{i=1}^{n} |h_{i}(x)|+ \frac{1}{2} \Vert x\Vert ,\\ f_{5}(x)&:=\sum _{i=1}^{n} (h_{i}(x))^{2}. \end{aligned}$$

It has been proved in [13] that these test functions are nonconvex and globally \( {\mathrm {lower}}-C^{1} \) in \( \mathbb {R}^{2} \) and they are nonsmooth except \( f_{5} \). These properties carry to higher dimensions as well. We also have \( 0=\min _{x\in \mathbb {R}^{n}} f_{k}(x) \) and \( \{0\} \in {\mathrm {argmin}}_{x\in \mathbb {R}^{n}} f_{k}(x) \) for \( k=1,2,3,4,5 \). For constraint functions, we consider the pointwise maximum of a finite collection of quadratic functions from [12], as follows

$$\begin{aligned} g(x)=\max _{1\le i\le N} \{ \langle x, A_{i} x \rangle +\langle b_{i},x \rangle +c_{i} \}, \end{aligned}$$

where \( A_{i} \) are \( n\times n \) matrices, \( b_{i} \in \mathbb {R}^{n} \) and \( c_{i} \in \mathbb {R} \) for \( i=1,2,\ldots ,N \). The terms of \(A_i\) and \(b_i\) and the coefficients \(c_i\) were chosen randomly as uniformly distributed numbers in \([-5,5]\). We consider the following test problems

$$\begin{aligned} \min _{x\in X} \quad&f_{i}(x)\\&g(x)\le 0, \end{aligned}$$

where \( X\subseteq \mathbb {R}^{n} \) for \( i=1,2,3,4,5 \) and \( n=2,3,4,5,10,20,50 \). We use two starting points as follows \( x_{0}=[1,1,\ldots ,1] \) and \( x_{0}=[1,1/2,\ldots ,1/n] \) for each problem, hence we have 70 examples.

Our results for deterministic tests are summarized in Tables 10111213 and 14, which show a reasonable performance of Algorithm 1. Where the notations are the same as the previous subsection. We analyse the results in more details as follows:

  • For the objective function \( f_{1} \), we tested 14 nonconvex examples. Table 10 shows that values of \( f_{{\mathrm {final}}} \) obtained by Algorithm 1 and SolvOpt are quite correct whereas PBM can not calculate the correct solution for \( n=10,20,50 \). Algorithm 1 needs the least number of function evaluations for \( n=2,3,4 \), whereas SolvOpt uses the least function evaluations for \( n=5,10,20,50 \).

  • For the objective function \( f_{2} \), we tested 14 examples. Table 11 demonstrates that \( f_{{\mathrm {final}}} \) obtained by Algorithm 1 is quite correct for all examples. Whereas, SolvOpt can not find the correct solution for \( n=5 \) with \( x_{0}=[1,1,\ldots ,1] \) and PBM can not calculate the correct solutions for \( n=5,20,50 \) with \( x_{0}=[1,1,\ldots ,1] \) and for \( n=50 \) with \( x_{0}=[1,1/2,\ldots ,1/n] \). Which means that Algorithm 1 has the most accuracy for the objective function \( f_{2} \). Moreover, Algorithm 1 needs the least number of function evaluations for 10 examples, SolvOpt uses the least number of function evaluations for 3 examples and PBM needs the least number of evaluations for one example.

  • The corresponding results for the 14 examples of function \( f_{3} \) are reported in Table 12. These results show that \( f_{{\mathrm {final}}} \) obtained by Algorithm 1 is quite correct for all examples. While, PBM fails to find the correct solution for \( n=20,50 \) with \( x_{0}=[1,1,\ldots ,1] \) and for \( n=10 \) with \( x_{0}=[1,1/2,\ldots ,1/n] \). Also the generated solutions by SolvOpt are false for \( n=10,50 \) with \( x_{0}=[1,1,\ldots ,1] \). Moreover Algorithm 1 needs the least number of function evaluations for 8 examples and SolvOpt uses the least number of function evaluations for 6 examples.

  • The results for function \( f_{4} \) are proposed in Table 13. These results show that \( f_{{\mathrm {final}}} \) obtained by Algorithm 1 is quite correct for all examples except \( n=20 \) with \( x_{0}=[1,1,\ldots ,1] \) and \( n=50 \) with \( x_{0}=[1,1/2,\ldots ,1/n] \). Whereas, PBM can not calculate the correct solutions for \( n=3,10,20,50 \) with \( x_{0}=[1,1,\ldots ,1] \) and for \( n=10,50 \) with \( x_{0}=[1,1/2,\ldots ,1/n] \) and SolvOpt fails to find the correct solutions for \( n=10,20,50 \) with \( x_{0}=[1,1,\ldots ,1] \). Moreover Algorithm 1 uses the least number of evaluations for 6 examples, SolvOpt needs the least number of function evaluations for 7 examples and PBM needs the least number of function evaluations for one example.

  • The generated results for function \( f_{5} \) are stated in Table 14. The results illustrate that Algorithm 1 finds the correct solutions for all examples. While, PBM fails to find the correct solutions for \( n=3,5,50 \) with \( x_{0}=[1,1,\ldots ,1] \) and for \( n=5,10,20 \) with \( x_{0}=[1,1/2,\ldots ,1/n] \) and SolvOpt can not calculate the correct solution for \( n=20 \) with \( x_{0}=[1,1,\ldots ,1] \). Moreover SolvOpt uses the least number of function evaluations for 12 examples and Algorithm 1 needs the least number of function evaluations for 2 examples. Which means that SolvOpt uses the least number of evaluations for function \( f_{5} \), that these results are acceptable since \( f_{5} \) is a smooth function.

Table 10 Comparison between Algorithm 1, PBM and SolvOpt for \( f_{1}\)
Table 11 Comparison between Algorithm 1, PBM and SolvOpt for \( f_{2} \)
Table 12 Comparison between Algorithm 1, PBM and SolvOpt for \( f_{3} \)
Table 13 Comparison between Algorithm 1, PBM and SolvOpt for \( f_{4} \)
Table 14 Comparison between Algorithm 1, PBM and SolvOpt for \( f_{5} \)

Figure 3 presents the performance profiles for Experiment 2. SolvOpt and Algorithm 1 have the best performance for \( 51\% \) and \( 47\% \) of the problems with the least number of function evaluations. While PBM has the best performance for \( 5\% \) of the problems. The performances of Algorithm 1 and SolvOpt are more competitive than the performance of PBM method for these problems.

Fig. 3
figure 3

Performance profile for Experiment 2

Figure 4 states the data profiles for Experiment 2. Suppose that the user has a budget limit of 100 number of function evaluations; according to Fig. 4, with this budget SolvOpt and Algorithm 1 solves roughly \( 53\% \) and \( 52\% \) of the problems, respectively; while PBM solves about \(18\%\) of the problems. Moreover, if the user has a budget limit of 300 number of function evaluations, then Algorithm 1 and SolvOpt solve about \( 93\% \) and \(90\%\) of the problems; while PBM solves \(58\%\) of the problems.

Fig. 4
figure 4

Data profile for Experiment 2

Moreover the results in Figs. 3 and 4 are shown that Algorithm 1 and PBM can solve \( 97\% \) and \( 94\% \) of the problems, respectively; while PBM solves roughly \( 70\% \) of the problems.

5.3 The behaviour of parameter \(\eta _{l}\)

At the end, we are interested in exploring the assumption that the parameter \( \eta _{l} \) remains bounded. In [14] the authors report that \(\eta _l\) was less than \(2n + 2\) when solving 73 of the 75 exact unconstrained optimization problems; for one test problem \(\eta _l\) was between \(2n + 2\) and 25n and for the remaining one \(\eta _l\) exceeded 25n.

In our results, to select \( \eta _{l} \) (in Step 2) we use (7) with equality, i.e.,

$$\begin{aligned} \eta _{l}= \max \left\{ \max _{\begin{array}{c} j\in \mathcal {L}_{l}^{{\mathrm {oracle}}}\\ y^{j} \ne x^{k} \end{array}} \left\{ \frac{- \left( e_{f_{j}}^{k}+g^{+}(x^{k})\right) }{b_{j}^{k}}, \frac{-\left( e_{g_{j}}^{k}+g^{+}(x^{k}) - g(x^{k})\right) }{b_{j}^{k}}\right\} ,\omega \right\} +\omega . \end{aligned}$$

For solving the 80 problems of Experiment 1 described in Sect. 5.1, \(\eta _l =2\omega \) in 60 problems, \(\eta _l\) is bounded by 2n in 16 problems and it is between 2n and 25n for 4 test problems. On the other hand, for solving the 70 problems of Experiment 2 considered in Sect. 5.1, \(\eta _l =2\omega \) in 25 problems, \(\eta _l\) is bounded by 2n in 39 problems and it is between 2n and 25n for 6 test problems. Overall, the experiments support that the assumption on the sequence \( \{\eta _{l}\} \) is quite reasonable and this sequence is bounded.

6 Conclusions

We have presented an infeasible proximal bundle method using the improvement function and the aggregate technique which is adapted for nonsmooth nonconvex constrained optimization problems with regular functions. The global convergence of the proposed algorithm was proved in the sense that every accumulation point of the sequence of serious iterates is stationary for the improvement function, with an arbitrary starting point. We present results of numerical experiments using some test problems. Our limited computational experiments suggest the good performance of the proposed method and we can say that our new solver is comparable with the existing solvers for nonsmooth nonconvex optimization problems.