1 Introduction

In this work, we are concerned with local optimization methods to address the following class of nonsmooth and nonconvex problems:

$$\begin{aligned} \min _{x \in X}\, f(x) \quad \hbox {with}\quad f(x) := f_1(x) - f_2(x), \end{aligned}$$
(1)

where \(\emptyset \ne X \subset {\mathbb {R}}^n\) is a simple closed convex set (typically, but not necessarily a polyhedral one) contained in an open set \(\varOmega \subset {\mathbb {R}}^n\), and \(f_1: \varOmega \rightarrow {\mathbb {R}}\) and \(f_2: \varOmega \rightarrow {\mathbb {R}}\) are both convex and possibly nonsmooth functions. The mapping \(f\) is the difference of two convex functions and it is called a DC function while \(f_1\) and \(f_2\) are its DC components. Accordingly, problem (1) is a convex-constrained DC program.

The past few years have witnessed a substantial development in the area of DC programming. This class of problems forms an important sub-field of nonconvex programming and has been receiving much attention from the mathematical programming community [14, 15, 22, 27, 31, 32, 38, 41,42,43]. We refer to [43, Part II] for a comprehensive presentation of several algorithms designed to find global solutions. Yet, local optimization methods play an important role in global optimization because algorithms of the latter class typically employ local methods to find stationary/critical points that feed a certain search strategy for global solutions. A non-exhaustive list of applications of DC programming fitting the above formulation includes production–transportation planning problems [23], location planning problems [43, Chapter 5], physical layer based security in a digital communication systems [38], cluster analysis [3, 29], sensor covering [1], and engineering problems [15, 43].

A well-known method for dealing with the optimization problem (1) is the DC Algorithm—DCA—of [31, 42], which iteratively linearizes the second DC component yielding convex optimization subproblems that are solved to define trial points. Another important algorithm for DC programming is the Proximal Linearized Method—PLM—[7, 38, 41], which can be seen as a regularized variant of DCA: the convex optimization subproblem is augmented with a Bregman function to prevent tailing-off effect that makes calculations unstable as the iteration process progresses.

The main disadvantage of both DCA and PLM is the need of solving exactly a convex nonsmooth program per iteration. As an attempt to overcome this issue, the recent work [41] investigates an inexact version of PLM by requiring the convex subproblems to be asymptotically solved up to optimality and subgradients of \(f_1\) satisfying a certain proximity property with respect to the previously computed subgradient of \(f_2\) [41, Equation (15)].

In this work, we propose more sophisticated and implementable variants of PLM for handling problem (1). The new algorithms belong to the bundle method family [21, Chapter XV], a class of methods proposed by C. Lemaréchal in the 70s [33]. Bundle methods are among the most efficient algorithms for solving nonsmooth convex optimization problems. This class of methods constitutes a very active area of research in the nonsmooth optimization community [8,9,10, 12, 13, 30, 48]. Extensions of proximal bundle algorithms to nonsmooth nonconvex programs have been investigated by different authors in [14, 17, 18, 36, 37] and references therein.

As far as the DC setting is concerned, the insightful paper [27] proposes a proximal bundle method for finding critical points of unconstrained DC programs, i.e., problem (1) with \(X={\mathbb {R}}^n\). The authors consider a DC piecewise linear model approximating the objective function \(f\) and compute trial points by globally minimizing such a model over \({\mathbb {R}}^n\): this task amounts to solving a fixed number of quadratic programs (QPs) per iteration. In general terms, the number of QPs solved per iteration increases with the DC model’s size [27, p. 514].

The recent paper [16] also proposes a bundle method employing a DC piecewise model to deal with unconstrained DC programs. Differently from [27] that globally minimizes the DC model to define new iterates, the DC model in [16] is tackled by means of two auxiliary QPs that have different local approximation properties. These quadratic programs must be solved at every iteration to compute descent directions in which a line-search is performed to define trial points.

Inspired by these methods for unconstrained DC programs, we propose in this work two proximal bundle methods for finding critical and d(irectional)-stationary points (see definitions in Sect. 2 below) of DC problems (1). The given variants relate to both proximal linearized methods of [7, 41] and proximal bundle algorithms of [16, 27]. In contrast to a more difficult (but possibly better) nonconvex approximation of \(f\), we employ a simple convex piecewise linear model. No line-search nor estimates of Lipschitz constants of the DC component are required. Moreover, both variants employ a reliable straightforward stopping test. We care to mention that the master program defining trial points in our algorithms is not necessarily a QP problem but a more general strictly convex program exploiting possible compelling structures of the feasible set X. This is computationally useful when X has a particular structure (for instance a ball, a simplex, a spectahedron, and other domains [4, Sect. 2.3]) and a specialized solver for handling the master problem is at disposal. We assume throughout this paper that such a specialized solver is available.

As standard in the DC literature, our first algorithm is shown to generate a subsequence of points that converges to a critical point of (1). Under the hypothesis that \(f_2\) in (1) is the pointwise maximum of Nknown differentiable functions our second algorithm is ensured to compute a d-stationary point, which is the sharpest stationarity definition in DC programming [38, Sect. 3.2]. The price to obtain this stronger result is the solution of possibly several (but no more than N) master programs at certain iterations. We care to mention that the bundle method of [26] is ensured to compute Clarke stationary points of unconstrained DC programs. To this end, the authors of [26] assume that the subdifferentials of the DC components at any point \(x\in {\mathbb {R}}^n\) are polytopes. This is, for instance, the case when both \(f_1\) and \(f_2\) are the pointwise maximum of finitely many differentiable functions. Clarke stationarity is a stronger property than criticality, but weaker than d-stationarity [38].

The remainder of this work is organized as follows: some well-known results and properties of DC programming are recalled in Sect. 2. The first proximal bundle method algorithm for finding critical points is presented in Sect. 3. Its convergence analysis is given in Sect. 4. The second proximal bundle algorithm for finding d-stationary points of a particular class of problem (1) is introduced in Sect. 5 together with its convergence analysis. We report some numerical experiments in Sects. 6, and 7 closes the paper with some final remarks.

2 Basic concepts and properties of DC programming

As mentioned in [31], DC programming is an extension of convex programming that is vast enough to cover almost all nonconvex optimization problems, but still allows the use of powerful tools from convex analysis and convex optimization. As in nonconvex nonsmooth optimization, many definitions of stationary points exist. Below we list some of them.

A point \({\bar{x}} \in X\) is called a local minimizer of problem (1) if there exists a neighborhood \(V\subset X\) of \({\bar{x}}\) such that \(f({\bar{x}})\le f(x)\) for all \(x \in V\). As convex functions are directionally differentiable in the interior of their domains [6, Proposition 2.2.7], the directional derivative

$$\begin{aligned} f_i'(x;d):= \lim _{t\downarrow 0}\frac{f_i(x+td)-f_i(x)}{t} \end{aligned}$$

for the DC component \(f_i\), \(i=1,2\), is well defined for all x in the open set \(\varOmega \subset \mathtt{dom}\, f_i\) and all \(d \in {\mathbb {R}}^n\). It is well known that \(f_i'(x;d)=\max _{g \in \partial f_i(x)} \langle g,d\rangle \), where

$$\begin{aligned} \partial f_i(x) :=\{g\in {\mathbb {R}}^n:\, f_i(y)\ge f_i(x)+ \langle g,y-x\rangle \quad \forall \, y \in \varOmega \} \end{aligned}$$

is the subdiferential of \(f_i\) at point x. For \(\epsilon \ge 0\), the inexact subdiferential is denoted by

$$\begin{aligned} \partial _\epsilon f_i(x) :=\{g\in {\mathbb {R}}^n:\, f_i(y)\ge f_i(x)+ \langle g,y-x\rangle - \epsilon \quad \forall \, y \in \varOmega \}. \end{aligned}$$

Since DC components are locally Lipschitz continuous, DC functions satisfy also local Lipschitz continuity. Thus, the directional and Clarke directional derivatives of a DC function are well defined for all \(x\in \varOmega \):

$$\begin{aligned} \begin{array}{lll} \hbox {(directional derivative) }&{} &{} f'(x;d)=f_1'(x;d)-f_2'(x;d)\\ \hbox {(Clarke directional derivative) }&{} &{} f^0(x;d):= \limsup \nolimits _{\underset{t\downarrow 0}{y\rightarrow x}} \frac{f(y+td)-f(y)}{t}. \end{array} \end{aligned}$$

It can be shown that if \( {\bar{x}} \in X\subset \varOmega \) is a local minimizer of \(f\) over X, then \({\bar{x}} \in X\) is a d(irectional)-stationary point, i.e., \(f'({\bar{x}};(x- \bar{x}))\ge 0\) for all \(x\in X\). For DC programs this definition is equivalent to

$$\begin{aligned} f_1'({\bar{x}};(x-{\bar{x}}))\ge f_2'({\bar{x}};(x-{\bar{x}}))\, \quad \forall x \in X. \end{aligned}$$

The above inequality is equivalent to \(f_1'({\bar{x}};(x-{\bar{x}}))\ge \max _{g_2 \in \partial f_2({\bar{x}})}\langle g_2,x -{\bar{x}}\rangle \) for all \(x \in X\). In other words, \({\bar{x}} \in X\) is a d-stationary point of (1) if for all \(g_2 \in \partial f_2({\bar{x}})\)

$$\begin{aligned} f_1'({\bar{x}};(x-{\bar{x}}))\ge \langle g_2,x -{\bar{x}}\rangle \quad \forall x \in X. \end{aligned}$$

It follows from convexity of \(f_1\) that \({\bar{x}} \in X\) satisfies the above inequality if and only if

$$\begin{aligned} {\bar{x}} \in \arg \min _{x \in X} [f_1(x) - \langle g_2, x\rangle ] \quad \forall \, g_2 \in \partial f_2({\bar{x}}). \end{aligned}$$

This shows that \({\bar{x}} \in X\) is a d-stationary point of (1) if

$$\begin{aligned} \partial f_2({\bar{x}})\subseteq \partial f_1({\bar{x}}) + N_X(\bar{x})\quad (= \partial [f_1({\bar{x}})+i_X({\bar{x}})]), \end{aligned}$$
(2)

where \(N_X({\bar{x}})\) is the normal cone of X at \({\bar{x}}\) and \(i_X\) is the indicator function of set X. The equality \(\partial f_1({\bar{x}}) + N_X({\bar{x}})= \partial [f_1({\bar{x}})+i_X({\bar{x}})]\) follows from convexity of \(X \subset \varOmega \subset \texttt {dom}(f_1)\) and the convexity of \(f_1\) [40]. Notice that verifying the above characterization of d-stationarity is impossible in many cases of interest. Hence, one generally employs a weaker notion of stationarity: a point \({\bar{x}} \in X\) is called a critical point of problem (1) if

$$\begin{aligned} \emptyset \ne \partial f_2({\bar{x}}) \cap \partial [f_1({\bar{x}})+ i_X({\bar{x}})], \quad \hbox {or equivalenty}\quad \emptyset \ne \partial f_2({\bar{x}}) \cap \{\partial f_1({\bar{x}})+ N_X({\bar{x}})\}. \end{aligned}$$
(3)

Another useful notion is Clarke stationarity: a point \({\bar{x}} \in X\) is a c(larke)-stationary point of problem (1) if

$$\begin{aligned} f^0({\bar{x}};(x-{\bar{x}}))\ge 0 \; \forall x \in X. \end{aligned}$$

It follows from the inequality \(f^0(x;d) \ge f'(x;d)\) that every d-stationary point is also c-stationary. The reverse implication is not always true [38, Example 2]. The following sequence of implications related to problem (1) can be found in the DC programming literature (see for instance [16, 22, 26, 31, 38, 43]):

$$\begin{aligned} \hbox { local minimizer }\quad \Rightarrow \quad d-\hbox { stationarity } \quad \Rightarrow \quad c-\hbox { stationarity }\quad \Rightarrow \quad \hbox { criticality. } \end{aligned}$$

If \(f_1\) is a continuously differentiable function on \(\varOmega \), then

$$\begin{aligned} c-\hbox {stationarity }\quad \Leftrightarrow \quad \hbox {criticality.} \end{aligned}$$

If \(f_2\) is a continuously differentiable function on \(\varOmega \), then

$$\begin{aligned} d-\hbox {stationarity }\quad \Leftrightarrow \quad c-\hbox {stationarity }\quad \Leftrightarrow \quad \hbox {criticality.} \end{aligned}$$

If \(f_2\) is a polyhedral function \(f_2(x)=\max _{i=1,\ldots , N} \{\langle a_i, x\rangle +b_i\}\) (where \(a_i\in {\mathbb {R}}^n\) and \(b_i \in {\mathbb {R}}\)), then \({\bar{x}} \in X\) is a local minimizer if and only if \({\bar{x}}\) is d-stationary point [31, Theorem 1]. Moreover,

$$\begin{aligned} \begin{array}{lll} \displaystyle \min _{x \in X} [f_1(x)-f_2(x)] &{}=&{} \displaystyle \min _{x \in X} [f_1(x)-\max _{i=1,\ldots ,N}\{\langle a_i, x \rangle +b_i\}] \\ &{}=&{} \displaystyle \min _{x \in X} [f_1(x)+\min _{i=1,\ldots ,N}-\{\langle a_i, x\rangle +b_i\} ]\\ &{}= &{} \displaystyle \min _{i=1,\ldots ,N}[\min _{x \in X} \{f_1(x)-\langle a_i, x\rangle -b_i\} ], \end{array} \end{aligned}$$

showing that a global solution of the DC program \(\min _{x \in X} [f_1(x)-f_2(x)]\) can be obtained by solving N convex programs \(\min _{x \in X} [f_1(x)-\langle a_i, x\rangle ] -b_i\).

3 Proximal bundle method for convex constrained DC programming

Let \(\omega : \varOmega \rightarrow {\mathbb {R}}\) be a twice differentiable and strongly convex function on X with parameter \(\rho >0\), w.r.t. the norm \(\Arrowvert \cdot \Arrowvert _p\) (\(p \in [1,\infty ]\)), that is

$$\begin{aligned} \omega (x) \ge \omega (y) + \langle \nabla \omega (y), x-y\rangle + \frac{\rho }{2} \Arrowvert x-y \Arrowvert _p^2\quad \forall \, x,y \in X. \end{aligned}$$
(4)

We can rewrite the DC function \(f=f_1-f_2\) in (1) for a given \(\omega \) as

$$\begin{aligned} f(x) = f_1(x)+\omega (x) - [f_2(x)+\omega (x)]. \end{aligned}$$

Let \(y \in X\), \(g_2\in \partial f_2(y)\) and \(\nabla \omega (y)\) be given. We can then overestimate \(f\) by the following convex function

$$\begin{aligned} f(x)&\le f_1(x)+\omega (x) - [f_2(y)+\omega (y)+\langle g_2 +\nabla \omega (y),x-y\rangle ]\\&= f_1(x) -f_2(y)-\langle g_2,x-y\rangle + D(x,y), \end{aligned}$$

where \(D(\cdot ,\cdot )\) is the Bregman function

$$\begin{aligned} D(x,y) := \omega (x)- [\omega (y)+\langle \nabla \omega (y),x-y\rangle ] \quad \left( \ge \; \frac{\rho }{2} \Arrowvert x-y \Arrowvert _p^2\right) . \end{aligned}$$
(5)

It is shown in [7] that the following Proximal Linearized Method generates a sequence of trial points \(x^{k+1}\) whose cluster points (if any) are critical to problem (1).

figure a

The work [41] studies the above algorithm with the standard choice \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\) (Euclidean norm). With this same regularizing function the authors of [38] propose two algorithms (akin to the above one) to find d-stationary points of (1). Both results in [41] and [38] can be extended to a more general strongly convex function \(\omega \) without much difficulty [7].

Notice that the above proximal algorithm requires solving exactly a constrained convex nonsmooth program per iteration. This can be a difficult task when dealing with real-life DC programs, mainly if \(f_1\) is only assessed via a black-box/oracle. In order to accelerate the optimization process, the authors of [41] investigate an inexact version of the above algorithm with \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert ^2_2/2\). In what follows we propose two proximal bundle algorithms that do not require solving exactly subproblem (6) and differently from [41] no assumption on the computed subgradient \(g_1^{k+1} \in \partial f_1(x^{k+1})\) is made. The first of these methods yields critical points whereas the second algorithm presented in Sect. 5 finds d-stationary points under the more restrictive assumption that \(f_2\) is the pointwise maximum of finitely many convex differentiable functions.

3.1 A proximal bundle method for DC programs

Let k denote an iteration counter and let \(g_1^j \in \partial f_1(x^j)\) and \(g_2^j \in \partial f_2(x^j)\) be subgradients of the DC components calculated during an iteration \(j \in \{0,1,2,\ldots \}\). Convexity of \(f_1\) implies that the linearization

$$\begin{aligned} {\bar{f}}_1^j(x):= f_1(x^j) + \langle g_1^j,x-x^j\rangle \end{aligned}$$

approximates \(f_1(x)\) from below. As a result, we can construct a cutting-plane model

$$\begin{aligned} {\check{f}}_1^k(x):= \max _{j \in {\mathcal {B}}_1^k}\, {\bar{f}}_1^j(x)\quad \le \quad f_1(x)\quad \forall \, x \in \varOmega , \end{aligned}$$
(7)

where \({\mathcal {B}}_1^k \subset \{0,1,\ldots , k\}\) is the index set containing the bundle of information of \(f_1\). By following the general ideas of bundle methods we replace \(f_1\) in the master program (6) with its cutting-plane model \({\check{f}}_1^k\). Since \({\check{f}}_1^k\) can be a rough approximation of \(f_1\) at certain iterations k, the trial point \(x^{k+1}\) obtained from (6) with \(f_1\) replaced by \({\check{f}}_1^k\) can be far away from the solution of (6). In order to diminish the impact of coarse approximations of \(f_1\) along the iterative process, we shall regularize the resulting master program by keeping trial points near to a certain stability center \(x^{k(\ell )}\in X\), where the index \(\ell \) counts the number of times that such a center has been updated and \(k(\ell )\) is the iteration in which the center is obtained. The stability center \(x^{k(\ell )}\) is some previous iterate, usually the “best” point generated by the iterative process so far. Accordingly, we replace subproblem (6) with

$$\begin{aligned} \min _{x \in X} \;{\check{f}}_1^k(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )}), \end{aligned}$$
(8)

where \(\mu _k>0\) is a prox-parameter determining the influence of the Bregman function D on the next trial point \(x^{k+1}\). In terms of optimal solution \(x^{k+1}\) subproblem (8) is equivalent to

$$\begin{aligned} \left\{ \begin{array}{lll} \displaystyle \min _{x \in X, r \in {\mathbb {R}}}&{}r +\mu _k\omega (x)- \langle g_2^{k(\ell )}+\mu _k\nabla \omega (x^{k(\ell )}),x\rangle \\ \hbox {s.t.} &{} {\bar{f}}_1^j(x) \le r,\;\; j \in {\mathcal {B}}_1^k. \end{array}\right. \end{aligned}$$
(9)

Notice that if X is a polyhedron and \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\), then \(D(x,x^{k(\ell )})=\Arrowvert x-x^{k(\ell )} \Arrowvert ^2_2/2\) and subproblem (9) is a convex quadratic problem (QP).

Proposition 1

Given a stability center \(x^{k(\ell )}\in X\) and a prox-parameter \(\mu _k >0\), let \(x^{k+1}\) be the unique solution of subproblem (9). Assume that either X is polyhedral or an appropriate constraint qualification [21] holds in (9), then there exist \(s^{k+1} \in N_X(x^{k+1})\) and \(\alpha _j \ge 0\) with \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j =1\) such that \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^{j}:= p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and

$$\begin{aligned} p^{k+1} + s^{k+1}- g_2^{k(\ell )} +\mu _k\left( \nabla \omega (x^{k+1}) -\nabla \omega (x^{k(\ell )}) \right) = 0. \end{aligned}$$
(10)

Moreover, the aggregate linearization

$$\begin{aligned} \bar{f}_1^{-k}(x):= {\check{f}}_1^k(x^{k+1}) +\langle p^{k+1},x- x^{k+1}\rangle \; \hbox {satisfies }\bar{f}_1^{-k}(x)\le f_1(x)\hbox { for all }x\in {\mathbb {R}}^n. \end{aligned}$$
(11)

Proof

The assumption on X ensures the existence of Lagrange multipliers \(\alpha _j\ge 0\) associated to the constraints \(\bar{f}_1^j(x)=f_1(x^j) + \langle g_1^j ,x-x^j\rangle \le r\), \(j \in {\mathcal {B}}_1^k\). Hence, the optimality conditions of (9) read as:

$$\begin{aligned} -\begin{pmatrix} -g_2^{k(\ell )}+\mu _k\left( \nabla \omega (x^{k+1}) -\nabla \omega (x^{k(\ell )}) \right) \\ 1 \end{pmatrix} - \sum _{j \in {\mathcal {B}}_1^k} \alpha _j \begin{pmatrix} g_1^j \\ -1 \end{pmatrix} \in \begin{pmatrix} N_X(x^{k+1})\\ 0 \end{pmatrix}. \end{aligned}$$

The above inclusion implies that \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j =1\) and that there exists \(s^{k+1} \in N_X(x^{k+1})\) such that \(s^{k+1} = g_2^{k(\ell )} -\mu _k (\nabla \omega (x^{k+1}) - \nabla \omega (x^{k(\ell )}))- \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^j\), which is exactly (10) with \(p^{k+1}= \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^j\). Note that \(p^{k+1}\) is a convex combination of active subgradients of \({\check{f}}_1^k\) at \(x^{k+1}\). As a result, \(p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) [5, Lemma 10.8]. The inequality \(\bar{f}_1^{-k}(x)\le f_1(x)\) holds because \( p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and \({\check{f}}_1^k{(\cdot )} \le f_1{(\cdot )}\). \(\square \)

If \(X={\mathbb {R}}^n\) and \(\omega (x)=\Arrowvert x \Arrowvert ^2_2/2\) then the solution \(x^{k+1}\) of (8) is, from Eq. (10), \(x^{k+1}= x^{k(\ell )}+\frac{1}{\mu _k}(g_2^{k(\ell )}- \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^{j})\). Furthermore, the Lagrange multipliers \(\alpha _j\) (with \(j \in {\mathcal {B}}_1^k\)) can be obtained by solving the dual QP of (8), that has dimension \(|{\mathcal {B}}_1^k|\) (see [5, Lemma 10.8]). Therefore, by keeping the size of \({\mathcal {B}}_1^k\) bounded we also keep the method’s memory (number of elements in the bundle) limited.

Once the trial point \(x^{k+1}\) is computed by a specialized solver (for QP, quadratically constrained QP, conic programming, etc.) a classification rule decides when to update \(x^{k(\ell )}\). For a given \(\kappa \in (0,1)\) and a lower bound \({\underline{\mu }}>0\) of the prox-parameter \(\mu _k\), a possible rule is as follows: if

$$\begin{aligned} f(x^{k+1}) \le f(x^{k(\ell )}) -\kappa {\underline{\mu }}D(x^{k+1},x^{k(\ell )}) \end{aligned}$$
(12)

then a serious step is performed and we set \(x^{k(\ell +1)} := x^{k+1}\) and \(\ell :=\ell +1\). Otherwise, a null step is performed and both stability center and counter \(\ell \) remain unchanged. A more economical rule in terms of evaluations of \(f_2\) is to test the condition

$$\begin{aligned} f_1(x^{k+1}) \le f_1(x^{k(\ell )}) + \langle g_2^{k(\ell )},x^{k+1}- x^{k(\ell )}\rangle -\kappa {\underline{\mu }}D(x^{k+1},x^{k(\ell )}). \end{aligned}$$
(13)

If it holds we do a serious step. Otherwise we perform a null step. Note that (13) takes into account only the first DC component \(f_1\) and not the DC function \(f\). Moreover, the above inequality does not necessarily imply that \(f_1(x^{k+1})\le f_1(x^{k(\ell )})\) due to term \(\langle g_2^{k(\ell )},x^{k+1}- x^{k(\ell )}\rangle \). However, when (13) holds the DC function is decreased by an amount of at least \(\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\).

Lemma 1

If (13) holds then \(f(x^{k(\ell )})\ge f(x^{k(\ell +1)})+\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}). \)

Proof

When inequality (13) is satisfied we obtain \(x^{k(\ell +1)}=x^{k+1}\) and thus

$$\begin{aligned} f_1(x^{k(\ell )}) \ge f_1(x^{k(\ell +1)}) - \langle g_2^{k(\ell )},x^{k(\ell +1)} - x^{k(\ell )}\rangle +\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}). \end{aligned}$$

Convexity of \(f_2\) yields \(f_2(x^{k(\ell +1)})\ge f_2(x^{k(\ell )}) +\langle g_2^{k(\ell )},x^{k(\ell +1)} - x^{k(\ell )}\rangle \). By summing these two inequalities we obtain

$$\begin{aligned} f_1(x^{k(\ell )})+f_2(x^{k(\ell +1)})\ge f_1(x^{k(\ell +1)})+ f_2(x^{k(\ell )})+\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}), \end{aligned}$$

i.e., \(f_1(x^{k(\ell )})-f_2(x^{k(\ell )}) \ge f_1(x^{k(\ell +1)})-f_2(x^{k(\ell +1)}) +\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\) as stated. \(\square \)

We are now in the position to present our first algorithm, which makes use of (13) to update stability centers. As a result, there is no need to evaluate \(f_2\) at \(x^{k+1}\): only subgradients of \(f_2\) are computed during serious steps, i.e., when (13) is satisfied. During null steps (when (13) does not hold) only the first DC component \(f_1\) needs to be assessed through an oracle that returns at \(x^{k+1}\) the value of the function and one of its subgradients. A similar algorithm can be stated with rule (12) instead, but with evaluations of \(f_2\) at every iteration.

figure b

After a null step, the bundle of information \({\mathcal {B}}_1^k\) can have as few as three linearizations: the new linearization calculated at \(x^{k+1}\), the linearization formed at the current stability center \(x^{k(\ell )}\) and the aggregate linearization \(\bar{f}_1^{-k}\) defined in Proposition 1. Right after a serious step the bundle size can be reset to only one linearization, as is standard in proximal bundle methods for convex optimization. Notice also that the proximal parameter \(\mu _k\) is forbidden to decrease after a null step. This is crucial to prove convergence of the algorithm, which stops when the next iterate approximately coincides with the current stability center. This is a cheap and reliable stopping test as shown in Theorem 1 below.

4 Convergence analysis

Let \({\mathcal {L}}\subset \{0,1,2,\ldots \}\) denote the index set gathering the serious steps: \(\ell \in {\mathcal {L}}\) implies that \(x^{k(\ell )}\) is the \(\ell \mathrm{th}\) stability center. Throughout this section we will use the notation \(i(\ell )=k(\ell +1)-1\) to refer to the algorithm’s iteration yielding a serious step. Then for such iterations subproblem (8) reads as

$$\begin{aligned} x^{k(\ell +1)}= \arg \min _{x \in X} \;{\check{f}}_1^{i(\ell )}(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle +\mu _{i(\ell )}D(x, x^{k(\ell )}). \end{aligned}$$

Our goal is to show that any cluster point \({\bar{x}} \in X\) of the sequence of stability centers \(\{x^{k(\ell )}\}_\ell \) generated by Algorithm 1 is a critical point of (1), i.e., a point satisfying (3).

We start with the following result showing that if the algorithm performs only finitely many steps, then the last stability center is a critical point if \(\delta _{{{\,\mathrm{Tol}\,}}}=0\).

Lemma 2

Assume that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and suppose that Algorithm 1 stops at iteration k. Then the last stability center \(x^{k(\ell )}\) is a critical point of (1).

Proof

Convexity of \(f_1\) and feasibility of \(x^{k(\ell )}\) imply that

$$\begin{aligned}&\min _{x \in X} {\check{f}}_1^k(x) - \langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )})\le \nonumber \\&\min _{x \in X} f_1(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )})\le f_1(x^{k(\ell )}). \end{aligned}$$
(14)

In addition, the point \(x^{k+1}\) solves the first subproblem (8) and the optimal value of this problem is \({\check{f}}_1^k(x^{k+1}) - \langle g_2^{k(\ell )},x^{k+1}-x^{k(\ell )}\rangle + \mu _k D(x^{k+1},x^{k(\ell )})\). Hence, if the algorithm stops at iteration k we have that \(x^{k+1}=x^{k(\ell )}\) and

$$\begin{aligned} {\check{f}}_1^k(x^{k+1}) - \langle g_2^{k(\ell )},x^{k+1}-x^{k(\ell )}\rangle + \mu _k D(x^{k+1},x^{k(\ell )})= {\check{f}}_1^k(x^{k(\ell )}) =f_1(x^{k(\ell )}), \end{aligned}$$

where the last equality follows from the assumption that \(k(\ell ) \in {\mathcal {B}}_1^k\). We have thus shown that

$$\begin{aligned} f_1(x^{k(\ell )}) \le \min _{x \in X} \;f_1(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )}) \le f_1(x^{k(\ell )}), \end{aligned}$$

i.e., the point \( x^{k+1}=x^{k(\ell )}\) also solves \(\min _{x \in X} \;f_1(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )})\). The optimality condition of this reads as

$$\begin{aligned} 0 \in \partial f_1(x^{k+1}) - g_2^{k(\ell )} + \mu _k \left( \nabla \omega (x^{k+1}) - \nabla \omega (x^{k(\ell )})\right) + N_X(x^{k+1}). \end{aligned}$$

Since \(x^{k+1}=x^{k(\ell )}\) the above inclusion is equivalent to \(g_2^{k(\ell )} \in \partial f_1(x^{k(\ell )}) + N_X(x^{k(\ell )})\), which gives (3) because \(g_2^{k(\ell )} \in \partial f_2(x^{k(\ell )})\). \(\square \)

From now on we assume that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and that the algorithm loops indefinitely.

4.1 Infinitely many serious steps

In what follows we assume that the algorithm generates infinitely many serious steps, i.e., \(|{\mathcal {L}}| =\infty \). The following result shows that \(p^{k(\ell +1)}+s^{k(\ell +1)}\) defined in Proposition 1 is an approximate subgradient of the function \(f_1(x)+i_X(x)\) at the point \(x=x^{k(\ell +1)}\).

Lemma 3

Let \(k(\ell +1)\) be the iteration index of the \((\ell +1)\)th stability center and \(i(\ell ) = k(\ell +1) -1\) be the iteration index in which \(x^{k(\ell +1)}\) is determined. Assume \(\mu _k\le \overline{\mu }<\infty \) and that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence. Let \(p^{k(\ell +1)} \in \partial {\check{f}}_1^{i(\ell )} (x^{k(\ell +1)})\) and \(s^{k(\ell +1)} \in N_X(x^{k(\ell +1)})\) be as in (10), and denote \(\beta ^{k(\ell +1)}: = p^{k(\ell +1)}+ s^{k(\ell +1)}\). Then there exist constants \(M,L>0\) such that

$$\begin{aligned} \Arrowvert \beta ^{k(\ell )} \Arrowvert _2\le M \quad \hbox { and }\quad \Arrowvert g_1^{k(\ell )} \Arrowvert _2 \le L\hbox { for all }{\ell \in {\mathcal {L}}}. \end{aligned}$$

Moreover, \(f_1(x^{k(\ell +1)})\ge {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\ge f_1(x^{k(\ell )}) -L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) and

$$\begin{aligned} \beta ^{k(\ell +1)} \in \partial _{ e_{i(\ell )}} [f_1(x^{k(\ell )})+i_X(x^{k(\ell )})] \quad \hbox {with} \quad e_{i(\ell )} = (M+L)\Arrowvert x^{k(\ell )}-x^{k(\ell +1)} \Arrowvert _2. \end{aligned}$$

Proof

It follows from (10) and the triangular inequality that

$$\begin{aligned} \Arrowvert p^{k(\ell +1)}+s^{k(\ell +1)} \Arrowvert _2= & {} \Arrowvert g_2^{k(\ell )} +\mu _k\left( \nabla \omega (x^{k(\ell )}) - \nabla \omega (x^{k(\ell +1)}) \right) \Arrowvert _2 \le \Arrowvert g_2^{k(\ell )} \Arrowvert _2 \\&+\,\overline{\mu }\Arrowvert \nabla \omega (x^{k(\ell )}) - \nabla \omega (x^{k(\ell +1)}) \Arrowvert _2. \end{aligned}$$

Since \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and the DC components \(f_1\) and \(f_2\) are finite-valued convex functions on \(\varOmega \supset X\), then [20, Theorem  3.1.2] ensures that \(\{g_1^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) and \(\{g_2^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) are bounded sequences as well. In particular, there exists \(L>0\) such that \(\Arrowvert g_1^{k(\ell )} \Arrowvert _2\le L\) for all \(\ell \in {\mathcal {L}}\). Moreover, the existence of a finite constant \(M>0\) bounding \(\Arrowvert p^{k(\ell +1)}+s^{k(\ell +1)} \Arrowvert _2\) is ensured because \(\overline{\mu }< \infty \) and \(\nabla \omega \) is a continuous map.

Recall that \(N_X(x^{k(\ell +1)})=\partial i_X(x^{k(\ell +1)})\) because \(x^{k(\ell +1)}\in X\). Then \(s^{k(\ell +1)}\in \partial i_X(x^{k(\ell +1)})\) and therefore \(i_X(x)\ge i_X(x^{k(\ell +1)}) + \langle s^{k(\ell +1)},x-x^{k(\ell +1)}\rangle \) for all \(x \in {\mathbb {R}}^n\). Since \(p^{k(\ell +1)}\in \partial {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\) we get

$$\begin{aligned} f_1(x) + i_X(x)\ge & {} {\check{f}}_1^{i(\ell )}(x) +i_X(x)\\\ge & {} [{\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) + \langle p^{k(\ell +1)},x-x^{k(\ell +1)}\rangle ]\\&+\, [i_X(x^{k(\ell +1)}) + \langle s^{k(\ell +1)},x-x^{k(\ell +1)}\rangle ]\\= & {} {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})+i_X(x^{k(\ell +1)}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell +1)}\rangle . \end{aligned}$$

As \(x^{k(\ell )}\in X\) for all \(\ell \), then \(i_X(x^{k(\ell +1)})=0\). By utilizing this in the above inequality we obtain

$$\begin{aligned} f_1(x) + i_X(x)\ge & {} {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell +1)}\rangle \\= & {} f_1(x^{k(\ell )}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell )}\rangle + \\&+\,\langle \beta ^{k(\ell +1)},x^{k(\ell )}-x^{k(\ell +1)}\rangle + {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) - f_1(x^{k(\ell )}), \end{aligned}$$

which gives by the Cauchy–Schwarz inequality

$$\begin{aligned} f_1(x) + i_X(x)\ge & {} f_1(x^{k(\ell )}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell )}\rangle \nonumber \\&- M\Arrowvert x^{k(\ell )}-x^{k(\ell +1)} \Arrowvert _2+{\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) - f_1(x^{k(\ell )}). \end{aligned}$$
(15)

Note that Algorithm 1 keeps in the bundle the index \(k(\ell )\) of the current stability center. Therefore,

$$\begin{aligned} {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) = \max _{j \in {\mathcal {B}}_1^{i(\ell )}} \{f_1(x^j) + \langle g_1^{j},x^{k(\ell +1)}-x^j \rangle \}\ge f_1(x^{k(\ell )}) + \langle g_1^{k(\ell )},x^{k(\ell +1)}-x^{k(\ell )}\rangle . \end{aligned}$$

Again by the Cauchy–Schwarz inequality we get \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). Thus it follows from (15) that (because \(i_X(x^{k(\ell )}) =0\))

$$\begin{aligned} f_1(x) + i_X(x)\ge & {} f_1(x^{k(\ell )}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell )}\rangle - (M+L)\Arrowvert x^{k(\ell )}-x^{k(\ell +1)} \Arrowvert _2 \\= & {} f_1(x^{k(\ell )}) + i_X(x^{k(\ell )}) + \langle \beta ^{k(\ell +1)},x-x^{k(\ell )}\rangle - e_{i(\ell )}. \end{aligned}$$

Since \(x \in {\mathbb {R}}^n\) is an arbitrary point we conclude that \(\beta ^{k(\ell +1)} \in \partial _{ e_{i(\ell )}} [f_1(x^{k(\ell )}) + i_X(x^{k(\ell )}) ]\). The inequalities \(f_1(x^{k(\ell +1)})\ge {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\ge f_1(x^{k(\ell )}) -L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) follow trivially from the already shown inequality \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). \(\square \)

Suppose that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded, \({\bar{x} \in X}\) is one of its cluster points, and \(\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2 \rightarrow 0\). Then the above lemma ensures that \(\{p^{k(\ell )}+s^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence and that any cluster point \({\bar{\beta }}\) of \(\{p^{k(\ell )}+s^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) satisfies \({\bar{\beta }} \in \partial [f_1({\bar{x}}) + i_X({\bar{x}})] \) as a consequence of [21, Proposition 4.1.1]. This crucial property is employed in the following proposition to establish that any cluster point \({\bar{x}} \in X\) of \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a critical point of (1).

Proposition 2

Assume that the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded and that Algorithm 1 performs infinitely many serious steps, i.e., \(|{\mathcal {L}}|=\infty \) and \(\ell \rightarrow \infty \). Then any cluster point \({\bar{x}} \in X\) of the sequence \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a critical point of problem (1).

Proof

Lemma 1 shows that \( f(x^{k(\ell )}) - f(x^{k(\ell +1)}) \ge \kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}) \) and therefore the sequence \(\{f(x^{k(\ell )})\}_{{\ell \in {\mathcal {L}}}}\) is strictly decreasing, which in turn implies that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence by the assumption of having a bounded level set. Continuity of \(f\) ensures that the level set is also closed, and hence compact. The Weierstrass theorem implies that the optimal value of (1) is finite. Therefore, by summing the above inequality with respect to \(\ell \) we get

$$\begin{aligned} \infty > {f(x^{k(0)}) - \min _X f(x) \ge } f(x^{k(0)}) - \lim _{\ell \rightarrow \infty }f(x^{k(\ell +1)})= & {} \sum _{\ell =0}^\infty [f(x^{k(\ell )}) - f(x^{k(\ell +1)})]\\\ge & {} \kappa {\underline{\mu }}\sum _{\ell =0}^\infty D(x^{k(\ell +1)},x^{k(\ell )}). \end{aligned}$$

Then it follows from the definition of D in (5) and the equivalence of norms in \({\mathbb {R}}^n\) that

$$\begin{aligned} \infty >\sum _{\ell =0}^\infty \Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2^2, \quad \hbox {and consequently}\quad \lim _{\ell \rightarrow \infty } \Arrowvert x^{k(\ell +1)} - x^{k(\ell )} \Arrowvert _2 =0. \end{aligned}$$

Lemma 3 ensures that \(\beta ^{k(\ell +1)}=p^{k(\ell +1)}+s^{k(\ell +1)} \in \partial _{ e_{i(\ell )}} [f_1(x^{k(\ell )})+i_X(x^{k(\ell )})]\) with \( e_{i(\ell )} = (M+L)\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) for two (possibly unknown) constants \(M,L>0\). Note also that \(\lim _{\ell \rightarrow \infty } e_{i(\ell )}=0\) because \(\lim _{\ell \rightarrow \infty } \Arrowvert x^{k(\ell +1)} - x^{k(\ell )} \Arrowvert _2 =0\). Thus, there exist subsets \({\mathcal {L}}'' \subset {\mathcal {L}}' \subset {\mathcal {L}}=\{0,1,\ldots \}\) such that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}'}}\) converges to a point \({\bar{x}} \in X\) (because \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and X is closed), \(\{\beta ^{k(\ell +1)}\}_{{\ell \in {\mathcal {L}}''}}\) converges to a point \({\bar{\beta }} \in \partial [f_1({\bar{x}})+i_X(\bar{x})]\) (see Proposition 4.1.1 in [21] for more details). In order to show that \({\bar{x}}\) is a critical point of problem (1), we only need to prove that \(\lim _{\ell \in {\mathcal {L}}''} g_2^{k(\ell )} = {\bar{\beta }}\). This latter result follows directly from Eq. (10), continuity of \(\nabla \omega \) and inequality \(\mu _k \le \overline{\mu }< \infty \):

$$\begin{aligned} {\bar{\beta }} = \lim _{\ell \in {\mathcal {L}}''} [p^{k(\ell +1)}+s^{k(\ell +1)}]= & {} \lim _{\ell \in {\mathcal {L}}''} g_2^{k(\ell )} \\&- \lim _{\ell \in {\mathcal {L}}''} \mu _{i(\ell )}\left( \nabla \omega (x^{k(\ell )})-\nabla \omega (x^{k(\ell +1)})\right) = \lim _{\ell \in {\mathcal {L}}''} g_2^{k(\ell )}, \end{aligned}$$

showing that \({\bar{\beta }} \) also belongs to \(\partial f_2({\bar{x}})\). This concludes the proof. \(\square \)

4.2 Finitely many serious steps (infinitely many null steps)

In this section, we assume that after the \({{\hat{\ell }}}\mathrm{th}\)-stability center \(x^{k({\hat{\ell }})} = {\hat{x}}\) only null steps are performed. Notice that in this case \( g_2^{k({\hat{\ell }})}={\hat{g}}_2\) is fixed and therefore Algorithm 1 behaves exactly as a convex bundle algorithm with the master program given by

$$\begin{aligned} \min _{x \in X} f_k^{\mathtt{model}}(x) + \mu _kD(x,{\hat{x}}), \end{aligned}$$

where \(f_k^{\mathtt{model}}(x):={\check{f}}_1^k(x)-\langle {\hat{g}}_2,x\rangle \) is a cutting-plane model for the convex function \(f_1(x)-\langle {\hat{g}}_2,x\rangle \). Thus it follows from the convergence analysis of convex bundle methods that the sequence of iterates generated after the last serious step converges to the last stability center: \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\). Indeed, if \(D(x,y) = \Arrowvert x-y \Arrowvert _2^2\) this claim follows directly from [45, Proposition 4.4]. For the more general setting of a Bregman function \(D(\cdot , \cdot )\) but with \(X={\mathbb {R}}^n\), the result \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\) can be justified by [12, Theorem 5.8] if the prox-parameter \(\mu _k>0\) is fixed after finitely many steps of the algorithm. Overall, we have the following result.

Lemma 4

Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1 and assume that \(\{\mu _k\}_{k\ge k(\hat{\ell })}\) is a nondecreasing sequence contained in \([{\underline{\mu }},\,\overline{\mu }]\). Then

$$\begin{aligned} \lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}. \end{aligned}$$

Since Algorithm 1 does not consider the same assumptions of either [45, Proposition 4.4] nor [12, Theorem 5.8]), we provide for the sake of completeness the proof of Lemma 4 at “Appendix”. The following result is a version of Lemma 3 for the sequence of null iterates.

Lemma 5

Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1, and for \(k\ge k({\hat{\ell }})\)\(p^{k+1}\in \partial {\check{f}}_1^k(x^{k+1})\) and \(s^{k+1}\in N_X(x^{k+1})\). If \(\{\mu _k\}_{k\ge k({\hat{\ell }})}\) is nondecreasing, then there exist constants \(M,L>0\) such that \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2\le M\) and \(\Arrowvert p^{k+1} \Arrowvert _2\le L\) for all \(k\ge k({\hat{\ell }})\). Moreover, \(\beta ^{k+1} =p^{k+1} +s^{k+1}\in \partial _{ e_k} [f_1({\hat{x}})+i_X({\hat{x}})]\), where \( e_k = (M+L)\Arrowvert x^{k+1} - {\hat{x}} \Arrowvert _2\) for all \(k\ge k({\hat{\ell }})\).

Proof

It follows from (10) that \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2 \le \Arrowvert {\hat{g}}_2 \Arrowvert _2 +\overline{\mu }\Arrowvert \nabla \omega ({\hat{x}}) - \nabla \omega (x^{k+1}) \Arrowvert _2\) for \(k\ge k({\hat{\ell }})\). Since \(\{\mu _k\}_{k \ge k({\hat{\ell }})}\) is nondecreasing, Lemma 8 (in the “Appendix”) ensures that \(\{x^k\}_{{k> k({\hat{\ell }})}}\) is a bounded sequence (therefore \(\{x^k\}_k\) is bounded itself). Then there exists a finite constant \(M>0\) bounding \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2\) (because \(\nabla \omega \) is continuous). As \(f_1\) is a finite-valued convex function on \(\varOmega \supset X\), its subdifferential is locally bounded [20, Theorem  3.1.2]. Then there exists a constant \(L>0\) such that \(\Arrowvert g_1^k \Arrowvert _2 \le L\) for all k. We recall the subdifferential of \({\check{f}}_1^k\) at x is the convex hull of the set \(\{g_1^j: \, {\check{f}}_1^k(x)=f_1(x^j)+\langle g_1^j,x-x^j\rangle \}\). Thus every subdifferential of \({\check{f}}_1^k\) is also bounded by L, which in turn implies that \({\check{f}}_1^k\) (regardless the iteration k) has L as a Lipschitz constant.

Recall that \(N_X(x^{k+1})=\partial i_X(x^{k+1})\) because \(x^{k+1}\in X\). Then \(s^{k+1} \in \partial i_X(x^{k+1})\) and therefore \(i_X(x)\ge i_X(x^{k+1}) + \langle s^{k+1},x-x^{k+1}\rangle \) for all \(x \in {\mathbb {R}}^n\). Since \(p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and \(i_X(x^k)=0\) for all k we get (for \(k\ge k({\hat{\ell }})\))

$$\begin{aligned} f_1(x) + i_X(x)\ge & {} {\check{f}}_1^k(x) +i_X(x)\\\ge & {} {\check{f}}_1^k(x^{k+1}) + \langle p^{k+1},x-x^{k+1}\rangle ] + [i_X(x^{k+1}) + \langle s^{k+1},x-x^{k+1}\rangle ]\\= & {} {\check{f}}_1^k(x^{k+1}) + \langle \beta ^{k+1},x-x^{k+1}\rangle ] \\= & {} f_1({\hat{x}}) + \langle \beta ^{k+1},x-{\hat{x}}\rangle + \langle \beta ^{k+1},{\hat{x}}-x^{k+1}\rangle + {\check{f}}_1^k(x^{k+1}) - f_1({\hat{x}}) \\\ge & {} f_1({\hat{x}}) + \langle \beta ^{k+1},x-{\hat{x}}\rangle -M\Arrowvert {\hat{x}}-x^{k+1} \Arrowvert _2 - [f_1({\hat{x}})-{\check{f}}_1^k(x^{k+1})]. \end{aligned}$$

Since by construction \(k({\hat{\ell }}) \in {\mathcal {B}}_1^k\) for all \(k\ge k({\hat{\ell }})\), we have that \({\check{f}}_1^k({\hat{x}}) = f_1({\hat{x}})\). Then \(f_1({\hat{x}})-{\check{f}}_1^k(x^{k+1}) = {\check{f}}_1^k({\hat{x}})-{\check{f}}_1^k(x^{k+1}) \le L \Arrowvert {\hat{x}}-x^{k+1} \Arrowvert _2\). This shows that

$$\begin{aligned} f_1(x)+i_X(x)\ge f_1({\hat{x}}) +i_X({\hat{x}}) + \langle \beta ^{k+1},x-{\hat{x}}\rangle - e_k\quad \forall \; x \in {\mathbb {R}}^n, \end{aligned}$$

because \(i_X({\hat{x}})=0\). Therefore, \(\beta ^{k+1} \in \partial _{ e_k}[f_1({\hat{x}}) +i_X({\hat{x}})]\) for all \(k\ge k({\hat{\ell }})\). \(\square \)

We are now in the position to prove that the last stability center \({\hat{x}}\) is a critical point of problem (1).

Proposition 3

Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1 and assume that \(\{\mu _k\}_{k\ge k(\hat{\ell })}\) is a nondecreasing sequence contained in \([{\underline{\mu }},\,\overline{\mu }]\). Then the sequence \(\{x^{k+1}\}_{{k}}\) converges to \({\hat{x}}\) and \({\bar{x}} = {\hat{x}} \) is a critical point of problem (1).

Proof

Lemma 4 gives \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\). It follows from (10), continuity of the mapping \(\nabla \omega \) and \(\mu _k\le \overline{\mu }\) that

$$\begin{aligned} \lim _{k\rightarrow \infty } \beta ^{k+1} = \lim _{k\rightarrow \infty } [p^{k+1} +s^{k+1}]= {\hat{g}}_2 - \lim _{k\rightarrow \infty } \mu _k \left( \nabla \omega (x^{k+1}) -\nabla \omega ({\hat{x}}) \right) ={\hat{g}}_2 \in \partial f_2({\hat{x}}). \end{aligned}$$

Lemma 5 ensures that \(\beta ^{k+1} \in \partial _{ e_k} [f_1({\hat{x}})+i_X({\hat{x}})]\) with \( e_k = (M+L)\Arrowvert x^{k+1} - {\hat{x}} \Arrowvert _2\). Since \( \lim _{k\rightarrow \infty } e_k = 0\) due to \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\), then \({\hat{g}}_2 =\lim _{k\rightarrow \infty } \beta ^{k+1}\in \partial [f_1({\hat{x}})+i_X({\hat{x}})]\) and (3) is satisfied with \({\bar{x}} = {\hat{x}}\). \(\square \)

4.3 Convergence analysis: main result

Convergence analysis of Algorithm 1 is summarized in the following theorem.

Theorem 1

Consider Algorithm 1 and suppose that the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded. If the stopping-test tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}=0\), then any cluster point \({\bar{x}}\) of the sequence of stability centers \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) generated by the algorithm satisfies (3). Moreover, if the stopping-test tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}>0\) then the algorithm stops after finitely many steps with an approximate critical point \(\bar{x}=x^{k(\ell )}\). In addition, if \(\nabla \omega (\cdot )\) is a locally Lipschitz continuous map, then the approximate critical point \({\bar{x}}=x^{k(\ell )}\) satisfies

$$\begin{aligned} \partial _{s_1\delta _{{{\,\mathrm{Tol}\,}}}} [f_1({\bar{x}})+i_X({\bar{x}})] \cap [\partial f_2(\bar{x}) + B(0;s_2\delta _{{{\,\mathrm{Tol}\,}}})]\ne \emptyset , \end{aligned}$$

where \(s_1,s_2>0\) are two constants and \(B(0;s_2\delta _{{{\,\mathrm{Tol}\,}}})\) is the closed ball in \({\mathbb {R}}^n\) with center at zero and with radius \(s_2\delta _{{{\,\mathrm{Tol}\,}}}\).

Proof

First, suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\). If Algorithm 1 stops at iteration k then Lemma 2 ensures that the last stability center is a critical point of (1). Suppose now that Algorithm 1 does not stop. If infinitely many serious steps are generated, then Proposition 2 gives the result. Otherwise there will be a finite number of serious steps and Proposition 3 ensures that the last stability center is a critical point.

Moreover, suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}>0\). If infinitely many stability centers are generated then Proposition 2 shows that \(\lim _{\ell \rightarrow \infty } \Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2=0\). Otherwise, if there is only finitely many stability centers, then Proposition 3 ensures that \(\lim _{k \rightarrow \infty } \Arrowvert x^{k+1}-{\hat{x}} \Arrowvert _2=0\) where \({\hat{x}}= x^{k(\ell )}\) is the last stability center. In any case, the stopping test of Algorithm 1 will be triggered after finitely many iterations if \(\delta _{{{\,\mathrm{Tol}\,}}}>0\). In this case, \(\Arrowvert x^{k+1}-x^{k(\ell )} \Arrowvert _2\le \delta _{{{\,\mathrm{Tol}\,}}}\) and therefore Lemma 5 gives

$$\begin{aligned} p^{k+1}+s^{k+1}\in \partial _{(M+L)\delta _{{{\,\mathrm{Tol}\,}}}} [f_1(x^{k(\ell )})+i_X(x^{k(\ell )})]. \end{aligned}$$

Furthermore, Eq. (10) and assumption on \(\nabla \omega \) yield

$$\begin{aligned} \Arrowvert p^{k+1}+s^{k+1} -g_2^{k(\ell )} \Arrowvert _2=\mu _k\Arrowvert \nabla \omega (x^{k+1})-\nabla \omega (x^{k(\ell )}) \Arrowvert _2 \le s_2\Arrowvert x^{k+1}-x^{k(\ell )} \Arrowvert _2\le s_2\delta _{{{\,\mathrm{Tol}\,}}}. \end{aligned}$$

These properties show that \({\bar{x}} = x^{k(\ell )}\) is an approximate critical point of problem (1) with \(s_1 = M+L\) and \(s_2=\bar{\mu }\,L_\omega \), where \(L_\omega \) is the Lipschitz constant of \(\nabla \omega \) on the bounded sequence \(\{x^k\}_{{k}}\). \(\square \)

4.4 Some comments on convex and DC models

Algorithm 1 employs the convex model \({\check{f}}_1^k(x)-\bar{f}_2^{k(\ell )}(x)\), with

$$\begin{aligned} {\bar{f}}_2^{k(\ell )}(x):=f_2(x^{k(\ell )})+\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle , \end{aligned}$$

to approximate the DC function \(f\), where \({\check{f}}_1^k\) is a cutting-plane model of \(f_1\). Instead of considering only one linearization for the second DC component, one could also gather a bundle of information \({\mathcal {B}}_2^k \subset \{1,2,\ldots , k\}\) for \(f_2\) and define a cutting-plane model

$$\begin{aligned} {\check{f}}_2^k(x):=\max _{j \in {\mathcal {B}}_2^k} {\bar{f}}_2^j (x). \end{aligned}$$

This provides a DC model \({\check{f}}_1^k-\check{f}_2^k\) for \(f\) that is expected to be better than the convex one \({\check{f}}_1^k(x)-\bar{f}_2^{k(\ell )}(x)\). Both publications [16] and [27] consider the DC model \({\check{f}}_1^k(\cdot )-\check{f}_2^k(\cdot )\) in their proximal bundle algorithms for unconstrained DC programs. For instance, the method of [27] defines trial points by solving globally the following DC subproblem

$$\begin{aligned} x^{k+1}\in \arg \min _{x \in {\mathbb {R}}^n} {\check{f}}_1^k(x)-\check{f}_2^k(x)+ \mu _k\Arrowvert x - x^{k(\ell )} \Arrowvert _2^2. \end{aligned}$$
(16)

Since \(\check{f}_2^k\) is a polyhedral function, a global solution to the above subproblem can be obtained by solving \(|{\mathcal {B}}_2^k|\) strictly convex QPs

$$\begin{aligned} \min _{j \in {\mathcal {B}}_2^k} \left\{ \min _{x \in X} {\check{f}}_1^k(x)-\bar{f}_2^j(x)+ \mu _k\Arrowvert x - x^{k(\ell )} \Arrowvert _2^2\right\} . \end{aligned}$$

As a result, if the bundle of information \({\mathcal {B}}_2^k\) is large then each iteration of the algorithm in [27] can be too time consuming. For this reason, the size of the bundle \({\mathcal {B}}_2^k\) should be kept small. The proximal bundle method of [27] is shown to converge to a critical point of an unconstrained DC program even when \({\mathcal {B}}_2^k\) has only two indexes, i.e., only two QPs are needed to be solved per iteration. In contrast, Algorithm 1 requires solving only one convex subproblem (a QP if \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\) and X is a polyhedron) per iteration, which results in taking \({\mathcal {B}}_2^k = \{k(\ell )\}\) for all k. This is a feature of practical interest if X contains some conic/quadratic constraints.

5 A proximal bundle method for finding d-stationary points

In this section, we consider a particular case of problem (1) in which the second DC component is the pointwise maximum of finitely many differentiable convex functions \(\psi _i: \varOmega \rightarrow {\mathbb {R}}\):

$$\begin{aligned} \min _{x \in X} f(x), \quad \hbox {with}\quad f(x) = f_1(x) - f_2(x)\, \quad \hbox {and} \quad f_2(x):= \max _{1\le i\le N } \psi _i(x). \end{aligned}$$
(17)

Inspired by the work [38], the following modification of Algorithm 1 may solve several subproblems of type (8) at certain iterations to generate a subsequence of stability centers \(\{x^{k(\ell )}\}_{\ell \in {\mathcal {L}}}\) that converges to a d-stationary point \({\bar{x}} \) of problem (17), i.e., \({\bar{x}}\) satisfying (2). To this end, we explore the structure of \(f_2\). It is well known that the subdifferential of \(f_2\) at any given point \(x \in X \subset \varOmega \) is the convex hull of gradients of functions \(\psi _i\) that are active:

$$\begin{aligned}&\partial f_2(x) = \mathtt{conv}(\{\nabla \psi _i(x)\}_{i \in A(x)}), \quad \hbox {with}\quad A(x) := \{1\le i\le N:\, \psi _i(x)\nonumber \\&\quad =f_2(x)\}\quad {\hbox {for all }\; x \in X}. \end{aligned}$$
(18)

As a result, if \({\bar{x}}\in X\) satisfies

$$\begin{aligned} \nabla \psi _i({\bar{x}}) \in \partial [f_1({\bar{x}}) + i_X({\bar{x}})] \quad {\hbox {for all }\; i \in A({\bar{x}})}, \end{aligned}$$

then (2) holds from the convexity of \( \partial [f_1(\bar{x}) + i_X({\bar{x}})]\) and thus \({\bar{x}}\) is a d-stationary point of (17). For technical reasons, Algorithm 2 below makes use of the following relaxation of A(x):

$$\begin{aligned} A_\epsilon (X)=\{1\le i \le N:\, \psi _i(x) \ge f_2(x)- \epsilon \},\quad \hbox {with }\epsilon >0. \end{aligned}$$

To avoid critical points that are not d-stationary the parameter \(\epsilon \) above must be strictly positive [38, Example 4].

figure c

Note that if the number N of functions \(\psi _i\) in (17) is equal to one (i.e., \(f_2= \psi _1\)), then Algorithm 2 becomes essentially Algorithm 1 (with the only difference in the descent test). However, if \(N>1\) then Algorithm 2 solves \(|A_{\epsilon }(x^k)|\) subproblems per iteration.

Moreover, some vectors \(\nabla \psi _i(x^{k(\ell )})\) with \(i \in A_\epsilon (x^{k(\ell )})\) may not belong to \( \partial f_2(x^{k(\ell )})\) because \(\epsilon >0\). As a result, the new trial point \(x^{k+1}\) solution of (19) for some \(i^* \in A_\epsilon (x^{k(\ell )})\) may not be issued by a true subgradient \(\nabla \psi _{i^*}(x^{k(\ell )})\) of \(f_2\) at \(x^{k(\ell )}\). Since Lemma 1 relies on the fact that \(g_2^{k(\ell )} \in \partial f_2(x^{k(\ell )})\), we cannot employ the descent test (13) with \(g_2^{k(\ell )}\) replaced with \( \nabla \psi _{i^*}(x^{k(\ell )})\): even if

$$\begin{aligned} f_1(x^{k+1}) \le f_1(x^{k(\ell )}) + \langle \nabla \psi _{i^*},x^{k+1}- x^{k(\ell )}\rangle -\kappa {\underline{\mu }}D(x^{k+1},x^{k(\ell )}) \end{aligned}$$

holds true, nothing ensures that \(f(x^{k+1}) \le f(x^{k(\ell )})-\kappa {\underline{\mu }}D(x^{k+1},x^{k(\ell )})\). This is why we have replaced the descent test (13) with the more direct one given in (12). A downside of such replacement is that we need to call an oracle for \(f_2\) at every trial point, in contrast to Algorithm 1 that never evaluates \(f_2\) (but computes one of its subgradients at serious steps).

In order to analyze the convergence of Algorithm 2, we rely on the convergence analysis of Algorithm 1. We start with the following lemma.

Lemma 6

Let \(x^{k+1}\) and \(i^*\) as be defined in Algorithm 2. Then for all \( i \in A_\epsilon (x^{k(\ell )})\)

$$\begin{aligned}&{\check{f}}_1^k(x^{k+1})-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x^{k+1}-x^{k(\ell )}\rangle + \mu _k\, D(x^{k+1},x^{k(\ell )})\nonumber \\&\quad \le \min _{x \in X}f_1(x)-\langle \nabla \psi _i(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \overline{\mu }\, D(x,x^{k(\ell )}) \le f_1(x^{k(\ell )}). \end{aligned}$$
(20)

Proof

It follows from the definition of \(x^{k+1}\) that \({\check{f}}_1^k(x^{k+1})-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x^{k+1}-x^{k(\ell )}\rangle + \mu _k\, D(x^{k+1},x^{k(\ell )}) = \min _{x \in X} {\check{f}}_1^k(x)-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \mu _k\, D(x,x^{k(\ell )})\), which is in turn less than or equal to

$$\begin{aligned} {\check{f}}_1^k(y_i)-\langle \nabla \psi _i(x^{k(\ell )}),y_i-x^{k(\ell )}\rangle + \mu _k\, D(y_i,x^{k(\ell )})\quad \forall \, i \in A_\epsilon (x^{k(\ell )}) \end{aligned}$$

due to the definition of \(i^*\) and \(y_i\) given in Algorithm 2. Again, the definition of \(y_i\) and the inequalities \({\check{f}}_1^k(x) \le f_1(x)\) and \(\mu _k\le \overline{\mu }\) ensure the first inequality in (20). The last one follows from feasibility of \(x^{k(\ell )}\). \(\square \)

Theorem 2

Consider Algorithm 2 applied to problem (17) and suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded. Then any cluster point \({\bar{x}}\) of the sequence of stability centers \(\{x^{k(\ell )}\}_{\ell \in {\mathcal {L}}}\) generated by the algorithm is a d-stationary point, i.e., \({\bar{x}}\) satisfies (2).

Proof

We split the proof in three main parts: the first part considers the case in which Algorithm 2 stops with \(\delta _{{{\,\mathrm{Tol}\,}}}=0\), the second one analyzes the case of infinitely many serious steps, and finally the last part assumes that after a last serious step the algorithm loops forever generating only null steps.

First part Suppose that Algorithm 2 stops at iteration k. Then \(x^{k+1}=x^{k(\ell )}\) and Lemma 6 gives

$$\begin{aligned} {\check{f}}_1^k(x^{k(\ell )})\le & {} \min _{x \in X}f_1(x)-\langle \nabla \psi _i(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \mu _k\, D(x,x^{k(\ell )}) \\\le & {} f_1(x^{k(\ell )}) \, \,\forall i \in A_\epsilon (x^{k(\ell )}). \end{aligned}$$

As \(k(\ell ) \in {\mathcal {B}}_1^k\), then \({\check{f}}_1^k(x^{k(\ell )})=f_1(x^{k(\ell )})\) and the above relation ensures that \(x^{k+1}=x^{k(\ell )}\) solves

$$\begin{aligned}&\min _{x \in X}f_1(x)-\langle \nabla \psi _i(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \mu _k\, D(x,x^{k(\ell )}) \quad \forall i \in A(x^{k(\ell )}) \subset A_\epsilon (x^{k(\ell )}),\;\hbox { i.e.,}\\&0 \in \partial f_1(x^{k+1}) - \nabla \psi _i(x^{k(\ell )}) + \mu _k \left( \nabla \omega (x^{k+1}) - \nabla \omega (x^{k(\ell )}) \right) + N_X(x^{k+1})\quad \forall i \in A(x^{k(\ell )}). \end{aligned}$$

Since \(x^{k+1}=x^{k(\ell )}\), then \(\nabla \psi _i(x^{k(\ell )}) \in \partial f_1(x^{k(\ell )}) + N_X(x^{k(\ell )})\) for all \(i \in A(x^{k(\ell )})\), which by (18) implies that \({\bar{x}}= x^{k(\ell )}\) satisfies (2). In what follows we suppose that the algorithm does not stop.

Second part Let us suppose that Algorithm 2 generates an infinite sequence of serious steps. By summing the inequality \(f(x^{k(\ell +1)}) \le f(x^{k(\ell )})-\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\) over \(\ell \in {\mathcal {L}}=\{0,1,2\ldots \}\) and using the fact that \(\{x\in X: f(x)\le f(x^0)\}\) is a bounded set, we conclude that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence. Moreover, \(\sum _{\ell =0}^\infty \Arrowvert x^{k(\ell +1)}- x^{k(\ell )} \Arrowvert _2^2 < \infty \) by (5) and the equivalence of norms in \({\mathbb {R}}^n\). Boundedness of \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) ensures that there exists an index set \({\mathcal {L}}' \subset {{\mathcal {L}}}\) such that the sequence \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}'}}\) converges to a point \({\bar{x}} \in X\). Continuity of \(f_2\) and the assumption that \(\epsilon >0\) yield the inclusion \(A({\bar{x}}) \subset A_\epsilon (x^{k(\ell )})\) for all \(\ell \in {\mathcal {L}}'\) large enough. In what follows, let \(i(\ell )=k(\ell +1)-1\). The definition of \(x^{k(\ell +1)}\) (\(=x^{k+1}= y_{i^*})\) and (20) give

$$\begin{aligned}&{\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x^{k(\ell +1)}-x^{k(\ell )}\rangle + \mu _{i(\ell )}\, D(x^{k(\ell +1)},x^{k(\ell )}) \nonumber \\&\quad \le f_1(x)-\langle \nabla \psi _i(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \overline{\mu }\, D(x,x^{k(\ell )}) \end{aligned}$$
(21)

for all \(x \in X\), \(i \in A({\bar{x}})\) and \(\ell \in {\mathcal {L}}'\) large enough. Algorithm 2 keeps in the bundle the index \(k(\ell )\) of the last serious step. As a result,

$$\begin{aligned} {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) = \max _{j \in {\mathcal {B}}_1^{i(\ell )}} \{f_1(x^j) + \langle g_1^{j},x^{k(\ell +1)}-x^j \rangle \}\ge f_1(x^{k(\ell )}) + \langle g_1^{k(\ell )},x^{k(\ell +1)}-x^{k(\ell )}\rangle . \end{aligned}$$

By the Cauchy–Schwarz inequality (and remembering that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and \(f_1\) is convex) we get \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - \Arrowvert g_1^{k(\ell )} \Arrowvert _2\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). Combining this inequality with (21) yields

$$\begin{aligned}&f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2 -\langle \nabla \psi _{i^*}(x^{k(\ell )}),x^{k(\ell +1)}-x^{k(\ell )}\rangle + \mu _{i(\ell )}\, D(x^{k(\ell +1)},x^{k(\ell )}) \nonumber \\&\quad \le f_1(x)-\langle \nabla \psi _i(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \overline{\mu }\, D(x,x^{k(\ell )}) \end{aligned}$$
(22)

for all \(x\in X,\, i \in A({\bar{x}})\) and \(\ell \in {\mathcal {L}}'\) large enough. By going to the limit with \(\ell \in {\mathcal {L}}'\) tending to infinity (and recalling that \(\lim _{\ell \rightarrow \infty }\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2=0\) because \(\sum _{\ell =0}^\infty \Arrowvert x^{k(\ell +1)}- x^{k(\ell )} \Arrowvert _2^2 < \infty \)) we obtain

$$\begin{aligned} f_1({\bar{x}})\le f_1(x)-\langle \nabla \psi _i({\bar{x}}),x-{\bar{x}}\rangle + \overline{\mu }\, D(x,{\bar{x}})\,, \quad \quad \forall \; x\in X,\; \forall i \in A({\bar{x}}). \end{aligned}$$

This shows that \({\bar{x}}\) solves all the \(|A({\bar{x}})|\) subproblems \( \min _{x \in X} \;f_1(x)-\langle \nabla \psi _i({\bar{x}}),x-{\bar{x}}\rangle + \overline{\mu }\, D(x,{\bar{x}}) \), i.e., \(\nabla \psi _i({\bar{x}}) \in \partial f_1({\bar{x}}) +N_X({\bar{x}})\) for all \(i \in A({\bar{x}})\), implying thus that \({\bar{x}}\) is a d-stationary point of (17).

Third part Let us consider the case of finitely many serious steps. Accordingly, there exists a last stability center \({\hat{x}}=x^{k({\hat{\ell }})}\) and \(f({\hat{x}})< f(x^{k+1}) + \kappa \,{\underline{\mu }}\, D(x^{k+1},{\hat{x}})\) for all \(k \ge k({\hat{\ell }})\). As mentioned right before Lemma 6 in page 13, this implies that a point \(x^{k+1}\) for \(k\ge k({\hat{\ell }})\) does not satisfy test (13) with \(g_2^{k({\hat{\ell }})}= \nabla \psi _i({\hat{x}})\) for any \(i \in A({\hat{x}})\) [although (13) may hold for some \(g_2^{k({\hat{\ell }})}= \nabla \psi _j({\hat{x}})\) with \(j \in A_\epsilon ({\hat{x}})\backslash A({\hat{x}})\)]. Hence, by seeing the null iterates \(x^{k+1}=y_{i^*}\) with \(i^* \in A_\epsilon ({\hat{x}})\backslash A({\hat{x}})\) as mere points enriching the bundle \({\mathcal {B}}_1^k\), Lemma 8 and Proposition 3 apply: \(\lim _{k\rightarrow \infty } [f_1(x^{k+1})- {\check{f}}_1^k(x^{k+1})]=0 \), \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\) and \({\hat{x}}\) is a critical point. It remains to show that \({\hat{x}}\) is indeed a d-stationary point of (17). To this end, consider the following inequality extracted from (20):

$$\begin{aligned}&{\check{f}}_1^{k}(x^{k+1})-\langle \nabla \psi _{i^*}({\hat{x}}),x^{k+1}-{\hat{x}}\rangle + \mu _{k}\, D(x^{k+1},{\hat{x}})\\&\quad \le f_1(x)-\langle \nabla \psi _i({\hat{x}}),x-{\hat{x}}\rangle + \overline{\mu }\, D(x,{\hat{x}})\,, \end{aligned}$$

for all \(x \in X\) and all \(i \in A({\hat{x}})\). By passing to the limit with k tending to infinity we obtain

$$\begin{aligned} f_1({\hat{x}})= & {} \lim _{k\rightarrow \infty } [{\check{f}}_1^{k}(x^{k+1})-\langle \nabla \psi _{i^*}({\hat{x}}),x^{k+1}-{\hat{x}}\rangle + \mu _{k}\, D(x^{k+1},{\hat{x}})]\\\le & {} f_1(x)-\langle \nabla \psi _i({\hat{x}}),x-{\hat{x}}\rangle + \overline{\mu }\, D(x,{\hat{x}}) \end{aligned}$$

for all \( x \in X\) and \(i \in A({\hat{x}})\). This shows that \({\hat{x}}\) solves all the \(|A({\hat{x}})|\) subproblems

$$\begin{aligned} \min _{x \in X} \;f_1(x)-\langle \nabla \psi _i({\hat{x}}),x-{\hat{x}}\rangle + \overline{\mu }\, D(x,{\hat{x}})\quad {\hbox {for }i \in A({\hat{x}}),} \end{aligned}$$

implying thus that \({\bar{x}}={\hat{x}}\) is a d-stationary point of (17) (see 18).

In all the cases, if \(\delta _{{{\,\mathrm{Tol}\,}}}>0\) the algorithm stops after finitely many steps. \(\square \)

6 Numerical experiments

We assess the numerical performance of Algorithms 1 and 2 on some academic DC problems. For comparison reasons, we consider two variants of Algorithm 1 with the Bregman function \(D(x,y)=\Arrowvert x-y \Arrowvert _2^2/2\), two implementations of the classic DC algorithm (DCA) of [42], one implementation of the proximal linearized algorithm [41, Algorithm 1], the DC proximal bundle method of [27] and an algorithm for nonsmooth and nonconvex optimization [35]. We also compare the computational behavior of Algorithm 1 against the results reported in the paper [16]. The considered solvers are as follows:

  • PBM1: Algorithm 1.

  • PBM-d: Algorithm 2.

  • PBM3: Algorithm 1 with subproblem (9) replaced with the DC subproblem (16), where the model for \(f_2\) has three pieces: three linearizations computed with the last three stability centers.

  • DCA-CPM: this is an implementation of the DCA algorithm of [42], with trial points defined as

    $$\begin{aligned} x^{k+1}\in \arg \min _{x \in X} f_1(x)-\langle g_2^k,x\rangle \quad {\hbox { with}\ g_2^k \in \partial f_2(x^k)}. \end{aligned}$$
    (23)

    The convex subproblem (23) is solved by an implementation of the Kelley’s cutting-plane method [28]. We have made use of warm start, meaning that when a new subgradient \(g_2^{k+1}\in \partial f_2(x^{k+1})\) is computed the solver re-uses the cutting-plane model for \(f_1\) constructed at the previous iteration.

  • DCA-LBM: as DCA-CPM, but with the convex subproblem (23) solved by an implementation of the level bundle method of [34].

  • PLM: this is an implementation of the proximal linearized algorithm in [41, Algorithm 1], with trial points defined as in (6). The convex subproblem (6) is solved by replacing \(f_1\) with its cutting-plane approximation, which is iteratively improved the same way as the classical proximal bundle method does. As in DCA-CPM, we have employed warm start.

  • HANSO: Hybrid Algorithm for Non-Smooth Optimization [35]. This is a matlab package based on the BFGS and gradient sampling methods for nonsmooth and nonconvex unconstrained optimization problems. The package (version 2.2) is freely available by its developers at the link: www.cs.nyu.edu/overton/software/hanso.

  • PBDC: the proximal bundle method of [27], whose Fortran code is freely available by its developers at the link: http://napsu.karmitsa.fi/pbdc/.

Except for solver HANSO and PBDC that are open source packages, we have implemented the considered solvers in matlab (version 2017a) using the Gurobi (version 7.5.1, www.gurobi.com) solver for LP, QP and quadratically-constrained problems. Solver HANSO was employed with its default parameters, except the memory of its BFGS algorithm that was set to 50 for problems where dimension is greater or equal to 1000. Solver PBDC was used with its default parameters, except its stopping-test tolerance (user_crit_tol) that was increased to \(0.06\cdot n\). The other solvers employ the stopping test \(\Arrowvert x^{k+1}- x^{k(\ell )} \Arrowvert _2 \le \delta _{{{\,\mathrm{Tol}\,}}}\), with tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}=10^{-4}\). In our implementation, the solvers PBM1 and PBM3 consider the test (12) with \(\kappa =0.1\) and \({\underline{\mu }}= 10^{-5}\) for updating stability centers. This implies that the number of evaluations of \(f_1\) is the same as for \(f_2\). The maximum size of the bundle of information for solvers PBM1, PBM3 and PBM-d was set to \(\max \{100,\,\min \{n+5,1000\}\}\). Moreover, only active indexes were kept in the bundle after a serious step. The matlab codes as well as the considered test problems are freely available at the link: http://www.oliveira.mat.br/solvers.

Numerical experiments were performed on a computer with Intel(R) Core(TM), i7-6820HQ, CPU @ 2.70 GHz, 32G (RAM), under Windows 7, 64Bits. The Fortran code of PBDC was compiled and executed in a Virtual Machine running Linux Ubuntu. The total CPU time allowed for each solver on each test problem was set to 3600 s.

For the sake of comparison, we also present some results from the paper [16]. Since we have not ran the solver DCPCA, it does not make sense to compare CPU time. However, the solution quality and number of subgradients evaluations can be compared, keeping in mind that DCPCA, PBDC and the matlab solvers employ different stopping tests.

6.1 Unconstrained DC programs

We consider all the unconstrained DC programs reported in [2, 16, 27]. Their formulation, optimal values and initial points are given in [27].

Since the inner algorithms employed by solvers DCA-CPM and DCA-LBM require the feasible set to be compact, we have added the bounds \(-100\le x\le 100\) to the test problems when running these two solvers (the other solvers did not consider this change).

Table 1 reports numerical experiments on several instances of the considered unconstrained DC problems. The first column indicates the problem, the second states the dimension of the problem and the third column reports the optimal value. The remaining columns present the obtained function value \(f(x^{k(\ell )})\), number of subgradient evaluations for DC components and CPU time (in seconds) for all the considered solvers. We emphasize that:

  • PBM1 and PBM3 The number of subgradient evaluations \(\# g_1\) is equal to the number of function evaluations of \(f_1\). It also coincides with the number of iterations performed by the algorithm. The number of subgradient evaluations \(\# g_2\) of the second DC component coincides with the number of serious steps of the algorithm.

  • DCA-CPM, DCA-LBM and PLM The number of subgradient evaluations \(\# g_1\) is equal to the number of function evaluations of \(f_1\). Moreover, \(\# g_2\) coincides with the number of iterations.

  • HANSO The number of subgradient evaluations is equal to the number of function evaluations. The solver does not exploit the DC decomposition of the objective function.

  • PBDC The number of function \(f\) evaluations can be slightly bigger than \(\# g_1\).

  • DCPCA The number of evaluations of \(f_1\) is greater than that number of subgradient evaluations due to the line-search.

Table 1 Unconstrained DC programs

Notice that the solution quality of solvers PBM1 and PBM3 is comparable to solvers PBDC and DCPCA. These four DC bundle solvers provided more precise solutions than the solvers DCA-CPM, DCA-LBM, PLM and the nonconvex solver HANSO. Moreover, the bundle solvers were successful in computing a critical point of problem (1) in all instances (expect solver PBDC that exceeded the CPU time limit when dealing with problem 4, \(n=750\)). Except for problem 10 with \({n=5}\), solvers PBM1, PBM3 and DCPCA found the best known function values of the considered problems.

Solver HANSO (for general nonconvex optimization problems) was the only solver that could solve globally problem 3 with \(n=5\). Overall, this solver performed a larger number of function/subgradient evaluations than the other solvers that exploited the DC structure of the objective function. Solver DCPCA required less subgradient evaluations of \(g_1\) than solver PBM1 did, however the former needed more subgradient evaluations of \(g_2\).

Notice also that when employing a convex cutting-plane model (solver PBM1) the total number of iterations and computation burden do not necessary increase when compared to PBM3, PBDC and DCPCA. Moreover, the quality of final solution candidates is not deteriorated.

We sumarize Table 1 in Fig. 1 by employing performance profiles [11]. For example, let the criterion be CPU time. For each solver, we plot the proportion of problems that is solved within a factor \(\gamma \) of the time required by the best algorithm. More specifically, denoting by \(t_s(p)\) the time spent by solver s to solve problem p and by \(t^*(p)\) the best time for the same problem among all the solvers, the proportion of problems solved by s within a factor \(\gamma \) is

$$\begin{aligned} \rho _s(\gamma ) := \displaystyle \frac{\text {number of problems } p \text { such that } t_s(p) \le \gamma \, t^*(p)}{\text {total number of problems}} . \end{aligned}$$

Therefore, the value \(\rho _s(1)\) gives the probability of the solver s to be the best by a given criterion. Furthermore, unless \(t_s(p)=\infty \) (which means that solver s failed to solve problem p), it follows that \(\lim _{\gamma \rightarrow \infty }\rho _s(\gamma )=1\). Thus, the higher is the line the better is the solver (by this criterion).

Fig. 1
figure 1

Performance profile of the results presented in Table 1

Figure 1 shows that the four proximal bundle solvers perform better than the other solvers in both number of subgradient evaluations of \(f_1\) and robustness. However, in terms of subgradient evaluations of \(f_2\) the DCA solvers DCA-CPM and DCA-LBM performed better. Overall, solver PBM1 presents a good compromise between CPU time and robustness.

6.2 Linearly-constrained DC programs

In this subsection, we consider two convex constrained DC programs obtained by approximating chance-constrained problems of the form

$$\begin{aligned} \left\{ \begin{array}{llll} \min \nolimits _x &{} \langle q, x\rangle \\ \hbox {s.t.} &{} Ax= b\\ &{} {\mathbb {P}}[c(x,\xi )\le 0]\ge p\\ &{} {\underline{x}} \le x \le {\overline{x}}\,, \end{array}\right. \end{aligned}$$
(24)

where \(\xi \in \varXi \subset {\mathbb {R}}^m\) is a random vector having probability measure \({\mathbb {P}}\), \(A\in {\mathbb {R}}^{s\times n}\) and \(b\in {\mathbb {R}}^s\). Parameter \(p \in (0,1)\) is a confidence level and \(c:{\mathbb {R}}^n\times \varXi \rightarrow {\mathbb {R}}\) is a given DC function (not necessary differentiable). It is well known that the above problem may fail to be convex even when \(c(\cdot ,\cdot )\) is convex in both arguments. Moreover, evaluating the probability function for a given point involves computing a multidimensional integral. This is a difficult task when random variables have large dimension [47], and therefore Monte-Carlo simulation is an important alternative to approximate \({\mathbb {P}}\). We refer to [19, 39, 44, 46] for more details on chance-constrained programming.

As discussed in [25], if \(c(\cdot ,\xi )\) is a DC function the probability constraint \({\mathbb {P}}[c(x,\xi )\le 0]\ge p\) can be approximated by the DC constraint \({\mathbb {E}}[c(x,\xi )+t]^+ - {\mathbb {E}}[c(x,\xi )]^+ \le t (1-p)\), where \(t\approx 0\) is a positive parameter, \([a]^+:=\max \{a,0\}\) and \({\mathbb {E}}[\cdot ]\) is the expected value operator w.r.t. \({\mathbb {P}}\). By penalizing this constraint withFootnote 1\(\rho >0\) we get the following approximation of (24):

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \min _{x} &{}\langle {q}, x\rangle + \rho \Big [{\mathbb {E}}[c(x,\xi )+t]^+ - {\mathbb {E}}[c(x,\xi )]^+ - t (1-p)\Big ]^+\\ \hbox {s.t.} &{} Ax= b\,\\ &{}{\underline{x}} \le x \le {\overline{x}}, \end{array}\right. \end{aligned}$$
(25)

that can be written as a DC program

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \min _{x} &{}f_1(x)-f_2(x)\\ \hbox {s.t.} &{} Ax= b\\ &{} {\underline{x}} \le x \le {\overline{x}}, \end{array}\right. \end{aligned}$$

with \(f_1(x)=\langle q, x\rangle + \rho \max \left( {\mathbb {E}}[c(x,\xi )+t]^+ - t (1-p),\, {\mathbb {E}}[c(x,\xi )]^+\right) \) and \(f_2(x)= \rho {\mathbb {E}}[c(x,\xi )]^+\). The expectations \({\mathbb {E}}[c(x,\xi )+t]^+\) and \({\mathbb {E}}[c(x,\xi )]^+\) can be approximated by Monte-Carlo simulation [25]: in our numerical experiments we have used a fixed sample of 10, 000 scenarios randomly generated accordingly to the distribution of \(\xi \).

6.2.1 PlanToy: an academic chance-constrained planning problem

We consider a small example of a management problem coming from the industry of energy. The problem consists of planing two fictitious refineries for producing two types of fuel to meet a demand that is deterministic in the first month of planning, but uncertain in the second one. The decision maker wishes to make an optimal decision on the amount of processed petrol by the two refineries (\(x_1\) and \(x_2\) for the first month, and \(x_5\) and \(x_6\) for the second one), storing petrol (\(x_3\) and \(x_7\)), and importation (\(x_4\) and \(x_8\)). The storage and importation decisions must ensure that the second-month demand is satisfied with probability p. The fictitious planning problem (PlanToy) reads as

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \min _{x \ge 0} &{} 2x_1 + 3 x_2 +0.5x_3 + 12x_4 +2x_5 + 3x_6 +0.5x_7 + 12.5x_8 \\ \hbox {s.t.} &{} 2x_1 + 6x_2 = 190 \\ &{} 3x_1+2.8x_2 = 168 \\ &{} x_1+x_2+x_3 -x_4= 60 \\ &{} -x_3+ x_5+x_6+x_7 -x_8= 47.16 \\ &{} {\mathbb {P}}[c(x,\xi )\le 0]\ge p\\ &{} x_3 \le 10\,, x_7 \le 10\,, \end{array}\right. \end{aligned}$$
(26)

where \(c(x,\xi )=\max \{\xi _1 - 2x_5 -6x_6,\, \xi _2 - 3x_5 -2.8x_6\}\), and \(\xi =(\xi _1,\xi _2)\) is a random vector (of fuel demand) following a normal distribution with mean \({\mathbb {E}}[\xi ]=(193, 178)\) and covariance matrix

$$\begin{aligned} C= \left( \begin{array}{ll} 9 &{}\quad \mathtt{Cov}(\xi _1,\xi _2)\\ \mathtt{Cov}(\xi _2,\xi _1) &{}\quad 10.24\\ \end{array}\right) . \end{aligned}$$

Under this assumption the probability constraint \({\mathbb {P}}[c(x,\xi )\le 0]\ge p\) can be replaced with the convex one \(\phi (x)\le 0\) with \(\phi (x):= \log (p) - \log ({\mathbb {P}}[c(x,\xi )\le 0])\) [39]. For comparison purposes, the resulting convex optimization problem was solved by using the matlab optimization routine fmincon. We employed the Matlab’s statistical function mvncdf to evaluate the probability function:

$$\begin{aligned} {\mathbb {P}}[c(x,\xi )\le 0] \quad = \quad {\mathbb {P}}\begin{bmatrix} \!\!\!\!\xi _1 \le 2x_5 +6x_6\\ \xi _2 \le 3x_5 +2.8x_6 \end{bmatrix}. \end{aligned}$$

Such a convex reformulation is possible thanks to the assumption on the probability distribution of the random vector \(\xi \). For more general (and possibly more realistic) distributions, the PlanToy cannot be recast into a convex formulation. We refer the interested reader to textbook [39] for a comprehensive discussion on convexity of chance-constrained programs.

Regardless convexity, evaluating the probability function is a difficult task in general. Therefore, the Monte-Carlo approach and DC approximation discussed above become attractive in the large-scale setting.

By varying the confidence level \(p \in \{0.5,\,0.6,\,0.7,\,0.8,\,0.9,\, 0.95\}\) and the covariance coefficient \(\mathtt{Cov}:=\mathtt{Cov}(\xi _1,\xi _2) \in \{-4.8, \, 0,\, 4.8 \}\) we ended up with 18 variants of PlanToy. Table 2 reports the results obtained by applying the solvers PBM1, PBM3 and PLM on these instances. As PLM, both solvers DCA-CPM and DCA-LBM had a similar and unsatisfactory performance on these instances. Solvers HANSO and PBDC were not applied because they do not handle constrained problems.

As starting points for these solvers, we have considered the solution of the simpler individual chance-constrained program obtained from (26) by replacing the joint probability constraint \({\mathbb {P}}[c(x,\xi )\le 0] \ge p\) with the individual ones \({\mathbb {P}}_1[\xi _1 \le 2x_5 +6x_6]\ge p\) and \({\mathbb {P}}_2[\xi _2 \le 3x_5 +2.8x_6]\ge p\), where \(\xi _1\sim N(193,9)\) and \(\xi _2\sim N(178,10.4)\) due to the assumptions on the joint distribution of \(\xi =(\xi _1,\xi _2)\). By making use of p-quantiles (that can be easily computed by several statistical softwares), the above constraints are in fact linear ones:

$$\begin{aligned} 2x_5 +6x_6\ge {\mathbb {P}}_1^{-1}[p] \quad \hbox {and}\quad 3x_5 +2.8x_6 \ge {\mathbb {P}}_2^{-1}[p]. \end{aligned}$$

Therefore, problem (26) with \({\mathbb {P}}[c(x,\xi )\le 0] \ge p\) replaced with the these two linear constraints is a mere linear programming problem approximating the nonlinear one (26) (see [39] for more details). This is why we consider such an approximation only for computing an initial point for our solvers.

Table 2 Linearly-constrained DC problem: numerical results on 18 instances of the PlanToy problem. All the solvers were initialized with the same starting point

Solver PLM failed to globally solve all the considered instances of PlanToy: the method stopped with a critical point of the penalized DC program (25), which is not even a feasible one. On the other hand, the proximal bundle solvers were successful.

We care to mention that the obtained function value has three sources of inaccuracy: (1) the approximation of the probability distribution by a sample of 10,000 scenarios, (2) approximation of the characteristic function \(\mathbf{1}_{[0,\infty )}(z)\) by \(([z+t]^+ - [z]^+)/t\), and (3) the solvers tolerance. As a result, one cannot expect that the obtained function values coincide with the (approximate) optimal value \({\bar{f}}\). Nevertheless, the function values computed by solvers PBM1 and PBM3 are very close to the optimal values and the obtained solutions are (nearly) feasible: the columns 7–9 of Table 2 report the probabilityFootnote 2 of the computed point \({\bar{x}}\) to satisfy the random constraint \(c(x,\xi )\le 0\).

Once again, PBM1 (with a convex model) was as precise as PBM3 (with a DC model) but faster than the latter on these instances of PlanToy.

6.2.2 A norm optimization problem with chance constraints

We now consider a DC approximation of the chance-constrained program

$$\begin{aligned} \left\{ \begin{array}{llll} \min \nolimits _{x\in {\mathbb {R}}^n_+} &{} -\Arrowvert X \Arrowvert _1 \\ \hbox {s.t} &{} {\mathbb {P}}[\Arrowvert \xi \, x \Arrowvert _2 \le 10]\ge p \end{array}\right. \end{aligned}$$
(27)

given in [25, Sect. 5.1]. This problem fits (24) with \(c(x,\xi )=\max _{i=1,\ldots ,10}\{\sum _{j=1}^n\xi _{ij}^2x_j^2 -100\}\), where \(\xi _{ij}\) (\(i=1,\ldots ,n\) and \(j=1,\ldots ,10\)) are independent and identically distributed standard normal random variables. The DC reformulation of this problem is

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n_+} f_1(x) -f_2(x), \end{aligned}$$

with \(f_1(x)=-\sum _{i=1}^n x_i +\rho \max ({\mathbb {E}}[c(x,\xi )+t]^+ - t(1-p),\, {\mathbb {E}}[c(x,\xi )]^+)\) and \(f_2(x)= \rho \,{\mathbb {E}}[c(x,\xi )]^+ \).

In order to approximate the expectation \({\mathbb {E}}\), we have used Monte-Carlo simulation with \(N=10{,}000\) scenarios. As discussed in [25], when \(\xi _{ij}\) are independent and identically distributed standard normal random variables the global solution and optimal value of (27) are, respectively,

$$\begin{aligned} {\bar{x}}_i = \frac{10}{\sqrt{F_{\chi _n^2}^{-1}(p^{1/10})}},\quad i=1,\ldots ,n, \quad \hbox {and}\quad {\bar{f}} = \frac{10\,n}{\sqrt{F_{\chi _n^2}^{-1}(p^{1/10})}} , \end{aligned}$$

where \(F_{\chi _n^2}^{-1}\) denotes the inverse distribution function of a Chi-square distribution with n degrees of freedom. Table 3 reports some results obtained by applying the six solvers on 18 instances of (27).

Table 3 Linearly-constrained DC problem: numerical results on 18 instances of the norm optimization problem with chance constraint. All the solvers were initialized with the same starting points, chosen randomly from \([0,\,0.2]^n\)

Since computing exactly the constraint \({\mathbb {P}}[c(x,\xi )\le 0]\) is a difficult task, we do not report in Table 3 the probability value \({\mathbb {P}}[c({\bar{x}},\xi )\le 0]\) at the obtained candidate solutions. However, we estimated this probability by the DC approximation mentioned above. We observed that in all the considered instances the value \(({\mathbb {E}}[c({\bar{x}},\xi )+t]^+- {\mathbb {E}}[c(\bar{x},\xi )]^+/{t})\) coincides with the given confidence level p for the solvers PBM1 and PBM3. This ensures that the provided candidate solutions are feasible for the DC formulation above and, therefore, nearly feasible for the considered norm optimization problem with chance-constraints.

Once again, we cannot expect that the obtained function values coincide with the optimal value \({\bar{f}}\) due to the reasons previously mentioned. However, we can see from Table 3 that the bundle solvers provided good estimates of the optimal values. Solvers DCA-LBM and PLM computed critical points for all problem instances, however they failed to compute global solutions. In order to apply HANSO, we dropped the (inactive) constraint \(x\ge 0\). After this solver HANSO could solve up to optimality only the smaller instances and find a critical point for the medium size ones, but could not solve the larger instances in a time limit of 1 h. The reason is that HANSO requires many function evaluations: the considered function requires Monte-Carlo simulation and is therefore difficult to evaluate. Table 4 reports the number of subgradient evaluations.

Table 4 Norm optimization problem with chance constraint: number of subgradients evaluations

6.3 Quadratically-constrained DC programs

We now consider the following quadratically-constrained DC program (QCDC)

$$\begin{aligned} {\bar{f}}:= \left\{ \begin{array}{llll} \displaystyle \min _{x\in {\mathbb {R}}^n} &{} \displaystyle \min _{j\in \{1,\ldots ,m\}} \left\{ \frac{1}{2}\Arrowvert x-c^j \Arrowvert _2^2\right\} \\ \hbox {s.t.} &{} \frac{1}{2}\sum \nolimits _{i=1}^n \alpha _i x_i^2 \le \frac{1}{2}K^2\,, \end{array} \right. \end{aligned}$$
(28)

with given data \(K \in {\mathbb {R}}_+\), \(\alpha \in {\mathbb {R}}^n_+\) and \(c^j \in {\mathbb {R}}^n\), \(j=1,\ldots ,m\). The optimal value (and a solution) of the above QCDC can be found by solving individually the m quadratically-constrained QP programs (QCQP)

$$\begin{aligned} v_j :=\left\{ \begin{array}{llll} \displaystyle \min _{x\in {\mathbb {R}}^n} &{} \frac{1}{2}\Arrowvert x-c^j \Arrowvert _2^2\\ \hbox {s.t.} &{} \frac{1}{2}\sum \nolimits _{i=1}^n \alpha _i x_i^2 \le \frac{1}{2}K^2\,, \end{array} \right. \end{aligned}$$

and taking \({\bar{f}} = \displaystyle \min _{j\in \{1,\ldots ,m\}} v_j\). Problem (28) has the following DC representation

$$\begin{aligned} \left\{ \begin{array}{llll} \displaystyle \min _{x\in {\mathbb {R}}^n} &{}f_1(x)-f_2(x)\\ \hbox {s.t.} &{} \frac{1}{2}\sum \nolimits _{i=1}^n \alpha _i x_i^2 \le \frac{1}{2}K^2\,, \end{array} \right. \end{aligned}$$

with

$$\begin{aligned} \displaystyle f_1(x):=\frac{1}{2}\sum _{j=1}^m \Arrowvert x-c^j \Arrowvert _2^2 \quad \hbox {and} \quad f_2(x):=\frac{1}{2}\max _{l\in \{1,\ldots ,m\}}\left\{ \sum _{j\ne l} \Arrowvert x-c^j \Arrowvert _2^2\right\} . \end{aligned}$$

Several instances of problem (28) were generated by the following scheme: we set \(K=10\), \(m=5\) and \(\alpha _i\) drawn randomly and uniformly from [0, 1] for all \(i=1,\ldots ,n\). Vectors \(c^j\), \(j=1,\ldots ,m\), were constructed in such a way that no \(c^j\) is feasible for (28):

$$\begin{aligned} c^j:= (1+{s_j})K \frac{{\tilde{c}}^j}{\sqrt{\sum _{i=1}^n \alpha _i {({\tilde{c}}_i^j)}^2}}, \end{aligned}$$

where \({\tilde{c}}_i^j\), \(i=1,\ldots ,n\), \(j=1,\ldots ,m\), was randomly generated following a standard normal distribution and \(s_j\) uniformly from the set \( \{1,2,3,4,5\}\).

For the QCDC problem (28), the proximal bundle method subproblem (9) becomes a QCQP (once again with the choice \(D(x,{\hat{x}}):=\Arrowvert x-{\hat{x}} \Arrowvert _2^2/2\)) that can be solved by specialized algorithms. In this study we applied Gurobi. Table 5 reports some numerical results obtained by applying solvers PBM1, PBM3, DCA-CPM and PLM to some instances of (28).

Table 5 Quadratically constrained DC programs: function value, subgradient evaluations and CPU time

Solver DCA-CPM (respectively PLM) solves # \(g_2\) linear (respectively quadratic) optimization subproblems with quadratic constraints per iteration. Solver PBM1 (respectively PBM3) requires solving # \(g_1\) (respectively three times # \(g_1\)) QCQPs. This explains why PBM1 is faster than the other considered solvers. Once again, the proximal bundle solvers were successful in computing global solutions.

6.4 Computing d-stationary points

In this section, we illustrate the performance of Algorithm 2 on two additional test problems whose second DC component is the pointwise maximum of finitely many differentiable functions. We start with the following two-dimensional problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^ 2} f_1(x)-f_2(x)\quad \hbox {s.t.}\quad -x_1+x_2\le 1, \end{aligned}$$
(29)

with \(f_1(x)=1.2\max \{0.1x_1^2 + 0.005x_2^2, 0.005x_1^2 + 0.1x_2^2\}\) and \(f_2(x){=}\max \{-x_1,-0.3x_2,0\}\). Note that the point \({\tilde{x}}=(0,0)\) is critical for this problem, however \({\tilde{x}}\) is not d-stationary since small negative perturbations of its components yield smaller function values. Figure 2a presents eight different sequences obtained by PBM1 initialized with eight different starting points. In Fig. 2b we consider PBM-d with the same initial points. Notice that PBM-d does not stop at the critical point \(\tilde{x}=(0,0)\), since near to \({\tilde{x}}\) the algorithm solves two QPs (19) and can escape from \({\tilde{x}}\).

Fig. 2
figure 2

Level curves of \(f(x)=f_1(x)-f_2(x)\), with \(f_1(x)=1.2\max \{0.1x_1^2 + 0.005x_2^2, 0.005x_1^2 + 0.1x_2^2\}\) and \(f_2(x)=\max \{-x_1,-0.3x_2,0\}\)

A comparison of PBM-d with other solvers is given in Table 6. We do not report CPU time because all the solvers computed a critical point in less than half a second. As PBM-d, solver DCA-LBM also computed d-stationary points for problem (29) regardless the initial point.

Table 6 Problem (29) with eight different starting points. Solvers PBM-d and DCA-LBM computed d-stationary points regardless the considered initial point \(x^0\)

We have also considered an unconstrained variant of problem (29). For the same starting points the obtained results are similar to the ones presented in Table 6, with the difference that the global solution is \({\bar{x}} \approx (-\,4.1667,\,0.000)\) (with optimal value approximately \(-2.08333\)). In this setting, we have also considered the solver PBDC that succeed in computing the global solution for every starting point of Table 6. However, if one starts nearly zero [e.g. \(x^0=(0.001,0.001)\)] then PBDC converges to the critical point (0, 0), whereas PBM-d always converges to a d-stationary point (either the global or the local solution of the problem).

Finally, we consider the following test problem

$$\begin{aligned} \min _{x \in {\mathbb {R}}^n} \sum _{i=1}^n (x_i + (-1)^i) + 2\min _{{i \in \{1,2,\ldots ,n\}}}\{x_i\}, \end{aligned}$$
(30)

which has the DC decomposition

$$\begin{aligned} f_1(x)=\displaystyle \sum _{i=1}^n (x_i + (-1)^i) + 2\displaystyle \sum _{i=1}^n x_i\hbox { and }f_2(x)=2\displaystyle \max _{{i \in \{1,2,\ldots ,n\}}} \Big \{\sum _{j=1,\, j\ne i}^n x_j\Big \}. \end{aligned}$$

Its optimal value is \(-4\) obtained with \({\bar{x}}_i = (-1)^{i+1}\) for all but an even index j such that \({\bar{x}}_j = -3\). A critical point (but not a d-stationary one) is \({\tilde{x}}_i = (-1)^{i+1}\) for all but an odd index j with \({\tilde{x}}_j = -1\). At this point, the function value is zero. Table 7 reports some numerical experiments obtained by applying seven solvers to problem (30).

Table 7 Problem (30) with nine different instances. All solvers were initialized with the vector \(x^0_i = 10\), \(i=1, \ldots ,n\). Except for PLM, all the solvers required less than one second of CPU time to compute a critical point

Note that solvers PBM1, PBM3 and DCA-LBM computed critical points that are not d-stationary for problem (30). On the other hand, the solvers PBM-d, PLM and HANSO succeeded in computing global solutions for all instances. A very good performance of solver PBDC is highlighted: PBDC succeeded in computing global solutions in six out of nine cases.

7 Concluding remarks

Inspired by Proximal Linearized Methods, this work proposes and analyzes two proximal bundle algorithms for dealing with convex constrained nonsmooth DC programs. No line-search nor estimates of Lipschitz constants of the DC components are required by the algorithms, which possess a reliable and straightforward stopping test. Moreover, the given algorithms consider convex models of the underlying DC function. The first algorithm is shown to generate a subsequence of points that converges to a critical point of the underlying DC problem. The second algorithm is proved to generate a sequence of descent steps converging to a d-stationary point, which is the sharpest stationary definition in DC programming. However, this requires us to assume that the second DC component \(f_2\) is the pointwise maximum of N differentiable functions. The price to obtain this stronger result is the solution of possibly several (but no more than N) master programs at certain iterations.

Numerical experiments on approximately one hundred instances of different academic DC programs indicate that employing a convex model for approximating the DC objective function can be a simpler alternative to DC models for dealing with DC problems via proximal bundle methods.