Abstract
We consider the problem of minimizing the difference of two nonsmooth convex functions over a simple convex set. To deal with this class of nonsmooth and nonconvex optimization problems, we propose new proximal bundle algorithms and show that the given approaches generate subsequences of iterates that converge to critical points. Trial points are obtained by solving strictly convex master programs defined by the sum of a convex cutting-plane model and a freely-chosen Bregman function. In the unconstrained case with the Bregman function being the Euclidean distance, new iterates are solutions of strictly convex quadratic programs of limited sizes. Stronger convergence results (d-stationarity) can be achieved depending on (a) further assumptions on the second DC component of the objective function and (b) solving possibly more than one master program at certain iterations. The given approaches are validated by encouraging numerical results on some academic DC programs.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this work, we are concerned with local optimization methods to address the following class of nonsmooth and nonconvex problems:
where \(\emptyset \ne X \subset {\mathbb {R}}^n\) is a simple closed convex set (typically, but not necessarily a polyhedral one) contained in an open set \(\varOmega \subset {\mathbb {R}}^n\), and \(f_1: \varOmega \rightarrow {\mathbb {R}}\) and \(f_2: \varOmega \rightarrow {\mathbb {R}}\) are both convex and possibly nonsmooth functions. The mapping \(f\) is the difference of two convex functions and it is called a DC function while \(f_1\) and \(f_2\) are its DC components. Accordingly, problem (1) is a convex-constrained DC program.
The past few years have witnessed a substantial development in the area of DC programming. This class of problems forms an important sub-field of nonconvex programming and has been receiving much attention from the mathematical programming community [14, 15, 22, 27, 31, 32, 38, 41,42,43]. We refer to [43, Part II] for a comprehensive presentation of several algorithms designed to find global solutions. Yet, local optimization methods play an important role in global optimization because algorithms of the latter class typically employ local methods to find stationary/critical points that feed a certain search strategy for global solutions. A non-exhaustive list of applications of DC programming fitting the above formulation includes production–transportation planning problems [23], location planning problems [43, Chapter 5], physical layer based security in a digital communication systems [38], cluster analysis [3, 29], sensor covering [1], and engineering problems [15, 43].
A well-known method for dealing with the optimization problem (1) is the DC Algorithm—DCA—of [31, 42], which iteratively linearizes the second DC component yielding convex optimization subproblems that are solved to define trial points. Another important algorithm for DC programming is the Proximal Linearized Method—PLM—[7, 38, 41], which can be seen as a regularized variant of DCA: the convex optimization subproblem is augmented with a Bregman function to prevent tailing-off effect that makes calculations unstable as the iteration process progresses.
The main disadvantage of both DCA and PLM is the need of solving exactly a convex nonsmooth program per iteration. As an attempt to overcome this issue, the recent work [41] investigates an inexact version of PLM by requiring the convex subproblems to be asymptotically solved up to optimality and subgradients of \(f_1\) satisfying a certain proximity property with respect to the previously computed subgradient of \(f_2\) [41, Equation (15)].
In this work, we propose more sophisticated and implementable variants of PLM for handling problem (1). The new algorithms belong to the bundle method family [21, Chapter XV], a class of methods proposed by C. Lemaréchal in the 70s [33]. Bundle methods are among the most efficient algorithms for solving nonsmooth convex optimization problems. This class of methods constitutes a very active area of research in the nonsmooth optimization community [8,9,10, 12, 13, 30, 48]. Extensions of proximal bundle algorithms to nonsmooth nonconvex programs have been investigated by different authors in [14, 17, 18, 36, 37] and references therein.
As far as the DC setting is concerned, the insightful paper [27] proposes a proximal bundle method for finding critical points of unconstrained DC programs, i.e., problem (1) with \(X={\mathbb {R}}^n\). The authors consider a DC piecewise linear model approximating the objective function \(f\) and compute trial points by globally minimizing such a model over \({\mathbb {R}}^n\): this task amounts to solving a fixed number of quadratic programs (QPs) per iteration. In general terms, the number of QPs solved per iteration increases with the DC model’s size [27, p. 514].
The recent paper [16] also proposes a bundle method employing a DC piecewise model to deal with unconstrained DC programs. Differently from [27] that globally minimizes the DC model to define new iterates, the DC model in [16] is tackled by means of two auxiliary QPs that have different local approximation properties. These quadratic programs must be solved at every iteration to compute descent directions in which a line-search is performed to define trial points.
Inspired by these methods for unconstrained DC programs, we propose in this work two proximal bundle methods for finding critical and d(irectional)-stationary points (see definitions in Sect. 2 below) of DC problems (1). The given variants relate to both proximal linearized methods of [7, 41] and proximal bundle algorithms of [16, 27]. In contrast to a more difficult (but possibly better) nonconvex approximation of \(f\), we employ a simple convex piecewise linear model. No line-search nor estimates of Lipschitz constants of the DC component are required. Moreover, both variants employ a reliable straightforward stopping test. We care to mention that the master program defining trial points in our algorithms is not necessarily a QP problem but a more general strictly convex program exploiting possible compelling structures of the feasible set X. This is computationally useful when X has a particular structure (for instance a ball, a simplex, a spectahedron, and other domains [4, Sect. 2.3]) and a specialized solver for handling the master problem is at disposal. We assume throughout this paper that such a specialized solver is available.
As standard in the DC literature, our first algorithm is shown to generate a subsequence of points that converges to a critical point of (1). Under the hypothesis that \(f_2\) in (1) is the pointwise maximum of Nknown differentiable functions our second algorithm is ensured to compute a d-stationary point, which is the sharpest stationarity definition in DC programming [38, Sect. 3.2]. The price to obtain this stronger result is the solution of possibly several (but no more than N) master programs at certain iterations. We care to mention that the bundle method of [26] is ensured to compute Clarke stationary points of unconstrained DC programs. To this end, the authors of [26] assume that the subdifferentials of the DC components at any point \(x\in {\mathbb {R}}^n\) are polytopes. This is, for instance, the case when both \(f_1\) and \(f_2\) are the pointwise maximum of finitely many differentiable functions. Clarke stationarity is a stronger property than criticality, but weaker than d-stationarity [38].
The remainder of this work is organized as follows: some well-known results and properties of DC programming are recalled in Sect. 2. The first proximal bundle method algorithm for finding critical points is presented in Sect. 3. Its convergence analysis is given in Sect. 4. The second proximal bundle algorithm for finding d-stationary points of a particular class of problem (1) is introduced in Sect. 5 together with its convergence analysis. We report some numerical experiments in Sects. 6, and 7 closes the paper with some final remarks.
2 Basic concepts and properties of DC programming
As mentioned in [31], DC programming is an extension of convex programming that is vast enough to cover almost all nonconvex optimization problems, but still allows the use of powerful tools from convex analysis and convex optimization. As in nonconvex nonsmooth optimization, many definitions of stationary points exist. Below we list some of them.
A point \({\bar{x}} \in X\) is called a local minimizer of problem (1) if there exists a neighborhood \(V\subset X\) of \({\bar{x}}\) such that \(f({\bar{x}})\le f(x)\) for all \(x \in V\). As convex functions are directionally differentiable in the interior of their domains [6, Proposition 2.2.7], the directional derivative
for the DC component \(f_i\), \(i=1,2\), is well defined for all x in the open set \(\varOmega \subset \mathtt{dom}\, f_i\) and all \(d \in {\mathbb {R}}^n\). It is well known that \(f_i'(x;d)=\max _{g \in \partial f_i(x)} \langle g,d\rangle \), where
is the subdiferential of \(f_i\) at point x. For \(\epsilon \ge 0\), the inexact subdiferential is denoted by
Since DC components are locally Lipschitz continuous, DC functions satisfy also local Lipschitz continuity. Thus, the directional and Clarke directional derivatives of a DC function are well defined for all \(x\in \varOmega \):
It can be shown that if \( {\bar{x}} \in X\subset \varOmega \) is a local minimizer of \(f\) over X, then \({\bar{x}} \in X\) is a d(irectional)-stationary point, i.e., \(f'({\bar{x}};(x- \bar{x}))\ge 0\) for all \(x\in X\). For DC programs this definition is equivalent to
The above inequality is equivalent to \(f_1'({\bar{x}};(x-{\bar{x}}))\ge \max _{g_2 \in \partial f_2({\bar{x}})}\langle g_2,x -{\bar{x}}\rangle \) for all \(x \in X\). In other words, \({\bar{x}} \in X\) is a d-stationary point of (1) if for all \(g_2 \in \partial f_2({\bar{x}})\)
It follows from convexity of \(f_1\) that \({\bar{x}} \in X\) satisfies the above inequality if and only if
This shows that \({\bar{x}} \in X\) is a d-stationary point of (1) if
where \(N_X({\bar{x}})\) is the normal cone of X at \({\bar{x}}\) and \(i_X\) is the indicator function of set X. The equality \(\partial f_1({\bar{x}}) + N_X({\bar{x}})= \partial [f_1({\bar{x}})+i_X({\bar{x}})]\) follows from convexity of \(X \subset \varOmega \subset \texttt {dom}(f_1)\) and the convexity of \(f_1\) [40]. Notice that verifying the above characterization of d-stationarity is impossible in many cases of interest. Hence, one generally employs a weaker notion of stationarity: a point \({\bar{x}} \in X\) is called a critical point of problem (1) if
Another useful notion is Clarke stationarity: a point \({\bar{x}} \in X\) is a c(larke)-stationary point of problem (1) if
It follows from the inequality \(f^0(x;d) \ge f'(x;d)\) that every d-stationary point is also c-stationary. The reverse implication is not always true [38, Example 2]. The following sequence of implications related to problem (1) can be found in the DC programming literature (see for instance [16, 22, 26, 31, 38, 43]):
If \(f_1\) is a continuously differentiable function on \(\varOmega \), then
If \(f_2\) is a continuously differentiable function on \(\varOmega \), then
If \(f_2\) is a polyhedral function \(f_2(x)=\max _{i=1,\ldots , N} \{\langle a_i, x\rangle +b_i\}\) (where \(a_i\in {\mathbb {R}}^n\) and \(b_i \in {\mathbb {R}}\)), then \({\bar{x}} \in X\) is a local minimizer if and only if \({\bar{x}}\) is d-stationary point [31, Theorem 1]. Moreover,
showing that a global solution of the DC program \(\min _{x \in X} [f_1(x)-f_2(x)]\) can be obtained by solving N convex programs \(\min _{x \in X} [f_1(x)-\langle a_i, x\rangle ] -b_i\).
3 Proximal bundle method for convex constrained DC programming
Let \(\omega : \varOmega \rightarrow {\mathbb {R}}\) be a twice differentiable and strongly convex function on X with parameter \(\rho >0\), w.r.t. the norm \(\Arrowvert \cdot \Arrowvert _p\) (\(p \in [1,\infty ]\)), that is
We can rewrite the DC function \(f=f_1-f_2\) in (1) for a given \(\omega \) as
Let \(y \in X\), \(g_2\in \partial f_2(y)\) and \(\nabla \omega (y)\) be given. We can then overestimate \(f\) by the following convex function
where \(D(\cdot ,\cdot )\) is the Bregman function
It is shown in [7] that the following Proximal Linearized Method generates a sequence of trial points \(x^{k+1}\) whose cluster points (if any) are critical to problem (1).
The work [41] studies the above algorithm with the standard choice \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\) (Euclidean norm). With this same regularizing function the authors of [38] propose two algorithms (akin to the above one) to find d-stationary points of (1). Both results in [41] and [38] can be extended to a more general strongly convex function \(\omega \) without much difficulty [7].
Notice that the above proximal algorithm requires solving exactly a constrained convex nonsmooth program per iteration. This can be a difficult task when dealing with real-life DC programs, mainly if \(f_1\) is only assessed via a black-box/oracle. In order to accelerate the optimization process, the authors of [41] investigate an inexact version of the above algorithm with \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert ^2_2/2\). In what follows we propose two proximal bundle algorithms that do not require solving exactly subproblem (6) and differently from [41] no assumption on the computed subgradient \(g_1^{k+1} \in \partial f_1(x^{k+1})\) is made. The first of these methods yields critical points whereas the second algorithm presented in Sect. 5 finds d-stationary points under the more restrictive assumption that \(f_2\) is the pointwise maximum of finitely many convex differentiable functions.
3.1 A proximal bundle method for DC programs
Let k denote an iteration counter and let \(g_1^j \in \partial f_1(x^j)\) and \(g_2^j \in \partial f_2(x^j)\) be subgradients of the DC components calculated during an iteration \(j \in \{0,1,2,\ldots \}\). Convexity of \(f_1\) implies that the linearization
approximates \(f_1(x)\) from below. As a result, we can construct a cutting-plane model
where \({\mathcal {B}}_1^k \subset \{0,1,\ldots , k\}\) is the index set containing the bundle of information of \(f_1\). By following the general ideas of bundle methods we replace \(f_1\) in the master program (6) with its cutting-plane model \({\check{f}}_1^k\). Since \({\check{f}}_1^k\) can be a rough approximation of \(f_1\) at certain iterations k, the trial point \(x^{k+1}\) obtained from (6) with \(f_1\) replaced by \({\check{f}}_1^k\) can be far away from the solution of (6). In order to diminish the impact of coarse approximations of \(f_1\) along the iterative process, we shall regularize the resulting master program by keeping trial points near to a certain stability center \(x^{k(\ell )}\in X\), where the index \(\ell \) counts the number of times that such a center has been updated and \(k(\ell )\) is the iteration in which the center is obtained. The stability center \(x^{k(\ell )}\) is some previous iterate, usually the “best” point generated by the iterative process so far. Accordingly, we replace subproblem (6) with
where \(\mu _k>0\) is a prox-parameter determining the influence of the Bregman function D on the next trial point \(x^{k+1}\). In terms of optimal solution \(x^{k+1}\) subproblem (8) is equivalent to
Notice that if X is a polyhedron and \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\), then \(D(x,x^{k(\ell )})=\Arrowvert x-x^{k(\ell )} \Arrowvert ^2_2/2\) and subproblem (9) is a convex quadratic problem (QP).
Proposition 1
Given a stability center \(x^{k(\ell )}\in X\) and a prox-parameter \(\mu _k >0\), let \(x^{k+1}\) be the unique solution of subproblem (9). Assume that either X is polyhedral or an appropriate constraint qualification [21] holds in (9), then there exist \(s^{k+1} \in N_X(x^{k+1})\) and \(\alpha _j \ge 0\) with \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j =1\) such that \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^{j}:= p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and
Moreover, the aggregate linearization
Proof
The assumption on X ensures the existence of Lagrange multipliers \(\alpha _j\ge 0\) associated to the constraints \(\bar{f}_1^j(x)=f_1(x^j) + \langle g_1^j ,x-x^j\rangle \le r\), \(j \in {\mathcal {B}}_1^k\). Hence, the optimality conditions of (9) read as:
The above inclusion implies that \(\sum _{j \in {\mathcal {B}}_1^k} \alpha _j =1\) and that there exists \(s^{k+1} \in N_X(x^{k+1})\) such that \(s^{k+1} = g_2^{k(\ell )} -\mu _k (\nabla \omega (x^{k+1}) - \nabla \omega (x^{k(\ell )}))- \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^j\), which is exactly (10) with \(p^{k+1}= \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^j\). Note that \(p^{k+1}\) is a convex combination of active subgradients of \({\check{f}}_1^k\) at \(x^{k+1}\). As a result, \(p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) [5, Lemma 10.8]. The inequality \(\bar{f}_1^{-k}(x)\le f_1(x)\) holds because \( p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and \({\check{f}}_1^k{(\cdot )} \le f_1{(\cdot )}\). \(\square \)
If \(X={\mathbb {R}}^n\) and \(\omega (x)=\Arrowvert x \Arrowvert ^2_2/2\) then the solution \(x^{k+1}\) of (8) is, from Eq. (10), \(x^{k+1}= x^{k(\ell )}+\frac{1}{\mu _k}(g_2^{k(\ell )}- \sum _{j \in {\mathcal {B}}_1^k} \alpha _j g_1^{j})\). Furthermore, the Lagrange multipliers \(\alpha _j\) (with \(j \in {\mathcal {B}}_1^k\)) can be obtained by solving the dual QP of (8), that has dimension \(|{\mathcal {B}}_1^k|\) (see [5, Lemma 10.8]). Therefore, by keeping the size of \({\mathcal {B}}_1^k\) bounded we also keep the method’s memory (number of elements in the bundle) limited.
Once the trial point \(x^{k+1}\) is computed by a specialized solver (for QP, quadratically constrained QP, conic programming, etc.) a classification rule decides when to update \(x^{k(\ell )}\). For a given \(\kappa \in (0,1)\) and a lower bound \({\underline{\mu }}>0\) of the prox-parameter \(\mu _k\), a possible rule is as follows: if
then a serious step is performed and we set \(x^{k(\ell +1)} := x^{k+1}\) and \(\ell :=\ell +1\). Otherwise, a null step is performed and both stability center and counter \(\ell \) remain unchanged. A more economical rule in terms of evaluations of \(f_2\) is to test the condition
If it holds we do a serious step. Otherwise we perform a null step. Note that (13) takes into account only the first DC component \(f_1\) and not the DC function \(f\). Moreover, the above inequality does not necessarily imply that \(f_1(x^{k+1})\le f_1(x^{k(\ell )})\) due to term \(\langle g_2^{k(\ell )},x^{k+1}- x^{k(\ell )}\rangle \). However, when (13) holds the DC function is decreased by an amount of at least \(\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\).
Lemma 1
If (13) holds then \(f(x^{k(\ell )})\ge f(x^{k(\ell +1)})+\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}). \)
Proof
When inequality (13) is satisfied we obtain \(x^{k(\ell +1)}=x^{k+1}\) and thus
Convexity of \(f_2\) yields \(f_2(x^{k(\ell +1)})\ge f_2(x^{k(\ell )}) +\langle g_2^{k(\ell )},x^{k(\ell +1)} - x^{k(\ell )}\rangle \). By summing these two inequalities we obtain
i.e., \(f_1(x^{k(\ell )})-f_2(x^{k(\ell )}) \ge f_1(x^{k(\ell +1)})-f_2(x^{k(\ell +1)}) +\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\) as stated. \(\square \)
We are now in the position to present our first algorithm, which makes use of (13) to update stability centers. As a result, there is no need to evaluate \(f_2\) at \(x^{k+1}\): only subgradients of \(f_2\) are computed during serious steps, i.e., when (13) is satisfied. During null steps (when (13) does not hold) only the first DC component \(f_1\) needs to be assessed through an oracle that returns at \(x^{k+1}\) the value of the function and one of its subgradients. A similar algorithm can be stated with rule (12) instead, but with evaluations of \(f_2\) at every iteration.
After a null step, the bundle of information \({\mathcal {B}}_1^k\) can have as few as three linearizations: the new linearization calculated at \(x^{k+1}\), the linearization formed at the current stability center \(x^{k(\ell )}\) and the aggregate linearization \(\bar{f}_1^{-k}\) defined in Proposition 1. Right after a serious step the bundle size can be reset to only one linearization, as is standard in proximal bundle methods for convex optimization. Notice also that the proximal parameter \(\mu _k\) is forbidden to decrease after a null step. This is crucial to prove convergence of the algorithm, which stops when the next iterate approximately coincides with the current stability center. This is a cheap and reliable stopping test as shown in Theorem 1 below.
4 Convergence analysis
Let \({\mathcal {L}}\subset \{0,1,2,\ldots \}\) denote the index set gathering the serious steps: \(\ell \in {\mathcal {L}}\) implies that \(x^{k(\ell )}\) is the \(\ell \mathrm{th}\) stability center. Throughout this section we will use the notation \(i(\ell )=k(\ell +1)-1\) to refer to the algorithm’s iteration yielding a serious step. Then for such iterations subproblem (8) reads as
Our goal is to show that any cluster point \({\bar{x}} \in X\) of the sequence of stability centers \(\{x^{k(\ell )}\}_\ell \) generated by Algorithm 1 is a critical point of (1), i.e., a point satisfying (3).
We start with the following result showing that if the algorithm performs only finitely many steps, then the last stability center is a critical point if \(\delta _{{{\,\mathrm{Tol}\,}}}=0\).
Lemma 2
Assume that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and suppose that Algorithm 1 stops at iteration k. Then the last stability center \(x^{k(\ell )}\) is a critical point of (1).
Proof
Convexity of \(f_1\) and feasibility of \(x^{k(\ell )}\) imply that
In addition, the point \(x^{k+1}\) solves the first subproblem (8) and the optimal value of this problem is \({\check{f}}_1^k(x^{k+1}) - \langle g_2^{k(\ell )},x^{k+1}-x^{k(\ell )}\rangle + \mu _k D(x^{k+1},x^{k(\ell )})\). Hence, if the algorithm stops at iteration k we have that \(x^{k+1}=x^{k(\ell )}\) and
where the last equality follows from the assumption that \(k(\ell ) \in {\mathcal {B}}_1^k\). We have thus shown that
i.e., the point \( x^{k+1}=x^{k(\ell )}\) also solves \(\min _{x \in X} \;f_1(x) -\langle g_2^{k(\ell )},x-x^{k(\ell )}\rangle + \mu _k D(x,x^{k(\ell )})\). The optimality condition of this reads as
Since \(x^{k+1}=x^{k(\ell )}\) the above inclusion is equivalent to \(g_2^{k(\ell )} \in \partial f_1(x^{k(\ell )}) + N_X(x^{k(\ell )})\), which gives (3) because \(g_2^{k(\ell )} \in \partial f_2(x^{k(\ell )})\). \(\square \)
From now on we assume that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and that the algorithm loops indefinitely.
4.1 Infinitely many serious steps
In what follows we assume that the algorithm generates infinitely many serious steps, i.e., \(|{\mathcal {L}}| =\infty \). The following result shows that \(p^{k(\ell +1)}+s^{k(\ell +1)}\) defined in Proposition 1 is an approximate subgradient of the function \(f_1(x)+i_X(x)\) at the point \(x=x^{k(\ell +1)}\).
Lemma 3
Let \(k(\ell +1)\) be the iteration index of the \((\ell +1)\)th stability center and \(i(\ell ) = k(\ell +1) -1\) be the iteration index in which \(x^{k(\ell +1)}\) is determined. Assume \(\mu _k\le \overline{\mu }<\infty \) and that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence. Let \(p^{k(\ell +1)} \in \partial {\check{f}}_1^{i(\ell )} (x^{k(\ell +1)})\) and \(s^{k(\ell +1)} \in N_X(x^{k(\ell +1)})\) be as in (10), and denote \(\beta ^{k(\ell +1)}: = p^{k(\ell +1)}+ s^{k(\ell +1)}\). Then there exist constants \(M,L>0\) such that
Moreover, \(f_1(x^{k(\ell +1)})\ge {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\ge f_1(x^{k(\ell )}) -L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) and
Proof
It follows from (10) and the triangular inequality that
Since \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and the DC components \(f_1\) and \(f_2\) are finite-valued convex functions on \(\varOmega \supset X\), then [20, Theorem 3.1.2] ensures that \(\{g_1^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) and \(\{g_2^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) are bounded sequences as well. In particular, there exists \(L>0\) such that \(\Arrowvert g_1^{k(\ell )} \Arrowvert _2\le L\) for all \(\ell \in {\mathcal {L}}\). Moreover, the existence of a finite constant \(M>0\) bounding \(\Arrowvert p^{k(\ell +1)}+s^{k(\ell +1)} \Arrowvert _2\) is ensured because \(\overline{\mu }< \infty \) and \(\nabla \omega \) is a continuous map.
Recall that \(N_X(x^{k(\ell +1)})=\partial i_X(x^{k(\ell +1)})\) because \(x^{k(\ell +1)}\in X\). Then \(s^{k(\ell +1)}\in \partial i_X(x^{k(\ell +1)})\) and therefore \(i_X(x)\ge i_X(x^{k(\ell +1)}) + \langle s^{k(\ell +1)},x-x^{k(\ell +1)}\rangle \) for all \(x \in {\mathbb {R}}^n\). Since \(p^{k(\ell +1)}\in \partial {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\) we get
As \(x^{k(\ell )}\in X\) for all \(\ell \), then \(i_X(x^{k(\ell +1)})=0\). By utilizing this in the above inequality we obtain
which gives by the Cauchy–Schwarz inequality
Note that Algorithm 1 keeps in the bundle the index \(k(\ell )\) of the current stability center. Therefore,
Again by the Cauchy–Schwarz inequality we get \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). Thus it follows from (15) that (because \(i_X(x^{k(\ell )}) =0\))
Since \(x \in {\mathbb {R}}^n\) is an arbitrary point we conclude that \(\beta ^{k(\ell +1)} \in \partial _{ e_{i(\ell )}} [f_1(x^{k(\ell )}) + i_X(x^{k(\ell )}) ]\). The inequalities \(f_1(x^{k(\ell +1)})\ge {\check{f}}_1^{i(\ell )}(x^{k(\ell +1)})\ge f_1(x^{k(\ell )}) -L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) follow trivially from the already shown inequality \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). \(\square \)
Suppose that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded, \({\bar{x} \in X}\) is one of its cluster points, and \(\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2 \rightarrow 0\). Then the above lemma ensures that \(\{p^{k(\ell )}+s^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence and that any cluster point \({\bar{\beta }}\) of \(\{p^{k(\ell )}+s^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) satisfies \({\bar{\beta }} \in \partial [f_1({\bar{x}}) + i_X({\bar{x}})] \) as a consequence of [21, Proposition 4.1.1]. This crucial property is employed in the following proposition to establish that any cluster point \({\bar{x}} \in X\) of \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a critical point of (1).
Proposition 2
Assume that the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded and that Algorithm 1 performs infinitely many serious steps, i.e., \(|{\mathcal {L}}|=\infty \) and \(\ell \rightarrow \infty \). Then any cluster point \({\bar{x}} \in X\) of the sequence \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a critical point of problem (1).
Proof
Lemma 1 shows that \( f(x^{k(\ell )}) - f(x^{k(\ell +1)}) \ge \kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )}) \) and therefore the sequence \(\{f(x^{k(\ell )})\}_{{\ell \in {\mathcal {L}}}}\) is strictly decreasing, which in turn implies that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence by the assumption of having a bounded level set. Continuity of \(f\) ensures that the level set is also closed, and hence compact. The Weierstrass theorem implies that the optimal value of (1) is finite. Therefore, by summing the above inequality with respect to \(\ell \) we get
Then it follows from the definition of D in (5) and the equivalence of norms in \({\mathbb {R}}^n\) that
Lemma 3 ensures that \(\beta ^{k(\ell +1)}=p^{k(\ell +1)}+s^{k(\ell +1)} \in \partial _{ e_{i(\ell )}} [f_1(x^{k(\ell )})+i_X(x^{k(\ell )})]\) with \( e_{i(\ell )} = (M+L)\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\) for two (possibly unknown) constants \(M,L>0\). Note also that \(\lim _{\ell \rightarrow \infty } e_{i(\ell )}=0\) because \(\lim _{\ell \rightarrow \infty } \Arrowvert x^{k(\ell +1)} - x^{k(\ell )} \Arrowvert _2 =0\). Thus, there exist subsets \({\mathcal {L}}'' \subset {\mathcal {L}}' \subset {\mathcal {L}}=\{0,1,\ldots \}\) such that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}'}}\) converges to a point \({\bar{x}} \in X\) (because \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and X is closed), \(\{\beta ^{k(\ell +1)}\}_{{\ell \in {\mathcal {L}}''}}\) converges to a point \({\bar{\beta }} \in \partial [f_1({\bar{x}})+i_X(\bar{x})]\) (see Proposition 4.1.1 in [21] for more details). In order to show that \({\bar{x}}\) is a critical point of problem (1), we only need to prove that \(\lim _{\ell \in {\mathcal {L}}''} g_2^{k(\ell )} = {\bar{\beta }}\). This latter result follows directly from Eq. (10), continuity of \(\nabla \omega \) and inequality \(\mu _k \le \overline{\mu }< \infty \):
showing that \({\bar{\beta }} \) also belongs to \(\partial f_2({\bar{x}})\). This concludes the proof. \(\square \)
4.2 Finitely many serious steps (infinitely many null steps)
In this section, we assume that after the \({{\hat{\ell }}}\mathrm{th}\)-stability center \(x^{k({\hat{\ell }})} = {\hat{x}}\) only null steps are performed. Notice that in this case \( g_2^{k({\hat{\ell }})}={\hat{g}}_2\) is fixed and therefore Algorithm 1 behaves exactly as a convex bundle algorithm with the master program given by
where \(f_k^{\mathtt{model}}(x):={\check{f}}_1^k(x)-\langle {\hat{g}}_2,x\rangle \) is a cutting-plane model for the convex function \(f_1(x)-\langle {\hat{g}}_2,x\rangle \). Thus it follows from the convergence analysis of convex bundle methods that the sequence of iterates generated after the last serious step converges to the last stability center: \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\). Indeed, if \(D(x,y) = \Arrowvert x-y \Arrowvert _2^2\) this claim follows directly from [45, Proposition 4.4]. For the more general setting of a Bregman function \(D(\cdot , \cdot )\) but with \(X={\mathbb {R}}^n\), the result \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\) can be justified by [12, Theorem 5.8] if the prox-parameter \(\mu _k>0\) is fixed after finitely many steps of the algorithm. Overall, we have the following result.
Lemma 4
Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1 and assume that \(\{\mu _k\}_{k\ge k(\hat{\ell })}\) is a nondecreasing sequence contained in \([{\underline{\mu }},\,\overline{\mu }]\). Then
Since Algorithm 1 does not consider the same assumptions of either [45, Proposition 4.4] nor [12, Theorem 5.8]), we provide for the sake of completeness the proof of Lemma 4 at “Appendix”. The following result is a version of Lemma 3 for the sequence of null iterates.
Lemma 5
Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1, and for \(k\ge k({\hat{\ell }})\)\(p^{k+1}\in \partial {\check{f}}_1^k(x^{k+1})\) and \(s^{k+1}\in N_X(x^{k+1})\). If \(\{\mu _k\}_{k\ge k({\hat{\ell }})}\) is nondecreasing, then there exist constants \(M,L>0\) such that \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2\le M\) and \(\Arrowvert p^{k+1} \Arrowvert _2\le L\) for all \(k\ge k({\hat{\ell }})\). Moreover, \(\beta ^{k+1} =p^{k+1} +s^{k+1}\in \partial _{ e_k} [f_1({\hat{x}})+i_X({\hat{x}})]\), where \( e_k = (M+L)\Arrowvert x^{k+1} - {\hat{x}} \Arrowvert _2\) for all \(k\ge k({\hat{\ell }})\).
Proof
It follows from (10) that \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2 \le \Arrowvert {\hat{g}}_2 \Arrowvert _2 +\overline{\mu }\Arrowvert \nabla \omega ({\hat{x}}) - \nabla \omega (x^{k+1}) \Arrowvert _2\) for \(k\ge k({\hat{\ell }})\). Since \(\{\mu _k\}_{k \ge k({\hat{\ell }})}\) is nondecreasing, Lemma 8 (in the “Appendix”) ensures that \(\{x^k\}_{{k> k({\hat{\ell }})}}\) is a bounded sequence (therefore \(\{x^k\}_k\) is bounded itself). Then there exists a finite constant \(M>0\) bounding \(\Arrowvert p^{k+1}+s^{k+1} \Arrowvert _2\) (because \(\nabla \omega \) is continuous). As \(f_1\) is a finite-valued convex function on \(\varOmega \supset X\), its subdifferential is locally bounded [20, Theorem 3.1.2]. Then there exists a constant \(L>0\) such that \(\Arrowvert g_1^k \Arrowvert _2 \le L\) for all k. We recall the subdifferential of \({\check{f}}_1^k\) at x is the convex hull of the set \(\{g_1^j: \, {\check{f}}_1^k(x)=f_1(x^j)+\langle g_1^j,x-x^j\rangle \}\). Thus every subdifferential of \({\check{f}}_1^k\) is also bounded by L, which in turn implies that \({\check{f}}_1^k\) (regardless the iteration k) has L as a Lipschitz constant.
Recall that \(N_X(x^{k+1})=\partial i_X(x^{k+1})\) because \(x^{k+1}\in X\). Then \(s^{k+1} \in \partial i_X(x^{k+1})\) and therefore \(i_X(x)\ge i_X(x^{k+1}) + \langle s^{k+1},x-x^{k+1}\rangle \) for all \(x \in {\mathbb {R}}^n\). Since \(p^{k+1} \in \partial {\check{f}}_1^k(x^{k+1})\) and \(i_X(x^k)=0\) for all k we get (for \(k\ge k({\hat{\ell }})\))
Since by construction \(k({\hat{\ell }}) \in {\mathcal {B}}_1^k\) for all \(k\ge k({\hat{\ell }})\), we have that \({\check{f}}_1^k({\hat{x}}) = f_1({\hat{x}})\). Then \(f_1({\hat{x}})-{\check{f}}_1^k(x^{k+1}) = {\check{f}}_1^k({\hat{x}})-{\check{f}}_1^k(x^{k+1}) \le L \Arrowvert {\hat{x}}-x^{k+1} \Arrowvert _2\). This shows that
because \(i_X({\hat{x}})=0\). Therefore, \(\beta ^{k+1} \in \partial _{ e_k}[f_1({\hat{x}}) +i_X({\hat{x}})]\) for all \(k\ge k({\hat{\ell }})\). \(\square \)
We are now in the position to prove that the last stability center \({\hat{x}}\) is a critical point of problem (1).
Proposition 3
Let \({\hat{x}}= x^{k({\hat{\ell }})}\) be the last stability center generated by Algorithm 1 and assume that \(\{\mu _k\}_{k\ge k(\hat{\ell })}\) is a nondecreasing sequence contained in \([{\underline{\mu }},\,\overline{\mu }]\). Then the sequence \(\{x^{k+1}\}_{{k}}\) converges to \({\hat{x}}\) and \({\bar{x}} = {\hat{x}} \) is a critical point of problem (1).
Proof
Lemma 4 gives \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\). It follows from (10), continuity of the mapping \(\nabla \omega \) and \(\mu _k\le \overline{\mu }\) that
Lemma 5 ensures that \(\beta ^{k+1} \in \partial _{ e_k} [f_1({\hat{x}})+i_X({\hat{x}})]\) with \( e_k = (M+L)\Arrowvert x^{k+1} - {\hat{x}} \Arrowvert _2\). Since \( \lim _{k\rightarrow \infty } e_k = 0\) due to \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\), then \({\hat{g}}_2 =\lim _{k\rightarrow \infty } \beta ^{k+1}\in \partial [f_1({\hat{x}})+i_X({\hat{x}})]\) and (3) is satisfied with \({\bar{x}} = {\hat{x}}\). \(\square \)
4.3 Convergence analysis: main result
Convergence analysis of Algorithm 1 is summarized in the following theorem.
Theorem 1
Consider Algorithm 1 and suppose that the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded. If the stopping-test tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}=0\), then any cluster point \({\bar{x}}\) of the sequence of stability centers \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) generated by the algorithm satisfies (3). Moreover, if the stopping-test tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}>0\) then the algorithm stops after finitely many steps with an approximate critical point \(\bar{x}=x^{k(\ell )}\). In addition, if \(\nabla \omega (\cdot )\) is a locally Lipschitz continuous map, then the approximate critical point \({\bar{x}}=x^{k(\ell )}\) satisfies
where \(s_1,s_2>0\) are two constants and \(B(0;s_2\delta _{{{\,\mathrm{Tol}\,}}})\) is the closed ball in \({\mathbb {R}}^n\) with center at zero and with radius \(s_2\delta _{{{\,\mathrm{Tol}\,}}}\).
Proof
First, suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\). If Algorithm 1 stops at iteration k then Lemma 2 ensures that the last stability center is a critical point of (1). Suppose now that Algorithm 1 does not stop. If infinitely many serious steps are generated, then Proposition 2 gives the result. Otherwise there will be a finite number of serious steps and Proposition 3 ensures that the last stability center is a critical point.
Moreover, suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}>0\). If infinitely many stability centers are generated then Proposition 2 shows that \(\lim _{\ell \rightarrow \infty } \Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2=0\). Otherwise, if there is only finitely many stability centers, then Proposition 3 ensures that \(\lim _{k \rightarrow \infty } \Arrowvert x^{k+1}-{\hat{x}} \Arrowvert _2=0\) where \({\hat{x}}= x^{k(\ell )}\) is the last stability center. In any case, the stopping test of Algorithm 1 will be triggered after finitely many iterations if \(\delta _{{{\,\mathrm{Tol}\,}}}>0\). In this case, \(\Arrowvert x^{k+1}-x^{k(\ell )} \Arrowvert _2\le \delta _{{{\,\mathrm{Tol}\,}}}\) and therefore Lemma 5 gives
Furthermore, Eq. (10) and assumption on \(\nabla \omega \) yield
These properties show that \({\bar{x}} = x^{k(\ell )}\) is an approximate critical point of problem (1) with \(s_1 = M+L\) and \(s_2=\bar{\mu }\,L_\omega \), where \(L_\omega \) is the Lipschitz constant of \(\nabla \omega \) on the bounded sequence \(\{x^k\}_{{k}}\). \(\square \)
4.4 Some comments on convex and DC models
Algorithm 1 employs the convex model \({\check{f}}_1^k(x)-\bar{f}_2^{k(\ell )}(x)\), with
to approximate the DC function \(f\), where \({\check{f}}_1^k\) is a cutting-plane model of \(f_1\). Instead of considering only one linearization for the second DC component, one could also gather a bundle of information \({\mathcal {B}}_2^k \subset \{1,2,\ldots , k\}\) for \(f_2\) and define a cutting-plane model
This provides a DC model \({\check{f}}_1^k-\check{f}_2^k\) for \(f\) that is expected to be better than the convex one \({\check{f}}_1^k(x)-\bar{f}_2^{k(\ell )}(x)\). Both publications [16] and [27] consider the DC model \({\check{f}}_1^k(\cdot )-\check{f}_2^k(\cdot )\) in their proximal bundle algorithms for unconstrained DC programs. For instance, the method of [27] defines trial points by solving globally the following DC subproblem
Since \(\check{f}_2^k\) is a polyhedral function, a global solution to the above subproblem can be obtained by solving \(|{\mathcal {B}}_2^k|\) strictly convex QPs
As a result, if the bundle of information \({\mathcal {B}}_2^k\) is large then each iteration of the algorithm in [27] can be too time consuming. For this reason, the size of the bundle \({\mathcal {B}}_2^k\) should be kept small. The proximal bundle method of [27] is shown to converge to a critical point of an unconstrained DC program even when \({\mathcal {B}}_2^k\) has only two indexes, i.e., only two QPs are needed to be solved per iteration. In contrast, Algorithm 1 requires solving only one convex subproblem (a QP if \(\omega (\cdot )=\Arrowvert \cdot \Arrowvert _2^2/2\) and X is a polyhedron) per iteration, which results in taking \({\mathcal {B}}_2^k = \{k(\ell )\}\) for all k. This is a feature of practical interest if X contains some conic/quadratic constraints.
5 A proximal bundle method for finding d-stationary points
In this section, we consider a particular case of problem (1) in which the second DC component is the pointwise maximum of finitely many differentiable convex functions \(\psi _i: \varOmega \rightarrow {\mathbb {R}}\):
Inspired by the work [38], the following modification of Algorithm 1 may solve several subproblems of type (8) at certain iterations to generate a subsequence of stability centers \(\{x^{k(\ell )}\}_{\ell \in {\mathcal {L}}}\) that converges to a d-stationary point \({\bar{x}} \) of problem (17), i.e., \({\bar{x}}\) satisfying (2). To this end, we explore the structure of \(f_2\). It is well known that the subdifferential of \(f_2\) at any given point \(x \in X \subset \varOmega \) is the convex hull of gradients of functions \(\psi _i\) that are active:
As a result, if \({\bar{x}}\in X\) satisfies
then (2) holds from the convexity of \( \partial [f_1(\bar{x}) + i_X({\bar{x}})]\) and thus \({\bar{x}}\) is a d-stationary point of (17). For technical reasons, Algorithm 2 below makes use of the following relaxation of A(x):
To avoid critical points that are not d-stationary the parameter \(\epsilon \) above must be strictly positive [38, Example 4].
Note that if the number N of functions \(\psi _i\) in (17) is equal to one (i.e., \(f_2= \psi _1\)), then Algorithm 2 becomes essentially Algorithm 1 (with the only difference in the descent test). However, if \(N>1\) then Algorithm 2 solves \(|A_{\epsilon }(x^k)|\) subproblems per iteration.
Moreover, some vectors \(\nabla \psi _i(x^{k(\ell )})\) with \(i \in A_\epsilon (x^{k(\ell )})\) may not belong to \( \partial f_2(x^{k(\ell )})\) because \(\epsilon >0\). As a result, the new trial point \(x^{k+1}\) solution of (19) for some \(i^* \in A_\epsilon (x^{k(\ell )})\) may not be issued by a true subgradient \(\nabla \psi _{i^*}(x^{k(\ell )})\) of \(f_2\) at \(x^{k(\ell )}\). Since Lemma 1 relies on the fact that \(g_2^{k(\ell )} \in \partial f_2(x^{k(\ell )})\), we cannot employ the descent test (13) with \(g_2^{k(\ell )}\) replaced with \( \nabla \psi _{i^*}(x^{k(\ell )})\): even if
holds true, nothing ensures that \(f(x^{k+1}) \le f(x^{k(\ell )})-\kappa {\underline{\mu }}D(x^{k+1},x^{k(\ell )})\). This is why we have replaced the descent test (13) with the more direct one given in (12). A downside of such replacement is that we need to call an oracle for \(f_2\) at every trial point, in contrast to Algorithm 1 that never evaluates \(f_2\) (but computes one of its subgradients at serious steps).
In order to analyze the convergence of Algorithm 2, we rely on the convergence analysis of Algorithm 1. We start with the following lemma.
Lemma 6
Let \(x^{k+1}\) and \(i^*\) as be defined in Algorithm 2. Then for all \( i \in A_\epsilon (x^{k(\ell )})\)
Proof
It follows from the definition of \(x^{k+1}\) that \({\check{f}}_1^k(x^{k+1})-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x^{k+1}-x^{k(\ell )}\rangle + \mu _k\, D(x^{k+1},x^{k(\ell )}) = \min _{x \in X} {\check{f}}_1^k(x)-\langle \nabla \psi _{i^*}(x^{k(\ell )}),x-x^{k(\ell )}\rangle + \mu _k\, D(x,x^{k(\ell )})\), which is in turn less than or equal to
due to the definition of \(i^*\) and \(y_i\) given in Algorithm 2. Again, the definition of \(y_i\) and the inequalities \({\check{f}}_1^k(x) \le f_1(x)\) and \(\mu _k\le \overline{\mu }\) ensure the first inequality in (20). The last one follows from feasibility of \(x^{k(\ell )}\). \(\square \)
Theorem 2
Consider Algorithm 2 applied to problem (17) and suppose that \(\delta _{{{\,\mathrm{Tol}\,}}}=0\) and the level set \(\{x\in X: f(x)\le f(x^0)\}\) is bounded. Then any cluster point \({\bar{x}}\) of the sequence of stability centers \(\{x^{k(\ell )}\}_{\ell \in {\mathcal {L}}}\) generated by the algorithm is a d-stationary point, i.e., \({\bar{x}}\) satisfies (2).
Proof
We split the proof in three main parts: the first part considers the case in which Algorithm 2 stops with \(\delta _{{{\,\mathrm{Tol}\,}}}=0\), the second one analyzes the case of infinitely many serious steps, and finally the last part assumes that after a last serious step the algorithm loops forever generating only null steps.
First part Suppose that Algorithm 2 stops at iteration k. Then \(x^{k+1}=x^{k(\ell )}\) and Lemma 6 gives
As \(k(\ell ) \in {\mathcal {B}}_1^k\), then \({\check{f}}_1^k(x^{k(\ell )})=f_1(x^{k(\ell )})\) and the above relation ensures that \(x^{k+1}=x^{k(\ell )}\) solves
Since \(x^{k+1}=x^{k(\ell )}\), then \(\nabla \psi _i(x^{k(\ell )}) \in \partial f_1(x^{k(\ell )}) + N_X(x^{k(\ell )})\) for all \(i \in A(x^{k(\ell )})\), which by (18) implies that \({\bar{x}}= x^{k(\ell )}\) satisfies (2). In what follows we suppose that the algorithm does not stop.
Second part Let us suppose that Algorithm 2 generates an infinite sequence of serious steps. By summing the inequality \(f(x^{k(\ell +1)}) \le f(x^{k(\ell )})-\kappa {\underline{\mu }}D(x^{k(\ell +1)},x^{k(\ell )})\) over \(\ell \in {\mathcal {L}}=\{0,1,2\ldots \}\) and using the fact that \(\{x\in X: f(x)\le f(x^0)\}\) is a bounded set, we conclude that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is a bounded sequence. Moreover, \(\sum _{\ell =0}^\infty \Arrowvert x^{k(\ell +1)}- x^{k(\ell )} \Arrowvert _2^2 < \infty \) by (5) and the equivalence of norms in \({\mathbb {R}}^n\). Boundedness of \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) ensures that there exists an index set \({\mathcal {L}}' \subset {{\mathcal {L}}}\) such that the sequence \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}'}}\) converges to a point \({\bar{x}} \in X\). Continuity of \(f_2\) and the assumption that \(\epsilon >0\) yield the inclusion \(A({\bar{x}}) \subset A_\epsilon (x^{k(\ell )})\) for all \(\ell \in {\mathcal {L}}'\) large enough. In what follows, let \(i(\ell )=k(\ell +1)-1\). The definition of \(x^{k(\ell +1)}\) (\(=x^{k+1}= y_{i^*})\) and (20) give
for all \(x \in X\), \(i \in A({\bar{x}})\) and \(\ell \in {\mathcal {L}}'\) large enough. Algorithm 2 keeps in the bundle the index \(k(\ell )\) of the last serious step. As a result,
By the Cauchy–Schwarz inequality (and remembering that \(\{x^{k(\ell )}\}_{{\ell \in {\mathcal {L}}}}\) is bounded and \(f_1\) is convex) we get \({\check{f}}_1^{i(\ell )}(x^{k(\ell +1)}) \ge f_1(x^{k(\ell )}) - \Arrowvert g_1^{k(\ell )} \Arrowvert _2\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\ge f_1(x^{k(\ell )}) - L\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2\). Combining this inequality with (21) yields
for all \(x\in X,\, i \in A({\bar{x}})\) and \(\ell \in {\mathcal {L}}'\) large enough. By going to the limit with \(\ell \in {\mathcal {L}}'\) tending to infinity (and recalling that \(\lim _{\ell \rightarrow \infty }\Arrowvert x^{k(\ell +1)}-x^{k(\ell )} \Arrowvert _2=0\) because \(\sum _{\ell =0}^\infty \Arrowvert x^{k(\ell +1)}- x^{k(\ell )} \Arrowvert _2^2 < \infty \)) we obtain
This shows that \({\bar{x}}\) solves all the \(|A({\bar{x}})|\) subproblems \( \min _{x \in X} \;f_1(x)-\langle \nabla \psi _i({\bar{x}}),x-{\bar{x}}\rangle + \overline{\mu }\, D(x,{\bar{x}}) \), i.e., \(\nabla \psi _i({\bar{x}}) \in \partial f_1({\bar{x}}) +N_X({\bar{x}})\) for all \(i \in A({\bar{x}})\), implying thus that \({\bar{x}}\) is a d-stationary point of (17).
Third part Let us consider the case of finitely many serious steps. Accordingly, there exists a last stability center \({\hat{x}}=x^{k({\hat{\ell }})}\) and \(f({\hat{x}})< f(x^{k+1}) + \kappa \,{\underline{\mu }}\, D(x^{k+1},{\hat{x}})\) for all \(k \ge k({\hat{\ell }})\). As mentioned right before Lemma 6 in page 13, this implies that a point \(x^{k+1}\) for \(k\ge k({\hat{\ell }})\) does not satisfy test (13) with \(g_2^{k({\hat{\ell }})}= \nabla \psi _i({\hat{x}})\) for any \(i \in A({\hat{x}})\) [although (13) may hold for some \(g_2^{k({\hat{\ell }})}= \nabla \psi _j({\hat{x}})\) with \(j \in A_\epsilon ({\hat{x}})\backslash A({\hat{x}})\)]. Hence, by seeing the null iterates \(x^{k+1}=y_{i^*}\) with \(i^* \in A_\epsilon ({\hat{x}})\backslash A({\hat{x}})\) as mere points enriching the bundle \({\mathcal {B}}_1^k\), Lemma 8 and Proposition 3 apply: \(\lim _{k\rightarrow \infty } [f_1(x^{k+1})- {\check{f}}_1^k(x^{k+1})]=0 \), \(\lim _{k\rightarrow \infty } x^{k+1}= {\hat{x}}\) and \({\hat{x}}\) is a critical point. It remains to show that \({\hat{x}}\) is indeed a d-stationary point of (17). To this end, consider the following inequality extracted from (20):
for all \(x \in X\) and all \(i \in A({\hat{x}})\). By passing to the limit with k tending to infinity we obtain
for all \( x \in X\) and \(i \in A({\hat{x}})\). This shows that \({\hat{x}}\) solves all the \(|A({\hat{x}})|\) subproblems
implying thus that \({\bar{x}}={\hat{x}}\) is a d-stationary point of (17) (see 18).
In all the cases, if \(\delta _{{{\,\mathrm{Tol}\,}}}>0\) the algorithm stops after finitely many steps. \(\square \)
6 Numerical experiments
We assess the numerical performance of Algorithms 1 and 2 on some academic DC problems. For comparison reasons, we consider two variants of Algorithm 1 with the Bregman function \(D(x,y)=\Arrowvert x-y \Arrowvert _2^2/2\), two implementations of the classic DC algorithm (DCA) of [42], one implementation of the proximal linearized algorithm [41, Algorithm 1], the DC proximal bundle method of [27] and an algorithm for nonsmooth and nonconvex optimization [35]. We also compare the computational behavior of Algorithm 1 against the results reported in the paper [16]. The considered solvers are as follows:
-
PBM1: Algorithm 1.
-
PBM-d: Algorithm 2.
-
PBM3: Algorithm 1 with subproblem (9) replaced with the DC subproblem (16), where the model for \(f_2\) has three pieces: three linearizations computed with the last three stability centers.
-
DCA-CPM: this is an implementation of the DCA algorithm of [42], with trial points defined as
$$\begin{aligned} x^{k+1}\in \arg \min _{x \in X} f_1(x)-\langle g_2^k,x\rangle \quad {\hbox { with}\ g_2^k \in \partial f_2(x^k)}. \end{aligned}$$(23)The convex subproblem (23) is solved by an implementation of the Kelley’s cutting-plane method [28]. We have made use of warm start, meaning that when a new subgradient \(g_2^{k+1}\in \partial f_2(x^{k+1})\) is computed the solver re-uses the cutting-plane model for \(f_1\) constructed at the previous iteration.
-
DCA-LBM: as DCA-CPM, but with the convex subproblem (23) solved by an implementation of the level bundle method of [34].
-
PLM: this is an implementation of the proximal linearized algorithm in [41, Algorithm 1], with trial points defined as in (6). The convex subproblem (6) is solved by replacing \(f_1\) with its cutting-plane approximation, which is iteratively improved the same way as the classical proximal bundle method does. As in DCA-CPM, we have employed warm start.
-
HANSO: Hybrid Algorithm for Non-Smooth Optimization [35]. This is a matlab package based on the BFGS and gradient sampling methods for nonsmooth and nonconvex unconstrained optimization problems. The package (version 2.2) is freely available by its developers at the link: www.cs.nyu.edu/overton/software/hanso.
-
PBDC: the proximal bundle method of [27], whose Fortran code is freely available by its developers at the link: http://napsu.karmitsa.fi/pbdc/.
Except for solver HANSO and PBDC that are open source packages, we have implemented the considered solvers in matlab (version 2017a) using the Gurobi (version 7.5.1, www.gurobi.com) solver for LP, QP and quadratically-constrained problems. Solver HANSO was employed with its default parameters, except the memory of its BFGS algorithm that was set to 50 for problems where dimension is greater or equal to 1000. Solver PBDC was used with its default parameters, except its stopping-test tolerance (user_crit_tol) that was increased to \(0.06\cdot n\). The other solvers employ the stopping test \(\Arrowvert x^{k+1}- x^{k(\ell )} \Arrowvert _2 \le \delta _{{{\,\mathrm{Tol}\,}}}\), with tolerance \(\delta _{{{\,\mathrm{Tol}\,}}}=10^{-4}\). In our implementation, the solvers PBM1 and PBM3 consider the test (12) with \(\kappa =0.1\) and \({\underline{\mu }}= 10^{-5}\) for updating stability centers. This implies that the number of evaluations of \(f_1\) is the same as for \(f_2\). The maximum size of the bundle of information for solvers PBM1, PBM3 and PBM-d was set to \(\max \{100,\,\min \{n+5,1000\}\}\). Moreover, only active indexes were kept in the bundle after a serious step. The matlab codes as well as the considered test problems are freely available at the link: http://www.oliveira.mat.br/solvers.
Numerical experiments were performed on a computer with Intel(R) Core(TM), i7-6820HQ, CPU @ 2.70 GHz, 32G (RAM), under Windows 7, 64Bits. The Fortran code of PBDC was compiled and executed in a Virtual Machine running Linux Ubuntu. The total CPU time allowed for each solver on each test problem was set to 3600 s.
For the sake of comparison, we also present some results from the paper [16]. Since we have not ran the solver DCPCA, it does not make sense to compare CPU time. However, the solution quality and number of subgradients evaluations can be compared, keeping in mind that DCPCA, PBDC and the matlab solvers employ different stopping tests.
6.1 Unconstrained DC programs
We consider all the unconstrained DC programs reported in [2, 16, 27]. Their formulation, optimal values and initial points are given in [27].
Since the inner algorithms employed by solvers DCA-CPM and DCA-LBM require the feasible set to be compact, we have added the bounds \(-100\le x\le 100\) to the test problems when running these two solvers (the other solvers did not consider this change).
Table 1 reports numerical experiments on several instances of the considered unconstrained DC problems. The first column indicates the problem, the second states the dimension of the problem and the third column reports the optimal value. The remaining columns present the obtained function value \(f(x^{k(\ell )})\), number of subgradient evaluations for DC components and CPU time (in seconds) for all the considered solvers. We emphasize that:
-
PBM1 and PBM3 The number of subgradient evaluations \(\# g_1\) is equal to the number of function evaluations of \(f_1\). It also coincides with the number of iterations performed by the algorithm. The number of subgradient evaluations \(\# g_2\) of the second DC component coincides with the number of serious steps of the algorithm.
-
DCA-CPM, DCA-LBM and PLM The number of subgradient evaluations \(\# g_1\) is equal to the number of function evaluations of \(f_1\). Moreover, \(\# g_2\) coincides with the number of iterations.
-
HANSO The number of subgradient evaluations is equal to the number of function evaluations. The solver does not exploit the DC decomposition of the objective function.
-
PBDC The number of function \(f\) evaluations can be slightly bigger than \(\# g_1\).
-
DCPCA The number of evaluations of \(f_1\) is greater than that number of subgradient evaluations due to the line-search.
Notice that the solution quality of solvers PBM1 and PBM3 is comparable to solvers PBDC and DCPCA. These four DC bundle solvers provided more precise solutions than the solvers DCA-CPM, DCA-LBM, PLM and the nonconvex solver HANSO. Moreover, the bundle solvers were successful in computing a critical point of problem (1) in all instances (expect solver PBDC that exceeded the CPU time limit when dealing with problem 4, \(n=750\)). Except for problem 10 with \({n=5}\), solvers PBM1, PBM3 and DCPCA found the best known function values of the considered problems.
Solver HANSO (for general nonconvex optimization problems) was the only solver that could solve globally problem 3 with \(n=5\). Overall, this solver performed a larger number of function/subgradient evaluations than the other solvers that exploited the DC structure of the objective function. Solver DCPCA required less subgradient evaluations of \(g_1\) than solver PBM1 did, however the former needed more subgradient evaluations of \(g_2\).
Notice also that when employing a convex cutting-plane model (solver PBM1) the total number of iterations and computation burden do not necessary increase when compared to PBM3, PBDC and DCPCA. Moreover, the quality of final solution candidates is not deteriorated.
We sumarize Table 1 in Fig. 1 by employing performance profiles [11]. For example, let the criterion be CPU time. For each solver, we plot the proportion of problems that is solved within a factor \(\gamma \) of the time required by the best algorithm. More specifically, denoting by \(t_s(p)\) the time spent by solver s to solve problem p and by \(t^*(p)\) the best time for the same problem among all the solvers, the proportion of problems solved by s within a factor \(\gamma \) is
Therefore, the value \(\rho _s(1)\) gives the probability of the solver s to be the best by a given criterion. Furthermore, unless \(t_s(p)=\infty \) (which means that solver s failed to solve problem p), it follows that \(\lim _{\gamma \rightarrow \infty }\rho _s(\gamma )=1\). Thus, the higher is the line the better is the solver (by this criterion).
Figure 1 shows that the four proximal bundle solvers perform better than the other solvers in both number of subgradient evaluations of \(f_1\) and robustness. However, in terms of subgradient evaluations of \(f_2\) the DCA solvers DCA-CPM and DCA-LBM performed better. Overall, solver PBM1 presents a good compromise between CPU time and robustness.
6.2 Linearly-constrained DC programs
In this subsection, we consider two convex constrained DC programs obtained by approximating chance-constrained problems of the form
where \(\xi \in \varXi \subset {\mathbb {R}}^m\) is a random vector having probability measure \({\mathbb {P}}\), \(A\in {\mathbb {R}}^{s\times n}\) and \(b\in {\mathbb {R}}^s\). Parameter \(p \in (0,1)\) is a confidence level and \(c:{\mathbb {R}}^n\times \varXi \rightarrow {\mathbb {R}}\) is a given DC function (not necessary differentiable). It is well known that the above problem may fail to be convex even when \(c(\cdot ,\cdot )\) is convex in both arguments. Moreover, evaluating the probability function for a given point involves computing a multidimensional integral. This is a difficult task when random variables have large dimension [47], and therefore Monte-Carlo simulation is an important alternative to approximate \({\mathbb {P}}\). We refer to [19, 39, 44, 46] for more details on chance-constrained programming.
As discussed in [25], if \(c(\cdot ,\xi )\) is a DC function the probability constraint \({\mathbb {P}}[c(x,\xi )\le 0]\ge p\) can be approximated by the DC constraint \({\mathbb {E}}[c(x,\xi )+t]^+ - {\mathbb {E}}[c(x,\xi )]^+ \le t (1-p)\), where \(t\approx 0\) is a positive parameter, \([a]^+:=\max \{a,0\}\) and \({\mathbb {E}}[\cdot ]\) is the expected value operator w.r.t. \({\mathbb {P}}\). By penalizing this constraint withFootnote 1\(\rho >0\) we get the following approximation of (24):
that can be written as a DC program
with \(f_1(x)=\langle q, x\rangle + \rho \max \left( {\mathbb {E}}[c(x,\xi )+t]^+ - t (1-p),\, {\mathbb {E}}[c(x,\xi )]^+\right) \) and \(f_2(x)= \rho {\mathbb {E}}[c(x,\xi )]^+\). The expectations \({\mathbb {E}}[c(x,\xi )+t]^+\) and \({\mathbb {E}}[c(x,\xi )]^+\) can be approximated by Monte-Carlo simulation [25]: in our numerical experiments we have used a fixed sample of 10, 000 scenarios randomly generated accordingly to the distribution of \(\xi \).
6.2.1 PlanToy: an academic chance-constrained planning problem
We consider a small example of a management problem coming from the industry of energy. The problem consists of planing two fictitious refineries for producing two types of fuel to meet a demand that is deterministic in the first month of planning, but uncertain in the second one. The decision maker wishes to make an optimal decision on the amount of processed petrol by the two refineries (\(x_1\) and \(x_2\) for the first month, and \(x_5\) and \(x_6\) for the second one), storing petrol (\(x_3\) and \(x_7\)), and importation (\(x_4\) and \(x_8\)). The storage and importation decisions must ensure that the second-month demand is satisfied with probability p. The fictitious planning problem (PlanToy) reads as
where \(c(x,\xi )=\max \{\xi _1 - 2x_5 -6x_6,\, \xi _2 - 3x_5 -2.8x_6\}\), and \(\xi =(\xi _1,\xi _2)\) is a random vector (of fuel demand) following a normal distribution with mean \({\mathbb {E}}[\xi ]=(193, 178)\) and covariance matrix
Under this assumption the probability constraint \({\mathbb {P}}[c(x,\xi )\le 0]\ge p\) can be replaced with the convex one \(\phi (x)\le 0\) with \(\phi (x):= \log (p) - \log ({\mathbb {P}}[c(x,\xi )\le 0])\) [39]. For comparison purposes, the resulting convex optimization problem was solved by using the matlab optimization routine fmincon. We employed the Matlab’s statistical function mvncdf to evaluate the probability function:
Such a convex reformulation is possible thanks to the assumption on the probability distribution of the random vector \(\xi \). For more general (and possibly more realistic) distributions, the PlanToy cannot be recast into a convex formulation. We refer the interested reader to textbook [39] for a comprehensive discussion on convexity of chance-constrained programs.
Regardless convexity, evaluating the probability function is a difficult task in general. Therefore, the Monte-Carlo approach and DC approximation discussed above become attractive in the large-scale setting.
By varying the confidence level \(p \in \{0.5,\,0.6,\,0.7,\,0.8,\,0.9,\, 0.95\}\) and the covariance coefficient \(\mathtt{Cov}:=\mathtt{Cov}(\xi _1,\xi _2) \in \{-4.8, \, 0,\, 4.8 \}\) we ended up with 18 variants of PlanToy. Table 2 reports the results obtained by applying the solvers PBM1, PBM3 and PLM on these instances. As PLM, both solvers DCA-CPM and DCA-LBM had a similar and unsatisfactory performance on these instances. Solvers HANSO and PBDC were not applied because they do not handle constrained problems.
As starting points for these solvers, we have considered the solution of the simpler individual chance-constrained program obtained from (26) by replacing the joint probability constraint \({\mathbb {P}}[c(x,\xi )\le 0] \ge p\) with the individual ones \({\mathbb {P}}_1[\xi _1 \le 2x_5 +6x_6]\ge p\) and \({\mathbb {P}}_2[\xi _2 \le 3x_5 +2.8x_6]\ge p\), where \(\xi _1\sim N(193,9)\) and \(\xi _2\sim N(178,10.4)\) due to the assumptions on the joint distribution of \(\xi =(\xi _1,\xi _2)\). By making use of p-quantiles (that can be easily computed by several statistical softwares), the above constraints are in fact linear ones:
Therefore, problem (26) with \({\mathbb {P}}[c(x,\xi )\le 0] \ge p\) replaced with the these two linear constraints is a mere linear programming problem approximating the nonlinear one (26) (see [39] for more details). This is why we consider such an approximation only for computing an initial point for our solvers.
Solver PLM failed to globally solve all the considered instances of PlanToy: the method stopped with a critical point of the penalized DC program (25), which is not even a feasible one. On the other hand, the proximal bundle solvers were successful.
We care to mention that the obtained function value has three sources of inaccuracy: (1) the approximation of the probability distribution by a sample of 10,000 scenarios, (2) approximation of the characteristic function \(\mathbf{1}_{[0,\infty )}(z)\) by \(([z+t]^+ - [z]^+)/t\), and (3) the solvers tolerance. As a result, one cannot expect that the obtained function values coincide with the (approximate) optimal value \({\bar{f}}\). Nevertheless, the function values computed by solvers PBM1 and PBM3 are very close to the optimal values and the obtained solutions are (nearly) feasible: the columns 7–9 of Table 2 report the probabilityFootnote 2 of the computed point \({\bar{x}}\) to satisfy the random constraint \(c(x,\xi )\le 0\).
Once again, PBM1 (with a convex model) was as precise as PBM3 (with a DC model) but faster than the latter on these instances of PlanToy.
6.2.2 A norm optimization problem with chance constraints
We now consider a DC approximation of the chance-constrained program
given in [25, Sect. 5.1]. This problem fits (24) with \(c(x,\xi )=\max _{i=1,\ldots ,10}\{\sum _{j=1}^n\xi _{ij}^2x_j^2 -100\}\), where \(\xi _{ij}\) (\(i=1,\ldots ,n\) and \(j=1,\ldots ,10\)) are independent and identically distributed standard normal random variables. The DC reformulation of this problem is
with \(f_1(x)=-\sum _{i=1}^n x_i +\rho \max ({\mathbb {E}}[c(x,\xi )+t]^+ - t(1-p),\, {\mathbb {E}}[c(x,\xi )]^+)\) and \(f_2(x)= \rho \,{\mathbb {E}}[c(x,\xi )]^+ \).
In order to approximate the expectation \({\mathbb {E}}\), we have used Monte-Carlo simulation with \(N=10{,}000\) scenarios. As discussed in [25], when \(\xi _{ij}\) are independent and identically distributed standard normal random variables the global solution and optimal value of (27) are, respectively,
where \(F_{\chi _n^2}^{-1}\) denotes the inverse distribution function of a Chi-square distribution with n degrees of freedom. Table 3 reports some results obtained by applying the six solvers on 18 instances of (27).
Since computing exactly the constraint \({\mathbb {P}}[c(x,\xi )\le 0]\) is a difficult task, we do not report in Table 3 the probability value \({\mathbb {P}}[c({\bar{x}},\xi )\le 0]\) at the obtained candidate solutions. However, we estimated this probability by the DC approximation mentioned above. We observed that in all the considered instances the value \(({\mathbb {E}}[c({\bar{x}},\xi )+t]^+- {\mathbb {E}}[c(\bar{x},\xi )]^+/{t})\) coincides with the given confidence level p for the solvers PBM1 and PBM3. This ensures that the provided candidate solutions are feasible for the DC formulation above and, therefore, nearly feasible for the considered norm optimization problem with chance-constraints.
Once again, we cannot expect that the obtained function values coincide with the optimal value \({\bar{f}}\) due to the reasons previously mentioned. However, we can see from Table 3 that the bundle solvers provided good estimates of the optimal values. Solvers DCA-LBM and PLM computed critical points for all problem instances, however they failed to compute global solutions. In order to apply HANSO, we dropped the (inactive) constraint \(x\ge 0\). After this solver HANSO could solve up to optimality only the smaller instances and find a critical point for the medium size ones, but could not solve the larger instances in a time limit of 1 h. The reason is that HANSO requires many function evaluations: the considered function requires Monte-Carlo simulation and is therefore difficult to evaluate. Table 4 reports the number of subgradient evaluations.
6.3 Quadratically-constrained DC programs
We now consider the following quadratically-constrained DC program (QCDC)
with given data \(K \in {\mathbb {R}}_+\), \(\alpha \in {\mathbb {R}}^n_+\) and \(c^j \in {\mathbb {R}}^n\), \(j=1,\ldots ,m\). The optimal value (and a solution) of the above QCDC can be found by solving individually the m quadratically-constrained QP programs (QCQP)
and taking \({\bar{f}} = \displaystyle \min _{j\in \{1,\ldots ,m\}} v_j\). Problem (28) has the following DC representation
with
Several instances of problem (28) were generated by the following scheme: we set \(K=10\), \(m=5\) and \(\alpha _i\) drawn randomly and uniformly from [0, 1] for all \(i=1,\ldots ,n\). Vectors \(c^j\), \(j=1,\ldots ,m\), were constructed in such a way that no \(c^j\) is feasible for (28):
where \({\tilde{c}}_i^j\), \(i=1,\ldots ,n\), \(j=1,\ldots ,m\), was randomly generated following a standard normal distribution and \(s_j\) uniformly from the set \( \{1,2,3,4,5\}\).
For the QCDC problem (28), the proximal bundle method subproblem (9) becomes a QCQP (once again with the choice \(D(x,{\hat{x}}):=\Arrowvert x-{\hat{x}} \Arrowvert _2^2/2\)) that can be solved by specialized algorithms. In this study we applied Gurobi. Table 5 reports some numerical results obtained by applying solvers PBM1, PBM3, DCA-CPM and PLM to some instances of (28).
Solver DCA-CPM (respectively PLM) solves # \(g_2\) linear (respectively quadratic) optimization subproblems with quadratic constraints per iteration. Solver PBM1 (respectively PBM3) requires solving # \(g_1\) (respectively three times # \(g_1\)) QCQPs. This explains why PBM1 is faster than the other considered solvers. Once again, the proximal bundle solvers were successful in computing global solutions.
6.4 Computing d-stationary points
In this section, we illustrate the performance of Algorithm 2 on two additional test problems whose second DC component is the pointwise maximum of finitely many differentiable functions. We start with the following two-dimensional problem
with \(f_1(x)=1.2\max \{0.1x_1^2 + 0.005x_2^2, 0.005x_1^2 + 0.1x_2^2\}\) and \(f_2(x){=}\max \{-x_1,-0.3x_2,0\}\). Note that the point \({\tilde{x}}=(0,0)\) is critical for this problem, however \({\tilde{x}}\) is not d-stationary since small negative perturbations of its components yield smaller function values. Figure 2a presents eight different sequences obtained by PBM1 initialized with eight different starting points. In Fig. 2b we consider PBM-d with the same initial points. Notice that PBM-d does not stop at the critical point \(\tilde{x}=(0,0)\), since near to \({\tilde{x}}\) the algorithm solves two QPs (19) and can escape from \({\tilde{x}}\).
A comparison of PBM-d with other solvers is given in Table 6. We do not report CPU time because all the solvers computed a critical point in less than half a second. As PBM-d, solver DCA-LBM also computed d-stationary points for problem (29) regardless the initial point.
We have also considered an unconstrained variant of problem (29). For the same starting points the obtained results are similar to the ones presented in Table 6, with the difference that the global solution is \({\bar{x}} \approx (-\,4.1667,\,0.000)\) (with optimal value approximately \(-2.08333\)). In this setting, we have also considered the solver PBDC that succeed in computing the global solution for every starting point of Table 6. However, if one starts nearly zero [e.g. \(x^0=(0.001,0.001)\)] then PBDC converges to the critical point (0, 0), whereas PBM-d always converges to a d-stationary point (either the global or the local solution of the problem).
Finally, we consider the following test problem
which has the DC decomposition
Its optimal value is \(-4\) obtained with \({\bar{x}}_i = (-1)^{i+1}\) for all but an even index j such that \({\bar{x}}_j = -3\). A critical point (but not a d-stationary one) is \({\tilde{x}}_i = (-1)^{i+1}\) for all but an odd index j with \({\tilde{x}}_j = -1\). At this point, the function value is zero. Table 7 reports some numerical experiments obtained by applying seven solvers to problem (30).
Note that solvers PBM1, PBM3 and DCA-LBM computed critical points that are not d-stationary for problem (30). On the other hand, the solvers PBM-d, PLM and HANSO succeeded in computing global solutions for all instances. A very good performance of solver PBDC is highlighted: PBDC succeeded in computing global solutions in six out of nine cases.
7 Concluding remarks
Inspired by Proximal Linearized Methods, this work proposes and analyzes two proximal bundle algorithms for dealing with convex constrained nonsmooth DC programs. No line-search nor estimates of Lipschitz constants of the DC components are required by the algorithms, which possess a reliable and straightforward stopping test. Moreover, the given algorithms consider convex models of the underlying DC function. The first algorithm is shown to generate a subsequence of points that converges to a critical point of the underlying DC problem. The second algorithm is proved to generate a sequence of descent steps converging to a d-stationary point, which is the sharpest stationary definition in DC programming. However, this requires us to assume that the second DC component \(f_2\) is the pointwise maximum of N differentiable functions. The price to obtain this stronger result is the solution of possibly several (but no more than N) master programs at certain iterations.
Numerical experiments on approximately one hundred instances of different academic DC programs indicate that employing a convex model for approximating the DC objective function can be a simpler alternative to DC models for dealing with DC problems via proximal bundle methods.
Notes
We have used \(\rho =10^4\times n\) in our numerical experiments.
Probability computed by using the mvncdf Matlab’s function.
References
Astorino, A., Miglionico, G.: Optimizing sensor cover energy via DC programming. Optim. Lett. 10(2), 355–368 (2016)
Bagirov, A.M.: A method for minimization of quasidifferentiable functions. Optim. Methods Softw. 17(1), 31–60 (2002)
Bagirov, A.M., Yearwood, J.: A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems. Eur. J. Oper. Res. 170(2), 578–596 (2006)
Ben-Tal, A., Nemirovski, A.: Non-Euclidean restricted memory level method for large-scale convex optimization. Math. Program. 102, 407–456 (2005)
Bonnans, J., Gilbert, J., Lemaréchal, C., Sagastizábal, C.: Numerical Optimization: Theoretical and Practical Aspects, 2nd edn. Springer, Berlin (2006)
Clarke, F.H.: Optimisation and nonsmooth analysis. Soc. Ind. Appl. Math. (1990). https://doi.org/10.1137/1.9781611971309
Cruz Neto, J.X., Oliveira, P.R., Soubeyran, A., Souza, J.C.O.: A generalized proximal linearized algorithm for DC functions with application to the optimal size of the firm problem. Ann. Oper. Res. (2018). https://doi.org/10.1007/s10479-018-3104-8
de Oliveira, W., Solodov, M.: A doubly stabilized bundle method for nonsmooth convex optimization. Math. Program. 156(1), 125–159 (2016)
de Oliveira, W.: Target radius methods for nonsmooth convex optimization. Oper. Res. Lett. 45(6), 659–664 (2017)
de Oliveira, W., Sagastizábal, C., Lemaréchal, C.: Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math. Program. Ser. B 148, 241–277 (2014)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002)
Frangioni, A.: Generalized bundle methods. SIAM J. Optim. 13(1), 117–156 (2002)
Frangioni, A., Gorgone, E.: Generalized bundle methods for sum-functions with “easy” components: applications to multicommodity network design. Math. Program. 145(1), 133–161 (2014)
Fuduli, A., Gaudioso, M., Giallombardo, G.: A DC piecewise affine model and a bundling technique in nonconvex nonsmooth minimization. Optim. Methods Softw. 19(1), 89–102 (2004)
Gaudioso, M., Giallombardo, G., Miglionico, G.: Minimizing piecewise-concave functions over polyhedra. Math. Oper. Res. 43(2), 580–597 (2018)
Gaudioso, M., Giallombardo, G., Miglionico, G., Bagirov, A.M.: Minimizing nonsmooth DC functions via successive DC piecewise-affine approximations. J. Glob. Optim. 71(1), 37–55 (2018)
Hare, W., Sagastizábal, C.: A redistributed proximal bundle method for nonconvex optimization. SIAM J. Optim. 20(5), 2442–2473 (2010)
Hare, W., Sagastizábal, C., Solodov, M.: A proximal bundle method for nonsmooth nonconvex functions with inexact information. Comput. Optim. Appl. 63(1), 1–28 (2016)
Henrion, R.: A Critical Note on Empirical (Sample Average, Monte Carlo) Approximation of Solutions to Chance Constrained Programs (Chapter 3 in [24]). Springer, Berlin (2013)
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I. Grundlehren der mathematischen Wissenschaften, vol. 305, 2nd edn. Springer, Berlin (1996)
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II. Grundlehren der mathematischen Wissenschaften, vol. 306, 2nd edn. Springer, Berlin (1996)
Hiriart-Urruty, J.B.: Generalized Differentiability/Duality and Optimization for Problems Dealing with Differences of Convex Functions, pp. 37–70. Springer, Berlin, Heidelberg (1985)
Holmberg, K., Tuy, H.: A production–transportation problem with stochastic demand and concave production costs. Math. Program. 85(1), 157–179 (1999)
Hömberg, D., Tröltzsch, F. (eds.): System Modeling and Optimization. IFIP Advances in Information and Communication, vol. 391. Springer, Berlin (2013)
Hong, L.J., Yang, Y., Zhang, L.: Sequential convex approximations to joint chance constrained programs: A Monte Carlo approach. Oper. Res. 59(3), 617–630 (2011)
Joki, K., Bagirov, A., Karmitsa, N., Mäkelä, M.M., Taheri, S.: Double bundle method for finding Clarke stationary points in nonsmooth DC programming. SIAM J. Optim. 28(2), 1892–1919 (2018)
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: A proximal bundle method for nonsmooth DC optimization utilizing nonconvex cutting planes. J. Glob. Optim. 68(3), 501–535 (2017)
Kelley, J.E.: The cutting-plane method for solving convex programs. J. Soc. Ind. Appl. Math. 8(4), 703–712 (1960)
Khalaf, W., Astorino, A., d’Alessandro, P., Gaudioso, M.: A DC optimization-based clustering technique for edge detection. Optim. Lett. 11(3), 627–640 (2017)
Kiwiel, K.C.: A proximal bundle method with approximate subgradient linearizations. SIAM J. Optim. 16(4), 1007–1023 (2006)
Le Thi, H.A., Tao, P.D.: DC programming in communication systems: challenging problems and methods. Vietnam J. Comput. Sci. 1(1), 15–28 (2014)
Le Thi, H.A., Pham Dinh, T., Ngai, H.V.: Exact penalty and error bounds in DC programming. J. Glob. Optim. 52(3), 509–535 (2012)
Lemaréchal, C.: An algorithm for minimizing convex functions. Inf. Process. 1, 552–556 (1974)
Lemaréchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1), 111–147 (1995)
Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-Newton methods. Math. Program. 141(1), 135–163 (2013)
Mäkelä, M.M., Miettinen, M., Lukšan, L., Vlček, J.: Comparing nonsmooth nonconvex bundle methods in solving hemivariational inequalities. J. Glob. Optim. 14(2), 117–135 (1999)
Noll, D., Apkarian, P.: Spectral bundle methods for non-convex maximum eigenvalue functions: first-order methods. Math. Program. 104(2), 701–727 (2005)
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017)
Prékopa, A.: Stochastic Programming. Kluwer, Dordrecht (1995)
Rockafellar, R.: Convex Analysis, 1st edn. Princeton University Press, Princeton (1970)
Souza, J.C.O., Oliveira, P.R., Soubeyran, A.: Global convergence of a proximal linearized algorithm for difference of convex functions. Optim. Lett. 10(7), 1529–1539 (2016)
Tao, P.D., Le Thi, H.A.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997)
Tuy, H.: Convex Analysis and Global Optimization. Springer Optimization and Its Applications, 2nd edn. Springer, Berlin (2016)
van Ackooij, W.: Eventual convexity of chance constrained feasible sets. Optimization 64(5), 1263–1284 (2015)
van Ackooij, W., Cruz, J.B., de Oliveira, W.: A strongly convergent proximal bundle method for convex minimization in Hilbert spaces. Optimization 65(1), 145–167 (2016)
van Ackooij, W., Henrion, R.: Gradient formulae for nonlinear probabilistic constraints with Gaussian and Gaussian-like distributions. SIAM J. Optim. 24(4), 1864–1889 (2014)
van Ackooij, W., de Oliveira, W.: Level bundle methods for constrained convex optimization with various oracles. Comput. Optim. Appl. 57(3), 555–597 (2014)
van Ackooij, W., Sagastizábal, C.: Constrained bundle methods for upper inexact oracles with application to joint chance constrained energy problems. SIAM J. Optim. 24(2), 733–765 (2014)
Acknowledgements
The author is grateful to the reviewers for their remarks and constructive suggestions that considerably improved the original version of this article.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix
A Appendix
1.1 A.1 A self-contained analysis of the sequence of infinitely many null steps generated after a last serious step
We assume that after the \({\hat{\ell }}\mathrm{th}\)-stability center \(x^{k({\hat{\ell }})} = {\hat{x}}\) only null steps are performed, i.e.,
where \({\hat{g}}_2 = g_2^{k({\hat{\ell }})}\) is the last subgradient computed for \(f_2\). Notice that in this case the sequence \(\{\mu _k\}_{k\ge k({\hat{\ell }})}\) is nondecreasing. In what follows we present some auxiliary results to prove Lemma 4. We start by defining the following two useful functions:
Notice that \(F^{-k}\) is twice differentiable (because \(\omega \) defining D is so):
where \(\nabla ^2 \omega (x) \in {\mathbb {R}}^{n\times n}\) is the Hessian of the function \(\omega \). Since \(\omega \) is strongly convex then \(\nabla ^2 \omega (x)\) is positive definite for all \(x \in {\mathbb {R}}^n\). It follows from (10) that \(\nabla F^{-k}(x^{k+1})=0\), i.e., the point \(x^{k+1}\) is the unique minimizer of \( F^{-k}(x)\) over \({\mathbb {R}}^n\). The Taylor and mean value Theorems [5, Sect. 13] give, for some \(z = \lambda x^{k+1}+(1-\lambda ){\hat{x}}\) and \(\lambda \in [0,1]\),
where the inequality is due to the assumption that \(\omega \) is strongly convex with parameter \(\rho >0\) and norm \(\Arrowvert \cdot \Arrowvert _p\) in (5) (\(\langle \nabla ^2 \omega (z) (x-x^{k+1}),x-x^{k+1}\rangle \ge \rho \Arrowvert x-x^{k+1} \Arrowvert _p\)), and the last equality follows from (31) and (32). The above development is crucial to show the following lemma, which is essentially a reformulation of [10, Lemma 6.3] to our setting.
Lemma 7
Let \({\hat{x}}= x^{k(\ell )}\) be the last stability center generated by Algorithm 1 during the iteration \(k({\hat{\ell }})\) after which only null steps are performed. Assume also that for \(k\ge k({\hat{\ell }})\) the function \(F^k\) is the model given in (31) and \(x^{k+1}\) is an iterate obtained from a null step. If \(\{\mu _k\}_{k\ge k({\hat{\ell }})}\) is nondecreasing, then
-
(i)
the sequence \(\{ F^{k}(x^{k+1})\}_{{k\ge k({\hat{\ell }})}}\) is nondecreasing and satisfies
$$\begin{aligned} F^{k}(x^{k+1})+\frac{\mu _k \rho }{2}\Arrowvert x^{k+2}-x^{k+1} \Arrowvert ^2_p \le F^{{k+1}}(x^{k+2}) \;{\hbox { for all }\; k\ge k({\hat{\ell }})}; \end{aligned}$$ -
(ii)
the sequence \(\{F^{k}(x^{k+1})\}_{{k\ge k(\hat{\ell })}}\) is bounded from above:
$$\begin{aligned} F^{k}(x^{k+1})+\frac{\mu _k \rho }{2}\Arrowvert {\hat{x}}-x^{k+1} \Arrowvert ^2_p \le f_1({\hat{x}}) \;{\hbox { for all }\; k\ge k({\hat{\ell }})}; \end{aligned}$$ -
(iii)
the following inequality holds true for all \(k\ge k({\hat{\ell }})\)
$$\begin{aligned} {\check{f}}_1^{k}(x^{k+1})-{\check{f}}_1^{k-1}(x^k)\le & {} F^{k}(x^{k+1}) -F^{k-1}(x^k) + \mu _{k-1}[D(x^k,{\hat{x}}) \\&- D(x^{k+1},{\hat{x}})] -\langle {\hat{g}}_2,x^k-x^{k+1}\rangle . \end{aligned}$$
Proof
Algorithm 1 ensures that the aggregate index \(-k\) enters the bundle in every null step. In particular \(-k \in {\mathcal {B}}_1^{{k+1}}\) for all \(k\ge k({\hat{\ell }})\). Then for all \(x \in X\) and all \(k\ge k({\hat{\ell }})\) we have that
where the first inequality is due to \(s^{k+1}\in N_X(x^{k+1})\), the second follows from inequality \(\bar{f}_1^{-k}{(\cdot )}\le {\check{f}}_1^{k+1}{(\cdot )}\) ensured because \(-k \in {\mathcal {B}}_1^{{k+1}}\), and the last inequality follows from the assumption \(\mu _{k+1}\ge \mu _{k}\). Set \(x=x^{k+2}\) in (33) to obtain (i), and \(x={\hat{x}}\) to obtain \(F^{k}(x^{k+1}) + \frac{\mu _k\rho }{2}\Arrowvert {\hat{x}}-x^{k+1} \Arrowvert ^2_p\le F^{-k}({\hat{x}}) \le F^{k+1}({\hat{x}})= {\check{f}}_1^{k+1}({\hat{x}}) \le f_1({\hat{x}})\). To show (iii), note that for all \(k\ge k({\hat{\ell }})\)
where the inequality is due to \(\mu _k\ge \mu _{k-1}\) and the last equality follows from (31) (with k therein replaced with \(k-1\)). The result thus follows. \(\square \)
Given the above properties, the following lemma shows that the cutting-plane model \({\check{f}}_1^k\) asymptotically approximates the DC component \(f_1\) on the sequence of null iterates.
Lemma 8
Under the assumptions of Lemma 7, Algorithm 1 ensures that \(\{x^k\}_{{k}}\) is a bounded sequence and
Proof
Lemma 7(i) ensures that the sequence \(\{F^k(x^{k+1})\}_{{k\ge k({\hat{\ell }})}}\) is nondecreasing. Thus, there exists a constant \(C>0\) such that \(F^k(x^{k+1}) \ge -C\) for all \(k\ge k({\hat{\ell }})\). Using Lemma 7(ii) we conclude that
showing that the sequences \(\{\Arrowvert {\hat{x}}-x^k \Arrowvert _p\}_{{k\ge k({\hat{\ell }})}}\) is bounded because \(\{\mu _k\}_{{k> k({\hat{\ell }})}}\) is nondecreasing. Accordingly, \(\{x^k\}_{{k}}\) is also bounded. It follows from Lemma 7(ii) that the sequence \(\{F^k(x^{k+1})\}_{{k\ge k({\hat{\ell }})}}\) is bounded from above by \(f_1({\hat{x}})\). Lemma 7(i) shows that \(\{F^k(x^{k+1})\}_{{k\ge k({\hat{\ell }})}}\) is nondecreasing and hence
where the second limit follows from Lemma 7(i) and the assumption that \(\{\mu _k\}_{{k\ge k({\hat{\ell }})}}\) is nondecreasing. Note that, by (5),
Since \(\{\mu _k\}_{{k}}\) is a bounded sequence and \(\omega \) is a continuous function, we conclude that
The inclusion \(k \in {\mathcal {B}}_1^k\) implies \( f_1(x^k)+\langle g_1^k,x-x^k\rangle =\bar{f}_1^k(x)\le {\check{f}}_1^k(x) \; \hbox { for all }x\in {\mathbb {R}}^n. \) Setting \(x=x^{k+1}\) in this inequality yields \( f_1(x^k)=\bar{f}_1^k(x^{k+1}) + \langle g_1^k,x^k-x^{k+1}\rangle \). Therefore, for \(k>k({\hat{\ell }})\)
where the last inequality is due to Lemma 7(iii). Applying the limit with \(k\rightarrow \infty \) in the above inequalities and taking into account (35) and (36), remembering that \(\{x^k\}_{{k}}\), and \(\{g_1^k\}_{{k}}\) are bounded sequences, we conclude that \(\limsup _{k\rightarrow \infty } [f_1(x^k)-{\check{f}}_1^{k-1}(x^k)]\le 0\). Since \(f_1\) is convex we have \(f_1(x^k)\ge {\check{f}}_1^{k-1}(x^k)\) and the result follows. \(\square \)
1.2 Proof of Lemma 4
Suppose that \({\hat{x}}\) is not a cluster point of \(\{x^{k+1}\}_{{k\ge k({\hat{\ell }})}}\). Then there would exist \(\epsilon >0\) and an index \({\tilde{k}}\ge k({\hat{k}})\) such that
for all index \(k+1\ge {\tilde{k}}\). It follows from Lemma 8 that there exists an index \({\bar{k}}\ge k({\hat{\ell }})\) such that
Definition of \(x^{k+1}\), feasibility of \({\hat{x}}\) and inequality \({\check{f}}_1^k {(\cdot )}\le f_1{(\cdot )}\) yield the inequality
As by assumption \(\kappa \in (0,1)\), \(\mu _k \ge {\underline{\mu }}\) and assuming that \(k+1 > \max \{{\tilde{k}},\,{\bar{k}}\}\) we get
showing that \(x^{k+1}\) satisfies the descent test (13), i.e. \(x^{k+1}\) becomes the new stability center. But this contradicts the fact that \({\hat{x}}\) is the last stability center. Hence, the sequence \(\{x^{k+1}\}_{{k}}\) has a subsequence that converges to \({\hat{x}}\), i.e., \(\lim _{k\in {\mathcal {K}}} x^{k} ={\hat{x}}\) for some index set \({\mathcal {K}}\subset {\{k({\hat{\ell }})+1,k({\hat{\ell }})+2,\ldots \}}\). We now proceed to show that indeed the whole sequence converges to \({\hat{x}}\): it follows from (31) that
and, therefore, \(\lim _{k\in {\mathcal {K}}} F^{k-1}(x^k) = f_1({\hat{x}})\) from Lemma 8. Lemma 7(i) shows that \(\{F^{k-1}(x^k)\}_{{k\ge k({\hat{\ell }})}}\) is nondecreasing and hence \(\lim _{k\rightarrow \infty } F^{k}(x^{k+1}) =\lim _{k\in {\mathcal {K}}} F^{k-1}(x^k)= f_1({\hat{x}})\). This property combined with (34) shows that the whole sequence \(\{x^k\}_{{k}}\) converges to \({\hat{x}}\). \(\square \)
Rights and permissions
About this article
Cite this article
de Oliveira, W. Proximal bundle methods for nonsmooth DC programming. J Glob Optim 75, 523–563 (2019). https://doi.org/10.1007/s10898-019-00755-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-019-00755-4