Abstract
A method, called an augmented subgradient method, is developed to solve unconstrained nonsmooth difference of convex (DC) optimization problems. At each iteration of this method search directions are found by using several subgradients of the first DC component and one subgradient of the second DC component of the objective function. The developed method applies an Armijo-type line search procedure to find the next iteration point. It is proved that the sequence of points generated by the method converges to a critical point of the unconstrained DC optimization problem. The performance of the method is demonstrated using academic test problems with nonsmooth DC objective functions and its performance is compared with that of two general nonsmooth optimization solvers and five solvers specifically designed for unconstrained DC optimization. Computational results show that the developed method is efficient and robust for solving nonsmooth DC optimization problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider an unconstrained optimization problem
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) is, in general, nonsmooth and is expressed as a difference of two convex functions \(f_1,f_2:\mathbb {R}^n\rightarrow \mathbb {R}\):
Some important practical problems can be formulated as a nonsmooth DC program of the form (1). They include supervised data classification [6], cluster analysis [11, 32], clusterwise linear regression analysis [10, 12] and edge detection [27] problems. Moreover, the DC optimization methods can be applied to solve a broad class of nonsmooth nonconvex optimization problems [38].
Different stationarity conditions, such as inf-stationarity (also called d-stationarity), Clarke stationarity and criticality conditions, have been studied for unconstrained DC minimization problems [25]. In the paper [33], the notion of B-stationary points of nonsmooth DC problems was defined and its relation with the criticality condition was investigated.
The special structure of DC problems stimulates design of efficient methods by utilizing this structure. To date, DC optimization problems have been considered in the context of both local and global optimization. Several methods have been developed to solve these problems globally [23, 37]. To the best of our knowledge the difference of convex algorithm (DCA) is the first local search algorithm for solving DC optimization problems. This algorithm was introduced in [35] and further studied in [4, 5].
Over the last decade the development of local search methods for unconstrained nonsmooth DC optimization has attracted a noticeable attention. A short survey of such methods can be found in [19]. Without loss of generality, methods for local DC optimization can be divided into three groups. The first group consists of methods which are extensions of the bundle methods for convex problems. They include the codifferential method [13], the proximal bundle method with the concave affine model [22], the proximal bundle method for DC optimization [17, 24] and the double bundle method [25]. In these methods piecewise linear underestimates of both DC components or subgradients of these components are used to compute search directions.
The second group consists of the DCA, its modifications and extensions. Although the DCA performs well in practice, its convergence can be fairly slow for some particular problems. Various modifications of the DCA have been developed to improve its convergence. The inertial DCA was developed in [20]. In papers [2, 3], the boosted DC algorithm (BDCA) was proposed to minimize smooth DC functions. The BDCA accelerates the convergence of the DCA using an additional line search step. More precisely, it performs a line search at the point generated by the DCA which leads to a larger decrease in the objective value at each iteration. In [1], the BDCA is combined with a simple derivative free optimization method which allows one to force the d-stationarity (lack of descent direction) at the obtained point. To avoid the difficulty of solving the DCA’s subproblem, in [18] the first DC component is replaced with a convex model and the second DC component is used without any approximation.
The third group consists of algorithms that at each iteration apply the convex piecewise linear model of the first DC component and one subgradient of the second DC component to compute search directions. These algorithms differ from the DCA-type algorithms as at each iteration they update the model of the first DC component and calculate the new subgradient of the second component. They are also distinctive from methods of the first group because they use one subgradient of the second DC component whereas the latter methods may use more than one subgradient of this component to build its piecewise linear model. A representative of this group is the aggregate subgradient method for DC optimization developed in [9]. In this algorithm the aggregate subgradient of the first DC component and one subgradient of the second component are used to compute search directions.
In this paper, we introduce a new method which belongs to the third group. First, we define augmented subgradients of the convex functions. Using them we construct the model of the first DC component. This model and one subgradient of the second DC component are utilized to compute search directions. Then we design a new method which uses these search directions and applies the Armijo-type line search. We prove that the sequence of points generated by this method converges to critical points of the DC function. We utilize nonsmooth DC optimization test problems, including those with large number of variables, to demonstrate the performance of the developed method and to compare it with two general methods of nonsmooth optimization and five methods of nonsmooth DC optimization. Our results show that, in general, the developed method is more robust than other methods used in the numerical experiments.
The rest of the paper is organized as follows. In Sect. 2, we provide necessary notations and some preliminaries on the nonsmooth analysis and DC optimization. In Sect. 3, the new method is introduced and its convergence is studied. Results of numerical experiments are reported in Sect. 4, and concluding remarks are given in Sect. 5.
2 Preliminaries
In this section we provide the notations that will be used throughout the paper and recall some concepts of nonsmooth analysis and DC optimization. For more details on nonsmooth analysis, we refer to [7, 8].
In what follows, \(\mathbb {R}^n\) is the n-dimensional Euclidean space, \(u^Tv =\sum _{i=1}^n u_i v_i\) is the inner product of vectors \(u,v\in \mathbb {R}^n\), and \(\Vert \cdot \Vert \) is the associated Euclidean norm. The origin of \(\mathbb {R}^n\) is denoted by \(0_n\). The unit sphere is denoted by \(S_1=\{d \in \mathbb {R}^n:\Vert d\Vert = 1\}\). For \(x \in \mathbb {R}^n\) and \(\varepsilon >0\), \(B_\varepsilon (x)\) is an open ball of the radius \(\varepsilon >0\) centred at x. The convex hull of a set is denoted by “\(\mathop {\mathrm {conv}}\)” and “cl” stands for the closure of a set.
The function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) is locally Lipschitz continuous on \(\mathbb {R}^n\) if for every \(x \in \mathbb {R}^n\) there exist a Lipschitz constant \(L>0\) and \(\varepsilon >0\) such that \(|f(y)-f(z)| \le L\Vert y-z\Vert \) for all \(y, z \in B_\varepsilon (x).\)
The subdifferential of a convex function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) at a point \(x \in \mathbb {R}^n\) is the set
Each vector \(\xi \in \partial f(x)\) is called a subgradient of f at x. The subdifferential \(\partial f(x)\) is a nonempty, convex and compact set such that \(\partial f(x) \subseteq B_L(0)\), where \(L>0\) is the Lipschitz constant of f. In addition, the subdifferential mapping \(\partial f(x)\) is upper semicontinuous.
For \(\varepsilon \ge 0\), the \(\varepsilon \)-subdifferential of a convex function f at a point \(x\in \mathbb {R}^n\) is defined by
Each vector \(\xi _\varepsilon \in \partial _\varepsilon f(x)\) is called an \(\varepsilon \)-subgradient of f at x. The set \(\partial _\varepsilon f(x)\) contains the subgradient information from some neighborhood of x as the following theorem shows.
Theorem 1
[8] Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex function with the Lipschitz constant \(L>0\) at a point \(x\in \mathbb {R}^n\). Then for \(\varepsilon \ge 0\)
The following theorem presents a necessary condition for Problem (1).
Theorem 2
[5, 36] If \(x^*\in \mathbb {R}^n\) is a local minimizer of Problem (1), then
A point \(x^*\), satisfying the condition (2), is called inf-stationary for Problem (1). The optimality condition (2) is, in general, hard to verify and is usually relaxed to the following condition [5, 36]:
A point \(x^*\in \mathbb {R}^n\), satisfying (3), is called a critical point of the function f. In a similar way, we can define an \(\varepsilon \)-critical point. Let \(\varepsilon \ge 0\), a point \(x^*\in \mathbb {R}^n\) is said to be an \(\varepsilon \)-critical point of f if [34]
3 Augmented subgradient method
In this section we describe the new method and study its convergence. We start with the definition of augmented subgradients of convex functions.
3.1 Augmented subgradients
Let \(\varphi : \mathbb {R}^n \rightarrow \mathbb {R}\) be a convex function and \(U \subset \mathbb {R}^n\) be a compact set. Take any \(x, y \in U\). In particular, \(y = x\). Let \(\xi \in \partial \varphi (y)\) be any subgradient computed at the point y. The linearization error of this subgradient at the point x is defined as:
It is clear that
Definition 1
A vector \(u \in \mathbb {R}^{n+1}\) is called an augmented subgradient of the convex function \(\varphi \) at the point x with respect to the compact set \(U \subset \mathbb {R}^n\) if there exists \(y \in U\) such that \(u = \big (\alpha (x,y,\xi ), \xi \big ),\) where \(\xi \in \partial \varphi (y)\) and \(\alpha (x,y,\xi )\) is defined by (5).
Note that there is some similarity between augmented subgradients and elements of codifferentials considered in [15, 40].
The set \(D_U\varphi (x)\) of all augmented subgradients at the point x with respect to the set U is defined as:
For \(u \in D_U\varphi (x)\) we will use the notation \(u = (\alpha , \xi )\) where \(\alpha \ge 0\) and \(\xi \in \mathbb {R}^n\). If we take \(y=x,\) then in this case \(\alpha =0\). Therefore, the set \(D_U\varphi (x)\) contains also elements of the form \((0,\xi )\) where \(\xi \in \partial \varphi (x)\).
Proposition 1
The set \(D_U\varphi (x)\) is compact and convex for any \(x \in U\).
Proof
The proof follows from the definition of the set \(D_U\varphi (x)\) and from the fact that the subdifferential of the function \(\varphi \) is bounded on bounded sets. \(\square \)
It follows from (6) that for any \(x,y \in U\) we have
Consider the following two sets at the point \(x \in U\):
Take any \(x \in \mathbb {R}^n\), \(\tau > 0\) and consider the closed ball \(\bar{B}_\tau (x).\) Let \(U=\bar{B}_\tau (x)\) and \(D_U\varphi (x)\) be the set of augmented subgradients at the point x with respect to U.
Proposition 2
Let \(x \in \mathbb {R}^n, ~\tau > 0, ~U = \bar{B}_\tau (x)\) and \(L >0\) be a Lipschitz constant of \(\varphi \) on \(\bar{B}_\tau (x)\). Then for \(\varepsilon > 2L\tau \) we have
Proof
The result for \(\alpha \) follows from its definition. Furthermore, Theorem 1 implies that if \(\varepsilon > 2L\tau ,\) then \(\partial \varphi (y) \subseteq \partial _\varepsilon \varphi (x)\) for all \(y \in \bar{B}_\tau (x).\) Then the proof is obtained from the definition of the set C(x). \(\square \)
One interesting case is when \(U = \{x\}\). In this case \(\alpha = 0\), and therefore, we have
Next we consider the DC function \(f(x) = f_1(x) - f_2(x)\). Again assume that \(U \subset \mathbb {R}^n\) is a compact set and \(x \in U\). Define the set of augmented subgradients \(D_U f_1(x)\) for the function \(f_1\) and the set \(D_x f_2(x)\) for the function \(f_2\). Construct the set
Proposition 3
Let \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) be a DC function and \(U \subset \mathbb {R}^n\) be a compact set. Then at a point \(x \in U\) we have
Proof
Let \(y \in U\) be any point. Taking into account the convexity of the function \(f_2\) and applying (6) to the function \(f_1\) at y we get that for \(\xi _1 \in \partial f_1(y), ~\xi _2 \in \partial f_2(x)\) the following holds:
It is clear that \(\big (\alpha (x,y,\xi _1), \xi _1 -\xi _2\big ) \in D_U f(x)\), and thus the proof is complete. \(\square \)
The following corollary can be easily obtained from Proposition 3.
Corollary 1
Let \(x \in \mathbb {R}^n\) and \(U = \bar{B}_\tau (x), ~\tau > 0\). Then for any \(d \in S_1\) we have
3.2 The model
In order to find search direction in Problem (1) we use the following model of the objective function f. Take any point \(x \in \mathbb {R}^n\) and the number \(\tau \in (0,1]\). Consider the closed ball \(\bar{B}_\tau (x)\). Compute the set \(D_U f(x)\) at x for \(U=\bar{B}_\tau (x)\). Then we replace the function f on U by the following function:
Proposition 4
The function \(\hat{f}\) is convex, \(f(y) \le \hat{f}(y)\) for all \(y \in U\) and \(\hat{f}(x) = f(x)\).
Proof
Convexity of the function \(\hat{f}\) follows from its definition. For any \(y \in \bar{B}_\tau (x)\) there exist \(d \in S_1\) and \(\lambda \in [0,\tau ]\) such that \(y = x+\lambda d\). Then by applying Corollary 1 we get that \(f(y) \le \hat{f}(y)\) for all \(y \in U\). Finally, for the point x we have
Since \((0,\xi ) \in D_U f(x)\) and \(\alpha \ge 0\) for all \((\alpha ,\xi ) \in D_U f(x)\) we have
Therefore, \(\hat{f}(x) = f(x)\). \(\square \)
Proposition 5
For any \(y \in U\) we have
where \(\hat{R}(y) = \big \{(\alpha , \xi ) \in D_U f(x): ~\hat{f}(y) = f(x) - \alpha + \xi ^T (y-x)\big \}\). In particular,
Proof
The function \(\hat{f}\) is convex and all functions under maximum in its expression are linear with respect to y. Then the expression for the subdifferential \(\partial \hat{f}(y)\) follows from Theorem 3.23 [8]. Since \(\hat{f}(x) = f(x)\) at the point x we get that \(\hat{R}(x) = \big \{(\alpha , \xi ) \in D_U f(x): ~\alpha =0\big \}\). Then we obtain the expression (11) for the subdifferential \(\partial \hat{f}(x)\). \(\square \)
Results from Propositions 4 and 5 show that the set \(D_U f(x)\) can be used to find search directions of the DC function f which is demonstrated in the next proposition.
Proposition 6
Assume that the set \(D_U f(x)\) is constructed using \(U=\bar{B}_\tau (x),\) \(~\tau \in (0,1]\). In addition assume \(0_{n+1} \notin D_U f(x)\) and
Then
where \(\overline{d} = -\Vert \overline{w}\Vert ^{-1}\overline{\xi }\).
Proof
The set \(D_U f(x)\) is compact and convex. Then the necessary condition for a minimum implies that
Taking into account that \(\Vert \overline{w}\Vert ^2 = \overline{\alpha }^2+\Vert \overline{\xi }\Vert ^2\) we get
First we show that \(\overline{\xi } \ne 0_n\). Assume the contrary, that is \(\overline{\xi }=0_n\). Since \(\overline{w}\ne 0_{n+1}\) it follows that \(\overline{\alpha }\ne 0\) and therefore, \(\overline{\alpha } > 0\). It follows from (13) that \(0< \overline{\alpha } \le \alpha \) for all \((\alpha ,\xi )\in D_U f(x)\) which is a contradiction since the element \((0, \xi )\) is included in \(D_U f(x)\). Therefore \(\overline{\xi }\ne 0_n\) which implies that \(\overline{d} \ne 0_n\).
Dividing both sides of (13) by \(-\Vert \overline{w}\Vert \), we obtain
Note that \(\Vert \overline{w}\Vert ^{-1} \overline{\alpha } \in (0,1)\). Since \(\tau \in (0,1]\) it is obvious that \(\Vert \overline{w}\Vert ^{-1} \tau \overline{\alpha } \in (0,1)\). Furthermore, since \(\alpha \ge 0\) for all \((\alpha ,\xi )\in D_U f(x)\) we get
where the last inequality comes from (14). From here and by applying (9), we obtain that \(f(x+\tau \overline{d})-f(x)\le -\tau \Vert \overline{w}\Vert \). This means that \(\overline{d}\) is a descent direction of the function f at the point \(x \in \mathbb {R}^n\) . \(\square \)
3.3 The proposed algorithm
Now we introduce the augmented subgradient method (asm-dc) for solving DC optimization problem (1). Let the search control parameter \( c_1 \in (0,1)\), the line search parameter \(c_2 \in (0, c_1]\), the decrease parameter \(c_3 \in (0,1)\) and the tolerance \(\theta > 0\) for criticality condition be given.
Some explanations to Algorithm 1 are essential. This algorithm consists of outer and inner loops. In the inner loop we compute augmented subgradients of the DC components and the subset \(\hat{D}_l^k\) of the set of augmented subgradients (Steps 1 and 4). We use this subset to formulate a quadratic programming problem to find search directions (Step 2). This problem can be rewritten as:
Here the set \(\hat{D}_l^k\) is a polytope. There are several algorithms for solving such problems [21, 29, 39]. The algorithm exits the inner loop when either the distance between the set of augmented subgradients and the origin is sufficiently small (that is the condition (15) is satisfied), or we find the direction of sufficient decrease (Step 4). In the first case the current value of the parameter \(\tau \) does not allow to improve the solution and should be updated. In the second case the line search procedure is applied.
In the outer loop we update parameters of the method (Step 5) and apply the line search procedure to find a new solution (Step 6). Since it is not easy to find the exact value of the step-length \(\sigma _l\) in Step 6 we propose the following scheme to estimate it in the implementation of the algorithm. It is clear that \(\sigma _l \ge \tau _l\). Consider the sequence \(\{\bar{\sigma }_i\},~i=1,2,\ldots \) where \(\bar{\sigma }_1 = \tau _l\) and \(\bar{\sigma }_i = 2^{i-1} \bar{\sigma }_1, ~i=1,2,\ldots \). We compute the largest \(i \ge 1\) satisfying the condition in Step 6. Then \(\bar{\sigma }_l = 2^{i-1} \tau _l\) is the estimate of \(\sigma _l\). The algorithm stops when values of its parameters become sufficiently small (Step 5). This means that an approximate criticality condition is achieved.
3.4 Global convergence
In this section we study the global convergence of Algorithm 1. First, we prove that for each \(l \ge 0\) the number of inner iterations of this algorithm is finite. At the point \(x_l \in \mathbb {R}^n\) for a given \(\tau _l>0,~l > 0\) define the following numbers:
It follows from Proposition 2 that \(K_l < \infty \) for all \(l>0\).
Proposition 7
Let \(\delta _l \in (0,K_l), l>0\). Then the inner loop at the l-th iteration of the asm-dc stops after \(m_l\) iterations, where
Proof
We proof the proposition by contradiction. This means that at the l-th iteration for any \(k \ge 1\) the condition (15) is not satisfied and the condition (16) is satisfied, that is \(\Vert \overline{w}^k_l\Vert \ge \delta _l\) and
First, we show that the new augmented subgradient
computed at Step 4 does not belong to the set \(\hat{D}^k_l\). Assume the contrary, that is \(w^{k+1}_l \in \hat{D}^k_l.\) It follows from the definition (5) of the linearization error that
Since \(\xi _{2l} \in \partial f_2(x_l)\) the subgradient inequality implies that
Subtracting this from (19) we get
Applying (18) and taking into account that \(\xi ^{k+1}_l = \xi _{1l}^{k+1} - \xi _{2l}\), we have
or
which, by taking into account that \(c_1\in (0,1)\), can be rewritten as
On the other side, since \(\Vert \overline{w}_l^k\Vert ^2 = \min \big \{\Vert w\Vert ^2:~w \in \hat{D}_l^k \big \}\) it follows from the necessary condition for a minimum that \(\Vert \overline{w}^k_l\Vert ^2 \le w^T\overline{w}^k_l\) for all \(w \in \hat{D}^k_l\), where \(w=(\alpha ,\xi )\) and \(\overline{w}^k_l=(\overline{\alpha }^k_l,\overline{\xi }^k_l)\). Hence we obtain
In particular, for \(w^{k+1}_l = (\alpha ^{k+1}_l,\xi ^{k+1}_l) \in \hat{D}^k_l\) we have \(\Vert \overline{w}^k_l\Vert ^2 \le \overline{\alpha }^k_l\alpha ^{k+1}_l + \Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l.\) Since \(\overline{\alpha }^k_l\le \Vert \overline{w}^k_l\Vert \) and \(\alpha ^{k+1}_l\ge 0\) we get \(\overline{\alpha }^k_l\alpha ^{k+1}_l\le \alpha ^{k+1}_l\Vert \overline{w}^k_l\Vert \). Next taking into account that \(\tau _l \in (0,1)\), we have
which contradicts (20). Thus, \(w^{k+1}_l \notin \hat{D}^k_l\).
Note that for all \(t\in [0,1]\) we have
It follows from (20) that
In addition, the definition of \(K_l\) implies that \(\Vert w^{k+1}_l-\overline{w}^k_l\Vert \le 2K_l\). Then we get
Let \(t_0=(1-c_1)(2K_l)^{-2}\Vert \overline{w}^k_l\Vert ^2\). Note that \(t_0\in (0,1),\) and for \(t = t_0\) we have
Since \(\Vert \overline{w}^k_l\Vert \ge \delta _l\) we get
Furthermore, since \(\delta _l \in (0, K_l)\) it follows from (17) that \(M_l \in (0,1)\) and \(\Vert \overline{w}^{k+1}_l\Vert ^2 < M_l \Vert \overline{w}^k_l\Vert ^2\). Then we have
Hence \(\Vert \overline{w}^{m_l}_l\Vert < \delta _l\) is satisfied if \(M_l^{m_l-1} K_l^2 < \delta _l^2\). This inequality must happen after at most \(m_l\) iterations, where
This completes the proof.\(\square \)
Definition 2
A point \(x \in \mathbb {R}^n\) is called a \((\tau , \delta )\)-critical point of Problem (1) if
Proposition 8
Let \(\tau , \delta > 0\) be given numbers, \(x \in \mathbb {R}^n\) be a \((\tau ,\delta )\)-critical point of Problem (1) and \(L_1 > 0\) be a Lipschitz constant of the function \(f_1\) at the point x. Then there exists \(\xi _2 \in \partial f_2(x)\) such that for any \(\varepsilon > 2L_1\tau \)
Proof
Let \(U = \bar{B}_\tau (x)\). If x is the \((\tau ,\delta )\)-critical point, then there exists \(\overline{w} \in D_U f(x)\) such that \( \Vert \overline{w}\Vert < \delta \). The element \(\overline{w}\) can be represented as
for some \((\overline{\alpha }, \overline{\xi }_1) \in D_U f_1(x)\) and \((0,\overline{\xi }_2) \in D_x f_2(x)\). Hence, \(\Vert \overline{\xi }_1 - \overline{\xi }_2\Vert <\delta \). It follows from Proposition 2 that \(\overline{\xi }_1 \in \partial _\varepsilon f_1(x)\) for all \(\varepsilon > 2L_1\tau \). In addition \(\overline{\xi }_2\in \partial f_2(x)\). Then we get the proof. \(\square \)
It follows from Proposition 8 that for sufficiently small \(\tau \) and \(\delta \) the \((\tau , \delta )\)-critical point can be interpreted as an approximate critical point of Problem (1). Furthermore, note that if at the l-th iteration (\(l > 0\)) of Algorithm 1 the inner loop terminates with the condition (15), then \(x_l\) is an \((\tau _l,\delta _l)\)-critical point of Problem (1).
Next, we prove the convergence of Algorithm 1. For a starting point \(x_0 \in \mathbb {R}^n\) consider the level set \(\mathcal{L}_f(x_0)\) of the function f:
Theorem 3
Suppose that the level set \(\mathcal{L}_f(x_0)\) is bounded and \(\theta =0\) in Algorithm 1. Then Algorithm 1 generates the sequence of approximate critical points whose accumulation points are critical points of Problem (1).
Proof
The asm-dc consists of inner and outer loops. The values of parameters \(\tau _l\) and \(\delta _l\) are constants in the inner loop. In addition, these parameters are also constants in Steps 6 and 7. It follows from Proposition 7 that for given \(\tau _l\) and \(\delta _l\), the inner loop stops after a finite number of iterations.
Since the set \(\mathcal{L}_f(x_0)\) is bounded (and compact) the function f is bounded from below. This means that \(f_* = \min \big \{f(x):~x \in \mathbb {R}^n\big \} > -\infty \). For given \(\tau _l\) and \(\delta _l\) the number of executions of Steps 6 and 7 is finite. To prove this we assume the contrary, that is the number of executions of these steps is infinite. Then for some \(k \ge 1\) the direction \(\overline{d}^k_l\), computed in Step 3, is the direction of sufficient decrease at \(x_l\) satisfying the following condition:
Since \(\Vert \overline{w}^k_l\Vert \ge \delta _l\), \(\sigma _l \ge \tau _l\) and \(c_2 \in (0, c_1]\) we have
From here we get
Then we have \(f(x_l) \rightarrow -\infty \) as \(l \rightarrow \infty \) which contradicts to the condition \(f_* > -\infty \). This means that for fixed values of \(\tau _l\) and \(\delta _l\), after finite number of iterations the inner loop will terminate with the condition (15) and \(x_l\) is the \((\tau _l,\delta _l)\)-critical point.
Thus parameters \(\tau _l\) and \(\delta _l\) are updated in Step 5 after a finite number of outer iterations. Denote these iterations by \(l_j, j=1,2,\ldots \) and \(x_{l_j}\) are \((\tau _{l_j},\delta _{l_j})\)-critical point. It follows from Proposition 8 that for some \(\xi _{2l_j} \in \partial f_2(x_{l_j})\)
It is obvious that \(x_{l_j} \in \mathcal{L}_f(x_0)\) for all \( j=1,2,\ldots \). The compactness of the set \(\mathcal{L}_f(x_0)\) implies that the sequence \(\{x_{l_j}\}\) has at least one accumulation point. Let \(x^*\) be an accumulation point and for the sake of simplicity assume that \(x_{l_j} \rightarrow x^*\) as \(j \rightarrow \infty \).
Upper semicontinuity of the subdifferential and \(\varepsilon \)-subdifferential mappings implies that for any \(\eta > 0\) there exist \(\gamma _1 > 0\) and \(\gamma _2 > 0\) such that
for all \(y \in B_{\gamma _1}(x^*)\) and \(\varepsilon \in [0, \gamma _2)\). Since \(x_{l_j} \rightarrow x^*\) and \(\tau _{l_j} \rightarrow 0\) as \(j \rightarrow \infty \) there exists \(j_\eta > 0\) such that \(x_{l_j} \in B_{\gamma _1}(x^*)\) and \(\tau _{l_j} \in [0, \gamma _2)\) for all \(j > j_\eta \). Moreover, we can choose the sequence \(\{\varepsilon _j\}\) so that \(\varepsilon _j \in [0, \gamma _2)\) for all \(j > j_\eta \) and \(\varepsilon _j \rightarrow 0\) as \(j \rightarrow \infty \). Then, it follows from (21) and the first part of (22) that for any \(j > j_\eta \) there exists \(\xi _{2l_j} \in \partial _\varepsilon f_1(x_{l_j})\) such that
On the other hand, from the second part of (22) we get that for any \(\xi _{2l_j} \in \partial f_2(x^*)\) there exists \(\bar{\xi }_2 \in \partial f_2(x^*)\) such that \(\Vert \xi _{2l_j} - \bar{\xi }_2\Vert < \eta \). Without loss of generality assume that \(\xi _{2l_j} \rightarrow \bar{\xi }_2 \in \partial f_2(x^*)\) as \(j \rightarrow \infty \). Since \(\delta _{l_j} \rightarrow 0\) as \(j \rightarrow \infty \) we get that
Taking into account that \(\eta > 0\) is arbitrary we have \(\bar{\xi _2} \in \partial f_1(x^*)\), that is \(\partial f_1(x^*) \cap \partial f_2(x^*) \ne \emptyset \). This completes the proof. \(\square \)
Remark 1
The “escape procedure”, introduced in [25], is able to escape from critical points which are not Clarke stationary. The developed algorithm can be combined with this procedure to design an algorithm for finding Clarke stationary points of Problem (1).
4 Numerical results
Using some academic test problems we evaluate the performance of the developed method and compare it with other similar nonsmooth optimization methods. We utilize 30 academic test problems which belong to 14 instance problems with various dimensions (ranging from 2 to 100): Problems 1-10 are from [24] and Problems 11–14 are described in [26]. In addition, we use three large-scale nonsmooth DC optimization problems in our numerical experiments. We design these problems such that different types of DC components are considered:
-
\(P_{L1}\): the function \(f_1\) is smooth and the function \(f_2\) is nonsmooth;
-
\(P_{L2}\): the function \(f_1\) is nonsmooth and the function \(f_2\) is smooth;
-
\(P_{L3}\): both functions \(f_1\) and \(f_2\) are nonsmooth.
The DC components of these test problems, \(f(x)= f_1(x) - f_2(x)\), are given below:
-
\(P_{L1}.\)
$$\begin{aligned} f_1(x)&= \sum _{i=1}^{n-1} \Big ( (x_i - 2)^2 + (x_{i+1} - 1)^2 + \exp (2x_i - x_{i+1}) \Big ),\\ f_2(x)&= \sum _{i=1}^{n-1} \max \Big ( (x_i - 2)^2 + (x_{i+1} - 1)^2, \exp (2x_i - x_{i+1}) \Big ). \end{aligned}$$Starting point: \(x = (x_1,\ldots ,x_n) = (3,\ldots ,3)\).
-
\(P_{L2}.\)
$$\begin{aligned} f_1(x)&= n \max _{i=1,\ldots ,n} \Big (\sum _{j=1}^n \frac{x_j^2}{i+j-1} \Big ),\\ f_2(x)&= \sum _{i=1}^n \sum _{j=1}^n \frac{x_j^2}{i+j-1}. \end{aligned}$$Starting point: \(x = (x_1,\ldots ,x_n) = (2,\ldots ,2)\).
-
\(P_{L3}.\)
$$\begin{aligned} f_1(x)&= n \max _{i=1,\ldots ,n} \Big |\sum _{j=1}^n \frac{x_j}{i+j-1} \Big |,\\ f_2(x)&= \sum _{i=1}^n \Big |\sum _{j=1}^n \frac{x_j}{i+j-1} \Big |. \end{aligned}$$Starting point: \(x = (x_1,\ldots ,x_n) = (2,\ldots ,2)\).
We set the number of variables to \(n=200,500,1000,1500,2000,3000,5000\) for these three test problems.
4.1 Solvers and parameters
We apply seven nonsmooth optimization solvers for comparison. Among these solvers five are designed for solving DC optimization problems and two of them are general solvers for nonsmooth nonconvex optimization problems. These solvers are:
-
Aggregate subgradient method for DC optimization (AggSub) [9];
-
Proximal bundle method for DC optimization (PBDC) [24];
-
Proximal bundle method for nonsmooth DC programming (PBMDC) [17];
-
Sequential DC programming with cutting-plane models (SDCA-LP) [18];
-
Hybrid Algorithm for Nonsmooth Optimization (HANSO 2.2) [14, 28].
The quadratic programming problem in Step 2 of the asm-dc is solved using the algorithm from [39]. Parameters in the asm-dc are chosen as follows: \(c_1=0.2\), \(c_2=0.05\), \(c_3=0.1\), \(\theta=5 \cdot 10^{-7}\), \(\tau _{l+1}=0.1\tau _l\) with \(\tau _1=1,\) \(\delta _l=10^{-7}\) for all l. The algorithms asm-dc, AggSub, PBDC, DCA, PBM are implemented in Fortran 95 and compiled using the gfortran compiler. For HANSO, we use the MATLAB implementation of HANSO version 2.2 which is available in
“https://cs.nyu.edu/overton/software/hanso/”. For PBMDC and SDCA-LP, we utilize the MATLAB implementations which are available in
“http://www.oliveira.mat.br/solvers”. Note that the SDCA-LP requires the feasible set to be compact. However, all test problems considered in this paper are unconstrained. In order to apply the SDCA-LP to these problems we define a large n-dimensional box around the starting point and consider only those points, generated by the SDCA-LP, that belong to this box. The parameters of all these algorithms are chosen according to the values recommended in their references. In addition, we consider a time limit for all algorithms, that is three hours for solvers in MATLAB and half an hour for implemented in Fortran.
4.2 Evaluation measures and notations
We use the number of function evaluations, the number of subgradient evaluations and the final value of the objective function obtained by algorithms to report the results of our experiments. The following notations are used:
-
P is the label of a test problem;
-
n is the number of variables in the problem;
-
\(f_{best}^s\) is the best value of the objective function obtained by a solver “s".
We say that a solver “s” finds a solution with respect to a tolerance \(\beta _1\) if
where \(f_{opt}\) is the value of the objective function at one of the known local minimizers. Otherwise, we say that the solver fails. We set \(\beta _1 = 10^{-4}.\)
Performance profiles, introduced in [16], are also applied to analyze the results. Recall that in the performance profiles, the value of \(\rho _s(\tau )\) at \(\tau =0\) shows the ratio of the number of test problems, for which the solver s uses the least computational time, function or subgradient evaluations, to the total number of test problems. The value of \(\rho _s(\tau )\) at the rightmost abscissa gives the ratio of the number of test problems, that the solver s can solve, to the total number of test problems and this value indicates the robustness of the solver s. Furthermore, the higher is a particular curve, the better is the corresponding solver. Note that the values of \(\tau \) are scaled using the natural logarithm.
In addition, we report the CPU time — the computational time (in seconds) — required by the developed algorithm to find a solution, and also results on the quality of solutions obtained by different algorithms. The quality of solutions is determined by the values of the objective function at these solutions. More precisely, consider two different local minimizers \(x_1\) and \(x_2\) of Problem (1). The local minimizer \(x_1\) is a better quality solution than the local minimizer \(x_2\) if \(f(x_1) < f(x_2)\). The notation (S, B, W) is used to report the total number of test problems that the asm-dc finds the similar (S), the better (B) and the worse (W) quality solutions than other algorithms. The accuracy of solutions are defined using (23), where \(\beta _1 = 10^{-4}\).
4.3 Results
We present the results of our numerical experiments as follows. First, we give the results obtained by algorithms for the test problems 1–14 with one starting point that are given in [24, 26]. We report the values of objective functions, the number of function evaluations, and the number of subgradient evaluations. Second, we present the results for the values of objective functions, the number of function evaluations, and the number of subgradient evaluations obtained for these test problems with 20 random starting points. Next, we give the average CPU time required by the asm-dc on the test problems 1–14 with both one and 20 starting points, and also discuss the quality of solutions obtained by the algorithms. Finally, we present the results for three large-scale test problems introduced above and report the values of objective functions, the number of function evaluations, and the number of subgradient evaluations required by the algorithms.
Numerical results for Problems 1–14 with one starting point. We summarize these results in Tables 1–3 and Fig. 1. Table 1 presents the best values of the objective function obtained by solvers. We use “\(^*\)” for those values that a solver fails to obtain the local or global minimizer. Results from this table show that the asm-dc is more reliable to find the global minimizer than all other solvers. It is able to find the optimal solutions for all test problems considered in our experiment, whereas other solvers fail for some test problems with different number of variables. More specifically, the AggSub fails to find the optimal solution in Problem 5, and also Problem 14 with \(n=5,10,50,100.\) The PBDC fails in Problems 5 and 13, and also Problem 12 with \(n=50,100\). The DCA fails in Problem 10 with \(n=5,10,50,100.\) The PBMDC fails in Problems 7 and 13, Problem 12 with \(n=5,10,50,100\), and Problem 14 with \(n=100.\) The SDCA-LP fails in Problems 1, 8, 9, 13, Problem 10 with \(n=50,100\), and Problem 12 with \(n=10,50,100.\) The PBM fails in Problems 10 and 12 for \(n=100\), and also Problem 14 with \(n=10,50.\) The HANSO fails in Problems 2 and 3, and also Problem 14 with \(n=2,10,50,100.\)
Tables 2 and 3 present the number of function and subgradient evaluations for Problems 1–14, respectively. We denote by “−” the cases that a solver fails to find a solution. It can be observed from these tables that, these numbers required by the asm-dc are reasonable and they are comparable with those required by the AggSub and HANSO algorithms. In most cases the asm-dc requires more number of function and subgradient evaluations than PBDC, DCA, PBMDC, SDCA-LP and PBM algorithms. This is due to the fact that the ASM-DC requires the large number of inner iterations since it does not use augmented subgradients from previous iterations.
Next, we demonstrate efficiency and robustness of the algorithms using the performance profiles. The results with one starting point are depicted in Fig. 1. We utilize the number of function evaluations required by algorithms to draw the performance profiles. The number of subgradient evaluations follow the similar trends with that of the number of function evaluations, and thus, we omit these results. For the algorithms based on the DC structure of the problem, we use the average value of the number of the function evaluations of the DC components. For the other algorithms, we use the number of function evaluations of the objective function. From Fig. 1, it can be observed that the asm-dc is the most robust for the collection of test problems used in the numerical experiments. However, it requires more function evaluations than others with the exception of the AggSub algorithm. This is due to the fact that the asm-dc requires the large number of inner iterations because it does not use augmented subgradients from previous iterations.
Numerical results for Problems 1–14 with 20 random starting points. Performance profiles using Problems 1–14 with 20 random starting points and the number of function evaluations are presented in Fig. 2. The results from this figure indicate a similar trend to those of Fig. 1 which confirm the results obtained by algorithms with only one starting point. Performance profiles using the number of subgradient evaluations follow the similar trends with the number of function evaluations for all solvers, and therefore, we omit these results.
Computational time and results on the quality of solutions. In Table 4 for a given number n of variables we report the average CPU time in seconds required by the asm-dc for Problems 1–14 (30 problems considering different variables) with one starting point and also 20 random starting points. We can see that the CPU time required by the asm-dc for test problems with the number of variables \(n=2,3,4\) is zero. We do not include the CPU time required by other algorithms since they are implemented on different platforms.
Table 5 presents the comparison of the quality of solutions obtained by the asm-dc with those obtained by other algorithms. Problems 1–14 (30 problems considering different number of variables) with 20 random starting points are used (note that the total number of cases is 600). We can see that the asm-dc outperforms all other solvers, except the PBDC, in terms of its ability to find high quality solutions. For instance, the asm-dc obtains the better quality solution than the AggSub in 131 cases whereas the latter one finds the better quality solution than the former algorithm in 23 cases. They find the similar solutions in 446 cases with respect to the tolerance given above. These results demonstrate that the asm-dc has better global search properties than the other solvers, except the PBDC.
Numerical results for large-scale problems. We report the results obtained by different algorithms for three large-scale test problems described above. Note that since the conventional code of HANSO is out of memory for problems \(P_{L2}\) and \(P_{L3}\) with \(n \ge 2000\) and \(n \ge 1500\), respectively, we use its limited memory version. The results are given in Tables 6 – 8, where the objective function values, the number of function evaluations and the number of subgradient evaluations are presented. For the test problem \(P_{L1}\), considering different number of variables, the best values of the objective function obtained by the asm-dc and the AggSub are much smaller than those obtained by other solvers. The solvers PBMDC and SDCA-LP fail to find any solution in a given time frame. For the test problem \(P_{L2}\), all solvers, except SDCA-LP, are able to find the optimal solution with a given tolerance. The SDCA-LP fails to produce accurate solutions within the given time limit for all number of variables. In the test problem \(P_{L3}\) with \(n=200,~500\), the solvers asm-dc, PBDC, PBMDC, SDCA-LP and PBM are able to find accurate solutions. For this test problem with \(n \ge 1000\), the PBDC finds the most accurate solutions and the asm-dc is less accurate than the PBDC, PBMDC, PBM and SDCA-LP. Other algorithms fail to find accurate solutions. Overall, the asm-dc is the most robust method for solving large-scale problems used in the numerical experiments.
From Tables 7 and 8 we can see that the PBMDC requires the least number of both function and subgradient evaluations than other solvers when it is able to find the solution. In average, the asm-dc requires less function and subgradient evaluations than the AggSub and slightly less than the DCA. The asm-dc requires more function and subgradient evaluations than PBDC, PBMDC, SDCA-LP, PBM and HANSO. However, in some cases the difference between the asm-dc and these five methods is not significant. Summarizing results from Tables 7, 8 and taking into account the number of variables we can say that the asm-dc requires reasonable number of function and subgradient evaluations in large-scale problems.
5 Conclusions
A new method, named the augmented subgradient method (asm-dc), is introduced for solving nonsmooth unconstrained difference of convex (DC) optimization problems. First, we define the set of augmented subgradients using subgradients of DC components and their linearization errors. Then a finite number of elements from this set are used to compute search directions, and the new method is designed using these search directions. It is proved that the sequence of points generated by the asm-dc converges to critical points of the DC problem.
We evaluate the performance of the asm-dc using 14 small and medium-sized and three large-scale nonsmooth DC optimization test problems. These test problems are also used to compare the performance of the asm-dc with that of five nonsmooth DC optimization and two nonsmooth optimization methods. We used different evaluation measures including performance profiles to compare solvers. Computational results clearly demonstrate that the asm-dc is the most robust method for the test problems used in the numerical experiments. In comparison with most other methods the asm-dc requires more function and subgradients evaluations. The developed method uses modest CPU time even for large-scale problems. Results demonstrate that the asm-dc in comparison with other methods is able, as a rule, to find solutions of either similar or better quality.
References
Artacho, F.J.A., Campoy, R., Vuong, P.T.: Using positive spanning sets to achieve d-stationarity with the boosted DC algorithm. Vietnam J. Math. 48(2), 363–376 (2020)
Artacho, F.J.A., Fleming, R.M.T., Vuong, P.T.: Accelerating the DC algorithm for smooth functions. Math. Program. 169, 95–118 (2018)
Artacho, F.J.A., Vuong, P.T.: The boosted difference of convex functions algorithm for nonsmooth functions. SIAM J. Optim. 30(1), 980–1006 (2020)
An, L.T.H., Tao, P.D., Ngai, H.V.: Exact penalty and error bounds in DC programming. J. Glob. Optim. 52(3), 509–535 (2012)
An, L.T.H., Tao, P.D.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133, 23–46 (2005)
Astorino, A., Fuduli, A., Gaudioso, M.: DC models for spherical separation. J. Glob. Optim. 48(4), 657–669 (2010)
Bagirov, A.M., Gaudioso, M., Karmitsa, N., Mäkelä, M.M., Taheri, S. (eds.): Numerical Nonsmooth Optimization: State of the Art Algorithms. Springer, Berlin (2020)
Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer, New York (2014)
Bagirov, A.M., Taheri, S., Joki, K., Karmitsa, N., Mäkelä, M.M.: Aggregate subgradient method for nonsmooth DC optimization. Optim. Lett. 15(1), 83–96 (2020)
Bagirov, A.M., Taheri, S., Cimen, E.: Incremental DC optimization algorithm for large-scale clusterwise linear regression. J. Comput. Appl. Math. 389, 113323 (2021)
Bagirov, A.M., Taheri, S., Ugon, J.: Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems. Pattern Recognit. 53, 12–24 (2016)
Bagirov, A.M., Ugon, J.: Nonsmooth DC programming approach to clusterwise linear regression: optimality conditions and algorithms. Optim. Methods Softw. 33(1), 194–219 (2018)
Bagirov, A.M., Ugon, J.: Codifferential method for minimizing nonsmooth DC functions. J. Glob. Optim. 50, 3–22 (2011)
Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2005)
Demyanov, V.F., Rubinov, A.M.: Constructive Nonsmooth Analysis. Peter Lang, Frankfurt a. M. (1995)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
de Oliveira, W.: Proximal bundle methods for nonsmooth DC programming. J. Glob. Optim. 75, 523–563 (2019)
de Oliveira, W.: Sequential difference-of-convex programming. J. Optim. Theory Appl. 186(3), 936–959 (2020)
de Oliveira, W.: The ABC of DC programming. Set-Valued Var. Anal. 28, 679–706 (2020)
de Oliveira, W., Tcheou, M.P.: An inertial algorithm for DC programming. Set-Valued Var. Anal. 27, 895–919 (2019)
Frangioni, A.: Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Comput. Oper. Res. 21, 1099–1118 (1996)
Gaudioso, M., Giallombardo, G., Miglionico, G., Bagirov, A.M.: Minimizing nonsmooth DC functions via successive DC piecewise-affine approximations. J. Glob. Optim. 71(1), 37–55 (2018)
Horst, R., Thoai, N.V.: DC programming: overview. J. Optim. Theory Appl. 103, 1–43 (1999)
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: A proximal bundle method for nonsmooth DC optimization utilizing nonconvex cutting planes. J. Glob. Optim. 68, 501–535 (2017)
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M., Taheri, S.: Double bundle method for finding Clarke stationary points in nonsmooth DC programming. SIAM J. Optim. 28(2), 1892–1919 (2018)
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M., Taheri, S.: Double bundle method for finding Clarke stationary points in nonsmooth DC optimization. Technical reports 1173, Turku Center for Computer Science (TUCS), Turku (2017)
Khalaf, W., Astorino, A., D’Alessandro, P., Gaudioso, M.: A DC optimization-based clustering technique for edge detection. Optim. Lett. 11(3), 627–640 (2017)
Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-Newton methods. Math. Program. 141(1–2), 135–163 (2013)
Luksan, L.: Dual method for solving a special problem of quadratic programming as a subproblem at linearly constrained nonlinear minimax approximation. Kybernetika 20(6), 445–457 (1984)
Mäkelä, M.M.: Multiobjective proximal bundle method for nonconvex nonsmooth optimization: Fortran subroutine MPBNGC 2.0. Reports of the Department of Mathematical Information Technology, Series B. Scientific Computing B 13/2003, University of Jyväskylä, Jyväskylä (2003)
Mäkelä, M.M., Neittaanmäki, P.: Nonsmooth Optimization: Analysis and Algorithms with Applications to Optimal Control. World Scientific Publishing Co., Singapore (1992)
Ordin, B., Bagirov, A.M.: A heuristic algorithm for solving the minimum sum-of-squares clustering problems. J. Glob. Optim. 61, 341–361 (2015)
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017)
Sun, W.Y., Sampaio, R.J.B., Candido, M.A.B.: Proximal point algorithm for minimization of DC functions. J. Comput. Math. 21(4), 451–462 (2003)
Tao, P.D., Souad, E.B.: Algorithms for solving a class of nonconvex optimization problems: methods of subgradient. North-Holland Mathematics Studies. Fermat Days 85: Mathematics for Optimization. 129, 249–271 (1986)
Toland, J.F.: On subdifferential calculus and duality in nonconvex optimization. Bull. Soc. Math. France Mémoire. 60, 177–183 (1979)
Tuy, H.: Convex Analysis and Global Optimization. Kluwer, Dordrecht (1998)
van Ackooij, W., de Oliveira, W.: Nonsmooth and nonconvex optimization via approximate difference-of-convex decompositions. J. Optim. Theory Appl. 182, 49–80 (2019)
Wolfe, P.: Finding the nearest point in a polytope. Math. Program. 11, 128–149 (1976)
Zaffaroni, A.: Continuous approximations, codifferentiable functions and minimization methods. In: Demyanov, V.F., Rubinov, A.M. (eds.) Nonconvex Optimization and Its Applications, pp. 361–391. Kluwer Academic Publishers, Dordrecht, Quasidifferentiability and Related Topics (2000)
Acknowledgements
The research by Dr. A.M. Bagirov is supported by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (Project No. DP190100580) and the research by Dr. N. Hoseini Monjezi is supported by the National Elite Foundation of Iran. The authors would like to thank the handling editor and three anonymous referees for their comments that helped to improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bagirov, A.M., Hoseini Monjezi, N. & Taheri, S. An augmented subgradient method for minimizing nonsmooth DC functions. Comput Optim Appl 80, 411–438 (2021). https://doi.org/10.1007/s10589-021-00304-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-021-00304-4