1 Introduction

Consider an unconstrained optimization problem

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {minimize} \quad &{} f(x) \\ \text {subject to} \quad &{} x \in \mathbb {R}^n, \end{array}\right. } \end{aligned}$$
(1)

where \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) is, in general, nonsmooth and is expressed as a difference of two convex functions \(f_1,f_2:\mathbb {R}^n\rightarrow \mathbb {R}\):

$$\begin{aligned} f (x)=f_1(x)-f_2(x). \end{aligned}$$

Some important practical problems can be formulated as a nonsmooth DC program of the form (1). They include supervised data classification [6], cluster analysis [11, 32], clusterwise linear regression analysis [10, 12] and edge detection [27] problems. Moreover, the DC optimization methods can be applied to solve a broad class of nonsmooth nonconvex optimization problems [38].

Different stationarity conditions, such as inf-stationarity (also called d-stationarity), Clarke stationarity and criticality conditions, have been studied for unconstrained DC minimization problems [25]. In the paper [33], the notion of B-stationary points of nonsmooth DC problems was defined and its relation with the criticality condition was investigated.

The special structure of DC problems stimulates design of efficient methods by utilizing this structure. To date, DC optimization problems have been considered in the context of both local and global optimization. Several methods have been developed to solve these problems globally [23, 37]. To the best of our knowledge the difference of convex algorithm (DCA) is the first local search algorithm for solving DC optimization problems. This algorithm was introduced in [35] and further studied in [4, 5].

Over the last decade the development of local search methods for unconstrained nonsmooth DC optimization has attracted a noticeable attention. A short survey of such methods can be found in [19]. Without loss of generality, methods for local DC optimization can be divided into three groups. The first group consists of methods which are extensions of the bundle methods for convex problems. They include the codifferential method [13], the proximal bundle method with the concave affine model [22], the proximal bundle method for DC optimization [17, 24] and the double bundle method [25]. In these methods piecewise linear underestimates of both DC components or subgradients of these components are used to compute search directions.

The second group consists of the DCA, its modifications and extensions. Although the DCA performs well in practice, its convergence can be fairly slow for some particular problems. Various modifications of the DCA have been developed to improve its convergence. The inertial DCA was developed in [20]. In papers [2, 3], the boosted DC algorithm (BDCA) was proposed to minimize smooth DC functions. The BDCA accelerates the convergence of the DCA using an additional line search step. More precisely, it performs a line search at the point generated by the DCA which leads to a larger decrease in the objective value at each iteration. In [1], the BDCA is combined with a simple derivative free optimization method which allows one to force the d-stationarity (lack of descent direction) at the obtained point. To avoid the difficulty of solving the DCA’s subproblem, in [18] the first DC component is replaced with a convex model and the second DC component is used without any approximation.

The third group consists of algorithms that at each iteration apply the convex piecewise linear model of the first DC component and one subgradient of the second DC component to compute search directions. These algorithms differ from the DCA-type algorithms as at each iteration they update the model of the first DC component and calculate the new subgradient of the second component. They are also distinctive from methods of the first group because they use one subgradient of the second DC component whereas the latter methods may use more than one subgradient of this component to build its piecewise linear model. A representative of this group is the aggregate subgradient method for DC optimization developed in [9]. In this algorithm the aggregate subgradient of the first DC component and one subgradient of the second component are used to compute search directions.

In this paper, we introduce a new method which belongs to the third group. First, we define augmented subgradients of the convex functions. Using them we construct the model of the first DC component. This model and one subgradient of the second DC component are utilized to compute search directions. Then we design a new method which uses these search directions and applies the Armijo-type line search. We prove that the sequence of points generated by this method converges to critical points of the DC function. We utilize nonsmooth DC optimization test problems, including those with large number of variables, to demonstrate the performance of the developed method and to compare it with two general methods of nonsmooth optimization and five methods of nonsmooth DC optimization. Our results show that, in general, the developed method is more robust than other methods used in the numerical experiments.

The rest of the paper is organized as follows. In Sect. 2, we provide necessary notations and some preliminaries on the nonsmooth analysis and DC optimization. In Sect. 3, the new method is introduced and its convergence is studied. Results of numerical experiments are reported in Sect. 4, and concluding remarks are given in Sect. 5.

2 Preliminaries

In this section we provide the notations that will be used throughout the paper and recall some concepts of nonsmooth analysis and DC optimization. For more details on nonsmooth analysis, we refer to [7, 8].

In what follows, \(\mathbb {R}^n\) is the n-dimensional Euclidean space, \(u^Tv =\sum _{i=1}^n u_i v_i\) is the inner product of vectors \(u,v\in \mathbb {R}^n\), and \(\Vert \cdot \Vert \) is the associated Euclidean norm. The origin of \(\mathbb {R}^n\) is denoted by \(0_n\). The unit sphere is denoted by \(S_1=\{d \in \mathbb {R}^n:\Vert d\Vert = 1\}\). For \(x \in \mathbb {R}^n\) and \(\varepsilon >0\), \(B_\varepsilon (x)\) is an open ball of the radius \(\varepsilon >0\) centred at x. The convex hull of a set is denoted by “\(\mathop {\mathrm {conv}}\)” and “cl” stands for the closure of a set.

The function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) is locally Lipschitz continuous on \(\mathbb {R}^n\) if for every \(x \in \mathbb {R}^n\) there exist a Lipschitz constant \(L>0\) and \(\varepsilon >0\) such that \(|f(y)-f(z)| \le L\Vert y-z\Vert \) for all \(y, z \in B_\varepsilon (x).\)

The subdifferential of a convex function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) at a point \(x \in \mathbb {R}^n\) is the set

$$\begin{aligned} \partial f(x) = \Big \{\xi \in \mathbb {R}^n:~f(y) \ge f(x) + \xi ^T (y-x),~\forall y \in \mathbb {R}^n \Big \}. \end{aligned}$$

Each vector \(\xi \in \partial f(x)\) is called a subgradient of f at x. The subdifferential \(\partial f(x)\) is a nonempty, convex and compact set such that \(\partial f(x) \subseteq B_L(0)\), where \(L>0\) is the Lipschitz constant of f. In addition, the subdifferential mapping \(\partial f(x)\) is upper semicontinuous.

For \(\varepsilon \ge 0\), the \(\varepsilon \)-subdifferential of a convex function f at a point \(x\in \mathbb {R}^n\) is defined by

$$\begin{aligned} \partial _\varepsilon f(x) = \Big \{\xi _\varepsilon \in \mathbb {R}^n:~f(y)\ge f(x)+\xi _\varepsilon ^T (y-x)-\varepsilon ,~\forall y\in \mathbb {R}^n \Big \}. \end{aligned}$$

Each vector \(\xi _\varepsilon \in \partial _\varepsilon f(x)\) is called an \(\varepsilon \)-subgradient of f at x. The set \(\partial _\varepsilon f(x)\) contains the subgradient information from some neighborhood of x as the following theorem shows.

Theorem 1

[8] Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex function with the Lipschitz constant \(L>0\) at a point \(x\in \mathbb {R}^n\). Then for \(\varepsilon \ge 0\)

$$\begin{aligned} \partial f(y)\subseteq \partial _\varepsilon f(x),\quad \forall y \in B_{\frac{\varepsilon }{2L}}(x). \end{aligned}$$

The following theorem presents a necessary condition for Problem (1).

Theorem 2

[5, 36] If \(x^*\in \mathbb {R}^n\) is a local minimizer of Problem (1), then

$$\begin{aligned} \partial f_2(x^*)\subseteq \partial f_1(x^*). \end{aligned}$$
(2)

A point \(x^*\), satisfying the condition (2), is called inf-stationary for Problem (1). The optimality condition (2) is, in general, hard to verify and is usually relaxed to the following condition [5, 36]:

$$\begin{aligned} \partial f_1(x^*)\cap \partial f_2(x^*)\ne \emptyset . \end{aligned}$$
(3)

A point \(x^*\in \mathbb {R}^n\), satisfying (3), is called a critical point of the function f. In a similar way, we can define an \(\varepsilon \)-critical point. Let \(\varepsilon \ge 0\), a point \(x^*\in \mathbb {R}^n\) is said to be an \(\varepsilon \)-critical point of f if [34]

$$\begin{aligned} \partial _\varepsilon f_1(x^*) \cap \partial _\varepsilon f_2(x^*) \ne \emptyset . \end{aligned}$$
(4)

3 Augmented subgradient method

In this section we describe the new method and study its convergence. We start with the definition of augmented subgradients of convex functions.

3.1 Augmented subgradients

Let \(\varphi : \mathbb {R}^n \rightarrow \mathbb {R}\) be a convex function and \(U \subset \mathbb {R}^n\) be a compact set. Take any \(x, y \in U\). In particular, \(y = x\). Let \(\xi \in \partial \varphi (y)\) be any subgradient computed at the point y. The linearization error of this subgradient at the point x is defined as:

$$\begin{aligned} \alpha (x,y,\xi ) = \varphi (x) - \Big [\varphi (y) + \xi ^T (x - y)\Big ] \ge 0. \end{aligned}$$
(5)

It is clear that

$$\begin{aligned} \varphi (y) - \varphi (x) = -\alpha (x,y,\xi ) + \xi ^T (y-x). \end{aligned}$$
(6)

Definition 1

A vector \(u \in \mathbb {R}^{n+1}\) is called an augmented subgradient of the convex function \(\varphi \) at the point x with respect to the compact set \(U \subset \mathbb {R}^n\) if there exists \(y \in U\) such that \(u = \big (\alpha (x,y,\xi ), \xi \big ),\) where \(\xi \in \partial \varphi (y)\) and \(\alpha (x,y,\xi )\) is defined by (5).

Note that there is some similarity between augmented subgradients and elements of codifferentials considered in [15, 40].

The set \(D_U\varphi (x)\) of all augmented subgradients at the point x with respect to the set U is defined as:

$$\begin{aligned} D_U\varphi (x) = \mathop {\mathrm {cl}}\mathop {\mathrm {conv}}\Big \{u \in \mathbb {R}^{n+1}:~\exists \big (y\in U, \xi \in \partial \varphi (y)\big ),~u = \big (\alpha (x,y,\xi ), \xi \big ) \Big \}. \end{aligned}$$

For \(u \in D_U\varphi (x)\) we will use the notation \(u = (\alpha , \xi )\) where \(\alpha \ge 0\) and \(\xi \in \mathbb {R}^n\). If we take \(y=x,\) then in this case \(\alpha =0\). Therefore, the set \(D_U\varphi (x)\) contains also elements of the form \((0,\xi )\) where \(\xi \in \partial \varphi (x)\).

Proposition 1

The set \(D_U\varphi (x)\) is compact and convex for any \(x \in U\).

Proof

The proof follows from the definition of the set \(D_U\varphi (x)\) and from the fact that the subdifferential of the function \(\varphi \) is bounded on bounded sets. \(\square \)

It follows from (6) that for any \(x,y \in U\) we have

$$\begin{aligned} \varphi (y) - \varphi (x) \le \max _{u \in D_U\varphi (x)} ~\Big [-\alpha + \xi ^T (y-x) \Big ]. \end{aligned}$$
(7)

Consider the following two sets at the point \(x \in U\):

$$\begin{aligned} A(x)&= \Big \{\alpha \ge 0: ~\exists ~\big (y \in U,~\xi \in \partial \varphi (y)\big ), (\alpha ,\xi ) \in D_U\varphi (x) \Big \},\\ C(x)&= \Big \{\xi \in \mathbb {R}^n: ~\exists ~ \alpha \ge 0, (\alpha ,\xi ) \in D_U\varphi (x) \Big \}. \end{aligned}$$

Take any \(x \in \mathbb {R}^n\), \(\tau > 0\) and consider the closed ball \(\bar{B}_\tau (x).\) Let \(U=\bar{B}_\tau (x)\) and \(D_U\varphi (x)\) be the set of augmented subgradients at the point x with respect to U.

Proposition 2

Let \(x \in \mathbb {R}^n, ~\tau > 0, ~U = \bar{B}_\tau (x)\) and \(L >0\) be a Lipschitz constant of \(\varphi \) on \(\bar{B}_\tau (x)\). Then for \(\varepsilon > 2L\tau \) we have

$$\begin{aligned} 0 \le \alpha \le 2L \tau ~~\forall \alpha \in A(x)~~\text{ and }~~C(x) \subseteq \partial _\varepsilon \varphi (x). \end{aligned}$$

Proof

The result for \(\alpha \) follows from its definition. Furthermore, Theorem 1 implies that if \(\varepsilon > 2L\tau ,\) then \(\partial \varphi (y) \subseteq \partial _\varepsilon \varphi (x)\) for all \(y \in \bar{B}_\tau (x).\) Then the proof is obtained from the definition of the set C(x). \(\square \)

One interesting case is when \(U = \{x\}\). In this case \(\alpha = 0\), and therefore, we have

$$\begin{aligned} D_U \varphi (x) \equiv D_x \varphi (x) = \Big \{u \in \mathbb {R}^{n+1}:~u=(0,\xi ), ~\xi \in \partial \varphi (x) \Big \}. \end{aligned}$$

Next we consider the DC function \(f(x) = f_1(x) - f_2(x)\). Again assume that \(U \subset \mathbb {R}^n\) is a compact set and \(x \in U\). Define the set of augmented subgradients \(D_U f_1(x)\) for the function \(f_1\) and the set \(D_x f_2(x)\) for the function \(f_2\). Construct the set

$$\begin{aligned} D_U f(x) = D_U f_1(x) - D_x f_2(x). \end{aligned}$$

Proposition 3

Let \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) be a DC function and \(U \subset \mathbb {R}^n\) be a compact set. Then at a point \(x \in U\) we have

$$\begin{aligned} f(y) \le f(x) + \max _{(\alpha , \xi ) \in D_U f(x)} \Big \{-\alpha + \xi ^T (y-x) \Big \},~y \in U. \end{aligned}$$
(8)

Proof

Let \(y \in U\) be any point. Taking into account the convexity of the function \(f_2\) and applying (6) to the function \(f_1\) at y we get that for \(\xi _1 \in \partial f_1(y), ~\xi _2 \in \partial f_2(x)\) the following holds:

$$\begin{aligned} f(y)&= f_1(y) - f_2(y) \\&\le f_1(x) - \alpha (x,y,\xi _1) + \xi _1^T (y-x) - \Big (f_2(x) + \xi _2^T (y-x) \Big ) \\&= \Big (f_1(x) - f_2(x) \Big ) - \alpha (x,y,\xi _1) + (\xi _1 - \xi _2)^T (y-x) \\&= f(x) + \Big (-\alpha (x,y,\xi _1) + (\xi _1 - \xi _2)^T (y-x) \Big ). \end{aligned}$$

It is clear that \(\big (\alpha (x,y,\xi _1), \xi _1 -\xi _2\big ) \in D_U f(x)\), and thus the proof is complete. \(\square \)

The following corollary can be easily obtained from Proposition 3.

Corollary 1

Let \(x \in \mathbb {R}^n\) and \(U = \bar{B}_\tau (x), ~\tau > 0\). Then for any \(d \in S_1\) we have

$$\begin{aligned} f(x+ \lambda d) \le f(x) + \max _{(\alpha , \xi ) \in D_U f(x)} \Big \{-\alpha + \lambda \xi ^T d \Big \}~~\forall \lambda \in [0,\tau ]. \end{aligned}$$
(9)

3.2 The model

In order to find search direction in Problem (1) we use the following model of the objective function f. Take any point \(x \in \mathbb {R}^n\) and the number \(\tau \in (0,1]\). Consider the closed ball \(\bar{B}_\tau (x)\). Compute the set \(D_U f(x)\) at x for \(U=\bar{B}_\tau (x)\). Then we replace the function f on U by the following function:

$$\begin{aligned} \hat{f}(y) = f(x) + \max _{(\alpha , \xi ) \in D_U f(x)} \Big \{-\alpha + \tau \xi ^T (y-x) \Big \},~y \in U. \end{aligned}$$
(10)

Proposition 4

The function \(\hat{f}\) is convex, \(f(y) \le \hat{f}(y)\) for all \(y \in U\) and \(\hat{f}(x) = f(x)\).

Proof

Convexity of the function \(\hat{f}\) follows from its definition. For any \(y \in \bar{B}_\tau (x)\) there exist \(d \in S_1\) and \(\lambda \in [0,\tau ]\) such that \(y = x+\lambda d\). Then by applying Corollary 1 we get that \(f(y) \le \hat{f}(y)\) for all \(y \in U\). Finally, for the point x we have

$$\begin{aligned} \hat{f}(x) = f(x) + \max _{(\alpha , \xi ) \in D_U f(x)} \{-\alpha \}. \end{aligned}$$

Since \((0,\xi ) \in D_U f(x)\) and \(\alpha \ge 0\) for all \((\alpha ,\xi ) \in D_U f(x)\) we have

$$\begin{aligned} \max _{(\alpha , \xi ) \in D_U f(x)} \{-\alpha \} = 0. \end{aligned}$$

Therefore, \(\hat{f}(x) = f(x)\). \(\square \)

Proposition 5

For any \(y \in U\) we have

$$\begin{aligned} \partial \hat{f}(y) = \mathop {\mathrm {conv}}\Big \{\xi \in \mathbb {R}^n:~\exists ~ \alpha \ge 0: (\alpha , \xi ) \in \hat{R}(y) \Big \}, \end{aligned}$$

where \(\hat{R}(y) = \big \{(\alpha , \xi ) \in D_U f(x): ~\hat{f}(y) = f(x) - \alpha + \xi ^T (y-x)\big \}\). In particular,

$$\begin{aligned} \partial \hat{f}(x) = \mathop {\mathrm {conv}}\Big \{\xi \in \mathbb {R}^n:~(0, \xi ) \in D_U f(x) \Big \}. \end{aligned}$$
(11)

Proof

The function \(\hat{f}\) is convex and all functions under maximum in its expression are linear with respect to y. Then the expression for the subdifferential \(\partial \hat{f}(y)\) follows from Theorem 3.23 [8]. Since \(\hat{f}(x) = f(x)\) at the point x we get that \(\hat{R}(x) = \big \{(\alpha , \xi ) \in D_U f(x): ~\alpha =0\big \}\). Then we obtain the expression (11) for the subdifferential \(\partial \hat{f}(x)\). \(\square \)

Results from Propositions 4 and 5 show that the set \(D_U f(x)\) can be used to find search directions of the DC function f which is demonstrated in the next proposition.

Proposition 6

Assume that the set \(D_U f(x)\) is constructed using \(U=\bar{B}_\tau (x),\) \(~\tau \in (0,1]\). In addition assume \(0_{n+1} \notin D_U f(x)\) and

$$\begin{aligned} \Vert \overline{w}\Vert ^2 = \min \Big \{\Vert w\Vert ^2:~w \in D_U f(x) \Big \} > 0,\quad with\quad \overline{w} = (\overline{\alpha },\overline{\xi }). \end{aligned}$$
(12)

Then

$$\begin{aligned} f(x+\tau \overline{d})-f(x)\le -\tau \Vert \overline{w}\Vert , \end{aligned}$$

where \(\overline{d} = -\Vert \overline{w}\Vert ^{-1}\overline{\xi }\).

Proof

The set \(D_U f(x)\) is compact and convex. Then the necessary condition for a minimum implies that

$$\begin{aligned} \overline{w}^T(w - \overline{w})\ge 0,\quad \forall w \in D_U f(x). \end{aligned}$$

Taking into account that \(\Vert \overline{w}\Vert ^2 = \overline{\alpha }^2+\Vert \overline{\xi }\Vert ^2\) we get

$$\begin{aligned} \overline{\alpha }^2+\Vert \overline{\xi }\Vert ^2 \le \alpha \overline{\alpha }+\xi ^T\overline{\xi },\quad \forall w=(\alpha ,\xi ) \in D_U f(x). \end{aligned}$$
(13)

First we show that \(\overline{\xi } \ne 0_n\). Assume the contrary, that is \(\overline{\xi }=0_n\). Since \(\overline{w}\ne 0_{n+1}\) it follows that \(\overline{\alpha }\ne 0\) and therefore, \(\overline{\alpha } > 0\). It follows from (13) that \(0< \overline{\alpha } \le \alpha \) for all \((\alpha ,\xi )\in D_U f(x)\) which is a contradiction since the element \((0, \xi )\) is included in \(D_U f(x)\). Therefore \(\overline{\xi }\ne 0_n\) which implies that \(\overline{d} \ne 0_n\).

Dividing both sides of (13) by \(-\Vert \overline{w}\Vert \), we obtain

$$\begin{aligned} -\frac{\alpha \overline{\alpha }}{\Vert \overline{w}\Vert }+\xi ^T\overline{d} \le -\Vert \overline{w}\Vert ,\quad \forall w = (\alpha ,\xi ) \in D_U f(x). \end{aligned}$$
(14)

Note that \(\Vert \overline{w}\Vert ^{-1} \overline{\alpha } \in (0,1)\). Since \(\tau \in (0,1]\) it is obvious that \(\Vert \overline{w}\Vert ^{-1} \tau \overline{\alpha } \in (0,1)\). Furthermore, since \(\alpha \ge 0\) for all \((\alpha ,\xi )\in D_U f(x)\) we get

$$\begin{aligned} - \alpha + \tau \xi ^T\overline{d}&\le \frac{-\tau \alpha \overline{\alpha }}{\Vert \overline{w}\Vert } + \tau \xi ^T\overline{d}\\&=\tau \Big (\frac{-\alpha \overline{\alpha }}{\Vert \overline{w}\Vert }+\xi ^T\overline{d}\Big )\\&\le -\tau \Vert \overline{w}\Vert , \end{aligned}$$

where the last inequality comes from (14). From here and by applying (9), we obtain that \(f(x+\tau \overline{d})-f(x)\le -\tau \Vert \overline{w}\Vert \). This means that \(\overline{d}\) is a descent direction of the function f at the point \(x \in \mathbb {R}^n\) . \(\square \)

3.3 The proposed algorithm

Now we introduce the augmented subgradient method (asm-dc) for solving DC optimization problem (1). Let the search control parameter \( c_1 \in (0,1)\), the line search parameter \(c_2 \in (0, c_1]\), the decrease parameter \(c_3 \in (0,1)\) and the tolerance \(\theta > 0\) for criticality condition be given.

figure a

Some explanations to Algorithm 1 are essential. This algorithm consists of outer and inner loops. In the inner loop we compute augmented subgradients of the DC components and the subset \(\hat{D}_l^k\) of the set of augmented subgradients (Steps 1 and 4). We use this subset to formulate a quadratic programming problem to find search directions (Step 2). This problem can be rewritten as:

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {minimize} \quad &{} \Vert w\Vert ^2 \\ \text {subject to} \quad &{} w \in \hat{D}_l^k. \end{array}\right. } \end{aligned}$$

Here the set \(\hat{D}_l^k\) is a polytope. There are several algorithms for solving such problems [21, 29, 39]. The algorithm exits the inner loop when either the distance between the set of augmented subgradients and the origin is sufficiently small (that is the condition (15) is satisfied), or we find the direction of sufficient decrease (Step 4). In the first case the current value of the parameter \(\tau \) does not allow to improve the solution and should be updated. In the second case the line search procedure is applied.

In the outer loop we update parameters of the method (Step 5) and apply the line search procedure to find a new solution (Step 6). Since it is not easy to find the exact value of the step-length \(\sigma _l\) in Step 6 we propose the following scheme to estimate it in the implementation of the algorithm. It is clear that \(\sigma _l \ge \tau _l\). Consider the sequence \(\{\bar{\sigma }_i\},~i=1,2,\ldots \) where \(\bar{\sigma }_1 = \tau _l\) and \(\bar{\sigma }_i = 2^{i-1} \bar{\sigma }_1, ~i=1,2,\ldots \). We compute the largest \(i \ge 1\) satisfying the condition in Step 6. Then \(\bar{\sigma }_l = 2^{i-1} \tau _l\) is the estimate of \(\sigma _l\). The algorithm stops when values of its parameters become sufficiently small (Step 5). This means that an approximate criticality condition is achieved.

3.4 Global convergence

In this section we study the global convergence of Algorithm 1. First, we prove that for each \(l \ge 0\) the number of inner iterations of this algorithm is finite. At the point \(x_l \in \mathbb {R}^n\) for a given \(\tau _l>0,~l > 0\) define the following numbers:

$$\begin{aligned} K_l = \max \Big \{\Vert w\Vert :~w\in D_U f(x_l) \Big \},~~M_l = 1 - \Big ((1 - c_1)(2K_l)^{-1}\delta _l\Big )^2. \end{aligned}$$
(17)

It follows from Proposition 2 that \(K_l < \infty \) for all \(l>0\).

Proposition 7

Let \(\delta _l \in (0,K_l), l>0\). Then the inner loop at the l-th iteration of the asm-dc stops after \(m_l\) iterations, where

$$\begin{aligned} m_l \le \frac{2\log _2 (\delta _l/K_l)}{\log _2(M_l)} + 1. \end{aligned}$$

Proof

We proof the proposition by contradiction. This means that at the l-th iteration for any \(k \ge 1\) the condition (15) is not satisfied and the condition (16) is satisfied, that is \(\Vert \overline{w}^k_l\Vert \ge \delta _l\) and

$$\begin{aligned} f(x_l + \tau _l d^{k+1}_l) - f(x_l)> -c_1\tau _l \Vert \overline{w}^k_l\Vert ,~\forall k \ge 1. \end{aligned}$$
(18)

First, we show that the new augmented subgradient

$$\begin{aligned} w^{k+1}_l \equiv (\alpha ^{k+1}_l, \xi ^{k+1}_l) = (\alpha ^{k+1}_l, \xi _{1l}^{k+1} - \xi _{2l})=(\alpha ^{k+1}_l, \xi _{1l}^{k+1}) - (0,\xi _{2l}), \end{aligned}$$

computed at Step 4 does not belong to the set \(\hat{D}^k_l\). Assume the contrary, that is \(w^{k+1}_l \in \hat{D}^k_l.\) It follows from the definition (5) of the linearization error that

$$\begin{aligned} f_1(x_l + \tau _l d^{k+1}_l) - f_1(x_l) = -\alpha ^{k+1}_l + \tau _l \Big (\xi ^{k+1}_{1l} \Big )^T d^{k+1}_l. \end{aligned}$$
(19)

Since \(\xi _{2l} \in \partial f_2(x_l)\) the subgradient inequality implies that

$$\begin{aligned} f_2(x_l+\tau _l d^{k+1}_l) - f_2(x_l) \ge \tau _l\xi _{2l}^T d^{k+1}_l. \end{aligned}$$

Subtracting this from (19) we get

$$\begin{aligned} f(x_l + \tau _l d^{k+1}_l) - f(x_l) \le \tau _l \Big (\xi _{1l}^{k+1} - \xi _{2l} \Big )^T d^{k+1}_l - \alpha ^{k+1}_l. \end{aligned}$$

Applying (18) and taking into account that \(\xi ^{k+1}_l = \xi _{1l}^{k+1} - \xi _{2l}\), we have

$$\begin{aligned} -c_1 \tau _l \Vert \overline{w}^k_l\Vert < \tau _l \Big (\xi ^{k+1}_l \Big )^T d^{k+1}_l - \alpha ^{k+1}_l, \end{aligned}$$

or

$$\begin{aligned} -c_1\Vert \overline{w}^k_l\Vert ^2 < -\Big (\xi ^{k+1}_l \Big )^T \overline{\xi }^k_l - \frac{\alpha ^{k+1}_l}{\tau _l} \Vert \overline{w}^k_l\Vert , \end{aligned}$$

which, by taking into account that \(c_1\in (0,1)\), can be rewritten as

$$\begin{aligned} \Big (\xi ^{k+1}_l \Big )^T \overline{\xi }^k_l + \frac{\alpha ^{k+1}_l}{\tau _l} \Vert \overline{w}^k_l\Vert< c_1\Vert \overline{w}^k_l\Vert ^2 < \Vert \overline{w}^k_l\Vert ^2. \end{aligned}$$
(20)

On the other side, since \(\Vert \overline{w}_l^k\Vert ^2 = \min \big \{\Vert w\Vert ^2:~w \in \hat{D}_l^k \big \}\) it follows from the necessary condition for a minimum that \(\Vert \overline{w}^k_l\Vert ^2 \le w^T\overline{w}^k_l\) for all \(w \in \hat{D}^k_l\), where \(w=(\alpha ,\xi )\) and \(\overline{w}^k_l=(\overline{\alpha }^k_l,\overline{\xi }^k_l)\). Hence we obtain

$$\begin{aligned} \Vert \overline{w}^k_l\Vert ^2 \le \overline{\alpha }^k_l \alpha + \Big (\overline{\xi }^k_l \Big )^T \xi , \quad \forall w=(\alpha ,\xi ) \in \hat{D}^k_l. \end{aligned}$$

In particular, for \(w^{k+1}_l = (\alpha ^{k+1}_l,\xi ^{k+1}_l) \in \hat{D}^k_l\) we have \(\Vert \overline{w}^k_l\Vert ^2 \le \overline{\alpha }^k_l\alpha ^{k+1}_l + \Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l.\) Since \(\overline{\alpha }^k_l\le \Vert \overline{w}^k_l\Vert \) and \(\alpha ^{k+1}_l\ge 0\) we get \(\overline{\alpha }^k_l\alpha ^{k+1}_l\le \alpha ^{k+1}_l\Vert \overline{w}^k_l\Vert \). Next taking into account that \(\tau _l \in (0,1)\), we have

$$\begin{aligned} \Vert \overline{w}^k_l\Vert ^2 \le \alpha ^{k+1}_l\Vert \overline{w}^k_l\Vert + \Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l \le \frac{\alpha ^{k+1}_l}{\tau _l} \Vert \overline{w}^k_l\Vert + \Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l, \end{aligned}$$

which contradicts (20). Thus, \(w^{k+1}_l \notin \hat{D}^k_l\).

Note that for all \(t\in [0,1]\) we have

$$\begin{aligned} \Vert \overline{w}^{k+1}_l\Vert ^2&\le \Vert tw^{k+1}_l + (1-t)\overline{w}^k_l\Vert ^2 \\&= \Vert \overline{w}^k_l\Vert ^2 + 2t\Big (\overline{w}^k_l \Big )^T(w^{k+1}_l - \overline{w}^k_l) + t^2\Vert w^{k+1}_l-\overline{w}^k_l\Vert ^2. \end{aligned}$$

It follows from (20) that

$$\begin{aligned} \Big (\overline{w}^k_l \Big )^T w^{k+1}_l&= \overline{\alpha }^k_l\alpha ^{k+1}_l + \Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l \\&<\frac{\alpha ^{k+1}_l}{\tau _l} \Vert \overline{w}^k_l\Vert +\Big (\overline{\xi }^k_l \Big )^T \xi ^{k+1}_l \\&< c_1\Vert \overline{w}^k_l\Vert ^2. \end{aligned}$$

In addition, the definition of \(K_l\) implies that \(\Vert w^{k+1}_l-\overline{w}^k_l\Vert \le 2K_l\). Then we get

$$\begin{aligned} \Vert \overline{w}^{k+1}_l\Vert ^2&< \Vert \overline{w}^k_l\Vert ^2 + 2tc_1\Vert \overline{w}^k_l\Vert ^2 - 2t\Vert \overline{w}^k_l\Vert ^2 + 4K_l^2t^2\\&= \Vert \overline{w}^k_l\Vert ^2 - 2t(1 - c_1)\Vert \overline{w}^k_l\Vert ^2 + 4K_l^2t^2. \end{aligned}$$

Let \(t_0=(1-c_1)(2K_l)^{-2}\Vert \overline{w}^k_l\Vert ^2\). Note that \(t_0\in (0,1),\) and for \(t = t_0\) we have

$$\begin{aligned} \Vert \overline{w}^{k+1}_l\Vert ^2 < \Big (1 - \Big ((1 - c_1)(2K_l)^{-1}\Vert \overline{w}^k_l\Vert \Big )^2 \Big )\Vert \overline{w}^k_l\Vert ^2. \end{aligned}$$

Since \(\Vert \overline{w}^k_l\Vert \ge \delta _l\) we get

$$\begin{aligned} \Vert \overline{w}^{k+1}_l\Vert ^2 < \Big (1 - \left( (1-c_1)(2K_l)^{-1}\delta _l\right) ^2\Big )\Vert \overline{w}^k_l\Vert ^2. \end{aligned}$$

Furthermore, since \(\delta _l \in (0, K_l)\) it follows from (17) that \(M_l \in (0,1)\) and \(\Vert \overline{w}^{k+1}_l\Vert ^2 < M_l \Vert \overline{w}^k_l\Vert ^2\). Then we have

$$\begin{aligned} \Vert \overline{w}^{m_l}_l\Vert ^2< M_l \Vert \overline{w}^{m_l-1}_l\Vert ^2< \ldots< M_l^{m_l-1} \Vert \overline{w_l}^1\Vert ^2 < M_l^{m_l-1} K_l^2. \end{aligned}$$

Hence \(\Vert \overline{w}^{m_l}_l\Vert < \delta _l\) is satisfied if \(M_l^{m_l-1} K_l^2 < \delta _l^2\). This inequality must happen after at most \(m_l\) iterations, where

$$\begin{aligned} m_l \le \frac{2\log _2 (\delta _l/K_l)}{\log _2 (M_l)} + 1. \end{aligned}$$

This completes the proof.\(\square \)

Definition 2

A point \(x \in \mathbb {R}^n\) is called a \((\tau , \delta )\)-critical point of Problem (1) if

$$\begin{aligned} \min \Big \{\Vert w\Vert :~w\in D_U f(x) \Big \} \le \delta . \end{aligned}$$

Proposition 8

Let \(\tau , \delta > 0\) be given numbers, \(x \in \mathbb {R}^n\) be a \((\tau ,\delta )\)-critical point of Problem (1) and \(L_1 > 0\) be a Lipschitz constant of the function \(f_1\) at the point x. Then there exists \(\xi _2 \in \partial f_2(x)\) such that for any \(\varepsilon > 2L_1\tau \)

$$\begin{aligned} \xi _2 \in \partial _\varepsilon f_1(x) + B_\delta (0_n). \end{aligned}$$

Proof

Let \(U = \bar{B}_\tau (x)\). If x is the \((\tau ,\delta )\)-critical point, then there exists \(\overline{w} \in D_U f(x)\) such that \( \Vert \overline{w}\Vert < \delta \). The element \(\overline{w}\) can be represented as

$$\begin{aligned}\overline{w} = (\overline{\alpha },\overline{\xi }_1) - (0,\overline{\xi }_2) = (\overline{\alpha },\overline{\xi }_1 - \overline{\xi }_2),\end{aligned}$$

for some \((\overline{\alpha }, \overline{\xi }_1) \in D_U f_1(x)\) and \((0,\overline{\xi }_2) \in D_x f_2(x)\). Hence, \(\Vert \overline{\xi }_1 - \overline{\xi }_2\Vert <\delta \). It follows from Proposition 2 that \(\overline{\xi }_1 \in \partial _\varepsilon f_1(x)\) for all \(\varepsilon > 2L_1\tau \). In addition \(\overline{\xi }_2\in \partial f_2(x)\). Then we get the proof. \(\square \)

It follows from Proposition 8 that for sufficiently small \(\tau \) and \(\delta \) the \((\tau , \delta )\)-critical point can be interpreted as an approximate critical point of Problem (1). Furthermore, note that if at the l-th iteration (\(l > 0\)) of Algorithm 1 the inner loop terminates with the condition (15), then \(x_l\) is an \((\tau _l,\delta _l)\)-critical point of Problem (1).

Next, we prove the convergence of Algorithm 1. For a starting point \(x_0 \in \mathbb {R}^n\) consider the level set \(\mathcal{L}_f(x_0)\) of the function f:

$$\begin{aligned} \mathcal{L}_f(x_0) = \left\{ x \in \mathbb {R}^n:~f(x) \le f(x_0) \right\} . \end{aligned}$$

Theorem 3

Suppose that the level set \(\mathcal{L}_f(x_0)\) is bounded and \(\theta =0\) in Algorithm 1. Then Algorithm 1 generates the sequence of approximate critical points whose accumulation points are critical points of Problem (1).

Proof

The asm-dc consists of inner and outer loops. The values of parameters \(\tau _l\) and \(\delta _l\) are constants in the inner loop. In addition, these parameters are also constants in Steps 6 and 7. It follows from Proposition 7 that for given \(\tau _l\) and \(\delta _l\), the inner loop stops after a finite number of iterations.

Since the set \(\mathcal{L}_f(x_0)\) is bounded (and compact) the function f is bounded from below. This means that \(f_* = \min \big \{f(x):~x \in \mathbb {R}^n\big \} > -\infty \). For given \(\tau _l\) and \(\delta _l\) the number of executions of Steps 6 and 7 is finite. To prove this we assume the contrary, that is the number of executions of these steps is infinite. Then for some \(k \ge 1\) the direction \(\overline{d}^k_l\), computed in Step 3, is the direction of sufficient decrease at \(x_l\) satisfying the following condition:

$$\begin{aligned} f(x_l+\tau _l \overline{d}^k_l) - f(x_l) \le -c_1 \tau _l \Vert \overline{w}^k_l\Vert . \end{aligned}$$

Since \(\Vert \overline{w}^k_l\Vert \ge \delta _l\), \(\sigma _l \ge \tau _l\) and \(c_2 \in (0, c_1]\) we have

$$\begin{aligned} f(x_{l+1})-f(x_l) = f(x_l+\sigma _l \overline{d}^k_l) - f(x_l) \le -c_2\sigma _l\Vert \overline{w}^{k}_l\Vert \le -c_2\tau _l\delta _l. \end{aligned}$$

From here we get

$$\begin{aligned} f(x_{l+1}) \le f(x_0) - (l+1)c_2 \tau _l \delta _l. \end{aligned}$$

Then we have \(f(x_l) \rightarrow -\infty \) as \(l \rightarrow \infty \) which contradicts to the condition \(f_* > -\infty \). This means that for fixed values of \(\tau _l\) and \(\delta _l\), after finite number of iterations the inner loop will terminate with the condition (15) and \(x_l\) is the \((\tau _l,\delta _l)\)-critical point.

Thus parameters \(\tau _l\) and \(\delta _l\) are updated in Step 5 after a finite number of outer iterations. Denote these iterations by \(l_j, j=1,2,\ldots \) and \(x_{l_j}\) are \((\tau _{l_j},\delta _{l_j})\)-critical point. It follows from Proposition 8 that for some \(\xi _{2l_j} \in \partial f_2(x_{l_j})\)

$$\begin{aligned} \xi _{2l_j} \in \partial _\varepsilon f_1(x_{l_j}) + B_{\delta _{l_j}}(0_n), ~\varepsilon > 2L_1 \tau _{l_j}. \end{aligned}$$
(21)

It is obvious that \(x_{l_j} \in \mathcal{L}_f(x_0)\) for all \( j=1,2,\ldots \). The compactness of the set \(\mathcal{L}_f(x_0)\) implies that the sequence \(\{x_{l_j}\}\) has at least one accumulation point. Let \(x^*\) be an accumulation point and for the sake of simplicity assume that \(x_{l_j} \rightarrow x^*\) as \(j \rightarrow \infty \).

Upper semicontinuity of the subdifferential and \(\varepsilon \)-subdifferential mappings implies that for any \(\eta > 0\) there exist \(\gamma _1 > 0\) and \(\gamma _2 > 0\) such that

$$\begin{aligned} \partial _\varepsilon f_1(y) \subset \partial f_1(x^*) + B_\eta (0_n)~~\text{ and }~~\partial f_2(y) \subset \partial f_2(x^*) + B_\eta (0_n), \end{aligned}$$
(22)

for all \(y \in B_{\gamma _1}(x^*)\) and \(\varepsilon \in [0, \gamma _2)\). Since \(x_{l_j} \rightarrow x^*\) and \(\tau _{l_j} \rightarrow 0\) as \(j \rightarrow \infty \) there exists \(j_\eta > 0\) such that \(x_{l_j} \in B_{\gamma _1}(x^*)\) and \(\tau _{l_j} \in [0, \gamma _2)\) for all \(j > j_\eta \). Moreover, we can choose the sequence \(\{\varepsilon _j\}\) so that \(\varepsilon _j \in [0, \gamma _2)\) for all \(j > j_\eta \) and \(\varepsilon _j \rightarrow 0\) as \(j \rightarrow \infty \). Then, it follows from (21) and the first part of (22) that for any \(j > j_\eta \) there exists \(\xi _{2l_j} \in \partial _\varepsilon f_1(x_{l_j})\) such that

$$\begin{aligned} \xi _{2l_j} \in \partial f_1(x^*) + B_{\eta + \delta _{l_j}}(0_n). \end{aligned}$$

On the other hand, from the second part of (22) we get that for any \(\xi _{2l_j} \in \partial f_2(x^*)\) there exists \(\bar{\xi }_2 \in \partial f_2(x^*)\) such that \(\Vert \xi _{2l_j} - \bar{\xi }_2\Vert < \eta \). Without loss of generality assume that \(\xi _{2l_j} \rightarrow \bar{\xi }_2 \in \partial f_2(x^*)\) as \(j \rightarrow \infty \). Since \(\delta _{l_j} \rightarrow 0\) as \(j \rightarrow \infty \) we get that

$$\begin{aligned} \bar{\xi }_2 \in \partial f_1(x^*) + B_{2\eta }(0_n). \end{aligned}$$

Taking into account that \(\eta > 0\) is arbitrary we have \(\bar{\xi _2} \in \partial f_1(x^*)\), that is \(\partial f_1(x^*) \cap \partial f_2(x^*) \ne \emptyset \). This completes the proof. \(\square \)

Remark 1

The “escape procedure”, introduced in [25], is able to escape from critical points which are not Clarke stationary. The developed algorithm can be combined with this procedure to design an algorithm for finding Clarke stationary points of Problem (1).

4 Numerical results

Using some academic test problems we evaluate the performance of the developed method and compare it with other similar nonsmooth optimization methods. We utilize 30 academic test problems which belong to 14 instance problems with various dimensions (ranging from 2 to 100): Problems 1-10 are from [24] and Problems 11–14 are described in [26]. In addition, we use three large-scale nonsmooth DC optimization problems in our numerical experiments. We design these problems such that different types of DC components are considered:

  • \(P_{L1}\): the function \(f_1\) is smooth and the function \(f_2\) is nonsmooth;

  • \(P_{L2}\): the function \(f_1\) is nonsmooth and the function \(f_2\) is smooth;

  • \(P_{L3}\): both functions \(f_1\) and \(f_2\) are nonsmooth.

The DC components of these test problems, \(f(x)= f_1(x) - f_2(x)\), are given below:

  • \(P_{L1}.\)

    $$\begin{aligned} f_1(x)&= \sum _{i=1}^{n-1} \Big ( (x_i - 2)^2 + (x_{i+1} - 1)^2 + \exp (2x_i - x_{i+1}) \Big ),\\ f_2(x)&= \sum _{i=1}^{n-1} \max \Big ( (x_i - 2)^2 + (x_{i+1} - 1)^2, \exp (2x_i - x_{i+1}) \Big ). \end{aligned}$$

    Starting point: \(x = (x_1,\ldots ,x_n) = (3,\ldots ,3)\).

  • \(P_{L2}.\)

    $$\begin{aligned} f_1(x)&= n \max _{i=1,\ldots ,n} \Big (\sum _{j=1}^n \frac{x_j^2}{i+j-1} \Big ),\\ f_2(x)&= \sum _{i=1}^n \sum _{j=1}^n \frac{x_j^2}{i+j-1}. \end{aligned}$$

    Starting point: \(x = (x_1,\ldots ,x_n) = (2,\ldots ,2)\).

  • \(P_{L3}.\)

    $$\begin{aligned} f_1(x)&= n \max _{i=1,\ldots ,n} \Big |\sum _{j=1}^n \frac{x_j}{i+j-1} \Big |,\\ f_2(x)&= \sum _{i=1}^n \Big |\sum _{j=1}^n \frac{x_j}{i+j-1} \Big |. \end{aligned}$$

    Starting point: \(x = (x_1,\ldots ,x_n) = (2,\ldots ,2)\).

We set the number of variables to \(n=200,500,1000,1500,2000,3000,5000\) for these three test problems.

4.1 Solvers and parameters

We apply seven nonsmooth optimization solvers for comparison. Among these solvers five are designed for solving DC optimization problems and two of them are general solvers for nonsmooth nonconvex optimization problems. These solvers are:

  • Aggregate subgradient method for DC optimization (AggSub) [9];

  • Proximal bundle method for DC optimization (PBDC) [24];

  • DC Algorithm (DCA) [4, 5];

  • Proximal bundle method for nonsmooth DC programming (PBMDC) [17];

  • Sequential DC programming with cutting-plane models (SDCA-LP) [18];

  • Proximal bundle method (PBM) [30, 31];

  • Hybrid Algorithm for Nonsmooth Optimization (HANSO 2.2) [14, 28].

The quadratic programming problem in Step 2 of the asm-dc is solved using the algorithm from [39]. Parameters in the asm-dc are chosen as follows: \(c_1=0.2\), \(c_2=0.05\), \(c_3=0.1\), \(\theta=5 \cdot 10^{-7}\), \(\tau _{l+1}=0.1\tau _l\) with \(\tau _1=1,\) \(\delta _l=10^{-7}\) for all l. The algorithms asm-dc, AggSub, PBDC, DCA, PBM are implemented in Fortran 95 and compiled using the gfortran compiler. For HANSO, we use the MATLAB implementation of HANSO version 2.2 which is available in

https://cs.nyu.edu/overton/software/hanso/”. For PBMDC and SDCA-LP, we utilize the MATLAB implementations which are available in

http://www.oliveira.mat.br/solvers”. Note that the SDCA-LP requires the feasible set to be compact. However, all test problems considered in this paper are unconstrained. In order to apply the SDCA-LP to these problems we define a large n-dimensional box around the starting point and consider only those points, generated by the SDCA-LP, that belong to this box. The parameters of all these algorithms are chosen according to the values recommended in their references. In addition, we consider a time limit for all algorithms, that is three hours for solvers in MATLAB and half an hour for implemented in Fortran.

4.2 Evaluation measures and notations

We use the number of function evaluations, the number of subgradient evaluations and the final value of the objective function obtained by algorithms to report the results of our experiments. The following notations are used:

  • P is the label of a test problem;

  • n is the number of variables in the problem;

  • \(f_{best}^s\) is the best value of the objective function obtained by a solver “s".

We say that a solver “s” finds a solution with respect to a tolerance \(\beta _1\) if

$$\begin{aligned} 0 \le \frac{f_{best}^s - f_{opt}}{|f_{opt}|+1} \le \beta _1, \end{aligned}$$
(23)

where \(f_{opt}\) is the value of the objective function at one of the known local minimizers. Otherwise, we say that the solver fails. We set \(\beta _1 = 10^{-4}.\)

Performance profiles, introduced in [16], are also applied to analyze the results. Recall that in the performance profiles, the value of \(\rho _s(\tau )\) at \(\tau =0\) shows the ratio of the number of test problems, for which the solver s uses the least computational time, function or subgradient evaluations, to the total number of test problems. The value of \(\rho _s(\tau )\) at the rightmost abscissa gives the ratio of the number of test problems, that the solver s can solve, to the total number of test problems and this value indicates the robustness of the solver s. Furthermore, the higher is a particular curve, the better is the corresponding solver. Note that the values of \(\tau \) are scaled using the natural logarithm.

In addition, we report the CPU time — the computational time (in seconds) — required by the developed algorithm to find a solution, and also results on the quality of solutions obtained by different algorithms. The quality of solutions is determined by the values of the objective function at these solutions. More precisely, consider two different local minimizers \(x_1\) and \(x_2\) of Problem (1). The local minimizer \(x_1\) is a better quality solution than the local minimizer \(x_2\) if \(f(x_1) < f(x_2)\). The notation (SBW) is used to report the total number of test problems that the asm-dc finds the similar (S), the better (B) and the worse (W) quality solutions than other algorithms. The accuracy of solutions are defined using (23), where \(\beta _1 = 10^{-4}\).

4.3 Results

We present the results of our numerical experiments as follows. First, we give the results obtained by algorithms for the test problems 1–14 with one starting point that are given in [24, 26]. We report the values of objective functions, the number of function evaluations, and the number of subgradient evaluations. Second, we present the results for the values of objective functions, the number of function evaluations, and the number of subgradient evaluations obtained for these test problems with 20 random starting points. Next, we give the average CPU time required by the asm-dc on the test problems 1–14 with both one and 20 starting points, and also discuss the quality of solutions obtained by the algorithms. Finally, we present the results for three large-scale test problems introduced above and report the values of objective functions, the number of function evaluations, and the number of subgradient evaluations required by the algorithms.

Numerical results for Problems 1–14 with one starting point. We summarize these results in Tables 13 and Fig. 1. Table 1 presents the best values of the objective function obtained by solvers. We use “\(^*\)” for those values that a solver fails to obtain the local or global minimizer. Results from this table show that the asm-dc is more reliable to find the global minimizer than all other solvers. It is able to find the optimal solutions for all test problems considered in our experiment, whereas other solvers fail for some test problems with different number of variables. More specifically, the AggSub fails to find the optimal solution in Problem 5, and also Problem 14 with \(n=5,10,50,100.\) The PBDC fails in Problems 5 and 13, and also Problem 12 with \(n=50,100\). The DCA fails in Problem 10 with \(n=5,10,50,100.\) The PBMDC fails in Problems 7 and 13, Problem 12 with \(n=5,10,50,100\), and Problem 14 with \(n=100.\) The SDCA-LP fails in Problems 1, 8, 9, 13, Problem 10 with \(n=50,100\), and Problem 12 with \(n=10,50,100.\) The PBM fails in Problems 10 and 12 for \(n=100\), and also Problem 14 with \(n=10,50.\) The HANSO fails in Problems 2 and 3, and also Problem 14 with \(n=2,10,50,100.\)

Table 1 Best values of objective functions for Problems 1–14 obtained by solvers

Tables 2 and 3 present the number of function and subgradient evaluations for Problems 1–14, respectively. We denote by “−” the cases that a solver fails to find a solution. It can be observed from these tables that, these numbers required by the asm-dc are reasonable and they are comparable with those required by the AggSub and HANSO algorithms. In most cases the asm-dc requires more number of function and subgradient evaluations than PBDC, DCA, PBMDC, SDCA-LP and PBM algorithms. This is due to the fact that the ASM-DC requires the large number of inner iterations since it does not use augmented subgradients from previous iterations.

Table 2 Number of function evaluations for Problems 1–14
Table 3 Number of subgradient evaluations for Problems 1–14

Next, we demonstrate efficiency and robustness of the algorithms using the performance profiles. The results with one starting point are depicted in Fig. 1. We utilize the number of function evaluations required by algorithms to draw the performance profiles. The number of subgradient evaluations follow the similar trends with that of the number of function evaluations, and thus, we omit these results. For the algorithms based on the DC structure of the problem, we use the average value of the number of the function evaluations of the DC components. For the other algorithms, we use the number of function evaluations of the objective function. From Fig. 1, it can be observed that the asm-dc is the most robust for the collection of test problems used in the numerical experiments. However, it requires more function evaluations than others with the exception of the AggSub algorithm. This is due to the fact that the asm-dc requires the large number of inner iterations because it does not use augmented subgradients from previous iterations.

Fig. 1
figure 1

Number of function evaluations for Problems 1–14 with one starting point

Numerical results for Problems 1–14 with 20 random starting points. Performance profiles using Problems 1–14 with 20 random starting points and the number of function evaluations are presented in Fig. 2. The results from this figure indicate a similar trend to those of Fig. 1 which confirm the results obtained by algorithms with only one starting point. Performance profiles using the number of subgradient evaluations follow the similar trends with the number of function evaluations for all solvers, and therefore, we omit these results.

Fig. 2
figure 2

Number of function evaluations for Problems 1–14 with 20 random starting points

Computational time and results on the quality of solutions. In Table 4 for a given number n of variables we report the average CPU time in seconds required by the asm-dc for Problems 1–14 (30 problems considering different variables) with one starting point and also 20 random starting points. We can see that the CPU time required by the asm-dc for test problems with the number of variables \(n=2,3,4\) is zero. We do not include the CPU time required by other algorithms since they are implemented on different platforms.

Table 4 CPU time (in seconds) required by asm-dc

Table 5 presents the comparison of the quality of solutions obtained by the asm-dc with those obtained by other algorithms. Problems 1–14 (30 problems considering different number of variables) with 20 random starting points are used (note that the total number of cases is 600). We can see that the asm-dc outperforms all other solvers, except the PBDC, in terms of its ability to find high quality solutions. For instance, the asm-dc obtains the better quality solution than the AggSub in 131 cases whereas the latter one finds the better quality solution than the former algorithm in 23 cases. They find the similar solutions in 446 cases with respect to the tolerance given above. These results demonstrate that the asm-dc has better global search properties than the other solvers, except the PBDC.

Table 5 Comparison of the quality of solutions obtained by asm-dc and other algorithms

Numerical results for large-scale problems. We report the results obtained by different algorithms for three large-scale test problems described above. Note that since the conventional code of HANSO is out of memory for problems \(P_{L2}\) and \(P_{L3}\) with \(n \ge 2000\) and \(n \ge 1500\), respectively, we use its limited memory version. The results are given in Tables 68, where the objective function values, the number of function evaluations and the number of subgradient evaluations are presented. For the test problem \(P_{L1}\), considering different number of variables, the best values of the objective function obtained by the asm-dc and the AggSub are much smaller than those obtained by other solvers. The solvers PBMDC and SDCA-LP fail to find any solution in a given time frame. For the test problem \(P_{L2}\), all solvers, except SDCA-LP, are able to find the optimal solution with a given tolerance. The SDCA-LP fails to produce accurate solutions within the given time limit for all number of variables. In the test problem \(P_{L3}\) with \(n=200,~500\), the solvers asm-dc, PBDC, PBMDC, SDCA-LP and PBM are able to find accurate solutions. For this test problem with \(n \ge 1000\), the PBDC finds the most accurate solutions and the asm-dc is less accurate than the PBDC, PBMDC, PBM and SDCA-LP. Other algorithms fail to find accurate solutions. Overall, the asm-dc is the most robust method for solving large-scale problems used in the numerical experiments.

Table 6 Best values of objective functions obtained by solvers for large-scale problems

From Tables 7 and 8 we can see that the PBMDC requires the least number of both function and subgradient evaluations than other solvers when it is able to find the solution. In average, the asm-dc requires less function and subgradient evaluations than the AggSub and slightly less than the DCA. The asm-dc requires more function and subgradient evaluations than PBDC, PBMDC, SDCA-LP, PBM and HANSO. However, in some cases the difference between the asm-dc and these five methods is not significant. Summarizing results from Tables 7, 8 and taking into account the number of variables we can say that the asm-dc requires reasonable number of function and subgradient evaluations in large-scale problems.

Table 7 Number of function evaluations for large-scale problems
Table 8 Number of subgradient evaluations for large-scale test problems

5 Conclusions

A new method, named the augmented subgradient method (asm-dc), is introduced for solving nonsmooth unconstrained difference of convex (DC) optimization problems. First, we define the set of augmented subgradients using subgradients of DC components and their linearization errors. Then a finite number of elements from this set are used to compute search directions, and the new method is designed using these search directions. It is proved that the sequence of points generated by the asm-dc converges to critical points of the DC problem.

We evaluate the performance of the asm-dc using 14 small and medium-sized and three large-scale nonsmooth DC optimization test problems. These test problems are also used to compare the performance of the asm-dc with that of five nonsmooth DC optimization and two nonsmooth optimization methods. We used different evaluation measures including performance profiles to compare solvers. Computational results clearly demonstrate that the asm-dc is the most robust method for the test problems used in the numerical experiments. In comparison with most other methods the asm-dc requires more function and subgradients evaluations. The developed method uses modest CPU time even for large-scale problems. Results demonstrate that the asm-dc in comparison with other methods is able, as a rule, to find solutions of either similar or better quality.