Abstract
In this paper, we propose an alternating linearization bundle method for minimizing the sum of a nonconvex function and a convex function, both of which are not necessarily differentiable. The nonconvex function is first locally “convexified” by imposing a quadratic term, and then a cutting-planes model of the local convexification function is generated. The convex function is assumed to be “simple” in the sense that finding its proximal-like point is relatively easy. At each iteration, the method solves two subproblems in which the functions are alternately represented by the linearizations of the cutting-planes model and the convex objective function. It is proved that the sequence of iteration points converges to a stationary point. Numerical results show the good performance of the method.
Similar content being viewed by others
1 Introduction
In this paper, we consider the structured nonconvex minimization problem
where \(f: \mathbb{R}^{n}\rightarrow\mathbb{R}\) is possibly a nonconvex nonsmooth function and \(h: \mathbb{R}^{n}\rightarrow(-\infty,\infty]\) is a closed proper convex function.
Problems of the form (1) often appear in practice, such as signal processing, image reconstruction, engineering, optimal control, and so on. Three typical examples are given below.
Example 1
(Unconstrained transformation of a constrained problem)
Consider the constrained problem
where f is possibly a nonsmooth nonconvex function and C is a convex subset of \(\mathbb{R}^{n}\). Problem (2) can be written equivalently as
where \(\imath_{C}\) is the indicator function of C, i.e., \(\imath _{C}(x)\) equals 0 on C and infinity elsewhere. Clearly, problem (3) is a special case of problem (1) with \(h(x)=\imath_{C}(x)\). We note that the proximal point of \(\imath_{C}\) can easily be calculated or even has a closed-form solution if C has some special structure.
Example 2
(Nonconvex regularization of a convex function)
Consider the \(l_{q}\ (0< q<1)\) regularization problem
which has many practical applications in compressed sensing and imaging science (see e.g., [1]), where \(\Vert x \Vert _{q}=(\sum_{i=1}^{n} \vert x_{i} \vert ^{q})^{1/q}\). The objective function of problem (4) is also the sum of a convex function and a nonconvex function.
Example 3
(Convex regularization of a nonconvex function)
Hare et al. [2] studied the function of the form
where \(f_{i}(x): \mathbb{R}^{n}\rightarrow\mathbb{R}, i=1,\ldots,n\) are Ferrier polynomials defined as
It is well known that \(f(x)=\sum^{n}_{i=1} \vert f_{i}(x) \vert \) is a nonconvex nonsmooth function, and \(h(x)=\frac{1}{2} \Vert x \Vert ^{2}\) is a simple convex function.
The methods for minimizing the sum of two functions have been well studied during the past several decades. Different methods are developed based on these two types of functions; see e.g., [3–12]. In particular, Kiwiel [9] proposed an alternating linearization bundle method for the sum of two convex functions and one of them is “simple” (i.e., minimizing this function plus a separable convex quadratic function is “easy”). Goldfarb et al. [7] proposed a fast alternating linearization methods for the sum of two convex functions both of which are “simple”. Li et al. [10] presented a proximal alternating linearization method for the sum of two nonconvex functions based on the assumption that the proximal point of the two functions at a given point can easily be calculated. Attouch et al. [3] and Bolte et al. [4] considered a broad class of nonconvex and nonsmooth minimization problems that include as a special case minimizing the sum of two nonconvex functions, in which the proximal alternating minimization technique is used and the Kurdyka–Łojasiewicz property is assumed.
In this paper, we consider to minimize the sum of a nonconvex function and a convex function with the form of (1). In particular, we assume that f is lower-\(C^{2}\) and h is “simple” in the sense that minimizing h plus a quadratic term is relatively easy. The method presented in this paper can be viewed as a generalized version of the methods given in [9] and [13]. On one hand, we generalize the method of [9] from minimizing the sum of two convex functions to the sum of a nonconvex function and a convex function. On the other hand, we generalize the method of [13] from minimizing a single nonconvex nonsmooth function to the sum of two functions.
Our method will produce three sequences of points: \(\{z^{\ell}\}\), \(\{ y^{\ell}\}\) and \(\{x^{k(\ell)}\}\), where \(\{z^{\ell}\}\) is the sequence of proximal points, \(\{y^{\ell}\}\) is the sequence of trial points, and \(\{x^{k(\ell)}\}\) is the sequence of stability centers (i.e., \(x^{k(\ell)}\in\{y^{\ell}\}\) is the “best” point obtained so far for iteration ℓ, which will be abbreviated as \(x^{k}\) if there is no confusion). More precisely, our method will alternately solve the following two subproblems:
where \(\check{\varphi}_{\ell}\) is a cutting-planes model [14, 15] of the local convexification function of f at iteration ℓ, which is based on the idea of the redistributed proximal bundle method in [13] and will be made more precise later; \(\bar{h}_{\ell-1}\) is a linearization of h at iteration \(\ell-1\); \(\bar{\varphi}_{\ell}\) is a linearization of \(\check{\varphi }_{\ell}\); \(\mu_{\ell}\) is the proximal parameter. Our convergence analysis shows that, under suitable assumptions, any accumulation point of the sequence \(\{x^{k}\}\) is a stationary point of F if there is an infinite number of serious steps; otherwise, the last stability center is a stationary point of F.
This paper is organized as follows. In Sect. 2, we review some basic definitions and results required for this work. In Sect. 3, we present the alternating linearization bundle method for problem (1). Section 4 examines the convergence properties of the algorithm. Some preliminary numerical results are given in Sect. 5. The Euclidean inner product in \(\mathbb{R}^{n}\) is denoted by \(\langle x, y\rangle=x^{T}y\), and the associated norm by \(\Vert \cdot \Vert \).
2 Preliminaries
In this section, we recall some basic definitions and results that are closely relevant to our method, which can be found in [13, 16, 17].
-
The limiting subdifferential of f at x̄ is defined by
$$\partial f(\bar{x}):=\lim_{x\rightarrow\bar{x}}\sup_{f(x)\rightarrow f(\bar{x})} \hat{\partial}f(x), $$where \(\hat{\partial}f(\bar{x})\) is the regular subdifferential defined by
$$\hat{\partial}f(\bar{x}):=\biggl\{ g\in\mathbb{R}^{n}: \lim _{x\rightarrow \bar{x}}\inf_{x\neq\bar{x}}\frac{f(x)-f(\bar{x})-\langle g,x-\bar {x}\rangle}{ \Vert x-\bar{x} \Vert }\geq0\biggr\} . $$An element \(g\in\partial f(\bar{x})\) is called a subgradient of f at x̄.
-
The function f is prox-bounded if there exists \(R\geq0\) such that the function \(f(\cdot)+\frac{1}{2}R \Vert \cdot \Vert ^{2}\) is bounded below. The corresponding threshold is the smallest \(r_{pb}\geq0\) such that \(f(\cdot)+\frac{1}{2}R \Vert \cdot \Vert ^{2}\) is bounded below for all \(R>r_{pb}\).
-
The function f is lower-\(C^{2}\) on an open set V if for each \(\bar{x}\in V\) there is a neighborhood \(V'\) of x̄ upon which a representation \(f(x)=\max_{t\in T} f_{t}(x)\) holds, where T is a compact set and the functions \(f_{t}\) are of class \(C^{2}\) on V such that \(f_{t}\), \(\bigtriangledown f_{t}\) and \(\bigtriangledown^{2}f_{t}\) depend continuously on \((t,x)\in T\times V\).
-
The proximal point mapping of the function f at the point \(x\in\mathbb{R}^{n}\) is defined by
$$p_{R}f(x):=\arg\min_{\omega\in\mathbb{R}^{n}} \biggl\{ f(\omega )+ \frac{1}{2}R \Vert \omega-x \Vert ^{2} \biggr\} . $$
Lemma 1
([16])
Suppose that the function f is lower-\(C^{2}\) on V and \(\bar{x}\in V\). Then there exist \(\varepsilon>0, K>0\), and \(\rho>0\) such that
-
(i)
for any point \(x^{0}\) and parameter \(R\geq\rho\) the function \(f+\frac{1}{2}R \Vert \cdot-x^{0} \Vert ^{2}\) is convex and finite valued on the closed ball \(\bar{B}_{\varepsilon }(\bar{x})\), and
-
(ii)
the function f is Lipschitz continuous with constant K on \(\bar{B}_{\varepsilon}(\bar{x})\).
Theorem 1
([16])
Suppose that the lower semicontinuous function f is prox-bounded with threshold \(r_{pb}\) and lower-\(C^{2}\) on V. Let \(\bar{x}\in V\) and let \(\varepsilon>0,K>0\) and \(\rho>0\) be given by Lemma 1. Then x̄ is a stationary point of f if and only if \(\bar {x}=p_{R}f(\bar{x})\) for any \(R>R_{\bar{x}}:=\max\{4K/\varepsilon, \rho, r_{pb}\}\).
Assumption 1
([13])
Given \(x^{0}\in\mathbb{R}^{n}\) and \(M_{0}\geq0\), there exist an open bounded set \(\mathcal{O}\) and a function H such that \(\mathcal{L}_{0}:=\{x\in\mathbb{R}^{n}: f(x)\leq f(x^{0})+M_{0}\}\subset\mathcal{O}\), and H is lower-\(C^{2}\) on \(\mathcal{O}\) with \(H\equiv f\) on \(\mathcal{L}_{0}\).
Theorem 2
([13])
For a function f satisfying Assumption 1, the following results hold:
-
(i)
The level set \(\mathcal{L}_{0}\) is nonempty and compact.
-
(ii)
There exists \(\rho^{\mathrm{id}}>0\) such that, for any \(\rho\geq\rho^{\mathrm{id}}\) and any given \(y\in\mathcal{L}_{0}\), the function \(f+\frac{1}{2}\rho \Vert \cdot-y \Vert ^{2}\) is convex on \(\mathcal{L}_{0}\).
-
(iii)
The function f is Lipschitz continuous on \(\mathcal{L}_{0}\).
3 The alternating linearization bundle method
3.1 Motivation and framework
The classic proximal point algorithm (see e.g. [18]) for solving problem (1) generates the new iterate by
where \(R_{\ell}>0\) is the proximal parameter.
However, since f is a nonconvex function, solving problem (8) may not be easy and is usually as difficult as the original problem (1). Therefore, we will tackle the difficulty via the following three steps.
1. Generate the local convexification function of the nonconvex function f. Following the redistribution idea of [13], we split the prox-parameter \(R_{\ell}\) into two dynamic parameters \(\eta_{\ell}\) and \(\mu_{\ell}\) which are nonnegative and satisfy \(R_{\ell}=\eta_{\ell}+\mu_{\ell}\). Then the problem (8) (after replacing \(y^{\ell}\) by \(x^{k}\)) can be written as
where
is called the local convexification function of f, since it is convex whenever \(\eta_{\ell}\) is large enough (see Theorem 2).
2. Generate the cutting-planes model of \(\varphi_{\ell}\). Let ℓ be the current iteration index, \(y^{i}\), \(i\in J_{\ell }\subseteq\{0,1,\ldots,\ell\}\) be trial points generated in the previous iterations, and \(g_{f}^{i}\in\partial f(y^{i})\). Define the cutting-planes model of \(\varphi_{\ell}\) by
where \(g_{f}^{i}=g_{f}(y^{i})\in\partial f(y^{i})\). Therefore, we obtain an approximate version of problem (9) as follows:
3. Apply the alternating linearization bundle strategy to solve problem (12). Since problem (12) may still be difficult, motivating by the idea of the alternating linearization bundle [9], we consider to alternately solve the following two subproblems:
The above two subproblems are much easier to solve, whose objective functions are alternately represented by linear models of \(h(\cdot)\) and \(\check{\varphi}_{\ell}(\cdot)\), respectively.
3.2 Further description via bundle terminologies
Bundle methods [19–21] are among the most robust and reliable methods to solve general nonsmooth optimization problems, which can be considered stabilized variants of cutting-planes method [14, 15]. In general, for a convex function h, bundle methods store the trial points \(y^{i}\), \(i\in J_{\ell}\) with their function values and subgradients in a bundle of information:
and a point \(x^{k}:=x^{k(\ell)}\) (called stability center) which is the “best” point obtained so far. A storage-saving form of (15) (refer to the current stability center \(x^{k}\)) is given by
where \(\partial_{e}h\) is the e-subdifferential of h in convex analysis, and \(e^{i,k}_{h}\) are the linearization errors of h defined by
Following the notations above, the bundle information of the function \(\varphi_{\ell}(\cdot)\) can be written as (see also [13]):
where \(e^{i,k}_{f}\) and \(d^{k}_{i}\) are the linearization errors of f and \(\frac{1}{2} \Vert \cdot-x^{k} \Vert ^{2}\), respectively, \(g^{i}_{f}\) is a subgradient of f at \(y^{i}\), and \(\Delta^{k}_{i}\) is the gradient of \(\frac{1}{2} \Vert \cdot-x^{k} \Vert ^{2}\) at \(y^{i}\). From (17), we know that \(e^{i,k}_{f}\), \(d^{k}_{i}\) and \(\Delta^{k}_{i}\) depend on the point \(x^{k}\), so they should be updated whenever a new stability center is generated (details are given below).
By the optimality conditions of subproblem (13), there exists a multiplier vector \((\alpha^{\ell}_{i}, i\in J_{\ell})\in S^{\ell}\) such that
where \(S^{\ell}\) denotes the unit simplex in \(\mathbb{R}^{ \vert J_{\ell } \vert }\) and \(g^{\ell-1}_{h}=\nabla\bar{h}_{\ell-1}(z^{\ell+1})\).
As iterations go along, the number of elements in the bundle may increase infinitely, which could lead to serious problems with storage and computation. The subgradient aggregation strategy [22] is the most popular and efficient way to overcome such a difficulty. We use the notation \(g^{-\ell}_{\eta_{\ell}}\) to denote the aggregate subgradient, i.e.,
Define the strongly active set of subgradients by
Then the corresponding aggregate bundle elements are given by
Therefore
Here, as in [13, 23], we use negative index −ℓ to express the aggregate bundle elements, hence \(J_{\ell}\subseteq\{ -\ell,-\ell+1,\dots,0,1,\dots,\ell-1,\ell\}\) in general.
By making use of the notations above, the cutting-planes model \(\check {\varphi}_{\ell}\) in (11) can be rewritten as:
Note that, for all \(j\in J^{\mathrm{act}}_{\ell}\) we have
and the aggregate model of \(\check{\varphi}_{\ell}\) in \(J_{\ell}\) is
For a new stability center \(x^{k+1}\), the bundle elements can be updated by (see [13])
On the other hand, since f is possibly nonconvex, the linearization errors \(e^{i,k}_{f}+\eta_{\ell}d^{k}_{i}, i\in J_{\ell}\) may be negative, and therefore the model \(\check{\varphi}_{\ell}\) is not necessarily a lower approximation to \(\varphi_{\ell}\). In the case of \(e^{i,k}_{f}+\eta_{\ell}d^{k}_{i}\geq0\), one has
In order to ensure that the linearization errors are all nonnegative, the convexification parameter \(\eta_{\ell}\) should be adjusted to asymptotically estimate the ideal convexity threshold \(\rho^{\mathrm{id}}\) in Theorem 2. Hare et al. [2] suggested a lower bound for \(\eta_{\ell }\) as follows:
which guarantees that \(e^{i,k}_{f}+\eta d^{k}_{i}\geq0\) for all \(i\in J_{\ell}\) whenever \(\eta\geq\eta^{\min}_{\ell}\).
Finally, in our algorithm, we define the predicted descent \(\delta _{\ell}\) and the linearization error \(\varepsilon_{\ell}\) as follows:
For a fixed parameter \(\kappa\in(0,1)\), a descent step is taken if
holds, and then update the stability center \(x^{k+1}=y^{\ell+1}\). Otherwise, a null step occurs, and then the aggregate linearization and the new linearization are used to produce a better model \(\check{\varphi}_{\ell+1}\).
3.3 The algorithm
Algorithm 1
Step 0. (Initialization). Select a starting point \(y^{0}\) and set \(x^{0}=y^{0}\). Set parameters \(M>0\), \(R_{0}>0\), \(\kappa\in(0,1)\), \(\epsilon\geq0\), and \(\Gamma\geq1\). Initialize the iteration counter \(\ell=0\), the descent step counter \(k:=k(\ell)=0\) with \(i_{0}=0\). Set \((\mu_{0}, \eta_{0})=(R_{0}, 0)\) and \(J_{0}:=\{0\}\). Compute \(f(x^{0}), g^{0}_{f}\in\partial f(x^{0})\) and the bundle information \((e^{0,0}_{f}, d^{0}_{0}, \triangle^{0}_{0}):=(0, 0, 0)\). Set \(s^{-1}_{h}=g^{0}_{h}\in\partial h(x^{0})\).
Step 1. Find \(z^{\ell+1}\) by solving subproblem (13), and set
Step 2. Find \(y^{\ell+1}\) by solving subproblem (14), and set
Step 3. (Stopping criterion). Compute \(f(y^{\ell +1})\), \(h(y^{\ell+1})\), \(g_{f}^{\ell+1}\in\partial f(y^{\ell+1})\) and \(g_{h}^{\ell+1}\in\partial h(y^{\ell+1})\). If \(\delta_{\ell} \leq \epsilon\), then STOP. Otherwise, compute the new bundle elements by
Select a new index set \(J_{\ell+1}\) satisfying
Step 4. (Descent test). If (28) holds, declare a descent step, set \(k(\ell +1)=k+1, i_{k+1}=\ell+1, x^{k+1}=y^{\ell+1}\), and update the bundle elements by (23). Otherwise, declare a null step, and set \(k(\ell+1)=k(\ell)\).
Step 5. (Update η). Update the convexification parameter by
Step 6. (Update μ). If \(F(y^{\ell +1})>F(x^{k})+M\), then the objective increase is unacceptable, let \(\mu_{\ell+1}:=\Gamma\mu_{\ell}\) and loop to Step 1; otherwise, set \(\mu_{\ell+1}:=\mu_{\ell}\).
Step 7. (Loop). Increase ℓ by 1 and go to Step 1.
Remark 1
(1) The predicted descent \(\delta_{\ell}\) and the linearization error \(\varepsilon_{\ell}\) are nonnegative (the details are given below); (2) in Step 6, if \(F(y^{\ell+1})>F(x^{k})+M\) holds, then the current model is considered as “bad”, so we should become more “conservative”, and therefore increase the proximal parameter μ by setting \(\mu_{\ell+1}=\Gamma\mu_{\ell}\). In the next section, we will prove that the number of increasing μ is finite; (3) the parameters \(\mu_{\ell}, \eta_{\ell}\) in the algorithm will be stable eventually.
Lemma 2
The predicted descent \(\delta_{\ell}\) and the linearization error \(\varepsilon_{\ell}\) are nonnegative, and satisfy
Proof
In Step 2, from (30) and (18) we know \(g^{\ell-1}_{h}=s^{\ell-1}_{h}\), hence
Next, we prove \(\delta_{\ell}\geq0\) and \(\varepsilon_{\ell}\geq 0\). From (26) and (29) we have
Let \(j=-\ell\) in (22) and from (34) we can obtain
On the other hand, from (30), we have
Hence, by combining (36), (37) and (30), (35) can be written as
In Step 5, the update for \(\eta_{\ell}\) is done to ensure \(\eta _{\ell}\geq\eta^{\min}_{\ell}\) for all iterations, so that \(e^{-\ell}_{f}+\eta_{\ell}d^{k}_{-\ell}\geq0\). Therefore, the predicted descent \(\delta_{\ell}\geq0\) since h is convex.
For \(\varepsilon_{\ell}\), from (27), one has
Similar to (36), we have
Thus, we have
Equation (33) follows immediately from (39) and (42). □
From (33), We know that \(\delta_{\ell}\geq\varepsilon _{\ell}\). Therefore, if \(\delta_{\ell}\leq\epsilon\), then \(\varepsilon_{\ell}\leq\epsilon\). So, we only use \(\delta_{\ell }\leq\epsilon\) as the termination criterion in Step 5.
Lemma 3
The vectors \(s_{\varphi}^{\ell}\) and \(s_{h}^{\ell}\) of (30) and (29) are in fact subgradients, i.e.,
Furthermore, we have
Proof
Let \(\phi^{\ell}_{f}\) and \(\phi^{\ell}_{h}\) denote the objectives of (13) and (14), respectively, i.e.,
By (13), (29) and the optimality condition of (45), we have
which implies \(s_{\varphi}^{\ell}\in\partial\check{\varphi}_{\ell }(z^{\ell+1})\). Similarly, by (14) and the optimality condition of (46), we obtain
which implies \(s_{h}^{\ell}\in\partial h_{\ell}(y^{\ell+1})\). So (43) holds.
4 Convergence
In this section, we will study the convergence properties of Algorithm 1. Firstly, based on the objective function of problem (1), we need to slightly modify Assumption 1 as follows.
Assumption 2
Given \(x^{0}\in\mathbb{R}^{n}\) and \(M_{0}\geq0\), there exist an open bounded set \(\mathcal{O}\) and a function H such that \(\mathcal{L}_{0}:=\{x\in\mathbb{R}^{n}: F(x)\leq F(x^{0})+M_{0}\}\subset\mathcal{O}\), and H is lower-\(C^{2}\) on \(\mathcal{O}\) with \(H\equiv f\) on \(\mathcal{L}_{0}\).
For convenience, we assume that Assumption 2 holds throughout the rest of convergence analysis.
In addition, from [24] we know that, if f is a locally Lipschitz continuous function, then the subgradients of f are locally bounded, i.e.,
Further, as in [9], it follows that the model subgradients \(s^{\ell}_{\varphi}\) in (43) satisfy
Remark 2
Note that (47) implies that \(\{g^{\ell}_{\varphi}:=g^{\ell }_{f}+\eta_{\ell}\triangle^{\ell}_{i}\}\) \((g^{\ell}_{\varphi}\in\partial\check{\varphi}_{\ell})\) is bounded if \(\{y^{\ell}\}\) is bounded, since \(\{\triangle^{\ell}_{i}\}\) in (17) is bounded if \(\{y^{\ell}\}\) is bounded. Since \(s^{\ell}_{\varphi}\in\partial\check{\varphi}_{\ell}\), then \(s^{\ell}_{\varphi}\in \operatorname{conv}\{g^{j}_{\varphi}\}_{j\in J_{\ell}}\), thus we have \(\Vert s^{\ell}_{\varphi} \Vert \leq\max^{\ell }_{j=1} \Vert g^{j}_{\varphi} \Vert \), and the model \(\check{\varphi}_{\ell}\) satisfies condition (48) automatically when (47) holds.
The following lemma shows the properties of the model function \(\check {\varphi}_{\ell}\), whose proof can be found in [13, 16].
Lemma 4
For the model function \(\check{\varphi}_{\ell}\) and convexification parameter \(\eta_{\ell}\), we have
-
(i)
\(\check{\varphi}_{\ell}\) is a convex function.
-
(ii)
If \(\eta_{\ell}\geq\eta_{\ell}^{\min}\), then
$$ \check{\varphi}_{\ell}\bigl(x^{k}\bigr)\leq f \bigl(x^{k}\bigr). $$(49) -
(iii)
If \(\eta_{\ell+1}=\eta_{\ell}\), and either \(J_{\ell +1}\supseteq J_{\ell}^{\mathrm{act}}\) or \(J_{\ell+1}\supseteq\{-\ell\}\), then
$$\check{\varphi}_{\ell+1}(\cdot)\geq\check{\varphi}_{\ell } \bigl(z^{\ell+1}\bigr)+\bigl\langle s^{\ell}_{\varphi}, \cdot-z^{\ell+1}\bigr\rangle $$if \(y^{\ell+1}\) is a null step.
-
(iv)
If \(J_{\ell}\supseteq\{\ell\}\), then
$$\check{\varphi}_{\ell}(\cdot)\geq f\bigl(y^{\ell}\bigr)+ \frac{1}{2}\eta _{\ell} \bigl\Vert y^{\ell}-x^{k} \bigr\Vert ^{2}+\bigl\langle g^{\ell}_{f}+ \eta_{\ell }\bigl(y^{\ell}-x^{k}\bigr), \cdot-y^{\ell}\bigr\rangle , $$for some \(g^{\ell}_{f}\in\partial f(y^{\ell})\).
-
(v)
If \(\eta_{\ell}\geq\rho^{\mathrm{id}}\), then
$$ \check{\varphi}_{\ell}(\omega)\leq f(\omega)+ \frac{1}{2}\eta _{\ell} \bigl\Vert \omega-x^{k} \bigr\Vert ^{2} \quad\textit{for all } \omega\in\mathcal{L}_{0}. $$(50)
From the updating rule in Step 5 of Algorithm 1, the convexification parameter \(\eta_{\ell}\) is either unchanged or increasing. The following lemma shows that \(\eta_{\ell}\) can be fixed in a finite number of iterations, whose proof can be found in [13].
Lemma 5
There exist an index \(\ell_{1}\) and a positive constant \(\overline {\eta}>0\) such that
Lemma 6
Suppose that there exists an integer K such that, for all \(\ell\geq K\), only null steps occur without increasing μ. Then the following results hold:
-
(i)
The sequences
$$\begin{aligned} &\biggl\{ \phi^{\ell}_{f}\bigl(z^{\ell+1}\bigr)=\check{ \varphi}_{\ell}\bigl(z^{\ell +1}\bigr)+\bar{h}_{\ell-1} \bigl(z^{\ell+1}\bigr)+ \frac{1}{2}\mu_{\ell} \bigl\Vert z^{\ell+1}-x^{k} \bigr\Vert ^{2}\biggr\} _{\ell\geq K}, \\ &\biggl\{ \phi^{\ell}_{h}\bigl(y^{\ell+1}\bigr)=\bar{ \varphi}_{\ell}\bigl(y^{\ell +1}\bigr)+h\bigl(y^{\ell+1}\bigr)+ \frac{1}{2}\mu_{\ell} \bigl\Vert y^{\ell+1}-x^{k} \bigr\Vert ^{2}\biggr\} _{\ell\geq K} \end{aligned}$$are nondecreasing and convergent.
-
(ii)
The sequences \(\{y^{\ell+1}\}\) and \(\{z^{\ell+1}\}\) are bounded, \(\Vert z^{\ell+1}-y^{\ell+1} \Vert \rightarrow0\) and \(\Vert z^{\ell+2}-y^{\ell+1} \Vert \rightarrow0\) as \(\ell\rightarrow \infty\).
Proof
First, using partial linearizations of the subproblems to show (i) is hold. Fixed \(\ell\geq K\). By the definitions in (13) and (29), we have \(\bar{\varphi}_{\ell}(z^{\ell+1})=\check{\varphi}_{\ell }(z^{\ell+1})\) and
from \(\nabla\bar{\phi}^{\ell}_{f}(z^{\ell+1})=0\). Since \(\bar {\phi}^{\ell}_{f}\) is quadratic and \(\bar{\phi}^{\ell}_{f}(z^{\ell+1})=\phi^{\ell}_{f}(z^{\ell +1})\), by Taylor’s expansion
Similarly, by the definitions in (14) and (30), we have \(\bar{h}_{\ell}(y^{\ell+1})=h(y^{\ell +1})\), and
Next, to bound the objective values of the linearized subproblem (51) and (53) from above, we use \(\bar{\varphi}_{\ell}\leq\check{\varphi}_{\ell}\) and \(\bar{h}_{\ell-1}\leq h\), \(\bar{h}_{\ell}\leq h\) of (44) and \(\check{\varphi}_{\ell}(x^{k})\leq f(x^{k})\) in (ii) of Lemma 4
From (14) and (51), we have \(\bar{\phi }_{f}^{\ell}\leq\phi^{\ell}_{h}\). On the other hand, since only null step occurred, so \(x^{k+1}=x^{k}\), the algorithm ensures that \(\mu_{\ell}=\mu_{\ell +1}\), and \(\bar{\varphi}_{\ell}\leq\check{\varphi}_{\ell+1}\) by (iii) of Lemma 4, we can obtain \(\bar{\phi}^{\ell}_{h}\leq\phi ^{\ell+1}_{f}\). By (52) and (54), we see that
In particular, from (57) and (58), we have the relation
which implies that \(\{\phi^{\ell}_{f}(z^{\ell+1})\}_{\ell\geq K}\) and \(\{\phi^{\ell}_{h}(y^{\ell+1})\}_{\ell\geq K}\) are nondecreasing sequences. Together with the bound of \(F(x^{k})\) from (55) and (56), the convergence is established.
For (ii), we have proved the convergence of \(\{\phi^{\ell }_{f}(z^{\ell+1})\}\) and \(\{\phi^{\ell}_{h}(y^{\ell+1})\}\) in (i) when \(\ell\geq K\), so there must have a common limit, say \(\phi_{\infty}\leq F(x^{k})\), such that
and we have \(\Vert z^{\ell+1}-y^{\ell+1} \Vert \rightarrow0\) and \(\Vert z^{\ell +2}-y^{\ell+1} \Vert \rightarrow0\) from (57) and (58), \(\{y^{\ell+1}\}\) and \(\{z^{\ell+1}\}\) are bounded from (55) and (56). Then the sequences \(\{g^{\ell}_{f}\}\) and \(\{s^{\ell}_{\varphi}\}\) are bounded by (47) and (48). □
The following lemma shows that the number of times of increasing μ is finite.
Lemma 7
Suppose that \(i_{k}\in J_{\ell}\), and let \(N_{\ell}\) be the number of times of increasing μ. Then there exists a positive constant L such that
where \(\lceil a\rceil\) is the smallest integer greater than or equal to a. As a result, there exists an index \(\ell_{2}\) such that
Proof
Let \(\ell_{r}\) be the index corresponding to the rth time that μ increases, then when \(\ell_{r}+1\leq\ell<\ell_{r+1}\), we have
Since \(i_{k}\in J_{\ell}\), from (24), we obtain \(g^{i_{k}}\in\partial\check{\varphi}(x^{k})\) by writing \(i=i_{k}\), and it also holds that \(g_{f}^{i_{k}}\in\partial f(x^{k})\) from (21), so \(\Vert g_{f}^{i_{k}} \Vert \) is bounded. Hence
When \(\ell_{r}+1\leq\ell<\ell_{r+1}\), if \(y^{\ell+1}\) is a null step, we know \(g^{\ell}_{f}\) is bounded from Lemma 6, else \(y^{\ell+1}\) is a descent step, and the corresponding subgradient \(g^{i_{k(\ell+1)}}\in\partial f(x^{k+1})\) is also bounded. Therefore, there exists a constant \(L>0\) such that \(\max\{ \Vert g_{f}^{i_{k}} \Vert , \Vert g_{f}^{\ell+1} \Vert , \Vert g^{k}_{h} \Vert , \Vert g^{\ell+1}_{h} \Vert \}\leq\frac{L}{2}\). Thus we have
This together with (61) shows that
Thus, if
then
This means that the number of times \(N_{\ell}\) of increasing μ satisfies (60). The latter part of the lemma follows immediately from the above result. □
Theorem 3
If \(\delta_{\ell}=0\) and \(\eta_{\ell}\geq\rho^{\mathrm{id}}\), then \(x^{k}\) is a stationary point of F.
Proof
From (33), we have the relation
If \(\delta_{\ell}=0\), then \(\varepsilon_{\ell}=0\) and \(s^{\ell }=0\), and therefore
From the last result of Theorem 1, we know that \(x^{k}\) is a stationary point of \(\bar{\varphi}_{\ell}+h\). In addition, from \(\varepsilon_{\ell}=0\) in (27), we have
This together with \(\bar{h}_{\ell}(x^{k})\leq h(x^{k})\) shows that
On one hand, for \(\omega\in\mathcal{L}_{0}\), if \(\eta_{\ell}\geq \rho^{\mathrm{id}}\), we obtain by (62) and (63)
From the convexity of \(\check{\varphi}_{\ell}\) and (50), we have
So, (64) can be written as
On the other hand, for \(\omega\notin\mathcal{L}_{0}\), from (28) we can obtain
Combining (64) and (66), we have
Hence
which together with Theorem 1 shows that \(x^{k}\) is a stationary point of F. □
We are now in a position to present the main convergence result of our algorithm. As usual in bundle methods, two cases are considered: the algorithm generates finite number of descent steps; and the algorithm generates infinite number of descent steps. We set the stopping parameter \(\epsilon=0\).
Theorem 4
Let η̄ be stabilized value for the convexification parameter sequence and assume \(\bar{\eta}\geq\rho^{\mathrm{id}}\). Then the following mutually exclusive situations hold:
-
(i)
Algorithm 1 generates finite number of descent steps followed by infinitely many null steps. Let x̄ be the last stability center. Then \(y^{\ell +1}\rightarrow\bar{x}\), and x̄ is a stationary point of F.
-
(ii)
Algorithm 1 generates an infinite sequence \(\{ x^{k}\}\) of stability centers. Then any accumulation point of \(\{x^{k}\}\) is a stationary point of F.
Proof
For (i), without loss of generality, we may assume \(\eta_{\ell }=\overline{\eta}\), \(\mu_{\ell}=\overline{\mu}\), and \(R_{\ell }=\overline{R}\) throughout. As in Lemma 6, for the bounded sequences \(\{y^{\ell}\}\) and \(\{ z^{\ell}\}\) we showed that \(\Vert y^{\ell}-z^{\ell} \Vert \rightarrow0\) and \(\Vert z^{\ell+1}-y^{\ell } \Vert \rightarrow0\) as \(\ell\rightarrow\infty\). Therefore \(y^{\ell_{i}}\rightarrow p\) as \(i\rightarrow\infty\) implies \(z^{\ell_{i}}\rightarrow p\) and \(z^{\ell_{i}+1}\rightarrow p\) as \(i\rightarrow\infty\). For \(\omega\in\mathcal{L}_{0}\) near p, by (50) and the convexity of h, we have
Let \(x^{k}=\bar{x}\), \(\mu_{\ell}=\bar{\mu}\), from (29) and the boundedness of \(s^{\ell}_{\varphi}+s^{\ell -1}_{h}=\overline{\mu}(\bar{x}-z^{\ell+1})\), we know
Note that
Combining (68) and (70), (67) can be written as
By Claim (iv) of Lemma 4 written with \(\ell=\ell_{i}\) for \(\omega=z^{\ell_{i}+1}\), we have the following inequality:
Then
From \(s^{\ell-1}_{h}=\overline{\mu}(\bar{x}-z^{\ell+1})-s^{\ell }_{\varphi}\), the bounded sequence \(\{\overline{\mu}(\bar {x}-z^{\ell+1})\}\), \(\{s^{\ell}_{\varphi}\}\), \(\{g^{\ell}_{f}\}\) and \(\{y^{\ell}\}\), we know that \(\{s^{\ell-1}_{h}\}\) and \(\{g^{\ell_{i}}_{f}+ \overline{\eta}(y^{\ell_{i}}-\bar{x})\}\) are bounded. Taking the limit as \(i\rightarrow\infty\), and using the fact that f is continuous at p, we obtain
for all \(\omega\in\mathcal{L}_{0}\) near p. Since \(\frac {1}{2}\overline{\eta} \Vert \omega-p \Vert ^{2}=o( \Vert \omega-p \Vert )\), the last inequality means that \(\overline{R}(\bar{x}-p)\in\partial F(p)\) by Definition 8.3 in [17], which implies \(p=p_{\bar{R}}F(\bar{x})\) by Theorem 1. Since f is continuous and condition (50) holds at all accumulation points of \(\{y^{\ell}\}\), then the entire sequence \(\{y^{\ell}\}\) converges to the proximal point \(p_{\bar{R}}F(\bar{x})\). Furthermore, evaluating the relations at \(\omega=p\) shows that the following equation holds for the entire sequence by Theorem 2 in [16]:
So as \(\ell\rightarrow\infty\), the whole sequence
Thus, from (26) we have
Since null step does not satisfy the descent test in Step 4 of the algorithm, we have \(F(y^{\ell+1})> F(\bar{x})-\kappa\delta_{\ell}\). Taking the limit as \(\ell\rightarrow\infty\) gives the relation \(F(p)\geq F(\bar{x})-\kappa(F(\bar{x})-F(p))\), so \(F(\bar{x})\leq F(p)\) because \(\kappa\in(0,1)\). But \(p=p_{\bar{R}}F(\bar{x})\) implies
which shows that \(\bar{x}=p\). That is, \(\bar{x}=p_{\bar{R}}F(\bar {x})\), so x̄ is a stationary point of F from Theorem 1.
For (ii), \(\mathcal{L}_{0}\) is a compact set, and the sequence \(\{ x^{k}\}\subset\mathcal{L}_{0}\), so it has an accumulation point, i.e., there exists some infinite set K such that \(x^{k}\rightarrow\hat {x}\in\mathcal{L}_{0}\) as \(K\ni k\rightarrow\infty\). Since \(x^{k+1}=y^{i_{k+1}}\), let \(j_{k}=i_{k+1}-1\) so that \(x^{k+1}=p_{\bar{\mu}}(\bar{\varphi}_{j_{k}}+h)(x^{k})\). The descent test
implies that, as \(k\rightarrow\infty\), either \(F(x^{k})\searrow -\infty\), or \(\delta_{j_{k}}\rightarrow0\). By Assumption 2, \(F(x^{k})\) is bounded below, therefore, \(\delta _{j_{k}}\rightarrow0\). From (39), this means that
must converge to 0. By
and
in Lemma 6, we have
By (21), \(\check{\varphi}_{j_{k}}(z^{j_{k}+1})-f(x^{k})\rightarrow 0\) as \(k\rightarrow\infty\), from (29) we know that
Therefore,
as \(k\rightarrow\infty\). Consider now \(k\in K\). Since \(\Vert x^{k+1}-x^{k} \Vert = \Vert y^{j_{k}+1}-x^{k} \Vert \rightarrow0\), both \(x^{k+1}\) and \(x^{k}\) converge to \(x^{\inf}\) as \(K\ni k\rightarrow \infty\) with
And from \(x^{k+1}=p_{\bar{\mu}}(\bar{\varphi}_{j_{k}}+h)(x^{k})\), \(\bar {\eta}\geq\rho^{\mathrm{id}}\) and (50), for all \(\omega\in\mathcal{L}_{0}\),
Therefore, taking the limit \(k\in K\), we have
On the other hand, \(x^{\inf}\in\mathcal{L}_{0}\) and for any \(\omega \notin\mathcal{L}_{0}\), it follows
Hence,
Therefore, \(\hat{x}=p_{\bar{R}}F(\hat{x})\) with \(\bar{R}\geq\rho ^{\mathrm{id}}\), hence x̂ is a stationary point of F from Theorem 1. □
5 Numerical results
This section aims to test the practical effectiveness of Algorithm 1. We tested a set of nine problems. The first set of seven problems are generalized from the unconstrained versions in [25] by imposing suitable constraints, the second set of two nonconvex unconstrained problems are taken from [13, 26] which are the sum of a nonconvex function and a convex function.
All numerical experiments were implemented by using MATLAB R2014a, and on a ThinkPad laptop with Windows 7 operating system. The first seven problems have the form of (2) with \(C=\{ x: \Vert x-a \Vert \leq b\}\), where \(a\in\mathbb{R}^{n}\) and \(0< b\in\mathbb{R}\) are given below. These problems are transformed to the form of (3) by using the indicator function. The detailed data for the seven problems are listed below. For simplicity, we use the MATLAB notations: ones(p,q) and zeros(p,q) denote p-by-q matrices of ones and zeros, respectively.
CB2: \(f(x)=\max\{x_{1}^{2}+x_{2}^{4}, (2-x_{1})^{2}+(2-x_{2})^{2}, 2e^{x_{2}-x_{1}}\}\), \(y^{0}=(3,3)^{T}\), \(a=(0,0)^{T}\), \(b=1\).
CB3: \(f(x)=\max\{x_{1}^{4}+x_{2}^{2}, (2-x_{1})^{2}+(2-x_{2})^{2}, 2e^{x_{2}-x_{1}}\}\), \(y^{0}=(3,3)^{T}\), \(a=(3,3)^{T}\), \(b=1\).
LQ: \(f(x)=\max\{-x_{1}-x_{2},-x_{1}-x_{2}+x_{1}^{2}+x_{2}^{2}-1\}\), \(y^{0}=(1,1)^{T}\), \(a=(1,-1)^{T}\), \(b=1\).
Mifflin1: \(f(x)=-x_{1}+20\max\{x_{1}^{2}+x_{2}^{2}-1, 0\}\), \(y^{0}=(1.5,0.5)^{T}\), \(a=(-2,2)^{T}\), \(b=1\).
Rosen–Suzuki: \(f(x)=\max_{1\leq i\leq4}f_{i}(x)\), \(y^{0}=(1,2.1,-3,-0.9)^{T}\), \(a=(1,2,3,4)^{T}\), \(b=2\), with
Shor: \(f(x)=\max_{1\leq i\leq10}\{d_{i}\sum^{5}_{j=1}(x_{j}-c_{ij})^{2})\}\), \(y^{0}=\texttt {zeros(10,1)}\), \(a=\texttt {zeros(10,1)}\), \(b=3\), \(d=(1,5,10, 2, 4, 3, 1.7, 2.5, 6, 3.5)^{T}\),
MAXL: \(f(x)=\max_{1\leq i\leq20} \vert x_{i} \vert \), \(a = (-\texttt {ones(1,10)}, \texttt {ones(1,10)})^{T}\), \(b=4\), \(y^{0}=(1,1.1,3, 1.1,5,1.1,7,1.1,9,1.1,-11,0.1,-13,0.1,-15,0.1,-17,0.1,-19,0.1)^{T}\).
The second set of two problems are:
Regular: \(F(x)=\sum_{i=1}^{n} \vert f_{i}(x) \vert +\frac{1}{2} \Vert x \Vert ^{2}\), where
are the Ferrier polynomials.
L-Mifflin: \(F(x)=2(x_{1}^{2}+x_{2}^{2}-1)+1.75 \vert x_{1}^{2}+x_{2}^{2}-1 \vert \).
The nonconvexity of the above two problems can be seen from Fig. 1.
In the test, the parameters are selected as \(M=5\), \(R_{0}=10\), \(\kappa =0.3\), \(\epsilon=10^{-5}\), \(\Gamma=2\). The numerical results are reported in Tables 1, 2 and 3. The notations are: the dimension of problem n; the number of iterations NI; the number of descent steps ND; the number of function evaluations NF; the approximately optimal solution \(x^{*}\); the approximately optimal objective value \(F^{*}\). The comparisons between Algorithm 1 and PPBM (the algorithm in [27]) for the first seven problems are listed in Table 1. From Table 1, we see that Algorithm 1 performs better than PPBM. In Table 2, we compare our algorithm with RedistProx in [13] for problem Regular with various n. From Table 2, under the same conditions (terminates if NF is more than 300), we see that the approximately optimal values and accuracies of Algorithm 1 are better than RedistProx. Finally, in Table 3, we report the approximately optimal solutions and values for problem L-Mifflin with different starting points.
References
Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14(10), 707–710 (2007)
Hare, W., Sagastizábal, C., Solodov, M.V.: A proximal bundle method for nonsmooth nonconvex functions with inexact information. Comput. Optim. Appl. 63(1), 1–28 (2016)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
Dinh, Q.T., Diehl, M.: Proximal methods for minimizing the sum of a convex function and a composite function (2011). arXiv:1105.0276
Eckstein, J., Svaiter, B.F.: A family of projective splitting methods for the sum of two maximal monotone operators. Math. Program. 111(1), 173–199 (2007)
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141, 349–382 (2013)
Kiwiel, K.C.: A method for minimizing the sum of a convex function and a continuously differentiable function. J. Optim. Theory Appl. 48(3), 437–449 (1986)
Kiwiel, K.C.: An alternating linearization bundle method for convex optimization and nonlinear multicommodity flow problems. Math. Program. 130(1), 59–84 (2011)
Li, D., Pang, L., Chen, S.: A proximal alternating linearization method for nonconvex optimization problems. Optim. Methods Softw. 29(4), 771–785 (2014)
Mine, H., Fukushima, M.: A minimization method for the sum of a convex function and a continuously differentiable function. J. Optim. Theory Appl. 33(1), 9–23 (1981)
Tuy, H., Tam, B.T., Dan, N.D.: Minimizing the sum of a convex function and a specially structured nonconvex function. Optimization 28, 237–248 (1994)
Hare, W.L., Sagastizábal, C.: A redistributed proximal bundle method for nonconvex optimization. SIAM J. Optim. 20(5), 2442–2473 (2010)
Cheney, E.W., Goldstein, A.A.: Newton’s method for convex programming and Tchebycheff approximations. Numer. Math. 1, 253–268 (1959)
Kelley, J.E.: The cutting-plane method for solving convex programs. J. Soc. Ind. Appl. Math. 8, 703–712 (1960)
Hare, W., Sagastizábal, C.A.: Computing proximal points of nonconvex functions. Math. Program. 116(1), 221–258 (2009)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Lemaréchal, C.: An extension of Davidon methods to nondifferentiable problems. Math. Program. Stud. 3, 95–109 (1975)
Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. Stud. 3, 145–173 (1975)
Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.: Numerical Optimization: Theoretical and Practical Aspects, 2nd edn. Springer, Berlin (2006)
Kiwiel, K.C.: Methods of Descent for Nondifferentiable Optimization. Lecture Notes in Mathematics, vol. 1133. Springer, Berlin (1985)
Kiwiel, K.C.: A method of centers with approximate subgradient linearizations for nonsmooth convex optimization. SIAM J. Optim. 18(4), 1467–1489 (2008)
Kiwiel, K.C.: An algorithm for linearly constrained convex nondifferentiable minimization problems. J. Math. Anal. Appl. 105(2), 452–465 (1985)
Lukšan, L., Vlček, J.: Test problems for nonsmooth unconstrained and linearly constrained optimization. Technical Report No. 798, Institute of Computer Science, Academy of Sciences of the Czech Republic, Prague (2000)
Bagirov, A.M., Karmitsa, N., Mäkelä, M.: Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer, Cham (2014)
Kiwiel, K.C.: A proximal-projection bundle method for Lagrangian relaxation, including semidefinite programming. SIAM J. Optim. 17(4), 1015–1034 (2006)
Acknowledgements
Project supported by the National Natural Science Foundation (11761013, 11771383) and Guangxi Natural Science Foundation (2013GXNSFAA019013, 2014GXNSFFA118001, 2016GXNSFDA380019) of China.
Author information
Authors and Affiliations
Contributions
All authors read and approved the final manuscript. CT mainly contributed to the algorithm design and convergence analysis; JL mainly contributed to the convergence analysis and numerical results; and JJ mainly contributed to the idea of the method and algorithm design.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Tang, C., Lv, J. & Jian, J. An alternating linearization bundle method for a class of nonconvex nonsmooth optimization problems. J Inequal Appl 2018, 101 (2018). https://doi.org/10.1186/s13660-018-1683-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-018-1683-1