Abstract
In many engineering applications, it is necessary to minimize smooth functions plus penalty (or regularization) terms that violate smoothness and convexity. Specific algorithms for this type of problems are available in recent literature. Here, a smooth reformulation is analyzed and equivalence with the original problem is proved both from the points of view of global and local optimization. Moreover, for the cases in which the objective function is much more expensive than the constraints, model-intensive algorithms, accompanied by their convergence and complexity theories, are introduced. Finally, numerical experiments are presented.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We are concerned with mathematical problems in which the objective is to find a point x ∈ X such that:
In general, X is a subset of a finite-dimensional space. Constrained and unconstrained minimization problems may be formulated in this way, with D(x) and E(x) representing objective functions and constraints. Problems of the form (1) are almost always solved by means of iterative methods. This means that, given an approximation xk to a solution, the approximation xk+ 1 is obtained by solving:
where \(M_{F, x^{k}}\), \(M_{D, x^{k}}\), and \(M_{E, x^{k}}\) are models of F, D, and E built using knowledge available at xk. For example, if we deal with the constrained optimization problem:
we may identify, say, f with D, g with E and F with the difference between f(x) and the minimum for each feasible point x, in such a way that problem (2) may consist on the minimization of the quadratic Taylor approximation of f (perhaps plus a Lagrangian term) subject to the linearization of the constraints. This is, essentially, the idea of sequential quadratic programming methods [29].
In many cases, the arguments of F involve a function that is difficult to evaluate (D(x)) and a function that is easy to evaluate (E(x)). In these cases, the computer work associated with the solution of (2) may be overwhelmingly dominated by the evaluation of D(xk) (including, possibly, its derivatives). Moreover, even
and
may be solvable with a computational cost essentially equal to the cost of computing D(xk). For example, in the constrained optimization problem, the evaluation of the objective function could involve a huge collection of data whereas the constraints may be easy to evaluate. In this case, it is probably convenient to employ an iterative scheme based on (5) by means of which we solve, at each iteration, a powerful model of the objective function subject to the true constraints. This is the point of view of [26], where the complexity of a general scheme for constrained optimization based on high-order models of the objective function is analyzed.
The objective of this paper is to exploit the modeling idea (5) with respect to the minimization of nonconvexly regularized problems. The equivalence of this problem with a constraint optimization problem in which the objective function is generally much more expensive than the constraints will be exploited and it will be proved that, for some Taylor-like models with mild assumptions, the resulting method enjoys theoretical convergence and complexity properties that suggest a good computational behavior. Some illustrative examples will be shown that tend to corroborate this hypothesis.
Given \(\mathcal {C} \subset \mathbb {R}^{n}\), continuously differentiable functions \(f:\mathbb {R}^{n}\rightarrow \mathbb {R}\) and \(g:\mathbb {R}^{n} \to \mathbb {R}^{d}\), a penalty function \(\pi : \mathbb {R} \rightarrow \mathbb {R}_{+}\) such that π(t) = 0 if t ≤ 0, and a regularization parameter σ > 0, we consider the constrained optimization problem given by:
Problem (6) may arise in the process of solving problem:
by means of penalization with respect to the constraints g(x) ≤ 0. In this case, we expect that a solution to (7) should emerge when σ is sufficiently large or, at least, when \(\sigma \to \infty \) (see, for example, [29]). Many times, the objective function of (6) is smooth and \({\pi }^{\prime }(0) = 0\). However, we are especially interested in the case in which \({\pi }^{\prime }(0) > 0\) and, above all, in the case in which \({\pi }^{\prime }(0)\) does not exist and \(\lim _{t \to 0+} {\pi }^{\prime }(t) = \infty \).
For several reasons, solutions to (6) for different values of σ may be meaningful independently of its connection with (7). In fact, many times, the solution to (6) when \(\sigma \to \infty \) is not relevant and the fulfillment of all the constraints gj(x) ≤ 0 is not desirable or, perhaps, impossible. Instead, the ideal asymptotic problem that one wishes to solve is, roughly speaking:
with the compromise that the number of indices j such that gj(x) > 0 should be small. For example, a typical portfolio problem could require the minimization of the expected loss f(x) subject to standard constraints \(x \in \mathcal {C}\) and the additional requirement that the number of scenarios under which shortfall occurs should be as small as possible [12]. We expect to achieve this objective employing penalty functions that are concave for t ≥ 0 and such that \(\lim _{t \to 0+} {\pi }^{\prime }(t) > 0\).
The case in which \(\mathcal {C}\) is polyhedral and π(t) = tq if t ≥ 0, with q ∈ (0,1), has been considered in [5, 24]. The problem:
called Lq-regularized optimization problem, has been the object of many studies starting from Tikhonov [30] and the enormous literature on inverse problems. The case q = 2 corresponds to the most classical regularization, intended to handle problems in which the mere minimization of f(x) onto \(\mathcal {C}\) is very ill-conditioned and, so, its exact solution is, frequently, meaningless. The case q = 1 preserves the possible convexity of f(x) and has been shown to induce sparse regularized minimizers of f(x) (see [13]). With q < 1 possible convexity of the objective function is lost but the tendency to find sparse solutions x is even more emphatic. Writing gj(x) = xj and gn+j(x) = −xj for \(j = 1, \dots , n\) and defining π(t) = tq for all t ≥ 0 problem (9) takes the form (6); so (9) is a particular case of the problem studied in this paper. Practical applications of problems of the form (6) include machine learning [18], information theory [20], image restoration [5], signal processing [4], variable selection [19], and others. Interesting examples were given in [24, Appx.1].
The rest of this paper is organized as follows. In Section 2, we introduce a smooth reformulation of (6) and we prove its equivalence with the original problem. In Section 3, we derive optimality conditions using the reformulation. In Section 4, we introduce a model-intensive algorithm for solving the problem and we prove its convergence and complexity. In Section 5, we introduce a specific algorithm for the case in which the constraints are linear. In Section 6, we present numerical experiments. Finally, in Section 7, we state conclusions and lines for future research.
Notation
∥⋅∥ denotes an arbitrary norm. \(\mathbb {R}_{+}\) denotes the set of nonnegative elements of \(\mathbb {R}\). The i th component of a vector v is denoted vi or [v]i. If \(v \in \mathbb {R}^{n}\) is a vector with components vi, v+ is the vector with components \(\max \limits \{0, v_{i}\}\), \(i = 1, \dots , n\).
2 Reformulation
A smooth reformulation of problem (6) will be introduced in this section. The penalty functions employed in this work will be required to satisfy the assumptions below.
Assumption A1
π is continuous, concave, strictly increasing for t ≥ 0. Moreover, \({\pi }^{\prime }(t)\) exists and is continuous for all t > 0.
Assumption A2
The restriction of π to \(\mathbb {R}_{+}\) admits an inverse \({\pi }^{-1}: \mathbb {R}_{+} \to \mathbb {R}_{+}\), the derivative (π− 1)′(t) exists for all t > 0, and the right derivative \(({\pi }^{-1})^{\prime }(0)_{+}\) also exists.
Assumptions A1 and A2 are satisfied by penalty functions in practical applications such as ridge regression [21], smoothly clipped absolute deviation penalization (SCAD) [19], minimax concave penalties (MCP) [31], fraction penalty optimization [28], and log-penalty minimization [13]. Note that assumptions A1 and A2 do not require \(\lim _{t \to 0+} {\pi }^{\prime }(t) = \infty \), although this property will be interesting for the purpose of minimizing the number of indices j such that gj(x) > 0.
The possible lack of smoothness of (6) is due to the nonsmoothness of π(t) when t = 0. Therefore, standard methods based on gradient information for solving (6) may be inappropriate. In order to avoid this inconvenience, several approaches were proposed. Among them, we can mention smoothing approximation methods [5, 16, 24], iterative reweighted algorithms [17, 23, 25], and interior point methods [6, 22]. Our proposal, based on a smooth reformulation, has the advantage that standard well-established algorithms for constrained optimization based on first-order or higher-order derivative information on the derivatives may be used. This is possible thanks to the fact that the function \({\pi }^{-1}: \mathbb {R}_{+} \to \mathbb {R}_{+}\) is continuously differentiable.
Problem (6) is obviously equivalent to:
If yj > π(gj(x)), taking \(\bar {y}_{j} = \pi (g_{j}(x))\), we obtain that \( f(x)+ \sigma {\sum }_{j=1}^{d} \bar {y}_{j} \leq f(x)+ \sigma {\sum }_{j=1}^{d} {y}_{j}\). Therefore, (10) is equivalent to:
As π(t) = 0 if t < 0, (11) is equivalent to:
If yj ≥ 0 and yj ≥ π(gj(x)), since both yj and π(gj(x)) belong to \(\mathbb {R}_{+}\) and the inverse of π onto \(\mathbb {R}_{+}\) is increasing, we obtain π− 1(yj) ≥ gj(x). Reciprocally, if π− 1(yj) ≥ gj(x) and yj ≥ 0, we obtain yj ≥ π(gj(x)) or gj(x) ≤ 0 in which case yj ≥ π(gj(x)) also holds. Therefore, (12) is equivalent to:
The theorem below shows that a solution to the original problem (6) can be found by solving its smooth reformulation (13). (The reciprocal is also true, as shown in the following theorem.)
Theorem 2.1
Suppose that (x∗,y∗) is a local (global) minimizer of (13). Then, \(y^{*}_{j} = \pi (g_{j}(x^{*}))\) for all \(j = 1,\dots , d\) and x∗ is a local (global) minimizer of (6). Reciprocally, if x∗ is a local (global) minimizer of (6) and \(y^{*}_{j} = \pi (g_{j}(x^{*})\) for all \(i = 1, \dots , d\), then (x∗,y∗) is a local (global) minimizer of (13).
Proof 1
Suppose that (x∗,y∗) is a local minimizer of (13). If \(y^{*}_{j} > \pi (g_{j}(x^{*})\) and \(y^{*}_{j} > y_{j} \geq \pi (g_{j}(x^{*}))\), we have \({\pi }^{-1}(y^{*}_{j}) > {\pi }^{-1}(y_{j}) \geq g_{j}(x^{*})\). Therefore, if \(y_{i} = y^{*}_{i}\) for all i≠j, we have that (x∗,y) is a feasible point of (13) and Φ(x∗,y) < Φ(x∗,y∗). Therefore, (x∗,y∗) would not be a local minimizer of (13). This implies that, necessarily, \(y^{*}_{j} = \pi (g_{j}(x^{*})\) for all \(j = 1,\dots , d\).
Let ε > 0 be such that for all (x,y) such that \(x \in \mathcal {C}\), y ≥ 0, \(g_{j}(x) \leq {\pi }^{-1}(y_{j})\), \(j = 1, \dots , d\), ∥x − x∗∥≤ ε and \(\|y- y^{*}\|_{\infty } \leq \varepsilon \), we have that \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j} \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} y^{*}_{j}\). By the first part of the proof, \(y^{*}_{j} = \pi (g_{j}(x^{*})) \); therefore, \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j} \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*})) \). Let ε1 ∈ (0,ε] be such that \(\|x - x^{*}\| \leq {\varepsilon }_{1}\) implies that |π(gj(x)) − π(gj(x∗))|≤ ε for all \(j=1,\dots ,d\). Then, if \(\|x - x^{*}\| \leq {\varepsilon }_{1}\), \(f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x)) \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*}))\). This implies that x∗ is a local minimizer of (6).
Reciprocally, suppose that x∗ is a local minimizer of (6). Therefore, there exists ε > 0 such that for all \(x \in \mathcal {C}\) such that ∥x − x∗∥≤ ε, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*})) \leq f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x))\). Therefore, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*} \leq f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x))\) for all \(x \in \mathcal {C}\) such that ∥x − x∗∥≤ ε. Thus, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*} \leq f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) for all \(x \in \mathcal {C}\) such that ∥x − x∗∥≤ ε if yj ≥ 0 and yj ≥ π(gj(x)) for all \(j = 1, \dots , d\). But, due to the increasing property of π− 1 onto \(\mathbb {R}_{+}\), yj ≥ 0 and yj ≥ π(gj(x)) is equivalent to yj ≥ 0 and π− 1(yj) ≥ gj(x). Therefore, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*}\) is less than or equal to \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) for all (x,y) satisfying \(x \in \mathcal {C}\), ∥x − x∗∥≤ ε, and the constraints of (13).
The equivalence proof with respect global minimizers is similar. □
3 Optimality conditions
Assume that:
where \(H: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable. If a local minimizer (x∗,y∗) of (13) satisfies a constraint qualification then the KKT conditions hold at (x∗,y∗). By Theorem 2.1, if x∗ is a local minimizer of (6), y∗ = π(gj(x∗)) for all \(i = 1, \dots , d\), and a constraint qualification for (13) is fulfilled, the KKT conditions are satisfied too. The KKT conditions for (13) obviously involve both x∗ and y∗. Since y is merely an auxiliary variable, it is interesting to derive KKT-like conditions in which only the primal variables x∗ are involved. This is the objective of Theorem 3.1 below, which, in turn, generalizes [15, Corollary 2.2] and [24] using straightforward optimization arguments.
Theorem 3.1
Let x∗ be a local minimizer of (6) and assume that π satisfies Assumptions A1 and A2. Define:
Then, x∗ is a local minimizer of the problem:
Moreover, if \(\mathcal {C}\) is defined by H(x) = 0 and G(x) ≤ 0, where \(H:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable, and the system of equalities and inequalities:
satisfies a constraint qualification at x∗ then we have that x∗ satisfies the KKT conditions of problem (16).
Proof 2
Let y∗≥ 0 be such that (x∗,y∗) is a local minimizer of (13). By Theorem 2.1, \(y^{*}_{j} = \pi (g_{j}(x^{*}))\) for all \(j = 1, \dots , d\). If j ∈ I(x∗), we have that gj(x) < 0 for all x in a neighborhood of x∗. Moreover, \(y^{*}_{j} = 0\) in this case. Therefore, (x∗,y∗) is a local minimizer of:
So, (x∗,y∗) is a local minimizer of
If j ∈ K(x∗), we have that gj(x) > 0 in a neighborhood of x∗ and π(gj(x)) is continuous and differentiable on that neighborhood. Therefore, x∗ is a local minimizer of the locally smooth problem given by:
Since gj(x∗) ≤ 0 for all j ∈ J(x∗), x∗ is also a local minimizer of:
Moreover, since \(y_{j}^{*} = 0\) for all j ∈ J(x∗), x∗ is a local minimizer of:
Clearly, the variables yj for j ∈ J(x∗) play no role in this problem, so x∗ is a local minimizer of:
Then, if the constraints that define \(\mathcal {C}\) together with the constraints gj(x) ≤ 0, j ∈ J(x∗) satisfy a constraint qualification at x∗, the KKT conditions of (16) hold at x∗. This completes the proof. □
Observe that (16) is a constrained optimization problem of the form
where \(f_{\text {obj}}: \mathbb {R}^{n} \to \mathbb {R}\), \(h_{\mathrm {E}}:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {E}}}\), and \(h_{\mathrm {I}}: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {I}}}\) are continuously differentiable. It is well known that, independently of constraint qualifications, every local minimizer w∗ of (18) satisfies the Approximate KKT (AKKT) optimality condition [2, 7], which means that, given εsol > 0, εkkt > 0, εfeas > 0, and εcomp > 0, there exist \(w \in \mathbb {R}^{n}\), \(\lambda \in \mathbb {R}^{n_{\mathrm {E}}}\), and \({\mu }_ \in \mathbb {R}^{n_{\mathrm {I}}}_{+}\) such that \(\|w - w^{*}\| \leq {\varepsilon }_{\text {sol}}\),
and
If \(w \in \mathbb {R}^{n}\) is such that there exist \(\lambda \in \mathbb {R}^{n_{\mathrm {E}}}\) and \({\mu }_ \in \mathbb {R}^{n_{\mathrm {I}}}_{+}\) fulfilling (19), (20), and (21), we say that w is an (εfeas/εcomp/εkkt)-AKKT point of (18).
Applying the AKKT definition to (16), we obtain the following result, independently of constraint qualifications.
Theorem 3.2
Let x∗ be a local minimizer of (6) and assume that π satisfies Assumptions A1 and A2. Let \(\mathcal {C}\) be defined by H(x) = 0 and G(x) ≤ 0, where \(H:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable. Assume that εsol > 0, εkkt > 0, εfeas > 0, and εcomp > 0 are arbitrary. Then, there exist \(x \in \mathbb {R}^{n}\), \(\lambda _{\mathrm {H}} \in \mathbb {R}^{n_{\mathrm {H}}}\), \({\mu }_{\mathrm {G}} \in \mathbb {R}^{n_{\mathrm {G}}}_{+}\), and \({\mu }_{j} \in \mathbb {R}_{+}\) for all j such that gj(x∗) = 0, such that:
and
Proof 3
By Theorem 3.1, x∗ is a local minimizer of (16). Then, the thesis follows from (19), (20), and (21). □
4 Model-intensive algorithms
As we mentioned in Section 2, problem (13) can be solved using standard methods for smooth constrained optimization. In many applications, the cost of evaluating f(x) exceeds, by far, the cost of evaluating the constraints. For these situations, “model intensive” (MI) methods are recommendable. At each iteration of such an algorithm, a model of the objective function is built and this model is minimized using the original constraints (instead of their linear approximations as in SQP and other popular approaches). The idea is to exploit the model as much as possible so that, hopefully, it will not be necessary to evaluate the expensive function many times.
The most popular model for an arbitrary sufficiently smooth function \(f: \mathbb {R}^{n} \to \mathbb {R}\) comes from considering its p th Taylor approximation:
so that:
Observe that \(T_{1}(\bar x, x) = \nabla f(\bar x)^{T}(x - \bar x)\) and \(T_{2}(\bar x, x)=\nabla f(\bar x)^{T}(x - \bar x)+\frac {1}{2}(x - \bar x)^{T} \nabla ^{2} f(\bar x) (x - \bar x)\). Consequently, the model for \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) will be \(f(\bar x) + T_{p}(\bar x, x) + \sigma {\sum }_{j=1}^{d} y_{j}\).
In this section, \(M_{\bar x, \bar y}(x,y)\) will be a model of \({\Phi }(x, y) - {\Phi }(\bar x, \bar y)\) for all \(\bar x, x \in \mathbb {R}^{n}\) and \(\bar y, y \in \mathbb {R}^{d}\). Therefore, \({\Phi }(x, y) \approx {\Phi }(\bar x, \bar y) + M_{\bar x, \bar y}(x, y)\). Let us define, for all δ ≥ 0:
and D = D0. The following algorithm is similar to Algorithm 3.1 of [11].
Algorithm 4.1
Assume that \(p \in \{ 1, 2, 3, \dots \}\), η > 0, α > 0, \(\rho _{\min \limits } > 0\), τ2 ≥ τ1 > 1, 𝜃 > 0, and (x0,y0) ∈ Dη/4 are given with \({y^{0}_{j}} = \pi (g_{j}(x^{0}))\), \(j = 1, \dots , d\). Initialize \(k \leftarrow 0\).
- Step 1.:
-
Set \(\rho \leftarrow 0\).
- Step 2.:
-
Compute \(x^{\text {trial}} \in \mathbb {R}^{n}\) and \(y^{\text {trial}} \in \mathbb {R}^{d}\) such that
$$ (x^{\text{trial}}, y^{\text{trial}}) \in D_{\eta_{k}} \text{ with } \eta_{k} = \eta \left( \frac{k+1}{k+2} \right) $$(26)and
$$ M_{x^{k}, y^{k}}(x^{\text{trial}}, y^{\text{trial}}) + \rho \|(x^{\text{trial}} - x^{k}, y^{\text{trial}} - y^{k})\|^{p+1} \leq 0. $$(27) - Step 3.:
-
Test the condition
$$ {\Phi} (x^{\text{trial}}, y^{\text{trial}}) \leq {\Phi}(x^{k}, y^{k}) - \alpha \| (x^{\text{trial}} - x^{k}, y^{\text{trial}} - y^{k}) \|^{p+1}. $$(28)If (28) holds, accept the trial point (xtrial,ytrial), define ρk = ρ, xk+ 1 = xtrial, yk+ 1 = ytrial, set \(k \leftarrow k+1\), and go to Step 1. Otherwise, update \(\rho \leftarrow \max \limits \{ \rho _{\min \limits }, \tau \rho \}\) with τ ∈ [τ1,τ2], and go to Step 2.
Under the model-intensive strategy, conditions (26) and (27) may be obtained by solving the subproblem:
with rather high precision, since we assume that the cost of evaluating the model is negligible with respect to the cost of evaluating f. Note that the feasibility required at xtrial is looser than the precision with which xk satisfies those constraints. Therefore, it is plausible to require the fulfillment of (26) and (27) by means of a standard constrained optimization algorithm. This means that, although conditions (27) and (28) are trivially satisfied by xtrial = xk and ytrial = yk, it is reasonable to expect that a nontrivial solution with decrease of Φ could be obtained with non-null increments.
Complexity results for Algorithm 4.1 are presented below. Before that, the standard assumptions (see, for example, [10, 14]) are given.
Assumption A3
For all (xk,yk) generated by Algorithm 4.1, we have that
Assumption A4
There exists L1 > 0 such that, for every (xk,yk) calculated by Algorithm 4.1 and for every (xtrial,ytrial) satisfying (27), we have that:
Assumption A4 is valid if M is obtained using the p th Taylor approximation and the derivatives of order p of f are Lipschitz continuous. See [3, 10, 14].
Assumption A5
There exists L2 > 0 such that, for every xk, yk, xk+ 1, and yk+ 1 obtained by Algorithm 4.1, we have that:
As in the case of Assumption A4, Assumption A5 also holds if we use Taylor p th approximations for computing M and the p th order derivatives of f are Lipschitz-continuous.
Assumption A6 below says that (29) must be approximately solved. Before stating this assumption let us recall that, assuming that ∥(x − xk,y − yk)∥p+ 1 is continuosly differentiable with respect to (x,y), (29) is a constrained optimization problem of the form (18).
Assumption A6
We say that this assumption holds at iteration k of Algorithm 4.1 if the function ∥(x − xk,y − yk)∥p+ 1 is continuously differentiable with respect to (x,y) and (xtrial,ytrial) is an (εfeas/εcomp/εkkt)-AKKT point of (29) with εfeas = ηk, εcomp = η, and εkkt = 𝜃∥(xtrial − xk,ytrial − yk)∥p.
Theorem 4.2
Assume that the sequence {(xk,yk)} is generated by Algorithm 4.1, \({\Phi }_{\text {target}} \in \mathbb {R}\), ε > 0, and \(L = \max \limits \{L_{1}, L_{2}\}\). Define:
Suppose that Assumptions A3, A4, A5, and A6 hold for all:
Then, the number of iterations k such that:
and (xk+ 1,yk+ 1) is not an (ηk/η/ε)-AKKT point of (13) is bounded above by \(K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\). Moreover, the number of functional evaluations per iteration is bounded above by:
Proof 4
The desired result follows as in [11, Thm.3.1]. □
Theorem 4.3
Suppose that Assumptions A3, A4, and A5 hold, and the function ∥(x − xk,y − yk)∥p+ 1 is continuously differentiable with respect to (x,y). Assume, moreover, that for all \(k = 0, 1, 2, \dots \) and εfeas > 0, εcomp > 0, and εkkt > 0, we are able to compute, using a suitable algorithm, an (εfeas/εcomp/εkkt)-AKKT point of (29) such that (27) holds. Then, there are two possibilities:
-
1.
Assumption A6 holds for all \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \).
-
2.
There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \) such that (xk,yk) is an AKKT point of (13).
Proof 5
Assume that (xk,ℓ,yk,ℓ) is a sequence generated by an algorithm that, when applied to the problem:
satisfies the hypotheses of the theorem. By construction, \((x^{k}, y^{k}) \in D_{\eta _{k-1}}\). Therefore the minimum of \(M_{x^{k}, y^{k}}(x, y ) + \rho \|(x-x^{k}, y-y^{k})\|^{p+1}\) onto \(D_{\eta _{k-1}}\) is non-positive, as well as the minimum of \(M_{x^{k}, y^{k}}(x, y ) + \rho \|(x-x^{k}, y-y^{k})\|^{p+1}\) onto \(D_{\eta _{k}}\).
By the fulfillment of the AKKT optimality conditions, we have two possibilities:
-
1.
There exists an iterate (xk,ℓ,yk,ℓ) which is an (εfeas/εcomp/εkkt)-AKKT point of (29) with εfeas = ηk, εcomp = η, and εkkt = 𝜃∥(xk,ℓ − xk,yk,ℓ − yk)∥p.
-
2.
There exist infinitely many iterates (xk,ℓ,yk,ℓ) that are (εfeas/εcomp/εkkt)-AKKT points of (29) with εfeas = 1/ℓ < ηk, εcomp = 1/ℓ < η, and εkkt = 1/ℓ but
$$ \theta \|(x^{k,\ell} - x^{k}, y^{k,\ell} - x^{k})\| < 1/\ell. $$(32)
In the first case, we can choose xtrial = xk,ℓ and ytrial = yk,ℓ, satisfying Assumption A6 and (27). In the second case, (32) implies that:
Therefore, (xk,yk) satisfies the AKKT optimality condition for (31). Since the first derivatives of Φ(x,y) at (xk,yk) coincide with the first derivatives of the objective function of (31), it turns out that (xk,yk) was an AKKT point for (31). In other words, Assumption A6 did not hold because (xk,yk) already was an approximate solution of the original problem. This completes the proof. □
Theorems 4.2 and 4.3 say that after at most the number of iterations given by (30), one finds an iterate that satisfies the constraints of (13) with tolerance η such that the objective function value is smaller than or equal to Φtarget; or, alternatively, we find an iterate that satisfies KKT conditions with tolerance η for feasibility, tolerance η for complementarity, and tolerance ε for optimality. The conclusion of these theorems is that, ultimately, Assumption A6 is not necessary since, if it holds for all \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), it guarantees that an approximate solution is found for some \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\). But, if Assumption A6 does not hold for some \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), the iterate xk is an approximate solution. This is stated in the following corollary.
Corollary 4.4
Suppose that the hypotheses of Theorem 4.3 hold. Then, one of the following statements is true:
-
1.
There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), such that
$$ {\Phi}(x^{k+1}, y^{k+1}) \leq {\Phi}_{\text{target}} $$and (xk+ 1,yk+ 1) is (ηk/η/ε)-AKKT point of (13).
-
2.
There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \) such that (xk,yk) is an AKKT point of (13).
5 Linearly constrained problems
In this section, we consider the case in which G(x), H(x), and gj(x), \(j=1,\dots ,d\), are affine functions. Thus, their first derivatives are constant. Consequently, we denote:
for all \(x \in \mathbb {R}^{n}\). Therefore, for all \(\bar x \in \mathbb {R}^{n}\),
The constraints H(x) = 0 and G(x) ≤ 0 are now linear but the constraints \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) of (13) are not, due to the nonlinearity of π− 1. Therefore, the set \(\mathcal {C}\) is a polytope but the set D0, defined by (24) and (25), is not.
Of course, we may use Algorithm 4.1 for solving (13) in this case, but the presence of linearity suggests that linear constraints could be satisfied exactly at the solution of each subproblem; and that subproblems might be tackled by a linear-constraints optimization method. For achieving this improvement, the nonlinear constraints \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) must be linearized. This amounts to replace, at each subproblem, each constraint \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) with:
By the convexity of the function π− 1, the fulfillment of (33) implies that \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\). Therefore, the feasible set defined by H(x) = 0, G(x) ≤ 0, and (33) for \(j = 1, \dots , d\) is contained in D0.
Algorithm 5.1
This algorithm is identical to Algorithm 4.1 except that condition (26) is replaced by:
Of course, the parameter η is not necessary anymore since we assume that linear constraints can be satisfied exactly using well-established constrained optimization methods.
Assumptions A3, A4, and A5 stand exactly in the same way as in Section 4. However, Assumption A6 needs to be replaced in order to take into account that now the subproblem has only linear constraints and that d constraints are linear approximations of the true ones.
Assumption A7
At each iteration k of Algorithm 5.1, the function \(\|(x-x^{k}, y-y^{k})\|^{p+1}\) is continuously differentiable with respect to (x,y) and \((x^{k+1}, y^{k+1})\) satisfies the KKT conditions for the linearly constrained problem that consists of minimizing \(M_{x^{k}, y^{k}}(x, y) + \rho \|(x - x^{k}, y - y^{k})\|^{p+1}\) subject to H(x) = 0, G(x) ≤ 0, and \(g_{j}(x) - [{\pi }^{-1}({y_{j}^{k}}) + ({\pi }^{-1})'(y^{k}) (y_{j} - {y_{j}^{k}})] \leq 0\), \(j=1, \dots , d\).
Assumption A7 is plausible because every minimizer of linearly constrained optimization problems satisfies KKT conditions.
According to Assumption A7, the increment (xk+ 1,yk+ 1) at each iteration of Algorithm 5.1 must satisfy the following conditions:
Recall that (x,y) satisfies the KKT conditions of (13) if conditions (35)–(44) hold with (x,y) replacing (xk+ 1,yk+ 1), except that (35) and (36) must be replaced by:
and
whereas (39) and (40) must be replaced by:
and
Therefore, the question is whether (36)–(44) implies some relaxed version of (45)–(48) for x = xk+ 1 and y = yk+ 1.
As in [11, Lemma 3.2], by Assumption A4, it follows that the values of ρ used at each subproblem of Algorithm 5.1 are bounded. By Assumption A5, (35) implies that:
By the limitation of ρ and the fact that \(|\frac {\partial }{\partial y_{j}} \|(x - x^{k}, y - y^{k})\|^{p+1}| \leq O(\|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p})\), (36) implies that:
By (39) and the convexity of π− 1, we have that (47) holds with x = xk+ 1 and y = yk+ 1. Thus,
For proving approximate complementarity, we need a new assumption regarding the replacement of π− 1 with its linear approximation.
Assumption A8
There exists cπ > 0 such that, for all z,y ≥ 0,
By Assumption A8, we have that \(g_{j}(x^{k+1}) - {\pi }^{-1}(y_{j}^{k+1}) < - c_{\pi } |y_{j}^{k+1} - {y_{j}^{k}}|^{2}\) implies that \(g_{j}(x^{k+1}) - [{\pi }^{-1}({y_{j}^{k}}) + ({\pi }^{-1})^{\prime }(y^{k}) (y^{k+1}_{j} - {y_{j}^{k}})] < 0\); so, by (40), βj = 0.
Theorem 5.2
Suppose that Assumptions A3, A4, A5, A7, and A8 hold, \(L = \max \limits \{L_{1}, L_{2}\}\), the sequence {(xk,yk)} is generated by Algorithm 5.1, \({\Phi }_{\text {target}} \in \mathbb {R}\), and ε > 0. Then, the number of iterations k such that:
and (xk+ 1,yk+ 1) is not an (ηk/η/ε)-AKKT point of (13) is not greater than
Moreover, the number of functional evaluations per iteration is bounded above by:
Proof 6
As in [11, Lemma 3.1], we obtain that:
As in [11, Lemma 3.1], we prove that the sequence of penalty parameters is bounded.
By the arguments presented above, we have that, for all k, there exist \(\lambda = \lambda ^{k} \in \mathbb {R}^{n_{\mathrm {H}}}\), \({\mu }_ = {\mu }_{k} \in \mathbb {R}^{n_{\mathrm {G}}}_{+}\), \(\beta = \beta ^{k} \in \mathbb {R}^{d}_{+}\), and \(\gamma = \gamma ^{k} \in \mathbb {R}^{d}_{+}\) such that:
By the sufficient descent condition and (53), there exists c∇ > 0 such that, for all k:
Therefore, given ε > 0, the number of iterations such that \({\Phi }(x^{k+1}, y^{k+1}) > {\Phi }_{\text {target}}\) and
is, at most,
Given ξ > 0, by the descent condition (28), the number of iterations such that ∥(xk+ 1 − xk,yk+ 1 − yk)∥p+ 1 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi )\). Therefore, the number of iterations such that ∥(xk+ 1 − xk,yk+ 1 − yk)∥2 > ξ2/(p+ 1) cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi )\). Therefore, the number of iterations such that ∥(xk+ 1 − xk,yk+ 1 − yk)∥2 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi ^{\frac {p+1}{2}} )\). Therefore, the number of iterations such that cπ∥(xk+ 1 − xk,yk+ 1 − yk)∥2 > cπξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi ^{\frac {p+1}{2}} )\). Therefore, the number of iterations such that cπ∥(xk+ 1 − xk,yk+ 1 − yk)∥2 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha (\xi /c_{\pi })^{\frac {p+1}{2}})\). Therefore, after at most
iterations, we have that all the iterates satisfy the approximate complementarity condition:
for all \(j = 1, \dots , d\). This completes the proof. □
6 Numerical experiments
In this section, we present numerical experiments with the penalty function \(\pi (\cdot )=\sqrt {\cdot }\). In Section 6.1, we aim to illustrate in which way the use of this penalty function promotes sparsity. In Section 6.2, we consider all the minimization problems from the Moré-Garbow-Hillstrom collection [27], penalized with the term \(\pi (\cdot )=\sqrt {\cdot }\). Problems are solved in two different ways. On the one hand, reformulated problems are solved with the nonlinear programming solver Algencan [1, 7]. On the other hand, problems are solved with Algorithm 4.1. Experiments aim to simulate the situation in which a model-intensive algorithm is used to solve a problem with a costly objective function and a nonconvex regularization term.
6.1 Numerical illustration
In this section, we present numerical experiments that illustrate in which way the use of the penalty function \(\pi (\cdot )=\sqrt {\cdot }\) promotes sparsity. Three toy problems are reformulated and solved with the general-use nonlinear programming solver Algencan [1, 7] that applies to smooth problems.
-
Problem 1. Regularized gradient minimization
The continuous version of this problem is to find a function \(u:[0,1] \to \mathbb {R}\) that solves the problem:
where \(\bar u(t) = 400\) if t ∈ [0,0.5] and \(\bar u(t) = 0\), otherwise. In addition, we aim u to coincide with \(\bar u\) in “as much as possible.” After discretization, the problem becomes:
where \(\bar u_{i} = 400\) for \(i=1,\dots ,50\) and \(\bar u_{i}=0\) for \(i=51,\dots ,100\), with the additional constraint of having \(u_{i} = \bar u_{i}\), for \(i=1,\dots ,100\), as many times as possible.
Solutions to the discretized version of the problem can be found by solving the reformulated problem given by:
for different values of σ > 0. Note that the reformulation corresponds to penalizing \(\pi (|u_{i} - \bar u_{i}|)\), \(i=1,\dots ,100\), with \(\pi (\cdot )=\sqrt {\cdot }\). Figure 1 shows a graphical representation of solutions to problems with σ ∈{0,1,2,3,4,5,10,20,30,40,50,100}. Note that, the larger σ is, the larger the number of constraints of the form \(u_{i} = \bar u_{i}\) being satisfied.
-
Problem 2. Two-dimensional dam
This problem is a two-dimensional version of Problem 1. The discretized version of the problem is given by:
where
with the additional constraint of having \(u_{ij} = \bar u_{ij}\), for \(i=1,\dots ,30\) and \(j=1,\dots ,20\), as many times as possible. Solutions to the problem can be found by solving the reformulated problem given by:
for different values of σ > 0. Note that the reformulation corresponds to penalizing \(\pi (|u_{ij} - \bar u_{ij}|)\), \(i=1,\dots ,30\), \(j=1,\dots ,20\), with \(\pi (\cdot )=\sqrt {\cdot }\). Figure 2 shows a graphical representation of solutions to problems with \(\sigma \in \{10^{8}, 10^{7}, \dots ,\) \(1, 10^{-1}, \dots , 10^{-8}\}\). Once again, the larger σ is, the larger the number of constraints of the form \(u_{ij} = \bar u_{ij}\) being satisfied.
-
Problem 3. Mass transportation
Assume that in a rectangle of nx × ny pixels, to each pixel (i,j) there are associated functions P(i,j) ≥ 0 and Q(i,j) ≥ 0 such that:
P(i,j) represents white mass and Q(i,j) represents red mass. The goal is to find a function u((i,j),(k,ℓ)) ≥ 0 that represents the transportation of white mass from pixel (i,j) to pixel (k,ℓ). There is no transportation of red mass. The function of transported mass u must be such that, at every pixel, the amount of white mass be equal to the amount of red mass. Therefore, for every pixel (i,j) ∈ V, we must have that:
where \(V = \{ \{ 1, 2, \dots , n_{x} \} \times \{ 1, 2, \dots , n_{y} \} \}\). With these constraints, the objective is to minimize the number of non-null transports, i.e., to minimize the number of u((i,j),(k,ℓ)), for (i,j)≠(k,ℓ) ∈ V, that are positive.
The reformulated problem is given by:
It corresponds to penalize \(\sqrt { {\sum }_{(i,j) \neq (k,\ell ) \in V} u((i,j),(k,\ell )) }\) that represents the cost of the transportation process, its concavity representing economy of scale (decreasing marginal cost of transportation). Figure 3 illustrates solutions to a small instance of Problem 3 with σ = 0 and σ > 0.
6.2 Numerical comparison
We implemented the model-intensive Algorithm 4.1 in Fortran. At Step 2, a pair (xtrial,ytrial) satisfying (26) and (27) is computed by tackling problem:
where
with the help of Algencan [1, 7]. By definition, \(\eta _{k} \in [\frac {1}{2}\eta ,\eta )\) for all k. Thus, in order to satisfy (26), it is enough to ask to Algencan to stop at a point satisfying the constraints with precision εfeas for any \({\varepsilon }_{\text {feas}} \leq \frac {1}{2}\eta \). Algencan also requires tolerances εcompl and εopt for complementarity and optimality measures, respectively. (See [8, §5.1] for details.) Since (27) is satisfied by (x,y) = (0,0), that is a feasible point of problem (65), in practice, it is expected to be always satisfied at the final iterate of Algencan. In the numerical experiments, in Algorithm 4.1, we arbitrarily set p ∈{2,3}, τ1 = τ2 = 100, \(\rho _{\min \limits }=10^{-8}\), α = 10− 8, η = 10− 6, and 𝜃 = 106; and, in Algencan, εfeas = εcompl = εopt = 10− 8. In order to guarantee the fulfillment of Assumption A6, Algencan should had been modified to accept a tolerance εopt depending on its current iterate, i.e. not a constant. Instead of doing that, we keep Algencan as it is and we observed in practice that Assumption A6 would have been satisfied setting 𝜃 = 106, which justifies that choice.
In the numerical experiments, we considered the 18 unconstrained minimization problems of the Moré-Garbow-Hillstrom collection [27] that consist in minimizing f(x), penalized with the term \(\sigma {\sum }_{i=1}^{n} \pi (|x_{i}|)\) with \(\pi (\cdot )=\sqrt {\cdot }\) and σ = 10− 8. Note that, when applying Algorithm 4.1 with p = 3, evaluating Tp(xk,x) requires the third-order derivatives of f that were taken from [9]. These problems were solved in two different ways. On the one hand, problems were reformulated as in Section 6.1 and solved with Algencan. On the other hand, they were solved with Algorithm 4.1 with p ∈{2,3}. The same stopping criterion was adopted in both cases, namely, the approximate satisfaction of the optimality conditions given by:
where p = 2n, \(g_{j}(x,y) := x_{j} - {y_{j}^{2}}\) and \(g_{n+j}(x,y) := - x_{j} - {y_{j}^{2}}\) for \(j=1,\dots ,n\), PΩ is the projector operator onto Ω, and \({\Omega }=\left \{ (x,y) \in \mathbb {R}^{2n} | y \geq 0 \right \}\), with ε = 10− 6. Emulating the situation to which Algorithm 4.1 is applicable, we considered the number of evaluations of f as performance metric. Equivalent solutions were found with the two approaches in all the problems. Table 1 and Fig. 4 show the results. Figures show that Algorithm 4.1 with p = 2 and p = 3 used, in average 72% and 82% of the number of functional evaluations used by Algencan, respectively. Algorithm 4.1 with p = 2 used a larger number of evaluations than Algencan in 2 problems, the same number in 3 problems, and less evaluations in all the other 13 problems. The apparent inferiority of Algorithm 4.1 with p = 3, with respect to the case with p = 2, can be attributed to the (lack of) parameter tuning, since the arbitrary value of the (dimensional) sufficient decrease parameter α = 10− 8 in (28) has a different meaning when p = 2 and p = 3.
7 Final remarks
In this work, we have introduced a smooth reformulation for constrained smooth problems with nonsmooth regularizations. The reformulation is entirely equivalent to the original problem both from the global and the local points of view. Moreover, we were able to prove optimality conditions in which auxiliary variables do not appear at all.
Our main interest relies now in the application of these techniques to real problems in which the evaluation of the objective function is overwhelmingly more expensive than the evaluation of the constraints. This type of functions appear when PDE calculations are involved in the objective function evaluation and when the objective function is related to some phenomenon that takes place in real time. In these cases, model-intensive algorithms as the ones introduced in this paper may be useful. So, it is interesting to show that, from the theoretical point of view (convergence and complexity), these algorithms are well supported. Of course, smooth reformulations are valuable in these cases because one may take advantage of well-established constrained optimization software.
References
Andreani, R., Birgin, E.G., Martínez, M.J., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18, 1286–1309 (2008)
Andreani, R., Haeser, G., Martínez, M.J.: On sequential optimality conditions for smooth constrained optimization. Optimization 60, 627–641 (2011)
Bertsekas, D.P.: Nonlinear programming, Athenas Scientific (1999)
Bian, W., Chen, X.: From sparse solutions of systems of equations to sparse modeling of signals and image. SIAM Rev. 51, 34–81 (2009)
Bian, W., Chen, X.: Linearly constrained non Lipschitz optimization for image restoration. SIAM J. Imaging Sci. 8, 2294–2322 (2015)
Bian, W., Chen, X., Ye, Y.: Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Math. Program. 149, 301–327 (2015)
Birgin, E.G., Martínez, M.J.: Practical augmented Lagrangian methods for constrained optimization society for industrial and applied mathematics, Philadelphia (2014)
Birgin E.G., Martínez, M.J.: Complexity and performance of an augmented Lagrangian algorithm, Optimization Methods and Software, to appear. https://doi.org/10.1080/10556788.2020.1746962
Birgin, E.G., Gardenghi, J.L., Martínez, M.J., Santos, S.A.: On the use of third-order models with fourth-order regularization for unconstrained optimization, Optimization Letters, to appear. https://doi.org/10.1007/s11590-019-01395-z
Birgin, E.G., Gardenghi, J.L., Martínez, M.J., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using higher order regularized models. Math. Progr. 163, 359–368 (2017)
Birgin, E.G., Martínez, M.J.: On regularization and active-set methods with complexity for constrained optimization. SIAM J. Optim. 28, 1367–1395 (2018)
Browne, S.: The risk and rewards of minimizing shortfall probability. J. Portf. Manag. 25, 76–85 (1999)
Candes, E., Wakin, M., Boyd, S.: Enhancing sparsity by reweighted L1 minimization. J. Fourier Anal. Appl. 14, 877–905 (2008)
Cartis, C., Gould, N.I.M., Toint, P.h.L.: Second-order optimality and beyond: characterization and evaluation complexity in convexly-constrained nonlinear optimization. Found. Comput. Math. 18, 1073–1107 (2018)
Chen, X., Guo, L., Lu, Z., Ye, J.: An augmented Lagrangian method for non-Lipschitz nonconvex programming. SIAM J. Numer. Anal. 55, 168–193 (2017)
Chen, X., Niu, L., Yuan, Y.: Optimality conditions and a smoothing trust region Newton method for non-Lipschitz optimization. SIAM J. Optim. 23, 1528–1552 (2013)
Chen, X., Zhou, W.: Convergence of the reweighted L1 minimization algorithm for L2-Lp minimization. Comput. Optim. Appl. 59, 47–61 (2014)
Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fang, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004)
Frank, L.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–135 (1993)
Haeser, G., Liu, H., Ye, Y.: Optimality condition and complexity analysis for linearly-constrained optimization without differentiability on the boundary, Mathematical Programming, to appear. https://doi.org/10.1007/s10107-018-1290-4
Lai, M., Wang, J.: An unconstrained Lq minimization with 0 < q ≤ 1 for sparse solution of underdetermined linear systems. SIAM J. Optim. 21, 82–101 (2010)
Liu, Y.F., Ma, S., Dai, Y.H., Zhang, S.: A smoothing SQP framework for a class of composite Lq minimization over polyhedron. Math. Program. 158, 467–500 (2016)
Lu, Z.: Iterative reweighted minimization methods for Lp regularized unconstrained nonlinear programming. Math. Program. 147, 277–307 (2014)
Martínez, M.J.: On high-order model regularization for constrained optimization. SIAM J. Optim. 27, 2447–2458 (2017)
Moré, J.J., Garbow, B.S., Hillstrom, K. E.: Testing unconstrained optimization software. ACM Trans. Math. Softw. 7, 17–41 (1981)
Nikolova, M., Ng, M.K., Zhang, S., Ching, W.K.: Efficient reconstruction of piecewise constant images using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. 1, 2–25 (2008)
Nocedal, J., Wright, S.J.: Numerical optimization. Springer-Verlag, New York (2006)
Vogel, C.R.: Computational methods for inverse problems, Society for industrial and applied mathematics, Philadelphia (2002)
Zhang, C.H.: Nearly unbiased variable selection under minimax nonconcave penalty. Ann. Stat. 38, 894–942 (2010)
Funding
This work was supported by FAPESP (grants 2013/07375-0, 2016/01860-1, and 2018/24293-0) and CNPq (grants 438185/2018-8, 302538/2019-4, and 302682/2019-8).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Birgin, E.G., Martínez, J.M. & Ramos, A. On constrained optimization with nonconvex regularization. Numer Algor 86, 1165–1188 (2021). https://doi.org/10.1007/s11075-020-00928-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11075-020-00928-3