1 Introduction

We are concerned with mathematical problems in which the objective is to find a point xX such that:

$$ F (D(x), E(x)) = 0. $$
(1)

In general, X is a subset of a finite-dimensional space. Constrained and unconstrained minimization problems may be formulated in this way, with D(x) and E(x) representing objective functions and constraints. Problems of the form (1) are almost always solved by means of iterative methods. This means that, given an approximation xk to a solution, the approximation xk+ 1 is obtained by solving:

$$ M_{F, x^{k}} (M_{D, x^{k}}(x), M_{E, x^{k}}(x)) = 0, $$
(2)

where \(M_{F, x^{k}}\), \(M_{D, x^{k}}\), and \(M_{E, x^{k}}\) are models of F, D, and E built using knowledge available at xk. For example, if we deal with the constrained optimization problem:

$$ \text{ Minimize } f(x) \text{ subject to } g(x) \leq 0, $$
(3)

we may identify, say, f with D, g with E and F with the difference between f(x) and the minimum for each feasible point x, in such a way that problem (2) may consist on the minimization of the quadratic Taylor approximation of f (perhaps plus a Lagrangian term) subject to the linearization of the constraints. This is, essentially, the idea of sequential quadratic programming methods [29].

In many cases, the arguments of F involve a function that is difficult to evaluate (D(x)) and a function that is easy to evaluate (E(x)). In these cases, the computer work associated with the solution of (2) may be overwhelmingly dominated by the evaluation of D(xk) (including, possibly, its derivatives). Moreover, even

$$ M_{F, x^{k}} (M_{D, x^{k}}(x), E(x)) = 0 $$
(4)

and

$$ F(M_{D, x^{k}}(x), E(x)) = 0 $$
(5)

may be solvable with a computational cost essentially equal to the cost of computing D(xk). For example, in the constrained optimization problem, the evaluation of the objective function could involve a huge collection of data whereas the constraints may be easy to evaluate. In this case, it is probably convenient to employ an iterative scheme based on (5) by means of which we solve, at each iteration, a powerful model of the objective function subject to the true constraints. This is the point of view of [26], where the complexity of a general scheme for constrained optimization based on high-order models of the objective function is analyzed.

The objective of this paper is to exploit the modeling idea (5) with respect to the minimization of nonconvexly regularized problems. The equivalence of this problem with a constraint optimization problem in which the objective function is generally much more expensive than the constraints will be exploited and it will be proved that, for some Taylor-like models with mild assumptions, the resulting method enjoys theoretical convergence and complexity properties that suggest a good computational behavior. Some illustrative examples will be shown that tend to corroborate this hypothesis.

Given \(\mathcal {C} \subset \mathbb {R}^{n}\), continuously differentiable functions \(f:\mathbb {R}^{n}\rightarrow \mathbb {R}\) and \(g:\mathbb {R}^{n} \to \mathbb {R}^{d}\), a penalty function \(\pi : \mathbb {R} \rightarrow \mathbb {R}_{+}\) such that π(t) = 0 if t ≤ 0, and a regularization parameter σ > 0, we consider the constrained optimization problem given by:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize } \quad f(x) + \sigma \sum\limits_{j=1}^{d} \pi(g_{j}(x))\\ &&\text{subject to }\quad x \in \mathcal{C}. \end{array} $$
(6)

Problem (6) may arise in the process of solving problem:

$$ \text{Minimize } f(x) \text{ subject to } g(x) \leq 0, x \in \mathcal{C} $$
(7)

by means of penalization with respect to the constraints g(x) ≤ 0. In this case, we expect that a solution to (7) should emerge when σ is sufficiently large or, at least, when \(\sigma \to \infty \) (see, for example, [29]). Many times, the objective function of (6) is smooth and \({\pi }^{\prime }(0) = 0\). However, we are especially interested in the case in which \({\pi }^{\prime }(0) > 0\) and, above all, in the case in which \({\pi }^{\prime }(0)\) does not exist and \(\lim _{t \to 0+} {\pi }^{\prime }(t) = \infty \).

For several reasons, solutions to (6) for different values of σ may be meaningful independently of its connection with (7). In fact, many times, the solution to (6) when \(\sigma \to \infty \) is not relevant and the fulfillment of all the constraints gj(x) ≤ 0 is not desirable or, perhaps, impossible. Instead, the ideal asymptotic problem that one wishes to solve is, roughly speaking:

$$ \text{Minimize } f(x) \text{ subject to } x \in \mathcal{C} $$
(8)

with the compromise that the number of indices j such that gj(x) > 0 should be small. For example, a typical portfolio problem could require the minimization of the expected loss f(x) subject to standard constraints \(x \in \mathcal {C}\) and the additional requirement that the number of scenarios under which shortfall occurs should be as small as possible [12]. We expect to achieve this objective employing penalty functions that are concave for t ≥ 0 and such that \(\lim _{t \to 0+} {\pi }^{\prime }(t) > 0\).

The case in which \(\mathcal {C}\) is polyhedral and π(t) = tq if t ≥ 0, with q ∈ (0,1), has been considered in [5, 24]. The problem:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j=1}^{n} |x_{j}|^{q} \text{ subject to } x \in \mathcal{C}, $$
(9)

called Lq-regularized optimization problem, has been the object of many studies starting from Tikhonov [30] and the enormous literature on inverse problems. The case q = 2 corresponds to the most classical regularization, intended to handle problems in which the mere minimization of f(x) onto \(\mathcal {C}\) is very ill-conditioned and, so, its exact solution is, frequently, meaningless. The case q = 1 preserves the possible convexity of f(x) and has been shown to induce sparse regularized minimizers of f(x) (see [13]). With q < 1 possible convexity of the objective function is lost but the tendency to find sparse solutions x is even more emphatic. Writing gj(x) = xj and gn+j(x) = −xj for \(j = 1, \dots , n\) and defining π(t) = tq for all t ≥ 0 problem (9) takes the form (6); so (9) is a particular case of the problem studied in this paper. Practical applications of problems of the form (6) include machine learning [18], information theory [20], image restoration [5], signal processing [4], variable selection [19], and others. Interesting examples were given in [24, Appx.1].

The rest of this paper is organized as follows. In Section 2, we introduce a smooth reformulation of (6) and we prove its equivalence with the original problem. In Section 3, we derive optimality conditions using the reformulation. In Section 4, we introduce a model-intensive algorithm for solving the problem and we prove its convergence and complexity. In Section 5, we introduce a specific algorithm for the case in which the constraints are linear. In Section 6, we present numerical experiments. Finally, in Section 7, we state conclusions and lines for future research.

Notation

∥⋅∥ denotes an arbitrary norm. \(\mathbb {R}_{+}\) denotes the set of nonnegative elements of \(\mathbb {R}\). The i th component of a vector v is denoted vi or [v]i. If \(v \in \mathbb {R}^{n}\) is a vector with components vi, v+ is the vector with components \(\max \limits \{0, v_{i}\}\), \(i = 1, \dots , n\).

2 Reformulation

A smooth reformulation of problem (6) will be introduced in this section. The penalty functions employed in this work will be required to satisfy the assumptions below.

Assumption A1

π is continuous, concave, strictly increasing for t ≥ 0. Moreover, \({\pi }^{\prime }(t)\) exists and is continuous for all t > 0.

Assumption A2

The restriction of π to \(\mathbb {R}_{+}\) admits an inverse \({\pi }^{-1}: \mathbb {R}_{+} \to \mathbb {R}_{+}\), the derivative (π− 1)(t) exists for all t > 0, and the right derivative \(({\pi }^{-1})^{\prime }(0)_{+}\) also exists.

Assumptions A1 and A2 are satisfied by penalty functions in practical applications such as ridge regression [21], smoothly clipped absolute deviation penalization (SCAD) [19], minimax concave penalties (MCP) [31], fraction penalty optimization [28], and log-penalty minimization [13]. Note that assumptions A1 and A2 do not require \(\lim _{t \to 0+} {\pi }^{\prime }(t) = \infty \), although this property will be interesting for the purpose of minimizing the number of indices j such that gj(x) > 0.

The possible lack of smoothness of (6) is due to the nonsmoothness of π(t) when t = 0. Therefore, standard methods based on gradient information for solving (6) may be inappropriate. In order to avoid this inconvenience, several approaches were proposed. Among them, we can mention smoothing approximation methods [5, 16, 24], iterative reweighted algorithms [17, 23, 25], and interior point methods [6, 22]. Our proposal, based on a smooth reformulation, has the advantage that standard well-established algorithms for constrained optimization based on first-order or higher-order derivative information on the derivatives may be used. This is possible thanks to the fact that the function \({\pi }^{-1}: \mathbb {R}_{+} \to \mathbb {R}_{+}\) is continuously differentiable.

Problem (6) is obviously equivalent to:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j=1}^{d} y_{j} \text{ subject to } y_{j} = \pi(g_{j}(x)) \text{ for } j=1,\dots,d \text{ and } x \in \mathcal{C}. $$
(10)

If yj > π(gj(x)), taking \(\bar {y}_{j} = \pi (g_{j}(x))\), we obtain that \( f(x)+ \sigma {\sum }_{j=1}^{d} \bar {y}_{j} \leq f(x)+ \sigma {\sum }_{j=1}^{d} {y}_{j}\). Therefore, (10) is equivalent to:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j=1}^{d} y_{j} \text{ subject to } y_{j} \geq \pi(g_{j}(x)) \text{ for } j=1,\dots,d \text{ and } x \in \mathcal{C}. $$
(11)

As π(t) = 0 if t < 0, (11) is equivalent to:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize } f(x) + \sigma \sum\limits_{j=1}^{d} y_{j} \text{ subject to } y_{j} \geq 0, y_{j} \geq \pi(g_{j}(x)) \text{ for } j=1,\dots,d\\ &&\text{ and } x \in \mathcal{C}. \end{array} $$
(12)

If yj ≥ 0 and yjπ(gj(x)), since both yj and π(gj(x)) belong to \(\mathbb {R}_{+}\) and the inverse of π onto \(\mathbb {R}_{+}\) is increasing, we obtain π− 1(yj) ≥ gj(x). Reciprocally, if π− 1(yj) ≥ gj(x) and yj ≥ 0, we obtain yjπ(gj(x)) or gj(x) ≤ 0 in which case yjπ(gj(x)) also holds. Therefore, (12) is equivalent to:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize} \quad {\Phi}(x,y) := f(x) + \sigma \sum\limits_{j=1}^{d} y_{j}\\ &&\text{subject to } \quad x \in \mathcal{C},\\ && \qquad\quad\qquad g_{j}(x)-{\pi}^{-1}(y_{j}) \leq 0 \text{ for } j=1, \dots, d,\\ && \qquad\qquad\quad y \geq 0. \end{array} $$
(13)

The theorem below shows that a solution to the original problem (6) can be found by solving its smooth reformulation (13). (The reciprocal is also true, as shown in the following theorem.)

Theorem 2.1

Suppose that (x,y) is a local (global) minimizer of (13). Then, \(y^{*}_{j} = \pi (g_{j}(x^{*}))\) for all \(j = 1,\dots , d\) and x is a local (global) minimizer of (6). Reciprocally, if x is a local (global) minimizer of (6) and \(y^{*}_{j} = \pi (g_{j}(x^{*})\) for all \(i = 1, \dots , d\), then (x,y) is a local (global) minimizer of (13).

Proof 1

Suppose that (x,y) is a local minimizer of (13). If \(y^{*}_{j} > \pi (g_{j}(x^{*})\) and \(y^{*}_{j} > y_{j} \geq \pi (g_{j}(x^{*}))\), we have \({\pi }^{-1}(y^{*}_{j}) > {\pi }^{-1}(y_{j}) \geq g_{j}(x^{*})\). Therefore, if \(y_{i} = y^{*}_{i}\) for all ij, we have that (x,y) is a feasible point of (13) and Φ(x,y) < Φ(x,y). Therefore, (x,y) would not be a local minimizer of (13). This implies that, necessarily, \(y^{*}_{j} = \pi (g_{j}(x^{*})\) for all \(j = 1,\dots , d\).

Let ε > 0 be such that for all (x,y) such that \(x \in \mathcal {C}\), y ≥ 0, \(g_{j}(x) \leq {\pi }^{-1}(y_{j})\), \(j = 1, \dots , d\), ∥xx∥≤ ε and \(\|y- y^{*}\|_{\infty } \leq \varepsilon \), we have that \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j} \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} y^{*}_{j}\). By the first part of the proof, \(y^{*}_{j} = \pi (g_{j}(x^{*})) \); therefore, \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j} \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*})) \). Let ε1 ∈ (0,ε] be such that \(\|x - x^{*}\| \leq {\varepsilon }_{1}\) implies that |π(gj(x)) − π(gj(x))|≤ ε for all \(j=1,\dots ,d\). Then, if \(\|x - x^{*}\| \leq {\varepsilon }_{1}\), \(f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x)) \geq f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*}))\). This implies that x is a local minimizer of (6).

Reciprocally, suppose that x is a local minimizer of (6). Therefore, there exists ε > 0 such that for all \(x \in \mathcal {C}\) such that ∥xx∥≤ ε, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x^{*})) \leq f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x))\). Therefore, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*} \leq f(x) + \sigma {\sum }_{j=1}^{d} \pi (g_{j}(x))\) for all \(x \in \mathcal {C}\) such that ∥xx∥≤ ε. Thus, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*} \leq f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) for all \(x \in \mathcal {C}\) such that ∥xx∥≤ ε if yj ≥ 0 and yjπ(gj(x)) for all \(j = 1, \dots , d\). But, due to the increasing property of π− 1 onto \(\mathbb {R}_{+}\), yj ≥ 0 and yjπ(gj(x)) is equivalent to yj ≥ 0 and π− 1(yj) ≥ gj(x). Therefore, \(f(x^{*}) + \sigma {\sum }_{j=1}^{d} y_{j}^{*}\) is less than or equal to \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) for all (x,y) satisfying \(x \in \mathcal {C}\), ∥xx∥≤ ε, and the constraints of (13).

The equivalence proof with respect global minimizers is similar. □

3 Optimality conditions

Assume that:

$$ \mathcal{C} = \{x \in \mathbb{R}^{n} | H(x) = 0 \text{ and } G(x) \leq 0\}, $$
(14)

where \(H: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable. If a local minimizer (x,y) of (13) satisfies a constraint qualification then the KKT conditions hold at (x,y). By Theorem 2.1, if x is a local minimizer of (6), y = π(gj(x)) for all \(i = 1, \dots , d\), and a constraint qualification for (13) is fulfilled, the KKT conditions are satisfied too. The KKT conditions for (13) obviously involve both x and y. Since y is merely an auxiliary variable, it is interesting to derive KKT-like conditions in which only the primal variables x are involved. This is the objective of Theorem 3.1 below, which, in turn, generalizes [15, Corollary 2.2] and [24] using straightforward optimization arguments.

Theorem 3.1

Let x be a local minimizer of (6) and assume that π satisfies Assumptions A1 and A2. Define:

$$ \begin{array}{@{}rcl@{}} I(x^{*})&:=&\{j | g_{j}(x^{*})<0\}, J(x^{*}):=\{j | g_{j}(x^{*})=0\}, \text{ and }\\ K(x^{*})&:=&\{j | g_{j}(x^{*})>0\}. \end{array} $$
(15)

Then, x is a local minimizer of the problem:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j \in K(x^{*})} \pi(g_{j}(x)) \text{ subject to } x \in \mathcal{C} \text{ and } g_{j}(x) \leq 0 \forall j \in J(x^{*}). $$
(16)

Moreover, if \(\mathcal {C}\) is defined by H(x) = 0 and G(x) ≤ 0, where \(H:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable, and the system of equalities and inequalities:

$$ H(x)=0, G(x) \leq 0, g_{j}(x) \leq 0, \forall j \in J(x^{*}) $$
(17)

satisfies a constraint qualification at x then we have that x satisfies the KKT conditions of problem (16).

Proof 2

Let y≥ 0 be such that (x,y) is a local minimizer of (13). By Theorem 2.1, \(y^{*}_{j} = \pi (g_{j}(x^{*}))\) for all \(j = 1, \dots , d\). If jI(x), we have that gj(x) < 0 for all x in a neighborhood of x. Moreover, \(y^{*}_{j} = 0\) in this case. Therefore, (x,y) is a local minimizer of:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j \notin I(x^{*})} y_{j} \text{ subject to } x \!\in\! \mathcal{C}, y_{j} \!\geq\! \pi(g_{j}(x)) \text{ and } y_{j} \!\geq\! 0 \forall j \notin I(x^{*}). $$

So, (x,y) is a local minimizer of

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j \notin I(x^{*})} y_{j} \text{ subject to } x \in \mathcal{C}, y_{j} = \pi(g_{j}(x)) \text{ and } y_{j} \!\geq\! 0 \forall j \!\notin\! I(x^{*}). $$

If jK(x), we have that gj(x) > 0 in a neighborhood of x and π(gj(x)) is continuous and differentiable on that neighborhood. Therefore, x is a local minimizer of the locally smooth problem given by:

$$ \text{Minimize } f(x) + \sigma \left[ \sum\limits_{j \in J(x^{*})} y_{j} + \sum\limits_{j \in K(x^{*})} \pi(g_{j}(x)) \right] $$
$$ \text{subject to } x \in \mathcal{C}, y_{j} = \pi(g_{j}(x)) \text{ and } y_{j} \geq 0 \forall j \in J(x^{*}). $$

Since gj(x) ≤ 0 for all jJ(x), x is also a local minimizer of:

$$ \text{Minimize } f(x) + \sigma \left[ \sum\limits_{j \in J(x^{*})} y_{j} + \sum\limits_{j \in K(x^{*})} \pi(g_{j}(x)) \right] $$
$$ \text{subject to } x \in \mathcal{C}, g_{j}(x) \leq 0, y_{j} = \pi(g_{j}(x)) \text{ and } y_{j} \geq 0 \forall j \in J(x^{*}). $$

Moreover, since \(y_{j}^{*} = 0\) for all jJ(x), x is a local minimizer of:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize } f(x) + \sigma \sum\limits_{j \in K(x^{*})} \pi(g_{j}(x)) \text{ subject to } x \in \mathcal{C}, g_{j}(x) \leq 0, \\ && y_{j} = \pi(g_{j}(x)) \text{ and } y_{j} \geq 0 \forall j \in J(x^{*}). \end{array} $$

Clearly, the variables yj for jJ(x) play no role in this problem, so x is a local minimizer of:

$$ \text{Minimize } f(x) + \sigma \sum\limits_{j \in K(x^{*})} \pi(g_{j}(x)) \text{ subject to } x \in \mathcal{C} \text{ and } g_{j}(x) \leq 0 \forall j \in J(x^{*}). $$

Then, if the constraints that define \(\mathcal {C}\) together with the constraints gj(x) ≤ 0, jJ(x) satisfy a constraint qualification at x, the KKT conditions of (16) hold at x. This completes the proof. □

Observe that (16) is a constrained optimization problem of the form

$$ \text{Minimize } f_{\text{obj}}(w) \text{ subject to } h_{\mathrm{E}}(w) = 0 \text{ and } h_{\mathrm{I}}(w) \leq 0, $$
(18)

where \(f_{\text {obj}}: \mathbb {R}^{n} \to \mathbb {R}\), \(h_{\mathrm {E}}:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {E}}}\), and \(h_{\mathrm {I}}: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {I}}}\) are continuously differentiable. It is well known that, independently of constraint qualifications, every local minimizer w of (18) satisfies the Approximate KKT (AKKT) optimality condition [2, 7], which means that, given εsol > 0, εkkt > 0, εfeas > 0, and εcomp > 0, there exist \(w \in \mathbb {R}^{n}\), \(\lambda \in \mathbb {R}^{n_{\mathrm {E}}}\), and \({\mu }_ \in \mathbb {R}^{n_{\mathrm {I}}}_{+}\) such that \(\|w - w^{*}\| \leq {\varepsilon }_{\text {sol}}\),

$$ \|h_{\mathrm{E}}(w)\|_{\infty} \leq {\varepsilon}_{\text{feas}}, \|h_{\mathrm{I}}(w)_{+}\|_{\infty} \leq {\varepsilon}_{\text{feas}}, $$
(19)
$$ {\mu}_{j} \leq {\varepsilon}_{\text{comp}} \text{ for all } j \text{ such that } [h_{\mathrm{I}}(w)]_{j} < - {\varepsilon}_{\text{feas}}, $$
(20)

and

$$ \|\nabla f_{\text{obj}}(w) + \nabla h_{\mathrm{E}}(w) \lambda + \nabla h_{\mathrm{I}}(w) {\mu}_ \| \leq {\varepsilon}_{\text{kkt}}. $$
(21)

If \(w \in \mathbb {R}^{n}\) is such that there exist \(\lambda \in \mathbb {R}^{n_{\mathrm {E}}}\) and \({\mu }_ \in \mathbb {R}^{n_{\mathrm {I}}}_{+}\) fulfilling (19), (20), and (21), we say that w is an (εfeas/εcomp/εkkt)-AKKT point of (18).

Applying the AKKT definition to (16), we obtain the following result, independently of constraint qualifications.

Theorem 3.2

Let x be a local minimizer of (6) and assume that π satisfies Assumptions A1 and A2. Let \(\mathcal {C}\) be defined by H(x) = 0 and G(x) ≤ 0, where \(H:\mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {H}}}\) and \(G: \mathbb {R}^{n} \to \mathbb {R}^{n_{\mathrm {G}}}\) are continuously differentiable. Assume that εsol > 0, εkkt > 0, εfeas > 0, and εcomp > 0 are arbitrary. Then, there exist \(x \in \mathbb {R}^{n}\), \(\lambda _{\mathrm {H}} \in \mathbb {R}^{n_{\mathrm {H}}}\), \({\mu }_{\mathrm {G}} \in \mathbb {R}^{n_{\mathrm {G}}}_{+}\), and \({\mu }_{j} \in \mathbb {R}_{+}\) for all j such that gj(x) = 0, such that:

$$ \|x - x^{*}\| \leq {\varepsilon}_{\text{sol}}, $$
$$ \|H(x)\| \leq {\varepsilon}_{\text{feas}}, \|G(x)_{+}\| \leq {\varepsilon}_{\text{feas}}, $$
$$ g_{j}(x) \leq {\varepsilon}_{\text{feas}} \text{ for all } j \text{ such that } g_{j}(x^{*})=0, $$
$$ \begin{array}{@{}rcl@{}} &&\left\| \nabla f(x) + \sigma \sum\limits_{g_{j}(x^{*}) > 0} {\pi}^{\prime}(g_{j}(x)) \nabla g_{j}(x) + \nabla H(x) \lambda_{\mathrm{H}} + \nabla G(x) {\mu}_{\mathrm{G}} \right.\\ &&\quad\left.+ \sum\limits_{g_{j}(x^{*}) = 0} {\mu}_{j} \nabla g_{j}(x) \right\| \leq {\varepsilon}_{\text{kkt}}, \end{array} $$
$$ [{\mu}_{G}]_{j} \leq {\varepsilon}_{\text{comp}} \text{ for all } j \text{ such that } G_{j}(x) < - {\varepsilon}_{\text{feas}}, $$

and

$$ {\mu}_{j} \leq {\varepsilon}_{\text{comp}} \text{ for all } j \text{ such that } g_{j}(x) < - {\varepsilon}_{\text{feas}}. $$

Proof 3

By Theorem 3.1, x is a local minimizer of (16). Then, the thesis follows from (19), (20), and (21). □

4 Model-intensive algorithms

As we mentioned in Section 2, problem (13) can be solved using standard methods for smooth constrained optimization. In many applications, the cost of evaluating f(x) exceeds, by far, the cost of evaluating the constraints. For these situations, “model intensive” (MI) methods are recommendable. At each iteration of such an algorithm, a model of the objective function is built and this model is minimized using the original constraints (instead of their linear approximations as in SQP and other popular approaches). The idea is to exploit the model as much as possible so that, hopefully, it will not be necessary to evaluate the expensive function many times.

The most popular model for an arbitrary sufficiently smooth function \(f: \mathbb {R}^{n} \to \mathbb {R}\) comes from considering its p th Taylor approximation:

$$ T_{p}(\bar x, x) := \sum\limits_{j=1}^{p} \frac{1}{j!} \left( (x - \bar x_{1}) \frac{\partial}{\partial x_{1}} + {\dots} + (x - \bar x_{n}) \frac{\partial}{\partial x_{n}} \right)^{j} f(\bar x), $$
(22)

so that:

$$ f(x) = f(\bar x) + T_{p}(\bar x, x) + o(\|x - \bar x\|^{p+1}). $$
(23)

Observe that \(T_{1}(\bar x, x) = \nabla f(\bar x)^{T}(x - \bar x)\) and \(T_{2}(\bar x, x)=\nabla f(\bar x)^{T}(x - \bar x)+\frac {1}{2}(x - \bar x)^{T} \nabla ^{2} f(\bar x) (x - \bar x)\). Consequently, the model for \(f(x) + \sigma {\sum }_{j=1}^{d} y_{j}\) will be \(f(\bar x) + T_{p}(\bar x, x) + \sigma {\sum }_{j=1}^{d} y_{j}\).

In this section, \(M_{\bar x, \bar y}(x,y)\) will be a model of \({\Phi }(x, y) - {\Phi }(\bar x, \bar y)\) for all \(\bar x, x \in \mathbb {R}^{n}\) and \(\bar y, y \in \mathbb {R}^{d}\). Therefore, \({\Phi }(x, y) \approx {\Phi }(\bar x, \bar y) + M_{\bar x, \bar y}(x, y)\). Let us define, for all δ ≥ 0:

$$ \mathcal{C}_{\delta} = \{x \in \mathbb{R}^{n} | \| H(x) \|_{\infty} \leq \delta \text{ and } \|G(x)_{+}\|_{\infty} \leq \delta\}, $$
(24)
$$ D_{\delta} = \{ (x, y) \in \mathbb{R}^{n} \times \mathbb{R}^{d} | x \in \mathcal{C}_{\delta}, y \geq 0, \text{ and } g_{j}(x) - {\pi}^{-1}(y_{j}) \leq \delta, j=1, \dots, d \}, $$
(25)

and D = D0. The following algorithm is similar to Algorithm 3.1 of [11].

Algorithm 4.1

Assume that \(p \in \{ 1, 2, 3, \dots \}\), η > 0, α > 0, \(\rho _{\min \limits } > 0\), τ2τ1 > 1, 𝜃 > 0, and (x0,y0) ∈ Dη/4 are given with \({y^{0}_{j}} = \pi (g_{j}(x^{0}))\), \(j = 1, \dots , d\). Initialize \(k \leftarrow 0\).

Step 1.:

Set \(\rho \leftarrow 0\).

Step 2.:

Compute \(x^{\text {trial}} \in \mathbb {R}^{n}\) and \(y^{\text {trial}} \in \mathbb {R}^{d}\) such that

$$ (x^{\text{trial}}, y^{\text{trial}}) \in D_{\eta_{k}} \text{ with } \eta_{k} = \eta \left( \frac{k+1}{k+2} \right) $$
(26)

and

$$ M_{x^{k}, y^{k}}(x^{\text{trial}}, y^{\text{trial}}) + \rho \|(x^{\text{trial}} - x^{k}, y^{\text{trial}} - y^{k})\|^{p+1} \leq 0. $$
(27)
Step 3.:

Test the condition

$$ {\Phi} (x^{\text{trial}}, y^{\text{trial}}) \leq {\Phi}(x^{k}, y^{k}) - \alpha \| (x^{\text{trial}} - x^{k}, y^{\text{trial}} - y^{k}) \|^{p+1}. $$
(28)

If (28) holds, accept the trial point (xtrial,ytrial), define ρk = ρ, xk+ 1 = xtrial, yk+ 1 = ytrial, set \(k \leftarrow k+1\), and go to Step 1. Otherwise, update \(\rho \leftarrow \max \limits \{ \rho _{\min \limits }, \tau \rho \}\) with τ ∈ [τ1,τ2], and go to Step 2.

Under the model-intensive strategy, conditions (26) and (27) may be obtained by solving the subproblem:

$$ \text{Minimize } M_{x^{k},y^{k}}(x,y) + \rho \| (x-x^{k}, y-y^{k}) \|^{p+1} \text{ subject to } (x, y) \in D, $$
(29)

with rather high precision, since we assume that the cost of evaluating the model is negligible with respect to the cost of evaluating f. Note that the feasibility required at xtrial is looser than the precision with which xk satisfies those constraints. Therefore, it is plausible to require the fulfillment of (26) and (27) by means of a standard constrained optimization algorithm. This means that, although conditions (27) and (28) are trivially satisfied by xtrial = xk and ytrial = yk, it is reasonable to expect that a nontrivial solution with decrease of Φ could be obtained with non-null increments.

Complexity results for Algorithm 4.1 are presented below. Before that, the standard assumptions (see, for example, [10, 14]) are given.

Assumption A3

For all (xk,yk) generated by Algorithm 4.1, we have that

$$ \nabla {\Phi}(x^{k}, y^{k}) = \nabla M_{x^{k}, y^{k}}(x, y)|_{(x, y) = (x^{k}, y^{k})}. $$

Assumption A4

There exists L1 > 0 such that, for every (xk,yk) calculated by Algorithm 4.1 and for every (xtrial,ytrial) satisfying (27), we have that:

$$ {\Phi}(x^{\text{trial}}, y^{\text{trial}})\leq {\Phi}(x^{k}, y^{k}) + M_{x^{k}, y^{k}}(x^{\text{trial}}, y^{\text{trial}})+L_{1}\|(x-x^{\text{trial}}, y-y^{\text{trial}})\|^{p+1}. $$

Assumption A4 is valid if M is obtained using the p th Taylor approximation and the derivatives of order p of f are Lipschitz continuous. See [3, 10, 14].

Assumption A5

There exists L2 > 0 such that, for every xk, yk, xk+ 1, and yk+ 1 obtained by Algorithm 4.1, we have that:

$$ \|\nabla_{x,y} {\Phi}(x^{k+1}, y^{k+1})-\nabla_{x,y} M_{x^{k}, y^{k}}(x^{k+1}, y^{k+1} )\| \leq L_{2}\|(x^{k+1}-x^{k}, y^{k+1}-y^{k})\|^{p}. $$

As in the case of Assumption A4, Assumption A5 also holds if we use Taylor p th approximations for computing M and the p th order derivatives of f are Lipschitz-continuous.

Assumption A6 below says that (29) must be approximately solved. Before stating this assumption let us recall that, assuming that ∥(xxk,yyk)∥p+ 1 is continuosly differentiable with respect to (x,y), (29) is a constrained optimization problem of the form (18).

Assumption A6

We say that this assumption holds at iteration k of Algorithm 4.1 if the function ∥(xxk,yyk)∥p+ 1 is continuously differentiable with respect to (x,y) and (xtrial,ytrial) is an (εfeas/εcomp/εkkt)-AKKT point of (29) with εfeas = ηk, εcomp = η, and εkkt = 𝜃∥(xtrialxk,ytrialyk)∥p.

Theorem 4.2

Assume that the sequence {(xk,yk)} is generated by Algorithm 4.1, \({\Phi }_{\text {target}} \in \mathbb {R}\), ε > 0, and \(L = \max \limits \{L_{1}, L_{2}\}\). Define:

$$ \begin{array}{@{}rcl@{}} &&K(x^{0}, y^{0}, {\Phi}_{\text{target}}, \alpha, p, L, \tau_{2}, \theta)\\ &=& \left\lfloor \left( {\Phi}(x^{0}, y^{0}) - {\Phi}_{\text{target}} \right) \left( \frac{ \alpha^{p/(p+1)} \varepsilon}{L + \tau_{2} \left( L + \alpha \right) (p+1) + \theta} \right)^{-(p+1)/p} \right\rfloor. \end{array} $$
(30)

Suppose that Assumptions A3, A4, A5, and A6 hold for all:

$$ k \leq K(x^{0}, y^{0}, {\Phi}_{\text{target}}, \alpha, p, L, \tau_{2}, \theta). $$

Then, the number of iterations k such that:

$$ {\Phi}(x^{k+1}, y^{k+1}) > {\Phi}_{\text{target}} $$

and (xk+ 1,yk+ 1) is not an (ηk/η/ε)-AKKT point of (13) is bounded above by \(K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\). Moreover, the number of functional evaluations per iteration is bounded above by:

$$ \left\lfloor \log_{\tau_{1}}\left( \frac{\tau_{2}(L+\alpha)}{\rho_{\min}} \right) \right\rfloor + 1. $$

Proof 4

The desired result follows as in [11, Thm.3.1]. □

Theorem 4.3

Suppose that Assumptions A3, A4, and A5 hold, and the function ∥(xxk,yyk)∥p+ 1 is continuously differentiable with respect to (x,y). Assume, moreover, that for all \(k = 0, 1, 2, \dots \) and εfeas > 0, εcomp > 0, and εkkt > 0, we are able to compute, using a suitable algorithm, an (εfeas/εcomp/εkkt)-AKKT point of (29) such that (27) holds. Then, there are two possibilities:

  1. 1.

    Assumption A6 holds for all \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \).

  2. 2.

    There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \) such that (xk,yk) is an AKKT point of (13).

Proof 5

Assume that (xk,,yk,) is a sequence generated by an algorithm that, when applied to the problem:

$$ \text{Minimize } M_{x^{k}, y^{k}}(x, y ) + \rho \|(x-x^{k}, y-y^{k})\|^{p+1} \text{ subject to } (x, y) \in D_{\eta_{k-1}}, $$
(31)

satisfies the hypotheses of the theorem. By construction, \((x^{k}, y^{k}) \in D_{\eta _{k-1}}\). Therefore the minimum of \(M_{x^{k}, y^{k}}(x, y ) + \rho \|(x-x^{k}, y-y^{k})\|^{p+1}\) onto \(D_{\eta _{k-1}}\) is non-positive, as well as the minimum of \(M_{x^{k}, y^{k}}(x, y ) + \rho \|(x-x^{k}, y-y^{k})\|^{p+1}\) onto \(D_{\eta _{k}}\).

By the fulfillment of the AKKT optimality conditions, we have two possibilities:

  1. 1.

    There exists an iterate (xk,,yk,) which is an (εfeas/εcomp/εkkt)-AKKT point of (29) with εfeas = ηk, εcomp = η, and εkkt = 𝜃∥(xk,xk,yk,yk)∥p.

  2. 2.

    There exist infinitely many iterates (xk,,yk,) that are (εfeas/εcomp/εkkt)-AKKT points of (29) with εfeas = 1/ < ηk, εcomp = 1/ < η, and εkkt = 1/ but

    $$ \theta \|(x^{k,\ell} - x^{k}, y^{k,\ell} - x^{k})\| < 1/\ell. $$
    (32)

In the first case, we can choose xtrial = xk, and ytrial = yk,, satisfying Assumption A6 and (27). In the second case, (32) implies that:

$$ \lim_{\ell \to \infty} (x^{k,\ell}, y^{k,\ell}) = (x^{k}, y^{k}). $$

Therefore, (xk,yk) satisfies the AKKT optimality condition for (31). Since the first derivatives of Φ(x,y) at (xk,yk) coincide with the first derivatives of the objective function of (31), it turns out that (xk,yk) was an AKKT point for (31). In other words, Assumption A6 did not hold because (xk,yk) already was an approximate solution of the original problem. This completes the proof. □

Theorems 4.2 and 4.3 say that after at most the number of iterations given by (30), one finds an iterate that satisfies the constraints of (13) with tolerance η such that the objective function value is smaller than or equal to Φtarget; or, alternatively, we find an iterate that satisfies KKT conditions with tolerance η for feasibility, tolerance η for complementarity, and tolerance ε for optimality. The conclusion of these theorems is that, ultimately, Assumption A6 is not necessary since, if it holds for all \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), it guarantees that an approximate solution is found for some \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\). But, if Assumption A6 does not hold for some \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), the iterate xk is an approximate solution. This is stated in the following corollary.

Corollary 4.4

Suppose that the hypotheses of Theorem 4.3 hold. Then, one of the following statements is true:

  1. 1.

    There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta )\), such that

    $$ {\Phi}(x^{k+1}, y^{k+1}) \leq {\Phi}_{\text{target}} $$

    and (xk+ 1,yk+ 1) is (ηk/η/ε)-AKKT point of (13).

  2. 2.

    There exists \(k \leq K(x^{0}, y^{0}, {\Phi }_{\text {target}}, \alpha , p, L, \tau _{2}, \theta ) \) such that (xk,yk) is an AKKT point of (13).

5 Linearly constrained problems

In this section, we consider the case in which G(x), H(x), and gj(x), \(j=1,\dots ,d\), are affine functions. Thus, their first derivatives are constant. Consequently, we denote:

$$ A_{\mathrm{G}} = G^{\prime}(x), A_{\mathrm{H}} = H^{\prime}(x), {a_{j}^{T}} = g^{\prime}_{j}(x), j = 1, \dots, d, $$

for all \(x \in \mathbb {R}^{n}\). Therefore, for all \(\bar x \in \mathbb {R}^{n}\),

$$ \begin{array}{@{}rcl@{}} G(x) &=& A_{\mathrm{G}}(x - \bar x) + G(\bar x), H(x) = A_{\mathrm{H}}(x - \bar x) + H(\bar x), \\ g_{j}(x) &=& {a_{j}^{T}} (x - \bar x) + g_{j}(\bar x), j = 1, \dots, d. \end{array} $$

The constraints H(x) = 0 and G(x) ≤ 0 are now linear but the constraints \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) of (13) are not, due to the nonlinearity of π− 1. Therefore, the set \(\mathcal {C}\) is a polytope but the set D0, defined by (24) and (25), is not.

Of course, we may use Algorithm 4.1 for solving (13) in this case, but the presence of linearity suggests that linear constraints could be satisfied exactly at the solution of each subproblem; and that subproblems might be tackled by a linear-constraints optimization method. For achieving this improvement, the nonlinear constraints \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) must be linearized. This amounts to replace, at each subproblem, each constraint \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\) with:

$$ g_{j}(x) - [{\pi}^{-1}({y_{j}^{k}}) + ({\pi}^{-1})'(y^{k}) (y_{j} - {y_{j}^{k}})] \leq 0. $$
(33)

By the convexity of the function π− 1, the fulfillment of (33) implies that \(g_{j}(x)-{\pi }^{-1}(y_{j}) \leq 0\). Therefore, the feasible set defined by H(x) = 0, G(x) ≤ 0, and (33) for \(j = 1, \dots , d\) is contained in D0.

Algorithm 5.1

This algorithm is identical to Algorithm 4.1 except that condition (26) is replaced by:

$$ (x^{\text{trial}}, y^{\text{trial}}) \in \mathcal{C} \text{ and } g_{j}(x) - [{\pi}^{-1}({y_{j}^{k}}) + ({\pi}^{-1})'({y_{j}^{k}}) (y_{j} - {y_{j}^{k}})] \leq 0, j = 1, {\dots} d. $$
(34)

Of course, the parameter η is not necessary anymore since we assume that linear constraints can be satisfied exactly using well-established constrained optimization methods.

Assumptions A3, A4, and A5 stand exactly in the same way as in Section 4. However, Assumption A6 needs to be replaced in order to take into account that now the subproblem has only linear constraints and that d constraints are linear approximations of the true ones.

Assumption A7

At each iteration k of Algorithm 5.1, the function \(\|(x-x^{k}, y-y^{k})\|^{p+1}\) is continuously differentiable with respect to (x,y) and \((x^{k+1}, y^{k+1})\) satisfies the KKT conditions for the linearly constrained problem that consists of minimizing \(M_{x^{k}, y^{k}}(x, y) + \rho \|(x - x^{k}, y - y^{k})\|^{p+1}\) subject to H(x) = 0, G(x) ≤ 0, and \(g_{j}(x) - [{\pi }^{-1}({y_{j}^{k}}) + ({\pi }^{-1})'(y^{k}) (y_{j} - {y_{j}^{k}})] \leq 0\), \(j=1, \dots , d\).

Assumption A7 is plausible because every minimizer of linearly constrained optimization problems satisfies KKT conditions.

According to Assumption A7, the increment (xk+ 1,yk+ 1) at each iteration of Algorithm 5.1 must satisfy the following conditions:

$$ \begin{array}{@{}rcl@{}} &&\nabla_{x} \left[ M_{x^{k}, y^{k}} (x^{k+1}, y^{k+1}) + \rho \|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p+1} \right]\\ &&+ A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j} = 0, \end{array} $$
(35)
$$ \rho \frac{\partial}{\partial y_{j}} \|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p+1} + {\sigma}_{j} - ({\pi}^{-1})'(y^{k+1}_{j}) \beta_{j} - {\gamma}_{j} = 0, j = 1, \dots, d, $$
(36)
$$ H(x^{k+1}) = 0, $$
(37)
$$ G(x^{k+1}) \leq 0, $$
(38)
$$ g_{j}(x^{k+1}) - \left[ {\pi}^{-1}({y_{j}^{k}}) + ({\pi}^{-1})'(y^{k}) (y^{k+1}_{j} - {y_{j}^{k}}) \right] \leq 0, j = 1, \dots, d, $$
(39)
$$ \beta_{j} = 0 \text{ whenever } g_{j}(x^{k+1}) - [{\pi}^{-1}({y_{j}^{k}}) + ({\pi}^{-1})'(y^{k}) (y^{k+1}_{j} - {y_{j}^{k}})] < 0, $$
(40)
$$ y^{k+1} \geq 0, $$
(41)
$$ {\mu}_{j} = 0 \text{ for all } j \text{ such that } G(x^{k+1})_{j} < 0, $$
(42)
$$ {\gamma}_{j} = 0 \text{ for all } j \text{ such that } y^{k+1}_{j} > 0, $$
(43)
$$ {\mu}_, \beta, \gamma \geq 0. $$
(44)

Recall that (x,y) satisfies the KKT conditions of (13) if conditions (35)–(44) hold with (x,y) replacing (xk+ 1,yk+ 1), except that (35) and (36) must be replaced by:

$$ \nabla f(x)+ A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j} = 0 $$
(45)

and

$$ {\sigma}_{j} - ({\pi}^{-1})'(y_{j}) \beta_{j} - {\gamma}_{j} = 0, j = 1, \dots, d, $$
(46)

whereas (39) and (40) must be replaced by:

$$ g_{j}(x) - {\pi}^{-1}(y_{j}) \leq 0 $$
(47)

and

$$ \beta_{j} = 0 \text{ whenever } g_{j}(x) - {\pi}^{-1}(y_{j}) < 0. $$
(48)

Therefore, the question is whether (36)–(44) implies some relaxed version of (45)–(48) for x = xk+ 1 and y = yk+ 1.

As in [11, Lemma 3.2], by Assumption A4, it follows that the values of ρ used at each subproblem of Algorithm 5.1 are bounded. By Assumption A5, (35) implies that:

$$ \| \nabla f(x^{k+1}) + A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j} \| \leq O(\|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p}). $$
(49)

By the limitation of ρ and the fact that \(|\frac {\partial }{\partial y_{j}} \|(x - x^{k}, y - y^{k})\|^{p+1}| \leq O(\|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p})\), (36) implies that:

$$ | {\sigma}_{j} - ({\pi}^{-1})^{\prime}(y^{k+1}_{j}) \beta_{j} - {\gamma}_{j}| \leq O(\|(x^{k+1} - x^{k}, y^{k+1} - y^{k})\|^{p}), j = 1, \dots, d. $$
(50)

By (39) and the convexity of π− 1, we have that (47) holds with x = xk+ 1 and y = yk+ 1. Thus,

$$ g_{j}(x^{k+1}) - {\pi}^{-1}(y_{j}^{k+1}) \leq 0, j = 1, \dots, d. $$
(51)

For proving approximate complementarity, we need a new assumption regarding the replacement of π− 1 with its linear approximation.

Assumption A8

There exists cπ > 0 such that, for all z,y ≥ 0,

$$ | {\pi}^{-1}(y) - \left[ {\pi}^{-1}(z) + ({\pi}^{-1})^{\prime}(z) (z - y) \right] | \leq c_{\pi} |y - z|^{2}. $$

By Assumption A8, we have that \(g_{j}(x^{k+1}) - {\pi }^{-1}(y_{j}^{k+1}) < - c_{\pi } |y_{j}^{k+1} - {y_{j}^{k}}|^{2}\) implies that \(g_{j}(x^{k+1}) - [{\pi }^{-1}({y_{j}^{k}}) + ({\pi }^{-1})^{\prime }(y^{k}) (y^{k+1}_{j} - {y_{j}^{k}})] < 0\); so, by (40), βj = 0.

Theorem 5.2

Suppose that Assumptions A3, A4, A5, A7, and A8 hold, \(L = \max \limits \{L_{1}, L_{2}\}\), the sequence {(xk,yk)} is generated by Algorithm 5.1, \({\Phi }_{\text {target}} \in \mathbb {R}\), and ε > 0. Then, the number of iterations k such that:

$$ {\Phi}(x^{k+1}, y^{k+1}) > {\Phi}_{\text{target}} $$

and (xk+ 1,yk+ 1) is not an (ηk/η/ε)-AKKT point of (13) is not greater than

$$ \left\lfloor \left( ({\Phi}(x^{0}, y^{0}) - {\Phi}_{\text{target}}) \right) \left( \frac{ \alpha^{p/(p+1)} \varepsilon}{L + \tau_{2} \left( L + \alpha \right) (p+1) + \theta} \right)^{-(p+1)/p} \right\rfloor. $$
(52)

Moreover, the number of functional evaluations per iteration is bounded above by:

$$ \left\lfloor \log_{\tau_{1}}\left( \frac{\tau_{2}(L+\alpha)}{\rho_{\min}} \right) \right\rfloor + 1. $$

Proof 6

As in [11, Lemma 3.1], we obtain that:

$$ \lim_{k \to \infty} \|x^{k+1}-x^{k}\| = \lim_{k \to \infty} \|y^{k+1}-y^{k}\|=0. $$

As in [11, Lemma 3.1], we prove that the sequence of penalty parameters is bounded.

By the arguments presented above, we have that, for all k, there exist \(\lambda = \lambda ^{k} \in \mathbb {R}^{n_{\mathrm {H}}}\), \({\mu }_ = {\mu }_{k} \in \mathbb {R}^{n_{\mathrm {G}}}_{+}\), \(\beta = \beta ^{k} \in \mathbb {R}^{d}_{+}\), and \(\gamma = \gamma ^{k} \in \mathbb {R}^{d}_{+}\) such that:

$$ \| \nabla f(x^{k+1}, y^{k+1}) + A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j}\| \leq O(\|(x^{k+1}, y^{k+1})-(x^{k}, y^{k})\|^{p}), $$
(53)
$$ {\sigma}_{j} - ({\pi}^{-1})'(y^{k+1}_{j}) \beta_{j} - {\gamma}_{j} = 0, j = 1, \dots, d, $$
(54)
$$ H(x^{k+1}) = 0, $$
(55)
$$ G(x^{k+1}) \leq 0, $$
(56)
$$ g_{j}(x^{k+1}) - {\pi}^{-1}(y^{k+1}_{j}) \leq 0, j = 1, \dots, d, $$
(57)
$$ \beta_{j} = 0 \text{ whenever } g_{j}(x^{k+1}) - {\pi}^{-1}(y^{k+1}_{j}) < - c_{\pi} |y_{j}^{k+1} - {y_{j}^{k}}|^{2}, $$
(58)
$$ y^{k+1} \geq 0, $$
(59)
$$ {\mu}_{j} = 0 \text{ for all } j \text{ such that } G(x^{k+1})_{j} < 0, $$
(60)
$$ {\gamma}_{j} = 0 \text{ for all } j \text{ such that } y^{k+1}_{j} > 0, $$
(61)
$$ {\mu}_, \beta, \gamma \geq 0. $$
(62)

By the sufficient descent condition and (53), there exists c > 0 such that, for all k:

$$ {\Phi}(x^{k+1}, y^{k+1}) \leq {\Phi}(x^{k}, y^{k}) - c_{\nabla} \left\| \nabla f(x^{k+1}, y^{k+1}) + A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j} \right\|^{\frac{p+1}{p}}. $$
(63)

Therefore, given ε > 0, the number of iterations such that \({\Phi }(x^{k+1}, y^{k+1}) > {\Phi }_{\text {target}}\) and

$$ \left\| \nabla f(x^{k+1}, y^{k+1}) + A_{\mathrm{H}}^{T} \lambda + A_{\mathrm{G}}^{T} {\mu}_ + \sum\limits_{j=1}^{d} a_{j} \beta_{j} \right\| > \varepsilon $$
(64)

is, at most,

$$ ({\Phi}(x^{0}, y^{0}) - {\Phi}_{\text{target}}) \times O(\varepsilon^{\frac{p+1}{p}}). $$

Given ξ > 0, by the descent condition (28), the number of iterations such that ∥(xk+ 1xk,yk+ 1yk)∥p+ 1 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi )\). Therefore, the number of iterations such that ∥(xk+ 1xk,yk+ 1yk)∥2 > ξ2/(p+ 1) cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi )\). Therefore, the number of iterations such that ∥(xk+ 1xk,yk+ 1yk)∥2 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi ^{\frac {p+1}{2}} )\). Therefore, the number of iterations such that cπ∥(xk+ 1xk,yk+ 1yk)∥2 > cπξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha \xi ^{\frac {p+1}{2}} )\). Therefore, the number of iterations such that cπ∥(xk+ 1xk,yk+ 1yk)∥2 > ξ cannot exceed the quantity \(({\Phi }(x^{0}, y^{0}) - {\Phi }_{\text {target}} ) / (\alpha (\xi /c_{\pi })^{\frac {p+1}{2}})\). Therefore, after at most

$$ \frac{ ({\Phi}(x^{0}, y^{0}) - {\Phi}_{\text{target}})}{ \alpha (\xi/c_{\pi})^{\frac{p+1}{2}}} $$

iterations, we have that all the iterates satisfy the approximate complementarity condition:

$$ \beta_{j} = 0 \text{ whenever } g_{j}(x^{k+1}) - {\pi}^{-1}(y^{k+1}_{j}) < - \xi $$

for all \(j = 1, \dots , d\). This completes the proof. □

6 Numerical experiments

In this section, we present numerical experiments with the penalty function \(\pi (\cdot )=\sqrt {\cdot }\). In Section 6.1, we aim to illustrate in which way the use of this penalty function promotes sparsity. In Section 6.2, we consider all the minimization problems from the Moré-Garbow-Hillstrom collection [27], penalized with the term \(\pi (\cdot )=\sqrt {\cdot }\). Problems are solved in two different ways. On the one hand, reformulated problems are solved with the nonlinear programming solver Algencan [1, 7]. On the other hand, problems are solved with Algorithm 4.1. Experiments aim to simulate the situation in which a model-intensive algorithm is used to solve a problem with a costly objective function and a nonconvex regularization term.

6.1 Numerical illustration

In this section, we present numerical experiments that illustrate in which way the use of the penalty function \(\pi (\cdot )=\sqrt {\cdot }\) promotes sparsity. Three toy problems are reformulated and solved with the general-use nonlinear programming solver Algencan [1, 7] that applies to smooth problems.

  • Problem 1. Regularized gradient minimization

The continuous version of this problem is to find a function \(u:[0,1] \to \mathbb {R}\) that solves the problem:

$$ \text{Minimize } \| \nabla u \|^{2} \text{ subject to } {{\int}_{0}^{1}} u(t) dt = {{\int}_{0}^{1}} \bar u(t), $$

where \(\bar u(t) = 400\) if t ∈ [0,0.5] and \(\bar u(t) = 0\), otherwise. In addition, we aim u to coincide with \(\bar u\) in “as much as possible.” After discretization, the problem becomes:

$$ \text{Minimize } \frac{1}{2} {\sum}_{i=2}^{99} (u_{i-1} - u_{i+1})^{2} \text{ subject to } {\sum}_{i=1}^{100} u_{i} = {\sum}_{i=1}^{100} \bar u_{i}, $$

where \(\bar u_{i} = 400\) for \(i=1,\dots ,50\) and \(\bar u_{i}=0\) for \(i=51,\dots ,100\), with the additional constraint of having \(u_{i} = \bar u_{i}\), for \(i=1,\dots ,100\), as many times as possible.

Solutions to the discretized version of the problem can be found by solving the reformulated problem given by:

$$ \begin{array}{@{}rcl@{}} &&{\text{Minimize } \frac{1}{2} \sum\nolimits_{i=2}^{99} (u_{i-1} - u_{i+1})^{2} + \sigma \sum\nolimits_{i=1}^{100} y_{i}}\\ &&\text{subject to } \qquad\qquad\quad {\sum\nolimits_{i=1}^{100} u_{i}} = \sum\nolimits_{i=1}^{100} \bar u_{i}\\ && \qquad\qquad-{y_{i}^{2}} \leq u_{i} - \bar u_{i} \leq {y_{i}^{2}}, i=1,\dots,100 \end{array} $$

for different values of σ > 0. Note that the reformulation corresponds to penalizing \(\pi (|u_{i} - \bar u_{i}|)\), \(i=1,\dots ,100\), with \(\pi (\cdot )=\sqrt {\cdot }\). Figure 1 shows a graphical representation of solutions to problems with σ ∈{0,1,2,3,4,5,10,20,30,40,50,100}. Note that, the larger σ is, the larger the number of constraints of the form \(u_{i} = \bar u_{i}\) being satisfied.

Fig. 1
figure 1

Graphical representation of solutions to problem 1 for different values of σ

Fig. 2
figure 2

Graphical representation of solutions to Problem 2 for different values of σ

  • Problem 2. Two-dimensional dam

This problem is a two-dimensional version of Problem 1. The discretized version of the problem is given by:

$$ \begin{array}{@{}rcl@{}} && \text{Minimize } \frac{1}{2} \sum\limits_{i=2}^{29} \sum\limits_{j=2}^{19} (4 u_{ij} - u_{i,j-1} - u_{i,j+1} - u_{i-1,j} - u_{i+1,j} )^{2}\\ &&\text{ subject to } \sum\limits_{i=1}^{30} \sum\limits_{j=1}^{20} u_{ij} = \sum\limits_{i=1}^{30} \sum\limits_{j=1}^{20} \bar u_{ij}, \end{array} $$

where

$$ u_{ij} = \left\{ \begin{array}{rl} 400, & \text{if } j \leq 30 \left[ \frac{1}{2} + \frac{1}{10} \sin\left( 2 \pi \left( \frac{i - 1}{30 - 1} \right) \right) \right]\\ 0, & \text{otherwise}, \end{array} \right. $$

with the additional constraint of having \(u_{ij} = \bar u_{ij}\), for \(i=1,\dots ,30\) and \(j=1,\dots ,20\), as many times as possible. Solutions to the problem can be found by solving the reformulated problem given by:

$$ \begin{array}{@{}rcl@{}} &&{\text{Minimize } \frac{1}{2} \sum\nolimits_{i=2}^{99} (u_{i-1} - u_{i+1})^{2} + \sigma \sum\nolimits_{i=1}^{100} y_{i}}\\ &&\text{subject to } \qquad\qquad\quad {\sum\nolimits_{i=1}^{100} u_{i}} = \sum\nolimits_{i=1}^{30} \sum\nolimits_{j=1}^{20} \bar u_{ij} \\ &&\qquad\qquad -y_{ij}^{2} \leq u_{ij} - \bar u_{ij} \leq y_{ij}^{2}, i=1,\dots,30, j=1,\dots,20, \end{array} $$

for different values of σ > 0. Note that the reformulation corresponds to penalizing \(\pi (|u_{ij} - \bar u_{ij}|)\), \(i=1,\dots ,30\), \(j=1,\dots ,20\), with \(\pi (\cdot )=\sqrt {\cdot }\). Figure 2 shows a graphical representation of solutions to problems with \(\sigma \in \{10^{8}, 10^{7}, \dots ,\) \(1, 10^{-1}, \dots , 10^{-8}\}\). Once again, the larger σ is, the larger the number of constraints of the form \(u_{ij} = \bar u_{ij}\) being satisfied.

  • Problem 3. Mass transportation

Assume that in a rectangle of nx × ny pixels, to each pixel (i,j) there are associated functions P(i,j) ≥ 0 and Q(i,j) ≥ 0 such that:

$$ \sum\limits_{i=1}^{n_{x}} \sum\limits_{j=1}^{n_{y}} P(i,j) = \sum\limits_{i=1}^{n_{x}} \sum\limits_{j=1}^{n_{y}} Q(i,j). $$

P(i,j) represents white mass and Q(i,j) represents red mass. The goal is to find a function u((i,j),(k,)) ≥ 0 that represents the transportation of white mass from pixel (i,j) to pixel (k,). There is no transportation of red mass. The function of transported mass u must be such that, at every pixel, the amount of white mass be equal to the amount of red mass. Therefore, for every pixel (i,j) ∈ V, we must have that:

$$ Q(i,j) = P(i,j) + \sum\limits_{(k,\ell) \in V, (k,\ell) \neq (i,j)} u((k,\ell),(i,j)) - u((i,j),(k,\ell)), $$

where \(V = \{ \{ 1, 2, \dots , n_{x} \} \times \{ 1, 2, \dots , n_{y} \} \}\). With these constraints, the objective is to minimize the number of non-null transports, i.e., to minimize the number of u((i,j),(k,)), for (i,j)≠(k,) ∈ V, that are positive.

The reformulated problem is given by:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize} \qquad \sigma \sum\nolimits_{(i,j) \neq (k,\ell) \in V} y_{ijk\ell}\\ &&\text{subject to} \qquad Q(i,j) = P(i,j) + \sum\nolimits_{(k,\ell) \in V, (k,\ell) \neq (i,j)} u((k,\ell),(i,j))\\ && \qquad\qquad\qquad- u((i,j),(k,\ell)), \forall (i,j) \in V\\ && \qquad\qquad\qquad 0 \leq u((i,j),(k,\ell)) \leq y_{ijk\ell}^{2}, \forall (i,j) \neq (k,\ell) \in V.\\ \end{array} $$

It corresponds to penalize \(\sqrt { {\sum }_{(i,j) \neq (k,\ell ) \in V} u((i,j),(k,\ell )) }\) that represents the cost of the transportation process, its concavity representing economy of scale (decreasing marginal cost of transportation). Figure 3 illustrates solutions to a small instance of Problem 3 with σ = 0 and σ > 0.

Fig. 3
figure 3

Graphical representation of solutions to a small instance of Problem 3

6.2 Numerical comparison

We implemented the model-intensive Algorithm 4.1 in Fortran. At Step 2, a pair (xtrial,ytrial) satisfying (26) and (27) is computed by tackling problem:

$$ \begin{array}{@{}rcl@{}} &&\text{Minimize} \quad M_{x^{k},y^{k}}(x,y) + \rho \| x-x^{k},y-y^{k}\|^{p+1}\\ &&\text{subject to} \quad -{y_{i}^{2}} \leq x_{i} \leq {y_{i}^{2}} \text{ for } i=1,\dots,n \text{ and } y \geq 0, \end{array} $$
(65)

where

$$ M_{x^{k},y^{k}}(x,y) = T_{p}(x^{k},x) - f(x^{k}) + \sum\limits_{i=1}^{n} [y-y^{k}]_{i}, $$

with the help of Algencan [1, 7]. By definition, \(\eta _{k} \in [\frac {1}{2}\eta ,\eta )\) for all k. Thus, in order to satisfy (26), it is enough to ask to Algencan to stop at a point satisfying the constraints with precision εfeas for any \({\varepsilon }_{\text {feas}} \leq \frac {1}{2}\eta \). Algencan also requires tolerances εcompl and εopt for complementarity and optimality measures, respectively. (See [8, §5.1] for details.) Since (27) is satisfied by (x,y) = (0,0), that is a feasible point of problem (65), in practice, it is expected to be always satisfied at the final iterate of Algencan. In the numerical experiments, in Algorithm 4.1, we arbitrarily set p ∈{2,3}, τ1 = τ2 = 100, \(\rho _{\min \limits }=10^{-8}\), α = 10− 8, η = 10− 6, and 𝜃 = 106; and, in Algencan, εfeas = εcompl = εopt = 10− 8. In order to guarantee the fulfillment of Assumption A6, Algencan should had been modified to accept a tolerance εopt depending on its current iterate, i.e. not a constant. Instead of doing that, we keep Algencan as it is and we observed in practice that Assumption A6 would have been satisfied setting 𝜃 = 106, which justifies that choice.

In the numerical experiments, we considered the 18 unconstrained minimization problems of the Moré-Garbow-Hillstrom collection [27] that consist in minimizing f(x), penalized with the term \(\sigma {\sum }_{i=1}^{n} \pi (|x_{i}|)\) with \(\pi (\cdot )=\sqrt {\cdot }\) and σ = 10− 8. Note that, when applying Algorithm 4.1 with p = 3, evaluating Tp(xk,x) requires the third-order derivatives of f that were taken from [9]. These problems were solved in two different ways. On the one hand, problems were reformulated as in Section 6.1 and solved with Algencan. On the other hand, they were solved with Algorithm 4.1 with p ∈{2,3}. The same stopping criterion was adopted in both cases, namely, the approximate satisfaction of the optimality conditions given by:

$$ \begin{array}{@{}rcl@{}} \| g(x)_{+} \|_{\infty} & \leq & \varepsilon \\ \| P_{\Omega}\left( x - \left[ \nabla f(x) + \sum\limits_{j=1}^{p} {\mu}_{j} \nabla g_{j}(x) \right] \right) - x \|_{\infty} & \leq & \varepsilon \\ \max_{j=1,\dots,p}\{ \min\{ -g_{j}(x), {\mu}_{j} \}\} & \leq & \varepsilon \end{array} $$

where p = 2n, \(g_{j}(x,y) := x_{j} - {y_{j}^{2}}\) and \(g_{n+j}(x,y) := - x_{j} - {y_{j}^{2}}\) for \(j=1,\dots ,n\), PΩ is the projector operator onto Ω, and \({\Omega }=\left \{ (x,y) \in \mathbb {R}^{2n} | y \geq 0 \right \}\), with ε = 10− 6. Emulating the situation to which Algorithm 4.1 is applicable, we considered the number of evaluations of f as performance metric. Equivalent solutions were found with the two approaches in all the problems. Table 1 and Fig. 4 show the results. Figures show that Algorithm 4.1 with p = 2 and p = 3 used, in average 72% and 82% of the number of functional evaluations used by Algencan, respectively. Algorithm 4.1 with p = 2 used a larger number of evaluations than Algencan in 2 problems, the same number in 3 problems, and less evaluations in all the other 13 problems. The apparent inferiority of Algorithm 4.1 with p = 3, with respect to the case with p = 2, can be attributed to the (lack of) parameter tuning, since the arbitrary value of the (dimensional) sufficient decrease parameter α = 10− 8 in (28) has a different meaning when p = 2 and p = 3.

Table 1 Comparison of the performance of Algencan and Algorithm 4.1 with p ∈{2,3} when applied to solving the Moré-Garbow-Hillstrom minimization problems with the penalty term given by \(\pi (\cdot )=\sqrt {\cdot }\)
Fig. 4
figure 4

Graphical representation of the performance of Algencan and Algorithm 4.1 with p ∈{2,3} when applied to solving the Moré-Garbow-Hillstrom minimization problems with the penalty term given by \(\pi (\cdot )=\sqrt {\cdot }\)

7 Final remarks

In this work, we have introduced a smooth reformulation for constrained smooth problems with nonsmooth regularizations. The reformulation is entirely equivalent to the original problem both from the global and the local points of view. Moreover, we were able to prove optimality conditions in which auxiliary variables do not appear at all.

Our main interest relies now in the application of these techniques to real problems in which the evaluation of the objective function is overwhelmingly more expensive than the evaluation of the constraints. This type of functions appear when PDE calculations are involved in the objective function evaluation and when the objective function is related to some phenomenon that takes place in real time. In these cases, model-intensive algorithms as the ones introduced in this paper may be useful. So, it is interesting to show that, from the theoretical point of view (convergence and complexity), these algorithms are well supported. Of course, smooth reformulations are valuable in these cases because one may take advantage of well-established constrained optimization software.