1 Introduction

Practical methods for solving nonlinearly constrained optimization problems are iterative. Given an iterate, one tries to find a better approximation to the solution taking into account feasibility and optimality criteria. Several traditional nonlinear programming methods preserve feasibility of all the iterates, so that the improvement criterion relies only on objective function values [19]. These methods are very effective in many cases, but they tend to be slow in the presence of constraints with high curvature. Modern inexact-restoration (I-R) methods, on the other hand, admit infeasibility of iterates and employ inexact nonspecific restoration procedures, which may be adapted to different classes of problems. See [1023].

The common features of the I-R methods mentioned above are the following:

  • Given the current iterate, a sufficiently more feasible point (inexactly restored point) is computed by any arbitrary method. This is the Feasibility Phase of the I-R iteration.

  • At the Optimality Phase of the I-R iteration, one minimizes (approximately) a suitable objective function (the Lagrangian or some approximation) for obtaining a trial point. The trial point is compared with the current iterate in terms of feasibility and optimality (using merit functions as in [10, 11, 19], or filters, as in [12, 18]). If the trial point is acceptable, we define it as the new iterate. Otherwise, we find a new trial point “closer” to the inexactly restored point. The fact that the new trial point should be “closer” to the inexactly restored point, instead of closer to the current iterate, is crucial in the philosophy of I-R methods, because ultimately the inexactly restored point is considered to be better than the current iterate.

The first I-R methods [10, 11] motivated the introduction of a sequential optimality condition called AGP (approximate gradient projection) [24]. For several years, it was thought that AGP was equivalent to the approximate KKT condition (AKKT) implicitly used to define stopping criteria in nonlinear programming algorithms. Surprisingly, it was recently shown [25] that AGP is stronger than AKKT; moreover, one of the AGP variations, called linear AGP (L-AGP), is stronger than AGP. These theoretical facts support one of the decisions taken in our I-R method: When linear constraints are present in the problem, feasibility with respect to these constraints should be satisfied at all the iterations.

Fischer and Friedlander [19] introduced a new line search by means of which an I-R method exhibits interesting global convergence properties. The resulting approach turns out to be simpler than more classical approaches based on trust regions or filters. The main convergence result of [19] states that under proper assumptions, the search direction, used at the optimality phase, tends to zero. This result complements the suggestion of Gomes-Ruggiero, Martínez and Santos [17] who claim that by means of the spectral gradient choice [2630] of the first optimality-phase trial point, I-R turns out to be the natural generalization to nonconvex and constrained optimization of the spectral projected gradient (SPG) method [3133], which is currently used for large-scale convex and constrained minimization.

In [17], spectral gradient tangent directions and the trust-region globalization approach of [10] were invoked for proving convergence. With the aim of getting closer to quadratic convergence [13], we use here a search direction that comes from solving a strictly convex Newton-related quadratic minimization problem. The resulting method is the natural generalization of the variable metric method [34] for nonlinear programming in the same sense that the method [17] is a generalization of SPG.

In this work, we introduce an improved line search I-R algorithm based on the Fischer–Friedlander approach. The proposed method is more flexible than the approaches in [10, 11, 19], when dealing with sufficient decrease. Several typical I-R algorithmic requirements are relaxed and, therefore, larger steps are more likely to be accepted by the algorithm. We also employ approximations of Lagrange multipliers in the merit function in order to add relevant information when solving the problem. An optimization procedure for general constrained optimization problems is introduced.

This paper is organized as follows. In Sect. 2, we introduce the new I-R method and we prove that it is well defined. In Sect. 3, we state the assumptions that will allow us to prove convergence starting from arbitrary initial points, and we prove feasibility and optimality of the limit points. The plausibility of the assumptions used to prove convergence is discussed in Sect. 4. Conclusions are given in Sect. 5.

2 Basic Method

In this section, we consider the constrained optimization problem in the form:

$$\begin{aligned} \mathrm{min }~f(x) \;\;\hbox { s.t. }\;\; h(x) = 0, \; x \in \Omega , \end{aligned}$$
(1)

where \(\Omega \subset \mathbb {R}^n\) is a polytope, \(f: \mathbb {R}^n \rightarrow \mathbb {R}\), \(h: \mathbb {R}^n \rightarrow \mathbb {R}^m\) and both \(f\) and \(h\) are smooth. Every smooth, finite dimensional, and constrained optimization problem may be written in the form (1) using slack variables, if necessary.

For all \(x \in \Omega \) and \( \lambda \in \mathbb {R}^m\), we define the Lagrangian \(L(x, \lambda )\) by

$$\begin{aligned} L(x, \lambda ) := f(x) + \sum _{i=1}^m \lambda _i h_i(x). \end{aligned}$$

Given a penalty parameter \( \theta \in [0, 1]\), we consider, for all \(x \in \Omega , \lambda \in \mathbb {R}^m\), the following merit function that consists of a convex combination of the Lagrangian function and the infeasibility measure:

$$\begin{aligned} \Phi (x, \lambda , \theta ) := \theta L(x, \lambda ) + (1 - \theta ) \Vert h(x)\Vert . \end{aligned}$$
(2)

Note that, when \(\theta =1\), the merit function coincides with the Lagrangian function and, when \(\theta =0\), it coincides with the infeasibility measure. In these cases, we are neglecting either the objective function or the feasibility at the point \(x\). In our method \(\theta \) is always smaller than one, and we will prove that it is also bounded away from zero. In this paper, the symbol \(\Vert \cdot \Vert \) denotes the Euclidean norm, although many times it can be replaced by an arbitrary norm.

The merit function (2) is essentially the sharp Lagrangian defined in [35] and employed in [3638], in the context of nonsmooth optimization, and in [11] in the I-R framework.

Algorithm 2.1 below is a flexible form of the I-R method with the line search procedure of [19]. One improvement here with respect to [19] is that we are able to use Lagrange multipliers in the algorithm (thus employing (2)) preserving convergence. Moreover, sufficient descent requirements are relaxed so that acceptance of “Pure Newtonian” steps will be more frequent.

Algorithm 2.1

Flexible inexact-restoration with sharp Lagrangian.

Step 0 Initialization

As initial approximation we choose, arbitrarily, \(x^0 \in \Omega \) and \(\lambda ^{0} \in \mathbb {R}^m\). We initialize \(\theta _{-1} \in \; ]0, 1[\) and \(k \leftarrow 0\).

Step 1 Restoration step

If \(\Vert h(x^k)\Vert =0\), define \(y^{k+1} := x^k\). Otherwise, try to compute \(y^{k+1} \in \Omega \) such that

$$\begin{aligned} \Vert h(y^{k+1})\Vert < \Vert h(x^k)\Vert . \end{aligned}$$
(3)

If this is not possible, we say that Step 1 is deemed unsuccessful and the algorithm stops.

Step 2 Estimation of Lagrange Multipliers

Compute \(\lambda ^{k+1} \in \mathbb {R}^m\). (Practical effective ways for this choice will be given later.)

Step 3 Descent direction

Compute \(d^k \in \mathbb {R}^n\) such that \(y^{k+1} + d^k \in \Omega \),

$$\begin{aligned} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k < 0, \;\; \text{ if } \;\; d^k \ne 0, \end{aligned}$$
(4)

and, for some \(c_k > 0\) and all small enough \(t > 0\):

$$\begin{aligned} \Vert h\big (y^{k+1} + t d^k\big )\Vert \le \Vert h\big (y^{k+1}\big )\Vert + c_k t^2 \Vert d^k\Vert ^2. \end{aligned}$$
(5)

Step 4 Globalization

Step 4.1 Sufficient decrease parameters

If \(\Vert h(x^k)\Vert = \Vert h\big (y^{k+1}\big )\Vert = 0\), take \(r_k \in \; ]0,1[\). Else, define \(r_k \ne 0\) such that

$$\begin{aligned} r_k \in \bigg [ \frac{\Vert h\big (y^{k+1}\big )\Vert }{\Vert h(x^k)\Vert }, 1 \bigg [ . \end{aligned}$$
(6)

Set

$$\begin{aligned} s_k \in [0,r_k[. \end{aligned}$$
(7)

Step 4.2 Penalty parameter computation

Compute \(\theta _k\) as the supremum of the values of \(\theta \in [0, \theta _{k-1}]\) such that

$$\begin{aligned} \Phi \big (y^{k+1}, \lambda ^{k+1}, \theta \big ) \le \Phi \big (x^k, \lambda ^{k}, \theta \big ) + \frac{1-s_k}{2}\big (\Vert h\big (y^{k+1}\big )\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$
(8)

Step 4.3 Line search

If \(\Vert d^k\Vert =0\) we define \(t_k := 1\).

Otherwise, by means of some backtracking procedure, compute \(t_k \in [0, 1]\) such that

$$\begin{aligned} L\big (y^{k+1} + t_k d^k, \lambda ^{k+1}\big ) < L\big (y^{k+1}, \lambda ^{k+1}\big ) \end{aligned}$$
(9)

and

$$\begin{aligned} \Phi \big (y^{k+1} + t_k d^k, \lambda ^{k+1}, \theta _k\big ) \le \Phi \big (x^k, \lambda ^{k}, \theta _k\big ) + \frac{1-r_k}{2} \big (\Vert h\big (y^{k+1}\big )\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$
(10)

Moreover, the backtracking procedure should be such that either \(t_k = 1\) or there exists \(\bar{t}_k \in [t_k, 10 t_k]\) such that

$$\begin{aligned} L\big (y^{k+1} + \bar{t}_k d^k, \lambda ^{k+1}\big ) \ge L\big (y^{k+1}, \lambda ^{k+1}\big ) \end{aligned}$$
(11)

or

$$\begin{aligned} \Phi \big (y^{k+1} + \bar{t}_k d^k, \lambda ^{k+1}, \theta _k\big ) > \Phi \big (x^k, \lambda ^{k}, \theta _k\big ) + \frac{1-r_k}{2} \big (\Vert h\big (y^{k+1}\big )\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$
(12)

Step 4.4 Iteration update

Set

$$\begin{aligned} x^{k+1} = y^{k+1} + t_k d^k, \end{aligned}$$
(13)

update \(k \leftarrow k+1\) and go to Step 1.

We have defined Algorithm 2.1 without an explicit stopping criterion, which, of course, is necessary in practical implementations. This means that, in principle, the algorithm generates an infinite sequence whose theoretical properties are proved in Sects. 2 and 3. In Sect. 5, we define suitable criteria for stopping the execution of the algorithm.

Algorithm 2.1 is a simple inexact-restoration algorithm, where even the condition of \(d^k\) being in the tangent subspace is relaxed. However, it is interesting to show that this simplified method is well defined in the sense that an iteration can be completed if the inexactly restored point \(y^{k+1}\) can be computed at every iteration. In the following theorem, we do not make use of Lipschitz conditions at all, neither on the objective function, nor on the constraints.

Theorem 2.1

For all \(x^k \in \Omega \), if the point \(y^{k+1}\) at Step 1 of Algorithm 2.1 is successfully computed, then the iterate \(x^{k+1}\) is well defined.

Proof

By the hypothesis, we have that Step 1 is well defined. Steps 2 and 3 can be completed observing that \(\lambda ^{k+1}=0 \in \mathbb {R}^m\) and \(d^k=0 \in \mathbb {R}^n\) are always acceptable. Although these options will not be our choice, this ensures that both steps are well defined. We will show now that the other computations required in the iteration are possible, whenever \(\Vert h\big (y^{k+1}\big )\Vert < \Vert h(x^k)\Vert \) or \(y^{k+1} = x^k\).

Computation of \(\theta _k\): If \(y^{k+1} = x^k\), we have that \(\Vert h\big (y^{k+1}\big )\Vert = \Vert h(x^k)\Vert = 0\) and, thus, \(L\big (x^k,\lambda \big )=L\big (y^{k+1},\mu \big )\) for all \(\lambda ,\mu \in \mathbb {R}^m\). This means that (8) is obvious for all \(\theta \) and so \(\theta _{k}=\theta _{k-1}\).

If \(y^{k+1} \ne x^k\), by (3), we have that \(\Vert h(y^{k+1})\Vert < \Vert h(x^k)\Vert \) and, since \(0\le s_k <1\),

$$\begin{aligned} \Vert h\big (y^{k+1}\big )\Vert - \Vert h(x^k)\Vert < \frac{1-s_k}{2} \big ( \Vert h(y^{k+1})\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$

Therefore, for small enough \(\theta \), we have that

$$\begin{aligned} \Vert h(y^{k+1})\Vert - \Vert h(x^k)\Vert&\le \frac{\theta }{1-\theta } \left[ L(x^k, \lambda ^{k}) - L(y^{k+1}, \lambda ^{k+1})\right] \\&+ \frac{1-s_k}{2 (1-\theta )} \big ( \Vert h(y^{k+1})\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$

This implies that, for small enough \(\theta > 0\), (8) takes place. By direct calculations, we note that the parameter \(\theta _k\) can be computed in the following way: If (8) is verified with \(\theta = \theta _{k-1}\), then take \(\theta _k = \theta _{k-1}\). Otherwise, we compute

$$\begin{aligned} \theta _k = \frac{(1+s_k) \big ( \Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert \big )}{2 \big [ L(y^{k+1}, \lambda ^{k+1}) - L(x^k, \lambda ^{k}) + \Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert \big ]}. \end{aligned}$$
(14)

Computation of \(t_k\): We only need to consider the case in which \(\Vert d^k\Vert \ne 0\).

By (4), we have that (9) necessarily holds for small enough \(t\). Let us prove now that (10) also holds for small \(t\).

Assume first that \(y^{k+1} = x^k\), then, \(\Vert h(y^{k+1})\Vert = \Vert h(x^k)\Vert = 0\) and \(L(x^k,\lambda )=f(x^k)\) for all \(\lambda \in \mathbb {R}^m\). Thus

$$\begin{aligned} \Phi \big (x^k,\lambda ^{k},\theta _k\big )+\frac{1-r_k}{2} \big (\Vert h(y^{k+1}\big )\Vert - \Vert h\big (x^k)\Vert \big ) =\theta _k L\big (x^k,\lambda ^k\big )=\theta _k f\big (y^{k+1}\big ), \end{aligned}$$
(15)

and, by (5),

$$\begin{aligned} \Vert h\big (y^{k+1} + t d^k\big )\Vert \le c_k t^2 \Vert d^k\Vert ^2. \end{aligned}$$
(16)

Now, by (4), for small enough \(t\), we have that

$$\begin{aligned} L\big (y^{k+1} + t d^k, \lambda ^{k+1}\big )&\le L\big (y^{k+1}, \lambda ^{k+1}\big ) + \frac{t}{2} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k\nonumber \\&= f\big (y^{k+1}\big )+\frac{t}{2} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k \end{aligned}$$
(17)

and

$$\begin{aligned} \frac{\theta _k}{2} \nabla L\big (x^k, \lambda ^{k+1}\big )^T d^k+\big (1-\theta _k\big )c_k t \Vert d^k\Vert ^2 <0. \end{aligned}$$
(18)

By (15), (16), (17), and (18), for small enough \(t\), we have that

$$\begin{aligned} \Phi \big (y^{k+1} + t d^k, \lambda ^{k+1}, \theta _k\big )&= \theta _k L\big (y^{k+1} + t d^k, \lambda ^{k+1}\big ) + \left( 1 - \theta _k\right) \Vert h\big (y^{k+1} + t d^k\big )\Vert \\&\le \theta _k\big (f(y^{k+1}\big )\!+\!\frac{t}{2} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k)\!+\!(1\!-\!\theta _k)c_k t^2 \Vert d^k\Vert ^2\\&\le \Phi \big (x^k,\lambda ^{k},\theta _k\big )+\frac{1-r_k}{2} \big (\Vert h\big (y^{k+1}\big )\Vert - \Vert h\big (x^k\big )\Vert \big ). \end{aligned}$$

So we conclude that (10) holds for small enough \(t\) when \(y^{k+1} = x^k\).

Now, let us show that \(t_k\) is well defined in the case that \(\Vert h(y^{k+1})\Vert < \Vert h(x^k)\Vert \). In this case, by (7), we have that

$$\begin{aligned} \frac{1-s_k}{2}\big (\Vert h(y^{k+1})\Vert - \Vert h(x^k)\Vert \big ) < \frac{1-r_k}{2} \big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big ). \end{aligned}$$
(19)

By (8) and (19),

$$\begin{aligned} \Phi \big (y^{k+1}, \lambda ^{k+1}, \theta _k\big ) < \Phi \big (x^k, \lambda ^{k}, \theta _k\big ) + \frac{1-r_k}{2} \big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big ). \end{aligned}$$

By continuity,

$$\begin{aligned} \Phi \big (y^{k+1} + t d^k, \lambda ^{k+1}, \theta _k\big ) \le \Phi \big (x^k, \lambda ^{k}, \theta _k\big ) + \frac{1-r_k}{2} \big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big ) \end{aligned}$$

for small enough \(t\). This ensures that \(t_k\) can be computed in a finite time by means of a backtracking procedure. This completes the proof.

Observe that (10) is easier to be fulfilled, if \(r_k\) is chosen to be close to \(1\) in (6). This means that the chances of accepting large steps \(t_k\) increase if we choose \(r_k \approx 1\). Moreover, the gap in (19) is larger as \(s_k\) is closer to zero, which also favors the acceptance of large steps. On the other hand, \(\theta _k\), the weight of the Lagrangian in the merit function, is larger as \(s_k\) is closer to \(r_k\). Since the direction \(d^k\) is a descent direction for the Lagrangian, it is natural that large values of \(\theta _k\) increase the chances to accept large steps. We believe that the best choice of \(s_k\) strongly depends on the nonlinearity of the constraints in (1). For simplicity in our implementation, we decide to take the largest gap in (19) setting \(s_k=0\) for all \(k\). This flexible way of handling sufficient decrease allows the algorithm to take larger steps than the ones in [19], where \(r_k\) and \(s_k\) are equal and fixed for every \(k\).

Theorem 2.1 showed that, for well-definedness, the following conditions are essential: (a) simple descent of \(\Vert h(y^{k+1})\Vert \) with respect to \(\Vert h(x^k)\Vert \); (b) the direction \(d^k\) should be quasi-tangent (5); and (c) \(d^k\) must be a descent direction for the Lagrangian \(L\).

3 Assumptions and Global Convergence

The global convergence proof for Algorithm 2.1 requires a problem assumption (3.1) and several algorithmic assumptions (3.2–3.6).

Assumption 3.1

The problem should be such that Lipschitz conditions hold both on the gradients of \(f\) and \(h\). Namely, there exists \( \eta > 0\) such that, for all \(x, y \in \Omega \),

$$\begin{aligned} \Vert \nabla f(y) - \nabla f(x)\Vert \le \eta \Vert y - x\Vert \; \text { and }\; \Vert \nabla h(y) - \nabla h(x)\Vert \le \eta \Vert y - x\Vert . \end{aligned}$$
(20)

The algorithmic assumptions involve conditions that should be satisfied by the different steps of the algorithm in order to guarantee global convergence. For the introduction of these assumptions, we will denote the Euclidean projection of \(x\) on a non-empty, closed, and convex set \(\Omega \) by \(P_\Omega (x)\) and \(\{x^k\}, \{y^{k}\}, \{\lambda ^{k}\}\) will be the sequences generated by Algorithm 2.1. Note that these sequences have infinitely many terms, except in the case when the restoration step fails and (3) cannot be obtained for some \(k\).

Assumption 3.2

For all \(k \in \mathbb {N}\), Step 1 of the Algorithm is successful and there exists \(r \in [0, 1]\) and \(\beta > 0\) such that

$$\begin{aligned} r_k \le r \end{aligned}$$
(21)

and

$$\begin{aligned} L\big (y^{k+1}, \lambda ^{k+1}\big ) - L\big (x^k, \lambda ^{k}\big ) \le \beta \Vert h(x^k)\Vert . \end{aligned}$$
(22)

Conditions (21) and (22) involve the restoration step. Condition (21) states that sufficient uniform feasibility improvement should be obtained by the restoration procedure. In most I-R implementations, \(r\) is an algorithmic parameter and the algorithm stops declaring “restoration failure” if the restoration procedure fails to satisfy (21) after reasonable computer time. Here we adopt the practical point of view that even conservative algorithmic parameters \(r\) (say \(r = 0.99\)) could be excessively strict at some iterations of the algorithm, which, however, could converge smoothly under single descent requirements such as (3). On the other hand, the requirement (21) for some unknown value of \(r\) is usually satisfied under regularity assumptions on the constraints. Since \(y^{k+1}=x^k\) if \(\Vert h(x^k)\Vert =0\), and \(r_k\) is defined as in (6) if \(y^{k+1}\ne x^k\), condition (21) implies that

$$\begin{aligned} \Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \le r_k\Vert h(x^k)\Vert -\Vert h(x^k)\Vert \le -(1-r)\Vert h(x^k)\Vert . \end{aligned}$$
(23)

Condition (22) requires that the deterioration of the Lagrangian at \((y^{k+1}, \lambda ^{k+1})\) should be smaller than a multiple of the infeasibility \(\Vert h(x^k)\Vert \). If the Lagrange multipliers estimates are bounded, and under suitable Lipschitz assumptions, condition (22) is implied by \(\Vert y^{k+1} - x^k\Vert = O(\Vert h(x^k)\Vert )\). This means that the distance between \(x^k\) and the restored point \(y^{k+1}\) should be proportional to the infeasibility measure at \(x^k\). Providing suitable safeguarding parameters \(\beta \) is even harder than in the case of (21), because (22) is scale-dependent. In practical terms, Assumption 3.2 says that we believe that the restoration algorithm employed is reasonable enough so that sufficient improvement and bounded-distance requirements are automatically satisfied.

Assumption 3.3

For all \(k \in \mathbb {N}\), we choose \(d^k\) such that \(y^{k+1} + d^k \in \Omega \) and \(\nabla h(y^{k+1})^T d^k = 0\). Moreover, we assume that there exists \(\sigma > 0\) such that

$$\begin{aligned} \nabla L\big (y^{k+1},\lambda ^{k+1}\big )^Td^k \le -\sigma \Vert d^k\Vert ^2 \end{aligned}$$
(24)

and

$$\begin{aligned} \sigma \Vert P_k\big (y^{k+1} - \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )\big ) - y^{k+1}\Vert \le \Vert d^k\Vert \end{aligned}$$
(25)

for all \(k \in \mathbb {N}\), where \(P_k\) denotes the Euclidean projection on the polytope defined by \(y \in \Omega \) and \(\nabla h(y^{k+1})^T (y-y^{k+1}) = 0\). As we will see in Sect. 4, there is no loss of generality in the use of the same \(\sigma \) in (24) and (25).

In practice, the direction \(d^k\) will be obtained as the solution of the following problem:

$$\begin{aligned}&\mathrm{min }~\, \frac{1}{2} d^T H_k d + \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d \end{aligned}$$
(26)
$$\begin{aligned}&\mathrm{s.t. }~\,\nabla h\big (y^{k+1}\big )^T d = 0, \;\; y^{k+1} + d \in \Omega , \end{aligned}$$
(27)

where \(H_k\) is an approximation to the Hessian of the Lagrangian. Since \(\Omega \) is a polytope, we have that (26)–(27) is a quadratic programming problem. Note that solving (26)–(27) is equivalent to minimize \(\frac{1}{2} d^t H_k d + \nabla f(y^{k+1})^T d\) subject to (27).

Note that the requirement (24) is not necessarily satisfied by the solution of (26)–(27), unless we impose some additional conditions on \(H_k\). We will show in Sect. 4 that Assumption 3.3 holds if the eigenvalues of the matrix of \(Z_k^T H_k Z_k\) lie in a positive interval \([\sigma _\mathrm{min}, \sigma _\mathrm{max}]\), where the columns of \(Z_k\) form an orthonormal basis of the null-space of \(\nabla h(y^{k+1})^T\). In implementations, we will define \(H_k\) as the Hessian of the approximate Lagrangian, testing further the descent condition and switching to a safe positive definite matrix, if necessary.

Assumption 3.4

There exists \(\gamma > 0\) such that, for all \(k \in \mathbb {N}\),

$$\begin{aligned} L\big (y^{k+1} + t_k d^k, \lambda ^{k+1}\big ) \le L\big (y^{k+1}, \lambda ^{k+1}\big ) - \gamma t_k \Vert d^k\Vert ^2. \end{aligned}$$
(28)

Note that (28) states that the Lagrangian, whose quadratic approximation is minimized at (26)–(27), should decrease along the direction \(d^k\). The sufficient decrease condition, which depends on \(\gamma \), holds for small enough \(t_k\) under the choice (26)–(27) with safeguards on the eigenvalues of \(H\). As we saw in Theorem 2.1, condition (10) also holds for small enough \(t_k\). Therefore, Assumption 3.4 suggests a safeguarding backtracking procedure that aims to satisfy, simultaneously, (10) and (28).

Assumption 3.5

The Lagrange multiplier estimates \(\{\lambda ^{k}\}\) lie in a compact set.

By Assumption 3.1, we have that \(\nabla L(\cdot ,\lambda ): \Omega \rightarrow \mathbb {R}^n\) is Lipschitz continuous in \(\Omega \) for all fixed \(\lambda \in \mathbb {R}^m\). By Assumption 3.5, we can assume that the Lipschitz constant for \(\nabla L(\cdot ,\lambda ^{k})\) is the same for all \(k \in \mathbb {N}\). Without any loss of generality, we will also denote by \(\eta \) this Lipschitz constant, that is, for all \(x,y \in \Omega \) and \(k \in \mathbb {N}\),

$$\begin{aligned} \Vert \nabla L(y,\lambda ^k) - \nabla L(x,\lambda ^k)\Vert \le \eta \Vert y - x\Vert . \end{aligned}$$
(29)

Assumption 3.6

The Lagrange multiplier estimates \(\{\lambda ^{k}\}\) are such that

$$\begin{aligned} \lim _{k \rightarrow \infty } P_{\Omega } (y^{k}-\nabla f(y^{k})-\nabla h(y^{k}) \lambda ^{k+1})-y^{k}=0. \end{aligned}$$
(30)

As we will see in Sect. 4, this assumption can be fulfilled, if we take \(\lambda ^{k+1}\) as the Lagrange multipliers associated with \(\nabla h(y^{k})\) in the KKT conditions of the problem (26)–(27).

In next lemma, we will prove that the sequence of the penalty parameters is bounded away from zero. This means that the objective function always has a significant weight in the merit criterion.

Lemma 3.1

Suppose that Assumption 3.2 holds. The sequence \(\{\theta _k\}\) is non-increasing and bounded away from zero.

Proof

By direct calculations, (8) is equivalent to

$$\begin{aligned}&\theta \big [L\big (y^{k+1},\lambda ^{k+1}\big ) - L\big (x^k,\lambda ^{k}\big ) + \Vert h(x^k)\Vert - \Vert h\big (y^{k+1}\big )\Vert \big ]\nonumber \\&\quad \le \frac{(1+s_k)}{2} \big (\Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert \big ). \end{aligned}$$
(31)

If \(\big [L\big (y^{k+1},\lambda ^{k+1}\big ) - L\big (x^k,\lambda ^{k}\big ) + \Vert h(x^k)\Vert - \Vert h\big (y^{k+1}\big )\Vert \big ] \le 0\), then (31) holds for all \(\theta \ge 0\), thus \(\theta _k=\theta _{k-1}\).

If \([L(y^{k+1},\lambda ^{k+1}) - L(x^k,\lambda ^{k}) + \Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert ] > 0\), then \(y^{k+1} \ne x^k\) and, consequently,

$$\begin{aligned} \frac{(1+s_k) \big (\Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert \big )}{ 2 \big [ L(y^{k+1},\lambda ^{k+1}) - L\big (x^k,\lambda ^{k}\big ) + \Vert h(x^k)\Vert - \Vert h\big (y^{k+1}\big )\Vert \big ]} > 0. \end{aligned}$$

In this case, we have that

$$\begin{aligned} {\theta }_{k} = \min \left\{ \theta _{k-1}, \frac{(1+s_k) (\Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert )}{ 2\big [ L\big (y^{k+1},\lambda ^{k+1}\big ) - L\big (x^k,\lambda ^{k}\big ) + \Vert h(x^k)\Vert - \Vert h\big (y^{k+1}\big )\Vert \big ]} \right\} . \end{aligned}$$
(32)

By the updating rule of the penalty parameter, we have that the sequence \(\{\theta _k\}\) is non-increasing. It remains to be proved that \(\{\theta _k\}\) is bounded away from zero. For this purpose, it is sufficient to show that \(\theta _{k}\) is greater than a fixed positive number when \(\theta _k \ne \theta _{k-1}\). In this case, we have

$$\begin{aligned} \frac{1}{{\theta }_{k}}&= \frac{ 2 \big [ L\big (y^{k+1},\lambda ^{k+1}\big ) - L\big (x^k,\lambda ^{k}\big ) + \Vert h(x^k)\Vert - \Vert h\big (y^{k+1}\big )\Vert \big ]}{(1+s_k) \big (\Vert h(x^k\big )\Vert - \Vert h(y^{k+1})\Vert \big )}\\&= \frac{2}{1+s_k} \left[ \frac{L\big (y^{k+1},\lambda ^{k+1}\big ) - L\big (x^k,\lambda ^{k}\big )}{\Vert h(x^k)\Vert - \Vert h(y^{k+1})\Vert } + 1 \right] . \end{aligned}$$

Thus, by Assumption 3.2, (7), and (23),

$$\begin{aligned} \frac{1}{{\theta }_{k}} \le 2\bigg [\frac{\beta \Vert h(x^k)\Vert }{(1-r)\Vert h(x^k)\Vert } + 1\bigg ] = 2 \bigg [\frac{\beta }{1-r} + 1 \bigg ]. \end{aligned}$$
(33)

This implies that the sequence \(\{1 / {\theta }_{k}\}\) is bounded. Therefore, the sequence \(\{\theta _k\}\) is non-increasing and bounded away from zero. \(\square \)

The following lemma ensures that any limit point of the sequence generated by Algorithm 2.1 is feasible.

Lemma 3.2

Suppose that Assumptions 3.2 and 3.5 hold. Then, the sum \(\sum _{k=0}^\infty \Vert h(x^k)\Vert \) is convergent.

Proof

By condition (10) and Assumption 3.2, for all \(k \in \mathbb {N}\) one has that

$$\begin{aligned} \Phi \big (x^{k+1},\lambda ^{k+1}, \theta _{k}\big ) \le \Phi \big (x^k, \lambda ^{k},\theta _{k}\big ) + \frac{1-r}{2} \big (\Vert h(y^{k+1})\Vert - \Vert h(x^k)\Vert \big ). \end{aligned}$$

Therefore, by (23),

$$\begin{aligned} \Phi \big (x^{k+1}, \lambda ^{k+1}, \theta _{k}\big ) \le \Phi \big (x^k, \lambda ^{k},\theta _{k}\big ) - \frac{(1-r)^2}{2} \Vert h(x^k)\Vert . \end{aligned}$$
(34)

Let us define \(\rho _k := (1 - \theta _k)/\theta _k\) for all \(k \in \{-1,0,1,2\ldots \}\). By Lemma 3.1, there exists \(\bar{\theta } > 0\) such that \(\theta _k \ge \bar{\theta }\) for all \(k \in \mathbb {N}\). This implies that \(\rho _k \le 1/\bar{\theta }-1\) for all \(k \in \mathbb {N}\). Since \(\{\rho _k\}\) is bounded and non-decreasing, it follows that

$$\begin{aligned} \sum _{k = 0}^\infty \big (\rho _{k} - \rho _{k-1}\big ) = \lim _{k \rightarrow \infty } \rho _{k}-\rho _{-1} < \infty . \end{aligned}$$
(35)

By compactness, the sequence \(\{\Vert h(x^k)\Vert \}\) is bounded. Therefore, by (35), there exists \(c > 0\) such that

$$\begin{aligned} \sum _{k = 0}^\infty (\rho _{k} - \rho _{k-1}) \Vert h(x^k)\Vert \le c < \infty . \end{aligned}$$
(36)

Now, by (34),

$$\begin{aligned} L\big (x^{k+1},\lambda ^{k+1}\big ) + \frac{1 - \theta _{k}}{\theta _{k}} \Vert h(x^{k+1})\Vert&\le \;L\big (x^{k},\lambda ^{k}\big ) + \frac{1 - \theta _{k}}{\theta _{k}} \Vert h(x^{k})\Vert \\&- \frac{(1-r)^2}{2 \theta _{k}} \Vert h(x^k)\Vert . \end{aligned}$$

Since \(0<\theta _{k}<1\), we have that \(\frac{(1-r)^2}{2} < \frac{(1-r)^2}{2 \theta _{k}}\), so

$$\begin{aligned} L\big (x^{k+1},\lambda ^{k+1}\big ) + \rho _{k} \Vert h\big (x^{k+1}\big )\Vert \le L\big (x^{k},\lambda ^{k}\big ) + \rho _{k} \Vert h\big (x^{k}\big )\Vert - \frac{(1-r)^2}{2 } \Vert h(x^k)\Vert . \end{aligned}$$

Therefore, for all \(k \in \mathbb {N}\),

$$\begin{aligned}&L\big (x^{k+1},\lambda ^{k+1}\big )+ \rho _{k} \Vert h\big (x^{k+1}\big )\Vert \le L\big (x^{k},\lambda ^{k}\big ) + \rho _{k-1} \Vert h\big (x^{k}\big )\Vert \\&\quad + \big (\rho _{k} - \rho _{k-1}\big ) \Vert h(x^k)\Vert - \frac{(1-r)^2}{2 } \Vert h(x^k)\Vert . \end{aligned}$$

Thus, for all \(k \in \mathbb {N}\), we have

$$\begin{aligned} L\big (x^{k+1},\lambda ^{k+1}\big )\!+ \!\rho _{k} \Vert h\big (x^{k+1}\big )\Vert&\le L\big (x^{0},\lambda ^{0}\big ) \!+\! \rho _{-1} \Vert h\big (x^{0}\big )\Vert \!+\! \sum _{j=0}^{k} \big (\rho _{j} - \rho _{j-1}\big ) \Vert h(x^{j})\Vert \\&- \frac{(1-r)^2}{2 } \sum _{j=0}^{k} \Vert h(x^{j})\Vert . \end{aligned}$$

Therefore, by (36),

$$\begin{aligned} L\big (x^{k+1},\lambda ^{k+1}\big )+ \rho _{k} \Vert h\big (x^{k+1}\big )\Vert&\le L\big (x^{0},\lambda ^{0}\big ) + \rho _{-1} \Vert h\big (x^{0}\big )\Vert \\&+ \,c - \frac{(1-r)^2}{2 } \sum _{j=0}^{k} \Vert h(x^{j})\Vert . \end{aligned}$$

Thus,

$$\begin{aligned} \frac{(1-r)^2}{2 }\! \sum _{j=0}^{k} \Vert h(x^{j})\Vert&\le - \big [L\big (x^{k+1}\!,\lambda ^{k+1}\big )+ \rho _{k} \Vert h\big (x^{k+1}\big )\Vert \big ] + L\big (x^{0}\!,\lambda ^{0}\big )\\&+\, \rho _{-1} \Vert h\big (x^{0}\big )\Vert + c. \end{aligned}$$

Since the functions \(L\) and \(h\) are continuous, by Assumption 3.5 and the compactness of \(\Omega \), we have that the sequences \(\{L(x^{k},\lambda ^{k})\}\) and \(\{\Vert h(x^k)\Vert \}\) are bounded. Since \(\{\rho _k\}\) is also bounded, it follows that the series \(\sum _{k=0}^\infty \Vert h(x^k)\Vert \) is convergent. \(\square \)

The following lemma shows that, if the problem is sufficiently smooth, then the direction on the tangent subspace ensures a uniform- bounded deterioration of the feasibility. The lemma also guarantees that the step size will not tend to zero. This means that Algorithm 2.1 will not produce excessively small steps.

Lemma 3.3

Suppose that Assumptions 3.1–3.3 and 3.5 hold. Define \(c_k\) as in (5) and \(t_k\) as in Step 4.3 of Algorithm 2.1. Then, there exist \(c>0\) and \(\bar{t}>0\) such that \(c_k>c\) and \(t_k \ge \bar{t}\) for all \(k \in \mathbb {N}\).

Proof

Given any continuously differentiable function \(F: \mathbb {R}^n \rightarrow \mathbb {R}^p\), by the Fundamental Theorem of Calculus, we have that

$$\begin{aligned} F(y+td)=F(y)+t \nabla F(y)^Td+ t \int _0^1 \left( \nabla F(y+t \xi d)-\nabla F(y) \right) ^Td \, d \xi . \end{aligned}$$
(37)

Applying (37) with \(F(y+td)= h(y^{k+1}+td^k)\), by Assumptions 3.1 and 3.2, the triangle inequality, and the Cauchy-Schwarz inequality, we have that, for \(c=\frac{\eta }{2}\),

$$\begin{aligned} \Vert h\big (y^{k+1}+td^k\big )\Vert&\le \Vert h\big (y^{k+1}\big )\Vert + t\int _0^1 \eta \Vert y^{k+1}+t\xi d^k-y^{k+1}\Vert \Vert d^k\Vert \, d \xi \nonumber \\&\le \Vert h\big (y^{k+1}\big )\Vert + c t^2 \Vert d^k\Vert ^2. \end{aligned}$$
(38)

Thus (5) holds for \(c_k=\frac{\eta }{2}\), independently of \(k\) and for all \(t\ge 0\).

Now, applying (37) with \(F(y+td)= L\big (y^{k+1}+td^k,\lambda ^{k}\big )\), by Assumptions 3.1, 3.2, 3.5, and (29),

$$\begin{aligned}&L\big (y^{k+1}+td^k,\lambda ^{k+1}\big )-L\big (y^{k+1},\lambda ^{k+1}\big )=t \nabla L\big (y^{k+1},\lambda ^{k+1}\big )^Td^k\\&\quad +\, t \int _0^1 \bigg (\nabla L\big (y^{k+1}+t \xi d^k,\lambda ^{k+1}\big ) -\,\nabla L\big (y^{k+1}\!,\lambda ^{k+1}\big )\bigg )^T\!d^k \, d \xi \\&\quad \le -\sigma t \Vert d^k\Vert ^2+ t\int _0^1 \eta \Vert y^{k+1} +\,\,t \xi d^k-y^{k+1}\Vert \Vert d^k\Vert \, d \xi \\&\quad \le - \left( \sigma -\frac{\eta t}{2} \right) t \Vert d^k\Vert ^2. \end{aligned}$$

Therefore, if \(\gamma _1 \le \frac{\sigma }{2}\) and \(\tau =\frac{\sigma }{2 \eta }\), we have that

$$\begin{aligned} L\big (y^{k+1}+td^k,\lambda ^{k+1}\big ) \le L\big (y^{k+1},\lambda ^{k+1}\big )-\gamma _1 t \Vert d^k\Vert ^2, \end{aligned}$$
(39)

for all \(t \le \tau \).

If \(\Vert d^k\Vert \ne 0\), condition (39) ensures that (9) holds whenever \(t_k \le \tau \). Let us prove that we can also satisfy (10) for any \(t_k\) in a specific interval. In fact, we will prove that any \(t_k \le \min \left\{ \frac{\bar{\theta }\gamma _1}{c(1-\bar{\theta })},\tau \right\} \), where \(\bar{\theta }\) is defined in Lemma 3.2, is acceptable in Step 4.3 of Algorithm 2.1. Since \(r_k \ge 0\), by (8), (19), (38), (39), and Lemma 3.1, we have that

$$\begin{aligned}&\Phi \big (y^{k+1}+t d^k,\lambda ^{k+1},\theta _{k}\big ) - \Phi \big (x^k,\lambda ^{k},\theta _{k}\big )\\&\quad = \Phi \big (y^{k+1}+t d^k,\lambda ^{k+1},\theta _{k}\big )\! -\!\Phi (y^{k+1},\lambda ^{k+1},\theta _{k})+\Phi \big (y^{k+1},\lambda ^{k+1},\theta _{k}\big )\\&\qquad -\, \Phi \big (x^k,\lambda ^{k},\theta _{k}\big )\\&\quad \le \theta _{k} \left( L\big (y^{k+1}+td^k,\lambda ^{k+1}\big ) - L\big (y^{k+1},\lambda ^{k+1}\big ) \right) \\&\quad \quad +\, \big (1-\theta _{k}\big )\left( \Vert h\big (y^{k+1}+t d^k\big )\Vert - \Vert h\big (y^{k+1}\big )\Vert \right) +\frac{1-s_k}{2 }\big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big )\\&\quad \le -\theta _{k} \gamma _1 t \Vert d^k\Vert ^2+ (1-\theta _{k})ct^2\Vert d^k\Vert ^2+\frac{1-r_k}{2 }\big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big )\\&\quad \le -\bar{\theta } \gamma _1 t \Vert d^k\Vert ^2+ (1-\bar{\theta })ct^2\Vert d^k\Vert ^2+\frac{1-r_k}{2 }\big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big )\\&\quad \le \big ( (1-\bar{\theta })ct-\bar{\theta } \gamma _1\big ) t \Vert d^k\Vert ^2+\frac{1-r_k}{2 }\big (\Vert h(y^{k+1})\Vert -\Vert h(x^k)\Vert \big )\\&\quad \le \frac{1-r_k}{2 }\big (\Vert h(y^{k+1}\big )\Vert -\Vert h\big (x^k)\Vert \big ). \end{aligned}$$

Thus, due to the backtracking procedure, we have that \(t_{k} \ge \bar{t} := \min \left\{ \frac{\bar{\theta }\gamma _1}{10 c(1-\bar{\theta })}, \frac{\tau }{10}, 1 \right\} \) for all \(k \in \mathbb {N}\). \(\square \)

Lemma 3.4

Suppose that Assumptions 3.1–3.5 hold. Then \(\lim _{k \rightarrow \infty } \Vert d^k\Vert =0\).

Proof

By Lemma 3.3, there exists \(\bar{t}>0\) such that \(t_k \ge \bar{t}\) for all \(k \in \mathbb {N}\). By (22) and (28),

$$\begin{aligned} L\big (x^{k+1},\lambda ^{k+1}\big )-L\big (x^k,\lambda ^{k}\big )&= L\big (x^{k+1},\lambda ^{k+1}\big ) -L\big (y^{k+1},\lambda ^{k+1}\big )\\&+\,\, L\big (y^{k+1},\lambda ^{k+1}\big )-L\big (x^k,\lambda ^{k}\big )\\&\le -\gamma t_k\Vert d^k\Vert ^2+\beta \Vert h(x^k)\Vert \le -\gamma \bar{t}\Vert d^k\Vert ^2+\beta \Vert h(x^k)\Vert . \end{aligned}$$

By Lemma 3.2 there exists \(c\) such that \(\sum _{k=0}^\infty \Vert h(x^k)\Vert =c\). Thus

$$\begin{aligned} L\big (x^{l+1},\lambda ^{l+1}\big )-L\big (x^0,\lambda ^{0}\big )&= \sum _{k=0}^l\big (L\big (x^{k+1},\lambda ^{k+1}\big )-L\big (x^k,\lambda ^{k}\big )\big ) \le -\gamma \bar{t}\sum _{k=0}^l\Vert d^k\Vert ^2\\&+\,\,\beta \sum _{k=0}^l \Vert h(x^k)\Vert \le -\gamma \bar{t} \sum _{k=0}^l\Vert d^k\Vert ^2+\beta c. \end{aligned}$$

Since \(f\) and \(h\) are bounded below on \(\Omega \), and \(\{\lambda ^{k+1}\}\) remains in a compact set, we have that the series \(\sum _{k=0}^\infty \Vert d^k\Vert ^2\) is convergent, and thus, \(\{\Vert d^k\Vert \}\) converges to zero. \(\square \)

Next Lemma and its corollary are used to prove the optimality of every limit point of the sequence generated by Algorithm 2.1.

Lemma 3.5

Consider \(P_k\) as in Assumption 3.3. Given \(y^{k+1} \in \Omega \), then

$$\begin{aligned} P_k\big (y^{k+1}-\nabla L\big (y^{k+1},\lambda \big )\big )=P_k\big (y^{k+1}-\nabla L\big (y^{k+1},\mu \big )\big ), \end{aligned}$$

for all \(\lambda \), \(\mu \in \mathbb {R}^m\).

Proof

Defining \(S_k := \{d \; : \; \nabla h(y^{k+1})^Td=0\}\), we have that \(\nabla h(y^{k+1})(\lambda -\mu ) \in S_k^\perp \). Since \(\nabla L\big (y^{k+1},\lambda \big )=\nabla L(y^{k+1},\mu )+ \nabla h\big (y^{k+1}\big )\big (\lambda -\mu \big )\), the result follows from the Projection Theorem (see [39], Proposition B.11, item (b), p 704). \(\square \)

Corollary 3.1

Let \(x^*\) be a feasible point of (1) such that

$$\begin{aligned} \lim _{k \rightarrow \infty } y^{k+1}=x^* \text{ and } \lim _{k \rightarrow \infty } P_k(y^{k+1}-\nabla L(y^{k+1},\lambda ^{k+1}))-y^{k+1}=0, \end{aligned}$$

for some \(\{y^{k+1}\} \subset \Omega \) and \(\{\lambda ^{k+1}\} \subset \mathbb {R}^m\). Then \(x^*\) satisfies the L-AGP optimality condition [25].

Proof

By hypothesis, \(x^*\) is feasible. Moreover, by Lemma 3.5,

$$\begin{aligned}&\lim _{k \rightarrow \infty } P_k(y^{k+1}\!-\nabla f(y^{k+1}))-y^{k+1} \!=\!\lim _{k \rightarrow \infty } P_k(y^{k+1}-\nabla L(y^{k+1},0))-y^{k+1}&\\&= \lim _{k \rightarrow \infty } P_k(y^{k+1}-\nabla L(y^{k+1},\lambda ^{k+1}))-y^{k+1} = 0.&\end{aligned}$$

Thus, \(x^*\) satisfies the L-AGP optimality condition. \(\square \)

Next lemma will be used to prove the convergence of the Lagrange multiplier estimates \(\{\lambda ^{k}\}\).

Lemma 3.6

Let \(x^*\) be a feasible point such that

$$\begin{aligned} \lim _{k \rightarrow \infty } y^{k}=x^* \text{ and } \lim _{k \rightarrow \infty } P_{\Omega }(y^{k}-\nabla L(y^{k},\lambda ^{k+1}))-y^{k}=0, \end{aligned}$$

for some \(\{y^{k}\} \subset \Omega \) and \(\{\lambda ^{k+1}\} \subset \mathbb {R}^m\). Then, if \(x^*\) satisfies the Mangasarian–Fromovitz constraint qualification, we have that the KKT conditions hold in \(x^*\). Moreover, all the limit points of \(\{\lambda ^{k+1}\}\) are Lagrange multipliers associated with the equality constraints in \(x^*\).

Proof

Let \(z^{k}:=P_\Omega (y^k-\nabla L(y^k,\lambda ^{k+1}))\); hence \(\lim _{k \rightarrow \infty } (z^k-y^k) =0\) and \(z^k\) is the solution of the problem:

$$\begin{aligned} \mathrm{min }~\,\frac{1}{2}\Vert z-(y^k-\nabla L(y^k,\lambda ^{k+1}))\Vert ^2 \hbox { s.t. } z\in \Omega . \end{aligned}$$

Define \(\Omega :=\{x\in \mathbb {R}^n : Ax=b,x\ge 0\}\). Then, for all \(k = 0, 1, 2,\ldots \), there exist \(\alpha ^k\) and \(\beta ^k\) such that

$$\begin{aligned} \beta ^k\ge 0, \;\;\;\; (z^k)^T\beta ^k=0, \end{aligned}$$
(40)

and

$$\begin{aligned} z^k- y^k+\nabla L(y^k,\lambda ^{k+1})+A^T\alpha ^k-\beta ^k=0. \end{aligned}$$
(41)

Defining \(M_k:=\max \{\Vert \lambda ^{k+1}\Vert _\infty ,\Vert \alpha ^k\Vert _\infty ,\Vert \beta ^k\Vert _\infty \}\), we have that \(\{M_k\}\) is bounded. Otherwise, we could divide (41) by \(M_k\) and take the limit on a suitable subsequence to obtain a contradiction with the Mangasarian-Fromovitz condition.

Since \(\{M_k\}\) is bounded, if \(\lambda ^*\) is a limit point of \(\{\lambda ^{k+1}\}\), there exists \(K \subseteq \mathbb {N}\) such that \(\lim _{k \in K} \lambda ^{k+1}=\lambda ^*\), \(\lim _{k \in K} \alpha ^{k}=\alpha ^*\), and \(\lim _{k \in K} \beta ^{k}=\beta ^*\). Taking limits for \(k \in K\) in (40) and (41), we verify that \(\lambda ^*\) is a Lagrange multiplier associated with the constraints \(h(x)=0\) in \(x^*\). \(\square \)

The following theorem summarizes the convergence properties of Algorithm 2.1 under Assumptions 3.1–3.6.

Theorem 3.1

Suppose that Assumptions 3.1–3.6 hold. Then

  1. 1.

    For all \(k \in \mathbb {N}\), \(x^{k}\) is well defined.

  2. 2.

    There exists \(\bar{\theta }>0\) such that \(\theta _k \ge \bar{\theta }\) for all \(k \in \mathbb {N}\).

  3. 3.

    \(\lim _{k \rightarrow \infty } \Vert h(x^k)\Vert = \lim _{k \rightarrow \infty } \Vert h(y^{k})\Vert = 0\) and any cluster point of \(\{x^k\}\) or \(\{y^{k}\}\) is feasible.

  4. 4.

    \(\lim _{k \rightarrow \infty } d^k = 0\).

  5. 5.

    \(\lim _{k \rightarrow \infty } \Vert y^{k} - x^{k}\Vert = 0\).

  6. 6.

    The sequences \(\{x^k\}\) and \(\{y^{k}\}\) admit the same cluster points.

  7. 7.

    Limit points of \(\{x^k\}\) satisfy the L-AGP optimality condition.

  8. 8.

    If a limit point \(x^*\) satisfies the Constant Positive Generators (CPG) constraint qualification [40], then the KKT conditions hold at \(x^*\).

  9. 9.

    If a limit point \(x^*\) satisfies the Mangasarian–Fromovitz constraint qualification, then the sequence \(\{\lambda ^k\}\) admits a limit point \(\lambda ^*\), which is a Lagrange multiplier associated with \(\nabla h(x^*)\).

Proof

Well-definedness of \(x^{k}\) has been proved in Theorem 2.1. The existence of \(\bar{\theta }\) follows from Lemma 3.1.

By Lemma 3.2 and (3),

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert h\big (y^{k+1}\big )\Vert \le \lim _{k \rightarrow \infty } \Vert h\big (x^k\big )\Vert =0, \end{aligned}$$

and thus, by the continuity of \(h\), any cluster point of \(\{x^k\}\) or \(\{y^{k}\}\) is feasible.

The fact that \(\lim _{k \rightarrow \infty } d^k=0\) follows from Lemma 3.4.

By item 4 and the triangle inequality, we have that

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert y^{k+1}-x^{k+1}\Vert = \lim _{k \rightarrow \infty } \Vert y^{k+1}-\big (y^{k+1}+t_kd^k\big )\Vert \le \lim _{k \rightarrow \infty }\Vert d^k \Vert =0. \end{aligned}$$

By item 5 and the triangle inequality,

$$\begin{aligned} \lim _{k \in K \subseteq \mathbb {N}} \Vert x^{k}-z\Vert =0 \Leftrightarrow \lim _{k \in K \subseteq \mathbb {N}} \Vert y^{k}-z\Vert =0. \end{aligned}$$

Thus, the sequences \(\{x^k \}\) and \(\{y^{k} \}\) admit the same cluster points.

Let \(x^*\) be a limit point of \(\{x^k \}\), therefore, by item 6, there is \(K \subseteq \mathbb {N}\) such that \(\lim _{k \in K } y^{k+1}=x^*\). By Assumption 3.3 and item 4,

$$\begin{aligned} \lim _{k \in K} \Vert P_k\big (y^{k+1} - \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )\big ) - y^{k+1}\Vert =0. \end{aligned}$$

By item 3, we have that \(x^*\) is feasible. Thus we conclude, by Corollary 3.1, that \(x^*\) satisfies the L-AGP optimality condition.

The statement 8 is a consequence of item 7 and the fact that sequential optimality conditions such as L-AGP imply KKT under the CPG constraint qualification [25, 40].

Finally, suppose that \(\lim _{k \in K}y^{k}=x^*\) and the Mangasarian–Fromovitz constraint qualification holds in \(x^*\). Since the sequence \(\{\lambda ^{k+1}\}\) lies in a compact set, we have that this sequence admits at least one limit point. Then, by Lemma 3.6, we conclude that any limit point of \(\{\lambda ^{k+1}\}\) with \(k \in K\) is a Lagrange Multiplier associated with \(\nabla h(x^*)\). \(\square \)

Remark 3.1

The CPG constraint qualification is weaker than CRSC [40], (R)CPLD [4143], (R)CRCQ [44], and Mangasarian–Fromovitz constraint qualification.

4 Plausibility of Assumptions

One of our main assumptions is that, given an infeasible point \(x^k\), a less infeasible point can always be computed at the Restoration phase. Clearly, this assumption is not always true. For example, if the feasible set is empty and \(x^k\) is a minimizer of the infeasibility, then it would not be possible to be successful at the Restoration phase.

However, if a good algorithm is used at the Restoration phase, we judge that the improvement in feasibility will be achieved, or the problem is probably infeasible.

Lipschitz conditions on the gradients of \(f\) and \(h\), assumed in Assumption 3.1, are standard in smooth optimization. Convexity conditions on the problem (26)–(27) are also usual in quadratic programming analyses. After minimizing \(\frac{1}{2}d^T H_{k-1} d +\nabla f(y^{k})^T d\) subject to \(y^{k} + d \in \Omega ,\; \nabla h(y^{k})^T d = 0\), a natural way to estimate the Lagrange multipliers is taking \(\lambda ^{k+1}\) as the Lagrange multipliers associated with the constraints \(\nabla h(y^{k})^Td=0\). In this case, it is plausible to assume that \(\lambda ^{k+1}\) lies in a compact set if the Mangasarian–Fromovitz constraint qualification holds at \(y^{k}\). Thus, if the sequence \(\{\Vert H_k d^k\Vert \}\) is uniformly bounded, we have that Assumption 3.5 holds under the Mangasarian–Fromovitz constraint qualification on \(\Omega \). An easy alternative to guarantee that Assumption 3.5 holds is to use safeguards on the Lagrange multipliers in the implementation.

Assumption 3.2 asks for a sufficient improvement of feasibility with a bounded deterioration in the Lagrangian. This assumption can be satisfied with a typical Lipschitz hypothesis and regularity conditions on the feasible set. In Lemma 2 of [19], it is proved that if the Mangasarian Fromovitz-constraint qualification holds at all points of \(\Omega \), then the closest feasible point to \(x^k\) fulfills the bounded deterioration for the objective function \(f\), and so any \(r_k>0\) is a possible choice for all \(k\). A practical restoration procedure is shown in Lemma 6.1 of [11], if the Linear Independence constraint qualification holds on \(\Omega \). This result asserts that, given \(r \in \; ]0,1[\), there is a neighborhood \(N\) of the feasible set such that, if \(x^k \in N\), then the linearization of the constraints at \(x_k\), \(T_k := \{ y \in \Omega : h(x^k)+\nabla h(x^k)^T(y-~x^k)~=~0~\!\!\},\) is not empty, and denoting by \(y^{k+1}\) the point of \(T_k\) closest to \(x^k\), we have that \(h(y^{k+1}) \le r h(x^k)\). Moreover, under Lipschitz conditions, bounded deterioration for the objective function holds. In both restoration procedures, if the Lagrange multipliers \(\lambda ^k\) remain in a compact set (Assumption 3.5), then the bounded deterioration also holds for the Lagrangian.

The plausibility of Assumption 3.3 is strictly linked with our proposal in the present paper of choosing \(d^k\) as the solution of the quadratic programming problem:

$$\begin{aligned} \mathrm{min } \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d + \frac{1}{2}d^T H_k d \;\hbox { s.t. } y^{k+1} + d \in \Omega ,\; \nabla h\big (y^{k+1}\big )^T d = 0, \end{aligned}$$
(42)

where \(H_k\) is symmetric, \(Z_k^T H_k Z_k\) is positive definite, and the columns of \(Z_k\) form an orthonormal basis of the null-space of \(\nabla h(y^{k+1})^T\). In addition, we assume that the eigenvalues of \(Z_k^T H_k Z_k\) lie in a positive interval \([\sigma _\mathrm{min}, \sigma _\mathrm{max}]\).

By selecting \(d^k\) in this way, since \(d=0\) is feasible to (42), we have that

$$\begin{aligned} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k + \frac{1}{2}(d^k)^T H_k d^k \le 0. \end{aligned}$$

Therefore,

$$\begin{aligned} \nabla L\big (y^{k+1}, \lambda ^{k+1}\big )^T d^k \le -\frac{\sigma _\mathrm{min}}{2} \Vert d^k\Vert ^2. \end{aligned}$$

In order to prove condition (25), let us define

$$\begin{aligned} D_k := \{d \in \mathbb {R}^n \;:\; \nabla h\big (y^{k+1}\big )^T d = 0 \text{ and } y^{k+1}+d \in \Omega \}. \end{aligned}$$
(43)

Since \(d^k\) is the solution of the linearly constrained problem (42),

$$\begin{aligned} P_{D_k}\big (d^k-H_kd^k-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )-d^k=0. \end{aligned}$$

Changing variables (\(y=y^{k+1}+d\)) and using the fact that the projections are non-expansive, we have

$$\begin{aligned}&\Vert P_k \big (y^{k+1}-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )-y^{k+1}\Vert =\Vert P_{D_k} \big (-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\Vert \\&\quad = \Vert P_{D_k}\big (-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )-P_{D_k}\big (d^k-H_kd^k-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\\&\quad +\,P_{D_k}\big (d^k-H_kd^k-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\Vert \le \Vert P_{D_k} \big (-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\\&\quad -\,P_{D_k}\big (d^k\!-\!H_kd^k-\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\Vert \!+\!\Vert P_{D_k}\big (d^k\!-\!H_kd^k\!-\!\nabla L\big (y^{k+1},\lambda ^{k+1}\big )\big )\Vert \\&\quad \le \Vert d^k-H_kd^k\Vert +\Vert d^k\Vert \le \big (2+\sigma _\mathrm{max}\big ) \Vert d^k\Vert . \end{aligned}$$

So, defining \(\sigma := \min \left\{ \frac{\sigma _\mathrm{min}}{2}, \frac{1}{2+\sigma _\mathrm{max}} \right\} \), we have that Assumption 3.3 holds.

From Lemma 3.3, we can see that condition (39) holds for \(\gamma _1=\frac{\sigma _\mathrm{min}}{4}\) and sufficiently small \(t\). Thus, if \(\sigma _\mathrm{min}\) is known, then we can ensure that Assumption 3.4 holds testing the sufficient decrease at Step 5 of Algorithm 2.1. Moreover, even if \(\sigma _\mathrm{min}\) is not known, it is possible to do a double backtracking, in \(t\) and \(\sigma \), to ensure that Assumption 3.4 holds [45]. However, this does not seem to be either efficient or necessary in practice, so we only test simple decrease in Step 5 of Algorithm 2.1.

Finally, to prove the plausibility of Assumption 3.6, we consider that \(\lambda ^{k+1}\) is chosen as the vector of Lagrange multipliers associated with the constraints \(\nabla h(y^{k})^T d = 0\) in problem:

$$\begin{aligned} \mathrm{min } \frac{1}{2}d^T H_{k-1} d +\nabla f(y^{k})^T d \;\; \mathrm{s.t.}\;\; y^{k} + d \in \Omega ,\; \nabla h(y^{k})^T d = 0. \end{aligned}$$

In this case, denoting \(D_k:=\{d \; : \; y^{k}+d \in \Omega \}\), we have that

$$\begin{aligned} P_{D_k}\big (d^{k-1}-H_{k-1}d^{k-1}-\nabla f\big (y^{k}\big )-\nabla h\big (y^{k}\big )\lambda ^{k+1}\big )-d^{k-1}=0. \end{aligned}$$

Therefore,

$$\begin{aligned}&\Vert P_{\Omega }\big (y^{k}-\nabla f(y^{k})-\nabla h(y^{k})\lambda ^{k+1}\big )-y^{k}\Vert = \Vert P_{D_k}\big (-\nabla f\big (y^{k}\big )-\nabla h(y^{k})\lambda ^{k+1}\big )\Vert \\&\quad \le \! \Vert P_{D_k}\big (-\nabla f(y^{k}\big )-\nabla h(y^{k})\lambda ^{k+1})- P_{D_k}\big (d^{k-1}\!-H_{k-1}d^{k-1}\!-\nabla f(y^{k})\\&\qquad -\nabla h(y^{k})\lambda ^{k+1}\big ) \Vert \\&\quad +\,\Vert P_{D_k}\big (d^{k-1}-H_{k-1}d^{k-1}\!-\nabla f(y^{k})-\nabla h(y^{k})\lambda ^{k+1}\big ) \Vert \\&\qquad \le \Vert d^{k-1}-H_{k-1}d^{k-1}\Vert +\Vert d^{k-1}\Vert . \end{aligned}$$

Thus, considering that the sequence \(\{\Vert H_k \Vert \}\) is uniformly bounded and that Assumptions 3.1–3.5 hold, by Lemma 3.4, we have that

$$\begin{aligned} \lim _{k \rightarrow \infty } P_{\Omega } (y^{k}-\nabla f(y^{k})-\nabla h(y^{k}) \lambda ^{k+1})-y^{k}=0, \end{aligned}$$

which means that Assumption 3.6 holds.

5 Conclusions

The Inexact-Restoration method introduced in this paper employs an improved version of the Fischer–Friedlander line search approach combined with the use of Lagrange multipliers estimates both in the Optimization Phase subproblem as in the merit function. The new method has enhancements in several theoretical aspects and it is appropriate to tackle many problems in which there is a natural way to restore feasibility.

It remains to be investigated its application to general constrained optimization problems. In the general case, some standard (perhaps under-determined Newtonian) method should be defined for the restoration phase and convergence should be proved taking into account the characteristics of that method.

Other possible approach for general problems could be to introduce objective function information in the restoration phase, aiming to provide better initial approximations for the second-phase subproblems. Much theoretical and practical research can be expected along these lines.

Future research will also include the application of alternative I-R approaches to multiobjective optimization.