1 Introduction

Sequential quadratic programming (SQP) methods are an important class of methods for minimizing a smooth nonlinear function subject to both equality and inequality constraints. This paper concerns the local convergence properties of a new stabilized SQP method for the solution of a nonlinear optimization problem written in the form

$$\begin{aligned} \displaystyle {\mathop {\mathrm{minimize}}\limits _{x \in \mathbb {R}^n}}\;\;f(x) \;\;\;\hbox {subject to}\;\;c(x) = 0, \quad x\ge 0, \end{aligned}$$
(NP)

where \(c :\mathbb {R}^n \mapsto \mathbb {R}^m\) and \(f :\mathbb {R}^n \mapsto \mathbb {R}\) are twice-continuously differentiable. For problem (NP), the vector g(x) is used to denote \(\nabla \!f(x)\), the gradient of f at x. The matrix J(x) denotes the \(m\times n\) constraint Jacobian, which has ith row \(\nabla \!c_i(x)^T\), the gradient of the ith constraint function \(c_i\) at x. The Lagrangian associated with (NP) is \(L(x, y,z) = f(x) - c(x)^T\!y -z^T\!x\), where y and z are m- and n-vectors of dual variables associated with the equality constraints and nonnegativity constraints, respectively. The Hessian of the Lagrangian with respect to x is denoted by \(H(x,y) = \nabla ^2\!f(x) - \sum _{i=1}^m y_i \nabla ^2\!c_i(x)\).

At each iteration of a conventional line-search merit-function SQP method, a sufficient decrease in a merit function is obtained by performing a line search in the direction of a solution of a quadratic programming (QP) subproblem in which a local quadratic model of the Lagrangian is minimized subject to the linearized constraints. The merit function is designed to provide a measure of the quality of a given point as an estimate of a solution of the nonlinearly constrained problem. (For a recent survey of SQP methods, see Gill and Wong [17].) Stabilized sequential quadratic programming (sSQP) methods are designed to improve the poor local convergence rate that can occur when a conventional SQP method is applied to an ill-posed or degenerate problem. Given an estimate \((x_k,y_k)\) in the neighborhood of a primal-dual solution \((x^*,y^*)\) of problem (NP), sSQP methods compute a new solution estimate based on the properties of a QP subproblem of the form

$$\begin{aligned}&{\mathop {\mathrm{minimize}}\limits _{x,y}} \;\;g(x_k)^T\!(x-x_k) + {\textstyle \frac{1}{2}}(x-x_k)^T H(x_k,y_k) (x-x_k) + {\textstyle \frac{1}{2}}\mu _k\Vert y\Vert ^2\nonumber \\&\hbox {subject to}\;\;c(x_k) + J(x_k)(x-x_k) + \mu _k(y - y_k) = 0, \;\;\;x \ge 0, \end{aligned}$$
(1)

where \(\mu _k\) is a positive scalar of the order of the distance of \((x_k,y_k)\) to the set of solutions of (NP). The QP subproblem associated with a conventional SQP method corresponds to the value \(\mu _k = 0\). The terms in the objective and constraints of (1) associated with \(\mu _k\) serve to bound the change in the dual variables and provide a sequence of iterates with fast local convergence regardless of whether or not the active-constraint gradients are linearly dependent. The first sSQP method was proposed by Wright [32], who established a superlinear rate of convergence of the solutions \(\{(x_k,y_k)\}\) of (1) under the assumptions of strict complementarity and the satisfaction of the Mangasarian-Fromovitz constraint qualification. These assumptions were relaxed by Hager [19], and more recently by Fernández and Solodov [9], and Solodov and Izmailov [24]. Independently, Fischer [10] proposed an algorithm in which an auxiliary QP problem is solved for the multiplier estimate of the conventional QP subproblem. This method also has superlinear convergence under appropriate assumptions. The analysis of a conventional sSQP method concerns the sequence \(\{(x_k,y_k)\}\) of solutions of the QP subproblem (1). Other methods related to sSQP identify an estimate of the optimal active set and then solve an equality constrained or inequality constrained QP defined in terms of a subset of the constraints. Constraints omitted from the estimated active set are allowed to be violated slightly. Wright [33, 34] includes only a subset of the linearized constraints in an inequality constrained sSQP subproblem. Wright [35], and Oberlin and Wright [31] use an auxiliary inequality constrained subproblem to estimate the optimal active set and then solve an sSQP subproblem with only equality constraints. Izmailov and Solodov [21] also use an auxiliary subproblem, but solve an unstabilized equality constrained problem using a rank detection method to treat any linear dependence in the linearized constraints.

All of these sSQP methods can be shown to exhibit fast local convergence under suitable assumptions. It should be emphasized that, with the notable exception of Wright [35], previous analyses of sSQP methods do not pertain to a consistent, well-defined algorithm. They show only that if a specific local solution of a nonconvex QP subproblem is found, then these solutions converge at a superlinear rate. Unfortunately, in a practical method, there is no guarantee that a nonconvex QP solver will find the specific solution required for the theory. This problem is in addition to the well-known difficulties associated with solving a nonconvex QP, i.e., the potential for multiple and unbounded solutions. (See Kungurtsev [27, Chapter 5] for a discussion of these issues.)

Although sSQP methods exhibit fast local convergence, they come with little global convergence theory, so that stabilized methods must start by solving the QP subproblem associated with a conventional (globally convergent) SQP method and switch to the stabilized QP strategy when it is determined that the iterates are in the proximity of a solution. Moreover, as mentioned above, many sSQP methods require the solution of an auxiliary inequality-constrained subproblem at each outer iteration, usually a linear program (LP).

In this paper we consider the local convergence properties of a globally convergent sSQP method that does not require a switch to a conventional SQP method or the solution of an auxiliary inequality constrained subproblem. The method is based on using a primal-dual augmented Lagrangian merit function in conjunction with a line search to enforce global convergence. At each iteration, an estimate of the solution is computed by minimizing a strictly convex local quadratic model of the augmented Lagrangian subject to simple bound constraints. This subproblem is formally equivalent to a QP problem that is closely related to the QP subproblem associated with sSQP.

The principal contributions are the following. (i) A local descent step is proposed that is based on allowing a small relaxation of the optimality conditions for the bound-constrained subproblem. It is shown that this step provides iterates that are equivalent to those from a conventional sSQP method when close to the solution. This equivalence holds under conditions that are no stronger than those required to establish the superlinear convergence of a conventional sSQP method. (ii) A local convergence analysis is given that does not require the assumption of a constraint qualification or strict complementarity condition. (iii) It is shown that the step length of one is selected in the limit, which implies that the method does not suffer from the Maratos effect (see Maratos [28]). As far as we are aware, this is the only stabilized SQP method with this property. (iv) Although exact second-derivatives are used, the method does not require the solution of a nonconvex QP subproblem—a problem that is known to be NP-hard. In addition, the local convergence theory makes no assumptions about which local solution of the QP subproblem is computed. (v) Preliminary numerical results indicate that the method has good global and local convergence properties for degenerate problems under weak regularity assumptions. Overall, the local analysis of this paper and the global analysis of [14] imply that the proposed method has the same strong first- and second-order global convergence properties that have been established for augmented Lagrangian methods, yet is able to transition seamlessly to sSQP with fast local convergence in the neighborhood of a solution.

The remainder of the paper is organized as follows. This section concludes with a summary of the notation. Section 2 contains a description of the second-order primal-dual sSQP method. The local convergence properties of the method are established in Sect. 3. In Sect. 4, methods are discussed for solving the sSQP subproblems, and numerical results are provided. Although this paper describes the method in its entirety, the reader is referred to [14] for a complete analysis of the global convergence, as well as additional details of the method that are not related to the local analysis.

Unless explicitly indicated otherwise, \(\Vert \cdot \Vert \) denotes the vector two-norm or its induced matrix norm. Given vectors a and b with the same dimension, the vector with ith component \(a_i b_i\) is denoted by \(a{{\varvec{\cdot }}}b\). Similarly, \(\min (a,b)\) is the vector with components \(\min (a_i,b_i)\). The vectors e and \(e_j\) denote, respectively, the column vector of ones and the jth column of the identity matrix I. The dimensions of e, \(e_i\) and I are defined by the context. The set of integers \(\{ 1\), 2, ..., \(n\}\) is denoted by \(1\,{:}\,n\). Given vectors x and y, the vector consisting of the elements of x augmented by elements of y is denoted by (xy). The value of a scalar-, vector- or matrix-valued function F with arguments x and y will be written as either F(xy) or F(v), where v is the vector (xy). The ith component of a vector labeled with a subscript will be denoted by \([\,\,\cdot \,\,]_i\), e.g., \([\,v\,]_i\) is the ith component of the vector v. For a given \(\ell \)-vector u and index set \(\mathcal {S}\), the quantity \([\,u\,]_{\scriptscriptstyle \mathcal {S}}\) denotes the subvector of components \(u_j\) such that \(j\in \{ 1\), 2, ..., \(\ell \,\}\cap \mathcal {S}\). Similarly, if M is a symmetric \(\ell \times \ell \) matrix, then \([\,M\,]_{\scriptscriptstyle \mathcal {S}}\) denotes the symmetric matrix with elements \(m_{ij}\) for i, \(j\in \{ 1\), 2, ..., \(\ell \, \} \cap \mathcal {S}\). Let \(\{\,\alpha _j\,\}_{j\ge 0}\) be a sequence of scalars, vectors or matrices and let \(\{\, \beta _j \,\}_{j\ge 0}\) be a sequence of positive scalars. If there exists a positive constant \(\gamma \) such that \(\Vert \alpha _j\Vert \le \gamma \beta _j\), we write \(\alpha _j = O(\beta _j)\). If there exists a sequence \(\{\,\gamma _j\,\} \rightarrow 0\) such that \(\Vert \alpha _j\Vert \le \gamma _j \beta _j\), we say that \(\alpha _j = o(\beta _j)\). If there exist positive constants \(\gamma _1\) and \(\gamma _2\) such that \(\gamma _1 \beta _j \le \Vert \alpha _j\Vert \le \gamma _2 \beta _j\), we write \(\alpha _j = \varTheta \!\big (\beta _j\big )\).

2 The primal-dual stabilized SQP algorithm

The proposed algorithm is designed to find first- and second-order KKT pairs associated with problem (NP). A vector \(x^*\) is a first-order KKT point for problem (NP) if there exists a dual vector \(y^*\) such that \(r(x^*\!,y^*)= 0\), where

$$\begin{aligned} r(x,y) = \left\| \big ( c(x), \min \big (x,g(x)-J(x)^T\!y\big )\big ) \right\| . \end{aligned}$$
(2)

Any \((x^*\!,y^*)\) satisfying \(r(x^*\!,y^*)= 0\), is called a first-order KKT pair. For arbitrary vectors x and y of appropriate dimension, the scalar \(r(x,y)\) provides a practical estimate of the distance of (xy) to a first-order KKT pair of problem (NP). If, in addition, \((x^*\!,y^*)\) satisfies the condition \(p^T\!H(x^*\!,y^*) p \ge 0\) for all p such that \(J(x^*) p = 0\), with \(p_i \ge 0\) for all i such that \(x^*_i = 0\), then \((x^*\!,y^*)\) is referred to as a second-order KKT pair. In general, the Lagrange multiplier associated with a first-order KKT point is not unique, and the set of Lagrange multiplier vectors is given by

$$\begin{aligned} \mathcal {Y}(x^*) = \{ y\in \mathbb {R}^m :(x^*\!,y) \;\;\text {satisfies}\;\; r(x^*\!,y)= 0 \}. \end{aligned}$$

The algorithm is based on replacing problem (NP) by a sequence of problems

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in \mathbb {R}^n,y\in \mathbb {R}^m}}\;\;M(x,y \mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{}) \;\;\;\hbox {subject to}\;\;x \ge 0, \end{aligned}$$
(3)

where \(M(x,y \mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{})\) is the primal-dual function

$$\begin{aligned} M(x,y \mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{}) {=} f(x) - c(x)^T\!y_k^{\scriptscriptstyle E }+ \frac{1}{2\mu _k}\Vert c(x)\Vert ^2 + \frac{1}{2\mu _k}\Vert c(x)+\mu _k^{}(y - y_k^{\scriptscriptstyle E })\Vert ^2, \end{aligned}$$
(4)

with \(\mu _k\) a positive penalty parameter and \(y_k^{\scriptscriptstyle E }\) an estimate of a Lagrange multiplier vector for problem (NP). The method has an inner/outer iteration structure in which each outer iteration involves the minimization of a quadratic model of M subject to the nonnegativity constraints. The inner iterations are then those of the active-set method used to find an approximate bound-constrained minimizer of the quadratic model. If the Hessian of M is not positive definite, a direction of negative curvature for M is computed. A direction obtained by solving the QP subproblem is combined with the direction of negative curvature (if one is computed) to give a search direction for a line search designed to find a step of sufficient decrease in \(M(x,y \mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{})\).

Each outer iteration involves the definition of two related QP subproblems associated with the primal-dual function (4). The objective function in both subproblems is defined in terms of the gradient \(\nabla \!M\) and a matrix that approximates the Hessian \(\nabla ^2\!M\). For values of \(y^{\scriptscriptstyle E }\) and \(\mu \), the gradient \(\nabla \!M(x,y \mathop {;}y^{\scriptscriptstyle E },\mu )\) and Hessian \(\nabla ^2\!M(x,y \mathop {;}y^{\scriptscriptstyle E },\mu )\) at (xy) may be written in the form

$$\begin{aligned} \begin{pmatrix}g(x) - J(x)^T\!\big (\pi (x \mathop {;}y^{\scriptscriptstyle E },\mu ) + (\pi (x \mathop {;}y^{\scriptscriptstyle E },\mu ) - y)\big ) \\ \mu (y - \pi (x \mathop {;}y^{\scriptscriptstyle E },\mu )) \end{pmatrix}, \end{aligned}$$

and

$$\begin{aligned} \begin{pmatrix} H\big (x,\pi (x \mathop {;}y^{\scriptscriptstyle E },\mu ) + (\pi (x \mathop {;}y^{\scriptscriptstyle E },\mu ) - y)\big )+\frac{2}{\mu } J(x)^T\!J(x) &{}\;\;J(x)^T \\ J(x) &{}\;\;\mu I \end{pmatrix}, \end{aligned}$$

where \(\pi \) is the vector-valued function \(\pi (x \mathop {;}y^{\scriptscriptstyle E },\mu ) = y^{\scriptscriptstyle E }- c(x)/\mu \).

Let \((x_k,y_k)\) be the kth estimate of a primal-dual solution of (NP). Let v and \(v_k\) denote the \((n+m)\)-vectors of primal-dual variables (xy) and \((x_k,y_k)\). Given a second penalty parameter \(\mu _k^{\scriptscriptstyle R }\) such that \(0 < \mu _k^{\scriptscriptstyle R }\le \mu _k^{}\), the change in M at \(v_k\) may be approximated by the quadratic function \({\mathcal Q}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\), where

$$\begin{aligned} {\mathcal Q}_k^{}(v \mathop {;}y^{\scriptscriptstyle E }, \mu ^{\scriptscriptstyle R }) = \nabla \!M(v_k^{}\mathop {;}y^{\scriptscriptstyle E },\mu ^{\scriptscriptstyle R })^T\!(v-v_k^{}) + {\textstyle \frac{1}{2}}(v-v_k^{})^T B(v_k^{}\mathop {;}\mu ^{\scriptscriptstyle R })(v-v_k^{}), \end{aligned}$$
(5)

and the matrix \(B(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is obtained by replacing \(\pi (x_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) by \(y_k\) in the leading block of the Hessian matrix \(\nabla ^2\!M(x_k^{},y_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\), i.e.,

$$\begin{aligned} B(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R }) = \begin{pmatrix} H(x_k,y_k) + \frac{2}{\mu _k^{\scriptscriptstyle R }}J(x_k)^T\!J(x_k) &{} J(x_k)^T\\ J(x_k) &{} \mu _k^{\scriptscriptstyle R }I \end{pmatrix}. \end{aligned}$$
(6)

The matrix \(B(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is independent of \(\pi \) and therefore does not involve \(y_k^{\scriptscriptstyle E }\). If \((x^*\!,y^*)\) satisfies certain second-order sufficient conditions for an optimal solution of problem (NP), then, for the values \(v_k = (x^*,y^*)\) and \(y_k^{\scriptscriptstyle E }= y_k^{}\), there exists a positive \(\bar{\mu }\) such that for all \(0<\mu _k^{\scriptscriptstyle R }<\bar{\mu }\), the point \((x^*\!,y^*)\) satisfies the second-order sufficient optimality conditions for the QP subproblem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{v}} \;\;{\mathcal Q}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) \;\;\hbox {subject to}\;\;[\,v\,]_i \ge 0, \quad i = 1 \,{:}\,n \end{aligned}$$
(7)

(see Gill, Kungurtsev and Robinson [14]). The benefit of using \(B(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) and not \(\nabla ^2\!M(x_k^{},y_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) in the definition of the quadratic function (5) is that the QP subproblem (7) is formally equivalent to the QP subproblem

$$\begin{aligned} \begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x,y}}&\;\;g(x_k)^T\!(x-x_k) + {\textstyle \frac{1}{2}}(x-x_k)^T\!H(x_k,y_k)(x-x_k) + {\textstyle \frac{1}{2}}\mu _k^{\scriptscriptstyle R }\Vert y\Vert ^2\\ \hbox {subject to}&\;\;c(x_k) + J(x_k)(x-x_k) + \mu _k^{\scriptscriptstyle R }(y - y_k^{\scriptscriptstyle E }) = 0, \;\;\;x \ge 0 \end{aligned} \end{aligned}$$
(8)

(see Gill and Robinson [16]). A comparison of this subproblem and (1) indicates that setting \(y_k^{\scriptscriptstyle E }= y_k^{}\) in the definition of (5) and forcing \(\mu _k^{\scriptscriptstyle R }\rightarrow 0\) as \((x_k,y_k)\) converges to a primal-dual solution \((x^*,y^*)\) will induce the method to behave like an sSQP method and thereby inherit the same fast local convergence rate.

At the outermost level, the method may be regarded as a primal-dual augmented Lagrangian method for which the parameters \(\{ y_k^{\scriptscriptstyle E }\}\) and \(\{ \mu _k \}\) are adjusted to give global convergence. However, the sequence of penalty parameters \(\{ \mu _k^{\scriptscriptstyle R }\}\) is chosen in such a way that, in the neighborhood of a solution, the search direction is equivalent to that defined by an sSQP method. In this context, \(\mu _k^{\scriptscriptstyle R }\) plays the role of a regularization or stabilization parameter rather than a penalty parameter, thereby providing an \(O(\mu _k^{\scriptscriptstyle R })\) estimate of the conventional SQP direction (see Gill and Robinson [16]).

The next four sections provide some additional details of the algorithm, with an emphasis on those aspects related to the local convergence analysis. More details of the computation, including a step-by-step description of the main algorithms, may be found in Gill, Kungurtsev and Robinson [14]. In Sect. 2.1 we provide details of how the parameters \(y_k^{\scriptscriptstyle E }\), \(\mu _k^{}\) and \(\mu _k^{\scriptscriptstyle R }\) are defined. In Sect. 2.2 we consider the definition of the QP subproblem and show that although the QP (7) cannot be used directly as a local quadratic model of M, it forms the basis for two approximate convex QP subproblems, one with inequality constraints, and the other with only equality constraints. In Sect. 2.3 we give a brief outline of the flexible line search. Finally, Sect. 2.4 provides a brief summary of the algorithm.

2.1 Definition of the penalty parameters and multiplier estimate

At the start of the kth outer iteration, \((x_k,y_k)\) is known, as well as the regularization parameter \(\mu _{k-1}^{\scriptscriptstyle R }\) and penalty parameter \(\mu _{k-1}\). The first step is to compute \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\) for the new iteration. These parameters are defined in terms of an estimate of the optimal active set of problem (NP). This estimate involves a positive scalar \(\epsilon \) that reflects the distance of (xy) to a first-order optimal pair for problem (NP). The \(\epsilon \)-active set is defined as

$$\begin{aligned} \mathcal {A}_\epsilon (x,y,\mu ) = \big \{ i :x_i \le \epsilon , \;\;\text {with}\;\; \epsilon \equiv \min \big ( \,\epsilon _a^{},\,\max \big (\mu ,r(x,y)^\gamma \big )\,\big )\big \}, \end{aligned}$$
(9)

where \(\gamma \) and \(\epsilon _a\) are fixed scalars satisfying \(0<\gamma <1\) and \(0< \epsilon _a <1\), and \(r(x,y)\) is the nonnegative scalar of (2). Similarly, the \(\epsilon \)-free set is the complement of \(\mathcal {A}_\epsilon \) in \(\{1\), 2, ..., \(n+m\}\), i.e.,

$$\begin{aligned} \mathcal {F}_\epsilon (x,y,\mu ) = \{1,2,\dots , n+m\}\setminus \mathcal {A}_\epsilon (x,y,\mu ). \end{aligned}$$
(10)

The calculation of \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\) also requires the scalar \(\xi _k\) (\(\xi _k \ge 0\)), which is an estimate of the magnitude of the “most negative” eigenvalue of \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\). The scalar \(\xi _k\) is computed as part of the scalar-vector pair \((\xi _k^{}, s_k^{(1)})\) such that

$$\begin{aligned} s_k^{(1)T}\! B(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })s_k^{(1)} = -\xi _k^{}\Vert u_k^{(1)}\Vert ^2, \end{aligned}$$
(11)

where \(u_k^{(1)}\) is the vector of first n components of \(s_k^{(1)}\). If \(\xi _k^{}=0\), then \(s_k^{(1)}=0\). If \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\) is positive definite then \((\xi _k^{},s_k^{(1)}) = 0\). (The calculation of \(\xi _k\) is discussed further in [14, Algorithm 1] and Sect. 2.2.) The values of \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\) depend on scalars \(\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\), \(\phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }\) and \(\tau _{k-1}\) defined below. The magnitudes of \(\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\), \(\phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }\) and \(\tau _{k-1}\) reflect the distance of \((x_k,v_k)\) to an optimal point.

The multiplier estimate \(y_k^{\scriptscriptstyle E }\) is set to \(y_k\) if \((x_k,y_k)\) gives an improvement in a measure of the distance to a second-order solution \((x^*,y^*)\). The algorithm uses the feasibility and optimality measures \(\eta (x_k)\) and \(\omega (x_k,y_k,\xi _k)\) such that

$$\begin{aligned} \eta (x_k)= & {} \Vert c(x_k)\Vert , \;\;\text {and} \nonumber \\ \omega (x_k,y_k,\xi _k)= & {} \max \left( \left\| \min (x_k^{},\,g(x_k^{})-J(x_k^{})^T\!y_k^{})\right\| , \;\xi _k^{}\right) . \end{aligned}$$
(12)

Given \(\eta (x_k)\) and \(\omega (x_k,y_k,\xi _k)\), weighted combinations of the feasibility and optimality measures are defined as

$$\begin{aligned} \phi _{\scriptscriptstyle V}(x_k,y_k)&= \eta (x_k) + \beta \omega (x_k,y_k,\xi _k), \;\;\text {and} \\ \phi _{\scriptscriptstyle O}(x_k,y_k,\xi _k)&= \beta \eta (x_k) + \omega (x_k,y_k,\xi _k), \end{aligned}$$

where \(\beta \) is a fixed scalar such that \(0 < \beta \ll 1\). (With this notation, “V” indicates a measure of the constraint violations and “O” denotes a measure of the distance to optimality.) The assignment \(y_k^{\scriptscriptstyle E }= y_k^{}\) is done if

$$\begin{aligned} \phi _{\scriptscriptstyle V}(v_k) \le {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\qquad \text {or}\qquad \phi _{\scriptscriptstyle O}(v_k,\xi _k) \le {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }. \end{aligned}$$
(13)

The point \((x_k,y_k)\) is called a “V-iterate” if it satisfies the bound on \(\phi _{\scriptscriptstyle V}(v_k)\), and an “O-iterate” if it satisfies the bound on \(\phi _{\scriptscriptstyle O}(v_k,\xi _k)\). A “V-O iterate” is a point at which one or both of these conditions holds, and the associated iteration is called a “V-O iteration.” For a V-O iteration, new values are given by \(\tau _k = {\textstyle \frac{1}{2}}\tau _{k-1}\), and \(\phi _{{\scriptscriptstyle V}\!,\,k}^{\max }= {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\) or \(\phi _{{\scriptscriptstyle O}\!,\,k}^{\max }= {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }\), depending on which of the inequalities in (13) holds. Also, the new regularization parameter is

$$\begin{aligned} \mu _k^{\scriptscriptstyle R }= {\left\{ \begin{array}{ll} \min \big (\,\mu ^{\scriptscriptstyle R }_0,\; \max \big (r_k^{}, \xi _k^{}\big )^\gamma \,\big ) &{} \;\;\text {if} \max \big (r_k^{}, \xi _k^{}\big ) > 0; \\ {\textstyle \frac{1}{2}}\mu _{k-1}^{\scriptscriptstyle R } &{} \;\;\text {otherwise}, \end{array}\right. } \end{aligned}$$
(14)

where \(r_k = r(x_k,y_k)\) is defined in (2).

If the conditions for a V-O iteration do not hold, a test is made to determine if \((x_k,y_k)\) is an approximate second-order solution of the problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x,y}} \;\;M(x,y\mathop {;}y_{k-1}^{\scriptscriptstyle E },\mu _{k-1}^{\scriptscriptstyle R }) \;\;\;\hbox {subject to}\;\;x\ge 0. \end{aligned}$$
(15)

In particular, \((x_k,y_k)\) is tested using the conditions:

$$\begin{aligned} \Vert \min \big (x_k^{}, \nabla \!_x M(x_k^{},y_k^{}\mathop {;}y_{k-1}^{\scriptscriptstyle E },\mu _{k-1}^{\scriptscriptstyle R })\big )\Vert&\le \tau _{k-1}^{}, \end{aligned}$$
(16a)
$$\begin{aligned} \Vert \nabla \!_y M(x_k^{},y_k^{}\mathop {;}y_{k-1}^{\scriptscriptstyle E },\mu _{k-1}^{\scriptscriptstyle R })\Vert&\le \tau _{k-1}^{} \mu _{k-1}^{\scriptscriptstyle R }, \;\;\text {and}\;\; \end{aligned}$$
(16b)
$$\begin{aligned} \xi _k^{}&\le \tau _{k-1}^{}, \end{aligned}$$
(16c)

where \(\tau _{k-1}\) is a positive tolerance. If these conditions are satisfied, then \((x_k,y_k)\) is called an “M-iterate” and the parameters are updated as in a typical conventional augmented Lagrangian method, with the multiplier estimate \(y_{k-1}^{\scriptscriptstyle E }\) replaced by the safeguarded value

$$\begin{aligned} y_k^{\scriptscriptstyle E }= \max \big ( -y_{{\max }}e, \; \min (\, y_k, \; y_{{\max }}e\, ) \,\big ) \end{aligned}$$
(17)

for some large positive scalar constant \(y_{{\max }}\), and the new regularization parameter is given by

$$\begin{aligned} \mu _k^{\scriptscriptstyle R }= {\left\{ \begin{array}{ll} \min \big (\,{\textstyle \frac{1}{2}}\mu _{k-1}^{\scriptscriptstyle R },\; \max \big (r_k^{}, \xi _k^{}\big )^\gamma \,\big ), &{} \;\;\text {if} \max (r_k^{},\xi _k^{}) > 0; \\ {\textstyle \frac{1}{2}}\mu _{k-1}^{\scriptscriptstyle R }, &{} \;\;\text {otherwise}. \end{array}\right. } \end{aligned}$$
(18)

In addition, a new tolerance \(\tau _k\) is computed such that \(\tau _k = {\textstyle \frac{1}{2}}\tau _{k-1}\).

Finally, if neither (13) nor (16) are satisfied, then \(y_k^{\scriptscriptstyle E }= y_{k-1}^{\scriptscriptstyle E }\), \(\mu _k^{\scriptscriptstyle R }= \mu _{k-1}^{\scriptscriptstyle R }\), \(\phi _{{\scriptscriptstyle V}\!,\,k}^{\max }= \phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\), \(\phi _{{\scriptscriptstyle O}\!,\,k}^{\max }= \phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }\), and \(\tau _k = \tau _{k-1}\). As the multiplier estimates and regularization parameter are fixed at their current values in this case, \((x_k,y_k)\) is called an “F-iterate”.

2.2 Definition of the quadratic model and line-search direction

The bound-constrained problem (7) is not suitable for the calculation of a search direction because \(B(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is not positive definite in general. A nonconvex QP can have many local minima and may be unbounded. In addition, the certification of a second-order solution of a nonconvex QP is computationally intractable in certain situations. These difficulties are avoided by approximating subproblem (7) by the convex QP

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{v}}\;\;\widehat{\mathcal {Q}}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) \;\;\;\hbox {subject to}\;\;[\,v\,]_i \ge 0, \quad i = 1 \,{:}\,n, \end{aligned}$$
(19)

where \(\widehat{\mathcal {Q}}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\) is the strictly convex quadratic model

$$\begin{aligned} \widehat{\mathcal {Q}}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) = \nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })^T\!(v-v_k) + {\textstyle \frac{1}{2}}(v-v_k)^T \widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })(v-v_k), \end{aligned}$$
(20)

with \(\widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) a positive-definite approximation of \(B(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) of the form

$$\begin{aligned} \widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R }) = \begin{pmatrix}\widehat{H}(x_k,y_k) + \frac{2}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!J(x_k)&{} J(x_k)^T \\ J(x_k) &{} \mu _k^{\scriptscriptstyle R }I \end{pmatrix}, \end{aligned}$$
(21)

where \(\widehat{H}(x_k,y_k)\) is defined so that the matrix \(\widehat{B}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite, and \(\widehat{B}\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is equal to \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) if \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite. The matrix \(\widehat{B}\) is computed by a process known as “convexification” (see [16, Sect. 4] for details). If the unique solution of the subproblem (19) is denoted by \(\widehat{v}_k\), then the associated direction vector starting from \(v_k\) is given by \(d_k = \widehat{v}_k - v_k\). The vector \(d_k\) found by solving (19) is known as the global descent direction because of its crucial role in the proof of global convergence.

An important property of the proposed method is the ability to compute a direction \(d_k\) from an alternative QP subproblem that has only equality constraints. The optimality conditions for the QP subproblem (7) at an optimal point \(\widehat{v}_k = v_k + d_k\) are given by

$$\begin{aligned}&[\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {F}}= 0, \quad [\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}}\ge 0, \;\;\text {and} \nonumber \\&\quad [\,v_k + d_k\,]_i \ge 0 \;\;\text {for}\;\; i = 1 \,{:}\,n, \end{aligned}$$
(22)

where \([\,\cdot \,]_{\scriptscriptstyle \mathcal {A}}\) and \([\,\cdot \,]_{\scriptscriptstyle \mathcal {F}}\) denote vectors with components from the active/free sets

$$\begin{aligned} \mathcal {A}(x) = \{ i :[\,x\,]_i=0 \} \quad \text {and}\quad \mathcal {F}(x) = \{ 1 \,{:}\,n+m \}\setminus \mathcal {A}(x), \end{aligned}$$
(23)

at \(\widehat{v}_k = v_k + d_k\). If strict complementarity does not hold for (NP), then some of the components of \(y^*\) associated with variables on their bounds may be zero, in which case some QPs defined at \(x_k\) near \(x^*\) may have multipliers that are close to zero. In this situation the QP algorithm may remove active-set indices associated small negative multipliers at one outer iteration, only to add them again at the next. This inefficiency is prevented using an approximate QP solution in which small negative multipliers are regarded as being optimal.

If \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }\) is positive definite and \(v_k\) is a V-O iterate (in which case \(y_k^{\scriptscriptstyle E }=y_k^{}\)), the solution of the equality-constraint QP subproblem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{v}}\;\;{\mathcal Q}_k^{}(v \mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) \;\;\;\hbox {subject to}\;\;[\,v\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= 0, \end{aligned}$$
(24)

is unique. As in the case of a global descent direction, the solution \(\widehat{v}_k\) may be defined in terms of a step \(d_k\) from the point \(v_k\) using the optimality conditions

$$\begin{aligned}{}[\,v_k^{}+ d_k^{}\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}= 0, \qquad [\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}= 0, \end{aligned}$$
(25)

with no nonnegativity restriction on the components of the gradient vector \([\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\). The unique direction satisfying these equations is referred to as the local descent direction. When computed, it is used as the vector \(d_k\) in the line search only if certain conditions hold. Let

$$\begin{aligned} t_k = r(x_k,y_k)^\lambda , \quad \text {where}\quad 0< \lambda< \min \{\gamma ,1-\gamma \} < 1, \end{aligned}$$
(26)

and \(\gamma \) is the parameter used in the definition (9) of the \(\epsilon \)-active set. The local descent direction \(d_k\) satisfying (25) is used in the line search when

$$\begin{aligned}{}[\,v_k + d_k\,]_i \ge 0, \;i=1 \,{:}\,n, \;\;[\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\ge & {} -t_k^{}e, \;\;\text {and}\;\; \nonumber \\&\nabla \!M_k^T\!d_k^{}< 0. \end{aligned}$$
(27)

These conditions may be satisfied at any iterate, but are most likely to be satisfied in the neighborhood of a solution. If the local descent direction does not satisfy the conditions (27) and is therefore not selected for the line search, it is used to initialize the active-set method for solving (19). In this sense, the equality-constrained subproblem (24) is not an auxiliary subproblem, but one that must be solved anyway as part of the solution of the QP subproblem (19) (for more details, see Sect. 4).

The line-search direction \(\varDelta v_k\) is the sum of two vectors \(d_k\) and \(s_k\). The vector \(d_k\) is either the global descent direction or local descent direction as computed above. The vector \(s_k\), if nonzero, is a direction of negative curvature for the quadratic model \({\mathcal Q}_k^{}(v \mathop {;}y_{k-1}^{\scriptscriptstyle E }, \mu _{k-1}^{\scriptscriptstyle R })\). The vector \(s_k\) has the form \(s_k = (u_k,w_k)\) and is a scalar multiple of the vector \(s_k^{(1)}\) of (11) defined such that

$$\begin{aligned} s_k^T B(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })s_k^{}\le 0, \;\;\nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })^T\!s_k^{}\le & {} 0, \;\;\text {and}\;\; \nonumber \\&[\,v_k + d_k + s_k\,]_i \ge 0, \;i = 1 \,{:}\,n. \end{aligned}$$
(28)

The direction \(s_k\) is zero if no negative curvature is detected, but \(s_k\) must be nonzero if \(\xi _k > 0\) and \(d_k=0\) (see [14, Lemma 2.2]), which ensures that the line-search direction is nonzero at a first-order stationary point \(v_k\) at which \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\) is not positive semidefinite.

2.3 Computation of the line-search step

Once the directions \(d_k\) and \(s_k\) have been computed, a flexible line search is performed based on the search direction \(\varDelta v_k = d_k + s_k\). (The idea of a flexible line search was proposed by Curtis and Nocedal [4] in the context of minimizing an \(l_1\) penalty function, and extended to the augmented Lagrangian function by Gill and Robinson [16].)

For a given line-search penalty parameter \(\mu \), an Armijo condition is used to define a reduction in the function \(\varPsi _k(\alpha \mathop {;}\mu ) = M(v_k^{}+ \alpha \varDelta v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu )\) that is at least as good as the reduction in the line-search model function

$$\begin{aligned} \psi _k(\alpha \mathop {;}\mu ,\ell _k)= & {} \varPsi _k(0 \mathop {;}\mu ) + \alpha \varPsi _k'(0 \mathop {;}\mu ) \nonumber \\&\qquad \qquad \quad + {\textstyle \frac{1}{2}}(\ell _k-1)\alpha ^2 \min \big ( 0, \varDelta v_k^T\!B(x_k^{}, y_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R }) \varDelta v_k^{}\big ),\qquad \end{aligned}$$
(29)

where \(\varPsi _k'\) denotes the derivative with respect to \(\alpha \). The scalar \(\ell _k\) is either 1 or 2, depending on the order of the line-search model function. The value \(\ell _k = 1\) implies that \(\psi _k\) is an affine function, which gives a first-order line-search model. The value \(\ell _k = 2\) defines a quadratic \(\psi _k\) and gives a second-order line-search model. The first-order line-search model is used when \(d_k \ne 0\), \(s_k = 0\), and \((x_k,y_k)\) is a V-O iterate. This is crucial for the proof that the line-search algorithm returns the step length of one in the neighborhood of a second-order solution (see Theorem 2 below).

Given a fixed parameter \(\gamma _{\scriptscriptstyle S}\in (0,{\textstyle \frac{1}{2}})\), the flexible line search attempts to compute an \(\alpha _k\) that satisfies the modified Armijo condition

$$\begin{aligned} \varPsi _k^{}(0 \mathop {;}\mu _k^{\scriptscriptstyle F }) - \varPsi _k(\alpha _k^{}\mathop {;}\mu _k^{\scriptscriptstyle F }) \ge \gamma _{\scriptscriptstyle S}^{}\big (\,\psi _k^{}(0 \mathop {;}\mu _k^{\scriptscriptstyle R },\ell _k^{}) - \psi _k^{}(\alpha _k^{}\mathop {;}\mu _k^{\scriptscriptstyle R },\ell _k^{})\,\big ) \end{aligned}$$
(30)

for some \(\mu _k^{\scriptscriptstyle F }\in [\mu _k^{\scriptscriptstyle R },\mu _k^{}]\). The required step is found by repeatedly reducing \(\alpha _k\) by a constant factor until \(\rho _k(\alpha _k \mathop {;}\mu _k,\ell _k)\ge \gamma _{\scriptscriptstyle S}\) or \(\rho _k(\alpha _k^{}\mathop {;}\mu _k^{\scriptscriptstyle R },\ell _k)\ge \gamma _{\scriptscriptstyle S}\), where

$$\begin{aligned} \rho _k(\alpha \mathop {;}\mu ,\ell ) = \big (\varPsi _k(0 \mathop {;}\mu ) - \varPsi _k(\alpha \mathop {;}\mu )\big )/ \big (\,\psi _k^{}(0 \mathop {;}\mu _k^{\scriptscriptstyle R },\ell ) - \psi _k^{}(\alpha \mathop {;}\mu _k^{\scriptscriptstyle R },\ell )\,\big ). \end{aligned}$$

(Just prior to the line search, the line-search penalty parameter \(\mu _k\) is increased if necessary to ensure that \(\mu _k^{}\ge \mu _k^{\scriptscriptstyle R }\), i.e., \(\mu _k^{}=\max (\mu _k^{\scriptscriptstyle R },\mu _k^{})\).)

The Armijo procedure is not executed in two situations. First, if \(d_k = s_k = 0\), then the step length is set at \(\alpha _k = 1\). Second, \(\alpha _k\) is set to zero if \(d_k = 0\), \(\nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })^T s_k^{}= 0\), and the magnitude of the curvature of the merit function in the direction of \(s_k\) is not sufficiently large compared to \(\xi _k\), the magnitude of the curvature of the quadratic model. The magnitude of the negative curvature is considered to be insufficient if \(-s_k^T \nabla ^2\!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })s_k^{}/\Vert u_k^{}\Vert ^2 \le \gamma _{\scriptscriptstyle S}\xi _k\), where \(u_k\) is the vector of first n components of \(s_k\). In either case, \(v_{k+1} = v_k\) and it must hold that a \(\mu _k^{\scriptscriptstyle R }\) such that \(\mu _k^{\scriptscriptstyle R }< \mu _{k-1}^{\scriptscriptstyle R }\) is used in the next iteration (see Lemmas 2.3(2) and 2.4(3) of [14]).

Once \(\alpha _k\) has been found, the next penalty parameter is set as

$$\begin{aligned} \mu _{k+1} = {\left\{ \begin{array}{ll} \mu _k, &{} \hbox {if} \; \rho _k(\alpha _k \mathop {;}\mu _k,\ell _k) \ge \gamma _{\scriptscriptstyle S}, \; \hbox {or} \; d_k = s_k = 0, \; \hbox {or} \; \alpha _k = 0; \\ \max \big ( {\textstyle \frac{1}{2}}\mu _k^{},\mu _k^{\scriptscriptstyle R }\big ), &{} \hbox {otherwise.} \end{array}\right. } \end{aligned}$$
(31)

The aim is to decrease the penalty parameter only when the merit function computed with \(\mu _k\) is not sufficiently reduced by the trial step.

2.4 Algorithm summary

The computation associated with the kth iteration of the main algorithm may be arranged into seven principal steps.

  1. 1.

    Given \((x_k,y_k)\) and the regularization parameter \(\mu _{k-1}^{\scriptscriptstyle R }\) from the previous iteration, compute \(\mathcal {F}\!_\epsilon ^{}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R })\) and \(B(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\). Compute the nonnegative scalar \(\xi _k^{}\) and vector \(s_k^{(1)}\) such that \(s_k^{(1)T}\! B(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })s_k^{(1)} = -\xi _k^{}\Vert u_k^{(1)}\Vert ^2\), where \(\xi _k \ge 0\) and \(u_k^{(1)}\) is the vector of first n components of \(s_k^{(1)}\). If \(\xi _k^{}>0\), then \(\xi _k^{}\) approximates the magnitude of the “most negative” or “least” eigenvalue of \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\). If \(\xi _k^{}=0\), then \(s_k^{(1)}=0\). If \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\) is positive definite then \((\xi _k^{},s_k^{(1)}) = 0\). (See [14, Algorithm 1].)

  2. 2.

    Terminate if the following conditions hold:

    $$\begin{aligned} r(x_k,y_k) \le \tau _{\scriptstyle \mathrm {stop}}, \;\;\;\xi _k \le \tau _{\scriptstyle \mathrm {stop}}, \quad \text {and}\quad \mu _{k-1}^{\scriptscriptstyle R } \le \tau _{\scriptstyle \mathrm {stop}}, \end{aligned}$$
    (32)

    where \(\tau _{\scriptstyle \mathrm {stop}}\) is a preassigned stopping criterion. If these conditions are satisfied, \(x_k\) is an approximate second-order KKT point.

  3. 3.

    Compute \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\) for the kth iteration based on the values \(\xi _k\), \(r(x_k,y_k)\), \(y_{k-1}^{\scriptscriptstyle E }\), \(\mu _{k-1}^{\scriptscriptstyle R }\), \(\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\), \(\phi _{{\scriptscriptstyle O}\!,\,k-1}^{\max }\) and \(\tau _{k-1}^{}\). Compute new values for \(\phi _{{\scriptscriptstyle V}\!,\,k}^{\max }\), \(\phi _{{\scriptscriptstyle O}\!,\,k}^{\max }\), \(\tau _k^{}\). (See Steps 13–24 of Algorithm 5 [14].)

  4. 4.

    Terminate if \(x_k\) is an M-iterate such that

    $$\begin{aligned} \min \big (\Vert c(x_k)\Vert ,\tau _{\scriptstyle \mathrm {stop}}\big ) > \mu _k^{\scriptscriptstyle R }, \;\;\text {and}\;\; \Vert \min \big (x_k,J(x_k)^T\!c(x_k)\big )\Vert \le \tau _{\scriptstyle \mathrm {stop}}. \end{aligned}$$
    (33)

    If these conditions are satisfied, \(x_k\) is an approximate infeasible stationary point of the problem \(\min \) \(\Vert c(x)\Vert ^2\) subject to \(x\ge 0\).

  5. 5.

    Compute a positive-definite matrix \(\widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) such that \(\widehat{B}\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })=B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) if the matrix \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite. Compute \(d_k = \widehat{v}_k - v_k\), where \(\widehat{v}_k\) is the solution of either the equality-constraint QP subproblem (24) or the strictly convex QP subproblem (19). In either case, \(d_k\) has the form \(d_k = (p_k,q_k)\), where the primal components \(p_k\) satisfy \(x_k + p_k \ge 0\). (See [14, Algorithm 2].)

  6. 6.

    Rescale the direction \(s_k^{(1)}\) to give a feasible direction of negative curvature \(s_k = (u_k,w_k)\) satisfying (28). (See [14, Algorithm 3].)

  7. 7.

    Perform a flexible line search along the vector \(\varDelta v_k = s_k + d_k = (u_k + p_k, w_k + q_k)\). (See [14, Algorithm 4].) Update the line-search penalty parameter \(\mu _k\) using (31).

3 Local convergence

The analysis involves second-order sufficient conditions defined in terms of the sets of strongly-active variables \(\mathcal {A}_{\scriptscriptstyle +}\) and weakly-active variables \(\mathcal {A}_{\scriptscriptstyle 0}\):

$$\begin{aligned} \mathcal {A}_{\scriptscriptstyle +}(x,y)= & {} \{ i \in \mathcal {A}(x):[\,g(x) - J(x)^T\!y\,]_i > 0 \}, \nonumber \\ \mathcal {A}_{\scriptscriptstyle 0}(x,y)= & {} \{ i \in \mathcal {A}(x):[\,g(x) - J(x)^T\!y\,]_i = 0 \}. \end{aligned}$$
(34)

Definition 1

(Second-order sufficient conditions (SOSC)) A primal-dual pair \((x^*\!,y^*)\) satisfies the second-order sufficient optimality conditions for problem (NP) if it is a first-order KKT pair (i.e., \(r(x^*\!,y^*)= 0\)) and

$$\begin{aligned} p^T\!H(x^*\!,y^*) p > 0 \;\;\text {for all}\;\; p\in \mathcal {C}(x^*\!,y^*)\setminus \{ 0 \}, \end{aligned}$$
(35)

where \(\mathcal {C}(x^*\!,y^*)=\mathrm{null}\!\big (J(x^*)\big )\cap \{p :p_i=0 \;\;\text {for}\;\; i\in \mathcal {A}_{\scriptscriptstyle +}(x^*\!,y^*), \;p_i\ge 0 \;\;\text {for}\;\; i\in \mathcal {A}_{\scriptscriptstyle 0}(x^*\!,y^*)\,\}\) is the critical cone.

The analysis of Gill, Kungurtsev and Robinson [14] establishes that the global convergence behavior of the method falls into one of two cases, depending on whether the set of V-O iterates is infinite or finite. If there are infinitely many V-O iterates, there exists a subsequence with limit point \(x^*\) that is either a first-order KKT point, or fails to satisfy the constant positive generator constraint qualification (CPGCQ)Footnote 1. Moreover, if the Mangasarian-Fromovitz constraint qualification (MFCQ) holds at \(x^*\), then the associated subsequence of dual estimates is bounded with limit point \(y^*\) such that \((x^*\!,y^*)\) is a first-order KKT pair for problem (NP). If the weak constant rank condition (WCRC)Footnote 2 holds in addition to the MFCQ (in which case, the CPGCQ holds automatically), then \((x^*\!,y^*)\) is a second-order KKT point. In the case that the set of V-O iterates is finite, there are infinitely many M-iterates, and every limit point \(x^*\) of this sequence is an infeasible stationary point.

The local convergence analysis given here focuses on sequences that converge to first- or second-order KKT pair. (An analysis of the rate of convergence associated with sequences converging to locally infeasible points is beyond the scope of this paper).

The results established in this section require three standing assumptions.

Assumption 1

f and c are twice Lipschitz-continuously differentiable.

Assumption 2

The index set \(\mathcal {S}\) of V-O iterates, i.e.,

is infinite, and there exists a subsequence \(\mathcal {S}_*\subseteq \mathcal {S}\), such that \(\lim _{k\in \mathcal {S}_*} (x_k,y_k) = (x^*\!,y^*)\), with \((x^*\!,y^*)\) a first-order KKT pair for problem (NP). (This assumption requires that the finite termination conditions (32) and (33) are omitted.)

Assumption 3

If \((x^*\!,y^*)\) is the first-order KKT pair in Assumption 2, then

  1. (i)

    there exists a compact set \(\varLambda (x^*)\subseteq \mathcal {Y}(x^*)\) such that \(y^*\) belongs to the (nonempty) interior of \(\varLambda (x^*)\) relative to \(\mathcal {Y}(x^*)\); and

  2. (ii)

    \((x^*\!,y)\) satisfies the SOSC of Definition 1 for every \(y\in \varLambda (x^*)\).

The key part of Assumption 3 is the existence of the compact set \(\varLambda (x^*)\), which guarantees that the closest point in \(\mathcal {Y}(x^*)\) to every element \(y_k\) of the subsequence \(\{\, y_k \,\}\) satisfying \(\lim _{k\rightarrow \infty }y_k =y^*\) is also in \(\varLambda (x^*)\) for k sufficiently large. This is equivalent to there being a set \(\mathcal {K}\), open relative to \(\mathcal {Y}(x^*)\), such that \(y^*\in \mathcal {K}\subset \varLambda (x^*)\). This, in turn, is equivalent to the assumption that the affine hulls of \(\varLambda (x^*)\) and \(\mathcal {Y}(x^*)\) are identical, with \(y^*\) in the relative interior of \(\varLambda (x^*)\). (For example, if \(m=3\), and \(\mathcal {Y}(x^*)\) is a ray of the form \(y=a+b t\) for a, \(b\in \mathbb {R}^3\), \(t\in (-\infty ,\infty )\), then \(\varLambda (x^*)\) could be a closed interval relative to the ray, e.g., \(\varLambda (x^*) = \{ y :y=a+bt, \;\;\text {for}\;\; t\in [t_1,t_2]\).) Note that the set of multipliers \(\mathcal {Y}(x^*)\) need not be bounded. The second-order sufficient conditions need hold only for multipliers in a compact subset of \(\mathcal {Y}(x^*)\).

For any y, compactness of \(\varLambda (x^*)\) in Assumption 3 implies the existence of a vector \(y^*_{\scriptscriptstyle P}(y)\in \varLambda (x^*)\) that minimizes the distance from y to \(\varLambda (x^*)\), i.e.,

$$\begin{aligned} y^*_{\scriptscriptstyle P}(y) \in \mathop {\mathrm{Argmin}}\limits _{\bar{y}\in \varLambda (x^*)}\;\Vert y - \bar{y}\Vert . \end{aligned}$$
(36)

The existence of a vector \(y^*_{\scriptscriptstyle P}(y)\) implies that the distance \(\delta (x,y)\) of any primal-dual point (xy) to the primal-dual solution set \(\mathcal {V}(x^*) = \{ x^*\} \times \varLambda (x^*)\) associated with \(x^*\), may be written in the form

$$\begin{aligned} \delta (x,y) = \min _{(\bar{x},\bar{y}) \in \mathcal {V}(x^*)} \Vert (x - \bar{x},y-\bar{y})\Vert = \Vert (x - x^*, y - y^*_{\scriptscriptstyle P}(y))\Vert . \end{aligned}$$
(37)

The pair \(\big (x^*\!,y^*_{\scriptscriptstyle P}(y)\big )\) satisfies the second-order sufficient conditions as a result of Assumption 3(ii). The following result shows that the proximity measure \(r(x,y)\) may be used as a surrogate for \(\delta (x,y)\) near \((x^*,y^*)\).

Lemma 1

([35, Theorem 3.2]) There exists a positive scalar \(\kappa \equiv \kappa (\varLambda (x^*))\) such that \(r(x_k,y_k)\in \big [\,\delta (x_k,y_k)/\kappa ,\, \delta (x_k,y_k)\kappa \,\big ]\) for all \(k\in \mathcal {S}_*\) sufficiently large.

Proof

Under the assumptions used here, the result follows from Theorem 3.2 of Wright [35], where Lemmas 2.1 and 2.2 of Gill, Kungurtsev and Robinson [13] are used to establish that the exact and estimated distance of \((x_k,y_k)\) to the primal-dual solution set used in [35] are equivalent (up to a scalar multiple) to the values \(\delta (x_k,y_k)\) and \(r(x_k,y_k)\) given here. \(\square \)

The principal steps of the local convergence analysis are summarized as follows. First, the properties of iterates with indices \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) are considered. It is shown that for some \(k\in \mathcal {S}_*\) sufficiently large, the following results hold.

  1. (a)

    The active set at \(x^*\) is identified correctly by the \(\epsilon \)-active set, and the direction \(s_k\) of negative curvature is zero.

  2. (b)

    A local descent direction \(d_k\) is computed, and the conditions \([\,v_k + d_k\,]_i\ge 0, \;i = 1 \,{:}\,n, \;\;\nabla \!M_k^T\!d_k^{}< 0, \;\;\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\ge -t_k^{}e\) are satisfied, i.e., the local descent direction is selected for the line search.

  3. (c)

    The unit step is accepted by the flexible line-search, and the variables active at \(x^*\) are the same as those active at \(x_{k+1}\).

Once (a)–(c) are established, the next step is to show that \((x_{k+1},y_{k+1})\) is a V-iterate. This implies that the arguments may be repeated at \(x_{k+1}\), and all iterates must be in \(\mathcal {S}_*\) for k sufficiently large. The final step is to show that the iterates are identical to those generated by an sSQP method for which superlinear convergence has been established.

The first result shows that for \(k\in \mathcal {S}_*\) sufficiently large, the set \({\mathcal {A}_\epsilon ^{}}\) correctly estimates the active set at \(x^*\). Moreover, for these iterations, the search direction does not include a contribution from the direction of negative curvature.

Lemma 2

The following results hold for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large.

  1. (i)

    The measure \(r(x_k,y_k)\) of the distance to a first-order KKT point converges to zero, i.e., \(\lim _{k\in \mathcal {S}}\, r(x_k,y_k) = 0\).

  2. (ii)

    The \(\epsilon \)-active sets satisfy \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R })={\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })=\mathcal {A}(x^*)\).

  3. (iii)

    The \(\epsilon \)-free sets satisfy \(\mathcal {F}\!_\epsilon ^{}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R }) = \mathcal {F}\!_\epsilon ^{}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {F}(x^*)\).

  4. (iv)

    If the suffix “\(\scriptstyle \mathcal {F}\)” denotes the components corresponding to the set \(\mathcal {F}(x^*)\), then \(B_{\scriptscriptstyle \mathcal {F}}^{}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\) is positive definite, with \(s_k^{(1)} = 0\) and \(\xi _k^{}= 0\).

  5. (v)

    \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite and a local descent direction is computed.

  6. (vi)

    The feasible direction of negative curvature \(s_k\) is zero.

Proof

A point \((x_k,y_k)\) is designated as a V-O iterate if the optimality and feasibility measures satisfy condition (13). In this case \(y_k\) is set to \(y_k^{\scriptscriptstyle E }\), and the values for \(\phi _{{\scriptscriptstyle V}\!,\,k}^{\max }\) or \(\phi _{{\scriptscriptstyle O}\!,\,k}^{\max }\) are decreased by a fixed factor. If follows that on the infinite set \(\mathcal {S}\) of V-O iterates, the condition (13) must hold infinitely often and at least one of the functions \(\phi _{\scriptscriptstyle V}(v_k)\) or \(\phi _{\scriptscriptstyle O}(v_k,\xi _k)\) must go to zero. The definitions of \(\phi _{\scriptscriptstyle V}(v_k)\) and \(\phi _{\scriptscriptstyle O}(v_k,\xi _k)\) in terms of the feasibility and optimality measures \(\eta (x_k)\) and \( \omega (x_k,y_k,\xi _k)\) imply that \(\lim _{k\in \mathcal {S}} \eta (x_k) = 0\) and \(\lim _{k\in \mathcal {S}} \omega (x_k,y_k,\xi _k) = 0\). The definition (2) of \(r(x_k,y_k)\) implies that \(\lim _{k\in \mathcal {S}} r(x_k,y_k) = 0\), which proves part (i). As \(r(x_k,y_k)\) goes to zero, Theorem 3.6(2) of [14] implies \(\lim _{k\in \mathcal {S}}\, \max \big (\mu _{k-1}^{\scriptscriptstyle R },r(x_k^{},y_k^{})^\gamma \big ) = \lim _{k\in \mathcal {S}}\, \max \big (\mu _k^{\scriptscriptstyle R }, r(x_k^{},y_k^{})^\gamma \big ) = 0\). If these limits are combined with (9), we obtain the inclusions \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R })\subseteq \mathcal {A}(x^*)\) and \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\subseteq \mathcal {A}(x^*)\) for \(k\in \mathcal {S}\) sufficiently large.

For the reverse inclusion, (9) together with \(\max \big (\mu _{k-1}^{\scriptscriptstyle R },r(x_k^{},y_k^{})^\gamma \big ) \ge r(x_k,y_k)^\gamma \) and \(\max \big (\mu _k^{\scriptscriptstyle R },r(x_k^{},y_k^{})^\gamma \big ) \ge r(x_k,y_k)^\gamma \), imply that for \(k\in \mathcal {S}\) sufficiently large, \(\mathcal {A}_{\gamma }(x_k,y_k) = \big \{ i :x_i \le r(x_k,y_k)^\gamma \big \}\) satisfies \(\mathcal {A}_{\gamma }(x_k,y_k)\subseteq {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R })\) and \(\mathcal {A}_{\gamma }(x_k,y_k)\subseteq {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\). The set \(\mathcal {A}_{\gamma }(x_k,y_k)\) is an active-set estimator that is equivalent (in the sense of Gill, Kungurtsev and Robinson [13, Lemma 2.2]) to the active-set estimator used by Wright [35], and Facchinei, Fischer, and Kanzow [8]. This equivalence allows the application of Theorem 3.3 of [35] to obtain the inclusions \(\mathcal {A}(x^*) \subseteq \mathcal {A}_{\gamma }(x_k,y_k) \subseteq {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _{k-1}^{\scriptscriptstyle R })\) and \(\mathcal {A}(x^*) \subseteq \mathcal {A}_{\gamma }(x_k,y_k) \subseteq {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\), which completes the proof of part (ii). Part (iii) follows directly from (ii) and the definition of the \(\epsilon \)-free set in (10).

For the proof of (iv) it is assumed that \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) is sufficiently large that (ii) and (iii) hold. From Assumption 3, \((x^*,y^*)\) satisfies the SOSC and consequently, \(d^T\!H(x^*\!,y^*)d > 0\) for all \(d\ne 0\) such that \(J(x^*)d = 0\) and \(d_i = 0\) for every \(i\in \mathcal {A}(x^*)\), i.e., \(d_{\scriptscriptstyle \mathcal {F}}^T H\!_{\scriptscriptstyle \mathcal {F}}^{}(x^*\!,y^*) d_{\scriptscriptstyle \mathcal {F}}^{}> 0\) for all \(d_{\scriptscriptstyle \mathcal {F}}\ne 0\) satisfying \(J_{\scriptscriptstyle \mathcal {F}}(x^*) d_{\scriptscriptstyle \mathcal {F}}= 0\), where the suffix “\(\scriptstyle \mathcal {F}\)” denotes quantities associated with indices in \(\mathcal {F}(x^*)\). Under this assumption, together with the results of part (iii), Lemma 2.2 of [16], Lemma 3 of [19], and [14, part (2) of Theorem 3.6] imply that \(B_{\scriptscriptstyle \mathcal {F}}^{}(v_k^{}\mathop {;}\mu _{k-1}^{\scriptscriptstyle R })\) is positive definite for all \(k\in \mathcal {S}_*\) sufficiently large. If this matrix is positive definite, then \(s_k^{(1)} = 0\) and \(\xi _k^{}= 0\), as required.

As \(\{\,\mu _k^{\scriptscriptstyle R }\,\} \rightarrow 0\) (see [14, Theorem 3.6, part (2)]), an argument similar to that used to establish (iv) shows that \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite for the same values of k (see Gill and Robinson [16, Lemma 2.2]). As \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is positive definite for every \(k\in \mathcal {S}_*\subseteq \mathcal {S}\), and k is a V-O iterate by definition, the conditions that initiate the solution of the equality constraint QP (24) are satisfied, and a local descent direction is computed. This proves part (v). Finally, part (iv) implies that \(s_k^{(1)}\) and its scaled counterpart \(s_k\) are zero, which proves part (vi).\(\square \)

The next result shows that \(d_k\) is nonzero for certain types of iteration.

Lemma 3

For all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, it must hold that either \(d_k \ne 0\) or \((x_k,y_k) = (x^*\!,y^*)\).

Proof

The result holds trivially if \(d_k \ne 0\) for all \(k\in \mathcal {S}_*\) sufficiently large. Assume without loss of generality that there exists an infinite sequence \(\mathcal {S}_2 \subseteq \mathcal {S}_*\) such that \(d_k = 0\) for every \(k\in \mathcal {S}_2\). Parts (ii) and (vi) of Lemma 2 imply that \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {A}(x^*)\) and \(s_k = 0\) for all \(k\in \mathcal {S}_2\) sufficiently large. Every \(k\in \mathcal {S}_2\) is a V-O iterate and there must exist an index \(k_2\in \mathcal {S}_2\) sufficiently large that

$$\begin{aligned} d_{k_2}^{}= & {} s_{k_2}^{}= 0, \;\;(x_{k_2+1}^{},y_{k_2+1}^{}) = (x_{k_2}^{},y_{k_2}^{}),\;\;\nonumber \\ y^{\scriptscriptstyle E }_{k_2}= & {} y_{k_2}^{}, \;\;\text {and}\;\; {\mathcal {A}_\epsilon ^{}}(x_{k_2}^{},y_{k_2}^{},\mu ^{\scriptscriptstyle R }_{k_2}) = \mathcal {A}(x^*). \end{aligned}$$
(38)

As \(d_{k_2}=0\), parts (ia) and (ib) of Lemma 2.3 in [14] give \(r(x_{k_2},y_{k_2}) = 0\), which implies that \((x_{k_2},y_{k_2})\) is a first-order KKT point for both problem (NP) and the problem of minimizing \(M(x,y\mathop {;}y^{\scriptscriptstyle E }_{k_2},\mu ^{\scriptscriptstyle R }_{k_2})\) subject to \(x \ge 0\). From (38) it must hold that \(r(x_{k_2+1},y_{k_2+1}) = 0\), and parts (iii) and (iv) of Lemma 2 imply that \(B_{\scriptscriptstyle \mathcal {F}}(x_{k_2+1}^{},y_{k_2+1}^{}\mathop {;}\mu ^{\scriptscriptstyle R }_{k_2})\) is positive definite, with \(\xi _{k_2+1} = 0\) and \(s^{(1)}_{k_2+1} = 0\). It follows that \(\phi _{\scriptscriptstyle V}(x_{k_2+1},y_{k_2+1}) = 0\), and \(k_2+1\) is a V-iterate from condition (13). As a result, \(y^{\scriptscriptstyle E }_{k_2+1} = y^{\scriptscriptstyle E }_{k_2}\) and \(\mu ^{\scriptscriptstyle R }_{k_2+1} = {\textstyle \frac{1}{2}}\mu ^{\scriptscriptstyle R }_{k_2}\), which implies that the primal-dual pair \((x_{k_2+1},y_{k_2+1}) = (x_{k_2},y_{k_2})\) is not only a first-order KKT point for problem (NP), but also a first-order solution of the problem of minimizing \(M(x,y\mathop {;}y^{\scriptscriptstyle E }_{k_2+1},\mu ^{\scriptscriptstyle R }_{k_2+1})\) subject to \(x \ge 0\). In particular, it must hold that \(d_{k_2+1} = 0\), and \(s_{k_2+1} = 0\) because \(\xi _{k_2+1} = 0\). Similarly, it must hold that \({\mathcal {A}_\epsilon ^{}}(x_{k_2+1}^{},y_{k_2+1}^{},\mu ^{\scriptscriptstyle R }_{k_2+1}) = \mathcal {A}(x^*)\).

This argument may be repeated at every \((x_k,y_k)\) such that \(k\ge k_2+1\), and it must hold that \((x_k,y_k) = (\bar{x},\bar{y})\) for some \((\bar{x},\bar{y})\), and that \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {A}(x^*)\) for every \(k \ge k_2\). It then follows from Assumption 3 that \((\bar{x},\bar{y}) = (x^*\!,y^*)\), which completes the proof. \(\square \)

For a local convergence analysis, Lemma 3 implies that there is no loss of generality in making the following additional standing assumption.

Assumption 4

The vector \(d_k\) is nonzero for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large.

Lemma 4

It must hold that \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma > 0\) for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large.

Proof

Part (iv) of Lemma 2 gives \(\xi _k = 0\) for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large. In addition, \(r(x_k,y_k)\) must be nonzero, otherwise the definition of \(r(x_k,y_k)\) would imply that \(c(x_k) = 0\), \(y_k^{\scriptscriptstyle E }= y_k^{}\) (because \(k\in \mathcal {S}\)), \(\pi (x_k^{},y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) = y_k^{}\), \(\nabla \!_y M(x_k^{},y_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) = 0\), and \(\min \big (x_k,\nabla \!_xM(x_k^{},y_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\big ) = 0\). In other words, if \(r(x_k,y_k)\) is zero, then \((x_k,y_k)\) satisfies the first-order conditions for a minimizer of \(M(x,y\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) subject to \(x \ge 0\). This implies that there is no nonzero descent direction at \((x_k,y_k)\), which contradicts Assumption 4. It follows that \(r(x_k,y_k)\) is nonzero. The values \(\xi _k = 0\) and \(r(x_k,y_k)>0\) in the definition of \(\mu _k^{\scriptscriptstyle R }\) in (14), and part (i) of Lemma 2 imply that \(\mu _k^{\scriptscriptstyle R }=r(x_k^{},y_k^{})^\gamma \) for \(\gamma \in (0,1)\) and \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large. \(\square \)

Much of the local convergence analysis involves establishing that, in the limit, the algorithm computes and accepts the local descent direction at every iteration. The next result concerns the properties of the equality-constrained subproblem for the local descent direction.

Lemma 5

If \(v_k = (x_k,y_k)\) is a point at which the conditions for the calculation of a local descent direction are satisfied, then the following results hold.

(i):

The bound-constrained problem (24) for the local descent direction is equivalent to the stabilized QP subproblem

$$\begin{aligned} \begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x,y}}&\;\;g(x_k)^T\!(x-x_k) + {\textstyle \frac{1}{2}}(x-x_k)^T\!H(x_k,y_k)(x-x_k) + {\textstyle \frac{1}{2}}\mu _k^{\scriptscriptstyle R }\Vert y\Vert ^2\\ \hbox {subject to}&\;\;c(x_k^{}) + J(x_k^{})(x-x_k^{}) + \mu _k^{\scriptscriptstyle R }(y - y_k^{}) = 0, \quad E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T x = 0, \end{aligned} \end{aligned}$$
(39)

where \(E_{\scriptscriptstyle \mathcal {A}_\epsilon }\) is the matrix of columns of the identity matrix with indices in the \(\epsilon \)-active set \(\mathcal {A}_\epsilon \).

(ii):

If \(d_k = (p_k,q_k)\) is the local descent direction, and \(z_k = g(x_k) - J(x_k)^T\!y_k\), then the optimal solution to (39) may be written as \((x_k+p_k\), \(y_k+q_k\), \([\,z_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }+w_k)\), where \((p_k,q_k,w_k)\) satisfy the nonsingular equations

$$\begin{aligned} \begin{pmatrix} H(x_k,y_k) &{} J(x_k)^T\!&{} E_{\scriptscriptstyle \mathcal {A}_\epsilon }\\ J(x_k) &{} -\mu _k^{\scriptscriptstyle R }I &{} 0 \\ E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T &{} 0 &{} 0 \end{pmatrix} \begin{pmatrix}\,p_k \\ - q_k \\ - w_k \end{pmatrix} = -\begin{pmatrix} g(x_k)-J(x_k)^T\!y_k - z^p_k \\ c(x_k) \\ [\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\end{pmatrix}, \end{aligned}$$

with \(z^p_k = E_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\!E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T z_k^{}\), i.e., \(z^p_k\) is the projection of \(z_k\) onto range \((E_{\scriptscriptstyle \mathcal {A}_\epsilon })\).

Proof

Part (i) follows from the specialization of Result 2.1 of Gill and Robinson [15] to the equality-constraint case. The equations of part (ii) are then the optimality conditions associated with (39). It remains to show that the equations are nonsingular. The vector \((p_k,q_k)\) is the unique solution of (39) if the primal-dual Hessian of problem (39) is positive definite on the null-space of the constraints, which in this case is the set of vectors satisfying \(J(x_k^{}) p + \mu _k^{\scriptscriptstyle R }q = 0\) and \(E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T p = 0\). This corresponds to the requirement that

$$\begin{aligned}&\begin{pmatrix} p_{\scriptscriptstyle \mathcal {F}_\epsilon }\\ q \end{pmatrix}^T \begin{pmatrix} H\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k,y_k) &{} 0 \\ 0 &{} \mu _k^{\scriptscriptstyle R }I \end{pmatrix} \begin{pmatrix} p_{\scriptscriptstyle \mathcal {F}_\epsilon }\\ q \end{pmatrix} \\&\quad = p_{\scriptscriptstyle \mathcal {F}_\epsilon }^T H\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}) p_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}+ \frac{1}{\mu _k^{\scriptscriptstyle R }}p_{\scriptscriptstyle \mathcal {F}_\epsilon }^T J_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{})^T\!J_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{}) p_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}> 0. \end{aligned}$$

Gill and Robinson [15, Lemma 2.2] show \(H\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}) + (1/\mu _k^{\scriptscriptstyle R })J_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{})^T\!J_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{})\) is positive definite if \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }\) is positive definite, which is one of the conditions that must be satisfied for a local descent direction to be computed. \(\square \)

The next result shows that two of the three conditions in (27) for acceptance of the local descent direction hold for all \(k\in \mathcal {S}_*\) sufficiently large.

Lemma 6

For all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, a local descent direction \(d_k=(p_k,q_k)\) is computed that satisfies the following conditions:

(i):

\(\max \{\Vert p_k\Vert ,\Vert q_k\Vert \} = O\big (\delta (x_k,y_k)\big )\); and

(ii):

\(x_k+p_k\ge 0\), \([\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\ge -t_k^{}e\), where \(t_k\) is the positive feasibility parameter (26), and \([\,\cdot \,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\) denotes the vector of components with indices in the \(\epsilon \)-active set \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\).

Proof

Lemma 5 implies that the local descent direction \((p_k,q_k)\) satisfies

$$\begin{aligned} \begin{pmatrix} H(x_k,y_k) &{} J(x_k)^T\!&{} E_{\scriptscriptstyle \mathcal {A}_\epsilon }\\ J(x_k) &{} -\mu _k^{\scriptscriptstyle R }I &{} 0 \\ E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T &{} 0 &{} 0 \end{pmatrix} \begin{pmatrix}\,p_k \\ - q_k \\ - w_k \end{pmatrix} = -\begin{pmatrix} g(x_k)-J(x_k)^T\!y_k - z^p_k \\ c(x_k)\\ [\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\end{pmatrix}, \end{aligned}$$
(40)

where \([\,z_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }+ w_k\) is the vector of multipliers for \(E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T x = 0\) of problem (39). Let \(\widetilde{\mu }_k\) denote the scalar \(\widetilde{\mu }(x_k,y_k,z_k) = \Vert (g(x_k)-J(x_k)^T\!y_k - z^p_k, c(x_k), [\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon })\Vert _1\). The Eq. (40) constitute a perturbation of the linear system

$$\begin{aligned} \begin{pmatrix} H(x_k,y_k) &{} J(x_k)^T &{} E_{\scriptscriptstyle \mathcal {A}_\epsilon }\\ J(x_k) &{} -\widetilde{\mu }_k I&{} 0 \\ E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T &{} 0 &{} -\widetilde{\mu }_k I \end{pmatrix} \begin{pmatrix} \,\widetilde{p}_k \\ - \widetilde{q}_k \\ - \widetilde{w}_k \end{pmatrix} = -\begin{pmatrix} g(x_k) - J(x_k)^T\!y_k - z^p_k \\ c(x_k) \\ [\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\end{pmatrix}, \end{aligned}$$
(41)

which characterize the optimality conditions for the sSQP subproblem associated with the equality constrained problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x}}\;f(x) \;\;\hbox {subject to}\;\;c(x)= 0, \;\;\text {and}\;\; [\,x\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}= E_{\scriptscriptstyle \mathcal {A}_\epsilon }^T x = 0. \end{aligned}$$
(42)

The matrix of (41) is nonsingular and the equations have a unique solution (see Izmailov and Solodov [24, Lemma 2]). In addition, it follows from Wright [35, Lemma 4.1], Gill, Kungurtsev and Robinson [13, Lemma 2.3], and Lemma 1 that the unique solution of (41) satisfies

$$\begin{aligned} \Vert (\widetilde{p}_k,\widetilde{q}_k)\Vert \le \Vert (\widetilde{p}_k,\widetilde{q}_k,\widetilde{w}_k)\Vert = O(\widetilde{\mu }_k) = O\big (\delta (x_k,y_k)\big ) = O\big (r(x_k,y_k)\big ). \end{aligned}$$
(43)

The underlying quadratic program associated with (40) satisfies the second-order sufficient conditions for optimality. Under this condition, Izmailov [20, Theorem 2.3]) establishes the Lipschitz error bound for the perturbed solutions as

$$\begin{aligned} \Vert (p_k - \widetilde{p}_k, q_k - \widetilde{q}_k)\Vert&\le \Vert (p_k - \widetilde{p}_k, q_k - \widetilde{q}_k, w_k - \widetilde{w}_k)\Vert \\&= O(\Vert \widetilde{\mu }_k \widetilde{w}_k + \big ( \mu _k^{\scriptscriptstyle R }-\widetilde{\mu }_k^{}\big )(q_k - \widetilde{q}_k)\Vert ). \end{aligned}$$

Lemma 4 gives \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma \) for \(\gamma \in (0,1)\). It then follows from Lemma 2.3 of Gill, Kungurtsev and Robinson [13], the bound (43) and Lemma 1 that

$$\begin{aligned} \Vert (p_k - \widetilde{p}_k, q_k - \widetilde{q}_k)\Vert = O\big (\delta (x_k,y_k) + r(x_k,y_k)^\gamma \Vert q_k - \widetilde{q}_k\Vert \big ). \end{aligned}$$
(44)

The triangle inequality, (44), and (43) imply the existence of constants \(\kappa _1\) and \(\kappa _2\) that satisfy

$$\begin{aligned} \Vert p_k\Vert +\Vert q_k\Vert&\le \Vert p_k-\widetilde{p}_k\Vert +\Vert q_k - \widetilde{q}_k\Vert +\Vert \widetilde{p}_k\Vert +\Vert \widetilde{q}_k\Vert \end{aligned}$$
(45)
$$\begin{aligned}&\le \kappa _1 \delta (x_k,y_k) + \kappa _2 r(x_k,y_k)^\gamma \Vert q_k-\widetilde{q}_k\Vert . \end{aligned}$$
(46)

Part (i) of Lemma 2 implies that \(1-\kappa _2r(x_k,y_k)^\gamma \ge {\textstyle \frac{1}{2}}\) for \(k\in \mathcal {S}_*\) sufficiently large. This inequality may be used to derive the bound

$$\begin{aligned}&\Vert p_k-\widetilde{p}_k\Vert + {\textstyle \frac{1}{2}}\Vert q_k-\widetilde{q}_k\Vert + \Vert \widetilde{p}_k\Vert + \Vert \widetilde{q}_k\Vert \\&\quad \le \Vert p_k-\widetilde{p}_k\Vert + \big (1-\kappa _2 r(x_k,y_k)^\gamma \big )\Vert q_k-\widetilde{q}_k\Vert + \Vert \widetilde{p}_k\Vert +\Vert \widetilde{q}_k\Vert . \end{aligned}$$

This upper bound may be simplified using the bound on \(\Vert p_k-\widetilde{p}_k\Vert +\Vert q_k - \widetilde{q}_k\Vert +\Vert \widetilde{p}_k\Vert +\Vert \widetilde{q}_k\Vert \) from (45)–(46), giving

$$\begin{aligned} \Vert p_k-\widetilde{p}_k\Vert + {\textstyle \frac{1}{2}}\Vert q_k-\widetilde{q}_k\Vert + \Vert \widetilde{p}_k\Vert +\Vert \widetilde{q}_k\Vert \le \kappa _1\delta (x_k,y_k). \end{aligned}$$

The quantity \({\textstyle \frac{1}{2}}(\Vert p_k\Vert +\Vert q_k\Vert )\) may be bounded using similar arguments used for (45). In this case,

$$\begin{aligned} {\textstyle \frac{1}{2}}(\Vert p_k\Vert +\Vert q_k\Vert ) \le \Vert p_k-\widetilde{p}_k\Vert + {\textstyle \frac{1}{2}}\Vert q_k - \widetilde{q}_k\Vert +\Vert \widetilde{p}_k\Vert +\Vert \widetilde{q}_k\Vert \le \kappa _1\delta (x_k,y_k), \end{aligned}$$

which implies that \(\max \{\,\Vert p_k\Vert , \Vert q_k\Vert \,\} = O\big (\delta (x_k,y_k)\big )\), and proves part (i).

The second inequality to be established for part (ii) may be written equivalently as \([\,\nabla \!M_k + B_k d_k \,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\ge -t_k e\), where \(\nabla \!M_k = \nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) and \(B_k = B(v_k^{},\mu _k^{\scriptscriptstyle R })\). The proof requires estimates of the components of \([\,\nabla \!M_k + B_k d_k \,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\). After simplification, the substitution of the quantities \(B_k\), \(\nabla \!M_k\) and \(d_k = (p_k,q_k)\), together with the identity \(J(x_k^{})p_k^{}+ \mu _k^{\scriptscriptstyle R }q_k^{}= - c(x_k^{})\) from (40) give

$$\begin{aligned}{}[\,\nabla \!M_k + B_k d_k \,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= \Big [z_k + \frac{1}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!c(x_k) + H(x_k,y_k) p_k + \frac{1}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!J(x_k) p_k\Big ]_{\scriptscriptstyle \mathcal {A}_\epsilon }, \end{aligned}$$
(47)

where \(z_k = g(x_k) - J(x_k)^T y_k\). The first part of the proof involves the estimation of a lower bound on the vector \(z_k + (1/\mu _k^{\scriptscriptstyle R }) J(x_k)^T c(x_k)\). The definition of \(y^*_{\scriptscriptstyle P}(\cdot )\) and the fact that \((x^*\!,y^*)\) is a first-order KKT pair for problem (NP) implies that the vector \(g(x^*) - J(x^*)^T\!y^*_{\scriptscriptstyle P}(y_k^{})\) is nonnegative, with

$$\begin{aligned} -[\,z_k\,]_i&= - [\,g(x_k)-J(x_k)^T\!y_k\,]_i \\&\le - \big [ g(x_k)-J(x_k)^T\!y_k - \big (g(x^*) - J(x^*)^T\!y^*_{\scriptscriptstyle P}(y_k^{})\big )\big ]_i^{}\\&\le - [\,g(x_k)-J(x_k)^T\!y_k + J(x_k)^T\!y^*_{\scriptscriptstyle P}(y_k^{}) -J(x_k)^T\!y^*_{\scriptscriptstyle P}(y_k^{}) \\&\quad \qquad - g(x^*)+J(x^*)^T\!y^*_{\scriptscriptstyle P}(y_k^{})\,]_i^{}. \end{aligned}$$

From Assumptions 13, \(\Vert J(x_k)\Vert \) is bounded independently of k and the functions g and J are Lipschitz continuous. It follows that there exist positive constants \(\kappa _3\), \(\kappa _4\), and \(\kappa _5\) such that

$$\begin{aligned} -[\,z_k\,]_i \le \kappa _3\Vert x_k-x^*\Vert +\kappa _4\Vert y_k-y^*_{\scriptscriptstyle P}(y_k)\Vert \le \kappa _5 \delta (x_k,y_k), \end{aligned}$$
(48)

where the last inequality follows from the definition (37) of \(\delta (x_k,y_k)\). As the sequence of iterates satisfies \(\lim _{k\in \mathcal {S}_*} (x_k.y_k) = (x^*\!,y^*)\) and \(\lim _{k\in \mathcal {S}_*} y^*_{\scriptscriptstyle P}(y_k^{}) = y^*\), for \(k\in \mathcal {S}_*\) sufficiently large, the assumptions of Lemma 1 apply, and

$$\begin{aligned} -[\,z_k\,]_i \le \kappa _5 \delta (x_k,y_k) \le \kappa _6 r(x_k,y_k), \end{aligned}$$
(49)

for some positive constant \(\kappa _6\). The combination of the inequality (49), the definition of \(r(x_k,y_k)\), and the result \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma \) of Lemma 4 imply that there exists a positive constant \(\kappa _7\) such that

$$\begin{aligned} \Big [z_k + \frac{1}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!c(x_k)\Big ]_i&\ge -\kappa _6r(x_k,y_k) - \frac{\Vert J(x_k)\Vert _1 r(x_k,y_k)}{r(x_k,y_k)^\gamma } \nonumber \\&= -\kappa _6r(x_k,y_k) - \Vert J(x_k)\Vert _1r(x_k,y_k)^{1-\gamma } \nonumber \\&\ge -\kappa _7r(x_k,y_k)^{1-\gamma } \ge -{\textstyle \frac{1}{2}}r(x_k,y_k)^\lambda , \end{aligned}$$
(50)

for all i, and every \(k\in \mathcal {S}_*\) sufficiently large, where the last inequality follows from the assumption \(0< \lambda< \min \{\gamma ,1-\gamma \} < 1\).

The \((1/\mu _k^{\scriptscriptstyle R })p_k^T J(x_k^{})^T\!J(x_k^{}) p_k^{}\) term of (47) may be bounded in a similar way using the definition \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma \) and the bound on \(\Vert p_k\Vert \) from part (i). The assumption that \(H(x_k,y_k)\) and \(J(x_k)\) are bounded, the estimate \(\delta (x_k,y_k) = O(r(x_k,y_k))\) of Lemma 1, and the definition of \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\) give

$$\begin{aligned} \big [H(x_k,y_k) p_k + (1/\mu _k^{\scriptscriptstyle R }) J(x_k)^T\!J(x_k) p_k\big ]_i = O\big (r(x_k,y_k)^{1-\gamma }\big ) \le {\textstyle \frac{1}{2}}r(x_k,y_k)^\lambda , \end{aligned}$$

for all \(k\in \mathcal {S}_*\) sufficiently large. This inequality with (47) and (50) gives

$$\begin{aligned}&[\,\nabla \!M_k + B_k d_k \,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\\&\quad \ge \Big [z_k + \frac{1}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!c(x_k)\Big ]_{\scriptscriptstyle \mathcal {A}_\epsilon }-\Big \Vert \Big [H(x_k,y_k) p_k + \frac{1}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!J(x_k) p_k\Big ]_{\scriptscriptstyle \mathcal {A}_\epsilon }\Big \Vert _\infty e\\&\quad \ge -r(x_k,y_k)^\lambda e = -t_k e, \end{aligned}$$

for all \(k\in \mathcal {S}_*\) sufficiently large, which proves the second result of part (ii).

The first result of Lemma 2(iii) implies that \(\mathcal {F}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {F}(x^*)\) for \(k\in \mathcal {S}_*\) sufficiently large. If the limit \(\lim _{k\in \mathcal {S}_*} [\,x_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }= [\,x^*\,]_{\scriptscriptstyle \mathcal {F}}>0\) is used in conjunction with the definition \([\,x_k + p_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= 0\), and the estimate \(\Vert [\,p_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }\Vert =\Vert [\,p_k\,]_{\scriptscriptstyle \mathcal {F}}\Vert = O\big (\delta (x_k,y_k)\big )\) of part (i), it follows that \(x_k + p_k \ge 0\) for \(k\in \mathcal {S}_*\) sufficiently large, as required. \(\square \)

Part (ii) of Lemma 6 implies that two of the three conditions needed for the acceptance of the local descent direction are satisfied. It remains to show that the third condition \(\nabla \!M_k^T d_k^{}< 0\) holds. Two technical results, Lemmas 7 and 8 below, are required.

Lemma 7

For all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, a local descent direction \(d_k=(p_k,q_k)\) is computed such that \((\widehat{x}_k,\widehat{y}_k) = (x_k+p_k,y_k+q_k)\) satisfies

$$\begin{aligned} \delta (\widehat{x}_k,\widehat{y}_k) = \Vert \widehat{x}_k^{}- x^*\Vert +\Vert \widehat{y}_k^{}- y^*_{\scriptscriptstyle P}(\widehat{y}_k^{})\Vert = O\big (\delta (x_k,y_k)^{1+\gamma }\big ), \end{aligned}$$
(51)

with \(y^*_{\scriptscriptstyle P}(\cdot )\) defined in (36).

Proof

The proof uses Izmailov [20, Theorem 2.3] to provide a bound on the change in the solution of a problem perturbed by a quantity \(\varepsilon \). If the second-order sufficient conditions hold at a primal-dual solution \((x^*,y^*)\) of a problem P, then the primal-dual solution \((\widetilde{x},\widetilde{y})\) of a perturbed problem \(P(\varepsilon )\) satisfies

$$\begin{aligned} \Vert \widetilde{x}- x^*\Vert + \inf _{y\in \mathcal {Y}(x^*)} \Vert \widetilde{y}- y\Vert = O(\Vert \varepsilon \Vert ). \end{aligned}$$
(52)

For the purposes of this theorem, the unperturbed problem is an equality-constrained variant of problem (NP) for which the optimal active set has been identified. Parts (ii) and (iii) of Lemma 2 imply that \(\mathcal {A}(x^*)={\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\), and \(\mathcal {F}(x^*) = \mathcal {F}\!_\epsilon ^{}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\) for \(k\in \mathcal {S}_*\) sufficiently large. Let \(E_{\scriptscriptstyle \mathcal {A}}\) denote the matrix of columns of the identity matrix with indices in \(\mathcal {A}(x^*)\). At any iteration with \(k\in \mathcal {S}_*\), consider the perturbed problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x}}\;\;f(x) + x^T\!\varepsilon _k^{(1)} \;\;\;\hbox {subject to}\;\;c(x) + \varepsilon _k^{(2)} = 0, \;\;\;E_{\scriptscriptstyle \mathcal {A}}^T x = 0, \end{aligned}$$
(53)

where \(\varepsilon _k^{(1)}\) and \(\varepsilon _k^{(2)}\) are perturbation vectors such that \(\varepsilon _k = \big (\varepsilon _k^{(1)},\varepsilon _k^{(2)}\big )\) with

$$\begin{aligned} \begin{pmatrix} \varepsilon _k^{(1)} \\ \varepsilon _k^{(2)} \end{pmatrix} {=} \begin{pmatrix} g(x_k) {-} J(x_k)^T \widehat{y}_k - (g(\widehat{x}_k) - J(\widehat{x}_k)^T \widehat{y}_k) {+} H(x_k,y_k)(\widehat{x}_k - x_k) \\ c(x_k) + J(x_k)(\widehat{x}_k - x_k) - c(\widehat{x}_k) + \mu _k^{\scriptscriptstyle R }(\widehat{y}_k^{}- y_k^{\scriptscriptstyle E })\end{pmatrix}. \end{aligned}$$
(54)

The following argument shows that the perturbations go to zero as \(k\rightarrow \infty \) for \(k\in \mathcal {S}_*\). Part (i) of Lemma 6 implies that \(\lim _{k\in \mathcal {S}_*} (\widehat{x}_k-x_k,\widehat{y}_k-y_k) = \lim _{k\in \mathcal {S}_*} (p_k,q_k) = 0\) for \(k\in \mathcal {S}_*\) sufficiently large. Also, as \(\lim _{k\in \mathcal {S}_*} (x_k,y_k) = (x^*\!,y^*)\) and \(y_k^{\scriptscriptstyle E }= y_k^{}\) for \(k\in \mathcal {S}_*\), it must be the case that \(\lim _{k\in \mathcal {S}_*} \varepsilon _k = 0\).

The proof of (51) is based on applying the bound (52) for the values \((\widetilde{x},\widetilde{y}) = (\widehat{x}_k,\widehat{y}_k)\). In this case, under Assumption 3, it holds that

$$\begin{aligned} \delta (\widehat{x}_k,\widehat{y}_k) {=} \Vert \widehat{x}_k - x^*\Vert {+} \Vert \widehat{y}_k - y^*_{\scriptscriptstyle P}(\widehat{y}_k^{})\Vert {=} \Vert \widehat{x}_k - x^*\Vert {+} \inf _{y\in \varLambda (x^*)} \Vert \widehat{y}_k - y\Vert {=} O(\Vert \varepsilon _k\Vert ). \end{aligned}$$

Three results must be established in order to apply this result. First, \((x^*,y^*)\) must satisfy the second-order sufficient conditions for the equality-constrained problem (53) with \(\varepsilon _k = 0\). Second, \((\widehat{x}_k,\widehat{y}_k)\) must be an optimal solution for the perturbed problem (53) with perturbation (54). Third, the perturbation (54) must be bounded in terms of \(\delta (x_k,y_k)\).

For the first part it must be shown that \((x^*\!,y^*)\) satisfies the second-order sufficient conditions for problem (53) with no perturbation. The first-order KKT conditions for (53) are

$$\begin{aligned} g(x)-J(x)^T\!y + \varepsilon _k^{(1)} - E_{\scriptscriptstyle \mathcal {A}}z_{\scriptscriptstyle \mathcal {A}}= 0, \quad c(x) + \varepsilon _k^{(2)} = 0, \quad \text {and}\quad E_{\scriptscriptstyle \mathcal {A}}^T x = 0. \end{aligned}$$
(55)

If \(\varepsilon _k = 0\) then \((x^*\!,y^*)\) satisfies these conditions, which implies that the primal-dual pair \((x^*\!,y^*)\) is a first-order KKT point. The second-order conditions for problem (NP) imply that \(p^T\!H(x^*\!,y^*) p > 0\) for all p such that \(J(x^*)p=0\) and \(p_i=0\) for every \(i\in \mathcal {A}(x^*)\). These conditions also apply for problem (53) when \(\varepsilon _k = 0\), which imply that \((x^*\!,y^*)\) satisfies the second-order sufficient conditions for the unperturbed problem.

Next, it must be shown that \((\widehat{x}_k,\widehat{y}_k)\) is an optimal solution for the problem (53) with perturbation (54). By definition, the point \((\widehat{x}_k,\widehat{y}_k)\) satisfies the optimality conditions for the equality-constrained problem (24). If \(y_k^{\scriptscriptstyle E }= y_k^{}\), then these conditions are

$$\begin{aligned}&g(x_k) + H(x_k,y_k)(\widehat{x}_k - x_k) - J(x_k)^T\!y_k - E_{\scriptscriptstyle \mathcal {A}}z_{\scriptscriptstyle \mathcal {A}}= 0,\nonumber \\&c(x_k) + J(x_k)(\widehat{x}_k - x_k) +\mu _k^{\scriptscriptstyle R }(\widehat{y}_k^{}- y_k^{}) = 0, \quad \text {and}\quad E_{\scriptscriptstyle \mathcal {A}}^T \widehat{x}_k^{}= 0, \end{aligned}$$
(56)

where \(z_{\scriptscriptstyle \mathcal {A}}= [\,z_k\,]_{\scriptscriptstyle \mathcal {A}}\) with \(z_k = g(x_k) - J(x_k)^T y_k\) (cf. (40)). These identities may be used to show that \((\widehat{x}_k,\widehat{y}_k)\) satisfies the optimality conditions (55) with \(\varepsilon _k\) defined as in (54).

It remains to bound the perturbation norm \(\Vert \varepsilon _k\Vert \) from (54). The Taylor-series expansions of \(g(\widehat{x}_k) = g(x_k+p_k)\) and \(J(\widehat{x}_k) = J(x_k+p_k)\), together with the assumption that \(\{\,\nabla ^2\!c_i(x_k)\,\}_{k\in \mathcal {S}_*}\) is bounded, give

$$\begin{aligned}&g(x_k) - g(x_k + p_k) + H(x_k,y_k)p_k - (J(x_k) - J(x_k + p_k))^T \widehat{y}_k \nonumber \\&\quad \!\!= \sum _{i=1}^m [\,\widehat{y}_k - y_k\,]_i \nabla ^2\!c_i(x_k) p_k + O(\Vert p_k\Vert ^2) = O\big (\Vert p_k\Vert \Vert \widehat{y}_k - y_k\Vert ) + O(\Vert p_k\Vert ^2\big ),\nonumber \\ \end{aligned}$$
(57)

which bounds the norm of the first block of (54).

Three properties of the iterates are needed to bound the norm of the second block. First, a Taylor-series expansion of \(c(x_k+p_k)\) gives \(c(x_k)-c(x_k+p_k)+J(x_k)p_k = O(\Vert p_k\Vert ^2)\). Second, as \(\mathcal {S}_*\) contains only V-O iteration indices, the rule for updating \(y_k^{\scriptscriptstyle E }\) described in Sect. 2.1 gives \(y_k^{\scriptscriptstyle E }= y_k^{}\) for all \(k\in \mathcal {S}_*\). Third, Lemma 4 gives \(\mu _k^{\scriptscriptstyle R }= r(x_k^{},y_k^{})^{\gamma }\), which implies that \(\mu _k^{\scriptscriptstyle R }\Vert \widehat{y}_k^{}- y_k^{}\Vert = r(x_k,y_k)^\gamma \Vert \widehat{y}_k-y_k\Vert \). The combination of these results gives \(\Vert \varepsilon _k\Vert = O(\Vert p_k\Vert ^2) + O(\Vert p_k\Vert \Vert \widehat{y}_k-y_k\Vert ) + O(r(x_k,y_k)^\gamma \Vert \widehat{y}_k-y_k\Vert )\). Writing \(q_k = \widehat{y}_k-y_k\), using the results that \(r(x_k,y_k) = O( \delta (x_k,y_k))\) (from Lemma 1) and that \(\max \{\Vert p_k\Vert ,\Vert q_k\Vert \} = O\big (\delta (x_k,y_k)\big )\) (from Lemma 6(i)), and the definition \(0< \gamma < 1\), gives \(\Vert \varepsilon _k\Vert = O\big (\delta (x_k,y_k)^2 + \delta (x_k,y_k)^{1+\gamma }\big ) = O\big (\delta (x_k,y_k)^{1+\gamma }\big )\), which gives the required bound (51).\(\square \)

The second technical lemma concerns the properties of the vector of approximate multipliers \(\pi (x_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\).

Lemma 8

Let \(\pi _k\) denote \(\pi (x_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\). For every \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) it holds that

  1. (i)

    \(\Vert y_k - \pi _k\Vert = O\big (\Vert c(x_k^{})\Vert /\mu _k^{\scriptscriptstyle R }\big )\) and

  2. (ii)

    \(\Vert \nabla ^2\!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) - B_k^{}\Vert = O\big (\Vert c(x_k^{})\Vert /\mu _k^{\scriptscriptstyle R }\big )\).

Moreover, \(\lim _{k\in \mathcal {S}_*} \Vert y_k - \pi _k\Vert = 0\) and \(\lim _{k\in \mathcal {S}_*} \Vert \nabla ^2\!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) - B_k^{}\Vert = 0\).

Proof

As \(y_k^{}= y_k^{\scriptscriptstyle E }\) for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\), the definition of \(\pi _k\) gives \(\Vert y_k - \pi _k\Vert = \Vert c(x_k^{})\Vert /\mu _k^{\scriptscriptstyle R }\). This estimate in conjunction with the definitions of \(\nabla ^2\!M\) and B imply that part (ii) also holds.

Lemma 4 and part (i) of Lemma 2 give \(\lim _{k\in \mathcal {S}_*} r(x_k,y_k) = 0\), with \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma \) and \(1-\gamma > 0\) for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large. These results may be combined to give

$$\begin{aligned} 0 \le \lim _{k\in \mathcal {S}_*} \frac{\Vert c(x_k)\Vert }{\mu _k^{\scriptscriptstyle R }} \le \lim _{k\in \mathcal {S}_*} \frac{r(x_k,y_k)}{\mu _k^{\scriptscriptstyle R }} = \lim _{k\in \mathcal {S}_*} \frac{r(x_k,y_k)}{r(x_k,y_k)^\gamma } = \lim _{k\in \mathcal {S}_*} r(x_k,y_k)^{1-\gamma } = 0. \end{aligned}$$

It follows from (i) that \(\lim _{k\in \mathcal {S}_*} \Vert y_k - \pi _k\Vert = 0\). Also, as \(\{\,\nabla ^2\!c_i(x_k)\,\}_{k\in \mathcal {S}_*}\) is bounded, it must hold that \(\lim _{k\in \mathcal {S}_*} \Vert \nabla ^2\!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R }) - B_k^{}\Vert = 0\). \(\square \)

Given Lemmas 7 and 8, we show that the last of the conditions in (27) required for the acceptance of the local descent direction is satisfied, i.e., that the local descent direction is a descent direction for the merit function.

Lemma 9

For any \(\bar{\sigma }\) satisfying \(0<\bar{\sigma }<1\), and all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, a local descent direction \(d_k=(p_k,q_k)\) is computed that satisfies

$$\begin{aligned} \nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })^T\!d_k^{}\le -\bar{\sigma }d_k^T\!B_k^{}d_k^{}- \bar{c}\Vert d_k^{}\Vert ^2 \;\;\text {and}\;\; \nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })^T\!d_k^{}< 0, \end{aligned}$$
(58)

for some positive constant \(\bar{c}\). In particular, \(d_k\) is a strict descent direction for \(M(v \mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) at \(v_k\).

Proof

Throughout the proof, the gradient \(\nabla \!M(x_k^{},y_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) and approximate Hessian \(B(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) are denoted by \(\nabla \!M_k\) and \(B_k\), respectively. In addition, it is assumed that \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) is sufficiently large that parts (ii) and (iii) of Lemma 2 hold; i.e., \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {A}(x^*)\), and \(\mathcal {F}\!_\epsilon ^{}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {F}(x^*)\). With this assumption, \([\,B_k\,]_{\scriptscriptstyle \mathcal {A}}\), \([\,B_k\,]_{\scriptscriptstyle \mathcal {F}}\) and \([\,B_k\,]_{\scriptscriptstyle \mathcal {A},\,\mathcal {F}}\) denote the rows and columns of the matrix \(B_k\) associated with the index sets \(\mathcal {A}(x^*)\) and \(\mathcal {F}(x^*)\).

The definition of \(d_k\) from (25) gives \([\,\nabla \!M_k + B_k d_k\,]_{\scriptscriptstyle \mathcal {F}}= 0\), or equivalently

$$\begin{aligned}{}[\,B_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{} + [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A},\mathcal {F}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^{} = - [\,\nabla \!M_k^{}\,]_{\scriptscriptstyle \mathcal {F}}. \end{aligned}$$
(59)

Similarly, the scalar \(d_k^T B_k^{}d_k^{}\) may be written in the form

$$\begin{aligned} d_k^T\!B_k^{}d_k^{}= [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^T [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{} + (2 [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A},\mathcal {F}}^{}[\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}} + [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}})^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^{}. \end{aligned}$$
(60)

Combining (59) and (60) yields

$$\begin{aligned} -[\,\nabla \!M_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{}&= d_k^T B_k^{}d_k^{}- ([\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A},\mathcal {F}}^{}[\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}}^{} + [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}})^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}} \nonumber \\&= d_k^T B_k^{}d_k^{}- [\,B_k^{}d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^{}, \end{aligned}$$
(61)

which implies that, for any \(\bar{\sigma }\) satisfying \(0<\bar{\sigma }<1\), it must hold that

$$\begin{aligned} \nabla \!M_k^T\!d_k^{}+ \bar{\sigma }d_k^T\!B_k^{}d_k^{}= (\bar{\sigma }- 1)d_k^T B_k^{}d_k^{}+ [\,B_k^{}d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^{} + [\,\nabla \!M_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}}^{}. \end{aligned}$$
(62)

The proof involves constructing a bound on each of the terms of the right-hand side of this identity. These bounds are characterized in terms of the index sets \(\mathcal {A}_{\scriptscriptstyle +}(x^*,y^*)\) and \(\mathcal {A}_{\scriptscriptstyle 0}(x^*,y^*)\) defined in (34), together with the set \(\mathcal {F}_0(x^*,y^*)= \mathcal {A}_{\scriptscriptstyle 0}(x^*,y^*)\cup \mathcal {F}(x^*,y^*)\). In what follows, \([\,B_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\) and \([\,B_k\,]_{\scriptscriptstyle \mathcal {F}_0}\) denote the matrices of rows and columns of \(B_k\) associated with the index sets \(\mathcal {A}_{\scriptscriptstyle +}\) and \(\mathcal {F}_0\), with similar definitions for \([\,B_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\) and \([\,B_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +},\mathcal {F}_0}\), etc. The index sets \(\mathcal {F}_0\) and \(\mathcal {A}_{\scriptscriptstyle +}\) define a partition of \(\{ 1\), 2, ..., \(n+m\}\), and \(d_k^T\!B_k^{}d_k^{}\) may be partitioned analogous to (60) as

$$\begin{aligned} d_k^T\!B_k^{}d_k^{}= [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^T [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^{} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^{} + ([\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} + 2 [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +},\mathcal {F}_0} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0})^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}. \end{aligned}$$
(63)

The second-order sufficient conditions given in Definition 1, [14, Theorem 1.3 and part 2 of Theorem 3.6], together with a continuity argument imply that, for all \(k\in \mathcal {S}_*\) sufficiently large, \(B_k\) is uniformly positive definite when restricted to the set \(\mathcal {C}= \{(p,q)\in \mathbb {R}^{n+m}: p_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}=0 \;\text {and}\;p_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\ge 0 \}\). The relation \((-d)^T\!B_k (-d) = d^T\!B_k d\) implies that if d satisfies \(d_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\le 0\) and \(d_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}=0\), then \(d^T\!B_k d> 0\). For the particular vector \(d= (0, [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}, [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}})= (0, [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0})\) for which \([\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\le 0\), it follows that

$$\begin{aligned}{}[\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^T [\,B_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^{} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^{} \ge \kappa _8\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}^{}\Vert ^2, \;\;\text {for some}\; \kappa _8\in (0,1), \end{aligned}$$
(64)

and all \(k\in \mathcal {S}_*\) sufficiently large. This inequality provides a bound on the first term on the right-hand side of (63). An estimate of the second and third terms may be determined using a bound on the magnitude of the components of \([\,B_k d_k\,]_{\scriptscriptstyle \mathcal {A}}\), where, by definition,

$$\begin{aligned}{}[\,B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}} = \Big [\Big (H(x_k,y_k) + \frac{2}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!J(x_k)\Big ) p_k + J(x_k)^T\!q_k \Big ]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}. \end{aligned}$$

For sufficiently large \(k\in \mathcal {S}_*\), Lemma 4 gives \(\mu _k^{\scriptscriptstyle R }= r(x_k,y_k)^\gamma \). Also, as \(\Vert J(x_k)\Vert \) and \(\Vert H(x_k,y_k)\Vert \) are bounded on \(\mathcal {S}\), it follows from the bounds on \(\Vert p_k\Vert \) and \(\Vert q_k\Vert \) from Lemma 6(i), and the equivalence \(r(x_k,y_k) = \varTheta \!\big (\delta (x_k,y_k)\big )\) of Lemma 1, that the magnitude of the components of \([\,B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\) are estimated by

$$\begin{aligned} \Vert [\,B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert = O(r(x_k,y_k)^{1-\gamma }) = O(\delta (x_k,y_k)^{1-\gamma }). \end{aligned}$$
(65)

A similar argument gives the bound

$$\begin{aligned} \big |\big ([\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}} + 2[\,B_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +},\mathcal {F}_0} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\big )^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\big | = O(\delta (x_k,y_k)^{1-\gamma }\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert ). \end{aligned}$$
(66)

The application of the bound (64) and estimate (66) to (63) gives

$$\begin{aligned} -d_k^T\!B_k^{}d_k^{}\le -\kappa _8 \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 + \kappa _9 \delta (x_k,y_k)^{1-\gamma } \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert , \end{aligned}$$
(67)

for some positive \(\kappa _9\) independent of k, which serves to bound \((\bar{\sigma }- 1)d_k^T\!B_k^{}d_k^{}\), the first term of the right-hand side of (62).

The second and third terms of (62) are estimated by bounding components from the index set \(\mathcal {A}_{\scriptscriptstyle +}\). The estimate (65) gives

$$\begin{aligned}{}[\,B_k^{}d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} \le \kappa _{10} \delta (x_k,y_k)^{1-\gamma } \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert , \;\;\text {for some} \; \kappa _{10}\in (0,1). \end{aligned}$$
(68)

A Taylor-series expansion of \(\nabla \!M(v_k^{}\mathop {;}y^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) at \(y^{\scriptscriptstyle E }= y_k^{\scriptscriptstyle E }\) (\(= y_k^{})\) gives

$$\begin{aligned} \nabla \!M_k = \nabla \!M(v_k^{}\mathop {;}y^*+ (y_k^{}- y^*),\mu _k^{\scriptscriptstyle R }) = \nabla \!M(v_k^{}\mathop {;}y^*,\mu _k^{\scriptscriptstyle R }) + O(\Vert y_k-y^*\Vert ). \end{aligned}$$
(69)

A Taylor-series expansion of \([\,\nabla \!M(v \mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\) at \(v = v^*\) gives

$$\begin{aligned}&[\, \nabla \!M(v_k^{}\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} \\&= [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,\nabla \!M(v^*+ (v_k^{}- v^*) \mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\\&= [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,\nabla \!M(v^*\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}+ O\Big (\frac{1}{\mu _k^{\scriptscriptstyle R }}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\Vert \,\Vert v_k-v^*\Vert \Big ). \end{aligned}$$

In order to bound the last term on the right-hand side, we substitute the value \(\mu _k^{\scriptscriptstyle R }= r(x_k,y_k)^\gamma \) implied by Lemma 4, and apply the estimate \(r(x_k,y_k) = \varTheta \!\big (\delta (x_k,y_k)\big )\) from Lemma 1. If the resulting value is used with the value \(\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\Vert = O(\Vert d_k\Vert ) = O(\delta (x_k,y_k))\) of Lemma 6(i), then it follows that \([\, \nabla \!M(v_k^{}\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} = [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,\nabla \!M(v^*\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}+ O\big (\delta (x_k,y_k)^{1-\gamma }\Vert v_k-v^*\Vert \big )\). This estimate can be combined with (69) to obtain

$$\begin{aligned}{}[\,\nabla \!M_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}= & {} [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,\nabla \!M(v^*\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}} + O(\delta (x_k,y_k)^{1-\gamma }\Vert v_k-v^*\Vert ) \nonumber \\&+ O(\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\Vert \,\Vert y_k-y^*\Vert ). \end{aligned}$$
(70)

As \(v^*= (x^*\!,y^*)\) is a primal-dual KKT pair for problem (NP), it follows from the definition of \(\mathcal {A}_{\scriptscriptstyle +}\) that \([\,\nabla \!M(v^*\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}} = [\,g(x^*)-J(x^*)^T\!y^*\,]_{\mathcal {A}_{\scriptscriptstyle +}} > 0\). Combining this with \([\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\le 0\) from the first equality of (25) yields

$$\begin{aligned}{}[\,\nabla \!M(v^*\mathop {;}y^*,\mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} \le -\kappa _{11} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \;\;\text {for some positive}\;\; \kappa _{11}. \end{aligned}$$
(71)

As \(\gamma <1\), the limit \(\delta (x_k,y_k)\rightarrow 0\) and estimates (70)–(71) imply that the inequality \([\,\nabla \!M_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} \le -{\textstyle \frac{1}{2}}\kappa _{11} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \) holds for \(k\in \mathcal {S}_*\) sufficiently large. The combination of this inequality with (68) gives

$$\begin{aligned}{}[\,B_k^{}d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{} + [\,\nabla \!M_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}^{}\le & {} \kappa _{10} \delta (x_k^{},y_k^{})^{1-\gamma } \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \nonumber \\&- {\textstyle \frac{1}{2}}\kappa _{11} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert , \end{aligned}$$
(72)

for all \(k\in \mathcal {S}_*\) sufficiently large.

Finally, consider the last two terms of (62) associated with the set \(\mathcal {A}_0\). As \(k\in \mathcal {S}\), it holds that \(y_k^{\scriptscriptstyle E }= y_k^{}\) and \(\pi _k^{}= \pi (x_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) = y_k^{}- c(x_k^{})/\mu _k^{\scriptscriptstyle R }\). Let \(\widetilde{y}_k\) denote the vector \(\widetilde{y}_k = \pi _k+(\pi _k-y_k) = y_k^{}- 2 c(x_k^{})/\mu _k^{\scriptscriptstyle R }\). The definitions of \(\nabla \!M_k\) and \(B_k\), together with the definition \(d_k = (p_k,q_k)\) and the identity \(c(x_k^{}) + J(x_k^{}) p_k^{}+ \mu _k^{\scriptscriptstyle R }q_k^{}= 0\) from (39) give

$$\begin{aligned}&[\,\nabla \!M_k + B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}} \\&= [\,g(x_k) - J(x_k)^T \widetilde{y}_k+H(x_k,y_k)p_k + \frac{2}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!J(x_k) p_k+ J(x_k)^T\!q_k \,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}} \\&= [\,g(x_k)-J(x_k)^T\widetilde{y}_k+H(x_k,y_k)p_k- \frac{2}{\mu _k^{\scriptscriptstyle R }} J(x_k)^T\!c(x_k)-J(x_k)^T\!q_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}} \\&= [\,g(x_k)-J(x_k)^T y_k+H(x_k,y_k)p_k-J(x_k)^T\!q_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}. \end{aligned}$$

It follows from the previous displayed equation and a Taylor-series expansion with respect to x of \(g(x)-J(x)^T\!(y_k+q_k)\) that

$$\begin{aligned}{}[\,\nabla \!M_k + B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}&= \big [g(x_k+p_k)-J(x_k+p_k)^T\!(y_k+q_k) + o(\Vert (p_k,q_k)\Vert )\big ]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}} \nonumber \\&= \big [g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k^{}+ o(\Vert (p_k,q_k)\Vert )\big ]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}, \end{aligned}$$
(73)

where \((\widehat{x}_k,\widehat{y}_k) = (x_k + p_k,y_k + q_k)\). Part (ii) of Lemma 6 then gives

$$\begin{aligned} r(\widehat{x}_k,\widehat{y}_k)&\ge \big |\min \big ([\,\widehat{x}_k\,]_i, [\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k^{}\,]_i \big )\big |\nonumber \\&= \big |\min (0,[\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k^{}\,]_i) \big |, \;\;\text {for all}\,\, i\in \mathcal {A}_{\scriptscriptstyle 0}. \end{aligned}$$
(74)

There are two possible cases for each \(i\in \mathcal {A}_{\scriptscriptstyle 0}\), depending on the sign of \([\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k\,]_i\). If \([\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k\,]_i \ge 0\), then the property that \( [\,d_k\,]_i\le 0\) for every \(i\in \mathcal {A}\) implies that \([\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k\,]_i^{}[\,d_k\,]_i^{}\le 0\). The expression for \([\,\nabla \!M_k+B_k d_k\,]_i [\,d_k\,]_i\) from (73), and the result that \(\Vert (p_k,q_k)\Vert = O\big (\delta (x_k,y_k)\big )\) from Lemma 6(i) gives

$$\begin{aligned}{}[\,\nabla \!M_k+B_k d_k\,]_i [\,d_k\,]_i&= \big [g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k\big ]_i [\,d_k^{}\,]_i^{}+ o(\Vert (p_k,q_k)\Vert ) [\,d_k^{}\,]_i^{}\\&= o\big (\delta (x_k,y_k)\big )\big | [\,d_k\,]_i \big |. \end{aligned}$$

Alternatively, if \(i\in \mathcal {A}_{\scriptscriptstyle 0}\) and \([\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k\,]_i < 0\), then

$$\begin{aligned} \begin{aligned}&[\,\nabla \!M_k+B_k d_k\,]_i [\,d_k\,]_i\\ \;\;&= [\,g(\widehat{x}_k)-J(\widehat{x}_k)^T \widehat{y}_k + o(\Vert (p_k,q_k)\Vert )\,]_i [\,d_k\,]_i \\&\le r(\widehat{x}_k,\widehat{y}_k)\big |[\,d_k\,]_i\big | + o\big (\delta (x_k,y_k)\big )\big |[\,d_k\,]_i\big |&\;\;&\big (\text {(74) and Lemma~6(i)}\big ) \\&\le \kappa \delta (\widehat{x}_k,\widehat{y}_k)\big |[\,d_k\,]_i\big |+ o\big (\delta (x_k,y_k)\big )\big |[\,d_k\,]_i\big |&\;\;&\big (\text {Lemma~1}\big ) \\&= O\big (\delta (x_k,y_k)^{1+\gamma }\big )|[\,d_k\,]_i| + o\big (\delta (x_k,y_k)\big )\big |[\,d_k\,]_i\big |&\;\;&\big (\text {Lemma~7}\big ) \\&= o\big (\delta (x_k,y_k)\big )\big |[\,d_k\,]_i\big |. \end{aligned} \end{aligned}$$

A combination of the two cases provides the estimate

$$\begin{aligned}{}[\,\nabla \!M_k + B_k d_k\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}^T [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}^{}\le o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}^{}\Vert . \end{aligned}$$
(75)

It now follows from (62), (67), (72), (75), and \(\lim _{k\in \mathcal {S}_*} d_k = 0\) that there exist positive constants \(\kappa _{12}\), \(\kappa _{13}\), and \(\kappa _{14}\) such that

$$\begin{aligned} \nabla \!M_k^T\!d_k^{}+\bar{\sigma }d_k^T\!B_k^{}d_k^{}\le & {} -\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 + \kappa _{13} \delta (x_k,y_k)^{1-\gamma } \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \\&- \kappa _{14}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert . \end{aligned}$$

As \(\lim _{k\in \mathcal {S}_*}\delta (x_k,y_k) = 0\), it must hold that \(\kappa _{13} \delta (x_k,y_k)^{1-\gamma } \le {\textstyle \frac{1}{2}}\kappa _{14}\) for all \(k\in \mathcal {S}_*\) sufficiently large, which gives

$$\begin{aligned} \nabla \!M_k^T\!d_k^{}+\bar{\sigma }d_k^T\!B_k^{}d_k^{}\le & {} -\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{2}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \nonumber \\&+ o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert . \end{aligned}$$
(76)

The next step is to show that the right-hand side of (76) is bounded above by a positive multiple of \(-\Vert d_k\Vert ^2\). Consider the sequence \(v^p_k = \big (x^*\!, y^*_{\scriptscriptstyle P}(\widehat{y}_k^{})\big )\), where \(y^*_{\scriptscriptstyle P}(\cdot )\) is given by (36) and satisfies the second-order sufficient conditions for all k. The triangle inequality and substitution of \(\widehat{v}_k\) for \(v_k + d_k\) yields

$$\begin{aligned} \Vert v_k^{}- v^p_k\Vert = \Vert v_k^{}+ d_k^{}- v^p_k - d_k^{}\Vert = \Vert \widehat{v}_k^{}- v^p_k - d_k^{}\Vert \le \Vert \widehat{v}_k^{}- v^p_k\Vert + \Vert d_k^{}\Vert . \end{aligned}$$
(77)

By definition, \(\Vert \widehat{v}_k^{}- v^p_k\Vert = \delta (\widehat{x}_k,\widehat{y}_k)\), and the estimate \(\delta (\widehat{x}_k,\widehat{y}_k) = o\big (\delta (x_k,y_k)\big )\) given by Lemma 7 implies that \(\delta (\widehat{x}_k,\widehat{y}_k) \le {\textstyle \frac{1}{2}}\delta (x_k,y_k)\) for k sufficiently large. In addition, the definition of \(\delta (x_k,y_k)\) is such that \(\delta (x_k,y_k) \le \Vert v_k^{}- v^p_k\Vert \). If these inequalities are used to estimate \(\Vert d_k\Vert \) in (77), then

$$\begin{aligned} -\Vert d_k^{}\Vert \le \Vert \widehat{v}_k^{}- v^p_k\Vert - \Vert v_k^{}- v^p_k\Vert \le -{\textstyle \frac{1}{2}}\delta (x_k,y_k). \end{aligned}$$
(78)

Consider the inequality (76). Suppose that k is large enough that the bound \(\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert \le \frac{1}{4}\kappa _{14}\) holds. Standard norm inequalities applied in conjunction with the estimates \(\Vert d_k\Vert \le \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert + \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \), \(\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \le \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert \), and \(\Vert d_k^{}\Vert \ge {\textstyle \frac{1}{2}}\delta (x_k,y_k)\) from (78), give

$$\begin{aligned}&-\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{2}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert + o\big (\delta (x_k^{},y_k^{})\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\quad \le -\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 -{\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert -{\textstyle \frac{1}{2}}\kappa _{12} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert \,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert \\&\qquad + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\quad \le - {\textstyle \frac{1}{2}}\kappa _{12} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 -{\textstyle \frac{1}{4}}\kappa _{14}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert - {\textstyle \frac{1}{2}}\kappa _{12} \Vert d_k\Vert \,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert \\&\qquad + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\quad \le - {\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert - {\textstyle \frac{1}{2}}\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{4}}\kappa _{12} \delta (x_k,y_k)\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert \\&\qquad + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\quad \le - {\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert - {\textstyle \frac{1}{2}}\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{4}}\kappa _{12} \delta (x_k,y_k)\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\qquad + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \\&\quad \le - {\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert - {\textstyle \frac{1}{2}}\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2\\&\quad \le - {\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert ^2 - {\textstyle \frac{1}{2}}\kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2. \end{aligned}$$

These inequalities, when used with (76), imply that

$$\begin{aligned} \nabla \!M_k^T\!d_k^{}+\bar{\sigma }d_k^T\!B_k^{}d_k^{}&\le - \kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{2}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert + o\big (\delta (x_k^{},y_k^{})\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \nonumber \\&\le - \bar{c}\,\Vert d_k\Vert ^2, \end{aligned}$$
(79)

with \(\bar{c}= \min \{{\textstyle \frac{1}{4}}\kappa _{14}, {\textstyle \frac{1}{2}}\kappa _{12} \}\). This establishes the first part of (58).

To prove the second part of (58), the bounds on \(\nabla \!M_k^T\!d_k^{}+\bar{\sigma }d_k^T\!B_k^{}d_k^{}\) and \(d_k^T\!B_k^{}d_k^{}\) given by (76) and (67) imply that

$$\begin{aligned} \nabla \!M_k^T\!d_k^{}&= \nabla \!M_k^T\!d_k^{}+ \bar{\sigma }d_k^T\!B_k^{}d_k^{}-\bar{\sigma }d_k^T\!B_k^{}d_k^{}\nonumber \\&\le - \kappa _{12}\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{2}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \nonumber \\&\quad \qquad - \bar{\sigma }\kappa _{8} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 + \bar{\sigma }\kappa _{9} \delta (x_k,y_k)^{1-\gamma }\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert . \end{aligned}$$
(80)

As \(\lim _{k\in \mathcal {S}_*} d_k = 0\), there is an index k sufficiently large that \(\bar{\sigma }\kappa _{9} \delta (x_k,y_k)^{1-\gamma } \le \frac{1}{4}\kappa _{14}\), and the bound (80) may be written in the form \(\nabla \!M_k^T\!d_k^{}\le - (\kappa _{12}+\bar{\sigma }\kappa _{8})\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {F}_0}\Vert ^2 - {\textstyle \frac{1}{4}}\kappa _{14} \Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle +}}\Vert + o\big (\delta (x_k,y_k)\big )\,\Vert [\,d_k^{}\,]_{\scriptscriptstyle \mathcal {A}_{\scriptscriptstyle 0}}\Vert \), which is the inequality (76) with different positive constants. If the argument used to derive (79) is repeated for this inequality, it follows that there is a positive \(\widehat{c}\) such that \(\nabla \!M_k^T\!d_k^{}\le -\widehat{c}\,\Vert d_k\Vert ^2\). From Assumption 4, \(d_k\) is nonzero, which implies that \(d_k\) is a strict descent direction for \(M(v\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) at \(v_k\).\(\square \)

Lemma 9 establishes that the third condition in (27) needed for the acceptance of the local descent direction \(d_k\) holds for all \(k\in \mathcal {S}_*\) sufficiently large.

Theorem 1

For all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, it holds that:

  1. (i)

    a local descent direction \(d_k = (p_k,q_k)\) is computed;

  2. (ii)

    \(v_k + d_k\) is feasible, \([\,\nabla \!{\mathcal Q}_k^{}(v_k^{}+ d_k^{}\mathop {;}y_k^{\scriptscriptstyle E }, \mu _k^{\scriptscriptstyle R })\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}\ge -t_k^{}e\), and \(\nabla \!M_k^T\!d_k^{}< 0\), i.e., all three conditions (27) are satisfied; and

  3. (iii)

    \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })=\mathcal {A}(x^*) = \mathcal {A}(x_k+p_k)\).

Proof

Part (i) follows from Lemma 6. Part (ii) follows from Lemmas 6(ii) and 9. It remains to prove part (iii). The equality \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })=\mathcal {A}(x^*)\) is established in Lemma 2(ii). Suppose that \(i\in \mathcal {A}(x^*) = {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\). The definition of the local descent direction \(d_k\) in (25) implies that \([\,x_k+p_k\,]_i = 0\), which gives \(i\in \mathcal {A}(x_k+p_k)\). For the reverse inclusion, suppose that \(i\notin \mathcal {A}(x^*)\), i.e., \(x^*_i > 0\). In this case, the assumption that \(\lim _{k\in \mathcal {S}_*} x_k = x^*\) implies that \([\,x_k\,]_i \ge {\textstyle \frac{1}{2}}x^*_i\) for all \(k\in \mathcal {S}_*\) sufficiently large. Part (i) of Lemma 6 gives \(\max \{\Vert p_k\Vert ,\Vert q_k\Vert \} = O\big (\delta (x_k,y_k)\big )\), and the assumption \(\lim _{k\in \mathcal {S}_*} (x_k,y_k) = (x^*\!,y^*)\) implies that \(\lim _{k\in \mathcal {S}_*} \delta (x_k,y_k) = 0\). It follows that \(\lim _{k\in \mathcal {S}_*} p_k = 0\), with \([\,x_k^{}+ p_k^{}\,]_i^{}\ge {\textstyle \frac{1}{2}}x^*_i + [\,p_k^{}\,]_i^{}\ge \frac{1}{3}x^*_i > 0\) for all \(k\in \mathcal {S}_*\) sufficiently large, which means that \(i\notin \mathcal {A}(x_k+p_k)\). \(\square \)

The next result shows that the flexible line search returns the unit step length for all \(k\in \mathcal {S}_*\) sufficiently large. Lemma 2(vi) and Theorem 1 imply that \(s_k = 0\) and the line-search direction \(\varDelta v_k = d_k\) is a nonzero local descent direction for every \(k\in \mathcal {S}_*\) sufficiently large. In this case the modified Armijo procedure is executed at every \(v_k\) with \(\ell _k = 1\), and reduces to finding an \(\alpha _k\) that satisfies the condition

$$\begin{aligned} M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) - M(v_k^{}+ \alpha _k^{}v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) \ge - \gamma _{\scriptscriptstyle S}^{}\alpha _k^{}\nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })^T\!d_k^{}, \end{aligned}$$
(81)

for either \(\mu = \mu _k\) or \(\mu = \mu _k^{\scriptscriptstyle R }\).

Theorem 2

The line search gives \(\alpha _k=1\) for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large.

Proof

Throughout the proof, the quantities \(M(v\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\), \(\nabla \!M(v\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\), and \(\nabla ^2\!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R })\) are denoted by M(v), \(\nabla \!M(v)\), and \(\nabla ^2\!M_k\). Assumption 4 and part (vi) of Lemma 2 imply that the first-order line-search model is used for all \(k\in \mathcal {S}_*\subseteq \mathcal {S}\) sufficiently large, i.e., the quantity \(\ell _k\) is set to one. A Taylor-series expansion of \(M(v_k+d_k)\) gives

$$\begin{aligned} M(v_k+d_k)&= M(v_k) + \nabla \!M(v_k^{})^T\!d_k^{}+ {\textstyle \frac{1}{2}}d_k^T \nabla ^2\!M_k^{}d_k^{}+ O\Big (\frac{1}{\mu _k^{\scriptscriptstyle R }}\Vert d_k^{}\Vert ^3\Big )\\&= M(v_k) + \nabla \!M(v_k^{})^T\!d_k^{}+ {\textstyle \frac{1}{2}}d_k^T \nabla ^2\!M_k^{}d_k^{}+ O\big (\delta (x_k,y_k)^{1-\gamma }\Vert d_k^{}\Vert ^2\big ), \end{aligned}$$

where the bound on the last term follows from the sequence of estimates \((1/\mu _k^{\scriptscriptstyle R })\Vert d_k^{}\Vert = r(x_k,y_k)^{-\gamma } \Vert d_k\Vert = O\big (\delta (x_k,y_k)^{-\gamma }\big ) \Vert d_k\Vert = O\big (\delta (x_k,y_k)^{1-\gamma }\big )\) derived in Lemmas 41, and 6(i).

Let the scalar \(\bar{\sigma }\) of Lemma 9 be defined so that \((1-\gamma _{\scriptscriptstyle S})\bar{\sigma }= {\textstyle \frac{1}{2}}\), where \(\gamma _{\scriptscriptstyle S}\) (\(0< \gamma _{\scriptscriptstyle S}< {\textstyle \frac{1}{2}}\)) is the parameter used for the modified Armijo condition (30) defined with \(\ell _k = 1\). With this definition, \(\bar{\sigma }\) satisfies \(0<\bar{\sigma }<1\), and Lemma 9 with the particular value \(\bar{\sigma }= {\textstyle \frac{1}{2}}(1-\gamma _{\scriptscriptstyle S})^{-1}\) gives

$$\begin{aligned} \begin{aligned}&M(v_k+d_k)-M(v_k)-\gamma _{\scriptscriptstyle S}\nabla \!M(v_k)^T\!d_k \\&\quad = (1-\gamma _{\scriptscriptstyle S})\nabla \!M(v_k^{})^T\!d_k^{}+ {\textstyle \frac{1}{2}}d_k^T \nabla ^2\!M_k^{}d_k^{}+ O\big (\delta (x_k,y_k)^{1-\gamma }\Vert d_k^{}\Vert ^2\big ) \\&\quad \le [{\textstyle \frac{1}{2}}-(1-\gamma _{\scriptscriptstyle S})\bar{\sigma }] d_k^T\!B_k^{}d_k^{}-(1-\gamma _{\scriptscriptstyle S})\bar{c}\, \Vert d_k\Vert ^2 \\&\qquad + {\textstyle \frac{1}{2}}\Vert \nabla ^2\!M_k-B_k\Vert \,\Vert d_k\Vert ^2 + O\big (\delta (x_k,y_k)^{1-\gamma }\Vert d_k^{}\Vert ^2\big )\\&\quad = - (1 - \gamma _{\scriptscriptstyle S})\bar{c}\,\Vert d_k\Vert ^2 + {\textstyle \frac{1}{2}}\Vert \nabla ^2\!M_k-B_k\Vert \,\Vert d_k\Vert ^2 + O\big (\delta (x_k,y_k)^{1-\gamma }\Vert d_k^{}\Vert ^2\big ), \end{aligned} \end{aligned}$$

for all \(k\in \mathcal {S}_*\) sufficiently large. The global convergence property of Assumption 3(2) implies that \(\lim _{k\in \mathcal {S}_*} \delta (x_k,y_k) = 0\), which gives \(\lim _{k\in \mathcal {S}_*} d_k = 0\) from part (i) of Lemma 6. In addition, Lemma 8 implies that \(\lim _{k\in \mathcal {S}_*} \Vert \nabla ^2\!M_k-B_k\Vert = 0\). The combination of these results gives the estimate

$$\begin{aligned} M(v_k+d_k)-M(v_k)-\gamma _{\scriptscriptstyle S}\nabla \!M(v_k)^T\!d_k \le -(1-\gamma _{\scriptscriptstyle S})\bar{c}\, \Vert d_k\Vert ^2 + o(\Vert d_k\Vert ^2) < 0, \end{aligned}$$

for all \(k\in \mathcal {S}_*\) sufficiently large. As \(\lim _{k\in \mathcal {S}_*} d_k = 0\) (see Lemma 6(i)), the computation of \(\alpha _k = 1\) follows from the previous displayed inequality and the Armijo condition (81). \(\square \)

Next it is shown that the properties established in Lemmas 19 and Theorems 12 hold for every k sufficiently large, not just those in the set \(\mathcal {S}_*\subseteq \mathcal {S}\).

Theorem 3

For any positive \(\epsilon \) sufficiently small, and any \(\rho \) such that \(1< \rho < 1 + \gamma \), there exists a V-iteration index \(k_{\scriptscriptstyle V}= k_{\scriptscriptstyle V}(\epsilon )\) such that the following results hold for every \(k \ge k_{\scriptscriptstyle V}\):

  1. (i)

    \(\Vert (x_k - x^*\!,y_k-y^*)\Vert \le \epsilon \);

  2. (ii)

    \(\delta (x_{k+1},y_{k+1}) \le \delta (x_k,y_k)^\rho \);

  3. (iii)

    k is a V-iterate; and

  4. (iv)

    the results of Lemmas 19 and Theorems 12 hold.

Proof

Let the positive scalar \(\epsilon \) be sufficiently small that the results of Lemmas 19 and Theorems 12 hold for every V-O iterate \((x_k,y_k)\) satisfying \(\Vert (x_k - x^*\!,y_k-y^*)\Vert \le \epsilon \). (The proof of (iv) establishes that these results hold for every k sufficiently large.)

Let \((x_k,y_k)\) be a primal-dual iterate with \(k\in \mathcal {S}_*\). Theorem 1 implies that the unit step is accepted in the line search, in which case \((x_{k+1},y_{k+1}) = (x_k+p_k,y_k + q_k)\). Let \(\kappa \) be the positive scalar defined in Lemma 1. Similarly, let \(c_1\) (\(c_1>0\)) and \(c_2\) (\(c_2\ge 1\)) denote constants such that

$$\begin{aligned}&\max \{\Vert x_{k+1} - x_k\Vert ,\Vert y_{k+1} - y_k\Vert \} \le c_1\delta (x_k,y_k), \quad \text {and}\quad \nonumber \\&\quad \delta (x_{k+1},y_{k+1}) \le c_2\delta (x_k,y_k)^{1+\gamma }. \end{aligned}$$
(82)

(The existence of \(c_1\) and \(c_2\) is implied by the results of Lemmas 6(i) and 7.)

If \(\rho \) is any scalar satisfying \(1< \rho < 1 + \gamma \), let \(k_{\scriptscriptstyle V}= k_{\scriptscriptstyle V}(\epsilon )\) be an index in \(\mathcal {S}_*\subseteq \mathcal {S}\) that is sufficiently large that \((x_{k_{\scriptscriptstyle V}},y_{k_{\scriptscriptstyle V}})\) is a V-iterate and satisfies

$$\begin{aligned} \max \left\{ \, \Vert x_{k_{\scriptscriptstyle V}} - x^*\Vert ,\, \Vert y_{k_{\scriptscriptstyle V}}-y^*\Vert , \, 2c_1\delta _{\scriptscriptstyle V},\, 2c_1\delta _{\scriptscriptstyle V}^{\rho }/(1-\delta _{\scriptscriptstyle V}^{\rho }) \, \right\}&\le {\textstyle \frac{1}{4}}\epsilon , \quad \text {and} \end{aligned}$$
(83)
$$\begin{aligned} \max \left\{ \, 2\kappa ^{\rho +2}\delta _{\scriptscriptstyle V}^{\rho -1}/\beta , \, c_2^{}\delta _{\scriptscriptstyle V}^{1+\gamma -\rho }, \, \delta _{\scriptscriptstyle V}^{\rho }\, \right\}&\le 1, \end{aligned}$$
(84)

where \(\delta _{\scriptscriptstyle V}= \delta (x_{k_{\scriptscriptstyle V}},y_{k_{\scriptscriptstyle V}})\), and \(\beta \) (\(0< \beta < 1\)) is the weight used in the definitions of \(\phi _{\scriptscriptstyle V}(x,y)\) and \(\phi _{\scriptscriptstyle V}(x,y)\). The following argument shows that an index \(\kappa _{\scriptscriptstyle V}\) satisfying these conditions must exist. As \(\lim _{k\in \mathcal {S}_*} (x_k,y_k) = (x^*\!,y^*)\), it must hold that the optimality and feasibility measures (12) give \(\lim _{k\in \mathcal {S}_*} \phi _{\scriptscriptstyle V}(x_k,y_k) = 0\) and \(\lim _{k\in \mathcal {S}_*} \phi _{\scriptscriptstyle O}(x_k,y_k) = 0\). As Assumption 3(2) implies that there are infinitely many V-O iterates, and the condition \(\phi _{\scriptscriptstyle V}(v_k) \le {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle V}\!,\,k}^{\max }\) for a V-iteration is checked before the condition for an O-iteration, then there must be infinitely many V-iterates. In addition, as \(\lim _{k\in \mathcal {S}_*} \delta (x_k,y_k) = 0\), there must be an index \(k=k_{\scriptscriptstyle V}\) such that \(\delta _{\scriptscriptstyle V}= \delta (x_k,y_k)\) is sufficiently small to give (83) and (84).

An inductive argument is used to prove that parts (i)–(iv) hold for all \(k \ge k_{\scriptscriptstyle V}\). The base case is \(k = k_{\scriptscriptstyle V}\). The definition of \(k_{\scriptscriptstyle V}\) implies that \(k=k_{\scriptscriptstyle V}\) is a V-iteration index, and it follows trivially that part (iii) holds. Moreover, the assumption (83) and standard norm inequalities yield

$$\begin{aligned} \Vert (x_{k_{\scriptscriptstyle V}}-x^*\!,y_{k_{\scriptscriptstyle V}}-y^*)\Vert \le \Vert x_{k_{\scriptscriptstyle V}}-x^*\Vert + \Vert y_{k_{\scriptscriptstyle V}}-y^*\Vert \le {\textstyle \frac{1}{4}}\epsilon + {\textstyle \frac{1}{4}}\epsilon < \epsilon , \end{aligned}$$
(85)

which establishes part (i) for \(k=k_{\scriptscriptstyle V}\). It follows immediately from (85) and the choice of \(\epsilon \) that part (iv) holds for \(k=k_{\scriptscriptstyle V}\). As part (iv) holds for \(k=k_{\scriptscriptstyle V}\), (82), and (84) may be combined to give \(\delta (x_{k_{\scriptscriptstyle V}+1},y_{k_{\scriptscriptstyle V}+1}) \le c_2^{}\delta _{\scriptscriptstyle V}^{1+\gamma } = c_2^{}\delta _{\scriptscriptstyle V}^{1+\gamma -\rho }\delta _{\scriptscriptstyle V}^\rho \le \delta _{\scriptscriptstyle V}^\rho \), which establishes part (ii) for \(k = k_{\scriptscriptstyle V}\). This completes the base case \(k = k_{\scriptscriptstyle V}\).

The inductive hypothesis is that (i)–(iv) hold for every iterate k such that \(k_{\scriptscriptstyle V}\le k \le k_{\scriptscriptstyle V}+ j - 1\). Under this hypothesis, it must be shown that (i)–(iv) hold for \(k = k_{\scriptscriptstyle V}+j\). For part (i), standard norm inequalities give

$$\begin{aligned}&\left\| \begin{pmatrix}x_{k_{\scriptscriptstyle V}+j}-x^*\\ y_{k_{\scriptscriptstyle V}+j}-y^*\end{pmatrix}\right\| \le \Vert x_{k_{\scriptscriptstyle V}+j}-x^*\Vert + \Vert y_{k_{\scriptscriptstyle V}+j}-y^*\Vert \\&= \Big \Vert \sum _{l=0}^{j-1} (x_{k_{\scriptscriptstyle V}+l+1}-x_{k_{\scriptscriptstyle V}+l}) + x_{k_{\scriptscriptstyle V}} - x^*\Big \Vert + \Big \Vert \sum _{l=0}^{j-1} (y_{k_{\scriptscriptstyle V}+l+1}-y_{k_{\scriptscriptstyle V}+l}) + y_{k_{\scriptscriptstyle V}} - y^*\Big \Vert \\&\le \sum _{l=0}^{j-1} \big (\Vert x_{k_{\scriptscriptstyle V}+l+1}-x_{k_{\scriptscriptstyle V}+l}\Vert + \Vert y_{k_{\scriptscriptstyle V}+l+1}-y_{k_{\scriptscriptstyle V}+l}\Vert \big ) + \Vert x_{k_{\scriptscriptstyle V}} - x^*\Vert + \Vert y_{k_{\scriptscriptstyle V}} - y^*\Vert \\&\le 2c_1 \sum _{l=0}^{j-1} \delta (x_{k_{\scriptscriptstyle V}+l},y_{k_{\scriptscriptstyle V}+l}) + {\textstyle \frac{1}{2}}\epsilon , \end{aligned}$$

where the first inequality of (82) has been used to bound each of the terms in the summation, and the term \(\Vert x_{k_{\scriptscriptstyle V}} - x^*\Vert + \Vert y_{k_{\scriptscriptstyle V}} - y^*\Vert \) is estimated by (85). It follows from the inductive hypothesis for part (ii) and (83) that

$$\begin{aligned} \left\| \begin{pmatrix}x_{k_{\scriptscriptstyle V}+j}-x^*\\ y_{k_{\scriptscriptstyle V}+j}-y^*\end{pmatrix}\right\| = 2c_1\Big [\delta _{\scriptscriptstyle V}+ \sum _{i=1}^{j-1}\delta _{\scriptscriptstyle V}^{i\rho }\Big ] + {\textstyle \frac{1}{2}}\epsilon < 2c_1\Big [\delta _{\scriptscriptstyle V}+ \frac{\delta _{\scriptscriptstyle V}^\rho }{1-\delta _{\scriptscriptstyle V}^\rho }\Big ] + {\textstyle \frac{1}{2}}\epsilon \le \epsilon , \end{aligned}$$

which establishes that part (i) holds for \(k = k_{\scriptscriptstyle V}+j\).

The next stage involves establishing that part (iii) holds for \(k = k_{\scriptscriptstyle V}+j\). For all \(k\ge k_{\scriptscriptstyle V}\), it holds that \(\xi _k = 0\) and the feasibility measure \(\phi _{\scriptscriptstyle V}\) satisfies

$$\begin{aligned} \beta r(x_k,y_k) \le \phi _{\scriptscriptstyle V}(x_k,y_k) = \eta (x_k) + \beta \omega (x_k,y_k,\xi _k) \le 2r(x_k,y_k) \le 2\kappa \delta (x_k,y_k), \end{aligned}$$

where the last inequality follows from Lemma 1. Applying these inequalities at \((x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})\), together with Lemma 1 and the induction assumption (ii) at \((x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})\), gives

$$\begin{aligned}&\phi _{\scriptscriptstyle V}(x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j}) \nonumber \\&\quad \le 2\kappa \delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j}) \le 2\kappa \delta (x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^\rho \nonumber \\&\quad \le 2\kappa ^{\rho +1} r(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^\rho \nonumber \\&\quad = 2\kappa ^{\rho +1} r(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^{\rho -1}r(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1}) \nonumber \\&\quad \le (2\kappa ^{\rho +1}/\beta ) r(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^{\rho -1}\phi _{\scriptscriptstyle V}(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1}). \end{aligned}$$
(86)

If \(\phi _{{\scriptscriptstyle V}\!,\,k-1}^{\max }\) denotes the parameter used in condition (13) to test for a V-iterate, then the assumption that \((x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})\) is a V-iterate implies that the inequality \(\phi _{\scriptscriptstyle V}(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1}) \le {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle V}\!,k_{\scriptscriptstyle V}+j-1}^{\max }\) holds. This allows the bound (86) to be extended so that

$$\begin{aligned} \phi _{\scriptscriptstyle V}(x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})&\le (\kappa ^{\rho +1}/\beta ) r(x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^{\rho -1}\phi _{{\scriptscriptstyle V}\!,k_{\scriptscriptstyle V}+j-1}^{\max }\\&\le (\kappa ^{\rho +2}/\beta ) \delta (x_{k_{\scriptscriptstyle V}+j-1},y_{k_{\scriptscriptstyle V}+j-1})^{\rho -1}\phi _{{\scriptscriptstyle V}\!,k_{\scriptscriptstyle V}+j-1}^{\max }\\&\le (\kappa ^{\rho +2}\delta _{\scriptscriptstyle V}^{\rho -1}/\beta ) \phi _{{\scriptscriptstyle V}\!,k_{\scriptscriptstyle V}+j-1}^{\max } \le {\textstyle \frac{1}{2}}\phi _{{\scriptscriptstyle V}\!,k_{\scriptscriptstyle V}+j-1}^{\max }. \end{aligned}$$

The last of these inequalities follows from (84) and implies that \(k_{\scriptscriptstyle V}+j\) is a V-iterate. This establishes that part (iii) holds for \(k = k_{\scriptscriptstyle V}+j\), as required. Part (iv) then follows immediately from the choice of \(\epsilon \) and the fact that (i) and (iii) hold at \(k = k_{\scriptscriptstyle V}+j\).

It remains to show that (ii) holds for \(k = k_{\scriptscriptstyle V}+ j\). It follows from the bound (84) and definition of \(\rho \) \((\rho > 1)\), that \(c_2^{}(\delta _{\scriptscriptstyle V}^{j\rho })^{1+\gamma -\rho } \le c_2^{}\delta _{\scriptscriptstyle V}^{\rho (1+\gamma -\rho )} \le c_2^{}\delta _{\scriptscriptstyle V}^{1+\gamma -\rho } \le 1\). This inequality, the induction hypotheses of parts (ii) and (iv), and Lemma 7, together give

$$\begin{aligned} \delta (x_{k_{\scriptscriptstyle V}+j+1},y_{k_{\scriptscriptstyle V}+j+1})&\le c_2^{}\delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})^{1+\gamma } \\&= c_2^{}\delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})^{1+\gamma -\rho }\delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})^\rho \\&\le c_2^{}(\delta _{\scriptscriptstyle V}^{j\rho })^{1+\gamma -\rho }\delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})^\rho \le \delta (x_{k_{\scriptscriptstyle V}+j},y_{k_{\scriptscriptstyle V}+j})^\rho , \end{aligned}$$

which shows that part (ii) holds for \(k = k_{\scriptscriptstyle V}+j\). This completes the proof.\(\square \)

It remains to establish the rate of convergence of the primal-dual iterates to \((x^*\!,y^*)\). The proof is based on showing that the iterates are equivalent to those of an sSQP method for which superlinear convergence has been established.

Theorem 4

The iterates satisfy \(\lim _{k\rightarrow \infty } (x_k,y_k) = (x^*\!,y^*)\) and the convergence rate is superlinear.

Proof

As \(\epsilon >0\) was arbitrary in Theorem 3, it follows that \(\lim _{k\rightarrow \infty } (x_k,y_k) = (x^*\!,y^*)\). It remains to show that the convergence rate is superlinear. Theorem 3(iii) shows that the iterates generated by the algorithm are all V-iterates for k sufficiently large. Moreover, Theorem 3(iv) implies that Lemmas 19 and Theorems 12 hold for all k sufficiently large (not just for \(k\in \mathcal {S}_*\subseteq \mathcal {S}\)). It follows that for all k sufficiently large: (a) \(\mu _k^{\scriptscriptstyle R }=r(x_k,y_k)^\gamma \) (from Lemma 4); (b) \(\mathcal {A}(x^*) = \mathcal {A}(x_k) = {\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\) (from Lemma 2(ii)); and (c) \((x_{k+1},y_{k+1}) = (x_k + p_k,y_k + q_k)\) with every direction \((p_k,q_k)\) a local descent direction (from Theorems 2 and 1(i)–(iii)). The combination of these results gives \([\,x_k\,]_{\scriptscriptstyle \mathcal {A}}= 0\) for all k sufficiently large, where the suffix “\(\scriptstyle \mathcal {A}\)” denotes the components with indices in the optimal active set \(\mathcal {A}(x^*)\). It follows that the sequence \((x_k,y_k)\) is identical to the sequence generated by a conventional sSQP method applied to the equality-constrained problem (42), i.e., the iterates correspond to performing a conventional sSQP method on problem (NP) having correctly estimated the active set (the associated stabilized QP subproblem is defined in the statement of Lemma 5). The superlinear rate convergence of the iterates now follows, for example, from [24, Theorem 1]. \(\square \)

4 Numerical experiments

This section concerns an implementation of the algorithm described in Sect. 2, and includes the results of some numerical experiments designed to illustrate the behavior of the algorithm on degenerate problems. Sections 4.14.4 evaluate the performance of the method on problems that exhibit various forms of degeneracy. All the results are from a variant of the method that does not test for a direction of negative curvature until a first-order stationary point is located. Both the global and local convergence analysis remain valid.

From a numerical stability perspective, it is important that every computation be performed without forming the matrix \(B(v_k^{}\mathop {;}\mu )\) given by (6) explicitly. All the relevant properties of the matrix B may be determined from the matrix

$$\begin{aligned} \begin{pmatrix} H(x,y) &{} J(x)^T \\ J(x) &{} -\mu I \end{pmatrix}, \end{aligned}$$

which is said to have “regularized KKT form.” In particular, each iteration involves the factorization of a matrix of the form

$$\begin{aligned} K\!_{\scriptscriptstyle \mathcal {F}_\epsilon }= \begin{pmatrix} H\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x,y) &{} \,J\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x)^T \\ J\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x) &{} - \mu _k^{\scriptscriptstyle R }I \end{pmatrix}. \end{aligned}$$
(87)

The (implicitly defined) positive-definite matrix \(\widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) (21) associated with the bound-constrained QP problem (19) is obtained by using a pre-convexification scheme. Specifically, the positive-definite matrix \(\widehat{H}\) of (21) has the form \(\widehat{H}(x_k,y_k)= H(x_k,y_k) + E_k + D_k\) for some positive-semidefinite matrix \(E_k\) and positive-semidefinite diagonal matrix \(D_k\), as described in [16, Sect. 4]. If the matrix formed from the \(\epsilon \)-free rows and columns of B is positive definite (see (6)), then \(E_k\) is zero, in which case, the (implicit) \(\widehat{B}\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) is equal to \(B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}(x_k^{},y_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })\) and the regularized KKT equations remain unmodified (see the equations (93) below). The calculation of the matrix \(E_k\) is based on an LBL\(^T\) factorization of a matrix in the form (87). The factorization also provides the direction of negative curvature \(s_k^{(1)}\) (11) used to compute \(\xi _k\) (see, e.g., Forsgren [11], Forsgren and Gill [12], and Kungurtsev [27, Chapter 9]).

Solution of the QP subproblem. Let \(\widehat{\mathcal {Q}}_k(v)\) denote the convex QP objective (20) defined with parameters \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\). Given an initial feasible point \(\widehat{v}_k^{(0)}\) for problem (19) (i.e., a point such that \([\,\widehat{v}_k^{(0)}\,]_i \ge 0\), \(i=1 \,{:}\,n\)), a typical active-set method generates a feasible sequence \(\{\,\widehat{v}_k^{(j)}\,\}_{j>0}\) such that \(\widehat{\mathcal {Q}}_k^{}(\widehat{v}_k^{(j)}) \le \widehat{\mathcal {Q}}_k^{}(\widehat{v}_k^{(j-1)})\) and \(\widehat{v}_k^{(j)}\) minimizes \(\widehat{\mathcal {Q}}_k^{}(v)\) on a “working set” \(\mathcal {W}_j\) of variables fixed at their bounds. An iterate \(\widehat{v}_k^{(j)}\) is optimal for (19) if the Lagrange multipliers for the bound constraints in the working set are nonnegative, i.e.,

$$\begin{aligned}{}[\,\nabla \!\widehat{\mathcal {Q}}_k^{}(\widehat{v}_k^{(j)})\,]_{\scriptscriptstyle \mathcal {W}_{j}} = [\,\nabla \!M(v_k^{}\mathop {;}y_k^{\scriptscriptstyle E },\mu _k^{\scriptscriptstyle R }) + \widehat{B}(v_k^{}\mathop {;}\mu _k^{\scriptscriptstyle R })(\widehat{v}_k^{(j)} - v_k^{})\,]_{\scriptscriptstyle \mathcal {W}_{j}} \ge 0, \end{aligned}$$
(88)

where the suffix “\(\scriptstyle \mathcal {W}_j\)” denotes the vector of components with indices in the working set \(\mathcal {W}_j\). The initial working set \(\mathcal {W}_0\) is defined as the \(\epsilon \)-active set \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R })\). The first step is to move the current iterate \(v_k\) onto the bounds in the working set. This gives the first feasible point \(\widehat{v}_k^{(0)}\) such that

$$\begin{aligned}{}[\,\widehat{v}_k^{(0)}\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}= 0 \quad \text {and}\quad [\,\widehat{v}_k^{(0)}\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}= [\,v_k^{}\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}, \end{aligned}$$
(89)

where the suffices “\(\scriptstyle \mathcal {A}_\epsilon \)” and “\(\scriptstyle \mathcal {F}_\epsilon \)” refer to the components associated with the \(\epsilon \)-active and \(\epsilon \)-free sets at \((x_k,y_k)\). In general, \(\widehat{v}_k^{(0)}\) will not minimize \(\widehat{\mathcal {Q}}_k(v)\) on \(\mathcal {W}_0\), and an estimate of the next QP iterate \(\widehat{v}_k^{(1)}\) is found by minimizing \(\widehat{\mathcal {Q}}_k(v)\) subject to \([\,v\,]_{\scriptscriptstyle \mathcal {W}_{0}} = 0\). If the primal components of this solution are feasible for \(x\ge 0\), then the solution is used to define \(\widehat{v}_k^{(1)}\). Otherwise one of the violated bounds is added to the working set and the iteration is repeated. Eventually, the working set will include enough bounds to define an appropriate minimizer \(\widehat{v}_k^{(1)}\). If \(\widehat{v}_k^{(1)}\) does not satisfy the gradient condition (88) then the index of a variable with a negative gradient is selected for deletion from \(\mathcal {W}_1\).

Computation of the local descent direction. Here, \(v_k + d_k\) is a solution of the equality-constrained subproblem (24) and must satisfy the optimality conditions (25). Let \({\mathcal Q}_k(v)\) denote the QP objective (5) defined with parameters \(y_k^{\scriptscriptstyle E }\) and \(\mu _k^{\scriptscriptstyle R }\). The vector \(d_k\) is computed in the form \(d_k^{}= \widehat{v}_k^{(0)} + \varDelta \widehat{v}_k^{(0)} - v_k^{}\), where \(\widehat{v}_k^{(0)}\) is the feasible point (89) and \(\varDelta \widehat{v}_k^{(0)}\) is defined uniquely by the equations

$$\begin{aligned}{}[\,\varDelta \widehat{v}_k^{(0)}\,]_{\scriptscriptstyle {\mathcal {A}_\epsilon }^{}\,}=0, \;\;\text {and}\;\; B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}[\,\varDelta \widehat{v}_k^{(0)}\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}=-[\,\nabla \!{\mathcal Q}_k^{}(\widehat{v}_k^{(0)})\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}. \end{aligned}$$
(90)

The definition of \(\widehat{v}_k^{(0)}\) from (89) together with the form of the \(\epsilon \)-free and \(\epsilon \)-active components of \(d_k\) yields

$$\begin{aligned}{}[\,d_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }= [\,\varDelta \widehat{v}_k^{(0)}\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }\;\;\text {and}\;\; [\,d_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= -[\,v_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= -[\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }\le 0, \end{aligned}$$
(91)

where the last inequality follows from the feasibility of \(x_k\) with respect to the bounds. The benefit of computing \(d_k\) in this form is that the vector \(\widehat{v}_k^{(0)} + \varDelta \widehat{v}_k^{(0)}\) is an initial estimate of \(\widehat{v}_k^{(1)}\) used in the active-set method for solving the inequality constrained QP (19). (The conditions necessary for the computation of the local descent direction include the fact that \(B\!_{\scriptscriptstyle \mathcal {F}_\epsilon }\) must be positive definite, which implies that \(\widehat{B}\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}= B\!_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}\).) It follows that if the local descent direction does not satisfy the conditions (27) and is not suitable for the line search, it may be used to initialize the active-set method for solving (19).

The system of equations for \([\,\varDelta \widehat{v}_k^{(0)}\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}\) in (90) may be written in regularized KKT form as follows. Consider the matrix \(U_{\scriptscriptstyle \mathcal {F}_\epsilon }= \left( {\begin{matrix} I &{} -(2/\mu _k^{\scriptscriptstyle R }) J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k)^T \\ 0 &{} I \end{matrix}}\right) \), where \(J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k)\) denotes the matrix of \(\epsilon \)-free columns of \(J(x_k)\). The matrix \(U_{\scriptscriptstyle \mathcal {F}_\epsilon }\) is nonsingular and can be applied to both sides of (90) without changing the solution. Using the definitions (91) and performing some simplification yields

$$\begin{aligned}&\begin{pmatrix} H\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k,y_k) &{} J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k^{})^T \\ J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k) &{} -\mu _k^{\scriptscriptstyle R }I \end{pmatrix} \begin{pmatrix} [\,p_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }\\ - q_k \end{pmatrix} \nonumber \\&\quad = -\begin{pmatrix} [\,g(x_k^{}) + H(x_k^{},y_k^{})(\widehat{x}_k^{(0)} - x_k^{}) - J(x_k)^T\!y_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }\\ c(x_k^{}) + J(x_k^{})(\widehat{x}_k^{(0)} - x_k^{}) + \mu _k^{\scriptscriptstyle R }(y_k^{}- y_k^{\scriptscriptstyle E }) \end{pmatrix}, \end{aligned}$$
(92)

where \(p_k\) and \(q_k\) are the vectors of primal and dual components of \(d_k\), and \(H\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k,y_k)\) is the matrix of \(\epsilon \)-free rows and columns of \(H(x_k,y_k)\).

The local convergence analysis of Sect. 3 implies that for k sufficiently large, it must hold that \({\mathcal {A}_\epsilon ^{}}(x_k^{},y_k^{},\mu _k^{\scriptscriptstyle R }) = \mathcal {A}(x^*)\), \([\,x_k\,]_{\scriptscriptstyle \mathcal {A}_\epsilon }= 0\), and \(y_k^{\scriptscriptstyle E }= y_k^{}\). It follows that \(\widehat{x}_k^{(0)} = x_k^{}\) and the Eq. (92) become

$$\begin{aligned} \begin{pmatrix} H\!_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k,y_k) &{} J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k)^T \\ J_{\scriptscriptstyle \mathcal {F}_\epsilon }(x_k) &{} -\mu _k^{\scriptscriptstyle R }I \end{pmatrix} \begin{pmatrix} \,[\,p_k\,]_{\scriptscriptstyle \mathcal {F}_\epsilon }\\ - q_k \end{pmatrix} = -\begin{pmatrix} [\,g(x_k) - J(x_k)^T\!y_k\,]_{\scriptscriptstyle {\mathcal {F}_\epsilon }^{}\,}\\ c(x_k) \end{pmatrix}, \end{aligned}$$
(93)

i.e., the dual-regularized Newton equations for minimizing M on \({\mathcal {A}_\epsilon ^{}}\).

Parameter definitions. The numerical experiments were performed using pdSQP, a preliminary implementation of the method written in MATLAB [29]. The control parameter values and their initial values are specified in Table 1. If pdSQP did not converge within \(k_{{\max }}= 1000\) iterations, it was considered to have failed. The tests used to terminate the algorithm at an approximate solution or an infeasible stationary point are given by (32) and (33), respectively.

Table 1 Control parameters and initial values required by Algorithm pdSQP

4.1 Degenerate CUTEst problems

The local rate of convergence of algorithm pdSQP was investigated for a set of degenerate problems from the CUTEst test set [18]. In particular, 84 problems were identified for which the active-constraint Jacobian is numerically rank deficient at the computed solution. In addition, 56 problems have at least one negligible multiplier associated with a variable on its bound. In this case, a multiplier is considered as being negligible if it is less than \(\tau _{\scriptstyle \mathrm {stop}}\) in absolute value. A zero multiplier associated with an active constraint implies that the property of strict complementarity does not hold. A total of 26 problems were identified that fail both the linear independence constraint qualification (LICQ) and strict complementarity.

For these degenerate problems, the order of convergence was estimated by

$$\begin{aligned} \hbox {EOC} = \log r(x_{k_f},y_{k_f}) / \log r(x_{k_f-1},y_{k_f-1}), \end{aligned}$$
(94)

where \(k_f\) denotes the final computed iterate. The results are given in Table 2. The column with heading “Last is global” contains the statistics for problems for which the final search direction is a global descent direction. The column marked “Last is local” gives the statistics for problems for which the final direction is a local descent direction. Column headed “Last two are local” contains the statistics for problems for which the final two descent steps are local descent directions. The values in parentheses indicate the number of problems that satisfy the weak second-order sufficient optimality conditions, i.e., the Hessian of the Lagrangian is positive definite on the null space of the active constraint Jacobian matrix. In the implementation considered here, this property is considered to hold if the smallest eigenvalue of \(Z^T\!H_{\scriptscriptstyle \mathcal {F}_\epsilon }Z\) is greater than \(\tau _{\scriptstyle \mathrm {stop}}\), where the columns of Z form a basis for the null space of \(J_{\scriptscriptstyle \mathcal {F}_\epsilon }\).

Table 2 Estimated order of convergence for pdSQP on the degenerate CUTEst problems

Table 2 shows that if the LICQ does not hold, but strict complementarity does, then local descent steps are computed in the final stages and they contribute to a superlinear rate of convergence. Moreover, superlinear convergence is typical even when the local descent step is not computed. This observation is consistent with [27, Chapter 8], which shows that the iterates generated by algorithm pdSQP of Gill and Robinson [15] converge superlinearly when the second-order sufficient conditions for optimality hold as well as the property of strict complementarity. The results are more mixed on those problems for which pdSQP converges to a solution at which strict complementarity fails.

4.2 The degenerate problems of Mostafa, Vicente, and Wright

In [30], Mostafa, Vicente and Wright analyze the performance of an sSQP algorithm proposed by Wright [34] that estimates the weakly and strongly active multipliers. The authors demonstrate that the algorithm is robust in general and converges rapidly on a specified collection of 12 degenerate problems that includes some of the original Hock-Schittkowski problems; several Hock-Schittkowski problems modified to include redundant constraints; and several problems drawn from the literature (see the reference [30] for additional details). All 12 problems have either a rank-deficient Jacobian or at least one weakly active multiplier at the solution.

Algorithm pdSQP was tested on ten of the twelve problems that could be coded directly or obtained from other sources. Of the ten cases, pdSQP converges superlinearly on seven problems, converges linearly on two problems, and fails to converge on one problem. These results appear to be similar to those obtained by Mostafa, Vicente and Wright using their code sSQPa [30].

4.3 Degenerate MPECs

Mathematical programs with equilibium constraints (MPECs) are optimization problems that have variational inequalities as constraints. Various reformulations of MPECs as nonlinear programs (see Baumrucker, Renfro and Biegler [3]) include complementarity constraints that do not satisfy the LICQ or the MFCQ. This is generally recognized as the main source of difficulty for conventional nonlinear solvers. In the case of pdSQP, the violation of the MFCQ implies that [14, Theorem 3.16] cannot be used to ensure the existence of limit points of the sequence of dual variables. As a consequence, the primal-dual iterates of pdSQP may never enter a region of superlinear convergence. Nonetheless, as MPECs constitute an important and challenging class of problems, this section includes results from pdSQP on a large set of MPECs.

We evaluated pdSQP was evaluated on 86 MPECs obtained from Sven Leyffer at the Argonne National Laboratory. Many of these problems are included in the MPECLib library [5], which is a varied collection of MPECs from both theoretical and practical test models. pdSQP solved 78 of the 86 problems.

As discussed above, the theoretical results of Sect. 3 do not guarantee that the primal-dual iterates will enter a region in which local descent steps are used. In order to study this possibility, Table 3 gives the EOC rates defined in (94) for all of the MPEC problems. The results indicate that, as predicted by the theory, the last search direction is a global descent direction in 23 cases. Nonetheless, 20 of these cases still converge at a superlinear rate. By comparison, of the 55 problems for which the last direction is a local descent direction, superlinear convergence occurs in 52 cases.

Table 3 Estimated order of convergence for pdSQP on the MPEC test set

4.4 Degenerate problems from the DEGEN test set

In a series of numerical tests, Izmailov and Solodov [22, 23, 25] demonstrate that Newton-like algorithms such as SQP or inexact SQP methods tend to generate dual iterates that converge to critical multipliers, when they exist. (Critical multipliers are those multipliers \(y\in \mathcal {Y}(x^*)\) for which the regularized KKT matrix (87) is singular at \(x^*\)). This is significant because dual convergence to critical multipliers will result in a linear rate of convergence [23]. However, Izmailov [23] shows that an implementation of a conventional sSQP algorithm is less likely to exhibit this behavior, although poor performance can still occur in a small number of cases. This has motivated the use of sSQP subproblems as a way of accelerating local convergence in the presence of critical multipliers. However, such algorithms have had mixed results in practice (see, e.g., Izmailov [26]). The purpose of this section is to use a subset of the DEGEN test set to investigate the performance of pdSQP on problems with critical multipliers. The subset of problems consists of those considered by Izmailov [22], and Izmailov and Solodov [23].

Table 4 gives the estimated orders of convergence for these problems. The results are separated based on the following properties: (i) no critical multipliers exist; (ii) critical multipliers exist but the limit point \(y^*\) is not critical; and (iii) the limit point \(y^*\) is critical. The summaries indicate which optimal multipliers (if any) are critical. If the final multiplier estimate is within \(10^{-3}\) of a critical multiplier, the multiplier is designated as critical. As shown in Table 4, empirically, pdSQP converges superlinearly on 45 of the 51 problems that do not have critical multipliers. For the 58 problems that have critical multipliers, pdSQP converges to a critical multiplier for 46 of them, and for those 46 problems the rate of convergence was typically slower. The slower convergence supports the theory in [23], but the results indicate that on this test set, pdSQP often converges to critical multipliers when they are present.

Table 4 Estimated order of convergence (EOC) of pdSQP on the DEGEN test set

5 Conclusions

This paper considers the local convergence analysis and some aspects of the numerical performance of an sSQP method for nonlinearly constrained optimization. The method appears to constitute the first algorithm with provable convergence to second-order points as well as a superlinear rate of convergence. The method is formulated as a stabilized SQP method with an implicit safeguarding strategy based on minimizing a bound-constrained primal-dual augmented Lagrangian. The method involves a flexible line search along a direction formed from an approximate solution of a regularized quadratic programming subproblem and, when one exists, a direction of negative curvature for the primal-dual augmented Lagrangian. With an appropriate choice of termination condition, the method terminates in a finite number of iterations under weak assumptions on the problem functions. Safeguarding becomes relevant only when the iterates are converging to an infeasible stationary point of the norm of the constraint violations. Otherwise, the method terminates with a point that either satisfies the second-order necessary conditions for optimality, or fails to satisfy a weak second-order constraint qualification. In the former case, superlinear local convergence is established by using an approximate solution of the stabilized QP subproblem that guarantees that the optimal active set, once correctly identified, remains active regardless of the presence of weakly active multipliers. It is shown that the method has superlinear local convergence under the assumption that limit points become close to a solution set containing multipliers satisfying the second-order sufficient conditions for optimality. This rate of convergence is obtained without the need to solve a nonconvex QP subproblem, or impose restrictions on which local minimizer of the QP is found. For example, it is not necessary to compute the QP solution closest to the current solution estimate.

Numerical results on a variety of problems indicate that the method performs relatively well compared to a state-of-the-art SQP method. Superlinear convergence is typical, even for problems that do not satisfy standard constraint qualifications. Results are more mixed for problems that do not satisfy the property of strict complementarity.

The proposed method is based on the beneficial properties of dual regularization, which makes it necessary to assume a second-order sufficient condition to rule out the possibility of critical multipliers at the solution. Future research will focus on primal regularization techniques that allow superlinear convergence when critical multipliers are present. For a local algorithm framework based on primal regularization, see Facchinei, Fischer and Herrich [6, 7].