1 Introduction

In general, algorithms for nonlinear optimization are iterative. A simplified model is solved at each iteration with the goal of building a sequence of iterates converging to a solution of the original problem. Usually, the more strict one solves the subproblems, the better is the quality of the solution found. If one requires subproblems to be solved up to first-order stationarity, under reasonable assumptions, the final solution is also first-order stationary. In this paper, our goal is to increase the quality of the final solution to second-order stationarity by obtaining second-order stationary points of the subproblems. This is particularly relevant in some machine and statistical learning problems [48], where a second-order stationary point is a provably good substitute for an actual solution, or in problems with multiple non-optimal first-order stationary points.

We will consider a large class of Augmented Lagrangian algorithms, in the sense that we do not specify which constraints are penalized (upper-level constraints) and which constraints are kept in the subproblems (lower-level constraints). Theoretically, the division into lower- and upper-level constraints can be done arbitrarily [24]. However, many practical motivations exist that drive the separation of the constraint into those two sets. The Augmented Lagrangian iterates are feasible, or approximately feasible, with respect to the lower-level constraints; while the upper-level constraints are, in general, satisfied only in the limit. It may be the case, depending on the application, that the objective function is meaningless when the lower-level constraints are violated; so they must be satisfied at all iterations. In [26], lower-level constraints are called non-relaxable; while upper-level constraints are called relaxable constraints. An extreme case is when the objective function is not defined outside the lower-level constraints. In practice, the choice of lower- and upper-level constraints is strongly based on the possibility of solving subproblems with lower-level constraints. It may also depend on convenience if, for instance, in a particular application, some well established and tuned solver exists and an additional constraint the solver is not capable to deal with must be considered. In this situation, it would be natural to adopt an Augmented Lagrangian algorithm penalizing the new constraint and using the well established solver for solving the subproblems. Two particular examples of this situation follow. The first one is the important problem of minimizing a matrix function subject to an orthogonality constraint. Fast tailored algorithms exist for this problem; but they can not be easily generalized when additional constraints are present. (See [37] and references therein for details.) A second example is bi-level optimization (see, for example, [35]), where one wants to minimize an objective function under the constraint that some variables are a solution to an optimization problem parameterized by the other variables. When additional standard constraints are present, they are natural candidates to be upper-level constraints; since there are efficient algorithms for bi-level optimization problems without additional constraints.

First-order Augmented Lagrangian methods in which the lower-level constraints are bound constraints can be found in [5, 32], while second-order ones can be found in [6]. More recently, an Augmented Lagrangian method that keeps equality constraints at the lower-level set and penalizes the remaining constraints (including the bound constraints) was presented in [19]. This motivated us to further develop the global convergence theory of Augmented Lagrangian methods with arbitrary constraints at the lower-level set.

In first-order optimization methods, the Karush–Kuhn–Tucker (KKT) conditions are the standard first-order stationarity concept. In the second-order case, the situation is not straightforward. In theory, many second-order optimality conditions can be derived, with or without the presence of constraint qualifications and with distinct properties. Conditions may rely on (i) generalized multipliers associated with the objective function (also called Fritz John multipliers [29, Theorem 3.50]), (ii) the whole set of Lagrange multipliers, (iii) only one Lagrange multiplier [3, 15, 57], (iv) the critical cone, or (v) the critical subspace (weak critical cone). See [16] for a discussion and more details on second-order optimality conditions.

The second-order condition that will be considered in the present work is the weak second-order condition (WSOC). It states that the Hessian of the Lagrangian at a KKT point is positive semidefinite on the subspace orthogonal to the gradients of active constraints. Several reasons lead us to consider this second-order condition. Usually, algorithms build a single approximation of a Lagrange multiplier and not the whole set of Lagrange multipliers. Hence, conditions depending on the whole set of Lagrange multipliers may be hard to check in practice. We also do not consider Fritz John multipliers, since algorithms for solving subproblems usually treat objective function and constraints in different manners, in a way that a multiplier associated with the objective function is not present. Finally, checking the positive semidefiniteness of a matrix over a subspace is a relatively easy task. In contrast, checking the positive semidefiniteness of a matrix on a pointed cone is an NP-hard problem [56]. These considerations make WSOC a natural second-order condition whose approximate satisfaction may be considered as stopping criterion for practical algorithms. Other than that, up to the authors knowledge, no algorithm with global convergence to points satisfying a stronger second-order stationarity concept is known. Moreover, the discussion in [40], where a very simple algorithm is shown to converge to a point that violates a strong optimality condition, suggests that such algorithms may not exist.

Second-order methods for constrained optimization have been devised in [6, 14, 38, 39, 55]. In particular, the methods proposed in [6, 55] are Augmented Lagrangian methods. The method proposed in the present work differs from those methods in at least two aspects—it has the flexibility introduced by the general lower-level approach and its global convergence theory relies on weak assumptions. In particular, previous results are based on the Mangasarian-Fromowitz constraint qualification, which is equivalent to the boundedness of Lagrange multipliers; while in this work global convergence is guaranteed even when the solution has unbounded Lagrange multipliers. When the algorithm converges to an infeasible point, it is shown that the limit point is second-order stationary for the problem of minimizing an infeasibility measure, which also enhances previous results. Since subproblems are not solved exactly, the presented analysis relies on a perturbation of the second-order optimality conditions defined in [9].

The rest of this work is organized as follows. Section 2 describes the Augmented Lagrangian framework and gives some basic definitions. Section 3 presents the global convergence results. A discussion and numerical illustrative examples are given in Sect. 4. Section 5 gives some conclusions.

Notation

We denote by \(\mathbb {R}^{n}\) the n-dimensional Euclidean space. By \(\mathbb {R}^{n}_{+}\) we denote the set of elements of \(\mathbb {R}^{n}\) for which each component is non-negative. The notation \(\max \{0,a\}\) represents the maximum between \(a \in \mathbb {R}\) and 0. If \(a \in \mathbb {R}^{n}\), \(\max \{0,a\}\) is to be taken componentwise. The symbol \(\Vert \cdot \Vert \) stands for the Euclidean norm and \(\Vert \cdot \Vert _{\infty }\) stands for the supremum norm. Given a set-valued mapping \(\mathcal {F}: \mathbb {R}^{s} \rightrightarrows \mathbb {R}^{\ell }\), the sequential Painlevé-Kuratowski outer limit of \(\mathcal {F}(z)\) as \(z \rightarrow z^{*}\) is denoted by

$$\begin{aligned} \limsup _{z \rightarrow z^{*}} \mathcal {F}(z) = \{w^{*} \in \mathbb {R}^{\ell }: \exists \; (z^{k},w^{k})\rightarrow (z^{*},w^{*})\quad \text{ with } w^{k} \in \mathcal {F}(z^{k}) \}. \end{aligned}$$

We say that \(\mathcal {F}\) is outer semicontinuous at \(z^{*}\) if \(\limsup _{z \rightarrow z^{*}} \mathcal {F}(z) \subseteq \mathcal {F}(z^{*})\).

2 Augmented Lagrangian framework

Consider the optimization problem

$$\begin{aligned} \hbox {Minimize } f(x) \hbox { subject to } x \in \Omega _L \cap \Omega _U, \end{aligned}$$
(OP)

where \(\Omega _U = \{ x \in \mathbb {R}^n \; | \; H(x)=0, G(x) \le 0 \}\) and \(\Omega _L = \{ x \in \mathbb {R}^n \; | \; h(x)=0, g(x) \le 0\}\). We assume that \(f:\mathbb {R}^n\rightarrow \mathbb {R}\), \(h:\mathbb {R}^n\rightarrow \mathbb {R}^{m}\), \(g:\mathbb {R}^n\rightarrow \mathbb {R}^{p}\), \(H:\mathbb {R}^n\rightarrow \mathbb {R}^{\bar{m}}\) and \(G:\mathbb {R}^n\rightarrow \mathbb {R}^{\bar{p}}\) are twice continuously differentiable.

Constraints in \(\Omega _U\) are the so-called upper-level constraints and they will be penalized with the use of the Powell-Hestenes-Rockafellar (PHR) Augmented Lagrangian function [43, 59, 61] defined by

$$\begin{aligned} x\mapsto L(x,\bar{v},\bar{u};\rho ) = f(x) + \frac{\rho }{2} \left( \left\| H(x) + \frac{\bar{v}}{\rho } \right\| ^2 + \left\| \max \left\{ 0, G(x) + \frac{\bar{u}}{\rho } \right\} \right\| ^2 \right) . \end{aligned}$$

The main algorithm consists in a sequence of approximate minimizations of the Augmented Lagrangian \(L(x,\bar{v},\bar{u};\rho )\) over the set \(\Omega _L\) followed by an update of the multipliers \(\bar{v}\in \mathbb {R}^{\bar{m}}\) and \(\bar{u}\in \mathbb {R}^{\bar{p}}_+\) and the penalty parameter \(\rho >0\). This means that the penalty parameter \(\rho \) and the Lagrange multipliers approximations \(\bar{v}\) and \(\bar{u}\) drive the feasibility with respect to \(\Omega _U\); while the feasibility of the lower-level constraints \(\Omega _L\) is kept in each iteration. Each approximate minimization of \(L(x,\bar{v},\bar{u};\rho )\) is called an outer iteration. For updating the multipliers approximations, we adopt the classical first-order update [17, 43, 59] but, following [4,5,6], we consider a safeguarded multiplier estimate laying in a compact box. To update the penalty parameter \(\rho \), we consider a slight generalization of the update in [4,5,6] (see also [24]). We continue to describe the Augmented Lagrangian framework.

Algorithm 2.1. Augmented Lagrangian framework [4, 5, 24]

Let \(v_{\min } < v_{\max }\), \(u_{\max }>0\), \(\gamma >1\), \(\tau \in [0,1)\), \(\bar{v}^1 \in [v_{\min },v_{\max }]^{\bar{m}}\), and \(\bar{u}^1 \in [0,u_{\max }]^{\bar{p}}\) be given. Initialize \(k \leftarrow 1\).

Step 1 :

Find and approximate solution \(x^k\) to

$$\begin{aligned} \text{ Minimize } L(x,\bar{v}^k,\bar{u}^k;\rho _k) \text{ subject } \text{ to } x \in \Omega _L. \end{aligned}$$
(1)
Step 2 :

Define

$$\begin{aligned} v^k = \bar{v}^k+\rho _k H(x^k) \text{ and } u^k = \max \{0,\bar{u}^k+\rho _k G(x^k)\} \end{aligned}$$

and compute \(\bar{v}^{k+1}\in [v_{\min },v_{\max }]^{\bar{m}}\) and \(\bar{u}^{k+1}\in [0,u_{\max }]^{\bar{p}}\) (typically, as the projections of \(v^{k}\) and \(u^{k}\) onto the corresponding safeguarded boxes).

Step 3 :

Set

$$\begin{aligned} V^k=\max \left\{ G(x^k),-\bar{u}^k/\rho _k \right\} . \end{aligned}$$

If \(k=1\) or

$$\begin{aligned} \max \{\Vert H(x^k)\Vert _\infty ,\Vert V^k\Vert _\infty \} \le \tau \max \{\Vert H(x^{k-1})\Vert _\infty \Vert V^{k-1}\Vert _\infty \} \end{aligned}$$

then choose \(\rho _{k+1}\) satisfying \(\rho _{k+1}\ge \rho _k\). Otherwise, choose \(\rho _{k+1}\) such that \(\rho _{k+1}\ge \gamma \rho _k\).

Step 4 :

Set \(k \leftarrow k+1\) and go to Step 1.

The measure \(V^k\) is a joint measure of feasibility (with respect to the inequality constraints) and complementarity. Thus, \(\max \{\Vert H(x^k)\Vert _\infty ,\Vert V^k\Vert _\infty \}\) measures the failure of the feasibility and the complementarity at the iterate \(x^k\). Clearly, \(\max \{\Vert H(x^k)\Vert _\infty ,\Vert V^k\Vert _\infty \}=0\) if, and only if, \(H(x^k)=0\), \(G(x^k)\le 0\), and \((\bar{u}^k)^T G(x^k) = 0\). In order to control the growth of the penalty parameter \(\rho _k\), it is increased when the joint measure of feasibility and complementarity is not sufficiently reduced. Safeguarded Lagrange multipliers approximations are considered in order to obtain convergence to a global minimizer when subproblems are globally solved. See [5] or [24] for further details on this framework. Note that, despite the safeguarded strategy, the Lagrange multipliers approximations \(v^k\) and \(u^k\) may still be unbounded. See [45] for a comparison of Augmented Lagrangian methods with and without safeguarded Lagrange multipliers.

In order to formulate precisely our general Augmented Lagrangian framework, we must clarify what is meant by an approximate solution to subproblem (1). For a first-order Augmented Lagrangian method, this means that \(x^k\) is an \(\varepsilon _k\)-KKT point, for a given sequence of tolerances \(\varepsilon _k\rightarrow 0^+\), as defined below. Let us consider subproblem (1) with a general objective function, that is, let us consider the general optimization problem with a continuously differentiable objective function \(F:\mathbb {R}^n\rightarrow \mathbb {R}\) given by

$$\begin{aligned} \text{ Minimize } F(x) \text{ subject } \text{ to } h(x)=0 \text{ and } g(x) \le 0. \end{aligned}$$
(GOP)

Definition 2.1

(\(\varepsilon \)-KKT) Given \(\varepsilon \ge 0\), we say that \(x\in \mathbb {R}^n\) is an \(\varepsilon \)-KKT point for the problem (GOP) if there exist \(\lambda \in \mathbb {R}^{m}\) and \(\mu \in \mathbb {R}^{p}_+\) such that

$$\begin{aligned} \Vert h(x)\Vert \le \varepsilon , \Vert \max \{0,g(x)\}\Vert \le \varepsilon , \end{aligned}$$
(2)
$$\begin{aligned} \left\| \nabla F(x) + \sum _{i=1}^{m}\lambda _i\nabla h_i(x) + \sum _{i=1}^{p} \mu _i \nabla g_i(x) \right\| \le \varepsilon , \end{aligned}$$
(3)
$$\begin{aligned} \mu _i=0 \text{ whenever } g_i(x) < -\varepsilon ,\quad i=1,\dots ,p. \end{aligned}$$
(4)

Note that \(\varepsilon \)-KKT for \(\varepsilon =0\) is equivalent to the KKT conditions and that (2), (3), and (4) are perturbed measures of feasibility, optimality, and complementarity, respectively. An important property of \(\varepsilon \)-KKT points is that, given a local minimizer \(x^*\) of (GOP), for every \(\varepsilon >0\), there exists a point \(x_\varepsilon \in \mathbb {R}^n\) with \(\Vert x^*-x_\varepsilon \Vert <\varepsilon \) such that \(x_\varepsilon \) is an \(\varepsilon \)-KKT point. That is, even though \(x^*\) may not be a KKT point, due, for instance, to the failure of a constraint qualification, it can be arbitrarily approximated by \(\varepsilon \)-KKT points. This motivates the following definition. See [7].

Definition 2.2

(Approximate-KKT [7]) We say that \(x^*\in \mathbb {R}^n\) is an Approximate-KKT (AKKT) point for the problem (GOP) if there exist sequences \(\varepsilon _k\rightarrow 0^+\) and \(x^k\rightarrow x^*\) such that \(x^k\) is \(\varepsilon _k\)-KKT for all k. Equivalently, a feasible point \(x^*\) is an AKKT point if there exist sequences \(x^k\rightarrow x^*\), \(\{\lambda ^k\}\subset \mathbb {R}^{m}\), and \(\{\mu ^k\}\subset \mathbb {R}^{p}_+\) such that

$$\begin{aligned} \nabla F(x^k) + \sum _{i=1}^{m}\lambda _i^k\nabla h_i(x^k) + \sum _{i \in \mathcal{A}(x^*)} \mu _i^k\nabla g_i(x^k) \rightarrow 0, \end{aligned}$$

where \(\mathcal{A}(x^*) = \{ i \in \{ 1,\ldots ,p \} \; | \; g_i(x^*)=0 \}\).

Theorem 2.1

([7]) If \(x^*\) is a local minimizer then \(x^*\) is an AKKT point.

Thus, the AKKT condition is a true optimality condition independently of the validity of any constraint qualification. Theorem 2.1 justifies the stopping criterion based on the KKT condition which is used in several optimization methods for solving constrained optimization problems, i.e. stopping when finding an \(\varepsilon \)-KKT point for a given \(\varepsilon >0\) small enough. A local minimizer may not be a KKT point but we can always find a point sufficiently close to the local minimizer satisfying approximately the KKT condition, which makes it reasonable to accept it as possible solution.

The global convergence theory of the first-order version of Algorithm 2.1 is based on the optimality condition AKKT as described below.

Theorem 2.2

([5]) Let \(\varepsilon _k\rightarrow 0^+\) be a sequence of tolerances. Assume that \(\{x^k\}\) is a sequence generated by Algorithm 2.1 and that, at Step 1, \(x^k\) is computed as an \(\varepsilon _k\)-KKT point for problem (1), that is, for every k, (24) hold with \(x=x^k\), \(\varepsilon =\varepsilon _k\), and \(F(x)=L(x,\bar{v}^k,\bar{u}^k;\rho _k)\). Then, every limit point \(x^*\) of \(\{x^k\}\) is an AKKT point of the infeasibility optimization problem

$$\begin{aligned} \text{ Minimize } \Vert H(x)\Vert ^2 + \Vert \max \{0,G(x)\}\Vert ^2 \text{ subject } \text{ to } x \in \Omega _L. \end{aligned}$$
(IOP)

It is worth noting that the sequence of tolerances \(\{ \varepsilon _k\}\) may be defined in an external or an internal way (depending on the feasibility of the current iterate). For further details in adaptive ways for choosing the subproblems tolerances \(\varepsilon _k\), see, for example, [21, 25, 36].

Theorem 2.3

([5]) Assume that \(\{x^k\}\) is a sequence generated by Algorithm 2.1 as in Theorem 2.2. Then, every feasible limit point \(x^*\in \Omega _L\cap \Omega _U\) of \(\{x^k\}\) is an AKKT point of the original optimization problem (OP).

In order to measure the strength of the optimality condition AKKT, it can be compared with classical optimality conditions based on constraint qualifications. This can be achieved with the following result.

Definition 2.3

(CCP [11]) We say that the Cone Continuity Property (CCP) with respect to the constraints of problem (GOP) holds at a feasible point \(x^*\) when the set-valued mapping \(x\mapsto K(x)\) is outer semicontinuous at \(x^*\), that is, \(\limsup _{x\rightarrow x^*}K(x)\subseteq K(x^*)\), where

$$\begin{aligned} K(x) = \left\{ \sum _{i=1}^{m}\lambda _i\nabla h_i(x) + \sum _{i \in \mathcal{A}(x^*)} \mu _i\nabla g_i(x) \; | \; \mu \in \mathbb {R}^{p}_+, \lambda \in \mathbb {R}^{m} \right\} . \end{aligned}$$

Theorem 2.4

([11]) The Cone Continuity Property holds at \(x^*\) with respect to the constraints of problem (GOP) if, and only if, for every objective function F(x) in (GOP) such that \(x^*\) is AKKT, \(x^*\) is also a KKT point.

In particular, assuming \(x^*\) satisfies CCP with respect to the constraints in \(\Omega _L\) and with respect to \(\Omega _L\cap \Omega _U\), allows us to rewrite Theorem 2.2 and 2.3 obtaining true KKT points and, therefore, these results can be compared with other results that rely on a constraint qualification. Note that Theorems 2.1 and 2.4 imply that CCP is a constraint qualification and that CCP is weaker than most of the other constraint qualifications used to analyze global convergence of algorithms. It is strictly weaker than CPG [10], CRSC [10], (R)CPLD [8, 60], (R)CRCQ [44, 54], MFCQ [49], and LICQ [18]. For details, see [11] and the references therein.

The next section is devoted to developing an analogous second-order global convergence theory for the Augmented Lagrangian method with general lower-level constraints when subproblems are approximately solved up to second-order.

3 A second-order Augmented Lagrangian

In this section, we consider that, at Step 1 of Algorithm 2.1, the approximate solution \(x^k\) is obtained taking into account second-order information. To be precise, let us consider again subproblem (1) with a general objective function given in (GOP).

For \(\lambda \in \mathbb {R}^m\) and \(\mu \in \mathbb {R}^p_+\), let us consider the Lagrangian function \(x \mapsto \ell (x,\lambda ,\mu )\) associated with (GOP) given by

$$\begin{aligned} \ell (x,\lambda ,\mu ) = F(x) + \sum _{i=1}^{m}\lambda _i h_i(x) + \sum _{i =1}^p \mu _i g_i(x). \end{aligned}$$

We recall that WSOC holds at a feasible point \(x^{*}\) for the problem (GOP) if there exists Lagrange multipliers \((\lambda ,\mu ) \in \mathbb {R}^m\times \mathbb {R}^p_+\) with \(\mu _i=0\) whenever \(g_i(x^*)<0\) such that

$$\begin{aligned} \nabla F(x^{*}) + \sum _{i=1}^{m}\lambda _{i}\nabla h_{i}(x^{*})+ \sum _{i \in \mathcal{A}(x^*)} \mu _{i} \nabla g_{i}(x^{*}) = 0 \end{aligned}$$
(5)

and, in addition, for each d in the critical subspace

$$\begin{aligned} S = \{ d \in \mathbb {R}^n \; | \; \nabla h_i(x^*)^\mathtt {T}d=0,\quad i=1,\dots ,m, \text{ and } \nabla g_i(x^*)^\mathtt {T}d=0,\quad i \in \mathcal{A}(x^*) \}, \end{aligned}$$

we have that

$$\begin{aligned} d^T \left( \nabla ^{2} F(x^{*}) + \sum _{i=1}^m \lambda _{i} \nabla ^{2} h_{i}(x^{*}) + \sum _{i \in \mathcal{A}(x^*)} \mu _i \nabla ^{2} g_{i}(x^{*}) \right) d \ge 0. \end{aligned}$$
(6)

Similar to the KKT condition, WSOC holds at a local minimizer only under some constraint qualification. However, not all constraint qualifications serve this purpose. For instance, WSOC holds under LICQ but not under MFCQ. See [16] for more details on WSOC. Based on this notion of second-order stationarity, a second order version of AKKT (called AKKT2) was presented in [9]. Its definition is given below.

Definition 3.1

(AKKT2 [9]) We say that the feasible point \(x^*\in \mathbb {R}^n\) fulfills AKKT2 with respect to (GOP) if there exist sequences \(x^k\rightarrow x^*\), \(\{\lambda ^k\}\subset \mathbb {R}^m\), \(\{\mu ^k\}\subset \mathbb {R}^p_+\), \(\{\theta ^k\}\subset \mathbb {R}^m_+\), \(\{\eta ^k\}\subset \mathbb {R}^p_+\), and \(\delta _k\rightarrow 0^+\), with \(\mu ^k_i=0\) and \(\eta ^k_i=0\) whenever \(g_i(x^*)<0\), such that

$$\begin{aligned} \nabla \ell (x^k,\lambda ^k,\mu ^k) = \nabla F(x^k) + \sum _{i=1}^{m} \lambda ^k_i \nabla h_i(x^k) + \sum _{i \in \mathcal{A}(x^*)} \mu ^k_i \nabla g_i(x^k) \rightarrow 0 \end{aligned}$$
(7)

and

$$\begin{aligned} \nabla ^2 \ell (x^k,\lambda ^k,\mu ^k) + \sum _{i=1}^m \theta _i^k\nabla h(x^k) \nabla h(x^k)^T + \sum _{i \in \mathcal{A}(x^*)} \eta _i^k \nabla g(x^k)\nabla g(x^k)^T + \delta _k \mathcal {I} \end{aligned}$$
(8)

is positive semidefinite for all k, where \(\mathcal {I}\) is the identity matrix.

In [9], it was proved that AKKT2 is an optimality condition, that is, when \(x^*\) is a local minimizer, there must exist sequences given by Definition 3.1, even if WSOC fails at \(x^*\). Thus, AKKT2 justifies a practical stopping criterion based on the approximate fulfillment of WSOC. In this section, we prove global convergence results for Algorithm 2.1 based on the optimality condition AKKT2.

In order to measure the strength of AKKT2, we recall the definition of the second-order cone-continuity property (CCP2) from [9].

Definition 3.2

(CCP2 [9]) We say that CCP2 holds at a feasible point \(x^*\) with respect to the problem (GOP) when the set-valued mapping \(x\mapsto K_2(x)\) is outer semicontinuous at \(x^*\), i.e. \(\limsup _{x\rightarrow x^*}K_2(x)\subseteq K_2(x^*)\), where \(K_2(x)\) is the cone given by

$$\begin{aligned} \bigcup _{\begin{array}{c}(\lambda ,\mu ) \in \mathbb {R}^{m}\times \mathbb {R}_{+}^{p}\\ \mu _{i}=0, \; i \not \in \mathcal{A}(x^*)\end{array}} \left\{ \begin{array}{c} \hbox {all }\left( w , W \right) \hbox { with }w\in \mathbb {R}^n\hbox { and }W\hbox { a symmetric }n\times n\hbox { matrix such that}\\ w+\sum _{i=1}^{m} \lambda _{i} \nabla h_{i}(x) + \sum _{i \in \mathcal{A}(x^*)} \mu _{i} \nabla g_{i}(x)=0,\\ d^T \left( W + \sum _{i=1}^{m} \lambda _{i} \nabla ^{2} h_{i}(x) + \sum _{i \in \mathcal{A}(x^*)} \mu _{i}\nabla ^{2} g_{i}(x) \right) d \ge 0 \quad \hbox { for all } d \in S(x,x^*) \end{array} \right\} \end{aligned}$$
(9)

and

$$\begin{aligned}&S(x,x^*) \nonumber \\&\quad = \left\{ d \in \mathbb {R}^n \; | \; \nabla h_i(x)^T d = 0,\quad i=1,\dots ,m, \text{ and } \nabla g_i(x)^T d = 0,\quad i \in \mathcal{A}(x^*) \right\} \nonumber \\ \end{aligned}$$
(10)

is the perturbed critical subspace around the point \(x^{*}\).

The cone \(K_{2}(x^*)\) allows us to rewrite WSOC in a geometrical way. In fact, a feasible point \(x^*\) satisfies WSOC if, and only if, there exists \((\lambda ,\mu ) \in \mathbb {R}^m\times \mathbb {R}^p_+\) with \(\mu _i=0\) whenever \(g_i(x^*)<0\), such that \(\nabla \ell (x^*,\lambda ,\mu )=0\) and \(\nabla ^2 \ell (x^*,\lambda ,\mu )\) is positive semidefinite over the critical subspace \(S(x^*,x^*) = S\), which can be equivalently stated, shortly, as \((\nabla F(x^*), \nabla ^2 F(x^*)) \in K_2(x^*)\). The following theorem states precisely the strength of AKKT2.

Theorem 3.1

([9]) CCP2 holds at \(x^*\) with respect to the constraints of problem (GOP) if, and only if, for all objective functions F(x) in (GOP) such that \(x^*\) is AKKT2, \(x^*\) also satisfies WSOC.

In [9], it was proved that CCP2 is strictly weaker than the joint condition MFCQ+WCR, CRCQ [44], and RCRCQ [54] and that it can be used in the global convergence analysis of the second-order algorithms defined in [6, 12, 39]. In particular, in view of Theorem 3.1, global convergence results under AKKT2 can be restated in terms of the usual WSOC under CCP2. This means that even when MFCQ fails, and hence the solution does not have bounded multipliers, a classical global convergence result is still available.

We turn now to the issue of developing an alternative definition of AKKT2 that relies on \(\varepsilon \) perturbations of the second-order stationarity notion WSOC and that can be checked without the knowledge of the limit point \(x^*\). This would allow us to provide a tolerance for the solvability of the subproblems at Step 1 of Algorithm 2.1 that guarantees that limit points of a sequence generated by the algorithm satisfy AKKT2.

Lemma 3.1

([17, 34]) Let P be a symmetric \(n \times n\) matrix and \(a_{1},\dots ,a_{r} \in \mathbb {R}^{n}\). Define the linear subspace \(\mathcal {C}=\{d \in \mathbb {R}^{n} \; | \; a_{i}^T d =0, i =1,\dots ,r \}\). Suppose that \(v^T P v>0\) for all \(v \in \mathcal {C}\). Then, there exist positive scalars \(c_i\), \(i=1,\dots ,r\), such that \(P + \sum _{i=1}^r c_i a_i a_i^T\) is positive definite.

Now, we provide an equivalent definition of AKKT2 based on \(\varepsilon \) perturbations. This equivalence can be used as stopping criterion for constrained optimization algorithms.

Theorem 3.2

A point \(x^*\) satisfies AKKT2 for problem (GOP) if, and only if, there are sequences \(x^k\rightarrow x^*\), \(\{\lambda ^k\}\subset \mathbb {R}^m\), \(\{\mu ^k\}\subset \mathbb {R}^p_+\), and \(\varepsilon _k\rightarrow 0^+\) such that

$$\begin{aligned}&\Vert h(x^k) \Vert \le \varepsilon _k, \Vert \max \{0,g(x^k)\}\Vert \le \varepsilon _k, \end{aligned}$$
(11)
$$\begin{aligned}&\Vert \nabla \ell (x^k,\lambda ^k,\mu ^k)\Vert = \left\| \nabla F(x^k) + \sum _{i=1}^{m} \lambda ^k_i \nabla h_i(x^k) + \sum _{i=1}^{p} \mu ^k_i \nabla g_i(x^k) \right\| \le \varepsilon _k,\qquad \end{aligned}$$
(12)
$$\begin{aligned}&\mu ^k_i = 0 \text{ whenever } g_i(x^k) < -\varepsilon _k, \end{aligned}$$
(13)

and

$$\begin{aligned} d^T \nabla ^2 \ell (x^k,\lambda ^k,\mu ^k) d \ge -\varepsilon _k \Vert d\Vert ^2, \end{aligned}$$
(14)

for all \(d \in S_k\), where \(S_k\) is the subspace given by

$$\begin{aligned} S_k = \left\{ d \in \mathbb {R}^n \; \left| \begin{array}{l} \nabla h_i(x^k)^T d = 0, i=1,\dots ,m, \text{ and } \\ \nabla g_i(x^k)^T d = 0, i=1,\dots ,p \text{ such } \text{ that } g_i(x^k) \ge -\varepsilon _k \end{array} \right. \right\} . \end{aligned}$$
(15)

Proof

Let \(x^*\) be an AKKT2 point and let \(\{x^k\}\), \(\{\lambda ^k\}\), \(\{\mu ^k\}\), \(\{\theta ^k\}\), \(\{\eta ^k\}\), and \(\{\delta _k\}\) be the sequences given by Definition 3.1. Define

$$\begin{aligned} \varepsilon _k = \max \left\{ \delta ^k, \Vert h(x^k)\Vert , \Vert \max \{ 0, g(x^k) \} \Vert , \Vert \nabla \ell (x^k,\lambda ^k,\mu ^k)\Vert , \{ -g_i(x^k), i \in \mathcal{A}(x^*) \} \right\} . \end{aligned}$$

Clearly, we have that \(\varepsilon _k \rightarrow 0^+\) and (11) and (12) hold. (13) also holds. In fact, if for some index \(i_{0}\) we have \(g_{i_{0}}(x^k)<-\varepsilon _k\), or, equivalently, \(-g_{i_{0}}(x^k)>\varepsilon _k\ge \max _{\{i \in \mathcal{A}(x^*)\}} \{ -g_i(x^k) \}\), then, \(i_0\not \in \mathcal{A}(x^*)\). Thus, \(\mu _{i_{0}}^k=0\) by the definition of the sequence \(\{\mu ^k\}\). Let us note that for k large enough, \(S_k=S(x^k,x^*)\). This comes from the fact that \(g_i(x^*)=0\) if, and only if, \(g_i(x^k)\ge -\varepsilon _k\) for k large enough. The implication \(g_i(x^k)<-\varepsilon _k\Rightarrow g_i(x^*)<0\) has already been proved; while the converse is a simple consequence of the continuity of \(g_i\) and \(\varepsilon _k\rightarrow 0^+\). Note that \(S(x^k,x^*)\subseteq S_k\) holds eventually for any sequence \(\varepsilon _k\rightarrow 0^+\); while the reciprocal inclusion is a consequence of this particular definition of \(\varepsilon _k\). For k large enough, let \(d \in S_k = S(x^k,x^*)\). Using (8) and the definition of \(S(x^k,x^*)\) we get \(d^T \nabla ^2 \ell (x^k,\lambda ^k,\mu ^k) d \ge - \delta _k \Vert d\Vert ^2 \ge -\varepsilon _k \Vert d\Vert ^2\) and (14) holds.

Now let us assume that \(x^*\) is such that there are sequences \(x^k \rightarrow x^*\), \(\{\lambda ^k\}\subset \mathbb {R}^m\), \(\{\mu ^k\}\subset \mathbb {R}^p_+\), and \(\varepsilon _k \rightarrow 0^+\) such that (1114) hold. The continuity of h and g together with (11) imply that \(x^*\) is feasible. Also, if \(g_i(x^*)<0\), eventually, \(g_i(x^k)<-\varepsilon _k\) for k large enough. Hence, (12) and (13) imply (7). From (14), for k large enough, the matrix \(P_k = \nabla ^2 \ell (x^k,\lambda ^k,\mu ^k) + ( \varepsilon _k + \frac{1}{k} ) \mathcal {I}\) is positive definite on the subspace \(S_k \supseteq S(x^k,x^*)\). Hence, defining \(\delta _k = \varepsilon _k + \frac{1}{k} \rightarrow 0^+\), Lemma 3.1 with matrix \(P_k\) and the subspace \(S(x^k,x^*)\) implies the existence of \(\theta _i^k\), \(i=1,\dots ,m\), and \(\eta _i^k, i \in \mathcal{A}(x^*)\), such that (8) holds. \(\square \)

Note that a stronger equivalence can be proved without using the same tolerance \(\varepsilon _k\) in all equations (1114), as long as all the considered tolerances go to zero. Hence, Theorem 3.2 suggests the following stopping criterion for methods solving constrained optimization problems with convergence to second-order stationary points.

Given tolerances \(\varepsilon _{{\mathrm {feas}}}>0\), \(\varepsilon _{{\mathrm {opt}}}>0\), \(\varepsilon _{{\mathrm {compl}}}>0\), and \(\varepsilon _{{\mathrm {curv}}}>0\) for feasibility, optimality, complementarity, and curvature, respectively, accept \(x^k\) as an approximation to a solution when

$$\begin{aligned}&\Vert h(x^k)\Vert \le \varepsilon _{{\mathrm {feas}}}, \Vert \max \{0,g(x^k)\}\Vert \le \varepsilon _{{\mathrm {feas}}}, \end{aligned}$$
(16)
$$\begin{aligned}&\Vert \nabla \ell (x^k,\lambda ^k,\mu ^k)\Vert \le \varepsilon _{{\mathrm {opt}}}, \end{aligned}$$
(17)
$$\begin{aligned}&\mu ^k_i=0 \hbox { whenever } g_i(x^k) < -\varepsilon _{{\mathrm {compl}}}, \end{aligned}$$
(18)

and

$$\begin{aligned} d^T \nabla ^2 \ell (x^k,\lambda ^k,\mu ^k) d \ge -\varepsilon _{{\mathrm {curv}}} \Vert d\Vert ^2, \end{aligned}$$
(19)

for all \(d \in \mathbb {R}^n \; \text{ with } \; \nabla h_i(x^k)^T d = 0, i=1,\dots ,m, \text{ and } \nabla g_i(x^k)^T d = 0 \text{ whenever } g_i(x^k) \ge -\varepsilon _{{\mathrm {compl}}}\). A verification of (19) can be done under mild computational cost, for instance, computing the smallest eigenvalue of the reduced matrix \(\nabla ^2\ell (x^k,\lambda ^k,\mu ^k)\) in the corresponding subspace (see, for example, [47] and the references therein). Similarly to the first-order case, when (1114) hold for some \(\lambda ^k\in \mathbb {R}^m\) and \(\mu ^k\in \mathbb {R}^p_+\), we say that \(x^k\) is \(\varepsilon _k\)-KKT2 for problem (GOP). Note that when x is \(\varepsilon \)-KKT2 with \(\varepsilon =0\), x satisfies WSOC.

In order to prove our second-order global convergence results based on AKKT2, a natural way is to assume that subproblem (1) is approximately solved up to second-order, that is, \(x^k\) is \(\varepsilon _k\)-KKT2 with \(\varepsilon _k\rightarrow 0^+\). This is reasonable since it can be expected that algorithms with convergence to second-order stationary points will generate AKKT2 limit points, see [9]. But there is a minor issue about the possible lack of second-order differentiability of the objective function \(F(x) = L(x,\bar{v}^k,\bar{u}^k;\rho _k)\) of subproblems (1). The problem here is that \(\max \{ 0, G_i(x) + \bar{u}_i^k/\rho _k \}^2\) may not be two times differentiable at points x where \(G_i(x) + \bar{u}_i^k/\rho _k = 0\). To precisely formulate our results, let us consider the general subproblem (GOP) where the objective function F(x) is of the PHR form, that is,

$$\begin{aligned} F(x) = f_0(x) + \frac{1}{2} \sum _{i=1}^r \max \{ 0, f_i(x) \}^2, \end{aligned}$$

where \(f_i:\mathbb {R}^n\rightarrow \mathbb {R}\) are twice continuously differentiable for \(i=0,1,\dots ,r\). Following [6, Prop. 1], we can see that the matrix

$$\begin{aligned} \nabla ^2 F_\varepsilon (x)= & {} \nabla ^2 f_0(x) + \sum _{i=1}^r \max \{ 0, f_i(x) \} \nabla ^2 f_i(x)\nonumber \\&+ \sum _{f_i(x)\ge -\varepsilon } \nabla f_i(x) \nabla f_i(x)^T, \quad \varepsilon \ge 0, \end{aligned}$$
(20)

can play the role of a Hessian matrix of F in its Taylor expansion. In particular, second-order optimality conditions for the minimization of F(x) can be formulated as if F(x) were twice continuously differentiable using \(\nabla ^2 F_\varepsilon (x)\) instead of the Hessian of F(x). The optimality condition depends on \(\varepsilon \ge 0\) and is more stringent the closer \(\varepsilon \) is to zero. When F is twice continuously differentiable, \(\nabla ^2 F = \nabla ^2 F_0\), that is, the Hessian of F at x coincides with the expression in (20) with \(\varepsilon =0\).

We will say that \(x^*\) is an AKKT2 point for problem (GOP) with the PHR objective function F(x) when there are appropriate sequences such that (1114) hold with \(\nabla ^2 F(x^k)\) replaced by \(\nabla ^2 F_{\varepsilon _k}(x^k)\), or, what can be shown to be equivalent, when (78) hold with \(\nabla ^2F(x^k)\) replaced by

$$\begin{aligned} \nabla ^{2} f_0(x^k) + \sum _{i=1}^r \max \{ 0, f_i(x^k) \} \nabla ^2 f_i(x^k) + \sum _{\{ i \; | \; f_i(x^*) \ge 0\}} \nabla f_i(x^k) \nabla f_i(x^k)^T. \end{aligned}$$

It is easy to see from the proof of Theorem 3.3 in [9] that AKKT2 for PHR functions in the way that we have defined is an optimality condition. Also, Theorem 3.1 still holds for PHR functions, where \(\nabla ^2F(x)\) in the definition of WSOC is replaced by \(\nabla ^2F_0(x)\).

In the following theorem we prove that limit points of the second-order Augmented Lagrangian algorithm are second-order stationary points for the problem of minimizing the squared infeasibility measure, which is a PHR function. This extends previous results where only first-order conditions were obtained in [6] for the particular case of a compact box at the lower-level.

Theorem 3.3

Assume that \(\varepsilon _k \rightarrow 0^+\) is a sequence of given tolerances and that \(\{x^k\}\) is a sequence generated by Algorithm 2.1 in such a way that, at Step 1, \(x^k\) is \(\varepsilon _k\)-KKT2 for subproblem (1), i.e. (1114) hold with \(F(x) = L(x,\bar{v}^k,\bar{u}^k;\rho _k)\) and \(\nabla ^2 F(x^k) = \nabla ^2 F_{\varepsilon _k}(x^k)\). Then, every limit point \(x^*\) of \(\{x^k\}\) is an AKKT2 point for the infeasibility optimization problem

$$\begin{aligned} \text{ Minimize } \Vert H(x)\Vert ^2 + \Vert \max \{0,G(x)\}\Vert ^2 \text{ subject } \text{ to } x \in \Omega _L. \end{aligned}$$
(IOP)

Proof

Assume without loss of generality that \(x^k\rightarrow x^*\). From (11), we get \(x^*\in \Omega _L\). Now, we will examine two cases depending on whether the sequence \(\{\rho _k\}\) is bounded or not.

If \(\{\rho _k\}\) is bounded, Step 3 of Algorithm 2.1 ensures that \(\max \{\Vert H(x^k)\Vert _\infty ,\Vert V^k\Vert _\infty \}\rightarrow 0\). Hence \(H(x^k)\rightarrow 0\) and \(V^k=\max \{ G(x^k), -\bar{u}^k / \rho _k \}\rightarrow 0\). Thus, \(H(x^*)=0\), \(G(x^*)\le 0\), and \(x^*\) is a global minimizer of (IOP), in particular, an AKKT2 point.

Now, we consider the case in which the sequence \(\{\rho _k\}\) is unbounded. By some calculations and using the notation \(\nabla ^2 L(x^k,\bar{v}^k,\bar{u}^k;\rho _k)=\nabla ^2 L_{\varepsilon _k}(x^k,\bar{v}^k,\bar{u}^k;\rho _k)\), we get

$$\begin{aligned} \nabla L(x^k,\bar{v}^k,\bar{u}^k;\rho _k)= & {} \nabla f(x^k) + \sum _{i=1}^{\bar{m}} (\bar{v}_i^k+\rho _kH_i(x^k)) \nabla H_i(x^k)\\&+\sum _{i=1}^{\bar{p}} \max \{0,\bar{u}_i^k+\rho _kG_i(x^k)\} \nabla G_i(x^k) \end{aligned}$$

and

$$\begin{aligned} \nabla ^2 L(x^k,\bar{v}^k,\bar{u}^k;\rho _k)= & {} \nabla ^2 f(x^k) + \sum _{i=1}^{\bar{m}} (\bar{v}_i^k+\rho _kH_i(x^k)) \nabla ^2 H_i(x^k)\\&+ \sum _{i=1}^{\bar{p}} \max \{0,\bar{u}_i^k+\rho _kG_i(x^k)\} \nabla ^2 G_i(x^k) \\&+ \sum _{i=1}^{\bar{m}} \rho _k \nabla H_i(x^k) \nabla H_i(x^k)^T \\&+ \sum _{\{i \; | \; \bar{u}_i^k+\rho _k G_i(x^k) \ge - \varepsilon _k\}} \rho _k \nabla G_i(x^k) \nabla G_i(x^k)^T. \end{aligned}$$

Substituting \(\nabla L(x^k,\bar{v}^k,\bar{u}^k;\rho _k)\) and \(\nabla ^2 L(x^k,\bar{v}^k,\bar{u}^k;\rho _k)\) in (12) and (14), we get

$$\begin{aligned} \Vert L_k\Vert \le \varepsilon _k \end{aligned}$$
(21)

and

$$\begin{aligned} d^T M_k d \ge -\varepsilon _k \Vert d\Vert ^2 \text{ for } \text{ all } d \in S_k, \end{aligned}$$
(22)

where

$$\begin{aligned} S_k= & {} \{ d \in \mathbb {R}^n \; | \; \nabla h_i(x^k)^T d = 0, \\ i= & {} 1,\dots ,m, \hbox { and } \nabla g_i(x^k)^T d = 0 \text{ whenever } g_i(x^k) \ge - \varepsilon _k \},\\ L_{k}= & {} \nabla f(x^k) +\sum _{i=1}^{\bar{m}} (\bar{v}_i^k + \rho _k H_i(x^k)) \nabla H_i(x^k)\\&+ \sum _{i=1}^{\bar{p}} \max \{0, \bar{u}_i^k + \rho _k G_i(x^k) \} \nabla G_i(x^k) \\&+ \sum _{i=1}^{m} \lambda ^k_i \nabla h_i(x^k) + \sum _{i=1}^{p} \mu ^k_i \nabla g_i(x^k),\\ M_k= & {} \nabla ^2 f(x^k) +\sum _{i=1}^{\bar{m}} (\bar{v}_i^k + \rho _k H_i(x^k)) \nabla ^2 H_i(x^k) \\&+ \sum _{i=1}^{\bar{p}} \max \{0, \bar{u}_i^k + \rho _kG_i(x^k)\} \nabla ^2 G_i(x^k) + \sum _{i=1}^{\bar{m}} \rho _k \nabla H_i(x^k) \nabla H_i(x^k)^T \\&+ \sum _{\{ i \; | \; \bar{u}_i^k + \rho _k G_i(x^k) \ge -\varepsilon _k\}} \rho _k \nabla G_i(x^k) \nabla G_i(x^k)^T \\&+ \sum _{i=1}^m \lambda _i^k \nabla ^2 h_i(x^k) + \sum _{i=1}^p \mu _i^k \nabla ^2 g_i(x^k). \end{aligned}$$

Now, we proceed by dividing (21) and (22) by \(\rho _k\). Noting that \(\rho _{k}^{-1}(\nabla f(x^k), \nabla ^2 f(x^k), \bar{v}^k, \bar{u}^k, \varepsilon _k)\) vanishes in the limit, given that \(\{\rho _k\}\) is unbounded and \(\{ \bar{v}^k \}\) and \(\{ \bar{u}^k \}\) are bounded by Step 2 of Algorithm 2.1, we get

$$\begin{aligned}&\left\| \sum _{i=1}^{\bar{m}} H_i(x^k)\nabla H_i(x^k) + \sum _{i=1}^{\bar{p}} \max \{0,G_i(x^k)\} \nabla G_i(x^k) + \sum _{i=1}^{m} \frac{\lambda ^k_i}{\rho _k} \nabla h_i(x^k)\right. \\&\qquad \left. + \sum _{i=1}^{p} \frac{\mu ^k_i}{\rho _k} \nabla g_i(x^k) \right\| \le \varepsilon '_k \end{aligned}$$

and

$$\begin{aligned} \begin{array}{l} d^T \Big ( \sum _{i=1}^{\bar{m}} H_i(x^k) \nabla ^2 H_i(x^k) + \sum _{i=1}^{\bar{p}} \max \{ 0, G_i(x^k) \} \nabla ^2 G_i(x^k)\\ \quad + \sum _{i=1}^{\bar{m}} \nabla H_i(x^k) \nabla H_i(x^k)^T + \sum _{\{ i \; | \; \bar{u}_i^k + \rho _kG_i(x^k) \ge -\varepsilon _k \}} \nabla G_i(x^k) \nabla G_i(x^k)^T \\ \quad + \sum _{i=1}^m \frac{\lambda _i^k}{\rho _k} \nabla ^2 h_i(x^k) + \sum _{i=1}^p \frac{\mu _i^k}{\rho _k} \nabla ^2 g_i(x^k) \Big ) d \ge -\varepsilon ''_k \Vert d\Vert ^2\quad \text{ for } \text{ all } d \in S_k, \end{array} \end{aligned}$$
(23)

for some \(\varepsilon '_k\rightarrow 0^+\) and \(\varepsilon ''_k\rightarrow 0^+\). Since, on the one hand, if \(G_i(x^*)<0\) then we have that \(\max \{0,G_i(x^k)\}=0\) and that \(\bar{u}_i^k+\rho _kG_i(x^k)<-\varepsilon _k\) for sufficiently large k, and, on the other hand, \(\nabla G_i(x^k)\nabla G_i(x^k)^T\) is positive semidefinite, we can replace

$$\begin{aligned} \sum _{i=1}^{\bar{p}} \max \{ 0, G_i(x^k) \} \nabla ^2 G_i(x^k) \quad \text{ and } \sum _{\{ i \; | \; \bar{u}_i^k + \rho _kG_i(x^k) \ge -\varepsilon _k \}} \nabla G_i(x^k) \nabla G_i(x^k)^T \end{aligned}$$

by

$$\begin{aligned} \sum _{\{ i \; | \; G_i(x^*) \ge 0\}} \max \{ 0, G_i(x^k)\} \nabla ^2 G_i(x^k) \quad \text{ and } \sum _{\{ i \; | \; G_i(x^*) \ge 0\}} \nabla G_i(x^k) \nabla G_i(x^k)^T \end{aligned}$$

in (23), respectively, yielding that \(x^*\) is an AKKT2 point of (IOP). \(\square \)

Now, it remains to prove the relationship between the sequence of iterates given by Algorithm 2.1 and the original problem (OP). The next theorem makes the precise statement.

Theorem 3.4

Let \(\{x^k\}\) be a sequence generated by Algorithm 2.1 as in Theorem 3.3. Then, every feasible limit point \(x^*\in \Omega _L\cap \Omega _U\) of \(\{x^k\}\) is an AKKT2 point of the original problem (OP).

Proof

As usual, there is no loss of generality if we assume that \(\{x^k\}\) converges to \(x^* \in \Omega _L \cap \Omega _U\). Following the proof of Theorem 3.3, we get that (21) and (22) hold, where \(L_k\) and \(M_k\) are defined as in the proof of Theorem 3.3.

Set \(v^k = \bar{v}^k + \rho _k H(x^k)\) and \(u^k = \max \{ 0, \bar{u}^k + \rho _k G(x^k) \} \ge 0\). Now, we will prove that \(u_i^k = 0\) whenever \(G_{i}(x^{*}) < 0\). We have two cases depending on whether the sequence of penalty parameters \(\{\rho _k\}\) is bounded or not.

If \(\{\rho _k\}\) is bounded then Step 3 of the algorithm implies that \(V^k = \max \{ G(x^k), - \bar{u}^k / \rho _k \} \rightarrow 0\). Hence, if \(G_{i_{0}}(x^*) < 0\), for some \(i_0\), we have that \(\bar{u}_{i_0}^k \rightarrow 0\) and \(\bar{u}_{i_0}^k + \rho _k G_{i_0}(x^k) < -\varepsilon _k\) for k large enough. Thus, for every i such that \(G_{i}(x^*) < 0\), we get \(u_i^k = 0\). If \(\{\rho _k\}\) is unbounded, we also have \(\bar{u}_i^k + \rho _k G_i(x^k) < -\varepsilon _k\) for k large enough whenever \(G_i(x^*) < 0\). From both cases, we get that if \(G_{i}(x^{*}) < 0\) then \(u_i^k=0\) for k large enough.

Since for k large enough, \(u_i^k=0\), \(\bar{u}_i^k + \rho _k G_i(x^k) < -\varepsilon _k\) whenever \(G_i(x^*)<0\), and \(\mu _i^k = 0\) if \(g_i(x^*) < 0\), we can write \(L_k\) and \(M_k\) as follows:

$$\begin{aligned} L_k= & {} \nabla f(x^k) + \sum _{i=1}^{\bar{m}} v_i^k \nabla H_i(x^k) + \sum _{\{ i \; | \; G_i(x^*)=0 \}} u_i^k \nabla G_i(x^k) \\&\quad + \sum _{i=1}^{m} \lambda ^k_i \nabla h_i(x^k) + \sum _{\{ i \; | \; g_i(x^*) = 0 \}} \mu ^k_i \nabla g_i(x^k), \\ M_k= & {} \nabla ^2 f(x^k) + \sum _{i=1}^{\bar{m}} v_i^k \nabla ^2 H_i(x^k) + \sum _{\{ i \; | \; G_i(x^*) = 0 \}} u_i^k \nabla ^2 G_i(x^k) \\&\quad + \sum _{i=1}^m \lambda _i^k \nabla ^2 h_i(x^k) + \sum _{\{ i \; | \; g_i(x^*) = 0 \}} \mu _i^k \nabla ^2 g_i(x^k) \\&\quad + \sum _{i=1}^{\bar{m}} \rho _k \nabla H_i(x^k) \nabla H_i(x^k)^T + \sum _{\{ i \; | \; G_i(x^*) = 0 \}} \rho _k \nabla G_i(x^k) \nabla G_i(x^k)^T. \end{aligned}$$

Furthermore, by (21) and (22), we obtain that \(L_k \rightarrow 0\) and \(d^T M_k d \ge -\varepsilon _k \Vert d\Vert \) for all \(d \in S_k\), where \(S_k\) is the subspace given in (15). To end the proof, consider the subspace \(W_k\) defined as

$$\begin{aligned} W_k = \left\{ d \in \mathbb {R}^{n} \left| \begin{array}{l} \nabla H_i(x^k)^T d = 0,\quad i=1,\dots ,\bar{m},\quad \nabla G_i(x^k)^T d = 0\quad \text{ whenever } G_i(x^*) = 0, \\ \nabla h_i(x^k)^T d = 0,\quad i=1,\dots ,m,\quad \text{ and } \nabla g_i(x^k)^T d = 0,\quad \text{ whenever } g_i(x^*) = 0 \end{array} \right. \right\} . \end{aligned}$$

Clearly, \(g_i(x^k) \ge -\varepsilon _k\) for all k sufficiently large imply \(g_i(x^*) = 0\). Thus, \(W_k \subseteq S_k\). Then, the matrix \(P_k = M_k + ( \varepsilon _k + \frac{1}{k} ) \mathcal {I}\) is positive definite on \(W_k\). We get the desired result applying Lemma 3.1 for the matrix \(P_k\) and the subspace \(W_k\). \(\square \)

As a consequence of Theorems 3.1, 3.3, and 3.4 we obtain the following result.

Corollary 3.1

Let \(\{x^k\}\) be a sequence generated by Algorithm 2.1 in such a way that, at Step 1, we require that \(x^k\) is \(\varepsilon _k\)-KKT2 for \(\varepsilon _k \rightarrow 0^+\). Let \(x^*\) be a limit point of \(\{x^k\}\).

  1. 1.

    If \(x^*\) satisfies CCP2 for the infeasibility optimization problem (IOP) then \(x^*\) satisfies WSOC with respect to (IOP).

  2. 2.

    If \(x^*\) is feasible and satisfies CCP2 for the original problem (OP) then \(x^*\) satisfies WSOC with respect to (OP).

Corollary 3.1 allows us to obtain a second-order stationarity result even if in a solution the set of Lagrange multipliers is unbounded, that is, MFCQ fails. Furthermore, since RCRCQ implies CCP2, the theory developed above guarantees convergence to second-order stationary points when all the constraints are given by affine functions.

In order to get a stronger global convergence theory, we can impose a slightly stronger condition to accept \(x^k\) as an approximate solution to the kth subproblem, as well as an additional smoothness assumption on the upper-level constraints. In [13], it has been suggested to replace (4) in the definition of \(\varepsilon \)-KKT by the stronger condition given by

$$\begin{aligned} | \lambda _i h_i(x) |< \varepsilon ,\quad i=1,\dots ,m,\quad \text{ and } | \mu _i g_i(x) | < \varepsilon , \quad i=1,\dots ,p. \end{aligned}$$
(24)

When (2), (3), and (24) hold for some \(\lambda \in \mathbb {R}^m\) and \(\mu \in \mathbb {R}^p_+\), x is called an \(\varepsilon \)-CAKKT point, where the letter C stands for “complementarity”. A limit of \(\varepsilon \)-CAKKT points with \(\varepsilon \rightarrow 0^+\) is called a CAKKT point. In [13], it was proved that CAKKT is an optimality condition and that feasible limit points of a general lower-level Augmented Lagrangian are CAKKT points of the original problem, given that an additional smoothness assumption on the upper-level constraints is satisfied. In order to obtain those results, it was also assumed that each \(x^k\) is an \(\varepsilon _k\)-KKT point and that the lower-level constraints are simple in the sense that Lagrange multipliers approximations associated with them can be assumed to be bounded.

Let us consider now the generalization of CAKKT to second-order given in [42]. We say that \(x\in \mathbb {R}^n\) is an \(\varepsilon \)-CAKKT2 point if, for some \(\lambda \in \mathbb {R}^m\), \(\mu \in \mathbb {R}^p_+\), \(\theta \in \mathbb {R}^m_+\), and \(\eta \in \mathbb {R}^p_+\), conditions (2), (3), and (24) hold together with

$$\begin{aligned} \nabla ^2 \ell (x,\lambda ,\mu ) + \sum _{i=1}^m \theta _i \nabla h(x) \nabla h(x)^T + \sum _{i=1}^p \eta _i \nabla g(x) \nabla g(x)^T + \varepsilon \mathcal {I} \end{aligned}$$
(25)

being positive semidefinite and

$$\begin{aligned} | \theta _i h_i(x)^2 |< \varepsilon ,\quad i=1,\dots ,m,\quad \text{ and } | \eta _i g_i(x)^2 | < \varepsilon ,\quad i=1,\dots ,p. \end{aligned}$$
(26)

A limit of \(\varepsilon \)-CAKKT2 points with \(\varepsilon \rightarrow 0^+\) is called a CAKKT2 point. In [42] it was proved that CAKKT2 is an optimality condition strictly stronger than CAKKT and AKKT2 together. Let us present now the global convergence result of Algorithm 2.1 when subproblems are solved obtaining, at each iteration, \(\varepsilon _k\)-CAKKT or \(\varepsilon _k\)-CAKKT2 points with \(\varepsilon _k \rightarrow 0^+\), under the same additional smoothness property on the upper-level constraints needed in the first-order case.

Theorem 3.5

Assume that \(\varepsilon _k \rightarrow 0^+\) is a sequence of given tolerances and that \(\{x^k\}\) is a sequence generated by Algorithm 2.1 where, at Step 1, \(x^k\) is \(\varepsilon _k\)-CAKKT2 (or \(\varepsilon _k\)-CAKKT) for subproblem (1), with \(F(x) = L(x,\bar{v}^k,\bar{u}^k;\rho _k)\) and \(\nabla ^2 F(x^k) = \nabla ^2 F_{\varepsilon _k}(x^k)\). Then, any limit point \(x^*\) of \(\{x^k\}\) is a CAKKT2 point (or CAKKT point, respectively) for the infeasibility optimization problem (IOP). If, in addition, we assume (the generalized Łojasiewicz inequality that says) that there exist \(\delta >0\) and a function \(\varphi : B(x^*,\delta ) \rightarrow \mathbb {R}\) with \(\varphi (x)\rightarrow 0\) as \(x\rightarrow x^*\) such that for all \(x \in B(x^*,\delta )\),

$$\begin{aligned} |\Phi (x)-\Phi (x^*)|\le \varphi (x)\Vert \nabla \Phi (x)\Vert , \end{aligned}$$

where \(\Phi (x)=\Vert H(x)\Vert ^2+\Vert \max \{0,G(x)\}\Vert ^2\), then, when \(x^*\) is feasible for the original problem (OP), \(x^*\) is a CAKKT2 point (or CAKKT point, respectively) of the original problem (OP).

Proof

For the lower-level constraints, the additional property follows easily from the additional requirements (24) and (26) for the subproblems. When \(x^*\) is feasible, under the generalized Łojasiewicz inequality, we can follow the proof of [13, Theorem 5.1] to see that \(\rho _k H_i(x^k)^2 \rightarrow 0\), \(i=1,\dots ,\bar{m}\), \(\rho _k G_i(x^k)^2 \rightarrow 0\), \(i=1,\dots ,\bar{p}\), \(v^k_i H_i(x^k) \rightarrow 0\), \(i=1,\dots ,\bar{m}\), and \(u^k_i G_i(x^k) \rightarrow 0\), \(i=1,\dots ,\bar{p}\). Now, defining \(\theta ^k_i = \rho _k\), \(i=1,\dots ,\bar{m}\), and \(\eta ^k_i = \rho _k\) if \(\bar{u} _i^k + \rho _k G_i(x^k) \ge -\varepsilon _k\), \(\eta ^k_i = 0\) otherwise, we can see that \(\theta ^k_i H_i(x^k)^2 \rightarrow 0\), \(i=1,\dots ,\bar{m}\), and \(\eta ^k_i G_i(x^k)^2 \rightarrow 0\), \(i=1,\dots ,\bar{p}\), as in the proof of [42, Theorem 3.1]. \(\square \)

We end this section with a detailed comparison of our results with the global convergence results presented for the Augmented Lagrangian method introduced in [6]. The following improvements apply: (a) a weaker constraint qualification is employed, (b) lower-level constraints can be arbitrary, (c) subproblems can be solved slightly more loosely, and (d) a second-order stationarity result also holds for the infeasibility problem.

The requirements employed in this work for an approximate solution to the subproblems are slightly weaker than the ones required in [6, Theorem 2], where the particular case of a compact box in the lower-level is considered. The condition described in [6] for solving the subproblems corresponds to replacing the set \(S_k\) by the larger one \(\{ d \in \mathbb {R}^n \; | \; \nabla h_i(x^k)^T d = 0, i=1,\dots ,m, \text{ and } \nabla g_i(x^k)^T d = 0 \text{ whenever } g_i(x^k) \ge 0 \}\), that requires more from the subproblem solver. However, it should be noted that one could modify the definition of AKKT2 by replacing the range of the last sum in (8) from \(\mathcal{A}(x^*) = \{ i \in \{1,\dots ,p\} \; | \; g_i(x^*) = 0\}\) to the smaller index set \(\{ i \in \{1,\dots ,p\} \; | \; g_i(x^k) \ge 0\}\), arriving to a stronger optimality condition. The condition considered in [6] to accept \(x^k\) as an approximate subproblem’s solution is precisely the \(\varepsilon _k\)-perturbation of this modified AKKT2 condition (analogous to the \(\varepsilon _k\)-perturbation of the AKKT2 considered in Theorem 3.2). The fact that this is an optimality condition follows from the proof of  [9, Theorem 3.3]. The point is that it is not the case that feasible limit points of the Augmented Lagrangian method fulfill this stronger optimality condition, independently of the requirements made to consider \(x^k\) an approximate solution to the kth subproblem. This makes the formulation being introduced in the present work more natural, in the sense that the (weaker) criterion used for the resolution of the subproblems coincides with the optimality condition present in the global convergence theorem.

Similarly, with respect to the definition of AKKT2 for objective functions of the PHR form, we note that although the optimality condition is stronger replacing \(\nabla ^2 F(x^k)\) by \(\nabla ^2 F_0(x^k)\), it is required less from an approximate solution to a subproblems by using \(\nabla ^2 F_{\varepsilon _k}(x^k)\), which makes it easier to the subproblem solver to stop and it is also compatible with the AKKT2 definition for PHR functions.

4 Discussion and illustrative numerical examples

In many practical constrained optimization problems the constraints can be divided in two classes: the ones that may be satisfied only at the end of the optimization process and the ones that should be satisfied at every stage of it. The reason for that is that the fulfillment of the first type of constraints is not mandatory, for many different reasons; while the fulfillment of the second ones cannot be avoided, for example, because they involve definitions of variables. Intermediate results in which the first constraints are not fully satisfied may be useful in practice just because the second ones are. In those cases it is not sensible to deal with the constraints in identical ways which means that, in the terminology of Augmented Lagrangian methods, the mandatory constraints should be permanently included in the so called lower-level set, or classified as subproblem constraints. See [24]. Therefore, at each subproblem of the Augmented Lagrangian method one should solve a subproblem in which the only constraints are those of the second class.

The family of Augmented Lagrangian methods introduced in the present work possesses two desirable features at the same time—convergence to second-order stationary points and general constraints in the so-called lower-level set. In the present section, we illustrate with numerical examples the possible advantages of both features. However, it should be noted that the implementation of this kind of method is limited by the existence of a method able to tackle the subproblems. Candidates should exhibit convergence to second-order stationary points and should be able to deal with the lower-level constraints by preserving feasibility of the iterates. While, on the one hand, feasible second-order methods for bound-constrained minimization exist (see, for example, [6]); on the other hand, feasible second-order methods with arbitrary constraints are rare. In the present section, we focus on situations in which the lower-level constraints are given by Euclidean balls and, therefore, subproblems can be solved by the trust-region method introduced in [50, 51].

4.1 Avoiding convergence to global maximizers

Regarding the convergence to second-order stationary points, it is not easy to exhibit numerical experiments in which a method with convergence to second-order stationary points finds better quality solutions (feasible points with smaller objective function value) than those found by a method with convergence to first-order stationary points. A recent numerical experiment presented in [23], considering all the 87 unconstrained problems of the CUTEst collection [41] with available second-order derivatives, illustrates this fact. In this experiment it was observed that every time a first-order stopping criterion was satisfied, a second-order stopping criterion was satisfied as well at the same time. The consequence of this was that the first- and second-order methods being compared obtained equivalent results in all the problems in which both satisfied a stopping criterion related to success (independently of being a first- or a second-order stopping criterion). Ad hoc problems in which a first-order method converges to a global maximizer were presented in [6] and [23], but this behavior is hardly observed in massive numerical experiments. A similar example follows.

Consider the problem

$$\begin{aligned} \text{ Minimize } x \text{ subject } \text{ to } x + y^2 = 0 \quad \text{ and }\quad x^2 + y^2 \le 1. \end{aligned}$$

It has a global maximizer at \(x=y=0\) and global minimizers at \(x=(1-\sqrt{5})/2\) and \(y=\pm \sqrt{(\sqrt{5}-1)/2}\). Algencan [5] (see also [24]) is a well-established implementation of an Augmented Lagrangian method that keeps the bound constraints as lower-level constraints (and penalizes any other type of constraint) and solves the subproblems with a variety of methods for bound-constrained minimizations, all of them with convergence to first-order stationary points. When Algencan is applied to the problem above starting from \((2,0)^T\), due to the problem’s symmetry with respect to the x-axis, it converges to the global maximizer \((x,y)^T=(0,0)^T\) with \(\lambda _1=-1\) and \(\mu _1=0\). (A similar behaviour would be expected from any other first-order method.) Consider now the possibility of applying an Augmented Lagrangian method that keeps \(x^2 + y^2 \le 1\) as a lower-level constraint and finds second-order stationary points of the subproblems. The subproblem at the kth iteration is of the form

$$\begin{aligned} \text{ Minimize } x + \frac{\rho _k}{2} \left( x + y^2 \right) ^2 \text{ subject } \text{ to } x^2 + y^2 \le 1, \end{aligned}$$

where, with the simple purpose of simplifying the presentation, we consider \(\bar{v}^k_1 =0\) for all k (assuming that parameters \(\bar{v}_{\min }\) and \(\bar{v}_{\max }\) are such that \(0 \in [\bar{v}_{\min },\bar{v}_{\max }]\)). Figure 1 shows a sequence of approximate second-order stationary points for the subproblems, for increasing values of \(\rho _k\), that converges to one of the global minimizers. In theory, a global minimizer would also be obtained by penalizing both constraints, if one were able to find second-order stationary points for the subproblems. However, in practice, if a “small” penalty parameter is used for the first subproblem, the greediness phenomenon described below impairs convergence in this case.

Fig. 1
figure 1

Sequence \(\{ x^k \}\) of second-order stationary points of the kth subproblem for increasing values of \(\rho _k\). The whole sequence satisfies the lower-level constraint \(x^2 + y^2 \le 1\). The infeasible point \(x^1 \approx (-1,0)^T\) (with respect to the upper-level constraint \(x+y^2=0\)) corresponds to a small value of \(\rho _1 = 0.1\); while the final iterate \(\approx ( (1-\sqrt{5})/2, -((\sqrt{5}-1)/2)^{1/2} )^T\) corresponds to \(\rho _k=100\) and it satisfies the penalized constraint with tolerance \(10^{-8}\)

4.2 Avoiding the greediness phenomenon

Penalty and Augmented Lagrangian methods may be affected by the behavior of the objective function outside the feasible region. If the objective function takes very low values at infeasible points, iterates of the subproblem solver may be attracted by undesirable minimizers or they may even diverge, especially at the first outer iterations, and overall convergence may fail to occur. This phenomenon was named “greediness” in [31], where a Proximal Augmented Lagrangian method was proposed as an alternative to overcome it. In [18, pp. 418–419], it was suggested as a remedy that an external penalty function with exponent bigger than 2 should be used; while, in [20], an outer trust-region approach for the same purpose was introduced. None of these alternatives is necessary if the greediness phenomenon is caused by the penalization of a constraint that can be kept at the lower-level set.

Consider the problem

$$\begin{aligned} \text{ Minimize } x^5 \text{ subject } \text{ to } x^2 \le 1 \text{ and } x \le 0. \end{aligned}$$

If the bound constraint is kept at the lower-level set and the ball constraint is penalized, as it is the case in Algencan, subproblems are of the form

$$\begin{aligned} \text{ Minimize } x^5 + \frac{\rho _k}{2} \left( x^2 - 1 \right) ^2 \text{ subject } \text{ to } x \le 0. \end{aligned}$$

Clearly, subproblems are unbounded from below for any value of \(\rho _k\) and, therefore, the Augmented Lagrangians’ theory does not help to predict the practical behavior of the method in this case. Even adding an artificial lower bound of the form \(-10^{20} \le x\) may not help to avoid the greediness phenomenon. If \(x=-10^{20}\) is obtained as a solution to the first subproblem and, as it is usual in practice, it is used as initial guess for solving the next subproblem, many outer iterations and increases of the penalty parameter will be needed to obtain a solution to the original problem. Moreover, it is well-known that large penalty parameters promote ill-conditioned subproblems, numerical issues, and, eventually, practical failure of the inner solver to satisfy the subproblems’ stopping criterion. This is what happens with Algencan. On the other hand, if the ball constraint can be kept as a lower-level constraint, subproblems are of the form

$$\begin{aligned} \text{ Minimize } x^5 + \frac{\rho _k}{2} \max \{0, x \}^2 \text{ subject } \text{ to } x^2 \le 1 \end{aligned}$$

and, in this particular case, convergence to the global minimizer \(x=-1\) is expected to occur at the first outer iteration.

4.3 Dealing with functions that are undefined outside a given domain

Consider an objective function \(f: \mathbb {R}^n \rightarrow \mathbb {R}\) that should be minimized within \(\Omega _1 \cap \Omega _2\) and such that f or its derivatives are not well-defined outside \(\Omega _1\). In this case, it appears to be no other choice other than setting the lower-level set \(\Omega _L = \Omega _1\) and the upper-level set \(\Omega _U = \Omega _2\); i.e. subproblems must consist on minimizing the Augmented Lagrangian that penalizes \(\Omega _2\) subject to \(\Omega _1\). Three simple examples follow.

Example 1

In [27], a location problem was introduced. The problem consists in finding locations \(c_0, c_1, \dots , c_N \in \mathbb {R}^2\) such that \(\sum _{i=1}^N \Vert c_0 - c_i \Vert _2\) is minimized subject to \(c_i \in C_i \cap \Omega _i\) for \(i=0,\dots ,N\), where \(C_i = \{ x \in R^2 \; | \; \Vert c_i - \bar{c}_i \Vert \le r_i \}\) for given constants \(\bar{c}_i \in \mathbb {R}^2\) and \(r_i \in \mathbb {R}\) and it is assumed that \(C_0 \cap C_i = \emptyset \) for all \(i>0\). The constraints \(\Omega _i\) are some additional constraints that must be satisfied by each location \(c_i\), \(i=0,\dots ,N\). If all constraints are penalized, subproblems are unconstrained and they are not differentiable if \(c_0=c_i\) for some \(i>0\). This inconvenience is avoided if constraints \(c_i \in C_i\) for \(i=0,\dots ,N\) are kept as lower-level constraints.

Example 2

Similar examples occur when the potential energy function \(\sum _{i \ne j} 1 / \Vert p_i - p_j \Vert ^2\) is minimized in order to obtain “well-distributed” points \(p_i\) within a given region (see, for example, [58], where this function was considered as an ingredient for solving packing problems). It may be the case that keeping some constraints in the lower-level set avoids the regions where the objective functions is not well-defined (\(p_i=p_j\) for some \(i \ne j\)).

Example 3

There exists a huge variety of packing problem. They basically consist in placing items within a container without overlapping. One of the simplest packing problems consists in maximizing the radius r of N identical circles that must be packed within a given container. On the one hand, this simple geometrical problem has received the attention of several researchers in the last decades (see [28] and the references therein). On the other hand, this problem is an over-simplification of a packing problem that applies to building initial configurations for molecular dynamics simulations [52, 53] (see also [24, p. 165]).

Consider the problem of packing identical circles within the “ring-shape” region displayed in Fig. 2. The problem can be modelled as

$$\begin{aligned} \begin{array}{ll} \text{ Maximize } &{} r \\ \text{ subject } \text{ to } &{} ( x_i - x_j )^2 + ( y_i - y_j )^2 \ge (2r)^2 \quad \text{ for } i=1,\dots ,N,\quad j=i+1,\dots ,N,\\ &{} x_i^2 + y_i^2 \le (R_{{\mathrm {out}}}-r)^2\quad \text{ for } i=1,\dots ,N,\\ &{}x_i^2 + y_i^2 \ge (R_{{\mathrm {in}}}+r)^2 \quad \text{ for } i=1,\dots ,N.\\ \end{array} \end{aligned}$$

The first constraint avoids the overlapping between the circles being packed; while the two remaining constraints represent the fact that the items must be placed within the ring-shape container with an inner circle of radius \(R_{{\mathrm {in}}}\) and an outer circle of radius \(R_{{\mathrm {out}}}\). Constraint \(x_i^2 + y_i^2 \ge (R_{{\mathrm {in}}}+r)^2\) may also be written as

$$\begin{aligned} \log [ (R_{{\mathrm {out}}}-r)^2 - ( x_i^2 + y_i^2 ) ] \le c, \end{aligned}$$
(27)

where \(c = \log [ (R_{{\mathrm {out}}}-r)^2 - (R_{{\mathrm {in}}}+r)^2 ]\). This equivalent constraint is not well-defined if \(x_i^2 + y_i^2 \le (R_{{\mathrm {out}}}-r)^2\) does not hold and, therefore, the latter constraint should be kept as a lower level constraint. Of course, (27) is an unhappy modelling choice, but it illustrates a situation that may occur in more complex real applications.

Fig. 2
figure 2

\(N=10\) identical circles of maximizing radius \(r \approx 0.4586\) packed within a ring-shape container given by an inner circle with radius \(R_{{\mathrm {in}}}=0.7\) and an outer circle with radius \(R_{{\mathrm {out}}}=2\)

5 Conclusions

Typical implementations of Augmented Lagrangian methods consider subproblems with simple constraints such as bound-constraints (as is the case of the well-known packages Algencan [5] and Lancelot [33]) or linear constraints (see, for example, [22]). However, in a recent work [19], the general lower-level theory of the first-order Augmented Lagrangian developed in [5] has been exploited to take a radical change on the classical approach, by penalizing the box constraints and keeping the equality constraints as subproblems’ constraints. Similarly to the interior point approach, a Newtonian method applied to the KKT conditions of the subproblems has been implemented, with numerical results comparable to the classical approach employed by Algencan. This state of facts motivated us to further develop the second-order theory of Augmented Lagrangian methods considered in [6] including general lower-level constraints and using modern second-order optimality conditions [9, 42] associated with weak constraint qualifications.

Several recently-developed affordable algorithms for unconstrained optimization [1, 2, 6, 23, 30, 46] compute the least eigenvalue of the Hessian matrix at each iteration, and the corresponding eigenvector when necessary, in order to improve the complexity of finding a first-order stationary point, or to guarantee convergence to a second-order stationary point. Dealing with second-order information is not straightforward in general constrained optimization, but an analogous case would be when an algorithm computes directions of negative curvature in the kernel of the Jacobian of active lower-level constraints. In this case, one should expect that a second-order stopping criterion could be checked at a reasonable cost. This paper provided an Augmented Lagrangian framework to deal with constraints in those cases. Our future research will focus on developing an affordable second-order variant of the Newtonian Augmented Lagrangian with equality-constrained subproblems developed in [19], exploiting directions of negative curvature and the general theory developed in this paper.