1 Introduction

The aim of this chapter is to introduce the reader to the most important results and algorithms in unconstrained and constrained optimization. However, it would be impossible to discuss the wide range of results in this area. We decided to concentrate the attention on few fundamental topics that are also at the basis of the new results obtained using \(\textcircled {1}\) and introduced in the successive chapters. The interested reader can refer to various classical and recent books to deepen his/her knowledge. We also omitted almost all proofs of the results, but complete references are always provided.

After introducing the concepts of convex set and functions, in Sect. 3 optimality conditions for unconstrained optimization are presented as well as the most important algorithmic techniques: gradient method, conjugate gradient method, Newton’s and Quasi-Newton’s methods. In the successive Sect. 4 we concentrate the attention on optimality conditions for constrained optimization and the construction of the dual of nonlinear optimization problems. Finally, some important algorithms for constrained nonlinear optimization problems are presented.

Three important aspects in optimization are not presented here: global optimization, multi–objective optimization, and non–smooth optimization. In this chapter we only discuss first and second order optimality conditions for local optima, and all algorithms will compute a local minimum, or a stationary point (for the unconstrained case) or a Karush–Kuhn–Tucker point (for the constrained case). Global optimization is an important area, theoretically and practically, for which several different approaches have been suggested: space covering methods, trajectory methods, random sampling, random search, etc. A basic reference on various aspects of global optimization is [40], while practical applications are discussed in [51]. Finally, a comprehensive archive of online information can be found at the web page http://www.globaloptimization.org/.

Many optimization problems are multi–objective in nature, and different techniques have been proposed to deal with this important aspect of optimization. In general, due to the presence of conflicting objectives, there is no single solution that simultaneously optimizes all the objectives, and, hence, the aim is to determine non–dominated, Pareto optimal solutions [16, 23]. Two general approaches to multiple-objective optimization are present: combine the individual objective functions into a single function, or move all but one objective to the constraint set. More recently algorithms based on different meta-heuristics have also been proposed in literature [41].

Non–smooth optimization problems arise in many important practical applications. Here it is assumed that the function is continuous, but not differentiable. The methods for non–smooth optimization can be roughly divided into two main classes: subgradient methods and bundle methods. Both classes of methods are based on the assumption that, at each point, the objective function value and one subgradient can be computed. A classical book on this topic is [43]. A survey on different numerical methods for non–smooth optimization and the most recent developments presented in [5]. Finally, a recent compact survey on non–smooth optimization can be found in [28].

We briefly describe our notation now. All vectors are column vectors and will be indicated with lower case Latin letter (x, y, \(\ldots \)). Subscripts indicate components of a vector, while superscripts are used to identify different vectors. Matrices will be indicated with upper case roman letter (A, B, \(\ldots \)). For a \({m \times n}\) matrix A, \(A_{ij}\) is the element in the ith row, jth column, \(A_{.j}\) is the j–th column of A, while \(A_{i.}\) is its i–th row. The set of real numbers and the set of nonnegative real numbers will be denoted by \(\mathrm{I}\!\mathrm{R}\) and \(\mathrm{I}\!\mathrm{R}_+\) respectively. The space of the n–dimensional vectors with real components will be indicated by \(\mathrm{I}\!\mathrm{R}^n\) and \(\mathrm{I}\!\mathrm{R}^n_+\) is an abbreviation for the nonnegative orthant in \(\mathrm{I}\!\mathrm{R}^n\). The symbol \(\left\| x \right\| \) indicates the norm of the vector x. In particular, the Euclidean norm is denoted by \(\left\| x \right\| _2\), and \(\left\| x \right\| ^2_2 = {x}^T{x}\). Superscript \(^T\) indicates transpose. The scalar product of two vectors x and y in \(\mathrm{I}\!\mathrm{R}^n\) will be denoted by \({x}^T{y}\). The space of the \(m\times n\) matrices with real components will be indicated by \(\mathrm{I}\!\mathrm{R}^{m\times n}\). The rank of a matrix A will be indicated by \(\mathrm {rank}A\). A square matrix \(A \in \mathrm{I}\!\mathrm{R}^{n\times n}\) is positive semidefinite if \({x}^T{Ax} \ge 0\) for all \(x \in \mathrm{I}\!\mathrm{R}^n\) and positive definite if \({x}^T{Ax} > 0\) for all \(0 \ne x \in \mathrm{I}\!\mathrm{R}^n\). If \(f: S \subseteq \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{\pm \infty \}\), \(\mathop {\mathrm {dom}}f \) is the set of points x for which f(x) is defined and \(f(x)\in \mathrm{I}\!\mathrm{R}\). The gradient \(\nabla f(x)\) of a continuously differentiable function \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) at a point \(x\in \mathrm{I}\!\mathrm{R}^n\) is a column vector with components \(\left[ \nabla f(x)\right] _j = \frac{\partial f(x)}{\partial x_j}\). For a twice differentiable function \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\), the Hessian \(\nabla ^2 f(x)\) belongs to \(\mathrm{I}\!\mathrm{R}^{n\times n}\) and \(\left[ \nabla ^2 f(x)\right] _{ij} = \frac{\partial ^2 f(x)}{\partial x_i \partial x_j}\). If \(F:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}^m\) is a continuously differentiable vector–valued function, then \(\nabla F(x) \in \mathrm{I}\!\mathrm{R}^{m\times n} \) denotes the Jacobian matrix of F at \(x\in \mathrm{I}\!\mathrm{R}^n\). Here and throughout the symbols \(:=\) and \(=:\) denote definition of the term on the left and the right sides of each symbol, respectively.

2 Convex Sets and Functions

Definition 1

([56]) A set \(C \subseteq \mathrm{I}\!\mathrm{R}^n\) is a convex set if

$$\begin{aligned} x^1, x^2 \in C, \quad \lambda \in [0, 1] \quad \quad \implies \quad \quad (1- \lambda ) x^1 + \lambda x^2 \in C. \end{aligned}$$

Therefore, for a convex set C the segment joining any two distinct points in C is all contained in C.

Definition 2

A point \(x \in \mathrm{I}\!\mathrm{R}^n\) is a convex combination of \(x^1, x^2, \ldots , x^k \in \mathrm{I}\!\mathrm{R}^n\) if there exist scalars \(\lambda _1, \lambda _2, \ldots , \lambda _k\) such that

$$\begin{aligned} x = \lambda _1 x^1 + \lambda _2 x^2 + \ldots + \lambda _k x^k \mathop { \text{ with } } \lambda _1 + \lambda _2 + \ldots + \lambda _k = 1,\; \lambda _j \ge 0, j = 1, \ldots , {k}. \end{aligned}$$

The concept of convex hull of a set is extremely important in optimization.

Definition 3

([46, Definition 3.1.16]) Given \(S \subseteq \mathrm{I}\!\mathrm{R}^n\) the convex hull of S is defined as

$$ \mathop {\text{ conv }}S :=\left\{ \begin{array}{ll} x \in \mathrm{I}\!\mathrm{R}^n: &{} x = \lambda _1 x^1 + \lambda _2 x^2 + \ldots + \lambda _k x^k \\ &{} x^1,\ldots , x^k \in S \\ &{}\lambda _j \in [0, 1], j = 1, \ldots , {k} \\ &{}\lambda _1 + \lambda _2 + \ldots + \lambda _k = 1 \end{array}\right\} $$

Clearly, if S is a convex set, then \(S = \mathop {\text{ conv }}S \).

Hyperplanes

$$\begin{aligned} H :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: {p}^T{x} = \mu \right\} , \end{aligned}$$

and halfspaces

$$\begin{aligned} S :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: {p}^T{x} \le \mu \right\} \end{aligned}$$

are examples of convex sets (here \(0 \ne p \in \mathrm{I}\!\mathrm{R}^n\), and \(\mu \in \mathrm{I}\!\mathrm{R}\)). The set of symmetric positive semidefinite matrices is a convex subsets of \(\mathrm{I}\!\mathrm{R}^{n\times n}\) the set of square matrices of dimension n. Another interesting example of convex set is the norm cone

$$\begin{aligned} C :=\left\{ \left[ \begin{array}{c} x\\ t \end{array}\right] \in \mathrm{I}\!\mathrm{R}^{n+1}: \left\| x \right\| _2 \le t\right\} \subseteq \mathrm{I}\!\mathrm{R}^{n+1}. \end{aligned}$$

Definition 4

([46, Definition 4.1.1]) Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{+\infty \}\), f is a convex function if \(\mathop {\mathrm {dom}}f\) is a convex set and

$$\begin{aligned} f \Bigl [(1 -\lambda )x +\lambda y \Bigr ] \le (1-\lambda )f(x) + \lambda f(y) \ \ \forall x, y \in \mathop {\mathrm {dom}}f , \ \lambda \in [0,1] \end{aligned}$$
(1)

The following propositions provide necessary and sufficient conditions for convexity for differentiable and twice differentiable functions.

Proposition 1

([8, 3.1.3], [46, Theorem 6.1.2]) Let \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{+\infty \}\) be a differentiable function. The function f is convex if and only if \(\mathop {\mathrm {dom}}f\) is convex and

$$\begin{aligned} f(y) \ge f(x) + \nabla f(x)^{T} (y-x) \quad \forall x, y \in \mathop {\mathrm {dom}}f. \end{aligned}$$
(2)

Proposition 2

([8, 3.1.4], [46, Theorem 6.3.1]) Let \(f: \mathrm{I}\!\mathrm{R}^{n} \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{+\infty \}\), be twice differentiable. The function f is convex if and only if \(\mathop {\mathrm {dom}}f\) is convex and \(\nabla ^{2} f(x)\) is positive semidefinite \(\forall x \in \mathop {\mathrm {dom}}f\).

Finally, let introduce two additional weaker classes of functions whose importance will be clear when necessary optimality conditions for unconstrained optimization problems will be introduced.

Definition 5

([46, Definition 9.3.1]) Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{+\infty \}\) be a differentiable function. The function f is pseudo–convex if

$$\begin{aligned} x, y \in \mathop {\mathrm {dom}}f, \quad {\nabla f(x)}^T{(y-x)} \ge 0 \Rightarrow f(y)\ge f(x) \end{aligned}$$

A convex function is also pseudo–convex.

Definition 6

([8, 3.4.1], [46, Definition 9.1.1]) Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\!\cup \!\{+\infty \}\). The function f is quasi–convex if

$$\begin{aligned} x, y \in \mathop {\mathrm {dom}}f, \quad f \bigl ((1 -\lambda )x +\lambda y \bigr ) \le \max \bigl \{f(x), f(y) \bigr \} \mathop { \text{ for } } \lambda \in [0,1] \end{aligned}$$

Clearly, a convex function is also quasi–convex. Moreover, a differentiable pseudo-convex function is also quasi–convex. The opposite is not true.

For simplicity, in the sequel, we will only consider functions whose domain is \(\mathrm{I}\!\mathrm{R}^n\).

3 Unconstrained Optimization

In this section we will present optimality conditions and algorithms for the unconstrained optimization problems

$$\begin{aligned} \min _{\begin{array}{c} x \in \mathcal {F} \end{array}} f(x) \end{aligned}$$
(3)

where \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) is the objective function and the feasible set \(\mathcal {F} \subseteq \mathrm{I}\!\mathrm{R}^n \) is an open subset of \(\mathrm{I}\!\mathrm{R}^n\) (note that \(\mathcal {F}\) may coincide with \(\mathrm{I}\!\mathrm{R}^n\)).

First, let introduce the following definitions.

Definition 7

([6, Definition 3.4.1]) Let \(\bar{x} \in \mathcal {F}\). The point \(\bar{x}\) is a global minimum of Problem (3) if

$$\begin{aligned} \ f(\bar{x}) \le f(x),\ \forall x \in \mathcal {F}. \end{aligned}$$

The point \(\bar{x}\) is a strict global minimum of Problem (3) if

$$\begin{aligned} f(\bar{x}) < f(x),\ \forall x \in \mathcal {F}, \ x\ne \bar{x}. \end{aligned}$$

Definition 8

([6, Definition 3.4.1]) Let \(\bar{x} \in \mathcal {F}\). The point \(\bar{x}\) is a local minimum of Problem (3) if \(\exists \; \gamma > 0\) such that

$$\begin{aligned} f(\bar{x}) \le f(x) \quad \forall x \in \mathcal {F} \mathop { \text{ and } } \left\| x -\bar{x} \right\| \le \gamma . \end{aligned}$$

The point \(\bar{x}\) is a strict local minimum if \(\exists \ \gamma > 0\) such that

$$\begin{aligned} f(\bar{x})< f(x) \quad \forall \ x \in \mathcal {F}:\; 0 < \left\| x -\bar{x} \right\| \le \gamma . \end{aligned}$$

The point \(\bar{x}\) is a strong or isolated local minimum  if \(\exists \ \gamma > 0\) such that \(\bar{x}\) is the only local minimum in \(\{ x \in \mathrm{I}\!\mathrm{R}^n: \left\| x -\bar{x} \right\| \le \gamma \} \cap \mathcal {F} \).

If \(\bar{x}\) is a strong local minimum, then it is also a strict local minimum. On the other hand, a strict local minimum may not be necessarily a strong local minimum. For example, the function [7, p. 21]

$$ f(x) = \left\{ \begin{array}{ll} x^2 \left( \sqrt{2} - \sin \left( \frac{4}{3} \pi - \sqrt{3} \ln (x^2) \right) \right) &{} \mathop { \text{ if } } x \ne 0 \\ 0 &{} \mathop {\text{ otherwise }} \end{array}\right. $$

has in \(x = 0\) a strict local minimum that is not a strong (isolated) local minimum. In the remaining of the section, without loss of generality, we assume that \(\mathcal {F} = \mathrm{I}\!\mathrm{R}^n \)

3.1 Necessary and Sufficient Optimality Conditions

Consider the nonlinear optimization problem (3) where \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) is a continuously differentiable function. The directional derivatives [3, Chap. 8] of f at \(\bar{x}\) along the direction d is given by

$$\displaystyle \lim _{t \rightarrow 0^+} \frac{ f(\bar{x}+td) -f(\bar{x})}{t} = {\nabla f(\bar{x})}^T{d}. $$

A direction \(d\in \mathrm{I}\!\mathrm{R}^n\) is a descent direction for f(x) at \(\bar{x}\) if

$$\begin{aligned} \exists \ \delta >0 : f(\bar{x} + td)< f(\bar{x}) \ \ \ \forall \ 0<t \le \delta \end{aligned}$$
(4)

If f is continuously differentiable, then d is a descent direction at \(\bar{x}\) if and only if \({\nabla f(\bar{x})}^T{d} < 0\).

A point \(x^*\) is a local minimum if at \(x^*\) there are no descent directions. The following theorems provide first and second order necessary optimality conditions for a point \(x^*\).

Theorem 1

Let \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow ~\mathrm{I}\!\mathrm{R}\) be a continuously differentiable function and let \(x^*\in \mathrm{I}\!\mathrm{R}^n\). If \(x^*\) is a local minimum then

$$\begin{aligned} \nabla f(x^*) =0. \end{aligned}$$
(5)

The above result can be easy proved by contradiction; in fact, if \(\nabla f(x^*) \ne 0\), then \(d = -\nabla f(x^*)\) is a descent direction. This observation is the basis of the most used technique for function minimization: the gradient method.

Theorem 2

([35, Proposition 2.5]) Let \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be twice continuously differentiable and let \(x^*\in \mathrm{I}\!\mathrm{R}^n\). If \(x^*\) is a local minimum then

$$\begin{aligned} \nabla f(x^*) =0 \end{aligned}$$
(6)
$$\begin{aligned} \nabla ^2 f(x^*) \mathop { \text{ is } \text{ positive } \text{ semidefinite. }} \end{aligned}$$
(7)

The above conditions cannot exactly be reversed to obtain sufficiency optimality conditions. In fact, positive definiteness of the Hessian matrix is required to obtain a strict local minimum.

Theorem 3

([35, Proposition 2.6]) Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be twice continuously differentiable and let \(x^*\in \mathrm{I}\!\mathrm{R}^n\). Suppose that:

$$\begin{aligned} \nabla f(x^*) =0 \end{aligned}$$
(8)
$$\begin{aligned} \nabla ^2 f(x^*) \mathop { \text{ is } \text{ positive } \text{ definite. }} \end{aligned}$$
(9)

Then \(x^*\) is a strict local minimum.

Conditions for global optima are much more complex to obtain. However, if \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) is a differentiable pseudo–convex function, then a local minimum is also a global minimum [46, Theorem 9.3.7].

3.2 Algoritms for Unconstrained Optimization

A very general minimization algorithm is presented in Algorithm 1 where

$$\begin{aligned} \Omega :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n : \nabla f(x) = 0\right\} \end{aligned}$$
(10)

is the set of stationary points. The algorithm, starting from the initial point \(x^0\), constructs either a finite sequence terminating in a stationary point or an infinite sequence which, under adequate conditions, converges to a stationary point, or, has at least an accumulation point that is also a stationary point.

figure a

Given the current point \(x^k\) and a search direction \(d^k\), the new point \(x^{k+1}\) is obtained by moving along \(d^k\) with a stepsize \(\alpha _k\).

The following general theorem establishes conditions on the direction \(d^k\) and the stepsize \(\alpha _k\) ensuring that, in the case an infinite sequence is obtained by Algorithm 1, the sequence has at least one accumulation point and each accumulation point is a stationary point. Before stating the theorem, it is necessary to introduce the concept of forcing function.

Definition 9

([50, Definition 14.2.1]) A function \(\sigma : \mathrm{I}\!\mathrm{R}_+ \rightarrow \mathrm{I}\!\mathrm{R}_+\) is a forcing function if

$$ \left. \begin{array}{c} \forall \mathop { \text{ sequence } } \{t_k\} \subseteq \mathrm{I}\!\mathrm{R}_+ \\ \\ \displaystyle \lim _k \sigma (t_k) = 0 \end{array}\right\} \implies \lim _k t_k = 0. $$

Note that any non–decreasing function \(\sigma : \mathrm{I}\!\mathrm{R}_+ \rightarrow \mathrm{I}\!\mathrm{R}_+\) such that \(\sigma (0) = 0\) and \(\sigma (t) > 0\) when \(t >0\) is a forcing function. The functions

$$\begin{aligned} \sigma (t) = t,\quad \quad \sigma (t) = ct^q, \quad with \quad c>0, \quad q>0 \end{aligned}$$

are examples of forcing functions.

Theorem 4

Let \(\{x^k\}\) be obtained by Algorithm 1 and assume that

  1. (i)

    the level set

    $$ \mathcal {L}_0 :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: f(x) \le f(x^0) \right\} $$

    is compact,

  2. (ii)

    \(d^k \ne 0\) when \(\nabla f(x^{k}) \ne 0\),

  3. (iii)

    \(f(x^{k+1}) \le f(x^{k})\),

  4. (iv)

    if \(\nabla f(x^{k}) \ne 0\) for all k then

    $$\begin{aligned} \lim _k \frac{{\nabla f(x^{k})}^T{d^k}}{\left\| d^k \right\| } = 0, \end{aligned}$$
  5. (v)

    when \(d^k \ne 0\)

    $$\begin{aligned} \frac{|{\nabla f(x^{k})}^T{d^k}|}{\left\| d^k \right\| } \ge \sigma \left( \left\| \nabla f(x^{k}) \right\| \right) \end{aligned}$$

    for some forcing function \(\sigma \).

Then, either the algorithm terminates after a finite number of iterations in a stationary point or an infinite sequence \(\{x^k\}\) is generated that satisfies the following properties:

  1. (a)

    the sequence \(\{x^k\}\) remains in \(\mathcal {L}_0\) and has at least one accumulation point;

  2. (b)

    each accumulation point of \(\{x^k\}\) belongs to \(\mathcal {L}_0\);

  3. (c)

    the sequence of real numbers \(\left\{ f(x^{k}) \right\} \) converges;

  4. (d)

    \(\displaystyle \lim _k \left\| \nabla f(x^{k}) \right\| = 0\);

  5. (e)

    each accumulation point \(x^*\) of the sequence \(\{x^k\}\) is a stationary point.

With reference to Theorem 4,

  • Conditions (iii) and (v) can be guaranteed by choosing an opportune descent direction \(d^k\) and a line–search along this direction. In particular, using the Euclidean norm and choosing \(\sigma (t) = ct\) (with \(c >0 \)), Condition (v) can be written as

    $$\begin{aligned} {\nabla f(x^{k})}^T{d^k} \le -c \left\| d^k \right\| _2 \left\| \nabla f(x^{k}) \right\| _2, \end{aligned}$$

    that is, when \(x^k\) is not a stationary point, a direction \(d^k\) must be chosen such that

    $$\begin{aligned} \frac{{\nabla f(x^{k})}^T{d^k}}{\left\| d^k \right\| _2 \left\| \nabla f(x^{k}) \right\| }_2 \le -c. \end{aligned}$$

    Geometrically, the above condition requires that the cosine of the angle between the direction \(d^k\) and the direction \(-\nabla f(x^k)\) (the “antigradient”) must be greater than a constant independent of k. This implies that \(d^k\) and the gradient direction cannot be orthogonal as k goes to infinity. If \(\sigma (t) = c t^q\) (with \(c >0\)), Condition (v) becomes, instead,

    $$\begin{aligned} \frac{{\nabla f(x^{k})}^T{d^k}}{\left\| d^k \right\| } \le -c \left\| \nabla f(x^{k}) \right\| ^q. \end{aligned}$$
  • Condition (iv) can be guaranteed using specific safeguard rules on the line–search along the direction \(d^k\).

From the above considerations, it is clear the importance of inexact line–search procedures for determining the stepsize \(\alpha _k\).

A simple, but also very effective, line–search procedure is the Armijo Backtracking method [4, 7, p. 29]. Given \(x^k\) and a descent search direction \(d^k\), the algorithm starts with a fixed large value of \(\alpha \) and decreases it (that is, the new values of \(\alpha \) is obtained by multiplying the current value of \(\alpha \) by a constant \(\delta < 1\)) until the stopping criterion

$$\begin{aligned} f(x^k + \alpha d^k) \le f(x^k) + \gamma \alpha {\nabla f(x^k)}^T{d^k} \end{aligned}$$
(11)

is satisfied, where \(\gamma \in (0, \frac{1}{2})\). In other words, the Armijo procedures chooses as \(\alpha _k\) the maximum value of \(\alpha \) in the set

$$\begin{aligned} S = \left\{ \alpha : \alpha = \delta ^l \alpha _{\text {init}}, \; l = 0, 1, \ldots \right\} \end{aligned}$$

for which Condition (11) holds.

The following proposition demonstrates that, if the search direction \(d^k\) is opportunely chosen, and the stepsize is obtained according to the Armijo rule, then conditions (iii) and (iv) in Theorem 4 are automatically satisfied.

Proposition 3

Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be continuously differentiable. Suppose that:

  1. (i)

    \(\mathcal {L}_0\) is a compact set;

  2. (ii)

    \({\nabla f\left( x^{k}\right) }^T{d^k} < 0. \; \forall k\);

  3. (iii)

    there exists a forcing function \(\sigma : \mathrm{I}\!\mathrm{R}_+ \rightarrow \mathrm{I}\!\mathrm{R}_+\) such that

    $$\begin{aligned} \left\| d^k \right\| \ge \sigma \left( \frac{|{\nabla f\left( x^{k}\right) }^T{d^k}|}{\left\| d^k \right\| }\right) . \end{aligned}$$

Then, if \(\alpha _k\) is chosen according to the Armijo procedure, and \(x^{k+1} = x^k + \alpha _k d^k\)

  1. (a)

    \(f\left( x^{k+1}\right) < f\left( x^{k}\right) \);

  2. (b)

    \(\displaystyle \lim _k \frac{|{\nabla f\left( x^{k}\right) }^T{d^k}|}{\left\| d^k \right\| } = 0. \)

A different line–search stopping criterion was proposed by Goldstein [31] considering two lines

$$ f(x^{k})+ \gamma _{1} \alpha \nabla f(x^{k}) ^T d^{k} = 0,$$
$$f(x^{k})+ \gamma _{2} \alpha \nabla f(x^{k}) ^T d^{k} = 0 $$

where \(0< \gamma _{1}< \gamma _{2}< \frac{1}{2}\). The chosen stepsize \(\alpha _{k} > 0\) must satisfy, instead of (11), the following conditions:

$$\begin{aligned} f(x^{k}+ \alpha _{k}d^{k})\le f(x^{k})+ \gamma _{1} \alpha _{k} \nabla f(x_{k}) ^T d^{k}, \end{aligned}$$
(12a)
$$\begin{aligned} f(x^{k}+ \alpha _{k}d^{k})\ge f(x^{k})+ \gamma _{2} \alpha _{k} \nabla f(x_{k}) ^T d^{k}. \end{aligned}$$
(12b)

From a geometrical point of view, this corresponds to choose a value for \(\alpha _{k}\) for which \(f(x^k + \alpha _k d^k)\) is between the two straight lines with slopes \(\gamma _{1} \nabla f(x^{k}) ^T d^{k}\) and \(\gamma _{2}\nabla f(x_{k}) ^T d^{k}\), and passing through the point \((0,f(x_{k}))\).

Moreover, let \(\gamma _{1}\) and \(\gamma _{2}\) such that \(0< \gamma _{1}< \gamma _{2} < 1\). The step length \(\alpha _{k}\) satisfy the Wolfe conditions [62] if the following conditions are satisfied:

$$\begin{aligned} f(x^{k} + \alpha _{k} d^{k} ) \le f(x^{k}) +\gamma _{1}\alpha _{k} { \nabla f(x^{k})}^T{d^{k}} , \end{aligned}$$
(13a)
$$\begin{aligned} {\nabla f(x^{k} + \alpha _{k} d^{k} )}^T{d^k} \ge \gamma _{2}{\nabla f(x^{k})}^T{d^{k}}. \end{aligned}$$
(13b)

It is possible to show that, when \({d^k}\) is an opportune descent direction, then using both the Goldstein and the Wolfe line–search methods, similar results to those in Proposition 3 can be obtained.

Various other line–search procedures have been proposed in literature, including line–search procedures that do not use information on the gradient of the function [15, 44] and/or non–monotone line–search procedures [33, 34, 63] for which the monotonicity of the objective function for the generated sequence is not required.

3.3 The Gradient Method

In the gradient method (also known as the steepest descent method) the search direction is given by

$$\begin{aligned} d^k = -\nabla f(x^k). \end{aligned}$$
(14)

For this choice of the search direction, both Condition (iii) and Condition (v) in Theorem 4 are trivially satisfied with \(\sigma (t) = t\) and, therefore, the method is globally convergent when one of line–search procedures outlined in the previous subsection is applied. However, the gradient method, even when applied to the minimization of a strictly convex quadratic function

$$\begin{aligned} \min _x \frac{1}{2}{x}^T{Mx} + {q}^T{x} \end{aligned}$$

(\(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) positive definite, and \(q \in \mathrm{I}\!\mathrm{R}^n\)) and exact line–search is utilized, exhibits linear rate of convergence. The theorem below shows that the gradient method has at least linear rate of convergence.

Theorem 5

([45, Sect. 7.6, p. 218]) Let \(M\in \mathrm{I}\!\mathrm{R}^{n \times n}\) be a symmetric positive definite matrix and let \(0 < \lambda _1 \le \lambda _2 \le \ldots \le \lambda _n\) be the eigenvalues of M, let \(q \in \mathrm{I}\!\mathrm{R}^n\) and let \(f(x) :=\frac{1}{2}{x}^T{Mx} + {q}^T{x}\). Let \(\{x^k\}\) be obtained by the gradient method with exact line–search:

$$\begin{aligned} x^{k+1} = x^k - \frac{{\left( M x^k + q\right) }^T{\left( M x^k + q\right) }}{{\left( M x^k + q\right) }^T{M\left( M x^k + q\right) }} \left( M x^k + q \right) . \end{aligned}$$

ThenFootnote 1

$$\begin{aligned} \left\| x^{k+1} - x^* \right\| _M \le \frac{\lambda _n-\lambda _1}{\lambda _n+\lambda _1}\left\| x^{k} - x^* \right\| _M \end{aligned}$$
(15a)

and

$$\begin{aligned} \left\| x^{k+1} - x^* \right\| _2 \le \left( \frac{\lambda _n}{\lambda _1}\right) ^{\frac{1}{2}} \frac{\lambda _n-\lambda _1}{\lambda _n+\lambda _1}\left\| x^{k} - x^* \right\| _2 \end{aligned}$$
(15b)

where \(x^*= -M^{-1}q\) is the unique minimum point.

Moreover, it is not difficult to construct an examples (see [8, Example 9.3.2, p. 469]) of quadratic problems where the inequalities (15) above are satisfied as equality, demonstrating that, in fact, the gradient method has linear rate of convergence.

3.4 The Newton’s Method

As stated before the gradient method exhibits a slow rate of convergence. Here we consider a method that, instead, has quadratic rate of convergence at the expenses of a higher cost per iteration, requiring, in addition to calculate the gradient at each point, also the calculation of the Hessian function. Moreover, only local convergence can be established, that is convergence is guaranteed only if the initial point \(x^0\) is chosen sufficiently close to the stationary point \(x^*\).

Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be twice continuously differentiable. From Taylor’s series expansion we have that

$$\begin{aligned} f(x + s) = f(x) + {\nabla f(x)}^T{s} + \frac{1}{2}{s}^T{\nabla ^2 f(x)s} + \beta (x,s) \end{aligned}$$

where

$$\begin{aligned} \lim _{s\rightarrow 0}\frac{\beta (x,s)}{\left\| s \right\| ^2} = 0. \end{aligned}$$

In Newton’s method, at each iteration a quadratic approximation of the function f around the current point \(x^k\) is constructed:

$$\begin{aligned} q_k(s) :=f(x^k) + {\nabla f(x^k)}^T{s} + \frac{1}{2}{s}^T{\nabla ^2 f(x^k)s}, \end{aligned}$$

and the new point \(x^{k+1}\) is obtained as \(x^{k+1} = x^k + s^k\) where

$$\begin{aligned} s^k \in \mathop {\text{ argmin }}_s q_k(s). \end{aligned}$$

The following theorem provides conditions for local convergence of the Newton’s method for minimization problems.

Theorem 6

([25, Theorem 3.1.1] [35, Proposizione 7.2]) Let \(f: \mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be a twice continuously differentiable function, and assume that

  1. (a)

    there exists \(x^*\in \mathrm{I}\!\mathrm{R}^n\): \(\nabla f(x^*) = 0\),

  2. (b)

    \(\nabla ^2 f(x^*)\) is non singular,

  3. (c)

    \(\nabla ^2 f(x)\) is a Lipschitz–continuous function, i.e.,

    $$\begin{aligned} \exists \; L >0: \forall x, y \in \mathrm{I}\!\mathrm{R}^n \quad \quad \left\| \nabla ^2 f(x)-\nabla ^2 f(y) \right\| \le L \left\| x-y \right\| . \end{aligned}$$

Then, there exists \(B(x^*, \rho ) :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: \left\| x-x^* \right\| \le \rho \right\} \) with \(\rho >0\) such that, if \(x^0 \in B(x^*, \rho )\),

  1. (i)

    the Newton’s iterate is well defined,

  2. (ii)

    the sequence \(\left\{ x^k \right\} \) remains in \(B(x^*, \rho )\),

  3. (iii)

    the sequence \(\left\{ x^k \right\} \) converges to \(x^*\) with at least a quadratic rate of convergence

    $$\begin{aligned} \left\| x^{k+1} - x^* \right\| \le \alpha \left\| x^k-x^* \right\| ^2, \mathop { \text{ for } \text{ some } } \alpha > 0. \end{aligned}$$

3.5 The Conjugate Gradient Method

The slow convergence of the gradient method and the heavy requirements of the fast converging Newton’s method where, at each iteration, it is necessary to calculate the Hessian matrix, set the stage for new classes of methods that only require to compute the gradient of the function to minimize, and, at the same time, exhibit q–superlinear rate of convergence [50]. In this section the important class of conjugate gradient methods is introduced, while the next section is devoted to Quasi-Newton’s methods.

Definition 10

Let \(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) be a symmetric positive define matrix. The n nonzero vectors \(d^0, d^1, \ldots , d^{n-1}\) in \(\mathrm{I}\!\mathrm{R}^n\) are conjugate directions with respect to the matrix M if

$$\begin{aligned} {d^i}^T{M d^j} = 0 \quad \quad \forall \; i, j = 0, 1, \ldots , n-1, i \ne j. \end{aligned}$$
(16)

Consider the quadratic function

$$\begin{aligned} f(x) = \frac{1}{2}{x}^T{Mx} + {q}^T{x} \end{aligned}$$
(17)

where M is a symmetric positive matrix, \(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) and \(q \in \mathrm{I}\!\mathrm{R}^n\); the following algorithm utilizes n conjugate directions to determine the unique point minimizing f(x).

figure b

One important result for this algorithm is that for quadratic strictly convex problems, when conjugate directions are utilized as search directions, then convergence is achieved in a finite number of iterations.

Theorem 7

Consider the quadratic function \(f(x) = \frac{1}{2}{x}^T{Mx} + {q}^T{x} \) where M is a symmetric positive matrix, \(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) and \(q \in \mathrm{I}\!\mathrm{R}^n\). Starting from any \(x^0 \in \mathrm{I}\!\mathrm{R}^n\) apply Algorithm 2. Then \(x^{k+1}\) is the minimizer of f(x) over the affine set

$$\begin{aligned} S_k :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: x = x^0 + \sum _{j=0}^k \lambda _j d^j, \; \lambda _j \in \mathrm{I}\!\mathrm{R}\right\} \end{aligned}$$

for \(k = 0, 1, \ldots , n-1\). Moreover, \(x^n = -M^{-1}q\) is the unique minimizers of f(x) over \(\mathrm{I}\!\mathrm{R}^n\).

The above Algorithm 2 allows different choices for the set of conjugate directions. A interesting choice is to link these conjugate directions to the gradient of the function f(x). Algorithm 3 explores this possibility.

figure c

Theorem 8

([25, Theorem 4.1.1]) Consider the quadratic function \(f(x) = \frac{1}{2}{x}^T{Mx} + {q}^T{x} \) where \(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) is a symmetric positive matrix, and \(q \in \mathrm{I}\!\mathrm{R}^n\). The Conjugate Gradient Algorithm 3 terminates after \(m < n\) iterations and for all \(i = 0, \ldots , m\)

$$\begin{aligned} {d^i}^T{M d^j}= 0, \quad \quad j = 0, \ldots , i-1, \end{aligned}$$
(18a)
$$\begin{aligned} {\nabla f(x^i)}^T{\nabla f(x^j)}= 0, \quad \quad j = 0, \ldots , i-1, \end{aligned}$$
(18b)
$$\begin{aligned} {\nabla f(x^i)}^T{d^i} = -{\nabla f(x^i)}^T{\nabla f(x^i)}. \end{aligned}$$
(18c)

The theorem above shows that the directions generated by Algorithm 3 are conjugate directions with respect to the matrix M, and, therefore, finite termination of algorithm follows from Theorem 7.

It is important to note that

$$\begin{aligned} \alpha _k = -\frac{{\nabla f(x^k)}^T{d^k_1}}{{d^k}^T{Md^k}} = \frac{{\nabla f(x^k)}^T{\nabla f(x^{k})}}{{d^k}^T{Md^k}} > 0. \end{aligned}$$
(19)

Moreover, from \(x^k = x^{k-1} + \alpha _{k-1} d^{k-1}\) it follows that

$$\begin{aligned} \alpha _{k-1} M d^{k-1} = Mx^k - Mx^{k-1} = \nabla f(x^k) - \nabla f(x^{k-1}) \end{aligned}$$

and hence

$$ {\nabla f(x^k)}^T{Md^{k-1}} = \frac{1}{\alpha _{k-1}}{\nabla f(x^k)}^T{\left( \nabla f(x^k) - \nabla f(x^{k-1}\right) } = \frac{1}{\alpha _{k-1}}{\nabla f(x^k)}^T{\nabla f(x^k)}.$$

Now, from (19), \(\alpha _{k-1} {d^{k-1}}^T{Md^{k-1}} = {\nabla f(x^{k-1})}^T{\nabla f(x^{k-1})}\), and hence

$$\begin{aligned} \begin{array}{lll} \beta _{k-1} &{} = &{} \displaystyle \frac{{\nabla f(x^k)}^T{Md^{k-1}}}{{d^{k-1}}^T{Md^{k-1}}} = \frac{1}{\alpha _{k-1}}\frac{{\nabla f(x^k)}^T{\nabla f(x^k)}}{{d^{k-1}}^T{Md^{k-1}}}\\ \\ &{}= &{} \displaystyle \frac{{\nabla f(x^k)}^T{\nabla f(x^k)}}{{\nabla f(x^{k-1})}^T{\nabla f(x^{k-1})}} =: \beta ^\text {FR}_{k-1} \end{array} \end{aligned}$$
(20)

This formula was first proposed by Fletcher and Reeves [27]. Alternative choices, all equivalent in the case of strictly convex quadratic problems, for \(\beta _{k-1}\) are

$$\begin{aligned} \beta ^\text {PRP}_{k-1} = \frac{{\nabla f(x^{k})}^T{\Bigl [\nabla f(x^{k})- \nabla f(x^{k-1})\Bigr ]}}{{\nabla f(x^{k-1})}^T{\nabla f(x^{k-1})}} . \end{aligned}$$
(21a)
$$\begin{aligned} \beta ^\text {HS}_{k-1} =- \frac{{\nabla f(x^{k})}^T{\Bigl [\nabla f(x^{k})- \nabla f(x^{k-1})\Bigr ]}}{{d^{k-1}}^T{\Bigl [\nabla f(x^{k})- \nabla f(x^{k-1})\Bigr ]}} \end{aligned}$$
(21b)

proposed, respectively, by Polak and Ribiére [52] and Polyak [53] and Hestenes and Stiefel [39].

For general non–quadratic functions, there is no guarantee that the conjugate gradient method will terminate in a finite number of steps. Moreover, it is extremely difficult to exactly solve the one–dimensional subproblem and inexact line–search methods such as Armijo or Goldstein or Wolfe methods must be utilized. Moreover, each n iterations or when a non–descent direction is generated, a reset step should be performed using as search direction the negative gradient direction. Computational results, however, show that the use of a restarting procedure is not convenient and is better to opportunely modify the formula for \(\beta _{k-1}\) and to choose specific line-search procedure to globalize the overall scheme.

The first global convergence result of the Fletcher-Reeves method with inexact line search was given by Al-Baali [2]. Under strong Wolfe conditions, he demonstrated that the method generates sufficient descent directions and, therefore, global convergence can be established.

For the Polak-Ribiére-Polyak method the convergence for general nonlinear function is uncertain. While global convergence can be proved in the case of strongly convex functions, there are examples of not strongly convex functions, for which the method may not converge, even with an exact line search. Instead, convergence can be proved when

$$ \beta ^\text {PRP+}_{k-1} = \min \{\beta ^\text {PRP}_{k-1},0\} $$

is utilized [29]. Additional information on conjugate gradient method for general nonlinear functions can be found in [35, 38, 54].

3.6 Quasi-Newton’s Methods

As already noted earlier, the Newton’s method is locally convergent at quadratic rate and requires, at each iteration, to compute the Hessian function. On the other hand, the gradient method for which global convergence can be established, only requires to compute the gradient of the function; however, its convergence can be quite slow (at most linear, in some cases).

The aim of Quasi-Newton’s methods is to define procedures that

  • will not require to compute the Hessian function,

  • exhibits q–superlinear rate of convergence.

Two classes of Quasi–Newton methods have been studied in literature:

  • direct methods for which

    $$ \left\{ \begin{array}{l} x^{k+1} = x^k - \left[ B^k \right] ^{-1} \nabla f(x^k),\\ \\ B^{k+1} = B^k + \Delta B^k, \qquad\qquad\quad B^{k+1} \approx \nabla ^2 f(x^{k+1}); \end{array}\right. $$
  • inverse methods where

    $$ \left\{ \begin{array}{l} x^{k+1} = x^k - H^k \nabla f(x^k),\\ \\ H^{k+1} = H^k + \Delta H^k, \qquad\qquad\quad H^{k+1} \approx \left[ \nabla ^2 f(x^{k+1})\right] ^{-1}. \end{array}\right. $$

In order to derive updating formulas for \(B^k\) and \(H^k\), consider again the quadratic function

$$\begin{aligned} f(x) = \frac{1}{2}{x}^T{Mx} + {q}^T{x} \end{aligned}$$

where \(M \in \mathrm{I}\!\mathrm{R}^{n \times n}\) is a symmetric positive matrix, and \(q \in \mathrm{I}\!\mathrm{R}^n\). Let now \(x, y \in \mathrm{I}\!\mathrm{R}^n\), then

$$\begin{aligned} \nabla f(y) = \nabla f(x) + M (y-x) \end{aligned}$$

or equivalently

$$\begin{aligned} M^{-1} \left( \nabla f(y) - \nabla f(x) \right) = y - x. \end{aligned}$$

Therefore, since \(B^{k+1}\) (resp. \(H^{k+1}\)) must be an approximation of M (resp. \(M^{-1}\)), it is reasonable to require that

$$\begin{aligned} \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} \left( x^{k+1}-x^k\right) \end{aligned}$$
(22a)

or

$$\begin{aligned} H^{k+1}\left( \nabla f(x^{k+1}) - \nabla f(x^k)\right) =x^{k+1}-x^k. \end{aligned}$$
(22b)

After defining \(\gamma ^k :=\nabla f(x^{k+1}) - \nabla f(x^k)\) and \(\delta ^k :=x^{k+1}-x^k\) the above conditions become

$$\begin{aligned} \gamma ^k = B^{k+1} \delta ^k \end{aligned}$$
(23a)

and

$$\begin{aligned} H^{k+1} \gamma ^k = \delta ^k. \end{aligned}$$
(23b)

These conditions are known as “secant conditions”. Furthermore, it is reasonable to require that \(B^{k+1}\) (resp. \(H^{k+1}\)) be symmetric and as close as possible to \(B^{k}\) (resp. \(H^{k}\)) imposing that \(B^{k+1}\) (resp. \(H^{k+1}\))) differs from \(B^{k}\) (resp. \(H^{k}\) by a matrix of rank 1 or 2 and a minimality condition in some norm, for example in Frobenius normFootnote 2[18]. The generic direct Quasi–Newton method is reported in Algorithm 4.

figure d

A similar algorithm can be easily constructed for a Quasi–Newton generic inverse method.

In the sequel, unless strictly necessary we will drop the superscripts k and \(k+1\); the current matrix will be indicated simply by B (resp. H) while the updated matrix will be indicated by \(\bar{B} \) (resp. \(\bar{H} \)). A similar rule will be used for \(\gamma ^k\) and \(\delta ^k\).

The simplest updating direct formula for B is a rank-1 updating formula

$$\begin{aligned} \bar{B} = B + \rho u v^T \end{aligned}$$
(24)

where \(\rho \in \mathrm{I}\!\mathrm{R}\) and \(u, v \in \mathrm{I}\!\mathrm{R}^n\), \(u, v \ne 0\).

For the symmetric case (i.e., \(u=v\)), we have

$$\begin{aligned} \bar{B} = B + \rho u u^T \end{aligned}$$
(25)

and, imposing the secant condition \(\bar{B} \delta = \gamma \) either \(\bar{B} = B\) (which happens if already \(B \delta = \gamma \)) or, under the hypothesis that \({\left( \gamma - B \delta \right) }^T{\delta } \ne 0\), the following formula, known as Symmetric Rank-1 updating formula (SR1), is obtained

$$\begin{aligned} \bar{B} = B + \frac{\left( \gamma - B \delta \right) \left( \gamma - B \delta \right) ^T}{{\left( \gamma - B \delta \right) }^T{\delta } }. \end{aligned}$$
(26)

This formula was first proposed by Davidon [14] and later rediscovered by Broyden [10].

Using \(H = B^{-1}\) and the Sherman–Morrison–Woodbury [32, 37] formula, it is possible to derive the inverse Symmetric Rank–1 updating scheme:

$$\begin{aligned} \bar{H} = H + \frac{{\left( \delta - H \gamma \right) }{\left( \delta - H \gamma \right) }^T}{{\left( \delta - H \gamma \right) }^T{\gamma }}. \end{aligned}$$
(27)

Note that in this case the secant condition completely determines the updating formula under the hypothesis that \({\left( \gamma - B \delta \right) }^T{\delta } \ne 0\;\) (resp. \({\left( H \gamma - \delta \right) }^T{\gamma } \ne 0\)). However, it must be noticed that there is no guarantee that the search direction that is obtained is a descent direction.

The symmetric rank–1 updating formula (27) has a very interesting behaviour when applied to a quadratic function.

Theorem 9

([17, Theorem 7.1]) Let \(A \in \mathrm{I}\!\mathrm{R}^{n\times n}\) be symmetric and nonsingular. Let \(\{s^0, s^1, \ldots , s^m\}\) be m vectors spanning \(\mathrm{I}\!\mathrm{R}^n\) and let \(y^k = A s^k, k = 0, \ldots , m\). Again, let \(H^0 \) be a symmetric \(n\times n\) real matrix and for \(k = 0, \ldots , {m}\) let

$$\begin{aligned} H^{k+1} = H^k + \frac{{\left( s^k - H^k y^k\right) }{\left( s^k - H^k y^k \right) }^T}{{\left( s^k - H^k y^k\right) }^T{y^k}} \end{aligned}$$

assuming that \({\left( s^k - H^k y^k\right) }^T{y^k} \ne 0\). Then

$$\begin{aligned} H^{m+1} = A^{-1}. \end{aligned}$$

The theorem above shows that, given a nonsingular matrix A, the inverse \(A^{-1}\) can be calculated using the SR1 formula which only involves matrix–vector multiplications

For general nonlinear functions, under specific assumptions, including that

$$\begin{aligned} \left| {\left( \gamma ^k - B^k \delta ^k\right) }^T{\delta ^k} \right| \ge c \left\| \gamma ^k - B^k \delta ^k \right\| _2 \left\| \delta ^k \right\| _2 \end{aligned}$$

and that the sequence \(\{x^k\}\) has at least one accumulation point \(x^*\), then [13, Theorem 2]

$$\begin{aligned} \lim _k \left\| B^k - \nabla ^2 f(x^*) \right\| = 0. \end{aligned}$$

In the non–symmetric case (mostly utilized for solving system of nonlinear equations) the updating formula require to choose both vectors u and v and the secant condition does not define uniquely both vectors. Therefore, the new matrix \(\bar{B} \) is required to be as close as possible to the current matrix B. The following lemma shows that a rank-1 updating formula is obtained when the closest matrix (in Frobenius norm) to the current matrix B is calculated among all matrices for which the secant condition is satisfied.

Theorem 10

([17, Theorem 4.1], [18, Lemma 8.1.1]) Let \(B \in \mathrm{I}\!\mathrm{R}^{n \times n}\) and let \(\gamma , \delta \in \mathrm{I}\!\mathrm{R}^n\). Then, the solution of the minimization problem

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathop {\text{ min }}_{A}} &{} {\left\| A - B \right\| _{\text {F}}} \\ * {\mathop {\text{ subject } \text{ to }}} &{} {\begin{array}{c} \gamma = A \delta \end{array}} \end{array} \qquad \qquad \end{aligned}$$
(28)

is given by

$$\begin{aligned} \bar{B} = B + \frac{\left( \gamma - B \delta \right) \delta ^T}{{\delta }^T{\delta }}.\end{aligned}$$
(29)

Using again the Sherman–Morrison–Woodbury formula, it possible (under the hypothesis that \({\gamma }^T{H\delta } \ne 0\)) to derive the inverse updating formula:

$$\begin{aligned} \bar{H} = H + \frac{\left( \delta - H \gamma \right) \delta ^T H}{{\delta }^T{H\gamma }} \end{aligned}$$
(30)

The updating formula (30) was first proposed by Broyden [9] in the context of solving systems of nonlinear equations. Locally q–superlinear convergence for this method was then proved by Broyden, Dennis and Morè in [12], see also [19, Theorem 3.3.3] and [19, Theorem 8.2.2].

The generic rank–2 updating formula is given by

$$\begin{aligned} \bar{B} = B + a u u^T + b v v^T \end{aligned}$$

where \(a, b \in \mathrm{I}\!\mathrm{R}\) and \(u,v \in \mathrm{I}\!\mathrm{R}^n\); similar rank–2 updating formulas can be defined for \(\bar{H} \).

The use of a rank–2 updating formula allows to imposes important conditions, in addition to the secant property, on the updated matrix such as symmetry and positive definiteness. Theorem 11 show that the closest matrix to B for which symmetry and secant condition are satisfied can be obtained via a rank–2 modification. Here a weighted Frobenius norm is utilized, where W is the weight matrix:

$$ \left\| A-B \right\| _{\text {W,F}} :=\left\| W(A-B)W \right\| _{\text {F}}. $$

Theorem 11

([17, Theorem 7.3]) Let \(B \in \mathrm{I}\!\mathrm{R}^{n \times n}\) be a symmetric matrix and let \(c, \delta , \gamma \in \mathrm{I}\!\mathrm{R}^n\) with \({c}^T{\delta } >0\). Let \(W\in R^{n\times n}\) be a symmetric nonsingular matrix such that \(Wc = W^{-1}\delta \). Then, the unique solution of

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathop {\text{ min }}_{A}} &{} {\left\| A-B \right\| _{\text {W,F}}} \\ * {\mathop {\text{ subject } \text{ to }}} &{} {\begin{array}{c} A \delta = \gamma \\ A \mathop {\text{ symmetric }} \end{array}} \end{array} \qquad \qquad \end{aligned}$$
(31)

is given by

$$\begin{aligned} \bar{B} = B + \frac{\left( \gamma - B \delta \right) c^T + c\left( \gamma - B \delta \right) ^T }{{c}^T{\delta }} - \frac{{\left( \gamma - B \delta \right) }^T{~}\delta }{\left( {c}^T{\delta }\right) ^2} cc^T. \end{aligned}$$
(32)

The above theorem leaves space for opportune choices of the matrix W and the vector c, which can be utilized to impose positive definiteness of the updating formula, i.e., conditions that ensure that the matrix \(\bar{B} \) be positive definite, provided that the same property holds for B. Positive definiteness of the matrix \(\bar{B} \) ensures that the search direction utilized in the Quasi–Newton method (see Algorithm 4) is a descent direction.

In view of the fact that \(\bar{B} \delta = \gamma \), if \(\bar{B} \) is positive definite then \(0 < {\delta }^T{\bar{B} \delta } = {\delta }^T{\gamma }\). It is not difficult to show that this condition is also necessary for the positive definiteness of \(\bar{B} \) [17, Theorem 7.5].

A simple, but extremely interesting, choice for the vector c is \(c = \gamma \). This specific choice leads to the updating formula

$$\begin{aligned} \bar{B} = B + \frac{\left( \gamma - B \delta \right) \gamma ^T + \gamma \left( \gamma - B \delta \right) ^T }{{\gamma }^T{\delta }} - \frac{{\left( \gamma - B \delta \right) }^T{~}\delta }{\left( {\gamma }^T{\delta }\right) ^2} \gamma \gamma ^T. \end{aligned}$$
(33)

Again, using the Sherman–Morrison–Woodbury formula, it is possible to compute the corresponding inverse updating formula (here \(H = B^{-1}\) and \(\bar{H} = \bar{B} ^{-1}\))

$$\begin{aligned} \bar{H} = H + \frac{{\delta }{\delta ^T}}{{\gamma }^T{\delta }} - \frac{H\gamma \gamma ^T H}{{\gamma }^T{H\gamma }}. \end{aligned}$$
(34)

These formulas, known as direct and inverse DFP formulas, were firstly proposed by Davidon [14] and later rediscovered by Fletcher and Powell that also clarified and improved them [26].

Moreover, it is also possible to construct updating formulas directly for H similarly requiring that \(\bar{H} \) be symmetric, that the secant condition be satisfied and \(\bar{H} \) be as close as possible to H in a specific norm as shown in the theorem below, in a similar way as done for the direct updating formula.

Theorem 12

Let \(H \in \mathrm{I}\!\mathrm{R}^{n \times n}\) be a symmetric matrix and let \(d, \delta , \gamma \in \mathrm{I}\!\mathrm{R}^n\) with \({d}^T{\gamma } >0\). Let \(W \in R^{n\times n}\) be a symmetric nonsingular matrix such that \(Wd = W^{-1}\gamma \). Then, the unique solution of

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathop {\text{ min }}_{A}} &{} {\left\| W(A-H)W \right\| _{\text {F}}} \\ * {\mathop {\text{ subject } \text{ to }}} &{} {\begin{array}{c} A \gamma = \delta \\ A \mathop {\text{ symmetric }} \end{array}} \end{array} \qquad \qquad \end{aligned}$$
(35)

is given by

$$\begin{aligned} \bar{H} = H + \frac{\left( \delta - H \gamma \right) d^T + d\left( \delta - H \gamma \right) ^T }{{d}^T{\gamma }} - \frac{{\left( \delta - H \gamma \right) }^T{~}\gamma }{\left( {d}^T{\gamma }\right) ^2} dd^T. \end{aligned}$$
(36)

Similarly to what done in the direct case, it is possible to impose that the new matrix \(\bar{H} \) be positive definite. Moreover, the choice \(d = \delta \) brings to the following updating formula

$$\begin{aligned} \bar{H} = H + \frac{\left( \delta - H \gamma \right) \delta ^T + \delta \left( \delta - H \gamma \right) ^T }{{\delta }^T{\gamma }} - \frac{{\left( \delta - H \gamma \right) }^T{~}\gamma }{\left( {\delta }^T{\gamma }\right) ^2} \delta \delta ^T. \end{aligned}$$
(37)

Using the Sherman–Morrison–Woodbury formula the corresponding direct updating formula can be obtained

$$\begin{aligned} \bar{B} = B + \frac{{\gamma }{\gamma ^T}}{{\delta }^T{\gamma }} - \frac{B\delta \delta ^T B}{{\delta }^T{B\delta }}. \end{aligned}$$
(38)

The above formulas are known as (inverse and direct) BFGS updating formulas from Broyden [11], Fletcher [24], Goldfarb [30] and Shanno [57] that discovered them.

Starting from the DFP and BFGS updating formulas a whole class (known as Broyden class) of updating methods can be constructed:

$$\begin{aligned} \bar{H} ^{\phi } = (1-\phi ) \bar{H} ^{\text {DFP}} + \phi \bar{H} ^{\text {BFGS}} \end{aligned}$$

where \(\bar{H} ^{\text {DFP}}\) and \(\bar{H} ^{\text {BFGS}}\) are given by (34) and (37) respectively, and \(\phi \in \mathrm{I}\!\mathrm{R}\).

Note that \(\bar{H} ^{\phi }\) can be easily rewritten as

$$\begin{aligned} \bar{H} ^{\phi } = \bar{H} ^{\text {DFP}} + \phi v v^T \end{aligned}$$

where

$$ v = \Bigl ( {\gamma }^T{H \gamma }\Bigr )^{\frac{1}{2}} \Biggl ( \frac{\delta }{{\gamma }^T{\delta }} - \frac{H\gamma }{{\gamma }^T{H \gamma }} \Biggr ) $$

clearly showing the relationship between \(\bar{H} ^{\phi }\) and \(\bar{H} ^{\text {DFP}}\).

The generic Quasi–Newton algorithm using the Broyden updating scheme is reported in Algorithm 5 (here we return to use again the index k).

figure e

For quadratic strictly convex problems

$$ \min f(x) :=\frac{1}{2}{x}^T{Mx} + {q}^T{x} $$

where \(M\in \mathrm{I}\!\mathrm{R}^{n\times n}\) is a positive definite matrix and \(q \in \mathrm{I}\!\mathrm{R}^n\), finite termination is guaranteed when \(\phi _k \ge 0\) and \(\alpha _k\) is obtained via exact line–search [17, Theorem 8.1], [59, Theorem 5.2.1]. Finite termination follows from the observation that for all \(k=0, \ldots , n\)

$$\begin{aligned} {\left( x^{k+1}-x^k\right) }^T{M\left( x^{i+1}-x^i\right) } = 0 , \quad \forall i < k \end{aligned}$$

i.e., the directions \(\delta ^0 = x^{1}-x^0, \delta ^1 = x^{2}-x^1, \ldots , \delta ^k {=} x^{k+1}-x^k\) are M-conjugate and \(x^{k+1}\) minimizes f(x) in the affine space the affine set

$$\begin{aligned} S_k :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: x = x^0 + \sum _{j=0}^k \sigma _j \delta ^j, \; \sigma _j \in \mathrm{I}\!\mathrm{R}\right\} \end{aligned}$$

The following theorem, originally due to Powell, shows global convergence for rank–2 symmetric updating schemes for generic convex optimization problems.

Theorem 13

([17, Theorem 8.2]) Let \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) be twice differentiable and convex. Let \(x^0 \in \mathrm{I}\!\mathrm{R}^n\) and assume the level set \(\mathcal {L}_{0}\) be bounded. Let \(H^0\) be symmetric and positive definite and let \(\{x^k\}\) be generated by Algorithm 5 with either

  • \(\phi _k =0\) (DFP updating formula) and exact line–search, or

  • \(\phi _k =1\) (BFGS updating formula) and inexact line–search satisfying the Wolfe condition.

Then, for any \(\epsilon >0\) there exists k such that \(\left\| \nabla f(x^k) \right\| < \epsilon \).

Dixon [22] demonstrated that in case of exact line–search, the sequence \(\{x^k\}\) is independent of \(\phi _k \ge 0\). Furthermore, in [24], it is shown that, for stability reasons, it is better to choose \(\phi _k \in [0,1]\). However, from a computational point of view the choice \(\phi _k =1\) appears better than \(\phi _k =0\).

Finally, q–superlinear convergence of rank–2 updating methods is obtained in the case \(\phi _k = 0\) or \(\phi _k =1\) [17, Theorem 8.9] if the stepsize \(\alpha _k\) is chosen according to the Wolfe rule (in which case it is possible to show that \(\alpha _k = 1\) for \(k \ge k_0\)) provided that \(\displaystyle \sum _{k=1}^{+\infty }\left\| x^k - x^* \right\| < +\infty \).

We refer the interested reader to [35, 59] for additional insight on Quasi–Newton methods.

4 Constrained Optimization

In this section we concentrate our attention on the constrained optimization problem

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathop {\text{ min }}_{x}} \qquad\qquad\qquad\qquad {f(x)} \\ * {\mathop {\text{ subject } \text{ to }}} {\begin{array}{c} \begin{array}{cl} \quad c_i(x)&\!\! \le 0 \quad \quad i = 1, \ldots , {m} \\ \quad c_i(x) &\!\!= 0 \quad \quad i = {m+1}, \ldots , m + h\end{array} \end{array}} \end{array} \qquad \qquad \end{aligned}$$
(39)

where \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) is the objective function and \(c_i:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}, i = 1, \ldots , {m+h}\) are the constraints defining the feasible region X

$$\begin{aligned} X :=\left\{ x \in \mathrm{I}\!\mathrm{R}^n: c_i(x) \le 0, i = 1, \ldots , {m} \mathop { \text{ and } } c_i(x) = 0, i = {m+1}, \ldots , m + h \right\} . \end{aligned}$$

For simplicity, we will always assume that all the functions are at least twice continuously differentiable.

Definition 11

A feasible point \(\bar{x}\) is a local minimum if \(\exists \; \gamma > 0\) such that

$$\begin{aligned} f(\bar{x}) \le f(x) \quad \forall x \in X \mathop { \text{ and } } \left\| x -\bar{x} \right\| \le \gamma . \end{aligned}$$

More generally, a point \(\bar{x}\) is a local minimum if \(\exists \; \mathcal {I}\) neighborhood of \(\bar{x}\) such that

$$\begin{aligned} f(\bar{x}) \le f(x) \quad \forall \ x \in X \cap \mathcal {I}. \end{aligned}$$

A point \(\bar{x}\) is a strict local minimum if \(\exists \mathcal {I}\) neighborhood of \(\bar{x}\) such that

$$\begin{aligned} f(\bar{x}) < f(x) \quad \forall \ x \in X \cap \mathcal {I}, x \ne \bar{x}. \end{aligned}$$

A point \(\bar{x}\) is a strong or isolated local minimum  if \(\exists \mathcal {I}\) neighborhood of \(\bar{x}\) such that \(\bar{x}\) is the only local minimum in \(X \cap \mathcal {I}\).

For the above minimization problem, the Lagrangian function is defined as follows:

$$\begin{aligned} L(x,\lambda ) :=f(x) + \sum _{i=1}^{m+h} \lambda _i c_i(x) \end{aligned}$$
(40)

where \(\lambda \in \mathrm{I}\!\mathrm{R}^{m+h}\). First and second order necessary and sufficient optimality conditions will be expressed in terms of the above Lagrangian function.

Another important concept in constrained optimization is the concept of active constraints.

Definition 12

Let \(\bar{x}\) be a feasible point for the constrained optimization problem (39), i.e., , \(\bar{x}\in X\). The set of active constraints at \(\bar{x}\) is

$$ {\mathcal {A}}\left( \bar{x}\right) :=\Bigl \{i: 1 \le i \le m: c_i(\bar{x}) = 0 \Bigr \} \cup \Bigl \{ m+1, \ldots , m + h\Bigr \}.$$

4.1 Necessary and Sufficient Optimality Conditions

In order to derive optimality conditions for the constrained optimization problem (39), two sets must be introduced: the tangent cone and the set of linearized feasible directions.

Definition 13

Let \(\bar{x}\in X\). A direction \(d \in \mathrm{I}\!\mathrm{R}^n\) is tangent  to the set X at \(\bar{x}\) if there exists a sequence \(\left\{ z^k \right\} \subseteq X\) with \(\lim _k z^k = \bar{x}\) and a sequence \({\left\{ \theta _k \right\} }\) of positive scalars with \(\lim _k \theta _k = 0 \) such that

$$\begin{aligned} \lim _k \frac{z^k - \bar{x}}{\theta _k} = d. \end{aligned}$$
(41)

The set of all tangent vectors to X at \(\bar{x}\) is called the tangent cone at \(\bar{x}\) and is denoted by \({\mathcal {T}}_{X}\left( \bar{x}\right) \).

Definition 14

Let \(\bar{x}\in X\) and let \({\mathcal {A}}\left( \bar{x}\right) \) be the set of indices of active constraints. The set of linearized feasible directions \({\mathcal {F}}\left( \bar{x}\right) \) is the set

$$ {\mathcal {F}}\left( \bar{x}\right) :=\left\{ \begin{array}{ll}d \in \mathrm{I}\!\mathrm{R}^n:&{} {\nabla c_i(\bar{x})}^T{d} \le 0, \quad i = 1, \ldots , {m}, i \in {\mathcal {A}}\left( \bar{x}\right) , \\ &{}{\nabla c_i(\bar{x})}^T{d} = 0, \quad i = m+1, \ldots , m + h \end{array}\right\} .$$

Clearly, if \(d \in {\mathcal {F}}\left( \bar{x}\right) \) also \(\alpha d \in {\mathcal {F}}\left( \bar{x}\right) \) for all nonnegative \(\alpha \) and hence also the set \({\mathcal {F}}\left( \bar{x}\right) \) is a cone.

These two sets can be viewed as local approximations of the feasible region X around \(\bar{x}\).

Note that, while the tangent cone only depends on the geometry of the feasible region, the set of linearized feasible directions depends on the specific formulation of the constraints utilized to define the feasible region. Moreover, for all \(\bar{x}\in X\), \({\mathcal {T}}_{X}\left( \bar{x}\right) \subseteq {\mathcal {F}}\left( \bar{x}\right) \).

A fundamental question is under which conditions the two sets coincide, that is \({\mathcal {T}}_{X}\left( \bar{x}\right) = {\mathcal {F}}\left( \bar{x}\right) \) or, to be more precise, under which conditions the dual cones of the two sets coincide.Footnote 3 Since the seminal work of Kuhn and Tucker [42], a number of different (inter–related) Constraint Qualification conditions have been proposed in literature both for the case of only inequality constraints and for the more general case of equality of inequality constraints (see [60] and references therein, including the two schemes showing the relationships between the different Constraint Qualification conditions, and [46]).

In [60] the various Constraint Qualification conditions are partitioned in 4 different levels with weakest (less stringent) conditions being at level 1 and strongest (more stingent) conditions at level 4. In the following, we state some of the most common Constraint Qualification conditions.

Definition 15

(Guignard’s Constraint Qualification, GCQ) [36] Let \(\bar{x}\in X\). The Guignard’s Constraint Qualification conditions are satisfied at \(\bar{x}\) if \({\mathcal {F}}\left( \bar{x}\right) = \mathop {\text{ cl }}\left( \mathop {\text{ conv }}\left( {\mathcal {T}}_{X}\left( \bar{x}\right) \right) \right) \).Footnote 4

The GCQ is considered the weakest possible Constraint Qualification.

Definition 16

(Abadie’s Constraint Qualification, ACQ) [1] Let \(\bar{x}\in X\). The Abadie’s Constraint Qualification conditions are satisfied at \(\bar{x}\) if \({\mathcal {T}}_{X}\left( \bar{x}\right) = {\mathcal {F}}\left( \bar{x}\right) \).

The Abadie’s Constraint Qualification requires that \({\mathcal {T}}_{X}\left( \bar{x}\right) \) be a convex cone. Hence this condition is stronger than Guignard’s Constraint Qualification.

Definition 17

(Slater’s Constraint Qualification, SCQ) [58] Let \(\bar{x}\in X\). The Slater’s Constraint Qualification conditions are satisfied at \(\bar{x}\) if

  • \(c_i(x)\) is pseudo–convex at \(\bar{x}\) for all \(1 \le i \le m\) and \(i \in {\mathcal {A}}\left( \bar{x}\right) \),

  • \(c_i(x)\) is both quasi–convex and quasi–concaveFootnote 5 at \(\bar{x}\) for all \(m+1 \le i \le m+h\),

  • the vectors \( \bigl \{\nabla c_i\left( \bar{x}\right) , \quad m+1 \le i \le m+h \bigr \}\) are linearly independent,

  • there exists \(\hat{x}\) such that \(c_i\left( \hat{x}\right) <0\), for all \(1 \le i \le m\) and \(i \in {\mathcal {A}}\left( \bar{x}\right) \) and \(c_i\left( \hat{x}\right) = 0\) for all \(m+1 \le i \le m+h\).

Definition 18

(Mangasarian–Fromovitz’s Constraint Qualification, MFCQ) [47, 3.4–3.6] Let \(\bar{x}\in X\). The Mangasarian–Fromovitz’s Constraint Qualification conditions are satisfied at \(\bar{x}\) if

  • the vectors \( \bigl \{\nabla c_i\left( \bar{x}\right) , \quad m+1 \le i \le m+h \bigr \}\) are linearly independent,

  • there exists d such that \({\nabla c_i\left( \bar{x}\right) }^T{d} <0\), for all \(1 \le i \le m\) and \(i \in {\mathcal {A}}\left( \bar{x}\right) \) and \({\nabla c_i\left( \bar{x}\right) }^T{d} = 0\) for all \(m+1 \le i \le m+h\).

Definition 19

(Linear Independence Constraint Qualification, LICQ) [60, 5.4.6] Let \(\bar{x}\in X\). Linear Independence Constraint Qualification (LICQ) conditions are satisfied at \(\bar{x}\) if the vectors

$$\begin{aligned} \Bigl \{\nabla c_i\left( \bar{x}\right) , 1 \le i \le m, \quad i \in {\mathcal {A}}\left( \bar{x}\right) \Bigr \} \cup \Bigl \{\nabla c_i\left( \bar{x}\right) , m+1 \le i \le m+h \Bigr \} \end{aligned}$$

are linearly independent.

Both LICQ and SCQ imply MFCQ. Moreover, MFCQ implies ACQ. Therefore, if any of these Constraint Qualification conditions hold at a feasible point \(\bar{x}\), then \({\mathcal {T}}_{X}\left( \bar{x}\right) = {\mathcal {F}}\left( \bar{x}\right) \).

The following lemma provides necessary conditions for a local optimal solution.

Lemma 1

Let \(x^*\) be a local optimal solution of Problem (39). Then

$$\begin{aligned} {\nabla f(x^*)}^T{d} \ge 0 \quad \quad \forall \; d \in {\mathcal {T}}_{X}\left( x^*\right) \end{aligned}$$
(42)

Proof

Suppose, by contradiction, that at \(x^*\) there is a direction \(d\in {\mathcal {T}}_{X}\left( x^*\right) \) such that

$$\begin{aligned} {\nabla f(x^*)}^T{d} < 0. \end{aligned}$$

Then, there exists a sequence \(\left\{ d^k \right\} \) converging to d and a sequence \({\left\{ \theta _k \right\} }\) of positive scalars converging to zero with \(z^k = x^*+ \theta _k d^k \in X \) for all k. Therefore,

$$\begin{aligned} f(z^k)= & {} f(x^*+ \theta _k d^k) = f(x^*) + {\nabla f(x^*)}^T{\left( z^k - x^*\right) } + o\left( \left\| z^k - x^* \right\| \right) \\= & {} f(x^*) + \theta _k {\nabla f(x^*)}^T{d^k} + o(\theta _k). \end{aligned}$$

But

$$\begin{aligned} \lim _k {\nabla f(x^*)}^T{d^k} = {\nabla f(x^*)}^T{d} < 0, \end{aligned}$$

and hence

$$\begin{aligned} f(z^k) - f(x^*) = \theta _k {\nabla f(x^*)}^T{d^k} + o(\theta _k) < 0 \end{aligned}$$

for k sufficiently large. Therefore, for each neighborhood of \(x^*\), there exists a sufficiently large index k such that the point \(z^k\) belongs to such neighborhood and, hence, \(x^*\) is not a local minimum.

In case that Constraint Qualification conditions are satisfied, the necessary optimality conditions can be expressed in a more convenient way, in terms of the Karush–Kuhn–Tucker conditions. This results utilizes Motzkin’s theorem of the alternative, see [46, Chap. 2].

Theorem 14

([46, 7.3.7][49, Theorem 12.1]) Let \(x^*\) be a local minimum for the constrained optimization Problem (39) and assume that at \(x^*\) some Constraint Qualification conditions are satisfied. Then there exists a Lagrange multiplier vector \(\lambda ^*\) such that the following conditions, known as Karush–Kuhn–Tucker (or KKT) conditions, hold:

$$\begin{aligned} \nabla _x L(x^*, \lambda ^*) = 0,&\end{aligned}$$
(43a)
$$\begin{aligned} c_i(x^*) \le 0&\forall i = 1, \ldots , {m}, \end{aligned}$$
(43b)
$$\begin{aligned} c_i(x^*) = 0&\forall i = m+1, \ldots , m+h, \end{aligned}$$
(43c)
$$\begin{aligned} \lambda ^*_i \ge 0&\forall i = 1, \ldots , {m}, \end{aligned}$$
(43d)
$$\begin{aligned} \lambda ^*_i c_i(x^*) = 0&\forall i = 1, \ldots , {m}, \end{aligned}$$
(43e)

where L(., .) is the Lagrangian function (40).

Note that Condition 43a can be rewritten as

$$\begin{aligned} \nabla f(x^*) + \sum _{i=1}^{m+h} \lambda _i^* \nabla c_i(x^*) = 0. \end{aligned}$$

Conditions (43e) are called complementarity conditions and they impose that for all \(i = 1, \ldots , {m}\) either \(\lambda _i^*=0 \) or \(c_i(x^*) = 0\) or both.

In order to derive second order necessary and sufficient conditions, we need to introduce first the concept of critical cone.

Definition 20

Let \(x^*\) be a local minimum for the constrained optimization Problem (39) and suppose that the pair \(\left( x^*, \lambda ^*\right) \) satisfies the Karush–Kuhn–Tucker conditions (43) for some \(\lambda ^*\in \mathrm{I}\!\mathrm{R}^{m+h}\). The Critical Cone is defined as:

$$\begin{aligned} \begin{array}{ll} \mathcal {C}\left( {x^*}, {\lambda ^*} \right) &{} :=\Bigl \{ d \in {\mathcal {F}}\left( x^*\right) : {\nabla c_i(x^*)}^T{d} = 0, \\ &{} \quad \quad \mathop {\text{ for } \text{ all } } i = 1, \ldots , {m}, i \in {\mathcal {A}}\left( \bar{x}\right) \mathop { \text{ with } } \lambda _i^* > 0 \Bigr \}. \end{array} \end{aligned}$$
(44)

Equivalently, \( d \in \mathcal {C}\left( {x^*}, {\lambda ^*} \right) \) if only if

$$\begin{aligned} {\nabla c_i(x^*)}^T{d}&= 0&i = 1, \ldots , {m}, i \in {\mathcal {A}}\left( \bar{x}\right) \mathop { \text{ with } } \lambda _i^* > 0 \end{aligned}$$
(45a)
$$\begin{aligned} {\nabla c_i(x^*)}^T{d}&\le 0&i = 1, \ldots , {m}, i \in {\mathcal {A}}\left( \bar{x}\right) \mathop { \text{ with } } \lambda _i^* = 0 \end{aligned}$$
(45b)
$$\begin{aligned} {\nabla c_i(x^*)}^T{d}&= 0&i = m+1, \ldots , m+h \end{aligned}$$
(45c)

Note that the Hessian of the Lagrangian function \(L\left( x, \lambda \right) \) is given by

$$\begin{aligned} \nabla ^2_{xx} L\left( x, \lambda \right) = \nabla ^2 f(x) + \sum _{i=1}^{m+h} \lambda _i \nabla ^2 c_i(x). \end{aligned}$$

This Hessian is of fundamental importance in the second order necessary and sufficient conditions.

Theorem 15

([25, Theorem 9.3.1] [49, Theorem 12.5]) Let \(x^*\) be a local minimum for the constrained optimization Problem (39) and assume that at \(x^*\) some Constrained Qualification Conditions are satisfied. Then, there exists a Lagrange multiplier vector \(\lambda ^*\) such that the Karush–Kuhn–Tucker conditions (43) are satisfied and

$$\begin{aligned} {d}^T{\nabla ^2_{xx} L\left( x^*, \lambda ^* \right) d} \ge 0, \mathop { \text{ for } \text{ all } } d \in \mathcal {C}\left( {x^*}, {\lambda ^*} \right) . \end{aligned}$$
(46)

Theorem 16

([25, Theorem 9.3.2] [49, Theorem 12.6]) Let \(x^*\) be a feasible point for the constrained optimization Problem (39) and assume that there exists a vector \(\lambda ^*\) such that the pait \(\left( x^*, \lambda ^*\right) \) satifies the Karush–Kuhn–Tucker conditions (43). Furthermore, suppose that

$$\begin{aligned} {d}^T{\nabla ^2_{xx} L\left( x^*, \lambda ^* \right) d} > 0, \mathop { \text{ for } \text{ all } } 0 \ne d \in \mathcal {C}\left( {x^*}, {\lambda ^*} \right) . \end{aligned}$$
(47)

Then, \(x^*\) is a strict local minimum of the constrained optimization Problem (39)

As for the unconstrained case, also here there is a significative difference between necessary and sufficient optimality conditions. In fact, for necessary optimality condition the Hessian of the Lagrangian must be positive semidefinite in the Critical Cone. Instead, for sufficient optimality condition, the Hessian of the Lagrangian must be positive definite in the same Critical Cone.

4.2 Duality in Constrained Optimization

The concept of duality is central in constrained optimization as well as in other fields of Operations Research (e.g., in Linear programming) and, in general, in mathematics. Let

$$\begin{aligned} f^* = \inf _{x \in X} f(x).\end{aligned}$$
(48)

The dual function is given by

$$\begin{aligned} q(\lambda ) :=\inf _x L(x,\lambda ) \end{aligned}$$
(49)

where \(L(x,\lambda )\) is the Lagrangian function (40). Note that this dual function is a concave function (i.e., \(-q(\lambda )\) is a convex function) and its domain \(\mathcal {D}\) is a convex set. The dual problem is defined as

(50)

Moreover, let

$$\begin{aligned} \begin{array}{cl} q^* = \displaystyle \sup _{\lambda } &{} q(\lambda ) \\ &{} \lambda _i \ge 0, i = 1, \ldots , {m}.\end{array}. \end{aligned}$$
(51)

It is not difficult to show that (Weak Duality), if \(\bar{\lambda }\in \mathrm{I}\!\mathrm{R}^{m+h}\) with \(\bar{\lambda }_i \ge 0\), \(i=1, \ldots , {m}\) (that is, \(\bar{\lambda }\) is dual feasible), and \(\bar{x}\in X\) (that is, \(\bar{x}\) is primal feasible), then

$$\begin{aligned} q(\bar{\lambda })= & {} \inf _x L(x,\bar{\lambda }) \le L(\bar{x}, \bar{\lambda }) = f(\bar{x}) + \sum _{i=1}^{m+h} \bar{\lambda }_i c_i(\bar{x}) \\= & {} f(\bar{x}) + \sum _{i=1}^{m} \bar{\lambda }_i c_i(\bar{x}) \le f(\bar{x}) \end{aligned}$$

and hence

$$\begin{aligned} q^* \le f^*. \end{aligned}$$
(52)

A more convenient form for the dual problem can be derived for convex problems. The theorem below shows the relationship between points satisfying Karush–Kuhn–Tucker conditions and optimal solutions of the dual problem. Moreover, it shows that in convex optimization there is no duality gap.

Theorem 17

([49, Theorem 12.12]) Let \(x^*\) be an optimal solution of the convex optimization problem (39) where \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) and \(c_i:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}, i = 1, \ldots , {m}\) are convex functions and \(c_i:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R},\; i = m+1, \ldots , {m+h}\) are linear functions. Assume, further, that at \(x^*\) some Constraint Qualification conditions are satisfied. Then, there exists \(\lambda ^*\) such that the pair \((x^*, \lambda ^*)\) satisfies the Karush–Kuhn–Tucker conditions (43) and \(\lambda ^*\) solves the dual problem (50), i.e.,

$$\begin{aligned} \begin{array}{cl} q(\lambda ^*) = \displaystyle \max _{\lambda } &{} q(\lambda ) \\ &{} \lambda _i \ge 0, i = 1, \ldots , {m}.\end{array}\end{aligned}$$
(53)

Moreover,

$$\begin{aligned} f^* = f(x^*) = L(x^*, \lambda ^*) = q(\lambda ^*) = q^* \end{aligned}$$

In order to revert, at least partially, this condition, strict convexity is required.

Theorem 18

([49, Theorem 12.13]) Let \(x^*\) be an optimal solution of the convex optimization problem (39) where \(f:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}\) and \(c_i:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R}, i = 1, \ldots , {m}\) are convex functions and \(c_i:\mathrm{I}\!\mathrm{R}^n \rightarrow \mathrm{I}\!\mathrm{R},\; i = m+1, \ldots , {m+h}\) are linear functions. Assume, further, that at \(x^*\) some Constraint Qualification conditions are satisfied. Moreover, suppose that \(\bar{\lambda }\) is a solution of the dual problem (50) and that the function \(L(x,\bar{\lambda })\) is strict convex in x and

$$\begin{aligned} L(\bar{x}, \bar{\lambda }) = \inf _x L(x,\bar{\lambda }). \end{aligned}$$

Then \(x^*= \bar{x}\) and \(f(x^*) = L(x^*,\bar{\lambda })\).

The two theorems above are the basis for a different, more convenient, form for the dual problem, known as Wolfe dual [61]:

(54)

For this dual optimization problem it is possible to show that, if \(x^*\) is a local minimum at which Constraint Qualification conditions are satisfied, there exists \(\lambda ^*\in \mathrm{I}\!\mathrm{R}^{m+h}\) such that the pair \((x^*, \lambda ^*)\) satisfies the Karush–Kuhn–Tucker conditions (43) and, furthermore, solves the Wolfe dual problem (54).

4.3 Penalty and Augmented Lagrangian Methods

An important and well studied class of methods for solving nonlinear optimization problems is based on the idea of replacing the original constrained problem by a single or a sequence of unconstrained problems. For these problems the new objective function will contain terms penalizing the violation of the original constraints.

An important issue, both from a theoretical and a practical point of view, is to determine if the minimizer of the penalty function and the solution of the original optimization problem coincide. This property is called exactness of the penalty function [20].

The simplest approach is to use a quadratic term to penalize the violation of the constraints. For the constrained problem (39), the quadratic penalty function is given by

$$\begin{aligned} \psi (x,\mu ) :=f(x) + \mu \Biggl [\sum _{i=1}^m \max \{0, c_i(x)\}^2 + \sum _{i=m+1}^{m+h} c_i(x)^2 \Biggr ], \end{aligned}$$
(55)

where \(\mu >0\) is the penalty parameter. If the original functions f(x) and \(c_i(x),\; i = 1, \ldots , {m+h}\) are sufficiently smooth, the function \(\psi (x,\mu )\) is differentiable (and continuously differentiable, if \(m=0\), i.e., there are only equality constraints) and, hence, standard algorithms for unconstrained minimization can be applied to calculate its minimizer.

However, when quadratic penalty terms are utilized, it is necessary to drive the penalty term to \(+\infty \). A general scheme require to:

  • choose a priori a sequence \(\{\mu _k\} \rightarrow +\infty \),

  • for each value of \(\mu _k\), calculate \(x^{\mu _{k}}\), a minimizer of \(\psi (x,\mu _k)\).

The procedure terminates when the violation of the constraints at \(x^{\mu _{k}}\) is sufficiently small.

For this simple scheme it is possible to show [25, Theorem 12.1.1] [49, Theorem 17.1] that

  1. 1.

    \(\{\psi (x^{\mu _{k}},\mu _k)\}\) is non–decreasing,

  2. 2.

    \(\{f(x^{\mu _{k}})\}\) is non–increasing,

  3. 3.

    the constraints violation is non–increasing,

  4. 4.

    every accumulation point \(x^*\) of the sequence \(\{x^{\mu _{k}}\}\) is a solution of Problem (39).

Another widely used penalty function is the 1–norm penalty function defined as

$$\begin{aligned} \phi (x,\mu ) :=f(x) + \mu \Biggl [ \sum _{i=1}^m \max \{0, c_i(x)\} + \sum _{i=m+1}^{m+h} \left| c_i(x) \right| \Biggr ]. \end{aligned}$$
(56)

This penalty function is exact in the sense that there exists a finite \(\mu ^*>0\) such that for all values of \(\mu \ge \mu ^*\), if \(x^*\) is a strict local solution of the nonlinear problem (39) at which first-order necessary optimality conditions are satisfied, then \(x^*\) is a local minimizer of \(\phi (x,\mu )\) [49, Theorem 17.3]. However, the 1–norm penalty function is non–smooth and, therefore, specific algorithms must be utilized and convergence is in many cases quite slow.

Continuously differentiable exact penalty functions can also be constructed [20] by using an Augmented Lagrangian function that includes additional terms penalizing the violation of the Karush–Kuhn–Tucker conditions. Under specific assumptions, stationary point, local and global minimizers of the Augmented Lagrangian function exactly correspond to Karush–Kuhn–Tucker points, local and global solutions of the constrained problem [21].

4.4 Sequential Quadratic Programming

A very effective method for solving constrained optimization problems is based on sequentially solving quadratic subproblems (Sequential Quadratic Programming, SQP). The idea behind the SQP approach is to construct, at each iteration, a quadratic approximation of Problem (39) around the current point \(x^k\), and then use the minimizer of this subproblem as the new iterate \(x^{k+1}\) [49, Chap. 18].

More specifically, at the current point \(x^k\) we construct a quadratic approximation of the problem by a quadratic approximation for the objective function using the Hessian of the Lagrangian function with respect to the x variables and linear approximation of the constraints:

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathop {\text{ min }}_{d}} &{} {f(x^k) + {\nabla f(x^k)}^T{d} + \frac{1}{2}{d}^T{\nabla ^2_{xx} L(x^k, \lambda ^k)}{d}} \\ * {\mathop {\text{ subject } \text{ to }}} &{} {\begin{array}{c} \begin{array}{lcl} {\nabla c_i(x^k)}^T{d} + c_i(x^k) &{}\le 0,&{} \quad i = 1, \ldots , m \\ {\nabla c_i(x^k)}^T{d} + c_i(x^k) &{} = 0, &{}\quad i = m+1, \ldots , m+h. \\ \end{array} \end{array}} \end{array} \qquad \qquad \end{aligned}$$
(57)

Very fast and efficient algorithms exist for solving the above problem that produce the vector \(d^k\), multipliers associated to the linearized constraints \(\lambda ^{k+1}\), and an estimate of the active constraints.Footnote 6 Then the new point is obtained as \(x^{k+1} = x^k + d^k\).

Under specific assumptions, it is possible to show that if \(x^*\) is a local solution of Problem (39), if, at \(x^*\) and some \(\lambda ^*\), Karush–Kuhn–Tucker conditions are satisfied, and if \((x^k,\lambda ^k\) is sufficiently close to \((x^*, \lambda ^*)\), then there is a local solution of the subproblem (57) whose active set is the same as the active set of the nonlinear optimization problem (39) at \(x^*\) [55].

The correct identification of the active constraints when \(x^k\) is sufficiently close to \(x^*\) is at the basis of the proof of local convergence for the SQP method.