1 Introduction

Penalty methods represent an important class of methods for solving constrained optimization problems. These functions, in their simplest form, are composed of two parts: the original objective function and a penalty function, usually multiplied by a positive scalar called penalty parameter, which penalize the violation of the constraints. Using penalty functions, the original constrained optimization problem can be transformed in an “equivalent” unconstrained problem. Two major issues must be taken into account when constructing such penalty functions: exactness and differentiability. Roughly speaking, exactness requires that the global or local solution of the unconstrained problem or a stationary point of it, must correspond to a global or local minimum of the constrained problem or to a point satisfying Karush–Kuhn–Tucker conditions. Using the Euclidean norm, it is quite simple to construct continuously differentiable, or at least differentiable penalty functions, as long as the functions in the original problem are sufficiently smooth. However, in these cases the penalty parameter must be driven to \(+\infty \), thus generating numerical instability. Using the 1–norm, instead, it is possible to construct penalty functions that are exact, that is they do not require that the penalty parameter goes to \(+\infty \). The difficulty in this case arises from the non–differentiability of the function.

Recently, Sergeyev introduced a new approach to infinite and infinitesimals. The proposed numeral system is based on \(\textcircled {1}\), the number of elements of \(\mathrm{\mathrm I}\!\mathrm{\mathrm N}\), the set of natural numbers. We refer the reader to Chap. 1 for insights on the arithmetic of infinity and the properties of \(\textcircled {1}\). Here we want to stress that \(\textcircled {1}\) is not a symbol and is not used to make symbolic calculation. In fact, the new numeral \(\textcircled {1}\) is a natural number, and it has both cardinal and ordinal properties, exactly as the “standard”, finite natural numbers. Moreover, the new proposed approach is far apart from non–Standard Analysis, as clearly shown in [30]. A comprehensive description of the grossone–based methodology can also be found in [29].

In this chapter we discuss the use of \(\textcircled {1}\) in constructing exact differentiable penalty functions for the case of only equality constraints, the general cases of equality and inequality constraints, and quadratic problems. Using this novel penalty function, it is possible to recover the solution of the unconstrained problem from the finite term (in its \(\textcircled {1}\) expansion) of the optimal solution of the unconstrained problem. Moreover, Lagrangian duals are also automatically and at no additional cost obtained just considering the \(\textcircled {1}^{-1}\) grossdigits in their expansion in term \(\textcircled {1}\).

While this chapter only concentrates the attention on the use of \(\textcircled {1}\) to define novel exact penalty functions for constrained optimization problems, it must be noted that the use of \(\textcircled {1}\) has been beneficial in many other areas in optimization. Already in [8], the authors demonstrated how the classical simplex method for linear programming can be modified, using \(\textcircled {1}\) to overcome the difficulties due to degenerate steps. Along this line of research, more recently in [4], the authors proposed the Infinitely-Big-M method, a re–visitation of the Big–M method for the Infinity Computer. Various different optimization problems have been successfully tackled using this new methodology: multiobjective optimization problems [3, 5, 6, 21], the use of negative curvature directions in large-scale unconstrained optimization [11, 12], variable metric methods in nonsmooth optimization [16]. Recently, this computational methodology has also been also utilized in the field of Machine Learning allowing to construct new spherical separations for classification problems [2], and novel sparse Support Vector Machines [10]. Furthermore, the use of \(\textcircled {1}\) has given rise to a variety of applications in several fields of pure and applied mathematics, providing new and alternative approaches. Here we only mention numerical differentiation [26], ODE [1, 20, 31], hyperbolic geometry [23], infinite series and the Riemann zeta function [25, 27], biology [28], and cellular automata [7].

We briefly describe our notation now. All vectors are column vectors and will be indicated with lower case Latin letter (x, y, \(\ldots \)). Subscripts indicate components of a vector, while superscripts are used to identify different vectors. Matrices will be indicated with upper case roman letter (A, B, \(\ldots \)). The set of real numbers and the set of nonnegative real numbers will be denoted by \(\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}\) and \(\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}_+\) respectively. The space of the n–dimensional vectors with real components will be indicated by \(\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n\). Superscript \(^T\) indicates transpose. The scalar product of two vectors x and y in \(\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n\) will be denoted by \({x}^T{y}\). The norm of a vector x will be indicated by \(\left\| x \right\| \). The space of the \(m\times n\) matrices with real components will be indicated by \(\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^{m\times n}\). Let \(f: S \subseteq \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n \rightarrow \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}\), the gradient \(\nabla f(x)\) of \(f:\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n \rightarrow \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}\) at a point \(x\in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n\) is a column vector with \(\left[ \nabla f(x)\right] _j = \frac{\partial f(x)}{\partial x_j}\).

For what is necessary in this chapter, in this new positional numeral system with base \(\textcircled {1}\) a value C is expressed as

$$ C = C^{(1)}\textcircled {1} + C^{(0)} + C^{(-1)}\textcircled {1}^{-1} C^{(-2)}\textcircled {1}^{-2} + \cdots $$

Here and throughout the symbols \(:=\) and \(=:\) denote definition of the term on the left and the right sides of each symbol, respectively.

2 Exact Penalty Methods

In this section we will utilize the novel approach to infinite and infinitesimal numbers proposed by SergeyevFootnote 1 to construct exact differentiable penalty functions for nonlinear optimization problems.

Consider the constrained optimization problem

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathrm{min}_{x}} \qquad\qquad \quad \,\,\,\,{f(x)} \\ { \mathrm{subject\,to}} {\begin{array}{c} \begin{array}{cl} c_i(x) & \le 0 \quad \quad i = 1, \ldots , {m} \\ c_i(x) & = 0 \quad \quad i = {m+1}, \ldots , m + h\end{array} \end{array}} \end{array} \end{aligned}$$
(1)

where \(f:\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n \rightarrow \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}\) is the objective function and \(c_i:\mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n \rightarrow \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}, i = 1, \ldots , {m+h}\) are the constraints defining the feasible region X

$$ X :=\left\{ x \in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n: c_i(x) \le 0,\; i = 1, \ldots , {m} \mathrm{ and } c_i(x) = 0,\; i = {m+1}, \ldots , m + h \right\} . $$

For simplicity, we will assume that all the functions are at least twice continuously differentiable.

Let \(\bar{x}\) be a feasible point for the above problem. The set of active constraints at \(\bar{x}\) is defined as

$$ {\mathcal A}\left( \bar{x}\right) :=\Bigl \{i: 1 \le i \le m, c_i(\bar{x}) = 0 \Bigr \} \cup \Bigl \{ m+1, \ldots , m + h\Bigr \}. $$

In nonlinear optimization a key role is played by the Constraint Qualification conditions that ensure that the tangent cone and the cone of linearized feasible directions coincide and allow to express necessary and sufficient optimality conditions in terms of the well known Karush–Kuhn–Tucker conditions.Footnote 2 For reader’s easiness we recall here the fundamental Linear Independence Constraint Qualification that will be heavily utilized in this form or in a modified form in this chapter.

Definition 1

Linear independence constraint qualification (LICQ) condition is said to hold true at \(\bar{x}\in X\) if the gradients of the active constraints at \(\bar{x}\) are linearly independent.

Note that weaker Constraint Qualification conditions can be imposed [32]. In [34] various Constraint Qualification conditions are stated and categorized into four levels by their relative strengths from weakest (less stringent) to strongest (more stringent, but easier to check).

The Lagrangian function associated to Problem (1) is given by

$$\begin{aligned} L(x,\lambda ) :=f(x) + \sum _{i=1}^{m+h} \lambda _i c_i(x) . \end{aligned}$$
(2)

Necessary and sufficient optimality conditions can be written in terms of the Lagrangian function. If \(x^*\in X\) is a local minimum of Problem (1) at which LICQ condition holds true, then, there exist a vector \(\lambda ^*\) such that the pair \(\left( x^*, \lambda ^*\right) \) is a Karush–Kuhn–Tucker point.

Different algorithms have been proposed in literature for finding a local minimum of Problem (1). Among the most effective methods, penalty methods play a crucial role, obtaining the solution of the constrained problem by solving a single or a sequence of unconstrained optimization problems.

Let \(\delta : \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n \rightarrow \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}_+\) be a function, when possible continuously differentiable, such that

$$ \delta (x) \left\{ \begin{array}{ll} = 0 &{} \mathrm{ if } x \in X \\ > 0 &{} \mathrm{ otherwise.} \end{array}\right. $$

Then the constrained optimization problem (1) can be replaced by the following unconstrained problem

$$\begin{aligned} \mathrm{min}_{x}{\psi ( x, \mu ).} \end{aligned}$$
(3)

where

$$\begin{aligned} \psi (x,\mu ) :=f(x) + \mu \delta (x), \end{aligned}$$
(4)

and \(\mu \) is a positive real number. Different choices for the function \(\delta (x)\) conduct to different penalty methods. A convenient, highly utilized, choice is

$$\begin{aligned} \delta (x) = \displaystyle \frac{1}{2}\sum _{i=1}^m \mathrm{max} \bigl \{c_{i}(x), 0 \bigr \}^2 + \frac{1}{2}\sum _{i=m+1}^{m+h} \left( c_{i}(x)\right) ^2 \end{aligned}$$
(5)

which is differentiable but not twice differentiable. Therefore, this choice for \(\delta (x)\) does not allow to utilize some of the most effective algorithms available for unconstrained optimization.

One key issue in exterior penalty methods is exactness. Roughly speaking, the penalty function is exact if, for some finite value of the parameter \(\mu \), a local (global) minimum of it corresponds to a local (global) minimum point of the constrained problem. As noted in [14] this characterization is satisfactory only when both the constrained problem and the penalty function are convex, while in the non–convex case a more precise definition of exactness is necessary [13, 14].

Unfortunately, for the penalty function (5) this exactness property does not hold, not even in the case of only equality constraints, that is when \(m=0\). In [15, p. 279] a simple 1–dimensional counter-example is reported, showing that the solution of the constraint problem is only obtained when \(\mu \rightarrow +\infty \). Exact penalty functions can be constructed using 1-norm instead of the 2-norm utilized in (5) to penalize the violation of the constraints. However, in these cases the resulting function is nondifferentiable, and ad-hoc methods must be utilized for which convergence is often quite slow. Furthermore, exact differentiable penalty functions can be constructed [13] by introducing in the objective function additional terms related to first order optimality conditions, thus making the objective function more complicate to manage.

Sequential penalty methods require to solve a sequence of minimization problems with increasing values of the parameter \(\mu \) as shown in Algorithm 1.

figure a

In the algorithm above, the point \(x^{\mu _{k}}\) obtained at iteration k can be used as starting point for the minimization problem at iteration \(k+1\). Note that Step 4 cannot be, in general, completed in a finite number of steps and, hence, the algorithm must be modified requiring to calculate, at each iteration, only an approximation of the optimal solution.

In [8] a novel exact differentiable penalty method is introduced using the numeral grossone. In the next three sections, the equality constraints, the general equality and inequality constraints and quadratics case will be discussed. Finally, a simple new non–monotone algorithmic scheme is also proposed for the solution of penalty functions based on \(\textcircled {1}\).

3 Equality Constraints Case

Consider the constrained optimization problem with only equality constraints (that is \(m = 0\)):

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathrm{min}_{x}} \qquad\qquad \,\,\,\, {f(x)} \\ { \mathrm{subject\,to}} {\begin{array}{c} \begin{array}{cl} & c_i(x) = 0 \quad \quad i = 1, \ldots , {h} .\end{array} \end{array}} \end{array} \end{aligned}$$
(6)

Let

$$ \delta (x) = \frac{1}{2}\sum _{i= 1}^{ h} \left( c_{i}(x)\right) ^2 $$

and let \(\psi ( x, \mu )\) be given by (4). In this case, it is possible to show [15, Theorem 12.1.1] that for the sequence constructed by Algorithm 1

  1. 1.

    \( \{ \psi ( x^{\mu _{k}}, \mu _k) \}\) is monotonically non–decreasing,

  2. 2.

    \( \{ \delta (x^{\mu _{k}}) \}\) is monotonically non–increasing,

  3. 3.

    \( \{f(x^{\mu _{k}})\}\) is monotonically non–decreasing.

Moreover, \(\displaystyle \lim _k c_i(x^{\mu _{k}}) = 0\), \(i = 1, \ldots , {h}\) and each accumulation point \(x^*\) of the sequence \(\{x^{\mu _{k}}\}_k\) solves Problem (6).

For the equality constrained case, the penalty function proposed in [8] is defined as follows:

$$\begin{aligned} \mathrm{min}_x \quad \psi (x) :=f(x) + \frac{\textcircled {1}}{2} \sum _{i= 1}^{ h} \left( c_{i}(x)\right) ^2. \end{aligned}$$
(7)

Under the hypothesis that any generic point x, the function value f(x), and the generic constraint \(c_i(x)\) as well as the gradient \(\nabla f(x)\) and \(\nabla c_i(x)\) have a finite part (i.e., grossdigits corresponding to grosspower 0) and only infinitesimal terms, it is possible to show that [8, Theorem 3.3] if \(x^*\) is a stationary point for (7) then \(\left( x^{*(0)}, \lambda ^*\right) \) is a KKT point for (6), where \(x^{*(0)}\) is the finite term in the representation of \(x^*\), and for \(i = 1, \ldots , {h}\), \(\lambda ^*_i = c_i^{(-1)}(x^{*(0)})\) where \(c_i^{(-1)}(x^*)\) is the grossdigit corresponding to the grosspower -1 in the representation of \(c_i(x^*)\).

The above result strongly relies on the fact that at \(x^*\) the LICQ condition is satisfied. The proof is based on the observation that grossdigits “do not mix”, that is, they are kept well separated in the computations. Therefore, setting to 0 a grossnumber is equivalent to set to zero all grossdigits in its representation.

The fundamental aspect of this result is that, by solving the unconstrained optimization problem (where the objective function is twice continuously differentiable), an optimal solution of the constrained problem is obtained. Moreover, the multipliers associated to the equality constraints are automatically recovered at no additional cost, just from the representation of \(c_i(x^*)\) in terms of powers of \(\textcircled {1}\).

In [9] two simple examples are discussed, showing the importance of constraint qualification conditions. Here, we propose a novel example, similar in spirit to the first example discussed in [9]. The problem is originally studied in [15, pp. 279–280] to show the effectiveness and weakness of sequential penalty method.

Consider the problem

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathrm{min}_{x}} &{} {-x_1-x_2} \\ { \mathrm{subject\,to}} &{} {\begin{array}{c} x_1^2 + x_2^2 - 1 = 0. \end{array}} \end{array} \end{aligned}$$
(8)

The feasible region and the optimal solution for this problem is shown in Fig. 1.

Fig. 1
figure 1

Feasible region and optimal solution for Problem (8)

The Lagrangian function is given by

$$ L(x,\lambda ) = -x_1-x_2 + \lambda \left( x_1^2 + x_2^2 - 1\right) $$

and the optimal solution is \(x^* = \displaystyle \left[ \begin{array}{c}\displaystyle \frac{1}{\sqrt{2}} \\ \displaystyle \frac{1}{\sqrt{2}} \end{array}\right] \). Moreover, it is not difficult to show that the pair \(\left( x^*, \lambda ^*= \displaystyle \frac{1}{\sqrt{2}}\right) \) satisfies the KKT conditions.

In Fig. 2 the contour plots for the function \(\psi ( x, \mu )\) for different values of \(\mu \) are shown.

Fig. 2
figure 2

Contour plots of \(\psi (x, \mu )\) for \(\mu = 0.1, 1, 5, 20 \)

The penalty function we construct is:

$$\begin{aligned} \psi \left( x, \textcircled {1}\right) :=-x_1-x_2 + \frac{\textcircled {1}}{2} \left( x_1^2 + x_2^2 - 1\right) ^2. \end{aligned}$$
(9)

The First–Order Optimality Conditions \(\nabla \psi \left( x, \textcircled {1}\right) = 0\) are:

$$\begin{aligned} \left\{ \begin{array}{l} -1 + 2 \textcircled {1} x_1 \left( x_1^2 + x_2^2 - 1\right) = 0,\\ \\ -1 + 2 \textcircled {1} x_2 \left( x_1^2 + x_2^2 - 1\right) = 0. \end{array}\right. \end{aligned}$$
(10)

By symmetry,

$$ x_1^* = x_2^* = R = R^{(0)} + \textcircled {1}^{-1} R^{(-1)} + \textcircled {1}^{-2} R^{(-2)} + \ldots $$

and from (10) we have that

$$\begin{aligned} -1 + 4 \textcircled {1} R \left[ R^2 - \frac{1}{2}\right] = 0. \end{aligned}$$
(11)

From (11), by equating the term of order \(\textcircled {1}\) to 0, we obtain:

$$ 4 R^{(0)} \left[ \left( R^{(0)}\right) ^2 - \frac{1}{2}\right] = 0 $$

from which either \(R^{(0)} = \frac{1}{\sqrt{2}}\) or \(R^{(0)} = -\frac{1}{\sqrt{2}}\) or \(R^{(0)} = 0\). The first choice for \(R^{(0)}\) corresponds to a minimum of the unconstrained problem (and also for the constrained problem (8)), while the second corresponds to a maximum. Finally, the third choice of \(R^{(0)}\) corresponds to a spurious solution whose presence is due to the fact that, for this point \(\hat{x}= \left[ \begin{array}{c} 0 \\ 0 \end{array}\right] \), LICQ are not satisfied. In fact \(\nabla h(\hat{x}) = \left[ \begin{array}{c} 0 \\ 0 \end{array}\right] \). Fixing now \(R^{(0)} = \frac{1}{\sqrt{2}}\), and equating to 0 the finite terms in (11), i.e., the term corresponding to \(\textcircled {1}^0\), we obtain:

$$ -1 + 4 \frac{1}{\sqrt{2}} 2 \frac{1}{\sqrt{2}} R^{(-1)} = 0, $$

from which \(R^{(-1)} = \frac{1}{4}\) and hence

$$ x^*_1 = x^*_2 = R = \frac{1}{\sqrt{2}} + \frac{1}{4} \textcircled {1}^{-1} + \cdots $$

where all remaining terms are infinitesimal of higher order.

Moreover,

$$\begin{aligned} \left( x^*_1\right) ^2 + \left( x^*_2\right) ^2 - 1= & {} 2R^2 - 1 = 2 \left[ \frac{1}{2}+ \frac{2}{4\sqrt{2}} \textcircled {1}^{-1} + \frac{1}{16} \textcircled {1}^{-2} + \cdots \right] -1 \\= & {} \frac{1}{\sqrt{2}} \textcircled {1}^{-1} + \cdots \end{aligned}$$

where again we neglect infinitesimal of higher order than \( \textcircled {1}^{-1}\).

As expected from the theory, the \(\textcircled {1}^{-1}\) terms in the representation of the (unique) constraint provides automatically, and at no additional costs, the value of the Lagrangian multiplier.

Here we want to stress the importance of Constraint Qualification conditions that, when not satisfied, could bring to spurious results, as shown by this example.

This situation is even better clarified in the second example from [9]:

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathrm{min}_{x}} &{} {x_1+x_2} \\ { \mathrm{subject\,to}} &{} {\begin{array}{c} \left( x_1^2 + x_2^2 - 2\right) ^2 = 0. \end{array}} \end{array} \end{aligned}$$
(12)

Here the objective function and the feasible region are the same as in the first example in [9] and the optimal solution is \(x^* = \left[ \begin{array}{c} -1 \\ -1 \end{array}\right] \) with Lagrangian multiplier \(\lambda ^*= \frac{1}{2}\). However, in this case

$$ \nabla c_1(x) = \left[ \begin{array}{c} 2x_1 \left( x_1^2 + x_2^2 - 2\right) \\ \\ 2x_1 \left( x_1^2 + x_2^2 - 2\right) \end{array}\right] $$

which, calculated at the optimal solution, gives:

$$ \nabla c_1\left( \left[ \begin{array}{c} -1 \\ -1 \end{array}\right] \right) = \left[ \begin{array}{c} 0 \\ 0 \end{array}\right] . $$

Therefore, the LICQ condition is not satisfied and the Karush–Kuhn–Tucker conditions

$$\begin{aligned} \left\{ \begin{array}{c} 1 - 4 \lambda x_1 \left( x_1^2 + x_2^2 - 2\right) = 0\\ 1 - 4 \lambda x_2 \left( x_1^2 + x_2^2 - 2\right) = 0 \\ \left( x_1^2 + x_2^2 - 2\right) ^2 = 0 \end{array}\right. \end{aligned}$$
(13)

have no solution. In this case (see [9]) the optimal solution \(x^*= \left[ \begin{matrix}{c} - 1 \\ - 1 \end{matrix} \right] \) of Problem (12) cannot be recovered from the exact penalty function

$$ \psi (x,\textcircled {1}) = x_1+x_2 + \frac{\textcircled {1}}{2} \left( x_1^2 + x_2^2 - 2\right) ^4 $$

using first order optimality conditions.

4 Equality and Inequality Constraints Case

Consider now the more general nonlinear optimization problem (1) that includes equality and inequality constraints. In this case, the exact penalty function proposed in [8] is given by

$$\begin{aligned} \psi (x,\textcircled {1}) = f(x) + \frac{\textcircled {1}}{2} \displaystyle \sum _{i=1}^m \max \{0, c_{i}(x)\}^{2} + \frac{\textcircled {1}}{2} \displaystyle \sum _{i= m+1}^{ m+h} c_{i}(x)^2. \end{aligned}$$
(14)

In order to derive the correspondence between stationary points of (14) and KKT points for Problem (1) a Modified LICQ condition is introduced in [8].

Definition 2

Let \(\bar{x}\in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n\). The Modified LICQ (MLICQ) condition is said to hold true at \(\bar{x}\) if the vectors

$$ \Bigl \{\nabla c_i(\bar{x}), {i: 1 \le i \le m \mathrm{and} c_i(\bar{x})\ge 0} \Bigr \} \cup \Bigl \{\nabla c_i(\bar{x}), {i = m+1, \ldots , m+h} \Bigr \} $$

are linearly independent.

If the above conditions are satisfied, and \(x^*\) is a stationary point for the unconstrained problem

$$ \min _x \psi (x,\textcircled {1}) $$

then, it possible, again, to show [8, Theorem 3.4] that the pair \(\left( x^{*(0)}, \lambda ^*\right) \) is a KKT point for Problem (1), where \(x^{*(0)}\) is the finite term in the representation of \(x^*\), and for \(i = 1, \ldots , {m+h}\), \(\lambda ^*_i = c_i^{(-1)}(x^{*(0)})\) where \(c_i^{(-1)}(x^*)\) is the grossdigit corresponding to the grosspower -1 in the representation of \(c_i(x^*)\).

5 Quadratic Case

In this section we apply the new exact penalty function to the quadratic problem

$$\begin{aligned} \begin{array}{lc} {\displaystyle \mathrm{min}_{x}} &{} {\frac{1}{2}{x}^T{Mx}+ {q}^T{x}} \\ { \mathrm{subject\,to}} &{} {\begin{array}{c} Ax = b\\ x \ge 0 \end{array}} \end{array} \end{aligned}$$
(15)

where \(M \in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^{n\times n}\) is positive definite, \(A \in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^{m\times n}\) with \(\mathrm {rank}(A) = m\), \(q \in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^n\), and \(b \in \mathrm{\mathrm I}\!\!\!\mathrm{\mathrm R}^m\). We assume that the feasible region is not empty.

For this linearly constrained problem, Constraint Qualification conditions are always satisfied and the Karush–Kuhn–Tucker conditions [22] are:

$$\begin{aligned} \begin{array}{ccc} M x + q - A^T u - v &{} = &{} 0, \\ Ax - b &{} = &{} 0, \\ x &{}\ge &{} 0, \\ v &{}\ge &{} 0, \\ {x}^T{v} &{}=&{} 0. \end{array} \end{aligned}$$
(16)

For this quadratic problem, the new exact penalty function using \(\textcircled {1}\) is given by:

$$\begin{aligned} \psi (x,\textcircled {1}) :=\frac{1}{2}{x}^T{Mx} + {q}^T{x} + \frac{\textcircled {1}}{2} \left\| Ax - b \right\| ^2_2 + \frac{\textcircled {1}}{2} \left\| \max \{0, -x\} \right\| ^2_2 \end{aligned}$$
(17)

and the corresponding unconstrained problem is:

$$\begin{aligned} \min _x \psi (x,\textcircled {1}). \end{aligned}$$
(18)

The first order optimality conditions can be written as follows

$$\begin{aligned} 0 = \nabla \psi (x,\textcircled {1}) = M x + q + \textcircled {1} A^T \left( Ax - b \right) - \textcircled {1} \max \{0, -x\}. \end{aligned}$$
(19)

Lemma 1 in [9] shows that the function

$$ \delta (x) = \frac{1}{2} \left\| Ax - b \right\| ^2_2 + \frac{1}{2} \left\| \max \{0, -x\} \right\| ^2_2 $$

is convex, and \(\nabla \delta (x) = 0\) if and only if \(Ax=b\) and \(x \ge 0\).

Based on this lemma it is possible to show that, if \(x^*\) is a stationary point for the unconstrained problem (18), then \(\left( x^{*(0)},\pi ^*,\mu ^*\right) \) is a Karush–Kuhn–Tucker points for Problem (15), where again \(x^{*(0)}\) is the finite term in the representation of \(x^*\), and, taking into account the linearity of the original constraints,

  • \(\pi ^*= A x^{*(-1)} + b^{(-1)} \), where \(x^{*(-1)}\) (resp. \(b^{(-1)}\)) is the grossdigit corresponding to the grossterm \(\textcircled {1}^{-1}\) in the representation of \(x^*\) (resp. b), and

  • \(\mu ^*_j = \max \{0, -x^{*(-1)}\}\).

Once again, the proof is based on the fact that during the computation grossdigits do not mix. Then, the results follow from (19)

  • by setting first to 0 the terms with grosspower \(\textcircled {1}\), in this way the feasibility of \(x^{*(0)}\) is provided (second and third Karush–Kuhn–Tucker conditions in (16)),

  • equating to zero the finite terms (that is the terms with grosspower \(\textcircled {1}^0\)) thus obtaining the first, fourth and last Karush–Kuhn–Tucker conditions in (16).

6 A General Scheme for the New Exact Penalty Function

In the previous sections, we have shown how the solution of the constrained minimization problem (1) can be obtained by solving an unconstrained minimization problem that uses \(\textcircled {1}\). Then, any standard optimization algorithm can be utilized to obtain a stationary point for these problems from which the solution of the constrained problems as well the multipliers can be easily obtained.

In this section a simple novel general algorithmic scheme is proposed to solve the unconstrained minimization problem arising from penalizing the constraints using \(\textcircled {1}\), as constructed in the previous sections. The algorithms belongs to the class of non–monotone descend algorithms [17,18,19, 33, 35], and does not require that the new calculated points be necessary better than the current one. This property is necessary since for a descent algorithm, when applied to determine the minimum of \(\psi (x,\textcircled {1})\), once a feasible point is obtained, then all remaining points generated by the algorithm should remain feasible. This, in general, could be quite complicate to ensure in practice.

For simplicity, consider the generic minimization problem:

$$ \min _x f(x) $$

where f is continuously differentiable and

$$\begin{aligned} f(x) = \textcircled {1} f^{(1)}(x) + f^{(0)}(x) + \textcircled {1}^{-1} f^{(-1)}(x) + \ldots \end{aligned}$$
(20)
$$\begin{aligned} \nabla f(x) = \textcircled {1} \nabla f^{(1)}(x) + \nabla f^{(0)}(x) + \textcircled {1}^{-1} \nabla f^{(-1)}(x) + \ldots \end{aligned}$$
(21)

The proposed algorithm utilizes (as customary in non–monotone algorithms) two sequences \(\{t_k\}_k\) and \(\{l_k\}_k\) of positive integers such that

$$\begin{aligned} t_0 = 0, \quad t_{k+1} \le \max \left\{ t_k + 1, T \right\} , \end{aligned}$$
$$\begin{aligned} l_0 = 0, \quad l_{k+1} \le \max \left\{ l_k + 1, L \right\} , \end{aligned}$$

where T and L are two fixed positive integers.

At the generic iteration k, having \(x^k\), it is necessary first to verify if the stopping criterion

$$\begin{aligned} \nabla f^{(1)}(x^k) = 0 \mathrm{ and } \nabla f^{(0)}(x^k) = 0 \end{aligned}$$

is satisfied. Otherwise, the next iterate \(x^{k+1}\) is calculated in the following way:

  • if \( \nabla f^{(1)}(x^k) \ne 0\), choose \(x^{k+1}\) such that

    $$\begin{aligned} f^{(1)}(x^{k+1}) \le f^{(1)}(x^{k}) + \sigma \left( \left\| \nabla f^{(1)}(x^k) \right\| \right) , \end{aligned}$$

    and

    $$\begin{aligned} f^{(0)}(x^{k+1}) \le \max _{0 \le j \le l_k} f^{(0)}(x^{k-j}) + \sigma \left( \left\| \nabla f^{(0)}(x^k) \right\| \right) ; \end{aligned}$$
  • If \( \nabla f^{(1)}(x^k) = 0\), choose \(x^{k+1}\) such that

    $$\begin{aligned} f^{(0)}(x^{k+1}) \le f^{(0)}(x^{k}) + \sigma \left( \left\| \nabla f^{(0)}(x^k) \right\| \right) , \end{aligned}$$
    $$\begin{aligned} f^{(1)}(x^{k+1}) \le \max _{0 \le j \le t_k} f^{(1)}(x^{k-j}) \end{aligned}$$

where \(\sigma (.)\) is a forcing function [24].

In other words, when \( \nabla f^{(1)}(x^k) \ne 0\) the infinite term cannot grow more than a quantity that depends on the norm of \(\nabla f^{(1)}(x^k)\). For the finite term, instead, the new value \(f^{(0)}(x^{k+1})\) cannot be bigger than the worst value of \(f^{(0)}(x^{k-j})\) at \(l_k\) previous steps plus a quantity that depends on the norm of \(\nabla f^{(0)}(x^k)\).

Instead, when \( \nabla f^{(1)}(x^k) = 0\), the finite term \(f^{(0)}(x^{k+1})\) cannot grow bigger than a quantity that depends on the norm of \(\nabla f^{(0)}(x^k)\), and the infinite term \(f^{(1)}(x^{k+1})\) must be better than the worst of the previous \(t_k\) values \(f^{(1)}(x^{k-j})\).

In order to demonstrate convergence of the above scheme, we need to consider two different cases.

  • Case 1: there exists \(\bar{k}\) such that \(\nabla f^{(1)}(x^k) = 0 \), for all \( k \ge \bar{k}\). Then

    $$\begin{aligned} f^{(1)}(x^{k+1}) \le \max _{0 \le j \le t_k} f^{(1)}(x^{k-j}), \quad \quad k \ge \bar{k}. \end{aligned}$$

    Therefore, in this case:

    $$\begin{aligned} \max _{0 \le i \le T} f^{(1)}(x^{\bar{k}+Tj+i}) \le \max _{0 \le i \le T} f^{(1)}(x^{\bar{k}+T(j-1)+i}) \end{aligned}$$

    and the sequence \(\displaystyle \left\{ \max _{0 \le i \le T} f^{(1)}(x^{\bar{k}+Tj+i}) \right\} _j\) is monotonically decreasing. Moreover,

    $$\begin{aligned} f^{(0)}(x^{k+1}) \le f^{(0)}(x^{k}) + \sigma \left( \left\| \nabla f^{(0)}(x^k) \right\| \right) , \quad \quad k \ge \bar{k}. \end{aligned}$$

    Assuming that the level sets for \(f^{(1)}(x^{0})\) and \(f^{(0)}(x^{0})\) are compact sets, the sequence \(\displaystyle \left\{ \max _{0 \le i \le T} f^{(1)}(x^{\bar{k}+Tj+i}) \right\} _j\) has at least one accumulation point \(x^*\) and any accumulation point (according to the second condition) satisfies \(\nabla f^{(0)}(x^*) = 0 \) in addition to \(\nabla f^{(1)}(x^*) = 0 \).

  • Case 2: there exists a subsequence \(j_k\) such that \(\nabla f^{(1)}(x^{j_k}) \ne 0\). In this case we have that

    $$\begin{aligned} f^{(1)}(x^{j_k+1}) \le f^{(1)}(x^{j_k}) + \sigma \left( \left\| \nabla f^{(1)}(x^{j_k}) \right\| \right) . \end{aligned}$$

    Again, taking into account that it is possible that for some index i bigger than \(j_k\), \(\nabla f^{(1)}(x^{i}) = 0\)

    $$ \max _{ 0\le i \le M} f^{(1)}(x^{j_k+T j +i}) \le \max _{ 0\le i \le M} f^{(1)}(x^{j_k+T(j-1) +i}) + \sigma \left( \left\| \nabla f^{(1)}(x^{j_k}) \right\| \right) $$

    and hence, assuming again that the level sets for \(f^{(1)}(x^{0})\) and \(f^{(0)}(x^{0})\) are compact sets, we have that \(\nabla f^{(1)}(x^{j_k})\) goes to 0. Moreover,

    $$ \max _{ 0\le i \le L} f^{(0)}(x^{j_k+Lj +i}) \le \max _{ 0\le i \le L} f^{(0)}(x^{j_k+L(j-1) +i}) + \sigma \left( \left\| \nabla f^{(0)}(x^{j_k}) \right\| \right) $$

    and hence also \(\nabla f^{(0)}(x^{j_k})\) goes to 0.

7 Conclusions

Penalty methods are an important and widely studied class of algorithms in nonlinear optimization. Using penalty functions the solution of the original constrained optimization problem could be obtained by solving an unconstrained problem. The main issues with this class of methods are exactness and differentiability. In this chapter we presented some recent development on penalty functions based on the use of \(\textcircled {1}\). The solution of the proposed unconstrained problem provides not only the solution of the original problem but also the Lagrangian dual variables associated to the constraints at no additional cost, from the expansion of the constraints in terms of \(\textcircled {1}\). Some simple examples are also reported, showing the effectiveness of the method. Finally a general non–monotone scheme is presented for the minimization of functions that include \(\textcircled {1}\) grossterms.