1 Introduction

Given a vector x of n components, the \(l_0\) pseudo-norm

$$\begin{aligned} \left\| {x} \right\| _0 :=\mathop {{\mathrm{number~of~nonzero~components~in }}} x , \end{aligned}$$

has often been used in optimization problems arising in various fields. However, the introduction of \(\left\| {x} \right\| _0\) makes these problems extremely complicated to solve, so that approximations and iterative schemes have been proposed in the scientific literature to efficiently solve them.

The use of this pseudo-norm arises in many different fields, such as machine learning, signal processing, pattern recognition, portfolio optimization, subset selection problem in regression and elastic-net regularization. Cardinality constrained optimization problems are difficult to solve, and a common approach is to apply global discrete optimization techniques.

Quite recently Sergeyev, in a book and in a series of papers, proposed a novel approach to infinite and infinitesimal numbers. By introducing the new numeral grossone (indicated by \(\textcircled {1}\)), defined as the number of elements of the set of the natural numbers, Sergeyev demonstrated how it is possible to operate with finite, infinite and infinitesimal quantities using the same arithmetics. This new numerical system allows to treat infinite and infinitesimal numbers as particular cases of a single structure and offers a new view and an alternative approach for many fundamental aspects of mathematics such as limits, derivatives, sums of series and so on.

The aim of this paper is to show how this new numeral system and in particular \(\textcircled {1}\) can be used in different optimization problems, by replacing the \(l_0\) pseudo-norm \(\left\| {x} \right\| _0\) with

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}}. \end{aligned}$$

Indeed in literature, there are many contributes for approximating the \(l_0\) pseudo-norm. For example in Rinaldi et al. (2010), two new smooth approximations of the \(l_0\) pseudo-norm are presented and other approximations are recalled in the following.

The paper is organized as follows. In Sect. 2, the new numeral system is presented by describing its main properties: Infinity, Identity and Divisibility. Moreover, the new numeral positional system and the concept of gross-number are discussed. In Sect. 3, the properties of \(\left\| {x} \right\| _0\) are introduced and some approximations proposed in literature are presented. The definition of \(\left\| {x} \right\| _{0,{\textcircled {1}}^{-1}}\) is then introduced and it is shown that \(\left\| {x} \right\| _{0}\) and \(\left\| {x} \right\| _{0,{\textcircled {1}}^{-1}}\) coincide for the finite term and may differ only for infinitesimal terms. Two different applications of \(\left\| {x} \right\| _{0,\textcircled {1}^{-1}}\) are presented in Sects. 4 and 5. The first, studied in Sect. 4, concerns elastic net regularization and an algorithm for solving the optimization problem

$$\begin{aligned} \min _x \frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _{0,\textcircled {1}^{-1}} + \frac{\lambda _1}{2} \left\| {x} \right\| _2^2. \end{aligned}$$

In Sect. 5, the newly proposed definition of \(\left\| {x} \right\| _{0,\textcircled {1}^{-1}}\) is used in classification problems using sparse support vector machines (SVMs). In particular, we suggest an interpretation of an updating rule already proposed in the literature based on the KKT (Karush–Kuhn–Tucker) conditions and the expansion of gross-numbers.

We briefly describe some notations used throughout the paper. With \({\text {I}}\!{\text {N}}\), we indicate the set of natural numbers. Given \(n,m \in {\text {I}}\!{\text {N}}\), let \({\text {I}}\!{\text {R}}^n\) be the space of the n-dimensional vectors with real components and let \({\text {I}}\!{\text {R}}^{m\times n}\) be the space of matrices with real elements, m rows and n columns. All vectors are column vectors and are indicated with lower case Latin letter (e.g., x, y, z \( \in {\text {I}}\!{\text {R}}^n\)). Subscripts indicate components of a vector, while superscripts are used to identify different vectors. Matrices are indicated with upper case Roman letter (e.g., A, B \(\in {\text {I}}\!{\text {R}}^{m \times n}\)). If \(A \in {\text {I}}\!{\text {R}}^{m \times n}\), \(A^T_{i \cdot }\) is the ith row of A. The symbol \(\left\| {x} \right\| \) indicates the norm of a vector x. Specific norms or parameters of the norm are indicated with subscripts. The scalar product of two vectors xy in \({\text {I}}\!{\text {R}}^n\) is denoted by \({x}^T{y}\), while in a generic Hilbert space we use the notation \(\left\langle x,y \right\rangle \). The symbol \(:=\) denotes definition of the term. The gradient of a continuously differentiable function \(f:{\text {I}}\!{\text {R}}^n \rightarrow {\text {I}}\!{\text {R}}\) at a point \(x\in {\text {I}}\!{\text {R}}^n\) is indicated by \(\nabla f(x)\).

2 The algebra of \(\textcircled {1}\)

The numeral system, originally proposed by Sergeyev (2001, (2009, (2017), is based on the numeral \(\textcircled {1}\) (called grossone) defined as the number of elements of the set \({\text {I}}\!{\text {N}}\). This new definition of infinite unit consents to work numerically with infinities and infinitesimals. In particular, the numerical system built upon \(\textcircled {1}\) makes possible to treat infinite and infinitesimal numbers in a unique framework, and to work with all of them numerically.

For instance, using \(\textcircled {1}\)-based numerals, it is possible to execute arithmetic operations with floating-point numbers and to assign concrete infinite and infinitesimal values to variables. Moreover, \(\textcircled {1}\) allows to compute more precisely the number of elements of infinite sets extending the traditional set theory operating with Cantor’s cardinals. For example, the set of even numbers and the set of integers that the traditional cardinalities identify both as countable, have in this new numeral system, respectively, \(\frac{\textcircled {1}}{2}\) and \(2 \textcircled {1}+1\) elements.

The new computational methodology has been successfully applied in several fields of pure and applied mathematics offering new and alternative approaches. Here, we only mention (numerical) differentiation (Sergeyev 2011a), ODE (Sergeyev et al. 2016; Amodio et al. 2017), optimization (Cococcioni et al. 2018; De Cosmis and De Leone 2012; De Leone 2018; De Leone et al. 2018; Gaudioso et al. 2018; Sergeyev et al. 2018), hyperbolic geometry (Margenstern 2012), infinite series and the Riemann zeta function (Sergeyev 2009, 2011b), biological processes, cellular automata (D’Alotto 2013). For a survey of the various aspects and applications of \(\textcircled {1}\), we refer the interested reader to Sergeyev (2001, (2010, (2016, (2017, (2019), Caldarola (2018), Calude and Dumitrescu (2020), Iudin et al. (2012), Lolli (2015), Rizza (2019) and to the references therein.

Following the procedure used in the past when the numeral 0 (zero) has been introduced to extend the natural numbers to integers, Sergeyev has introduced the new numeral \(\textcircled {1}\).

In particular, \(\textcircled {1}\) is introduced by adding the Infinite Unit Axiom postulate (IUA) to the axioms of real numbers. The IUA postulate is composed of three parts: Infinity, Identity and Divisibility:

  1. 1.

    Infinity: any finite natural number n is less than grossone, i.e., \(n<\textcircled {1}\), \(\forall n \in {\text {I}}\!{\text {N}}\).

  2. 2.

    Identity: the following relations link \(\textcircled {1}\) to the identity elements 0 and 1

    $$\begin{aligned} 0 \cdot {\textcircled {1}}&={\textcircled {1}}\cdot 0=0, \quad {\textcircled {1}} - {\textcircled {1}}=0, \quad \frac{\textcircled {1}}{\textcircled {1}}=1,\nonumber \\&\quad {\textcircled {1}}^{0} =1, \quad 1 ^{{\textcircled {1}}} =1, \quad 0^{{\textcircled {1}}} =0. \end{aligned}$$
    (1)
  3. 3.

    Divisibility: for any finite natural number n the sets \({\text {I}}\!{\text {N}}_{k,n}\), \(1 \le k \le n\), called the nth parts of the set \({\text {I}}\!{\text {N}}\) of natural numbers and defined as

    $$\begin{aligned} {\text {I}}\!{\text {N}}_{k,n}&= \left\{ k,k+n, k+2n,k+3n, \ldots \right\} , 1 \le k \le n,\nonumber \\&\quad \quad \displaystyle \bigcup _{k=1}^{n}{\text {I}}\!{\text {N}}_{k,n}= {\text {I}}\!{\text {N}}, \end{aligned}$$
    (2)

    have the same number of elements indicated by the numeral \(\frac{\textcircled {1}}{n}\). Note that \(\frac{\textcircled {1}}{n}\) is larger than any finite number.

Since this postulate is added to the standard axioms of real numbers, all standard properties (i.e., commutative and associative properties, distributive property of multiplication over addition, existence of inverse elements with respect to addition and multiplication, ...) also apply to \(\textcircled {1}\) and to grossone-based numerals. On the other hand, since in this framework it is possible to execute arithmetical operations with a variety of different infinities and infinitesimals, indeterminate forms as well as various kinds of divergences are not present when working with any (finite, infinite and infinitesimal) numbers of the new numerical system.

In this new numeral positional system, a gross-number (or gross-scalar) C can be represented similarly to traditional positional numeral system, but with base number \(\textcircled {1}\), that is:

$$\begin{aligned}&C=c_{p_{m}} {\textcircled {1}}^{p_{m}}+\cdots + c_{p_{1}} {\textcircled {1}}^{p_{1}}+ c_{p_{0}} {\textcircled {1}}^{p_{0}}\nonumber \\&\quad + c_{p_{-1}} {\textcircled {1}}^{p_{-1}}+\cdots + c_{p_{-k}} {\textcircled {1}}^{p_{-k}}, \end{aligned}$$
(3)

where \(m,k\in {\text {I}}\!{\text {N}}\), for \(i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m\), numerals \(c_{p_{i}}\) are floating-point numbers and exponents \(p_{i}\) are gross-numbers such that

$$\begin{aligned}&p_{m}>p_{m-1}\cdots>p_{1}>p_{0}=0>p_{-1} \nonumber \\&\quad>\cdots>p_{-(k-1)}>p_{-k}. \end{aligned}$$
(4)

A gross-number is called finite if \(m=k=0\), it is called infinite if \(m>0\), and it is called infinitesimal if \(m=0\), \(c_{p_0}=0\) and \(k>0\). The exponents \(p_{i}\), \(i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m\), are called gross-powers and can be finite, infinite, and infinitesimal. In (3), all numerals \(c_{p_{i}} \ne 0\), \(i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m\), are called gross-digits and belong to a traditional numeral system (for example floating-point numbers).

We note that in this new numeral system, the record

$$\begin{aligned} C=c_{p_{m}} {\textcircled {1}}^{p_{m}}\ldots c_{p_{1}} {\textcircled {1}}^{p_{1}} c_{p_{0}} {\textcircled {1}}^{p_{0}} c_{p_{-1}} {\textcircled {1}}^{p_{-1}}\ldots c_{p_{-k}} {\textcircled {1}}^{p_{-k}}, \end{aligned}$$
(5)

represents the number C. Infinitesimal numbers are represented by numerals C having only negative (finite or infinite) gross-powers. The infinitesimal number \({\textcircled {1}}^{-1}\) verifies \({\textcircled {1}}^{-1} {\textcircled {1}}={\textcircled {1}} {\textcircled {1}}^{-1}=1\). Note that all infinitesimals are not equal to zero and in particular \({\textcircled {1}}^{-1}=\frac{1}{\textcircled {1}}>0\) because it is the result of a division between two positive gross-numbers. In the following, we consider only gross-numbers having representation (3) with \(p_i\in {\mathbb {Z}}\), \(i=-k,\dots ,m-1,m\).

We conclude this section by observing that the Infinity Computer is a new kind of a supercomputer able to execute numerical computations with finite, infinite and infinitesimal numbers numerically (not symbolically) using \(\textcircled {1}\) and the new numeral system. For more details, see Sergeyev (2001) and the references therein.

3 The \(l_0\) pseudo-norm and \(\textcircled {1}\)

In many problems in optimization and numerical analysis, it is extremely important to obtain a vector with the smallest possible number of components different from zero.

In Rinaldi et al. (2010), the problem of determining a vector belonging to a polyhedral set and having the minimum number of nonzero components is studied, and two smooth approximations of the \(l_0\) pseudo-norm are proposed. The general optimization problem with cardinality constraints is considered in Burdakov et al. (2016), where a reformulation as a smooth optimization problem is proposed. In Pham Dinh and Le Thi (2014) and Gotoh et al. (2018), the cardinality-constrained optimization problem is reformulated using a DC (difference of convex functions) approach. We refer the interested reader to Gotoh et al. (2018) for additional references to optimization problems where sparsity of the solution is required.

Determining a vector having the minimum number of nonzero components can be generally obtained by adding to the original problem a further term penalizing the number of components different from zero or a term that can approximately achieve the same goal.

Let \(x\in {\text {I}}\!{\text {R}}^n\). The \(l_0\) pseudo-norm is defined as

$$\begin{aligned} \left\| {x} \right\| _0 :=\mathop {{\mathrm{number~of~nonzero~components~in }}} x = \sum _{i=1}^{n} 1_{x_i \ne 0}, \end{aligned}$$
(6)

where \(1_a\) is the characteristic (indicator) function that is equal to 1 if \(a \ne 0\) and zero otherwise. To be precise, \(\left\| {\cdot } \right\| _0\) is called \(l_0\) pseudo-norm since it is not a norm. In fact for \(x \in {\text {I}}\!{\text {R}}^n\), \(x\ne 0\) and \(0 \ne \lambda \in {\text {I}}\!{\text {R}}\), we have:

$$\begin{aligned} \left\| {\lambda x} \right\| _0 = \left\| {x} \right\| _0 , \end{aligned}$$

and hence, \(\left\| {\lambda x} \right\| _0 = \left| \lambda \right| \left\| {x} \right\| _0 \) if and only if \(\left| \lambda \right| = 1\).

In successive sections, we present some specific use of the \(l_0\) pseudo-norm for regularization and sparse solutions problems.

In Natarajan (1995), it is shown that computing sparse approximate solutions to linear systems is NP-hard. Moreover, in Amaldi and Kann (1998) it is shown that, for system of linear relations, the problems of determining a solution violating the minimum number of relations (when the system itself is infeasible) and determining a solution with as few nonzero variables as possible (if feasible) are both NP-hard, and various strong bounds on the approximability of different variants of these problems are discussed.

Therefore, various approximations of \( \left\| {x} \right\| _0 \) have been proposed in the literature. In the context of Feature Selection and Machine Learning, in Bradley and Mangasarian (1998) the following approximation is proposed

$$\begin{aligned} \left\| {x} \right\| _0 \approx \sum _{i=1}^n \left( 1 - e^{-\alpha \left| x_i \right| } \right) , \end{aligned}$$

where \(\alpha \) is a given positive number for which the authors suggest to set the value 5.

In Li and Ye (2017), in the context of elastic net regularization (discussed in detail in Sect. 4), the authors proposed the following approximation:

$$\begin{aligned} \left\| {x} \right\| _0 \approx \left\| {x} \right\| _{0,\delta } :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \delta }, \end{aligned}$$
(7)

where \(\delta >0 \) and smaller positive values of \(\delta \) provide a better approximation of \(\left\| {x} \right\| _0\).

Following this suggestion and by using the new numeral system, we propose to approximate the quantity \(\left\| {x} \right\| _0\) with

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}}. \end{aligned}$$
(8)

Let us study in detail the connections between \( \left\| {x} \right\| _0\) and \(\left\| {x} \right\| _{0,\textcircled {1}^{-1}}\). Let \(\displaystyle \psi (t):=\frac{t^2}{t^2 + \textcircled {1}^{-1} } \). Hence, \(\displaystyle \left\| {x} \right\| _{0,\textcircled {1}^{-1}} = \sum _{i=1}^{n} \psi (x_i) \). For \(i=1,\ldots ,n\), we assume

$$\begin{aligned} x_{i} = x^{(0)}_{i} + R_{i} \textcircled {1}^{-1} , \end{aligned}$$

where \(R_{i}\) includes only finite and infinitesimal terms.

When \(x_i^{(0)} = 0 \)

$$\begin{aligned} \psi (x_i)= & {} \frac{R_i^2\textcircled {1}^{-2}}{R_i^2\textcircled {1}^{-2} +\textcircled {1}^{-1} } = \textcircled {1}^{-1} \frac{R_i^2\textcircled {1}^{-1}}{R_i^2\textcircled {1}^{-2} +\textcircled {1}^{-1} } \\= & {} \textcircled {1}^{-1} \frac{R_i^2}{R_i^2\textcircled {1}^{-1} + 1 } = 0 \textcircled {1}^{0} + R_i' \textcircled {1}^{-1} , \end{aligned}$$

where \(R_i'\) includes only finite and infinitesimal terms. Instead, when \(x_i^{(0)} \ne 0 \)

$$\begin{aligned} \psi (x_i)= & {} \frac{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 }{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 +\textcircled {1}^{-1} } \\= & {} 1 - \frac{\textcircled {1}^{-1}}{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 +\textcircled {1}^{-1} }= 1 + S_i' \textcircled {1}^{-1}, \end{aligned}$$

where again \(S_i'\) includes only finite and infinitesimal terms.

Therefore,

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} = \left\| {x} \right\| _0 + T \textcircled {1}^{-1}, \end{aligned}$$
(9)

for some gross-number T which includes only finite and infinitesimal terms. Hence, the finite parts of \( \left\| {x} \right\| _0 \) and \(\left\| {x} \right\| _{0,\textcircled {1}^{-1}} \) coincide.

4 Elastic net regularization and \(\textcircled {1}\)

Given a matrix \(A \in {\text {I}}\!{\text {R}}^{m\times n}\) and a vector \(b \in {\text {I}}\!{\text {R}}^m\), in many important applications it is essential to determine a solution \(x\in {\text {I}}\!{\text {R}}^n\) of the system of linear equations \(Ax = b\) with the smallest number of nonzero components:

$$\begin{aligned} {\begin{array}{lc} \min _{x} &{} \left\| {x} \right\| _0, \\ \hbox {subject to } &{} Ax=b. \end{array}} \end{aligned}$$

The associated generalized elastic net regularization problem (Li and Ye 2017) is

$$\begin{aligned} \min _x \frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _0 + \frac{\lambda _2}{2} \left\| {x} \right\| _2^2, \end{aligned}$$
(10)

where \(\lambda _0 > 0\) and \(\lambda _2 > 0\) are regularization parameters. In Li and Ye (2017), the authors suggest to substitute \(\left\| {x} \right\| _0\) with \(\left\| {x} \right\| _{0,\delta }\), as defined in (7), for fixed positive \(\delta \), and a convergent algorithm for the solution of the corresponding optimization problem is proposed. Clearly, the obtained solution only approximates the optimal solution of (10).

We propose to use in (10) the approximation (8) of \(\left\| {x} \right\| _0\). Replicating some of the proofs in Li and Ye (2017), a convergent algorithm can be constructed. Moreover, due to the position (9), apart from terms of order \(\textcircled {1}^{-1}\) or below, the obtained solution solves also problem (10).

In detail, consider the problem

$$\begin{aligned} \min _x f(x) , \end{aligned}$$

where

$$\begin{aligned} f(x)&:=\frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _{0,\textcircled {1}^{-1}} + \frac{\lambda _2}{2} \left\| {x} \right\| _2^2 \nonumber \\&= \frac{1}{2}{\left( Ax - b\right) }^T{\left( Ax - b\right) } + \lambda _0 \sum _{i=1}^{n}\frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}} + \frac{\lambda _2}{2} {x}^T{x}. \end{aligned}$$
(11)

Let \(D(x) \in {\text {I}}\!{\text {R}}^{n\times n}\) be given by

$$\begin{aligned} D_{ii}(x) = \frac{2 \textcircled {1}^{-1}}{\left( \left( x_i\right) ^2 + \textcircled {1}^{-1} \right) ^2}, \;\;\; D_{ij}(x) = 0,\; i \ne j. \end{aligned}$$
(12)

Then,

$$\begin{aligned} \nabla f(x) = \Biggl ( A^T A + \lambda _2 I + \lambda _0 D(x) \Biggr ) x - A^Tb. \end{aligned}$$

Following (Li and Ye 2017), the iterative scheme we propose is given in Algorithm 4.1, moreover Lemma 1 is the basis for establishing the convergence result.

figure a

Lemma 1

Let f be given by (11). Let \(x^{k+1}\) be obtained from \(x^{k}\) by solving the system of Eq. (13). Then

$$\begin{aligned} f\left( x^k\right) - f\left( x^{k+1}\right) \ge \frac{1}{2}\left\| {A x^k - A x^{k+1}} \right\| _2^2 + \frac{\lambda _2}{2} \left\| {x^k -x^{k+1}} \right\| _2^2, \end{aligned}$$
(14)

and hence

$$\begin{aligned}&\left\| {A x^k - A x^{k+1}} \right\| _2^2 \le 2 \left( f\left( x^k\right) - f\left( x^{k+1}\right) \right) , \end{aligned}$$
(15)
$$\begin{aligned}&\left\| {x^k -x^{k+1}} \right\| _2^2 \le \frac{2}{\lambda _2} \left( f\left( x^k\right) - f\left( x^{k+1}\right) \right) . \end{aligned}$$
(16)

Proof

See “Appendix A.” \(\square \)

The lemma above shows that the sequence \(\{f(x^k)\}\) is a non-increasing sequence. We are now ready to state the following convergence theorem.

Theorem 1

Let \(\mathcal {L}_0 :=\{x: f(x) \le f(x^0) \}\) be a compact set, and let \(\{x^k\}\) be the sequence produced by the iterative scheme (13). Then

  1. 1.

    the sequence \(\{x^k\}\) is all contained in \(\mathcal {L}_0\);

  2. 2.

    the sequence \(\left\{ x^{k}\right\} \) has at least one accumulation point;

  3. 3.

    each accumulation point of \(\{x^k\}\) belongs to \(\mathcal {L}_0\);

  4. 4.

    each accumulation point \({x}^{*}\) satisfies the condition

    $$\begin{aligned} \Biggl (A^T A + \lambda _2 I + \lambda _0 D({x}^{*}) \Biggr ) {x}^{*}= A^T b , \end{aligned}$$

    and hence is a stationary point of f.

Proof

First, from condition (14) in Lemma 1 we have \(f\left( x^{k+1}\right) \le f\left( x^k\right) \), and hence, the entire sequence \(\{x^k\}\) is contained in \(\mathcal {L}_0\). Moreover, the sequence \( \left\{ f\left( x^k\right) \right\} \) is a bounded non-increasing sequence, and hence, it is a convergent sequence. The existence of accumulation points for the subsequence follows from the compactness of \(\mathcal {L}_0\). Let now \({x}^{*}\) be an accumulation point of \(\{x^k\}\) and \(\left\{ x^{k_l}\right\} \) be a subsequence indexed by l converging to \({x}^{*}\). From (16), it follows that

$$\begin{aligned} \left\| {x^{k_{l}} - x^{k_{l+1}} } \right\| _2^2 \le \displaystyle \frac{2}{\lambda _2} \Bigl ( f\left( x^{k_{l}}\right) - f\left( x^{k_{l+1}}\right) \Bigr ). \end{aligned}$$

The right term converges to 0, and also the subsequence \(\left\{ x^{k_{l+1}}\right\} \) converges to the accumulation point \({x}^{*}\). Moreover,

$$\begin{aligned} \ \Biggl (A^T A + \lambda _2 I + \lambda _0 D(x^{k_l}) \Biggr ) x^{k_{l+1}} = A^T b , \end{aligned}$$

and hence,

$$\begin{aligned} \Biggl (A^T A + \lambda _2 I + \lambda _0 D({x}^{*}) \Biggr ) {x}^{*}= A^T b . \end{aligned}$$

Therefore, \({x}^{*}\) is a stationary point of f. \(\square \)

5 Sparse support vector machine

In this section, we show how \(\textcircled {1}\) and the results of Sect. 3 can be used in the context of sparse support vector machines (SSVMs).

Assume that empirical data (training set) \((x^i, y_i), \) \(i = 1, \ldots , {l},\) are given, where \(x^i \in {\text {I}}\!{\text {R}}^n,\) and \(y_{i} \in \{-1,1\}\), \(i = 1, \ldots , {l}\). Note that when the index i is used as a superscript the corresponding object is an input, while when it is used as a subscript the corresponding object is an output. The aim is to determine an hyperplane (and hence a vector \(w\in {\text {I}}\!{\text {R}}^n\) and a scalar \(\theta \)) such that:

$$\begin{aligned} {w}^T{x^i} + \theta > 0 \mathop {{\mathrm{ when }}} y_i = 1 \mathop {{\mathrm{ and }}} {w}^T{x^i} + \theta < 0 \mathop {{\mathrm{ when }}} y_i = -1. \end{aligned}$$

The classification function is

$$\begin{aligned} h(x) = \text {sign}\left( {w}^T{x} + \theta \right) . \end{aligned}$$

Classification in the feature space (instead of the original space) requires to introduce a map

$$\begin{aligned} \phi : {\text {I}}\!{\text {R}}^n \mapsto \mathcal {E}, \end{aligned}$$

where \(\mathcal {E}\) is an Hilbert Space with scalar product \(\left\langle \cdot ,\cdot \right\rangle \). In classical SVM [see Cortes and Vapnik (1995), Cristianini and Shawe-Taylor (2000), Smola and Schölkopf (2004) and references therein], the construction of the optimal hyperplane requires to solve the following (primal) optimization problem

$$\begin{aligned} {\begin{array}{ll} \min \limits _{w, \theta , \xi } &{} \frac{1}{2}\left\langle w,w \right\rangle + C {e}^T{\xi }, \\ \hbox {subject to } &{}y_i \left( \left\langle w,\phi (x^i) \right\rangle + \theta \right) \ge 1 - \xi _i , \quad i = 1, \ldots , {l}, \\ &{} \xi _i \ge 0, \quad i = 1, \ldots , {l}, \end{array}}\nonumber \\ \end{aligned}$$
(17)

where e is a vector with all elements equal to 1 and C is a positive scalar.

The corresponding dual problem is

$$\begin{aligned} {\begin{array}{ll} \min \limits _{\alpha } &{} \frac{1}{2}{\alpha }^T{Q\alpha } - {e}^T{\alpha }, \\ \hbox {subject to } &{} {y}^T{\alpha } = 0 , \\ &{} 0 \le \alpha \le Ce, \end{array}} \end{aligned}$$
(18)

where

$$\begin{aligned} Q_{ij} = y_i y_j K_{ij}, \mathop {{\mathrm{ and }}} K_{ij}= K(x^i, x^j) :=\left\langle \phi (x^i),\phi (x^j) \right\rangle . \end{aligned}$$

The function

$$\begin{aligned} K: {\text {I}}\!{\text {R}}^n\times {\text {I}}\!{\text {R}}^n \rightarrow {\text {I}}\!{\text {R}}, \end{aligned}$$

is called the kernel function. It is well known that for the construction of the dual problem and the classification function the complete knowledge of the function \(\phi (\cdot ) \) is not necessary: only the quantities \(K_{ij}=\left\langle \phi (x^i),\phi (x^j) \right\rangle \) are needed. In fact, from KKT conditions

$$\begin{aligned}&w = \sum _{i=1}^{l}\alpha _i y_i \phi (x^i) \end{aligned}$$
(19)

and the classification function is

$$\begin{aligned} h(x)= & {} \text {sign}\Bigl (\left\langle w,\phi (x) \right\rangle + \theta \Bigr ) \\= & {} \text {sign}\left( \sum _{i=1}^{l}\alpha _i y_i \left\langle \phi (x^i),\phi (x) \right\rangle + \theta \right) . \end{aligned}$$

For a sparse representation of SVM, the vector w is substituted by its expansion in terms of the vector \(\alpha \). Moreover, let \(K_{i.}\) be the column vector that corresponds to the ith row of matrix \((K_{ij})\), note that

$$\begin{aligned} {K_{i.}}^T{\alpha }+ \theta= & {} \sum _{j=1}^{l} K_{ij} \alpha _j + \theta \\= & {} \sum _{j=1}^{l} \left\langle \phi (x^i),\phi (x^j) \right\rangle \alpha _j + \theta \\= & {} \left\langle \phi (x^i),\sum _{j=1}^{l} \phi (x^j)\alpha _j \right\rangle + \theta . \end{aligned}$$

In Huang et al. (2010), the authors consider the following optimization problem instead of (17), obtained by replacing \(\frac{1}{2}\left\langle w,w \right\rangle \) with \(\left\| {\alpha } \right\| _0\) and by using the expansion (19) of w in terms of \(\alpha \):

$$\begin{aligned} {\begin{array}{ll} \min _{\alpha , \theta , \xi } &{} \displaystyle \left\| {\alpha } \right\| _0 + C {e}^T{\xi }, \\ \hbox {subject to } &{} y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigr ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l}, \\ &{} \xi \ge 0. \end{array}} \end{aligned}$$
(20)

In problem (20) the term \(\left\| {\alpha } \right\| _0\) is then replaced by \(\frac{1}{2}{\alpha }^T{\Lambda \alpha }\), where \(\Lambda \) is the diagonal matrix with \(\Lambda _{ii} = \lambda _i\), \(i = 1, \ldots , {l}\).

The following iterative scheme was proposed in Huang et al. (2010) to solve the above problem. In particular, given a very small positive value \(\epsilon \), in the Algorithm 5.1 the new value \(\lambda ^{k+1}_r\) for \(\lambda _r\) is set to \(1/\left( \alpha ^k_r\right) ^2\) if \(\left| \alpha ^k_r \right| \) is “significantly” different from zero, otherwise, \(\lambda ^{k+1}_r = 1/\epsilon ^2\).

figure b

The KKT conditions for Problem (21) are

$$\begin{aligned}&\displaystyle {\Lambda ^k} \alpha - \sum _{j=1}^{l} \beta _j y_j K_{j.} = 0 ,\end{aligned}$$
(22a)
$$\begin{aligned}&{\beta }^T{y} = 0, \end{aligned}$$
(22b)
$$\begin{aligned}&Ce - \beta \ge 0 , \end{aligned}$$
(22c)
$$\begin{aligned}&\beta \ge 0 , \end{aligned}$$
(22d)
$$\begin{aligned}&{\beta }^T{\left( Ce - \beta \right) } = 0, \end{aligned}$$
(22e)

where \(\beta \) is the vector of multipliers associated to the constraints \(y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigr ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l}\).

From (22a) it follows that

$$\begin{aligned} {\lambda ^k_r} \alpha _r = \sum _{j=1}^{l} \beta _j y_j K_{jr} = {\bar{K}_{r.}}^T{\beta },\quad r = 1, \ldots , {l}, \end{aligned}$$

where \(\bar{K}_{rj} = y_j K_{jr}\), with \(r,j = 1, \ldots , {l}\).

Once again, instead of \(\left\| {\alpha } \right\| _0\), we use \(\left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}}\) and we propose to solve the following \(\textcircled {1}\)–Sparse SVM problem in place of (20):

$$\begin{aligned} {\begin{array}{ll} \min _{\alpha , \theta , \xi } &{} \frac{\textcircled {1}}{2} \left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}} + C {e}^T{\xi }, \\ \hbox {subject to }&{} y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigl ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l},\\ &{} \xi \ge 0. \end{array}} \end{aligned}$$
(23)

Let by (8)

$$\begin{aligned} h(\alpha ):= & {} \frac{\textcircled {1}}{2} \left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}} \\= & {} \frac{\textcircled {1}}{2} \sum _{j=1}^{l} \frac{\alpha _j^2}{\alpha _j^2 + \textcircled {1}^{-1}} = \frac{\textcircled {1}}{2} \sum _{j=1}^{l} \frac{\alpha _j^2 + \textcircled {1}^{-1} - \textcircled {1}^{-1} }{\alpha _j^2 + \textcircled {1}^{-1}} \\= & {} \frac{l}{2} \textcircled {1} - \sum _{j=1}^{l}\frac{\textcircled {1}}{2} \frac{\textcircled {1}^{-1}}{\alpha _j^2 + \textcircled {1}^{-1}} \\= & {} \frac{l}{2} \textcircled {1} -\frac{1}{2} \sum _{j=1}^{l} \frac{1}{\alpha _j^2 + \textcircled {1}^{-1}}. \end{aligned}$$

Then,

$$\begin{aligned} \Bigl [\nabla h(\alpha )\Bigr ]_r = \frac{\alpha _r }{\left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2} . \end{aligned}$$

The KKT conditions for the above problem (23) are

$$\begin{aligned}&\displaystyle \nabla h(\alpha ) - \sum _{j=1}^{l} \beta _j y_j K_{j.} = 0 , \end{aligned}$$
(24a)
$$\begin{aligned}&{\beta }^T{y} = 0, \end{aligned}$$
(24b)
$$\begin{aligned}&Ce - \beta \ge 0 , \end{aligned}$$
(24c)
$$\begin{aligned}&\beta \ge 0 , \end{aligned}$$
(24d)
$$\begin{aligned}&{\beta }^T{\left( Ce - \beta \right) } = 0. \end{aligned}$$
(24e)

Note that conditions (24a) can be rewritten as

$$\begin{aligned} \displaystyle \frac{1}{\displaystyle \left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2} \alpha _r = {\bar{K}_{r.}}^T{\beta }, \quad r = 1, \ldots , {l} . \end{aligned}$$
(25)

From the above formula, it is “natural” to set the new value for \(\lambda _r\)

$$\begin{aligned} {\lambda ^{k+1}_r} = \displaystyle \frac{1}{\displaystyle \left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2}, \quad r = 1, \ldots , {l}. \end{aligned}$$

Now, for \( r = 1, \ldots , {l}\), let

$$\begin{aligned} \alpha _r = \alpha _r^{(0)} + \alpha _r^{(1)} \textcircled {1}^{-1} + \ldots = \alpha _r^{(0)} + A \textcircled {1}^{-1}, \end{aligned}$$

with \(A\in {\text {I}}\!{\text {R}}\) finite or infinitesimal. When \(\alpha _r^{(0)} = 0\)

$$\begin{aligned} \alpha _r^2 + \textcircled {1}^{-1} = \textcircled {1}^{-1} + A^2 \textcircled {1}^{-2}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{\Bigl (\alpha _r^2 + \textcircled {1}^{-1} \Bigr )^2} = \frac{1}{\Bigl ( \textcircled {1}^{-1} + A^2 \textcircled {1}^{-2} \Bigr )^2} = \frac{1}{\textcircled {1}^{-2}} + A' \frac{1}{\textcircled {1}^{-3}}, \end{aligned}$$
(26)

with \(A'\) finite or infinitesimal. In place, when \(\alpha _r^{(0)} \ne 0\)

$$\begin{aligned} \alpha _r^2 + \textcircled {1}^{-1} = \left( \alpha _r^{(0)}\right) ^2 + A'' \textcircled {1}^{-1}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{\Bigl (\alpha _r^2 + \textcircled {1}^{-1} \Bigr )^2} = \frac{1}{\left( \alpha _r^{(0)}\right) ^4 + A''' \textcircled {1}^{-1}}, \end{aligned}$$
(27)

with \(A'',\ A'''\) finite or infinitesimal.

Formulas (26) and (27) mimic almost perfectly the updating formulas for \(\lambda ^{k+1}\) proposed in Huang et al. (2010) in order to solve (20) by using problem (21). The main difference is that, for the nonzero case there is a power 4 instead of 2. However, using \(\textcircled {1}\) and considering problem (23) we can shed some light on the updating formulas for \(\beta \), otherwise quite arbitrary. The term \(\frac{1}{\textcircled {1}^{-2}}\) perfectly corresponds to the updating formula \(\lambda ^{k+1}_r = \frac{1}{\epsilon ^2} \). In the other case, the updating formula \(\lambda ^{k+1}_r = \frac{1}{\bar{\alpha }_r^2}\) is replaced by the term \(\frac{1}{\left( \alpha _r^{(0)}\right) ^4 }\).

The use of \( \textcircled {1}\) allows to easily obtain both formulas from the KKT condition (24a) and observations on the expansion of the gross-number \(\alpha \).

6 Conclusions

In this paper, we have presented some possible uses in optimization problems of the novel approach to infinite and infinitesimal numbers proposed by Sergeyev (2001, (2009, (2017). In particular, optimization problems including smoothed \(l_0\) penalty and classification problems involving sparse support vector machines are studied. In order to avoid the difficulties due to the use of the \(l_0\) pseudo-norm of a vector, we propose to approximate the \(l_0\) pseudo-norm by using a smooth function defined in terms of \(\textcircled {1}\). The results obtained using this new approximation in the two optimization problems perfectly match with those presented in the literature. Actually, the new \(\textcircled {1}\)-based methodology may represent a fruitful and promising tool to be exploited within other classification and regression problems offering new views and different perspectives.