The use of grossone in elastic net regularization and sparse support vector machines

De Leone, Renato; Egidi, Nadaniela; Fatone, Lorella

doi:10.1007/s00500-020-05185-z

The use of grossone in elastic net regularization and sparse support vector machines

Focus
Published: 25 July 2020

Volume 24, pages 17669–17677, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

The use of grossone in elastic net regularization and sparse support vector machines

Download PDF

211 Accesses
5 Citations
Explore all metrics

Abstract

New algorithms for the numerical solution of optimization problems involving the $l_{0}$ pseudo-norm are proposed. They are designed to use a recently proposed computational methodology that is able to deal numerically with finite, infinite and infinitesimal numbers. This new methodology introduces an infinite unit of measure expressed by the numeral $\textcircled {1}$ (grossone) and indicating the number of elements of the set ${\text {I}}\!{\text {N}}$, of natural numbers. We show how the numerical system built upon $\textcircled {1}$ and the proposed approximation of the $l_0$ pseudo-norm in terms of $\textcircled {1}$ can be successfully used in the solution of elastic net regularization problems and sparse support vector machines classification problems.

The Use of Infinities and Infinitesimals for Sparse Classification Problems

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

Article Open access 14 June 2017

Nonlinear optimization and support vector machines

Article 23 May 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given a vector x of n components, the $l_0$ pseudo-norm

$$\begin{aligned} \left\| {x} \right\| _0 :=\mathop {{\mathrm{number~of~nonzero~components~in }}} x , \end{aligned}$$

has often been used in optimization problems arising in various fields. However, the introduction of $\left\| {x} \right\| _0$ makes these problems extremely complicated to solve, so that approximations and iterative schemes have been proposed in the scientific literature to efficiently solve them.

The use of this pseudo-norm arises in many different fields, such as machine learning, signal processing, pattern recognition, portfolio optimization, subset selection problem in regression and elastic-net regularization. Cardinality constrained optimization problems are difficult to solve, and a common approach is to apply global discrete optimization techniques.

Quite recently Sergeyev, in a book and in a series of papers, proposed a novel approach to infinite and infinitesimal numbers. By introducing the new numeral grossone (indicated by $\textcircled {1}$), defined as the number of elements of the set of the natural numbers, Sergeyev demonstrated how it is possible to operate with finite, infinite and infinitesimal quantities using the same arithmetics. This new numerical system allows to treat infinite and infinitesimal numbers as particular cases of a single structure and offers a new view and an alternative approach for many fundamental aspects of mathematics such as limits, derivatives, sums of series and so on.

The aim of this paper is to show how this new numeral system and in particular $\textcircled {1}$ can be used in different optimization problems, by replacing the $l_0$ pseudo-norm $\left\| {x} \right\| _0$ with

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}}. \end{aligned}$$

Indeed in literature, there are many contributes for approximating the $l_0$ pseudo-norm. For example in Rinaldi et al. (2010), two new smooth approximations of the $l_0$ pseudo-norm are presented and other approximations are recalled in the following.

The paper is organized as follows. In Sect. 2, the new numeral system is presented by describing its main properties: Infinity, Identity and Divisibility. Moreover, the new numeral positional system and the concept of gross-number are discussed. In Sect. 3, the properties of $\left\| {x} \right\| _0$ are introduced and some approximations proposed in literature are presented. The definition of $\left\| {x} \right\| _{0,{\textcircled {1}}^{-1}}$ is then introduced and it is shown that $\left\| {x} \right\| _{0}$ and $\left\| {x} \right\| _{0,{\textcircled {1}}^{-1}}$ coincide for the finite term and may differ only for infinitesimal terms. Two different applications of $\left\| {x} \right\| _{0,\textcircled {1}^{-1}}$ are presented in Sects. 4 and 5. The first, studied in Sect. 4, concerns elastic net regularization and an algorithm for solving the optimization problem

$$\begin{aligned} \min _x \frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _{0,\textcircled {1}^{-1}} + \frac{\lambda _1}{2} \left\| {x} \right\| _2^2. \end{aligned}$$

In Sect. 5, the newly proposed definition of $\left\| {x} \right\| _{0,\textcircled {1}^{-1}}$ is used in classification problems using sparse support vector machines (SVMs). In particular, we suggest an interpretation of an updating rule already proposed in the literature based on the KKT (Karush–Kuhn–Tucker) conditions and the expansion of gross-numbers.

We briefly describe some notations used throughout the paper. With ${\text {I}}\!{\text {N}}$, we indicate the set of natural numbers. Given $n,m \in {\text {I}}\!{\text {N}}$, let ${\text {I}}\!{\text {R}}^n$ be the space of the n-dimensional vectors with real components and let ${\text {I}}\!{\text {R}}^{m\times n}$ be the space of matrices with real elements, m rows and n columns. All vectors are column vectors and are indicated with lower case Latin letter (e.g., x, y, z $ \in {\text {I}}\!{\text {R}}^n$). Subscripts indicate components of a vector, while superscripts are used to identify different vectors. Matrices are indicated with upper case Roman letter (e.g., A, B $\in {\text {I}}\!{\text {R}}^{m \times n}$). If $A \in {\text {I}}\!{\text {R}}^{m \times n}$, $A^T_{i \cdot }$ is the ith row of A. The symbol $\left\| {x} \right\| $ indicates the norm of a vector x. Specific norms or parameters of the norm are indicated with subscripts. The scalar product of two vectors x, y in ${\text {I}}\!{\text {R}}^n$ is denoted by ${x}^T{y}$, while in a generic Hilbert space we use the notation $\left\langle x,y \right\rangle $. The symbol $:=$ denotes definition of the term. The gradient of a continuously differentiable function $f:{\text {I}}\!{\text {R}}^n \rightarrow {\text {I}}\!{\text {R}}$ at a point $x\in {\text {I}}\!{\text {R}}^n$ is indicated by $\nabla f(x)$.

2 The algebra of $\textcircled {1}$

The numeral system, originally proposed by Sergeyev (2001, (2009, (2017), is based on the numeral $\textcircled {1}$ (called grossone) defined as the number of elements of the set ${\text {I}}\!{\text {N}}$. This new definition of infinite unit consents to work numerically with infinities and infinitesimals. In particular, the numerical system built upon $\textcircled {1}$ makes possible to treat infinite and infinitesimal numbers in a unique framework, and to work with all of them numerically.

For instance, using $\textcircled {1}$-based numerals, it is possible to execute arithmetic operations with floating-point numbers and to assign concrete infinite and infinitesimal values to variables. Moreover, $\textcircled {1}$ allows to compute more precisely the number of elements of infinite sets extending the traditional set theory operating with Cantor’s cardinals. For example, the set of even numbers and the set of integers that the traditional cardinalities identify both as countable, have in this new numeral system, respectively, $\frac{\textcircled {1}}{2}$ and $2 \textcircled {1}+1$ elements.

The new computational methodology has been successfully applied in several fields of pure and applied mathematics offering new and alternative approaches. Here, we only mention (numerical) differentiation (Sergeyev 2011a), ODE (Sergeyev et al. 2016; Amodio et al. 2017), optimization (Cococcioni et al. 2018; De Cosmis and De Leone 2012; De Leone 2018; De Leone et al. 2018; Gaudioso et al. 2018; Sergeyev et al. 2018), hyperbolic geometry (Margenstern 2012), infinite series and the Riemann zeta function (Sergeyev 2009, 2011b), biological processes, cellular automata (D’Alotto 2013). For a survey of the various aspects and applications of $\textcircled {1}$, we refer the interested reader to Sergeyev (2001, (2010, (2016, (2017, (2019), Caldarola (2018), Calude and Dumitrescu (2020), Iudin et al. (2012), Lolli (2015), Rizza (2019) and to the references therein.

Following the procedure used in the past when the numeral 0 (zero) has been introduced to extend the natural numbers to integers, Sergeyev has introduced the new numeral $\textcircled {1}$.

In particular, $\textcircled {1}$ is introduced by adding the Infinite Unit Axiom postulate (IUA) to the axioms of real numbers. The IUA postulate is composed of three parts: Infinity, Identity and Divisibility:

1.
Infinity: any finite natural number n is less than grossone, i.e., $n<\textcircled {1}$, $\forall n \in {\text {I}}\!{\text {N}}$.
2.
Identity: the following relations link $\textcircled {1}$ to the identity elements 0 and 1
$$\begin{aligned} 0 \cdot {\textcircled {1}}&={\textcircled {1}}\cdot 0=0, \quad {\textcircled {1}} - {\textcircled {1}}=0, \quad \frac{\textcircled {1}}{\textcircled {1}}=1,\nonumber \\&\quad {\textcircled {1}}^{0} =1, \quad 1 ^{{\textcircled {1}}} =1, \quad 0^{{\textcircled {1}}} =0. \end{aligned}$$
(1)
3.
Divisibility: for any finite natural number n the sets ${\text {I}}\!{\text {N}}_{k,n}$, $1 \le k \le n$, called the nth parts of the set ${\text {I}}\!{\text {N}}$ of natural numbers and defined as
$$\begin{aligned} {\text {I}}\!{\text {N}}_{k,n}&= \left\{ k,k+n, k+2n,k+3n, \ldots \right\} , 1 \le k \le n,\nonumber \\&\quad \quad \displaystyle \bigcup _{k=1}^{n}{\text {I}}\!{\text {N}}_{k,n}= {\text {I}}\!{\text {N}}, \end{aligned}$$
(2)
have the same number of elements indicated by the numeral $\frac{\textcircled {1}}{n}$. Note that $\frac{\textcircled {1}}{n}$ is larger than any finite number.

Since this postulate is added to the standard axioms of real numbers, all standard properties (i.e., commutative and associative properties, distributive property of multiplication over addition, existence of inverse elements with respect to addition and multiplication, ...) also apply to $\textcircled {1}$ and to grossone-based numerals. On the other hand, since in this framework it is possible to execute arithmetical operations with a variety of different infinities and infinitesimals, indeterminate forms as well as various kinds of divergences are not present when working with any (finite, infinite and infinitesimal) numbers of the new numerical system.

In this new numeral positional system, a gross-number (or gross-scalar) C can be represented similarly to traditional positional numeral system, but with base number $\textcircled {1}$, that is:

$$\begin{aligned}&C=c_{p_{m}} {\textcircled {1}}^{p_{m}}+\cdots + c_{p_{1}} {\textcircled {1}}^{p_{1}}+ c_{p_{0}} {\textcircled {1}}^{p_{0}}\nonumber \\&\quad + c_{p_{-1}} {\textcircled {1}}^{p_{-1}}+\cdots + c_{p_{-k}} {\textcircled {1}}^{p_{-k}}, \end{aligned}$$

(3)

where $m,k\in {\text {I}}\!{\text {N}}$, for $i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m$, numerals $c_{p_{i}}$ are floating-point numbers and exponents $p_{i}$ are gross-numbers such that

$$\begin{aligned}&p_{m}>p_{m-1}\cdots>p_{1}>p_{0}=0>p_{-1} \nonumber \\&\quad>\cdots>p_{-(k-1)}>p_{-k}. \end{aligned}$$

(4)

A gross-number is called finite if $m=k=0$, it is called infinite if $m>0$, and it is called infinitesimal if $m=0$, $c_{p_0}=0$ and $k>0$. The exponents $p_{i}$, $i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m$, are called gross-powers and can be finite, infinite, and infinitesimal. In (3), all numerals $c_{p_{i}} \ne 0$, $i=-k, -(k-1),\ldots ,-1,0, 1,\ldots ,m-1,m$, are called gross-digits and belong to a traditional numeral system (for example floating-point numbers).

We note that in this new numeral system, the record

$$\begin{aligned} C=c_{p_{m}} {\textcircled {1}}^{p_{m}}\ldots c_{p_{1}} {\textcircled {1}}^{p_{1}} c_{p_{0}} {\textcircled {1}}^{p_{0}} c_{p_{-1}} {\textcircled {1}}^{p_{-1}}\ldots c_{p_{-k}} {\textcircled {1}}^{p_{-k}}, \end{aligned}$$

(5)

represents the number C. Infinitesimal numbers are represented by numerals C having only negative (finite or infinite) gross-powers. The infinitesimal number ${\textcircled {1}}^{-1}$ verifies ${\textcircled {1}}^{-1} {\textcircled {1}}={\textcircled {1}} {\textcircled {1}}^{-1}=1$. Note that all infinitesimals are not equal to zero and in particular ${\textcircled {1}}^{-1}=\frac{1}{\textcircled {1}}>0$ because it is the result of a division between two positive gross-numbers. In the following, we consider only gross-numbers having representation (3) with $p_i\in {\mathbb {Z}}$, $i=-k,\dots ,m-1,m$.

We conclude this section by observing that the Infinity Computer is a new kind of a supercomputer able to execute numerical computations with finite, infinite and infinitesimal numbers numerically (not symbolically) using $\textcircled {1}$ and the new numeral system. For more details, see Sergeyev (2001) and the references therein.

3 The $l_0$ pseudo-norm and $\textcircled {1}$

In many problems in optimization and numerical analysis, it is extremely important to obtain a vector with the smallest possible number of components different from zero.

In Rinaldi et al. (2010), the problem of determining a vector belonging to a polyhedral set and having the minimum number of nonzero components is studied, and two smooth approximations of the $l_0$ pseudo-norm are proposed. The general optimization problem with cardinality constraints is considered in Burdakov et al. (2016), where a reformulation as a smooth optimization problem is proposed. In Pham Dinh and Le Thi (2014) and Gotoh et al. (2018), the cardinality-constrained optimization problem is reformulated using a DC (difference of convex functions) approach. We refer the interested reader to Gotoh et al. (2018) for additional references to optimization problems where sparsity of the solution is required.

Determining a vector having the minimum number of nonzero components can be generally obtained by adding to the original problem a further term penalizing the number of components different from zero or a term that can approximately achieve the same goal.

Let $x\in {\text {I}}\!{\text {R}}^n$. The $l_0$ pseudo-norm is defined as

$$\begin{aligned} \left\| {x} \right\| _0 :=\mathop {{\mathrm{number~of~nonzero~components~in }}} x = \sum _{i=1}^{n} 1_{x_i \ne 0}, \end{aligned}$$

(6)

where $1_a$ is the characteristic (indicator) function that is equal to 1 if $a \ne 0$ and zero otherwise. To be precise, $\left\| {\cdot } \right\| _0$ is called $l_0$ pseudo-norm since it is not a norm. In fact for $x \in {\text {I}}\!{\text {R}}^n$, $x\ne 0$ and $0 \ne \lambda \in {\text {I}}\!{\text {R}}$, we have:

$$\begin{aligned} \left\| {\lambda x} \right\| _0 = \left\| {x} \right\| _0 , \end{aligned}$$

and hence, $\left\| {\lambda x} \right\| _0 = \left| \lambda \right| \left\| {x} \right\| _0 $ if and only if $\left| \lambda \right| = 1$.

In successive sections, we present some specific use of the $l_0$ pseudo-norm for regularization and sparse solutions problems.

In Natarajan (1995), it is shown that computing sparse approximate solutions to linear systems is NP-hard. Moreover, in Amaldi and Kann (1998) it is shown that, for system of linear relations, the problems of determining a solution violating the minimum number of relations (when the system itself is infeasible) and determining a solution with as few nonzero variables as possible (if feasible) are both NP-hard, and various strong bounds on the approximability of different variants of these problems are discussed.

Therefore, various approximations of $ \left\| {x} \right\| _0 $ have been proposed in the literature. In the context of Feature Selection and Machine Learning, in Bradley and Mangasarian (1998) the following approximation is proposed

$$\begin{aligned} \left\| {x} \right\| _0 \approx \sum _{i=1}^n \left( 1 - e^{-\alpha \left| x_i \right| } \right) , \end{aligned}$$

where $\alpha $ is a given positive number for which the authors suggest to set the value 5.

In Li and Ye (2017), in the context of elastic net regularization (discussed in detail in Sect. 4), the authors proposed the following approximation:

$$\begin{aligned} \left\| {x} \right\| _0 \approx \left\| {x} \right\| _{0,\delta } :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \delta }, \end{aligned}$$

(7)

where $\delta >0 $ and smaller positive values of $\delta $ provide a better approximation of $\left\| {x} \right\| _0$.

Following this suggestion and by using the new numeral system, we propose to approximate the quantity $\left\| {x} \right\| _0$ with

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} :=\sum _{i=1}^{n} \frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}}. \end{aligned}$$

(8)

Let us study in detail the connections between $ \left\| {x} \right\| _0$ and $\left\| {x} \right\| _{0,\textcircled {1}^{-1}}$. Let $\displaystyle \psi (t):=\frac{t^2}{t^2 + \textcircled {1}^{-1} } $. Hence, $\displaystyle \left\| {x} \right\| _{0,\textcircled {1}^{-1}} = \sum _{i=1}^{n} \psi (x_i) $. For $i=1,\ldots ,n$, we assume

$$\begin{aligned} x_{i} = x^{(0)}_{i} + R_{i} \textcircled {1}^{-1} , \end{aligned}$$

where $R_{i}$ includes only finite and infinitesimal terms.

When $x_i^{(0)} = 0 $

$$\begin{aligned} \psi (x_i)= & {} \frac{R_i^2\textcircled {1}^{-2}}{R_i^2\textcircled {1}^{-2} +\textcircled {1}^{-1} } = \textcircled {1}^{-1} \frac{R_i^2\textcircled {1}^{-1}}{R_i^2\textcircled {1}^{-2} +\textcircled {1}^{-1} } \\= & {} \textcircled {1}^{-1} \frac{R_i^2}{R_i^2\textcircled {1}^{-1} + 1 } = 0 \textcircled {1}^{0} + R_i' \textcircled {1}^{-1} , \end{aligned}$$

where $R_i'$ includes only finite and infinitesimal terms. Instead, when $x_i^{(0)} \ne 0 $

$$\begin{aligned} \psi (x_i)= & {} \frac{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 }{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 +\textcircled {1}^{-1} } \\= & {} 1 - \frac{\textcircled {1}^{-1}}{\left( x_i^{(0)} + R_i\textcircled {1}^{-1}\right) ^2 +\textcircled {1}^{-1} }= 1 + S_i' \textcircled {1}^{-1}, \end{aligned}$$

where again $S_i'$ includes only finite and infinitesimal terms.

Therefore,

$$\begin{aligned} \left\| {x} \right\| _{0,\textcircled {1}^{-1}} = \left\| {x} \right\| _0 + T \textcircled {1}^{-1}, \end{aligned}$$

(9)

for some gross-number T which includes only finite and infinitesimal terms. Hence, the finite parts of $ \left\| {x} \right\| _0 $ and $\left\| {x} \right\| _{0,\textcircled {1}^{-1}} $ coincide.

4 Elastic net regularization and $\textcircled {1}$

Given a matrix $A \in {\text {I}}\!{\text {R}}^{m\times n}$ and a vector $b \in {\text {I}}\!{\text {R}}^m$, in many important applications it is essential to determine a solution $x\in {\text {I}}\!{\text {R}}^n$ of the system of linear equations $Ax = b$ with the smallest number of nonzero components:

$$\begin{aligned} {\begin{array}{lc} \min _{x} &{} \left\| {x} \right\| _0, \\ \hbox {subject to } &{} Ax=b. \end{array}} \end{aligned}$$

The associated generalized elastic net regularization problem (Li and Ye 2017) is

$$\begin{aligned} \min _x \frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _0 + \frac{\lambda _2}{2} \left\| {x} \right\| _2^2, \end{aligned}$$

(10)

where $\lambda _0 > 0$ and $\lambda _2 > 0$ are regularization parameters. In Li and Ye (2017), the authors suggest to substitute $\left\| {x} \right\| _0$ with $\left\| {x} \right\| _{0,\delta }$, as defined in (7), for fixed positive $\delta $, and a convergent algorithm for the solution of the corresponding optimization problem is proposed. Clearly, the obtained solution only approximates the optimal solution of (10).

We propose to use in (10) the approximation (8) of $\left\| {x} \right\| _0$. Replicating some of the proofs in Li and Ye (2017), a convergent algorithm can be constructed. Moreover, due to the position (9), apart from terms of order $\textcircled {1}^{-1}$ or below, the obtained solution solves also problem (10).

In detail, consider the problem

$$\begin{aligned} \min _x f(x) , \end{aligned}$$

where

$$\begin{aligned} f(x)&:=\frac{1}{2}\left\| {Ax - b} \right\| _2^2 + \lambda _0 \left\| {x} \right\| _{0,\textcircled {1}^{-1}} + \frac{\lambda _2}{2} \left\| {x} \right\| _2^2 \nonumber \\&= \frac{1}{2}{\left( Ax - b\right) }^T{\left( Ax - b\right) } + \lambda _0 \sum _{i=1}^{n}\frac{x_i^2}{x_i^2 + \textcircled {1}^{-1}} + \frac{\lambda _2}{2} {x}^T{x}. \end{aligned}$$

(11)

Let $D(x) \in {\text {I}}\!{\text {R}}^{n\times n}$ be given by

$$\begin{aligned} D_{ii}(x) = \frac{2 \textcircled {1}^{-1}}{\left( \left( x_i\right) ^2 + \textcircled {1}^{-1} \right) ^2}, \;\;\; D_{ij}(x) = 0,\; i \ne j. \end{aligned}$$

(12)

Then,

$$\begin{aligned} \nabla f(x) = \Biggl ( A^T A + \lambda _2 I + \lambda _0 D(x) \Biggr ) x - A^Tb. \end{aligned}$$

Following (Li and Ye 2017), the iterative scheme we propose is given in Algorithm 4.1, moreover Lemma 1 is the basis for establishing the convergence result.

Lemma 1

Let f be given by (11). Let $x^{k+1}$ be obtained from $x^{k}$ by solving the system of Eq. (13). Then

$$\begin{aligned} f\left( x^k\right) - f\left( x^{k+1}\right) \ge \frac{1}{2}\left\| {A x^k - A x^{k+1}} \right\| _2^2 + \frac{\lambda _2}{2} \left\| {x^k -x^{k+1}} \right\| _2^2, \end{aligned}$$

(14)

and hence

$$\begin{aligned}&\left\| {A x^k - A x^{k+1}} \right\| _2^2 \le 2 \left( f\left( x^k\right) - f\left( x^{k+1}\right) \right) , \end{aligned}$$

(15)

$$\begin{aligned}&\left\| {x^k -x^{k+1}} \right\| _2^2 \le \frac{2}{\lambda _2} \left( f\left( x^k\right) - f\left( x^{k+1}\right) \right) . \end{aligned}$$

(16)

Proof

See “Appendix A.” $\square $

The lemma above shows that the sequence $\{f(x^k)\}$ is a non-increasing sequence. We are now ready to state the following convergence theorem.

Theorem 1

Let $\mathcal {L}_0 :=\{x: f(x) \le f(x^0) \}$ be a compact set, and let $\{x^k\}$ be the sequence produced by the iterative scheme (13). Then

1.
the sequence $\{x^k\}$ is all contained in $\mathcal {L}_0$;
2.
the sequence $\left\{ x^{k}\right\} $ has at least one accumulation point;
3.
each accumulation point of $\{x^k\}$ belongs to $\mathcal {L}_0$;
4.
each accumulation point ${x}^{*}$ satisfies the condition
$$\begin{aligned} \Biggl (A^T A + \lambda _2 I + \lambda _0 D({x}^{*}) \Biggr ) {x}^{*}= A^T b , \end{aligned}$$
and hence is a stationary point of f.

Proof

First, from condition (14) in Lemma 1 we have $f\left( x^{k+1}\right) \le f\left( x^k\right) $, and hence, the entire sequence $\{x^k\}$ is contained in $\mathcal {L}_0$. Moreover, the sequence $ \left\{ f\left( x^k\right) \right\} $ is a bounded non-increasing sequence, and hence, it is a convergent sequence. The existence of accumulation points for the subsequence follows from the compactness of $\mathcal {L}_0$. Let now ${x}^{*}$ be an accumulation point of $\{x^k\}$ and $\left\{ x^{k_l}\right\} $ be a subsequence indexed by l converging to ${x}^{*}$. From (16), it follows that

$$\begin{aligned} \left\| {x^{k_{l}} - x^{k_{l+1}} } \right\| _2^2 \le \displaystyle \frac{2}{\lambda _2} \Bigl ( f\left( x^{k_{l}}\right) - f\left( x^{k_{l+1}}\right) \Bigr ). \end{aligned}$$

The right term converges to 0, and also the subsequence $\left\{ x^{k_{l+1}}\right\} $ converges to the accumulation point ${x}^{*}$. Moreover,

$$\begin{aligned} \ \Biggl (A^T A + \lambda _2 I + \lambda _0 D(x^{k_l}) \Biggr ) x^{k_{l+1}} = A^T b , \end{aligned}$$

and hence,

$$\begin{aligned} \Biggl (A^T A + \lambda _2 I + \lambda _0 D({x}^{*}) \Biggr ) {x}^{*}= A^T b . \end{aligned}$$

Therefore, ${x}^{*}$ is a stationary point of f. $\square $

5 Sparse support vector machine

In this section, we show how $\textcircled {1}$ and the results of Sect. 3 can be used in the context of sparse support vector machines (SSVMs).

Assume that empirical data (training set) $(x^i, y_i), $ $i = 1, \ldots , {l},$ are given, where $x^i \in {\text {I}}\!{\text {R}}^n,$ and $y_{i} \in \{-1,1\}$, $i = 1, \ldots , {l}$. Note that when the index i is used as a superscript the corresponding object is an input, while when it is used as a subscript the corresponding object is an output. The aim is to determine an hyperplane (and hence a vector $w\in {\text {I}}\!{\text {R}}^n$ and a scalar $\theta $) such that:

$$\begin{aligned} {w}^T{x^i} + \theta > 0 \mathop {{\mathrm{ when }}} y_i = 1 \mathop {{\mathrm{ and }}} {w}^T{x^i} + \theta < 0 \mathop {{\mathrm{ when }}} y_i = -1. \end{aligned}$$

The classification function is

$$\begin{aligned} h(x) = \text {sign}\left( {w}^T{x} + \theta \right) . \end{aligned}$$

Classification in the feature space (instead of the original space) requires to introduce a map

$$\begin{aligned} \phi : {\text {I}}\!{\text {R}}^n \mapsto \mathcal {E}, \end{aligned}$$

where $\mathcal {E}$ is an Hilbert Space with scalar product $\left\langle \cdot ,\cdot \right\rangle $. In classical SVM [see Cortes and Vapnik (1995), Cristianini and Shawe-Taylor (2000), Smola and Schölkopf (2004) and references therein], the construction of the optimal hyperplane requires to solve the following (primal) optimization problem

$$\begin{aligned} {\begin{array}{ll} \min \limits _{w, \theta , \xi } &{} \frac{1}{2}\left\langle w,w \right\rangle + C {e}^T{\xi }, \\ \hbox {subject to } &{}y_i \left( \left\langle w,\phi (x^i) \right\rangle + \theta \right) \ge 1 - \xi _i , \quad i = 1, \ldots , {l}, \\ &{} \xi _i \ge 0, \quad i = 1, \ldots , {l}, \end{array}}\nonumber \\ \end{aligned}$$

(17)

where e is a vector with all elements equal to 1 and C is a positive scalar.

The corresponding dual problem is

$$\begin{aligned} {\begin{array}{ll} \min \limits _{\alpha } &{} \frac{1}{2}{\alpha }^T{Q\alpha } - {e}^T{\alpha }, \\ \hbox {subject to } &{} {y}^T{\alpha } = 0 , \\ &{} 0 \le \alpha \le Ce, \end{array}} \end{aligned}$$

(18)

where

$$\begin{aligned} Q_{ij} = y_i y_j K_{ij}, \mathop {{\mathrm{ and }}} K_{ij}= K(x^i, x^j) :=\left\langle \phi (x^i),\phi (x^j) \right\rangle . \end{aligned}$$

The function

$$\begin{aligned} K: {\text {I}}\!{\text {R}}^n\times {\text {I}}\!{\text {R}}^n \rightarrow {\text {I}}\!{\text {R}}, \end{aligned}$$

is called the kernel function. It is well known that for the construction of the dual problem and the classification function the complete knowledge of the function $\phi (\cdot ) $ is not necessary: only the quantities $K_{ij}=\left\langle \phi (x^i),\phi (x^j) \right\rangle $ are needed. In fact, from KKT conditions

$$\begin{aligned}&w = \sum _{i=1}^{l}\alpha _i y_i \phi (x^i) \end{aligned}$$

(19)

and the classification function is

$$\begin{aligned} h(x)= & {} \text {sign}\Bigl (\left\langle w,\phi (x) \right\rangle + \theta \Bigr ) \\= & {} \text {sign}\left( \sum _{i=1}^{l}\alpha _i y_i \left\langle \phi (x^i),\phi (x) \right\rangle + \theta \right) . \end{aligned}$$

For a sparse representation of SVM, the vector w is substituted by its expansion in terms of the vector $\alpha $. Moreover, let $K_{i.}$ be the column vector that corresponds to the ith row of matrix $(K_{ij})$, note that

$$\begin{aligned} {K_{i.}}^T{\alpha }+ \theta= & {} \sum _{j=1}^{l} K_{ij} \alpha _j + \theta \\= & {} \sum _{j=1}^{l} \left\langle \phi (x^i),\phi (x^j) \right\rangle \alpha _j + \theta \\= & {} \left\langle \phi (x^i),\sum _{j=1}^{l} \phi (x^j)\alpha _j \right\rangle + \theta . \end{aligned}$$

In Huang et al. (2010), the authors consider the following optimization problem instead of (17), obtained by replacing $\frac{1}{2}\left\langle w,w \right\rangle $ with $\left\| {\alpha } \right\| _0$ and by using the expansion (19) of w in terms of $\alpha $:

$$\begin{aligned} {\begin{array}{ll} \min _{\alpha , \theta , \xi } &{} \displaystyle \left\| {\alpha } \right\| _0 + C {e}^T{\xi }, \\ \hbox {subject to } &{} y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigr ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l}, \\ &{} \xi \ge 0. \end{array}} \end{aligned}$$

(20)

In problem (20) the term $\left\| {\alpha } \right\| _0$ is then replaced by $\frac{1}{2}{\alpha }^T{\Lambda \alpha }$, where $\Lambda $ is the diagonal matrix with $\Lambda _{ii} = \lambda _i$, $i = 1, \ldots , {l}$.

The following iterative scheme was proposed in Huang et al. (2010) to solve the above problem. In particular, given a very small positive value $\epsilon $, in the Algorithm 5.1 the new value $\lambda ^{k+1}_r$ for $\lambda _r$ is set to $1/\left( \alpha ^k_r\right) ^2$ if $\left| \alpha ^k_r \right| $ is “significantly” different from zero, otherwise, $\lambda ^{k+1}_r = 1/\epsilon ^2$.

The KKT conditions for Problem (21) are

$$\begin{aligned}&\displaystyle {\Lambda ^k} \alpha - \sum _{j=1}^{l} \beta _j y_j K_{j.} = 0 ,\end{aligned}$$

(22a)

$$\begin{aligned}&{\beta }^T{y} = 0, \end{aligned}$$

(22b)

$$\begin{aligned}&Ce - \beta \ge 0 , \end{aligned}$$

(22c)

$$\begin{aligned}&\beta \ge 0 , \end{aligned}$$

(22d)

$$\begin{aligned}&{\beta }^T{\left( Ce - \beta \right) } = 0, \end{aligned}$$

(22e)

where $\beta $ is the vector of multipliers associated to the constraints $y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigr ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l}$.

From (22a) it follows that

$$\begin{aligned} {\lambda ^k_r} \alpha _r = \sum _{j=1}^{l} \beta _j y_j K_{jr} = {\bar{K}_{r.}}^T{\beta },\quad r = 1, \ldots , {l}, \end{aligned}$$

where $\bar{K}_{rj} = y_j K_{jr}$, with $r,j = 1, \ldots , {l}$.

Once again, instead of $\left\| {\alpha } \right\| _0$, we use $\left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}}$ and we propose to solve the following $\textcircled {1}$–Sparse SVM problem in place of (20):

$$\begin{aligned} {\begin{array}{ll} \min _{\alpha , \theta , \xi } &{} \frac{\textcircled {1}}{2} \left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}} + C {e}^T{\xi }, \\ \hbox {subject to }&{} y_i \Bigl [{K_{i.}}^T{\alpha }+ \theta \Bigl ] \ge 1 - \xi _i,\;\;\; i = 1, \ldots , {l},\\ &{} \xi \ge 0. \end{array}} \end{aligned}$$

(23)

Let by (8)

$$\begin{aligned} h(\alpha ):= & {} \frac{\textcircled {1}}{2} \left\| {\alpha } \right\| _{0,\textcircled {1}^{-1}} \\= & {} \frac{\textcircled {1}}{2} \sum _{j=1}^{l} \frac{\alpha _j^2}{\alpha _j^2 + \textcircled {1}^{-1}} = \frac{\textcircled {1}}{2} \sum _{j=1}^{l} \frac{\alpha _j^2 + \textcircled {1}^{-1} - \textcircled {1}^{-1} }{\alpha _j^2 + \textcircled {1}^{-1}} \\= & {} \frac{l}{2} \textcircled {1} - \sum _{j=1}^{l}\frac{\textcircled {1}}{2} \frac{\textcircled {1}^{-1}}{\alpha _j^2 + \textcircled {1}^{-1}} \\= & {} \frac{l}{2} \textcircled {1} -\frac{1}{2} \sum _{j=1}^{l} \frac{1}{\alpha _j^2 + \textcircled {1}^{-1}}. \end{aligned}$$

Then,

$$\begin{aligned} \Bigl [\nabla h(\alpha )\Bigr ]_r = \frac{\alpha _r }{\left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2} . \end{aligned}$$

The KKT conditions for the above problem (23) are

$$\begin{aligned}&\displaystyle \nabla h(\alpha ) - \sum _{j=1}^{l} \beta _j y_j K_{j.} = 0 , \end{aligned}$$

(24a)

$$\begin{aligned}&{\beta }^T{y} = 0, \end{aligned}$$

(24b)

$$\begin{aligned}&Ce - \beta \ge 0 , \end{aligned}$$

(24c)

$$\begin{aligned}&\beta \ge 0 , \end{aligned}$$

(24d)

$$\begin{aligned}&{\beta }^T{\left( Ce - \beta \right) } = 0. \end{aligned}$$

(24e)

Note that conditions (24a) can be rewritten as

$$\begin{aligned} \displaystyle \frac{1}{\displaystyle \left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2} \alpha _r = {\bar{K}_{r.}}^T{\beta }, \quad r = 1, \ldots , {l} . \end{aligned}$$

(25)

From the above formula, it is “natural” to set the new value for $\lambda _r$

$$\begin{aligned} {\lambda ^{k+1}_r} = \displaystyle \frac{1}{\displaystyle \left( \alpha _r^2 + \textcircled {1}^{-1}\right) ^2}, \quad r = 1, \ldots , {l}. \end{aligned}$$

Now, for $ r = 1, \ldots , {l}$, let

$$\begin{aligned} \alpha _r = \alpha _r^{(0)} + \alpha _r^{(1)} \textcircled {1}^{-1} + \ldots = \alpha _r^{(0)} + A \textcircled {1}^{-1}, \end{aligned}$$

with $A\in {\text {I}}\!{\text {R}}$ finite or infinitesimal. When $\alpha _r^{(0)} = 0$

$$\begin{aligned} \alpha _r^2 + \textcircled {1}^{-1} = \textcircled {1}^{-1} + A^2 \textcircled {1}^{-2}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{\Bigl (\alpha _r^2 + \textcircled {1}^{-1} \Bigr )^2} = \frac{1}{\Bigl ( \textcircled {1}^{-1} + A^2 \textcircled {1}^{-2} \Bigr )^2} = \frac{1}{\textcircled {1}^{-2}} + A' \frac{1}{\textcircled {1}^{-3}}, \end{aligned}$$

(26)

with $A'$ finite or infinitesimal. In place, when $\alpha _r^{(0)} \ne 0$

$$\begin{aligned} \alpha _r^2 + \textcircled {1}^{-1} = \left( \alpha _r^{(0)}\right) ^2 + A'' \textcircled {1}^{-1}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{\Bigl (\alpha _r^2 + \textcircled {1}^{-1} \Bigr )^2} = \frac{1}{\left( \alpha _r^{(0)}\right) ^4 + A''' \textcircled {1}^{-1}}, \end{aligned}$$

(27)

with $A'',\ A'''$ finite or infinitesimal.

Formulas (26) and (27) mimic almost perfectly the updating formulas for $\lambda ^{k+1}$ proposed in Huang et al. (2010) in order to solve (20) by using problem (21). The main difference is that, for the nonzero case there is a power 4 instead of 2. However, using $\textcircled {1}$ and considering problem (23) we can shed some light on the updating formulas for $\beta $, otherwise quite arbitrary. The term $\frac{1}{\textcircled {1}^{-2}}$ perfectly corresponds to the updating formula $\lambda ^{k+1}_r = \frac{1}{\epsilon ^2} $. In the other case, the updating formula $\lambda ^{k+1}_r = \frac{1}{\bar{\alpha }_r^2}$ is replaced by the term $\frac{1}{\left( \alpha _r^{(0)}\right) ^4 }$.

The use of $ \textcircled {1}$ allows to easily obtain both formulas from the KKT condition (24a) and observations on the expansion of the gross-number $\alpha $.

6 Conclusions

In this paper, we have presented some possible uses in optimization problems of the novel approach to infinite and infinitesimal numbers proposed by Sergeyev (2001, (2009, (2017). In particular, optimization problems including smoothed $l_0$ penalty and classification problems involving sparse support vector machines are studied. In order to avoid the difficulties due to the use of the $l_0$ pseudo-norm of a vector, we propose to approximate the $l_0$ pseudo-norm by using a smooth function defined in terms of $\textcircled {1}$. The results obtained using this new approximation in the two optimization problems perfectly match with those presented in the literature. Actually, the new $\textcircled {1}$-based methodology may represent a fruitful and promising tool to be exploited within other classification and regression problems offering new views and different perspectives.

References

Amaldi E, Kann V (1998) On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoret Comput Sci 209(1–2):237–260
Article MathSciNet Google Scholar
Amodio P, Iavernaro F, Mazzia F, Mukhametzhanov MS, Sergeyev YD (2017) A generalized Taylor method of order three for the solution of initial value problems in standard and infinity floating-point arithmetic. Math Comput Simul 141:24–39
Article MathSciNet Google Scholar
Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the fifteenth international conference on machine learning, ICML’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 82–90
Burdakov O, Kanzow C, Schwartz A (2016) Mathematical programs with cardinality constraints: reformulation by complementarity-type conditions and a regularization method. SIAM J Optim 26(1):397–425
Article MathSciNet Google Scholar
Caldarola F (2018) The Sierpinski curve viewed by numerical computations with infinities and infinitesimals. Appl Math Comput 318:321–328
MATH Google Scholar
Calude CS, Dumitrescu M (2020) Infinitesimal probabilities based on grossone. SN Comput Sci 1(1):36
Article Google Scholar
Cococcioni M, Pappalardo M, Sergeyev YD (2018) Lexicographic multi-objective linear programming using grossone methodology: theory and algorithm. Appl Math Comput 318:298–311
MATH Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book Google Scholar
D’Alotto L (2013) A classification of two-dimensional cellular automata using infinite computations. Indian J Math 55:143–158
MathSciNet MATH Google Scholar
De Cosmis S, De Leone R (2012) The use of grossone in mathematical programming and operations research. Appl Math Comput 218(16):8029–8038
MathSciNet MATH Google Scholar
De Leone R (2018) Nonlinear programming and grossone. Appl Math Comput 318(C):290–297
MATH Google Scholar
De Leone R, Fasano G, Sergeyev YD (2018) Planar methods and grossone for the conjugate gradient breakdown in nonlinear programming. Comput Optim Appl 71(1):73–93
Article MathSciNet Google Scholar
Gaudioso M, Giallombardo G, Mukhametzhanov M (2018) Numerical infinitesimals in a variable metric method for convex nonsmooth optimization. Appl Math Comput 318:312–320
MATH Google Scholar
Gotoh J, Takeda A, Tono K (2018) DC formulations and algorithms for sparse optimization problems. Math Program 169:141–176
Article MathSciNet Google Scholar
Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951
Article Google Scholar
Iudin D, Sergeyev YD, Hayakawa M (2012) Interpretation of percolation in terms of infinity computations. Appl Math Comput 218(16):8099–8111
MathSciNet MATH Google Scholar
Li S, Ye W (2017) A generalized elastic net regularization with smoothed $l_0$ penalty. Adv Pure Math 7:66–74
Article Google Scholar
Lolli G (2015) Metamathematical investigations on the theory of grossone. Appl Math Comput 255:3–14
MathSciNet MATH Google Scholar
Margenstern M (2012) An application of grossone to the study of a family of tilings of the hyperbolic plane. Appl Math Comput 218(16):8005–8018
MathSciNet MATH Google Scholar
Natarajan BK (1995) Sparse approximate solutions to linear systems. SIAM J Comput 24(2):227–234
Article MathSciNet Google Scholar
Pham DT, Le Thi HA (2014) Recent advances in DC programming and DCA. Springer, Berlin, pp 1–37
Google Scholar
Rinaldi F, Schoen F, Sciandrone M (2010) Concave programming for minimizing the zero-norm over polyhedral sets. Comput Optim Appl 46:467–486
Article MathSciNet Google Scholar
Rizza D (2019) Numerical methods for infinite decision-making processes. Int J Unconv Comput 14(2):139–158
Google Scholar
Sergeyev YD (2001) Arithmetic of infinity. Edizioni Orizzonti Meridionali, Cosenza
MATH Google Scholar
Sergeyev YD (2009) Numerical point of view on Calculus for functions assuming finite, infinite, and infinitesimal values over finite, infinite, and infinitesimal domains. Nonlinear Anal Ser A Theory Methods Appl 71(12):e1688–e1707
Article MathSciNet Google Scholar
Sergeyev YD (2010) Lagrange lecture: methodology of numerical computations with infinities and infinitesimals. Rendiconti del Seminario Matematico dell’Università e del Politecnico di Torino 68(2):95–113
MATH Google Scholar
Sergeyev YD (2011a) Higher order numerical differentiation on the infinity computer. Optim Lett 5(4):575–585
Article MathSciNet Google Scholar
Sergeyev YD (2011b) On accuracy of mathematical languages used to deal with the Riemann zeta function and the Dirichlet eta function. p-Adic Numbers Ultrametr Anal Appl 3(2):129–148
Article MathSciNet Google Scholar
Sergeyev YD (2016) The exact (up to infinitesimals) infinite perimeter of the Koch snowflake and its finite area. Commun Nonlinear Sci Numer Simul 31(1–3):21–29
Article MathSciNet Google Scholar
Sergeyev YD (2017) Numerical infinities and infinitesimals: methodology, applications, and repercussions on two Hilbert problems. EMS Sur Math Sci 4:219–320
Article MathSciNet Google Scholar
Sergeyev YD (2019) Independence of the grossone-based infinity methodology from non-standard analysis and comments upon logical fallacies in some texts asserting the opposite. Found Sci 24(1):153–170
Article Google Scholar
Sergeyev YD, Mukhametzhanov M, Mazzia F, Iavernaro F, Amodio P (2016) Numerical methods for solving initial value problems on the infinity computer. Int J Unconv Comput 12(1):3–23
Google Scholar
Sergeyev YD, Kvasov DE, Mukhametzhanov MS (2018) On strong homogeneity of a class of global optimization algorithms working with infinite and infinitesimal scales. Commun Nonlinear Sci Numer Simul 59:319–330
Article MathSciNet Google Scholar
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Science and Technology, University of Camerino, Camerino, Italy
Renato De Leone, Nadaniela Egidi & Lorella Fatone

Authors

Renato De Leone
View author publications
You can also search for this author in PubMed Google Scholar
Nadaniela Egidi
View author publications
You can also search for this author in PubMed Google Scholar
Lorella Fatone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renato De Leone.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by Yaroslav D. Sergeyev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proof of Lemma 1

We prove Lemma 1 of Sect. 4 by adapting to this context the proof proposed by the authors in Li and Ye (2017), Lemma 2.

Proof of Lemma 1

Let f be defined in (11). We have:

$$\begin{aligned}&f\left( x^{k}\right) - f\left( x^{k+1}\right) = \frac{1}{2}\left\| {Ax^{k} - b} \right\| _2^2 + \lambda _0 \left\| {x^{k}} \right\| _{0,\textcircled {1}^{-1}}\nonumber \\&\quad +\, \frac{\lambda _2}{2} \left\| {x^{k}} \right\| _2^2 \nonumber \\&\quad - \,\frac{1}{2}\left\| {Ax^{k+1} - b} \right\| _2^2 - \lambda _0 \left\| {x^{k+1}} \right\| _{0,\textcircled {1}^{-1}} \nonumber \\&\quad - \,\frac{\lambda _2}{2} \left\| {x^{k+1}} \right\| _2^2 \nonumber \\&= \frac{1}{2}\left\| {Ax^{k} - b} \right\| _2^2 - \frac{1}{2}\left\| {Ax^{k+1} - b} \right\| _2^2 \nonumber \\&\quad +\,\lambda _2 \left( \frac{1}{2}\left\| {x^{k}} \right\| _2^2- \frac{1}{2}\left\| {x^{k+1}} \right\| _2^2 \right) \nonumber \\&\quad + \,\lambda _0 \left( \left\| {x^{k}} \right\| _{0,\textcircled {1}^{-1}} - \left\| {x^{k+1}} \right\| _{0,\textcircled {1}^{-1}} \right) . \end{aligned}$$

(28)

We analyze separately the different terms of (28). We can write:

$$\begin{aligned}&\frac{1}{2}\left\| {Ax^{k} - b} \right\| _2^2 - \frac{1}{2}\left\| {Ax^{k+1} - b} \right\| _2^2 \nonumber \\&=\frac{1}{2}\left( Ax^{k} \right) ^{T} \left( Ax^{k} \right) - \frac{1}{2}\left( Ax^{k+1} \right) ^{T} \left( Ax^{k+1} \right) \nonumber \\&\quad + \,b^{T} \left( Ax^{k+1} -Ax^{k} \right) \nonumber \\&=\frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \left( Ax^{k} -Ax^{k+1} \right) ^{T} \left( Ax^{k+1} \right) \nonumber \\&\quad +\,b^{T} \left( Ax^{k+1} -Ax^{k} \right) \nonumber \\&=\frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \left( Ax^{k+1}-b \right) ^{T} \left( Ax^{k} -Ax^{k+1} \right) \nonumber \\&=\frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \left( x^{k}-x^{k+1} \right) ^{T} \left( A^{T}Ax^{k+1} - A^{T}b \right) \nonumber \\&=\frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 - \lambda _{2} \left( x^{k+1}\right) ^{T} \left( x^{k}-x^{k+1} \right) \nonumber \\&\quad -\,\lambda _{0} \sum _{i=1}^{n} \frac{\left( x^{k}_{i} -x^{k+1}_{i} \right) x^{k+1}_{i}2 \textcircled {1}^{-1}}{\left( \left( x^{k}_i\right) ^2 + \textcircled {1}^{-1} \right) ^2} , \end{aligned}$$

(29)

where in the last step the definitions of the proposed iterative scheme (13) and the matrix D in (12) have been used.

Moreover,

$$\begin{aligned}&\frac{1}{2}\left\| {x^{k}} \right\| _2^2 - \frac{1}{2}\left\| {x^{k+1}} \right\| _2^2 \nonumber \\&= \frac{1}{2}(x^{k} )^{T} x^{k} +\frac{1}{2}(x^{k+1} )^{T} x^{k+1} \nonumber \\&\quad -(x^{k} )^{T} x^{k+1} -(x^{k+1} )^{T} x^{k+1} +(x^{k} )^{T} x^{k+1} \nonumber \\&=\frac{1}{2}\left\| {x^{k} -x^{k+1} } \right\| _2^2 + (x^{k+1} )^{T} \left( x^{k} -x^{k+1} \right) . \end{aligned}$$

(30)

Substituting (29) and (30) into (28) we have:

$$\begin{aligned}&f\left( x^{k}\right) - f\left( x^{k+1}\right) = \frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \frac{\lambda _{2}}{2} \left\| {x^{k} -x^{k+1} } \right\| _2^2 \nonumber \\&-\,\lambda _{0} \sum _{i=1}^{n} \frac{\left( x^{k}_{i} -x^{k+1}_{i} \right) x^{k+1}_{i}2 \textcircled {1}^{-1}}{\left( \left( x^{k}_i\right) ^2 + \textcircled {1}^{-1} \right) ^2} \nonumber \\&+ \,\lambda _0 \left( \left\| {x^{k}} \right\| _{0,\textcircled {1}^{-1}} - \left\| {x^{k+1}} \right\| _{0,\textcircled {1}^{-1}} \right) . \end{aligned}$$

(31)

From the definition (8)

$$\begin{aligned}&\left\| {x^{k}} \right\| _{0,\textcircled {1}^{-1}} - \left\| {x^{k+1}} \right\| _{0,\textcircled {1}^{-1}}\nonumber \\&\quad = \sum _{i=1}^{n} \left( \frac{\left( x_i^{k}\right) ^2}{\left( x_i^{k}\right) ^2 + \textcircled {1}^{-1}} - \frac{\left( x_i^{k+1}\right) ^2}{\left( x_i^{k+1}\right) ^2 + \textcircled {1}^{-1}} \right) . \end{aligned}$$

(32)

It is easy to prove (Li and Ye 2017), Lemma 1 that given $\delta >0$, for any $a, b \in {\text {I}}\!{\text {R}}$ the following inequality holds:

$$\begin{aligned} \frac{a^{2}}{a^{2}+\delta } - \frac{b^{2}}{b^{2}+\delta } - \frac{2 \delta b (a-b) }{(a^{2}+\delta )^{2}} \ge \frac{\delta (a-b)^{2}}{(a^{2}+\delta )^{2}} . \end{aligned}$$

(33)

As a result, we can write

$$\begin{aligned}&\sum _{i=1}^{n} \left( \frac{\left( x_i^{k}\right) ^2}{\left( x_i^{k}\right) ^2 + \textcircled {1}^{-1}} - \frac{\left( x_i^{k+1}\right) ^2}{\left( x_i^{k+1}\right) ^2 + \textcircled {1}^{-1}} -\frac{2 \textcircled {1}^{-1} \left( x_i^{k}-x_i^{k+1} \right) x_i^{k+1} }{\left( \left( x_i^{k}\right) ^2+ \textcircled {1}^{-1}\right) ^{2}} \right) \nonumber \\&\quad \ge \sum _{i=1}^{n} \frac{\textcircled {1}^{-1} \left( x_i^{k}-x_i^{k+1} \right) ^{2} }{\left( \left( x_i^{k}\right) ^2+ \textcircled {1}^{-1}\right) ^{2}}. \end{aligned}$$

(34)

By using (32)–(34) into (31), we can conclude:

$$\begin{aligned}&f\left( x^{k}\right) - f\left( x^{k+1}\right) \ge \frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \frac{ \lambda _{2}}{2} \left\| {x^{k} -x^{k+1} } \right\| _2^2 \nonumber \\&\quad +\lambda _{0} \sum _{i=1}^{n} \frac{\textcircled {1}^{-1} \left( x_i^{k}-x_i^{k+1} \right) ^{2} }{\left( \left( x_i^{k}\right) ^2+ \textcircled {1}^{-1}\right) ^{2}} \nonumber \\&\ge \frac{1}{2}\left\| {Ax^{k}-Ax^{k+1}} \right\| _2^2 + \frac{ \lambda _{2}}{2} \left\| {x^{k} -x^{k+1} } \right\| _2^2 \end{aligned}$$

(35)

since $\displaystyle \sum \nolimits _{i=1}^{n} \frac{\textcircled {1}^{-1} \left( x_i^{k}-x_i^{k+1} \right) ^{2} }{\left( \left( x_i^{k}\right) ^2+ \textcircled {1}^{-1}\right) ^{2}} \ge 0$ for any $x_i^{k}$ and $x_i^{k+1}$.

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Leone, R., Egidi, N. & Fatone, L. The use of grossone in elastic net regularization and sparse support vector machines. Soft Comput 24, 17669–17677 (2020). https://doi.org/10.1007/s00500-020-05185-z

Download citation

Published: 25 July 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00500-020-05185-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The use of grossone in elastic net regularization and sparse support vector machines

Abstract

Similar content being viewed by others

The Use of Infinities and Infinitesimals for Sparse Classification Problems

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

Nonlinear optimization and support vector machines

1 Introduction

2 The algebra of \(\textcircled {1}\)

3 The \(l_0\) pseudo-norm and \(\textcircled {1}\)

4 Elastic net regularization and \(\textcircled {1}\)

Lemma 1

Proof

Theorem 1

Proof

5 Sparse support vector machine

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Proof of Lemma 1

Proof of Lemma 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The use of grossone in elastic net regularization and sparse support vector machines

Abstract

Similar content being viewed by others

The Use of Infinities and Infinitesimals for Sparse Classification Problems

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

Nonlinear optimization and support vector machines

Explore related subjects

1 Introduction

2 The algebra of \(\textcircled {1}\)

3 The \(l_0\) pseudo-norm and \(\textcircled {1}\)

4 Elastic net regularization and \(\textcircled {1}\)

Lemma 1

Proof

Theorem 1

Proof

5 Sparse support vector machine

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Proof of Lemma 1

Proof of Lemma 1

Proof of Lemma 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation