1 Introduction

One of challenges of Machine Learning is the handling of the input datasets with very large number of features. The so called feature selection has been widely studied to address this challenge. The goals are to remove the irrelevant and redundant features, reduce store space and execution time, and avoid the course of dimensionality to improve the prediction performance (Le Thi et al. 2008b).

The research on feature-selection methods is very active in recent years, and an excellent review can be found in the book by Guyon et al. (2006). Generally speaking, feature selection can be classified into three categories: filter approaches, wrapper approaches, and embedded approaches. Wrapper methods exploit a machine learning algorithm to evaluate the usefulness of features. Filter methods rank the features according to some discrimination measure and select features having higher ranks without using any learning algorithm (it utilizes the underlying characteristics of the training data to evaluate the relevance of the features or feature set by some independent measures such as distance measure, correlation measures, consistency measures Chen et al. 2006). The wrapper approach is generally considered to produce better feature subsets but runs much more slowly than a filter. In contrast to the filter and wrapper approaches, the embedded approach of feature selection does not separate the learning from the feature selection part. It integrates the selection of features in the model building. For instance, for feature selection in classification, an embedded method uses a machine learning algorithm to search a classifier that uses as few features as possible while a filter method selects the features by optimizing a measure criterion and then finds a classifier defined on the selected features.

Feature selection is often applied to high-dimensional data prior to classification learning. In this paper, we are interested in the feature selection task for Multi-class Support Vector Machine (MSVM) in the framework of embedded approaches. The objective is to simultaneously select a subset of features (representative features) and construct a good classifier, i.e. we search a sparse MSVM. Whereas most feature selection methods were initially developed for binary-Support Vector Machine (SVM) classification (see e.g. Bradley and Mangasarian 1998; Fan and Li 2001; Hermes and Buhmann 2000; Hui 2006; Le Thi et al. 2008a, b; Neumann et al. 2005; Rakotomamonjy 2003; Wang et al. 2007), several extensions to feature selection for MSVM are recently investigated (see e.g. Cai et al. 2011; Chapelle 2008; Chen et al. 2006; Deng et al. 2013; Duan et al. 2005; Huang et al. 2013; Hsu and Lin 2002; Lee et al. 2006, 2004; Li et al. 2004; Wang and Shen 2003; Weston and Watkins 1999; Weston et al. 2003; Wu et al. 2007; Zhang et al. 2008; Zou 2006; Zhou and Tuck 2007; Zhou et al. 2010). However, an extension from the binary case to the multi-category case (Q classes) is not trivial. Indeed, as will be seen in the next section, the decision rule in MSVM has to be determined from Q decision functions (but not one decision function as in SVM) and each selected feature should be associated with Q coefficients in Q decision functions.

For the feature selection purpose, we use a natural concept dealing with sparsity, that is the zero norm (denoted \(\ell _{0}\) or \(\Vert .\Vert _{0}\)). The zero norm of a vector is defined as the number of its nonzero components. The function \(\ell _{0}\), apparently very simple, is lower-semicontinuous on \(\mathbb {R}^{n},\) but its discontinuity at the origin makes nonconvex programs involving \(\Vert .\Vert _{0}\) challenging. During the last two decades, research is very active in optimization models and methods involving the zero norm. Works can be divided into three categories according to the way to treat the zero norm: convex approximation (the \(\ell _{0}\)-norm is replaced by a convex function, for instance the \( \ell _{1}\)-norm Tibshirani 1996 or the conjugate function Pham Dinh and Le Thi 2014), nonconvex approximation (a continuous nonconvex function is used instead to the \(\ell _{0}\)-norm, usual sparse inducing functions are introduced in Bradley and Mangasarian (1998), Fan and Li (2001), Le Thi (2012), Le Thi et al. (2015b), Peleg and Meir (2008), Weston et al. (2003)), and nonconvex exact reformulation (with the binary variables \(u_{i}=0\) if \(w_{i}=0\) and \(u_{i}=1 \) otherwise, the original problem is formulated as a combinatorial optimization problem which is equivalently reformulated as a continuous nonconvex program via exact penalty techniques (Le Thi et al. 2015a)). An extensive overview of these approaches can be found in Le Thi et al. (2015b). When the objective function (besides the \(\ell _{0}\)-term) is convex, convex approximation techniques result in a convex optimization problem which is so far ”easy” to solve.

Whilst nonconvex approximation approaches have been widely used for feature selection in SVM, most of works on feature selection in MSVM are based on convex approximations, especially on \(\ell _1\) regularization. In this work, we study two models of sparse MSVM using the \(\ell _{0}\) and/or the \(\ell _{2}\)\(\ell _{0}\) regularization. Our motivation to consider the \(\ell _{2}\)\(\ell _{0}\) regularization is that it can reduce overfitting. Due to the \(\ell _{0}\) term, the resulting problems are nonsmooth and nonconvex. We tackle these problems by the two nonconvex approaches, both are based on Difference of Convex functions (DC) programming and DC Algorithms (DCA), powerful tools in the nonconvex programming framework which were introduced by Pham Dinh Tao in their preliminary form in 1985 and have been extensively developed since 1994 by Le Thi Hoai An and Pham Dinh Tao and become now classic and increasingly popular (see, e.g. Le Thi and Pham Dinh 2005; Le Thi 2005; Pham Dinh and Le Thi 1997, 1998, and references therein).

Our motivating arguments to use the \(\ell _0\)-norm are multiple. Firstly, even though using the \(\ell _1\) is the simplest way to deal with the sparsity, the \(\ell _1\) can encourage the sparsity in only some cases with restrictive assumptions (see Gribonval and Nielsen 2003). In particular, for feature selection purpose, the \(\ell _1\) penalty has been shown to be, in certain cases, inconsistent and biased (Zou 2006). Secondly, the \(\ell _0\)-norm is the most natural and suitable concept for modelling the sparsity, and nonconvex approximations of the \(\ell _0\)-norm are, in general, deeper than the \(\ell _1\)-norm, and can then produce better sparsity. Especially, for feature selection in SVM, solutions of the \(\ell _0\) -norm penalty problem have been shown to be much sparser than those of \(\ell _1\)-norm approach in several previous works (see e.g. Le Thi et al. 2008a, 2015a, b; Ong and Le Thi 2013). Thirdly, although we are faced with nonconvex problems, the power of DCA can be exploited to efficiently solve these hard problems, knowing that DCA has been successfully developed in a variety of works in Machine Learning (see e.g. Collobert et al. 2006; Krause and Singer 2004; Le Thi et al. 2006, 2007, 2008a, b, 2015a, b; Le Thi and Phan 2016a, b; Liu et al. 2005; Liu and Shen 2006; Ronan et al. 2006 and the list of reference in Le Thi (2005)), in particular to feature selection in SVM (Le Thi et al. 2008a, b, 2015a, b; Neumann et al. 2005; Ong and Le Thi 2013).

In our first approach, the \(\ell _{0}\)-norm is approximated by a DC function that leads to a DC program for which a DCA scheme is investigated. This general DCA scheme is developed to various sparse inducing DC approximation functions: the piecewise exponential function (Bradley and Mangasarian 1998), the SCAD penalty function (Fan and Li 2001), the logarithm function (Weston et al. 2003), the capped-\(\ell _{1}\) function (Peleg and Meir 2008) and the piecewise linear function recently proposed in Le Thi (2012). In the second approach, the original problem is equivalently reformulated, via an exact penalty technique in DC programming (Le Thi et al. 2012), as a DC program. Hence, using a unified DC programming framework, we unify all solution methods into DCA, and then convergence properties of our algorithms are guaranteed thanks to general convergence results of the generic DCA scheme. Specific convergence properties of each DCA secheme are also studied. We perform empirical comparative numerical experiments of 12 versions of DCA based algorithms, with various approximate functions as well as with the exact continuous reformulation. We are interested in several questions from both algorithmic and numerical points of view: what is better between the \(\ell _0\) or \(\ell _2\)\(\ell _0\) regularization? What might be the approach advised—the nonconvex approximation or the nonconvex exact penalty reformulation? And in the nonconvex approximation approaches, what is the best approximation among several sparse inducing functions? Such questions are useful for researchers in the choice of the algorithm to be applied to their problems among various versions offered in this paper.

The remainder of the paper is organized as follows. Section 2 contains the introduction of two models of sparse MSVM using \(\ell _0\) and \(\ell _2\)\(\ell _0\) regularizations, followed by a brief presentation of DC Programming and DCA. The approximation approach is presented in Sect. 3 while the exact penalty approach is developed in Sect. 4. Computational experiments are reported in Sect. 5 and finally Sect. 6 concludes the paper.

2 Models and methodology

2.1 Sparse MSVM models

For beginning, let us introduce the model of MSVM proposed by Weston and Watkins (1999), a direct approach (without using binary-SVM) for learning multiclass, known to be appropriate to capture correlations between the different classes, which can be described as follows.

Let \({\mathcal {X}}\) be a set of vectors in \({\mathrm {I\!R}}^{d}\) and \(\mathcal {Y} =\{1,\ldots ,Q\}\) be a set of class labels. Given a training dataset \({\mathcal {D}}=\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots ,(x_{n},y_{n})\}\in {\mathbb {R}}^{n\times (d+1)}\), where \(x_{i}\in \mathcal {X}\), \(y_{i}\in \mathcal {Y},i=\{1,\ldots ,n\}\). The task is to learn a classification rule \(f:\mathcal {X}\mapsto \mathcal {Y}\) that maps an element x to a class label \(y\in \mathcal {Y}\).

In a more natural way than the classical SVM based approaches for multi-classification, Weston and Watkins (1999) proposed to construct a piecewise linear separation that gives the decision function:

$$\begin{aligned} f(x)=\arg \max \limits _{1\le i\le Q}f_{i}(x), \end{aligned}$$
(1)

where \(f_{i}\) stands for the hyperplane \(f_{i}(x)=\langle w_{i},x\rangle +b_{i}\), with \(w_{i}\in \mathbb {R}^{d},b_{i}\in {\mathbb {R}}\), \(i=1,\ldots ,Q\). Let \(w=\left( w_{1},w_{2},\ldots ,w_{Q}\right) \) be the vector in \({\mathbb {R}}^{Q\times d}\) and let \(\varvec{b}=(b_{i})_{i=1}^{Q}\in \mathbb {R}^{Q}\). Then the MSVM model given in Weston and Watkins (1999), the first “all-together” implementation of multi-class SVM, is a single optimization problem of the form:

$$\begin{aligned} \min \left\{ C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}+\sum _{k=1}^{Q}\Vert w_{k}\Vert _{2}^{2}:(w,b,\xi )\in \varOmega \right\} , \end{aligned}$$
(2)

where

$$\begin{aligned} \varOmega =\left\{ \begin{array}{ll} (w,b,\xi )\in \mathbb {R}^{Q\times d}\times \mathbb {R}^{Q}\times \mathbb {R} _{+}^{n\times Q}:&{} \\ \langle w_{y_{i}}-w_{k},x_{i}\rangle +b_{y_{i}}-b_{k}\ge 1-\xi _{ik},&{}\forall 1\le i\le n,1\le k\ne y_{i}\le Q \end{array} \right\} , \end{aligned}$$

and \(\xi \in \mathbb {R}_{+}^{n\times Q}\) is a slack variable. In the objective function, \(C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}\) is the hinge loss term which presents the training classification errors. The remaining term is known as a regularization. C is a parameter that presents the trade-off between the hinge loss and the regularizer term.

For feature selection in MSVM, we consider the two sparse MSVM models obtained from (2) by replacing the second term in the objective function with the \(\ell _{0}\) and/or the \(\ell _{2}\)\(\ell _{0}\) regularization, that lead to the so called \(\ell _{0}\)-MSVM and \(\ell _{2}\)\(\ell _{0}\) -MSVM problems defined respectively by

$$\begin{aligned} \min \limits _{(w,b,\xi )\in \varOmega }C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}+\sum _{k=1}^{Q}\Vert w_{k}\Vert _{0}\quad (\ell _{0}-MSVM) \end{aligned}$$
(3)

and

$$\begin{aligned} \min \limits _{(w,b,\xi )\in \varOmega }C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}+\beta \sum _{k=1}^{Q}\Vert w_{k}\Vert _{2}^{2}+\sum _{k=1}^{Q}\Vert w_{k}\Vert _{0}\quad (\ell _{2}-\ell _{0}-MSVM). \end{aligned}$$
(4)

The backbone of our methods is DC programming and DCA whose brief overview will be given below.

2.2 A brief presentation of DC programming and DCA

DC programming and DCA constitute the backbone of smooth/nonsmooth nonconvex programming and global optimization. A general DC program takes the form:

$$\begin{aligned} \inf \{F(x):=G(x) - H(x):x\in \mathbb {R}^n\},\quad (P_{dc}) \end{aligned}$$

where G and H are lower semicontinuous proper convex functions on \( \mathrm {I\!R}^n\). Such a function F is called DC function, and \(G-H\), DC decomposition of F while G and H are DC components of F. The convex constraint \(x \in C\) can be incorporated in the objective function of (\( P_{dc}\)) by using the indicator function on C denoted \(\chi _C\) which is defined by \(\chi _C(x)=0\) if \(x \in C; +\infty \) otherwise:

$$\begin{aligned} \inf \{f(x):=G(x)-H(x):x\in C \} =\inf \{\chi _{C}(x)+G(x)-H(x):x\in \mathrm { I\!R}^{n}\}. \end{aligned}$$

A convex function \(\theta \) is called convex polyhedral if it is the sum of the maximum of a finite set of affine functions and the indicator of a nonempty polyhedral convex set K, i.e.,

$$\begin{aligned} \theta (x)=\max _{i=1,\ldots ,m} \{\langle a_{i},x\rangle +b, ~ a_{i}\in \mathbb {R}^{n}\} + \chi _K(x). \end{aligned}$$

Polyhedral DC program occurs when either G or H is polyhedral convex. This class of DC programs, which is frequently encountered in practice, enjoys interesting properties (from both theoretical and practical viewpoints) concerning local optimality and the convergence of DCA (Le Thi and Pham Dinh 2005; Pham Dinh and Le Thi 1997).

A point \(x^{*}\) is said to be a local minimizer of \(G-H\) if \( G(x^{*})-H(x^{*})\) is finite and there exists a neighbourhood \( \mathcal {U}\) of \(x^{*}\) such that

$$\begin{aligned} G(x^{*})-H(x^{*})\le G(x)-H(x),\ \ \forall x\in \mathcal {U}. \end{aligned}$$
(5)

The necessary local optimality condition for (primal) DC program \((P_{dc})\) is given by

$$\begin{aligned} \emptyset \ne \partial H(x^{*})\subset \partial G(x^{*}). \end{aligned}$$
(6)

The condition (6) is also sufficient (for local optimality) in many important classes of DC programs, for example, when \((P_{dc})\) is a DC polyhedral program with H being polyhedral convex function, or when f is locally convex at \(x^*\) (see Le Thi and Pham Dinh 2005; Pham Dinh and Le Thi 1997, 1998).

A point \(x^{*}\) is said to be a critical point of \(G-H\) if

$$\begin{aligned} \partial H(x^{*})\cap \partial G(x^{*})\ne \emptyset . \end{aligned}$$
(7)

The relation (7) is in fact the generalized KKT condition for \( (P_{dc})\) and \(x^{*}\) is also called a generalized KKT point.

DCA is based on local optimality conditions and duality in DC programming. The main idea of DCA is simple: each iteration of DCA approximates the concave part \(-H\) by its affine majorization (that corresponds to taking \( y^l \in \partial H(x^l))\) and minimizes the resulting convex function.

The generic DCA scheme can be described as follows:

figure a

Convergence properties of DCA and its theoretical basics have been described in Le Thi and Pham Dinh (2005), Pham Dinh and Le Thi (1997, (1998). However, it is worthwhile to report the following properties that are useful in the next section (for simplicity’s sake, we omit here the dual part of these properties).

  1. i)

    DCA is a descent method (without line search): the sequences \(\{G(x^l) - H(x^l)\}\) is decreasing.

  2. ii)

    If \(G(x^{l+1}) - H(x^{l+1}) = G(x^l) - H(x^l)\), then \(x^l\) is a critical point of \(G-H\). In such a case, DCA terminates at l-th iteration.

  3. iii)

    If the optimal value \(\alpha \) of problem \((P_{dc})\) is finite and the sequences \(\{x^l\}\) is bounded then every limit point \(x^*\) of the sequences \(\{x^l\}\) is a critical point of \(G-H\).

  4. iv

    DCA has a linear convergence for general DC programs, and has a finite convergence for polyhedral DC programs.

  5. v)

    If H is polyhedral convex and H is differentiable at \(x^{*} \), then \(x^{*}\) is a local minimizer of \((P_{dc})\).

A deeper insight into DCA has been described in Le Thi and Pham Dinh (2005), Pham Dinh and Le Thi (1997), Pham Dinh and Le Thi (1998), Pham Dinh and Le Thi (2014). For instant it is crucial to note the main features of DCA: DCA is constructed from DC components and their conjugates but not the DC function f itself which has infinitely many DC decompositions, and there are as many DCA as there are DC decompositions. Such decompositions play a critical role in determining the speed of convergence, stability, robustness, and globality of sought solutions. It is important to study various equivalent DC forms of a DC program. This flexibility of DC programming and DCA is of particular interest from both a theoretical and an algorithmic point of view. Moreover, with suitable DC decompositions DCA generates most standard algorithms in convex and nonconvex optimization For a complete study of DC programming and DCA the reader is referred to Le Thi and Pham Dinh (2005), Pham Dinh and Le Thi (1997), Pham Dinh and Le Thi (1998), Pham Dinh and Le Thi (2014) and the references therein.

In the last decade, a variety of works in Machine Learning based on DC programming and DCA have been developed. The efficiency and the scalability of DCA have been proved in a lot of works (see e.g. Collobert et al. 2006; Krause and Singer 2004; Le Thi et al. 2006, 2007, 2008a, b, 2015a, b; Le Thi and Phan 2016a, b; Liu et al. 2005; Liu and Shen 2006; Neumann et al. 2005; Ong and Le Thi 2013; Le Thi and Pham Dinh 2005; Pham Dinh and Le Thi 1997, 1998, 2014; Ronan et al. 2006).

3 DC approximation approaches

For simplifying the presentation, we will consider the following common optimization problem:

$$\begin{aligned} \min \left\{ F(w,b,\xi )+\sum _{k=1}^{Q}\Vert w_{k}\Vert _{0}:X=(w,b,\xi )\in \varOmega \right\} , \end{aligned}$$
(8)

where F stands for \(F_{1}\) in the \(\ell _{0}\)-MSVM problem, and for \(F_{2}\) in the \(\ell _{2}\)\(\ell _{0}\)-MSVM problem, say

$$\begin{aligned}&\displaystyle F_{1}(w,b,\xi ):=C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}, \end{aligned}$$
(9)
$$\begin{aligned}&\displaystyle F_{2}(w,b,\xi ):=C\sum _{i=1}^{n}\sum _{k\ne y_{i}}\xi _{ik}+\beta \sum _{k=1}^{Q}\Vert w_{k}\Vert _{2}^{2}. \end{aligned}$$
(10)

Here \(F_{1}\) is a linear function while \(F_{2}\) is a quadratic convex function.

We introduce in this section a class of DC approximation functions of the \(\ell _0\) norm. Define the step function \(s:\mathbb {R}\rightarrow \mathbb {R}\) by

$$\begin{aligned} s(x)=1 \text { for } x\ne 0\quad \text {and}\quad s(x)=0 \text { for } x=0. \end{aligned}$$

Then for \(X \in \mathbb {R}^n\) we have \(\Vert X\Vert _{0}=\sum _{i=1}^{n}s(X_{i})\). Let \(\varphi _\theta :\mathbb {R} \rightarrow \mathbb {R} \) be a function depending on the parameter \(\theta \) which approximates s(x), say

$$\begin{aligned} \lim _{\theta \rightarrow +\infty }\varphi _\theta (x) = s(x), \, \forall x \in \mathbb { R}. \end{aligned}$$
(11)

For symplifying the presentation, in the sequel, we will omit the parameter \(\theta \) when this doesn’t cause any ambiguity. Suppose that \(\varphi \) can be expressed as a DC function of the form

$$\begin{aligned} \varphi (x) = g(x) - h(x),\ x\in \mathbb {R} \end{aligned}$$
(12)

where g and h are convex functions. Using this approximation, the \(\ell _0\) term in (8) can be written as

$$\begin{aligned} \sum _{k=1}^Q\Vert w_{k}\Vert _0 \approx \sum _{k=1}^Q\sum _{j=1}^d \varphi (w_{kj}) = \sum _{k=1}^Q\sum _{j=1}^d g(w_{kj}) - \sum _{k=1}^Q\sum _{j=1}^d h(w_{kj}), \end{aligned}$$
(13)

and the problem (8) can be represented as follows:

$$\begin{aligned} \min \left\{ G(X) - H(X):X \in \mathbb {R}^{Q \times d}\times \mathbb {R} ^{Q}\times \mathbb {R}^{n \times Q} \right\} , \end{aligned}$$
(14)

where

$$\begin{aligned} G(X) = \chi _\varOmega (X) + F(X) + \sum _{k=1}^Q\sum _{j=1}^d g(w_{kj}); \quad H(X) = \sum _{k=1}^Q\sum _{j=1}^d h(w_{kj}). \end{aligned}$$

Since the functions Fg and h are convex, G and H are convex too. Therefore (14) is a DC program. Thanks to the general DCA scheme given in Sect. 2, DCA applied on (14) can be described as follows.

figure b

We consider now usual sparse inducing DC approximation functions \(\varphi \) and develop the corresponding DCA-dcApp to solve the resulting DC program.

First of all, we observe that the implementation of Algorithm DCA-dcApp according to each specific function \(\varphi \) differs from one to another by the computation of \(\overline{w}_{kj}^{l}\in \partial h(w_{kj}^{l})\) in the step 1, and the subproblem (15) in the step 2. On the other hand, with the same approximate function \(\varphi \), DCA-dcApp applied on \(\ell _{0}\)-MSVM and on \(\ell _{2}\)\(\ell _{0}\)-MSVM problems share the same step 1 and distinguish only on the step 2. We will itemize below the computation of \(\overline{w}_{kj}^{l}\in \partial h(w_{kj}^{l})\) in the step 1 and the subproblem (15) in the step 2 of Algorithm DCA-dcApp in each specific case.

3.1 A piecewise exponential approximation

The piecewise exponential function introduced in Bradley and Mangasarian (1998) is defined as follows.

$$\begin{aligned} \varphi (x)=\left\{ \begin{array}{ll} 1-\varepsilon ^{-\alpha x} &{}\text {if }x\ge 0, \\ 1-\varepsilon ^{\alpha x} &{}\text {if }x < 0,\\ \end{array} \quad \alpha > 0. \right. \end{aligned}$$

\(\varphi \) can be expressed as a DC program of the form

$$\begin{aligned} \varphi (x) = g(x) - h(x), \quad g(x) = \max \{\alpha x, -\alpha x\} ,\quad h(x) = \left\{ \begin{array}{ll} \alpha x - 1 + \varepsilon ^{-\alpha x} &{} \text {if } x\ge 0 \\ -\alpha x - 1 + \varepsilon ^{\alpha x} &{} \text {if } x \le 0. \\ \end{array} \right. \end{aligned}$$

Clearly, the function h is differentiable and the computation of \(\overline{w}_{kj}^{l} = \nabla h(w_{kj}^{l})\) in the step 1 of DCA-dcApp is given by

$$\begin{aligned} \overline{w}_{kj}^l = \left\{ \begin{array}{ll} \alpha \left( 1 - \varepsilon ^{-\alpha w_{kj}^l}\right) &{} \text {if}\,\,w_{kj}^l \ge 0\\ -\alpha \left( 1 - \varepsilon ^{\alpha w_{kj}^l}\right) &{} \text {if}\,\,w_{kj}^l < 0, \\ \end{array} \right. \quad k = 1,\ldots ,Q, \, j = 1,\ldots ,d. \end{aligned}$$
(16)

The step 2 of DCA-dcApp applied on the \(\ell _0\)-MSVM (3) consists in solving the following convex program

$$\begin{aligned} \min _{(w,b,\xi ) \in \varOmega } C\sum _{i=1}^n \sum _{k\ne y_i}\xi _{ik} + \sum _{k=1}^Q\sum _{j=1}^d \max (\alpha w_{kj}, -\alpha w_{kj}) - \sum _{k=1}^Q\sum _{j=1}^d \overline{w}_{kj}^l w_{kj} \end{aligned}$$
(17)

which can be transformed equivalently to the next linear program

$$\begin{aligned} \min _{w,b,\xi ,t} \left\{ \begin{array}{ll} C\sum _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d\overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w,b,\xi ) \in \varOmega ,\quad t \ge \alpha w,\, t \ge -\alpha w. \end{array} \right. \end{aligned}$$
(18)

Finally, DCA-dcApp for solving the \(\ell _0\)-MSVM problem (3) with piecewise exponential (PiE) approximation can be described as follows:

figure c

For \(\ell _2\)\(\ell _0\)-MSVM problem (4), as indicated above, DCA-dcApp differs from Algorithm 1 by the convex subproblem in the step 2. It is now defined by

$$\begin{aligned} \small \min _{w,b,\xi ,t} \left\{ \begin{array}{ll} \beta \sum \nolimits _{k=1}^{Q}\Vert w_{k}\Vert _{2}^{2}+ C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w,b,\xi ) \in \varOmega ,\, t \ge \alpha w,\quad t \ge -\alpha w. \end{array} \right. \end{aligned}$$
(19)

Hence, DCA-dcApp for solving the \(\ell _2\)\(\ell _0\)-MSVM problem (4) with PiE approximation is depicted below:

figure d

3.2 Capped-\(\ell _1\) approximation

The Capped-\(\ell _1\) approximation proposed in Peleg and Meir (2008) is given by

$$\begin{aligned} \varphi (x)=\min \{1, \alpha |x|\} = 1 + \alpha |x| - \max \{1, \alpha |x|\}, \quad \alpha >0. \end{aligned}$$

Let g and h be the functions defined by

$$\begin{aligned} g(x) = 1 + \alpha |x|, \quad h(x) = \max \{1, \alpha |x|\}. \end{aligned}$$

Clearly, \(\varphi = g - h\) and g, h are convex. Therefore \(\varphi \) is a DC function. The computation of \(\overline{w}_{kj}^{l} \in \partial h(w_{kj}^{l})\) in the step 1 of DCA-dcApp is given by

$$\begin{aligned} \overline{w}_{kj}^l = \left\{ \begin{array}{l l} 0 &{} \text { if } -1/\alpha \le w_{kj}^l \le 1/\alpha \\ \alpha &{} \text { if } w_{kj}^l > 1/\alpha \\ -\alpha &{} \text { if } w_{kj}^l < -1/\alpha \\ \end{array} \quad k=1,\ldots ,Q, \, j=1,\ldots ,d. \right. \end{aligned}$$
(20)

Now, the convex subproblem in the step 2 of DCA-dcApp becomes

$$\begin{aligned} \min \left\{ C\sum _{i=1}^n \sum _{k\ne y_i}\xi _{ik} + \sum _{k=1}^Q\sum _{j=1}^d \alpha |w_{kj}| - \sum _{k=1}^Q\sum _{j=1}^d \overline{w}_{kj}^l w_{kj}: (w,b,\xi ) \in \varOmega \right\} . \end{aligned}$$
(21)

The last problem, like (17), is equivalent to the next linear program

$$\begin{aligned} \min _{w, b, \xi , t} \left\{ \begin{array}{l} C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w, b, \xi )\in \varOmega ,\quad t \ge \alpha w, t \ge -\alpha w. \end{array} \right. \end{aligned}$$
(22)

Hence, DCA-dcApp applied on (3) with Capped-\(\ell _1\) approximation is given below.

figure e

In case of \(\ell _2\)\(\ell _0\)-MSVM problem (4), the convex subproblem takes the form

$$\begin{aligned} \min _{w, b, \xi , t} \left\{ \begin{array}{l} C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \beta \sum \nolimits _{k=1}^Q \Vert w_k\Vert _2^2 + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w, b, \xi )\in \varOmega ,\quad t \ge \alpha w, t \ge -\alpha w. \end{array} \right. \end{aligned}$$
(23)

Then DCA-dcApp for solving (4) with Capped-\(\ell _1\) approximation can be presented as follows.

figure f

3.3 A new piecewise linear approximation

We consider now a new and efficient approximation of \(\ell _0\)-norm introduced in Le Thi (2012) (see also Le Thi et al. 2015b). Let a and b be positive constants, \(0 \le a < b\). Then the approximate function \(\varphi (x)\) is defined as follows (Le Thi 2012):

$$\begin{aligned} \varphi (x)= \min \left\{ 1, \max \left\{ 0, \frac{|x| - a}{b-a} \right\} \right\} . \end{aligned}$$

The function \(\varphi (x)\) can be expressed as

$$\begin{aligned} \begin{array}{ll} \varphi (x)&{} = 1 + \max \left( 0, \frac{|x| - a}{b-a}\right) - \max \left( 1, \frac{|x| - a}{b-a}\right) \\ &{} = \left( 1+\frac{1}{b-a}\max (a, |x|)\right) - \left( \frac{1}{b-a} \max (b, |x|)\right) . \end{array} \end{aligned}$$

Let g and h be the functions defined by

$$\begin{aligned} g(x) = 1 + \frac{1}{b-a}\max (a, |x|);\quad h(x) = \frac{1}{b-a} \max (b, |x|). \end{aligned}$$

They are clearly convex, and so, \(\varphi (x) = g(x) - h(x)\) is a DC function.

Similarly to the case of Capped-\(\ell _1\) approximation, the computation of \(\overline{w}_{kj}^{l} \in \partial h(w_{kj}^{l})\) in the step 1 of DCA-dcApp is given by

$$\begin{aligned} \overline{w}_{kj}^l = \left\{ \begin{array}{ll} 0 &{} \text {if } | w_{kj}^l| \le b \\ \frac{1}{b-a} &{} \text {if } w_{kj}^l > b \\ \frac{-1}{b-a} &{} \text {if } w_{kj}^l < -b \end{array}\right. \quad k = 1,\ldots ,Q,\, j=1,\ldots ,d. \end{aligned}$$
(24)

The convex subproblem at the step 2 of DCA-dcApp now has the form

$$\begin{aligned} \min _{(w,b,\xi ) \in \varOmega } \frac{1}{b-a} \sum _{k=1}^Q\sum _{j=1}^d \max (a, | w_{kj}|) + C\sum _{i=1}^n \sum _{k\ne y_i}\xi _{ik} - \sum _{k=1}^Q \sum _{j=1}^d \overline{w}_{kj}^l w_{kj}, \end{aligned}$$
(25)

which is equivalent to the following linear program

$$\begin{aligned} \min _{w,b,\xi ,t} \left\{ \begin{array}{l} C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \frac{1}{b-a} \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q \sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj}\\ \text {s.t. } (w,b,\xi ) \in \varOmega ,\quad t \ge a, t \ge w, t \ge - w. \end{array} \right. \end{aligned}$$
(26)

The description of the DCA-dcApp for solving problem (3) with the piecewise linear (PiL) approximation is shown below.

figure g

For the \(\ell _2\)\(\ell _0\)-MSVM problem (4), at the step 2 of DCA-dcApp we have to solve the following convex quadratic program

$$\begin{aligned} \small \min _{w,b,\xi ,t} \left\{ \begin{array}{l} \frac{1}{b-a} \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} + \beta \sum \nolimits _{k=1}^Q \Vert w_k\Vert _2^2+ C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} - \sum \nolimits _{k=1}^Q \sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj}\\ \text {s.t. } (w,b,\xi ) \in \varOmega ,\quad t \ge a,\, t \ge w,\, t \ge - w \end{array} \right. . \end{aligned}$$
(27)

The description of DCA-dcApp for solving (4) with the PiL approximation is depicted in the algorithm 6 below.

figure h

3.4 SCAD (Smoothly clipped absolute deviation) approximation

The well-known SCAD penalty function as studied in Fan and Li (2001) is defined by

$$\begin{aligned} \varphi (x) = \left\{ \begin{array}{ll} \lambda x &{} \text {if } 0\le x \le \lambda ,\\ - \frac{x^2-2\alpha \lambda x + \lambda ^2}{2(\alpha -1)} &{} \text {if } \lambda< x \le \alpha \lambda ,\\ \frac{(\alpha +1)\lambda ^2}{2} &{} \text {if } x > \alpha \lambda ,\\ \varphi (-x) &{} \text {if } x < 0, \end{array}\right. \end{aligned}$$

where \(\lambda > 0\) and \(\alpha > 2\) are parameters. Let g and h be the functions given by

$$\begin{aligned} g(x) = \lambda |x| \quad \text {and} \quad h(x) = \left\{ \begin{array}{ll} 0 &{} \text {if } 0 \le x \le \lambda \\ \frac{1}{2(\alpha -1)} (x-\lambda )^2 &{} \text {if } \lambda< x \le \alpha \lambda \\ \lambda x - \frac{(\alpha +1)\lambda ^2}{2} &{} \text {if } x > \alpha \lambda \\ h(-x) &{} \text {if } x < 0. \end{array}\right. \end{aligned}$$

It is easy to verify that g and h are convex functions and \(\varphi (x) = g(x) - h(x)\), so \(\varphi \) is a DC function. Moreover, h is differentiable and the computation of \(\overline{w}_{kj}^l = \nabla h(w^l_{kj})\) in the step 1 of DCA-dcApp is given by the following expression

$$\begin{aligned} \overline{w}_{kj}^l = \left\{ \begin{array}{ll} 0 &{} \text {if } |w_{kj}^l| \le \lambda \\ (w_{kj}^l - \lambda )/(\alpha -1) &{} \text {if } \lambda< w_{kj}^l \le \alpha \lambda \\ (w_{kj}^l + \lambda )/(\alpha -1) &{} \text {if } -\alpha \lambda \le w_{kj}^l< -\lambda \\ \lambda &{} \text {if } w_{kj}^l > \alpha \lambda \\ -\lambda &{} \text {if } w_{kj}^l < - \alpha \lambda \end{array}\right. \quad k = 1,\ldots ,Q;\, j=1,\ldots ,d. \end{aligned}$$
(28)

As with the cases of PiL approximation and Capped-\(\ell _1\) approximation, the step 2 of DCA-dcApp applied to problem (3) consists of solving the following linear program

$$\begin{aligned} \min _{w, b, \xi , t} \left\{ \begin{array}{l} C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w, b, \xi )\in \varOmega ,\quad t \ge \lambda w, t \ge -\lambda w. \end{array} \right. \end{aligned}$$
(29)

Finally, the DCA-dcApp applied to the \(\ell _0\)-MSVM problem (3) with the SCAD approximation is described as follows:

figure i

For problem (4), we need only to replace the linear program in the step 2 of Algorithm 7 with the following convex quadratic program

$$\begin{aligned} \min _{w, b, \xi , t} \left\{ \begin{array}{l} C\sum \nolimits _{i=1}^n \sum \nolimits _{k\ne y_i}\xi _{ik} + \beta \sum \nolimits _{k=1}^Q \Vert w_k\Vert _2^2 + \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d t_{kj} - \sum \nolimits _{k=1}^Q\sum \nolimits _{j=1}^d \overline{w}_{kj}^l w_{kj} \\ \text {s.t. } (w, b, \xi )\in \varOmega ,\quad t \ge \lambda w, t \ge -\lambda w. \end{array} \right. \end{aligned}$$
(30)

Hence, the DCA-dcApp for solving the \(\ell _2\)\(\ell _0\)-MSVM problem (4) with SCAD approximation is presented in the algorithm 8 below.

figure j

3.5 A logarithm approximation

Consider now the logarithm (Log) approximation function defined by

$$\begin{aligned} \varphi (x) = \rho _\tau \log \left( 1+\frac{|x|}{ \tau }\right) , \end{aligned}$$

where \(\rho _\tau = \frac{1}{\log \left( 1+\frac{1}{\tau }\right) }\) with \(\tau > 0\) being a parameter. This approximation has been used for feature selection in Weston et al. (2003) and compressed sensing in Candès et al. (2008).

Since \(\log (\cdot )\) is an increasing function, by introducing the variable \(v \in \mathbb {R}^{Q \times d}_+\) we can rewrite the problem (8) as

$$\begin{aligned} \min _{(w,b,\xi ,v)\in \widetilde{\varOmega }} F(w,b,\xi ) + \sum \limits _{k=1}^Q \sum \limits _{j=1}^d \varphi (v_{kj}), \end{aligned}$$
(31)

where \(\widetilde{\varOmega } = \{(w,b,\xi ,v): (w,b,\xi ) \in \varOmega , -v \le w \le v\}\).

Let \(g(x) = 0\) and \(h(x) = -\rho _\varepsilon \log (1+ x/\varepsilon )\). Clearly, \(\varphi (x) = g(x) - h(x)\) and gh are convex functions on \( \mathbb {R_+}\). Then we can express the problem (31) as a DC program

$$\begin{aligned} \min \left\{ \widetilde{G}(\widetilde{X}) - \widetilde{H}(\widetilde{X}): \widetilde{X} = (w,b,\xi ,v) \in \mathbb {R}^{Q \times d+Q+n\times Q+Q\times d} \right\} , \end{aligned}$$
(32)

where \(\widetilde{G}(\widetilde{X}) = \chi _{\widetilde{\varOmega }}(\widetilde{X} ) + F(w,b,\xi )\) and \(\widetilde{H}(\widetilde{X}) = \sum _{k=1}^Q \sum _{j=1}^d h(v_{kj})\) are convex functions on \(\widetilde{\varOmega }\).

DCA applied on the problem (31) amounts to computing, at each iteration l, a subgradient \(\widetilde{Y}^l=(0, 0, 0, \overline{v}^l) \in \partial \widetilde{H}(\widetilde{X}^l)\) with \( \overline{v}_{kj}^l = \partial h(v^l_{kj})\,(k=1,\ldots ,Q,\, j=1,\ldots ,d)\) and then, solving the convex program

$$\begin{aligned} \min \{F(w,b,\xi ) - \langle \widetilde{Y}^l,\widetilde{X} \rangle : \widetilde{X}=(w, b, \xi ,v) \in \widetilde{\varOmega }\}. \end{aligned}$$
(33)

It can be seen that \(\widetilde{H}\) is differentiable and \(\nabla \widetilde{ H}(w,b,\xi ,v^l) = (0,0,0,\overline{v}^l)\), where

$$\begin{aligned} \overline{v}_{kj}^l = \nabla h\left( v_{kj}^l\right) = - \frac{\rho _\varepsilon }{v_{kj}^l + \tau }, \quad k=1,\ldots ,Q,\,j=1,\ldots ,d. \end{aligned}$$
(34)

For the \(\ell _0\)-MSVM problem (3), i.e. F stands for \(F_1\) in (31), problem (33) becomes the following linear program

$$\begin{aligned} \min \left\{ C\sum _{i=1}^n \sum _{k\ne y_i}\xi _{ik} - \sum _{k=1}^Q \sum _{j=1}^d \overline{v}_{kj}^l v_{kj}: (w,b,\xi ,v)\in \widetilde{\varOmega } \right\} . \end{aligned}$$
(35)

Hence, DCA for solving the \(\ell _0\)-MSVM problem (3) with Log approximation can be described as follows.

figure k

For the \(\ell _2\)\(\ell _0\)-MSVM problem (4), F will stand for \(F_2\) in (31). Thus, problem (33) now becomes the following quadratic program

$$\begin{aligned} \min \left\{ C\sum _{i=1}^n \sum _{k\ne y_i}\xi _{ik} + \beta \sum _{k=1}^Q \Vert w_k\Vert _2^2 - \sum _{k=1}^Q \sum _{j=1}^d \overline{v}_{kj}^l v_{kj}: (w,b,\xi , v)\in \widetilde{\varOmega } \right\} . \end{aligned}$$
(36)

Consequently, DCA for solving the \(\ell _2\)\(\ell _0\)-MSVM problem (4) with Log approximation is presented below.

figure l

3.6 Convergence properties

Theorem 1

(Convergence properties of the above 10 DCA based algorithms)

  1. (i)

    The sequence \(\{G(X^l) - H(X^l)\}\) (resp. \(\{\widetilde{G}(\widetilde{X}^l) - \widetilde{H}(\widetilde{X}^l)\}\)) is monotonously decreasing.

  2. (ii)

    In algorithms \(\ell _0\) -DCA_PiE, \(\ell _0\) -DCA_Cap-l1, \(\ell _2\)\(\ell _0\) -DCA_Cap-l1, \(\ell _0\) -DCA_PiL, \(\ell _2\)\(\ell _0\) -DCA_PiL, \(\ell _0\) -DCA_SCAD (resp. algorithm \(\ell _0\) -DCA_Log), the sequences \(\{X^l = (w^l, \xi ^l, b^l)\}\) and \(\{Y^l = (\overline{w}^l, 0, 0)\}\) (resp. \(\{\widetilde{X}^l = (w^l, b^l,\xi ^l, v^l)\}\) and \(\{\widetilde{Y}^l=(0, 0, 0, \overline{v}^l)\}\)) converge respectively to \(X^* = (w^*, \xi ^*, b^*)\) and \(Y^* = (\overline{w}^*, 0, 0)\) (resp. \(\widetilde{X}^* = (w^*, b^*,\xi ^*, v^*)\) and \(\widetilde{Y}^*=(0, 0, 0, \overline{v}^*)\)) after a finite number of iterations. Moreover, \(X^*\) (resp. \(\widetilde{X}^*\)) is almost always a local minimizer of problem (3) (resp. (31)). Especially, for algorithms \(\ell _0\) -DCA_Cap-l1 and \(\ell _2\)\(\ell _0\) -DCA_Cap-l1 (resp. \(\ell _0\) -DCA_PiL and \(\ell _2\)\(\ell _0\) -DCA_PiL), if

    $$\begin{aligned} w^*_{kj} \notin \left\{ -\frac{1}{\alpha },\frac{1}{\alpha } \right\} \left( \text {resp. } w^*_{kj} \notin \left\{ -b,b \right\} \right) \quad \forall k=1,\ldots ,Q,\,j=1,\ldots ,d, \end{aligned}$$
    (37)

    then \(X^*\) is actually a local minimizer of problem (3).

Proof

(i) is a consequence of the DCA’s convergence property i) mentioned in Sect. 2. We are going to prove (ii).

For algorithm \(\ell _0\)-DCA_Cap-l1, the second DC component of (14), says H, is polyhedral convex, so (14) is a polyhedral DC program. By using the DCA’s convergence properties iv) and v) mentioned in Sect. 2, \(\ell _0\)-DCA_Cap-l1 has finite convergence; moreover, if H is differentiable at \(X^*\), e.g. if the condition (37) holds, then \(X^*\) is a local minimizer of problem (3). Since a polyhedral convex function is almost always differentiable, say, it is differentiable everywhere except on a set of measure zero, we can say that \(X^*\) is almost always a local minimizer of problem (3).

The same arguments are also applied to the cases of \(\ell _2\)\(\ell _0\)-DCA_Cap-l1, \(\ell _0\)-DCA_PiL and \(\ell _2\)\(\ell _0\)-DCA_PiL.

For algorithm \(\ell _0\)-DCA_PiE, since the first DC component G is polyhedral convex, (14) is a DC polyhedral program, so \(\ell _0\) -DCA_PiE has a finite convergence. We consider the dual problem of (14) in which the second DC component \(G^*\), the conjugateFootnote 1 of G, is polyhedral convex. By the same arguments as for algorithm \(\ell _0\) -DCA_Cap-l1, we conclude that \(Y^*\) is almost always a local minimizer of the dual problem. According to the property of transportation of local minimizers in DC programming (see Pham Dinh and Le Thi 1997; Le Thi and Pham Dinh 2005), we deduce that \(X^*\) is almost always a local minimizer of problem (3).

For algorithms \(\ell _0\)-DCA_SCAD and \(\ell _0\)-DCA_Log, the proof is similar. \(\square \)

4 A continuous reformulation approach via exact penalty techniques

In this section we reformulate equivalently the problem (8) in the form of a DC program and develop two DCA based algorithms for solving it. Denote by e the vector of ones in the appropriate space.

Let \(u\in \mathbb {R}^{Q\times d}\) be the binary variable defined by:

$$\begin{aligned} u_{kj}=\left\{ \begin{array}{ll} 1 &{} \text {if } w_{kj}\ne 0 \\ 0 &{} \text {if } w_{kj}=0, \end{array}\right. \quad \forall k=1,\ldots ;\,Q,j=1,\ldots ,d. \end{aligned}$$
(38)

We have

$$\begin{aligned} \sum _{k=1}^{Q}\Vert w_{k}\Vert _{0}=\sum _{k=1}^{Q}\sum \limits _{j=1}^{d}u_{kj}=\langle e,u\rangle . \end{aligned}$$
(39)

Suppose that \(\varOmega \) is bounded in the variable w, i.e. \(\varOmega \subset [-\mathcal {B},\mathcal {B}]^{Q\times d}\times \mathbb {R}^{Q}\times \mathbb {R}_{+}^{n\times Q}\) for some \(\mathcal {B}>0\). Then the problem (8) can be expressed as

$$\begin{aligned} \left\{ \begin{array}{cl} \min \nolimits _{(w,b,\xi ,u)} &{} F(w,b,\xi )+\langle e,u\rangle \\ \text {s.t.} &{} (w,b,\xi )\in \varOmega , \\ &{} \mid w_{kj}\mid \le \mathcal {B}u_{kj},~u_{kj}\in \{0,1\},\quad \forall k=1,\ldots ,Q;\, j=1,\ldots ,d. \end{array} \right. \end{aligned}$$
(40)

Let \(\overline{p}:\mathbb {R}^{Q\times d}\times \mathbb {R}^{Q}\times \mathbb {R }^{n\times Q}\times [0,1]^{Q\times d}\rightarrow \mathbb {R}\) be the function defined as \(\overline{p}(w,b,\xi ,u)=p(u)\) with \(p:\mathbb {R} ^{Q\times d}\rightarrow \mathbb {R},~p(u):=\) \(\sum \nolimits _{k=1}^{Q}\sum \nolimits _{j=1}^{d}u_{kj}(1-u_{kj}),\) and let \(\Lambda \) be the polyhedral convex set determined by

$$\begin{aligned} \Lambda :=\left\{ (w,b,\xi ,u)\in \varOmega \times [0,1]^{Q\times d}:|w_{kj}|\le | \mathcal {B}u_{kj}|,\,\forall k=1,\ldots ,Q;\,j=1,\ldots ,d,\right\} . \end{aligned}$$
(41)

We observe that \(\overline{p}\) is concave and \(\overline{p}(w,b,\xi ,u)\ge 0 \forall (w,b,\xi ,u)\in \Lambda \). By dint of the exact penalty technique developed recently in Le Thi et al. (2012), the problem (40 ) is equivalent to the following continuous optimization problem, with sufficient large positive numbers \(\eta >\eta _{0}\ge 0\) (called penalty parameters)

$$\begin{aligned} \min \left\{ F(w,b,\xi )+\langle e,u\rangle +\eta p(u):\mathcal {X}=(w,b,\xi ,u)\in \Lambda \right\} \end{aligned}$$
(42)

We investigate now a DCA scheme for solving (42). Let \(\overline{G}\) and \(\overline{H}\) be the functions defined by

$$\begin{aligned} \overline{G}(\mathcal {X}):=F(X)+\chi _{\Lambda }(\mathcal {X}) \end{aligned}$$

and

$$\begin{aligned} \overline{H}(\mathcal {X}):=\eta \sum _{k=1}^{Q}\sum _{j=1}^{d}u_{kj}^{2}-(\eta +1)\sum _{k=1}^{Q}\sum _{j=1}^{d}u_{kj}. \end{aligned}$$

The problem (42) can be expressed as:

$$\begin{aligned} \min \left\{ \overline{G}(\mathcal {X})-\overline{H}(\mathcal {X}):\mathcal {X} =(w,b,\xi ,u)\in \mathbb {R}^{Q\times d+Q+n\times Q+Q\times d}\right\} . \end{aligned}$$
(43)

Obviously, \(\overline{G}\) and \(\overline{H}\) are convex functions and so ( 43) is a DC program. DCA applied on (43) consists of computing, at each iteration l,

$$\begin{aligned} \mathcal {Y}^{l}\in \partial \overline{H}(\mathcal {X}^{l}),~\mathcal {X}^{l+1}\in \arg \min \left\{ \overline{G}(\mathcal {X})-\ \langle \mathcal {Y}^{l},\mathcal {X}\rangle :\mathcal {X} \in \Lambda \right\} . \end{aligned}$$

Clearly, \(\overline{H}\) is differentiable and \(\mathcal {Y}^{l}=\nabla \overline{H}(\mathcal {X}^{l})\) can be computed directly

$$\begin{aligned} \mathcal {Y}^{l}=(0,0,0,\overline{u}^{l}),\text { with } \overline{u} _{kj}^{l}=2\eta u_{kj}^{l}-(\eta +1),\quad \forall k=1,\ldots ,Q;\,j=1,\ldots ,d. \end{aligned}$$
(44)

And \(\mathcal {X}^{l+1}=(w^{l+1},b^{l+1},\xi ^{l+1},u^{l+1})\) is an optimal solution of the convex optimization problem

$$\begin{aligned} \min \left\{ F(X)-\langle \overline{u}^{l},u\rangle :\mathcal {X}\in \Lambda \right\} . \end{aligned}$$
(45)

Hence, the DCA applied to (43) when \(F=F_{1}\) (the \(\ell _{0}\) -MSVM problem) is described below.

figure m

Remark 1

Note that since \(\overline{G}\) is polyhedral convex, ( 43) is a polyhedral DC program and then \(\ell _{0}\) -DCA-Econ has a finite convergence. Furthermore, by considering the dual problem of (43) in which the second DC component is polyhedral convex and using the property v) mentioned in Sect. 2, we can prove that \(\ell _{0}\) -DCA-Econ converges almost always to a local minimizer of (43).

Similarly, the DCA applied to (43) when \(F=F_{2}\) (the \( \ell _{2}\)\(\ell _{0}\)-MSVM problem) is described as follows:

figure n

5 Numerical experiments

We have implemented the algorithms in the V.S C++ v6.0 environment and performed the experiments a Intel Core\(^{\mathrm{TM}}\) I7 (\(2\times 2.2\) Ghz) processor, 4 GB RAM. The purpose is to clarify the multiple questions: the \(\ell _0\) or \(\ell _2\)\(\ell _0\) regularization which one is better? What is the approach advised—the DC approximation or the continuous exact reformulation? And in the DC approximation approaches, what is the best approximation among several sparse inducing functions? Is there the consistency among features that are selected from different DCA based algorithms?

5.1 Datasets

We consider eight popular datasets often used for feature selection: Lung Cancer (LUN), Optical Recognition of Handwritten Digits (OPT), Libras Movement (MOV), Semeion Handwritten Digit (SEM), Multiple Features (MFE), CNAE-9 (CNA), Internet Advertisement (ADV), and ADN. The first 7 datasets are taken from UCI Machine Learning Repository while the last can be found at ftp://genbank.bio.net. Each dataset is divided into two parts—the training set and the test set, except for the LUN dataset which contains a very small number of samples, therefore the whole dataset is used as the training set as well as the test set. For the OPT dataset, both training set and test set are given in UCI Machine Learning Repository. For the other 6 datasets, training and test sets are randomly sampled from the original data with 60 % for training and the remaining 40 % for testing. We repeat 10 times this random procedure and get 10 couple of training/test sets of each dataset. These datasets are described in details in the Table 1.

Table 1 The description of the datasets

5.2 Experiment setting

The CPLEX 12.5 solver is used to solve linear and convex quadratic problems.

The parameters are taken as follows: for all methods, the most appropriate values of the parameter C are chosen by a five-folds cross-validation; \( \beta \), the coefficient of the \(\ell _{2}\) term is set to 0.01; the tolerance \(\tau \) (in the stopping criterion of DCA) is set to \(10^{-6}\).

Observe that, theoretically speaking, the larger value of \(\alpha \) is, the better DC approximation of \(\ell _{0}\)-norm would be, while practically, when \(\alpha \) is large, the algorithms give sometimes bad local minima. Hence, we use an \(\alpha \) updating procedure during the algorithms. Starting with a small value of \(\alpha _{0}\), we increase it at each iteration l by \(\alpha _{l+1}=\alpha _{l}+\varDelta \alpha \) until a given threshold \(\overline{\alpha }\). For \(\ell _{0}\)-DCA_PiE and \(\ell _{2}\)\(\ell _{0} \)-DCA_PiE, \(\alpha _{0}=1.5\) and \(\varDelta \alpha =0.5,\) \(\overline{\alpha } =5.5\). For \(\ell _{0}\)-DCA_Cap-l1 and \(\ell _{2}\)\(\ell _{0}\)-DCA_Cap-l1 we set \(\alpha _{0}\in \{0.7,0.8,0.9\},\varDelta \alpha =0.2,\) \( \overline{\alpha }=5.5\). For \(\ell _{0}\)-DCA_PiL and \(\ell _{2}\)\(\ell _{0}\)-DCA_PiL, we take \(a=10^{-6}\) and \(b\in \) \(\{10^{-4},10^{-3},\ldots ,10^{-1},0.2,0.3\}\). The parameters \(\alpha \) and \(\lambda \) in \(\ell _{0}\)-DCA_SCAD and \(\ell _{2}\)\( \ell _{0}\)-DCA_SCAD are, respectively, set to 3.4 and 0.4 as proposed by Fan and Li (2001). In \(\ell _{0}\)-DCA_Log and \(\ell _{2}\)\(\ell _{0}\)-DCA_Log, \( \varepsilon \) is set to \(10^{-4}\). Finally, for \(\ell _{0}\)-DCA_Econ and \(\ell _{2}\) -\(\ell _{0}\)-DCA_Econ, the starting value of \(\eta \) is chosen in \(\{10,20,\ldots ,50\}\) and \(\eta \) is doubled at each iteration until 10, 000. The parameter \( \mathcal {B}\) is set to 1000.

The starting point \(w^{0}\), \(u^{0}\) and \(v^{0}\) of the DCA based algorithms are randomly chosen in \([-0.5,0.5]^{Q\times d}\).

We compare our methods with one of the best algorithms for feature selection in MSVM, called the Adaptive Sub-Norm (ASN) method (see Zhang et al. 2008 for more details). To select relevant features, we first compute the feature ranking score \(c_{j},j=1,\ldots ,d\) for each feature (Duan et al. 2005) as follows

$$\begin{aligned} c_{j}=\sum _{i=1}^{Q}|w_{ij}|. \end{aligned}$$

This ranking score is then normalized. Let \(\varsigma =\max _{j}c_{j}\), we compute \(c_{j}=\frac{c_{j}}{\varsigma }\) for each \(j=1,\ldots ,d\). Then, we remove the features j for which \(c_{j}\) is smaller than a given threshold ( 0.01 in our experiments). After removing features, for computing the accuracy of classifier, we apply again \(\ell _{2}\)-MSVM (2) on the new training datasets and calculate the classification’s accuracy on the new test sets.

5.3 Numerical results

The comparative results of 12 versions of the DCA based algorithms and the concurrent ASN method are presented in Table 2 (the number and the percentage of selected features) and Table 3 (the accuracy of classifiers and the CPU time in seconds). On each given couple of training/test set, each algorithm is performed 10 times. In these tables, the average result (on 10 running times for LUN and OPT, and on 100 running times for the six remaining data) and its standard deviation are reported.

Table 2 Number and % of selected features of 12 versions of DCA applied on two regularization models (\(\ell _0\) and \(\ell _2\)\(\ell _0\)) with 5 approximations (PiE, Cap-l1, PiL, SCAD, Log) and the exact penalty continuous approach (Econ), and the Adaptive Sub-norm (ASN)—one of the best related algorithms
Table 3 Classification accuracy and CPU time of 12 versions of DCA applied on two regularization models (\(\ell _0\) and \(\ell _2\)\(\ell _0\)) with 5 approximations (PiE, Cap-l1, PiL, SCAD, Log) and the exact penalty continuous approach (Econ), and the Adaptive Sub-norm (ASN)—one of the best related algorithms

For an easy observation, the number of selected features and the corresponding accuracy of classifiers on the \(\ell _{0}\) model are shown in the Figs. 1 and 2.

Fig. 1
figure 1

Selected features (%) of \(\ell _0\)-DCAs and ASN

Fig. 2
figure 2

Accuracy of classifiers (%) of \(\ell _0\)-DCAs and ASN

It is interesting to study the consistency of features that are selected from different DCA based algorithms. In the six last columns of Table 4 we report, respectively, the number of features selected by six (6Algo), five (5Algo), four (4Algo), three (3Algo), two (2Algo), one (1Algo) algorithm(s). Here ”Best” (resp. ”Worst”) stands for the number of selected features of the best (resp. worst) algorithm in terms of sparsity.

Table 4 Frequency of selected features by 6 DCA based algorithms

5.4 Comments on numerical results

Generally speaking, all the DCA based algorithms applied on the \(\ell _{0}\) models (\(\ell _{0}\)-DCA) give a good sparsity of solutions. The percentage of selected features varies from 0.33 to 52.49 %. For the \(\ell _{2}\)\(\ell _{0}\) models, the DCA based algorithms (\(\ell _2\)\(\ell _0\)-DCA) give similar results to (but slightly less good than) those of \(\ell _{0}\)-DCA. Both \(\ell _{0}\)-DCA and \(\ell _2\)\(\ell _0\)-DCA not only provide a good performance in term of feature selection, but also give a high accuracy of classifiers (from 68.75 to 96.88 % for the \(\ell _0\) model and from 68.75 to 96.89 % for the \(\ell _2\)\(\ell _0\) model).

More specifically, it is worth to mention the following observations which contribute to clarify the several questions indicated at the beginning of this section.

5.4.1 DCA based algorithms versus ASN

All the DCA based algorithms are better than ASN in terms of feature selection. The gain of sparsity (the difference of the percentage of selected features) of \(\ell _{0}\)-DCA (resp. \(\ell _2\)\(\ell _0\)-DCA) varies from 1.2 to 46.47 % (resp. from 0.47 to 46.45 %). The accuracy of classifier of DCA based algorithms is also better than ASN in most of cases (except for the logarithm approximation Log), and the best DCA version is always better than ASN on all datasets. More precisely, \(\ell _{0}\)-DCA (resp. \(\ell _2\)\(\ell _0\)-DCA) with PiE, Cap-l1, PiL, SCAD approximation and the exact penalty continuous approach (Econ) is better than ASN on 6/8, 6/8, 4/8, 7/8, and 5/8 (resp. 6/8, 6/8, 5/8, 7/8, and 6/8) datasets, respectively. As for the computation time, all \(\ell _{0}\)-DCAs are generally faster than ASN. More precisely, \(\ell _{0}\)-DCA_Log and \(\ell _{0}\)-DCA_Econ are faster than ASN on 8/8 datasets and the 4 remaining \(\ell _{0}\)-DCAs are faster than ASN on 6/8 datasets. The gain ratio grows up to 43.32 times (\(\ell _{0}\)-DCA_PiL on MOV dataset). Likewise, all (resp. 5/6) \(\ell _2\)\(\ell _0\)-DCAs are faster than ASN on ADV (resp. OPT) dataset. On the remaining datasets, some \(\ell _2\)\(\ell _0\)-DCAs are more expensive than ASN: 3/6 (resp. 4/6) \(\ell _2\)\(\ell _0\)-DCAs are slower than ASN on CNA and MFE (resp. MOV and SEM) datasets. The size of the LUN dataset is very small, hence the difference on CPU running time between the algorithms is not significant.

5.4.2 Approximation approaches versus the exact penalty approach

Numerically, the results are comparable in terms of sparsity and classification accuracy, i.e. there is no a great difference between these two approaches. This can be justified by a theoretical result proved in Le Thi et al. (2015b): with a suitable parameter, the resulting approximation problems using Cap-l1 or SCAD are equivalent to the exact penalized continuous problem (see Le Thi et al. 2015b for more details). Meanwhile, in some cases the exact penalty approach is very fast, the gain ratio can grow up to 12 times (SEM dataset, in comparing with \(\ell _{0}\)-DCA_PiE).

5.4.3 \(\ell _0\) regularization versus \(\ell _2\)\(\ell _0\) regularization

Based on the same approximation function, \(\ell _{0}\)-DCAs are better than \(\ell _{2}\)\(\ell _{0}\)-DCAs in term of sparsity on 7 / 8 (resp. 4 / 8) datasets with PiE, Cap-l1, PiL, SCAD (resp. Log). Likewise, \(\ell _{0}\)-DCA_Econ is better than \(\ell _2\)\(\ell _{0}\)-DCA_Econ on 6/8 datasets. Meanwhile, the accuracy of classifiers of \(\ell _{2}\)\(\ell _{0}\)-DCAs are slightly better than that of \(\ell _{0}\)-DCAs (in 6/8 datasets with PiE, SCAD, Econ, 5/8 datasets with Capped-\(\ell _{1}\), PiL, and 4/8 datasets with Log). This result justifies the utility of the \(\ell _2\)\(\ell _{0}\) regularization, in particular to overcome overfitting. As for the rapidity, not surprisingly, the \(\ell _{0}\)-DCAs are faster than \(\ell _{2}\)\(\ell _{0}\)-DCAs, because that \(\ell _{0}\)-DCAs solve one linear program while \(\ell _{2}\)\(\ell _{0}\)-DCAs solve one convex quadratic program at each iteration.

5.4.4 About sparse inducing functions

The classification accuracy of DCA based algorithms with different approximation functions are comparable: the difference from one to another is less than 3 %, except for SEM dataset where Cap-l1 is considerably better than the others (from 6.45 to 10.66 %). Cap-l1, followed by PiL, gives the best accuracy in most of cases.

However the approximation functions influenced more considerably on the sparsity: the difference of selected feature varies from 1.16 to 18 %.

Overall, Cap-l1 seems to be the best approximation in the sense that it realizes a good trade-off between sparsity and accuracy. Once again, this result confirms the consistency property of Cap-l1 proved in Le Thi et al. (2015b).

Concerning the computation time, the comparative results are quite different among the height datasets and there is no ”winner” on all datasets. Each of \(\ell _{0}\)-DCA_Log and \(\ell _{2}\)\(\ell _{0}\)-DCA_Log is the fastest on 3 datasets. For the remaining dataset, each of DCA-PiE, DCA-Cap-l1, DCA-PiL and DCA-SCAD is the fastest one time.

5.4.5 The consistency of the selected features from different versions of DCA

The results in Table 4 show that there is a consistency of the selected features from different versions of DCA. Some features are selected by all or most of all algorithms. The high frequency selection of the features allows us to identify ”important” features.

5.4.6 The connection between the testing datasets and different versions of DCA

We identify only one “strong” connection: \(\ell _0\)-DCA_Cap-l1 and \(\ell _2\)\(\ell _0\)-DCA_Cap-l1 are preferred to SEM dataset. Indeed, DCA_Cap-l1 are considerably better than the others on both classification accuracy (from 6.47 to 10.66 %) and sparsity (from 11.94 to 18.04 %). Otherwise, we can say that \(\ell _0\)-DCA_PiL is preferred to MFE dataset since it is better than the others, from 2.91 to 5.44 % on sparsity and in terms of classification accuracy \(\ell _0\)-DCA_PiL and the exact penalty approach give the best results (with a litle difference 0.53 %).

As for the relation between the dataset and CPU running time, for the multi-classification (i.e. the number of classes is at least 3), the CPU running time depends on the size (the number of classes \(\times \) the number of features \(\times \) the number of samples in the train set) of the dataset. In general, with the same model (\(\ell _{0}\) or \(\ell _2\)\(\ell _{0}\) regularization) the larger size of dataset is, the more important computation time will be. Indeed, we observe that for all algorithms (except for \(\ell _{0}\)-DCA-Cap-l1) the MFE dataset (\(10 \times 649 \times 1200\)) is the most expensive among 8 datasets, followed by SEM (\(10 \times 256 \times 960\)) and then OPT (\(10 \times 63 \times 3823\)). The CNA dataset (\(9 \times 856 \times 464\)) is special, its size is quite large but is not expensive.

Finally, we observe that DCA based algorithms work well on both balanced (the sizes of classes are relatively equal) and unbalanced datasets. For instance, these algorithms give good results for the ADV dataset which has the size of two classes are respectively 319 and 1960.

6 Conclusion

We have developed efficient approaches based on DC programming and DCA for feature selection in multi-class support vector machine. Based on appropriate approximation functions of zero-norm and an exact penalty technique, the \(\ell _{0}\)-MSVM and \(\ell _{2}\)\(\ell _{0}\)-MSVM problems are reformulated as DC programs. It fortunately turns out that the corresponding DCA consists in solving, at each iteration, one linear program (in \(\ell _{0}\) regularization) or one convex quadratic program (in \(\ell _{2}\)\(\ell _{0}\) regularization). Moreover, several DCA based algorithms converge, after a finite number of iterations, almost always to a local solution. Numerical results on several real datasets showed the robustness, the effectiveness of the DCA based schemes. We are convinced that DCA is a promising approach for feature selection in MSVM.

By proposing twelve DCA based algorithms we offered to the users a large choice of suitable algorithms according to their learning situations. From both theoretical and algorithmic points of view, \(\ell _0\)-DCA is a very efficient approach because that it results in polyhedral DC programs for which the corresponding DCA requires solving one linear program at each iteration, and converges after a finite number of iterations to not only a critical point but also a local minimizer in almost always cases. Especially, the sparse inducing functions such as Capped-\(\ell _1\) and Piecewise linear are interesting since an explicit local sufficient condition is available. Overall, to get a good trade-off between sparsity and accuracy by a fast algorithm, the \(\ell _{0}\)-DCA-Cap-l1 as well as the exact penalty approach \(\ell _{0}\)-DCA-Econ are the best choices. In particular, Capped-\(\ell _1\) followed by Piecewise linear approximation are very recommended while Logarithm function is to avoid when the classification accuracy is the most important criterion, and \(\ell _{2}\)\(\ell _{0}\)-DCA is not recommended either when the sparsity is the most important criterion or for massive datasets.

Another important contribution of this paper lies on the determination of the most important features in the dataset via the high frequency of selected features when performing several DCA based algorithms.

This research can be extended to the nonlinear separable case by introducing a kernel in the MSVM model. Work in this direction is in progress.