1 Introduction

Microarray technology can simultaneously access the expression level of thousands of genes for DNA, which can be used for identifying and classifying cancer diseases [6, 16, 19, 20, 26, 27, 36, 37, 39]. Many methods have been applied to cancer classification, such as support vector machines(SVMs) [4, 16], and k nearest neighbor (k NN) [4, 19]. Generally, a DNA microarray data set consists of a large number of genes and a relatively small number of samples for cancer classification. Among these genes, not all of them are useful for cancer classification tasks. To find the useful genes in the gene expression data, it is necessary to implement gene selection or feature selection for cancer classification based on gene expression data.

Many gene selection method have been proposed, such as genetic algorithm-based methods [19, 26], and SVMs-based methods [4, 10, 12, 16,17,18, 38, 46]. In this paper, we focus on the gene selection methods based on SVMs. An embedded feature selection algorithm based on SVMs was proposed in [18], which can adaptively identify important features through introducing data driven weights, and simultaneously implement classification and gene selection. However, this method requires adjusting more parameters, and its performance largely depends on the parameters. The 2 0 SVM, a variant of SVM, can be applied to feature selection [31]. How to design the objective function based on the standard SVM is the key and difficult issue of this kind of algorithm.

Lee and Leu proposed a hybrid feature selection algorithm based on SVM for microarray data analysis [17]. Specifically, this method uses the genetic algorithm to generate a number of subsets of genes, the chi square test to select a proper number of the top-ranked genes for data analysis and SVM to verify the efficiency of the selected genes. Weston et al. proposed the wrapper feature selection algorithm based on SVM, which finds useful genes by minimizing bounds on the leave-one-out error using gradient descent [38].

Guyon et al. proposed SVM-RFE (recursive feature elimination) feature selection algorithm, which is the most representative gene selection algorithm based on SVMs [16]. Duan et al. presented a new gene selection method based on SVM-RFE, called MPSVM-RFE. At each step, MPSVM-RFE trains multiple linear SVMs on subsamples of training data and computes the feature ranking scores from statistical analysis of the weight vectors [10]. To deal with multi-class cancer classification tasks, a multi-class SVM-RFE (MSVM-RFE) [46], multiple SVM-RFE (MPSVM-RFE) [40] and a multiple support vector data description recursive feature elimination (MSVDD-RFE) [4] have been developed. Once gene selection is achieved, a classifier should be applied subsequently. Although SVM-RFE methods can select useful genes for cancer classification, the computational complexity of these methods are very high when the gene number is huge.

Fortunately, the 1-norm SVM can be applied to perform gene selection fast [12] and feature selection for conjoint analysis [22]. The 1-norm SVM is a variant of SVM [2, 23, 43, 45, 47]. It is well-known that the objective function of SVM consists of 2-norm regularization penalty and the hinge loss [13, 29, 34, 35]. The 1-norm SVM is to simultaneously minimize the 1-norm regularization penalty and the hinge loss. Since the 1-norm regularization penalty as a substitution of the 0-norm regularization penalty can also induce a sparse solution, it has shown that the 1-norm SVM has a better sparse representation than SVM [43].

Fung and Mangasarian used the Newton method to minimize an exterior penalty problem constructed from the dual problem of the 1-norm SVM, which is called NLPSVM, and applied NLPSVM to gene selection [12]. Since NLPSVM with the linear kernel directly picks up the useful genes and does not use the recursive feature elimination scheme, NLPSVM is much faster than SVM-RFE methods. In addition, NLPSVM can can classify microarray data in its own way. As a variant of the 1-norm SVM, the 1-norm SVM with a squared loss (1-norm SVMSL) was proposed in [42]. Since the 1-norm SVMSL uses orthogonal matching pursuit (OMP) to find an approximate solution, its training speed is very fast. It is the approximate solution that would cause loss on the classification performance.

To fast select gene and achieve better classification performance, this paper designs a cancer classification framework. The 1-norm SVMSL is first used to select useful genes, and then a subsequent classifier is applied on the reduced data. We perform extensive experiments on DNA microarray data sets.

The rest of the paper is organized as follows. In Section 2, we address related work on gene selection based on SVMs. Section 3 gives the framework of this paper. Experimental results on the real-world data sets are given in Section 4. Section 5 concludes this paper.

2 Related work

This section introduces gene selection methods based on SVMs. We discuss SVM-RFE and the 1-norm SVM, respectively. Generally, cancer classification can be simply described as follows. Given a set of training samples \(X=\{(\textbf {x}_{i},y_{i})\}_{i = 1}^{n}\), where \({{\textbf {x}}_{i}} \in {\mathbb {R}^{D}}\) denotes the gene expression of a patient, y i ∈{−1,+1} is the label of x i , D is the number of genes (or the dimension of samples), and n is the number of samples, we try to find genes from this training set which are useful for discriminating cancer diseases. In different datasets, the label y i of x i has different meanings. In the Breast Cancer dataset [33], for example, the label “+1” means “non-relapse” and the label “-1” means “relapse”.

2.1 SVM-RFE

Support vector machine, the state of the art learner in machine learning and pattern recognition, is famous for its good generalization performance, sparse model representation, and nonlinear learning ability [13, 29, 34, 35].

As a classifier, SVM requires a set of training samples X, and solves the following primal optimization problem:

$$\begin{array}{@{}rcl@{}} \min_{\textbf{w},b,{\xi}}& &\frac{1}{2}\|\textbf{w}\|^{2}_{2}+C\sum\limits_{i=1}^{n}\xi_{i} \\ subject~to & &y_{i}\left( \textbf{w}^{T}\textbf{x}_{i}+b\right)\geq 1-\xi_{i}, \\ & & \xi_{i} \geq 0, i=1,\cdots, n \end{array} $$
(1)

where ξ i is the slack variable, ∥⋅∥ denotes the 2-norm of a vector, C > 0 is the regularization factor which controls the balance between the empirical risk and the capacity of SVM, and w and b are respectively the weight vector and the threshold in the hypothesis function

$$ f\left( \textbf{x}\right)=\textbf{w}^{T}\textbf{x}+b $$
(2)

where w = [w 1,w 2,⋯ ,w D ]T. As long as we get w and b, we can assign a label for any sample x using

$$ f^{*}(\textbf{x})=sign\left( f\left( \textbf{x}\right)\right)=sign\left( \textbf{w}^{T}\textbf{x}+b\right) $$
(3)

where s i g n(⋅) is the sign function.

To implement gene selection for cancer classification, SVM-RFE was proposed in [16]. In a sequential backward elimination way, SVM-RFE iteratively deletes redundant genes and retains useful genes. In each iteration, SVM-RFE trains a linear SVM model with the reserved genes, and deletes a gene according to weight scores \({w_{j}^{2}}\), j = 1,⋯ ,D. The weight score for gene j is \(c_{j}={w_{j}^{2}}\) . The higher c j , the greater the importance of the j th gene is. The detail algorithm for SVM-RFE is shown in Algorithm 1 [16]. Note that the order of elements in R is very important in Step 2(d). The last element in R means that the corresponding gene has the highest score and is the most important one.

A novel gene selection method, MPSVM-RFE was proposed based on SVM-RFE [10]. In each iteration, MPSVM-RFE computes the ranking scores from a statistical analysis of the weight vectors in multiple linear SVMs trained on original training subsets. CV (cross-validation) is adopted to resample the training data. Thus, the computational complexity of MPSVM-RFE is much higher than SVM-RFE.

figure f

2.2 1-norm SVM

Since the 1-norm SVM has sparsity, the linear 1-norm SVM can be used for feature selection or variable selection [3]. As we know, there are some different LP forms for the 1-norm SVM [2, 8, 23, 24, 43, 45, 47]. However, these forms are proved to be essentially equivalent while selecting proper parameters. Here we introduce the LP form presented in [43], which can be solved by the simplex method, the interior-point method, NLPSVM [12], the column generation Newton (CGN) method [44] and the row-column generation (RCG) method [41]. Among these methods, NLPSVM is relatively fast in the linear case. CGN and RCG were proposed for improving the training speed in the nonlinear case.

Consider the training sample set of two classes X = {(x 1,y 1),(x 2,y 2),...,(x n ,y n )}, where \(\textbf {x}_{i}\in \mathbb {R}^{D}\) and y i ∈{1,−1}. The 1-norm SVM is to simultaneously minimize the 1-norm regularization penalty and the hinge loss, which can be expressed as a linear programming problem. The primal problem of the 1-norm SVM is:

$$ \begin{array}{ll} \min \hspace{0.2cm} &\sum\limits_{j=1}^{D}\left( \alpha^{+}_{j}+\alpha^{-}_{j}\right)+ \sigma\left( \beta^{+}+\beta^{-}\right) + C\sum\limits_{i=1}^{n} \gamma_{i} \\ s.t.\hspace{0.2cm} & y_{i}\left[\textbf{x}_{i}^{T}\left( \boldsymbol{\upalpha}^{+} - \boldsymbol{\upalpha}^{-}\right) + \left( \beta^{+} - \beta^{-}\right)\right] \geq 1-\gamma_{i} \\ &\beta^{+},\beta^{-} \geq 0, \gamma_{i}\geq 0, i=1, \cdots, n\\ & \alpha_{j}^{+},\alpha_{j}^{-} \geq 0, j=1, \cdots, D \end{array} $$
(4)

where C > 0 is the regularization factor, σ > 0 is a small constant, \(\boldsymbol {\upalpha }^{+}=\left [\alpha _{1}^{+},\cdots ,\alpha _{D}^{+}\right ]^{T}\), \(\boldsymbol {\upalpha }^{-}=\left [\alpha _{1}^{-},\cdots ,\alpha _{D}^{-}\right ]^{T}\), \(\beta ^{+}\in \mathbb {R}\) and \(\beta ^{-}\in \mathbb {R}\) are model coefficients for the 1-norm SVM, and \(\boldsymbol {\upgamma }=\left [\gamma _{1},\cdots ,\gamma _{\ell }\right ]^{T}\in \mathbb {R}^{n}\) is a loss vector.

Naturally, (4) could be rewritten in matrix form

$$ \begin{array}{ll} \min_{\textbf{u}}\hspace{0.2cm} &{c}^{T}\textbf{u}\\ s.t.\hspace{0.2cm}&Au\geq \textbf{b}\\ &{u}\geq \textbf{0} \end{array} $$
(5)

where the variable vector u = [(α +)T,(α )T,β +, β ,γ T]T, c =[1 T,1 T,σ,σ,C 1 T]T, b = 1, the constraint coefficient matrix A = [D y X,−D y X,y,−y,I], \(\textbf {X}\in \mathbb {R}^{n \times D}\) is the sample matrix in which each sample is a row vector, y = [y 1,y 2,⋯ ,y n ]T, D y is the diagonal matrix with the diagonal line of y, and 1 and I are the column vector of all ones and the identity matrix with the appropriate size, respectively.

The decision function of the linear 1-norm SVM takes the form:

$$\begin{array}{@{}rcl@{}} f^{*}(\textbf{x})&=&sign\left( \left( \boldsymbol{\upalpha}^{+}-\boldsymbol{\upalpha}^{-}\right)^{T}\textbf{x}+(\beta^{+}-\beta^{-})\right)\\ &=&sign\left( \boldsymbol{\upalpha}^{T}\textbf{x}+\beta\right) \end{array} $$
(6)

where α = α +α and β = β +β .

Once the problem (5) is solved, we can get α and β. Since most elements in α are zero, gene selection is automatically implemented by searching nonzero coefficients in α. If α j ≠0, then the j gene is selected.

The 1-norm SVM solved by NLPSVM is much faster than SVM-RFE when dealing with gene selection since it does not require a recursive process. The complexity of each iteration in NLPSVM is O(m i n(n,D)3) when applying the Sherman-Morrison-Woodbury identity equation, where n is the number of samples and D is the number of genes.

3 Framework of the proposed method

The classification framework based on the 1-norm SVMSL is shown in Fig. 1, where the linear SVM is taken as an illustration of subsequent classifiers, X is the given training sample set, and X is the reduced training sample set. We first introduce the 1-norm SVMSL, and then respectively discuss gene selection and classification stages.

Fig. 1
figure 1

Classification framework based on 1-norm SVMSL

3.1 1-norm SVMSL

Given a training set \(X=\{(\textbf {x}_{i},y_{i})\}_{i=1}^{n}\) with \(\textbf {x}\in \mathbb {R}^{D}\), our goal is to select d genes from D ones with d << D in the stage of gene selection. The 1-norm SVMSL is applied to select the d genes.

By replacing the hinge loss with the squared loss and ignoring σ, we can rewrite (4) and obtain the optimization problem for the 1-norm SVMSL:

$$\begin{array}{@{}rcl@{}} &&\min\left\|\boldsymbol{\upalpha}^{+}\right\|_{1}\,+\,\left\|\boldsymbol{\upalpha}^{-}\right\|_{1}\,+\,\sigma\left( \beta^{+}+\beta^{-}\right)+ C \left\|\textbf{1}- \textbf{D}_{y}\left( \textbf{X}\boldsymbol{\upalpha}+\beta\right)\right\|_{2}^{2} \\ &&s.t.~\beta^{+}\!\geq\! 0,\beta^{-} \geq 0, \boldsymbol{\upalpha}^{+}\geq \textbf{0},\boldsymbol{\upalpha}^{-} \geq \textbf{0} \end{array} $$
(7)

where ∥⋅∥1 denotes the 1-norm of a vector. The standard matrix formulation of (7) can be expressed as

$$ \begin{array}{ll} \underset{\textbf{u}}{\min} \hspace{0.2cm} &\tau \|\textbf{u}\|_{1}+ \frac{1}{2} \left\|\textbf{c}- \textbf{Au}\right\|_{2}^{2} \\ s.t. & \textbf{u} \geq \textbf{0} \end{array} $$
(8)

where u =[(α +)T,(α )T,β +,β ]T, A = [D y X,−D y X,y,−y], c = 1, and τ > 0 is a non-negative real parameter which has the same role as the parameter C in both SVM and the 1-norm SVM.

In sparse signal reconstruction, the problem (8) is equivalent to both the least absolute shrinkage and selection operator (LASSO), and the quadratically constrained linear problem (QCLP) under some conditions [1]. Thus, orthogonal matching pursuit (OMP) which has been applied to approximately solve QCLP [7, 9, 32] can also be used to approximately solve the 1-norm SVMSL. In addition, OMP needs only the parameter d instead of τ in solving (8).

3.2 Gene selection and classification

When the problem (8) is solved approximately by using OMP, we can get the coefficient vector vector α. Similar to NLPSVM, the 1-norm SVMSL can automatically select genes. However, since the solution to the 1-norm SVMSL is only an approximate one, the classification performance obtained by directly using α is not consistent with one generated by an exact solution. The experimental results in [44] also show that the performance of approximate algorithms could be better or worse than that of exact algorithms.

To avoid this situation, we can use a subsequent classifier to classify the reduced dataset, such as support vector machine, and nearest neighbor. As shown in Fig. 1, the framework shows the training and test processes, which are respectively discussed as follows.

The training process is given in Algorithm 2, which requires training both the 1-norm SVMSL and the linear SVM. Let \(X=\{(\textbf {x}_{i},y_{i})\}_{i=1}^{n}\) denote the given training sample set. Assume that we need to select d genes from D ones, where d << D. The 1-norm SVMSL is first trained by using X and then outputs the selected gene subset S with |S| = d. According to the subset S, we can select useful genes and generate a new training set \(X^{\prime }=\{(\textbf {x}^{\prime }_{i},y_{i})\}_{i=1}^{n}\), where \(\textbf {x}^{\prime }_{i}\in \mathbb {R}^{d}\). The subsequent classifier, for example the linear SVM, is trained by using X .

For an unseen sample x, we first reduce it to x employing S and then apply the trained model on x to assign a class label for x.

figure g

4 Results and discussions

This section discusses experimental results of the proposed method on cancer classification using gene expression data and compares it with classical gene selection methods, such as SVM-RFE.

4.1 Experiment settings

Four public gene microarray datasets available are used to validate the performance of these compared methods, including Leukemia-ALLAML [14], Lung Cancer [15], Central Nervous System (CNS) dataset [25], and Prostate Cancer [28]. These four datasets can be downloaded from http://datam.i2r.a-star.edu.sg/datasets/krbd/. All data here are normalized such that each gene has mean 0 and variance 1 according to the way mentioned in [11].

Since a subsequent classifier is required, we here provide two classifiers, including SVM, and the nearest neighbor (NN) to observe the performance of the 1-norm SVMSL. The compared methods include SVM-RFE+SVM/NN [16], SVDD-RFE+SVM/NN [4], the 1-norm SVM [12, 43], the 1-norm SVMSL [42], the 1-norm SVM+SVM/NN, and the proposed 1-norm SVMSL+SVM/NN. Since the 1-norm SVM is similar to the 1-norm SVMSL, SVM or NN could be subsequent to it. Thus, we have the compared method, the 1-norm SVM+SVM/NN. Note that there are many methods for solving the 1-norm SVM, we choose two ways, or the simplex method and the Newton method. The 1-norm SVM with the simplex method is an exact solution one, and with the Newton method is an approximate one. For simplicity, the 1-norm SVM with the simplex method is still denoted by the 1-norm SVM. The 1-norm SVM with the Newton method is NLPSVM proposed in [12].

All these gene selection methods use the linear kernel. The regularization parameter C for SVM-RFE is selected from the set {0.5,1,2,10,100}, for SVDD-RFE from the set {1/n,0.1,0.5,1}, and for 1-norm SVM from the set {0.5,1,2, 10,100} according to the empirical evidence. The regularization parameter is approximately selected by using 3-fold cross validation on the training set and the corresponding classifiers. For example, C for SVM-RFE is selected by using SVM. For the 1-norm SVMSL, the only parameter is d which determines the selected gene number.

We use recalls to evaluate the performance of these methods. The recall of the j th class is defined as

$$ Recall_{j}=\frac{TP_{j}}{TP_{j}+FN_{j}} $$
(9)

where T P j and F N j are the number of correctly classified samples and the number of incorrectly classified samples in class j, respectively.

All programs are written in MATLAB or/and C. Both SVM and SVDD are implemented by using the LIBSVM package [5], the 1-norm SVM is implemented by using GLPK package [21], NLPSVM is implemented by using the code provided in [12]. In addition, the 1-norm SVMSL is solved by using OMP in the SparseLab package [30]. All numerical experiments are performed on a personal computer with a 2.93GHz Intel(R) Core(T)2 Duo CPU and 2G bytes of memory. This computer runs on Windows 7, with MATLAB 7.10 compiler installed.

4.2 Lung cancer

Lung cancer consists of 181 tissue samples of which 31 samples are malignant pleural mesothelioma (MPM), and 150 adenocarcinoma(ADCA) of the lung. The training set contains 32 of them, 16 MPM and 16 ADCA. The remaining 149 samples are used for test, with 15 MPM and 134 ADCA. Each sample is described by 12533 genes.

4.2.1 SVM classifier

On the training set, we determine the parameter for SVM-RFE C = 10, for SVDD-RFE C = 1, and for the 1-norm SVM, NLPSVM and the 1-norm SVMSL C = 2. Note that SVM-RFE and SVDD-RFE can rank genes, but the 1-norm SVMSL and the 1-norm SVM can not. However, 1-norm SVMSL has a free parameter d which can represent the number of selected genes. Thus, we can select the last d genes in R for the first two methods and set different d for the 1-norm SVMSL. Let d vary from 1 to 400, which follows the experiment setting in [46]. The average recall vs. d for the three methods is shown in Fig. 2.

Fig. 2
figure 2

Average recall vs. gene number with the linear SVM on the Lung cancer dataset

When d is small, say d < 20, the 1-norm SVMSL is better than SVM-RFE and SVDD-RFE. When d > 25, the performance of the 1-norm SVMSL is unchanged. In addition, the maximum selected gene number is 29 for the 1-norm SVMSL on the Lung cancer dataset. Thus, the 1-norm SVMSL has a stable performance when d ≥ 29. The 1-norm SVM automatically determines the number of selected genes which is the number of non-zero coefficients in its model. On this dataset, 24 genes are selected by the 1-norm SVM, and 33 by NLPSVM.

Table 1 lists the best average performance and the corresponding gene number and CPU time for all these methods, where CPU time includes gene selection, classifier training and test. Observation on Table 1 indicates that the 1-norm SVMSL can very quickly pick up genes, and the use of the subsequent classifier can also improve the average recall from 98.99% to 99.25%. Actually, for the 1-norm SVM-like methods, the use of SVM improves their classification performance. The 1-norm SVMSL is almost four orders of magnitude faster than SVM-RFE, three orders faster than SVDD-RFE, two orders faster than NLPSVM, and one order faster than the 1-norm SVM. In addition, note that the number of useful genes is relatively small for the 1-norm SVMSL, only three.

Table 1 Comparison of gene selection methods with SVM on the Lung cancer dataset

4.2.2 NN classifier

The subsequent classifier used here is NN. Similary, let d vary from 1 to 400 for three methods, SVM-RFE, SVDD-RFE and the 1-norm SVMSL. The average recall vs. d is shown in Fig. 3. We can see that the 1-norm SVMSL has good average recall when d is small, say d < 10.

Fig. 3
figure 3

Average recall vs. gene number with NN on the Lung cancer dataset

Table 2 lists the best average performance and the corresponding gene number and CPU time for all these methods, where CPU time includes gene selection and classification process. Since the 1-norm SVM, NLPSVM and the 1-norm SVMSL can implement gene selection and classification at the same time, their results are the same as those in Table 1. Unfortunately, the use of NN can not improve the classification performance as we expect. Only for the 1-norm SVM, NN improves its classification performance. Obviously, the rank about the CPU time is the same as the previous experiment.

Table 2 Comparison of gene selection methods with NN on the Lung cancer dataset

4.2.3 Selected genes by the 1-norm SVMSL

On the Lung cancer dataset, the 1-norm SVMSL can at most select 29 genes based on the training set. However, from Tables 1 and 2, we can see that only three or four genes are enough for discriminating ADCA and MPM, 3 for the SVM classifier and 4 for the NN classifier. Table 3 shows the information on four selected genes, where the first three genes are only for SVM. These genes are protein coding genes from homo sapiens, or human.

Table 3 Genes selected by the 1-norm SVMSL on the Lung cancer dataset

Gene selection methods aim to select genes which are less correlated to each other to reduce the redundancy. Statistical methods offer a way to measure a linear correlation between genes. For any gene pair (G i ,G j ), we can compute the correlation coefficient between them. Namely,

$$ \rho_{ij}=\frac{Cov(G_{i},G_{j})}{\sqrt{D(G_{i})D(G_{j})}} $$
(10)

where C o v(⋅,⋅) denotes the covariance function, and D(⋅) denotes the variance function and 0 ≤ ρ i j ≤ 1. G i is linearly uncorrelated to G j if and only if ρ i j = 0.

Based on Table 1, we check the linear correlation of selected genes corresponding to the best performance. To see the coefficient distribution, we discrete the correlation coefficient into five intervals [0.0,0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8) and [0.8,1), and compute the probability in the corresponding interval. Table 4 gives the coefficient distribution and average correlation coefficient, which is defined as

$$ \bar{\rho}=\frac{2}{d(d-1)}\sum\limits_{i=1}^{d-1}\sum\limits_{j=i+1}^{d}\rho_{ij} $$
(11)

The number of Gene pair in Table 4 is equal to d(d − 1)/2 and the detail value for d is given in Table 1. The correlation coefficients for the 1-norm SVMSL are less than 0.4, which indicates that this method can select less correlated genes indeed. For Table 2, we have the similar conclusion.

Table 4 Average correlation coefficient and distribution of best gene subsets on the Lung cancer dataset

In summary, we have the following conclusions based on the experimental results on the Lung cancer dataset:

  • Although SVM-RFE +SVM/NN has the best average recall, which is slightly superior to the 1-norm SVMSL +SVM/NN, the speed of the 1-norm SVMSL is greatly superior to SVM-RFE. In addition, the 1-norm SVMSL +SVM/NN achieves its best performance with a small number of genes, which is useful for speeding the test process and data storage.

  • Compared with SVDD-RFE, the 1-norm SVMSL has an advantage over both speed and performance.

  • The 1-norm SVMSL has a similar performance to the 1-norm SVM and a much better performance than NLPSVM. However, the 1-norm SVMSL needs less time for selecting genes than these two methods.

  • As the subsequent classifier, SVM is much better than NN because SVM improves the average recall of the 1-norm SVMSL.

  • The genes selected by the 1-norm SVMSL are mostly less correlated. Some gene pairs selected by other methods have high correlation. For example, 0.79 % gene pairs in SVM-RFE have large correlation coefficients which locates in the interval [0.8,0.9).

4.3 Leukemia-ALLAML

Leukemia-ALLAML dataset has 72 samples belonging to two classes, or ALL (Acute Lymphoblastic Leukemia) and AML (Acute Myeloid Leukemia). The training set consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. In addition, 34 test samples are provided, with 20 ALL and 14 AML.

d varies from 1 to 400. The average recall vs. d obtained by SVM-RFE, SVDD-RFE and the 1-norm SVMSL is shown in Fig. 4. We have the similar conclusion as before. The 1-norm SVMSL achieves its best performance when d is small.

Fig. 4
figure 4

Average recall vs. gene number on the Leukemia-ALLAML dataset

Table 5 shows the best average performance and the corresponding gene number and CPU time for all these methods. Observation on Table 5 also implies that the 1-norm SVMSL can select gene subsets fast. SVM-RFE has the best average recall among all the methods, followed by the 1-norm SVMSL+SVM and the 1-norm SVM. In addition, NN can not improve the performance of the 1-norm SVM, NLPS-VM and the 1-norm SVMSL, but SVM can improve the performance of both NLPSVM and the 1-norm SVMSL.

Table 5 Comparison on the Leukemia-ALLAML dataset

On this dataset, the 1-norm SVMSL can at most select 32 genes based on the training set. Among the 32 genes, four genes are enough for discriminating ALL and AML. Table 6 shows the information on the four genes. The first two genes, ZYX and MPO are often presented in visualizations, see http://www.biolab.si/supp/bi-cancer/projections/info/leukemia.htm.

Table 6 Genes selected by the 1-norm SVMSL on the Leukemia-ALLAML dataset

Now, we validate the linear correlation of best gene genes which means subsets with highest average recall obtained by SVM. Table 7 lists the coefficient distribution and average correlation coefficient. The correlation coefficients for the 1-norm SVMSL are all between 0.2 to 0.4, which shows that the genes selected by the 1-norm SVMSL are not strongly correlated.

Table 7 Average correlation coefficient and distribution of best gene subsets on the Leukemia-ALLAML dataset

Experimental conclusions on the Leukemia-ALLAML dataset are consistent with those on the Lung cancer dataset. We do not discuss more.

4.4 More datasets

In the following, we perform experiments on another two gene expression datasets, including CNS and Prostate tumor. The two datasets are described as follows:

  • CNS dataset contains 60 patient samples, 21 are survivors (alive after treatment) and 39 are failures (succumbed to their disease). There are 7129 genes in the dataset. The training set consists of the first 10 survivors and 30 failures, the other 11 survivors and 9 failures are testing points.

  • Prostate tumor dataset has a training set which contains 52 prostate tumor samples and 50 non-tumor (labeled as ”Normal”) prostate samples with around 12600 genes. An independent set of testing samples is also prepared, which is from a different experiment and has a nearly 10-fold difference in overall microarray intensity from the training data. Besides, we have removed extra genes contained in the testing samples. There are 25 tumor and 9 normal samples in the test set.

As showed above, NN does not outperform SVM. Thus, we only consider SVM as the subsequent classifier. The setting of parameters is the same as before.

The average recall vs. d obtained by SVM-RFE, SVDD-RFE and the 1-norm SVMSL is shown in Fig. 5. If the selected gene number is big enough, SVDD-RFE can obtained good performance on the CNS and Prostate datasets. For the 1-norm SVMSL, its best performance still can be achieved when d is small.

Fig. 5
figure 5

Average recall vs. gene number when taking SVM as the classifier

Tables 8 and 9 show the experimental results on the two datasets, respectively. If we do not perform gene selection, the average recall obtained by SVM is very low on both datasets. SVDD-RFE is compared to SVM-RFE when SVDD-RFE selects more genes. On the two datasets, the 1-norm SVMSL+SVM is much better than other 1-norm SVM-like methods.

Table 8 Comparison on the CNS dataset
Table 9 Comparison on the Prostate tumor dataset

The correlation coefficient between selected genes is also shown in Fig. 6. Similarly, the linear correlation between genes selected by the 1-norm SVMSL is very low on both the CNS and Prostate tumor datasets.

Fig. 6
figure 6

Correlation coefficient distribution

5 Conclusions

This paper proposes a new strategy for cancer classification based on gene expression data. A fast gene selection way is implemented by using the 1-norm SVMSL. The 1-norm SVMSL is an variant of the 1-norm SVM. Although the 1-norm SVMSL can perform gene selection and classification at the same time, we use OMP to approximate its solution which may be not so exact to perform classification. Thus, a subsequent classifier is used to discriminating genes selected by the 1-norm SVMSL. In theory, any classifier dealing with cancer classification can be used as the subsequent classifier. Experimental results shows SVM is a proper one.

In our experiments, we compare the 1-norm SVMSL with SVM-RFE, SVDD-RFE, the 1-norm SVM and NLPSVM on four datasets. SVM-RFE outperforms other methods on classification performance in three of four datasets, but it takes a long time to rank genes, say more than 1,000 sec. for 12533 genes. Note that the 1-norm SVMSL is at least four order of magnitude faster than SVM-RFE. In addition, the 1-norm SVMSL+SVM has an acceptable performance by reference to SVM-RFE+SVM. Compared to the 1-norm SVM and NLPSVM, the 1-norm SVMSL is almost an order of magnitude faster than it. On four datasets, the 1-norm SVMSL is better than or compared to both the 1-norm SVM and NLPSVM. In addition, we test the linear correlation between the selected genes. Among these gene selection methods, the linear correlation is the weakest for the 1-norm SVMSL. In other words, genes selected by the 1-norm SVMSL are not redundant and less correlated.

It is notable that the 1-norm SVMSL only needs a small number of genes to achieve its best performance, which would contribute to fast test and storage. However, we can not determine how many genes we should select, or determine the value of d in the 1-norm SVMSL. If we select all which the 1-norm SVMSL generates, we can not guarantee the classification performance, especially in case of the Prostrate tumor dataset. In the future, we try to find a way to determine the optimal gene number for the 1-norm SVMSL.