1 Introduction

The feature selection methods have drawn widespread attention in the field of gene expression data analysis [3, 7, 9, 11, 12, 1618, 2123, 25]. One of its important applications is in the human cancer classification. It provides the basis to identify crucial genes related to human cancers. The DNA microarray gene expression data are widely used for human cancer classification problem. The microarray gene expression data consist of large number of genes (dimensions) compared to the number of samples or feature vectors. The high dimensionality of the feature vectors degrades the generalization performance of the classifier and increases its computational complexity. This problem is popularly known as the small sample size (SSS) problem in the literature [10]. The feature selection method can be used here to retain only a few useful features and discard others, thereby, reducing the complexity in addition to finding important genes. Different approaches used in feature selection can be broadly grouped into two categories: filter approach and wrapper approach.Footnote 1 The filter approach is classifier independent whereas the wrapper approach is classifier dependent. Wrapper methods try to optimize the performance of a specific classifier and select the features yielding the best generalization performance. On the generalization, recently, some new techniques [27, 28, 30] based on maximizing the uncertainty or combining multiple reducts of rough sets have been proposed to improve the generalization of the classifier. In addition, some feature selection methods with genetic algorithm have been proposed in the literatures [5, 26].

In this paper we propose a feature selection method based on null linear discriminant analysis (LDA) technique [4]. The null LDA method is a feature extraction method, however, we have extended the notion to the feature selection domain. The proposed method falls under the wrapper approach category and is computationally efficient, simple and easy to implement. In this research, it has been applied on a number of DNA microarray gene expression data to show its classification effectiveness. In addition, the biological significance of the selected genes from the proposed method is also presented.

2 Proposed gene selection method

The null LDA technique [4] finds the orientation or transformation matrix W in two stages. In the first stage, data is projected on the null space of within-class scatter matrix \( {\mathbf{S}}_{W} \); i.e., \( {\mathbf{S}}_{W} {\mathbf{W}} = 0 \). Then in the second stage it finds \( {\mathbf{W}} \) that maximizes the transformed between-class scatter matrix \( \left| {{\mathbf{W}}^{\text{T}} {\mathbf{S}}_{B} {\mathbf{W}}} \right| \). The second stage is commonly implemented through the principal component analysis (PCA) method. Thus, in the null space method W is found as

$$ {\mathbf{W}} = {\mathbf{arg}}\mathop {{\mathbf{max}}}\limits_{{|{\mathbf{W}}^{\text{T}} {\mathbf{S}}_{W} {\mathbf{W}}| = 0}} |{\mathbf{W}}^{\text{T}} {\mathbf{S}}_{B} {\mathbf{W}}| $$

The orientation W projects the vectors on the reduced feature space. The training vectors of a given class get merged into a single vector in this reduced feature space; i.e., class-conditional variances of the features in the reduced feature space are zero. The orientation \( {\mathbf{W}} = \left[ {{\mathbf{w}}_{1} ,{\mathbf{w}}_{2} , \ldots ,{\mathbf{w}}_{h} } \right] \) has h column vectors, where 1 ≤ h ≤ c–1 and c is the number of class or state of the nature. When the dimensionality d of the original feature space is very large in comparison to the number of training vectors n, the evaluation of null space becomes nearly impossible as the eigenvalue decomposition of such a large d × d matrix leads to serious computational problems. However, there are methods available for efficiently computing the orientation W [14, 29].

In order to introduce our gene selection method, let us first consider the two-class case. Let \( \hbox{--}\kern-6pt\chi \) be a set of n training vectors in a d-dimensional feature space. The set \( \hbox{--}\kern-6pt\chi \) can be subdivided into two subsets \( \hbox{--}\kern-6pt\chi = \left\{{{\mathbf{x}}_{1}^{1},{\mathbf{x}}_{2}^{1}, \ldots,{\mathbf{x}}_{{n_{1}}}^{1}} \right\} \) and \( \hbox{--}\kern-6pt\chi = \left\{{{\mathbf{x}}_{1}^{2},{\mathbf{x}}_{2}^{2}, \ldots,{\mathbf{x}}_{{n_{2}}}^{2}} \right\}, \) where subset χ i belongs to i-th class and consists of n i number of samples such that n = n 1 + n 2; and the training vector \( {\mathbf{x}}_{j}^{i} \) represents j-th sample in i-th class. Let μ i be the centroid of \( \hbox{--}\kern-6pt\chi_{i} \) and μ be the centroid of \( \hbox{--}\kern-6pt\chi \).

Since the training vectors of a class get merged into a single vector in the reduced feature space, we can write training vectors of class-1 as:

$$ y = {\mathbf{w}}^{\text{T}} {\mathbf{x}}_{j}^{1} \quad {\text{for }}j = 1 \ldots n_{1} , $$
(1)

Where \( {\mathbf{w}} \in \mathbb{R}^{d} \) is any column vector of W. Using Eq. 1, we get

$$ {\mathbf{w}}^{\text{T}} \left( {{\mathbf{x}}_{1}^{1} - {\mathbf{x}}_{2}^{1} } \right) = 0 $$
(2)

If we retain \( {\mathbf{x}}_{1}^{1} \) in Eq. 2 and subtract it by the remaining samples of the same class then we will get in total n 1 − 1 homogeneous equations; i.e., \( {\mathbf{w}}^{\text{T}} \left( {{\mathbf{x}}_{1}^{1} - {\mathbf{x}}_{j}^{1} } \right) = 0 \) for j = 2 … n 1. It can be seen from the Appendix section that the projection of any sample of a given class onto the null space of S W is independent of sample selection. Thus, we retain the first sample of a class without affecting the statistical stability. Now taking the average of these n 1 − 1 equations, we get

$$ \begin{gathered} \frac{1}{n}{\mathbf{w}}^{\text{T}} \left( {n_{1} {\mathbf{x}}_{1}^{1} - \sum\nolimits_{j = 1}^{{{\text{n}}_{1} }} {{\mathbf{x}}_{j}^{1} } } \right) = 0 \hfill \\ {\text{or}}\quad {\mathbf{w}}^{\text{T}} \left[ {\frac{{n_{1} }}{n}{\mathbf{x}}_{1}^{1} - \frac{{n_{1} }}{n}{\varvec{\mu}}_{1} } \right] = 0 \hfill \\ \end{gathered} $$
(3)

In a similar way, equation for class-2 can be written as

$$ {\mathbf{w}}^{\text{T}} \left[ {\frac{{n_{2} }}{n}{\mathbf{x}}_{1}^{2} - \frac{{n_{2} }}{n}{\varvec{\mu}}_{2} } \right] = 0 $$
(4)

Since mean of all the training vector can be given as \( {\varvec{\mu}} = \left( {n_{1} {\varvec{\mu}}_{1} + n_{2} {\varvec{\mu}}_{2} } \right)/n, \) the summation of Eqs. 3 and 4 will give

$$ {\mathbf{w}}^{\text{T}} \left( {{\bar{\mathbf{x}}} - {\varvec{\mu}}} \right) = 0, $$
(5)

where \( {\bar{\mathbf{x}}} = \left( {n_{1} {\mathbf{x}}_{1}^{1} + n_{2} {\mathbf{x}}_{1}^{2} } \right)/n. \) Eq. 5 can be written for all the d features in the summation form as

$$ \sum\limits_{j = 1}^{d} {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right) = 0} $$
(6)

where w j, \( \bar{x}_{j} \) and μ j are the features of w, \( {\bar{\mathbf{x}}} \) and μ respectively. It can be observed from Eq. 6 that if the j-th component is very small; i.e., \( \left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right| \approx 0, \) it will not contribute much in making the overall sum equal to zero and thus the j-th component or j-th feature can be discarded without sacrificing much information. If \( \left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right| \) is arranged in descending order such that

$$ \left| {w_{1} \left( {\bar{x}_{1} - \mu_{1} } \right)} \right| \ge \left| {w_{2} \left( {\bar{x}_{2} - \mu_{2} } \right)} \right| \ge \cdots \ge \left| {w_{d} \left( {\bar{x}_{d} - \mu_{d} } \right)} \right| $$

then by discarding the bottom features (for which \( \left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right| \approx 0 \)) will not affect much the qualitative importance of the retained top r features.

The two-class case can be easily extended to the multi-class case. If there are c classes and the number of training vectors per class is n i where i = 1, 2,…, c then \( {\bar{\mathbf{x}}} \) in Eq. 5 can be computed as

$$ {\bar{\mathbf{x}}} = \frac{1}{n}\sum\limits_{i = 1}^{c} {n_{i} {\mathbf{x}}_{1}^{i} } \quad {\text{where }}n = \sum\limits_{i = 1}^{c} {n_{i} } $$
(7)

and μ is the mean of all the training vectors. We derived the method by using the first training vector from each class. However, one can take any training vector from each of the classes.

The selected number of features should be in the range n − c < r < d, as r ≤ n − c will make the within-class scatter matrix non-singular and null LDA method cannot be applied on that case. There are several ways of computing the value of r. One way is to find the argument of median of \( \left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right| \) for j = 1 … d; i.e., \( r_{1} = \arg \,median_{j = 1 \ldots d} \left[ {\left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right|} \right], \) this will discard approximately 50% features. The reduction can be continued further as \( r_{2} = \arg \,median_{{j = 1 \ldots r_{1} }} \left[ {\left| {w_{j} \left( {\bar{x}_{j} - \mu_{j} } \right)} \right|} \right] \). This procedure will discard approximately 75% of features. Alternatively, cross-validation procedure (illustrated later in this section) can be applied to find the optimum r value from the training data set. The method is summarized in Table 1.

Table 1 Gene selection method

The proposed gene selection method does not evaluate the performance of every single gene. It takes all the genes and evaluates the best genes. Therefore, the genes can be selected with very low computational complexity, O(dn 2). The effectiveness of the method is discussed in the experimentation section.

If the optimum value of r is required then k-fold cross validation procedure can be used [20]:

  • Step 1 Partition training data randomly into k roughly equal segments.

  • Step 2 Hold out one segment as validation data and the remaining k − 1 segments as learning data from the training data.

  • Step 3 Select r (where n − c < r < d) and apply the proposed method (shown in Table 1) using learning data to find top r features.

  • Step 4 Use validation data to compute classification accuracy for a range of values of r. Store the obtained classification accuracies.

  • Step 5 Repeat steps 1–4 N times.

  • Step 6 Evaluate average classification accuracy over N repetitions.

  • Step 7 Plot a curve of average classification accuracy as a function of r.

  • Step 8 The argument of maximum average classification accuracy will be the optimum r value.

3 Experimentation

Four DNA microarray gene expression datasetsFootnote 2 are utilized in this work to show the effectiveness of the proposed method. The description of the datasets is given as follows:

Acute leukemia dataset [12]: this dataset consists of DNA microarray gene expression data of human acute leukemia for cancer classification. Two types of acute leukemia data are provided for classification namely acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The dataset is subdivided into 38 training samples and 34 test samples. The training set consists of 38 bone marrow samples (27 ALL and 11 AML) over 7,129 probes. The testing set consists of 34 samples with 20 ALL and 14 AML, prepared under different experimental conditions. All the samples have 7,129 dimensions and all are numeric.

SRBCT dataset [15]: the small round blue-cell tumor dataset consists of 83 samples with each having 2,308 genes. This is a four class classification problem. The tumors are Burkitt lymphoma (BL), the Ewing family of tumors (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS). There are 63 samples for training and 20 samples for testing. The training set consists of 8, 23, 12 and 20 samples of BL, EWS, NB and RMS respectively. The test set consists of 3, 6, 6 and 5 samples of BL, EWS, NB and RMS respectively.

MLL leukemia dataset [2]: this dataset has three classes namely ALL, MLL and AML. The training set contains 57 leukemia samples (20 ALL, 17 MLL and 20 AML) whereas the test set contains 15 samples (4 ALL, 3 MLL and 8 AML). The dimension of MLL dataset is 12,582.

Lung Dataset [13]: this dataset contains gene expression levels of malignant mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. There are 181 tissue samples (31 MPM and 150 ADCA). The training set contains 32 of them, 16 MPM and 16 ADCA. The rest of 149 samples are used for testing. Each sample is described by the expression values of 12,533 genes.

In order to study the performance of the proposed feature selection method, we first evaluate the classification accuracy on the original feature set using null LDA method and nearest neighbor classifier (with Euclidean distance measure). Then we select the features using the proposed method and apply null LDA method and nearest neighbor classifier to see if there is any degradation in the classification accuracy. The results are summarized in Table 2.

Table 2 Comparison between the classification accuracy obtained using the selected features (by the proposed method) and using original feature space

It can be observed from Table 2 that for most of the datasets there is no degradation in classification accuracy until the number of features is reduced to 500, which is a fair amount of reduction. However, for MLL leukemia there is no degradation at all in the performance even when the number of features is reduced to 100; i.e., after 99.2% feature reduction. On the other hand, for acute leukemia dataset the classification accuracy is actually improved to the perfect level (100%) when only 100 genes (1.4% of the original dimension) are selected. This shows that some unimportant genes are discarded which helps in improving the classification performance. In a similar way, the classification accuracy for SRBCT dataset utilizing only 100 genes is recorded at 95% (using 4.33% of the original dimension). Furthermore, on Lung Cancer dataset the classification accuracy of selected 100 genes is recorded at 97.32% (using 0.79% of the original dimension) which is a fraction less when compared with the classification accuracy of original dimension. It can be concluded from the table that a large amount of unimportant genes can be discarded without significant loss of discriminant information.

Table 3 shows the comparison between the classification accuracy with the other feature selection algorithms [24] on SRBCT and MLL leukemia datasets. It can be observed from Table 3 that the null space based feature selection method outperforms many other state-of-the-art feature selection methods. The proposed method achieves 95% and 100% classification accuracies on SRBCT using 100 genes and 500 genes, respectively. Some of the feature selection methods (like information gain + SVM random, towing rule + SVM random etc.) also achieve 95% classification accuracy but using 150 genes. Similarly, the proposed method achieves 100% classification accuracy on MLL leukemia dataset using only 100 genes.

Table 3 Comparison of feature selection methods on SRBCT and MLL leukemia datasets

Table 4 shows the performance in terms of classification accuracy on acute leukemia dataset. The proposed method achieves 100% classification accuracy using 100 genes and is better than all the other presented methods. The prediction strength + SVM achieves between 88.2 and 94.1% using 25−1,000 genes. RCBT, discretization + decision tree and rough sets achieve optimum classification accuracy of 91.2% using 10−40, 1,038 and 9 genes, respectively. Though some of these methods use small number of genes, they are computationally intensive.

Table 4 Comparison of feature selection methods on acute leukemia dataset

Table 5 shows the performance on Lung Cancer dataset. The null space based feature selection method exhibits 97.3% classification accuracy. The classification accuracy on Lung Cancer dataset is 0.6% lower than RCBT and PCL methods. However, it can be made equal by increasing the number of selected genes.

Table 5 Comparison of feature selection methods on lung cancer dataset

In general, it can be concluded from the Tables 3, 4, 5 that the null space based feature selection method exhibits promising results.

In order to assess the reliability of the null space based feature selection method, we conducted sensitivity testing [1, 6] on test data. The sensitivity of a class is defined by the number of true positives over the number of true positives + the number of false negatives. Table 6 shows the sensitivity of the method on all the four datasets. In the table s i (for j = 1 … c) denotes the sensitivity of class i, s = min(s i ) is the minimum of sensitivity and \( g = \left( {\prod\nolimits_{i = 1}^{c} {s_{i} } } \right)^{1/c} \) is the c-th root of the product of s i , where \( 0 \le s_{i} \le 1 \). The minimum sensitivity s can be considered as a complementary measure of classification accuracy whose value should be high for a good method [6]. The term g can be considered as an imbalance of classification accuracy among the classes [1]. It can be observed from the table that both the terms (s and g) for the null space based feature selection method on acute leukemia and MLL leukemia datasets achieve perfect results; for Lung Cancer dataset the method misclassified four test samples in class 2 and for SRBCT dataset the method misclassified one test sample in class 3. Overall the high values of these terms depict that the method’s ability to identify cancer classes is very reliable.

Table 6 Sensitivity testing on all the four datasets

Moreover, it would be interesting to see the biological significance of the selected features by the null based feature selection algorithm. We use acute leukemia data as a prototype to show the biological significance using Ingenuity Pathway Analysis.Footnote 3 The selected 100 features from the proposed algorithm are used for this purpose. The top five high level biological functions obtained are shown in Fig. 1. In the figure, the y-axis denotes the negative of logarithm of p-values and x-axis denotes the high level functions.

Fig. 1
figure 1

Top five high level biological functions on selected 100 features of acute leukemia by the null space based feature selection method

Since the cancer function is of paramount interest, we investigated them further. There are 66 cancer functions obtained from the experiment. The leukemia is selected from these 66 cancer functions and shown in Table 7. In the table, the p-values and the number of selected genes are depicted corresponding to the selected functions. The selected genes by the proposed method provide significant p-values above the threshold (as specified in IPA). This shows that the features selected by the proposed method contain useful information for discriminatory purpose and have biological significance.

Table 7 Cancer functions

In order to check the robustness of the proposed method, we carried out sensitivity analysis. First we selected top 100 genes using the proposed method on a given dataset. After this we contaminated the dataset by adding Gaussian noise; then applied the method again to find the top 100 genes. The generated noise levels are 1, 2 and 5% of the standard deviation of the original expression values. The number of genes which are common after contamination and before contamination is noted. In addition, the classification accuracy is also noted. This contamination, selection of genes and computation of classification accuracy is repeated 50 times. The average number of genes and average classification accuracy (in percentage) over 50 iterations is depicted in Table 8. It can be observed from the table that the proposed method can achieve high classification accuracy in an adverse environmental or noisy condition. Also, the method is able to capture the majority of original genes in the noisy condition.

Table 8 Sensitivity analysis for null space based feature selection method on gene expression datasets at different noise levels

4 Conclusion

In this paper we proposed the null space based feature selection method. The proposed method effectively selects important genes which have been demonstrated on several DNA microarray gene expression data. Comparisons with several other feature selection methods have shown that the proposed method has better classification accuracy. The implementation of the method is also quite simple and the computation is fast. Finally, the selected genes by the proposed method have biological significance which is demonstrated by performing functional analysis and will therefore contribute positively towards detection of significant biological phenomenon.