Keywords

1 Introduction

As the quick development of genomic, proteomics and metabolomics techniques, they have been widely applied in the study of pathology, diagnostics and prognosis. Since the bioinformatic data are often high dimensional and contain noise and redundant variables, finding the interested features to get an efficient classification model is becoming very important. Many feature selection methods, such as Support Vector Machine-Recursive Feature Elimination (SVM-RFE) [1], Random Forests (RF) [2], Genetic Algorithm (GA) [3], Relief-F [4], and Mutual Information (MI) [5, 6], have been applied to select the meaningful feature subset from the high dimensional data to induce a classification model with a high performance [7, 8].

SVM [9] is a supervised machine learning technique. It is suitable to analyze the high dimensional data [10]. Originally, SVM was proposed for binary problems. And it could solve the multi-class problems by means of “one-versus-all” and “one-versus-one” methods[11], etc. SVM-RFE [12] is a popular feature selection approach based on SVM. It calculates the weights of the features according to the SVM learning model and removes the features with the smallest weights iteratively. GA is a stochastic global search technique [13] and has got a promising performance. Many feature selection techniques have been proposed based on GA [14, 15].

To filter out noise and redundant data simultaneously, several techniques have been proposed, such as min-redundancy and max-relevance (mRMR) [16], a method combining SVM-RFE and correlation coefficient [17], a method where SVM-RFE and mRMR work together [18], and a dynamic weighting-based feature selection algorithm [19].

To select the meaningful feature subset from the high dimensional data, this paper proposes a new feature selection method based on feature grouping and GA (FS-FGGA). It removes the irrelevant data which has small relevance with the class label, groups the features, and applies GA to search the optimal combination feature subset from different feature groups. The applications on eight public data verify the effectiveness of FS-FGGA.

2 Methods

To improve the performance of the learning model, FS-FGGA selects the meaningful non-redundant features from the original data. It eliminates the irrelevant features by symmetrical uncertainty [20, 21] and groups the features according to the relevance among the features. The features lying in the same group have the similar information related to the class label. Hence each group contributes one feature to the final feature subset. But selecting different features from each group may induce different learning models which may have different classification performance. GA is adopted to search the optimal combination feature subset. Fig. 1 shows the main framework of FS-FGGA.

Fig. 1.
figure 1

Framework of FS-FGGA.

2.1 Symmetrical Uncertainty

Symmetrical Uncertainty (SU) [20, 21] is an effective technique to measure the correlation of two random variables. Let X and Y be two variables, their correlation SU(X, Y) is defined as follows:

$$ SU(X,Y) = 2\cdot \frac{IG(X |Y)}{H(X) + H(Y)}. $$
(1)

H(X) is the entropy of X, IG(X|Y) is the information gain which reflects additional information about X provided by Y.

Let F = {f 1, f 2, …, f n } denote the feature set, C denote the class label set. In order to filter out the irrelevant features, FS-FGGA adopts symmetrical uncertainty, SU(f i , C) (1 ≤ in), to measure the relation between feature f i F and the class label C. If SU(f i , C) is low enough, i.e. it is lower than a threshold σ, feature f i has little relevance with the class label, and is removed from the data [20, 21].

2.2 Grouping Features

Fast Correlation-Based Filter (FCBF) [21] is an efficient feature selection technique. It analyzes the relevance by symmetrical uncertainty, and removes the redundant data by means of Approximate Markov blanket (AMB). For two different features f i F and f j F (1≤ ijn), f i is an Approximate Markov blanket [21] of f j , if and only if

$$ { SU}({f}_{i} ,{C}) \, \ge {SU}({f}_{j} ,{C})\;{ and \; SU}({f}_{i} , {f}_{j} ) \, \ge { SU}({f}_{j} , { C}) \, . $$
(2)

FS-FGGA groups the features according to AMB. The features which are relevant to each other by FCBF [21] are put into the same group.

2.3 Searching the Optimal Feature Subset by GA

FCBF produces a feature subset which is formed by picking the center feature of the group [21]. But the center may be different as the training samples change [22]. Ensemble correlation-based gene selection (ECBGS) [23] method uses the different starting points and selects the best feature subset according to the corresponding classification performance.

Let FG = {FG 1, FG 2, …, FG k } denote the feature group set. Since the features in the same group contain the similar information, only one feature is picked up from each group to constitute the selected feature subset. Further the combination of different features from different groups may have different classification performance. Hence FS-FGGA applies GA to search the optimal one. Initially, FS-FGGA randomly selects a feature from each group to form a feature subset as an individual and repeats this operation to get the initial population of GA. The flow chart of searching the optional feature subset is shown in Fig. 2.

Fig. 2.
figure 2

Flow chart of searching the best feature subset.

The fitness of an individual in a population is assessed by the classification accuracy rate of SVM. Roulette wheel selection is adopted to select the parents from the population. A single-point crossover operation and a single-point mutation are also applied for the offspring individuals.

3 Results and Discussion

3.1 Performance Metrics

Features selection technique aims at selecting a feature subset having the high classification ability. Meanwhile, the stability of the method is also very important. This study applied the classification accuracy and stability to evaluate the performance of the methods. The percentage of overlapping features related (POFR) [24] is used to measure the method stability. It is defined as follows [24]:

$$ POFR_{{F_{1} F_{2} }} = \frac{{\left| {F_{1} \cap F_{2} } \right| + \left| {R_{{F_{1} F_{2} }} } \right|}}{{\left| {F_{1} } \right|}}. $$
(3)
$$ POFR_{{F_{2} F_{1} }} = \frac{{\left| {F_{1} \cap F_{2} } \right| + \left| {R_{{F_{2} F_{1} }} } \right|}}{{\left| {F_{2} } \right|}}. $$
(4)

where F 1 and F 2 are two different feature subsets selected by the different running of a algorithm, |F 1| (or |F 2|) is the number of the features in F 1 (or F 2), \( R_{{F_{1} F_{2} }} \)(or \( R_{{F_{2} F_{1} }} \)) is the set of the features in F 1 (or F 2) which are not in F 2 (or F 1) but have a strong correlation with at least one feature in F 2 (or F 1). The greater its value is, the more stable the feature selection algorithm is.

3.2 Experiment

To demonstrate the effectiveness of FS-FGGA, it is compared with SVM-RFE and ECBGS on eight public microarray datasets, which are gene expression data from various human cancers. Table 1 shows the basic information of the eight public datasets. Among them, Adenocarcinoma, Leukemia 2, Lymphoma 1 and Srbct datasets are from http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html, and the other four datasets come from http://linus.nci.nih.gov/~brb/DataArchive_New.html.

Table 1. The basic information of the eight public datasets

Auto scaling is used to reduce the differences of the magnitude of different features. To calculate SU, equal width discretization (EWD) [25, 26] is adopted, where the real data is divided into h (h is set to 3 in the experiments) intervals with equal width between the minimum value and the maximum value.

Parameter σ for FS-FGGA and ECBGS is set as follows:

$$ \sigma = 0.5*(SU_{\hbox{max} } - SU_{ \hbox{min} } ). $$
(5)

where SU max and SU min are the maximal and the minimal relevant values of the features with the class label, respectively.

For FS-FGGA, the maximal number of iterations and the size of population are set to 50 and 100, respectively. The crossover probability and mutation rate are set to 0.8 and 0.01, respectively. When the generation is up to the maximal number of iterations or the best fitness comes to 0.95, the GA search procedure stops.

Ten-fold cross-validation was run ten times. SVM is adopted as the classification method, and the RBF kernel function and the LINEAR kernel function are used respectively. The source code of SVM is from http://www.csie.ntu.edu.Tw/~cjlin/libsvm and the other algorithms were written in C++.

3.3 Results and Discussion

Tables 2 and 3 show the comparison of the three methods on the average classification accuracy rates. The bold face means the largest accuracy rate among the three methods in a data set. The last row (W/T/L) of the two tables count the number of wins/ties/losses compared to the FS-FGGA over all data sets. It can be seen that FS-FGGA is superior to the other two feature selection methods in most cases.

Table 2. The comparison on SVM with RBF kernel function
Table 3. The comparison on SVM with LINEAR kernel function

In comparison with SVM-RFE, FS-FGGA ties with SVM-RFE on RBF kernel function (Table 2), but it shows a clear superiority over SVM-RFE on the LINEAR kernel function (Table 3), where FS-FGGA wins seven times to SVM-RFE. With LINEAR kernel function, the average classification accuracy rate of SVM-RFE is equal to that of FS-FGGA only on the Adenocarcinoma data, but the standard deviation of SVM-RFE is 1.55% higher than that of FS-FGGA.

In comparison with ECBGS, the average classification accuracy rates of FS-FGGA are higher than those of ECBGS on all the eight datasets with RBF kernel function (Table 2). While in Table 3, using the LINEAR kernel function, FS-FGGA wins ECBGS seven times. Only on the Leukemia 1 data, the average classification accuracy rate of ECBGS is higher than that of FS-FGGA a little.

Tables 4 and 5 show the average POFR of the three feature selection algorithms. From the two tables, FS-FGGA algorithm is more stable than the other two algorithms in the majority cases.

Table 4. The average POFR using RBF kernel function
Table 5. The average POFR using LINEAR kernel function

4 Conclusions

This paper proposes a new feature selection method based on feature group and genetic algorithm (FS-FGGA). The method can effectively eliminate the irrelevant features and reduce the redundant features. Applications on eight public microarray data show the effectiveness of FS-FGGA. It can select more discriminative feature subsets to build more efficient classification models than SVM-RFE and ECBGS in most cases.