A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Lin, Xiaohui; Wang, Xiaomei; Xiao, Niyi; Huang, Xin; Wang, Jue

doi:10.1007/978-3-319-23862-3_15

Xiaohui Lin²¹,
Xiaomei Wang²¹,
Niyi Xiao²¹,
Xin Huang²¹ &
…
Jue Wang²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9243))

Included in the following conference series:

International Conference on Intelligent Science and Big Data Engineering

2875 Accesses

Abstract

Feature selection technique has shown its power in analyzing the high dimensional data and building the efficient learning models. This study proposes a feature selection method based on feature grouping and genetic algorithm (FS-FGGA) to get a discriminative feature subset and reduce the irrelevant and redundancy data. Firstly, it eliminates the irrelevant features using the symmetrical uncertainty between features and class labels. Then, it groups the features by Approximate Markov blanket. Finally, genetic algorithm is applied to search the optimal feature subset from the different groups. Experiments on the eight public datasets demonstrate the effectiveness and superiority of FS-FGGA in comparison with SVM-RFE and ECBGS in most cases.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Feature Selection Algorithm for Big Data Based on Genetic Algorithm

Feature Selection Optimization Using a Hybrid Genetic Algorithm

Hybrid Feature Selection Method Based on the Genetic Algorithm and Pearson Correlation Coefficient

Keywords

1 Introduction

As the quick development of genomic, proteomics and metabolomics techniques, they have been widely applied in the study of pathology, diagnostics and prognosis. Since the bioinformatic data are often high dimensional and contain noise and redundant variables, finding the interested features to get an efficient classification model is becoming very important. Many feature selection methods, such as Support Vector Machine-Recursive Feature Elimination (SVM-RFE) [1], Random Forests (RF) [2], Genetic Algorithm (GA) [3], Relief-F [4], and Mutual Information (MI) [5, 6], have been applied to select the meaningful feature subset from the high dimensional data to induce a classification model with a high performance [7, 8].

SVM [9] is a supervised machine learning technique. It is suitable to analyze the high dimensional data [10]. Originally, SVM was proposed for binary problems. And it could solve the multi-class problems by means of “one-versus-all” and “one-versus-one” methods[11], etc. SVM-RFE [12] is a popular feature selection approach based on SVM. It calculates the weights of the features according to the SVM learning model and removes the features with the smallest weights iteratively. GA is a stochastic global search technique [13] and has got a promising performance. Many feature selection techniques have been proposed based on GA [14, 15].

To filter out noise and redundant data simultaneously, several techniques have been proposed, such as min-redundancy and max-relevance (mRMR) [16], a method combining SVM-RFE and correlation coefficient [17], a method where SVM-RFE and mRMR work together [18], and a dynamic weighting-based feature selection algorithm [19].

To select the meaningful feature subset from the high dimensional data, this paper proposes a new feature selection method based on feature grouping and GA (FS-FGGA). It removes the irrelevant data which has small relevance with the class label, groups the features, and applies GA to search the optimal combination feature subset from different feature groups. The applications on eight public data verify the effectiveness of FS-FGGA.

2 Methods

To improve the performance of the learning model, FS-FGGA selects the meaningful non-redundant features from the original data. It eliminates the irrelevant features by symmetrical uncertainty [20, 21] and groups the features according to the relevance among the features. The features lying in the same group have the similar information related to the class label. Hence each group contributes one feature to the final feature subset. But selecting different features from each group may induce different learning models which may have different classification performance. GA is adopted to search the optimal combination feature subset. Fig. 1 shows the main framework of FS-FGGA.

2.1 Symmetrical Uncertainty

Symmetrical Uncertainty (SU) [20, 21] is an effective technique to measure the correlation of two random variables. Let X and Y be two variables, their correlation SU(X, Y) is defined as follows:

$$ SU(X,Y) = 2\cdot \frac{IG(X |Y)}{H(X) + H(Y)}. $$

(1)

H(X) is the entropy of X, IG(X|Y) is the information gain which reflects additional information about X provided by Y.

Let F = {f ₁, f ₂, …, f _n} denote the feature set, C denote the class label set. In order to filter out the irrelevant features, FS-FGGA adopts symmetrical uncertainty, SU(f _i, C) (1 ≤ i ≤ n), to measure the relation between feature f _i∈F and the class label C. If SU(f _i, C) is low enough, i.e. it is lower than a threshold σ, feature f _i has little relevance with the class label, and is removed from the data [20, 21].

2.2 Grouping Features

Fast Correlation-Based Filter (FCBF) [21] is an efficient feature selection technique. It analyzes the relevance by symmetrical uncertainty, and removes the redundant data by means of Approximate Markov blanket (AMB). For two different features f _i∈F and f _j∈F (1≤ i ≠ j ≤ n), f _i is an Approximate Markov blanket [21] of f _j, if and only if

$$ { SU}({f}_{i} ,{C}) \, \ge {SU}({f}_{j} ,{C})\;{ and \; SU}({f}_{i} , {f}_{j} ) \, \ge { SU}({f}_{j} , { C}) \, . $$

(2)

FS-FGGA groups the features according to AMB. The features which are relevant to each other by FCBF [21] are put into the same group.

2.3 Searching the Optimal Feature Subset by GA

FCBF produces a feature subset which is formed by picking the center feature of the group [21]. But the center may be different as the training samples change [22]. Ensemble correlation-based gene selection (ECBGS) [23] method uses the different starting points and selects the best feature subset according to the corresponding classification performance.

Let FG = {FG ₁, FG ₂, …, FG _k} denote the feature group set. Since the features in the same group contain the similar information, only one feature is picked up from each group to constitute the selected feature subset. Further the combination of different features from different groups may have different classification performance. Hence FS-FGGA applies GA to search the optimal one. Initially, FS-FGGA randomly selects a feature from each group to form a feature subset as an individual and repeats this operation to get the initial population of GA. The flow chart of searching the optional feature subset is shown in Fig. 2.

The fitness of an individual in a population is assessed by the classification accuracy rate of SVM. Roulette wheel selection is adopted to select the parents from the population. A single-point crossover operation and a single-point mutation are also applied for the offspring individuals.

3 Results and Discussion

3.1 Performance Metrics

Features selection technique aims at selecting a feature subset having the high classification ability. Meanwhile, the stability of the method is also very important. This study applied the classification accuracy and stability to evaluate the performance of the methods. The percentage of overlapping features related (POFR) [24] is used to measure the method stability. It is defined as follows [24]:

$$ POFR_{{F_{1} F_{2} }} = \frac{{\left| {F_{1} \cap F_{2} } \right| + \left| {R_{{F_{1} F_{2} }} } \right|}}{{\left| {F_{1} } \right|}}. $$

(3)

$$ POFR_{{F_{2} F_{1} }} = \frac{{\left| {F_{1} \cap F_{2} } \right| + \left| {R_{{F_{2} F_{1} }} } \right|}}{{\left| {F_{2} } \right|}}. $$

(4)

where F ₁ and F ₂ are two different feature subsets selected by the different running of a algorithm, |F ₁| (or |F ₂|) is the number of the features in F ₁ (or F ₂), $ R_{{F_{1} F_{2} }} $(or $ R_{{F_{2} F_{1} }} $) is the set of the features in F ₁ (or F ₂) which are not in F ₂ (or F ₁) but have a strong correlation with at least one feature in F ₂ (or F ₁). The greater its value is, the more stable the feature selection algorithm is.

3.2 Experiment

To demonstrate the effectiveness of FS-FGGA, it is compared with SVM-RFE and ECBGS on eight public microarray datasets, which are gene expression data from various human cancers. Table 1 shows the basic information of the eight public datasets. Among them, Adenocarcinoma, Leukemia 2, Lymphoma 1 and Srbct datasets are from http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html, and the other four datasets come from http://linus.nci.nih.gov/~brb/DataArchive_New.html.

Table 1. The basic information of the eight public datasets

Full size table

Auto scaling is used to reduce the differences of the magnitude of different features. To calculate SU, equal width discretization (EWD) [25, 26] is adopted, where the real data is divided into h (h is set to 3 in the experiments) intervals with equal width between the minimum value and the maximum value.

Parameter σ for FS-FGGA and ECBGS is set as follows:

$$ \sigma = 0.5*(SU_{\hbox{max} } - SU_{ \hbox{min} } ). $$

(5)

where SU _max and SU _min are the maximal and the minimal relevant values of the features with the class label, respectively.

For FS-FGGA, the maximal number of iterations and the size of population are set to 50 and 100, respectively. The crossover probability and mutation rate are set to 0.8 and 0.01, respectively. When the generation is up to the maximal number of iterations or the best fitness comes to 0.95, the GA search procedure stops.

Ten-fold cross-validation was run ten times. SVM is adopted as the classification method, and the RBF kernel function and the LINEAR kernel function are used respectively. The source code of SVM is from http://www.csie.ntu.edu.Tw/~cjlin/libsvm and the other algorithms were written in C++.

3.3 Results and Discussion

Tables 2 and 3 show the comparison of the three methods on the average classification accuracy rates. The bold face means the largest accuracy rate among the three methods in a data set. The last row (W/T/L) of the two tables count the number of wins/ties/losses compared to the FS-FGGA over all data sets. It can be seen that FS-FGGA is superior to the other two feature selection methods in most cases.

Table 2. The comparison on SVM with RBF kernel function

Full size table

Table 3. The comparison on SVM with LINEAR kernel function

Full size table

In comparison with SVM-RFE, FS-FGGA ties with SVM-RFE on RBF kernel function (Table 2), but it shows a clear superiority over SVM-RFE on the LINEAR kernel function (Table 3), where FS-FGGA wins seven times to SVM-RFE. With LINEAR kernel function, the average classification accuracy rate of SVM-RFE is equal to that of FS-FGGA only on the Adenocarcinoma data, but the standard deviation of SVM-RFE is 1.55% higher than that of FS-FGGA.

In comparison with ECBGS, the average classification accuracy rates of FS-FGGA are higher than those of ECBGS on all the eight datasets with RBF kernel function (Table 2). While in Table 3, using the LINEAR kernel function, FS-FGGA wins ECBGS seven times. Only on the Leukemia 1 data, the average classification accuracy rate of ECBGS is higher than that of FS-FGGA a little.

Tables 4 and 5 show the average POFR of the three feature selection algorithms. From the two tables, FS-FGGA algorithm is more stable than the other two algorithms in the majority cases.

Table 4. The average POFR using RBF kernel function

Full size table

Table 5. The average POFR using LINEAR kernel function

Full size table

4 Conclusions

This paper proposes a new feature selection method based on feature group and genetic algorithm (FS-FGGA). The method can effectively eliminate the irrelevant features and reduce the redundant features. Applications on eight public microarray data show the effectiveness of FS-FGGA. It can select more discriminative feature subsets to build more efficient classification models than SVM-RFE and ECBGS in most cases.

References

Tang, Y.C., Zhang, Y.Q., Huang, Z.: Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 365–381 (2007)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Holland, J.H.: Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA (1992)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. Mach. Learn. 784, 171–182 (1994)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1991)
Book MATH Google Scholar
Xu, J.C., Xu, T.H., Sun, L.: An efficient gene selection technique based on fuzzy C-means and neighborhood rough set. Appl. Math. Inf. Sci. 8, 3101–3110 (2014)
Article Google Scholar
Yassi, M., Moattar, M.H.: Robust and stable feature selection by integrating ranking methods and wrapper technique in genetic data classification. Biochem. Biophys. Res. Commun. 446, 850–856 (2014)
Article Google Scholar
Liu, X.M., Tang, J.S.: Mass classification in mammograms using selected geometry and texture features, and a new SVM-based feature selection method. IEEE Syst. J. 8, 910–920 (2014)
Article Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Shen, L., Tan, E.C.: Dimension reduction based penalized logistic regression for cancer classification using micro-array data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2, 166–175 (2005)
Article Google Scholar
Zhou, X., Tuck, D.P.: MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23, 1106–1114 (2007)
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Arunachalam, J., Kanagasabai, V., Gautham, N.: Protein structure prediction using mutually orthogonal Latin squares and a genetic algorithm. Biochem. Biophys. Res. Commun. 342, 424–433 (2006)
Article Google Scholar
Ram, R., Chetty, M.: A Markov-Blanked-Based model for gene regulatory network inference. IEEE-ACM Trans. Comput. Biol. Bioinform. 8, 353–367 (2011)
Article Google Scholar
Abbasnia, R., Shayanfar, M., Khodam, A.: Reliability-based design optimization of structural systems using a hybrid genetic algorithm. Struct. Eng. Mech. 52, 1099–1120 (2014)
Article Google Scholar
Maji, P., Garai, P.: On fuzzy-rough attribute selection: criteria of max-dependency, max-relevance, min-redundancy, and max-significance. Applied Soft Computing. 13, 3968–3980 (2013)
Article Google Scholar
Xie, Z.X., Hu, Q.H., Yu, D.R.: Improved feature selection algorithm based on SVM and correlation. Adv. Neyral Netw. 3971, 1373–1380 (2006)
Google Scholar
Mundra, P.A., Rajapakse, M.J.: SVM-RFE with mRMR filter for gene selection. IEEE transactions on nano bioscience. 9(1), 31–37 (2010)
Article Google Scholar
Sun, X., Liu, Y.H., Xu, M.T., Chen, H.L., Han, J.W., Wang, K.H.: Feature selection using dynamic weights for classification. Knowl.-Based Syst. 37, 541–549 (2013)
Article Google Scholar
Shen, L.L., Zhu, Z.X., Jia, S.: Discriminative Gabor feature selection for hyper spectral image classification. IEEE Geosci. Remote Sens. Lett. 10, 29–33 (2013)
Article Google Scholar
Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)
MathSciNet MATH Google Scholar
Liu, H.W., Liu, L., Zhang, H.J.: Ensemble gene selection by grouping for microarray data classification. J. Biomed. Inform. 43, 81–87 (2010)
Article Google Scholar
Piao, Y.J., Piao, M.H., Park, K.J., Ryu, K.H.: An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics 28, 3306–3315 (2012)
Article Google Scholar
Zhang, M., Zhang, L., Zou, J.F., Yan, C., Xiao, H., Liu, Q.: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25, 1662–1668 (2009)
Article Google Scholar
Bennasar, M., Setchi, R., Hicks, Y.: Unsupervised discretization method based on adjustable intervals. In: 16th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, vol. 243, pp. 79–87, San Sebastian (2012)
Google Scholar
Orhan, U., Hekim, M., Ozer, M.: Epileptic seizure detection using artificial neural network and a new feature extraction approach based on equal width discretization. J. Fac. Eng. Archit. Gazi Univ. 26, 575–580 (2011)
Google Scholar

Download references

Acknowledgments

The study has been supported by the State Key Science & Technology Project for Infectious Diseases (2012ZX10002011), the Sino-German Center for Research Promotion (GZ 753), National Natural Science Foundation of China (21375011).

Author information

Authors and Affiliations

School of Computer Science & Technology, Dalian University of Technology, 116024, Dalian, China
Xiaohui Lin, Xiaomei Wang, Niyi Xiao, Xin Huang & Jue Wang

Authors

Xiaohui Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Niyi Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Lin .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Xiaofei He
Xidian University, Xi'an, China
Xinbo Gao
Northwestern Polytechnical University, Xi'an, China
Yanning Zhang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Chinese Academy of Sciences, Beijing, China
Zhi-Yong Liu
Suzhou University of Science and Technology, Suzhou, China
Baochuan Fu
Suzhou University of Science and Technology, Jiangsu, China
Fuyuan Hu
Suzhou University of Science and Technology, Jiangsu, China
Zhancheng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, X., Wang, X., Xiao, N., Huang, X., Wang, J. (2015). A Feature Selection Method Based on Feature Grouping and Genetic Algorithm. In: He, X., et al. Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques. IScIDE 2015. Lecture Notes in Computer Science(), vol 9243. Springer, Cham. https://doi.org/10.1007/978-3-319-23862-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-23862-3_15
Published: 17 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23861-6
Online ISBN: 978-3-319-23862-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Abstract

Similar content being viewed by others

A Feature Selection Algorithm for Big Data Based on Genetic Algorithm

Feature Selection Optimization Using a Hybrid Genetic Algorithm

Hybrid Feature Selection Method Based on the Genetic Algorithm and Pearson Correlation Coefficient

Keywords

1 Introduction

2 Methods

2.1 Symmetrical Uncertainty

2.2 Grouping Features

2.3 Searching the Optimal Feature Subset by GA

3 Results and Discussion

3.1 Performance Metrics

3.2 Experiment

3.3 Results and Discussion

4 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Feature Selection Method Based on Feature Grouping and Genetic Algorithm

Abstract

Similar content being viewed by others

A Feature Selection Algorithm for Big Data Based on Genetic Algorithm

Feature Selection Optimization Using a Hybrid Genetic Algorithm

Hybrid Feature Selection Method Based on the Genetic Algorithm and Pearson Correlation Coefficient

Keywords

1 Introduction

2 Methods

2.1 Symmetrical Uncertainty

2.2 Grouping Features

2.3 Searching the Optimal Feature Subset by GA

3 Results and Discussion

3.1 Performance Metrics

3.2 Experiment

3.3 Results and Discussion

4 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation