Abstract
As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Alon U et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750
Breiman L, Forest R Technical Report. Stat. Dept, UCB
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data mining Knowl Dis 2(2):121–167
Chuang H-Y et al (2004) Identifying significant genes from microarray data. Fourth IEEE symposium on bioinformatics and bioengineering (BIBE’04) p. 358
Dash M, Liu H (1999) Handling large unsupervised data via dimensionality reduction. ACM SIGMOD workshop on research issues in data mining and knowledge discovery
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Furey T, Cristianini N, Bednarski DN, Schummer DM (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914
Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection (Kernel Machines Section). JMLR 3:1157–1182
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York
Hsu FD, Shapiro J, Taksa I (2002) Methods of data fusion in information retreival: Rank vs. Score combination. DIMACS Technical report 58
Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform 6:148
LeCun Y, Denker JS, Solla SA (1990) Optimum brain damage. Touretzky DS (ed) Advances in neural information processing systems II, Morgan Kaufmann, Mateo
Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inform Comput Sci 44(5):1823–1828
Liu H, Setiono R (1995) χ2: feature selection and discretization of numeric attributes. In: Proceedings IEEE 7th international conference on tools with artificial intelligence, pp 338–391
Liu H, Li J, Wong L (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic pattern. Genom Inform 13:51–60
Liu H et al. (2005) Evolving feature selection. Intelligent systems. IEEE Vol 20(6): 64–76
Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76
Mao Y, Zhou X, Pi D, Sun Y, STC Wong (2005) Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection. J Biomed Biotechnol 2:160–171
Noble WS (2004) Support vector machine applications in computational biology. In: Schoelkopf B, suda KT, Vert J.-P (eds). Kernel methods in computational biology. MIT, New York, pp 71–92
Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Patt Anal Mach Intell 27(8):1226–1238
Schölkopf B, Guyon I, Weston J (2003) Statistical learning and kernel methods in bioinformatics. In: Frasconi P, Shamir R (eds) Artificial intelligence and heuristic methods in bioinformatics. vol 183. IOS Press, Amsterdam, pp 1–21
Singh D (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Singh D et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Somorjai RL, Dolenko B, Baumgartner R (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 12; 19(12):1484–91
Space Physics Group; Applied Physics Laboratory; Johns Hopkins University; Johns Hopkins Road; Laurel; MD 20723
Vapnik V (1998) Statistical learning theory. Wiley, New York
Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intell Syst 13:44–49
Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: Proceedings of ACM SIGKDD international conference knowledge discovery and data mining (KDD 03), ACM, New york, pp. 685–690
Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature Selection for SVMs. Adv Neural Inform Process Syst 13
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tan, F., Fu, X., Zhang, Y. et al. A genetic algorithm-based method for feature subset selection. Soft Comput 12, 111–120 (2008). https://doi.org/10.1007/s00500-007-0193-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-007-0193-8