Abstract
Feature selection (FS) is employed to make text classification (TC) more effective. Well-known FS metrics like information gain (IG) and odds ratio (OR) rank terms without considering term interactions. Building classifiers with FS algorithms considering term interactions can yield better performance. But their computational complexity is a concern. This has resulted in two-stage algorithms such as information gain-principal component analysis (IG–PCA). Random forests-based feature selection (RFFS), proposed by Breiman, has demonstrated outstanding performance while capturing gene–gene relations in bioinformatics, but its usefulness for TC is less explored. RFFS has fewer control parameters and is found to be resistant to overfitting and thus generalizes well to new data. It does not require use of a test dataset to report accuracy and does not use conventional cross-validation. This paper investigates the working of RFFS for TC and compares its performance against IG, OR and IG–PCA. We carry out experiments on four widely used text data sets using naive Bayes’ and support vector machines as classifiers. RFFS achieves macro-F 1 values higher than other FS algorithms in 73 % of the experimental instances. We also analyze the performance of RFFS for TC in terms of its parameters and class skews of the data sets and yield interesting results.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp. 412–420 (1997)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp. 137–142 (1998)
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Uguz H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based Syst. 24, 1024–1032 (2011)
Montanes E., Diaz I., Ranilla J., Combarro E.F., Fernandez J.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)
Manning C.D., Raghavan P., Schütze H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Joachims T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Aggarwal, C.C.; Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. Mining Text Data, pp. 163–222. Springer, Berlin (2012)
Forman G., Guyon I., Elisseeff A.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Zhang W., Yoshida T., Tang X.: A comparative study of tf*idf, lsi and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011)
Badawi D., Altincay H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014)
Uysal A.K., Gunal S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Meng J., Lin H., Yu Y.: A two-stage feature selection method for text categorization. Comput. Math. Appl. 62, 2793–2800 (2011)
Yu, L.; Liu, H.: Feature selection for high-dimensional data: a fast correlation based filter solution. In: Proceedings of 20th international conference on machine learning, pp. 856–863 (2003)
Javed K., Babri H.A., Saeed M.: Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143, 248–260 (2014)
Koller, D.; Sahami, M.: Toward optimal feature selection. Technical report 1996–77. Stanford InfoLab (1996)
Hall M., Holmes G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15, 1437–1447 (2003)
Makrehchi, M.: Feature ranking for text classifiers. Ph.D. thesis Department of Electrical and Computer Engineering, University of Waterloo Waterloo, Ontario, Canada (2007)
Javed K., Babri H.A., Saeed M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans. Knowl. Data Eng. 24, 465–477 (2012)
Uysal A.K., Gunal S.: Text classification using genetic algorithm oriented latent semantic features. Exp. Syst. Appl. 41, 5938–5947 (2014)
Alpaydin E.: Introduction to Machine Learning, 2nd edition. The MIT Press, Cambridge (2010)
Saeed M., Javed K., Babri H.A.: Machine learning using Bernoulli mixture models: clustering, rule extraction and dimensionality reduction. Neurocomputing 119, 366–374 (2013)
Guyon I., Gunn S., Nikravesh M., Zadeh L.A.: Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer, New York (2006)
Duch, W.: Filter methods. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. Feature Extraction: Foundations and Applications, pp. 89–117. Springer, New York (2006)
Guyon, I.; Bitter, H.M.; Ahmed, Z.; Brown, M.; Heller, J.: Multivariate non-linear feature selection with kernel methods. In: Nikravesh, M., Zadeh, L., Kacprzyk, J. Soft Computing for Information Processing and Analysis, Studies in Fuzziness and Soft Computing, vol. 164, pp. 313–326. Springer, Berlin (2005)
Zheng Z., Wu X., Srihari R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6, 80–89 (2004)
Mladenic, D.; Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the 6th international conference on machine learning, pp. 258–267 (1999)
Kohavi R., John G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Das, S.: Filters, Wrappers, and a boosting based hybrid for feature selection. In: Proceedings of the 18th international conference on machine learning, pp. 74–81 (2001)
Breiman L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Cutler D., Edwards T.C., Beard K., Cutler A., Hess K., Gibson J., Lawler J.: Random forests for classification in ecology. Ecology 88, 2783–2792 (2007)
Dìaz-Uriarte, R.; Alvarez de Andrès, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7,3 (2006). doi:10.1186/1471-2105-7-3
Rodenburg, W.; Heidema, A.; Boer, J.; Bovee-Oudenhoven, I.; Feskens, E.; Mariman, E.; Keijer, J.: A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes. Physiol. Genom. 33, 78–90 (2008).
Mitchell T.M.: Machine Learning. McGraw-Hill, Inc., New York (1997)
Scholkopf B., Smola A.: Learning with Kernels. MIT Press, Cambridge (2002)
Breiman L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996)
Breiman L., Friedman J.H., Olshen R.A., Stone C.J: Classification and Regression Trees. Chapman & Hall, New York (1984)
Liaw A., Wiener M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8, 25 (2007). doi:10.1186/1471-2105-8-25
Breiman, L.: Manual on setting up, using, and understanding random forests v3.1. Technical report (2002)
Genuer R., Poggi J.M., Tuleau-Malot C.: Variable selection using random forests. Pattern Recognit. Lett. 31, 2225–2236 (2010)
Chen, C.; Liaw, A.; Breiman, L.: Using random forest to learn imbalanced data. www.stat.berkeley.edu/tech-reports/666.pdf (2004)
Hapfelmeier A., Ulm K.: A new variable selection approach using random forests. Comput. Stat. Data Anal. 60, 50–69 (2013)
Amaratunga D., Cabrera J., Yung-Seop L.: Enriched random forests. Bioinformatics 24(18), 2010–2014 (2008)
Neumayer, R.: Clustering based ensemble classification for spam filtering. In: Proceedings of the 7th workshop on data analysis (2006)
Abdel-Aal R.E.: GMDH-based feature ranking and selection for improved classification of medical data. J. Biomed. Inf. 38, 456–468 (2005)
Tang, R.; Sinnwell, J.P.; Li, J.; Rider, D.N.; De Andrade, M.; Biernacka, J.M.: Identification of genes and haplotypes that predict rheumatoid arthritis using random forests. BMC Proc. Genet. Anal. Workshop 16(Suppl 7), S68 (2009)
Javed K., Maruf S., Babri H.A.: A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157, 91–104 (2015)
Saffari, A.; Guyon, I.: Quick start guide for challenge learning object package (CLOP). Technical report. Graz University of Technology and Clopinet (2006)
MathWorks. MATLAB: The language of technical computing (2010)
Cardoso-Cachopo, A.: Improving methods for single-label text categorization. Ph.D. thesis Instituto Superior Tecnico, Universidade Tecnica de Lisboa Portugal (2007)
Porter M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Chen J., Huang H., Tian S., Qu Y.: Feature selection for text classification with Naï ve Bayes. Expert Syst. Appl. 36, 5432–5435 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Maruf, S., Javed, K. & Babri, H.A. Improving Text Classification Performance with Random Forests-Based Feature Selection. Arab J Sci Eng 41, 951–964 (2016). https://doi.org/10.1007/s13369-015-1945-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-015-1945-x