Improving Text Classification Performance with Random Forests-Based Feature Selection

Maruf, Sameen; Javed, Kashif; Babri, Haroon A.

doi:10.1007/s13369-015-1945-x

Improving Text Classification Performance with Random Forests-Based Feature Selection

Research Article - Computer Engineering and Computer Science
Published: 05 November 2015

Volume 41, pages 951–964, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Improving Text Classification Performance with Random Forests-Based Feature Selection

Download PDF

Sameen Maruf¹,
Kashif Javed² &
Haroon A. Babri²

366 Accesses
16 Citations
Explore all metrics

Abstract

Feature selection (FS) is employed to make text classification (TC) more effective. Well-known FS metrics like information gain (IG) and odds ratio (OR) rank terms without considering term interactions. Building classifiers with FS algorithms considering term interactions can yield better performance. But their computational complexity is a concern. This has resulted in two-stage algorithms such as information gain-principal component analysis (IG–PCA). Random forests-based feature selection (RFFS), proposed by Breiman, has demonstrated outstanding performance while capturing gene–gene relations in bioinformatics, but its usefulness for TC is less explored. RFFS has fewer control parameters and is found to be resistant to overfitting and thus generalizes well to new data. It does not require use of a test dataset to report accuracy and does not use conventional cross-validation. This paper investigates the working of RFFS for TC and compares its performance against IG, OR and IG–PCA. We carry out experiments on four widely used text data sets using naive Bayes’ and support vector machines as classifiers. RFFS achieves macro-F ₁ values higher than other FS algorithms in 73 % of the experimental instances. We also analyze the performance of RFFS for TC in terms of its parameters and class skews of the data sets and yield interesting results.

Article PDF

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Article 22 July 2020

Feature selection methods for text classification: a systematic literature review

Article 24 February 2021

An optimal feature selection method for text classification through redundancy and synergy analysis

Article 28 June 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp. 412–420 (1997)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp. 137–142 (1998)
Sebastiani F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Uguz H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based Syst. 24, 1024–1032 (2011)
Article Google Scholar
Montanes E., Diaz I., Ranilla J., Combarro E.F., Fernandez J.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)
Article Google Scholar
Manning C.D., Raghavan P., Schütze H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Joachims T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)
Book Google Scholar
Aggarwal, C.C.; Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. Mining Text Data, pp. 163–222. Springer, Berlin (2012)
Forman G., Guyon I., Elisseeff A.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Zhang W., Yoshida T., Tang X.: A comparative study of tf*idf, lsi and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011)
Article Google Scholar
Badawi D., Altincay H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014)
Article Google Scholar
Uysal A.K., Gunal S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Article Google Scholar
Meng J., Lin H., Yu Y.: A two-stage feature selection method for text categorization. Comput. Math. Appl. 62, 2793–2800 (2011)
Article Google Scholar
Yu, L.; Liu, H.: Feature selection for high-dimensional data: a fast correlation based filter solution. In: Proceedings of 20th international conference on machine learning, pp. 856–863 (2003)
Javed K., Babri H.A., Saeed M.: Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143, 248–260 (2014)
Article Google Scholar
Koller, D.; Sahami, M.: Toward optimal feature selection. Technical report 1996–77. Stanford InfoLab (1996)
Hall M., Holmes G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15, 1437–1447 (2003)
Article Google Scholar
Makrehchi, M.: Feature ranking for text classifiers. Ph.D. thesis Department of Electrical and Computer Engineering, University of Waterloo Waterloo, Ontario, Canada (2007)
Javed K., Babri H.A., Saeed M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans. Knowl. Data Eng. 24, 465–477 (2012)
Article Google Scholar
Uysal A.K., Gunal S.: Text classification using genetic algorithm oriented latent semantic features. Exp. Syst. Appl. 41, 5938–5947 (2014)
Article Google Scholar
Alpaydin E.: Introduction to Machine Learning, 2nd edition. The MIT Press, Cambridge (2010)
Google Scholar
Saeed M., Javed K., Babri H.A.: Machine learning using Bernoulli mixture models: clustering, rule extraction and dimensionality reduction. Neurocomputing 119, 366–374 (2013)
Article Google Scholar
Guyon I., Gunn S., Nikravesh M., Zadeh L.A.: Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer, New York (2006)
Book Google Scholar
Duch, W.: Filter methods. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. Feature Extraction: Foundations and Applications, pp. 89–117. Springer, New York (2006)
Guyon, I.; Bitter, H.M.; Ahmed, Z.; Brown, M.; Heller, J.: Multivariate non-linear feature selection with kernel methods. In: Nikravesh, M., Zadeh, L., Kacprzyk, J. Soft Computing for Information Processing and Analysis, Studies in Fuzziness and Soft Computing, vol. 164, pp. 313–326. Springer, Berlin (2005)
Zheng Z., Wu X., Srihari R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6, 80–89 (2004)
Article Google Scholar
Mladenic, D.; Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the 6th international conference on machine learning, pp. 258–267 (1999)
Kohavi R., John G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article MATH Google Scholar
Das, S.: Filters, Wrappers, and a boosting based hybrid for feature selection. In: Proceedings of the 18th international conference on machine learning, pp. 74–81 (2001)
Breiman L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Cutler D., Edwards T.C., Beard K., Cutler A., Hess K., Gibson J., Lawler J.: Random forests for classification in ecology. Ecology 88, 2783–2792 (2007)
Article Google Scholar
Dìaz-Uriarte, R.; Alvarez de Andrès, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7,3 (2006). doi:10.1186/1471-2105-7-3
Rodenburg, W.; Heidema, A.; Boer, J.; Bovee-Oudenhoven, I.; Feskens, E.; Mariman, E.; Keijer, J.: A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes. Physiol. Genom. 33, 78–90 (2008).
Mitchell T.M.: Machine Learning. McGraw-Hill, Inc., New York (1997)
MATH Google Scholar
Scholkopf B., Smola A.: Learning with Kernels. MIT Press, Cambridge (2002)
Google Scholar
Breiman L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996)
Google Scholar
Breiman L., Friedman J.H., Olshen R.A., Stone C.J: Classification and Regression Trees. Chapman & Hall, New York (1984)
MATH Google Scholar
Liaw A., Wiener M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
Google Scholar
Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8, 25 (2007). doi:10.1186/1471-2105-8-25
Breiman, L.: Manual on setting up, using, and understanding random forests v3.1. Technical report (2002)
Genuer R., Poggi J.M., Tuleau-Malot C.: Variable selection using random forests. Pattern Recognit. Lett. 31, 2225–2236 (2010)
Article Google Scholar
Chen, C.; Liaw, A.; Breiman, L.: Using random forest to learn imbalanced data. www.stat.berkeley.edu/tech-reports/666.pdf (2004)
Hapfelmeier A., Ulm K.: A new variable selection approach using random forests. Comput. Stat. Data Anal. 60, 50–69 (2013)
Article MathSciNet Google Scholar
Amaratunga D., Cabrera J., Yung-Seop L.: Enriched random forests. Bioinformatics 24(18), 2010–2014 (2008)
Article Google Scholar
Neumayer, R.: Clustering based ensemble classification for spam filtering. In: Proceedings of the 7th workshop on data analysis (2006)
Abdel-Aal R.E.: GMDH-based feature ranking and selection for improved classification of medical data. J. Biomed. Inf. 38, 456–468 (2005)
Article Google Scholar
Tang, R.; Sinnwell, J.P.; Li, J.; Rider, D.N.; De Andrade, M.; Biernacka, J.M.: Identification of genes and haplotypes that predict rheumatoid arthritis using random forests. BMC Proc. Genet. Anal. Workshop 16(Suppl 7), S68 (2009)
Javed K., Maruf S., Babri H.A.: A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157, 91–104 (2015)
Article Google Scholar
Saffari, A.; Guyon, I.: Quick start guide for challenge learning object package (CLOP). Technical report. Graz University of Technology and Clopinet (2006)
MathWorks. MATLAB: The language of technical computing (2010)
Cardoso-Cachopo, A.: Improving methods for single-label text categorization. Ph.D. thesis Instituto Superior Tecnico, Universidade Tecnica de Lisboa Portugal (2007)
Porter M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Article Google Scholar
Chen J., Huang H., Tian S., Qu Y.: Feature selection for text classification with Naï ve Bayes. Expert Syst. Appl. 36, 5432–5435 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, University of Central Punjab, Lahore, Pakistan
Sameen Maruf
Departmental of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan
Kashif Javed & Haroon A. Babri

Authors

Sameen Maruf
View author publications
You can also search for this author in PubMed Google Scholar
Kashif Javed
View author publications
You can also search for this author in PubMed Google Scholar
Haroon A. Babri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sameen Maruf.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maruf, S., Javed, K. & Babri, H.A. Improving Text Classification Performance with Random Forests-Based Feature Selection. Arab J Sci Eng 41, 951–964 (2016). https://doi.org/10.1007/s13369-015-1945-x

Download citation

Received: 16 April 2015
Accepted: 19 October 2015
Published: 05 November 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s13369-015-1945-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving Text Classification Performance with Random Forests-Based Feature Selection

Abstract

Article PDF

Similar content being viewed by others

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Feature selection methods for text classification: a systematic literature review

An optimal feature selection method for text classification through redundancy and synergy analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Text Classification Performance with Random Forests-Based Feature Selection

Abstract

Article PDF

Similar content being viewed by others

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Feature selection methods for text classification: a systematic literature review

An optimal feature selection method for text classification through redundancy and synergy analysis

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation