Abstract
Variable selection is one of the main problems faced by data mining and machine learning techniques. These techniques are often, more or less explicitly, based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. Firstly, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Secondly, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Bell, D., Wang, H.: A formalism for relevance and its application in feature subset selection. Mach. Learn. 4(2), 175–195 (2000)
Berk, R.A.: An introduction to ensemble methods for data analysis. Sociol. Methods Res. 34(3), 263–295 (2006)
Breiman, L.: The heuristic of instability in model selection. Ann. Stat. 24, 2350–2383 (1996)
Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001a)
Breiman, L.: Statistical modeling: the two cultures. Stat. Sci. 16, 199–231 (2001b)
Breiman, L.: Manual on setting up, using, and understanding Random Forests v3.1. Technical report (2002). http://oz.berkeley.edu/users/breiman
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, London (1984)
Breiman, L., Cutler, A., Liaw, A., Wiener, M.: Breiman and Cutler’s Random Forests for classification and regression. R package version 4.5-18 (2006). http://cran.r-project.org/doc/packages/randomForest.pdf
Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002)
Dobra, A., Gehrke, J.: Bias correction in classification tree construction. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Seventeenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA, pp. 90–97 (2001)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Friedman, J.H.: Tutorial: getting started with MART in R. Technical report, Standford University (2002). http://www-stat.stanford.edu/~jhf/r-mart/tutorial/tutorial.pdf
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Kim, H., Loh, W.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96, 589–604 (2001)
Kononenko, I.: On biases in estimating multi-valued attributes. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada, pp. 1034–1040 (1995)
Liaw, A., Wiener, M.: Classification and regression by Random Forest. R News 2(3), 18–22 (2002)
Loh, W.-Y., Shih, Y.-S.: Split selection methods for classification trees. Stat. Sinica 7, 815–840 (1997)
Murthy, K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 1384–5810 (2004)
Nierenberg, D.W., Stukel, T.A., Baron, J.A., Dain, B.J., Greenberg, E.R.: Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130, 511–521 (1989)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. (2008)
Ridgeway, G.: Generalized boosted models: a guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf (2007)
Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 1–18 (2008)
Schonlau, M.: Boosted regression (boosting): a tutorial and a stata plugin. Stata J. 5(3), 330–354 (2005)
Shih, Y.-S.: Families of splitting criteria for classification trees. Stat. Comput. 9, 309–315 (1999)
Strobl, C.: Statistical sources of variable selection bias in classification trees based on the Gini index. Technical report, SFB 386. http://epub.ub.uni-muenchen.de/archive/00001789/01/paper_420.pdf (2005)
Strobl, C., Boulesteix, A.-L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. (2007a). doi:10.1016/j.csda.2006.12.030
Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinf. 8, 25 (2007b). doi:10.1186/1471-2105-8-25
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinf. 9, 307 (2008). doi:10.1186/1471-2105-9-307
van der Laan, M.J.: Statistical inference for variable importance. Int. J. Biostat. 2(1), 1–30 (2005)
White, A.P., Liu, W.Z.: Bias in information-based measures in decision tree induction. Mach. Learn. 15, 321–329 (1994)
Wu, Y., Boos, D.D., Stefanski, L.A.: Controlling variable selection by the addition of pseudovariables. J. Am. Stat. Assoc. 102(477), 235–243 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sandri, M., Zuccolotto, P. Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms. Stat Comput 20, 393–407 (2010). https://doi.org/10.1007/s11222-009-9132-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-009-9132-0