Abstract
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Bradley P, Fayyad Usama M, Mangasarian OL. Mathematical programming for data mining: Formulations and challenges. INFORMS Journal on Computing 1999;11(3):217–238.
Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth, California, USA, 1984.
Bryson K-M. On two families of entropy-based splitting methods. Working Paper, Department of Information Systems. Virginia Commonwealth University, USA, 2000.
Cheeseman P, Stutz J, Bayesian Classification (AutoClass): Theory and results. In: Gregory Piatetsky-Shapiro Usama Fayyad, Padhraic Smyth, ed, Advances in Knowledge Discovery and Data Mining, Menlo Park, AAAI Press, MIT Press, 1996; 153–180.
Ching J, Wong A, Chan K. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence 1995;17(7):631–641.
Gersten W, Wirth D, Arndt D. Predictive modeling in automative direct marketing: Tools, experiences and open issues. In: Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining (KDD-2000) Boston, MA, 2000; 398–406.
Lopez de Mantaras R. A Distance-Based Attribute Selection Measure for Decision Tree Induction. Machine Learning 6: 1991; 81–92.
Martin J. An Exact Probability Metric For Decision Tree Splitting and Stopping. Machine Learning 1997;28:257–291.
Murphy P, Aha DW. UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science,1994.
Piatetsky-Shapiro G. The Data-Mining Industry Coming of Age. IEEE Intelligent Systems 1999;14(6):32–34.
Quinlan J. Induction of decision trees. Machine Learning 1986;1:81–106.
Quinlan J. C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993.
Safavian S, Landgrebe D. A Survey of Decision Tree Classifier Methodology. IEEE Transactions on Systems, Man, and Cybernetics 1991;21(3):660–674.
Shih Y-S. Families of Splitting Criteria for Classification Trees. Statistics and Computing 9:4, 1999; 309–315.
Taylor P, Silverman B. Block Diagrams and Splitting Criteria for Classification Trees. Statistics and Computing 3(4), 1993; 147–161.
Wu X, Urpani D. Induction by Attribute Elimination. IEEE Transactions on Knowledge and Data Engineering 1999;11(5):805–812.
Author information
Authors and Affiliations
Corresponding author
Additional information
Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal.
Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.
Rights and permissions
About this article
Cite this article
Osei-Bryson, KM., Giles, K. Splitting methods for decision tree induction: An exploration of the relative performance of two entropy-based families. Inf Syst Front 8, 195–209 (2006). https://doi.org/10.1007/s10796-006-8779-8
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10796-006-8779-8