Abstract
Classification is a widely used technique in data mining. Thereby achieving a reasonable classifier performance is an increasingly important goal. This paper aims to empirically show how classifier performance can be improved by knowledge-driven data preparation using business, data and methodological know-how. To point out the variety of knowledge-driven approaches, we firstly introduce an advanced framework that breaks down the data preparation phase to four hierarchy levels within the CRISP-DM process model. The first 3 levels reflect methodological knowledge; the last level clarifies the use of business and data know-how. Furthermore, we present insights from a case study to show the effect of variable derivation as a subtask of data preparation. The impact of 9 derivation approaches and 4 combinations of them on classifier performance is assessed on a real world dataset using decision trees and gains charts as performance measure. The results indicate that our approach improves the classifier performance.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Rexer, K.: 5th Annual Data Miner Survey - 2011 Survey Summary Report. Rexer Analytics, Winchester (2011)
KDnuggets, Which methods/algorithms did you use for data analysis in 2011?, http://www.kdnuggets.com/polls/2011/algorithms-analytics-data-mining.html
Fayyad, U., Piatesky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining. AAAI Press, California (1996)
SAS: From Data to Business Advantage: Data Mining, SEMMA Methodology and the SAS System. White Paper, SAS Institute Inc. (1997)
Reinartz, T.: Focusing Solutions for Data Mining: Analytical Studies and Experimental Results in Real-World Domains. Springer, Heidelberg (1999)
Chapman, P., Clinton, J., Kerber, R., Khabaza, R.T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: step-by-step data mining guide. SPSS Inc. (2000)
Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. The Knowledge Engineering Review 21(1), 1–24 (2006)
Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)
Anand, S.S., Bell, D.A., Hughes, J.G.: The role of domain knowledge in data mining. In: 4th Int’l ACM Conference on Information and Knowledge Management, pp. 37–43. ACM, New York (1995)
de Oliveira Lima, E.: Domain Knowledge Integration in data mining for churn and customer lifetime value modelling: new approaches and applications. Dissertation, University of Southhampton (2009)
Kopanas, I., Avouris, N.M., Daskalaki, S.: The Role of Domain Knowledge in a Large Scale Data Mining Project. In: Vlahavas, I.P., Spyropoulos, C.D. (eds.) SETN 2002. LNCS (LNAI), vol. 2308, pp. 288–299. Springer, Heidelberg (2002)
Sinha, A.P., Zhao, H.: Incorporating domain knowledge into data mining classifiers: An application in indirect lending. Decision Support Systems 46, 287–299 (2008)
Pyle, D.: Business Modeling and Data Mining. Morgan Kaufmann Publishers, Amsterdam (2003)
Linoff, G.S., Berry, M.J.A.: Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Wiley Publishing, Indianapolis (2011)
Han, J., Kamber, M., Pei, J.: Data Mining, Concepts and Techniques. Morgan Kaufmann, Waltham (2012)
Azevedo, A., Santos, M.F.: KDD, SEMMA and CRISP-DM: A Parallel Overview. In: Proceedings of the IADIS European Conference Data Mining, pp. 182–185 (2008)
Nisbet, R., Elder, J.F., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Academic Press, Elsevier, Amsterdam, Boston (2009)
CRISP-DM 2.0 Special Interest Group (SIG), http://www.crisp-dm.org/new.htm
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12(4), 5–33 (1996)
Rahm, E., Do, H.H.: Data Cleaning: Problems and Current Approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 23(4), 3–26 (2000)
Michalski, R.S.: Pattern Recognition as Knowledge-Guided Computer Induction. Technical Report No. 927. Department of Computer Science, University of Illinois, Urbana-Champaign, IL (1978)
Wnek, J., Michalski, R.S.: Hypothesis-driven constructive induction in AQ17: A method and experiments. In: Proceedings of the International Joint Conference on Artificial Intelligence, Workshop on Evaluating and Changing Representations in Machine Learning, pp. 13–22 (1991)
Hammer, M., McLeod, D.: The semantic data model: a modelling mechanism for data base applications. In: Lowenthal, E.I., Nell, B.D. (eds.) Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, Austin, Texas, pp. 26–36 (1978)
Matheus, C.J., Rendell, L.A.: Constructive Induction on Decision Trees. In: Sridharan, N.S. (ed.) 11th International Joint Conference on Artificial Intelligence, pp. 645–650. Morgan Kaufmann (1989)
Zheng, Z.: Constructing New Attributes for Decision Tree Learning. Dissertation, Basser Department of Computer Science (1996)
Welcker, L.: Segmentierungsansätze zur Variablenreduktion im Rahmen der Optimierung von Scoring-Ergebnissen. Master Thesis, unpublished, Münster University of Applied Sciences (2010)
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Englewood Cliffs, Ellis Horwood (1994)
Lim, T.-J., Loh, W.-Y., Shih, Y.-S.: A Comparison of Prediction Acuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Machine Learning 40, 203–229 (2000)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Journal of Applied Statistics 29(2), 119–127 (1980)
Biggs, D., de Ville, B., Suen, E.: A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics 18(1), 49–62 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Welcker, L., Koch, S., Dellmann, F. (2012). Improving Classifier Performance by Knowledge-Driven Data Preparation. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2012. Lecture Notes in Computer Science(), vol 7377. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31488-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-31488-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31487-2
Online ISBN: 978-3-642-31488-9
eBook Packages: Computer ScienceComputer Science (R0)