Improving Classifier Performance by Knowledge-Driven Data Preparation

Welcker, Laura; Koch, Stephan; Dellmann, Frank

doi:10.1007/978-3-642-31488-9_13

Laura Welcker²⁰,
Stephan Koch²¹ &
Frank Dellmann²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7377))

Included in the following conference series:

Industrial Conference on Data Mining

Abstract

Classification is a widely used technique in data mining. Thereby achieving a reasonable classifier performance is an increasingly important goal. This paper aims to empirically show how classifier performance can be improved by knowledge-driven data preparation using business, data and methodological know-how. To point out the variety of knowledge-driven approaches, we firstly introduce an advanced framework that breaks down the data preparation phase to four hierarchy levels within the CRISP-DM process model. The first 3 levels reflect methodological knowledge; the last level clarifies the use of business and data know-how. Furthermore, we present insights from a case study to show the effect of variable derivation as a subtask of data preparation. The impact of 9 derivation approaches and 4 combinations of them on classifier performance is assessed on a real world dataset using decision trees and gains charts as performance measure. The results indicate that our approach improves the classifier performance.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Automated Data Pre-processing via Meta-learning

Data Mining to Support Decision-Making—A Research Approach

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Keywords

References

Rexer, K.: 5th Annual Data Miner Survey - 2011 Survey Summary Report. Rexer Analytics, Winchester (2011)
Google Scholar
KDnuggets, Which methods/algorithms did you use for data analysis in 2011?, http://www.kdnuggets.com/polls/2011/algorithms-analytics-data-mining.html
Fayyad, U., Piatesky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining. AAAI Press, California (1996)
Google Scholar
SAS: From Data to Business Advantage: Data Mining, SEMMA Methodology and the SAS System. White Paper, SAS Institute Inc. (1997)
Google Scholar
Reinartz, T.: Focusing Solutions for Data Mining: Analytical Studies and Experimental Results in Real-World Domains. Springer, Heidelberg (1999)
Book MATH Google Scholar
Chapman, P., Clinton, J., Kerber, R., Khabaza, R.T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0: step-by-step data mining guide. SPSS Inc. (2000)
Google Scholar
Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. The Knowledge Engineering Review 21(1), 1–24 (2006)
Article Google Scholar
Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)
Google Scholar
Anand, S.S., Bell, D.A., Hughes, J.G.: The role of domain knowledge in data mining. In: 4th Int’l ACM Conference on Information and Knowledge Management, pp. 37–43. ACM, New York (1995)
Google Scholar
de Oliveira Lima, E.: Domain Knowledge Integration in data mining for churn and customer lifetime value modelling: new approaches and applications. Dissertation, University of Southhampton (2009)
Google Scholar
Kopanas, I., Avouris, N.M., Daskalaki, S.: The Role of Domain Knowledge in a Large Scale Data Mining Project. In: Vlahavas, I.P., Spyropoulos, C.D. (eds.) SETN 2002. LNCS (LNAI), vol. 2308, pp. 288–299. Springer, Heidelberg (2002)
Chapter Google Scholar
Sinha, A.P., Zhao, H.: Incorporating domain knowledge into data mining classifiers: An application in indirect lending. Decision Support Systems 46, 287–299 (2008)
Article Google Scholar
Pyle, D.: Business Modeling and Data Mining. Morgan Kaufmann Publishers, Amsterdam (2003)
Google Scholar
Linoff, G.S., Berry, M.J.A.: Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Wiley Publishing, Indianapolis (2011)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining, Concepts and Techniques. Morgan Kaufmann, Waltham (2012)
MATH Google Scholar
Azevedo, A., Santos, M.F.: KDD, SEMMA and CRISP-DM: A Parallel Overview. In: Proceedings of the IADIS European Conference Data Mining, pp. 182–185 (2008)
Google Scholar
Nisbet, R., Elder, J.F., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Academic Press, Elsevier, Amsterdam, Boston (2009)
MATH Google Scholar
CRISP-DM 2.0 Special Interest Group (SIG), http://www.crisp-dm.org/new.htm
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12(4), 5–33 (1996)
MATH Google Scholar
Rahm, E., Do, H.H.: Data Cleaning: Problems and Current Approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 23(4), 3–26 (2000)
Google Scholar
Michalski, R.S.: Pattern Recognition as Knowledge-Guided Computer Induction. Technical Report No. 927. Department of Computer Science, University of Illinois, Urbana-Champaign, IL (1978)
Google Scholar
Wnek, J., Michalski, R.S.: Hypothesis-driven constructive induction in AQ17: A method and experiments. In: Proceedings of the International Joint Conference on Artificial Intelligence, Workshop on Evaluating and Changing Representations in Machine Learning, pp. 13–22 (1991)
Google Scholar
Hammer, M., McLeod, D.: The semantic data model: a modelling mechanism for data base applications. In: Lowenthal, E.I., Nell, B.D. (eds.) Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, Austin, Texas, pp. 26–36 (1978)
Google Scholar
Matheus, C.J., Rendell, L.A.: Constructive Induction on Decision Trees. In: Sridharan, N.S. (ed.) 11th International Joint Conference on Artificial Intelligence, pp. 645–650. Morgan Kaufmann (1989)
Google Scholar
Zheng, Z.: Constructing New Attributes for Decision Tree Learning. Dissertation, Basser Department of Computer Science (1996)
Google Scholar
Welcker, L.: Segmentierungsansätze zur Variablenreduktion im Rahmen der Optimierung von Scoring-Ergebnissen. Master Thesis, unpublished, Münster University of Applied Sciences (2010)
Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Englewood Cliffs, Ellis Horwood (1994)
MATH Google Scholar
Lim, T.-J., Loh, W.-Y., Shih, Y.-S.: A Comparison of Prediction Acuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Machine Learning 40, 203–229 (2000)
Article MATH Google Scholar
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Journal of Applied Statistics 29(2), 119–127 (1980)
Article Google Scholar
Biggs, D., de Ville, B., Suen, E.: A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics 18(1), 49–62 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Münster University of Applied Sciences, Münster, Germany
Laura Welcker & Frank Dellmann
BBDO Proximity GmbH, Hamburg, Germany
Stephan Koch

Authors

Laura Welcker
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Koch
View author publications
You can also search for this author in PubMed Google Scholar
Frank Dellmann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Welcker, L., Koch, S., Dellmann, F. (2012). Improving Classifier Performance by Knowledge-Driven Data Preparation. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2012. Lecture Notes in Computer Science(), vol 7377. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31488-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-31488-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31487-2
Online ISBN: 978-3-642-31488-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Classifier Performance by Knowledge-Driven Data Preparation

Abstract

Chapter PDF

Similar content being viewed by others

Automated Data Pre-processing via Meta-learning

Data Mining to Support Decision-Making—A Research Approach

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Classifier Performance by Knowledge-Driven Data Preparation

Abstract

Chapter PDF

Similar content being viewed by others

Automated Data Pre-processing via Meta-learning

Data Mining to Support Decision-Making—A Research Approach

Automated Machine Learning for Studying the Trade-Off Between Predictive Accuracy and Interpretability

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation