Where are the large and difficult datasets?

Jamain, Adrien; Hand, David J.

doi:10.1007/s11634-009-0037-8

Where are the large and difficult datasets?

Regular Article
Published: 14 March 2009

Volume 3, pages 25–38, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Where are the large and difficult datasets?

Download PDF

Adrien Jamain¹ &
David J. Hand²

177 Accesses
6 Citations
Explore all metrics

Abstract

A great many comparative performance assessments of classification rules have been undertaken, ranging from small ones involving just one or two methods, to large ones involving many tens of methods. We are undertaking a meta-analytic study of these studies, attempting to distil some overall conclusions. This paper describes just one of our observations. The dataset analysed in this paper contains 5,203 error rates taken from 45 articles and describing 146 datasets. One curious general relationship which was persistent in our data, despite the fact that we were looking at results mixed between distributions rather than conditional on distributions, was that error rate decreased with increasing dataset size. We believe this to be an artefact of the way datasets are collected by the research community.

Avoid common mistakes on your manuscript.

References

Atlas L, Connor J, Dong P, Lippman A, Cole R, Muthusamy Y (1991) A performance comparison of trained multi-player perceptrons and trained classification trees. In: Systems, man and cybernetics: proceedings of the 1989 IEEE international conference, Cambridge, Hyatt Regency, pp 915–920
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://www1.ics.uci.edu/~mlearn/MLRepository.html, University of California, Irvine, Dept. of Information and Computer Sciences
Brazdil PB, Soares C, Pinto da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50: 251–277
Article MATH Google Scholar
Eklund PW, Hoang A (2002) A performance survey of public domain supervised machine learning algorithms. http://citeseer.nj.nec.com/551273.html
Hand DJ (1999) Intelligent data analysis: an introduction. In: Berthold M, Hand DJ(eds) Intelligent data analysis. Springer, Berlin
Chapter Google Scholar
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–91
Article MATH Google Scholar
Jamain A (2004) Meta-analysis of classification methods. PhD thesis, Department of Mathematics, Imperial College, London (2004)
Jamain A, Hand DJ (2005) The Naive Bayes mystery: a classification detective story. Pattern Recognit Lett 26: 1752–1760
Article Google Scholar
Jamain A., Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1): 87–112
Article Google Scholar
Lim T, Loh W, Shih Y (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40: 203–228
Article MATH Google Scholar
METAL Consortium . Esprit project METAL (#26.357). http://www.metal-kdd.org, 2002
Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New York
MATH Google Scholar
Perlich C, Provost F, Simonoff JS (2003) Tree induction versus logistic regresion: a learning-curve analysis. J Mach Learn Res 4: 211–255
Article MathSciNet Google Scholar
Quinlan JR (1994) Comparing connectionist and symbolic learning methods, volume I: constraints and Prospects. MIT Press, Cambridge, pp 445–456. http://citeseer.nj.nec.com/quinlan94comparing.html
Rasmussen CE, Neal RM, Hinton GE, van Camp D, Revow M, Ghahramani Z, Kustra R, Tibshirani R (1996) DELVE, Data for evaluating learning in valid experiments. http://www.cs.toronto.edu/~delve/
Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Discov 1: 317–328
Article Google Scholar
Sargent DJ (2001) Comparison of artificial neural networks with other statistical approaches. Cancer 91: 1636–1642
Article Google Scholar
Shavlik JW, Mooney RJ, Towell GG (1991) Symbolic and neural learning algorithms: an experimental comparison. Mach Learn 6: 111–143
Google Scholar
Soares C (2002) Is the UCI repository useful for data mining? In: Lavrac N, Motoda H, Fawcett T (eds) Proceedings of the ICML-2002 workshop on data mining lessons learned
Sohn SY (1999) Meta-analysis of classification algorithms for pattern recognition. IEEE Trans Pattern Recognit Mach Intell 21(11): 1137–1144
Article Google Scholar
Viswanathan M, Webb GI (1998) Classification learning using all rules. In: 11th European conference on machine learning. Springer, Berlin, pp 150–159
Zarndt F (1995) A comprehensive case study: an examination of machine learning and connectionnist algorithms. http://citeseer.nj.nec.com/481595.html

Download references

Author information

Authors and Affiliations

BNP-Paribas, 10 Harewood Avenue, London, NW1 6AA, UK
Adrien Jamain
Department of Mathematics, Institute for Mathematical Sciences, Imperial College, London, SW7 2AZ, UK
David J. Hand

Authors

Adrien Jamain
View author publications
You can also search for this author in PubMed Google Scholar
David J. Hand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrien Jamain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jamain, A., Hand, D.J. Where are the large and difficult datasets?. Adv Data Anal Classif 3, 25–38 (2009). https://doi.org/10.1007/s11634-009-0037-8

Download citation

Received: 21 August 2007
Revised: 14 January 2009
Accepted: 20 February 2009
Published: 14 March 2009
Issue Date: June 2009
DOI: https://doi.org/10.1007/s11634-009-0037-8

Keywords

Mathematics Subject Classification (2000)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Where are the large and difficult datasets?

Abstract

Article PDF

Similar content being viewed by others

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Using p-values for the comparison of classifiers: pitfalls and alternatives

Should significance testing be abandoned in machine learning?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Where are the large and difficult datasets?

Abstract

Article PDF

Similar content being viewed by others

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Using p-values for the comparison of classifiers: pitfalls and alternatives

Should significance testing be abandoned in machine learning?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation