The Need for Low Bias Algorithms in Classification Learning from Large Data Sets

Brain, Damien; Webb, Geoffrey I.

doi:10.1007/3-540-45681-3_6

Damien Brain⁴ &
Geoffrey I. Webb⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2516 Accesses
30 Citations

Abstract

This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelisation have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm — the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management.

Download to read the full chapter text

Chapter PDF

United Statistical Algorithms and Data Science: An Introduction to the Principles

Machine Learning Based on Minimizing Robust Mean Estimates

Statistical Learning Theory in Practice

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Provost, F., Aronis, J.: Scaling Up Inductive Learning with Massive Parallelism. Machine Learning, Vol. 23. (1996) 33–46
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. ACM Press, New York (1999) 22–32
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. Proceedings of the Twenty-Second VLDB Conference. Morgan Kaufmann, San Francisco (1996) 544–555
Google Scholar
Catlett, J.: Peepholing: Choosing Attributes Efficiently for Megainduction. Proceedings of the Ninth International Conference on Machine Learning. Morgan Kaufmann, San Mateo (1992) 49–54
Google Scholar
Cohen, W.: Fast Effective Rule Induction. Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1995) 115–123
Google Scholar
Aronis, J., Provost, F.: Increasing the Efficiency of Data Mining Algorithms with Breadth-First Marker Propagation. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1997) 119–122
Google Scholar
Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Breiman, L., Freidman, J. H., Olshen, R. A., Stone, C. J.: Classification and Regression Trees. Wadsworth International, Belmont (1984)
MATH Google Scholar
Hecht-Nielsen, R.: Neurocomputing. Addison-Wesley, Menlo Park (1990)
Google Scholar
Cover, T. M., Hart, P. E.: Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, Vol. 13. (1967) 21–27
Article MATH Google Scholar
Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest — A Framework for Fast Decision Tree Induction. Proceedings of the Twenty-fourth International Conference on Very Large Databases. Morgan Kaufmann, San Mateo (1998)
Google Scholar
Moore, A., Lee, M. S.: Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets. Journal of Artificial Intelligence Research, Vol. 8. (1998) 67–91
MATH MathSciNet Google Scholar
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H. W., Yang, D.: Large Scale Data Mining: Challenges and Responses. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1997)
Google Scholar
Freund, Y., Schapire, R. E.: A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, Vol. 55. (1997) 95–121
Article MathSciNet Google Scholar
Breiman, L.: Arcing Classifiers. Technical Report 460. Department of Statistics, University of California, Berkeley (1996)
Google Scholar
Breiman, L.: Bagging Predictors. Machine Learning, Vol. 24. (1996) 123–140.
MATH MathSciNet Google Scholar
Webb, G. (2000). MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning, Vol. 40, (2000) 159–196
Article MathSciNet Google Scholar
Kong, E. B., Dietterich, T. G.: Error-Correcting Output Coding Corrects Bias and Variance. Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Mateo (1995)
Google Scholar
Kohavi, R., Wolpert, D. H.: Bias Plus Variance Decomposition for Zero-One Loss Functions. Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1996)
Google Scholar
James, G, Hastie, T.: Generalizations of the bias/variance decomposition for prediction error. Technical Report. Department of Statistics, Stanford University (1997)
Google Scholar
Friedman, J. H.: On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery, Vol. 1. (1997) 55–77
Article Google Scholar
Bauer, E., Kohavi, R.: An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, Vol. 36. (1999) 105–142
Article Google Scholar
Blake, C. L., Merz, C. J. UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Department of Information and Computer Science, University of California, Irvine

Download references

Author information

Authors and Affiliations

School of Computing and Mathematics, Deakin University Geelong, 3217, Victoria, Australia
Damien Brain & Geoffrey I. Webb

Authors

Damien Brain
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brain, D., Webb, G.I. (2002). The Need for Low Bias Algorithms in Classification Learning from Large Data Sets. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_6

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_6
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

The Need for Low Bias Algorithms in Classification Learning from Large Data Sets

Abstract

Chapter PDF

Similar content being viewed by others

United Statistical Algorithms and Data Science: An Introduction to the Principles

Machine Learning Based on Minimizing Robust Mean Estimates

Statistical Learning Theory in Practice

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Need for Low Bias Algorithms in Classification Learning from Large Data Sets

Abstract

Chapter PDF

Similar content being viewed by others

United Statistical Algorithms and Data Science: An Introduction to the Principles

Machine Learning Based on Minimizing Robust Mean Estimates

Statistical Learning Theory in Practice

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation