Abstract
Probabilistic classifiers are developed by assuming generative models which are product distributions over the original attribute space (as in naive Bayes) or more involved spaces (as in general Bayesian networks). While this paradigm has been shown experimentally successful on real world applications, despite vastly simplified probabilistic assumptions, the question of why these approaches work is still open.
This paper resolves this question. We show that almost all joint distributions with a given set of marginals (i.e., all distributions that could have given rise to the classifier learned) or, equivalently, almost all data sets that yield this set of marginals, are very close (in terms of distributional distance) to the product distribution on the marginals; the number of these distributions goes down exponentially with their distance from the product distribution. Consequently, as we show, for almost all joint distributions with this set of marginals, the penalty incurred in using the marginal distribution rather than the true one is small. In addition to resolving the puzzle surrounding the success of probabilistic classifiers our results contribute to understanding the tradeoffs in developing probabilistic classifiers and will help in developing better classifiers.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley and Sons, 1991.
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of Mathematics. Springer Verlag, 1996.
P. Domingos and Pazzani. M. Beyond independence: Conditions for the optimality of simple bayesian classifier. Machine Learning, 29:103–130, 1997.
C. Elkan. Boosting and naive bayesian learning. Technical Report CS97-557, Department of Computer Science, University of California, San Diego, 1997.
M Feder and N. Merhav. Relation between entropy and error probability. IEEE Trans. on Information Theory, 40:259–266, 1994.
J. H. Friedman. On bias, variance, 0/1-loss and curse-of-dimensionality. Data Mining and Knowledge Discovery, 55, 1997.
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29: 131–163, 1997.
A. Garg and D. Roth. Understanding probabilistic classifiers. Technical Report UIUCDCSR-2001-2206, UIUC Computer Science Department, March 2001.
A. R. Golding. A Bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the 3rd workshop on very large corpora, ACL-95, 1995.
D. Roth. Learning in natural language. In Proc. of the International Joint Conference of Artificial Intelligence, pages 898–904, 1999.
H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces and cars. In The IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 746–751, 2000.
L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garg, A., Roth, D. (2001). Understanding Probabilistic Classifiers. In: De Raedt, L., Flach, P. (eds) Machine Learning: ECML 2001. ECML 2001. Lecture Notes in Computer Science(), vol 2167. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44795-4_16
Download citation
DOI: https://doi.org/10.1007/3-540-44795-4_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42536-6
Online ISBN: 978-3-540-44795-5
eBook Packages: Springer Book Archive