Abstract
Classification problems have a long history in the machine learning literature. One of the simplest, and yet most consistently well-performing set of classifiers is the Naïve Bayes models. However, an inherent problem with these classifiers is the assumption that all attributes used to describe an instance are conditionally independent given the class of that instance. When this assumption is violated (which is often the case in practice) it can reduce classification accuracy due to “information double-counting” and interaction omission.
In this paper we focus on a relatively new set of models, termed Hierarchical Naïve Bayes models. Hierarchical Naïve Bayes models extend the modeling flexibility of Naïve Bayes models by introducing latent variables to relax some of the independence statements in these models. We propose a simple algorithm for learning Hierarchical Naïve Bayes models in the context of classification. Experimental results show that the learned models can significantly improve classification accuracy as compared to other frameworks.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Binder, J., Koller, D. Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29:2–3, 213–244.
Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html
Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Context-specific independence in Bayesian networks. In: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA. (pp. 115–123), Morgan Kaufmann Publishers.
Chow, C. K., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14, 462–467.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39, 1–38.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:2–3, 103–130.
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: John Wiley & Sons.
Elidan, G., & Friedman, N. (2001). Learning the dimensionality of hidden variables. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA. (pp. 144–151), Morgan Kaufmann Publishers.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuousvalued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence. San Mateo. CA. (pp. 1022–1027), Morgan Kaufmann Publishers.
Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1:1, 55–77.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29:2–3, 131–163.
Greiner, R., Grove, A. J., & Schuurmans, D. (1997). Learning Bayesian nets that perform well. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA. (pp. 198–207), Morgan Kaufmann Publishers.
Grossman, D., & Domingos, P. (2004). Learning Bayesian network classifiers by maximizing conditional likelihood. In: Proceedings of the Twenty-first International Conference on Machine Learning. Banff, Canada, (pp. 361–368), ACM Press.
Jaeger, M. (2003). Probabilistic classifiers and the concepts they recognize. In: Proceedings of the Twentieth International Conference on Machine Learning. Menlo Park, (pp. 266–273), The AAAI Press.
Jensen, F. V. (2001). Bayesian networks and decision graphs. New York, NY: Springer-Verlag.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. San Mateo, CA. (pp. 1137–1143), Morgan Kaufmann Publishers.
Kohavi, R., John, G., Long, R., Manley, D., & K. Pfleger. (1994). MLC++: A machine learning library in C++. In: Proceedings of the Sixth International Conference on Tools with Artificial Intelligence. (pp. 740–743), IEEE Computer Society Press.
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97:1–2, 273–324.
Kononenko, I. (1991). Semi-naive Bayesian classifier. In: Proceedings of Sixth European Working Session on Learning. Porto, Portugal, (pp. 206–219), Springer-Verlag.
Kočka, T., & Zhang, N. L. (2002). Dimension correction for hierarchical latent class models. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence. San Francisco. CA. (pp. 267–274), Morgan Kaufmann Publishers.
Lam, W., & Bacchus, F. (1994). Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10:4, 269–293.
Langley, P. (1993). Induction of recursive Bayesian classifiers. In: Proceedings of the Fourth European Conference on Machine Learning, Vol. 667 of Lecture Notes in Artificial Intelligence. (pp. 153–164), Springer-Verlag.
Langley, P. (1994). Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall symposium on Relevance. The AAAI Press.
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B, 50:2, 157–224.
Madsen, A. L., & Jensen, F. V. (1998). Lazy propagation in junction trees. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. (pp. 362–369). Morgan Kaufmann Publishers.
Martin, J. D., & VanLehn, K. (1994). Discrete factor analysis: Learning hidden variables in Bayesian networks. Technical Report LRDC-ONR-94–1, Department of Computer Science, University of Pittsburgh. http://www.pitt.edu/vanlehn/distrib/Papers/Martin.pdf
Mitchell, T. M. (1997). Machine learning. Boston, MA.: McGraw Hill.
Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52:3, 239–281.
Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In: Advances in Neural Information Processing Systems 15. Vancouver. British Columbia, Canada, (pp. 841–848), The MIT Press.
Pazzani, M. (1996a). Searching for dependencies in Bayesian classifiers. In: Learning from data: Artificial Intelligence and Statistics V. New York, N.Y., (pp. 239–248).
Pazzani, M. J. (1996b), Constructive induction of Cartesian product attributes. In: ISIS: Information, Statistics and Induction in Science. Singapore, (pp. 66–77), World Scientific.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA.: Morgan Kaufmann Publishers.
Quinlan, R. (1998). C5.0: An informal tutorial. http://www.rulequest.com/see5-unix.html
Rissanen, J. (1978). Modelling by shortest data description. Automatica, 14, 465–471.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Shafer, G. R., & Shenoy, P. P. (1990). Probability propagation. Annals of Mathematics and Artificial Intelligence, 2, 327–352.
Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. New York: Springer-Verlag.
SPSS Inc. (2002). Clementine v6.5. http://www.spss.com/clementine/
Wettig, H., Grunwald, P., Roos, T., Myllymaki, P., & Tirri, H. (2003). When discriminative learning of Bayesian network parameters is easy. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence. (pp. 491–496), Morgan Kaufmann Publishers.
Whittaker, J. (1990). Graphical models in applied multivariate statistics. Chichester, UK: John Wiley & Sons.
Zhang, H. (2004a), The optimality of naive Bayes. In: Proceedings of the Seventeenth Florida Artificial Intelligence Research Society Conference. (pp. 562–567), The AAAI Press.
Zhang, N. L. (2004b), Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research, 5:6, 697–723.
Zhang, N. L., Nielsen, T. D., & Jensen, F. V. (2003). Latent variable discovery in classification models. Artificial Intelligence in Medicine, 30:3, 283–299.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Peter Flach
Rights and permissions
About this article
Cite this article
Langseth, H., Nielsen, T.D. Classification using Hierarchical Naïve Bayes models. Mach Learn 63, 135–159 (2006). https://doi.org/10.1007/s10994-006-6136-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-6136-2