Abstract
Bayesian belief nets (BNs) are often used for classification tasks—typically to return the most likely class label for each specified instance. Many BN-learners, however, attempt to find the BN that maximizes a different objective function—viz., likelihood, rather than classification accuracy—typically by first learning an appropriate graphical structure, then finding the parameters for that structure that maximize the likelihood of the data. As these parameters may not maximize the classification accuracy, “discriminative parameter learners” follow the alternative approach of seeking the parameters that maximize conditional likelihood (CL), over the distribution of instances the BN will have to classify. This paper first formally specifies this task, shows how it extends standard logistic regression, and analyzes its inherent sample and computational complexity. We then present a general algorithm for this task, ELR, that applies to arbitrary BN structures and that works effectively even when given incomplete training data. Unfortunately, ELR is not guaranteed to find the parameters that optimize conditional likelihood; moreover, even the optimal-CL parameters need not have minimal classification error. This paper therefore presents empirical evidence that ELR produces effective classifiers, often superior to the ones produced by the standard “generative” algorithms, especially in common situations where the given BN-structure is incorrect.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abe, N., Takeuchi, J., & Warmuth, M. (1991). Polynomial learnability of probablistic concepts with respect to the Kullback-Leibler divergence. In Conference on Learning Theory.
Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244.
Bishop, C. (1998). Neural networks for pattern recognition. Oxford.
Blake, C., & Merz, C. (2000). UCI repository of machine learning databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html.
Boyd, S., & Vandenberghe. L. (2004). Convex optimization. Cambridge.
Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering.
Cerquides J., & de Mántaras, R.L. (2003). Tractable Bayesian learning of tree augmented naïve Bayes models. In International Conference on Machine Learning (pp. 75–82).
Cheng J., & Greiner, R. (1999). Comparing Bayesian network classifiers. In Uncertainty in Artificial Intelligence.
Cheng, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence, 137.
Chickering, D. M., Geiger, D., & Heckerman, D. (1994). Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research.
Chou, W., Juang, B., & Lee, C. (1992). Segmental GPD training of HMM based speech recognizer. In International Conference on Acoustics, Speech and Signal Processing, vol. 1 (pp. 473–476).
Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Tans. on Information Theory (pp. 462–467).
Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial intelligence, 42.
Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.
Cox, D. R., & Snell, E. J. (1989). Analysis of binary data. Chapman & Hall, London,
Darwiche, A. (2000). A differential approach to inference in Bayesian networks. In Uncertainty in Artificial Intelligence.
Dasgupta, S. (1997). The sample complexity of Learning fixed-structure Bayesian networks. Machine Learning, 29, 165–180,
Dash D., & Cooper, G. (2002). Exact model averaging with naïve Bayesian classifiers. In International Conference on Machine Learning (pp. 91–98).
Dawid, A. P. (1976). Properties of diagnostic data distributions. Biometrics, 32, 647–658.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. (with discussion). J. Royal Statistics Society, Series B, 39.
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. Wiley.
Edwards, D., & Lauritzen, S. (2001). The TM algorithm for maximising a conditional likelihood function. Biometrika, 88, 961–972.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In International Joint Conferences on Artificial Intelligence.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning Journal, 29, 131–163.
Greiner, R., Grove, A., & Schuurmans, D.(1997). Learning Bayesian nets that perform well. In Uncertainty in Artificial Intelligence.
Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminant parameter learning of belief net classifiers. In American Association of Artificial Intelligence.
Grossman, D., & Domingos, P. (2004). Learning Bayesian network classifiers by maximizing conditional likelihood. In International Conference on Machine Learning.
Hagan, M., Demuth, H., & Beale, M. (1996). Neural network design. Boston, MA, PWS Publishing.
Heckerman, D. E. (1998). A tutorial on learning with Bayesian networks. In M. I. Jordan (Ed.), Learning in graphical models.
Jaakkola, T., Meila, M., & Jebara, T. (2000). Maximum entropy discrimination. In Neural Information Processing Systems.
Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence.
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97:1–2.
Kontkanen, P., Myllymäki, P., Silander, T., & Tirri, H. (1999). On supervised selection of Bayesian networks. In Uncertainty in Artificial Intelligence (pp. 334–342).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning.
Lauritzen, S. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191–201.
Little, J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York, Wiley.
McCullagh, P., & Nelder, J. (1989). Generalized linear models. Chapman and Hall.
Minka. T. (2001). Algorithms for maximum-likelihood logistic regression. Technical report, CMU CALD, http://www.stat.cmu.edu/∼minka/papers/logreg/minka-logreg.pdf.
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281,
Nesterov, Y., & Nemirovskii, A. (1994). Interior-point polynomial methods in convex programming, Society for Industrial and Applied Mathematics.
Ng, A., & Jordan, M. (2001). On discriminative versus generative classifiers: A comparison of logistic regression and naive Bayes. In Neural Information Processing Systems.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo: Morgan Kaufmann.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (2002). Numerical recipes in C. Cambridge, http://www.nr.com/.
Ripley, B. (1996). Pattern recognition and neural networks, Cambridge, UK: Cambridge University Press.
Roos, T., Wettig, H., Grünwald, P., & Myllymäki, P., Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59:3, 269–298.
Schlüter, R., Macherey, W., Kanthak, S., Ney, H., & Welling, L. (1997). Comparison of optimization methods for discriminative training criteria. In Proc. of European Conference on Speech Communication and Technology.
Shen, B., Su, X., Greiner, R., Musilek, P., & Cheng, C. (2003). Discriminative parameter learning of general Bayesian network classifiers. In International Conference on Tools with Artificial Intelligence.
Sundberg, R. (2002). The convergence rate of the TM algorithm of Edwards and Lauritzen. Biometrika, 89, 478–483.
Van Allen, T., & Greiner, R. (2000). Model selection criteria for learning belief nets: An empirical comparison. In International Conference on Machine Learning (pp. 1047–1054).
Zhou, W. (2002). Discriminative learning of Bayesian net parameters. Master’s thesis, Dept of Computing Science, University of Alberta.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors:
Pedro Larrañaga, Jose A. Lozano, Jose M. Peña and Iñaki Inza
This paper extends the earlier results that appear in ELR-AAAI02-s and ELR-ICTAI03-s
This e-mail address is available for all problems and questions
Electronic supplemetary material
Rights and permissions
About this article
Cite this article
Greiner, R., Su, X., Shen, B. et al. Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers. Mach Learn 59, 297–322 (2005). https://doi.org/10.1007/s10994-005-0469-0
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10994-005-0469-0