Abstract
An extension of the latent class model is presented for clustering categorical data by relaxing the classical “class conditional independence assumption” of variables. This model consists in grouping the variables into inter-independent and intra-dependent blocks, in order to consider the main intra-class correlations. The dependency between variables grouped inside the same block of a class is taken into account by mixing two extreme distributions, which are respectively the independence and the maximum dependency. When the variables are dependent given the class, this approach is expected to reduce the biases of the latent class model. Indeed, it produces a meaningful dependency model with only a few additional parameters. The parameters are estimated, by maximum likelihood, by means of an EM algorithm. Moreover, a Gibbs sampler is used for model selection in order to overcome the computational intractability of the combinatorial problems involved by the block structure search. Two applications on medical and biological data sets show the relevance of this new model. The results strengthen the view that this model is meaningful and that it reduces the biases induced by the conditional independence assumption of the latent class model.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
AGRESTI, A. (2002), Categorical Data Analysis (Vol. 359), New York: John Wiley and Sons.
ALLMAN, E., MATIAS, C., and RHODES, J. (2009), “Identifiability of Parameters in Latent Structure Models with Many Observed Variables”, The Annals of Statistics 37 (6A), 3099–3132.
BANFIELD, J., and RAFTERY, A. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49(3), 803–821.
BOCK, H. (1986), “Loglinear Models and Entropy Clustering Methods for Qualitative Data”, in Classification as a Tool of Research, eds. W. Gaul and M. Schader, Amsterdam: North Holland, pp. 19–26.
CELEUX, G., and GOVAERT, G. (1991), “Clustering Criteria for Discrete Data and Latent Class Models”, Journal of Classification 8(2), 157–176.
CELEUX, G., and GOVAERT, G. (1995), “Gaussian Parsimonious Clustering Models”, Pattern Recognition 28(5), 781–793.
CHAVENT, M., KUENTZ, V., and SARACCO, J. (2010) “A Partitioning Method for the Clustering of Categorical Variables”, in Classification as a Tool for Research, Berlin: Springer, pp. 91–99.
CHENG, J., GREINER, R. (1999), “Comparing Bayesian Network Classifiers”, in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp. 101–108.
CHOW, C., and LIU, C. (1968), “Approximating Discrete Probability Distributions with Dependence Trees”, IEEE Transactions on Information Theory, 14(3), 462–467.
DEMPSTER, A., LAIRD, N., and RUBIN, D. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B (Methodological), 39, 1–38.
ESPELAND, M., AND HANDELMAN, S. (1989), “Using Latent Class Models to Characterize and Assess Relative Error in Discrete Measurements”, Biometrics 45(2), 587–599.
FORMANN, A. (1992), “Linear Logistic Latent Class Analysis for Polytomous Data”, Journal of the American Statistical Association 87(418), 476–486.
FRIEDMAN, N., GEIGER, D., and GOLDSZMIDT, M. (1997) “Bayesian Network Classifiers”, Machine Learning 29(2), 131–163.
GOLLINI, I., and MURPHY, T. (2013), “Mixture of Latent Trait Analyzers for Model-Based Clustering of Categorical Data”, Statistics and Computing, 24(4), 569–588.
GOODMAN, L. (1974) “Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models”, Biometrika 61(2), 215–231.
GOVAERT, G. (2010), Data Analysis (Vol. 136), Wiley Online Library.
GOVAERT, G., and NADIF, M. (2003), “Clustering with Block Mixture Models”, Pattern Recognition 36(2), 463–473.
GUINOT, C., LATREILLE, J., MALVY, D., PREZIOSI, P., GALAN, P., HERCBERG, S., and TENENHAUS, M. (2001), “Use of Multiple Correspondence Analysis and Cluster Analysis to Study Dietary Behaviour: Food Consumption Questionnaire in the SU. VI. MAX. Cohort”, European Journal of Epidemiology 17(6), 505–516.
HAGENAARS, J. (1988), “Latent Structure Models with Direct Effects Between Indicators Local Dependence Models”, Sociological Methods and Research 16(3), 379–405.
HAND, D., and YU, K. (2001), “Idiot’s Bayes: Not So Stupid After All? International Statistical Review 69(3), 385–398.
HANDELMAN, S., LEVERETT, D., ESPELAND, M., and CURZON, J. (1986), “Clinical Radiographic Evaluation of Sealed Carious and Sound Tooth Surfaces”, The Journal of the American Dental Association 113(5), 751–754.
HARPER, D. (1972), “Local Dependence Latent Structure Models”, Psychometrika 37(1), 53–59.
HUANG, J., NG, M., RONG, H., and LI, Z. (2005), “Automated Variable Weighting in K-Means Type Clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657–668.
HUNT, L., and JORGENSEN,M. (1999), “Theory andMethods: Mixture Model Clustering Using the MULTIMIX Program”, Australian and New Zealand Journal of Statistics 41(2), 154–171.
JAJUGA, K., SOKOŁOWSKI, A., and BOCK, H. (2002), Classification, Clustering and Data Analysis: Recent Advances and Applications, New York: Springer Verlag.
JORGENSEN, M., and HUNT, L. (1996), “Mixture Model Clustering of Data Sets with Categorical and Continuous Variables”, in Proceedings of the Conference ISIS, Vol. 96), 375–384.
LEBARBIER, E., andMARY-HUARD, T. (2006), “Une Introduction au Critère BIC: Fondements Théoriques et Interprétation”, Journal de la Societé Française de Statisique, 147(1), 39–57.
LEBRET, R., IOVLEFF, S., LANGROGNET, F., BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2012), “Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library”, Journal of Statistical Software, in press (2014).
MARBAC, M., BIERNACKI, C., and VANDEWALLE, V. (2013), “Model-Based Clustering for Conditionally Correlated Categorical Data”, Rapport de Recherche RR-8232, INRIA.
MAUGIS, C., CELEUX, G., and MARTIN-MAGNIETTE, M.-L. (2009), “Variable Selection in Model-Based Clustering: A General Variable Role Modeling”, Computational Statistics and Data Analysis 53(11), 3872–3882.
MCLACHLAN, G., and KRISHNAN, T. (1997), The EMAlgorithm,Wiley Series in Probability and Statistics: Applied Probability and Statistics, New York: Wiley-Interscience.
MCLACHLAN, G., and PEEL, D. (2000), Finite Mixture Models,Wiley Series in Probability and Statistics: Applied Probability and Statistics, New York: Wiley-Interscience.
MEILA, M., and JORDAN, M. (2001), “Learning with Mixtures of Trees”, The Journal of Machine Learning Research, 1, 1–48.
MUTH´EN, B. (2008), “Latent Variable Hybrids: Overview of Old and New Models”, Advances in Latent Variable Mixture Models 1, 1–24.
QU, Y., TAN, M., and KUTNER, M. (1996), “Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests”, Biometrics 52(3), 797–810.
REBOUSSIN, B., IP, E., and WOLFSON, M. (2008), “Locally Dependent Latent Class Models with Covariates: An Application to Under-Age Drinking in the USA”, Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(4), 877–897.
REBOUSSIN, B., SONG, E., SHRESTHA, A., LOHMAN, K., andWOLFSON, M. (2006), “A Latent Class Analysis of Underage Problem Drinking: Evidence from a Community Sample of 16–20 Year Olds”, Drug and Alcohol Dependence 83(3), 199–209.
RICHARDSON, S., and GREEN, P. (1997), “On Bayesian Analysis of Mixtures with an Unknown Number of Components (With Discussion)”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 731–792.
ROBERT, C. (2005), Le Choix Bayesien: Principes et Pratique, France Editions: Springer.
ROBERT, C., and CASELLA, G. (2004), Monte Carlo Statistical Methods, New York: Springer Verlag.
SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.
STRAUSS, S., RINDSKOPF, D., ASTONE-TWERELL, J., DES JARLAIS, D., and HAGAN, H. (2006), “Using Latent Class Analysis to Identify Patterns of Hepatitis C Service Provision in Drug-Free Treatment Programs in the US”, Drug and Alcohol Dependence 83(1), 15–24.
VAN HATTUM, P., and HOIJTINK, H. (2009), “Market Segmentation Using Brand Strategy Research: Bayesian Inference with Respect to Mixtures of Log-Linear Models”, Journal of Classification 26(3), 297–328.
VERMUNT, J. (2003), “Multilevel Latent Class Models”, Sociological Methodology 33(1), 213–239.
VERMUNT, J. (2007), “Multilevel Mixture Item Response TheoryModels: An Application in Education Testing”, Proceedings of the 56th Session of the International Statistical Institute, Lisbon, Portugal, pp. 22–28.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors thank the Editor and the three anonymous referees for their useful comments and references. The authors are grateful to the Genes Diffusion corporation for the provision of the data set and especially its members: Amélie Vallée, Julie Hamon and Claude Grenier. We are grateful to Parmeet Bhatia, Modal Team engineer, and Stéphane Chrétien for their precious assistance. This work was financed by DGA and Inria.
Rights and permissions
About this article
Cite this article
Marbac, M., Biernacki, C. & Vandewalle, V. Model-Based Clustering for Conditionally Correlated Categorical Data. J Classif 32, 145–175 (2015). https://doi.org/10.1007/s00357-015-9180-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-015-9180-4