Model-Based Clustering for Conditionally Correlated Categorical Data

Marbac, Matthieu; Biernacki, Christophe; Vandewalle, Vincent

doi:10.1007/s00357-015-9180-4

Model-Based Clustering for Conditionally Correlated Categorical Data

Published: 09 July 2015

Volume 32, pages 145–175, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Classification Aims and scope Submit manuscript

Model-Based Clustering for Conditionally Correlated Categorical Data

Download PDF

Matthieu Marbac¹,
Christophe Biernacki² &
Vincent Vandewalle³

360 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

An extension of the latent class model is presented for clustering categorical data by relaxing the classical “class conditional independence assumption” of variables. This model consists in grouping the variables into inter-independent and intra-dependent blocks, in order to consider the main intra-class correlations. The dependency between variables grouped inside the same block of a class is taken into account by mixing two extreme distributions, which are respectively the independence and the maximum dependency. When the variables are dependent given the class, this approach is expected to reduce the biases of the latent class model. Indeed, it produces a meaningful dependency model with only a few additional parameters. The parameters are estimated, by maximum likelihood, by means of an EM algorithm. Moreover, a Gibbs sampler is used for model selection in order to overcome the computational intractability of the combinatorial problems involved by the block structure search. Two applications on medical and biological data sets show the relevance of this new model. The results strengthen the view that this model is meaningful and that it reduces the biases induced by the conditional independence assumption of the latent class model.

Article PDF

Latent class model with conditional dependency per modes to cluster categorical data

Article 13 May 2016

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood

Article 13 April 2021

On mathematical optimization for clustering categories in contingency tables

Article Open access 28 June 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

AGRESTI, A. (2002), Categorical Data Analysis (Vol. 359), New York: John Wiley and Sons.
ALLMAN, E., MATIAS, C., and RHODES, J. (2009), “Identifiability of Parameters in Latent Structure Models with Many Observed Variables”, The Annals of Statistics 37 (6A), 3099–3132.
BANFIELD, J., and RAFTERY, A. (1993), “Model-Based Gaussian and Non-Gaussian Clustering”, Biometrics, 49(3), 803–821.
BOCK, H. (1986), “Loglinear Models and Entropy Clustering Methods for Qualitative Data”, in Classification as a Tool of Research, eds. W. Gaul and M. Schader, Amsterdam: North Holland, pp. 19–26.
CELEUX, G., and GOVAERT, G. (1991), “Clustering Criteria for Discrete Data and Latent Class Models”, Journal of Classification 8(2), 157–176.
CELEUX, G., and GOVAERT, G. (1995), “Gaussian Parsimonious Clustering Models”, Pattern Recognition 28(5), 781–793.
CHAVENT, M., KUENTZ, V., and SARACCO, J. (2010) “A Partitioning Method for the Clustering of Categorical Variables”, in Classification as a Tool for Research, Berlin: Springer, pp. 91–99.
CHENG, J., GREINER, R. (1999), “Comparing Bayesian Network Classifiers”, in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp. 101–108.
CHOW, C., and LIU, C. (1968), “Approximating Discrete Probability Distributions with Dependence Trees”, IEEE Transactions on Information Theory, 14(3), 462–467.
DEMPSTER, A., LAIRD, N., and RUBIN, D. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B (Methodological), 39, 1–38.
ESPELAND, M., AND HANDELMAN, S. (1989), “Using Latent Class Models to Characterize and Assess Relative Error in Discrete Measurements”, Biometrics 45(2), 587–599.
FORMANN, A. (1992), “Linear Logistic Latent Class Analysis for Polytomous Data”, Journal of the American Statistical Association 87(418), 476–486.
FRIEDMAN, N., GEIGER, D., and GOLDSZMIDT, M. (1997) “Bayesian Network Classifiers”, Machine Learning 29(2), 131–163.
GOLLINI, I., and MURPHY, T. (2013), “Mixture of Latent Trait Analyzers for Model-Based Clustering of Categorical Data”, Statistics and Computing, 24(4), 569–588.
GOODMAN, L. (1974) “Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models”, Biometrika 61(2), 215–231.
GOVAERT, G. (2010), Data Analysis (Vol. 136), Wiley Online Library.
GOVAERT, G., and NADIF, M. (2003), “Clustering with Block Mixture Models”, Pattern Recognition 36(2), 463–473.
GUINOT, C., LATREILLE, J., MALVY, D., PREZIOSI, P., GALAN, P., HERCBERG, S., and TENENHAUS, M. (2001), “Use of Multiple Correspondence Analysis and Cluster Analysis to Study Dietary Behaviour: Food Consumption Questionnaire in the SU. VI. MAX. Cohort”, European Journal of Epidemiology 17(6), 505–516.
HAGENAARS, J. (1988), “Latent Structure Models with Direct Effects Between Indicators Local Dependence Models”, Sociological Methods and Research 16(3), 379–405.
HAND, D., and YU, K. (2001), “Idiot’s Bayes: Not So Stupid After All? International Statistical Review 69(3), 385–398.
HANDELMAN, S., LEVERETT, D., ESPELAND, M., and CURZON, J. (1986), “Clinical Radiographic Evaluation of Sealed Carious and Sound Tooth Surfaces”, The Journal of the American Dental Association 113(5), 751–754.
HARPER, D. (1972), “Local Dependence Latent Structure Models”, Psychometrika 37(1), 53–59.
HUANG, J., NG, M., RONG, H., and LI, Z. (2005), “Automated Variable Weighting in K-Means Type Clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657–668.
HUNT, L., and JORGENSEN,M. (1999), “Theory andMethods: Mixture Model Clustering Using the MULTIMIX Program”, Australian and New Zealand Journal of Statistics 41(2), 154–171.
JAJUGA, K., SOKOŁOWSKI, A., and BOCK, H. (2002), Classification, Clustering and Data Analysis: Recent Advances and Applications, New York: Springer Verlag.
JORGENSEN, M., and HUNT, L. (1996), “Mixture Model Clustering of Data Sets with Categorical and Continuous Variables”, in Proceedings of the Conference ISIS, Vol. 96), 375–384.
LEBARBIER, E., andMARY-HUARD, T. (2006), “Une Introduction au Critère BIC: Fondements Théoriques et Interprétation”, Journal de la Societé Française de Statisique, 147(1), 39–57.
LEBRET, R., IOVLEFF, S., LANGROGNET, F., BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2012), “Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library”, Journal of Statistical Software, in press (2014).
MARBAC, M., BIERNACKI, C., and VANDEWALLE, V. (2013), “Model-Based Clustering for Conditionally Correlated Categorical Data”, Rapport de Recherche RR-8232, INRIA.
MAUGIS, C., CELEUX, G., and MARTIN-MAGNIETTE, M.-L. (2009), “Variable Selection in Model-Based Clustering: A General Variable Role Modeling”, Computational Statistics and Data Analysis 53(11), 3872–3882.
MCLACHLAN, G., and KRISHNAN, T. (1997), The EMAlgorithm,Wiley Series in Probability and Statistics: Applied Probability and Statistics, New York: Wiley-Interscience.
MCLACHLAN, G., and PEEL, D. (2000), Finite Mixture Models,Wiley Series in Probability and Statistics: Applied Probability and Statistics, New York: Wiley-Interscience.
MEILA, M., and JORDAN, M. (2001), “Learning with Mixtures of Trees”, The Journal of Machine Learning Research, 1, 1–48.
MUTH´EN, B. (2008), “Latent Variable Hybrids: Overview of Old and New Models”, Advances in Latent Variable Mixture Models 1, 1–24.
QU, Y., TAN, M., and KUTNER, M. (1996), “Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests”, Biometrics 52(3), 797–810.
REBOUSSIN, B., IP, E., and WOLFSON, M. (2008), “Locally Dependent Latent Class Models with Covariates: An Application to Under-Age Drinking in the USA”, Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(4), 877–897.
REBOUSSIN, B., SONG, E., SHRESTHA, A., LOHMAN, K., andWOLFSON, M. (2006), “A Latent Class Analysis of Underage Problem Drinking: Evidence from a Community Sample of 16–20 Year Olds”, Drug and Alcohol Dependence 83(3), 199–209.
RICHARDSON, S., and GREEN, P. (1997), “On Bayesian Analysis of Mixtures with an Unknown Number of Components (With Discussion)”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 731–792.
ROBERT, C. (2005), Le Choix Bayesien: Principes et Pratique, France Editions: Springer.
ROBERT, C., and CASELLA, G. (2004), Monte Carlo Statistical Methods, New York: Springer Verlag.
SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.
STRAUSS, S., RINDSKOPF, D., ASTONE-TWERELL, J., DES JARLAIS, D., and HAGAN, H. (2006), “Using Latent Class Analysis to Identify Patterns of Hepatitis C Service Provision in Drug-Free Treatment Programs in the US”, Drug and Alcohol Dependence 83(1), 15–24.
VAN HATTUM, P., and HOIJTINK, H. (2009), “Market Segmentation Using Brand Strategy Research: Bayesian Inference with Respect to Mixtures of Log-Linear Models”, Journal of Classification 26(3), 297–328.
VERMUNT, J. (2003), “Multilevel Latent Class Models”, Sociological Methodology 33(1), 213–239.
VERMUNT, J. (2007), “Multilevel Mixture Item Response TheoryModels: An Application in Education Testing”, Proceedings of the 56th Session of the International Statistical Institute, Lisbon, Portugal, pp. 22–28.

Download references

Author information

Authors and Affiliations

Inria Lille and DGA, 40 avenue Halley, 59000, Lille, France
Matthieu Marbac
University Lille 1, CNRS and Inria, Cit Scientifique, 59650, Villeneuve-d’Ascq, France
Christophe Biernacki
University Lille 2 and Inria, EA 2694 40 avenue Halley, 59000, Lille, France
Vincent Vandewalle

Authors

Matthieu Marbac
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Biernacki
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Vandewalle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthieu Marbac.

Additional information

The authors thank the Editor and the three anonymous referees for their useful comments and references. The authors are grateful to the Genes Diffusion corporation for the provision of the data set and especially its members: Amélie Vallée, Julie Hamon and Claude Grenier. We are grateful to Parmeet Bhatia, Modal Team engineer, and Stéphane Chrétien for their precious assistance. This work was financed by DGA and Inria.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marbac, M., Biernacki, C. & Vandewalle, V. Model-Based Clustering for Conditionally Correlated Categorical Data. J Classif 32, 145–175 (2015). https://doi.org/10.1007/s00357-015-9180-4

Download citation

Published: 09 July 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s00357-015-9180-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Model-Based Clustering for Conditionally Correlated Categorical Data

Abstract

Article PDF

Similar content being viewed by others

Latent class model with conditional dependency per modes to cluster categorical data

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood

On mathematical optimization for clustering categories in contingency tables

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model-Based Clustering for Conditionally Correlated Categorical Data

Abstract

Article PDF

Similar content being viewed by others

Latent class model with conditional dependency per modes to cluster categorical data

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood

On mathematical optimization for clustering categories in contingency tables

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation