Abstract
We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable’s usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNPs.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Y. Dodge, J. Whittaker (Eds.), Computational statistics (Vol. 1, pp. 251–256). Heidelberg: Physica Verlag.
Clogg C.C. (1981) New developments in latent structure analysis. In: Jackson D.J., Borgatta E.F. (eds) Factor analysis and measurement in sociological research. Sage, Beverly Hills, pp 215–246
Clogg C.C. (1995) Latent class models. In: Arminger G., Clogg C.C., Sobel M.E. (eds) Handbook of statistical modeling for the social and behavioral sciences. Plenum, New York, pp 311–360
Detrano R., Janosi A., Steinbrunn W., Pfisterer M., Schmid J.-J., Sandhu S., Guppy K. H., Lee S., Froelicher V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology 64: 304–310
Fraley C., Raftery A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97: 611–631
Galimberti G., Soffritti G. (2006) Identifying multiple cluster structures through latent class models. In: Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds) From data and information analysis to knowledge engineering. Springer, Berlin, pp 174–181
Gennari J.H., Langley P., Fisher D. (1989) Models of incremental concept formation. Artificial Intelligence 40: 11–61
Goodman L.A. (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61: 215–231
Hagenaars J.A., McCutcheon A.L. (2002) Applied latent class analysis. Cambridge University Press, Cambridge
Hubert L., Arabie P. (1985) Comparing partitions. Journal of Classification 2: 193–218
Kass R.E., Raftery A.E. (1995) Bayes factors. Journal of the American Statistical Association 90: 773–795
Keribin C. (1998) Consistent estimate of the order of mixture models. Comptes Rendues de l’Academie des Sciences, Série I-Mathématiques 326: 243–248
Lazarsfeld, P. F. (1950a). The logical and mathematical foundations of latent structure analysis. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 10, pp. 362–412). Princeton, NJ: Princeton University Press.
Lazarsfeld, P. F. (1950b). The interpretation and computation of some latent structures. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 11, pp. 413–472). Princeton, NJ: Princeton University Press.
Lazarsfeld P.F., Henry N.W. (1968) Latent structure analysis. Houghton Mifflin, Boston
McCutcheon A.L. (1987) Latent class analysis. Sage, Newbury Park, CA
McLachlan G.J., Peel D. (2000) Finite mixture models. Wiley, New York
Raftery A.E., Dean N. (2006) Variable selection for model-based clustering. Journal of the American Statistical Association 101: 168–178
Rand W.M. (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66: 846–850
Rusakov D., Geiger D. (2005) Asymptotic model selection for naive Bayesian networks. Journal of Machine Learning Research 6: 1–35
The International HapMap Consortium (2003) The international hapmap project. Nature 426: 789–796
Wolfe, J. H. (1963). Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Dean, N., Raftery, A.E. Latent class analysis variable selection. Ann Inst Stat Math 62, 11–35 (2010). https://doi.org/10.1007/s10463-009-0258-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-009-0258-9