Abstract
This paper addresses practical issues in k-means cluster analysis or segmentation with mixed types of variables and missing values. A more general k-means clustering procedure is developed that is suitable for use with very large datasets, such as arise in data mining and survey analysis. An exact assignment test guarantees that the algorithm will converge, and the detection of outliers allows the densest regions of the sample space to be mapped by tessellations of tightly-specified spherical clusters. A summary tree is obtained for the resulting k-cluster partition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
BALL, G. H. (1965): Data analysis in the social sciences: What about the details? Proc. Fall Joint Computer Conf., Spartan Books, Washington D.C., Vol. 27 (1), 533–539.
BALL, G. H. and HALL, D. J. (1967): A clustering technique for summarizing multivariate data. Behavioral Science, Vol. 12, 153–155.
BEALE, E. M. L. (1969): Euclidean cluster analysis. Bull. I. S. I., Vol. 43 (2), 92–94.
DIDAY, E., and SIMON, J. C. (1976): Cluster analysis, in Fu, K. S. (Ed): Digital pattern recognition. Springer, Berlin, 47–94.
FORGEY, E. W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, Vol. 21, 768–769.
GOWER, J. C. (1971): A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857–874.
JANCEY, R. C. (1966): Multidimensional group analysis. Austral. J. Botany, Vol. 14 (1), 127–130.
KASS, G. V. (1980): An exploratory technique for investigating large quantities of categorical data. Applied Statistics, Vol. 29, 119–127.
KAUFMAN, L. and ROUSSEEUW, P. J. (1960): Finding groups in data. Wiley, New York.
MacQUEEN, J. (1967): Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp., Vol. I, 281–297.
THORNDIKE, R. L. (1953): Who belongs in the family. Psychometrika, Vol. 18, 267–276.
WISHART, D. (1970): Some problems in the theory and application of the methods of numerical taxonomy. Ph.D. dissertation, University of St. Andrews.
WISHART, D. (1978): Treatment of missing values in cluster analysis. Proc. Compstat 1978, Physica-Verlag, Wien, 281–287.
WISHART, D. (1984): Clustan Benutzerhandbuch. Gustav Fischer Verlag, Stuttgart, 46–54.
WISHART, D. (1986): Hierarchical cluster analysis with messy data, in: Gaul, Schader, (Eds.): Classification as a Tool of Research. North-Holland, Amsterdam, 453–460.
WISHART, D. (1999): ClustanGraphics Primer. Clustan, Edinburgh, 37–38.
WISHART, D. (2002): Clustan Professional User Guide. Clustan, Edinburgh (in preparation).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wishart, D. (2003). k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values. In: Schwaiger, M., Opitz, O. (eds) Exploratory Data Analysis in Empirical Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55721-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-55721-7_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44183-0
Online ISBN: 978-3-642-55721-7
eBook Packages: Springer Book Archive