Abstract
We establish an affine equivariant, constrained heteroscedastic model and criterion with trimming for clustering contaminated, grouped data. We show existence of the maximum likelihood estimator, propose a method for determining an appropriate constraint, and design a strategy for finding reasonable clusterings. We finally compute breakdown points of the estimated parameters thereby showing asymptotic robustness of the method.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester
Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. JASA 94: 947–955
Bezdek JC, Keller J, Krisnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. The handbooks of fuzzy sets series. Kluwer, Boston
Bock H-H (1985) On some significance tests in cluster analysis. J Class 2: 77–108
Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B 66: 95–115
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576
Dennis JE Jr (1981) Algorithms for nonlinear fitting. In: Powell MJD (eds) Nonlinear optimization 1981. Procedings of the NATO Advanced Research Institute held at Cambridge in July 1981. Academic Press, London
Donoho DL, Huber PJ (1983) The notion of a breakdown point. In: Bickel PJ, Doksum KA, Hodges JL (eds) A Festschrift for Erich L. Lehmann, The Wadsworth Statistics/Probability Series. Wadsworth, Belmont, pp 157–184
Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380
Gallegos MT, Ritter G (2009) Using combinatorial optimization in model-based clustering under spurious outliers and cardinality constraints. Comput Statist Data Anal (to appear)
García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345
Gordon AD (1999) Classification. Monographs on statistics and applied probability, vol 82, 2nd edn. CRC Press, New York
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800
Hodges JL Jr (1967) Efficiency in normal samples and tolerance of extreme values for some estimates of location. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, pp 163–186
Kéribin C (2000) Consistent estimation of the order of mixture models. Sankhyā 62(Series A): 49–66
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Mecklin CJ, Mundfrom DJ (2004) An appraisal and bibliography of tests for multivariate normality. Int Stat Rev 72(1): 123–138
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Mucha H-J, Bartel HG, Dolata J (2002) Exploring Roman brick and tile by cluster analysis with validation of results. In: Gaul W, Ritter G (eds) Classification, automation, and new media. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 471–478
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308
Pollard D (1981) Strong consistency of k-means clustering. Ann Stat 9: 135–140
Ritter G, Gallegos MT (1997) Outliers in statistical pattern recognition and an application to automatic chromosome classification. Patt Rec Lett 18: 525–539
Rocke DM, Woodruff DL (1999) A synthesis of outlier detection and cluster identification. Technical report, University of California, Davis. http://handel.cipic.ucdavis.edu/~dmrocke/Synth5.pdf
Schroeder A (1976) Analyse d’un mélange de distributions de probabilités de même type. Revue de Statistique Appliquée 24: 39–62
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6: 461–464
Symons MJ (1981) Clustering criteria and multivariate normal mixtures. Biometrics 37: 35–43
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63: 411–423
Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivar Behav Res 5: 329–350
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gallegos, M.T., Ritter, G. Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3, 135–167 (2009). https://doi.org/10.1007/s11634-009-0044-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-009-0044-9
Keywords
- Statistical clustering
- Robust clustering
- Trimming algorithm
- Breakdown points
- Heteroscedasticity
- HDBT ratio