Abstract
We consider a framework of sample-based clustering. In this setting, the input to a clustering algorithm is a sample generated i.i.d by some unknown arbitrary distribution. Based on such a sample, the algorithm has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clustering algorithms that approximate the optimal clustering. We show that the K-median clustering, as well as K-means and the Vector Quantization problems, satisfy these conditions. Our results apply to the combinatorial optimization setting where, assuming that sampling uniformly over an input set can be done in constant time, we get a sampling-based algorithm for the K-median and K-means clustering problems that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the dependence of the running time of our algorithm on the Euclidean dimension is only linear. Our main technical tool is a uniform convergence result for center based clustering that can be viewed as showing that the effective VC-dimension of k-center clustering equals k.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.
Bartlett, P., Linder, T., & Lugosi, G. (1998). The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44, 1802–1813.
Ben-David, S. (2004). A framework for statistical cluatering with a constant time approximation algorithm for K-median clustering. In Proceedings of the 17th Annual Conference on Learning Theory, COLT’04, Springer.
Buhmann, J. (1998). Empirical risk approximation: An induction principle for unsupervised learning. Technical Report IAI-TR-98-3, Institut for Informatik III, Universitat Bonn.
Czumaj, A., & Sohler, C. (2004). Sublinear-time approximation for clustering via random samples. In Proceedings of the 31st International Colloquium on Automata, Language and Programming (ICALP’04), LNCS 3142:396–407.
Mettu, R. R., & Plaxton, C. G. (2004). Optimal time bounds for approximate clustering. Machine Learning, 56, 35–60.
Meyerson, A., O’Callaghan, L., & Plotkin, S. (2004). A k-median algorithm with running time independent of data size. Journal of Machine Learning, Special Issue on Theoretical Advances in Data Clustering (MLJ).
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of Symposium on Discrete Algorithms, SODA, (pp. 439–447).
Pollard, D. (1982). Quantization and the method of k-means. In IEEE Transactions on Information Theory, 28, 199–205.
Smola, A. J., Mika, S., & Scholkopf, B. (1998). Quantization functionals and regularized principal manifolds. Neuro COLT Technical Report Series NC2-TR-1998-028.
de la Vega, F., Karpinski, M., Kenyon, C., & Rabani, Y. (2003). Approximation schemes for clustering problems. In Proceedings of Symposium on the Theory of Computation, STOC’03.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Olivier Bousquet and Andre Elisseeff
A preliminary version of this work appeared in the proceedings of COLT’04 (Ben-David, 2004).
This work is supported in part by the Multidisciplinary University Research Initiative (MURI) under the Office of Naval Research Contract N00014-00-1-0564.
Rights and permissions
About this article
Cite this article
Ben-David, S. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering. Mach Learn 66, 243–257 (2007). https://doi.org/10.1007/s10994-006-0587-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-0587-3