A k-Median Algorithm with Running Time Independent of Data Size

Meyerson, Adam; O'Callaghan, Liadan; Plotkin, Serge

doi:10.1023/B:MACH.0000033115.78247.f0

A k-Median Algorithm with Running Time Independent of Data Size

Published: July 2004

Volume 56, pages 61–87, (2004)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A k-Median Algorithm with Running Time Independent of Data Size

Download PDF

Adam Meyerson,
Liadan O'Callaghan &
Serge Plotkin

925 Accesses
33 Citations
Explore all metrics

Abstract

We give a sampling-based algorithm for the k-Median problem, with running time O(k \((\frac{{k^2 }}{ \in } \log k)^2 \) log\((\frac{k}{ \in } \log k)\)), where k is the desired number of clusters and ∈ is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω\((\frac{{n \in }}{k})\) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Alon, N., Dar, S., Parnas, M., & Ron, D. (2000). Testing of clustering. In Proc. FOCS.
Arora, S., Raghavan, P., & Rao, S. (1998). Approximation schemes for euclidean k-medians and related problems. In Proc. STOC.
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., & Pandit, V. (2001). Local search heuristics for k-median and facility location problems. In Proc. STOC.
Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. In Proc. FOCS.
Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proc. KDD.
Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proc. STOC.
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proc. FOCS.
Charikar, M., Guha, S., Tardos, É., & Shmoys, D. (2002). A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65:1, 129–149.
Google Scholar
Charikar, M., Khuller, S., Mount, D., & Narasimhan, G. (2001). Algorithms for facility location problems with outliers. In Proc. SODA.
Charikar, M., & Panigrahy, R. (2001). Clustering to minimize the sum of cluster diameters. In Proc. 33rd STOC.
Dasgupta, S. (1999). Learning mixtures of gaussians. In Proc. FOCS.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1984). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B, 39.
Duda, R. O., & Hart, P. E. (1972). Pattern classification and scene analysis. Wiley.
Feder, T., & Greene, D. (1988). Optimal algorithms for approximate clustering. In Proc. STOC.
Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 2–3.
Google Scholar
Guha, S. (2000). Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University.
Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. In Proc. FOCS.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD.
Hartigan, J. A. (1975). Clustering algorithms. John Wiley and Sons.
Hochbaum, D. S., & Shmoys, D. B. (1985). A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research, 10, 180–184.
Google Scholar
Indyk, P. (1999). Sublinear time algorithms for metric space problems. In Proc. STOC.
Jain, K., Mahdian, M., Markakis, E., Saberi, A., & Vazirani, V. (to appear). Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. Journal of the ACM.
Jain, K., & Vazirani, V. (1999). Primal-dual approximation algorithms for metric facility location and k-median problems. In Proc. FOCS.
Joze-Hkajavi, N., & Salem, K. (1998). Two-phase clustering of large datasets. Tech. rpt. CS-98-27, Univ. of Waterloo.
Lin, J.-H., & Vitter, J. S. (1992). ε-Approximations with minimum packing constraint violation. In Proc. STOC.
Marroquin, J., & Girosi, F. (1993). Some extensions of the k-means algorithm for image segmentation and pattern recognition. AI Memo 1390, MIT AI Lab..
Mettu, R., & Plaxton, C. G. (2002). Optimal time bounds for approximate clustering. In Proc. 18th UAI.
Meyerson, A. (2001). Online facility location. In Proc. FOCS.
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proc. SODA.
Palmer, C. R., & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering. In Proc. SIGMOD.
Pelleg, D., & Moore, A. W. (1999). Accelerating exact k-means algorithms with geometric reasoning. In Proc. KDD.
Pitt, L., & Reinke, R. E. (1988). Criteria for polynomial-time (conceptual) clustering. Machine Learning, 2, 4.
Google Scholar
Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 2.
Google Scholar
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Tech rept. No. 00-034, Univ. of Minn.
Thorup, M. (2001). Quick k-median, k-center, and facility location for sparse graphs. In Proc. ICALP.
Young, N. E. (2000). k-medians, facility location, and the Chernoff-Wald bound. In Proc. SODA (pp. 86–95).
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. SIGMOD (pp. 103–114).

Download references

Authors

Adam Meyerson
View author publications
You can also search for this author in PubMed Google Scholar
Liadan O'Callaghan
View author publications
You can also search for this author in PubMed Google Scholar
Serge Plotkin
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meyerson, A., O'Callaghan, L. & Plotkin, S. A k-Median Algorithm with Running Time Independent of Data Size. Machine Learning 56, 61–87 (2004). https://doi.org/10.1023/B:MACH.0000033115.78247.f0

Download citation

Issue Date: July 2004
DOI: https://doi.org/10.1023/B:MACH.0000033115.78247.f0

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A k-Median Algorithm with Running Time Independent of Data Size

Abstract

Article PDF

Similar content being viewed by others

k-Median/Means with Outliers Revisited: A Simple Fpt Approximation

Faster Algorithms for the Constrained k-means Problem

The p-Median Problem

References

Rights and permissions

About this article

Cite this article

Navigation

A k-Median Algorithm with Running Time Independent of Data Size

Abstract

Article PDF

Similar content being viewed by others

k-Median/Means with Outliers Revisited: A Simple Fpt Approximation

Faster Algorithms for the Constrained k-means Problem

The p-Median Problem

Explore related subjects

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation