A Fast Parallel Clustering Algorithm for Large Spatial Databases

Xu, Xiaowei; Jäger, Jochen; Kriegel, Hans-Peter

doi:10.1007/0-306-47011-X_3

Xiaowei Xu³,
Jochen Jäger⁴ &
Hans-Peter Kriegel⁴

401 Accesses
33 Citations

Abstract

The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, aparallel version ofthis algorithm. We use the ‘shared-nothing’ architecture withmultiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure. We introduce the dR*-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.

This work was performed while the author was still working at the Institute for Computer Science, University of Munich.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R. and Shafer, J.C. 1996. Parallel mining of association rules: design, implementation, and experience. IBM Research Report.
Google Scholar
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD Int. Conf. on Management of Data. Atlantic City, NJ, pp. 322–331.
Google Scholar
Berchtold S., Keim D.A., and Kriegel, H.-P. 1996. The X-tree: an index structure for high-dimensional data. Proc. 22nd Int. Conf. on Very Large Data Bases, Bombay, India, Morgan Kaufmann, pp. 28–39.
Google Scholar
Bially, T. 1969. Space-filling curves: their generation and their application to bandwidth reduction. IEEE Trans. on Information Theory, IT-15(6):658–664.
Google Scholar
Cheung, D.W., Han, J., Ng, V.T., Fu, A.W., and Fu, Y. 1996. A fast distributed algorithm for mining association rules. Proc. Int. Conf. on Parallel and Distributed Information System (PDIS’96). Miami Beach, FL, USA.
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 226–231.
Google Scholar
Ester, M., Kriegel, H.-P., and Xu, X. 1995. A database interface for clustering in large spatial databases. Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining. Montreal, Canada, 1995, pp. 94–99.
Google Scholar
Faloutsos, C. and Roseman, S. 1989. Fractals for secondary key retrieval. Proc. 8th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 247–252.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. Knowledge discovery anddatamining: towards a unifying framework. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 82–88.
Google Scholar
Geist, A., Beguelin, A., Dongama, J., Jiang, W., Manchek, R., and Sunderam, V. 1996. PVM: Parallel Virtual Machine — A User’s Guide and Tutorial for Networked Parallel Computing. Cambridge, MA, London, England: The MIT Press, 3rd printing.
Google Scholar
Gueting, R.H. 1994. An introduction to spatial database systems. The VLDB Journal, 3(4):357–399.
Google Scholar
Jaja, J. 1992. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, pp. 61–65.
Google Scholar
Kamel, I. and Faloutsos, C. 1993. On packing R-trees. Proc. 2nd Int. Conf. on Information and Knowledge Management (CIKM).
Google Scholar
Li, X. and Fang, Z. 1989. Parallel clustering algorithms. Parallel Computing, 11:275–290.
MathSciNet Google Scholar
Matheus, C.J., Chan, P.K., and Piatetsky-Shapiro, G. 1993. Systems for knowledge discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6):903–913.
Article Google Scholar
Mehta, M. and DeWitt, D.J. 1997. Data placement in shared-nothing parallel database systems. VLDB Journal, 6:53–72.
Google Scholar
Olson, C.F. 1995. Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313–1325.
Article MATH MathSciNet Google Scholar
Park, J.-S., Chen, M.-S., and Yu, P.S. 1995. An effective hash based algorithm formining association rules. Proc. ACM SIGMOD Int. Conf. on Management of Data. San Jose, CA, pp. 175–186.
Google Scholar
Pfitzner, D.W., Salmon, J.K., and Sterling, T. 1998. Halo World: Tools forParallel Cluster Finding in Astrophysical N-body Simulations. Data Mining and Knowledge Discovery. Kluwer Academic Publishers, Vol. 2,No. 2, pp. 419–438.
Google Scholar
Rasmussen, E.M. and Willett, P. 1989. Efficiency ofhierarchical agglomerative clustering using the ICL distributed array processor. Journal of Documentation, 45(1): 1–24.
Google Scholar
Richards, A.J. 1983. Remote Sensing Digital Image Analysis. An Introduction, Berlin: Springer.
Google Scholar
Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. 1998. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, vol. 2, pp. 1–27.
Google Scholar
Stonebraker, M. 1986. The case for shared nothing. Database Eng., 9(1).
Google Scholar
Stonebraker, M., Frew, J., Gardels, K., and Meredith, J. 1993. The SEQUOIA 2000 storage benchmark. Proc. ACM SIGMOD Int. Conf. on Management of Data. Washington, DC, pp. 2–11.
Google Scholar
Xu, X. 1999. Efficient Clustering for Knowledge Discovery in Spatial Databases. Shaker, Germany: Aachen.
Google Scholar
Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. 1998. A distribution-based clustering algorithm for mining in large spatial databases. 14th Int. Conf. on Data Engineering (ICDE’98). Orlando, FL.
Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. 1998. BIRCH: A New Data Clustering Algorithmand Its Applications, Kluwer Academic Publishers, pp. 1–40
Google Scholar

Download references

Author information

Authors and Affiliations

Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, D-81730, Munchen, Germany
Xiaowei Xu
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538, Munchen, Germany
Jochen Jäger & Hans-Peter Kriegel

Authors

Xiaowei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jochen Jäger
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Peter Kriegel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Imperial College, UK
Yike Guo
University of Illinois at Chicago, USA
Robert Grossman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xu, X., Jäger, J., Kriegel, HP. (1999). A Fast Parallel Clustering Algorithm for Large Spatial Databases. In: Guo, Y., Grossman, R. (eds) High Performance Data Mining. Springer, Boston, MA. https://doi.org/10.1007/0-306-47011-X_3

Download citation

DOI: https://doi.org/10.1007/0-306-47011-X_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7745-0
Online ISBN: 978-0-306-47011-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics