Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Dong, Jianqiang; Wang, Fei; Yuan, Bo

doi:10.1007/978-3-642-41278-3_50

Jianqiang Dong²⁴,
Fei Wang²⁴ &
Bo Yuan²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8206))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

5065 Accesses
12 Citations

Abstract

In this big data era, the capability of mining and analyzing large scale datasets is imperative. As data are becoming more abundant than ever before, data driven methods are playing a critical role in areas such as decision support and business intelligence. In this paper, we demonstrate how state-of-the-art GPUs and the Dynamic Parallelism feature of the latest CUDA platform can bring significant benefits to BIRCH, one of the most well-known clustering techniques for streaming data. Experiment results show that, on a number of benchmark problems, the GPU accelerated BIRCH can be made up to 154 times faster than the CPU version with good scalability and high accuracy. Our work suggests that massively parallel GPU computing is a promising and effective solution to the challenges of big data.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

A survey on parallel clustering algorithms for Big Data

Article 06 October 2020

Parallel Implementation of a Density-Based Stream Clustering Algorithm Over a GPU Scheduling System

Keywords

References

Zhang, T., Raghu, R., Miron, L.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Record 25(2), 103–114 (1996)
Article Google Scholar
Zhang, T., Raghu, R., Miron, L.: BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Article Google Scholar
Fang, W., Lau, K., Lu, M., et al.: Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07 (2008)
Google Scholar
Bai, H., He, L., Ouyang, D., Li, Z., Li, H.: K-Means on Commodity GPUs with CUDA. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 651–655 (2009)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Mahdiraji, A.R.: Clustering Data Stream: A Survey of Algorithms. International Journal of Knowledge-Based and Intelligent Engineering Systems 13(2), 39–44 (2009)
Google Scholar
Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Kogan, J., et al. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer (2006)
Google Scholar
Barbará, D.: Requirements for Clustering Data Streams. ACM SIGKDD Explorations Newsletter 3(2), 23–27 (2002)
Article Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A Framework for Clustering Evolving Data Streams. In: 29th International Conference on Very Large Data Bases, pp. 81–92 (2003)
Google Scholar
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-Data Algorithms for High-Quality Clustering. In: 18th International Conference on Data Engineering, pp. 685–694 (2002)
Google Scholar
Shalom, S.A., Dash, M.: Efficient Partitioning Based Hierarchical Agglomerative Clustering Using Graphics Accelerations with CUDA. International Journal of Artificial Intelligence & Applications 4(2), 13–33 (2013)
Article Google Scholar
Shalom, S.A., Dash, M., Tue, M., Wilson, N.: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture. In: 2009 International Conference on Signal Processing Systems, pp. 556–561 (2009)
Google Scholar
Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: A Scalable Parallel Clustering Algorithm for Incremental Data. In: 10th IEEE International Database Engineering and Applications Symposium, pp. 315–316 (2006)
Google Scholar
Bagga, A., Toshniwal, D.: Parallelization of Hierarchical Text Clustering on Multi-core CUDA Architecture. International Journal of Computer Science and Electrical Engineering 1, 72–76 (2012)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: 1998 ACM International Conference on Management of Data, pp. 73–84 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Computing Lab, Division of Informatics, Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, P.R. China
Jianqiang Dong, Fei Wang & Bo Yuan

Authors

Jianqiang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin
University of Science and Technology of China, Hefei, China
Ke Tang
Nanjing University, Nanjing, China
Yang Gao
Ostfalia University of Applied Sciences, 38302, Wolfenbüttel, Germany
Frank Klawonn
Kyungpook National University, 702-701, Buk-Gu, Daegu, Korea
Minho Lee
Nature Inspired Computational and Applications Laboratory, School of Computer Science and Technology,, University of Science and Technology of China, 230027, Hefei, China
Thomas Weise
University of Science and Technology of China, 230017, Hefei, China
Bin Li
CERCIA, School of Computer Science, University of Birmingham, B15 2TT, Edgbaston, Birmingham, UK
Xin Yao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, J., Wang, F., Yuan, B. (2013). Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2013. IDEAL 2013. Lecture Notes in Computer Science, vol 8206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41278-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-41278-3_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41277-6
Online ISBN: 978-3-642-41278-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Abstract

Chapter PDF

Similar content being viewed by others

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

A survey on parallel clustering algorithms for Big Data

Parallel Implementation of a Density-Based Stream Clustering Algorithm Over a GPU Scheduling System

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Abstract

Chapter PDF

Similar content being viewed by others

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

A survey on parallel clustering algorithms for Big Data

Parallel Implementation of a Density-Based Stream Clustering Algorithm Over a GPU Scheduling System

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation