Efficient Results Merging for Parallel Data Clustering Using MapReduce

Bousbaci, Abdelhak; Kamel, Nadjet

doi:10.1007/978-3-319-40162-1_38

Abdelhak Bousbaci⁹ &
Nadjet Kamel^9,10

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 474))

1680 Accesses
1 Citations

Abstract

Data clustering is partitioning data into sub-groups using a distance measure. Clustering a large data amount requires an important execution time. Several works have been proposed to overcome this problem using parallelism. One of the parallel techniques consists in partitioning data and processing each partition apart, the results obtained from each partition are merged to get the final clusters configuration. Using an inappropriate merging technique leads to an inaccurate final centroids and a middling clustering quality. In this paper, we propose two merging techniques to improve the clustering quality.

In a first solution, the results are merged using the K-means algorithm, and in a second one using the genetic algorithm. The results proved the efficiency of the proposed strategies.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Article 25 November 2017

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Article 11 November 2017

Optimisation Techniques for Parallel K-Means on MapReduce

Keywords

References

MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, USA, vol. 1, pp. 281–297 (1967)
Google Scholar
Ene, A., Im, S., Moseley, B.: Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 681–689. ACM (2011)
Google Scholar
Guerrieri, A., Montresor, A.: Ds-means: distributed data stream clustering. In: Euro-Par 2012 Parallel Processing, pp. 260–271. Springer (2012)
Google Scholar
Ferreira Cordeiro, R.L., Traina Junior, C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 690–698. ACM (2011)
Google Scholar
Mashayekhi, H., Habibi, J., Voulgaris, S., van Steen, M.: Goscan: Decentralized scalable data clustering. Computing 95(9), 759–784 (2013)
Article MathSciNet MATH Google Scholar
Bousbaci, A., Kamel, N.: A parallel sampling-pso-multi-core-k-means algorithm using mapreduce. In: 2014 14th International Conference on Hybrid Intelligent Systems (HIS), pp. 129–134. IEEE (2014)
Google Scholar
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing 70(3), 1249–1259 (2014)
Article Google Scholar
Kamel, N., Ouchen, I., Baali, K.: A sampling-pso-k-means algorithm for document clustering. In: Genetic and Evolutionary Computing, pp. 45–54. Springer (2014)
Google Scholar
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, vol. 98, pp. 91–99. Citeseer (1998)
Google Scholar
Kwedlo, W., Iwanowicz, P.: Using genetic algorithm for selection of initial cluster centers for the k-means method. In: Artifical Intelligence and Soft Computing, pp. 165–172. Springer (2010)
Google Scholar
Maulik, U., Bandyopadhyay, S.: Genetic algorithm-based clustering technique. Pattern recognition 33(9), 1455–1465 (2000)
Article Google Scholar
Hore, P., Hall, L., Goldgof, D.: A cluster ensemble framework for large data sets. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, vol. 4, pp. 3342–3347. IEEE (2006)
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

LRIA, Computer Science Department, USTHB Algiers, Bab Ezzouar, Algeria
Abdelhak Bousbaci & Nadjet Kamel
Computer Science Department, Faculty of Sciences, UFAS Setif, Setif, Algeria
Nadjet Kamel

Authors

Abdelhak Bousbaci
View author publications
You can also search for this author in PubMed Google Scholar
Nadjet Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelhak Bousbaci .

Editor information

Editors and Affiliations

Faculty of Engineering, Osaka Institute of Technology Faculty of Engineering, Osaka, Osaka, Japan
Sigeru Omatu
Faculty of Computer Science & Informatio, Universiti Teknologi Malaysia (UTM) Faculty of Computer Science & Informatio, Baharu, Malaysia
Ali Semalat
Department of Electronics and Compu, Koszalin University of Technology Department of Electronics and Compu, Koszalin, Poland
Grzegorz Bocewicz
Faculty of Electrical Engineering and Cs, Kielce University of Technology Faculty of Electrical Engineering and Cs, Kielce, Poland
Paweł Sitek
Faculty of Engineering and Science, Aalborg University Faculty of Engineering and Science, Aalborg, Denmark
Izabela E. Nielsen
ETS Ingeniería Informática, University of Sevilla ETS Ingeniería Informática, Sevilla, Spain
Julián A. García García
Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid Departamento de Inteligencia Artificial, Madrid, Spain
Javier Bajo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bousbaci, A., Kamel, N. (2016). Efficient Results Merging for Parallel Data Clustering Using MapReduce. In: Omatu, S., et al. Distributed Computing and Artificial Intelligence, 13th International Conference. Advances in Intelligent Systems and Computing, vol 474. Springer, Cham. https://doi.org/10.1007/978-3-319-40162-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-40162-1_38
Published: 01 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40161-4
Online ISBN: 978-3-319-40162-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Efficient Results Merging for Parallel Data Clustering Using MapReduce

Abstract

Chapter PDF

Similar content being viewed by others

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Optimisation Techniques for Parallel K-Means on MapReduce

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Efficient Results Merging for Parallel Data Clustering Using MapReduce

Abstract

Chapter PDF

Similar content being viewed by others

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Optimisation Techniques for Parallel K-Means on MapReduce

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation