Abstract
Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computational costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today’s novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today’s available technologies and frameworks.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)
YouTube Statistic (2014), http://www.youtube.com/yt/press/statistics.html
Williams, P., Soares, C., Gilbert, J.E.: A Clustering Rule Based Approach for Classification Problems. Int. J. Data Warehous. Min. 8(1), 1–23 (2012)
Priya, R.V., Vadivel, A.: User Behaviour Pattern Mining from Weblog. Int. J. Data Warehous. Min. 8(2), 1–22 (2012)
Kwok, T., Smith, K.A., Lozano, S., Taniar, D.: Parallel Fuzzy c-Means Clustering for Large Data Sets. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 365–374. Springer, Heidelberg (2002)
Kalia, H., Dehuri, S., Ghosh, A.: A Survey on Fuzzy Association Rule Mining. Int. J. Data Warehous. Min. 9(1), 1–27 (2013)
Daly, O., Taniar, D.: Exception Rules Mining Based on Negative Association Rules. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 543–552. Springer, Heidelberg (2004)
Ashrafi, M.Z., Taniar, D., Smith, K.A.: Redundant association rules reduction techniques. Int. J. Bus. Intell. Data Min. 2(1), 29–63 (2007)
Taniar, D., Rahayu, W., Lee, V.C.S., Daly, O.: Exception rules in association rule mining. Appl. Math. Comput. 205(2), 735–750 (2008)
Meyer, F.G., Chinrungrueng, J.: Spatiotemporal clustering of fMRI time series in the spectral domain. Med. Image Anal. 9(1), 51–68 (2004)
Ernst, J., Nau, G.J., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinforma. 21(suppl. 1), i159–i168 (2005)
Iglesias, F., Kastner, W.: Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns. Energies 6(2), 579–597 (2013)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)
Hathaway, R., Bezdek, J.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)
Big Data, What is it and why it is important, http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction on Cluster Analysis. John Wiley and Sons (1990)
Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)
Guha, S., Rastogi, R.: CURE: An efficient clustering algorithm for large database. Inf. Syst. 26(1), 35–58 (2001)
Achlioptas, D., McSherry, F.: Fast computation of low rank matrix approximations. J. ACM 54(2), 9 (2007)
Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: ICML, pp. 186–193 (2003)
Dasgupta, S.: Experiments with random projection. In: UAI, pp. 143–151 (2000)
Boutsidis, C., Chekuri, C., Feder, T., Motwani, R.: Random projections for k-means clustering. In: NIPS, pp. 298–306 (2010)
Golub, G.H., Van-Loan, C.F.: Matrix computations, 2nd edn. The Johns Hopkins University Press (1989)
Drineas, P., Kannan, R., Mahony, M.W.: Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36(1), 132–157 (2006)
Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is More: Compact Matrix Decomposition for Large Sparse Graphs. In: SDM (2007)
Tong, H., Papadimitriou, S., Sun, J., Yu, P.S., Faloutsos, C.: Colibri: Fast mining of large static and dynamic graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 686–694 (2008)
Januzaj, E., Kriegel, H.-P., Pfeifle, M.: DBDC: Density based distributed clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004)
Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications (2013)
Ester, M., Kriegel, H.P., Sander, J., Xui, X.: A density-based algorithm for discovering clusters in large spatial database with noise. In: KDD, pp. 226–231 (1996)
Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev. 41(2), 278–300 (1999)
Karypis, G., Kumar, V.: Multilevel k-way partitining scheme for irregular graphs. J. Parallel Disteributed Comput. 48(1), 96–129 (1998)
Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Comput. Sci. 18, 369–378 (2013)
Anchalia, P.P., Koundinya, A.K., Srinath, N.: MapReduce Design of K-Means Clustering Algorithm. In: 2013 International Conference on Information Science and Applications (ICISA), pp. 1–5 (2013)
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Cloud Computing, pp. 674–679 (2009)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann (2006)
Mirkin, B.: Clustering for data mining a data recovery approach. CRC Press (2012)
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T. (2014). Big Data Clustering: A Review. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8583. Springer, Cham. https://doi.org/10.1007/978-3-319-09156-3_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-09156-3_49
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09155-6
Online ISBN: 978-3-319-09156-3
eBook Packages: Computer ScienceComputer Science (R0)