Abstract
Big data mining is an intelligent process of extracting hidden knowledge from high volume, high variety, and high velocity data environments for decision-making systems. Classical data mining algorithms are facing memory utilization, speed-up, scale-up, computing cost, efficiency, and effectiveness related challenges inside the big data. Data volume is a prime attribute of big data mining and is responsible for variety and velocity-related challenges. Intelligent big data mining process incorporates classical data mining and statistics under single and multiple machine execution environments. Sampling is a data reduction technique that handles data volume-related challenges and increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilizes memory resources for any data mining algorithms without the influence of their characteristics. This paper proposed the systematic sampling-based big data mining model through the K-means clustering that is known as SYK-means (systematic sampling-based K-means). The experimental results of the SYK-means algorithm are compared with the RSK-means (random sampling-based K-means) and classical K-means algorithms concerning sample size selection and entire data selection. The experimental evaluation of the SYK-means algorithm achieved better effectiveness and efficiency through R squares, root-mean-square standard deviation, Davies Bouldin, Calinski Harabasz, Silhouette coefficient, CPU time, and convergence validation indices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Oussous A, Benjelloun F, Lahcen AA, Belfkih S (2017) Big Data technologies: a survey. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2017.06.001
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:1–6. https://doi.org/10.1186/s40537-019-0206-3
Gandomi A, Haider M (2015) Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of Big Data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60:293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Siddiqa A, Hashem IAT, Yaqoob I et al (2016) A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl 71:151–166. https://doi.org/10.1016/j.jnca.2016.04.008
Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable Big Data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002
Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-Dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
HajKacem MA Ben, N’Cir C-E Ben, Essoussi N (2019) Clustering methods for big data analytics. In: Unsupervised and semi-supervised learning, pp 1–23
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IeeexploreIeeeOrg, 1–26
Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV (2016) Big Data analytics. In: Big Data technologies and applications, pp 1–400
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. https://doi.org/10.1145/775107.775114, pp 462–468
Zhao J, Sun J, Zhai Y et al (2018) A novel clustering-based sampling approach for minimum sample set in big data environment. Int J Pattern Recognit Artif Intell 32:1–20. https://doi.org/10.1142/S0218001418500039
Ly T, Cockburn M, Langholz B (2018) Cost-efficient case-control cluster sampling designs for population-based epidemiological studies. Spat Spatiotemporal Epidemiol 26:95–105. https://doi.org/10.1016/j.sste.2018.05.002
Boicea A, Truică CO, Rădulescu F, Buşe EC (2018) Sampling strategies for extracting information from large data sets. Data Knowl Eng 115:1–15. https://doi.org/10.1016/j.datak.2018.01.002
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173
Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271
Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method. In: Proceedings - 2014 2nd International Conference on Adv Cloud Big Data, CBD 2014. https://doi.org/10.1109/CBD.2014.56
Thompson SK (2012) Sampling, Third edn. Wiley Publication
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering map reduce model. Int J Emerg Technol 10
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7:1–21. https://doi.org/10.1186/s40537-019-0279-z
Pandey KK, Shukla D, Milan R (2020) A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability. In: Shukla RK, Agrawal J, Sharma S, et al (eds) Social networking and computational intelligence, Lecture Notes in Networks and Systems 100. Springer Nature Singapore Pte Ltd., pp 427–440
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means 04:243–257. https://doi.org/10.1007/s10994-006-0587-3
Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Lecture Notes Computer Science (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 5687 LNCS. https://doi.org/10.1007/978-3-642-03685-9_2, pp 15–28
Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010
Ben Hajkacem MA, Ben Ncir CE, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33:1950013. https://doi.org/10.1142/S0218001419500137
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011-12 1–11
Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. Wuhan Univ J Nat Sci 6:467–473
Wang X, Hamilton HJ (2003) DBRS : A Density-Based Spatial Clustering Method with Random Sampling. 563–575
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Proc 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Zhan Q (2017) Improved spectral clustering based on Nyström method. https://doi.org/10.1007/s11042-017-4566-4, pp 20149–20165
Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Mostafa SA, Ahmad IA, Ahmad IA (2017) Recent developments in systematic sampling: a review. J Stat Theory Pract ISSN. https://doi.org/10.1080/15598608.2017.1353456
Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Ziegel ER, Lohr SL (2000) Sampling: design and analysis. In: Technometrics, p 223
Shalabh (2019) Stratified sampling. In: Sampling theory, pp 1–27
Olufadi Y, Oshungade IO, Adewara AA (2012) On allocation procedures using systematic sampling. J Interdiscip Math 15:23–40. https://doi.org/10.1080/09720502.2012.10700783
Aggarwal CC, Reddy CK (2013) DATA custering algorithms and applications
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications
Peña JM, Lozano JA, Larrañaga P (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Luchi D, Santos W, Rodrigues A, Varejao FM (2015) Genetic sampling k-means for clustering large data sets. In: CIARP 2015, LNCS 9423, pp 691–698
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pandey, K.K., Shukla, D. (2022). Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining. In: Dubey, H.M., Pandit, M., Srivastava, L., Panigrahi, B.K. (eds) Artificial Intelligence and Sustainable Computing. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-1220-6_19
Download citation
DOI: https://doi.org/10.1007/978-981-16-1220-6_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1219-0
Online ISBN: 978-981-16-1220-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)