Abstract
Big data mining is the modern research of predicting knowledge for decision-making systems, and clustering is the data mining technique that predicts the classes of data without any prior knowledge. Big data mining supports high-volume, high-variety, and high-velocity data sets that reason classical data mining algorithms face speed-up, scale-up, memory utilization, computing efficiency, and effectiveness related problems. Data volume is the primary attribute of big data mining that is responsible for the variety and velocity related challenges. Sampling is a data reduction technique that handles data volume-related challenges under the single and multiple machine execution environments of big data mining such as classification and clustering. A good sampling process increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilized the memory resources for any data mining techniques without the influence of their characteristics. This paper proposed two sampling approaches for big data mining using the K-Means clustering algorithm known as SYK-Means (Systematic Sampling-based K-Means) and SSYK-Means (Stratified Systematic Sampling-based K-Means). The experimental results of the SYK-Means and SSYK-Means compared with classical K-Means and achieved better effectiveness and efficiency through R squares, Davies Bouldin, Calinski Harabasz, Silhouette Coefficient, CPU time, and convergence indices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Marr B How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#4c5b6f5360ba
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6. https://doi.org/10.1186/s40537-019-0206-3
Tabesh P, Mousavidin E, Hasani S (2019) Implementing big data strategies: a managerial perspective. Bus Horiz 62:347–358. https://doi.org/10.1016/j.bushor.2019.02.001
Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. In: Perner P (ed) ICDM 2014, LNAI 8557. Springer International Publishing Switzerland, pp 214–227
Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001
Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Pandey KK, Shukla D (2019) Challenges of big data to big data mining with their processing framework. In: 2018 8th international conference on communication systems and network technologies (CSNT), pp 89–94. https://doi.org/10.1109/CSNT.2018.19
Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon 5. https://doi.org/10.1016/j.heliyon.2019.e02541
Pujari AK, Rajesh K, Reddy DS (2001) Clustering techniques in data mining—a survey. IETE J Res 47:19–28. https://doi.org/10.1080/03772063.2001.11416199
van Altena AJ, Moerland PD, Zwinderman AH, Olabarriaga SD (2016) Understanding big data themes from scientific biomedical literature through topic modeling. J Big Data 3. https://doi.org/10.1186/s40537-016-0057-0
Moharm K (2019) State of the art in big data applications in microgrid: a review. Adv Eng Inform 42. https://doi.org/10.1016/j.aei.2019.100945
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering mapreduce model. Int J Emerg Technol 10
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173
Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12. https://doi.org/10.1145/3132088
Xie H, Zhang L, Lim CP, Yu Y, Liu C, Liu H, Walters J (2019) Improving K-means clustering with enhanced firefly algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
HajKacem MAB, N’Cir C-EB, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C-EB (eds) Clustering methods for big data analytics, unsupervised and semi-supervised learning. Springer Nature, Switzerland, pp 1–23
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data, pp 462–468. https://doi.org/10.1145/775107.775114
Shu H (2016) Big data analytics: six techniques. Geo-Spatial Inf Sci 19:119–128. https://doi.org/10.1080/10095020.2016.1182307
Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling, pp 563–575
Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271
Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling information-theoretic text clustering: a sampling-based approximate method. In: Proceedings of 2014 international conference on advanced cloud and big data, CBD 2014, pp 18–25. https://doi.org/10.1109/CBD.2014.56
Haas PJ (2016) Data stream sampling. In: Data stream management, data-centric systems and applications. Springer-Verlag Berlin Heidelberg, pp 13–44
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Thompson SK (2012) Sampling. Wiley
Pandey KK, Shukla D (2019) An empirical perusal of distance measures for clustering with big data mining. Int J Eng Adv Technol 8. https://doi.org/10.35940/ijeat.F8078.088619
Arora S, Chana I (2014) A survey of clustering techniques for big data analysis. In: Proceedings of 2014 international conference of confluence the next generation information technology summit (Confluence), pp 59–65. https://doi.org/10.1109/CONFLUENCE.2014.6949256
Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. 6:467–473
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means. 4:243–257. https://doi.org/10.1007/s10994-006-0587-3
Jayaram N, Baker JW (2010) Efficient sampling and data reduction techniques for probabilistic seismic lifeline risk assessment. 1109–1131. https://doi.org/10.1002/eqe
da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Houari R, Bounceur A, Kechadi T (2013) A new method for estimation of missing data based on sampling methods for data mining. 89–100. https://doi.org/10.1007/978-3-319-00951-3
Jaiswal R, Kumar A, Sen S (2014) A simple D2—sampling based PTAS for k-means. 22–46. https://doi.org/10.1007/s00453-013-9833-9
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energ Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Hajkacem MAB, Ncir CEB, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33. https://doi.org/10.1142/S0218001419500137
Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling within k-means algorithm to cluster large datasets. Tech Rep HPCF-2011-12, pp 1–11
Pandey, KK, Shukla D (2020) Stratified sampling-based data reduction and categorization model for big data mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query size estimation for joins using. 237–275
Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Goshu NN, Kassa SM (2020) A systematic sampling evolutionary (SSE) method for stochastic bilevel programming. Prob Comput Oper Res 104942. https://doi.org/10.1016/j.cor.2020.104942
Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming models. 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006
Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026
Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. 335–360. https://doi.org/10.1007/s11009-008-9108-0
Saini M, Kumar A (2018) Ratio estimators using stratified random sampling and stratified ranked set sampling. Life Cycle Reliab Saf Eng. https://doi.org/10.1007/s41872-018-0046-8
Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Rice JA (2007) Mathematical statistics and metastatistical analysis. Thomson Higher Education
Singh S (2003) Advanced sampling theory with applications
De los Santos PA, Burke RJ, Tien JM (2007) Progressive random sampling with stratification. IEEE Trans Syst Man Cybern Part C Appl Rev 37:1223–1230. https://doi.org/10.1109/TSMCC.2007.905818
Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023
Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn 46:769–787. https://doi.org/10.1016/j.patcog.2012.09.005
Liu T, Wang F, Agrawal G (2012) Stratified sampling for data mining on the deep web. Front Comput Sci China 6:179–196. https://doi.org/10.1007/s11704-012-2859-3
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats. Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Aggarwal CC, Reddy CK (2014) Data clustering algorithms and applications. CRC Press
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pandey, K.K., Shukla, D. (2021). Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering. In: Agrawal, S., Kumar Gupta, K., H. Chan, J., Agrawal, J., Gupta, M. (eds) Machine Intelligence and Smart Systems . Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-33-4893-6_30
Download citation
DOI: https://doi.org/10.1007/978-981-33-4893-6_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4892-9
Online ISBN: 978-981-33-4893-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)