Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Pandey, Kamlesh Kumar; Shukla, Diwakar

doi:10.1007/978-981-33-4893-6_30

Kamlesh Kumar Pandey⁹ &
Diwakar Shukla⁹

Part of the book series: Algorithms for Intelligent Systems ((AIS))

506 Accesses

Abstract

Big data mining is the modern research of predicting knowledge for decision-making systems, and clustering is the data mining technique that predicts the classes of data without any prior knowledge. Big data mining supports high-volume, high-variety, and high-velocity data sets that reason classical data mining algorithms face speed-up, scale-up, memory utilization, computing efficiency, and effectiveness related problems. Data volume is the primary attribute of big data mining that is responsible for the variety and velocity related challenges. Sampling is a data reduction technique that handles data volume-related challenges under the single and multiple machine execution environments of big data mining such as classification and clustering. A good sampling process increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilized the memory resources for any data mining techniques without the influence of their characteristics. This paper proposed two sampling approaches for big data mining using the K-Means clustering algorithm known as SYK-Means (Systematic Sampling-based K-Means) and SSYK-Means (Stratified Systematic Sampling-based K-Means). The experimental results of the SYK-Means and SSYK-Means compared with classical K-Means and achieved better effectiveness and efficiency through R squares, Davies Bouldin, Calinski Harabasz, Silhouette Coefficient, CPU time, and convergence indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

Article 24 January 2022

References

Marr B How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#4c5b6f5360ba
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6. https://doi.org/10.1186/s40537-019-0206-3
Tabesh P, Mousavidin E, Hasani S (2019) Implementing big data strategies: a managerial perspective. Bus Horiz 62:347–358. https://doi.org/10.1016/j.bushor.2019.02.001
Article Google Scholar
Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. In: Perner P (ed) ICDM 2014, LNAI 8557. Springer International Publishing Switzerland, pp 214–227
Google Scholar
Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001
Article Google Scholar
Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Article Google Scholar
Pandey KK, Shukla D (2019) Challenges of big data to big data mining with their processing framework. In: 2018 8th international conference on communication systems and network technologies (CSNT), pp 89–94. https://doi.org/10.1109/CSNT.2018.19
Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon 5. https://doi.org/10.1016/j.heliyon.2019.e02541
Pujari AK, Rajesh K, Reddy DS (2001) Clustering techniques in data mining—a survey. IETE J Res 47:19–28. https://doi.org/10.1080/03772063.2001.11416199
Article Google Scholar
van Altena AJ, Moerland PD, Zwinderman AH, Olabarriaga SD (2016) Understanding big data themes from scientific biomedical literature through topic modeling. J Big Data 3. https://doi.org/10.1186/s40537-016-0057-0
Moharm K (2019) State of the art in big data applications in microgrid: a review. Adv Eng Inform 42. https://doi.org/10.1016/j.aei.2019.100945
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering mapreduce model. Int J Emerg Technol 10
Google Scholar
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173
Article Google Scholar
Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12. https://doi.org/10.1145/3132088
Xie H, Zhang L, Lim CP, Yu Y, Liu C, Liu H, Walters J (2019) Improving K-means clustering with enhanced firefly algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763
Article Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
Article Google Scholar
HajKacem MAB, N’Cir C-EB, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C-EB (eds) Clustering methods for big data analytics, unsupervised and semi-supervised learning. Springer Nature, Switzerland, pp 1–23
Google Scholar
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data, pp 462–468. https://doi.org/10.1145/775107.775114
Shu H (2016) Big data analytics: six techniques. Geo-Spatial Inf Sci 19:119–128. https://doi.org/10.1080/10095020.2016.1182307
Article Google Scholar
Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling, pp 563–575
Google Scholar
Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271
Article Google Scholar
Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling information-theoretic text clustering: a sampling-based approximate method. In: Proceedings of 2014 international conference on advanced cloud and big data, CBD 2014, pp 18–25. https://doi.org/10.1109/CBD.2014.56
Haas PJ (2016) Data stream sampling. In: Data stream management, data-centric systems and applications. Springer-Verlag Berlin Heidelberg, pp 13–44
Google Scholar
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Article MathSciNet Google Scholar
Thompson SK (2012) Sampling. Wiley
Google Scholar
Pandey KK, Shukla D (2019) An empirical perusal of distance measures for clustering with big data mining. Int J Eng Adv Technol 8. https://doi.org/10.35940/ijeat.F8078.088619
Arora S, Chana I (2014) A survey of clustering techniques for big data analysis. In: Proceedings of 2014 international conference of confluence the next generation information technology summit (Confluence), pp 59–65. https://doi.org/10.1109/CONFLUENCE.2014.6949256
Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. 6:467–473
Google Scholar
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means. 4:243–257. https://doi.org/10.1007/s10994-006-0587-3
Jayaram N, Baker JW (2010) Efficient sampling and data reduction techniques for probabilistic seismic lifeline risk assessment. 1109–1131. https://doi.org/10.1002/eqe
da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Article Google Scholar
Houari R, Bounceur A, Kechadi T (2013) A new method for estimation of missing data based on sampling methods for data mining. 89–100. https://doi.org/10.1007/978-3-319-00951-3
Jaiswal R, Kumar A, Sen S (2014) A simple D2—sampling based PTAS for k-means. 22–46. https://doi.org/10.1007/s00453-013-9833-9
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
Article MATH Google Scholar
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energ Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Article Google Scholar
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Article Google Scholar
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Article MathSciNet MATH Google Scholar
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Article MATH Google Scholar
Hajkacem MAB, Ncir CEB, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33. https://doi.org/10.1142/S0218001419500137
Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling within k-means algorithm to cluster large datasets. Tech Rep HPCF-2011-12, pp 1–11
Google Scholar
Pandey, KK, Shukla D (2020) Stratified sampling-based data reduction and categorization model for big data mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems
Google Scholar
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Google Scholar
Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query size estimation for joins using. 237–275
Google Scholar
Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Article MathSciNet MATH Google Scholar
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Article Google Scholar
Goshu NN, Kassa SM (2020) A systematic sampling evolutionary (SSE) method for stochastic bilevel programming. Prob Comput Oper Res 104942. https://doi.org/10.1016/j.cor.2020.104942
Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming models. 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006
Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026
Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. 335–360. https://doi.org/10.1007/s11009-008-9108-0
Saini M, Kumar A (2018) Ratio estimators using stratified random sampling and stratified ranked set sampling. Life Cycle Reliab Saf Eng. https://doi.org/10.1007/s41872-018-0046-8
Article Google Scholar
Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Article Google Scholar
Rice JA (2007) Mathematical statistics and metastatistical analysis. Thomson Higher Education
Google Scholar
Singh S (2003) Advanced sampling theory with applications
Google Scholar
De los Santos PA, Burke RJ, Tien JM (2007) Progressive random sampling with stratification. IEEE Trans Syst Man Cybern Part C Appl Rev 37:1223–1230. https://doi.org/10.1109/TSMCC.2007.905818
Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023
Article Google Scholar
Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn 46:769–787. https://doi.org/10.1016/j.patcog.2012.09.005
Article Google Scholar
Liu T, Wang F, Agrawal G (2012) Stratified sampling for data mining on the deep web. Front Comput Sci China 6:179–196. https://doi.org/10.1007/s11704-012-2859-3
Article MathSciNet MATH Google Scholar
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats. Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Article Google Scholar
Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Article MathSciNet Google Scholar
Aggarwal CC, Reddy CK (2014) Data clustering algorithms and applications. CRC Press
Google Scholar
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Applications, Dr. Hari singh Gour Vishwavidyalaya, Sagar, India
Kamlesh Kumar Pandey & Diwakar Shukla

Authors

Kamlesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamlesh Kumar Pandey .

Editor information

Editors and Affiliations

University Institute of Technology, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, Madhya Pradesh, India
Shikha Agrawal
Rustamji Institute of Technology, Gwalior, Madhya Pradesh, India
Kamlesh Kumar Gupta
King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
School of Information Technology, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, Madhya Pradesh, India
Jitendra Agrawal
Vikrant Institute of Technology and Management, Gwalior, Madhya Pradesh, India
Manish Gupta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pandey, K.K., Shukla, D. (2021). Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering. In: Agrawal, S., Kumar Gupta, K., H. Chan, J., Agrawal, J., Gupta, M. (eds) Machine Intelligence and Smart Systems . Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-33-4893-6_30

Download citation

DOI: https://doi.org/10.1007/978-981-33-4893-6_30
Published: 09 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4892-9
Online ISBN: 978-981-33-4893-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation