Skip to main content

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

  • Conference paper
  • First Online:
Machine Intelligence and Smart Systems

Part of the book series: Algorithms for Intelligent Systems ((AIS))

  • 506 Accesses

Abstract

Big data mining is the modern research of predicting knowledge for decision-making systems, and clustering is the data mining technique that predicts the classes of data without any prior knowledge. Big data mining supports high-volume, high-variety, and high-velocity data sets that reason classical data mining algorithms face speed-up, scale-up, memory utilization, computing efficiency, and effectiveness related problems. Data volume is the primary attribute of big data mining that is responsible for the variety and velocity related challenges. Sampling is a data reduction technique that handles data volume-related challenges under the single and multiple machine execution environments of big data mining such as classification and clustering. A good sampling process increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilized the memory resources for any data mining techniques without the influence of their characteristics. This paper proposed two sampling approaches for big data mining using the K-Means clustering algorithm known as SYK-Means (Systematic Sampling-based K-Means) and SSYK-Means (Stratified Systematic Sampling-based K-Means). The experimental results of the SYK-Means and SSYK-Means compared with classical K-Means and achieved better effectiveness and efficiency through R squares, Davies Bouldin, Calinski Harabasz, Silhouette Coefficient, CPU time, and convergence indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Marr B How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#4c5b6f5360ba

  2. Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6. https://doi.org/10.1186/s40537-019-0206-3

  3. Tabesh P, Mousavidin E, Hasani S (2019) Implementing big data strategies: a managerial perspective. Bus Horiz 62:347–358. https://doi.org/10.1016/j.bushor.2019.02.001

    Article  Google Scholar 

  4. Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. In: Perner P (ed) ICDM 2014, LNAI 8557. Springer International Publishing Switzerland, pp 214–227

    Google Scholar 

  5. Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001

    Article  Google Scholar 

  6. Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

    Article  Google Scholar 

  7. Pandey KK, Shukla D (2019) Challenges of big data to big data mining with their processing framework. In: 2018 8th international conference on communication systems and network technologies (CSNT), pp 89–94. https://doi.org/10.1109/CSNT.2018.19

  8. Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon 5. https://doi.org/10.1016/j.heliyon.2019.e02541

  9. Pujari AK, Rajesh K, Reddy DS (2001) Clustering techniques in data mining—a survey. IETE J Res 47:19–28. https://doi.org/10.1080/03772063.2001.11416199

    Article  Google Scholar 

  10. van Altena AJ, Moerland PD, Zwinderman AH, Olabarriaga SD (2016) Understanding big data themes from scientific biomedical literature through topic modeling. J Big Data 3. https://doi.org/10.1186/s40537-016-0057-0

  11. Moharm K (2019) State of the art in big data applications in microgrid: a review. Adv Eng Inform 42. https://doi.org/10.1016/j.aei.2019.100945

  12. Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006

  13. Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering mapreduce model. Int J Emerg Technol 10

    Google Scholar 

  14. Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173

    Article  Google Scholar 

  15. Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978

  16. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  17. Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12. https://doi.org/10.1145/3132088

  18. Xie H, Zhang L, Lim CP, Yu Y, Liu C, Liu H, Walters J (2019) Improving K-means clustering with enhanced firefly algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763

    Article  Google Scholar 

  19. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  20. Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007

    Article  Google Scholar 

  21. HajKacem MAB, N’Cir C-EB, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C-EB (eds) Clustering methods for big data analytics, unsupervised and semi-supervised learning. Springer Nature, Switzerland, pp 1–23

    Google Scholar 

  22. Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data, pp 462–468. https://doi.org/10.1145/775107.775114

  23. Shu H (2016) Big data analytics: six techniques. Geo-Spatial Inf Sci 19:119–128. https://doi.org/10.1080/10095020.2016.1182307

    Article  Google Scholar 

  24. Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling, pp 563–575

    Google Scholar 

  25. Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271

    Article  Google Scholar 

  26. Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling information-theoretic text clustering: a sampling-based approximate method. In: Proceedings of 2014 international conference on advanced cloud and big data, CBD 2014, pp 18–25. https://doi.org/10.1109/CBD.2014.56

  27. Haas PJ (2016) Data stream sampling. In: Data stream management, data-centric systems and applications. Springer-Verlag Berlin Heidelberg, pp 13–44

    Google Scholar 

  28. Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290

    Article  MathSciNet  Google Scholar 

  29. Thompson SK (2012) Sampling. Wiley

    Google Scholar 

  30. Pandey KK, Shukla D (2019) An empirical perusal of distance measures for clustering with big data mining. Int J Eng Adv Technol 8. https://doi.org/10.35940/ijeat.F8078.088619

  31. Arora S, Chana I (2014) A survey of clustering techniques for big data analysis. In: Proceedings of 2014 international conference of confluence the next generation information technology summit (Confluence), pp 59–65. https://doi.org/10.1109/CONFLUENCE.2014.6949256

  32. Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. 6:467–473

    Google Scholar 

  33. Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means. 4:243–257. https://doi.org/10.1007/s10994-006-0587-3

  34. Jayaram N, Baker JW (2010) Efficient sampling and data reduction techniques for probabilistic seismic lifeline risk assessment. 1109–1131. https://doi.org/10.1002/eqe

  35. da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7

    Article  Google Scholar 

  36. Houari R, Bounceur A, Kechadi T (2013) A new method for estimation of missing data based on sampling methods for data mining. 89–100. https://doi.org/10.1007/978-3-319-00951-3

  37. Jaiswal R, Kumar A, Sen S (2014) A simple D2—sampling based PTAS for k-means. 22–46. https://doi.org/10.1007/s00453-013-9833-9

  38. Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8

    Article  MATH  Google Scholar 

  39. Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energ Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342

    Article  Google Scholar 

  40. Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8

    Article  Google Scholar 

  41. Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1

    Article  MathSciNet  MATH  Google Scholar 

  42. Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268

    Article  MATH  Google Scholar 

  43. Hajkacem MAB, Ncir CEB, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33. https://doi.org/10.1142/S0218001419500137

  44. Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling within k-means algorithm to cluster large datasets. Tech Rep HPCF-2011-12, pp 1–11

    Google Scholar 

  45. Pandey, KK, Shukla D (2020) Stratified sampling-based data reduction and categorization model for big data mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems

    Google Scholar 

  46. Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702

    Google Scholar 

  47. Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query size estimation for joins using. 237–275

    Google Scholar 

  48. Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011

    Article  MathSciNet  MATH  Google Scholar 

  49. Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007

    Article  Google Scholar 

  50. Goshu NN, Kassa SM (2020) A systematic sampling evolutionary (SSE) method for stochastic bilevel programming. Prob Comput Oper Res 104942. https://doi.org/10.1016/j.cor.2020.104942

  51. Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming models. 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006

  52. Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026

  53. Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. 335–360. https://doi.org/10.1007/s11009-008-9108-0

  54. Saini M, Kumar A (2018) Ratio estimators using stratified random sampling and stratified ranked set sampling. Life Cycle Reliab Saf Eng. https://doi.org/10.1007/s41872-018-0046-8

    Article  Google Scholar 

  55. Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001

    Article  Google Scholar 

  56. Rice JA (2007) Mathematical statistics and metastatistical analysis. Thomson Higher Education

    Google Scholar 

  57. Singh S (2003) Advanced sampling theory with applications

    Google Scholar 

  58. De los Santos PA, Burke RJ, Tien JM (2007) Progressive random sampling with stratification. IEEE Trans Syst Man Cybern Part C Appl Rev 37:1223–1230. https://doi.org/10.1109/TSMCC.2007.905818

  59. Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023

    Article  Google Scholar 

  60. Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn 46:769–787. https://doi.org/10.1016/j.patcog.2012.09.005

    Article  Google Scholar 

  61. Liu T, Wang F, Agrawal G (2012) Stratified sampling for data mining on the deep web. Front Comput Sci China 6:179–196. https://doi.org/10.1007/s11704-012-2859-3

    Article  MathSciNet  MATH  Google Scholar 

  62. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats. Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014

    Article  Google Scholar 

  63. Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062

    Article  MathSciNet  Google Scholar 

  64. Aggarwal CC, Reddy CK (2014) Data clustering algorithms and applications. CRC Press

    Google Scholar 

  65. Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamlesh Kumar Pandey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pandey, K.K., Shukla, D. (2021). Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering. In: Agrawal, S., Kumar Gupta, K., H. Chan, J., Agrawal, J., Gupta, M. (eds) Machine Intelligence and Smart Systems . Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-33-4893-6_30

Download citation

Publish with us

Policies and ethics