Abstract
Nowadays, communication, digital and real-time-based applications are widely used by the user. These types of technologies change the speed of data generation, the format of data, storing natures, and their management. The common challenges of big data mining are related to the data volume, and data volume is indirectly related to the variety of data and data velocity. The volume-related challenges isolated by the big data analysis strategies are divide-and-conquer, feature selection, parallel processing, granular computing, incremental learning, and sampling. These big data analysis strategies reduce the data for the data mining and also categorized the variety. This paper used sampling for data reduction and categorization through stratified sampling because stratified sampling has capability data categorization in efficient ways. From a theoretical, practical, and the existing research perspective, the paper focuses on big data and characteristics, big data mining, big data reduction strategies, sampling techniques for big data, and design the model for data reduction and categorization through stratified sampling, and this model describes the new data mining technique based on stratified sampling which is known as stratified sampling-based (data mining algorithm name). The data reduction and categorization model is explained by using the partitioning-based K-means clustering algorithm, which is known as the SSBKM through this model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sivarajah, U., Kamal, M.M., Irani, Z., Weerakkody, V.: Critical analysis of big data challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017). https://doi.org/10.1016/j.jbusres.2016.08.001
Weichen, W.: Survey of big data storage technology. Internet Things Cloud Comput. 4(3), 28–33 (2016). https://doi.org/10.11648/j.iotcc.20160403.13
Grover, P., Kar, A.K.: Big data analytics: a review on theoretical contributions and tools used in literature. Glob. J. Flex. Syst. Manag. 18(3), 203–229 (2017). https://doi.org/10.1007/s40171-017-0159-3
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015). https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Pandey, K.K., Shukla, D.: A study of clustering taxonomy for big data mining with optimized clustering MapReduce model. Int. J. Emerg. Technol. 10(2), 226–234 (2019)
Zicari, R.: Big Data: Challenges and Opportunities, pp. 103–128. Chapman and Hall/CRC (2014)
Czarnowski, I., Jędrzejowicz, P.: An approach to data reduction for learning from big datasets: integrating stacking, rotation, and agent population learning techniques. Complexity 2018, 1–13 (2018). https://doi.org/10.1155/2018/7404627
Lutu, P.E.: Database sampling for data mining. In: Encyclopedia of Data Warehousing and Mining, pp. 344–348 (2005). https://doi.org/10.4018/978-1-59140-557-3.ch066
Chen, B., Haas, P., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD 02, pp. 367–370 (2002). https://doi.org/10.1145/775047.775114
Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans. Knowl. Data Eng. 15(5), 1170–1187 (2003). https://doi.org/10.1109/tkde.2003.1232271
Xu, H., Li, Z., Guo, S., Chen, K.: CloudVista. Proc. VLDB Endowment 5(12), 1886–1889 (2012). https://doi.org/10.14778/2367502.2367529
Cormode, G., Duffield, N.: Sampling for big data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD 14, p. 1975 (2014). https://doi.org/10.1145/2623330.2630811
Satyanarayana, A.: Intelligent sampling for big data using bootstrap sampling and Chebyshev inequality. In: IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–6 (2014). https://doi.org/10.1109/ccece.2014.6901029
Zhao, J., Sun, J., Zhai, Y., Ding, Y., Wu, C., Hu, M.: A novel clustering-based sampling approach for minimum sample set in big data environment. Int. J. Pattern Recogn. Artif. Intell. 32(02), 1850003-1–1850003-20 (2017). https://doi.org/10.1142/s0218001418500039
Kim, J.K., Wang, Z.: Sampling techniques for big data analysis in finite population inference. Statistics Preprints (2018)
Boicea, A., Truică, C., Rădulescu, F., Buşe, E.-C.: Sampling strategies for extracting information from large data sets. Data Knowl. Eng. 115, 1–15 (2018). https://doi.org/10.1016/j.datak.2018.01.002
Zhao, X., Liang, J., Dang, C.: A stratified sampling based clustering algorithm for large-scale data. Knowl.-Based Syst. 163, 416–428 (2019). https://doi.org/10.1016/j.knosys.2018.09.007
Yıldırım, A.A., Özdoğan, C., Watson, D.: Parallel data reduction techniques for big datasets. In: Big Data Management, Technologies, and Applications (Advances in Data Mining and Database Management), pp. 72–93 (2014). https://doi.org/10.4018/978-1-4666-4699-5.ch004
Wang, X., He, Y.: Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst. Man Cybern. Mag. 2(2), 26–31 (2016). https://doi.org/10.1109/msmc.2016.2557479
Tsai, C., Lai, C., Chao, H., Vasilakos, A.V.: Big data analytics. In: Big Data Technologies and Applications, pp. 13–52 (2016). https://doi.org/10.1007/978-3-319-44550-2_2
Thompson, S.K.: Sampling. Wiley, Hoboken, NJ (2012)
Provost, F., Jensen, D., Oates, T.: Progressive sampling. In: Instance Selection and Construction for Data Mining, pp. 151–170 (2001). https://doi.org/10.1007/978-1-4757-3359-4_9
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985). https://doi.org/10.1145/3147.3165
Jing, L., Tian, K., Huang, J.Z.: Stratified feature sampling method for ensemble clustering of high dimensional data. Pattern Recogn. 48(11), 3688–3702 (2015). https://doi.org/10.1016/j.patcog.2015.05.006
Rice, J.: Mathematical Statistics and Data Analysis. W. Ross MacDonald School Resource Services Library, Brantford, Ontario (2015)
Lohr, S.L.: Sampling: Design and Analysis. South-Western Cengage Learning, Mason, OH (2010)
Haas, P.J.: Data-stream sampling: basic techniques and results. In: Data-Centric Systems and Applications Data Stream Management, pp. 13–44 (2016). https://doi.org/10.1007/978-3-540-28608-0_2
Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine, CA (2017)
Pandove, D., Goel, S.: A comprehensive study on clustering approaches for big data mining. In: Proceedings of IEEE 2nd International Conference on Electronics and Communication Systems, pp. 1333–1338. IEEE Xplore Digital Library (2015). https://doi.org/10.1109/ecs.2015.7124801
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Bouras, A.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014). https://doi.org/10.1109/tetc.2014.2330519
Pandey, K.K., Shukla, D.: An empirical perusal of distance measures for clustering with big data mining. Int. J. Eng. Adv. Technol. 8(6), 606–616 (2019). https://doi.org/10.35940/ijeat.f8078.088619
Dave, M., Gianey, H.: Different clustering algorithms for big data analytics: a review. In: Proceedings of IEEE International Conference System Modeling & Advancement in Research Trends, pp. 328–333. IEEE Xplore Digital Library (2016). https://doi.org/10.1109/sysmart.2016.7894544
Aggarwal, C.C., Reddy, C.: Data Clustering Algorithms and Applications. CRC Press Taylor & Francis Group (2014). ISBN 978-1-4665-5822-9
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pandey, K.K., Shukla, D. (2020). Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Bansal, J., Gupta, M., Sharma, H., Agarwal, B. (eds) Communication and Intelligent Systems. ICCIS 2019. Lecture Notes in Networks and Systems, vol 120. Springer, Singapore. https://doi.org/10.1007/978-981-15-3325-9_9
Download citation
DOI: https://doi.org/10.1007/978-981-15-3325-9_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3324-2
Online ISBN: 978-981-15-3325-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)