Skip to main content

Abstract

This chapter provides a comprehensive overview of data preprocessing techniques and tools in the context of web and social media analytics. As data volume and complexity from various sources grow, effective data preprocessing becomes crucial for extracting valuable insights and knowledge. This chapter covers vital steps in data preprocessing, including characterizing data, reducing dimensionality, data transformation, and data enrichment and validation. By following these steps and utilizing appropriate techniques and tools, you can improve the quality of your data, enhance the effectiveness of your analytics efforts, and make better-informed decisions. Moreover, this chapter aims to equip you with the necessary knowledge to effectively tackle complex and noisy data, enabling you to unlock your organization’s full potential for data mining and analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alexandropoulos, S.A.N., Kotsiantis, S.B., Vrahatis, M.N.: Data preprocessing in predictive data mining. Knowl. Eng. Rev. 34, e1 (2019)

    Article  Google Scholar 

  2. Batrinca, B., Treleaven, P.C.: Social media analytics: a survey of techniques, tools and platforms. AI Soc. 30(1), 89–116 (2015)

    Article  Google Scholar 

  3. Coughlin, D.M., Campbell, M.C., Jansen, B.J.: A web analytics approach for appraising electronic resources in academic libraries. J. Assoc. Inf. Sci. Technol. 67(3), 518–534 (2016)

    Article  Google Scholar 

  4. Danubianu, M.: Step by step data preprocessing for data mining. A case study. In: Proceedings of the International Conference on Information Technologies (InfoTech-2015), pp. 117–124 (2015)

    Google Scholar 

  5. Diouf, R., Sarr, E.N., Sall, O., Birregah, B., Bousso, M., Mbaye, S.N.: Web scraping: State-of-the-art and areas of application. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 6040–6042 (2019)

    Google Scholar 

  6. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37 (1996). https://ojs.aaai.org/index.php/aimagazine/article/view/1230

  7. Gama, J.a., Žliobaitundefined, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4) (2014)

    Google Scholar 

  8. García, S., Luengo, J., Herrera, F.: Data preprocessing in data mining, vol. 72. Springer (2015)

    Google Scholar 

  9. Garson, G.D.: Data Analytics for the Social Sciences: Applications in R. Routledge, London (2021)

    Book  Google Scholar 

  10. Jansen, B., Jung, S.g., Salminen, J.: The effect of hyperparameter selection on the personification of customer population data. Int. J. Electr. Comput. Eng. Res. 1(2) (2021)

    Google Scholar 

  11. Jolliffe, I.: Principal Component Analysis. Wiley Ltd (2005)

    Google Scholar 

  12. Kazil, J., Jarmul, K.: Data Wrangling with Python: Tips and Tools to Make Your Life Easier. O’Reilly Media, Inc. (2016)

    Google Scholar 

  13. Kimball, R., Ross, M.: The Data Warehouse Toolkit: the Complete Guide to Dimensional Modeling. Wiley (2011)

    Google Scholar 

  14. Kotsiantis, S.B.: Supervised machine learning: a review of classification techniques. In: Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval and Pervasive Technologies, pp. 3–24. IOS Press, NLD (2007)

    Google Scholar 

  15. Kraus, D.: Consolidated data analysis and presentation using an open-source add-in for the microsoft excel R spreadsheet software. Med. Writ. 23(1), 25–28 (2014)

    Article  Google Scholar 

  16. Kuhn, M., Johnson, K., et al.: Applied Predictive Modeling, vol. 26. Springer (2013)

    Google Scholar 

  17. Liu, B.: Opinion Mining and Sentiment Analysis, pp. 459–526. Springer, Berlin (2011)

    Google Scholar 

  18. Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: an ever evolving frontier in data mining. In: Liu, H., Motoda, H., Setiono, R., Zhao, Z., (eds.) Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, Proceedings of Machine Learning Research, vol. 10, pp. 4–13. PMLR, Hyderabad (2010). https://proceedings.mlr.press/v10/liu10b.html

  19. Mukherjee, R., Kar, P.: A comparative review of data warehousing etl tools with new trends and industry insight. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 943–948 (2017)

    Google Scholar 

  20. Nelli, F.: Python Data Analytics. Apress, Berkeley (2015)

    Book  Google Scholar 

  21. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann (1999)

    Google Scholar 

  22. Suadaa, L.H.: A survey on web usage mining techniques and applications. In: 2014 International Conference on Information Technology Systems and Innovation (ICITSI), pp. 39–43 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bernard J. Jansen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jansen, B.J., Aldous, K.K., Salminen, J., Almerekhi, H., Jung, Sg. (2024). Data Preprocessing. In: Understanding Audiences, Customers, and Users via Analytics. Synthesis Lectures on Information Concepts, Retrieval, and Services. Springer, Cham. https://doi.org/10.1007/978-3-031-41933-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41933-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41932-4

  • Online ISBN: 978-3-031-41933-1

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics