Skip to main content

Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies

  • Chapter
  • First Online:
Integrating Meta-Heuristics and Machine Learning for Real-World Optimization Problems

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1038))

  • 513 Accesses

Abstract

Business news helps leaders and entrepreneurs in decision-making every day. This involves making corporate strategies, taking marketing decisions, planning operations, investing human capital, etc. This news gives a complete idea to leaders and entrepreneurs about what is happening in the corporate world. They maintain track of all mergers and takeovers and make interested people informed. Today, it is essential for people to keep themselves updated about corporate business. As there are so many news websites and the same news article gets published on each of these websites with a little changed title. As a consequence of which people have to spend far longer trying to find information than the time they have to catch up on the news. So, it would be very helpful for such people, if clusters of semantically similar news articles from different websites could be created, and reading only one news item from one cluster will be sufficient. This chapter will explain a few approaches to aggregate similar news articles. The very first step is to collect the data. Initially, for developing the model, data is collected from sites such as Kaggle, UCI, etc. After the model is developed, real-time data can be collected using the news API. The second step is to preprocess the collected data which involves subtasks such as Tokenization, Stop-Word Removal, Stemming/Lemmatization, Case Transformation, etc. The third step here is Embedding Text to Vectors, using various embedding techniques such as Bag-of-Words, TF-IDF, Word2Vec, etc. The next task here is to make clusters of these embedded vectors or numbers, using various unsupervised learning algorithms such as K-means, agglomerative, etc. Finally, the last step here is to find a comparison of various combinations of embedding techniques and clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering (2007)

    Google Scholar 

  2. G. Salton, A. Wong, C.S. Yang, A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  3. M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques. Technical Report (Department of Computer Science and Engineering, University of Minnesota, 2000)

    Google Scholar 

  4. F. Bach, M. Jordan, Learning spectral clustering, in Advances in Neural Information Processing Systems 16 (NIPS). ed. by S. Thrun, L. Saul, B. Schölkopf (MIT Press, Cambridge, 2004), pp. 305–312

    Google Scholar 

  5. D. Cheng, S. Vempala, R. Kannan, G. Wang, A divide-and-merge methodology for clustering, in PODS ’05: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM Press, New York, NY, USA, 2005), pp. 196–205

    Google Scholar 

  6. C.H.Q. Ding, X. He, H. Zha, M. Gu, H.D. Simon, A min–max cut algorithm for graph partitioning and data clustering, in ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining (IEEE Computer Society, Washington, DC, USA, 2001), pp 107–114

    Google Scholar 

  7. S. Osinski, J. Stefanowski, D. Weiss, Lingo: search results clustering algorithm based on singular value decomposition, in ed. M.A. Klopotek, S.T. Wierzchon, K. Trojanowski, Intelligent Information Systems, Advances in Soft Computing (Springer, Berlin, 2004), pp 359–368

    Google Scholar 

  8. D. Greene, P. Cunningham, Producing accurate interpretable clusters from high-dimensional data, in 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 3721 (University of Dublin, Trinity College, Dublin, 2005), pp. 486–494

    Google Scholar 

  9. O. Zamir, O. Etzioni, Web document clustering: a feasibility demonstration, in SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM Press, New York, NY, USA, 1998), pp. 46–54

    Google Scholar 

  10. Y. Xia, N. Tang, A. Hussain, E. Cambria, Discriminative Bi-Term Topic Model for Headline-Based Social News Clustering in FLAIRS Conference (2015)

    Google Scholar 

  11. I. Himelboim, M.A. Smith, L. Rainie, B. Shneiderman, C. Espina, Classifying Twitter topic-networks using social network analysis. Soc. Media + Soc. 3(1), (2017)

    Google Scholar 

  12. M. Sahami, T.D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW (ACM, New York, NY, USA, 2006), pp. 377–386

    Google Scholar 

  13. S. Banerjee, K. Ramanathan, A. Gupta, Clustering short texts using Wikipedia, in Proceeding SIGIR ’07 Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007), pp. 787–788

    Google Scholar 

  14. J.G. Conrad, M. Bender, Semi-supervised events clustering in news retrieval, in NewsIR@ECIR (2016)

    Google Scholar 

  15. M. Weber, Finding news in a haystack: event based clustering with social media based ranking. Master thesis for the media technology programme, Leiden University, The Netherlands, 2012

    Google Scholar 

  16. S. Bergamaschi, F. Guerra, M. Orsini, C. Sartori, M. Vincini, Relevant news: a semantic news feed aggregator, in Semantic Web Applications and Perspectives, vol. 314, ed. G. Semeraro, E. Di Sciascio, C. Morbidoni, H. Stoemer (2007), pp. 150–159

    Google Scholar 

  17. A. Gulli, The anatomy of a news search engine, in WWW (Special Interest Tracks and Posters), ed. A. Ellis, T. Hagino (ACM, New York, 2005), pp. 880–881

    Google Scholar 

  18. X. Li, J. Yan, Z. Deng, L. Ji, W. Fan, B. Zhang, Z. Chen, A novel clustering-based RSS aggregator, in Williamson et al. [11], pp. 1309–1310

    Google Scholar 

  19. D.R. Radev, J. Otterbacher, A. Winkel, S. Blair-Goldensohn, Newsinessence: summarizing online news topics. Commun. ACM 48(10), 95–98 (2005)

    Article  Google Scholar 

  20. F. Hamborg, N. Meuschke, B. Gipp, Matrix-based news aggregation: exploring different news perspectives, in Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (IEEE Press, 2017), pp. 69–78

    Google Scholar 

  21. C. Grozea, D.C. Cercel, C. Onose, S. Trausan-Matu, Atlas: news aggregation service, in 2017 16th RoEduNet Conference: Networking in Education and Research (RoEduNet) (IEEE, 2017), pp. 1–6

    Google Scholar 

  22. G. Paliouras, A. Mouzakidis, V. Moustakas, C. Skourlas, PNS: a personalized news aggregator on the web, vol. 104 (1970), pp. 175–197

    Google Scholar 

  23. K. Sundaramoorthy, R. Durga, S. Nagadarshini, Newsone—an aggregation system for news using web scraping method, in 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC) (IEEE, 2017), pp. 136–140

    Google Scholar 

  24. A.A. Amer, H.I. Abdalla, A set theory based similarity measure for text clustering and classification. J. Big Data 7, 74 (2020). https://doi.org/10.1186/s40537-020-00344-3

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nitesh Tarbani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tarbani, N., Wadhva, K. (2022). Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies. In: Houssein, E.H., Abd Elaziz, M., Oliva, D., Abualigah, L. (eds) Integrating Meta-Heuristics and Machine Learning for Real-World Optimization Problems. Studies in Computational Intelligence, vol 1038. Springer, Cham. https://doi.org/10.1007/978-3-030-99079-4_5

Download citation

Publish with us

Policies and ethics