Abstract
Business news helps leaders and entrepreneurs in decision-making every day. This involves making corporate strategies, taking marketing decisions, planning operations, investing human capital, etc. This news gives a complete idea to leaders and entrepreneurs about what is happening in the corporate world. They maintain track of all mergers and takeovers and make interested people informed. Today, it is essential for people to keep themselves updated about corporate business. As there are so many news websites and the same news article gets published on each of these websites with a little changed title. As a consequence of which people have to spend far longer trying to find information than the time they have to catch up on the news. So, it would be very helpful for such people, if clusters of semantically similar news articles from different websites could be created, and reading only one news item from one cluster will be sufficient. This chapter will explain a few approaches to aggregate similar news articles. The very first step is to collect the data. Initially, for developing the model, data is collected from sites such as Kaggle, UCI, etc. After the model is developed, real-time data can be collected using the news API. The second step is to preprocess the collected data which involves subtasks such as Tokenization, Stop-Word Removal, Stemming/Lemmatization, Case Transformation, etc. The third step here is Embedding Text to Vectors, using various embedding techniques such as Bag-of-Words, TF-IDF, Word2Vec, etc. The next task here is to make clusters of these embedded vectors or numbers, using various unsupervised learning algorithms such as K-means, agglomerative, etc. Finally, the last step here is to find a comparison of various combinations of embedding techniques and clustering algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering (2007)
G. Salton, A. Wong, C.S. Yang, A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques. Technical Report (Department of Computer Science and Engineering, University of Minnesota, 2000)
F. Bach, M. Jordan, Learning spectral clustering, in Advances in Neural Information Processing Systems 16 (NIPS). ed. by S. Thrun, L. Saul, B. Schölkopf (MIT Press, Cambridge, 2004), pp. 305–312
D. Cheng, S. Vempala, R. Kannan, G. Wang, A divide-and-merge methodology for clustering, in PODS ’05: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM Press, New York, NY, USA, 2005), pp. 196–205
C.H.Q. Ding, X. He, H. Zha, M. Gu, H.D. Simon, A min–max cut algorithm for graph partitioning and data clustering, in ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining (IEEE Computer Society, Washington, DC, USA, 2001), pp 107–114
S. Osinski, J. Stefanowski, D. Weiss, Lingo: search results clustering algorithm based on singular value decomposition, in ed. M.A. Klopotek, S.T. Wierzchon, K. Trojanowski, Intelligent Information Systems, Advances in Soft Computing (Springer, Berlin, 2004), pp 359–368
D. Greene, P. Cunningham, Producing accurate interpretable clusters from high-dimensional data, in 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 3721 (University of Dublin, Trinity College, Dublin, 2005), pp. 486–494
O. Zamir, O. Etzioni, Web document clustering: a feasibility demonstration, in SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM Press, New York, NY, USA, 1998), pp. 46–54
Y. Xia, N. Tang, A. Hussain, E. Cambria, Discriminative Bi-Term Topic Model for Headline-Based Social News Clustering in FLAIRS Conference (2015)
I. Himelboim, M.A. Smith, L. Rainie, B. Shneiderman, C. Espina, Classifying Twitter topic-networks using social network analysis. Soc. Media + Soc. 3(1), (2017)
M. Sahami, T.D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW (ACM, New York, NY, USA, 2006), pp. 377–386
S. Banerjee, K. Ramanathan, A. Gupta, Clustering short texts using Wikipedia, in Proceeding SIGIR ’07 Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007), pp. 787–788
J.G. Conrad, M. Bender, Semi-supervised events clustering in news retrieval, in NewsIR@ECIR (2016)
M. Weber, Finding news in a haystack: event based clustering with social media based ranking. Master thesis for the media technology programme, Leiden University, The Netherlands, 2012
S. Bergamaschi, F. Guerra, M. Orsini, C. Sartori, M. Vincini, Relevant news: a semantic news feed aggregator, in Semantic Web Applications and Perspectives, vol. 314, ed. G. Semeraro, E. Di Sciascio, C. Morbidoni, H. Stoemer (2007), pp. 150–159
A. Gulli, The anatomy of a news search engine, in WWW (Special Interest Tracks and Posters), ed. A. Ellis, T. Hagino (ACM, New York, 2005), pp. 880–881
X. Li, J. Yan, Z. Deng, L. Ji, W. Fan, B. Zhang, Z. Chen, A novel clustering-based RSS aggregator, in Williamson et al. [11], pp. 1309–1310
D.R. Radev, J. Otterbacher, A. Winkel, S. Blair-Goldensohn, Newsinessence: summarizing online news topics. Commun. ACM 48(10), 95–98 (2005)
F. Hamborg, N. Meuschke, B. Gipp, Matrix-based news aggregation: exploring different news perspectives, in Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (IEEE Press, 2017), pp. 69–78
C. Grozea, D.C. Cercel, C. Onose, S. Trausan-Matu, Atlas: news aggregation service, in 2017 16th RoEduNet Conference: Networking in Education and Research (RoEduNet) (IEEE, 2017), pp. 1–6
G. Paliouras, A. Mouzakidis, V. Moustakas, C. Skourlas, PNS: a personalized news aggregator on the web, vol. 104 (1970), pp. 175–197
K. Sundaramoorthy, R. Durga, S. Nagadarshini, Newsone—an aggregation system for news using web scraping method, in 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC) (IEEE, 2017), pp. 136–140
A.A. Amer, H.I. Abdalla, A set theory based similarity measure for text clustering and classification. J. Big Data 7, 74 (2020). https://doi.org/10.1186/s40537-020-00344-3
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Tarbani, N., Wadhva, K. (2022). Aggregation of Semantically Similar News Articles with the Help of Embedding Techniques and Unsupervised Machine Learning Algorithms: A Machine Learning Application with Semantic Technologies. In: Houssein, E.H., Abd Elaziz, M., Oliva, D., Abualigah, L. (eds) Integrating Meta-Heuristics and Machine Learning for Real-World Optimization Problems. Studies in Computational Intelligence, vol 1038. Springer, Cham. https://doi.org/10.1007/978-3-030-99079-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-99079-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99078-7
Online ISBN: 978-3-030-99079-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)