Abstract
The increasing number of news sources makes analysis of the news difficult and increase the need for automated systems. This paper presents a system that clusters news with similar content in real time. The system uses the Apache Solr database to capture texts from the news source and its MoreLikeThis (MLT) search component to extract the 5 most similar news from thousands of previously recorded news. The new news will be included in the cluster with the most similar of the 5 news obtained by pre-filtering. Therefore, the main problem sought in this study is finding the news that most closely resembles the new news. For this, a 2-step approach has been proposed. The majority of the news published in different sources in the media is created by the same or broadening/shortening of the same text. For this reason, the ‘citation rate’ was calculated primarily among news pairs. If there are news pairs that exceed the citation threshold, the pair with the highest citation rate is included in the same cluster. Otherwise, the numerical representations of the texts at different levels were used in order to determine the similarity semantically. The results of the study show that the proposed 2-stage approach reduces the sensitivity of embeddings at different levels to text lengths. Thus, it achieved up to 7.6% improvement compared to clustering approach only with embeddings. The system proposed in these study has a structure that can be used in real life applications in terms of real-time clustering with a high F-score rate of over 90%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Staykovski, T., Barrón-Cedeno, A., Da San Martino, G., Nakov, P.: Dense vs. sparse representations for news stream clustering. In: Text2Story@ ECIR, pp. 47–52 (2019)
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report, Virginia Tech. (2007)
Blokh, I., Alexandrov, V.: News clustering based on similarity analysis. Proc. Comput. Sci. 122, 715–719 (2017)
Dangre, N., Bodke, A., Date, A., Rungta, S., Pathak, S.S.: System for Marathi news clustering. Proc. Comput. Sci. 92, 18–22 (2016)
Bisandu, D.B., Prasad, R., Liman, M.M.: Clustering news articles using efficient similarity measure and N-grams. Int. J. Knowl. Eng. Data Mining 5(4), 333–348 (2018)
Lwin, M.T., Aye, M.M.: A modified hierarchical agglomerative approach for efficient document clustering system. Am. Sci. Res. J. Eng. Technol. Sci. (ASRJETS) 29(1), 228–238 (2017)
Bouras, C., Tsogkas, V.: Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem. Int. J. Mach. Learn. Cybernet. 7(2), 171–184 (2014)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)
Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models. Kluwer (1998)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198. ACM Press (2002)
Banerjee, A., Ghosh, J.: Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In: IEEE International Joint Conference on Neural Networks, Honolulu, Hawaii, pp. 1590–1595 (2002)
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
Ravi, K., Santosh, V., Adrian, V.: On clusterings: good, bad and spectral. J. ACM 51(3), 497–515 (2004)
Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: FUZZ 2003: 12th IEEE International Conference on Fuzzy Systems, pp. 772–777 (2003)
Dell, Z., Yisheng, D.: Semantic, hierarchical, online clustering of web search results. In: 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, pp. 69–78 (2004)
Wei, X., Xin, L., Yihong G.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, pp. 267–273 (2003)
Chris, D., Xioafeng, H., Horst D.S.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM International Conference on Data Mining, pp. 606–610 (2005)
Chris, D., Tao, L., Wei, P.: NMF and PLSI: equivalence and a hybrid algorithm. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 641–642, New York, NY, USA (2006)
Derek, G., Padraig, C.: Producing accurate interpretable clusters from high dimensional data. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 3721, pp. 486–494. University of Dublin, Trinity College, Dublin (2005)
Stanislaw, O., Jerzy, S., Dawid, W.: Lingo: search results clustering algorithm based on singular value decomposition. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Systems, Advances in Soft Computing, pp. 359–368. Springer (2004)
Sven, E., Benno, S., Martin, P.: The suffix tree document model revisited. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of the 5th International Conference on Knowledge Management (I-KNOW 2005), Graz, Austria, pp. 596–603. Know-Center (2005). Journal of Universal Computer Science
Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In: Eighth International AAAI Conference on Weblogs and Social Media (2014)
Miranda, S., Znotiņš, A., Cohen, S.B., Barzdins, G.: Multilingual clustering of streaming news. arXiv preprint arXiv:1809.00540 (2018)
Gong, L., Zeng, J., Zhang, S.: Text stream clustering algorithm based on adaptive feature selection. Expert Syst. Appl. 38(3), 1393–1399 (2011)
Wattanakitrungroj, N., Maneeroj, S., Lursinsap, C.: BEstream: batch capturing with elliptic function for one-pass data stream clustering. Data Knowl. Eng. 117, 53–70 (2018)
O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 234–243 (2003)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.H., Strope, B., Kurzweil, R.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. arXiv preprint arXiv:1810.12836 (2018)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 1, pp. 79–85. Association for Computational Linguistics (1998)
Acknowledgment
We would like to thank to Interpress Media Monitoring Agency, which provides news data for Turkish News Texts Corpus and Grouped Turkish News Texts Test Set
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kömeçoğlu, Y., Kömeçoğlu, B.B., Yılmaz, B. (2021). Real-Time News Grouping: Detecting the Same-Content News on Turkish News Stream. In: Allahviranloo, T., Salahshour, S., Arica, N. (eds) Progress in Intelligent Decision Science. IDS 2020. Advances in Intelligent Systems and Computing, vol 1301. Springer, Cham. https://doi.org/10.1007/978-3-030-66501-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-66501-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66500-5
Online ISBN: 978-3-030-66501-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)