Abstract
Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, US (2012)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28(2), pp. 49–60. ACM, June 1999
Ball, G.H., Hall, D.J.: ISODATA, a novel method of data analysis and pattern classification. Stanford Research Institute (NTIS No. AD 699616) (1965)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 160–172. Springer, Heidelberg (2013)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3(1), 1–27 (1974)
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. Journal of Statistical Software 61(6), 1–36 (2014)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227 (1979)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, vol. 3. Wiley, New York (1973)
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics 4(1), 95–104 (1974)
Doob, J.L.: Stochastic Processes, vol. 101. Wiley, New York (1953)
Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96(34), pp. 226–231, August 1996
Frey, T., Van Groenewoud, H.: A cluster analysis of the D2 matrix of white spruce stands in Saskatchewan based on the maximum-minimum principle. The Journal of Ecology, 873–886 (1972)
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)
Hartigan, J.A.: Clustering Algorithms. John Wiley Sons, New York (1975)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)
Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1072 (1976)
Kaufman, L., Rousseeuw, P.J.: Finding groups in data. An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, 1st edn. Wiley, New York (1990)
Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 23–34 (1988)
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint (2012). arXiv:1111.0352
Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 393–400 (2011)
McClain, J.O., Rao, V.R.: Clustisz: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research (pre-1986) 12(000004), 456 (1975)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Qian, M., Zhai, C.: Unsupervised feature selection for multi-view clustering on text-image web news data. In: Proceedings of the 23rd ACM International Conference on Information Knowledge Management, pp. 1963–1966. ACM, November 2014
Ratkowsky, D.A., Lance, G.N.: A criterion for determining the number of groups in a classification. Australian Computer Journal 10(3), 115–117 (1978)
Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A.: Automatic extraction of clusters from hierarchical clustering representations. In: Advances in Knowledge Discovery and Data Mining, pp. 75–87. Springer, Heidelberg (2003)
Schneider, J., Vlachos, M.: Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM International Conference on Information Knowledge Management, pp. 861–866. ACM (2013)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-41920-6_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)