Abstract
Semantic-based document clustering has been a challenging problem over the past few years and its execution depends on modeling the underlying content and its similarity metrics. Existing metrics evaluate pair wise text similarity based on text content, which is referred as content similarity. The performances of these measures are based on co-occurrences, and ignore the semantics among words. Although, several research works have been carried out to solve this problem, we propose a novel similarity measure by exploiting external knowledge base-Wikipedia to enhance document clustering task. Wikipedia articles and the main categories were used to predict and affiliate them to their semantic concepts. In this measure, we incorporate context similarity by constructing a vector with each dimension representing contents similarity between a document and other documents in the collection. Experimental result conducted on TREC blog dataset confirms that the use of context similarity measure, can improve the precision of document clustering significantly.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Ounis, I., Macdonald, C., Soboroff, I.: On the TREC BlogTrack. In: ICWSM, USA (2008)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. J. of Intelligent Information System (2001)
Berkhin, P.: Survey of clustering data mining techniques. Accrue Software Inc., Technical report (2002)
Xu, R., Wunsch II, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI (2006)
Huang, A., Milne, D., Frank, E., Witten, I.: Clustering documents using a Wikipedia-based concept representation. In: PAKDD, pp. 628–636 (2009)
Hu, J., Fang, L., Cao, Y., Hua-Jun Zeng, H., Li, H.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: ACM SIGIR, pp. 179–186 (2008)
Yoo, I., Hu, X., Song, I.Y.: Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: KDD (2006)
Gao, B., Liu, T., Zheng, X., Cheng, Q., Ma, W.: Consistent Bipartite Graph Co-Partitioning for Star-Structured High-Order Heterogeneous Data Co-Clustering. In: SIGKDD (2005)
Xu, W., Liu, X.: Gong. Y.: Document clustering based on nonnegative matrix factorization. In: SIGIR 2003, pp. 267–273 (2003)
Baker, L., McCallum, A.: Distributional Clustering of Words for Text Classification. In: ACM SIGIR, pp. 96–103 (1998)
von Luxburg, U.: A tutorial on Spectral Clustering. In: MPI-Technical Reports No.149. Tubingen: Max Planck Institute for Biological Cybernetics
Dhillon, I., Guan, Y., Kulis, B.: Kernel k-Means, Spectral Clustering and Normalized Cuts. In: KDD, pp. 551–556 (2004)
Ayyasamy, R.K., Tahayna, B., Alhashmi, S.M., Siew, E., Egerton, S.: Mining Wikipedia knowledge to improve document indexing and classification. In: 10th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2010, pp. 806–809 (2010)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. J. Information Processing & Management 24, 513–523 (1988)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp. 269–274 (2001)
Sun, A., Suryanto, M.A., Liu, Y.: Blog Classification Using Tags: An Empirical Study. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 307–316. Springer, Heidelberg (2007)
Tahayna, B., Ayyasamy, R.K., Alhashmi, S.M., Siew, E.: A Novel Weighting Scheme for Efficient Document Indexing and Classification. In: 4th International Symposium on Information Technology, ITSIM 2010, pp. 783–788 (2010)
Rui, X., Li, M., Li, Z., Ma, W.Y., Yu, N.: Bipartite graph reinforcement model for web image annotation. In: Multimedia 2007 (2007)
Zhang, D.Q., Lin, C.Y., Chang, S.F., Smith, J.R.: Semantic Video Clustering Across Sources Using Bipartitie Spectral Clustering. In: ICME (2004)
Zha, H., Ding, C., Gu, M.: Bipartite graph partitioning and data clustering. In: CIKM (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ayyasamy, R.K., Alhashmi, S.M., Eu-Gene, S., Tahayna, B. (2011). Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning. In: Wang, Y., Li, T. (eds) Knowledge Engineering and Management. Advances in Intelligent and Soft Computing, vol 123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25661-5_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-25661-5_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25660-8
Online ISBN: 978-3-642-25661-5
eBook Packages: EngineeringEngineering (R0)