Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning

Ayyasamy, Ramesh Kumar; Alhashmi, Saadat M.; Eu-Gene, Siew; Tahayna, Bashar

doi:10.1007/978-3-642-25661-5_60

Ramesh Kumar Ayyasamy⁴,
Saadat M. Alhashmi⁴,
Siew Eu-Gene⁵ &
…
Bashar Tahayna⁴

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 123))

1752 Accesses

Abstract

Semantic-based document clustering has been a challenging problem over the past few years and its execution depends on modeling the underlying content and its similarity metrics. Existing metrics evaluate pair wise text similarity based on text content, which is referred as content similarity. The performances of these measures are based on co-occurrences, and ignore the semantics among words. Although, several research works have been carried out to solve this problem, we propose a novel similarity measure by exploiting external knowledge base-Wikipedia to enhance document clustering task. Wikipedia articles and the main categories were used to predict and affiliate them to their semantic concepts. In this measure, we incorporate context similarity by constructing a vector with each dimension representing contents similarity between a document and other documents in the collection. Experimental result conducted on TREC blog dataset confirms that the use of context similarity measure, can improve the precision of document clustering significantly.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Semantic Framework to Text Clustering with Neighbors

Locality-Sensitive Term Weighting for Short Text Clustering

Topic-Level Clustering on Web Resources

Keywords

References

Ounis, I., Macdonald, C., Soboroff, I.: On the TREC BlogTrack. In: ICWSM, USA (2008)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. J. of Intelligent Information System (2001)
Google Scholar
Berkhin, P.: Survey of clustering data mining techniques. Accrue Software Inc., Technical report (2002)
Google Scholar
Xu, R., Wunsch II, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI (2006)
Google Scholar
Huang, A., Milne, D., Frank, E., Witten, I.: Clustering documents using a Wikipedia-based concept representation. In: PAKDD, pp. 628–636 (2009)
Google Scholar
Hu, J., Fang, L., Cao, Y., Hua-Jun Zeng, H., Li, H.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: ACM SIGIR, pp. 179–186 (2008)
Google Scholar
Yoo, I., Hu, X., Song, I.Y.: Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: KDD (2006)
Google Scholar
Gao, B., Liu, T., Zheng, X., Cheng, Q., Ma, W.: Consistent Bipartite Graph Co-Partitioning for Star-Structured High-Order Heterogeneous Data Co-Clustering. In: SIGKDD (2005)
Google Scholar
Xu, W., Liu, X.: Gong. Y.: Document clustering based on nonnegative matrix factorization. In: SIGIR 2003, pp. 267–273 (2003)
Google Scholar
Baker, L., McCallum, A.: Distributional Clustering of Words for Text Classification. In: ACM SIGIR, pp. 96–103 (1998)
Google Scholar
von Luxburg, U.: A tutorial on Spectral Clustering. In: MPI-Technical Reports No.149. Tubingen: Max Planck Institute for Biological Cybernetics
Google Scholar
Dhillon, I., Guan, Y., Kulis, B.: Kernel k-Means, Spectral Clustering and Normalized Cuts. In: KDD, pp. 551–556 (2004)
Google Scholar
Ayyasamy, R.K., Tahayna, B., Alhashmi, S.M., Siew, E., Egerton, S.: Mining Wikipedia knowledge to improve document indexing and classification. In: 10th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2010, pp. 806–809 (2010)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. J. Information Processing & Management 24, 513–523 (1988)
Article Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Google Scholar
Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp. 269–274 (2001)
Google Scholar
Sun, A., Suryanto, M.A., Liu, Y.: Blog Classification Using Tags: An Empirical Study. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 307–316. Springer, Heidelberg (2007)
Chapter Google Scholar
Tahayna, B., Ayyasamy, R.K., Alhashmi, S.M., Siew, E.: A Novel Weighting Scheme for Efficient Document Indexing and Classification. In: 4th International Symposium on Information Technology, ITSIM 2010, pp. 783–788 (2010)
Google Scholar
Rui, X., Li, M., Li, Z., Ma, W.Y., Yu, N.: Bipartite graph reinforcement model for web image annotation. In: Multimedia 2007 (2007)
Google Scholar
Zhang, D.Q., Lin, C.Y., Chang, S.F., Smith, J.R.: Semantic Video Clustering Across Sources Using Bipartitie Spectral Clustering. In: ICME (2004)
Google Scholar
Zha, H., Ding, C., Gu, M.: Bipartite graph partitioning and data clustering. In: CIKM (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Monash University, Malaysia
Ramesh Kumar Ayyasamy, Saadat M. Alhashmi & Bashar Tahayna
School of Business, Monash University, Malaysia
Siew Eu-Gene

Authors

Ramesh Kumar Ayyasamy
View author publications
You can also search for this author in PubMed Google Scholar
Saadat M. Alhashmi
View author publications
You can also search for this author in PubMed Google Scholar
Siew Eu-Gene
View author publications
You can also search for this author in PubMed Google Scholar
Bashar Tahayna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, 200240, Shanghai, China
Yinglin Wang
School of Information Science and Technology, Southwest Jiaotong University, 610031, Chengdu, Sichuan Province, China
Tianrui Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ayyasamy, R.K., Alhashmi, S.M., Eu-Gene, S., Tahayna, B. (2011). Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning. In: Wang, Y., Li, T. (eds) Knowledge Engineering and Management. Advances in Intelligent and Soft Computing, vol 123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25661-5_60

Download citation

DOI: https://doi.org/10.1007/978-3-642-25661-5_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25660-8
Online ISBN: 978-3-642-25661-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning

Abstract

Chapter PDF

Similar content being viewed by others

Semantic Framework to Text Clustering with Neighbors

Locality-Sensitive Term Weighting for Short Text Clustering

Topic-Level Clustering on Web Resources

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning

Abstract

Chapter PDF

Similar content being viewed by others

Semantic Framework to Text Clustering with Neighbors

Locality-Sensitive Term Weighting for Short Text Clustering

Topic-Level Clustering on Web Resources

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation