Abstract
This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Independent Component Analysis
- Semantic Similarity
- Semantic Relatedness
- Independent Component Analysis
- Cluster Document
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering Short Texts using Wikipedia. In: Proceedings of the SIGIR, pp. 787–788. ACM, New York (2007)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Gabrilovich, E., Markovitch, S.: Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In: Proceedings of AAAI, pp. 1301–1306. AAAI, Menlo Park (2006)
Hotho, A., Staab, S., Stumme, G.: WordNet improves Text Document Clustering. In: Proceedings of SIGIR Semantic Web Workshop, pp. 541–544. ACM, New York (2003)
Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: Proceedings of SIGIR, pp. 179–186. ACM, New York (2008)
Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering Documents with Active Learning using Wikipedia. In: Proceedings of ICDM, pp. 839–844. IEEE, Los Alamitos (2008)
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley Interscience, Hoboken (2001)
Kolenda, T., Hansen, L.K.: Independent Components in Text. In: Girolami, M. (ed.) Advances in Independent Component Analysis, ch. 13, pp. 235–256. Springer, Heidelberg (2000)
Milne, D., Witten, I.H.: Learning to Link with Wikipedia. In: Proceedings of CIKM, pp. 509–518. ACM, New York (2008)
Milne, D., Witten, I.H.: An Effective, Low-Cost Measure of Semantic Relatedness obtained from Wikipedia Links. In: Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI), pp. 25–30. AAAI, Menlo Park (2008)
Minier, Z., Bodo, Z., Csato, L.: Wikipedia-Based Kernels for Text Categorization. In: Proceedings of SYNASC, pp. 157–164. IEEE, Los Alamitos (2007)
Recupero, D.R.: A New Unsupervised Method for Document Clustering by Using WordNet Lexical and Conceptual Relations. Information Retrieval 10, 563–579 (2007)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Torkkola, K.: Discriminative Features for Document Classification. In: Proceedings of ICPR, pp. 10472–10475. IEEE, Los Alamitos (2002)
Wang, P., Hu, J., Zeng, H.J., Chen, L., Chen, Z.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of ICDM, pp. 332–341. IEEE, Los Alamitos (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, A., Milne, D., Frank, E., Witten, I.H. (2009). Clustering Documents Using a Wikipedia-Based Concept Representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_62
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)