Abstract
The introduction of hierarchical thesauri (HT) that contain significant semantic information, has led researchers to investigate their potential for improving performance of the text classification task, extending the traditional “bag of words” representation, incorporating syntactic and semantic relationships among words. In this paper we address this problem by proposing a Word Sense Disambiguation (WSD) approach based on the intuition that word proximity in the document implies proximity also in the HT graph. We argue that the high precision exhibited by our WSD algorithm in various humanly-disambiguated benchmark datasets, is appropriate for the classification task. Moreover, we define a semantic kernel, based on the general concept of GVSM kernels, that captures the semantic relations contained in the hierarchical thesaurus. Finally, we conduct experiments using various corpora achieving a systematic improvement in classification accuracy using the SVM algorithm, especially when the training set is small.
Chapter PDF
Similar content being viewed by others
References
Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems (NIPS), pp. 155–161 (1996)
Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems 21, 227–247 (2003)
Hwang, R., Richards, D., Winter, P.: The steiner tree problem. Annals of Discrete Mathematics 53 (1992)
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Mining for and from the Semantic Web Workshop, pp. 70–87 (2004)
Rosso, P., Ferretti, E., Jimenez, D., Vidal, V.: Text categorization and information retrieval using wordnet senses. In: Proc. of the 2nd International WordNet Conference, GWC (2004)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Proc. of the 16th International Conference on Machine Learning (ICML), pp. 379–388 (1999)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In: International Workshop on Web and Databases (WebDB), pp. 1–6 (2003)
Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector space model in information retrieval. In: Proc. of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 18–25 (1985)
Fellbaum, C. (ed.): WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)
Siolas, G., d’Alche Buc, F.: Support vector machines based on semantic kernel for text categorization. In: Proc. of the International Joint Conference on Neural Networks (IJCNN), vol. 5, pp. 205–209. IEEE Press, Los Alamitos (2000)
Sussna, M.: Word sense disambiguation for free-text indexing using a massive semantic network. In: Proc. of the 2nd International Conference on Information and Knowledge Management (CIKM), pp. 67–74 (1993)
Agirre, E., Rigau, G.: A proposal for word sense disambiguation using conceptual distance. In: Proc. of Recent Advances in NLP (RANLP), pp. 258–264 (1995)
Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proc. of the 18th International Joint Conference on Artificial Intelligence (IJCAI), pp. 805–810 (2003)
Molina, A., Pla, F., Segarra, E.: A hidden markov model approach to word sense disambiguation. In: Proc. of the 8th Iberoamerican Conference on Artificial Intelligence (2002)
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of the International Conference on Research in Computational Linguistics (1997)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proc. of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (1995)
Lin, D.: An information-theoretic definition of similarity. In: Proc. of the 15th International Conference on Machine Learning (ICML), pp. 296–304 (1998)
Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M.: Semantic distances for sets of senses and applications in word sense disambiguation. In: Proc. of the 3rd International Workshop on Text Mining and its Applications (2004)
Devitt, A., Vogel, C.: The topology of wordnet: Some metrics. In: Proc. of the 2nd International WordNet Conference (GWC), pp. 106–111 (2004)
Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In: Proc. of the 17th International Conference on Machine Learning (ICML), pp. 487–494 (2000)
Cowie, J., Guthrie, J., Guthrie, L.: Lexical disambiguation using simulated annealing. In: 14th International Conference on Computational Linguistics (COLING), pp. 359–365 (1992)
Manning, C., Schuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G. (2005). Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_21
Download citation
DOI: https://doi.org/10.1007/11564126_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)