Abstract
Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval).
In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Collins, M.: Three generative, lexicalized models for statistical parsing. In: Proceedings of the ACL and EACL, Somerset, New Jersey, pp. 16–23 (1997)
Strzalkowski, T., Jones, S.: NLP track at TREC-5. In: Text REtrieval Conference (1996)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Strzalkowski, T., Carballo, J.P.: Natural language information retrieval: TREC-6 report. In: TREC (1997)
Strzalkowski, T., Stein, G.C., Wise, G.B., Carballo, J.P., Tapanainen, P., Jarvinen, T., Voutilainen, A., Karlgren, J.: Natural language information retrieval: TREC-7 report. In: TREC (1998)
Strzalkowski, T., Carballo, J.P., Karlgren, J., Hulth, A., Tapanainen, P., Jarvinen, T.: Natural language information retrieval: TREC-8 report. In: TREC (1999)
Smeaton, A.F.: Using NLP or NLP resources for information retrieval tasks. In: Strzalkowski, T. (ed.) Natural language information retrieval, pp. 99–111. Kluwer Academic Publishers, Dordrecht (1999)
Sussua, M.: Word sense disambiguation for free-text indexing using a massive semantic network. In: New York, A.P. (ed.) Proceeding of CKIM 1993 (1993)
Voorhees, E.M.: Using wordnet to disambiguate word senses for text retrieval. In: Proceedings of SIGIR 1993, PA, USA (1993)
Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of SIGIR 1994 (1994)
Voorhees, E.M.: Using wordnet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 285–303. The MIT Press, Cambridge (1998)
Kilgarriff, A., Rosenzweig, J.: English senseval: Report and results. In: English SENSEVAL: Report and Results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC, Athens, Greece (2000)
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of SIGIR 2003, Canada (2003)
Furnkranz, J., Mitchell, T., Rilof, E.: A case study in using linguistic phrases for text categorization on the www. In: AAAI/ICML Workshop (1998)
Mladenić, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK 1998, Ljubljana, SL (1998)
Raskutti, B., Ferrá, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 419. Springer, Heidelberg (2001)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the ACM SIGIR 2001, pp. 146–153. ACM Press, New York (2001)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing & Management (2002)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of ICML 1999, Bled, SL (1999)
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System–Experiments in Automatic Document Processing, pp. 313–323. Prentice Hall, Inc., Englewood Cliffs (1971)
Basili, R., Moschitti, A., Pazienza, M.: NLP-driven IR: Evaluating performances over text classification task. In: Proceedings of IJCAI 2001, USA (2001)
Moschitti, A.: A study on optimal parameter tuning for Rocchio text classifier. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 420–435. Springer, Heidelberg (2003)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Joachims, T.: T. joachims, making large-scale svm learning practical. In: Advances in Kernel Methods - Support Vector Learning (1999)
Brill, E.: A simple rule-based part of speech tagger. In: Proc. of the Third Applied Natural Language Processing, Povo, Trento, Italy (1992)
Basili, R., De Rossi, G., Pazienza, M.: Inducing terminology for lexical acquisition. In: Preoceeding of EMNLP 1997 Conference, Providence, USA (1997)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal (1999)
Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Idea Group Publishing, Hershey, US (2001)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, Kobenhavn, DK (1992)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI/IAAI, vol. 2, pp. 1044–1049 (1996)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17, 141–173 (1999)
Furnkranz, J.: A study using n-gram features for text categorization. Technical report oefai-tr-9830, Austrian Institute for Artificial Intelligence (1998)
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop (1999)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Moschitti, A., Basili, R. (2004). Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-24752-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive