Abstract
In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection.
Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline.
We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Avi Arampatzis, Jean Beney, C. H. A. Koster, Th.P. van der Weide, KUN on the TREC-9 Filtering Track: Incrementality, Decay, and Threshold Optimization for Adaptive Filtering Systems. The Ninth Text REtrieval Conference (TREC-9), Gaithersburg, Maryland, November 13–16, 2000.
M. F. Caropreso, S. Matwin and F. Sebastiani (2001), A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, In: A. G. Chin (Ed.), Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, US, pp. 78–102.
W. W. Cohen and Y. Singer (1999), Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13,1, 100–111.
I. Dagan, Y. Karov, D. Roth (1997), Mistake-Driven Learning in Text Categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63.
D. Evans and R. G. Lefferts (1994), Design and evaluation of the CLARIT-TREC-2 system. Proceedings TREC-2, NIST Special Publication 500-215, pp. 137–150.
J. L. Fagan (1988), Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods, PhD Thesis, Cornell University.
A. Grove, N. Littlestone, and D. Schuurmans (2001), General convergence results for linear discriminant updates. Machine Learning 43(3), pp. 173–210.
C. H. A. Koster, C. Derksen, D. van de Ende and J. Potjer, Normalization and matching in the DORO system. Proceedings of IRSG’99, 10pp.
C. H. A. Koster, M. Seutter and J. Beney (2001), Classifying Patent Applications with Winnow, Proceedings Benelearn 2001, Antwerpen, 8pp.
C. H. A. Koster and E. Verbruggen (2002), The AGFL Grammar Work Lab, Proceedings FREENIX/Usenix dy2002, pp 13–18.
M. Krier and F. Zaccà (2001), Automatic Categorisation Applications at the European Patent Office, International CHemical Information Conference, Nimes, October 2001, 10 pp.
Term Clustering of Syntactic Phrases (1990), Proceedings SIGIR 90, pp. 385–404.
D. Lin (1995), A dependency-based method for evaluating broad-coverage parsers. Proceedings IJCAI-95, pp. 1420–1425.
C. Peters and C. H. A. Koster (2002), Uncertainty-based Noise Reduction and Term Selection, Proceedings ECIR 2002, Springer LNCS 2291, pp 248–267.
J. J. Rocchio (1971), Relevance feedback in Information Retrieval, In: Salton, G. (ed.), The Smart Retrieval system — experiments in automatic document processing, Prentice-Hall, Englewood Cliffs, NJ, pp 313–323.
G. Ruge (1992), Experiments on Linguistically Based Term Associations, Information Processing & management, 28(3), pp. 317–332.
T. Strzalkowski (1992), TTP: A Fast and Robust Parser for Natural Language, In: Proceedings COLING’ 92, pp 198–204.
T. Strzalkowski, editor (1999), Natural Language Information Retrieval, Kluwer Academic Publishers, ISBN 0-7923-5685-3.
Y. Yiming and J. P. Pedersen (1997), A Comparative Study on Feature Selection in Text Categorization. In: ICML 97, pp. 412–420.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koster, C.H.A., Seutter, M. (2003). Taming Wild Phrases. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_12
Download citation
DOI: https://doi.org/10.1007/3-540-36618-0_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive