Abstract
This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified χ2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general “install-and-forget” term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
ISO/TAG4/WG3, (R. Cohen, P. Clifford, P. Giacomo, O. Mathiesen, C. Peters, B. Taylor, K. Weise), Guide to the expression of uncertainties in measurement, ISO publication ISBN-92-67-10118-9.
Apté, C. and Damerau, F. (1994) Automated learning of decision rules for text categorization. In: ACM Transactions on Information Systems 12(3):233–251, 1994.
L. Douglas Baker and Andrew Kachites McCallum, Distributional clustering of words for text-classification, In: Proceedings SIGIR 98, pp. 96–103.
E. Richard Cohen, Uncertainty and error in physical measurements, At: The International summer school of physics “Enrico Fermi”, SIF Course CX, Metrology at the frontiers of physics and technology, Lerici (Italy), 27 June–7 July 1989.
W.W. Cohen and Y. Singer (1999), Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13, 1, 100–111.
I. Dugan, Y. Karov, D. Roth (1997), Mistake-Driven Learning in Text Categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63.
A. Grove, N. Littlestone, and D. Schuurmans (2001), General convergence results for linear discriminant updates. Machine Learning 43(3), pp. 173–210.
C.H.A. Koster, M. Seutter and J. Beney (2000), Classifying Patent Applications with Winnow, Proceedings Benelearn 2001, Antwerpen, 8pp.
M. Krier and F. Zaccà (2001), Automatic Categorisation Applications at the European Patent Office, International CHemical Information Conference, Nimes, October 2001, 10 pp.
L. D. Landau, E.M. Lifschitz, Lehrbuch der theoretischen Physik V, Statistische Physik Teil 1, Akademie Verlag Berlin, 1979.
David D. Lewis, An evaluation of Phrasal and Clustered representations on a Text Categorization task, Fifteenth Annual International ACM SIGIR, Copenhagen, 1992.
H. Ragas and C.H.A. Koster, Four classification algorithms compared on a Dutch corpus, Proceedings SIGIR 98, pp. 369–370.
J.J. Rocchio (1971), Relevance feedback in Information Retrieval, In: Salton, G. (ed.), The Smart Retrieval system-experiments in automatic document processing, Prentice-Hall, Englewood Cliffs, NJ, pp 313–323.
Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, Forthcoming, 2002 http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf
Yiming Yang and Jan Pederson (1997), Feature selection in statistical learning of text categorization. In: ICML 97, pp. 412–420.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peters, C., Koster, C.H.A. (2002). Uncertainty-Based Noise Reduction and Term Selection in Text Categorization. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_17
Download citation
DOI: https://doi.org/10.1007/3-540-45886-7_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43343-9
Online ISBN: 978-3-540-45886-9
eBook Packages: Springer Book Archive