Abstract
The Winnow family of learning algorithms can cope well with large numbers of features and is tolerant to variations in document length, which makes it suitable for classifying large collections of large documents, like patent applications.
Both the large size of the documents and the large number of available training documents for each class make this classification task qualitatively different from the classification of short documents (newspaper articles or medical abstracts) with few training examples, as exemplified by the TREC evaluations.
This note describes recent experiments with Winnow on two large corpora of patent applications, supplied by the European Patent Office (EPO). It is found that the multi-classification of patent applications is much less accurate than the mono-classification of similar documents. We describe a potential pitfall in multi-classification and show ways to improve the accuracy. We argue that the inherently larger noisiness of multi-class labeling is the reason that multi-classification is harder than mono-classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arampatzis, A., van Hameren, A.: The Score-Distributional Threshold Optimization for Adaptive Binary Classification Tasks. In: Proceedings ACM SIGIR 2001, pp. 267–275 (2001)
Bel, N., Koster, C.H.A., Villegas, M.: Cross-Lingual Text Categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13(1), 100–111 (1999)
Dagan, I., Karov, Y., Roth, D.: Mistake-Driven Learning in Text Categorization. In: Proceedings 2nd Conference on Empirical Methods in NLP, pp. 55–63 (1997)
Grove, A., Littlestone, N., Schuurmans, D.: General convergence results for linear discriminant updates. Machine Learning 43(3), 173–210 (2001)
Koster, C.H.A., Seutter, M., Beney, J.: Classifying Patent Applications with Winnow. In: Proceedings Benelearn 2001, Antwerpen, p. 8 (2001), http://cnts.uia.ac.be/benelearn2001/
Koster, C.H.A., Seutter, M.: Taming Wild Phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)
Krier, M., Zaccà, F.: Automatic Categorisation Applications at the European Patent Office. World Patent Information 24, 187–196 (2002)
Larkey, L.S.: A patent search and classification system. In: Proceedings of DL 1999, 4th ACM Conference on Digital Libraries, pp. 179–187 (1999)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, 285–318 (1988)
Peters, C., Koster, C.H.A.: Uncertainty-based Noise Reduction and Term Selection in Text Categorisation. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS) 11(1), 115–137 (2003)
Rocchio, J.J.: Relevance feedback in Information Retrieval. In: Salton, G. (ed.) The Smart Retrieval system - experiments in automatic document processing, pp. 313–323. Prentice - Hall, Englewood Cliffs (1971)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Zhiang, Y., Callan, J.: Maximum Likelyhood Estimation for Filtering Thresholds. In: Proceedings of ACM SIGIR 2001, pp. 294–302 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koster, C.H.A., Seutter, M., Beney, J. (2004). Multi-classification of Patent Applications with Winnow. In: Broy, M., Zamulin, A.V. (eds) Perspectives of System Informatics. PSI 2003. Lecture Notes in Computer Science, vol 2890. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39866-0_53
Download citation
DOI: https://doi.org/10.1007/978-3-540-39866-0_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20813-6
Online ISBN: 978-3-540-39866-0
eBook Packages: Springer Book Archive