Abstract
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
P. J. Hayes. Intelligent high-volume text processing using shallow, domain-specific techniques. In Paul. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research in Text Analysis, Information Extraction, and Retrieval, pages 227–241. Lawrence Erlbaum, Hillsdale, NJ, 1992.
P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz. The automatic indexing system AIR/PHYS—from research to application. In Proc. SIGIR-88, pages 333–342, 1988.
W. G. Cochran. Sampling Techniques. John Wiley & Sons, New York, 3rd edition, 1977.
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41 (4): 288–297, 1990.
W. A. Gale, K. W. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: 415–439, 1993.
B. K. Ghosh. A brief history of sequential analysis. In B. K. Ghosh and P. K. Sen, editors, Handbook of Sequential Analysis, chapter 1, pages 1–19. Marcel Dekker, New York, 1991.
D. Angluin. Queries and concept learning. Machine Learning, 2: 319–342, 1988.
M. Plutowski and H. White. Selecting concise training sets from clean data. IEEE Transactions on Neural Networks, 4 (2): 305–318, March 1993.
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with self-directed learning, 1992. To appear in Machine Learning.
D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4: 720–736, 1992.
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 287–294, 1992.
T. M. Mitchell. Generalization as search. Artificial Intelligence, 18: 203–226, 1982.
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee. In Advances in Neural Informations Processing Systems 5, San Mateo, CA, 1992. Morgan Kaufmann.
J. Hwang, J. J. Choi, S. Oh, and R. J. Marks II. Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2 (1): 131–136, January 1991.
D. T. Davis and J. Hwang. Attentional focus training by boundary region data selection. In International Joint Conference on Neural Networks, pages 1–676 to I-681, Baltimore, MD, June 7–11 1992.
P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14: 515–516, May 1968.
P. E. Utgoff. Improved training via incremental learning. In Sixth International Workshop on Machine Learning, pages 362–365, 1989.
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25 (1): 55–72, 1989.
D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proc. SIGIR-92, pages 37–50, 1992.
M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8: 404–417, 1961.
W. S. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In Proc. SIGIR-91, pages 57–61, 1991.
P. McCullagh and J. A. Neider. Generalized Linear Models. Chapman & Hall, London, 2nd edition, 1989.
W. S. Cooper, F. C. Gey, and D. P. Dabney. Probabilistic retrieval based on staged logistic regression. In Proc. SIGIR-92, pages 198–210, 1992.
N. Fuhr and U. Pfeifer. Combining model-oriented and description-oriented approaches for probabilistic indexing. In Proc. SIGIR-91, pages 46–56, 1991.
S. Robertson and J. Hovey. Statistical problems in the application of probabilistic models to information retrieval. Report 5739, British Library, London, 1982.
W. A. Gale and K. W. Church. Poor estimates of context are worse than none. In Speech and Natural Language Workshop, pages 283–287, San Mateo, CA, June 1990. DARPA, Morgan Kaufmann.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.
N. Goldstein, editor. The Associated Press Stylebook and Libel Manual. Addison-Wesley, Reading, MA, 1992.
W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance feedback. Journal of Documentation, 35 (4): 285–295, 1979.
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society for Information Science, 34: 331–342, September 1983.
David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, 1994. To appear.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1994 Springer-Verlag London Limited
About this paper
Cite this paper
Lewis, D.D., Gale, W.A. (1994). A Sequential Algorithm for Training Text Classifiers. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_1
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2099-5_1
Publisher Name: Springer, London
Print ISBN: 978-3-540-19889-5
Online ISBN: 978-1-4471-2099-5
eBook Packages: Springer Book Archive