Abstract
Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi-modality of categories. Existing classification techniques have limited applicability in the data sets of these natures. In this paper, we present a Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique. We also present two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality. We experimentally evaluated WAKNN on 52 document data sets from a variety of domains and compared its performance against several classification algorithms, such as C4.5, RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on these data sets confirm that WAKNN consistently outperforms other existing classification algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review, 13(5-6), 1999.
W.W. Cohen. Fast effective rule induction. In Proc. of the Twelfth International Conference on Machine Learning, 1995.
S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10(1):57–78, 1993.
T. Curran and P. Thompson. Automatic categorization of statute documents. In Proc. of the 8th ASIS SIG/CR Classification Research Workshop, Tucson, Arizona, 1997.
I.S. Dhillon and D.M. Modha. Visualizing class structure of multi-dimensional data. In Proc. of the 30th Symposium of the Interface: Computing Science and Statistics, pages 488–493, 1998.
R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973.
E.H. Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.
W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In SIGIR-94, pages 192–201, 1994.
A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998.
L.N. Kanal and Vipin Kumar, editors. Search in Artificial Intelligence. Springer-Verlag, New York, NY, 1988.
I. Kononenko. Estimating attributes: Analysis and extensions of relief. In Proc. of the 1994 European Conference on Machine Learning, 1994.
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994.
D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/lewis, 1999.
D.G. Lowe. Similarity metric learning for a variable-kernel classifier. Neural Computation, pages 72–85, January 1995.
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
G.W. Snedecor and W.G. Cochran. Statistical Methods. Iowa State University Press, 1989.
TREC. Text Retrieval conference.
D. Wettschereck, D.W. Aha, and T. Mohri. A review and empirical evaluation of feature-weighting methods for a class of lazy learning algorithms. AI Review, 11, 1997.
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In SIGIR-94, 1994.
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR-99, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, EH.(., Karypis, G., Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_9
Download citation
DOI: https://doi.org/10.1007/3-540-45357-1_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive