Abstract
The k-nearest neighbour (k-NN) technique, due to its interpretable nature, is a simple and very intuitively appealing method to address classification problems. However, choosing an appropriate distance function for k-NN can be challenging and an inferior choice can make the classifier highly vulnerable to noise in the data. In this paper, we propose a new method for determining a good distance function for k-NN. Our method is based on consideration of the area under the Receiver Operating Characteristics (ROC) curve, which is a well known method to measure the quality of binary classifiers. It computes weights for the distance function, based on ROC properties within an appropriate neighbourhood for the instances whose distance is being computed. We experimentally compare the effect of our scheme with a number of other well-known k-NN distance metrics, as well as with a range of different classifiers. Experiments show that our method can substantially boost the classification performance of the k-NN algorithm. Furthermore, in a number of cases our technique is even able to deliver better accuracy than state-of-the-art non k-NN classifiers, such as support vector machines.
Chapter PDF
Similar content being viewed by others
Keywords
References
Theilhaber, J., et al.: Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data. Genome Research 12(1), 165–176 (2002)
Wu, W., Xing, E., Mian, I., Bissell, M.: Evaluation of normalization methods for cDNA microarray data by k-NN classification. BMC Bioinformatics 6(191), 1–21 (2005)
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Transactions on Pattern Analysis Machine Intelligence 18(6), 607–616 (1996)
Green, D.M., Swets, J.M.: Signal detection theory and psychophysics. John Wiley & Sons Inc., New York (1966)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Kira, K., Rendell, L.: A practical approach to feature selection. In: Proc. of ICML, pp. 249–256 (1992)
Salzberg, S.: Distance metrics for instance-based Learning. In: Raś, Z.W., Zemankova, M. (eds.) ISMIS 1991. LNCS, vol. 542, pp. 399–408. Springer, Heidelberg (1991)
Kononenko, I.: Estimating attributes: Analysis and extensions of Relief. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)
Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: Proc. of ICML, pp. 307–314 (2002)
Stein, B., Niggemann, O.: Generation of similarity measures from different sources. In: 14th International Conference on Industrial Engineering Applications of Artificial Intelligence and Expert Systems, pp. 197–206 (2001)
Han, E.H., Karypis, G., Kumar, V.: Text categorization using weight-adjusted nearest-neighbor classification. In: Conference on Knowledge Discovery and Data Mining, Hong Kong, China, pp. 53–65 (2001)
Zhang, Z.: Learning metrics via discriminant kernels and multidimensional scaling: toward expected Euclidean representation. In: Proc. of ICML, pp. 872–879 (2003)
Hastie, T., Tibshirani, R.: Discriminant Adaptive Nearest Neighbor Classification and Regression. In: Advances in Neural Information Processing Systems, vol. 8, pp. 409–415 (1996)
Domeniconi, C., Gunopulos, D., Peng, J.: Large margin nearest neighbor classifiers. IEEE Transactions on Neural Networks 16(4), 899–909 (2005)
Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C.: Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 60–69. Springer, Heidelberg (2006)
Im, K.H., Park, S.C.: Case-based reasoning and neural network based expert system for personalization. Expert Systems with Applications 32, 77–85 (2007)
Vivencio, D.P., et al.: Feature-weighted k-Nearest Neighbor Classifier. In: Proc. of FOCI, pp. 481–486 (2007)
Hossain, M.M., Hassan, M.R., Bailey, J.: ROC-tree: A Novel Decision Tree Induction Algorithm Based on Receiver Operating Characteristics to Classify Gene Expression Data. In: Proc. of 8th SIAM Int’l Conf. on Data Mining (SDM 2008), pp. 455–465 (2008)
Ferri, C., Flach, P., Hernández-Orallo, J.: Learning decision trees using the area under the ROC curve. In: Proc. of ICML, pp. 139–146 (2002)
Deng, L., Pei, J., Ma, J., Lee, D.L.: A Rank Sum Test Method for Informative Gene Discovery. In: Proc. of KDD, pp. 410–419 (2004)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. Technical Report MS 1143, HP Laboratories (2004)
Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press Series in Cognition and Perception. Academic Press, London (1975)
Mamitsuka, H.: Selecting features in microarray classification using ROC curves. Pattern Recognition 39(12), 2393–2404 (2006)
van ’t Veer, L.J., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–535 (2002)
Integrated Tumor Transcriptome Array and Clinical Data Analysis (2006), http://bioinfo-out.curie.fr/ittaca
Hedenfalk, I., et al.: Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344(8), 539–548 (2001)
Gordon, G.J., et al.: Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer And Mesothelioma. Cancer Research 62, 4963–4967 (2002)
The Division of Thoracic Surgery (2002), http://www.chestsurg.org/publications/2002-microarray.aspx
Broad Institute Cancer Program (2002), http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=75
Singh, D., et al.: Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell 1, 203–209 (2002)
Broad Institute Cancer Program (1999), http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43
Golub, T.R., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537 (1999)
Newman, D., et al.: UCI Repository of machine learning databases. Online Repository (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Perner, P.: Methods for Data Mining. In: Data Mining on Multimedia Data. LNCS, vol. 2558, pp. 23–89. Springer, Heidelberg (2002)
Glantz, S.A.: Primer of BioStatistics, pp. 309–310. McGraw-Hill, NY (1992)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hassan, M.R., Hossain, M.M., Bailey, J., Ramamohanarao, K. (2008). Improving k-Nearest Neighbour Classification with Distance Functions Based on Receiver Operating Characteristics. In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87479-9_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-87479-9_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87478-2
Online ISBN: 978-3-540-87479-9
eBook Packages: Computer ScienceComputer Science (R0)