Abstract
The presence of missing values in a dataset can affect the performance of a classifier constructed using that dataset as a training sample. Several methods have been proposed to treat missing data and the one used most frequently deletes instances containing at least one missing value of a feature. In this paper we carry out experiments with twelve datasets to evaluate the effect on the misclassification error rate of four methods for dealing with missing values: the case deletion method, mean imputation, median imputation, and the KNN imputation procedure. The classifiers considered were the Linear Discriminant Analysis (LDA) and the KNN classifier. The first one is a parametric classifier whereas the second one is a nonparametric classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Acufia, E., Coaquira, F. and Gonzalez, M. (2003). “A Comparison of Feature Selection Procedures for Classifiers Based on Kernel Density Estimation,” in Proceedings of the International Conference on Computer, Communication and Control Technologies, Orlando, FL: CCCT′03, Vol I, pp. 468–472.
Batista G. E. A. P. A., and Monard, M. C. (2002). “K-Nearest Neighbour as Imputation Method: Experimental Results,” Technical Report 186, ICMC-USP.
Bello, A. L. (1995). “Imputation Techniques in Regression Analysis: Looking Closely at Their Implementation,” Computational Statistics and Data Analysis, 20, 45–57.
Chan, P., and Dunn, O. J. (1972). “The Treatment of Missing Values in Discriminant Analysis,” Journal of the American Statistical Association, 69, 473–477.
Dixon J. K. (1979). “Pattern Recognition with Partly Missing Data,” IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 10, 617–621.
Grzymala-Busse, J. W., and Hu, M. (2000). “A Comparison of Several Approaches to Missing Attribute Values in Data Mining,” in Rough Sets and Current Trends in Computing 2000, pp. 340–347.
Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M, Brown, P. and Bolstein, D. (1999). “Imputing Missing Data for Gene Expression Arrays,” Techical Report, Division of Biostatistics, Stanford University.
Kalton, G., and Kasprzyk, D. (1986). “The Treatment of Missing Survey Data,” Survey Methodology, 12, 1–16.
Little, R. J., and Rubin, D. B. (2002). Statistical Analysis with Missing Data, second edn., John Wiley and Sons, New York.
Mundfrom, D. J., and Whitcomb, A. (1998). “Imputing Missing values: The Effect on the Accuracy of Classification,” Multiple Linear Regression Viewpoints, 25, 13–19.
Schäfer, J. L. (1997). Analysis of Incomplete Multivariate Data, Chapman and Hall, London.
Tresp, V., Neuneier, R., and Ahmad, S. (1994). “Efficient Methods for Dealing with Missing Data in Supervised Learning,” in NIPS 1994, eds. G. Tesauro, D. S. Touretzky, and T. K. Leen, Cambridge, MA: MIT Press, pp. 689–696.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P. Hastie, T., Tibshirani, R., Bostein, D. and Altman, R. B. (2001). “Missing Value Estimation Methods for DNA Microarrays,” Bioinformatics, 17, 520–525.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Acuña, E., Rodriguez, C. (2004). The Treatment of Missing Values and its Effect on Classifier Accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17103-1_60
Download citation
DOI: https://doi.org/10.1007/978-3-642-17103-1_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22014-5
Online ISBN: 978-3-642-17103-1
eBook Packages: Springer Book Archive