Abstract
Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy. We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We also used discriminant analysis and naïve Bayesian method for the classification. We discovered that for a two-class dataset, despite as high as 20–30% missing values, almost as good results as with no missing value could still be produced. If there are more than two classes, over 10–20% missing values are probably too many, at least for small classes with relatively few cases. The more classes and the more classes of different sizes, a classification task is the more sensitive to missing values. On the other hand, when values are missing on the basis of actual distributions affected by some selection or non-random cause and not fully random, classification can tolerate even high numbers of missing values for some datasets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html/
Chowdhury S, Bodemar G, Haug P, Bapic A, Wigertz O (1991) Methods for knowledge extraction from a clinical database on liver diseases. Comput Biomed Res 24: 530–548
Fortes I, Mora-López L, Morales R, Triguere F (2006) Inductive learning models with missing values. Math Comp Modell 44: 790–806
Laurikkala J, Juhola M (1998) Genetics-based machine learning system to discover diagnostic rules for female urinary incontinence. Comput Meth Prog Biomed 55: 217–228
Laurikkala J, Juhola M, Lammi S, Penttinen J, Aukee P (2001) Analysis of the imputed female urinary incontinence data for the evaluation of expert system parameters. Comp Biol Med 31: 239–257
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
Markey MK, Tourassi GD, Margolis M, DeLong DM (2006) Impact of missing data in evaluating artificial neural networks trained on complete data. Comp Biol Med 36: 516–525
Mykkänen J, Juhola M, Ruotsalainen U (2000) Extracting VOIs from brain PET images. Int J Med Inf 58(59): 59–69
Pesonen E, Eskelinen M, Juhola M (1998) Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artif Intell Med 13: 139–146
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
Viikki K, Kentala E, Juhola M, Pyykkö I (1999) Decision tree induction in the diagnosis of otoneurological diseases. Med Inf Internet Med 24: 277–289
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169: 1–25
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6: 1–34
Witten IH, Frank E (2000) Data mining, practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Juhola, M., Laurikkala, J. Missing values: how many can they be to preserve classification reliability?. Artif Intell Rev 40, 231–245 (2013). https://doi.org/10.1007/s10462-011-9282-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-011-9282-2