Missing values: how many can they be to preserve classification reliability?

Juhola, Martti; Laurikkala, Jorma

doi:10.1007/s10462-011-9282-2

Missing values: how many can they be to preserve classification reliability?

Published: 21 August 2011

Volume 40, pages 231–245, (2013)
Cite this article

Download PDF

Access provided by CONRICYT – Journals CONACYT

Artificial Intelligence Review Aims and scope Submit manuscript

Missing values: how many can they be to preserve classification reliability?

Download PDF

Martti Juhola¹ &
Jorma Laurikkala¹

408 Accesses
20 Citations
Explore all metrics

Abstract

Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy. We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We also used discriminant analysis and naïve Bayesian method for the classification. We discovered that for a two-class dataset, despite as high as 20–30% missing values, almost as good results as with no missing value could still be produced. If there are more than two classes, over 10–20% missing values are probably too many, at least for small classes with relatively few cases. The more classes and the more classes of different sizes, a classification task is the more sensitive to missing values. On the other hand, when values are missing on the basis of actual distributions affected by some selection or non-random cause and not fully random, classification can tolerate even high numbers of missing values for some datasets.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66
Google Scholar
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html/
Chowdhury S, Bodemar G, Haug P, Bapic A, Wigertz O (1991) Methods for knowledge extraction from a clinical database on liver diseases. Comput Biomed Res 24: 530–548
Article Google Scholar
Fortes I, Mora-López L, Morales R, Triguere F (2006) Inductive learning models with missing values. Math Comp Modell 44: 790–806
Article MATH Google Scholar
Laurikkala J, Juhola M (1998) Genetics-based machine learning system to discover diagnostic rules for female urinary incontinence. Comput Meth Prog Biomed 55: 217–228
Article Google Scholar
Laurikkala J, Juhola M, Lammi S, Penttinen J, Aukee P (2001) Analysis of the imputed female urinary incontinence data for the evaluation of expert system parameters. Comp Biol Med 31: 239–257
Article Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
Markey MK, Tourassi GD, Margolis M, DeLong DM (2006) Impact of missing data in evaluating artificial neural networks trained on complete data. Comp Biol Med 36: 516–525
Article Google Scholar
Mykkänen J, Juhola M, Ruotsalainen U (2000) Extracting VOIs from brain PET images. Int J Med Inf 58(59): 59–69
Google Scholar
Pesonen E, Eskelinen M, Juhola M (1998) Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artif Intell Med 13: 139–146
Article Google Scholar
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
Google Scholar
Viikki K, Kentala E, Juhola M, Pyykkö I (1999) Decision tree induction in the diagnosis of otoneurological diseases. Med Inf Internet Med 24: 277–289
Article Google Scholar
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169: 1–25
Article MathSciNet MATH Google Scholar
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6: 1–34
MathSciNet MATH Google Scholar
Witten IH, Frank E (2000) Data mining, practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Sciences, University of Tampere, 33014, Tampere, Finland
Martti Juhola & Jorma Laurikkala

Authors

Martti Juhola
View author publications
You can also search for this author in PubMed Google Scholar
Jorma Laurikkala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martti Juhola.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Juhola, M., Laurikkala, J. Missing values: how many can they be to preserve classification reliability?. Artif Intell Rev 40, 231–245 (2013). https://doi.org/10.1007/s10462-011-9282-2

Download citation

Published: 21 August 2011
Issue Date: October 2013
DOI: https://doi.org/10.1007/s10462-011-9282-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Missing values: how many can they be to preserve classification reliability?

Abstract

Article PDF

Similar content being viewed by others

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

Missing Data Characteristics and the Choice of Imputation Technique: An Empirical Study

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Missing values: how many can they be to preserve classification reliability?

Abstract

Article PDF

Similar content being viewed by others

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

Missing Data Characteristics and the Choice of Imputation Technique: An Empirical Study

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation