1 Introduction

Many of the multivariate data sets collected today would have unobserved or missing observations scattered throughout the data set. These missing values can have no particular pattern of occurrence. Despite the frequent occurrence of missing data, many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing:—the value is simply unknown, and the missing value is handled in a simple way.

With many classification algorithms, a common approach that is used is to replace the missing values in the data set with some plausible value and the resulting completed data set is analysed using standard algorithms. The procedure that replaces the missing values using some value is known as imputation.

Some algorithms treat the missing attribute as a value in its own right, whilst other classifiers such as Naïve Bayes ignore the missing data:—If an attribute is missing, the likelihood is calculated on the observed attributes and there is no need to impute a value. Decision trees, such as C4.5 and J48 [12, 13, 23], cope with missing values for an observation by notionally splitting the observation into pieces with a weighted split, and sending part down each branch of the tree.

However, the effect the treatment of missing values has on the performance of classifiers is not well understood. The estimation of the missing values can introduce additional biases into the data depending on the imputation method used and affect the classification of the observations.

This paper analyses the effect that several commonly used methods of imputation have on the accuracy of classification when classifying data that has a known classification. In Sect. 3, we review the mechanisms that can lead to data being missing, and in Sect. 2, we review the basic strategies for handing missing data. In Sect. 5, the imputation methods considered in this paper are examined and in Sect. 6, the effect the imputation methods have on the classification accuracy of several data sets is assessed.

2 Missing Data Mechanisms

The treatment of missing data indicators as random variables which were subsequently assigned a distribution was proposed by Rubin [16]. Depending on the distribution of the indicator, three basic mechanisms were defined by Rubin [16]:

  1. 1.

    Missing completely at random (MCAR).

    If the missingness does not depend on the values of the data, either missing or observed, then the data are MCAR.

  2. 2.

    Missing at random (MAR). If the missingness depends only on the data that are observed but not on the components that are missing, the data are MAR.

  3. 3.

    Not missing at random (NMAR). If the distribution of the missing data depends on missing values in the data matrix, then the mechanism is NMAR.

Knowledge of the mechanism that led to the values being missing is important in choosing an appropriate analysis to use for the data [10]. Hence it is important to consider how the classifier handles the missing data to avoid bias being introduced into the knowledge induced from that classifier.

3 Strategies for Handling Missing Data

There are several basic strategies that can be used to deal with missing data in classification studies. Some of these methods were developed in the context of sample surveys and can have some disadvantages in classification.

3.1 Complete Case Analysis

Complete case analysis (also known as elimination) is an approach in which observations that have any missing attributes are deleted from the data set. This strategy may be satisfactory with small amounts of missing data. However with large amounts of data, it is possible to lose considerable sample size. The critical concern with this strategy is that it can lead to biased estimates as it requires the assumption that the complete cases are a random subsample of the original observations. The completely recorded cases frequently differ from the original sample.

3.2 Available Case Analysis

Available case analysis is another approach that can be used. As this procedure uses all observations that have values for a particular attribute, there is no loss of information as all cases are used. However, the sample base changes from attribute to attribute depending on the pattern of missing data, and hence any statistics calculated can be based on different numbers of observations. The main disadvantage to this approach is that the procedure can lead to covariance and correlation matrices that are not positive definite, see, for example, [7]. This approach is used, for example, by Bayesian classifiers.

3.3 Weighting Procedures

Weighting Procedures are another approach to dealing with missing data. This approach is frequently used in the analysis of survey data. In survey data, the sampled units are weighted by their design weight which is inversely proportional to the probability of selection. Weighting procedures for non-response modify the weights in an attempt to adjust for non-response as if it were part of the sample design.

3.4 Imputation Procedures

Imputation procedures in which the missing data values are replaced with some value is another commonly used strategy for dealing with missing value. These procedures result in a hypothetical ‘complete’ data set that will cause no problems with the analysis. Many machine learning algorithms are designed to use either a complete case analysis or an imputation procedure.

Imputation methods often involve replacing the missing values with estimated values based on information that is in the data set. Many of the imputation methods are restricted to coping with one type of variable (i.e. either categorical or continuous) and make assumptions about the distribution of the data or subsets of variables. The performance of classifiers with imputed data in unreliable, and it is hard to distinguish situations in which the methods work from those in which they fail. When imputation is used, it is easy to forget that the data is incomplete [6]. However, imputation methods are commonly used in classification algorithms. There are many options that are available for imputation.

Imputation using a model-based approach is another popular strategy for handling missing data. A predictive model is created to estimate the values to be imputed for the missing values. With regression imputation, the attribute with missing data is used as the response attribute, and the remaining attributes are used as input for the predictive model. Maximum likelihood estimation using the EM algorithm [5] is one of the recommended missing data techniques in the methodological literature. This method assumes that the underlying model for the observed data is Gaussian.

Rather than imputing a single value for each missing data value, multiple imputation procedures are also commonly used. With this method, the missing values are imputed with values drawn randomly (with replacement) from a fitted distribution for that attribute. This is repeated a number, N, of times. The classifier is applied to each of the N “complete” data sets and the misclassification error is calculated. The misclassification error rates are averaged to provide a single misclassification error estimate and also estimate variances of the error rate.

Iterative regression imputation is not restricted to data having a multivariate normal distribution and can cope with mixed data. For the estimation, regression methods are usually applied in an iterative manner where each iteration uses one variable as an outcome and the remaining variables as predictors. If the outcome has any missing values, the predicted values from the regression are imputed. Iterations end when all variables in the data frame have served as an outcome.

4 Methods Used to Deal with Missing Values

The methods used in this paper for imputing the missing values are now described.

4.1 Mean and Median Imputation

Imputation of the missing value by either the mean, median or mode for the attribute are commonly used imputations. These types of imputation ignore any relationships between the variables. For mean imputation, it is well known that this method of imputation will underestimate the variance covariance matrices for that data [17]. The authors [10] also point out that with mean imputation the distribution of the “new values” is an incorrect representation of the population values as the shape of the distribution is distorted by adding values at the mean. Both mean and median imputation can only be used on continuous attributes. For categorical data, the mode is often imputed whilst using either mean or median imputation.

4.2 Hot Deck Imputation

Hot deck imputation is another imputation method that is commonly used, especially in survey samples, and it can cope with both continuous and categorical attributes. Hot deck imputation involves replacing the missing values using values from one or more similar instances that are in the same classification group. There are various forms of hot deck imputation commonly used. Random hot deck imputation involves replacing the missing value with a randomly selected value from the pool of potential donor values. Other methods known as deterministic hot deck imputation involve replacing the missing values with those from a single donor, often the nearest neighbour that is determined using some distance measure. Hot deck imputation has an advantage in that it does not rely on model fitting for the missing value that is to be imputed and thus is potentially less sensitive to model misspecification than an imputation method based on a parametric model. Further details on hot deck imputation can be found, for example, in [2].

4.3 kth Nearest Neighbour Imputation

The kth nearest neighbour algorithm is another method that can be used for imputation of missing values. This approach can predict both categorical and continuous attributes and can easily handle observations that have multiple missing values. This approach takes into account the correlation structure of the data. The algorithm requires the specification of number of neighbours, k, and the distance function that is to be used. The algorithm searches through the entire data set looking for most similar instances.

4.4 Iterative Model-Based Imputation

EM based stepwise regression imputation was proposed by Templ et al. [20] as a method for handling missing data. This technique for coping with missing values is an iterative model-based imputation (IRMI) that uses standard and robust methods. This algorithm has the advantage that it can cope with mixed data. In the first step of the algorithm, the missing values are initialised either using mean or KNN imputation. The attributes are sorted according to the original amount of missing values. After the attributes are sorted, we have

$$\displaystyle{ M\left (x_{1}\right ) \geq M\left (x_{2}\right ) \geq M\left (x_{3}\right ) \geq \cdots \geq M\left (x_{p}\right ) }$$
(1)

where \(M\left (x_{j}\right )\) represents the amount of missing values for attribute j and where x j is now the jth column of the data matrix. The algorithm proceeds iteratively with one variable acting as the response variable and the remaining variables as the predictors in each step of the algorithm.

The authors [20, 22] compared their algorithm with that of IVEMARE [14], an algorithm that also performs iterative regression imputation. IRMI has advantages over IVEWARE with regard to the stability of the initial values, the robustness of the imputed values and the lack of requirement at least one fully observed variable [22]. With IRMI imputation, the user can also use least trimmed squares (LTS) regression (see, for example, [15]), MM estimation ([24] and M estimation [8]).

4.5 Factorial Analysis for Mixed Data Imputation

Imputation of missing values for mixed categorical and continuous data using the principal component method “factorial analysis for mixed data” (FAMD)’ was proposed by Josse and Husson [9]. See also [3]. This algorithm imputes the missing values using an iterative FAMD algorithm that uses the EM algorithm or a regularised FAMD algorithm where the method is regularised.

4.6 Random Forest Imputation

Random forest imputation was proposed by Stekhoven and Bühlmann [19] as a method to deal with missing values with mixed type data. This approach uses an iterative imputation scheme in which a random forest is trained on the observed values in the first stage, the missing vales are predicted and then the algorithm proceeds iteratively. The algorithm begins by making an initial guess for the missing values in the data matrix. This data matrix is the imputed data matrix. The guesses for the missing values could be obtained using mean imputation or some other imputation method. In the first stage, a random forest is trained on the observed values. The missing values are then predicted using the random forest that was trained on the observed values, and the imputed matrix is updated. This procedure is repeated until the difference between the updated imputed data matrix and the previous imputed data matrix increases for the first time for both the categorical and the continuous types of variables. For continuous variables, the performance of the imputation is assessed using a normalised root mean squared error [11], and for categorical values, the proportion of falsely classified entries over the categorical missing values is used. Good performance of the algorithm gives a value that is close to 0 and bad performance gives a value close to 1. The algorithm proposed by Stekhoven and Bühlmann [19] is implemented in the R package MissForest [18]. This package also gives an estimate of the imputation error that is based in the out-of-bag error estimate from random forest.

5 The Analysis

Four classical machine learning data sets listed in Table 1 were taken. Note that the prostate cancer data set of [4] listed in [1] contained the information collected from 506 individuals; however, there were some missing values for some observations. This paper reports a complete case classification of the 12 pretrial attributes where individuals who had missing values in any of the pretrial attributes were omitted from further analysis, leaving 475 of the original 506 individuals.

Table 1 Datasets analysed

For each data set, missing values were created such that the probability p of an attribute being missing was independent of all other data values where p = 0. 10, 0.20, 0.30 and 0.50. This was repeated 20 times. The missing values generated in this fashion are missing completely at random, and the missing data mechanism is ignorable [10]. As some of the amounts of missing data are fairly extreme, it should be a good test of the types of imputation and the effect on the accuracy of a classifier.

The missing values in each data set were imputed using mean imputation, median imputation, k nearest neighbour (kNN) imputation with k = 5, hot deck imputation (HotD), iterative regression imputation (IRMI), the principal component method “factorial analysis for mixed data” (FAMD) and random forest imputation (MissForest). The missing values were imputed using the R packages VIM [21], HotDeckImputation, missMDA and MissForest. For data sets containing mixed data, the mode was imputed for the categorical attributes when using mean imputation for the continuous attributes.

The resulting “complete” data sets were analysed using the WEKA [23] experimenter with ten repetitions of tenfold cross-validation using several commonly used machine learning classifiers listed in Table 2. The mean percent of observations correctly classified was recorded for each of the four amounts of missingness and each of the datasets analysed (see Figs. 1, 2, 3, 4).

Fig. 1
figure 1

Comparison of the imputation methods for Fisher’s Iris data

Fig. 2
figure 2

Comparison of the imputation methods for Pima Indian data

Fig. 3
figure 3

Comparison of the imputation methods for prostate cancer data

Fig. 4
figure 4

Comparison of the imputation methods for wine data

Table 2 Classifiers used

It can be seen in Fig. 1 that, as the percentage of missing values increased, imputing the missing values in Fisher’s Iris data using mean, median, KNN and IRMI imputation resulted in a decrease in the percentage of observations that were correctly classified. However with this data set, using FAMD, MissForest and Hot Deck imputation gave similar percentages of the observations correctly classified regardless of the amount of missingness in the data. The percentage correctly classified using FAMD, MissForest and Hot Deck imputation was similar to that for the complete data.

Figure 2 shows that mean, median and KNN imputation had similar percentages of observations correctly classified for each of the four missing data percentages, with lower percentages correctly classified as the amount of missingness in the data increased for all the classifiers applied. FAMD, MissForest and Hot Deck imputation had similar mean percentages of the observations correctly classified regardless of the amount of missingness in the data. Overall for the Pima Indian data, the imputations using IRMI, FAMD, MissForest and Hot Deck gave consistently the highest mean percentage of observations correctly classified, with the percentage correctly classified similar to that for the complete data.

It can be seen in Fig. 3 that FAMD, MissForest and Hot Deck imputation had similar mean percentages of the observations correctly classified regardless of the amount of missingness in the data with the percentage of observations correctly classified similar to that for the complete data. Figure 4 shows that IRMI, FAMD, MissForest and Hot Deck imputations had similar mean percentages of the observations correctly classified regardless of the amount of missingness in the data, with the percentage correctly classified similar to that for the complete data.

6 Discussion

For all datasets analysed, we see that mean, median and kNN imputation have a similar mean percentage of observations correctly classified. We also see that the percentage of observations correctly classified when using the mean, median or kNN to impute the missing observations decreased as the percentage of missing data increased. In general, hot deck, IRMI, FAMD and missing Forest imputation had the highest mean percentage of observations correctly classified, and performed in a similar manner regardless of the amount of missing data imputed.

The investigations have shown that the type of method used for imputing missing values in a data set can have an effect on the accuracy of classification. Future research needs to be undertaken on the effect of imputation on the accuracy of classification on data that has more than three classes.