Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Introduction

Many existing, industrial and research data sets contain MVs in their attribute values. Intuitively a MV is just a value for attribute that was not introduced or was lost in the recording process. There are various reasons for their existence, such as manual data entry procedures, equipment errors and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data is prepared and cleaned [71], in order to be useful to and sufficiently clear for the knowledge extraction process. The simplest way of dealing with MVs is to discard the examples that contain them. However, this method is practical only when the data contains a relatively small number of examples with MVs and when analysis of the complete examples will not lead to serious bias during the inference [54].

MVs make performing data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, inappropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study, and can also limit the generalizability of the research findings [96]. Three types of problems are usually associated with MVs in DM [5]:

  1. 1.

    loss of efficiency;

  2. 2.

    complications in handling and analyzing the data; and

  3. 3.

    bias resulting from differences between missing and complete data.

Recently some authors have tried to estimate how many MVs are needed to noticeably harm the prediction accuracy in classification [45].

Usually the treatment of MVs in DM can be handled in three different ways [27]:

  • The first approach is to discard the examples with MVs in their attributes. Therefore deleting attributes with elevated levels of MVs is included in this category too.

  • Another approach is the use of maximum likelihood procedures, where the parameters of a model for the complete portion of the data are estimated, and later used for imputation by means of sampling.

  • Finally, the imputation of MVs is a class of procedures that aims to fill in the MVs with estimated ones. In most cases, a data set’s attributes are not independent from each other. Thus, through the identification of relationships among attributes, MVs can be determined

We will focus our attention on the use of imputation methods. A fundamental advantage of this approach is that the MV treatment is independent of the learning algorithm used. For this reason, the user can select the most appropriate method for each situation faced. There is a broad family of imputation methods, from simple imputation techniques like mean substitution, KNN, etc.; to those which analyze the relationships between attributes such as: SVM-based, clustering-based, logistic regressions, maximum likelihood procedures and multiple imputation [6, 26].

The use of imputation methods for MVs is a task with a well established background. It is possible to track the first formal studies to several decades ago. The work of [54] laid the foundation of further work in this topic, specially in statistics. From their work, imputation techniques based on sampling from estimated data distributions followed, distinguishing between single imputation procedures (like Expectation-Maximization (EM) procedures [81]) and multiple imputation ones [82], the latter being more reliable and powerful but more difficult and restrictive to be applied.

These imputation procedures became very popular for quantitative data, and therefore they were easily adopted in other fields of knowledge, like bioinformatics [49, 62, 93], climatic science [85], medicine [94], etc. The imputation methods proposed in each field are adapted to the common characteristics of the data analyzed in it. With the popularization of the DM field, many studies in the treatment of MVs arose in this topic, particularly in the classification task. Some of the existent imputation procedures of other fields are adapted to be used in classification, for example adapting them to deal with qualitative data, while many specific approaches are proposed.

4.2 Assumptions and Missing Data Mechanisms

It is important to categorize the mechanisms which lead to the introduction of MVs [54]. The assumptions we make about the missingness mechanism and the MVs pattern of MVs can affect which treatment method could be correctly applied, if any.

When thinking about the missing data mechanism the probability distributions that lie beneath the registration of rectangular data sets should be taken into account, where the rows denote different registers, instances or cases, and the columns are the features or variables. A common assumption is that the instances are all independent and identically distributed (i.i.d.) draws of some multivariate probability distribution. This assumption is also made by Schafer in [82] where the schematic representation followed is depicted in Fig. 4.1.

Fig. 4.1
figure 1

Data set with MVs denoted with a ‘?’

\(X\) being the \(n\,\times \,m\) rectangular matrix of data, we usually denote as \(x_i\) the \(i\)th row of \(X\). If we consider the i.i.d. assumption, the probability function of the complete data can be written as follows:

$$\begin{aligned} P(X|\theta ) = \prod _{i=1}^n f(x_i | \theta ), \end{aligned}$$
(4.1)

where \(f\) is the probability function for a single case and \(\theta \) represents the parameters of the model that yield such a particular instance of data. The main problem is that the particular parameters’ values \(\theta \) for the given data are very rarely known. For this reason authors usually overcome this problem by considering distributions that are commonly found in nature and their properties are well known as well. The three distributions that standout among these are:

  1. 1.

    the multivariate normal distribution in the case of only real valued parameters;

  2. 2.

    the multinomial model for cross-classified categorical data (including loglinear models) when the data consists of nominal features; and

  3. 3.

    mixed models for combined normal and categorical features in the data [50, 55].

If we call \(X_{obs}\) the observed part of \(X\) and we denote the missing part as \(X_{mis}\) so that \(X = (X_{obs},X_{mis})\), we can provide a first intuitive definition of what missing at random (MAR) means. Informally talking, when the probability that an observation is missing may depend on \(X_{obs}\) but not on \(X_{mis}\) we can state that the missing data is missing at random.

In the case of MAR missing data mechanism, given a particular value or values for a set of features belonging to \(X_{obs}\), the distribution of the rest of features is the same among the observed cases as it is among the missing cases. Following Schafer’s example based on [79], let suppose that we dispose an \(n\,\times \,p\) matrix called \(B\) of variables whose values are \(1\) or \(0\) when \(X\) elements are observed and missing respectively. The distribution of \(B\) should be related to \(X\) and to some unknown parameters \(\zeta \), so we dispose a probability model for \(B\) described by \(P(B |X, \zeta )\). Having a MAR assumption means that this distribution does not depend on \(X_{mis}\):

$$\begin{aligned} P(B|X_{obs},X_{mis},\zeta ) = P(B|X_{obs},\zeta ). \end{aligned}$$
(4.2)

Please be aware of MAR does not suggest that the missing data values constitute just another possible sample from the probability distribution. This condition is known as missing completely at random (MCAR). Actually MCAR is a special case of MAR in which the distribution of an example having a MV for an attribute does not depend on either the observed or the unobserved data. Following the previous notation, we can say that

$$\begin{aligned} P(B|X_{obs},X_{mis},\zeta ) = P(B|\zeta ). \end{aligned}$$
(4.3)

Although there will generally be some loss of information, comparable results can be obtained with missing data by carrying out the same analysis that would have been performed with no MVs. In practice this means that, under MCAR, the analysis of only those units with complete data gives valid inferences.

Please note that MCAR is more restrictive than MAR. MAR requires only that the MVs behave like a random sample of all values in some particular subclasses defined by observed data. In such a way, MAR allows the probability of a missing datum to depend on the datum itself, but only indirectly through the observed values. Recently a software package has been published in which the MCAR condition can be tested [43].

A third case arises when MAR does not apply as the MV depends on both the rest of observed values and the proper value itself. That is

$$\begin{aligned} P(B|X_{obs},X_{mis},\zeta ) \end{aligned}$$
(4.4)

is the actual probability estimation. This model is usually called not missing at random (NMAR) or missing not at random (MNAR) in the literature. This model of missingness is a challenge for the user as the only way to obtain an unbiased estimate is to model the missingness as well. This is a very complex task in which we should create a model accounting for the missing data that should be later incorporated to a more complex model used to estimate the MVs. However, even when we cannot account for the missingness model, the introduced bias may be still small enough. In [23] the reader can find an example of how to perform this.

4.3 Simple Approaches to Missing Data

In this section we introduce the most simplistic methods used to deal with MVs. As they are very simple, they usually do not take into account the missingness mechanism and they blindly perform the operation.

The most simple approach is to do not impute (DNI). As its name indicates, all the MVs remain unreplaced, so the DM algorithm must use their default MVs strategies if present. Often the objective is to verify whether imputation methods allow the classification methods to perform better than when using the original data sets. As a guideline, in [37] a previous study of imputation methods is presented. As an alternative for these learning methods that cannot deal with explicit MVs notation (as a special value for instance) another approach is to convert the MVs to a new value (encode them into a new numerical value), but such a simplistic method has been shown to lead to serious inference problems [82].

A very common approach in the specialized literature, even nowadays, is to apply case deletion or ignore missing (IM). Using this method, all instances with at least one MV are discarded from the data set. Although IM often results in a substantial decrease in the sample size available for the analysis, it does have important advantages. In particular, under the assumption that data is MCAR, it leads to unbiased parameter estimates. Unfortunately, even when the data are MCAR there is a loss in power using this approach, especially if we have to rule out a large number of subjects. And when the data is not MCAR, it biases the results. For example when low income individuals are less likely to report their income level, the resulting mean is biased in favor of higher incomes. The alternative approaches discussed below should be considered as a replacement for IM.

Often seen as a good choice, the substitution of the MVs for the global most common attribute value for nominal attributes, and global average value for numerical attributes (MC) [36] is widely used, specially when many instances in the data set contain MVs and to apply DNI would result in a very reduced and unrepresentative pre-processed data set. This method is very simple: for nominal attributes, the MV is replaced with the most common attribute value, and numerical values are replaced with the average of all values of the corresponding attribute.

A variant of MC is the concept most common attribute value for nominal attributes, and concept average value for numerical attributes (CMC) [36]. As stated in MC, the MV is replaced by the most repeated one if nominal or is the mean value if numerical, but considers only the instances with the same class as the reference instance.

Older and rarely used DM approaches may be put under this category. For example Hot deck imputation goes back over 50 years and was used quite successfully by the Census Bureau and others. It is referred from time to time [84] and thus it is interesting to describe it here partly for historical reasons and partly because it represents an approach of replacing data that is missing.

Hot deck has it origins in the surveys made in USA in the 40s and 50s, when most people felt impelled to participate in survey filling. As a consequence little data was missing and when any registers were effectively missing, a random complete case from the same survey was used to substitute the MVs. This process can be simulated nowadays by clustering over the complete data, and associating the instance with a cluster. Any complete example from the cluster can be used to fill in the MVs [6]. Cold deck is similar to hot deck, but the cases or instances used to fill in the MVs came from a different source. Traditionally this meant that the case used to fill the data was obtained from a different survey. Some authors have recently assessed the limitations imposed to the donors (the instances used to substitute the MVs) [44].

4.4 Maximum Likelihood Imputation Methods

At the same time Rubin et al. formalized the concept of missing data introduction mechanisms described in Sect. 4.2, they advised against use case deletion as a methodology (IM) to deal with the MVs. However, using MC or CMC techniques are not much better than replacing MVs with fixed values, as they completely ignore the mechanisms that yield the data values. In an ideal and rare case where the parameters of the data distribution \(\theta \) were known, a sample from such a distribution conditioned to the other attributes’ values or not depending of whether the MCAR, MAR or NMAR applies, would be a suitable imputed value for the missing one. The problem is that the parameters \(\theta \) are rarely known and also very hard to estimate [38].

In a simple case such as flipping a coin, \(P(heads) = \theta \) and \(P(tails) = 1 - \theta \). Depending on the coin being rigged or not, the value of \(\theta \) can vary and thus its value is unknown. Our only choice is to flip the coin several times, say \(n\), obtaining \(h\) heads and \(n-h\) tails. An estimation of \(\theta \) would be \(\widehat{\theta } = h/n\).

More formally, the likelihood of \(\theta \) is obtained from a binomial distribution \(P(\theta ) = \left( {\begin{array}{c}h\\ n\end{array}}\right) \theta ^h (1-\theta )^{n-h}\). Our \(\widehat{\theta }\) can be proven to be the maximum likelihood estimate of \(\theta \).

So the next question arises: to solve a maximum likelihood type problem, can we analytically maximize the likelihood function? We have shown it can work with one dimensional Bernoulli problems like the coin toss, and that it also works with one dimensional Gaussian by finding the \(\mu \) and \(\sigma \) parameters. To illustrate the latter case let us assume that we have the samples 1, 4, 7, 9 obtained from a normal distribution and we want to estimate the population mean for the sake of simplicity, that is, in this simplistic case \(\theta = \mu \). The maximum likelihood problem here is to choose a specific value of \(\mu \) and compute \( p(1)\cdot p(4)\cdot p(7) \cdot p(9)\). Intuitively one can say that this probability would be very small if we fix \(\mu = 10\) and would be higher for \(\mu = 4\) or \(\mu = 5\). The value of \(\mu \) that produces the maximum product of combined probabilities is what we call the maximum likelihood estimate of \(\mu = \theta \). Again, in our case the maximum likelihood estimate would constitute the sample mean \(\mu = 5.25\) and adding the variance to the problem can be solved again using the sample variance as the best estimator.

In real world data things are not that easy. We can have distribution that may not be well behaved or have too many parameters making the actual solution computationally too complex. Having a likelihood function made of a mixture of 100 100-dimensional Gaussians would yield 10,000 parameters and thus direct trial-error maximization is not feasible. The way to deal with such complexity is to introduce hidden variables in order to simplify the likelihood function and, in our case as well, to account for MVs. The observed variables are those that can be directly measured from the data, while hidden variables influence the data but are not trivial to measure. An example of an observed variable would be if it is sunny today, whereas the hidden variable can be \(P(sunny\; today|sunny\; yesterday)\).

Even simplifying with hidden variables does not allow us to reach the solution in a single step. The most common approach in these cases would be to use an iterative approach in which we obtain some parameter estimates, we use a regression technique to impute the values and repeat. However as the imputed values will depend on the estimated parameters \(\theta \), they will not add any useful information to the process and can be ignored. There are several techniques to obtain maximum likelihood estimators. The most well known and simplistic is the EM algorithm presented in the next section.

4.4.1 Expectation-Maximization (EM)

In a nutshell the EM algorithm estimates the parameters of a probability distribution. In our case this can be achieved from incomplete data. It iteratively maximizes the likelihood of the complete data \(X_{obs}\) considered as a function dependent of the parameters [20].

That is, we want to model dependent random variables as the observed variable \(a\) and the hidden variable \(b\) that generates \(a\). We stated that a set of unknown parameters \(\theta \) governs the probability distributions \(P_{\theta }(a)\), \(P_{\theta }(b)\). As an iterative process, the EM algorithm consists of two steps that are repeated until convergence: the expectation (E-step) and the maximization (M-step) steps.

The E-step tries to compute the expectation of \(log P_{\theta }(y,x)\):

$$\begin{aligned} Q(\theta ,\theta ') = \sum _y P_{\theta '}(b|a) log P_{\theta }(b,a), \end{aligned}$$
(4.5)

where \(\theta '\) are the new distribution parameters. Please note that we are using the \(log\). The reason for this is that we need to multiply the probabilities of each observed value for an specific set of parameters. But multiplying several probabilities will soon yield a very small number and thus produce a loss of precision in a computer due to limited digital accuracy. A typical solution is then to use the \(log\) of these probabilities and to look for the maximum log likelihood. As the logs will be negative, we are looking for the set of parameters whose likelihood is as close to 0 as possible. In the M-step we try to find the \(\theta \) that maximizes \(Q\).

How can we find the \(\theta \) that maximizes \(Q\)? Let us review conditional expectation where \(A\) and \(B\) are random variables drawn from distributions \(P(a)\) and \(P(b)\) to resolve the M-step. The conditional distribution is given by \(P(b|a) = \frac{P(b,a)}{P(a)}\) and then \(E[B] = \sum _b P(b)b\). For a function depending on \(B\) \(h(B)\) the expectation is trivially obtained by \(E[h(B)] = \sum _b P(b)h(b)\). For a particular value \(A (A=a)\) the expectation is \(E[h(B)|a] = \sum _b P(b|a)h(b)\).

Remember that we want to pick a \(\theta \) that maximizes the log likelihood of the observed (\(a\)) and the unobserved \((b)\) variables given an observed variable \(a\) and the previous parameters \(\theta '\). The conditional expectation of \(log P_{\theta }(b,a)\) given \(a\) and \(\theta '\) is

$$\begin{aligned} E[log P(b,a|\theta )| a,\theta ']&= \sum _y P(b|a,\theta ') log P(b,a|\theta ) \end{aligned}$$
(4.6)
$$\begin{aligned}&= \sum _y P_{\theta '}(b|a) log P_{\theta }(b,a). \end{aligned}$$
(4.7)

The key is that if \( \sum _b P_{\theta '}(b|a) log P_{\theta }(b,a) > \sum _b P_{\theta '} (b|a) log P_{\theta '}(b,a)\) then \(P_{\theta }(a) > P_{\theta '}(a)\). If we can improve the expectation of the log likelihood, EM is improving the model of the observed variable \(a\).

In any real world problem, we do not have a single point but a series of attributes \(x_1, \ldots , x_n\). Assuming i.i.d. we can sum over all points to compute the expectation:

$$\begin{aligned} Q(\theta ,\theta ') = \sum _{i=1}^n \sum _b P_{\theta '} (b|x_i) log P_{\theta } (b,x_i) \end{aligned}$$
(4.8)

The EM algorithm is not perfect: it can be stuck in local maxima and also depends on an initial \(\theta \) value. The latter is usually resolved by using a bootstrap process in order to choose a correct initial \(\theta \). Also the reader may have noticed that we have not talked about any imputation yet. The reason is EM is a meta algorithm that it is adapted to a particular application.

To use EM for imputation first we need to choose a plausible set of parameters, that is, we need to assume that the data follows a probability distribution, which is usually seen as a drawback of these kind of methods. The EM algorithm works better with probability distributions that are easy to maximize, as Gaussian mixture models. In [85] an approach of EM using multivariate Gaussian is proposed as using multivariate Gaussian data can be parameterized by the mean and the covariance matrix.

In each iteration of the EM algorithm for imputation the estimates of the mean \(\mu \) and the covariance \(\varSigma \) are represented by a matrix and revised in three phases. These parameters are used to apply a regression over the MVs by using the complete data. In the first one in each instance with MVs the regression parameters \(B\) for the MVs are calculated from the current estimates of the mean and covariance matrix and the available complete data. Next the MVs are imputed with their conditional expectation values from the available complete ones and the estimated regression coefficients

$$\begin{aligned} x_{mis} = \mu _{mis} + (x_{obs} - \mu _{obs}) B + e, \end{aligned}$$
(4.9)

where the instance \(x\) of \(n\) attributes is separated into the observed values \(x_{obs}\) and the missing ones \(x_{mis}\). The mean and covariance matrix are also separated in such a way. The residual \(e \in \mathbb {R}^{1 \times n_{mis}}\) is assumed to be a random vector with mean zero and unknown covariance matrix. These two phases would complete the E-step. Please note that for the iteration of the algorithm the imputation is not strictly needed as only the estimates of the mean and covariance matrix are, as well as the regression parameters. But our ultimate goal is to have our data set filled, so we use the latest regression parameters to create the best imputed values so far.

In the third phase the M-step is completed by re-estimating the mean a covariance matrix. The mean is taken as the sample mean of the completed data set and the covariance is the sample covariance matrix and the covariance matrices of the imputation errors as shown in [54]. That is:

$$\begin{aligned} \widehat{B} = \widehat{\varSigma }_{obs,obs}^{-1} \widehat{\varSigma }_{obs,mis}, and \end{aligned}$$
(4.10)
$$\begin{aligned} \widehat{C} = \widehat{\varSigma }_{mis,mis} - \widehat{\varSigma }_{mis,obs} \widehat{\varSigma }_{obs,obs}^{-1} \widehat{\varSigma }_{obs,mis} \end{aligned}$$
(4.11)

The hat accent \(\widehat{A}\) designates an estimate of a quantity \(A\). After updating \(B\) and \(C\) the mean and covariance matrix must be updated with

$$\begin{aligned} \widehat{\mu }^{(t+1)} = \frac{1}{n} \sum _{i=1}^n X_i \end{aligned}$$
(4.12)

and

$$\begin{aligned} \widehat{\varSigma }^{(t+1)} = \frac{1}{\tilde{n}} \sum _{i=1}^n \left[ \widehat{S}_i^{(t)} - (\widehat{\mu }^{(t+1)})\widehat{\mu }^{(t+1)} \right] , \end{aligned}$$
(4.13)

where, for each instance \(x = X_i\), the conditional expectation\(\widehat{S}_i^{(t)}\) of the cross-products is composed of three parts. The two parts that involve the available values in the instance,

$$\begin{aligned} E(x_{obs}^T x_{obs} | x_{obs}; \widehat{\mu }^{(t)},\widehat{\varSigma }^{(t)}) = x_{obs}^T x_{obs} \end{aligned}$$
(4.14)

and

$$\begin{aligned} E(x_{mis}^T x_{mis} | x_{obs}; \widehat{\mu }^{(t)},\widehat{\varSigma }^{(t)}) = \widehat{x}_{mis}^T \widehat{x}_{mis} + \widehat{C}, \end{aligned}$$
(4.15)

is the sum of the cross-product of the imputed values and the residual covariance matrix \(\widehat{C} = Cov(x_{miss},x_{mis} | x_{obs};\widehat{\mu }^{(t)},\widehat{\varSigma }^{(t)})\), the conditional covariance matrix of the imputation error. The normalization constant \(\tilde{n}\) of the covariance matrix estimate [Eq. (4.13)] is the number of degrees of freedom of the sample covariance matrix of the completed data set.

The first estimation of the mean and covariance matrix needs to rely on a completely observed data set. One solution in [85] is to fill the data set with the initial estimates of the mean and covariance matrices. The process ends when the estimates of the mean and covariance matrix do not change over a predefined threshold. Please note that this EM approach is only well suited for numeric data sets, constituting a limitation for the application of EM, although an extension for mixed numerical and nominal attributes can be found in [82].

The EM algorithm is still valid nowadays, but is usually part of a system in which it helps to evolve some distributions like GTM neural networks in [95]. Still some research is being carried out for EM algorithms in which its limitations are being improved and also are applied to new fields like semi-supervised learning [97]. The most well known version of the EM for real valued data sets is the one introduced in [85] where the basic EM algorithm presented is extended with a regularization parameter.

4.4.2 Multiple Imputation

One big problem of the maximum likelihood methods like EM is that they tend to underestimate the inherent errors produced by the estimation process, formally standard errors. The Multiple Imputation (MI) approach was designed to take this into account to be a less biased imputation method, at the cost of being computationally expensive. MI is a Monte Carlo approach described very well by [80] in which we generate multiple imputed values from the observed data in a very similar way to the EM algorithm: it fills the incomplete data by repeatedly solving the observed-data. But a significative difference between the two methods is attained: while EM generates a single imputation in each step from the estimated parameters at each step, MI performs several imputations that yield several complete data sets.

This repeated imputation can be done thanks to the use of Markov Chain Monte Carlo methods, as the several imputations are obtained by introducing a random component, usually from a standard normal distribution. In a more advanced fashion, MI also considers that the parameters estimates are in fact sample estimates. Thus, the parameters are not directly estimated from the available data but, as the process continues, they are drawn from their Bayesian posterior distributions given the data at hand. These assumptions means that only in the case of MCAR or MAR missingness mechanisms hold MI should be applied.

As a result Eq. (4.9) can be applied with slight changes as the \(e\) term is now a sample from a standard normal distribution and is applied more than once to obtain several imputed values for a single MV. As indicated in the previous paragraph, MI has a Bayesian nature that forces the user to specify a prior distribution for the parameters \(\theta \) of the model from which the \(e\) term is drawn. In practice [83] is stressed out that the results depend more on the election of the distribution for the data than the distribution for \(\theta \). Unlike the single imputation performed by EM where only one imputed value for each MV is created (and thus only one value of \(e\) is drawn), MI will create several versions of the data set, where the observed data \(X_{obs}\) is essentially the same, but the imputed values for \(X_{mis}\) will be different in each data set created. This process is usually known as data augmentation (DA) [91] as depicted in Fig. 4.2.

Fig. 4.2
figure 2

Multiple imputation process by data augmentation. Every MV denoted by a ‘?’ is replaced by several imputed and different values that will be used to continue the process later

Surprisingly not many imputation steps are needed. Rubin claims in [80] that only 3–5 steps are usually needed. He states that the efficiency of the final estimation built upon \(m\) imputations is approximately:

$$ \left( 1 + \frac{\gamma }{m}\right) ^{-1}, $$

where \(\gamma \) is the fraction of missing data in the data set. With a 30 % of MVs in each data set, which is a quite high amount, with 5 different final data sets a 94 % of efficiency will be achieved. Increasing the number to \(m = 10\) slightly raises the efficiency to 97 %, which is a low gain paying the double computational effort.

To start we need an estimation of the mean and covariance matrices. A good approach is to take them from a solution provided from an EM algorithm once their values have stabilized at the end of its execution [83]. Then the DA process starts by alternately filling the MVs and then making inferences about the unknown parameters in a stochastic fashion. First DA creates an imputation using the available values of the parameters of the MVs, and then draws new parameter values from the Bayesian posterior distribution using the observed and missing data. Concatenating this process of simulating the MVs and the parameters is what creates a Markov chain that will converge at some point. The distribution of the parameters \(\theta \) will stabilize to the posterior distribution averaged over the MVs, and the distribution of the MVs will stabilize to a predictive distribution: the proper distribution needed to drawn values for the MIs.

Large rates of MVs in the data sets will cause the convergence to be slow. However, the meaning of convergence is different to that used in EM. In EM the parameter estimates have converged when they no longer change from one iteration to the following over a threshold. In DA the distribution of the parameters do no change across iterations but the random parameter values actually continue changing, which makes the convergence of DA more difficult to assess than for EM. In [83] the authors propose to reinterpret convergence in DA in terms of lack of serial dependence: DA can be said to have converged by \(k\) cycles if the value of any parameter at iteration \(t \in {1,2,\ldots }\) is statistically independent of its value at iteration \(t+k\). As the authors show in [83] the DA algorithm usually converges under these terms in equal or less cycles than EM.

The value \(k\) is interesting, because it establishes when we should stop performing the Markov chain in order to have MI that are independent draws from the missing data predictive distribution. A typical process is to perform \(m\) runs, each one of length \(k\). That is, for each imputation from 1 to \(m\) we perform the DA process during \(k\) cycles. It is a good idea not to be too conservative with the \(k\) value, as after convergence the process remains stationary, whereas with low \(k\) values the \(m\) imputed data sets will not be truly independent. Remember that we do not need a high \(m\) value, so \(k\) acts as the true computational effort measure.

Once the \(m\) MI data sets have been created, they can be analyzed by any standard complete-data methods. For example, we can use a linear or logistic regression, a classifier or any other technique applied to each one of the \(m\) data sets, and the variability of the \(m\) results obtained will reflect the uncertainty of MVs. It is common to combine the results following the rules provided by Rubin [80] that act as measures of ordinary sample variation to obtain a single inferential statement of the parameters of interest.

Rubin’s rules to obtain an overall set of estimated coefficients and standard errors proceed as follows. Let \(\widehat{R}\) denote the estimation of interest and \(U\) its estimated variance, \(R\) being either an estimated regression coefficient or a kernel parameter of a SVM, whatever applies. Once the MIs have been obtained, we will have \(\widehat{R}_1,\widehat{R}_2,\ldots ,\widehat{R}_m\) estimates and their respective variances \(U_1,U_2,\ldots ,U_m\). The overall estimate, occasionally called the MI estimate is given by

$$\begin{aligned} \overline{R} = \frac{1}{m} \sum _{i=1}^m \widehat{R}_i . \end{aligned}$$
(4.16)

The variance for the estimate has two components: the variability within each data set and across data sets. The within imputation variance is simply the average of the estimated variances:

$$\begin{aligned} \overline{U} = \frac{1}{m} \sum _{i=1}^m U_i, \end{aligned}$$
(4.17)

whereas the between imputation variance is the sample variance of the proper estimates:

$$\begin{aligned} B = \frac{1}{m-1} \sum _{i=1}^m (\widehat{R}_i - \overline{R})^2. \end{aligned}$$
(4.18)

The total variance \(T\) is the corrected sum of these two components with a factor that accounts for the simulation error in \(\widehat{R}\),

$$\begin{aligned} T = \widehat{U} + \left( 1 + \frac{1}{m} \right) B. \end{aligned}$$
(4.19)

The square root of \(T\) is the overall standard error associated to \(\overline{R}\). In the case of no MVs being present in the original data set, all \(\widehat{R}_1,\widehat{R}_2,\ldots ,\widehat{R}_m\) would be the same, then \(B = 0\) and \(T = \overline{U}\). The magnitude of \(B\) with respect to \(\overline{U}\) indicates how much information is contained in the missing portion of the data set relative to the observed part.

In [83] the authors elaborate more on the confidence intervals extracted from \(\overline{R}\) and how to test the null hypothesis of \( R = 0\) by comparing the ratio \(\frac{\overline{R}}{\sqrt{T}}\) with a Student’s \(t\)-distribution with degrees of freedom

$$\begin{aligned} df = (m-1) \left( 1+ \frac{m \overline{U}}{(m+1) B} \right) ^2, \end{aligned}$$
(4.20)

in the case the readers would like to further their knowledge on how to use this hypothesis to check whether the number of MI \(m\) was large enough.

The MI algorithm has been widely used in many research fields. Focusing on DM methods to increase the robustness of MI [19], alleviate the parameter selection process [35] and improve Rubin’s rules to aggregate models have been proposed [86]. New extensions to new problems like one-class [48] can be found, as well as hybridizations with innovative techniques such as Gray System Theory [92]. Implementing MI is not trivial and reputed implementations can be found in statistical packages as R [9] (see Chap. 10) and Stata [78].

4.4.3 Bayesian Principal Component Analysis (BPCA)

The MV estimation method based on BPCA [62] consists of three elementary processes. They are (1) principal component (PC) regression, (2) Bayesian estimation, and (3) an EM-like repetitive algorithm. In the following we describe each of these processes.

4.4.3.1 PC Regression

For the time being, we consider a situation where there is no MV. PCA represents the variation of \(D\)-dimensional example vectors \(y\) as a linear combination of principal axis vectors \(w_l (1 \le l \le K)\) whose number is relatively small \((K < D)\):

$$\begin{aligned} y = \sum _{l=1}^K x_l w_l + \epsilon \end{aligned}$$
(4.21)

The linear coefficients \(x_l (1 \le l \le K)\) are called factor scores. \(\epsilon \) denotes the residual error. Using a specifically determined number \(K\), PCA obtains \(x_l\) and \(w_l\) such that the sum of squared error \(\parallel \epsilon \parallel ^2\) over the whole data set \(Y\) is minimized.

When there is no MV, \(x_l\) and \(w_l\) are calculated as follows. A covariance matrix\(S\) for the example vectors \(y_i (1 \le i \le N)\) is given by

$$\begin{aligned} S = \frac{1}{N} \sum _{i=1}^N (y_i - \mu ) (y_i -\mu )^T, \end{aligned}$$
(4.22)

where \(\mu \) is the mean vector of \(y\): \(\mu = (1/N)\sum _{i=1}^N y_i\). T denotes the transpose of a vector or a matrix. For description convenience, \(Y\) is assumed to be row-wisely normalized by a preprocess, so that \(\mu = 0\) holds. With this normalization, the result by PCA is identical to that by SVD.

Let \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _D\) and \(u_1, u_2, \ldots , u_D\) denote the eigenvalues and the corresponding eigenvectors, respectively, of \(S\). We also define the \(l\)th principal axis vector by \(w_l = \sqrt{\lambda _l u_l}\) .With these notations, the \(l\)th factor score for an example vector \(y\) is given by \(x_l = (w_l/\lambda _l)^T y\). Now we assume the existence of MVs. In PC regression, the missing part \(y_{miss}\) in the expression vector \(y\) is estimated from the observed part \(y_{obs}\) by using the PCA result. Let \(w_{obs}^l\) and \(w_{miss}^l\) be parts of each principal axis \(w_l\), corresponding to the observed and missing parts, respectively, in \(y\). Similarly, let \(W = (W_{obs},W_{miss})\) where \(W_{obs}\) or \(W_{miss}\) denotes a matrix whose column vectors are \(w_{obs}^1 , \ldots ,w_{obs}^K\) or \(w_{miss}^1 ,\ldots ,w_{miss}^K\), respectively.

Factor scores \(x = (x_1,\ldots , x_K)\) for the example vector \(y\) are obtained by minimization of the residual error

$$\begin{aligned} err = \parallel y_{obs} - W_{obs} x \parallel ^2. \end{aligned}$$

This is a well-known regression problem, and the least square solution is given by

$$\begin{aligned} x = (W^{obs T}W_{obs})^{-1} W^{obs T} y_{obs}. \end{aligned}$$

Using \(x\), the missing part is estimated as

$$\begin{aligned} y_{miss} = W_{miss} x \end{aligned}$$
(4.23)

In the PC regression above, \(W\) should be known beforehand. Later, we will discuss the way to determine the parameter.

4.4.3.2 Bayesian Estimation

A parametric probabilistic model, which is called probabilistic PCA (PPCA), has been proposed recently. The probabilistic model is based on the assumption that the residual error \(\epsilon \) and the factor scores \(x_l (1 \le l \le K)\) in Equation (reflinearcomb) obey normal distributions:

$$\begin{aligned} p(x)&= \mathcal {N}_K (x | 0, I_K),\\ p(\epsilon )&= \mathcal {N}_D (\epsilon | 0,(1/\tau )I_D), \end{aligned}$$

where \(\mathcal {N}_K (x |\mu ,\varSigma )\) denotes a \(K\)-dimensional normal distribution for \(x\), whose mean and covariance are \(\mu \) and \(\varSigma \), respectively. \(I_K\) is a \((K \times K)\) identity matrix and \(\tau \) is a scalar inverse variance of \(\epsilon \). In this PPCA model, a complete log-likelihood function is written as:

$$\begin{aligned} ln\; p(y, x| \theta )&\equiv ln\; p(y,x | W,\mu ,\tau ) \\&= -\frac{\tau }{2} \parallel y - W x - \tau \parallel ^2 - \frac{1}{2} \parallel x \parallel ^2 + \frac{D}{2} ln\; \tau - \frac{K + D}{2} ln 2 \Pi , \end{aligned}$$

where \(\theta \equiv {W,\mu , \tau }\) is the parameter set. Since the maximum likelihood estimation of the PPCA is identical to PCA, PPCA is a natural extension of PCA to a probabilistic model.

We present here a Bayesian estimation method for PPCA from the authors. Bayesian estimation obtains the posterior distribution of \(\theta \) and \(X\), according to the Bayes’ theorem:

$$\begin{aligned} p(\theta ,X | Y) \propto p(Y, X | \theta ) p(\theta ). \end{aligned}$$
(4.24)

\(p(\theta )\) is called a prior distribution, which denotes a priori preference for parameter \(\theta \). The prior distribution is a part of the model and must be defined before estimation. We assume conjugate priors for \(\tau \) and \(\mu \), and a hierarchical prior for \(W\), namely, the prior for \(W, p(W|\tau ,\alpha )\), is parameterized by a hyperparameter \(\alpha \in \mathbb {R}^K\).

$$\begin{aligned} p(\theta | \alpha )&\equiv p(\mu , W, \tau | \alpha ) = p(\mu | \tau )p(\tau )\prod _{j=1}^K p(w_j | \tau , \alpha _j), \\ p(\mu | tau)&= \mathcal {N}(\mu | \overline{\mu }_0, (\gamma _{\mu _0}^{\tau })^{-1} I_m), \\ p(w_j | \tau , \alpha _j)&= \mathcal {N}(w_j | 0, (\alpha _j \tau )^{-1} I_m), \\ p(\tau )&= \mathcal {G}(\tau | \overline{\tau }_0, \gamma _{\tau _0}) \end{aligned}$$

\(\mathcal {G}(\tau | \overline{\tau },\gamma _\tau )\) denotes a Gamma distribution with hyperparameters \(\overline{\tau }\) and \(\gamma _\tau \):

$$\begin{aligned} \mathcal {G}(\tau | \overline{\tau },\gamma _\tau ) \equiv \frac{(\gamma _\tau \overline{\tau }^{-1})^{\gamma _\tau }}{\varGamma (\gamma _\tau )} \exp \left[ -\gamma _\tau \overline{\tau }^{-1}\tau + (\gamma _\tau -1) ln \tau \right] \end{aligned}$$

where \(\varGamma (\cdot )\) is a Gamma function.

The variables used in the above priors, \(\gamma _{\mu 0}\), \(\overline{\mu }_0\), \(\gamma _{\tau 0}\) and \(\overline{\tau }_0\) are deterministic hyperparameters that define the prior. Their actual values should be given before the estimation. We set \(\gamma _{\mu 0} = \gamma _{\tau 0} = 10^{-10}\), \(\overline{\mu }_0 = 0\) and \(\overline{\tau }_0 = 1\), which corresponds to an almost non-informative prior.

Assuming the priors and given a whole data set \(Y = {y}\), the type-II maximum likelihood hyperparameter \(\alpha _{ML-II}\) and the posterior distribution of the parameter, \(q(\theta ) = p(\theta |Y, \alpha _{ML-II})\), are obtained by Bayesian estimation.

The hierarchical prior \(p(W|\alpha , \tau )\), which is called an automatic relevance determination (ARD) prior, has an important role in BPCA. The \(j\)th principal axis \(w_j\) has a Gaussian prior, and its variance \(1/(\alpha _j \tau )\) is controlled by a hyperparameter \(\alpha _j\) which is determined by type-II maximum likelihood estimation from the data. When the Euclidian norm of the principal axis, \(\parallel w_j\parallel \), is small relatively to the noise variance \(1/\tau \), the hyperparameter \(\alpha _j\) gets large and the principal axis \(w_j\) shrinks nearly to be 0. Thus, redundant principal axes are automatically suppressed.

4.4.3.3 EM-Like Repetitive Algorithm

If we know the true parameter \(\theta _{true}\), the posterior of the MVs is given by

$$\begin{aligned} q(Y_{miss}) = p(Y_{miss} | Y_{obs}, \theta _{true}), \end{aligned}$$

which produces equivalent estimation to the PC regression. Here, \(p(Y_{miss}|Y_{obs}, \theta _{true})\) is obtained by marginalizing the likelihood (4.24) with respect to the observed variables \(Y_{obs}\). If we have the parameter posterior \(q(\theta )\) instead of the true parameter, the posterior of the MVs is given by

$$\begin{aligned} q(Y_{miss}) = \int d\theta q(\theta ) p(Y_{miss} | Y_{obs}, \theta ), \end{aligned}$$

which corresponds to the Bayesian PC regression. Since we do not know the true parameter naturally, we conduct the BPCA. Although the parameter posterior \(q(\theta )\) can be easily obtained by the Bayesian estimation when a complete data set \(Y\) is available, we assume that only a part of \(Y\), \(Y_{obs}\), is observed and the rest \(Y_{miss}\) is missing. In that situation, it is required to obtain \(q(\theta )\) and \(q(Y_{miss})\) simultaneously.

We use a variational Bayes (VB) algorithm, in order to execute Bayesian estimation for both model parameter \(\theta \) and MVs \(Y_{miss}\). Although the VB algorithm resembles the EM algorithm that obtains maximum likelihood estimators for \(\theta \) and \(Y_{miss}\), it obtains the posterior distributions for \(\theta \) and \(Y_{miss}\), \(q(\theta )\) and \(q(Y_{miss})\), by a repetitive algorithm.

The VB algorithm is implemented as follows: (a) the posterior distribution of MVs, \(q(Y_{miss})\), is initialized by imputing each of the MVs to instance-wise average; (b) the posterior distribution of the parameter \(\theta \), \(q(\theta )\), is estimated using the observed data \(Y_{obs}\) and the current posterior distribution of MVs, \(q(Y_{miss})\); (c) the posterior distribution of the MVs, \(q(Y_{miss})\), is estimated using the current \(q(\theta )\); (d) the hyperparameter \(\alpha \) is updated using both of the current \(q(\theta )\) and the current \(q(Y_{miss})\); (e) repeat (b)–(d) until convergence.

The VB algorithm has been proved to converge to a locally optimal solution. Although the convergence to the global optimum is not guaranteed, the VB algorithm for BPCA almost always converges to a single solution. This is probably because the objective function of BPCA has a simple landscape. As a consequence of the VB algorithm, therefore, \(q(\theta )\) and \(q(Y_{miss})\) are expected to approach the global optimal posteriors.

Then, the MVs in the expression matrix are imputed to the expectation with respect to the estimated posterior distribution:

$$\begin{aligned} \widehat{Y}_{miss} = \int y_{miss} q(Y_{miss}) dY_{miss}. \end{aligned}$$
(4.25)

4.5 Imputation of Missing Values. Machine Learning Based Methods

The imputation methods presented in Sect. 4.4 originated from statistics application and thus they model the relationship between the values by searching for the hidden distribution probabilities. In Artificial Intelligence modeling the unknown relationships between attributes and the inference of the implicit information contained in a sample data set has been done using ML models. Immediately many authors noticed that the same process that can be carried out to predict a continuous or a nominal value from a previous learning process in regression or classification can be applied to predict the MVs. The use of ML methods for imputation alleviates us from searching for the estimated underlying distribution of the data, but they are still subject to the MAR assumption in order to correctly apply them.

Batista [6] tested the classification accuracy of two popular classifiers (C4.5 and CN2) considering the proposal of KNN as an imputation (KNNI) method and MC. Both CN2 and C4.5 (like [37]) algorithms have their own MV estimation. From their study, KNNI results in good accuracy, but only when the attributes are not highly correlated to each other. Related to this work, [1] have investigated the effect of four methods that deal with MVs. As in [6], they use KNNI and two other imputation methods (MC and median imputation). They also use the KNN and Linear Discriminant Analysis classifiers. The results of their study show that no significantly harmful effect in accuracy is obtained from the imputation procedure. In addition to this, they state that the KNNI method is more robust with the increment of MVs in the data set in respect to the other compared methods.

The idea of using ML or Soft Computing techniques as imputation methods spread from this point on. Li et al. [53] uses a fuzzy clustering method: the Fuzzy K-Means (FKMI). They compare the FKMI with Mean substitution and KMI (K-Means imputation). Using a Root Mean Square Error error analysis, they state that the basic KMI algorithm outperforms the MC method. Experiments also show that the overall performance of the FKMI method is better than the basic KMI method, particularly when the percentage of MVs is high. Feng et al. [29] uses an SVM for filling in MVs (SVMI) but they do not compare this with any other imputation methods. Furthermore, they state that we should select enough complete examples without MVs as the training data set in this case.

In the following we proceed to describe the main details of the most used imputation methods based on ML techniques. We have tried to stay as close as possible to the original notation used by the authors so the interested reader can easily continue his or her exploration of details in the corresponding paper.

4.5.1 Imputation with K-Nearest Neighbor (KNNI)

Using this instance-based algorithm, every time an MV is found in a current instance, KNNI computes the KNN and a value from them is imputed. For nominal values, the most common value among all neighbors is taken, and for numerical values the average value is used. Therefore, a proximity measure between instances is needed for it to be defined. The Euclidean distance (it is a case of a \(L_p\) norm distance) is the most commonly used in the literature.

In order to estimate a MV \(y_{ih}\) in the \(i\)th example vector \(y_i\) by KNNI [6], we first select \(K\) examples whose attribute values are similar to \(y_i\). Next, the MV is estimated as the average of the corresponding entries in the selected \(K\) expression vectors. When there are other MVs in \(y_i\) and/or \(y_j\), their treatment requires some heuristics. The missing entry \(y_{ih}\) is estimated as average:

$$\begin{aligned} y_{{\widehat{i}}h} = \frac{\sum _{j \in I_{Kih}} y_{jh}}{|I_{Kih}|}, \end{aligned}$$
(4.26)

where \(I_{Kih}\) is now the index set of KNN examples of the \(i\)th example, and if \(y_{jh}\) is missing the \(j\)th attribute is excluded from \(I_{Kih}\). Note that KNNI has no theoretical criteria for selecting the best K-value and the K-value has to be determined empirically.

4.5.2 Weighted Imputation with K-Nearest Neighbour (WKNNI)

The Weighted KNN method [93] selects the instances with similar values (in terms of distance) to incomplete instance, so it can impute as KNNI does. However, the estimated value now takes into account the different distances to the neighbors, using a weighted mean or the most repeated value according to a similarity measure . The similarity measure \(s_i (y_j)\) between two examples \(y_i\) and \(y_j\) is defined by the Euclidian distance calculated over observed attributes in \(y_i\). Next we define the measure as follows:

$$\begin{aligned} 1/s_i = \sum _{h_i \in O_i \bigcap O_j} (y_{ih} - y_{jh})^2, \end{aligned}$$
(4.27)

where \(O_i = \{h | \mathrm the \, h\mathrm th component of y_i \mathrm is observed \}\).

The missing entry \(y_{ih}\) is estimated as average weighted by the similarity measure:

$$\begin{aligned} y_{{\widehat{i}}h} = \frac{\sum _{j \in I_{Kih}} s_i(y_j) y_{jh}}{\sum _{j \in I_{Kih}} s_i(y_j)}, \end{aligned}$$
(4.28)

where \(I_{Kih}\) is the index set of KNN examples of the \(i\)th example, and if \(y_{jh}\) is missing the \(j\)th attribute is excluded from \(I_{Kih}\). Note that KNNI has no theoretical criteria for selecting the best K-value and the K-value has to be determined empirically.

4.5.3 K-means Clustering Imputation (KMI)

In K-means clustering [53], the intra-cluster dissimilarity is measured by the summation of distances between the objects and the centroid of the cluster they are assigned to. A cluster centroid represents the mean value of the objects in the cluster. Given a set of objects, the overall objective of clustering is to divide the data set into groups based on the similarity of objects, and to minimize the intra-cluster dissimilarity. KMI measures the intra-cluster dissimilarity by the addition of distances among the objects and the centroid of the cluster which they are assigned to. A cluster centroid represents the mean value of the objects in the cluster. Once the clusters have converged, the last process is to fill in all the non-reference attributes for each incomplete object based on the cluster information. Data objects that belong to the same cluster are taken to be nearest neighbors of each other, and KMI applies a nearest neighbor algorithm to replace MVs, in a similar way to KNNI.

Given a set of \(N\) objects \(X = {x_1, x_2,ldots, x_N}\) where each object has S attributes, we use \(x_{ij} (1 \le i \le N \mathrm and 1 \le j \le S)\) to denote the value of attribute \(j\) in object \(x_i\). Object \(x_i\) is called a complete object, if \(\{x_{ij} \ne \phi | \forall 1 \le j \le S\}\), and an incomplete object, if \(\{x_{ij} = \phi | \exists 1 \le j \le S\}\), and we say object \(x_i\) has a MV on attribute \(j\). For any incomplete object \(x_i\), we use \(R =\{j | x_{ij} \ne \phi , 1 \le j \le S\}\) to denote the set of attributes whose values are available, and these attributes are called reference attributes. Our objective is to obtain the values of non-reference attributes for the incomplete objects. By K-means clustering method, we divide data set \(X\) into \(K\) clusters, and each cluster is represented by the centroid of the set of objects in the cluster. Let \(V = {v_1,\ldots ,v_k}\) be the set of \(K\) clusters, where \(v_k (1 \le k \le K)\) represents the centroid of cluster \(k\). Note that \(v_k\) is also a vector in a \(S\)-dimensional space. We use \(d(v_k,x_i)\) to denote the distance between centroid \(v_k\) and object \(x_i\).

KMI can be divided into three processes. First, randomly select K complete data objects as \(K\) centroids. Second, iteratively modify the partition to reduce the sum of the distances for each object from the centroid of the cluster to which the object belongs. The process terminates once the summation of distances is less than a user-specified threshold \(\varepsilon = 100\), or no change on the centroids were made in last iteration. The last process is to fill in all the non-reference attributes for each incomplete object based on the cluster information. Data objects that belong to the same cluster are taken as nearest neighbors of each other, and we apply a nearest neighbor algorithm to replace missing data. We use as a distance measure the Euclidean distance.

4.5.4 Imputation with Fuzzy K-means Clustering (FKMI)

In fuzzy clustering, each data object has a membership function which describes the degree to which this data object belongs to a certain cluster. Now we want to extend the original K-means clustering method to a fuzzy version to impute missing data [1, 53]. The reason for applying the fuzzy approach is that fuzzy clustering provides a better description tool when the clusters are not well-separated, as is the case in missing data imputation. Moreover, the original K-means clustering may be trapped in a local minimum status if the initial points are not selected properly. However, continuous membership values in fuzzy clustering make the resulting algorithms less susceptible to get stuck in a local minimum situation.

In fuzzy clustering, each data object\(x_i\) has a membership function which describes the degree to which this data object belongs to certain cluster \(v_k\). The membership function is defined in the next equation

$$\begin{aligned} U(v_k,x_i) = \frac{d(v_k,x_i)^{-27(m-1)}}{\sum _{j=1}^K d(v_j,x_i)^{-2/(m-1)}} \end{aligned}$$
(4.29)

where \(m > 1\) is the fuzzifier, and \(\sum _{j=1}^K U(v_j,x_i) = 1\) for any data object \({x_i (1 \le i \le N)}\). Now we can not simply compute the cluster centroids by the mean values. Instead, we need to consider the membership degree of each data object. Equation (4.30) provides the formula for cluster centroid computation:

$$\begin{aligned} v_k = \frac{\sum _{i=1}^N U(v_k,x_i) \times x_i}{\sum _{i=1}^N U(v_k,x_i)} \end{aligned}$$
(4.30)

Since there are unavailable data in incomplete objects, we use only reference attributes to compute the cluster centroids.

The algorithm for missing data imputation with fuzzy K-means clustering method also has three processes. Note that in the initialization process, we pick \(K\) centroids which are evenly distributed to avoid local minimum situation. In the second process, we iteratively update membership functions and centroids until the overall distance meets the user-specified distance threshold \(\varepsilon \). In this process, we cannot assign the data object to a concrete cluster represented by a cluster centroid (as did in the basic K-mean clustering algorithm), because each data object belongs to all \(K\) clusters with different membership degrees. Finally, we impute non-reference attributes for each incomplete object. We replace non-reference attributes for each incomplete data object \(x_i\) based on the information about membership degrees and the values of cluster centroids, as shown in next equation:

$$\begin{aligned} x_{i,j} = \sum _{k=1}^K U(x_i,v_k) \times v_{k,j},\text {for any non-reference attribute } \, j \notin R \end{aligned}$$
(4.31)

4.5.5 Support Vector Machines Imputation (SVMI)

Support Vector Machines Imputation [29] is an SVM regression based algorithm to fill in MVs, i.e. set the decision attributes (output or classes) as the condition attributes (input attributes) and the condition attributes as the decision attributes, so SVM regression can be used to predict the missing condition attribute values. SVM regression estimation seeks to estimate functions

$$\begin{aligned} f(x) = (wx) + b,\qquad w,x \in \mathbb {R}^n, b \in \mathbb {R} \end{aligned}$$
(4.32)

based on data

$$\begin{aligned} (x_1, y_1),\ldots ,(x_l,y_l) \in \mathbb {R} \times \mathbb {R} \end{aligned}$$
(4.33)

by minimizing the regularized risk functional

$$\begin{aligned} \parallel W \parallel ^2 / 2 + C \bullet R_{emp}^\varepsilon \end{aligned}$$
(4.34)

where \(C\) is a constant determining the trade-off between minimizing the training error, or empirical risk

$$\begin{aligned} R_{emp}^\varepsilon = \frac{1}{l} \sum _{i=1}^l \left| y_i - f(x_i) \right| _\varepsilon \end{aligned}$$
(4.35)

and the model complexity term \(\parallel W \parallel ^2\). Here, we use the so-called \(\varepsilon \)-insensitive loss function

$$\begin{aligned} \left| y - f(x) \right| _\varepsilon = \max \{0, | y-f(x)| - \varepsilon \} \end{aligned}$$
(4.36)

The main insight of the statistical learning theory is that in order to obtain a small risk, one needs to control both training error and model complexity, i.e. explain the data with a simple model. The minimization of Eq. (4.36) is equivalent to the following constrained optimization problem [17]: minimize

$$\begin{aligned} \tau (w, \xi ^{(*)}) = \frac{1}{2} \parallel w \parallel ^2 +\, C \frac{1}{l} \sum _{i=1}^l (\xi _i + \xi _i^*) \end{aligned}$$
(4.37)

subject to the following constraints

$$\begin{aligned} ((w \bullet x_i)+b)-y_i \le \varepsilon + \xi _i \end{aligned}$$
(4.38)
$$\begin{aligned} y_i - ((w \bullet x_i)+b) \le \varepsilon + \xi _i^* \end{aligned}$$
(4.39)
$$\begin{aligned} \xi _i^{(*)} \ge 0,\quad \varepsilon \ge 0 \end{aligned}$$
(4.40)

As mentioned above, at each point \(x_i\) we allow an error of magnitude \(\varepsilon \). Errors above \(\varepsilon \) are captured by the slack variables \(\xi ^*\) (see constraints 4.38 and 4.39). They are penalized in the objective function via the regularization parameter \(C\) chosen a priori.

In the \(\nu \)-SVM the size of \(\varepsilon \) is not defined a priori but is itself a variable. Its value is traded off against model complexity and slack variables via a constant \(\nu \in (0,1]\) minimize

$$\begin{aligned} \tau (W, \xi ^{(*)},\varepsilon ) = \frac{1}{2} \parallel W \parallel ^2 + C \bullet (\nu \varepsilon + \frac{1}{l} \sum _{i=1}^l (\xi _i + \xi _i^*)) \end{aligned}$$
(4.41)

subject to the constraints 4.384.40. Using Lagrange multipliers techniques, one can show [17] that the minimization of Eq. (4.37) under the constraints 4.384.40 results in a convex optimization problem with a global minimum. The same is true for the optimization problem 4.41 under the constraints 4.384.40. At the optimum, the regression estimate can be shown to take the form

$$\begin{aligned} f(x) = \sum _{i=1}^l (\alpha _i^* - \alpha _i)(x_i \bullet x) + b \end{aligned}$$
(4.42)

In most cases, only a subset of the coefficients \((\alpha _i^* - \alpha _i)\) will be nonzero. The corresponding examples \(x_i\) are termed support vectors (SVs). The coefficients and the SVs, as well as the offset \(b\); are computed by the \(\nu \)-SVM algorithm. In order to move from linear (as in Eq. 4.42) to nonlinear functions the following generalization can be done: we map the input vectors \(x_i\) into a high-dimensional feature space\(Z\) through some chosen a priori nonlinear mapping \(\varPhi :X_i \rightarrow Z_i\). We then solve the optimization problem 4.41 in the feature space \(Z\). In this case, the inner product of the input vectors \((x_i \bullet x)\) in Eq. (4.42) is replaced by the inner product of their icons in feature space \(Z, (\varPhi (x_i) \bullet \varPhi (x))\). The calculation of the inner product in a high-dimensional space is computationally very expensive. Nevertheless, under general conditions (see [17] and references therein) these expensive calculations can be reduced significantly by using a suitable function \(k\) such that

$$\begin{aligned} (\varPhi (x_i) \bullet \varPhi (x)) = k(x_i \bullet x), \end{aligned}$$
(4.43)

leading to nonlinear regressions functions of the form:

$$\begin{aligned} f(x) = \sum _{i=1}^l (\alpha _i^* - \alpha _i) k (x_i,x) + b \end{aligned}$$
(4.44)

The nonlinear function\(k\) is called a kernel [17]. We mostly use a Gaussian kernel

$$\begin{aligned} k(x,y) \vDash \exp (- \parallel x - y \parallel ^2 / (2 \sigma _{kernel}^2)) \end{aligned}$$
(4.45)

We can use SVM regression [29] to predict the missing condition attribute values. In order to do that, first we select the examples in which there are no missing attribute values. In the next step we set one of the condition attributes (input attribute), some of those values are missing, as the decision attribute (output attribute), and the decision attributes as the condition attributes by contraries. Finally, we use SVM regression to predict the decision attribute values.

4.5.6 Event Covering (EC)

Based on the work of Wong et al. [99], a mixed-mode probability model is approximated by a discrete one. First, we discretize the continuous components using a minimum loss of information criterion. Treating a mixed-mode feature \(n\)-tuple as a discrete-valued one, the authors propose a new statistical approach for synthesis of knowledge based on cluster analysis: (1) detect from data patterns which indicate statistical interdependency; (2) group the given data into clusters based on detected interdependency; and (3) interpret the underlying patterns for each of the clusters identified. The method of synthesis is based on author’s event–covering approach. With the developed inference method, we are able to estimate the MVs in the data.

The cluster initiation process involves the analysis of the nearest neighbour distance distribution on a subset of samples, the selection of which is based on a mean probability criterion. Let \(X=(X_1,X_2,\ldots ,X_n)\) be a random \(n\)-tuple of related variables and \(x=(x_1,x_2,\ldots ,x_n)\) be its realization. Then a sample can be represented as \(x\). Let \(S\) be an ensemble of observed samples represented as \(n\)-tuples. The nearest-neighbour distance of a sample \(x_i\) to a set of examples \(S\) is defined as:

$$\begin{aligned} D(x_i,S) = {min_{x_j \in S}}_{x_i \ne x_j} d(x_i,x_j) \end{aligned}$$
(4.46)

where \(d(x_i,x_j)\) is a distance measure. Since we are using discrete values, we have adopted the Hamming distance . Let \(C\) be a set of examples forming a simple cluster. We define the maximum within-cluster nearest-neighbour distance as

$$\begin{aligned} D_c^* = max_{x_i \in C} D(x_i,C) \end{aligned}$$
(4.47)

\(D_c^*\) reflects an interesting characteristic of the cluster configuration: that is, the smaller the \(D_c^*\), the denser the cluster.

Using a mean probability criterion to select a similar subset of examples, the isolated samples can be easily detected by observing the wide gaps in the nearest-neighbour distance space. The probability distribution from which the criterion is derived for the samples can be estimated using a second-order probability estimation. An estimation of \(P(x)\) known as the dependence tree product approximation can be expressed as:

$$\begin{aligned} \widehat{P}(x) = \prod _{j=1}^n P(x_{mj} | x_{m_{k(j)}}), 0 < k(j) < 1 \end{aligned}$$
(4.48)

where (1) the index set \({m_1,m_2,\ldots ,m_n}\) is a permutation of the integer set \({1,2,\ldots ,n}\), (2) the ordered pairs \({x_{mj},x_{m_{k(j)}}}\) are chosen so that they the set of branches of a spanning tree defined on \(X\) with their summed MI maximized, and (3) \(P({x_m}_1 | {x_m}_0) = P({x_m}_1)\). The probability defined above is known to be the best second-order approximation of the high-order probability distribution. Then corresponding to each \(x\) in the ensemble, a probability \(P(x)\) can be estimated.

In general, it is more likely for samples of relatively high probability to form clusters. By introducing the mean probability below, samples can be divided into two subsets: those above the mean and those below. Samples above the mean will be considered first for cluster initiation.

Let \(S = {x}\). The mean probability is defined as

$$\begin{aligned} \mu _s = \sum _{x \in S} P(x) / |S| \end{aligned}$$
(4.49)

where \(|S|\) is the number of samples in \(S\). For more details in the probability estimation with dependence tree product approximation, please refer to [13].

When distance is considered for cluster initiation, we can use the following criteria in assigning a sample \(x\) to a cluster.

  1. 1.

    If there exists more than one cluster, say \({C_k | k = 1,2,\ldots }\), such that \(D(x,C_k) \le D^*\) for all \(k\), then all these clusters can be merged together.

  2. 2.

    If exactly one cluster \(C_k\) exists, such that \(D(x,C_k) \le D^*\), then \(x\) can be grouped into \(C_k\).

  3. 3.

    If \(D(x,C_K) > D^*\) for all clusters \(C_k\), then \(x\) may not belong to any cluster.

To avoid including distance calculation of outliers, we use a simple method suggested in [99] which assigns \(D^*\) the maximum value of all nearest-neighbor distances in \(L\) provided there is a sample in \(L\) having a nearest-neighbor distance value of \(D^*-1\) (with the distance values rounded to the nearest integer value).

After finding the initial clusters along with their membership, the regrouping process is thus essentially an inference process for estimating the cluster label of a sample. Let \(C={a_{c1},a_{c2},\ldots ,a_{cq}}\) be the set of labels for all possible clusters to which \(x\) can be assigned. For \(X_k\) in \(X\), we can form a contingency table between \(X_k\) and \(C\). Let \(a_{ks}\) and \(a_{cj}\) be possible outcomes of \(X_k\) and \(C\) respectively, and let \(obs(a_{ks}\) and \(obs{a_{cj}}\) be the respectively marginal frequencies of their observed occurrences. The expected relative frequency of \((a_{ks},a_{cj})\) is expressed as:

$$\begin{aligned} exp(a_{ks},a_{cj}) = \frac{obs(a_{ks}) \times obs(a_{cj})}{|S|} \end{aligned}$$
(4.50)

Let \(obs(a_{ks},a_{cj})\) represent the actual observed frequency of \((a_{ks},a_{cj})\) in \(S\). The expression

$$\begin{aligned} D = \sum _{j=1}^q \frac{(obs_{ks} - exp(a_{ks},a_{cj}))^2}{exp(a_{ks},a_{cj})} \end{aligned}$$
(4.51)

summing over the outcomes of \(C\) in the contingency table, possesses an asymptotic chi-squared property with \((q - 1)\) degrees of freedom. \(D\) can then be used in a criterion for testing the statistical dependency between \(a_{ks}\), and \(C\) at a presumed significant level as described below. For this purpose, we define a mapping

$$\begin{aligned} h_k^c (a_{ks},C) = \left\{ \begin{array}{l@{\quad }l} 1, &{} \mathrm{if}\,D > \chi ^2(q-1); \\ 0, &{} \mathrm{otherwise.} \end{array} \right. \end{aligned}$$
(4.52)

where \(\chi ^2(q-1)\) is the tabulated chi-squared value. The subset of selected events of \(X_k\), which has statistical interdependency with \(C\), is defined as

$$\begin{aligned} E_k^c = \left\{ a_{ks} | h_k^c (a_{ks},C) = 1 \right\} \end{aligned}$$
(4.53)

We call \(E_k^c\) the covered event subset of \(X_k\) with respect to \(C\). Likewise, the covered event subset \(E_c^k\) of \(C\) with respect to \(X_k\) can be defined. After finding the covered event subsets of \(E_c^k\) and \(E_k^c\) between a variable pair \((C,X_k)\), information measures can be used to detect the statistical pattern of these subsets. An interdependence redundancy measure between \(X_k^c\) and \(C^k\) can be defined as

$$\begin{aligned} R(X_k^c,C^k) = \frac{I(X_k^c,C^k)}{H(X_k^c,C^k)} \end{aligned}$$
(4.54)

where \(I(X_k^c,C^k)\) is the expected MI and \(H(X_k^c,C^k)\) is the Shannon’s entropy defined respectively on \(X_k^c\) and \(C^k\):

$$\begin{aligned} I(X_k^c,C^k) = \sum _{a_{cu}\in E_c^k} \sum _{a_{ks} \in E_k^c} P(a_{cu},a_{ks}) \log \frac{P(a_{cu},a_{ks})}{P(a_{cu}) P(a_{ks})} \end{aligned}$$
(4.55)

and

$$\begin{aligned} H(X_k^c,C^k) = - \sum _{a_{cu}\in E_c^k} \sum _{a_{ks} \in E_k^c} P(a_{cu},a_{ks}) \log P(a_{cu},a_{ks}). \end{aligned}$$
(4.56)

The interdependence redundancy measure has a chi-squared distribution:

$$\begin{aligned} I(X_k^c,C^k) ~ \frac{\chi _{df}^2}{2 |S| H(x_k^c,C^k)} \end{aligned}$$
(4.57)

where \(df\) is the corresponding degree of freedom having the value \((|E_c^k| -1)(|E_k^c| -1)\). A chi-squared test is then used to select interdependent variables in \(X\) at a presumed significant level.

The cluster regrouping process uses an information measure to regroup data iteratively. Wong et al. have proposed an information measure called normalized surprisal (NS) to indicate significance of joint information. Using this measure, the information conditioned by an observed event \(x_k\) is weighted according to \(R(X_k^c, C^K)\), their measure of interdependency with the cluster label variable. Therefore, the higher the interdependency of a conditioning event, the more relevant the event is. NS measures the joint information of a hypothesized value based on the selected set of significant components. It is defined as

$$\begin{aligned} NS(a_{cj} | x'(a_{cj})) = \frac{I(a_{cj} | x'(a_{cj}))}{m \left( \sum _{k=1}^m R(X_k^c,C^k) \right) } \end{aligned}$$
(4.58)

where \(I(a_{cj} | x'(a_{cj}))\) is the summation of the weighted conditional information defined on the incomplete probability distribution scheme as

$$\begin{aligned} I(a_{cj} | x'(a_{cj}))&= \sum _{k=1}^m R(X_k^c,C^k) I(a_{cj} | x_k))\nonumber \\&= \sum _{k=1}^m R(X_k^c,C^k) \left( -log\, \frac{P(a_{cj} | x_k)}{\sum _{a_{cu}\in E_c^k} P(a_{cu} | x_k)} \right) \end{aligned}$$
(4.59)

In rendering a meaningful calculation in the incomplete probability scheme formulation, \(x_k\) is selected if

$$\begin{aligned} \sum _{a_{cu} \in E_c^k} P(a_{cu} | x_k) > T \end{aligned}$$
(4.60)

where \(T \ge 0\) is a size threshold for meaningful estimation. NS can be used in a decision rule in the regrouping process. Let \(C = \{a_{c1},\ldots ,a_{cq}\}\) be the set of possible cluster labels. We assign \(a_{cj}\) to \(x_e\) if

$$\begin{aligned} NS(a_{cj} | x'(a_{cj})) = \min _{a_{cu} \in C} NS(a_{cu} | x'(a_{cu})). \end{aligned}$$

If no component is selected with respect to all hypothesized cluster labels, or if there is more than one label associated with the same minimum NS, then the sample is assigned a dummy label, indicating that the estimated cluster label is still uncertain. Also, zero probability may be encountered in the probability estimation, an unbiased probability based on Entropy minimax. In the regrouping algorithm, the cluster label for each sample is estimated iteratively until a stable set of label assignments is attained.

Once the clusters are stable, we take the examples with MVs. Now we use the distance \(D(x_i, S) = {\min _{x_j \in S}}_{x_i \ne x_j} d(x_i,x_j)\) to find the nearest cluster \(C_i\) to such instance. From this cluster we compute the centroid\(x^\prime \) such that

$$\begin{aligned} D(x', C_i) < D(x_i,C_i) \end{aligned}$$
(4.61)

for all instances \(x_i\) of the cluster \(C_i\). Once the centroid is obtained, the MV of the example is imputed with the value of the proper attribute of \(x_i\).

4.5.7 Singular Value Decomposition Imputation (SVDI)

In this method, we employ singular value decomposition (4.62) to obtain a set of mutually orthogonal expression patterns that can be linearly combined to approximate the values of all attributes in the data set [93]. These patterns, which in this case are identical to the principle components of the data values’ matrix, are further referred to as eigenvalues.

$$\begin{aligned} A_{m \times m} = U_{m \times m} \varSigma _{m \times n} V_{n \times n}^T. \end{aligned}$$
(4.62)

Matrix \(V^T\) now contains eigenvalues, whose contribution to the expression in the eigenspace is quantified by corresponding eigenvalues on the diagonal of matrix \(\varSigma \). We then identify the most significant eigenvalues by sorting the eigenvalues based on their corresponding eigenvalue. Although it has been shown that several significant eigenvalues are sufficient to describe most of the expression data, the exact fraction of eigenvalues best for estimation needs to be determined empirically.

Once \(k\) most significant eigenvalues from \(V^T\) are selected, we estimate a MV \(j\) in example \(i\) by first regressing this attribute value against the \(k\) eigenvalues and then use the coefficients of the regression to reconstruct \(j\) from a linear combination of the \(k\) eigenvalues. The \(j\)th value of example \(i\) and the \(j\)th values of the \(k\) eigenvalues are not used in determining these regression coefficients. It should be noted that SVD can only be performed on complete matrices; therefore we originally substitute row average for all MVs in matrix A, obtaining \(A'\). We then utilize an Regularized EM method to arrive at the final estimate, as follows. Each MV in \(A'\) is estimated using the above algorithm, and then the procedure is repeated on the newly obtained matrix, until the total change in the matrix falls below the empirically determined (by the authors [93]) threshold of \(0.01\) (noted as stagnation tolerance in the EM algorithm). The other parameters of the EM algorithm are the same for both algorithms.

4.5.8 Local Least Squares Imputation (LLSI)

In this method proposed in [49] a target instance that has MVs is represented as a linear combination of similar instances. Rather than using all available instances in the data, only similar instances based on a similarity measure are used, and for that reason the method has the “local” connotation. There are two steps in the LLSI. The first step is to select \(k\) instances by the \(L_2\)-norm. The second step is regression and estimation, regardless of how the \(k\) instances are selected. A heuristic \(k\) parameter selection method is used by the authors.

Throughout the section, we will use \(X \in \mathbb {R}^{m \times n}\) to denote a dataset with \(m\) attributes and \(n\) instances. Since LLSI was proposed for microarrays, it is assumed that \(m \gg n\). In the data set \(X\), a row \(x_i^T \in \mathbb {R}^{1 \times n}\) represents expressions of the \(i\)th instance in \(n\) examples:

$$ X = \left( \begin{array}{c} x_1^T\\ \vdots \\ x_m^T \end{array} \right) \in \mathbb {R}^{m \times n} $$

A MV in the \(l\)th location of the \(i\)th instance is denoted as \(\alpha \), i.e.

$$ X(i,l) = \mathbf {x}_i(l) = \alpha $$

For simplicity we first assume assuming there is a MV in the first position of the first instance, i.e.

$$ X(1,1) = \mathbf {x}_1(1) = \alpha . $$

4.5.8.1 Selecting the Instances

To recover a MV \(\alpha \) in the first location \(\mathbf {x}_1(1)\) of \(\mathbf {x}_1\) in \(X \in \mathbb {R}^{m \times n}\), the KNN instance vectors for \(\mathbf {x}_1\),

$$ \mathbf {x}_{S_i}^T \in \mathbb {R}^{1 \times n},\quad 1 \le i \le k, $$

are found for LLSimpute based on the \(L_2\)-norm (LLSimpute). In this process of finding the similar instances, the first component of each instance is ignored due to the fact that \(\mathbf {x}_1(1)\) is missing. The LLSimpute based on the Pearson’s correlation coefficient to select the \(k\) instances can be consulted in [49].

4.5.8.2 Local Least Squares Imputation

As imputation can be performed regardless of how the \(k\)-instances are selected, we present only the imputation based on \(L_2\)-norm for simplicity. Based on these \(k\)-neighboring instance vectors, the matrix \(A \in \mathbb {R}^{k \times (n-1)}\) and the two vectors \(\mathbf {b} \in \mathbb {R}^{k \times 1}\) and \(\mathbf {w} \in \mathbb {R}^{(n-1) \times 1}\) are formed. The \(k\) rows of the matrix \(A\) consist of the KNN instances \(\mathbf {x}^T_{S_i} \in \mathbb {R}^{1 \times n}\), \(1 \le i \le k\), with their first values deleted, the elements of the vector b consists of the first components of the \(k\) vectors \(\mathbf {x}^T_{S_i}\), and the elements of the vector w are the \(n - 1\) elements of the instance vector \(\mathbf {x}_1\) whose missing first item is deleted. After the matrix \(A\), and the vectors b and w are formed, the least squares problem is formulated as

$$\begin{aligned} \min _x || A^T \mathbf {z} - \mathbf {w} ||_2 \end{aligned}$$
(4.63)

Then, the MV \(\alpha \) is estimated as a linear combination of first values of instances

$$\begin{aligned} \alpha = \mathbf {b}^T \mathbf {z} = \mathbf {b}^T (A^T)^\dagger \mathbf {w}, \end{aligned}$$
(4.64)

where \((A^T)^\dagger \) is the pseudoinverse of \(A^T\).

For example, assume that the target instance \(\mathbf {x}_1\) has a MV in the first position among a total of six examples. If the MV is to be estimated by the \(k\) similar instances, the matrix \(A\), and vectors b and w are constructed as

$$\begin{aligned} \left( \begin{array}{c} x_1^T\\ \vdots \\ x_m^T \end{array} \right)&= \left( \begin{array}{cc} \alpha &{} \mathbf {w}^T\\ \mathbf {b} &{} A \end{array} \right) \nonumber \\&= \left( \begin{array}{cccccc} \alpha &{} \mathbf {w}_1 &{} \mathbf {w}_2 &{} \mathbf {w}_3 &{} \mathbf {w}_4 &{} \mathbf {w}_5\\ \mathbf {b}_1 &{} A_{1,1} &{} A_{1,2} &{} A_{1,3} &{} A_{1,4} &{} A_{1,5} \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \mathbf {b}_k &{} A_{k,1} &{} A_{k,2} &{} A_{k,3} &{} A_{k,4} &{} A_{k,5} \\ \end{array} \right) \nonumber \end{aligned}$$

where \(\alpha \) is the MV and \(\mathbf {x}^T_{S_1},\ldots ,\mathbf {x}^T_{S_k}\) are instances similar to \(\mathbf {x}^T_1\). From the second to the last components of the neighbor instances, \(a^T_i,\; 1 \le i \le k\), form the \(i\)th row vector of the matrix \(A\). The vector w of the known elements of target instance \(\mathbf {x}_1\) can be represented as a linear combination

$$ \mathbf {w} \simeq \mathbf {z}_1\mathbf {a}_1 + \mathbf {z}_2 \mathbf {a}_2 + \cdots + \mathbf {z}_k\mathbf {a}_k $$

where \(\mathbf {z}_i\) are the coefficients of the linear combination, found from the least squares formulation (4.63). Accordingly, the MV \(\alpha \) in \(\mathbf {x}_1\) can be estimated by

$$ \alpha = \mathbf {b}^T x = \mathbf {b}_1\mathbf {z}_1 + \mathbf {b}_2 \mathbf {z}_2 + \cdots + \mathbf {b}_k\mathbf {z}_k $$

Now, we deal with the case in which there is more than one MV in a instance vector. In this case, to recover the total of \(q\) MVs in any of the locations of the instance \(\mathbf {x}_1\), first, the KNN instance vectors for \(x_1\),

$$ \mathbf {x}_{S_i}^T \in \mathbb {R}^{1 \times n}, \quad 1 \le i \le k, $$

are found. In this process of finding the similar instances, the \(q\) components of each instance at the \(q\) locations of MVs in \(\mathbf {x}_1\) are ignored. Then, based on these \(k\) neighboring instance vectors, a matrix \(A \in \mathbb {R}^{k \times (n-q)}\) a matrix \(B \in \mathbb {R}^{k \times q}\) and a vector \(\mathbf {w} \in \mathbb {R}^{(n-q) \times 1}\) are formed. The \(i\)th row vector \(\mathbf {a}^T_i\) of the matrix \(A\) consists of the \(i\)th nearest neighbor instances \(\mathbf {x}^T_{S_i} \in \mathbb {R}^{1 \times n},\; 1 \le i \le k\), with its elements at the \(q\) missing locations of MVs of \(\mathbf {x}_1\) excluded. Each column vector of the matrix \(B\) consists of the values of the \(j\)th location of the MVs \((1 \le j \le q)\) of the \(k\) vectors \(\mathbf {x}^T_{S_i}\). The elements of the vector w are the \(n - q\) elements of the instance vector x whose missing items are deleted. After the matrices \(A\) and \(B\) and a vector w are formed, the least squares problem is formulated as

$$\begin{aligned} \min _x ||A^T \mathbf {z} - \mathbf {w} ||_2 \end{aligned}$$
(4.65)
Table 4.1 Recent and most well-known imputation methods involving ML techniques

Then, the vector \(\mathbf {u} = (\alpha _1, \alpha _2, \ldots , \alpha _q)^T\) of \(q\) MVs can be estimated as

$$\begin{aligned} \mathbf {u} = \left( \begin{array}{c} \alpha _1\\ \vdots \\ \alpha _q \end{array} \right) = B^T \mathbf {z} = B^T (A^T)^\dagger \mathbf {w}, \end{aligned}$$
(4.66)

where \((A^T)^\dagger \) is the pseudoinverse of \(A^T\).

4.5.9 Recent Machine Learning Approaches to Missing Values Imputation

Although we have tried to provide an extensive introduction to the most used and basic imputation methods based on ML techniques, there is a great amount of journal publications showing their application and particularization to real world problems. We would like to give the reader a summarization of the latest and more important imputation methods presented at the current date of publication, both extensions of the introduced ones and completely novel ones in Table 4.1.

4.6 Experimental Comparative Analysis

In this section we aim to provide the reader with a general overview of the behavior and properties of all the imputation methods presented above. However, this is not an easy task. The main question is: what is a good imputation method?

As multiple imputation is a very resource consuming approach, we will focus on the single imputation methods described in this chapter.

4.6.1 Effect of the Imputation Methods in the Attributes’ Relationships

From an unsupervised data point of view, those imputation methods able to generate values close to the true but unknown MV should be the best. This idea has been explored in the literature by means of using complete data sets and then artificially introducing MVs. Please note that such a mechanism will act as a MCAR MV generator mechanism, validating the use of imputation methods. Then, imputation methods are applied to the data and an estimation of how far is the estimation to the original (and known) value. Authors usually choose the mean square error (MSE) or root mean square error (RMSE) to quantify and compare the imputation methods over a set of data sets [6, 32, 41, 77].

On the other hand, other problems arise when we do not have the original values or the problem is supervised. In classification, for example, it is more demanding to impute values that will constitute an easier and more generalizable problem. As a consequence in this paradigm a good imputation method will enable the classifier to obtain better accuracy. This is harder to measure, as we are relating two different values: the MV itself and the class label assigned to the example. Neither MSE or RMSE can provide us with such kind of information.

One way to measure how good the imputation is for the supervised task is to use Wilson’s Noise Ratio. This measure proposed by [98] observes the noise in the data set. For each instance of interest, the method looks for the KNN (using the Euclidean distance), and uses the class labels of such neighbors in order to classify the considered instance. If the instance is not correctly classified, then the variable \(noise\) is increased by one unit. Therefore, the final noise ratio will be

$$ \mathrm{Wilson's Noise } = \frac{noise}{\#\mathrm{instances\ in\ the\ data\ set }}$$

After imputing a data set with different imputation methods, we can measure how disturbing the imputation method is for the classification task. Thus by using Wilson’s noise ratio we can observe which imputation methods reduce the impact of the MVs as a noise, and which methods produce noise when imputing.

Another approach is to use the MI (MI) which is considered to be a good indicator of relevance between two random variables [18]. Recently, the use of the MI measure in FS has become well-known and seen to be successful [51, 52, 66]. The use of the MuI measure for continuous attributes has been tackled by [51], allowing us to compute the Mui measure not only in nominal-valued data sets.

In our approach, we calculate the Mui between each input attribute and the class attribute, obtaining a set of values, one for each input attribute. In the next step we compute the ratio between each one of these values, considering the imputation of the data set with one imputation method in respect to the not imputed data set. The average of these ratios will show us if the imputation of the data set produces a gain in information:

$$ \text {Avg. Mui Ratio} = \frac{\sum _{x_i \in X} \frac{Mui_\alpha (x_i)+1}{Mui(x_i)+1}}{|X|} $$

where \(X\) is the set of input attributes, \(Mui_\alpha (i)\) represents the Mui value of the \(i\)th attribute in the imputed data set and \(Mui(i)\) is the Mui value of the \(i\)th input attribute in the not imputed data set. We have also applied the Laplace correction, summing 1 to both numerator and denominator, as an Mui value of zero is possible for some input attributes.

The calculation of \(Mui(x_i)\) depends on the type of attribute \(x_i\). If the attribute \(x_i\) is nominal, the Mui between \(x_i\) and the class label\(Y\) is computed as follows:

$$ Mui_{nominal}(x_i) = I(x_i; Y) = \sum _{z \in x_i} \sum _{y \in Y} p(z,y) log_2 \frac{p(z,y)}{p(z) p(y)}. $$

On the other hand, if the attribute \(x_i\) is numeric, we have used the Parzen window density estimate as shown in [51] considering a Gaussian window function:

$$ Mui_{numeric}(x_i) = I(x_i; Y) = H(Y) - H(C|X); $$

where \(H(Y)\) is the entropy of the class label

$$ H(Y) = - \sum _{y \in Y} p(y) log_2 p(y); $$

and \(H(C|X)\) is the conditional entropy

$$ H(Y|x_i) = - \sum _{z \in x_i} \sum _{y \in Y} p(z,y) log_2 p(y|z).$$

Considering each sample has the same probability, applying the Bayesian rule and approximating \(p(y|z)\) by the Parzen window we get:

$$ \hat{H}(Y|x_i) = - \sum _{j=1}^n \frac{1}{n} \sum _{y=1}^N \hat{p}(y|z_j) log_2 \hat{p}(y|z_j)$$

where \(n\) is the number of instances in the data set, \(N\) is the total number of class labels and \(\hat{p}(c|x)\) is

$$ \hat{p}(y|z) = \frac{\sum _{i \in I_c} exp\left( - \frac{(z-z_i)\varSigma ^{-1}(z-z_i)}{2 h^2} \right) }{\sum _{k=1}^N \sum _{i \in I_k} exp\left( - \frac{(z-z_i)\varSigma ^{-1}(z-z_i)}{2 h^2} \right) }.$$

In this case, \(I_c\) is the set of indices of the training examples belonging to class \(c\), and \(\varSigma \) is the covariance of the random variable \((z-z_i)\).

Let us consider all the single imputation methods presented in this chapter. For the sake of simplicity we will omit the Multiple Imputation approaches, as it will require us to select a probability model for all the data sets, which would be infeasible. In Table 4.2 we have summarized the Wilson’s noise ratio values for 21 data sets with MVs from those presented in Sect. 2.1. We must point out that the results of Wilson’s noise ratio and Mui are related to a given data set. Hence, the characteristics of the proper data appear to determine the values of this measure.

Table 4.2 Wilson’s noise ratio values and average MI values per data set

Looking at the results from Table 4.2 we can observe which imputation methods reduce the impact of the MVs as noise, and which methods produce noise when imputing. In addition the MI ratio allows us to relate the attributes to the imputation results. A value of the Mui ratio higher than 1 will indicate that the imputation is capable of relating more of the attributes individually to the class labels. A value lower than 1 will indicate that the imputation method is adversely affecting the relationship between the individual attributes and the class label.

If we consider the average Mui ratio in Table 4.2 we can observe that the average ratios are usually close to 1; that is, the use of imputation methods appears to harm the relationship between the class label and the input attribute little or not at all, even improving it in some cases. However, the MI considers only one attribute at a time and therefore the relationships between the input attributes are ignored. The imputation methods estimate the MVs using such relationships and can afford improvements in the performance of the classifiers. Hence the highest values of average Mui ratios could be related to those methods which can obtain better estimates for the MVs, and maintaining the relationship degree between the class labels and the isolated input attributes. It is interesting to note that when analyzing the Mui ratio, the values do not appear to be as highly data dependant as Wilson’s noise ratio, as the values for all the data sets are more or less close to each other.

If we count the methods with the lowest Wilson’s noise ratios in each data set in Table 4.2, we find that the CMC method is first, with 12 times being the lowest one, and the EC method is second with 9 times being the lowest one. If we count the methods with the highest MI ratio in each data set, the EC method has the highest ratio for 7 data sets and is therefore the first one. The CMC method has the highest ratio for 5 data sets and is the second one in this case. Immediately the next question arises: are these methods also the best for the performance of the learning methods applied afterwards? We try to answer this question in the following.

4.6.2 Best Imputation Methods for Classification Methods

Our aim is to use the same imputation results as data sets used in the previous Sect. 4.6.1 as the input for a series of well known classifiers in order to shed light on the question “which is the best imputation method?”. Let us consider a wide range of classifiers grouped by their nature, as that will help us to limit the comparisons needed to be made. We have grouped them in three sub-categories. In Table 4.3 we summarize the classification methods we have used, organized in these three categories. The description of the former categories is as follows:

  • The first group is the Rule Induction Learning category. This group refers to algorithms which infer rules using different strategies.

  • The second group represents the Black Box Methods. It includes ANNs, SVMs and statistical learning.

  • The third and last group corresponds to the Lazy Learning (LL) category. This group incorporates methods which do not create any model, but use the training data to perform the classification directly.

Table 4.3 Classifiers used by categories

Some methods do not work with numerical attributes (CN2, AQ and Naïve-Bayes). In order to discretize the numerical values, we have used the well-known discretizer proposed by [28]. For the SVM methods (C-SVM, \(\nu \)-SVM and SMO), we have applied the usual preprocessing in the literature to these methods [25]. This preprocessing consists of normalizing the numerical attributes to the \([0,1]\) range, and binarizing the nominal attributes. Some of the presented classification methods in the previous section have their own MVs treatment that will trigger when no imputation is made (DNI): C4.5 uses a probabilistic approach to handling MVs and CN2 applies the MC method by default in these cases. For ANNs [24] proposed to replace MVs with zero so as not to trigger the corresponding neuron which the MV is applied to.

As shown here all the detailed accuracy values for each fold, data set, imputation method and classifier would be too long, we have used Wilcoxon’s Signed Rank test to summarize them. For each classifier, we have compared every imputation method along with the rest in pairs. Every time the classifier obtains a better accuracy value for an imputation method than another one and the statistical test yield a \(p-value < 0.1\) we count it as a win for the former imputation method. In another case it is a tie when \(p-value > 0.1\).

In the case of rule induction learning in Table 4.4 we show the average ranking or each imputation method for every classifier belonging to this group. We can observe that, for the rule induction learning classifiers, the imputation methods FKMI, SVMI and EC perform best. The differences between these three methods in average rankings are low. Thus we can consider that these three imputation methods are the most suitable for this kind of classifier. They are well separated from the other imputation methods and we cannot choose the best method from among these three. On the other hand, BPCA and DNI are the worst methods.

Table 4.4 Average ranks for the Rule Induction Learning methods
Table 4.5 Average ranks for the Black Box methods

In Table 4.5 we can observe the rankings associated with the methods belonging to the black-boxes modeling category. As can be appreciated, for black-boxes modelling the differences between imputation methods are even more evident. We can select the EC method as the best solution, as it has a difference of ranking of almost 1 with KMI, which stands as the second best. This difference increases when considering the third best, FKMI. No other family of classifiers present this gap in the rankings. Therefore, in this family of classification methods we could, with some confidence, establish the EC method as the best choice. The DNI and IM methods are among the worst. This means that for the black-boxes modelling methods the use of some kind of MV treatment is mandatory, whereas the EC method is the most suitable one. As with the RIL methods, the BPCA method is the worst choice, with the highest ranking.

Table 4.6 Average ranks for the Lazy Learning methods

Finally the results for the last LL group are presented in Table 4.6. For the LL models, the MC method is the best with the lowest average ranking. The CMC method, which is relatively similar to MC, also obtains a low rank very close to MC’s. Only the FKMI method obtains a low enough rank to be compared with the MC and CMC methods. The rest of the imputation methods are far from these lowest ranks with almost two points of difference in the ranking. Again, the DNI and IM methods obtain high rankings. The DNI method is one of the worst, with only the BPCA method performing worse. As with the black-boxes modelling models, the imputation methods produce a significant improvement in the accuracy of these classification methods.

4.6.3 Interesting Comments

In this last Section we have carried out an experimental comparison among the imputation methods presented in this chapter. We have tried to obtain the best imputation choice by means of non-parametric statistical testing. The results obtained concur with previous studies:

  • The imputation methods which fill in the MVs outperform the case deletion (IM method) and the lack of imputation (DNI method).

  • There is no universal imputation method which performs best for all classifiers.

In Sect. 4.6.1 we have analyzed the influence of the imputation methods in the data in respect to two measures. These two measures are the Wilson’s noise ratio and the average MI difference. The first one quantifies the noise induced by the imputation method in the instances which contain MVs. The second one examines the increment or decrement in the relationship of the isolated input attributes with respect to the class label. We have observed that the CMC and EC methods are the ones which introduce less noise and maintain the MI better.

According to the results in Sect. 4.6.2, the particular analysis of the MVs treatment methods conditioned to the classification methods’ groups seems necessary. Thus, we can stress the recommended imputation algorithms to be used based on the classification method’s type, as in the case of the FKMI imputation method for the Rule Induction Learning group, the EC method for the black-boxes modelling Models and the MC method for the Lazy Learning models. We can confirm the positive effect of the imputation methods and the classifiers’ behavior, and the presence of more suitable imputation methods for some particular classifier categories than others.