1 Introduction

Data collection, storage, and analysis have become crucial nowadays for various decision-making processes of modern organizations. Typically various sources and techniques (including surveys, interviews, and sensors) are used for data collection, and the collected data are then generally integrated into a single data set for data mining purposes [23]. For example, different types of sensors such as weather stations are typically used to collect temperature, humidity, and wind speed data in a habitat monitoring system (HMS). Various factors including human error and misunderstanding, equipment malfunctioning, and faulty data transmission can cause data corruption and missing during the whole process of data collection, storage, and preparation. Approximately 5 % or more data values can often be lost (missing) unless extreme care is taken by the organizations [32, 35, 40, 48].

In this study, we consider that a data set \(D_F\) is a two-dimensional table, where rows represent the records (\(R = \{R_1, R_2, \ldots , R_n\}\)) and columns represent the attributes (\(A = \{A_1, A_2, \ldots A_M\}\)). A record \(R_x \in D_F\) has a set of \(M\) attributes \(A = \{A_1, A_2, \ldots , A_M\}\) (i.e., \(|A| = M\)). It represents an individual such as a patient in the patient data set of a hospital. An attribute \(A_i\) represents an information such as the age of the records. Each attribute \(A_i\) has a domain. If \(A_i\) is a categorical attribute, then the domain of \(A_i\) is \(A_i =\{A_{i1}, A_{i2},\ldots , A_{ij}\}\), and if \(A_i\) is a numerical attribute, then \(A_i = [low, up]\), where the lowest limit of \(A_i\) is low and the highest limit of \(A_i\) is up. By \(R_{xi}\), we mean the value of the \(i\)th attribute of the \(x\)th record, and by a missing value of \(R_{xi}\), we mean that the value \(R_{xi}\) is missing. A data set \(D_F\) may have some missing values in it. Note that in this study, we do not consider time series data.

Use of poor-quality data, having missing and incorrect values, can result in an inaccurate and non-sensible conclusion, making the whole process of data collection and analysis useless for the users [16, 36]. Therefore, in order to deal with the inaccurate and missing values, it is extremely important to have an effective data preprocessing framework [8, 17, 28, 29]. One important data preprocessing task is the imputation/estimation of missing values as accurately as possible. A number of imputation methods have been proposed recently [12, 14, 22, 31, 3739, 43, 52, 54, 55].

The imputation performance generally depends on the selection of a suitable technique [51]. Different techniques take different approaches for imputation of a missing value. While imputing a missing value, some techniques including EMI [43] and Mean Imputation [22] use the whole data set \(D_F\). On the other hand, some techniques including DMI [35], KMI [31], LLSI [25], ILLSI [9], and IBLLS [12] use only a portion (i.e., a horizontal segment) \(D_a \subset D_F\) where the records \(R_x \in D_a; \forall x\) are similar to the record \(R_y\) having a missing value.

In this paper, we propose a novel imputation technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). The basic idea of the technique is to impute/estimate a missing value \(R_{xi}\) of a record \(R_x\) using the records that are similar to \(R_x\). Our technique considers the fuzzy nature of a data set, where a record \(R_x\) has an association (i.e., a membership degree) with each cluster instead of one and only one cluster. A cluster is a portion \(D_a\) containing a group of records. Similar records are grouped together in a cluster, and dissimilar records are placed in different clusters.

FEMI uses two levels of fuzziness. In the first level, it considers that the record \(R_x\) (having a missing value \(R_{xi}\)) has a fuzzy nature in the sense that it has membership degrees \(U_{xl}\) with all clusters \(C_l; \forall l\), instead of \(R_x\) having a complete (100 %) association with one cluster \(C_l\) and zero association with all other clusters \(C_j;\forall j \ne l\). The missing value of \(R_x\) is estimated by considering each cluster \(C_l\) separately. Therefore, if there are \(k\) numbers of clusters, we get \(k\) numbers of imputed values for \(R_{xi}\). A final imputation is then computed through the weighted average of all \(k\) imputed values, by using the imputed values (\(v_1, v_2, \ldots , v_k\)) and the membership degrees (\(U_{x1}, U_{x2}, \ldots , U_{xk}\)).

In the second level of fuzziness, FEMI considers that all records \(R_y \in D_F; \forall y\) (not just \(R_x\)) have a fuzzy association with all the \(k\) clusters. Therefore, while imputing the missing value \(R_{xi}\) according to a cluster \(C_l\), FEMI uses a novel fuzzy imputation technique (as explained in Eqs. (6), (7) and (8)) that considers this second level of fuzziness. The fuzzy imputation technique considers all records \(R_y; \forall y\) and their membership degrees \(U_{yl}; \forall y\) with \(C_l\). Note that all records \(R_y; \forall y\) are used for all clusters \(C_l; \forall l\), but each cluster produces an imputed value that is different to the imputed values produced by other clusters due to the difference in membership degrees of the records with the clusters.

The main contributions of the technique are as follows: 1. The overall framework for imputing the missing values, 2. The use of multiple clusters \(C_l;\forall l\) for imputing a missing value \(R_{xi}\), 3. Combining the imputed values (\(v_1,v_2, \ldots , v_k\)) considering the membership degrees (\(U_{x1},U_{x2}, \ldots , U_{xk} \)) of the record \(R_x\) (that has the missing value \(R_{xi}\)), and 4. Imputing the missing value \(R_{xi}\) using a fuzzy imputation technique [see Eqs. (6), (7), and (8)] applied on each cluster.

We evaluate FEMI on eight (8) natural data sets (available from UCI Machine Learning Repository [15]) by comparing its performance with the performance of five high-quality existing techniques, namely EMI [22, 43], GkNN [53], FKMI [27, 31], SVR [44, 49] and IBLLS [12], which have been argued to be better than many other existing techniques including Bayesian principal component analysis (BPCA) [33], LLSI [25], and ILLSI [9]. Two evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. Our experimental results indicate (based on some statistical analysis namely \(t\) test and confidence interval) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

The organization of the paper is as follows. Section 2 presents a literature review. Our technique (FEMI) is presented in Sect. 3. Section 4 presents experimental results, and Sect. 5 gives concluding remarks.

2 Background study

A number of missing value imputation techniques have recently been proposed [12, 14, 22, 30, 31, 43, 52, 54, 55]. A few of the existing techniques are nearest neighbor (NN), linear interpolation (LIN), cubic spline interpolation, regression-based expectation maximization (REGEM) imputation, self-organizing map (SOM), and multilayer perceptron (MLP) [22]. However, many existing techniques cannot handle a data set having both numerical and categorical attributes [47].

The mean of all values of an attribute is used to impute a missing value of the attribute by a relatively simple technique [43]. However, a more advanced technique called \(k\)-nearest neighbor imputation (kNNI) [4] first finds the \(k\)-most similar records from the data set by using Euclidean distance measure. It then imputes a missing categorical value belonging to an attribute by using the most frequent value, of the same attribute, within the \(k\)-nearest neighbor (\(k\)-NN) records. For imputing a numerical value, the technique utilizes the attribute mean value for the \(k\)-NN records. The experimental results show that the performance of kNNI technique is higher than the performance of a technique using mean/mode imputation on a whole data set, instead of the horizontal segment having \(k\)-NN records. It is a simple technique in a way since it does not require to create explicit models such as decision trees and forests. However, it needs to search the whole data set as many times as the number of records having missing values in order to find the nearest neighbors of each record having missing value/s. Therefore, the technique can be found expensive for a large data set [4, 50].

Instead of repeatedly finding \(k\)-NN for each record having a missing value, another existing technique called “K-means Clustering-based Imputation (KMI)” [27, 31] uses the clusters of records for imputation. It first divides a given data set \(D_F\) into two data sets namely \(D_C\) and \(D_I\), where \(D_C\) contains complete (no missing) records and \(D_I\) contains records having missing values. KMI then partitions the data set \(D_C\) into k (user-defined) clusters using a well-known K-Means clustering approach. It then assigns a record belonging to \(D_I\) into a cluster, which it has the minimum distance with. The missing value is then imputed following an educated guess based on the records of the cluster. The clustering is only done once for imputing all records belonging to \(D_I\).

In order to impute numerical missing values of a record \(R_i\), a recent technique called “Iterative Bi-Cluster based Local Least Square Imputation” (IBLLS) [12] divides a data set in both horizontal and vertical segments. It first finds the \(k\)-most similar records for \(R_i\). Within the \(k\)-most similar records, it then calculates the correlation information, \(Q\), between the attributes, the values of which are available in \(R_i\), and the attributes the values of which are missing in \(R_i\). IBLLS then uses the correlation information, \(Q\), in order to re-calculate the \(k\)-most similar records for \(R_i\).

IBLLS then detects the attributes (within the \(k\)-most similar records) that have high correlations with the attribute having a missing value in \(R_i\). It thus partitions a data set horizontally (using the \(k\)-most similar records) and vertically using the most correlated attributes. IBLLS then uses Local Least Square Framework [25] into the partition in order to impute a missing value of \(R_i\). IBLLS repeats this procedure for imputing other missing values (if any) of \(R_i\). Similarly, IBLLS imputes other records of the data set having missing values.

IBLLS is an iterative method. In each iteration, it checks how well the imputed value agrees with the correlation matrix of the attributes within the data segment that is partitioned both horizontally and vertically. If the imputed value of the current iteration agrees better than the imputed value of the previous iteration, then it replaces the previous imputed value by the new imputed value; otherwise, it keeps the previous imputed value. The process of imputation and updating missing values continues recursively until the change of degrees of agreement (between an imputed value and the correlation matrix) for two consecutive iterations goes under a user-defined threshold.

Another iterative method called Expectation Maximization Imputation (EMI) [22, 43] relies on the basic concept of the well-known expectation maximization (EM) algorithm [6, 14]. The EM algorithm uses two main steps known as the expectation step (E-Step) and the maximization step (M-Step)  [6, 14]. For the imputation of missing values, the EM algorithm in the E-Step first computes the mean and covariance values of a data set based on the available (that is, non-missing) values. It then imputes the missing values based on the estimated mean and covariance values. The imputed values are the best possible results according to the maximum likelihood approach based on the available information, which in this case is the mean and covariance values. The EM algorithm then goes to the M-Step in order to update the mean and covariance values by taking the imputed values into consideration. It then again uses the E-Step to make a better imputation of the values using the updated mean and covariance values. The steps continue iteratively as long as the imputation quality keeps improving.

Expectation maximization imputation (EMI) uses the mean and covariance matrix of a whole data set (not a segment) in order to impute a numerical missing value. Let \(x_a\) be the vector containing the available values of \(R_i\). If \(R_i\) has four available values, then the vector \(x_a\) has four elements, where the first element contains the first available value of \(R_i\). Let \(x_m\) be the vector that will contain the imputed values of the missing values of \(R_i\). If \(R_i\) has three missing values, then \(x_m\) has three elements, where in the first element, the imputed value of the first missing value of \(R_i\) will be stored. Let \(\mu _m\) be the mean vector of the attributes having missing values for a record \(R_i\). For example, if \(R_i\) has three missing values, then \(\mu _m\) is a vector having three elements. The first element contains the average value (over all records of \(D_F\)) of the first attribute for which \(R_i\) has a missing value. Similarly, the second and third elements contain the average of the second and third attributes for which \(R_i\) has missing values. Let \(\mu _a\) be the mean vector of the attributes without missing values for a record \(R_i\). Let \(\mu (=\mu _a \cup \mu _m)\) be the mean vector of attributes having available values and of attributes with missing values. Let \(B\) be a regression coefficient matrix. \(B= \theta ^{-1}_{aa} \theta _{am}\), where \(\theta _{aa}\) is the covariance matrix for the attributes having available values for \(R_i\). For example, if \(R_i\) has four available values, then \(\theta _{aa}\) is a \(4 \times 4\) matrix, where the element \(\sigma _{pq}\) represents the covariance between the \(p\)th attribute and \(q\)th attribute for which \(R_i\) has available values.

In the E-Step, EMI imputes the missing values \(x_m\) of \(R_i\) based on the available values \(x_a\) of \(R_i\), mean vectors \(\mu _a\) and \(\mu _m\), and coefficient matrix \(B\) as shown in Eq. (1). During the imputation, EMI considers the correlations among the attributes.

In Eq. (1), EMI considers that the deviation of a missing value \(R_{ij}\in x_m\) from the mean value of the \(j\)th attribute \(\mu _j\in \mu _m\) is proportional to the deviation of an available value \(R_{il}\in x_a\) from the mean value of the \(l\)th attribute \(\mu _l\in \mu _a\), when the correlation between the \(j\)th and the \(l\)th attribute is high. The missing values \(x_m\) of \(R_i\) are imputed using Eq. (1) as follows  [43].

$$\begin{aligned} x_{m}= \mu _{m} + (x_{a}-\mu _{a})B+ e \end{aligned}$$
(1)

where \(e\) is a residual error with mean zero and unknown covariance matrix [43]. The \(e\) value is obtained by randomizing the covariances of the attributes as explained in Step 4 of Sect. 3.3. The \(e\) value is used only in the first iteration (i.e., in the first execution of the E-step) of the EMI algorithms.

In the M-Step, EMI again estimates the mean vector \(\mu \) and the coefficient matrix \(B\) considering the imputed data set. The objective of this step was to maximize the imputation quality of the E-Step.

EMI repeats the E-Step and M-Step until the difference between the mean (and covariance matrix) of the current iteration and the mean (and covariance matrix) of the previous iteration is less than user defined thresholds.

3 A novel missing value imputation framework

3.1 The basic contributions

We propose a technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI) that makes use of a fuzzy clustering technique and a fuzzy expectation maximization algorithm for imputation. Before we discuss the technique in details, we first introduce the basic concepts and contributions of the proposed framework as follows.

If a numerical attribute value of a record is missing, then it can be imputed based on the available attribute values of the record and the correlations for the attributes in the data set \(D_F\) as shown for the EMI technique [43] in Sect. 2. We argue that the imputation accuracy is likely to be high when all records \(R_x \in D_F;\forall x\) are very similar to each other, and the correlations for the attributes are high. By similar records, we mean records having similar attribute values resulting in low distances among them. When the records are very similar to each other, then the total number of possible values/options for an attribute are low. Low number of possible values helps to achieve high imputation accuracy. Similarly, when the correlations between two attributes are high, then with the increase of the value of one attribute, the value of the other attribute also increases (or decreases in case of negative high correlation). Therefore, by knowing the value of one attribute, it is possible to impute the value of the other attribute more accurately. Based on this understanding, we aim to first find a group of similar records (from \(D_F\)) with high correlations for the attributes and then apply an imputation technique within the group. In order to find similar records, we perform a clustering on the records. Note that existing EMI [14, 22, 43] uses all records of the whole data set for imputing a missing value \(R_{xi}\) and therefore does not use any clusters.

If we perform a clustering on the records of \(D_F\), then we can expect to get similar records in a cluster and dissimilar records in different clusters [11, 26]. We argue that typically the records grouped together in a cluster (i.e., similar records) should also have a high correlation between two attributes of the data set. An initial experimentation is carried out on the Yeast data set [15] in order to empirically assess the validity of the argument. We first compute the correlation \(C_{i,j}\) between two attributes \(A_i\) and \(A_j\) based on all records of the Yeast data set. We then group the records of the Yeast data set into ten clusters. The correlation \(C_{i,j}^l\) between the attributes \(A_i\) and \(A_j\) is then computed for the records belonging to a cluster \(C_l\). If \(C_{i,j}^l > C_{i,j}\), then we have a stronger/higher correlation between \(A_i\) and \(A_j\) within a cluster \(C_l\) than the correlation between \(A_i\) and \(A_j\) over the whole data set \(D_F\). Since Yeast has 8 attributes, there are 28 possible pairs of attributes, and therefore, 28 correlations \(C_{i,j}\). Table 1 shows the number of pairs of attributes having higher correlations within a cluster. Column A shows the Cluster ID, and Column B presents the number of attribute-pairs (out of 28 pairs) having stronger correlation within the cluster than for the whole data set. It clearly shows that attributes have higher correlations within the clusters than in the whole data set. Column C presents the number of records in each cluster.

Table 1 Correlation analysis on the Yeast data set

We also calculate the similarity of a record \(R_x\) with another record \(R_y\) for all \(R_x\) and \(R_y\). All these similarities are then used to calculate the average similarity \(S_D\) among the records for the whole data set, as shown by the left bar graph in Fig. 1. The average similarity \(S_l\) for the records within a cluster \(C_l\) is also calculated. We then compute the average similarity \(S_c = \frac{\sum \nolimits _{l =1 }^k{S_l}}{k}\), where \(k\) is the total number of clusters. Figure 1 shows that the similarity among the records within the clusters is higher than the similarity among the records for the whole data set.

Fig. 1
figure 1

Similarity analysis on Yeast data set

Sometimes, the records in a data set do not have a clear separation among them, and therefore, obvious boundaries do not exist among the clusters. For example, Fig. 2 shows an example data set (two-dimensional), where each dot represents a record \(R_x\). Each record \(R_x\) has two attribute values: one along the \(X\)-axis and the other one along the \(Y\)-axis. There are perhaps four clusters in the four corners of the example data set (see Fig. 2). However, it can be extremely difficult to draw the exact boundaries of the clusters. For example, the records just outside the outer circle (in the top left corner of Fig. 2) are almost equally attached to the cluster (shown in the top left corner) as the records just inside the circle. Additionally, the records also have some association with other clusters. Therefore, in the fuzzy clustering approach [11, 26], a record is assigned to all clusters with different membership degrees based on the argument that a record may not just belong to one or the other cluster, rather it can have different degrees of association with all clusters.

Fig. 2
figure 2

Basic concepts of FEMI

In this study, we propose two levels of fuzziness. For the first level of fuzziness, we consider that a record \(R_x\) having a missing value has a membership degree (i.e., a fuzzy association) with each cluster. The fuzzy association of a record with all clusters can be computed using a fuzzy clustering technique such as GFCM [26] as briefly explained in Sect. 3.2. Therefore, instead of imputing a missing value of \(R_x\) based on just one cluster, we use all \(k\) clusters, and thereby produce the \(k\) number of imputed values (\(v_1\), \(v_2\), \(\ldots \), \(v_k\)). We then compute the final imputation using \(v_1\), \(v_2\), \(\ldots \), \(v_k\) and the membership degrees (\(U_{x1}\),\(U_{x2}\), \(\ldots \), \(U_{xk}\)) of \(R_x\) with the \(k\) clusters. The cluster with which the record \(R_x\) has a higher membership degree has more influence in the imputation than the cluster with a lower membership degree.

The second level of fuzziness is that while imputing a missing value (\(R_{xi}=v_l\)) based on a cluster \(C_l\), we consider that each record of the data set (\(R_x \in D_F;\forall x\)) has a fuzzy association with \(C_l\). Therefore, while calculating the mean vector for the missing values \(\mu _m\), mean vector for the available values \(\mu _a\) and regression coefficient matrix \(B\) (see Eq. (1)) for \(C_l\), we take all records of a data set into consideration according to their membership degrees with \(C_l\). Hence, we modify Eq. (1) as shown in Eqs. (6), (7) and (8).

Let us give a logical justification of the second level of fuzziness as follows using Fig. 2. Let us assume that a record \(R_x\) has a missing value \(R_{xi}\), and it is located somewhere within the smaller/inner circle in the top left corner of Fig. 2. Let us also assume that \(R_{xi}\) is the \(Y\)-axis value of \(R_x\) in the two-dimensional example data set. According to our framework, we need to find the records similar to \(R_x\), i.e., we need to find the cluster where \(R_x\) belongs to. If we consider the records within the inner circle as the cluster, then we get a different set of \(\mu _a\), \(\mu _m\) and \(B\) values than the values we get if we consider the records within the outer circle as the cluster. Since the \(Y\)-axis value of \(R_x\) is missing, \(\mu _a\) is the average of the \(X\)-axis values for all records within a cluster. It is not trivial to determine which set of values (for \(\mu _a\), \(\mu _m\) and \(B\)) are the most useful or sensible since for different cluster boundaries, we get different values. It is not possible to determine the exact boundaries of the clusters. That is, it is not possible to identify the exact set of records belonging to a cluster for the example data set. Therefore, while calculating \(\mu _a\), \(\mu _m\), and \(B\) for \(C_l\), we consider all records of the data set and their membership degrees with \(C_l\) as shown in Eq. (6).

Hence, when a data set has the fuzzy nature (as shown in Fig. 2), we expect our novel technique to perform better than even the existing techniques that use similar records either by K-nearest neighbors or hard clustering. Besides, our technique should not be disadvantaged even for a data set that does not have the fuzzy nature since our technique can adjust its behavior accordingly through the use of membership degrees. If a data set is completely non-fuzzy, then the records belonging to a cluster will have a membership degree equal to 1 for the cluster and zero for all other clusters.

The novel concept of the two level fuzziness, the necessary modifications of Eq. (1) [see Eqs. (6), (7) and (8)] in order to support the Fuzzy EMI, and the overall framework are the basic contributions of the study. Note that the existing EM algorithms [14, 22, 43] do not consider the fuzzy clustering approach and fuzzy EMI imputation while imputing missing values. In the following section, we briefly introduce an existing fuzzy clustering technique called GFCM [26] for the clear understanding of an interest reader on Fuzzy Clustering.

3.2 A general fuzzy C-means (GFCM) [26] clustering algorithm

Clustering algorithms can be grouped into two categories, namely hard clustering and fuzzy (soft) clustering [3]. In hard clustering, a record \(R_i\) of a data set \(D_F\) belongs to one and only one cluster to which \(R_i\) is the most similar. However, in fuzzy clustering, \(R_i\) has certain probability (called membership degree) of belonging to each of the clusters. The membership degree \(U_{ik}\) for the record \(R_i\) with the cluster \(C_k\) can vary between 0 and 1. The value \(U_{ik}=1\) indicates a complete association between \(R_i\) and \(C_k\), and \(U_{ik}=0\) indicates a complete absence of any association between \(R_i\) and \(C_k\). Moreover, the total association of \(R_i\) with k clusters (i.e., \(C_1, C_2, \ldots \, C_k\)) is equal to 1 (i.e., \(\sum _{j=1}^{k}{U_{ij}}=1\)) [5, 26].

We now discuss the general fuzzy C-means (GFCM) [26] clustering algorithm as follows. Let a data set \(D_F\) has \(N\) records and \(M\) attributes. GFCM first requires a user-defined value for \(k\), based on which it groups the records of \(D_F\) into \(k\) clusters (i.e., \(C_1, C_2, \ldots \, C_k\)). It then randomly assigns a membership degree to each record \(R_i\) for each cluster in such a way so that \(\sum _{j=1}^{k}{U_{ij}}=1\). Therefore, for \(k\) number of clusters a record \(R_i\) has \(k\) number of membership degrees (i.e., {\(U_{i1},U_{i2},\ldots ,U_{ik}\)}), where \(\sum _{j=1}^{k}{U_{ij}}=1\). The membership degrees of all records with all clusters can be stored in two-dimensional matrix \(U\) having \(N\) rows and \(k\) columns, where \(U_{ij}\) contains the membership degree of the \(i\)th record with the \(j\)th cluster.

Based on \(U\), GFCM then calculates the center of each cluster. Let \(V =\{V_1, V_2, \ldots , V_k\}\) be a set of \(k\) centers for \(k\)-clusters. The center \(V_j\in V\) of the \(j\)th cluster contains \(M\) values \(\{v_{j1},v_{j2},\ldots ,v_{jM}\}\) for the \(M\) attributes of the data set, where \(v_{jl}\) is the \(l\)th attribute value of the \(j\)th cluster center.

While computing the attribute values of a cluster center, GFCM applies different approaches for numerical and categorical attributes. For a numerical attribute \(A_l\in A\), GFCM calculates \(v_{jl}\) by taking the weighted average of the \(l\)th attribute values for all records of \(D_F\). The weighted average is computed considering the membership degrees of the records with the \(j\)th cluster as shown in Eq. (2).

$$\begin{aligned} v_{jl}= \frac{\sum _{i=1}^{N}{U^m_{ij}R_{il}}}{{\sum _{i=1}^{N}}{U^m_{ij}}} \end{aligned}$$
(2)

where \(m\) is a fuzzification coefficient, which is greater than 1.0 and the default value of which is 1.3 [26].

For a categorical attribute \(A_l\in A\), GFCM considers that the \(l\)th attribute value of a cluster center actually contains all domain values of \(A_l\) instead of just one of the domain values. Each of the domain values of \(A_l\) has a probability of being the actual value of the \(l\)th attribute for the center. For example, let us assume that the domain size of \(A_l\) is three meaning that \(A_l\) has three domain values say \(x\), \(y\) and \(z\). GFCM calculates the probability \(v^x_{jl}\) of the domain value \(x\) being the actual value of \(v_{jl}\) as shown in Eq. (3). Similarly, it also calculates the probabilities \(v^y_{jl}\) and \(v^z_{jl}\).

$$\begin{aligned} v^x_{jl}=\frac{\sum \nolimits _{i=1}^{N}{\left( U^m_{ij}|R_{il}=x\right) }}{\sum \nolimits _{i=1}^{N}{U^m_{ij}}} \end{aligned}$$
(3)

The random assignment of the membership degrees \(U_{ij};\forall i,j\) and the calculation of the cluster centers \(V = \{V_1, V_2, \ldots , V_k\}\) are considered to be the first iteration of GFCM. After the completion of the first iteration, GFCM then enters into the second iteration, where it recalculates the membership degrees [see Eq. (4)] and the cluster centers as explained above. The membership degree \(U_{ij}\) of a record \(R_i\) with a cluster \(C_j\) is calculated using the similarity of \(R_i\) with \(C_j\). The record having high similarity with a cluster will have high membership degree with the cluster [26].

$$\begin{aligned} U_{ij}= \frac{1}{\sum \nolimits _{z=1}^{k} \left( \frac{\sum \nolimits _{l=1}^{M} {\delta (R_{il},v_{jl})^2}}{\sum \nolimits _{l=1}^{M}{\delta (R_{il},v_{zl})^2}}\right) ^{\frac{1}{(m-1)}}} \end{aligned}$$
(4)

where \(\delta (R_{il},v_{jl})\) is the dissimilarity between \(R_{il}\) and \(v_{jl}\).

Based on the updated (recalculated) membership degrees, a set of cluster centers are calculated using Eqs. (2) and (3), as explained before. The process of recalculating the membership degrees and the cluster centers continues iteratively until a termination condition is met.

Each iteration of GFCM is expected to improve the cluster quality from the previous iteration. The termination condition in GFCM is met when the improvement of cluster quality stops over the two consecutive iterations. The cluster quality is evaluated through the dissimilarity (as calculated by Eq. (5)) of the records within a cluster. The lower the total dissimilarity, the better the cluster quality. Equation (5) computes the total dissimilarity of the clustering solution of the \(t\)th iteration.

$$\begin{aligned} d_t= \sum _{j=1}^{k}\underset{i=1}{\overset{N}{\sum }}{U^m_{ij}} \underset{l=1}{\overset{M}{\sum }}{\delta (R_{il},v_{jl})^2} \end{aligned}$$
(5)

3.3 The main steps of our proposed technique

We now introduce the main steps of the FEMI framework. We also present an overall block diagram of the FEMI framework as shown in Fig. 3. Besides, we present a running example to illustrate the steps of FEMI. Note that the missing values of the running example (the main purpose of which is to illustrate the main steps clearly) may appear to be straightforward and can be imputed using a simple functional dependency analysis without requiring the sophistication of FEMI. However, the real data sets are typically a lot more complicated, than the running example, requiring the extra elegance of FEMI as evident from the experimental results presented in Sect. 4.

Step 1: :

Copy a full data set \(D_F\) into \(D_N\) and normalize all numerical attributes of \(D_N\) within a range between 0 and 1.

Step 2: :

Divide the data set \(D_N\) into two sub data sets \(D_C\) (having only records without missing values) and \(D_I\) (having only records with missing values).

Step 3: :

Find membership degrees of all records of \(D_C\) and \(D_I\) with all clusters.

Step 4: :

Apply our novel FuzzyEM method to impute numerical missing values using all clusters.

Step 5: :

Find the combined imputed value of a numerical attribute. Find the imputed value of a categorical attribute.

Step 6: :

Combine records to form a completed data set (\(D_F^\prime \)) without any missing values.

Fig. 3
figure 3

The overall block diagram of our proposed technique

figure c
figure d

Step 1: Copy a full data set \(D_F\) into \(D_N\) and normalize all numerical attributes of \(D_N\) within a range between 0 and 1.

We first make a copy of a data set \(D_F\) having \(|R|\) records and \(|A|\) attributes into \(D_N\). \(R = \{R_1, R_2, ... R_{n}\}\) is the set of records and \(A = \{A_1, A_2, ... A_{m}\}\) is the set of attributes in \(D_F\). We then normalize each numerical attribute \(A_i \in A\) of \(D_N\) into a range between 0 and 1 as shown in the Step 1 of the FEMI algorithm (Algorithm 1). Table 2a shows an example data set \(D_F\). The data set \(D_F\) has 15 records and 4 attributes out of which two are numerical and two are categorical. The records R2, R8, R10, and R13 have missing values for attributes Education (Edu.), Position (Pos.), Age and Salary (in thousand), respectively. The normalized data set \(D_N\) is shown in Table 2b.

Table 2 A sample data set \(D_{F}\) and normalized data set \(D_{N}\)

Step 2 Divide the data set \(D_N\) into two sub data sets \(D_C\) (having only records without missing values) and \(D_I\) (having only records with missing values).

In this step, we divide the data set \(D_N\) into two sub data sets namely \(D_C\) and \(D_I\), where \(D_C\) contains \(|R^\prime |\) records without missing values (see Table 2c) and \(D_I\) contains \((|R|-|R^\prime |)\) records with missing values (see Table 2d) as shown in the FEMI algorithm (Step 2 of Algorithm 1).

Step 3: Find membership degrees of all records of \(D_C\) and \(D_I\) with all clusters.

In Step 3, we apply a fuzzy clustering technique such as GFCM [26] on \(D_C\) with a user-defined k number of clusters as an input in order to produce a set of k cluster centers V = \(\{V_1, V_2, \ldots , V_k\}\) as shown in Step 3 of Algorithm 1. The clustering technique also produces a membership degree matrix \(U^C\) having \(|R^\prime |\) rows and k columns, where \(U^C_{ij}\) is the membership degree of the ith record (\(R_i \in D_C\)) with the jth cluster.

Using the membership degree calculation equation of the fuzzy clustering technique, we then find the membership degree matrix \(U^I\) for all records of \(D_I\) and the same k clusters having the same cluster centers V = \(\{V_1, V_2,\) \(\ldots , V_k\}\) as mentioned before. \(U^I_{lm}\) is the membership degree of the lth record (\(R_l \in D_I\)) with the mth cluster. Membership degree of a record with a cluster is inversely proportionate to the distance between the record and the center of the cluster. We next combine \(U^C\) and \(U^I\) into \(U\) having \(|R|\) number of rows and k number of columns.

For the example data set (as shown in Table 2c), let us use k = 3 in order to produce 3 clusters. The membership degrees of each record of \(R_i \in D_C\) with 3 clusters are shown in Table 3a. Each cluster has a centroid which contains the weighted mean of a numerical attribute and the frequency of each attribute value of a categorical attribute as shown Table 3b.

Table 3 Membership degree of the records of \(D_{C}\) and \(D_{I}\) with three clusters and clusters centroids \(V\)

We then calculate the membership degrees of each record \(R_l \in D_I\) with a cluster \(C_j;\forall j\), using the centroid of \(C_j\) and the available values of \(R_l\). Membership degrees are calculated based on the equations used in GFCM [26]. In our example, the membership degrees \(U^I\), of \(R_l \in D_I; \forall R_l\) with the 3 clusters are shown in Table 3c. Finally, we combine the membership degrees \(U^C\), and \(U^I\) into \(U\), where \(U_{ij}\) is the membership degree of the ith record \(R_i \in D_F\) with the jth cluster. Note that the membership degree of a record \(R_i \in D_F\) is the same as the membership degree of the record \(R^N_i \in D_N\) with a cluster \(C_j\), where \(R^N_i\) is the normalized form of \(R_i\).

Step 4 Apply our novel FuzzyEM method to impute numerical missing values using all clusters.

In this step, we impute the missing numerical attribute values using a novel Fuzzy Expectation Maximization approach, which is a modification (fuzzy version) of an existing approach [43]. The basic idea of this step is to use the membership degrees of the records with a cluster, in order to impute the missing values as shown in Step 4 of the FEMI algorithm (see Algorithm 1) and Procedure FuzzyEM (see Algorithm 2). The missing values are imputed using the membership degrees of all clusters one by one. For example, we impute the missing values using the membership degrees of the records with Cluster 1 and store the imputed data set in \(D^1_I\). Similarly, we produce \(D^2_I\), \(D^3_I\) \(\ldots \) \(D^k_I\) from Cluster 2, Cluster3, ... Cluster k, respectively.

FuzzyEM takes four input parameters namely the index k of a cluster, data set having records with missing values \(D_I\), full data set \(D_F\), and membership degree matrix \(U\). We next describe the proposed iterative imputation process as follows.

In this step, we first denormalize \(D_I\) data set, where each record has one or more missing values. Let \(R_i\) be a record of \(D_I\). \(R_i\) has \(\textit{m}\) number of numerical attributes with missing values and \(\textit{a}\) number of numerical attributes with available values.

Let \(\mu ^k_{mis}\) and \(\mu ^k_{av}\) be the fuzzy mean vectors of the \(m\) number of missing values and \(a\) number of available values, respectively. The fuzzy mean is calculated considering the membership degree of each record. It is the average value weighted by the membership degrees as shown in Eq. (6). We calculate fuzzy mean, \(\mu ^k_p\), for the pth attribute according to the kth cluster as follows. Note that the fuzzy mean of an attribute according to a cluster can be different to the fuzzy mean of the attribute according to another cluster.

$$\begin{aligned} \mu ^{k}_{p}=\frac{\sum ^{|R|}_{i=1} U_{ik} R_{ip}}{\sum ^{|R|}_{i=1}U_{ik}} \end{aligned}$$
(6)

where \(|R|\) is the number of records in \(D_F\), \(U_{ik}\) is the membership degree of the ith record with the kth cluster, and \(R_{ip}\) is the pth attribute value of the ith record.

Using \(\mu ^k_{mis}\) and \(\mu ^k_{av}\), we now calculate fuzzy covariance matrix \(\mathbf B =(\theta ^{k}_{aa})^{-1} \theta ^k_{am}\), where the element \(\sigma ^k_{pq}\), of \(\theta ^{k}_{aa}\) or \(\theta ^{k}_{am}\), is a covariance between the \(p\)-th and \(q\)-th attributes according to the \(k\)th cluster and is calculated as follows. It is a covariance value weighted by the membership degrees of the records with a cluster as shown in Eq. (7). Note that the covariance of two attributes according to a cluster can be different to the covariance of the attributes according to a different cluster.

$$\begin{aligned} \sigma ^{k}_{pq} = \frac{\sum ^{|R|}_{i=1}U_{ik} (R_{ip}-\mu ^k_{p})(R_{iq}-\mu ^k_{q})}{\sum ^{|R|}_{i=1}U_{ik}} \end{aligned}$$
(7)

Now, let \(x_m\) and \(x_a\) be the vectors of missing attribute values and available attribute values in \(R_i\), respectively. We impute the missing values of \(x_m\) as follows.

$$\begin{aligned} x_{m}= \mu ^{k}_{mis} + (x_{a}-\mu ^{k}_{av})\mathbf B + e \end{aligned}$$
(8)

where \(e\) is a residual error with mean zero and unknown covariance matrix \(Q(=\theta _{mm} - \theta _{ma} \theta ^{-1}_{aa} \theta _{am})\). Following an existing approach [43], we use a residual error \(e=[\mu _{0}+H.Z^{T} ]^{T}\), where \(\mu _{0}\) is a mean vector having zero value/s meaning that the elements of the vector \(\mu _{0}\) represent the mean values, which are all equal to zero in this case, \(H\) is a cholesky decomposition of the covariance matrix \(Q\), and \(Z\) is a vector having Gaussian random values that have mean zero and variance equal to one. Since \(H\) is multiplied by \(Z\) for the calculation of e, the e value is obtained through a randomization of the cholesky decomposition of the covariances. Note that the residual error \(e\) is used in Eq. (8) for the first iteration only.

After imputing all the records of \(R_i \in D_I\), \(\forall i\) we get an imputed data set \(D^k_I\). We replace the missing values of \(D_F\) with the corresponding imputed values in \(D^k_I\). We then re-calculate \(\mu ^k_{mis}\), \(\mu ^k_{av}\) and \(\mathbf B \) using the updated \(D_F\). We next re-impute \(x_m\) using  (8) for all the records of \(D_I\). We repeat this process until the change of the means and covariances in \(D_F\) of two consecutive iterations is less than a user-defined threshold \(\epsilon \). We use \(10^{-10}\) as a value for the termination threshold (\(\epsilon \)) in the experiments. The Procedure FuzzyEM then returns the imputed data set \(D^k_I\). Following [43], we use the residual error only in the first iteration.

We impute our example data set according to three clusters using the Algorithm 2. The imputed data sets are shown in Table 4a–c. For easy understanding, the imputed values are presented in bold font.

Table 4 Imputed data set \(D_{I}\)

Note that for a non-fuzzy data set where \(U_{ik}\) can be either 1 or 0, \(\mu _{miss}^k\) (see Eq. (6)) becomes equal to \(\mu _m\) (see Eq. (1)) for some \(k\). Similarly, \(\mu _{av}^k\) becomes equal to \(\mu _a\) and \(\sigma _{pq}^k\) becomes equal to \(\sigma _{pq}\) for some \(k\). Therefore, Eq. (8) becomes equal to Eq. (1), suggesting that the existing EMI [43] technique is a special case of the proposed FEMI algorithm.

Step 5 Find the combined imputed value of a numerical attribute. Find the imputed value of a categorical attribute.

In this step, we impute both numerical and categorical missing values of \(D_I\) (see Fig. 3 and Step 5 of Algorithm 1). For imputing numerical missing value of \(D_I\), we combine all \(D^s_I;\forall s \in K\), where \(K =\{1,2,\ldots k\}\). Let \(R_{ip}\) be the pth attribute value (numerical) of the ith record in \(D_I\). Let \(v^s_{ip}\) be the pth attribute value of the ith record in the data set \(D^s_I\), which is imputed according to sth cluster as explained in Step 4. If \(R_{ip}\) is a missing value, then it is computed as follows.

$$\begin{aligned} R_{ip}= \frac{\sum ^{k}_{s=1}U_{is}v^{s}_{ip}}{\sum ^{k}_{s=1}U_{is}}, \quad \hbox {where}\ U_{is} \in U^I \end{aligned}$$
(9)

We now explain our approach toward imputing a categorical attribute value. Many fuzzy clustering techniques such as GFCM produce a fuzzy seed (center) for a cluster where the seed contains each value of the domain of an attribute according to a confidence degree [24, 26]. The confidence degree of an attribute value \(p_l\) in a cluster \(s\) is the sum of the membership degrees for the records, having \(p_l\), with the cluster. So the confidence degree \(C^k_{p_l}\) for the value \(p_l\) of the \(p\)-th attribute (categorical) in the \(k\)th cluster can be calculated as follows.

$$\begin{aligned} C^k_{p_l}= \sum ^{|R|}_{i=1}U_{ik}|v^k_{ip}=p_l \end{aligned}$$
(10)

Similarly, confidence degree for all domain values of an attribute can be calculated. A fuzzy seed of a cluster can contain all the domain values of a categorical attribute and their corresponding confidence degrees with the cluster. Naturally, the value having the highest confidence degree is considered to be the most likely value of the attribute in the cluster.

While imputing a missing value of a record \(R_i\), we calculate the vote (\(G_{p_l}\)) for a value \(p_l\) by multiplying its confidence degree (\(C^s_{p_l}\)) in terms of the \(s\)-th cluster, and the membership degree \(U_{is}\) of \(R_i\) with the \(s\)-th cluster, for all clusters as follows.

$$\begin{aligned} G_{p_l}= \sum ^{k}_{s=1}C^s_{p_l}\times U_{is} \end{aligned}$$
(11)

Similarly, we calculate \(G_{p_l}; \forall p_l \in P\), where \(P\) is the domain of the \(p\)-th attribute. Finally, the value having the maximum vote is considered to be the imputed value \(R_{ip}\) for the \(p\)-th attribute of the \(i\)-th record. The final imputed data set of \(D_I\) (in our example) is shown in Table 4d.

Step 6 Combine records to form a completed data set (\(D_F^\prime \)) without any missing values.

We finally combine denormalized \(D_C\) and imputed \(D_I\) in order to form \(D^\prime _F\), which is the imputed data set.

3.4 A few possible approaches for automatically determining the number of clusters, \(k\)

Since FEMI requires a number of clusters \(k\) for the fuzzy clustering technique (as discussed in Step 3 of Sect. 3.3), a suitable approach (for automatically determining the most appropriate \(k\) value) could be incorporated into the FEMI algorithm. In the following paragraphs, we discuss a few possible approaches to automatically determine the \(k\) value. However, they need to be carefully evaluated in order to find the best of them and we plan to carry out the evaluation in our future study. In the experiments of this study, we use k = 20 based on our initial empirical analysis as discussed in Sect. 4.4.

Approach 1: The full data set \(D_F\) can be divided into two (mutually exclusive) horizontal segments \(D_C\) (having only records without any missing value/s) and \(D_I\) that has only the records with missing value/s. Artificial missing values can be created in \(D_C\) randomly, using the same approach that we have taken in this study to simulate missing values (see Sect. 4.2). The artificial missing values can then be imputed many times, where each time FEMI can use different \(k\) values. Finally, the \(k\) value resulting in the best imputation accuracy can be chosen as the \(k\) value for FEMI in order to impute the real missing values. Note that the process of generating artificial missing values and imputation can be run many times, and the average result for each \(k\) value can be used for finding the best \(k\).

Approach 2: Instead of GFCM [26], an existing fuzzy clustering algorithm such as FBSA [45], that automatically finds the \(k\) value, can be used. Note that FBSA finds the best \(k\) value by comparing the quality of clusters obtained for different \(k\) values.

Approach 3: The k-means algorithm [21] can be applied many times for different \(k\) values and thereby produce different clustering results. All clustering results can be evaluated using any evaluation metric such as the Silhouette coefficient [41] and Davies?Bouldin Index (DBI) [13]. The \(k\) with the best clustering result can then be used in FEMI for the imputation of missing values.

Approach 4: Instead of k-means a basic fuzzy c-means such as FCM [5] can be applied many times in order to find the best \(k\) value. This approach does not require FEMI to use the GFCM technique for clustering since a fuzzy clustering technique like FCM computes the membership degrees as well. The set of membership degrees for the best \(k\) value can be directly used in the FEMI algorithm.

An advantage of the first approach is that the best \(k\) value is determined by evaluating the imputation accuracy instead of evaluating the cluster quality for each \(k\) value as required by the other approaches. The \(k\) value determined based on the imputation accuracy may produce better end result than the \(k\) value determined based on the cluster quality. However, a disadvantage of the first approach is its time complexity since it requires the complete imputation many times in order to determine the best \(k\) value. An advantage of the second approach is a relatively lower time complexity than the first approach as it does not require to run all steps of FEMI in order to identify the best \(k\) value. Since k-means is well known for its low time complexity [20], the third approach can also enjoy its simplicity in terms of time complexity compared with the first approach. Since the fourth approach skips the step involving GFCM, it can save some computation time. However, a possible disadvantage of Approach 2, Approach 3, and Approach 4 can be the use of clustering quality instead of imputation quality.

3.5 Complexity analysis

We now analyze complexity for FEMI, EMI, GkNN, FKMI, SVR, and IBLLS. We consider that we have a data set with \(n\) records, and \(m\) attributes. We also consider that there are \(n_I\) records with one or more missing values, and \(n_c\) records (\(n_c = n - n_I\)) with no missing values. FEMI uses a fuzzy clustering technique such as GFCM [26] to create clusters with a user-defined number of clusters \(k\).

Complexity of Step 1 for normalizing all numerical attributes is \(O(nm)\). In Step 2, the complexity for preparing \(D_I\) and \(D_C\) is \(O(nm)\). The complexity of Step 3 is dominated by the GFCM algorithm [26] which has a complexity \(O(k^2n_cmd)\), where we consider that the domain size of each attribute is \(d\).

Step 4 uses the \(FuzzyEM()\) procedure, which has a complexity \(O(nn_Im^2 + n_Im^3)\). \(FuzzyEM()\) is applied repeatedly \(k\) times which makes the complexity of this step is \(O(knn_Im^2 + kn_Im^3)\). The complexity for imputing missing values in Step 5 is \(O(knn_Im)\). Complexity of Step 6 for denormalizing all numerical attributes is \(O(n_Cm)\).

Therefore, the overall complexity of FEMI is \(O(nm+k^2n_cmd+knn_Im^2+kn_Im^3+knn_Im\). However, typically \(k\), and \(d\) values are very small, especially compared with \(n\). Besides, we can also consider \(n_I\) to be very small and therefore \(n_c \approx n\). Hence, the overall complexity of FEMI is \(O(nm^2 + m^3)\). Moreover, for low-dimensional data sets such as those used in this study, the complexity is \(O(n)\). We estimate the complexities of EMI, FKMI, SVR and IBLLS (i.e., the techniques that we use in the experiments of this study) as \(O(nm^2+m^3)\), \(O(nm)\), \(O(n^3m)\) and \(O(n^3m^2+nm^4)\), respectively. Moreover, the complexity of GkNN is \(O(n^2mlog(n))\) [53]. This is also reflected in the execution time complexity analysis in the next section (see Table 8).

4 Experimental result

We compare FEMI with four existing techniques namely EMI [22, 43], GkNN [53], FKMI [27, 31] and IBLLS [12]. Moreover, we use the Java implementation (LibSVM [10]) of an existing technique SVR [44, 49]. The existing techniques have been shown to be better than Bayesian principal component analysis (BPCA) [33], LLSI [25], and ILLSI [9].

4.1 Data sets

We apply the techniques on eight real data sets, namely Adult, Chess, Yeast, Contraceptive Method Choice (CMC), GermanCA, Pima, Housing and Autompg data set that are available from UCI Machine Learning Repository [15]. A brief description of the data sets is presented in Table 5. For instance, the Adult data set has 32,561 records, 6 numerical and 9 categorical attributes. There are a number of records having missing values. We first remove all records having missing values. Therefore, we get a pure data set having 30162 records without any missing values. In our experiment we use the pure data set.

Table 5 Data sets at a glance

4.2 Simulation of missing values

In pure data sets, we artificially create missing values, which are then imputed by different techniques. Since the original values of the artificially created missing data are known to us, we can evaluate the performances of the techniques.

Both the amount and type of missing data influence the imputation performance [22]. Missing data can have many different types. For example, in one scenario (type), we may have a data set where a record has at most one missing value, and in another scenario, we may have records with multiple missing values, but both data sets may have the same number of total missing values. Moreover, the probability of a value being missing typically does not depend on the missing value itself [42, 43], and hence, missing values often can have a random nature, which can be difficult to formulate. Therefore, in this experiment, we use various types of missing values such as simple, medium, complex and blended, as explained below.

A record can have at most one missing value for a simple pattern, whereas in a medium pattern, if a record has any missing value, then it has minimum 2 attributes with missing values and maximum 50 % of the attributes with missing values [22, 35, 40]. Similarly, a record having missing values in a complex pattern has minimum 50 % and maximum 80 % attributes with missing values. In a blended pattern, we have a mixture of records from all three other patterns. A blended pattern contains 25 % records having missing values in simple pattern, 50 % in medium pattern and 25 % in complex pattern. Blended pattern simulates a natural scenario where we may expect a combination of all three missing patterns.

For each of the missing patterns, we use different missing ratios (1, 3, 5 and 10 %) where x % missing ratios means x % of the total attribute values (not records) of a data set are missing. For example, if a data set has 5 records and 20 attributes then a missing ratio of 10% means that 10 values, out of the total 100 (\(=5\times 20\)) attribute values, are missing. Therefore, for 10 % missing ratios and simple missing pattern, the total number of records having missing values may exceed the total records in some data sets. In the experiments, we therefore use 6 % missing ratios (instead of 10 % missing ratios) only for simple missing pattern with 10 % missing ratios, for all data sets.

Moreover, we use two types of missing models namely Overall and Uniformly Distributed (UD). In the overall distribution, missing values are not equally spread out among the attributes, and in the worst case scenario, all missing values can belong to a single attribute. However, in the UD model each attribute has equal number of missing values.

Note that there are 32 combinations (id \(1, 2, \ldots , 32\)) of Missing Ratio, Missing Model, and Missing Pattern. For each combination, we create 10 data sets with missing values. For example, for the combination having “1 %” missing values , “overall” missing model , and “simple” missing pattern (id \(1\), see Table 6), we generate 10 data sets with missing values. We therefore create all together 320 data sets for each natural data set namely Adult, Chess, Yeast, Contraceptive Method Choice (CMC), GermanCA, Pima, Housing and Autompg.

Table 6 Performance of FEMI, GkNN, FKMI, SVR, EMI, and IBLLS based on RMSE, and MAE for 32 missing combinations on Adult data set

4.3 Evaluation criteria

The imputation accuracy of FEMI is evaluated using two well-known evaluation criteria namely root mean squared error (RMSE) and mean absolute error (MAE).

We now define the evaluation criteria briefly. Let \(N\) be the number of artificially created missing values, \(O_{i}\) (\(1 \le i \le N\)) be the actual value of the \(i\)th artificially created missing value, \(P_{i}\) be the imputed value of the \(i\)th missing value.

The most commonly used imputation performance indicator is the root mean squared error (RMSE) [22], which aim to explore the average difference of actual values with the imputed values as shown in (12). Its value ranges from 0 to \(\infty \), where a lower value indicates a better matching.

$$\begin{aligned} \hbox {RMSE}=\left( \frac{1}{N} \sum \limits _{i=1}^N \left[ P_i-O_i \right] ^2 \right) ^\frac{1}{2} \end{aligned}$$
(12)

The mean absolute error (MAE) [22] determines the closeness between actual and imputed values. Similar to RMSE, its value ranges from 0 to \(\infty \), where a lower value indicates a better matching.

$$\begin{aligned} \hbox {MAE}=\frac{1}{N} \sum \limits _{i=1}^N |P_i-O_i| \end{aligned}$$
(13)

4.4 Justification of \(k\)

FEMI uses a user-defined value \(k\) (the number of clusters) as explained in Sect. 3. In order to explore a suitable value for \(k\), we first evaluate the performances of FEMI for the Credit Approval data set and CMC data set by using different values of k as shown in Fig. 4. For both data sets, \(k=20\) gives the best result based on the evaluation criteria RMSE. Note that, \(k=20\) also gives the best result based on MAE for the data sets. Therefore, we use the default value of \(k\) equal to \(20\) for the experiments in this study.

Fig. 4
figure 4

Performance of FEMI based on RMSE for different number of clusters. a Credit Approval data set, b CMC data set

4.5 Experimental result analysis for the adult data set

In Table 6, we present the performance of FEMI, GkNN, FKMI, SVR, EMI and IBLLS, on Adult data set, based on RMSE and MAE for 32 missing combinations. The average values of the performance indicators on 10 data sets having missing values for each combination of missing ratios, missing model, and missing pattern are presented in Table 6. For example, there are 10 data sets having missing values with the combination (\(id=1\)) of “1 %” missing ratio, “Overall” missing model and “Simple” missing pattern. The average of RMSE for the data sets having \(id=1\) is 0.103 for FEMI as reported in Table 6. The average RMSE values for the data sets having \(id=1\) are 0.156, 0.182, 0.120, 0.123 and 0.149 for GkNN, FKMI, SVR, EMI, and IBLLS, respectively. Bold values in the table indicate the best results among the three techniques. FEMI performs significantly better than all other techniques on the data set. In 32 out of 32 combinations of missing patterns, FEMI performs better than all other techniques in terms of all evaluation criteria.

4.6 Statistical significance analysis on adult data sets

We now present the confidence interval analysis on Adult data set to evaluate the statistical significance of the superiority of FEMI over the five existing techniques as evident from our empirical assessment. However, we first briefly explain the essence of confidence interval and its actual meaning.

Suppose that we have ten different copies of the Adult data set with different missing values, and we impute the missing values of each data set using FEMI. We thereby get ten imputed data sets. We can evaluate the imputation quality of the ten data sets through a metric such as RMSE. Therefore, we get ten RMSE values and can calculate the mean of the ten values. Let us call the mean RMSE value as the population mean. Now if we collect another set of ten copies of the Adult data set with missing values and impute them using FEMI, we are likely to get a different population mean. An obvious question is then which of the two population means is the correct one. Is any of the two population means is the same as the true mean? A true mean is the theoretical population mean obtained from the imputation of infinite number of data sets [46].

The 95 % confidence interval analysis allows us to compute an interval (i.e., an upper limit and a lower limit) around a mean value, suggesting that if we run the tests 100 times (where each test imputes say ten data sets and thereby obtain a population mean) and compute the 100 intervals from the tests then the true mean will be within the intervals for 95 tests. By running a test, we mean the collection of ten missing data sets and creating ten imputed data sets. Therefore, 95 % confidence interval tells us that the true mean falls within the interval obtained from a test with 95 % chance [46].

If we have ten data sets with missing values that are imputed by two different imputation techniques \(X\) and \(Y\), then we get ten imputed data sets that are imputed by a technique. We can then compute the population mean and confidence interval for each technique. If the population mean of \(X\) is higher than the population mean of \(Y\) and the confidence intervals are non-overlapping, then it tells us that there is 95 % chance that the true mean of \(X\) is higher than the true mean of \(Y\). That is, \(Y\) is clearly better than \(X\) (since the RMSE value lower the better) with 95 % probability.

We now present 95 % confidence interval analysis of FEMI with other techniques in terms of RMSE, and MAE for all 32 missing combinations in Fig. 5. It is clear from the figure that FEMI performs better (i.e., better average value and no overlap of confidence intervals) than other techniques for all missing combinations. We can see from the figures that IBLLS in general performs worse for a high missing ratios, whereas FEMI maintains almost the same performance even for a high missing ratios.

Fig. 5
figure 5

95 % confidence interval analysis on Adult data set. a RMSE, b MAE

4.7 Statistical significance analysis for all data sets

We present 95 % confidence interval analysis in terms of RMSE for all seven remaining data sets as shown in Fig. 6. FEMI performs significantly better (i.e., better average value and no overlap of confidence intervals) than other five techniques in terms of RMSE, for all missing combinations in Chess (Fig. 6a), Yeast (Fig. 6b), GermanCA (Fig. 6d) and Pima (Fig. 6e) data sets. In CMC (Fig. 6c), Housing (Fig. 6f) and Autompg (Fig. 6g) FEMI performs significantly better than EMI and IBLLS for all missing combinations except for those marked by the circles.

Fig. 6
figure 6

95 % confidence interval analysis on Chess, Yeast, CMC, GermanCA, Pima, Housing and Autompg data sets in terms of \(d_2\). a Chess data set. b Yeast data set. c CMC data set. d GermanCA data set. e Pima data set. f Housing data set. g Autompg data set

We also present the number of overlapping cases between FEMI and other techniques for all evaluation criteria in Table 7. Out of 256 cases (i.e., 32 combinations for each data set, of 8 data sets), confidence interval of FEMI overlaps with other techniques in only 38 and 51 cases in terms of RMSE, and MAE, respectively (see Table 7).

Table 7 The number of overlapping cases (out of the total of 256 cases for each evaluation criterion) between FEMI and other techniques in terms of 95 % confidence interval analysis

From Fig. 6, we realize that generally overlapping happens for low missing ratio and simple missing pattern. Although GkNN, FKMI, SVR, IBLLS, and EMI generally perform worse than FEMI for all patterns, they perform comparatively better in the simple pattern than in the medium and complex patterns. However, FEMI performs almost equally good for the low and high missing ratio.

All six techniques demonstrate a similar tendency of better performance for the simple pattern than the medium and complex patterns. In the simple pattern, a record can have at most one missing value. Therefore, an imputation technique can take advantage of a higher number of available values of a record in order to impute the missing value of the record. On the other hand, in the medium and complex patterns, a record may have more than one missing values. When a record has a big number of missing values, then an imputation technique has a lower number of available values in the record to make a more precise estimation of the missing values. An extreme example can be the case where all values of a record are missing. Naturally, the imputation accuracy drops with the increased number of missing values in a record. However, the imputation techniques generally perform better for the blended pattern (as evident from Fig. 6) due to the existence of the simple pattern inside the blended pattern as explained in Sect. 4.2.

In Fig. 7, we present the overall average values (average of all 32 combinations) for the techniques on all eight data sets. FEMI performs clearly better than five other techniques.

Fig. 7
figure 7

Performance comparison on eight data sets. a Performance on Adult, Chess, Yeast and CMC data sets. b Performance on GermanCA, Pima, Housing and Autompg data sets

Figure 8 shows the percentage of combinations (out of 256 combinations for all data sets) where FEMI, GkNN, FKMI, SVR, EMI, and IBLLS perform the best. For example, FEMI performs the best in 88.67 % cases (combinations) in terms of RMSE as shown in Fig. 8a.

Fig. 8
figure 8

Score comparison on eight data sets. a RMSE. b MAE

In Fig. 9, we now present another statistical significance analysis called the t-test for all 32 missing combinations of all data sets. Before we present the results, we first briefly explain the \(t\) test analysis. Like the confidence interval test, a \(t\) test is also used to evaluate the significance of the superiority of a technique \(X\) over another technique \(Y\) [46]. Let \(X\) and \(Y\) be evaluated through RMSE. For 32 runs (i.e., the sample size \(n_X=32\)) of \(X\), we get 32 different RMSE values for \(X\). Similarly, if we run the technique \(Y\) 32 times (i.e., the sample size \(n_Y=32\)), then we get 32 different RMSE values for \(Y\). Let \(\overline{X}\) and \(\overline{Y}\) be the averages of \(X\) and \(Y\), respectively, and \(s_X\) and \(s_Y\) be the variances of \(X\) and \(Y\), respectively. Also, let \(df_X(=n_X-1)\) and \(df_Y(=n_Y-1)\) be the degrees of freedom for \(X\) and \(Y\), respectively. Using the averages, variances, and degrees of freedom, we can calculate a \(t\) value for \(X\) and \(Y\) as follows [2].

$$\begin{aligned} t_{XY}=\frac{\overline{X}-\overline{Y}}{\sqrt{\frac{s_X}{df_X}+\frac{s_Y}{df_Y}}} \end{aligned}$$
(14)

The statistical superiority of \(X\) over \(Y\) is evaluated by comparing the \(t_{XY}\)-value with the t(ref) value. The t(ref) value can be obtained from the Student \(t\) distribution Table [1] through the degree of freedom and the confidence level. For example, the t(ref) value of the two-tailed test at significance level \(p=0.05\) (i.e., 95 % confidence level) and degree of freedom=31 (since \(n_X=n_Y=32\)) is 1.96 [1]. The performance of \(X\) is significantly better than \(Y\) if the \(t_{XY}\)-value is greater than the \(t\) ref (which is 1.96 in this case).

Fig. 9
figure 9

\(t\) test analysis on eight data sets. a \(t\) test analysis on Adult, Chess, Yeast and CMC data sets. b \(t\) test analysis on GermanCA, Pima, Housing and Autompg data sets

In Fig. 9, we present the statistical significance analysis using \(t\) test for all 32 missing combinations of all data sets. If a \(t_{XY}\) value (denoted as the \(t\) value in the figure) is greater than the t(ref) value, then it indicates the superiority of FEMI over the other techniques being evaluated. Please note that in this experimentation, we evaluate the superiority of FEMI over the existing techniques for 99.5 % confidence since we use (\(p = 0.005\)). Therefore, Fig. 9 demonstrates a considerably better performance of FEMI over other techniques based on all evaluation criteria for all eight data sets except for a few cases indicated by a downhead arrow in the figure.

4.8 Execution time complexity analysis

We now present the average execution time (in milliseconds) for 320 data sets (32 combinations \(\times \) 10 data sets per combination) with missing values for each real data set in Table 8. We carry out the experiments using two different machines. However, for one data set, we use the same machine for all techniques. The configuration of Machine 1 is 4 \(\times \) 8 core Intel E7-8837 Xeon processors, 256 GB RAM. The configuration of Machine 2 is Intel Core i5 processor with 2.67 GHz speed and 4 GB RAM. The last row of Table 8 indicates that FEMI takes less time than IBLLS, SVR, and FKMI, whereas it takes considerably more time than EMI to pay the cost of a significantly better quality imputation. Moreover, the execution time of FEMI is comparable with GkNN. However, FEMI performs significantly better than GkNN as shown in Figs. 7 and 8.

Table 8 Average execution time (in ms) of different techniques on the eight data sets

For CMC data set, we next create 30 noisy data sets for a missing combination (id=27, missing ratio=10 %, missing model = Overall, and missing pattern = Complex) and apply FEMI, EMI, and IBLLS on all data sets. We then calculate average execution time for 30 runs for each of the three techniques. FEMI requires lower time than IBLLS, but higher time than EMI (see Table 9).

Table 9 Average execution time (in ms) on CMC data set for a complex pattern (30 runs)

For the same missing combination, we also perform \(t\) test analysis for all four evaluation criteria as shown in Fig. 10. FEMI performs significantly better than both other techniques in terms of \(R^2\), \(d_2\), RMSE, and MAE.

Fig. 10
figure 10

\(t\) test analysis on CMC data set for a complex pattern (30 runs)

4.9 Experimentation on categorical missing value imputation for all data sets

Unlike EMI and IBLLS, FEMI can impute categorical missing values in addition to numerical missing values. Therefore, in order to evaluate the performance of FEMI for imputing categorical values, we now compare its performance with GkNN, FKMI, and SVR that imputes categorical missing values. Figure 11 shows that FEMI achieves lower RMSE (Fig. 11a) and MAE (Fig. 11b) values than GkNN and FKMI for all eight data sets. FEMI also performs better than SVR in terms of RMSE and MAE for most of the data sets except Yeast and Autompg. For each data set, RMSE and MAE values are computed using all 32 combinations. Note that for RMSE and MAE, a lower value indicates a better imputation.

Fig. 11
figure 11

Performance comparison on eight data sets in terms of categorical imputation. a RMSE, b MAE

Note that for categorical attributes, FEMI provides the votes \(V_{p_l};\forall p_l \in P\) where \(P\) is the domain of the \(p\)th attribute (see Eq. 11). For the RMSE and MAE calculation in this study (see Eqs. 12 and 13), the value \(p_l\) having the highest vote is considered as the imputed value \(P_i\) of the \(i\)th missing value. For the RMSE and MAE calculation, if the imputed value (\(P_i\)) and the actual/observed value (\(O_i\)) are the same, then the distance between them (\(P_i - O_i\)) is considered 0 and otherwise 1. However, if the votes for two possible values (such as \(p_l \in P\) and \(p_m \in P\)) are high, then a user is less certain about the imputed value \(P_i\). When the votes are close to each other, then a user is more uncertain than when only one value has the highest vote, and all other values have very low votes. The uncertainty of a user can be captured through the entropy of the votes. We considered in this study the value having the highest vote as the imputed value just for the sake of RMSE and MAE calculation. However, FEMI actually provides a user with all votes and a user can take his/her decision on the imputation.

We understand that when the observed value (\(O_i\)) does not receive the highest vote, the following two cases are not the same. In the first case, the observed value (\(O_i\)) receives a high vote, and in the second case, it receives a low vote. Obviously, the first case is better than the second case in terms of the imputation accuracy. In order to take this into consideration in the RMSE and MAE calculation, we use all votes and thereby calculate the probability of the actual value to be imputed, \(p(O_i) = \frac{V_{p_m}| p_m = O_i}{\Sigma _{p_l \in P;\forall p_l}{V_p}_l}\). The best imputation is when \(p(O_i) = 1\) and \(p(p_l) = 0; \forall p_l \ne O_i\). Therefore, we introduce two new metrics called new RMSE (see Eq. 15) and new MAE (see Eq. (16)), where \(N\) is the total number of missing values.

$$\begin{aligned} n\hbox {RMSE}&= \left( \frac{1}{N} \sum \limits _{i=1}^N \left[ 1-p(O_i) \right] ^2 \right) ^\frac{1}{2} \end{aligned}$$
(15)
$$\begin{aligned} n\hbox {MAE}&= \frac{1}{N} \sum \limits _{i=1}^N |1-p(O_i)| \end{aligned}$$
(16)

In Table 10, we present the overall average values (average of all 32 combinations) of RMSE, \(n\hbox {RMSE}\), MAE and \(n\hbox {MAE}\) for FEMI and DMI on the CMC and Housing data sets. The results in the table indicate a strong superiority of FEMI over DMI for categorical imputation.

Table 10 Overall performance comparison of FEMI and DMI for categorical imputation based on RMSE, \(n\)RMSE, MAE and \(n\)MAE

5 Conclusion

In this paper, we propose a framework for missing value imputation using a fuzzy clustering technique and a proposed fuzzy expectation maximization algorithm. The basic idea of the technique is to make an educated guess for a missing value using the most similar records. It takes the fuzzy nature of clustering into consideration while identifying the group of most similar records. Therefore, it considers all groups of records (clusters) as similar, with some degree of similarity. Moreover, while imputing a missing value based on a group, it also considers the fuzzy nature of all records for belonging to the group. Therefore, it uses a novel fuzzy expectation maximization algorithm to impute missing values.

The proposed technique utilizes a user-defined value for \(k\) (i.e., the number of clusters), the default value of which is \(k=20\). We use \(k=20\) in all experiments. While a data set specific \(k\) value would favor our technique, it achieves higher quality of imputation than the existing techniques even for the default \(k\) value. Moreover, in this study (see Sect. 3.4), we present a number of possible approaches for finding a suitable \(k\) value automatically. Note that the main focus of the proposed technique is imputation of missing values and not clustering the records. We are using clustering only to find a group of similar records possibly having high correlations among the attributes. Therefore, it is okay if we do not find the actual clusters very precisely (that is, the best clustering solution with the best \(k\)) as long as the clusters found contain similar records. For example, if there are actually three clusters in a data set (i.e., \(k = 3\)) while FEMI finds two groups of records from each cluster and thereby obtains six groups (i.e., \(k = 6\)), the records within each group are still similar to each other. Therefore, the imputation quality of the proposed technique should still remain high. Hence, choosing a high \(k\) value (such as \(k =20\)) can be considered acceptable when the typical data sets [34] such as those used in this study have low number of clusters. Our future research plans include the automation of the \(k\) value as indicated in Sect. 3.4.

We compare the proposed technique with five other high-quality existing techniques. In our experiments, we use eight publicly available natural data sets and two evaluation criteria. The data sets used in the experiments of this study contain non-time series data. However, our technique can also be used on time series data provided the data set has two or more attributes. Two or more attributes are required to facilitate the correlation calculation needed by the fuzzy EMI technique used in FEMI. Since FEMI was not tailored for time series data, it does not take advantage of time information, while imputing missing values. Nevertheless, many imputation techniques similar to our approach have been applied on time series data [7, 18, 19].

The experimental results indicate that the proposed technique performs significantly better than the other five techniques. We use average values and confidence interval and \(t\) test analyses to compare the performance of our technique with other techniques. We also carry out a test on execution time complexity. While the proposed technique takes less time than IBLLS, SVR, and FKMI, it takes more time than EMI. Our future research plan is to reduce the time complexity of the technique and improve its imputation accuracy.