1 Introduction

Machine learning and data mining algorithms are frequently used for knowledge discovery in databases since they are non-trivial processes of exploring the new facts and identifying helpful relationships or patterns in data [5, 21, 26]. However, analysts using real-world databases or datasets constantly encounter data imperfection in the form of incompleteness [3, 8]. Thus, plenty of resolutions have been devised to tackle the unfavorable phenomenon. Even though missing data might not cause any serious trouble especially when the missing ratio is not significantly high, it would not be an ideal case to ensure the data quality. Additionally, some opinions argue that fragmentary data should be directly discarded without any further considerations. Nevertheless, the opinion has obvious shortcomings which are articulated that the other observed factual values of the same instance may simultaneously be absent [11, 18, 35]. In some high-missing-rate environment, this strategy is presumed unreasonable and infeasible. Consequently the handling for substitution or replacement draws increasing attentions, termed as imputation. In broad outline, the methods available can be separated into two categories: single imputation and multiple imputation methods. Single imputation, i.e. filling in precisely one value for each missing one, intuitively has many appealing features, e.g. standard complete-data methods can be applied directly and the substantial effort required to create imputations needs to be carried out only once [23, 47]. At the same time, it cannot provide valid standard errors and confidence intervals, because it ignores the uncertainty implicit in the fact that the imputed values are not the actual values. Oppositely, multiple imputation generates a quantity of simulated values for each missing item, in order to reflect properly the uncertainty attached to missing data [42, 58]. More importantly, multiple imputation accounts for missing data by restoring not only the natural variability in the missing-data, but also by also incorporating the uncertainty caused by the interpolation process. This type of data imputation has been advocated as a statistically sound approach, but so far its use has been limited mainly to the social and medical sciences.

Multiple imputation recently has emerged as an interesting and quite visible direction in missing data analyses. The versions of the sophisticated approach are advantageous to these conventional techniques because they require less stringent assumptions and mitigate the pitfalls of traditional ones [19, 36, 49]. Nevertheless, for most of the existing solutions, the following facets retain defective: (a) The clustering strategy combining complete instances with incomplete instances violates the formation of good clusters. In other words, the entire-instances-involved clustering generates biased values due to the imperfection [57, 59]. (b) The common distance metrics such as Minkowski’s L p (p=1,2,∞) and Cosine Correlation (CC) [34, 41] is imprecise to scale the dissimilarity among instances; (c) The current methodologies are hardly applicable to handling the missing aerospace data due to the their underperformance on validity.

In this paper, data imputation is formulated as a problem of estimation of missing values by multiple operations based on clustering. Furthermore, the prime contribution of this paper could be described as: (a) Dividing non-missing items into a finite number of well-partitioned clusters contributes to make the completion in the optimal tailored area. (b) GST, which signifies the situational variation of the curve, could characterize the relative discrepancy more precisely; (c) MIGEC performs accurately on the real aerospace datasets than other multiple imputation strategies.

The rest of this paper is organized as: Sect. 2 first briefly introduces both the missingness mechanisms and the emblematic patterns of missing data treatment then reviews the diverse related literatures about imputation. In Sect. 3, the detailed process of the MIGEC algorithm is illustrated in three primary procedural sub-items. Section 4 demonstrates a series of experimental results on both UCI datasets and empirical aerospace datasets to compare the performance with other state of the art including both single imputation and multiple imputation. Finally conclusions are given in Sect. 5.

2 Related work

2.1 Missingness mechanisms

Before the discussions on different handling options, it is necessary to have a solid comprehension of missingness mechanism. That is because the performance of methods does not depend only on the amount of absent data but on the characteristics of the missing data patterns. According to Rubin and his colleagues’ taxonomy, the mechanisms are categorized as follows [5, 19, 31, 48]:

  1. 1.

    Missing Completely At Random (MCAR). The MCAR refers to the case that the distribution of an example having a missing value for an attribute does not depend on either the observed data or the missing data. The probability that units provide data on a particular variable, thus, does not depend on the value of that variable or the value of any other variable. An example of the mechanism is that a laboratory sample is dropped, so the resulting observation is missing.

  2. 2.

    Missing At Random (MAR). Once the distribution of an example having a missing value for an attribute depends on the observed data, but does not depend on the missing data, the mechanism is MAR. As the probability of a value being missing will generally depend on observed values, it does not correspond to the intuitive notion of ’random’. For example, if income is more likely to be missing for the more educated and education is fully observed.

  3. 3.

    Not Missing At Random (NMAR). It implies that the pattern of data missingness is non-random and it is unpredictable from other variables in the database. If missing data are NMAR, even accounting for all the available observed information, the reason for observations being missing still depends on the unseen observations themselves. In other words, the missing entry relies on the observed data as well as on the value of the data which is missing.

These terms own precise probabilistic and mathematical implications since they explain why the data are missing. Moreover, the conceptual descriptions state the relationships between observed variables and the probability of missing data. So they have to be involved into missing data analysis.

2.2 Methods for missing data analysis

Current administrations of processing missing data can be approximately divided into three categories: tolerance, ignoring and imputation-based procedures.

2.2.1 Tolerance

The straightforward method aims to maintain the source entries in the incomplete fashion. It may be a practical and computationally low cost solution, whereas it requires the techniques to work robustly even if the data quality stays low [20, 48].

2.2.2 Ignoring

Missing data ignorance often refers to “Case Deletion”. It is the most frequently applied procedure, but it is prone to diminish the data quality. The strength lies in the ease of application: deleting the elements with missing values in two manners [15, 55]:

  1. (a)

    List-wise/Case-wise Deletion: it performs indiscriminately deleting from the database any elements with missing data for any of the attributes being examined.

  2. (b)

    Pairwise Deletion: incomplete cases are removed on an analysis-by-analysis basis, such that any given case may contribute to some analyses but not to others.

2.2.3 Imputation

Mean/Mode Substitution (MMS): It replaces the missing values by the mean (the arithmetic average value) or mode (the highest frequency of occurrence) of all the observations or a subgroup at the same variable. It consists of replacing the unknown value for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all known values of that attribute. Replacing all missing records with a single value distorts the input data distribution [9, 22].

Hot-deck/Cold-deck Imputation [9, 21]: Given an incomplete pattern, Hot-Deck Imputation (HDI) replaces the missing data with the values from the input vector that is closest in terms of the attributes that are known in both patterns. This method attempts to preserve the distribution by substituting different observed values for each missing item. Another possibility is the Cold-Deck Imputation (CDI) method, which is similar to hot deck but the data source must be other than the current dataset. For example, in a survey context, the external source can be a previous realization of the same survey.

Regression imputation: This method uses multiple linear regression to obtain estimates of the missing values. It is applied by estimating a regression equation for each variable, using the others as predictors. This solves the problems concerning variance and covariance raised by the previous method but leads to polarization of all the variables if they are not linked in a linear fashion. Possible errors are due to the insertion of highly correlated predictors to estimate the variables. The advantage of this method is that existing relationships between the variables can be used to calculate missing data, but it is rarely used as it amplifies the correlation between variables [18, 44, 56].

Expectation Maximization Estimation (EME): The algorithm can handle parameter estimation in the presence of missing data, based on Expectation-Maximization (EM) algorithm proposed by Dempster, Laird and Rubin. These methods are generally superior to case deletion methods, because they utilize all the observed data. However, they suffer from a strict assumption of a model distribution for the variables, such as a multivariate normal model, which has a high sensitivity to outliers [17, 24].

Machine-learning-based imputation: It acquires the features of interested unknown data by behavior evolution after sample data processed. The essence is to automatically learn sample for complicated pattern cognition and intelligently predict the missing values. The methods mainly include decision tree based imputation, association rules based imputation and clustering-based imputation [39, 43, 55].

Multiple imputation: Several, usually likelihood, ordered choices for imputing the missing value are computed. Each of the two or more resulting complete data sets is then analyzed using standard complete-data methods. All the analysis becomes combined to reflect both the inter-imputation variability and intra-imputation variability [30, 40, 46].

2.2.4 State of the art for missing data imputation

Historically, people have relied on diverse ad hoc techniques to deal with missing data. These related methodologies have accomplished substantial and rapid developments during the last decades. Baraldi A. and Enders C. [5] made a comparison of the conventional literatures with the most recent methodological researches. They pointed out that the traditional techniques could only work in some limited circumstances with strict assumptions proposed by Little and Rubin [35]. Magnani [38] also investigated the main missing data techniques, including conventional methods, global imputation, local imputation, parameter estimation and direct management of missing data. They tried to highlight the advantages and disadvantages for all kinds of missing data mechanisms. Clearly, the major problem of these techniques is under strong model assumptions. Chen and Chen [11] developed an estimating null value method, where a fuzzy similarity matrix is used to represent fuzzy relations, and the method is used to deal with one missing value in an attribute. The K Nearest Neighbors (KNN) [8, 22] is another prevailing means to explore missing data completion, such as Sequential KNN (SKNN) and Iterative KNN (IKNN). Li D. et al. [33] represented the missing data as intervals which were clustered by a nearest-neighbor-intervals-based Fuzzy C-Means (FCM) resulting in interval cluster prototypes that reflect both uncertainty and the shape of clusters. García-Laencina, P.J. et al. [22] established a new KNN variant that selects the k nearest cases considering the relevance between the input and target attribute to classify and impute missing data using mutual-information-based distance metric. Bose S. et al. [8] invented a local interpolation based imputation method which each time generates a similarity sub-matrix about a missing position in a target gene and interpolate the missing data rather than using all genes and samples. Soft computing techniques were also introduced to seek for the solutions. García, J.C.F. et al. [23] presented a genetic algorithm taking advantages of the flexibility and non-linear capability and got successful results even without conditional information. Huang and Lee [27] employed a gray-based nearest neighbor method to handle the missing data problem. The gray association analysis is employed to determine the nearest neighbors of an instance with missing values. And those unknown values are inferred by the known attribute values from these nearest neighbors. Chen and Huang [12] used the weighted fuzzy rules to estimate null values in relational database. Di Nuovo, A.G. [15] made the comparisons among four solutions of FCM in the psychological research environment. The result revealed that the FCM based on Optimal Completion Strategy (FCMOCS) lead to effective data imputation instead of deleting elements with missing values. The theoretical underpinnings of multi-imputation are Bayesian [52]. Hruschka Jr. et al. [26] used Bayesian networks to fulfill missing values in a hybrid model, which applies the clustering genetic algorithm in objects without missing values and generates Bayesian networks to substitute the missing values. Predictive Mean Matching (PMM) [10, 16] is to fill in the blanks based on the combined residuals, with the residual value forecasts to reflect the uncertainty. The distribution of residuals can be either normal or non-normal. Yet the random error term is often difficult to determine. Propensity Score (PS) [37, 45] is a particular processing of conditional probability as the observed covariates in a provided time-scale. Each parametric value with missing scores owns a tendency to imply the probability of lost observations. The observations were categorized into groups according to the trends scores and consequently applied to each group to predict the approximate Bayesian bootstrap. Markov Chain Monte Carlo (MCMC) [2, 7] is another kind of Bayesian inference that it explores the posterior distribution. Zhang C. et al. [53] generalized the random regression estimation with a method named “Clustering-based Random Imputation (CRI)”, which fills the unknown values with those plausible ones generated from the same cluster using a kernel based random method after splitting the raw data into complete and incomplete sets. Clustering-based Multiple Imputation (CMI) [58] was designed to utilize the kernel function nonparametric random imputation to make inference for the missing data after k-means clustering. Zhang S. et al. [56] utilized the information within the incomplete instances since the second imputation iteration. The Non-Parametric Iterative Imputation (NIIA) is an improvement of the classic multiple imputation, which is based on kernel function. The experimental results on UCI datasets unfolded that the NIIA could easily capture the distribution of a dataset even when there is no prior knowledge of the datasets.

3 The MIGEC algorithm

The global procedure of the MIGEC algorithm is schematized in Fig. 1. And each of the key components is detailed in the following subsections.

Fig. 1
figure 1

The flowchart of the MIGEC algorithm

Our method pursues to make full use of the uncorrupt information at instance level. For this reason, the items from the raw data set are divided into two disjoint subsets, namely the complete dataset and the incomplete dataset. It is expected to minimize the negative impact due to the information loss of missing values by the way of separation. On one hand, the objects of the complete set constitute a number of clusters via FCM. On the other hand, the items in the incomplete set are reordered according to the missing severity from high to low. That is, the specific record with the least missing parametric values is firstly allocated to the closest group quantified by the GST-based distance metric. Next, each missing attributive value of the record is estimated by the proposed multiple imputation (including the initial and successive stages). Then the imputed item is included into the complete set along with excluding the original copy from the incomplete set. And the next element in the rearranged incomplete data set repeats the similar solution until no more elements exist in that set. The strategy ensures that all the missing-valued instances could be processed through utilizing the known information as much as possible in the most similar region.

3.1 The clustering strategy

Clustering divides target elements into classes or clusters so that items in the same class are as similar as possible, and items in different classes are as dissimilar as possible. This is important because it helps to categorize the raw samples into groups with high level of intra-cohesion and inter-separation. In hard clustering (such as k-means [1]), data is identified into distinct clusters, where each item precisely belongs to one cluster. Soft clustering ensures data elements, which gain a set of membership levels, belong to more than one cluster, though. Moreover, the soft clustering has been successfully applied to a variety of domains with variants.

The specific clustering schema utilizes the standard FCM [6, 25], which aims to minimize the following objective function with respect to fuzzy memberships \(U^{(r)} = [u_{ij}^{(r)}]\) and cluster centroids \(C^{(r)} = c_{j}^{(r)}: J = \sum_{j = 1}^{G} \sum_{i = 1}^{M} (u_{ij}^{(r)})^{s}d(x_{i},c_{j}^{(r)})\). In these expressions, r is the ordinal number of the iterations with x k and \(c_{j}^{(r)} \) respectively denoting the kth complete data instance and the jth cluster, while d(⋅,⋅) is the distance metric between two instances and \(u_{ij}^{(r)} \) is the degree of membership that the ith instance is subordinate to the jth cluster under the “fuzzier” s, as G defines the total number of clusters and M represents the number of data instances.

The FCM can be summarized in 4 steps:

  1. Step 1.

    Randomly initialize the matrix U (0), which satisfies \(\sum_{j = 1}^{G} u_{ij}^{(0)} = 1\); i=1,2,…,M

  2. Step 2.

    From the rth iteration (r>0), calculate the centroids C(r):

    $$ c_{j}^{(r)} = \frac{\sum_{i = 1}^{N} u_{ij}^{(r)}x_{i}}{\sum_{i = 1}^{N} u_{ij}^{(r)}} $$
    (1)
  3. Step 3.

    Update the membership matrix U (r):

    $$ u_{ij}^{(r)} = \sum_{g = 1}^{G} \biggl[\frac{d(x_{i},c_{j}^{(r)})}{d(x_{i},c_{g}^{(r)})}\biggr]^{ - \frac{2}{s - 1}} $$
    (2)
  4. Step 4.

    Check whether the following STOP conditions are satisfied. If not, it returns to Step 2, otherwise the iterative procedure immediately ends with formed clusters.

    $$\bigl \Vert U^{(r)} - U^{(r - 1)} \bigr \Vert < \varepsilon;\quad \varepsilon > 0 $$

    or r accumulatively reaches the predefined number R.

3.2 The classification of incomplete instances

GST is established by Deng [13, 14], combined with Gray Control Theory. Both of the two branches adapt to the situation where partial information is difficult to acquire as well as information stays extensional explicit or intensional implicit. For GST, the concept of Gray Relational Analysis (GRA) remains crucial, as it includes Gray Relational Coefficient (GRC) and Gray Relational Grade (GRG). GRA is used to scale the influence of a compared series on the reference series in the gray space without prior assumption about the distribution type. Furthermore, it could generate the satisfactory outcome among target objects even if the amount is small or with great variability. Thus, the correlation can be regarded as a new distance metric, for smaller distance indicates strong influence.

Two steps for distance calculation are needed in a missing environment:

Step 1. The processing of initialization is indispensible in order to map the original data into a particular interval and eliminate the outliers. Then GRC is formulated by (3):

$$\begin{aligned} &{GRC\bigl(x_{k}^{mis},c_{i}\bigr)} \\ &{\quad {} = \frac{\min_{i}\min_{p}|x_{_{kp}}^{mis} - c_{ip}| + \zeta \max_{i}\max_{p}|x_{_{kp}}^{mis} - c_{ip}|}{|x_{_{kp}}^{mis} - c_{ip}| + \zeta \max_{i}\max_{p}|x_{_{kp}}^{mis} - c_{ip}|},} \\ &{\quad p = 1,2,\ldots,N,\ 0 \le \zeta \le 1} \end{aligned}$$
(3)

In (3), \(x_{k}^{mis} \) is the kth incomplete instance and p is the pth attribute with non-missing values, while c i denotes the centroid of the ith cluster. In other words, the calculation only happens when the pth attributive value of \(x_{k}^{mis} \) exists. Another important parameter is ζ∈(0,1], which is used to control the level of differences with respect to the relational coefficient. When ζ=0, the comparison environment does not occur any more. On the contrary, ζ=1 shows that the comparison environment remains the unchanged status. A proper value of ζ could favorably manage the impact of the maximum value in the matrix. Nevertheless, no methods have been convinced about the optimum value selection so far. Instead, researchers usually choose to empirically set it as 0.5 or learn the optimized one from experimental results [28].

Step 2. Integrating each parameter’s GRC between an incomplete instance and the reference, the GRG is calculated in (4).

$$ \begin{aligned}[b] &GRG\bigl(x_{k}^{mis},c_{i}\bigr) = \frac{1}{N}\sum_{p = 1}^{N} GRC \bigl(x_{k}^{mis},c_{i}\bigr),\\ &\quad i = 1,2,\ldots,M \end{aligned} $$
(4)

In terms of the maximal value of GRG, each incomplete instance is individually incorporated into the closest cluster [6, 54].

From the above descriptions, it should be pointed out that the rationale of the GRA is to quantify the similarity and the degree of compactness for the different items based on their geometric relationship.

3.3 The entropy-based multiple imputation

Each time that one instance has been assigned to the most proximate group, an internal multiple imputation strategic approach starts as follows:

3.3.1 First imputation

The MMS is employed to initialize missing values in the first imputation. The simple technique could perform well only when the data is normally distributed. Yet, it is believed that it could produce excellent performance provided that the missing ones are initialized by MMS before the multiple imputation, even without any prior knowledge about the pattern of distribution [56].

3.3.2 Successive imputation

In the authors’ opinion, the imputation of one specific missing instance could benefit from all the other instances within the particular cluster. As a result, a method involving all the instances via entropy is proposed. In this context, the term “entropy” refers to the Shannon’s entropy [29, 50], which quantifies the expected value of the information contained in a message. It states that a broad distribution represents more uncertainty than does a sharply peaked one. And it is used to determine the relative importance of each criterion in the matrix as follows.

R=(r ij ) m×n associates with the data matrix of the cluster, into which \(x_{i}^{mis} \in X_{ic} \) is attached. That is, it includes m−1 complete elements and one initialized element.

Step 1. Calculate the entropy value of the fth data instance [29]:

$$ \begin{aligned}[b] &I_{f} = - k*h_{f}*\ln p_{f},\\ &\quad k = \frac{1}{\ln m},\quad h_{f} = \frac{\Vert r_{fl} - r_{il} \Vert }{\sum_{i = 1}^{m} \Vert r_{fl} - r_{il} \Vert },\quad (l \ne j) \end{aligned} $$
(5)

I f measures the decision information that the fth parameter contains.

Step 2. Compute the coefficient of difference for the fth instance:

$$ t_{f} = 1 - I_{f},\quad f = 1,2,\ldots,n $$
(6)

t f represents the inherent contrast intensity of the fth parameter. The greater value of t f signifies the more significance of that parameter.

Step 3. Elicit the coefficient of weight for the fth copy:

$$ w_{f} = \frac{t_{f}}{\sum_{f = 1}^{n} t_{f}} $$
(7)

Step 4. Estimate the jth attributive missing value of \(x_{i}^{mis}\):

$$ x_{ij}^{mis} = \sum_{q = 1,q \ne j}^{n} w_{q}x_{iq}^{mis} $$
(8)

If the estimated values of the individual instance vary beyond a tolerable interval compared with the calculated value of the last iteration or the number of iteration times does not reach the threshold, the operations from (5) to (8) continue iteratively. Otherwise, the iterative process mentioned above terminates as the assessed value is considered as the imputed one. Consequently, the imputed instance is aggregated into the corresponding cluster afterwards with updated centroids.

3.4 The framework of the MIGEC algorithm

figure a

4 Experimental evaluation

In this section, the assessment criteria are primarily explained in terms of the types of the attributes in Sect. 4.1. Then the general effectiveness of our algorithmic approach is presented by comparative experiments on two UCI datasets [4], remaining superior to other methods in Sect. 4.2.1. Section 4.2.2 shows that the technique also outperforms these aforementioned approaches by applying MIGEC to a real case analysis in two aerospace datasets.

4.1 The evaluation criterion

4.1.1 Missing data on numeric attributes

The Root Mean Square Error (RMSE) is used to evaluate the predictive ability of the various data imputation algorithms within which the attributes are quantitative:

$$ \mathit{RMSE} = \sqrt{\frac{1}{m}\sum _{i = 1}^{m} (e_{i} - \tilde{e}_{i} )^{2}} $$
(9)

where e i is the original value, \(\tilde{e}_{i}\) is the predicted plausible value and m is the total number of estimations. The larger value of RMSE suggests the less accuracy that the algorithm holds [58].

4.1.2 Missing data on nominal attributes

The performances of the algorithms for categorical attributes are appraised by the Classification Accuracy (CA):

$$ CA = \frac{1}{n}\sum_{i = 1}^{n} l(EC_{i},TC_{i}) $$
(10)

where EC i and TC i are the estimated and true class label for the ith missing value respectively with n indicating the total number of the missing values. The function l(x,y)=1 if x=y, otherwise l(x,y)=0. For this reason, the larger value of function l indicates the more correct imputed value [56].

4.2 Empirical result analysis

4.2.1 UCI datasets

Two UCI datasets, i.e. Wine and Thyroid Disease, are selected to test the validity of the algorithms. Wine contains 178 instances and 13 attributes. The variable values are either real or integer. Thyroid Disease includes 7200 instances and 21 attributes. The multivariate factual data are either categorical or real.

Missing data generation

To intrinsically examine the effectiveness and validity and ensure the systematic nature of the research, we artificially generated a lack of data at four distinct missing ratios, i.e. 5 %, 10 %, 15 % and 20 % under three different modalities, namely MCAR, MAR, NMAR in the complete datasets via the means that Twala did [51, 59].

Parametric values determination

Before the comparative demonstrations, it is requisite to select the optimum values for fuzzier s and ζ. In the section, they are both determined by practical testing in a specific interval (s∈(1,2),ζ∈(0,1)). Here, we choose a typical scenario “at MAR with 10 % missing rate and three clusters” and the variables processes are displayed in Table 1. These parametric values under the other circumstances could be found in a similar way.

Table 1 The impact on RMSE in Wine at MAR with 10 % missing rate by s and ζ

From Table 1, it can be seen that the selection of the values for s and ζ could influence the precision of the imputation in a manner. The worst situation, which yields the maximum RMSE, usually occurs when approaching to the upper and (or) lower boundaries of the intervals. And when s=1.3 and ζ=0.5, the RMSE declines to the least. So they are the optimum parametric values respectively.

Distance metric selection

To clarify the different distance metrics that influence the accuracy of the results, the test is supposed to happen under MAR. Then the GST-based metric is practically compared with both the Minkowski Distance family (referring to Manhattan Distance (MD), Euclidean Distance (ED) and Chebyshev Distance (CD)) and CC.

As seen in Fig. 2, we could assume that GST-based distance metric generates the least bias at different missing rates comparing with both the Minkowski’s L p and CC. Particularly, the discrimination is even more significant when GST is contrasted with CD according to either RMSE or CA.

Fig. 2
figure 2

The performances of distance metrics on (a) Wine and (b) Thyroid Disease

Comparative experiments

In consideration of making comparisons as extensively as possible, we thoughtfully select seven other approaches, which are MMS, HDI, Garcia’s KNN Imputation with Mutual Information [22] (denoted as KNNMI), FCMOCS [15], CRI [53], CMI [58] and NIIA [56]. These methods involve both single methods (i.e. MMS, HDI, KNNMI and FCMOCS) and multiple methods (i.e. CRI, CMI and NIIA).

The experimental data provided in Table 2 and Table 3 illustrate some phenomena that we would like to discuss as follows:

  1. (a)

    Increasing proportion of missing instances deteriorates the accuracy of the interpolation in either RMSE or CA. It states that incomplete values negatively impact on the completion, in other words, more available information could promote the precision of the final predictions.

  2. (b)

    For each individual method, the best result (namely, the minimal value of RMSE or the maximal value of CA) at the same missing ratio always appears when data are NMAR distributed, whilst MCAR yields the opposite occasions.

  3. (c)

    Concerning imputation types, the performance of single imputation techniques (MMS, HDI, KNNMI and FCMOCS) stays inferior to the multiple imputation ones (NIIA, CMI, CRI and MIGEC). There are clear improvements between the two categories of methods. Therefore, imputing the absent value by multiple times can significantly alleviate the biased effect of single imputation.

  4. (d)

    It could be obviously observed that MMS does worst while MIGEC does best, which has the average absolute difference beyond 0.020 (measured by RMSE in Table 2) or 0.030 (measured by CA in Table 3). Nonetheless, MIGEC amalgamates MMS into itself as the first step. Hence, it is a feasible and proper option to take MMS to initialize the unknown data before the subsequent multiple completion.

    Table 2 RMSE on Wine under different mechanism with varying missing rates
    Table 3 CA on Thyroid Disease under different mechanism with varying missing rates
  5. (e)

    For the multiple imputation methodologies, though the results generated from NIIA, CMI, CRI and MIGEC overlaps slightly at some points, the three clustering-based methods (CMI, CRI and MIGEC) outperform the non-clustered one (NIIA). Hence, clustering could actually help ameliorate the accuracy of the prediction through narrowing the potential space of the target value.

  6. (f)

    For the clustering-based solutions, CMI works slightly worse than those strictly split missing and non-missing values before dividing them into a finite set of groups. So it could be presumed that information loss can disturb the formation of clustering.

According to factual values and the above analyzing, MIGEC performs better than the other seven approaches under any missingness mechanism no matter the data is numeric or categorical, in general.

4.2.2 Aerospace datasets

Aerospace is a special but critical field that associates with both nations’ military and residents’ daily life. In this paper, it refers to the diverse procedures of spacecrafts. Like a great quantity of other businesses or industries, aerospace analysis confronts the data quality problem as well. The specific issue derives from numerous facets such as difficulties in sensing the target objects restricted by the physical environment or in data acquisition because of autonomous units’ regulations. The loss of information negatively blocks further handling or analysis, for instance, curve fitting and rules mining. Nonetheless, open-access researches in aerospace data imputation remain hard to achieve for its confidentiality and particularity. Thereupon, we try our best to explore some of the known parts by using the current accomplishments in other industries e.g. medicine or biology for reference together with our aforementioned MIGEC [32].

To capture the result of the data imputation accurately, we chose the complete data entries and artificially simulated the missing situation according with Missing data generation. In this part, the missingness is established under MAR at missing rate 15 % on two datasets, as three imputation methods in Parametric values determination (NIIA, CMI and CRI) are selected as the competitors of MIGEC.

The Remote Controlling for Spacecraft Flying (RCSF) dataset is comprised of the data produced by one particular unmanned spaceship in real time condition when flying in the outer space with remote controlling by the experts on the ground. Due to the huge amount of the raw data, we just extract the data produced within one minute. Subsequently, the experiment is designed on the 953 records of 20 continuous attributes.

When MIGEC is applied to RCSF dataset, the maximum times of the iteration in all the clusters is 18 loops, which is faster than NIIA’s 27 times, CMI’s 22 times and CRI’s 25 times of iterations respectively in Fig. 3(a). What is more, the RMSE is slightly lower than in the other counterparts.

Fig. 3
figure 3

The RMSE influenced by (a) imputation times and (b) number of clusters on RCSF dataset

As versions of clustering principles, interrelationship between RMSE and the number of clusters in these techniques should be discussed except the non-clustering NIIA. In Fig. 3(b), it appears that when the whole data is agglomerated into 6 groups, the RMSE of MIGEC declines to the minimum. Differently, CMI performs best with 5 clusters while CRI requires 6 partitions.

The Spacecraft Overall Mechanical Design (SOMD) dataset comprises the data related to the assembling and fabrication of one specific model of manned spaceship. Both the numeric and categorical values are mixed in the dataset. The total number of instances is beyond 300,000. 1,221 elements with the 30 variables belonging to a certain step of the entire manufacturing process are chosen.

It is easy to perceive that the three algorithms advance CA as the number of iteration aggrandizes until the convergence emerges in Fig. 4(a). Concurrently, MIGEC attains the best CA in the minimum time of the repetitions comparing with the other opponents on SOMD.

Fig. 4
figure 4

The CA influenced by (a) imputation times and (b) number of clusters on SOMD dataset

The CA fluctuates irregularly in the interval [0.83, 0.88], when the amount of clusters rises. And MIGEC approaches to the optimized CA when 11 clusters exist in Fig. 4(b). Generally, CMI and CRI undulate in an inferior range of CA to MIGEC, which demands the different optimal number of clusters respectively.

5 Conclusion

Revisiting various missing data analysis techniques, this study advocates the clustering-based imputation via partitioning original data into two non-overlapped subsets, i.e. the missing-valued subsets and the complete-valued subsets. Then the iterative imputation is combined within the categorized groups after each missing-valued entry has been merged into the most homogeneous cluster through GST-based distance metric. The experiments demonstrate that MIGEC operates better than the existing methods, like MMS, HDI, KNNMI, FCMOCS, CMI and CRI, in terms of the RMSE (for continuous missing attributes), and the CA (for discrete missing attributes) at different missing ratios in two canonical UCI datasets. In particular, MIGEC has been successfully applied into the aerospace datasets. The RMSE and CA affected by the iteration times indicate that MIGEC converges more rapidly than the other iterative imputation techniques with better accuracy in the real application environment. The ongoing research focuses on how to infer and impute missing values more effectively when the dimensionality becomes high.