Keywords

1 Introduction

Missing data can occur in data records for various reasons, such as: data entry errors, system failures, or respondents who avoid answering questions within a survey. Various methods have been proposed to deal with the missing data problem. The standard technique is discarding observations or variables that contain missing values. The deletion method is inappropriate when the missing proportions are high, resulting in inefficient parameter estimates, and estimated results tend to be underestimated. To deal with these issues, imputation methods can be used to substitute missing values with plausible values. For example, the single mean imputation consists of replacing the missing values with the mean, median or mode value. However, this simple approach produces biased analysis results. The multiple imputation method introduced in [1] is a complex approach where missing data are filled-in by drawing multiple sets of complete data that contain different plausible values. This method is complicated and computationally expensive [2], especially for large data sets because execution processes are implemented through three phases in several iterations. The improved version of the single imputation technique such as conditional mean imputation, which incorporates the statistical and machine learning methods with multivariate Gaussian mixture models (GMM) [3] have gained interest in many years [4].

The conditional mean imputation (also known as ordinary least square, OLS) or regression imputation can preserve the data distribution, according to Di Zio [5]. The conventional OLS \( \hat{y}_{i} = \beta_{0} + \sum\limits_{j = 1}^{J} {\beta_{j} x_{j} } + \varepsilon_{i} \) implementation requires the use of random error \( \varepsilon_{i} \) which can be obtained in two ways [6]: (1) draw a random error with underlying assumption that it is independent and identically distributed, that follows a Gaussian distribution with zero mean and finite variance; (2) draw a random error with replacement from the empirical distribution of the estimated residuals \( \varepsilon_{i} = y_{i} - \hat{y}_{i} \) [7]. Problems can occur in the random error and residual \( \varepsilon_{i} \) in method (1) that will create the sparsity problem whereas the random \( \varepsilon_{i} \) generation will be either too large or too small although the normality distribution assumption is met. The sparsity of data in method (2) will be inconsistent if the data distribution has different clusterings and each cluster consists of a different density. The sparsity of data creates some problems such as increases in the variance between the imputed and original data.

The conditional mean imputation proposed in [5] does not consider adding the residuals. Although this method may preserve the data distribution, it will underestimate the variability, introduce the bias on imputed data and the result of imputed data will be highly inaccurate. The additional steps are required to improve data sparsity in the random error \( \varepsilon_{i} \) generated in the OLS to obtain a better predicted missing value.

The main objective of this study is to investigate the random error and employ the wild bootstrap [8, 9] on the missing data prediction using regression imputation on the Gaussian mixture model. The wild bootstrap is used to improve the variance in heteroscedasticity issue when the data variance is not homoscedastic [8, 9]. Further details about the wild bootstrap approach are discussed in the next section that introduces the modelling framework.

In this paper, we employ the wild bootstrap to the single imputation technique in missing value prediction, since the GMM framework is flexible to learn multimodal data distribution. We combine the GMM model with the proposed missing data prediction method. We also employ the wild bootstrap to investigate the effect of the sparsity of imputed data in a different mixture data distribution case. Thus, we would like to show that the performance of single imputation may perform well, and as good as the implementation of MI. We assume that the data is missing data at random (MAR).

This paper is organized as follows: in Sect. 2, we present the Gaussian mixture model framework and the proposed regression imputation with wild bootstrap technique. In Sect. 3 we discuss the experimental evaluation and experimental results. Section 4 concludes the paper and identifies further directions for research and study.

2 Modelling Framework

GMM is a powerful probabilistic model used in predicting specifically in data clustering [5]. This model is flexible to learn from different data distributions by fitting the probability density function (PDF) to represent different clusters [3]. The well-known strategy for finding the Maximum Likelihood (ML) parameter estimation uses the Expectation-Maximization (EM) algorithm [10]. GMM applications to missing data problems have been studied extensively for example in [4, 5, 11].

2.1 Definitions

Suppose the data set \( {\mathbf{X}} \) having \( N \) units of independent and identically distributed (i.i.d) data points with \( p \)-column vectors can be written as follows:

Figure 1 illustrates a data set that contains missing values (highlighted with NA in the relevant cells). Let \( {\mathbf{X}} = \left\{ {X_{1} ,X_{2} , \ldots ,X_{p} } \right\} \) be the random variable of the \( N \times p \) data matrix. In the imputation process, Rao and Shao [12] suggested to create a set of respondents \( {\mathbf{X}}^{O} \) and a set of non-respondents \( {\mathbf{X}}^{M} \) separately. The variable \( {\mathbf{X}}^{O} \) denotes the \( n_{1} \times p \) matrix where \( \text{n}_{1} \) is the size of observed data while \( {\mathbf{X}}^{M} \) denote the \( n_{0} \times p \) matrix where \( n_{0} = N - n_{1} \) is the number of missing values that occur in \( {\mathbf{x}}_{l} \). Let \( {\mathbf{x}}_{l} \) of size \( n_{1} \times 1 \) vector contain observed data and \( \text{n}_{0} \) be the size of missing values in \( {\mathbf{x}}_{l} \).

Fig. 1.
figure 1

A sample data set with missing values

2.2 Multivariate Gaussian Mixture Model

The Maximum Likelihood (ML) is an approach to estimate the parameters of the distribution from multivariate GMM using the Expectation-Maximization EM algorithm [10]. The data in GMM are distributed by different \( k \) Gaussian components and estimated as follows:

$$ f\left( {{\mathbf{x}};{\varvec{\Phi}}} \right) = \sum\limits_{k = 1}^{K} {\pi_{k} } f({\mathbf{x}}\,|\,\varvec{\theta}_{k} ) $$
(1)

where \( f({\mathbf{x}}\,|\,{\varvec{\uptheta}}_{k} ) \) is the density of p-variate Gaussian distribution with the \( k \) component. The vector \( {\varvec{\Phi}} \) contains the full set of parameters in the mixture model \( {\varvec{\Phi}} = (\pi_{1} , \ldots ,\pi_{K} ;{\varvec{\uptheta}}_{1} , \ldots ,{\varvec{\uptheta}}_{K} ) \), where \( {\varvec{\uptheta}}_{k} \) is the vector of unknown parameters of mean vector \( {\varvec{\upmu}}_{k} \) and covariance matrix \( {\varvec{\Sigma}}_{k} \).

The mixing coefficients (or weights) \( \pi_{k} \) for the \( k^{th} \) component must satisfy the conditions \( 0 < \pi_{k} < 1, \) and \( \sum\nolimits_{k = 1}^{K} {\pi_{k} = 1} \). The GMM is a dynamic model where it is not required to specify any column vector to be an input or output particularly.

2.3 The General EM Algorithm

The EM algorithm is a statistical tool to find the maximum likelihood estimates of the set parameters such as mean, variances, covariances and regression coefficients of a model. The optimisation algorithm introduced by Dempster et al. [10] starts with an initial estimate of \( {\varvec{\Phi}} \) and iteratively executes the process until it satisfies the convergence criteria. The iterative process has two steps known as the E-step and the M-step. The E-step computes the probability membership \( \tau_{ik} \) for all data points \( x_{i} \) of mixture component \( k \). The M-step will update the value of the parameter \( {\varvec{\Phi}} \) with respect to the \( k \) Gaussian component. Let denote \( q \) as an iteration counter, the expected values of the posterior distribution are computed by:

$$ \hat{\tau }_{ik}^{(q)} = \frac{{\hat{\pi }_{k} f({\mathbf{x}}_{i}^{o} \,|\, {\hat{{\varvec{\upmu}}}}_{k} , {\hat{{\varvec{\Sigma}}}}_{k} )}}{{\sum\limits_{j = 1}^{K} {\hat{\pi }_{j} f({\mathbf{x}}_{i}^{o}\, |\, {\hat{{\varvec{\upmu}}}}_{j} , {\hat{{\varvec{\Sigma}}}}_{j} )} }} $$
(2)

In the M-step, we use the expected values in the posterior distribution (2) to re-estimate the means, covariances and mixing coefficients. The new set of parameters \( {\varvec{\Phi}}^{(q + 1)} \) are updated as follows:

$$ \hat{\pi }_{k}^{(q + 1)} = \frac{{N_{k} }}{N}\,{\text{for}}\,k\, = \,1, \ldots ,K, $$
(3)
$$ \hat{\mu }_{k}^{{\left( {q + 1} \right)}} = \frac{1}{{N_{k} }}\mathop \sum \limits_{i = 1}^{N} \tau_{ik} {\hat{\mathbf{x}}}_{ik} $$
(4)
$$ {\hat{{\varvec{\Sigma}}}}_{k}^{(q + 1)} = \frac{1}{{N_{k} }}\sum\limits_{i = 1}^{N} {\tau_{ik} [({\hat{\mathbf{x}}}_{ik} - {\varvec{\upmu}}_{k} )({\hat{\mathbf{x}}}_{ik} - {\varvec{\upmu}}_{k} )^{T} + {\hat{{\varvec{\Sigma}}}}_{ik}^{MM} ]} $$
(5)

The algorithm then iterates the E-step and M-step until convergence is achieved.

2.4 The Least Square Method

The conditional mean imputation is also known as regression imputation [13]. The imputed values are regressed from independent variables \( {\mathbf{X}}_{p} \). Let consider the following linear regression model:

$$ x_{il} = \beta_{0} + \beta_{1} x_{i} + \varepsilon_{i} ,\,\,\,i = 1,2, \ldots ,n $$
(6)

where the response variable \( x_{il} \) is predicted from regression coefficients \( \beta_{0} \) and \( \beta_{1} \) with random error \( \upvarepsilon_{\text{i}} \sim{\text{N}}\left( {0,\upsigma^{2} } \right) \) i.i.d. and uncorrelated. The matrix development of Eq. (6) is presented as follows:

$$ {\mathbf{x}}_{l} = \left[ {\begin{array}{*{20}c} {x_{1l} } \\ : \\ {x_{Nl} } \\ \end{array} } \right],\,\,{\mathbf{X}} = \left[ {\begin{array}{*{20}c} 1 & {x_{11} } & \ldots & {x_{1p} } \\ : & : & \ldots & : \\ 1 & {x_{N1} } & \ldots & {x_{Np} } \\ \end{array} } \right],\,\,{\varvec{\upbeta}} = \left[ {\begin{array}{*{20}c} {\beta_{1} } \\ : \\ {\beta_{p} } \\ \end{array} } \right],\,\,{\varvec{\upvarepsilon}} = \left[ {\begin{array}{*{20}c} {\varepsilon_{1} } \\ : \\ {\varepsilon_{N} } \\ \end{array} } \right] $$

In general, \( {\mathbf{x}}_{l} \) is an \( N \times 1 \) vector of the dependent variable contains missing values, \( {\mathbf{X}} \) is a \( N \times p \) matrix of observed variables, \( {\varvec{\upbeta}} \) is a \( p \times 1 \) vector of the regression coefficients and \( {\varvec{\upvarepsilon}} \) is a \( N \times 1 \) vector of random errors. The general least square estimator of \( {\varvec{\upbeta}} \) based on observed values is:

$$ {{\hat{\varvec{\upbeta}}}}^{O} = \left( {{\mathbf{X}}^{T} {\mathbf{X}}} \right)^{ - 1} {\mathbf{X}}^{T} {\mathbf{X}}_{l} $$
(7)

In the presence of missing data, the imputed values are obtained by the conditional mean imputation technique which corresponds to imputed values generated from a set of regression equation calculated in (7) as discussed in [13, 14]. There are two ways to generate the random error component \( \varepsilon_{i} \). The random error component \( \varepsilon_{i} \) can be generated either with \( \upvarepsilon_{\text{i}} \sim{\text{N}}\left( {0,\upsigma^{2} } \right) \) or residual.

2.5 Fundamentals of the Bootstrap Method

The bootstrap non-parametric resampling technique was proposed by Efron [15] for estimating a standard error, confidence interval in various types of distributions. This method was extended in [16, 17] to generate the random error \( \varepsilon_{i} \) in the regression model. Let \( {\mathbf{X}} = \left\{ {{\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{{n_{1} }} } \right\} \) is a random sample from p-variate normal distribution K where \( \text{n}_{1} \) refers to the size of observed data \( {\mathbf{X}}^{O} \) as shown in Fig. 1. Let \( {\mathbf{X}}^{{(b_{k} )}} \) denote the bootstrap resampled data generated by sampling with replacement from the original dataset \( {\mathbf{X}}_{k} \) where \( b \) indicates the counter \( b = 1, \ldots ,B \) of drawing samples of bootstrap and \( k \) refers to the current Gaussian component. In this study, the resampling and parameter estimation are implemented on the observed data \( {\mathbf{X}}_{k}^{O} \) where the superscript O refers to observed data.

2.6 The Wild Bootstrap

Wu [8] introduced the wild bootstrap to deal with the heteroscedasticity issue. Later, a better approximation of the wild bootstrap was proposed by Liu [9]. The wild bootstrap is based on the modification of the bootstrap residual approach of the least square estimation. Wu [8] improved the resampling residual with replacement in bootstrap by drawing a value of \( t_{i}^{*} \) that follow a standard normal distribution with zero mean and unit variance:

$$ x_{il}^{b} = x_{i}^{T} {\hat{\upbeta}} + t_{i}^{*} \frac{{\hat{\varepsilon }_{i} }}{{\sqrt {1 - w_{i} } }} $$
(8)

where \( w_{i} = x_{i}^{T} ({\mathbf{X}}^{T} {\mathbf{X}})x_{i} \). However, the error variance \( t_{i}^{*} \hat{\varepsilon }_{i} \) are inconsistent. Therefore, authors in [18] proposed to compute \( t_{i}^{*} \) by drawing a sample \( a_{i} \) with replacement:

$$ t_{i}^{*} = a_{i} = \frac{{\hat{\varepsilon }_{i} - \bar{\hat{\varepsilon }}_{i} }}{{\sqrt {n_{1k}^{ - 1} \sum\nolimits_{i = 1}^{{n_{1k} }} {(\hat{\varepsilon }_{i} - \bar{\hat{\varepsilon }})^{2} } } }} $$
(9)

where \( \bar{\hat{\varepsilon }} = n_{1k}^{ - 1} \sum\nolimits_{i = 1}^{{n_{1k} }} {\hat{\varepsilon }_{i} } \).

The second wild bootstrap technique employed in this study is the Liu’s bootstrap [9]. Liu [9] proposed \( t_{i}^{*} \) in Wu [8] by resampling a set of central residual with zero mean and unit variance that has third central moments equal to one. Liu proposed two procedures to draw random numbers \( t_{i}^{*} \). However, we consider the second procedure as it is appropriate for normal distribution. Liu’s bootstrap is conducted by drawing random numbers:

$$ t_{i} = D_{1} D_{2} - E(D_{1} )E(D_{2} ) $$
(10)

where \( D_{1} \) and \( D_{2} \) are random i.i.d that follows normal distribution with means \( 0.5*(\sqrt {17/6} \; + \;\sqrt {1/6} ) \) and \( 0.5*(\sqrt {17/6} \; - \;\sqrt {1/6} ) \) respectively, and variance 0.5.

2.7 The Non-parametric Wild Bootstrap Applied in Missing Data Imputation

The bootstrap procedure based on the resample approach in the GMM is described in the following steps:

  1. 1.

    Initiate the set of parameters \( {\varvec{\Phi}} \) with K-means algorithm.

  2. 2.

    Compute the residual for each Gaussian component:

    1. a.

      Fit Gaussian mixture model using the parameter values from the step 1.

    2. b.

      Compute the residual: \( \hat{\varepsilon }_{k} = {\mathbf{X}}_{lk} {{\hat{\varvec{\upbeta}}}}_{k} \) where k is the Gaussian component k = 1, .., K.

  3. 3.

    For b = 1, .., B

    1. a.

      Draw a vector \( \hat{\varepsilon }_{k} \) of \( n_{1k} \) i.i.d sample with a simple random sampling with replacement. The vector \( \hat{\varepsilon }_{k} \) is generated from step 2b with respect to the option of the Wu’s [8] or Liu’s [9] bootstrap procedure as discussed in the Sect. 2.6.

    2. b.

      Fit Gaussian mixture model using the parameter values from the step 1.

    3. c.

      In the E-step,

      1. i.

        Compute the posterior probabilities vector \( \tau_{ik} \) in Eq. (2) on the observed data.

    4. d.

      In the M-step,

      1. i.

        Impute the missing values of size \( \text{n}_{0k} \) using a linear regression model (6) based on OLS estimator \( {\hat{\varvec{\upbeta}}}^{{(O_{k} )}} \) in (12):

        $$ x_{il} = \hat{\beta }_{0}^{{(O_{k} )}} + \hat{\beta }_{1}^{{(O_{k} )}} x_{i} + t_{i}^{*} \hat{\varepsilon }_{i} /\sqrt {1 - w_{i} } $$

        where the residual \( t_{i}^{*} \) taken from the step 3a.

      2. ii.

        Update the new parameter \( {\varvec{\Phi}} \) for each component in GMM as shown in (3), (4) and (5).

3 Experiments and Discussion of Results

In this section, the numerical results are presented on real and simulated datasets.

3.1 The Non-parametric Wild Bootstrap Applied in Missing Data Imputation

Dataset: We applied various evaluation criteria on one real dataset and one artificial dataset with two variables and two Gaussian classes. The first case study is the Old Faithful Geyser dataset [19]. This dataset contains 272 records on the waiting time between geyser eruptions (waiting) and the duration of eruptions (eruptions) in Yellowstone National Park, USA.

For the artificial case study, the values are randomly sampled with 1000 observations of two Gaussian classes with different position mean values and positive-negative correlation. Data are drawn with normal distribution using the following parameters:

$$ \pi_{1} = 0.5,\,\pi_{2} = 0.5 $$
$$ \mu_{1} = (4,2)^{{\prime }} ,\,\mu_{2} = ( - 2,6)^{{\prime }} $$
$$ \Sigma _{1} = \left( {\begin{array}{*{20}c} 1 & { - 0.7} \\ { - 0.7} & 1 \\ \end{array} } \right),\,\Sigma _{1} = \left( {\begin{array}{*{20}c} 3 & {0.9} \\ {0.9} & 3 \\ \end{array} } \right) $$

Software: the proposed method in these experiments were conducted using Matlab version 2017a. The proposed method is compared with multiple imputation available in the R-package Amelia II. The comparisons are conducted based on the artificial missing data generated with different missing data percentages (MDP): 5%, 10%, 15% and 20%.

Imputation implementation: the missing data are imputed based on the regression imputation. Prior to the imputation process, the K-means algorithm is used to determine initial parameter values of mixing proportion \( \pi_{k} \), mean \( \mu_{k} \) and covariance matrix \( \Sigma _{k} \) in GMM. The stopping criteria is based on a selected threshold where the different iterations were less than \( 10^{ - 6} \).

Evaluation criteria: these experiments are designed to measure the performance and prediction accuracy between predicted and actual values. RMSE computes the deviation between predicted and actual values that employed by most missing data imputation studies. The greater the deviation means the greater variance between them. Therefore, the lower value shows better performance:

$$ RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{N} {(\hat{y}_{i} - y_{i} )^{2} } }}{N}} $$
(11)

MAPE was used to measure the average relative error of the imputation accuracy:

$$ MAPE = \frac{100}{N} \times \sum\nolimits_{i = 1}^{N} {\left| {\frac{{y_{i} - \hat{y}_{i} }}{{y_{i} }}} \right|} $$
(12)

MAE was used to measure the average error of each different in imputation:

$$ MAE = \frac{1}{N} \times \sum\nolimits_{i = 1}^{N} {\left| {y_{i} - \hat{y}_{i} } \right|} $$
(13)

R-squared values were used to describe the variance in goodness-of-fit for the regression models between observed data and the expected values of the dependent variable. The range of R-squared is between 0 and 1:

$$ R^{2} = \frac{{\sum\nolimits_{i = 1}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {(y_{i} - \bar{y})^{2} } }} $$
(14)

3.2 Experimental Results

In this study, we compare the imputation accuracy using MAPE and MAE whilst measuring the performance using RMSE and R-Squared of three methods: single regression imputation combined with Wu’s and Liu’s wild bootstrap and MI. The better results are highlighted in bold font.

Table 1 summarizes the performance and prediction accuracy of the three methods on the Old Faithful Geyser dataset while Table 2 shows the result estimation on the random data generation. The result of the proposed methods in RMSE shows better performance and significantly different between the MI with the proposed Wu’s and Liu’s method in all MDP proportions. This is shown in the 5% MDP, Wu and Liu method yielded 7.8225 and 7.8879 respectively while MI gained 9.8719. It is also found in 10%, 15% and 20% MDP where the Wu’s and Liu’s method have outperformed the MI where the result of Wu’s shows 7.0955, 6.6819 and 6.7349 while Liu shows 7.8746, 7.0150 and 7.2354 in RMSE. In contrast, the MI obtained 8.4187, 8.7004 and 8.9103 higher than Wu’s and Liu’s method in 10%, 15% and 20% MDP respectively.

Table 1. The MAPE, MAE, R-square and RMSE estimates on the Old Faithful Geyser dataset
Table 2. The MAPE, MAE, R-square and RMSE estimates on the randomly generated data

The R-squared values are used to quantify the overall model performance of variance in response variable explained by the independent variables. The larger the R-squared means the more variability is explained by the linear regression model. The result of R-squared presented in Table 1 showed that the proposed method gives the best performance with 0.6338% for 5% of MDP proportion followed by 0.6836, 0.7683 and 0.7127 for Wu, while Liu’s obtained 0.6894, 0.6869 and 0.7050 for the 10%, 15% and 20% of MDP respectively on the Faithful data set. The R-squared obtained by the proposed method in the random generation data in the Table 2 showed less than 0.6% for all MDP percentages. In contrast, the MI in Amelia gives a lower variance than the proposed method in all MDP proportions with R-squared ranging from 0.03 to 0.2.

The imputation accuracy is measured based on the average relative error between predicted missing data and the original data using mean absolute percentage error (MAPE) and mean absolute error (MAE).

The result of MAE in the Table 1 showed that the Wu’s and Liu’s methods are consistently outperformed the MI method on the Old Faithful Geyser dataset. In contrast, in the Table 2, the Liu’s method offered consistent and better accuracy than MI method. Meanwhile the Wu’s method showed inconsistent improvement in the measure of average error magnitude to MI method on the random data generation.

As can be observed from the MAPE values obtained in Table 1, the proposed method of Wu’s and Liu’s performed better imputation on the Old Faithful Geyser data set.

Meanwhile, by observing the MAPE values gained in the Table 2, Liu’ method showed consistent to defeat the MI method compared to Wu’s method.

Plots of the results shown in Fig. 2 compare the outcome between multiple imputation technique in r-package Amelia II and the proposed methods.

Fig. 2.
figure 2

The scatter plot of two datasets using R Amelia II and the proposed methods

4 Conclusions

In this paper, we proposed a method for single imputation that incorporates wild bootstrap in order to create the variability of imputed data as for example Multiple Imputation (MI) does. The MI is indeed known to be the preferred method in handling missing data problems over the years compared to the single imputation methods.

The imputation process in MI involves several steps while single imputation has simpler implementation compared to MI. The missing data in MI are imputed for M times with different plausible values and combine appropriately in the analysis stage. The sparsity of imputed data is a matter of concern because it will reflect the variance and measurement error between predicted and original data. Thus, the main purpose of this comparison is to show that the performance of single imputation in the Gaussian mixture model may perform well and as good as the implementation of MI.

The performance of this method is measured by the RMSE, R-squared, MAE, and MAPE. Based on the results, we summarize that the single missing data imputation combined with the wild bootstrap is preferrable over the MI technique for the data containing several Gaussian distributions. Furthermore, the imputation process on the Gaussian mixture model could be relevant to preserve the originality of data distribution.

Since this study is implemented on bivariate data with two Gaussian components, in the future work we will focus on multivariate data with multiple Gaussian components.