Abstract
Missing data imputation aims to accurately impute the unobserved regions with complete data in the real world. Although many current methods have made remarkable advances, the local homogenous regions, especially in boundary, and the reason of the imputed data are still the two most challenging issues. To address these issues, we propose a novel Generative Adversarial Guider Imputation Network (GAGIN) based on generative adversarial network (GAN) for unsupervised imputation, which is composed of a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM). The GIN looks at the entire missing regions to generate and impute data as a whole. Considering the reason of the GIN results, IGM is assigned to capture coherent information between global and local and guide the LIN to look only at a small area centered at the missing focused regions. After processing these three modules, the local imputed results are concatenated to those global imputed results, which impute the rational values and refine the local details from rough to accurate. The comprehensive experiments demonstrate our proposed method is significantly superior to the other three state-of-the-art approaches and seven traditional methods, and we achieve the best RMSE surpass the second-best method on both numeric datasets (17.3%) and image dataset (24.1%). Besides, the extensive ablation study validates the superior performance for dealing with missing data imputation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Missing data imputation is an important and common topic in the real world, aiming to impute the uncollected and unobserved regions with rational values. Many imputation approaches have been proposed to handle data containing missing observations, such as multivariate time series imputation [1,2,3,4,5], image imputation [6,7,8,9], regression imputation [10, 11], sentence completion [12,13,14], to cite just a few.
To deal with missing data imputation, various traditional methods can be categorized into two types: (1) The simple statistical imputation methods, (2) The machine learning-based imputation methods. But these methods have the limitations of changing the original data distribution and assuming the data correlate with characteristics [44, 45].
Recently, a few effective methods about missing data imputation based on the lately prevalent generative adversarial network (GAN) [15] have been performed [16,17,18]. Most of these methods utilize generators and discriminators to learn the information of unobserved regions. The generator frequently generates and imputes missing data to deceive the discriminator, while discriminator discriminate between the imputation and fake data. Although these methods enhance the characteristic expression and follow the data distribution compared to the traditional methods, local homogenous regions especially in boundary (as shown in the blue and red boxes in Fig. 1) and the reason of the imputed data are still the challenging issues that negatively impact the missing data imputation results. Essentially, two main reasons are resulting in these issues being challenging to solve. First, GAN-based methods pay more attention to make the distribution of generated data approximate the distribution of real data as a whole, and the details receive insufficient attention [41, 47, 50]. Hence, the detailed local representation of the imputation is still not accurately accessed. Second, the existing methods capture the random noise to feed the model initially, which ignores the guiding results between different levels from global to local [15, 46, 48, 49]. Therefore, we not only focus on the global and local regions and avail the information effectively to combine the global and the local regions well.
In this paper, we propose a novel unsupervised learning of GAN-based imputation model GAGIN to deal with missing data imputation, which consists of a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM). GIN captures the global distribution from the entire dataset and initially generates the imputation as a whole. Considering the clutter local area and the improper results, we design IGM to stretch the information from global to local and guide LIN to refine the local regions, especially the boundary of imputation results. Our GAGIN learns the guide information between global and local and refine local regions enhancing the imputation performance. Hence, the proposed GAGIN imputes more rational values and the inadequate local regions from rough to accurate.
To sum up, the significant contributions of our proposed methods can be summarized as follows:
-
We propose a novel GAGIN for missing data imputation. This network is designed to equip with three sub-networks, the GIN can generate imputation as a whole, while the LIN guided by IGM refines the local region especially for inadequate areas.
-
Our method compared with other 10 missing data imputation methods verifies the effectiveness of our GAGIN. The experimental results illustrate that our method outperforms the other state-of-the-art approaches on both numeric and image datasets. Furthermore, the comprehensive ablation study demonstrates that our IGM and LIN perform their effectiveness and superiority for dealing with missing data imputation.
2 Related work
2.1 Traditional methods
Existing traditional missing data imputation methods can be categorized into two classes. The first algorithms are statistical imputation methods such as zero imputation, mean imputation, and the most common value imputation [19]. The second kind of methods is machine learning-based imputation algorithms. Multivariate Imputation by Chained Equations (MICE) [20] fills the missing data by using iterative regression model. MissForest algorithm [21] uses known variables as feature data and the missing data variable as a label and updates the missing values predicted by the random forest. Matrix completion algorithm [22] factorizes the incomplete dataset low-rank matrices and adopts the product of these two matrices to impute the missing data. Expectation–Maximization (EM) algorithm [23] consisting of the “expectation” step and the “maximization” step iteratively updates model parameters and imputed data. K-Nearest Neighbor (KNN) algorithm [24] uses the mean value of k nearest neighbors to fill missing data.
Although these methods are somewhat effective in imputing the missing data, statistical and machine learning-based methods for imputation have various drawbacks. For instance, the main drawbacks of these imputation methods are the lack of the utilization of the information and the change of the original data distribution. Furthermore, MICE, MissForest, EM, KNN, etc., are all based on the assumption that the data are missing at random and having a correlation between characteristics.
2.2 Generative adversarial networks (GANs)-based methods
In recent years, generative adversarial network (GAN) schemes [15, 25,26,27] have significantly developed missing data imputation. Yoon et al. [18] proposed an imputation network (GAIN) that employs a hint vector to complete the missing data. It is trained with a discriminator trying to discriminate which in the fake complete data were imputed or not. GAIN has improved the performance in imputation compared to the traditional methods. Nevertheless, the main drawback of GAIN is the limitation of dealing with high dimensional datasets or high missing rate. Li et al. [17] proposed a GAN architecture (MisGAN) and an imputation method using it. MisGAN, consisting of two generators and two discriminators, learns a complete data. For imputation, another pair of generators and discriminators are used. Although MisGAN has taken advantage of high-dimensional incomplete data, it tends to neglect correspondence between imputations and groundtruth. The above papers did not consider the design of local, so there is a lack of local details. Yoon and Sull [16] proposed a generative adversarial multiple imputation network (GAMIN), which generated candidates of imputation and presented a confidence prediction method to perform reliable multiple imputation. GAMIN has made tremendous advances in high missing rate. However, the missing rate rarely exceeds 80% for dataset. This work has studied local to a certain extent, but due to the limited global-to-local GAN information transmission problem, the solution is not good.
3 The proposed GAGIN
In this paper, we propose a generative adversarial guider imputation network (GAGIN) for missing data imputation. Our design is based on solving the previous two problems: global-to-local details and conditional guiding. We introduce our model in this section, and in Sect. 4 we provide the theoretical discussion proving our designing. The proposed GAGIN receives a data missing completely at random and outputs the imputation using the guide concept. Figure 2 illustrates the overall architecture of the proposed GAGIN, which involves a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM). Section 3.1 describes our GIN, and the detailed explanation of our IGM and LIN is mentioned in Sects. 3.2 and 3.3. The model training for imputation is explained in Sect. 3.4.
3.1 Global-impute-net (GIN)
We design GIN to focus on the entire missing regions to generate and impute data as a whole. After this network simulates the global information, it also needs to pass the information to the local network through the following guider. The design of GIN is as follows:
In the generator, we take the missing data \(\tilde{X}\), the missing mask M, and a noise variable Z as input, while output is a vector of generated data \(\overline{X}.\) Obtained by taking the partial observation of \(\tilde{X}\) and replacing each missing region with the corresponding value of \(\overline{X}\), \(\hat{X}\) corresponds to the completed data vector. Thus, we define \(\overline{X}\) and \(\hat{X}\) in Eqs. (1) (2) as below:
where \(G_{{\text{g}}}\) is defined as a function transforming the unobserved data to generated data for every component and \(\odot\) denotes element-wise multiplication.
There is a global discriminator \(D_{g}\) used to train our GIN model, which tries to criticize whether each component of input is imputed or not. The missing mask M and \(\hat{X}\) are combined to feed in \(D_{g}\) and output a value in the [0,1] range. Thus, the loss function for the adversarial global impute net is defined in Eq. (3) as follows:
where the i-th component of \(D_{{\text{g}}}^{{\text{i}}}\) corresponds to the probability that the i-th component of \(\hat{X}\) was observed.
3.2 Impute guider model (IGM)
For missing data imputation, it is critical for imputers to explore the felicitous structure and the appropriate local information especially in smooth homogeneous regions. However, the previous methods learn a variety of global information and treat all characteristics without distinction so that the finer details are ignored. In using GAN to solve such a problem, the difficulty lies in how to use the global information as the pre-condition of the local distribution to guide the generation. As we all know, GANs cannot directly use complex conditions, and the output dimension of GIN is about \(R^{n}\), which is not suitable as a condition. For this we designed a module IGM.
Based on the above observation, we propose an IGM to guide the LIN according to the GIN result \(\hat{X}\) and missing mask to meet both local refinable and values reasonable. Therefore, the global imputation results act as a prior to leading the generation and adjusting the local imputation results. Each local information is extracted from the intermediate imputation guider by the fully connected layer. In order to model the impute guider of the intermediate results \(\widehat{{x_{{{\text{cdd}}}} }} \in R^{{{\text{local}}}} \times R^{{{\text{local}}}}\), our proposed method IGM can be summarized as the following three steps as illustrated in Fig. 3: (1) Dividing the whole GIN results into few partitions and searching candidate local region \(f_{{{\text{search}}}}^{{{\text{local}}}} \left( \cdot \right)\) via Eq. (4); (2) Digging the inter-imputation relationship ε using the extracted \(x_{{{\text{cdd}}}}\) via the multilayer perceptron \(f_{{{\text{FC}}}} \left( \cdot \right)\) in Eq. (5); (3) Fusing missing mask and intermediate results via Eq. (6).
where \(f_{{{\text{FC}}}} \left( \cdot \right)\) is two fully connected layers with activation function relu, \(E\left( \cdot \right)\) represents expanding the spatial dimension of \({\upvarepsilon }\) to that of M, and \(C\left( \cdot \right)\) is the concatenation operation.
Here we provide a linear/nonlinear module after the guider, allowing the model to perform linear/nonlinear conversion. Traditional theory believes that nonlinear kernel function will increase the representation ability after conversion. We have discussed this module in theoretical research and experiment, respectively.
3.3 Local-impute-net (LIN)
Benefit from the results of IGM, the proposed LIN pays more attention to partial regions relating to the missing location. The GIN takes the full data as input to recognize global consistency of the scene, while the LIN guided by IGM focuses on a small region around the inadequate imputation area to refine the quality of more detailed appearance. Similar to Eq. (3), the adversarial local impute net loss is defined as follows:
Finally, the outputs of the global and the local discriminators are concatenated together into a single vector, which is then processed by a single fully connected layer, to output a continuous value. A sigmoid transfer function [40] is used so that this value is in the [0,1] range and represents the probability that the data are real, rather than imputed.
3.4 Model training for imputation
We design and joint two training loss functions of the proposed GAGIN: the observation loss for training stability and the adversarial loss for improving the imputation performance.
The objective of imputation is to minimize the difference between the imputed values of the observed components and the real observed values. For GIN, the observation loss is given by:
where d represents distance between the imputed data and real data.
Similarly, we obtain the observation loss for the LIN and Eq. (9) shows the whole observation loss.
As mentioned above, GIN and the LIN are both adversarial networks and the missing data imputation is well trained between generators and discriminators. To obtain global consistency and finer details, we define the adversarial loss as follows:
The overall imputation losses jointing these two functions are defined as below:
The pseudo-code is presented in Algorithm 1.
4 Theoretical analysis
In this section, we discuss the global-to-local problem and conditional guiding problem for GANs as the following points via theoretical analysis: 1) the imputation problem can be transformed and solved by the simulation of the global distribution from GAN; 2) introducing localization could result in a better imputation with a smaller imputation risk function, so as to prove the advantages of global-to-local imputation; 3) our designing could solve the region distribution simulation in local GAN with the global condition.
4.1 Risk function for imputation
We first define the risk function of the overall imputation. Let the observed value be \(x \in X\), the missing value be \(y \in Y\), and the target value be \(y_{0} \in Y\). The goal of imputation is to estimate the missing value through the observed value, which means to evaluate the conditional distribution function \(F\left( {y{|}x} \right)\). The objective can be expressed as the following formula with the conditional mathematical expectation function:
Given the functional set \(f\left( {x,{\upalpha }} \right), {\upalpha } \in \Lambda_{{\upalpha }} (f\left( {x,{\upalpha }} \right) \in L_{2} \left( P \right)\)), if the regression \(r\left( x \right)\) belongs to \(f\left( {x,{\upalpha }} \right)\),\(a \in \Lambda_{{\upalpha }}\), then the problem of regression function is transformed into solving the following functional. The risk function for imputation can be expressed as:
And the solution for such risk function is to minimize the above risk function so that
Equation (14) is a traditional regression problem [42] that can be represented and solved by the risk function on the functional space. And also be a theoretical description for our imputation task.
Via the definition of empirical risk function, solving (13) is a problem to find out \({\upalpha }^{*}\) in \(\Lambda_{{\upalpha }}\) as in (14). However, in the rather complicated scenes, such as the imputation study, solving (13) by empirical risk function is not a quite good option [2] [6] [9]. The difficulty of the above problem becomes the problem of turning \({\upalpha }^{*}\) into \(F\left( {x,y} \right)\), which provides a theoretical basis for using GAN to solve imputation, because GAN can only calculate joint distribution [15] [41].
4.2 Estimation of \({\varvec{F}}\left( {{\varvec{x}},{\varvec{y}}} \right)\) in GAN Imputation
In this part, we discuss that GAN can well simulate \({ }F\left( {x,y} \right)\), and the simulated \(F\left( {x,y} \right)\) can further solve imputation.
However, in imputation, the distribution of \(F\left( {x,y} \right)\) is unknown, and we can only observe the empirical samples of \((y|x)\) and to estimate the \(F\left( {x,y} \right)\). We define a density functional \(p\left( {x,y,\beta } \right)\), where \(\beta \in {\Lambda }_{\beta }\), and if \(\beta^{*} { }\) is found, we have
Based on Bretagnolle–Huber inequality, there is
By letting the \(\beta = D \in {\Lambda }_{{\mathfrak{D}}}\) as a discriminator neural network in a functional space, we can obtain from Eq. (18) as follows:
And it coincidently turns to be a GAN loss function as follows:
We explain below that using \(p_{g} \left( {x,y,D} \right)\) to fit imputation, the fitting result is equivalent to the original fitting Eq. (13), so the problem of imputation can be solved by GAN.
Plugging Eq. (20) into Eq. (13), we find that fitting an imputation function \(f\left( {x,{\upalpha }} \right)\) by GAN with an estimated probability \(p\left( {x,y,D} \right)\) is to:
The first term is the error term caused by the inaccurate distribution estimation, which can be seen from Eq. (18), \(\int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D_{0} } \right) - p\left( {x,y,D} \right) } \right]d\left( {x,y} \right) \to 0\).
So, we obtain:
We use the simulated distribution generated by GAN to fit the imputation, which is the same as the original fit. It can also solve the problem of \(F\left( {x,y} \right)\) super-dimensional and unknown.
4.3 Advantages of localization imputation
The above conclusion shows that using GAN to simulate a global probability distribution is a reasonable method to solve the imputation. Then we further discuss that if the local simulation method is used, the final overall loss will be smaller.
For the risk function of Eq. (22), suppose we define the region we want to impute as \(A_{i}\), and let the regions except \(A_{i}\) be \(A_{j \ne i}\). s.t.\({ }A_{i} \cap A_{j \ne i} = \phi\) and \(\cup_{i} A_{i} = A_{all} = I\). If this area regionalized and imputation performed, the risk function of each area \(R_{{A_{i} }}\) is defined as:
where \(L_{2}\) is the norm-2 distance and is a concave function, \(F_{{A_{i} }} \left( {x,y} \right)\) is a margin distribution of \(F\left( {x,y} \right)\), \(F_{{A_{i} }} \left( {x,y} \right) = \mathop \int \nolimits_{i \ne j} F\left( {x,y} \right)dA_{j}\). Here we use \(dA_{j}\) to simplify \(d\left( {x,y} \right)\), s.t. \(\left( {x,y} \right) \in A_{j}\) without confusion.
Hence with a localization imputation, the entire risk function is
Given the fact that \({L}_{2}\) is a concave function and from Jensen Inequality, we have
Given the fact that \(A_{i} \in A_{all}\) and \(L_{2} \left( \cdot \right) \ge 0\), we have
Equation (26) proves that when the regionalization strategy is used, the overall risk of regionalized imputation is less than or equal to the risk of global imputation.
4.4 Localization imputation in GAN with the global Information
We have discussed that localization usually has a smaller risk function. But practically, we have globally observed samples. Compared with Eq. (26) in C, the problem we need to solve is fitting \(F\left( {x,y{|}x_{0} } \right)\), which is a simulation problem of conditional probability.
As well known, GAN is not good at conditional probability simulation (for example, the classical CGAN can only satisfy one-dimensional condition as the label [43]), while the condition of imputation design is a high-dimensional observation sample set. Next, we support our innovative design through theoretical discussions that how to solve the above difficulty fitting \(F\left( {x,y{|}x_{0} } \right)\), we make the following design.
We define the result of global imputation net (GIN) as \(\hat{y} \in Y\):
\({\hat{\text{y}}}\) is not the optimal solution for the fitting since there is a more minor risk function from localization. An optimization result from localization risk function could be a better result. Hence we will discuss how to get a \(\hat{y}_{{A_{i} }}\) while satisfying a condition density function of \(p(y|x_{{A_{i} }} ,x_{A\,j \ne i} )\).
For the local imputation net (LIN), taking into account global-to-local information, we let the input noise as an encoder representation of \(\hat{y}_{{A_{i} }}\).Without loss of generality, take the encoding as an operator T, an imputation guide model (IGM) in our framework.
where Z is the input noise for the local GAN. f can be linear/nonlinear function. Actually, when we use a nonlinear function as activation, the guider is closer to an encoder. Interestingly, through subsequent heuristic experiments we find a linear function has better characterization ability. We suspect this is because a linear transformation can retain more "original information" as a priori, while a nonlinear transformation will bring more disturbances. The relevant results will also be given under Sect. 5.4.
If the local generator in LIN is considered as an encoder \({\text{T}}^{\prime }\), then we will find that
Given the fact that
So we finally have
It is a conditional probability from \({\text{x}}_{{A_{i} }}\) and \({\text{x}}_{{A_{j} }}\) to the localized imputation result \(\overline{y}_{{A_{i} }}\). \(\overline{y}_{{A_{i} }}\) uses the global information \(x_{{A_{j} }}\).
The above (31) theoretically supports how the method proposed in this paper uses the global imputation (GIN) result as a condition to generate the process of local imputation (LIN). Hence, we have successfully solved the problem of the conditional simulation in a local imputation.
In summary, we have discussed that GAN can be used to simulate F(x,y) to solve the imputation problem and localized imputation can theoretically bring better solutions. Finally, the proposed method can use global information as a condition to complete the localized imputation.
5 Experiments
5.1 Datasets and experimentation details
5.1.1 Datasets
We evaluate our method both on numeric datasets and on image dataset available for missing data imputation tasks: UCI Machine Learning Repository [28] and MNIST [29]. The UCI maintains 559 datasets as a service to the machine learning community. Like [18], we select four real-world datasets (Breast, Spam, Credit, Letter) to evaluate the imputation performance quantitatively, as shown in Table 1. MNIST is a dataset of handwritten digits images of size 28 × 28 containing 70,000 images. We use the provided 50,000 as training set, 10,000 as validation set, and the remaining 10,000 images as testing set. Tenfold cross-validation is applied.
5.1.2 Experimentation details
For our training set, the range values of each numeric dataset and image dataset are rescaled to [0, 1]. We simulate that each missing value is independent of the missing rate. The dropout missing rates of our experiments are set from 10 to 80% with a step of 10%. During the training process, our GAGIN parameters are initialized by Xavier [30, 37]. The dimensional vector Z is the same as the input data. Our whole network is trained by Adam optimizer [31] with learning rate 1e-3 and the batch size is set to 128. Then, we stop the whole learning process at 10 k iterations. We implement our model based on TensorFlow [32] framework using python 3.7 [38] and our experiments run on a Nvidia RTX 2080Ti GPU [39]. Due to demonstrating the proposed model's performance fairly, as to all the compared methods, we implement with the same FC architecture that only fully connected layers for both the generators and the discriminators.
5.2 Evaluation metrics
To be fair to all methods, we use a unified evaluation metric to quantitatively analyze the results of all missing data imputation methods. Through the inspiration of the papers [16,17,18], we choose root mean square error (RMSE) [33, 36] and Frechet inception distance (FID) [34, 35] as the evaluation metrics, to compare the performances with the state-of-the-art imputation methods.
The RMSE between the real data and imputed data of missing data imputation is defined as:
where N represents the number of samples.
The FID is a measure of the similarity between the generated images and real ones, which is defined as below:
where \(\left( {i,r} \right)\) represents the imputed image and the real image. \(\mu_{{\text{i}}}\) and \(\sum_{{\text{i}}}\) are the mean-value and covariance matrix of the imputed image’s vectors, the same as \(\mu_{{\text{r}}}\) and \(\sum_{{\text{r}}}\). \({\text{Tr}}\) denotes the trace of the matrix.
5.3 Performance comparison on UCI and MNIST missing dataset
In the experiments, our GAGIN is compared against 10 methods, including three state-of-the-art methods (GAIN, MisGAN, GAMIN) and seven traditional methods (zero-imputation, mean-imputation, MICE, MissForest, Matrix, EM, KNN). Fairly, as to all the compared methods, we directly use the author’s released codes or repeat according to the author’s idea to perform the evaluations. Before the starting of the experiments, we have standardized the input datasets.
5.3.1 Evaluation on various missing rates and dimensions for UCI missing dataset
Table 2 shows the comparison results between some other imputation methods and our proposed method GAGIN in the last row. We conduct two experiments, in which we vary the missing rate and the number of dimensions on the Credit dataset. Figure 4 shows the RMSE performance for various missing rates from 10 to 80% with a step of 10%. The blue columns show the traditional methods for imputation, while the green polylines represent the GAN-based imputation methods. Even though the RMSE of each algorithm decreases as missing rates increase, our GAGIN consistently outperforms the benchmarks across the entire range of missing rates as the proposed GAGIN can capture the global information and learn the local relationship of the unobserved values guided by IGM. Besides, our method can impute the missing data with more accurate values.
We also investigate the influence of the comparison for various dimensions by the different methods as shown in Fig. 5. The RMSE performances decrease with the increasing of dimensions. The red line shows the superiority of RMSE with different dimensional datasets. We can conclude that the proposed GAGIN is also robust to the number of dimensions.
5.3.2 Quantitative evaluation on MNIST missing dataset
Our comparison of evaluation metrics is shown in Table 3, and our method's evaluation scores are shown in the last column. For RMSE, our GAGIN surpasses the second-best method by 24.1% on MNIST dataset with 50% missing rate. Moreover, our method presents the lowest score of FID and is 33.2% lower than the second-best method. To intuitively show the comparison, we calculate the FID score of all the methods under independent dropout with missing rates from 10 to 80%. The FID results are illustrated in Fig. 6. The blue polylines show the traditional methods for imputation, while the green polylines represent the GAN-based imputation methods. It can be observed that, in all cases, our proposed method evidently gains the best imputation FID than the others.
When the missing rate is low (10%-60%), the overall distribution is retained more, and the GANs method performs well owing to simulating the global information well. However, when the missing rate increases, the global information becomes less, and the traditional algorithm is greatly affected at this time (the observation surrounding is more missing points). When the missing rate is greater than 60%, the RMSE rises sharply. It is undeniable that our method has the best performance, as the global-to-local method gives a better solution and optimizes the details. In addition, this problem can be derived from the subsequent t test experiment. Consequently, our GAGIN outperforms all the other methods on missing data imputation.
5.3.3 Qualitative comparison on MNIST missing dataset
Figure 7 shows the visible imputation results generated from different methods on MNIST dataset of 50% missing rate. The (c) to (e) columns are the traditional methods, i.e. zero-imputation, mean-imputation and matrix-based imputation, while the (f) to (h) columns are the GAN-based methods for imputation recently as GAIN, MisGAN and GAMIN. Intuitively, the zero-imputation method and the mean-imputation method cannot produce valuable imputations. The matrix-based imputation generates the blurred images with stars. Furthermore, images imputed by GAN-based methods possess unclear and insufficient boundary in detail. It is apparent from Fig. 7 that our imputation results of the proposed method present the clearest and the most precise boundary particularly in visual performance.
5.4 Ablation analysis
5.4.1 Linear or nonlinear GIN
In our framework, we use a linear/nonlinear module after IGM module. Here we heuristically choose two nonlinear functions as the kernel transform after the Guider. Compared to the linear transform, the result is listed in the last three columns of Tables 2 and 3. Interestingly, people generally consider adding nonlinear kernel could increase the representation ability of the module. However, in our study, we can see that the effect of using nonlinear functions has become worse. This is because the guide here obtains information from global random variables as a priori to generate the local part. To avoid potential overfitting, a permutation is rational enough to introduce some uncertainty/variance from a solid global-to-local relationship. Nonlinear functions extremely disturb this situation, especially in the case of digital data, which excessively weaken the priors passed by the guider. On the other hand, this proves the effectiveness of the guider conduction prior we designed. To sum up, this part of the work is enlightening. It is not ruled out that knowing the nonlinear function that can simultaneously haa better representation and more prior information is in the future research. Therefore, we still retain this module in the final model design.
5.4.2 Effectiveness of local-impute-net (LIN)
As mentioned, our LIN's goal is to optimize the local boundary details and refine the inadequate imputation area. To validate the effectiveness of LIN, we get rid of the complete LIN and then directly obtain the results of IGM and GIN concatenated together as the final outputs. The RMSE with (Row GAGIN) and without LIN (denoted as “w/o LIN”) are reported in Table 4. It can be observed that the proposed GAGIN with LIN works better than that without LIN. In addition, our method without LIN gains the best performance compared with other state-of-the-art approaches and traditional-based methods combining Table 3, which also highlights the effectiveness of our subnetworks (i.e. GIN and IGM). Intuitively, we show the visual results generated by our method without LIN and with LIN (denoted as “w/ LIN”) in Fig. 8. In terms of visual results, (d) is more concrete than (c), illustrating that our LIN can refine the boundary details and effectively promote the final imputation outputs.
5.4.3 Analysis of impute guider model (IGM)
To explore the importance of IGM (i.e. \(\widehat{X, }f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\)), we conduct our experiments with four different instances. Considering 1) why feed \(\hat{X}\) generated by GIN into our IGM as input; 2) how to choose different candidate local region for our impute guider model; 3) how to dig the inter-imputation relationship using the extracted region. Based on above considerations, our ablation studies about the input \(\widehat{X }\), \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) are shown in the effectiveness of IGM part in Table 4. To investigate the importance of GIN’s information, we first replace the input with random noise (denoted as “w/o \(\hat{X}\)”). To verify the effectiveness of impute guider model, we get the RMSE and FID results without \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) (denoted as “w/o \(f_{{{\text{search}}}}^{{{\text{local}}}}\) + w/o\(f_{{{\text{FC}}}}\)”) by replacing these modules with the corresponding fully connected layers. As for the local region choice, the impute guide model is only equipped with \(f_{FC}\) (denoted as “w/ \(f_{{{\text{FC}}}}\)”). Then, as for the digging the relationship of imputing guider model, we only add \(f_{{{\text{search}}}}^{{{\text{local}}}}\) (denoted as “w/ \(f_{{{\text{search}}}}^{{{\text{local}}}}\)”). Compared with basic model (w/o \(f_{{{\text{search}}}}^{{{\text{local}}}}\) + w/o \(f_{{{\text{FC}}}}\)), our IGM can decrease FID up to 1.944 and reduce RMSE up to 0.0722. Hence, our \(\hat{X}\), \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) significantly improve the results.
5.4.4 Statistical T test to ensure the superiority of GAGIN
T test is a common method testing two independent samples in statistics. Considering whether imputed data and real data are similar, we used the t test method to quantify the significance of the difference between the two type samples in missing data imputation. The test results with 10–80% missing rate of MNIST data set are presented in Table 5. From the table, we see that as the missing rate increases, the p value gets smaller, but the t value gets larger. When p value > 0.05, accept the null hypothesis and consider that the difference between the two samples is not significant, and vice versa. When the missing rate is between 10 and 60%, the p value is more significant than 0.05, indicating that the generated imputed data are similar to the real data, and the lower the missing rate, the more similar. When missing rate > 70%, the p value < 0.05, and it is considered that there is a certain gap between the generated sample and the real sample. The t test further proves the significance of our proposed method in statistics.
6 Conclusions
In this paper, we propose a novel generative adversarial guider imputation network (GAGIN) for missing data imputation. To solve the interference of local clutter and the inaccurate imputation boundary details, we design a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM).
After the GIN generating and imputing data on the whole, the LIN is assigned to capture and refine local details guided by the IGM. Comprehensive experiments indicate our proposed method has the superiority of missing data imputation. However, we need to improve our method for all realistic settings. Future work will investigate the performance of GAGIN in other missing data types (MAR, MNAR). Furthermore, we plan to make an additional absolute guide imputation to enhance the performance of our method.
References
Fortuin V, Baranchuk D, Rätsch G, et al. (2020) Gp-vae: Deep probabilistic time series imputation[C]//International Conference on artificial intelligence and statistics. PMLR, pp 1651–1661
Yonghong Luo, Ying Zhang, Xiangrui Cai, and Xiaojie Yuan. (2019) EGAN: End-to-end generative adversarial network for multivariate time series imputation. In: 12th International joint conference on artificial intelligence IJCAI-19
Rubanova Y, Chen R T Q, Duvenaud D. 2019 Latent odes for irregularly-sampled time series[J]. arXiv preprint arXiv:1907.03907
Liu Y, Yu R, Zheng S, et al. Naomi 2019 Non-auto regressive multiresolution sequence imputation[J]. arXiv preprint arXiv:1901.10946
Fedus W, Goodfellow I, Dai A M. Maskgan 2018 better text generation via filling in the_[J]. arXiv preprint arXiv:1801.07736
Lee D, Kim J, Moon W J, et al. 2019 CollaGAN: Collaborative GAN for missing image data imputation[C] In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 2487–2496
Becker P, Pandya H, Gebhardt G, et al. 2019 Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces[C]//International conference on machine learning. PMLR pp 544–552
Dalca AV, Bouman KL, Freeman WT et al (2018) Medical image imputation from image collections[J]. IEEE Trans Med Imaging 38(2):504–514
Lee D, Moon W J, Ye J C. 2019 Which contrast does matter? towards a deep understanding of MR contrast using collaborative GAN[J]. arXiv preprint arXiv:1905.04105
Khosravi P, Liang Y, Choi Y J, et al. 2019 What to expect of classifiers? reasoning about logistic regression with missing features[J]. arXiv preprint arXiv:1903.01620
Cortes D. 2019 Imputing missing values with unsupervised random trees[J]. arXiv preprint arXiv:1911.06646
Brown T B, Mann B, Ryder N, et al. 2020 Language models are few-shot learners[J]. arXiv preprint arXiv:2005.14165
Tran K, Bisazza A, Monz C. 2016 Recurrent memory networks for language modeling[J]. arXiv preprint arXiv:1601.01272
Zhang X, Lu L, Lapata M. 2015 Top-down tree long short-term memory networks[J]. arXiv preprint arXiv:1511.00060
Goodfellow IJ, Pouget-Abadie J, Mirza M (2014) Generative Adversarial Networks. Adv Neural Inf Process Syst 3:2672–2680
Seongwook Yoon, and Sanghoon Sull. 2020 GAMIN: Generative adversarial multiple imputation network for highly missing data. 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE
Steven Cheng-Xian Li, Bo Jiang, and Benjamin Marlin. 2019 Misgan: Learning from incomplete data with generative adversarial networks
Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018 Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning, pp 5675–5684
Kantardzic Mehmed. 2011 Data mining: concepts, models, methods, and algorithms
White I R, Royston P, Wood A M (2011) Multiple imputation using chained equations: issues and guidance for practice. Statistic Med 30(4):377–399
Stekhoven DJ, Bühlmann P (2011) Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Evrim Acar, Daniel M Dunlavy, and Tamara G Kolda, and Morten Mørup. 2010 Scalable tensor factorizations with missing data. In Proceedings of the 2010 SIAM international conference on data mining, pp 701–712. SIAM
García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review[J]. Neural Comput Appl 19(2):263–282
Hudak A T, Crookston N L, Evans J S, Hall D E, Falkowski M J (2008) Nearest neighbor imputation of species-level, plot-scale forest structure attributes from lidar data. Remote Sens Environ 112(5):2232–2245
Li M, Lin J, Ding Y, et al. 2020 Gan compression: Efficient architectures for interactive conditional gans[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5284–5294
Shen Y, Gu J, Tang X, et al. 2020 Interpreting the latent space of gans for semantic face editing[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp 9243–9252
Daras G, Odena A, Zhang H, et al. 2020 Your local GAN: Designing two dimensional local attention mechanisms for generative models[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp 14531–14539
Lichman M. 2013 UCI machine learning repository. URL http://archive.ics.uci.edu/ml.
LeCun Y, and Cortes C. 2010 MNIST handwritten digit database. URL http://yann.lecun.com/ exdb/mnist/.
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256
Diederik P. Kingma, and Jimmy Lei Ba. 2014 Adam: A method for stochastic optimization. Computer Science
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016 Tensorflow: a system for large-scale machine learning
Suhrid Balakrishnan and S. Chopra. 2012 Collaborative ranking. WSDM ’12, pp 143–152. ACM
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. 2017 Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems, pp 6626–6637
XU, Qiantong, et al. 2018 An empirical study on evaluation metrics of generative adversarial networks. arXiv preprint arXiv:1806.07755
Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geoscientific model development 7(3):1247–1250
Kumar S K. 2017 On weight initialization in deep neural networks[J]. arXiv preprint arXiv:1704.08863
Pajankar A (2021) Useful unix commands and tools[M]//Practical Linux with Raspberry Pi OS. Apress, Berkeley, CA, pp 81–89
Kumar N. 2019 Neural network implementation using CUDA[D]
Yin X et al (2003) A flexible sigmoid function of determinate owth. Annal Botany 91(3):361–371
Gulrajani I, Ahmed F, Arjovsky M, et al. 2017 Improved Training of Wasserstein GANs[J]. arXiv preprint arXiv:1704.00028v3
Vapnik V. 2013 The nature of statistical learning theory[M]. Springer science & business media
Mirza M, Osindero S. 2014 Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411.1784
Liu Y, Gopalakrishnan V (2017) An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data 2(1):8
Jerez José M et al (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artifi Intell Med 50(2):105–115
Li L, Fu H, Xu X (2021) Active learning with sampling by joint global-local uncertainty for salient object detection. Neural Comput Applic. https://doi.org/10.1007/s00521-021-06395-8
Ma X, Li X, Zhou Y et al (2021) Image smoothing based on global sparsity decomposition and a variable parameter. Comp Visual Media 7:483–497
Wang Q, Hu X, Gao Q et al (2014) Global–local fisher discriminant approach for face recognition. Neural Comput Applic 25:1137–1144
Cheng Y, Song F, Qian K (2021) Missing multi-label learning with non-equilibrium based on two-level autoencoder. Appl Intell 51:6997–7015
Raja PS, Sasirekha K, Thangavel K (2020) A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Applic 32:10033–10050
Acknowledgements
This work was supported by Qian Xuesen Laboratory of Space Technology, CAST(GZZKFJJ2020002), National Key Research and Development Program of China under the grant number (2018hjyzkfkt-002).
Author information
Authors and Affiliations
Contributions
Wei Wang was involved in supervision and project administration. Yimeng Chai was involved in methodology, software, and writing—original draft. Yue Li was involved in conceptualization, methodology, and writing—review & editing.
Corresponding author
Ethics declarations
Conflict of Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, W., Chai, Y. & Li, Y. GAGIN: generative adversarial guider imputation network for missing data. Neural Comput & Applic 34, 7597–7610 (2022). https://doi.org/10.1007/s00521-021-06862-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06862-2