1 Introduction

Missing data imputation is an important and common topic in the real world, aiming to impute the uncollected and unobserved regions with rational values. Many imputation approaches have been proposed to handle data containing missing observations, such as multivariate time series imputation [1,2,3,4,5], image imputation [6,7,8,9], regression imputation [10, 11], sentence completion [12,13,14], to cite just a few.

To deal with missing data imputation, various traditional methods can be categorized into two types: (1) The simple statistical imputation methods, (2) The machine learning-based imputation methods. But these methods have the limitations of changing the original data distribution and assuming the data correlate with characteristics [44, 45].

Recently, a few effective methods about missing data imputation based on the lately prevalent generative adversarial network (GAN) [15] have been performed [16,17,18]. Most of these methods utilize generators and discriminators to learn the information of unobserved regions. The generator frequently generates and imputes missing data to deceive the discriminator, while discriminator discriminate between the imputation and fake data. Although these methods enhance the characteristic expression and follow the data distribution compared to the traditional methods, local homogenous regions especially in boundary (as shown in the blue and red boxes in Fig. 1) and the reason of the imputed data are still the challenging issues that negatively impact the missing data imputation results. Essentially, two main reasons are resulting in these issues being challenging to solve. First, GAN-based methods pay more attention to make the distribution of generated data approximate the distribution of real data as a whole, and the details receive insufficient attention [41, 47, 50]. Hence, the detailed local representation of the imputation is still not accurately accessed. Second, the existing methods capture the random noise to feed the model initially, which ignores the guiding results between different levels from global to local [15, 46, 48, 49]. Therefore, we not only focus on the global and local regions and avail the information effectively to combine the global and the local regions well.

Fig. 1
figure 1

Some examples of challenges in missing data imputation. a Ground truth (GT) and e missing mask of which missing components are colored black. Visualization of b Zero imputation, f Mean imputation, c GAIN [18], g GAMIN [16], d MisGAN [17] and h Our proposed GAGIN

In this paper, we propose a novel unsupervised learning of GAN-based imputation model GAGIN to deal with missing data imputation, which consists of a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM). GIN captures the global distribution from the entire dataset and initially generates the imputation as a whole. Considering the clutter local area and the improper results, we design IGM to stretch the information from global to local and guide LIN to refine the local regions, especially the boundary of imputation results. Our GAGIN learns the guide information between global and local and refine local regions enhancing the imputation performance. Hence, the proposed GAGIN imputes more rational values and the inadequate local regions from rough to accurate.

To sum up, the significant contributions of our proposed methods can be summarized as follows:

  • We propose a novel GAGIN for missing data imputation. This network is designed to equip with three sub-networks, the GIN can generate imputation as a whole, while the LIN guided by IGM refines the local region especially for inadequate areas.

  • Our method compared with other 10 missing data imputation methods verifies the effectiveness of our GAGIN. The experimental results illustrate that our method outperforms the other state-of-the-art approaches on both numeric and image datasets. Furthermore, the comprehensive ablation study demonstrates that our IGM and LIN perform their effectiveness and superiority for dealing with missing data imputation.

2 Related work

2.1 Traditional methods

Existing traditional missing data imputation methods can be categorized into two classes. The first algorithms are statistical imputation methods such as zero imputation, mean imputation, and the most common value imputation [19]. The second kind of methods is machine learning-based imputation algorithms. Multivariate Imputation by Chained Equations (MICE) [20] fills the missing data by using iterative regression model. MissForest algorithm [21] uses known variables as feature data and the missing data variable as a label and updates the missing values predicted by the random forest. Matrix completion algorithm [22] factorizes the incomplete dataset low-rank matrices and adopts the product of these two matrices to impute the missing data. Expectation–Maximization (EM) algorithm [23] consisting of the “expectation” step and the “maximization” step iteratively updates model parameters and imputed data. K-Nearest Neighbor (KNN) algorithm [24] uses the mean value of k nearest neighbors to fill missing data.

Although these methods are somewhat effective in imputing the missing data, statistical and machine learning-based methods for imputation have various drawbacks. For instance, the main drawbacks of these imputation methods are the lack of the utilization of the information and the change of the original data distribution. Furthermore, MICE, MissForest, EM, KNN, etc., are all based on the assumption that the data are missing at random and having a correlation between characteristics.

2.2 Generative adversarial networks (GANs)-based methods

In recent years, generative adversarial network (GAN) schemes [15, 25,26,27] have significantly developed missing data imputation. Yoon et al. [18] proposed an imputation network (GAIN) that employs a hint vector to complete the missing data. It is trained with a discriminator trying to discriminate which in the fake complete data were imputed or not. GAIN has improved the performance in imputation compared to the traditional methods. Nevertheless, the main drawback of GAIN is the limitation of dealing with high dimensional datasets or high missing rate. Li et al. [17] proposed a GAN architecture (MisGAN) and an imputation method using it. MisGAN, consisting of two generators and two discriminators, learns a complete data. For imputation, another pair of generators and discriminators are used. Although MisGAN has taken advantage of high-dimensional incomplete data, it tends to neglect correspondence between imputations and groundtruth. The above papers did not consider the design of local, so there is a lack of local details. Yoon and Sull [16] proposed a generative adversarial multiple imputation network (GAMIN), which generated candidates of imputation and presented a confidence prediction method to perform reliable multiple imputation. GAMIN has made tremendous advances in high missing rate. However, the missing rate rarely exceeds 80% for dataset. This work has studied local to a certain extent, but due to the limited global-to-local GAN information transmission problem, the solution is not good.

3 The proposed GAGIN

In this paper, we propose a generative adversarial guider imputation network (GAGIN) for missing data imputation. Our design is based on solving the previous two problems: global-to-local details and conditional guiding. We introduce our model in this section, and in Sect. 4 we provide the theoretical discussion proving our designing. The proposed GAGIN receives a data missing completely at random and outputs the imputation using the guide concept. Figure 2 illustrates the overall architecture of the proposed GAGIN, which involves a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM). Section 3.1 describes our GIN, and the detailed explanation of our IGM and LIN is mentioned in Sects. 3.2 and 3.3. The model training for imputation is explained in Sect. 3.4.

Fig. 2
figure 2

The overall architecture of Generative Adversarial Guider Imputation Network (GAGIN)

3.1 Global-impute-net (GIN)

We design GIN to focus on the entire missing regions to generate and impute data as a whole. After this network simulates the global information, it also needs to pass the information to the local network through the following guider. The design of GIN is as follows:

In the generator, we take the missing data \(\tilde{X}\), the missing mask M, and a noise variable Z as input, while output is a vector of generated data \(\overline{X}.\) Obtained by taking the partial observation of \(\tilde{X}\) and replacing each missing region with the corresponding value of \(\overline{X}\), \(\hat{X}\) corresponds to the completed data vector. Thus, we define \(\overline{X}\) and \(\hat{X}\) in Eqs. (1) (2) as below:

$$\overline{X} = G_{g} \left( {\tilde{X},M,\left( {1 - M} \right) \odot Z} \right)$$
(1)
$${\hat{\text{X}}} = {\text{M}} \odot {\tilde{\text{X}}} + \left( {1 - {\text{M}}} \right) \odot {\overline{\text{X}}}$$
(2)

where \(G_{{\text{g}}}\) is defined as a function transforming the unobserved data to generated data for every component and \(\odot\) denotes element-wise multiplication.

There is a global discriminator \(D_{g}\) used to train our GIN model, which tries to criticize whether each component of input is imputed or not. The missing mask M and \(\hat{X}\) are combined to feed in \(D_{g}\) and output a value in the [0,1] range. Thus, the loss function for the adversarial global impute net is defined in Eq. (3) as follows:

$$L_{{{\text{adv}}}}^{{\text{g}}} = \mathop \sum \limits_{i = 1}^{s} L_{{{\text{adv}}}} \left( { G_{{\text{g}}}^{{\text{i}}} ; D_{{\text{g}}}^{i} \left( {\hat{X},M} \right)} \right)$$
(3)

where the i-th component of \(D_{{\text{g}}}^{{\text{i}}}\) corresponds to the probability that the i-th component of \(\hat{X}\) was observed.

3.2 Impute guider model (IGM)

For missing data imputation, it is critical for imputers to explore the felicitous structure and the appropriate local information especially in smooth homogeneous regions. However, the previous methods learn a variety of global information and treat all characteristics without distinction so that the finer details are ignored. In using GAN to solve such a problem, the difficulty lies in how to use the global information as the pre-condition of the local distribution to guide the generation. As we all know, GANs cannot directly use complex conditions, and the output dimension of GIN is about \(R^{n}\), which is not suitable as a condition. For this we designed a module IGM.

Based on the above observation, we propose an IGM to guide the LIN according to the GIN result \(\hat{X}\) and missing mask to meet both local refinable and values reasonable. Therefore, the global imputation results act as a prior to leading the generation and adjusting the local imputation results. Each local information is extracted from the intermediate imputation guider by the fully connected layer. In order to model the impute guider of the intermediate results \(\widehat{{x_{{{\text{cdd}}}} }} \in R^{{{\text{local}}}} \times R^{{{\text{local}}}}\), our proposed method IGM can be summarized as the following three steps as illustrated in Fig. 3: (1) Dividing the whole GIN results into few partitions and searching candidate local region \(f_{{{\text{search}}}}^{{{\text{local}}}} \left( \cdot \right)\) via Eq. (4); (2) Digging the inter-imputation relationship ε using the extracted \(x_{{{\text{cdd}}}}\) via the multilayer perceptron \(f_{{{\text{FC}}}} \left( \cdot \right)\) in Eq. (5); (3) Fusing missing mask and intermediate results via Eq. (6).

$$x_{{{\text{cdd}}}} = f_{{{\text{search}}}}^{{{\text{local}}}} \left( {\hat{X},M} \right)$$
(4)
$$\varepsilon = f_{{{\text{FC}}}} \left( {x_{{{\text{cdd}}}} } \right)$$
(5)
$$\widehat{{x_{{{\text{cdd}}}} }} = \left( {C\left( {E\left( \varepsilon \right),M} \right)} \right)$$
(6)

where \(f_{{{\text{FC}}}} \left( \cdot \right)\) is two fully connected layers with activation function relu, \(E\left( \cdot \right)\) represents expanding the spatial dimension of \({\upvarepsilon }\) to that of M, and \(C\left( \cdot \right)\) is the concatenation operation.

Fig. 3
figure 3

The workflow of our Impute Guider Model (IGM), where E and C denote the expanding and concatenation operations, respectively

Here we provide a linear/nonlinear module after the guider, allowing the model to perform linear/nonlinear conversion. Traditional theory believes that nonlinear kernel function will increase the representation ability after conversion. We have discussed this module in theoretical research and experiment, respectively.

3.3 Local-impute-net (LIN)

Benefit from the results of IGM, the proposed LIN pays more attention to partial regions relating to the missing location. The GIN takes the full data as input to recognize global consistency of the scene, while the LIN guided by IGM focuses on a small region around the inadequate imputation area to refine the quality of more detailed appearance. Similar to Eq. (3), the adversarial local impute net loss is defined as follows:

$$L_{{{\text{adv}}}}^{{\text{l}}} = \mathop \sum \limits_{i = 1}^{s} L_{{{\text{adv}}}} \left( { G_{{\text{l}}}^{{\text{i}}} \left( {\widehat{{x_{{{\text{cdd}}}} }}} \right) ; D_{{\text{l}}}^{{\text{i}}} \left( {\hat{X},M} \right)} \right)$$
(7)

Finally, the outputs of the global and the local discriminators are concatenated together into a single vector, which is then processed by a single fully connected layer, to output a continuous value. A sigmoid transfer function [40] is used so that this value is in the [0,1] range and represents the probability that the data are real, rather than imputed.

3.4 Model training for imputation

We design and joint two training loss functions of the proposed GAGIN: the observation loss for training stability and the adversarial loss for improving the imputation performance.

The objective of imputation is to minimize the difference between the imputed values of the observed components and the real observed values. For GIN, the observation loss is given by:

$$L_{{{\text{obs}}}}^{{\text{g}}} = d\left( {M \odot \tilde{X},M \odot G_{g} \left( {\tilde{X},M,\left( {1 - M} \right) \odot Z} \right)} \right)$$
(8)

where d represents distance between the imputed data and real data.

Similarly, we obtain the observation loss for the LIN and Eq. (9) shows the whole observation loss.

$${\mathcal{L}}_{{{\text{obs}}}} = L_{{{\text{obs}}}}^{{\text{g}}} + L_{{{\text{obs}}}}^{{\text{l}}}$$
(9)

As mentioned above, GIN and the LIN are both adversarial networks and the missing data imputation is well trained between generators and discriminators. To obtain global consistency and finer details, we define the adversarial loss as follows:

$${\mathcal{L}}_{{{\text{adv}}}} = L_{{{\text{adv}}}}^{{\text{g}}} + L_{{{\text{adv}}}}^{{\text{l}}} = \mathop \sum \limits_{i = 1}^{s} L_{{{\text{adv}}}} \left( { G_{{\text{g}}}^{{\text{i}}} ; D_{{\text{g}}}^{{\text{i}}} \left( {\hat{X},M} \right)} \right) + \mathop \sum \limits_{i = 1}^{s} L_{{{\text{adv}}}} \left( { G_{{\text{l}}}^{{\text{i}}} \left( {\widehat{{x_{{{\text{cdd}}}} }}} \right) ; D_{{\text{l}}}^{{\text{i}}} \left( {\hat{X},M} \right)} \right)$$
(10)

The overall imputation losses jointing these two functions are defined as below:

$${\mathcal{L}}_{{{\text{imp}}}} = {\mathcal{L}}_{{{\text{adv}}}} + {\mathcal{L}}_{{{\text{obs}}}}$$
(11)

The pseudo-code is presented in Algorithm 1.

figure a

4 Theoretical analysis

In this section, we discuss the global-to-local problem and conditional guiding problem for GANs as the following points via theoretical analysis: 1) the imputation problem can be transformed and solved by the simulation of the global distribution from GAN; 2) introducing localization could result in a better imputation with a smaller imputation risk function, so as to prove the advantages of global-to-local imputation; 3) our designing could solve the region distribution simulation in local GAN with the global condition.

4.1 Risk function for imputation

We first define the risk function of the overall imputation. Let the observed value be \(x \in X\), the missing value be \(y \in Y\), and the target value be \(y_{0} \in Y\). The goal of imputation is to estimate the missing value through the observed value, which means to evaluate the conditional distribution function \(F\left( {y{|}x} \right)\). The objective can be expressed as the following formula with the conditional mathematical expectation function:

$$r\left( x \right) = \int ydF\left( {y{|}x} \right)$$
(12)

Given the functional set \(f\left( {x,{\upalpha }} \right), {\upalpha } \in \Lambda_{{\upalpha }} (f\left( {x,{\upalpha }} \right) \in L_{2} \left( P \right)\)), if the regression \(r\left( x \right)\) belongs to \(f\left( {x,{\upalpha }} \right)\),\(a \in \Lambda_{{\upalpha }}\), then the problem of regression function is transformed into solving the following functional. The risk function for imputation can be expressed as:

$$R\left( {\upalpha } \right) = \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} dF\left( {x,y} \right)$$
(13)

And the solution for such risk function is to minimize the above risk function so that

$$f\left( {x,{\upalpha }^{*} } \right) = arg\mathop {\min }\limits_{{{\upalpha }^{*} }} R\left( {\upalpha } \right){ }$$
(14)

Equation (14) is a traditional regression problem [42] that can be represented and solved by the risk function on the functional space. And also be a theoretical description for our imputation task.

Via the definition of empirical risk function, solving (13) is a problem to find out \({\upalpha }^{*}\) in \(\Lambda_{{\upalpha }}\) as in (14). However, in the rather complicated scenes, such as the imputation study, solving (13) by empirical risk function is not a quite good option [2] [6] [9]. The difficulty of the above problem becomes the problem of turning \({\upalpha }^{*}\) into \(F\left( {x,y} \right)\), which provides a theoretical basis for using GAN to solve imputation, because GAN can only calculate joint distribution [15] [41].

4.2 Estimation of \({\varvec{F}}\left( {{\varvec{x}},{\varvec{y}}} \right)\) in GAN Imputation

In this part, we discuss that GAN can well simulate \({ }F\left( {x,y} \right)\), and the simulated \(F\left( {x,y} \right)\) can further solve imputation.

However, in imputation, the distribution of \(F\left( {x,y} \right)\) is unknown, and we can only observe the empirical samples of \((y|x)\) and to estimate the \(F\left( {x,y} \right)\). We define a density functional \(p\left( {x,y,\beta } \right)\), where \(\beta \in {\Lambda }_{\beta }\), and if \(\beta^{*} { }\) is found, we have

$$F\left( {x,y} \right) = \int_{ - \infty ,}^{x,} {\int_{ - \infty }^{y} {p\left( {u,v,\beta^{*} } \right)dudv} }$$
(15)
$$R_{{{\text{prop}}}} \left( \beta \right) = - \int lnp\left( {x,y,\beta } \right)dF\left( {x,y} \right)$$
(16)
$$p\left( {x,y,\beta^{*} } \right) = arg\mathop {\min }\limits_{{\beta^{*} }} R_{prop} \left( \beta \right)\arg {\text{min}}\left( { - \int lnp\left( {x,y,\beta } \right)dF\left( {x,y} \right)} \right)$$
(17)

Based on Bretagnolle–Huber inequality, there is

$$\int \left| {p\left( {x,y,\beta } \right) - p\left( {x,y,\beta_{0} } \right)} \right|dx \le 2\sqrt {1 - e^{{R\left( {\beta_{0} } \right) - R\left( \beta \right)}} }$$
$$s.t. R\left( {\beta^{*} } \right) - \mathop {\inf }\limits_{{\forall \beta \in \Lambda_{\beta } }} R\left( \beta \right) < \in$$
(18)

By letting the \(\beta = D \in {\Lambda }_{{\mathfrak{D}}}\) as a discriminator neural network in a functional space, we can obtain from Eq. (18) as follows:

$$\int \left| {p_{{\text{g}}} \left( {x,y,D} \right) - p\left( {x,y,D_{0} } \right)} \right|dx$$
(19)

And it coincidently turns to be a GAN loss function as follows:

$$\int \left( {p_{{\text{g}}} \left( {x,y,D} \right)} \right)dx - \int \left( {p_{{{\text{real}}}} \left( {x,y,D} \right)} \right)dx$$
$$\quad = E_{{g\left( z \right)\sim p_{{\text{g}}} }} \left( {D\left( {g\left( z \right)} \right)} \right) - E_{{\left( {X,Y} \right)\sim p_{{{\text{real}}}} }} \left( {D\left( {X,Y} \right)} \right)$$
(20)

We explain below that using \(p_{g} \left( {x,y,D} \right)\) to fit imputation, the fitting result is equivalent to the original fitting Eq. (13), so the problem of imputation can be solved by GAN.

Plugging Eq. (20) into Eq. (13), we find that fitting an imputation function \(f\left( {x,{\upalpha }} \right)\) by GAN with an estimated probability \(p\left( {x,y,D} \right)\) is to:

$$\begin{aligned} R\left( {\upalpha } \right) = & \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} dF\left( {x,y} \right) \\ = & \int \left( {y_{0} - f\left( {x,a} \right)} \right)^{2} p\left( {x,y,\beta { }} \right)d\left( {x,y} \right) \\ = & \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} p\left( {x,y,D_{0} } \right)d\left( {x,y} \right) \\ = & \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D_{0} } \right) - p\left( {x,y,D} \right) + p\left( {x,y,D} \right)} \right]d\left( {x,y} \right) \\ = & \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D_{0} } \right) - p\left( {x,y,D} \right) } \right]d\left( {x,y} \right) + \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D} \right)} \right]d\left( {x,y} \right) \\ \end{aligned}$$
(21)

The first term is the error term caused by the inaccurate distribution estimation, which can be seen from Eq. (18), \(\int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D_{0} } \right) - p\left( {x,y,D} \right) } \right]d\left( {x,y} \right) \to 0\).

So, we obtain:

$$R\left( {\upalpha } \right) = \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} \left[ {p\left( {x,y,D} \right)} \right]d\left( {x,y} \right)$$
(22)

We use the simulated distribution generated by GAN to fit the imputation, which is the same as the original fit. It can also solve the problem of \(F\left( {x,y} \right)\) super-dimensional and unknown.

4.3 Advantages of localization imputation

The above conclusion shows that using GAN to simulate a global probability distribution is a reasonable method to solve the imputation. Then we further discuss that if the local simulation method is used, the final overall loss will be smaller.

For the risk function of Eq. (22), suppose we define the region we want to impute as \(A_{i}\), and let the regions except \(A_{i}\) be \(A_{j \ne i}\). s.t.\({ }A_{i} \cap A_{j \ne i} = \phi\) and \(\cup_{i} A_{i} = A_{all} = I\). If this area regionalized and imputation performed, the risk function of each area \(R_{{A_{i} }}\) is defined as:

$$R_{{A_{i} }} \left( {\upalpha } \right) = \int L_{2} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)dF_{{A_{i} }} \left( {x,y} \right)$$
(23)

where \(L_{2}\) is the norm-2 distance and is a concave function, \(F_{{A_{i} }} \left( {x,y} \right)\) is a margin distribution of \(F\left( {x,y} \right)\), \(F_{{A_{i} }} \left( {x,y} \right) = \mathop \int \nolimits_{i \ne j} F\left( {x,y} \right)dA_{j}\). Here we use \(dA_{j}\) to simplify \(d\left( {x,y} \right)\), s.t. \(\left( {x,y} \right) \in A_{j}\) without confusion.

Hence with a localization imputation, the entire risk function is

$$R_{{{\text{all}}}} \left( {\upalpha } \right) = \int R_{{A_{i} }} dA_{{\text{i}}} = \mathop \int \limits_{i} L_{2} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)dF_{{A_{i} }} \left( {x,y} \right)dA_{{\text{i}}} = \mathop \int \limits_{i} \left[ {\mathop \int \limits_{i \ne j} L_{2} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)F\left( {x,y} \right)dA_{{\text{j}}} } \right]dA_{{\text{i}}}$$
(24)

Given the fact that \({L}_{2}\) is a concave function and from Jensen Inequality, we have

$$\begin{aligned} R_{{{\text{all}}}} \left( {\upalpha } \right) = & \mathop \int \limits_{{\text{i}}} \left[ {\mathop \int \limits_{i \ne j} L_{2} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)dF_{{A_{i} }} \left( {x,y} \right)} \right]dA_{{\text{i}}} \le \mathop \int \limits_{i} L_{2} \left[ {\mathop \int \limits_{i \ne j} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)dF_{{A_{i} }} \left( {x,y} \right)} \right]dA_{{\text{i}}} \\ = & \mathop \int \limits_{{\text{i}}} L_{2} \left[ {\mathop \int \limits_{all} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)F\left( {x,y} \right)dA_{all} - \mathop \int \limits_{{{\text{all}}}} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)F\left( {x,y} \right)dA_{i} } \right]dA_{{\text{i}}} \\ = & \mathop \int \limits_{{\text{i}}} L_{2} \left[ {\mathop \int \limits_{{{\text{all}}}} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)F\left( {x,y} \right)dA_{all} } \right]dA_{{\text{i}}} - \mathop \int \limits_{{\text{i}}} L_{2} \left[ {\mathop \int \limits_{{{\text{all}}}} \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)F\left( {x,y} \right)dA_{{\text{i}}} } \right]dA_{{\text{i}}} \\ \end{aligned}$$
(25)

Given the fact that \(A_{i} \in A_{all}\) and \(L_{2} \left( \cdot \right) \ge 0\), we have

$$R_{{{\text{all}}}} \left( {\upalpha } \right) \le \int \left( {y_{0} - f\left( {x,{\upalpha }} \right)} \right)^{2} dF\left( {x,y} \right) = R\left( {\upalpha } \right)$$
(26)

Equation (26) proves that when the regionalization strategy is used, the overall risk of regionalized imputation is less than or equal to the risk of global imputation.

4.4 Localization imputation in GAN with the global Information

We have discussed that localization usually has a smaller risk function. But practically, we have globally observed samples. Compared with Eq. (26) in C, the problem we need to solve is fitting \(F\left( {x,y{|}x_{0} } \right)\), which is a simulation problem of conditional probability.

As well known, GAN is not good at conditional probability simulation (for example, the classical CGAN can only satisfy one-dimensional condition as the label [43]), while the condition of imputation design is a high-dimensional observation sample set. Next, we support our innovative design through theoretical discussions that how to solve the above difficulty fitting \(F\left( {x,y{|}x_{0} } \right)\), we make the following design.

We define the result of global imputation net (GIN) as \(\hat{y} \in Y\):

$$\hat{y} = r\left( x \right) = \int ydF\left( {x,y{|}x = x_{0} } \right)$$
(27)

\({\hat{\text{y}}}\) is not the optimal solution for the fitting since there is a more minor risk function from localization. An optimization result from localization risk function could be a better result. Hence we will discuss how to get a \(\hat{y}_{{A_{i} }}\) while satisfying a condition density function of \(p(y|x_{{A_{i} }} ,x_{A\,j \ne i} )\).

For the local imputation net (LIN), taking into account global-to-local information, we let the input noise as an encoder representation of \(\hat{y}_{{A_{i} }}\).Without loss of generality, take the encoding as an operator T, an imputation guide model (IGM) in our framework.

$$Z \, = f(\hat{y}_{{A_{i} }} *T)$$
(28)

where Z is the input noise for the local GAN. f can be linear/nonlinear function. Actually, when we use a nonlinear function as activation, the guider is closer to an encoder. Interestingly, through subsequent heuristic experiments we find a linear function has better characterization ability. We suspect this is because a linear transformation can retain more "original information" as a priori, while a nonlinear transformation will bring more disturbances. The relevant results will also be given under Sect. 5.4.

If the local generator in LIN is considered as an encoder \({\text{T}}^{\prime }\), then we will find that

$$\overline{y}_{{A_{i} }} = Z*T^{\prime } = \hat{y}_{{A_{i} }} *T*T^{\prime }$$
(29)

Given the fact that

$$\hat{y}_{{A_{i} }} \sim F(y_{{A_{i} }} |x_{{A_{i} }} ,x_{{A_{j \ne i} }} )$$
(30)

So we finally have

$$\overline{y}_{{A_{i} }} \sim F\left[ {\left( { y_{{A_{i} }} {|}x_{{A_{i} }} ,x_{{A_{j} }} } \right)*T*T^{\prime}} \right]$$
(31)

It is a conditional probability from \({\text{x}}_{{A_{i} }}\) and \({\text{x}}_{{A_{j} }}\) to the localized imputation result \(\overline{y}_{{A_{i} }}\). \(\overline{y}_{{A_{i} }}\) uses the global information \(x_{{A_{j} }}\).

The above (31) theoretically supports how the method proposed in this paper uses the global imputation (GIN) result as a condition to generate the process of local imputation (LIN). Hence, we have successfully solved the problem of the conditional simulation in a local imputation.

In summary, we have discussed that GAN can be used to simulate F(x,y) to solve the imputation problem and localized imputation can theoretically bring better solutions. Finally, the proposed method can use global information as a condition to complete the localized imputation.

5 Experiments

5.1 Datasets and experimentation details

5.1.1 Datasets

We evaluate our method both on numeric datasets and on image dataset available for missing data imputation tasks: UCI Machine Learning Repository [28] and MNIST [29]. The UCI maintains 559 datasets as a service to the machine learning community. Like [18], we select four real-world datasets (Breast, Spam, Credit, Letter) to evaluate the imputation performance quantitatively, as shown in Table 1. MNIST is a dataset of handwritten digits images of size 28 × 28 containing 70,000 images. We use the provided 50,000 as training set, 10,000 as validation set, and the remaining 10,000 images as testing set. Tenfold cross-validation is applied.

Table 1 UCI Datasets used in the experiments

5.1.2 Experimentation details

For our training set, the range values of each numeric dataset and image dataset are rescaled to [0, 1]. We simulate that each missing value is independent of the missing rate. The dropout missing rates of our experiments are set from 10 to 80% with a step of 10%. During the training process, our GAGIN parameters are initialized by Xavier [30, 37]. The dimensional vector Z is the same as the input data. Our whole network is trained by Adam optimizer [31] with learning rate 1e-3 and the batch size is set to 128. Then, we stop the whole learning process at 10 k iterations. We implement our model based on TensorFlow [32] framework using python 3.7 [38] and our experiments run on a Nvidia RTX 2080Ti GPU [39]. Due to demonstrating the proposed model's performance fairly, as to all the compared methods, we implement with the same FC architecture that only fully connected layers for both the generators and the discriminators.

5.2 Evaluation metrics

To be fair to all methods, we use a unified evaluation metric to quantitatively analyze the results of all missing data imputation methods. Through the inspiration of the papers [16,17,18], we choose root mean square error (RMSE) [33, 36] and Frechet inception distance (FID) [34, 35] as the evaluation metrics, to compare the performances with the state-of-the-art imputation methods.

The RMSE between the real data and imputed data of missing data imputation is defined as:

$${\text{RMSE}} = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {{\text{real}}_{{\text{i}}} - {\text{imputed}}_{{\text{i}}} } \right)^{2} }$$
(32)

where N represents the number of samples.

The FID is a measure of the similarity between the generated images and real ones, which is defined as below:

$${\text{FID}}\left( {i,r} \right) = \left| {\left| {\mu_{{\text{i}}} - \mu_{{\text{r}}} } \right|} \right|_{2}^{2} + Tr\left( {\sum_{{\text{i}}} + \sum_{{\text{r}}} - 2\left( {\sum_{{\text{i}}} \sum_{{\text{r}}} } \right)^{\frac{1}{2}} } \right)$$
(33)

where \(\left( {i,r} \right)\) represents the imputed image and the real image. \(\mu_{{\text{i}}}\) and \(\sum_{{\text{i}}}\) are the mean-value and covariance matrix of the imputed image’s vectors, the same as \(\mu_{{\text{r}}}\) and \(\sum_{{\text{r}}}\). \({\text{Tr}}\) denotes the trace of the matrix.

5.3 Performance comparison on UCI and MNIST missing dataset

In the experiments, our GAGIN is compared against 10 methods, including three state-of-the-art methods (GAIN, MisGAN, GAMIN) and seven traditional methods (zero-imputation, mean-imputation, MICE, MissForest, Matrix, EM, KNN). Fairly, as to all the compared methods, we directly use the author’s released codes or repeat according to the author’s idea to perform the evaluations. Before the starting of the experiments, we have standardized the input datasets.

5.3.1 Evaluation on various missing rates and dimensions for UCI missing dataset

Table 2 shows the comparison results between some other imputation methods and our proposed method GAGIN in the last row. We conduct two experiments, in which we vary the missing rate and the number of dimensions on the Credit dataset. Figure 4 shows the RMSE performance for various missing rates from 10 to 80% with a step of 10%. The blue columns show the traditional methods for imputation, while the green polylines represent the GAN-based imputation methods. Even though the RMSE of each algorithm decreases as missing rates increase, our GAGIN consistently outperforms the benchmarks across the entire range of missing rates as the proposed GAGIN can capture the global information and learn the local relationship of the unobserved values guided by IGM. Besides, our method can impute the missing data with more accurate values.

Table 2 Comparison of the RMSE of the different methods on UCI dataset with 20% missing rate(Average ± Std of RMSE)
Fig. 4
figure 4

Comparison of the RMSE performance by the different methods on Credit dataset for various missing rates from 10 to 80% with a step of 10%. The smaller the RMSE, the better the results

We also investigate the influence of the comparison for various dimensions by the different methods as shown in Fig. 5. The RMSE performances decrease with the increasing of dimensions. The red line shows the superiority of RMSE with different dimensional datasets. We can conclude that the proposed GAGIN is also robust to the number of dimensions.

Fig. 5
figure 5

Comparison of the RMSE performance by the different methods on the Credit dataset for various dimension numbers

5.3.2 Quantitative evaluation on MNIST missing dataset

Our comparison of evaluation metrics is shown in Table 3, and our method's evaluation scores are shown in the last column. For RMSE, our GAGIN surpasses the second-best method by 24.1% on MNIST dataset with 50% missing rate. Moreover, our method presents the lowest score of FID and is 33.2% lower than the second-best method. To intuitively show the comparison, we calculate the FID score of all the methods under independent dropout with missing rates from 10 to 80%. The FID results are illustrated in Fig. 6. The blue polylines show the traditional methods for imputation, while the green polylines represent the GAN-based imputation methods. It can be observed that, in all cases, our proposed method evidently gains the best imputation FID than the others.

Table 3 Comparison of the evaluation metrics of the different methods on MNIST dataset with 50% missing rate
Fig. 6
figure 6

Comparison of the FID score by the different methods trained on MNIST dataset for various missing rates from 10 to 80% with a step of 10%. The smaller the FID, the better the results

When the missing rate is low (10%-60%), the overall distribution is retained more, and the GANs method performs well owing to simulating the global information well. However, when the missing rate increases, the global information becomes less, and the traditional algorithm is greatly affected at this time (the observation surrounding is more missing points). When the missing rate is greater than 60%, the RMSE rises sharply. It is undeniable that our method has the best performance, as the global-to-local method gives a better solution and optimizes the details. In addition, this problem can be derived from the subsequent t test experiment. Consequently, our GAGIN outperforms all the other methods on missing data imputation.

5.3.3 Qualitative comparison on MNIST missing dataset

Figure 7 shows the visible imputation results generated from different methods on MNIST dataset of 50% missing rate. The (c) to (e) columns are the traditional methods, i.e. zero-imputation, mean-imputation and matrix-based imputation, while the (f) to (h) columns are the GAN-based methods for imputation recently as GAIN, MisGAN and GAMIN. Intuitively, the zero-imputation method and the mean-imputation method cannot produce valuable imputations. The matrix-based imputation generates the blurred images with stars. Furthermore, images imputed by GAN-based methods possess unclear and insufficient boundary in detail. It is apparent from Fig. 7 that our imputation results of the proposed method present the clearest and the most precise boundary particularly in visual performance.

Fig. 7
figure 7

Imputation results of 50% dropout missing: a groundtruth, b missing mask of which missing components are colored black, c impute with zero value, d impute with mean value, e impute with Matrix algorithm, f GAIN based imputation, g MisGAN based imputation, h GAMIN based imputation, i Ours GAGIN based imputation. The c to e and f to h columns respectively show the traditional methods and GAN based methods for imputation

5.4 Ablation analysis

5.4.1 Linear or nonlinear GIN

In our framework, we use a linear/nonlinear module after IGM module. Here we heuristically choose two nonlinear functions as the kernel transform after the Guider. Compared to the linear transform, the result is listed in the last three columns of Tables 2 and 3. Interestingly, people generally consider adding nonlinear kernel could increase the representation ability of the module. However, in our study, we can see that the effect of using nonlinear functions has become worse. This is because the guide here obtains information from global random variables as a priori to generate the local part. To avoid potential overfitting, a permutation is rational enough to introduce some uncertainty/variance from a solid global-to-local relationship. Nonlinear functions extremely disturb this situation, especially in the case of digital data, which excessively weaken the priors passed by the guider. On the other hand, this proves the effectiveness of the guider conduction prior we designed. To sum up, this part of the work is enlightening. It is not ruled out that knowing the nonlinear function that can simultaneously haa better representation and more prior information is in the future research. Therefore, we still retain this module in the final model design.

5.4.2 Effectiveness of local-impute-net (LIN)

As mentioned, our LIN's goal is to optimize the local boundary details and refine the inadequate imputation area. To validate the effectiveness of LIN, we get rid of the complete LIN and then directly obtain the results of IGM and GIN concatenated together as the final outputs. The RMSE with (Row GAGIN) and without LIN (denoted as “w/o LIN”) are reported in Table 4. It can be observed that the proposed GAGIN with LIN works better than that without LIN. In addition, our method without LIN gains the best performance compared with other state-of-the-art approaches and traditional-based methods combining Table 3, which also highlights the effectiveness of our subnetworks (i.e. GIN and IGM). Intuitively, we show the visual results generated by our method without LIN and with LIN (denoted as “w/ LIN”) in Fig. 8. In terms of visual results, (d) is more concrete than (c), illustrating that our LIN can refine the boundary details and effectively promote the final imputation outputs.

Table 4 Ablation analysis of our proposed GAGIN using FID and RMSE
Fig. 8
figure 8

Visual comparison of missing data imputation results without and with LIN

5.4.3 Analysis of impute guider model (IGM)

To explore the importance of IGM (i.e. \(\widehat{X, }f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\)), we conduct our experiments with four different instances. Considering 1) why feed \(\hat{X}\) generated by GIN into our IGM as input; 2) how to choose different candidate local region for our impute guider model; 3) how to dig the inter-imputation relationship using the extracted region. Based on above considerations, our ablation studies about the input \(\widehat{X }\), \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) are shown in the effectiveness of IGM part in Table 4. To investigate the importance of GIN’s information, we first replace the input with random noise (denoted as “w/o \(\hat{X}\)”). To verify the effectiveness of impute guider model, we get the RMSE and FID results without \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) (denoted as “w/o \(f_{{{\text{search}}}}^{{{\text{local}}}}\) + w/o\(f_{{{\text{FC}}}}\)”) by replacing these modules with the corresponding fully connected layers. As for the local region choice, the impute guide model is only equipped with \(f_{FC}\) (denoted as “w/ \(f_{{{\text{FC}}}}\)”). Then, as for the digging the relationship of imputing guider model, we only add \(f_{{{\text{search}}}}^{{{\text{local}}}}\) (denoted as “w/ \(f_{{{\text{search}}}}^{{{\text{local}}}}\)”). Compared with basic model (w/o \(f_{{{\text{search}}}}^{{{\text{local}}}}\) + w/o \(f_{{{\text{FC}}}}\)), our IGM can decrease FID up to 1.944 and reduce RMSE up to 0.0722. Hence, our \(\hat{X}\), \(f_{{{\text{search}}}}^{{{\text{local}}}}\) and \(f_{{{\text{FC}}}}\) significantly improve the results.

5.4.4 Statistical T test to ensure the superiority of GAGIN

T test is a common method testing two independent samples in statistics. Considering whether imputed data and real data are similar, we used the t test method to quantify the significance of the difference between the two type samples in missing data imputation. The test results with 10–80% missing rate of MNIST data set are presented in Table 5. From the table, we see that as the missing rate increases, the p value gets smaller, but the t value gets larger. When p value > 0.05, accept the null hypothesis and consider that the difference between the two samples is not significant, and vice versa. When the missing rate is between 10 and 60%, the p value is more significant than 0.05, indicating that the generated imputed data are similar to the real data, and the lower the missing rate, the more similar. When missing rate > 70%, the p value < 0.05, and it is considered that there is a certain gap between the generated sample and the real sample. The t test further proves the significance of our proposed method in statistics.

Table 5 Statistical t-test of our proposed GAGIN using t-value and p-value

6 Conclusions

In this paper, we propose a novel generative adversarial guider imputation network (GAGIN) for missing data imputation. To solve the interference of local clutter and the inaccurate imputation boundary details, we design a Global-Impute-Net (GIN), a Local-Impute-Net (LIN) and an Impute Guider Model (IGM).

After the GIN generating and imputing data on the whole, the LIN is assigned to capture and refine local details guided by the IGM. Comprehensive experiments indicate our proposed method has the superiority of missing data imputation. However, we need to improve our method for all realistic settings. Future work will investigate the performance of GAGIN in other missing data types (MAR, MNAR). Furthermore, we plan to make an additional absolute guide imputation to enhance the performance of our method.