1 Introduction

The issue of missing data is a significant problem in statistical analysis across all statistical applications. Various reasons for missing data include mechanical and human reasons. Data play a crucial role in artificial intelligence, and high-quality data directly affect the quality of knowledge output. Since the quality of data is an essential indicator of their value, a substantial amount of data in a real production setting can lead to quality problems. Moreover, a large amount of low-quality data can lower the density of values in the data, potentially resulting in a biased final analysis that does not fully leverage information inherent in the data. Therefore, correctly and efficiently handling missing data is critical [1, 2].

The current state-of-the-art missing data imputation algorithms fall into two main categories: discriminative algorithms and generative algorithms. Discriminative algorithms include Multiple Imputation by Chained Equations (MICE) [3], MissForest [4], and matrix completion [5]. Furthermore, generative algorithms include K-NearestNeighbor (KNN) [6], Expectation Maximization (EM) [7], and machine learning-based algorithms such as Auto-encoders (AE) [8] and GAN [9]. However, current generative algorithms have some limitations. They are based on data distribution assumptions [10], making them less suitable for datasets containing mixed categories and continuous variables.

In previous statistical studies, the sample sizes were often small and the proportion of missing data was low. Researchers usually use subjective judgment to discard or manually process the missing records. However, with the advent of the big data era, the proportion of missing data has become larger because data dimensions have skyrocketed. Manual processing is inefficient at this point. Discarding missing records leads to a loss of a significant amount of information [11], leading to systematic differences between incomplete and complete observations. Analyzing such data is likely to lead to wrong conclusions. As data volume and dimension increase, statistical learning methods are widely used in daily classification [12, 13], prediction [14, 15], and dimensionality reduction [16]. However, when the original input data of statistical learning methods contain missing values, most statistical learning methods will be unusable. Only specialized methods such as methods based on decision trees (Cox Regression Model Combined with Decision Tree [17], Branch-Exclusive Splits Trees (BEST) [18], and Other variants of decision trees [19]) and methods based on random forests (Depth-Weighted Prevalence for an random forests Tree Ensemble [20], PhyloMissForest [21], Logistic Ridge Regression and Random Forest Ensemble Model [22], and Random forest with the assignation of missing entries [23]) can handle missing values, but they still have limitations, such as significant degradation in the prediction accuracy.

Imputation of missing data has always been a difficult and hot problem. Various computational methods have been proposed to address missing data in multiple fields, including medical [24], image mapping [25], and financial data [26, 27]. In a study tackling the missing MRI problem, Sharma et al. used the GAN and designed a multi-input, multi-output network model to fill in the missing information. GAN was found to have good applicability in multimodal problems [28].

Using machine learning methods to handle missing data has become an active research. However, existing machine learning methods for processing missing data still face limitations. A new approach called Generative Adversarial Imputation Networks (GAIN) has gained attention since its proposal in 2018 [29]. GAIN leverages the capabilities of GAN to learn the original data distributions for imputation, surpassing the traditional statistical methods and other machine learning methods in dealing with missing data. Awan [30] proposed a new method for computing missing data based on class-specific features to address the challenges of modeling a single distribution over the entire dataset. However, this method ignores the problem of class-specific features of the data. Conditional Generative Adversarial Imputation Networks (CGAIN) address this issue using class-specific distributions for missing data, producing the best estimate of missing values.

Additionally, PC-GAIN, a new unsupervised missing data computation method, addresses the problem of GAIN, overlooking the issue of the latent class information reflecting the relationship between samples. PC-GAIN uses the latent class information to further improve the accuracy of filling in the missing values [31]. The Generative Adversarial Guider Imputation Network proposed in 2022 focuses on unsupervised interpolation to handle the locally homogeneous regions, particularly at the boundaries [32]. Wu et al. proposed a method based on the Fuzzy c-Means algorithm and GAIN to exploit the information on the local samples [33]. Zhao et al. introduced an imputation method called Multiple Generative Adversarial Imputation Networks based on data attributes [34]. A study also uses deep metric learning and MisGAN methods for multi-tasking missing data imputation [35]. For time series continuous missing values, Wang et al. proposed Wasserstein GAN with gradient penalty (CWGAIN-GP) [36]. However, the discriminator of WGAN-GP usually fails to maintain continuity in the peripheral region of the true sample distribution.

However, the above GAIN-based data imputation methods address the original issues of the original gradient vanishing and mode collapse in GAN. This limitation can affect the model performance and lead to model overfitting. Moreover, the data that are filled out may lack high quality and diversity. For example, the generators of PC-GAIN and MisGAN may get stuck in a dead-end loop of generating similar samples, resulting to a lack of diversity in filling results. CWGAIN-GP is sensitive to noise or outliers, leading to unstable or inaccurate filling results. This study proposes MGAIN to overcome these challenges. Unlike traditional imputation methods and other popular imputation methods, MGAIN incorporates least squares loss to address the gradient vanishing and uses dual discriminator to mitigate mode collapse, ensuring high quality and diversity of the output data. Therefore, the motivation of this paper is to utilize the MGAIN to address the problems of traditional imputation methods and GAN-based approaches, enhancing the quality and diversity of missing data imputation, and offering a more reliable and effective solution for real data processing tasks.

Missing data are classified mainly into three categories, including completely random missing, random missing, and nonrandom missing [37]. Completely random missing data refer to data that are entirely missing at random, independent of incomplete or complete variables, and do not introduce bias to the sample. Random missing data are the probability of missing data not related to the missing data but only to the partially observed data. The nonrandom missing data occur when the missingness is related to the values taken by the incomplete variables. This paper focuses on missing completely at random (MCAR) data.

Fig. 1
figure 1

Graphical abstract

Fig. 2
figure 2

Framework diagram of GAN and its variants

Therefore, we followed the approach depicted in Fig. 1. First, we processed the original datasets with four missing rates of 0.2, 0.4, 0.6, and 0.8. Subsequently, the missing data was subjected to data preprocessing. Then, the data were compared using different imputation methods. Finally, conclusions were drawn based on the experiments and results. The main contributions of this paper are as follows,

  1. 1.

    A novel missing data imputation method, MGAIN, has been developed to address the issue of missing data. This approach address the challenges associated with incomplete data.

  2. 2.

    A combination of least squares loss function and dual discriminator has been proposed to overcome the challenges of mode collapse and gradient vanishing in GAN. This innovative approach effectively solves these problems, and theoretically demonstrates its feasibility.

  3. 3.

    We extensively tested the proposed model on diverse datasets to assess its effectiveness and generalizability. These experimental results demonstrate the model’s potential to address the challenges of missing data estimation in various practical applications.

The remainder of this paper is organized as follows. In Section 2, we review GAN and its variants. Section 3 provides a detailed overview of the proposed method. Section 4 presents the theoretical analysis. Section 5 presents the experimental methodology and the experimental results. Finally, Section 6 summarizes the paper.

2 Methodology

In this section, we mainly review the model principles and computational problems based on GAN, which mainly focus on our research work. This section mainly introduces GAN, Least Squares Generative Adversarial Networks (LSGAN) and Dual Discriminator Generative Adversarial Networks (D2GAN) as shown in Fig. 2. LSGAN and D2GAN are variants of GAN. We utilize variants of GAN for missing data imputation. To the best of our knowledge, this is the first time that both variants have been applied to missing data.

2.1 GAN

Goodfellow (2014) introduced a GAN consisting of a generator and a discriminator (Fig. 2(a)), both of which are composed of multilayer perceptrons [9]. The generator learns the original data distribution to create artificial samples (called fake samples) that are similar to real samples, while the discriminator distinguishes the difference between real and generated samples. The core logic of all GAN networks is that generator and discriminator play against each other. And generation is mainly embodied in the two-player games until the generator can generate the discriminator can not determine the real or fake samples.

\(x \sim P_{\text{ data } }\) is the original data distribution, and \(z \sim p_{z}\)is generally a noise that obeys a uniform or normal distribution. The generator and discriminator are abbreviated as G and D, respectively. z is used as input to G, which is then fed into an output G(z), which is then fed into a D. Then the input to D is two parts x and G(z), and tries to determine whether the input is real or fake. Therefore, the discriminator is a binary classification problem using a Sigmoid function that produces outputs in the range between 0 and 1. The GAN framework corresponds to an extremely large and extremely small two-player game with the following objective function L(DG):

$$\begin{aligned} \min _{G} \max _{D} L(D, G)=E_{x \sim P_{data}}[\log D(x)]+E_{z \sim P_{z}}[\log (1-D(G(z)))]. \end{aligned}$$
(1)
$$\begin{aligned} L_{G}=E_{z \sim P_{z}}[\log (1-D(G(z)))]. \end{aligned}$$
(2)
$$\begin{aligned} L_{D}=-E_{x \sim P_{data}}[\log D(x)]-E_{z \sim P_{z}}[\log (1-D(G(z)))]. \end{aligned}$$
(3)

where \(L_D\) and \(L_G\) are the objective functions D of G and , respectively.

As a powerful deep learning model, GAN has achieved great success in many tasks, but it also has some shortcomings and challenges. During training, the generator may fall into mode collapse, resulting in a lack of diversity in the generated samples and too many repetitions of the same patterns. This can limit the quality of the generated results. Similarly, gradient vanishing is a common problem that hinders the convergence and training effectiveness of the model and tends to lead to training instability. Overall, GAN, while making significant progress in generative tasks, still face challenges and require further research and improvements to address these issues.

2.2 LSGAN

The LSGAN is a variant of GAN proposed in 2017 [38]. It is shown in Fig. 2(b). The cross-entropy loss function used in the original GAN leads to the problem of gradient vanishing, and to solve this problem, LSGAN takes the least squares loss function network. And LSGAN has two benefits over conventional GAN. First, LSGAN can generate higher-quality images. Second, LSGAN is more stable in the learning process. The following is the objective function of LSGAN:

$$\begin{aligned} \min _{D} L_{D}=\frac{1}{2} E_{x \sim P_{x \sim data}}\left[ (D(x)-b)^{2}\right] +\frac{1}{2} E_{z \sim P_{z}}\left[ (D(G(z))-a)^{2}\right] . \end{aligned}$$
(4)
$$\begin{aligned} \min _{G} L_{G}=\frac{1}{2} E_{z \sim P_z}\left[ D(G(z)-c)^{2}\right] . \end{aligned}$$
(5)

where \(L_{D}\) and \(L_{G}\) are the objective functions of D and G, respectively. a and b are the labels of the generated and real data, respectively. c denotes the value of the generated data that the generator wants the discriminator to believe. In this paper, \(a=0\) and \(b=c=1\).

Compared with traditional GAN, LSGAN is more stable in the training process, which avoids the common training instability problem in traditional GAN. LSGAN has a positive impact on both the generator and the discriminator, which improves the overall training effect and the quality of the generated results. However, LSGAN still suffers from mode collapse, resulting in a lack of diversity in the generated results. On the whole, LSGAN has some advantages over traditional GAN in terms of stability and quality of generated results. However, LSGAN also has some disadvantages, which need to be weighed and selected according to the specific tasks and datasets. For some specific generation tasks, LSGAN may be an valid choice.

2.3 D2GAN

D2GAN [39], proposed by Nguyen et al. (2017), differs from the original GAN in that D2GAN contains not only one generator but also two discriminators. As shown in (c) in Fig. 2. It combines the Kullback-Leibler (KL) divergence and reverse KL divergence into a unified objective function, thus utilizing the complementary statistical properties of these divergences to effectively disperse the estimation density and alleviate the mode collapse problem. The objective function \(L\left( G, D_{1}, D_{2}\right) \) of D2GAN can be expressed as follows:

$$\begin{aligned} \begin{aligned} \min _{G} \max _{D_{1}, D_{2}} L\left( G, D_{1}, D_{2}\right) =&\alpha \times E_{x \sim P_{data}}\left[ \log D_{1}(x)\right] +E_{z \sim P_{z}}\left[ -D_{1}(G(z))\right] \\ &+E_{x \sim P_{data}}\left[ -D_{2}(x)\right] +\beta \times E_{z \sim P_z}\left[ \log D_{2}(G(z))]\right. \end{aligned}. \end{aligned}$$
(6)

where \(D_{1}\), \(D_{2}\) and G denote discriminator 1, discriminator 2, and generator, respectively. \(\alpha \) and \(\beta \) are hyperparameters, \( 0<\alpha \le 1 \), \( 0<\beta \le 1 \). The role of \(\alpha \) and \(\beta \) is to control the effect of KL divergence and inverse KL divergence on the optimization problem.

D2GAN can efficiently learn multimodal data distributions through the triple confrontation of the generator and the two discriminators. \(D_{1}\) and \(D_{2}\) are giving high scores to the distribution \(P_{data}\) from the original data and \(P_{G}\) from the generated data and vice versa. \(D_{1}\) and \(D_{2}\) do not share their parameters.

By introducing two discriminators, D2GAN can evaluate the realism of the generated samples more efficiently, thus facilitating the generation of more realistic and high-quality samples by the generator. Moreover, D2GAN can also reduce the risk of mode collapse and avoid the generator from falling into local optimal solutions for generating repeated samples. However, D2GAN still faces some difficulties, such as gradient vanishing and training instability. Overall, D2GAN improves the performance of the generator, which brings higher-quality generated samples, but it also needs to face challenges such as increased training complexity.

The current GAN and its various variants are riddled with problems such as gradient vanishing, modal collapse, and training instability, limiting their performance and applications. We propose a new GAN variant to address these issues, improving model stability and generation. We apply this new GAN variant to dealing with missing data. Applying MGAIN to missing data can effectively improve data generation and increase the model’s ability to handle missing data. This innovative GAN variant introduces new ideas and methods to address missing and incomplete data in practical problems and has a wide range of application prospects and research value.

3 Proposed method

In this section, a noteworthy GAIN is proposed, which is a new method to deal with missing data based on GAN. Yoon et al take each value in incomplete data whether it is missing or not, as a category label constitutes the missing mask, and combined with Conditional GAN (CGAN), they propose GAIN model to realize the imputation of missing data, and experimentally illustrate that the method is better than the traditional imputation method. However, GAIN suffers from the common problems of GAN, i.e., gradient vanishing and mode collapse problems. Therefore, to solve the above problems, a new imputation method based on GAN is proposed in this paper. As shown in Fig. 3. This structure and theory are described in detail below.

Fig. 3
figure 3

Framework of MGAIN

3.1 Inputs of the model

First, define the original data as \(X=\left( X_{1},X_{2},X_{3},...,X_{n}\right) \). X is a random variable in the n-dimensional space \(\mathbb {X}=\left( \mathbb {X}_{1}\times \mathbb {X}_{2}\times \mathbb {X}_{3}\times ...\times \mathbb {X}_{n}\right) \), and n represents the total number of samples. Define M to be the mask vector, \(M=\left( M_{1},M_{2},M_{3},...,M_{n}\right) \), which takes values from \(\{0,1\}^n\). Define a new random variable \(\tilde{X}=\left( \tilde{X}_{1},\tilde{X}_{2},\tilde{X}_{3},...,\tilde{X}_{n}\right) \), \(\tilde{X}\) is a random variable in the n-dimensional space \(\tilde{\mathbb {X}}=\left( \tilde{\mathbb {X}}_{1}\times \tilde{\mathbb {X}}_{2}\times \tilde{\mathbb {X}}_{3}\times ...\times \tilde{\mathbb {X}}_{n}\right) \), which is obtained from the following equation:

$$\begin{aligned} \tilde{X}_{i}= {\left\{ \begin{array}{ll} X_{i}, if M_{i} =1\\ *, otherwise \end{array}\right. }. \end{aligned}$$
(7)

where \(M_{i} =1\) means that these data are not missing, otherwise it is missing.

The purpose of the calculation method is to calculate the missing values in \(\tilde{X}\). The purpose of the missing data calculation is to generate samples based on the conditional probability of X given \(\tilde{X}\), i.e., \(P \left( X|\tilde{X}=\tilde{X}_{i}\right) \).

3.2 Generator

From Fig. 2 we can see that the input to the generator consists of the trio of data matrix X, mask matrix M, and random matrix Z(aka noise). Similar to the data and mask matrices, the random matrix is a n-dimensional vector \(Z=\left( Z_{1},Z_{2},Z_{3},...,Z_{n}\right) \).

The generator G is equivalent to a mapping function of \(\tilde{\mathbb {X}}\times \{0,1\}\times [0,1]^n \rightarrow \mathbb {X}\). Now define two random variables:

$$\begin{aligned} \begin{aligned} \bar{X}&=G\left( \tilde{X},M,\left( 1-M\right) \odot Z\right) \\ \hat{X}&=M\odot \tilde{X} + \left( 1-M\right) \odot \bar{X}. \end{aligned} \end{aligned}$$
(8)

where \(\odot \) denotes the element level multiplication, \(\bar{X}\) denotes the missing value portion computed by the generator, and \(\hat{X}\) is composed of two variables, X and \(\bar{X}\). That is, the missing value portion is filled by the \(\bar{X}\) inferred by the generator, and the rest of the unmissing data consist of the original observations.

3.3 Discriminator

Unlike other GAN-based applications for missing data, the method proposed in this paper consists of two discriminators. For example, GAIN has only one discriminator; however, there are two generators (generator, hint generator). Then the mode collapse problem arises, so in order to mitigate the mode collapse problem, the dual discriminator approach is used in this study.

The discriminator of the GAN in dealing with missing data does not determine whether the whole variable is true (the output is 0 or 1), but rather determines which of the components are observations and which are generated. In this process, then, it is equivalent to de-predicting the mask vector. Similarly to the generator, the discriminator D is equivalent to being a mapping function of \(\mathbb {X}\rightarrow [0,1]^n\). The ith component of \(D\left( \hat{x} \right) \) is the probability that the ith element of \(\hat{x}\) is observed.

3.4 Hint Generator

Similarly to the hint mechanism of Yoon et al., our approach includes a hint mechanism. The cue is represented as a random variable H which takes values in the cue space \(\mathbb {H}\). The cue vector supports D by telling D some input and observation values, allowing D to decide whether to input other values or observe other values. Similar to G and D earlier, H is a mapping function of \(\mathbb {X}\times \mathbb {H}\rightarrow [0,1]^n\). The hinting mechanism is necessary because G can produce multiple distributions, and for all of them, D cannot distinguish between real and false values. Thus, giving H to D restricts the solution to a single distribution. H is obtained using (9).

$$\begin{aligned} H=B\odot M+0.5\odot \left( 1-B \right) . \end{aligned}$$
(9)

where \(B\in \{0,1\}^n\) is the random variable obtained by uniformly sampling k from \(\{0,1,2,...,n\}^n\) and applying (10).

$$\begin{aligned} B_{i}= {\left\{ \begin{array}{ll} 1, if\;i\not = k\\ 0, if\;i= k \end{array}\right. }. \end{aligned}$$
(10)

3.5 Objective function

Inspired by GAN and its variants as well as GAIN, the objective function of our MGAIN method has two parts. Second, we train G to minimize the probability that D predicts m correctly. The overall objective function, and loss function of the MGAIN method are given in (11) and (12).

$$\begin{aligned} \min _{G} \max _{D_{1}, D_{2}} V\left( G, D_{1}, D_{2}\right) . \end{aligned}$$
(11)
$$\begin{aligned} \begin{aligned} V\left( G, D_{1}, D_{2}\right) =&\frac{\alpha }{2} \times E_{x \sim P_{data}}\left[ \ M \left( D_{1}(\hat{X},H) -1\right) ^2 \right] \\+&\frac{1}{2}E_{z \sim P_{z}}\left[ (1-M)\left( D_{1}(\hat{X},H)\right) ^2\right] \\+&\frac{1}{2}\times E_{x \sim P_{data}}\left[ \ M \left( D_{2}(\hat{X},H) -1\right) ^2 \right] \\+&\frac{\beta }{2} \times E_{z \sim P_z}\left[ \left( 1-M\right) \left( D_{2}(\hat{X},H) \right) ^2\right] \end{aligned}. \end{aligned}$$
(12)

where, \(\alpha \), \(\beta \) are hyperparameters, \(0<\alpha , \beta \le 1\). The role of \(\alpha \) and \(\beta \) is to control the effect of minimizing the loss on the optimization problem. \(D_{1}\) and \(D_{2}\) represent discriminator 1 and discriminator 2, respectively. \(D_{1}\) and \(D_{2}\) are the ones that give high scores to the data from the original data distribution \(P_{data}\) and to the data from the generated data distribution \(P_{G}\), and vice versa. Where \(D_{1}\) and \(D_{2}\) do not share their parameters.

D2GAN can mitigate mode collapse, but cannot avoid the problem of vanishing gradient and instability. LSGAN can solve the problem of gradient vanishing, but it is difficult to avoid the problem of mode collapse, which is the lack of diversity in the generated samples. Therefore, this paper draws on the advantages and disadvantages of these two and combines D2GAN and LSGAN to propose a new generative adversarial inference network, a model that not only mitigates mode collapse, but also solves the gradient vanishing problem.

The same as [29] the loss function of G consists of two parts, the loss of estimates and the loss of observations. Unlike it, the loss function of G consists of two discriminators and the least squares loss. The combined loss function \(V_{G}\) is given in (13).

$$\begin{aligned} \begin{aligned} V_{G}=&\frac{1}{2}E_{z \sim P_{z}}\left[ (1-M)\left( D_{1}(\hat{X},H)\right) ^2\right] \\&\qquad \qquad \qquad \qquad +\frac{\beta }{2} E_{z \sim P_z}\left[ M\left( D_{2}(\hat{X},H) \right) ^2\right] \\&+\lambda \sum _{i=1}^{n}m_{i}L_{obs}(x_{i},x_{i}^{'}).\end{aligned}. \end{aligned}$$
(13)

where \(\lambda \) and \(\beta \) are the hyperparameters, \(m_{i}\) is the element in the mask matrix M. In this paper, \(\lambda =0.4\). \(L_{obs}(x_{i},x_{i}^{'})\) is defined by (14):

$$\begin{aligned} L_{obs}(x_{i},x_{i}^{'})= {\left\{ \begin{array}{ll} (x_{i},x_{i}^{'})^2, if\;x_{i}\; is\;continuous\\ -x_{i}log(x_{i}^{'}), if\;x_{i}\;is\;binary \end{array}\right. }. \end{aligned}$$
(14)
Table 1 Datasets

To be concrete, the pseudo-code of the algorithm of MGAIN is shown in Algorithm 1.

Algorithm 1
figure f

MGAIN

4 Theoretical analysis

Now provide a formal theoretical analysis of our proposed model, essentially showing that given G, \(D_{1}\) and \(D_{2}\) have sufficient capacity, i.e., in the nonparametric limit, at the optimal point, G can recover the data distribution by minimizing the divergence between the model and the data distribution. Consider first the optimization problem of a (w.r.t.) discriminator given a fixed generator.

Proposition 1

Minimizing \(V\left( G,D_{1},D_{2}\right) \) for a given generator yields the following closed-form optimal discriminator:

$$\begin{aligned} \begin{aligned} D_{1}^{*}&=\frac{\alpha M P_{data} }{\alpha M P_{data}+(1-M)P_{z}}\\ D_{2}^{*}&=\frac{ M P_{data} }{ M P_{data}+\beta (1-M)P_{z}} \end{aligned} \end{aligned}$$

where, \(V(G,D_{1},D_{2})\) is the overall objective function of MGAIN. \(D_{1}^{*}\) and \(D_{2}^{*}\) are the optimal value of \(D_{1}\) and \(D_{2}\). \(\alpha \) and \(\beta \) are hyperparameters, same as for \(\alpha \) and \(\beta \) as mentioned in Section 3.5M is the mask matrix mentioned in Section 3.2\(P_{data}\) and \(P_{z}\) represent both the raw data distribution and the noise distribution, respectively.

Proof

$$\begin{aligned} V\left( G, D_{1}, D_{2}\right) =&\frac{\alpha }{2} E_{x \sim P_{data}}\left[ \ M \left( D_{1}(\hat{X},H) -1\right) ^2 \right] \\ +&\frac{1}{2}E_{z \sim P_{z}}\left[ (1-M)\left( D_{1}(\hat{X},H)\right) ^2\right] \\ +&\frac{1}{2} E_{x \sim P_{data}}\left[ \ M \left( D_{2}(\hat{X},H) -1\right) ^2 \right] \\ +&\frac{\beta }{2} E_{z \sim P_z}\left[ \left( 1-M\right) \left( D_{2}(\hat{X},H) \right) ^2\right] \\ =&\int \frac{\alpha }{2}P_{data}(x)\left[ \ \!M \left( D_{1}(\hat{X},H) \!-\!1\right) ^2 \right] dx\\ +&\int \frac{1}{2}P_{z}(z)\left[ (1-M)\left( D_{1}(\hat{X},H)\right) ^2\right] dz\\ +&\!\int \frac{1}{2}P_{data}(x)\left[ \ \! M \left( D_{2}(\hat{X},H) -1\right) ^2 \right] dx\\ +&\int \frac{\beta }{2}P_{z}(z)\left[ \left( 1-M\right) \left( D_{2}(\hat{X},H) \right) ^2\right] dz \end{aligned}$$

Let \(\frac{\partial V\left( G, D_{1}, D_{2}\right) }{\partial D_{1}}=0\), \(\frac{\partial V\left( G, D_{1}, D_{2}\right) }{\partial D_{2}}=0\), the optimal \(D_{1}\) and \(D_{2}\), that is, \(D_{1}^{*}\) and \(D_{2}^{*}\), are obtained.

In the following, fixing \(D_{1}=D_{1}^{*}\) and \(D_{2}=D_{2}^{*}\), and then going to compute \(V\left( G, D_{1}^{*}, D_{2}^{*}\right) \) yields the optimal \(G^{*}\) for generator G. \(\square \)

5 Experiments

5.1 Datasets and evaluation metrics

5.1.1 Datasets

We tested the proposed MGAIN method using several publicly available real-world datasets provided by the University of California, Irvine (UCI) Machine Learning Repository and a database of handwritten digits provided by Yann. These datasets are listed in Table 1. We compare our method with the state-of-the-art GAIN method and other popular imputation methods. We also evaluated it on different proportions of missing data, ranging from 20% to 80%. In all experiments, the missing data were randomly deleted as MCAR.

The missing rate (MR) is the missing rate of the data, which can be expressed by the following equation:

$$\begin{aligned} MR=\frac{Number\;of \; missing\;values}{Total\;number\;of \;samples}. \end{aligned}$$
(15)
Table 2 Main parameters of MGAIN
Table 3 RMSE for differentand \(\alpha \) and \(\beta \)

5.1.2 Evaluation metrics

The experimental results are based on the use of real-world datasets. We use the root mean square error (RMSE) to evaluate the experimental computational performance results:

$$\begin{aligned} RMSE=\frac{1}{N}\sqrt{\sum _{i=1}^{N}(y^{i}-\hat{y}^{i})^2}. \end{aligned}$$
(16)

where, \(y^{i}\) and \(\hat{y}^{i}\) are real values and generated values respectively. N is the number of data. To evaluate the models, this work compares the RMSE of all models on test data. This is the same evaluation metrics used by Yoon et al. [29] and Stefenon et al. [40].

Table 4 RMSE for MR=0.2
Table 5 RMSE for MR=0.4

Experimental prediction performance results are evaluated using Area Under the Receiver Operating Characteristic Curve (AUC). AUC is defined as the area under the ROC curve. where, the ROC curve is called the receiver operating characteristic curve, with the TPR as the vertical coordinate and the FPR as the horizontal coordinate. TPR is the true positive rate, i.e., the proportion of true positive cases that are correctly predicted to be positive cases; FPR is the false positive rate, i.e., the proportion of true negative examples that are incorrectly predicted to be positive.

$$\begin{aligned} TPR=\frac{TP}{TP+FN}. \end{aligned}$$
(17)
$$\begin{aligned} FPR=\frac{FP}{FP+TN}. \end{aligned}$$
(18)

where, TP, FP, TN, and FN denote true positive, false positive, true negative, and false negative in the samples, respectively. Higher AUC values correspond to higher model classification accuracy.

5.2 Design of experiments

In order to compare and investigate with existing missing value imputation methods, 10 datasets and 8 existing sampling methods were selected for comparison to validate the effectiveness of MGAIN. The number of missing regions is determined based on a specified missing rate. Subsequently, these regions are randomly assigned as missing regions based on the defined missing rate to generate complete data. We performed a 5-fold cross-validation in our experiments. Each experiment is repeated ten times, and its average performance is reported.

In Table 2, Batch size is the batch size, epochs is the number of iterations of the network, the hint rate is the percentage of the hinting mechanism, and MR is the missing rate of (15). \(\alpha \) and \(\beta \) are the hyperparameters in (12). \(\lambda \) is the hyperparameters in (13). Moreover, during the training process of the MGAIN model, the Adam optimizer was used in this paper.

5.3 Quantitative analysis of MGAIN

First, we performed experiments to find the appropriate \(\alpha \) and \(\beta \) then compared RMSE and AUC with other missing data imputation methods. This experiment was conducted when the missing rate was 0.2, and the specific experimental results are shown in Table 3. The best results among all the experimental results in this section are highlighted in bold.

According to Table 3 we can observe that the proposed model in this paper has the best performance at \(\alpha =0.5\),\(\beta =1\). Therefore, \(\alpha =0.5\),\(\beta =1\) is chosen for the following comparison experiments with other methods.

Table 6 RMSE for MR=0.6
Table 7 RMSE for MR=0.8

We compared our proposed MGAIN method with Mean, KNN, MICE, MissForest, EM, Auto-encoder, GAIN, PC-GAIN, and CWGAIN-GP methods. Mean, KNN, MICE and EM are popular statistical imputation methods, whereas, in a recent study, GAIN, MissForest, Auto-encoder, PC-GAIN, and CWGAIN-GP completions were the best-performing methods. We performed experiments at missing rates of 0.2, 0.4, 0.6, and 0.8, and the results are shown in Tables 4, 5, 6 and 7.

The proposed MGAIN model had exhibited the best performance with MR=0.2. For example, MGAIN reduced the RMSE by up to 0.2029 on the Beijing Air Quality dataset, because EM did not effectively handle the missing values when faced with time series datasets. CWGAIN-GP was introduced as the method to deal with missing time series data in 2024, indicating that MGAIN was better than CWGAIN-GP, could handle tabular data, and had some generalization ability on time series data.

Table 5 presents the results of RMSE at MR=0.4. The proposed model outperformed classical and GAIN-based methods. For example, our models demonstrated superiority on the Spam and BreastCancer datasets, reducing 0.088 and 0.1158, respectively, compared with the worst model. Additionally, we found that the errors of all models were generally larger on the two datasets with the largest amount of data (MNIST and News) than on the other datasets. However, our models performed best on these two datasets, demonstrating the capability of MGAIN to handle large amount of data.

At MR=0.6, the experimental results demonstrated the effectiveness of the proposed method in dealing with high missing rate data. First, the RMSE of MGAIN was smaller than that of all other methods. For example, on the News dataset, the RMSE of MGAIN is 0.0049 lower than that of the second-best model, PC-GAIN. Second, the missing rate ranged from 0.2 to 0.6, with a smaller increase in MGAIN. For instance, with MR=0.2, MGAIN’s RMSE on the Letter dataset increaseed by 0.02 from 0.1214 to 0.1414, while MissForest’s RMSE on the Letter dataset increases by 0.2404 from 0.1948 to 0.3352, suggesting that MGAIN can handle missing data with stability.

We also conducted experiments with MR=0.8, and the specific results are shown in Table 7. MGAIN performed excellently when MR=0.8. MGAIN decreased by 0.1776 compared with the latest model CWGAIN-GP on the Beijing Air Quality dataset. This result indicates that CWGAIN-GP is unsuitable for handling data with a high missing rate. During the experiments, the PC-GAIN running time was longer than that of other models. Therefore, the three Spam, Default, and News datasets are visualized as examples to better view the proposed model’s superiority, as shown in Fig. 4.

Based on Fig. 4, our model exhibited strong validity and stability in the four cases, with the MR being 0.2, 0.4, 0.6, and 0.8. Contrarily, other models had apparent fluctuations when the MR changed, particularly at a higher MR. However, the proposed model was not overly affected by changes in MR. On the contrary, MGAIN showed a lower RMSE than other models under different MR. It also exhibited a closer RMSE and a certain stability.

5.4 Prediction performance of MGAIN

We also compared the MGAIN data prediction accuracy after missing data imputation, and it demonstrated the best prediction accuracy. For this purpose, we chose AUC as the performance metric. To be fair, we chose the logistic regression (LR) prediction model for the Default, News, Breast Cancer, and credit dataset with MR of 0.4 and 0.8. The specific results are shown in Tables 8 and 9.

From Table 8, MGAIN is the optimal method for making predictions after missing data imputation, showing the best prediction accuracy. However, the improvement in prediction accuracy was not always significant, even when computational accuracy was greatly improved. For example, on the BreastCancer dataset, the AUC of the data after MGAIN filling was 0.9943. whereas the AUC of the data after the second-best model, PC-GAIN filling, is 0.9913. However, MGAIN improves only 0.003. possibly because there is enough information to predict labels with 0.6 of the observed data. Therefore, we conducted another comparison of the prediction performance with 0.8 missing data, and the results are shown in Table 9.

Fig. 4
figure 4

Analysis of various MR

Table 8 AUC predicted at MR=0.4
Table 9 AUC predicted at MR=0.8
Fig. 5
figure 5

The AUC performance with MR=0.4(a) and MR=0.8(b)

Table 10 Ablation experiment

The results in Table 9 demonstrated that MGAIN was effective at high MR compared with other models. For instance, on the BreastCancer dataset, the AUC of the data after MGAIN filling was 0.0068 higher than that of the data after PC-GAIN filling. The performance gap was even more significant than at MR=0.4. On the News dataset, MGAIN improves its performance over the worst model, Mean, by 0.4244. On the Credit dataset, MGAIN improved its AUC over the second-best model, PC-GAIN, by 0.0311.

In contrast, the latest model, CWGAIN-GP, consistently performed poorly, suggesting its ineffectiveness at handling data with a high MR, which is consistent with the conclusion drawn in Section 5.3 MGAIN outperformed other models in terms of computational performance method. It also exhibited an advantage in prediction performance. Tables 8 and 9 visualize this advantage in Fig. 5, indicating that the proposed model performed better than other models, especially on the Default, News, and Credit datasets. This superiority is attributed to the BreastCancer dataset being smaller by 569, less than other datasets, resulting in less information for the estimation model. Nonetheless, the predictive performance of MGAIN is still better than that of other models.

5.5 Ablation study

To validate the effectiveness of incorporating the least squares loss and dual discriminator in the model, we performed ablation experiments, the specific results of which are shown in Table 10. This experiment was conducted with MR = 0.2.

In Table 10, adding least squares loss and dual discriminator is reasonably practical. The RMSE of GAIN (with least square loss) and GAIN (with dual discriminator) was not as low as the RMSE of GAIN on the News dataset. However, the RMSE of MGAIN was lower than that of GAIN, suggesting that adding these two strategies alone does not fully address gradient vanishing and mode collapse, resulting in a higher RMSE. Since LSGAN solves the gradient vanishing problem but not the mode collapse problem, adding least squares loss or dual discriminator alone does not significantly improve the results. However, D2GAN, alleviates the mode collapse but still faces the gradient vanishing problem. Therefore, this study combines the advantages to allow the proposed MGAIN to alleviate mode collapse and avoid gradient vanishing simultaneously, leading to better overall performance. As a result, MGAIN has high filling accuracy, and good prediction accuracy after filling due to the improved learning ability of the original missing data. Therefore, MGAIN exhibits better performance.

6 Conclusions

The current methods for dealing with missing data are limited. We also observed a lack of effective imputation models, especially when faced with high missing rates. Traditional imputation methods have limitations that can affect model accuracy and stability. Advanced GAN-based imputation methods also face problems such as gradient vanishing and model collapse. Therefore, in this paper, we take the missing data as the research object and construct the MGAIN model. We introduce the least squares loss and dual discriminator to solve these problems. In the empirical analysis, MGAIN was found to reduce the RMSE error by up to 0.2029 when the missing rate was 0.2. In contrast, MGAIN can reduce the RMSE error by up to 0.2166 when the missing rate was 0.8. In addition, we evaluated the prediction of the filled missing data. The experimental results demonstrated that MGAIN outperformed traditional and GAN-based imputation methods. For example, on the Credit dataset, MGAIN improved the prediction performance by 13.65% compared with the state-of-the-art CWGAIN-GP model when the missing rate was 0.8. In summary, the proposed model in this paper demonstrates excellent superiority and stability in experimental results. This indicates that MGAIN has important application prospects in real-world scenarios dealing with high missing rates data, which can provide an effective solution for data analysis and prediction tasks.