1 Introduction

The class imbalance problem was originally proposed by Japkowicz [1]. It refers to the classification scenario where one class is represented by a large number of samples while the other is represented by only a few. Class imbalance problems are quite pervasive in many real-world applications, such as software defect prediction [2], machinery fault diagnosis [3], spam filtering [4], and so on. Class imbalance problems include two-class imbalance problems and multi-class imbalance problems. Since most existing solutions for multi-class imbalance problems first use class decomposition schemes to divide a multi-class problem into multiple two-class problems, and then to conquer each two-class imbalance subproblem [5, 6], this paper focuses on the two-class imbalance problem. In the two-class imbalance problem, the minority class is also called the positive class, while the majority class is also called the negative class. In the past two decades, many solutions to two-class imbalance problem have been proposed. SMOTE is the most influential oversampling method [7], which balances imbalanced dataset by generating synthetic positive class samples on the line between each positive class sample and its k nearest neighbors with same class. SMOTE and its variants have the following three drawbacks due to the mechanism by which they generate synthetic samples:

  1. 1.

    the probability distribution of the minority class samples is not considered;

  2. 2.

    the generated minority samples lack diversity;

  3. 3.

    the generated minority class samples overlap heavily when oversample many times for balancing with majority class samples.

In order to overcome the three drawbacks, inspired by the idea of generative adversarial network (GAN) [8], we propose a framework which includes an oversampling method and a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral. The main contributions of this paper include the following three folds:

  1. 1.

    We propose an oversampling method which is based on an improved GAN model. The improvement lies in introducing a regularization term of intra-class divergence into the loss function of the GAN, and replacing the discriminator of GAN with a classifier whose output is a vector with three entries: the probabilities that a predicted sample belongs to majority class, minority class, or generated sample.

  2. 2.

    Based on the proposed oversampling method, we propose a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral. Fuzzy integral can well model the interactions among the base classifiers which are not independent, since all balanced data subsets used for training the base classifiers include the oversampled positive class samples. The proposed ensemble approach can enhance the classification accuracy of the positive class samples.

  3. 3.

    Extensive experiments are conducted to compare the proposed methods with related methods including 11 SMOTE related and 4 GAN related state-of-the-art approaches on 5 aspects: MMD-score, Silhouette-score, F-measure, G-means, and AUC-area. The experimental results demonstrate that the proposed methods are more effective and efficient than the compared approaches.

The rest of this paper is organized as follows. In Sect. 2, we review the works related to two-class imbalanced data classification. In Sect. 3, we describe the details of the proposed methods. In Sect. 4, the experimental results and analyses are presented. At last, we conclude our work in the Sect. 5.

2 Related works

Many methods have been proposed by different researchers for addressing two-class imbalanced data classification. These methods can be classified into three categories: data-level methods, algorithm-level methods, and ensemble methods. Considering that this paper focuses on the data-level and ensemble method, we only provide a brief review of algorithm-level methods, as a comprehensive review of algorithm-level approach can be found in [9, 10].

The basic idea of the algorithm-level methods is to modify the existing classification algorithms to adapt to the scenario of imbalanced data classification. The most common strategy of modification is to introduce cost sensitive mechanism to traditional classification algorithms. The pioneering work of cost sensitive methods for the class imbalance problem was presented by Sun et al. [11]. They introduced cost items into the famous ensemble algorithm AdaBoost, and proposed the AdaC algorithm family. Other representative works published in recent year are reported in [12,13,14]. Khan et al. [12] proposed a cost sensitive deep neural network model which can automatically learn good features [15,16,17,18,19] from imbalanced data by jointly optimizing the class correlation losses and network parameters. Tao et al. [13] proposed a self-adaptive cost weights-based support vector machine (SVM), and a cost-sensitive ensemble approach for imbalanced data classification. Wang et al. [14] proposed cost sensitive fuzzy multiple kernel learning method for addressing the imbalanced problem by introducing fuzzy memberships to characterize the feature of imbalanced data. The proposed method obtained more favorable classification performances on imbalanced datasets.

The basic idea of the data-level methods is to preprocess the original imbalanced dataset for balancing the distribution of samples in two classes by undersampling majority samples or oversampling minority samples. Some empirical comparisons demonstrate that oversampling is much more effective than undersampling [2, 20,21,22]. Among the oversampling methods, SMOTE [23] is the most influential oversampling approach. Since from SMOTE was proposed in 2002, many oversampling approaches have been proposed in the past 18 years. Based on k-means clustering and SMOTE, Douzas et al. [24] proposed an oversampling method which can avoid the generation of noise and effectively overcome imbalances between and within classes. Douzas and Bacao [25] proposed a geometric SMOTE which generates synthetic samples in a geometric region of the input space. The region is a hyper-sphere around each selected minority instance. Maldonado et al. [26] studied the SMOTE oversampling strategy for high-dimensional datasets, and proposed an alternative distance metric for the computation of the neighbors for each minority sample. Susan and Kumar [27] combined undersampling and oversampling, and proposed a three-step intelligent pruning strategy of majority and minority samples for learning from imbalanced datasets. Mathew et al. [28] proposed a weighted kernel-based SMOTE (WKSMOTE) approach, which generates synthetic positive class samples in feature space. WKSMOTE can overcome the limitation of the linear interpolation of SMOTE. Based on WKSMOTE, Raghuwanshi and Shukla [29] proposed a SMOTE based class-specific extreme learning machine, which exploits the benefits of both the minority oversampling and the class-specific regularization. Pan et al. [30] proposed two oversampling methods. One is an adaptive SMOTE, which is an improved SMOTE by adaptively selecting groups of inner and danger data from the minority class. The other one adopts Gaussian oversampling, which provides a novel division strategy for sampling regions and makes sampling more reasonable. Zhang and Li [31] proposed an approach to balance different class samples by creating synthetic samples through randomly walking from the real data. Han et al. [32] presented a Gaussian mixture model based combined resampling approach. The resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then to balance the dataset, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples. Zhang et al. [33] investigated a classification method of high-dimensional class imbalanced datasets and proposed an algorithm to improve the performance of SMOTE by adopting an adaptive over-sampling rate. Elreedy and Atiya [21] presented a theoretical and experimental analysis of the SMOTE method. Specifically, they explored the accuracy of how faithfully it emulates the underlying density, and analyzed the effect of different factors on generation accuracy, such as the dimension, the size of the training set and the considered number of neighbors K. Fernández et al. [22] presented a comprehensive survey on SMOTE-based approaches, in which the progress and challenges of SMOTE-based approaches over fifteen years (from 2003 to 2018) are well summarized.

In recent years, generative adversarial network (GAN) has become a popular research topic in deep learning. Some researchers have used the generation mechanism of GAN to generate synthetic positive class samples for balancing imbalanced datasets. For instance, inspired by the idea of auxiliary classifier GAN (AC-GAN) [34], Ali-Gombe and Elyan proposed an improved model multiple fake class GAN (MFC-GAN) [35] and used the MFC-GAN to handle imbalanced data classification problem. MFC-GAN differs from AC-GAN that it uses multiple fake classes rather than single fake class as in AC-GAN. Furthermore, MFC-GAN can preserve the structure of the minority classes by learning the correct data distribution, which is an intriguing property. Douzas and Bacao [36] applied conditional GAN (cGAN) on binary class imbalanced datasets, where the conditional GAN conditions on the class labels of the imbalanced datasets. Finally generative model is used to create artificial data for the minority class. Zheng et al. [37] introduced a gradient penalty into conditional Wasserstein GAN [38], and proposed a synthetic oversampling approach for imbalanced datasets. Different from these existing methods, the novelty of our proposed method lies in the following three aspects: (a) introducing intra-class divergence as a regularization term to the loss function of GAN to guarantee the diversity of the synthetic samples; (b) introducing MMD-score and Silhouette-score to measure diversity and separability, while the diversity and separability have great influence on the performance of imbalanced data classification; (c) replacing the discriminator of GAN with a classifier whose output is a vector with three entries: the probabilities that a predicted sample belongs to majority class, minority class, and generated sample.

Ensemble method usually combines the data-level and the algorithm-level approach to handle the class imbalance problem. Based on SMOTE combined with Adaboost SVM ensemble integrated with time weighting (ADASVM-TW), Sun et al. [39] proposed two class imbalanced dynamic financial distress prediction approaches. One is the simple integration model of SMOTE with ADASVM-TW, and the other is the embedding integration model of SMOTE with ADASVM-TW. González et al. [40] explored the effectiveness of the switching technique for classification of highly imbalanced problems, and proposed a switching-based ensemble to select the switched examples based on the nearest enemy distance. Gutiérrez-López et al. [41] also investigated the impact of switching technique on class imbalance learning, and proposed an asymmetric binary label switching algorithm to resist binary imbalance and presented a theoretical analysis, concluding that asymmetric switching binary classifiers offer an intrinsic resistance to imbalance effects. Raghuwanshi and Shukla [42] proposed an ensemble approach using a reduced kernelized weighted extreme learning machine as the base classifier to solve the class imbalance problem effectively. Hsiao et al. [43] proposed a method named MTSbag for class imbalance problems. MTSbag integrates the Mahalanobis–Taguchi system (MTS) and the bagging-based ensemble learning approaches to enhance the ability of conventional MTS in handling imbalanced data. Zhai et al. [44] combined oversampling method and ensemble learning, and proposed a MapReduce based imbalanced large scale data classification. The proposed oversampling method is based on enemy nearest neighbor. In this paper, we present a classifier fusion approach based on fuzzy integral for imbalanced data classification. Since the fusion method can well model the interactions among the base classifiers due to using fuzzy integral as an ensemble tool, the proposed approach can effectively enhance the generalization performance of the classification algorithm.

3 The proposed framework

In this section, we present the proposed framework for addressing the two-class imbalance problem. The framework includes an oversampling method which is based on an improved GAN model, and a two-class imbalanced data classification approach which is based on classifier fusion via fuzzy integral.

3.1 Oversampling method based on an improved GAN model

GAN is a generative model which consists of two neural networks G and D (see Fig. 1). The G is a generator network whose input, denoted by \({\varvec{z}}\), is drawn from a known noise prior distribution \(p_{noise}\), and its output is denoted by \({\varvec{x}}'\). The D is a discriminator network, whose input includes the generated data \({\varvec{x}}'\) and real data \({\varvec{x}}\). The distribution of \({\varvec{x}}\) is denoted by \(p_d\) which is unknown. The output of discriminator D is a probability distribution which indicates the support degrees that the input comes from \(p_{data}\) or from \(p_{gen}\).

Fig. 1
figure 1

The architecture of generative adversarial network

Since GAN is a probabilistic generative model, it is a natural idea to use GAN to generate synthetic positive class samples for addressing the two-class imbalanced data classification problem. However, we found that if we only learn the distribution of positive class samples using GAN, it is easy to incur overlap between the positive and the negative class samples. In addition, since GAN is prone to mode collapse, the generated synthetic positive class samples by GAN lack diversity. In this section, we present the proposed oversampling method to deal with these two problems, which is based on an improved GAN model.

In the proposed method, we improve the GAN model on two aspects: (1) We replace the discriminator of GAN with a classifier C (see Fig. 2), and its output would be \(p_{pos}\) for the positive class samples, \(p_{neg}\) for the negative class samples, and \(p_{g}\) for the generative samples by generator G. In the adversarial training process for generator G and classifier C, we want the samples generated by generator G to fool the classifier C, namely when a generated sample is fed as input to the classifier, we want the output to be close to \(p_{pos}\). Classifier C can not only learn the distribution of samples, but also learn a good classification boundary between the positive and the negative class. (2) We introduce a regularization term of intra-class divergence into the loss function of the GAN, which can enhance the diversity of the generated samples by generator G and avoid mode collapse of GAN.

Fig. 2
figure 2

The architecture of improved generative adversarial network

Let \(S=S^+\cup S^-\), \(S^+\) and \(S^-\) denote the positive class and negative class respectively, and let \(S_{up}^+\) be the oversampled positive class, \({\varvec{m}}\) and \({\varvec{m}}'\) are the mean vectors of the positive class samples and the oversampled positive class samples respectively. The loss function of the improved GAN is given by Eq. (1)

$$\begin{aligned} L(G({\varvec{z}}))\nonumber \\= & {} \frac{1}{|S^{+}|}\sum \limits _{{\varvec{x}}\in S^{+}}({\varvec{x}}-{\varvec{m}})({\varvec{x}}-{\varvec{m}})^{\text {T}}\nonumber \\&+\frac{1}{|S_{up}^{+}|}\sum \limits _{G({\varvec{z}})\in S_{up}^{+}}(G({\varvec{z}})-{\varvec{m}}')(G({\varvec{z}})-{\varvec{m}}')^{\text {T}} \end{aligned}$$
(1)

The objective functions of C and G of the improved GAN model are given by Eqs. (2) and (3) respectively.

$$\begin{aligned}&\max \limits _{C}J=J_{1}+ J_{2}+ J_{3} \end{aligned}$$
(2)
$$\begin{aligned}&\max \limits _{G}L=J_{4}+ \lambda L(G({\varvec{z}})) \end{aligned}$$
(3)

where \(\lambda\) is a parameter, and

$$\begin{aligned} J_{1}&=E_{{\varvec{x}}\sim p_{neg}}\log C_{1}({\varvec{x}})+E_{{\varvec{x}}\sim p_{neg}}\log (1-C_{2}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{neg}}\log (1-C_{3}({\varvec{x}})) \end{aligned}$$
(4a)
$$\begin{aligned} J_{2}&=E_{{\varvec{x}}\sim p_{pos}}\log C_{2}({\varvec{x}})+E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{1}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{3}({\varvec{x}}))\end{aligned}$$
(4b)
$$\begin{aligned} J_{3}&=E_{{\varvec{x}}\sim p_{g}}\log C_{3}({\varvec{x}})+E_{{\varvec{x}}\sim p_{g}}\log (1-C_{1}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{g}}\log (1-C_{2}({\varvec{x}})) \end{aligned}$$
(4c)
$$\begin{aligned} J_{4}&=E_{{\varvec{x}}\sim p_{g}}\log C_{2}({\varvec{x}})-E_{{\varvec{x}}\sim p_{g}}\log C_{1}({\varvec{x}})\nonumber \\&\quad - E_{{\varvec{x}}\sim p_{g}}\log C_{3}({\varvec{x}}) \end{aligned}$$
(4d)

In the adversarial learning process, G attempts to generate diverse positive class samples and expect that C can categorize the generated samples to minority class, while C attempts to classify correctly the positive, negative and generated samples. It is can be proved that the optimal C will result in the following formula (5).

$$\begin{aligned} L=-KL\left( p_{g}\parallel p_{pos}\right) +H\left( p_{g}, p_{neg}\right) \end{aligned}$$
(5)

where \(KL\left( p_{g}\parallel p_{pos}\right)\) is the KL divergence between \(p_{g}\) and \(p_{pos}\), and \(H\left( p_{g}, p_{neg}\right)\) is the cross entropy between \(p_{g}\) and \(p_{neg}\). In the following, we prove that Eq. (5) is hold. Because the item of intr-class divergence is not related to the classifier C, hence for \(C_{i}({\varvec{x}}), 1\le i\le 3\), we can obtain the following equation.

$$\begin{aligned} \begin{aligned} J(C_{1}({\varvec{x}}))&=E_{{\varvec{x}}\sim p_{neg}}\log C_{1}({\varvec{x}})\\&\quad +E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{1}({\varvec{x}}))+E_{{\varvec{x}}\sim p_{g}}\log (1-C_{1}({\varvec{x}}))\\&=\int \left( p_{neg}\log C_{1}({\varvec{x}})+p_{pos}\log (1-C_{1}({\varvec{x}}))\right. \\&\quad \left. +p_{g}\log (1-C_{1}({\varvec{x}}))\right) d{\varvec{x}} \end{aligned} \end{aligned}$$

Take the partial derivative of the integrand, and set it equal to zero, we have the following equation.

$$\begin{aligned} \frac{p_{neg}}{C_{1}({\varvec{x}})}-\frac{p_{pos}}{1-C_{1}({\varvec{x}})}-\frac{p_{g}}{1-C_{1}({\varvec{x}})}=0 \end{aligned}$$

Hence,

$$\begin{aligned} C_{1}^{*}({\varvec{x}})=\frac{p_{neg}}{p_{neg}+p_{pos}+p_{g}} \end{aligned}$$

Similarly, we have,

$$\begin{aligned} \begin{aligned}&C_{2}^{*}({\varvec{x}})=\frac{p_{pos}}{p_{neg}+p_{pos}+p_{g}}\\&C_{3}^{*}({\varvec{x}})=\frac{p_{g}}{p_{neg}+p_{pos}+p_{g}} \end{aligned} \end{aligned}$$

Substitute \(C_{1}^{*}({\varvec{x}})\), \(C_{2}^{*}({\varvec{x}})\) and \(C_{3}^{*}({\varvec{x}})\) into L [i.e. (4d)], we have,

$$\begin{aligned} \begin{aligned} L&=E_{{\varvec{x}}\sim p_{g}}\log C_{2}^{*}({\varvec{x}})\\&\quad -E_{{\varvec{x}}\sim p_{g}}\log C_{1}^{*}({\varvec{x}})- E_{{\varvec{x}}\sim p_{g}}\log C_{3}^{*}({\varvec{x}})\\&=\int \left( p_{g}\log p_{pos}\right) d{\varvec{x}}-\int \left( p_{g}\log p_{neg}\right) d{\varvec{x}}\\&\quad -\int \left( p_{g}\log p_{g}\right) d{\varvec{x}}\\&=\int \left( p_{g}\log \frac{p_{pos}}{p_{g}}\right) d{\varvec{x}} -\int \left( p_{g}\log p_{neg}\right) d{\varvec{x}}\\&=-KL\left( p_{g}\parallel p_{pos}\right) +H\left( p_{g}, p_{neg}\right) \end{aligned} \end{aligned}$$

Note: (1) For \(KL\left( p_{g}\parallel p_{pos}\right)\), since \(p_{pos}\) is fixed, we want \(p_{g}\) to be as close to \(p_{pos}\) as possible. It is noted that \(KL(\cdot \parallel \cdot )\) is not symmetric, for different optimization objective, the results are different (see Fig. 3). Obviously, we should adopt the optimization objective given in Fig. 3b. (2) The cross entropy \(H\left( p_{g}, p_{neg}\right)\) is used to distinguish the generated samples from the negative class samples as much as possible. (3) For some cases, the number of positive class samples are too small to train a model, accordingly we train the model with an incremental iterative mode. The pseudo code of the proposed oversampling algorithm is given in Algorithm 1.

Fig. 3
figure 3

a Optimization for \(\mathop {argmin}_{q}KL(p\parallel q)\), b optimization for \(\mathop {argmin}_{q}KL(q\parallel p)\)

figure a

3.2 Two-class imbalanced data classification approach based on classifier fusion via fuzzy integral

On the basis of the above oversampling method, we proposed a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral [45]. The proposed approach includes the following two stages:

(1) Construct balance training sets and train base classifiers

In this stage, we first partition \(S^{-}\) into l subsets \(S_{1}^{-}, S_{2}^{-}, \ldots , S_{l}^{-}\), where \(l=\frac{|S^{-}|}{|S_{up}^{+}|}\). Next, construct l balance training sets \(S_{i}=S_{i}^{-}\cup S_{up}^{+}\), \(1\le i\le l\). Finally, train l classifiers \(C=\{C_{1}, C_{2},\ldots , C_{l}\}\) on the l balance training sets. The l classifiers are fused for imbalanced data classification via fuzzy integral in the next stage.

(2) Fuse the trained base classifiers via fuzzy integral

As a classifier fusion method, fuzzy integral is distinguished from other fusion methods due to its intriguing property, that is it can well model the interactions among the base classifiers, including positive interaction and negative interaction, this is the reason why we select fuzzy integral to fuse the trained base classifiers.

Let \(D=\{({\varvec{x}}_i,y_i)|x_i\in R^{d},y_i\in Y\}\) be a training set, \(1\le i\le n\), \(Y=\{\omega _{1},\omega _{2},\ldots ,\omega _{k}\}\) be a set of class labels, \(C=\{C_{1}, C_{2},\ldots , C_{l}\}\) be a set of classifiers trained on D or on subsets of D. For \(\forall {\varvec{x}} \in R^{d}\), the output of classifier \(C_i\) is a k-dimensional vector \((p_{i1}({\varvec{x}}),p_{i2}({\varvec{x}}),\ldots ,p_{ik}({\varvec{x}}))\). The \(p_{ij}({\varvec{x}}) \in [0,1](1\le i \le l; 1\le j \le k)\) denotes the support degree given by classifier \(C_{i}\) to the hypothesis that \({\varvec{x}}\) comes from class \(\omega _{j}\), \(\sum _{j=1}^{k}p_{ij}({\varvec{x}})=1\).

Given \(C=\{C_{1}, C_{2},\ldots , C_{l}\}\), \(Y=\{\omega _{1},\omega _{2},\ldots ,\omega _{k}\}\), and arbitrary test sample \({\varvec{x}}\). The following matrix is called decision matrix with respect to \({\varvec{x}}\).

$$\begin{aligned} DM({\varvec{x}})= \begin{bmatrix} p_{11}({\varvec{x}}) &{}\cdots &{}p_{1j}({\varvec{x}}) &{}\cdots &{}p_{1k}({\varvec{x}})\\ \vdots &{} &{}\vdots &{} &{}\vdots \\ p_{i1}({\varvec{x}}) &{}\cdots &{}p_{ij}({\varvec{x}}) &{}\cdots &{}p_{ik}({\varvec{x}})\\ \vdots &{} &{}\vdots &{} &{}\vdots \\ p_{l1}({\varvec{x}}) &{}\cdots &{}p_{lj}({\varvec{x}}) &{}\cdots &{}p_{lk}({\varvec{x}}) \end{bmatrix} \end{aligned}$$
(6)

In the matrix \(DM({\varvec{x}})\), the ith row of the matrix is the output of classifier \(C_{i}\), the \(j{th}\) column of the matrix are the support degrees from classifiers \(C_{1}, C_{2},\ldots ,C_{l}\) for class \(\omega _{j}\).

Let P(C) be the power set of C, the fuzzy measure on C is a set function: \(g: P(C) \rightarrow [0, 1]\), which satisfies the following two conditions:

  1. 1.

    \(g(\varnothing )=1\), \(g(C)=1\);

  2. 2.

    For \(\forall C_{i}, C_{j} \subseteq C\), if \(C_{i} \subset C_{j}\), then \(g(C_{i})\le g(C_{j})\).

For \(\forall C_{i}, C_{j} \subseteq C\) and \(C_{i} \cap C_{j}= \varnothing\), g is called \(\lambda\)-fuzzy measure, if it satisfies the following condition:

$$\begin{aligned} g(C_{i} \cup C_{j})=g(C_{i})+g(C_{j})+\lambda g(C_{i})g(C_{j}) \end{aligned}$$
(7)

where \(\lambda > -1\) and \(\lambda \ne 0\).

The value of \(\lambda\) can be determined by solving the following Eq. (8).

$$\begin{aligned} \lambda +1=\prod _{i=1}^{l}(1+\lambda g_{i}) \end{aligned}$$
(8)

where \(g_{i}=g(\{C_{i}\})\), it is usually determined by the following formula (9) [46]:

$$\begin{aligned} g_i=\frac{p_i}{\sum _{j=1}^{l}p_{j}}\delta . \end{aligned}$$
(9)

where \(\delta \in [0,\, 1]\) and \(p_i\) is testing accuracy or verification accuracy of classifier \(C_i(1\le i\le l)\).

Let \(h:C \rightarrow [0, 1]\) be a function defined on C. The Choquet fuzzy integral of function h with respect to g is defined by the following Eq. (10).

$$\begin{aligned} (C)\int h d\mu =\sum \limits _{i=2}^{l+1}\left( h(C_{i-1})-h(C_{i})\right) g(F_{i-1}) \end{aligned}$$
(10)

where \(h(C_{1})\ge h(C_{2})\ge \cdots \ge h(C_{l})\), \(h(C_{l+1})=0\), \(F_{i-1}=\{C_1, C_2, \ldots , C_{i-1}\}\).

Given a test instance \({\varvec{x}}\), when we use fuzzy integral to fuse l base classifiers \(C_{1}, C_{2},\ldots , C_{l}\) for classifying \({\varvec{x}}\), the process includes three step: Firstly, compute decision matrix \(DM({\varvec{x}})\). Secondly, sort \(j^{th}(1\le j\le k)\) column of \(DM({\varvec{x}})\) in descending order and obtain \((p_{i_{1}j}, p_{i_{2}j}, \ldots , p_{i_{l}j})\). Finally, calculate the support degree \(p_{j}({\varvec{x}})\) by the following formula (11).

$$\begin{aligned} p_{j}({\varvec{x}})=\sum _{t=2}^{l+1}\left( p_{i_{t-1}j}({\varvec{x}})-p_{i_{t}j}({\varvec{x}})\right) g(F_{t-1}) \end{aligned}$$
(11)

The pseudo code of the proposed two-class imbalanced data classification algorithm based on classifier fusion via fuzzy integral is given in Algorithm 2.

figure b

4 Experimental results and analyses

4.1 datasets and experimental environments

To demonstrate the superiority of the proposed framework denoted by GANDO (generative adversarial network based diverse oversampling), we conducted extensive experiments on 11 datasets including 8 numeric datasets and 3 image datasets. We use the 8 numeric datasets to compare GANDO with 11 SMOTE related state-of-the-art approaches which are SMOTE [23], B-SMOTE [47], ADASYN [48], CCR [49], ANS [50], K-SMOTE [24], NRPSOS [51], OUPS [52], GAN [8], AC-GAN [34], MFC-GAN [35], and use 3 image datasets to compare GANDO with 4 GAN related state-of-the-art methods which are AUGMENT [53], GAN [8], AC-GAN [34], and MFC-GAN [35]. The 8 numeric datasets include 1 artificial dataset, 4 KEEL datasets [54], and 3 liver datasets [55]. The basic information of the 11 datasets is given in Table 1. All experiments were carried out on the same hardware platform with Intel(R) Core(TM) i7-6600k CPU @ 3.10 GHz, 16.0 G memory, 64 bit MAC operation system. The programming environment consists of PyCharm Community Edition 2017.1.1, scikit-learn, smote-variants and keras. Our code is publicly available at https://github.com/xichie/oversample.

Table 1 The basic information of the 11 datasets

In Table 1, \(\text {IR}=\frac{|S^{-}|}{|S^{+}|}\). Gaussian is an artificial dataset which is a two-dimensional dataset with two classes followed two Gaussian distributions, the mean vectors and covariance matrices of the two Gaussian distributions are given in Table 2. The artificial dataset Gaussian is used for illustrating the feasibility of the proposed approach and visualizing the generated synthetic samples.

Table 2 The mean vectors and covariance matrices of two Gaussian distributions

The three well known image datasets are not imbalanced, so we transform them into imbalanced ones.The purpose of selecting the three datasets is used to demonstrate the feasibility and effectiveness of the proposed method for image data.

MNIST is a handwritten digital dataset which includes 70,000 \(28\times 28\) grayscale images, the training and test set contain 60,000 and 10,000 images respectively. We randomly select 100 images from zero class as positive class sample, and put other classes images together as negative class.

Fashion-MNIST dataset is similar to MNIST, it also includes 70,000 \(28\times 28\) grayscale images of 70,000 fashion products from 10 categories. We randomly select 100 images from T-Shirt class as positive class sample, and put other classes images together as negative class.

Cifar10 consists of 60,000 \(32\times 32\) colour images containing one of 10 object classes, with 6000 images per class. The training and test set contain 50,000 and 10,000 images respectively. We randomly select 100 images from airplane class as positive class sample, and put other classes images together as negative class.

4.2 Performance evaluation measures

The used performance evaluation measures include MMD-score [56], Silhouette-score [57], F-measure [58], G-mean [58], and AUC-area [58]. The MMD is a statistics for measuring the mean squared difference of two sets of samples. Given two sets of samples \(\mathbf {X}=\{\mathbf {x}_{i}\},\,1\le i\le n\) and \(\mathbf {Y}=\{\mathbf {y}_{i}\},\,1\le i\le m\), the MMD of \(\mathbf {X}\) and \(\mathbf {Y}\) is defined by Eq. (12).

$$\begin{aligned} \begin{aligned} \text {MMD}&=\biggl \Vert \frac{1}{n}\sum _{i=1}^{n}\phi (\mathbf {x}_{i}) -\frac{1}{m}\sum _{j=1}^{m}\phi (\mathbf {y}_{i})\biggr \Vert ^{2}\\&=\frac{1}{n^2}\sum _{i=1}^{n}\sum _{i'=1}^{n}\phi (\mathbf {x}_{i})^T\phi (\mathbf {x}_{i'})\\&\quad -\frac{2}{nm}\sum _{i=1}^{n}\sum _{j=1}^{m}\phi (\mathbf {x}_{i})^T\phi (\mathbf {y}_{j})\\&\quad +\frac{1}{m^2}\sum _{j=1}^{m}\sum _{j'=1}^{m}\phi (\mathbf {y}_{j})^T\phi (\mathbf {y}_{j'}) \end{aligned} \end{aligned}$$
(12)

In Eq. (12), \(\phi (\cdot )\) is a kernel mapping, using kernel trick, Eq. (12) can be written as Eq. (13).

$$\begin{aligned} \text {MMD}= & {} \frac{1}{n^2}\sum _{i=1}^{n}\sum _{i'=1}^{n}k(\mathbf {x}_{i}, \mathbf {x}_{i'})\nonumber \\&-\frac{2}{nm}\sum _{i=1}^{n}\sum _{j=1}^{m}k(\mathbf {x}_{i}, \mathbf {y}_{j})\nonumber \\&+\frac{1}{m^2}\sum _{j=1}^{m}\sum _{j'=1}^{m}k(\mathbf {y}_{j}. \mathbf {y}_{j'}) \end{aligned}$$
(13)

The Silhouette coefficient (Silhouette-score) is an evaluation index of clustering algorithms. Given a sample \(\mathbf {x}\) which belongs to cluster A, the Silhouette coefficient of \(\mathbf {x}\) is defined by Eq. (14).

$$\begin{aligned} s(\mathbf {x})=\frac{b(\mathbf {x})-a(\mathbf {x})}{\max \{a(\mathbf {x}), b(\mathbf {x})\}} \end{aligned}$$
(14)

where \(a(\mathbf {x})\) is the average dissimilarity of sample \(\mathbf {x}\) to all other samples of A, \(b(\mathbf {x})=\text {minimum}_{\text {C}\ne \text {A}}d(\mathbf {x}, \text {C})\), while \(d(\mathbf {x}, \text {C})\) is the average dissimilarity of sample \(\mathbf {x}\) to all samples of cluster C. With respect to a cluster (or a set) A, the Silhouette coefficient of A is \(s(\text {A})=\frac{1}{|A|}\sum _{\mathbf {x}\in \text {A}}s(\mathbf {x})\). From Eq. (14), it is easy to find that the value of \(s(\mathbf {x})\) is between [\(-1,1\)], and the closer the value of \(s(\mathbf {x})\) to 1, the better the separability is.

4.3 Network architecture and parameter settings

For two different kind of datasets, we employ different network architecture and parameter settings. Specifically, for the 8 numeric datasets, the generator and the classifier are all single hidden layer feedforward neural networks, the dimension of noise \({\varvec{z}}\) is uniformly set to 100. Other parameters including the number of hidden nodes of generator (denoted by \(\sharp\)HNodesG), the number of hidden nodes of classifier (denoted by \(\sharp\)HNodesC), the number of iteration (n), the number of training (k), the weighted parameter \(\lambda\), and the number of oversampling samples (denoted by \(\sharp\)Oversampling) at each time are given in Table 3. In the second stage, we use support vector machine as the base classifier for fusion via fuzzy integral to classify two-class imbalanced data.

Table 3 Model parameter settings used for 8 numeric datasets

For the 3 image datasets, because MNIST and Fashion-MNIST are single channel grayscale image datasets, the generator and classifier are all use same fully connected networks. Since Cifar10 is three channel color image dataset, the generator and classifier are all use convolutional neural networks, the ADAM is used as the optimization method, the mini-batch size is 100. The network structures of G and C are given in Table 4, other model parameter settings are given in Table 5. In the second stage, we use two different neural networks as the base classifiers for fusion via fuzzy integral to classify two-class imbalanced data. For MNIST and Fashion-MNIST, we use same neural network as base classifier, whereas a different neural network is employed for Cifar10, the architectures of the two different neural networks are given in Table 6. Regarding the parameter choice, we use grid search strategy to select parameters and pick the ones which resulted in the best performance. For example, regarding the numbers of hidden node of encoder and decoder networks used for 8 numeric datasets given in Table 3 and the numbers of hidden node of generator and discriminator networks used for 3 image datasets given in Table 5. For each dataset, we determine the suitable numbers of hidden node of neural networks by grid search strategy in same interval [50, 150].

Table 4 The network structures of generator and classifier used for 3 image datasets
Table 5 Model parameter settings used for 3 image datasets
Table 6 The architectures of the two different neural networks

4.4 Comparisons with 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets

We use 5-fold cross validation to experimentally compare the proposed method GANDO with the 11 SMOTE related state-of-the-art approaches on 5 aspects: MMD-score, Silhouette-score, F-measure, G-means, and AUC-area, and the generated synthetic samples are visualized on the artificial dataset to demonstrate effectiveness and superiority of the proposed approach GANDO. The experimental results of MMD-score compared with the 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets are given in Table 7, and the experimental results of Silhouette-score compared with the 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets are given in Table 8.

Table 7 Experimental comparison of MMD-score on the 8 numeric datasets

From the experimental results listed in Table 7, the MMD-scores of the proposed method GANDO on 7 numeric datasets are greater than the ones of the 10 SMOTE related state-of-the-art approaches, which means that the oversampled positive class samples by GANDO have better diversities than the 10 SMOTE related state-of-the-art approaches. This conclusion is further confirmed by the visualization of the generated synthetic positive class samples on the artificial dataset (see Fig. 4). In the Fig. 4, the yellow “−” represents the negative class sample, the blue “+” represents the positive class sample, and the red “+” represents the generated positive class sample. It can be seen from the Fig. 4 that the samples generated by the proposed method GANDO have better diversity than the 11 SMOTE state-of-the-art approaches. Although MFC-GAN has good diversity, it has bad separability, i.e. the generated synthetic positive class samples overlap with the original negative samples. K-SMOTE is an exception that K-SMOTE can not generate synthetic positive class samples on the artificial dataset. This is due to its oversampling mechanism: K-SMOTE first use K-means to cluster the artificial dataset, and then for each cluster, K-SMOTE calculates it’s IR, and select the clusters whose IR is less than a threshold for oversampling with SMOTE. In our experiments, the threshold is set to 2.0. Since the IR of each cluster is greater than 2.0, no oversampling is performed.

It is well known that the better the diversity of generated synthetic positive class samples, the better the quality of the generated synthetic positive class samples. High quality generated synthetic positive class samples can effectively expand the training field of positive class samples, and effectively improve the performance of the proposed classification algorithm. This point is confirmed by the experimental results on three classification performance metrics: F-measure, G-means, and AUC-area (see Tables 9, 10, 11). The reason why the proposed method GANDO can generate synthetic positive class samples with good diversity is that we introduce a regularization term of intra-class divergence into the loss function of the improved GAN model.

Fig. 4
figure 4

The visualization of the generated synthetic positive class samples of the artificial dataset

Table 8 Experimental comparison of Silhouette-score on the 8 numeric datasets

From the experimental results listed in Table 8, the Silhouette-scores of the proposed method GANDO on 7 numeric datasets are greater than the ones of the 10 SMOTE related state-of-the-art approaches, which demonstrates that the oversampled positive class samples by GANDO also have better separability than the 10 SMOTE related state-of-the-art approaches. This conclusion is further confirmed by the visualization of the generated synthetic positive class samples on the artificial dataset (see Fig. 4). It can be seen from the Fig. 4 that the samples generated by the proposed method GANDO has better separability than the 11 SMOTE related state-of-the-art approaches. Although B-SMOTE, ANS, and NRPSOS have good separability, they have low diversities.

Table 9 Experimental comparisons of F-measure on the 8 numeric datasets
Table 10 Experimental comparison of G-mean on the 8 numeric datasets
Table 11 Experimental comparison of AUC-area on the 8 numeric datasets

The experimental results of F-measure, G-means and AUC-area compared with the 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets are given in Tables 9, 10, 11 respectively. From the experimental results listed in Tables 9, 10, 11, it is observed that (a) the F-measure of the proposed method GANDO are greater than the ones of the 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets; (b) the G-means and AUC-area of the proposed method GANDO are greater than the ones of the 11 SMOTE related state-of-the-art approaches on 6 and 7 numeric datasets respectively. Overall, the performance of the proposed method GANDO outperforms the 11 SMOTE related state-of-the-art approaches in terms of F-measure, G-means, and AUC-area. We think that the reasons include the following three points:

  1. 1.

    Introducing the regularization term of intra-class divergence into the loss function of the GAN can guarantee good diversity of the generated synthetic positive class samples. Good diversity can effectively expand the training field of the positive class samples.

  2. 2.

    Introducing the Silhouette-score can guarantee good separability between the generated synthetic positive class samples and negative, and the combination of MMD-score and Silhouette-score can further improve the quality of the generated synthetic positive class samples, all of which contribute to the good performance of the proposed method GANDO.

  3. 3.

    Since the base classifiers are trained on balanced training sets containing the same set of oversampling positive class samples, intrinsic interactions exist among different base classifiers. The interactions may be positively correlated, in this case, the base classifiers enhance each other. The interactions may also be negatively correlated, in this case, the base classifiers suppress each other. Fuzzy integral can well model the interactions among the base classifiers, which enhance the generalization performance of the ensemble classifier.

From the experimental results on F-measure and G-mean listed in Tables 9 and 10, we find that some traditional methods (e.g. ADASYN, CCR, ANS, etc) obtained exceptional results on liver 1 dataset, we believe that the reason for the exceptional results is that this dataset has very high IR. Yet, the proposed method GANDO obtained competitive result on this severely imbalanced dataset.

4.5 Comparisons with the 4 GAN related state-of-the-art methods on the 3 image datasets

It is well known that GAN can generate realistic images, which can be viewed as a data augmentation technique, while oversampling is also a data augmentation technique. In order to further demonstrate the effectiveness of GANDO for classifying imbalanced image datasets, we conduct experiments on three famous images datasets to compare GANDO with 4 GAN related state-of-the-art methods, AUGMENT, GAN, AC-GAN, and MFC-GAN. The experimental results of the 5 evaluation measures compared with the 4 GAN related state-of-the-art approaches on the 3 image datasets are listed in Tables 12, 13, 14, 15 and 16.

Table 12 Experimental comparison of MMD-score on the 3 image datasets
Table 13 Experimental comparison of Silhouette-score on the 3 image datasets
Table 14 Experimental comparison of F-measure on the 3 image datasets
Table 15 Experimental comparison of G-mean on the 3 image datasets
Table 16 Experimental comparison of AUC-area on the 3 image datasets

From the experimental results of MMD-score listed in Table 12, we find that the proposed method GANDO obtained 1 maximum on fashion-MNIST, AC-GAN obtained the other two maxima on datasets MNIST and Cifar10. However, AC-GAN has bad separability on the three image datasets, while GANDO has much better separability than AC-GAN (see Table 13). In other words, GANDO achieves the optimal tradeoff between diversity and separability, this will result in that GANDO outperforms AC-GAN on classification performance, which can be confirmed by the experimental results of F-measure, G-means, and AUC-area listed Tables 14, 15 and 16. From the experimental results of Silhouette-score listed in Table 13, we find that the proposed method GANDO obtained 2 maximum on MNIST and Cifar10, MFC-GAN obtained the other maxima on datasets fashion-MNIST. Compared GANDO and MFC-GAN on three classification performance measures, i.e. F-measure, G-means, and AUC-area, GANDO is superior to MFC-GAN on 2 image datasets MNIST and Cifar10, and MFC-GAN is superior to GANDO on image dataset fashion-MNIST. In summary, GANDO outperforms other 4 GAN related state-of-the-art methods.

5 Conclusions

Based on an improved GAN model and a classifier fusion mechanism via fuzzy integral, a framework for classifying imbalanced data was proposed in this paper. The framework contains an oversampling method and an ensemble classification approach for the classification of imbalanced data. The oversampling method is based on the improved GAN model by introducing a regularization term of intra-class divergence into the loss function of the GAN, and replacing the discriminator of GAN with a classifier with three outputs. The ensemble classification approach is based on fuzzy integral. Since the base classifiers are trained on balanced training sets containing the same positive class set, there are intrinsic interactions among the base classifiers. Fuzzy integral can well model the interactions, thus effectively enhance the classification performance. The proposed classification framework has four advantages: (1) It can generate synthetic positive class samples with good diversity and good separability. (2) The improved GAN model can effectively avoid mode collapse. (3) It has good classification generalization performance due to diverse oversampling and controllable separability. (4) It is effective not only for datasets with medium imbalanced ratio, but also for datasets with very high imbalanced ratio. The promising future works of this study include (1) extending GANDO to multi-class imbalanced data classification; (2) expanding the scalability of GANDO for imbalanced big data scenarios.