Imbalanced data classification based on diverse sample generation and classifier fusion

Zhai, Junhai; Qi, Jiaxing; Zhang, Sufang

doi:10.1007/s13042-021-01321-9

Imbalanced data classification based on diverse sample generation and classifier fusion

Original Article
Published: 12 April 2021

Volume 13, pages 735–750, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Imbalanced data classification based on diverse sample generation and classifier fusion

Download PDF

925 Accesses
21 Citations
1 Altmetric
Explore all metrics

Abstract

Class imbalance problems are pervasive in many real-world applications, yet classifying imbalanced data remains to be a very challenging task in machine learning. SMOTE is the most influential oversampling approach. Based on SMOTE, many variants have been proposed. However, SMOTE and its variants have three drawbacks: (1) the probability distribution of the minority class samples is not considered; (2) the generated minority samples lack diversity; (3) the generated minority class samples overlap severely when oversampled many times for balancing with majority class samples. In order to overcome these three drawbacks, a generative adversarial network (GAN) based framework is proposed in this paper. The framework includes an oversampling method and a two-class imbalanced data classification approach. The oversampling method is based on an improved GAN model, and the classification approach is based on classifier fusion via fuzzy integral, which can well model the interactions among the base classifiers trained on the balanced data subsets constructed by the proposed oversampling method. Extensive experiments are conducted to compare the proposed methods with related methods on 5 aspects: MMD-score, Silhouette-score, F-measure, G-means, and AUC-area. The experimental results demonstrate that the proposed methods are more effective and efficient than the compared approaches.

LoRAS: an oversampling approach for imbalanced datasets

Article Open access 12 November 2020

Noise-free sampling with majority framework for an imbalanced classification problem

Article 09 April 2024

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The class imbalance problem was originally proposed by Japkowicz [1]. It refers to the classification scenario where one class is represented by a large number of samples while the other is represented by only a few. Class imbalance problems are quite pervasive in many real-world applications, such as software defect prediction [2], machinery fault diagnosis [3], spam filtering [4], and so on. Class imbalance problems include two-class imbalance problems and multi-class imbalance problems. Since most existing solutions for multi-class imbalance problems first use class decomposition schemes to divide a multi-class problem into multiple two-class problems, and then to conquer each two-class imbalance subproblem [5, 6], this paper focuses on the two-class imbalance problem. In the two-class imbalance problem, the minority class is also called the positive class, while the majority class is also called the negative class. In the past two decades, many solutions to two-class imbalance problem have been proposed. SMOTE is the most influential oversampling method [7], which balances imbalanced dataset by generating synthetic positive class samples on the line between each positive class sample and its k nearest neighbors with same class. SMOTE and its variants have the following three drawbacks due to the mechanism by which they generate synthetic samples:

1.
the probability distribution of the minority class samples is not considered;
2.
the generated minority samples lack diversity;
3.
the generated minority class samples overlap heavily when oversample many times for balancing with majority class samples.

In order to overcome the three drawbacks, inspired by the idea of generative adversarial network (GAN) [8], we propose a framework which includes an oversampling method and a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral. The main contributions of this paper include the following three folds:

1.
We propose an oversampling method which is based on an improved GAN model. The improvement lies in introducing a regularization term of intra-class divergence into the loss function of the GAN, and replacing the discriminator of GAN with a classifier whose output is a vector with three entries: the probabilities that a predicted sample belongs to majority class, minority class, or generated sample.
2.
Based on the proposed oversampling method, we propose a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral. Fuzzy integral can well model the interactions among the base classifiers which are not independent, since all balanced data subsets used for training the base classifiers include the oversampled positive class samples. The proposed ensemble approach can enhance the classification accuracy of the positive class samples.
3.
Extensive experiments are conducted to compare the proposed methods with related methods including 11 SMOTE related and 4 GAN related state-of-the-art approaches on 5 aspects: MMD-score, Silhouette-score, F-measure, G-means, and AUC-area. The experimental results demonstrate that the proposed methods are more effective and efficient than the compared approaches.

The rest of this paper is organized as follows. In Sect. 2, we review the works related to two-class imbalanced data classification. In Sect. 3, we describe the details of the proposed methods. In Sect. 4, the experimental results and analyses are presented. At last, we conclude our work in the Sect. 5.

2 Related works

Many methods have been proposed by different researchers for addressing two-class imbalanced data classification. These methods can be classified into three categories: data-level methods, algorithm-level methods, and ensemble methods. Considering that this paper focuses on the data-level and ensemble method, we only provide a brief review of algorithm-level methods, as a comprehensive review of algorithm-level approach can be found in [9, 10].

The basic idea of the algorithm-level methods is to modify the existing classification algorithms to adapt to the scenario of imbalanced data classification. The most common strategy of modification is to introduce cost sensitive mechanism to traditional classification algorithms. The pioneering work of cost sensitive methods for the class imbalance problem was presented by Sun et al. [11]. They introduced cost items into the famous ensemble algorithm AdaBoost, and proposed the AdaC algorithm family. Other representative works published in recent year are reported in [12,13,14]. Khan et al. [12] proposed a cost sensitive deep neural network model which can automatically learn good features [15,16,17,18,19] from imbalanced data by jointly optimizing the class correlation losses and network parameters. Tao et al. [13] proposed a self-adaptive cost weights-based support vector machine (SVM), and a cost-sensitive ensemble approach for imbalanced data classification. Wang et al. [14] proposed cost sensitive fuzzy multiple kernel learning method for addressing the imbalanced problem by introducing fuzzy memberships to characterize the feature of imbalanced data. The proposed method obtained more favorable classification performances on imbalanced datasets.

The basic idea of the data-level methods is to preprocess the original imbalanced dataset for balancing the distribution of samples in two classes by undersampling majority samples or oversampling minority samples. Some empirical comparisons demonstrate that oversampling is much more effective than undersampling [2, 20,21,22]. Among the oversampling methods, SMOTE [23] is the most influential oversampling approach. Since from SMOTE was proposed in 2002, many oversampling approaches have been proposed in the past 18 years. Based on k-means clustering and SMOTE, Douzas et al. [24] proposed an oversampling method which can avoid the generation of noise and effectively overcome imbalances between and within classes. Douzas and Bacao [25] proposed a geometric SMOTE which generates synthetic samples in a geometric region of the input space. The region is a hyper-sphere around each selected minority instance. Maldonado et al. [26] studied the SMOTE oversampling strategy for high-dimensional datasets, and proposed an alternative distance metric for the computation of the neighbors for each minority sample. Susan and Kumar [27] combined undersampling and oversampling, and proposed a three-step intelligent pruning strategy of majority and minority samples for learning from imbalanced datasets. Mathew et al. [28] proposed a weighted kernel-based SMOTE (WKSMOTE) approach, which generates synthetic positive class samples in feature space. WKSMOTE can overcome the limitation of the linear interpolation of SMOTE. Based on WKSMOTE, Raghuwanshi and Shukla [29] proposed a SMOTE based class-specific extreme learning machine, which exploits the benefits of both the minority oversampling and the class-specific regularization. Pan et al. [30] proposed two oversampling methods. One is an adaptive SMOTE, which is an improved SMOTE by adaptively selecting groups of inner and danger data from the minority class. The other one adopts Gaussian oversampling, which provides a novel division strategy for sampling regions and makes sampling more reasonable. Zhang and Li [31] proposed an approach to balance different class samples by creating synthetic samples through randomly walking from the real data. Han et al. [32] presented a Gaussian mixture model based combined resampling approach. The resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then to balance the dataset, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples. Zhang et al. [33] investigated a classification method of high-dimensional class imbalanced datasets and proposed an algorithm to improve the performance of SMOTE by adopting an adaptive over-sampling rate. Elreedy and Atiya [21] presented a theoretical and experimental analysis of the SMOTE method. Specifically, they explored the accuracy of how faithfully it emulates the underlying density, and analyzed the effect of different factors on generation accuracy, such as the dimension, the size of the training set and the considered number of neighbors K. Fernández et al. [22] presented a comprehensive survey on SMOTE-based approaches, in which the progress and challenges of SMOTE-based approaches over fifteen years (from 2003 to 2018) are well summarized.

In recent years, generative adversarial network (GAN) has become a popular research topic in deep learning. Some researchers have used the generation mechanism of GAN to generate synthetic positive class samples for balancing imbalanced datasets. For instance, inspired by the idea of auxiliary classifier GAN (AC-GAN) [34], Ali-Gombe and Elyan proposed an improved model multiple fake class GAN (MFC-GAN) [35] and used the MFC-GAN to handle imbalanced data classification problem. MFC-GAN differs from AC-GAN that it uses multiple fake classes rather than single fake class as in AC-GAN. Furthermore, MFC-GAN can preserve the structure of the minority classes by learning the correct data distribution, which is an intriguing property. Douzas and Bacao [36] applied conditional GAN (cGAN) on binary class imbalanced datasets, where the conditional GAN conditions on the class labels of the imbalanced datasets. Finally generative model is used to create artificial data for the minority class. Zheng et al. [37] introduced a gradient penalty into conditional Wasserstein GAN [38], and proposed a synthetic oversampling approach for imbalanced datasets. Different from these existing methods, the novelty of our proposed method lies in the following three aspects: (a) introducing intra-class divergence as a regularization term to the loss function of GAN to guarantee the diversity of the synthetic samples; (b) introducing MMD-score and Silhouette-score to measure diversity and separability, while the diversity and separability have great influence on the performance of imbalanced data classification; (c) replacing the discriminator of GAN with a classifier whose output is a vector with three entries: the probabilities that a predicted sample belongs to majority class, minority class, and generated sample.

Ensemble method usually combines the data-level and the algorithm-level approach to handle the class imbalance problem. Based on SMOTE combined with Adaboost SVM ensemble integrated with time weighting (ADASVM-TW), Sun et al. [39] proposed two class imbalanced dynamic financial distress prediction approaches. One is the simple integration model of SMOTE with ADASVM-TW, and the other is the embedding integration model of SMOTE with ADASVM-TW. González et al. [40] explored the effectiveness of the switching technique for classification of highly imbalanced problems, and proposed a switching-based ensemble to select the switched examples based on the nearest enemy distance. Gutiérrez-López et al. [41] also investigated the impact of switching technique on class imbalance learning, and proposed an asymmetric binary label switching algorithm to resist binary imbalance and presented a theoretical analysis, concluding that asymmetric switching binary classifiers offer an intrinsic resistance to imbalance effects. Raghuwanshi and Shukla [42] proposed an ensemble approach using a reduced kernelized weighted extreme learning machine as the base classifier to solve the class imbalance problem effectively. Hsiao et al. [43] proposed a method named MTSbag for class imbalance problems. MTSbag integrates the Mahalanobis–Taguchi system (MTS) and the bagging-based ensemble learning approaches to enhance the ability of conventional MTS in handling imbalanced data. Zhai et al. [44] combined oversampling method and ensemble learning, and proposed a MapReduce based imbalanced large scale data classification. The proposed oversampling method is based on enemy nearest neighbor. In this paper, we present a classifier fusion approach based on fuzzy integral for imbalanced data classification. Since the fusion method can well model the interactions among the base classifiers due to using fuzzy integral as an ensemble tool, the proposed approach can effectively enhance the generalization performance of the classification algorithm.

3 The proposed framework

In this section, we present the proposed framework for addressing the two-class imbalance problem. The framework includes an oversampling method which is based on an improved GAN model, and a two-class imbalanced data classification approach which is based on classifier fusion via fuzzy integral.

3.1 Oversampling method based on an improved GAN model

GAN is a generative model which consists of two neural networks G and D (see Fig. 1). The G is a generator network whose input, denoted by ${\varvec{z}}$, is drawn from a known noise prior distribution $p_{noise}$, and its output is denoted by ${\varvec{x}}'$. The D is a discriminator network, whose input includes the generated data ${\varvec{x}}'$ and real data ${\varvec{x}}$. The distribution of ${\varvec{x}}$ is denoted by $p_d$ which is unknown. The output of discriminator D is a probability distribution which indicates the support degrees that the input comes from $p_{data}$ or from $p_{gen}$.

Since GAN is a probabilistic generative model, it is a natural idea to use GAN to generate synthetic positive class samples for addressing the two-class imbalanced data classification problem. However, we found that if we only learn the distribution of positive class samples using GAN, it is easy to incur overlap between the positive and the negative class samples. In addition, since GAN is prone to mode collapse, the generated synthetic positive class samples by GAN lack diversity. In this section, we present the proposed oversampling method to deal with these two problems, which is based on an improved GAN model.

In the proposed method, we improve the GAN model on two aspects: (1) We replace the discriminator of GAN with a classifier C (see Fig. 2), and its output would be $p_{pos}$ for the positive class samples, $p_{neg}$ for the negative class samples, and $p_{g}$ for the generative samples by generator G. In the adversarial training process for generator G and classifier C, we want the samples generated by generator G to fool the classifier C, namely when a generated sample is fed as input to the classifier, we want the output to be close to $p_{pos}$. Classifier C can not only learn the distribution of samples, but also learn a good classification boundary between the positive and the negative class. (2) We introduce a regularization term of intra-class divergence into the loss function of the GAN, which can enhance the diversity of the generated samples by generator G and avoid mode collapse of GAN.

Let $S=S^+\cup S^-$, $S^+$ and $S^-$ denote the positive class and negative class respectively, and let $S_{up}^+$ be the oversampled positive class, ${\varvec{m}}$ and ${\varvec{m}}'$ are the mean vectors of the positive class samples and the oversampled positive class samples respectively. The loss function of the improved GAN is given by Eq. (1)

$$\begin{aligned} L(G({\varvec{z}}))\nonumber \\= & {} \frac{1}{|S^{+}|}\sum \limits _{{\varvec{x}}\in S^{+}}({\varvec{x}}-{\varvec{m}})({\varvec{x}}-{\varvec{m}})^{\text {T}}\nonumber \\&+\frac{1}{|S_{up}^{+}|}\sum \limits _{G({\varvec{z}})\in S_{up}^{+}}(G({\varvec{z}})-{\varvec{m}}')(G({\varvec{z}})-{\varvec{m}}')^{\text {T}} \end{aligned}$$

(1)

The objective functions of C and G of the improved GAN model are given by Eqs. (2) and (3) respectively.

$$\begin{aligned}&\max \limits _{C}J=J_{1}+ J_{2}+ J_{3} \end{aligned}$$

(2)

$$\begin{aligned}&\max \limits _{G}L=J_{4}+ \lambda L(G({\varvec{z}})) \end{aligned}$$

(3)

where $\lambda$ is a parameter, and

$$\begin{aligned} J_{1}&=E_{{\varvec{x}}\sim p_{neg}}\log C_{1}({\varvec{x}})+E_{{\varvec{x}}\sim p_{neg}}\log (1-C_{2}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{neg}}\log (1-C_{3}({\varvec{x}})) \end{aligned}$$

(4a)

$$\begin{aligned} J_{2}&=E_{{\varvec{x}}\sim p_{pos}}\log C_{2}({\varvec{x}})+E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{1}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{3}({\varvec{x}}))\end{aligned}$$

(4b)

$$\begin{aligned} J_{3}&=E_{{\varvec{x}}\sim p_{g}}\log C_{3}({\varvec{x}})+E_{{\varvec{x}}\sim p_{g}}\log (1-C_{1}({\varvec{x}}))\nonumber \\&\quad + E_{{\varvec{x}}\sim p_{g}}\log (1-C_{2}({\varvec{x}})) \end{aligned}$$

(4c)

$$\begin{aligned} J_{4}&=E_{{\varvec{x}}\sim p_{g}}\log C_{2}({\varvec{x}})-E_{{\varvec{x}}\sim p_{g}}\log C_{1}({\varvec{x}})\nonumber \\&\quad - E_{{\varvec{x}}\sim p_{g}}\log C_{3}({\varvec{x}}) \end{aligned}$$

(4d)

In the adversarial learning process, G attempts to generate diverse positive class samples and expect that C can categorize the generated samples to minority class, while C attempts to classify correctly the positive, negative and generated samples. It is can be proved that the optimal C will result in the following formula (5).

$$\begin{aligned} L=-KL\left( p_{g}\parallel p_{pos}\right) +H\left( p_{g}, p_{neg}\right) \end{aligned}$$

(5)

where $KL\left( p_{g}\parallel p_{pos}\right)$ is the KL divergence between $p_{g}$ and $p_{pos}$, and $H\left( p_{g}, p_{neg}\right)$ is the cross entropy between $p_{g}$ and $p_{neg}$. In the following, we prove that Eq. (5) is hold. Because the item of intr-class divergence is not related to the classifier C, hence for $C_{i}({\varvec{x}}), 1\le i\le 3$, we can obtain the following equation.

$$\begin{aligned} \begin{aligned} J(C_{1}({\varvec{x}}))&=E_{{\varvec{x}}\sim p_{neg}}\log C_{1}({\varvec{x}})\\&\quad +E_{{\varvec{x}}\sim p_{pos}}\log (1-C_{1}({\varvec{x}}))+E_{{\varvec{x}}\sim p_{g}}\log (1-C_{1}({\varvec{x}}))\\&=\int \left( p_{neg}\log C_{1}({\varvec{x}})+p_{pos}\log (1-C_{1}({\varvec{x}}))\right. \\&\quad \left. +p_{g}\log (1-C_{1}({\varvec{x}}))\right) d{\varvec{x}} \end{aligned} \end{aligned}$$

Take the partial derivative of the integrand, and set it equal to zero, we have the following equation.

$$\begin{aligned} \frac{p_{neg}}{C_{1}({\varvec{x}})}-\frac{p_{pos}}{1-C_{1}({\varvec{x}})}-\frac{p_{g}}{1-C_{1}({\varvec{x}})}=0 \end{aligned}$$

Hence,

$$\begin{aligned} C_{1}^{*}({\varvec{x}})=\frac{p_{neg}}{p_{neg}+p_{pos}+p_{g}} \end{aligned}$$

Similarly, we have,

$$\begin{aligned} \begin{aligned}&C_{2}^{*}({\varvec{x}})=\frac{p_{pos}}{p_{neg}+p_{pos}+p_{g}}\\&C_{3}^{*}({\varvec{x}})=\frac{p_{g}}{p_{neg}+p_{pos}+p_{g}} \end{aligned} \end{aligned}$$

Substitute $C_{1}^{*}({\varvec{x}})$, $C_{2}^{*}({\varvec{x}})$ and $C_{3}^{*}({\varvec{x}})$ into L [i.e. (4d)], we have,

$$\begin{aligned} \begin{aligned} L&=E_{{\varvec{x}}\sim p_{g}}\log C_{2}^{*}({\varvec{x}})\\&\quad -E_{{\varvec{x}}\sim p_{g}}\log C_{1}^{*}({\varvec{x}})- E_{{\varvec{x}}\sim p_{g}}\log C_{3}^{*}({\varvec{x}})\\&=\int \left( p_{g}\log p_{pos}\right) d{\varvec{x}}-\int \left( p_{g}\log p_{neg}\right) d{\varvec{x}}\\&\quad -\int \left( p_{g}\log p_{g}\right) d{\varvec{x}}\\&=\int \left( p_{g}\log \frac{p_{pos}}{p_{g}}\right) d{\varvec{x}} -\int \left( p_{g}\log p_{neg}\right) d{\varvec{x}}\\&=-KL\left( p_{g}\parallel p_{pos}\right) +H\left( p_{g}, p_{neg}\right) \end{aligned} \end{aligned}$$

Note: (1) For $KL\left( p_{g}\parallel p_{pos}\right)$, since $p_{pos}$ is fixed, we want $p_{g}$ to be as close to $p_{pos}$ as possible. It is noted that $KL(\cdot \parallel \cdot )$ is not symmetric, for different optimization objective, the results are different (see Fig. 3). Obviously, we should adopt the optimization objective given in Fig. 3b. (2) The cross entropy $H\left( p_{g}, p_{neg}\right)$ is used to distinguish the generated samples from the negative class samples as much as possible. (3) For some cases, the number of positive class samples are too small to train a model, accordingly we train the model with an incremental iterative mode. The pseudo code of the proposed oversampling algorithm is given in Algorithm 1.

3.2 Two-class imbalanced data classification approach based on classifier fusion via fuzzy integral

On the basis of the above oversampling method, we proposed a two-class imbalanced data classification approach based on classifier fusion via fuzzy integral [45]. The proposed approach includes the following two stages:

(1) Construct balance training sets and train base classifiers

In this stage, we first partition $S^{-}$ into l subsets $S_{1}^{-}, S_{2}^{-}, \ldots , S_{l}^{-}$, where $l=\frac{|S^{-}|}{|S_{up}^{+}|}$. Next, construct l balance training sets $S_{i}=S_{i}^{-}\cup S_{up}^{+}$, $1\le i\le l$. Finally, train l classifiers $C=\{C_{1}, C_{2},\ldots , C_{l}\}$ on the l balance training sets. The l classifiers are fused for imbalanced data classification via fuzzy integral in the next stage.

(2) Fuse the trained base classifiers via fuzzy integral

As a classifier fusion method, fuzzy integral is distinguished from other fusion methods due to its intriguing property, that is it can well model the interactions among the base classifiers, including positive interaction and negative interaction, this is the reason why we select fuzzy integral to fuse the trained base classifiers.

Let $D=\{({\varvec{x}}_i,y_i)|x_i\in R^{d},y_i\in Y\}$ be a training set, $1\le i\le n$, $Y=\{\omega _{1},\omega _{2},\ldots ,\omega _{k}\}$ be a set of class labels, $C=\{C_{1}, C_{2},\ldots , C_{l}\}$ be a set of classifiers trained on D or on subsets of D. For $\forall {\varvec{x}} \in R^{d}$, the output of classifier $C_i$ is a k-dimensional vector $(p_{i1}({\varvec{x}}),p_{i2}({\varvec{x}}),\ldots ,p_{ik}({\varvec{x}}))$. The $p_{ij}({\varvec{x}}) \in [0,1](1\le i \le l; 1\le j \le k)$ denotes the support degree given by classifier $C_{i}$ to the hypothesis that ${\varvec{x}}$ comes from class $\omega _{j}$, $\sum _{j=1}^{k}p_{ij}({\varvec{x}})=1$.

Given $C=\{C_{1}, C_{2},\ldots , C_{l}\}$, $Y=\{\omega _{1},\omega _{2},\ldots ,\omega _{k}\}$, and arbitrary test sample ${\varvec{x}}$. The following matrix is called decision matrix with respect to ${\varvec{x}}$.

$$\begin{aligned} DM({\varvec{x}})= \begin{bmatrix} p_{11}({\varvec{x}}) &{}\cdots &{}p_{1j}({\varvec{x}}) &{}\cdots &{}p_{1k}({\varvec{x}})\\ \vdots &{} &{}\vdots &{} &{}\vdots \\ p_{i1}({\varvec{x}}) &{}\cdots &{}p_{ij}({\varvec{x}}) &{}\cdots &{}p_{ik}({\varvec{x}})\\ \vdots &{} &{}\vdots &{} &{}\vdots \\ p_{l1}({\varvec{x}}) &{}\cdots &{}p_{lj}({\varvec{x}}) &{}\cdots &{}p_{lk}({\varvec{x}}) \end{bmatrix} \end{aligned}$$

(6)

In the matrix $DM({\varvec{x}})$, the ith row of the matrix is the output of classifier $C_{i}$, the $j{th}$ column of the matrix are the support degrees from classifiers $C_{1}, C_{2},\ldots ,C_{l}$ for class $\omega _{j}$.

Let P(C) be the power set of C, the fuzzy measure on C is a set function: $g: P(C) \rightarrow [0, 1]$, which satisfies the following two conditions:

1.
$g(\varnothing )=1$, $g(C)=1$;
2.
For $\forall C_{i}, C_{j} \subseteq C$, if $C_{i} \subset C_{j}$, then $g(C_{i})\le g(C_{j})$.

For $\forall C_{i}, C_{j} \subseteq C$ and $C_{i} \cap C_{j}= \varnothing$, g is called $\lambda$-fuzzy measure, if it satisfies the following condition:

$$\begin{aligned} g(C_{i} \cup C_{j})=g(C_{i})+g(C_{j})+\lambda g(C_{i})g(C_{j}) \end{aligned}$$

(7)

where $\lambda > -1$ and $\lambda \ne 0$.

The value of $\lambda$ can be determined by solving the following Eq. (8).

$$\begin{aligned} \lambda +1=\prod _{i=1}^{l}(1+\lambda g_{i}) \end{aligned}$$

(8)

where $g_{i}=g(\{C_{i}\})$, it is usually determined by the following formula (9) [46]:

$$\begin{aligned} g_i=\frac{p_i}{\sum _{j=1}^{l}p_{j}}\delta . \end{aligned}$$

(9)

where $\delta \in [0,\, 1]$ and $p_i$ is testing accuracy or verification accuracy of classifier $C_i(1\le i\le l)$.

Let $h:C \rightarrow [0, 1]$ be a function defined on C. The Choquet fuzzy integral of function h with respect to g is defined by the following Eq. (10).

$$\begin{aligned} (C)\int h d\mu =\sum \limits _{i=2}^{l+1}\left( h(C_{i-1})-h(C_{i})\right) g(F_{i-1}) \end{aligned}$$

(10)

where $h(C_{1})\ge h(C_{2})\ge \cdots \ge h(C_{l})$, $h(C_{l+1})=0$, $F_{i-1}=\{C_1, C_2, \ldots , C_{i-1}\}$.

Given a test instance ${\varvec{x}}$, when we use fuzzy integral to fuse l base classifiers $C_{1}, C_{2},\ldots , C_{l}$ for classifying ${\varvec{x}}$, the process includes three step: Firstly, compute decision matrix $DM({\varvec{x}})$. Secondly, sort $j^{th}(1\le j\le k)$ column of $DM({\varvec{x}})$ in descending order and obtain $(p_{i_{1}j}, p_{i_{2}j}, \ldots , p_{i_{l}j})$. Finally, calculate the support degree $p_{j}({\varvec{x}})$ by the following formula (11).

$$\begin{aligned} p_{j}({\varvec{x}})=\sum _{t=2}^{l+1}\left( p_{i_{t-1}j}({\varvec{x}})-p_{i_{t}j}({\varvec{x}})\right) g(F_{t-1}) \end{aligned}$$

(11)

The pseudo code of the proposed two-class imbalanced data classification algorithm based on classifier fusion via fuzzy integral is given in Algorithm 2.

4 Experimental results and analyses

4.1 datasets and experimental environments

To demonstrate the superiority of the proposed framework denoted by GANDO (generative adversarial network based diverse oversampling), we conducted extensive experiments on 11 datasets including 8 numeric datasets and 3 image datasets. We use the 8 numeric datasets to compare GANDO with 11 SMOTE related state-of-the-art approaches which are SMOTE [23], B-SMOTE [47], ADASYN [48], CCR [49], ANS [50], K-SMOTE [24], NRPSOS [51], OUPS [52], GAN [8], AC-GAN [34], MFC-GAN [35], and use 3 image datasets to compare GANDO with 4 GAN related state-of-the-art methods which are AUGMENT [53], GAN [8], AC-GAN [34], and MFC-GAN [35]. The 8 numeric datasets include 1 artificial dataset, 4 KEEL datasets [54], and 3 liver datasets [55]. The basic information of the 11 datasets is given in Table 1. All experiments were carried out on the same hardware platform with Intel(R) Core(TM) i7-6600k CPU @ 3.10 GHz, 16.0 G memory, 64 bit MAC operation system. The programming environment consists of PyCharm Community Edition 2017.1.1, scikit-learn, smote-variants and keras. Our code is publicly available at https://github.com/xichie/oversample.

Table 1 The basic information of the 11 datasets

Imbalanced data classification based on diverse sample generation and classifier fusion

Abstract

Similar content being viewed by others

LoRAS: an oversampling approach for imbalanced datasets

Noise-free sampling with majority framework for an imbalanced classification problem

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Explore related subjects

1 Introduction

2 Related works

3 The proposed framework

3.1 Oversampling method based on an improved GAN model

3.2 Two-class imbalanced data classification approach based on classifier fusion via fuzzy integral

4 Experimental results and analyses

4.1 datasets and experimental environments

4.2 Performance evaluation measures

4.3 Network architecture and parameter settings

4.4 Comparisons with 11 SMOTE related state-of-the-art approaches on the 8 numeric datasets

4.5 Comparisons with the 4 GAN related state-of-the-art methods on the 3 image datasets

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation