OUBoost: boosting based over and under sampling technique for handling imbalanced data

Mostafaei, Sahar Hassanzadeh; Tanha, Jafar

doi:10.1007/s13042-023-01839-0

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Original Article
Published: 10 May 2023

Volume 14, pages 3393–3411, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Download PDF

423 Accesses
4 Citations
Explore all metrics

Abstract

Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.

Hybrid Sampling with Bagging for Class Imbalance Learning

Experimental Analysis of Oversampling Techniques in Class Imbalance Problem

TLUSBoost algorithm: a boosting solution for class imbalance problem

Article 20 November 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Classification is an essential task of knowledge discovery in data mining. This task can be a big challenge when the data is imbalanced. Most real-world datasets usually contain imbalanced data. For example, identifying rare diseases, such as cancer in medicine [1], or fraudulent transactions in banks [2]. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. Therefore, traditional algorithms in the classification of imbalanced data are weak and classify most samples into the same as the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. However, in classifying imbalanced data, the minority class is more valuable to us, and the goal is to accurately learn minority class examples. Hence, new reliable methods are needed so that models can identify these useful and rare examples.

Techniques to solve the class imbalance problem are divided into data-level and algorithm-level methods [3]. Data-level methods rebalance the class distribution by manipulating the data space. Over-sampling of the minority class and under-sampling of the majority class are among the data-level methods. Over-sampling balances class distribution in the imbalanced dataset by adding examples to the minority class, and under-sampling also tries to balance the dataset by removing samples from the majority class [4]. There are several techniques for under-sampling and over-sampling. The simplest way to rebalance the imbalanced dataset is random sampling, which is done in two ways: random over-sampling and random under-sampling. Random over-sampling balances the imbalanced dataset by repeating samples from the minority class until the desired class ratio is obtained [5]. In the same way, random under-sampling randomly deletes samples of the majority class to achieve the desired ratio [5]. There are more advanced methods for over-sampling and under-sampling [6], and several combinations of these techniques also improved classification performance in imbalanced datasets [7,8,9]. Algorithm-level methods try to force algorithms to learn minority class samples by adding a penalty cost. Cost-sensitive learning [10] and learning based on recognition [11] are algorithm-level methods.

Under-sampling and over-sampling methods have the advantages and disadvantages described as below. The disadvantages of the under-sampling method are related to the loss of information that is deleted from the training data [12]. However, as the size of the training data set decreases, the execution speed increases. On the other hand, there is no problem with data loss in over-sampling, but as the size of the original training data set increases, the execution speed decreases. The over-sampling algorithm can include duplicate or new minority samples. In addition to the problem of increasing execution time, another problem in oversampling is the excessive repetition of examples, which can lead to over-fitting [13].

Boosting is another method that is used to improve classification performance in imbalanced data [4]. This technique can be used in balanced and imbalanced data and improve classification performance. AdaBoost [14] is one of the most common boosting algorithms that iteratively produces a set of models. This algorithm increases the weight of misclassified examples during each iteration, so in the next iterations, they have a better chance of being selected and learned better. At the end of the iterations, all classifiers participate in a poll to classify the unseen samples. In such a method, when dealing with class imbalances, minority instances that are misclassified are given more weight in the next iterations. As a result, the algorithm pays more attention to them and learns better. Boosting can be considered as a method of advanced data sampling [14] and can be used in two ways: re-weighting or re-sampling [15]. In re-weighted boosting, the weights of the modified examples are transferred directly to the base learner during each iteration. Since not all learning algorithms can use this weighted information [16], it is more appropriate to use re-sampled boosting. In re-sampled boosting, instead of transferring the sample weights to the learner, the training data can be re-sampled according to the weights of the samples. In this method, a new training dataset is created by sampling (replacement or new sample). In the new dataset, samples that have a higher weight are repeated several times, so the classifier is biased towards these examples and learns better. Recently several re-sampled boosting algorithms for imbalanced data have been proposed, such as RUSBoost [17] and SMOTEBoost [18]. The RUSBoost algorithm uses the undersampling during the boosting process and randomly removes the samples from the majority class. As mentioned, there is a problem with data loss by removing samples in the undersampling. The SMOTEBoost algorithm also uses oversampling in the boosting process to generate synthetic samples from the minority class, although it may lead to over-fitting.

To partially alleviate these problems, in this paper, we first propose a novel technique for undersampling of the majority class in imbalanced datasets. We then propose a new boosting-based algorithm for learning from imbalanced datasets based on a combination of the proposed Peak undersampling algorithm and oversampling technique (SMOTE) in the boosting procedure, named OUBoost. We represent a comprehensive description of the proposed OUBoost algorithm and compare its performance with other boosting-based algorithms, such as SMOTEBoost, RUSBoost, and state-of-the-art algorithms. We then evaluate the performance of algorithms using several evaluation metrics such as Recall, MCC, Gmean, and F-score on 30 imbalanced datasets. We also report time comparisons and statistical tests to analyze our proposed algorithm.

The rest of this paper is formed as follows: Sect. 2 shows related works, and Sect. 3 explains the proposed algorithms. Sections 4 and 5 provide the details of the experiments and the experimental setup. We finally give the conclusion in Sect. 6.

2 Related work

Recently, several researches have been performed on imbalanced data, and various techniques have been proposed to deal with the class imbalance problem [19]. Methods for dealing with class imbalance include data level and algorithm level. Data level methods include sampling the minority or majority class, which is used to reduce the problem of data imbalance.

Chawla et al. [18] propose the SMOTEBoost algorithm, which tries to solve the imbalance problem during the boosting process by creating synthetic examples from the minority class. Based on SMOTEBoost, RUSBoost, (Random Under Sampling) is introduced to manage the imbalance problem by deleting samples from the majority class in the boosting process [17].

In [20], a Random Hybrid Sampling based on Boosting (RHSBoost) is used to handle the imbalance problem. This algorithm employs under-sampling and random over-sampling in the boosting algorithm. The RHS algorithm allows each base classifier to focus on minority class examples and uses a new balanced training data that contains characteristics of original data. This technique has stable and impressive classification performance in real-world data.

Pope et al. [21] introduce Hybrid Under-Sampling based on Boosting (HUSBoost) approach to manage imbalanced data. This algorithm needs three main steps: clearing, data balancing, and classification. HUSBoost uses Tomek-link to clean up data to remove noise or overlapping data. Then, the dataset is divided into a balanced number of subsets by random sampling without replacement, and then classification is performed. The purpose of these methods is to optimize overall accuracy, while traditional algorithms often ignore the minority class samples.

In [22], LIUBoost combines a sampling technique and cost-sensitive learning in boosting process. This algorithm divides majority and minority samples into categories and gives the high cost to difficult samples in imbalanced datasets. However, in cost-sensitive methods, there is a problem of allocating domain-specific costs, and over-sampling methods lead to over-fitting with increased execution time; the results of LIUBoost are significant.

In [23] a novel ensemble method is proposed for classifying imbalanced data. The base classifier of this algorithm is based on reduced kernelized WELM that handles the class imbalance problem more efficiently. This algorithm generates balanced training subsets using random undersampling that serves as the centroid of the reduced kernelized WELM classifier. In this method, base classifiers are generated in a sequential manner. Both misclassified samples of the majority class in the first base classifier and all samples of the minority class were selected as the centroids of the next base classifier. Therefore, selected samples of the centroids are different in the classifiers.

MTSbag is a method that combines the Mahalanobis–Taguchi system (MTS) and bagging-based ensemble learning to increase the ability of conventional MTS in managing imbalanced data classification [24]. MTS is strong in addressing class imbalance problems and bagging can reduce the learning bias of classification algorithms. Therefore, MTSbag can be a useful method, especially for datasets with high imbalance levels.

SMOTECSELM is a novel SMOTE-based class-specific extreme learning machine introduced to manage imbalanced data classification [25]. This algorithm is a variant of class-specific extreme learning machine (CS-ELM) that gains the advantages of both minority oversampling and class-specific regularization. SMOTECSELM uses the synthetic minority oversampling technique (SMOTE) for minority oversampling that increases the significance of the minority class samples for determining the decision region of the classifiers, but it has fluctuation problem in random initialization of weights between the input and the hidden layer.

In [26] a new technique is developed to handle the problem of the SMOTE-CSELM. This technique uses SMOTE based class-specific kernelized extreme learning machine (SMOTE-CSKELM) with the Gaussian kernel function to map the input data to the feature space. The advantages of SMOTE-CSKELM are minority oversampling and class-specific regularization coefficients. The Gaussian kernel function of this technique can handle the non-optimal hidden node problem. The SMOTE method is used to generate synthetic samples of the minority class to balance the training dataset.

Jiang et al. [27] propose a boosting-based algorithm to handle imbalanced data. This algorithm combines static and dynamic re-sampling techniques and uses fuzzy entropy and fuzzy support in the boosting-based random forest (FESBoost). In this algorithm, static re-sampling is for decreasing the ratio of an imbalanced dataset, and dynamic re-sampling is for updating training data. FESBoost uses the density peak clustering algorithm (DPCA) to select representative samples that are effective in training.

DBRF is a Density-Based Random Forest algorithm developed to improve prediction performance on imbalanced datasets [28]. This algorithm detects borderline samples using a density-based method to augment the samples. DBRF uses two different random forest classifiers to model the augmented boundary samples and the original dataset. This algorithm determines the final output using a bagging technique. DBRF algorithm can solve the problem of classifying minority samples located on the class boundary.

In [29] a novel ensemble classification method based on kernel density estimate (KDE) is proposed to handle imbalanced data. This method trains each tree in the ensemble using uniquely generated synthetically balanced data. KDE offers a natural and effective approach to generating new minority samples to balance subsets by estimating the underlying distribution of the data.

K-Means-SMOTE–ENN is a new method based on hybrid bag-boost is introduced to improve the acts of resampling techniques in noisy imbalanced datasets [30]. This algorithm combines a hybrid bag-boost model of decision tree and hybrid K-Means SMOTE–edited nearest neighbor (ENN) resampling technique to address the noisy class imbalanced problems. K-Means-SMOTE–ENN uses the edited nearest neighbor as an undersampling method to remove samples that create noise.

In [31] a new classification algorithm based on diverse sample generation and classifier fusion are proposed to address the drawbacks of SMOTE, such as lack of diversity and strong overlap of generated minority samples. This algorithm uses a generative adversarial network (GAN)-based framework that includes an oversampling method and a two-class imbalanced data classification approach. In this algorithm, the oversampling method is based on an improved GAN model, and the classification approach is based on classifier fusion through fuzzy integral that can model the interactions between the base classifiers with different balanced subsets.

Most of these aforementioned methods suggest that the use of hybrid methods is useful for learning samples of minority class and also can be used in the boosting process for imbalanced data classification. In this paper, we propose a novel boosting-based method using a new under-sampling technique based on the Peak clustering algorithm.

3 Boosting-based approach to imbalanced data

As mentioned earlier, the number of examples in different classes differs in many application domains. This leads to the imbalance problem in machine learning. Traditional classifiers cannot properly classify the minority class examples in the imbalanced datasets. Although, the accuracy of these datasets is high, however, most minority instances are misclassified. Therefore, accuracy cannot be used as a proper evaluation metric. To handle this issue, we focus on the boosting algorithm in this paper. A boosting algorithm increases the weights of misclassified examples in the boosting process. In fact, boosting leads to learn more from the hard examples through this training procedure. This is why boosting is a useful approach to classify imbalanced data. As mentioned, SMOTEBoost and RUSBoost are popular boosting-based methods for imbalanced data. In the RUSBoost algorithm, which tries to balance the data by randomly deleting samples, the minority class recall is improved, but many majority class samples are lost. The results mentioned in the article [18] show that using SMOTEBoost is an effective approach.

In this paper, we propose a new boosting-based algorithm along with a novel under-sampling approach using the Peak clustering method to handle the imbalanced data. The peak under-sampling algorithm performs the under-sampling by detecting clusters and selecting useful samples with the maximum density and distance from the minority class.

3.1 Density-peak clustering

In [32], a new clustering algorithm named Density-peak clustering (DPC) is introduced to identify clusters in datasets with complex structures. The DPC algorithm finds the correct number of clusters and has two suppositions, the center of the clusters is among the samples with lower local density, which are approximately far from any samples with higher local density. In this algorithm, two measures are computed for each sample j in the dataset: the local density ${{\varvec{\eta}}}_{{\varvec{j}}}$ of the sample and its distances ${{\varvec{\mu}}}_{{\varvec{i}}}$ from the higher density samples. The density ${{\varvec{\eta}}}_{{\varvec{j}}}$ is then defined as follows:

$${\varvec{\eta}}_{{\boldsymbol{j }}} = \boldsymbol{\Sigma }_{{{\varvec{k}} \in {\varvec{S}}, \quad \boldsymbol{ k} \ne \boldsymbol{j }}} \Omega \;(d({\varvec{j}},\boldsymbol{ k}) - {\varvec{t}}_{{\varvec{r}}} )$$

(1)

here ${\varvec{d}}$(${\varvec{j}},\boldsymbol{ k}$) is the distance of sample ${\varvec{j}}$ to sample ${\varvec{k}}$, S is the dataset, and ${\varvec{t}}_{{\varvec{r}}}$ is a tuning parameter. ${\varvec{t}}_{{\varvec{r}}}$ is chosen so that the average number of neighbors is about 2/m that m is the number of all samples in the dataset.

$$\Omega (\mathrm{y}) =\left\{\begin{array}{ll}1 y < 0\\ 0 otherwis\end{array}\right.$$

(2)

In general, ${\eta }_{j}$ is the number of samples that are the vicinity of sample $j$ with the radius ${t}_{r}$ [33]. While the ${\mu }_{j}$ of sample $j$ is the minimum distance from sample $j$ to any other sample with the higher density that is defined in (3). For the sample with the maximum local density, ${\mu }_{j}$ is the maximum distance between point $j$ and other points, which is computed as follows:

$${\mu }_{j}= \left\{\begin{array}{ll}maximum \left\{d\left(j, k\right) | k\in S\right\}\quad if \forall k\in \quad S,{\eta }_{j}\ge {\eta }_{k}\\ minimum \{d(j, k) |{ \eta }_{j}<{\eta }_{k}, k\in S\}\quad otherwise\end{array}\right.$$

(3)

Then, to determine the center of clusters based on the density and distance values of each sample, a new factor ${\uplambda }_{j}$ is defined [32]. Therefore, the center of clusters is the samples with maximum μ and η that computed as:

$${\lambda }_{j}= {\mu }_{j }\times {\eta }_{j}$$

(4)

Next, the λ-values, are sorted in descending, and samples with abnormal high λ-values are selected as center of clusters. Finally, all the residual samples are assigned to the clusters using the measure defined in (5). For a sample with the maximum density, ${\Psi }_{j}$ is set to j and computed as:

$${\Psi }_{j}=\left\{ \begin{array}{ll}j if \forall k\in S, {\eta }_{j}\ge {\eta }_{k}\\ argmin \left\{d \left(j,k\right) | {\eta }_{j}<{\eta }_{k}\right\} \quad otherwise\end{array}\right.$$

3.2 Proposed Peak-based undersampling algorithm

As mentioned earlier, under-sampling and over-sampling methods can be used in the boosting process to improve the performance of classifiers in case of imbalanced datasets. Inspired by our previous work [34], which has become developed for intrusion detection, we here propose a new general format for under-sampling in imbalanced datasets. In the proposed under-sampling technique, we use the following steps:

1.
In the main dataset, separate the majority class and the minority class. Suppose ${S}_{maj}$ be the majority class with size E and, ${S}_{min}$ be the minority class with size F.
2.
Use the proposed under-sampling algorithm in ${S}_{maj}$ to generate $e$ clusters of the majority class.
3.
Then we select some effective samples of ${S}_{maj}$ with high density and distance from ${S}_{min}$.

To implant the proposed approach, we determine two measures: ${denc}_{i}$ and ${dist}_{i}$, where ${dens}_{i}$ is the density of the ${cluster}_{i}$ ( i ∈ {1, 2, …, e}), which is the summation of the local density (${\eta }_{j}$) of each sample in the ${cluster}_{i}$ as:

$${{dens}_{i}=\Sigma }_{{j \in cluster}_{i} }{\eta }_{j}$$

(6)

We then define ${dist}_{i}$ which is the distance between the center of ${cluster}_{i}$ and the minority class that is calculated as:

$${dist}_{i}={\sum }_{\begin{array}{c} q \in { min}_{class} \\ p = \text{centroid of cluster}_{i} \end{array}}d(p,q)$$

(7)

By obtaining the distance and density values for the clusters, we select the most effective samples with the new measurement as ${H}_{i}$. The ${H}_{i}$ specifies samples with maximum density and distance from the ${S}_{min}$. For each sample in the ${S}_{maj}$, ${H}_{i}$ is computed as:

$${H}_{i}= u \times {dens}_{i} v \times {dist}_{i}$$

(8)

where u + v = 1. Finally, we select samples with maximum $H$-values that are not close to each other. To select the samples from $H$, the new quantity L is computed as follows:

$$L =\left\{{x}_{i} | {x}_{i},{x}_{j }\in H \quad \& \quad \& d(i,j)> \theta \}\right.$$

(9)

where $\theta$ is a tuning parameter and D samples with $d\left(i,j\right)$> θ are selected as ${S}_{effective\_maj}$. At the end step, ${S}_{effective\_maj}$ and ${S}_{min}$ are merged to generate a modified new dataset. This new dataset can be imbalanced, but it has a lower imbalance ratio than the original dataset. The imbalance ratio is selected based on the minority class size. Figure 1 Shows the proposed peak-based undersampling algorithm.

3.3 Proposed OUBoost algorithm

In this paper, we propose a combination of under-sampling and over-sampling methods within the boosting process. We first divide the given dataset into the minority class and the majority class. We then use over-sampling to generate synthetic data from the minority class. The SMOTE algorithm is used to generate synthetic data. This generated data is then added to the original dataset. In the next step, an under-sampling algorithm is applied to the majority class. We use the proposed peak-based under-sampling algorithm. In the peak under-sampling algorithm, according to the samples ratio of the minority class, useful and reliable examples of the majority class that have the maximum distance from the border are selected. In fact, in this under-sampling method, instead of removing samples from the majority class, useful and reliable samples are selected and placed in a temporary new dataset along with all minority class samples. This new dataset is then passed to the weak learners to learn. At each iteration, a new temporary dataset is generated based on the original dataset to learn, and at the end of the iteration is discarded. This process is repeated until the imbalance ratio in the original dataset reaches the desired adjustable value. Finally, the final model is made by voting from the classifiers.

The main idea of our proposed approach, which differs from the state-of-the-art algorithms, is that each constructed model through the boosting procedure uses its own generated dataset from S based on the proposed peak-based under-sampling and oversampling in the current iteration. However, the proposed approach only updates the weights of the original dataset S at each iteration of the boosting procedure. The overview of the proposed algorithm is depicted in Fig. 2.

The proposed OUBoost algorithm is presented in Fig. 3. We now explain the proposed algorithm in more detail. Suppose S be the original dataset and D be the weights of examples related to S. Let define T as the number of iterations, ${S}_{t}^{^{\prime}}$ is the temporary dataset, and ${D}_{t}^{^{\prime}}$ be the weights of the examples on iteration t.

The original dataset S includes examples {(${x}_{1}$, ${y}_{1}$),…, (${x}_{m}$, ${y}_{m}$)} where ${x}_{i}$ ∈ X, ${y}_{i}$∈ Y = {0,1}, CP, corresponds to minority (positive) class, and CN majority (negative) class, (CP < CN). At first, we assign 1/m as the weight to all samples, where m is the total number of examples.

We now address our proposed OUBoost algorithm based on the boosting formulation. We first generate P samples from the CP minority class and add them to the original S dataset. We then select N useful samples from the majority class using the proposed peak-based undersampling algorithm. We next create a new ${S}_{t}^{^{\prime}}$ dataset using all the minority class examples and the selected examples by the proposed Peak-based undersampling algorithm. Now, the weak learner is trained using ${S}_{t}^{^{\prime}},$ and the weak hypothesis ${h}_{t}$ is calculated as:

$${h}_{t}: :\mathrm{ X }*\mathrm{ Y }\to [0, 1]$$

(10)

Next, the weighted error rate ${\epsilon }_{t}$ for the original dataset S and weight distribution D is computed as follows:

$${\epsilon }_{t}=\begin{array}{cc}\sum_{(i,y):{y}_{i}\ne y}& {D}_{t}(i)\left(1-{h}_{t}\left({x}_{i},{y}_{i}\right)+{h}_{t}\left({x}_{i},y\right)\right)\end{array}$$

(11)

Then, the weight update parameter $\alpha$ is calculated based on ${\epsilon }_{t}$ as:

$${\alpha }_{t}=\frac{{\epsilon }_{t}}{1-{\epsilon }_{t}}$$

(12)

Finally, the weights of the samples are updated and normalized in D as:

$${D}_{t+1}(i)={D}_{t}(i){\alpha }_{t}^{\frac{1}{2}\left(1+{h}_{t}\left({x}_{i},{y}_{i}\right)-{h}_{t}\left({x}_{i},y:y\ne {y}_{i}\right)\right)}$$

(13)

$${D}_{t+1}\left(i\right)=\frac{{D}_{t+1}\left(i\right)} \quad {{Z}_{t}} {Z}_{t}= \sum_{i} {D}_{t+1}(i)$$

(14)

At the end of T iterations, the final output of model $H(x)$ is obtained as:

$$H(x)=\underset{y\in Y}{\mathrm{argmax}}\sum_{t=1}^{T} \quad {h}_{t}(x,y)\mathrm{log}\frac{1}{{\alpha }_{t}}$$

(15)

Note that when generating a temporary new dataset for training in each iteration, the number of samples selected from the majority and minority classes is not necessarily equal. Because if the number of samples selected from the majority class is equal to the number of samples from the minority class, we may lose a lot of data from the majority class.

In fact, the temporary new dataset generated in each iteration can be imbalanced, except that the new data set imbalance ratio is less than the original data set imbalance ratio. This means that the imbalance ratio can be modified depending on the nature of the data set. For example, if the imbalance ratio of the original dataset is 20, a temporary new dataset can be generated in each iteration with an imbalance ratio of less than 20. Reducing the imbalance ratio enables classifiers to learn properly from modified datasets. This is a hyper-parameter that can be tuned depending on the nature of the data. The generation of synthetic samples through over-sampling increases the number of samples in the minority class, so the number of samples selected from the majority class also increases to create a temporary new dataset. By integrating these samples into the newly generated data set, the classifiers are trained with different data in each iteration, which may lead to improve the classification performance.

4 Experiments

In the experiments, we compare the performance of the proposed OUBoost algorithm with several other boosting-based algorithms. In this section, we first introduce the used datasets in the experiments. We then address the experimental setup, baseline methods, and evaluation metrics.

4.1 Datasets

The performance of different methods for imbalanced data is measured using various datasets in terms of size and imbalance ratio and as well as a synthetic dataset.

4.1.1 Real-world datasets

We use 22 real-world datasets that contain imbalanced data. Seven of them are from UCI machine learning repository [35], ten datasets are from KEEL-dataset repository [36], and five are from Machine Learning Mastery repository [37], which have been commonly used in related studies. Table 1 provides the specifications of these datasets, including the number of samples, the number of attributes, the number of majority class samples, the number of minority class samples, the minority class name, the imbalance ratio, and the number of classes in the datasets. These datasets present a wide variety of dataset sizes, imbalance ratios, and application domains. Since in this article, our focus is on two-class datasets and binary classification; we convert multiclass datasets that contain more than two classes into two classes. So we converted Glass, Wine, Wheat-seeds, Satimage, Iris, Abalone, and Yeast datasets into two classes using the one-versus-all method, labeling the smallest class as a minority class and the rest as the majority class [38].

Table 1 Real-world datasets

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Abstract

Similar content being viewed by others

Hybrid Sampling with Bagging for Class Imbalance Learning

Experimental Analysis of Oversampling Techniques in Class Imbalance Problem

TLUSBoost algorithm: a boosting solution for class imbalance problem

Explore related subjects

1 Introduction

2 Related work

3 Boosting-based approach to imbalanced data

3.1 Density-peak clustering

3.2 Proposed Peak-based undersampling algorithm

3.3 Proposed OUBoost algorithm

4 Experiments

4.1 Datasets

4.1.1 Real-world datasets

4.1.2 Largescale datasets

4.1.3 Synthetic datasets

4.2 Experimental setup

4.3 Baseline methods

4.4 The used evaluation metrics

4.4.1 Recall

4.4.2 F-Score

4.4.3 G-mean

4.4.4 MCC

4.4.5 Statistical tests

5 Results

5.1 The results of synthetic datasets

5.2 The results of Real-world datasets

5.3 MCC results

5.4 G-mean results

5.5 F-score results

5.6 Execution time

5.7 Statistical test results

5.8 The results of large-scale datasets

5.9 Recall results

5.10 F-score results

6 Conclusions

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation