1 Introduction

Classification is an essential task of knowledge discovery in data mining. This task can be a big challenge when the data is imbalanced. Most real-world datasets usually contain imbalanced data. For example, identifying rare diseases, such as cancer in medicine [1], or fraudulent transactions in banks [2]. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. Therefore, traditional algorithms in the classification of imbalanced data are weak and classify most samples into the same as the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. However, in classifying imbalanced data, the minority class is more valuable to us, and the goal is to accurately learn minority class examples. Hence, new reliable methods are needed so that models can identify these useful and rare examples.

Techniques to solve the class imbalance problem are divided into data-level and algorithm-level methods [3]. Data-level methods rebalance the class distribution by manipulating the data space. Over-sampling of the minority class and under-sampling of the majority class are among the data-level methods. Over-sampling balances class distribution in the imbalanced dataset by adding examples to the minority class, and under-sampling also tries to balance the dataset by removing samples from the majority class [4]. There are several techniques for under-sampling and over-sampling. The simplest way to rebalance the imbalanced dataset is random sampling, which is done in two ways: random over-sampling and random under-sampling. Random over-sampling balances the imbalanced dataset by repeating samples from the minority class until the desired class ratio is obtained [5]. In the same way, random under-sampling randomly deletes samples of the majority class to achieve the desired ratio [5]. There are more advanced methods for over-sampling and under-sampling [6], and several combinations of these techniques also improved classification performance in imbalanced datasets [7,8,9]. Algorithm-level methods try to force algorithms to learn minority class samples by adding a penalty cost. Cost-sensitive learning [10] and learning based on recognition [11] are algorithm-level methods.

Under-sampling and over-sampling methods have the advantages and disadvantages described as below. The disadvantages of the under-sampling method are related to the loss of information that is deleted from the training data [12]. However, as the size of the training data set decreases, the execution speed increases. On the other hand, there is no problem with data loss in over-sampling, but as the size of the original training data set increases, the execution speed decreases. The over-sampling algorithm can include duplicate or new minority samples. In addition to the problem of increasing execution time, another problem in oversampling is the excessive repetition of examples, which can lead to over-fitting [13].

Boosting is another method that is used to improve classification performance in imbalanced data [4]. This technique can be used in balanced and imbalanced data and improve classification performance. AdaBoost [14] is one of the most common boosting algorithms that iteratively produces a set of models. This algorithm increases the weight of misclassified examples during each iteration, so in the next iterations, they have a better chance of being selected and learned better. At the end of the iterations, all classifiers participate in a poll to classify the unseen samples. In such a method, when dealing with class imbalances, minority instances that are misclassified are given more weight in the next iterations. As a result, the algorithm pays more attention to them and learns better. Boosting can be considered as a method of advanced data sampling [14] and can be used in two ways: re-weighting or re-sampling [15]. In re-weighted boosting, the weights of the modified examples are transferred directly to the base learner during each iteration. Since not all learning algorithms can use this weighted information [16], it is more appropriate to use re-sampled boosting. In re-sampled boosting, instead of transferring the sample weights to the learner, the training data can be re-sampled according to the weights of the samples. In this method, a new training dataset is created by sampling (replacement or new sample). In the new dataset, samples that have a higher weight are repeated several times, so the classifier is biased towards these examples and learns better. Recently several re-sampled boosting algorithms for imbalanced data have been proposed, such as RUSBoost [17] and SMOTEBoost [18]. The RUSBoost algorithm uses the undersampling during the boosting process and randomly removes the samples from the majority class. As mentioned, there is a problem with data loss by removing samples in the undersampling. The SMOTEBoost algorithm also uses oversampling in the boosting process to generate synthetic samples from the minority class, although it may lead to over-fitting.

To partially alleviate these problems, in this paper, we first propose a novel technique for undersampling of the majority class in imbalanced datasets. We then propose a new boosting-based algorithm for learning from imbalanced datasets based on a combination of the proposed Peak undersampling algorithm and oversampling technique (SMOTE) in the boosting procedure, named OUBoost. We represent a comprehensive description of the proposed OUBoost algorithm and compare its performance with other boosting-based algorithms, such as SMOTEBoost, RUSBoost, and state-of-the-art algorithms. We then evaluate the performance of algorithms using several evaluation metrics such as Recall, MCC, Gmean, and F-score on 30 imbalanced datasets. We also report time comparisons and statistical tests to analyze our proposed algorithm.

The rest of this paper is formed as follows: Sect. 2 shows related works, and Sect. 3 explains the proposed algorithms. Sections 4 and 5 provide the details of the experiments and the experimental setup. We finally give the conclusion in Sect. 6.

2 Related work

Recently, several researches have been performed on imbalanced data, and various techniques have been proposed to deal with the class imbalance problem [19]. Methods for dealing with class imbalance include data level and algorithm level. Data level methods include sampling the minority or majority class, which is used to reduce the problem of data imbalance.

Chawla et al. [18] propose the SMOTEBoost algorithm, which tries to solve the imbalance problem during the boosting process by creating synthetic examples from the minority class. Based on SMOTEBoost, RUSBoost, (Random Under Sampling) is introduced to manage the imbalance problem by deleting samples from the majority class in the boosting process [17].

In [20], a Random Hybrid Sampling based on Boosting (RHSBoost) is used to handle the imbalance problem. This algorithm employs under-sampling and random over-sampling in the boosting algorithm. The RHS algorithm allows each base classifier to focus on minority class examples and uses a new balanced training data that contains characteristics of original data. This technique has stable and impressive classification performance in real-world data.

Pope et al. [21] introduce Hybrid Under-Sampling based on Boosting (HUSBoost) approach to manage imbalanced data. This algorithm needs three main steps: clearing, data balancing, and classification. HUSBoost uses Tomek-link to clean up data to remove noise or overlapping data. Then, the dataset is divided into a balanced number of subsets by random sampling without replacement, and then classification is performed. The purpose of these methods is to optimize overall accuracy, while traditional algorithms often ignore the minority class samples.

In [22], LIUBoost combines a sampling technique and cost-sensitive learning in boosting process. This algorithm divides majority and minority samples into categories and gives the high cost to difficult samples in imbalanced datasets. However, in cost-sensitive methods, there is a problem of allocating domain-specific costs, and over-sampling methods lead to over-fitting with increased execution time; the results of LIUBoost are significant.

In [23] a novel ensemble method is proposed for classifying imbalanced data. The base classifier of this algorithm is based on reduced kernelized WELM that handles the class imbalance problem more efficiently. This algorithm generates balanced training subsets using random undersampling that serves as the centroid of the reduced kernelized WELM classifier. In this method, base classifiers are generated in a sequential manner. Both misclassified samples of the majority class in the first base classifier and all samples of the minority class were selected as the centroids of the next base classifier. Therefore, selected samples of the centroids are different in the classifiers.

MTSbag is a method that combines the Mahalanobis–Taguchi system (MTS) and bagging-based ensemble learning to increase the ability of conventional MTS in managing imbalanced data classification [24]. MTS is strong in addressing class imbalance problems and bagging can reduce the learning bias of classification algorithms. Therefore, MTSbag can be a useful method, especially for datasets with high imbalance levels.

SMOTECSELM is a novel SMOTE-based class-specific extreme learning machine introduced to manage imbalanced data classification [25]. This algorithm is a variant of class-specific extreme learning machine (CS-ELM) that gains the advantages of both minority oversampling and class-specific regularization. SMOTECSELM uses the synthetic minority oversampling technique (SMOTE) for minority oversampling that increases the significance of the minority class samples for determining the decision region of the classifiers, but it has fluctuation problem in random initialization of weights between the input and the hidden layer.

In [26] a new technique is developed to handle the problem of the SMOTE-CSELM. This technique uses SMOTE based class-specific kernelized extreme learning machine (SMOTE-CSKELM) with the Gaussian kernel function to map the input data to the feature space. The advantages of SMOTE-CSKELM are minority oversampling and class-specific regularization coefficients. The Gaussian kernel function of this technique can handle the non-optimal hidden node problem. The SMOTE method is used to generate synthetic samples of the minority class to balance the training dataset.

Jiang et al. [27] propose a boosting-based algorithm to handle imbalanced data. This algorithm combines static and dynamic re-sampling techniques and uses fuzzy entropy and fuzzy support in the boosting-based random forest (FESBoost). In this algorithm, static re-sampling is for decreasing the ratio of an imbalanced dataset, and dynamic re-sampling is for updating training data. FESBoost uses the density peak clustering algorithm (DPCA) to select representative samples that are effective in training.

DBRF is a Density-Based Random Forest algorithm developed to improve prediction performance on imbalanced datasets [28]. This algorithm detects borderline samples using a density-based method to augment the samples. DBRF uses two different random forest classifiers to model the augmented boundary samples and the original dataset. This algorithm determines the final output using a bagging technique. DBRF algorithm can solve the problem of classifying minority samples located on the class boundary.

In [29] a novel ensemble classification method based on kernel density estimate (KDE) is proposed to handle imbalanced data. This method trains each tree in the ensemble using uniquely generated synthetically balanced data. KDE offers a natural and effective approach to generating new minority samples to balance subsets by estimating the underlying distribution of the data.

K-Means-SMOTE–ENN is a new method based on hybrid bag-boost is introduced to improve the acts of resampling techniques in noisy imbalanced datasets [30]. This algorithm combines a hybrid bag-boost model of decision tree and hybrid K-Means SMOTE–edited nearest neighbor (ENN) resampling technique to address the noisy class imbalanced problems. K-Means-SMOTE–ENN uses the edited nearest neighbor as an undersampling method to remove samples that create noise.

In [31] a new classification algorithm based on diverse sample generation and classifier fusion are proposed to address the drawbacks of SMOTE, such as lack of diversity and strong overlap of generated minority samples. This algorithm uses a generative adversarial network (GAN)-based framework that includes an oversampling method and a two-class imbalanced data classification approach. In this algorithm, the oversampling method is based on an improved GAN model, and the classification approach is based on classifier fusion through fuzzy integral that can model the interactions between the base classifiers with different balanced subsets.

Most of these aforementioned methods suggest that the use of hybrid methods is useful for learning samples of minority class and also can be used in the boosting process for imbalanced data classification. In this paper, we propose a novel boosting-based method using a new under-sampling technique based on the Peak clustering algorithm.

3 Boosting-based approach to imbalanced data

As mentioned earlier, the number of examples in different classes differs in many application domains. This leads to the imbalance problem in machine learning. Traditional classifiers cannot properly classify the minority class examples in the imbalanced datasets. Although, the accuracy of these datasets is high, however, most minority instances are misclassified. Therefore, accuracy cannot be used as a proper evaluation metric. To handle this issue, we focus on the boosting algorithm in this paper. A boosting algorithm increases the weights of misclassified examples in the boosting process. In fact, boosting leads to learn more from the hard examples through this training procedure. This is why boosting is a useful approach to classify imbalanced data. As mentioned, SMOTEBoost and RUSBoost are popular boosting-based methods for imbalanced data. In the RUSBoost algorithm, which tries to balance the data by randomly deleting samples, the minority class recall is improved, but many majority class samples are lost. The results mentioned in the article [18] show that using SMOTEBoost is an effective approach.

In this paper, we propose a new boosting-based algorithm along with a novel under-sampling approach using the Peak clustering method to handle the imbalanced data. The peak under-sampling algorithm performs the under-sampling by detecting clusters and selecting useful samples with the maximum density and distance from the minority class.

3.1 Density-peak clustering

In [32], a new clustering algorithm named Density-peak clustering (DPC) is introduced to identify clusters in datasets with complex structures. The DPC algorithm finds the correct number of clusters and has two suppositions, the center of the clusters is among the samples with lower local density, which are approximately far from any samples with higher local density. In this algorithm, two measures are computed for each sample j in the dataset: the local density \({{\varvec{\eta}}}_{{\varvec{j}}}\) of the sample and its distances \({{\varvec{\mu}}}_{{\varvec{i}}}\) from the higher density samples. The density \({{\varvec{\eta}}}_{{\varvec{j}}}\) is then defined as follows:

$${\varvec{\eta}}_{{\boldsymbol{j }}} = \boldsymbol{\Sigma }_{{{\varvec{k}} \in {\varvec{S}}, \quad \boldsymbol{ k} \ne \boldsymbol{j }}} \Omega \;(d({\varvec{j}},\boldsymbol{ k}) - {\varvec{t}}_{{\varvec{r}}} )$$
(1)

here \({\varvec{d}}\)(\({\varvec{j}},\boldsymbol{ k}\)) is the distance of sample \({\varvec{j}}\) to sample \({\varvec{k}}\), S is the dataset, and \({\varvec{t}}_{{\varvec{r}}}\) is a tuning parameter. \({\varvec{t}}_{{\varvec{r}}}\) is chosen so that the average number of neighbors is about 2/m that m is the number of all samples in the dataset.

$$\Omega (\mathrm{y}) =\left\{\begin{array}{ll}1 y < 0\\ 0 otherwis\end{array}\right.$$
(2)

In general, \({\eta }_{j}\) is the number of samples that are the vicinity of sample \(j\) with the radius \({t}_{r}\) [33]. While the \({\mu }_{j}\) of sample \(j\) is the minimum distance from sample \(j\) to any other sample with the higher density that is defined in (3). For the sample with the maximum local density, \({\mu }_{j}\) is the maximum distance between point \(j\) and other points, which is computed as follows:

$${\mu }_{j}= \left\{\begin{array}{ll}maximum \left\{d\left(j, k\right) | k\in S\right\}\quad if \forall k\in \quad S,{\eta }_{j}\ge {\eta }_{k}\\ minimum \{d(j, k) |{ \eta }_{j}<{\eta }_{k}, k\in S\}\quad otherwise\end{array}\right.$$
(3)

Then, to determine the center of clusters based on the density and distance values of each sample, a new factor \({\uplambda }_{j}\) is defined [32]. Therefore, the center of clusters is the samples with maximum μ and η that computed as:

$${\lambda }_{j}= {\mu }_{j }\times {\eta }_{j}$$
(4)

Next, the λ-values, are sorted in descending, and samples with abnormal high λ-values are selected as center of clusters. Finally, all the residual samples are assigned to the clusters using the measure defined in (5). For a sample with the maximum density, \({\Psi }_{j}\) is set to j and computed as:

$${\Psi }_{j}=\left\{ \begin{array}{ll}j if \forall k\in S, {\eta }_{j}\ge {\eta }_{k}\\ argmin \left\{d \left(j,k\right) | {\eta }_{j}<{\eta }_{k}\right\} \quad otherwise\end{array}\right.$$

3.2 Proposed Peak-based undersampling algorithm

As mentioned earlier, under-sampling and over-sampling methods can be used in the boosting process to improve the performance of classifiers in case of imbalanced datasets. Inspired by our previous work [34], which has become developed for intrusion detection, we here propose a new general format for under-sampling in imbalanced datasets. In the proposed under-sampling technique, we use the following steps:

  1. 1.

    In the main dataset, separate the majority class and the minority class. Suppose \({S}_{maj}\) be the majority class with size E and, \({S}_{min}\) be the minority class with size F.

  2. 2.

    Use the proposed under-sampling algorithm in \({S}_{maj}\) to generate \(e\) clusters of the majority class.

  3. 3.

    Then we select some effective samples of \({S}_{maj}\) with high density and distance from \({S}_{min}\).

To implant the proposed approach, we determine two measures: \({denc}_{i}\) and \({dist}_{i}\), where \({dens}_{i}\) is the density of the \({cluster}_{i}\) ( i ∈ {1, 2, …, e}), which is the summation of the local density (\({\eta }_{j}\)) of each sample in the \({cluster}_{i}\) as:

$${{dens}_{i}=\Sigma }_{{j \in cluster}_{i} }{\eta }_{j}$$
(6)

We then define \({dist}_{i}\) which is the distance between the center of \({cluster}_{i}\) and the minority class that is calculated as:

$${dist}_{i}={\sum }_{\begin{array}{c} q \in { min}_{class} \\ p = \text{centroid of cluster}_{i} \end{array}}d(p,q)$$
(7)

By obtaining the distance and density values for the clusters, we select the most effective samples with the new measurement as \({H}_{i}\). The \({H}_{i}\) specifies samples with maximum density and distance from the \({S}_{min}\). For each sample in the \({S}_{maj}\), \({H}_{i}\) is computed as:

$${H}_{i}= u \times {dens}_{i} v \times {dist}_{i}$$
(8)

where u + v = 1. Finally, we select samples with maximum \(H\)-values that are not close to each other. To select the samples from \(H\), the new quantity L is computed as follows:

$$L =\left\{{x}_{i} | {x}_{i},{x}_{j }\in H \quad \& \quad \& d(i,j)> \theta \}\right.$$
(9)

where \(\theta\) is a tuning parameter and D samples with \(d\left(i,j\right)\)> θ are selected as \({S}_{effective\_maj}\). At the end step, \({S}_{effective\_maj}\) and \({S}_{min}\) are merged to generate a modified new dataset. This new dataset can be imbalanced, but it has a lower imbalance ratio than the original dataset. The imbalance ratio is selected based on the minority class size. Figure 1 Shows the proposed peak-based undersampling algorithm.

Fig. 1
figure 1

The proposed Peak-based undersampling algorithm

3.3 Proposed OUBoost algorithm

In this paper, we propose a combination of under-sampling and over-sampling methods within the boosting process. We first divide the given dataset into the minority class and the majority class. We then use over-sampling to generate synthetic data from the minority class. The SMOTE algorithm is used to generate synthetic data. This generated data is then added to the original dataset. In the next step, an under-sampling algorithm is applied to the majority class. We use the proposed peak-based under-sampling algorithm. In the peak under-sampling algorithm, according to the samples ratio of the minority class, useful and reliable examples of the majority class that have the maximum distance from the border are selected. In fact, in this under-sampling method, instead of removing samples from the majority class, useful and reliable samples are selected and placed in a temporary new dataset along with all minority class samples. This new dataset is then passed to the weak learners to learn. At each iteration, a new temporary dataset is generated based on the original dataset to learn, and at the end of the iteration is discarded. This process is repeated until the imbalance ratio in the original dataset reaches the desired adjustable value. Finally, the final model is made by voting from the classifiers.

The main idea of our proposed approach, which differs from the state-of-the-art algorithms, is that each constructed model through the boosting procedure uses its own generated dataset from S based on the proposed peak-based under-sampling and oversampling in the current iteration. However, the proposed approach only updates the weights of the original dataset S at each iteration of the boosting procedure. The overview of the proposed algorithm is depicted in Fig. 2.

Fig. 2
figure 2

The general overview of the proposed boosting-based algorithm

The proposed OUBoost algorithm is presented in Fig. 3. We now explain the proposed algorithm in more detail. Suppose S be the original dataset and D be the weights of examples related to S. Let define T as the number of iterations, \({S}_{t}^{^{\prime}}\) is the temporary dataset, and \({D}_{t}^{^{\prime}}\) be the weights of the examples on iteration t.

Fig. 3
figure 3

The proposed OUBoost

The original dataset S includes examples {(\({x}_{1}\), \({y}_{1}\)),…, (\({x}_{m}\), \({y}_{m}\))} where \({x}_{i}\) ∈ X, \({y}_{i}\)∈ Y = {0,1}, CP, corresponds to minority (positive) class, and CN majority (negative) class, (CP < CN). At first, we assign 1/m as the weight to all samples, where m is the total number of examples.

We now address our proposed OUBoost algorithm based on the boosting formulation. We first generate P samples from the CP minority class and add them to the original S dataset. We then select N useful samples from the majority class using the proposed peak-based undersampling algorithm. We next create a new \({S}_{t}^{^{\prime}}\) dataset using all the minority class examples and the selected examples by the proposed Peak-based undersampling algorithm. Now, the weak learner is trained using \({S}_{t}^{^{\prime}},\) and the weak hypothesis \({h}_{t}\) is calculated as:

$${h}_{t}: :\mathrm{ X }*\mathrm{ Y }\to [0, 1]$$
(10)

Next, the weighted error rate \({\epsilon }_{t}\) for the original dataset S and weight distribution D is computed as follows:

$${\epsilon }_{t}=\begin{array}{cc}\sum_{(i,y):{y}_{i}\ne y}& {D}_{t}(i)\left(1-{h}_{t}\left({x}_{i},{y}_{i}\right)+{h}_{t}\left({x}_{i},y\right)\right)\end{array}$$
(11)

Then, the weight update parameter \(\alpha\) is calculated based on \({\epsilon }_{t}\) as:

$${\alpha }_{t}=\frac{{\epsilon }_{t}}{1-{\epsilon }_{t}}$$
(12)

Finally, the weights of the samples are updated and normalized in D as:

$${D}_{t+1}(i)={D}_{t}(i){\alpha }_{t}^{\frac{1}{2}\left(1+{h}_{t}\left({x}_{i},{y}_{i}\right)-{h}_{t}\left({x}_{i},y:y\ne {y}_{i}\right)\right)}$$
(13)
$${D}_{t+1}\left(i\right)=\frac{{D}_{t+1}\left(i\right)} \quad {{Z}_{t}} {Z}_{t}= \sum_{i} {D}_{t+1}(i)$$
(14)

At the end of T iterations, the final output of model \(H(x)\) is obtained as:

$$H(x)=\underset{y\in Y}{\mathrm{argmax}}\sum_{t=1}^{T} \quad {h}_{t}(x,y)\mathrm{log}\frac{1}{{\alpha }_{t}}$$
(15)

Note that when generating a temporary new dataset for training in each iteration, the number of samples selected from the majority and minority classes is not necessarily equal. Because if the number of samples selected from the majority class is equal to the number of samples from the minority class, we may lose a lot of data from the majority class.

In fact, the temporary new dataset generated in each iteration can be imbalanced, except that the new data set imbalance ratio is less than the original data set imbalance ratio. This means that the imbalance ratio can be modified depending on the nature of the data set. For example, if the imbalance ratio of the original dataset is 20, a temporary new dataset can be generated in each iteration with an imbalance ratio of less than 20. Reducing the imbalance ratio enables classifiers to learn properly from modified datasets. This is a hyper-parameter that can be tuned depending on the nature of the data. The generation of synthetic samples through over-sampling increases the number of samples in the minority class, so the number of samples selected from the majority class also increases to create a temporary new dataset. By integrating these samples into the newly generated data set, the classifiers are trained with different data in each iteration, which may lead to improve the classification performance.

4 Experiments

In the experiments, we compare the performance of the proposed OUBoost algorithm with several other boosting-based algorithms. In this section, we first introduce the used datasets in the experiments. We then address the experimental setup, baseline methods, and evaluation metrics.

4.1 Datasets

The performance of different methods for imbalanced data is measured using various datasets in terms of size and imbalance ratio and as well as a synthetic dataset.

4.1.1 Real-world datasets

We use 22 real-world datasets that contain imbalanced data. Seven of them are from UCI machine learning repository [35], ten datasets are from KEEL-dataset repository [36], and five are from Machine Learning Mastery repository [37], which have been commonly used in related studies. Table 1 provides the specifications of these datasets, including the number of samples, the number of attributes, the number of majority class samples, the number of minority class samples, the minority class name, the imbalance ratio, and the number of classes in the datasets. These datasets present a wide variety of dataset sizes, imbalance ratios, and application domains. Since in this article, our focus is on two-class datasets and binary classification; we convert multiclass datasets that contain more than two classes into two classes. So we converted Glass, Wine, Wheat-seeds, Satimage, Iris, Abalone, and Yeast datasets into two classes using the one-versus-all method, labeling the smallest class as a minority class and the rest as the majority class [38].

Table 1 Real-world datasets

4.1.2 Largescale datasets

In Table 2, we introduce three large Imbalance datasets in our experiments. The CICIDS2017 dataset was developed by the Canadian Institute for Cyber Security at the University of New Brunswick. This dataset includes a variety of network traffic, consisting of benign and malignant. The total number of samples in this data set is 283,0743 samples, of which 172,848 samples belong to the minority class [39].

Table 2 Large-scale datasets

The mammography dataset includes 40,000 mammograms performed between January 2005 and December 2008, from women in the Breast Cancer Surveillance Consortium [40]. In this dataset from 40,000 samples, there are 259 positive examples of breast cancer.

The credit card fraud dataset includes transactions crated by credit cards in September 2013 by European cardholders [41]. This dataset shows transactions that occurred in two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly imbalanced, and the minority class (frauds) size is 0.172% of all transactions.

4.1.3 Synthetic datasets

In this section, we generate five synthetic datasets with various known effects such as dataset size, noisy samples, outliers, and imbalanced ratios. We use these synthetic datasets to demonstrate the properties of the proposed algorithm. The characteristics of the synthetic datasets are summarized in Table 3. Figure 4 shows these synthetic datasets, where the red points represent the samples of the majority class and the blue points represent the samples of the minority class. In this figure, (a) corresponds to the original synthetic datasets and (b) depicts the first temporary new datasets created in the first iteration of the proposed algorithm. These temporary new datasets contain a combination of minority class and majority class samples.

Table 3 Synthetic datasets
Fig. 4
figure 4

The used synthetic datasets in our experiments. a ‘‘The original synthetic dataset”, b ‘‘The first new dataset”, and c the final new dataset created by OUBoost algorithm

To generate temporary new datasets in each boosting iteration, first, synthetic samples of the minority class are generated by the SMOTE over-sampling technique, and then the majority-class samples are selected by the proposed undersampling method. The peak-based undersampling method selects samples from the majority class that have the maximum density and distance from the minority class samples. Figure 4c shows the final temporary new dataset created by the proposed algorithm, which has a significant boundary between the minority and majority classes. As can be seen in all datasets, the proposed algorithm tries to separate minority and majority class samples with a safe margin. This is the advantage of the proposed OUBoost algorithm that helps classifiers to learn properly rare samples of the minority class while preserving the initial information of the majority class samples in a reduced size. Comparing Fig. 4a, c shows that the OUBoost algorithm selects samples from the majority class that contain useful information instead of using all samples from this class.

4.2 Experimental setup

In this paper, all experiments are performed using ten-fold cross validation. Ten-fold cross-validation is a method that divides a dataset into ten parts, in which each part is used as test data. Nine of them are used as training sets, and the rest for testing. This method performs the fitting process ten times so that each part acts as test data. The learning rate is 0.3 in all experiments, and the used base learner in all methods is Decision Tree.

4.3 Baseline methods

Since our proposed method is based on boosting, we use the state-of-the-art boosting-based methods to compare the results. For comparison, we use SMOTEBoost, RUSBoost, RHSBoost, and FESBoost algorithms. We also use the standard boosting model in the comparisons to show the performance of the proposed algorithm.

4.4 The used evaluation metrics

As mentioned earlier, when dealing with imbalanced datasets, traditional classifiers cannot learn properly minority class. This is because the number of minority class samples is much smaller than the majority class samples. Therefore, the accuracy cannot be used to evaluate the performance of the models. In this paper, we use several evaluation metrics to measure classification performance of algorithms. These metrics include Recall, MCC, Gmean, and F-score. We also report time comparisons and use statistical tests to provide a statistical basis for the main comparisons.

4.4.1 Recall

Recall measures the proportion of correctly classified examples as:

$$Recall=\frac{TP}{TP+FN}$$
(15)

The best value is 1, and the worst value is 0.

4.4.2 F-Score

F-measure [42] metric is the harmonic mean of the precision and recall and computed as:

$$F-measure \, =\frac{(1+\beta )\text{ PrecisionRecall }}{{\beta }^{2}\text{ Precision }+\text{ Recall}}$$
(16)

This measure is an appropriate evaluation metric for the imbalanced datasets. The range of F1 is in [0, 1], where 1 is the correct classification and 0 is the total failure.

4.4.3 G-mean

Geometric Mean (G-mean) is a good choice for imbalanced classification that measures the balance between classification performance in the majority and minority classes. A low G-mean value indicates poor performance in classifying positive samples even if the negative samples are classified correctly. This metric is important in preventing the over-fitting of the majority class and the under-fitting of the minority class. G-mean [43] is defined as:

$$G-mean=\sqrt{\frac{TP}{TP+FN}\frac{TN}{TN+FP}}$$
(17)

4.4.4 MCC

Matthews Correlation Coefficient (MCC) is a measure to evaluate the performance of boosting algorithms that is especially utilized in imbalanced data classification [42]. It ranges between − 1 and 1, where 1 score presents a good prediction, 0 equals the random prediction, and − 1 shows the total difference between predicted scores and the true labels [44]. This metric can be obtained from the confusion matrix defined as follows:

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
(18)

4.4.5 Statistical tests

We use statistical tests to evaluate the performance of the proposed algorithm [45]. First, we use the Friedman test, which is a non-parametric test equivalent to ANOVA with repeated measures. In the null hypothesis, the Friedman test states that all algorithms are equal, and the rejection of this hypothesis indicates the difference in the performance of the algorithms. It ranks the algorithms based on their performance for each dataset, then assigns rank 1 to the best algorithm, rank 2 to the second best, and so on. In this test, the value of the significance level is set to 0.05. Next, we apply a post-hoc test using Holm’s method for comparing the performance of the algorithms with each other [46]. This test compares two algorithms and checks the hypothesis ordered by their performance as p-values.

5 Results

This section consists of three parts. In the first part, we analyze the performance of the proposed algorithm on several synthetic datasets with known effects such as noise samples, outliers, different sizes, and different imbalance ratios. In the second part, the performance of the algorithms is compared on 22 Real-world datasets through various evaluation metrics such as MCC, Gmean, and F-score. Finally, in the third part, the performance of the algorithms is analyzed on 3 large imbalanced datasets with different imbalance ratios. The best results are boldfaced in the Tables.

5.1 The results of synthetic datasets

In the first part of the experiments, we use the five synthetic datasets introduced in Sect. 4.1. These datasets are generated with known effects such as noise samples, outliers, different sizes, and different imbalance ratios. To analyze the performance of the proposed algorithm, we report the MCC, Gmean, and F-score values. Then we compare them with other algorithms in classifying imbalanced synthetic datasets. Tables 4, 5, 6 show the results of algorithms based on MCC, Gmean, and F-score metrics respectively. As can be seen, by increasing the size and imbalance ratio of the dataset, the proposed method can achieve accurate classification and better performance than others in most cases. Also despite the noise samples and outliers in synthetic data 1, 4, and 5, the OUBoost algorithm is not affected and outperforms the other algorithms.

Table 4 MCC averages of synthetic datasets
Table 5 Gmean averages of synthetic datasets
Table 6 Fscore averages of synthetic datasets

5.2 The results of Real-world datasets

In the second part, we report the classification performance of different algorithms on Real-world datasets using several evaluation metrics, including MCC, G-mean, and F-score. We also compare execution time and analyze statistical test results of algorithms. Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 give the results of these experiments. In Tables 7, 8, 9, 10, the first column shows the name of the dataset. The second to sixth columns show the classification performance of the algorithms based on the desired metric. We use AdaBoost, SMOTEBoost, RUSBoost, RHSBoost, FESBoost, and the proposed algorithm in our tests. It should be noted that in all tables and figures, the classification results of the AdaBoost algorithm are presented to observe the improvement of the algorithms in the imbalanced data classification.

Table 7 MCC averages of Real-world datasets under 6 algorithms
Table 8 G-mean averages of Real-world datasets under 6 algorithms
Table 9 F-Score averages of Real-world datasets under 6 algorithms

5.3 MCC results

In this experiment, we compare the performance of the proposed algorithm to the state-of-the-art methods when the evaluation metric is MCC. Table 7 shows the results of each algorithm for each dataset using the MCC metric. The best classification performance is boldfaced for each dataset in the tables. The results show that the proposed algorithm performs better than the other boosting based algorithms on 17 out of 22 datasets. We further observe that OUBoost outperforms the AdaBoost nearly in all used datasets.

5.4 G-mean results

In this experiment, we report the result when the evaluation metric is G-mean. The results in Table 8 show that OUBoost gives better results than all five boosting algorithms AdaBoost, SMOTEBoost, RUSBoost, FESBoost, and RHSBoost on 18 out of 22 datasets. The improvement of OUBoost in most of the used datasets is significant, and the same results can be seen with the other evaluation metrics.

5.5 F-score results

The F-score metric is used to evaluate the performance in this experiment. The obtained results for algorithms are reported in Table 9. As can be seen in Table 9, OUBoost performs better than the other boosting algorithms on 19 out of 22 datasets. A high F-score indicates that both the minority recall and the majority recall are high.

Based on the results of these evaluation metrics, we can be relatively confident that the proposed algorithm works better than other methods in most datasets. However, each of these algorithms performed well in imbalanced data classification.

5.6 Execution time

Table 10 shows the execution time of different algorithms in the classification of 22 imbalanced datasets from real world datasets. Algorithms execution time are calculated based on milliseconds (ms), which includes both training and testing phase. As expected, the RUSBoost algorithm with the lowest execution time is the fastest algorithm among the algorithms due to the random elimination operation. The second, third, and fourth lowest execution times are for AdaBoost, SMOTEBoost, and RHSBoost respectively. For other algorithms such as FESBoost and the proposed OUBoost, when the imbalance ratio is high, the algorithm runs slower than others. This is related to the DPC algorithm in detecting the majority class samples from the clusters that have the maximum distance from the minority class samples. However, in working with imbalanced data, the classification of minority class samples is more important than the execution time of the algorithms.

Table 10 Execution time of Real-world datasets under 6 algorithms
Table 11 Friedman test based on MCC
Table 12 Friedman test based on Gmean
Table 13 . Friedman test based on Fscore
Table 14 Holm post-hoc test based on the MCC (Using OUBoost as control method)
Table 15 Holm post-hoc test based on the Gmean (Using OUBoost as control method)
Table 16 Holm post-hoc test based on the Fscore (Using OUBoost as control method)

5.7 Statistical test results

In this section, we analyze the results of statistical tests in terms of MCC, Gmean, and F-score from Tables 7, 8, 9. Tables 11, 12, 13 show the ranking of each algorithm according to the Friedman test based on MCC, Gmean, and F-score metrics respectively. As can be seen, the proposed method with the highest ranking has the best performance compared to other algorithms, and FESBoost is the second-best algorithm. Tables 14, 15, 16 show the results of Holm's test. Holm's method rejects all hypotheses because the corresponding p-values are smaller than 0.05. We conclude that the performance of OUBoost is significantly different from each other boosting-based algorithms in classifying imbalanced data.

5.8 The results of large-scale datasets

In the third part of the experiments, we use several datasets with high imbalance ratios to compare the performance of the proposed algorithm with other boosting-based algorithms. In this experiment, we report the recall and F-score values of the minority class to evaluate the performance of algorithms. For this purpose, we use three large datasets with a high imbalance ratio. For each dataset, we create 5 new datasets with different IRs of 5, 15, 25, 50, and 100. Then we evaluate the algorithms using these datasets. Figures 5, 6, 7, 8, 9, 10 shows the results of these experiments.

Fig. 5
figure 5

The minority class recall averages of different algorithms on Mammography datasets

Fig. 6
figure 6

The minority class recall averages of different algorithms on Credit card fraud datasets

Fig. 7
figure 7

The minority class recall averages of different algorithms on CICIDS2017 datasets

Fig. 8
figure 8

The minority class F-score averages of different algorithms on Mammography datasets

Fig. 9
figure 9

The minority class F-score averages of different algorithms on Credit card fraud datasets

Fig. 10
figure 10

The minority class F-score averages of different algorithms on CICIDS2017 datasets

5.9 Recall results

In this experiment, we report the recall value of the minority class to evaluate the performance of the algorithms when we have different imbalance ratios. Figures 5, 6, 7 show the minority class recall for algorithms in different datasets with various IRs. In the figures, the horizontal bar is related to different IRs, and the vertical bar is for a minority class recall. As shown in Figs. 5, 6, 7, our proposed algorithm effectively learns data from the minority class and improves the recall rate of the minority class in the imbalance dataset.

As can be seen in the figures, increasing IR reduces the performance of algorithms in classifying minority class samples. This is normal because as IR increases, the number of minority class samples will be much smaller than the majority class samples, and the minority class will become more difficult for algorithms to learn. However, in all IRs, it is clear that the proposed algorithm significantly outperforms other algorithms. We also observe that the proposed algorithm gives good improvements when there is a high imbalance ratio in most used datasets.

5.10 F-score results

In this experiment, we report the result when the evaluation metric is F-score. The F-score value indicates that only a good precision and good recall together result in a good F-measure. In Figs. 8, 9, 10, the horizontal bar is related to different IRs, and the vertical bar is for a minority class F-score. The results in the figures show that OUBoost gives better results than all five boosting algorithms AdaBoost, SMOTEBoost, RUSBoost, FESBoost, and RHSBoost on datasets.

As these figures show, the OUBoost algorithm improves the classification performance of imbalance learning. We further observe that the OUBoost algorithm gives the best result.

According to the figures, the RUSBoost algorithm performs relatively better in high IRs, while the SMOTEBoost algorithm loses its performance. Other hybrid algorithms FESBoost, RHSBoost, try to maintain their efficiency when increasing IR. This indicates that the use of hybrid methods is more appropriate for classifying datasets with high imbalance ratios.

6 Conclusions

In this paper, we first propose a new advanced technique for undersampling the majority class in imbalanced datasets. The proposed peak-based undersampling algorithm uses the peak density clustering technique to select the majority class samples with the highest density and distance from the minority class. This algorithm intelligently ignores low-value and noisy samples from the majority class and preserves useful and informative samples for learning. Then, we propose a novel boosting-based algorithm to deal with the imbalance problem. Our OUBoost algorithm uses the proposed peak-based undersampling method and oversampling technique in the boosting process. In the proposed algorithm, for creating temporary new datasets in each iteration, the number of majority class samples can be determined based on the minority class ratio. As a result, temporary new datasets are created with lower imbalance ratios than the original dataset.

In addition, we investigated boosting-based algorithms to deal with imbalanced data. We analyzed the performance of boosting algorithms using 30 imbalanced datasets with various effects such as dataset size, imbalance ratio, noise samples, and outliers. For this purpose, we use several evaluation metrics including Recall, MCC, G-mean, and F-Score. We also report time comparisons and statistical test analyses of the algorithms. By observing the results, it can be concluded that the use of hybrid methods, including the proposed OUBoost algorithm, works better than other simple boosting methods. In fact, the use of oversampling and undersampling simultaneously has better results in the boosting process than simple methods. However, oversampling is an effective approach to deal with the imbalanced data problem, but undersampling alone does not yield acceptable results compared to other methods. This is due to the loss of information in the majority class.

Possible future work for this study includes analyzing the impact of other oversampling and undersampling methods on the performance of boosting-based algorithms to deal with the imbalance problem.