1 Introduction

Inbinary classification, the number of samples in one class may be much smaller than the other, leading to the so-called class imbalance problem (He & Ma, 2013; Fernández et al., 2018a). The class with smaller number of samples is generally labeled as the minority (or, positive) class and the other as majority (or, negative). When other distribution-dependent factors such as availability of rare instances representing infrequent sub-concepts and overlaps between minority and majority classes are combined with the bias on the majority class due to imbalance, the test samples are generally classified as the majority class (Napierala & Stefanowski, 2016). Most of the conventionally used learners achieve poor performance on the minority class since they compute decision regions by considering the number of classification errors on the whole training data (Jo & Japkowicz, 2004). The imbalance problem is generally tackled using two main approaches (Gong & Kim, 2017). The first is based on resampling the minority samples to balance the training data (Bej et al., 2021). After balancing by resampling, some parts of the feature space where the minority samples were outnumbered by the negatives will be learned as positive decision regions. Numerous techniques have been proposed for synthetic sample generation, most of which are variants of a well known algorithm named as synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002; Fernández et al., 2018b; Kovács, 2019). In this approach, new instances are generated on lines connecting existing minority samples. Particularly, given a minority point named as seed, a sample from its k-nearest minority samples is randomly selected. Then, a synthetic sample is generated on the line joining these two points. This process is repeated until the desired number of synthetic samples are obtained. Such data-level approaches are also combined with ensembling techniques to improve the generalization performances (Ghaderi Zefrehi & Altınçay, 2020; Galar et al., 2013; Haixiang et al., 2017; Díez-Pastor et al., 2015). As an alternative approach, algorithm-level modifications of existing classifiers is addressed. For instance, in cost-sensitive decision trees, the cost of misclassifying a minority class sample is set to be higher than a majority sample (Ling et al., 2006). The penalty of misclassifying minority class samples by a support vector machine classifier is set to be higher than that of the majority class (Veropoulos et al., 1999). Similarly, a weighted cross-entropy loss is employed for XGBoost to consider misclassified positive samples as larger loss compared to the misclassified negatives (Wang et al., 2020).

SMOTE has been criticized to have several shortcomings (Barua et al., 2014), which can be itemized as follows: (1) The samples in dense clusters will have the neighbors that are likely to be from the same clusters. In such cases, the synthetic samples will be near replicas of the seed sample lying in dense clusters. (2) The relative positions and distances of the minority samples are not considered in k-nearest-based neighbor selection. Because of this, the selected samples may be in incorrect regions. For instance, the seed sample may be noisy or an outlier and hence located in a region that is far away from the other minority samples. In such a case, the synthetic samples may also be away from the other minority examples. (3) The nearest neighbor of a seed may be in a different cluster. In this type of cases, the synthetic sample may be located in space of the majority class. (4) Due to the linear interpolation-based sample generation, true distribution of the minority samples will be distorted (Halimu & Kasem, 2021; Xie et al., 2022). If a classifier takes into account the variance of the minority samples, incorrect variance estimates will lead to additional challenges for the classifiers during testing (Blagus & Lusa, 2013). Moreover, the training samples will not be independent after balancing, violating independence assumption, if employed by the classifier (Blagus & Lusa, 2013). (5) Using an a priori fixed k value is another weakness since variations of the density of samples in different regions is ignored. (6) All samples are not equally important from oversampling point of view. For instance, some samples such as those located close to the decision boundary are harder-to-classify but SMOTE does not differentiate between the samples (Barua et al., 2014; Napierala & Stefanowski, 2016).

The solutions for binary imbalanced classification tasks are generally easier to formulate when compared to multi-class setting. For instance, samples can be categorized by taking into account their positions with respect to the decision boundary, such as boundary sample, safe or outlier (Napierala & Stefanowski, 2016). Since there are two classes, each class is either minority or majority. However, multi-class problems are more challenging, mainly due to having more complex relations between the classes than the binary case (Krawczyk, 2016; Lango & Stefanowski, 2022). Moreover, learning the decision boundaries is expected to be more difficult in the multi-class case. Binary imbalanced classification has attracted more interest and many variants of SMOTE have been proposed for this purpose, each addressing some of its shortcomings (Fernández et al., 2018b; Kovács, 2019). Most of these techniques employ alternative strategies for selecting and/or weighting the minority samples during resampling. Two well-known and frequently-mentioned methods are Borderline-SMOTE and ADASYN (Han et al., 2005; He et al., 2008). In Borderline-SMOTE, a subset of minority samples is considered as seed samples. For instance, if all k-nearest neighbors of a minority sample are from the majority class, it is assumed to be noise and it is not considered during oversampling. A sample in the remaining set is considered as a seed only if most of its nearest samples are from the majority class. This criterion is expected to be mainly satisfied by the borderline samples and hence the learner will more accurately learn the decision boundary. A variant of Borderline-SMOTE which takes into account the neighboring majority samples as well in generating synthetic samples is also proposed (Han et al., 2005). On the other hand, ADASYN assigns different weights to the minority samples (He et al., 2008). The likelihood that a sample will be selected as a seed is proportional with its weight. The weight of a sample is computed by taking into account the number of majority samples in its nearest sample set. Using weights, it is aimed at focusing on the samples that are harder to classify. However, noisy samples that are located in majority class region may also get large weights, leading to introducing additional noisy samples (Barua et al., 2014). SMOTE may generate synthetic samples that overlap with the majority class space, leading to expanded minority clusters. Data cleaning is an alternative technique used for post-processing the balanced datasets to correct the distortions of the class clusters. For instance, in SMOTE-ENN, any instance that is not correctly labeled by employing three nearest neighbors is deleted (Batista et al., 2004). Similarly, SMOTE-TomekLinks discards the instances that form Tomek links (Batista et al., 2004). CCR is another data-cleaning based approach where, before oversampling, the neighborhood of each minority sample is cleaned by removing majority class instances (Koziarski & Wożniak, 2017).

The ultimate goal of oversampling is to enforce the learner to label the feature space including minority samples as the positive class, although some of these regions might be outnumbered by the negative samples. In fact, this can be achieved using an alternative approach. Rather than increasing the numbers of positive samples in the subspace where they appear, we propose to reposition the negative samples towards their centroid so as to reduce the number of majority samples in the subspace where the minority samples exist. Similarly, the positives are repositioned but for small steps to effectively deal with different types of minority samples such as borderline or outlying instances. This approach has several advantages compared to the oversampling-based approaches. For instance, small displacements will keep the positive borderline samples close to their original positions and hence do not distort the boundary. After repositioning, the outliers that are more likely to be in majority class space will not mislead the learner as their original forms since they will be closer to the other positives. Moreover, irrelevant positive regions are not generated since the minority instances are slightly modified. Consequently, the assumptions made by some classifiers about the original class distribution such as normality of feature values are not violated. The hard-to-classify minority samples problem is also addressed by reducing the number of negatives in the corresponding spaces.

The first step of the proposed MAjority and MInority rePOsitioning Technique (MaMiPot) is to train the learner on the given dataset to identify the misclassified samples, that is the set of false positives (sFP) and false negatives (sFN). The initial set of the seed samples is computed as the union of sFP and sFN. The algorithm repositions the seed samples in sFP using the centroid of the true negatives by linear interpolation. Similarly, the samples in sFN are repositioned by employing centroid of the correctly classified positive samples. The learner is re-trained by employing the updated training set and evaluated on the original training data to check for improvement in the performance metric. The procedure is repeated until further performance gain is not achieved.

The following section presents a brief review of the resampling-based techniques to alleviate the aforementioned shortcomings of SMOTE. The complete description of the proposed imbalance learning approach is presented in Section 3. The experimental work done on 52 datasets is presented in Section 4. The conclusions drawn are given in Section 5.

2 Literature review

More than hundred variants of SMOTE have been proposed after it was published in year 2003 (Fernández et al., 2018b; Kovács, 2019). These methods mainly differ in the approaches utilized in selecting and/or weighting the minority seed samples, the methods for generating new samples to replace linear interpolation and, elimination of existing noisy samples and incorrect synthetic samples that are located in the majority class region. In this brief review, some of these variants are presented to highlight the solutions they proposed for one or more shortcomings of SMOTE.

In SMOTEFUNA, the furthest sample from the seed is selected and the synthetic sample is generated within the cuboidal space, allowing to produce diverse synthetic instances (Tarawneh et al., 2020). This technique is argued to have several advantages. For instance, the synthetic samples are not expected to be close to their seeds in the cases when the seed is from a dense cluster. Hence, the shortcoming mentioned in item (1) is improved. Moreover, since only the furthest instance is considered, setting the value of tuning parameter k is not needed.

In order to avoid employing noisy instances as seed that is mentioned in item (2), DBSMOTE utilizes a clustering algorithm (Bunkhumpornpat et al., 2012). The samples that do not lie in a cluster are labeled as noisy instances and they are not utilized in oversampling. In MSMOTE, if all k neighbors of a minority sample belong to the majority class, it is labeled as a noisy instance and ignored (Liang et al., 2009). MWMOTE checks the neighbors of all minority samples (Barua et al., 2014). The samples which do not have any minority example in their neighbor set are discarded. Hence, such samples will not be considered as either seed or neighbor during linear interpolation. In Safe-Level SMOTE, the safety of the two samples selected for linear interpolation is computed by checking their neighbors (Bunkhumpornpat et al., 2009). The synthetic sample is generated closer to the instance that is more safe.

As a solution to the weakness mentioned in item (3), (Hu et al., 2014) proposed to use a classifier to evaluate the validity of each synthetic sample. A classifier is initially trained using the training samples. Two randomly selected minority samples are employed to generate a new synthetic sample by linear interpolation. This sample is accepted if the confidence score of the classifier is within a predetermined confidence interval. In GSMOTE, a safe area is defined for each minority sample (Douzas & Bação, 2019). Safe area represents the feature space where the generated samples are not noisy. In SMOTEFUNA algorithm, the distances of each synthetic sample from the nearest minority and nearest majority sample are utilized to evaluate its validity (Tarawneh et al., 2020). In particular, if a synthetic sample is closer to the negative nearest neighbor, it is discarded. In RBO, the synthetic samples are generated by computing Gaussian radial basis function-based distributions of samples in both positive and negative classes (Koziarski et al., 2019). The difficult regions are identified by using these potential surfaces and oversampled. ADPCHFO proposed by Tao et al. applies oversampling by employing samples within the same clusters (Tao et al., 2020). Density peak clustering is applied to cluster the minority samples. Using clusters allows generation of samples within valid regions. In order to avoid overfitting due to duplicate samples, a heuristic filter is applied to eliminate overlapping instances.

GSMOTE addresses the shortcoming mentioned in item (4) by using geometric regions rather than linear interpolation for generating synthetic instances (Douzas & Bação, 2019). For each minority sample, a safe radius is computed. Based on this value, sample generation takes place in a truncated hyper-spheroid. Xie et al. proposed the use of Gaussian distribution around the seed samples for generating synthetic instances (Xie et al., 2022). For a given seed, a random direction that originates from the seed is initially selected. The new sample is generated in that direction where the distance is computed using a Gaussian distribution whose parameters are computed from the training data. In ROSE, synthetic instances are sampled from a kernel density estimate that is centered at the selected samples (Menardi & Torelli, 2014). For both GSMOTE and ROSE, setting the value of tuning parameter k is not needed. Hence, the shortcoming mentioned in item (5) is avoided. SWIM employs the density of the well-represented majority class for generating new samples (Bellinger et al., 2020). In particular, the synthetic samples have approximately the same density as the seed.

In order to address the shortcoming mentioned in (6), ADASYN selects the seeds by considering the weights computed for each minority samples. The weight is proportional with the number of neighbors from the majority class. Hence, the learner is expected to focus on the hard-to-classify samples. In order to put more emphasis on hard-to-classify samples, SDSMOTE identifies the borderline samples by defining the degree of support on the minority samples (Li et al., 2014). In MWMOTE, the hard-to-learn samples are selected by considering the distance of each minority sample to its closest majority instance (Barua et al., 2014).

Many other variants of SMOTE have also been proposed. For instance, SMOTE-IPF addresses the shortcoming in item (2) by removing both the noisy samples in the original data and noisy synthetic samples (Sáez et al., 2015). TRIM-SMOTE proposes preprocessing the minority instances to generate multiple seed sets to be utilized for generating synthetic instances (Puntumapon & Waiyamai, 2012). In Polynom-fit-SMOTE method, synthetic instances are generated using polynomial fitting (Gazzah et al., 2008). MCT generates synthetic samples using cloning where the instances on the borderline are cloned less than those that are internal to the minority class region (Jiang et al., 2015). SMOTE-D considers the distances between each seed and its k-nearest neighbors to calculate the number synthetic instances between each sample pair (Torres et al., 2016). Cluster-SMOTE aims at generating artificial samples withing the major clusters of the minority class (Cieslak et al., 2006). CURE-SMOTE is another clustering-based approach (Li & Fan, 2017). The minority samples are clustered to identify small clusters. These are assumed to be noisy and removed. MDO generates synthetic samples in dense regions of the minority class and hence reduces the risk generating instances in majority class space (Abdi & Hashemi, 2016). In ANS, the number of neighbors considered, i.e. k is not fixed but dynamically calculated for each minority sample (Siriseriwan & Sinapiromsaran, 2017). In kmeans-SMOTE, the training data is clustered in an unsupervised way (Douzas et al., 2018). The clusters including small proportion of minority samples are not selected for oversampling. Moreover, dense minority clusters are oversampled more than sparse clusters.

Balancing is also addressed by analyzing the types of samples in the minority class. For instance, the minority samples are categorized as safe, rare, borderline and outlier and, the weights of a one-class classifier are adjusted based on the types of samples (Krawczyk et al., 2014). Sampling by analyzing the neighborhood of the positive samples is also proposed (Błaszczyński & Stefanowski, 2015). A bagging-based ensemble of classifiers is constructed where each boostrap is formed by mainly including hard-to-learn samples. In particular, the probability of drawing an instance depends on its difficulty of classification that is quantified using the number of neighbors from the majority class.

3 The proposed approach: MaMiPot

The proposed approach is based on repositioning the training samples. In order to understand the main idea behind this approach, consider the toy example given in Fig. 1. Due to class imbalance, most of the positives (represented by red diamonds) are expected to be misclassified, as shown on the left part of the figure. In other words, the decision region for the minority class covers much smaller space than the actual space where the positive samples appear. Such solutions generally lead to low classification accuracy on the positive samples (i.e. sensitivity) but high accuracy on the negatives (i.e. specificity). As shown in the upper right part of the figure, the solution by SMOTE is to generate synthetic minority samples to enlarge the positive decision region, leading to a more acceptable solution. As an alternative approach, the idea behind MaMiPot is presented in the lower right part of the figure. The positive and negative samples are repositioned for the same purpose of enlarging the minority decision region. The repositioned instances are shown using the same symbol of the class but not filled with the corresponding color. Since the majority samples overlapping with the minority are moved away from the minority region, minimizing the classification accuracy will correspond to labeling a wider space as positive. Moreover, since the distribution of minority samples is not distorted, the borderline is computed more accurately. Although the motivation behind moving the majority samples is intuitively reasonable, repositioning of the minority instances needs further clarification. As mentioned above, one of the main challenges in SMOTE is outliers. These samples may lead to generating synthetic instances in the majority class region and hence severely mislead the learner. To alleviate this problem, the proposed approach repositions the minority samples as well. However, in order not to distort the distribution of the minority samples, the amount of repositioning is set to be in smaller steps for the minority class.

Fig. 1
figure 1

Comparison of SMOTE and MaMiPot on a toy example

There are two main issues that need to be addressed for the implementation of the idea. These are selection of the samples and the directions to which they will be moved. In MaMiPot, the decision boundary computed on the original training data is employed for guiding the selection and repositioning tasks. The details of the proposed method is presented in Algorithm 1. The first step of the algorithm is to train a classifier. In Step 2, it the classifier is tested using the training data. The next step, Step 3 is to obtain the decision boundary by computing the best-fitting decision threshold, τopt. The performance metric is a design parameter of the algorithm. Hence, τopt should be defined by taking into account the metric, M. Using the selected threshold, the value of the metric, Mopt is recorded. Selection of τopt is explained in detail in the following context.

Using the decision boundary obtained, the training samples are partitioned into four sets. sTP denotes the set of correctly classified minority samples (i.e. true positives). sFN represents the set of misclassified minority samples (i.e. false negatives). The set sTN includes correctly classified majority samples (i.e. true negatives). sFP denotes the set of misclassified majority samples (i.e. false positives). The next step is to compute the centroids of the correctly classified minority samples denoted by μP and the correctly classified majority samples denoted by μN.

Algorithm 1
figure a

MaMiPot.

The samples in sFN and sFP are the candidates to be repositioned. In the first part of the iterations starting on line 11, the misclassified minority samples in sFN are repositioned towards the centroid of the correctly classified minority samples using the positive repositioning factor, αP. \(\widehat {sFN}\) is computed as the updated set after repositioning. In this study, we assumed that the correctly classified samples form a single cluster.

It should be noted that, at this step only the locations in the set sFN changed. As a condition for making the repositioning, we check whether the number of correctly classified samples is above a threshold denoted by γ. The motivation for this condition can be explained as follows. The centroid is computed using only the correctly classified minority samples. In MaMiPot algorithm, the centroids are used as the representatives for the potential regions to be enlarged. By moving samples towards these points, it is aimed to enforce the learner to focus on that part of the feature space. Because of this, for a centroid to convey a potential for the learner to discover additional minority regions, a predefined number of samples denoted by γ are required to be already correctly classified. For instance, if all minority samples are misclassified, none of them is repositioned.

In the second part of the iterations starting on line 16, the misclassified majority samples are repositioned towards the centroid of correctly classified majority samples using the negative repositioning factor, αN. The new positions are recorded, obtaining the updated set, \(\widehat {sFP}\). As in the case of minority samples, we check whether the number of correctly classified majority samples is above the threshold, γ. The new training set, Tnew containing the correctly classified samples (i.e. sTP and sTN) and, repositioned samples (i.e. \(\widehat {sFN}\) and \(\widehat {sFP}\)), is used to construct a new classifier, Cnew as presented on line 21. This classifier is tested on the original training set, T to evaluate whether the performance is improved. If the performance is improved, Tnew is recorded as Topt and the iterations are repeated.

During the iterations, the centroids are not updated. The main reason can be explained as follows: The ultimate goal of the algorithm is to reposition the samples but not the class. Allowing changes in the direction of repositioning may lead to translation of the classes in the feature space which is not desired. Actually, such operation will not be rewarded since the repositioning done in each iteration is evaluated using the original training set on Line 22 of the algorithm.

The repositioning factors αP and αN represent the percentage of repositioning on the lines connecting the minority and majority instances, respectively. In order not to distort the distribution of minority samples, αP should be kept small. In other words, the minority samples must be repositioned in small steps. On the other hand, larger displacements should be allowed by employing large αN values. In fact, αP is mainly used to deal with the outliers. During the iterations, they are moved towards the centroid computed for the minority class. Figure 2 presents the execution of MaMiPot, where αN = 0.9 and αP = 0.1. The figure on the left shows the dataset. The minority samples are presented using the symbol ‘◇’ filled by red color. The repositioned instances are plotted using non-filled symbols. As it can be seen, the positive and negative samples in the overlapping regions are moved towards the corresponding centroids. Consequently, the classifier is expected to increase the number of true positives. It can be also be seen that the minority samples are moved slightly.

Fig. 2
figure 2

Repositioning of positive and negative samples using MaMiPot. The design parameters were set as αN = 0.9 and αP = 0.1

If three consecutive updates do not lead to further performance improvement, the iterations are terminated and Topt is returned as the best-fitting training set. Remember that the centroids μP and μN are kept fixed during the iterations. Since αN is selected as a large value, after the first iteration, the repositioned negatives will be close to μP. As a matter of fact, the remaining room for repositioning is small for the majority samples after the first iteration. On the other hand, since the minority samples are repositioned in small steps, they can continue to reposition as the number of iterations increase. Consequently, the first priority of the algorithm is to solve the problems due to imbalance by repositioning the negatives. If a performance gain is not achieved any more, by running for two more iterations, MaMiPot tries the lower priority option and targets at achieving further improvements by repositioning the minority samples. When αP = 0.1, the distance between a minority sample and its centroid will be reduced by 1 − (0.9)3 = 0.27 (i.e. 27%) after three iterations which is much less than the repositioning of majority samples in the first iteration (i.e. 0.90). This is consistent with our main target of avoiding altering the distribution of positive samples. Alternative values for the number of iterations can be empirically evaluated for the task under concern.

The performance metric is another design parameter of the algorithm. Accuracy is not an appropriate measure in imbalance learning. Since the number of minority samples is only a few in general, very high accuracy values can be obtained even by classifying all samples as negative. Alternatively, metrics that measure the performance separately for different classes are employed for this purpose. However, since the performances on the positive and negative classes are competing, trade-off forms of these metrics are generally utilized. These metrics are F-score, G-mean and AUC. F-score is defined as the harmonic mean of precision and recall where precision is the ratio of correctly classified positive instances and the number of all samples that are classified as positive (i.e. TP / (TP+FP)) and recall is the percentage of correctly classified positives (i.e. sensitivity, TP/ (TP+FN)) as follows:

$$ F\text{-score} = 2 \times \frac{Precision \times Recall}{Precision + Recall}. $$
(1)

G-mean, which is defined as the geometric mean of sensitivity (i.e. recall) and specificity (i.e the classification accuracy of the negative class) is calculated as

$$ G\text{-mean} = \sqrt{Sensitivity \times Specificity}. $$
(2)

AUC is the third metric used that is computed as the area under the receiver operating characteristic curve. As mentioned above, the main idea of MaMiPot is to reposition the misclassified minority and majority samples. As a matter of fact, AUC can be used as a design parameter only if the samples to be repositioned are determined using a heuristic approach. For instance, majority samples that appear in highest P ranks and minority samples appearing in lowest N ranks can be selected. Alternatively, as done in several variants of SMOTE, the neighborhoods of the minority samples or density of the minority samples in different clusters can be considered. On the other hand, F-score or G-mean can be employed directly since they are threshold-dependent. In this study, in order not to bias the results by re-running the algorithm for all metrics, the algorithm is run only for F-score and the performance scores of all three metrics are reported. In a recent study, the following threshold is proposed for F-score (Collell et al., 2018):

$$ \tau_{opt} = \frac{1}{2} \bigg (\frac{P}{P+N} + 0.5\bigg ) $$
(3)

where 0.5 is the maximum possible threshold value for maximizing F-score and P/(P + N) is the a-priori probability of the minority class (Lipton et al., 2014). The midway between these two term is proposed as the threshold for F-score. In our simulations, we used this threshold value.

In general, since the minority samples are outnumbered by the majority, we expect low recall values (Erenel & Altınçay, 2013). In other words, the classification performance on the positive samples is generally low. The main reason for this is the bias in favor of the majority class samples. On the other hand, precision will be higher when compared to employing a balanced dataset. This can be explained as follows: After balancing, the decision region for the minority enlarges, including more correctly classified minority samples and hence higher recall. However, the number of false positives will also increase. Due to the imbalance, we expect a larger increase in FP than TP, leading to a smaller precision score. Let the balancing factor, β be defined as the proportion of the synthetic samples to the difference in the class sizes, β = Nsyn/(NmajNmin) where Nsyn denotes the number of synthetic samples that are generated in the case of Nmaj majority and Nmin minority instances. In order to tackle the low recall problem, SMOTE oversamples with balancing factor equal to one whereas MaMiPot aims to alleviate the imbalance problem by moving the negatives away from the positives. When the minority class is small, if the performance metric is selected as recall, a large degradation in precision may occur. In order to avoid this, the algorithm should be run using a trade-off metric such as F1 and G-mean.

For a better understanding of the way MaMiPot will contribute the performance scores for different metrics, both class-imbalance and characteristics of the metrics should be taken into consideration. As it was emphasized in the recent study by (Soleymani et al., 2020), both AUC and G-mean favor increasing the number of true positives, even for much higher number of misclassified instances from the majority class. This can be understood by considering the toy example presented in Fig. 3. Assume that there are 10 minority and 100 test samples. Using a learner that is trained on the training set which is not balanced, let the number of true positives and true negatives be 5 and 90, respectively. After oversampling the minority class, we expect a higher true positive value. However, this can be achieved at the expense of true negatives. Let the number of true positives be increased by only one whereas the number of true negatives is decreased by 5. This corresponds an increase in sensitivity from 5/10 to 6/10 and a decreased specificity from 90/100 to 85/100. When compared to employing the original training data, although the total number of misclassified samples increased by 5 − 1 = 4, G-mean increases from 0.67 to 0.71. Hence, G-mean favors increasing the number of true positives.

Fig. 3
figure 3

An example to illustrate the effect of balancing in terms of the metric values

MaMiPot improves the performance on the minority class mainly by repositioning the majority samples. However, this is done in a global way. In particular, the algorithm does not focus on any specific region by evaluating the clustering of samples. Similarly, in SMOTE, the whole feature space of the minority class benefits from oversampling. On the other hand, in some variants of SMOTE, oversampling is applied in a selected subspace. For instance, kmeans-SMOTE considers the density of minority samples to select clusters to be oversampled. Since MaMiPot is not cluster or region-based, some minority samples may not benefit from repositioning. Oversampling the minority class has the potential to be a complementary tool for such samples. In this setting, MaMiPot will be utilized as a preprocessing step before oversampling, providing several advantages for the oversampler. For instance, due to repositioning of the majority samples by MaMiPot, a smaller balancing factor than one is expected to be better-fitting, leading to less distortion on the minority distribution when compared to SMOTE and most of its variants. Moreover, utilizing a small balancing factor will result in a similar imbalance for both training and test data, which is another important concern in imbalance learning (Pozzolo et al., 2015). By repositioning outlying minority samples towards the centroid of the correctly classified instances, the risk of generating synthetic samples in invalid regions is reduced.

Precision replaces the specificity in the case of F-score (Soleymani et al., 2020). Since the minority class includes small number of samples in general, if the increase in true positive rate due oversampling is achieved at the expense of precision due to increased number of false positives, F-score value will degrade. As shown in Fig. 3, F-score decreases from 0.40 to 0.39. Consequently, when F-score is the performance metric, SMOTE is expected to contribute MaMiPot only if it is run using a small balancing factor. The left part of Fig. 4 presents the execution of SMOTE on the data given in Fig. 2. The dataset obtained after running MaMiPot using αN = 0.9 and αP = 0.1, followed by oversampling using SMOTE for β = 0.5 is illustrated on the right. It can be seen in the figure that, since many majority samples are repositioned, a smaller number of synthetic samples is expected to be sufficient for achieving a reasonable recall value.

Fig. 4
figure 4

Balancing the data using SMOTE (left figure) and balancing positive samples after repositioning by MaMiPot and using β = 0.5 (right figure). The design parameters of MaMiPot were set as αN = 0.9 and αP = 0.1

4 Experimental work

In order to evaluate the proposed approach, the experiments are conducted using three classifiers, namely support vector machine (SVM) with radial basis function kernel, naive Bayes (NB) and decision tree (J48) which are widely employed for imbalanced datasets (Haixiang et al., 2017). The computational cost of SVM is high. Based on the empirical evidence from preliminary experiments, we fixed its parameters as σ = 0.25 and C = 0.25 in all simulations. The default settings are utilized for J48, i.e. confidence factor = 0.25 and minimum instances per leaf is 2. The experiments are conducted on KEEL collection (Alcalá-Fdez et al., 2011). Fourteen datasets having imbalance ratios less than 5 are discarded. All of the remaining 52 datasets having the imbalance ratios between 5.14 and 129.44 are considered. The datasets are presented in Table 1.

Table 1 The datasets selected from KEEL repository

For each dataset, the experiments are repeated 20 times and the average scores are reported. In each experiment, 5 × 2-fold cross validation is applied. In particular, the data is randomly split into two equal-sized parts five times. Both parts are then utilized as either training or test data. The performance of the proposed method is compared with SMOTE and nineteen extensions namely, ADASYN (He et al., 2008), Borderline-SMOTE1 (Han et al., 2005), DBSMOTE (Bunkhumpornpat et al., 2012), GSMOTE (Douzas & Bação, 2019), SMOTEFUNA (Tarawneh et al., 2020), MWMOTE (Barua et al., 2014), ANS (Siriseriwan & Sinapiromsaran, 2017), CCR (Koziarski & Wożniak, 2017), CURE-SMOTE (Li & Fan, 2017), kmeans-SMOTE (Douzas et al., 2018), SMOTE-D (Torres et al., 2016), MDO (Abdi & Hashemi, 2016), MCT (Jiang et al., 2015), SMOTE-ENN (Batista et al., 2004), SMOTE-TomekLinks (Batista et al., 2004), Polynom-fit-SMOTE (Gazzah et al., 2008), SMOTE-IPF (Sáez et al., 2015), TRIM-SMOTE (Puntumapon & Waiyamai, 2012) and Cluster-SMOTE (Cieslak et al., 2006).

In the first set of experiments, MaMiPot is run using five balancing factors, β ∈{0,0.25,0.50,0.75,1}. As mentioned above, αP and αN which are in the interval (0,1) represent the percentage of repositioning on the lines connecting the minority and majority instances to the centroids of the corresponding classes. To reposition the minority instances in small steps, we set αP = 0.1. On the other hand, we set αN = 0.9 to allow larger displacements for the majority class. A large value of αN is expected to allow MaMiPot terminate in small number of iterations. Since the minority class is rare in many datasets, we set γ = 1. When β = 0, only MaMiPot is applied. For β > 0, SMOTE is applied after MaMiPot where the number of synthetic samples is determined by β. The results obtained are presented in Fig. 5. The rightmost column presents the specificity, recall and precision scores. As expected, increasing the minority samples by applying SMOTE results in decreased precision but increased recall, when compared to using the original training set. Due to reducing the bias on the majority class using oversampling, specificity which denotes the performance on the majority samples decreases as β increases.

Fig. 5
figure 5

Average F-score, G-mean and AUC values (in leftmost column) and, sensitivity (recall), specificity and precision scores (in rightmost column) obtained by MaMiPot (β = 0) and oversampling after MaMiPot using different balancing factor values, β > 0. Each row corresponds to a different classifier

The leftmost column in Fig. 5 presents the performance scores obtained for three trade-off metrics and several balancing factors. It can be seen in the figure that, applying oversampling after MaMiPot degrades the average F-score for NB. On the other hand, G-mean improves for all three classifiers when β = 0.25. Similarly, a notable improvement is observed for AUC in the case of J48. For β > 0.25, there is not any notable improvement for SVM and NB. When all metrics are considered, it can be argued that β = 0.25 is a reasonable setting.

The average F-score, G-mean and AUC scores obtained using 52 KEEL datasets are presented in Table 2. The results obtained for both β = 0 and oversampling using β = 0.25 are reported. For each classifier and metric, the highest score is presented in boldface and the second highest score is underlined. The table shows that, using β = 0.25 our method achieves the highest average scores for all three metrics for SVM. Similarly, it provides better scores than all references for F-score and G-mean using NB. Another observation is that, for all three performance metrics, the scores obtained for SVM are superior to those obtained for both NB and J48. For G-mean, the importance of applying oversampling after repositioning by MaMiPot is evident from the scores obtained using all classifiers. On the other hand, when the average performance over all folds is considered, MaMiPot is not among the top-ranked systems for J48.

Table 2 The average performance scores obtained using SVM, NB and J48

For a deeper evaluation of MaMiPot for different β, the performance scores are compared in terms of the average ranks over all folds for different values. In particular, a rank score is assigned for each β value and each fold. Then, the sums of these ranks over all folds are calculated for each β. After sorting in ascending order, the overall ranks are recorded. Table 3 presents the overall ranks corresponding to each classifier and performance metric pair. The last column presents the row-wise averages. β = 0.25 is found to be the top-ranked when the ranking performance is evaluated for all metrics and classifiers. It can be seen in the table that, when G-mean and AUC are considered, J48 benefits more from higher β values when compared to SVM and NB.

Table 3 The rankings of MaMiPot for different β values

Taking into account the ranking scores presented in Table 3, we set β = 0.25 in the following context. Table 4 presents the overall rank of each classifier and performance metric pair. The last column presents the row-wise averages. It can be seen in Table 4 that, MaMiPot using β = 0.25 is the top-ranked technique when the ranking performance is evaluated for all metrics and classifiers. It should be noted that, when evaluated using F-score, MaMiPot is top-ranked for SVM and second ranked for both NB and J48. The ranks achieved using J48 are low for G-mean and AUC. In a recent study, the performances of seven classifiers are evaluated on datasets that are grouped according to their characteristics (Napierala & Stefanowski, 2016). Considering the set of 5 neighboring samples, rare and outlying instances are defined to have one or zero minority instances in their sets of 5 neighbors, respectively. A sample is labeled as safe if at most one of its neighbors is from the majority class. It is shown that SVM performs better than J48 on datasets which include safe and borderline instances (Napierala & Stefanowski, 2016). It can be argued that, by repositioning the majority samples away from the minority class, training sets are transformed into safer forms, leading to better performance for SVM, when compared to J48.

Table 4 The rankings of different schemes according to their fold-based performance scores for SVM, NB and J48

It is well known that decision trees are unstable classifiers whereas SVM and NB are stable (Ting et al., 2011). Training unstable classifiers like decision trees is challenging when the training set size is small because a small perturbation in the training set may result in a highly different model, leading to substantial differences in the performance scores obtained on unseen test sets (Skurichina & Duin, 2002; Dietterich, 2000). As a matter of fact, the standard deviations in performance scores obtained from different folds are expected to be larger for unstable classifiers. In fact, variations in the predicted labels due to small perturbations on the training sets is effectively used in constructing classifier ensembles (Breiman, 1996; Bauer & Kohavi, 1999). In this approach, the predictions of unstable learners such as decision trees are combined to obtain more accurate decisions (Skurichina & Duin, 2000). Table 5 presents the average standard deviations for 20 repetitions over all folds for MaMiPot. It can be seen that the standard deviations are much larger for J48. For a better evaluation of relative performances of the schemes considered, 5-fold cross-validation is also performed. In this approach, 80% of the data is employed during training instead of 50%. The average scores and the corresponding ranks are presented in Tables 6 and 7, respectively. It can be seen that, for J48, the average ranks of MaMiPot are improved from 10 to 3 for G-mean and 13 to 10 for AUC. For stable classifiers SVM and NB, only the rank for G-mean is increased from 1 to 3 in the case of NB. Taking into account the size of datasets employed and the observations listed above, we argue that the inferior rankings achieved using J48 can be attributed to its unstable behavior on small datasets.

Table 5 The average standard deviations over all folds for MaMiPot
Table 6 The average performance scores obtained using SVM, NB and J48 using 5-fold cross validation
Table 7 The rankings of different schemes according to their fold-based performance scores for SVM, NB and J48 using 5-fold cross validation

The KEEL repository includes datasets having a wide range of imbalance ratios. Similarly, the numbers of positive samples and dimensionalities of the feature vectors are different. In order to investigate the relative performance of different techniques for datasets having different characteristics, the average ranks of different KEEL subsets are computed. The results are presented in Table 8. The scores reported for IR ≥ 10 are obtained using the datasets for which the imbalance ratio is greater than or equal to ten. Similarly the column labeled as P > 30 is for the list of datasets where the number of positive samples is above 30. In last two columns, the datasets are partitioned according to the number of available features, d. MaMiPot is top-ranked for highly imbalanced datasets. Similarly, it is the best-performing scheme when the number of positive samples is small. This is inline with the main target that was set as alleviating the imbalance problem without altering the distribution of the minority class.

Table 8 The ranks on different subsets of KEEL datasets

As another comparison of different approaches, Nemenyi test is performed as shown in Figs. 67 and 8, respectively for SVM, NB and J48 (Demšar, 2006). The black lines connect the methods for which the difference of mean ranks is smaller than the critical difference (CD). The test is done in a pairwise manner, for three classifiers and three different performance metrics. In part (a), F-scores are compared. Parts (b) and (c) compare G-mean and AUC scores, respectively. It can be observed that significantly better scores than most of the reference techniques are obtained for majority of the classifier/metric pairs, especially for SVM and NB. Although MaMiPot is not among top-ranked schemes for G-mean and AUC when J48 is used, it can be seen in Fig. 8 that the only method that is consistently in top five for all three metrics is SMOTE-IPF. However, this technique is not ranked in top five for any metric in the case of SVM. Table 9 presents the ties of the method that is top ranked according to Table 2 for each classifier and performance metric. Pairwise comparisons between individual systems are performed using Nemenyi post-hoc test and the pairs having p −value ≥ 0.05 are reported. The corresponding p-values are also provided. It can be seen that MaMiPot is tied with the winner, TRIM-SMOTE in the case of NB and F-score. Similarly, MaMiPot is tied with the winner, borderline-SMOTE1 in the case of J48 and F-score.

Fig. 6
figure 6

Statistical test results for SVM: (a) Using F-scores, (b) Using G-mean scores (c) Using AUC scores

Fig. 7
figure 7

Statistical test results for NB: (a) Using F-scores, (b) Using G-mean scores (c) Using AUC scores

Fig. 8
figure 8

Statistical test results for J48: (a) Using F-scores, (b) Using G-mean scores (c) Using AUC scores

Table 9 The ties of the method providing the highest average rank over all folds for each classifier and performance metric pair

5 Conclusions and future work

In this study, repositioning of the instances is proposed as an alternative approach for imbalance learning. The main motivation behind the proposed approach was to effectively learn the minority decision regions without distorting the true distribution. The proposed approach is evaluated on 52 datasets, covering a wide range of imbalance ratios. The experiments conducted using three different classifiers have shown that MaMiPot achieves the best average ranking score, when evaluated in terms of three different performance metrics. The relative performance of MaMiPot and reference techniques were also compared for six groups of datasets which were formed by taking into account the imbalance ratios, numbers of positive samples and numbers of features. Among these groups, it is observed that MaMiPot is particularly effective on datasets that have higher imbalance ratio and on those having smaller number of positive samples.

In our experimental work, we considered MaMiPot as a preprocessing algorithm only for SMOTE. As a further study, the performance of MaMiPot should also be assessed for the variants of SMOTE. As it can be seen on lines 14 and 19 of the algorithm, the misclassified samples are moved towards the centroids denoted by μP and μN. Various alternatives can be considered for this purpose. For instance, the samples may be clustered as done in some variants of the SMOTE and the samples in sFP and sFN are moved towards the nearest cluster. Alternatively, given a majority sample in sFP, a randomly selected sample in sTN can be selected. In other words, misclassified majority samples may be moved towards a randomly selected true negative instance.

MaMiPot should be evaluated by using parameters other than the imbalance ratio. For instance, the datasets can be partitioned into subsets, each of which is dominated by either safe, borderline, outlier or rare samples. After categorizing the datasets, MaMiPot should be run on each group to further explore its strengths and weaknesses. Moreover, MaMiPot can be modified to reposition the majority samples in different ways by taking into account the types of neighboring minority samples.