1 Introduction

In this paper, we address classification problems involving extreme class imbalance. We define extreme imbalances as having both an exceptional imbalanced ratio between the classes (over 1:1000), as well as a very low absolute number of minority class instances in the training set (often fewer than 20). These challenging properties appear in many important classification domains, such as gamma-ray spectral classification [17], fraud detection [21] and failure prediction [18]. In general, synthetic oversampling has shown to be very effective for managing class imbalance, and as a result, has received a large portion of the research focus in recent years [4, 6, 7]. These algorithms, however, fail on domains involving extreme imbalance. In spite of the importance of domains involving extreme imbalance, there remains a dearth of research into means of ameliorating performance on extremely imbalanced classification problems.

In our previous work, we postulated that in cases of extreme imbalance, there is insufficient information encoded in the minority class to employ standard techniques for synthetic oversampling. Under these circumstances, we demonstrated that superior performance improvements are achieved by employing a majority-focused strategy for generating synthetic minority training examples. We denote this majority-focused strategy by the moniker SWIM: Sampling WIth the Majority [16].Footnote 1 In particular, we proposed an algorithm that generates synthetic minority training examples that are (1) near to their minority seed and (2) have the same Mahalanobis distance to the mean of the majority class as their seed. This ensures that the synthetic instances do not spread into denser regions of the majority class where there is no statistical evidence that they should be.

The intuition behind SWIM is that (1) the synthetic minority instances should be generated in regions of the data-space that have similar densities with respect to the majority class as the real minority instances, and (2) that they should be generated in regions that neighbour the real minority instances. Specifically, instead of asking: given the minority class data, where should the new minority class instances be generated, we ask: given the majority class data and the relative position of the minority class instances, where should new minority class instances be generated. In this work, we generalize SWIM by offering a nonparametric alternative to the Mahalanobis distance, which uses radial basis functions. Moreover, we discuss how SWIM and SMOTE can be formed into a pipeline to combine their relative strengths.

The SWIM framework is presented on the left in Fig. 1 and contrasted with SMOTE on the right. SMOTE is the standard approach to synthetic oversampling. It generates new minority instances by interpolating points at random distances between nearest neighbours in the minority class [6]. In this sense, SMOTE is a minority-focused approach, which we contrast with our majority-focused approach. The minority-focused approach ignores the majority class, thereby rendering it susceptible to generating samples far from the local neighbourhood of the minority class seeds, and in high-density regions deep inside the majority class. This risk increases with the severity of the imbalance. Post hoc cleaning by removing Tomek links and the edit-nearest-neighbour rule have been proposed to deal with this [19, 22]. In practice, their effectiveness can be limited and hard to predict. This is exacerbated in cases of extreme imbalance.

Alternatively, the image on the right in Fig. 1 demonstrates that the instances generated by SWIM (red squares) are in the same neighbourhood as their minority class seeds (grey circles), and that they are in regions of the data-space with the same density with respect to the majority class as their minority class seeds (\({\hat{p}}_+({\text {GENERATED}}) = {\hat{p}}_+({\text {SEED}})\)).Footnote 2 The density contours are shown as black rings in the image.

Fig. 1
figure 1

Illustration of the SMOTE procedure generating synthetic minority instances (red squares) inside the majority class (blue circles) on the left, and the SWIM framework using the relative density of the minority class instances (grey circles) to guide the generation process and avoid erroneous regions of the majority class (right) (colour figure online)

We evaluate SWIM and five other state-of-the-art methods for resampling on 24 benchmark datasets of extreme imbalance. Our performance analysis shows that SWIM provides greater performance enhancements, in terms of the geometric mean (g-mean), than the existing state-of-the-art in oversampling techniques on extremely imbalanced domains.

2 Related work

In this work, we focus on supervised binary classification problems over highly imbalanced domains. The process of binary classification utilizes a training set \(X_{n \times m} \in {\mathbb {R}}\) and corresponding labels \(Y_n \in \{0,1\}\). The objective is to induce a function, \(f(x_i) \rightarrow y_i\), that maps the training instances \(x_i \in X\) to their corresponding class labels \(y_i \in Y\). This problem is made more challenging in imbalanced domains where there are far fewer examples of the minority class \(X_\mathrm{{min}}\), \(y=1\) than of the majority class, \(X_\mathrm{{maj}}\)\(y=0\). This has been shown to cause the induced classifier \(f(\cdot )\) to become biased towards the larger class, thus leading to poor performance [11].

Two paradigms exist for dealing with imbalanced classification problems. When the minority class is rare or unavailable, one-class classification is applied. However, binary learning quickly becomes advantageous as the number of instances increases [5]. This has motivated research into extending the usefulness of binary classifiers to increasingly imbalanced domains. This is done via resampling the training data, cost-adjustment and/or algorithm modification [8, 20]. Resampling is the most commonly applied of these techniques; it is often favoured because it has been shown to produce robust performance gains, and can be applied with any classifier. In this paper, we focus on resampling approaches.

The most basic resampling strategies are random undersampling (RUS), and random oversampling (ROS). These balance class distributions in the training set by randomly discarding instances of the majority class, and/or by randomly replicating instances of the minority class. These strategies, however, suffer from the loss of information and the risk of overfitting, respectively.

To avoid these shortcomings, and to expand the regions of the data-space occupied by the minority training instances, the Synthetic Minority Oversampling TEchnique was proposed (SMOTE) [6]. It produces a balanced training set by interpolating synthetic instances between nearest neighbours in the set of minority class instances in the training set. This procedure relies entirely on the minority class training instances; the outcome is that the resulting synthetic data is situated within the convex-hull formed by the minority class. Because the majority class is disregarded, the convex-hull may overlap majority class. In such circumstances, applying SMOTE can actually degrade performance. This outcome becomes increasingly likely with greater class imbalance.

In order to address this weakness in SMOTE, more recent methods have incorporated the majority class into the resampling process, or used it to clean the re-sampled training data after SMOTE is applied. Cleaning techniques include the removal of Tomek links and the edit-nearest-neighbour rule [19]. The resampling process has been altered to account for the class density around the minority class instances. This is the case with adaptive synthetic oversampling (ADASYN), borderline SMOTE and majority weighted minority oversampling technique [3, 9, 10]; the only majority class information used is that which is present within the local neighbourhood of the generated sample. In these methods, the distribution of the minority class remains the key component of the generative process. Consequently, an insufficient number of minority samples will negatively impact the generative process.

In addition to the SMOTE-based methods that rely on the Euclidean distance to the k-nearest neighbours, Abdi and Hashemi [1] proposed the use of the Mahalanobis distance (MD) for synthetic minority oversampling. The fundamental distinction with our method is that they do not utilize the majority class information. Rather, they generate synthetic samples using the MD calculated on the small, and potentially error prone, minority class training set; new samples are generated at the same MD as a reference minority point from the minority class mean. Therefore, this method is susceptible to failure due to the limitations of the dearth of minority class data in the training set, as the estimated mean and covariance matrix would be unrepresentative of the latent minority distribution. Radial basis functions have previously been demonstrated to work well for oversampling in case of moderate class imbalance [12]. This approach utilizes the difference between the density in the minority and majority classes to identify safe locations from which to generate samples. As with the previously discussed methods, this method limited due to its reliance on the minority class.

At their core, all current state-of-the-art oversampling methods still rely on the representativeness of the minority class instances to produce a beneficial synthetic set. Alternatively, our method does not make any assumptions regarding what the minority class represents, except where existing samples are positioned with respect to the majority class. The information for generating synthetic samples comes from the populous majority class, and thus our method is effective for classification problems in which the minority class is rare, a situation that is both common and of great importance [13].

3 SWIM framework

Instead of relying on the position and distance between minority class instances, the SWIM framework utilizes the density of each minority class instance with respect to the distribution of the majority class in order to determine where to generate synthetic instances. This is abstractly described in Algorithm 1. The key components of SWIM are the density estimation and the shift procedures. Any appropriate method for density estimation can be applied. In the subsequent sections, we present the use of the Mahalanobis distance and the radial basis function for this purpose. The shift function takes a minority seed and serves to generate a synthetic instances that is close to the seed and has approximately the same density.

figure e

3.1 Mahalanobis oversampling

The Mahalanobis distance (MD) provides a fast and efficient means of implementing the SWIM framework. The MD calculates the distance between the mean of the majority class distribution, and a minority class seed point. The calculation accounts for the density along the path between the mean and seed points. Thus, two points have the same MD from the mean if they lie on the same hyperelliptical density contour. This is contrasted with Euclidean distance in Fig. 2. Given a minority seed, according to the SWIM framework, any point in the nearby regions of the data-space with the same MD can be sampled as a synthetic minority training instance.

Fig. 2
figure 2

Illustration of the Mahalanobis distance between two points A and B from the mean. Both points have the same Mahalanobis distance, but different Euclidean distances from the mean

The calculation of the MD involves knowing the mean \(\mu \) and the covariance matrix \(\varSigma \) of the distribution. In practice, however, the parameters are estimated as \({\overline{\mu }}\) and \({\overline{\varSigma }}\) on a sample population. Larger, more representative sets, such as is typical in the majority class training data, produce better estimates of these parameters.

Line 1 of Algorithm 1 for \(\hbox {SWIM}_\mathrm{{MD}}\) involves applying some preprocessing steps to simplify the data generation procedure and the estimation of \({\overline{\mu }}\) and \({\overline{\varSigma }}\) from the majority class instances. The steps are as follows:

  1. Step 1

    Centre the majority and minority classes Centring the data simplifies the calculation of the distances; this will be evident in the first step of the generation process. Let \({{\mu }}_\mathbf{a}\) be the feature mean vector of the majority class A. We centre the majority class to have \(\mathbf {0}\) mean, and then centre the minority class with mean vector of the majority class:

    $$\begin{aligned} \begin{aligned} A_\mathrm{c} = A -{{\mu }}_\mathbf{a} \\ B_\mathrm{c} = B -{{\mu }}_\mathbf{a} \end{aligned} \end{aligned}$$
    (1)
  2. Step 2

    Whiten the minority class: Let \({\varvec{\Sigma }}\) denote the estimated covariance matrix of \(A_\mathrm{c}\), and \({\varvec{\Sigma }}^{-1}\) denote its inverse. \({\varvec{\Sigma }}^{-\frac{1}{2}}\) is the square root of \({\varvec{\Sigma }}^{-1}\). We whiten the centred minority class as:

    $$\begin{aligned} B_\mathrm{w} = B_\mathrm{c}{\varvec{\Sigma }}^{-\frac{1}{2}} \end{aligned}$$
    (2)

    The MD is equivalent to the Euclidean distance in the whitened space of a distribution. Thus, by whitening, we simplify the calculations for generating synthetic data. This simplifies the shift procedure.

  3. Step 3

    Find feature bounds These are used to bound the spread of the synthetic samples. For each feature f in \(B_\mathrm{w}\), we find its mean \(\mu _{f}\) and standard deviation \(\sigma _{f}\). We then calculate an upper and lower bound on its value, \(u_{f}\) and \(l_{f}\), as follows:

    $$\begin{aligned} \begin{aligned} u_{f} = \mu _{f} + \alpha \sigma _{f} \\ l_{f} = \mu _{f} - \alpha \sigma _{f} \end{aligned} \end{aligned}$$
    (3)

    \(\alpha \in {\mathbb {R}}\) controls the number of standard deviations we want the bounds to be. Therefore, larger \(\alpha \) values cause a greater amount of spread along the corresponding density contour.

Lines 3–7 of Algorithm 1 are repeated until a user-specified number of synthetic samples are generated. In practice, we repeat this until the classes are balanced. Within the loop, we select a minority staple \(b_i\) at random and apply the shift procedure which returns a synthetic instances. The shift procedure for \(\hbox {SWIM}_\mathrm{{MD}}\) is carried out as follows:

  1. Step 1

    Generate new samples For each feature f, we generate a random number between \(u_{f}\) and \(l_{f}\). Thus, we obtain a sample point, \(b_i^\prime \) in the whitened space, where each feature \(b_{i,f}^\prime \) is \(l_{f} \le b_{i,f}^\prime \le u_{f}\). We generate a sample that is at the same Euclidean distance from the mean of the majority class.Footnote 3 Since we centred the data, this implies that the new sample will have the same Euclidean norm as the minority seed \(b_i\). Therefore, we transform \(b_i^\prime \) as:

    $$\begin{aligned} b_i^\mathrm{{norm}} = s\frac{\Vert b_i\Vert _2}{\Vert b_i^\prime \Vert _2} \end{aligned}$$
    (4)
  2. Step 2

    Scale sample back to original space\(b_i^\mathrm{{norm}}\) exists in the whitened space of the minority class, with the same Euclidean distance from the mean vector \(\mathbf {0}\) as \(b_i\) in the whitened space. We now have to transform the point back into the original space. This is done as:

    $$\begin{aligned} b_i^{\prime \prime } = ({\varvec{\Sigma }}^{-\frac{1}{2}})^{-1}b_i^\mathrm{{norm}}, \end{aligned}$$
    (5)

where the synthetic sample \(b_i^{\prime \prime }\) will be in the same density contour as its minority seed instances \(b_i\).

As the method involves the computation of matrix inverses, if there are linearly dependent columns, the calculations will fail. To handle this case, we check the rank r of the majority class A. If \(r < d\), where d is the dimensionality of A, then we calculate the QR-decomposition of A. The nonzero values of the resulting upper-triangular matrix correspond to the linearly independent columns of A. Using the steps outlined above, we can then oversample and classify the data in the sub-space defined by the features represented by these columns.

3.2 Radial basis oversampling

To implement a nonparametric form of SWIM, we propose the use of a radial basis function (RBF) with a Gaussian kernel to estimate the local density of the minority class samples with respect to the majority class. The Gaussian kernel takes a distance r and a smoothing parameter \(\epsilon \) and is computed as:

$$\begin{aligned} \phi (r) = \exp {-(\epsilon r)^2} \end{aligned}$$
(6)

For \(\epsilon \) values closer to zero, the result is a flatter, wider basis function, whereas as \(\epsilon \rightarrow \infty \) all of the weight is placed at the sample points. The latter results in a distribution with density spikes at each majority class point. Typically, the optimal choice of \(\epsilon \) is a value close to zero.

As a reasonable default, we set the shape parameter, \(\epsilon \) to be \(\epsilon =\alpha \sigma d\), where d and \(\sigma \) are the mean and standard deviation of the distance between points in A, and \(\alpha =-0.5\). This keeps the parameter relatively small and ensures it to be increasingly small for increasingly sparse data. This causes the function to be smoothing, leading to better generalization in practice.

To estimate a score for minority instance \(b \in B\) with respect to the majority class A that is analogous to a density, we sum over the RBFs between b and each \(a_i \in A\) instance in the majority class:

$$\begin{aligned} {{rbf}}\_{{{score}}}(b) = \displaystyle \sum _{i=1}^{|A|} \phi (||a_i - b||_2) \end{aligned}$$
(7)

For sample generation, the \({{rbf}}\_{{{score}}}\) can be used to find regions of the data-space that neighbour the minority seed b and have similar estimated densities. In practice, we consider any region of the data-space with an equal or lower \({{rbf}}\_{{{score}}}\) as being a candidate location for a synthetic minority instances.

Fig. 3
figure 3

Example of potential regions from which to draw synthetic minority samples (colour figure online)

The regions of the data-space that are shaded white in Fig. 3 have the same density as the seed, and the areas shaded red have lower densities. The white and red regions form the set of potential points from which to generate synthetic minority samples. We constrain this by dictating that the samples be drawn from close to the seed.

The \(\hbox {SWIM}_\mathrm{{RBF}}\) implementation of Algorithm 1 does not require learning a model of the majority class instances A in line 1. However, we do apply some initial processing that is worth noting. In particular, we standardize the data to have mean zero and unit standard deviation, and calculate the mean and standard deviation of the distances between majority class instances. These are utilized in setting the \(\epsilon \) parameter and the step size in the Gaussian jitter used in the shift function. At this point, a ball or KD Tree can also be construct to improve the efficiency of estimating the density.

The values of the vector d in line 2 of Algorithm 1 are calculated with the RBF approach by applying the \(rbf\_score\) to each minority instance \(b_i \in B\). Given a minority seed \(b_i\), the shift procedure for the RBF implementation of SWIM is shown in Algorithm 2. The process involves sampling the data-space around the seed to find points with approximately the same density as the seed. In particular, for a given seed, \(b_i\), a synthetic sample \(b_i^\prime \) is produced by applying Gaussian jitter to the seed in line 2, and calculating the RBF density at the candidate sample point in line 3. This is repeated until a candidate sample with \(b^\prime _{{score}} \le b_{{score}}\) is found. However, we limit the number of attempts to a fixed size. If after \(maxAttempt=5\) tries no point is sampled from in the neighbourhood of the seed with equal or lower density than the seed, then the seed itself is replicated. In practice, however, the max will not be reached if the step size is small and \(\epsilon \) is sufficiently small.

The generation process is repeated by randomly selecting minority instances and generating a new synthetic instance until classes are balanced or a user-specified number of instances are produced.

figure f

3.3 SWIM demonstration

3.3.1 RBF density estimation

As previously discussed, the \(\epsilon \) parameter of radial basis function controls the smoothness of the density distribution. As a result, \(\epsilon \) is the parameter that impacts where \(\hbox {SWIM}_\mathrm{{RBF}}\) draws its set of penitential synthetic samples from. Although we find that setting the \(\epsilon \) value based on the Euclidean distance between majority class instances works well in practice, it is possible to optimize this using cross-validation.

We provide Fig. 4 in order to assist readers in understanding the impact of setting the \(\epsilon \) parameter. This figure shows the resulting density distributions estimated from the majority class samples (red squares) for \(\epsilon \) values \(\{1,1.25,1.5,1.75,2\}\). The effect is that the density distribution is smoother and less moon-shaped for \(\epsilon =1\), whereas \(\epsilon =2\) has more variation in the density and a more exaggerated moon-shape.

By qualitatively examining these plots, it appears that a good \(\epsilon \) value is likely between 1.5 and 1.75.Footnote 4 The density distribution for this range of \(\epsilon \) matches the moon-shape well, while the density drops off a moderate rate away from dense regions of the majority class. Alternatively, with the smaller \(\epsilon \) values the shape is not well matched, and with a larger \(\epsilon \) density scores quickly drop away from the majority class samples.

Fig. 4
figure 4

This figure demonstrates how the increase in the epsilon parameter affects the smoothness of the density function on a toy dataset

3.3.2 Resampling and classification

Figure 5 presents a comparison of the decision surfaces produced by SVM classifiers on the moon-shaped imbalanced dataset without resampling (top left), with SMOTE (top right), with \(\hbox {SWIM}_\mathrm{{MD}}\) (bottom right) and with \(\hbox {SWIM}_\mathrm{{RBF}}\) (bottom left). This figure depicts both the training, testing and synthetic instances. This is done in order to paint a clear picture of whether each resampling method generates synthetic instances in regions that are reflective of the underlying distribution. In addition, this enables the reader to see if the synthetic samples have a positive impact on the induced decision surface (i.e. is the decision surface learned on the training data a good match to the test data?) The minority instances are blue and the majority instances are red in each plot. The training instances are plotted as squares, and the testing instances are plotted as circles. The score in the bottom right corner of each figure shows the geometric mean on a test set.

Fig. 5
figure 5

This figure demonstrates how the decision surface (shown as the red-blue gradient) of an SVM classifier changes as a result of resampling with SMOTE, \(\hbox {SWIM}_\mathrm{{MD}}\) and \(\hbox {SWIM}_\mathrm{{RBF}}\) on a toy dataset (colour figure online)

This overlay provides evidence regarding the merit of each resampling method. The plots highlight how the different means of generating synthetic samples impact the learned decision surfaces. In particular, they demonstrate that SWIM produces a smoother synthetic minority distribution than SMOTE and that the distribution spreads in directions that have a more beneficial impact on the learned decision boundary.

4 Experiments

4.1 Set-up

We compare the \(\hbox {SWIM}_\mathrm{{MD}}\) and \(\hbox {SWIM}_\mathrm{{RBF}}\) implementations of the SWIM framework to the state-of-the-art resampling methods for class imbalance. The latter includes SMOTE, SMOTE with the removal of Tomek links (SMOTE \(+\) TL), SMOTE with edit nearest neighbours (SMOTE \(+\) ENN), borderline SMOTE (Bord), adaptive synthetic sampling (ADASYN) and no resampling (None). For each of the SMOTE-based methods, \(K=5\) and \(K=7\) were used.

We set \(\alpha =0.5\) and test \(\epsilon = \{0.5,1,3,5\}\) for \(\hbox {SWIM}_\mathrm{{RBF}}\). \(\hbox {SWIM}_\mathrm{{MD}}\) has a single parameter \(\alpha \in {\mathbb {R}}\), which controls the potential step size from the seed. We tested \(\alpha = \{0.25, 0.5, 1\}\).

In each experiment, we test the following base classifiers: naïve Bayes, k-nearest neighbours, decision trees, support vector machines and multi-layer perceptron classifiers. For each classifier, the default settings from the Python Scikit-learn [15] are used. From the set of base classifiers, we select the best coupling in classifier with each of resampling method for each domain and perform our analysis is performed using these sets ups.

The evaluation is performed on 24 benchmark datasets (see Table 1) from the KEEL repository [2]. From each KEEL dataset, D, we created six sub-problems in which the minority class size is artificially down-sampled to 3, 5, 10, 15, 20, 30. Therefore, we ran experiments on \(24\times 6=156\) imbalanced classification problems. This down-sampling strategy enables us to study the behaviour of each resampling algorithm on increasingly extreme levels of absolute and relative imbalance. We randomly repeat each sub-problem 30 times and record the average performance in terms of the g-mean.

The g-mean provides a combined assessment of accuracy on the target and the outlier class in a single value [14]. Given the accuracy on the target class \(a^+\) and the accuracy on the outlier class \(a^-\), the g-mean for a classification model f on test set X is calculated as: \(\hbox {g-mean}_{f(X)}=\sqrt{a^+ \times a^-}\).

Table 1 This table lists the dimensionality of each dataset used in the evaluation, along with the number average number of majority class training instances, and the imbalance ratio for training sets with 5, 15 and 30 minority class instances

4.2 Results

4.3 Rank analysis

Table 2 shows the number of times each resampling method led to the best performing classifier for experiments with increasingly extreme levels of imbalance. The resampling methods are listed in the first column. Each of the remaining columns correspond to the sub-problems of training on 3, 5, 10, 15, 20, and 30 minority instances for each of the 24 datasets. The table shows that the SWIM framework (particularly \(\hbox {SWIM}_\mathrm{{RBF}}\)) dominates on the more extreme levels of imbalance. For minority training sizes 20 and 30, the SWIM framework remains competitive, but the results are more mixed.

Table 2 Number of times each resampling method was best across all datasets on each minority training set size

Figure 6 presents box plots of the ranks of each resampling method on the six imbalanced sub-problems. These plots provide additional evidence of the superiority of the SWIM framework on highly imbalanced problems. In particular, they show that \(\hbox {SWIM}_\mathrm{{RBF}}\) and \(\hbox {SWIM}_\mathrm{{MD}}\) have the best average ranks, and that low variances in their ranks relative to the other methods. This is most notably the cases for up to minority training sizes of 15.

Fig. 6
figure 6

Boxplots showing rank of each resampling method over all of the datasets in each sub-problem of minority training sizes 3, 5, 10, 15, 20 and 30. In these figures, SW indicates SWIM and SM indicates SMOTE. The y-axis shows the ranks over which each method is distributed and the x-axis specifies the resampling method

We employed the Friedman test to asses the statistical significance of the rankings for each sub-problem and found that the null hypothesis can be rejected at an \(\alpha \) value of 0.05 for minority training sizes of 3, 5, 10, 15. The Nemenyi post hoc test finds a statistical significance between \(\hbox {SWIM}_\mathrm{{RBF}}\) and all methods except \(\hbox {SWIM}_\mathrm{{MD}}\) for minority training sizes 3, 5, 10.

4.4 Relative performance analysis

The bar plot in Fig. 7 shows the relative difference in g-mean produced by the best SWIM system versus the best alternative for each data set with minority training sizes 3, 5, 10 and 20. We have left out sizes 15 and 30 to ensure the presentation remains clear. Nonetheless, the trend is consistent across the omitted sizes. The dataset names are stated on the y-axis, and the difference \(\text {g-mean}(\hbox {SWIM})\)\(\text {g-mean}(\text {Alt})\) is plotted against the x-axis. Bars to the right show better performance for SWIM, and those to the left indicate better performance for the alternative resampling method. In order to clarify the performance trends, the datasets are sorted according to the relative differences over minority training sets of size 3.

Fig. 7
figure 7

Bar plots highlighting the difference in g-mean for the best SWIM system versus the best alternative resampling system (Alt best) on each dataset with minority training sizes 3, 5, 10 and 20. The differences are calculated as \(\text {g-mean}(\hbox {SWIM})\)\(\text {g-mean}(\hbox {Alt})\) and displayed on the x-axis. Each dataset is specified on the y-axis, and the colours of the bars indicate the minority training size. Positive values indicate better performance by SWIM, whereas negative values indicate better performance by the alternative method

These results once again demonstrate that SWIM performs well on all training sizes, but has a stronger advantage over more extreme imbalance. In addition to the number of wins, it also wins by a larger margin on these datasets. The tables showing the g-mean scores for each method are included in the “Appendix”.

In the comparison between the use of MD versus RBF implementations of SWIM, the trend is that RBF has an advantage over MD on the more extreme cases of imbalance. This is demonstrated in Fig. 7. It shows the performance of the best classification systems involving \(\hbox {SWIM}_\mathrm{{RBF}}\), \(\hbox {SWIM}_\mathrm{{MD}}\) and none on the sub-problems with minority training sizes 20 and 3. In general, preprocessing with either SWIM beats the baseline on all cases for size 3 and all but one cases for size 20. With respect to the SWIM implementations, \(\hbox {SWIM}_\mathrm{{RBF}}\) is best on 17 out of 25 datasets for minority training size 3, and 13 out of 25 times for minority training size 5.

4.5 Performance curves

Figure 8 displays the performance curves for each resampling method on the monk-2 and vowel0 datasets. The plots show the minority training size on the x-axis and the average g-mean on the y-axis. These represent typical examples of how the relative performances of the resampling method evolve with the changing level of imbalance. In particular, we see that SWIM generally has a strong advantage on the extreme levels of imbalance, and the performances of the resampling methods begin to converge around 20 or 30 minority samples.

Fig. 8
figure 8

Performance curves showing the relationship between g-mean and the number of minority training instances on the monk-2 (left) and vowel0 (right) datasets for each resampling method

5 Discussion

Using PCA analysis, we are able to identify categories of datasets where SWIM works best, and others where the SMOTE-based alternatives are strong. Figure 7 reveals that the SMOTE-based alternative is better on the Bands and Sonar datasets at all levels of imbalance. The PCA plots for these datasets are presented in Fig. 9.

Fig. 9
figure 9

PCA plots for the Band and Sonar datasets

Through our empirical analysis, we have found that the SMOTE-based strategy (particularly with some cleaning) has an advantage over datasets in which the minority class has a single cluster of points that are close together. In these cases, linear interpolation is an effective sampling bias. Heavy overlap of the single minority cluster with the majority class is also a property that appears in the datasets where SMOTE is generally strong. In these cases, generating in dense regions of the majority space that neighbour the minority seeds is essential to improving the classification performance. Alternatively, the SWIM methods avoid higher density areas of the majority class space. Therefore, on domains with these properties, classifiers induced on data from SWIM undergeneralize. It is possible to relax this property of the SWIM algorithm, but we expect that this would often be harmful.

Fig. 10
figure 10

PCA plots for the ionosphere and heart datasets

SWIM produces good results on most datasets that do not have the aforementioned properties. These represent a large majority of the datasets in our study, and in real-world applications. During our analysis, we have identified, however, that SWIM is particularly strong on datasets in which the minority class data is spread out, and potentially has many clusters. We also note that SWIM performs well when the majority class density is non-uniform between the minority class instances. This is the case in the ionosphere and heart datasets. The PCA plots of these are presented in Fig. 10.

Fig. 11
figure 11

Comparison of using RBF versus Mahalanobis implementations of SWIM. The plot shows the g-mean for datasets with minority training sizes of 20, and the right plot is for minority training sizes of 3

We compare the performance of \(\hbox {SWIM}_\mathrm{{MD}}\) to \(\hbox {SWIM}_\mathrm{{RBF}}\) in Fig. 11. These show that on certain cases, \(\hbox {SWIM}_\mathrm{{MD}}\) has a strong advantage of \(\hbox {SWIM}_\mathrm{{RBF}}\). In conjunction with these results, we preformed the same empirical analysis using PCA plots to examine the properties of the datasets where \(\hbox {SWIM}_\mathrm{{MD}}\) works better that \(\hbox {SWIM}_\mathrm{{RBF}}\). This confirmed our hypothesis that \(\hbox {SWIM}_\mathrm{{RBF}}\) would have a strategic advantage on datasets with more complex, non-Gaussian majority classes, such as Segment0 and spambase. Alternatively, \(\hbox {SWIM}_\mathrm{{MD}}\) performs very will on datasets, such as Coil2000 and Monk-2 and Glass, where the distribution is closer to a Gaussian. We present an example of PCA plot for each of these in Fig. 12.

Fig. 12
figure 12

PCA plots for the coil2000 and spambase datasets

5.1 Resampling parameter

The parameters in many resampling methods can have a significant impact on the performance of the induced classifiers. SMOTE, \(\hbox {SWIM}_\mathrm{{MD}}\) and \(\hbox {SWIM}_\mathrm{{RBF}}\) each have parameters (K, \(\alpha \), \(\sigma \)) that impact the spread of the synthetic instance through the data-space. With respect to SMOTE, K is an integer in the range of 1 and one minus the size of the minority class (\(K \in \{1,|B|-1\}\)). Increasing value of K allows synthetic samples to be generated between minority class seed and more of its distant minority class neighbours. Larger K-values cause the induced classifier to generalize a larger area of the data-space for the minority class. Setting K too large, however, can lead to synthetic instances being generated deep inside the majority class space (as we previously demonstrated in Fig. 1).

Similarly to the K value in SMOTE, increasing \(\alpha \) parameters in \(\hbox {SWIM}_\mathrm{{MD}}\) and the \(\sigma \) value in \(\hbox {SWIM}_\mathrm{{RBF}}\) increases the spread of the synthetic samples, and therefore the generalization of the induced classifier. The key difference is that the SWIM algorithm will only generate synthetic instances in regions of the data-space with similar densities as the minority seeds. Therefore, unlike SMOTE, there is no risk of spreading synthetic samples deep inside the majority class. Because the SWIM algorithm will not generate samples in harmful regions, using a naive setting of it is less likely to have a negative impact on classifier performance than the K parameter in SMOTE.

Another advantage of the real-valued spread parameter in SWIM in comparison with SMOTE is that small adjustments in the parameters of SWIM cause smooth, small changes in the distribution of the synthetic samples. Alternatively, increasing or decreasing the K vale of SMOTE by one can have a drastic impact on the distribution of the synthetic instances. This is particularly the case in domains with extreme imbalance. As a result, the performance of classifiers induced on data including synthetic instances from SWIM is consistent across small changes in the spread parameters, whereas classifier performance may change drastically with the value of K.

5.2 Run-time complexity

An advantage of the SMOTE algorithm is that it has a relatively low run-time complexity. For the basic algorithm, it requires finding the k-nearest minority class neighbours of each minority class instance. For a single minority class instance, its k-nearest neighbours can be found in \({\mathcal {O}}(kn)\), where n is the number of sample, and k is the number of neighbours. This is repeated n times (once for each minority class seed), and then synthetic samples can be generated a between the seed and a random neighbour in constant time.

In comparison, the SWIM algorithms are composed of two algorithms: the modelling step (Algorithm 1) and the shift step (Algorithm 2), which generates a sample from a minority seed.

To efficiently implement \(\hbox {SWIM}_\mathrm{{RBF}}\), the first step involves constructing a ball tree from the majority class instances. This can be done in \({\mathcal {O}}(n \text { log } n)\). Conversely to SMOTE, the n in this case represents the number of instances in the majority class, which is much larger. Once the tree is constructed, each \(rbf_{score}\) of each candidate synthetic sample is calculated in \({\mathcal {O}}(n \text { log } n)\) time. In the worst case, this is repeat for \(m \times maxAttempts\) times, where m is the number of synthetic samples needed, and maxAttempts is the maximum number of times we can attempt to satisfy the condition \(b^\prime _{score} \le b_{score}\) in line 4 of Algorithm 2.

The run-time complexity of \(\hbox {SWIM}_\mathrm{{MD}}\) is dictated by the matrix inversion and multiplication operations. Let a be the number of instances in the majority class, and b be the number of instances in the minority class. Let d be the dimensionality of the data, and \(n = a + b\) be the total number of instances available for training. The complexities of the computation of the mean vector, covariance matrix and the inverse of the covariance matrix are \({\mathcal {O}}(ad)\), \({\mathcal {O}}(ad^2)\) and \({\mathcal {O}}(d^{2.37})\) respectively. The centring step has a complexity of \({\mathcal {O}}(n)\), whereas the computation of the square root and the whitening operation have \({\mathcal {O}}(d^3)\) and \({\mathcal {O}}(bd^2)\) time complexity. Finding the feature bounds has \({\mathcal {O}}(d)\) complexity, and sample generation has \({\mathcal {O}}(btd)\) complexity, where t is the number of samples to generate for each minority class sample. Finally, scaling back the generated samples to the original space involves matrix multiplication with the square root of the inverse, and thus the operation has a complexity of \({\mathcal {O}}(btd^2)\).

Although the run-time complexities of the SWIM algorithms are slightly greater than SMOTE, they come with the benefit of better results on case of extreme imbalance. We argue that it is a necessary trade-off to achieve good result on highly imbalanced problems. Moreover, it is only performed once during the preprocessing stage, and thus has a relatively minor impact on the overall learning cost.

5.3 SWIM-SMOTE cascade ensemble

Given that SWIM has an advantage on the more extreme cases of imbalance, and SMOTE is strong on the less extreme imbalance, a natural question is whether they can be applied as a cascade ensemble to produce better overall results. We are specifically interested in if we can achieve performance gains by applying SWIM to address the extreme imbalance, and then applying SMOTE to the combined set of real and synthetic minority training examples. The bottom row of Table 3 shows the number of times the ensemble cascade of \(\hbox {SWIM}_\mathrm{{MD}}\) + SMOTE was superior to just applying \(\hbox {SWIM}_\mathrm{{MD}}\) or SMOTE.

Table 3 Number of times each resampling method was best across all datasets on each minority training set size

The combination leads to improved performance on approximately 10 of the 25 datasets. Interestingly, these results appear to be independent of the minority class training size. This is counter to our hypothesis that the combination would be most helpful on cases of more extreme imbalance. However, more research is required on exactly how to optimize ensembles of resampling systems. We set this out as a future direction for research in class imbalance.

6 Conclusion

Extreme class imbalance occurs in a wide variety of important domains. Research related to it, however, has largely failed to design synthetic oversampling methods that are effective on such domains. To address this, we introduce a framework for synthetic oversampling, SWIM (Sampling WIth the Majority). The key advantage of SWIM versus existing methods on extreme imbalance is that SWIM utilizes the information offered by the majority class data to generate synthetic minority class instances. This enables SWIM to generate synthetic data in a manner that leads to a more general decision boundary without encroaching too deeply into the majority class. Alternatively, traditional methods, such as SMOTE, apply a minority-focused resampling strategy. This causes them to be heavily impacted by extreme imbalance and often leads to harmful encroachments into the majority class space.

We evaluate our proposed parametric and nonparametric implementations of SWIM (\(\hbox {SWIM}_\mathrm{{MD}}\) and \(\hbox {SWIM}_\mathrm{{RBF}}\)) against the state-of-the-art resampling methods on 24 benchmark datasets of extreme imbalance. Our results, based on the g-mean evaluation metric, show that classifiers trained on datasets preprocessed with SWIM generally rank better than those trained with any other method in cases of extreme imbalance, i.e. when datasets have fewer than 20 minority samples. In comparison between \(\hbox {SWIM}_\mathrm{{MD}}\) and \(\hbox {SWIM}_\mathrm{{RBF}}\), our results suggest that \(\hbox {SWIM}_\mathrm{{RBF}}\) robust on each of the evaluated minority class sizes (3–50). However, \(\hbox {SWIM}_\mathrm{{MD}}\) is comparable or better on the larger minority class sizes (30–50).

Future work will explore other methods and aim to derive insights into the relationships between their efficacy for generating samples and the properties of the dataset. Furthermore, the integration of both SWIM and SMOTE is an exciting avenue for future work, as it harnesses the powers of both methods, with SWIM generating enough data to rectify the extreme imbalance, and then SMOTE generating more instances over a less extremely imbalanced dataset.