1 Introduction

Multi-class imbalance problems occur in many real-world applications, such as medical research, information retrieval, oil reservoir identification, and credit rating models, where certain classes known as minority classes possess far fewer instances than other classes known as majority classes. Learning with multi-class imbalanced data is more complicated as the complex structural features composed of multiple majority and minority classes are highly susceptible to class overlapping and overgeneralization (Wang et al., 2012). The learning algorithms that work well for two-class imbalanced data may fail in multi-class imbalance scenarios (Zhou et al., 2006; Hartono et al., 2021). Thus, there is a pressing need to research effective learning methods for multi-class imbalanced datasets.

In imbalanced data learning, oversampling methods have received extensive attention as oversampling a minority class is better than undersampling the majority class, especially for datasets with a high imbalance rate (IR) (García et al., 2012). However, when addressing multi-class imbalanced classification tasks, many oversampling methods may face new challenges. For one thing, increasingly more studies have shown that the distribution of the minority classes significantly affects classification (Rekha et al., 2021; Saez et al., 2015; Wang et al., 2022). If the instances of a minority class are scattered, and even some may be in a region of the majority class, it will exacerbate intra-class imbalance or sample overgeneralization due to the blindness of minority class instance selection (Shaikh et al., 2019). On the other hand, multi-class imbalanced datasets might have multiple minority classes. Certain minority classes instances are easily to be ignored and misclassified by the algorithm as the result of skew distribution with the majority classes.

Currently, most existing resampling methods for handling multi-class imbalanced data implement informed resampling strategies based on the local neighborhood characteristics of minority instances to ascertain the difficulty of their identification. However, it is easy to distort class information and reduce the prediction effect due to multiple majority classes or minority classes. Therefore, in order to strengthen the identification of minority instances and improve the efficiency of multi-class imbalance learning, from the perspective of distribution characteristics of the minority classes, this paper proposes a clustering-based oversampling algorithm (COM) for multi-class imbalance problems. COM spatially clusters the minority instances by identifying soft core instances, designs weights based on the distribution of clusters to sample them, and conducts differentiated oversampling within the selected clusters for minority instances. This means that COM focuses on the overall distribution of minority instances, and at the same time, it considers the intra-class imbalance and overgeneralization in oversampling. The contributions of this study are summarized as follows:

  1. 1.

    Density clustering based on the structural characteristics of the instances is applied to multi-class imbalance learning, which can learn the sample information of the minority class well

  2. 2.

    COM effectively avoids the overgeneralization of samples and solves the problem of intra-class imbalance in imbalanced data learning, which greatly improves the efficiency of multi-class imbalance learning.

  3. 3.

    The effectiveness of the COM algorithm is verified via experiments that COM is superior to other methods in its average classification ability between any two classes.

2 Related Works

Considering that our focus is multi-class imbalanced learning on data structure characteristics, we provide a short review of the learning techniques on the data level in multi-class imbalance scenarios.

For multi-class imbalanced data with complex structures, intra-class imbalance and overgeneralization that often result from oversampling are the main factors contributing to the difficulty of learning. Unreasonable oversampling methods are likely to yield noisy samples or overlapping of classes, which can negatively impact the recognition of the minority class. The usual method is class decomposition, which converts the multi-class imbalance problem into several two-class imbalance problems. Methods of two-class imbalance learning, e.g., OAO and OAA (Dong et al., 2022; Kang et al., 2015; Li et al., 2020), are now effective in dealing with the original multi-class problem. Wu et al. (2010) studied multi-class imbalance learning via clustering and decomposing the majority class. The majority class is clustered by the k-means algorithm and decomposed into clusters of equal numbers. Then, each cluster is combined with the minority class instances to form many two-class imbalanced datasets, and then, random oversampling is used to solve the imbalance. Unfortunately, such techniques often result in the loss of class information. The classification effect may decline because the sample information of all classes is not used in training the classifier.

To improve the identification of minority classes, Lin et al. (2013) proposed a memetic algorithm to optimize a radial basis neural network. Next, a dynamic oversampling algorithm uses the SMOTE method to oversample the class with the lowest classification accuracy. Different instances have been given different sampling probabilities based on the multi-layer perceptron classifier (Fernandez et al., 2011). During the training process, a higher sampling probability is given to the minority class to improve the classification accuracy of the minority class instances. The ensemble learning method can also solve multi-class imbalance problems (Krawczyk et al., 2020; Liu et al., 2021). This method generally employs the boosting algorithm, which converts the imbalanced dataset into imbalanced subsets, then implements resampling to train the overall model. Some studies combine the ensemble algorithm with a feature selection algorithm (Guo et al., 2016; Hartono et al., 2021) to solve the problem of overlapping classification boundaries, thus improving the identification of minority class instances. Experiments have shown that this learning technique improves classification but is not independent of specific classifiers.

Abdi and Hashemi (2015) recently proposed an oversampling method based on Mahalanobis distance to solve multi-class imbalance problems. Since the synthesized instance is located on the contour line of the ellipse, the synthesized and the original instances are guaranteed to have the same distance from the class center. However, this method focuses on the instances in the concentrated area of the minority class. It does not consider the boundary instances or the small separation items in the minority class sufficiently. Generative direction (Tang et al., 2017) is becoming an increasingly popular method to avoid the randomness of synthesized instances. For each minority class instance, it selects k-nearest sample points of the same class, so the instance has k different generation directions. According to the generation weights of different directions, directions are selected to introduce the same number of synthesized samples. However, this method only divides the minority class into outstanding instances and trapped instances (Zhu et al., 2017), which cannot fully reflect the structural characteristics of the minority class.

For multi-class imbalance learning, although a great deal of research works has been done, there are certain aspects that could be even better. Based on the above analysis, the current research strategies are not based on the consideration of the overall distribution of multi-class imbalanced data and thus either may cause the loss of class information or exacerbate the imbalance problem affecting the identification of minority class instances.

3 COM: A Clustering-Based Oversampling Algorithm for Multi-class Imbalance Problems

In imbalance learning, one of the main factors causing learning difficulties is the complex distribution characteristics of the dataset. As shown in Fig. 1a, the minority class L1 has a serious intra-class imbalance problem, in which the instances are divided into four categories: safe instances, boundary instances, rare instances, and outliers (Napierala et al., 2016). Since there are relatively few instances in the minority class, the distribution of the minority class cannot be fully expressed; for example, the outlier is most likely a rare valid instance that cannot be represented by other instances, so it cannot be simply deleted, otherwise it will lead to the loss of information.

Fig. 1
figure 1

A multi-class imbalanced data

To reflect the structural characteristic of minority classes as completely as possible, COM handles the multi-class imbalanced data based on the clustering structure of the minority class. We can observe from Fig. 1b that the density of each cluster is different, and the difficulty of learning the instances in each cluster is also different. If we randomly synthesize the minority class instances, it may aggravate the intra-class imbalance, so we should appropriately increase the synthesis of the instances in low-density clusters to improve the recognition of such instances.

In addition, in the process of synthesizing the minority class instances, if the synthesis instances are inserted between various clusters, it will inevitably aggravate the overgeneralization of samples and affect the efficiency of imbalance learning. Therefore, in this study, different oversampling was carried out in cluster to achieve the purpose of alleviating the intra-class imbalance while avoiding overgeneralization.

Based on the above analysis, the key research ideas of COM can be described as follows:

  1. 1.

    Minority class instances are clustered using density clustering based on the distribution characteristics of the minority class

  2. 2.

    Sampling weights are assigned to each cluster according to the number of instances in the cluster and the distance between the instances. A high weight is assigned to a cluster with low density, and the oversampling probability of the instances in this cluster will be high, otherwise the opposite.

  3. 3.

    Various oversampling techniques can be performed in different clusters according to each cluster’s structural characteristics.

3.1 Density Clustering for the Minority Class

Research shows that the primary factor affecting classification lies in the occurrence of difficult samples in the dataset. Thus, one effective way to solve imbalanced learning is by analyzing the structural characteristics of imbalanced datasets, especially the instance structure characteristics of the minority class.

Set \(D=\left\{\left({x}_{1},{y}_{1}\right)\cdots ,\left({x}_{i},{y}_{i}\right),\cdots \left({x}_{n},{y}_{n}\right)\right\}\) as a multi-class imbalanced dataset with sample size n. \({x}_{i}=({x}_{i1},{x}_{i2},\cdots ,{x}_{id})\) is an instance of dimension d. The class label variable for instance xi is yi. Set the class L in the dataset as the minority class and the set constituted by the L class instances as SL.

Definition 1

(Zhu et al., 2017): Soft core instance. For any \({x}_{i}\in {S}^{L}\), if the proportion of L class instances in its k-nearest neighbors set is not less than rTh, then xi is a soft-core instance of class L.

Definition 2

(Zhu et al., 2017): \(g^{L}_{k}\) -neighborhood. For any soft-core instance of class L, xi, its \(g^{L}_{k}\)-neighborhood is defined as an instance set:

$$g^{L}_{k}\left({x}_{i}\right)=\left\{{x}_{i}\right\}\cup \left\{{x}_{j}\in {x}_{i}.{N}_{k}|{x}_{j}\in {S}^{L}\right\}\cup \left\{{x}_{j}\in {x}_{i}.R{N}_{k}|{x}_{j}\; is\; a\; soft\; core\; instance\; of\; class\; L\right\}$$

where Nk and RNk denote k-nearest and reverse k-nearest neighbor sets, respectively.

The \({g}_{k}^{L}\)-neighborhood satisfies reflexivity and symmetry: for any two soft core instances xi and xj, if \({x}_{j}\in {g}_{k}^{L}({x}_{i})\), then \({x}_{i}\in {g}_{k}^{L}({x}_{j})\).

To better characterize rare instances in the minority, this paper defines soft core instances as follows.

Definition 3

:Soft core instance. For any , if the number of L class instances in its k-nearest neighbors set is not less than 1, then xi is a soft-core instance of class L.

The clustering in this part mainly includes two steps: First, a soft-core instance set \(\Omega\) of the minority class is constructed. For any \({x}_{i}\in {S}^{L}\), if xi meets Definition 3, it is added to set \(\Omega\) and given its \({g}_{k}^{L}\)-neighborhood. Second, density clustering is performed on the minority class. The clustering process defines the cluster by determining its \({g}_{k}^{L}\)-neighborhood of the instances in set \(\Omega\) For the specific implementation of clustering, you can refer to the definition of a cluster function in reference (Zhu et al., 2017).

Let the clustering result be \(C=\{{C}_{1},\cdots ,{C}_{i},\cdots ,{C}_{m}\}\). Ci represents the ith cluster, and the cluster label of instance xi is denoted xi.c. The clustering process in the COM oversampling method is then implemented as follows.

Algorithm 1
figure a

Clustering algorithm based on the structural characteristics

3.2 Acquiring the Sampling Weights for the Clusters

Sampling weights are assigned to the instance clusters formed by clustering to overcome the intra-class imbalance of the minority class. For a cluster with sparse instances of the minority class, the instances in the cluster are often difficult to learn. A higher weight is assigned to these clusters to increase the probability of instance synthesis and improve the degree of recognition of such instances. Conversely, a low weight is usually assigned to a cluster with a high density of minority class instances because learning instances in the cluster are often safe. Therefore, the weight of the cluster will depend on the density of the minority class samples in the cluster. Cluster weighting consists of measuring and converting cluster density into sampling weight.

3.2.1 Density of Clusters

The cluster density is usually related to the distance between its instances and is high if the distance between instances is relatively small. However, if the distance is large, the cluster density is low. To measure the cluster density, the following formula is usually used to calculate the average distance between instances:

$${\text{avg}}({d}_{{C}_{i}})=\frac{2}{|{C}_{i}|(|{C}_{i}|-1)}\sum_{1\le i<j\le |{C}_{i}|}{\text{dist}}({x}_{i},{x}_{j})$$
(1)

Of these, Ci is the ith cluster, |Ci| represents the number of instances in the cluster, and \({\text{dist}}\;({x}_{i},{x}_{j})\) is the Euclidean distance between instances in the cluster.

The cluster density is also related to the data dimension d. Using the average distance between instances, the cluster density is defined as follows:

$${\text{density}}\left({C}_{i}\right)=\frac{\left|{C}_{i}\right|}{{\left[{\text{avg}}\left({d}_{{C}_{i}}\right)\right]}^{d}}.$$
(2)

The outliers in the minority class are often labeled separately after clustering, usually far away from the instances forming clusters or even located in the regions of other classes. Outliers often affect the classification effect. Imbalanced datasets have fewer minority instances, which is insufficient for representing the distribution of the minority class. The outliers are likely to be rare instances that are not fully represented. Especially in datasets with high IR, deleting outliers directly will likely cause the loss of important information.

To ensure the integrity of the minority class information, this paper regards each outlier separately as a cluster composed of the instance. Then, giving sampling weight to it and increasing the proportion of such instances helps the classifier to learn more useful information. Here, the distance between the outlier and its nearest instance is the average distance of instances within the cluster. Then, formula (2) is used to calculate the cluster density of this kind of cluster.

3.2.2 Sampling Weight

To overcome the influence of intra-class imbalance on classification, the selection weight of each cluster is defined as \(\text{exp}(-{\text{density}}({C}_{i}))\). For low-density clusters, there is a greater probability of being drawn to generate synthetic instances of the minority class. Conversely, a small probability is given to high-density clusters because these instances are located in safe areas of the minority class and are not difficult to classify. In the process of imbalance learning, the influence of imbalance between classes is considered based on the distribution of the minority class instances, alleviating the influence of the intra-class imbalance of the minority class on the classification effect.

Converting each cluster’s selection weights into a probability distribution is necessary. The sum of the selection weights of all clusters is used to standardize the selection weights of each cluster. That is, the sampling weights of each cluster can be obtained as follows:

$${\text{weight}}\left({C}_{i}\right)=\frac{\text{exp}\left(-{\text{density}}\left({C}_{i}\right)\right)}{\sum\limits_{i=1}^{m}\text{exp}\left(-{\text{density}}\left({C}_{i}\right)\right)}.$$
(3)

3.3 Generating Synthetic Instances

For a cluster extracted according to the sampling probability, the interpolation method is used to oversample the minority instances in the cluster. Set xi and xj as any two instances; then, a new instance is interpolated between the two instances xs = xi + r (xj − xi), where r is any random number in the interval [0,1].

Any two instances in a high-density cluster can be chosen in oversampling, and a new instance is synthesized by the above interpolation method. For a cluster formed by an outlier in the minority class, interpolation is performed between the outlier and its nearest instance to avoid overfitting problems. Since the nearest neighbor instance of an outlier must belong to other classes, r is set as a random number in the interval [0,1/2] during interpolation to ensure that the synthetic instance is as close to the minority class instances as possible.

The specific implementation process of the COM oversampling algorithm is as follows.

Algorithm 2
figure b

Clustering-based oversampling algorithm for multi-class imbalance learning (COM)

4 Experimental Study

4.1 Setup

4.1.1 Experimental Data

The multi-class imbalanced datasets selected for the experiment were from the UCI and KEEL databases. Some datasets with very few classes (e.g., the original E. coli dataset) had only two samples labeled 2 and 3. Few classes are combined with adjacent classes to ensure the classification effect. Specific data characteristics and distribution information are shown in Table 1.

Table 1 Description of characteristics of datasets

The feature description of the data in Table 1 includes the sample size (S), number of features (F), number of classes (C), class distribution, and imbalance rate of the dataset. The class distribution represents the number of instances in each class, the bold font indicates that the class is a minority class, and the corresponding imbalance rate is reflected in the column IRi of the minority class. The IR column is the overall imbalance rate of the multi-class imbalanced dataset.

4.1.2 IR of Multi-class Imbalance Learning

There is no standard definition for the IR of multi-class imbalanced datasets, and different studies give the IR of datasets according to their own research requirements. The definition of the average IR (Zhu et al., 2019) of the dataset we used in our study was.

$${\text{IR}}=\frac{1}{l}\sum_{i=1}^{l}{\text{IR}}_{i}, {\text{IR}}_{i}=\frac{{\sum }_{q\ne i}{n}_{q}}{l\times {n}_{i}},$$

Where IRi is the imbalance rate of class Li in the dataset, and nq and ni are the numbers of instances in classes Lq and Li, respectively. The fewer the instances a class contains, the higher the IR for that class is. If each class has the same number of instances, then the imbalance rate of each class is

$${\text{IR}}_{i}=\frac{l-1}{l},$$

where IR is approximately 1 if the number of classes is large enough.

According to the literature, the threshold of IR is set at 1.5. Classes with an imbalance rate higher than 1.5 are minority classes, while the remaining are majority classes. In this study, only minority classes with an imbalance rate higher than 1.5 were oversampled. For simplicity, the target number of instances for each minority class is set as the average sample size of all majority classes.

4.1.3 Base Classifier and Compared Algorithms

To verify the superiority of the COM learning method in classifying multi-class imbalanced data, our study chose several learning methods—random oversampling (ROS), SMOTE (Chawla et al., 2002), Borderline-SMOTE (B-SM) (Han et al., 2005), and ADASYN (He et al., 2008)—and compared their balance effect on the multi-class imbalanced data. The number of nearest neighbors for each of them was optimally selected among 1, 3, 5, and 7.

For a reasonable evaluation of the data-balancing effect of the compared methods, the study selected decision tree (DT), k-nearest neighbor (KNN), and multi-layer perceptrons (MLP) with a single hidden layer as the classifiers. According to the classification effects of classifiers with different evaluation criteria, several resampling methods were compared, and the Euclidean distance was used to measure the distance between instances.

In our experiments, fivefold cross-validation was applied to evaluate the performance of the algorithms; that is, in each stage, 80% of the data were used for training, and 20% were used for testing, and it was ensured that both the training and test sets contained samples of each class. The average value of five tests was used to evaluate the classifier’s performance.

4.2 Experimental Results and Analyses

Since total accuracy is not appropriate for multi-class imbalanced data, the micro-F1, MG, and MAUC values were used to compare the performance of the classifiers. The micro-F1 value is denoted as F1 for simplicity. The results in Table 2 are the average values and average rankings of each resampling method for every combination of the datasets, three evaluation metrics, and three classifiers. In addition, we included the performance of the classifiers when oversampling was not used, so they were ranked from 1 to 6. The three evaluation metrics showed that the larger the value, the better the classification. The resampling methods were sorted according to the classification effect. Therefore, if the average ranking is smaller, the classification effect of the resampling method is better. Conversely, the effect is worse. The bold font in the table indicates the resampling method with the best effect.

Table 2 Average performance results of the oversampling methods across the datasets

4.2.1 Average Value

It can be seen from the average values in Table 2 that the COM oversampling method has the best effect when DT is used for classification—its average value for each evaluation metric is the highest, indicating that its data balancing is significantly better than that of other resampling methods. When the KNN and MLP classifiers are used, the COM oversampling method shows obvious advantages in two of three evaluation metrics. Its average F1 and MAUC values are the highest.

4.2.2 Mean Ranking

The ranking results in Table 2 show that the COM oversampling method outperforms all other methods for any evaluation metric when DT classifier is used. For the KNN and MLP classifiers, the COM oversampling method has the best results in two of three evaluation metrics, similar to the average value scenario. For the convenience of comparison, the total average of each resampling method for the three classifiers is also shown in Table 2, from which it can be seen that, compared with the other resampling methods with the MG evaluation metric, the COM method does not show a decisive advantage, and its overall effect with SMOTE is similar. However, with the F1 and MAUC metrics, COM has an absolute advantage in value and ranking.

4.3 Statistical Test of Experimental Results

4.3.1 Friedman Test

To further verify the significant statistical difference in the ranking among resampling methods, the non-parametric statistical method, Friedman test, which mainly tests the overall differences through the rank sum, was applied to the ordering of the resampling methods. The null hypothesis is that there is no significant difference in the average ranking of the resampling methods. The results are shown in Table 3, showing that the null hypothesis is rejected per the three evaluation metrics for the DT and KNN classifiers. That is, the average ranking of various resampling methods is significantly different. When the MLP classifier is used, the Friedman test shows statistical insignificance for the F1 metric. In contrast, test results for the other two metrics show a significant difference in average ranking.

Table 3 Results for Friedman’s test

4.3.2 Multiple Comparisons

Based on the statistical significance of the Friedman test results, the mean ranking of resampling methods was tested using multiple comparisons method. Since this research aimed to investigate the performance of the proposed COM oversampling method in solving intra-class imbalance, only the COM method was used as the control in comparing methods. The comparison results are listed in Table 4. In multiple comparisons, the average ranking difference between methods was compared with the critical value determined by the number of methods and instances for different combinations of classifiers and evaluation metrics. The bold font in the table indicates that the results of multiple comparisons are statistically significant. That is, the COM oversampling method is superior to the compared methods.

Table 4 Results for multiple comparisons

For the DT and KNN classifiers, Table 4 shows that the COM oversampling method is statistically significant for the F1 and MAUC evaluation metrics. In the experiment on the MLP classifier, the average F1 of the COM oversampling method is the highest. However, the performance of the Friedman test results is not significant with the F1 evaluation metric, so only the other two metrics are used for multiple comparisons. The results show that the average ranking of the COM method is significantly better than other methods with the MAUC metric.

5 Simulation Experiment of COM Stability

In multi-class imbalance learning, due to the diversity of the number of classes and IR, the complete virtual data is restrictive in reflecting the actual problem, while simulation experiments based on real datasets can better reflect the applicability of the algorithm.

The effectiveness of COM algorithm has been verified in the above research. Next, its stability will be further analyzed through the simulation experiment of noise interference to learning results. Since most of the noise in practice is Gaussian white noise, we add the Gaussian white noise to the multi-class imbalance data in Table 1 according to the signal-to-noise ratio (SNR). Where the SNR indicates the ratio of signal power to noise power, and the smaller the SNR, the more noise in the data, otherwise the opposite.

In order to analyze the stability of COM algorithm, the experiment still use DT, KNN, and MLP classifiers to compare several resampling methods based on different evaluation criteria, where the experimental setup is similar to that in 4.1.3. The minority classes in the new datasets after the addition of noise were oversampled, and the experiments were conducted using a fivefold cross-validation method, where the average of the five trials was used to measure the classification effectiveness of the classifier.

Since the smaller SNR represents the more serious noise in the data, in order to make a reasonable evaluation of the stability of COM algorithm, the SNR is taken to be 15 in the experiment. That is, ten virtual multi-class imbalanced datasets can be obtained by adding Gaussian white noise with SNR of 15 to the data in Table 1. The experiments utilize oversampling algorithms to balance the virtual datasets and then compare the classification effectiveness of the three classifiers.

The experimental results are shown in Fig. 2. It can be seen that COM algorithm has obvious advantage when utilizing DT for classification, indicating that its data balancing effect is significantly better than other resampling methods, while in the case of KNN and MLP, the COM algorithm demonstrates a clear advantage with two evaluation criteria. The results of this experiment show that COM algorithm demonstrates relative stability in the experiment; i.e., the noise does not have a significant effect on its balancing effect. It is worth mentioning that, in terms of the evaluation criterion MAUC, the COM algorithm’s ability to categorize any two classes does not seem to be outstanding after adding noise. This is mainly due to the small setting of SNR in the experiment.

Fig. 2
figure 2

Average performance results of each oversampling method versus the virtual datasets

6 Conclusions

This paper has proposed a COM oversampling algorithm, which focuses on solving the intra-class imbalance and sample overgeneralization for multi-class imbalance problems. With our proposed method, the minority class instances are first locally clustered, and then, the sampling weights of clusters are set according to the distribution and density of clusters. After this, the oversampling method was performed on the extracted clusters, making full use of the structural characteristics of the minority class instances. The effectiveness of COM was verified on multi-class imbalanced datasets through comparing to multiple oversampling methods. Experiment results demonstrate that COM can alleviate the influence of intra-class imbalance and overgeneralization significantly and can improve classifiers’ average classification ability for any two types of instances in multi-class imbalance problems; our proposed method outperforms other compared methods in terms of F1, MAUC, and MG.

Although the training of COM is slightly more complex, it is effective in learning the distribution of minority class and synthesizing minority class instances. In future work, we will continue to explore imbalance learning methods based on data distribution and generalize COM to deal with the multi-class imbalanced data with high dimensions. In addition, raising dimensions for imbalanced data to improve the learning performance is a feasible idea, and we will try to combine COM with raising dimension methods to obtain better performance for the multi-class imbalance learning.