Abstract
Imbalanced data classification is a challenging problem with wide applications in machine learning and data mining. Most researchers attempt to solve this problem from the data level or algorithm level. Nevertheless, these methods have their limitations. In addition, most of them focus on dealing with the imbalance in the number of data samples while ignoring the imbalance caused by sample difficulty. Thus, we design a hybrid model to handle this problem. Our model integrates data space improvement, sample selection, sampling strategy, and loss function. To evaluate the performance of our hybrid model, we conduct experiments on several real-world imbalanced datasets. The experimental results prove that our hybrid model is effective.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Imbalanced data classification is challenging [10, 13], and it has wide applications in the machine learning field [3, 11, 19]. The main characteristic of the imbalanced data is its skewed data distribution, which means that most samples belong to one class (the majority class) and the rest belong to the other (the minority class). The skewed data distribution usually leads to conventional machine learning classifiers having poor classification performance.
To address imbalanced data classification, researchers have proposed plenty of methods. Existing methods mainly contain two categories: data-level techniques and algorithm-level techniques. Data-level techniques solve the imbalanced data by changing the data distribution. Algorithm-level techniques increase the importance of the minority class in adjusting the learning or decision process.
However, we notice the weakness of the above existing methods. On the one hand, traditional data-level methods usually do not consider the impact of different types of samples in the imbalanced dataset to train the model. The study [16] indicates that some of the samples are useless and even negatively impact model training. On the other hand, traditional algorithm-level methods [6, 8] usually focus on giving a higher loss to the minority class but ignore the impact of sample difficulty.
This paper aims to remedy the above weaknesses from two aspects. Firstly, this paper introduces the concept of “sample classification importance" to select suitable samples for sampling. Intuitively, classification importance represents the importance of a sample for classifier training. For a dataset, we divide all samples into three kinds, i.e., important informative samples, negative informative samples, and general informative samples. Such sample classification importance can guide the selection of suitable samples for sampling to obtain satisfactory results. Secondly, we propose a loss function that is based on sample difficulty. This loss function can give different costs to different samples according to their sample difficulty.
Then, we further propose a hybrid model to solve imbalanced data classification. Our model integrates data space improvement, sample selection, and loss function based on sample difficulty. Specifically, it contains three blocks: (1) Data space block, which transforms the data space to make samples close to their nearest neighbors belonging to the same class and separates samples from other classes by a large margin. This block can make samples easier to be separated. (2) Sample selection block finds suitable samples for sampling to obtain a balanced dataset. This block aims to find valuable samples. (3) Sample Difficulty block applies a novel loss function that adds larger loss to samples with greater difficulty for training the classifier.
In summary, our contributions lie in the following aspects. (1) Firstly, we propose a new sample selection approach that can use fewer samples but get better classification results. (2) Secondly, we design a novel loss function based on sample difficulty for imbalanced data training. (3) Thirdly, we design a hybrid model that integrates space improvement, sample selection, sampling, and loss function to handle this problem. (4) Finally, experimental results on real-world imbalanced datasets have shown that our hybrid model performs better than competing methods, and each block of our model is valid.
2 Related Work
2.1 Data-Level Methods
Data-level approaches [7] aim to solve imbalanced data by changing the data distribution. They can be further divided into undersampling methods and oversampling methods. Under-sampling methods reduce the number of majority instances from the original dataset to balance the dataset. The simplest undersampling form is random undersampling [10]. This method removes the majority of instances randomly. Unlike undersampling methods, oversampling methods generate minority instances to obtain a balanced dataset. Random oversampling is the most straightforward way that randomly generates minority instances from the original data. In addition, plenty of advanced sampling methods have been designed. SMOTE [5] is the commonly used sampling method that selects close instances, drawing a line between instances and generating a new instance at a point along that line. ADASYN [9], MWMOTE [1], and ADMO [18] are representative sampling methods that generate the minority synthetic instances. However, the weaknesses of data-level methods are apparent: The technique of selecting suitable instances for sampling is still being determined [4].
2.2 Algorithm-Level Methods
Algorithm-level approaches solve imbalanced data by increasing the importance of the minority class in adjusting the learning or decision process. These methods mainly contain cost-sensitive learning and novel loss functions. Cost-sensitive learning approaches modify the cost matrix to reduce bias towards the majority class. However, determining a matrix is difficult for cost-sensitive learning-based methods. Researchers have recently designed several new loss functions [6, 8] for training deep neural networks for solving imbalanced data classification. The most widely used loss for imbalanced data is the focal loss [15] that assigns a weight to each instance according to its prediction accuracy in model training.
3 Proposed Method
3.1 Overview
As shown in Fig.1, our model consists of three blocks: (1) Data space block (DSB), which transforms the data space to make samples close to their nearest neighbors with the same class. This block can make samples easier to be separated. (2) Sample selection block (SSB) finds valuable samples and builds up a set based on valuable samples. This block aims to find valuable samples for sampling. (3) Sample Difficulty block (SDB) applies a novel loss function that adds larger loss to samples with higher sample difficulty for the training classifier.
3.2 Data Space Block
Our model integrates the data space improvement technique to make the imbalanced data easier to be separated. In this work, we use the LMNN [17] technique that builds up an algorithm to pull samples with the same class label close to the target sample and push samples that belong to different class labels away from the target sample, as shown in Fig. 2. The algorithm of the LMNN technique is as follows: \(\varphi (H)=(1-\mu ) \varphi _{\text{ pull } }(H)+\mu \varphi _{\text{ push } }(H)\), where H is the linear transformation of the input space and \(\mu \) is a positive real number utilized as the weight. The first part of this loss penalizes large distances between the sample and its k nearest neighbors belonging to the same class, which is defined as \(\varphi _{\text{ pull } }(H)=\sum _{p, q \in M(p)}\left\| L\left( x_{p}-x_{q}\right) \right\| ^{2}\), where M(p) is the k nearest neighbor of sample p with the same class label as p.
The second part penalizes small distances between the sample and others with different classes, which is defined as:
\(\varphi _{\text{ push } }(H)=\) \(\sum _{p, q, l}\left( 1-\delta _{p l}\right) \max \left\{ 1+\left\| H\left( s_{p}-s_{q}\right) \right\| ^{2}-\left\| H\left( s_{p}-s_{l}\right) \right\| ^{2}, 0\right\} \), where \(\delta _{i l}\) is utilized to decide whether samples \(s_{l}\) and \(s_{p}\) belong to different classes or not. If samples belong to different classes, \(\delta _{p l}=0\); otherwise, \(\delta _{p l}=1\).
3.3 Sample Selection Block
Traditional data-level methods usually select all samples for sampling to obtain a balanced dataset. However, study [16] has indicated that not all samples are useful for model training. Thus, it is necessary to distinguish the types of samples and select suitable samples for sampling. In this part, we first introduce the definition of sample classification importance and propose a method to finish sample selection based on this definition.
Definition: Sample classification importance represents the importance of a sample for the classifier training.
Intuitively, we divide samples into three kinds, i.e., important informative samples, negative informative samples, and general informative samples, as shown in Fig. 3 .
Important informative samples: They are the most informative samples during the classifier training. For instance, as we can see in Fig.3, important informative instances are usually located close to the classification boundary of the classifier. Improving the importance of these instances is helpful in improving the performance of the classification [16].
Negative informative samples: By contrast, negative informative samples harm the model training. For example, negative informative samples are in Fig. 3 are usually caused by indistinguishable noise, which could lead the model to serious overfitting. Thus, we need to reduce the impact of these samples.
General informative samples: Most of the samples are general informative samples that the model can correctly classify, as shown in Fig.3. Each general informative sample only contributes minor importance. However, the overall contribution is enormous because of its large number. For this type of sample, we only need a small part of them to remain their " skeleton " to prevent overfitting, then remove most of them.
We evaluate sample classification importance based on the kNN method [2]. If all neighbors of a sample belong to a different class, then it is a negative informative sample. On the contrary, if all neighbors of a sample and itself belong to the same class, then it is a general informative sample. In other cases, the sample can be seen as an important informative sample, which means that it will have a large value when a sample locates on the borderline between different classes. Then, we introduce the sample selection method. Given a dataset, it can be divided into three parts: negative informative set, important informative set, and general informative set according to sample classification importance. We do not use negative informative samples to sample since they have negative impacts on the classifier training. We focus on sampling important informative samples because they are essential in finding the classification boundary. In addition, we only use small parts of general informative samples to sample because we only need a small part of them to retain their "skeleton". Based on the above analysis, our sample selection method is shown in Algorithm 1 in detail.
3.4 Sample Difficulty Block
This block applies a new loss function based on sample difficulty to train the classifier with the imbalanced data. We first introduce the sample difficulty and then propose our loss function. Based on the analysis in the sample selection part, finding suitable samples that can learn the classification boundary as precisely as possible is important. In addition, we also notice that different suitable samples may also have different difficulties in model training. Thus, we propose a method to calculate the level of sample difficulty.
Intuitively, a sample with more nearest neighbors with different class labels will have a high sample difficulty level. Based on this, we provide formula (1) to evaluate the sample difficulty (SD), where k is the number of nearest neighbors. \(kNN(x_{i, j}, D-D_{j})\) is the number of k nearest neighbors of sample \(x_{i, j}\) that do not belong to class j.
Then, We introduce our novel loss starting from the cross-entropy (CE) loss for classification. For a classification of p categories, the CE loss is defined as:
where n is the sample size. \(y_{i,j} \in \{1,0\}\) specifies the ground truth sample, and \(\hat{y}_{i,j} \in [0,1]\) is the model’s estimated probability for the sample with ground truth i, j.
Based on the CE loss, we add a factor that can consider the different types of samples in a dataset, as mentioned in the sample selection block. The parameter \(w_{i, j}\) is related to the sample difficulty. We use formulas (1) and (3) to calculate the value of \(w_{i, j}\). Then we define our sample difficulty loss function as formula (4). We notice the property of our proposed loss function. The parameter \(w_{i, j}\) gives samples that are more difficult to train a large loss.
4 Experiments
4.1 Data Description and Compared Methods
We employ several real-world imbalanced datasets by imblearn toolbox [14] (These datasets are from UCI, LIBSVM, and KDD repository.) to test the performance of our hybrid model. These datasets have different characteristics in terms of the number of samples, IR (Imbalance Ratio), and features. The detailed information on datasets is shown in Table 1. Besides, we randomly split datasets into training sets (60\(\%\)), valid sets (20\(\%\)), and test sets (20\(\%\)).
We compare our hybrid model with the following methods, including data-level methods: Random oversampling (ROS), MWMOTE [1], ADASYN [9], SMOTE [5], and AMDO [18]; algorithm-level methods: Focal loss [15], Class-balanced loss [6], and DWE loss [8].
4.2 Evaluation Metrics
We employ commonly used metrics, G-mean and AUC [12], to evaluate the performance of imbalanced data classification. Let FN, FP, TP, and TN be false negative, false positive, true positive, and true negative. TNR and TPR measure the number of correctly classified positive instances and negative instances, respectively. G-mean combines TNR and TPR . AUC is the area under the receiver operating characteristic curve that reflects the relationship between the false positive and true positive ratios. This area describes the trade-off between incorrectly classified positive and correctly classified negative instances.
4.3 Implementation Details
We select Multilayer perception (MLP) as the classifier and a batch size of 32 to train it for 100 epochs based on the TensorFlow framework. The classifier utilizes Adam as the optimizer, with a learning rate is 0.001. We ran all experiments ten times and took the average of ten times as the final result to obtain a reliable result. Our model finds suitable samples and evaluates the sample difficulty level based on the kNN method (k = 7).
4.4 Experimental Results
Tables 2 and 3 reports AUC and G-mean values on imbalanced datasets. From the experimental results, we find that no single method can achieve the best performance on all datasets. In contrast, our hybrid model achieves decent performance in most cases. The reasons that our model can perform well lie in the following aspects.
First, we use a data space block to make samples easier to be classified. Second, unlike traditional imbalance resolution methods, we select suitable samples based on sample selection for model training. This method retains the critical classification information. Third, our sample difficulty loss function gives each sample a loss corresponding to its sample difficulty. This loss function fully considers the impact of sample difficulty and offers a higher loss to the samples with higher sample difficulty and more challenging to distinguish. Combining the findings above, our model is effective for imbalanced data classification.
5 Discussion
5.1 The Impact of Important Informative Samples
In our model, we select suitable samples to train the classifier because samples are essential for finding the classification boundary. Thus, we run experiments on both original and datasets that only contain important informative samples to further illustrate the impact of important informative samples. From Table 4, we observe that training the classifier with datasets containing only important informative samples can obtain better results than training the classifier with original datasets, which verifies the effectiveness of important informative samples. In addition, we also noticed that by selecting suitable samples for training, we improved the classification results while reducing the number of samples used for model training. In summary, selecting suitable samples to deal with imbalanced data classification is a new perspective, which can both reduce the number of samples used for the classifier training and improve the performance of the classifier.
5.2 The Impact of Parameters
To analyze the impact of parameter k in our model, we conduct experiments with varying k from 1 to 13 on three real-world imbalanced datasets. From the experimental results in Fig. 4, we find that the performance of our model is stable with the change of k and when \(k=7\) achieves the best performance.
5.3 Ablation Study
Our model consists of three blocks: Data Space Block (DSB), Sample Selection Block (SSB), and Sample Difficulty Block (SDB). To analyze the effectiveness of each block, we build some variants of our hybrid model: (1) DSB, which is our model without DSB; (2) SSB, which is our model without SSB; (3) SDB, which is our model without SDB. Fig. 5 shows experimental results on abalone-19 and us-crime datasets. We find that all of these variants perform worse than our model on both datasets, which illustrates that our model effectively integrates three blocks to take advantage of each. Moreover, we find that SSB performs the worst, which demonstrates that SSB has a more critical impact among all blocks.
6 Conclusion
We aim to overcome the weakness of existing imbalanced learning methods from perspectives of sample selection and sample difficulty. First, we divide samples into different types in an imbalanced dataset according to their impacts on imbalanced data classification. Based on this, we can select suitable samples for sampling. Then, we propose a loss function based on sample difficulty. After that, we design a hybrid model to solve imbalanced data classification. To the best of our knowledge, this is the first model that integrates data space improvement, sample selection, and loss function into imbalanced data classification. Experiments on real-world imbalanced datasets have shown that our hybrid model performs better than competing methods. The ablation study verifies that each model block is valid.
References
Barua, S., Islam, M.M., Yao, X., Murase, K.: Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2012)
Borsos, Z., Lemnaru, C., Potolea, R.: Dealing with overlap and imbalance: a new metric and approach. Pattern Anal. Appl. 21(2), 381–395 (2018)
Bugnon, L.A., Yones, C., Milone, D.H., Stegmayer, G.: Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans. Neural Netw. Learn. Syst. 31(8), 2857–2867 (2019)
Cao, P., Zhao, D., Zaïane, O.R.: A PSO-based cost-sensitive neural network for imbalanced data classification. In: Li, J., et al. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7867, pp. 452–463. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40319-4_39
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16, 321–357 (2002)
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)
Das, B., Krishnan, N.C., Cook, D.J.: Racog and wracog: two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 27(1), 222–234 (2014)
Fernando, K.R.M., Tsokos, C.P.: Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. (2021)
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Hu, Y., Zhang, Y., Gong, D., Sun, X.: Multi-participant federated feature selection algorithm with particle swarm optimizaiton for imbalanced data under privacy protection. IEEE Trans. Artifi. Intell. (2022)
Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress Artifi. Intell. 5(4), 221–232 (2016)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Liu, Z., et al.: Self-paced ensemble for highly imbalanced massive data classification. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 841–852. IEEE (2020)
Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10(2) (2009)
Yang, X., Kuang, Q., Zhang, W., Zhang, G.: Amdo: an over-sampling technique for multi-class imbalanced problems. IEEE Trans. Knowl. Data Eng. 30(9), 1672–1685 (2017)
Zhao, H., Wang, R., Lei, Y., Liao, W.H., Cao, H., Cao, J.: Severity level diagnosis of parkinson’s disease by ensemble k-nearest neighbor under imbalanced data. Expert Syst. Appli. 189, 116113 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shan, A., Chung, YC. (2023). A Hybrid Model Based on Samples Difficulty for Imbalanced Data Classification. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14254. Springer, Cham. https://doi.org/10.1007/978-3-031-44207-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-44207-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44206-3
Online ISBN: 978-3-031-44207-0
eBook Packages: Computer ScienceComputer Science (R0)