Keywords

1 Introduction

Imbalanced data classification is challenging [10, 13], and it has wide applications in the machine learning field [3, 11, 19]. The main characteristic of the imbalanced data is its skewed data distribution, which means that most samples belong to one class (the majority class) and the rest belong to the other (the minority class). The skewed data distribution usually leads to conventional machine learning classifiers having poor classification performance.

To address imbalanced data classification, researchers have proposed plenty of methods. Existing methods mainly contain two categories: data-level techniques and algorithm-level techniques. Data-level techniques solve the imbalanced data by changing the data distribution. Algorithm-level techniques increase the importance of the minority class in adjusting the learning or decision process.

However, we notice the weakness of the above existing methods. On the one hand, traditional data-level methods usually do not consider the impact of different types of samples in the imbalanced dataset to train the model. The study [16] indicates that some of the samples are useless and even negatively impact model training. On the other hand, traditional algorithm-level methods [6, 8] usually focus on giving a higher loss to the minority class but ignore the impact of sample difficulty.

This paper aims to remedy the above weaknesses from two aspects. Firstly, this paper introduces the concept of “sample classification importance" to select suitable samples for sampling. Intuitively, classification importance represents the importance of a sample for classifier training. For a dataset, we divide all samples into three kinds, i.e., important informative samples, negative informative samples, and general informative samples. Such sample classification importance can guide the selection of suitable samples for sampling to obtain satisfactory results. Secondly, we propose a loss function that is based on sample difficulty. This loss function can give different costs to different samples according to their sample difficulty.

Then, we further propose a hybrid model to solve imbalanced data classification. Our model integrates data space improvement, sample selection, and loss function based on sample difficulty. Specifically, it contains three blocks: (1) Data space block, which transforms the data space to make samples close to their nearest neighbors belonging to the same class and separates samples from other classes by a large margin. This block can make samples easier to be separated. (2) Sample selection block finds suitable samples for sampling to obtain a balanced dataset. This block aims to find valuable samples. (3) Sample Difficulty block applies a novel loss function that adds larger loss to samples with greater difficulty for training the classifier.

In summary, our contributions lie in the following aspects. (1) Firstly, we propose a new sample selection approach that can use fewer samples but get better classification results. (2) Secondly, we design a novel loss function based on sample difficulty for imbalanced data training. (3) Thirdly, we design a hybrid model that integrates space improvement, sample selection, sampling, and loss function to handle this problem. (4) Finally, experimental results on real-world imbalanced datasets have shown that our hybrid model performs better than competing methods, and each block of our model is valid.

2 Related Work

2.1 Data-Level Methods

Data-level approaches [7] aim to solve imbalanced data by changing the data distribution. They can be further divided into undersampling methods and oversampling methods. Under-sampling methods reduce the number of majority instances from the original dataset to balance the dataset. The simplest undersampling form is random undersampling [10]. This method removes the majority of instances randomly. Unlike undersampling methods, oversampling methods generate minority instances to obtain a balanced dataset. Random oversampling is the most straightforward way that randomly generates minority instances from the original data. In addition, plenty of advanced sampling methods have been designed. SMOTE [5] is the commonly used sampling method that selects close instances, drawing a line between instances and generating a new instance at a point along that line. ADASYN [9], MWMOTE [1], and ADMO [18] are representative sampling methods that generate the minority synthetic instances. However, the weaknesses of data-level methods are apparent: The technique of selecting suitable instances for sampling is still being determined [4].

2.2 Algorithm-Level Methods

Algorithm-level approaches solve imbalanced data by increasing the importance of the minority class in adjusting the learning or decision process. These methods mainly contain cost-sensitive learning and novel loss functions. Cost-sensitive learning approaches modify the cost matrix to reduce bias towards the majority class. However, determining a matrix is difficult for cost-sensitive learning-based methods. Researchers have recently designed several new loss functions [6, 8] for training deep neural networks for solving imbalanced data classification. The most widely used loss for imbalanced data is the focal loss [15] that assigns a weight to each instance according to its prediction accuracy in model training.

3 Proposed Method

3.1 Overview

As shown in Fig.1, our model consists of three blocks: (1) Data space block (DSB), which transforms the data space to make samples close to their nearest neighbors with the same class. This block can make samples easier to be separated. (2) Sample selection block (SSB) finds valuable samples and builds up a set based on valuable samples. This block aims to find valuable samples for sampling. (3) Sample Difficulty block (SDB) applies a novel loss function that adds larger loss to samples with higher sample difficulty for the training classifier.

Fig. 1.
figure 1

The architecture of our hybrid model

3.2 Data Space Block

Our model integrates the data space improvement technique to make the imbalanced data easier to be separated. In this work, we use the LMNN [17] technique that builds up an algorithm to pull samples with the same class label close to the target sample and push samples that belong to different class labels away from the target sample, as shown in Fig. 2. The algorithm of the LMNN technique is as follows: \(\varphi (H)=(1-\mu ) \varphi _{\text{ pull } }(H)+\mu \varphi _{\text{ push } }(H)\), where H is the linear transformation of the input space and \(\mu \) is a positive real number utilized as the weight. The first part of this loss penalizes large distances between the sample and its k nearest neighbors belonging to the same class, which is defined as \(\varphi _{\text{ pull } }(H)=\sum _{p, q \in M(p)}\left\| L\left( x_{p}-x_{q}\right) \right\| ^{2}\), where M(p) is the k nearest neighbor of sample p with the same class label as p.

The second part penalizes small distances between the sample and others with different classes, which is defined as:

\(\varphi _{\text{ push } }(H)=\) \(\sum _{p, q, l}\left( 1-\delta _{p l}\right) \max \left\{ 1+\left\| H\left( s_{p}-s_{q}\right) \right\| ^{2}-\left\| H\left( s_{p}-s_{l}\right) \right\| ^{2}, 0\right\} \), where \(\delta _{i l}\) is utilized to decide whether samples \(s_{l}\) and \(s_{p}\) belong to different classes or not. If samples belong to different classes, \(\delta _{p l}=0\); otherwise, \(\delta _{p l}=1\).

Fig. 2.
figure 2

Using the LMNN algorithm, the comparison between original data space and transformed data space

3.3 Sample Selection Block

Traditional data-level methods usually select all samples for sampling to obtain a balanced dataset. However, study [16] has indicated that not all samples are useful for model training. Thus, it is necessary to distinguish the types of samples and select suitable samples for sampling. In this part, we first introduce the definition of sample classification importance and propose a method to finish sample selection based on this definition.

Definition: Sample classification importance represents the importance of a sample for the classifier training.

Intuitively, we divide samples into three kinds, i.e., important informative samples, negative informative samples, and general informative samples, as shown in Fig. 3 .

Important informative samples: They are the most informative samples during the classifier training. For instance, as we can see in Fig.3, important informative instances are usually located close to the classification boundary of the classifier. Improving the importance of these instances is helpful in improving the performance of the classification [16].

Negative informative samples: By contrast, negative informative samples harm the model training. For example, negative informative samples are in Fig. 3 are usually caused by indistinguishable noise, which could lead the model to serious overfitting. Thus, we need to reduce the impact of these samples.

General informative samples: Most of the samples are general informative samples that the model can correctly classify, as shown in Fig.3. Each general informative sample only contributes minor importance. However, the overall contribution is enormous because of its large number. For this type of sample, we only need a small part of them to remain their " skeleton " to prevent overfitting, then remove most of them.

We evaluate sample classification importance based on the kNN method [2]. If all neighbors of a sample belong to a different class, then it is a negative informative sample. On the contrary, if all neighbors of a sample and itself belong to the same class, then it is a general informative sample. In other cases, the sample can be seen as an important informative sample, which means that it will have a large value when a sample locates on the borderline between different classes. Then, we introduce the sample selection method. Given a dataset, it can be divided into three parts: negative informative set, important informative set, and general informative set according to sample classification importance. We do not use negative informative samples to sample since they have negative impacts on the classifier training. We focus on sampling important informative samples because they are essential in finding the classification boundary. In addition, we only use small parts of general informative samples to sample because we only need a small part of them to retain their "skeleton". Based on the above analysis, our sample selection method is shown in Algorithm 1 in detail.

Fig. 3.
figure 3

Illustration of types of samples

figure a

3.4 Sample Difficulty Block

This block applies a new loss function based on sample difficulty to train the classifier with the imbalanced data. We first introduce the sample difficulty and then propose our loss function. Based on the analysis in the sample selection part, finding suitable samples that can learn the classification boundary as precisely as possible is important. In addition, we also notice that different suitable samples may also have different difficulties in model training. Thus, we propose a method to calculate the level of sample difficulty.

Intuitively, a sample with more nearest neighbors with different class labels will have a high sample difficulty level. Based on this, we provide formula (1) to evaluate the sample difficulty (SD), where k is the number of nearest neighbors. \(kNN(x_{i, j}, D-D_{j})\) is the number of k nearest neighbors of sample \(x_{i, j}\) that do not belong to class j.

$$\begin{aligned} {\text {SD}}\left( x_{i, j}\right) =\frac{k N N\left( x_{i, j}, D-D_{j}\right) }{k} \end{aligned}$$
(1)

Then, We introduce our novel loss starting from the cross-entropy (CE) loss for classification. For a classification of p categories, the CE loss is defined as:

$$\begin{aligned} L_{\textrm{CE}}=-\frac{1}{n} \sum _{i=1}^{n} \sum _{j=1}^{p} y_{i, j} \log \hat{y}_{i, j} \end{aligned}$$
(2)

where n is the sample size. \(y_{i,j} \in \{1,0\}\) specifies the ground truth sample, and \(\hat{y}_{i,j} \in [0,1]\) is the model’s estimated probability for the sample with ground truth ij.

Based on the CE loss, we add a factor that can consider the different types of samples in a dataset, as mentioned in the sample selection block. The parameter \(w_{i, j}\) is related to the sample difficulty. We use formulas (1) and (3) to calculate the value of \(w_{i, j}\). Then we define our sample difficulty loss function as formula (4). We notice the property of our proposed loss function. The parameter \(w_{i, j}\) gives samples that are more difficult to train a large loss.

$$\begin{aligned} w_{i,j}=\log (1+SD(x_{i, j})) \end{aligned}$$
(3)
$$\begin{aligned} L_{\textrm{SD}}=-\frac{1}{n} \sum _{i=1}^{n} \sum _{j=1}^{p} w_{i,j} y_{i, j} \log \hat{y}_{i, j} \end{aligned}$$
(4)

4 Experiments

4.1 Data Description and Compared Methods

We employ several real-world imbalanced datasets by imblearn toolbox [14] (These datasets are from UCI, LIBSVM, and KDD repository.) to test the performance of our hybrid model. These datasets have different characteristics in terms of the number of samples, IR (Imbalance Ratio), and features. The detailed information on datasets is shown in Table 1. Besides, we randomly split datasets into training sets (60\(\%\)), valid sets (20\(\%\)), and test sets (20\(\%\)).

Table 1. Summary of imbalanced datasets

We compare our hybrid model with the following methods, including data-level methods: Random oversampling (ROS), MWMOTE [1], ADASYN [9], SMOTE [5], and AMDO [18]; algorithm-level methods: Focal loss [15], Class-balanced loss [6], and DWE loss [8].

4.2 Evaluation Metrics

We employ commonly used metrics, G-mean and AUC [12], to evaluate the performance of imbalanced data classification. Let FN, FP, TP, and TN be false negative, false positive, true positive, and true negative. TNR and TPR measure the number of correctly classified positive instances and negative instances, respectively. G-mean combines TNR and TPR . AUC is the area under the receiver operating characteristic curve that reflects the relationship between the false positive and true positive ratios. This area describes the trade-off between incorrectly classified positive and correctly classified negative instances.

$$\begin{aligned} T N R=\frac{T N}{T N+F P} \end{aligned}$$
(5)
$$\begin{aligned} T P R=\frac{T P}{T P+F N} \end{aligned}$$
(6)
$$\begin{aligned} G- \text{ Mean } =\sqrt{T P R \times T N R} \end{aligned}$$
(7)

4.3 Implementation Details

We select Multilayer perception (MLP) as the classifier and a batch size of 32 to train it for 100 epochs based on the TensorFlow framework. The classifier utilizes Adam as the optimizer, with a learning rate is 0.001. We ran all experiments ten times and took the average of ten times as the final result to obtain a reliable result. Our model finds suitable samples and evaluates the sample difficulty level based on the kNN method (k = 7).

4.4 Experimental Results

Tables 2 and 3 reports AUC and G-mean values on imbalanced datasets. From the experimental results, we find that no single method can achieve the best performance on all datasets. In contrast, our hybrid model achieves decent performance in most cases. The reasons that our model can perform well lie in the following aspects.

First, we use a data space block to make samples easier to be classified. Second, unlike traditional imbalance resolution methods, we select suitable samples based on sample selection for model training. This method retains the critical classification information. Third, our sample difficulty loss function gives each sample a loss corresponding to its sample difficulty. This loss function fully considers the impact of sample difficulty and offers a higher loss to the samples with higher sample difficulty and more challenging to distinguish. Combining the findings above, our model is effective for imbalanced data classification.

Table 2. Valus of AUC on 17 real-world imbalanced datasets
Table 3. Valus of G-mean on 17 real-world imbalanced datasets

5 Discussion

5.1 The Impact of Important Informative Samples

In our model, we select suitable samples to train the classifier because samples are essential for finding the classification boundary. Thus, we run experiments on both original and datasets that only contain important informative samples to further illustrate the impact of important informative samples. From Table 4, we observe that training the classifier with datasets containing only important informative samples can obtain better results than training the classifier with original datasets, which verifies the effectiveness of important informative samples. In addition, we also noticed that by selecting suitable samples for training, we improved the classification results while reducing the number of samples used for model training. In summary, selecting suitable samples to deal with imbalanced data classification is a new perspective, which can both reduce the number of samples used for the classifier training and improve the performance of the classifier.

5.2 The Impact of Parameters

To analyze the impact of parameter k in our model, we conduct experiments with varying k from 1 to 13 on three real-world imbalanced datasets. From the experimental results in Fig. 4, we find that the performance of our model is stable with the change of k and when \(k=7\) achieves the best performance.

Table 4. The Impact of Important Informative Samples
Fig. 4.
figure 4

Impact of parameter k in our model

Fig. 5.
figure 5

Ablation Study

5.3 Ablation Study

Our model consists of three blocks: Data Space Block (DSB), Sample Selection Block (SSB), and Sample Difficulty Block (SDB). To analyze the effectiveness of each block, we build some variants of our hybrid model: (1) DSB, which is our model without DSB; (2) SSB, which is our model without SSB; (3) SDB, which is our model without SDB. Fig. 5 shows experimental results on abalone-19 and us-crime datasets. We find that all of these variants perform worse than our model on both datasets, which illustrates that our model effectively integrates three blocks to take advantage of each. Moreover, we find that SSB performs the worst, which demonstrates that SSB has a more critical impact among all blocks.

6 Conclusion

We aim to overcome the weakness of existing imbalanced learning methods from perspectives of sample selection and sample difficulty. First, we divide samples into different types in an imbalanced dataset according to their impacts on imbalanced data classification. Based on this, we can select suitable samples for sampling. Then, we propose a loss function based on sample difficulty. After that, we design a hybrid model to solve imbalanced data classification. To the best of our knowledge, this is the first model that integrates data space improvement, sample selection, and loss function into imbalanced data classification. Experiments on real-world imbalanced datasets have shown that our hybrid model performs better than competing methods. The ablation study verifies that each model block is valid.