Keywords

1 Introduction

Class imbalance occurs when the samples of each class are not equally represented, i.e. the numbers of representatives differ widely: many real-world datasets show this imbalance  [8, 13, 17, 18]. Since this is extremely common in practice, it has been widely studied in classical machine learning. Commonly, there are two types of imbalance—long-tailed imbalance  [15] and step imbalance  [3]. In step imbalance, classes are grouped into majority and minority classes. The two classes have different numbers of samples, but the number of samples is equal within majority classes and equal within minority classes. For long-tailed imbalance, the class frequency distribution is long-tailed, the samples of a few classes occupy most of the data, while samples of most classes rarely appear. In binary classification, when a dataset is imbalanced, it is a step imbalance. This paper focuses on binary classification.

Recently, deep neural networks (DNNs) have been used for various classification tasks, e.g. image and text classification, and they have achieved excellent performance. However, DNNs perform poorly on imbalanced data due to ineffective learning  [3, 6]. In binary classification, when classification algorithms based on DNNs are trained with unbalanced data, classifiers will prefer the negative (majority) class and achieve high accuracy on it. However, it will show lower accuracy on the positive (minority) class.

Existing methods use two strategies for dealing with imbalanced data  [9]—data sampling and algorithmic adjusting. There are two data sampling techniques—over-sampling the positive class and under-sampling the negative class. However, each techniques has disadvantages: over-sampling can easily cause model over-fitting, due to repeatedly duplicated samples, whereas under-sampling may throw away valuable information, and it is not practicable for extremely unbalanced data. Algorithmic adjusting changes the learning process, so that it can give higher importance to the positive class. One technique for adjusting the algorithm is cost-sensitive learning, which considers the misclassification costs  [19]. If it is applied to a DNN model, the learning will jointly optimize the network parameters and misclassification costs, instead of optimizing the network parameters alone  [10, 21]. It will be difficult to simultaneously optimize the network parameters and misclassification costs, when the imbalance is large  [7]. However, recent work has addressed the class imbalance problem, without adding additional parameters  [14, 22]. Solutions proposed in  [14, 22] allow the model to optimize just the network parameters. To clarify, they tried to solve the problem by modifying just the existing loss functions and did not alter the models. It was quite simple but effective. The essential advantage of this strategy for solving the problem is that it is easy to implement and use with existing DNN models.

Here, we studied two well-performing loss functions, namely mean false error (MFE)  [22] and focal loss  [14], specially designed to combat the imbalance problem. These two loss functions used different perspectives to make learning the model concentrate more on the positive class. Focal loss differentiates between easy samples (samples with low losses) and hard samples (samples with high losses), so that it can lower the weight of the loss contribution of easy samples and focus training on hard samples. This gives more importance to the positive class, because most easy samples are in the negative class. The mean false error technique changes the total error by summing the negative and positive sample errors separately. This effectively balances between the loss contributions of both classes and allows the positive class to have a substantial contribution in calculating the total loss.

There is a drawback for each loss. For focal loss, the contribution of the negative class (or easy samples class) to the total loss is reduced. However, the total loss is an average over the whole data, so losses from negative samples can still dominate it. For mean false error, although the total loss is calculated by summing the average losses of both classes, the loss from the negative class can still dominate the overall loss, because of the effect of the easy samples. Moreover, mean false error will work best, if every batch of training data contains at least one positive sample. If there is no positive sample in a batch, the total loss will be biased by the average of negative class, i.e. the easy samples.

To avoid the drawbacks, inspired by these two approaches, we formed a hybrid solution and defined a new loss function—the hybrid loss—so that advantages of each loss will compensate for the drawbacks of the other.

Our main contributions are: Firstly, we explored the ideas behind the mean false error and focal loss ideas, to understand how they perform, when the data is unbalanced. Secondly, we defined a hybrid loss function, a hybrid of mean false error and focal loss solutions, which combines advantages of the two ideas, and we showed that the two loss functions can be combined in an efficient way. Lastly, we tested our hybrid function with image and text datasets. For each dataset, a variety of imbalance levels was applied.

2 Related Works

2.1 Imbalanced Learning

Anand et al. [2] studied the effect of class imbalance and found that it adversely affects the backpropagation algorithm. The loss of the negative class rapidly decreased, whereas the positive class loss significantly increased in early iterations and the network would often converge very slowly. This occurred because the negative class completely dominated the network gradient used to update the weights. To deal with this, we need to increase the positive class contribution and correspondingly decrease the negative class contribution.

2.2 Focal Loss

Focal loss, \(FL(p_{t}) = -\alpha _{t}(1 - p_{t})^\gamma \log (p_{t})\), was a modification of cross entropy loss  [14]. A modulating factor \((1 - p)^{\gamma }\) was added to the cross entropy loss. For notational convenience, let p is the predicted probability and y is the ground-truth class. \(p_{t}\) will be p if \(y = 1\) and be \(1 - p\) for otherwise. By the equation of focal loss, \(\gamma \ge 0\) is a tunable focusing parameter. In practice, \(\alpha _{t}\) will be \(\alpha \), if the ground-truth class of sample is the positive class and be \(1 - \alpha \) for otherwise.

The motivation for defining the focal loss is that cross entropy loss is not able to correctly balance the weights of positive and negative samples due to the imbalance. Although adding a weighting factor \(\alpha \) partially addresses the problem, it cannot differentiate between easy samples and hard samples. Usually, most of easy samples are from negative class, and they hugely contribute to the total loss and dominate the network gradient. In general, hard samples add more discriminative information than easy samples  [23], so that learning from hard samples is more effective than learning from easy ones. For this reason, the contribution of easy samples needs to be reduced while learning, so that the model can concentrate on learning hard samples.

Focal loss was designed to down-weight easy samples by adding a modulating factor to the cross entropy loss. This factor reduces the loss contribution from easy samples and focuses training on hard negative samples. Define \(l_{FL} = \frac{1}{n} \sum _{i = 1}^{n} -\alpha _{t}^{(i)}(1 - p_{t}^{(i)})^{\gamma }\log (p_{t}^{(i)})\), as a total loss form of an \(\alpha \)-balanced variant of focal loss, where n is the number of samples.

We considered focal loss as a reference for our improved method, described in the next section.

2.3 Mean False Error

Mean false error was derived from a mean squared error (MSE)  [22], by separating the calculation of the total MSE for all samples to a sum of an average losses of negative and positive samples separately: \(l_{MFE} = l_{MSE_{-}} + l_{MSE_{+}}\), where \(l_{MSE_{-}} = \frac{1}{n_{-}}\sum _{i=1}^{n_{-}}\frac{1}{2}(y^{(i)} - p^{(i)})^{2}\) and \(l_{MSE_{+}} = \frac{1}{n_{+}}\sum _{i=1}^{n_{+}}\frac{1}{2}(y^{(i)} - p^{(i)})^{2}\). Based on the equations, \(y^{(i)}\) is the ground-truth class of sample i and \(n_{-}\) and \(n_{+}\) are the numbers of negative or positive samples.

The motivation for introducing mean false error is that a MSE is not able to capture losses from the positive class effectively. That is, loss contributions from negative samples will overrule the contribution from positive samples, due to the higher volume of negative samples. Thus it computes the total loss from a sum of separate calculations of the average loss of each class. This allows the positive class to more fully contribute to updating weights of the network. In experiments on various benchmark datasets, Wang et al.  [22] showed that mean false error performed better than a simple MSE approach. They further improved mean false error with mean squared false error (MSFE)  [22]. Both of these variations were compared with our hybrid method—see Sect. 5.

3 Our Method

The principal advantage of focal loss is that it can control the difference between easy and hard samples and increase the loss contribution of the positive class by reducing the importance of easy samples. A weighting factor was added to the loss to balance the contribution of positive and negative samples. However, since the total loss is an average for both positive and negative classes, the negative class can still dominate the total loss. The mean false error solution diminishes this effect, because it can make positive class more important during training.

We showed that the advantage of each loss can address the drawback of the other. Hence, to more effectively learn unbalanced data, we mimicked the mean false error total loss calculation, by summing average separately computed losses from both classes: \(l_{Hybrid} = l_{FL_{-}} + l_{FL_{+}}\), where \(l_{FL_{-}} = \frac{1}{n_{-}} \sum _{i = 1}^{n_{-}} -\alpha _{t}^{(i)}(1 - p_{t}^{(i)})^{\gamma }\log (p_{t}^{(i)})\) and \(l_{FL_{+}} = \frac{1}{n_{+}} \sum _{i = 1}^{n_{+}} -\alpha _{t}^{(i)}(1 - p_{t}^{(i)})^{\gamma }\log (p_{t}^{(i)})\). \(l_{FL_{-}}\) and \(l_{FL_{+}}\) are the average losses of the negative and positive classes.

To use the hybrid loss in back-propagation algorithm, we need their derivatives. For focal loss, let \(p = \sigma (x) = \frac{1}{1 + e^{-x}}\), be an output of a logistic function, and x is an input of the logistic function.  [14] define a quality \(x_{t} = xy.\) Based on the definition of \(p_{t}\) in Sect. 2.2, \(p_{t} = \frac{1}{1 + e^{xy}}\). Using \(p_{t}\), the derivative for focal loss is: \(\frac{\partial l_{FL}}{\partial x_{t}^{(i)}} = \frac{1}{n} \sum _{i = 1}^{n} y^{(i)} (1 - p_{t}^{(i)})^{\gamma } (\gamma p_{t}^{(i)} \log (p_{t}^{(i)}) + p_{t}^{(i)} - 1)\). For mean false error, the derivative is: \(\frac{\partial l_{MFE}}{\partial x^{(i)}} = \frac{\partial l_{MSE_{-}}}{\partial x^{(i)}} + \frac{\partial l_{MSE_{+}}}{\partial x^{(i)}}\), where

$$\begin{aligned} \begin{aligned} \frac{\partial l_{MSE_{-}}}{\partial x^{(i)}} = - \frac{1}{n_{-}} \sum _{i=1}^{n_{-}} (y^{(i)} - p^{(i)}) p^{(i)} (1 - p^{(i)}), \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} \frac{\partial l_{MSE_{+}}}{\partial x^{(i)}} = - \frac{1}{n_{+}} \sum _{i=1}^{n_{+}} (y^{(i)} - p^{(i)}) p^{(i)} (1 - p^{(i)}). \end{aligned} \end{aligned}$$
(2)

Note that the derivative in (1) is used for the negative sample, while (2) is used for the positive sample.

Using the mean false error derivative, we can define the derivative for the hybrid loss by combining the derivatives of focal loss for negative and positive classes: \(\frac{\partial l_{Hybrid}}{\partial x_{t}^{(i)}} = \frac{\partial l_{FL_{-}}}{\partial x_{t}^{(i)}} + \frac{\partial l_{FL_{+}}}{\partial x_{t}^{(i)}}\), where

$$\begin{aligned} \frac{\partial l_{FL_{-}}}{\partial x_{t}^{(i)}} = \frac{1}{n_{-}} \sum _{i = 1}^{n_{-}} y^{(i)} (1 - p_{t}^{(i)})^{\gamma } (\gamma p_{t}^{(i)} \log (p_{t}^{(i)}) + p_{t}^{(i)} - 1), \end{aligned}$$
(3)
$$\begin{aligned} \frac{\partial l_{FL_{+}}}{\partial x_{t}^{(i)}} = \frac{1}{n_{+}} \sum _{i = 1}^{n_{+}} y^{(i)} (1 - p_{t}^{(i)})^{\gamma } (\gamma p_{t}^{(i)} \log (p_{t}^{(i)}) + p_{t}^{(i)} - 1). \end{aligned}$$
(4)

As in mean false error, these derivatives are used for the corresponding samples from each class.

Our hypothesis is that our hybrid loss function will perform better than mean false error and focal loss, because it allows the positive class to contribute in its full extent to the total loss and differentiate between easy and hard samples at the same time.

4 Experimental Framework

4.1 Datasets

We use two benchmark datasets, CIFAR-100  [12] and IMDB review  [16]. Originally, both datasets were balanced, but we extracted various imbalanced sets from them: (1) Unbalanced Sets from CIFAR-100: CIFAR-100 has 100 classes and contains 600 images per class, including 500 training and 100 testing images. For fair comparison, we created three different sets of data, labeled Household, Tree 1 and Tree 2, by following the setting of Wang et al.  [22]. Each set of data had two classes and the representation of one class was reduced to three different imbalance levels, 20%, 10% and 5%. (2) Unbalanced Sets from IMDB Review: IMDB review is for binary sentiment classification: it contains 25,000 movie reviews for training and 25,000 for testing, and each set includes 12,500 positive and 12,500 negative reviews. We created three different sets of data by leaving 20%, 10% and 5% of positive reviews.

4.2 Experiment Settings

Each unbalanced data set was split into training, validation and test sets. All three sets have the same imbalance ratio. As both CIFAR-100 and IMDR review, already had training and test sets, we chose 20% of samples from the training set for the validation set. The obtained training and validation sets are used for training model, and the test set is used for evaluating the trained model. The experiment was run five times with different random splits.

We used ResNet-50  [5] for image classification, and Transformer  [20], that is represented in Keras document for sentiment classification. Both models used the Adam Optimizer  [11]. We ran the experiments using TensorFlow  [1] and Keras  [4].

5 Results and Discussions

Table 1 reports the classification performances of the methods used on the CIFAR-100 sets. Our hybrid loss function performed better than the other losses in most cases and achieved the highest \(F_{1}\)-score in all cases.

Table 1. Performance of ResNet-50 with different loss functions. The high \(F_{1}\)-score and AUC demonstrate that the loss function was suited for image classification on unbalanced data

We report the classification performances of Transformer trained using different loss functions in Table 2. The hybrid loss achieved the highest \(F_{1}\)-score and AUC at all imbalance levels.

Table 2. Performances of Transformer on different loss functions. The high \(F_{1}\)-score and AUC demonstrated that the loss function is suited for the sentiment classification on imbalanced data.

6 Conclusion

We studied two loss functions, mean false error and focal loss for training deep neural networks on unbalanced data. As each of the two losses has advantages that can eliminate drawbacks of the other, we showed that hybridizing the two losses in a hybrid loss function that imitates the calculation procedures of mean false error’s total loss to focal loss. Tests on this hybrid loss, on image and text classifications, at various imbalance levels, showed that the networks trained with it were superior to mean false error, mean squared false error and focal loss on the \(F_{1}\)-score, but worse in a few cases on the AUC.

This work focused on improving DNN performance for binary classification: future work will evaluate it on multi-class classification.